Replies: 3 comments
-
I think one step is missing. Between
|
Beta Was this translation helpful? Give feedback.
-
Everything else being equal, you want to do as few passes over the data as possible (for speed and simplicity). That is why I propose that there should be just one function that checks if the input requires any reserialization. If if the URL can be left unchanged, then you should immediately return the offsets. Otherwise, with the approach you propose, I am concerned that we might be effectively computing the offsets twice, needlessly. |
Beta Was this translation helpful? Give feedback.
-
The question is how do we recognize a simple URL string that can be used as is? So we must scan the scheme, arrive a '//', find a host, possibly locate a ':', and so forth. As you do this, you effectively compute the offsets, and you might as well compute them and return them. To speed things up, you probably can generate some useful data for future parsing even if you don't have a simple URL... E.g., suppose that you have identified the scheme and it is just fine, then can record the offset and so forth. Even if the path needs to be rewritten, you can still use the information that the fast scan provided. So take this Ultimately, I imagine that a future version of ada would often just do a fast scan, check whether we are in the simple case, if we are then we are done (super quickly), if not we push through the current ada (possibly with some hints). In this manner, we could be very fast. We could probably double our performance. |
Beta Was this translation helpful? Give feedback.
-
Currently, the design of the core ada struct is made of several independent strings. Another competing design would be to have one large string, with offsets into it.
The std::string in C++ and short string optimization
In C++, std::string instances are typically implemented using a 'short string/long string model'. This will vary depending on the standard library/compiler. In some cases, the threshold is 22 characters according to this StackOverflow answer.
Size of our structures
Though these numbers can vary...
We need 8 offsets to indicate the location of each segment of an URL (32 bytes):
Roughly speaking, the ada/url struct is made of 8 strings, so 8 times 32 bytes is 256 bytes. (We use more because we have booleans and std::optional).
Ada memory usage vs 'optimal' memory usage
Assuming that we store the encoded URL (say 80 bytes), we could added 8 integers and 32 bytes to represent a parsed URL using (say) 112 bytes.
We use 264 bytes plus a bit of allocation. How much?
Let us pick a 'typical' URL like...
In this instance, the path clearly exceeds the short-string limit and will be allocated on the heap.
Let us pick another...
in that instance, it is the query/search string that would exceed the short-path threshold.
In any case, we may therefore reasonably assume that in most instances the following holds true...
For an hypothetical 80-byte URL, we might (say), have 64 bytes of heap memory plus 264 bytes or about 328 bytes.
Roughly speaking we might use three times as much memory as we might need to.
Worst case scenario for ada
The worst case scenario for ada is a trivial string that requires no non-trivial parsing. E.g.,
You could quickly just copy this string (if a copy is even needed) and compute the offsets. With something like 64 bytes, and little work, you are done. Ada would use 5 times more memory.
What about more complex cases?
The standard requires us to be able to modify the URL. We have setters. In that scenario, all our extra weight pays off. To some degree, modifying the path, only means writing to the path string. There is no need to move data around as there would be if we had a single string. Because we don't hardcode the offsets, there is no need any recomputation.
Future plans?
It seems like a good extra feature for ada would be the production of a lightweight immutable URL instance. It could be produced from an existing URL: you would serialize the reconstructed string, and add the 8 offsets.
If you are parsing an existing URL string and you are expecting a simple case (no setters, and you expect that you won't need to modify the input string), you could just scan it and produce a lightweight structure on top of it, possibly made of just the 8 offsets.
Finally, you'd want to be able to convert this lightweight immutable string into an ada/url.
The simplest possible design for this lightweight immutable data structure would be just the 8 offsets, you'd have to somehow carry the serialized URL string. Maybe you could have a reference to the serialized strings coupled with the 8 offsets, plus some extra functions for convenience.
It could be implemented as the following functions:
Beta Was this translation helpful? Give feedback.
All reactions