Introducing immutable URL instances with a single string buffer and offsets #253

lemire · 2023-03-01T22:21:52Z

lemire
Mar 1, 2023
Maintainer

Currently, the design of the core ada struct is made of several independent strings. Another competing design would be to have one large string, with offsets into it.

The std::string in C++ and short string optimization

In C++, std::string instances are typically implemented using a 'short string/long string model'. This will vary depending on the standard library/compiler. In some cases, the threshold is 22 characters according to this StackOverflow answer.

Size of our structures

Though these numbers can vary...

ada::result : 272 bytes
ada/url (struct) : 264 bytes
an empty string: 32 bytes
std::optionalstd::string : 40 bytes
a single 32-bit integers: 4 bytes

We need 8 offsets to indicate the location of each segment of an URL (32 bytes):

     uint32_t protocol_end{0};
     uint32_t username_end{0};
     uint32_t host_start{0};
     uint32_t host_end{0};
     uint32_t port{omitted};
     uint32_t pathname_start{0};
     uint32_t search_start{omitted};
     uint32_t hash_start{omitted};

Roughly speaking, the ada/url struct is made of 8 strings, so 8 times 32 bytes is 256 bytes. (We use more because we have booleans and std::optional).

Ada memory usage vs 'optimal' memory usage

Assuming that we store the encoded URL (say 80 bytes), we could added 8 integers and 32 bytes to represent a parsed URL using (say) 112 bytes.

We use 264 bytes plus a bit of allocation. How much?

Let us pick a 'typical' URL like...

https://static.files.bbci.co.uk/orbit/737a4ee2bed596eb65afc4d2ce9af568/js/polyfills.js

In this instance, the path clearly exceeds the short-string limit and will be allocated on the heap.

Let us pick another...

https://www.google.com/webhp?hl=en&amp;ictx=2&amp;sa=X&amp;ved=0ahUKEwil_oSxzJj8AhVtEFkFHTHnCGQQPQgI

in that instance, it is the query/search string that would exceed the short-path threshold.

In any case, we may therefore reasonably assume that in most instances the following holds true...

At most one of the inner strings within the ada/url struct will use heap allocation.

For an hypothetical 80-byte URL, we might (say), have 64 bytes of heap memory plus 264 bytes or about 328 bytes.

Roughly speaking we might use three times as much memory as we might need to.

Worst case scenario for ada

The worst case scenario for ada is a trivial string that requires no non-trivial parsing. E.g.,

http://www.google.com/johbn?cds=1

You could quickly just copy this string (if a copy is even needed) and compute the offsets. With something like 64 bytes, and little work, you are done. Ada would use 5 times more memory.

What about more complex cases?

The standard requires us to be able to modify the URL. We have setters. In that scenario, all our extra weight pays off. To some degree, modifying the path, only means writing to the path string. There is no need to move data around as there would be if we had a single string. Because we don't hardcode the offsets, there is no need any recomputation.

Future plans?

It seems like a good extra feature for ada would be the production of a lightweight immutable URL instance. It could be produced from an existing URL: you would serialize the reconstructed string, and add the 8 offsets.

If you are parsing an existing URL string and you are expecting a simple case (no setters, and you expect that you won't need to modify the input string), you could just scan it and produce a lightweight structure on top of it, possibly made of just the 8 offsets.

Finally, you'd want to be able to convert this lightweight immutable string into an ada/url.

The simplest possible design for this lightweight immutable data structure would be just the 8 offsets, you'd have to somehow carry the serialized URL string. Maybe you could have a reference to the serialized strings coupled with the 8 offsets, plus some extra functions for convenience.

It could be implemented as the following functions:

Produce a structure made of 8 offsets from an existing ada/url. This in the works at https://github.com/ada-url/ada/pull/250/files#diff-8722980a29ef1d4759c4c7232ba181d5ba8da70448a1b8f602f2b2a5cede749b
From an existing string, attempt to construct a structure of offsets. Such a function would be non throwing and non allocating. In the general case, you'd need to validate the input (e.g., the string 'http://www.GOoogle.com/path/../doc' is not simple and requires parsing and reserialization). The most efficient would be a function that just scans a string and returns a structure of offsets upon success, or some 'error' condition otherwise.
Have a fast function that can go in reverse: from a structure made of the 8 offsets, build an ada/url structure. This is probably nearly trivial.

anonrig · 2023-03-01T23:31:48Z

anonrig
Mar 1, 2023
Maintainer

I think one step is missing. Between #1 and #2 we need to:

Add method to check if the input requires any encoding & decoding. This means that, eventually, we don't need to even copy the string. This intermediary function can be also used in future optimizations for Node.js

0 replies

lemire · 2023-03-01T23:48:01Z

lemire
Mar 1, 2023
Maintainer Author

@anonrig

Everything else being equal, you want to do as few passes over the data as possible (for speed and simplicity). That is why I propose that there should be just one function that checks if the input requires any reserialization. If if the URL can be left unchanged, then you should immediately return the offsets.

Otherwise, with the approach you propose, I am concerned that we might be effectively computing the offsets twice, needlessly.

0 replies

lemire · 2023-03-02T21:45:24Z

lemire
Mar 2, 2023
Maintainer Author

@anonrig

The question is how do we recognize a simple URL string that can be used as is? So we must scan the scheme, arrive a '//', find a host, possibly locate a ':', and so forth. As you do this, you effectively compute the offsets, and you might as well compute them and return them.

To speed things up, you probably can generate some useful data for future parsing even if you don't have a simple URL... E.g., suppose that you have identified the scheme and it is just fine, then can record the offset and so forth. Even if the path needs to be rewritten, you can still use the information that the fast scan provided.

So take this http://google.com/éscape. Well, the path needs to be encoded, bad luck. But you can still record (somehow) that the domain and scheme are fine.

Ultimately, I imagine that a future version of ada would often just do a fast scan, check whether we are in the simple case, if we are then we are done (super quickly), if not we push through the current ada (possibly with some hints).

In this manner, we could be very fast. We could probably double our performance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing immutable URL instances with a single string buffer and offsets #253

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Introducing immutable URL instances with a single string buffer and offsets #253

lemire Mar 1, 2023 Maintainer

The std::string in C++ and short string optimization

Size of our structures

Ada memory usage vs 'optimal' memory usage

Worst case scenario for ada

What about more complex cases?

Future plans?

Replies: 3 comments

anonrig Mar 1, 2023 Maintainer

lemire Mar 1, 2023 Maintainer Author

lemire Mar 2, 2023 Maintainer Author

lemire
Mar 1, 2023
Maintainer

anonrig
Mar 1, 2023
Maintainer

lemire
Mar 1, 2023
Maintainer Author

lemire
Mar 2, 2023
Maintainer Author