-
Notifications
You must be signed in to change notification settings - Fork 38
Conversation
It looks good to me! It could be interesting as well to experiment with https://www.sqlite.org/sharedcache.html, and see how it affects memory pressure and performance. |
I do think that refusing requests when at 90% memory usage is a bit too conservative, and I would maybe set it as a multiple of the max response size. If you have a 100Gb ram machine, you can definitely still respond with 10Gb ram left |
Yeah, fair enough, perhaps we should go with something generic like "min(10%, 50MiB)", so that machines with 128 or 256MiB RAM don't get half of their resources reserved just in case. |
It isn't applied anywhere yet, but will be in hrana and http result builders.
Once the limit of total response sizes in flight is reached, queries start to fail in order to free memory.
Once the limit of total response sizes in flight is reached, queries start to fail in order to free memory.
Concurrency is throttled more aggresively if we detect that the max total response size is heading towards its predefined maximum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This series adds a few overload protection mechanisms, introduced in separate commits.
max_response_size
. My experiments showed that even 16MiB is a much safer choice, but 32MiB is much better than nothing.sysinfo
crate and reads system memory. If less than 10% memory is available, requests are refused. On Linux, reading system memory stats was empirically measured to cost <40 microseconds, which satisfies the definition of "cheap enough". That would have to be verified for other platforms, esp. virtualized ones.As for (1.), one substantial problem with it is that the global response size counter is decremented too early, namely right after the response is generated, but before it is sent. Because of that, a sufficiently high number of large requests (e.g.
SELECT *
on a large table) can still overcommit memory, and that problem is alleviated by (2.) and (3.). Ideally we should hold the permit until the memory is actually freed, but I didn't have any idea how to elegantly code it without overhauling lots of interfaces. Suggestions welcome.Each mechanism is subject to discussion, so please voice your opinions folks. We should add at least 1 of them to make sqld more robust, but in particular, my experiments with highly concurrent large reads showed that only (1.),(2.) and (3.) combined actually prevented OOM.
That said, OOM-ing the process and restarting it is in itself a very nice, though drastic, overload protection mechanism 😇