A few overload protection mechanisms #552

psarna · 2023-07-26T12:42:27Z

This series adds a few overload protection mechanisms, introduced in separate commits.

A total response size limit, defaulting to 32MiB, which is the global equivalent of max_response_size. My experiments showed that even 16MiB is a much safer choice, but 32MiB is much better than nothing.
Concurrency throttling is now dependent on how much memory is currently occupied by response buffers - in high pressure conditions, less concurrency is allowed, and it's done by requiring more semaphore units per request
Concurrency throttling also monitors the number of waiters on the semaphore. If the number reaches 128, subsequent requests are refused.
There's a pree-OOM panic check, which relies on sysinfo crate and reads system memory. If less than 10% memory is available, requests are refused. On Linux, reading system memory stats was empirically measured to cost <40 microseconds, which satisfies the definition of "cheap enough". That would have to be verified for other platforms, esp. virtualized ones.

As for (1.), one substantial problem with it is that the global response size counter is decremented too early, namely right after the response is generated, but before it is sent. Because of that, a sufficiently high number of large requests (e.g. SELECT * on a large table) can still overcommit memory, and that problem is alleviated by (2.) and (3.). Ideally we should hold the permit until the memory is actually freed, but I didn't have any idea how to elegantly code it without overhauling lots of interfaces. Suggestions welcome.

Each mechanism is subject to discussion, so please voice your opinions folks. We should add at least 1 of them to make sqld more robust, but in particular, my experiments with highly concurrent large reads showed that only (1.),(2.) and (3.) combined actually prevented OOM.

That said, OOM-ing the process and restarting it is in itself a very nice, though drastic, overload protection mechanism 😇

MarinPostma · 2023-07-26T13:24:28Z

It looks good to me!

It could be interesting as well to experiment with https://www.sqlite.org/sharedcache.html, and see how it affects memory pressure and performance.

MarinPostma · 2023-07-26T13:26:36Z

I do think that refusing requests when at 90% memory usage is a bit too conservative, and I would maybe set it as a multiple of the max response size. If you have a 100Gb ram machine, you can definitely still respond with 10Gb ram left

psarna · 2023-07-27T13:29:22Z

I do think that refusing requests when at 90% memory usage is a bit too conservative, and I would maybe set it as a multiple of the max response size. If you have a 100Gb ram machine, you can definitely still respond with 10Gb ram left

Yeah, fair enough, perhaps we should go with something generic like "min(10%, 50MiB)", so that machines with 128 or 256MiB RAM don't get half of their resources reserved just in case.

It isn't applied anywhere yet, but will be in hrana and http result builders.

Once the limit of total response sizes in flight is reached, queries start to fail in order to free memory.

Concurrency is throttled more aggresively if we detect that the max total response size is heading towards its predefined maximum.

penberg

lgtm

psarna requested review from penberg and MarinPostma July 26, 2023 12:42

MarinPostma approved these changes Jul 27, 2023

View reviewed changes

psarna added 7 commits July 31, 2023 09:35

treewide: introduce max_total_response_size parameter

5fb3345

It isn't applied anywhere yet, but will be in hrana and http result builders.

http: adhere to max_total_response_size limit

0f7b4fc

Once the limit of total response sizes in flight is reached, queries start to fail in order to free memory.

hrana: adhere to max_total_response_size limit

3048980

Once the limit of total response sizes in flight is reached, queries start to fail in order to free memory.

downgrade tracing info level to debug in debug messages

c250016

database/factory: add throttling based on max response size

5ebda61

Concurrency is throttled more aggresively if we detect that the max total response size is heading towards its predefined maximum.

clippy workaround (no refactoring in this series)

b29bc91

fix tests

9e7062e

psarna force-pushed the ovpro1 branch from feaedf7 to 9e7062e Compare July 31, 2023 07:38

penberg approved these changes Jul 31, 2023

View reviewed changes

penberg added this pull request to the merge queue Aug 1, 2023

Merged via the queue into libsql:main with commit 09b0b60 Aug 1, 2023
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few overload protection mechanisms #552

A few overload protection mechanisms #552

psarna commented Jul 26, 2023

MarinPostma commented Jul 26, 2023

MarinPostma commented Jul 26, 2023

psarna commented Jul 27, 2023

penberg left a comment

A few overload protection mechanisms #552

A few overload protection mechanisms #552

Conversation

psarna commented Jul 26, 2023

MarinPostma commented Jul 26, 2023

MarinPostma commented Jul 26, 2023

psarna commented Jul 27, 2023

penberg left a comment

Choose a reason for hiding this comment