Skip to content
This repository has been archived by the owner on Oct 18, 2023. It is now read-only.

A few overload protection mechanisms #552

Merged
merged 7 commits into from
Aug 1, 2023
Merged

A few overload protection mechanisms #552

merged 7 commits into from
Aug 1, 2023

Conversation

psarna
Copy link
Contributor

@psarna psarna commented Jul 26, 2023

This series adds a few overload protection mechanisms, introduced in separate commits.

  1. A total response size limit, defaulting to 32MiB, which is the global equivalent of max_response_size. My experiments showed that even 16MiB is a much safer choice, but 32MiB is much better than nothing.
  2. Concurrency throttling is now dependent on how much memory is currently occupied by response buffers - in high pressure conditions, less concurrency is allowed, and it's done by requiring more semaphore units per request
  3. Concurrency throttling also monitors the number of waiters on the semaphore. If the number reaches 128, subsequent requests are refused.
  4. There's a pree-OOM panic check, which relies on sysinfo crate and reads system memory. If less than 10% memory is available, requests are refused. On Linux, reading system memory stats was empirically measured to cost <40 microseconds, which satisfies the definition of "cheap enough". That would have to be verified for other platforms, esp. virtualized ones.

As for (1.), one substantial problem with it is that the global response size counter is decremented too early, namely right after the response is generated, but before it is sent. Because of that, a sufficiently high number of large requests (e.g. SELECT * on a large table) can still overcommit memory, and that problem is alleviated by (2.) and (3.). Ideally we should hold the permit until the memory is actually freed, but I didn't have any idea how to elegantly code it without overhauling lots of interfaces. Suggestions welcome.

Each mechanism is subject to discussion, so please voice your opinions folks. We should add at least 1 of them to make sqld more robust, but in particular, my experiments with highly concurrent large reads showed that only (1.),(2.) and (3.) combined actually prevented OOM.

That said, OOM-ing the process and restarting it is in itself a very nice, though drastic, overload protection mechanism 😇

@MarinPostma
Copy link
Collaborator

It looks good to me!

It could be interesting as well to experiment with https://www.sqlite.org/sharedcache.html, and see how it affects memory pressure and performance.

@MarinPostma
Copy link
Collaborator

I do think that refusing requests when at 90% memory usage is a bit too conservative, and I would maybe set it as a multiple of the max response size. If you have a 100Gb ram machine, you can definitely still respond with 10Gb ram left

@psarna
Copy link
Contributor Author

psarna commented Jul 27, 2023

I do think that refusing requests when at 90% memory usage is a bit too conservative, and I would maybe set it as a multiple of the max response size. If you have a 100Gb ram machine, you can definitely still respond with 10Gb ram left

Yeah, fair enough, perhaps we should go with something generic like "min(10%, 50MiB)", so that machines with 128 or 256MiB RAM don't get half of their resources reserved just in case.

It isn't applied anywhere yet, but will be in hrana and http
result builders.
Once the limit of total response sizes in flight is reached,
queries start to fail in order to free memory.
Once the limit of total response sizes in flight is reached,
queries start to fail in order to free memory.
Concurrency is throttled more aggresively if we detect that
the max total response size is heading towards its predefined maximum.
Copy link
Contributor

@penberg penberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@penberg penberg added this pull request to the merge queue Aug 1, 2023
Merged via the queue into libsql:main with commit 09b0b60 Aug 1, 2023
7 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants