Skip to content
This repository has been archived by the owner on Oct 18, 2023. It is now read-only.

turn off sqlite auto_checkpoint and introduce period-based checkpoint calls #547

Merged
merged 12 commits into from
Aug 22, 2023

Conversation

Horusiath
Copy link
Contributor

This PR turns off auto_checkpoint in sqld-embed SQLite instance (which by default is every 1000 pages) and introduces a fiber loop, which checkpoints given configured time interval (by default it's every 1h).

@Horusiath Horusiath requested a review from psarna July 24, 2023 11:38
@Horusiath Horusiath force-pushed the bart/sqld-interval-checkpoint branch from 2216505 to ce6b596 Compare July 24, 2023 12:45
sqld/src/lib.rs Show resolved Hide resolved
sqld/src/replication/primary/logger.rs Outdated Show resolved Hide resolved
sqld/src/replication/primary/logger.rs Outdated Show resolved Hide resolved
@psarna
Copy link
Contributor

psarna commented Jul 25, 2023

@MarinPostma this commit makes sqld take over checkpointing, which is what we discussed a few times in context of deduplicating WAL. I'd appreciate you take a look to see if it doesn't interfere with replication in any way (it shouldn't, but worth double-checking)

psarna
psarna previously requested changes Jul 25, 2023
Copy link
Contributor

@psarna psarna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Horusiath also, we should turn off auto_checkpoint in every single database connection that opens, because by default, each connection opens with autocheckpoint set to 1000 pages. So, all places that call libsql_open, like

let rc = rusqlite::ffi::libsql_open_v2(
, need to run sqlite3_wal_autocheckpoint(db, 0) as well. If we miss even one, it will cause a checkpoint to happen eventually

@MarinPostma
Copy link
Collaborator

we should limit the WAL size as well. The disk size on the platform is rather, and the WAL can grow very fast

sqld/src/lib.rs Show resolved Hide resolved
sqld/src/lib.rs Outdated Show resolved Hide resolved
sqld/src/replication/primary/logger.rs Outdated Show resolved Hide resolved
"Checkpoint failed: database journal_mode is not WAL"
))
} else {
conn.execute("VACUUM", ())?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure a vacuum is necessary on every checkpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the original code. I'm not sure, what would be the best course of action here. We can leave VACUUM as an explicit request from the user side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was different, it was meant for replication log recovery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'll just remove it and let users perform it explicitly. It shouldn't be a problem most of the time.

@Horusiath Horusiath requested a review from psarna July 25, 2023 12:13
@Horusiath Horusiath force-pushed the bart/sqld-interval-checkpoint branch from eaacb8c to 075c370 Compare July 26, 2023 12:17
@psarna
Copy link
Contributor

psarna commented Jul 31, 2023

@Horusiath also, we should turn off auto_checkpoint in every single database connection that opens, because by default, each connection opens with autocheckpoint set to 1000 pages. So, all places that call libsql_open, like

let rc = rusqlite::ffi::libsql_open_v2(

, need to run sqlite3_wal_autocheckpoint(db, 0) as well. If we miss even one, it will cause a checkpoint to happen eventually

@Horusiath was the comment about optionally turning off autocheckpoint for every single connection that we open already applied, or not yet?

@Horusiath
Copy link
Contributor Author

@psarna question would be - should we turn off auto_checkpoint by default on connection open? In general this only happens if checkpoint interval and bottomless are set up. Given that these can be set either as env vars or command line params, it gets extremely tedious to pass that information down to connection open.

@psarna
Copy link
Contributor

psarna commented Aug 1, 2023

@Horusiath to the best of my knowledge we open db connections in two main places for user traffic:

[sarna@sarna-pc sqld]$ git grep sqld_libsql_bindings::Connection::open\(
sqld/src/database/libsql.rs:    sqld_libsql_bindings::Connection::open(path, flags, wal_methods, hook_ctx)
sqld/src/replication/replica/injector.rs:        let conn = sqld_libsql_bindings::Connection::open(

and two places for our own maintenance ops:

[sarna@sarna-pc sqld]$ git grep rusqlite::Connection::open\(
sqld/src/main.rs:    let conn = rusqlite::Connection::open(db_path.join("data"))?;
sqld/src/replication/primary/logger.rs:        let conn = rusqlite::Connection::open(data_path)?;

, and moreover, I think we only ever intend to replicate to S3 from the primary, so the connection open in replica/injector.rs can be ignored for now.

That leaves 3 places to handle, and I think all of them at some point have access to our Config instance, which contains all the information we need to determine if we should disable autocheckpoint or not.

Since autocheckpoint is a per-connection option, and we want to control how often checkpoints happen, I don't see any other option rather than consistently disable it if needed for each connection that we open. There's also an SQLite compilation flag that allows changing the default threshold, but it's not that interesting from our point of view, since we need to be able to determine the autocheckpoint threshold from command-line parameters. Another option to consider would be to contribute to libSQL and add a global option that sets the default WAL autocheckpoint for new connections, but that sounds a little too specific to be generally useful, so I'm not sure. In any case, we need to be able to consistently disable autocheckpointing for all connections that we create, otherwise we'd have spurious checkpoints that we don't want. /cc @haaawk

@Horusiath
Copy link
Contributor Author

Horusiath commented Aug 1, 2023

Generally, sqld_libsql_bindings::Connection::open is used by:

  1. replica::injector::FrameInjector - leave auto_checkpoint to default?
  2. As part of libsql::open_db which is used in few different places:
    • storage_monitor which is read only (so we can leave autocheckpoint untouched)
    • DumpLoader which can execute any arbitrary scripts.
    • libsql::Connection<'a> which can be called from anywhere and doesn't seem to have access to bottomless config

@psarna
Copy link
Contributor

psarna commented Aug 11, 2023

@Horusiath after giving it some more thought, I think the following makes sense:

replica::injector::FrameInjector - leave auto_checkpoint to default?

yes, replicas don't use bottomless anyway

As part of libsql::open_db which is used in few different places:
storage_monitor which is read only (so we can leave autocheckpoint untouched)

yes

DumpLoader which can execute any arbitrary scripts.

does it only ever get called on startup, before bottomless is initialized? If so, we can leave autocheckpoint on, or always turn it off, I don't care much either way

libsql::Connection<'a> which can be called from anywhere and doesn't seem to have access to bottomless config

here's the only place where we actually need to push the configuration option, so that we know if we should enable or disable autocheckpoint, and to which value

... and let me also reiterate my mantra that the most important bit is to make sure that the checkpointing fiber uses a connection with our WAL methods from primary/logger.rs registered, so that it calls bottomless code on checkpoint.

@MarinPostma
Copy link
Collaborator

MarinPostma commented Aug 11, 2023

does it only ever get called on startup, before bottomless is initialized? If so, we can leave autocheckpoint on, or always turn it off, I don't care much either way

actually, we probably shouldn't even for that, maybe a handle to manually trigger checkpoints that we are sure bottomless will pick up. But we can't really completely turn it off either, because a dump can be very large

@Horusiath
Copy link
Contributor Author

@psarna could you review again?

sqld/src/main.rs Outdated Show resolved Hide resolved
sqld/src/main.rs Outdated Show resolved Hide resolved
sqld/src/lib.rs Outdated Show resolved Hide resolved
sqld/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@haaawk haaawk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good except for one place when we're using QueryBuilderConfig::default in libsql.rs

@Horusiath Horusiath force-pushed the bart/sqld-interval-checkpoint branch from bb890d6 to c2185e6 Compare August 22, 2023 10:45
@@ -192,7 +192,7 @@ impl Default for Options {
Options {
create_bucket_if_not_exists: true,
verify_crc: true,
use_compression: CompressionKind::None,
use_compression: CompressionKind::Gzip,
Copy link
Collaborator

@haaawk haaawk Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this change related to auto checkpointing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, I've left it when started running comparisons with Litestream. Keeping non-compressed WAL frames around is extremely expensive and we definitely shouldn't do that by default.

@haaawk haaawk dismissed psarna’s stale review August 22, 2023 12:35

Changes were made by Bartosz to address the comments

@Horusiath Horusiath added this pull request to the merge queue Aug 22, 2023
Merged via the queue into libsql:main with commit a61dffa Aug 22, 2023
7 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants