Skip to content
This repository has been archived by the owner on Oct 18, 2023. It is now read-only.

bottomless: recover when a checkpoint happen outside of bottomless replication control #597

Open
psarna opened this issue Aug 11, 2023 · 2 comments

Comments

@psarna
Copy link
Contributor

psarna commented Aug 11, 2023

Current bottomless replication implementation depends heavily on the fact that we control checkpoints - it replicates data straight from the WAL file, so it needs to be aware when a checkpoint happens, in order to make sure everything gets replicated, and its own metadata gets updated.
If a checkpoint happens outside of bottomless replication control, e.g. by another database connection that doesn't use bottomless virtual WAL methods, we can see a log entry like this:

2023-08-11T08:42:25.917655Z ERROR bottomless::replicator: [BUG] Local max valid frame is 0, while replicator thinks it's 10

Right now bottomless just logs the error and continues, but perhaps we should consider a more robust mechanism, e.g. marking current generation as potentially corrupt/partial, and creating a new one ASAP, so that we can always restore the state safely.

Opinions? cc @Horusiath @MarinPostma

NOTE: There's a separate sub-issue of this one that we experienced seeing the log error above in sqld, which wasn't supposed to happen -- perhaps we have a connection somewhere that didn't properly disable wal_autocheckpoint?

@psarna
Copy link
Contributor Author

psarna commented Aug 11, 2023

One way to trigger such a state manually is to run sqld --enable-bottomless-replication, inject some data, and then create a shell connection on the side, straight on the data file, e.g. sqlite3 data.sqld/dbs/default/data, and perform a PRAGMA wal_checkpoint(TRUNCATE) on it.

@psarna
Copy link
Contributor Author

psarna commented Aug 11, 2023

Ok, update: since neither #547 nor #574 are applied yet, we don't really disable autocheckpoint on connections. That means we often perform a checkpoint outside of bottomless control, which explains why we see the error in the logs from time to time.

It's also very important to remember that the database connection that performs our periodic checkpoint uses bottomless WAL methods -- a regular db connection is not enough, since such a checkpoint won't trigger our custom replication code that happens in on_checkpoint callback.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant