-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeline offloading persistence #9444
Conversation
5316 tests run: 5102 passed, 0 failed, 214 skipped (full report)Flaky tests (4)Postgres 17
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
5ad1198 at 2024-10-22T21:02:42.725Z :recycle: |
b630c9c
to
b9d33fb
Compare
The test has some smaller modifications from test_timeline_offloading: * the method by which we determine the offloaded state * us doing manual offloading all the time * only one offloaded branch * a test in the end that deletion of a tenant works (so timeline deletion deletes the manifest as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me how we deal with interrupted offload operations. Or more broadly, tenant manifest conflicts, given that we duplicate the tenant manifest across shards.
The RFC says that shard 0 should be authoritative -- is it worth giving it special treatment, such that we ensure shard 0's manifest is uploaded before the other shards? Do we rely on the offloaded state being eventually consistent, and a failed actor being required to retry until that's the case? I'm not familiar enough with the design to enumerate all of the possible hazards here, but you get my drift.
It would be worth spelling all of this out in the manifest's doc comment, along with safety arguments and appropriate test cases.
There is no cross-shard offload operation: each shard can decide on the offload on its own, which it does during compaction. So that's why one shouldn't have a unified tenant manifest either. I don't think it makes much sense to have only some shards offloaded as an intentional state, there is nothing gained by that. But it's also not an erroneous condition, and most features of the pageserver are per-shard, and not per-tenant. We'd otherwise need to make offloading be coordinated by the storage controller, something we gain little from. However, this made me think about shard splits. We'll probably need to copy over the manifests, or write them from scratch (and what do do with the pre-split manifests? probably we can delete them). |
I see. Making offloading a shard-local operation seems reasonable. I was confused by the below from the PR description, as well as the RFC's reference to an authoritative manifest:
It seems worthwhile to me to explicitly scope the manifest to a shard rather than a tenant, e.g. by referring to it as a tenant shard manifest (or just shard manifest), and point out that it's perfectly valid (if unexpected) for them to diverge across shards. This would prevent confusion and footguns if we later try to use it as tenant-global. But maybe that's just me -- feel free to get others' opinion on this as well.
Yes, good catch, we'll need to propagate these. |
This reverts commit d4fec5f.
Also, handle dangling references in the tenant loading code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this made me think about shard splits. We'll probably need to copy over the manifests, or write them from scratch
For posterity: this has been added as a follow-up task on #8088.
We support multiple storage backends now, so remove the `_s3_` from the name. Analogous to the names adopted for tenant manifests added in #9444.
Before, we didn't copy over the `index-part.json` of offloaded timelines to the new shard's location, resulting in the new shard not knowing the timeline even exists. In #9444, we copy over the manifest, but we also need to do this for `index-part.json`. As the operations to do are mostly the same between offloaded and non-offloaded timelines, we can iterate over all of them in the same loop, after the introduction of a `TimelineOrOffloadedArcRef` type to generalize over the two cases. This is analogous to the deletion code added in #8907. The added test also ensures that the sharded archival config endpoint works, something that has not yet been ensured by tests. Part of #8088
Persist timeline offloaded state to S3.
Right now, as of #8907, at each restart of the pageserver, all offloaded state is lost, so we load the full timeline again. As it starts with an empty local directory, we might potentially download some files again, leading to downloads that are ultimately wasteful.
This patch adds support for persisting the offloaded state, allowing us to never load offloaded timelines in the first place. The persistence feature is facilitated via a new file in S3 that is tenant-global, which contains a list of all offloaded timelines. It is updated each time we offload or unoffload a timeline, and otherwise never touched.
This choice means that tenants where no offloading is happening will not immediately get a manifest, keeping the change very minimal at the start.
We leave generation support for future work. It is important to support generations, as in the worst case, the manifest might be overwritten by an older generation after a timeline has been unoffloaded (and unarchived), so the next pageserver process instantiation might wrongly believe that some timeline is still offloaded even though it should be active.
Part of #9386, #8088