-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage & Compute release 2024-09-23 #9095
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a pre requisite for #8681
If the primary is started at an LSN within the first of a 16 MB WAL segment, the "long XLOG page header" at the beginning of the segment was not initialized correctly. That has gone unnnoticed, because under normal circumstances, nothing looks at the page header. The WAL that is streamed to the safekeepers starts at the new record's LSN, not at the beginning of the page, so that bogus page header didn't propagate elsewhere, and a primary server doesn't normally read the WAL its written. Which is good because the contents of the page would be bogus anyway, as it wouldn't contain any of the records before the LSN where the new record is written. Except that in the following cases a primary does read its own WAL: 1. When there are two-phase transactions in prepared state at checkpoint. The checkpointer reads the two-phase state from the XLOG_XACT_PREPARE record, and writes it to a file in pg_twophase/. 2. Logical decoding reads the WAL starting from the replication slot's restart LSN. This PR fixes the problem with two-phase transactions. For that, it's sufficient to initialize the page header correctly. The checkpointer only needs to read XLOG_XACT_PREPARE records that were generated after the server startup, so it's still OK that older WAL is missing / bogus. I have not investigated if we have a problem with logical decoding, however. Let's deal with that separately. Special thanks to @Lzjing-1997, who independently found the same bug and opened a PR to fix it, although I did not use that PR.
We frequently mess up our submodule references. This adds one safeguard: it checks that the submodule references are only updated "forwards", not to some older commit, or a commit that's not a descended of the previous one. As next step, I'm thinking that we should automate things so that when you merge a PR to the 'neon' repository that updates the submodule references, the REL_*_STABLE_neon branches are automatically updated to match the submodule references. That way, you never need to manually merge PRs in the postgres repository, it's all triggered from commits in the 'neon' repository. But that's not included here.
We were already building v0.4.0 as an indirect dependency, so this avoids having to build two different versions of it.
To eliminate one version of it from our dependency tree.
cargo update ciborium iana-time-zone lazy_static schannel uuid cargo update [email protected] cargo update --precise 2.9.7 ureq It might be worthwhile just update all our dependencies at some point, but this is aimed at pruning the dependency tree, to make the build a little faster. That's also why I didn't update ureq to the latest version: that would've added a dependency to yet another version of rustls.
These were not referenced in any of the other Cargo.toml files in the workspace. They were not being built because of that, so there was little harm in having them listed, but let's be tidy.
vipvap
requested review from
knizhnik and
arpad-m
and removed request for
a team
September 23, 2024 06:05
5012 tests run: 4848 passed, 0 failed, 164 skipped (full report)Flaky tests (3)Postgres 17
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
fc67f8d at 2024-09-24T14:05:57.279Z :recycle: |
## Problem Previously, the storage controller may send compute notifications containing stale pageservers (i.e. pageserver serving the shard was detached). This happened because detaches did not update the compute hook state. ## Summary of Changes Update compute hook state on shard detach. Fixes #8928
Not used in production, but in benchmarks, to demonstrate minimal RTT. (It would be nice to not have to copy the 8KiB of zeroes, but, that would require larger protocol changes). Found this useful in investigation #8952.
#8935) refs #8184 stacked atop #8934 This PR changes from ignoring the config field to rejecting configs that contain it. PR neondatabase/infra#1903 removes the field usage from `pageserver.toml`. It rolls into prod sooner or in the same release as this PR.
There is discrepancy with the spec, it has PUT
Part of #8002 Close #8920 Legacy compaction (as well as gc-compaction) rely on the GC process to remove unused layer files, but this relies on many factors (i.e., key partition) to ensure data in a dropped table can be eventually removed. In gc-compaction, we consider the keyspace information when doing the compaction process. If a key is not in the keyspace, we will skip that key and not include it in the final output. However, this is not easy to implement because gc-compaction considers branch points (i.e., retain_lsns) and the retained keyspaces could change across different LSNs. Therefore, for now, we only remove aux v1 keys in the compaction process. ## Summary of changes * Add `FilterIterator` to filter out keys. * Integrate `FilterIterator` with gc-compaction. * Add `collect_gc_compaction_keyspace` for a spec of keyspaces that can be retained during the gc-compaction process. --------- Signed-off-by: Alex Chi Z <[email protected]>
It was pretty noisy. It changed from debug to info level in commit 78938d1, but I believe that was not purpose.
The metrics include a histogram of how long we need to wait for a GetPage request, number of reconnects, and number of requests among other things. The metrics are not yet exported anywhere, but you can query them manually. Note: This does *not* bump the default version of the 'neon' extension. We will do that later, as a separate PR. The reason is that this allows us to roll back the compute image smoothly, if necessary. Once the image that includes the new extension .so file with the new functions has been rolled out, and we're confident that we don't need to roll back the image anymore, we can change default extension version and actually start using the new functions and views. This is what the view looks like: ``` postgres=# select * from neon_perf_counters ; metric | bucket_le | value ---------------------------------------+-----------+---------- getpage_wait_seconds_count | | 300 getpage_wait_seconds_sum | | 0.048506 getpage_wait_seconds_bucket | 2e-05 | 0 getpage_wait_seconds_bucket | 3e-05 | 0 getpage_wait_seconds_bucket | 6e-05 | 71 getpage_wait_seconds_bucket | 0.0001 | 124 getpage_wait_seconds_bucket | 0.0002 | 248 getpage_wait_seconds_bucket | 0.0003 | 279 getpage_wait_seconds_bucket | 0.0006 | 297 getpage_wait_seconds_bucket | 0.001 | 298 getpage_wait_seconds_bucket | 0.002 | 298 getpage_wait_seconds_bucket | 0.003 | 298 getpage_wait_seconds_bucket | 0.006 | 300 getpage_wait_seconds_bucket | 0.01 | 300 getpage_wait_seconds_bucket | 0.02 | 300 getpage_wait_seconds_bucket | 0.03 | 300 getpage_wait_seconds_bucket | 0.06 | 300 getpage_wait_seconds_bucket | 0.1 | 300 getpage_wait_seconds_bucket | 0.2 | 300 getpage_wait_seconds_bucket | 0.3 | 300 getpage_wait_seconds_bucket | 0.6 | 300 getpage_wait_seconds_bucket | 1 | 300 getpage_wait_seconds_bucket | 2 | 300 getpage_wait_seconds_bucket | 3 | 300 getpage_wait_seconds_bucket | 6 | 300 getpage_wait_seconds_bucket | 10 | 300 getpage_wait_seconds_bucket | 20 | 300 getpage_wait_seconds_bucket | 30 | 300 getpage_wait_seconds_bucket | 60 | 300 getpage_wait_seconds_bucket | 100 | 300 getpage_wait_seconds_bucket | Infinity | 300 getpage_prefetch_requests_total | | 69 getpage_sync_requests_total | | 231 getpage_prefetch_misses_total | | 0 getpage_prefetch_discards_total | | 0 pageserver_requests_sent_total | | 323 pageserver_requests_disconnects_total | | 0 pageserver_send_flushes_total | | 323 file_cache_hits_total | | 0 (39 rows) ```
## Problem LFC cache entry is chunk (right now size of chunk is 1Mb). LFC statistics shows number of chunks, but not number of used pages. And autoscaling team wants to know how sparse LFC is: https://neondb.slack.com/archives/C04DGM6SMTM/p1726782793595969 It is possible to obtain it from the view `select count(*) from local_cache`. Nut it is expensive operation, enumerating all entries in LFC under lock. ## Summary of changes This PR added "file_cache_used_pages" to `neon_lfc_stats` view: ``` select * from neon_lfc_stats; lfc_key | lfc_value -----------------------+----------- file_cache_misses | 3139029 file_cache_hits | 4098394 file_cache_used | 1024 file_cache_writes | 3173728 file_cache_size | 1024 file_cache_used_pages | 25689 (6 rows) ``` Please notice that this PR doesn't change neon extension API, so no need to create new version of Neon extension. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <[email protected]>
…g names (#9105) The warnings: warning: elided lifetime has a name --> pageserver/src/metrics.rs:1386:29 | 1382 | pub(crate) fn start_timer<'c: 'a, 'a>( | -- lifetime `'a` declared here ... 1386 | ) -> Option<impl Drop + '_> { | ^^ this elided lifetime gets resolved as `'a` | = note: `#[warn(elided_named_lifetimes)]` on by default warning: elided lifetime has a name --> pageserver/src/metrics.rs:1537:46 | 1534 | pub(crate) fn start_recording<'c: 'a, 'a>( | -- lifetime `'a` declared here ... 1537 | ) -> BasebackupQueryTimeOngoingRecording<'_, '_> { | ^^ this elided lifetime gets resolved as `'a` warning: elided lifetime has a name --> pageserver/src/metrics.rs:1537:50 | 1534 | pub(crate) fn start_recording<'c: 'a, 'a>( | -- lifetime `'a` declared here ... 1537 | ) -> BasebackupQueryTimeOngoingRecording<'_, '_> { | ^^ this elided lifetime gets resolved as `'a` warning: elided lifetime has a name --> pageserver/src/tenant.rs:3630:25 | 3622 | async fn prepare_new_timeline<'a>( | -- lifetime `'a` declared here ... 3630 | ) -> anyhow::Result<UninitializedTimeline> { | ^^^^^^^^^^^^^^^^^^^^^ this elided lifetime gets resolved as `'a`
Seems nice to keep all these together. This also provides a nice place for a README file to describe the compute image build process. For now, it briefly describes the contents of the directory, but can be expanded.
Instead of adding them to the VM image late in the build process, when putting together the final VM image, include them in the earlier compute image already. That makes it more convenient to edit the files, and to test them.
Part of #8128, fixes #8872. ## Problem See #8872. ## Summary of changes - Retry `list_timeline_blobs` another time if - there are layer file keys listed but not index. - failed to download index. - Instrument code with `analyze-tenant` and `analyze-timeline` span. - Remove `initdb_archive` check, it could have been deleted. - Return with exit code 1 on fatal error if `--exit-code` parameter is set. Signed-off-by: Yuchen Liang <[email protected]>
## Problem We need to be able to run the regression tests against a cloud-based Neon staging instance to prepare the migration to the arm architecture. ## Summary of changes Some tests were modified to work on the cloud instance (i.e. added passwords, server-side copy changed to client-side, etc) --------- Co-authored-by: Alexander Bayandin <[email protected]>
## Problem Scheduling on tenant creation uses different heuristics compared to the scheduling done during background optimizations. This results in scenarios where shards are created and then immediately migrated by the optimizer. ## Summary of changes 1. Make scheduler aware of the type of the shard it is scheduling (attached vs secondary). We wish to have different heuristics. 2. For attached shards, include the attached shard count from the context in the node score calculation. This brings initial shard scheduling in line with what the optimization passes do. 3. Add a test for (2). This looks like a bigger change than required, but the refactoring serves as the basis for az-aware shard scheduling where we also need to make the distinction between attached and secondary shards. Closes #8969
The PostgreSQL 17 vendor module is now based on postgres/postgres @ d7ec59a63d745ba74fba0e280bbf85dc6d1caa3e, presumably the final code change before the V17 tag.
conradludgate
approved these changes
Sep 24, 2024
arpad-m
approved these changes
Sep 24, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Storage & Compute release 2024-09-23
NB there are more changes in
origin/release
that haven't been deployed to most hosts yet: