Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix init of WAL page header at startup (#8914)
If the primary is started at an LSN within the first of a 16 MB WAL segment, the "long XLOG page header" at the beginning of the segment was not initialized correctly. That has gone unnnoticed, because under normal circumstances, nothing looks at the page header. The WAL that is streamed to the safekeepers starts at the new record's LSN, not at the beginning of the page, so that bogus page header didn't propagate elsewhere, and a primary server doesn't normally read the WAL its written. Which is good because the contents of the page would be bogus anyway, as it wouldn't contain any of the records before the LSN where the new record is written. Except that in the following cases a primary does read its own WAL: 1. When there are two-phase transactions in prepared state at checkpoint. The checkpointer reads the two-phase state from the XLOG_XACT_PREPARE record, and writes it to a file in pg_twophase/. 2. Logical decoding reads the WAL starting from the replication slot's restart LSN. This PR fixes the problem with two-phase transactions. For that, it's sufficient to initialize the page header correctly. The checkpointer only needs to read XLOG_XACT_PREPARE records that were generated after the server startup, so it's still OK that older WAL is missing / bogus. I have not investigated if we have a problem with logical decoding, however. Let's deal with that separately. Special thanks to @Lzjing-1997, who independently found the same bug and opened a PR to fix it, although I did not use that PR.
- Loading branch information
9a32aa8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5091 tests run: 4917 passed, 0 failed, 174 skipped (full report)
Flaky tests (10)
Postgres 17
test_pageserver_compaction_smoke
: release-arm64test_neon_cli_basics
: release-arm64test_scrubber_physical_gc[4]
: release-x86-64, debug-x86-64test_subscriber_restart
: release-x86-64test_timeline_archive[4]
: debug-x86-64test_delete_timeline_client_hangup
: debug-x86-64Postgres 16
test_neon_cli_basics
: release-arm64Postgres 15
test_neon_cli_basics
: release-arm64test_subscriber_restart
: release-x86-64Code coverage* (full report)
functions
:31.9% (7435 of 23317 functions)
lines
:49.9% (59940 of 120118 lines)
* collected from Rust tests only
9a32aa8 at 2024-09-21T09:57:08.493Z :recycle: