-
Notifications
You must be signed in to change notification settings - Fork 122
Stalls Caused by Big Transaction Commits or Aborts
One of our customers sometimes observed lots of simple insertions taking for a lot longer than expected to complete. Usually, these insertions completed in milliseconds, but the insertions sometimes were taking hundreds of seconds. These stalls indicated the existence of a serialization bug in the fractal tree software, so the hunt was on. We found that these stalls occurred when a big transaction was committing and the fractal tree software was taking a checkpoint. This problem was fixed in both TokuDB 7.0.3 and TokuMX 1.0.3. Please read on as we describe some details about this bug and how we fixed it. We describe some of the relevant fractal tree algorithms first.
Each transaction builds a rollback log as it performs fractal tree operations. The rollback log is maintained in memory until it gets too big and is spilled to the tokudb rollback file. We define a small transaction as one whose rollback log resides in memory. We define a big transaction as one whose rollback log is spilled to the rollback file.
When the transaction commits or aborts, the entire rollback log for the transaction is traversed and for each rollback entry the commit or abort action for the rollback entry is executed. For example, it the transaction is committing and the rollback entry is an insertion, then the fractal tree software does nothing since we implicitly commit the insertion. If the transaction is aborting an insertion, then the fractal tree software sends an abort message into the appropriate fractal tree. The time to traverse the rollback log depends on the number of operations that the transaction logged. This time can be huge.
Each transaction also appends recovery entries to the recovery log as it performs fractal tree operations. When recovering from a crash, the recovery log is traversed and its log entries are executed. Since we don't want to process the recovery log from the beginning of time, we periodically checkpoint the open fractal trees and write checkpoint log entries to the recovery log. A checkpoint creates a new version of the fractal tree and persists it to storage. The checkpoint log entries allow crash recovery to only have to replay operations that occurred since the last checkpoint, not since the beginning of time.
A checkpoint is split into two phases: the begin phase and the end phase. The begin checkpoint phase logs the state of the open fractal trees and the live transactions. It also marks the blocks that need to be written by the end checkpoint phase. The details of checkpoints will not be covered here since they are way beyond the topic of this blog.
A reader writer lock is used to serialize transaction commits, aborts and the checkpoint begin phase. Transaction commits and aborts take a read lock on this lock. Checkpoints take a write lock on this lock. We use a fair reader writer lock so that checkpoints are not starved forever in the face of lots of transaction commits or aborts.
Here is how the stalls occur.
- A big transaction grabs a read lock on the checkpoint lock when it wants to commit or abort.
- Other transactions can still grab a read lock on the checkpoint lock and commit. So far, so good.
- A checkpoint attempts to grab a write lock in the checkpoint lock. Since there is a read lock active, the write lock attempt is blocked.
- When other transactions try to grab a read lock on the checkpoint lock, they are blocked because there is a write lock pending.
- Now, all transaction commits and aborts are serialized until the big transaction completes.
We split the checkpoint lock into a low priority reader writer lock and a high priority reader writer lock. A big transaction grabs a read lock on the low priority reader writer lock. A small transaction grabs a read lock on the high priority reader writer lock. The checkpoint first grabs a write lock on the low priority lock. It then grabs a write lock on the high priority lock.
Zardosht and Leif tell me that this design pattern is a special case of a graded lock that supports multiple priority levels. We use two levels here.
Here is how the stalls are avoided.
- A big transaction grabs a read on the low priority checkpoint lock.
- Other big transactions can still grab a read lock on the low priority checkpoint lock.
- Other small transactions can grab a read lock on the high priority checkpoint lock as well.
- A checkpoint attempts to grab a write lock on the low priority checkpoint. Since there is a read lock active, the write lock attempt is blocked.
- Other small transactions can continue to grab a read lock on the high priority checkpoint lock since there are no write locks.
- Eventually, the big transaction releases the low priority lock, the checkpoint thread grabs it, and then the checkpoint thread tries to grab the high priority lock.
- This will serialize the small transactions while the checkpoint begin runs, but we expect checkpoint begin to be relatively fast.
We give priority to small transactions over big transactions in this design. While big transactions are running through their rollback logs, small transactions continue to commit or abort.
- Since the checkpoint thread is suspended behind one or more big transactions, the time between checkpoints is proportional to the size of the big transaction. Since this can be arbitrarily large, the recovery log can get very large and so can the crash recovery time.
- Any fractal tree operation that needs a checkpoint can be delayed arbitrarily long by a big transaction. For example, hot indexing checkpoints.
- set global tokudb_checkpoint_on_flush_logs=true.
- abort a big transaction.
- flush logs (should block).
- insert a row into the same table that the big transaction was writing to. this should succeed.
- when the big txn finally completes, the flush logs should do a checkpoint and complete.