Deep Dive: InnoDB Transactions and Write Paths
From the client connection to physical storage
Marko Mäkelä, Lead Developer InnoDB�Michaël de Groot, MariaDB Consultant
Second Edition, for
MariaDB Developers Unconference, Tampere, Finland
June 28, 2018
InnoDB Concepts
Some terms that an Advanced DBA should be familiar with
A mini-transaction is an atomic set of page reads or writes, with write-ahead redo log.
A transaction writes undo log before modifying indexes.
The read view of a transaction may access the undo logs of newer transactions to retrieve old versions.
Purge may remove old undo logs and delete-marked records once no read view needs them.
The “storage stack” of InnoDB
A Comparison to the OSI Model (Upper Layers)
The Open Systems Interconnection Model
7. Application layer
6. Presentation layer
5. Session layer
Some layers inside the MariaDB Server:
7. client connection
6. SQL
5. Storage engine interface
A Comparison to the OSI Model (Lower Layers)
The Open Systems Interconnection Model
4. Transport layer
3. Network layer
2. Data link
1. Physical
InnoDB Storage Engine
4. Transaction
3. Mini-transaction
2. Operating System (file system, block device)
1. Hardware/Firmware (physical storage)
SQL and storage engine interface
Step 1: SQL Layer
UPDATE talk SET attendees = 25 WHERE conference="M|18"
AND name="Deep Dive";
Step 2a: Read via the Storage Engine Interface
UPDATE talk SET attendees = 25 WHERE conference="M|18"
AND name="Deep Dive";
Step 2b: Filtering the Rows
UPDATE talk SET attendees = 25 WHERE conference="M|18"
AND name="Deep Dive";
Step 2c: Locking the rows
Transaction layer
InnoDB Transaction Layer
The InnoDB transaction layer relies on atomically updated data pages forming persistent data structures:
A Page View of the InnoDB Transaction Layer
128
B-tree root page
B-tree leaf page
DB_ROLL_PTR
TRX_SYS�Page�(ibdata1)
Rollback segment header page
Table�.ibd
Index
History List (to-purge committed)
Uncommitted transactions
(pkN,child page)
(pkM,child page)
B-tree internal page
(pkN1,child page)
(pkN2,child page)
(pk2,DB_TRX_ID,DB_ROLL_PTR,cols)
(pk1,DB_TRX_ID,DB_ROLL_PTR,cols)
Undo Log�(linked list of pages)
Undo Log�(linked list of pages)
DB_TRX_ID and DB_ROLL_PTR
Step 3: Updating the Matching Rows
UPDATE talk SET attendees = 25 WHERE conference="M|18"
AND name="Deep Dive";
Step 3a: Creating undo log records
Step 3b: Modifying the Indexes
Step 4a: Commit
UPDATE talk SET attendees = 25 WHERE conference="M|18"
AND name="Deep Dive";
Alternate step 4a: Rollback
Step 4b: Cleaning up
After User COMMIT or ROLLBACK is done, there are some clean-up steps:
Mini-Transaction Layer
The InnoDB Mini-Transaction Layer
Each layer provides service to the upper layers, relying on the service provided by the lower layer, adding some encapsulation or refinement:
Mini-Transactions and Page Locking
A mini-transaction is not a user transaction. It is a short operation comprising:
There is no rollback for mini-transactions.
Latches of unmodified pages can be released any time. Commit will:
Index page latch acquisition must avoid deadlocks (server hangs):
Mini-Transactions: RW-Locks and Redo Logs
Mini-Transaction������
Memo:�Locks or Buffer-Fixes
Index tree latch (dict_index_t::lock): covers internal pages
Tablespace latch�(fil_space_t::latch): allocating/freeing pages
Log:�Page Changes
Data Files�FIL_PAGE_LSN
Flush (after log written)
Redo Log Files�(ib_logfile*)
Redo Log Buffer�(log_sys_t::buf)
Write ahead (of page flush) to log (make durable)
Buffer pool page�buf_page_t::oldest_modification
commit
A mini-transaction commit stores the log position (LSN) to each changed page.
�Recovery will redo changes: Apply log if the page LSN is older than the log record LSN.
Log position (LSN)
Mini-transactions and Durability
Depending on when the redo log buffer was written to the redo log files, recovery may miss some latest mini-transaction commits.
When innodb_flush_log_at_trx_commit=1, the only mini-transaction that is synchronously persisted to disk is that of a User transaction commit. All earlier mini-transactions are also persisted.
Mini-transaction Write operations on Index Trees
In InnoDB, a table is a collection of indexes: PRIMARY KEY (clustered index, storing all materialized columns), and optional secondary indexes.
A mini-transaction can only modify one index at a time.
Read Mini-Transactions in Transactions
In InnoDB, a table is a collection of indexes: PRIMARY KEY (clustered index, storing all materialized columns), and optional secondary indexes.
A mini-transaction can read only 1 secondary index, the clustered index and its undo log records
Flushing dirty pages
Redo Log Checkpoint (Truncating Redo Log)
On startup, any redo log after the latest checkpoint will be read and applied.
One More Layer: the Doublewrite Buffer
What if the server was killed in the middle of a page flush?
Page writes first go to the doublewrite buffer (128 pages in ibdata1).
Bugs and improvement potential: MDEV-11799, MDEV-12699, MDEV-12905
Crash Recovery
On startup, InnoDB performs some recovery steps:
Transaction Isolation Levels and Multi-Version Concurrency Control
Read Views and Isolation Levels
The lowest transaction isolation level is READ UNCOMMITTED.
REPEATABLE READ uses non-locking read views:
READ COMMITTED is similar to REPEATABLE READ, but the read view is
Locks and Isolation Levels
The highest transaction isolation level SERIALIZABLE implies�SELECT…LOCK IN SHARE MODE
Read Views, Implicit Locks, and the Undo Log
InnoDB may have to read undo logs for the purposes of:
MVCC Read Views and the Undo Log
Multi-versioning concurrency control: provide non-locking reads from a virtual ‘snapshot’ that corresponds to a read view: The current DB_TRX_ID and a list of DB_TRX_IDs that are not committed at that time.
The read view contains:
“Too new” INSERT can be simply ignored. DELETE and UPDATE make the previous version available by pointing to the undo log record.
MVCC Read Views in the PRIMARY Index
The hidden PRIMARY index fields DB_TRX_ID, DB_ROLL_PTR and undo log records constitute a singly-linked list, from the newest version to the oldest.
A non-locking read iterates this list in a loop, starting from the newest version:
Return the record, or skip it if delete-marked.
Skip the record (the oldest version was inserted in the future).
MVCC Read Views and Secondary Indexes
Secondary indexes only contain a PAGE_MAX_TRX_ID.
Purging Old History
Purging Old History
UPDATE talk SET attendees = 25 WHERE conference="M|18"
AND name="Deep Dive";
Purge Lag and Long-Running Transactions
The purge threads can start removing history as soon as it is no longer visible to any active read view.
Undo log is by default stored in the InnoDB system tablespace (ibdata1).
Secondary Indexes and the Purge Lag
In the worst case, MVCC and implicit lock checks will require a clustered index lookup for each secondary index record, followed by undo log lookups.
Thank you!
To be continued in the other Deep Dive:�InnoDB Transactions and Replication