1 of 37

Fast⚡Crash Recovery in PostgreSQL [WIP]:�Design, Implementation and Challenges

Srinath Reddy Sadipiralla

Bengaluru PUG�15/11/2025

2 of 37

Agenda

What is WAL?
Internals of WAL
Crash Recovery in Postgres
The problem
Is it even a problem?
Solution
Design
Challenges
Current state
Q&A

Srinath Reddy Sadipiralla�Staff SDE at EDB

Postgres Hacker

3 of 37

What is WAL? (1/2)

WAL == Write Ahead Log
Used to record changes to the database before they are applied to the actual data files.
In the event of crash, the server can use the WAL to restore the database to a consistent state.
Enables replication and point-in-time recovery by providing sequential log of all changes.

4 of 37

What is WAL? (2/2)

WAL files are written to the pg_wal directory by multiple background processes, such as the backend, walwriter, checkpointer, and bgwriter.

5 of 37

Internals of WAL (1/3)

6 of 37

Internals of WAL (2/3)

7 of 37

Internals of WAL (3/3)

8 of 37

Crash Recovery in Postgres (1/3)

9 of 37

Crash Recovery in Postgres (2/3)

When Postgres server starts a child process called startup process.
Startup process reads the control file and checks the state of the server.
If the state is not DB_SHUTDOWNED, basically it means server was not shutdown cleanly.
So this is considered as a server crash.
Startup process gets the last checkpoint location using control file.

10 of 37

Crash Recovery in Postgres (3/3)

11 of 37

The Problem (1/3)

12 of 37

The Problem (2/3)

Currently the amount of time crash recovery takes is directly proportional(*) to time taken to replay the WALs.
if server has many WALs, then crash recovery takes more time to bring the server up & running.

* there could be other factors like WAL read speed depends on hardware.

13 of 37

The Problem (3/3)

Reducing the checkpoint interval helps, because dirty pages gets flushed, old WALs are removed frequently.
So we will have less WALs to replay during crash recovery.
but at the expense of making backends executing queries slow, because it needs to read pages from disk more frequently.

14 of 37

Is it even a problem? (1/2)

15 of 37

Is it even a problem? (2/2)

Until all the WALs are replayed, clients can’t connect to the server because server is not consistent yet, which is downtime for clients.
If we have more WALs to replay after crash it will take more time for the recovery to complete which means more server downtime.

16 of 37

Solution (1/2)

17 of 37

Solution (2/2)

The design idea is to skip the replay of WALs which will be done during the crash recovery by the startup process.
Then we will make the server ready up and running which can now accept client connections.
So when will the data recovered? On demand when the page is read from disk into the memory.

19 of 37

Skipping the WAL?

To make the crash recovery fast and less server downtime we will skip replaying WALs in startup process except RM_XACT_ID WAL records to keep the transaction status and snapshots consistent.

20 of 37

When will the data be recovered then?

The actual data recovery happens after the server is up and running, where users can connect to it.
When a page is being read from disk into shared memory by any process we will apply the WALs related to this page on-demand.

21 of 37

Isn’t it slow to read, filter the WALs per page request?

To address this I use a hashtable whose key is page’s BufferTag and value is list of LSNs of the WAL records which affected this page.
This hashtable is prepared when the startup process is reading WALs from disk, by inserting the entries into the hashtable by getting the BufferTag, LSN from the WAL read.

22 of 37

checkpointer and bgwriter processes (1/3)

During normal crash recovery phase where startup is replaying the WALs by reading pages from disk, there we will have checkpointer and background writer processes running.
To reduce the load on the startup process by flushing the dirty pages from time to time and also removing/recycling the WALs from pg_wal by checkpointer process.

23 of 37

checkpointer and bgwriter processes (2/3)

But in the fast crash recovery case startup process neither reads pages nor does replay so no need of starting checkpointer, bgwriter processes to flush the dirty pages and nor the WALs need to be removed/recycled because we need them for on-demand recovery.
These two processes are not started during the crash recovery.
After we skip the crash recovery and let clients connect to the server bgwriter is started not the checkpointer.

24 of 37

checkpointer and bgwriter processes (3/3)

bgwriter is started because it reduces some load on backends by flushing the dirty pages during on-demand WAL replay.
checkpointer is not started because not all the pages have yet been recovered, but when timeout is reached it removes the WALs that we still need.
checkpointer is started again when all the WALs got applied and data is recovered.

25 of 37

What if no one reads a page for a long time? (1/2)

Then its a problem because as the checkpointer is not started and if new pages being created instead of old pages being completely recovered then new WAL gets accumulated and there’s no one to remove/recycle the WAL.
Until all the pages are recovered the on-demand replaying by processes causes slowness to them vs where all the pages are completely recovered,because each read request should hit hashtable to check if any WALs are there to do on-demand replay.

26 of 37

What if no one reads a page for a long time? (2/2)

To solve this a new child process is introduced which continuously in background iterates through the hashtable we created earlier, bringing in those pages and replaying the WALs related to the page.
This process helps in confirming that “deferred recovery” has been completed, this makes sure that we can start checkpointer.
This new process also helps in cleaning up like destroying the hashtable and this process exits finally, so no overload of extra process.

27 of 37

What if 2 processes try to read same page at a time?

The on-demand replay is done in between StartBufferIO and TerminateBufferIO when someone is reading a page from disk.
The buffer state is set as BM_IO_IN_PROGRESS until the TerminateBufferIO is called.
If anyone tries to read the same page then they will be stuck in WaitIO which loops until BM_IO_IN_PROGRESS is cleared by TerminateBufferIO.

28 of 37

Challenges

Thought to keep the hashtable in shared memory using dynahash hashtable but for that we need to know the number of entries upfront.
The on-demand replay which was planned to be done between StartBufferIO and TerminateBufferIO is not possible from PG 18 because AIO has moved this reading page and TerminateBufferIO block into the critical section,so replay is not possible here.

29 of 37

Current Design “Experiments” for above challenges

Exploring Dynamic Shared Memory to know whether I can use it to keep the hashtable here.
Trying to acquire Exclusive lock before page is read inside StartBufferIO and TerminateBufferIO block and do the on-demand replay after critical section has ended then release the lock.

30 of 37

Current state

Current patch is able to make do on-demand WAL replay on pages in backend processes.
I tested with small dataset and it was able to recover that data by skipping the replay at startup.
This was possible because for testing purpose to overcome the previous challenges , i kept the size of the shared hashtable fixed and removed the critical section macros around page read code.

Future of this patch: will post the patch/es with performance numbers ASAP in pg hackers mailing list.

31 of 37

Summary (1/4)

32 of 37

Summary (2/4)

33 of 37

Summary (3/4)

34 of 37

Summary (4/4)

35 of 37

After all pages got recovered

37 of 37

Thank you

Connect with me on LinkedIn: linkedin.com/in/srinath-reddy-sadipiralla/

Reach out to me: srinath.reddy@enterprisedb.com