1 of 22

CSE 451

Operating Systems

Lecture 23 – Transactional File Systems

Slides by: Tom Anderson

Baris Kasikci

2 of 22

Recap

Persistent disk operations are block oriented
To build a file system, need to consistently update complex data structures spread over multiple blocks
Disk and server failures can disrupt any sequence of updates so that only some are completed
Approaches

Careful sequencing of file system operations + post-hoc recovery
Copy-on-write (WAFL, ZFS, Apple File System)
Journalling (NTFS, linux ext4)
Log structure (flash storage)

3 of 22

FFS: Create a File

Normal operation:

Allocate data block
Write data block
Allocate inode
Write inode block
Update bitmap of free blocks
Update directory with file name -> file number (inode number)
Update modify time for directory

Recovery:

Scan inode table
If any unlinked files (not in any directory), delete
Compare free block bitmap against inode trees

Go fix

Scan directories for missing update/access times

Go fix

Time proportional to size of disk

4 of 22

FFS: Move a File

Normal operation:

Remove filename from old directory
Add filename to new directory

Recovery:

Scan all directories to determine set of live files
Consider files with valid inodes and not in any directory

New file being created?
File move?
File deletion?

5 of 22

Application File Save

To update a file, create a new version of the file containing the update

As a temporary backup file
Never update in place
Atomic swap to replace file (rename)

Related concept

git commit update: new version of repo containing some update

Write new versions of the data into files
Create new version of repo pointing to mix of new and old data
Atomic swap to replace repo

6 of 22

Application Save File (eg, Microsoft Word)

Normal operation:

Keep a list of open files for writing
fsync(dir), fsync(data)
Write new version of file to backup

Or append changes to end of file

Will need app-level checksum

fsync(dir), fsync(data)
Rename backup file to file

In POSIX, rename is atomic!

Delete file from open file list on close
fsync(dir)

Recovery:

Were any files left open?
If so, look for backup file

Or scan for deltas appended to file

If backup file is valid, ask user to compare versions

Or automatically apply those changes

Warning: append of multiple blocks is not atomic!

7 of 22

Careful Ordering

Pros

Works with most disk drives
Works for most multi-step operations

Cons

Can require time-consuming recovery after a failure
Can require frequent synchronous flushes of pending writes to disk
Difficult to reduce every operation to a safely interruptible sequence of writes
Difficult to achieve consistency when multiple operations occur concurrently
Need to control when writes are flushed to storage

8 of 22

Approach #3: Write Ahead Logging

Instead of modifying data structures on disk directly, write changes to a journal/log

Intention list: set of changes we intend to make
Log is append-only

Once changes are on log, safe to apply changes to data structures on disk

Recovery can read log to see what changes were intended

Once changes are applied to data structures on disk, safe to remove/garbage collect log

9 of 22

Write Ahead (Redo) Logging

Recovery

Read log
Redo any operations for committed

transactions

Ignore operations for uncommitted

transactions

Garbage collect log

If crash during recovery, restart recovery process

Prepare

Write all changes (in transaction) to log
Set of intended changes

Commit

Single disk write to make transaction

durable (including checksum of changes)

Redo

Apply intended changes to disk data

structures

Garbage collection

Reclaim space in log

10 of 22

Before Transaction Start

11 of 22

After Updates Are Logged

12 of 22

After Commit Logged

13 of 22

After Copy Back

14 of 22

After Garbage Collection

15 of 22

Write Ahead (Redo) Logging

Recovery

Read log
Redo any operations for committed

transactions

Ignore operations for uncommitted

transactions

Garbage collect log

If crash during recovery, restart recovery process

Prepare

Write all changes (in transaction) to log
Set of intended changes

Commit

Single disk write to make transaction

durable

Redo

Apply intended changes to disk data

structures

Garbage collection

Reclaim space in log

16 of 22

Questions

What happens if machine crashes?

Before transaction start
After transaction start, before operations are logged
After operations are logged, before commit
After commit, before write back
After write back before garbage collection

What happens if machine crashes during recovery?

17 of 22

Performance

Log written sequentially
Asynchronous write back (in any order) provided

Synchronous commit
All changes are logged before commit block is written
All write backs occur after commit is written

Multiple concurrent transactions?

Transaction ID in each log entry
Transaction completed iff its commit record is in log

18 of 22

Redo Log Implementation

19 of 22

xk Implementation Note

For lab 4, you do not need to handle multiple concurrent file updates!
(A real system does)

20 of 22

Caveat

Most file systems implement a transactional model internally

Most often: write ahead (redo) logging

Ordering, atomicity for (some) system calls

File create, file rename, file move, …
No ordering for file data operations (read/write) except with fsync (fence)
Multi-block operations are not atomic or linearizable

Transactional updates for application data needs to be built on top of file system primitives

Application level atomic commit (eg, append log of changes, write new version to backup file, fsync, atomic rename, implicit delete of old file)

21 of 22

File System Write Ahead Logging

Pros

Correct behavior regardless of failures
Fast recovery (if log is garbage collected frequently)
High throughput if commits can be batched
On magnetic disk, write backs can be ordered to reduce seek time

Cons

Every write performed twice

Can wear down an SSD

Depending on commit batch size, potential for low throughput/higher latency

22 of 22

Question

Do we need the copy back?

What if update in place is expensive? (eg, RAID)
What if the underlying device doesn’t do update in place? (eg, flash)

Can we write the changed data just once?