1 of 22

CSE 451

Operating Systems

Lecture 23 – Transactional File Systems

Slides by: Tom Anderson

Baris Kasikci

2 of 22

Recap

  • Persistent disk operations are block oriented
  • To build a file system, need to consistently update complex data structures spread over multiple blocks
  • Disk and server failures can disrupt any sequence of updates so that only some are completed
  • Approaches
    • Careful sequencing of file system operations + post-hoc recovery
    • Copy-on-write (WAFL, ZFS, Apple File System)
    • Journalling (NTFS, linux ext4)
    • Log structure (flash storage)

3 of 22

FFS: Create a File

Normal operation:

  • Allocate data block
  • Write data block
  • Allocate inode
  • Write inode block
  • Update bitmap of free blocks
  • Update directory with file name -> file number (inode number)
  • Update modify time for directory

Recovery:

  • Scan inode table
  • If any unlinked files (not in any directory), delete
  • Compare free block bitmap against inode trees
    • Go fix
  • Scan directories for missing update/access times
    • Go fix

Time proportional to size of disk

4 of 22

FFS: Move a File

Normal operation:

  • Remove filename from old directory
  • Add filename to new directory

Recovery:

  • Scan all directories to determine set of live files
  • Consider files with valid inodes and not in any directory
    • New file being created?
    • File move?
    • File deletion?

5 of 22

Application File Save

  • To update a file, create a new version of the file containing the update
    • As a temporary backup file
    • Never update in place
    • Atomic swap to replace file (rename)

  • Related concept
    • git commit update: new version of repo containing some update
      • Write new versions of the data into files
      • Create new version of repo pointing to mix of new and old data
      • Atomic swap to replace repo

6 of 22

Application Save File (eg, Microsoft Word)

Normal operation:

  • Keep a list of open files for writing
  • fsync(dir), fsync(data)
  • Write new version of file to backup
    • Or append changes to end of file
      • Will need app-level checksum
  • fsync(dir), fsync(data)
  • Rename backup file to file
    • In POSIX, rename is atomic!
  • Delete file from open file list on close
  • fsync(dir)

Recovery:

  • Were any files left open?
  • If so, look for backup file
    • Or scan for deltas appended to file
  • If backup file is valid, ask user to compare versions
    • Or automatically apply those changes

  • Warning: append of multiple blocks is not atomic!

7 of 22

Careful Ordering

  • Pros
    • Works with most disk drives
    • Works for most multi-step operations
  • Cons
    • Can require time-consuming recovery after a failure
    • Can require frequent synchronous flushes of pending writes to disk
    • Difficult to reduce every operation to a safely interruptible sequence of writes
    • Difficult to achieve consistency when multiple operations occur concurrently
    • Need to control when writes are flushed to storage

8 of 22

Approach #3: Write Ahead Logging

  • Instead of modifying data structures on disk directly, write changes to a journal/log
    • Intention list: set of changes we intend to make
    • Log is append-only
  • Once changes are on log, safe to apply changes to data structures on disk
    • Recovery can read log to see what changes were intended
  • Once changes are applied to data structures on disk, safe to remove/garbage collect log

9 of 22

Write Ahead (Redo) Logging

  • Recovery
    • Read log
    • Redo any operations for committed

transactions

    • Ignore operations for uncommitted

transactions

    • Garbage collect log
  • If crash during recovery, restart recovery process
  • Prepare
    • Write all changes (in transaction) to log
    • Set of intended changes
  • Commit
    • Single disk write to make transaction

durable (including checksum of changes)

  • Redo
    • Apply intended changes to disk data

structures

  • Garbage collection
    • Reclaim space in log

10 of 22

Before Transaction Start

11 of 22

After Updates Are Logged

12 of 22

After Commit Logged

13 of 22

After Copy Back

14 of 22

After Garbage Collection

15 of 22

Write Ahead (Redo) Logging

  • Recovery
    • Read log
    • Redo any operations for committed

transactions

    • Ignore operations for uncommitted

transactions

    • Garbage collect log

If crash during recovery, restart recovery process

  • Prepare
    • Write all changes (in transaction) to log
    • Set of intended changes
  • Commit
    • Single disk write to make transaction

durable

  • Redo
    • Apply intended changes to disk data

structures

  • Garbage collection
    • Reclaim space in log

16 of 22

Questions

  • What happens if machine crashes?
    • Before transaction start
    • After transaction start, before operations are logged
    • After operations are logged, before commit
    • After commit, before write back
    • After write back before garbage collection
  • What happens if machine crashes during recovery?

17 of 22

Performance

  • Log written sequentially
  • Asynchronous write back (in any order) provided
    • Synchronous commit
    • All changes are logged before commit block is written
    • All write backs occur after commit is written
  • Multiple concurrent transactions?
    • Transaction ID in each log entry
    • Transaction completed iff its commit record is in log

18 of 22

Redo Log Implementation

19 of 22

xk Implementation Note

  • For lab 4, you do not need to handle multiple concurrent file updates!
  • (A real system does)

20 of 22

Caveat

  • Most file systems implement a transactional model internally
    • Most often: write ahead (redo) logging
  • Ordering, atomicity for (some) system calls
    • File create, file rename, file move, …
    • No ordering for file data operations (read/write) except with fsync (fence)
    • Multi-block operations are not atomic or linearizable
  • Transactional updates for application data needs to be built on top of file system primitives
    • Application level atomic commit (eg, append log of changes, write new version to backup file, fsync, atomic rename, implicit delete of old file)

21 of 22

File System Write Ahead Logging

  • Pros
    • Correct behavior regardless of failures
    • Fast recovery (if log is garbage collected frequently)
    • High throughput if commits can be batched
    • On magnetic disk, write backs can be ordered to reduce seek time
  • Cons
    • Every write performed twice
      • Can wear down an SSD
    • Depending on commit batch size, potential for low throughput/higher latency

22 of 22

Question

  • Do we need the copy back?
    • What if update in place is expensive? (eg, RAID)
    • What if the underlying device doesn’t do update in place? (eg, flash)

  • Can we write the changed data just once?