1 of 35

CSE 451

Operating Systems

L21 – Spinning disks & File system abstraction

Slides by: Tom Anderson

Rohan Kadekodi

2 of 35

Magnetic Disk

3 of 35

Magnetic Disk Anatomy

2 pictures. One how physically looks, and one is schematic.

Definitions: disk itself is a stack of platters of disk. Disk heads are like a column oof those platters. They read both upper and lower surface of the platter. So there is info on both sides of the platter.

So if you have 10 platters, that means you will have 20 disk heads. The spindle is the thing that is rotating them, thousands of times per minute, at a really high speed.

The actuator is the disk arm is moving in and out to locate bit / block of information that you are reading back / forth.

On the write, as the disk head rotates, the information rotates, and that is a magnet that reads or writes data that particular block. This gives me the ability to read /w rite particular track. If I want to read other tracks, I move the head.

Data is concentrically located. There is another ring of blocks.

These tracks are very small. Disks have a terrabyte of information per square inch. Very tightly connected.

4 of 35

Disk Tracks

~ 1 micron wide

Wavelength of light is ~ 0.5 micron
Resolution of human eye: 50 microns
100K tracks on a typical 2.5” disk

Separated by unused guard regions

Reduces likelihood neighboring tracks are corrupted during writes
Bit level corruption still happens with non-zero chance

Track length varies across disk

Outside: More sectors per track, higher bandwidth
Disk is organized into regions of tracks with same # of sectors/track
Only outer half of radius is used

Most of the disk area in the outer regions of the disk

5 of 35

Disk Performance

Disk Latency =

Seek Time + Rotation Time + Transfer Time

Seek Time: time to move disk arm over track (1-20ms)

Fine-grained position adjustment necessary for head to “settle”

Head switch time ~ track switch time (on modern disks)

Rotation Time: time to wait for disk to rotate under disk head

Disk rotation: 4 – 15ms (depending on price of disk) On average, only need to wait half a rotation

Transfer Time: time to transfer data onto/off of disk

Disk head transfer rate: 100-250MB/s (5-10 usec/sector)

Host transfer rate dependent on I/O connector (USB, SATA, …)

6 of 35

Western Digital Ultrastar DC HC690

Capacity	32 TB, 7 platters
Spin Speed	7200 RPM
Sustained Transfer Rate	270 MB/s (read), 260 MB/s (write)
Interface Transfer Rate	1.2 GB/s
Seek time (avg)	8 ms (read), 8.6 ms (write)
Rotational latency (avg)	4.16 ms
Cache	512 MB DRAM
Idle/Operating Power	5.8W/9.7W
Bit Error Rate (read)	10^-15

7 of 35

File Systems

8 of 35

File System Abstraction

File system

Persistent, named data
Hierarchical organization (directories, subdirectories)
Access control on data

File: named collection of data

Linear sequence of bytes (or a set of sequences)
Read/write or memory mapped

Crash and storage error tolerance

Operating system crashes (and disk errors) leave file system in a valid state

On top of block-oriented storage devices (flash, magnetic disk)

Achieve close to the hardware performance limit in the average case despite complex performance/reliability device characteristics

That’s a good thing, that’s what file systems are for. Question: how do we build file system structures on top of hardware structures to give us something that is more useful for users.

This is the abstraction that we have on top of storage layer. We want a persistent named data in the form of a file, give it a name, and I should be able to find data that I put in that file later on.

Most file systems are organized hierarchically. I want to be able to organize my data in a hierarchical fashion. I have directories, subdirectories, etc.

I want access control – which users have access to which data, and which users don’t.

I want to read the data that I wrote, if I put data in the file, I want to get it back again.

As we talked about previously on bit error rates. It is not trivial in a cloud that you are going to get, as we read data off of an SSD, we might get errors from it. Also, at any type of scale, we will have problems of failures, like what happens when disks fail and data is on it, etc. You still want to read that data. The question is: how do you get back data that has been written on some media that then crashes or becomes unavailable?

I want data back in exactly the way I wrote it, and the fact that there might be some device that failed shouldn’t affect me.

So I want this file that’s a named collection of data. Logically: linear sequence of bytes.

I can read or memory map the file, but I also want it to be the case that I’m storage error tolerant and crash tolerant if the OS crashes while I am editing or doing somethin, that should leave the file system in a valid state.

In the average case, get the best performance possible, despite the complex characteristics.

9 of 35

File System Abstraction

Directory

Group of named files or subdirectories
Mapping from file name to file metadata location

Path

String that uniquely identifies file or directory
Ex: /cse/www/education/courses/cse451/26sp/index.html

Links

Hard link: one file name points to the same file metadata as another file
Soft link: one file name points to another file name

Mount

Mapping from name in one file system to root of another

10 of 35

UNIX File System API

create, link, unlink, createdir, rmdir

Create file, link to file, remove link
Create directory, remove directory

open, close, read, write, seek

Open/close a file for reading/writing
Seek resets current position

fsync

File modifications can be cached, written back to disk in any order
fsync forces modifications to disk (like a memory barrier)
Warning: programmers who use fsync very often use it incorrectly

You are probably very familiar with the FS API
We have ability to create, link, unlink, create directories, etc
We can open, close, read, write, normal things.
Interesting thing that you haven’t thought about: what it means when you write data. Intuitively, when you finish write, when the OS returns when OS writes some file / block of data / create file. When the system call returns, does that mean that the data is on a disk such that if system crashes after that, I will find the data on disk.
Answer: it doesn’t work that way. Disk is slow enough that the writes are buffered. UNIX semantics (also Windows and others). When write to FS, it only buffers data, and eventually write to disk. If you want to ensure that data is actually on disk, you need to do an additional thing called fsync(). This forces the file system to write back data from buffer to disk.
Example: saving a copy of file and supposed to be a backup, doesn’t help the backup if I don’t force the write to the disk. The activity is similar to a memory barrier. Any changes made in a critical section are flushed to memory (not in a register), and appear to anyone in memory that acquires critical section. Similar to that.
Don’t know what happens when crash between modification and fsync. But after fsync(), anyone trying to read the data I wrote before fsync, they find it.
We will talk a bit about reliability of systems, how to ensure changes are not just persistent but also consistent, some parts of changes not others.
How to program persistent data that makes it consistent such that no failures.
To use fsync() correctly, what is the ordering of operation, which operations can be concurrent or not.

Some study done about modern applications that used fsync(). E.g., git (git commit), like to make sure git commit writes to disk, but had a bug in it, causing data loss, even after fsync().
Most users assume that when I do a read / write, it is done in the order in which I do it. It doesn’t happen, you need to use fsync for that.

11 of 35

Named Data in a File System

File number = location of inode

inode contains table to find storage for any offset

Rough terms: file system: give a path name (file name), offset (open, read()). Directory info gets from file name to file number (location of inode). The data in a directory will give location of inode. We will talk more at end.

Next: given I have location of info about a file, how do I find storage block where that offset lives.

So that is the translation from I have inode number (location on disk of where info of file is stored) to where the data is stored for the file.

That inode table is going to need to have info in it that will allow us to find any storage that we are using for that offset. This is somewhat similar to multi-level page tables, that allow us to go from Virtual address, a memory address to a physical address, which is a physical frame where I find data for it.

This is similar. I am taking a virtual address within the file (offset within the file), take it to location on storage.

However, the data structures for memory are not the same data structures for disk. We will not use 4-5 levels of page tables to translate from a virtual address to physical address. We will do something different, due to characteristics of disk storage being different from memory.

In memory, reference things with byte addressing. Cost of going to memory is relatively low, 100 slower to CPU. Flash: 100K, magnetic disk: 10 million times worse. As a consequence, I want to make minimal accesses to disk as possible to keep perf reasonable. That will cause me to use different abstractions of what that lookup will be like. Still in general going to be tree oriented.

12 of 35

inode -> disk block

Index structure

How do we locate the blocks of a file? (why not use multilevel page tables?)

Index granularity

What block size do we use? (4KB vs. larger?)

Free space

How do we find unused blocks on disk?

Locality

Do we preserve spatial locality?

Reliability

What if machine crashes in middle of a file system operation?

13 of 35

File System Workload

File sizes

Are most files small or large?
Which accounts for more total storage: small or large files?

14 of 35

File System Workload

File sizes

Are most files small or large?

SMALL

Which accounts for more total storage: small or large files?

LARGE

15 of 35

File System Workload

File access

Are most accesses to small or large files?
Which accounts for more total I/O bytes: small or large files?

16 of 35

File System Workload

File access

Are most accesses to small or large files?

SMALL

Which accounts for more total I/O bytes: small or large files?

LARGE

17 of 35

File System Workload

Most files are read/written sequentially
Some files are read/written randomly

Ex: database files, swap files

Some files have a pre-defined size at creation
Most files start empty and grow over time

Ex: compiler output, system logs

18 of 35

File System Design Constraints

For small files:

Small blocks for storage efficiency
Concurrent ops more efficient than sequential
Files used together should be stored together

For large files:

Storage efficient (large blocks)
Contiguous allocation for sequential access
Efficient lookup for random access

May not know at file creation

Whether file will become small or large
Whether file is persistent or temporary
Whether file will be used sequentially or randomly

19 of 35

File System Design Options

	FAT	FFS	NTFS
Index structure	Linked list	Tree (fixed, assym)	Tree (dynamic)
granularity	block	block	extent
free space allocation	FAT array	Bitmap (fixed location)	Bitmap (file)
Locality	defragmentation	Block groups + reserve space	Extents Best fit defrag

20 of 35

Microsoft File Allocation Table (FAT)

Linked list index structure

Simple, easy to implement
Still widely used (e.g., embedded devices, SD cards)

File table:

Linear map of all blocks on disk
Each file a linked list of blocks

21 of 35

FAT

22 of 35

FAT

Pros:

Easy to find free block
Easy to append to a file
Easy to delete a file

Cons:

Small file access is slow
Random access is very slow
Fragmentation

File blocks for a given file may be scattered
Files in the same directory may be scattered
Problem becomes worse as disk fills

23 of 35

Berkeley UNIX FFS (Fast File System)

inode table

Analogous to FAT table

inode

Metadata

File owner, access permissions, access times, …

Set of 12 data pointers
With 4KB blocks => max size of 48KB files

24 of 35

FFS inode

Metadata

File owner, access permissions, access times, …

Set of 12 data pointers

With 4KB blocks => max size of 48KB files

Indirect block pointer

pointer to disk block of data pointers

Indirect block: 1K data blocks => 4MB (+48KB)

25 of 35

FFS inode

Metadata

File owner, access permissions, access times, …

Set of 12 data pointers

With 4KB blocks => max size of 48KB

Indirect block pointer

pointer to disk block of data pointers
4KB block size => 1K data blocks => 4MB

Doubly indirect block pointer

Doubly indirect block => 1K indirect blocks
4GB (+ 4MB + 48KB)

26 of 35

FFS inode

Metadata

File owner, access permissions, access times, …

Set of 12 data pointers

With 4KB blocks => max size of 48KB

Indirect block pointer

pointer to disk block of data pointers
4KB block size => 1K data blocks => 4MB

Doubly indirect block pointer

Doubly indirect block => 1K indirect blocks
4GB (+ 4MB + 48KB)

Triply indirect block pointer

Triply indirect block => 1K doubly indirect blocks
4TB (+ 4GB + 4MB + 48KB)

27 of 35

28 of 35

FFS Asymmetric Tree

Small files: shallow tree

Efficient storage for small files

Large files: deep tree

Efficient lookup for random access in large files

Sparse files: only fill pointers if needed

29 of 35

FFS Locality

Block group allocation

Block group is a set of nearby cylinders
Files in same directory located in same group
Subdirectories located in different block groups

inode table spread throughout disk

inodes, bitmap near file blocks

First fit allocation

Small files fragmented, large files contiguous

30 of 35

31 of 35

FFS First Fit Block Allocation

32 of 35

FFS First Fit Block Allocation

33 of 35

FFS First Fit Block Allocation

34 of 35

FFS

Pros

Efficient storage for both small and large files
Locality for both small and large files
Locality for metadata and data

Cons

Inefficient for tiny files (a 1 byte file requires both an inode and a data block)
Inefficient encoding when file is mostly contiguous on disk
Need to reserve 10-20% of free space to prevent fragmentation

35 of 35

NTFS

Master File Table

Flexible 1KB storage for metadata and data

Extents

Block pointers cover runs of blocks
Similar approach in linux ext4
File create can provide hint as to size of file

Journalling for reliability

Next lecture