2 of 46

Filesystems

Filesystem

“ Block”

Fundamental allocation unit (i.e. smallest size) a Filesystem uses

Remember: vs a Sector which is the smallest individual reference-able region on a Disk

Main Filesystem tasks:

Persistence of Data & its structure

Associate sets of Data Blocks with names (Files)

Associate names with each other (Directories)

Can implement Filesystems on Disk, over Network, in Memory, in Non-Volatile RAM (NVRAM), on Tape, …

Focus on Disk and generalize later

CS446/646 C. Papachristos

3 of 46

Filesystems

File

Associated Blocks on Disk, with a name

Data with some properties
Contents, size, owner, last read/write time, protection, etc.

How a File’s data is managed by the Filesystem (more later)

Basic idea (in Unix):�A struct called an Index Node : The “inode”

Describes where on the Disk the Blocks for a given File are located

inode struct also holds some extra information (Metadata)

Disk stores an array of inodes : The “inode Table”

inode # is the index in the inode Table : The “i-number”

CS446/646 C. Papachristos

4 of 46

Filesystems

Filetypes

A File can also have a Type

Understood by the Filesystem

Block Device, Character Device, UNIX Socket, Symbolic Link (more later), etc.

Understood by other parts of the OS or runtime libraries

executable, dll, source, object, text, etc.

A File’s Type can be encoded in its Filename or Contents

Windows encodes type in name (.com, .exe, .bat, .dll, etc.)

Unix also encodes type in contents (magic numbers, initial characters)

e.g. #! for Shellscripts

CS446/646 C. Papachristos

5 of 46

Filesystems

Basic Filesystem Operations

Unix

creat(name) (Note: Redundant due to open())

rename(oldname, newname)

open(name, how)

read(fd, buf, len)

write(fd, buf, len)

truncate(fd, len)

sync(fd)

lseek(fd, pos)

close(fd)

unlink(name)

CS446/646 C. Papachristos

Windows

CreateFile(name, CREATE)

CreateFile(name, OPEN)

ReadFile(handle, …)

WriteFile(handle, …)

FlushFileBuffers(handle, …)

SetFilePointer(handle, …)

CloseHandle(handle, …)

DeleteFile(name)

CopyFile(name)

MoveFile(name)

6 of 46

Filesystems

CS446/646 C. Papachristos

For our Lectures

7 of 46

Filesystems

Directories

Problem:�How to reference Files

Should users remember where on Disk their Files are located (i.e. Disk Sector Number) ?

Human interface means we also need human-readable names
We use Directories to map names to Files

Remember: A File represents a set of associated Data Blocks, it does not have a name

�Directories serve two purposes

For users, they provide a structured way to organize Files
For Filesystem, they provide a convenient naming interface that allows the separation of Logical Organization of Files, from Physical Placement on the Disk

CS446/646 C. Papachristos

8 of 46

Filesystems

Hierarchical Directories

Used since CTSS (1960s)

Unix picked up and used nicely

Large Namespaces tend to be Hierarchical

IP Addresses, Domain Names, scoping in Programming Languages, etc.

CS446/646 C. Papachristos

/ (FileSystem Root Directory)

afs

bin

cdrom

dev

sbin

tmp

awk

chmod

chown

9 of 46

Filesystems

Directory Implementation

A Directory is a list of “Directory Entries”

Directory Entry:

<name, inode#> tuple: The inode # gets us to an inode, which contains a Disk location

Directories stored on Disk just like regular Files

Its Filetype is: Directory
Users can read its Data just like any other File
Only special System Calls can write to its Data
Data is list of Directory Entries that point to other Disk locations (through inode#s)
File pointed to by an inode #, may also be a Directory (inode Type is Directory)
Filesystem becomes a Hierarchical Tree

Simple, and any speedup in File operations also speeds up Directory operations

CS446/646 C. Papachristos

afs

bin

cdrom

dev

sbin

tmp

awk

chmod

chown

<afs,1021>

<tmp,1020>

<bin,1022>

<cdrom,4123>

<dev,1001>

<sbin,1011>

…

File contents for /

10 of 46

Filesystems

Pathname Translation (simple & high-level overview)

To open File /one/two/three the Filesystem will translate as follows:

Directory Entries map Filenames to Disk location(s) (via the inode # that gets us to an inode)

Access Directory "/": Root Directory is always inode #2 (more later)

Search its Data (list of Directory Entries) for the Directory Entry named "one", �get inode # and then inode of "one", get its Disk location

Access Directory one, search its Data (list of Directory Entries) for the one named "two",�get inode # and then inode of "two", get its Disk location

Access Directory two, search its Data (list of Directory Entries) for the one named "three",�get inode # and then inode of "three", get its Disk location

Access File three

CS446/646 C. Papachristos

11 of 46

Filesystems

Unix Example

/a/b/c.c :

CS446/646 C. Papachristos

12 of 46

Filesystems

Unix inodes and Path Search (more detailed overview)

Unix inodes are not Directories

An inode describes where on the Disk the Data Blocks for a File are located
A Directory is like a File, so its inode just describes the location of its Data Blocks on Disk; it’s just that a Directory’s Data Blocks actually store a list of its Directory Entries

Each Directory Entry maps a Filename to an inode

For a not-so-elaborate Filesystem structure, to open File "/one":

Read inode Table (more later) into Memory and find for "/" its inode and then its Disk Block Number(s)�That is the location of its Data Block(s) on Disk
Read the Data Block of "/" into Memory, then look into its Data for a Directory Entry named "one"�This Directory Entry contains the inode# for "one"
Find the inode for "one" in the inode Table (should be regular File) and then its Disk Block Number(s)�That is the location of its Data Block(s) on Disk
Read the Data Block(s) into Memory to access the File Data of "one"

CS446/646 C. Papachristos

What does it mean to “Access” a Directory?

13 of 46

Filesystems

Naming

Bootstrapping: Where does (File)Name search start from?

Root Directory always inode #2

Special names:

Root directory: /
Current directory: .
Parent directory: ..

Some special aliases are provided by the Shell, not the Filesystem:

User’s Home Directory: ∼
Globbing: foo.* (expands to all files starting with foo.)

Using the given naming, only need two operations to navigate the entire Namespace:

cd <name> : Move into (change Current Working Directory context to) Directory Name
ls : Enumerate all names in Current Working Directory (context)

CS446/646 C. Papachristos

Note:

#0 used as a special NULL value (indicates an inode does not exist); i.e. there is no inode #0
First inode is inode #1, and is reserved for�recording defective Blocks on the Disk

14 of 46

Filesystems

Basic Directory Operations

Unix

Directories implemented in Files

At the Filesystem level, File operations�are used to create Directories

C-library provides a higher-level�abstraction:

opendir(name)

readdir(DIR*)

seekdir(DIR*)

closedir(DIR*)

CS446/646 C. Papachristos

Windows

Explicit Directory operations

CreateDirectory(name)

RemoveDirectory(name)

Very different method for reading Directory Entries

FindFirstFile(pattern)

FindNextFile()

15 of 46

Filesystems

Default Context: Working Directory

Cumbersome to always have to specify full Pathnames

In Unix, each process has a “Current Working Directory” (cwd)
Filenames not beginning with / are assumed to be relative to the cwd

Otherwise translation happens as before

A Shell also tracks a default list of active Environment Contexts

A “Search Path” for the Programs you are running

( : -separated list of Pathnames)

Given a Search Path A:B:C , the Shell will check in A, then B, then C
Can get around this by using explicit paths, e.g: ./foo

Example of Locality

CS446/646 C. Papachristos

16 of 46

Filesystems

Hardlinks

More than one Directory Entries can refer to a given File

“Hardlink” creates a synonym for a File (i.e. creates another Directory Entry for it)
Unix stores count of references (Hardlinks) to the inode (of that File) itself
If one of the Hardlinks is removed (e.g. using rm), the data are still accessible through any other Hardlink that remains
If all Hardlinks are removed, then the space of that File’s Data is considered as freed

CS446/646 C. Papachristos

foo

inode #31279

links_count=1

ln foo bar

Existing File

Hardlink to create

foo

bar

inode #31279

links_count=2

17 of 46

Filesystems

Softlinks / Symlinks

Softlinks / Sym(bolic)links : Synonyms for Filenames

Point to a File(/Dir)name, but object can be deleted from underneath it

i.e. does not even have to exist

Unix implements these as inodes (just like with Directories):

inode Type indicates it is a Symlink, and its Data contain name of Symlink target

When the Filesystem encounters a Symlink Type inode, it automatically translates it

CS446/646 C. Papachristos

foo

bar

ln –s bar barz

barz

Existing File

Symlink to create

inode #31279

links_count=2

File Data

"bar"

inode #35382

links_count=1

18 of 46

Filesystems

File Sharing

File Sharing has been around since Time-sharing

Easy to do on a single machine
PCs, workstations, and networks get us there (mostly)

File Sharing is important for getting work done

Basis for Communication and Synchronization

2 key issues when sharing Files:

1) Semantics of Concurrent Access

What happens when one Process reads while another writes?
What happens when two Processes open a File for writing?
What should be used to coordinate these?

2) Protection

CS446/646 C. Papachristos

19 of 46

Filesystems

Protection

Filesystems implement a Protection system to be able to manage:

Who can access a File
What access rights they have (read/write/etc.)

A Protection system dictates whether a given Action performed by a given Subject �on a given Object should be allowed

Examples:

Your User account can read and/or write your own Files, but other Users cannot
You can read /etc/motd , but you cannot write to it

CS446/646 C. Papachristos

20 of 46

Filesystems

Protection – Conceptual Representation

1) Access Control Lists (ACL)

For each Object, maintain a�list of Subjects and their�permitted Actions

CS446/646 C. Papachristos

2) Capabilities

For each Subject, maintain a list of Objects and the permitted Actions

	/one	/two	/three
root	rw	rw	rw
alice	rw	-	r
bob	w	r	rw

Objects

Subjects

ACL

Capability

21 of 46

Filesystems

ACLs and Capabilities

Capabilities are easier to transfer

“A Capability is a token, ticket, or key that gives the possessor permission to�access an entity or Object in a computer system” (Dennis and Van Horn, 1966)

ACLs are easier to manage

Object-centric, easy to grant, revoke
To revoke Capabilities, have to keep track of all Subjects that have the Capability (ticket)

A challenging problem

But, ACLs have a problem when Objects are heavily shared (i.e. many Subjects (/Users))

The ACLs become very large
Mitigation:

Use Subject (/User) Groups (e.g. in Unix)

CS446/646 C. Papachristos

22 of 46

Filesystems

Unix File Protection

Unix uses both Protection approaches in its Filesystem

1) ACLs:�https://linux.die.net/man/5/acl

Example: Unix File Permissions

View ACL: getfacl <filename>

2) Capabilities:

Example: File Descriptors

How are 1) and 2) used together?

e.g. making an open() System Call which gets us a File Descriptor

CS446/646 C. Papachristos

int fd = open("file.txt", O_WRONLY);

if (fd == -1)

exit(-1);

for (int i = 0; i < 100; i++)

write(fd, buf + i * 4, 4);

ACL check, expensive

Capability

Use Capability from now on

23 of 46

Filesystems

Filesystem Implementation

A File Descriptor is an application of Capabilities

When a File is open()ed, a File Descriptor is created and stored in the “fdtable”

Each Process has a files_struct.fdtable, which is stored in Kernel-Space

The User-Space Application is only provided with the index of the File Descriptor

We often call this index itself the File Descriptor, actually just an index to the fdtable.fd

The fdtable.fd is actually a Capability list (array of Kernel-space struct file*)

Each struct file entry contains a Permission part that describes what the Process can do to this File; it also contains a struct inode* that corresponds to the File’s inode

Note: A struct file in the Process’ fdtable.fd corresponds to a Capability that allows the Process to access the File in a specific way, even after a change in the ACL of the File

(i.e. the “token / ticket / key” is not revoked)

CS446/646 C. Papachristos

24 of 46

Filesystems

Filesystems vs Virtual Memory

CS446/646 C. Papachristos

25 of 46

Filesystems

CS446/646 C. Papachristos

26 of 46

Filesystems

Filesystems vs Virtual Memory

Trends:

Disk Bandwidth and cost-per-bit improving exponentially

Similar to CPU speed, Memory size, etc.

Seek Time and Rotational Delay improving very slowly

Require mechanical motion (e.g. disk arm)

Disk access is a huge system bottleneck & getting worse

Bandwidth increase lets system (pre-)fetch large chunks for about the same cost as a smaller chunk
Trade Bandwidth for Latency if you can get lots of related stuff

Desktop Memory size increasing faster than typical workloads

More and more of workload fits in File Cache
Disk traffic changes: Mostly writes and new data

Memory and CPU resources increasing

Use Memory and CPU to make better decisions
Complex prefetching to support more I/O patterns
Delay in Data Placement decisions reduces random I/O

CS446/646 C. Papachristos

27 of 46

Filesystems

Filesystems vs Virtual Memory

Goal: Operations should have as few Disk Accesses as possible & at minimal Space overhead

i.e. by grouping related things

What’s hard about grouping Blocks?

Remember: Virtual Memory Pages and Process Locality

Like Page Tables, the Filesystem Metadata construct mappings

e.g. Page Table: Provides mapping of Virtual Page Number to Physical Page Number

File Metadata:�Provide mapping of Byte Offset to Disk Block Address
Directory Data (i.e. List of Directory Entries):�Provide mapping of Name to inode # (i-number)

CS446/646 C. Papachristos

28 of 46

Filesystems

Filesystems vs Virtual Memory

In both Filesystems and Virtual Memory management, keeping operations location-agnostic is desired

Application shouldn’t care about particular Disk Blocks or Physical Memory locations

In some ways, Filesystems have an easier job than Virtual Memory:

CPU time to do Filesystem mappings not a big deal
No Virtual Address Space invalidation on Context Switch, TLB Misses, etc.
Page Tables deal with sparse Address Spaces and Random Access
Files often denser ([0 … filesize − 1] ∼ Sequentially Accessed)

In some ways, a Filesystem’s problem is harder:

Each layer of translation = potential Disk Access
Space a huge premium! (But Disk can be huge)

Disk Cache Space never enough; amount of Data you can get in a single fetch never enough

Range very extreme: Many Files < 10 KB-long, other Files GB-long

CS446/646 C. Papachristos

29 of 46

Filesystems

Filesystems vs Virtual Memory

Some Working Intuitions:

Filesystem performance dominated by # of Disk Accesses

If each Disk Access costs ∼10 milliseconds, touching the Disk 100 times = 1 second
Can perform a billion ALU operations in same time

Disk Access cost dominated by Mechanical Motion, not Transfer:

1 Sector: 5𝑚𝑠 + 4𝑚𝑠 + 5μ𝑠 (≈ 512 𝐵/(100 𝑀𝐵/𝑠)) ≈ 9𝑚𝑠
50 Sectors: 5𝑚𝑠 + 4𝑚𝑠 + .25𝑚𝑠 = 9.25𝑚𝑠
Can get 50x the data for only ∼3% more overhead

Important Observations:

1) All Blocks in same File tend to be used together, sequentially

2) All Names in same Directory tend to be used together

3) All Files under same Directory tend to be accessed together

CS446/646 C. Papachristos

30 of 46

Filesystems

Problem: How to Track a File’s Data

Disk management:

Need to keep track of where File contents are on Disk
Must be able to use this to map Byte Offset to Disk Block Address

The “Index Node” –or– “inode”

A structure that tracks a File’s Disk Sectors

Note: The inodes must be stored on Disk as well (Remember: Persistence)

Things to keep in mind while designing File Data storage structure:

Most Files are small
Much of the Disk is allocated to large Files
Many of the I/O operations are made to large Files
Want good Sequential Access and good Random Access

Each imposes its own requirements

CS446/646 C. Papachristos

31 of 46

Filesystems

Idea 1: Contiguous Allocation

“Extent-based”: Allocate Files similarly to Virtual Memory Segmentation model

When creating a File, do “Allocate-on-Flush”
inode contents: Location, Size

Example: IBM OS/360, B-Tree FS (BTRFS)

Advantages

Simple, fast Access, both Sequential and Random

Disadvantages (think of corresponding Virtual Memory Segmentation scheme)

Files may not dynamically grow after creation (unless Extents hybridized, see BTRFS)
External Fragmentation

CS446/646 C. Papachristos

Fragmentation:�What if File c needs 2 Sectors

32 of 46

Filesystems

Idea 2: Linked Files

Basically a Singly-Linked List on Disk

Keep a Singly-Linked List of all Free Blocks
inode contents: a pointer to File’s first Block
In each Block, keep a pointer to the next one

Examples (sort-of): Alto, TOPS-10, DOS FAT

Advantages

Easy dynamic growth & Sequential Access, no External Fragmentation

Disadvantages

Linked Lists on Disk a bad idea because of Access Times
Random Access very slow (e.g. have to traverse whole File to Access last Block)
Pointers necessary in every Block, skewing Data alignment

CS446/646 C. Papachristos

A single File’s Clusters are “Fragmented” throughout the Disk

33 of 46

Filesystems

Idea 2 (fixed): File Allocation Table (FAT) – Example : DOS Filesystem

Linked Files with key optimization:

Links are separately placed in fixed-size “File Allocation Table” (FAT), rather than indi-vidually inside each Block
Still require Pointer chasing, but can cache entire FAT in Memory (cheaper than Disk Access)

CS446/646 C. Papachristos

34 of 46

Filesystems

CS446/646 C. Papachristos

35 of 46

Filesystems

Idea 3: Indexed Files

Have an Array that holds all of a File’s Block locations (Disk Linear Addresses)

Similarly to a Page Table, so will have similar issues
Max File Size determined by Block Array’s size

Static or Dynamic Block Array?

Allocate Block Array to hold File’s Block locations�on File creation
Allocate actual Blocks on demand using Freelist of the Disk’s Blocks

Advantages

Both Sequential and Random Access easy

Disadvantages

The Mapping Table (for all Files’ Arrays) requires large chunk of Contiguous Space
Same problem we were trying to solve initially

CS446/646 C. Papachristos

36 of 46

Filesystems

CS446/646 C. Papachristos

File Index Array

Disk Blocks

(Determined by desired Max File Size)

37 of 46

Filesystems

Idea 3 (fixed): Multi-Level Indexed Files – Example : Unix inodes

Unix inode:

inode = 15 Block Pointers (+ Metadata)

First 12 are Direct Blocks

Giving faster Access�performance for initial Blocks

Then 1 Single-indirect Block,� 1 Double-indirect Block,� 1 Triple-indirect Block�

�

CS446/646 C. Papachristos

inode

Metadata

Note: �Indirect Blocks: Contain BLKSIZE/PTRSIZE amount of Block Entries

38 of 46

Filesystems

More about inodes

Note: Information about Type (File/Directory/Symlink/Character Device/etc.) could be stored in corresponding � Directory Entry that points to this inode, e.g.: <inode, file_type, name_len, name>

� �

�

CS446/646 C. Papachristos

struct inode

Data Block

Indirect

Data Block

type_perm (Type & Permissions)

uid (Owner)

gid (Group)

size (of File in Bytes)

blocks (# Blocks of this File – whether

atime (Access Time)

mtime (Modification Time)

ctime (Creation Time)

dtime (Deletion Time)

links_count (Hardlinks – # of Paths)

block[15] (12+1+1+1 Data Blocks)

used or not)

39 of 46

Filesystems

More about inodes

The “inode Table” : Array that stores inodes (one inode for every File)

(Similar functionality to the Mapping Table for all Files on the Filesystem)

inode Table: Fixed-size (can support max # of inodes) when Disk is initialized; can’t be changed
Lives in known Location, originally at outer edge of Disk

This resulted in long seeks between inodes and File Data, improved by having many small inode Tables spread across the Disk

The index of an inode in the inode Table called the “i-number”

Internally, the OS refers to Files by i-number
When File is opened, its inode is read into Memory
Written back to Disk when modified and File closed, or timeout elapses

CS446/646 C. Papachristos

40 of 46

Filesystems

More about inodes

Locality Problem: inode Table located at start of Disk

CS446/646 C. Papachristos

41 of 46

Filesystems

Remember: Unix inodes and Path Search

Unix inodes are not Directories

An inode describes where on the Disk the Data Blocks for a File are located
A Directory is like a File, so its inode just describes the location of its Data Blocks

Remember: Its Data is a List of Directory Entries

Directory Entries map Filenames to inodes

To open /one, read inode #2 ( "/" ) from inode Table on Disk into Memory
Read Data (Directory Entries) of "/" from Disk into Memory, find Entry for "one"
This Entry gives the inode # for "one"
From the inode Table again, find the inode for "one"
The inode says where the Data Blocks of "one" are on Disk
Read its first Block into Memory to access the initial Data stored by the File

CS446/646 C. Papachristos

Note: 4 Disk Accesses required

42 of 46

Filesystems

Unix Example: /a/b/c.c

inode #2 holds location (Data Blocks) with Directory Entries under /
inode #3 holds location (Data Blocks) with Directory Entries under a
inode #5 holds location (Data Blocks) with Directory Entries under b
inode #14 holds location (Data Blocks) with File Data of c.c

CS446/646 C. Papachristos

43 of 46

Filesystems

Write Caching

OSes typically also do “Write-Back Caching”

Maintain a queue of Uncommitted Blocks
Periodically flush the queue to Disk (30 second threshold)
If Blocks changed many times in 30 secs, only need one I/O
If Blocks deleted before 30 secs (e.g. /tmp), no I/O needed

Unreliable, but practical

On a crash, all writes within last 30 secs are lost
Modern OSes do this by default; too slow otherwise

System Calls exist to enable Applications to force-write data to Disk

e.g. fsync(): Flush OS Buffers to Physical Media (Device)

CS446/646 C. Papachristos

44 of 46

Filesystems

Read Caching – Read-Ahead (Prefetching)

Many Filesystems also implement “Read-Ahead”

Filesystem “predicts” that the Process will request a next Block
Filesystem requests it from the Disk ahead of time
This can happen while the Process is executing on the previous Block

Overlap I/O with execution

When the Process requests next Block, it will already be in the Read-Ahead Cache
Complements the Disk (Hardware) Cache

Remember: Disk is also is doing Read-Ahead

For Sequentially-Accessed Files can be a big win

Unless Blocks for the File are scattered across the Disk
Filesystems try to prevent that though (through proper File allocation)

CS446/646 C. Papachristos

45 of 46

Filesystems

Read Caching – File Buffering

Disk operations are slow, but Applications exhibit Locality for reading and writing Files

Idea: Cache already accessed File Blocks into Memory to leverage Locality

Called the “File Buffer Cache”
File Buffer Cache is system-wide, used and shared by all Processes
Reading from the File Buffer Cache helps Disk-based operations to perform more like Memory-based ones
Even a small File Buffer Cache can be very effective

Issues

The File Buffer Cache competes with Virtual Memory for Space (tradeoff here)
Like with Virtual Memory, there are actual limitations (due to Physical Memory)
Need Replacement Algorithms again (LRU usually used)

CS446/646 C. Papachristos

1 of 46

2 of 46

3 of 46

4 of 46

5 of 46

6 of 46

7 of 46

8 of 46

9 of 46

10 of 46

11 of 46

12 of 46

13 of 46

14 of 46

15 of 46

16 of 46

17 of 46

18 of 46

19 of 46

20 of 46

21 of 46

22 of 46

23 of 46

24 of 46

25 of 46

26 of 46

27 of 46

28 of 46

29 of 46

30 of 46

31 of 46

32 of 46

33 of 46

34 of 46

35 of 46

36 of 46

37 of 46

38 of 46

39 of 46

40 of 46

41 of 46

42 of 46

43 of 46

44 of 46

45 of 46

46 of 46