1 of 46

Filesystems

CS-446/646

C. Papachristos

Robotic Workers (RoboWork) Lab

University of Nevada, Reno

2 of 46

Filesystems

Filesystem

Block”

  • Fundamental allocation unit (i.e. smallest size) a Filesystem uses
    • Remember: vs a Sector which is the smallest individual reference-able region on a Disk

Main Filesystem tasks:

  • Persistence of Data & its structure

  • Associate sets of Data Blocks with names (Files)

  • Associate names with each other (Directories)

  • Can implement Filesystems on Disk, over Network, in Memory, in Non-Volatile RAM (NVRAM), on Tape, …
    • Focus on Disk and generalize later

CS446/646 C. Papachristos

3 of 46

Filesystems

File

Associated Blocks on Disk, with a name

  • Data with some properties
  • Contents, size, owner, last read/write time, protection, etc.

How a File’s data is managed by the Filesystem (more later)

  • Basic idea (in Unix):�A struct called an Index Node : The “inode
    • Describes where on the Disk the Blocks for a given File are located
      • inode struct also holds some extra information (Metadata)

    • Disk stores an array of inodes : The “inode Table

    • inode # is the index in the inode Table : The “i-number

CS446/646 C. Papachristos

4 of 46

Filesystems

Filetypes

A File can also have a Type

  • Understood by the Filesystem
    • Block Device, Character Device, UNIX Socket, Symbolic Link (more later), etc.
  • Understood by other parts of the OS or runtime libraries
    • executable, dll, source, object, text, etc.

A File’s Type can be encoded in its Filename or Contents

  • Windows encodes type in name (.com, .exe, .bat, .dll, etc.)

  • Unix also encodes type in contents (magic numbers, initial characters)
    • e.g. #! for Shellscripts

CS446/646 C. Papachristos

5 of 46

Filesystems

Basic Filesystem Operations

Unix

creat(name) (Note: Redundant due to open())

rename(oldname, newname)

open(name, how)

read(fd, buf, len)

write(fd, buf, len)

truncate(fd, len)

sync(fd)

lseek(fd, pos)

close(fd)

unlink(name)

CS446/646 C. Papachristos

Windows

CreateFile(name, CREATE)

CreateFile(name, OPEN)

ReadFile(handle, …)

WriteFile(handle, …)

FlushFileBuffers(handle, …)

SetFilePointer(handle, …)

CloseHandle(handle, …)

DeleteFile(name)

CopyFile(name)

MoveFile(name)

6 of 46

Filesystems

 

CS446/646 C. Papachristos

For our Lectures

7 of 46

Filesystems

Directories

  • Problem:�How to reference Files

Should users remember where on Disk their Files are located (i.e. Disk Sector Number) ?

  • Human interface means we also need human-readable names
  • We use Directories to map names to Files
    • Remember: A File represents a set of associated Data Blocks, it does not have a name

Directories serve two purposes

  • For users, they provide a structured way to organize Files
  • For Filesystem, they provide a convenient naming interface that allows the separation of Logical Organization of Files, from Physical Placement on the Disk

CS446/646 C. Papachristos

8 of 46

Filesystems

Hierarchical Directories

  • Used since CTSS (1960s)
    • Unix picked up and used nicely

  • Large Namespaces tend to be Hierarchical
    • IP Addresses, Domain Names, scoping in Programming Languages, etc.

CS446/646 C. Papachristos

/ (FileSystem Root Directory)

afs

bin

cdrom

dev

sbin

tmp

awk

chmod

chown

9 of 46

Filesystems

Directory Implementation

A Directory is a list of “Directory Entries

Directory Entry:

  • <name, inode#> tuple: The inode # gets us to an inode, which contains a Disk location

  • Directories stored on Disk just like regular Files

    • Its Filetype is: Directory
    • Users can read its Data just like any other File
    • Only special System Calls can write to its Data
    • Data is list of Directory Entries that point to other Disk locations (through inode#s)
    • File pointed to by an inode #, may also be a Directory (inode Type is Directory)
    • Filesystem becomes a Hierarchical Tree

  • Simple, and any speedup in File operations also speeds up Directory operations

CS446/646 C. Papachristos

/

afs

bin

cdrom

dev

sbin

tmp

awk

chmod

chown

<afs,1021>

<tmp,1020>

<bin,1022>

<cdrom,4123>

<dev,1001>

<sbin,1011>

File contents for /

10 of 46

Filesystems

Pathname Translation (simple & high-level overview)

  • To open File /one/two/three the Filesystem will translate as follows:

  • Directory Entries map Filenames to Disk location(s) (via the inode # that gets us to an inode)

  • Access Directory "/": Root Directory is always inode #2 (more later)

  • Search its Data (list of Directory Entries) for the Directory Entry named "one", �get inode # and then inode of "one", get its Disk location

  • Access Directory one, search its Data (list of Directory Entries) for the one named "two",�get inode # and then inode of "two", get its Disk location

  • Access Directory two, search its Data (list of Directory Entries) for the one named "three",�get inode # and then inode of "three", get its Disk location

  • Access File three

CS446/646 C. Papachristos

11 of 46

Filesystems

Unix Example

  • /a/b/c.c :

CS446/646 C. Papachristos

12 of 46

Filesystems

Unix inodes and Path Search (more detailed overview)

  • Unix inodes are not Directories
    • An inode describes where on the Disk the Data Blocks for a File are located
    • A Directory is like a File, so its inode just describes the location of its Data Blocks on Disk; it’s just that a Directory’s Data Blocks actually store a list of its Directory Entries

  • Each Directory Entry maps a Filename to an inode

For a not-so-elaborate Filesystem structure, to open File "/one":

    • Read inode Table (more later) into Memory and find for "/" its inode and then its Disk Block Number(s)�That is the location of its Data Block(s) on Disk
    • Read the Data Block of "/" into Memory, then look into its Data for a Directory Entry named "one"This Directory Entry contains the inode# for "one"
    • Find the inode for "one" in the inode Table (should be regular File) and then its Disk Block Number(s)That is the location of its Data Block(s) on Disk
    • Read the Data Block(s) into Memory to access the File Data of "one"

CS446/646 C. Papachristos

What does it mean to “Access” a Directory?

13 of 46

Filesystems

Naming

  • Bootstrapping: Where does (File)Name search start from?
    • Root Directory always inode #2

  • Special names:
    • Root directory: /
    • Current directory: .
    • Parent directory: ..

  • Some special aliases are provided by the Shell, not the Filesystem:
    • User’s Home Directory:
    • Globbing: foo.* (expands to all files starting with foo.)

  • Using the given naming, only need two operations to navigate the entire Namespace:
    • cd <name> : Move into (change Current Working Directory context to) Directory Name
    • ls : Enumerate all names in Current Working Directory (context)

CS446/646 C. Papachristos

Note:

  • #0 used as a special NULL value (indicates an inode does not exist); i.e. there is no inode #0
  • First inode is inode #1, and is reserved for�recording defective Blocks on the Disk

 

14 of 46

Filesystems

Basic Directory Operations

Unix

Directories implemented in Files

  • At the Filesystem level, File operations�are used to create Directories

  • C-library provides a higher-level�abstraction:

opendir(name)

readdir(DIR*)

seekdir(DIR*)

closedir(DIR*)

CS446/646 C. Papachristos

Windows

Explicit Directory operations

CreateDirectory(name)

RemoveDirectory(name)

Very different method for reading Directory Entries

FindFirstFile(pattern)

FindNextFile()

15 of 46

Filesystems

Default Context: Working Directory

  • Cumbersome to always have to specify full Pathnames
    • In Unix, each process has a “Current Working Directory” (cwd)
    • Filenames not beginning with / are assumed to be relative to the cwd
      • Otherwise translation happens as before

  • A Shell also tracks a default list of active Environment Contexts
    • A “Search Path” for the Programs you are running
      • ( : -separated list of Pathnames)
    • Given a Search Path A:B:C , the Shell will check in A, then B, then C
    • Can get around this by using explicit paths, e.g: ./foo

  • Example of Locality

CS446/646 C. Papachristos

16 of 46

Filesystems

Hardlinks

More than one Directory Entries can refer to a given File

  • “Hardlink” creates a synonym for a File (i.e. creates another Directory Entry for it)
  • Unix stores count of references (Hardlinks) to the inode (of that File) itself
  • If one of the Hardlinks is removed (e.g. using rm), the data are still accessible through any other Hardlink that remains
  • If all Hardlinks are removed, then the space of that File’s Data is considered as freed

CS446/646 C. Papachristos

foo

inode #31279

links_count=1

ln foo bar

Existing File

Hardlink to create

foo

bar

inode #31279

links_count=2

17 of 46

Filesystems

Softlinks / Symlinks

Softlinks / Sym(bolic)links : Synonyms for Filenames

  • Point to a File(/Dir)name, but object can be deleted from underneath it
    • i.e. does not even have to exist
  • Unix implements these as inodes (just like with Directories):
    • inode Type indicates it is a Symlink, and its Data contain name of Symlink target
  • When the Filesystem encounters a Symlink Type inode, it automatically translates it

CS446/646 C. Papachristos

foo

bar

ln –s bar barz

barz

Existing File

Symlink to create

inode #31279

links_count=2

File Data

"bar"

inode #35382

links_count=1

18 of 46

Filesystems

File Sharing

  • File Sharing has been around since Time-sharing
    • Easy to do on a single machine
    • PCs, workstations, and networks get us there (mostly)

  • File Sharing is important for getting work done
    • Basis for Communication and Synchronization

  • 2 key issues when sharing Files:
    • 1) Semantics of Concurrent Access
      • What happens when one Process reads while another writes?
      • What happens when two Processes open a File for writing?
      • What should be used to coordinate these?
    • 2) Protection

CS446/646 C. Papachristos

19 of 46

Filesystems

Protection

  • Filesystems implement a Protection system to be able to manage:
    • Who can access a File
    • What access rights they have (read/write/etc.)

  • A Protection system dictates whether a given Action performed by a given Subject �on a given Object should be allowed
    • Examples:
      • Your User account can read and/or write your own Files, but other Users cannot
      • You can read /etc/motd , but you cannot write to it

CS446/646 C. Papachristos

20 of 46

Filesystems

Protection – Conceptual Representation

1) Access Control Lists (ACL)

  • For each Object, maintain a�list of Subjects and their�permitted Actions

CS446/646 C. Papachristos

2) Capabilities

  • For each Subject, maintain a list of Objects and the permitted Actions

/one

/two

/three

root

rw

rw

rw

alice

rw

-

r

bob

w

r

rw

Objects

Subjects

ACL

Capability

21 of 46

Filesystems

ACLs and Capabilities

  • Capabilities are easier to transfer
    • “A Capability is a token, ticket, or key that gives the possessor permission to�access an entity or Object in a computer system” (Dennis and Van Horn, 1966)

  • ACLs are easier to manage
    • Object-centric, easy to grant, revoke
    • To revoke Capabilities, have to keep track of all Subjects that have the Capability (ticket)
      • A challenging problem

  • But, ACLs have a problem when Objects are heavily shared (i.e. many Subjects (/Users))
    • The ACLs become very large
    • Mitigation:
      • Use Subject (/User) Groups (e.g. in Unix)

CS446/646 C. Papachristos

22 of 46

Filesystems

Unix File Protection

Unix uses both Protection approaches in its Filesystem

  • 1) ACLs:�https://linux.die.net/man/5/acl
    • Example: Unix File Permissions
      • View ACL: getfacl <filename>

  • 2) Capabilities:
    • Example: File Descriptors

  • How are 1) and 2) used together?
    • e.g. making an open() System Call which gets us a File Descriptor

CS446/646 C. Papachristos

int fd = open("file.txt", O_WRONLY);

if (fd == -1)

exit(-1);

for (int i = 0; i < 100; i++)

write(fd, buf + i * 4, 4);

ACL check, expensive

Capability

Use Capability from now on

23 of 46

Filesystems

Filesystem Implementation

  • A File Descriptor is an application of Capabilities

  • When a File is open()ed, a File Descriptor is created and stored in the “fdtable

    • Each Process has a files_struct.fdtable, which is stored in Kernel-Space

    • The User-Space Application is only provided with the index of the File Descriptor

      • We often call this index itself the File Descriptor, actually just an index to the fdtable.fd
    • The fdtable.fd is actually a Capability list (array of Kernel-space struct file*)
      • Each struct file entry contains a Permission part that describes what the Process can do to this File; it also contains a struct inode* that corresponds to the File’s inode

Note: A struct file in the Process’ fdtable.fd corresponds to a Capability that allows the Process to access the File in a specific way, even after a change in the ACL of the File

  • (i.e. the “token / ticket / key is not revoked)

CS446/646 C. Papachristos

24 of 46

Filesystems

Filesystems vs Virtual Memory

CS446/646 C. Papachristos

25 of 46

Filesystems

 

CS446/646 C. Papachristos

26 of 46

Filesystems

Filesystems vs Virtual Memory

  • Trends:

  • Disk Bandwidth and cost-per-bit improving exponentially
    • Similar to CPU speed, Memory size, etc.

  • Seek Time and Rotational Delay improving very slowly
    • Require mechanical motion (e.g. disk arm)

  • Disk access is a huge system bottleneck & getting worse
    • Bandwidth increase lets system (pre-)fetch large chunks for about the same cost as a smaller chunk
    • Trade Bandwidth for Latency if you can get lots of related stuff

  • Desktop Memory size increasing faster than typical workloads
    • More and more of workload fits in File Cache
    • Disk traffic changes: Mostly writes and new data

  • Memory and CPU resources increasing
    • Use Memory and CPU to make better decisions
    • Complex prefetching to support more I/O patterns
    • Delay in Data Placement decisions reduces random I/O

CS446/646 C. Papachristos

27 of 46

Filesystems

Filesystems vs Virtual Memory

  • Goal: Operations should have as few Disk Accesses as possible & at minimal Space overhead
    • i.e. by grouping related things

  • What’s hard about grouping Blocks?
    • Remember: Virtual Memory Pages and Process Locality

  • Like Page Tables, the Filesystem Metadata construct mappings
      • e.g. Page Table: Provides mapping of Virtual Page Number to Physical Page Number
    • File Metadata:�Provide mapping of Byte Offset to Disk Block Address
    • Directory Data (i.e. List of Directory Entries):�Provide mapping of Name to inode # (i-number)

CS446/646 C. Papachristos

28 of 46

Filesystems

Filesystems vs Virtual Memory

  • In both Filesystems and Virtual Memory management, keeping operations location-agnostic is desired
    • Application shouldn’t care about particular Disk Blocks or Physical Memory locations

  • In some ways, Filesystems have an easier job than Virtual Memory:
    • CPU time to do Filesystem mappings not a big deal
    • No Virtual Address Space invalidation on Context Switch, TLB Misses, etc.
    • Page Tables deal with sparse Address Spaces and Random Access
    • Files often denser ([0 … filesize − 1]Sequentially Accessed)

  • In some ways, a Filesystem’s problem is harder:
    • Each layer of translation = potential Disk Access
    • Space a huge premium! (But Disk can be huge)
      • Disk Cache Space never enough; amount of Data you can get in a single fetch never enough
    • Range very extreme: Many Files < 10 KB-long, other Files GB-long

CS446/646 C. Papachristos

29 of 46

Filesystems

Filesystems vs Virtual Memory

  • Some Working Intuitions:

  • Filesystem performance dominated by # of Disk Accesses
    • If each Disk Access costs ∼10 milliseconds, touching the Disk 100 times = 1 second
    • Can perform a billion ALU operations in same time

  • Disk Access cost dominated by Mechanical Motion, not Transfer:
    • 1 Sector: 5𝑚𝑠 + 4𝑚𝑠 + 5μ𝑠 (≈ 512 𝐵/(100 𝑀𝐵/𝑠)) ≈ 9𝑚𝑠
    • 50 Sectors: 5𝑚𝑠 + 4𝑚𝑠 + .25𝑚𝑠 = 9.25𝑚𝑠
    • Can get 50x the data for only ∼3% more overhead

  • Important Observations:

1) All Blocks in same File tend to be used together, sequentially

2) All Names in same Directory tend to be used together

3) All Files under same Directory tend to be accessed together

CS446/646 C. Papachristos

30 of 46

Filesystems

Problem: How to Track a File’s Data

  • Disk management:
    • Need to keep track of where File contents are on Disk
    • Must be able to use this to map Byte Offset to Disk Block Address

The “Index Node” –or– “inode”

  • A structure that tracks a File’s Disk Sectors
    • Note: The inodes must be stored on Disk as well (Remember: Persistence)

  • Things to keep in mind while designing File Data storage structure:
    • Most Files are small
    • Much of the Disk is allocated to large Files
    • Many of the I/O operations are made to large Files
    • Want good Sequential Access and good Random Access
      • Each imposes its own requirements

CS446/646 C. Papachristos

31 of 46

Filesystems

Idea 1: Contiguous Allocation

  • Extent-based”: Allocate Files similarly to Virtual Memory Segmentation model
    • When creating a File, do “Allocate-on-Flush
    • inode contents: Location, Size

Example: IBM OS/360, B-Tree FS (BTRFS)

Advantages

  • Simple, fast Access, both Sequential and Random

Disadvantages (think of corresponding Virtual Memory Segmentation scheme)

  • Files may not dynamically grow after creation (unless Extents hybridized, see BTRFS)
  • External Fragmentation

CS446/646 C. Papachristos

Fragmentation:�What if File c needs 2 Sectors

32 of 46

Filesystems

Idea 2: Linked Files

  • Basically a Singly-Linked List on Disk
    • Keep a Singly-Linked List of all Free Blocks
    • inode contents: a pointer to File’s first Block
    • In each Block, keep a pointer to the next one

  • Examples (sort-of): Alto, TOPS-10, DOS FAT

Advantages

  • Easy dynamic growth & Sequential Access, no External Fragmentation

Disadvantages

  • Linked Lists on Disk a bad idea because of Access Times
  • Random Access very slow (e.g. have to traverse whole File to Access last Block)
  • Pointers necessary in every Block, skewing Data alignment

CS446/646 C. Papachristos

A single File’s Clusters are “Fragmented” throughout the Disk

33 of 46

Filesystems

Idea 2 (fixed): File Allocation Table (FAT) – Example : DOS Filesystem

Linked Files with key optimization:

  • Links are separately placed in fixed-size “File Allocation Table” (FAT), rather than indi-vidually inside each Block
  • Still require Pointer chasing, but can cache entire FAT in Memory (cheaper than Disk Access)

CS446/646 C. Papachristos

34 of 46

Filesystems

 

CS446/646 C. Papachristos

 

35 of 46

Filesystems

Idea 3: Indexed Files

  • Have an Array that holds all of a File’s Block locations (Disk Linear Addresses)
    • Similarly to a Page Table, so will have similar issues
    • Max File Size determined by Block Array’s size
      • Static or Dynamic Block Array?
    • Allocate Block Array to hold File’s Block locations�on File creation
    • Allocate actual Blocks on demand using Freelist of the Disk’s Blocks

Advantages

  • Both Sequential and Random Access easy

Disadvantages

  • The Mapping Table (for all Files’ Arrays) requires large chunk of Contiguous Space
  • Same problem we were trying to solve initially

CS446/646 C. Papachristos

36 of 46

Filesystems

 

CS446/646 C. Papachristos

File Index Array

Disk Blocks

(Determined by desired Max File Size)

37 of 46

Filesystems

Idea 3 (fixed): Multi-Level Indexed Files – Example : Unix inodes

Unix inode:

  • inode = 15 Block Pointers (+ Metadata)

  • First 12 are Direct Blocks
    • Giving faster Access�performance for initial Blocks
  • Then 1 Single-indirect Block,� 1 Double-indirect Block,� 1 Triple-indirect Block�

CS446/646 C. Papachristos

inode

Metadata

Note: �Indirect Blocks: Contain BLKSIZE/PTRSIZE amount of Block Entries

38 of 46

Filesystems

More about inodes

Note: Information about Type (File/Directory/Symlink/Character Device/etc.) could be stored in corresponding � Directory Entry that points to this inode, e.g.: <inode, file_type, name_len, name>

CS446/646 C. Papachristos

struct inode

Data Block

Data Block

Data Block

Indirect

Data Block

type_perm (Type & Permissions)

uid (Owner)

gid (Group)

size (of File in Bytes)

blocks (# Blocks of this File – whether

atime (Access Time)

mtime (Modification Time)

ctime (Creation Time)

dtime (Deletion Time)

links_count (Hardlinks – # of Paths)

block[15] (12+1+1+1 Data Blocks)

used or not)

39 of 46

Filesystems

More about inodes

The “inode Table” : Array that stores inodes (one inode for every File)

  • (Similar functionality to the Mapping Table for all Files on the Filesystem)

  • inode Table: Fixed-size (can support max # of inodes) when Disk is initialized; can’t be changed
  • Lives in known Location, originally at outer edge of Disk
    • This resulted in long seeks between inodes and File Data, improved by having many small inode Tables spread across the Disk

  • The index of an inode in the inode Table called the “i-number
    • Internally, the OS refers to Files by i-number
    • When File is opened, its inode is read into Memory
    • Written back to Disk when modified and File closed, or timeout elapses

CS446/646 C. Papachristos

40 of 46

Filesystems

More about inodes

Locality Problem: inode Table located at start of Disk

CS446/646 C. Papachristos

41 of 46

Filesystems

Remember: Unix inodes and Path Search

  • Unix inodes are not Directories
    • An inode describes where on the Disk the Data Blocks for a File are located
    • A Directory is like a File, so its inode just describes the location of its Data Blocks
      • Remember: Its Data is a List of Directory Entries

  • Directory Entries map Filenames to inodes
    • To open /one, read inode #2 ( "/" ) from inode Table on Disk into Memory
    • Read Data (Directory Entries) of "/" from Disk into Memory, find Entry for "one"
    • This Entry gives the inode # for "one"
    • From the inode Table again, find the inode for "one"
    • The inode says where the Data Blocks of "one" are on Disk
    • Read its first Block into Memory to access the initial Data stored by the File

CS446/646 C. Papachristos

Note: 4 Disk Accesses required

42 of 46

Filesystems

Unix Example: /a/b/c.c

  • inode #2 holds location (Data Blocks) with Directory Entries under /
  • inode #3 holds location (Data Blocks) with Directory Entries under a
  • inode #5 holds location (Data Blocks) with Directory Entries under b
  • inode #14 holds location (Data Blocks) with File Data of c.c

CS446/646 C. Papachristos

43 of 46

Filesystems

Write Caching

  • OSes typically also do “Write-Back Caching
    • Maintain a queue of Uncommitted Blocks
    • Periodically flush the queue to Disk (30 second threshold)
    • If Blocks changed many times in 30 secs, only need one I/O
    • If Blocks deleted before 30 secs (e.g. /tmp), no I/O needed

  • Unreliable, but practical
    • On a crash, all writes within last 30 secs are lost
    • Modern OSes do this by default; too slow otherwise

  • System Calls exist to enable Applications to force-write data to Disk
    • e.g. fsync(): Flush OS Buffers to Physical Media (Device)

CS446/646 C. Papachristos

44 of 46

Filesystems

Read Caching – Read-Ahead (Prefetching)

  • Many Filesystems also implement “Read-Ahead
    • Filesystem “predicts” that the Process will request a next Block
    • Filesystem requests it from the Disk ahead of time
    • This can happen while the Process is executing on the previous Block
      • Overlap I/O with execution
    • When the Process requests next Block, it will already be in the Read-Ahead Cache
    • Complements the Disk (Hardware) Cache
      • Remember: Disk is also is doing Read-Ahead

  • For Sequentially-Accessed Files can be a big win
    • Unless Blocks for the File are scattered across the Disk
    • Filesystems try to prevent that though (through proper File allocation)

CS446/646 C. Papachristos

45 of 46

Filesystems

Read Caching – File Buffering

  • Disk operations are slow, but Applications exhibit Locality for reading and writing Files

  • Idea: Cache already accessed File Blocks into Memory to leverage Locality
    • Called the “File Buffer Cache
    • File Buffer Cache is system-wide, used and shared by all Processes
    • Reading from the File Buffer Cache helps Disk-based operations to perform more like Memory-based ones
    • Even a small File Buffer Cache can be very effective

  • Issues
    • The File Buffer Cache competes with Virtual Memory for Space (tradeoff here)
    • Like with Virtual Memory, there are actual limitations (due to Physical Memory)
    • Need Replacement Algorithms again (LRU usually used)

CS446/646 C. Papachristos

46 of 46

Time for Questions !

CS-446/646

CS446/646 C. Papachristos