1 of 46

Filesystems

CS-446/646

C. Papachristos

Robotic Workers (RoboWork) Lab

University of Nevada, Reno

2 of 46

Filesystems

Filesystem

Block”

  • Fundamental allocation unit (i.e. smallest size) a Filesystem uses
    • Remember: vs a Sector which is the smallest individual reference-able region on a Disk

Main Filesystem tasks:

  • Persistence of Data & its structure

  • Associate sets of Data Blocks with names (Files)

  • Associate names with each other (Directories)

  • Can implement Filesystems on Disk, over Network, in Memory, in Non-Volatile RAM (NVRAM), on Tape, …
    • Focus on Disk and generalize later

CS446/646 C. Papachristos

3 of 46

Filesystems

File

Associated Blocks on Disk, with a name

  • Data with some properties
  • Contents, size, owner, last read/write time, protection, etc.

How a File’s data is managed by the Filesystem (more later)

  • Basic idea (in Unix):�A struct called an Index Node : The “inode
    • Describes where on the Disk the Blocks for a given File are located
      • inode struct also holds some extra information (Metadata)

    • Disk stores an array of inodes : The “inode Table

    • inode # is the index in the inode Table : The “i-number

CS446/646 C. Papachristos

4 of 46

Filesystems

Filetypes

A File can also have a Type

  • Understood by the Filesystem
    • Block Device, Character Device, UNIX Socket, Symbolic Link (more later), etc.
  • Understood by other parts of the OS or runtime libraries
    • executable, dll, source, object, text, etc.

A File’s Type can be encoded in its Filename or Contents

  • Windows encodes type in name (.com, .exe, .bat, .dll, etc.)

  • Unix also encodes type in contents (magic numbers, initial characters)
    • e.g. #! for Shellscripts

CS446/646 C. Papachristos

5 of 46

Filesystems

Basic File Operations

Unix

creat(name)

rename(oldname, newname)

open(name, how)

read(fd, buf, len)

write(fd, buf, len)

truncate(fd, len)

sync(fd)

lseek(fd, pos)

close(fd)

unlink(name)

CS446/646 C. Papachristos

Windows

CreateFile(name, CREATE)

CreateFile(name, OPEN)

ReadFile(handle, …)

WriteFile(handle, …)

FlushFileBuffers(handle, …)

SetFilePointer(handle, …)

CloseHandle(handle, …)

DeleteFile(name)

CopyFile(name)

MoveFile(name)

6 of 46

Filesystems

 

CS446/646 C. Papachristos

For our Lectures

7 of 46

Filesystems

Directories

  • Problem:�How to reference Files

Should users remember where on Disk their Files are located (i.e. Disk Sector Number) ?

  • Human interface means we also need human-readable names
  • We use Directories to map names to Files
    • Remember: A File represents a set of associated Data Blocks, it does not have a name

Directories serve two purposes

  • For users, they provide a structured way to organize Files
  • For Filesystem, they provide a convenient naming interface that allows the separation of Logical Organization of Files, from Physical Placement on the Disk

CS446/646 C. Papachristos

8 of 46

Filesystems

Hierarchical Directories

  • Used since CTSS (1960s)
    • Unix picked up and used nicely

  • Large Namespaces tend to be Hierarchical
    • IP Addresses, Domain Names, scoping in Programming Languages, etc.

CS446/646 C. Papachristos

/ (FileSystem Root Directory)

afs

bin

cdrom

dev

sbin

tmp

awk

chmod

chown

9 of 46

Filesystems

Directory Internals

A Directory is a list of Directory Entries

Directory Entry:

  • <name, inode#> tuple: The inode # gets us to an inode, which contains a Disk location

  • Directories stored on Disk just like regular Files

    • Its Filetype is: Directory
    • Users can read its Data just like any other File
    • Only special System Calls can write to its Data
    • Data is list of Directory Entries that point to other Disk locations (through inode#s)
    • File pointed to by an inode #, may also be a Directory (inode Type is Directory)
    • Filesystem becomes a Hierarchical Tree

  • Simple, and any speedup in File operations also speeds up Directory operations

CS446/646 C. Papachristos

/

afs

bin

cdrom

dev

sbin

tmp

awk

chmod

chown

<afs,1021>

<tmp,1020>

<bin,1022>

<cdrom,4123>

<dev,1001>

<sbin,1011>

File contents for /

10 of 46

Filesystems

Pathname Translation (simple & high-level overview)

  • To open File /one/two/three the Filesystem will translate as follows:

  • Directory Entries map Filenames to Disk location(s) (via the inode # that gets us to an inode)

  • Access Directory "/": Root Directory is always inode #2 (more later)

  • Search its Data (list of Directory Entries) for the Directory Entry named "one", �get inode # and then inode of "one", get its Disk location

  • Access Directory one, search its Data (list of Directory Entries) for the one named "two",�get inode # and then inode of "two", get its Disk location

  • Access Directory two, search its Data (list of Directory Entries) for the one named "three",�get inode # and then inode of "three", get its Disk location

  • Access File three, access its Data

CS446/646 C. Papachristos

11 of 46

Filesystems

Unix Example

  • /a/b/c.c :

CS446/646 C. Papachristos

12 of 46

Filesystems

Unix inodes and Path Search (more detailed overview)

  • Unix inodes are not Directories
    • An inode describes where on the Disk the Data Blocks for a File are located
    • A Directory is like a File, so its inode just describes the location of its Data Blocks on Disk; it’s just that a Directory’s Data Blocks actually store a list of its Directory Entries

  • Each Directory Entry maps a Filename to an inode

For a not-so-elaborate Filesystem structure, to open File "/one":

    • Read inode Table (more later) into Memory and find for "/" its inode and then its Disk Block Number(s)�That is the location of its Data Block(s) on Disk
    • Read the Data Block of "/" into Memory, then look into its Data for a Directory Entry named "one"This Directory Entry contains the inode# for "one"
    • Find the inode for "one" in the inode Table (should be regular File) and then its Disk Block Number(s)That is the location of its Data Block(s) on Disk
    • Read the Data Block(s) into Memory to access the File Data of "one"

CS446/646 C. Papachristos

13 of 46

Filesystems

Naming

  • Bootstrapping: Where do you start looking?
    • Root Directory always inode #2

  • Special names:
    • Root directory: /
    • Current directory: .
    • Parent directory: ..

  • Some special aliases are provided by the Shell, not the Filesystem:
    • User’s Home Directory:
    • Globbing: foo.* (expands to all files starting with foo.)

  • Using the given naming, only need two operations to navigate the entire Namespace:
    • cd <name> : Move into (change Current Working Directory context to) Directory Name
    • ls : Enumerate all names in Current Working Directory (context)

CS446/646 C. Papachristos

Note:

  • #0 used as a special NULL value (indicates an inode does not exist); i.e. there is no inode #0
  • First inode is inode #1, and is reserved for�recording defective Blocks on the Disk

 

14 of 46

Filesystems

Basic Directory Operations

Unix

Directories implemented in Files

  • At the Filesystem level, File operations�are used to create Directories

  • C-library provides a higher-level�abstraction:

opendir(name)

readdir(DIR*)

seekdir(DIR*)

closedir(DIR*)

CS446/646 C. Papachristos

Windows

Explicit Directory operations

CreateDirectory(name)

RemoveDirectory(name)

Very different method for reading Directory Entries

FindFirstFile(pattern)

FindNextFile()

15 of 46

Filesystems

Default Context: Working Directory

  • Cumbersome to always have to specify full Pathnames
    • In Unix, each process has a “Current Working Directory” (cwd)
    • Filenames not beginning with / are assumed to be relative to the cwd
      • Otherwise translation happens as before

  • A Shell also tracks a default list of active Environment Contexts
    • A “Search Path” for the Programs you are running
      • ( : -separated list of Pathnames)
    • Given a Search Path A:B:C , the Shell will check in A, then B, then C
    • Can get around this by using explicit paths, e.g: ./foo

  • Example of Locality

CS446/646 C. Papachristos

16 of 46

Filesystems

Hardlinks

More than one Directory Entries can refer to a given File

  • “Hardlink” creates a synonym name for a File (i.e. creates another Directory Entry for it)
  • Unix stores count of references (Hardlinks) to the inode (of that File) itself
  • If one of the Hardlinks is removed (e.g. using rm), the data are still accessible through any other Hardlink that remains
  • If all Hardlinks are removed, then the space of that File’s Data is considered as freed

CS446/646 C. Papachristos

foo

inode #31279

links_count=1

ln foo bar

Existing File

Hardlink to create

foo

bar

inode #31279

links_count=2

17 of 46

Filesystems

Softlinks / Symlinks

Softlinks / Sym(bolic)links : Synonyms for Filenames

  • Point to a File(/Dir)name, but object can be deleted from underneath it
    • i.e. does not even have to exist
  • Unix implements these like Directories:
    • inode Type indicates it is a Symlink, and its Data contain name of Symlink target
  • When the Filesystem encounters a Symlink inode, it automatically translates it

CS446/646 C. Papachristos

foo

bar

ln –s bar barz

barz

Existing File

Symlink to create

inode #31279

links_count=2

File Data

"bar"

inode #35382

links_count=1

18 of 46

Filesystems

File Sharing

  • File Sharing has been around since Time-sharing
    • Easy to do on a single machine
    • PCs, workstations, and networks get us there (mostly)

  • File Sharing is important for getting work done
    • Basis for Communication and Synchronization

  • 2 key issues when sharing Files:
    • 1) Semantics of Concurrent Access
      • What happens when one Process reads while another writes?
      • What happens when two Processes open a File for writing?
      • What should be used to coordinate these?
    • 2) Protection

CS446/646 C. Papachristos

19 of 46

Filesystems

Protection

  • Filesystems implement a Protection system to mange:
    • Who can access a File
    • How they are allowed to access it (read/write/etc.)

  • A Protection system dictates whether a given Action performed by a given Subject �on a given Object should be allowed
    • Examples:
      • Your User account can read and/or write your own Files, but other Users cannot
      • You can read /etc/motd , but you cannot write to it

CS446/646 C. Papachristos

20 of 46

Filesystems

Protection – Conceptual Representation

1) Access Control Lists (ACL)

  • For each Object, maintain a�list of Subjects and their�permitted Actions

CS446/646 C. Papachristos

2) Capabilities

  • For each Subject, maintain a list of Objects and the permitted Actions

/one

/two

/three

root

rw

rw

rw

alice

rw

-

r

bob

w

r

rw

Objects

Subjects

ACL

Capability

21 of 46

Filesystems

ACLs and Capabilities

  • Capabilities are easier to transfer
    • “A Capability is a token, ticket, or key that gives the possessor permission to�access an entity or Object in a computer system” (Dennis and Van Horn, 1966)

  • ACLs are easier to manage
    • Object-centric, easy to grant, revoke
    • To revoke Capabilities, have to keep track of all Subjects that have the Capability (ticket)
      • A challenging problem

  • But, ACLs have a problem when Objects are heavily shared (i.e. many Subjects (/Users))
    • The ACLs become very large
    • Mitigation:
      • Use Subject (/User) Groups (e.g. in Unix)

CS446/646 C. Papachristos

22 of 46

Filesystems

Unix File Protection

Unix uses both Protection approaches in its Filesystem

  • 1) ACLs:�https://linux.die.net/man/5/acl
    • Example: Unix File Permissions
      • View ACL: getfacl <filename>

  • 2) Capabilities:
    • Example: File Descriptors

  • How are 1) and 2) used together?
    • e.g. making an open() System Call which gets us a File Descriptor

CS446/646 C. Papachristos

int fd = open("file.txt", O_WRONLY);

if (fd == -1)

exit(-1);

for (int i = 0; i < 100; i++)

write(fd, buf + i * 4, 4);

ACL check, expensive

Capability

Use Capability from now on

23 of 46

Filesystems

Filesystem Implementation

  • A File Descriptor is an application of Capabilities

  • When a File is open()ed, a File Descriptor is created and stored in the “filp Table”

    • Each Process has a filp Table, which is stored in Kernel-Space

    • The User-Space Application is only provided with the index of the File Descriptor

      • We often call this index itself the File Descriptor, but it is actually just an index to the filp Table

    • The filp Table is actually a Capability list (contains a list of File Descriptors)
      • Each File Descriptor contains a Permission part that describes what the Process can do to this file; the File Descriptor also contains an identifier, which is the address of the file’s inode

Note: A File Descriptor in the Process’ filp Table is a Capability that allows the Process to access the File in a specific way, even after a change in the ACL of the File

  • (i.e. the “token / ticket / key is not revoked)

CS446/646 C. Papachristos

24 of 46

Filesystems

Filesystems vs Virtual Memory

CS446/646 C. Papachristos

25 of 46

Filesystems

 

CS446/646 C. Papachristos

26 of 46

Filesystems

Filesystems vs Virtual Memory

  • Trends:

  • Disk Bandwidth and cost-per-bit improving exponentially
    • Similar to CPU speed, Memory size, etc.

  • Seek Time and Rotational Delay improving very slowly
    • Require mechanical motion (e.g. disk arm)

  • Disk access is a huge system bottleneck & getting worse
    • Bandwidth increase lets system (pre-)fetch large chunks for about the same cost as a smaller chunk
    • Trade Bandwidth for Latency if you can get lots of related stuff

  • Desktop Memory size increasing faster than typical workloads
    • More and more of workload fits in File Cache
    • Disk traffic changes: Mostly writes and new data

  • Memory and CPU resources increasing
    • Use Memory and CPU to make better decisions
    • Complex prefetching to support more I/O patterns
    • Delay in Data Placement decisions reduces random I/O

CS446/646 C. Papachristos

27 of 46

Filesystems

Filesystems vs Virtual Memory

  • Goal: Operations should have as few Disk Accesses as possible & at minimal Space overhead
    • i.e. by grouping related things

  • What’s hard about grouping Blocks?
    • Remember: Virtual Memory Pages and Process Locality

  • Like Page Tables, the Filesystem Metadata construct mappings
      • e.g. Page Table: Provides mapping of Virtual Page Number to Physical Page Number
    • File Metadata:�Provide mapping of Byte Offset to Disk Block Address
    • Directory Data (i.e. List of Directory Entries):�Provide mapping of Name to inode # (i-number)

CS446/646 C. Papachristos

28 of 46

Filesystems

Filesystems vs Virtual Memory

  • In both Filesystems and Virtual Memory management, keeping operations location-agnostic is desired
    • Application shouldn’t care about particular Disk Blocks or Physical Memory locations

  • In some ways, Filesystems have an easier job than Virtual Memory:
    • CPU time to do Filesystem mappings not a big deal
    • No Virtual Address Space invalidation on Context Switch, TLB Misses, etc.
    • Page Tables deal with sparse Address Spaces and random access
    • Files often denser ([0 … filesize − 1]Sequentially Accessed)

  • In some ways, a Filesystem’s problem is harder:
    • Each layer of translation = potential Disk Access
    • Space a huge premium! (But Disk can be huge)
      • Disk Cache Space never enough; amount of Data you can get in a single fetch never enough
    • Range very extreme: Many Files < 10 KB-long, other Files GB-long

CS446/646 C. Papachristos

29 of 46

Filesystems

Filesystems vs Virtual Memory

  • Some Working Intuitions:

  • Filesystem performance dominated by # of Disk Accesses
    • If each Disk Access costs ∼10 milliseconds, touching the Disk 100 times = 1 second
    • Can perform a billion ALU operations in same time

  • Disk Access cost dominated by Mechanical Motion, not Transfer:
    • 1 Sector: 5𝑚𝑠 + 4𝑚𝑠 + 5μ𝑠 (≈ 512 𝐵/(100 𝑀𝐵/𝑠)) ≈ 9𝑚𝑠
    • 50 Sectors: 5𝑚𝑠 + 4𝑚𝑠 + .25𝑚𝑠 = 9.25𝑚𝑠
    • Can get 50x the data for only ∼3% more overhead

  • Important Observations:
    • All Blocks in same File tend to be used together, sequentially
    • All Names in same Directory tend to be used together
    • All Files under same Directory tend to be accessed together

CS446/646 C. Papachristos

30 of 46

Filesystems

Problem: How to Track a File’s Data

  • Disk management:
    • Need to keep track of where File contents are on Disk
    • Must be able to use this to map Byte Offset to Disk Block Address

The “Index Node” –or– “inode”

  • A structure that tracks a File’s Disk Sectors
    • Note: The inodes must be stored on Disk as well (Remember: Persistence)

  • Things to keep in mind while designing File Data storage structure:
    • Most Files are small
    • Much of the Disk is allocated to large Files
    • Many of the I/O operations are made to large Files
    • Want good Sequential and good Random Access
      • Each imposes its own requirements

CS446/646 C. Papachristos

31 of 46

Filesystems

Idea 1: Contiguous Allocation

  • Extent-based”: Allocate Files similarly to Virtual Memory Segmentation model
    • When creating a File, do “Allocate-on-Flush
    • inode contents: Location, Size

Example: IBM OS/360

Advantages

  • Simple, fast Access, both Sequential and Random

Disadvantages (think of corresponding Virtual Memory Segmentation scheme)

  • Files may not dynamically grow after creation
  • External Fragmentation

CS446/646 C. Papachristos

Fragmentation:�What if File c needs 2 Sectors

32 of 46

Filesystems

Idea 2: Linked Files

  • Basically a Singly-Linked List on Disk
    • Keep a Singly-Linked List of all Free Blocks
    • inode contents: a pointer to File’s first Block
    • In each Block, keep a pointer to the next one

  • Examples (sort-of): Alto, TOPS-10, DOS FAT

Advantages

  • Easy dynamic growth & Sequential Access, no External Fragmentation

Disadvantages

  • Linked Lists on Disk a bad idea because of Access Times
  • Random Access very slow (e.g. have to traverse whole File to Access last Block)
  • Pointers necessary in every Block, skewing Data alignment

CS446/646 C. Papachristos

A single File’s Clusters are “Fragmented” throughout the Disk

33 of 46

Filesystems

Idea 2 (fixed): File Allocation Table (FAT) – Example : DOS Filesystem

Linked Files with key optimization:

  • Links are separately placed in fixed-size “File Allocation Table” (FAT), rather than indi-vidually inside each Block
  • Still require Pointer chasing, but can cache entire FAT in Memory (cheaper than Disk Access)

CS446/646 C. Papachristos

34 of 46

Filesystems

 

CS446/646 C. Papachristos

35 of 46

Filesystems

Idea 3: Indexed Files

  • Have an Array that holds all of a File’s Block Pointers
    • Similarly to a Page Table, so will have similar issues
    • Max File Size determined by Array’s size
      • Static or Dynamic?
    • Allocate Array to hold File’s Block Pointers�on File creation
    • Allocate actual Blocks on demand using Freelist of the Disk’s Blocks

Advantages

  • Both Sequential and Random Access easy

Disadvantages

  • The Mapping Table (for all Files’ Arrays) requires large chunk of Contiguous Space
  • Same problem we were trying to solve initially

CS446/646 C. Papachristos

36 of 46

Filesystems

 

CS446/646 C. Papachristos

File Index Array

Disk Blocks

(Determined by desired Max File Size)

37 of 46

Filesystems

Idea 3 (fixed): Multi-Level Indexed Files – Example : Unix inodes

Unix inode:

  • inode = 15 Block Pointers (+ Metadata)

  • First 12 are Direct Blocks
    • Giving faster Access�performance for initial Blocks
  • Then 1 Single-indirect Block,� 1 Double-indirect Block,� 1 Triple-indirect Block�

CS446/646 C. Papachristos

inode

Metadata

Note: �Indirect Blocks: Contain BLKSIZE/PTRSIZE amount of Block Entries

38 of 46

Filesystems

More about inodes

Note: Information about Type (File/Directory/Symlink/Character Device/etc.) could be stored in corresponding � Directory Entry that points to this inode, e.g.: <inode, file_type, name_len, name>

CS446/646 C. Papachristos

struct inode

Data Block

Data Block

Data Block

Indirect

Data Block

type_perm (Type & Permissions)

uid (Owner)

gid (Group)

size (of File in Bytes)

blocks (# Blocks of this File – whether

atime (Access Time)

mtime (Modification Time)

ctime (Creation Time)

dtime (Deletion Time)

links_count (Hardlinks – # of Paths)

block[15] (12+1+1+1 Data Blocks)

used or not)

39 of 46

Filesystems

More about inodes

The “inode Table” : Array that stores inodes (one inode for every File)

  • (Similar functionality to the Mapping Table for all Files on the Filesystem)

  • inode Table: Fixed-size (can support max # of inodes) when Disk is initialized; can’t be changed
  • Lives in known Location, originally at outer edge of Disk
    • This resulted in long seeks between inodes and File Data, improved by having many small inode Tables spread across the Disk

  • The index of an inode in the inode Table called the “i-number
    • Internally, the OS refers to Files by i-number
    • When File is opened, its inode is brought in to Memory
    • Written back to Disk when modified and File closed, or timeout elapses

CS446/646 C. Papachristos

40 of 46

Filesystems

More about inodes

Locality Problem: inode Table located at start of Disk

CS446/646 C. Papachristos

41 of 46

Filesystems

Remember: Unix inodes and Path Search

  • Unix inodes are not Directories
    • An inode describes where on the Disk the Data Blocks for a File are located
    • A Directory is like a File, so its inode just describes the location of its Data Blocks
      • Remember: Its Data is a List of Directory Entries

  • Directory Entries map Filenames to inodes
    • To open /one, read inode #2 ( "/" ) from inode Table on Disk into Memory
    • Read Data (Directory Entries) of "/" from Disk into Memory, find Entry for "one"
    • This Entry gives the inode # for "one"
    • From the inode Table again, find the inode for "one"
    • The inode says where the Data Blocks of "one" are on Disk
    • Read its first Block into Memory to access the initial Data stored by the File

CS446/646 C. Papachristos

Note: 4 Disk Accesses required

42 of 46

Filesystems

Unix Example: /a/b/c.c

  • inode #2 holds location (Data Blocks) with Directory Entries under /
  • inode #3 holds location (Data Blocks) with Directory Entries under a
  • inode #5 holds location (Data Blocks) with Directory Entries under b
  • inode #14 holds location (Data Blocks) with File Data of c.c

CS446/646 C. Papachristos

43 of 46

Filesystems

Write Caching

  • OSes typically do “Write-Back Caching
    • Maintain a queue of Uncommitted Blocks
    • Periodically flush the queue to Disk (30 second threshold)
    • If Blocks changed many times in 30 secs, only need one I/O
    • If Blocks deleted before 30 secs (e.g. /tmp), no I/Os needed

  • Unreliable, but practical
    • On a crash, all writes within last 30 secs are lost
    • Modern OSes do this by default; too slow otherwise

  • System Calls exist to enable Applications to force-write data to Disk
    • e.g. fsync(): Flush OS Buffers to Physical Media (Device)

CS446/646 C. Papachristos

44 of 46

Filesystems

Read Caching

  • Many Filesystems implement “Read-Ahead
    • Filesystem “predicts” that the Process will request a next Block
    • Filesystem requests it from the Disk ahead of time
    • This can happen while the Process is executing on the previous Block
      • Overlap I/O with execution
    • When the Process requests next Block, it will already be in the Read-Ahead Cache
    • Complements the Disk (Hardware) Cache
      • Remember: Disk is also is doing Read-Ahead

  • For Sequentially-Accessed Files can be a big win
    • Unless Blocks for the File are scattered across the Disk
    • Filesystems try to prevent that though (through proper File allocation)

CS446/646 C. Papachristos

45 of 46

Filesystems

Read Caching at Higher-Level: File Buffering

  • Disk operations are slow, but Applications exhibit Locality for reading and writing Files

  • Idea: Cache File Blocks into Memory to capture Locality
    • Called the “File Buffer Cache
    • File Buffer Cache is system-wide, used and shared by all Processes
    • Reading from the File Buffer Cache helps Disk-based operations to perform more like Memory-based ones
    • Even a small File Buffer Cache can be very effective

  • Issues
    • The File Buffer Cache competes with Virtual Memory for Space (tradeoff here)
    • Like with Virtual Memory, there are actual limitations (due to Physical Memory)
    • Need Replacement Algorithms again (LRU usually used)

CS446/646 C. Papachristos

46 of 46

Time for Questions !

CS-446/646

CS446/646 C. Papachristos