1 of 50

CSE 451

Operating Systems

L15 - Address Translation

Slides by: Tom Anderson

Baris Kasikci

2 of 50

Main Points

  • Address Translation Mechanism
    • How do we convert a virtual address to a physical address?
    • Paging
    • Multilevel paging
  • Efficient Address Translation
    • Translation Lookaside Buffers
    • Virtually and physically addressed caches
  • Leveraging Address Translation
    • Efficient process creation, process startup, …
  • Virtual Machine Address Translation

3 of 50

Address Translation Concept

Intel 57 bits

Kernel trap

Intel 52 bits

Cache line: 64 bytes

  • Extension: I/O bus

cache coherent (~10TB)

4 of 50

Virtually Addressed Base and Bounds

Base

Base + bound

5 of 50

Virtually Addressed Base and Bounds

  • Pros?
    • Simple
    • Fast (2 registers, adder, comparator)
    • Can relocate in physical memory without changing process
  • Cons?
    • Can’t keep program from accidentally overwriting its own code
    • Can’t share code/data with other processes
    • Can’t grow stack/heap as needed
    • Fragmentation – hard to efficiently use memory if each program is variable size
    • What if virtual address space > physical address space?

6 of 50

Paged Translation

  • Manage memory in fixed size units, or pages
  • Finding free memory is easy
    • Bitmap allocation: 0011111100000001100 (0 can represent free memory pages)
    • Each bit represents one physical page frame
  • Each process has its own page table
    • Stored in physical memory
    • Hardware registers for page table start, length

Intel: cr3 register will have the address of the page table

7 of 50

8 of 50

64 bit address

12 bit offset

45 bit page #

4 KB frames

5

5

12

Virtual address

12 bit

40 bit

Physical address

offset

frame

9 of 50

Process View

A B C D

E F G H

I J K L

I J K L

E F G H

A B C D

4

3

1

Page Table

Physical Memory

3

2

8 page

frames

4 byte

pages

0

9

010

01

000

00

010

001

01

100

00

10 of 50

Page Table Entry (Intel 64 bit)

PTE not present: Unused virtual pages do not need a physical page frame

  • Trap into kernel if process references memory on that page

11 of 50

don’t need a page frame for all virtual addresses

instruction pages = 0

kernel pages = 0

page replacement

set by HW,

cleared by OS

}

12 of 50

Paging Questions

  • With paging, what is saved/restored on a process context switch?

  • What if page size is very small?
  • What if page size is very large?

13 of 50

Paging Questions

  • With paging, what is saved/restored on a process context switch?
    • Pointer to page table, size of page table
    • Page table itself is in main memory

  • What if page size is very small?
    • Overhead: Lots of space taken up by the page table
  • What if page size is very large?
    • Internal fragmentation: if we don’t need all of the space inside a page

14 of 50

Paging Questions - 2

How big is the page table based on what we learned so far?

245 entries, 1 word= 64 bits = 8 bytes => 248 bytes (256 TB!)

An array is not going to work!

15 of 50

Single Level Page Table

  • Pros?
    • Simple memory allocation
    • Safe
    • Efficient as long as page size not too large (internal fragmentation)
    • Page permission can prevent overwriting code by accident
    • Can share physical page frames (two copies of same program)
    • Unused virtual pages don’t need to be backed by physical frames
  • Cons?
    • An extra memory lookup on every memory reference
    • Need page table to map entire virtual address space
    • Page table itself needs to be contiguous in memory (fragmentation)

16 of 50

Two Level Page Tables?

  • Level 1 table: each PTE points to a page table
  • Level 2 table: each PTE points to a page
    • Typically, one page frame
  • Can share/protect at either level 1 or level 2

Level 1 Table

...

Level 2 Tables

...

12 bit

36 bit

offset

page

9 bit

offset into

level-2 page

index into level-1 page

page frame

of level-2

table

page frame

page 512 entries

17 of 50

Three Level Page Tables

12 bit

9

bit

9

bit

27 bit

no mapping

no mapping

no mapping

If PTE is invalid in level 1,

don’t need a level 2 table

If it is invalid in level 2,

don’t need a level 3 table

18 of 50

x86 Multilevel Paging

  • Each level of page table fits in one 4KB page
  • 32-bit: two level page table
    • Page table entry is 32 bit, so 2^10 entries per page
  • 64-bit: five level page table
    • Page table entry is 64 bit, so 2^9 entries per page
  • Omit sub-tree if page table entry is invalid
    • Good for sparse address space

19 of 50

Paging and Sharing

  • Can we use page tables to share memory between processes?

  • Set page table entry in each process to point to same page frame (4KB)
    • Or pointer to page table (both point to same page table) (2MB)
    • Or pointer to pointer to page tables (both point to same page directory) (1GB)
    • Or pointer to pointer to pointer to page tables (0.5TB)
      • Example: kernel portion of address space

  • Need core map for reference counting
    • Array of information about each physical page frame
    • Set of processes pointing to that page frame
    • When reference count goes to zero, reclaim!

20 of 50

Page Table Walk in xk (x86_vm64.c)

21 of 50

Page Translation in the OS

  • OS kernels manage their own data structures about virtual memory
    • List of memory objects in each process (in xk: vregions), possibly shared
    • Set of physical page frames used by this process
      • Reclaim on process exit
    • Set of virtual pages pointing to the same page frame
      • For a shared page frame, when is it safe to reclaim it?
      • Use/dirty bit: set in page table entry by hardware
      • Each process has its own page table
      • Use/dirty bit is the union of the bits set in any page table referencing the physical frame
  • Portability: if we want OS kernel to run on x86, Apple M1, ARM, …
  • Features: copy on write, stack extension, etc.

22 of 50

xk Paging Data Structures (vspace.h)

23 of 50

Multilevel Paging

Pros

  • Efficient memory allocation
  • Efficient for sparse addresses
  • Efficient disk transfers
  • Easier to build translation lookaside buffers
  • Variable granularity for protection/sharing

Cons

  • Every memory access involves 6 memory reads in the general case

24 of 50

Efficient Address Translation

  • Translation lookaside buffer (TLB)
    • Cache of recent virtual page -> physical page translations
    • If cache hit, use translation
    • If cache miss, walk multi-level page table
  • Cost of translation =

Cost of TLB lookup + Prob(TLB miss) * cost of page table lookup

25 of 50

TLB and Page Table Translation

26 of 50

TLB Lookup

Hit

Miss

27 of 50

Question

  • What happens on a context switch?
    • Reuse TLB?
    • Discard TLB? (xk resets TLB)

  • Solution: Tagged TLB
    • Each TLB entry has process ID
    • TLB hit only if process ID matches current process

28 of 50

29 of 50

Question

  • What happens when the OS changes the permissions on a page?
    • For demand paging, copy on write, zero on reference, …
  • TLB may contain old translation
    • OS must ask hardware to purge TLB entry
  • On a multicore: TLB shootdown
    • OS must ask each CPU to purge TLB entry

30 of 50

TLB Shootdown

31 of 50

Question

What is the cost of a TLB miss on a modern processor?

32 of 50

Hardware Design Principle

The bigger the memory, the slower the memory

Why we have multiple levels of cache

L1 cache –

L2 cache – 2 MB,

L3 cache – 320 MB,

80 KB, 1.5 ns

4ns

22ns

Also why we have multiple levels of TLBs

33 of 50

Question

  • What is the cost of a first level TLB miss?
    • Second level TLB lookup
  • What is the cost of a second level TLB miss?
    • 64 bit x86: 4-level page table walk
  • How expensive is a 4-level page table walk?
    • Page table entries are in main memory, so they can be cached

34 of 50

Superpages

  • On x86 and ARM, TLB entry can be
    • A page
    • A superpage: a set of contiguous, aligned pages
  • x86: superpage is set of pages in one page table
    • One page: 4KB
    • One page table: 2MB
    • One page table of page tables: 1GB
  • With superpages, TLB with 64 entries maps
    • 64 * 4KB = 256KB
    • 64 * 2MB = 128MB
    • 64 * 1GB = 64GB

35 of 50

Superpages

36 of 50

Virtually Addressed, Physically Tagged Caches

  • First level cache has at most a few cycles
    • Any delay slows every instruction fetch and data reference
  • Lookup TLB to get physical address, then lookup physical address in the cache?
    • Too slow!
  • Instead, lookup virtual address in cache in parallel with TLB lookup
    • Cache stores both virtual address and physical address (and data)
  • Use data only if both virtual address and physical address match

37 of 50

Virtually Addressed, Physically Tagged Cache

38 of 50

Virtually Addressed, Physically Tagged Cache

39 of 50

Address Translation Goals

  • Memory protection
    • Isolate process to its only memory
    • Prevent virus from re-writing machine instructions
  • Memory sharing
    • Shared libraries, interprocess communication
  • Sparse addresses
    • Dynamically allocated regions: heaps, stacks, mmap
  • Efficiency
    • Reduce fragmentation and copying
    • Runtime lookup cost and TLB hit rate
    • Translation table size
  • Portability

40 of 50

Bonus Feature

  • What if the kernel can regain control whenever a program reads or writes a particular virtual memory location?
  • Examples:
    • Copy on write
    • Zero on reference
    • Fill on demand
    • Demand paging
    • Memory mapped files

41 of 50

Process Regions

  • Every process has logical regions
    • Contiguous region of virtual (process) memory
  • Examples: code, data, heap, stack, dynamic libraries (code, data),

memory mapped files, …

  • Each memory region has its own
    • protection: read-only, read-write, execute-only
    • sharing: code vs. data
    • access pattern: code vs. mmap file

42 of 50

Question

  • How big a user stack should I allocate?
  • What if some programs need a large stack and others need a small one?

43 of 50

Expand Stack on Reference

  • When program references memory beyond end of stack
    • Page fault into OS kernel
    • Kernel allocates some additional memory
      • How much?
    • Remember to zero the memory to avoid accidentally leaking information!
    • Modify page table
    • Resume process

44 of 50

UNIX fork seems inefficient

  • Make a complete copy of process
  • Throw copy away on exec
  • Do we need to make the copy?

45 of 50

Copy on Write

  • Paging allows an efficient fork
    • Copy page table of parent into child
    • Mark all pages (in new/old page tables) as read-only
    • Start child process; restart parent
    • Trap into kernel on write (in child or parent)
    • Copy page
    • Mark both as writeable
    • Resume execution
  • With multilevel page tables
    • Can efficiently mark a region (top level page table entry) as read-only

46 of 50

Question

  • Can I start running a program before all of its code is in memory?

47 of 50

Fill On Demand

  • Set all page table entries to invalid
  • When a page is referenced for first time, kernel trap
  • Kernel brings page in from disk
  • Resume execution
  • Remaining pages can be transferred in the background while program is running

48 of 50

Address Translation Uses

  • Process isolation
    • Keep a process from touching anyone else’s memory, or the kernel’s
  • Efficient interprocess communication
    • Shared regions of memory between processes
  • Shared code segments
    • E.g., common libraries used by many different programs
  • Program initialization
    • Start running a program before it is entirely in memory
  • Dynamic memory allocation
    • Allocate and initialize stack/heap pages on demand

49 of 50

Address Translation (more)

  • Cache management
    • Page coloring
  • Program debugging
    • Data breakpoints when address is accessed
  • Zero-copy I/O
    • Directly from I/O device into/out of user memory
  • Memory mapped files
    • Access file data using load/store instructions
  • Demand-paged virtual memory
    • Illusion of near-infinite memory, backed by disk or memory on other machines

50 of 50

Address Translation (even more)

  • Checkpointing/restart
    • Transparently save a copy of a process, without stopping the program while the save happens
  • Persistent data structures
    • Implement data structures that can survive system reboots
  • Process migration
    • Transparently move processes between machines
  • Information flow control
    • Track what data is being shared externally
  • Distributed shared memory
    • Illusion of memory that is shared between machines