1 of 50

CSE 451

Operating Systems

L15 - Address Translation

Slides by: Tom Anderson

Baris Kasikci

2 of 50

Main Points

Address Translation Mechanism

How do we convert a virtual address to a physical address?
Paging
Multilevel paging

Efficient Address Translation

Translation Lookaside Buffers
Virtually and physically addressed caches

Leveraging Address Translation

Efficient process creation, process startup, …

Virtual Machine Address Translation

3 of 50

Address Translation Concept

Intel 57 bits

Kernel trap

Intel 52 bits

Cache line: 64 bytes

Extension: I/O bus

cache coherent (~10TB)

4 of 50

Virtually Addressed Base and Bounds

Base

Base + bound

5 of 50

Virtually Addressed Base and Bounds

Pros?

Simple
Fast (2 registers, adder, comparator)
Can relocate in physical memory without changing process

Cons?

Can’t keep program from accidentally overwriting its own code
Can’t share code/data with other processes
Can’t grow stack/heap as needed
Fragmentation – hard to efficiently use memory if each program is variable size
What if virtual address space > physical address space?

6 of 50

Paged Translation

Manage memory in fixed size units, or pages
Finding free memory is easy

Bitmap allocation: 0011111100000001100 (0 can represent free memory pages)
Each bit represents one physical page frame

Each process has its own page table

Stored in physical memory
Hardware registers for page table start, length

Intel: cr3 register will have the address of the page table

7 of 50

8 of 50

64 bit address

12 bit offset

45 bit page #

4 KB frames

5

12

Virtual address

12 bit

40 bit

Physical address

offset

frame

9 of 50

Process View

A B C D
E F G H
I J K L


I J K L

E F G H
A B C D

4
3
1

Page Table

Physical Memory

3

2

8 page

frames

4 byte

pages

0

9

010

01

000

00

010

001

01

100

00

10 of 50

Page Table Entry (Intel 64 bit)

PTE not present: Unused virtual pages do not need a physical page frame

Trap into kernel if process references memory on that page

11 of 50

don’t need a page frame for all virtual addresses

instruction pages = 0

kernel pages = 0

page replacement

set by HW,

cleared by OS

}

12 of 50

Paging Questions

With paging, what is saved/restored on a process context switch?

What if page size is very small?

What if page size is very large?

13 of 50

Paging Questions

With paging, what is saved/restored on a process context switch?

Pointer to page table, size of page table
Page table itself is in main memory

What if page size is very small?

Overhead: Lots of space taken up by the page table

What if page size is very large?

Internal fragmentation: if we don’t need all of the space inside a page

14 of 50

Paging Questions - 2

How big is the page table based on what we learned so far?

2⁴⁵ entries, 1 word= 64 bits = 8 bytes => 2⁴⁸bytes (256 TB!)

An array is not going to work!

15 of 50

Single Level Page Table

Pros?

Simple memory allocation
Safe
Efficient as long as page size not too large (internal fragmentation)
Page permission can prevent overwriting code by accident
Can share physical page frames (two copies of same program)
Unused virtual pages don’t need to be backed by physical frames

Cons?

An extra memory lookup on every memory reference
Need page table to map entire virtual address space
Page table itself needs to be contiguous in memory (fragmentation)

16 of 50

Two Level Page Tables?

Level 1 table: each PTE points to a page table
Level 2 table: each PTE points to a page

Typically, one page frame

Can share/protect at either level 1 or level 2

Level 1 Table

...

Level 2 Tables

...

12 bit

36 bit

offset

page

9 bit

offset into

level-2 page

index into level-1 page

page frame

of level-2

table

page frame

page 512 entries

17 of 50

Three Level Page Tables

12 bit

9

bit

9

bit

27 bit

no mapping

If PTE is invalid in level 1,

don’t need a level 2 table

If it is invalid in level 2,

don’t need a level 3 table

18 of 50

x86 Multilevel Paging

Each level of page table fits in one 4KB page
32-bit: two level page table

Page table entry is 32 bit, so 2^10 entries per page

64-bit: five level page table

Page table entry is 64 bit, so 2^9 entries per page

Omit sub-tree if page table entry is invalid

Good for sparse address space

19 of 50

Paging and Sharing

Can we use page tables to share memory between processes?

Set page table entry in each process to point to same page frame (4KB)

Or pointer to page table (both point to same page table) (2MB)
Or pointer to pointer to page tables (both point to same page directory) (1GB)
Or pointer to pointer to pointer to page tables (0.5TB)

Example: kernel portion of address space

Need core map for reference counting

Array of information about each physical page frame
Set of processes pointing to that page frame
When reference count goes to zero, reclaim!

20 of 50

Page Table Walk in xk (x86_vm64.c)

21 of 50

Page Translation in the OS

OS kernels manage their own data structures about virtual memory

List of memory objects in each process (in xk: vregions), possibly shared
Set of physical page frames used by this process

Reclaim on process exit

Set of virtual pages pointing to the same page frame

For a shared page frame, when is it safe to reclaim it?
Use/dirty bit: set in page table entry by hardware
Each process has its own page table
Use/dirty bit is the union of the bits set in any page table referencing the physical frame

Portability: if we want OS kernel to run on x86, Apple M1, ARM, …
Features: copy on write, stack extension, etc.

22 of 50

xk Paging Data Structures (vspace.h)

23 of 50

Multilevel Paging

Pros

Efficient memory allocation
Efficient for sparse addresses
Efficient disk transfers
Easier to build translation lookaside buffers
Variable granularity for protection/sharing

Cons

Every memory access involves 6 memory reads in the general case

24 of 50

Efficient Address Translation

Translation lookaside buffer (TLB)

Cache of recent virtual page -> physical page translations
If cache hit, use translation
If cache miss, walk multi-level page table

Cost of translation =

Cost of TLB lookup + Prob(TLB miss) * cost of page table lookup

25 of 50

TLB and Page Table Translation

26 of 50

TLB Lookup

Hit

Miss

27 of 50

Question

What happens on a context switch?

Reuse TLB?
Discard TLB? (xk resets TLB)

Solution: Tagged TLB

Each TLB entry has process ID
TLB hit only if process ID matches current process

28 of 50

29 of 50

Question

What happens when the OS changes the permissions on a page?

For demand paging, copy on write, zero on reference, …

TLB may contain old translation

OS must ask hardware to purge TLB entry

On a multicore: TLB shootdown

OS must ask each CPU to purge TLB entry

30 of 50

TLB Shootdown

31 of 50

Question

What is the cost of a TLB miss on a modern processor?

32 of 50

Hardware Design Principle

The bigger the memory, the slower the memory

Why we have multiple levels of cache

L1 cache –

L2 cache – 2 MB,

L3 cache – 320 MB,

80 KB, 1.5 ns

4ns

22ns

Also why we have multiple levels of TLBs

33 of 50

Question

What is the cost of a first level TLB miss?

Second level TLB lookup

What is the cost of a second level TLB miss?

64 bit x86: 4-level page table walk

How expensive is a 4-level page table walk?

Page table entries are in main memory, so they can be cached

34 of 50

Superpages

On x86 and ARM, TLB entry can be

A page
A superpage: a set of contiguous, aligned pages

x86: superpage is set of pages in one page table

One page: 4KB
One page table: 2MB
One page table of page tables: 1GB

With superpages, TLB with 64 entries maps

64 * 4KB = 256KB
64 * 2MB = 128MB
64 * 1GB = 64GB

35 of 50

Superpages

36 of 50

Virtually Addressed, Physically Tagged Caches

First level cache has at most a few cycles

Any delay slows every instruction fetch and data reference

Lookup TLB to get physical address, then lookup physical address in the cache?

Too slow!

Instead, lookup virtual address in cache in parallel with TLB lookup

Cache stores both virtual address and physical address (and data)

Use data only if both virtual address and physical address match

37 of 50

Virtually Addressed, Physically Tagged Cache

38 of 50

Virtually Addressed, Physically Tagged Cache

39 of 50

Address Translation Goals

Memory protection

Isolate process to its only memory
Prevent virus from re-writing machine instructions

Memory sharing

Shared libraries, interprocess communication

Sparse addresses

Dynamically allocated regions: heaps, stacks, mmap

Efficiency

Reduce fragmentation and copying
Runtime lookup cost and TLB hit rate
Translation table size

Portability

40 of 50

Bonus Feature

What if the kernel can regain control whenever a program reads or writes a particular virtual memory location?
Examples:

Copy on write
Zero on reference
Fill on demand
Demand paging
Memory mapped files
…

41 of 50

Process Regions

Every process has logical regions

Contiguous region of virtual (process) memory

Examples: code, data, heap, stack, dynamic libraries (code, data),

memory mapped files, …

Each memory region has its own

protection: read-only, read-write, execute-only
sharing: code vs. data
access pattern: code vs. mmap file

42 of 50

Question

How big a user stack should I allocate?

What if some programs need a large stack and others need a small one?

43 of 50

Expand Stack on Reference

When program references memory beyond end of stack

Page fault into OS kernel
Kernel allocates some additional memory

How much?

Remember to zero the memory to avoid accidentally leaking information!
Modify page table
Resume process

44 of 50

UNIX fork seems inefficient

Make a complete copy of process
Throw copy away on exec
Do we need to make the copy?

45 of 50

Copy on Write

Paging allows an efficient fork

Copy page table of parent into child
Mark all pages (in new/old page tables) as read-only
Start child process; restart parent
Trap into kernel on write (in child or parent)
Copy page
Mark both as writeable
Resume execution

With multilevel page tables

Can efficiently mark a region (top level page table entry) as read-only

46 of 50

Question

Can I start running a program before all of its code is in memory?

47 of 50

Fill On Demand

Set all page table entries to invalid
When a page is referenced for first time, kernel trap
Kernel brings page in from disk
Resume execution
Remaining pages can be transferred in the background while program is running

48 of 50

Address Translation Uses

Process isolation

Keep a process from touching anyone else’s memory, or the kernel’s

Efficient interprocess communication

Shared regions of memory between processes

Shared code segments

E.g., common libraries used by many different programs

Program initialization

Start running a program before it is entirely in memory

Dynamic memory allocation

Allocate and initialize stack/heap pages on demand

49 of 50

Address Translation (more)

Cache management

Page coloring

Program debugging

Data breakpoints when address is accessed

Zero-copy I/O

Directly from I/O device into/out of user memory

Memory mapped files

Access file data using load/store instructions

Demand-paged virtual memory

Illusion of near-infinite memory, backed by disk or memory on other machines

50 of 50

Address Translation (even more)

Checkpointing/restart

Transparently save a copy of a process, without stopping the program while the save happens

Persistent data structures

Implement data structures that can survive system reboots

Process migration

Transparently move processes between machines

Information flow control

Track what data is being shared externally

Distributed shared memory

Illusion of memory that is shared between machines