Operating Systems and C�Fall 2022, Performance-Track
6. Locality
· 1
01.10.2020
Parallel Tracks
perflab:
attacklab:
· 2
exciting part of the course!
security-track lecture nr. 1:�what you’ll need for attacklab.
performance-track lecture nr. 2:
what you’ll need for perflab.�optimizations. how to write code so compiler can derive performant code. manual transformations, blocking, loop unrolling
performance-track lecture nr. 1:
what you’ll need for perflab.�array layout, what it means for performance, cache hierarchy, how associativity is organized in the cache. (important stuff for anyone)
security-track lecture nr. 2:�Linux culture.
Outline
3
01.10.2020
Jim Gray, 2006
4
01.10.2020
http://jimgray.azurewebsites.net/jimgraytalks.htm
vanished w/o trace in 2007
motivated & driven a lot of systems research since.
Jim Gray, 2006
5
01.10.2020
http://jimgray.azurewebsites.net/jimgraytalks.htm
layout of data in memory
Data Systems Group Promotion
· 6
01.10.2020
performance;�benchmarking,�microarchitectural analysis,�how resources are used.�(how much time CPU spends waiting for memory?)
Locality
Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
Temporal locality:
Recently referenced items are likely �to be referenced again in the near future
Spatial locality:
Items with nearby addresses tend �to be referenced close together in time
Locality Example
Data references
Reference array elements in succession (stride-1 reference pattern).
Reference variable sum each iteration.
Instruction references
Reference instructions in sequence.
Cycle through loop repeatedly.
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
Spatial locality
Temporal locality
Spatial locality
Temporal locality
Qualitative Estimates of Locality
Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.
Question: Does this function have good locality with respect to array a?
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
Q: which is faster?
Memory Hierarchies
Jim Gray, 2006
11
01.10.2020
how is data transferred from
http://jimgray.azurewebsites.net/jimgraytalks.htm
I/O Bus
Main
memory
I/O
bridge
Bus interface
ALU
Register file
CPU chip
System bus
Memory bus
Disk
controller
Graphics
adapter
USB
controller
Mouse
Keyboard
Monitor
Disk
I/O bus
Expansion slots for
other devices such
as network adapters.
A bus is a collection of parallel
wires that carry address, data, and control signals.
Buses are typically shared by multiple devices.
Reading a Disk Sector (1)
Main
memory
ALU
Register file
CPU chip
Disk
controller
Graphics
adapter
USB
controller
mouse
keyboard
Monitor
Disk
I/O bus
Bus interface
CPU initiates a disk read by writing a command, logical block number, and destination memory address to a port (address) associated with disk controller.
Reading a Disk Sector (2)
Main
memory
ALU
Register file
CPU chip
Disk
controller
Graphics
adapter
USB
controller
Mouse
Keyboard
Monitor
Disk
I/O bus
Bus interface
Disk controller reads the sector and performs a direct memory access (DMA) transfer into main memory.
note: CPU is not �involved in this.
Reading a Disk Sector (3)
Main
memory
ALU
Register file
CPU chip
Disk
controller
Graphics
adapter
USB
controller
Mouse
Keyboard
Monitor
Disk
I/O bus
Bus interface
When the DMA transfer completes, the disk controller notifies the CPU with an interrupt (i.e., asserts a special “interrupt” pin on the CPU)
we’ll learn about interrupts in future lecture
Solid-State Drives
· 16
01.10.2020
Read
Write
Logical address space
Scheduling�& Mapping
Wear Leveling
Garbage collection
Read
Program
Erase
Chip
Chip
Chip
…
Chip
Chip
Chip
…
Chip
Chip
Chip
…
Chip
Chip
Chip
…
Flash memory array
Channels
Physical address space
unchanged for a long time. let us replace mechanical disks w/ SSDs.
but SSDs are very different (have a controller).
Open-Channel SSDs: Design Space
· 17
FTL
LightNVM separates �(application-customisable) front-end SSD management from (media-specific) back-end SSD management.
Mathias Bjørling,�PhD at ITU,�upstreamed Linux driver. Now used by Google, Amazon, Intel, Alibaba, Microsoft, …
Idea: separate front-end and back-end SSD management.
switching to…�NVMe�(controlled by Intel)
computational storage: move processing to storage devices!
Programming the Storage Controller
· 18
Niclas works �(worked?) on this!
Data Systems Group Promotion
· 19
01.10.2020
Memory Read Transaction (1)
CPU places address A on the memory bus.
ALU
Register file
Bus interface
A
0
A
x
Main memory
I/O bridge
%eax
Load operation: movl A, %eax
Memory Read Transaction (2)
Main memory reads A from the memory bus, retrieves word x, and places it on the bus.
ALU
Register file
Bus interface
x
0
A
x
%eax
I/O bridge
Load operation: movl A, %eax
Main memory
Memory Read Transaction (3)
CPU read word x from the bus and copies it into register %eax.
x
ALU
Register file
Bus interface
x
Main memory
0
A
%eax
I/O bridge
Load operation: movl A, %eax
Memory Write Transaction (1)
CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive.
y
ALU
Register file
Bus interface
A
Main memory
0
A
%eax
I/O bridge
Store operation: movl %eax, A
Memory Write Transaction (2)
CPU places data word y on the bus. Main memory�reads data word y from bus and stores it at address A.
y
ALU
Register file
Bus interface
y
Main memory
0
A
%eax
I/O bridge
Store operation: movl %eax, A
Conventional DRAM Organization
d * w DRAM:
dw total bits organized as d supercells of size w bits
cols
rows
0
1
2
3
0
1
2
3
Internal row buffer
16 x 8 DRAM chip
addr
data
supercell
(2,1)
2 bits
/
8 bits
/
Memory
controller
(to/from CPU)
the way RAM is organized is as an array of “supercells”
controller takes care of translations between addresses & physical storage
An Example Memory Hierarchy
Registers
L1 cache
(SRAM)
Main memory
(DRAM)
Local secondary storage
(local disks)
Larger,
slower,
cheaper
per byte
Remote secondary storage
(tapes, distributed file systems, Web servers)
Local disks hold files retrieved from disks on remote network servers
Main memory holds disk blocks retrieved from local disks
L2 cache
(SRAM)
L1 cache holds cache lines retrieved from L2 cache
CPU registers hold words retrieved from L1 cache
L2 cache holds cache lines retrieved from main memory
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,
faster,
costlier
per byte
this is actually a myth; local network faster than some local disks.
Caches
car mechanic analogy
General Cache Concepts
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
9
14
3
Cache
Memory
Larger, slower, cheaper memory
viewed as partitioned into “lines”
Data is copied in line-sized transfer units
Smaller, faster, more expensive
memory caches a subset of
the lines
4
4
4
10
10
10
memory is split into blocks, aka. lines,�of size 64 bytes.
unit of transfer: 1 line
number represents requested memory location
General Cache Concepts: Hit
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
9
14
3
Cache
Memory
Data in line b is needed
Request: 14
14
line b is in cache:
Hit!
General Cache Concepts: Miss
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
9
14
3
Cache
Memory
Data in line b is needed
Request: 12
Line b is not in cache:
Miss!
Line b is fetched from
memory
Request: 12
12
12
12
Line b is stored in cache
General Caching Concepts: Types of Cache Misses
Cold (compulsory) miss
Cold misses occur because the cache is empty.
Conflict miss
Most caches limit lines at level k+1 to a small subset (sometimes a singleton) of the line positions at level k.
Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k line.
Capacity miss
Occurs when set of active cache lines (working set) is larger than the cache.
thrashing = every access misses.
placement policy �avoids conflict misses
replacement policy care, to �avoid capacity misses.
Examples of Caching in the Hierarchy
Hardware
0
On-Chip TLB
Address translations
TLB
Web browser
10,000,000
Local disk
Web pages
Browser cache
Web cache
Network buffer cache
Buffer cache
Virtual Memory
L2 cache
L1 cache
Registers
Cache Type
Web pages
Parts of files
Parts of files
4-KB page
64-bytes line
64-bytes line
4-8 bytes words
What is Cached?
Web proxy server
1,000,000,000
Remote server disks
OS
100
Main memory
Hardware
1
On-Chip L1
Hardware
10
On/Off-Chip L2
AFS/NFS client
10,000,000
Local disk
Hardware + OS
100
Main memory
Compiler
0
CPU core
Managed By
Latency (cycles)
Where is it Cached?
Disk cache
Disk sectors
Disk controller
100,000
Disk firmware
ex: access to memory is 100x more expensive than L1 cache.
Cache Memories
Cache memories are small, fast SRAM-based memories managed automatically in hardware.
Hold frequently accessed blocks of main memory
CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.
Typical system structure:
Main
memory
I/O
bridge
Bus interface
ALU
Register file
CPU chip
System bus
Memory bus
Cache
memories
given some data,�where is it going to be located?
10s
General Cache Organization (S, E, B)
E = 2e lines per set
S = 2s sets
set
line
0
1
2
B-1
tag
v
B = 2b bytes per cache line (the data)
Cache size:
C = S x E x B data bytes
valid bit
organization of a cache.�have S sets, and E lines.
Cache Read
E = 2e lines per set
S = 2s sets
0
1
2
B-1
tag
v
valid bit
B = 2b bytes per cache line (the data)
t bits
s bits
b bits
Address of word:
tag
set
index
block
offset
data begins at this offset
Direct-Mapped Cache Simulation
set (0..3)
tag (0..1)
M=16 byte addresses, B=2 bytes/block,
S=4 sets, E=1 Blocks/set
Address trace (reads, one byte per read):
0 [00002],
1 [00012],
7 [01112],
8 [10002],
0 [00002]
x
t=1
s=2
b=1
xx
x
0
?
?
v
Tag
Block
miss
1
0
M[0-1]
hit
miss
1
0
M[6-7]
miss
1
1
M[8-9]
miss
1
0
M[0-1]
Set 0
Set 1
Set 2
Set 3
we ask for 1 byte.
but we transfer 1 line at a time.
here, a line is 2 bytes.
Why index with the middle bits?
· 37
01.10.2020
01.10.2020
every cell on the right is 1 byte.
line size 2 bytes.
Q: which indexing strategy is best?
�hint: sequential access
Why index with the middle bits?
· 38
01.10.2020
01.10.2020
every cell on the right is 1 byte.
line size 2 bytes.
Q: which indexing strategy is best?
�hint: sequential access
Why index with the middle bits?
· 39
01.10.2020
01.10.2020
every cell on the right is 1 byte.
line size 2 bytes.
Q: which indexing strategy is best?
�hint: sequential access
50% miss
50% miss,�better temporal
underutilized cache
Array Allocation
Basic Principle
T A[L];
A is an Array of data type T and length L
Contiguously allocated region of L * sizeof(T) bytes
· 40
01.10.2020
compiler is going to reserve space for the array.
examples follow
Array Allocation
· 41
01.10.2020
8 bytes = 64 bits�(address size =� word size)
Array Access
int A[5] = {0, 1, 2, 3, 4};
Array of data type int and length 5
Identifier A can be used as a pointer to array element 0: Type int*
Reference Type Value
val[4] int 4
val int * x
val+1 int * x + 4
&val[2] int * x + 8
val[5] int ??
*(val+1) int 1
val + i int * x + 4 i
· 42
01.10.2020
val from �previous slide
undefined�behavior
Array Example
· 43
ARRAY LOOP EXAMPLE (IA32)
# rdx = z
movq $0, %rax # %rax = i
.L4: # loop:
addq $1, (%rdx,%rax,4) # z[i]++
addq $1, %rax # i++
cmpl $5, %rax # i:ZLEN (ZLEN==5)
jne .L4 # if !=, goto loop
void zincr(zip_dig z) {
int i;
for (i = 0; i < ZLEN; i++)
z[i]++;
}
Array Example
$1 is �constant 1
%r is �register r
addressing mechanism:�z + 4*i
A Higher Level Example
int sum_array_rows(double a[10][10])
{
int i, j;
double sum = 0;
for (i = 0; i < 16; i++)
for (j = 0; j < 16; j++)
sum += a[i][j];
return sum;
}
int sum_array_cols(double a[10][10])
{
int i, j;
double sum = 0;
for (j = 0; i < 16; i++)
for (i = 0; j < 16; j++)
sum += a[i][j];
return sum;
}
not in the same line�as previous two
in perflab, use debugger to find out how my data structure is layed out.
What about writes?
Multiple copies of data exist:
L1, L2, Main Memory, Disk
What to do on a write-hit?
Write-through (write immediately to memory)
Write-back (defer write to memory until replacement of line)
What to do on a write-miss?
Write-allocate (load into cache, update line in cache)
No-write-allocate (writes immediately to memory)
Typical
Write-through + No-write-allocate
Write-back + Write-allocate
faster
Intel Core i7 Cache Hierarchy
Regs
L1
d-cache
L1
i-cache
L2 unified cache
Core 0
Regs
L1
d-cache
L1
i-cache
L2 unified cache
Core 3
…
L3 unified cache
(shared by all cores)
Main memory
Processor package
L1 i-cache and d-cache:
32 KB, 8-way,
Access: 4 cycles
L2 unified cache:
256 KB, 8-way,
Access: 11 cycles
L3 unified cache:
8 MB, 16-way,
Access: 30-40 cycles
Line size: 64 bytes for all caches.
Intel Core i7 Cache Hierarchy
Cache Performance Metrics
10s
Latency Numbers Every Programmer Should Know
· 49
01.10.2020
https://colin-scott.github.io/personal_website/research/interactive_latency.html
we’ve seen this.�penalty: from 1 to 100. huge difference.
Lets think about those numbers
Huge difference between a hit and a miss
Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
Consider: �cache hit time of 1 cycle�miss penalty of 100 cycles
Average access time:
97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
This is why “miss rate” is used instead of “hit rate”
Arrays and Cache Metrics
C arrays allocated in row-major order
Stepping through columns in one row:
sum += a[0][i];
compulsory miss rate = 4 bytes / B
Stepping through rows in one column:
sum += a[i][0];
compulsory miss rate = 1 (i.e. 100%)
skip
Writing Cache Friendly Code
Make the common case go fast
Focus on the inner loops of the core functions
Minimize the misses in the inner loops
Repeated references to variables are good (temporal locality)
Stride-1 reference patterns are good (spatial locality)
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.
Locality Example #1
Question: Can you permute the loops so that the function scans the 3-d array a with a stride-1 reference pattern (and thus has good spatial locality)?
int sum_array_3d(int a[M][N][N])
{
int i, j, k, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
sum += a[i][k][j];
return sum;
}
10s
Locality Example #2
Assume:
Line size = 32B (big enough for four 64-bit words)
Matrix dimension (N) is very large
Cache is not even big enough to hold multiple rows
Analysis Method:
Look at access pattern of inner loop
A
k
i
B
k
j
C
i
j
Matrix Multiplication
· 55
01.10.2020
Matrix Multiplication Example
Description:
Multiply N x N matrices
O(N3) total operations
N reads per source element
N values summed per destination
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
Variable sum
held in register
Matrix Multiplication (ijk)
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
A
B
C
(i,*)
(*,j)
(i,j)
Inner loop:
Column-
wise
Row-wise
Fixed
Misses per inner loop iteration:
A B C
0.25 1.0 0.0
Matrix Multiplication (jik)
/* jik */
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
}
A
B
C
(i,*)
(*,j)
(i,j)
Inner loop:
Row-wise
Column-
wise
Fixed
Misses per inner loop iteration:
A B C
0.25 1.0 0.0
Matrix Multiplication (kij)
/* kij */
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
A
B
C
(i,*)
(i,k)
(k,*)
Inner loop:
Row-wise
Row-wise
Fixed
Misses per inner loop iteration:
A B C
0.0 0.25 0.25
Matrix Multiplication (ikj)
/* ikj */
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
A
B
C
(i,*)
(i,k)
(k,*)
Inner loop:
Row-wise
Row-wise
Fixed
Misses per inner loop iteration:
A B C
0.0 0.25 0.25
Matrix Multiplication (jki)
/* jki */
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
A
B
C
(*,j)
(k,j)
Inner loop:
(*,k)
Column-
wise
Column-
wise
Fixed
Misses per inner loop iteration:
A B C
1.0 0.0 1.0
Matrix Multiplication (kji)
/* kji */
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
A
B
C
(*,j)
(k,j)
Inner loop:
(*,k)
Fixed
Column-
wise
Column-
wise
Misses per inner loop iteration:
A B C
1.0 0.0 1.0
Summary of Matrix Multiplication
ijk (& jik):
kij (& ikj):
jki (& kji):
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
k as inner loop
j as inner loop
i as inner loop
Core i7 Matrix Multiply Performance
jki / kji
ijk / jik
kij / ikj
Sharing is a Vulnerability
attackers exploit sharing.
if A & B share E,� A can observe/affect B through E.
different attacks depending on what is shared
Attacks: Side-Channel
“somebody toucha �my spaghet!”
fundamental tension:�security (isolation) vs. �performance (sharing)
a note on security
Imitating the Ideal
ideal computer: infinite cores, infinite memory.
fake it
important process requirement: isolation.�processes share resources.
isolation can be violated! (Unintended communication/interference)
Attacks: Side-Channel - Hardware
How It Works
Can I use that resource?
No.
Why not?
Bob’s process is using it.
Why is Bob’s process using it?
Because Bob’s process took the then-branch (not else) in procedure-
Aha! From this, I conclude…!
Attacks: Side-Channel - Hardware
Sherlock
World
CPU Timing Attacks
Attack: Process A monitors the CPU load of Process B.
Attack: Race conditions.
table of processes,�task manager
who writes to storage first
Attacks: Side-Channel - Hardware - CPU
Processes Share a Cache
Attacks: Side-Channel - Hardware - Cache
different address space, though. How do they communicate?
Cache Timing Attack: Prime+Probe
Attacks: Side-Channel - Hardware- Cache
estimate �nr. of cache misses w/ �a timer.
(remember� memory� hierarchy)
Willard Rafnsson�IT University of Copenhagen
wilr@itu.dk�https://www.willardthor.com/
goal: tools that developers can use to write secure SW.
sample research (past supervisions):
I like code, and I like proofs.
I created the “Applied Information Security” course.
I’m a barista in Analog.
formal�methods
programming�languages
computer�security
my�research
shameless self-promotion
Take-Aways
· 72
01.10.2020