2 of 72

Parallel Tracks

perflab:

have two matrix multiplication procedures �(rotate, smooth), have to rewrite it.
about optimization techniques; �blocking, loop unrolling, etc.

attacklab:

you have an executable, have to attack it.
about the stack; code injection (smashing the stack), return-oriented programming (find interesting code in other programs).

· 2

exciting part of the course!

security-track lecture nr. 1:�what you’ll need for attacklab.

performance-track lecture nr. 2:

what you’ll need for perflab.�optimizations. how to write code so compiler can derive performant code. manual transformations, blocking, loop unrolling

performance-track lecture nr. 1:

what you’ll need for perflab.�array layout, what it means for performance, cache hierarchy, how associativity is organized in the cache. (important stuff for anyone)

security-track lecture nr. 2:�Linux culture.

3 of 72

Outline

Locality
Memory Hierarchy
Cache Utilization
A note on Security

01.10.2020

4 of 72

Jim Gray, 2006

01.10.2020

http://jimgray.azurewebsites.net/jimgraytalks.htm

vanished w/o trace in 2007

motivated & driven a lot of systems research since.

5 of 72

Jim Gray, 2006

01.10.2020

http://jimgray.azurewebsites.net/jimgraytalks.htm

layout of data in memory

6 of 72

Data Systems Group Promotion

· 6

01.10.2020

performance;�benchmarking,�microarchitectural analysis,�how resources are used.�(how much time CPU spends waiting for memory?)

7 of 72

Locality

Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

Temporal locality:

Recently referenced items are likely �to be referenced again in the near future

Spatial locality:

Items with nearby addresses tend �to be referenced close together in time

8 of 72

Locality Example

Data references

Reference array elements in succession (stride-1 reference pattern).

Reference variable sum each iteration.

Instruction references

Reference instructions in sequence.

Cycle through loop repeatedly.

sum = 0;

for (i = 0; i < n; i++)

sum += a[i];

return sum;

Spatial locality

Temporal locality

Spatial locality

Temporal locality

9 of 72

Qualitative Estimates of Locality

Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.

Question: Does this function have good locality with respect to array a?

int sum_array_rows(int a[M][N])

{

int i, j, sum = 0;

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

sum += a[i][j];

return sum;

}

int sum_array_cols(int a[M][N])

{

int i, j, sum = 0;

for (j = 0; j < N; j++)

for (i = 0; i < M; i++)

sum += a[i][j];

return sum;

}

Q: which is faster?

10 of 72

Memory Hierarchies

Some fundamental and enduring properties of hardware and software:

Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).
The gap between CPU and main memory speed is widening.
Well-written programs tend to exhibit good locality.

These fundamental properties complement each other beautifully.

They suggest an approach for organizing memory and storage systems known as a memory hierarchy.

11 of 72

Jim Gray, 2006

01.10.2020

how is data transferred from

secondary storage to memory, and
memory to registers?

http://jimgray.azurewebsites.net/jimgraytalks.htm

12 of 72

I/O Bus

Main

memory

I/O

bridge

Bus interface

ALU

CPU chip

System bus

Memory bus

Disk

controller

Graphics

adapter

USB

controller

Mouse

Keyboard

Monitor

Disk

I/O bus

Expansion slots for

other devices such

as network adapters.

A bus is a collection of parallel

wires that carry address, data, and control signals.

Buses are typically shared by multiple devices.

13 of 72

Reading a Disk Sector (1)

Main

memory

ALU

CPU chip

Disk

controller

Graphics

adapter

USB

controller

mouse

keyboard

Monitor

Disk

I/O bus

Bus interface

CPU initiates a disk read by writing a command, logical block number, and destination memory address to a port (address) associated with disk controller.

14 of 72

Reading a Disk Sector (2)

Main

memory

ALU

CPU chip

Disk

controller

Graphics

adapter

USB

controller

Mouse

Keyboard

Monitor

Disk

I/O bus

Bus interface

Disk controller reads the sector and performs a direct memory access (DMA) transfer into main memory.

note: CPU is not �involved in this.

15 of 72

Reading a Disk Sector (3)

Main

memory

ALU

CPU chip

Disk

controller

Graphics

adapter

USB

controller

Mouse

Keyboard

Monitor

Disk

I/O bus

Bus interface

When the DMA transfer completes, the disk controller notifies the CPU with an interrupt (i.e., asserts a special “interrupt” pin on the CPU)

we’ll learn about interrupts in future lecture

16 of 72

Solid-State Drives

· 16

01.10.2020

Read

Write

Logical address space

Scheduling�& Mapping

Wear Leveling

Garbage collection

Read

Program

Erase

Chip

…

Chip

…

Chip

…

Chip

…

Flash memory array

Channels

Physical address space

unchanged for a long time. let us replace mechanical disks w/ SSDs.

but SSDs are very different (have a controller).

17 of 72

Open-Channel SSDs: Design Space

· 17

FTL

LightNVM separates �(application-customisable) front-end SSD management from (media-specific) back-end SSD management.

Mathias Bjørling,�PhD at ITU,�upstreamed Linux driver. Now used by Google, Amazon, Intel, Alibaba, Microsoft, …

Idea: separate front-end and back-end SSD management.

switching to…�NVMe�(controlled by Intel)

computational storage: move processing to storage devices!

18 of 72

Programming the Storage Controller

Conventional DRAM Organization

d * w DRAM:

dw total bits organized as d supercells of size w bits

cols

rows

Internal row buffer

16 x 8 DRAM chip

addr

data

supercell

(2,1)

2 bits

8 bits

Memory

controller

(to/from CPU)

the way RAM is organized is as an array of “supercells”

controller takes care of translations between addresses & physical storage

26 of 72

An Example Memory Hierarchy

Registers

L1 cache

(SRAM)

Main memory

(DRAM)

Local secondary storage

(local disks)

Larger,

slower,

cheaper

per byte

Remote secondary storage

(tapes, distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers

Main memory holds disk blocks retrieved from local disks

L2 cache

(SRAM)

L1 cache holds cache lines retrieved from L2 cache

CPU registers hold words retrieved from L1 cache

L2 cache holds cache lines retrieved from main memory

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,

faster,

costlier

per byte

this is actually a myth; local network faster than some local disks.

27 of 72

Caches

Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.
Fundamental idea of a memory hierarchy:

For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.

Why do memory hierarchies work? Because locality.

Programs tend to access the data at level k �more often than they access the data at level k+1.
Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.

Big Idea: Memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but serves data to programs at the rate of the fast storage near the top.

car mechanic analogy

28 of 72

General Cache Concepts

Cache

Memory

Larger, slower, cheaper memory

viewed as partitioned into “lines”

Data is copied in line-sized transfer units

Smaller, faster, more expensive

memory caches a subset of

the lines

memory is split into blocks, aka. lines,�of size 64 bytes.

unit of transfer: 1 line

number represents requested memory location

29 of 72

General Cache Concepts: Hit

Cache

Memory

Data in line b is needed

Request: 14

line b is in cache:

Hit!

30 of 72

General Cache Concepts: Miss

Cache

Memory

Data in line b is needed

Request: 12

Line b is not in cache:

Miss!

Line b is fetched from

memory

Request: 12

Line b is stored in cache

Placement policy:�determines where b goes
Replacement policy:�determines which block�gets evicted (victim)

31 of 72

General Caching Concepts: Types of Cache Misses

Cold (compulsory) miss

Cold misses occur because the cache is empty.

Conflict miss

Most caches limit lines at level k+1 to a small subset (sometimes a singleton) of the line positions at level k.

E.g. Line i at level k+1 must be placed in line (i mod 4) at level k.

Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k line.

E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

Capacity miss

Occurs when set of active cache lines (working set) is larger than the cache.

thrashing = every access misses.

placement policy �avoids conflict misses

replacement policy care, to �avoid capacity misses.

32 of 72

Examples of Caching in the Hierarchy

Hardware

On-Chip TLB

Address translations

TLB

Web browser

10,000,000

Local disk

Web pages

Browser cache

Web cache

Network buffer cache

Buffer cache

Virtual Memory

L2 cache

L1 cache

Registers

Cache Type

Web pages

Parts of files

4-KB page

64-bytes line

4-8 bytes words

What is Cached?

Web proxy server

1,000,000,000

Remote server disks

100

Main memory

Hardware

On-Chip L1

Hardware

On/Off-Chip L2

AFS/NFS client

10,000,000

Local disk

Hardware + OS

100

Main memory

Compiler

CPU core

Managed By

Latency (cycles)

Where is it Cached?

Disk cache

Disk sectors

Disk controller

100,000

Disk firmware

ex: access to memory is 100x more expensive than L1 cache.

33 of 72

Cache Memories

Cache memories are small, fast SRAM-based memories managed automatically in hardware.

Hold frequently accessed blocks of main memory

CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.

Typical system structure:

Main

memory

I/O

bridge

Bus interface

ALU

CPU chip

System bus

Memory bus

Cache

memories

given some data,�where is it going to be located?

10s

34 of 72

General Cache Organization (S, E, B)

E = 2^e lines per set

S = 2^s sets

set

line

B-1

tag

B = 2^b bytes per cache line (the data)

Cache size:

C = S x E x B data bytes

valid bit

organization of a cache.�have S sets, and E lines.

35 of 72

Cache Read

E = 2^e lines per set

S = 2^s sets

B-1

tag

valid bit

B = 2^b bytes per cache line (the data)

t bits

s bits

b bits

Address of word:

tag

set

index

block

offset

data begins at this offset

Locate set
Check if any line in set�has matching tag
Yes + line valid: hit
Locate data starting�at offset

set-index to find the set,
tag to find the right line in set�(compare w/ each line; linear search)
we find our data based on offset within line.

36 of 72

Direct-Mapped Cache Simulation

set (0..3)

tag (0..1)

M=16 byte addresses, B=2 bytes/block,

S=4 sets, E=1 Blocks/set

Address trace (reads, one byte per read):

0 [0000₂],

1 [0001₂],

7 [0111₂],

8 [1000₂],

0 [0000₂]

t=1

s=2

b=1

Tag

Block

miss

M[0-1]

hit

miss

M[6-7]

miss

M[8-9]

miss

M[0-1]

Set 0

Set 1

Set 2

Set 3

we ask for 1 byte.

but we transfer 1 line at a time.

here, a line is 2 bytes.

37 of 72

Why index with the middle bits?

· 37

01.10.2020

every cell on the right is 1 byte.

line size 2 bytes.

Q: which indexing strategy is best?

�hint: sequential access

38 of 72

Why index with the middle bits?

· 38

01.10.2020

every cell on the right is 1 byte.

line size 2 bytes.

Q: which indexing strategy is best?

�hint: sequential access

39 of 72

Why index with the middle bits?

· 39

01.10.2020

every cell on the right is 1 byte.

line size 2 bytes.

Q: which indexing strategy is best?

�hint: sequential access

50% miss

50% miss,�better temporal

underutilized cache

40 of 72

Array Allocation

Basic Principle

T A[L];

A is an Array of data type T and length L

Contiguously allocated region of L * sizeof(T) bytes

· 40

01.10.2020

compiler is going to reserve space for the array.

examples follow

41 of 72

Array Allocation

· 41

01.10.2020

8 bytes = 64 bits�(address size =� word size)

42 of 72

Array Access

int A[5] = {0, 1, 2, 3, 4};

Array of data type int and length 5

Identifier A can be used as a pointer to array element 0: Type int*

Reference Type Value

val[4] int 4

val int * x

val+1 int * x + 4

&val[2] int * x + 8

val[5] int ??

*(val+1) int 1

val + i int * x + 4 i

· 42

01.10.2020

val from �previous slide

undefined�behavior

43 of 72

Array Example

· 43

44 of 72

ARRAY LOOP EXAMPLE (IA32)

# rdx = z

movq $0, %rax # %rax = i

.L4: # loop:

addq $1, (%rdx,%rax,4) # z[i]++

addq $1, %rax # i++

cmpl $5, %rax # i:ZLEN (ZLEN==5)

jne .L4 # if !=, goto loop

void zincr(zip_dig z) {

int i;

for (i = 0; i < ZLEN; i++)

z[i]++;

}

Array Example

$1 is �constant 1

%r is �register r

addressing mechanism:�z + 4*i

45 of 72

A Higher Level Example

int sum_array_rows(double a[10][10])

{

int i, j;

double sum = 0;

for (i = 0; i < 16; i++)

for (j = 0; j < 16; j++)

sum += a[i][j];

return sum;

}

int sum_array_cols(double a[10][10])

{

int i, j;

double sum = 0;

for (j = 0; i < 16; i++)

for (i = 0; j < 16; j++)

sum += a[i][j];

return sum;

}

not in the same line�as previous two

in perflab, use debugger to find out how my data structure is layed out.

46 of 72

What about writes?

Multiple copies of data exist:

L1, L2, Main Memory, Disk

What to do on a write-hit?

Write-through (write immediately to memory)

Write-back (defer write to memory until replacement of line)

Need a dirty bit (line different from memory or not)

What to do on a write-miss?

Write-allocate (load into cache, update line in cache)

Good if more writes to the location follow

No-write-allocate (writes immediately to memory)

Typical

Write-through + No-write-allocate

Write-back + Write-allocate

faster

47 of 72

Intel Core i7 Cache Hierarchy

Regs

d-cache

i-cache

L2 unified cache

Core 0

Regs

d-cache

i-cache

L2 unified cache

Core 3

…

L3 unified cache

(shared by all cores)

Main memory

Processor package

L1 i-cache and d-cache:

32 KB, 8-way,

Access: 4 cycles

L2 unified cache:

256 KB, 8-way,

Access: 11 cycles

L3 unified cache:

8 MB, 16-way,

Access: 30-40 cycles

Line size: 64 bytes for all caches.

Intel Core i7 Cache Hierarchy

48 of 72

Cache Performance Metrics

Miss Rate

Fraction of memory references not found in cache (misses / accesses)�= 1 – hit rate
Typical numbers (in percentages):

3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.

Hit Time

Time to deliver a line in the cache to the processor

includes time to determine whether the line is in the cache

Typical numbers:

1-2 clock cycle for L1
5-20 clock cycles for L2

Miss Penalty

Additional time required because of a miss

typically 50-200 cycles for main memory (Trend: increasing!)

10s

49 of 72

Latency Numbers Every Programmer Should Know

· 49

01.10.2020

https://colin-scott.github.io/personal_website/research/interactive_latency.html

we’ve seen this.�penalty: from 1 to 100. huge difference.

50 of 72

Lets think about those numbers

Huge difference between a hit and a miss

Could be 100x, if just L1 and main memory

Would you believe 99% hits is twice as good as 97%?

Consider: �cache hit time of 1 cycle�miss penalty of 100 cycles

Average access time:

97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles

99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

This is why “miss rate” is used instead of “hit rate”

51 of 72

Arrays and Cache Metrics

C arrays allocated in row-major order

each row in contiguous memory locations

Stepping through columns in one row:

for (i = 0; i < N; i++)

sum += a[0][i];

accesses successive elements
if block size (B) > 4 bytes, exploit spatial locality

compulsory miss rate = 4 bytes / B

Stepping through rows in one column:

for (i = 0; i < n; i++)

sum += a[i][0];

accesses distant elements
no spatial locality!

compulsory miss rate = 1 (i.e. 100%)

skip

52 of 72

Writing Cache Friendly Code

Make the common case go fast

Focus on the inner loops of the core functions

Minimize the misses in the inner loops

Repeated references to variables are good (temporal locality)

Stride-1 reference patterns are good (spatial locality)

Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.

53 of 72

Locality Example #1

Question: Can you permute the loops so that the function scans the 3-d array a with a stride-1 reference pattern (and thus has good spatial locality)?

int sum_array_3d(int a[M][N][N])

{

int i, j, k, sum = 0;

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

for (k = 0; k < N; k++)

sum += a[i][k][j];

return sum;

}

10s

54 of 72

Locality Example #2

Assume:

Line size = 32B (big enough for four 64-bit words)

Matrix dimension (N) is very large

Approximate 1/N as 0.0

Cache is not even big enough to hold multiple rows

Analysis Method:

Look at access pattern of inner loop

55 of 72

Matrix Multiplication

· 55

01.10.2020

56 of 72

Matrix Multiplication Example

Description:

Multiply N x N matrices

O(N³) total operations

N reads per source element

N values summed per destination

but may be able to hold in register

/* ijk */

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

Variable sum

held in register

57 of 72

Matrix Multiplication (ijk)

/* ijk */

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

(i,*)

(*,j)

(i,j)

Inner loop:

Column-

wise

Row-wise

Fixed

Misses per inner loop iteration:

A B C

0.25 1.0 0.0

58 of 72

Matrix Multiplication (jik)

/* jik */

for (j=0; j<n; j++) {

for (i=0; i<n; i++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum

}

(i,*)

(*,j)

(i,j)

Inner loop:

Row-wise

Column-

wise

Fixed

Misses per inner loop iteration:

A B C

0.25 1.0 0.0

59 of 72

Matrix Multiplication (kij)

/* kij */

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

(i,*)

(i,k)

(k,*)

Inner loop:

Row-wise

Fixed

Misses per inner loop iteration:

A B C

0.0 0.25 0.25

60 of 72

Matrix Multiplication (ikj)

/* ikj */

for (i=0; i<n; i++) {

for (k=0; k<n; k++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

(i,*)

(i,k)

(k,*)

Inner loop:

Row-wise

Fixed

Misses per inner loop iteration:

A B C

0.0 0.25 0.25

61 of 72

Matrix Multiplication (jki)

/* jki */

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

(*,j)

(k,j)

Inner loop:

(*,k)

Column-

wise

Column-

wise

Fixed

Misses per inner loop iteration:

A B C

1.0 0.0 1.0

62 of 72

Matrix Multiplication (kji)

/* kji */

for (k=0; k<n; k++) {

for (j=0; j<n; j++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

(*,j)

(k,j)

Inner loop:

(*,k)

Fixed

Column-

wise

Column-

wise

Misses per inner loop iteration:

A B C

1.0 0.0 1.0

63 of 72

Summary of Matrix Multiplication

ijk (& jik):

2 loads, 0 stores
misses/iter = 1.25

kij (& ikj):

2 loads, 1 store
misses/iter = 0.5

jki (& kji):

2 loads, 1 store
misses/iter = 2.0

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

k as inner loop

j as inner loop

i as inner loop

64 of 72

Core i7 Matrix Multiply Performance

jki / kji

ijk / jik

kij / ikj

65 of 72

Sharing is a Vulnerability

attackers exploit sharing.

if A & B share E,� A can observe/affect B through E.

different attacks depending on what is shared

hardware
network
physical world (air-gap)

Attacks: Side-Channel

“somebody toucha �my spaghet!”

fundamental tension:�security (isolation) vs. �performance (sharing)

a note on security

66 of 72

Imitating the Ideal

ideal computer: infinite cores, infinite memory.

fake it

OS multitasking time-sharing
memory hierarchy space-sharing

important process requirement: isolation.�processes share resources.

isolation can be violated! (Unintended communication/interference)

Attacks: Side-Channel - Hardware

67 of 72

How It Works

Can I use that resource?

No.

Why not?

Bob’s process is using it.

Why is Bob’s process using it?

Because Bob’s process took the then-branch (not else) in procedure-

Aha! From this, I conclude…!

Attacks: Side-Channel - Hardware

Sherlock

World

68 of 72

CPU Timing Attacks

Attack: Process A monitors the CPU load of Process B.

High CPU load ⇒ 1
Low CPU load ⇒ 0

Attack: Race conditions.

table of processes,�task manager

who writes to storage first

Attacks: Side-Channel - Hardware - CPU

69 of 72

Processes Share a Cache

Attacks: Side-Channel - Hardware - Cache

different address space, though. How do they communicate?

70 of 72

Cache Timing Attack: Prime+Probe

Attacks: Side-Channel - Hardware- Cache

estimate �nr. of cache misses w/ �a timer.

(remember� memory� hierarchy)

71 of 72

Willard Rafnsson�IT University of Copenhagen

wilr@itu.dk�https://www.willardthor.com/

goal: tools that developers can use to write secure SW.

sample research (past supervisions):

analyze binaries for information leaks
reduce timing leaks in the Linux kernel
automatically fix vulnerabilities in JavaScript
automatically generate (i.e. synthesize) �a secure program from formal specification
assess privacy risk in analytics programs �(data scientists; Google search for “Privugger”)

I like code, and I like proofs.

I created the “Applied Information Security” course.

I’m a barista in Analog.

formal�methods

programming�languages

computer�security

my�research

shameless self-promotion

1 of 72

2 of 72

3 of 72

4 of 72

5 of 72

6 of 72

7 of 72

8 of 72

9 of 72

10 of 72

11 of 72

12 of 72

13 of 72

14 of 72

15 of 72

16 of 72

17 of 72

18 of 72

19 of 72

20 of 72

21 of 72

22 of 72

23 of 72

24 of 72

25 of 72

26 of 72

27 of 72

28 of 72

29 of 72

30 of 72

31 of 72

32 of 72

33 of 72

34 of 72

35 of 72

36 of 72

37 of 72

38 of 72

39 of 72

40 of 72

41 of 72

42 of 72

43 of 72

44 of 72

45 of 72

46 of 72

47 of 72

48 of 72

49 of 72

50 of 72

51 of 72

52 of 72

53 of 72

54 of 72

55 of 72

56 of 72

57 of 72

58 of 72

59 of 72

60 of 72

61 of 72

62 of 72

63 of 72

64 of 72

65 of 72

66 of 72

67 of 72

68 of 72

69 of 72

70 of 72

71 of 72

72 of 72