1 of 120

WES237B: Software for Embedded Systems

Pat Pannuto

Department of Computer Science and Engineering

University of California, San Diego

Summer Session 2023

2 of 120

Today

Morning

Performance & Optimization
Optimization in different levels (in the context of a single core)

Algorithmic (Cache Optimization), Compiler and Hardware
Memory Hierarchy

Afternoon

Lab session

Applying optimizations: Loop unrolling, hardware vectorization (NEON)
FIR Filters

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

3 of 120

Logistics

Assignment 2 due tomorrow
Assignment 3 released

Generally: First bit of assignment is lab portion, the rest is take-home
Finish the lab first if needed

Late policy?

1 week: 10% off
2 weeks: Last chance, 30% off
Email with any exceptional circumstances, we’ll work it out

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

4 of 120

Recap

Measuring Performance

Speedup / performance always relative and workload specific
Amdahl’s Law: Optimization benefit limited to proportion it improves

Optimizations

Hard & Getting Harder due to heterogeneity
Must learn systems-level design and analysis

Microarchitecture

ILP, Superscalar, SIMD, and Pipelining
(perf: throughput)

Memory Hierarchy

Memory is slow, exploit spatial and temporal locality for performance

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

5 of 120

A Modern Memory Hierarchy

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Register File, 32 words, sub-nsec

L1 Cache, ~32 KB, ~nsecs

L2 cache, 512KB – 1 MB, ~nsecs

L3 Cache, ~ 10 nsecs

Main memory (DRAM), GB, ~100 ns

Disk, 100 GB, ~10 msec

6 of 120

The guts of most interesting stuff is often ~this

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

void Foo(float* const A, float* const B, float* const C, const int M, const int N, const int K)

{

for (int m = 0; m < M; m++) {

for (int n = 0; n < N; n++) {

for (int k = 0; k < K; k++) {

C[m * M + n] += A[m * M + k] * B[k * K + n];

}

And these are huge

Lab2: image.tif

480x640 pixels * 4 bytes/float ~= 1.3 MB (greyscale!)

7 of 120

EXAMPLES OF SIMPLE CACHES

7

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

7

8 of 120

A simple cache

A cache that can put a line of data anywhere is called __________________________
The most popular replacement strategy is LRU ( ).

tag

data

the tag identifies

the address of

the cached data

4 entries, each block holds one word, any block

can hold any word.

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

8

9 of 120

A simple cache

A cache that can put a line of data anywhere is called Fully Associative
The most popular replacement strategy is LRU ( Least Recently Used ).

tag

data

the tag identifies

the address of

the cached data

4 entries, each block holds one word, any block

can hold any word.

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

9

10 of 120

A simpler cache

A cache that can put a line of data in exactly one place is called __________________.
Advantages/disadvantages vs. fully-associative?

an index is used

to determine

which line an address

might be found in

4 entries, each block holds one word, each word

in memory maps to exactly one cache location.

00000100

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

10

11 of 120

A simpler cache

A cache that can put a line of data in exactly one place is called direct mapped
Advantages/disadvantages vs. fully-associative?

an index is used

to determine

which line an address

might be found in

4 entries, each block holds one word, each word

in memory maps to exactly one cache location.

00000100

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

11

12 of 120

A set-associative cache

A cache that can put a line of data in exactly n places is called n-way ______________________.
The cache lines/blocks that share the same index are a cache ____________.

tag

data

4 entries, each block holds one word, each word

in memory maps to one of a set of n cache lines

00000100

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

12

13 of 120

A set-associative cache

A cache that can put a line of data in exactly n places is called n-way set-associative.
The cache lines/blocks that share the same index are a cache set.

tag

data

4 entries, each block holds one word, each word

in memory maps to one of a set of n cache lines

00000100

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

13

14 of 120

Longer Cache Blocks

Large cache blocks take advantage of spatial locality.
Too large of a block size can waste cache space.
Longer cache blocks require less tag space

tag

data

4 entries, each block holds two words, each word

in memory maps to exactly one cache location

(this cache is twice the total size of the prior caches).

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

14

15 of 120

Longer Cache Blocks

Large cache blocks take advantage of spatial locality.
Too large of a block size can waste cache space.
Longer cache blocks require less tag space

tag

data (now 64 bits)

4 entries, each block holds two words, each word

in memory maps to exactly one cache location

(this cache is twice the total size of the prior caches).

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

15

16 of 120

Q: Describing Cache Type Tradeoffs?

Exceptional usage of the cache space in exchange for a slow hit time
Poor usage of the cache space in exchange for an excellent hit time
Reasonable usage of cache space in exchange for a reasonable hit time

Selection	Fully-Associative	4-way Set Associative	Direct Mapped
A	3	2	1
B	3	3	2
C	1	2	3
D	3	2	1
E	None of the above

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

16

17 of 120

Back to Block Size

If block size increases spatial locality, should we just make the cache block size really, really big????

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

17

18 of 120

Block Size and Miss Rate

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

18

19 of 120

Cache Parameters

Cache size = Number of sets * block size * associativity

tag

data

tag

data

Bytes per block

Blocks per set

Sets per Cache

Warning / Notice—Things that count towards “cache size”: cache data

Things that do not count towards “cache size”: tags, valid bits, etc…

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

19

20 of 120

Cache Parameters

Cache size = Number of sets * block size * associativity

tag

data

tag

data

Bytes per block

Blocks per set

Sets per Cache

Warning / Notice—Things that count towards “cache size”: cache data

Things that do not count towards “cache size”: tags, valid bits, etc…

Is this very confusing?

Yes, until you implement one.

It is what the world has settled on, so you need to know this and become comfortable with it.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

20

21 of 120

Cache Parameters

Cache size = Number of sets * block size * associativity

128 blocks, 32-byte block size, direct mapped, size = ?

128 KB cache, 64-byte blocks, 512 sets, associativity = ?

(always keep in mind “cache size” only counts the data storage)

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

21

22 of 120

Q: How many bits for each field?

Generally, we have variables block_size, cache_size, and memory_size

Let’s work it out for

BS = 8 bytes
CS = 1 KB
MS = 4 MB

And we have a 4-way set-associative cache?

What is the…

Number of bits for the block offset: ____
Number of bits for the index: ____
Number of bits for the tag: ____

tag index block offset

N-bit address

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

22

23 of 120

Q: How many bits for each field?

Generally, we have variables block_size, cache_size, and memory_size

Let’s work it out for

BS = 32 bytes
CS = 16 KB
MS = 8 MB

And we have a 8-way set-associative cache?

What is the…

Number of bits for the block offset: ____
Number of bits for the index: ____
Number of bits for the tag: ____

tag index block offset

N-bit address

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

23

24 of 120

Handling a Cache Access

1. Use index and tag to access cache and determine hit/miss.
2. If hit, return requested data.
3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor).

load entire missed cache line into cache
return requested data to CPU (or higher cache)

4. If next lower memory is a cache, goto step 1 for that cache.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

24

25 of 120

Accessing a Sample Cache

64 KB cache, direct-mapped, 32-byte cache block size

31 30 29 28 27 .......... 17 16 | 15 14 13 12 11 10 9 8 7 6 5 | 4 3 2 1 0

tag

index

valid

tag

data

64 KB / 32 bytes =

2 K cache blocks/sets

11

=

256

32

16

hit/miss

0

1

2

...

2045

2046

2047

block offset

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

25

26 of 120

Accessing a Sample Cache

32 KB cache, 2-way set-associative, 16-byte block size

31 30 29 28 27 .......... 17 16 15 14 | 13 12 11 10 9 8 7 6 5 4 | 3 2 1 0

tag

index

valid

tag

data

32 KB / 16 bytes / 2 =

1 K cache sets

10

=

18

hit/miss

0

1

2

...

1021

1022

1023

block offset

tag

data

valid

=

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

26

27 of 120

Cache Alignment

The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).
This results in

no overlap of cache lines
easy mapping of addresses to cache lines (no additions)
data at address X always being present in the same location in the cache block (at byte X mod blocksize) if it is there at all.

Think of main memory as organized into cache-line sized pieces (because in reality, it is!).

tag index block offset

memory address

.

0

1

2

3

4

5

6

7

8

9

10

.

Memory

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

27

28 of 120

How does the cache actually connect to the execution core?

This is a “standard, five stage pipeline”

For machines with no caches, the “DM” looks more like this:

IM

Reg

ALU

DM

Reg

ALU

DM

Memory�Interface�Module

Address

Data

Read?

Write?

Data

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

28

29 of 120

How does the cache actually connect to the execution core?

This is a “standard, five stage pipeline”

For machines with no caches, the “DM” looks more like this:

IM

Reg

ALU

DM

Reg

ALU

DM

Memory�Interface�Module

Address

Data

Read?

Write?

Data

Stall

DRAM Controller

DRAM

Other

Peripherals

Memory Bus

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

29

30 of 120

How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]

ALU

DM

Memory�Interface�Module

Address

Data

Read?

Write?

Data

Stall

DRAM Controller

DRAM

Other

Peripherals

Memory Bus

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

30

31 of 120

How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]

ALU

DM

Memory�Interface�Module

Address

Data

Read?

Write?

Data

Stall

DRAM Controller

DRAM

Other

Peripherals

Memory Bus

Cache(s)

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

31

32 of 120

How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]

ALU

DM

Memory Interface�Module��

Address

Data

Read?

Write?

Data

Stall

DRAM Controller

DRAM

Other

Peripherals

Memory Bus

L1 Cache

A real interface�(from a simple design)

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

32

33 of 120

Cache Alignment Revisited

The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).
Think of main memory as organized into cache-line sized pieces (because in reality, it is!).
“Block-aligned” access is similar to the “word-aligned” access you have been thinking about already

Core <> Cache interface: Word aligned
Cache <> Mem interface: Block aligned

tag index block offset

memory address

.

0

1

2

3

4

5

6

7

8

9

10

.

Memory

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

33

34 of 120

When executing code, what does all this have to do with performance and alignment?

IM

Reg

ALU

DM

Reg

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

34

35 of 120

Which of the following things are possible?

It is possible to make a “set-associative” cache to hav N=1 sets?
It is possible for a set-associative cache to have N=3 sets?
It is possible for a cache to have a 12-byte block size?

Selection	Fully-Associative	4-way Set Associative	Direct Mapped
A	3	2	1
B	3	3	2
C	1	2	3
D	3	2	1
E	None of the above

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

35

36 of 120

Associative Caches

Higher hit rates, but...

longer access time

(longer to determine hit/miss, more muxing of outputs)

more space (longer tags)

16 KB, 16-byte blocks, DM, tag = ?
16 KB, 16-byte blocks, 4-way, tag = ?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

36

37 of 120

for (int i = 0; i < 10,000,000; i++)

sum += A[i];

Assume each element of A is 4 bytes and sum is kept in a register. Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks. Assume this loop is visited many times.

Which changes would help the hit rate of the above code?

Selection	Change
A	Increase to 2-way set associativity
B	Increase block size to 64 bytes
C	Increase cache size to 64 KB
D	A and C combined
E	A, B, and C combined

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

37

38 of 120

for (int i=0; i < 10,000,000; i++)

for (int j = 0; j < 8192; j++)

sum += A[j] – B[j];

Assume each element of A and B are 4 bytes.

Assume each array is at least 32KB in size.

Assume sum is kept in a register.

Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks.

Which changes would help the hit rate of the above code?

Selection	Change
A	Increase to 2-way set associativity
B	Increase block size to 64 bytes
C	Increase cache size to 64 KB
D	A and C combined
E	A, B, and C combined

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

38

39 of 120

Dealing with Stores

Stores must be handled differently than loads, because...

they don’t necessarily require the CPU to stall.
they change the content of cache/memory (creating memory consistency issues)

Q: Can you think of a situation when you might need to load from memory before you can execute a store?

Can you think of another one?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

39

40 of 120

Policy decisions for stores

Keep memory and cache identical?

write-through => all writes go to both cache and main memory
write-back => writes go only to cache. Modified cache lines are written back to memory when the line is replaced.

Make room in cache for store miss?

write-allocate => on a store miss, bring written line into the cache
write-around => on a store miss, ignore cache

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

40

41 of 120

Dealing with stores

On a store hit, write the new data to cache.

In a write-through cache, write the data immediately to memory.
In a write-back cache, mark the line as dirty.

On a store miss, initiate a cache block load from memory for a write-allocate cache.

Write directly to memory for a write-around cache.

On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

41

42 of 120

Cache Performance

CPI = BCPI + MCPI

BCPI = base CPI, which means the CPI assuming perfect memory
MCPI = the memory CPI, the number of cycles (per instruction) the processor is stalled waiting for memory.

MCPI = accesses/instruction * miss rate * miss penalty

this assumes we stall the pipeline on both read and write misses, that the miss penalty is the same for both, that cache hits require no stalls.
If the miss penalty or miss rate is different for Inst cache and data cache (common case), then

MCPI = I$ accesses/inst x I$MissRate x I$MissPenalty

+ D$ accesses/inst x D$MissRate x D$MissPenalty

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

42

43 of 120

In fact…

Can generalize this “formula” further for other stalls

(This is just putting a label on the ”penalty” idea we introduced earlier)

CPI = BCPI + DHSPI + BHSPI + MCPI

DHSPI = data hazard stalls per instruction
BHSPI = branch hazard stalls per instruction.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

43

44 of 120

Cache Performance

Instruction cache miss rate of 4%
Data cache miss rate of 10%
BaseCPI = 1.0 (no data or control hazards)
20% of instructions are loads and stores
Miss penalty = 12 cycles

CPI = ???

Selection	CPI (rounded if necessary)
A	1.24
B	1.34
C	1.48
D	1.72
E	None of the above

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

44

45 of 120

Cache Performance

Unified cache
25% of instructions are loads and stores
BaseCPI = 1.2, miss penalty of 10 cycles

If we improve the miss rate from 10% to 4% (e.g. with a larger cache), how much do we improve performance?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

45

46 of 120

Cache Performance

BaseCPI = 1
Miss rate of 8% overall, 20% loads, miss penalty 20 cycles, never stalls on stores.

What is the speedup from doubling the CPU clock rate?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

46

47 of 120

Example -- DEC Alpha 21164 Caches

ICache and DCache -- 8 KB, DM, 32-byte lines
L2 cache -- 96 KB, ?-way SA, 32-byte lines
L3 cache -- 1 MB, DM, 32-byte lines

21164 CPU

core

Instruction

Cache

Data

Cache

Unified

L2

Cache

Off-Chip

L3 Cache

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

47

48 of 120

Three types of cache misses

Compulsory (or cold-start) misses

first access to the data.

Capacity misses

we missed only because the cache isn’t big enough.

Conflict misses

we missed because the data maps to the same line as other data that forced it out of the cache.

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

DM cache

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

48

49 of 120

Q: Categorizing Misses

Suppose you experience a cache miss on a block (let's call it block A).
You have accessed block A in the past.
There have been precisely 1027 different blocks accessed between your last access to block A and your current miss.
Your block size is 32-bytes and you have a 64KB cache. What kind of miss was this?

Selection	Cache Miss
A	Compulsory
B	Capacity
C	Conflict
D	Both Capacity and Conflict
E	None of the above

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

49

50 of 120

So, then, how do we decrease...

Compulsory misses?
Capacity misses?
Conflict misses?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

50

51 of 120

Cache Associativity

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

51

52 of 120

LRU replacement algorithms

only needed for associative caches
requires one bit for 2-way set-associative, 8 bits (per set, 2/line) for 4-way, 24 bits for 8-way…
can be emulated with log n bits (NMRU)
can be emulated with use bits for highly associative caches (like page tables)
However, for most caches (eg, associativity <= 8), LRU is calculated exactly.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

52

53 of 120

Caches in Current Processors

Not long ago, they were DM at lowest level (closest to CPU), associative further away. Today they are less associative near the processor (2-4+), and more associative farther away (4-16).
split I and D close (L1) to the processor (for throughput rather than miss rate), unified further away (L2 and beyond).
write-through and write-back both common, but never write-through all the way to memory.
64-byte cache lines common (but getting larger)

Non-blocking

processor doesn’t stall on a miss, but only on the use of a miss (if even then)
this means the cache must be able to keep track of multiple outstanding accesses, even multiple outstanding misses.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

53

54 of 120

Intel Nehalem (i7)

CPU

I$

D$

L2 cache

CPU

I$

D$

L2 cache

CPU

I$

D$

L2 cache

CPU

I$

D$

L2 cache

L3 cache

Instruction Cache

-32 KB, 4-way

-64-byte line

Data Cache

-32 KB, 8-way

-64-byte line

-write-back, write-allocate

Unified L2 Cache

-256 KB, 8-way

-64-byte line

-write-back, write-allocate

Shared, unified L3 Cache

-8 MB, 16-way

-64-byte line

-write-back, write-allocate

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

54

55 of 120

What does PYNQ Provide?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

55

56 of 120

Key Points

Caches give illusion of a large, cheap memory with the access time of a fast, expensive memory.
Caches take advantage of memory locality, specifically temporal locality and spatial locality.
Cache design presents many options (block size, cache size, associativity, write policy) that an architect must combine to minimize miss rate and access time to maximize performance.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

56

57 of 120

Cache Extras

Additional Examples / Alternative Drawings

57

58 of 120

Cache Concepts

58

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

8

9

14

3

Cache

Memory

Larger, slower, cheaper memory

viewed as partitioned into “blocks”

Data is copied in block-sized transfer units

Smaller, faster, more expensive

memory caches a subset of

the blocks

4

10

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

59 of 120

Cache Concepts

59

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

8

9

14

3

Cache

Memory

Data in block b is needed

Request: 14

14

Block b is in cache:

Hit!

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

60 of 120

60

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

8

9

14

3

Cache

Memory

Data in block b is needed

Request: 12

Block b is not in cache:

Miss!

Block b is fetched from

memory

Request: 12

12

Block b is stored in cache

Placement policy:�determines where b goes
Replacement policy:�determines which block�gets evicted (victim)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

61 of 120

Cache Concepts

Cold (compulsory) miss

Cold misses occur because the cache is empty.

Conflict miss

Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k.

E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.

Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.

E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

Capacity miss

Occurs when the set of active cache blocks (working set) is larger than the cache.

61

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

62 of 120

Cache Concept Summary

The speed gap between CPU, memory and mass storage continues to widen.

Well-written programs exhibit a property called locality.

Memory hierarchies based on caching close the gap by exploiting locality.

62

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

63 of 120

Types of Cache Implementation

Direct mapped cache

Simple but lower spatial efficiency
Fast

Fully associative cache

Best spatial efficiency, complicated hardware design
Slow

N-way set associative cache

Frequently used in practice to balance trade-off between performance and complexity
The cache is divided into groups of blocks, called sets.
Each memory address maps to exactly one set in the cache, but data may be placed in any block within that set.

63

64 of 120

Cache Abstraction & Metrics

Cache hit rate = #hits/#accesses
Average memory access time = #hit rate*hit latency + #miss rate* miss latency

64

Address

Tag Store

(Is this address in the cache? + bookkeeping)

Data Store (Stores memory blocks)

Hit/miss

Data

65 of 120

Blocks and Addressing the Cache

Memory is logically divided into fixed size blocks
Each block maps to a location into the cache

Determined by the index bits in the address

Cache access

Use index bits to index to cache line
Check valid stored in the stored tag
Check tag bits against the tag stored in the tag store

If valid bit is true, and tag matches 🡺 cache hit

65

2b

3b

tag

index

offset

66 of 120

Direct Mapped Cache

Assume byte addressable memory, 256 bytes size of memory
8 byte blocks (block contains 8 bytes) 🡪 32 blocks
Assume cache 64 bytes; 🡪 8 blocks

2b

3b

tag

index

offset








V	Tag








V

==/=!

Hit?

Data

Tag store

Data store

67 of 120

Direct Mapped Cache Example

Let us assume memory is 64 bytes (assume it byte addressable).
How many address bits do we need?
Address bits of 6 (m=6) bits
Each block contains 4 bytes. What is the number of unique locations we can address ?

67

68 of 120

Direct Mapped Cache Example

m = 6 & each block contains 4 bytes
What is the number of blocks ?
Number of blocks = 16

68

0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15
16	17	18	19


60	61	62	63

69 of 120

Direct Mapped Cache Example

m = 6 & each block contains 4 bytes
What is the number of blocks ?
Number of blocks = 16
What is the b (block offset)?

69

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

70 of 120

Direct Mapped Cache Example

m = 6 & each block contains 4 bytes
What is the number of blocks ?
Number of blocks = 16
What is the b (block offset)?

70

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

Block Size = 4 = 2² 🡺 b = 2 🡺 The number of bits

71 of 120

Direct Mapped Cache Example

m = 6 & each block contains 4 bytes
What is the number of blocks ?
Number of blocks = 16
What is the b (block offset)?
What is the value of l ?

71

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

Block Size = 4 = 2² 🡺 b = 2

🡺 The number of bits

Block Offset (b=2)

Block Address (m-b)

72 of 120

Direct Mapped Cache Example

Let us assume m = 6 & each block contains 4 bytes
What is the number of unique locations we can address ?
What is the number of blocks ?
What is the value of b ?
What is the value of l ?

72

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

Block Size = 4 = 2² 🡺 b = 2

🡺 The number of bits

Block Offset (b=2)

Block Address (m-b)

0	1	2	3	0
4	5	6	7	1
8	9	10	11	2
12	13	14	15	3

73 of 120

Direct Mapped Cache Example

Let us assume m = 6 & each block contains 4 bytes
What is the number of unique locations we can address ?
What is the number of blocks ?
What is the value of b ?
What is the value of l ?

73

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

Block Size = 4 = 2² 🡺 b = 2

🡺 The number of bits

Block Offset (b=2)

Block Address (m-b)

0	1	2	3	0
4	5	6	7	1
8	9	10	11	2
12	13	14	15	3

#lines = 4 =2² 🡺 l =2

Cache line

tag

74 of 120

Direct Mapped Cache Example

Let us assume m = 6 & each block contains 4 bytes
What is the number of unique locations we can address ?
What is the number of blocks ?
What is the value of b ?
What is the value of l ?

74

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

Block Size = 4 = 2² 🡺 b = 2

🡺 The number of bits

Block Offset (b=2)

Block Address (m-b)

#lines = 4 =2² 🡺 l =2

Cache line

tag

tag	0	1	2	3	0
tag	4	5	6	7	1
tag	8	9	10	11	2
tag	12	13	14	15	3

75 of 120

Direct Mapped Cache Example

75

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

tag	0	1	2	3	0
tag	4	5	6	7	1
tag	8	9	10	11	2
tag	12	13	14	15	3

76 of 120

Direct Mapped Cache Example

76

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

tag	0	1	2	3	0
tag	4	5	6	7	1
tag	8	9	10	11	2
tag	12	13	14	15	3

77 of 120

Direct Mapped Cache Example

77

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19


15	60	61	62	63

tag	0	1	2	3	0
tag	4	5	6	7	1
tag	8	9	10	11	2
tag	12	13	14	15	3

78 of 120

2-way set associative cache

MM = 64 bytes
CS = 32 bytes
BS = 4 bytes
Set size = 2 cache lines
#cache lines = CS/BS = 8
#sets = #cache lines/set size = 4

78

Block Offset (b=2)

Block Address (m-b)

set line

tag

2-way set associative cache

Find the set
Compare tag with the tags stored

79 of 120

Types of Cache Implementation

Direct mapped cache

Simple but low efficiency

Fully associative cache

Has high efficiency, complicated hardware design

N-way set associative cache

Frequently used in practice to balance trade-off between performance and complexity
The cache is divided into groups of blocks, called sets.
Each memory address maps to exactly one set in the cache, but data may be placed in any block within that set.

79

80 of 120

Resources

Some of the slides are adapted from Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

80

81 of 120

Compiler Level Optimization

81

82 of 120

Compiler Level Optimizations

Many of the optimizations can be done by manually by the programmer, and many of the optimizations (e.g., loop) that we will see can be done by compiler (compiler can be smart).
If compiler can optimize the program, compiler should do the job!
Sometimes, it is needed to understand these optimizations because of limitations of compiler

82

83 of 120

Code Motion

Reduce frequency with which computation performed

If it will always produce same result
Especially moving code out of loop

83

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

void set_row(double *a, double *b,

long i, long n)

{

long j;

for (j = 0; j < n; j++)

a[n*i+j] = b[j];

}

void set_row(double *a, double *b,

long i, long n)

{

long j;

int ni = n*i;

for (j = 0; j < n; j++)

a[ni+j] = b[j];

}

84 of 120

Compiler Generated Code Motion

84

set_row:� testq %rcx, %rcx # Test n� jle .L1 # If 0, goto done� imulq %rcx, %rdx # ni = n*i� leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8� movl $0, %eax # j = 0�.L3: # loop:� movsd (%rsi,%rax,8), %xmm0 # t = b[j]� movsd %xmm0, (%rdx,%rax,8) # M[A+ni*8 + j*8] = t� addq $1, %rax # j++� cmpq %rcx, %rax # j:n� jne .L3 # if !=, goto loop�.L1: # done:� rep ; ret

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

long j;

for (j = 0; j < n; j++)

a[n*i+j] = b[j];

long j;

int ni = n*i;

for (j = 0; j < n; j++)

a[ni+j] = b[j];

85 of 120

Elimination (share) of Common Subexpressions

85

3 multiplications: i*n, (i–1)*n, (i+1)*n

1 multiplication: i*n

leaq 1(%rsi), %rax # i+1

leaq -1(%rsi), %r8 # i-1

imulq %rcx, %rsi # i*n

imulq %rcx, %rax # (i+1)*n

imulq %rcx, %r8 # (i-1)*n

addq %rdx, %rsi # i*n+j

addq %rdx, %rax # (i+1)*n+j

addq %rdx, %r8 # (i-1)*n+j

imulq %rcx, %rsi # i*n

addq %rdx, %rsi # i*n+j

movq %rsi, %rax # i*n+j

subq %rcx, %rax # i*n+j-n

leaq (%rsi,%rcx), %rcx # i*n+j+n

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

/* Sum neighbors of i,j */

up = val[(i-1)*n + j ];

down = val[(i+1)*n + j ];

left = val[i*n + j-1];

right = val[i*n + j+1];

sum = up + down + left + right;

long inj = i*n + j;

up = val[inj - n];

down = val[inj + n];

left = val[inj - 1];

right = val[inj + 1];

sum = up + down + left + right;

86 of 120

Reduction in Strength

Replace costly operation with simpler one

Shift, add instead of multiply or divide
16*x --> x << 4
*🡪 +

86

for (i = 0; i < n; i++) {

int ni = n*i;

for (j = 0; j < n; j++)

a[ni + j] = b[j];

}

int ni = 0;

for (i = 0; i < n; i++) {

for (j = 0; j < n; j++)

a[ni + j] = b[j];

ni += n;

}

87 of 120

Memory Aliasing

87

# sum_rows1 inner loop

.L4:

movsd (%rsi,%rax,8), %xmm0 # FP load

addsd (%rdi), %xmm0 # FP add

movsd %xmm0, (%rsi,%rax,8) # FP store

addq $8, %rdi

cmpq %rcx, %rdi

jne .L4

Code updates b[i] on every iteration
Extra store

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

/* Sum rows is of n X n matrix a

and store in vector b */

void sum_rows1(double *a, double *b, long n) {

long i, j;

for (i = 0; i < n; i++) {

b[i] = 0;

for (j = 0; j < n; j++)

b[i] += a[i*n + j];

}

88 of 120

Memory Aliasing

88

No extra store for intermediate result

# sum_rows2 inner loop

.L10:

addsd (%rdi), %xmm0 # FP load + add

addq $8, %rdi

cmpq %rax, %rdi

jne .L10

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

/* Sum rows is of n X n matrix a

and store in vector b */

void sum_rows2(double *a, double *b, long n) {

long i, j;

for (i = 0; i < n; i++) {

double val = 0;

for (j = 0; j < n; j++)

val += a[i*n + j];

b[i] = val;

}

89 of 120

Loop Interchange

Improved locality

89

Original loop:

for(j=0; j<M; j++)

for(i=0; i<N; i++)

A[i][j]=a[i][j]+b[i][j];

Interchanged loop:

for(i=0; i<N; i++)

for(j=0; j<M; j++)

c[i][j]=a[i][j]+b[i][j];

90 of 120

Loop Fusion

90

Original loops:

for(i=0; i<N; i++)

a[i] = b[i] + 10

for(i=0; i<N; i++)

c[i] = a[i]*2

Fused loops:

for(i=0; i<N; i++){

a[i] = b[i] + 10

c[i] = a[i]*2

}

91 of 120

Loop Fusion

Reducing loop overhead
Increasing the amount of work done in a loop
Improving locality by combining loops that reference the same array

91

Original loops:

for(i=0; i<N; i++)

a[i] = b[i] + 10

for(i=0; i<N; i++)

c[i] = a[i]*2

Fused loops:

for(i=0; i<N; i++){

a[i] = b[i] + 10

c[i] = a[i]*2

}

92 of 120

Loop Distribution

It is opposite of loop fusion

92

Distributed loops:

for(i=0; i<N; i++)

a[i] = b[i] + 10

for(i=0; i<N; i++)

c[i] = a[i]*2

Original loops:

for(i=0; i<N; i++){

a[i] = b[i] + 10

c[i] = a[i]*2

}

93 of 120

Loop Distribution

It is opposite of loop fusion

93

Distributed loops:

for(i=0; i<N; i++)

a[i] = b[i] + 10

for(i=0; i<N; i++)

c[i] = a[i]*2

Original loops:

for(i=0; i<N; i++){

a[i] = b[i] + 10

c[i] = a[i]*2

}

It can improve locality in multi-core processing
It can improve hardware prefetch by separating different data streams

94 of 120

Loop Invariant Computations

Calculations that do not change between loop iterations
Can be moved outside a loop

94

Original loop:

for(i=0; i<N; i++)

a[i] = i* (SOME_NUMBER/100)

Transformed loop:

for(i=0; i<N; i++)

a[i] = i* funct();

95 of 120

Loop Unrolling

Increases ILP
To reduce loop overhead instructions
Increases code size

95

void foo(

int A[8],

int B[8]) {

for(int i=0;i<8;i++){

B[i]=A[i]*3;

}

void foo(

int A[8],

int B[8]) {

for(int i=0;i<8; i+=2){

B[i]=A[i]*3;

B[i+1]=A[i+1]*3;

}

96 of 120

Dead Code Elimination

96

int global;

void f ()

{

int i;

i = 1; /* dead store */

global = 1; /* dead store */

global = 2;

return;

global = 3; /* unreachable */

}

int global;

void f ()

{

global = 2;

return;

}

97 of 120

Constant Folding

Evaluate constant expressions at compile time
Only possible when side-effect freeness guaranteed

97

Int a = 4+5;

Int a = 9;

true not;

false;

Caveat: Floating implementations could be different between machines

-ffast-math (really -funsafe-math-optimizations) lets compiler ignore this

98 of 120

Constant Propagation

Variables that have constant value, e.g. c := 3

Later uses of c can be replaced by the constant
If no change of c between!

Analysis needed, as b can be assigned more than once!

98

Int b = 3;

Int c = 1 + b;

Int d = b + c;

Int b = 3;

Int c = 1 + 3;

Int d = 3 + c;

99 of 120

Compiler Optimization Passes

Loop Optimizations

Loop fission
Loop fusion
Loop interchange
Loop invariant code motion
Loop unrolling
…

Data Flow Optimizations

Common subexpression elimination
Constant folding and propagation
Dead code elimination
…

Code Generation Optimization

Register Allocation
Instruction Scheduling
Instruction Selection
Reordering computations
,…

99

100 of 120

LLVM: Optimizing Compiler

LLVM is a compiler infrastructure designed as a set of reusable libraries with well defined interfaces
LLVM IR is target independent

100

https://www.aosabook.org/en/llvm.html

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

LLVM IR

Optimization

Pass 1

Optimization

Pass 1

Optimization

Pass 1

…. . ….

LLVM IR

101 of 120

Compiler Optimization Passes

Loop Optimizations

Loop fission
Loop fusion
Loop interchange
Loop invariant code motion
Loop unrolling
…

Data Flow Optimizations

Common subexpression elimination
Constant folding
Dead store elimination
…

Code Generation Optimization

Register Allocation
Instruction Scheduling
Instruction Selection
Reordering computations
,…

101

102 of 120

Optimizing Compilers

Provide efficient mapping of program to machine

register allocation
code selection and ordering (scheduling)
dead code elimination
...
eliminating minor inefficiencies

Do not optimize (usually) the "algorithm"

up to programmer to select best overall algorithm for a given task

Sorting: big-O(n²) vs. big-O(nlogn)

big-O savings are (often) more important

Have difficulty overcoming inherent limitations (cache, inherent parallelism)

102

103 of 120

Limitations of Optimizing Compilers

Operate under fundamental constraint
Must not cause any change in program behavior
Except, possibly when program making use of nonstandard language features
Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
e.g., Data ranges may be more limited than variable types suggest
Most analysis is performed only within procedures
Whole-program analysis is too expensive in most cases
Newer versions of GCC do interprocedural analysis within individual files
But, not between code in different files
Most analysis is based only on static information
Compiler has difficulty anticipating run-time inputs

103

When in doubt, the compiler must be conservative

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

104 of 120

GCC O1, O2, O3 levels

104

-O1

-O2

-O3

105 of 120

Inline Assembly

Ability to embed low level assembly code in high-level C/C++ code

__asm [volatile] (code); /* Basic inline assembly syntax */

"ADD R0, R1, R2

Why use inline assembly ?

105

106 of 120

Inline Assembly

Ability to embed low level assembly code in high-level C/C++ code

__asm [volatile] (code); /* Basic inline assembly syntax */

"ADD R0, R1, R2

Why use inline assembly ?

Optimization for codes that can not be naturally expressed by high-level languages

Reversing bits

Access to professor specific instructions

NEON

System class

106

107 of 120

SIMD

107

108 of 120

Flynn Taxonomy

108

		Data Stream
		Single	Multi
Instruction Stream	Single	SISD (Single-Core Processors)	SIMD (GPUs, Intel SSE/AVX extensions, …)
Instruction Stream	Multi	MISD (Dataflow architectures, Systolic Arrays [debatably], …)	MIMD (VLIW, Parallel Computers)

109 of 120

SIMD Basics

SIMD: Single instruction multiple data
A SIMD register (or a vector register) can hold many values (2 - 16 values or more) of a single type
Each value in a SIMD register is called a SIMD lane
SIMD instructions can operate on several (typically all) values on a SIMD register

109

110 of 120

SIMD

SIMD Opertions: Single Precison

110

%ymm0

%ymm1

vaddsd %ymm0, %ymm1, %ymm1

+

%ymm0

%ymm1

vaddpd %ymm0, %ymm1, %ymm1

+

%ymm0

%ymm1

vaddpd %ymm0, %ymm1, %ymm1

SIMD Opertions: Double Precison

Some example AVX-512F (a subset of AVX-512) instructions

Syntax

zmm0 ...zmm31 are 512 bit registers; each can hold 16 single-precision (float of C; 32 bits) or 8 double-precision (double of C; 64 bits) ﬂoating point numbers

111 of 120

SIMD

SIMD is good at parallelizing computations doing almost exactly the same series of instructions on contiguous data
⇒ generally, main targets are simple loops whose index values can be easily identified

111

Original loops:

for(i=0; i<N; i++){

S(i)

}

Original loops:

for(i=0; i<N; i+=L){

S(I;i+L)

}

112 of 120

Programming SIMD

Auto vectorization

loop vectorization

Language extensions/directives for SIMD

SIMD directives for loops/functions (OpenMP 4.0)
Domain Specific Languages (Halide)

Vector types

GCC vector extensions

Intrinsics
Assembly programming

112

Which vector registers is this using??

113 of 120

Auto Vectorization

Write a loop-based code and hope compiler does (figure out) the vectorization

113

Int len_vec=128;

void vec_add(float *vec_A, float *vec_B, float *vec_C, int len_vec) {

int i;

for (i=0; i<len_vec; i++) {

vec_C[i] = vec_A[i] + vec_B[i];

}

armclang --target=aarch64-arm-none-eabi -g -c -O1 vec_add.c

Automatically performs 4 adds on
Neon extension (Neon simd register is 128 bit)

gcc -o simd_auto -march=native -O3 vec_add.c

https://www.doulos.com/knowhow/arm/using_your_c_compiler_to_exploit_neon/Resources/using_your_c_compiler_to_exploit_neon.pdf

114 of 120

OpenMP SIMD

OpenMP simd pragma

allows an explicit vectorization of for loops

114

int len_vec=128;

void vec_add(float *vec_A, float *vec_B, float *vec_C, int len_vec) {

int i;

#pragma omp simd

for (i=0; i<len_vec; i++) {

vec_C[i] = vec_A[i] + vec_B[i];

}

115 of 120

GCC Vector Types

Gcc allows to define a vector types

115

typedef float floatv attribute ((vector size(64),aligned(sizeof(float))));

floatv x, y, z;

z += x * y;

float a, b;

floatv x, y;

y = a * x + b;

You can define operations on vector types

You can mix scalars with vector types

116 of 120

GCC Vector Types

116

for (long i = 0; i < n; i += L) {

V(y[i]) = a * V(x[i]);

}

typedef float floatv attribute ((vector size(64),aligned(sizeof(float))));

#define V(lv) *((floatv*)&(lv))

for (long i = 0; i < n; i++) {

y[i] = a * x[i];

}

117 of 120

Intrinsics

Processor/platform-specific functions and types
For example, ARM Neon instrinsics

117

#include <arm_neon.h>

https://developer.arm.com/documentation/dht0002/a/Introducing-NEON/Developing-for-NEON/Intrinsics

int16x4_t vadd_s16(int16x4_t a, int16x4_t b) {

int16x4_t c;

for (int i = 0; i < 4; i++) {

c[i] = a[i] + b[i];

}

return c;

}

118 of 120

Why (auto) vectorization fails?

Potential aliasing makes auto vectorization difficult/impossible
Complex control flows make vectorization impossible or less profitable
Non-contiguous data accesses make vectorization impossible or less profitable
Giving hints to the compiler sometimes (not always) addresses the problem

118

119 of 120

How do I know if my code is vectorized?

Most compilers provide tools to get a report

119

Add a trick to assembly (gss –S)

enclose loops with inline assembler comments

Compiler	Report options
gcc	-fopt-info-vec-{optimized,missed}
clang	-Rpass=vectorize

asm volatile ("# xxxxxx loop begins");

for (i = 0; i < n; i++) {

... /∗ loop to be vectorized ∗/

}

asm volatile ("# xxxxxx loop ends");

120 of 120

How do I know if my code is vectorized?

120

clang -O3 -Rpass=loop-vectorize -S add_vec.c -o /dev/null

add_vec.c:4:5: remark:

vectorized loop (vectorization factor: 4, unrolling interleave factor: 2)

void difficult_function_to_vecorize(int *A, int *B, int Length) {

for (int i = 0; i < Length; i++)

A[B[i]]++;

}

clang -O3 -Rpass=loop-vectorize -S difficult_to_vectorize.c -o /dev/null

difficult_to_vectorize.c:3:5: remark:

loop not vectorized: cannot identify array bounds

for (int i = 0; i < Length; i++)

^

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1

0	0	0	0	0
1	0	0	0	1
2	0	0	1	0
3	0	1	1	1
4	0	1	0	0
5	0	1	0	1
6	0	1	1	0
7	1	0	1	1
8	1	0	0	0
9	1	0	0	1
10	1	0	1	0
11	1	1	1	1
12	1	1	0	0
13	1	1	0	1
14	1	1	1	0
15	1	1	1	1