1 of 120

WES237B: Software for Embedded Systems

Pat Pannuto

Department of Computer Science and Engineering

University of California, San Diego

Summer Session 2023

2 of 120

Today

  • Morning
    • Performance & Optimization
    • Optimization in different levels (in the context of a single core)
      • Algorithmic (Cache Optimization), Compiler and Hardware
      • Memory Hierarchy

  • Afternoon
    • Lab session
      • Applying optimizations: Loop unrolling, hardware vectorization (NEON)
      • FIR Filters

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

3 of 120

Logistics

  • Assignment 2 due tomorrow
  • Assignment 3 released
    • Generally: First bit of assignment is lab portion, the rest is take-home
    • Finish the lab first if needed

  • Late policy?
    • 1 week: 10% off
    • 2 weeks: Last chance, 30% off
    • Email with any exceptional circumstances, we’ll work it out

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

4 of 120

Recap

  • Measuring Performance
    • Speedup / performance always relative and workload specific
    • Amdahl’s Law: Optimization benefit limited to proportion it improves
  • Optimizations
    • Hard & Getting Harder due to heterogeneity
    • Must learn systems-level design and analysis
  • Microarchitecture
    • ILP, Superscalar, SIMD, and Pipelining
    • (perf: throughput)
  • Memory Hierarchy
    • Memory is slow, exploit spatial and temporal locality for performance

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

5 of 120

A Modern Memory Hierarchy

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Register File, 32 words, sub-nsec

L1 Cache, ~32 KB, ~nsecs

L2 cache, 512KB – 1 MB, ~nsecs

L3 Cache, ~ 10 nsecs

Main memory (DRAM), GB, ~100 ns

Disk, 100 GB, ~10 msec

6 of 120

The guts of most interesting stuff is often ~this

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

void Foo(float* const A, float* const B, float* const C, const int M, const int N, const int K)

for (int m = 0; m < M; m++) {

for (int n = 0; n < N; n++) {

for (int k = 0; k < K; k++) {

C[m * M + n] += A[m * M + k] * B[k * K + n]; 

}

}

}

 }

And these are huge

Lab2: image.tif

480x640 pixels * 4 bytes/float ~= 1.3 MB (greyscale!)

7 of 120

EXAMPLES OF SIMPLE CACHES

7

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

7

8 of 120

A simple cache

  • A cache that can put a line of data anywhere is called __________________________
  • The most popular replacement strategy is LRU ( ).

tag

data

the tag identifies

the address of

the cached data

4 entries, each block holds one word, any block

can hold any word.

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

8

9 of 120

A simple cache

  • A cache that can put a line of data anywhere is called Fully Associative
  • The most popular replacement strategy is LRU ( Least Recently Used ).

tag

data

the tag identifies

the address of

the cached data

4 entries, each block holds one word, any block

can hold any word.

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

9

10 of 120

A simpler cache

  • A cache that can put a line of data in exactly one place is called __________________.
  • Advantages/disadvantages vs. fully-associative?

an index is used

to determine

which line an address

might be found in

4 entries, each block holds one word, each word

in memory maps to exactly one cache location.

00000100

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

10

11 of 120

A simpler cache

  • A cache that can put a line of data in exactly one place is called direct mapped
  • Advantages/disadvantages vs. fully-associative?

an index is used

to determine

which line an address

might be found in

4 entries, each block holds one word, each word

in memory maps to exactly one cache location.

00000100

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

11

12 of 120

A set-associative cache

  • A cache that can put a line of data in exactly n places is called n-way ______________________.
  • The cache lines/blocks that share the same index are a cache ____________.

tag

data

4 entries, each block holds one word, each word

in memory maps to one of a set of n cache lines

00000100

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

12

13 of 120

A set-associative cache

  • A cache that can put a line of data in exactly n places is called n-way set-associative.
  • The cache lines/blocks that share the same index are a cache set.

tag

data

4 entries, each block holds one word, each word

in memory maps to one of a set of n cache lines

00000100

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

13

14 of 120

Longer Cache Blocks

  • Large cache blocks take advantage of spatial locality.
  • Too large of a block size can waste cache space.
  • Longer cache blocks require less tag space

tag

data

4 entries, each block holds two words, each word

in memory maps to exactly one cache location

(this cache is twice the total size of the prior caches).

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

14

15 of 120

Longer Cache Blocks

  • Large cache blocks take advantage of spatial locality.
  • Too large of a block size can waste cache space.
  • Longer cache blocks require less tag space

tag

data (now 64 bits)

4 entries, each block holds two words, each word

in memory maps to exactly one cache location

(this cache is twice the total size of the prior caches).

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

00000100

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

15

16 of 120

Q: Describing Cache Type Tradeoffs?

  1. Exceptional usage of the cache space in exchange for a slow hit time
  2. Poor usage of the cache space in exchange for an excellent hit time
  3. Reasonable usage of cache space in exchange for a reasonable hit time

Selection

Fully-Associative

4-way Set Associative

Direct Mapped

A

3

2

1

B

3

3

2

C

1

2

3

D

3

2

1

E

None of the above

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

16

17 of 120

Back to Block Size

  • If block size increases spatial locality, should we just make the cache block size really, really big????

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

17

18 of 120

Block Size and Miss Rate

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

18

19 of 120

Cache Parameters

Cache size = Number of sets * block size * associativity

tag

data

tag

data

Bytes per block

Blocks per set

Sets per Cache

Warning / Notice—Things that count towards “cache size”: cache data

Things that do not count towards “cache size”: tags, valid bits, etc…

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

19

20 of 120

Cache Parameters

Cache size = Number of sets * block size * associativity

tag

data

tag

data

Bytes per block

Blocks per set

Sets per Cache

Warning / Notice—Things that count towards “cache size”: cache data

Things that do not count towards “cache size”: tags, valid bits, etc…

Is this very confusing?

Yes, until you implement one.

It is what the world has settled on, so you need to know this and become comfortable with it.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

20

21 of 120

Cache Parameters

Cache size = Number of sets * block size * associativity

  • 128 blocks, 32-byte block size, direct mapped, size = ?

  • 128 KB cache, 64-byte blocks, 512 sets, associativity = ?

(always keep in mind “cache size” only counts the data storage)

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

21

22 of 120

Q: How many bits for each field?

  • Generally, we have variables block_size, cache_size, and memory_size
    • Let’s work it out for
      • BS = 8 bytes
      • CS = 1 KB
      • MS = 4 MB
    • And we have a 4-way set-associative cache?
  • What is the…
    • Number of bits for the block offset: ____
    • Number of bits for the index: ____
    • Number of bits for the tag: ____

tag index block offset

N-bit address

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

22

23 of 120

Q: How many bits for each field?

  • Generally, we have variables block_size, cache_size, and memory_size
    • Let’s work it out for
      • BS = 32 bytes
      • CS = 16 KB
      • MS = 8 MB
    • And we have a 8-way set-associative cache?
  • What is the…
    • Number of bits for the block offset: ____
    • Number of bits for the index: ____
    • Number of bits for the tag: ____

tag index block offset

N-bit address

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

23

24 of 120

Handling a Cache Access

  • 1. Use index and tag to access cache and determine hit/miss.
  • 2. If hit, return requested data.
  • 3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor).
    • load entire missed cache line into cache
    • return requested data to CPU (or higher cache)
  • 4. If next lower memory is a cache, goto step 1 for that cache.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

24

25 of 120

Accessing a Sample Cache

  • 64 KB cache, direct-mapped, 32-byte cache block size

31 30 29 28 27 .......... 17 16 | 15 14 13 12 11 10 9 8 7 6 5 | 4 3 2 1 0

tag

index

valid

tag

data

64 KB / 32 bytes =

2 K cache blocks/sets

11

=

256

32

16

hit/miss

0

1

2

...

...

...

...

2045

2046

2047

block offset

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

25

26 of 120

Accessing a Sample Cache

  • 32 KB cache, 2-way set-associative, 16-byte block size

31 30 29 28 27 .......... 17 16 15 14 | 13 12 11 10 9 8 7 6 5 4 | 3 2 1 0

tag

index

valid

tag

data

32 KB / 16 bytes / 2 =

1 K cache sets

10

=

18

hit/miss

0

1

2

...

...

...

...

1021

1022

1023

block offset

tag

data

valid

=

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

26

27 of 120

Cache Alignment

  • The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).
  • This results in
    • no overlap of cache lines
    • easy mapping of addresses to cache lines (no additions)
    • data at address X always being present in the same location in the cache block (at byte X mod blocksize) if it is there at all.
  • Think of main memory as organized into cache-line sized pieces (because in reality, it is!).

tag index block offset

memory address

.

.

.

0

1

2

3

4

5

6

7

8

9

10

.

.

.

Memory

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

27

28 of 120

How does the cache actually connect to the execution core?

  • This is a “standard, five stage pipeline”

  • For machines with no caches, the “DM” looks more like this:

IM

Reg

ALU

DM

Reg

ALU

DM

Memory�Interface�Module

Address

Data

Read?

Write?

Data

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

28

29 of 120

How does the cache actually connect to the execution core?

  • This is a “standard, five stage pipeline”

  • For machines with no caches, the “DM” looks more like this:

IM

Reg

ALU

DM

Reg

ALU

DM

Memory�Interface�Module

Address

Data

Read?

Write?

Data

Stall

DRAM Controller

DRAM

Other

Peripherals

Memory Bus

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

29

30 of 120

How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]

ALU

DM

Memory�Interface�Module

Address

Data

Read?

Write?

Data

Stall

DRAM Controller

DRAM

Other

Peripherals

Memory Bus

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

30

31 of 120

How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]

ALU

DM

Memory�Interface�Module

Address

Data

Read?

Write?

Data

Stall

DRAM Controller

DRAM

Other

Peripherals

Memory Bus

Cache(s)

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

31

32 of 120

How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]

ALU

DM

Memory Interface�Module�����

Address

Data

Read?

Write?

Data

Stall

DRAM Controller

DRAM

Other

Peripherals

Memory Bus

L1 Cache

A real interface�(from a simple design)

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

32

33 of 120

Cache Alignment Revisited

  • The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).
  • Think of main memory as organized into cache-line sized pieces (because in reality, it is!).
  • “Block-aligned” access is similar to the “word-aligned” access you have been thinking about already
    • Core <> Cache interface: Word aligned
    • Cache <> Mem interface: Block aligned

tag index block offset

memory address

.

.

.

0

1

2

3

4

5

6

7

8

9

10

.

.

.

Memory

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

33

34 of 120

When executing code, what does all this have to do with performance and alignment?

IM

Reg

ALU

DM

Reg

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

34

35 of 120

Which of the following things are possible?

  1. It is possible to make a “set-associative” cache to hav N=1 sets?
  2. It is possible for a set-associative cache to have N=3 sets?
  3. It is possible for a cache to have a 12-byte block size?

Selection

Fully-Associative

4-way Set Associative

Direct Mapped

A

3

2

1

B

3

3

2

C

1

2

3

D

3

2

1

E

None of the above

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

35

36 of 120

Associative Caches

  • Higher hit rates, but...

  • longer access time
    • (longer to determine hit/miss, more muxing of outputs)
  • more space (longer tags)
    • 16 KB, 16-byte blocks, DM, tag = ?
    • 16 KB, 16-byte blocks, 4-way, tag = ?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

36

37 of 120

for (int i = 0; i < 10,000,000; i++)

sum += A[i];

Assume each element of A is 4 bytes and sum is kept in a register. Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks. Assume this loop is visited many times.

Which changes would help the hit rate of the above code?

Selection

Change

A

Increase to 2-way set associativity

B

Increase block size to 64 bytes

C

Increase cache size to 64 KB

D

A and C combined

E

A, B, and C combined

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

37

38 of 120

for (int i=0; i < 10,000,000; i++)

for (int j = 0; j < 8192; j++)

sum += A[j] – B[j];

Assume each element of A and B are 4 bytes.

Assume each array is at least 32KB in size.

Assume sum is kept in a register.

Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks.

Which changes would help the hit rate of the above code?

Selection

Change

A

Increase to 2-way set associativity

B

Increase block size to 64 bytes

C

Increase cache size to 64 KB

D

A and C combined

E

A, B, and C combined

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

38

39 of 120

Dealing with Stores

  • Stores must be handled differently than loads, because...
    • they don’t necessarily require the CPU to stall.
    • they change the content of cache/memory (creating memory consistency issues)
  • Q: Can you think of a situation when you might need to load from memory before you can execute a store?
    • Can you think of another one?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

39

40 of 120

Policy decisions for stores

  • Keep memory and cache identical?
    • write-through => all writes go to both cache and main memory
    • write-back => writes go only to cache. Modified cache lines are written back to memory when the line is replaced.
  • Make room in cache for store miss?
    • write-allocate => on a store miss, bring written line into the cache
    • write-around => on a store miss, ignore cache

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

40

41 of 120

Dealing with stores

  • On a store hit, write the new data to cache.
    • In a write-through cache, write the data immediately to memory.
    • In a write-back cache, mark the line as dirty.
  • On a store miss, initiate a cache block load from memory for a write-allocate cache.
    • Write directly to memory for a write-around cache.
  • On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

41

42 of 120

Cache Performance

CPI = BCPI + MCPI

    • BCPI = base CPI, which means the CPI assuming perfect memory
    • MCPI = the memory CPI, the number of cycles (per instruction) the processor is stalled waiting for memory.

MCPI = accesses/instruction * miss rate * miss penalty

    • this assumes we stall the pipeline on both read and write misses, that the miss penalty is the same for both, that cache hits require no stalls.
    • If the miss penalty or miss rate is different for Inst cache and data cache (common case), then

MCPI = I$ accesses/inst x I$MissRate x I$MissPenalty

+ D$ accesses/inst x D$MissRate x D$MissPenalty

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

42

43 of 120

In fact…

  • Can generalize this “formula” further for other stalls
    • (This is just putting a label on the ”penalty” idea we introduced earlier)

  • CPI = BCPI + DHSPI + BHSPI + MCPI
    • DHSPI = data hazard stalls per instruction
    • BHSPI = branch hazard stalls per instruction.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

43

44 of 120

Cache Performance

  • Instruction cache miss rate of 4%
  • Data cache miss rate of 10%
  • BaseCPI = 1.0 (no data or control hazards)
  • 20% of instructions are loads and stores
  • Miss penalty = 12 cycles

CPI = ???

Selection

CPI (rounded if necessary)

A

1.24

B

1.34

C

1.48

D

1.72

E

None of the above

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

44

45 of 120

Cache Performance

  • Unified cache
  • 25% of instructions are loads and stores
  • BaseCPI = 1.2, miss penalty of 10 cycles

  • If we improve the miss rate from 10% to 4% (e.g. with a larger cache), how much do we improve performance?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

45

46 of 120

Cache Performance

  • BaseCPI = 1
  • Miss rate of 8% overall, 20% loads, miss penalty 20 cycles, never stalls on stores.

  • What is the speedup from doubling the CPU clock rate?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

46

47 of 120

Example -- DEC Alpha 21164 Caches

  • ICache and DCache -- 8 KB, DM, 32-byte lines
  • L2 cache -- 96 KB, ?-way SA, 32-byte lines
  • L3 cache -- 1 MB, DM, 32-byte lines

21164 CPU

core

Instruction

Cache

Data

Cache

Unified

L2

Cache

Off-Chip

L3 Cache

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

47

48 of 120

Three types of cache misses

  • Compulsory (or cold-start) misses
    • first access to the data.
  • Capacity misses
    • we missed only because the cache isn’t big enough.
  • Conflict misses
    • we missed because the data maps to the same line as other data that forced it out of the cache.

tag

data

address string:

4 00000100

8 00001000

12 00001100

4 00000100

8 00001000

20 00010100

4 00000100

8 00001000

20 00010100

24 00011000

12 00001100

8 00001000

4 00000100

DM cache

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

48

49 of 120

Q: Categorizing Misses

  • Suppose you experience a cache miss on a block (let's call it block A).
  • You have accessed block A in the past.
  • There have been precisely 1027 different blocks accessed between your last access to block A and your current miss.
  • Your block size is 32-bytes and you have a 64KB cache. What kind of miss was this?

Selection

Cache Miss

A

Compulsory

B

Capacity

C

Conflict

D

Both Capacity and Conflict

E

None of the above

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

49

50 of 120

So, then, how do we decrease...

  • Compulsory misses?
  • Capacity misses?
  • Conflict misses?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

50

51 of 120

Cache Associativity

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

51

52 of 120

LRU replacement algorithms

  • only needed for associative caches
  • requires one bit for 2-way set-associative, 8 bits (per set, 2/line) for 4-way, 24 bits for 8-way…
  • can be emulated with log n bits (NMRU)
  • can be emulated with use bits for highly associative caches (like page tables)
  • However, for most caches (eg, associativity <= 8), LRU is calculated exactly.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

52

53 of 120

Caches in Current Processors

  • Not long ago, they were DM at lowest level (closest to CPU), associative further away. Today they are less associative near the processor (2-4+), and more associative farther away (4-16).
  • split I and D close (L1) to the processor (for throughput rather than miss rate), unified further away (L2 and beyond).
  • write-through and write-back both common, but never write-through all the way to memory.
  • 64-byte cache lines common (but getting larger)

  • Non-blocking
    • processor doesn’t stall on a miss, but only on the use of a miss (if even then)
    • this means the cache must be able to keep track of multiple outstanding accesses, even multiple outstanding misses.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

53

54 of 120

Intel Nehalem (i7)

CPU

I$

D$

L2 cache

CPU

I$

D$

L2 cache

CPU

I$

D$

L2 cache

CPU

I$

D$

L2 cache

L3 cache

Instruction Cache

-32 KB, 4-way

-64-byte line

Data Cache

-32 KB, 8-way

-64-byte line

-write-back, write-allocate

Unified L2 Cache

-256 KB, 8-way

-64-byte line

-write-back, write-allocate

Shared, unified L3 Cache

-8 MB, 16-way

-64-byte line

-write-back, write-allocate

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

54

55 of 120

What does PYNQ Provide?

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

55

56 of 120

Key Points

  • Caches give illusion of a large, cheap memory with the access time of a fast, expensive memory.
  • Caches take advantage of memory locality, specifically temporal locality and spatial locality.
  • Cache design presents many options (block size, cache size, associativity, write policy) that an architect must combine to minimize miss rate and access time to maximize performance.

WES 237B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.

56

57 of 120

Cache Extras

Additional Examples / Alternative Drawings

57

58 of 120

Cache Concepts

58

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

8

9

14

3

Cache

Memory

Larger, slower, cheaper memory

viewed as partitioned into “blocks”

Data is copied in block-sized transfer units

Smaller, faster, more expensive

memory caches a subset of

the blocks

4

4

4

10

10

10

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

59 of 120

Cache Concepts

59

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

8

9

14

3

Cache

Memory

Data in block b is needed

Request: 14

14

Block b is in cache:

Hit!

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

60 of 120

60

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

8

9

14

3

Cache

Memory

Data in block b is needed

Request: 12

Block b is not in cache:

Miss!

Block b is fetched from

memory

Request: 12

12

12

12

Block b is stored in cache

  • Placement policy:�determines where b goes
  • Replacement policy:�determines which block�gets evicted (victim)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

61 of 120

Cache Concepts

  • Cold (compulsory) miss
    • Cold misses occur because the cache is empty.
  • Conflict miss
    • Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k.
      • E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
    • Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.
      • E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
  • Capacity miss
    • Occurs when the set of active cache blocks (working set) is larger than the cache.

61

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

62 of 120

Cache Concept Summary

  • The speed gap between CPU, memory and mass storage continues to widen.

  • Well-written programs exhibit a property called locality.

  • Memory hierarchies based on caching close the gap by exploiting locality.

62

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

63 of 120

Types of Cache Implementation

  • Direct mapped cache
    • Simple but lower spatial efficiency
    • Fast
  • Fully associative cache
    • Best spatial efficiency, complicated hardware design
    • Slow
  • N-way set associative cache
    • Frequently used in practice to balance trade-off between performance and complexity
    • The cache is divided into groups of blocks, called sets.
    • Each memory address maps to exactly one set in the cache, but data may be placed in any block within that set.

63

64 of 120

Cache Abstraction & Metrics

  • Cache hit rate = #hits/#accesses
  • Average memory access time = #hit rate*hit latency + #miss rate* miss latency

64

Address

Tag Store

(Is this address in the cache? + bookkeeping)

Data Store (Stores memory blocks)

Hit/miss

Data

65 of 120

Blocks and Addressing the Cache

  • Memory is logically divided into fixed size blocks
  • Each block maps to a location into the cache
    • Determined by the index bits in the address

  • Cache access
    • Use index bits to index to cache line
    • Check valid stored in the stored tag
    • Check tag bits against the tag stored in the tag store
  • If valid bit is true, and tag matches 🡺 cache hit

65

2b

3b

3b

tag

index

offset

66 of 120

Direct Mapped Cache

  • Assume byte addressable memory, 256 bytes size of memory
  • 8 byte blocks (block contains 8 bytes) 🡪 32 blocks
  • Assume cache 64 bytes; 🡪 8 blocks

2b

3b

3b

tag

index

offset

V

Tag

V

==/=!

Hit?

Data

Tag store

Data store

67 of 120

Direct Mapped Cache Example

  • Let us assume memory is 64 bytes (assume it byte addressable).
  • How many address bits do we need?
  • Address bits of 6 (m=6) bits
  • Each block contains 4 bytes. What is the number of unique locations we can address ?

67

68 of 120

Direct Mapped Cache Example

  • m = 6 & each block contains 4 bytes
  • What is the number of blocks ?
  • Number of blocks = 16

68

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

60

61

62

63

69 of 120

Direct Mapped Cache Example

  • m = 6 & each block contains 4 bytes
  • What is the number of blocks ?
  • Number of blocks = 16
  • What is the b (block offset)?

69

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

70 of 120

Direct Mapped Cache Example

  • m = 6 & each block contains 4 bytes
  • What is the number of blocks ?
  • Number of blocks = 16
  • What is the b (block offset)?

70

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

Block Size = 4 = 22 🡺 b = 2 🡺 The number of bits

71 of 120

Direct Mapped Cache Example

  • m = 6 & each block contains 4 bytes
  • What is the number of blocks ?
  • Number of blocks = 16
  • What is the b (block offset)?
  • What is the value of l ?

71

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

Block Size = 4 = 22 🡺 b = 2

🡺 The number of bits

Block Offset (b=2)

Block Address (m-b)

72 of 120

Direct Mapped Cache Example

  • Let us assume m = 6 & each block contains 4 bytes
  • What is the number of unique locations we can address ?
  • What is the number of blocks ?
  • What is the value of b ?
  • What is the value of l ?

72

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

Block Size = 4 = 22 🡺 b = 2

🡺 The number of bits

Block Offset (b=2)

Block Address (m-b)

0

1

2

3

0

4

5

6

7

1

8

9

10

11

2

12

13

14

15

3

73 of 120

Direct Mapped Cache Example

  • Let us assume m = 6 & each block contains 4 bytes
  • What is the number of unique locations we can address ?
  • What is the number of blocks ?
  • What is the value of b ?
  • What is the value of l ?

73

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

Block Size = 4 = 22 🡺 b = 2

🡺 The number of bits

Block Offset (b=2)

Block Address (m-b)

0

1

2

3

0

4

5

6

7

1

8

9

10

11

2

12

13

14

15

3

#lines = 4 =22 🡺 l =2

Cache line

tag

74 of 120

Direct Mapped Cache Example

  • Let us assume m = 6 & each block contains 4 bytes
  • What is the number of unique locations we can address ?
  • What is the number of blocks ?
  • What is the value of b ?
  • What is the value of l ?

74

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

Block Size = 4 = 22 🡺 b = 2

🡺 The number of bits

Block Offset (b=2)

Block Address (m-b)

#lines = 4 =22 🡺 l =2

Cache line

tag

tag

0

1

2

3

0

tag

4

5

6

7

1

tag

8

9

10

11

2

tag

12

13

14

15

3

75 of 120

Direct Mapped Cache Example

75

0

0

0

0

0

1

0

0

0

1

2

0

0

1

0

3

0

1

1

1

4

0

1

0

0

5

0

1

0

1

6

0

1

1

0

7

1

0

1

1

8

1

0

0

0

9

1

0

0

1

10

1

0

1

0

11

1

1

1

1

12

1

1

0

0

13

1

1

0

1

14

1

1

1

0

15

1

1

1

1

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

tag

0

1

2

3

0

tag

4

5

6

7

1

tag

8

9

10

11

2

tag

12

13

14

15

3

76 of 120

Direct Mapped Cache Example

76

0

0

0

0

0

1

0

0

0

1

2

0

0

1

0

3

0

1

1

1

4

0

1

0

0

5

0

1

0

1

6

0

1

1

0

7

1

0

1

1

8

1

0

0

0

9

1

0

0

1

10

1

0

1

0

11

1

1

1

1

12

1

1

0

0

13

1

1

0

1

14

1

1

1

0

15

1

1

1

1

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

tag

0

1

2

3

0

tag

4

5

6

7

1

tag

8

9

10

11

2

tag

12

13

14

15

3

77 of 120

Direct Mapped Cache Example

77

0

0

0

0

0

1

0

0

0

1

2

0

0

1

0

3

0

1

1

1

4

0

1

0

0

5

0

1

0

1

6

0

1

1

0

7

1

0

1

1

8

1

0

0

0

9

1

0

0

1

10

1

0

1

0

11

1

1

1

1

12

1

1

0

0

13

1

1

0

1

14

1

1

1

0

15

1

1

1

1

0

0

1

2

3

1

4

5

6

7

2

8

9

10

11

3

12

13

14

15

4

16

17

18

19

15

60

61

62

63

tag

0

1

2

3

0

tag

4

5

6

7

1

tag

8

9

10

11

2

tag

12

13

14

15

3

78 of 120

2-way set associative cache

  • MM = 64 bytes
  • CS = 32 bytes
  • BS = 4 bytes
  • Set size = 2 cache lines
  • #cache lines = CS/BS = 8
  • #sets = #cache lines/set size = 4

78

Block Offset (b=2)

Block Address (m-b)

set line

tag

  • 2-way set associative cache
    • Find the set
    • Compare tag with the tags stored

79 of 120

Types of Cache Implementation

  • Direct mapped cache
    • Simple but low efficiency
  • Fully associative cache
    • Has high efficiency, complicated hardware design
  • N-way set associative cache
    • Frequently used in practice to balance trade-off between performance and complexity
    • The cache is divided into groups of blocks, called sets.
    • Each memory address maps to exactly one set in the cache, but data may be placed in any block within that set.

79

80 of 120

Resources

  • Some of the slides are adapted from Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition

80

81 of 120

Compiler Level Optimization

81

81

82 of 120

Compiler Level Optimizations

  • Many of the optimizations can be done by manually by the programmer, and many of the optimizations (e.g., loop) that we will see can be done by compiler (compiler can be smart).
  • If compiler can optimize the program, compiler should do the job!
  • Sometimes, it is needed to understand these optimizations because of limitations of compiler

82

83 of 120

Code Motion

  • Reduce frequency with which computation performed 
    • If it will always produce same result
    • Especially moving code out of loop

83

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

void set_row(double *a, double *b,​

   long i, long n)​

{​

    long j;​

    for (j = 0; j < n; j++)​

       a[n*i+j] = b[j];​

}

void set_row(double *a, double *b, 

   long i, long n) 

    long j;

    int ni = n*i;

    for (j = 0; j < n; j++) 

       a[ni+j] = b[j]; 

}

84 of 120

Compiler Generated Code Motion

84

set_row:​�  testq %rcx, %rcx # Test n​�  jle .L1 # If 0, goto done​�  imulq %rcx, %rdx # ni = n*i​�  leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8​�  movl $0, %eax                # j = 0​�.L3:       # loop:​�   movsd (%rsi,%rax,8), %xmm0    # t = b[j]​�   movsd %xmm0, (%rdx,%rax,8)   # M[A+ni*8 + j*8] = t​�   addq $1, %rax # j++​�   cmpq %rcx, %rax # j:n​�   jne .L3 # if !=, goto loop​�.L1:       # done:​�   rep ; ret

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

long j;​​

for (j = 0; j < n; j++)​​

   a[n*i+j] = b[j];​

long j;​

int ni = n*i;​

for (j = 0; j < n; j++) ​

    a[ni+j] = b[j]; 

85 of 120

Elimination (share) of Common Subexpressions

85

3 multiplications: i*n, (i–1)*n, (i+1)*n

1 multiplication: i*n

leaq   1(%rsi), %rax  # i+1

leaq   -1(%rsi), %r8  # i-1

imulq  %rcx, %rsi     # i*n

imulq  %rcx, %rax     # (i+1)*n

imulq  %rcx, %r8      # (i-1)*n

addq   %rdx, %rsi     # i*n+j

addq   %rdx, %rax     # (i+1)*n+j

addq   %rdx, %r8      # (i-1)*n+j

imulq %rcx, %rsi  # i*n

addq %rdx, %rsi  # i*n+j

movq %rsi, %rax  # i*n+j

subq %rcx, %rax  # i*n+j-n

leaq (%rsi,%rcx), %rcx # i*n+j+n

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

/* Sum neighbors of i,j */​

up =    val[(i-1)*n + j  ];​

down =  val[(i+1)*n + j  ];​

left =  val[i*n     + j-1];​

right = val[i*n     + j+1];​

sum = up + down + left + right;

long inj = i*n + j;​

up =    val[inj - n];​

down =  val[inj + n];​

left =  val[inj - 1];​

right = val[inj + 1];​

sum = up + down + left + right;

86 of 120

Reduction in Strength

  • Replace costly operation with simpler one 
    • Shift, add instead of multiply or divide 
    • 16*x --> x << 4 
    • *🡪 +

86

for (i = 0; i < n; i++) {​

  int ni = n*i;​

  for (j = 0; j < n; j++)​

    a[ni + j] = b[j];​

}

int ni = 0;​

for (i = 0; i < n; i++) {​

  for (j = 0; j < n; j++)​

    a[ni + j] = b[j];​

  ni += n;​

}

87 of 120

Memory Aliasing

87

# sum_rows1 inner loop

.L4:

        movsd   (%rsi,%rax,8), %xmm0 # FP load

        addsd   (%rdi), %xmm0 # FP add

        movsd   %xmm0, (%rsi,%rax,8) # FP store

        addq    $8, %rdi

        cmpq    %rcx, %rdi

        jne     .L4

    • Code updates b[i] on every iteration​
    • Extra store

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

/* Sum rows is of n X n matrix a​

   and store in vector b  */​

void sum_rows1(double *a, double *b, long n) {​

    long i, j;​

    for (i = 0; i < n; i++) {​

b[i] = 0;​

for (j = 0; j < n; j++)​

    b[i] += a[i*n + j];​

    }​

}

88 of 120

Memory Aliasing

88

    • No extra store for intermediate result

# sum_rows2 inner loop

.L10:

        addsd   (%rdi), %xmm0 # FP load + add

        addq    $8, %rdi

        cmpq    %rax, %rdi

        jne     .L10

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

/* Sum rows is of n X n matrix a

   and store in vector b  */

void sum_rows2(double *a, double *b, long n) {

    long i, j;

    for (i = 0; i < n; i++) {

    double val = 0;

      for (j = 0; j < n; j++)

         val += a[i*n + j];

      b[i] = val;

    }

}

89 of 120

Loop Interchange

  • Improved locality

89

Original loop:

for(j=0; j<M; j++)

for(i=0; i<N; i++)

A[i][j]=a[i][j]+b[i][j];

Interchanged loop:

for(i=0; i<N; i++)

for(j=0; j<M; j++)

c[i][j]=a[i][j]+b[i][j];

90 of 120

Loop Fusion

90

Original loops:

for(i=0; i<N; i++)

a[i] = b[i] + 10

for(i=0; i<N; i++)

c[i] = a[i]*2

Fused loops:

for(i=0; i<N; i++){

a[i] = b[i] + 10

c[i] = a[i]*2

}

91 of 120

Loop Fusion

  • Reducing loop overhead
  • Increasing the amount of work done in a loop
  • Improving locality by combining loops that reference the same array

91

Original loops:

for(i=0; i<N; i++)

a[i] = b[i] + 10

for(i=0; i<N; i++)

c[i] = a[i]*2

Fused loops:

for(i=0; i<N; i++){

a[i] = b[i] + 10

c[i] = a[i]*2

}

92 of 120

Loop Distribution

  • It is opposite of loop fusion

92

Distributed loops:

for(i=0; i<N; i++)

a[i] = b[i] + 10

for(i=0; i<N; i++)

c[i] = a[i]*2

Original loops:

for(i=0; i<N; i++){

a[i] = b[i] + 10

c[i] = a[i]*2

}

93 of 120

Loop Distribution

  • It is opposite of loop fusion

93

Distributed loops:

for(i=0; i<N; i++)

a[i] = b[i] + 10

for(i=0; i<N; i++)

c[i] = a[i]*2

Original loops:

for(i=0; i<N; i++){

a[i] = b[i] + 10

c[i] = a[i]*2

}

  • It can improve locality in multi-core processing
  • It can improve hardware prefetch by separating different data streams

94 of 120

Loop Invariant Computations

  • Calculations that do not change between loop iterations
  • Can be moved outside a loop

94

Original loop:

for(i=0; i<N; i++)

a[i] = i* (SOME_NUMBER/100)

Transformed loop:

for(i=0; i<N; i++)

a[i] = i* funct();

95 of 120

Loop Unrolling

  • Increases ILP
  • To reduce loop overhead instructions
  • Increases code size

95

void foo(

int A[8],

int B[8]) {

for(int i=0;i<8;i++){

B[i]=A[i]*3;

}

}

void foo(

int A[8],

int B[8]) {

for(int i=0;i<8; i+=2){

B[i]=A[i]*3;

B[i+1]=A[i+1]*3;

}

}

96 of 120

Dead Code Elimination

96

int global;

void f ()

{

int i;

i = 1; /* dead store */

global = 1; /* dead store */

global = 2;

return;

global = 3; /* unreachable */

}

int global;

void f ()

{

global = 2;

return;

}

97 of 120

Constant Folding

  • Evaluate constant expressions at compile time
  • Only possible when side-effect freeness guaranteed

97

Int a = 4+5;

Int a = 9;

true not;

false;

  • Caveat: Floating implementations could be different between machines
    • -ffast-math (really -funsafe-math-optimizations) lets compiler ignore this

98 of 120

Constant Propagation

  • Variables that have constant value, e.g. c := 3
    • Later uses of c can be replaced by the constant
    • If no change of c between!
  • Analysis needed, as b can be assigned more than once!

98

Int b = 3;

Int c = 1 + b;

Int d = b + c;

Int b = 3;

Int c = 1 + 3;

Int d = 3 + c;

99 of 120

Compiler Optimization Passes

  • Loop Optimizations
    • Loop fission
    • Loop fusion
    • Loop interchange
    • Loop invariant code motion
    • Loop unrolling
  • Data Flow Optimizations
    • Common subexpression elimination
    • Constant folding and propagation
    • Dead code elimination
  • Code Generation Optimization
    • Register Allocation
    • Instruction Scheduling
    • Instruction Selection
    • Reordering computations
    • ,…

99

100 of 120

LLVM: Optimizing Compiler

  • LLVM is a compiler infrastructure designed as a set of reusable libraries with well defined interfaces
  • LLVM IR is target independent

100

LLVM IR

LLVM IR

LLVM IR

Optimization

Pass 1

Optimization

Pass 1

Optimization

Pass 1

…. . ….

LLVM IR

LLVM IR

101 of 120

Compiler Optimization Passes

  • Loop Optimizations
    • Loop fission
    • Loop fusion
    • Loop interchange
    • Loop invariant code motion
    • Loop unrolling
  • Data Flow Optimizations
    • Common subexpression elimination
    • Constant folding
    • Dead store elimination
  • Code Generation Optimization
    • Register Allocation
    • Instruction Scheduling
    • Instruction Selection
    • Reordering computations
    • ,…

101

102 of 120

Optimizing Compilers

  • Provide efficient mapping of program to machine 
    • register allocation 
    • code selection and ordering (scheduling) 
    • dead code elimination
    • ...
    • eliminating minor inefficiencies 
  • Do not optimize (usually) the "algorithm" 
    • up to programmer to select best overall algorithm for a given task
      • Sorting: big-O(n2) vs. big-O(nlogn)
    • big-O savings are (often) more important
  • Have difficulty overcoming inherent limitations (cache, inherent parallelism)

102

103 of 120

Limitations of Optimizing Compilers

  • Operate under fundamental constraint 
  • Must not cause any change in program behavior 
  • Except, possibly when program making use of nonstandard language features 
  • Behavior that may be obvious to the programmer can  be obfuscated by languages and coding styles 
  • e.g., Data ranges may be more limited than variable types suggest 
  • Most analysis is performed only within procedures 
  • Whole-program analysis is too expensive in most cases 
  • Newer versions of GCC do interprocedural analysis within individual files 
  • But, not between code in different files 
  • Most analysis is based only on static information 
  • Compiler has difficulty anticipating run-time inputs

103

When in doubt, the compiler must be conservative​

Bryant and O'Hallaron, Computer Systems: A programmer's Perspective

104 of 120

GCC O1, O2, O3 levels

104

-O1

-O2

-O3

105 of 120

Inline Assembly

  • Ability to embed low level assembly code in high-level C/C++ code

  • __asm [volatile] (code); /* Basic inline assembly syntax */
    • "ADD R0, R1, R2
  • Why use inline assembly ?

105

106 of 120

Inline Assembly

  • Ability to embed low level assembly code in high-level C/C++ code

  • __asm [volatile] (code); /* Basic inline assembly syntax */
    • "ADD R0, R1, R2
  • Why use inline assembly ?
    • Optimization for codes that can not be naturally expressed by high-level languages
      • Reversing bits
    • Access to professor specific instructions
      • NEON
    • System class

106

107 of 120

SIMD

107

107

108 of 120

Flynn Taxonomy

108

Data Stream

Single

Multi

Instruction

Stream

Single

SISD

(Single-Core Processors)

SIMD

(GPUs, Intel SSE/AVX extensions, …)

Multi

MISD

(Dataflow architectures,

Systolic Arrays [debatably], …)

MIMD

(VLIW, Parallel Computers)

109 of 120

SIMD Basics

  • SIMD: Single instruction multiple data
  • A SIMD register (or a vector register) can hold many values (2 - 16 values or more) of a single type
  • Each value in a SIMD register is called a SIMD lane
  • SIMD instructions can operate on several (typically all) values on a SIMD register

109

110 of 120

SIMD

  • SIMD Opertions: Single Precison

110

%ymm0

%ymm1

vaddsd %ymm0, %ymm1, %ymm1

+

+

+

+

+

+

+

+

+

+

+

+

%ymm0

%ymm1

vaddpd %ymm0, %ymm1, %ymm1

+

+

+

+

%ymm0

%ymm1

vaddpd %ymm0, %ymm1, %ymm1

  • SIMD Opertions: Double Precison

Some example AVX-512F (a subset of AVX-512) instructions

Syntax

zmm0 ...zmm31 are 512 bit registers; each can hold 16 single-precision (float of C; 32 bits) or 8 double-precision (double of C; 64 bits) floating point numbers

111 of 120

SIMD

  • SIMD is good at parallelizing computations doing almost exactly the same series of instructions on contiguous data
  • ⇒ generally, main targets are simple loops whose index values can be easily identified

111

Original loops:

for(i=0; i<N; i++){

S(i)

}

Original loops:

for(i=0; i<N; i+=L){

S(I;i+L)

}

112 of 120

Programming SIMD

  • Auto vectorization
    • loop vectorization
  • Language extensions/directives for SIMD
    • SIMD directives for loops/functions (OpenMP 4.0)
    • Domain Specific Languages (Halide)
  • Vector types
    • GCC vector extensions
  • Intrinsics
  • Assembly programming

112

Which vector registers is this using??

113 of 120

Auto Vectorization

  • Write a loop-based code and hope compiler does (figure out) the vectorization

113

Int len_vec=128;

void vec_add(float *vec_A, float *vec_B, float *vec_C, int len_vec) {

int i;

for (i=0; i<len_vec; i++) {

vec_C[i] = vec_A[i] + vec_B[i];

}

}

armclang --target=aarch64-arm-none-eabi -g -c -O1 vec_add.c

  • Automatically performs 4 adds on
  • Neon extension (Neon simd register is 128 bit)

gcc -o simd_auto -march=native -O3 vec_add.c

114 of 120

OpenMP SIMD

  • OpenMP simd pragma
    • allows an explicit vectorization of for loops

114

int len_vec=128;

void vec_add(float *vec_A, float *vec_B, float *vec_C, int len_vec) {

int i;

#pragma omp simd

for (i=0; i<len_vec; i++) {

vec_C[i] = vec_A[i] + vec_B[i];

}

}

115 of 120

GCC Vector Types

  • Gcc allows to define a vector types

115

typedef float floatv attribute ((vector size(64),aligned(sizeof(float))));

floatv x, y, z;

z += x * y;

float a, b;

floatv x, y;

y = a * x + b;

  • You can define operations on vector types
  • You can mix scalars with vector types

116 of 120

GCC Vector Types

116

for (long i = 0; i < n; i += L) {

V(y[i]) = a * V(x[i]);

}

typedef float floatv attribute ((vector size(64),aligned(sizeof(float))));

#define V(lv) *((floatv*)&(lv))

for (long i = 0; i < n; i++) {

y[i] = a * x[i];

}

117 of 120

Intrinsics

  • Processor/platform-specific functions and types
  • For example, ARM Neon instrinsics

117

#include <arm_neon.h>

int16x4_t vadd_s16(int16x4_t a, int16x4_t b) {

int16x4_t c;

for (int i = 0; i < 4; i++) {

c[i] = a[i] + b[i];

}

return c;

}

118 of 120

Why (auto) vectorization fails?

  • Potential aliasing makes auto vectorization difficult/impossible
  • Complex control flows make vectorization impossible or less profitable
  • Non-contiguous data accesses make vectorization impossible or less profitable
  • Giving hints to the compiler sometimes (not always) addresses the problem

118

119 of 120

How do I know if my code is vectorized?

  • Most compilers provide tools to get a report

119

  • Add a trick to assembly (gss –S)
    • enclose loops with inline assembler comments

Compiler

Report options

gcc

-fopt-info-vec-{optimized,missed}

clang

-Rpass=vectorize

asm volatile ("# xxxxxx loop begins");

for (i = 0; i < n; i++) {

... /∗ loop to be vectorized ∗/

}

asm volatile ("# xxxxxx loop ends");

120 of 120

How do I know if my code is vectorized?

120

clang -O3 -Rpass=loop-vectorize -S add_vec.c -o /dev/null

add_vec.c:4:5: remark: 

vectorized loop (vectorization factor: 4, unrolling interleave factor: 2)

void difficult_function_to_vecorize(int *A, int *B, int Length) {

for (int i = 0; i < Length; i++)

A[B[i]]++;

}

clang -O3 -Rpass=loop-vectorize -S difficult_to_vectorize.c -o /dev/null

difficult_to_vectorize.c:3:5: remark:

loop not vectorized: cannot identify array bounds

for (int i = 0; i < Length; i++)

^