WES237B: Software for Embedded Systems
Pat Pannuto
Department of Computer Science and Engineering
University of California, San Diego
Summer Session 2023
Today
CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai
Logistics
CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai
Recap
CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai
A Modern Memory Hierarchy
CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai
Register File, 32 words, sub-nsec
L1 Cache, ~32 KB, ~nsecs
L2 cache, 512KB – 1 MB, ~nsecs
L3 Cache, ~ 10 nsecs
Main memory (DRAM), GB, ~100 ns
Disk, 100 GB, ~10 msec
The guts of most interesting stuff is often ~this
CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai
void Foo(float* const A, float* const B, float* const C, const int M, const int N, const int K)
{
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int k = 0; k < K; k++) {
C[m * M + n] += A[m * M + k] * B[k * K + n];
}
}
}
}
And these are huge
Lab2: image.tif
480x640 pixels * 4 bytes/float ~= 1.3 MB (greyscale!)
EXAMPLES OF SIMPLE CACHES
7
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
7
A simple cache
tag
data
the tag identifies
the address of
the cached data
4 entries, each block holds one word, any block
can hold any word.
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
8
A simple cache
tag
data
the tag identifies
the address of
the cached data
4 entries, each block holds one word, any block
can hold any word.
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
9
A simpler cache
an index is used
to determine
which line an address
might be found in
4 entries, each block holds one word, each word
in memory maps to exactly one cache location.
00000100
tag
data
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
10
A simpler cache
an index is used
to determine
which line an address
might be found in
4 entries, each block holds one word, each word
in memory maps to exactly one cache location.
00000100
tag
data
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
11
A set-associative cache
tag
data
4 entries, each block holds one word, each word
in memory maps to one of a set of n cache lines
00000100
tag
data
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
12
A set-associative cache
tag
data
4 entries, each block holds one word, each word
in memory maps to one of a set of n cache lines
00000100
tag
data
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
13
Longer Cache Blocks
tag
data
4 entries, each block holds two words, each word
in memory maps to exactly one cache location
(this cache is twice the total size of the prior caches).
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
00000100
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
14
Longer Cache Blocks
tag
data (now 64 bits)
4 entries, each block holds two words, each word
in memory maps to exactly one cache location
(this cache is twice the total size of the prior caches).
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
00000100
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
15
Q: Describing Cache Type Tradeoffs?
Selection | Fully-Associative | 4-way Set Associative | Direct Mapped |
A | 3 | 2 | 1 |
B | 3 | 3 | 2 |
C | 1 | 2 | 3 |
D | 3 | 2 | 1 |
E | None of the above | ||
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
16
Back to Block Size
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
17
Block Size and Miss Rate
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
18
Cache Parameters
Cache size = Number of sets * block size * associativity
tag
data
tag
data
Bytes per block
Blocks per set
Sets per Cache
Warning / Notice—Things that count towards “cache size”: cache data
Things that do not count towards “cache size”: tags, valid bits, etc…
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
19
Cache Parameters
Cache size = Number of sets * block size * associativity
tag
data
tag
data
Bytes per block
Blocks per set
Sets per Cache
Warning / Notice—Things that count towards “cache size”: cache data
Things that do not count towards “cache size”: tags, valid bits, etc…
Is this very confusing?
Yes, until you implement one.
It is what the world has settled on, so you need to know this and become comfortable with it.
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
20
Cache Parameters
Cache size = Number of sets * block size * associativity
(always keep in mind “cache size” only counts the data storage)
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
21
Q: How many bits for each field?
tag index block offset
N-bit address
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
22
Q: How many bits for each field?
tag index block offset
N-bit address
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
23
Handling a Cache Access
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
24
Accessing a Sample Cache
31 30 29 28 27 .......... 17 16 | 15 14 13 12 11 10 9 8 7 6 5 | 4 3 2 1 0
tag
index
valid
tag
data
64 KB / 32 bytes =
2 K cache blocks/sets
11
=
256
32
16
hit/miss
0
1
2
...
...
...
...
2045
2046
2047
block offset
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
25
Accessing a Sample Cache
31 30 29 28 27 .......... 17 16 15 14 | 13 12 11 10 9 8 7 6 5 4 | 3 2 1 0
tag
index
valid
tag
data
32 KB / 16 bytes / 2 =
1 K cache sets
10
=
18
hit/miss
0
1
2
...
...
...
...
1021
1022
1023
block offset
tag
data
valid
=
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
26
Cache Alignment
tag index block offset
memory address
.
.
.
0
1
2
3
4
5
6
7
8
9
10
.
.
.
Memory
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
27
How does the cache actually connect to the execution core?
IM
Reg
ALU
DM
Reg
ALU
DM
Memory�Interface�Module
Address
Data
Read?
Write?
Data
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
28
How does the cache actually connect to the execution core?
IM
Reg
ALU
DM
Reg
ALU
DM
Memory�Interface�Module
Address
Data
Read?
Write?
Data
Stall
DRAM Controller
DRAM
Other
Peripherals
Memory Bus
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
29
How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]
ALU
DM
Memory�Interface�Module
Address
Data
Read?
Write?
Data
Stall
DRAM Controller
DRAM
Other
Peripherals
Memory Bus
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
30
How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]
ALU
DM
Memory�Interface�Module
Address
Data
Read?
Write?
Data
Stall
DRAM Controller
DRAM
Other
Peripherals
Memory Bus
Cache(s)
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
31
How does the cache actually connect to the execution core?�[ Simplified View! Very different between embedded & high-perf! ]
ALU
DM
Memory Interface�Module�����
Address
Data
Read?
Write?
Data
Stall
DRAM Controller
DRAM
Other
Peripherals
Memory Bus
L1 Cache
A real interface�(from a simple design)
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
32
Cache Alignment Revisited
tag index block offset
memory address
.
.
.
0
1
2
3
4
5
6
7
8
9
10
.
.
.
Memory
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
33
When executing code, what does all this have to do with performance and alignment?
IM
Reg
ALU
DM
Reg
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
34
Which of the following things are possible?
Selection | Fully-Associative | 4-way Set Associative | Direct Mapped |
A | 3 | 2 | 1 |
B | 3 | 3 | 2 |
C | 1 | 2 | 3 |
D | 3 | 2 | 1 |
E | None of the above | ||
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
35
Associative Caches
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
36
for (int i = 0; i < 10,000,000; i++)
sum += A[i];
Assume each element of A is 4 bytes and sum is kept in a register. Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks. Assume this loop is visited many times.
Which changes would help the hit rate of the above code?
Selection | Change |
A | Increase to 2-way set associativity |
B | Increase block size to 64 bytes |
C | Increase cache size to 64 KB |
D | A and C combined |
E | A, B, and C combined |
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
37
for (int i=0; i < 10,000,000; i++)
for (int j = 0; j < 8192; j++)
sum += A[j] – B[j];
Assume each element of A and B are 4 bytes.
Assume each array is at least 32KB in size.
Assume sum is kept in a register.
Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks.
Which changes would help the hit rate of the above code?
Selection | Change |
A | Increase to 2-way set associativity |
B | Increase block size to 64 bytes |
C | Increase cache size to 64 KB |
D | A and C combined |
E | A, B, and C combined |
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
38
Dealing with Stores
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
39
Policy decisions for stores
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
40
Dealing with stores
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
41
Cache Performance
CPI = BCPI + MCPI
MCPI = accesses/instruction * miss rate * miss penalty
MCPI = I$ accesses/inst x I$MissRate x I$MissPenalty
+ D$ accesses/inst x D$MissRate x D$MissPenalty
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
42
In fact…
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
43
Cache Performance
CPI = ???
Selection | CPI (rounded if necessary) |
A | 1.24 |
B | 1.34 |
C | 1.48 |
D | 1.72 |
E | None of the above |
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
44
Cache Performance
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
45
Cache Performance
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
46
Example -- DEC Alpha 21164 Caches
21164 CPU
core
Instruction
Cache
Data
Cache
Unified
L2
Cache
Off-Chip
L3 Cache
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
47
Three types of cache misses
tag
data
address string:
4 00000100
8 00001000
12 00001100
4 00000100
8 00001000
20 00010100
4 00000100
8 00001000
20 00010100
24 00011000
12 00001100
8 00001000
4 00000100
DM cache
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
48
Q: Categorizing Misses
Selection | Cache Miss |
A | Compulsory |
B | Capacity |
C | Conflict |
D | Both Capacity and Conflict |
E | None of the above |
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
49
So, then, how do we decrease...
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
50
Cache Associativity
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
51
LRU replacement algorithms
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
52
Caches in Current Processors
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
53
Intel Nehalem (i7)
CPU
I$
D$
L2 cache
CPU
I$
D$
L2 cache
CPU
I$
D$
L2 cache
CPU
I$
D$
L2 cache
L3 cache
Instruction Cache
-32 KB, 4-way
-64-byte line
Data Cache
-32 KB, 8-way
-64-byte line
-write-back, write-allocate
Unified L2 Cache
-256 KB, 8-way
-64-byte line
-write-back, write-allocate
Shared, unified L3 Cache
-8 MB, 16-way
-64-byte line
-write-back, write-allocate
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
54
What does PYNQ Provide?
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
55
Key Points
WES 237B
CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen, Leo Porter, and the rest of the UCSD faculty.
56
Cache Extras
Additional Examples / Alternative Drawings
57
Cache Concepts
58
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
9
14
3
Cache
Memory
Larger, slower, cheaper memory
viewed as partitioned into “blocks”
Data is copied in block-sized transfer units
Smaller, faster, more expensive
memory caches a subset of
the blocks
4
4
4
10
10
10
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition
Cache Concepts
59
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
9
14
3
Cache
Memory
Data in block b is needed
Request: 14
14
Block b is in cache:
Hit!
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition
60
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
9
14
3
Cache
Memory
Data in block b is needed
Request: 12
Block b is not in cache:
Miss!
Block b is fetched from
memory
Request: 12
12
12
12
Block b is stored in cache
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition
Cache Concepts
61
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition
Cache Concept Summary
62
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective; Third Edition
Types of Cache Implementation
63
Cache Abstraction & Metrics
64
Address
Tag Store
(Is this address in the cache? + bookkeeping)
Data Store (Stores memory blocks)
Hit/miss
Data
Blocks and Addressing the Cache
65
2b
3b
3b
tag
index
offset
Direct Mapped Cache
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2b
3b
3b
tag
index
offset
| |
| |
| |
| |
| |
| |
| |
V | Tag |
| |
| |
| |
| |
| |
| |
| |
V | |
==/=!
Hit?
Data
Tag store
Data store
Direct Mapped Cache Example
67
Direct Mapped Cache Example
68
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 |
| | | |
| | | |
60 | 61 | 62 | 63 |
Direct Mapped Cache Example
69
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
Direct Mapped Cache Example
70
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
Block Size = 4 = 22 🡺 b = 2 🡺 The number of bits
Direct Mapped Cache Example
71
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
Block Size = 4 = 22 🡺 b = 2
🡺 The number of bits
| | | | | |
Block Offset (b=2)
Block Address (m-b)
Direct Mapped Cache Example
72
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
Block Size = 4 = 22 🡺 b = 2
🡺 The number of bits
| | | | | |
Block Offset (b=2)
Block Address (m-b)
0 | 1 | 2 | 3 | 0 |
4 | 5 | 6 | 7 | 1 |
8 | 9 | 10 | 11 | 2 |
12 | 13 | 14 | 15 | 3 |
Direct Mapped Cache Example
73
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
Block Size = 4 = 22 🡺 b = 2
🡺 The number of bits
| | | | | |
Block Offset (b=2)
Block Address (m-b)
0 | 1 | 2 | 3 | 0 |
4 | 5 | 6 | 7 | 1 |
8 | 9 | 10 | 11 | 2 |
12 | 13 | 14 | 15 | 3 |
#lines = 4 =22 🡺 l =2
Cache line
tag
Direct Mapped Cache Example
74
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
Block Size = 4 = 22 🡺 b = 2
🡺 The number of bits
| | | | | |
Block Offset (b=2)
Block Address (m-b)
#lines = 4 =22 🡺 l =2
Cache line
tag
tag | 0 | 1 | 2 | 3 | | 0 |
tag | 4 | 5 | 6 | 7 | | 1 |
tag | 8 | 9 | 10 | 11 | | 2 |
tag | 12 | 13 | 14 | 15 | | 3 |
Direct Mapped Cache Example
75
0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 1 | 0 |
3 | 0 | 1 | 1 | 1 |
4 | 0 | 1 | 0 | 0 |
5 | 0 | 1 | 0 | 1 |
6 | 0 | 1 | 1 | 0 |
7 | 1 | 0 | 1 | 1 |
8 | 1 | 0 | 0 | 0 |
9 | 1 | 0 | 0 | 1 |
10 | 1 | 0 | 1 | 0 |
11 | 1 | 1 | 1 | 1 |
12 | 1 | 1 | 0 | 0 |
13 | 1 | 1 | 0 | 1 |
14 | 1 | 1 | 1 | 0 |
15 | 1 | 1 | 1 | 1 |
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
tag | 0 | 1 | 2 | 3 | | 0 |
tag | 4 | 5 | 6 | 7 | | 1 |
tag | 8 | 9 | 10 | 11 | | 2 |
tag | 12 | 13 | 14 | 15 | | 3 |
Direct Mapped Cache Example
76
0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 1 | 0 |
3 | 0 | 1 | 1 | 1 |
4 | 0 | 1 | 0 | 0 |
5 | 0 | 1 | 0 | 1 |
6 | 0 | 1 | 1 | 0 |
7 | 1 | 0 | 1 | 1 |
8 | 1 | 0 | 0 | 0 |
9 | 1 | 0 | 0 | 1 |
10 | 1 | 0 | 1 | 0 |
11 | 1 | 1 | 1 | 1 |
12 | 1 | 1 | 0 | 0 |
13 | 1 | 1 | 0 | 1 |
14 | 1 | 1 | 1 | 0 |
15 | 1 | 1 | 1 | 1 |
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
tag | 0 | 1 | 2 | 3 | | 0 |
tag | 4 | 5 | 6 | 7 | | 1 |
tag | 8 | 9 | 10 | 11 | | 2 |
tag | 12 | 13 | 14 | 15 | | 3 |
Direct Mapped Cache Example
77
0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 1 | 0 |
3 | 0 | 1 | 1 | 1 |
4 | 0 | 1 | 0 | 0 |
5 | 0 | 1 | 0 | 1 |
6 | 0 | 1 | 1 | 0 |
7 | 1 | 0 | 1 | 1 |
8 | 1 | 0 | 0 | 0 |
9 | 1 | 0 | 0 | 1 |
10 | 1 | 0 | 1 | 0 |
11 | 1 | 1 | 1 | 1 |
12 | 1 | 1 | 0 | 0 |
13 | 1 | 1 | 0 | 1 |
14 | 1 | 1 | 1 | 0 |
15 | 1 | 1 | 1 | 1 |
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
| | | | |
| | | | |
15 | 60 | 61 | 62 | 63 |
tag | 0 | 1 | 2 | 3 | | 0 |
tag | 4 | 5 | 6 | 7 | | 1 |
tag | 8 | 9 | 10 | 11 | | 2 |
tag | 12 | 13 | 14 | 15 | | 3 |
2-way set associative cache
78
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | | |
Block Offset (b=2)
Block Address (m-b)
set line
tag
Types of Cache Implementation
79
Resources
80
Compiler Level Optimization
81
81
Compiler Level Optimizations
82
Code Motion
83
Bryant and O'Hallaron, Computer Systems: A programmer's Perspective
void set_row(double *a, double *b,
long i, long n)
{
long j;
for (j = 0; j < n; j++)
a[n*i+j] = b[j];
}
void set_row(double *a, double *b,
long i, long n)
{
long j;
int ni = n*i;
for (j = 0; j < n; j++)
a[ni+j] = b[j];
}
Compiler Generated Code Motion
84
set_row:� testq %rcx, %rcx # Test n� jle .L1 # If 0, goto done� imulq %rcx, %rdx # ni = n*i� leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8� movl $0, %eax # j = 0�.L3: # loop:� movsd (%rsi,%rax,8), %xmm0 # t = b[j]� movsd %xmm0, (%rdx,%rax,8) # M[A+ni*8 + j*8] = t� addq $1, %rax # j++� cmpq %rcx, %rax # j:n� jne .L3 # if !=, goto loop�.L1: # done:� rep ; ret
Bryant and O'Hallaron, Computer Systems: A programmer's Perspective
long j;
for (j = 0; j < n; j++)
a[n*i+j] = b[j];
long j;
int ni = n*i;
for (j = 0; j < n; j++)
a[ni+j] = b[j];
Elimination (share) of Common Subexpressions
85
3 multiplications: i*n, (i–1)*n, (i+1)*n
1 multiplication: i*n
leaq 1(%rsi), %rax # i+1
leaq -1(%rsi), %r8 # i-1
imulq %rcx, %rsi # i*n
imulq %rcx, %rax # (i+1)*n
imulq %rcx, %r8 # (i-1)*n
addq %rdx, %rsi # i*n+j
addq %rdx, %rax # (i+1)*n+j
addq %rdx, %r8 # (i-1)*n+j
imulq %rcx, %rsi # i*n
addq %rdx, %rsi # i*n+j
movq %rsi, %rax # i*n+j
subq %rcx, %rax # i*n+j-n
leaq (%rsi,%rcx), %rcx # i*n+j+n
Bryant and O'Hallaron, Computer Systems: A programmer's Perspective
/* Sum neighbors of i,j */
up = val[(i-1)*n + j ];
down = val[(i+1)*n + j ];
left = val[i*n + j-1];
right = val[i*n + j+1];
sum = up + down + left + right;
long inj = i*n + j;
up = val[inj - n];
down = val[inj + n];
left = val[inj - 1];
right = val[inj + 1];
sum = up + down + left + right;
Reduction in Strength
86
for (i = 0; i < n; i++) {
int ni = n*i;
for (j = 0; j < n; j++)
a[ni + j] = b[j];
}
int ni = 0;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++)
a[ni + j] = b[j];
ni += n;
}
Memory Aliasing
87
# sum_rows1 inner loop
.L4:
movsd (%rsi,%rax,8), %xmm0 # FP load
addsd (%rdi), %xmm0 # FP add
movsd %xmm0, (%rsi,%rax,8) # FP store
addq $8, %rdi
cmpq %rcx, %rdi
jne .L4
Bryant and O'Hallaron, Computer Systems: A programmer's Perspective
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}
Memory Aliasing
88
# sum_rows2 inner loop
.L10:
addsd (%rdi), %xmm0 # FP load + add
addq $8, %rdi
cmpq %rax, %rdi
jne .L10
Bryant and O'Hallaron, Computer Systems: A programmer's Perspective
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows2(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
double val = 0;
for (j = 0; j < n; j++)
val += a[i*n + j];
b[i] = val;
}
}
Loop Interchange
89
Original loop:
for(j=0; j<M; j++)
for(i=0; i<N; i++)
A[i][j]=a[i][j]+b[i][j];
Interchanged loop:
for(i=0; i<N; i++)
for(j=0; j<M; j++)
c[i][j]=a[i][j]+b[i][j];
Loop Fusion
90
Original loops:
for(i=0; i<N; i++)
a[i] = b[i] + 10
for(i=0; i<N; i++)
c[i] = a[i]*2
Fused loops:
for(i=0; i<N; i++){
a[i] = b[i] + 10
c[i] = a[i]*2
}
Loop Fusion
91
Original loops:
for(i=0; i<N; i++)
a[i] = b[i] + 10
for(i=0; i<N; i++)
c[i] = a[i]*2
Fused loops:
for(i=0; i<N; i++){
a[i] = b[i] + 10
c[i] = a[i]*2
}
Loop Distribution
92
Distributed loops:
for(i=0; i<N; i++)
a[i] = b[i] + 10
for(i=0; i<N; i++)
c[i] = a[i]*2
Original loops:
for(i=0; i<N; i++){
a[i] = b[i] + 10
c[i] = a[i]*2
}
Loop Distribution
93
Distributed loops:
for(i=0; i<N; i++)
a[i] = b[i] + 10
for(i=0; i<N; i++)
c[i] = a[i]*2
Original loops:
for(i=0; i<N; i++){
a[i] = b[i] + 10
c[i] = a[i]*2
}
Loop Invariant Computations
94
Original loop:
for(i=0; i<N; i++)
a[i] = i* (SOME_NUMBER/100)
Transformed loop:
for(i=0; i<N; i++)
a[i] = i* funct();
Loop Unrolling
95
void foo(
int A[8],
int B[8]) {
for(int i=0;i<8;i++){
B[i]=A[i]*3;
}
}
void foo(
int A[8],
int B[8]) {
for(int i=0;i<8; i+=2){
B[i]=A[i]*3;
B[i+1]=A[i+1]*3;
}
}
Dead Code Elimination
96
int global;
void f ()
{
int i;
i = 1; /* dead store */
global = 1; /* dead store */
global = 2;
return;
global = 3; /* unreachable */
}
int global;
void f ()
{
global = 2;
return;
}
Constant Folding
97
Int a = 4+5;
Int a = 9;
true not;
false;
Constant Propagation
98
Int b = 3;
Int c = 1 + b;
Int d = b + c;
Int b = 3;
Int c = 1 + 3;
Int d = 3 + c;
Compiler Optimization Passes
99
LLVM: Optimizing Compiler
100
LLVM IR
LLVM IR
LLVM IR
Optimization
Pass 1
Optimization
Pass 1
Optimization
Pass 1
…. . ….
LLVM IR
LLVM IR
Compiler Optimization Passes
101
Optimizing Compilers
102
Limitations of Optimizing Compilers
103
When in doubt, the compiler must be conservative
Bryant and O'Hallaron, Computer Systems: A programmer's Perspective
GCC O1, O2, O3 levels
104
-O1
-O2
-O3
Inline Assembly
105
Inline Assembly
106
SIMD
107
107
Flynn Taxonomy
108
| Data Stream | ||
Single | Multi | ||
Instruction Stream | Single | SISD (Single-Core Processors) | SIMD (GPUs, Intel SSE/AVX extensions, …) |
Multi | MISD (Dataflow architectures, Systolic Arrays [debatably], …) | MIMD (VLIW, Parallel Computers) | |
SIMD Basics
109
SIMD
110
%ymm0
%ymm1
vaddsd %ymm0, %ymm1, %ymm1
+
+
+
+
+
+
+
+
+
+
+
+
%ymm0
%ymm1
vaddpd %ymm0, %ymm1, %ymm1
+
+
+
+
%ymm0
%ymm1
vaddpd %ymm0, %ymm1, %ymm1
Some example AVX-512F (a subset of AVX-512) instructions
Syntax
zmm0 ...zmm31 are 512 bit registers; each can hold 16 single-precision (float of C; 32 bits) or 8 double-precision (double of C; 64 bits) floating point numbers
SIMD
111
Original loops:
for(i=0; i<N; i++){
S(i)
}
Original loops:
for(i=0; i<N; i+=L){
S(I;i+L)
}
Programming SIMD
112
Which vector registers is this using??
Auto Vectorization
113
Int len_vec=128;
void vec_add(float *vec_A, float *vec_B, float *vec_C, int len_vec) {
int i;
for (i=0; i<len_vec; i++) {
vec_C[i] = vec_A[i] + vec_B[i];
}
}
armclang --target=aarch64-arm-none-eabi -g -c -O1 vec_add.c
gcc -o simd_auto -march=native -O3 vec_add.c
OpenMP SIMD
114
int len_vec=128;
void vec_add(float *vec_A, float *vec_B, float *vec_C, int len_vec) {
int i;
#pragma omp simd
for (i=0; i<len_vec; i++) {
vec_C[i] = vec_A[i] + vec_B[i];
}
}
GCC Vector Types
115
typedef float floatv attribute ((vector size(64),aligned(sizeof(float))));
floatv x, y, z;
z += x * y;
float a, b;
floatv x, y;
y = a * x + b;
GCC Vector Types
116
for (long i = 0; i < n; i += L) {
V(y[i]) = a * V(x[i]);
}
typedef float floatv attribute ((vector size(64),aligned(sizeof(float))));
#define V(lv) *((floatv*)&(lv))
for (long i = 0; i < n; i++) {
y[i] = a * x[i];
}
Intrinsics
117
#include <arm_neon.h>
int16x4_t vadd_s16(int16x4_t a, int16x4_t b) {
int16x4_t c;
for (int i = 0; i < 4; i++) {
c[i] = a[i] + b[i];
}
return c;
}
Why (auto) vectorization fails?
118
How do I know if my code is vectorized?
119
Compiler | Report options |
gcc | -fopt-info-vec-{optimized,missed} |
clang | -Rpass=vectorize |
asm volatile ("# xxxxxx loop begins");
for (i = 0; i < n; i++) {
... /∗ loop to be vectorized ∗/
}
asm volatile ("# xxxxxx loop ends");
How do I know if my code is vectorized?
120
clang -O3 -Rpass=loop-vectorize -S add_vec.c -o /dev/null
add_vec.c:4:5: remark:
vectorized loop (vectorization factor: 4, unrolling interleave factor: 2)
void difficult_function_to_vecorize(int *A, int *B, int Length) {
for (int i = 0; i < Length; i++)
A[B[i]]++;
}
clang -O3 -Rpass=loop-vectorize -S difficult_to_vectorize.c -o /dev/null
difficult_to_vectorize.c:3:5: remark:
loop not vectorized: cannot identify array bounds
for (int i = 0; i < Length; i++)
^