1 of 60

CS61C: Great Ideas in Computer Architecture (aka Machine Structures)

Lecture 27: Caches IV

Instructors: Dan Garcia, Justin Yokota

#

CS 61C

Fall 2023

2 of 60

Agenda

Matrix Multiply
Set Associative Caches
Replacement Policy
Average Memory Access Time
Multilevel Caches
Actual CPUs

2

CS 61C

Fall 2023

3 of 60

Caching Example: Matrix Multiply

For today, we'll be using Matrix Multiply as an example.
Assumptions:

Elements are 4 bytes

But only the bottom byte will be shown in the cache for space

Block size is 16 bytes (one row of this matrix)
Matrix A stored at 0x1000 (16 bit addresses)
Matrix B stored at 0x2000
Matrix C stored at 0x3000
No other memory used
Caches start cold
To speed things up, we'll transpose matrix B to take more advantage of spatial locality. More on this in lab!

Let's see how this works on a write-back direct-mapped cache with 4 blocks

3

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

CS 61C

Fall 2023

4 of 60

Caching Example: Write-Back Direct-Mapped

Start: Cache starts cold

Note: Index isn't actually stored, but we'll show it anyway
Note: Random data stored in the cache (but valid bits are 0 so it doesn't matter)

2⁴ byte blocks = 4 bit offset
2² indexes = 2 bit index
16 bit addresses = 16-4-2 = 10 bit tag

4

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x100	0	0	0xBF 0x16 0x88 0x2B
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

5 of 60

Caching Example: Write-Back Direct-Mapped

Access A[0]: 0x1000

Address in binary: 0b0001 0000 0000 0000
Index 0, Tag 0x040
Miss, so replace the block in index 0

5

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x100	0	0	0xBF 0x16 0x88 0x2B
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

6 of 60

Caching Example: Write-Back Direct-Mapped

Access B[0]: 0x2000

Address in binary: 0b0010 0000 0000 0000
Index 0, Tag 0x080
Miss, so replace the block in index 0

Uh oh

6

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x040	1	0	0x01 0x02 0x03 0x04
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

7 of 60

Caching Example: Write-Back Direct-Mapped

Access A[1]: 0x1004

Address in binary: 0b0001 0000 0000 0100
Index 0, Tag 0x040
Miss, so replace the block in index 0

Oh no

7

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x080	1	0	0x11 0x15 0x19 0x1D
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

8 of 60

Caching Example: Write-Back Direct-Mapped

Access B[1]: 0x2004

Address in binary: 0b0010 0000 0000 0100
Index 0, Tag 0x080
Miss, so replace the block in index 0

8

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x040	1	0	0x01 0x02 0x03 0x04
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

9 of 60

Caching Example: Write-Back Direct-Mapped

Access A[2]: 0x1008

Index 0, Tag 0x040
Miss, so replace the block in index 0

9

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x080	1	0	0x11 0x15 0x19 0x1D
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

10 of 60

Caching Example: Write-Back Direct-Mapped

Access B[2]: 0x2008

Index 0, Tag 0x080
Miss, so replace the block in index 0

10

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x040	1	0	0x01 0x02 0x03 0x04
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

11 of 60

Caching Example: Write-Back Direct-Mapped

Access A[1]: 0x100C

Index 0, Tag 0x040
Miss, so replace the block in index 0

11

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x080	1	0	0x11 0x15 0x19 0x1D
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

12 of 60

Caching Example: Write-Back Direct-Mapped

Access B[0]: 0x200C

Index 0, Tag 0x080
Miss, so replace the block in index 0

12

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x040	1	0	0x01 0x02 0x03 0x04
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

13 of 60

Caching Example: Write-Back Direct-Mapped

Access C[0]: 0x3000

Index 0, Tag 0x0C0
Miss, so replace the block in index 0

So far, 9 misses, 0 hits

We keep evicting the block even though we have 3 unused spots in our cache!

13

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x080	1	0	0x11 0x15 0x19 0x1D
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

14 of 60

Caching Example: Write-Back Direct-Mapped

A[0]: Miss

Evicts the C[0] block, so we write the data back to the C matrix

B[4]: Miss, but now address is 0x2010

Address in binary: 0b0010 0000 0001 0000
We're now in index 1, so we don't evict A[0]

14

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x0C0	1	1	0xFA 0x?? 0x?? 0x??
1	0x133	0	0	0x3B 0x18 0xF1 0xB3
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

15 of 60

Caching Example: Write-Back Direct-Mapped

Access A[1]: 0x1004

Address in binary: 0b0001 0000 0000 0100
Index 0, Tag 0x040
Tag matches and Valid bit is on, so hit!

15

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x040	1	0	0x01 0x02 0x03 0x04
1	0x080	1	0	0x12 0x16 0x1A 0x1E
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

16 of 60

Caching Example: Write-Back Direct-Mapped

B[5]: Hit, … , C[1]: Miss, evicting Block 0x1000
Overall for this element of C: 3 misses, 6 hits

Better!

16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x040	1	0	0x01 0x02 0x03 0x04
1	0x080	1	0	0x12 0x16 0x1A 0x1E
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

17 of 60

Caching Example: Write-Back Direct-Mapped

In general: if the element of C is on the diagonal, 9 misses, 0 hits
Otherwise, 3 misses, 6 hits (most of the time; exact count is tedious)
Total: 66 misses, 78 hits, or 54% hit rate

17

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

Index	Tag	V	Dirty	Data
0	0x0C0	1	1	0xFA 0x04 0x?? 0x??
1	0x080	1	0	0x12 0x16 0x1A 0x1E
2	0x156	0	0	0xE6 0x57 0x49 0xEE
3	0x0E9	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

18 of 60

Two bad cache types

Direct-mapped caches are prone to thrashing if our matrices are in bad memory addresses
But fully-associative caches use a lot of extra circuitry and are much slower than direct-mapped caches

Is there another way?

18

CS 61C

Fall 2023

19 of 60

Agenda

Matrix Multiply
Set Associative Caches
Replacement Policy
Average Memory Access Time
Multilevel Caches
Actual CPUs

19

CS 61C

Fall 2023

20 of 60

N-Way Set Associative Cache (1/3)

Memory address fields:

Tag: same as before
Offset: same as before
Index: points us to the correct “row” (called a set in this case)

So what’s the difference?

each set contains multiple blocks
once we’ve found correct set, must compare with all tags in that set to find our data
Size of $ is # sets x N blocks/set x block size

20

CS 61C

Fall 2023

21 of 60

Associative Cache Example

Here’s a simple 2-way set associative cache.

2 sets, 2 blocks in set, 1 byte per block

21

Address	Memory

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

0

1

CS 61C

Fall 2023

22 of 60

N-Way Set Associative Cache (2/3)

Basic Idea

Cache is direct-mapped w/respect to sets
Each set is fully associative with N blocks in it

Given memory address:

Find correct set using Index value.
Compare Tag with all Tag values in that set.
If a match occurs, hit!, otherwise a miss.
Finally, use the offset field as usual to find the desired data within the block.

22

CS 61C

Fall 2023

23 of 60

4-Way Set Associative Cache Circuit

tag

index

“One Hot” Encoding

🗹

Caches IV (23)

Garcia, Yokota

24 of 60

Analyzing Cache Effectiveness

Let's compare the hit patterns of a few cache types
The below diagram shows the hit/miss pattern of various caches when run on the Matrix Multiply example

Red = Miss, White = Hit

Also one more cache: A fully associative cache with an absurd number of blocks (an "infinite" cache which will hit as often as possible)
Hit rates are 91.7%, 83.3%, 75%, 54.2%, respectively

24

FA 1M blocks
FA 4 blocks
2SA 4 blocks
DM 4 blocks

CS 61C

Fall 2023

25 of 60

N-Way Set Associative Cache (3/3)

What’s so great about this?

Even a 2-way set assoc cache avoids a lot of conflict misses
Hardware cost isn’t that bad: only need N comparators

In fact, for a cache with M blocks,

It’s Direct-Mapped if it’s 1-way set associative
It’s Fully Associative if it’s M-way set associative
So these two are just special cases of the more general set associative design

25

CS 61C

Fall 2023

26 of 60

Agenda

Matrix Multiply
Set Associative Caches
Replacement Policy
Average Memory Access Time
Multilevel Caches
Actual CPUs

26

CS 61C

Fall 2023

27 of 60

Block Replacement Policy

Direct-Mapped Cache

index completely specifies position which position a block can go in on a miss

N-Way Set Assoc

index specifies a set, but block can occupy any position within the set on a miss

Fully Associative

block can be written into any position

Question: if we have the choice, where should we write an incoming block?

If there’s a valid bit off, write new block into first invalid.
If all are valid, pick a replacement policy

rule for which block gets “cached out” on a miss.

27

CS 61C

Fall 2023

28 of 60

Eviction Policies

Least Recently Used (LRU): evict the block in the set that is the oldest previous access.

Pro: temporal locality

recent past use implies likely future use: in fact, this is a very effective policy

Con: Requires complicated hardware and much time to keep track of this

Most Recently Used (MRU): evict the block in the set that is the newest previous access.
First-In-First-Out (FIFO): evict the oldest block in the set (queue).
Last-In-First-Out (LIFO): evict the newest block in the set (stack).

Both FIFO and LIFO ignore order of accesses after/before the first/last access, but they serve as good approximations to LRU and MRU without adding too much excess hardware

Random: randomly selects a block to evict

Works surprisingly okay, especially given a low temporal locality workload
Does not perform exceedingly well, but also not exceedingly poorly

CS 61C

Fall 2023

29 of 60

Caching Example: Write-Back Fully Associative with LRU

Let's see how a WB Fully Associative Cache with LRU replacement policy works
0 misses, 0 hits

29

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x100	0	0	0	0xBF 0x16 0x88 0x2B
0x733	0	0	0	0x3B 0x18 0xF1 0xB3
0x156	0	0	0	0xE6 0x57 0x49 0xEE
0x4E9	0	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

30 of 60

Caching Example: Write-Back Fully Associative with LRU

Access A[0] = 0x1000

Miss

Access Bt[0] = 0x2000

Miss

30

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x100	0	0	0	0xBF 0x16 0x88 0x2B
0x733	0	0	0	0x3B 0x18 0xF1 0xB3
0x156	0	0	0	0xE6 0x57 0x49 0xEE
0x4E9	0	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

31 of 60

Caching Example: Write-Back Fully Associative with LRU

A[1]: Hit Bt[1]: Hit
A[2]: Hit Bt[2]: Hit
A[3]: Hit Bt[3]: Hit
6 Hits, 2 Misses!

31

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x100	1	0	2	0x01 0x02 0x03 0x04
0x200	1	0	1	0x11 0x15 0x19 0x1D
0x156	0	0	0	0xE6 0x57 0x49 0xEE
0x4E9	0	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

32 of 60

Caching Example: Write-Back Fully Associative with LRU

Write to C[0] = 0x3000

Miss + Write, so set dirty bit

Main memory not updated yet

32

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x100	1	0	2	0x01 0x02 0x03 0x04
0x200	1	0	1	0x11 0x15 0x19 0x1D
0x156	0	0	0	0xE6 0x57 0x49 0xEE
0x4E9	0	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

33 of 60

Caching Example: Write-Back Fully Associative with LRU

A[0]: Hit Bt[4]: Miss … A[3]: Hit Bt[7]: Hit. Write C[1]: Hit
8 Hits, 1 Miss here
Totals: 4 Misses, 14 Hits

33

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x100	1	0	3	0x01 0x02 0x03 0x04
0x200	1	0	2	0x11 0x15 0x19 0x1D
0x300	1	1	1	0xFA 0x?? 0x?? 0x??
0x4E9	0	0	0	0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

34 of 60

Caching Example: Write-Back Fully Associative with LRU

A[0]: Hit Bt[8]: Miss, evict block 0x200, … , Write C[2]: Hit
8 Hits, 1 Miss here
Totals: 5 Misses, 22 Hits

34

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x100	1	0	3	0x01 0x02 0x03 0x04
0x200	1	0	4	0x11 0x15 0x19 0x1D
0x300	1	1	1	0xFA 0x04 0x?? 0x??
0x201	1	0	2	0x12 0x16 0x1A 0x1E

CS 61C

Fall 2023

35 of 60

Caching Example: Write-Back Fully Associative with LRU

Compute C[3]: 8 hits, 1 miss, block 0x201 evicted
A[4]: Miss, Block 0x202 evicted
8 Hits, 2 Misses here
Totals: 7 Misses, 30 Hits

35

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x100	1	0	3	0x01 0x02 0x03 0x04
0x202	1	0	2	0x13 0x17 0x1B 0x1F
0x300	1	1	1	0xFA 0x04 0x0E 0x??
0x201	1	0	4	0x12 0x16 0x1A 0x1E

CS 61C

Fall 2023

36 of 60

Caching Example: Write-Back Fully Associative with LRU

Bt[0]: Miss, evict block 0x100
A[5]-Bt[3]: 6 hits
Total: 8 misses, 36 hits

36

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x100	1	0	4	0x01 0x02 0x03 0x04
0x101	1	0	1	0x05 0x06 0x07 0x08
0x300	1	1	2	0xFA 0x04 0x0E 0x18
0x203	1	0	3	0x14 0x18 0x1C 0x20

CS 61C

Fall 2023

37 of 60

Caching Example: Write-Back Fully Associative with LRU

Write C[4]: Miss, Write to dirty block, evict block 0x203
A[4]: Hit
Total: 9 misses, 37 hits

37

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x200	1	0	1	0x11 0x15 0x19 0x1D
0x101	1	0	2	0x05 0x06 0x07 0x08
0x300	1	1	3	0xFA 0x04 0x0E 0x18
0x203	1	0	4	0x14 0x18 0x1C 0x20

CS 61C

Fall 2023

38 of 60

Caching Example: Write-Back Fully Associative with LRU

Bt[4]: Miss, evict 0x300

Only at this point does main memory get updated

Total: 10 misses, 37 hits, 1 write

38

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x200	1	0	3	0x11 0x15 0x19 0x1D
0x101	1	0	1	0x05 0x06 0x07 0x08
0x300	1	1	4	0xFA 0x04 0x0E 0x18
0x301	1	1	2	0x6A 0x?? 0x?? 0x??

CS 61C

Fall 2023

39 of 60

Caching Example: Write-Back Fully Associative with LRU

A[5]-Write C[5]: 7 hits
Compute C[6]: 1 miss, 8 hits
Compute C[7]: 1 miss, 8 hits
Total: 12 misses, 60 hits, 1 write

39

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x200	1	0	4	0x11 0x15 0x19 0x1D
0x101	1	0	2	0x05 0x06 0x07 0x08
0x201	1	0	1	0x12 0x16 0x1A 0x1E
0x301	1	1	3	0x6A 0x84 0x9E 0xB8

CS 61C

Fall 2023

40 of 60

Caching Example: Write-Back Fully Associative with LRU

Compute C[8]: 3 misses, 6 hits
Compute C[9]: 1 miss, 8 hits, 1 write
Compute C[10]: 1 miss, 8 hits
Compute C[11]: 1 miss, 8 hits
Total: 18 misses, 90 hits, 2 writes

40

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x202	1	0	4	0x13 0x17 0x1B 0x1F
0x101	1	0	3	0x05 0x06 0x07 0x08
0x203	1	0	2	0x14 0x18 0x1C 0x20
0x301	1	1	1	0x6A 0x84 0x9E 0xB8

CS 61C

Fall 2023

41 of 60

Caching Example: Write-Back Fully Associative with LRU

Compute C[12]: 3 misses, 6 hits
Compute C[13]: 1 miss, 8 hits, 1 write
Compute C[14]: 1 miss, 8 hits
Compute C[15]: 1 miss, 8 hits
Total: 24 misses, 120 hits, 3 writes

41

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8		618	658	670	696
9	10	11	12
13	14	15	16

Tag	V	Dirty	LRU	Data
0x102	1	0	3	0x09 0x0A 0x0B 0x0C
0x202	1	0	4	0x13 0x17 0x1B 0x1F
0x302	1	1	1	0xDA 0x04 0x2E 0x58
0x203	1	0	2	0x14 0x18 0x1C 0x20

CS 61C

Fall 2023

42 of 60

Caching Example: Write-Back Fully Associative with LRU

Final step: Flush the cache

Write back all dirty blocks
Invalidate all blocks

Final Count: 24 misses, 120 hits, 4 writes

42

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8		618	658	670	696
9	10	11	12		986	1028	1070	1112
13	14	15	16

Tag	V	Dirty	LRU	Data
0x202	1	0	4	0x13 0x17 0x1B 0x1F
0x103	1	0	3	0x0D 0x0E 0x0F 0x10
0x203	1	0	2	0x14 0x18 0x1C 0x20
0x303	1	1	1	0x4A 0x84 0xBE 0xF8

CS 61C

Fall 2023

43 of 60

Caching Example: Write-Back Fully Associative with LRU

Final Count: 24 misses, 120 hits, 4 writes
83% hit rate

43

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8		618	658	670	696
9	10	11	12		986	1028	1070	1112
13	14	15	16		1354	1412	1470	1528

Tag	V	Dirty	LRU	Data
0x202	0	0	4	0x13 0x17 0x1B 0x1F
0x103	0	0	3	0x0D 0x0E 0x0F 0x10
0x203	0	0	2	0x14 0x18 0x1C 0x20
0x303	0	0	1	0x4A 0x84 0xBE 0xF8

CS 61C

Fall 2023

44 of 60

Agenda

Matrix Multiply
Set Associative Caches
Replacement Policy
Average Memory Access Time
Multilevel Caches
Actual CPUs

44

CS 61C

Fall 2023

45 of 60

Definition Slide

The hit rate of a cache is the percentage of accesses that result in a hit
The miss rate of a cache is the percentage of accesses that result in a miss
Hit time is how long it takes to check the cache. The Miss Penalty is how long it takes to access main memory after a miss.

If we get a cache hit, runtime is equal to hit time
If we get a cache miss, runtime is equal to hit time + miss penalty
If we didn't have a cache, we'd always take miss penalty time to access main memory

Analogy: Hit time = how long to check the bookshelf, Miss penalty = how long to go to the library
For example, if our hit rate is ⅚ = 83.33%, our miss rate is ⅙ = 16.66%
The Average Memory Access Time (or AMAT) is the average amount of time it takes for one memory access, given a hit rate.

45

CS 61C

Fall 2023

46 of 60

AMAT computation

In our matrix multiply example: DM cache had ~50% hit rate, 2WSA cache had ~75% hit rate, FA cache had ~80% hit rate
Let's say that accessing main memory takes 10 clock cycles, and accessing our cache takes 1 cycle for DM, 2 cycles for 2WSA, and 4 cycles for FA cache
DM cache AMAT: 50% chance for hit (1 cycle), 50% chance for miss (10+1 cycles)

AMAT = 1*0.5+11*0.5 = 6 cycles

FA cache AMAT: 80% chance for hit (4 cycles), 20% chance for miss (10+4 cycles)

AMAT = 4*0.8 + (4*0.2 + 10*0.2) = 6 cycles

2WSA cache AMAT: 2 cycles for hit time, 25% chance to get a 10 cycle miss penalty

AMAT = 2 + 0.25*10 = 4.5 cycles

46

CS 61C

Fall 2023

47 of 60

Big Idea

How to choose between associativity, block size, replacement & write policy?
Design against a performance model

Minimize Average Memory Access Time
influenced by technology & program behavior

Create the illusion of a memory that is large, cheap, and fast - on average
How can we improve miss penalty?

47

CS 61C

Fall 2023

48 of 60

Improving Miss Penalty

When caches first became popular, Miss Penalty ~ 10 processor clock cycles
Today 3 GHz Processor (1/3 ns per clock cycle) and 80 ns to go to DRAM � ~200 processor clock cycles!
Solution: another cache between memory and the processor cache: Second Level Cache

48

CS 61C

Fall 2023

49 of 60

Agenda

Matrix Multiply
Set Associative Caches
Replacement Policy
Average Memory Access Time
Multilevel Caches
Actual CPUs

49

CS 61C

Fall 2023

50 of 60

Multilevel Caches

Solution: Add multiple layers of caches.

Check L1 cache. If hit, we're done
If miss, check L2 cache. If hit, we're done, and bring data up to L1 cache
If miss, check L3 cache. If hit, we're done, and bring data up to L1 and L2 cache
…

Each level of cache is larger than the previous and should have a subset of the data in the next cache
Library analogy: We buy a bookshelf for right next to us, then we buy a larger bookshelf to put in our garage, then make a shed to store more books.

50

CS 61C

Fall 2023

51 of 60

Multilevel Caches

Generally, the L1 cache is small, but also on the core

Each core gets its own L1 cache
Fast access, but also fairly small

L2 cache is larger and slightly slower, but usually a bit farther away

Sometimes shared, sometimes on the core

L3 cache is normally much larger and much slower, and is shared with all cores
Using the numbers from the previous page as examples:

L1 hit time = 1 ns
L2 hit time = 1 ns + 4 ns = 5 ns
L3 hit time = 1 ns + 4 ns + 40 ns = 45 ns
L3 miss = 1 ns + 4 ns + 40 ns + 80 ns = 125 ns

Miss rate is calculated only for accesses that "reach" that cache

L2 hit rate is % of L2 accesses that hit, so L1 hits don't count as L2 hits OR L2 misses

51

CS 61C

Fall 2023

52 of 60

AMAT in Multilevel Caches

Let's say we have a memory system with the following properties. What would be the AMAT of this system?
Respond at PollEv.com/jy314

Hit rates are the % of accesses to that cache that hit

So L1 hits aren't counted in L2 accesses

Hit times are just for that cache

So accessing L2 takes 1 + 4 = 5 ns

52

	Hit time	Hit rate
L1 Cache	1 ns	50%
L2 Cache	4 ns	75%
L3 Cache	40 ns	80%
SRAM	80 ns	100%

CS 61C

Fall 2023

53 of 60

53

CS 61C

Fall 2023

54 of 60

AMAT in Multilevel Caches

Let's say we have a memory system with the following properties. What would be the AMAT of this system?
50% of accesses are L1 hits
50% of accesses are L1 misses

50%*75% = ⅜ are L2 hits

⅛ of accesses are L2 misses

⅛*80% = 1/10 are L3 hits

1/40 of accesses are L3 misses

And therefore access SRAM

Total: 1 ns + 0.5*4 ns + ⅛ * 40 ns + 1/40 * 80 ns
10 ns access time = 8x speedup

Note: In practice, L1 cache hit rates are often around 90-95%, so we tend to get even higher speedup factors.

54

	Hit time	Hit rate
L1 Cache	1 ns	50%
L2 Cache	4 ns	75%
L3 Cache	40 ns	80%
SRAM	80 ns	100%

CS 61C

Fall 2023

55 of 60

Agenda

Matrix Multiply
Set Associative Caches
Replacement Policy
Average Memory Access Time
Multilevel Caches
Actual CPUs

55

CS 61C

Fall 2023

56 of 60

An Actual CPU – Early PowerPC

Cache

32 KiB Instructions & 32 KiB Data L1 caches
External L2 Cache interface with integrated controller and cache tags, supports up to 1 MiB external L2 cache
Dual Memory Management Units (MMU) with Translation Lookaside Buffers (TLB)

Pipelining

Superscalar (3 inst/cycle)
6 execution units (2 integer and 1 double precision IEEE floating point)

56

CS 61C

Fall 2023

57 of 60

An Actual CPU – Pentium M

32KiB I$

32KiB D$

CS 61C

Fall 2023

58 of 60

Multilevel Caches

58

CS 61C

Fall 2023

59 of 60

Multilevel Caches

59

CS 61C

Fall 2023

60 of 60

Conclusion

Caching is a powerful tool that lets us save memory access time

Not just limited to memory caches! Software memoization, streaming caches, filesystem cache, anything where computing something in time is expensive

Tons of different ways to customize your cache, with different advantages/disadvantages

Direct-mapped to Fully Associative
Eviction policy
Write-back/Write-through
Multi-level caching
Shared vs independent caches

Hardware design ends up trying to pick the best parameters so that the cache works well for most programs
Software design can be improved by knowing your cache system and writing code that will be efficient.

60

CS 61C

Fall 2023

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8		618	658	670	696
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8		618	658	670	696
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8
9	10	11	12
13	14	15	16

				Bt	17	21	25	29
					18	22	26	30
					19	23	27	31
					20	24	28	32
A					C
1	2	3	4		250	260	270	280
5	6	7	8		618	658	670	696
9	10	11	12
13	14	15	16