1 of 60

CS61C: Great Ideas in Computer Architecture (aka Machine Structures)

Lecture 27: Caches IV

Instructors: Dan Garcia, Justin Yokota

#

CS 61C

Fall 2023

2 of 60

Agenda

  • Matrix Multiply
  • Set Associative Caches
  • Replacement Policy
  • Average Memory Access Time
  • Multilevel Caches
  • Actual CPUs

2

CS 61C

Fall 2023

3 of 60

Caching Example: Matrix Multiply

  • For today, we'll be using Matrix Multiply as an example.
  • Assumptions:
    • Elements are 4 bytes
      • But only the bottom byte will be shown in the cache for space
    • Block size is 16 bytes (one row of this matrix)
    • Matrix A stored at 0x1000 (16 bit addresses)
    • Matrix B stored at 0x2000
    • Matrix C stored at 0x3000
    • No other memory used
    • Caches start cold
    • To speed things up, we'll transpose matrix B to take more advantage of spatial locality. More on this in lab!
  • Let's see how this works on a write-back direct-mapped cache with 4 blocks

3

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

CS 61C

Fall 2023

4 of 60

Caching Example: Write-Back Direct-Mapped

  • Start: Cache starts cold
    • Note: Index isn't actually stored, but we'll show it anyway
    • Note: Random data stored in the cache (but valid bits are 0 so it doesn't matter)
  • 24 byte blocks = 4 bit offset
  • 22 indexes = 2 bit index
  • 16 bit addresses = 16-4-2 = 10 bit tag

4

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x100

0

0

0xBF 0x16 0x88 0x2B

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

5 of 60

Caching Example: Write-Back Direct-Mapped

  • Access A[0]: 0x1000
    • Address in binary: 0b0001 0000 0000 0000
    • Index 0, Tag 0x040
    • Miss, so replace the block in index 0

5

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x100

0

0

0xBF 0x16 0x88 0x2B

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

6 of 60

Caching Example: Write-Back Direct-Mapped

  • Access B[0]: 0x2000
    • Address in binary: 0b0010 0000 0000 0000
    • Index 0, Tag 0x080
    • Miss, so replace the block in index 0
      • Uh oh

6

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x040

1

0

0x01 0x02 0x03 0x04

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

7 of 60

Caching Example: Write-Back Direct-Mapped

  • Access A[1]: 0x1004
    • Address in binary: 0b0001 0000 0000 0100
    • Index 0, Tag 0x040
    • Miss, so replace the block in index 0
      • Oh no

7

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x080

1

0

0x11 0x15 0x19 0x1D

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

8 of 60

Caching Example: Write-Back Direct-Mapped

  • Access B[1]: 0x2004
    • Address in binary: 0b0010 0000 0000 0100
    • Index 0, Tag 0x080
    • Miss, so replace the block in index 0

8

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x040

1

0

0x01 0x02 0x03 0x04

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

9 of 60

Caching Example: Write-Back Direct-Mapped

  • Access A[2]: 0x1008
    • Index 0, Tag 0x040
    • Miss, so replace the block in index 0

9

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x080

1

0

0x11 0x15 0x19 0x1D

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

10 of 60

Caching Example: Write-Back Direct-Mapped

  • Access B[2]: 0x2008
    • Index 0, Tag 0x080
    • Miss, so replace the block in index 0

10

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x040

1

0

0x01 0x02 0x03 0x04

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

11 of 60

Caching Example: Write-Back Direct-Mapped

  • Access A[1]: 0x100C
    • Index 0, Tag 0x040
    • Miss, so replace the block in index 0

11

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x080

1

0

0x11 0x15 0x19 0x1D

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

12 of 60

Caching Example: Write-Back Direct-Mapped

  • Access B[0]: 0x200C
    • Index 0, Tag 0x080
    • Miss, so replace the block in index 0

12

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x040

1

0

0x01 0x02 0x03 0x04

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

13 of 60

Caching Example: Write-Back Direct-Mapped

  • Access C[0]: 0x3000
    • Index 0, Tag 0x0C0
    • Miss, so replace the block in index 0
  • So far, 9 misses, 0 hits
    • We keep evicting the block even though we have 3 unused spots in our cache!

13

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x080

1

0

0x11 0x15 0x19 0x1D

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

14 of 60

Caching Example: Write-Back Direct-Mapped

  • A[0]: Miss
    • Evicts the C[0] block, so we write the data back to the C matrix
  • B[4]: Miss, but now address is 0x2010
    • Address in binary: 0b0010 0000 0001 0000
    • We're now in index 1, so we don't evict A[0]

14

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x0C0

1

1

0xFA 0x?? 0x?? 0x??

1

0x133

0

0

0x3B 0x18 0xF1 0xB3

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

15 of 60

Caching Example: Write-Back Direct-Mapped

  • Access A[1]: 0x1004
    • Address in binary: 0b0001 0000 0000 0100
    • Index 0, Tag 0x040
    • Tag matches and Valid bit is on, so hit!

15

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

250

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x040

1

0

0x01 0x02 0x03 0x04

1

0x080

1

0

0x12 0x16 0x1A 0x1E

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

16 of 60

Caching Example: Write-Back Direct-Mapped

  • B[5]: Hit, … , C[1]: Miss, evicting Block 0x1000
  • Overall for this element of C: 3 misses, 6 hits
    • Better!

16

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

250

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x040

1

0

0x01 0x02 0x03 0x04

1

0x080

1

0

0x12 0x16 0x1A 0x1E

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

17 of 60

Caching Example: Write-Back Direct-Mapped

  • In general: if the element of C is on the diagonal, 9 misses, 0 hits
  • Otherwise, 3 misses, 6 hits (most of the time; exact count is tedious)
  • Total: 66 misses, 78 hits, or 54% hit rate

17

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

250

5

6

7

8

9

10

11

12

13

14

15

16

Index

Tag

V

Dirty

Data

0

0x0C0

1

1

0xFA 0x04 0x?? 0x??

1

0x080

1

0

0x12 0x16 0x1A 0x1E

2

0x156

0

0

0xE6 0x57 0x49 0xEE

3

0x0E9

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

18 of 60

Two bad cache types

  • Direct-mapped caches are prone to thrashing if our matrices are in bad memory addresses
  • But fully-associative caches use a lot of extra circuitry and are much slower than direct-mapped caches

Is there another way?

18

CS 61C

Fall 2023

19 of 60

Agenda

  • Matrix Multiply
  • Set Associative Caches
  • Replacement Policy
  • Average Memory Access Time
  • Multilevel Caches
  • Actual CPUs

19

CS 61C

Fall 2023

20 of 60

N-Way Set Associative Cache (1/3)

  • Memory address fields:
    • Tag: same as before
    • Offset: same as before
    • Index: points us to the correct “row” (called a set in this case)
  • So what’s the difference?
    • each set contains multiple blocks
    • once we’ve found correct set, must compare with all tags in that set to find our data
    • Size of $ is # sets x N blocks/set x block size

20

CS 61C

Fall 2023

21 of 60

Associative Cache Example

  • Here’s a simple 2-way set associative cache.
    • 2 sets, 2 blocks in set, 1 byte per block

21

Address

Memory

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

0

0

1

1

CS 61C

Fall 2023

22 of 60

N-Way Set Associative Cache (2/3)

  • Basic Idea
    • Cache is direct-mapped w/respect to sets
    • Each set is fully associative with N blocks in it
  • Given memory address:
    • Find correct set using Index value.
    • Compare Tag with all Tag values in that set.
    • If a match occurs, hit!, otherwise a miss.
    • Finally, use the offset field as usual to find the desired data within the block.

22

CS 61C

Fall 2023

23 of 60

4-Way Set Associative Cache Circuit

tag

index

“One Hot” Encoding

🗹

Caches IV (23)

Garcia, Yokota

24 of 60

Analyzing Cache Effectiveness

  • Let's compare the hit patterns of a few cache types
  • The below diagram shows the hit/miss pattern of various caches when run on the Matrix Multiply example
    • Red = Miss, White = Hit
  • Also one more cache: A fully associative cache with an absurd number of blocks (an "infinite" cache which will hit as often as possible)
  • Hit rates are 91.7%, 83.3%, 75%, 54.2%, respectively

24

FA 1M blocks

FA 4 blocks

2SA 4 blocks

DM 4 blocks

CS 61C

Fall 2023

25 of 60

N-Way Set Associative Cache (3/3)

  • What’s so great about this?
    • Even a 2-way set assoc cache avoids a lot of conflict misses
    • Hardware cost isn’t that bad: only need N comparators
  • In fact, for a cache with M blocks,
    • It’s Direct-Mapped if it’s 1-way set associative
    • It’s Fully Associative if it’s M-way set associative
    • So these two are just special cases of the more general set associative design

25

CS 61C

Fall 2023

26 of 60

Agenda

  • Matrix Multiply
  • Set Associative Caches
  • Replacement Policy
  • Average Memory Access Time
  • Multilevel Caches
  • Actual CPUs

26

CS 61C

Fall 2023

27 of 60

Block Replacement Policy

  • Direct-Mapped Cache
    • index completely specifies position which position a block can go in on a miss
  • N-Way Set Assoc
    • index specifies a set, but block can occupy any position within the set on a miss
  • Fully Associative
    • block can be written into any position
  • Question: if we have the choice, where should we write an incoming block?
    • If there’s a valid bit off, write new block into first invalid.
    • If all are valid, pick a replacement policy
      • rule for which block gets “cached out” on a miss.

27

CS 61C

Fall 2023

28 of 60

Eviction Policies

  • Least Recently Used (LRU): evict the block in the set that is the oldest previous access.
    • Pro: temporal locality
      • recent past use implies likely future use: in fact, this is a very effective policy
    • Con: Requires complicated hardware and much time to keep track of this
  • Most Recently Used (MRU): evict the block in the set that is the newest previous access.
  • First-In-First-Out (FIFO): evict the oldest block in the set (queue).
  • Last-In-First-Out (LIFO): evict the newest block in the set (stack).
    • Both FIFO and LIFO ignore order of accesses after/before the first/last access, but they serve as good approximations to LRU and MRU without adding too much excess hardware
  • Random: randomly selects a block to evict
    • Works surprisingly okay, especially given a low temporal locality workload
    • Does not perform exceedingly well, but also not exceedingly poorly

CS 61C

Fall 2023

29 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Let's see how a WB Fully Associative Cache with LRU replacement policy works
  • 0 misses, 0 hits

29

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x100

0

0

0

0xBF 0x16 0x88 0x2B

0x733

0

0

0

0x3B 0x18 0xF1 0xB3

0x156

0

0

0

0xE6 0x57 0x49 0xEE

0x4E9

0

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

30 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Access A[0] = 0x1000
    • Miss
  • Access Bt[0] = 0x2000
    • Miss

30

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x100

0

0

0

0xBF 0x16 0x88 0x2B

0x733

0

0

0

0x3B 0x18 0xF1 0xB3

0x156

0

0

0

0xE6 0x57 0x49 0xEE

0x4E9

0

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

31 of 60

Caching Example: Write-Back Fully Associative with LRU

  • A[1]: Hit Bt[1]: Hit
  • A[2]: Hit Bt[2]: Hit
  • A[3]: Hit Bt[3]: Hit
  • 6 Hits, 2 Misses!

31

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x100

1

0

2

0x01 0x02 0x03 0x04

0x200

1

0

1

0x11 0x15 0x19 0x1D

0x156

0

0

0

0xE6 0x57 0x49 0xEE

0x4E9

0

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

32 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Write to C[0] = 0x3000
    • Miss + Write, so set dirty bit
  • Main memory not updated yet

32

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x100

1

0

2

0x01 0x02 0x03 0x04

0x200

1

0

1

0x11 0x15 0x19 0x1D

0x156

0

0

0

0xE6 0x57 0x49 0xEE

0x4E9

0

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

33 of 60

Caching Example: Write-Back Fully Associative with LRU

  • A[0]: Hit Bt[4]: Miss … A[3]: Hit Bt[7]: Hit. Write C[1]: Hit
  • 8 Hits, 1 Miss here
  • Totals: 4 Misses, 14 Hits

33

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x100

1

0

3

0x01 0x02 0x03 0x04

0x200

1

0

2

0x11 0x15 0x19 0x1D

0x300

1

1

1

0xFA 0x?? 0x?? 0x??

0x4E9

0

0

0

0xB5 0x81 0x67 0x3F

CS 61C

Fall 2023

34 of 60

Caching Example: Write-Back Fully Associative with LRU

  • A[0]: Hit Bt[8]: Miss, evict block 0x200, … , Write C[2]: Hit
  • 8 Hits, 1 Miss here
  • Totals: 5 Misses, 22 Hits

34

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x100

1

0

3

0x01 0x02 0x03 0x04

0x200

1

0

4

0x11 0x15 0x19 0x1D

0x300

1

1

1

0xFA 0x04 0x?? 0x??

0x201

1

0

2

0x12 0x16 0x1A 0x1E

CS 61C

Fall 2023

35 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Compute C[3]: 8 hits, 1 miss, block 0x201 evicted
  • A[4]: Miss, Block 0x202 evicted
  • 8 Hits, 2 Misses here
  • Totals: 7 Misses, 30 Hits

35

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x100

1

0

3

0x01 0x02 0x03 0x04

0x202

1

0

2

0x13 0x17 0x1B 0x1F

0x300

1

1

1

0xFA 0x04 0x0E 0x??

0x201

1

0

4

0x12 0x16 0x1A 0x1E

CS 61C

Fall 2023

36 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Bt[0]: Miss, evict block 0x100
  • A[5]-Bt[3]: 6 hits
  • Total: 8 misses, 36 hits

36

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x100

1

0

4

0x01 0x02 0x03 0x04

0x101

1

0

1

0x05 0x06 0x07 0x08

0x300

1

1

2

0xFA 0x04 0x0E 0x18

0x203

1

0

3

0x14 0x18 0x1C 0x20

CS 61C

Fall 2023

37 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Write C[4]: Miss, Write to dirty block, evict block 0x203
  • A[4]: Hit
  • Total: 9 misses, 37 hits

37

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x200

1

0

1

0x11 0x15 0x19 0x1D

0x101

1

0

2

0x05 0x06 0x07 0x08

0x300

1

1

3

0xFA 0x04 0x0E 0x18

0x203

1

0

4

0x14 0x18 0x1C 0x20

CS 61C

Fall 2023

38 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Bt[4]: Miss, evict 0x300
    • Only at this point does main memory get updated
  • Total: 10 misses, 37 hits, 1 write

38

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x200

1

0

3

0x11 0x15 0x19 0x1D

0x101

1

0

1

0x05 0x06 0x07 0x08

0x300

1

1

4

0xFA 0x04 0x0E 0x18

0x301

1

1

2

0x6A 0x?? 0x?? 0x??

CS 61C

Fall 2023

39 of 60

Caching Example: Write-Back Fully Associative with LRU

  • A[5]-Write C[5]: 7 hits
  • Compute C[6]: 1 miss, 8 hits
  • Compute C[7]: 1 miss, 8 hits
  • Total: 12 misses, 60 hits, 1 write

39

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

250

260

270

280

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x200

1

0

4

0x11 0x15 0x19 0x1D

0x101

1

0

2

0x05 0x06 0x07 0x08

0x201

1

0

1

0x12 0x16 0x1A 0x1E

0x301

1

1

3

0x6A 0x84 0x9E 0xB8

CS 61C

Fall 2023

40 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Compute C[8]: 3 misses, 6 hits
  • Compute C[9]: 1 miss, 8 hits, 1 write
  • Compute C[10]: 1 miss, 8 hits
  • Compute C[11]: 1 miss, 8 hits
  • Total: 18 misses, 90 hits, 2 writes

40

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

250

260

270

280

5

6

7

8

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x202

1

0

4

0x13 0x17 0x1B 0x1F

0x101

1

0

3

0x05 0x06 0x07 0x08

0x203

1

0

2

0x14 0x18 0x1C 0x20

0x301

1

1

1

0x6A 0x84 0x9E 0xB8

CS 61C

Fall 2023

41 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Compute C[12]: 3 misses, 6 hits
  • Compute C[13]: 1 miss, 8 hits, 1 write
  • Compute C[14]: 1 miss, 8 hits
  • Compute C[15]: 1 miss, 8 hits
  • Total: 24 misses, 120 hits, 3 writes

41

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

250

260

270

280

5

6

7

8

618

658

670

696

9

10

11

12

13

14

15

16

Tag

V

Dirty

LRU

Data

0x102

1

0

3

0x09 0x0A 0x0B 0x0C

0x202

1

0

4

0x13 0x17 0x1B 0x1F

0x302

1

1

1

0xDA 0x04 0x2E 0x58

0x203

1

0

2

0x14 0x18 0x1C 0x20

CS 61C

Fall 2023

42 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Final step: Flush the cache
    • Write back all dirty blocks
    • Invalidate all blocks
  • Final Count: 24 misses, 120 hits, 4 writes

42

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

250

260

270

280

5

6

7

8

618

658

670

696

9

10

11

12

986

1028

1070

1112

13

14

15

16

Tag

V

Dirty

LRU

Data

0x202

1

0

4

0x13 0x17 0x1B 0x1F

0x103

1

0

3

0x0D 0x0E 0x0F 0x10

0x203

1

0

2

0x14 0x18 0x1C 0x20

0x303

1

1

1

0x4A 0x84 0xBE 0xF8

CS 61C

Fall 2023

43 of 60

Caching Example: Write-Back Fully Associative with LRU

  • Final Count: 24 misses, 120 hits, 4 writes
  • 83% hit rate

43

Bt

17

21

25

29

18

22

26

30

19

23

27

31

20

24

28

32

A

C

1

2

3

4

250

260

270

280

5

6

7

8

618

658

670

696

9

10

11

12

986

1028

1070

1112

13

14

15

16

1354

1412

1470

1528

Tag

V

Dirty

LRU

Data

0x202

0

0

4

0x13 0x17 0x1B 0x1F

0x103

0

0

3

0x0D 0x0E 0x0F 0x10

0x203

0

0

2

0x14 0x18 0x1C 0x20

0x303

0

0

1

0x4A 0x84 0xBE 0xF8

CS 61C

Fall 2023

44 of 60

Agenda

  • Matrix Multiply
  • Set Associative Caches
  • Replacement Policy
  • Average Memory Access Time
  • Multilevel Caches
  • Actual CPUs

44

CS 61C

Fall 2023

45 of 60

Definition Slide

  • The hit rate of a cache is the percentage of accesses that result in a hit
  • The miss rate of a cache is the percentage of accesses that result in a miss
  • Hit time is how long it takes to check the cache. The Miss Penalty is how long it takes to access main memory after a miss.
    • If we get a cache hit, runtime is equal to hit time
    • If we get a cache miss, runtime is equal to hit time + miss penalty
    • If we didn't have a cache, we'd always take miss penalty time to access main memory
  • Analogy: Hit time = how long to check the bookshelf, Miss penalty = how long to go to the library
  • For example, if our hit rate is ⅚ = 83.33%, our miss rate is ⅙ = 16.66%
  • The Average Memory Access Time (or AMAT) is the average amount of time it takes for one memory access, given a hit rate.

45

CS 61C

Fall 2023

46 of 60

AMAT computation

  • In our matrix multiply example: DM cache had ~50% hit rate, 2WSA cache had ~75% hit rate, FA cache had ~80% hit rate
  • Let's say that accessing main memory takes 10 clock cycles, and accessing our cache takes 1 cycle for DM, 2 cycles for 2WSA, and 4 cycles for FA cache
  • DM cache AMAT: 50% chance for hit (1 cycle), 50% chance for miss (10+1 cycles)
    • AMAT = 1*0.5+11*0.5 = 6 cycles
  • FA cache AMAT: 80% chance for hit (4 cycles), 20% chance for miss (10+4 cycles)
    • AMAT = 4*0.8 + (4*0.2 + 10*0.2) = 6 cycles
  • 2WSA cache AMAT: 2 cycles for hit time, 25% chance to get a 10 cycle miss penalty
    • AMAT = 2 + 0.25*10 = 4.5 cycles

46

CS 61C

Fall 2023

47 of 60

Big Idea

  • How to choose between associativity, block size, replacement & write policy?
  • Design against a performance model
    • Minimize Average Memory Access Time
    • influenced by technology & program behavior
  • Create the illusion of a memory that is large, cheap, and fast - on average
  • How can we improve miss penalty?

47

CS 61C

Fall 2023

48 of 60

Improving Miss Penalty

  • When caches first became popular, Miss Penalty ~ 10 processor clock cycles
  • Today 3 GHz Processor (1/3 ns per clock cycle) and 80 ns to go to DRAM � ~200 processor clock cycles!
  • Solution: another cache between memory and the processor cache: Second Level Cache

48

CS 61C

Fall 2023

49 of 60

Agenda

  • Matrix Multiply
  • Set Associative Caches
  • Replacement Policy
  • Average Memory Access Time
  • Multilevel Caches
  • Actual CPUs

49

CS 61C

Fall 2023

50 of 60

Multilevel Caches

  • Solution: Add multiple layers of caches.
    • Check L1 cache. If hit, we're done
    • If miss, check L2 cache. If hit, we're done, and bring data up to L1 cache
    • If miss, check L3 cache. If hit, we're done, and bring data up to L1 and L2 cache
  • Each level of cache is larger than the previous and should have a subset of the data in the next cache
  • Library analogy: We buy a bookshelf for right next to us, then we buy a larger bookshelf to put in our garage, then make a shed to store more books.

50

CS 61C

Fall 2023

51 of 60

Multilevel Caches

  • Generally, the L1 cache is small, but also on the core
    • Each core gets its own L1 cache
    • Fast access, but also fairly small
  • L2 cache is larger and slightly slower, but usually a bit farther away
    • Sometimes shared, sometimes on the core
  • L3 cache is normally much larger and much slower, and is shared with all cores
  • Using the numbers from the previous page as examples:
    • L1 hit time = 1 ns
    • L2 hit time = 1 ns + 4 ns = 5 ns
    • L3 hit time = 1 ns + 4 ns + 40 ns = 45 ns
    • L3 miss = 1 ns + 4 ns + 40 ns + 80 ns = 125 ns
  • Miss rate is calculated only for accesses that "reach" that cache
    • L2 hit rate is % of L2 accesses that hit, so L1 hits don't count as L2 hits OR L2 misses

51

CS 61C

Fall 2023

52 of 60

AMAT in Multilevel Caches

  • Let's say we have a memory system with the following properties. What would be the AMAT of this system?
  • Respond at PollEv.com/jy314
    • Hit rates are the % of accesses to that cache that hit
      • So L1 hits aren't counted in L2 accesses
    • Hit times are just for that cache
      • So accessing L2 takes 1 + 4 = 5 ns

52

Hit time

Hit rate

L1 Cache

1 ns

50%

L2 Cache

4 ns

75%

L3 Cache

40 ns

80%

SRAM

80 ns

100%

CS 61C

Fall 2023

53 of 60

53

CS 61C

Fall 2023

54 of 60

AMAT in Multilevel Caches

  • Let's say we have a memory system with the following properties. What would be the AMAT of this system?
  • 50% of accesses are L1 hits
  • 50% of accesses are L1 misses
    • 50%*75% = ⅜ are L2 hits
  • ⅛ of accesses are L2 misses
    • ⅛*80% = 1/10 are L3 hits
  • 1/40 of accesses are L3 misses
    • And therefore access SRAM
  • Total: 1 ns + 0.5*4 ns + ⅛ * 40 ns + 1/40 * 80 ns
  • 10 ns access time = 8x speedup
    • Note: In practice, L1 cache hit rates are often around 90-95%, so we tend to get even higher speedup factors.

54

Hit time

Hit rate

L1 Cache

1 ns

50%

L2 Cache

4 ns

75%

L3 Cache

40 ns

80%

SRAM

80 ns

100%

CS 61C

Fall 2023

55 of 60

Agenda

  • Matrix Multiply
  • Set Associative Caches
  • Replacement Policy
  • Average Memory Access Time
  • Multilevel Caches
  • Actual CPUs

55

CS 61C

Fall 2023

56 of 60

An Actual CPU – Early PowerPC

  • Cache
    • 32 KiB Instructions & 32 KiB Data L1 caches
    • External L2 Cache interface with integrated controller and cache tags, supports up to 1 MiB external L2 cache
    • Dual Memory Management Units (MMU) with Translation Lookaside Buffers (TLB)
  • Pipelining
    • Superscalar (3 inst/cycle)
    • 6 execution units (2 integer and 1 double precision IEEE floating point)

56

CS 61C

Fall 2023

57 of 60

An Actual CPU – Pentium M

32KiB I$

32KiB D$

CS 61C

Fall 2023

58 of 60

Multilevel Caches

58

CS 61C

Fall 2023

59 of 60

Multilevel Caches

59

CS 61C

Fall 2023

60 of 60

Conclusion

  • Caching is a powerful tool that lets us save memory access time
    • Not just limited to memory caches! Software memoization, streaming caches, filesystem cache, anything where computing something in time is expensive
  • Tons of different ways to customize your cache, with different advantages/disadvantages
    • Direct-mapped to Fully Associative
    • Eviction policy
    • Write-back/Write-through
    • Multi-level caching
    • Shared vs independent caches
  • Hardware design ends up trying to pick the best parameters so that the cache works well for most programs
  • Software design can be improved by knowing your cache system and writing code that will be efficient.

60

CS 61C

Fall 2023