1 of 77

Cache-Conscious Concurrent Data Structures for Near-Memory Computing

Maurice Herlihy

1 November 2023

2 of 77

3 of 77

Memory Wall

3

Source: Computer Architecture, A Quantitative Approach by Hennessy and Patterson

4 of 77

Memory Wall

5 of 77

Memory Wall

5

Source: Computer Architecture, A Quantitative Approach by Hennessy and Patterson

6 of 77

Memory Wall

6

host

processor

memory

data request

data

cache

data

interconnect

7 of 77

Near-Memory Processing (NMP)

7

interconnect

memory

near-mem

compute�unit

computation offload

computation result

host

processor

cache

data

8 of 77

Hardware Enablers of NMP

8

High Bandwidth Memory (HBM)

Hybrid Memory Cube (HMC)

Image sources:

Jun et al. HBM DRAM Technology and Architecture. IMW 2017

Hybrid Memory Cube Specification 2.1. 2015.

3D die-stacking:

9 of 77

Hardware Enablers of NMP

9

Load-Reduced DIMM (LR-DIMM)

Image source: Rosenfeld. Performance Exploration of the Hybrid Memory Cube. University of Maryland. 2014.

Buffer chip area on DIMMs:

10 of 77

Latest HW enabler: BBcube 3D

The proposed technology uses a stacked design where processing units (xPU) sit atop multiple interconnected memory layers (DRAM). By replacing wires with through-silicon vias (TSVs), the lengths of the connections can be shortened, leading to better overall electrical performance.

Source: www.titech.ac.jp/english/news/2023/067046

11 of 77

Adopting NMP to real systems…

  1. Is NMP really effective?
    • Are NMP-based applications better than state-of-the-art?
    • Can (hidden) overheads diminish benefits?

  • Is NMP easy to implement and use?
    • What HW configuration makes sense?
    • Is it easy to use by an application developer?

11

12 of 77

NMP-based concurrent data structures

Concurrent data structures:

Basic building blocks of software

Real use cases in high-performance systems

Provided as packaged software libraries 🡪 easy to use

Irregular memory accesses in pointer-chasing data structures

NMP-based data structures:

Preserve high concurrency & correctness guarantees

Retain high on-chip cache locality (when applicable)

Utilize NMP hardware most effectively

12

13 of 77

Baseline NMP Architecture

13

L1 cache

L1 cache

L1 cache

L1 cache

interconnect

main memory

host processor

host processor

host processor

host processor

. . .

last-level cache

14 of 77

Baseline NMP Architecture

14

L1 cache

L1 cache

L1 cache

L1 cache

interconnect

host processor

host processor

host processor

host processor

. . .

last-level cache

host-accessible

main memory

NMP-capable

memory

15 of 77

Baseline NMP Architecture

15

L1 cache

L1 cache

L1 cache

L1 cache

interconnect

host processor

host processor

host processor

host processor

. . .

last-level cache

host-accessible

main memory

NMP

partition

NMP

partition

NMP core

NMP core

16 of 77

Baseline NMP Architecture

16

NMP

partition

NMP core

Can be accessed by coupled NMP core only

  • simple in-order, single-cycle processor
  • small scratchpad memory:
    • memory-mapped for communication w/ host
    • stores instructions for data structure kernel
  • single node-sized buffer as data cache� (based on findings from preliminary work)

17 of 77

Baseline NMP Architecture

17

Load-Reduced DIMM (LR-DIMM)

Image sources:

Hybrid Memory Cube Specification 2.1. 2015.

Rosenfeld. Performance Exploration of the Hybrid Memory Cube. University of Maryland. 2014.

Hybrid Memory Cube (HMC)

18 of 77

Baseline NMP Architecture

18

L1 cache

L1 cache

L1 cache

L1 cache

interconnect

host processor

host processor

host processor

host processor

. . .

last-level cache

host-accessible

main memory

NMP

partition

NMP

partition

NMP core

NMP core

19 of 77

NMP-based Data Structures: Flat Combining

19

Hendler et al. (SPAA ‘10), Liu et al. (SPAA ‘17)

host processor

host processor

host processor

host processor

NMP core

NMP partition

data structure

publication list

op1

op2

op3

op4

op1

op2

op3

op4

20 of 77

NMP-based Data Structures: Partitioning

20

1

1

1

1

4

4

4

3

6

8

8

10

10

10

10

11

12

12

26

20

20

20

20

NMP part 1

NMP part 2

NMP part 3

NMP core 3

host processor

host processor

host processor

host processor

. . .

read(7)

remove(4)

read(13)

add(21)

NMP core 2

NMP core 1

Liu et al. (SPAA 2017)

21 of 77

Hierarchical Data Structures

Data structures in which nodes are organized by levels in a �top-down hierarchy to provide O(logN) access time.

Example: skiplists, trees

Useful in applications that require quick key-based data lookup

Example: online transaction processing (OLTP) systems

Subject to memory latency bottlenecks due to pointer-chasing

Our NMP-based data structure work focuses on

skiplists

B+ trees

21

22 of 77

Skip Lists

Probabilistic Data Structure

No global rebalancing

Logarithmic-time search

22

2

5

8

7

9

0

23 of 77

Skip List Property

Each layer is sub-list of lower levels

23

9

0

24 of 77

Skip List Property

Each layer is sub-list of lower-levels

24

7

9

0

25 of 77

Skip List Property

Each layer is sub-list of lower levels

25

5

7

9

0

26 of 77

Skip List Property

Each layer is sub-list of lower levels

26

5

8

7

9

0

27 of 77

Skip List Property

Each layer is sub-list of lower levels

Lowest level is entire list

27

2

5

8

7

9

0

28 of 77

Skip List Property

Each layer is sub-list of lower levels

Not easy to preserve in concurrent implementations …

28

2

5

8

7

9

0

29 of 77

Search

29

2

5

8

7

9

0

read(8)

Too far

30 of 77

Search

30

2

5

8

7

9

0

read(8)

OK

31 of 77

Search

31

2

5

8

7

9

0

read(8)

Too far

32 of 77

Search

32

2

5

8

7

9

0

read(8)

Too far

33 of 77

Search

33

2

5

8

7

9

0

read(8)

Yes!

34 of 77

Search

34

7

8

0

2

5

9

read(8)

35 of 77

Logarithmic

35

7

8

0

2

5

9

read(8)

Log N

36 of 77

Why Logarthimic?

  • Property: Each pointer at layer i jumps over roughly 2i nodes
  • Pick node heights randomly so property guaranteed probabilistically

36

2

5

8

7

9

0

2i

2i

37 of 77

Hierarchical Data Structures: Skiplist

  • Sorted linked list structure with multiple levels of lists
  • Node heights are assigned randomly from a particular distribution
  • List at each level is a sub-list of the list at level immediately below
  • Similar to binary search tree, but allows high levels of concurrency

37

1

4

3

6

8

10

11

12

22

20

26

27

23

30

31

1

30

12

6

23

10

26

1

30

12

4

6

23

1

12

20

read(22)

read(10)

38 of 77

Skiplist Performance Analysis

38

host lock-free

8part NMP

(higher is better)

Choe et al. (SPAA 2019)

Skiplist size: 6x LLC size

Workload: 90/5/5 (read/insert/remove), uniform

39 of 77

Skiplist Performance Analysis

39

1

4

3

6

8

10

11

12

22

20

26

27

23

30

31

1

30

12

6

23

10

26

1

30

12

4

6

23

1

12

20

read(22)

read(10)

read(26)

read(8)

Choe et al. (SPAA 2019)

40 of 77

Skiplist Performance Analysis

40

1

4

3

6

8

10

11

12

22

20

26

27

23

30

31

1

30

12

6

23

10

26

1

30

12

4

6

23

1

12

20

Assume a cache that stores half of all data:

Choe et al. (SPAA 2019)

41 of 77

Skiplist Performance Analysis

41

1

4

3

6

8

10

11

12

22

20

26

27

23

30

31

1

30

12

6

23

10

26

1

30

12

4

6

23

1

12

20

read(22)

only this access goes out to memory

Choe et al. (SPAA 2019)

Choe et al. (SPAA 2019)

42 of 77

Skiplist Performance Analysis

42

8part NMP-based

host lock-free

(higher is better)

Choe et al. (SPAA 2019)

Skiplist size: 400x LLC size

Workload: 90/5/5 (read/insert/remove), uniform

43 of 77

Skiplist Performance Analysis

43

8part NMP-based (1:400)

host LF (1:200)

host LF (1:400)

host LF (1:800)

host LF (1:3000)

host LF (1:25000)

*LF = lock-free

(LLC size:skiplist size)

(higher is better)

Choe et al. (SPAA 2019)

Workload: 90/5/5 (read/insert/remove), uniform

44 of 77

Hybrid Skiplist

Higher level nodes (host-managed):

    • Implemented as lock-free skiplist in host main memory
    • “Pins” frequently-traversed nodes to host cache

Lower level nodes (NMP-managed):

    • Implemented as NMP-based skiplist
    • Prevents infrequently-accessed nodes from polluting cache

44

Choe et al. (SPAA 2022)

45 of 77

Hybrid Skiplist – Concurrency

Skiplist property: list at level i is a sub-list of the list at level (i −1)

Insertions: complete NMP-side first, then host-side

Deletions: complete host-side first, then NMP-side

45

Choe et al. (SPAA 2022)

46 of 77

Hybrid Skiplist – Concurrency

46

insert(5)

remove(4)

1

1

1

1

4

4

4

3

6

8

8

10

10

10

11

12

12

26

20

20

20

20

NMP part. 1

NMP part. 2

NMP part. 3

NMP core 3

NMP core 2

NMP core 1

host cache

5

WRONG!!!

thread1

thread2

insert(5)

Begin at node 4

remove(4)

Begin at node 1

Choe et al. (SPAA 2022)

47 of 77

Hybrid Skiplist – Concurrency

47

insert(5)

remove(4)

1

1

1

1

4

4

4

3

6

8

8

10

10

10

11

12

12

26

20

20

20

20

NMP part. 1

NMP part. 2

NMP part. 3

NMP core 3

NMP core 2

NMP core 1

host cache

retry from host

thread1

thread2

48 of 77

Hybrid Skiplist Performance

48

(higher is better)

YCSB-C workload

(read-only, Zipfian distribution)

Skiplist size: 512x LLC size

NMP-based

host lock-free

hybrid

46% increase

49 of 77

Hybrid Skiplist Performance

49

(higher is better)

host lock-free

hybrid

60% increase

Skiplist size: 512x LLC size

Uniform key distribution

8-thread execution

46% increase

50 of 77

50

10

20

10

20

8

2

15

Data at leaves

Navigation

information

B+ Tree

51 of 77

B+ Tree Split & Merge

51

10

8

2

At most this full, else split

At least this full, else merge

52 of 77

B+ Tree Recursive Split

52

12

20

10

20

8

2

15

12

30

I just overflowed this node,

Time to split!

B+ Tree Recursive Split

53 of 77

Title

53

8

12

20

8

2

15

20

10

12

30

I just overflowed this node,

Time to split!

B+ Tree Recursive Split

54 of 77

54

20

15

20

30

8

12

8

2

10

12

B+ Tree Recursive Split

55 of 77

B+ Tree Recursive Merge

55

20

15

20

30

8

12

2

10

B+ Tree Recursive Merge

I just underflowed this node,

Time to merge!

56 of 77

B+ Tree Recursive Merge

56

20

15

20

30

12

2

10

B+ Tree Recursive Merge

I just underflowed this node,

Time to merge!

57 of 77

Hierarchical Data Structures: B+ Tree

N-ary balanced tree

Data stored at leaf nodes, ordered by keys

Node size set to cache block size to exploit spatial locality

Nodes are always at least half-full

57

read(14)

read(56)

58 of 77

Hierarchical Data Structures: B+ Tree

N-ary balanced tree

Data stored at leaf nodes, ordered by keys

Node size set to cache block size to exploit spatial locality

Nodes are always at least half-full

Concurrent modifications are difficult to manage

58

59 of 77

Hybrid B+ Tree – Concurrency

59

55

56

57

58

59

59

62

49

54

insert(57)

14

47

57

65

60 of 77

Hybrid B+ Tree – Concurrency

60

insert(57)

61 of 77

Hybrid B+ Tree – Concurrency

61

read(59)

insert(57)

thread1

thread2

A

62 of 77

Hybrid B+ Tree – Concurrency

62

55

56

57

58

59

59

62

49

54

14

47

57

65

read(59)

insert(57)

thread1

thread2

A

A

A’

63 of 77

Hybrid B+ Tree – Concurrency

63

read(59)

insert(57)

WRONG!!!

thread1

thread2 – done

A

A’

64 of 77

Hybrid B+ Tree – Concurrency

64

read(59)

insert(57)

Seq#

Seq# of parent

thread1

thread2

65 of 77

Hybrid B+ Tree – Concurrency

65

read(59)

insert(57)

thread1

2

Parent:2

thread2

Expects parent #: 2

Expects parent #: 2

A

66 of 77

Hybrid B+ Tree – Concurrency

66

read(59)

insert(57)

2

Parent:2

thread1

thread2

3

Parent:3

A

67 of 77

Hybrid B+ Tree – Concurrency

67

read(59)

insert(57)

3

Expects parent #: 2

retry from host

thread1

thread2 – done

Parent:3

A

68 of 77

Hybrid B+ Tree Performance

68

B+ tree size: 512x LLC size

Uniform key distribution

8-thread execution

host-only

hybrid

(higher is better)

69 of 77

Optimization: Non-blocking NMP calls

69

Get result,

Move on to next

operation

Traverse through host-side levels

HOST

NMP

Time

Traverse through

NMP-side levels

Wait for NMP-side�to complete

idle

possibly idle

70 of 77

Optimization: Non-blocking NMP calls

70

Traverse through host-side levels (op2)

Get result for op1,

Move on to next

operation

Traverse through host-side levels (op1)

HOST

NMP

Time

Traverse through

NMP-side levels (op1)

Traverse through host-side levels (op3)

Traverse through

NMP-side levels (op2)

71 of 77

Hybrid B+ Tree Performance

71

(higher is better)

B+ tree size: 512x LLC size

Uniform key distribution

8-thread execution

~60% increase in throughput across all workloads!

72 of 77

Hybrid Skiplist Performance

72

(higher is better)

Skiplist size: 512x LLC size

Uniform key distribution

8-thread execution

>2.5X increase in throughput across all workloads!

73 of 77

Future Research Directions…

  • Using real NMP devices to implement hybrid data structures:
    • e.g., UPMEM memory module
    • Evaluation in the context of long-running, large-scale applications�(Limited to relatively short-lived workloads in simulation framework)
    • Algorithms to dynamically push hot items to on-chip cache�(Useful for high-skew workloads)

73

74 of 77

To Learn More

Read our NMP-based data structures publications:

Jiwon Choe, Amy Huang, Tali Moreshet, Maurice Herlihy, R. I. Bahar, “Concurrent Data Structures with Near-Data-Processing: An Architecture-Aware Implementation,” ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2019.

Jiwon Choe, Andrew Crotty, Tali Moreshet, Maurice Herlihy, R. Iris Bahar, “HybriDS: Cache-Conscious Concurrent Data Structures for Near-Memory Processing Architectures,” ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 2022.

Other publications using NMP:

Jiwon Choe, Tali Moreshet, Maurice Herlihy, and R. Iris Bahar, “Attacking Memory-Hard Scrypt with Near-Data-Processing,” IEEE/ACM International Symposium on Memory Systems (MEMSYS). 2019.

S. Thomas, J. Choe, O. Gordon, E. Petrank, T. Moreshet, M. Herlihy, R. I. Bahar, “Towards Hardware Accelerated Garbage Collection with Near-Memory Processing,” IEEE High Performance Extreme Computing (HPEC), Sept. 2022.

Casey Nelson, Joseph Izraelevitz, R. Iris Bahar, Tamara Lehman, “Eliminating Micro-architectural Side-Channel Attacks using Near Memory Processing,” IEEE International Symposium on Secure and Private Execution Environment Design (SEED). September 2022.

74

75 of 77

Want to Try It Out?

Our simulator is publicly available:

Brown-SMCSim: https://github.com/jiwon-choe/Brown-SMCSim

gem5 full-system simulator for NMP architecture

Provides cycle-accurate latencies for all components of system

Software design for the skiplist and B+ tree data structures:

Pseudocode is provided in the SPAA 2022 paper

Includes code for host-side and NMP-side insert operations.

read, update operations follow initial part of insert

remove operations is similar to insert but must be aborted and retried if node is locked

75

76 of 77

Acknowledgements

76

Jiwon Choe,

Formerly Brown

Now at Apple

Andrew Crotty

Northwestern

Tali Moreshet

Boston Univ.

Maurice Herlihy

Brown

Supported in part by NSF CISE grant #1909715

Iris Bahar

Colorado Sch. Mines

77 of 77

Questions?

Thanks!