1 of 77

Cache-Conscious Concurrent Data Structures for Near-Memory Computing

Maurice Herlihy

1 November 2023

2 of 77

3 of 77

Memory Wall

3

Source: Computer Architecture, A Quantitative Approach by Hennessy and Patterson

4 of 77

Memory Wall

5 of 77

Memory Wall

5

Source: Computer Architecture, A Quantitative Approach by Hennessy and Patterson

6 of 77

Memory Wall

6

host

processor

memory

data request

data

cache

data

interconnect

7 of 77

Near-Memory Processing (NMP)

7

interconnect

memory

near-mem

compute�unit

computation offload

computation result

host

processor

cache

data

NMP proposes placing simple compute units (or cores) near main memory. 🡪 host offloads data-intensive computations to core.
Physical proximity of core to memory 🡪 higher bandwidth, lower latency, and lower energy memory access.

Sounds great, BUT why aren’t they in widespread use?

Idea has been around for several decades.
BUT recent advances in technology enable potential for commercial manufacturing
-----------------

(The idea of integrating logic and memory on same chip had been extensively researched in the 1990s. Many noted that significant amounts of available memory bandwidth was lost because of the separated processor and memory chips and the off-chip interconnect. The end goal of NMP research in the 90s was similar to that of now, but the research eventually died out because semiconductor industries were not interested enough. Memory industries were focused more on higher memory density, low power, reducing fabrication costs, and standardization; microprocessor industries pursued faster logic. It was impossible to bring them together at the time.)

8 of 77

Hardware Enablers of NMP

8

High Bandwidth Memory (HBM)

Hybrid Memory Cube (HMC)

Image sources:

Jun et al. HBM DRAM Technology and Architecture. IMW 2017

Hybrid Memory Cube Specification 2.1. 2015.

3D die-stacking:

9 of 77

Hardware Enablers of NMP

9

Load-Reduced DIMM (LR-DIMM)

Image source: Rosenfeld. Performance Exploration of the Hybrid Memory Cube. University of Maryland. 2014.

Buffer chip area on DIMMs:

10 of 77

Latest HW enabler: BBcube 3D

The proposed technology uses a stacked design where processing units (xPU) sit atop multiple interconnected memory layers (DRAM). By replacing wires with through-silicon vias (TSVs), the lengths of the connections can be shortened, leading to better overall electrical performance.

Source: www.titech.ac.jp/english/news/2023/067046

11 of 77

Adopting NMP to real systems…

Is NMP really effective?

Are NMP-based applications better than state-of-the-art?
Can (hidden) overheads diminish benefits?

Is NMP easy to implement and use?

What HW configuration makes sense?
Is it easy to use by an application developer?

11

Fundamental problems to address for NMP adoption to real systems.

Two major challenges.
Is NMP really effective at improving performance?
Is the NMP architecture easy to implement and use?

These two challenges go hand in hand.

(click) Need frequent & irregular memory accesses (ineffective cache), and app needs to be redesigned to use NMP effectively State-of-art algorithms good at hiding memory access latencies

(clck) NMP cores 🡪 slimmed down processors (e.g., no shared memory access and data coherence support). NMP-based apps must work around HW limitations and overheads.

2^nd major issue. (click) Need HW support to reduce overheads. Lightweight HW to meet area/power budget.

(click) HW config. easy for app. developer to understand & use. HW limitations need to be easily understood and worked around.

This is where the concurrent data structures come in.

12 of 77

NMP-based concurrent data structures

Concurrent data structures:

Basic building blocks of software

Real use cases in high-performance systems

Provided as packaged software libraries 🡪 easy to use

Irregular memory accesses in pointer-chasing data structures

NMP-based data structures:

Preserve high concurrency & correctness guarantees

Retain high on-chip cache locality (when applicable)

Utilize NMP hardware most effectively

12

13 of 77

Baseline NMP Architecture

13

L1 cache

interconnect

main memory

host processor

. . .

last-level cache

14 of 77

Baseline NMP Architecture

14

L1 cache

interconnect

host processor

. . .

last-level cache

host-accessible

main memory

NMP-capable

memory

15 of 77

Baseline NMP Architecture

15

L1 cache

interconnect

host processor

. . .

last-level cache

host-accessible

main memory

NMP

partition

NMP

partition

NMP core

16 of 77

Baseline NMP Architecture

16

NMP

partition

NMP core

Can be accessed by coupled NMP core only

simple in-order, single-cycle processor
small scratchpad memory:

memory-mapped for communication w/ host
stores instructions for data structure kernel

single node-sized buffer as data cache� (based on findings from preliminary work)

17 of 77

Baseline NMP Architecture

17

Load-Reduced DIMM (LR-DIMM)

Image sources:

Hybrid Memory Cube Specification 2.1. 2015.

Rosenfeld. Performance Exploration of the Hybrid Memory Cube. University of Maryland. 2014.

Hybrid Memory Cube (HMC)

18 of 77

Baseline NMP Architecture

18

L1 cache

interconnect

host processor

. . .

last-level cache

host-accessible

main memory

NMP

partition

NMP

partition

NMP core

19 of 77

NMP-based Data Structures: Flat Combining

19

Hendler et al. (SPAA ‘10), Liu et al. (SPAA ‘17)

host processor

NMP core

NMP partition

data structure

publication list

op1

op2

op3

op4

op1

op2

op3

op4

20 of 77

NMP-based Data Structures: Partitioning

20

1

4

3

6

8

10

11

12

26

20

NMP part 1

NMP part 2

NMP part 3

NMP core 3

host processor

. . .

read(7)

remove(4)

read(13)

add(21)

NMP core 2

NMP core 1

Liu et al. (SPAA 2017)

21 of 77

Hierarchical Data Structures

Data structures in which nodes are organized by levels in a �top-down hierarchy to provide O(logN) access time.

Example: skiplists, trees

Useful in applications that require quick key-based data lookup

Example: online transaction processing (OLTP) systems

Subject to memory latency bottlenecks due to pointer-chasing

Our NMP-based data structure work focuses on

skiplists

B+ trees

21

Hierarchical data structures have nodes that are organized by levels in a top-down hierarchy to provide a log of N data access time.
Examples of such data structures are skiplists, which I just briefly mentioned, or tree structures.

These hierarchical data structures are particularly useful in applications that require quick key-based lookups across large amounts of data.
A notable example of such applications would be online transaction processing systems, where large volumes of short-lived transactions access data items based on the key.

But these hierarchical data structures are still fundamentally pointer-chasing data structures that are subject to memory latency bottlenecks. (good candidates for NMP-based optimizations).

(click)

skiplists and B+ tree are two representative hierarchical data structures, with real use cases in OLTP systems, but they have very different correctness conditions for concurrency, and their algorithms need to be designed accordingly.

22 of 77

Skip Lists

Probabilistic Data Structure

No global rebalancing

Logarithmic-time search

22

2

5

8

7

9

0

23 of 77

Skip List Property

Each layer is sub-list of lower levels

23

9

0

24 of 77

Skip List Property

Each layer is sub-list of lower-levels

24

7

9

0

25 of 77

Skip List Property

Each layer is sub-list of lower levels

25

5

7

9

Search

34

7

8

0

2

5

9

read(8)

35 of 77

Logarithmic

35

7

8

0

2

5

9

read(8)

Log N

36 of 77

Why Logarthimic?

Property: Each pointer at layer i jumps over roughly 2ⁱnodes
Pick node heights randomly so property guaranteed probabilistically

36

2

5

8

7

9

0

2ⁱ

37 of 77

Hierarchical Data Structures: Skiplist

Sorted linked list structure with multiple levels of lists
Node heights are assigned randomly from a particular distribution
List at each level is a sub-list of the list at level immediately below
Similar to binary search tree, but allows high levels of concurrency

37

1

4

3

6

8

10

11

12

22

20

26

27

23

30

31

1

30

12

6

23

10

26

1

30

12

4

6

23

1

12

20

read(22)

read(10)

38 of 77

Skiplist Performance Analysis

38

host lock-free

8part NMP

(higher is better)

Choe et al. (SPAA 2019)

Skiplist size: 6x LLC size

Workload: 90/5/5 (read/insert/remove), uniform

Remember how I said that these hierarchical data structures are pointer-chasing and subject to memory latency bottlenecks, that can be resolved with NMP?

Initial implementation 🡪 No improvement !

This is what eventually motivated our hybrid data structure work.

We measure data structure performance in terms of operation throughput, the number of data structure operations completed per second. (so higher is better).

This graph shows the operation throughput for a skiplist that is 6 times larger than the last-level cache size;

Still, we see the host lock-free skiplist completely outperforms the NMP-based partitioned skiplist, where the data structure traversals are made entirely by the NMP cores.

These results were completely unexpected in the prior theoretical analysis. And we found that this is due to the impact of on-chip cache locality in the host-based skiplist implementations.

39 of 77

Skiplist Performance Analysis

39

1

4

3

6

8

10

11

12

22

20

26

27

23

30

31

1

30

12

6

23

10

26

1

30

12

4

6

23

1

12

20

read(22)

read(10)

read(26)

read(8)

Choe et al. (SPAA 2019)

40 of 77

Skiplist Performance Analysis

40

1

4

3

6

8

10

11

12

22

20

26

27

23

30

31

1

30

12

6

23

10

26

1

30

12

4

6

23

1

12

20

Assume a cache that stores half of all data:

Choe et al. (SPAA 2019)

41 of 77

Skiplist Performance Analysis

41

1

4

3

6

8

10

11

12

22

20

26

27

23

30

31

1

30

12

6

23

10

26

1

30

12

4

6

23

1

12

20

read(22)

only this access goes out to memory

Choe et al. (SPAA 2019)

42 of 77

Skiplist Performance Analysis

42

8part NMP-based

host lock-free

(higher is better)

Choe et al. (SPAA 2019)

Skiplist size: 400x LLC size

Workload: 90/5/5 (read/insert/remove), uniform

43 of 77

Skiplist Performance Analysis

43

8part NMP-based (1:400)

host LF (1:200)

host LF (1:400)

host LF (1:800)

host LF (1:3000)

host LF (1:25000)

*LF = lock-free

(LLC size:skiplist size)

(higher is better)

Choe et al. (SPAA 2019)

Workload: 90/5/5 (read/insert/remove), uniform

44 of 77

Hybrid Skiplist

Higher level nodes (host-managed):

Implemented as lock-free skiplist in host main memory
“Pins” frequently-traversed nodes to host cache

Lower level nodes (NMP-managed):

Implemented as NMP-based skiplist
Prevents infrequently-accessed nodes from polluting cache

44

Choe et al. (SPAA 2022)

45 of 77

Hybrid Skiplist – Concurrency

Skiplist property: list at level i is a sub-list of the list at level (i −1)

Insertions: complete NMP-side first, then host-side

Deletions: complete host-side first, then NMP-side

45

Choe et al. (SPAA 2022)

46 of 77

Hybrid Skiplist – Concurrency

46

insert(5)

remove(4)

1

4

3

6

8

10

11

12

26

20

NMP part. 1

NMP part. 2

NMP part. 3

NMP core 3

NMP core 2

NMP core 1

host cache

5

WRONG!!!

thread1

thread2

insert(5)

Begin at node 4

remove(4)

Begin at node 1

Choe et al. (SPAA 2022)

47 of 77

Hybrid Skiplist – Concurrency

47

insert(5)

remove(4)

1

4

3

6

8

10

11

12

26

20

NMP part. 1

NMP part. 2

NMP part. 3

NMP core 3

NMP core 2

NMP core 1

host cache

retry from host

thread1

thread2

48 of 77

Hybrid Skiplist Performance

48

(higher is better)

YCSB-C workload

(read-only, Zipfian distribution)

Skiplist size: 512x LLC size

NMP-based

host lock-free

hybrid

46% increase

49 of 77

Hybrid Skiplist Performance

49

(higher is better)

host lock-free

hybrid

60% increase

Skiplist size: 512x LLC size

Uniform key distribution

8-thread execution

46% increase

50 of 77

50

10

20

10

20

8

2

15

Data at leaves

Navigation

information

B+ Tree

51 of 77

B+ Tree Split & Merge

51

10

8

2

At most this full, else split

At least this full, else merge

52 of 77

B+ Tree Recursive Split

52

12

20

10

20

8

2

15

12

30

I just overflowed this node,

Time to split!

B+ Tree Recursive Split

53 of 77

Title

53

8

12

20

8

2

15

20

10

12

30

I just overflowed this node,

Time to split!

B+ Tree Recursive Split

54 of 77

54

20

15

20

30

8

12

8

2

10

12

B+ Tree Recursive Split

55 of 77

B+ Tree Recursive Merge

55

20

15

20

30

8

12

2

10

B+ Tree Recursive Merge

I just underflowed this node,

Time to merge!

56 of 77

B+ Tree Recursive Merge

56

20

15

20

30

12

2

10

B+ Tree Recursive Merge

I just underflowed this node,

Time to merge!

57 of 77

Hierarchical Data Structures: B+ Tree

N-ary balanced tree

Data stored at leaf nodes, ordered by keys

Node size set to cache block size to exploit spatial locality

Nodes are always at least half-full

57

read(14)

read(56)

58 of 77

Hierarchical Data Structures: B+ Tree

N-ary balanced tree

Data stored at leaf nodes, ordered by keys

Node size set to cache block size to exploit spatial locality

Nodes are always at least half-full

Concurrent modifications are difficult to manage

58

59 of 77

Hybrid B+ Tree – Concurrency

59

55

56

57

58

59

62

49

54

insert(57)

14

47

57

65

60 of 77

Hybrid B+ Tree – Concurrency

60

insert(57)

61 of 77

Hybrid B+ Tree – Concurrency

61

read(59)

insert(57)

thread1

thread2

A

62 of 77

Hybrid B+ Tree – Concurrency

62

55

56

57

58

59

62

49

54

14

47

57

65

read(59)

insert(57)

thread1

thread2

A

A’

63 of 77

Hybrid B+ Tree – Concurrency

63

read(59)

insert(57)

WRONG!!!

thread1

thread2 – done

A

A’

64 of 77

Hybrid B+ Tree – Concurrency

64

read(59)

insert(57)

Seq#

Seq# of parent

thread1

thread2

Hybrid B+ Tree Performance

68

B+ tree size: 512x LLC size

Uniform key distribution

8-thread execution

host-only

hybrid

(higher is better)

69 of 77

Optimization: Non-blocking NMP calls

69

Get result,

Move on to next

operation

Traverse through host-side levels

HOST

NMP

Time

Traverse through

NMP-side levels

Wait for NMP-side�to complete

idle

possibly idle

70 of 77

Optimization: Non-blocking NMP calls

70

Traverse through host-side levels (op2)

Get result for op1,

Move on to next

operation

Traverse through host-side levels (op1)

HOST

NMP

Time

Traverse through

NMP-side levels (op1)

Traverse through host-side levels (op3)

Traverse through

NMP-side levels (op2)

71 of 77

Hybrid B+ Tree Performance

71

(higher is better)

B+ tree size: 512x LLC size

Uniform key distribution

8-thread execution

~60% increase in throughput across all workloads!

72 of 77

Hybrid Skiplist Performance

72

(higher is better)

Skiplist size: 512x LLC size

Uniform key distribution

8-thread execution

>2.5X increase in throughput across all workloads!

73 of 77

Future Research Directions…

Using real NMP devices to implement hybrid data structures:

e.g., UPMEM memory module
Evaluation in the context of long-running, large-scale applications�(Limited to relatively short-lived workloads in simulation framework)
Algorithms to dynamically push hot items to on-chip cache�(Useful for high-skew workloads)

73

On a concluding note, I would like to suggest future research directions that could build up on this work.

I think one very promising direction would be using real NMP-hardware to implement and re-evaluate these NMP-aware data structures.
UPMEM, for example, is a real NMP memory module that has very recently been developed
and is – I believe – available for research purposes now.
This will allow for evaluations of the hybrid data structures in the context of larger-scale, long-running workloads, which were hard to carry out in a simulation.

It would be interesting to extend the hybrid data structure algorithms to dynamically push frequently-accessed, hot items onto the on-chip cache.

This would address the major weakness of our hybrid data structures as of now – that it will not be able to perform well under high-skew workloads.

74 of 77

To Learn More

Read our NMP-based data structures publications:

Jiwon Choe, Amy Huang, Tali Moreshet, Maurice Herlihy, R. I. Bahar, “Concurrent Data Structures with Near-Data-Processing: An Architecture-Aware Implementation,” ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2019.

Jiwon Choe, Andrew Crotty, Tali Moreshet, Maurice Herlihy, R. Iris Bahar, “HybriDS: Cache-Conscious Concurrent Data Structures for Near-Memory Processing Architectures,” ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 2022.

Other publications using NMP:

Jiwon Choe, Tali Moreshet, Maurice Herlihy, and R. Iris Bahar, “Attacking Memory-Hard Scrypt with Near-Data-Processing,” IEEE/ACM International Symposium on Memory Systems (MEMSYS). 2019.

S. Thomas, J. Choe, O. Gordon, E. Petrank, T. Moreshet, M. Herlihy, R. I. Bahar, “Towards Hardware Accelerated Garbage Collection with Near-Memory Processing,” IEEE High Performance Extreme Computing (HPEC), Sept. 2022.

Casey Nelson, Joseph Izraelevitz, R. Iris Bahar, Tamara Lehman, “Eliminating Micro-architectural Side-Channel Attacks using Near Memory Processing,” IEEE International Symposium on Secure and Private Execution Environment Design (SEED). September 2022.

74

On a concluding note, I would like to suggest future research directions that could build up on this work.

I think one very promising direction would be using real NMP-hardware to implement and re-evaluate these NMP-aware data structures.
UPMEM, for example, is a real NMP memory module that has very recently been developed
and is – I believe – available for research purposes now.
This will allow for evaluations of the hybrid data structures in the context of larger-scale, long-running workloads, which were hard to carry out in a simulation.

It would be interesting to extend the hybrid data structure algorithms to dynamically push frequently-accessed, hot items onto the on-chip cache.

This would address the major weakness of our hybrid data structures as of now – that it will not be able to perform well under high-skew workloads.

75 of 77

Want to Try It Out?

Our simulator is publicly available:

Brown-SMCSim: https://github.com/jiwon-choe/Brown-SMCSim

gem5 full-system simulator for NMP architecture

Provides cycle-accurate latencies for all components of system

Software design for the skiplist and B+ tree data structures:

Pseudocode is provided in the SPAA 2022 paper

Includes code for host-side and NMP-side insert operations.

read, update operations follow initial part of insert

remove operations is similar to insert but must be aborted and retried if node is locked

75

76 of 77

Acknowledgements

76

Jiwon Choe,

Formerly Brown

Now at Apple

Andrew Crotty

Northwestern

Tali Moreshet

Boston Univ.

Maurice Herlihy

Brown

Supported in part by NSF CISE grant #1909715

Iris Bahar

Colorado Sch. Mines

77 of 77

Questions?

Thanks!