Cache-Conscious Concurrent Data Structures for Near-Memory Computing
Maurice Herlihy
1 November 2023
Memory Wall
3
Source: Computer Architecture, A Quantitative Approach by Hennessy and Patterson
Memory Wall
Memory Wall
5
Source: Computer Architecture, A Quantitative Approach by Hennessy and Patterson
Memory Wall
6
host
processor
memory
data request
data
cache
data
interconnect
Near-Memory Processing (NMP)
7
interconnect
memory
near-mem
compute�unit
computation offload
computation result
host
processor
cache
data
Hardware Enablers of NMP
8
High Bandwidth Memory (HBM)
Hybrid Memory Cube (HMC)
Image sources:
Jun et al. HBM DRAM Technology and Architecture. IMW 2017
Hybrid Memory Cube Specification 2.1. 2015.
3D die-stacking:
Hardware Enablers of NMP
9
Load-Reduced DIMM (LR-DIMM)
Image source: Rosenfeld. Performance Exploration of the Hybrid Memory Cube. University of Maryland. 2014.
Buffer chip area on DIMMs:
Latest HW enabler: BBcube 3D
The proposed technology uses a stacked design where processing units (xPU) sit atop multiple interconnected memory layers (DRAM). By replacing wires with through-silicon vias (TSVs), the lengths of the connections can be shortened, leading to better overall electrical performance.
Source: www.titech.ac.jp/english/news/2023/067046
Adopting NMP to real systems…
11
NMP-based concurrent data structures
Concurrent data structures:
Basic building blocks of software
Real use cases in high-performance systems
Provided as packaged software libraries 🡪 easy to use
Irregular memory accesses in pointer-chasing data structures
NMP-based data structures:
Preserve high concurrency & correctness guarantees
Retain high on-chip cache locality (when applicable)
Utilize NMP hardware most effectively
12
Baseline NMP Architecture
13
L1 cache
L1 cache
L1 cache
L1 cache
interconnect
main memory
host processor
host processor
host processor
host processor
. . .
last-level cache
Baseline NMP Architecture
14
L1 cache
L1 cache
L1 cache
L1 cache
interconnect
host processor
host processor
host processor
host processor
. . .
last-level cache
host-accessible
main memory
NMP-capable
memory
Baseline NMP Architecture
15
L1 cache
L1 cache
L1 cache
L1 cache
interconnect
host processor
host processor
host processor
host processor
. . .
last-level cache
host-accessible
main memory
NMP
partition
NMP
partition
NMP core
NMP core
Baseline NMP Architecture
16
NMP
partition
NMP core
Can be accessed by coupled NMP core only
Baseline NMP Architecture
17
Load-Reduced DIMM (LR-DIMM)
Image sources:
Hybrid Memory Cube Specification 2.1. 2015.
Rosenfeld. Performance Exploration of the Hybrid Memory Cube. University of Maryland. 2014.
Hybrid Memory Cube (HMC)
Baseline NMP Architecture
18
L1 cache
L1 cache
L1 cache
L1 cache
interconnect
host processor
host processor
host processor
host processor
. . .
last-level cache
host-accessible
main memory
NMP
partition
NMP
partition
NMP core
NMP core
NMP-based Data Structures: Flat Combining
19
Hendler et al. (SPAA ‘10), Liu et al. (SPAA ‘17)
host processor
host processor
host processor
host processor
NMP core
NMP partition
data structure
publication list
op1
op2
op3
op4
op1
op2
op3
op4
NMP-based Data Structures: Partitioning
20
1
1
1
1
4
4
4
3
6
8
8
10
10
10
10
11
12
12
26
20
20
20
20
NMP part 1
NMP part 2
NMP part 3
NMP core 3
host processor
host processor
host processor
host processor
. . .
read(7)
remove(4)
read(13)
add(21)
NMP core 2
NMP core 1
Liu et al. (SPAA 2017)
Hierarchical Data Structures
Data structures in which nodes are organized by levels in a �top-down hierarchy to provide O(logN) access time.
Example: skiplists, trees
Useful in applications that require quick key-based data lookup
Example: online transaction processing (OLTP) systems
Subject to memory latency bottlenecks due to pointer-chasing
Our NMP-based data structure work focuses on
skiplists
B+ trees
21
Skip Lists
Probabilistic Data Structure
No global rebalancing
Logarithmic-time search
22
2
5
8
7
9
0
Skip List Property
Each layer is sub-list of lower levels
23
9
0
Skip List Property
Each layer is sub-list of lower-levels
24
7
9
0
Skip List Property
Each layer is sub-list of lower levels
25
5
7
9
0
Skip List Property
Each layer is sub-list of lower levels
26
5
8
7
9
0
Skip List Property
Each layer is sub-list of lower levels
Lowest level is entire list
27
2
5
8
7
9
0
Skip List Property
Each layer is sub-list of lower levels
Not easy to preserve in concurrent implementations …
28
2
5
8
7
9
0
Search
29
2
5
8
7
9
0
read(8)
Too far
Search
30
2
5
8
7
9
0
read(8)
OK
Search
31
2
5
8
7
9
0
read(8)
Too far
Search
32
2
5
8
7
9
0
read(8)
Too far
Search
33
2
5
8
7
9
0
read(8)
Yes!
Search
34
7
8
0
2
5
9
read(8)
Logarithmic
35
7
8
0
2
5
9
read(8)
Log N
Why Logarthimic?
36
2
5
8
7
9
0
2i
2i
Hierarchical Data Structures: Skiplist
37
1
4
3
6
8
10
11
12
22
20
26
27
23
30
31
1
30
12
6
23
10
26
1
30
12
4
6
23
1
12
20
read(22)
read(10)
Skiplist Performance Analysis
38
host lock-free
8part NMP
(higher is better)
Choe et al. (SPAA 2019)
Skiplist size: 6x LLC size
Workload: 90/5/5 (read/insert/remove), uniform
Skiplist Performance Analysis
39
1
4
3
6
8
10
11
12
22
20
26
27
23
30
31
1
30
12
6
23
10
26
1
30
12
4
6
23
1
12
20
read(22)
read(10)
read(26)
read(8)
Choe et al. (SPAA 2019)
Skiplist Performance Analysis
40
1
4
3
6
8
10
11
12
22
20
26
27
23
30
31
1
30
12
6
23
10
26
1
30
12
4
6
23
1
12
20
Assume a cache that stores half of all data:
Choe et al. (SPAA 2019)
Skiplist Performance Analysis
41
1
4
3
6
8
10
11
12
22
20
26
27
23
30
31
1
30
12
6
23
10
26
1
30
12
4
6
23
1
12
20
read(22)
only this access goes out to memory
Choe et al. (SPAA 2019)
Choe et al. (SPAA 2019)
Skiplist Performance Analysis
42
8part NMP-based
host lock-free
(higher is better)
Choe et al. (SPAA 2019)
Skiplist size: 400x LLC size
Workload: 90/5/5 (read/insert/remove), uniform
Skiplist Performance Analysis
43
8part NMP-based (1:400)
host LF (1:200)
host LF (1:400)
host LF (1:800)
host LF (1:3000)
host LF (1:25000)
*LF = lock-free
(LLC size:skiplist size)
(higher is better)
Choe et al. (SPAA 2019)
Workload: 90/5/5 (read/insert/remove), uniform
Hybrid Skiplist
Higher level nodes (host-managed):
Lower level nodes (NMP-managed):
44
Choe et al. (SPAA 2022)
Hybrid Skiplist – Concurrency
Skiplist property: list at level i is a sub-list of the list at level (i −1)
Insertions: complete NMP-side first, then host-side
Deletions: complete host-side first, then NMP-side
45
Choe et al. (SPAA 2022)
Hybrid Skiplist – Concurrency
46
insert(5)
remove(4)
1
1
1
1
4
4
4
3
6
8
8
10
10
10
11
12
12
26
20
20
20
20
NMP part. 1
NMP part. 2
NMP part. 3
NMP core 3
NMP core 2
NMP core 1
host cache
5
WRONG!!!
thread1
thread2
insert(5)
Begin at node 4
remove(4)
Begin at node 1
Choe et al. (SPAA 2022)
Hybrid Skiplist – Concurrency
47
insert(5)
remove(4)
1
1
1
1
4
4
4
3
6
8
8
10
10
10
11
12
12
26
20
20
20
20
NMP part. 1
NMP part. 2
NMP part. 3
NMP core 3
NMP core 2
NMP core 1
host cache
retry from host
thread1
thread2
Hybrid Skiplist Performance
48
(higher is better)
YCSB-C workload
(read-only, Zipfian distribution)
Skiplist size: 512x LLC size
NMP-based
host lock-free
hybrid
46% increase
Hybrid Skiplist Performance
49
(higher is better)
host lock-free
hybrid
60% increase
Skiplist size: 512x LLC size
Uniform key distribution
8-thread execution
46% increase
50
10
20
10
20
8
2
15
Data at leaves
Navigation
information
B+ Tree
B+ Tree Split & Merge
51
10
8
2
At most this full, else split
At least this full, else merge
B+ Tree Recursive Split
52
12
20
10
20
8
2
15
12
30
I just overflowed this node,
Time to split!
B+ Tree Recursive Split
Title
53
8
12
20
8
2
15
20
10
12
30
I just overflowed this node,
Time to split!
B+ Tree Recursive Split
54
20
15
20
30
8
12
8
2
10
12
B+ Tree Recursive Split
B+ Tree Recursive Merge
55
20
15
20
30
8
12
2
10
B+ Tree Recursive Merge
I just underflowed this node,
Time to merge!
B+ Tree Recursive Merge
56
20
15
20
30
12
2
10
B+ Tree Recursive Merge
I just underflowed this node,
Time to merge!
Hierarchical Data Structures: B+ Tree
N-ary balanced tree
Data stored at leaf nodes, ordered by keys
Node size set to cache block size to exploit spatial locality
Nodes are always at least half-full
57
read(14)
read(56)
Hierarchical Data Structures: B+ Tree
N-ary balanced tree
Data stored at leaf nodes, ordered by keys
Node size set to cache block size to exploit spatial locality
Nodes are always at least half-full
Concurrent modifications are difficult to manage
58
Hybrid B+ Tree – Concurrency
59
55
56
57
58
59
59
62
49
54
insert(57)
14
47
57
65
Hybrid B+ Tree – Concurrency
60
insert(57)
Hybrid B+ Tree – Concurrency
61
read(59)
insert(57)
thread1
thread2
A
Hybrid B+ Tree – Concurrency
62
55
56
57
58
59
59
62
49
54
14
47
57
65
read(59)
insert(57)
thread1
thread2
A
A
A’
Hybrid B+ Tree – Concurrency
63
read(59)
insert(57)
WRONG!!!
thread1
thread2 – done
A
A’
Hybrid B+ Tree – Concurrency
64
read(59)
insert(57)
Seq#
Seq# of parent
thread1
thread2
Hybrid B+ Tree – Concurrency
65
read(59)
insert(57)
thread1
2
Parent:2
thread2
Expects parent #: 2
Expects parent #: 2
A
Hybrid B+ Tree – Concurrency
66
read(59)
insert(57)
2
Parent:2
thread1
thread2
3
Parent:3
A
Hybrid B+ Tree – Concurrency
67
read(59)
insert(57)
3
Expects parent #: 2
retry from host
thread1
thread2 – done
Parent:3
A
Hybrid B+ Tree Performance
68
B+ tree size: 512x LLC size
Uniform key distribution
8-thread execution
host-only
hybrid
(higher is better)
Optimization: Non-blocking NMP calls
69
Get result,
Move on to next
operation
Traverse through host-side levels
HOST
NMP
Time
Traverse through
NMP-side levels
Wait for NMP-side�to complete
idle
possibly idle
Optimization: Non-blocking NMP calls
70
Traverse through host-side levels (op2)
Get result for op1,
Move on to next
operation
Traverse through host-side levels (op1)
HOST
NMP
Time
Traverse through
NMP-side levels (op1)
Traverse through host-side levels (op3)
Traverse through
NMP-side levels (op2)
Hybrid B+ Tree Performance
71
(higher is better)
B+ tree size: 512x LLC size
Uniform key distribution
8-thread execution
~60% increase in throughput across all workloads!
Hybrid Skiplist Performance
72
(higher is better)
Skiplist size: 512x LLC size
Uniform key distribution
8-thread execution
>2.5X increase in throughput across all workloads!
Future Research Directions…
73
To Learn More
Read our NMP-based data structures publications:
Jiwon Choe, Amy Huang, Tali Moreshet, Maurice Herlihy, R. I. Bahar, “Concurrent Data Structures with Near-Data-Processing: An Architecture-Aware Implementation,” ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2019.
Jiwon Choe, Andrew Crotty, Tali Moreshet, Maurice Herlihy, R. Iris Bahar, “HybriDS: Cache-Conscious Concurrent Data Structures for Near-Memory Processing Architectures,” ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 2022.
Other publications using NMP:
Jiwon Choe, Tali Moreshet, Maurice Herlihy, and R. Iris Bahar, “Attacking Memory-Hard Scrypt with Near-Data-Processing,” IEEE/ACM International Symposium on Memory Systems (MEMSYS). 2019.
S. Thomas, J. Choe, O. Gordon, E. Petrank, T. Moreshet, M. Herlihy, R. I. Bahar, “Towards Hardware Accelerated Garbage Collection with Near-Memory Processing,” IEEE High Performance Extreme Computing (HPEC), Sept. 2022.
Casey Nelson, Joseph Izraelevitz, R. Iris Bahar, Tamara Lehman, “Eliminating Micro-architectural Side-Channel Attacks using Near Memory Processing,” IEEE International Symposium on Secure and Private Execution Environment Design (SEED). September 2022.
74
Want to Try It Out?
Our simulator is publicly available:
Brown-SMCSim: https://github.com/jiwon-choe/Brown-SMCSim
gem5 full-system simulator for NMP architecture
Provides cycle-accurate latencies for all components of system
Software design for the skiplist and B+ tree data structures:
Pseudocode is provided in the SPAA 2022 paper
Includes code for host-side and NMP-side insert operations.
read, update operations follow initial part of insert
remove operations is similar to insert but must be aborted and retried if node is locked
75
Acknowledgements
76
Jiwon Choe,
Formerly Brown
Now at Apple
Andrew Crotty
Northwestern
Tali Moreshet
Boston Univ.
Maurice Herlihy
Brown
Supported in part by NSF CISE grant #1909715
Iris Bahar
Colorado Sch. Mines
Questions?
Thanks!