GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
Cong Guo*,1,2, Rui Zhang*,3, Jiale Xu1,2, Jingwen Leng+,1,2, Zihan Liu1,2, Ziyu Huang1,2, Minyi Guo1,2, Hao Wu3, Shouren Zhao3, Junping Zhao+,3, Ke Zhang3
1
2
3
*equal contribution +corresponding author
2
01
02
03
04
Background
&
Motivation
Implementat-ion
Evaluation
Q&A
Challenges of GPU Memory in LLM
�
There is a great memory wall in LLM
3
CUDA
Megatron-LM
GMLake Contribution Level
ROCm
massive & Chronological Allocations happen with Deep Learning Framework Allocator
Caching Allocator
forward
backward
4
\+,
Native Allocator
Caching Allocator
Memory
Request
cuMemAlloc
Allocated
Memory
cuMemFree
Caching
Memory
Pool
Deep Learning Framework Memory Management
Program Execution
Memory
Request
Allocated
Memory
Program Execution
cachingAlloc
Cached
Memory
cuMemAlloc
Reusing
Possible?
No
Yes
Tag as an caching memory block
CachingAllocator Results in more than 5× Higher Throughput !!
Lead to massive device-level memory allocation calls and overhead
5
Memory-Efficient Optimizations in DNN Training
Activation Recomputation (R)
Vanilla Pattern
Pattern After R
Vanilla Workload : Regular and Even Memory Access Pattern
Scheduled & Optimized Workload : Irregular and Uneven Pattern
6
Different Strategies
Different Distributed
Training Device Number
Memory-Efficient Optimizations in DNN Training Lead to Fragmentation
7
Original
GMLake(Ours)
Allocated
Fragmentation
Virtual Memory
3
4
2
1
6
OOM
4
3
1
2
5
Splitting
6
4
3
1
6
6
VM Stitching
VM
2 5
Mem Pool
Case Study: CUDA Caching Allocator
Allocation Request
Mechanism of Caching Allocator Facilitates Fragmentation
But VM Stitching Can Help
8
Opportunities with VMM(Virtual Memory Management)
Fine-grained Memory Chunk
Reserved Device Memory Address
Challenges with VMM
3. For I in range(5):
Map(VA+i* granularity, C[i])
1. For I in range(5):
C[i] = Create(granularity)
2. VA = AddrReserve(size)
4. Stitch
New Opportunities and Challenges with VMM
Fine-grained VM Stitching
Leads to Unacceptable Overhead
9
10
01
02
03
04
Background
&
Motivation
Implementat-ion
Evaluation
Q&A
New Tensor
Delete Tensor
Tensor Allocator
Caching
Allocator
(Original)
Mem Pool
Native Pool
GMLake
Allocator
(Ours)
VMM Pool
GPU Memory
\
Allocation
Deallocation
Alloc
Split
Best Fit
Stitch
Update Stitch
Stitched
Memory
Pool
Primitive
Memory
Pool
cuMemAddressReserve
cuMemCreate
cuMemMap
……
VM Pool
GMLake Overview
Extended High Level APIs
Efficient Stitching Data Structure
VMM APIs
11
Allocation 1
Block
Best Fit
Match
S1
S2
Allocation 2
Block 1
Block 2
Best Fit
Block 1
Split
Single
Block
S4
Allocation 2
5
Best Fit
Stitch
Block 4
Insufficient
Block 2
Alloc
S5
OOM
Allocating
Allocated
Stitched
Primitive
Allocation Strategy
Allocation 3
Best Fit
S3
Stitch
Block2
Block3
B4
Multiple
Blocks
Block 3
4
5
For algorithm details, please refer to section 3.3 of our paper.
Allocated Memory Block
Allocation Size
Stitched Memory Pool
Primitive Memory Pool
Allocating Strategy
12
GMLake Memory Pool
B1
B2
Original Memory Pool
B1
B2
B3
B4
B5
R1
R2
R3
R4
R6
R5
Time
Memory request size
Case Study: GMLake vs Original Caching
R1
R2
R3
R4
R5
B1
R6
R1
R2
R3
R4
B1
B2
B3
R5
B1
R6
B3
B4
B1
B2
B1
B2
B3
B4
B1
B2
B3
B4
B5
S4
S2
S1
Memory Consumption
GMLake Saves Memory by Properly Reuse Allocated Memory !
13
14
01
02
03
04
Background
&
Motivation
Implementat-ion
Evaluation
Q&A
Server | 16 × NVIDIA A100 SXM (80GB) |
Training Engine | FSDP, DeepSpeed ZeRO, Colossal-AI Gemini |
Model | 8 LLMs including OPT, Vicuna, GPT-2, GPT-NeoX and so on |
Ablation | Batch-size, Strategies, Distributed Training Devices and Different Training Engines |
Total Workloads | 76 fine-tune workloads |
Evaluation Environments
15
Effectiveness Evaluation
Evaluation on Strategies Combinations
Evaluation on Distributed Training Device Numbers
10 GB on average & 17 GB at most
9 GB on average & 17 GB at most
16
17
Effectiveness Evaluation
Evaluation on Batch sizes
Evaluation on Training Engines
7 GB on average & 12 GB at most
14 GB on average & 25 GB at most
17
Throughput Evaluation
18
In our evaluations, GMLake's throughput aligns with that of PyTorch
18
Memory Profile Analysis
GMLake’s Stitching Process
GMLake Takes Serveral Steps to Complete Stitching Process
PyTorch Occurs OOM Exception After Three Iterations
19
Summary
20
20
Recent Updates
21
- Concurrent work with different designs & impl.
21