1 of 21

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Cong Guo*^,1,2, Rui Zhang*^,3, Jiale Xu^1,2, Jingwen Leng^+,1,2, Zihan Liu^1,2, Ziyu Huang^1,2, Minyi Guo^1,2, Hao Wu³, Shouren Zhao³, Junping Zhao^+,3, Ke Zhang³

1

2

3

*equal contribution ⁺corresponding author

2 of 21

2

01

02

03

04

Background

&

Motivation

Implementat-ion

Evaluation

Q&A

3 of 21

Challenges of GPU Memory in LLM

quote from https://doi.org/10.48550/arXiv.2403.14123

�

There is a great memory wall in LLM

3

4 of 21

CUDA

Megatron-LM

GMLake Contribution Level

ROCm

massive & Chronological Allocations happen with Deep Learning Framework Allocator

Caching Allocator

forward

backward

4

5 of 21

\^+,

Native Allocator

Caching Allocator

Memory

Request

cuMemAlloc

Allocated

Memory

cuMemFree

Caching

Memory

Pool

Deep Learning Framework Memory Management

Program Execution

Memory

Request

Allocated

Memory

Program Execution

cachingAlloc

Cached

Memory

cuMemAlloc

Reusing

Possible?

No

Yes

Tag as an caching memory block

CachingAllocator Results in more than 5× Higher Throughput !!

Lead to massive device-level memory allocation calls and overhead

5

6 of 21

Memory-Efficient Optimizations in DNN Training

Activation Recomputation (R)

Vanilla Pattern

Pattern After R

Vanilla Workload : Regular and Even Memory Access Pattern

Scheduled & Optimized Workload : Irregular and Uneven Pattern

6

7 of 21

Different Strategies

Different Distributed

Training Device Number

Memory-Efficient Optimizations in DNN Training Lead to Fragmentation

7

8 of 21

Original

GMLake(Ours)

Allocated

Fragmentation

Virtual Memory

3

4

2

1

6

OOM

4

3

1

2

5

Splitting

6

4

3

1

6

VM Stitching

VM

2 5

Mem Pool

Case Study: CUDA Caching Allocator

Allocation Request

Mechanism of Caching Allocator Facilitates Fragmentation

But VM Stitching Can Help

8

9 of 21

Opportunities with VMM(Virtual Memory Management)

Fine-grained Memory Chunk

Reserved Device Memory Address

Challenges with VMM

3. For I in range(5):

Map(VA+i* granularity, C[i])

1. For I in range(5):

C[i] = Create(granularity)

2. VA = AddrReserve(size)

4. Stitch

New Opportunities and Challenges with VMM

Fine-grained VM Stitching

Leads to Unacceptable Overhead

9

10 of 21

10

01

02

03

04

Background

&

Motivation

Implementat-ion

Evaluation

Q&A

11 of 21

New Tensor

Delete Tensor

Tensor Allocator

Caching

Allocator

(Original)

Mem Pool

Native Pool

GMLake

Allocator

(Ours)

VMM Pool

GPU Memory

\

Allocation

Deallocation

Alloc

Split

Best Fit

Stitch

Update Stitch

Stitched

Memory

Pool

Primitive

Memory

Pool

cuMemAddressReserve

cuMemCreate

cuMemMap

……

VM Pool

GMLake Overview

Extended High Level APIs

Efficient Stitching Data Structure

VMM APIs

11

12 of 21

Allocation 1

Block

Best Fit

Match

S1

S2

Allocation 2

Block 1

Block 2

Best Fit

Block 1

Split

Single

Block

S4

Allocation 2

5

Best Fit

Stitch

Block 4

Insufficient

Block 2

Alloc

S5

OOM

Allocating

Allocated

Stitched

Primitive

Allocation Strategy

Allocation 3

Best Fit

S3

Stitch

Block2

Block3

B4

Multiple

Blocks

Block 3

4

5

For algorithm details, please refer to section 3.3 of our paper.

Allocated Memory Block

Allocation Size

Stitched Memory Pool

Primitive Memory Pool

Allocating Strategy

12

If the allocation size exactly matches the block size in one of the multi-level memory pools, we return the exactly matched block for the user to use.
In scenario 2, if the allocation size is smaller than some primitive blocks, we follow the best fit algorithm and split the block if possible.
When there are only primitive blocks smaller than the allocation size, we try to stitch them together following a certain policy and record the stitched block.
In scenario 4, if the total size of primitive blocks is less than the allocation size, GMLake try to fill the insufficient part with freshly allocated memory, then performs a stitching operation to serve the allocation request.
If no memory is available, an OOM exception will occur.

It’s important to note that the stitched block will be recorded in the stitched memory pool and can be further reused if all primitive blocks consisting of it are available. In short, each stitched block will be reused multiple times in the future. That’s why GMLake can amortize the vmm overhead.

For more details, please refer to section 3.3 in our paper.

13 of 21

GMLake Memory Pool

B1

B2

Original Memory Pool

B1

B2

B3

B4

B5

R1

R2

R3

R4

R6

R5

Time

Memory request size

Case Study: GMLake vs Original Caching

R1

R2

R3

R4

R5

B1

R6

R1

R2

R3

R4

B1

B2

B3

R5

B1

R6

B3

B4

B1

B2

B1

B2

B3

B4

B1

B2

B3

B4

B5

S4

S2

S1

Memory Consumption

GMLake Saves Memory by Properly Reuse Allocated Memory !

13

14 of 21

14

01

02

03

04

Background

&

Motivation

Implementat-ion

Evaluation

Q&A

15 of 21

Server	16 × NVIDIA A100 SXM （80GB）
Training Engine	FSDP, DeepSpeed ZeRO, Colossal-AI Gemini
Model	8 LLMs including OPT, Vicuna, GPT-2, GPT-NeoX and so on
Ablation	Batch-size, Strategies, Distributed Training Devices and Different Training Engines
Total Workloads	76 fine-tune workloads

Evaluation Environments

15

16 of 21

Effectiveness Evaluation

Evaluation on Strategies Combinations

Evaluation on Distributed Training Device Numbers

10 GB on average & 17 GB at most

9 GB on average & 17 GB at most

16

17 of 21

17

Effectiveness Evaluation

Evaluation on Batch sizes

Evaluation on Training Engines

7 GB on average & 12 GB at most

14 GB on average & 25 GB at most

17

18 of 21

Throughput Evaluation

18

In our evaluations, GMLake's throughput aligns with that of PyTorch

18

19 of 21

Memory Profile Analysis

GMLake’s Stitching Process

GMLake Takes Serveral Steps to Complete Stitching Process

PyTorch Occurs OOM Exception After Three Iterations

19

20 of 21

Summary

20

We performed a characterization study to show that the caching allocator used in existing DL frameworks suffers from up to 30% memory fragmentation.
We designed and implemented GMLake, a novel memory allocator that effectively reduces memory fragmentation and improves memory utilization.
We evaluated GMLake on multiple prominent LLM optimization platforms with a set of representative open-source LLMs to demonstrate its effectiveness, efficiency, and robustness.

20

21 of 21

Recent Updates

21

Comparision with PyTorch ExpandSegment

- Concurrent work with different designs & impl.

Extended as a PyTorch pluggable allocator
Expanded to support cross-stream GPU memory re-using�- Promising, open-source and configurable
Expanded to resolve KV cache fragmentations in LLM inference �- Different from PagedAttention, minor changes required to existing attention kernels

21