1 of 21

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Cong Guo*,1,2, Rui Zhang*,3, Jiale Xu1,2, Jingwen Leng+,1,2, Zihan Liu1,2, Ziyu Huang1,2, Minyi Guo1,2, Hao Wu3, Shouren Zhao3, Junping Zhao+,3, Ke Zhang3

1

2

3

*equal contribution +corresponding author

2 of 21

2

01

02

03

04

Background

&

Motivation

Implementat-ion

Evaluation

Q&A

3 of 21

Challenges of GPU Memory in LLM

There is a great memory wall in LLM

3

4 of 21

CUDA

Megatron-LM

GMLake Contribution Level

ROCm

massive & Chronological Allocations happen with Deep Learning Framework Allocator

Caching Allocator

forward

backward

4

5 of 21

\+,

Native Allocator

Caching Allocator

Memory

Request

cuMemAlloc

Allocated

Memory

cuMemFree

Caching

Memory

Pool

Deep Learning Framework Memory Management

Program Execution

Memory

Request

Allocated

Memory

Program Execution

cachingAlloc

Cached

Memory

cuMemAlloc

Reusing

Possible?

No

Yes

Tag as an caching memory block

CachingAllocator Results in more than 5× Higher Throughput !!

Lead to massive device-level memory allocation calls and overhead

5

6 of 21

Memory-Efficient Optimizations in DNN Training

Activation Recomputation (R)

Vanilla Pattern

Pattern After R

Vanilla Workload : Regular and Even Memory Access Pattern

Scheduled & Optimized Workload : Irregular and Uneven Pattern

6

7 of 21

Different Strategies

Different Distributed

Training Device Number

 

Memory-Efficient Optimizations in DNN Training Lead to Fragmentation

7

8 of 21

Original

GMLake(Ours)

Allocated

Fragmentation

Virtual Memory

3

4

2

1

6

OOM

4

3

1

2

5

Splitting

6

4

3

1

6

6

VM Stitching

VM

2 5

Mem Pool

Case Study: CUDA Caching Allocator

Allocation Request

Mechanism of Caching Allocator Facilitates Fragmentation

But VM Stitching Can Help

8

9 of 21

Opportunities with VMM(Virtual Memory Management)

Fine-grained Memory Chunk

Reserved Device Memory Address

Challenges with VMM

3. For I in range(5):

Map(VA+i* granularity, C[i])

1. For I in range(5):

C[i] = Create(granularity)

2. VA = AddrReserve(size)

4. Stitch

New Opportunities and Challenges with VMM

Fine-grained VM Stitching

Leads to Unacceptable Overhead

9

10 of 21

10

01

02

03

04

Background

&

Motivation

Implementat-ion

Evaluation

Q&A

11 of 21

New Tensor

Delete Tensor

Tensor Allocator

Caching

Allocator

(Original)

Mem Pool

Native Pool

GMLake

Allocator

(Ours)

VMM Pool

GPU Memory

\

Allocation

Deallocation

Alloc

Split

Best Fit

Stitch

Update Stitch

Stitched

Memory

Pool

Primitive

Memory

Pool

cuMemAddressReserve

cuMemCreate

cuMemMap

……

VM Pool

GMLake Overview

Extended High Level APIs

Efficient Stitching Data Structure

VMM APIs

11

12 of 21

Allocation 1

Block

Best Fit

Match

S1

S2

Allocation 2

Block 1

Block 2

Best Fit

Block 1

Split

Single

Block

S4

Allocation 2

5

Best Fit

Stitch

Block 4

Insufficient

Block 2

Alloc

S5

OOM

Allocating

Allocated

Stitched

Primitive

Allocation Strategy

Allocation 3

Best Fit

S3

Stitch

Block2

Block3

B4

Multiple

Blocks

Block 3

4

5

For algorithm details, please refer to section 3.3 of our paper.

Allocated Memory Block

Allocation Size

Stitched Memory Pool

Primitive Memory Pool

Allocating Strategy

12

13 of 21

GMLake Memory Pool

B1

B2

Original Memory Pool

B1

B2

B3

B4

B5

R1

R2

R3

R4

R6

R5

Time

Memory request size

Case Study: GMLake vs Original Caching

R1

R2

R3

R4

R5

B1

R6

R1

R2

R3

R4

B1

B2

B3

R5

B1

R6

B3

B4

B1

B2

B1

B2

B3

B4

B1

B2

B3

B4

B5

S4

S2

S1

Memory Consumption

GMLake Saves Memory by Properly Reuse Allocated Memory !

13

14 of 21

14

01

02

03

04

Background

&

Motivation

Implementat-ion

Evaluation

Q&A

15 of 21

Server

16 × NVIDIA A100 SXM (80GB)

Training Engine

FSDP, DeepSpeed ZeRO, Colossal-AI Gemini

Model

8 LLMs including

OPT, Vicuna, GPT-2, GPT-NeoX and so on

Ablation

Batch-size, Strategies, Distributed Training Devices and Different Training Engines

Total Workloads

76 fine-tune workloads

Evaluation Environments

15

16 of 21

Effectiveness Evaluation

Evaluation on Strategies Combinations

Evaluation on Distributed Training Device Numbers

10 GB on average & 17 GB at most

9 GB on average & 17 GB at most

16

17 of 21

17

Effectiveness Evaluation

Evaluation on Batch sizes

Evaluation on Training Engines

7 GB on average & 12 GB at most

14 GB on average & 25 GB at most

17

18 of 21

Throughput Evaluation

18

In our evaluations, GMLake's throughput aligns with that of PyTorch

18

19 of 21

Memory Profile Analysis

GMLake’s Stitching Process

GMLake Takes Serveral Steps to Complete Stitching Process

PyTorch Occurs OOM Exception After Three Iterations

19

20 of 21

Summary

20

  • We performed a characterization study to show that the caching allocator used in existing DL frameworks suffers from up to 30% memory fragmentation.
  • We designed and implemented GMLake, a novel memory allocator that effectively reduces memory fragmentation and improves memory utilization.
  • We evaluated GMLake on multiple prominent LLM optimization platforms with a set of representative open-source LLMs to demonstrate its effectiveness, efficiency, and robustness.

20

21 of 21

Recent Updates

21

  • Comparision with PyTorch ExpandSegment

- Concurrent work with different designs & impl.

  • Extended as a PyTorch pluggable allocator
  • Expanded to support cross-stream GPU memory re-using�- Promising, open-source and configurable
  • Expanded to resolve KV cache fragmentations in LLM inference �- Different from PagedAttention, minor changes required to existing attention kernels

21