1 of 39

CASE:

A Compiler-Assisted

SchEduling Framework

for Multi-GPU Systems

Chao Chen, Chris Porter, Santosh Pande

PPoPP 2022

2 of 39

High performance computing

  • GPU becomes an essential element in modern computing

Each node:

- 1 CPU + 4 GPUs

Frontier:

  • 1 HPC and AI Optimized 3rd Gen AMD EPYC CPU (up to 128 cores)
  • 4 Purpose Built AMD Instinct 250X GPUs (128GB, 220 CUs)

Summit (previous gen):

  • 2 IBM POWER9 CPUs (up to 24 cores)
  • 6 NVIDIA Volta GPUs (up to 32GB, 80 SMs)

3 of 39

The issue

  • They are underutilized
    • ~30% utilization for many scientific applications
    • Similar observation in data centers, e.g. AWS
  • They are expensive
    • Cost is 2 to 5x the price of high-end Intel Xeon
    • GPU VM instance in the cloud costs nearly 10x that of a regular VM

There is a need to improve GPU utilization

  • Key reason : It is unsafe to schedule multiple CUDA tasks on one device - might lead to out of memory errors

4 of 39

Our talk

  • A compiler-assisted scheduling framework for multi-GPU systems
    • Single node with multiple GPUs attached to it
    • Sharing it among independent, uncooperative applications
    • Improves system utilization with negligible performance interference per workload
    • Improves throughput for a batch of GPU tasks by up to 2-3X
    • Compiler generated resource requirements at runtime lead to simple scheduling
    • Fully automated approach

5 of 39

Outline

  • The state-of-the-art work
  • CASE - Compiler Assisted Scheduling
  • Evaluation

6 of 39

Outline

  • The state-of-the-art work
  • CASE - Compiler Assisted Scheduling
  • Evaluation

7 of 39

State-of-the-art work

  • Slurm
    • Traditional job queueing and resource reservation
  • FLEP
    • Kernel slicing technique
    • Allows for pre-emption and reassignment for high-priority jobs
    • Utilization not the main goal (does not co-execute multiple GPU jobs on one device)
  • Nvidia MPS
    • Simultaneous execution of multiple kernels
    • Cooperative:” Developers must carefully design MPI tasks and not overload the devices
    • Out-of-memory error possible when running independent processes

8 of 39

State-of-the-art work: schedGPU

  • Guarantees memory safety
  • Requires manual program rewrite to estimate memory usage
  • No scheduling support for device independence nor multiple GPUs

Carlos Reaño, Federico Silla, Dimitrios S. Nikolopoulos, and Blesson Varghese. 2018. Intra-Node Memory Safe GPU Co-Scheduling. IEEE Trans. Parallel Distributed Syst. 29, 5 (2018), 1089–1102. https://doi.org/10.1109/TPDS.2017.2784428

9 of 39

Outline

  • The state-of-the-art work
  • CASE - Compiler Assisted Scheduling
  • Evaluation

10 of 39

The framework

11 of 39

Overview of CASE

  • Colocate workloads based on their resource requirements

Device 0:

SMs: 56

Mem: 16GB

Device 1:

SMs: 56

Mem: 16GB

Application 1

K1: 32 SMs, 6GB

K2: 18 SMs, 9GB

Application 2

K1: 36 SMs, 4GB

K2: 22 SMs, 9GB

12 of 39

The challenge

  • The key is to get resource requirements for each kernel
    • Computing resources (SMs) → performance interference
    • Memory footprints → Crashes due to OOMs
  • CASE approaches it via:
    • Static program analysis
    • Dynamic runtime binding

13 of 39

Constructing (virtual) GPU tasks statically

  • A GPU task is a group of GPU operations for a kernel launch
    • Memory allocation and initialization, e.g. cudaMemcpy
    • Kernel launch and result saving
    • Kernel launches sharing memory objects are merged as one task.

1 // VecAdd is a CUDA kernel executed on GPU

2 __global__ void VecAdd(int *A, int *B, int *C) {

3 int i = blockIdx.x * blockDim.x + threadIdx.x;

4 C[i] = A[i] + B[i];

5 }

6

7 // main is sequential code running on CPU

8 int main(int argc, char **argv) {

9 int A[N], B[N], C[N], *dA, *dB *dC;

10

11 // initialize the vectors

12 for (int i = 0; i < N; i++) {

13 A[i] = cos(i);

14 B[i] = sin(i);

15 C[i] = 0;

16 }

17 task_begin(N*3, 128, N/128); // the instrumented probe

18 // allocate device memory

19 cudaMalloc(&dA, N); // an input vector

20 cudaMalloc(&dB, N); // an input vector

21 cudaMalloc(&dC, N); // for storing result

22

23 // initialize the device memory

24 cudaMemcpy(dA, A, N, cudaMemcpyHostToDevice);

25 cudaMemcpy(dB, B, N, cudaMemcpyHostToDevice);

26

27 // launch the kernel on device

28 dim3 T(128), B(N/128);

29 VecAdd<<<B, T>>>(d_A, d_B, d_C);

30

31 // retrieve the result

32 cudaMemcpy(C, dC, N, cudaMemcpyDeviceToHost);

33

34 cudaFree(dA);

35 cudaFree(dB);

36 cudaFree(dC);

37 task_end()

38 }

Computing resource

Memory resource

14 of 39

Lazy runtime

  • Static compiler way may not be always available
  • Delay CUDA operations
    • Until they are needed (kernel launches)
    • Record and Replay

1 // VecAdd is a CUDA kernel executed on GPU

2 __global__ void VecAdd(int *A, int *B, int *C) {

3 int i = blockIdx.x * blockDim.x + threadIdx.x;

4 C[i] = A[i] + B[i];

5 }

6 void init(int **A, int **B, int **C) {

7 // allocate device memory

8 lazyMalloc(A, N); // an input vector

9 lazyMalloc(B, N); // an input vector

10 lazyMalloc(C, N); // for storing result

11 }

12 // main is sequential code running on CPU

13 int main(int argc, char **argv) {

14 int A[N], B[N], C[N], *dA, *dB *dC;

15

16 // initialize the vectors

17 for (int i = 0; i < N; i++) {

18 A[i] = cos(i);

19 B[i] = sin(i);

20 C[i] = 0;

21 }

22 init(&dA, &dB, &dC)

23

24 // initialize the device memory

25 lazyMemcpy(dA, A, N, cudaMemcpyHostToDevice);

26 lazyMemcpy(dB, B, N, cudaMemcpyHostToDevice);

27

28 // launch the kernel on device

29 dim3 T(128), B(N/128);

30 task_begin_with_materialization(B, T, dA, dB, dC);

31 VecAdd<<<B, T>>>(dA, dB, dC);

32

33 // retrieve the result

34 lazyMemcpy(C, dC, N, cudaMemcpyDeviceToHost);

35

36 lazyFree(dA);

37 lazyFree(dB);

38 lazyFree(dC);

39 task_end();

40 }

*A = 1

(malloc, N)

(cpy, A, HtoD, N)

*B = 2

*C = 3

Compiler replaces and instruments code automatically

Perform actual operations

15 of 39

Runtime

Task_begin(mem, grid, block)

Scheduler

3. Receive device id

4. cudaSetDevice(id)

1. Send probe (mem, grid, block)

2. Determine the device

16 of 39

CASE scheduling: Hardware emulation

  • Idea: Only add a task to a GPU if the GPU has compute and memory available
  • Emulate hardware: “assign” a task’s threads to thread blocks on each streaming multiprocessor of a GPU (until saturation).
  • Must have memory available
  • Don’t assign a task to a GPU if compute or memory is already saturated

17 of 39

CASE scheduling: Simplified

sched(task, GPUs):

TargetG = None

MinWarps = Inf

for G in GPUs:

if task.MemReq < G.FreeMem:

if G.InUseWarps < MinWarps:

MinWarps = G.InUseWarps

TargetG = G

if TargetG:

TargetG.addWarps(task.Warps)

TargetG.addMem(task.MemReq)

return TargetG

Idea:

For some incoming task, find the GPU with

  1. memory available for it, and
  2. the least number of active warps

memory (1) is a hard requirement�compute (2) is soft

18 of 39

CASE scheduling

G2.FreeMem: 2GB

G2.InUseWarps: 4M

k0

G0.FreeMem: 2GB

G0.InUseWarps: 7M

G1.FreeMem: 10GB

G1.InUseWarps: 3M

G2.FreeMem: 8GB

G2.InUseWarps: 2M

G3.FreeMem: 4GB

G3.InUseWarps: 9M

task.MemReq: 6GB

task.Warps: 2M

k0

19 of 39

Outline

  • The state-of-the-art work
  • CASE - Compiler Assisted Scheduling
  • Evaluation

20 of 39

Machines

  • 2-GPU Nvidia P100 system (Intel Xeon E5-2670, 128 GB RAM)
    • Pascal, 16GB RAM, 3584 cores
  • 4-GPU Nvidia V100 system (Intel Xeon E5-2686, 244 GB RAM)
    • Volta, 16GB RAM, 5120 cores

21 of 39

Workloads

  • Rodinia
    • Mixed workloads
      • Randomly chosen from 7 benchmarks with realistically large kernels
    • 16- and 32-job versions, with varying ratios of large:small jobs
  • Darknet
    • Homogenous workloads
      • NN training, NN prediction, real-time object detection, text generation
    • 8 homogeneous jobs in each load

22 of 39

Baselines

  • Single assignment (SA)
    • Basic gating of GPU requests
    • Each GPU is only ever assigned 1 kernel at a time
  • Core-to-GPU (CG)
    • Generalized SA; assign up to C kernels per GPU, where C is based on the expected number of CPU cores (workers) per GPU
    • Unsafe
  • schedGPU
    • API for developers to modify code with resource needs
    • memory safety within 1 GPU

23 of 39

Single assignment scheduler

k3

k0

k2

k1

k3

k0

k2

k1

k3

k0

k2

Time 0

Time 1

Time 2

24 of 39

Core-to-GPU scheduler

k3

k0

k2

k1

k0

k2

k1

k3

k7

k4

k6

k5

k7

k4

k6

k5

k2

k1

k7

k4

k6

k5

k3

Time 0

Time 1

Time 2

25 of 39

schedGPU scheduler

k3

k0

k2

k1

k0

k1

k3

k2

k1

k3

k2

Time 0

Time 1

Time 2

k0-k1: 8GB footprint

k2-k3: 10GB footprint

GPU: 16GB memory

26 of 39

Summary of results

  • Improve GPU utilization by up to 3.36x
  • Improve throughput by an average of ~2x for both Rodinia benchmarks and Darknet neural networks
  • Improve job turnaround time by up to 4.9x
  • Limit individual kernel performance degradation to at most 2.5%

27 of 39

4xV100 throughput for 2 scheduling algs on Rodinia

HE

S

28 of 39

4xV100 throughput on Rodinia

29 of 39

4xV100 throughput on Darknet

30 of 39

Conclusion

  • CASE: A fully automated GPU scheduling framework to uniformly and transparently manage GPU resources
    • Combination of static analysis, lazy runtime, and scheduler
    • Guarantees memory-safe execution of uncooperative processes
    • Strong empirical evidence for performance improvements:

2.5x throughput, 3.36x utilization, 4.9x job turnaround time

  • Artifact available at https://zenodo.org/record/5787410
  • New: Compiler-Assisted Scheduling for Multi-Instance GPUs
    • Modified CASE scheduler for Multi-Instance GPU feature
    • Appearing in GPGPU 2022
    • Improvement over Slurm: 1.45x throughput, 2.93x mem. utilization

31 of 39

Thank you!

Q&A

32 of 39

Extra slides

33 of 39

Workloads - Rodinia

Workload

Mix

Workload

Mix

W1

16-job, 1:1-mix

W2

16-job, 2:1-mix

W3

16-job, 3:1-mix

W4

16-job, 5:1-mix

W5

32-job, 1:1-mix

W6

32-job, 2:1-mix

W7

32-job, 3:1-mix

W8

32-job, 5:1-mix

Generated from backprop (pattern recognition), srad-v1 and srad-v2 (image processing), lavaMD (molecular dynamics), needle (bioinformatics), dwt2d (image/video compression), and bfs (graph)�

Ratio mixes are of the form large:small, where large jobs have kernels with footprints >4GB (up to ~12GB), and small jobs have kernels with 1-4GB footprints.

34 of 39

Workloads - Darknet

Task

Model

NN training

Small Darknet architecture for CIFAR-10

NN prediction

Pre-trained Darknet19 and Darknet53-448x448 from 1000-class ImageNet

Real-time object detection

Pre-trained yolov3-tiny architecture

Text generation

Pre-trained RNN based on Shakespeare’s complete works

35 of 39

2xP100 throughput on Rodinia

36 of 39

Percentage of jobs crashes for CG (P100s/V100s)

# of workers

1:1 mix

2:1 mix

3:1 mix

5:1 mix

3/6

0/0

3%/6%

8%/17%

0/0

4/8

14%/13%

6%/19%

6%/25%

9%/13%

5/10

13%/15%

13%/25%

20%/20%

22%/25%

6/12

16%/33%

17%/29%

16%/38%

16%/50%

37 of 39

CASE average job turnaround speedup

GPUs

# of jobs

1:1 mix

2:1 mix

3:1 mix

5:1 mix

2xP100s

16 jobs

4.9x

2.3x

4.9x

4.3x

2xP100s

32 jobs

4.6x

3.2x

3.6x

2.0x

4xV100s

16 jobs

2.4x

2.0x

3.5x

2.6x

4xV100s

32 jobs

3.8x

2.9x

2.9x

2.6x

38 of 39

Additional state-of-the-art work

  • Nvidia MPS
    • Multiprocess service
    • Provides support for simultaneous execution of multiple cooperative kernels
    • “Cooperative:” developers must carefully design MPI tasks and not overload the devices
    • Susceptible to out-of-memory errors when running independent processes
    • Not designed for scheduling kernels across different GPU devices (i.e. responsibility falls to developer)

39 of 39

Additional state-of-the-art work

  • Job queues: Slurm
  • Scheduling: schedGPU (API), VirtCL (API and history), FLEP (preemption)
  • Deep learning: Gandiva (properties, cluster-wide), Amaral et al. (topology), MXNet (data partitioning), Hu et al. (concurrent models single GPU), MARBLE (concurrent models multi GPU), ONNX-GPU (graph IR contraction)
  • OS support: Gdev (1st class GPU resource management), PTask (task graphs)
  • Nvidia: unified memory, multi-instance GPU (MIG)