1 of 39

CASE:

A Compiler-Assisted

SchEduling Framework

for Multi-GPU Systems

Chao Chen, Chris Porter, Santosh Pande

PPoPP 2022

2 of 39

High performance computing

GPU becomes an essential element in modern computing

Each node:

- 1 CPU + 4 GPUs

Frontier:

1 HPC and AI Optimized 3rd Gen AMD EPYC CPU (up to 128 cores)
4 Purpose Built AMD Instinct 250X GPUs (128GB, 220 CUs)

Summit (previous gen):

2 IBM POWER9 CPUs (up to 24 cores)
6 NVIDIA Volta GPUs (up to 32GB, 80 SMs)

3 of 39

The issue

They are underutilized

~30% utilization for many scientific applications
Similar observation in data centers, e.g. AWS

They are expensive

Cost is 2 to 5x the price of high-end Intel Xeon
GPU VM instance in the cloud costs nearly 10x that of a regular VM

There is a need to improve GPU utilization

Key reason : It is unsafe to schedule multiple CUDA tasks on one device - might lead to out of memory errors

4 of 39

Our talk

A compiler-assisted scheduling framework for multi-GPU systems

Single node with multiple GPUs attached to it
Sharing it among independent, uncooperative applications
Improves system utilization with negligible performance interference per workload
Improves throughput for a batch of GPU tasks by up to 2-3X
Compiler generated resource requirements at runtime lead to simple scheduling
Fully automated approach

5 of 39

Outline

The state-of-the-art work
CASE - Compiler Assisted Scheduling
Evaluation

6 of 39

Outline

The state-of-the-art work
CASE - Compiler Assisted Scheduling
Evaluation

7 of 39

State-of-the-art work

Slurm

Traditional job queueing and resource reservation

FLEP

Kernel slicing technique
Allows for pre-emption and reassignment for high-priority jobs
Utilization not the main goal (does not co-execute multiple GPU jobs on one device)

Nvidia MPS

Simultaneous execution of multiple kernels
“Cooperative:” Developers must carefully design MPI tasks and not overload the devices
Out-of-memory error possible when running independent processes

There is a significant amount of research that has gone into various aspects of GPU sharing, utilization, and performance. Slurm, FLEP, and Nvidia MPS are three very different types of solutions but which help illustrate a tradition of more effective multiprocessing with GPUs. Slurm is a job queueing tool. It’s concerned with resource reservation and is not GPU-specific. But it does support systems with multiple GPUs per node, and it enables users on shared systems to reserve a subset of available GPUs, and, thus, in effect, share them. It is conservative, ensuring that different users are guaranteed the GPUs requested for the duration of a job. FLEP solves a different problem, but it is still relevant to us, because it is concerned with how to share GPUs. It’s a kernel slicing technique, which allows for more effective pre-emption. Sub-kernels can be pre-empted and reassigned to higher priority ones. It does not solve the problem of how to co-execute multiple GPU jobs on one device. For that, users can take advantage of Nvidia’s multiprocess service. MPS enables simultaneous execution of multiple kernels. The assumption, though, is that different processes’ kernels are cooperative, like with MPI tasks. The developer is completely responsible for managing the memory and not overloading it, because MPS makes no guarantees on this. And if, for whatever reason, the processes are independent and do not know about each other’s resource needs, out of memory errors are completely uncontrolled and possible. This is not a complete list, of course, of related work, but it is somewhat representative. Today you can get basic, conservative sharing with a system like Slurm; or you can go down these paths of pre-emption and more CPU-like control over your kernels GPU residency; or you can even get full co-execution of your processes on a GPU. But there is little to no work on safe, efficient sharing of independent, uncooperative processes for the multi-GPU, single-node case.

8 of 39

State-of-the-art work: schedGPU

Guarantees memory safety
Requires manual program rewrite to estimate memory usage
No scheduling support for device independence nor multiple GPUs

Carlos Reaño, Federico Silla, Dimitrios S. Nikolopoulos, and Blesson Varghese. 2018. Intra-Node Memory Safe GPU Co-Scheduling. IEEE Trans. Parallel Distributed Syst. 29, 5 (2018), 1089–1102. https://doi.org/10.1109/TPDS.2017.2784428

This brings us to a recent research prototype, schedGPU. It attempts to solve at least the out-of-memory problem by having applications share their resource needs before launching on the GPU. This requires programmers to manually rewrite parts of their code to estimate memory usage, and to insert API calls for this. Though this technique can be applied to multiple GPUs, it is not a main focus of the work. Thus, if multiple applications are written to all use device 0, for example, there is no technique for moving certain kernels to device 1. All of them would end up sharing device 0, albeit in a memory-safe manner. This figure 3 is drawn from the creators’ paper, and shows how it works. You’ll notice multiple CUDA applications all link against the CUDA library and the schedGPU library. They communicate over shared memory with one another. And their shared resource needs allow them to cooperate and not overrun the GPU’s memory. Though schedGPU is an advancement over just using Nvidia MPS, there are still shortcomings, such as the lack of automated support for identifying the resource needs, a deliberate technique for leveraging multiple GPUs on a single node (an architectural detail which is now the norm), or a scheduling technique to handle this multiple GPU case.

9 of 39

Outline

The state-of-the-art work
CASE - Compiler Assisted Scheduling
Evaluation

10 of 39

The framework

11 of 39

Overview of CASE

Colocate workloads based on their resource requirements

Device 0:

SMs: 56

Mem: 16GB

Device 1:

SMs: 56

Mem: 16GB

Application 1

K1: 32 SMs, 6GB

K2: 18 SMs, 9GB

Application 2

K1: 36 SMs, 4GB

K2: 22 SMs, 9GB

12 of 39

The challenge

The key is to get resource requirements for each kernel

Computing resources (SMs) → performance interference
Memory footprints → Crashes due to OOMs

CASE approaches it via:

Static program analysis
Dynamic runtime binding

13 of 39

Constructing (virtual) GPU tasks statically

A GPU task is a group of GPU operations for a kernel launch

Memory allocation and initialization, e.g. cudaMemcpy
Kernel launch and result saving
Kernel launches sharing memory objects are merged as one task.

1 // VecAdd is a CUDA kernel executed on GPU

2 __global__ void VecAdd(int *A, int *B, int *C) {

3 int i = blockIdx.x * blockDim.x + threadIdx.x;

4 C[i] = A[i] + B[i];

5 }

6

7 // main is sequential code running on CPU

8 int main(int argc, char **argv) {

9 int A[N], B[N], C[N], *dA, *dB *dC;

10

11 // initialize the vectors

12 for (int i = 0; i < N; i++) {

13 A[i] = cos(i);

14 B[i] = sin(i);

15 C[i] = 0;

16 }

17 task_begin(N*3, 128, N/128); // the instrumented probe

18 // allocate device memory

19 cudaMalloc(&dA, N); // an input vector

20 cudaMalloc(&dB, N); // an input vector

21 cudaMalloc(&dC, N); // for storing result

22

23 // initialize the device memory

24 cudaMemcpy(dA, A, N, cudaMemcpyHostToDevice);

25 cudaMemcpy(dB, B, N, cudaMemcpyHostToDevice);

26

27 // launch the kernel on device

28 dim3 T(128), B(N/128);

29 VecAdd<<<B, T>>>(d_A, d_B, d_C);

30

31 // retrieve the result

32 cudaMemcpy(C, dC, N, cudaMemcpyDeviceToHost);

33

34 cudaFree(dA);

35 cudaFree(dB);

36 cudaFree(dC);

37 task_end()

38 }

Computing resource

Memory resource

We start with the simpler case: constructing GPU tasks statically. We define a GPU task as a group of GPU operations for a kernel launch. Thus, a GPU task includes memory allocation and initialization, the kernel launch itself, and any saving of the results. Also, kernel launches that share any memory objects are merged into the same task. Here is an example to explain what we mean. This simple program adds two vectors on the GPU. The main function begins at line 8. It invokes VecAdd at line 29. The kernel code itself is highlighted in dull pink. All of the kernel-related operations are highlighted in bright blue. Our static instrumentation has already been added in this example, and it’s colored in dark blue. Do you see lines 17 and 37? task_begin and task_end are our instrumented functions. Our framework automatically instruments those lines with the resource requirements of the GPU task. The question is: How? Well, our static analysis starts by identifying the kernel launches. In this case, there’s only one, which is at line 29. Notice how the kernel launch specifies its compute requirements, namely the number of blocks and threads per block. These are colored in yellow. And it specifies the objects to be worked on, which are colored in red, and which are directly associated with cudaMalloc calls. In other words, the parameters to the kernel launch tell us the compute and memory requirements of the kernel, and we just need to ensure that we instrument the code with said information just before the task begins. We rely on def-use analysis to identify the definition points and participating variables of the GPU task. I’ll show how we can use this resource information shortly. But before that...

14 of 39

Lazy runtime

Static compiler way may not be always available
Delay CUDA operations

Until they are needed (kernel launches)
Record and Replay

1 // VecAdd is a CUDA kernel executed on GPU

2 __global__ void VecAdd(int *A, int *B, int *C) {

3 int i = blockIdx.x * blockDim.x + threadIdx.x;

4 C[i] = A[i] + B[i];

5 }

6 void init(int **A, int **B, int **C) {

7 // allocate device memory

8 lazyMalloc(A, N); // an input vector

9 lazyMalloc(B, N); // an input vector

10 lazyMalloc(C, N); // for storing result

11 }

12 // main is sequential code running on CPU

13 int main(int argc, char **argv) {

14 int A[N], B[N], C[N], *dA, *dB *dC;

15

16 // initialize the vectors

17 for (int i = 0; i < N; i++) {

18 A[i] = cos(i);

19 B[i] = sin(i);

20 C[i] = 0;

21 }

22 init(&dA, &dB, &dC)

23

24 // initialize the device memory

25 lazyMemcpy(dA, A, N, cudaMemcpyHostToDevice);

26 lazyMemcpy(dB, B, N, cudaMemcpyHostToDevice);

27

28 // launch the kernel on device

29 dim3 T(128), B(N/128);

30 task_begin_with_materialization(B, T, dA, dB, dC);

31 VecAdd<<<B, T>>>(dA, dB, dC);

32

33 // retrieve the result

34 lazyMemcpy(C, dC, N, cudaMemcpyDeviceToHost);

35

36 lazyFree(dA);

37 lazyFree(dB);

38 lazyFree(dC);

39 task_end();

40 }

*A = 1

(malloc, N)

(cpy, A, HtoD, N)

*B = 2

*C = 3

Compiler replaces and instruments code automatically

Perform actual operations

... let me show how we have to do this for a slightly more difficult example. Static analysis may not always work, so we have to delay certain CUDA operations until runtime. Here we have a similar code example, but the cudaMalloc calls are inside of an init function starting at line 6. This kind of code pattern is common. And in fact it makes static analysis more difficult, because our go-to technique, namely def-use analysis, does not play nicely in interprocedural cases. So what we do is, we replace cudaMalloc with our own lazyMalloc call. At runtime, lazyMalloc will assign a unique ID to A, rather than allocate actual memory on the device. And we’ll enqueue the kind of operation we want, in this case a malloc of size N. We do this for other calls, including calls like cudaMemcpy. Each object will get its own ID. Just before the actual kernel call happens, we will issue a task-begin-with-materialization, which will perform the actual CUDA operations that we enqueued. The takeaway here is that the compiler replaces and instruments code automatically so that the compute and memory requirements of GPU tasks are always made known.

15 of 39

Runtime

Task_begin(mem, grid, block)

Scheduler

3. Receive device id

4. cudaSetDevice(id)

1. Send probe (mem, grid, block)

2. Determine the device

16 of 39

CASE scheduling: Hardware emulation

Idea: Only add a task to a GPU if the GPU has compute and memory available
Emulate hardware: “assign” a task’s threads to thread blocks on each streaming multiprocessor of a GPU (until saturation).
Must have memory available
Don’t assign a task to a GPU if compute or memory is already saturated

17 of 39

CASE scheduling: Simplified

sched(task, GPUs):

TargetG = None

MinWarps = Inf

for G in GPUs:

if task.MemReq < G.FreeMem:

if G.InUseWarps < MinWarps:

MinWarps = G.InUseWarps

TargetG = G

if TargetG:

TargetG.addWarps(task.Warps)

TargetG.addMem(task.MemReq)

return TargetG

Idea:

For some incoming task, find the GPU with

memory available for it, and
the least number of active warps

memory (1) is a hard requirement�compute (2) is soft

Our second algorithm is simpler. The basic idea is the following: For some incoming task, find the GPU with (1) memory available for it, and (2) the least number of active warps. Note that memory is still a hard requirement. A GPU must have memory available for the task, or we cannot schedule the task on it. Compute, however, has been relaxed to a soft requirement. Once we know which GPUs have enough memory available for a task, we just want the least loaded GPU. Our pseudocode for this second scheduling algorithm is shown on the right. We take in some task and the list of GPUs. We initialize the targeted GPU to nil and the minimum warps over all GPUs to infinity. Then we check all GPUs. First we see if the task can fit on a GPU, and if it can, we check if that GPU is least loaded. If so, we update the targeted GPU for the task. At the end, we return the targeted GPU, so that the process can call cudaSetDevice() with it.

18 of 39

CASE scheduling

G2.FreeMem: 2GB

G2.InUseWarps: 4M

k0

G0.FreeMem: 2GB

G0.InUseWarps: 7M

G1.FreeMem: 10GB

G1.InUseWarps: 3M

G2.FreeMem: 8GB

G2.InUseWarps: 2M

G3.FreeMem: 4GB

G3.InUseWarps: 9M

task.MemReq: 6GB

task.Warps: 2M

k0

Here’s an example of how this second scheduling algorithm would place a kernel on a set of GPUs. Here we have 4 GPUs with different amounts of free memory and in-use warps. Our scheduler wants to place kernel k0 on a GPU. Our algorithm proceeds as follows. It first checks the GPUs to see which of them have enough free memory to hold the kernel. In this case, we can see that k0 has a memory requirement of 6GB. G1 has 10 free gigabytes, and G2 has 8 free gigabytes, so both of these are candidates. Then our algorithm checks the in-use warps for those candidates. This is a soft requirement, so both GPUs are candidates. G1 has 3 million in-use warps, and G2 has 2 million in-use warps. We want the GPU with the least load, which in this case, is G2. Thus we select G2 as the target GPU for k0, and we update G2’s free memory and in-use warps based on k0’s values.

19 of 39

Outline

The state-of-the-art work
CASE - Compiler Assisted Scheduling
Evaluation

20 of 39

Machines

2-GPU Nvidia P100 system (Intel Xeon E5-2670, 128 GB RAM)

Pascal, 16GB RAM, 3584 cores

4-GPU Nvidia V100 system (Intel Xeon E5-2686, 244 GB RAM)

Volta, 16GB RAM, 5120 cores

21 of 39

Workloads

Rodinia

Mixed workloads

Randomly chosen from 7 benchmarks with realistically large kernels

16- and 32-job versions, with varying ratios of large:small jobs

Darknet

Homogenous workloads

NN training, NN prediction, real-time object detection, text generation

8 homogeneous jobs in each load

22 of 39

Baselines

Single assignment (SA)

Basic gating of GPU requests
Each GPU is only ever assigned 1 kernel at a time

Core-to-GPU (CG)

Generalized SA; assign up to C kernels per GPU, where C is based on the expected number of CPU cores (workers) per GPU
Unsafe

schedGPU

API for developers to modify code with resource needs
memory safety within 1 GPU

23 of 39

Single assignment scheduler

k3

k0

k2

k1

k3

k0

k2

k1

k3

k0

k2

Time 0

Time 1

Time 2

24 of 39

Core-to-GPU scheduler

k3

k0

k2

k1

k0

k2

k1

k3

k7

k4

k6

k5

k7

k4

k6

k5

k2

k1

k7

k4

k6

k5

k3

Time 0

Time 1

Time 2

25 of 39

schedGPU scheduler

k3

k0

k2

k1

k0

k1

k3

k2

k1

k3

k2

Time 0

Time 1

Time 2

k0-k1: 8GB footprint

k2-k3: 10GB footprint

GPU: 16GB memory

26 of 39

Summary of results

Improve GPU utilization by up to 3.36x
Improve throughput by an average of ~2x for both Rodinia benchmarks and Darknet neural networks
Improve job turnaround time by up to 4.9x
Limit individual kernel performance degradation to at most 2.5%

27 of 39

4xV100 throughput for 2 scheduling algs on Rodinia

HE

S

We’ll start with a comparison of the two schedulers shown earlier: the hardware emulation technique vs. a simplified version. Here we have the results on the 4xV100 system. The X axis shows 8 workloads, which are 16- and 32-job random mixes of the Rodinia benchmarks. And the Y axis shows the normalized throughput of the simplified approach over the hardware-emulated approach. What we observed is that, in general, holding back jobs when GPU compute is saturated does not yield any significant benefit. Rather, it’s better to add the jobs -- given that there is memory available -- and allow the hardware to handle the co-located jobs. The gains are modest but significant enough to prefer the simplified approach. As we can see here, the average improvement hovers around 1.25x for these workloads.

28 of 39

4xV100 throughput on Rodinia

29 of 39

4xV100 throughput on Darknet

30 of 39

Conclusion

CASE: A fully automated GPU scheduling framework to uniformly and transparently manage GPU resources

Combination of static analysis, lazy runtime, and scheduler
Guarantees memory-safe execution of uncooperative processes
Strong empirical evidence for performance improvements:

2.5x throughput, 3.36x utilization, 4.9x job turnaround time

Artifact available at https://zenodo.org/record/5787410
New: Compiler-Assisted Scheduling for Multi-Instance GPUs

Modified CASE scheduler for Multi-Instance GPU feature
Appearing in GPGPU 2022
Improvement over Slurm: 1.45x throughput, 2.93x mem. utilization

31 of 39

Thank you!

Q&A

32 of 39

Extra slides

33 of 39

Workloads - Rodinia

Workload	Mix	Workload	Mix
W1	16-job, 1:1-mix	W2	16-job, 2:1-mix
W3	16-job, 3:1-mix	W4	16-job, 5:1-mix
W5	32-job, 1:1-mix	W6	32-job, 2:1-mix
W7	32-job, 3:1-mix	W8	32-job, 5:1-mix

Generated from backprop (pattern recognition), srad-v1 and srad-v2 (image processing), lavaMD (molecular dynamics), needle (bioinformatics), dwt2d (image/video compression), and bfs (graph)�

Ratio mixes are of the form large:small, where large jobs have kernels with footprints >4GB (up to ~12GB), and small jobs have kernels with 1-4GB footprints.

34 of 39

Workloads - Darknet

Task	Model
NN training	Small Darknet architecture for CIFAR-10
NN prediction	Pre-trained Darknet19 and Darknet53-448x448 from 1000-class ImageNet
Real-time object detection	Pre-trained yolov3-tiny architecture
Text generation	Pre-trained RNN based on Shakespeare’s complete works

35 of 39

2xP100 throughput on Rodinia

36 of 39

Percentage of jobs crashes for CG (P100s/V100s)

# of workers	1:1 mix	2:1 mix	3:1 mix	5:1 mix
3/6	0/0	3%/6%	8%/17%	0/0
4/8	14%/13%	6%/19%	6%/25%	9%/13%
5/10	13%/15%	13%/25%	20%/20%	22%/25%
6/12	16%/33%	17%/29%	16%/38%	16%/50%

37 of 39

CASE average job turnaround speedup

GPUs	# of jobs	1:1 mix	2:1 mix	3:1 mix	5:1 mix
2xP100s	16 jobs	4.9x	2.3x	4.9x	4.3x
2xP100s	32 jobs	4.6x	3.2x	3.6x	2.0x
4xV100s	16 jobs	2.4x	2.0x	3.5x	2.6x
4xV100s	32 jobs	3.8x	2.9x	2.9x	2.6x

38 of 39

Additional state-of-the-art work

Nvidia MPS

Multiprocess service
Provides support for simultaneous execution of multiple cooperative kernels
“Cooperative:” developers must carefully design MPI tasks and not overload the devices
Susceptible to out-of-memory errors when running independent processes
Not designed for scheduling kernels across different GPU devices (i.e. responsibility falls to developer)

39 of 39

Additional state-of-the-art work

Job queues: Slurm
Scheduling: schedGPU (API), VirtCL (API and history), FLEP (preemption)
Deep learning: Gandiva (properties, cluster-wide), Amaral et al. (topology), MXNet (data partitioning), Hu et al. (concurrent models single GPU), MARBLE (concurrent models multi GPU), ONNX-GPU (graph IR contraction)
OS support: Gdev (1st class GPU resource management), PTask (task graphs)
Nvidia: unified memory, multi-instance GPU (MIG)