CASE:
A Compiler-Assisted
SchEduling Framework
for Multi-GPU Systems
Chao Chen, Chris Porter, Santosh Pande
PPoPP 2022
High performance computing
Each node:
- 1 CPU + 4 GPUs
Frontier:
Summit (previous gen):
The issue
There is a need to improve GPU utilization
Our talk
Outline
Outline
State-of-the-art work
State-of-the-art work: schedGPU
Carlos Reaño, Federico Silla, Dimitrios S. Nikolopoulos, and Blesson Varghese. 2018. Intra-Node Memory Safe GPU Co-Scheduling. IEEE Trans. Parallel Distributed Syst. 29, 5 (2018), 1089–1102. https://doi.org/10.1109/TPDS.2017.2784428
Outline
The framework
Overview of CASE
Device 0:
SMs: 56
Mem: 16GB
Device 1:
SMs: 56
Mem: 16GB
Application 1
K1: 32 SMs, 6GB
K2: 18 SMs, 9GB
Application 2
K1: 36 SMs, 4GB
K2: 22 SMs, 9GB
The challenge
Constructing (virtual) GPU tasks statically
1 // VecAdd is a CUDA kernel executed on GPU
2 __global__ void VecAdd(int *A, int *B, int *C) {
3 int i = blockIdx.x * blockDim.x + threadIdx.x;
4 C[i] = A[i] + B[i];
5 }
6
7 // main is sequential code running on CPU
8 int main(int argc, char **argv) {
9 int A[N], B[N], C[N], *dA, *dB *dC;
10
11 // initialize the vectors
12 for (int i = 0; i < N; i++) {
13 A[i] = cos(i);
14 B[i] = sin(i);
15 C[i] = 0;
16 }
17 task_begin(N*3, 128, N/128); // the instrumented probe
18 // allocate device memory
19 cudaMalloc(&dA, N); // an input vector
20 cudaMalloc(&dB, N); // an input vector
21 cudaMalloc(&dC, N); // for storing result
22
23 // initialize the device memory
24 cudaMemcpy(dA, A, N, cudaMemcpyHostToDevice);
25 cudaMemcpy(dB, B, N, cudaMemcpyHostToDevice);
26
27 // launch the kernel on device
28 dim3 T(128), B(N/128);
29 VecAdd<<<B, T>>>(d_A, d_B, d_C);
30
31 // retrieve the result
32 cudaMemcpy(C, dC, N, cudaMemcpyDeviceToHost);
33
34 cudaFree(dA);
35 cudaFree(dB);
36 cudaFree(dC);
37 task_end()
38 }
Computing resource
Memory resource
Lazy runtime
1 // VecAdd is a CUDA kernel executed on GPU
2 __global__ void VecAdd(int *A, int *B, int *C) {
3 int i = blockIdx.x * blockDim.x + threadIdx.x;
4 C[i] = A[i] + B[i];
5 }
6 void init(int **A, int **B, int **C) {
7 // allocate device memory
8 lazyMalloc(A, N); // an input vector
9 lazyMalloc(B, N); // an input vector
10 lazyMalloc(C, N); // for storing result
11 }
12 // main is sequential code running on CPU
13 int main(int argc, char **argv) {
14 int A[N], B[N], C[N], *dA, *dB *dC;
15
16 // initialize the vectors
17 for (int i = 0; i < N; i++) {
18 A[i] = cos(i);
19 B[i] = sin(i);
20 C[i] = 0;
21 }
22 init(&dA, &dB, &dC)
23
24 // initialize the device memory
25 lazyMemcpy(dA, A, N, cudaMemcpyHostToDevice);
26 lazyMemcpy(dB, B, N, cudaMemcpyHostToDevice);
27
28 // launch the kernel on device
29 dim3 T(128), B(N/128);
30 task_begin_with_materialization(B, T, dA, dB, dC);
31 VecAdd<<<B, T>>>(dA, dB, dC);
32
33 // retrieve the result
34 lazyMemcpy(C, dC, N, cudaMemcpyDeviceToHost);
35
36 lazyFree(dA);
37 lazyFree(dB);
38 lazyFree(dC);
39 task_end();
40 }
*A = 1
(malloc, N)
(cpy, A, HtoD, N)
*B = 2
*C = 3
Compiler replaces and instruments code automatically
Perform actual operations
Runtime
Task_begin(mem, grid, block)
Scheduler
3. Receive device id
4. cudaSetDevice(id)
1. Send probe (mem, grid, block)
2. Determine the device
CASE scheduling: Hardware emulation
CASE scheduling: Simplified
sched(task, GPUs):
TargetG = None
MinWarps = Inf
for G in GPUs:
if task.MemReq < G.FreeMem:
if G.InUseWarps < MinWarps:
MinWarps = G.InUseWarps
TargetG = G
if TargetG:
TargetG.addWarps(task.Warps)
TargetG.addMem(task.MemReq)
return TargetG
Idea:
For some incoming task, find the GPU with
memory (1) is a hard requirement�compute (2) is soft
CASE scheduling
G2.FreeMem: 2GB
G2.InUseWarps: 4M
k0
G0.FreeMem: 2GB
G0.InUseWarps: 7M
G1.FreeMem: 10GB
G1.InUseWarps: 3M
G2.FreeMem: 8GB
G2.InUseWarps: 2M
G3.FreeMem: 4GB
G3.InUseWarps: 9M
task.MemReq: 6GB
task.Warps: 2M
k0
Outline
Machines
Workloads
Baselines
Single assignment scheduler
k3
k0
k2
k1
k3
k0
k2
k1
k3
k0
k2
Time 0
Time 1
Time 2
Core-to-GPU scheduler
k3
k0
k2
k1
k0
k2
k1
k3
k7
k4
k6
k5
k7
k4
k6
k5
k2
k1
k7
k4
k6
k5
k3
Time 0
Time 1
Time 2
schedGPU scheduler
k3
k0
k2
k1
k0
k1
k3
k2
k1
k3
k2
Time 0
Time 1
Time 2
k0-k1: 8GB footprint
k2-k3: 10GB footprint
GPU: 16GB memory
Summary of results
4xV100 throughput for 2 scheduling algs on Rodinia
HE
S
4xV100 throughput on Rodinia
4xV100 throughput on Darknet
Conclusion
2.5x throughput, 3.36x utilization, 4.9x job turnaround time
Thank you!
Q&A
Extra slides
Workloads - Rodinia
Workload | Mix | Workload | Mix |
W1 | 16-job, 1:1-mix | W2 | 16-job, 2:1-mix |
W3 | 16-job, 3:1-mix | W4 | 16-job, 5:1-mix |
W5 | 32-job, 1:1-mix | W6 | 32-job, 2:1-mix |
W7 | 32-job, 3:1-mix | W8 | 32-job, 5:1-mix |
Generated from backprop (pattern recognition), srad-v1 and srad-v2 (image processing), lavaMD (molecular dynamics), needle (bioinformatics), dwt2d (image/video compression), and bfs (graph)�
Ratio mixes are of the form large:small, where large jobs have kernels with footprints >4GB (up to ~12GB), and small jobs have kernels with 1-4GB footprints.
Workloads - Darknet
Task | Model |
NN training | Small Darknet architecture for CIFAR-10 |
NN prediction | Pre-trained Darknet19 and Darknet53-448x448 from 1000-class ImageNet |
Real-time object detection | Pre-trained yolov3-tiny architecture |
Text generation | Pre-trained RNN based on Shakespeare’s complete works |
2xP100 throughput on Rodinia
Percentage of jobs crashes for CG (P100s/V100s)
# of workers | 1:1 mix | 2:1 mix | 3:1 mix | 5:1 mix |
3/6 | 0/0 | 3%/6% | 8%/17% | 0/0 |
4/8 | 14%/13% | 6%/19% | 6%/25% | 9%/13% |
5/10 | 13%/15% | 13%/25% | 20%/20% | 22%/25% |
6/12 | 16%/33% | 17%/29% | 16%/38% | 16%/50% |
CASE average job turnaround speedup
GPUs | # of jobs | 1:1 mix | 2:1 mix | 3:1 mix | 5:1 mix |
2xP100s | 16 jobs | 4.9x | 2.3x | 4.9x | 4.3x |
2xP100s | 32 jobs | 4.6x | 3.2x | 3.6x | 2.0x |
4xV100s | 16 jobs | 2.4x | 2.0x | 3.5x | 2.6x |
4xV100s | 32 jobs | 3.8x | 2.9x | 2.9x | 2.6x |
Additional state-of-the-art work
Additional state-of-the-art work