Real-Time Multi-GPU
Rendering for Codec Avatars
Vasu Agrawal
2
3
Metric Telepresence
A GRAND CHALLENGE
Remote interactions that are indistinguishable
from in-person interactions
4
[Mark Zuckerberg: First Interview in the Metaverse | Lex Fridman Podcast #398]
5
What is a Codec Avatar?
6
Sensor
Display
Code
Encoder
Environment
Encoder
Decoder
Head-mounted�display
Sensor
Head-mounted
capture
Head-mounted
capture
What is a Codec Avatar?
7
Sensor
Display
Code
Decoder
Encoder
Encoder
Sensor
Environment
that disentangle TX, RX, and EX signals
to minimize DISTORTION and LATENCY for telepresence
A pair of functions ( , ) �
Bringing the Metaverse to the
Next Billion Users via Codec Avatars
8
[GTC ’24 S63211]
What are we building now?
10
[Mark Zuckerberg: First Interview in the Metaverse | Lex Fridman Podcast #398]
11
We need more completeness!
More expressive faces
Full body avatars
Universality
Environments & objects
Relighting & shadows
Dynamic clothing & hair
…
12
We’ll work on full body avatars!
I’ll work on relighting!
I’ll work on universality!
We’ll work on environments!
We’ll work on mobile!
I’ll work on objects!
I’ll work on hair!
LGTM!
We want to build “time machine” experiences
13
∞
scale
Goal: Rapid VR prototypes from hot research
14
15
16
17
Requirements: How to build a time machine
18
Time machines are easy!
19
20
I’ll work on objects!
I’ll work on hair!
We’ll work on full body avatars!
I’ll work on relighting!
I’ll work on universality!
We’ll work on environments!
We’ll work on mobile!
I wonder how research is going …
Problem: Each object can be rendered differently
21
Requirements: How to build a time machine
22
23
Each scene element is rendered differently
24
Code
(≈ 500 floats)
VR-NeRF
InstantNGP
RFBGCA
Observation: Each renderer makes an RGBAD image
Environment
Object
Avatar 1
Avatar 2
25
Treat each renderer as a camera → RGBAD black box
26
Code
(≈ 500 floats)
VR-NeRF
InstantNGP
RFBGCA
Black Box
Renderer
Black Box
Renderer
Black Box
Renderer
Compositing: Use depth to sort pixels and perform alpha blending
27
29
30
31
33
Requirements: How to build a time machine
35
Requirements: How to build a time machine
36
Requirements: How to build a time machine
37
Each element & view is rendered sequentially
38
Object Left
Environment Left
Environment Right
Object Right
Avatar 1 Left
Avatar 1 Right
Time
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Simplify rendering to a single renderer & view
39
Environment Left
Time
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Frame Start
Zoom timeline into a single frame
40
Environment Left
Time
Frame Start
Frame Start
Let’s use more GPUs!
41
“The more (GPUs) you buy, the more (time) you save”
Jensen Huang
42
Lambda Quad: 4x 3090 workstation
43
Abstraction enables work distribution tuning
44
Black Box
Renderer
uint32_t row_start
uint32_t row_end
cudaStream_t stream
uint32_t gpu_id
Abstraction enables work distribution tuning
45
Black Box
Renderer
uint32_t row_start
uint32_t row_end
cudaStream_t stream
uint32_t gpu_id
Abstraction enables work distribution tuning
46
Black Box
Renderer
uint32_t row_start
uint32_t row_end
cudaStream_t stream
uint32_t gpu_id
Abstraction enables work distribution tuning
47
Black Box
Renderer
uint32_t row_start
uint32_t row_end
cudaStream_t stream
uint32_t gpu_id
Abstraction enables work distribution tuning
48
Black Box
Renderer
uint32_t row_start
uint32_t row_end
cudaStream_t stream
uint32_t gpu_id
Start by splitting rendering evenly over GPUs
49
GPU-agnostic rendering interface
template <typename Renderable, typename Config = DefaultConfig>
class MultiGpuRenderHelper {
public:
template <typename... Args>
MultiGpuRenderHelper(
std::vector<uint32_t> gpu_ids, // CUDA GPU IDs to use for rendering
uint32_t display_gpu_id, // CUDA GPU ID of the display GPU (copy target)
const Args&... args) { // additional arguments to pass to Renderable ctor
// Initialize a new control thread per GPU. Each thread calls cudaSetDevice() before calling ctor.
for (size_t gpu_index = 0; gpu_index < gpu_ids_.size(); ++gpu_index) {
worker_threads_.emplace_back(
&MultiGpuRenderHelper<Renderable, Config>::template workerThread<Args...>,
this, gpu_index, args...);
}
}
void updateWorkloadDistribution();
void launchRenders(const CameraPinhole& camera, Framebuffer& rgbad_framebuffer);
void synchronizeRenders();
};
50
GPU-agnostic rendering interface
template <typename Renderable, typename Config = DefaultConfig>
class MultiGpuRenderHelper {
public:
template <typename... Args>
MultiGpuRenderHelper(
std::vector<uint32_t> gpu_ids, // CUDA GPU IDs to use for rendering
uint32_t display_gpu_id, // CUDA GPU ID of the display GPU (copy target)
const Args&... args) { // additional arguments to pass to Renderable ctor
// Initialize a new control thread per GPU. Each thread calls cudaSetDevice() before calling ctor.
for (size_t gpu_index = 0; gpu_index < gpu_ids_.size(); ++gpu_index) {
worker_threads_.emplace_back(
&MultiGpuRenderHelper<Renderable, Config>::template workerThread<Args...>,
this, gpu_index, args...);
}
}
void updateWorkloadDistribution();
void launchRenders(const CameraPinhole& camera, Framebuffer& rgbad_framebuffer);
void synchronizeRenders();
};
51
GPU-agnostic rendering interface
class VrNerfRenderable {
VrNerfRenderable(const std::filesystem::path& config_path) {
const uint32_t current_gpu = cudaGetDevice();
initialize(current_gpu, config_path);
}
// Client-side rendering interface
// Top-level code handles allocation of rows
void renderAsync(
const CameraPinhole& camera,
const std::pair<uint32_t, uint32_t>& rows,
CudaMemory<float4>& color_output_buffer,
CudaMemory<float>& depth_output_buffer,
cudaStream_t render_stream) {
renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);
}
};
52
GPU-agnostic rendering interface
class VrNerfRenderable {
VrNerfRenderable(const std::filesystem::path& config_path) {
const uint32_t current_gpu = cudaGetDevice();
initialize(current_gpu, config_path);
}
// Client-side rendering interface
// Top-level code handles allocation of rows
void renderAsync(
const CameraPinhole& camera,
const std::pair<uint32_t, uint32_t>& rows,
CudaMemory<float4>& color_output_buffer,
CudaMemory<float>& depth_output_buffer,
cudaStream_t render_stream) {
renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);
}
};
53
GPU-agnostic rendering interface
class VrNerfRenderable {
VrNerfRenderable(const std::filesystem::path& config_path) {
const uint32_t current_gpu = cudaGetDevice();
initialize(current_gpu, config_path);
}
// Client-side rendering interface
// Top-level code handles allocation of rows
void renderAsync(
const CameraPinhole& camera,
const std::pair<uint32_t, uint32_t>& rows,
CudaMemory<float4>& color_output_buffer,
CudaMemory<float>& depth_output_buffer,
cudaStream_t render_stream) {
renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);
}
};
54
GPU-agnostic rendering interface
int main(int argc, char** argv) {
const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};
const uint32_t display_gpu_id = 0;
const std::filesystem::path config_path = {};
MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);
while(true) {
// Update camera and framebuffer
const CameraPinhole camera = updateCamera();
Framebuffer rgbad_framebuffer = getEmptyFramebuffer();
// Update allocation of work to GPUs and launch rendering kernels
helper.updateWorkloadDistribution();
helper.launchRenders(camera, rgbad_framebuffer);
// Wait for all rendering work to complete
helper.synchronizeRenders();
// Send the rendered frame(s) to the display
sendToDisplays(rgbad_framebuffer);
}
}
55
GPU-agnostic rendering interface
int main(int argc, char** argv) {
const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};
const uint32_t display_gpu_id = 0;
const std::filesystem::path config_path = {};
MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);
while(true) {
// Update camera and framebuffer
const CameraPinhole camera = updateCamera();
Framebuffer rgbad_framebuffer = getEmptyFramebuffer();
// Update allocation of work to GPUs and launch rendering kernels
helper.updateWorkloadDistribution();
helper.launchRenders(camera, rgbad_framebuffer);
// Wait for all rendering work to complete
helper.synchronizeRenders();
// Send the rendered frame(s) to the display
sendToDisplays(rgbad_framebuffer);
}
}
56
GPU-agnostic rendering interface
int main(int argc, char** argv) {
const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};
const uint32_t display_gpu_id = 0;
const std::filesystem::path config_path = {};
MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);
while(true) {
// Update camera and framebuffer
const CameraPinhole camera = updateCamera();
Framebuffer rgbad_framebuffer = getEmptyFramebuffer();
// Update allocation of work to GPUs and launch rendering kernels
helper.updateWorkloadDistribution();
helper.launchRenders(camera, rgbad_framebuffer);
// Wait for all rendering work to complete
helper.synchronizeRenders();
// Send the rendered frame(s) to the display
sendToDisplays(rgbad_framebuffer);
}
}
57
Start by splitting rendering evenly over GPUs
58
We expect equal per-GPU render times
59
Environment Left
Environment Left
Environment Left
Environment Left
Time
Frame Start
Copy
Copy
Copy
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Unfortunately, render times vary by tile
60
Environment Left
Environment Left
Environment Left
Environment Left
Time
Frame Start
Frame Start
Copy
Copy
Copy
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Let’s reassign work based on throughput
61
P-controller to smoothly vary workloads
// Apply p-controller to each distribution
for (size_t gpu_index = 0; gpu_index < work_split_.size(); ++gpu_index) {
work_split_[gpu_index] += pid_p_ * (target_ratio_[gpu_index] - work_split_[gpu_index]);
work_split_[gpu_index] = std::max(work_split_[gpu_index], 0.01); // minimum amount of work
}
// Normalize new distribution
double total_workload_distribution = std::accumulate(work_split_.begin(), work_split_.end(), 0.);
for (auto& ratio : work_split_) {
ratio /= total_workload_distribution;
}
// Convert new distribution into work, handling rounding accordingly
uint32_t distributed = 0;
for (size_t gpu_index = 0; gpu_index < distributed_work_.size() - 1; ++gpu_index) {
const uint32_t work = std::round(work_split_[gpu_index] * problem_size);
const auto work_end = std::min(distributed + work, problem_size);
distributed_work_[gpu_index] = std::make_pair(distributed, work_end);
distributed = work_end;
}
distributed_work_.back() = std::make_pair(distributed, problem_size);
62
P-controller to smoothly vary workloads
// Apply p-controller to each distribution
for (size_t gpu_index = 0; gpu_index < work_split_.size(); ++gpu_index) {
work_split_[gpu_index] += pid_p_ * (target_ratio_[gpu_index] - work_split_[gpu_index]);
work_split_[gpu_index] = std::max(work_split_[gpu_index], 0.01); // minimum amount of work
}
// Normalize new distribution
double total_workload_distribution = std::accumulate(work_split_.begin(), work_split_.end(), 0.);
for (auto& ratio : work_split_) {
ratio /= total_workload_distribution;
}
// Convert new distribution into work, handling rounding accordingly
uint32_t distributed = 0;
for (size_t gpu_index = 0; gpu_index < distributed_work_.size() - 1; ++gpu_index) {
const uint32_t work = std::round(work_split_[gpu_index] * problem_size);
const auto work_end = std::min(distributed + work, problem_size);
distributed_work_[gpu_index] = std::make_pair(distributed, work_end);
distributed = work_end;
}
distributed_work_.back() = std::make_pair(distributed, problem_size);
63
Aside: Heterogeneous GPUs also work
64
RTX 4090
RTX 3090
RTX 3090
Total render time is lower with work balancing
65
Environment Left
Environment Left
Environment Left
Environment Left
Time
Frame Start
Copy
Copy
Copy
Frame Start
(Display)
GPU 0
GPU
1
GPU 2
GPU 3
Still not fast enough for VR
66
Environment Left
Environment Left
Environment Left
Environment Left
Environment Right
Environment Right
Environment Right
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU
1
GPU 2
GPU 3
Environment Right
Copy
Copy
Copy
This strategy can scale to 20+ GPUs
67
Results from VR-NeRF on up to 20 Nvidia A40 (≈ 3090) GPUs
This strategy can scale to 20+ GPUs
68
Results from VR-NeRF on up to 18 Nvidia L40S (≈ 4090) GPUs
The Turtle: A 20 GPU rendering time machine
69
Front
Back
Adnacom S31
The Turtle: A 20 GPU rendering time machine
70
The Turtle: A 20 GPU rendering time machine
71
Requirements: How to build a time machine
72
Requirements: How to build a time machine
73
74
Gaussian Splatting significantly increases FPS
75
Environment can now be rendered on 1 GPU
76
Environment Left
Environment Right
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Let’s give each scene element its own GPU
77
Object Left
Avatar 1 Left
Avatar 2 Left
Object Right
Avatar 1 Right
Avatar 2 Right
Environment Left
Environment Right
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Copy
PCIe is a bus, so all copies are serialized
78
Object Left
Avatar 1 Left
Avatar 2 Left
Object Right
Avatar 1 Right
Avatar 2 Right
Environment Left
Environment Right
Time
Idle
Frame Start
Idle
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Copy
Use profilers to identify possible optimizations
79
Trace collected with NVIDIA Nsight Systems
GPU utilization may have long tail
80
Object Left
Avatar 1 Left
Avatar 2 Left
Object Right
Avatar 1 Right
Avatar 2 Right
Environment Left
Environment Right
Time
Idle
Frame Start
Idle
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Copy
Use streams to expose inherent parallelism
81
Render each eye on a separate CUDA stream
82
Idle
Avatar 2 Right
Avatar 1 Right
Object Right
Object Left
Avatar 1 Left
Avatar 2 Left
Environment Right
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Idle
Copy
Copy
Environment Left
Overlapping copies with rendering can be slow
83
Avatar 2 Right
Avatar 1 Right
Idle
Copy
Object Right
Object Left
Avatar 1 Left
Avatar 2 Left
Environment Right
Idle
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Environment Left
Finish display GPU work before starting copies
84
Avatar 2 Right
Avatar 1 Right
Idle
Avatar 1 Left
Avatar 2 Left
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Object Right
Object Left
Environment Right
Environment Left
Copy
Rendering may be slower on display GPU
85
Avatar 2 Right
Avatar 1 Right
Idle
Avatar 1 Left
Avatar 2 Left
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Object Right
Idle
Object Left
Environment Right
Environment Left
Copy
Measure and use idle time
86
Avatar 2 Right
Avatar 1 Right
Avatar 1 Left
Avatar 2 Left
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Object Right
Time for more work!
Object Left
Environment Right
Environment Left
Adjust render parameters to consume idle time, e.g.:
Copy
Ideal scheduling has no idle time before copies
87
Avatar 2 Right
Avatar 1 Right
Avatar 1 Left
Avatar 2 Left
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Object Right
Object Left
Environment Right
Environment Left
Copy
Real timeline includes compositing & display
88
Avatar 2 Right
Avatar 1 Right
Avatar 1 Left
Avatar 2 Left
Time
Frame Start
Frame Start
(Display)
GPU 0
GPU 1
GPU 2
GPU 3
Copy
Copy
Object Right
Object Left
Environment Right
Environment Left
Copy
We want to build “time machine” experiences
89
∞
scale
?
?
?
GPU → GPU copies dominate the frame time
90
20x L40S GPU trace
10x L40S GPU trace
5x L40S GPU trace
Amdahl’s Law: Serial components limit scaling
“The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used”
91
Requirements: How to build a time machine
92
Copy bandwidth is an upper bound on scaling
93
RTX5090 Includes PCIe Gen 5
94
2x improved upper bounds with PCIe gen 5
95
PCIe bandwidth is outpacing VR resolution
96
Consider compressing images for transport
97
Compression tests with nvCOMP
98
Compression tests with nvCOMP
99
Compression tests with nvCOMP
100
Environment | ||||
| Color f16x4 | Depth f16 | Color u8x4 | Depth u8 |
Compression Ratio | 1.58 | 1.53 | 2.76 | 1.53 |
Compression Throughput (GB/s) | 57 | 14.3 | 28.7 | 14.3 |
Decompression Throughput (GB/s) | 40.5 | 10.5 | 21.1 | 10.5 |
| | | | |
Avatar | ||||
| Color f16x4 | Depth f16 | Color u8x4 | Depth u8 |
Compression Ratio | 6.6 | 8.4 | 7.95 | 8.4 |
Compression Throughput (GB/s) | 57.8 | 14.5 | 32.2 | 16.2 |
Decompression Throughput (GB/s) | 45 | 11.7 | 25.1 | 12.6 |
Bitcomp compression, L40S GPU, nvCOMP 3.0.5 HLIF, 8K chunk size
[Data courtesy of Nico Iskos at Nvidia]
Compression tests with nvCOMP
101
Environment | ||||
| Color f16x4 | Depth f16 | Color u8x4 | Depth u8 |
Compression Ratio | 1.58 | 1.53 | 2.76 | 1.53 |
Compression Throughput (GB/s) | 57 | 14.3 | 28.7 | 14.3 |
Decompression Throughput (GB/s) | 40.5 | 10.5 | 21.1 | 10.5 |
| | | | |
Avatar | ||||
| Color f16x4 | Depth f16 | Color u8x4 | Depth u8 |
Compression Ratio | 6.6 | 8.4 | 7.95 | 8.4 |
Compression Throughput (GB/s) | 57.8 | 14.5 | 32.2 | 16.2 |
Decompression Throughput (GB/s) | 45 | 11.7 | 25.1 | 12.6 |
Bitcomp compression, L40S GPU, nvCOMP 3.0.5 HLIF, 8K chunk size
[Data courtesy of Nico Iskos at Nvidia]
Requirements: How to build a time machine
102
Requirements: How to build a time machine
103
104
105
106
Let’s chat!
vasuagrawal@meta.com
107
vasuagrawal.com
vasuagrawal.com
108