1 of 108

Real-Time Multi-GPU

Rendering for Codec Avatars

Vasu Agrawal

2 of 108

2

3 of 108

3

Metric Telepresence

A GRAND CHALLENGE

Remote interactions that are indistinguishable

from in-person interactions

4 of 108

4

[Mark Zuckerberg: First Interview in the Metaverse | Lex Fridman Podcast #398]

5 of 108

5

6 of 108

What is a Codec Avatar?

6

Sensor

Display

Code

Encoder

Environment

Encoder

Decoder

Head-mounted�display

Sensor

Head-mounted

capture

Head-mounted

capture

7 of 108

What is a Codec Avatar?

7

Sensor

Display

Code

Decoder

Encoder

Sensor

Environment

that disentangle TX, RX, and EX signals

to minimize DISTORTION and LATENCY for telepresence

A pair of functions ( , ) �

8 of 108

Bringing the Metaverse to the

Next Billion Users via Codec Avatars

8

[GTC ’24 S63211]

9 of 108

What are we building now?

10 of 108

10

[Mark Zuckerberg: First Interview in the Metaverse | Lex Fridman Podcast #398]

11 of 108

11

We need more completeness!

More expressive faces

Full body avatars

Universality

Environments & objects

Relighting & shadows

Dynamic clothing & hair

…

12 of 108

12

We’ll work on full body avatars!

I’ll work on relighting!

I’ll work on universality!

We’ll work on environments!

We’ll work on mobile!

I’ll work on objects!

I’ll work on hair!

LGTM!

13 of 108

We want to build “time machine” experiences

13

∞

scale

While the researchers are busy building their individual components, we should think about the whole system we want to build
We want to build “time machines” which let us look at what future versions of the optimized technology could look like, using today’s early stage research results.
Ideally, we can build time machines at whatever scale:

Single object lego digger
2 quarter body avatars in a floating void
2 full body avatars in an environment with some objects
A concert with thousands of people
The world
And beyond?

Attributions:

Lego digger: https://www.matthewtancik.com/nerf
MZ/Lex: from before
Rendering: from next few slides
Linking Park concert: taken by me
Pale blue dot: By NASA/Apollo 17 crew; taken by either Harrison Schmitt or Ron Evans - https://web.archive.org/web/20160112123725/http://grin.hq.nasa.gov/ABSTRACTS/GPN-2000-001138.html (image link); see also https://www.nasa.gov/multimedia/imagegallery/image_feature_329.html, Public Domain, https://commons.wikimedia.org/w/index.php?curid=43894484

14 of 108

Goal: Rapid VR prototypes from hot research

Quickly turn the latest offline research into low-latency real-time experiences in VR
Iterate quickly and offer fast feedback on future research direction and optimizations
Identify problems not visible in 2D images
Build small-scale “time machine” demos which offer a glimpse into the future, without requiring costly optimizations
Not constrained by product requirements

14

More concretely, we’d like to be able to build rapid VR prototypes from hot research
We want to take the latest research results, which run at offline speeds, and find some way of displaying them as immersive VR experiences
This will help us validate research ideas more thoroughly, and iterate more quickly on future research direction and possible/necessary optimizations
We can also identify problems not visible in 2D images, such as 3D consistency issues due to depth mismatches
In short, we’d like to build small-scale “time machine” demos which show how immersive the current research direction could be, without needing it to be fully baked and optimized
It’s worth noting that the approach we take to do this does not need to be constrained by product requirements, since this is meant to be a research tool. We don’t need to care as much about cost or scale. We’ll come back to this.

Image from Meta AI

15 of 108

15

16 of 108

16

17 of 108

17

18 of 108

Requirements: How to build a time machine

Support [1, ∞) different scene elements
May run tethered to a single (powerful) workstation
Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

18

19 of 108

Time machines are easy!

19

20 of 108

20

I’ll work on objects!

I’ll work on hair!

We’ll work on full body avatars!

I’ll work on relighting!

I’ll work on universality!

We’ll work on environments!

We’ll work on mobile!

I wonder how research is going …

21 of 108

Problem: Each object can be rendered differently

Environments & Objects:

Mesh rasterization
Ray- or path-tracing
InstantNGP
HybridNeRF
Gaussian Splatting
…

Avatars:

Deep Appearance Models
Pixel Codec Avatars
Mixture of Volumetric Primitives
Relightable Gaussian Codec Avatars
…

21

22 of 108

Requirements: How to build a time machine

Handle elements made with bespoke, arbitrary rendering algorithms
Support [1, ∞) different scene elements
May run tethered to a single (powerful) workstation
Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

22

23 of 108

23

24 of 108

Each scene element is rendered differently

24

Code

(≈ 500 floats)

VR-NeRF

InstantNGP

RFBGCA

25 of 108

Observation: Each renderer makes an RGBAD image

Environment

Object

Avatar 1

Avatar 2

25

26 of 108

Treat each renderer as a camera → RGBAD black box

26

Code

(≈ 500 floats)

VR-NeRF

InstantNGP

RFBGCA

Black Box

Renderer

Black Box

Renderer

Black Box

Renderer

27 of 108

Compositing: Use depth to sort pixels and perform alpha blending

27

28 of 108

29 of 108

29

30 of 108

30

31 of 108

31

32 of 108

33 of 108

33

34 of 108

35 of 108

Requirements: How to build a time machine

Handle elements made with bespoke, arbitrary rendering algorithms
Support [1, ∞) different scene elements
May run tethered to a single (powerful) workstation
Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

35

36 of 108

Requirements: How to build a time machine

✅Handle elements made with bespoke, arbitrary rendering algorithms

Treat each renderer as a black box and do pixel-wise depth-based alpha blending

✅Support [1, ∞) different scene elements

Render each element into a separate RGBAD layer and composite

May run tethered to a single (powerful) workstation
Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

36

37 of 108

Requirements: How to build a time machine

✅Handle elements made with bespoke, arbitrary rendering algorithms

Treat each renderer as a black box and do pixel-wise depth-based alpha blending

✅Support [1, ∞) different scene elements

Render each element into a separate RGBAD layer and composite

May run tethered to a single (powerful) workstation
Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

37

38 of 108

Each element & view is rendered sequentially

38

Object Left

Environment Left

Environment Right

Object Right

Avatar 1 Left

Avatar 1 Right

Time

Frame Start

39 of 108

Simplify rendering to a single renderer & view

39

Environment Left

Time

Frame Start

40 of 108

Zoom timeline into a single frame

40

Environment Left

Time

Frame Start

41 of 108

Let’s use more GPUs!

41

[https://images.app.goo.gl/94pvJxVSxLVxeTe69]

42 of 108

“The more (GPUs) you buy, the more (time) you save”

Jensen Huang

42

[https://www.youtube.com/watch?v=XDpDesU_0zo]

43 of 108

Lambda Quad: 4x 3090 workstation

43

[https://lambdalabs.com/products/quad]

44 of 108

Abstraction enables work distribution tuning

44

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

45 of 108

Abstraction enables work distribution tuning

45

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

46 of 108

Abstraction enables work distribution tuning

46

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

47 of 108

Abstraction enables work distribution tuning

47

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

48 of 108

Abstraction enables work distribution tuning

48

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

49 of 108

Start by splitting rendering evenly over GPUs

Give each of N GPUs 1/N of the total number of rows
Copy each rendered tile to the main (display) GPU
Display aggregated image on the screen or in VR for N-times speedup!

49

50 of 108

GPU-agnostic rendering interface

template <typename Renderable, typename Config = DefaultConfig>

class MultiGpuRenderHelper {

public:

template <typename... Args>

MultiGpuRenderHelper(

std::vector<uint32_t> gpu_ids, // CUDA GPU IDs to use for rendering

uint32_t display_gpu_id, // CUDA GPU ID of the display GPU (copy target)

const Args&... args) { // additional arguments to pass to Renderable ctor

// Initialize a new control thread per GPU. Each thread calls cudaSetDevice() before calling ctor.

for (size_t gpu_index = 0; gpu_index < gpu_ids_.size(); ++gpu_index) {

worker_threads_.emplace_back(

&MultiGpuRenderHelper<Renderable, Config>::template workerThread<Args...>,

this, gpu_index, args...);

}

void updateWorkloadDistribution();

void launchRenders(const CameraPinhole& camera, Framebuffer& rgbad_framebuffer);

void synchronizeRenders();

};

50

51 of 108

GPU-agnostic rendering interface

template <typename Renderable, typename Config = DefaultConfig>

class MultiGpuRenderHelper {

public:

template <typename... Args>

MultiGpuRenderHelper(

std::vector<uint32_t> gpu_ids, // CUDA GPU IDs to use for rendering

uint32_t display_gpu_id, // CUDA GPU ID of the display GPU (copy target)

const Args&... args) { // additional arguments to pass to Renderable ctor

// Initialize a new control thread per GPU. Each thread calls cudaSetDevice() before calling ctor.

for (size_t gpu_index = 0; gpu_index < gpu_ids_.size(); ++gpu_index) {

worker_threads_.emplace_back(

&MultiGpuRenderHelper<Renderable, Config>::template workerThread<Args...>,

this, gpu_index, args...);

}

void updateWorkloadDistribution();

void launchRenders(const CameraPinhole& camera, Framebuffer& rgbad_framebuffer);

void synchronizeRenders();

};

51

52 of 108

GPU-agnostic rendering interface

class VrNerfRenderable {

VrNerfRenderable(const std::filesystem::path& config_path) {

const uint32_t current_gpu = cudaGetDevice();

initialize(current_gpu, config_path);

}

// Client-side rendering interface

// Top-level code handles allocation of rows

void renderAsync(

const CameraPinhole& camera,

const std::pair<uint32_t, uint32_t>& rows,

CudaMemory<float4>& color_output_buffer,

CudaMemory<float>& depth_output_buffer,

cudaStream_t render_stream) {

renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);

}

};

52

53 of 108

GPU-agnostic rendering interface

class VrNerfRenderable {

VrNerfRenderable(const std::filesystem::path& config_path) {

const uint32_t current_gpu = cudaGetDevice();

initialize(current_gpu, config_path);

}

// Client-side rendering interface

// Top-level code handles allocation of rows

void renderAsync(

const CameraPinhole& camera,

const std::pair<uint32_t, uint32_t>& rows,

CudaMemory<float4>& color_output_buffer,

CudaMemory<float>& depth_output_buffer,

cudaStream_t render_stream) {

renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);

}

};

53

54 of 108

GPU-agnostic rendering interface

class VrNerfRenderable {

VrNerfRenderable(const std::filesystem::path& config_path) {

const uint32_t current_gpu = cudaGetDevice();

initialize(current_gpu, config_path);

}

// Client-side rendering interface

// Top-level code handles allocation of rows

void renderAsync(

const CameraPinhole& camera,

const std::pair<uint32_t, uint32_t>& rows,

CudaMemory<float4>& color_output_buffer,

CudaMemory<float>& depth_output_buffer,

cudaStream_t render_stream) {

renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);

}

};

54

55 of 108

GPU-agnostic rendering interface

int main(int argc, char** argv) {

const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};

const uint32_t display_gpu_id = 0;

const std::filesystem::path config_path = {};

MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);

while(true) {

// Update camera and framebuffer

const CameraPinhole camera = updateCamera();

Framebuffer rgbad_framebuffer = getEmptyFramebuffer();

// Update allocation of work to GPUs and launch rendering kernels

helper.updateWorkloadDistribution();

helper.launchRenders(camera, rgbad_framebuffer);

// Wait for all rendering work to complete

helper.synchronizeRenders();

// Send the rendered frame(s) to the display

sendToDisplays(rgbad_framebuffer);

}

55

56 of 108

GPU-agnostic rendering interface

int main(int argc, char** argv) {

const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};

const uint32_t display_gpu_id = 0;

const std::filesystem::path config_path = {};

MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);

while(true) {

// Update camera and framebuffer

const CameraPinhole camera = updateCamera();

Framebuffer rgbad_framebuffer = getEmptyFramebuffer();

// Update allocation of work to GPUs and launch rendering kernels

helper.updateWorkloadDistribution();

helper.launchRenders(camera, rgbad_framebuffer);

// Wait for all rendering work to complete

helper.synchronizeRenders();

// Send the rendered frame(s) to the display

sendToDisplays(rgbad_framebuffer);

}

56

57 of 108

GPU-agnostic rendering interface

int main(int argc, char** argv) {

const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};

const uint32_t display_gpu_id = 0;

const std::filesystem::path config_path = {};

MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);

while(true) {

// Update camera and framebuffer

const CameraPinhole camera = updateCamera();

Framebuffer rgbad_framebuffer = getEmptyFramebuffer();

// Update allocation of work to GPUs and launch rendering kernels

helper.updateWorkloadDistribution();

helper.launchRenders(camera, rgbad_framebuffer);

// Wait for all rendering work to complete

helper.synchronizeRenders();

// Send the rendered frame(s) to the display

sendToDisplays(rgbad_framebuffer);

}

57

58 of 108

Start by splitting rendering evenly over GPUs

Give each of N GPUs 1/N of the total number of rows
Copy each rendered tile to the main (display) GPU
Display aggregated image on the screen or in VR for N-times speedup!

58

59 of 108

We expect equal per-GPU render times

59

Environment Left

Time

Frame Start

Copy

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

60 of 108

Unfortunately, render times vary by tile

60

Environment Left

Time

Frame Start

Copy

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

61 of 108

Let’s reassign work based on throughput

Each pixel will take different amounts of time to render depending on scene content
Other background GPU load may also affect throughput
Measure previous frame throughput (rows / second)
Assume throughput is linear in the number of rows and use it to proportionally scale work assigned to each GPU
+50% FPS in VR-NeRF!

61

62 of 108

P-controller to smoothly vary workloads

// Apply p-controller to each distribution

for (size_t gpu_index = 0; gpu_index < work_split_.size(); ++gpu_index) {

work_split_[gpu_index] += pid_p_ * (target_ratio_[gpu_index] - work_split_[gpu_index]);

work_split_[gpu_index] = std::max(work_split_[gpu_index], 0.01); // minimum amount of work

}

// Normalize new distribution

double total_workload_distribution = std::accumulate(work_split_.begin(), work_split_.end(), 0.);

for (auto& ratio : work_split_) {

ratio /= total_workload_distribution;

}

// Convert new distribution into work, handling rounding accordingly

uint32_t distributed = 0;

for (size_t gpu_index = 0; gpu_index < distributed_work_.size() - 1; ++gpu_index) {

const uint32_t work = std::round(work_split_[gpu_index] * problem_size);

const auto work_end = std::min(distributed + work, problem_size);

distributed_work_[gpu_index] = std::make_pair(distributed, work_end);

distributed = work_end;

}

distributed_work_.back() = std::make_pair(distributed, problem_size);

62

63 of 108

P-controller to smoothly vary workloads

// Apply p-controller to each distribution

for (size_t gpu_index = 0; gpu_index < work_split_.size(); ++gpu_index) {

work_split_[gpu_index] += pid_p_ * (target_ratio_[gpu_index] - work_split_[gpu_index]);

work_split_[gpu_index] = std::max(work_split_[gpu_index], 0.01); // minimum amount of work

}

// Normalize new distribution

double total_workload_distribution = std::accumulate(work_split_.begin(), work_split_.end(), 0.);

for (auto& ratio : work_split_) {

ratio /= total_workload_distribution;

}

// Convert new distribution into work, handling rounding accordingly

uint32_t distributed = 0;

for (size_t gpu_index = 0; gpu_index < distributed_work_.size() - 1; ++gpu_index) {

const uint32_t work = std::round(work_split_[gpu_index] * problem_size);

const auto work_end = std::min(distributed + work, problem_size);

distributed_work_[gpu_index] = std::make_pair(distributed, work_end);

distributed = work_end;

}

distributed_work_.back() = std::make_pair(distributed, problem_size);

63

64 of 108

Aside: Heterogeneous GPUs also work

64

RTX 4090

RTX 3090

65 of 108

Total render time is lower with work balancing

65

Environment Left

Time

Frame Start

Copy

Frame Start

(Display)

GPU 0

GPU

1

GPU 2

GPU 3

66 of 108

Still not fast enough for VR

66

Environment Left

Environment Right

Time

Frame Start

(Display)

GPU 0

GPU

1

GPU 2

GPU 3

Environment Right

Copy

67 of 108

This strategy can scale to 20+ GPUs

67

Results from VR-NeRF on up to 20 Nvidia A40 (≈ 3090) GPUs

68 of 108

This strategy can scale to 20+ GPUs

68

Results from VR-NeRF on up to 18 Nvidia L40S (≈ 4090) GPUs

69 of 108

The Turtle: A 20 GPU rendering time machine

69

Front

Back

Adnacom S31

70 of 108

The Turtle: A 20 GPU rendering time machine

GPUs: 19x Nvidia L40s, 1x Nvidia RTX 6000 Ada
Servers: 2x Dell R7515, primary & backup

Epyc 7313P CPU
256 GB DDR4 RAM
2x Gen4x16 Liqid PCIe host cards
1x Gen3x8 Adnacom PCIe host card

PCIe: Liqid Director, Switch, and 2x JBOG
USB: Adnacom S31 w/ HighPoint RocketU 1344A

Fiber: 2x 30m OS2 LC/LC (MG-OS2LCDX)
Transceivers: 4x 40G QSFP+ (FTL4C1QL2C)

PDU: Vertiv Geist NU30217, 8.6 kW
Network: Cisco Catalyst C9300-48UN-A
Rack: Soundproof USystems 24U Edge 3
Blanking: HotLok panel w/ temperature strip

70

71 of 108

The Turtle: A 20 GPU rendering time machine

71

72 of 108

Requirements: How to build a time machine

✅Handle elements made with bespoke, arbitrary rendering algorithms

Treat each renderer as a black box and do pixel-wise depth-based alpha blending

✅Support [1, ∞) different scene elements

Render each element into a separate RGBAD layer and composite

✅May run tethered to a single (powerful) workstation

Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise

Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

72

73 of 108

Requirements: How to build a time machine

✅Handle elements made with bespoke, arbitrary rendering algorithms

Treat each renderer as a black box and do pixel-wise depth-based alpha blending

✅Support [1, ∞) different scene elements

Render each element into a separate RGBAD layer and composite

✅May run tethered to a single (powerful) workstation

Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise

Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

73

74 of 108

74

75 of 108

Gaussian Splatting significantly increases FPS

3DGS can be >= 10x faster than NeRF-style methods like VR-NeRF
Fits directly into our multi-GPU rendering framework due to the black box renderer abstraction
Good thing we didn’t spend much time optimizing the old method!
Algorithmic innovations are where orders of magnitude wins are

75

76 of 108

Environment can now be rendered on 1 GPU

76

Environment Left

Environment Right

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

77 of 108

Let’s give each scene element its own GPU

77

Object Left

Avatar 1 Left

Avatar 2 Left

Object Right

Avatar 1 Right

Avatar 2 Right

Environment Left

Environment Right

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

78 of 108

PCIe is a bus, so all copies are serialized

78

Object Left

Avatar 1 Left

Avatar 2 Left

Object Right

Avatar 1 Right

Avatar 2 Right

Environment Left

Environment Right

Time

Idle

Frame Start

Idle

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

79 of 108

Use profilers to identify possible optimizations

79

Trace collected with NVIDIA Nsight Systems

80 of 108

GPU utilization may have long tail

80

Object Left

Avatar 1 Left

Avatar 2 Left

Object Right

Avatar 1 Right

Avatar 2 Right

Environment Left

Environment Right

Time

Idle

Frame Start

Idle

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

81 of 108

Use streams to expose inherent parallelism

Rendering code is unlikely to have 100% GPU utilization
Schedule additional renders (e.g. second eye) on another CUDA stream to allow GPU to run both concurrently
Less value as CUDA code gets more efficient, memory tradeoff may not be worth it
Additional parallelism can help overcome inefficiency

81

82 of 108

Render each eye on a separate CUDA stream

82

Idle

Avatar 2 Right

Avatar 1 Right

Object Right

Object Left

Avatar 1 Left

Avatar 2 Left

Environment Right

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Idle

Copy

Environment Left

83 of 108

Overlapping copies with rendering can be slow

83

Avatar 2 Right

Avatar 1 Right

Idle

Copy

Object Right

Object Left

Avatar 1 Left

Avatar 2 Left

Environment Right

Idle

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Environment Left

84 of 108

Finish display GPU work before starting copies

84

Avatar 2 Right

Avatar 1 Right

Idle

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Object Right

Object Left

Environment Right

Environment Left

Copy

85 of 108

Rendering may be slower on display GPU

85

Avatar 2 Right

Avatar 1 Right

Idle

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Object Right

Idle

Object Left

Environment Right

Environment Left

Copy

86 of 108

Measure and use idle time

86

Avatar 2 Right

Avatar 1 Right

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Object Right

Time for more work!

Object Left

Environment Right

Environment Left

Adjust render parameters to consume idle time, e.g.:

Higher CLoD
More marching steps
Weaker foveation

Copy

87 of 108

Ideal scheduling has no idle time before copies

87

Avatar 2 Right

Avatar 1 Right

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Object Right

Object Left

Environment Right

Environment Left

Copy

88 of 108

Real timeline includes compositing & display

88

Avatar 2 Right

Avatar 1 Right

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Object Right

Object Left

Environment Right

Environment Left

Copy

89 of 108

We want to build “time machine” experiences

89

∞

scale

?

90 of 108

GPU → GPU copies dominate the frame time

90

20x L40S GPU trace

10x L40S GPU trace

5x L40S GPU trace

91 of 108

Amdahl’s Law: Serial components limit scaling

“The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used”

91

[https://en.wikipedia.org/wiki/Amdahl%27s_law]

92 of 108

Requirements: How to build a time machine

✅Handle elements made with bespoke, arbitrary rendering algorithms

Treat each renderer as a black box and do pixel-wise depth-based alpha blending

✅Support [1, ∞) different scene elements

Render each element into a separate RGBAD layer and composite

✅May run tethered to a single (powerful) workstation

Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise

✅Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

Carefully distribute work across many GPUs tile-wise, view-wise, and element-wise

92

93 of 108

Copy bandwidth is an upper bound on scaling

We need to copy #renderers×2×2K RGBAD pixels to the display GPU every frame for compositing and VR output
Limited by PCIe bus speed, ~10-18 GB/s for Gen4x16 (3090 & 4090)
Quantize 1 Quest 3 frame to fewer bits, assuming 18 GB/s copy performance:

fp32: 10.4ms copy, 1.3x max scaling
fp16: 5.2ms copy, 2.7x max scaling
uint8: 2.6ms copy, 5.3x max scaling

I use uint8 color, fp16 depth

93

[https://en.wikipedia.org/wiki/Amdahl%27s_law]

Amdahl’s law on the right shows that our maximum speedup, aka the upper bound on our time machine scale, is limited by the serial portion of our code
If the program is 50% serial (blue), the maximum speedup is 2x, even as the number of processors goes to infinity and the parallel portion, or rendering in our case, gets infinitely fast (0 time).
As we make the serial portion smaller, e.g. 5% of the time in green, we can start to reach 20x or more speedup.

One immediate thing we can do to make the serial portion smaller is to try quantizing the data we’re sending over the wire
Going from fp32 to uint8 gets us a 4x improvement in possible scaling.
In practice, I use a mixture of uint8 and fp16 for precision.

Note: this math was done assuming we copy depth at the same bit depth, and that we’re getting 18 GB/s transfers with no overhead.

Image from wikipedia: https://en.wikipedia.org/wiki/Amdahl%27s_law

94 of 108

RTX5090 Includes PCIe Gen 5

94

95 of 108

2x improved upper bounds with PCIe gen 5

Assuming 2x improvement in transfer speeds to 36 GB/s in practice:

fp32: 5.2ms copy, 2.7x max scaling
fp16: 2.6ms copy, 5.3x max scaling
uint8: 1.3ms copy, 10.6x max scaling

Note: Numbers currently theoretical.

95

[https://en.wikipedia.org/wiki/Amdahl%27s_law]

96 of 108

PCIe bandwidth is outpacing VR resolution

PCIe bandwidth is now increasing exponentially, doubling every few years
VR pixels / second increasing linearly
Copy time should trend towards 0
Fighting back against Amdahl!

96

97 of 108

Consider compressing images for transport

It’s possible to get 500+ GB/s lossless compression of image data on GPU with nvCOMP or dietGPU
Compression, PCIe copy, and decompression may be faster than directly copying raw bytes
Savings depend on compressibility
Avatars are much easier to compress than spaces due to black background

97

98 of 108

Compression tests with nvCOMP

98

99 of 108

Compression tests with nvCOMP

99

100 of 108

Compression tests with nvCOMP

100

Environment
	Color f16x4	Depth f16	Color u8x4	Depth u8
Compression Ratio	1.58	1.53	2.76	1.53
Compression Throughput (GB/s)	57	14.3	28.7	14.3
Decompression Throughput (GB/s)	40.5	10.5	21.1	10.5

Avatar
	Color f16x4	Depth f16	Color u8x4	Depth u8
Compression Ratio	6.6	8.4	7.95	8.4
Compression Throughput (GB/s)	57.8	14.5	32.2	16.2
Decompression Throughput (GB/s)	45	11.7	25.1	12.6

Bitcomp compression, L40S GPU, nvCOMP 3.0.5 HLIF, 8K chunk size

[Data courtesy of Nico Iskos at Nvidia]

101 of 108

Compression tests with nvCOMP

101

Environment
	Color f16x4	Depth f16	Color u8x4	Depth u8
Compression Ratio	1.58	1.53	2.76	1.53
Compression Throughput (GB/s)	57	14.3	28.7	14.3
Decompression Throughput (GB/s)	40.5	10.5	21.1	10.5

Avatar
	Color f16x4	Depth f16	Color u8x4	Depth u8
Compression Ratio	6.6	8.4	7.95	8.4
Compression Throughput (GB/s)	57.8	14.5	32.2	16.2
Decompression Throughput (GB/s)	45	11.7	25.1	12.6

Bitcomp compression, L40S GPU, nvCOMP 3.0.5 HLIF, 8K chunk size

[Data courtesy of Nico Iskos at Nvidia]

102 of 108

Requirements: How to build a time machine

✅Handle elements made with bespoke, arbitrary rendering algorithms

Treat each renderer as a black box and do pixel-wise depth-based alpha blending

✅Support [1, ∞) different scene elements

Render each element into a separate RGBAD layer and composite

✅May run tethered to a single (powerful) workstation

Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise

✅Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

Carefully distribute work across many GPUs tile-wise, view-wise, and element-wise
Efficiently copy elements back to the display GPU for VR display

102

103 of 108

Requirements: How to build a time machine

✅Handle elements made with bespoke, arbitrary rendering algorithms

Treat each renderer as a black box and do pixel-wise depth-based alpha blending

✅Support [1, ∞) different scene elements

Render each element into a separate RGBAD layer and composite

✅May run tethered to a single (powerful) workstation

Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise

✅Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

Carefully distribute work across many GPUs tile-wise, view-wise, and element-wise
Efficiently copy elements back to the display GPU for VR display

103

104 of 108

104

105 of 108

105

106 of 108

106

107 of 108

Let’s chat!

vasuagrawal@meta.com

107

vasuagrawal.com

108 of 108

108