1 of 108

Real-Time Multi-GPU

Rendering for Codec Avatars

Vasu Agrawal

2 of 108

2

3 of 108

3

Metric Telepresence

A GRAND CHALLENGE

Remote interactions that are indistinguishable

from in-person interactions

4 of 108

4

[Mark Zuckerberg: First Interview in the Metaverse | Lex Fridman Podcast #398]

5 of 108

5

6 of 108

What is a Codec Avatar?

6

Sensor

Display

Code

Encoder

Environment

Encoder

Decoder

Head-mounted�display

Sensor

Head-mounted

capture

Head-mounted

capture

7 of 108

What is a Codec Avatar?

7

Sensor

Display

Code

Decoder

Encoder

Encoder

Sensor

Environment

that disentangle TX, RX, and EX signals

to minimize DISTORTION and LATENCY for telepresence

A pair of functions ( , ) �

8 of 108

Bringing the Metaverse to the

Next Billion Users via Codec Avatars

8

[GTC ’24 S63211]

9 of 108

What are we building now?

10 of 108

10

[Mark Zuckerberg: First Interview in the Metaverse | Lex Fridman Podcast #398]

11 of 108

11

We need more completeness!

More expressive faces

Full body avatars

Universality

Environments & objects

Relighting & shadows

Dynamic clothing & hair

12 of 108

12

We’ll work on full body avatars!

I’ll work on relighting!

I’ll work on universality!

We’ll work on environments!

We’ll work on mobile!

I’ll work on objects!

I’ll work on hair!

LGTM!

13 of 108

We want to build “time machine” experiences

13

scale

14 of 108

Goal: Rapid VR prototypes from hot research

  • Quickly turn the latest offline research into low-latency real-time experiences in VR
  • Iterate quickly and offer fast feedback on future research direction and optimizations
  • Identify problems not visible in 2D images
  • Build small-scale “time machine” demos which offer a glimpse into the future, without requiring costly optimizations
  • Not constrained by product requirements

14

15 of 108

15

16 of 108

16

17 of 108

17

18 of 108

Requirements: How to build a time machine

  • Support [1, ∞) different scene elements
  • May run tethered to a single (powerful) workstation
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

18

19 of 108

Time machines are easy!

19

20 of 108

20

I’ll work on objects!

I’ll work on hair!

We’ll work on full body avatars!

I’ll work on relighting!

I’ll work on universality!

We’ll work on environments!

We’ll work on mobile!

I wonder how research is going …

21 of 108

Problem: Each object can be rendered differently

  • Environments & Objects:
    • Mesh rasterization
    • Ray- or path-tracing
    • InstantNGP
    • HybridNeRF
    • Gaussian Splatting
  • Avatars:
    • Deep Appearance Models
    • Pixel Codec Avatars
    • Mixture of Volumetric Primitives
    • Relightable Gaussian Codec Avatars

21

22 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
  • Support [1, ∞) different scene elements
  • May run tethered to a single (powerful) workstation
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

22

23 of 108

23

24 of 108

Each scene element is rendered differently

24

Code

(≈ 500 floats)

VR-NeRF

InstantNGP

RFBGCA

25 of 108

Observation: Each renderer makes an RGBAD image

Environment

Object

Avatar 1

Avatar 2

25

26 of 108

Treat each renderer as a camera → RGBAD black box

26

Code

(≈ 500 floats)

VR-NeRF

InstantNGP

RFBGCA

Black Box

Renderer

Black Box

Renderer

Black Box

Renderer

27 of 108

Compositing: Use depth to sort pixels and perform alpha blending

27

28 of 108

29 of 108

29

30 of 108

30

31 of 108

31

32 of 108

33 of 108

33

34 of 108

35 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
  • Support [1, ∞) different scene elements
  • May run tethered to a single (powerful) workstation
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

35

36 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
    • Treat each renderer as a black box and do pixel-wise depth-based alpha blending
  • Support [1, ∞) different scene elements
    • Render each element into a separate RGBAD layer and composite
  • May run tethered to a single (powerful) workstation
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

36

37 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
    • Treat each renderer as a black box and do pixel-wise depth-based alpha blending
  • Support [1, ∞) different scene elements
    • Render each element into a separate RGBAD layer and composite
  • May run tethered to a single (powerful) workstation
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

37

38 of 108

Each element & view is rendered sequentially

38

Object Left

Environment Left

Environment Right

Object Right

Avatar 1 Left

Avatar 1 Right

Time

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

39 of 108

Simplify rendering to a single renderer & view

39

Environment Left

Time

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

Frame Start

40 of 108

Zoom timeline into a single frame

40

Environment Left

Time

Frame Start

Frame Start

41 of 108

Let’s use more GPUs!

41

42 of 108

“The more (GPUs) you buy, the more (time) you save”

Jensen Huang

42

43 of 108

Lambda Quad: 4x 3090 workstation

43

44 of 108

Abstraction enables work distribution tuning

44

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

45 of 108

Abstraction enables work distribution tuning

45

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

46 of 108

Abstraction enables work distribution tuning

46

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

47 of 108

Abstraction enables work distribution tuning

47

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

48 of 108

Abstraction enables work distribution tuning

48

Black Box

Renderer

uint32_t row_start

uint32_t row_end

cudaStream_t stream

uint32_t gpu_id

49 of 108

Start by splitting rendering evenly over GPUs

  • Give each of N GPUs 1/N of the total number of rows
  • Copy each rendered tile to the main (display) GPU
  • Display aggregated image on the screen or in VR for N-times speedup!

49

50 of 108

GPU-agnostic rendering interface

template <typename Renderable, typename Config = DefaultConfig>

class MultiGpuRenderHelper {

public:

template <typename... Args>

MultiGpuRenderHelper(

std::vector<uint32_t> gpu_ids, // CUDA GPU IDs to use for rendering

uint32_t display_gpu_id, // CUDA GPU ID of the display GPU (copy target)

const Args&... args) { // additional arguments to pass to Renderable ctor

// Initialize a new control thread per GPU. Each thread calls cudaSetDevice() before calling ctor.

for (size_t gpu_index = 0; gpu_index < gpu_ids_.size(); ++gpu_index) {

worker_threads_.emplace_back(

&MultiGpuRenderHelper<Renderable, Config>::template workerThread<Args...>,

this, gpu_index, args...);

}

}

void updateWorkloadDistribution();

void launchRenders(const CameraPinhole& camera, Framebuffer& rgbad_framebuffer);

void synchronizeRenders();

};

50

51 of 108

GPU-agnostic rendering interface

template <typename Renderable, typename Config = DefaultConfig>

class MultiGpuRenderHelper {

public:

template <typename... Args>

MultiGpuRenderHelper(

std::vector<uint32_t> gpu_ids, // CUDA GPU IDs to use for rendering

uint32_t display_gpu_id, // CUDA GPU ID of the display GPU (copy target)

const Args&... args) { // additional arguments to pass to Renderable ctor

// Initialize a new control thread per GPU. Each thread calls cudaSetDevice() before calling ctor.

for (size_t gpu_index = 0; gpu_index < gpu_ids_.size(); ++gpu_index) {

worker_threads_.emplace_back(

&MultiGpuRenderHelper<Renderable, Config>::template workerThread<Args...>,

this, gpu_index, args...);

}

}

void updateWorkloadDistribution();

void launchRenders(const CameraPinhole& camera, Framebuffer& rgbad_framebuffer);

void synchronizeRenders();

};

51

52 of 108

GPU-agnostic rendering interface

class VrNerfRenderable {

VrNerfRenderable(const std::filesystem::path& config_path) {

const uint32_t current_gpu = cudaGetDevice();

initialize(current_gpu, config_path);

}

// Client-side rendering interface

// Top-level code handles allocation of rows

void renderAsync(

const CameraPinhole& camera,

const std::pair<uint32_t, uint32_t>& rows,

CudaMemory<float4>& color_output_buffer,

CudaMemory<float>& depth_output_buffer,

cudaStream_t render_stream) {

renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);

}

};

52

53 of 108

GPU-agnostic rendering interface

class VrNerfRenderable {

VrNerfRenderable(const std::filesystem::path& config_path) {

const uint32_t current_gpu = cudaGetDevice();

initialize(current_gpu, config_path);

}

// Client-side rendering interface

// Top-level code handles allocation of rows

void renderAsync(

const CameraPinhole& camera,

const std::pair<uint32_t, uint32_t>& rows,

CudaMemory<float4>& color_output_buffer,

CudaMemory<float>& depth_output_buffer,

cudaStream_t render_stream) {

renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);

}

};

53

54 of 108

GPU-agnostic rendering interface

class VrNerfRenderable {

VrNerfRenderable(const std::filesystem::path& config_path) {

const uint32_t current_gpu = cudaGetDevice();

initialize(current_gpu, config_path);

}

// Client-side rendering interface

// Top-level code handles allocation of rows

void renderAsync(

const CameraPinhole& camera,

const std::pair<uint32_t, uint32_t>& rows,

CudaMemory<float4>& color_output_buffer,

CudaMemory<float>& depth_output_buffer,

cudaStream_t render_stream) {

renderKernel<<<numBlocks(rows), threadsPerBlock(rows), 0, render_stream>>>(camera);

}

};

54

55 of 108

GPU-agnostic rendering interface

int main(int argc, char** argv) {

const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};

const uint32_t display_gpu_id = 0;

const std::filesystem::path config_path = {};

MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);

while(true) {

// Update camera and framebuffer

const CameraPinhole camera = updateCamera();

Framebuffer rgbad_framebuffer = getEmptyFramebuffer();

// Update allocation of work to GPUs and launch rendering kernels

helper.updateWorkloadDistribution();

helper.launchRenders(camera, rgbad_framebuffer);

// Wait for all rendering work to complete

helper.synchronizeRenders();

// Send the rendered frame(s) to the display

sendToDisplays(rgbad_framebuffer);

}

}

55

56 of 108

GPU-agnostic rendering interface

int main(int argc, char** argv) {

const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};

const uint32_t display_gpu_id = 0;

const std::filesystem::path config_path = {};

MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);

while(true) {

// Update camera and framebuffer

const CameraPinhole camera = updateCamera();

Framebuffer rgbad_framebuffer = getEmptyFramebuffer();

// Update allocation of work to GPUs and launch rendering kernels

helper.updateWorkloadDistribution();

helper.launchRenders(camera, rgbad_framebuffer);

// Wait for all rendering work to complete

helper.synchronizeRenders();

// Send the rendered frame(s) to the display

sendToDisplays(rgbad_framebuffer);

}

}

56

57 of 108

GPU-agnostic rendering interface

int main(int argc, char** argv) {

const std::vector<uint32_t> gpu_ids = {0, 1, 2, 3};

const uint32_t display_gpu_id = 0;

const std::filesystem::path config_path = {};

MultiGpuRenderHelper<VrNerfRenderable> helper(gpu_ids, display_gpu_id, config_path);

while(true) {

// Update camera and framebuffer

const CameraPinhole camera = updateCamera();

Framebuffer rgbad_framebuffer = getEmptyFramebuffer();

// Update allocation of work to GPUs and launch rendering kernels

helper.updateWorkloadDistribution();

helper.launchRenders(camera, rgbad_framebuffer);

// Wait for all rendering work to complete

helper.synchronizeRenders();

// Send the rendered frame(s) to the display

sendToDisplays(rgbad_framebuffer);

}

}

57

58 of 108

Start by splitting rendering evenly over GPUs

  • Give each of N GPUs 1/N of the total number of rows
  • Copy each rendered tile to the main (display) GPU
  • Display aggregated image on the screen or in VR for N-times speedup!

58

59 of 108

We expect equal per-GPU render times

59

Environment Left

Environment Left

Environment Left

Environment Left

Time

Frame Start

Copy

Copy

Copy

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

60 of 108

Unfortunately, render times vary by tile

60

Environment Left

Environment Left

Environment Left

Environment Left

Time

Frame Start

Frame Start

Copy

Copy

Copy

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

61 of 108

Let’s reassign work based on throughput

  • Each pixel will take different amounts of time to render depending on scene content
  • Other background GPU load may also affect throughput
  • Measure previous frame throughput (rows / second)
  • Assume throughput is linear in the number of rows and use it to proportionally scale work assigned to each GPU
  • +50% FPS in VR-NeRF!

61

62 of 108

P-controller to smoothly vary workloads

// Apply p-controller to each distribution

for (size_t gpu_index = 0; gpu_index < work_split_.size(); ++gpu_index) {

work_split_[gpu_index] += pid_p_ * (target_ratio_[gpu_index] - work_split_[gpu_index]);

work_split_[gpu_index] = std::max(work_split_[gpu_index], 0.01); // minimum amount of work

}

// Normalize new distribution

double total_workload_distribution = std::accumulate(work_split_.begin(), work_split_.end(), 0.);

for (auto& ratio : work_split_) {

ratio /= total_workload_distribution;

}

// Convert new distribution into work, handling rounding accordingly

uint32_t distributed = 0;

for (size_t gpu_index = 0; gpu_index < distributed_work_.size() - 1; ++gpu_index) {

const uint32_t work = std::round(work_split_[gpu_index] * problem_size);

const auto work_end = std::min(distributed + work, problem_size);

distributed_work_[gpu_index] = std::make_pair(distributed, work_end);

distributed = work_end;

}

distributed_work_.back() = std::make_pair(distributed, problem_size);

62

63 of 108

P-controller to smoothly vary workloads

// Apply p-controller to each distribution

for (size_t gpu_index = 0; gpu_index < work_split_.size(); ++gpu_index) {

work_split_[gpu_index] += pid_p_ * (target_ratio_[gpu_index] - work_split_[gpu_index]);

work_split_[gpu_index] = std::max(work_split_[gpu_index], 0.01); // minimum amount of work

}

// Normalize new distribution

double total_workload_distribution = std::accumulate(work_split_.begin(), work_split_.end(), 0.);

for (auto& ratio : work_split_) {

ratio /= total_workload_distribution;

}

// Convert new distribution into work, handling rounding accordingly

uint32_t distributed = 0;

for (size_t gpu_index = 0; gpu_index < distributed_work_.size() - 1; ++gpu_index) {

const uint32_t work = std::round(work_split_[gpu_index] * problem_size);

const auto work_end = std::min(distributed + work, problem_size);

distributed_work_[gpu_index] = std::make_pair(distributed, work_end);

distributed = work_end;

}

distributed_work_.back() = std::make_pair(distributed, problem_size);

63

64 of 108

Aside: Heterogeneous GPUs also work

64

RTX 4090

RTX 3090

RTX 3090

65 of 108

Total render time is lower with work balancing

65

Environment Left

Environment Left

Environment Left

Environment Left

Time

Frame Start

Copy

Copy

Copy

Frame Start

(Display)

GPU 0

GPU

1

GPU 2

GPU 3

66 of 108

Still not fast enough for VR

66

Environment Left

Environment Left

Environment Left

Environment Left

Environment Right

Environment Right

Environment Right

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU

1

GPU 2

GPU 3

Environment Right

Copy

Copy

Copy

67 of 108

This strategy can scale to 20+ GPUs

67

Results from VR-NeRF on up to 20 Nvidia A40 (≈ 3090) GPUs

68 of 108

This strategy can scale to 20+ GPUs

68

Results from VR-NeRF on up to 18 Nvidia L40S (≈ 4090) GPUs

69 of 108

The Turtle: A 20 GPU rendering time machine

69

Front

Back

Adnacom S31

70 of 108

The Turtle: A 20 GPU rendering time machine

  • GPUs: 19x Nvidia L40s, 1x Nvidia RTX 6000 Ada
  • Servers: 2x Dell R7515, primary & backup
    • Epyc 7313P CPU
    • 256 GB DDR4 RAM
    • 2x Gen4x16 Liqid PCIe host cards
    • 1x Gen3x8 Adnacom PCIe host card
  • PCIe: Liqid Director, Switch, and 2x JBOG
  • USB: Adnacom S31 w/ HighPoint RocketU 1344A
    • Fiber: 2x 30m OS2 LC/LC (MG-OS2LCDX)
    • Transceivers: 4x 40G QSFP+ (FTL4C1QL2C)
  • PDU: Vertiv Geist NU30217, 8.6 kW
  • Network: Cisco Catalyst C9300-48UN-A
  • Rack: Soundproof USystems 24U Edge 3
  • Blanking: HotLok panel w/ temperature strip

70

71 of 108

The Turtle: A 20 GPU rendering time machine

71

72 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
    • Treat each renderer as a black box and do pixel-wise depth-based alpha blending
  • Support [1, ∞) different scene elements
    • Render each element into a separate RGBAD layer and composite
  • May run tethered to a single (powerful) workstation
    • Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

72

73 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
    • Treat each renderer as a black box and do pixel-wise depth-based alpha blending
  • Support [1, ∞) different scene elements
    • Render each element into a separate RGBAD layer and composite
  • May run tethered to a single (powerful) workstation
    • Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers

73

74 of 108

74

75 of 108

Gaussian Splatting significantly increases FPS

  • 3DGS can be >= 10x faster than NeRF-style methods like VR-NeRF
  • Fits directly into our multi-GPU rendering framework due to the black box renderer abstraction
  • Good thing we didn’t spend much time optimizing the old method!
  • Algorithmic innovations are where orders of magnitude wins are

75

76 of 108

Environment can now be rendered on 1 GPU

76

Environment Left

Environment Right

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

77 of 108

Let’s give each scene element its own GPU

77

Object Left

Avatar 1 Left

Avatar 2 Left

Object Right

Avatar 1 Right

Avatar 2 Right

Environment Left

Environment Right

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Copy

78 of 108

PCIe is a bus, so all copies are serialized

78

Object Left

Avatar 1 Left

Avatar 2 Left

Object Right

Avatar 1 Right

Avatar 2 Right

Environment Left

Environment Right

Time

Idle

Frame Start

Idle

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Copy

79 of 108

Use profilers to identify possible optimizations

79

Trace collected with NVIDIA Nsight Systems

80 of 108

GPU utilization may have long tail

80

Object Left

Avatar 1 Left

Avatar 2 Left

Object Right

Avatar 1 Right

Avatar 2 Right

Environment Left

Environment Right

Time

Idle

Frame Start

Idle

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Copy

81 of 108

Use streams to expose inherent parallelism

  • Rendering code is unlikely to have 100% GPU utilization
  • Schedule additional renders (e.g. second eye) on another CUDA stream to allow GPU to run both concurrently
  • Less value as CUDA code gets more efficient, memory tradeoff may not be worth it
  • Additional parallelism can help overcome inefficiency

81

82 of 108

Render each eye on a separate CUDA stream

82

Idle

Avatar 2 Right

Avatar 1 Right

Object Right

Object Left

Avatar 1 Left

Avatar 2 Left

Environment Right

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Idle

Copy

Copy

Environment Left

83 of 108

Overlapping copies with rendering can be slow

83

Avatar 2 Right

Avatar 1 Right

Idle

Copy

Object Right

Object Left

Avatar 1 Left

Avatar 2 Left

Environment Right

Idle

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Environment Left

84 of 108

Finish display GPU work before starting copies

84

Avatar 2 Right

Avatar 1 Right

Idle

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Object Right

Object Left

Environment Right

Environment Left

Copy

85 of 108

Rendering may be slower on display GPU

85

Avatar 2 Right

Avatar 1 Right

Idle

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Object Right

Idle

Object Left

Environment Right

Environment Left

Copy

86 of 108

Measure and use idle time

86

Avatar 2 Right

Avatar 1 Right

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Object Right

Time for more work!

Object Left

Environment Right

Environment Left

Adjust render parameters to consume idle time, e.g.:

  • Higher CLoD
  • More marching steps
  • Weaker foveation

Copy

87 of 108

Ideal scheduling has no idle time before copies

87

Avatar 2 Right

Avatar 1 Right

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Object Right

Object Left

Environment Right

Environment Left

Copy

88 of 108

Real timeline includes compositing & display

88

Avatar 2 Right

Avatar 1 Right

Avatar 1 Left

Avatar 2 Left

Time

Frame Start

Frame Start

(Display)

GPU 0

GPU 1

GPU 2

GPU 3

Copy

Copy

Object Right

Object Left

Environment Right

Environment Left

Copy

89 of 108

We want to build “time machine” experiences

89

scale

?

?

?

90 of 108

GPU → GPU copies dominate the frame time

90

20x L40S GPU trace

10x L40S GPU trace

5x L40S GPU trace

91 of 108

Amdahl’s Law: Serial components limit scaling

“The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used”

91

92 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
    • Treat each renderer as a black box and do pixel-wise depth-based alpha blending
  • Support [1, ∞) different scene elements
    • Render each element into a separate RGBAD layer and composite
  • May run tethered to a single (powerful) workstation
    • Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers
    • Carefully distribute work across many GPUs tile-wise, view-wise, and element-wise

92

93 of 108

Copy bandwidth is an upper bound on scaling

  • We need to copy #renderers×2×2K RGBAD pixels to the display GPU every frame for compositing and VR output
  • Limited by PCIe bus speed, ~10-18 GB/s for Gen4x16 (3090 & 4090)
  • Quantize 1 Quest 3 frame to fewer bits, assuming 18 GB/s copy performance:
    • fp32: 10.4ms copy, 1.3x max scaling
    • fp16: 5.2ms copy, 2.7x max scaling
    • uint8: 2.6ms copy, 5.3x max scaling
  • I use uint8 color, fp16 depth

93

94 of 108

RTX5090 Includes PCIe Gen 5

94

95 of 108

2x improved upper bounds with PCIe gen 5

  • Assuming 2x improvement in transfer speeds to 36 GB/s in practice:
    • fp32: 5.2ms copy, 2.7x max scaling
    • fp16: 2.6ms copy, 5.3x max scaling
    • uint8: 1.3ms copy, 10.6x max scaling
  • Note: Numbers currently theoretical.

95

96 of 108

PCIe bandwidth is outpacing VR resolution

  • PCIe bandwidth is now increasing exponentially, doubling every few years
  • VR pixels / second increasing linearly
  • Copy time should trend towards 0
  • Fighting back against Amdahl!

96

97 of 108

Consider compressing images for transport

  • It’s possible to get 500+ GB/s lossless compression of image data on GPU with nvCOMP or dietGPU
  • Compression, PCIe copy, and decompression may be faster than directly copying raw bytes
  • Savings depend on compressibility
  • Avatars are much easier to compress than spaces due to black background

97

98 of 108

Compression tests with nvCOMP

98

99 of 108

Compression tests with nvCOMP

99

100 of 108

Compression tests with nvCOMP

100

Environment

Color f16x4

Depth f16

Color u8x4

Depth u8

Compression Ratio

1.58

1.53

2.76

1.53

Compression Throughput (GB/s)

57

14.3

28.7

14.3

Decompression Throughput (GB/s)

40.5

10.5

21.1

10.5

Avatar

Color f16x4

Depth f16

Color u8x4

Depth u8

Compression Ratio

6.6

8.4

7.95

8.4

Compression Throughput (GB/s)

57.8

14.5

32.2

16.2

Decompression Throughput (GB/s)

45

11.7

25.1

12.6

Bitcomp compression, L40S GPU, nvCOMP 3.0.5 HLIF, 8K chunk size

[Data courtesy of Nico Iskos at Nvidia]

101 of 108

Compression tests with nvCOMP

101

Environment

Color f16x4

Depth f16

Color u8x4

Depth u8

Compression Ratio

1.58

1.53

2.76

1.53

Compression Throughput (GB/s)

57

14.3

28.7

14.3

Decompression Throughput (GB/s)

40.5

10.5

21.1

10.5

Avatar

Color f16x4

Depth f16

Color u8x4

Depth u8

Compression Ratio

6.6

8.4

7.95

8.4

Compression Throughput (GB/s)

57.8

14.5

32.2

16.2

Decompression Throughput (GB/s)

45

11.7

25.1

12.6

Bitcomp compression, L40S GPU, nvCOMP 3.0.5 HLIF, 8K chunk size

[Data courtesy of Nico Iskos at Nvidia]

102 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
    • Treat each renderer as a black box and do pixel-wise depth-based alpha blending
  • Support [1, ∞) different scene elements
    • Render each element into a separate RGBAD layer and composite
  • May run tethered to a single (powerful) workstation
    • Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers
    • Carefully distribute work across many GPUs tile-wise, view-wise, and element-wise
    • Efficiently copy elements back to the display GPU for VR display

102

103 of 108

Requirements: How to build a time machine

  • Handle elements made with bespoke, arbitrary rendering algorithms
    • Treat each renderer as a black box and do pixel-wise depth-based alpha blending
  • Support [1, ∞) different scene elements
    • Render each element into a separate RGBAD layer and composite
  • May run tethered to a single (powerful) workstation
    • Use 20x GPU “Turtle” machine if possible, 4x GPU workstation otherwise
  • Render to a VR headset (Quest 3): 72 FPS, dual 2K × 2K eyebuffers
    • Carefully distribute work across many GPUs tile-wise, view-wise, and element-wise
    • Efficiently copy elements back to the display GPU for VR display

103

104 of 108

104

105 of 108

105

106 of 108

106

107 of 108

Let’s chat!

vasuagrawal@meta.com

107

vasuagrawal.com

vasuagrawal.com

108 of 108

108