1 of 48

RT-NeRF: Real-Time On-Device Neural Radiance Fields� Towards Immersive AR/VR Rendering

Chaojian Li, Sixu Li, Yang Zhao, Wenbo Zhu, and Yingyan (Celine) Lin

Georgia Institute of Technology

Efficient and Intelligent Computing Lab

2 of 48

NeRF as a Tool to Generate Novel Views

  • Neural Radiance Fields (NeRF) can generate arbitrary new views of a specific scene given sparsely sampled scene images

Video source: youtu.be/HfJpQCBTqZs

Inputs: Sparsely sampled views

Outputs: Images of any new view

3 of 48

SOTA Efficient NeRF’s Pipeline: How Does It Work?

  • Volume rendering 1) emits a ray from the origin of the view for each pixel and 2) aggregates the queried features of points along the ray

Video source: [Mildenhall et. al., ECCV’20]

4 of 48

Real-Time NeRF Is Increasingly Demanded

  • Real-Time NeRF can enhance numerous applications and features

Virtual Meetings

Metaverse

Autonomous Driving

Simulation

Source: shorturl.at/gCFMW

Source: shorturl.at/kmnvZ

Source: shorturl.at/fGUY7

5 of 48

SOTA Efficient NeRF Desired Real-Time NeRF

6 GB

FastNeRF [Garbin et. al., ICCV’21]

TensoRF [Chen et. al., ECCV’22]

Our RT-NeRF

NeRF [Mildenhall et. al., ECCV’20]

  • Limitation 1: Large memory requirement
    • Require > 54 GB for caching intermediate results [Garbin et. al., ICCV’21]
    • Oculus Quest 2 VR headset has only 6 GB memory

Memory cost

Techniques

6 of 48

SOTA Efficient NeRF Desired Real-Time NeRF

  • Limitation 2: Low throughput
    • Require > 30 FPS to enable real-time immersive interactions
    • Rendering 800x800 images on Edge GPU

can only achieve 0.01 FPS [Chen et. al., ECCV’22]

Memory cost

6 GB

Techniques

  • Limitation 1: Large memory requirement
    • Require > 54 GB for caching intermediate results [Garbin et. al., ICCV’21]
    • Oculus Quest 2 VR headset has only 6 GB memory

Throughput

30 FPS

Techniques

FastNeRF [Garbin et. al., ICCV’21]

TensoRF [Chen et. al., ECCV’22]

Our RT-NeRF

NeRF [Mildenhall et. al., ECCV’20]

7 of 48

Contribution 1: Analyze the Efficiency Bottlenecks

  • Runtime breakdown on a SOTA efficient NeRF solution [Chen et. al., ECCV’22]

Source: [Mildenhall et. al., ECCV’20]

100%

75%

50%

25%

0%

Map pixels to rays

8 of 48

Contribution 1: Analyze the Efficiency Bottlenecks

  • Runtime breakdown on a SOTA efficient NeRF solution [Chen et. al., ECCV’22]

Source: [Mildenhall et. al., ECCV’20]

100%

75%

50%

25%

0%

Query the features of points along the rays

9 of 48

Contribution 1: Analyze the Efficiency Bottlenecks

  • Runtime breakdown on a SOTA efficient NeRF solution [Chen et. al., ECCV’22]

Source: [Mildenhall et. al., ECCV’20]

100%

75%

50%

25%

0%

Render pixels’ colors

10 of 48

Contribution 2: Identify Two Key Bottlenecks

  • Dominant step: Query the features of points along the rays
    • Bottleneck 1 - Locate pre-existing points

Bottleneck 1

11 of 48

Contribution 2: Identify Two Key Bottlenecks

  • Dominant step: Query the features of points along the rays
    • Bottleneck 1 - Locate pre-existing points
    • Bottleneck 2 - Compute points’ embeddings

Bottleneck 1

Bottleneck 2

12 of 48

Zoom-in Bottleneck 1 (Locate Pre-Existing Points)

  • Existing works [Chen et. al., ECCV’22]:
    • Query points’ existence based on a 3D binary occupancy grid

Skip the following steps

Continue the following steps

1) If zero

2) If non-zero

13 of 48

Zoom-in Bottleneck 1 (Locate Pre-Existing Points)

  • Existing works [Chen et. al., ECCV’22]:
    • Query points’ existence based on a 3D binary occupancy grid

Skip the following steps

Continue the following steps

1) If zero

2) If non-zero

We identify the corresponding cause: Irregular accesses to the occupancy grid

because rays can come from any direction

14 of 48

+

×

×

×

Scalar Multiplication

Decompose to matrix-vector pairs

Zoom-in Bottleneck 2 (Compute Points’ Embeddings)

  • Existing works [Chen et. al., ECCV’22]:
    • Fetch the embeddings from a 3D decomposed grid

15 of 48

+

×

×

×

Scalar Multiplication

Decompose to matrix-vector pairs

Zoom-in Bottleneck 2 (Compute Points’ Embeddings)

  • Existing works [Chen et. al., ECCV’22]:
    • Fetch the embeddings from a 3D decomposed grid

We identify the corresponding cause: The sparse decomposed embedding grid

is treated as a dense one, i.e., the sparsity was not leveraged.

16 of 48

Overview of the Proposed RT-NeRF

To Alleviate Bottleneck 1

(Locate the Pre-Existing Points)

Only query points in the cube

Propose a New Efficient Rendering Pipeline

17 of 48

Propose a Hybrid Sparse Encoding Scheme & Bi-Direction Trees

To Alleviate Bottleneck 1

(Locate the Pre-Existing Points)

To Alleviate Bottleneck 2

(Compute Points’ Embeddings)

Only query points in the cube

Bitmap-based

Coordinate-based

Denser

Sparser

Propose a New Efficient Rendering Pipeline

Overview of the Proposed RT-NeRF

18 of 48

RT-NeRF’s Algorithm Contribution

Propose a Hybrid Sparse Encoding Scheme & Bi-Direction Trees

To Alleviate Bottleneck 1

(Locate the Pre-Existing Points)

To Alleviate Bottleneck 2

(Compute Points’ Embeddings)

Only query points in the cube

Bitmap-based

Coordinate-based

Denser

Sparser

Propose a New Efficient Rendering Pipeline

19 of 48

Contribution 3: Efficient Rendering Pipeline

  • Loop over the non-zeros of the occupancy grid instead of rays to be rendered, utilizing the corresponding 3D geometry

Only query points in the cube

20 of 48

  • Regular accesses to the stored non-zero cubes’ location

Irregular accesses

Stored non-zero cubes’ location

Stored non-zero cubes’ location

Existing rendering pipeline [Chen et. al., ECCV’22]

Our proposed efficient rendering pipeline

Regular accesses

Our Proposed Efficient Rendering Pipeline : Motivation

21 of 48

Efficient Rendering Pipeline: Implementation

  • Challenge: Locating which points are inside each non-zero cube needs to loop over all sampled points

22 of 48

Efficient Rendering Pipeline: Implementation

  • Challenge: Locating which points are inside each non-zero cube needs to loop over all sampled points

Loop over all sampled points to locate the blue one

23 of 48

Efficient Rendering Pipeline: Implementation

  • Opportunity: All sampled points are regularly placed on the rendering plane when they are projected to the plane

Irregular point cloud in the 3D space

Regular grid in the 2D plane

24 of 48

Efficient Rendering Pipeline: Implementation

  • Step 1: Approximate each non-zero cube as a ball

25 of 48

Efficient Rendering Pipeline: Implementation

  • Step 1: Approximate each non-zero cube as a ball

26 of 48

Efficient Rendering Pipeline: Implementation

  • Step 2: Locate the intersection points of the ball and rays that pass through the ball

27 of 48

RT-NeRF’s Hardware Contributions

Propose a Hybrid Sparse Encoding Scheme & Bi-Direction Trees

To Alleviate Bottleneck 1

(Locate the Pre-Existing Points)

To Alleviate Bottleneck 2

(Compute Points’ Embeddings)

Only query points in the cube

Bitmap-based

Coordinate-based

Denser

Sparser

Propose a New Efficient Rendering Pipeline

28 of 48

  • For dense (< 80% sparsity) matrices
  • For sparse (≥ 80% sparsity) matrices
    • Encoding: Improved bitmap-based encoding format
    • Decoding: High-density sparse search unit
    • Encoding: Coordinate-based encoding format
    • Decoding: Dual-purpose bi-direction adder & search tree

Contribution 4: Hybrid Sparse Encoding

29 of 48

  • Imbalanced sparsity patterns among different types of weights (4% ~ 92%)
  • Imbalanced sparsity patterns among different datasets (46% ~ 88%)

Our Proposed Hybrid Sparse Encoding: Motivation

30 of 48

  • Imbalanced sparsity patterns among different datasets (46% ~ 88%)

A uniform sparse encoding/decoding scheme is suboptimal

Our Proposed Hybrid Sparse Encoding: Motivation

  • Imbalanced sparsity patterns among different types of weights (4% ~ 92%)

31 of 48

  • For dense (< 80% sparsity) matrices

Hybrid Sparse Encoding: Design Targets

  • Hybrid sparse encoding to fulfill
    • Small storage size
    • High decoding throughput
    • High resource utilization

Encoding

Scheme

Storage Size

(↓)

Decoding

Throughput (↑)

Resource

Utilization (↑)

Bitmap

-based

🌟🌟🌟

🌟

🌟🌟🌟

Our proposed

🌟🌟🌟

🌟🌟🌟

🌟🌟🌟

  • For sparse (≥ 80% sparsity) matrices

Encoding

Scheme

Storage Size

(↓)

Decoding

Throughput (↑)

Resource

Utilization (↑)

Coordinate-based

🌟🌟🌟

🌟🌟🌟

🌟

Our proposed

🌟🌟🌟

🌟🌟🌟

🌟🌟🌟

32 of 48

Vanilla Hybrid Sparse Encoding & Decoding Implementation

Bitmap-based encoding & decoding

Coordinate-based encoding & decoding

2) If sparsity ratio 80%

1) If sparsity ratio < 80%

Low decoding throughput due to location-dependent various decoding latencies

Low decoding resource utilization when matrices are too sparse

33 of 48

Low Decoding Throughput For Bitmap-based Decoding

Various decoding latencies 🡪 Low decoding throughput

  • Due to location-dependent varying decoding latencies, low decoding throughput when a large decoding latency occurs

3 cycles for decoding*

8 cycles for decoding*

*Assuming an adder tree w/ 7 adders

34 of 48

Under Utilization When Decoding Sparse Matrices

Wasted hardware resources 🡪 Under utilization

  • When the matrices are too sparse, the search trees to decode sparse (≥ 80% sparsity) matrices are under utilization

35 of 48

  • For dense (< 80% sparsity) matrices

Contribution 5: Improved Bitmap-based Scheme to Boost Throughput

Sparse Bitmap Matrix

Non-Zero

Element Array

1

0

Bitmap encoding: 1-bit binary metadata

Matrix Row

Pointer Vector

0

4

6

10

15

17

20

24

Matric row pointer vector: Addresses of 1st non-zero element of each row

    • Encoding: Improved bitmap-based encoding format

36 of 48

  • For dense (< 80% sparsity) matrices

Contribution 5: Improved Bitmap-based Scheme to Boost Throughput

    • Encoding: Improved bitmap-based encoding format
    • Decoding: High-density sparse search unit 🡪 Fixed 3-cycle decoding to boost throughput

Sparse Bitmap Matrix

Non-Zero

Element Array

1

0

Matrix Row

Pointer Vector

0

4

6

10

15

17

20

24

+

Index Control Unit

1-bit

Adder Tree

6

Target Non-Zero Element

Target Element Location

Cycle 1: Check the bitmap matrix element 1 or 0

Cycle 2: Sum up 1-bit bitmap vector and then add the row pointer value

1

6

6

Target Row Pointer Value

Cycle 3: Fetch the target non-zero element

7

Target Non-Zero

Element

37 of 48

    • Encoding: Coordinate-based encoding format
    • Decoding: Dual-purpose bi-direction adder & search tree

Adder Sub-tree A

Adder Sub-tree B

Mode 1:

+

×

×

×

Compute Points’ Embeddings

  • For sparse (≥ 80% sparsity) matrices

7

6

5

4

3

2

1

0

0

1

2

3

4

5

6

7

 

 

 

Contribution 6: Bi-Direction Trees to Boost Utilization

38 of 48

Adder Sub-tree A

Search Sub-tree B

Mode 1:

Mode 2:

Adder Sub-tree A

Adder Sub-tree B

    • Encoding: Coordinate-based encoding format
    • Decoding: Dual-purpose bi-direction adder & search tree
  • For sparse (≥ 80% sparsity) matrices

7

6

5

4

3

2

1

0

0

1

2

3

4

5

6

7

 

 

 

Contribution 6: Bi-Direction Trees to Boost Utilization

39 of 48

Search Path

Adder Path

Shared Path

Leaf Node

Trunk Node

Bi-Direction Trees: Reconfigurable Implementation

40 of 48

  • Datasets: 8 scenes of the NeRF-Synthetic dataset [Mildenhall et. al., ECCV’20]

  • Baselines from two categories:
    • 2 edge devices:

    • 3 cloud devices:

Evaluation Setup

Jetson Nano

ICARUS [Rao et. al., SIGGRAPH Asia’2022]

RTX 2080Ti

Tesla 2080Ti

Threadripper 3970x

41 of 48

  • Enable Real-Time NeRF with 3,000× speedup than baselines

Compare with edge devices on different datasets

30 FPS

RT-NeRF’s Speedup Over Baselines

42 of 48

Compare with edge devices on different datasets

  • Enable 4,000× more efficient NeRF on resource-constrained AR/VR devices

RT-NeRF’s Energy Efficiency Over Baselines

43 of 48

  • RT-NeRF: The first algorithm-hardware co-design acceleration of NeRF
  • RT-NeRF algorithm:
    • Integrate an efficient rendering pipeline to leverage the sparsity of pre-existing points
  • RT-NeRF hardware:
    • Adopt a hybrid encoding scheme and bi-direction trees to ensure efficient sparse decoding

Summary

Our RT-NeRF framework has delivered the first real-time neural rendering solution suited for edge applications

44 of 48

RT-NeRF: Real-Time On-Device Neural Radiance Fields� Towards Immersive AR/VR Rendering

Chaojian Li, Sixu Li, Yang Zhao, Wenbo Zhu, and Yingyan (Celine) Lin

Georgia Institute of Technology

This work is supported by the National Science Foundation (NSF) through the CCRI program and the NSF RTML program

45 of 48

  • For sparse (≥ 80% sparsity) matrices
    • Encoding: Coordinate-based encoding format

Hybrid Sparse Encoding: How to Encode Sparse Matrices ?

46 of 48

  • For sparse (≥ 80% sparsity) matrices
    • Encoding: Coordinate-based encoding format

7

6

5

4

3

2

1

0

0

1

2

3

4

5

6

7

 

 

 

Hybrid Sparse Encoding: How to Encode Sparse Matrices ?

47 of 48

  • For sparse (≥ 80% sparsity) matrices

7

6

5

4

3

2

1

0

0

1

2

3

4

5

6

7

 

 

 

 

 

 

 

 

 

(x,y)=(2,1)

 

 

 

(x,y)=(0,3)

 

(x,y)=(2,6)

(s,y)=(3,7)

 

 

 

 

(x,y)=(6,0)

(x,y)=(5,1)

(x,y)=(6,5)

(x,y)=(4,6)

Search tree: Store the coordinates in the leaves

    • Encoding: Coordinate-based encoding format

Hybrid Sparse Encoding: How to Encode Sparse Matrices ?

48 of 48

  • For sparse (≥ 80% sparsity) matrices

7

6

5

4

3

2

1

0

0

1

2

3

4

5

6

7

 

 

 

 

 

 

 

    • Encoding: Coordinate-based encoding format

 

 

 

 

 

 

(x,y)=(2,1)

(x,y)=(0,3)

(x,y)=(2,6)

(s,y)=(3,7)

 

 

 

 

(x,y)=(6,0)

(x,y)=(5,1)

(x,y)=(6,5)

(x,y)=(4,6)

Target Non-Zero Element

Search tree: Store the coordinates in the leaves

Hybrid Sparse Encoding: How to Decode Sparse Matrices ?