1 of 70

Instant-3D: Instant Neural Radiance Field Training Towards On-Device AR/VR 3D Reconstruction

Sixu Li*, Chaojian Li*, Wenbo Zhu, Boyang Yu, Yang Zhao,

Cheng Wan, Haoran You, Huihong Shi, and Yingyan (Celine) Lin

Georgia Institute of Technology

The 50th International Symposium on

Computer Architecture (ISCA 2023)

2 of 70

Have you experienced 3D reconstruction to create anything in the digital world?

[Source: https://jonbarron.info/mipnerf360]

3 of 70

Background: What is 3D Reconstruction?

3D

Reconstruction

[Source: https://jonbarron.info/mipnerf360]

  • Input: Sparsely sampled images
  • Output: 2D images from any new view of the same scene

4 of 70

3D Reconstruction Demand has Surged

 

5 of 70

3D Reconstruction Demand has Surged

[Source: y2u.be/TX9qSaGXFyg]

[Source: y2u.be/afdnbXXbBTg]

[Source: y2u.be/XXXMrD7aWNs]

Virtual Telepresence

Metaverse

Rescue Robots

 

  • Estimated market value: > $1.8 billion by 2028

[Source: shorturl.at/aDHY6]

6 of 70

On-Device 3D Reconstruction: Not Yet Possible 

  • Instant on-device 3D reconstruction is highly desirable

7 of 70

On-Device 3D Reconstruction: Not Yet Possible 

 

[Müller et. al., SIGGRAPH’22]

8 of 70

Contribution 1: Identify the Bottleneck

 

9 of 70

Contribution 1: Identify the Bottleneck

 

10 of 70

Contribution 1: Identify the Bottleneck

 

11 of 70

Contribution 1: Identify the Bottleneck

 

12 of 70

Contribution 1: Identify the Bottleneck

 

13 of 70

Contribution 1: Identify the Bottleneck

 

14 of 70

Contribution 1: Identify the Bottleneck

 

 

15 of 70

Contribution 2: Instant-3D Algorithm

  • We leverage: the reconstruction quality has different sensitivities on the color and density branches

16 of 70

Contribution 2: Instant-3D Algorithm

  • We leverage: the reconstruction quality has different sensitivities on the color and density branches

17 of 70

Contribution 2: Instant-3D Algorithm

  • Experiments on the two branches w/ different grid sizes

Norm. Grid Size of Density Branch

Norm. Grid Size of Color Branch

Avg. Training Runtime (s)*

Avg. Test PSNR/Accuracy**

1

1

72

26.0

0.25

1

65 (↓ 9.7%)

25.4

1

0.25

63 (↓ 12.5%)

26.0

* Training time is measured on an edge GPU [Source: shorturl.at/rFMS0]

**PSNR/accuracy is measured on NeRF-Synthetic Dataset [Mildenhall et. al., ECCV’20]

Color Branch

Density Branch

Winning Configuration

👍

18 of 70

Contribution 2: Instant-3D Algorithm

  • Experiments on the two branches w/ different update freq.

Norm. Update Freq. of Density Branch

Norm. Update Freq. of Color Branch

Avg. Training Runtime (s)*

Avg. Test PSNR/Accuracy**

1

1

72

26.0

0.5

1

67 (↓ 6.9%)

24.3

1

0.5

65 (↓ 9.7%)

25.9

* Training time is measured on an edge GPU [Source: shorturl.at/rFMS0]

**PSNR/accuracy is measured on NeRF-Synthetic Dataset [Mildenhall et. al., ECCV’20]

Color Branch

Density Branch

Winning Configuration

👍

19 of 70

Contribution 2: Instant-3D Algorithm

  • Our Instant-3D algorithm allocates different model complexities for the color and density branches

Larger grid size

Smaller grid size

More frequent weights update

Less frequent weights update

  • We leverage: the reconstruction quality has different sensitivities on the color and density branches

20 of 70

Contribution 3: Instant-3D Accelerator

  • We observe: the memory access pattern during embedding grid interpolation is predictable

21 of 70

Contribution 3: Instant-3D Accelerator

  • We observe: the memory access pattern during embedding grid interpolation is predictable

 

 

 

 

 

 

Memory Address

Avg. Inter-Group Distance: 60,000

Avg. Intra-Group Distance: 2

(Correlate to their positions in the 3D grid)

 

 

 

22 of 70

Contribution 3: Instant-3D Accelerator

  • Our Instant-3D accelerator reorganizes memory accesses to reduce data movement
  • We observe: the memory access pattern during embedding grid interpolation is predictable

23 of 70

Instant-3D Accelerator: Overview

  • Instant-3D accelerator leverages the properties of Instant-3D algorithm and observations on the memory access

Idle banks

Low-utilization in read requests to SRAM

Observations

Timestep

24 of 70

Instant-3D Accelerator: Overview

Memory Address

 

 

 

Low-utilization in read requests to SRAM

Frequent write requests to the same address

Observations

  • Instant-3D accelerator leverages the properties of Instant-3D algorithm and observations on the memory access

25 of 70

Instant-3D Accelerator: Overview

Low-utilization in read requests to SRAM

Frequent write requests to the same address

Different model sizes for different applications

Observations

  • Instant-3D accelerator leverages the properties of Instant-3D algorithm and observations on the memory access

26 of 70

Instant-3D Accelerator: Overview

  • Instant-3D accelerator further reduces the dominant memory accesses based on the observations

Pre-fetches & executes memory accesses

Accumulates requests into necessary ones

Fuse processing cores for flexibility

 

 

 

Proposed Techniques

Low-utilization in read requests to SRAM

Frequent write requests to the same address

Different model sizes for different applications

Observations

27 of 70

Instant-3D Acc.: Feed-Forward Read Mapper

  • We observe:
    • The large inter-group distance results in low bandwidth utilization when accessing multi-bank SRAM

28 of 70

Instant-3D Acc.: Feed-Forward Read Mapper

 

 

 

 

 

 

 

 

Memory Address

Avg. Inter-Group Distance: 60,000

Avg. Intra-Group Distance: 2

SRAM Bank to be Accessed

Idle banks cause low-utilization

  • We observe:
    • The large inter-group distance results in low bandwidth utilization when accessing multi-bank SRAM

29 of 70

Instant-3D Acc.: Feed-Forward Read Mapper

SRAM Bank to be Accessed

  • We observe:
    • The large inter-group distance results in low bandwidth utilization when accessing multi-bank SRAM
    • Sequential read requests access different SRAM banks

Timestep

30 of 70

Instant-3D Acc.: Feed-Forward Read Mapper

  • We observe:
    • The large inter-group distance results in low bandwidth utilization when accessing multi-bank SRAM
    • Sequential read requests access different SRAM banks

  • Our Feed-Forward Read Mapper (FRM) pre-fetches and pre-executes memory access if idle banks exist

Feed-Forward Read Mapper (FRM) Unit

31 of 70

Instant-3D Acc.: Back-Prop. Update Merger

  • We observe: there exist frequent memory write requests to the same address during the back-propagation process

32 of 70

Instant-3D Acc.: Back-Prop. Update Merger

  • We observe: there exist frequent memory write requests to the same address during the back-propagation process

Timestep of a Sliding Window of 1000 Continuous Accesses

33 of 70

Instant-3D Acc.: Back-Prop. Update Merger

  • We observe: there exist frequent memory write requests to the same address during the back-propagation process

Memory Address

 

 

 

Temporal Locality

34 of 70

Instant-3D Acc.: Back-Prop. Update Merger

  • We observe: there exist frequent memory write requests to the same address during the back-propagation process

  • Our Back-Propagation Update Merger (BUM) accumulates write requests with a buffer for making only necessary SRAM accesses

35 of 70

Instant-3D Acc.: Back-Prop. Update Merger

  • We observe: there exist frequent memory write requests to the same address during the back-propagation process

  • Our Back-Propagation Update Merger (BUM) accumulates write requests with a buffer for making only necessary SRAM accesses

36 of 70

Instant-3D Acc.: Back-Prop. Update Merger

  • We observe: there exist frequent memory write requests to the same address during the back-propagation process

  • Our Back-Propagation Update Merger (BUM) accumulates write requests with a buffer for making only necessary SRAM accesses

37 of 70

Instant-3D Acc.: Back-Prop. Update Merger

  • We observe: there exist frequent memory write requests to the same address during the back-propagation process

  • Our Back-Propagation Update Merger (BUM) accumulates write requests with a buffer for making only necessary SRAM accesses

38 of 70

Instant-3D Acc.: Back-Prop. Update Merger

  • We observe: there exist frequent memory write requests to the same address during the back-propagation process

  • Our Back-Propagation Update Merger (BUM) accumulates write requests with a buffer for making only necessary SRAM accesses

39 of 70

Instant-3D Acc.: Multi-Core Fusion Scheme

  • We observe: different grid sizes are necessary to accommodate the Instant-3D algorithm and scenes of varying scales

40 of 70

Instant-3D Acc.: Multi-Core Fusion Scheme

  • We observe: different grid sizes are necessary to accommodate the Instant-3D algorithm and scenes of varying scales

Different Grid Sizes in Instant-3D Algorithm

41 of 70

Instant-3D Acc.: Multi-Core Fusion Scheme

  • We observe: different grid sizes are necessary to accommodate the Instant-3D algorithm and scenes of varying scales

Different Grid Sizes in Instant-3D Algorithm

Different Grid Sizes in Scenes with Varying Scales

42 of 70

Instant-3D Acc.: Multi-Core Fusion Scheme

  • We observe: different grid sizes are necessary to accommodate the Instant-3D algorithm and scenes of varying scales
  • Our Multi-Core Fusion Scheme fuses processing cores with a fixed grid size to support flexible grid sizes

43 of 70

Instant-3D Acc.: Multi-Core Fusion Scheme

  • Our Multi-Core Fusion Scheme fuses processing cores with a fixed grid size to support flexible grid sizes

 

44 of 70

Instant-3D Acc.: Multi-Core Fusion Scheme

 

 

45 of 70

Instant-3D Acc.: Multi-Core Fusion Scheme

 

 

46 of 70

Instant-3D Acc.: Multi-Core Fusion Scheme

 

 

47 of 70

Evaluation Setup

  • Consider 13 scenes from 3 datasets
    • Commonly-used NeRF-Synthetic dataset

    • Large-scale SILVR dataset

    • Real-world-captured ScanNet dataset

[Mildenhall et. al., ECCV’20]

[Courteaux et. al., MMSys’22]

[Dai et. al., CVPR’17]

48 of 70

Evaluation Setup

  • Consider 13 scenes from 3 Datasets
    • Commonly-used NeRF-Synthetic dataset
    • Large-scale SILVR dataset
    • Real-world-captured ScanNet dataset
  • Benchmark w/ 3 baselines with various computing resources

[Mildenhall et. al., ECCV’20]

Device

Jetson Nano

Jetson TX2

Xavier NX

Instant-3D (Ours)

SRAM

2.5 MB

5 MB

11 MB

1.5 MB

Area

118 mm2

N/A

350 mm2

6.8 mm2

Frequency

0.9 GHz

1.4 GHz

1.1 GHz

0.8 GHz

[Courteaux et. al., MMSys’22]

[Dai et. al., CVPR’17]

49 of 70

Instant-3D’s Speedup Over Baselines

 

Instant-NGP on Edge GPU

Our Instant-3D

[Müller et. al., SIGGRAPH’22]

50 of 70

Key Insights in Instant-3D

Observation:

Different quality sensitivities on color and density branches 

Allocate different model complexities for color/density

Observation:

Predictable memory access patterns during dominant steps

Dedicated design to reorganize memory accesses and fuse cores

Instant-3D Algorithm

Instant-3D Accelerator

51 of 70

Key Insights in Instant-3D

Our Instant-3D has delivered the first instant on-device Neural Radiance Fields (NeRF)-based 3D reconstruction

Instant-3D Algorithm

Instant-3D Accelerator

Allocate different model complexities for color/density

Dedicated design to reorganize memory accesses and fuse cores

52 of 70

Instant-3D: Instant Neural Radiance Field Training Towards On-Device AR/VR 3D Reconstruction

The 50th International Symposium on

Computer Architecture (ISCA 2023)

This work was supported by the NSF through the CCF program and CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

Sixu Li*, Chaojian Li*, Wenbo Zhu, Boyang Yu, Yang Zhao,

Cheng Wan, Haoran You, Huihong Shi, and Yingyan (Celine) Lin

Georgia Institute of Technology

53 of 70

FRM & BUM Overheads

  • As shown in Fig. 15b, the FRM and BUM units takes 25% of the area and energy but provide 3.1x speedup.

54 of 70

Speedup Breakdown

55 of 70

Compare with NeRF Inference Accelerator

56 of 70

Why On-The-Fly 3D Reconstruction

  • Although offloading to the cloud is possible, it may cause undesired-latency and privacy-concerns
  • On-device 3D reconstruction offers
    • Smaller communication volume (20 MB reconstructed model vs. 120 MB jpeg images)
    • Enhanced privacy
    • An alternative to offloading under unstable/ unavailable-internet.

[Dai et. al., CVPR’17]

57 of 70

Implementation Details

  • Technology Node: 28nm HPC+
  • Corner: TT 25℃
  • Instant-3D accelerator is implemented in RTL
  • Toolchain:
    • Synthesis: Synopsys Design Compiler
    • Place & Route: Cadence Innovus
  • IP used: DesignWare Floating Point Units

58 of 70

What If the Hash Table Size is Larger

  • For larger SRAMs, the required time will be larger depending on the size
  • For example, for a 2 MB hash table, it need to be processed twice

50%

Table

3D Points

Array A

Left 50%

Table

3D Points

Array A

Step 1: Load first 50% table, process the input

Step 2: Load left 50 % table, process the same input

 

59 of 70

Why Temporal Locality Exists

  • Different training rays can pass through the same sampled point, corresponding to the same memory address

60 of 70

Only Compare with Commercial Devices

  • We have tried our best to ensure the fairness in terms of computing resources

Device

Jetson Nano

Jetson TX2

Xavier NX

Instant-3D (Ours)

SRAM

2.5 MB

5 MB

11 MB

1.5 MB

Area

118 mm2

N/A

350 mm2

6.8 mm2

Frequency

0.9 GHz

1.4 GHz

1.1 GHz

0.8 GHz

61 of 70

Only Compare with Commercial Devices

  • We have tried our best to ensure the fairness in terms of computing resources
  • As the first accelerator, we can only compare with the best commercial devices

62 of 70

Only Compare with Commercial Devices

  • We have tried our best to ensure the fairness in terms of computing resources
  • As the first accelerator, we can only compare with the best commercial devices
  • We believe our Instant-3D can be the starting point for efforts from multiple communities to build the next 3D reconstruction accelerator

63 of 70

SOTA Efficient Algorithm: How Does it Work?

  • Volume rendering 1) emits a ray from the origin for each pixel and 2) aggregates the queried features of points along the ray

[Mildenhall et. al., ECCV’20]

64 of 70

SOTA Efficient Algorithm: How Does it Work?

 

[Mildenhall et. al., ECCV’20]

65 of 70

SOTA Efficient Algorithm: How Does it Work?

  • The queried features are generated by a MLP with the direction of the point and an embedding from a hash table as input

[Müller et. al., SIGGRAPH’22]

66 of 70

Reason for the Observed Memory Access Pattern

  • We observe:
    • The large inter-group distance results in low bandwidth utilization when accessing multi-bank SRAM
    • Sequential read requests access different SRAM banks

  • Unique observations on the SOTA efficient algorithm because of the adopted random hash mapping

In the same cell

Access different addresses randomly

[Müller et. al., SIGGRAPH’22]

67 of 70

Why Not Directly Frequent Update SRAM?

  • SRAMs are needed for large memory such as hash tables
    • SRAMs need 1 clock to read, 1 clock to write
  • If we directly access SRAM, we need 3 clocks
    • 1 for read, 1 for computation, 1 for write
  • If we utilize register-based buffers
    • Using combinational logic, we can use 1 cycle for read, compute, write

68 of 70

We Already Have GPUs, Why Accelerator?

  • GPUs cannot satisfy the requirement of Instant (< 5 seconds) 3D reconstruction on edge devices

69 of 70

We Already Have GPUs, Why Accelerator?

  • GPUs cannot satisfy the requirement of Instant (< 5 seconds) 3D reconstruction on edge devices
  • It is difficult for GPUs to perform the required precise/fine-grained memory-access in our proposed Instant-3D

70 of 70

A Good Time for Building Accelerator?

  • One next-technology disruptor, calling for both algorithm and hardware communities' mutual efforts
  • Instant on-device NeRF training enables multiple applications
  • The SOTA efficient algorithm has attracted 340+ citations and 11k+ GitHub stars within one year, and is widely adopted
  • Our efforts contribute to the critical stage of efficient NeRF development.