1 of 21

CUDA Ray Tracer

Abhi S.

Noah K.

Jared G.

2 of 21

What is CUDA?

CUDA (Compute Unified Device Architecture) is a parallel computing platform and API by NVIDIA.
API for using GPUs to process tasks that were traditionally handled by CPUs, significantly speeding up computations.
Used for GPGPU, using the GPU for applications other than graphics processing.

3 of 21

Ray Tracing

Fire a ray from the camera position through a pixel in the near-plane.
Collide the ray with every sphere in the scene, save the minimum distance and the center of that sphere.
Accumulate the color of the sphere multiplied by the light contribution.
Compute the normal by subtracting collision position with center of collided sphere.
Reflect ray across normal, repeat collision and accumulation until max_iter or if missing all spheres. Add skybox contribution on miss.
Divide accumulated color by number of bounces.

https://www.scratchapixel.com/lessons/3d-basic-rendering/introduction-to-ray-tracing/implementing-the-raytracing-algorithm.html

4 of 21

Ray Tracing

Fire a ray from the camera position through a pixel in the near-plane.
Collide the ray with every sphere in the scene, save the minimum distance and the center of that sphere.
Accumulate the color of the sphere multiplied by the light contribution.
Compute the normal by subtracting collision position with center of collided sphere.
Reflect ray across normal, repeat collision and accumulation until max_iter or if missing all spheres. Add skybox contribution on miss.
Divide accumulated color by number of bounces.

miss!

1

2

3

[ Camera ]

5 of 21

Comparison

Serial

CUDA

6 of 21

Performance with Varying Sphere Count

7 of 21

Performance with Varying Sphere Count

8 of 21

CUDA Memory Management

CUDA offers a "Unified memory" abstraction that syncs device and host address space.

Our problem was simple enough that we could manage device memory manually.

Steps:

Allocate input/output buffers on device and host

Inputs: Spheres, Materials, Camera, etc.
Outputs: Framebuffer (pixel data)

For each frame:

Memcpy input data (host -> device)
Run Kernel
Call cudaDeviceSynchronize(); to block CPU until kernel finishes
Memcpy framebuffer (device -> host)
Render it!

9 of 21

CUDA Abstraction

Threads are the smallest unit of execution.
Threads are grouped into blocks.
All the blocks form the grid.

Threads < Thread-Blocks < Grid

10 of 21

CUDA Architecture

Blocks are executed on a Streaming Multiprocessor(SM).
The device is a grid made up of many SMs.
Threads on an SM share a L1 cache.
All threads share a global L2 cache.

[Global L2 cache]

SM 1

[L1 cache]

SM 2

[L1 cache]

SM 3

[L1 cache]

SM 4

[L1 cache]

11 of 21

CUDA Architecture

In hardware, SMs are implemented with warps.

[SM 1]

W0

W2

W3

W1

T0

T1

T2

T3

...

T0

T1

T2

T3

...

T0

T1

T2

T3

...

T0

T1

T2

T3

A warp is a set of 32 threads within a block.

- All threads in a warp execute the same instruction.

- Context switches are done for whole warps.

- Threads in a warp that branch differently become serial.

(thread-divergent branches)

...

12 of 21

Thread-Divergent Branches

if (threadIdx.x % 2)

a();

else

b();

c();

a and b won't run in parallel on threads in the same warp.

Since the threads have diverged, c effectively runs in half the speed!!

Thread-divergent branches are bad for parallelism.

13 of 21

CUDA Kernel Configuration

1 Thread per pixel
1 Ray per thread

No dependencies between rays.

50 Bounces per ray for reflections

Ends early if ray “misses”

All bounces are calculated sequentially.

Want to group threads into blocks such that all threads in a block have similar runtime, to avoid thread-divergent branches.

miss!

1

2

3

[ Camera ]

14 of 21

CUDA Kernel Configuration

15 of 21

CUDA Kernel Configuration

Observation:

The orange square will have same runtime of 0 bounces.

16 of 21

CUDA Kernel Configuration

Observation:

Pixels within a square will generally have a similar # of bounces.




		v

17 of 21

CUDA Kernel Configuration

Observation: Nearby pixels will

Have similar number of bounces (runtime).
Access nearby memory in the framebuffer and accumulation buffer.

Better utilize SM's L1 cache.

Warps are assigned 32 threads,

So we make blocks of 8x8 chunks of pixels.




		v

dim3 blocks(width / 8, height / 8);

dim3 threads(8, 8);

raytraceKernel<<<blocks, threads>>>(...);

18 of 21

Scatter

No scatter

Too perfect

With scatter

Too noisy

19 of 21

Accumulation

With Accumulation

Just right :)

Frame 1

Frame 2

Frame 3

+

Frame ...

No Accumulation

Too noisy

20 of 21

Ray Tracing

Fire a ray from the camera position through a pixel in the near-plane.
Collide the ray with every sphere in the scene, save the minimum distance and the center of that sphere.
Accumulate the color of the sphere multiplied by the light contribution.
Compute the normal by subtracting collision position with center of collided sphere.
Reflect ray across normal, repeat collision and accumulation until max_iter or if missing all spheres. Add skybox contribution on miss.
Divide accumulated color by number of bounces.

https://www.scratchapixel.com/lessons/3d-basic-rendering/introduction-to-ray-tracing/implementing-the-raytracing-algorithm.html

21 of 21

Collisions and Light Contribution

Light On intersection the ray takes a dot product between the color to calculate the light contribution

Specular after the intersect a new ray is cast from the direction reflected from the normal

Scattering a random uniformly distributed vector is added to the the subsequent ray direction according to the roughness

Intersect take the dot product of the difference between the ray origin and the sphere center. With the radius use pythagorean and subtract and add.

https://www.lighthouse3d.com/tutorials/maths/ray-sphere-intersection/

https://astrobites.org/2020/10/22/specular-reflections-titan/