2 of 23

What is it?

X = 10 year anniversary of Cycles release in 2011
Time for a refresh to address some long standing issues
6 month project worked on by Brecht and Sergey
Development in cycles-x branch with freedom to make breaking changes, prototype being worked on for about a month

This presentation contains a mix of planned work and speculative future direction that may or may not happen in those 6 months.

3 of 23

User Level Goals

Remove burden of tweaking settings to get least noise and best performance
More interactive viewport rendering
Workflow for resumable and incremental rendering
Improved performance for CPU and GPU rendering

4 of 23

Render Kernel

5 of 23

Kernel

Single split kernel implementation for CPU and GPU
Goals

Better GPU occupancy and performance
Keep UI responsive and avoid GPU time outs
Prepare for CPU batch shading and packet tracing
Reduce code duplication and maintenance cost
Reduce kernel compile time
Make branched path tracing obsolete and simplify controls

6 of 23

Kernel Graph

Init From Camera

Intersect Closest

Shade Surface

Shade Background

Intersect Shadow

Shade Shadow

Ray-tracing isolated to own kernels, for CPU packet tracing and hopefully better GPU performance.

Surface kernel does shader evaluation, light sampling and evaluation, and BSDF sampling and evaluation.

Kernels that do shading can be specialized to include only needed shader nodes.

Shadow tracing branches off from main path and can be handled in parallel with next bounce.

Transparent shadow tracing is a loop to keep GPU memory usage for intersections constant.

Shade Light

7 of 23

Kernel Graph

with volumes and sss

Init From Camera

Intersect Closest

Shade Surface

Shade Background

Intersect Shadow

Shade Shadow

Shade Volume

Volume Stack Init

Intersect Subsurface

For most scenes we can determine the camera position volume stack in advance and run this only once before the render loop.

If the volume stack is non-empty, a volume rendering kernel runs. The ray may either scatter or pass through to the surface.

If surface sampled a subsurface closure instead of BSDF, continue to subsurface kernel which will do ray-traced scattering, and go back to Shade Surface to shade the exit point.

Surface and volume kernels may cast shadow rays.

On GPU, existing shadow ray must be handled before execution of the next kernel that could cast them.

Volume Stack Init

Just like for cameras, this volume stack init can be skipped most of the time.

8 of 23

Kernel Memory

Must keep memory usage low to have many GPU paths in flight
Persistent State

PathState
Ray
Intersection
Volume Stack
Shadow Ray & Intersection
SSS Parameters
Aiming for 512 bytes per path

ShaderData is local in kernels that do shader evaluation
Persistent state uses explicit SoA storage for GPU, and possibly CPU for batch shading and ray-tracing.
Eliminate PathRadiance, write passes directly to render buffers

9 of 23

Kernel GPU Scheduling

New implementation

Each path has persistent location in global path state array
Kernels set flag and increment atomic counter to indicate next kernel to execute
Greedily schedule kernel with highest counter
Build path index array of paths for that kernel
For Shade Surface, paths are sorted by shader
Execute kernel with path index array to ensure every thread is occupied
Regenerate paths when number of active paths drops too low
When number of active paths is low, switch to megakernel

Things to try

Queues with work items, without or only partial persistent location for each path

Possibly more memory usage/traffic, but more coherent reads without path index?

Device side enqueue instead of megakernel

10 of 23

Shade Surface Kernel

Performs many tasks, not ideal for coherence and occupancy

ShaderData setup
Surface shader evaluation
Light sampling
Light shader evaluation
BSDF evaluation and sampling

But want to avoid making ShaderData and closures persistent state to keep memory usage under control
Ideas:

Perform light sampling before Shade Surface kernel? Can’t take into account BSDFs for many light sampling then. Maybe ok if we can statically determine if there is e.g. refraction.
Perform light shader evaluation after Shade Surface kernel for non-constant lights? Seems practical, possibly increases state memory usage.
Pick a single BSDF closure and split off BSDF sampling and/or evaluation? Noise increase too much?
Specialize Shade Surface kernel for different materials? For example with/without shader ray-tracing.

Start with single kernel, profile and go from there

11 of 23

Shadow Kernels

Too many different code paths now, aim to unify
Basic algorithm:

Find opaque intersection or up to N closest transparent intersections
Evaluate shaders and compute throughput for transparent intersections
If more intersections remaining, trace again

N can be tweaked per device

12 of 23

Other Ideas

Light baking: share kernels with path tracing, using an Init From Bake kernel and keeping rest of kernels the same.
Device abstraction: reduce to handle memory allocation, queues and kernel executions in a more abstract way. New classes to handle scheduling, loading balancing, multi-devices.
Network rendering device: seems impractical already with Embree, OptiX, OSL, texture caching, etc. Remove for now, any new implementation should sync the scene graph instead?

13 of 23

Render Pipeline

14 of 23

Progressive & Adaptive

Assume per-pixel adaptive sampling
Assume progressive rendering

Prepare for rendering algorithms that require progressive passes
Pause and resume final renders�

Memory usage

Split up render into big e.g. 2K tiles to support very high res renders
But no longer rely on these tiles as a mechanism for work distribution between devices�

Performance

Aim for fine grained automatic scheduling for multi-device and multi-threading
Batch together multiple samples to keep occupancy high
Use better GPU kernel scheduling to make per-pixel adaptive sampling possible
Make Cycles own GPU display buffer for final render, like viewport

15 of 23

Denoising

Perform denoising on big tiles, either with padding or at the end of render. No more smart logic to keep around neighboring tiles
Consider removing NLM denoiser in favor of AI denoisers?
Main missing functionality: cross-frame denoising

Hope for OpenImageDenoise to add it

16 of 23

Resumable & Incremental Rendering

OpenEXR multilayer files should contain enough info to resume rendering automatically, including for adaptive sampling and denoising
Native Blender render UI support for pause and resume
Auto save and load render on reopening .blend files?

17 of 23

Viewport Rendering

Revisit logic for viewport drawing and resets to make it feel more interactive
Batch together multiple samples to speed up convergence after the first few samples
Perform render to display buffer conversion in render thread rather than main thread
Add GPUDisplay abstraction for Cycles integrations to handle GPU display textures. In Blender this will use the GPU API that abstracts OpenGL and Vulkan.

18 of 23

Rendering Algorithms

19 of 23

Light Sampling

Multiple importance sampling

Make Intersect Closest handle light intersections
Preferably as real geometry, using new point primitive for point and spot

Many lights

Finish GSoC many light sampling implementation
Cross-check with PBRT v4 implementation

Volumes

Unify CPU and GPU implementation
GPU friendly distance and equiangular sampling

Basic idea: sample random number in advance and march up to matching step

Product importance sampling instead of MIS?

20 of 23

Volumes

Delta tracking should replace ray marching for efficiency and unbiasedness

Overlapping OpenVDB volumes currently are handled poorly

Precision issues
Duplicated work for overlapping volume segments

Idea

Volume stepping and light sampling as if a single volume
Global volume BVH that includes all OpenVDB volumes?

Query if a ray segment overlaps it
Query delta tracking density bounds
Query OpenVDB grids at point

21 of 23

Subsurface Scattering

Ideally only random walk SSS now that noise is as good as BSSRDF

However requires improvements to handling of internal geometry

BSSRDF requires branched path for efficiency

If still needed, candidate sampling could help avoid branching

Extend Principled BSDF to support thin volume scattering with random walk

To easily render water, ice, glass

22 of 23

Caustics etc.

Path guiding seems like the most suitable solution for production rendering and GPU
Regardless of the best algorithm, using progressive rather than tiled rendering helps to make these kinds of algorithm fit in well
A few more pragmatic tricks:

Automatic transparent shadows for sharp refraction
Splatting to reduce noise from specular fireflies in DoF and motion blur

23 of 23

Shadow Catcher / Matte

Revamp implementation to take into account indirect light
Render scene with and without synthetic objects and compute differences
Practical implementation:

After shadow catcher hit, duplicate path state and continue tracing two paths
One path uses visibility flag to skip intersecting certain objects
Paths write combined result to separate render passes
Divide results and composite into combined render pass