1 of 23

Cycles X

Technical Presentation

2 of 23

What is it?

  • X = 10 year anniversary of Cycles release in 2011
  • Time for a refresh to address some long standing issues
  • 6 month project worked on by Brecht and Sergey
  • Development in cycles-x branch with freedom to make breaking changes, prototype being worked on for about a month

This presentation contains a mix of planned work and speculative future direction that may or may not happen in those 6 months.

3 of 23

User Level Goals

  • Remove burden of tweaking settings to get least noise and best performance
  • More interactive viewport rendering
  • Workflow for resumable and incremental rendering
  • Improved performance for CPU and GPU rendering

4 of 23

Render Kernel

5 of 23

Kernel

  • Single split kernel implementation for CPU and GPU
  • Goals
    • Better GPU occupancy and performance
    • Keep UI responsive and avoid GPU time outs
    • Prepare for CPU batch shading and packet tracing
    • Reduce code duplication and maintenance cost
    • Reduce kernel compile time
    • Make branched path tracing obsolete and simplify controls

6 of 23

Kernel Graph

Init From Camera

Intersect Closest

Shade Surface

Shade Background

Intersect Shadow

Shade Shadow

Ray-tracing isolated to own kernels, for CPU packet tracing and hopefully better GPU performance.

Surface kernel does shader evaluation, light sampling and evaluation, and BSDF sampling and evaluation.

Kernels that do shading can be specialized to include only needed shader nodes.

Shadow tracing branches off from main path and can be handled in parallel with next bounce.

Transparent shadow tracing is a loop to keep GPU memory usage for intersections constant.

Shade Light

7 of 23

Kernel Graph

with volumes and sss

Init From Camera

Intersect Closest

Shade Surface

Shade Background

Intersect Shadow

Shade Shadow

Shade Volume

Volume Stack Init

Intersect Subsurface

For most scenes we can determine the camera position volume stack in advance and run this only once before the render loop.

If the volume stack is non-empty, a volume rendering kernel runs. The ray may either scatter or pass through to the surface.

If surface sampled a subsurface closure instead of BSDF, continue to subsurface kernel which will do ray-traced scattering, and go back to Shade Surface to shade the exit point.

Surface and volume kernels may cast shadow rays.

On GPU, existing shadow ray must be handled before execution of the next kernel that could cast them.

Volume Stack Init

Just like for cameras, this volume stack init can be skipped most of the time.

8 of 23

Kernel Memory

  • Must keep memory usage low to have many GPU paths in flight
  • Persistent State
    • PathState
    • Ray
    • Intersection
    • Volume Stack
    • Shadow Ray & Intersection
    • SSS Parameters
    • Aiming for 512 bytes per path
  • ShaderData is local in kernels that do shader evaluation
  • Persistent state uses explicit SoA storage for GPU, and possibly CPU for batch shading and ray-tracing.
  • Eliminate PathRadiance, write passes directly to render buffers

9 of 23

Kernel GPU Scheduling

  • New implementation
    • Each path has persistent location in global path state array
    • Kernels set flag and increment atomic counter to indicate next kernel to execute
    • Greedily schedule kernel with highest counter
    • Build path index array of paths for that kernel
    • For Shade Surface, paths are sorted by shader
    • Execute kernel with path index array to ensure every thread is occupied
    • Regenerate paths when number of active paths drops too low
    • When number of active paths is low, switch to megakernel
  • Things to try
    • Queues with work items, without or only partial persistent location for each path
      • Possibly more memory usage/traffic, but more coherent reads without path index?
    • Device side enqueue instead of megakernel

10 of 23

Shade Surface Kernel

  • Performs many tasks, not ideal for coherence and occupancy
    • ShaderData setup
    • Surface shader evaluation
    • Light sampling
    • Light shader evaluation
    • BSDF evaluation and sampling
  • But want to avoid making ShaderData and closures persistent state to keep memory usage under control
  • Ideas:
    • Perform light sampling before Shade Surface kernel? Can’t take into account BSDFs for many light sampling then. Maybe ok if we can statically determine if there is e.g. refraction.
    • Perform light shader evaluation after Shade Surface kernel for non-constant lights? Seems practical, possibly increases state memory usage.
    • Pick a single BSDF closure and split off BSDF sampling and/or evaluation? Noise increase too much?
    • Specialize Shade Surface kernel for different materials? For example with/without shader ray-tracing.
  • Start with single kernel, profile and go from there

11 of 23

Shadow Kernels

  • Too many different code paths now, aim to unify
  • Basic algorithm:
    • Find opaque intersection or up to N closest transparent intersections
    • Evaluate shaders and compute throughput for transparent intersections
    • If more intersections remaining, trace again
  • N can be tweaked per device

12 of 23

Other Ideas

  • Light baking: share kernels with path tracing, using an Init From Bake kernel and keeping rest of kernels the same.
  • Device abstraction: reduce to handle memory allocation, queues and kernel executions in a more abstract way. New classes to handle scheduling, loading balancing, multi-devices.
  • Network rendering device: seems impractical already with Embree, OptiX, OSL, texture caching, etc. Remove for now, any new implementation should sync the scene graph instead?

13 of 23

Render Pipeline

14 of 23

Progressive & Adaptive

  • Assume per-pixel adaptive sampling
  • Assume progressive rendering
    • Prepare for rendering algorithms that require progressive passes
    • Pause and resume final renders�
  • Memory usage
    • Split up render into big e.g. 2K tiles to support very high res renders
    • But no longer rely on these tiles as a mechanism for work distribution between devices�
  • Performance
    • Aim for fine grained automatic scheduling for multi-device and multi-threading
    • Batch together multiple samples to keep occupancy high
    • Use better GPU kernel scheduling to make per-pixel adaptive sampling possible
    • Make Cycles own GPU display buffer for final render, like viewport

15 of 23

Denoising

  • Perform denoising on big tiles, either with padding or at the end of render. No more smart logic to keep around neighboring tiles
  • Consider removing NLM denoiser in favor of AI denoisers?
  • Main missing functionality: cross-frame denoising
    • Hope for OpenImageDenoise to add it

16 of 23

Resumable & Incremental Rendering

  • OpenEXR multilayer files should contain enough info to resume rendering automatically, including for adaptive sampling and denoising
  • Native Blender render UI support for pause and resume
  • Auto save and load render on reopening .blend files?

17 of 23

Viewport Rendering

  • Revisit logic for viewport drawing and resets to make it feel more interactive
  • Batch together multiple samples to speed up convergence after the first few samples
  • Perform render to display buffer conversion in render thread rather than main thread
  • Add GPUDisplay abstraction for Cycles integrations to handle GPU display textures. In Blender this will use the GPU API that abstracts OpenGL and Vulkan.

18 of 23

Rendering Algorithms

19 of 23

Light Sampling

  • Multiple importance sampling
    • Make Intersect Closest handle light intersections
    • Preferably as real geometry, using new point primitive for point and spot
  • Many lights
    • Finish GSoC many light sampling implementation
    • Cross-check with PBRT v4 implementation
  • Volumes
    • Unify CPU and GPU implementation
    • GPU friendly distance and equiangular sampling
      • Basic idea: sample random number in advance and march up to matching step
    • Product importance sampling instead of MIS?

20 of 23

Volumes

  • Delta tracking should replace ray marching for efficiency and unbiasedness

  • Overlapping OpenVDB volumes currently are handled poorly
    • Precision issues
    • Duplicated work for overlapping volume segments
  • Idea
    • Volume stepping and light sampling as if a single volume
    • Global volume BVH that includes all OpenVDB volumes?
      • Query if a ray segment overlaps it
      • Query delta tracking density bounds
      • Query OpenVDB grids at point

21 of 23

Subsurface Scattering

  • Ideally only random walk SSS now that noise is as good as BSSRDF
    • However requires improvements to handling of internal geometry
  • BSSRDF requires branched path for efficiency
    • If still needed, candidate sampling could help avoid branching
  • Extend Principled BSDF to support thin volume scattering with random walk
    • To easily render water, ice, glass

22 of 23

Caustics etc.

  • Path guiding seems like the most suitable solution for production rendering and GPU
  • Regardless of the best algorithm, using progressive rather than tiled rendering helps to make these kinds of algorithm fit in well
  • A few more pragmatic tricks:
    • Automatic transparent shadows for sharp refraction
    • Splatting to reduce noise from specular fireflies in DoF and motion blur

23 of 23

Shadow Catcher / Matte

  • Revamp implementation to take into account indirect light
  • Render scene with and without synthetic objects and compute differences
  • Practical implementation:
    • After shadow catcher hit, duplicate path state and continue tracing two paths
    • One path uses visibility flag to skip intersecting certain objects
    • Paths write combined result to separate render passes
    • Divide results and composite into combined render pass