1 of 17

Faster, easier 2D vector rendering

RustWeek • 2025-05-13

Raph Levien • Google Fonts

2 of 17

Limitations of existing Vello

  • Requires reasonably modern GPU
    • doesn’t work at all on WebGL
  • High and unpredictable memory usage
    • buffers of adequate size must be allocated in advance
  • Not easy to integrate into existing renderers
    • rendering done by compute shader
    • can’t integrate with fragment shaders / existing render pass
  • Some performance cliffs
    • GPU hotspot in very zoomed-out case
  • Compute shader logic is complex
    • not everyone is a rocket scientist

3 of 17

Sparse strips

4 of 17

Sparse strips

  • variable width, fixed height (4 or 8 pixels)
  • efficient representation of rendered path:
    • modest memory usage
    • minimal number of primitives to render
  • efficient computation
    • areas without coverage aren’t touched
    • solid interior regions have only per-strip setup cost, no alpha

5 of 17

Sparse strips

  • Pipeline
    • stroke expansion (strokes only) -> filled shapes
    • flattening -> lines
    • tiling
    • sort tiles
    • organize tiles into strips
    • merge tiles w/ same (x, y) coordinates and render to alpha values
    • coarse rasterization - generate sequence of drawing commands per wide tile
    • render sparse strip representation

6 of 17

CPU-driven to GPU-driven spectrum

  • Fully CPU driven (vello_cpu)
    • very portable, small simple codebase
    • superpower: rendering emoji
    • decent performance through SIMD optimization
  • CPU/GPU hybrid (vello_hybrid)
    • geometry & scheduling done on CPU
    • painting of pixels done in GPU rasterization pipeline
    • currently rendering of alpha values done SIMD, will move to compute shader
  • Open research topic: GPU-driven rendering
    • requires advanced GPU execution model: indirect command encoding? work graphs?

7 of 17

Performance philosophy

  • Move allocation & scheduling work to CPU
    • Fully GPU-driven is still a goal but has many practical challenges
  • Do all per-pixel calculation on GPU
  • SIMD and multithreading for CPU work
  • Cache paths & avoid needless work
  • Schedule GPU work as efficiently as possible
    • Sparse (avoid unneeded work)
    • Minimize barriers (exploit maximal parallelism)
  • Use bounded resources on GPU

8 of 17

Path caching

  • Previous Vello: dynamic rendering of all paths
    • but many (most) UI workloads benefit from caching
  • Generalizes glyph caching
  • Memory footprint is O(n); dense glyph atlas is O(n^2)
    • useful visual here: sparse repr of glyph at different sizes
  • Avoids need for 2D atlas allocation

9 of 17

Sparse strips scale efficiently

10 of 17

Clip optimization

11 of 17

Clip optimization

composite

just draw

no GPU work

12 of 17

Spatio-temporal allocation

abundant memory

tight memory

13 of 17

Exploiting parallelism

  • Early pipeline stages (up to coarse rasterization)
    • Fully parallel by draw object
  • Coarse rasterization
    • (currently) serial but fast - simple calculation per wide tile
  • Fine rasterization
    • Fully parallel by wide tile (CPU rasterization)
    • GPU accelerated

  • All CPU stages: lots of SIMD

14 of 17

SIMD

  • Single Instruction Multiple Data
  • One CPU instruction handles a vector
  • Significant speedups from fine-grained parallelism
  • Speedups for many different operations:
    • flattening, tiling, alpha rendering, fine rasterization
  • Neon fp16 is hot
    • but currently need to write asm
  • We need better infrastructure in Rust!
    • https://linebender.org/blog/towards-fearless-simd/

15 of 17

Community

  • Code base is simpler and more modular than most renderers
    • more amenable to community contributions
  • The team:
    • Taj Pereira, Canva
    • Alex Gemberg, Canva
    • Andrew Jakubowicz, Canva
    • Laurenz Stampfl, ETH Zurich
    • Tom Churchman
    • Daniel McNab, Linebender, funded by Google Fonts
    • Nico Burns, Dioxus/Blitz
  • Renderer office hours every Wednesday
  • Linebender is a great community for learning & building

16 of 17

Current status

  • vello_cpu 0.0.1 on crates.io
    • imaging model includes gradients, images, clips, blends, blurred rounded rectangles
    • text: variable fonts, hinting, and both COLRv1 and bitmap emoji
    • no_std
    • AnyRender abstraction back-end (used by Blitz)
  • vello_hybrid in active development
    • imaging model includes images & clips
    • also contains WebGL2 back-end (no wgpu dependency)

17 of 17

Roadmap

  • See roadmap doc
  • Full imaging model for CPU and hybrid
  • Set of image filters for both CPU and GPU
  • Glyph caching
  • Continued performance work
    • lots of SIMD - could use better Rust infrastructure
  • Future work
    • HDR color
    • conflation-free compositing