3 of 40

Motivation

Room for 4x CPU speedup + GPU sometimes relies on CPU for concave paths

curr/maxrss loops min stddev samples config bench

23/23 MB 99 4.23µs 2% ▅▁▁▁▁▁▁█▂▁ 8888 path_fill_big_rotated_rect_aa_45

31/32 MB 156 955ns 5% ▇▁▁▁█▁▁▁▁▁ 8888 path_fill_big_rotated_rect_noaa_45

4 of 40

Result -- fully landed

5 of 40

Time to fill big triangle in ms

6 of 40

Related work

Signal Processing:

Post-filtering: FXAA, SMAA
Pre-filtering: NSAA (Non-sampled), �AAA (Analytic) : Fancy, beyond simple coverage

Coverage Sampling:�MSAA (Multisample) / FSAA (Full-scene) / SSAA (Supersample)

Analytic Coverage

Catmull 1978, criticized by Carpenter 1984 as �“this can be extremely expensive” �(it also detects hidden surfaces)
Pathfinder (video) (thanks to mtklein@, allanmac@)
allanmac@, csmartdalton@

https://www.beyond3d.com/content/articles/122/13

7 of 40

Rasterization and Coverage-based Anti-Aliasing

No AA

The number to compute: coverage per pixel�(Area of the Intersection)

A single pixel

(unit area = 1.0 = 0xFF = 255)

8 of 40

It’s all about computing the area

Rasterization

Supersampling, MSAA
GPU Triangle Fanning
Surveyor’s Formula�(Shoelace Formula)
Our CPU Analytic AA...�

Area Computation

Analytic (e.g., ab/2, 𝝿/2*r^2)
Monte Carlo Sampling

count # black pixels

compute area in each pixel

9 of 40

Sampling: Monte Carlo for Area Computation

16x Supersampling

(Scanline in our previous CPU code)

16x MSAA

(Sampling in GPU)

5 / 16 = 0.3125

6 / 16 = 0.375

10 of 40

Analytic Area Computation

Principle: decompose to a summation of primitive shapes

11 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

12 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

13 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

OBC = (B.x * C.y - B.y * C.x) / 2

14 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

OBC = (B.x * C.y - B.y * C.x) / 2

15 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

OBC = (B.x * C.y - B.y * C.x) / 2

OCD = (C.x * D.y - C.y * D.x) / 2

16 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

OBC = (B.x * C.y - B.y * C.x) / 2

OCD = (C.x * D.y - C.y * D.x) / 2

17 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

OBC = (B.x * C.y - B.y * C.x) / 2

OCD = (C.x * D.y - C.y * D.x) / 2

ODA = (D.x * A.y - D.y * A.x) / 2

18 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

OBC = (B.x * C.y - B.y * C.x) / 2

OCD = (C.x * D.y - C.y * D.x) / 2

ODA = (D.x * A.y - D.y * A.x) / 2

19 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

OBC = (B.x * C.y - B.y * C.x) / 2

OCD = (C.x * D.y - C.y * D.x) / 2

ODA = (D.x * A.y - D.y * A.x) / 2

---------------------------------

Area = Shoelace (Surveyor) Formula

GPU Rasterization (see, e.g., csmartdalton@)

20 of 40

Analytic Area Computation

Triangle Area as Cross Product (det):

OAB = (A.x * B.y - A.y * B.x) / 2

Proof 1 [Sin]:

OA x OB = |OA|*|OB|*sin(AOB)

...but why sin(AOB)?

21 of 40

Analytic Area Computation

Triangle Area as Cross Product (det):

OAB = (A.x * B.y - A.y * B.x) / 2

Proof 1+:

OA x OB = |OA|*|OB|*sin(AOB)because

Multiple blackboards of these… (from khanacademy.org)

22 of 40

Analytic Area Computation

Triangle Area as Cross Product (det):

OAB = (A.x * B.y - A.y * B.x) / 2

Proof 2 [Triangle]:

OAB = -|BH| * A.y / 2

= -(B.x - H.x) * A.y / 2

= -(B.x - A.x * B.y / A.y) * A.y / 2

= (A.x * B.y - A.y * B.x) / 2

23 of 40

Analytic Area Computation

Triangle Area as Cross Product (det):

OAB = (A.x * B.y - A.y * B.x) / 2

Proof 3 [Trapezoid / Inverse-Allen]:

OAB = IABO + JBO - IAO

= -(A.x + B.x)*(A.y - B.y)/2

- B.x * B.y / 2 + A.x * A.y / 2

= (A.x * B.y - A.y * B.x) / 2

24 of 40

Analytic Area Computation

Allan’s Surveyor / Pathfinder / CPU Scanline: Trapezoid (incl. rect) Primitives

ABCD = IABJ + JBCL + LCDM + MDAI

(allanmac@)

25 of 40

CPU Scanline: 4x overhead, not 16x overhead

16x supersampling

No AA

26 of 40

CPU Scanline: 4x overhead, not 16x overhead

16x supersampling�is actually 4x rect primitives

No AA

27 of 40

CPU Scanline: old vs. new

There are only trapezoids (triangles, rects) between scanlines

(similar to Cairo and Direct2D, except we never tesselate explicitly)

16x supersampling:�4 scan-lines per pixel

Analytic AA:

1 scan-line per pixel, per edge endpoint, per intersection

Fractional scan-lines & pixels

28 of 40

Intersection of Trapezoid and Pixel

Principle: decompose to a summation of primitive shapes

Instead of computing area A using a general formula (e.g., shoelace), we compute 1 - B - C

2 multiplications

12 or 6 multiplications

29 of 40

General algorithm

Inclusion

Consider trapezoids as the join of two linear half-space
Independently compute the exclusion per pixel of those two linear half-spaces
The coverage is 1 - exclusions

Exclusion 2

Exclusion 1

30 of 40

Exclusion computation

(a + b) / 2 = (a + b) >> 1

a x b / 2 = a x a x slope >> 1

1 - a’ x a’ x slope >> 1

a x a x slope >> 1

a x slope + slope >> 1

a x slope + slope >> 1 + slope * i

...

1 - a’ x a’ x slope >> 1

a’

pathfinder:

31 of 40

More speedup for coverage computation

Directly compute coverage (inclusion) as (a + b) >> 1

Break into 3 pieces, directly compute coverage (inclusion) using our exclusion method (pathfinder?) on the left/right;�full coverage in the middle.

32 of 40

Intersections: how our solution evolves

Reuse old scan-line procedure, run it twice and emit intersection in the first run during edge insertion-sort
In the first run, jump between edge endpoints instead of all pixel boundaries
Bentley-Ottmann (balanced-tree instead of insertion sort) for O(n log n + K) asymptotic performance
Go back to solution 1., but instead of run twice, combine two runs together
Don’t compute intersections at all; emit y + 1/4, y + 2/4, y + 3/4 whenever edges may change order between y, y + 1 (fast, pretty enough, but only approximately correct)

33 of 40

Intersection approximation: potential hazards

Under-cover for blit-once:

No constant error bounds, �only bounded by slope (dx/dy)

Over-cover for blit-all / winding

(GPU & Compute & pathfinder?):

No constant bounds,

only bounded by the number edges in a pixel

(what happens for evenodd?)

34 of 40

Intersections: deferred blit speedup

Deferred blit, blit only once even if there are two fractional pixel rows

Can’t deferred blit because edges change, blit twice, one for each fractional pixel row

35 of 40

Blitting Coverage

Snap alpha (coverage): blit 255 (or 0) is much faster than 254 (or 1)
Safe check: clip bounds, path bounds, alpha bounds (<= 255)

36 of 40

Blitting Coverage

Snap alpha (coverage): blit 255 (or 0) is much faster than 254 (or 1)
Safe check: clip bounds, path bounds, alpha bounds (<= 255)
Mask vs RLE vs Delta (possible future work, defer all sorting and summing)

255

64,�1

255, 4

---

255 - 64

---

-255

37 of 40

Blitting Coverage

Snap alpha (coverage): blit 255 (or 0) is much faster than 254 (or 1)
Safe check: clip bounds, path bounds, alpha bounds (<= 255)
Mask vs RLE vs Delta (possible future work, defer all sorting and summing)

255

64,�1

255, 4

---

255 - 64

---

-255

Input : <key, value> = <edge_id, edge>

Map : edge_id -> [<y, (x, delta)>,...]

(possibly fractional y)

Reduce: y -> {

1. Sort (x, delta) by x;

2. Compute prefix sum of delta;

Blit coverage based on the� prefix sum and a possible� winding transformation� function

}

(we may also have a z for paint order)

Future work on delta?

38 of 40

Miscellaneous speedup and lessons

Chop quads/cubics at integral y
Smooth jump
Assembly code is really hard to reason…
The following things are often slow: asymptotic, virtual, if, heap, inline-(or-not)

The fast code is asymptotically suboptimal; we don’t have much code reuse (maybe macro and template can save us); we expand 10-line “if” block into 30-line block without “if”; local stack variable blows up the code; and we have all different kinds of inlines based on experiments… ugly but fast

Chromium layout tests rebaseline

Rietveld is bad, Gerrit will save us

39 of 40

Caveat

Worst-case speedup is not good for paths with more verbs than resolution

High-level idea/algorithm is simple, but the devil is in the detail, and it’s actually quite hard to be faster than our original supersampling AA algorithm. That unfortunately makes my simple algorithm look quite complicated...

Maybe only 1%-10% of the paths are like these. But they took probably 80% of render time

40 of 40

Thank you! Questions?

Time to fill big triangle in ms

Thanks to reed@, caryclark@, djsollen@, fmalita@, mtklein@, msarett@, bungeman@, herb@, halcanary@, jvanverth@, egdaniel@, bsalomon@, ethannicholas@, allanmac@, csmartdalton@, kjlubick@, rmistry@, benjaminwagner@, stephana@, agable@, ...