1 of 40

Analytic Anti-Aliasing�(mainly CPU; also a little discussion for GPU/Compute)

Yuqian Li

liyuqian@google.com

Thanks to many Skia and Google engineers

2 of 40

Motivation

Quality

b/30130346 by Android Wear 2.0

GPU has them too

3 of 40

Motivation

Room for 4x CPU speedup + GPU sometimes relies on CPU for concave paths

curr/maxrss loops min stddev samples config bench

23/23 MB 99 4.23µs 2% ▅▁▁▁▁▁▁█▂▁ 8888 path_fill_big_rotated_rect_aa_45

31/32 MB 156 955ns 5% ▇▁▁▁█▁▁▁▁▁ 8888 path_fill_big_rotated_rect_noaa_45

4 of 40

Result -- fully landed

5 of 40

Time to fill big triangle in ms

6 of 40

Related work

Signal Processing:

  • Post-filtering: FXAA, SMAA
  • Pre-filtering: NSAA (Non-sampled), �AAA (Analytic) : Fancy, beyond simple coverage

Coverage Sampling:�MSAA (Multisample) / FSAA (Full-scene) / SSAA (Supersample)

Analytic Coverage

  • Catmull 1978, criticized by Carpenter 1984 as �“this can be extremely expensive” �(it also detects hidden surfaces)
  • Pathfinder (video) (thanks to mtklein@, allanmac@)
  • allanmac@, csmartdalton@

7 of 40

Rasterization and Coverage-based Anti-Aliasing

No AA

AA

The number to compute: coverage per pixel�(Area of the Intersection)

A single pixel

(unit area = 1.0 = 0xFF = 255)

??

8 of 40

It’s all about computing the area

Rasterization

  • Supersampling, MSAA
  • GPU Triangle Fanning
  • Surveyor’s Formula�(Shoelace Formula)
  • Our CPU Analytic AA...�

Area Computation

  • Analytic (e.g., ab/2, 𝝿/2*r^2)
  • Monte Carlo Sampling

count # black pixels

compute area in each pixel

9 of 40

Sampling: Monte Carlo for Area Computation

16x Supersampling

(Scanline in our previous CPU code)

16x MSAA

(Sampling in GPU)

5 / 16 = 0.3125

6 / 16 = 0.375

10 of 40

Analytic Area Computation

Principle: decompose to a summation of primitive shapes

11 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

12 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

13 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

+

OBC = (B.x * C.y - B.y * C.x) / 2

14 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

+

OBC = (B.x * C.y - B.y * C.x) / 2

15 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

+

OBC = (B.x * C.y - B.y * C.x) / 2

+

OCD = (C.x * D.y - C.y * D.x) / 2

16 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

+

OBC = (B.x * C.y - B.y * C.x) / 2

+

OCD = (C.x * D.y - C.y * D.x) / 2

17 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

+

OBC = (B.x * C.y - B.y * C.x) / 2

+

OCD = (C.x * D.y - C.y * D.x) / 2

+

ODA = (D.x * A.y - D.y * A.x) / 2

18 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

+

OBC = (B.x * C.y - B.y * C.x) / 2

+

OCD = (C.x * D.y - C.y * D.x) / 2

+

ODA = (D.x * A.y - D.y * A.x) / 2

19 of 40

Analytic Area Computation

Shoelace / Surveyor / GPU: Triangle Primitives

OAB = (A.x * B.y - A.y * B.x) / 2

+

OBC = (B.x * C.y - B.y * C.x) / 2

+

OCD = (C.x * D.y - C.y * D.x) / 2

+

ODA = (D.x * A.y - D.y * A.x) / 2

---------------------------------

Area = Shoelace (Surveyor) Formula

GPU Rasterization (see, e.g., csmartdalton@)

20 of 40

Analytic Area Computation

Triangle Area as Cross Product (det):

OAB = (A.x * B.y - A.y * B.x) / 2

Proof 1 [Sin]:

OA x OB = |OA|*|OB|*sin(AOB)

...but why sin(AOB)?

21 of 40

Analytic Area Computation

Triangle Area as Cross Product (det):

OAB = (A.x * B.y - A.y * B.x) / 2

Proof 1+:

OA x OB = |OA|*|OB|*sin(AOB)because

Multiple blackboards of these… (from khanacademy.org)

22 of 40

Analytic Area Computation

Triangle Area as Cross Product (det):

OAB = (A.x * B.y - A.y * B.x) / 2

Proof 2 [Triangle]:

OAB = -|BH| * A.y / 2

= -(B.x - H.x) * A.y / 2

= -(B.x - A.x * B.y / A.y) * A.y / 2

= (A.x * B.y - A.y * B.x) / 2

23 of 40

Analytic Area Computation

Triangle Area as Cross Product (det):

OAB = (A.x * B.y - A.y * B.x) / 2

Proof 3 [Trapezoid / Inverse-Allen]:

OAB = IABO + JBO - IAO

= -(A.x + B.x)*(A.y - B.y)/2

- B.x * B.y / 2 + A.x * A.y / 2

= (A.x * B.y - A.y * B.x) / 2

24 of 40

Analytic Area Computation

Allan’s Surveyor / Pathfinder / CPU Scanline: Trapezoid (incl. rect) Primitives

ABCD = IABJ + JBCL + LCDM + MDAI

(allanmac@)

25 of 40

CPU Scanline: 4x overhead, not 16x overhead

16x supersampling

No AA

26 of 40

CPU Scanline: 4x overhead, not 16x overhead

16x supersampling�is actually 4x rect primitives

No AA

27 of 40

CPU Scanline: old vs. new

There are only trapezoids (triangles, rects) between scanlines

(similar to Cairo and Direct2D, except we never tesselate explicitly)

16x supersampling:�4 scan-lines per pixel

Analytic AA:

1 scan-line per pixel, per edge endpoint, per intersection

Fractional scan-lines & pixels

28 of 40

Intersection of Trapezoid and Pixel

Principle: decompose to a summation of primitive shapes

A

B

C

Instead of computing area A using a general formula (e.g., shoelace), we compute 1 - B - C

2 multiplications

12 or 6 multiplications

29 of 40

General algorithm

Inclusion

  1. Consider trapezoids as the join of two linear half-space
  2. Independently compute the exclusion per pixel of those two linear half-spaces
  3. The coverage is 1 - exclusions

Exclusion 2

Exclusion 1

30 of 40

Exclusion computation

(a + b) / 2 = (a + b) >> 1

a x b / 2 = a x a x slope >> 1

1 - a’ x a’ x slope >> 1

a x a x slope >> 1

a x slope + slope >> 1

a x slope + slope >> 1 + slope * i

...

1 - a’ x a’ x slope >> 1

a

b

a

a’

pathfinder:

31 of 40

More speedup for coverage computation

Directly compute coverage (inclusion) as (a + b) >> 1

Break into 3 pieces, directly compute coverage (inclusion) using our exclusion method (pathfinder?) on the left/right;�full coverage in the middle.

32 of 40

Intersections: how our solution evolves

  1. Reuse old scan-line procedure, run it twice and emit intersection in the first run during edge insertion-sort
  2. In the first run, jump between edge endpoints instead of all pixel boundaries
  3. Bentley-Ottmann (balanced-tree instead of insertion sort) for O(n log n + K) asymptotic performance
  4. Go back to solution 1., but instead of run twice, combine two runs together
  5. Don’t compute intersections at all; emit y + 1/4, y + 2/4, y + 3/4 whenever edges may change order between y, y + 1 (fast, pretty enough, but only approximately correct)

33 of 40

Intersection approximation: potential hazards

Under-cover for blit-once:

No constant error bounds, �only bounded by slope (dx/dy)

Over-cover for blit-all / winding

(GPU & Compute & pathfinder?):

No constant bounds,

only bounded by the number edges in a pixel

(what happens for evenodd?)

34 of 40

Intersections: deferred blit speedup

Deferred blit, blit only once even if there are two fractional pixel rows

Can’t deferred blit because edges change, blit twice, one for each fractional pixel row

35 of 40

Blitting Coverage

  • Snap alpha (coverage): blit 255 (or 0) is much faster than 254 (or 1)
  • Safe check: clip bounds, path bounds, alpha bounds (<= 255)

36 of 40

Blitting Coverage

  • Snap alpha (coverage): blit 255 (or 0) is much faster than 254 (or 1)
  • Safe check: clip bounds, path bounds, alpha bounds (<= 255)
  • Mask vs RLE vs Delta (possible future work, defer all sorting and summing)

64

255

255

255

255

0

64,�1

255, 4

---

---

---

0,

0

64

255 - 64

---

---

---

-255

37 of 40

Blitting Coverage

  • Snap alpha (coverage): blit 255 (or 0) is much faster than 254 (or 1)
  • Safe check: clip bounds, path bounds, alpha bounds (<= 255)
  • Mask vs RLE vs Delta (possible future work, defer all sorting and summing)

64

255

255

255

255

0

64,�1

255, 4

---

---

---

0,

0

64

255 - 64

---

---

---

-255

Input : <key, value> = <edge_id, edge>

Map : edge_id -> [<y, (x, delta)>,...]

(possibly fractional y)

Reduce: y -> {

1. Sort (x, delta) by x;

2. Compute prefix sum of delta;

Blit coverage based on the� prefix sum and a possible� winding transformation� function

}

(we may also have a z for paint order)

Future work on delta?

38 of 40

Miscellaneous speedup and lessons

  • Chop quads/cubics at integral y
  • Smooth jump
  • Assembly code is really hard to reason…
  • The following things are often slow: asymptotic, virtual, if, heap, inline-(or-not)
    • The fast code is asymptotically suboptimal; we don’t have much code reuse (maybe macro and template can save us); we expand 10-line “if” block into 30-line block without “if”; local stack variable blows up the code; and we have all different kinds of inlines based on experiments… ugly but fast
  • Chromium layout tests rebaseline
    • Rietveld is bad, Gerrit will save us

39 of 40

Caveat

Worst-case speedup is not good for paths with more verbs than resolution

High-level idea/algorithm is simple, but the devil is in the detail, and it’s actually quite hard to be faster than our original supersampling AA algorithm. That unfortunately makes my simple algorithm look quite complicated...

Maybe only 1%-10% of the paths are like these. But they took probably 80% of render time

40 of 40

Thank you! Questions?

A

B

C

Time to fill big triangle in ms

Thanks to reed@, caryclark@, djsollen@, fmalita@, mtklein@, msarett@, bungeman@, herb@, halcanary@, jvanverth@, egdaniel@, bsalomon@, ethannicholas@, allanmac@, csmartdalton@, kjlubick@, rmistry@, benjaminwagner@, stephana@, agable@, ...