Analytic Anti-Aliasing�(mainly CPU; also a little discussion for GPU/Compute)
Motivation
Quality
b/30130346 by Android Wear 2.0
GPU has them too
Motivation
Room for 4x CPU speedup + GPU sometimes relies on CPU for concave paths
curr/maxrss loops min stddev samples config bench
23/23 MB 99 4.23µs 2% ▅▁▁▁▁▁▁█▂▁ 8888 path_fill_big_rotated_rect_aa_45
31/32 MB 156 955ns 5% ▇▁▁▁█▁▁▁▁▁ 8888 path_fill_big_rotated_rect_noaa_45
Result -- fully landed
Time to fill big triangle in ms
Related work
Signal Processing:
Coverage Sampling:�MSAA (Multisample) / FSAA (Full-scene) / SSAA (Supersample)
Analytic Coverage
Rasterization and Coverage-based Anti-Aliasing
No AA
AA
The number to compute: coverage per pixel�(Area of the Intersection)
A single pixel
(unit area = 1.0 = 0xFF = 255)
??
It’s all about computing the area
Rasterization
Area Computation
count # black pixels
compute area in each pixel
Sampling: Monte Carlo for Area Computation
16x Supersampling
(Scanline in our previous CPU code)
16x MSAA
(Sampling in GPU)
5 / 16 = 0.3125
6 / 16 = 0.375
Analytic Area Computation
Principle: decompose to a summation of primitive shapes
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
OAB = (A.x * B.y - A.y * B.x) / 2
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
OAB = (A.x * B.y - A.y * B.x) / 2
+
OBC = (B.x * C.y - B.y * C.x) / 2
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
OAB = (A.x * B.y - A.y * B.x) / 2
+
OBC = (B.x * C.y - B.y * C.x) / 2
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
OAB = (A.x * B.y - A.y * B.x) / 2
+
OBC = (B.x * C.y - B.y * C.x) / 2
+
OCD = (C.x * D.y - C.y * D.x) / 2
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
OAB = (A.x * B.y - A.y * B.x) / 2
+
OBC = (B.x * C.y - B.y * C.x) / 2
+
OCD = (C.x * D.y - C.y * D.x) / 2
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
OAB = (A.x * B.y - A.y * B.x) / 2
+
OBC = (B.x * C.y - B.y * C.x) / 2
+
OCD = (C.x * D.y - C.y * D.x) / 2
+
ODA = (D.x * A.y - D.y * A.x) / 2
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
OAB = (A.x * B.y - A.y * B.x) / 2
+
OBC = (B.x * C.y - B.y * C.x) / 2
+
OCD = (C.x * D.y - C.y * D.x) / 2
+
ODA = (D.x * A.y - D.y * A.x) / 2
Analytic Area Computation
Shoelace / Surveyor / GPU: Triangle Primitives
OAB = (A.x * B.y - A.y * B.x) / 2
+
OBC = (B.x * C.y - B.y * C.x) / 2
+
OCD = (C.x * D.y - C.y * D.x) / 2
+
ODA = (D.x * A.y - D.y * A.x) / 2
---------------------------------
Area = Shoelace (Surveyor) Formula
GPU Rasterization (see, e.g., csmartdalton@)
Analytic Area Computation
Triangle Area as Cross Product (det):
OAB = (A.x * B.y - A.y * B.x) / 2
Proof 1 [Sin]:
OA x OB = |OA|*|OB|*sin(AOB)
...but why sin(AOB)?
Analytic Area Computation
Triangle Area as Cross Product (det):
OAB = (A.x * B.y - A.y * B.x) / 2
Proof 1+:
OA x OB = |OA|*|OB|*sin(AOB)because
Multiple blackboards of these… (from khanacademy.org)
Analytic Area Computation
Triangle Area as Cross Product (det):
OAB = (A.x * B.y - A.y * B.x) / 2
Proof 2 [Triangle]:
OAB = -|BH| * A.y / 2
= -(B.x - H.x) * A.y / 2
= -(B.x - A.x * B.y / A.y) * A.y / 2
= (A.x * B.y - A.y * B.x) / 2
Analytic Area Computation
Triangle Area as Cross Product (det):
OAB = (A.x * B.y - A.y * B.x) / 2
Proof 3 [Trapezoid / Inverse-Allen]:
OAB = IABO + JBO - IAO
= -(A.x + B.x)*(A.y - B.y)/2
- B.x * B.y / 2 + A.x * A.y / 2
= (A.x * B.y - A.y * B.x) / 2
Analytic Area Computation
Allan’s Surveyor / Pathfinder / CPU Scanline: Trapezoid (incl. rect) Primitives
ABCD = IABJ + JBCL + LCDM + MDAI
(allanmac@)
CPU Scanline: 4x overhead, not 16x overhead
16x supersampling
No AA
CPU Scanline: 4x overhead, not 16x overhead
16x supersampling�is actually 4x rect primitives
No AA
CPU Scanline: old vs. new
There are only trapezoids (triangles, rects) between scanlines
(similar to Cairo and Direct2D, except we never tesselate explicitly)
16x supersampling:�4 scan-lines per pixel
Analytic AA:
1 scan-line per pixel, per edge endpoint, per intersection
Fractional scan-lines & pixels
Intersection of Trapezoid and Pixel
Principle: decompose to a summation of primitive shapes
A
B
C
Instead of computing area A using a general formula (e.g., shoelace), we compute 1 - B - C
2 multiplications
12 or 6 multiplications
General algorithm
Inclusion
Exclusion 2
Exclusion 1
Exclusion computation
(a + b) / 2 = (a + b) >> 1
a x b / 2 = a x a x slope >> 1
1 - a’ x a’ x slope >> 1
a x a x slope >> 1
a x slope + slope >> 1
a x slope + slope >> 1 + slope * i
...
1 - a’ x a’ x slope >> 1
a
b
a
a’
pathfinder:
More speedup for coverage computation
Directly compute coverage (inclusion) as (a + b) >> 1
Break into 3 pieces, directly compute coverage (inclusion) using our exclusion method (pathfinder?) on the left/right;�full coverage in the middle.
Intersections: how our solution evolves
Intersection approximation: potential hazards
Under-cover for blit-once:
No constant error bounds, �only bounded by slope (dx/dy)
Over-cover for blit-all / winding
(GPU & Compute & pathfinder?):
No constant bounds,
only bounded by the number edges in a pixel
(what happens for evenodd?)
Intersections: deferred blit speedup
Deferred blit, blit only once even if there are two fractional pixel rows
Can’t deferred blit because edges change, blit twice, one for each fractional pixel row
Blitting Coverage
Blitting Coverage
64
255
255
255
255
0
64,�1
255, 4
---
---
---
0,
0
64
255 - 64
---
---
---
-255
Blitting Coverage
64
255
255
255
255
0
64,�1
255, 4
---
---
---
0,
0
64
255 - 64
---
---
---
-255
Input : <key, value> = <edge_id, edge>
Map : edge_id -> [<y, (x, delta)>,...]
(possibly fractional y)
Reduce: y -> {
1. Sort (x, delta) by x;
2. Compute prefix sum of delta;
Blit coverage based on the� prefix sum and a possible� winding transformation� function
}
(we may also have a z for paint order)
Future work on delta?
Miscellaneous speedup and lessons
Caveat
Worst-case speedup is not good for paths with more verbs than resolution
High-level idea/algorithm is simple, but the devil is in the detail, and it’s actually quite hard to be faster than our original supersampling AA algorithm. That unfortunately makes my simple algorithm look quite complicated...
Maybe only 1%-10% of the paths are like these. But they took probably 80% of render time
Thank you! Questions?
A
B
C
Time to fill big triangle in ms
Thanks to reed@, caryclark@, djsollen@, fmalita@, mtklein@, msarett@, bungeman@, herb@, halcanary@, jvanverth@, egdaniel@, bsalomon@, ethannicholas@, allanmac@, csmartdalton@, kjlubick@, rmistry@, benjaminwagner@, stephana@, agable@, ...