HyperLogLog - A Layman’s Overview
by Andrew Sy
1
Intro
2
Count-Distinct Problem�
Count number of distinct elements in a multiset (i.e. collection with duplicate elements).
E.g.
SQL: COUNT DISTINCT
3
Example Problems (involving large data sets)
* Note some of the above have small expected distinct values (DV) cardinality. E.g. English language has only ~170k words.
4
Exact Count - Approaches
5
Probabilistic Count (aka DV Sketch) - Approaches
Trade accuracy for smaller size.
Examples:
6
HLL: current “standard” for DV Sketching
HyperLogLog (and its variants, HyperLogLog++)
7
HLL - Mechanics and Motivation
8
HLL Steps - outline
9
HLL Steps - diagram 0
Setup Registers - used to keep track of estimated count
E.g.
Setup Register with:
10
r0 | | | |
r1 | | | |
r2 | | | |
r3 | | | |
r4 | | | |
r5 | | | |
r6 | | | |
r7 | | | |
HLL Steps - diagram 1
Assume: m=8 registers, p=3 bits / register
11
input_A
murmur2
1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
Register# 5
N=2 leading zeroes
Record max(r5,N+1) into Register# 5
r0 | | | |
r1 | | | |
r2 | | | |
r3 | | | |
r4 | | | |
r5 | 0 | 1 | 1 |
r6 | | | |
r7 | | | |
HLL Steps - diagram 2
Assume: m=8 registers, p=3 bits / register
12
input_B
murmur2
0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
Register# 3
N=1 leading zeros
Record max(r3,N+1) into Register# 3
r0 | | | |
r1 | | | |
r2 | | | |
r3 | 0 | 1 | 0 |
r4 | | | |
r5 | 0 | 1 | 1 |
r6 | | | |
r7 | | | |
HLL Steps - diagram 3
Assume: m=8 registers, p=3 bits / register
13
input_C
murmur2
0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
Register# 3
N=4 leading zeros
Record max(r3,N+1) into Register# 3
r0 | | | |
r1 | | | |
r2 | | | |
r3 | 1 | 0 | 1 |
r4 | | | |
r5 | 0 | 1 | 1 |
r6 | | | |
r7 | | | |
HLL Steps - diagram 4
Assume: m=8 registers, p=3 bits / register
14
input_D
murmur2
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Register# 2
N=5 leading zeros
Record max(r2,N+1) into Register# 2
r0 | | | |
r1 | | | |
r2 | 1 | 1 | 0 |
r3 | 1 | 0 | 1 |
r4 | | | |
r5 | 0 | 1 | 1 |
r6 | | | |
r7 | | | |
* p=3 is the minimum number of bits required to hold max value of N+1 = 6 (i.e. p=2 won’t do).
HLL Steps - diagram 5
Estimated Count = Harmonic Mean of all registers.
15
2 |
3 |
2 |
6 |
5 |
3 |
3 |
2 |
harmonic_mean(2,3,2,6,5,3,3,2) = 2.7907
Count Leading Zeros - Rationale (part 1)
Given a bunch of uniformly distributed hash values, expect
1/2 to look like 1xxxxxxx
1/4 to look like 01xxxxxx
1/8 to look like 001xxxxx
1/16 to look like 0001xxxx
16
Count Leading Zeros - Rationale (part 2)
Flipping logic around, if max number of leading zeros N is:
DV = 2^(N+1)
* Not a very rigorous or satisfying explanation
17
Counting Leading Zeros = Coin Flip Run Length
Still not convinced?
Coin Flip Analogy: I spent some period of time flipping a coin.
18
Probabilistic Count - Intuition
A Wager:
Given collection of interger values uniformly distributed between [1, 100]. The smallest of these is 20, so I claim there are probably 5 distinct values. I then challenge you to make your own best guess (other than 5) as to the number of distinct values. Would you be willing to accept my wager? (5 has the highest probability of being the right guess, so that would not be wise of you). And what if the smallest value were 25 or 50?
19
Average of Multiple Hash Values
20
Average of Hash Values in Multiple Buckets
21
0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
Register# 3
N=4 leading zeros
Record max(r3,4) into Register# 3
HLL Accuracy, Data Struct and Size
22
HLL Accuracy - Table
Given m registers, expected error = 1.04 / sqrt(m).
23
Register pointer with k-bits | # of Registers = 2^k = m | Error = 1.04 / sqrt(2^k) |
0 | 1 | 104% |
1 | 2 | 73.5% |
2 | 4 | 52% |
... | ... | ... |
10 | 1024 | 3.25% |
11 | 2048 | 2.298% |
12 | 4096 | 1.625% |
HLL Accuracy - Graph
Register pointer size (k-bits) vs Expected Error: Diminishing Returns as k increases
24
k-bits (register pointer)
err
HLL Data Struct Size - Calculation
25
| | …. | | | | …. | | | | | |
11-bit bucket pointer
53-bits: count # of leading zeros in here
HLL Data Struct - Diagram
2048 registers x 6-bits / register
= 1536 bytes
26
register 0 | | | | | | |
register 1 | | | | | | |
register 2 | | | | | | |
| | | | | | |
…. | | | | | | |
| | | | | | |
| | | | | | |
register 2047 | | | | | | |
6-bits / register
HLL Unions
27
Union (Merge) of HLL Data Structs
Given 2 or more HLL Data Structs, both constructed using
You can take their union by:
28
HLL Union - Diagram
Union: take the max of each register comparison
29
2 |
8 |
6 |
5 |
3 |
3 |
4 |
4 |
2 |
5 |
3 |
8 |
6 |
5 |
5 |
∪
=
HLL Merge - Algebraic Operation
HLL Merge Operation is algebraic, i.e. commutative and associative (which has implications for efficient parallel and distributed computation).
30
Distributed Parallel Construction
31
Construct HLL via MapReduce - Map Step
Map: input tuple (deviceId, adCampaign)
32
Construct HLL via MapReduce - Combine Step
Combine: input (adCampaign, listOfHll)
Merge listOfHll into hllAccum
33
Construct HLL via MapReduce -Reduce Step
Reduce: input (adCampaign, listOfHll)
Same as Combine Step, except that Reduce involves shuffling. For each key, the data sent from a mapper task to the reducer node consists of a single HLL struct, so it is quite small (e.g. 1.5KB)
34
(k1, hll_1)
(k1, hll_2)
(k1, hll_3)
Application To AdTech (AdCampaigns, Target Segments, etc)
35