1 of 35

HyperLogLog - A Layman’s Overview

by Andrew Sy

1

2 of 35

Intro

2

3 of 35

Count-Distinct Problem�

Count number of distinct elements in a multiset (i.e. collection with duplicate elements).

E.g.

SQL: COUNT DISTINCT

3

4 of 35

Example Problems (involving large data sets)

  • Web Site - number of unique visitors
  • IP Network - number of unique IP addresses passing through
  • Mobile Advertising - number of unique deviceIDs reached per adCampaign
  • Biology - number of distinct motifs in a DNA sequence
  • Sensor Network - number of incidents (distinct timestamps) per event type
  • Linguistics - number of distinct words in a large body of work

* Note some of the above have small expected distinct values (DV) cardinality. E.g. English language has only ~170k words.

4

5 of 35

Exact Count - Approaches

  • Hash Map
    • suitable if expected cardinality of distinct values (DV) is relatively small
  • Bit Map (with zero collision)
    • map each distinct value to a bit field
    • more compact than Hash Map, so can support larger DV cardinality
    • but still relatively large data struct. E.g. supporting up to 1B distinct values requires at least ~2^30 bits (128MB)

5

6 of 35

Probabilistic Count (aka DV Sketch) - Approaches

Trade accuracy for smaller size.

Examples:

  • HLL (and its variants, such as HyperLogLog++)
  • K-Minimum Values (KMV)
  • Linear Probabilistic Counting (bit-map with collisions)

6

7 of 35

HLL: current “standard” for DV Sketching

HyperLogLog (and its variants, HyperLogLog++)

  • current Gold Standard for Probabilistic Counting with large expected DV cardinality
  • Good balance between
    • High accurarcy
    • Fast Performance
    • Low space requirements

7

8 of 35

HLL - Mechanics and Motivation

8

9 of 35

HLL Steps - outline

  • Initialize HLL data struct
    • Consists of m registers (aka buckets) each p bits in size
  • Distribute inputs uniformly by hashing (e.g. into 64-bit hash value)
  • Partition hash value
    • First k bits: determines storage bucket (aka register)
    • Remaining bits: count number of leading zeros
  • Store max count of leading zeros into appropriate register
  • Harmonic mean across all the registers = cardinality estimate of distinct values (DV)

9

10 of 35

HLL Steps - diagram 0

Setup Registers - used to keep track of estimated count

E.g.

Setup Register with:

  • m=8 registers, p=3 bits / register
  • new byte[24]

10

r0

r1

r2

r3

r4

r5

r6

r7

11 of 35

HLL Steps - diagram 1

Assume: m=8 registers, p=3 bits / register

11

input_A

murmur2

1

0

1

0

0

1

1

1

Register# 5

N=2 leading zeroes

Record max(r5,N+1) into Register# 5

r0

r1

r2

r3

r4

r5

0

1

1

r6

r7

12 of 35

HLL Steps - diagram 2

Assume: m=8 registers, p=3 bits / register

12

input_B

murmur2

0

1

1

0

1

1

1

1

Register# 3

N=1 leading zeros

Record max(r3,N+1) into Register# 3

r0

r1

r2

r3

0

1

0

r4

r5

0

1

1

r6

r7

13 of 35

HLL Steps - diagram 3

Assume: m=8 registers, p=3 bits / register

13

input_C

murmur2

0

1

1

0

0

0

0

1

Register# 3

N=4 leading zeros

Record max(r3,N+1) into Register# 3

r0

r1

r2

r3

1

0

1

r4

r5

0

1

1

r6

r7

14 of 35

HLL Steps - diagram 4

Assume: m=8 registers, p=3 bits / register

14

input_D

murmur2

0

1

0

0

0

0

0

0

Register# 2

N=5 leading zeros

Record max(r2,N+1) into Register# 2

r0

r1

r2

1

1

0

r3

1

0

1

r4

r5

0

1

1

r6

r7

* p=3 is the minimum number of bits required to hold max value of N+1 = 6 (i.e. p=2 won’t do).

15 of 35

HLL Steps - diagram 5

Estimated Count = Harmonic Mean of all registers.

15

2

3

2

6

5

3

3

2

harmonic_mean(2,3,2,6,5,3,3,2) = 2.7907

16 of 35

Count Leading Zeros - Rationale (part 1)

Given a bunch of uniformly distributed hash values, expect

1/2 to look like 1xxxxxxx

1/4 to look like 01xxxxxx

1/8 to look like 001xxxxx

1/16 to look like 0001xxxx

16

17 of 35

Count Leading Zeros - Rationale (part 2)

Flipping logic around, if max number of leading zeros N is:

  • N=0 (i.e. 1xxxxxxx) => estimate DV = 2
  • N=1 (i.e. 01xxxxxx) => estimate DV = 4
  • N=2 (i.e. 001xxxxx) => estimate DV = 8
  • N=3 (i.e. 0001xxxx) => estimate DV = 16

DV = 2^(N+1)

* Not a very rigorous or satisfying explanation

17

18 of 35

Counting Leading Zeros = Coin Flip Run Length

Still not convinced?

Coin Flip Analogy: I spent some period of time flipping a coin.

  • If max run length of heads I saw is 2, how many times do you think I flipped the coin?
  • If max run length of heads I saw is 16, how many times do you think I flipped the coin?

18

19 of 35

Probabilistic Count - Intuition

A Wager:

Given collection of interger values uniformly distributed between [1, 100]. The smallest of these is 20, so I claim there are probably 5 distinct values. I then challenge you to make your own best guess (other than 5) as to the number of distinct values. Would you be willing to accept my wager? (5 has the highest probability of being the right guess, so that would not be wise of you). And what if the smallest value were 25 or 50?

19

20 of 35

Average of Multiple Hash Values

  • Using single hash value is too unreliable
    • E.g. 100 distinct values, but one happened to hash to 0000 0001 (N=7, so est DV=2^8)
  • Solution: run each input through m different hash functions.
    • Estimated DV = average of m hashes
    • To dampen large outliers, use Harmonic Mean

20

21 of 35

Average of Hash Values in Multiple Buckets

  • Higher m (number of hash functions) => higher accuracy
  • Too many hash functions => slow performance
  • Compromise Solution:
    • Single Hash Function
    • Partition hash output: bucket pointer (m buckets) + rest of hash value (to count leading zeros)

  • Est DV = Harmonic Mean across the m registers

21

0

1

1

0

0

0

0

1

Register# 3

N=4 leading zeros

Record max(r3,4) into Register# 3

22 of 35

HLL Accuracy, Data Struct and Size

22

23 of 35

HLL Accuracy - Table

Given m registers, expected error = 1.04 / sqrt(m).

23

Register pointer with k-bits

# of Registers = 2^k = m

Error = 1.04 / sqrt(2^k)

0

1

104%

1

2

73.5%

2

4

52%

...

...

...

10

1024

3.25%

11

2048

2.298%

12

4096

1.625%

24 of 35

HLL Accuracy - Graph

Register pointer size (k-bits) vs Expected Error: Diminishing Returns as k increases

24

k-bits (register pointer)

err

25 of 35

HLL Data Struct Size - Calculation

  • Desired Accuracy ~97%, Expected Error = 1.04/sqrt(2^11) = 2.3%
  • Need m = 2^11 = 2048 buckets
  • Assume 64-bit hash: 64 - 11 bits (register pointer) = 53 bits left for run length counting

  • Each register must be big enough to support max value of 53 (the max possible number of leading zeros)
  • 5-bit register too small: 2^5 - 1 = 31
  • 6-bit register is big enough: 2^6 - 1 = 63 (we say register size p=6 )
  • HLL Data Struct is 2048 x 6-bit registers = 12288 bits or 1536KB (exactly 1.5KB)

25

….

….

11-bit bucket pointer

53-bits: count # of leading zeros in here

26 of 35

HLL Data Struct - Diagram

2048 registers x 6-bits / register

= 1536 bytes

26

register 0

register 1

register 2

….

register 2047

6-bits / register

27 of 35

HLL Unions

27

28 of 35

Union (Merge) of HLL Data Structs

Given 2 or more HLL Data Structs, both constructed using

  • same hash function
  • same values for k (register pointer size) and p (number of bits per register)

You can take their union by:

  • for (i <- 0 until m) hll_union.register(i) = max(hll_1.register(i),hll_2.register(i))

28

29 of 35

HLL Union - Diagram

Union: take the max of each register comparison

29

2

8

6

5

3

3

4

4

2

5

3

8

6

5

5

=

30 of 35

HLL Merge - Algebraic Operation

HLL Merge Operation is algebraic, i.e. commutative and associative (which has implications for efficient parallel and distributed computation).

30

31 of 35

Distributed Parallel Construction

31

32 of 35

Construct HLL via MapReduce - Map Step

Map: input tuple (deviceId, adCampaign)

  • hash(deviceId)
  • Derive:
    • register pointer,
    • run length (aka leading number of zeros)
  • Create HLL data struct
  • Set appropriate register in HLL with the run length
  • Output: (k,v) pair = (adCampaign, data struct)

32

33 of 35

Construct HLL via MapReduce - Combine Step

Combine: input (adCampaign, listOfHll)

Merge listOfHll into hllAccum

  • hllAccum = new HLLDataStruct
  • for (hll in listOfHll) { hllAccum.merge(hll) }
  • Output: (adCampaign, hllAccum)

33

34 of 35

Construct HLL via MapReduce -Reduce Step

Reduce: input (adCampaign, listOfHll)

  • hllAccum = new HLLDataStruct
  • for (hll in listOfHll) { hllAccum.merge(hll) }
  • Output: (adCampaign, hllAccum)

Same as Combine Step, except that Reduce involves shuffling. For each key, the data sent from a mapper task to the reducer node consists of a single HLL struct, so it is quite small (e.g. 1.5KB)

34

(k1, hll_1)

(k1, hll_2)

(k1, hll_3)

35 of 35

Application To AdTech (AdCampaigns, Target Segments, etc)

  • Storage of HLL Data Structs
    • keyed/partitioned per adCampaign per hour
    • each HLL data struct is quite small (e.g. 1.5KB)
  • Union Across Multiple AdCampaigns
  • Union Across Time Ranges

35