1 of 26

CS365�Foundations of Data Science

Lecture 12

(3/15)

Charalampos E. Tsourakakis �ctsourak@bu.edu

2 of 26

SELECT COUNT(DISTINCT)

Estimating F0

2

3 of 26

Naive algorithmic solutions

  • Algorithm 1
    • Maintain a bitmask with n bits, �one per item in the universe.
    • When an element x appears �in the stream, if BITMASK[x]=0, set it to 1.

Space complexity: O(n)

  • Algorithm 2
    • Store the whole stream

Space complexity: O(mlog(n))

  • Algorithm 3
    • d=Dict{}
    • For each x in stream
      • if x is in dictionary d, do nothing
      • else add x to d
    • Return size of d �

Worst-case space complexity: O(min(n,mlog(n))

3

4 of 26

Algorithm 1: Linear counting

Buckets: 1 2 3 ... k-1 k

4

10 4 1 10 1 9 3 ...

Suppose we send each item x in the stream to a bucket according to some random (idealized for now) hash function.

5 of 26

Algorithm 1: Linear counting

Buckets: 1 2 3 ... k-1 k

5

10 4 1 10 1 9 3 ...

10,4,10

1

3,9

Let Z be the number of buckets that didn’t receive any item. In expectation, we obtain

Estimator:

6 of 26

Algorithm 1: Linear counting

  • Setting the numbers of bins k requires knowing F0, the quantity we wish to estimate �
  • By first computing the variance of Z, and then applying Chebyshev, we obtain the following corollary:�� Setting k=F0/12 yields a standard error of less than 1%...�
  • …. However, F0 can be O(n), so the space can also be prohibitively large using this approach.

Can we do better than this, and if yes, how much better can we do?

6

7 of 26

Algorithm 2: Idealized F0 estimation

Suppose we have access to a random hash function h:U→[0,1].

  • V ← +inf
  • For each x
    • If h(x)<V then V← h(x)�
  • At the end of the stream output 1/V-1 as our estimate for the number of distinct elements

7

8 of 26

Minimum of n uniform random variables

Let’s assume that the hash function is fully random.

  • .
  • Z=min(X1,...,Xn) ��What is the expectation E[Z]?

8

9 of 26

Generalizing this idea: K Minimum Values (kMV) sketch

Basic idea: Use a “good enough” hash function h:[n] → R and the k smallest hashed values.

KMV-Init(k)

  • Pick hash function h
  • Initialize L=[] for k ( item, hash(item) ) pairs

KMV-Update(x)�- If x is not in L

L← L U {(x,h(x))}� if |L|>k then remove x with largest h(x) value from L

KMV-Query()

v← largest hash value in L�Return (k-1) R/v

9

10 of 26

Flajolet-Martin (FM) sketch

  • Key assumption: We have an idealized hash function that maps each element of the universe into a string of random bits (i.e., Pr(bit=0)=Pr(bit=1)=½) ����

  • Intuition: If we see the prefix 11110xx, probably we have seen more than 32 distinct items.�
  • Idea: Keep track of prefixes of the form 1k0

10

1

1

1

1

0

x

x

11 of 26

Flajolet-Martin (FM) sketch

Important functions �1. h(x) = hash function that transforms x into a uniform binary string

2. ρ(x) = position of leftmost 0 (e.g., ρ(111010100000)=4), or in terms of the usual msb to lsb position of rightmost 0 ρ(000001010111)=4)

Algorithm

  • Initialize a bitmask with L bits to 0, i.e., BITMASK �
  • For each element x in the stream
    • BITMASK[ρ(h(x))] ← 1 �
  • Set R← ρ(BITMASK)�
  • Output

11

0

0

….

0

12 of 26

Flajolet-Martin (FM) sketch

  • FM estimator: Clearly, 2R where R is the largest size of such a prefix approximates the number of distinct elements. However, it turns out there is some small bias: ��
  • Flajolet and Martin proved the following remarkable result: ���� where ν(p)=#bits equal to 1 in binary representation of p�
  • Thus, an unbiased (modulo negligible terms) estimator becomes �
  • Fascinating analysis of an algorithm

12

13 of 26

Implementation details

  • Bitmask operations can �Be done very efficiently. �
  • E.g., R(x):

x=1215�'0b10010111111'�x+1�0b10011000000�x+1 & ~x�0b00001000000

13

14 of 26

FM sketch in Julia (prototype)

14

15 of 26

Flajolet-Martin (FM) sketch

  • Is the constant 1/φ unexpected in the estimator?
    • Intuition: since if we have a prefix 1111...10 there are likely to be more than 2R and less than 2R+1 elements. �
  • Variance of the estimator is 1.257, which in practice means it can be off by a factor of 2. �
  • Question: how do we reduce variance? �
    • Idea 1: use multiple hash functions [space-expensive]�
    • Idea 2: stochastic averaging

15

16 of 26

Stochastic Averaging

  • We improve the accuracy of the algorithm by a multiplicative factor of by taking the average of k different hash function ⇒ expensive �
  • Better idea: use the first bits to create substreams! ��Split the elements in k=2l substreams by using the first l bits of the hash. E.g, for l=2�h(v) = b1b2 b3b4b5...

Substream id hash value�

16

17 of 26

Flajolet-Martin theorem

The estimator Z is asymptotically unbiased, i.e., En[Z] → n and the coefficient of variation ��using k bitmasks is �

Memory: O(klogn)

Example: can count cardinalities up to 109 with error <=6% using 4kBs of memory (=4096 bytes)**�

“Caveat” of their work: Practical implementations of the hash function are not discussed.

** See also Mitzenmacher-Vadhan: Why simple hash functions work: Exploiting the entropy in a data stream

17

18 of 26

AMS sketch

  • Flajolet and Martin assumed the existence of a family of hash functions that exhibit ideal properties. ��
  • Alon, Matias, Szegedy provided a slight modification of the FM sketch that uses 2-wise independent hash functions �
  • Algorithm�For each element x in the stream
    • Compute r(h(x))=largest value r such that the rightmost bits of h(x) are 0.
    • R=max(R,r(h(x))

Output Y=2R

18

19 of 26

AMS Sketch for F0

Theorem: For every c>2 there exists an algorithm that outputs an estimate Y of F0 such that the probability that the ratio between Y and F0 is not between 1/c and c is at most 2/c.

Space usage: O(logn) to store the hash functions, O(loglogn) to store r.

Median trick: the failure probability can become δ using the standard median trick, i.e., repeat the process log(1/δ) times and output the median. This yields a (O(1), δ) approximation scheme. ��

19

20 of 26

HyperLogLog++ (HLL++) [Heule-Nunkesser-Hall]

  • Engineering F0 algorithms
  • Low end correction is very important in practice
    • Consider a small business relying on Google analytics to estimate #unique visitors

20

For small cardinalities, linear counting is better than HLL

For large cardinalities

HLL is better than

linear counting

21 of 26

Algorithms for estimating

F2

21

22 of 26

AMS sketch for F2

Algorithm

  • Let h:[n] → {-1,+1} be chosen from a 4-wise independent hash family H.
  • X← 0
  • For each element x in the stream
    • X← X+h(x)
  • Output X2

22

23 of 26

Why does this work?

  • Some elements push z to the positive direction (h(e)=+1), �some to the negative direction (h(e)=-1)
    • Tug-of-war sketch �
  • Define the random variable Yj=h(j). Then,��

and therefore by linearity of expectation, and the 4-wise (which implies 2-wise independence) independence of variables Yi,Yj into account we get

23

h(e)=+1

h(e)=-1

24 of 26

Second moment

Now we compute the variance of X2 . We get��We expand and bound the expectation E[X4] as follows:

24

25 of 26

Reminder

Theorem: Let X be an unbiased estimator of a quantity Q. Let be a collection of independent RVs with Xij distributed identically to X, where ����Let . Then, ��By using the median-of-means improvement we obtain that we can get an (ε,δ) approximation for F2 using:

25

26 of 26

Readings

26