1 of 26

CS365�Foundations of Data Science

Lecture 12

(3/15)

Charalampos E. Tsourakakis �ctsourak@bu.edu

2 of 26

SELECT COUNT(DISTINCT)

Estimating F0

But first some hashing basics

2

3 of 26

Naive algorithmic solutions

Algorithm 1

Maintain a bitmask with n bits, �one per item in the universe.
When an element x appears �in the stream, if BITMASK[x]=0, set it to 1.

Space complexity: O(n) �

Algorithm 2

Store the whole stream

Space complexity: O(mlog(n))

Algorithm 3

d=Dict{}
For each x in stream

if x is in dictionary d, do nothing
else add x to d

Return size of d �

Worst-case space complexity: O(min(n,mlog(n))

3

4 of 26

Algorithm 1: Linear counting

Buckets: 1 2 3 ... k-1 k

4

10 4 1 10 1 9 3 ...

Suppose we send each item x in the stream to a bucket according to some random (idealized for now) hash function.

5 of 26

Algorithm 1: Linear counting

Buckets: 1 2 3 ... k-1 k

5

10 4 1 10 1 9 3 ...

10,4,10

1

3,9

Let Z be the number of buckets that didn’t receive any item. In expectation, we obtain

Estimator:

6 of 26

Algorithm 1: Linear counting

Setting the numbers of bins k requires knowing F₀, the quantity we wish to estimate �
By first computing the variance of Z, and then applying Chebyshev, we obtain the following corollary:�� Setting k=F0/12 yields a standard error of less than 1%...�
…. However, F0 can be O(n), so the space can also be prohibitively large using this approach.

Can we do better than this, and if yes, how much better can we do?

6

7 of 26

Algorithm 2: Idealized F₀ estimation

Suppose we have access to a random hash function h:U→[0,1].

V ← +inf
For each x

If h(x)<V then V← h(x)�

At the end of the stream output 1/V-1 as our estimate for the number of distinct elements

7

8 of 26

Minimum of n uniform random variables

Let’s assume that the hash function is fully random.

.
Z=min(X₁,...,X_n) ��What is the expectation E[Z]?

8

9 of 26

Generalizing this idea: K Minimum Values (kMV) sketch

Basic idea: Use a “good enough” hash function h:[n] → R and the k smallest hashed values.

KMV-Init(k)

Pick hash function h
Initialize L=[] for k ( item, hash(item) ) pairs

KMV-Update(x)�- If x is not in L

L← L U {(x,h(x))}� if |L|>k then remove x with largest h(x) value from L

KMV-Query()

v← largest hash value in L�Return (k-1) R/v

9

10 of 26

Flajolet-Martin (FM) sketch

Key assumption: We have an idealized hash function that maps each element of the universe into a string of random bits (i.e., Pr(bit=0)=Pr(bit=1)=½) ��

�

Intuition: If we see the prefix 11110xx, probably we have seen more than 32 distinct items.�
Idea: Keep track of prefixes of the form 1^k0

10

1	1	1	1	0	x	x

11 of 26

Flajolet-Martin (FM) sketch

Important functions �1. h(x) = hash function that transforms x into a uniform binary string

2. ρ(x) = position of leftmost 0 (e.g., ρ(111010100000)=4), or in terms of the usual msb to lsb position of rightmost 0 ρ(000001010111)=4)

Algorithm

Initialize a bitmask with L bits to 0, i.e., BITMASK �
For each element x in the stream

BITMASK[ρ(h(x))] ← 1 �

Set R← ρ(BITMASK)�
Output

11

0	0	….	0

12 of 26

Flajolet-Martin (FM) sketch

FM estimator: Clearly, 2^R where R is the largest size of such a prefix approximates the number of distinct elements. However, it turns out there is some small bias: ��
Flajolet and Martin proved the following remarkable result: �� where ν(p)=#bits equal to 1 in binary representation of p�
Thus, an unbiased (modulo negligible terms) estimator becomes �
Fascinating analysis of an algorithm

12

13 of 26

Implementation details

Bitmask operations can �Be done very efficiently. �
E.g., R(x):

x=1215�'0b10010111111'�x+1�0b10011000000�x+1 & ~x�0b00001000000

13

14 of 26

FM sketch in Julia (prototype)

14

15 of 26

Flajolet-Martin (FM) sketch

Is the constant 1/φ unexpected in the estimator?

Intuition: since if we have a prefix 1111...10 there are likely to be more than 2^R and less than 2^R+1 elements. �

Variance of the estimator is 1.257, which in practice means it can be off by a factor of 2. �
Question: how do we reduce variance? �

Idea 1: use multiple hash functions [space-expensive]�
Idea 2: stochastic averaging

15

16 of 26

Stochastic Averaging

We improve the accuracy of the algorithm by a multiplicative factor of by taking the average of k different hash function ⇒ expensive �
Better idea: use the first bits to create substreams! ��Split the elements in k=2^l substreams by using the first l bits of the hash. E.g, for l=2�h(v) = b₁b₂b₃b₄b₅...

Substream id hash value�

16

17 of 26

Flajolet-Martin theorem

The estimator Z is asymptotically unbiased, i.e., E_n[Z] → n and the coefficient of variation ��using k bitmasks is �

Memory: O(klogn)

Example: can count cardinalities up to 10⁹ with error <=6% using 4kBs of memory (=4096 bytes)**�

“Caveat” of their work: Practical implementations of the hash function are not discussed.

** See also Mitzenmacher-Vadhan: Why simple hash functions work: Exploiting the entropy in a data stream

17

18 of 26

AMS sketch

Flajolet and Martin assumed the existence of a family of hash functions that exhibit ideal properties. ��
Alon, Matias, Szegedy provided a slight modification of the FM sketch that uses 2-wise independent hash functions �
Algorithm�For each element x in the stream

Compute r(h(x))=largest value r such that the rightmost bits of h(x) are 0.
R=max(R,r(h(x))

Output Y=2^R

18

19 of 26

AMS Sketch for F₀

Theorem: For every c>2 there exists an algorithm that outputs an estimate Y of F₀ such that the probability that the ratio between Y and F₀ is not between 1/c and c is at most 2/c.

Space usage: O(logn) to store the hash functions, O(loglogn) to store r.

�Median trick: the failure probability can become δ using the standard median trick, i.e., repeat the process log(1/δ) times and output the median. This yields a (O(1), δ) approximation scheme. ��

19

20 of 26

HyperLogLog++ (HLL++) [Heule-Nunkesser-Hall]

Engineering F0 algorithms
Low end correction is very important in practice

Consider a small business relying on Google analytics to estimate #unique visitors

20

For small cardinalities, linear counting is better than HLL

For large cardinalities

HLL is better than

linear counting

21 of 26

Algorithms for estimating

F2

21

22 of 26

AMS sketch for F₂

Algorithm

Let h:[n] → {-1,+1} be chosen from a 4-wise independent hash family H.
X← 0
For each element x in the stream

X← X+h(x)

Output X²

22

23 of 26

Why does this work?

Some elements push z to the positive direction (h(e)=+1), �some to the negative direction (h(e)=-1)

Tug-of-war sketch �

Define the random variable Y_j=h(j). Then,��

and therefore by linearity of expectation, and the 4-wise (which implies 2-wise independence) independence of variables Y_i,Y_j into account we get

23

h(e)=+1

h(e)=-1

24 of 26

Second moment

Now we compute the variance of X². We get��We expand and bound the expectation E[X⁴] as follows:

24

25 of 26

Reminder

�Theorem: Let X be an unbiased estimator of a quantity Q. Let be a collection of independent RVs with X_ij distributed identically to X, where ��Let . Then, ��By using the median-of-means improvement we obtain that we can get an (ε,δ) approximation for F₂ using:

25

26 of 26

Readings

Material, proofs and slides from lecture
Foundations of Data Science book 6.2.1 and 6.2.4
K-wise independent hashing�

26