CS365�Foundations of Data Science
Lecture 12
(3/15)
Charalampos E. Tsourakakis �ctsourak@bu.edu
SELECT COUNT(DISTINCT)
Estimating F0
2
Naive algorithmic solutions
Space complexity: O(n) �
Space complexity: O(mlog(n))
Worst-case space complexity: O(min(n,mlog(n))
3
Algorithm 1: Linear counting
Buckets: 1 2 3 ... k-1 k
4
10 4 1 10 1 9 3 ...
Suppose we send each item x in the stream to a bucket according to some random (idealized for now) hash function.
Algorithm 1: Linear counting
Buckets: 1 2 3 ... k-1 k
5
10 4 1 10 1 9 3 ...
10,4,10
1
3,9
Let Z be the number of buckets that didn’t receive any item. In expectation, we obtain
Estimator:
Algorithm 1: Linear counting
Can we do better than this, and if yes, how much better can we do?
6
Algorithm 2: Idealized F0 estimation
Suppose we have access to a random hash function h:U→[0,1].
7
Minimum of n uniform random variables
Let’s assume that the hash function is fully random.
8
Generalizing this idea: K Minimum Values (kMV) sketch
Basic idea: Use a “good enough” hash function h:[n] → R and the k smallest hashed values.
KMV-Init(k)
KMV-Update(x)�- If x is not in L
L← L U {(x,h(x))}� if |L|>k then remove x with largest h(x) value from L
KMV-Query()
v← largest hash value in L�Return (k-1) R/v
9
Flajolet-Martin (FM) sketch
�
10
1 | 1 | 1 | 1 | 0 | x | x |
Flajolet-Martin (FM) sketch
Important functions �1. h(x) = hash function that transforms x into a uniform binary string
2. ρ(x) = position of leftmost 0 (e.g., ρ(111010100000)=4), or in terms of the usual msb to lsb position of rightmost 0 ρ(000001010111)=4)
Algorithm
11
0 | 0 | …. | 0 |
Flajolet-Martin (FM) sketch
12
Implementation details
x=1215�'0b10010111111'�x+1�0b10011000000�x+1 & ~x�0b00001000000
13
FM sketch in Julia (prototype)
14
Flajolet-Martin (FM) sketch
15
Stochastic Averaging
Substream id hash value�
16
Flajolet-Martin theorem
The estimator Z is asymptotically unbiased, i.e., En[Z] → n and the coefficient of variation ��using k bitmasks is �
Memory: O(klogn)
Example: can count cardinalities up to 109 with error <=6% using 4kBs of memory (=4096 bytes)**�
“Caveat” of their work: Practical implementations of the hash function are not discussed.
** See also Mitzenmacher-Vadhan: Why simple hash functions work: Exploiting the entropy in a data stream
17
AMS sketch
Output Y=2R
18
AMS Sketch for F0
Theorem: For every c>2 there exists an algorithm that outputs an estimate Y of F0 such that the probability that the ratio between Y and F0 is not between 1/c and c is at most 2/c.
Space usage: O(logn) to store the hash functions, O(loglogn) to store r.
�Median trick: the failure probability can become δ using the standard median trick, i.e., repeat the process log(1/δ) times and output the median. This yields a (O(1), δ) approximation scheme. �� �
19
HyperLogLog++ (HLL++) [Heule-Nunkesser-Hall]
20
For small cardinalities, linear counting is better than HLL
For large cardinalities
HLL is better than
linear counting
Algorithms for estimating
F2
21
AMS sketch for F2
Algorithm
22
Why does this work?
and therefore by linearity of expectation, and the 4-wise (which implies 2-wise independence) independence of variables Yi,Yj into account we get
23
h(e)=+1
h(e)=-1
Second moment
Now we compute the variance of X2 . We get��We expand and bound the expectation E[X4] as follows:
24
Reminder
�Theorem: Let X be an unbiased estimator of a quantity Q. Let be a collection of independent RVs with Xij distributed identically to X, where ����Let . Then, ��By using the median-of-means improvement we obtain that we can get an (ε,δ) approximation for F2 using:
25
Readings
26