Probability
Bloom Filters
Shreya Jayaraman and Luxi Wang
Alex Tsun
Agenda
Data Structures
Data Structures
Deterministic Data Structures
1 | 2 | 3 | 4 |
4
6
2
1
3
5
7
Key2
2
Value2
Key1
1
Value1
Key0
0
Value0
arrays/lists
maps/dictionaries
trees
Probabilistic Data Structures
Probabilistic Data Structures
Random Picture
Definition of
Expectation
Hashing and Probabilities
Bloom Filters
Bloom Filters
Hashing (Motivation):
If we want to store elements in some
structure like a list/array, checking
whether an element is in it requires
searching the entire array. Can we do
better?
Hashing
Hashing
Hashing
0 | 1 | 2 | 3 | 4 | 5 |
| | | | | |
Hashing
0 | 1 | 2 | 3 | 4 | 5 |
“hello” | | | “bye” | | |
Hashing
0 | 1 | 2 | 3 | 4 | 5 |
“hello” | | | “bye” | | |
Hashing
0 | 1 | 2 | 3 | 4 | 5 |
“hello” | | | “bye” | | |
Hashing
0 | 1 | 2 | 3 | 4 | 5 |
“hello” | | | “bye” | | |
Bloom Filters
Bloom Filters: Motivation
Bloom Filters: Motivation
Bloom Filters: Motivation
Bloom Filters: Motivation
Bloom Filters
Bloom Filters
Bloom Filters
Bloom Filters
Hash Tables
Bloom Filters
Bloom Filters: Initialization
Size of bloom filter
Number of hash functions
for each hash function, initialize an empty bit vector of size m
Bloom Filters: Example
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 0 | 0 | 0 |
t2 | 0 | 0 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 0 |
bloom filter t of length m = 5 that uses k = 3 hash functions
Bloom Filters: Add
for each hash function hi
hi(x) → result of hash function hi on x
Bloom Filters: Add
for each hash function hi
Index into ith bit-vector, at index produced by hash function and set to 1
Bloom Filters: Example
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“thisisavirus.com”)
h1(“thisisavirus.com”) → 2
h2(“thisisavirus.com”) → 1
h3(“thisisavirus.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 0 | 0 | 0 |
t2 | 0 | 0 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 0 |
Bloom Filters: Example
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“thisisavirus.com”)
h1(“thisisavirus.com”) → 2
h2(“thisisavirus.com”) → 1
h3(“thisisavirus.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 1 | 0 | 0 |
t2 | 0 | 0 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 0 |
Bloom Filters: Example
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“thisisavirus.com”)
h2(“thisisavirus.com”) → 1
h3(“thisisavirus.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 1 | 0 | 0 |
t2 | 0 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 0 |
h1(“thisisavirus.com”) → 2
Bloom Filters: Example
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“thisisavirus.com”)
h1(“thisisavirus.com”) → 2
h3(“thisisavirus.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 1 | 0 | 0 |
t2 | 0 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
h2(“thisisavirus.com”) → 1
Bloom Filters: Contains
Returns True if the bit vector for each hash function has bit 1 at index determined by that hash function, otherwise returns False
Bloom Filters: Example
bloom filter t of length m = 5 that uses k = 3 hash functions
contains(“thisisavirus.com”)
h1(“thisisavirus.com”) → 2
h2(“thisisavirus.com”) → 1
h3(“thisisavirus.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 1 | 0 | 0 |
t2 | 0 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
True
Bloom Filters: Example
bloom filter t of length m = 5 that uses k = 3 hash functions
contains(“thisisavirus.com”)
h2(“thisisavirus.com”) → 1
h3(“thisisavirus.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 1 | 0 | 0 |
t2 | 0 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
True
True
h1(“thisisavirus.com”) → 2
Bloom Filters: Example
bloom filter t of length m = 5 that uses k = 3 hash functions
contains(“thisisavirus.com”)
h3(“thisisavirus.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 1 | 0 | 0 |
t2 | 0 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
True
True
True
h2(“thisisavirus.com”) → 1
h1(“thisisavirus.com”) → 2
Bloom Filters: Example
bloom filter t of length m = 5 that uses k = 3 hash functions
contains(“thisisavirus.com”)
h3(“thisisavirus.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 1 | 0 | 0 |
t2 | 0 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
True
True
True
Since all conditions satisfied, returns True (correctly)
h2(“thisisavirus.com”) → 1
h1(“thisisavirus.com”) → 2
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“totallynotsuspicious.com”)
h1(“totallynotsuspicious.com”) → 1
h2(“totallynotsuspicious.com”) → 0
h3(“totallynotsuspicious.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 0 | 1 | 0 | 0 |
t2 | 0 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“totallynotsuspicious.com”)
h1(“totallynotsuspicious.com”) → 1
h2(“totallynotsuspicious.com”) → 0
h3(“totallynotsuspicious.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 0 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“totallynotsuspicious.com”)
h1(“totallynotsuspicious.com”) → 1
h2(“totallynotsuspicious.com”) → 0
h3(“totallynotsuspicious.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 1 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“totallynotsuspicious.com”)
h1(“totallynotsuspicious.com”) → 1
h2(“totallynotsuspicious.com”) → 0
h3(“totallynotsuspicious.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 1 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
Collision, is already set to 1
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
add(“totallynotsuspicious.com”)
h1(“totallynotsuspicious.com”) → 1
h2(“totallynotsuspicious.com”) → 0
h3(“totallynotsuspicious.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 1 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
contains(“verynormalsite.com”)
h1(“verynormalsite.com”) → 2
h2(“verynormalsite.com”) → 0
h3(“verynormalsite.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 1 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
True
contains(“verynormalsite.com”)
h1(“verynormalsite.com”) → 2
h2(“verynormalsite.com”) → 0
h3(“verynormalsite.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 1 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
True
True
contains(“verynormalsite.com”)
h2(“verynormalsite.com”) → 0
h3(“verynormalsite.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 1 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
h1(“verynormalsite.com”) → 2
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
True
True
True
contains(“verynormalsite.com”)
h3(“verynormalsite.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 1 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
h2(“verynormalsite.com”) → 0
h1(“verynormalsite.com”) → 2
Bloom Filters: False Positives
bloom filter t of length m = 5 that uses k = 3 hash functions
True
True
True
Since all conditions satisfied, returns True (incorrectly)
contains(“verynormalsite.com”)
h3(“verynormalsite.com”) → 4
Index → | 0 | 1 | 2 | 3 | 4 |
t1 | 0 | 1 | 1 | 0 | 0 |
t2 | 1 | 1 | 0 | 0 | 0 |
t3 | 0 | 0 | 0 | 0 | 1 |
Bloom Filters
h2(“verynormalsite.com”) → 0
h1(“verynormalsite.com”) → 2
Bloom Filters: Summary
END PIC
Alex Tsun