1 of 52

Probability

Bloom Filters

Shreya Jayaraman and Luxi Wang

Alex Tsun

2 of 52

Agenda

  • Deterministic Data Structures
  • Probabilistic Data Structures
  • Bloom Filters

3 of 52

Data Structures

4 of 52

Data Structures

  • Handle data organization, management, and storage
  • Collection of data, relationships between data, and operations that manipulate the data

5 of 52

Deterministic Data Structures

  • The ones you’re likely familiar with
  • Starting with an empty data structure and applying a series of operations (insert, delete, etc.) gives the same resulting structure

1

2

3

4

4

6

2

1

3

5

7

Key2

2

Value2

Key1

1

Value1

Key0

0

Value0

arrays/lists

maps/dictionaries

trees

6 of 52

Probabilistic Data Structures

  • Data structures that rely on a random number generator, or have some aspect of their output probabilistic
  • This makes their output after a series of operations unpredictable

7 of 52

Probabilistic Data Structures

  • Data structures that rely on a random number generator, or have some aspect of their output probabilistic
  • This makes their output after a series of operations unpredictable
  • Why use them?
    • Reduce time complexity (in expectation)
    • Reduce space complexity (in expectation)
    • Reduce “code complexity” (meaning “easier to understand” or “easier to code up”)

8 of 52

Random Picture

Definition of

Expectation

9 of 52

Hashing and Probabilities

10 of 52

Bloom Filters

  • Probabilistic implementation of sets (unordered collection of unique objects)
  • Rely on hashing

11 of 52

Bloom Filters

  • Probabilistic implementation of sets (unordered collection of unique objects)
  • Rely on hashing

Hashing (Motivation):

If we want to store elements in some

structure like a list/array, checking

whether an element is in it requires

searching the entire array. Can we do

better?

12 of 52

Hashing

  • We want to store the strings “hello” and “bye” in a structure, and have quick insert AND lookup times.

13 of 52

Hashing

  • We want to store the strings “hello” and “bye” in a structure, and have quick insert AND lookup times.
  • Suppose we have a “magical” function h that maps strings to indices (integers). e.g. h(“bye”) = 3, h(“hello”) = 0

14 of 52

Hashing

  • We want to store the strings “hello” and “bye” in a structure, and have quick insert AND lookup times.
  • Suppose we have a “magical” function h that maps strings to indices (integers). e.g. h(“bye”) = 3, h(“hello”) = 0

0

1

2

3

4

5

15 of 52

Hashing

  • We want to store the strings “hello” and “bye” in a structure, and have quick insert AND lookup times.
  • Suppose we have a “magical” function h that maps strings to indices (integers). e.g. h(“bye”) = 3, h(“hello”) = 0

0

1

2

3

4

5

“hello”

“bye”

16 of 52

Hashing

  • We want to store the strings “hello” and “bye” in a structure, and have quick insert AND lookup times.
  • Suppose we have a “magical” function h that maps strings to indices (integers). e.g. h(“bye”) = 3, h(“hello”) = 0

  • To insert or find a string, just use h to determine its location!

0

1

2

3

4

5

“hello”

“bye”

17 of 52

Hashing

  • We want to store the strings “hello” and “bye” in a structure, and have quick insert AND lookup times.
  • Suppose we have a “magical” function h that maps strings to indices (integers). e.g. h(“bye”) = 3, h(“hello”) = 0

  • To insert or find a string, just use h to determine its location!
  • Want to have minimal collisions (multiple strings hashing to same location.

0

1

2

3

4

5

“hello”

“bye”

18 of 52

Hashing

  • We want to store the strings “hello” and “bye” in a structure, and have quick insert AND lookup times.
  • Suppose we have a “magical” function h that maps strings to indices (integers). e.g. h(“bye”) = 3, h(“hello”) = 0

  • To insert or find a string, just use h to determine its location!
  • Want to have minimal collisions (multiple strings hashing to same location.
  • Have our hash function h behave “uniformly at random”, but consistent.

0

1

2

3

4

5

“hello”

“bye”

19 of 52

Bloom Filters

20 of 52

Bloom Filters: Motivation

  • Google Chrome has a database of malicious URLs, but it takes a long time to query.

21 of 52

Bloom Filters: Motivation

  • Google Chrome has a database of malicious URLs, but it takes a long time to query.
  • Want an in-browser structure, so needs to be efficient and be space-efficient

22 of 52

Bloom Filters: Motivation

  • Google Chrome has a database of malicious URLs, but it takes a long time to query.
  • Want an in-browser structure, so needs to be efficient and be space-efficient
  • Want it so that can check if a URL is in structure:
    • If return False, then definitely not in the structure (don’t need to do expensive database lookup, website is safe)

23 of 52

Bloom Filters: Motivation

  • Google Chrome has a database of malicious URLs, but it takes a long time to query.
  • Want an in-browser structure, so needs to be efficient and be space-efficient
  • Want it so that can check if a URL is in structure:
    • If return False, then definitely not in the structure (don’t need to do expensive database lookup, website is safe)
    • If return True, the URL may or may not be in the structure. Have to perform expensive lookup in this rare case.

24 of 52

Bloom Filters

  • Supports two key operations:
  • add(x) - adds x to bloom filter
  • contains(x) - returns true if x in bloom filter, otherwise returns false
    1. If return false, definitely not in structure.
    2. If return true, possibly in the structure (some false positives).

25 of 52

Bloom Filters

  • Supports two key operations:
  • add(x) - adds x to bloom filter
  • contains(x) - returns true if x in bloom filter, otherwise returns false
    • If return false, definitely not in structure.
    • If return true, possibly in the structure (some false positives).

  • Does not support the following operations:
  • removing an element from bloom filter
  • giving a collection of elements in the structure

26 of 52

Bloom Filters

  • Why accept false positives?
    • Speed - add and contains have good time complexity
    • Space - doesn’t require much space relative to items to be stored

27 of 52

Bloom Filters

Hash Tables

  • These store the data itself
  • Therefore, they need at least as much space as all the data being stored
  • E.g. in the case of malicious URLS, store strings

Bloom Filters

  • These don’t store the data itself
  • They only store bit arrays - 0/1s - bits are very small and use much less space

28 of 52

Bloom Filters: Initialization

Size of bloom filter

Number of hash functions

for each hash function, initialize an empty bit vector of size m

29 of 52

Bloom Filters: Example

Index →

0

1

2

3

4

t1

0

0

0

0

0

t2

0

0

0

0

0

t3

0

0

0

0

0

bloom filter t of length m = 5 that uses k = 3 hash functions

30 of 52

Bloom Filters: Add

for each hash function hi

hi(x) → result of hash function hi on x

31 of 52

Bloom Filters: Add

for each hash function hi

Index into ith bit-vector, at index produced by hash function and set to 1

32 of 52

Bloom Filters: Example

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“thisisavirus.com”)

h1(“thisisavirus.com”) → 2

h2(“thisisavirus.com”) → 1

h3(“thisisavirus.com”) → 4

Index →

0

1

2

3

4

t1

0

0

0

0

0

t2

0

0

0

0

0

t3

0

0

0

0

0

33 of 52

Bloom Filters: Example

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“thisisavirus.com”)

h1(“thisisavirus.com”) → 2

h2(“thisisavirus.com”) → 1

h3(“thisisavirus.com”) → 4

Index →

0

1

2

3

4

t1

0

0

1

0

0

t2

0

0

0

0

0

t3

0

0

0

0

0

34 of 52

Bloom Filters: Example

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“thisisavirus.com”)

h2(“thisisavirus.com”) → 1

h3(“thisisavirus.com”) → 4

Index →

0

1

2

3

4

t1

0

0

1

0

0

t2

0

1

0

0

0

t3

0

0

0

0

0

h1(“thisisavirus.com”) → 2

35 of 52

Bloom Filters: Example

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“thisisavirus.com”)

h1(“thisisavirus.com”) → 2

h3(“thisisavirus.com”) → 4

Index →

0

1

2

3

4

t1

0

0

1

0

0

t2

0

1

0

0

0

t3

0

0

0

0

1

h2(“thisisavirus.com”) → 1

36 of 52

Bloom Filters: Contains

Returns True if the bit vector for each hash function has bit 1 at index determined by that hash function, otherwise returns False

37 of 52

Bloom Filters: Example

bloom filter t of length m = 5 that uses k = 3 hash functions

contains(“thisisavirus.com”)

h1(“thisisavirus.com”) → 2

h2(“thisisavirus.com”) → 1

h3(“thisisavirus.com”) → 4

Index →

0

1

2

3

4

t1

0

0

1

0

0

t2

0

1

0

0

0

t3

0

0

0

0

1

True

38 of 52

Bloom Filters: Example

bloom filter t of length m = 5 that uses k = 3 hash functions

contains(“thisisavirus.com”)

h2(“thisisavirus.com”) → 1

h3(“thisisavirus.com”) → 4

Index →

0

1

2

3

4

t1

0

0

1

0

0

t2

0

1

0

0

0

t3

0

0

0

0

1

True

True

h1(“thisisavirus.com”) → 2

39 of 52

Bloom Filters: Example

bloom filter t of length m = 5 that uses k = 3 hash functions

contains(“thisisavirus.com”)

h3(“thisisavirus.com”) → 4

Index →

0

1

2

3

4

t1

0

0

1

0

0

t2

0

1

0

0

0

t3

0

0

0

0

1

True

True

True

h2(“thisisavirus.com”) → 1

h1(“thisisavirus.com”) → 2

40 of 52

Bloom Filters: Example

bloom filter t of length m = 5 that uses k = 3 hash functions

contains(“thisisavirus.com”)

h3(“thisisavirus.com”) → 4

Index →

0

1

2

3

4

t1

0

0

1

0

0

t2

0

1

0

0

0

t3

0

0

0

0

1

True

True

True

Since all conditions satisfied, returns True (correctly)

h2(“thisisavirus.com”) → 1

h1(“thisisavirus.com”) → 2

41 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“totallynotsuspicious.com”)

h1(“totallynotsuspicious.com”) → 1

h2(“totallynotsuspicious.com”) → 0

h3(“totallynotsuspicious.com”) → 4

Index →

0

1

2

3

4

t1

0

0

1

0

0

t2

0

1

0

0

0

t3

0

0

0

0

1

42 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“totallynotsuspicious.com”)

h1(“totallynotsuspicious.com”) → 1

h2(“totallynotsuspicious.com”) → 0

h3(“totallynotsuspicious.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

0

1

0

0

0

t3

0

0

0

0

1

43 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“totallynotsuspicious.com”)

h1(“totallynotsuspicious.com”) → 1

h2(“totallynotsuspicious.com”) → 0

h3(“totallynotsuspicious.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

1

1

0

0

0

t3

0

0

0

0

1

44 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“totallynotsuspicious.com”)

h1(“totallynotsuspicious.com”) → 1

h2(“totallynotsuspicious.com”) → 0

h3(“totallynotsuspicious.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

1

1

0

0

0

t3

0

0

0

0

1

Collision, is already set to 1

45 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

add(“totallynotsuspicious.com”)

h1(“totallynotsuspicious.com”) → 1

h2(“totallynotsuspicious.com”) → 0

h3(“totallynotsuspicious.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

1

1

0

0

0

t3

0

0

0

0

1

46 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

contains(“verynormalsite.com”)

h1(“verynormalsite.com”) → 2

h2(“verynormalsite.com”) → 0

h3(“verynormalsite.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

1

1

0

0

0

t3

0

0

0

0

1

47 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

True

contains(“verynormalsite.com”)

h1(“verynormalsite.com”) → 2

h2(“verynormalsite.com”) → 0

h3(“verynormalsite.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

1

1

0

0

0

t3

0

0

0

0

1

48 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

True

True

contains(“verynormalsite.com”)

h2(“verynormalsite.com”) → 0

h3(“verynormalsite.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

1

1

0

0

0

t3

0

0

0

0

1

h1(“verynormalsite.com”) → 2

49 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

True

True

True

contains(“verynormalsite.com”)

h3(“verynormalsite.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

1

1

0

0

0

t3

0

0

0

0

1

h2(“verynormalsite.com”) → 0

h1(“verynormalsite.com”) → 2

50 of 52

Bloom Filters: False Positives

bloom filter t of length m = 5 that uses k = 3 hash functions

True

True

True

Since all conditions satisfied, returns True (incorrectly)

contains(“verynormalsite.com”)

h3(“verynormalsite.com”) → 4

Index →

0

1

2

3

4

t1

0

1

1

0

0

t2

1

1

0

0

0

t3

0

0

0

0

1

Bloom Filters

h2(“verynormalsite.com”) → 0

h1(“verynormalsite.com”) → 2

51 of 52

Bloom Filters: Summary

  • An empty bloom filter is an empty k x m bit matrix with all values initialized to zeros
    • k = number of hash functions
    • m = size of bloom filter
  • add(x) runs in O(k) time
  • contains(x) runs in O(k) time
  • requires O(km) space
  • Probability of false positives from collisions can be reduced by increasing the size of the bloom filter

52 of 52

END PIC

Alex Tsun