1 of 66

Hashing

CS202 - Fundamental Structures of Computer Science II

1

Initially prepared by Dr. İlyas Çiçekli; improved by various Bilkent CS202 instructors.

2 of 66

Hashing

  • Using balanced search trees (2-3, 2-3-4, red-black, and AVL trees), we implement table operations in O (logN) time
    • retrieval, insertion, and deletion

  • Can we find a data structure so that we can perform these table operations even faster (e.g., in O(1) time)?
    • Hash Tables

CS202 - Fundamental Structures of Computer Science II

2

3 of 66

Hash Tables

  • In hash tables, we have

    • An array (index ranges 0 … n – 1) and
      • Each array location is called a bucket

    • An address calculator (hash function), which maps a search key into an array index between 0 … n – 1

CS202 - Fundamental Structures of Computer Science II

3

4 of 66

Hash Function -- Address Calculator

CS202 - Fundamental Structures of Computer Science II

4

Hash Function

Hash Table

5 of 66

Hashing

  • A hash function tells us where to place an item in array called a hash table.
    • This method is known as hashing.

  • Hash function maps a search key into an integer between 0 and n – 1.
    • We can have different hash functions.
    • Hash function depends on key type (int, string, ...)
      • E.g., h(x) = x mod n, where x is an integer

CS202 - Fundamental Structures of Computer Science II

5

6 of 66

Collisions

  • A perfect hash function maps each search key into a unique location of the hash table.
    • A perfect hash function is possible if we know all search keys in advance.
    • In practice we do not know all search keys, and thus, a hash function can map more than one key into the same location.

  • Collisions occur when a hash function maps more than one item into the same array location.
    • We have to resolve the collisions using a certain mechanism.

CS202 - Fundamental Structures of Computer Science II

6

7 of 66

Hash Functions

  • We can design different hash functions.

  • But a good hash function should
    • be easy and fast to compute
    • place items uniformly (evenly) throughout the hash table.

  • We will consider only integer hash functions
    • On a computer, everything is represented with bits.
    • They can be converted into integers.
      • 1001010101001010000110….

CS202 - Fundamental Structures of Computer Science II

7

8 of 66

Everything is an Integer

  • If search keys are strings, think of them as integers, and apply a hash function for integers.

  • For example, strings can be encoded using ASCII codes of characters.

  • Consider the string “NOTE”
    • ASCII code of N is 4Eh (01001110), O is 4Fh (01001111),

T is 54h(01010100), E is 45h (01000101)

    • Concatenate four binary numbers to get a new binary number

01001110010011110101010001000101= 4E4F5445h = 1313821765

CS202 - Fundamental Structures of Computer Science II

8

9 of 66

How to Design a Hash Function?

  • Three possibilities
    • Selecting digits
    • Folding
    • Modular Arithmetic
  • Or, their combinations

CS202 - Fundamental Structures of Computer Science II

9

Hash Function

Hash Table

10 of 66

Hash Functions -- Selecting Digits

  • Select certain digits and combine to create the address.

  • For example, suppose that we have 11-digit Turkish nationality IDs
    • Define a hash function that selects the 2nd and 5th most significant digits

h(033475678) = 37

h(023455678) = 25

    • Define the table size as 100

  • Is this a good hash function?
    • No, since it does not place items uniformly.

CS202 - Fundamental Structures of Computer Science II

10

11 of 66

Hash Functions -- Folding

  • Folding – selecting all digits and adding them.

  • For example, suppose previous nine-digit numbers
    • Define a hash function that selects all digits and adds them

h(033475678) = 0 + 3 + 3 + 4 + 7 + 5 + 6 + 7 + 8 = 43

h(023455678) = 0 + 2 + 3 + 4 + 5 + 5 + 6 + 7 + 8 = 40

    • Define the table size as 82

  • We can select a group of digits and add the digits in this group as well.

CS202 - Fundamental Structures of Computer Science II

11

12 of 66

Hash Functions -- Modular Arithmetic

  • Modular arithmetic – provides a simple and effective hash function.

h(x) = x mod tableSize

  • The table size should be a prime number.
    • Why? Think about it.

  • We will use modular arithmetic as our hash function in the rest of our discussions.

CS202 - Fundamental Structures of Computer Science II

12

13 of 66

Why Primes?

  • Assume you hash the following with x mod 8:
    • 64, 100, 128, 200, 300, 400, 500

CS202 - Fundamental Structures of Computer Science II

13

0

1

2

3

4

5

6

7

64

100

128

200

400

300

500

14 of 66

Why Primes?

  • Now try it with x mod 7
    • 64, 100, 128, 200, 300, 400, 500

CS202 - Fundamental Structures of Computer Science II

14

0

1

2

3

4

5

6

64

100

128

200

400

300

500

15 of 66

Rationale

  • If we are adding numbers a1, a2, a3 … a4 to a table of size m
    • All values will be hashed into multiples of

gcd(a1, a2, a3 … a4 ,m)

    • For example, if we are adding 64, 100, 128, 200, 300, 400, 500 to a table of size 8, all values will be hashed to 0 or 4

gcd(64,100,128,200,300,400,500, 8) = 4

    • When m is a prime gcd(a1, a2, a3 … a4 ,m) = 1, all values will be hashed to anywhere

gcd(64,100,128,200,300,400,500,7) = 1

unless gcd(a1, a2, a3 … a4 ) = m, which is rare.

CS202 - Fundamental Structures of Computer Science II

15

16 of 66

Collision Resolution

  • Collision resolution – two general approaches
    • Open Addressing

Each entry holds one item

    • Chaining

Each entry can hold more than one item

(Buckets – hold certain number of items)

CS202 - Fundamental Structures of Computer Science II

16

Table size is 101

17 of 66

Open Addressing

  • Open addressing – probes for some other empty location when a collision occurs.

  • Probe sequence: sequence of examined locations. Different open-addressing schemes:
    • Linear Probing
    • Quadratic Probing
    • Double Hashing

CS202 - Fundamental Structures of Computer Science II

17

18 of 66

Open Addressing -- Linear Probing

  • linear probing: search table sequentially starting from the original hash location.
    • Check next location, if location is occupied.
    • Wrap around from last to first table location

CS202 - Fundamental Structures of Computer Science II

18

19 of 66

Linear Probing -- Example

  • Example:
    • Table Size is 11 (0..10)
    • Hash Function: h(x) = x mod 11
    • Insert keys: 20, 30, 2, 13, 25, 24, 10, 9
      • 20 mod 11 = 9
      • 30 mod 11 = 8
      • 2 mod 11 = 2
      • 13 mod 11 = 2 🡺 2+1=3
      • 25 mod 11 = 3 🡺 3+1=4
      • 24 mod 11 = 2 🡺 2+1, 2+2, 2+3=5
      • 10 mod 11 = 10
      • 9 mod 11 = 9 🡺 9+1, 9+2 mod 11 =0

CS202 - Fundamental Structures of Computer Science II

19

0

9

1

2

2

3

13

4

25

5

24

6

7

8

30

9

20

10

10

20 of 66

Linear Probing -- Clustering Problem

  • One of the problems with linear probing is that table items tend to cluster together in the hash table.
    • i.e. table contains groups of consecutively occupied locations.

  • This phenomenon is called primary clustering.
    • Clusters can get close to one another, and merge into a larger cluster.
    • Thus, the one part of the table might be quite dense, even though another part has relatively few items.
    • Primary clustering causes long probe searches, and therefore, decreases the overall efficiency.

CS202 - Fundamental Structures of Computer Science II

20

21 of 66

Open Addressing -- Quadratic Probing

  • Quadratic probing: almost eliminates clustering problem

  • Approach:
    • Start from the original hash location i
    • If location is occupied, check locations i+12, i+22,

i+32, i+42 ...

    • Wrap around table, if necessary.

CS202 - Fundamental Structures of Computer Science II

21

22 of 66

Quadratic Probing -- Example

  • Example:
    • Table Size is 11 (0..10)
    • Hash Function: h(x) = x mod 11
    • Insert keys: 20, 30, 2, 13, 25, 24, 10, 9
      • 20 mod 11 = 9
      • 30 mod 11 = 8
      • 2 mod 11 = 2
      • 13 mod 11 = 2 🡺 2+12=3
      • 25 mod 11 = 3 🡺 3+12=4
      • 24 mod 11 = 2 🡺 2+12, 2+22=6
      • 10 mod 11 = 10
      • 9 mod 11 = 9 🡺 9+12, 9+22 mod 11,

9+32 mod 11 =7

CS202 - Fundamental Structures of Computer Science II

22

0

1

2

2

3

13

4

25

5

6

24

7

9

8

30

9

20

10

10

23 of 66

Open Addressing -- Double Hashing

  • Double hashing also reduces clustering.
  • Idea: increment using a second hash function h2. Should satisfy:

h2(key) ≠0

h2h1

  • Probes following locations until it finds an unoccupied place

h1(key)

h1(key) + h2(key)

h1(key) + 2*h2(key),

...

CS202 - Fundamental Structures of Computer Science II

23

24 of 66

Double Hashing -- Example

  • Example:
    • Table Size is 11 (0..10)
    • Hash Function:

h1(x) = x mod 11

h2(x) = 7 – (x mod 7)

    • Insert keys: 58, 14, 91
      • 58 mod 11 = 3
      • 14 mod 11 = 3 🡺 3+7=10
      • 91 mod 11 = 3 🡺 3+7, 3+2*7 mod 11=6

CS202 - Fundamental Structures of Computer Science II

24

0

1

2

3

58

4

5

6

91

7

8

9

10

14

25 of 66

Open Addressing -- Retrieval & Deletion

  • Retrieving an item with a given key:
    • (same as insertion): probe the locations until we find the desired item or we reach to an empty location.

  • Deletions in open addressing cause complications
    • We CANNOT simply delete an item from the hash table because this new empty (a deleted) location causes to stop prematurely (incorrectly) indicating a failure during a retrieval.
    • Solution: We have to have three kinds of locations in a hash table: Occupied, Empty, Deleted.
    • A deleted location will be treated as an occupied location during retrieval.

CS202 - Fundamental Structures of Computer Science II

25

26 of 66

Separate Chaining

  • Another way to resolve collisions is to change the structure of the hash table.
    • In open-addressing, each location holds only one item.
  • Idea 1: each location is itself an array called bucket
    • Store items that are hashed into same location in this array.
    • Problem: What will be the size of the bucket?
  • Idea 2: each location is itself a linked list. Known as separate-chaining.
    • Each entry (of the hash table) is a pointer to a linked list (the chain) of the items that the hash function has mapped into that location.

CS202 - Fundamental Structures of Computer Science II

26

27 of 66

Separate Chaining

CS202 - Fundamental Structures of Computer Science II

27

28 of 66

Hashing Analysis

CS202 - Fundamental Structures of Computer Science II

28

29 of 66

Hashing -- Analysis

  • An analysis of the average-case efficiency of hashing involves the load factor α:

α= (current number of items) / tableSize

  • α measures how full a hash table is.
    • Hash table should not be too loaded if we want to get better performance from hashing.
  • In average case analyses, we assume that the hash function uniformly distributes keys in the hash table.
  • Unsuccessful searches generally require more time than successful searches.

CS202 - Fundamental Structures of Computer Science II

29

30 of 66

Separate Chaining -- Analysis

  • Separate Chaining – approximate average number of comparisons (probes) that a search requires :

CS202 - Fundamental Structures of Computer Science II

30

for a successful search

for an unsuccessful search

  • It is the most efficient collision resolution scheme.
  • But it requires more storage (needs storage for pointers).
  • It easily performs the deletion operation. Deletion is more difficult in open-addressing.

31 of 66

Linear Probing -- Analysis

  • Linear Probing – approximate average number of comparisons (probes) that a search requires :

CS202 - Fundamental Structures of Computer Science II

31

for a successful search

for an unsuccessful search

  • Insert and search cost depend on length of cluster.
    • Average length of cluster = α = N / M.
    • Worst case: all keys hash to the same cluster.

32 of 66

Linear Probing -- Analysis

  • Linear Probing – approximate average number of comparisons (probes) that a search requires :

CS202 - Fundamental Structures of Computer Science II

32

for a successful search

for an unsuccessful search

  • As load factor increases, number of collisions increases, causing increased search times.
  • To maintain efficiency, it is important to prevent the hash table from filling up.

33 of 66

Linear Probing -- Analysis

Example: Find the average number of probes for a successful

search and an unsuccessful search for this hash table? Use the

following hash function: h(x) = x mod 11

Successful Search: Try 20, 30, 2, 13, 25, 24, 10, 9

20: 9 30: 8 2: 2 13: 2,3

25: 3, 4 24: 2, 3, 4, 5 10: 10 9: 9, 10, 0

Avg. no of probes = (1+1+1+2+2+4+1+3)/8 = 1.9

Unsuccessful Search: Try 0, 1, 35, 3, 4, 5, 6, 7, 8, 31, 32

0: 0, 1 1: 1 35: 2, 3, 4, 5, 6

3: 3, 4, 5, 6 4: 4, 5, 6 5: 5, 6

6: 6 7: 7 8: 8, 9, 10, 0, 1

31: 9, 10, 0, 1 32: 10, 0, 1

Avg. no of probes =(2+1+5+4+3+2+1+1+5+4+3)/11 = 2.8

CS202 - Fundamental Structures of Computer Science II

33

0

9

1

2

2

3

13

4

25

5

24

6

7

8

30

9

20

10

10

34 of 66

Quadratic Probing & Double Hashing -- Analysis

  • The approximate average number of comparisons (probes) that a search requires is given as follows:

CS202 - Fundamental Structures of Computer Science II

34

for a successful search

for an unsuccessful search

  • On average, both methods require fewer comparisons than linear probing.

35 of 66

The relative efficiency of�four collision-resolution methods

CS202 - Fundamental Structures of Computer Science II

35

36 of 66

What Constitutes a Good Hash Function

  • A hash function should be easy and fast to compute.

  • A hash function should scatter the data evenly throughout the hash table.
    • How well does the hash function scatter random data?
    • How well does the hash function scatter non-random data?

  • Two general principles :
    • The hash function should use entire key in the calculation.
    • If a hash function uses modulo arithmetic, the table size should be prime.

CS202 - Fundamental Structures of Computer Science II

36

37 of 66

Example: Hash Functions for Strings

CS202 - Fundamental Structures of Computer Science II

37

38 of 66

Hash Function 1

  • Add up the ASCII values of all characters of the key.

CS202 - Fundamental Structures of Computer Science II

38

int hash(const string &key, int tableSize)

{

int hasVal = 0;

for (int i = 0; i < key.length(); i++)

hashVal += key[i];

return hashVal % tableSize;

}

  • Simple to implement and fast.
  • However, if the table size is large, the function does not distribute the keys well.
    • e.g. Table size =10000, key length <= 8, the hash function can assume values only between 0 and 1016

39 of 66

Hash Function 2

  • Examine only the first 3 characters of the key.

CS202 - Fundamental Structures of Computer Science II

39

int hash (const string &key, int tableSize)

{

return (key[0]+27 * key[1] + 729*key[2]) % tableSize;

}

  • In theory, 26 * 26 * 26 = 17576 different words can be generated. However, English is not random, only 2851 different combinations are possible.
  • Thus, this function although easily computable, is also not appropriate if the hash table is reasonably large.

40 of 66

Hash Function 3

CS202 - Fundamental Structures of Computer Science II

40

int hash (const string &key, int tableSize)

{

int hashVal = 0;

for (int i = 0; i < key.length(); i++)

hashVal = 37 * hashVal + key[i];�

hashVal %=tableSize;

if (hashVal < 0) /* in case overflows occurs */

hashVal += tableSize;

return hashVal;

};

41 of 66

Hash function for strings:

CS202 - Fundamental Structures of Computer Science II

41

a

l

i

key

KeySize = 3;

98

108

105

hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172

0 1 2

i

key[i]

hash�function

ali

……

……

0

1

2

8172

10,006 (TableSize)

“ali”

42 of 66

Hash Table versus Search Trees

  • In most of the operations, the hash table performs better than search trees.

  • However, traversing the data in the hash table in a sorted order is very difficult.
    • For similar operations, the hash table will not be good choice (e.g., finding all the items in a certain range).

CS202 - Fundamental Structures of Computer Science II

42

43 of 66

Performance

  • With either chaining or open addressing:
    • Search - O(1) expected, O(n) worst case
    • Insert - O(1) expected, O(n) worst case
    • Delete - O(1) expected, O(n) worst case
    • Min, Max and Predecessor, Successor - O(n+m) expected and worst case
  • Pragmatically, a hash table is often the best data structure to maintain a dictionary/table. However, the worst-case time is unpredictable.
  • The best worst-case bounds come from balanced binary trees.

CS202 - Fundamental Structures of Computer Science II

43

44 of 66

Other applications of hash tables

  • To implement Table ADT, Dictionary ADT
  • Compilers
  • Spelling checkers
  • Games
  • Substring Pattern Matching
  • Searching
  • Document comparison

CS202 - Fundamental Structures of Computer Science II

44

45 of 66

Applications of Hashing

  • Compilers use hash tables to implement the symbol table (a data structure to keep track of declared variables).
  • Game programs use hash tables to keep track of positions it has encountered (transposition table)
  • Online spelling checkers.

CS202 - Fundamental Structures of Computer Science II

45

46 of 66

Substring Pattern Matching

  • Input: A text string t and a pattern string p.
  • Problem: Does t contain the pattern p as a substring, and if so where?
  • e.g: Is Bilkent in the news?
  • Brute Force: search for the presence of pattern string p in text t overlays the pattern string at every position in the text. 🡪 O(mn) (m: size of pattern, n: size of text)
  • Via Hashing: compute a given hash function on both the pattern string p and the m-character substring starting from the ith position of t. 🡪 O(n)

CS202 - Fundamental Structures of Computer Science II

46

Slide by Steven Skiena

47 of 66

Hashing, Hashing, and Hashing

  • Udi Manber says that the three most important algorithms at Yahoo are hashing, hashing, and hashing.
  • Hashing has a variety of clever applications beyond just speeding up search, by giving you a short but distinctive representation of a larger document.

CS202 - Fundamental Structures of Computer Science II

47

Slide by Steven Skiena

48 of 66

Document Comparison

  • Is this new document different from the rest in a large database? – Hash the new document, and compare it to the hash codes of database.

  • How can I convince you that a file isn’t changed? – Check if the cryptographic hash code of the file you give me today is the same as that of the original. Any changes to the file will change the hash code.

CS202 - Fundamental Structures of Computer Science II

48

Slide by Steven Skiena

49 of 66

SOME APPLICATIONS OF HASHING

49

50 of 66

Set ADT

  • Stores unique values with no order
    • Multiset: non-unique values are allowed
  • Same concept as sets in mathematics
  • Can be implemented using possibly many data structures
    • Arrays, Linked-Lists, Hash Tables, etc.
  • Typical operations:
    • union, intersection, difference, subset, is_element_of, is_empty, size, iterate, build, add, remove, etc.

50

51 of 66

Set Membership Test: Bloom Filters

51

Init a bitvector

Take h hash functions

Insertion: put 1’s at positions given by hash functions

Query: are there 1’s at all positions given by hash functions?

  • O(hk) insertion
  • O(hk) deletion
  • O(hk) query

(h=2)

1

0

1

1

1

0

0

1

1

1

0

0

0

Bloom Filter

Item

H1

H2

A

4

0

B

3

4

C

4

7

D

2

0

E

8

7

F

9

0

G

2

3

H

8

7

I

0

3

J

0

4

0

1

2

3

4

5

6

7

8

9

10

11

12

Positions

52 of 66

Bloom filter

  • Bloom filter encodes a set of elements
  • Uses a bit array B of length m and d hash functions
    • to insert x, we set B[hi(x)] = 1, for i=1,…,d
    • to query y, we check if B[hi(y)] all equal 1, �for i=1,…,d
  • Need an estimate for n, the number of elements to insert

53 of 66

Bloom filter example

  • a and b are inserted in to�a Bloom filter with�m = 10, n=2, d=3
  • c is not in the set, since�some bits are 0
  • d has not been inserted, but is still reported in the set, a false positive
  • Bloom filters have no false negatives

d

b

a

c

54 of 66

Bloom filter

  • Storing n k-mers in m bit array with d hash functions has a false positive rate of

(1-e-d n/m)d

  • Given n and m, the optimal d is m/n ln(2)
  • Example m = 8n, d=5 gives 2.16% false positive rate (FPR)� m = 6n, d=4 gives 5.6% FPR� m = 4n, d=3 gives 14.6% FPR
  • m=8n, corresponds to storing 1 byte per element

Space: m = 1.44n log2(ε) where ε is the false positive rate

55 of 66

Document Comparison: MinHash

  • Suppose we need to find near-duplicate documents among 𝑵 = 𝟏 million documents
    • Naïvely, we would have to compute pairwise similarities for every pair of docs
      • 𝑵(𝑵 − 𝟏)/𝟐 ≈ 5*1011 comparisons
      • At 105 secs/day and 106 comparisons/sec, it would take 5 days
    • For 𝑵 = 𝟏𝟎 million, it takes more than a year

  • Similarly, we have a dataset of 10B documents, quickly find the document that is most similar to query document 𝒒.

55

Slides from: http://www.mmds.org/

56 of 66

Document Comparison

  • Shingling: Converts a document into a set representation (Boolean vector)
  • Min-Hashing: Convert large sets to short signatures, while preserving similarity
  • Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

(skipped for now)

56

Slides from: http://www.mmds.org/

57 of 66

Shingling

Shingling: Converts a document into a set

  • A k-shingle (or k-gram) for a document is a sequence of k tokens that appears in the document
    • Tokens can be characters, words, etc., depending on the application
    • For example, assume that tokens = characters
  • To compress long shingles, we can hash them
  • Represent a document by the set of hash values of its k-shingles

57

Slides from: http://www.mmds.org/

58 of 66

Shingling: Example

  • k=2; document D1= abcab
  • Set of 2-shingles: S(D1 ) = {ab, bc, ca}
  • Hash the shingles: h(D1 ) = {1, 5, 7}

  • k = 8, 9, or 10 is often used in practice
  • Benefits:
    • Documents that are intuitively similar will have many shingles in common
    • Changing a word only affects k-shingles within distance k-1 from the word

58

Slides from: http://www.mmds.org/

59 of 66

Shingle Similarity

  • Document 𝑫𝒊 is represented by a set of its k-shingles 𝑪𝒊 = 𝑺(𝑫𝒊)
  • A natural similarity measure is the Jaccard similarity:

Jaccard Distance:

59

Slides from: http://www.mmds.org/

2 in intersection

9 in union

Jaccard Similarity = 2/9

60 of 66

Sets to Boolean Matrices

Encode sets using bitvectors

  • Rows = elements (shingles)
  • Columns = sets (documents)
    • 1 in row e and column s if and only if e is a member of s
    • Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1)

  • Each document is a column:

    • Example: sim(C1 ,C2 ) =
      • Size of intersection = 2; size of union = 7, Jaccard similarity (not distance) = 2/7
      • d(C1 ,C2 ) = 1 – (Jaccard similarity) = 5/7

60

0

1

1

0

1

1

0

1

1

0

1

1

0

1

1

1

0

0

1

0

1

1

0

1

1

0

0

1

1

0

0

0

Documents

Shingles

Slides from: http://www.mmds.org/

61 of 66

Hashing Columns

  • Idea: hash each column C to a small signature h(C), such that:
    • sim(C1 , C2) is the same as the “similarity” of signatures h(C1) and h(C2)

  • Goal: Find a hash function h(·) such that:
    • If sim(C1 ,C2) is high, then with high probability h(C1) = h(C2)
    • If sim(C1 ,C2) is low, then with high probability h(C1) ≠ h(C2)

  • Idea: Hash documents into buckets. Expect that most pairs of near duplicate documents hash into the same bucket

61

Slides from: http://www.mmds.org/

62 of 66

MinHash

  • Ideally: Permute the rows of the Boolean matrix using some permutation π

  • Define minhash function for this permutation π, hπ(C) = the number of the first (in the permuted order) row in which column C has value 1.
    • Denote this as: hπ(C) = minπ(C)

  • Apply, to all columns, several randomly chosen permutations π to create a signature for each column

  • Result is a signature matrix: Columns = sets, Rows = minhash values for each permutation π

62

Slides from: http://www.mmds.org/

63 of 66

MinHash Example

63

Slides from: http://www.mmds.org/

64 of 66

MinHash Implementation Trick

  • Permuting rows even once is prohibitive
  • Row hashing:
    • Pick K = 100 hash functions hi
    • Ordering under hi gives a random permutation of π rows
  • One-pass implementation
    • For each column c and hash function. hi keep a “slot” M(i, c) for the minhash value of column c and hash i
    • Initialize all M(i, c) = ∞
      • Scan rows looking for 1s
      • Suppose row j has 1 in column c
      • Then for each hi :
        • If hi (j) < M(i, c), then M(i, c) ← hi(j)

  • Expected similarity of two signatures equals the Jaccard similarity of the columns or sets that the signatures represent

64

Slides from: http://www.mmds.org/

65 of 66

SimHash: an LSH Technique

  • Distance approximation
  • Used in, for example, to approximate the min-cut problem
  • Google uses SimHash to identify near-duplicate web pages

65

66 of 66

SimHash (type of Locality Sensitive Hashing)

  • Goal: Generate the same hash value for similar set of items
    • Example input: A sentence (a set of items)
    • Items: Words in a sentence (hash values of items)

  • Count the net difference between 0s and 1s at each position

Hash Values

0b01111000

0b11011101

0b00011100

0b01000001

0b11110000

0b01011101

0b10011100

0b11000001

0b01001011

0b00101011

This is an example sentence to generate a hash value

Words

This

is

an

example

sentence

to

generate

a

hash

value

Hash Values

0b01111000

0b11011101

0b00011100

0b01000001

0b11110000

0b01011101

0b10011100

0b11000001

0b11111101

0b00101011

Bitwise Sum

(Net difference

between 0s and 1s)

[0, +4, -2, +4, +4, 0, -8, +2]

0b01011001

Counter Vector

SimHash Value

Words

This

is

an

example

sentence

to

generate

a

SimHash

value

[-2, +4, -4, +2, +4, -2, -6, +2]

66