Hashing
CS202 - Fundamental Structures of Computer Science II
1
Initially prepared by Dr. İlyas Çiçekli; improved by various Bilkent CS202 instructors.
Hashing
CS202 - Fundamental Structures of Computer Science II
2
Hash Tables
CS202 - Fundamental Structures of Computer Science II
3
Hash Function -- Address Calculator
CS202 - Fundamental Structures of Computer Science II
4
Hash Function
Hash Table
Hashing
CS202 - Fundamental Structures of Computer Science II
5
Collisions
CS202 - Fundamental Structures of Computer Science II
6
Hash Functions
CS202 - Fundamental Structures of Computer Science II
7
Everything is an Integer
T is 54h(01010100), E is 45h (01000101)
01001110010011110101010001000101= 4E4F5445h = 1313821765
CS202 - Fundamental Structures of Computer Science II
8
How to Design a Hash Function?
CS202 - Fundamental Structures of Computer Science II
9
Hash Function
Hash Table
Hash Functions -- Selecting Digits
h(033475678) = 37
h(023455678) = 25
CS202 - Fundamental Structures of Computer Science II
10
Hash Functions -- Folding
h(033475678) = 0 + 3 + 3 + 4 + 7 + 5 + 6 + 7 + 8 = 43
h(023455678) = 0 + 2 + 3 + 4 + 5 + 5 + 6 + 7 + 8 = 40
CS202 - Fundamental Structures of Computer Science II
11
Hash Functions -- Modular Arithmetic
h(x) = x mod tableSize
CS202 - Fundamental Structures of Computer Science II
12
Why Primes?
CS202 - Fundamental Structures of Computer Science II
13
0 | |
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
64
100
128
200
400
300
500
Why Primes?
CS202 - Fundamental Structures of Computer Science II
14
0 | |
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
64
100
128
200
400
300
500
Rationale
gcd(a1, a2, a3 … a4 ,m)
gcd(64,100,128,200,300,400,500, 8) = 4
gcd(64,100,128,200,300,400,500,7) = 1
unless gcd(a1, a2, a3 … a4 ) = m, which is rare.
CS202 - Fundamental Structures of Computer Science II
15
Collision Resolution
Each entry holds one item
Each entry can hold more than one item
(Buckets – hold certain number of items)
CS202 - Fundamental Structures of Computer Science II
16
Table size is 101
Open Addressing
CS202 - Fundamental Structures of Computer Science II
17
Open Addressing -- Linear Probing
CS202 - Fundamental Structures of Computer Science II
18
Linear Probing -- Example
CS202 - Fundamental Structures of Computer Science II
19
0 | 9 |
1 | |
2 | 2 |
3 | 13 |
4 | 25 |
5 | 24 |
6 | |
7 | |
8 | 30 |
9 | 20 |
10 | 10 |
Linear Probing -- Clustering Problem
CS202 - Fundamental Structures of Computer Science II
20
Open Addressing -- Quadratic Probing
i+32, i+42 ...
CS202 - Fundamental Structures of Computer Science II
21
Quadratic Probing -- Example
9+32 mod 11 =7
CS202 - Fundamental Structures of Computer Science II
22
0 | |
1 | |
2 | 2 |
3 | 13 |
4 | 25 |
5 | |
6 | 24 |
7 | 9 |
8 | 30 |
9 | 20 |
10 | 10 |
Open Addressing -- Double Hashing
h2(key) ≠0
h2≠h1
h1(key)
h1(key) + h2(key)
h1(key) + 2*h2(key),
...
CS202 - Fundamental Structures of Computer Science II
23
Double Hashing -- Example
h1(x) = x mod 11
h2(x) = 7 – (x mod 7)
CS202 - Fundamental Structures of Computer Science II
24
0 | |
1 | |
2 | |
3 | 58 |
4 | |
5 | |
6 | 91 |
7 | |
8 | |
9 | |
10 | 14 |
Open Addressing -- Retrieval & Deletion
CS202 - Fundamental Structures of Computer Science II
25
Separate Chaining
CS202 - Fundamental Structures of Computer Science II
26
Separate Chaining
CS202 - Fundamental Structures of Computer Science II
27
Hashing Analysis
CS202 - Fundamental Structures of Computer Science II
28
Hashing -- Analysis
α= (current number of items) / tableSize
CS202 - Fundamental Structures of Computer Science II
29
Separate Chaining -- Analysis
CS202 - Fundamental Structures of Computer Science II
30
for a successful search
for an unsuccessful search
Linear Probing -- Analysis
CS202 - Fundamental Structures of Computer Science II
31
for a successful search
for an unsuccessful search
Linear Probing -- Analysis
CS202 - Fundamental Structures of Computer Science II
32
for a successful search
for an unsuccessful search
Linear Probing -- Analysis
Example: Find the average number of probes for a successful
search and an unsuccessful search for this hash table? Use the
following hash function: h(x) = x mod 11
Successful Search: Try 20, 30, 2, 13, 25, 24, 10, 9
20: 9 30: 8 2: 2 13: 2,3
25: 3, 4 24: 2, 3, 4, 5 10: 10 9: 9, 10, 0
Avg. no of probes = (1+1+1+2+2+4+1+3)/8 = 1.9
Unsuccessful Search: Try 0, 1, 35, 3, 4, 5, 6, 7, 8, 31, 32
0: 0, 1 1: 1 35: 2, 3, 4, 5, 6
3: 3, 4, 5, 6 4: 4, 5, 6 5: 5, 6
6: 6 7: 7 8: 8, 9, 10, 0, 1
31: 9, 10, 0, 1 32: 10, 0, 1
Avg. no of probes =(2+1+5+4+3+2+1+1+5+4+3)/11 = 2.8
CS202 - Fundamental Structures of Computer Science II
33
0 | 9 |
1 | |
2 | 2 |
3 | 13 |
4 | 25 |
5 | 24 |
6 | |
7 | |
8 | 30 |
9 | 20 |
10 | 10 |
Quadratic Probing & Double Hashing -- Analysis
CS202 - Fundamental Structures of Computer Science II
34
for a successful search
for an unsuccessful search
The relative efficiency of�four collision-resolution methods
CS202 - Fundamental Structures of Computer Science II
35
What Constitutes a Good Hash Function
CS202 - Fundamental Structures of Computer Science II
36
Example: Hash Functions for Strings
CS202 - Fundamental Structures of Computer Science II
37
Hash Function 1
CS202 - Fundamental Structures of Computer Science II
38
int hash(const string &key, int tableSize)
{
int hasVal = 0;
for (int i = 0; i < key.length(); i++)
hashVal += key[i];
return hashVal % tableSize;
}
Hash Function 2
CS202 - Fundamental Structures of Computer Science II
39
int hash (const string &key, int tableSize)
{
return (key[0]+27 * key[1] + 729*key[2]) % tableSize;
}
Hash Function 3
CS202 - Fundamental Structures of Computer Science II
40
int hash (const string &key, int tableSize)
{
int hashVal = 0;
for (int i = 0; i < key.length(); i++)
hashVal = 37 * hashVal + key[i];�
hashVal %=tableSize;
if (hashVal < 0) /* in case overflows occurs */
hashVal += tableSize;
return hashVal;
};
Hash function for strings:
CS202 - Fundamental Structures of Computer Science II
41
a
l
i
key
KeySize = 3;
98
108
105
hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172
0 1 2
i
key[i]
hash�function
ali
……
……
0
1
2
8172
10,006 (TableSize)
“ali”
Hash Table versus Search Trees
CS202 - Fundamental Structures of Computer Science II
42
Performance
CS202 - Fundamental Structures of Computer Science II
43
Other applications of hash tables
CS202 - Fundamental Structures of Computer Science II
44
Applications of Hashing
CS202 - Fundamental Structures of Computer Science II
45
Substring Pattern Matching
CS202 - Fundamental Structures of Computer Science II
46
Slide by Steven Skiena
Hashing, Hashing, and Hashing
CS202 - Fundamental Structures of Computer Science II
47
Slide by Steven Skiena
Document Comparison
CS202 - Fundamental Structures of Computer Science II
48
Slide by Steven Skiena
SOME APPLICATIONS OF HASHING
49
Set ADT
50
Set Membership Test: Bloom Filters
51
Init a bitvector
Take h hash functions
Insertion: put 1’s at positions given by hash functions
Query: are there 1’s at all positions given by hash functions?
(h=2)
1 |
0 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
Bloom Filter
Item | H1 | H2 |
A | 4 | 0 |
B | 3 | 4 |
C | 4 | 7 |
D | 2 | 0 |
E | 8 | 7 |
F | 9 | 0 |
G | 2 | 3 |
H | 8 | 7 |
I | 0 | 3 |
J | 0 | 4 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
Positions
Bloom filter
Bloom filter example
d
b
a
c
Bloom filter
≈(1-e-d n/m)d
Space: m = 1.44n log2(ε) where ε is the false positive rate
Document Comparison: MinHash
55
Slides from: http://www.mmds.org/
Document Comparison
(skipped for now)
56
Slides from: http://www.mmds.org/
Shingling
Shingling: Converts a document into a set
57
Slides from: http://www.mmds.org/
Shingling: Example
58
Slides from: http://www.mmds.org/
Shingle Similarity
Jaccard Distance:
59
Slides from: http://www.mmds.org/
2 in intersection
9 in union
Jaccard Similarity = 2/9
Sets to Boolean Matrices
Encode sets using bitvectors
60
0 | 1 | 1 | 0 |
1 | 1 | 0 | 1 |
1 | 0 | 1 | 1 |
0 | 1 | 1 | 1 |
0 | 0 | 1 | 0 |
1 | 1 | 0 | 1 |
1 | 0 | 0 | 1 |
1 | 0 | 0 | 0 |
Documents
Shingles
Slides from: http://www.mmds.org/
Hashing Columns
61
Slides from: http://www.mmds.org/
MinHash
62
Slides from: http://www.mmds.org/
MinHash Example
63
Slides from: http://www.mmds.org/
MinHash Implementation Trick
64
Slides from: http://www.mmds.org/
SimHash: an LSH Technique
65
SimHash (type of Locality Sensitive Hashing)
Hash Values
0b01111000
0b11011101
0b00011100
0b01000001
0b11110000
0b01011101
0b10011100
0b11000001
0b01001011
0b00101011
This is an example sentence to generate a hash value
Words
This
is
an
example
sentence
to
generate
a
hash
value
Hash Values
0b01111000
0b11011101
0b00011100
0b01000001
0b11110000
0b01011101
0b10011100
0b11000001
0b11111101
0b00101011
Bitwise Sum
(Net difference
between 0s and 1s)
[0, +4, -2, +4, +4, 0, -8, +2]
0b01011001
Counter Vector
SimHash Value
Words
This
is
an
example
sentence
to
generate
a
SimHash
value
[-2, +4, -4, +2, +4, -2, -6, +2]
66