Lecture 08: Hash Collision Resolutions + Runtime
CSE 373: Data Structures and Algorithms
1
Please log onto tinyurl.com/Summer373L8 to answer the daily lecture participation question
Announcements
Please check Canvas Gradebook!
Practice
(2, b) (3,c) (10,j) (9,i) (1,a) (12,l)
3
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
(10, j)
(2, b)
(3, c)
(1, a)
(12, l)
(9, i)
Lecture Outline
ArrayMap
DirectAccessMap
SimpleHashMap
SeparateChaining
HashMap
4
3
2
1
FASTER: Jump directly to element, only int keys
MORE FLEXIBLE: Hash function supports any type of key
YOUR BEST FRIEND: Addresses limitations with hash collisions, but still fast!
Review
MAP ADT
As seen on
Project 2
As seen on
Project 2
Hash functions: translating a piece of data to an int
CSE 373 20 SP – CHAMPION & CHUN
Hash function definition
“review”: Integer remainder with % “mod”
14 % 4 is 2� � 3 43� 4 ) 14 5 ) 218� 12 20� 2 18� 15� 3
CSE 142 SP 18 – BRETT WORTZMAN
6
218 % 5 is 3
For more review/practice, check out https://www.khanacademy.org/computing/computer-science/cryptography/modarithmetic/a/what-is-modular-arithmetic
Limit keys to indices within array
Equivalently, to find a % b (for a,b > 0):
while(a > b-1)
a -= b;
return a;
First Hash Function: % table size
CSE 373 SP 22 - CHAMPION
7
indices | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
elements | | | | | | | | | | |
put(0, “foo”);
put(5, “bar”);
put(11, “biz”)
put(18, “bop”);
“foo”
0 % 10 = 0
5 % 10 = 5
11 % 10 = 1
18 % 10 = 8
“bop”
“bar”
“biz”
Implement First Hash Function
CSE 373 SU 19 - ROBBIE WEBER
8
SimpleHashMap<Integer>
put mod key by table size, put item at result
get mod key by table size, get item at result
containsKey mod key by table size, return data[result] == null remove mod key by table size, nullify element at result
size return count of items in dictionary
state
behavior
Data[]
size
Operation | Array w/ indices as keys | |
put(key,value) | best | 𝚹(1) |
worst | 𝚹(1) | |
get(key) | best | 𝚹(1) |
worst | 𝚹(1) | |
containsKey(key) | best | 𝚹(1) |
worst | 𝚹(1) | |
Note: % is just a math operator like +, -, /, *, so it’s constant runtime
Questions?
things we talked about:
First Hash Function: % table size
CSE 373 SP 22 - CHAMPION
10
indices | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
elements | | | | | | | | | | |
put(0, “foo”);
put(5, “bar”);
put(11, “biz”)
put(18, “bop”);
Collision!
“foo”
0 % 10 = 0
5 % 10 = 5
11 % 10 = 1
18 % 10 = 8
20 % 10 = 0
“bop”
“bar”
“biz”
“:(”
put(20, “:(”);
Hash Obsession: Collisions
CSE 373 SU 19 - ROBBIE WEBER
11
Strategies to handle hash collision
CSE 373 AU 18 – SHRI MARE
12
Separate chaining
CSE 373 ROBBIE WEBER + HANNAH TANG
13
Separate chaining
public boolean containsKey(int key) {
int bucketIndex = key % data.length;
loop through data[bucketIndex]
return true if we find the key in
data[bucketIndex]
return false if we get to here (didn’t
find it)
}
CSE 373 ROBBIE WEBER + HANNAH TANG
14
Reminder: the implementations of put/get/containsKey are all very similar, and almost always will have the same complexity class runtime
runtime analysis
Are there different possible states for our Hash Map that make this code run slower/faster, assuming there are already n key-value pairs being stored?
Yes! If we had to do a lot of loop iterations to find the key in the bucket, our code will run slower.
A best case situation for separate chaining
CSE 373 20 SP – CHAMPION & CHUN
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
(0, b)
(2, b)
(3, b)
(4, b)
(5, b)
(6, b)
(7, b)
(8, b)
It’s possible (and likely if you follow some best-practices) that everything is spread out across the buckets pretty evenly. This is the opposite of the last slide: when we have minimal collisions, our runtime should be less. For example, if we have a bucket with only 0 or 1 element in it, checking containsKey for something in that bucket will only take a constant amount of time.
We’re going to try a lot of stuff we can to make it more likely we achieve this beautiful state ☺.
In-practice situations for separate chaining
CSE 373 20 SP – CHAMPION & CHUN
Operation | Array w/ indices as keys | |
put(key,value) | best | 𝚹(1) |
In-practice | 𝚹(1) | |
worst | 𝚹(n) | |
get(key) | best | 𝚹(1) |
In-practice | 𝚹(1) | |
worst | 𝚹(1) | |
remove(key) | best | 𝚹(1) |
In-practice | 𝚹(1) | |
worst | O(n) | |
Reminder: the in-practice runtimes are assuming an even distribution of the keys inside the array and following of best-practices to ensure the average chain length is low.
Best practices (pay attention to this for the hw)
CSE 373 20 SP – CHAMPION & CHUN
It turns out we still want to resize “every so often” to make sure the average/expected length of each bucket is a small number.
Consider what happens if we had the array length 10 like on the left, but had 100 key-value pairs?
Assuming our in-practice niceness (not-worst case) you would expect on average each of the 10 buckets has about 10 key-value pairs in it.
What happens if we stick with the same size array but add 100 more key-value pairs? Each bucket gets about 10 more –key-value pairs and the runtime is getting worse and worse.
Best practices (pay attention to this for the hw)
CSE 373 20 SP – CHAMPION & CHUN
It turns out we still want to resize “every so often” to make sure the average/expected length of each bucket is a small number.
Consider what happens if we had the array length 10 like on the left, but had 100 key-value pairs?
Assuming our in-practice niceness (not-worst case) you would expect on average each of the 10 buckets has about 10 key-value pairs in it.
What happens if we stick with the same size array but add 100 more key-value pairs? Each bucket gets about 10 more –key-value pairs and the runtime is getting worse and worse.
The pattern we’re getting to is that the expected runtime is approximately: # of pairs / array.length (AKA n / c where n is the number of elements and c is the number of possible chains). If array.length is fixed for your whole program, then this is an order-n runtime, but if the array.length also increases (because you re-size) and you redistribute out the values evenly across the buckets, you can keep your runtime low. In particular, if you resize when when your n / c ratio increases to about 1, you’re expected to have 1 element or fewer in each bucket at all times. (do this on your homework).
Tip: make sure you re-hash (re-distribute) your keys by the new array length after re-sizing so they don’t get clustered in the old array length range.
Lambda + resizing rephrased
However, if you resize once you hit that 1:1 threshold, the current λ is expected to be less than 1 (which is a constant / constant runtime, so we can simplify to O(1)).
CSE 373 SU 19 - ROBBIE WEBER
19
Operation | Array w/ indices as keys | |
put(key,value) | best | O(1) |
In-practice | O(λ) | |
worst | O(n) | |
get(key) | best | O(1) |
In-practice | O(λ) | |
worst | O(n) | |
remove(key) | best | O(1) |
In-practice | O(λ) | |
worst | O(n) | |
What about non integer keys?
CSE 373 SU 19 - ROBBIE WEBER
20
Hash function definition
(Before we % by length, we have to convert the data into an int)
}
CSE 373 SU 19 - ROBBIE WEBER
21
Pro: super fast
Con: lots of collisions!
Pro: still really fast
Con: some collisions
Pro: few collisions
Con: slow, gigantic integers
Java’s hashCode (relevant for project)
CSE 373 20 SP – CHAMPION & CHUN
Best practices for an nice distribution of keys recap
CSE 373 20 SP – CHAMPION & CHUN
Practice
CSE 373 SU 19 - ROBBIE WEBER
24
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
(1, a)
(5, b)
(11, a)
(17, f)
(1, g)
(12, e)
(7, d)
(25, h)
Practice
CSE 373 SU 19 - ROBBIE WEBER
25
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
(“a”, 1)
(“abcd”, 5)
(“c”, 3)
(“five”, 7)
(“abc”, 4)
(“ab”, 2)
(“hello world”, 8)
(“abcdabcd”, 6)
Java and Hash Functions
If a.equals(b) is true then a.hashCode() == b.hashCode() MUST also be true
CSE 373 SU 19 - ROBBIE WEBER
26
Resizing
Don’t forget to re-distribute your keys!
As seen on
Project 2
0 | |
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
(7,blue) | |
(4,orange) | |
0 | |
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
(1,red) | |
(22,tan) | |
(22,tan) | |
(7,blue) | |
(77,aqua) | |
(4,orange) | |
(1,red) | |
(6,pink) | |
(8,lilac) | |
(53,puce) | |
(6,pink) | |
(77,aqua) | |
(53,puce) | |
(8,lilac) | |
If we just expand the buckets array, several values are hashed in the wrong place
How to Resize:
When to Resize?
LOAD FACTOR λ
(22,tan) | |
(7,blue) | |
(77,aqua) | |
(4,orange) | |
0 | |
1 | |
2 | |
3 | |
4 | |
(1,red) | |
(6,pink) | |
(8,lilac) | |
(53,puce) | |
When to Resize?
LOAD FACTOR λ
Questions?
Good Hashing
The hash function of a HashDictionary gets called a LOT:
This is why it is so important to have a “good” hash function. A good hash function is:
CSE 373 20 WI – HANNAH TANG
31
public int hashFn(String s) {
return random.nextInt()
}
public int hashFn(String s) {
int retVal = 0;
for (int I = 0; I < s.length(); i++) {
for (int j = 0; j < s.length(); j++) {
retVal += helperFun(s, I, j);
}
}
return retVal;
}
public int hashFn(String s) {
if (s.length() % 2 == 0) {
if (s.length(). % 2 == 0) {
return 17;
} else {
return 43;
}
}
}
NOT deterministic
NOT efficient
NOT uniform
Handling Collisions
int index = natrualHash % TableSize;
while (index in use) {
i++;
}
return index;
CSE 373 SP 18 - KASEY CHAMPION
32
Linear Probing
CSE 373 SP 18 - KASEY CHAMPION
33
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
Insert the following values into the Hash Table using a hashFunction of % table size and linear probing to resolve collisions
1, 5, 11, 7, 12, 17, 6, 25
1
5
11
7
12
17
6
25
Linear Probing
CSE 373 SP 18 - KASEY CHAMPION
34
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
Insert the following values into the Hash Table using a hashFunction of % table size and linear probing to resolve collisions
38, 19, 8, 109, 10
38
19
8
8
109
10
Problem:
Primary Clustering
When probing causes long chains of occupied slots within a hash table
Runtime
CSE 373 SP 18 - KASEY CHAMPION
35
Can we do better?
CSE 373 SP 18 - KASEY CHAMPION
36
Quadratic Probing
CSE 373 SP 18 - KASEY CHAMPION
37
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
(49 % 10 + 0 * 0) % 10 = 9
(49 % 10 + 1 * 1) % 10 = 0
(58 % 10 + 0 * 0) % 10 = 8
(58 % 10 + 1 * 1) % 10 = 9
(58 % 10 + 2 * 2) % 10 = 2
89
18
49
Insert the following values into the Hash Table using a hashFunction of % table size and quadratic probing to resolve collisions
89, 18, 49, 58, 79, 27
58
79
(79 % 10 + 0 * 0) % 10 = 9
(79 % 10 + 1 * 1) % 10 = 0
(79 % 10 + 2 * 2) % 10 = 3
Problems:
If λ≥ ½ we might never find an empty spot
Infinite loop!
Can still get clusters
27
Now try to insert 9.
Uh-oh
Quadratic Probing
There were empty spots. What Gives?
Quadratic probing is not guaranteed to check every possible spot in the hash table
The following is true:
Notice we have to assume p is prime to get that guarantee
Secondary Clustering
CSE 373 SP 18 - KASEY CHAMPION
39
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
Insert the following values into the Hash Table using a hashFunction of % table size and quadratic probing to resolve collisions
19, 39, 29, 9
39
29
19
9
Secondary Clustering
When using quadratic probing sometimes need to probe the same sequence of table cells, not necessarily next to one another
Probing
CSE 373 SP 18 - KASEY CHAMPION
40
Questions
Topics Covered:
CSE 373 20 SP – CHAMPION & CHUN
41
Double Hashing
int index = natrualHash % TableSize;
while (index in use) {
i++;
}
return index;
CSE 373 SP 18 - KASEY CHAMPION
42
<- Most effective if g(k) returns value relatively prime to table size
Second Hash Function
CSE 373 SP 18 - KASEY CHAMPION
43
Resizing: Open Addressing
Running Times
CSE 332 SU 18 – ROBBIE WEBER
In-Practice
Summary
CSE 373 SP 18 - KASEY CHAMPION
47
No clustering
Potentially more “compact” (λ can be higher)
Managing clustering can be tricky
Less compact (keep λ < ½)
Array lookups tend to be a constant factor faster than traversing pointers
Summary
Extra optimizations
CSE 373 SP 18 - KASEY CHAMPION
49
Other Hashing Applications
CSE 373 20 WI – HANNAH TANG
50
Cryptography
Hashing also ”hides” the data by translating it, this can be used for security
Fingerprinting
git hashes (“identification”)
Ad Tracking
YouTube Content ID
Caching
File Verification / Error Checking: