CS61B, 2022
Lecture 19: Hashing
Data Indexed Arrays
Sets
We’ve now seen several implementations of the Set (or Map) ADT.
Set
ArraySet
BST
2-3 Tree
LLRB
Map
| contains(x) | add(x) | Notes |
ArraySet | Θ(N) | Θ(N) | |
BST | Θ(N) | Θ(N) | Random trees are Θ(log N). |
2-3 Tree | Θ(log N) | Θ(log N) | Beautiful idea. Very hard to implement. |
LLRB | Θ(log N) | Θ(log N) | Maintains bijection with 2-3 tree. Hard to implement. |
Worst case runtimes
Limits of Search Tree Based Sets
Our search tree based sets require items to be comparable.
Our search tree sets have excellent performance, but could maybe be better?
Today we’ll see the answer to both of the questions above is yes.
Using Data as an Index
One extreme approach: Create an array of booleans indexed by data!
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Set containing nothing
...
DataIndexedIntegerSet diis;
diis = new DataIndexedIntegerSet();
diis.add(0);
Using Data as an Index
One extreme approach: Create an array of booleans indexed by data!
T
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Set containing 0
...
DataIndexedIntegerSet diis;
diis = new DataIndexedIntegerSet();
diis.add(0);
Using Data as an Index
One extreme approach: Create an array of booleans indexed by data!
T
F
F
F
F
T
F
F
F
F
F
F
F
F
F
F
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Set containing 0, 5
...
DataIndexedIntegerSet diis;
diis = new DataIndexedIntegerSet();
diis.add(0);
diis.add(5);
Using Data as an Index
One extreme approach: Create an array of booleans indexed by data!
T
F
F
F
F
T
F
F
F
F
T
F
F
F
F
F
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Set containing 0, 5, 10
...
DataIndexedIntegerSet diis;
diis = new DataIndexedIntegerSet();
diis.add(0);
diis.add(5);
diis.add(10);
Using Data as an Index
One extreme approach: Create an array of booleans indexed by data!
T
F
F
F
F
T
F
F
F
F
T
T
F
F
F
F
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Set containing 0, 5, 10, 11
...
DataIndexedIntegerSet diis;
diis = new DataIndexedIntegerSet();
diis.add(0);
diis.add(5);
diis.add(10);
diis.add(11);
DataIndexedIntegerSet Implementation
T
F
F
F
F
T
F
F
F
F
T
T
F
F
F
F
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Set containing 0, 5, 10, 11
...
public class DataIndexedIntegerSet {
private boolean[] present;
public DataIndexedIntegerSet() {
present = new boolean[2000000000];
}
public void add(int i) {
present[i] = true;
}
public boolean contains(int i) {
return present[i];
}
}
DataIndexedIntegerSet Implementation
public class DataIndexedIntegerSet {
private boolean[] present;
public DataIndexedIntegerSet() {
present = new boolean[2000000000];
}
public void add(int i) {
present[i] = true;
}
public boolean contains(int i) {
return present[i];
}
}
| contains(x) | add(x) |
ArraySet | Θ(N) | Θ(N) |
BST | Θ(N) | Θ(N) |
2-3 Tree | Θ(log N) | Θ(log N) |
LLRB | Θ(log N) | Θ(log N) |
DataIndexedArray | Θ(1) | Θ(1) |
Using Data as an Index
One extreme approach: Create an array of booleans indexed by data!
T
F
F
F
F
T
F
F
F
F
T
T
F
F
F
F
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Set containing 0, 5, 10, 11
...
DataIndexedIntegerSet diis;
diis = new DataIndexedIntegerSet();
diis.add(0);
diis.add(5);
diis.add(10);
diis.add(11);
Downsides of this approach (that we will try to address):
DataIndexedEnglishWordSet
Generalizing the DataIndexedIntegerSet Idea
Suppose we want to add(“cat”)
The key question:
What’s wrong with this approach?
F
F
F
T
a
b
c
d
y
z
0
1
2
3
4
25
26
...
F
“chupacabra” collides with “cat”
F
F
Avoiding Collisions
Use all digits by multiplying each by a power of 27.
Why this specific pattern?
F
F
a
b
c
d
y
z
cas
cat
cau
0
1
2
3
4
25
26
2233
2234
2235
...
F
...
T
F
F
F
...
F
F
F
Generalizing the DataIndexedIntegerSet Idea
Ideally, we want a data indexed set that can store arbitrary types.
But previous idea only supports integers!
DataIndexedSet<String> dis =
new DataIndexedSet<>();
dis.add("cat");
dis.add("bee");
dis.add("dog");
cat
dog
F
F
F
F
F
0
1
2
3
4
5
6
F
F
...
F
Where do cat, bee, and dog go???
bee
???
The Decimal Number System vs. Our System for Strings
In the decimal number system, we have 10 digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
Want numbers larger than 9? Use a sequence of digits.
Example: 7091 in base 10
Our system for strings is almost the same, but with letters.
Test Your Understanding
Convert the word “bee” into a number by using our “powers of 27” strategy.
Reminder: cat27 = (3 x 272) + (1 x 271) + (20 x 270) = 223410
Hint: ‘b’ is letter 2, and ‘e’ is letter 5.
Test Your Understanding
Convert the word “bee” into a number by using our “powers of 27” strategy.
Reminder: cat27 = (3 x 272) + (1 x 271) + (20 x 270) = 223410
Hint: ‘b’ is letter 2, and ‘e’ is letter 5.
Uniqueness
As long as we pick a base ≥ 26, this algorithm is guaranteed to give each lowercase English word a unique number!
In other words: Guaranteed that we will never have a collision.
Can write an englishToInt function (see two hidden slides that follow this one).
Implementing englishToInt (optional)
Optional exercise: Try to write a function englishToInt that can convert English strings to integers by adding characters scaled by powers of 27.
Examples:
Implementing englishToInt (optional) (solution)
/** Converts ith character of String to a letter number.� * e.g. 'a' -> 1, 'b' -> 2, 'z' -> 26 */
public static int letterNum(String s, int i) {
int ithChar = s.charAt(i);
if ((ithChar < 'a') || (ithChar > 'z'))
{ throw new IllegalArgumentException(); }
return ithChar - 'a' + 1;
}
public static int englishToInt(String s) {
int intRep = 0;
for (int i = 0; i < s.length(); i += 1) {
intRep = intRep * 27;
intRep = intRep + letterNum(s, i);
}
return intRep;
}
DataIndexedEnglishWordSet Implementation
public class DataIndexedEnglishWordSet {
private boolean[] present;
public DataIndexedEnglishWordSet() {
present = new boolean[2000000000];
}
public void add(String s) {
present[englishToInt(s)] = true;
}
public boolean contains(int i) {
return present[englishToInt(s)];
}
}
F
F
a
b
c
d
y
z
cas
cat
cau
0
1
2
3
4
25
26
2233
2234
2235
...
F
...
T
F
F
F
...
F
F
F
Set containing “cat”
DataIndexedStringSet
DataIndexedStringSet
Using only lowercase English characters is too restrictive.
ASCII Characters
The most basic character set used by most computers is ASCII format.
biggest value is 126
DataIndexedStringSet
Using only lowercase English characters is too restrictive.
Examples:
Implementing asciiToInt
The corresponding integer conversion function is actually even simpler than englishToInt from the hidden slides. Using the raw character value means we avoid the need for a helper method.
What if we want to use characters beyond ASCII?
public static int asciiToInt(String s) {
int intRep = 0;
for (int i = 0; i < s.length(); i += 1) {
intRep = intRep * 126;
intRep = intRep + s.charAt(i);
}
return intRep;
}
Going Beyond ASCII
chars in Java also support character sets for other languages and symbols.
Example: Computing Unique Representations of Chinese
The largest possible value for Chinese characters is 40,959*, so we’d need to use this as our base if we want to have a unique representation for all possible strings of Chinese characters.
Example:
F
守门呗
守门员
守门呙
39,3120,2486,9367
39,3120,2486,9368
39,3120,2486,9369
...
F
...
T
*If you’re curious, the last character is:
Integer Overflow and Hash Codes
Major Problem: Integer Overflow
In Java, the largest possible integer is 2,147,483,647.
int x = 2147483647;
System.out.println(x);
System.out.println(x + 1);
jug ~/Dropbox/61b/lec/hashing
$ javac BiggestPlusOne.java
$ java BiggestPlusOne
2147483647
-2147483648
Consequence of Overflow: Collisions
Because Java has a maximum integer, we won’t get the numbers we expect!
Consequence of Overflow: Collisions
Because Java has a maximum integer, we won’t get the numbers we expect!
Overflow can result in collisions, causing incorrect answers.
public void moo() {
DataIndexedStringSet disi = new DataIndexedStringSet();
disi.add("melt banana");
disi.contains("subterrestrial anticosmetic");
//asciiToInt for these strings is 839099497
}
returns true!
Hash Codes and the Pigeonhole Principle
The official term for the number we’re computing is “hash code”.
Hash Codes and the Pigeonhole Principle
The official term for the number we’re computing is “hash code”.
Pigeonhole principle tells us that if there are more than 4294967296 possible items, multiple items will share the same hash code.
Bottom line: Collisions are inevitable.
Two Fundamental Challenges
Two fundamental challenges:
Hash Tables:
Handling Collisions
Resolving Ambiguity
Pigeonhole principle tells us that collisions are inevitable due to integer overflow.
F
0
717
718
719
...
F
T
F
...
0
717
718
719
...
...
moo
Ñep
Suppose N items have the same numerical representation h:
How to implement a “bucket”?
The Separate Chaining Data Indexed Array
Each bucket in our array is initially empty. When an item x gets added at index h:
�We might call this a “separate chaining data indexed array”.
111239443
111239444
111239445
...
...
0
1
Initially all buckets are empty.
The Separate Chaining Data Indexed Array
Each bucket in our array is initially empty. When an item x gets added at index h:
�We might call this a “separate chaining data indexed array”.
add(“a”)
111239443
111239444
111239445
...
...
0
1
a
Bucket 1 now has a length 1 list.
The Separate Chaining Data Indexed Array
Each bucket in our array is initially empty. When an item x gets added at index h:
�We might call this a “separate chaining data indexed array”.
add(“a”)
add(“abomamora”)
111239443
111239444
111239445
...
...
0
1
a
abomamora
The Separate Chaining Data Indexed Array
Each bucket in our array is initially empty. When an item x gets added at index h:
�We might call this a “separate chaining data indexed array”.
add(“a”)
add(“abomamora”)
add(“adevilish”)
111239443
111239444
111239445
...
...
0
1
a
abomamora
adevilish
Both have hash code 111239444 using englishToInt
The Separate Chaining Data Indexed Array
Each bucket in our array is initially empty. When an item x gets added at index h:
�We might call this a “separate chaining data indexed array”.
add(“a”)
add(“abomamora”)
add(“adevilish”)
add(“abomamora”)
111239443
111239444
111239445
...
...
0
1
a
abomamora
adevilish
Both have hash code 111239444 using englishToInt
The Separate Chaining Data Indexed Array
Each bucket in our array is initially empty. When an item x gets added at index h:
�We might call this a “separate chaining data indexed array”.
add(“a”)
add(“abomamora”)
add(“adevilish”)
add(“abomamora”)
contains(“adevilish”)
111239443
111239444
111239445
...
...
0
1
a
abomamora
adevilish
Separate Chaining Performance
Observation: Worst case runtime will be proportional to length of longest list.
Worst case time | contains(x) | add(x) |
Bushy BSTs | Θ(log N) | Θ(log N) |
DataIndexedArray | Θ(1) | Θ(1) |
Separate Chaining Data Indexed Array | Θ(Q) | Θ(Q) |
Q: Length of longest list
Why Q and not 1?
0
1598
2234
3328
111239443
111239442
cat
dog
bee
...
...
...
...
abomamora
adevilish
Saving Memory Using Separate Chaining
Observation: We don’t really need billions of buckets.
0
1
2
3
4
5
6
7
8
9
Q: If we use the 10 buckets on the right, where should our five items go?
0
1598
2234
3328
111239443
111239442
cat
dog
bee
...
...
...
...
abomamora
adevilish
Saving Memory Using Separate Chaining and Modulus
Observation: Can use modulus of hashcode to reduce bucket count.
0
1598
2234
3328
111239443
111239442
cat
dog
bee
...
...
...
...
abomamora
adevilish
0
1
2
3
4
5
6
7
8
9
cat
bee
dog
abomamora
adevilish
Q: If we use the 10 buckets on the right, where should our five items go?
The Hash Table
What we’ve just created here is called a hash table.
0
1
2
3
4
5
6
7
8
9
bee
hug
抱抱
están
抱抱
unicodeToInt
1034854400
% 10
0
data
hash code
hash function
reduce
index
hash table
dog
الطبيعة
शानदार
포옹
In Java there’s a caveat here. Will revisit later.
Hash Table Performance
Hash Table Runtime
The good news: We use way less memory and can now handle arbitrary data.
The bad news: Worst case runtime is now Θ(Q), where Q is the length of the longest list.
0
1
2
3
4
Q: Length of longest list
| contains(x) | add(x) |
Bushy BSTs | Θ(log N) | Θ(log N) |
DataIndexedArray | Θ(1) | Θ(1) |
Separate Chaining Hash Table | Θ(Q) | Θ(Q) |
Worst case runtimes
Hash Table Runtime: yellkey.com/attorney
0
1
2
3
4
Q: Length of longest list
| contains(x) | add(x) |
Bushy BSTs | Θ(log N) | Θ(log N) |
DataIndexedArray | Θ(1) | Θ(1) |
Separate Chaining Hash Table | Θ(Q) | Θ(Q) |
Worst case runtimes
For the hash table above with 5 buckets, give the order of growth of Q with respect to N.
Hash Table Runtime
0
1
2
3
4
Q: Length of longest list
| contains(x) | add(x) |
Bushy BSTs | Θ(log N) | Θ(log N) |
DataIndexedArray | Θ(1) | Θ(1) |
Separate Chaining Hash Table | Θ(Q) | Θ(Q) |
Worst case runtimes
For the hash table above with 5 buckets, give the order of growth of Q with respect to N.
C. Q is Θ(N)
All items evenly distributed.
All items in same list.
In the best case, the length of the longest list will be N/5. In the worst case, it will be N. In both cases, Q(N) is Θ(N).
Improving the Hash Table
0
1
2
3
4
Q: Length of longest list
| contains(x) | add(x) |
Bushy BSTs | Θ(log N) | Θ(log N) |
DataIndexedArray | Θ(1) | Θ(1) |
Separate Chaining Hash Table | Θ(Q) | Θ(Q) |
Worst case runtimes
Suppose we have:
Major problem: Even if items are spread out evenly, lists are of length Q = N/M.
Hash Table Runtime
Suppose we have:
Q: Length of longest list
Worst case runtimes
0
1
2
3
4
| contains(x) | add(x) |
Bushy BSTs | Θ(log N) | Θ(log N) |
DataIndexedArray | Θ(1) | Θ(1) |
Separate Chaining Hash Table | Θ(Q) | Θ(Q) |
Major problem: Even if items are spread out evenly, lists are of length Q = N/M.
Hash Table Runtime
Suppose we have:
As long as M = Θ(N), then O(N/M) = O(1).
One example strategy: When N/M is ≥ 1.5, then double M.
0
1
2
3
4
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
N = 0 M = 4 N / M = 0
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
7
N = 1 M = 4 N / M = 0.25
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
16
7
N = 2 M = 4 N / M = 0.5
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
16
7
3
N = 3 M = 4 N / M = 0.75
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
16
7
3
11
N = 4 M = 4 N / M = 1
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
16
20
7
3
11
N = 5 M = 4 N / M = 1.25
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
16
13
20
7
3
11
N = 6 M = 4 N / M = 1.5
N/M is too large. Time to double!
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
16
13
20
7
3
11
N = 6 M = 4 N / M = 1.5
N/M is too large. Time to double!
?
?
?
0
1
2
3
4
5
6
7
?
?
?
?
?
Hash Table Resizing Example
When N/M is ≥ 1.5, then double M.
0
1
2
3
16
13
20
7
3
11
N = 6 M = 4 N / M = 1.5
N/M is too large. Time to double!
0
1
2
3
4
5
6
7
N = 6 M = 8 N / M = 0.75
16
20
13
7
3
11
Resizing Hash Table Runtime
Suppose we have:
As long as M = Θ(N), then O(N/M) = O(1).
Assuming items are evenly distributed (as above), lists will be approximately N/M items long, resulting in Θ(N/M) runtimes.
0
1
2
3
4
N = 19 M = 5 N / M = 3.8
Resizing Hash Table Runtime
Suppose we have:
As long as M = Θ(N), then O(N/M) = O(1).
One important thing to consider is the cost of the resize operation.
0
1
2
3
4
N = 19 M = 5 N / M = 3.8
Hash Table Runtime
Hash table operations are on average constant time if:
Worst case runtimes
0
1
2
3
4
*: Indicates “on average”.
†: Assuming items are evenly spread.
| contains(x) | add(x) |
Bushy BSTs | Θ(log N) | Θ(log N) |
DataIndexedArray | Θ(1) | Θ(1) |
Separate Chaining Hash Table With No Resizing | Θ(N) | Θ(N) |
… With Resizing | Θ(1)† | Θ(1)*† |
Because Q = Θ(N)
Because Q = Θ(1)
Regarding Even Distribution
Even distribution of item is critical for good hash table performance.
Will need to discuss how to ensure even distribution.
x
x
Hash Tables in Java
The Ubiquity of Hash Tables
Hash tables are the most popular implementation for sets and maps.
In Java, implemented as java.util.HashMap and java.util.HashSet.
Objects
All classes are hyponyms of Object.
Default implementation simply returns the memory address of the object.
Examples of Real Java HashCodes
We can see that Strings in Java override hashCode, doing something vaguely like what we did earlier.
System.out.println("a".hashCode());
System.out.println("bee".hashCode());
System.out.println("포옹".hashCode());
System.out.println("kamala lifefully".hashCode());
System.out.println("đậu hũ".hashCode());
jug ~/Dropbox/61b/lec/hashing
$ java JavaHashCodeExamples
97
97410
1732557
1732557
-2108180664
"a"
"bee"
"포옹"
"kamala lifefully"
"đậu hũ"
Using Negative hash codes: yellkey.com/writer
Suppose that ‘s hash code is -1.
0
1
2
3
Using Negative hash codes
Suppose that ‘s hash code is -1.
0
1
2
3
-1
Using Negative hash codes in Java
Suppose that ‘s hash code is -1.
0
1
2
3
public class ModTest {
public static void main(String[] args) {
System.out.println(-1 % 4);
System.out.println(Math.floorMod(-1, 4));
}
}
$ java ModTest
-1
3
Hash Tables in Java
Java hash tables:
포옹
a
0
1
2
3
đậu hũ
đậu hũ
hashCode()
-2108180664
Math.floorMod(x, 4)
0
data
hash code
hash function
reduce
index
kamala lifefully
bee
Hash Table Review in Java assuming “cow”.hashCode() == 6
Someone asks me (the hash table) for “cow”:
.key [포옹]
.value [9]
key: [a]
value: 5
0
1
2
3
key: [đậu hũ]
value: 3
.key [cow]
.value [12]
.key: [bee]
value: 2
Two Important Warnings When Using HashMaps/HashSets
Warning #1: Never store objects that can change in a HashSet or HashMap!
Warning #2: Never override equals without also overriding hashCode.
Good HashCodes (Extra)
What Makes a good .hashCode()?
Goal: We want hash tables that look like the table on the right.
Hashbrowns and Hash Codes
How do you make hashbrowns?
Can think of multiplying data by powers of some base as ensuring that all the data gets scrambled together into a seemingly random integer.
Example hashCode Function
The Java 8 hash code for strings. Two major differences from our hash codes:
@Override
public int hashCode() {
int h = cachedHashValue;
if (h == 0 && this.length() > 0) {
for (int i = 0; i < this.length(); i++) {
h = 31 * h + this.charAt(i);
}
cachedHashValue = h;
}
return h;
}
Example: Choosing a Base
Java’s hashCode() function for Strings:
Our asciiToInt function for Strings:
Which is better?
Example: Base 126
Major collision problem:
Any string that ends in the same last 32 characters has the same hash code.
Typical Base
A typical hash code base is a small prime.
A full treatment of good hash codes is well beyond the scope of our class.
Hashbrowns and Hash Codes
How do you make hashbrowns?
Using a prime base yields better “randomness” than using something like base 126.
Example: Hashing a Collection
Lists are a lot like strings: Collection of items each with its own hashCode:
To save time hashing: Look at only first few items.
@Override
public int hashCode() {
int hashCode = 1;
for (Object o : this) {
hashCode = hashCode * 31;
hashCode = hashCode + o.hashCode();
}
return hashCode;
}
elevate/smear the current hash code
add new item’s hash code
Example: Hashing a Recursive Data Structure
Computation of the hashCode of a recursive data structure involves recursive computation.
@Override
public int hashCode() {
if (this.value == null) {
return 0;
}
return this.value.hashCode() +
31 * this.left.hashCode() +
31 * 31 * this.right.hashCode();
}
Summary
Hash Tables in Java
Hash tables:
đậu hũ
hashCode()
-2108180664
Math.floorMod(x, 4)
0
data
hash code
hash function
reduce
index
*: Indicates “on average”.
†: Assuming items are evenly spread.
| contains(x) | add(x) |
Bushy BSTs | Θ(log N) | Θ(log N) |
Separate Chaining Hash Table With No Resizing | Θ(N) | Θ(N) |
… With Resizing | Θ(1)† | Θ(1)*† |
Collision Resolution With Linear Probing (Extra)
Open Addressing: An Alternate Disambiguation Strategy (Extra)
An alternate way to handle collisions is to use “open addressing”.
If target bucket is already occupied, use a different bucket, e.g.
In 61B, we’ll settle for separate chaining.
Citations
What is the distinction between hash set, hash map, and hash table?
A hash set is an implementation of the Set ADT using the “hash table” as its engine.
A hash map is an implementation of the Map ADT using the “hash table” as its engine.
A “hash table” is a way of storing information, where you have M buckets that store N items. Each item has a “hashCode” that tells you which of M buckets to put that item in.