3 of 96

Sets

We’ve now seen several implementations of the Set (or Map) ADT.

Set

ArraySet

BST

2-3 Tree

LLRB

Map

	contains(x)	add(x)	Notes
ArraySet	Θ(N)	Θ(N)
BST	Θ(N)	Θ(N)	Random trees are Θ(log N).
2-3 Tree	Θ(log N)	Θ(log N)	Beautiful idea. Very hard to implement.
LLRB	Θ(log N)	Θ(log N)	Maintains bijection with 2-3 tree. Hard to implement.

Worst case runtimes

datastructur.es

4 of 96

Limits of Search Tree Based Sets

Our search tree based sets require items to be comparable.

Need to be able to ask “is X < Y?” Not true of all types.
Could we somehow avoid the need for objects to be comparable?

Our search tree sets have excellent performance, but could maybe be better?

Θ(log N) is amazing. 1 billion items is still only height ~30.
Could we somehow do better than Θ(log N)?

Today we’ll see the answer to both of the questions above is yes.

datastructur.es

5 of 96

Using Data as an Index

One extreme approach: Create an array of booleans indexed by data!

Initially all values are false.
When an item is added, set appropriate index to true.

Set containing nothing

...

DataIndexedIntegerSet diis;

diis = new DataIndexedIntegerSet();

diis.add(0);

datastructur.es

6 of 96

Using Data as an Index

One extreme approach: Create an array of booleans indexed by data!

Initially all values are false.
When an item is added, set appropriate index to true.

Set containing 0

...

DataIndexedIntegerSet diis;

diis = new DataIndexedIntegerSet();

diis.add(0);

datastructur.es

7 of 96

Using Data as an Index

One extreme approach: Create an array of booleans indexed by data!

Initially all values are false.
When an item is added, set appropriate index to true.

Set containing 0, 5

...

DataIndexedIntegerSet diis;

diis = new DataIndexedIntegerSet();

diis.add(0);

diis.add(5);

datastructur.es

8 of 96

Using Data as an Index

One extreme approach: Create an array of booleans indexed by data!

Initially all values are false.
When an item is added, set appropriate index to true.

Set containing 0, 5, 10

...

DataIndexedIntegerSet diis;

diis = new DataIndexedIntegerSet();

diis.add(0);

diis.add(5);

diis.add(10);

datastructur.es

9 of 96

Using Data as an Index

One extreme approach: Create an array of booleans indexed by data!

Initially all values are false.
When an item is added, set appropriate index to true.

Set containing 0, 5, 10, 11

...

DataIndexedIntegerSet diis;

diis = new DataIndexedIntegerSet();

diis.add(0);

diis.add(5);

diis.add(10);

diis.add(11);

datastructur.es

10 of 96

DataIndexedIntegerSet Implementation

Set containing 0, 5, 10, 11

...

public class DataIndexedIntegerSet {

private boolean[] present;

public DataIndexedIntegerSet() {

present = new boolean[2000000000];

}

public void add(int i) {

present[i] = true;

}

public boolean contains(int i) {

return present[i];

}

datastructur.es

11 of 96

DataIndexedIntegerSet Implementation

public class DataIndexedIntegerSet {

private boolean[] present;

public DataIndexedIntegerSet() {

present = new boolean[2000000000];

}

public void add(int i) {

present[i] = true;

}

public boolean contains(int i) {

return present[i];

}

	contains(x)	add(x)
ArraySet	Θ(N)	Θ(N)
BST	Θ(N)	Θ(N)
2-3 Tree	Θ(log N)	Θ(log N)
LLRB	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)

datastructur.es

12 of 96

Using Data as an Index

One extreme approach: Create an array of booleans indexed by data!

Initially all values are false.
When an item is added, set appropriate index to true.

Set containing 0, 5, 10, 11

...

DataIndexedIntegerSet diis;

diis = new DataIndexedIntegerSet();

diis.add(0);

diis.add(5);

diis.add(10);

diis.add(11);

Downsides of this approach (that we will try to address):

Extremely wasteful of memory. To support checking presence of all positive integers, we need > 2 billion booleans.
Need some way to generalize beyond integers.

datastructur.es

13 of 96

DataIndexedEnglishWordSet

datastructur.es

14 of 96

Generalizing the DataIndexedIntegerSet Idea

Suppose we want to add(“cat”)

The key question:

What is the catth element of a list?
One idea: Use the first letter of the word as an index.

a = 1, b = 2, c = 3, …, z = 26

What’s wrong with this approach?

Other words start with c.

contains(“chupacabra”) : true

Can’t store “=98yae98fwyawef”

...

“chupacabra” collides with “cat”

datastructur.es

15 of 96

Avoiding Collisions

Use all digits by multiplying each by a power of 27.

a = 1, b = 2, c = 3, …, z = 26
Thus the index of “cat” is (3 x 27²) + (1 x 27¹) + (20 x 27⁰) = 2234.

Why this specific pattern?

Let’s review how numbers are represented in decimal.

cas

cat

cau

2233

2234

2235

...

datastructur.es

16 of 96

Generalizing the DataIndexedIntegerSet Idea

Ideally, we want a data indexed set that can store arbitrary types.

But previous idea only supports integers!

Let’s talk about storing Strings.
Will get to generic types later.

DataIndexedSet<String> dis =

new DataIndexedSet<>();

dis.add("cat");

dis.add("bee");

dis.add("dog");

cat

dog

...

Where do cat, bee, and dog go???

bee

???

datastructur.es

17 of 96

The Decimal Number System vs. Our System for Strings

In the decimal number system, we have 10 digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Want numbers larger than 9? Use a sequence of digits.

Example: 7091 in base 10

7091₁₀= (7 x 10³) + (0 x 10²) + (9 x 10¹) + (1 x 10⁰)

Our system for strings is almost the same, but with letters.

datastructur.es

18 of 96

Test Your Understanding

Convert the word “bee” into a number by using our “powers of 27” strategy.

Reminder: cat₂₇= (3 x 27²) + (1 x 27¹) + (20 x 27⁰) = 2234₁₀

Hint: ‘b’ is letter 2, and ‘e’ is letter 5.

datastructur.es

19 of 96

Test Your Understanding

Convert the word “bee” into a number by using our “powers of 27” strategy.

Reminder: cat₂₇= (3 x 27²) + (1 x 27¹) + (20 x 27⁰) = 2234₁₀

Hint: ‘b’ is letter 2, and ‘e’ is letter 5.

bee₂₇= (2 x 27²) + (5 x 27¹) + (5 x 27⁰) = 1598₁₀

datastructur.es

20 of 96

Uniqueness

cat₂₇= (3 x 27²) + (1 x 27¹) + (20 x 27⁰) = 2234₁₀
bee₂₇= (2 x 27²) + (5 x 27¹) + (5 x 27⁰) = 1598₁₀

As long as we pick a base ≥ 26, this algorithm is guaranteed to give each lowercase English word a unique number!

Using base 27, no other words will get the number 1598.

In other words: Guaranteed that we will never have a collision.

Can write an englishToInt function (see two hidden slides that follow this one).

datastructur.es

21 of 96

Implementing englishToInt (optional)

Optional exercise: Try to write a function englishToInt that can convert English strings to integers by adding characters scaled by powers of 27.

Examples:

a: 1
z: 26
aa: 28
bee: 1598
cat: 2234
dog: 3328
potato: 237,949,071

datastructur.es

22 of 96

Implementing englishToInt (optional) (solution)

/** Converts ith character of String to a letter number.� * e.g. 'a' -> 1, 'b' -> 2, 'z' -> 26 */

public static int letterNum(String s, int i) {

int ithChar = s.charAt(i);

if ((ithChar < 'a') || (ithChar > 'z'))

{ throw new IllegalArgumentException(); }

return ithChar - 'a' + 1;

}

public static int englishToInt(String s) {

int intRep = 0;

for (int i = 0; i < s.length(); i += 1) {

intRep = intRep * 27;

intRep = intRep + letterNum(s, i);

}

return intRep;

}

datastructur.es

23 of 96

DataIndexedEnglishWordSet Implementation

public class DataIndexedEnglishWordSet {

private boolean[] present;

public DataIndexedEnglishWordSet() {

present = new boolean[2000000000];

}

public void add(String s) {

present[englishToInt(s)] = true;

}

public boolean contains(int i) {

return present[englishToInt(s)];

}

cas

cat

cau

2233

2234

2235

...

Set containing “cat”

datastructur.es

24 of 96

DataIndexedStringSet

datastructur.es

25 of 96

DataIndexedStringSet

Using only lowercase English characters is too restrictive.

What if we want to store strings like “2pac” or “eGg!”?
To understand what value we need to use for our base, let’s discuss briefly discuss the ASCII standard.

datastructur.es

26 of 96

ASCII Characters

The most basic character set used by most computers is ASCII format.

Each possible character is assigned a value between 0 and 127.
Characters 33 - 126 are “printable”, and are shown below.
For example, char c = ’D’ is equivalent to char c = 68.

biggest value is 126

datastructur.es

27 of 96

DataIndexedStringSet

Using only lowercase English characters is too restrictive.

What if we want to store strings like “2pac” or “eGg!”?
Maximum possible value for english-only text including punctuation is 126, so let’s use 126 as our base in order to ensure unique values for all possible strings.

Examples:

bee₁₂₆= (98 x 126²) + (101 x 126¹) + (101 x 126⁰) = 1,568,675
2pac₁₂₆= (50 x 126³) + (112 x 126²) + (97 x 126¹) + (99 x 126⁰) = 101,809,233
eGg!₁₂₆= (98 x 126³) + (71 x 126²) + (98 x 126¹) + (33 x 126⁰) = 203,178,213

datastructur.es

28 of 96

Implementing asciiToInt

The corresponding integer conversion function is actually even simpler than englishToInt from the hidden slides. Using the raw character value means we avoid the need for a helper method.

What if we want to use characters beyond ASCII?

public static int asciiToInt(String s) {

int intRep = 0;

for (int i = 0; i < s.length(); i += 1) {

intRep = intRep * 126;

intRep = intRep + s.charAt(i);

}

return intRep;

}

datastructur.es

29 of 96

Going Beyond ASCII

chars in Java also support character sets for other languages and symbols.

char c = ’☂’ is equivalent to char c = 9730.
char c = ’鳌’ is equivalent to char c = 40140.
char c = ’혜’ is equivalent to char c = 54812.
This encoding is known as Unicode. Table is too big to list.

datastructur.es

30 of 96

Example: Computing Unique Representations of Chinese

The largest possible value for Chinese characters is 40,959*, so we’d need to use this as our base if we want to have a unique representation for all possible strings of Chinese characters.

Example:

守门员₄₀₉₅₉ = (23432 x 40959²) + (38376 x 40959¹) + (21592 x 40959⁰) = 39,312,024,869,368

守门呗

守门员

守门呙

39,3120,2486,9367

39,3120,2486,9368

39,3120,2486,9369

...

*If you’re curious, the last character is:

datastructur.es

31 of 96

Integer Overflow and Hash Codes

datastructur.es

32 of 96

Major Problem: Integer Overflow

In Java, the largest possible integer is 2,147,483,647.

If you go over this limit, you overflow, starting back over at the smallest integer, which is -2,147,483,648.
In other words, the next number after 2,147,483,647 is -2,147,483,648.

int x = 2147483647;

System.out.println(x);

System.out.println(x + 1);

jug ~/Dropbox/61b/lec/hashing

$ javac BiggestPlusOne.java

$ java BiggestPlusOne

2147483647

-2147483648

datastructur.es

33 of 96

Consequence of Overflow: Collisions

Because Java has a maximum integer, we won’t get the numbers we expect!

With base 126, we will run into overflow even for short strings.

Example: omens₁₂₆= 28,196,917,171, which is much greater than the maximum integer!
asciiToInt(’omens’) will give us -1,867,853,901 instead.

datastructur.es

34 of 96

Consequence of Overflow: Collisions

Because Java has a maximum integer, we won’t get the numbers we expect!

With base 126, we will run into overflow even for short strings.

Example: omens₁₂₆= 28,196,917,171, which is much greater than the maximum integer!
asciiToInt(’omens’) will give us -1,867,853,901 instead.

Overflow can result in collisions, causing incorrect answers.

public void moo() {

DataIndexedStringSet disi = new DataIndexedStringSet();

disi.add("melt banana");

disi.contains("subterrestrial anticosmetic");

//asciiToInt for these strings is 839099497

}

returns true!

datastructur.es

35 of 96

Hash Codes and the Pigeonhole Principle

The official term for the number we’re computing is “hash code”.

Via Wolfram Alpha: a hash code “projects a value from a set with many (or even an infinite number of) members to a value from a set with a fixed number of (fewer) members.”
Here, our target set is the set of Java integers, which is of size 4294967296.

datastructur.es

36 of 96

Hash Codes and the Pigeonhole Principle

The official term for the number we’re computing is “hash code”.

Via Wolfram Alpha: a hash code “projects a value from a set with many (or even an infinite number of) members to a value from a set with a fixed number of (fewer) members.”
Here, our target set is the set of Java integers, which is of size 4294967296.

Pigeonhole principle tells us that if there are more than 4294967296 possible items, multiple items will share the same hash code.

There are more than 4294967296 planets.

Each has mass, xPos, yPos, xVel, yVel, imgName.

There are more than 4294967296 strings.

“one”, “two”, … “nineteen quadrillion”, ...

Bottom line: Collisions are inevitable.

datastructur.es

37 of 96

Two Fundamental Challenges

Two fundamental challenges:

How do we resolve hashCode collisions (“melt banana” vs. “subterrestrial anticosmetic”)?

We’ll call this collision handling.

How do we compute a hash code for arbitrary objects?

We’ll call this computing a hashCode.
Example: Our hashCode for “melt banana” was 839099497.
For Strings, this was relatively straightforward (treat as a base 27 or base 126 or base 40959 number).

datastructur.es

38 of 96

Hash Tables:

Handling Collisions

datastructur.es

39 of 96

Resolving Ambiguity

Pigeonhole principle tells us that collisions are inevitable due to integer overflow.

Example: hash code for “moo” and “Ñep” might both be 718.

717

718

719

...

717

718

719

...

moo

Ñep

Suppose N items have the same numerical representation h:

Instead of storing true in position h, store a “bucket” of these N items at position h.

How to implement a “bucket”?

Conceptually simplest way: LinkedList.
Could also use ArrayLists.
Could also use an ArraySet.
Will see it doesn’t really matter what you do.

datastructur.es

40 of 96

The Separate Chaining Data Indexed Array

Each bucket in our array is initially empty. When an item x gets added at index h:

If bucket h is empty, we create a new list containing x and store it at index h.
If bucket h is already a list, we add x to this list if it is not already present.

�We might call this a “separate chaining data indexed array”.

Bucket #h is a “separate chain” of all items that have hash code h.

111239443

111239444

111239445

...

Initially all buckets are empty.

datastructur.es

41 of 96

The Separate Chaining Data Indexed Array

Each bucket in our array is initially empty. When an item x gets added at index h:

If bucket h is empty, we create a new list containing x and store it at index h.
If bucket h is already a list, we add x to this list if it is not already present.

�We might call this a “separate chaining data indexed array”.

Bucket #h is a “separate chain” of all items that have hash code h.

add(“a”)

111239443

111239444

111239445

...

Bucket 1 now has a length 1 list.

datastructur.es

42 of 96

The Separate Chaining Data Indexed Array

Each bucket in our array is initially empty. When an item x gets added at index h:

If bucket h is empty, we create a new list containing x and store it at index h.
If bucket h is already a list, we add x to this list if it is not already present.

�We might call this a “separate chaining data indexed array”.

Bucket #h is a “separate chain” of all items that have hash code h.

add(“a”)

add(“abomamora”)

111239443

111239444

111239445

...

abomamora

datastructur.es

43 of 96

The Separate Chaining Data Indexed Array

Each bucket in our array is initially empty. When an item x gets added at index h:

If bucket h is empty, we create a new list containing x and store it at index h.
If bucket h is already a list, we add x to this list if it is not already present.

�We might call this a “separate chaining data indexed array”.

Bucket #h is a “separate chain” of all items that have hash code h.

add(“a”)

add(“abomamora”)

add(“adevilish”)

111239443

111239444

111239445

...

abomamora

adevilish

Both have hash code 111239444 using englishToInt

datastructur.es

44 of 96

The Separate Chaining Data Indexed Array

Each bucket in our array is initially empty. When an item x gets added at index h:

If bucket h is empty, we create a new list containing x and store it at index h.
If bucket h is already a list, we add x to this list if it is not already present.

�We might call this a “separate chaining data indexed array”.

Bucket #h is a “separate chain” of all items that have hash code h.

add(“a”)

add(“abomamora”)

add(“adevilish”)

add(“abomamora”)

111239443

111239444

111239445

...

abomamora

adevilish

Both have hash code 111239444 using englishToInt

datastructur.es

45 of 96

The Separate Chaining Data Indexed Array

Each bucket in our array is initially empty. When an item x gets added at index h:

If bucket h is empty, we create a new list containing x and store it at index h.
If bucket h is already a list, we add x to this list if it is not already present.

�We might call this a “separate chaining data indexed array”.

Bucket #h is a “separate chain” of all items that have hash code h.

add(“a”)

add(“abomamora”)

add(“adevilish”)

add(“abomamora”)

contains(“adevilish”)

Look at all items in bucket 111239443 to see if “adevilish” is present.

111239443

111239444

111239445

...

abomamora

adevilish

datastructur.es

46 of 96

Separate Chaining Performance

Observation: Worst case runtime will be proportional to length of longest list.

Worst case time	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)
Separate Chaining Data Indexed Array	Θ(Q)	Θ(Q)

Q: Length of longest list

Why Q and not 1?

1598

2234

3328

111239443

111239442

cat

dog

bee

...

abomamora

adevilish

datastructur.es

47 of 96

Saving Memory Using Separate Chaining

Observation: We don’t really need billions of buckets.

Q: If we use the 10 buckets on the right, where should our five items go?

1598

2234

3328

111239443

111239442

cat

dog

bee

...

abomamora

adevilish

datastructur.es

48 of 96

Saving Memory Using Separate Chaining and Modulus

Observation: Can use modulus of hashcode to reduce bucket count.

Downside: Lists will be longer.

1598

2234

3328

111239443

111239442

cat

dog

bee

...

abomamora

adevilish

cat

bee

dog

abomamora

adevilish

Q: If we use the 10 buckets on the right, where should our five items go?

Put in bucket = hashCode % 10

datastructur.es

49 of 96

The Hash Table

What we’ve just created here is called a hash table.

Data is converted by a hash function into an integer representation called a hash code.
The hash code is then reduced to a valid index, usually using the modulus operator, e.g. 2348762878 % 10 = 8.

bee

hug

抱抱

están

抱抱

unicodeToInt

1034854400

% 10

data

hash code

hash function

reduce

index

hash table

dog

الطبيعة

शानदार

포옹

In Java there’s a caveat here. Will revisit later.

datastructur.es

50 of 96

Hash Table Performance

datastructur.es

51 of 96

Hash Table Runtime

The good news: We use way less memory and can now handle arbitrary data.

The bad news: Worst case runtime is now Θ(Q), where Q is the length of the longest list.

Q: Length of longest list

	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)
Separate Chaining Hash Table	Θ(Q)	Θ(Q)

Worst case runtimes

datastructur.es

52 of 96

Hash Table Runtime: yellkey.com/attorney

Q: Length of longest list

	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)
Separate Chaining Hash Table	Θ(Q)	Θ(Q)

Worst case runtimes

For the hash table above with 5 buckets, give the order of growth of Q with respect to N.

Q is Θ(1)
Q is Θ(log N)
Q is Θ(N)
Q is Θ(N log N)
Q is Θ(N²)

datastructur.es

53 of 96

Hash Table Runtime

Q: Length of longest list

	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)
Separate Chaining Hash Table	Θ(Q)	Θ(Q)

Worst case runtimes

For the hash table above with 5 buckets, give the order of growth of Q with respect to N.

C. Q is Θ(N)

All items evenly distributed.

All items in same list.

In the best case, the length of the longest list will be N/5. In the worst case, it will be N. In both cases, Q(N) is Θ(N).

datastructur.es

54 of 96

Improving the Hash Table

Q: Length of longest list

	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)
Separate Chaining Hash Table	Θ(Q)	Θ(Q)

Worst case runtimes

Suppose we have:

A fixed number of buckets M.
An increasing number of items N.

Major problem: Even if items are spread out evenly, lists are of length Q = N/M.

For M = 5, that means Q = Θ(N). Results in linear time operations.
Hard question: How can we improve our design to guarantee that N/M is Θ(1)?

datastructur.es

55 of 96

Hash Table Runtime

Suppose we have:

An increasing number of buckets M.
An increasing number of items N.

Q: Length of longest list

Worst case runtimes

	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)
Separate Chaining Hash Table	Θ(Q)	Θ(Q)

Major problem: Even if items are spread out evenly, lists are of length Q = N/M.

For M = 5, that means Q = Θ(N). Results in linear time operations.
Hard question: How can we improve our design to guarantee that N/M is Θ(1)?

datastructur.es

56 of 96

Hash Table Runtime

Suppose we have:

An increasing number of buckets M.
An increasing number of items N.�

As long as M = Θ(N), then O(N/M) = O(1).

One example strategy: When N/M is ≥ 1.5, then double M.

We often call this process of increasing M “resizing”.
N/M is often called the “load factor”. It represents how full the hash table is.
This rule ensures that the average list is never more than 1.5 items long!

datastructur.es

57 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

N = 0 M = 4 N / M = 0

datastructur.es

58 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

N = 1 M = 4 N / M = 0.25

datastructur.es

59 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

N = 2 M = 4 N / M = 0.5

datastructur.es

60 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

N = 3 M = 4 N / M = 0.75

datastructur.es

61 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

N = 4 M = 4 N / M = 1

datastructur.es

62 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

N = 5 M = 4 N / M = 1.25

datastructur.es

63 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

N = 6 M = 4 N / M = 1.5

N/M is too large. Time to double!

datastructur.es

64 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

Yellkey question: Where will the bucket go?
Or: Draw the results after doubling M.

N = 6 M = 4 N / M = 1.5

N/M is too large. Time to double!

datastructur.es

65 of 96

Hash Table Resizing Example

When N/M is ≥ 1.5, then double M.

Draw the results after doubling M.

N = 6 M = 4 N / M = 1.5

N/M is too large. Time to double!

N = 6 M = 8 N / M = 0.75

datastructur.es

66 of 96

Resizing Hash Table Runtime

Suppose we have:

An increasing number of buckets M.
An increasing number of items N.�

As long as M = Θ(N), then O(N/M) = O(1).

Assuming items are evenly distributed (as above), lists will be approximately N/M items long, resulting in Θ(N/M) runtimes.

Our doubling strategy ensures that N/M = O(1).
Thus, worst case runtime for all operations is Θ(N/M) = Θ(1).

… unless that operation causes a resize.

N = 19 M = 5 N / M = 3.8

datastructur.es

67 of 96

Resizing Hash Table Runtime

Suppose we have:

An increasing number of buckets M.
An increasing number of items N.�

As long as M = Θ(N), then O(N/M) = O(1).

One important thing to consider is the cost of the resize operation.

Resizing takes Θ(N) time. Have to redistribute all items!
Most add operations will be Θ(1). Some will be Θ(N) time (to resize).

Similar to our ALists, as long as we resize by a multiplicative factor, the average runtime will still be Θ(1).
Note: We will eventually analyze this in more detail.

N = 19 M = 5 N / M = 3.8

datastructur.es

68 of 96

Hash Table Runtime

Hash table operations are on average constant time if:

We double M to ensure constant average bucket length.
Items are evenly distributed.

Worst case runtimes

*: Indicates “on average”.

†: Assuming items are evenly spread.

	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)
Separate Chaining Hash Table With No Resizing	Θ(N)	Θ(N)
… With Resizing	Θ(1)^†	Θ(1)*^†

Because Q = Θ(N)

Because Q = Θ(1)

datastructur.es

69 of 96

Regarding Even Distribution

Even distribution of item is critical for good hash table performance.

Both tables below have load factor of N/M = 1.
Left table is much worse!

Contains is Θ(N) for x.

Will need to discuss how to ensure even distribution.

First, let’s talk a little bit about how hash tables work in Java.

datastructur.es

70 of 96

Hash Tables in Java

datastructur.es

71 of 96

The Ubiquity of Hash Tables

Hash tables are the most popular implementation for sets and maps.

Great performance in practice.
Don’t require items to be comparable.
Implementations often relatively simple.
Python dictionaries are just hash tables in disguise.

In Java, implemented as java.util.HashMap and java.util.HashSet.

How does a HashMap know how to compute each object’s hash code?

Good news: It’s not “implements Hashable”.
Instead, all objects in Java must implement a .hashCode() method.

datastructur.es

72 of 96

Objects

All classes are hyponyms of Object.

String toString()
boolean equals(Object obj)
Class<?> getClass()
int hashCode()
protected Object clone()
protected void finalize()
void notify()
void notifyAll()
void wait()
void wait(long timeout)
void wait(long timeout, int nanos)

Default implementation simply returns the memory address of the object.

datastructur.es

73 of 96

Examples of Real Java HashCodes

We can see that Strings in Java override hashCode, doing something vaguely like what we did earlier.

Will see the actual hashCode() function later!

System.out.println("a".hashCode());

System.out.println("bee".hashCode());

System.out.println("포옹".hashCode());

System.out.println("kamala lifefully".hashCode());

System.out.println("đậu hũ".hashCode());

jug ~/Dropbox/61b/lec/hashing

$ java JavaHashCodeExamples

97410

1732557

-2108180664

"a"

"bee"

"포옹"

"kamala lifefully"

"đậu hũ"

datastructur.es

74 of 96

Using Negative hash codes: yellkey.com/writer

Suppose that ‘s hash code is -1.

Philosophically, into which bucket is it most natural to place this item?

datastructur.es

75 of 96

Using Negative hash codes

Suppose that ‘s hash code is -1.

Philosophically, into which bucket is it most natural to place this item?

I say 3, since -1 → 3, 0 → 0, 1 → 1, 2 → 2, 3 → 3, 4 → 0, ...

-1

datastructur.es

76 of 96

Using Negative hash codes in Java

Suppose that ‘s hash code is -1.

Unfortunately, -1 % 4 = -1. Will result in index errors!
Use Math.floorMod instead.

public class ModTest {

public static void main(String[] args) {

System.out.println(-1 % 4);

System.out.println(Math.floorMod(-1, 4));

}

$ java ModTest

-1

datastructur.es

77 of 96

Hash Tables in Java

Java hash tables:

Data is converted by the hashCode method an integer representation called a hash code.
The hash code is then reduced to a valid index, using something like the floorMod function, e.g. Math.floorMod(1732557 % 4) = 8.

포옹

đậu hũ

hashCode()

-2108180664

Math.floorMod(x, 4)

data

hash code

hash function

reduce

index

kamala lifefully

bee

datastructur.es

78 of 96

Hash Table Review in Java assuming “cow”.hashCode() == 6

Someone asks me (the hash table) for “cow”:

(Hashcode of cow) % array size: get back 2
!n.key.equals(“cow”) so go to next.
!n.key.equals(“cow”) so go to next.
n.key.equals(“cow”), so: return n.value

.key [포옹]

.value [9]

key: [a]

value: 5

key: [đậu hũ]

value: 3

.key [cow]

.value [12]

.key: [bee]

value: 2

datastructur.es

79 of 96

Two Important Warnings When Using HashMaps/HashSets

Warning #1: Never store objects that can change in a HashSet or HashMap!

Such objects are also called “mutable” objects, e.g. they can change.

Example: You’d never want to make a HashSet<List<Integer>>.

If an object’s variables changes, then its hashCode changes. May result in items getting lost.

Warning #2: Never override equals without also overriding hashCode.

Can also lead to items getting lost and generally weird behavior.
HashMaps and HashSets use equals to determine if an item exists in a particular bucket.
See study guide problems.

datastructur.es

80 of 96

Good HashCodes (Extra)

datastructur.es

81 of 96

What Makes a good .hashCode()?

Goal: We want hash tables that look like the table on the right.

Want a hashCode that spreads things out nicely on real data.

Example #1: return 0 is a bad hashCode function.
Example #2: just returning the first character of a word, e.g. “cat” → 3 was also a bad hash function.
Example #3: Adding chars together is bad. “ab” collides with “ba”.
Example #4: returning string treated as a base B number can be good.

Writing a good hashCode() method can be tricky.

datastructur.es

82 of 96

Hashbrowns and Hash Codes

How do you make hashbrowns?

Chopping a potato into nice predictable segments? No way!
Similarly, adding up the characters is not nearly “random” enough.

Can think of multiplying data by powers of some base as ensuring that all the data gets scrambled together into a seemingly random integer.

datastructur.es

83 of 96

Example hashCode Function

The Java 8 hash code for strings. Two major differences from our hash codes:

Represents strings as a base 31 number.

Why such a small base? Real hash codes don’t care about uniqueness.

Stores (caches) calculated hash code so future hashCode calls are faster.

@Override

public int hashCode() {

int h = cachedHashValue;

if (h == 0 && this.length() > 0) {

for (int i = 0; i < this.length(); i++) {

h = 31 * h + this.charAt(i);

}

cachedHashValue = h;

}

return h;

}

datastructur.es

84 of 96

Example: Choosing a Base

Java’s hashCode() function for Strings:

h(s) = s₀× 31^n-1 + s₁ × 31^n-2 + … + s_n-1

Our asciiToInt function for Strings:

h(s) = s₀× 126^n-1 + s₁ × 126^n-2 + … + s_n-1

Which is better?

Might seem like 126 is better. Ignoring overflow, this ensures a unique numerical representation for all ASCII strings.
… but overflow is a particularly bad problem for base 126!

datastructur.es

85 of 96

Example: Base 126

Major collision problem:

“geocronite is the best thing on the earth.”.hashCode() yields 634199182.
“flan is the best thing on the earth.”.hashCode() yields 634199182.
“treachery is the best thing on the earth.”.hashCode() yields 634199182.
“Brazil is the best thing on the earth.”.hashCode() yields 634199182.

Any string that ends in the same last 32 characters has the same hash code.

Why? Because of overflow.
Basic issue is that 126^32 = 126^33 = 126^34 = ... 0.

Thus upper characters are all multiplied by zero.
See CS61C for more.

datastructur.es

86 of 96

Typical Base

A typical hash code base is a small prime.

Why prime?

Never even: Avoids the overflow issue on previous slide.
Lower chance of resulting hashCode having a bad relationship with the number of buckets: See study guide problems and hw3.

Why small?

Lower cost to compute.

A full treatment of good hash codes is well beyond the scope of our class.

datastructur.es

87 of 96

Hashbrowns and Hash Codes

How do you make hashbrowns?

Chopping a potato into nice predictable segments? No way!

Using a prime base yields better “randomness” than using something like base 126.

datastructur.es

88 of 96

Example: Hashing a Collection

Lists are a lot like strings: Collection of items each with its own hashCode:

To save time hashing: Look at only first few items.

Higher chance of collisions but things will still work.

@Override

public int hashCode() {

int hashCode = 1;

for (Object o : this) {

hashCode = hashCode * 31;

hashCode = hashCode + o.hashCode();

}

return hashCode;

}

elevate/smear the current hash code

add new item’s hash code

datastructur.es

89 of 96

Example: Hashing a Recursive Data Structure

Computation of the hashCode of a recursive data structure involves recursive computation.

For example, binary tree hashCode (assuming sentinel leaves):

@Override

public int hashCode() {

if (this.value == null) {

return 0;

}

return this.value.hashCode() +

31 * this.left.hashCode() +

31 * 31 * this.right.hashCode();

}

datastructur.es

90 of 96

Summary

datastructur.es

91 of 96

Hash Tables in Java

Hash tables:

Data is converted into a hash code.
The hash code is then reduced to a valid index.
Data is then stored in a bucket corresponding to that index.
Resize when load factor N/M exceeds some constant.
If items are spread out nicely, you get Θ(1) average runtime.

đậu hũ

hashCode()

-2108180664

Math.floorMod(x, 4)

data

hash code

hash function

reduce

index

*: Indicates “on average”.

†: Assuming items are evenly spread.

	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
Separate Chaining Hash Table With No Resizing	Θ(N)	Θ(N)
… With Resizing	Θ(1)^†	Θ(1)*^†

datastructur.es

92 of 96

Collision Resolution With Linear Probing (Extra)

datastructur.es

93 of 96

Open Addressing: An Alternate Disambiguation Strategy (Extra)

An alternate way to handle collisions is to use “open addressing”.

If target bucket is already occupied, use a different bucket, e.g.

Linear probing: Use next address, and if already occupied, just keep scanning one by one.

Demo: http://goo.gl/o5EDvb

Quadratic probing: Use next address, and if already occupied, try looking 4 ahead, then 9 ahead, then 16 ahead, …
Many more possibilities. See the optional reading for today (or CS170) for a more detailed look.

In 61B, we’ll settle for separate chaining.

datastructur.es

94 of 96

Citations

http://www.nydailynews.com/news/national/couple-calls-911-forgotten-mcdonalds-hash-browns-article-1.1543096

http://en.wikipedia.org/wiki/Pigeonhole_principle#mediaviewer/File:TooManyPigeons.jpg

https://cookingplanit.com/public/uploads/inventory/hashbrown_1366322674.jpg

datastructur.es

95 of 96

datastructur.es

96 of 96

What is the distinction between hash set, hash map, and hash table?

A hash set is an implementation of the Set ADT using the “hash table” as its engine.

A hash map is an implementation of the Map ADT using the “hash table” as its engine.

A “hash table” is a way of storing information, where you have M buckets that store N items. Each item has a “hashCode” that tells you which of M buckets to put that item in.

datastructur.es