3 of 50

Practice

Insert the following key value pairs into the HashTable below. Use the following hash function:
public int hashCode(int key) {
return key % arr.length;
}

(2, b) (3,c) (10,j) (9,i) (1,a) (12,l)

0	1	2	3	4	5	6	7	8	9

(10, j)

(2, b)

(3, c)

(1, a)

(12, l)

(9, i)

4 of 50

Lecture Outline

ArrayMap

DirectAccessMap

SimpleHashMap

SeparateChaining

HashMap

FASTER: Jump directly to element, only int keys

MORE FLEXIBLE: Hash function supports any type of key

YOUR BEST FRIEND: Addresses limitations with hash collisions, but still fast!

Review

MAP ADT

As seen on

Project 2

As seen on

Project 2

5 of 50

Hash functions: translating a piece of data to an int

In our case: we want to translate int keys to a valid index in our array. If our array is length 10 but our input key is 500, we need to make sure we have a way of mapping that to a number between 0 and 9 (the valid indices for a length 10 array). This mapping that we decide on is a hash function.

One simple thing we can do (and that you will do when you implement this in your project):
Hash function: take your key and % it by the length of the array.
ex: key is 500, and array is length 10 – if you take 500 % 10, you will get the number 0, so we’d just plop 500 and it’s value at index 0.

CSE 373 20 SP – CHAMPION & CHUN

Hash function definition

A hash function is any function that can be used to map data of arbitrary size to fixed-size values.

6 of 50

“review”: Integer remainder with % “mod”

The % operator computes the remainder from integer division.

14 % 4 is 2� � 3 43� 4 ) 14 5 ) 218� 12 20� 2 18� 15� 3

Applications of % operator:

Obtain last digit of a number: 230857 % 10 is 7
See whether a number is odd: 7 % 2 is 1, 42 % 2 is 0
Limit integers to specific range: 8 % 12 is 8, 18 % 12 is 6

CSE 142 SP 18 – BRETT WORTZMAN

218 % 5 is 3

For more review/practice, check out https://www.khanacademy.org/computing/computer-science/cryptography/modarithmetic/a/what-is-modular-arithmetic

Limit keys to indices within array

Equivalently, to find a % b (for a,b > 0):

while(a > b-1)

a -= b;

return a;

7 of 50

First Hash Function: % table size

CSE 373 SP 22 - CHAMPION

indices	0	1	2	3	4	5	6	7	8	9
elements

put(0, “foo”);

put(5, “bar”);

put(11, “biz”)

put(18, “bop”);

“foo”

0 % 10 = 0

5 % 10 = 5

11 % 10 = 1

18 % 10 = 8

“bop”

“bar”

“biz”

8 of 50

Implement First Hash Function

public void put(int key, int value) {
data[hashToValidIndex(key)] = value;
}

public V get(int key) {
return data[hashToValidIndex(key)];
}

public int hashToValidIndex(int k) {
return k % this.data.length;
}

CSE 373 SU 19 - ROBBIE WEBER

SimpleHashMap<Integer>

put mod key by table size, put item at result

get mod key by table size, get item at result

containsKey mod key by table size, return data[result] == null remove mod key by table size, nullify element at result

size return count of items in dictionary

state

behavior

Data[]

size

Operation		Array w/ indices as keys
put(key,value)	best	𝚹(1)
put(key,value)	worst	𝚹(1)
get(key)	best	𝚹(1)
get(key)	worst	𝚹(1)
containsKey(key)	best	𝚹(1)
containsKey(key)	worst	𝚹(1)

Note: % is just a math operator like +, -, /, *, so it’s constant runtime

9 of 50

Questions?

things we talked about:

review of ArrayMap + LinkedMap
DirectAccessMap
% as a hash function andSimpleHashMap

10 of 50

First Hash Function: % table size

CSE 373 SP 22 - CHAMPION

indices	0	1	2	3	4	5	6	7	8	9
elements

put(0, “foo”);

put(5, “bar”);

put(11, “biz”)

put(18, “bop”);

Collision!

“foo”

0 % 10 = 0

5 % 10 = 5

11 % 10 = 1

18 % 10 = 8

20 % 10 = 0

“bop”

“bar”

“biz”

“:(”

put(20, “:(”);

11 of 50

Hash Obsession: Collisions

Collision: multiple keys translate to the same location of the array

Future big idea: the fewer the collisions, the better the runtime! (we’ll see this when we figure out that resolving these leads to worse runtime)

Two questions:
1. When we have a collision, how do we resolve it?
2. How do we minimize the number of collisions?

CSE 373 SU 19 - ROBBIE WEBER

12 of 50

Strategies to handle hash collision

There are multiple strategies. In this class, we’ll cover the following ones:

1. Separate chaining
2. Open addressing

Linear probing
Quadratic probing
Double hashing

CSE 373 AU 18 – SHRI MARE

13 of 50

Separate chaining

Solution 1: Separate Chaining
Each index in our array represents a “bucket”. When an item x hashes to index h:

If the bucket at index h is empty: create a new list containing x
If the bucket at index h is already a list: add x if it is not already present

in other words:
If multiple things hash to the same index, then we’ll just put all of those in that same index bucket. Often, you’ll see the data structure chosen is a linked-list like structure.

CSE 373 ROBBIE WEBER + HANNAH TANG

14 of 50

Separate chaining

// some pseudocode

public boolean containsKey(int key) {

int bucketIndex = key % data.length;

loop through data[bucketIndex]

return true if we find the key in

data[bucketIndex]

return false if we get to here (didn’t

find it)

}

CSE 373 ROBBIE WEBER + HANNAH TANG

Reminder: the implementations of put/get/containsKey are all very similar, and almost always will have the same complexity class runtime

runtime analysis

Are there different possible states for our Hash Map that make this code run slower/faster, assuming there are already n key-value pairs being stored?

Yes! If we had to do a lot of loop iterations to find the key in the bucket, our code will run slower.

15 of 50

A best case situation for separate chaining

CSE 373 20 SP – CHAMPION & CHUN

0	1	2	3	4	5	6	7	8	9

(0, b)

(2, b)

(3, b)

(4, b)

(5, b)

(6, b)

(7, b)

(8, b)

It’s possible (and likely if you follow some best-practices) that everything is spread out across the buckets pretty evenly. This is the opposite of the last slide: when we have minimal collisions, our runtime should be less. For example, if we have a bucket with only 0 or 1 element in it, checking containsKey for something in that bucket will only take a constant amount of time.

We’re going to try a lot of stuff we can to make it more likely we achieve this beautiful state ☺.

16 of 50

In-practice situations for separate chaining

CSE 373 20 SP – CHAMPION & CHUN

Operation		Array w/ indices as keys
put(key,value)	best	𝚹(1)
	In-practice	𝚹(1)
	worst	𝚹(n)
get(key)	best	𝚹(1)
	In-practice	𝚹(1)
	worst	𝚹(1)
remove(key)	best	𝚹(1)
	In-practice	𝚹(1)
	worst	O(n)

Reminder: the in-practice runtimes are assuming an even distribution of the keys inside the array and following of best-practices to ensure the average chain length is low.

17 of 50

Best practices (pay attention to this for the hw)

what about resizing?

for data structures like ArrayMap or ArrayList or ArrayStack we had to resize when we’re full just because we couldn’t store any more things! But our Separate Chaining Hash Map is a little bit different: we aren’t ever forced to resize our main array, since the buckets are flexible size.

CSE 373 20 SP – CHAMPION & CHUN

It turns out we still want to resize “every so often” to make sure the average/expected length of each bucket is a small number.

Consider what happens if we had the array length 10 like on the left, but had 100 key-value pairs?

Assuming our in-practice niceness (not-worst case) you would expect on average each of the 10 buckets has about 10 key-value pairs in it.

What happens if we stick with the same size array but add 100 more key-value pairs? Each bucket gets about 10 more –key-value pairs and the runtime is getting worse and worse.

18 of 50

Best practices (pay attention to this for the hw)

CSE 373 20 SP – CHAMPION & CHUN

It turns out we still want to resize “every so often” to make sure the average/expected length of each bucket is a small number.

Consider what happens if we had the array length 10 like on the left, but had 100 key-value pairs?

Assuming our in-practice niceness (not-worst case) you would expect on average each of the 10 buckets has about 10 key-value pairs in it.

What happens if we stick with the same size array but add 100 more key-value pairs? Each bucket gets about 10 more –key-value pairs and the runtime is getting worse and worse.

The pattern we’re getting to is that the expected runtime is approximately: # of pairs / array.length (AKA n / c where n is the number of elements and c is the number of possible chains). If array.length is fixed for your whole program, then this is an order-n runtime, but if the array.length also increases (because you re-size) and you redistribute out the values evenly across the buckets, you can keep your runtime low. In particular, if you resize when when your n / c ratio increases to about 1, you’re expected to have 1 element or fewer in each bucket at all times. (do this on your homework).

Tip: make sure you re-hash (re-distribute) your keys by the new array length after re-sizing so they don’t get clustered in the old array length range.

19 of 50

Lambda + resizing rephrased

To be more precise, the in-practice runtime depends on λ, the current average chain length.

However, if you resize once you hit that 1:1 threshold, the current λ is expected to be less than 1 (which is a constant / constant runtime, so we can simplify to O(1)).

CSE 373 SU 19 - ROBBIE WEBER

Operation		Array w/ indices as keys
put(key,value)	best	O(1)
	In-practice	O(λ)
	worst	O(n)
get(key)	best	O(1)
	In-practice	O(λ)
	worst	O(n)
remove(key)	best	O(1)
	In-practice	O(λ)
	worst	O(n)

20 of 50

What about non integer keys?

CSE 373 SU 19 - ROBBIE WEBER

Hash function definition

A hash function is any function that can be used to map data of arbitrary size to fixed-size values.

21 of 50

(Before we % by length, we have to convert the data into an int)

Implementation 1: Simple aspect of values
public int hashCode(String input) {
return input.length();
}
Implementation 2: More aspects of value
public int hashCode(String input) {
int output = 0;
for(char c : input) {
out += (int)c;
}
return output;
}
Implementation 3: Multiple aspects of value + math!
public int hashCode(String input) {
int output = 1;
for (char c : input) {
int nextPrime = getNextPrime();
out *= Math.pow(nextPrime, (int)c);

}

return Math.pow(nextPrime, input.length());
}

CSE 373 SU 19 - ROBBIE WEBER

Pro: super fast

Con: lots of collisions!

Pro: still really fast

Con: some collisions

Pro: few collisions

Con: slow, gigantic integers

22 of 50

Java’s hashCode (relevant for project)

Luckily, most of these design decisions have been made for us by smart people. All objects in java come with a `hashCode()` method that does some magic (see previous slide for the not-magic version) to turn any object type (like String, ArrayList, Point, Scanner) into an integer. These hashCodes are designed to distribute pretty evenly / not have lots of collisions, so we use them as the starting point for determining the bucket index.

high level steps to figure out which bucket a key goes into

call the key.hashCode() to get an int representation of the object
% by the array table length to convert it to a valid index for your hash map

CSE 373 20 SP – CHAMPION & CHUN

23 of 50

Best practices for an nice distribution of keys recap

resize when lambda (number of elements / number of buckets) increases up to 1
when you resize, you can choose a the table length that will help reduce collisions if you multiply the array length by 2 and then choose the nearest prime number
design the hashCode of your keys to be somewhat complex and lead to a distribution of different output numbers

CSE 373 20 SP – CHAMPION & CHUN

24 of 50

Practice

Consider an IntegerDictionary using separate chaining with an internal capacity of 10. Assume our buckets are implemented using a LinkedList where we append new key-value pairs to the end.
Now, suppose we insert the following key-value pairs. What does the dictionary internally look like?
(1, a) (5,b) (11,a) (7,d) (12,e) (17,f) (1,g) (25,h)

CSE 373 SU 19 - ROBBIE WEBER

0	1	2	3	4	5	6	7	8	9

(1, a)

(5, b)

(11, a)

(17, f)

(1, g)

(12, e)

(7, d)

(25, h)

25 of 50

Practice

Consider a StringDictionary using separate chaining with an internal capacity of 10. Assume our buckets are implemented using a LinkedList. Use the following hash function:
public int hashCode(String input) {
return input.length() % arr.length;
}
Now, insert the following key-value pairs. What does the dictionary internally look like?
(“a”, 1) (“ab”, 2) (“c”, 3) (“abc”, 4) (“abcd”, 5) (“abcdabcd”, 6) (“five”, 7) (“hello world”, 8)

CSE 373 SU 19 - ROBBIE WEBER

0	1	2	3	4	5	6	7	8	9

(“a”, 1)

(“abcd”, 5)

(“c”, 3)

(“five”, 7)

(“abc”, 4)

(“ab”, 2)

(“hello world”, 8)

(“abcdabcd”, 6)

26 of 50

Java and Hash Functions

Object class includes default functionality:

equals
hashCode

If you want to implement your own hashCode you should:

Override BOTH hashCode() and equals()

If a.equals(b) is true then a.hashCode() == b.hashCode() MUST also be true

That requirement is part of the Object interface. �Other people’s code will assume you’ve followed this rule.
Java’s HashMap (and HashSet) will assume you follow these rules and conventions for your custom objects if you want to use your custom objects as keys.

CSE 373 SU 19 - ROBBIE WEBER

27 of 50

Resizing

Don’t forget to re-distribute your keys!

As seen on

Project 2

0
1
2
3
4
5
6
7
8
9

(7,blue)

(4,orange)

0
1
2
3
4
5
6
7
8
9

(1,red)

(22,tan)

(22,tan)

(7,blue)

(77,aqua)

(4,orange)

(1,red)

(6,pink)

(8,lilac)

(53,puce)

(6,pink)

(77,aqua)

(53,puce)

(8,lilac)

If we just expand the buckets array, several values are hashed in the wrong place

How to Resize:

Expand the buckets array
For every element in the old hash table, re-distribute! Recompute its position by taking the mod with the new length

28 of 50

When to Resize?

In ArrayList, we were forced to resize when we ran out of room

In SeparateChainingHashMap, never forced to resize, but we want to make sure the buckets don’t get too long for good runtime

How do we quantify “too full”?

Look at the average bucket size: number of elements / number of buckets

LOAD FACTOR λ

(22,tan)

(7,blue)

(77,aqua)

(4,orange)

0
1
2
3
4

(1,red)

(6,pink)

(8,lilac)

(53,puce)

29 of 50

When to Resize?

In ArrayList, we were forced to resize when we ran out of room

In SeparateChainingHashMap, never forced to resize, but we want to make sure the buckets don’t get too long for good runtime

How do we quantify “too full”?

Look at the average bucket size: number of elements / number of buckets

LOAD FACTOR λ

If we resize when λ hits some constant value like 1:

We expect to see 1 element per bucket: constant runtime!
If we double the capacity each time, the expensive resize operation becomes less and less frequent

31 of 50

Good Hashing

The hash function of a HashDictionary gets called a LOT:

When first inserting something into the map
When checking if a key is already in the map
When resizing and redistributing all values into new structure

This is why it is so important to have a “good” hash function. A good hash function is:

Deterministic – same input should generate the same output
Efficiency - it should take a reasonable amount o time
Uniformity – inputs should be spread “evenly” over output range

CSE 373 20 WI – HANNAH TANG

public int hashFn(String s) {

return random.nextInt()

}

public int hashFn(String s) {

int retVal = 0;

for (int I = 0; I < s.length(); i++) {

for (int j = 0; j < s.length(); j++) {

retVal += helperFun(s, I, j);

}

return retVal;

}

public int hashFn(String s) {

if (s.length() % 2 == 0) {

if (s.length(). % 2 == 0) {

return 17;

} else {

return 43;

}

NOT deterministic

NOT efficient

NOT uniform

32 of 50

Handling Collisions

Solution 2: Open Addressing
Resolves collisions by choosing a different location to store a value if natural choice is already full.
Type 1: Linear Probing
If there is a collision, keep checking the next element until we find an open spot.
int findFinalLocation(Key s)
int naturalHash = this.getHash(s);

int index = natrualHash % TableSize;

while (index in use) {

i++;

index = (naturalHash + i) % TableSize;

}

return index;

CSE 373 SP 18 - KASEY CHAMPION

33 of 50

Linear Probing

CSE 373 SP 18 - KASEY CHAMPION

0	1	2	3	4	5	6	7	8	9

Insert the following values into the Hash Table using a hashFunction of % table size and linear probing to resolve collisions

1, 5, 11, 7, 12, 17, 6, 25

34 of 50

Linear Probing

CSE 373 SP 18 - KASEY CHAMPION

0	1	2	3	4	5	6	7	8	9

Insert the following values into the Hash Table using a hashFunction of % table size and linear probing to resolve collisions

38, 19, 8, 109, 10

109

Problem:

Linear probing causes clustering
Clustering causes more looping when probing

Primary Clustering

When probing causes long chains of occupied slots within a hash table

35 of 50

Runtime

When is runtime good?
When we hit an empty slot

(or an empty slot is a very short distance away)

When is runtime bad?
When we hit a “cluster”

Maximum Load Factor?
λ at most 1.0

When do we resize the array?
λ ≈ ½ is a good rule of thumb

CSE 373 SP 18 - KASEY CHAMPION

36 of 50

Can we do better?

CSE 373 SP 18 - KASEY CHAMPION

37 of 50

Quadratic Probing

CSE 373 SP 18 - KASEY CHAMPION

0	1	2	3	4	5	6	7	8	9

(49 % 10 + 0 * 0) % 10 = 9

(49 % 10 + 1 * 1) % 10 = 0

(58 % 10 + 0 * 0) % 10 = 8

(58 % 10 + 1 * 1) % 10 = 9

(58 % 10 + 2 * 2) % 10 = 2

Insert the following values into the Hash Table using a hashFunction of % table size and quadratic probing to resolve collisions

89, 18, 49, 58, 79, 27

(79 % 10 + 0 * 0) % 10 = 9

(79 % 10 + 1 * 1) % 10 = 0

(79 % 10 + 2 * 2) % 10 = 3

Problems:

If λ≥ ½ we might never find an empty spot

Infinite loop!

Can still get clusters

Now try to insert 9.

Uh-oh

38 of 50

Quadratic Probing

There were empty spots. What Gives?

Quadratic probing is not guaranteed to check every possible spot in the hash table

The following is true:

Notice we have to assume p is prime to get that guarantee

39 of 50

Secondary Clustering

CSE 373 SP 18 - KASEY CHAMPION

0	1	2	3	4	5	6	7	8	9

Insert the following values into the Hash Table using a hashFunction of % table size and quadratic probing to resolve collisions

19, 39, 29, 9

Secondary Clustering

When using quadratic probing sometimes need to probe the same sequence of table cells, not necessarily next to one another

40 of 50

Probing

h(k) = the natural hash
h’(k, i) = resulting hash after probing
i = iteration of the probe
T = table size

Linear Probing:
h’(k, i) = (h(k) + i) % T
Quadratic Probing
h’(k, i) = (h(k) + i²) % T

CSE 373 SP 18 - KASEY CHAMPION

41 of 50

Questions

Topics Covered:

Writing good hash functions
Open addressing to resolve collisions:

Linear probing
Quadratic probing

CSE 373 20 SP – CHAMPION & CHUN

42 of 50

Double Hashing

Probing causes us to check the same indices over and over- can we check different ones instead?

Use a second hash function!
h’(k, i) = (h(k) + i * g(k)) % T

int findFinalLocation(Key s)
int naturalHash = this.getHash(s);

int index = natrualHash % TableSize;

while (index in use) {

i++;

index = (naturalHash + i*jumpHash(s)) % TableSize;

}

return index;

CSE 373 SP 18 - KASEY CHAMPION

<- Most effective if g(k) returns value relatively prime to table size

43 of 50

Second Hash Function

Effective if g(k) returns a value that is relatively prime to table size

If T is a power of 2, make g(k) return an odd integer
If T is a prime, make g(k) return anything except a multiple of the TableSize

CSE 373 SP 18 - KASEY CHAMPION

44 of 50

Resizing: Open Addressing

45 of 50

Running Times

CSE 332 SU 18 – ROBBIE WEBER

46 of 50

In-Practice

47 of 50

Summary

1. Pick a hash function to:

Avoid collisions
Uniformly distribute data
Reduce hash computational costs

2. Pick a collision strategy

Chaining

LinkedList
AVL Tree

Probing

Linear
Quadratic
Double Hashing

CSE 373 SP 18 - KASEY CHAMPION

No clustering

Potentially more “compact” (λ can be higher)

Managing clustering can be tricky

Less compact (keep λ < ½)

Array lookups tend to be a constant factor faster than traversing pointers

49 of 50

Extra optimizations

CSE 373 SP 18 - KASEY CHAMPION

50 of 50

Other Hashing Applications

We use it for hash tables but there are lots of uses! Hashing is a really good way of taking arbitrary data and creating a succinct and unique summary of data.

CSE 373 20 WI – HANNAH TANG

Cryptography

Hashing also ”hides” the data by translating it, this can be used for security

For password verification: Storing passwords in plaintext is insecure. So your passwords are stored as a hash
Digital signatures

Fingerprinting

git hashes (“identification”)

That crazy number that is attached to each of your commits
SHA-1 hash incorporates the contents of your change, the name of the files and the lines of the files you changes

Ad Tracking

track who has seen an ad if they saw it on a different device (if they saw it on their phone don’t want to show it on their laptop)
https://panopticlick.eff.org will show you what is being hashed about you

YouTube Content ID

Do two files contain the same thing? Copyright infringement
Change the files a bit!

Caching

you’ve downloaded a large video file, You want to know if a new version is available, Rather than re-downloading the entire file, compare your file’s hash value with the server's hash value.

File Verification / Error Checking:

compare the hash of a file instead of the file itself
Find similar substrings in a large collection of strings – detecting plagiarism

1 of 50

2 of 50

3 of 50

4 of 50

5 of 50

6 of 50

7 of 50

8 of 50

9 of 50

10 of 50

11 of 50

12 of 50

13 of 50

14 of 50

15 of 50

16 of 50

17 of 50

18 of 50

19 of 50

20 of 50

21 of 50

22 of 50

23 of 50

24 of 50

25 of 50

26 of 50

27 of 50

28 of 50

29 of 50

30 of 50

31 of 50

32 of 50

33 of 50

34 of 50

35 of 50

36 of 50

37 of 50

38 of 50

39 of 50

40 of 50

41 of 50

42 of 50

43 of 50

44 of 50

45 of 50

46 of 50

47 of 50

48 of 50

49 of 50

50 of 50