1 of 50

Lecture 08: Hash Collision Resolutions + Runtime

CSE 373: Data Structures and Algorithms

1

Please log onto tinyurl.com/Summer373L8 to answer the daily lecture participation question

2 of 50

Announcements

  • Exercise 2 due Friday July 15th
  • Project 2 due Wednesday July 20th
    • Once again please start early!

Please check Canvas Gradebook!

  • Lecture Polls
  • Extra Credit Survey #1

3 of 50

Practice

  • Insert the following key value pairs into the HashTable below. Use the following hash function:
  • public int hashCode(int key) {
  • return key % arr.length;
  • }

(2, b) (3,c) (10,j) (9,i) (1,a) (12,l)

3

0

1

2

3

4

5

6

7

8

9

(10, j)

(2, b)

(3, c)

(1, a)

(12, l)

(9, i)

4 of 50

Lecture Outline

ArrayMap

DirectAccessMap

SimpleHashMap

SeparateChaining

HashMap

4

3

2

1

FASTER: Jump directly to element, only int keys

MORE FLEXIBLE: Hash function supports any type of key

YOUR BEST FRIEND: Addresses limitations with hash collisions, but still fast!

Review

MAP ADT

As seen on

Project 2

As seen on

Project 2

5 of 50

Hash functions: translating a piece of data to an int

  • In our case: we want to translate int keys to a valid index in our array. If our array is length 10 but our input key is 500, we need to make sure we have a way of mapping that to a number between 0 and 9 (the valid indices for a length 10 array). This mapping that we decide on is a hash function.

  • One simple thing we can do (and that you will do when you implement this in your project):
  • Hash function: take your key and % it by the length of the array.
  • ex: key is 500, and array is length 10 – if you take 500 % 10, you will get the number 0, so we’d just plop 500 and it’s value at index 0.

CSE 373 20 SP – CHAMPION & CHUN

Hash function definition

hash function is any function that can be used to map data of arbitrary size to fixed-size values.

6 of 50

“review”: Integer remainder with % “mod”

  • The % operator computes the remainder from integer division.

14 % 4 is 2� � 3 43� 4 ) 14 5 ) 218� 12 202 18� 15� 3

  • Applications of % operator:
    • Obtain last digit of a number: 230857 % 10 is 7
    • See whether a number is odd: 7 % 2 is 1, 42 % 2 is 0
    • Limit integers to specific range: 8 % 12 is 8, 18 % 12 is 6

CSE 142 SP 18 – BRETT WORTZMAN

6

218 % 5 is 3

Limit keys to indices within array

Equivalently, to find a % b (for a,b > 0):

while(a > b-1)

a -= b;

return a;

7 of 50

First Hash Function: % table size

CSE 373 SP 22 - CHAMPION

7

indices

0

1

2

3

4

5

6

7

8

9

elements

put(0, “foo”);

put(5, “bar”);

put(11, “biz”)

put(18, “bop”);

“foo”

0 % 10 = 0

5 % 10 = 5

11 % 10 = 1

18 % 10 = 8

“bop”

“bar”

“biz”

8 of 50

Implement First Hash Function

  • public void put(int key, int value) {
  • data[hashToValidIndex(key)] = value;
  • }

  • public V get(int key) {
  • return data[hashToValidIndex(key)];
  • }

  • public int hashToValidIndex(int k) {
  • return k % this.data.length;
  • }

CSE 373 SU 19 - ROBBIE WEBER

8

SimpleHashMap<Integer>

put mod key by table size, put item at result

get mod key by table size, get item at result

containsKey mod key by table size, return data[result] == null remove mod key by table size, nullify element at result

size return count of items in dictionary

state

behavior

Data[]

size

Operation

Array w/ indices as keys

put(key,value)

best

𝚹(1)

worst

𝚹(1)

get(key)

best

𝚹(1)

worst

𝚹(1)

containsKey(key)

best

𝚹(1)

worst

𝚹(1)

Note: % is just a math operator like +, -, /, *, so it’s constant runtime

9 of 50

Questions?

things we talked about:

  • review of ArrayMap + LinkedMap
  • DirectAccessMap
  • % as a hash function andSimpleHashMap

10 of 50

First Hash Function: % table size

CSE 373 SP 22 - CHAMPION

10

indices

0

1

2

3

4

5

6

7

8

9

elements

put(0, “foo”);

put(5, “bar”);

put(11, “biz”)

put(18, “bop”);

Collision!

“foo”

0 % 10 = 0

5 % 10 = 5

11 % 10 = 1

18 % 10 = 8

20 % 10 = 0

“bop”

“bar”

“biz”

:(

put(20, “:(”);

11 of 50

Hash Obsession: Collisions

  • Collision: multiple keys translate to the same location of the array

  • Future big idea: the fewer the collisions, the better the runtime! (we’ll see this when we figure out that resolving these leads to worse runtime)

  • Two questions:
  • 1. When we have a collision, how do we resolve it?
  • 2. How do we minimize the number of collisions?

CSE 373 SU 19 - ROBBIE WEBER

11

12 of 50

Strategies to handle hash collision

  • There are multiple strategies. In this class, we’ll cover the following ones:

  • 1. Separate chaining
  • 2. Open addressing
    • Linear probing
    • Quadratic probing
    • Double hashing

CSE 373 AU 18 – SHRI MARE

12

13 of 50

Separate chaining

  • Solution 1: Separate Chaining
  • Each index in our array represents a “bucket”. When an item x hashes to index h:
    • If the bucket at index h is empty: create a new list containing x
    • If the bucket at index h is already a list: add x if it is not already present

  • in other words:
  • If multiple things hash to the same index, then we’ll just put all of those in that same index bucket. Often, you’ll see the data structure chosen is a linked-list like structure.

CSE 373 ROBBIE WEBER + HANNAH TANG

13

14 of 50

Separate chaining

  • // some pseudocode

public boolean containsKey(int key) {

int bucketIndex = key % data.length;

loop through data[bucketIndex]

return true if we find the key in

data[bucketIndex]

return false if we get to here (didn’t

find it)

}

CSE 373 ROBBIE WEBER + HANNAH TANG

14

Reminder: the implementations of put/get/containsKey are all very similar, and almost always will have the same complexity class runtime

runtime analysis

Are there different possible states for our Hash Map that make this code run slower/faster, assuming there are already n key-value pairs being stored?

Yes! If we had to do a lot of loop iterations to find the key in the bucket, our code will run slower.

15 of 50

A best case situation for separate chaining

CSE 373 20 SP – CHAMPION & CHUN

0

1

2

3

4

5

6

7

8

9

(0, b)

(2, b)

(3, b)

(4, b)

(5, b)

(6, b)

(7, b)

(8, b)

It’s possible (and likely if you follow some best-practices) that everything is spread out across the buckets pretty evenly. This is the opposite of the last slide: when we have minimal collisions, our runtime should be less. For example, if we have a bucket with only 0 or 1 element in it, checking containsKey for something in that bucket will only take a constant amount of time.

We’re going to try a lot of stuff we can to make it more likely we achieve this beautiful state ☺.

16 of 50

In-practice situations for separate chaining

  •  

CSE 373 20 SP – CHAMPION & CHUN

Operation

Array w/ indices as keys

put(key,value)

best

𝚹(1)

In-practice

𝚹(1)

worst

𝚹(n)

get(key)

best

𝚹(1)

In-practice

𝚹(1)

worst

𝚹(1)

remove(key)

best

𝚹(1)

In-practice

𝚹(1)

worst

O(n)

Reminder: the in-practice runtimes are assuming an even distribution of the keys inside the array and following of best-practices to ensure the average chain length is low.

17 of 50

Best practices (pay attention to this for the hw)

  • what about resizing?
    • for data structures like ArrayMap or ArrayList or ArrayStack we had to resize when we’re full just because we couldn’t store any more things! But our Separate Chaining Hash Map is a little bit different: we aren’t ever forced to resize our main array, since the buckets are flexible size.

CSE 373 20 SP – CHAMPION & CHUN

It turns out we still want to resize “every so often” to make sure the average/expected length of each bucket is a small number.

Consider what happens if we had the array length 10 like on the left, but had 100 key-value pairs?

Assuming our in-practice niceness (not-worst case) you would expect on average each of the 10 buckets has about 10 key-value pairs in it.

What happens if we stick with the same size array but add 100 more key-value pairs? Each bucket gets about 10 more –key-value pairs and the runtime is getting worse and worse.

18 of 50

Best practices (pay attention to this for the hw)

CSE 373 20 SP – CHAMPION & CHUN

It turns out we still want to resize “every so often” to make sure the average/expected length of each bucket is a small number.

Consider what happens if we had the array length 10 like on the left, but had 100 key-value pairs?

Assuming our in-practice niceness (not-worst case) you would expect on average each of the 10 buckets has about 10 key-value pairs in it.

What happens if we stick with the same size array but add 100 more key-value pairs? Each bucket gets about 10 more –key-value pairs and the runtime is getting worse and worse.

The pattern we’re getting to is that the expected runtime is approximately: # of pairs / array.length (AKA n / c where n is the number of elements and c is the number of possible chains). If array.length is fixed for your whole program, then this is an order-n runtime, but if the array.length also increases (because you re-size) and you redistribute out the values evenly across the buckets, you can keep your runtime low. In particular, if you resize when when your n / c ratio increases to about 1, you’re expected to have 1 element or fewer in each bucket at all times. (do this on your homework).

Tip: make sure you re-hash (re-distribute) your keys by the new array length after re-sizing so they don’t get clustered in the old array length range.

19 of 50

Lambda + resizing rephrased

  • To be more precise, the in-practice runtime depends on λ, the current average chain length.

However, if you resize once you hit that 1:1 threshold, the current λ is expected to be less than 1 (which is a constant / constant runtime, so we can simplify to O(1)).

CSE 373 SU 19 - ROBBIE WEBER

19

Operation

Array w/ indices as keys

put(key,value)

best

O(1)

In-practice

O(λ)

worst

O(n)

get(key)

best

O(1)

In-practice

O(λ)

worst

O(n)

remove(key)

best

O(1)

In-practice

O(λ)

worst

O(n)

 

20 of 50

What about non integer keys?

  •  

CSE 373 SU 19 - ROBBIE WEBER

20

Hash function definition

hash function is any function that can be used to map data of arbitrary size to fixed-size values.

21 of 50

(Before we % by length, we have to convert the data into an int)

  • Implementation 1: Simple aspect of values
  • public int hashCode(String input) {
  • return input.length();
  • }
  • Implementation 2: More aspects of value
  • public int hashCode(String input) {
  • int output = 0;
  • for(char c : input) {
  • out += (int)c;
  • }
  • return output;
  • }
  • Implementation 3: Multiple aspects of value + math!
  • public int hashCode(String input) {
  • int output = 1;
  • for (char c : input) {
  • int nextPrime = getNextPrime();
  • out *= Math.pow(nextPrime, (int)c);

}

  • return Math.pow(nextPrime, input.length());
  • }

CSE 373 SU 19 - ROBBIE WEBER

21

Pro: super fast

Con: lots of collisions!

Pro: still really fast

Con: some collisions

Pro: few collisions

Con: slow, gigantic integers

22 of 50

Java’s hashCode (relevant for project)

  • Luckily, most of these design decisions have been made for us by smart people. All objects in java come with a `hashCode()` method that does some magic (see previous slide for the not-magic version) to turn any object type (like String, ArrayList, Point, Scanner) into an integer. These hashCodes are designed to distribute pretty evenly / not have lots of collisions, so we use them as the starting point for determining the bucket index.

  • high level steps to figure out which bucket a key goes into
    • call the key.hashCode() to get an int representation of the object
    • % by the array table length to convert it to a valid index for your hash map

CSE 373 20 SP – CHAMPION & CHUN

23 of 50

Best practices for an nice distribution of keys recap

  • resize when lambda (number of elements / number of buckets) increases up to 1
  • when you resize, you can choose a the table length that will help reduce collisions if you multiply the array length by 2 and then choose the nearest prime number
  • design the hashCode of your keys to be somewhat complex and lead to a distribution of different output numbers

CSE 373 20 SP – CHAMPION & CHUN

24 of 50

Practice

  • Consider an IntegerDictionary using separate chaining with an internal capacity of 10. Assume our buckets are implemented using a LinkedList where we append new key-value pairs to the end.
  • Now, suppose we insert the following key-value pairs. What does the dictionary internally look like?
  • (1, a) (5,b) (11,a) (7,d) (12,e) (17,f) (1,g) (25,h)

CSE 373 SU 19 - ROBBIE WEBER

24

0

1

2

3

4

5

6

7

8

9

(1, a)

(5, b)

(11, a)

(17, f)

(1, g)

(12, e)

(7, d)

(25, h)

25 of 50

Practice

  • Consider a StringDictionary using separate chaining with an internal capacity of 10. Assume our buckets are implemented using a LinkedList. Use the following hash function:
  • public int hashCode(String input) {
  • return input.length() % arr.length;
  • }
  • Now, insert the following key-value pairs. What does the dictionary internally look like?
  • (“a”, 1) (“ab”, 2) (“c”, 3) (“abc”, 4) (“abcd”, 5) (“abcdabcd”, 6) (“five”, 7) (“hello world”, 8)

CSE 373 SU 19 - ROBBIE WEBER

25

0

1

2

3

4

5

6

7

8

9

(“a”, 1)

(“abcd”, 5)

(“c”, 3)

(“five”, 7)

(“abc”, 4)

(“ab”, 2)

(“hello world”, 8)

(“abcdabcd”, 6)

26 of 50

Java and Hash Functions

  • Object class includes default functionality:
    • equals
    • hashCode

  • If you want to implement your own hashCode you should:
    • Override BOTH hashCode() and equals()

If a.equals(b) is true then a.hashCode() == b.hashCode() MUST also be true

  • That requirement is part of the Object interface. �Other people’s code will assume you’ve followed this rule.
  • Java’s HashMap (and HashSet) will assume you follow these rules and conventions for your custom objects if you want to use your custom objects as keys.

CSE 373 SU 19 - ROBBIE WEBER

26

27 of 50

Resizing

Don’t forget to re-distribute your keys!

As seen on

Project 2

0

1

2

3

4

5

6

7

8

9

(7,blue)

(4,orange)

0

1

2

3

4

5

6

7

8

9

(1,red)

(22,tan)

(22,tan)

(7,blue)

(77,aqua)

(4,orange)

(1,red)

(6,pink)

(8,lilac)

(53,puce)

(6,pink)

(77,aqua)

(53,puce)

(8,lilac)

If we just expand the buckets array, several values are hashed in the wrong place

How to Resize:

  1. Expand the buckets array
  2. For every element in the old hash table, re-distribute! Recompute its position by taking the mod with the new length

28 of 50

When to Resize?

  • In ArrayList, we were forced to resize when we ran out of room
    • In SeparateChainingHashMap, never forced to resize, but we want to make sure the buckets don’t get too long for good runtime
  • How do we quantify “too full”?
    • Look at the average bucket size: number of elements / number of buckets

LOAD FACTOR λ

 

(22,tan)

(7,blue)

(77,aqua)

(4,orange)

0

1

2

3

4

(1,red)

(6,pink)

(8,lilac)

(53,puce)

 

29 of 50

When to Resize?

  • In ArrayList, we were forced to resize when we ran out of room
    • In SeparateChainingHashMap, never forced to resize, but we want to make sure the buckets don’t get too long for good runtime
  • How do we quantify “too full”?
    • Look at the average bucket size: number of elements / number of buckets

LOAD FACTOR λ

 

  • If we resize when λ hits some constant value like 1:
    • We expect to see 1 element per bucket: constant runtime!
    • If we double the capacity each time, the expensive resize operation becomes less and less frequent

30 of 50

Questions?

31 of 50

Good Hashing

The hash function of a HashDictionary gets called a LOT:

    • When first inserting something into the map
    • When checking if a key is already in the map
    • When resizing and redistributing all values into new structure

This is why it is so important to have a “good” hash function. A good hash function is:

    • Deterministic – same input should generate the same output
    • Efficiency - it should take a reasonable amount o time
    • Uniformity – inputs should be spread “evenly” over output range

CSE 373 20 WI – HANNAH TANG

31

public int hashFn(String s) {

return random.nextInt()

}

public int hashFn(String s) {

int retVal = 0;

for (int I = 0; I < s.length(); i++) {

for (int j = 0; j < s.length(); j++) {

retVal += helperFun(s, I, j);

}

}

return retVal;

}

public int hashFn(String s) {

if (s.length() % 2 == 0) {

if (s.length(). % 2 == 0) {

return 17;

} else {

return 43;

}

}

}

NOT deterministic

NOT efficient

NOT uniform

32 of 50

Handling Collisions

  • Solution 2: Open Addressing
  • Resolves collisions by choosing a different location to store a value if natural choice is already full.
  • Type 1: Linear Probing
  • If there is a collision, keep checking the next element until we find an open spot.
  • int findFinalLocation(Key s)
  • int naturalHash = this.getHash(s);

int index = natrualHash % TableSize;

while (index in use) {

i++;

  • index = (naturalHash + i) % TableSize;

}

return index;

CSE 373 SP 18 - KASEY CHAMPION

32

33 of 50

Linear Probing

CSE 373 SP 18 - KASEY CHAMPION

33

0

1

2

3

4

5

6

7

8

9

Insert the following values into the Hash Table using a hashFunction of % table size and linear probing to resolve collisions

1, 5, 11, 7, 12, 17, 6, 25

1

5

11

7

12

17

6

25

34 of 50

Linear Probing

CSE 373 SP 18 - KASEY CHAMPION

34

0

1

2

3

4

5

6

7

8

9

Insert the following values into the Hash Table using a hashFunction of % table size and linear probing to resolve collisions

38, 19, 8, 109, 10

38

19

8

8

109

10

Problem:

    • Linear probing causes clustering
    • Clustering causes more looping when probing

Primary Clustering

When probing causes long chains of occupied slots within a hash table

35 of 50

Runtime

  • When is runtime good?
  • When we hit an empty slot
    • (or an empty slot is a very short distance away)

  • When is runtime bad?
  • When we hit a “cluster”

  • Maximum Load Factor?
  • λ at most 1.0

  • When do we resize the array?
  • λ ≈ ½ is a good rule of thumb

CSE 373 SP 18 - KASEY CHAMPION

35

36 of 50

Can we do better?

CSE 373 SP 18 - KASEY CHAMPION

36

37 of 50

Quadratic Probing

CSE 373 SP 18 - KASEY CHAMPION

37

0

1

2

3

4

5

6

7

8

9

(49 % 10 + 0 * 0) % 10 = 9

(49 % 10 + 1 * 1) % 10 = 0

(58 % 10 + 0 * 0) % 10 = 8

(58 % 10 + 1 * 1) % 10 = 9

(58 % 10 + 2 * 2) % 10 = 2

89

18

49

Insert the following values into the Hash Table using a hashFunction of % table size and quadratic probing to resolve collisions

89, 18, 49, 58, 79, 27

58

79

(79 % 10 + 0 * 0) % 10 = 9

(79 % 10 + 1 * 1) % 10 = 0

(79 % 10 + 2 * 2) % 10 = 3

Problems:

If λ≥ ½ we might never find an empty spot

Infinite loop!

Can still get clusters

27

Now try to insert 9.

Uh-oh

38 of 50

Quadratic Probing

 

There were empty spots. What Gives?

Quadratic probing is not guaranteed to check every possible spot in the hash table

The following is true:

Notice we have to assume p is prime to get that guarantee

39 of 50

Secondary Clustering

CSE 373 SP 18 - KASEY CHAMPION

39

0

1

2

3

4

5

6

7

8

9

Insert the following values into the Hash Table using a hashFunction of % table size and quadratic probing to resolve collisions

19, 39, 29, 9

39

29

19

9

Secondary Clustering

When using quadratic probing sometimes need to probe the same sequence of table cells, not necessarily next to one another

40 of 50

Probing

    • h(k) = the natural hash
    • h’(k, i) = resulting hash after probing
    • i = iteration of the probe
    • T = table size
  • Linear Probing:
  • h’(k, i) = (h(k) + i) % T
  • Quadratic Probing
  • h’(k, i) = (h(k) + i2) % T

CSE 373 SP 18 - KASEY CHAMPION

40

41 of 50

Questions

Topics Covered:

  • Writing good hash functions
  • Open addressing to resolve collisions:
    • Linear probing
    • Quadratic probing

CSE 373 20 SP – CHAMPION & CHUN

41

42 of 50

Double Hashing

  • Probing causes us to check the same indices over and over- can we check different ones instead?

  • Use a second hash function!
  • h’(k, i) = (h(k) + i * g(k)) % T

  • int findFinalLocation(Key s)
  • int naturalHash = this.getHash(s);

int index = natrualHash % TableSize;

while (index in use) {

i++;

  • index = (naturalHash + i*jumpHash(s)) % TableSize;

}

return index;

CSE 373 SP 18 - KASEY CHAMPION

42

<- Most effective if g(k) returns value relatively prime to table size

43 of 50

Second Hash Function

  • Effective if g(k) returns a value that is relatively prime to table size
    • If T is a power of 2, make g(k) return an odd integer
    • If T is a prime, make g(k) return anything except a multiple of the TableSize

CSE 373 SP 18 - KASEY CHAMPION

43

44 of 50

Resizing: Open Addressing

45 of 50

Running Times

CSE 332 SU 18 – ROBBIE WEBER

46 of 50

In-Practice

  •  

47 of 50

Summary

  • 1. Pick a hash function to:
    • Avoid collisions
    • Uniformly distribute data
    • Reduce hash computational costs
  • 2. Pick a collision strategy
    • Chaining
      • LinkedList
      • AVL Tree
    • Probing
      • Linear
      • Quadratic
      • Double Hashing

CSE 373 SP 18 - KASEY CHAMPION

47

No clustering

Potentially more “compact” (λ can be higher)

Managing clustering can be tricky

Less compact (keep λ < ½)

Array lookups tend to be a constant factor faster than traversing pointers

48 of 50

Summary

  •  

49 of 50

Extra optimizations

  •  

CSE 373 SP 18 - KASEY CHAMPION

49

50 of 50

Other Hashing Applications

  • We use it for hash tables but there are lots of uses! Hashing is a really good way of taking arbitrary data and creating a succinct and unique summary of data.

CSE 373 20 WI – HANNAH TANG

50

Cryptography

Hashing also ”hides” the data by translating it, this can be used for security

    • For password verification: Storing passwords in plaintext is insecure. So your passwords are stored as a hash
    • Digital signatures

Fingerprinting

git hashes (“identification”)

    • That crazy number that is attached to each of your commits
    • SHA-1 hash incorporates the contents of your change, the name of the files and the lines of the files you changes

Ad Tracking

    • track who has seen an ad if they saw it on a different device (if they saw it on their phone don’t want to show it on their laptop)
    • https://panopticlick.eff.org will show you what is being hashed about you

YouTube Content ID

    • Do two files contain the same thing? Copyright infringement
    • Change the files a bit!

Caching

    • you’ve downloaded a large video file, You want to know if a new version is available, Rather than re-downloading the entire file, compare your file’s hash value with the server's hash value.

File Verification / Error Checking:

    • compare the hash of a file instead of the file itself
    • Find similar substrings in a large collection of strings – detecting plagiarism