Lecture 7: Recurrences and Intro to Hashing
CSE 373: Data Structures and Algorithms
1
Please log onto Live Lecture Ed Thread to submit live lecture questions
Please log onto tinyurl.com/Summer373L7 to answer the daily lecture participation question
Announcements
CSE 373 22 SP – CHAMPION
2
For real, though, it will take you 2 weeks, do not wait until next week to start
Warm Up!
CSE 373 SP 22 – CHAMPION
3
What’s the theta bound for the runtime function for this piece of code?
public void method1(int n) {
if (n <= 100) {
System.out.println(“:3”);
} else {
System.out.println(“:D”);
for (int i = 0; i<16; i++) {
method1(n / 4);
}
}
}
Please fill out the Poll at- tinyurl.com/Summer373L7
a = 16, b = 4, c = 0
If
If
If
then
then
then
Master Theorem
2 > 0
Questions
CSE 373 20 SP – CHAMPION & CHUN
4
Wednesday’s Warm Up
CSE 373 22 SP – CHAMPION
5
All false!
Try on your own!
Case Study: Linear Search
CSE 373 SP 22 - CHAMPION
int linearSearch(int[] arr, int toFind) {
for (int i = 0; i < arr.length; i++) {
if (arr[i] == toFind)
return i;
}
return -1;
}
2 | 3 | 9 | 4 | 5 |
arr
toFind
2
2 | 3 | 9 | 4 | 5 |
toFind
8
i
arr
i
i
i
i
i
Best Case
Worst Case
On Lucky Earth
On Unlucky Earth (where it’s 2020 every year)
2 | 3 | 9 | 4 | 5 |
arr
toFind
2
i
2 | 3 | 9 | 4 | 5 |
arr
toFind
8
i
f(n) = 3n + 1
f(n) = 2
O(1)
Θ(1)
O(n)
Θ(n)
After asymptotic analysis:
After asymptotic analysis:
Questions
CSE 373 20 SP – CHAMPION & CHUN
8
Modeling Recursive Code
CSE 373 20 SP – CHAMPION & CHUN
9
Recurrence to Big-Θ
It’s still really hard to tell what the big-O is just by looking at it.
But fancy mathematicians have a formula for us to use!
CSE 373 SP 20 – CHUN & CHAMPION
If
If
If
then
then
then
Master Theorem
Understanding Master Theorem
CSE 373 SP 20 – CHUN & CHAMPION
11
If
If
If
then
then
then
Master Theorem
Recursive Patterns
CSE 373 20 WI – HANNAH TANG
12
Binary Search Θ(logn)
Merge Sort
Merge Sort
CSE 373 SP 18 - KASEY CHAMPION
13
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
8 | 2 | 91 | 22 | 57 | 1 | 10 | 6 | 7 | 4 |
Divide
0 | 1 | 2 | 3 | 4 |
8 | 2 | 91 | 22 | 57 |
5 | 6 | 7 | 8 | 9 |
1 | 10 | 6 | 7 | 4 |
Conquer
0 |
8 |
0 |
8 |
0 | 1 | 2 | 3 | 4 |
2 | 8 | 22 | 57 | 91 |
5 | 6 | 7 | 8 | 9 |
1 | 4 | 6 | 7 | 10 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
1 | 2 | 4 | 6 | 7 | 8 | 10 | 22 | 57 | 91 |
Combine
Merge Sort
CSE 373 SP 18 - KASEY CHAMPION
14
mergeSort(input) {
if (input.length == 1)
return
else
smallerHalf = mergeSort(new [0, ..., mid])
largerHalf = mergeSort(new [mid + 1, ...])
return merge(smallerHalf, largerHalf)
}
0 | 1 | 2 | 3 | 4 |
8 | 2 | 57 | 91 | 22 |
0 | 1 |
8 | 2 |
0 | 1 | 2 |
57 | 91 | 22 |
0 |
8 |
0 |
2 |
0 |
57 |
0 | 1 |
91 | 22 |
0 |
91 |
0 |
22 |
0 | 1 |
22 | 91 |
0 | 1 | 2 |
22 | 57 | 91 |
0 | 1 |
2 | 8 |
0 | 1 | 2 | 3 | 4 |
2 | 8 | 22 | 57 | 91 |
1 if n<= 1
2T(n/2) + n otherwise
T(n) =
Pattern #2 – Constant size input and doing work
Take a guess! What is the Big-O of worst case merge sort?
Merge Sort Recurrence to Big-Θ
CSE 373 SP 20 – CHUN & CHAMPION
If
If
If
then
then
then
Master Theorem
1 if n<= 1
2T(n/2) + n otherwise
T(n) =
Recursive Patterns
CSE 373 20 WI – HANNAH TANG
16
Binary Search Θ(logn)
Merge Sort Θ(nlogn)
Calculating Fibonacci
Calculating Fibonacci
CSE 373 20 WI – HANNAH TANG
17
Almost
f(4)
f(3)
f(2)
f(2)
f(2)
f(1)
f(1)
f(1)
f(1)
f(1)
f(1)
Pattern #3 – Doubling the Input
Calculating Fibonacci Recurrence to Big-Θ
CSE 373 20 WI – HANNAH TANG
18
d
2T(n-C1) + C2
Master Theorem
Can we use master theorem?
Uh oh, our model doesn’t match that format…
Maybe geometry can help!
Can we intuit a pattern?
T(1) = d
T(2) = 2T(2-1) + c = 2(d) + c
T(3) = 2T(3-1) + c = 2(2(d) + c) + c = 4d + 3c
T(4) = 2T(4-1) + c = 2(4d + 3c) + c = 8d + 7c
T(5) = 2T(5-1) + c = 2(8d + 7c) + c = 16d +25c
Looks like something’s happening but it’s tough
Finish the recurrence, what is the model for the recursive case?
2T(n-C1) + C2
Calculating Fibonacci Recurrence to Big-Θ
How many layers in the function call tree?
How many layers will it take to transform “n” to the base case of “1” by subtracting 1
For our example, 4 -> Height = n
How many function calls per layer?
CSE 373 20 WI – HANNAH TANG
19
Layer | Function calls |
1 | 1 |
2 | 2 |
3 | 4 |
4 | 8 |
How many function calls on layer k?
2k-1
How many function calls TOTAL
for a tree of k layers?
1 + 2 + 3 + 4 + … + 2k-1
f(4)
f(3)
f(2)
f(2)
f(2)
f(1)
f(1)
f(1)
f(1)
f(1)
f(1)
Calculating Fibonacci Recurrence to Big-Θ
CSE 373 20 SP – CHAMPION & CHUN
20
How many function calls on layer k?
2k-1
How many function calls TOTAL for a tree of k layers?
1 + 2 + 4 + 8 + … + 2k-1
Total runtime = (total function calls) x (runtime of each function call)
Total runtime = (1 + 2 + 4 + 8 + … + 2k-1) x (constant work)
1 + 2 + 4 + 8 + … + 2k-1 =
How many layers in the function call tree?
n
Summation Identity
Finite Geometric Series
Recursive Patterns
CSE 373 20 WI – HANNAH TANG
21
Binary Search Θ(logn)
Merge Sort Θ(nlogn)
Calculating Fibonacci Θ(2n)
Questions
CSE 373 20 SP – CHAMPION & CHUN
22
Intro to Hashing
CSE 373 20 SP – CHAMPION & CHUN
23
Dictionaries (aka Maps)
CSE 373 19 SU - ROBBIE WEBER
// two types of Map implementations supposedly covered in CSE 143
Map<String, Integer> map1 = new HashMap<>();
Map<String, String> map2 = new TreeMap<>();
Review: Maps
CSE 373 19 SU - ROBBIE WEBER
key | value |
“you" | 22 |
key | value |
“in" | 37 |
key | value |
“the" | 56 |
key | value |
“at" | 43 |
map.get("the")
56
Dictionary ADT
put(key, item) add item to collection indexed with key
get(key) return item associated with key
containsKey(key) return if key already in use
remove(key) remove item and associated key
size() return count of items
state
behavior
Set of items & keys
Count of items
supported operations:
Implementing a Map with an Array
CSE 373 22 SP - CHAMPION
ArrayMap<K, V>
put find key, overwrite value if there. Otherwise create new pair, add to next available spot, grow array if necessary
get scan all pairs looking for given key, return associated item if found
containsKey scan all pairs, return if key is found
remove scan all pairs, replace pair to be removed with last pair in collection
size return count of items in dictionary
state
behavior
Pair<K, V>[] data
Big O Analysis – (if key is the last one looked at / not in the dictionary)
put()
get()
containsKey()
remove()
size()
O(1) constant
O(N) linear
O(N) linear
O(N) linear
O(N) linear
0 | 1 | 2 | 3 |
| | | |
containsKey(‘c’)
get(‘d’)
put(‘b’, 97)
put(‘e’, 20)
(‘a’, 1)
(‘b’, 2)
Map ADT
put(key, item) add item to collection indexed with key
get(key) return item associated with key
containsKey(key) return if key already in use
remove(key) remove item and associated key
size() return count of items
state
behavior
Set of items & keys
Count of items
(‘c’, 3)
97)
(‘d’, 4)
Big O Analysis – (if the key is the first one looked at)
put()
get()
containsKey()
remove()
size()
O(1) constant
O(1) constant
O(1) constant
O(1) constant
O(1) constant
|
4
(‘e’, 20)
Implementing a Map with Nodes
CSE 373 19 SU - ROBBIE WEBER
LinkedMap<K, V>
put if key is unused, create new with pair, add to front of list, else replace with new value
get scan all pairs looking for given key, return associated item if found
containsKey scan all pairs, return if key is found
remove scan all pairs, skip pair to be removed
size return count of items in dictionary
state
behavior
front
size
containsKey(‘c’)
get(‘d’)
put(‘b’, 20)
Map ADT
put(key, item) add item to collection indexed with key
get(key) return item associated with key
containsKey(key) return if key already in use
remove(key) remove item and associated key
size() return count of items
state
behavior
Set of items & keys
Count of items
front
‘c’
9
‘b’
7
‘d’
4
‘a’
1
20
Big O Analysis – (if key is the last one looked at / not in the dictionary)
put()
get()
containsKey()
remove()
size()
O(1) constant
O(N) linear
O(N) linear
O(N) linear
O(N) linear
Big O Analysis – (if the key is the first one looked at)
put()
get()
containsKey()
remove()
size()
O(1) constant
O(1) constant
O(1) constant
O(1) constant
O(1) constant
Can we do better?
Let’s simplify the problem we’re working with + combine it with some facts about arrays.
Problem Simplification: only worry about supporting integer keys
Array Facts: accessing (data[i]) or updating an element (data[i] = …) at a given index takes Theta(1) runtime.
If we store the Key-Value pairs at the data[key] then we don’t have to do any looping to find it. For example consider `containsKey` or `get` -- we can just jump directly to data[key] to figure out the return answer.
CSE 373 SP 22 - CHAMPION
28
DirectAccessMap<Integer, V>
put put item at given index
get get item at given index
containsKey if data[] null at index, return false, return true otherwise
remove nullify element at index
size return count of items in dictionary
state
behavior
Data[]
size
indices | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
data | | | | (3, Sherdil) | | | | | | |
put(3, “Sherdil”);
get(3);
Can we do better? -- Direct Access Map impl.
return this.array[key];
CSE 373 SP 22 - CHAMPION
29
DirectAccessMap<Integer, V>
put put item at given index
get get item at given index
containsKey if data[] null at index, return false, return true otherwise
remove nullify element at index
size return count of items in dictionary
state
behavior
Data[]
size
Operation | Array w/ indices as keys | |
put(key,value) | best | 𝚹(1) |
worst | 𝚹(1) | |
get(key) | best | 𝚹(1) |
worst | 𝚹(1) | |
containsKey(key) | best | 𝚹(1) |
worst | 𝚹(1) | |
Direct Access Map tradeoffs:
CSE 373 20 SP – CHAMPION & CHUN
- what’s a benefit of using DirectAccessMap?
- what’s a bad thing when using DirectAccessMap?
Can we do this for any integer?
CSE 373 SU 19 - ROBBIE WEBER
31
Hash functions: translating a piece of data to an int
CSE 373 20 SP – CHAMPION & CHUN
Hash function definition
“review”: Integer remainder with % “mod”
14 % 4 is 2� � 3 43� 4 ) 14 5 ) 218� 12 20� 2 18� 15� 3
CSE 142 SP 18 – BRETT WORTZMAN
33
218 % 5 is 3
For more review/practice, check out https://www.khanacademy.org/computing/computer-science/cryptography/modarithmetic/a/what-is-modular-arithmetic
Limit keys to indices within array
Equivalently, to find a % b (for a,b > 0):
while(a > b-1)
a -= b;
return a;
First Hash Function: % table size
CSE 373 SP 22 - CHAMPION
34
indices | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
elements | | | | | | | | | | |
put(0, “foo”);
put(5, “bar”);
put(11, “biz”)
put(18, “bop”);
“foo”
0 % 10 = 0
5 % 10 = 5
11 % 10 = 1
18 % 10 = 8
“bop”
“bar”
“biz”
Implement First Hash Function
CSE 373 SU 19 - ROBBIE WEBER
35
SimpleHashMap<Integer>
put mod key by table size, put item at result
get mod key by table size, get item at result
containsKey mod key by table size, return data[result] == null remove mod key by table size, nullify element at result
size return count of items in dictionary
state
behavior
Data[]
size
Operation | Array w/ indices as keys | |
put(key,value) | best | 𝚹(1) |
worst | 𝚹(1) | |
get(key) | best | 𝚹(1) |
worst | 𝚹(1) | |
containsKey(key) | best | 𝚹(1) |
worst | 𝚹(1) | |
Note: % is just a math operator like +, -, /, *, so it’s constant runtime
Questions?
things we talked about:
First Hash Function: % table size
CSE 373 SP 22 - CHAMPION
37
indices | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
elements | | | | | | | | | | |
put(0, “foo”);
put(5, “bar”);
put(11, “biz”)
put(18, “bop”);
Collision!
“foo”
0 % 10 = 0
5 % 10 = 5
11 % 10 = 1
18 % 10 = 8
20 % 10 = 0
“bop”
“bar”
“biz”
“:(”
put(20, “:(”);
Hash Obsession: Collisions
CSE 373 SU 19 - ROBBIE WEBER
38
Roadmap for lecture content today
CSE 373 20 SP – CHAMPION & CHUN
Strategies to handle hash collision
CSE 373 AU 18 – SHRI MARE
40
Separate chaining
CSE 373 ROBBIE WEBER + HANNAH TANG
41
Separate chaining
public boolean containsKey(int key) {
int bucketIndex = key % data.length;
loop through data[bucketIndex]
return true if we find the key in
data[bucketIndex]
return false if we get to here (didn’t
find it)
}
CSE 373 ROBBIE WEBER + HANNAH TANG
42
Reminder: the implementations of put/get/containsKey are all very similar, and almost always will have the same complexity class runtime
runtime analysis
Are there different possible states for our Hash Map that make this code run slower/faster, assuming there are already n key-value pairs being stored?
Yes! If we had to do a lot of loop iterations to find the key in the bucket, our code will run slower.
A best case situation for separate chaining
CSE 373 20 SP – CHAMPION & CHUN
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
(0, b)
(2, b)
(3, b)
(4, b)
(5, b)
(6, b)
(7, b)
(8, b)
It’s possible (and likely if you follow some best-practices) that everything is spread out across the buckets pretty evenly. This is the opposite of the last slide: when we have minimal collisions, our runtime should be less. For example, if we have a bucket with only 0 or 1 element in it, checking containsKey for something in that bucket will only take a constant amount of time.
We’re going to try a lot of stuff we can to make it more likely we achieve this beautiful state ☺.
In-practice situations for separate chaining
CSE 373 20 SP – CHAMPION & CHUN
Operation | Array w/ indices as keys | |
put(key,value) | best | 𝚹(1) |
In-practice | 𝚹(1) | |
worst | 𝚹(n) | |
get(key) | best | 𝚹(1) |
In-practice | 𝚹(1) | |
worst | 𝚹(1) | |
remove(key) | best | 𝚹(1) |
In-practice | 𝚹(1) | |
worst | O(n) | |
Reminder: the in-practice runtimes are assuming an even distribution of the keys inside the array and following of best-practices to ensure the average chain length is low.
Best practices (pay attention to this for the hw)
CSE 373 20 SP – CHAMPION & CHUN
It turns out we still want to resize “every so often” to make sure the average/expected length of each bucket is a small number.
Consider what happens if we had the array length 10 like on the left, but had 100 key-value pairs?
Assuming our in-practice niceness (not-worst case) you would expect on average each of the 10 buckets has about 10 key-value pairs in it.
What happens if we stick with the same size array but add 100 more key-value pairs? Each bucket gets about 10 more –key-value pairs and the runtime is getting worse and worse.
Best practices (pay attention to this for the hw)
CSE 373 20 SP – CHAMPION & CHUN
It turns out we still want to resize “every so often” to make sure the average/expected length of each bucket is a small number.
Consider what happens if we had the array length 10 like on the left, but had 100 key-value pairs?
Assuming our in-practice niceness (not-worst case) you would expect on average each of the 10 buckets has about 10 key-value pairs in it.
What happens if we stick with the same size array but add 100 more key-value pairs? Each bucket gets about 10 more –key-value pairs and the runtime is getting worse and worse.
The pattern we’re getting to is that the expected runtime is approximately: # of pairs / array.length (AKA n / c where n is the number of elements and c is the number of possible chains). If array.length is fixed for your whole program, then this is an order-n runtime, but if the array.length also increases (because you re-size) and you redistribute out the values evenly across the buckets, you can keep your runtime low. In particular, if you resize when when your n / c ratio increases to about 1, you’re expected to have 1 element or fewer in each bucket at all times. (do this on your homework).
Tip: make sure you re-hash (re-distribute) your keys by the new array length after re-sizing so they don’t get clustered in the old array length range.
Lambda + resizing rephrased
However, if you resize once you hit that 1:1 threshold, the current λ is expected to be less than 1 (which is a constant / constant runtime, so we can simplify to O(1)).
CSE 373 SU 19 - ROBBIE WEBER
47
Operation | Array w/ indices as keys | |
put(key,value) | best | O(1) |
In-practice | O(λ) | |
worst | O(n) | |
get(key) | best | O(1) |
In-practice | O(λ) | |
worst | O(n) | |
remove(key) | best | O(1) |
In-practice | O(λ) | |
worst | O(n) | |
What about non integer keys?
CSE 373 SU 19 - ROBBIE WEBER
48
Hash function definition
(Before we % by length, we have to convert the data into an int)
}
CSE 373 SU 19 - ROBBIE WEBER
49
Pro: super fast
Con: lots of collisions!
Pro: still really fast
Con: some collisions
Pro: few collisions
Con: slow, gigantic integers
Java’s hashCode (relevant for project)
CSE 373 20 SP – CHAMPION & CHUN
Best practices for an nice distribution of keys recap
CSE 373 20 SP – CHAMPION & CHUN
Practice
CSE 373 SU 19 - ROBBIE WEBER
52
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
(1, a)
(5, b)
(11, a)
(17, f)
(1, g)
(12, e)
(7, d)
(25, h)
Practice
CSE 373 SU 19 - ROBBIE WEBER
53
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| | | | | | | | | |
(“a”, 1)
(“abcd”, 5)
(“c”, 3)
(“five”, 7)
(“abc”, 4)
(“ab”, 2)
(“hello world”, 8)
(“abcdabcd”, 6)
Java and Hash Functions
If a.equals(b) is true then a.hashCode() == b.hashCode() MUST also be true
CSE 373 SU 19 - ROBBIE WEBER
54