Discussion 5
Sorting and Hashing
CS 186 - Fall 2017
TA: Tammy Nguyen
tmmydngyn.me
Agenda
Administrivia
Homework 2 was due yesterday, be aware of your slip hours
Midterm 1 tomorrow during lecture�→ Covers up to yesterday’s lecture and today’s discussion
Sorting
Why sorting?
Sorting specs
Parameters
Output
Terminology
Internal sort: sorting a collection in main memory�→ a.k.a. in-place sort�→ e.g. quick sort, heap sort, insertion sort
External sort: sorting a collection that doesn’t fit into main memory�→ a.k.a. out-of-core sort�→ e.g. general external merge sort
External merge sort
Idea: Conquer and combine
Algorithm:
Parameters
N: # of pages in file
B: # of pages that fit in main memory
B
Pass 0
Pass 1, 2, …, etc.
Run ⌈N/B⌉
...
Run 2
Run 1
1
2
B-1
...
...
Conquer phase
write
read
8, 6, 10
11, 5, 7
2, 9, 3
3, 8, 2
19, 6, 8
9, 3, 8
6, 16, 3
5, 2, 3
Parameters
N: 10 pages
B: 4 pages
8, 6, 10
11, 5, 7
2, 9, 3
3, 8, 2
Pull B pages of the file into main memory at a time.
7, 4, 1
10, 2, 8
Conquer phase
write
Parameters
N: 8 pages
B: 4 pages
Internally sort B pages.
2, 2, 3
3, 5, 6
7, 8, 8
9, 10, 11
read
8, 6, 10
11, 5, 7
2, 9, 3
3, 8, 2
19, 6, 8
9, 3, 8
6, 16, 3
5, 2, 3
7, 4, 1
10, 2, 8
Conquer phase
Parameters
N: 8 pages
B: 4 pages
Write the sorted run to disk.
read
2, 2, 3
3, 5, 6
7, 8, 8
9, 10, 11
8, 6, 10
11, 5, 7
2, 9, 3
3, 8, 2
19, 6, 8
9, 3, 8
6, 16, 3
5, 2, 3
7, 4, 1
10, 2, 8
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
write
Conquer phase
Parameters
N: 10 pages
B: 4 pages
Repeat until all pages of the file have been turned into sorted runs.
read
8, 6, 10
11, 5, 7
2, 9, 3
3, 8, 2
19, 6, 8
9, 3, 8
6, 16, 3
5, 2, 3
7, 4, 1
10, 2, 8
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
write
19, 6, 8
9, 3, 8
6, 16, 3
5, 2, 3
Conquer phase
Parameters
N: 10 pages
B: 4 pages
Repeat until all pages of the file have been turned into sorted runs.
read
8, 6, 10
11, 5, 7
2, 9, 3
3, 8, 2
19, 6, 8
9, 3, 8
6, 16, 3
5, 2, 3
7, 4, 1
10, 2, 8
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
write
2, 3, 4
5, 5, 6
7, 8, 8
9, 16, 19
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
Conquer phase
Parameters
N: 10 pages
B: 4 pages
Repeat until all pages of the file have been turned into sorted runs.
read
8, 6, 10
11, 5, 7
2, 9, 3
3, 8, 2
19, 6, 8
9, 3, 8
6, 16, 3
5, 2, 3
7, 4, 1
10, 2, 8
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
write
7, 4, 1
10, 2, 8
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
Conquer phase
Parameters
N: 10 pages
B: 4 pages
Repeat until all pages of the file have been turned into sorted runs.
read
8, 6, 10
11, 5, 7
2, 9, 3
3, 8, 2
19, 6, 8
9, 3, 8
6, 16, 3
5, 2, 3
7, 4, 1
10, 2, 8
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
write
1, 2, 4
7, 2, 8
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
2
2
1
1
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
2
2
2
2
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
2
2
4
2
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
2
2
4
2
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
3
2
4
2
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
3
3
4
3
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
3
3
4
3
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
1, 2, 2, 2, 2, 3, 3
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
5
3
4
3
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
1, 2, 2, 2, 2, 3, 3
1, 2, 2, 2, 2, 3, 3, 3
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
5
4
4
4
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
1, 2, 2, 2, 2, 3, 3
1, 2, 2, 2, 2, 3, 3, 3
1, 2, 2, 2, 2, 3, 3, 3, 4
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
5
5
4
4
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
1, 2, 2, 2, 2, 3, 3
1, 2, 2, 2, 2, 3, 3, 3
1, 2, 2, 2, 2, 3, 3, 3, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
5
5
7
5
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
1, 2, 2, 2, 2, 3, 3
1, 2, 2, 2, 2, 3, 3, 3
1, 2, 2, 2, 2, 3, 3, 3, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5
Combine phase
Parameters
N: 10 pages
B: 4 pages
Stream B-1 runs into memory and merge into output buffer.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
6
5
7
5
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
1, 2, 2, 2, 2, 3, 3
1, 2, 2, 2, 2, 3, 3, 3
1, 2, 2, 2, 2, 3, 3, 3, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5
1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5
Combine phase
Parameters
N: 10 pages
B: 4 pages
And so on...
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
6
5
7
5
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
1, 2, 2, 2, 2, 3, 3
1, 2, 2, 2, 2, 3, 3, 3
1, 2, 2, 2, 2, 3, 3, 3, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5
1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5
Combine phase
Parameters
N: 10 pages
B: 4 pages
Until we have merged B-1 runs.
read
write
2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11
2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19
1, 2, 4, 7, 8, 10
1
* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)
1, 2
1, 2, 2
1, 2, 2, 2
1, 2, 2, 2, 2
1, 2, 2, 2, 2, 3
1, 2, 2, 2, 2, 3, 3
1, 2, 2, 2, 2, 3, 3, 3
1, 2, 2, 2, 2, 3, 3, 3, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4
1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5
1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 10, 10, 11, 16, 19
What happens after each pass?
B
Pass 0
Pass 1, 2, …, etc.
Run ⌈N/B⌉
...
Run 2
Run 1
1
2
B-1
...
...
How big are these runs?
How many runs are left after each pass?
Pass 0
B
B
B
B
B
B
B
B
1
2
3
4
5
6
⌈N/B⌉
...
⌈N/B⌉-1
Pass 1
B(B - 1)
B(B - 1)
B(B - 1)
B(B - 1)
1
2
3
⌈N/B⌉ / (B - 1)
...
Pass 2
B(B - 1)(B - 1)
⌈N/B⌉ / (B - 1)2 runs
B(B - 1)(B - 1)
...
Pass ???
N
sorted file
.
.
.
N
original file
Total # passes:
1+ ⌈logB-1⌈N/B⌉⌉
Sort pass
Merge passes
What happens after each pass?
B
Pass 0
Pass 1, 2, …, etc.
Run ⌈N/B⌉
...
Run 2
Run 1
1
2
B-1
...
...
How big are these runs?
How many runs are left after each pass?
(B-1) * (size of run from previous pass)
(num previous runs) / (B-1)
Cost of the algorithm
B
Pass 0
Pass 1, 2, …, etc.
Run ⌈N/B⌉
...
Run 2
Run 1
1
2
B-1
...
...
How many passes does it take to merge ⌈N/B⌉ runs into a single file B-1 pages at a time?
→ 1 + ⌈logB-1⌈N/B⌉⌉
Each pass, we read and write the entire file of size N.
→ 2N I/Os per pass
2N * (1 + ⌈logB-1⌈N/B⌉⌉)
Worksheet
Problems
You have 4 buffer pages and 108 pages of records to sort.
Problems
You have 4 buffer pages and 108 pages of records to sort.
Pass 0: ceil(108/4) = 27 sorted runs of 4 pages each
Pass 1: ceil(27/3) = 9 sorted runs of 12 pages each
Pass 2: ceil(9/3) = 3 sorted runs of 36 pages each
Pass 3: ceil(3/4) = 1 sorted run = sorted file
→ 4 passes
Problems
You have 4 buffer pages and 108 pages of records to sort.
The algorithm takes 4 passes.
Each pass, we read and write all 108 pages of the file.
→ 2(108)*4 = 864 I/Os
There would be no change.
The first pass splits up the file into 27 runs, they’re just already sorted.
Problems
You have 4 buffer pages and 108 pages of records to sort.
formula: num passes = 1 + ⌈logB-1⌈N/B⌉⌉
3 = 1 + logB-1⌈N/B⌉
2B-1 = ⌈N/B⌉
→ N = B(2B-1)
Hashing
not on midterm!
Why hashing?
Hashing specs
Parameters
Output
External hashing
Idea: Divide and conquer
Algorithm:
Parameters
N: # of pages in file
B: # of pages that fit in main memory
B
Pass 0
Pass 1
Part. B-1
...
Part. 2
Part. 1
1
2
B-1
...
...
Cost of the algorithm
In two passes, we can hash a file of size B(B-1). Use recursive partitioning for bigger files.
Each pass, we read and write the entire file of size N.
→ 2N I/Os per pass
2N * 2 = 4N I/Os
B
Pass 0
Pass 1
Part. B-1
...
Part. 2
Part. 1
1
2
B-1
...
...
Worksheet
Discussion
Discussion
No, # of keys for different buckets varies greatly which will lead to uneven � buckets.� e.g. key = 1 ⇒ 10 values, key = 3 ⇒ 1000 values
When we don’t care about order, just groups.� e.g. removing duplicates