1 of 46

Discussion 5

Sorting and Hashing

CS 186 - Fall 2017

TA: Tammy Nguyen

tmmydngyn.me

2 of 46

Agenda

  • Administrivia
  • Sorting
  • Hashing - not on midterm!
  • Midterm questions

3 of 46

Administrivia

Homework 2 was due yesterday, be aware of your slip hours

Midterm 1 tomorrow during lecture�→ Covers up to yesterday’s lecture and today’s discussion

4 of 46

Sorting

5 of 46

Why sorting?

  • Group similar records
    • GROUP BY
    • Eliminating duplicates for DISTINCT
    • Sort/Hash algorithm (later)

  • Order records
    • ORDER BY
    • Bulk-loading B+ trees

6 of 46

Sorting specs

Parameters

  • A file F
    • containing a multiset of records R
    • N: number of blocks that file takes up on disk
  • Two disks each with >> N blocks of free storage
  • Memory capacity in RAM
    • B: number of blocks that fit into RAM buffers

Output

  • A file FS
    • with contents R stored in order by a given sorting criterion

7 of 46

Terminology

Internal sort: sorting a collection in main memory�→ a.k.a. in-place sort�→ e.g. quick sort, heap sort, insertion sort

External sort: sorting a collection that doesn’t fit into main memory�→ a.k.a. out-of-core sort�→ e.g. general external merge sort

8 of 46

External merge sort

Idea: Conquer and combine

Algorithm:

  • Pass 0: Internal sort ceil(N/B) runs of size B
  • Pass 1, 2, …, etc.: Merge B - 1 runs at a time into output buffer to runs of size

Parameters

N: # of pages in file

B: # of pages that fit in main memory

B

Pass 0

Pass 1, 2, …, etc.

Run ⌈N/B⌉

...

Run 2

Run 1

1

2

B-1

...

...

9 of 46

Conquer phase

write

read

8, 6, 10

11, 5, 7

2, 9, 3

3, 8, 2

19, 6, 8

9, 3, 8

6, 16, 3

5, 2, 3

Parameters

N: 10 pages

B: 4 pages

8, 6, 10

11, 5, 7

2, 9, 3

3, 8, 2

Pull B pages of the file into main memory at a time.

7, 4, 1

10, 2, 8

10 of 46

Conquer phase

write

Parameters

N: 8 pages

B: 4 pages

Internally sort B pages.

2, 2, 3

3, 5, 6

7, 8, 8

9, 10, 11

read

8, 6, 10

11, 5, 7

2, 9, 3

3, 8, 2

19, 6, 8

9, 3, 8

6, 16, 3

5, 2, 3

7, 4, 1

10, 2, 8

11 of 46

Conquer phase

Parameters

N: 8 pages

B: 4 pages

Write the sorted run to disk.

read

2, 2, 3

3, 5, 6

7, 8, 8

9, 10, 11

8, 6, 10

11, 5, 7

2, 9, 3

3, 8, 2

19, 6, 8

9, 3, 8

6, 16, 3

5, 2, 3

7, 4, 1

10, 2, 8

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

write

12 of 46

Conquer phase

Parameters

N: 10 pages

B: 4 pages

Repeat until all pages of the file have been turned into sorted runs.

read

8, 6, 10

11, 5, 7

2, 9, 3

3, 8, 2

19, 6, 8

9, 3, 8

6, 16, 3

5, 2, 3

7, 4, 1

10, 2, 8

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

write

19, 6, 8

9, 3, 8

6, 16, 3

5, 2, 3

13 of 46

Conquer phase

Parameters

N: 10 pages

B: 4 pages

Repeat until all pages of the file have been turned into sorted runs.

read

8, 6, 10

11, 5, 7

2, 9, 3

3, 8, 2

19, 6, 8

9, 3, 8

6, 16, 3

5, 2, 3

7, 4, 1

10, 2, 8

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

write

2, 3, 4

5, 5, 6

7, 8, 8

9, 16, 19

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

14 of 46

Conquer phase

Parameters

N: 10 pages

B: 4 pages

Repeat until all pages of the file have been turned into sorted runs.

read

8, 6, 10

11, 5, 7

2, 9, 3

3, 8, 2

19, 6, 8

9, 3, 8

6, 16, 3

5, 2, 3

7, 4, 1

10, 2, 8

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

write

7, 4, 1

10, 2, 8

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

15 of 46

Conquer phase

Parameters

N: 10 pages

B: 4 pages

Repeat until all pages of the file have been turned into sorted runs.

read

8, 6, 10

11, 5, 7

2, 9, 3

3, 8, 2

19, 6, 8

9, 3, 8

6, 16, 3

5, 2, 3

7, 4, 1

10, 2, 8

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

write

1, 2, 4

7, 2, 8

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

16 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

2

2

1

1

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

17 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

2

2

2

2

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

18 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

2

2

4

2

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

19 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

2

2

4

2

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

20 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

3

2

4

2

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

21 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

3

3

4

3

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

22 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

3

3

4

3

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

1, 2, 2, 2, 2, 3, 3

23 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

5

3

4

3

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

1, 2, 2, 2, 2, 3, 3

1, 2, 2, 2, 2, 3, 3, 3

24 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

5

4

4

4

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

1, 2, 2, 2, 2, 3, 3

1, 2, 2, 2, 2, 3, 3, 3

1, 2, 2, 2, 2, 3, 3, 3, 4

25 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

5

5

4

4

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

1, 2, 2, 2, 2, 3, 3

1, 2, 2, 2, 2, 3, 3, 3

1, 2, 2, 2, 2, 3, 3, 3, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4

26 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

5

5

7

5

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

1, 2, 2, 2, 2, 3, 3

1, 2, 2, 2, 2, 3, 3, 3

1, 2, 2, 2, 2, 3, 3, 3, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5

27 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Stream B-1 runs into memory and merge into output buffer.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

6

5

7

5

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

1, 2, 2, 2, 2, 3, 3

1, 2, 2, 2, 2, 3, 3, 3

1, 2, 2, 2, 2, 3, 3, 3, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5

1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5

28 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

And so on...

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

6

5

7

5

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

1, 2, 2, 2, 2, 3, 3

1, 2, 2, 2, 2, 3, 3, 3

1, 2, 2, 2, 2, 3, 3, 3, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5

1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5

29 of 46

Combine phase

Parameters

N: 10 pages

B: 4 pages

Until we have merged B-1 runs.

read

write

2, 2, 3, 3, 5, 6, 7, 8, 8, 9, 10, 11

2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 16, 19

1, 2, 4, 7, 8, 10

1

* for simplicity, to illustrate this phase, assume that a single page can only hold one integer (vs. 3 in previous slides)

1, 2

1, 2, 2

1, 2, 2, 2

1, 2, 2, 2, 2

1, 2, 2, 2, 2, 3

1, 2, 2, 2, 2, 3, 3

1, 2, 2, 2, 2, 3, 3, 3

1, 2, 2, 2, 2, 3, 3, 3, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4

1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5

1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 10, 10, 11, 16, 19

30 of 46

What happens after each pass?

B

Pass 0

Pass 1, 2, …, etc.

Run ⌈N/B⌉

...

Run 2

Run 1

1

2

B-1

...

...

How big are these runs?

How many runs are left after each pass?

31 of 46

Pass 0

B

B

B

B

B

B

B

B

1

2

3

4

5

6

⌈N/B⌉

...

⌈N/B⌉-1

Pass 1

B(B - 1)

B(B - 1)

B(B - 1)

B(B - 1)

1

2

3

⌈N/B⌉ / (B - 1)

...

Pass 2

B(B - 1)(B - 1)

⌈N/B⌉ / (B - 1)2 runs

B(B - 1)(B - 1)

...

Pass ???

N

sorted file

.

.

.

N

original file

Total # passes:

1+ ⌈logB-1⌈N/B⌉⌉

Sort pass

Merge passes

32 of 46

What happens after each pass?

B

Pass 0

Pass 1, 2, …, etc.

Run ⌈N/B⌉

...

Run 2

Run 1

1

2

B-1

...

...

How big are these runs?

How many runs are left after each pass?

(B-1) * (size of run from previous pass)

(num previous runs) / (B-1)

33 of 46

Cost of the algorithm

B

Pass 0

Pass 1, 2, …, etc.

Run ⌈N/B⌉

...

Run 2

Run 1

1

2

B-1

...

...

How many passes does it take to merge ⌈N/B⌉ runs into a single file B-1 pages at a time?

1 + ⌈logB-1⌈N/B⌉⌉

Each pass, we read and write the entire file of size N.

→ 2N I/Os per pass

2N * (1 + ⌈logB-1⌈N/B⌉⌉)

34 of 46

Worksheet

35 of 46

Problems

You have 4 buffer pages and 108 pages of records to sort.

  • How many passes would it take to sort the file?
  • How many sorted runs does each pass produce?
  • What is the total cost for this sort process in terms of I/O?
  • If each of the data pages were already sorted individually, how many passes would it take to sort the file and how many IOs would it be instead?
  • Given an arbitrary N number of buffer pages, if we could sort the file in 3 passes, what order of magnitude is the size of the file in terms of number of pages?

36 of 46

Problems

You have 4 buffer pages and 108 pages of records to sort.

  • How many passes would it take to sort the file?
  • How many runs does each pass produce?

Pass 0: ceil(108/4) = 27 sorted runs of 4 pages each

Pass 1: ceil(27/3) = 9 sorted runs of 12 pages each

Pass 2: ceil(9/3) = 3 sorted runs of 36 pages each

Pass 3: ceil(3/4) = 1 sorted run = sorted file

→ 4 passes

37 of 46

Problems

You have 4 buffer pages and 108 pages of records to sort.

  • What is the total cost for this sort process in terms of I/O?

The algorithm takes 4 passes.

Each pass, we read and write all 108 pages of the file.

→ 2(108)*4 = 864 I/Os

  • If each of the data pages were already sorted individually, how many passes would it take to sort the file and how many IOs would it be instead?

There would be no change.

The first pass splits up the file into 27 runs, they’re just already sorted.

38 of 46

Problems

You have 4 buffer pages and 108 pages of records to sort.

  • Given an arbitrary B number of buffer pages, if we could sort the file in 3 passes, what order of magnitude is the size of the file in terms of number of pages?

formula: num passes = 1 + ⌈logB-1⌈N/B⌉⌉

3 = 1 + logB-1⌈N/B⌉

2B-1 = ⌈N/B⌉

→ N = B(2B-1)

39 of 46

Hashing

not on midterm!

40 of 46

Why hashing?

  • Group similar records
    • GROUP BY
    • Eliminating duplicates for DISTINCT

41 of 46

Hashing specs

Parameters

  • A file F
    • containing a multiset of records R
    • N: number of blocks that file takes up on disk
  • Two disks each with >> N blocks of free storage
  • Memory capacity in RAM
    • B: number of blocks that fit into RAM buffers

Output

  • Hashed file FH
    • R arranged on disk s.t. no 2 records that are incomparable are separated by a greater or smaller record.
    • i.e. matching records are always “stored consecutively”

42 of 46

External hashing

Idea: Divide and conquer

Algorithm:

  • Pass 0: Use hash function hp to stream records to disk partitions
  • Pass 1: Read each partition into RAM hash table and rehas using hash function hr

Parameters

N: # of pages in file

B: # of pages that fit in main memory

B

Pass 0

Pass 1

Part. B-1

...

Part. 2

Part. 1

1

2

B-1

...

...

43 of 46

Cost of the algorithm

In two passes, we can hash a file of size B(B-1). Use recursive partitioning for bigger files.

Each pass, we read and write the entire file of size N.

→ 2N I/Os per pass

2N * 2 = 4N I/Os

B

Pass 0

Pass 1

Part. B-1

...

Part. 2

Part. 1

1

2

B-1

...

...

44 of 46

Worksheet

45 of 46

Discussion

  • If we had records that we wanted to hash where the key being used was random integer values distributed uniformly, would the length of the integer be a good hash key? Why?�
  • What are some of the use-cases of hashing is preferred over sorting?

46 of 46

Discussion

  • If we had records that we wanted to hash where the key being used was random integer values distributed uniformly, would the length of the integer be a good hash key? Why?

No, # of keys for different buckets varies greatly which will lead to uneven � buckets.� e.g. key = 1 ⇒ 10 values, key = 3 ⇒ 1000 values

  • What are some of the use-cases of hashing is preferred over sorting?

When we don’t care about order, just groups.� e.g. removing duplicates