1 of 98

Announcements

  1. Nov 1st Midterm (dates/details on cs145 website)
    1. Material covered till Oct 20th lecture (Oct 25th lecture tested in Finals, not in Midterm)
    2. Exam format: ~5 questions (SQL, Systems, materials in 1st 8 lectures)
    3. Similar questions to HW1, HW2, Project1 (sql). I.e., problems are the best prep. +1 prep-test out ~Oct 27
    4. HW2 (Oct 19) - due Oct 26th HW2 (no late days) 11:59 pm due
      1. Shorter than HW1 (moved some problems to HW3)
      2. Oct 21 Section for HW2
      3. Oct 27 HW2 11:59 pm solutions released
      4. For OAE students who need an extra day, please plan ahead now
    5. Oct 27th Test prep in class

Midterm

Review

HW2

due

HW2

section

HW2

out

2 of 98

Announcements

Fill in Pulse survey today

Wide class mix

  • 85%+ CS+ Engg
  • 60% Junior-> CoTerm
  • Good CS exposure, but no intermediate classes yet

Need more help on RAM Algorithms +OS concepts? (~15%). Goto OH with questions. E.g., if are

  • non-CS?
  • pre-Junior?

If you find the material basic (~10%)

  • Use Project3
  • CS245 a better fit?
  • Stop by my OH

3 of 98

Where we are

Week 3

  • Learnt basics of IO-aware algorithms and indexes

Week 4

  • Problem 1: How to search Amazon’s product catalog in < 1sec?

  • Problem 2: JOIN queries are slow and expensive.
    • How do we speed up by 1000x? (Today)
    • How do we speed up by another 100x? (Thursday)

Week 5

  • Problem 3: Build a Product Recommendation system (e.g., Amazon’s)

4 of 98

Q: What’s index size for Amazon’s product catalog?

(Wrapping up B+ trees example)

5 of 98

Recall Cost Model for Indexes -- [Baseline simplest model]

Page

Row1

Row2

Row ..

Row ...

. . . . .

Page

Page … n (RAM/DIsk)

“Real” data layout, with full records

(including cname, prices, etc.)

E.g, where is 19?

where is 33?

(e.g, disk block 0x45AB)

N index pages (leaf)

M

data pages

‘f’ fanout

<value, record location>

Example 1:Search Amazon’s 1 trillion Products

    • SKeySize = 8 bytes,
    • PointerSize = 8 bytes
    • NumSKeys = 1 Trillion

AMAZING: Worst-case for 1 Trillion SKeys

  • 2 IOs (for 64MB) for index
  • 1 IO for data page

PageSize

64KB

64MB

f

~4000

~4 million

h

~4

~2

6 of 98

B+ tree Levels and node sizes

E.g, where is 19?

where is 33?

(e.g, disk block 0x45AB)

fh-1 nodes (pointing upto fh SKeys overall)

M

data pages

(hidden)

<value, record location>

Consider Example with f ~= 4000 and PageSize = 64KB

Level

Num Nodes (size)

Size

0 (root)

1

64 KB

1

~4000 (f)

4000*

64KB

2

~4000^2 (f^2)

~4000^2*

64KB

3

~4000^3 (f^3)

~4000^3*

64KB

h-1

4000h-1 (fh-1 nodes) [Collectively pointing at upto fh SKeys]

4000h-1 * 64KB

f nodes

1 node

f2 nodes

. . .

Recall We use fanout ‘f’ as a constant, for simplicity. The algorithm uses ‘f’ for keys, and ‘f+1’ for pointers. Our engineering approximation uses ‘f’ for both. (i.e., f ~= f + 1)

7 of 98

Search cost of B+ Tree (on RAM + Disk)

Not all nodes pictured

. . . … . …..

. . . … . …..

Read 1st levels

Into RAM buffer

Rest of index on disk

Use B (RAM size) to keep top few levels

(e.g., 1st 2-3 levels. Overall, as many nodes as will fit in B – because they’re ‘zero’ cost and speed up queries)

8 of 98

Problem: How to make SQL queries fast?

Optimize!

9 of 98

Approximations

O(n) vs O(n^2) – Big “O” notation for algorithms

Similar goal. We often approximate B ~= B+1, or f ~= f+1 for simpler equations.

Is that OK?

  • Yup, for “modern” ~2010+ machines
  • When B ~= 100, B+1/B ~= 1.01.

Complexity

Engineering Approximations

Models of real world systems are complex. ⇒

  • IO cost equations are often ‘dense.’
  • We will approximate and simplify (with most of the needed accuracy), so we focus on the core insights.

10 of 98

Example:

Basic SFW queries

SELECT pname

FROM Product

WHERE year = ? AND Category =?

AND manufacturer = ?

SELECT pname

FROM Product

WHERE year = ? AND category =?

Workload description

  1. How to execute? Sort, Hash first …?
  2. Maintain indexes for Year? Category? Manufacturer?
  3. For query, check multiple indexes?
  4. What’s cost of maintaining index?
  5. Use multiple machines? ...

Manufacturers likely most Selective.

Few Categories. Many more manufacturers. Maintain index, if this query happens a lot.

Lower cost

(query and update cost)

Intuition

200 times/sec

100 times/sec

11 of 98

Optimization

Roadmap

  1. For SFW, Joins queries
    1. Sort? Hash? Count? Brute-force?
    2. Pre-build an index? B+ tree, Hash?

  • What statistics can I keep to optimize?
    • E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

12 of 98

What you will learn about in this section

  1. RECAP: Joins

  • Nested Loop Join (NLJ)

  • Block Nested Loop Join (BNLJ)

  • Index Nested Loop Join (INLJ)

13 of 98

Problem: JOINs are slow. How to make them fast?

Optimize!

14 of 98

Semantically: A Subset of the Cross Product

SELECT R.A,B,C,D�FROM R, S

WHERE R.A = S.A

 

A

D

3

7

2

2

2

3

A

B

C

1

0

1

2

3

4

2

5

2

3

1

1

R

S

A

B

C

D

2

3

4

2

2

3

4

3

2

5

2

2

2

5

2

3

3

1

1

7

 

Cross Product

Filter by conditions

(r.A = s.A)

Can we actually implement a join in this way?

 

15 of 98

Joins: Example

 

A

D

3

7

2

2

2

3

A

B

C

1

0

1

2

3

4

2

5

2

3

1

1

R

S

A

B

C

D

2

3

4

2

 

SELECT R.A,B,C,D�FROM R, S

WHERE R.A = S.A

16 of 98

Joins: Example

 

A

D

3

7

2

2

2

3

A

B

C

1

0

1

2

3

4

2

5

2

3

1

1

R

S

A

B

C

D

2

3

4

2

2

3

4

3

 

SELECT R.A,B,C,D�FROM R, S

WHERE R.A = S.A

17 of 98

Joins: Example

 

A

D

3

7

2

2

2

3

A

B

C

1

0

1

2

3

4

2

5

2

3

1

1

R

S

A

B

C

D

2

3

4

2

2

3

4

3

2

5

2

2

 

SELECT R.A,B,C,D�FROM R, S

WHERE R.A = S.A

18 of 98

Joins: Example

 

A

D

3

7

2

2

2

3

A

B

C

1

0

1

2

3

4

2

5

2

3

1

1

R

S

A

B

C

D

2

3

4

2

2

3

4

3

2

5

2

2

2

5

2

3

 

SELECT R.A,B,C,D�FROM R, S

WHERE R.A = S.A

19 of 98

Joins: Example

 

A

D

3

7

2

2

2

3

A

B

C

1

0

1

2

3

4

2

5

2

3

1

1

R

S

A

B

C

D

2

3

4

2

2

3

4

3

2

5

2

2

2

5

2

3

3

1

1

7

 

SELECT R.A,B,C,D�FROM R, S

WHERE R.A = S.A

20 of 98

Simple way to execute JOINs –

Nested Loop Joins

21 of 98

Notation for JOIN cost estimates

We consider “IO aware” algorithms: care about IO

Given a relation R, let:

    • T(R) = # of tuples in R
    • P(R) = # of pages in R

We’ll assume B buffer for input (B usually << P(R), P(S))

  • 1 for output ← Imagine there’s always 1 extra page

Recall that we read / write entire pages (blocks)

We’ll see lots of formulae from now

⇒ Hint: Focus on how it works. Much easier to derive from 1st principles (vs recalling formula soup)

22 of 98

Recall

Census table

(row store)

Census (SSN, Address, … )

TaxInfo (SSN, TaxPaid, ...)

Census

432-567-789

134-562-184

613-416-452

For 1 Billion people

Goal: Compute Census JOIN TaxInfo

Data stored in RowStores, 1000 tuples/page (million pages)

TaxInfo table

(row store)

TaxInfo

432-567-789

134-562-184

613-416-452

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Output

Steps: Repeat till done

Read tuples from Census and TaxInfo into Buffer

Output matching tuples

23 of 98

Nested Loop Join (NLJ)

 

Note: This is our 1st basic algorithm. We use this baseline to intro cost models. We’ll see much better ones soon.

Is this

  • (a) Correct, (b) Incorrect?
  • (c ) Fast, (d) Slow?

24 of 98

Nested Loop Join (NLJ)

 

P(R)

  1. Loop over the tuples in R

We read all T(R) tuples. But they are in P(R) pages.Cost to read = P(R)

Cost:

25 of 98

Nested Loop Join (NLJ)

 

P(R) + T(R)*P(S)

It’s T(R)*P(S) because

  • Per tuple in R (i.e., T(R)), we read every page of S (i.e., P(S)).
  1. Loop over the tuples in R

  • For every tuple in R, loop over all the tuples in S

Cost:

26 of 98

Nested Loop Join (NLJ)

 

P(R) + T(R)*P(S)

Note that NLJ can handle things other than equality constraints… just check in the if statement!

  1. Loop over the tuples in R

  • For every tuple in R, loop over all the tuples in S

  • Check against join conditions

Cost:

27 of 98

Nested Loop Join (NLJ)

 

P(R) + T(R)*P(S) + OUT

  1. Loop over the tuples in R

  • For every tuple in R, loop over all the tuples in S

  • Check against join conditions

  • Write out (to page, then when page full, to disk)

Cost:

What would OUT be if our join condition is trivial (if TRUE)?

OUT = T(R)*T(S) tuples split into multiple pages.

Could be bigger than P(R)*P(S)… but usually not that bad

28 of 98

Nested Loop Join (NLJ)

 

P(R) + T(R)*P(S) + OUT

What if R (“outer”) and S (“inner”) switched?

Cost:

P(S) + T(S)*P(R) + OUT

Outer vs. inner selection makes a huge difference- DBMS needs to know which relation is smaller!

29 of 98

IO-Aware Approach

30 of 98

Problem: JOINs are slow. How to make them fast?

Optimize – IO aware

(1000x faster than NLJ?)

31 of 98

Intuition:

Block Nested

Loop Joins

Census table

(row store)

Census (SSN, Address, … )

TaxInfo (SSN, TaxPaid, ...)

Census

432-567-789

134-562-184

613-416-452

For 1 Billion people

Goal: Compute Census JOIN TaxInfo

Data stored in RowStores, 1000 tuples/page (million pages)

TaxInfo table

(row store)

TaxInfo

432-567-789

134-562-184

613-416-452

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Given: B buffer space for input. + 1 for output

Idea: Use B-1 pages for Census, 1 page for TaxInfo

Output

Steps: Repeat till done

Read B-1 pages from Census into Buffer

Read 1 page from TaxInfo

Partial Join into 1 output page

32 of 98

Block Nested Loop Join (BNLJ)

 

Assume B buffer for input

  • 1 for output (For B << P(R), P(S))
  1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output)

Cost:

Why P(R)?

  1. Overall, read all P(R) pages, even if it’s B-1 at a time
  2. There could be some speedup here, if we’re reading in multiple pages sequentially however we’ll ignore this here!

P(R)

33 of 98

Block Nested Loop Join (BNLJ)

 

Notes:

  1. Faster to iterate over the smaller relation first!
  2. We use 1/B in denominator, for our engg approximation (vs 1/(B-1)).
  1. Load in B-1 pages of R at a time (leaving 1 page for S)

  • For each (B-1)-page segment of R, load each page of S

Cost:

How many R pages to read?

P(R)

How many times is each S page read?

P(R)/B

P(R) + P(R)*P(S)/B

34 of 98

Block Nested Loop Join (BNLJ)

 

  1. Load in B-1 pages of R at a time (leaving 1 page free for S)

  • For each (B-1)-page segment of R, load each page of S

  • Check against the join conditions

BNLJ can also handle non-equality constraints

Cost:

P(R) + P(R)*P(S)/B

35 of 98

Block Nested Loop Join (BNLJ)

 

  1. Load in B-1 pages of R at a time (leaving 1 page free for S)

  • For each (B-1)-page segment of R, load each page of S

  • Check against the join conditions

  • Write out

Cost:

P(R) + P(R)*P(S)/B + OUT

36 of 98

BNLJ vs. NLJ: Benefits of IO Aware

In BNLJ, by loading larger chunks of R, we minimize the number of full disk reads of S

    • Still the full cross-product, but more done only in memory

P(R) + T(R)*P(S) + OUT

NLJ

BNLJ

P(R) + P(R)*P(S)/B + OUT

BNLJ is faster by ~ B*T(R)/P(R)

37 of 98

Example NLJ vs. BNLJ: Steel Cage Match

Example: P(R) = 1000, P(S) = 500,

100 tuples/page ⇒ T(R) = 1000*100,T(S)=500*100]

B= 100 (+ 1 for output)

�NLJ

BNLJ

(1000 + 1000*500/20)

⇒ IO = ~26,000 IOs +OUT

B= 20 (+ 1 for output)

(1000 + 1000*500/100)

⇒ IO = ~6000 +OUT

(1000 + 1000*100*500 + OUT)

⇒ IO = ~5,001,000 +OUT

P(R) + T(R)*P(S) + OUT

(1000 + 1000*100*500 + OUT)

⇒ IO = ~ 5,001,000 +OUT

P(R) + P(R)*P(S)/B + OUT

Small change in algorithm ⇒ Big speedup in JOINs (~1000x faster)

Also, notice if we swap R and S, we can save an extra 500 IOs in BNLJ

38 of 98

Problem: JOINs are slow. How to make them fast?

(Can we do 100x faster than BNLJ?)

Idea: Smarter cross-products

39 of 98

What you will learn about in this section

0. Index Joins

  1. Sort-Merge Join

  • HashPartion Joins

40 of 98

Smarter than Cross-Products: From Quadratic to Nearly Linear

All joins computing the full cross-product have a quadratic term

    • For example we saw:

NLJ

BNLJ

We get this gain by taking advantage of structure- moving to equality constraints (“equijoin”) only!

Now we’ll see some (nearly) linear joins:

    • ~ O(P(R) + P(S) + OUT)

P(R) + T(R)*P(S) + OUT

P(R) + P(R)*P(S)/B + OUT

41 of 98

Example

Census table

(row store)

Census (SSN, Address, … )

TaxInfo (SSN, TaxPaid, ...)

Census

432-567-789

134-562-184

613-416-452

For 1 Billion people

Goal: Compute Census JOIN TaxInfo

Data stored in RowStores, 1000 tuples/page (million pages)

TaxInfo table

(row store)

TaxInfo

432-567-789

134-562-184

613-416-452

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

BNLJ -- See all the extra work BNLJ is doing to JOIN for 432-567-789, ...

42 of 98

Index Nested Loop Join (INLJ)

 

P(R) + T(R)*L + OUT

→ We can use an index (e.g. B+ Tree) to avoid full cross-product!

Cost:

Where L is the IO cost to access each distinct value in index

Recall: L is usually small (e.g., 2-5)

→ Much better than quadratic. But what if T(R) = 1 billion?

43 of 98

Announcements

  • Nov 1st Midterm (dates/details on cs145 website)
    • Material covered till Oct 20th lecture (Oct 25th lecture tested in Finals, not in Midterm)
    • Exam format: ~5 questions (SQL, Systems, materials in 1st 8 lectures)
    • Similar questions to HW1, HW2, Project1 (sql). I.e., problems are the best prep. +1 prep-test out ~Oct 27
    • HW2 (Oct 19) - due Oct 26th HW2 (no late days) 11:59 pm due
      • Shorter than HW1 (moved some problems to HW3)
      • Oct 21 Section for HW2
      • Oct 27 HW2 11:59 pm solutions released
      • For OAE students who need an extra day, please plan ahead now
    • Oct 27th Test prep in class

Midterm

Review

HW2

due

HW2

section

HW2

out

44 of 98

FAQ: Why do we focus on IO cost?

[1] CPU Cost

(e.g.,sort in RAM or check tuple equality in NLJ in RAM)

[2] IO Cost from HDD/SSDs

(# of Pages we read or write fromHDD/SSD)

Typical algorithms in RAM (e.g., quicksort in nLog n)

From Lecture1

⇒ For big data, focus on IO cost (i.e., it’s the primary factor)

(For tiny data < 10 GBs, just use RAM)

E.g., For ExternalSort

45 of 98

FAQ: Why do we focus on scale? How does it relate to SQL?

  1. How to search Amazon’s product catalog?
    1. Consumers want answers in < 1 sec
    2. Data layout and Indexing –
      1. Lecture 1 vs Lecture 6 (speedup “search” queries from hours to < 1sec)

  • How to run JOINs fast?
    • Improved by 1000x from NLJ to BNLJ
    • Will do another 100x today

⇒ For big data, how to scale is the primary driver

(For tiny data < 10 GBs, just use RAM)

46 of 98

Reminder: reread “Why cs145” from Week 1.

47 of 98

Problem: JOINs are slow. How to make them fast?

(Can we do 100x faster than BNLJ?)

Idea: Smarter cross-products

48 of 98

Pre-process data before JOINing

Preview of doing better?

SortMergeJoin

Census table

(row store)

Census

432-567-789

134-562-184

613-416-452

TaxInfo table

(row store)

TaxInfo

432-567-789

134-562-184

613-416-452

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

HashPartitionJoin

Census table

(row store)

Census

432-567-789

134-562-184

613-416-452

TaxInfo table

(row store)

TaxInfo

432-567-789

134-562-184

613-416-452

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

-- Sort(Census), Sort(TaxInfo) on SSN

-- Merge sorted pages

-- Hash(Census), Hash(TaxInfo) on SSN

-- Merge partitioned pages

49 of 98

Join Algorithms: Summary

  • NLJ: An example of a non-IO aware join algorithm
  • BNLJ: Big gains just by being IO aware & reading in chunks of pages!
  • SMJ: Sort R and S, then scan over to join!
  • HPJ: Partition R and S into buckets using a hash function, then join the (much smaller) matching buckets

For R ⋈ 𝑆 𝑜𝑛 column 𝐴

Quadratic in P(R), P(S)�I.e. O(P(R)*P(S))

Given sufficient buffer space, linear in P(R), P(S)

I.e. ~O(P(R)+P(S))

By only supporting equijoins & taking advantage of this structure!

50 of 98

Sort-Merge Join (SMJ)

51 of 98

What you will learn about in this section

  1. Sort-Merge Join

  • “Backup” & Total Cost

  • Optimizations

52 of 98

Sort Merge Join (SMJ)

Goal: Execute R ⋈ S on A

Key Idea:

Sort R and S [with external sort]

Merge-scan (aka merge-join) over them!

IO Cost:

    • Sort phase: Sort(R) + Sort(S)
    • Merge-join phase: ~ P(R) + P(S) + OUT

S

R

Unsorted input relations

Split & sort

Merge

S

R

Merge

R

53 of 98

Example1: SMJ: R⋈𝑆 with 3 page buffer

For simplicity: Let each page be one tuple. Let the first column be join key

(E.g., R has 3 pages with 1 tuple each)

Disk

Main Memory

Buffer

R

(3, j)

(5, b)

(0,a)

S

(7,f)

(0,j)

(3,g)

We show the file HEAD, which is the next value to be read!

0

a

3

j

5

b

3

g

7

f

0

j

R

S

Col1

Col1

0

a

j

3

j

g

Col1

R

S

Physical Join

Logical Join

54 of 98

SMJ Example: R⋈𝑆 with 3 page buffer

Disk

Main Memory

Buffer

R

(5,b)

(3,j)

(0,a)

S

(7,f)

(0,j)

(3,g)

(3,j)

(5,b)

(0,a)

(3,g)

(7,f)

(0,j)

1. Sort(R) and Sort(S) (on 1st column)

55 of 98

SMJ Example: R⋈𝑆 with 3 page buffer

2. Merge-scan (or merge-join) – Scan and “merge” on join key!

Disk

Main Memory

Buffer

R

S

(3,g)

(7,f)

(3,j)

(5,b)

Output

(0,j)

(0,a)

(0,a)

(0,j)

Read 1 page each from (sorted) R and S

56 of 98

SMJ Example: R⋈𝑆 with 3 page buffer

2. Scan and “merge” on join key!

Disk

Main Memory

Buffer

R

S

(3,g)

(7,f)

(3,j)

(5,b)

Output

(0,j)

(0,a)

(0,a)

(0,j)

(0,a,j)

Join (0, a) and (0, j) to get (0, a, j)

57 of 98

SMJ Example: R⋈𝑆 with 3 page buffer

2. Scan and “merge” on join key!

Disk

Main Memory

Buffer

R

S

(3,g)

(7,f)

(3,j)

(5,b)

Output

(0,a)

(0,j)

(0,a,j)

(3,j,g)

(3,j)

(3,g)

(5,b)

(7,f)

(3,j)

(3,g)

58 of 98

SMJ Example: R⋈𝑆 with 3 page buffer

2. Done!

Disk

Main Memory

Buffer

R

S

3,g

7,f

3,j

5,b

Output

(0,a)

(0,j)

(0,a,j)

(3,j)

(3,g)

(3,j,g)

(5,b)

(7,f)

(5,b)

(7, f)

59 of 98

What happens with �duplicate join keys?

60 of 98

Example2: Multiple tuples with Same Join Key: “Backup”

1. Start with sorted relations, and begin merge-join…

Disk

Main Memory

Buffer

R

S

3,g

7,f

3,j

5,b

Output

(0, b)

(0,g)

(0,c)

(7,f)

(0,a)

(0,j)

(0,a)

(0,j)

0

a

0

b

0

c

0

j

0

g

7

f

R

S

Col1

Col1

0

a

j

0

b

j

0

c

j

0

a

g

0

b

g

0

c

g

Col1

R

S

Physical Join

Logical Join

61 of 98

Multiple tuples with Same Join Key: “Backup”

1. Start with sorted relations, and begin merge-join…

Disk

Main Memory

Buffer

R

S

3,g

7,f

3,j

5,b

Output

(0,b)

(0,g)

(0,c)

(7,f)

(0,a)

(0,a)

(0,j)

(0,j)

(0,a,j)

Read and join R.(0, a) and S.(0, j) to get (0, a, j)

62 of 98

Multiple tuples with Same Join Key: “Backup”

1. Start with sorted relations, and begin merge-join…

Disk

Main Memory

Buffer

R

S

(0,g)

7,f

(0,j)

5,b

Output

(0,c)

(7,f)

(0,a)

(0,a)

(0,j)

(0,a,j)

(0,a,g)

(0,g)

(0,b)

(0,g)

Output (0, a, j)

63 of 98

Multiple tuples with Same Join Key: “Backup”

1. Start with sorted relations, and begin merge-join…

Disk

Main Memory

Buffer

R

S

0,g

7,f

0,j

5,b

Output

(0, b)

(0,c)

(7,f)

(0,a)

(0,a,j)

(0,g)

(0,a,g)

(0,b)

Have to “backup” in the scan of S and reread pages!

(0,j)

(0,j)

(0, j)

Read R.(0, b) and reread S.(0,j), . . .

64 of 98

Backup

  • At best, no backup → merge-scan takes P(R) + P(S) reads
    • For ex: if no duplicate values in join attribute

  • At worst (e.g. full backup each time), merge-scan could take P(R) * P(S) reads!
    • For ex: if all duplicate values in join attribute, i.e. all tuples in R and S have the same value for the join attribute
    • Roughly: For each page of R, we’ll back up and read each page of S…

  • Often not that bad however, plus we can:
    • Leave more data in buffer (for larger buffers)
    • Can design other algorithms

65 of 98

SMJ: Total cost

  • Cost of SMJ is
    • Cost of sorting R and S…
    • Plus the cost of merge-join: ~P(R)+P(S)
      • Because of backup: in worst case P(R)*P(S); but unlikely

  • Plus the cost of writing out: OUT

~ Sort(P(R)) + Sort(P(S)) + P(R) + P(S) + OUT

NumPasses

66 of 98

Example: SMJ NumPasses

Consider P(R) = 1000, P(S) = 500

Case1: NumPasses for R (for B = 100) =

K = ⌈log100 1000/(2*100)⌉ = 1

Case2: NumPasses for R (for B = 20)

K = ⌈log20 1000/(2*20)⌉) = 2

Reminder: More Buffer? Fewer passes for Sorting

(Repeat for S, and you get k = 1 and 2)

67 of 98

Example SMJ vs. BNLJ: Steel Cage Match

SMJ is ~ linear vs. BNLJ is quadratic…

Redo the same with 10x? SMJ much faster.

Consider P(R) = 1000, P(S) = 500

B = 100

SMJ

BNLJ

(Sort R and S in NumPasses (np)=2

2*1000*(np+1) + 2*500*(np+1) = 9000

MergeJoin: 1000 + 500: 1500 IOs)

⇒ IO = 10,500 IOs + OUT

(500 + 1000*500/20 )

⇒ IO = 25500 IOs +OUT

B = 20

~ Sort(P(R)) + Sort(P(S)) + P(R) + P(S) + OUT

(Sort R and S in NumPasses (np) =1

2*1000*(np+1) + 2*500*(np+1) = 6000

MergeJoin: 1000 + 500 = 1500 IOs)

⇒ IO = 7500 IOs + OUT

(500 + 1000*500/100)

⇒ IO = 5500+OUT

P(R) + P(R)*P(S)/B + OUT

Error fixed 10/20 after lecture.

1. Let’s use full formulae pre-optimization in next 2 slides.

2. Note: with optimization in next 2 Slides, we’ll need fewer IOs (we skip the last merge steps).

68 of 98

Un-Optimized SMJ

Merge / Join Phase

Sort Phase

(Ext. Merge Sort)

S

R

Split & sort

Split & sort

MergeJoin

MergeJoin

MergeJoin

MergeJoin

Given B buffer pages for input + 1 for output

Joined output file created!

Unsorted input relations

69 of 98

Simple SMJ Optimization

Merge / Join Phase

Sort Phase

(Ext. Merge Sort)

S

R

Split & sort

Split & sort

Merge

Join

Merge

Join

Joined output file created!

Unsorted input relations

<= B total runs

B-Way MergeJoin

70 of 98

Takeaway points from SMJ

SMJ needs to sort both relations

    • SMJ is basically linear.
    • Nasty but unlikely case: Many duplicate join keys.

If relation(s) already sorted on join key, no more work.

71 of 98

What you will learn about in this section

0. Intuition for smarter joins

  • SortMergeJoin

  • HashPartion Joins

72 of 98

Pre-process data before JOINing

Preview of smarter joins

SortMergeJoin

Census table

(row store)

Census

432-567-789

134-562-184

613-416-452

TaxInfo table

(row store)

TaxInfo

432-567-789

134-562-184

613-416-452

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

HashPartitionJoin

Census table

(row store)

Census

432-567-789

134-562-184

613-416-452

TaxInfo table

(row store)

TaxInfo

432-567-789

134-562-184

613-416-452

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

-- Sort(Census), Sort(TaxInfo) on SSN

-- Merge sorted pages

-- Hash(Census), Hash(TaxInfo) on SSN

-- Merge partitioned pages

73 of 98

Hash Join (HJ) or

Hash Partition Join (HPJ)

74 of 98

Hash Join

  • Goal: Execute R ⋈ S on A

  • Key Idea:
  • Partition R and S into buckets by hashing the join attribute
  • Join the pairs of (small) matching buckets!

75 of 98

HPJ Phase 1: Hash Partitioning

Goal: For each relation, partition relation into buckets such that if hB(t.A) = hB(t’.A) they are in the same bucket

Given B+1 buffer pages, we partition into B buckets:

    • We use B buffer pages for output (one for each bucket), and 1 for input

      • For each tuple t in input, copy to buffer page for hB(t.A)
      • When page fills up, write to disk.

76 of 98

HPJ Phase 1: Partitioning

We partition into B = 2 buckets using hash function h2 so that we can have one buffer page for each partition (and one for input)

Disk

R

(3,j)

(0,j)

Given B+1 = 3 buffer pages

(5,b)

(5,a)

(0,j)

(0,a)

(3,a)

For simplicity, we’ll look at partitioning one of the two relations- we just do the same for the other relation!

Note our new convention for this example: pages each have two tuples (one per row)

77 of 98

HPJ Phase 1: Partitioning

1. We read pages from R into the “input” page of the buffer…

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

(3,j)

(0,j)

(5,b)

(5,a)

(0,j)

(0,a)

(3,a)

(0,a)

(3,a)

78 of 98

HPJ Phase 1: Partitioning

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

(0,a)

(3,a)

h2(0) = 0; h2(3) = 1

(0,a)

(3,a)

(3,j)

(0,j)

(5,b)

(5,a)

(0,j)

2. Then we use hash function h2 to sort into the buckets, which each have one page in the buffer

Copy (0, a) →Bucket0 and (3, a) → Bucket1, based on hash

79 of 98

HPJ Phase 1: Partitioning

3. We repeat until the buffer bucket pages are full…

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

(0,a)

(3,a)

(3,j)

(0,j)

(5,b)

(5,a)

(0,j)

80 of 98

HPJ Phase 1: Partitioning

3. We repeat until the buffer bucket pages are full…

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

(0,a)

(3,a)

(0,j)

(5,b)

(5,a)

(0,j)

h2(3) = 1

(3,j)

(0,j)

(3,a)

(3,j)

81 of 98

HPJ Phase 1: Partitioning

3. We repeat until the buffer bucket pages are full…

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

(0,a)

(3,a)

(0,j)

(5,b)

(5,a)

(0,j)

h2(0) = 0

(3,a)

(3,j)

(0,a)

(0,j)

82 of 98

HPJ Phase 1: Partitioning

3. We repeat until the buffer bucket pages are full… then flush to disk

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

(5,b)

(5,a)

(0,j)

B0

B1

(3,a)

(3,j)

(0,a)

(0,j)

B is full. Use B0 and B1 files.

83 of 98

HPJ Phase 1: Partitioning

3. We repeat until the buffer bucket pages are full… then flush to disk

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

(5,b)

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(5,a)

(0,j)

Write relevant pages to B0 and B1 files.

84 of 98

HPJ Phase 1: Partitioning

Note that collisions can occur!

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

(5,b)

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(0,j)

h2(5) = 1

Collision!!!

(5,a)

(0,j)

(5,a)

h2(5) = h2(3) = 1

Both (5, *) and (3, *) map to bucket

(0,j)

85 of 98

HPJ Phase 1: Partitioning

Finish this pass…

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(5,a)

(0,j)

(5,b)

86 of 98

HPJ Phase 1: Partitioning

Finish this pass…

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(5,a)

(0,j)

(5,b)

h2(5) = 1

(5,a)

(5,b)

h2(5) = h2(3) = 1

Collision!!!

87 of 98

HPJ Phase 1: Partitioning

Finish this pass…

Main Memory

Buffer

Input page

0

1

Output (bucket) pages

Disk

R

Given B+1 = 3 buffer pages

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(0,j)

(5,a)

(5,b)

88 of 98

HPJ Phase 1: Partitioning

Disk

Given B+1 = 3 buffer pages

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(0,j)

(5,a)

(5,b)

We wanted buckets of size B-1 = 1… however we got larger ones due to:

(1) Duplicate join keys

(2) Hash collisions

89 of 98

HPJ Phase 1: Partitioning

Disk

Given B+1 = 3 buffer pages

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(0,j)

(5,a)

(5,b)

To take care of larger buckets caused by (2) hash collisions, we can just do another pass!

What hash function should we use?

Do another pass with a different hash function, h’2, ideally such that:

h’2(3) != h’2(5)

90 of 98

HPJ Phase 1: Partitioning

Disk

Given B+1 = 3 buffer pages

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(0,j)

To take care of larger buckets caused by (2) hash collisions, we can just do another pass!

What hash function should we use?

Do another pass with a different hash function, h’2, ideally such that:

h’2(3) != h’2(5)

B2

(5,a)

(5,b)

91 of 98

HPJ Phase 1: Partitioning

Disk

Given B+1 = 3 buffer pages

B0

B1

(0,a)

(0,j)

(3,a)

(3,j)

(0,j)

What about duplicate join keys? Unfortunately this is a problem… but usually not a huge one.

B2

(5,a)

(5,b)

We call this unevenness in the bucket size skew

92 of 98

Now that we have �partitioned R and S…

93 of 98

HPJ Phase 2: Partition Join

Now, we just join pairs of buckets from R and S that have the same hash value to complete the join!

Disk

R

S

(3,j)

(0,j)

(0,a)

(0,a)

(3,b)

(5,b)

(0,a)

(0,j)

Disk

R1

S1

hB

S2

R2

(0,a)

(0,a)

(0,j)

(0,a)

(0,j)

(5,b)

(5,b)

Join matching buckets

(3,j)

(3,b)

94 of 98

HPJ: High-level procedure

Disk

R

S

(3,j)

(0,j)

(0,a)

(0,a)

(3,b)

(5,b)

(0,a)

(0,j)

Disk

R1

S1

hB

S2

R2

(0,a)

(0,a)

(0,j)

(0,a)

(0,j)

(5,b)

(5,b)

Don’t have to join the others! E.g. (S1 and R2)!

2. Per-Partition Join: JOIN tuples in same partition

(3,j)

(3,b)

95 of 98

HPJ Summary

Given enough buffer pages

    • Hash Partition requires reading + writing each page of R,S
      • → 2(P(R)+P(S)) IOs

    • Partition Join (with BNLJ) requires reading each page of R,S
      • → P(R) + P(S) IOs

HJ takes ~3(P(R)+P(S)) + OUT IOs!

96 of 98

Join Algorithms: Summary

  • NLJ: An example of a non-IO aware join algorithm
  • BNLJ: Big gains just by being IO aware & reading in chunks of pages!
  • SMJ: Sort R and S, then scan over to join!
  • HPJ: Partition R and S into buckets using a hash function, then join the (much smaller) matching buckets

For R ⋈ 𝑆 𝑜𝑛 column 𝐴

Quadratic in P(R), P(S)�I.e. O(P(R)*P(S))

Given sufficient buffer space, linear in P(R), P(S)

I.e. ~O(P(R)+P(S))

By only supporting equijoins & taking advantage of this structure!

97 of 98

IO analysis

Recap

Sort N pages with B buffer size (+1 for output)

(vs n log n, for n tuples in RAM. Negligible for large data, vs IO -- much, much slower)

Sort N pages when N ~= B

(because (LogB 0.5) < 0)

Sorting of relation R with N pages (i.e. P(R) = N)

SortMerge and HashJoin for R & S

Where P(R) and P(S) are number of pages in R and S, when you have enuf RAM

Sort N pages when N ~= 2*B^2

(because (LogB B) = 1)

Special cases: For SortMerge, if already sorted. For HashJoin, if already partitioned

~ 2 N

~3 * (P(R) + P(S)) + OUT

~ 4 N

~1 * (P(R) + P(S)) + OUT

We assume cost = 1 IO for read and 1 IO for write.

Alternative IO model (e.g, SSDs in HW#2): 1 IO for read and 8 IOs for write?

98 of 98

Where we are

Week 3

  • Learnt basics of IO-aware algorithms and indexes

Week 4

  • Problem 1: How to search Amazon’s product catalog in < 1sec?

  • Problem 2: JOIN queries are slow and expensive.
    • How do we speed up by 1000x? (Tuesday)
    • How do we speed up by another 100x? (Tuesday)

Week 5

  • Problem 3: Build a Product Recommendation system (e.g., Amazon’s)