Announcements
Midterm
Review
HW2
due
HW2
section
HW2
out
Announcements
Fill in Pulse survey today
Wide class mix
Need more help on RAM Algorithms +OS concepts? (~15%). Goto OH with questions. E.g., if are
If you find the material basic (~10%)
Where we are
Week 3
Week 4
Week 5
Q: What’s index size for Amazon’s product catalog?
(Wrapping up B+ trees example)
Recall Cost Model for Indexes -- [Baseline simplest model]
Page
Row1
Row2
Row ..
Row ...
. . . . .
Page
Page … n (RAM/DIsk)
“Real” data layout, with full records
(including cname, prices, etc.)
E.g, where is 19?
where is 33?
(e.g, disk block 0x45AB)
N index pages (leaf)
M
data pages
‘f’ fanout
<value, record location>
Example 1:Search Amazon’s 1 trillion Products
AMAZING: Worst-case for 1 Trillion SKeys
PageSize | 64KB | 64MB |
f | ~4000 | ~4 million |
h | ~4 | ~2 |
B+ tree Levels and node sizes
E.g, where is 19?
where is 33?
(e.g, disk block 0x45AB)
fh-1 nodes (pointing upto fh SKeys overall)
M
data pages
(hidden)
<value, record location>
Consider Example with f ~= 4000 and PageSize = 64KB
Level | Num Nodes (size) | Size |
0 (root) | 1 | 64 KB |
1 | ~4000 (f) | 4000* 64KB |
2 | ~4000^2 (f^2) | ~4000^2* 64KB |
3 | ~4000^3 (f^3) | ~4000^3* 64KB |
h-1 | 4000h-1 (fh-1 nodes) [Collectively pointing at upto fh SKeys] | 4000h-1 * 64KB |
f nodes
1 node
f2 nodes
. . .
Recall We use fanout ‘f’ as a constant, for simplicity. The algorithm uses ‘f’ for keys, and ‘f+1’ for pointers. Our engineering approximation uses ‘f’ for both. (i.e., f ~= f + 1)
Search cost of B+ Tree (on RAM + Disk)
| | | | ||||
| | | | |
|
| | | ||||
| | | | |
|
|
| | ||||
| | | | |
|
|
| | ||||
| | | | |
|
|
|
| ||||
| | | | |
|
| | | ||||
| | | | |
|
|
| | ||||
| | | | |
Not all nodes pictured
. . . … . …..
. . . … . …..
Read 1st levels
Into RAM buffer
Rest of index on disk
Use B (RAM size) to keep top few levels
(e.g., 1st 2-3 levels. Overall, as many nodes as will fit in B – because they’re ‘zero’ cost and speed up queries)
Problem: How to make SQL queries fast?
Optimize!
Approximations
O(n) vs O(n^2) – Big “O” notation for algorithms
Similar goal. We often approximate B ~= B+1, or f ~= f+1 for simpler equations.
Is that OK?
Complexity
Engineering Approximations
Models of real world systems are complex. ⇒
Example:
Basic SFW queries
SELECT pname
FROM Product
WHERE year = ? AND Category =?
AND manufacturer = ?
SELECT pname
FROM Product
WHERE year = ? AND category =?
Workload description
Manufacturers likely most Selective.
Few Categories. Many more manufacturers. Maintain index, if this query happens a lot.
Lower cost
(query and update cost)
Intuition
200 times/sec
100 times/sec
Optimization
Roadmap
Cost in I/O, resources?
To query, maintain?
Build Query Plans
Analyze Plans
What you will learn about in this section
Problem: JOINs are slow. How to make them fast?
Optimize!
Semantically: A Subset of the Cross Product
SELECT R.A,B,C,D�FROM R, S
WHERE R.A = S.A
A | D |
3 | 7 |
2 | 2 |
2 | 3 |
A | B | C |
1 | 0 | 1 |
2 | 3 | 4 |
2 | 5 | 2 |
3 | 1 | 1 |
R
S
A | B | C | D |
2 | 3 | 4 | 2 |
2 | 3 | 4 | 3 |
2 | 5 | 2 | 2 |
2 | 5 | 2 | 3 |
3 | 1 | 1 | 7 |
Cross Product
Filter by conditions
(r.A = s.A)
…
Can we actually implement a join in this way?
Joins: Example
A | D |
3 | 7 |
2 | 2 |
2 | 3 |
A | B | C |
1 | 0 | 1 |
2 | 3 | 4 |
2 | 5 | 2 |
3 | 1 | 1 |
R
S
A | B | C | D |
2 | 3 | 4 | 2 |
| | | |
| | | |
| | | |
| | | |
SELECT R.A,B,C,D�FROM R, S
WHERE R.A = S.A
Joins: Example
A | D |
3 | 7 |
2 | 2 |
2 | 3 |
A | B | C |
1 | 0 | 1 |
2 | 3 | 4 |
2 | 5 | 2 |
3 | 1 | 1 |
R
S
A | B | C | D |
2 | 3 | 4 | 2 |
2 | 3 | 4 | 3 |
| | | |
| | | |
| | | |
SELECT R.A,B,C,D�FROM R, S
WHERE R.A = S.A
Joins: Example
A | D |
3 | 7 |
2 | 2 |
2 | 3 |
A | B | C |
1 | 0 | 1 |
2 | 3 | 4 |
2 | 5 | 2 |
3 | 1 | 1 |
R
S
A | B | C | D |
2 | 3 | 4 | 2 |
2 | 3 | 4 | 3 |
2 | 5 | 2 | 2 |
| | | |
| | | |
SELECT R.A,B,C,D�FROM R, S
WHERE R.A = S.A
Joins: Example
A | D |
3 | 7 |
2 | 2 |
2 | 3 |
A | B | C |
1 | 0 | 1 |
2 | 3 | 4 |
2 | 5 | 2 |
3 | 1 | 1 |
R
S
A | B | C | D |
2 | 3 | 4 | 2 |
2 | 3 | 4 | 3 |
2 | 5 | 2 | 2 |
2 | 5 | 2 | 3 |
| | | |
SELECT R.A,B,C,D�FROM R, S
WHERE R.A = S.A
Joins: Example
A | D |
3 | 7 |
2 | 2 |
2 | 3 |
A | B | C |
1 | 0 | 1 |
2 | 3 | 4 |
2 | 5 | 2 |
3 | 1 | 1 |
R
S
A | B | C | D |
2 | 3 | 4 | 2 |
2 | 3 | 4 | 3 |
2 | 5 | 2 | 2 |
2 | 5 | 2 | 3 |
3 | 1 | 1 | 7 |
SELECT R.A,B,C,D�FROM R, S
WHERE R.A = S.A
Simple way to execute JOINs –
Nested Loop Joins
Notation for JOIN cost estimates
We consider “IO aware” algorithms: care about IO
Given a relation R, let:
We’ll assume B buffer for input (B usually << P(R), P(S))
Recall that we read / write entire pages (blocks)
We’ll see lots of formulae from now
⇒ Hint: Focus on how it works. Much easier to derive from 1st principles (vs recalling formula soup)
Recall
Census table
(row store)
Census (SSN, Address, … )
TaxInfo (SSN, TaxPaid, ...)
Census
432-567-789
134-562-184
613-416-452
For 1 Billion people
Goal: Compute Census JOIN TaxInfo
Data stored in RowStores, 1000 tuples/page (million pages)
TaxInfo table
(row store)
TaxInfo
432-567-789
134-562-184
613-416-452
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Output
Steps: Repeat till done
Read tuples from Census and TaxInfo into Buffer
Output matching tuples
Nested Loop Join (NLJ)
Note: This is our 1st basic algorithm. We use this baseline to intro cost models. We’ll see much better ones soon.
Is this
Nested Loop Join (NLJ)
P(R)
We read all T(R) tuples. But they are in P(R) pages.Cost to read = P(R)
Cost:
Nested Loop Join (NLJ)
P(R) + T(R)*P(S)
It’s T(R)*P(S) because
Cost:
Nested Loop Join (NLJ)
P(R) + T(R)*P(S)
Note that NLJ can handle things other than equality constraints… just check in the if statement!
Cost:
Nested Loop Join (NLJ)
P(R) + T(R)*P(S) + OUT
Cost:
What would OUT be if our join condition is trivial (if TRUE)?
OUT = T(R)*T(S) tuples split into multiple pages.
Could be bigger than P(R)*P(S)… but usually not that bad
Nested Loop Join (NLJ)
P(R) + T(R)*P(S) + OUT
What if R (“outer”) and S (“inner”) switched?
Cost:
P(S) + T(S)*P(R) + OUT
Outer vs. inner selection makes a huge difference- DBMS needs to know which relation is smaller!
IO-Aware Approach
Problem: JOINs are slow. How to make them fast?
Optimize – IO aware
(1000x faster than NLJ?)
Intuition:
Block Nested
Loop Joins
Census table
(row store)
Census (SSN, Address, … )
TaxInfo (SSN, TaxPaid, ...)
Census
432-567-789
134-562-184
613-416-452
For 1 Billion people
Goal: Compute Census JOIN TaxInfo
Data stored in RowStores, 1000 tuples/page (million pages)
TaxInfo table
(row store)
TaxInfo
432-567-789
134-562-184
613-416-452
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Given: B buffer space for input. + 1 for output
Idea: Use B-1 pages for Census, 1 page for TaxInfo
Output
Steps: Repeat till done
Read B-1 pages from Census into Buffer
Read 1 page from TaxInfo
Partial Join into 1 output page
Block Nested Loop Join (BNLJ)
Assume B buffer for input
Cost:
Why P(R)?
P(R)
Block Nested Loop Join (BNLJ)
Notes:
Cost:
How many R pages to read?
P(R)
How many times is each S page read?
P(R)/B
P(R) + P(R)*P(S)/B
Block Nested Loop Join (BNLJ)
BNLJ can also handle non-equality constraints
Cost:
P(R) + P(R)*P(S)/B
Block Nested Loop Join (BNLJ)
Cost:
P(R) + P(R)*P(S)/B + OUT
BNLJ vs. NLJ: Benefits of IO Aware
In BNLJ, by loading larger chunks of R, we minimize the number of full disk reads of S
P(R) + T(R)*P(S) + OUT
NLJ
BNLJ
P(R) + P(R)*P(S)/B + OUT
BNLJ is faster by ~ B*T(R)/P(R)
Example NLJ vs. BNLJ: Steel Cage Match
Example: P(R) = 1000, P(S) = 500,
100 tuples/page ⇒ T(R) = 1000*100,T(S)=500*100]
| B= 100 (+ 1 for output) | |
�NLJ | | |
BNLJ | | |
(1000 + 1000*500/20)
⇒ IO = ~26,000 IOs +OUT
B= 20 (+ 1 for output)
(1000 + 1000*500/100)
⇒ IO = ~6000 +OUT
(1000 + 1000*100*500 + OUT)
⇒ IO = ~5,001,000 +OUT
P(R) + T(R)*P(S) + OUT
(1000 + 1000*100*500 + OUT)
⇒ IO = ~ 5,001,000 +OUT
P(R) + P(R)*P(S)/B + OUT
Small change in algorithm ⇒ Big speedup in JOINs (~1000x faster)
Also, notice if we swap R and S, we can save an extra 500 IOs in BNLJ
Problem: JOINs are slow. How to make them fast?
(Can we do 100x faster than BNLJ?)
Idea: Smarter cross-products
What you will learn about in this section
0. Index Joins
Smarter than Cross-Products: From Quadratic to Nearly Linear
All joins computing the full cross-product have a quadratic term
NLJ
BNLJ
We get this gain by taking advantage of structure- moving to equality constraints (“equijoin”) only!
Now we’ll see some (nearly) linear joins:
P(R) + T(R)*P(S) + OUT
P(R) + P(R)*P(S)/B + OUT
Example
Census table
(row store)
Census (SSN, Address, … )
TaxInfo (SSN, TaxPaid, ...)
Census
432-567-789
134-562-184
613-416-452
For 1 Billion people
Goal: Compute Census JOIN TaxInfo
Data stored in RowStores, 1000 tuples/page (million pages)
TaxInfo table
(row store)
TaxInfo
432-567-789
134-562-184
613-416-452
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
BNLJ -- See all the extra work BNLJ is doing to JOIN for 432-567-789, ...
Index Nested Loop Join (INLJ)
P(R) + T(R)*L + OUT
→ We can use an index (e.g. B+ Tree) to avoid full cross-product!
Cost:
Where L is the IO cost to access each distinct value in index
Recall: L is usually small (e.g., 2-5)
→ Much better than quadratic. But what if T(R) = 1 billion?
Announcements
Midterm
Review
HW2
due
HW2
section
HW2
out
FAQ: Why do we focus on IO cost?
[1] CPU Cost
(e.g.,sort in RAM or check tuple equality in NLJ in RAM)
[2] IO Cost from HDD/SSDs
(# of Pages we read or write fromHDD/SSD)
Typical algorithms in RAM (e.g., quicksort in nLog n)
From Lecture1
⇒ For big data, focus on IO cost (i.e., it’s the primary factor)
(For tiny data < 10 GBs, just use RAM)
E.g., For ExternalSort
FAQ: Why do we focus on scale? How does it relate to SQL?
⇒ For big data, how to scale is the primary driver
(For tiny data < 10 GBs, just use RAM)
Reminder: reread “Why cs145” from Week 1.
Problem: JOINs are slow. How to make them fast?
(Can we do 100x faster than BNLJ?)
Idea: Smarter cross-products
Pre-process data before JOINing
Preview of doing better?
SortMergeJoin
Census table
(row store)
Census
432-567-789
134-562-184
613-416-452
TaxInfo table
(row store)
TaxInfo
432-567-789
134-562-184
613-416-452
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
HashPartitionJoin
Census table
(row store)
Census
432-567-789
134-562-184
613-416-452
TaxInfo table
(row store)
TaxInfo
432-567-789
134-562-184
613-416-452
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
-- Sort(Census), Sort(TaxInfo) on SSN
-- Merge sorted pages
-- Hash(Census), Hash(TaxInfo) on SSN
-- Merge partitioned pages
Join Algorithms: Summary
For R ⋈ 𝑆 𝑜𝑛 column 𝐴
Quadratic in P(R), P(S)�I.e. O(P(R)*P(S))
Given sufficient buffer space, linear in P(R), P(S)
I.e. ~O(P(R)+P(S))
By only supporting equijoins & taking advantage of this structure!
Sort-Merge Join (SMJ)
What you will learn about in this section
Sort Merge Join (SMJ)
Goal: Execute R ⋈ S on A
Key Idea:
Sort R and S [with external sort]
Merge-scan (aka merge-join) over them!
IO Cost:
S
R
Unsorted input relations
Split & sort
Merge
S
R
Merge
R
Example1: SMJ: R⋈𝑆 with 3 page buffer
For simplicity: Let each page be one tuple. Let the first column be join key
(E.g., R has 3 pages with 1 tuple each)
Disk
Main Memory
Buffer
R
(3, j)
(5, b)
(0,a)
S
(7,f)
(0,j)
(3,g)
We show the file HEAD, which is the next value to be read!
0 | a |
3 | j |
5 | b |
3 | g |
7 | f |
0 | j |
R
S
Col1
Col1
0 | a | j |
3 | j | g |
Col1
R
S
Physical Join
Logical Join
SMJ Example: R⋈𝑆 with 3 page buffer
Disk
Main Memory
Buffer
R
(5,b)
(3,j)
(0,a)
S
(7,f)
(0,j)
(3,g)
(3,j)
(5,b)
(0,a)
(3,g)
(7,f)
(0,j)
1. Sort(R) and Sort(S) (on 1st column)
SMJ Example: R⋈𝑆 with 3 page buffer
2. Merge-scan (or merge-join) – Scan and “merge” on join key!
Disk
Main Memory
Buffer
R
S
(3,g)
(7,f)
(3,j)
(5,b)
Output
(0,j)
(0,a)
(0,a)
(0,j)
Read 1 page each from (sorted) R and S
SMJ Example: R⋈𝑆 with 3 page buffer
2. Scan and “merge” on join key!
Disk
Main Memory
Buffer
R
S
(3,g)
(7,f)
(3,j)
(5,b)
Output
(0,j)
(0,a)
(0,a)
(0,j)
(0,a,j)
Join (0, a) and (0, j) to get (0, a, j)
SMJ Example: R⋈𝑆 with 3 page buffer
2. Scan and “merge” on join key!
Disk
Main Memory
Buffer
R
S
(3,g)
(7,f)
(3,j)
(5,b)
Output
(0,a)
(0,j)
(0,a,j)
(3,j,g)
(3,j)
(3,g)
(5,b)
(7,f)
(3,j)
(3,g)
SMJ Example: R⋈𝑆 with 3 page buffer
2. Done!
Disk
Main Memory
Buffer
R
S
3,g
7,f
3,j
5,b
Output
(0,a)
(0,j)
(0,a,j)
(3,j)
(3,g)
(3,j,g)
(5,b)
(7,f)
(5,b)
(7, f)
What happens with �duplicate join keys?
Example2: Multiple tuples with Same Join Key: “Backup”
1. Start with sorted relations, and begin merge-join…
Disk
Main Memory
Buffer
R
S
3,g
7,f
3,j
5,b
Output
(0, b)
(0,g)
(0,c)
(7,f)
(0,a)
(0,j)
(0,a)
(0,j)
0 | a |
0 | b |
0 | c |
0 | j |
0 | g |
7 | f |
R
S
Col1
Col1
0 | a | j |
0 | b | j |
0 | c | j |
0 | a | g |
0 | b | g |
0 | c | g |
Col1
R
S
Physical Join
Logical Join
Multiple tuples with Same Join Key: “Backup”
1. Start with sorted relations, and begin merge-join…
Disk
Main Memory
Buffer
R
S
3,g
7,f
3,j
5,b
Output
(0,b)
(0,g)
(0,c)
(7,f)
(0,a)
(0,a)
(0,j)
(0,j)
(0,a,j)
Read and join R.(0, a) and S.(0, j) to get (0, a, j)
Multiple tuples with Same Join Key: “Backup”
1. Start with sorted relations, and begin merge-join…
Disk
Main Memory
Buffer
R
S
(0,g)
7,f
(0,j)
5,b
Output
(0,c)
(7,f)
(0,a)
(0,a)
(0,j)
(0,a,j)
(0,a,g)
(0,g)
(0,b)
(0,g)
Output (0, a, j)
Multiple tuples with Same Join Key: “Backup”
1. Start with sorted relations, and begin merge-join…
Disk
Main Memory
Buffer
R
S
0,g
7,f
0,j
5,b
Output
(0, b)
(0,c)
(7,f)
(0,a)
(0,a,j)
(0,g)
(0,a,g)
(0,b)
Have to “backup” in the scan of S and reread pages!
(0,j)
(0,j)
(0, j)
Read R.(0, b) and reread S.(0,j), . . .
Backup
SMJ: Total cost
~ Sort(P(R)) + Sort(P(S)) + P(R) + P(S) + OUT
NumPasses
Example: SMJ NumPasses
Consider P(R) = 1000, P(S) = 500
Case1: NumPasses for R (for B = 100) =
K = ⌈log100 1000/(2*100)⌉ = 1
Case2: NumPasses for R (for B = 20)
K = ⌈log20 1000/(2*20)⌉) = 2
Reminder: More Buffer? Fewer passes for Sorting
(Repeat for S, and you get k = 1 and 2)
Example SMJ vs. BNLJ: Steel Cage Match
SMJ is ~ linear vs. BNLJ is quadratic…
Redo the same with 10x? SMJ much faster.
Consider P(R) = 1000, P(S) = 500
| B = 100 | |
SMJ | | |
BNLJ | | |
(Sort R and S in NumPasses (np)=2
2*1000*(np+1) + 2*500*(np+1) = 9000
MergeJoin: 1000 + 500: 1500 IOs)
⇒ IO = 10,500 IOs + OUT
(500 + 1000*500/20 )
⇒ IO = 25500 IOs +OUT
B = 20
~ Sort(P(R)) + Sort(P(S)) + P(R) + P(S) + OUT
(Sort R and S in NumPasses (np) =1
2*1000*(np+1) + 2*500*(np+1) = 6000
MergeJoin: 1000 + 500 = 1500 IOs)
⇒ IO = 7500 IOs + OUT
(500 + 1000*500/100)
⇒ IO = 5500+OUT
P(R) + P(R)*P(S)/B + OUT
Error fixed 10/20 after lecture.
1. Let’s use full formulae pre-optimization in next 2 slides.
2. Note: with optimization in next 2 Slides, we’ll need fewer IOs (we skip the last merge steps).
Un-Optimized SMJ
Merge / Join Phase
Sort Phase
(Ext. Merge Sort)
S
R
Split & sort
Split & sort
MergeJoin
MergeJoin
MergeJoin
MergeJoin
Given B buffer pages for input + 1 for output
Joined output file created!
Unsorted input relations
Simple SMJ Optimization
Merge / Join Phase
Sort Phase
(Ext. Merge Sort)
S
R
Split & sort
Split & sort
Merge
Join
Merge
Join
Joined output file created!
Unsorted input relations
<= B total runs
B-Way MergeJoin
Takeaway points from SMJ
SMJ needs to sort both relations
If relation(s) already sorted on join key, no more work.
What you will learn about in this section
0. Intuition for smarter joins
Pre-process data before JOINing
Preview of smarter joins
SortMergeJoin
Census table
(row store)
Census
432-567-789
134-562-184
613-416-452
TaxInfo table
(row store)
TaxInfo
432-567-789
134-562-184
613-416-452
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
HashPartitionJoin
Census table
(row store)
Census
432-567-789
134-562-184
613-416-452
TaxInfo table
(row store)
TaxInfo
432-567-789
134-562-184
613-416-452
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
-- Sort(Census), Sort(TaxInfo) on SSN
-- Merge sorted pages
-- Hash(Census), Hash(TaxInfo) on SSN
-- Merge partitioned pages
Hash Join (HJ) or
Hash Partition Join (HPJ)
Hash Join
HPJ Phase 1: Hash Partitioning
Goal: For each relation, partition relation into buckets such that if hB(t.A) = hB(t’.A) they are in the same bucket
Given B+1 buffer pages, we partition into B buckets:
HPJ Phase 1: Partitioning
We partition into B = 2 buckets using hash function h2 so that we can have one buffer page for each partition (and one for input)
Disk
R
(3,j)
(0,j)
Given B+1 = 3 buffer pages
(5,b)
(5,a)
(0,j)
(0,a)
(3,a)
For simplicity, we’ll look at partitioning one of the two relations- we just do the same for the other relation!
Note our new convention for this example: pages each have two tuples (one per row)
HPJ Phase 1: Partitioning
1. We read pages from R into the “input” page of the buffer…
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
(3,j)
(0,j)
(5,b)
(5,a)
(0,j)
(0,a)
(3,a)
(0,a)
(3,a)
HPJ Phase 1: Partitioning
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
(0,a)
(3,a)
h2(0) = 0; h2(3) = 1
(0,a)
(3,a)
(3,j)
(0,j)
(5,b)
(5,a)
(0,j)
2. Then we use hash function h2 to sort into the buckets, which each have one page in the buffer
Copy (0, a) →Bucket0 and (3, a) → Bucket1, based on hash
HPJ Phase 1: Partitioning
3. We repeat until the buffer bucket pages are full…
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
(0,a)
(3,a)
(3,j)
(0,j)
(5,b)
(5,a)
(0,j)
HPJ Phase 1: Partitioning
3. We repeat until the buffer bucket pages are full…
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
(0,a)
(3,a)
(0,j)
(5,b)
(5,a)
(0,j)
h2(3) = 1
(3,j)
(0,j)
(3,a)
(3,j)
HPJ Phase 1: Partitioning
3. We repeat until the buffer bucket pages are full…
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
(0,a)
(3,a)
(0,j)
(5,b)
(5,a)
(0,j)
h2(0) = 0
(3,a)
(3,j)
(0,a)
(0,j)
HPJ Phase 1: Partitioning
3. We repeat until the buffer bucket pages are full… then flush to disk
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
(5,b)
(5,a)
(0,j)
B0
B1
(3,a)
(3,j)
(0,a)
(0,j)
B is full. Use B0 and B1 files.
HPJ Phase 1: Partitioning
3. We repeat until the buffer bucket pages are full… then flush to disk
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
(5,b)
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(5,a)
(0,j)
Write relevant pages to B0 and B1 files.
HPJ Phase 1: Partitioning
Note that collisions can occur!
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
(5,b)
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(0,j)
h2(5) = 1
Collision!!!
(5,a)
(0,j)
(5,a)
h2(5) = h2(3) = 1
Both (5, *) and (3, *) map to bucket
(0,j)
HPJ Phase 1: Partitioning
Finish this pass…
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(5,a)
(0,j)
(5,b)
HPJ Phase 1: Partitioning
Finish this pass…
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(5,a)
(0,j)
(5,b)
h2(5) = 1
(5,a)
(5,b)
h2(5) = h2(3) = 1
Collision!!!
HPJ Phase 1: Partitioning
Finish this pass…
Main Memory
Buffer
Input page
0
1
Output (bucket) pages
Disk
R
Given B+1 = 3 buffer pages
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(0,j)
(5,a)
(5,b)
HPJ Phase 1: Partitioning
Disk
Given B+1 = 3 buffer pages
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(0,j)
(5,a)
(5,b)
We wanted buckets of size B-1 = 1… however we got larger ones due to:
(1) Duplicate join keys
(2) Hash collisions
HPJ Phase 1: Partitioning
Disk
Given B+1 = 3 buffer pages
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(0,j)
(5,a)
(5,b)
To take care of larger buckets caused by (2) hash collisions, we can just do another pass!
What hash function should we use?
Do another pass with a different hash function, h’2, ideally such that:
h’2(3) != h’2(5)
HPJ Phase 1: Partitioning
Disk
Given B+1 = 3 buffer pages
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(0,j)
To take care of larger buckets caused by (2) hash collisions, we can just do another pass!
What hash function should we use?
Do another pass with a different hash function, h’2, ideally such that:
h’2(3) != h’2(5)
B2
(5,a)
(5,b)
HPJ Phase 1: Partitioning
Disk
Given B+1 = 3 buffer pages
B0
B1
(0,a)
(0,j)
(3,a)
(3,j)
(0,j)
What about duplicate join keys? Unfortunately this is a problem… but usually not a huge one.
B2
(5,a)
(5,b)
We call this unevenness in the bucket size skew
Now that we have �partitioned R and S…
HPJ Phase 2: Partition Join
Now, we just join pairs of buckets from R and S that have the same hash value to complete the join!
Disk
R
S
(3,j)
(0,j)
(0,a)
(0,a)
(3,b)
(5,b)
(0,a)
(0,j)
Disk
R1
S1
hB
S2
R2
(0,a)
(0,a)
(0,j)
(0,a)
(0,j)
(5,b)
(5,b)
Join matching buckets
(3,j)
(3,b)
HPJ: High-level procedure
Disk
R
S
(3,j)
(0,j)
(0,a)
(0,a)
(3,b)
(5,b)
(0,a)
(0,j)
Disk
R1
S1
hB
S2
R2
(0,a)
(0,a)
(0,j)
(0,a)
(0,j)
(5,b)
(5,b)
Don’t have to join the others! E.g. (S1 and R2)!
2. Per-Partition Join: JOIN tuples in same partition
(3,j)
(3,b)
HPJ Summary
Given enough buffer pages…
HJ takes ~3(P(R)+P(S)) + OUT IOs!
Join Algorithms: Summary
For R ⋈ 𝑆 𝑜𝑛 column 𝐴
Quadratic in P(R), P(S)�I.e. O(P(R)*P(S))
Given sufficient buffer space, linear in P(R), P(S)
I.e. ~O(P(R)+P(S))
By only supporting equijoins & taking advantage of this structure!
IO analysis
Recap
Sort N pages with B buffer size (+1 for output)
(vs n log n, for n tuples in RAM. Negligible for large data, vs IO -- much, much slower)
Sort N pages when N ~= B
(because (LogB 0.5) < 0)
Sorting of relation R with N pages (i.e. P(R) = N)
SortMerge and HashJoin for R & S
Where P(R) and P(S) are number of pages in R and S, when you have enuf RAM
Sort N pages when N ~= 2*B^2
(because (LogB B) = 1)
Special cases: For SortMerge, if already sorted. For HashJoin, if already partitioned
~ 2 N
~3 * (P(R) + P(S)) + OUT
~ 4 N
~1 * (P(R) + P(S)) + OUT
We assume cost = 1 IO for read and 1 IO for write.
Alternative IO model (e.g, SSDs in HW#2): 1 IO for read and 8 IOs for write?
Where we are
Week 3
Week 4
Week 5