CSE 344: Section 8
Parallel DBs
November 30, 2023
Administrivia
Parallel DBs
Parallel Query Processing
Horizontal vs Vertical Partitioning
Horizontal Data Partitioning
Payroll
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
Horizontal Data Partitioning
Payroll
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
Horizontal Data Partitioning
Payroll
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
ID | Name | Job | Salary |
| | | |
| | | |
Payroll1
Payroll2
Payroll3
Vertical Data Partitioning
Payroll
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
Vertical Data Partitioning
Payroll
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
Vertical Data Partitioning
11
Payroll
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
…
1
2
3
ID | Name |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Job |
|
|
|
|
|
|
|
|
|
Salary |
|
|
|
|
|
|
|
|
|
Payroll1
Payroll2
Payroll3
Horizontal Partitioning Strategies
Block Partitioning
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
…
T(R) = N
T(R2) = N/p
T(Rp) = N/p
T(R1) = N/p
p nodes
| |
| |
| |
| |
| |
… | … |
| |
| |
| |
| |
Range Partitioning
A | |
| |
| |
| |
| |
| |
| |
… | … |
A | |
| |
| |
A | |
| |
| |
A | |
| |
| |
…
R2, v1 < A <= v2
RN, vN < A < inf
R1, -inf < A <= v1
p nodes
Hash Partitioning
A | |
| |
| |
| |
| |
| |
| |
… | … |
A | |
| |
| |
A | |
| |
| |
A | |
| |
| |
…
R2, 2 = h(A)%p
RN, 0 = h(A)%p
R1, 1 = h(A)%p
p nodes
h(A)
Partition Quality
Worksheet Question 3 & 4
Question 3
Question 4
Question 4
Question 4 cont.
The query load consists primarily but not solely of randomly-sampled letter contents, grouped by sender. The database contains every letter sent since the creation of the Postal Service in 1771.
As before, we should choose a horizontal partitioning scheme (although we do know that the letter contents attribute will be used eventually, we don’t have information on if/which other columns are needed). If we are grouping primarily by sender, this would be a reasonable attribute to partition on.
(The founding of the postal service is a red herring in this question)
Shared-Nothing Model
Data Partitioning
We focus on shared-nothing architecture and intra-operator parallelism.
Moving Data
We have a “network layer” to move tuples temporarily between nodes.
Transferring data is expensive, so we need to be efficient (especially on joins and grouping).
Moving Data:
Partitioned Hash-Join Mechanism
Key Points:
R1, S1
R2, S2
Rn, Sn
R1’, S1’
R2’, S2’
Rn’, Sn’
...
...
Contains tuples s.t.
h(R.x) = h(S.y) = red
Contains tuples s.t.
h(R.x) = h(S.y) = green
Contains
tuples s.t.
h(R.x) = h(S.y) = blue
Moving Data:
Broadcast Join Mechanism
Takes advantage of small datasets (can all fit into main memory)
Key Points:
R1
R2
Rn
R1’, S
R2’, S
Rn’, S
...
...
S
Contains all of S
Contains all of S
Contains all of S
Parallel Query Plans
Now, we need to know how to derive parallel plans from single node plans.
⋈
σ
π
Parallel DB Practice!
We have a distributed database that holds the relations:
Drug(spec VARCHAR(255), compatibility INT)
Person(name VARCHAR(100) primary key, compatibility INT)
We want to compute:
SELECT P.name, count(D.spec)
FROM Person AS P, Drug AS D
WHERE P.compatibility = D.compatibility
GROUP BY P.name;
Drug is block-partitioned
Person is hash-partitioned on compatibility [h(n)]
You have three nodes. Draw a parallel query plan.
ƔP.name, count(D.spec)(P ⋈ D)
Node 1
Node 2
Node 3
P ⋈ D
⋈
⋈
ƔP.name,count(D.spec)
Ɣ
Ɣ
Hash [h(n)] Drug on compatibility
Takes advantage of:
Block partitioning of Drug
Section Worksheet!