Section 9
Cost Estimation + Parallel DB
November 30th, 2023
Announcements
Cost Estimation Example
Cost Estimation
Supply Statistics:
Supplier Statistics:
Supply
σpno=2
Supplier
σscity=’Seattle’ Λ sstate=’WA’
πsname
⋈sid=sid
(Hash Join)
Cost Estimation
Supply Statistics:
Supplier Statistics:
Supply
σpno=2
Supplier
σscity=’Seattle’ Λ sstate=’WA’
πsname
⋈sid=sid
(Hash Join)
(On the fly)
(Sequential
Scan)
(Sequential
Scan)
Cost Estimation
Supply Statistics:
Supplier Statistics:
Supply
σpno=2
Supplier
σscity=’Seattle’ Λ sstate=’WA’
πsname
⋈sid=sid
(Hash Join)
(On the fly)
(Sequential
Scan)
(Sequential
Scan)
(On the fly)
Cost Estimation
Supply Statistics:
Supplier Statistics:
Supply
σpno=2
Supplier
σscity=’Seattle’ Λ sstate=’WA’
πsname
⋈sid=sid
(Hash Join)
(On the fly)
(On the fly)
Cost = B(Supply) = 100
Cost = B(Supplier) = 100
Cost Estimation
Supply Statistics:
Supplier Statistics:
Supply
σpno=2
Supplier
σscity=’Seattle’ Λ sstate=’WA’
πsname
⋈sid=sid
Cost = 0
Cost = 0
C1 = 100
C2 = 100
Cost Estimation
Supply Statistics:
Supplier Statistics:
Supply
σpno=2
Supplier
σscity=’Seattle’ Λ sstate=’WA’
πsname
⋈sid=sid
Total Cost = 100 + 100 = 200 I/Os
C1 = 100
C2 = 100
C3 = 0
C4 = 0
Parallel DBs
Data Partitioning
We focus on shared-nothing architecture and intra-operator parallelism.
Moving Data
We have a “network layer” to move tuples temporarily between nodes.
Transferring data is expensive, so we need to be efficient (especially on joins and grouping).
Moving Data:
Partitioned Hash-Join Mechanism
Key Points:
R1, S1
R2, S2
Rn, Sn
R1’, S1’
R2’, S2’
Rn’, Sn’
...
...
Contains tuples s.t.
h(R.x) = h(S.y) = red
Contains tuples s.t.
h(R.x) = h(S.y) = green
Contains
tuples s.t.
h(R.x) = h(S.y) = blue
Moving Data:
Broadcast Join Mechanism
Takes advantage of small datasets (can all fit into main memory)
Key Points:
R1
R2
Rn
R1’, S
R2’, S
Rn’, S
...
...
S
Contains all of S
Contains all of S
Contains all of S
Parallel Query Plans
Now, we need to know how to derive parallel plans from single node plans.
⋈
σ
π
Parallel DB Practice!
We have a distributed database that holds the relations:
Drug(spec VARCHAR(255), compatibility INT)
Person(name VARCHAR(100) primary key, compatibility INT)
We want to compute:
SELECT P.name, count(D.spec)
FROM Person AS P, Drug AS D
WHERE P.compatibility = D.compatibility
GROUP BY P.name;
Drug is block-partitioned
Person is hash-partitioned on compatibility [h(n)]
You have three nodes. Draw a parallel query plan.
ƔP.name, count(D.spec)(P ⋈ D)
Node 1
Node 2
Node 3
P ⋈ D
⋈
⋈
ƔP.name,count(D.spec)
Ɣ
Ɣ
Hash [h(n)] Drug on compatibility
Takes advantage of:
Block partitioning of Drug