1 of 18

Section 9

Cost Estimation + Parallel DB

November 30th, 2023

2 of 18

Announcements

HW 6 part 2 due date extended to Monday!
HW 7 will be out tomorrow

3 of 18

Cost Estimation Example

4 of 18

Cost Estimation

Supply Statistics:

T(Supply) = 10000
B(Supply) = 100
V(Supply, pno) = 2500

Supplier Statistics:

T(Supplier) = 1000
B(Supplier) = 100
V(Supplier, scity) = 20
V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

⋈sid=sid

(Hash Join)

5 of 18

Cost Estimation

Supply Statistics:

T(Supply) = 10000
B(Supply) = 100
V(Supply, pno) = 2500

Supplier Statistics:

T(Supplier) = 1000
B(Supplier) = 100
V(Supplier, scity) = 20
V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

⋈sid=sid

(Hash Join)

(On the fly)

(Sequential

Scan)

(Sequential

Scan)

6 of 18

Cost Estimation

Supply Statistics:

T(Supply) = 10000
B(Supply) = 100
V(Supply, pno) = 2500

Supplier Statistics:

T(Supplier) = 1000
B(Supplier) = 100
V(Supplier, scity) = 20
V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

⋈sid=sid

(Hash Join)

(On the fly)

(Sequential

Scan)

(Sequential

Scan)

(On the fly)

7 of 18

Cost Estimation

Supply Statistics:

T(Supply) = 10000
B(Supply) = 100
V(Supply, pno) = 2500

Supplier Statistics:

T(Supplier) = 1000
B(Supplier) = 100
V(Supplier, scity) = 20
V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

⋈sid=sid

(Hash Join)

(On the fly)

Cost = B(Supply) = 100

Cost = B(Supplier) = 100

8 of 18

Cost Estimation

Supply Statistics:

T(Supply) = 10000
B(Supply) = 100
V(Supply, pno) = 2500

Supplier Statistics:

T(Supplier) = 1000
B(Supplier) = 100
V(Supplier, scity) = 20
V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

⋈sid=sid

Cost = 0

C1 = 100

C2 = 100

9 of 18

Cost Estimation

Supply Statistics:

T(Supply) = 10000
B(Supply) = 100
V(Supply, pno) = 2500

Supplier Statistics:

T(Supplier) = 1000
B(Supplier) = 100
V(Supplier, scity) = 20
V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

⋈sid=sid

Total Cost = 100 + 100 = 200 I/Os

C1 = 100

C2 = 100

C3 = 0

C4 = 0

10 of 18

Parallel DBs

11 of 18

Data Partitioning

We focus on shared-nothing architecture and intra-operator parallelism.

Block Partition

Tuples partitioned by raw size (no ordering considered)

Hash partitioned on attribute A

Node contains tuples with chosen attribute hashes

Range partitioned on attribute A

Node contains tuples in chosen attribute ranges

12 of 18

Moving Data

We have a “network layer” to move tuples temporarily between nodes.

Transferring data is expensive, so we need to be efficient (especially on joins and grouping).

13 of 18

Moving Data:

Partitioned Hash-Join Mechanism

Join on some attribute (e.g. R.x and S.y)
Call hash function h(z)

Key Points:

Hash shuffle tuples on join attributes

R1, S1

R2, S2

Rn, Sn

R1’, S1’

R2’, S2’

Rn’, Sn’

...

Contains tuples s.t.

h(R.x) = h(S.y) = red

Contains tuples s.t.

h(R.x) = h(S.y) = green

Contains

tuples s.t.

h(R.x) = h(S.y) = blue

14 of 18

Moving Data:

Broadcast Join Mechanism

Takes advantage of small datasets (can all fit into main memory)

Key Points:

Partition type of R doesn’t matter
S is unpartitioned and small
Distribute S across all R

R1

R2

Rn

R1’, S

R2’, S

Rn’, S

...

S

Contains all of S

15 of 18

Parallel Query Plans

Now, we need to know how to derive parallel plans from single node plans.

Which RA operations can you do without talking to other nodes?
Which RA operations require moving tuples?

⋈

σ

π

16 of 18

Parallel DB Practice!

17 of 18

We have a distributed database that holds the relations:

Drug(spec VARCHAR(255), compatibility INT)

Person(name VARCHAR(100) primary key, compatibility INT)

We want to compute:

SELECT P.name, count(D.spec)

FROM Person AS P, Drug AS D

WHERE P.compatibility = D.compatibility

GROUP BY P.name;

Drug is block-partitioned

Person is hash-partitioned on compatibility [h(n)]

You have three nodes. Draw a parallel query plan.

18 of 18

Ɣ_{P.name, count(D.spec)}(P ⋈ D)

Node 1

Node 2

Node 3

P ⋈ D

⋈

Ɣ_{P.name,count(D.spec)}

Ɣ

Hash [h(n)] Drug on compatibility

Takes advantage of:

Hash partitioning of [h(n)]
PK uniqueness of name

Block partitioning of Drug