1 of 18

Section 9

Cost Estimation + Parallel DB

November 30th, 2023

2 of 18

Announcements

  • HW 6 part 2 due date extended to Monday!
  • HW 7 will be out tomorrow

3 of 18

Cost Estimation Example

4 of 18

Cost Estimation

Supply Statistics:

  • T(Supply) = 10000
  • B(Supply) = 100
  • V(Supply, pno) = 2500

Supplier Statistics:

  • T(Supplier) = 1000
  • B(Supplier) = 100
  • V(Supplier, scity) = 20
  • V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

sid=sid

(Hash Join)

5 of 18

Cost Estimation

Supply Statistics:

  • T(Supply) = 10000
  • B(Supply) = 100
  • V(Supply, pno) = 2500

Supplier Statistics:

  • T(Supplier) = 1000
  • B(Supplier) = 100
  • V(Supplier, scity) = 20
  • V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

sid=sid

(Hash Join)

(On the fly)

(Sequential

Scan)

(Sequential

Scan)

6 of 18

Cost Estimation

Supply Statistics:

  • T(Supply) = 10000
  • B(Supply) = 100
  • V(Supply, pno) = 2500

Supplier Statistics:

  • T(Supplier) = 1000
  • B(Supplier) = 100
  • V(Supplier, scity) = 20
  • V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

sid=sid

(Hash Join)

(On the fly)

(Sequential

Scan)

(Sequential

Scan)

(On the fly)

7 of 18

Cost Estimation

Supply Statistics:

  • T(Supply) = 10000
  • B(Supply) = 100
  • V(Supply, pno) = 2500

Supplier Statistics:

  • T(Supplier) = 1000
  • B(Supplier) = 100
  • V(Supplier, scity) = 20
  • V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

sid=sid

(Hash Join)

(On the fly)

(On the fly)

Cost = B(Supply) = 100

Cost = B(Supplier) = 100

8 of 18

Cost Estimation

Supply Statistics:

  • T(Supply) = 10000
  • B(Supply) = 100
  • V(Supply, pno) = 2500

Supplier Statistics:

  • T(Supplier) = 1000
  • B(Supplier) = 100
  • V(Supplier, scity) = 20
  • V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

sid=sid

Cost = 0

Cost = 0

C1 = 100

C2 = 100

9 of 18

Cost Estimation

Supply Statistics:

  • T(Supply) = 10000
  • B(Supply) = 100
  • V(Supply, pno) = 2500

Supplier Statistics:

  • T(Supplier) = 1000
  • B(Supplier) = 100
  • V(Supplier, scity) = 20
  • V(Supplier, sstate) = 10

Supply

σpno=2

Supplier

σscity=’Seattle’ Λ sstate=’WA’

πsname

sid=sid

Total Cost = 100 + 100 = 200 I/Os

C1 = 100

C2 = 100

C3 = 0

C4 = 0

10 of 18

Parallel DBs

11 of 18

Data Partitioning

We focus on shared-nothing architecture and intra-operator parallelism.

  • Block Partition
    • Tuples partitioned by raw size (no ordering considered)
  • Hash partitioned on attribute A
    • Node contains tuples with chosen attribute hashes
  • Range partitioned on attribute A
    • Node contains tuples in chosen attribute ranges

12 of 18

Moving Data

We have a “network layer” to move tuples temporarily between nodes.

Transferring data is expensive, so we need to be efficient (especially on joins and grouping).

13 of 18

Moving Data:

Partitioned Hash-Join Mechanism

  • Join on some attribute (e.g. R.x and S.y)
  • Call hash function h(z)

Key Points:

  • Hash shuffle tuples on join attributes

R1, S1

R2, S2

Rn, Sn

R1’, S1’

R2’, S2’

Rn’, Sn’

...

...

Contains tuples s.t.

h(R.x) = h(S.y) = red

Contains tuples s.t.

h(R.x) = h(S.y) = green

Contains

tuples s.t.

h(R.x) = h(S.y) = blue

14 of 18

Moving Data:

Broadcast Join Mechanism

Takes advantage of small datasets (can all fit into main memory)

Key Points:

  • Partition type of R doesn’t matter
  • S is unpartitioned and small
  • Distribute S across all R

R1

R2

Rn

R1’, S

R2’, S

Rn’, S

...

...

S

Contains all of S

Contains all of S

Contains all of S

15 of 18

Parallel Query Plans

Now, we need to know how to derive parallel plans from single node plans.

  • Which RA operations can you do without talking to other nodes?
  • Which RA operations require moving tuples?

σ

π

16 of 18

Parallel DB Practice!

17 of 18

We have a distributed database that holds the relations:

Drug(spec VARCHAR(255), compatibility INT)

Person(name VARCHAR(100) primary key, compatibility INT)

We want to compute:

SELECT P.name, count(D.spec)

FROM Person AS P, Drug AS D

WHERE P.compatibility = D.compatibility

GROUP BY P.name;

Drug is block-partitioned

Person is hash-partitioned on compatibility [h(n)]

You have three nodes. Draw a parallel query plan.

18 of 18

ƔP.name, count(D.spec)(P D)

Node 1

Node 2

Node 3

P D

ƔP.name,count(D.spec)

Ɣ

Ɣ

Hash [h(n)] Drug on compatibility

Takes advantage of:

  1. Hash partitioning of [h(n)]
  2. PK uniqueness of name

Block partitioning of Drug