1 of 19

CSE 414 Section 8

Cardinality and Cost Estimation

February 27th, 2025

2 of 19

Announcements

HW 4 grades released
HW6 will release soon

3 of 19

Cardinality Estimation Example

4 of 19

Cardinality Estimation (Point Selection)

Assume a uniform distribution.

σ_{color = “orange”}

X ≅ 1 / V(color)

≅ 1/4

5 of 19

Cardinality Estimation (Union)

Assume conditions are disjoint.

σ_{color = “orange” OR color = “green”}

X ≅ 1 / V(color) + 1 / V(color)

≅ 1/4 + 1/4 = 1/2

6 of 19

Cardinality Estimation (Intersection)

Assume independence.

σ_{c1 = “red” AND c2 = “yellow”}

X ≅ 1 / V(c1) · 1 / V(c2)

≅ 1/2 · 1/2 = 1/4

c1

c2

7 of 19

Cardinality Estimation (Ranges)

Schema: Weather(date, pressure). Assume a uniform distribution.

σ_{pressure < p}

σ_{pressure > p}

σ_{p < pressure < q}

X ≅ (p − min) / (max - min)

X ≅ (max − p) / (max - min)

X ≅ (q − p) / (max - min)

min(pressure)

max(pressure)

p

q

8 of 19

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ (1 / (V(c1) · V(c2))) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/4

c1

c2

Equijoins are tricky. The first guess we might have for the selectivity factor is X = 1/V(c1) * 1/V(c2). This isn’t quite right though, as 1/V(c1) * 1/V(c2) = 1/3 * 1/4 = 1/12. The correct answer we want is X = 3/12 = 1/4. What went wrong?

Let’s think of what 1/V(c1) * 1/V(c2) really means. From the “Point Selection” slide, we know that 1/V(c1) is the selection factor for choosing one specific color from c1. For example, the selectivity factor for σ (c1 = ‘red’) is 1/V(c1) = 1/3. The same goes for 1/V(c2). For example, σ (c2 = ‘red’) has a selectivity factor of 1/V(c2) = 1/4.

From the “Intersection” slide, we know that multiplying selectivity conditions (when they are independent, and in this case they are) is the same as the AND of them. So σ (c1 = ‘red’ AND c2 = ‘red’) has a selectivity factor of 1/3 * 1/4 = 1/12. If we look at the table, that’s exactly right: there is just one circle that is red for c1 and c2.

We don’t just want one circle here, though. The equijoin is like multiple selections at once: σ (c1 = ‘red’ AND c2 = ‘red’) + σ (c1 = ‘orange’ AND c2 = ‘orange’) + σ (c1 = ‘yellow’ AND c2 = ‘yellow’) = 1/12 + 1/12 + 1/12 = 1/12 * 3 = 3/12.

Is that why we multiply by min(V(c1), V(c2))? Yes! Think about this: the selectivity factor for each σ is always the same: 1/V(c1) * 1/V(c2). The only question is how many we need to add together. When the tables are different sizes, the smaller table is the one that constrains the join. You could add 100 new colors to c2 and it would not increase the cardinality of the join as c1 would not have any of those new colors. The cardinality only increases when V(c1) increases to match V(c2).

In conclusion, we get the formula 1/V(c1) * 1/V(c2) * min(V(c1), V(c2)), which we can simplify to 1/max(V(c1), V(c2)). (When we divide the smaller of V(c1) and V(c2) by both V(c1) and V(c2), we’re left with the larger of the two--max(V(c1), V(c2)).)

9 of 19

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ (1 / (V(c1) · V(c2))) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/4 (didn’t change)

c1

c2

10 of 19

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ (1 / (V(c1) · V(c2))) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/5 (decreased)

c1

c2

11 of 19

Summary

12 of 19

Cost Estimation

13 of 19

Cost Estimation: Factors

B(R) = # blocks for relation R

T(R) = # tuples for relation R

V(R, a) = # of unique values of attribute a in relation R

M = # of available memory pages

14 of 19

Cost Estimation: Nested Loop Join (⋈)

Naive: B(R) + T(R)B(S)

for each tuple t1 in R do

for each tuple t2 in S do

if t1 and t2 join then output (t1,t2)

15 of 19

Cost Estimation: Nested Loop Join (⋈)

Block-at-a-time: B(R) + B(S)B(R)

for each block bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

16 of 19

Cost Estimation: Nested Loop Join (⋈)

Block-nested-loop: B(R) + (B(R)/(M-1))*B(S) ≅ B(R) + (B(R)/(M)) * B(S)

for each group of M blocks bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

17 of 19

Cost Estimation: Hash Join (⋈)

R joined with S (assume R is smaller in size)

B(R) + B(S)

Assuming B(R) < M for one pass (look at each table once) efficiency, read all of R into a hash table and join with all of S

18 of 19

Cost Estimation: Sort-Merge Join (⋈)

B(R) + B(S)

One pass (look at each table once); Both must be small (B(R) + B(S) < M)

Why would we use this over Hash Join?

Tables are sorted on join attributes (No need to hash)
Range join instead of equijoin

19 of 19

Selectivity Formulas

Selectivity Factor (X) → Proportion of total data needed

Assuming uniform distribution of data values on numeric attribute a in table R, if the condition is:

a = c → X ≅1 / V(R, a)
a < c→ X ≅ (c - min(R, a))/ (max(R, a) - min(R, a))
c1 < a < c2→ X ≅ (c2 - c1)/ (max(R, a) - min(R, a))
cond1 and cond2 → X ≅ X1 * X2