1 of 31

CSE 344: Section 8

Indexing & Estimation

February 22nd, 2024

2 of 31

Administrivia

HW6 Milestone 1 due tomorrow 2/23 @ 11pm

Can use late days

HW6 Milestone 2 due 3/1 @ 11 pm

Most students use late days

3 of 31

Indexing

4 of 31

5 of 31

Definitions

Index - data structure that organizes data records on disk to optimize selections on the search key fields for the index
An index contains a collection of data entries, and supports efficient retrieval of all data entries with the given search key value k
A data entry is of the format (key = k, value = pointer to records)

14
15
16

14	tuple1

15	tuple2

16	tuple3

Data file sorted on age

Index File

Index on age

6 of 31

Definitions

An index is either clustered or unclustered, depending on how the actual data is sorted on disk
Clustered index - one that has the same key ordering as what is on disk (one per table)
Unclustered index - may exist without any ordering on disk (any number per table)

7 of 31

Definitions

An index is either clustered or unclustered, depending on how the actual data is sorted on disk
Clustered index - one that has the same key ordering as what is on disk (one per table)
Unclustered index - may exist without any ordering on disk (any number per table)

8 of 31

Why Indexing?

Consider you have a deck of 52 cards. I asked you to pick out the 8 of hearts. Calculate the average number of flips required if:

The deck is randomly shuffled.
The deck is separated into four piles, each representing a suit (spades, diamonds, clubs, hearts). Each pile is kept randomly face down.

9 of 31

Problem 1

To find the 8 of hearts from a randomly shuffled cards, on average I would require 26 flips through the deck, about half the deck.

10 of 31

Problem 2

There are 4 piles, one for each suit.
Disregard the other 3 piles, i.e. clubs, spades, diamonds.
In doing so we would require 2 flips, on average, to find the heart pile.
We're left with 13 cards in the heart pile to scan, on average we would require 7 flips to find the 8 of hearts.

11 of 31

Answer

Problem 1:

Scan 52 Cards (+26)

Average Flips = 26

Problem 2:

Pick suit (+2)
Scan 13 cards (+7)

Average Flips = 9

12 of 31

Further Thinking…

Can you think of an better index?
Would your index be clustered or unclustered?

13 of 31

14 of 31

Cost Estimation

15 of 31

Cost Estimation: Factors

B(R) = # blocks for relation R

T(R) = # tuples for relation R

V(R, a) = # of unique values of attribute a in relation R

M = # of available memory pages

16 of 31

Cost Estimation: Nested Loop Join (⋈)

Naive: B(R) + T(R)B(S)

for each tuple t1 in R do

for each tuple t2 in S do

if t1 and t2 join then output (t1,t2)

17 of 31

Cost Estimation: Nested Loop Join (⋈)

Block-at-a-time: B(R) + B(S)B(R)

for each block bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

18 of 31

Cost Estimation: Nested Loop Join (⋈)

Block-nested-loop: B(R) + (B(R)/(M-1))*B(S) ≅ B(R) + (B(R)/(M)) * B(S)

for each group of M blocks bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

19 of 31

Cost Estimation: Hash Join (⋈)

R joined with S (assume R is smaller in size)

B(R) + B(S)

Assuming B(R) < M for one pass (look at each table once) efficiency, read all of R into a hash table and join with all of S

20 of 31

Cost Estimation: Sort-Merge Join (⋈)

B(R) + B(S)

One pass (look at each table once); Both must be small (B(R) + B(S) < M)

Why would we use this over Hash Join?

Tables are sorted on join attributes (No need to hash)
Range join instead of equijoin

21 of 31

Selectivity Formulas

Selectivity Factor (X) → Proportion of total data needed

Assuming uniform distribution of data values on numeric attribute a in table R, if the condition is:

a = c → X ≅1 / V(R, a)
a < c→ X ≅ (c - min(R, a))/ (max(R, a) - min(R, a))
c1 < a < c2→ X ≅ (c2 - c1)/ (max(R, a) - min(R, a))
cond1 and cond2 → X ≅ X1 * X2

22 of 31

Cardinality Estimation Example

23 of 31

Cardinality Estimation (Point Selection)

Assume a uniform distribution.

σ_{color = “orange”}

X ≅ 1 / V(color)

≅ 1/4

24 of 31

Cardinality Estimation (Union)

Assume conditions are disjoint.

σ_{color = “orange” OR color = “green”}

X ≅ 1 / V(color) + 1 / V(color)

≅ 1/4 + 1/4 = 1/2

25 of 31

Cardinality Estimation (Intersection)

Assume independence.

σ_{c1 = “red” AND c2 = “yellow”}

X ≅ 1 / V(c1) · 1 / V(c2)

≅ 1/2 · 1/2 = 1/4

c1

c2

26 of 31

Cardinality Estimation (Ranges)

Schema: Weather(date, pressure). Assume a uniform distribution.

σ_{pressure < p}

σ_{pressure > p}

σ_{p < pressure < q}

X ≅ (p − min) / (max - min)

X ≅ (max − p) / (max - min)

X ≅ (q − p) / (max - min)

min(pressure)

max(pressure)

p

q

27 of 31

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ (1 / (V(c1) · V(c2))) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/4

c1

c2

Equijoins are tricky. The first guess we might have for the selectivity factor is X = 1/V(c1) * 1/V(c2). This isn’t quite right though, as 1/V(c1) * 1/V(c2) = 1/3 * 1/4 = 1/12. The correct answer we want is X = 3/12 = 1/4. What went wrong?

Let’s think of what 1/V(c1) * 1/V(c2) really means. From the “Point Selection” slide, we know that 1/V(c1) is the selection factor for choosing one specific color from c1. For example, the selectivity factor for σ (c1 = ‘red’) is 1/V(c1) = 1/3. The same goes for 1/V(c2). For example, σ (c2 = ‘red’) has a selectivity factor of 1/V(c2) = 1/4.

From the “Intersection” slide, we know that multiplying selectivity conditions (when they are independent, and in this case they are) is the same as the AND of them. So σ (c1 = ‘red’ AND c2 = ‘red’) has a selectivity factor of 1/3 * 1/4 = 1/12. If we look at the table, that’s exactly right: there is just one circle that is red for c1 and c2.

We don’t just want one circle here, though. The equijoin is like multiple selections at once: σ (c1 = ‘red’ AND c2 = ‘red’) + σ (c1 = ‘orange’ AND c2 = ‘orange’) + σ (c1 = ‘yellow’ AND c2 = ‘yellow’) = 1/12 + 1/12 + 1/12 = 1/12 * 3 = 3/12.

Is that why we multiply by min(V(c1), V(c2))? Yes! Think about this: the selectivity factor for each σ is always the same: 1/V(c1) * 1/V(c2). The only question is how many we need to add together. When the tables are different sizes, the smaller table is the one that constrains the join. You could add 100 new colors to c2 and it would not increase the cardinality of the join as c1 would not have any of those new colors. The cardinality only increases when V(c1) increases to match V(c2).

In conclusion, we get the formula 1/V(c1) * 1/V(c2) * min(V(c1), V(c2)), which we can simplify to 1/max(V(c1), V(c2)). (When we divide the smaller of V(c1) and V(c2) by both V(c1) and V(c2), we’re left with the larger of the two--max(V(c1), V(c2)).)

28 of 31

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ 1 / (V(c1) · V(c2)) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/4 (didn’t change)

c1

c2

29 of 31

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ 1 / (V(c1) · V(c2)) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/5 (decreased)

c1

c2

30 of 31

Summary

31 of 31

Worksheet