1 of 31

CSE 344: Section 8

Indexing & Estimation

February 22nd, 2024

2 of 31

Administrivia

  • HW6 Milestone 1 due tomorrow 2/23 @ 11pm
    • Can use late days
  • HW6 Milestone 2 due 3/1 @ 11 pm
    • Most students use late days

3 of 31

Indexing

4 of 31

5 of 31

Definitions

  • Index - data structure that organizes data records on disk to optimize selections on the search key fields for the index
  • An index contains a collection of data entries, and supports efficient retrieval of all data entries with the given search key value k
  • A data entry is of the format (key = k, value = pointer to records)

14

15

16

14

tuple1

15

tuple2

16

tuple3

Data file sorted on age

Index File

Index on age

6 of 31

Definitions

  • An index is either clustered or unclustered, depending on how the actual data is sorted on disk
  • Clustered index - one that has the same key ordering as what is on disk (one per table)
  • Unclustered index - may exist without any ordering on disk (any number per table)

7 of 31

Definitions

  • An index is either clustered or unclustered, depending on how the actual data is sorted on disk
  • Clustered index - one that has the same key ordering as what is on disk (one per table)
  • Unclustered index - may exist without any ordering on disk (any number per table)

8 of 31

Why Indexing?

Consider you have a deck of 52 cards. I asked you to pick out the 8 of hearts. Calculate the average number of flips required if:

  • The deck is randomly shuffled.
  • The deck is separated into four piles, each representing a suit (spades, diamonds, clubs, hearts). Each pile is kept randomly face down.

9 of 31

Problem 1

To find the 8 of hearts from a randomly shuffled cards, on average I would require 26 flips through the deck, about half the deck.

10 of 31

Problem 2

  • There are 4 piles, one for each suit.
  • Disregard the other 3 piles, i.e. clubs, spades, diamonds.
  • In doing so we would require 2 flips, on average, to find the heart pile.
  • We're left with 13 cards in the heart pile to scan, on average we would require 7 flips to find the 8 of hearts.

11 of 31

Answer

Problem 1:

  • Scan 52 Cards (+26)

Average Flips = 26

Problem 2:

  • Pick suit (+2)
  • Scan 13 cards (+7)

Average Flips = 9

12 of 31

Further Thinking…

  • Can you think of an better index?
  • Would your index be clustered or unclustered?

13 of 31

14 of 31

Cost Estimation

15 of 31

Cost Estimation: Factors

B(R) = # blocks for relation R

T(R) = # tuples for relation R

V(R, a) = # of unique values of attribute a in relation R

M = # of available memory pages

16 of 31

Cost Estimation: Nested Loop Join (⋈)

Naive: B(R) + T(R)B(S)

for each tuple t1 in R do

for each tuple t2 in S do

if t1 and t2 join then output (t1,t2)

17 of 31

Cost Estimation: Nested Loop Join (⋈)

Block-at-a-time: B(R) + B(S)B(R)

for each block bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

18 of 31

Cost Estimation: Nested Loop Join (⋈)

Block-nested-loop: B(R) + (B(R)/(M-1))*B(S) ≅ B(R) + (B(R)/(M)) * B(S)

for each group of M blocks bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

19 of 31

Cost Estimation: Hash Join (⋈)

R joined with S (assume R is smaller in size)

B(R) + B(S)

Assuming B(R) < M for one pass (look at each table once) efficiency, read all of R into a hash table and join with all of S

20 of 31

Cost Estimation: Sort-Merge Join (⋈)

B(R) + B(S)

One pass (look at each table once); Both must be small (B(R) + B(S) < M)

Why would we use this over Hash Join?

  • Tables are sorted on join attributes (No need to hash)
  • Range join instead of equijoin

21 of 31

Selectivity Formulas

  • Selectivity Factor (X) → Proportion of total data needed
    • Assuming uniform distribution of data values on numeric attribute a in table R, if the condition is:
      • a = c → X 1 / V(R, a)
      • a < c→ X (c - min(R, a))/ (max(R, a) - min(R, a))
      • c1 < a < c2→ X (c2 - c1)/ (max(R, a) - min(R, a))
      • cond1 and cond2 → X X1 * X2

22 of 31

Cardinality Estimation Example

23 of 31

Cardinality Estimation (Point Selection)

Assume a uniform distribution.

σcolor = “orange”

X ≅ 1 / V(color)

≅ 1/4

24 of 31

Cardinality Estimation (Union)

Assume conditions are disjoint.

σcolor = “orange” OR color = “green”

X ≅ 1 / V(color) + 1 / V(color)

≅ 1/4 + 1/4 = 1/2

25 of 31

Cardinality Estimation (Intersection)

Assume independence.

σc1 = “red” AND c2 = “yellow”

X ≅ 1 / V(c1) · 1 / V(c2)

≅ 1/2 · 1/2 = 1/4

c1

c2

26 of 31

Cardinality Estimation (Ranges)

Schema: Weather(date, pressure). Assume a uniform distribution.

σpressure < p

σpressure > p

σp < pressure < q

X ≅ (p min) / (max - min)

X ≅ (max − p) / (max - min)

X ≅ (q p) / (max - min)

min(pressure)

max(pressure)

p

q

27 of 31

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

c1 = c2

X ≅ (1 / (V(c1) · V(c2))) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/4

c1

c2

28 of 31

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

c1 = c2

X ≅ 1 / (V(c1) · V(c2)) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/4 (didn’t change)

c1

c2

29 of 31

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

c1 = c2

X ≅ 1 / (V(c1) · V(c2)) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/5 (decreased)

c1

c2

30 of 31

Summary

31 of 31

Worksheet