1 of 52

CSE 344: Section 9

Join Algorithms, B+ Trees, Indexing

November 21st, 2024

2 of 52

Announcements

Announcements

  • Homework 6 M2:
    • Due at 11:00pm on Wednesday, November 27th (late days allowed)
    • No guarantee Ed will be checked Wednesday night or Thursday (Thanksgiving)
    • Autograder might run slow in the last couple hours, try to submit early
  • Questions?

3 of 52

Join Algorithms

4 of 52

Nested Loop Join (⋈)

for each tuple t1 in R do

for each tuple t2 in S do

if t1 and t2 join then output (t1,t2)

5 of 52

Nested Loop Join (⋈)

for each block bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

6 of 52

Nested Loop Join (⋈)

for each group of M blocks bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

7 of 52

8 of 52

Hash Join (⋈)

for tuple x in R:

insert(x.r, x)

for tuple y in S:

x = find(y.s)

output(x, y)

9 of 52

Hash Join (⋈)

for tuple x in R:

insert(x.r, x)

for tuple y in S:

x = find(y.s)

output(x, y)

Which table is R?

10 of 52

Hash Join (⋈)

for tuple x in R:

insert(x.r, x)

for tuple y in S:

x = find(y.s)

output(x, y)

Which table is R?

In this case, R is table with PK. Why?

11 of 52

Hash Join (⋈)

What if |FK table| << |PK table| ?

R is better as FK table

12 of 52

Hash Join (⋈)

for tuple x in R:

insert(x.r, x)

for tuple y in S:

for x in find(y.s)

output(x, y)

What if |FK table| << |PK table| ?

R is better as FK table

13 of 52

Sort-Merge Join (⋈)

sort(R); sort(S); // Better if sorted already!

x = R.first()

y = S.first()

while y != NULL do:

case:

x.r < y.s: x.next();

x.r = y.s: output(x, y); y.next();

x.r > y.s: y.next();

Note: R is table with PK. Why?

14 of 52

B+ Trees

15 of 52

B+ Trees

Goal: Reduce + optimize disk accesses

  • Each node (except root) has d <= m <= 2d keys and m + 1 pointers, ~1 page

Keys k < 30

Keys 30<=k<120

Keys 120<=k<240

Keys 240<=k

Left pointer of k:

to keys < k

Right pointer of k:

to keys >= k

30

120

240

16 of 52

B+ Trees

Goal: Reduce + optimize disk accesses

  • Each node (except root) has d <= m <= 2d keys and m + 1 pointers, ~1 page
  • Each leaf has d <= m <= 2d keys, pointing to data records

40

50

60

70

40

50

60

Next leaf

Data records

70

17 of 52

Worksheet Problem 1: Insert

18 of 52

20

(Root)

INSERT 20

20

19 of 52

20

40

(Root)

INSERT 40

40

20

20 of 52

20

40

72

(Root)

INSERT 72

40

20

72

21 of 52

20

25

40

72

(Root)

INSERT 25

25

20

40

72

22 of 52

(Root)

INSERT 63

20

25

40

63

72

25

20

40

63

72

23 of 52

(Root)

20

25

40

63

72

Violates m <= 4, split!

25

20

40

63

72

24 of 52

40

(Root)

20

25

40

63

72

20

25

40

63

72

25 of 52

40

(Root)

INSERT 78

20

25

40

63

72

78

20

25

40

63

72

78

26 of 52

40

(Root)

INSERT 2

2

20

25

40

63

72

78

2

20

40

63

72

78

25

27 of 52

40

65

(Root)

INSERT 65

2

20

25

40

63

2

20

40

63

25

65

72

78

72

78

65

28 of 52

40

65

(Root)

INSERT 86

2

20

25

40

63

2

20

40

63

25

65

72

78

86

72

78

65

86

29 of 52

40

65

78

(Root)

INSERT 92

2

20

40

63

25

72

78

65

86

2

20

25

40

63

65

72

78

86

92

92

30 of 52

40

65

78

(Root)

INSERT 95

2

20

40

63

25

72

78

65

86

2

20

25

40

63

65

72

78

86

92

95

92

95

31 of 52

40

65

78

92

(Root)

INSERT 97

2

20

40

63

25

72

78

65

86

2

20

25

40

63

65

72

78

86

92

95

92

95

97

97

32 of 52

40

65

78

92

(Root)

INSERT 99

2

20

40

63

25

72

78

65

86

2

20

25

40

63

65

72

78

86

92

95

92

95

97

99

97

99

33 of 52

40

65

(New root)

INSERT 98

78

86

78

86

92

95

92

95

92

97

78

97

98

99

...

97

98

99

(split old root into two nodes)

34 of 52

Worksheet Problem 2: Delete

35 of 52

40

65

DELETE 92

78

86

78

86

92

95

92

95

92

97

78

97

98

99

97

98

99

...

36 of 52

40

65

78

86

78

86

95

95

95

97

78

97

98

99

97

98

99

Violates m >= 2, check left and right to steal

...

37 of 52

40

65

78

86

78

86

95

95

97

95

98

78

98

99

97

98

99

...

38 of 52

40

65

78

86

78

86

95

95

97

95

98

78

98

99

97

98

99

DELETE 95

...

39 of 52

40

65

78

86

78

86

97

97

97

98

78

98

99

98

99

Violates m >= 2, check left and right to steal

...

40 of 52

40

65

78

86

78

86

97

97

97

98

78

98

99

98

99

Nothing to steal, so merge!

...

41 of 52

40

65

78

86

78

86

97

97

98

78

98

99

98

99

2

20

40

63

25

72

65

2

20

25

40

63

65

72

Violates m >= 2, check left and right to steal

42 of 52

40

65

78

86

78

86

97

97

98

78

98

99

98

99

2

20

40

63

25

72

65

2

20

25

40

63

65

72

Nothing to steal, so merge!

43 of 52

40

65

78

98

(Root)

2

20

40

63

25

72

78

65

86

2

20

25

40

63

65

72

78

86

97

98

99

98

99

97

44 of 52

Indexing

45 of 52

Definitions

  • Index - data structure that organizes data records on disk to optimize selections on the search key fields for the index
  • An index contains a collection of data entries, and supports efficient retrieval of all data entries with the given search key value k
  • A data entry is of the format (key = k, value = pointer to records)

14

15

16

14

tuple1

15

tuple2

16

tuple3

Data file sorted on age

Index File

Index on age

46 of 52

Why Indexing?

Consider you have a deck of 52 cards. I asked you to pick out the 8 of hearts. Calculate the average number of flips required if:

  1. The deck is randomly shuffled.
  2. The deck is separated into four piles, each representing a suit (spades, diamonds, clubs, hearts). Each pile is kept randomly face down.

47 of 52

Problem 1

To find the 8 of hearts from a randomly shuffled cards, on average I would require 26 flips through the deck, about half the deck.

48 of 52

Problem 2

  • There are 4 piles, one for each suit.
  • Disregard the other 3 piles, i.e. clubs, spades, diamonds.
  • In doing so we would require 2 flips, on average, to find the heart pile.
  • We're left with 13 cards in the heart pile to scan, on average we would require 7 flips to find the 8 of hearts.

49 of 52

Answer

Problem 1:

  • Scan 52 Cards (+26)

Average Flips = 26

Problem 2:

  • Pick suit (+2)
  • Scan 13 cards (+7)

Average Flips = 9

50 of 52

Example

51 of 52

Types of Indexes

  • Clustered
    • Main data file is sorted on the attribute
    • Useful for point selections and range queries
    • You can only create one per table
  • Unclustered
    • Main data file is NOT sorted on the attribute
    • Useful for point selections, not so great for range queries
    • You can have multiple for one table

52 of 52

Definitions

  • An index is either clustered or unclustered, depending on how the actual data is sorted on disk
  • Clustered index - one that has the same key ordering as what is on disk (one per table)
  • Unclustered index - may exist without any ordering on disk (any number per table)