1 of 52

CSE 344: Section 9

Join Algorithms, B+ Trees, Indexing

November 21st, 2024

2 of 52

Announcements

Homework 6 M2:

Due at 11:00pm on Wednesday, November 27th (late days allowed)
No guarantee Ed will be checked Wednesday night or Thursday (Thanksgiving)
Autograder might run slow in the last couple hours, try to submit early

Questions?

3 of 52

Join Algorithms

4 of 52

Nested Loop Join (⋈)

for each tuple t1 in R do

for each tuple t2 in S do

if t1 and t2 join then output (t1,t2)

5 of 52

Nested Loop Join (⋈)

for each block bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

6 of 52

Nested Loop Join (⋈)

for each group of M blocks bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

7 of 52

8 of 52

Hash Join (⋈)

for tuple x in R:

insert(x.r, x)

for tuple y in S:

x = find(y.s)

output(x, y)

9 of 52

Hash Join (⋈)

for tuple x in R:

insert(x.r, x)

for tuple y in S:

x = find(y.s)

output(x, y)

Which table is R?

10 of 52

Hash Join (⋈)

for tuple x in R:

insert(x.r, x)

for tuple y in S:

x = find(y.s)

output(x, y)

Which table is R?

In this case, R is table with PK. Why?

11 of 52

Hash Join (⋈)

What if |FK table| << |PK table| ?

R is better as FK table

12 of 52

Hash Join (⋈)

for tuple x in R:

insert(x.r, x)

for tuple y in S:

for x in find(y.s)

output(x, y)

What if |FK table| << |PK table| ?

R is better as FK table

13 of 52

Sort-Merge Join (⋈)

sort(R); sort(S); // Better if sorted already!

x = R.first()

y = S.first()

while y != NULL do:

case:

x.r < y.s: x.next();

x.r = y.s: output(x, y); y.next();

x.r > y.s: y.next();

Note: R is table with PK. Why?

14 of 52

B+ Trees

15 of 52

B+ Trees

Goal: Reduce + optimize disk accesses

Each node (except root) has d <= m <= 2d keys and m + 1 pointers, ~1 page

Keys k < 30

Keys 30<=k<120

Keys 120<=k<240

Keys 240<=k

Left pointer of k:

to keys < k

Right pointer of k:

to keys >= k

30		120		240

16 of 52

B+ Trees

Goal: Reduce + optimize disk accesses

Each node (except root) has d <= m <= 2d keys and m + 1 pointers, ~1 page
Each leaf has d <= m <= 2d keys, pointing to data records

40		50		60		70

40

50

60

Next leaf

Data records

70

17 of 52

Worksheet Problem 1: Insert

18 of 52

20

(Root)

INSERT 20

20

19 of 52

20		40

(Root)

INSERT 40

40

20

20 of 52

20		40		72

(Root)

INSERT 72

40

20

72

21 of 52

20		25		40		72

(Root)

INSERT 25

25

20

40

72

22 of 52

(Root)

INSERT 63

20		25		40		63		72

25

20

40

63

72

23 of 52

(Root)

20		25		40		63		72

Violates m <= 4, split!

25

20

40

63

72

24 of 52

40

(Root)

20		25

40		63		72

20

25

40

63

72

25 of 52

40

(Root)

INSERT 78

20		25

40		63		72		78

20

25

40

63

72

78

26 of 52

40

(Root)

INSERT 2

2		20		25

40		63		72		78

2

20

40

63

72

78

25

27 of 52

40		65

(Root)

INSERT 65

2		20		25

40		63

2

20

40

63

25

65		72		78

72

78

65

28 of 52

40		65

(Root)

INSERT 86

2		20		25

40		63

2

20

40

63

25

65		72		78		86

72

78

65

86

29 of 52

40		65		78

(Root)

INSERT 92

2

20

40

63

25

72

78

65

86

2		20		25

40		63

65		72

78		86		92

92

30 of 52

40		65		78

(Root)

INSERT 95

2

20

40

63

25

72

78

65

86

2		20		25

40		63

65		72

78		86		92		95

92

95

31 of 52

40		65		78		92

(Root)

INSERT 97

2

20

40

63

25

72

78

65

86

2		20		25

40		63

65		72

78		86

92

95

92		95		97

97

32 of 52

40		65		78		92

(Root)

INSERT 99

2

20

40

63

25

72

78

65

86

2		20		25

40		63

65		72

78		86

92

95

92		95		97		99

97

99

33 of 52

40		65

(New root)

INSERT 98

78

86

78		86

92

95

92		95

92		97

78

97		98		99

...

97

98

99

(split old root into two nodes)

34 of 52

Worksheet Problem 2: Delete

35 of 52

40		65

DELETE 92

78

86

78		86

92

95

92		95

92		97

78

97		98		99

97

98

99

...

36 of 52

40		65

78

86

78		86

95

95

95		97

78

97		98		99

97

98

99

Violates m >= 2, check left and right to steal

...

37 of 52

40		65

78

86

78		86

95

95		97

95		98

78

98		99

97

98

99

...

38 of 52

40		65

78

86

78		86

95

95		97

95		98

78

98		99

97

98

99

DELETE 95

...

39 of 52

40		65

78

86

78		86

97

97

97		98

78

98		99

98

99

Violates m >= 2, check left and right to steal

...

40 of 52

40		65

78

86

78		86

97

97

97		98

78

98		99

98

99

Nothing to steal, so merge!

...

41 of 52

40		65

78

86

78		86		97

97

98

78

98		99

98

99

2

20

40

63

25

72

65

2		20		25

40		63

65		72

Violates m >= 2, check left and right to steal

42 of 52

40		65

78

86

78		86		97

97

98

78

98		99

98

99

2

20

40

63

25

72

65

2		20		25

40		63

65		72

Nothing to steal, so merge!

43 of 52

40		65		78		98

(Root)

2

20

40

63

25

72

78

65

86

2		20		25

40		63

65		72

78		86		97

98

99

98		99

97

44 of 52

Indexing

45 of 52

Definitions

Index - data structure that organizes data records on disk to optimize selections on the search key fields for the index
An index contains a collection of data entries, and supports efficient retrieval of all data entries with the given search key value k
A data entry is of the format (key = k, value = pointer to records)

14
15
16

14	tuple1

15	tuple2

16	tuple3

Data file sorted on age

Index File

Index on age

46 of 52

Why Indexing?

Consider you have a deck of 52 cards. I asked you to pick out the 8 of hearts. Calculate the average number of flips required if:

The deck is randomly shuffled.
The deck is separated into four piles, each representing a suit (spades, diamonds, clubs, hearts). Each pile is kept randomly face down.

47 of 52

Problem 1

To find the 8 of hearts from a randomly shuffled cards, on average I would require 26 flips through the deck, about half the deck.

48 of 52

Problem 2

There are 4 piles, one for each suit.
Disregard the other 3 piles, i.e. clubs, spades, diamonds.
In doing so we would require 2 flips, on average, to find the heart pile.
We're left with 13 cards in the heart pile to scan, on average we would require 7 flips to find the 8 of hearts.

49 of 52

Answer

Problem 1:

Scan 52 Cards (+26)

Average Flips = 26

Problem 2:

Pick suit (+2)
Scan 13 cards (+7)

Average Flips = 9

50 of 52

Example

51 of 52

Types of Indexes

Clustered

Main data file is sorted on the attribute
Useful for point selections and range queries
You can only create one per table

Unclustered

Main data file is NOT sorted on the attribute
Useful for point selections, not so great for range queries
You can have multiple for one table

52 of 52

Definitions

An index is either clustered or unclustered, depending on how the actual data is sorted on disk
Clustered index - one that has the same key ordering as what is on disk (one per table)
Unclustered index - may exist without any ordering on disk (any number per table)