Data Engineering
Performance Tuning: Index Selection
1
The Declarative Contract
2
The Declarative Contract
Why Cede Control to the System?
Many good reasons to do so:
The Declarative Contract
Why Should Users Worry about Performance I
One could argue that we (as users) should leave performance aspects up to the system, and not worry about it.
Several reasons to not do so:
Reason I: revisiting our specification
The Declarative Contract
Why Should Users Worry about Performance II
Reason 2: contract breakdown
The Declarative Contract
Why Should Users Worry about Performance III
Reason 3: many heavyweight levers still under the control of users
The Declarative Contract
OK, Let’s Worry about Performance!
7
A Mental Model for a Data System
8
… Blocks …
Data System
Buffer
Controller
Data
A Mental Model for a Data System
9
… Blocks …
Data System
Buffer
Controller
Data
Bottomline: What Impacts Performance?
Thought Experiment
… Blocks …
Data
Thought Experiment
13
Indexes
key-location” pairs
14
From CS61A or your intro programming class… a Binary Search Tree
7
4
5
2
1
10
8
13
11
14
Binary Search Trees for Data Systems: B+ Trees
16
80 | | | | ||||
| | | | | |||
20 | 60 | | | ||||
| | | | | |||
100 | 120 | 140 | | ||||
| | | | | |||
10 | 15 | 18 | | ||||
| | | | | |||
20 | 30 | 40 | 50 | ||||
| | | | | |||
60 | 65 | | | ||||
| | | | | |||
80 | 85 | 90 | | ||||
| | | | | |||
10
15
18
20
30
40
50
60
65
80
85
90
Find A = 30
Find A = [30,63]
Pointers to records
Nodes
Generalizing Binary Search Trees for Data Systems: B+ Trees
17
Generalizing Binary Search Trees for Data Systems: B+ Trees
18
When Do Indexes Apply?
Generally beneficial when your relations are large and when sequential scans are too time consuming
SELECT * FROM Stops WHERE age = 15 AND location = ‘West Oakland’;
Suppose the relation has 10M tuples, spread across 100,000 blocks
Let’s say there are about 1000 tuples that satisfy the first predicate and 100 that satisfy both predicates
19
When Do Indexes Apply?
SELECT * FROM Stops WHERE age = 15 AND location = ‘West Oakland’;
Suppose the relation has 10M tuples, spread across 100,000 blocks
Let’s say there are about 1000 tuples that satisfy the first predicate and 100 that satisfy both predicates
Here, the # of blocks examined using either index is relatively small: 1000 (or 100) vs 100,000 that the index may be valuable
20
Also Useful for Joins!
SELECT * FROM Stops, Zips
WHERE Stops.location = Zips.location
21
Declaring Indexes
CREATE INDEX ageIndex ON Stops (age);
CREATE INDEX ageLocationIndex ON Stops (age, location);
DROP INDEX ageLocationIndex;
22
Demo
Indexes Seem Magical…
24
OK, Going Back to Indexes…
25
Why does cardinality matter?
26
… Blocks …
Why does cardinality matter? II
27
How Do We Decide?
28
Other Types of Indexes: Hash Indexes
29
10 | |
20 | |
30 | |
40 | |
50 | |
60 | |
70 | |
80 | |
A1 | |
A2 | |
A3 | |
A4 | |
A5 | |
A6 | |
A7 | |
A8 | |
BLOCK
DATA FILE
HASH INDEX
h(A1) = 30; h(A2) = 20; …
Types of Indexes: Hash Indexes
30
10 | |
20 | |
30 | |
40 | |
50 | |
60 | |
70 | |
80 | |
A1 | |
A2 | |
A3 | |
A4 | |
A5 | |
A6 | |
A7 | |
A8 | |
BLOCK
DATA FILE
HASH INDEX
h(A1) = 30; h(A2) = 20; …
Other Types of Indexes
31
Indexes Summary
32