1 of 32

Data Engineering

Performance Tuning: Index Selection

1

2 of 32

The Declarative Contract

  • So far, we’ve been imagining a declarative contract between the data system and the user, where the user specifies:
    • The relations, along with any constraints (PK, FK)
      • But not how, exactly, they are stored, e.g., on a disk drive
      • But not how, these constraints are enforced
    • What they want the output of a query to look like
      • But not how it computed
    • What they want the modified relations to look like
      • But not how exactly the modification takes place

  • The data system is then expected to hold up its end of the bargain by:
    • Figuring out how to store data, process it, enforce constraints, etc.
    • And to do so in the “best” way it knows how.

2

The Declarative Contract

3 of 32

Why Cede Control to the System?

Many good reasons to do so:

  • Simpler for end-users
    • Just specify what you want, don’t worry about how to do it
    • Most specifications are therefore quite compact
  • Hard for (most) users to reason about how to best execute a query
    • Exponential # of ways to execute queries!
  • System has more fine-grained awareness of the data and its nuances
    • Includes data sizes, statistics, understanding of distributions, correlations, etc.
    • This can influence the “best” execution strategy

The Declarative Contract

4 of 32

Why Should Users Worry about Performance I

One could argue that we (as users) should leave performance aspects up to the system, and not worry about it.

Several reasons to not do so:

Reason I: revisiting our specification

  • There are cases when even the “best” execution strategy from the data system is poor (slow)
  • We might want to understand how to modify our query into another (possibly non-equivalent) one that is faster to execute
    • E.g., rather than performing an analysis on all the stops in the Bay Area, start with doing it on Berkeley
  • An understanding of what leads to slowness is therefore important
    • So that we (users) can rephrase our specification in less expensive ways

The Declarative Contract

5 of 32

Why Should Users Worry about Performance II

Reason 2: contract breakdown

  • Despite many decades of engineering, data systems are not perfect
    • Best to not be at the mercy of an imperfect system
  • There are scenarios where data systems struggle to find good ways to execute queries
    • The space of query execution strategies is sometimes too large, so data systems cut corners by not examining all possible ones
    • The estimates for the costs (latencies) for various strategies may be incorrect, for various reasons: insufficiently fine-grained distributional information, lack of understanding of correlations and data skew, etc.
    • Users also oftentimes have a better idea of some data characteristics and future workload than the system
  • In such cases, users can steer the system towards “better” execution strategies

The Declarative Contract

6 of 32

Why Should Users Worry about Performance III

Reason 3: many heavyweight levers still under the control of users

  • There are still many levers (impacting performance) that users control over data systems, incl.
    • What the relations and materialized views should look like
    • Whether certain efficient data access structures (known as indexes) should be generated
    • The amount of resources employed, incl. memory footprint, disk space, # of machines, …

  • Many of these decisions have repercussions beyond efficiency of execution of the current query, including:
    • Ease of use
    • Monetary considerations
    • Anticipated future needs and workload

The Declarative Contract

7 of 32

OK, Let’s Worry about Performance!

  • Fortunately, there are many things that we can do to improve performance
  • But this requires us to understand a bit more about data system internals and how we can “tune” them as need to achieve what we want
    • We’re now “opening up the Pandora’s box”!

  • Let’s start by talking about a “mental model” for what a data system does

  • Disclaimer: will be massively hand-wavy! If you’d like to learn more about performance, I encourage you to take CS186!

7

8 of 32

A Mental Model for a Data System

  • Data is laid out on blocks/pages on stable storage (SSD or HDD)
    • Each block may have 100s of tuples
  • Data is retrieved a block at a time for reading/writing by the data system
  • Stored in frames in main-memory within a main-memory buffer
  • Written back to stable storage as needed

8

… Blocks …

Data System

Buffer

Controller

Data

9 of 32

A Mental Model for a Data System

  • Usually, computation on data in main-memory is cheap relative to I/O
  • So main cost in processing SQL queries is the cost of I/O
    • i.e., fetching the blocks into the buffer and writing them out
  • Also, reading/writing data blocks sequentially is typically cheaper than random reads and writes
    • the system can be smarter about this

9

… Blocks …

Data System

Buffer

Controller

Data

10 of 32

Bottomline: What Impacts Performance?

  • Rule 1: Read less data, in sequence, if possible
    • The more data you have to read/write, the slower it is
      • In some cases, you have to read the same data multiple times, which is slower than reading it just once
    • But: random accesses are slower than sequential access

  • Rule 2…: Later!

11 of 32

Thought Experiment

  • Say I have 10M stops in my Stop relation across all cities in the US
  • And I want to find a specific stop
    • SELECT * FROM Stops WHERE id = 256
  • The relation is laid out across many pages on disk
  • Q: How would we go about doing this?

… Blocks …

Data

12 of 32

Thought Experiment

  • Say I have 10M stops in my Stop relation across all cities in the US
  • And I want to find a specific stop
    • SELECT * FROM Stops WHERE id = 256
  • The relation is laid out across many pages on disk
  • Q: How would we go about doing this?
    • Data is unordered by default so don’t know where id = 256 is
    • So, will need to look through all of the data pages
      • An operation known as sequential scan
    • But, can be more efficient if we could have a lookup structure that could help us tell which page id = 256 is located

13 of 32

13

14 of 32

Indexes

  • Indexes help you look for data matching certain characteristics
    • Much like an index in a book! Or a library!
  • An index on an attribute, say age, allows you to find tuples with a specific value for age in Stops
  • Indexes are essentially data structures storing “index

key-location” pairs

    • Index key = values of age
    • Location = where (which block, start position within the block) tuples with that age value can be found
  • This may be useful, for example, to find tuples with
    • age = 15
    • age <= 30
    • 15 <= age <= 45

14

15 of 32

From CS61A or your intro programming class… a Binary Search Tree

  •  

7

4

5

2

1

10

8

13

11

14

16 of 32

Binary Search Trees for Data Systems: B+ Trees

  • The most popular type of index in data systems are B+Trees
  • Essentially applying & extending the binary search tree ideas

16

80

20

60

100

120

140

10

15

18

20

30

40

50

60

65

80

85

90

10

15

18

20

30

40

50

60

65

80

85

90

Find A = 30

Find A = [30,63]

Pointers to records

Nodes

17 of 32

Generalizing Binary Search Trees for Data Systems: B+ Trees

  • Unlike binary search trees, B+trees have a high fanout
    • Each “node” is a block (remember our mental model!)
    • If “b” is fanout is ~1000, with four levels
      • b^4 = 10^12: one trillion tuples!
  • B+Trees are self-balancing
    • Uniform height throughout
    • Maintains a certain fill-factor (usually half) by splitting nodes and percolating changes to maintain balance
    • We won’t worry about those algorithms in this class: see CS186!

17

18 of 32

Generalizing Binary Search Trees for Data Systems: B+ Trees

  • Logarithmic complexity
    • For searching (as we saw)
    • For insert, delete, update (not covered)
  • In practice, depth of tree = number of nodes (blocks) retrieved is small
    • High fanout ensures that
    • Typical trees: short & wide

  • Can handle both value lookups (id = 20) as well as range (20 <= id <= 50) lookups
    • For the latter, can follow “leaf” pointers once you have identified the start of the range
  • Works well with updates
  • In PostgreSQL this is the default index type

18

19 of 32

When Do Indexes Apply?

Generally beneficial when your relations are large and when sequential scans are too time consuming

SELECT * FROM Stops WHERE age = 15 AND location = ‘West Oakland’;

Suppose the relation has 10M tuples, spread across 100,000 blocks

Let’s say there are about 1000 tuples that satisfy the first predicate and 100 that satisfy both predicates

  • No index approach: sequential scan of all 100,000 blocks of Stops, check for each tuple if two predicates are satisfied
  • With an index on age, look over 1000 tuples, spread across 1000 blocks in the worst case, check for each tuple if second predicate is satisfied
  • With a multi-dimensional index on age, location, look over 100 tuples, spread over 100 blocks in the worst case

19

20 of 32

When Do Indexes Apply?

SELECT * FROM Stops WHERE age = 15 AND location = ‘West Oakland’;

Suppose the relation has 10M tuples, spread across 100,000 blocks

Let’s say there are about 1000 tuples that satisfy the first predicate and 100 that satisfy both predicates

Here, the # of blocks examined using either index is relatively small: 1000 (or 100) vs 100,000 that the index may be valuable

    • Recall that random access is more expensive than sequential access
    • And index-based accesses are typically (with some exceptions that we’ll cover later) random
    • But if 1000 grows to (say) 10,000 or greater, it is possible that the benefits of an index may be outweighed by the penalty for random accesses
    • More on this later

20

21 of 32

Also Useful for Joins!

SELECT * FROM Stops, Zips

WHERE Stops.location = Zips.location

  • Say Stops has 10M tuples and Zips has 1M tuples
  • Q: If you have no indexes, need to consider how many pairs of tuples?
    • 10M x 1M = 10^13
  • Q: How would you use indexes here, at a high level?
    • For each Stop, look up corresponding location in Zips (or vice versa)
    • For former, “only” 10M lookups in the index
    • Note: if I had an index on Zips.zipcode and not Zips.location, it wouldn’t help
    • There are other efficient join approaches that don’t use indexes; discussed later

21

22 of 32

Declaring Indexes

CREATE INDEX ageIndex ON Stops (age);

  • Allow for lookups based on age

CREATE INDEX ageLocationIndex ON Stops (age, location);

  • Imagine this to be a “index key-value” pair data structure based on concatenating attribute values
  • Allow for lookups based on pairs of age and location
  • Or just for age
  • Q: Why not location only?

DROP INDEX ageLocationIndex;

22

23 of 32

Demo

  • select * from actor where id = 23456
  • explain analyze select * from actor where id = 23456;
  • create index idIndex on actor(id);
  • explain analyze select * from actor where id = 23456;
  • explain analyze select * from actor where id >= 23456;
  • explain analyze select * from actor where id >= 4000000;
  • explain analyze select * from actor where id >= 23456 AND id < 23457;
  • explain analyze select * from actor where id < 23457;
  • drop index idIndex;
  • explain analyze select * from actor where name = 'Hanks, Tom';
  • create index nameIdIndex on actor(name,id);
  • explain analyze select * from actor where name = 'Hanks, Tom';
  • explain analyze select * from actor where id < 23457;
  • create index idNameIndex on actor(id,name);
  • explain analyze select * from actor where id < 23457;

24 of 32

Indexes Seem Magical…

  • Q: Why not automatically add indexes for all attributes & combinations of attributes? What are downsides of indexes?
  • A: the maintenance costs during updates, deletes, inserts

  • If an index is not maintained, it may lead to inaccurate query results
  • e.g., if a new tuple is inserted but it is not reflected in the index, it won’t be returned as part of queries

24

25 of 32

OK, Going Back to Indexes…

  • Which indexes should we create?
  • Answer: it depends!

  • Many databases automatically create indexes for PRIMARY KEY or UNIQUE attributes. Several reasons:
    • It also helps enforcing the constraints! Updating the index can serve as a check for the constraint
    • Many queries lookup based on these attributes
      • SELECT * FROM Actor, Cast_info WHERE Actor.id = Cast_info.person_id
    • It is a high cardinality attribute
      • That is, number of distinct values for this attribute is high
      • In fact, cardinality here is as many as the number of tuples
      • So is especially beneficial to use an index on this attribute.. More next

25

26 of 32

Why does cardinality matter?

  • Databases retrieve data as “blocks”
  • Say you are indexing based on attribute A, with 100 distinct values. You have 10,000 tuples. Each block contains 100 tuples. So we have 100 blocks.

  • Assuming that the tuples with a given value of A (A = a) are randomly distributed, every block, on average, has one tuple with A=a.
  • Thus, we can’t avoid looking at all blocks => the index doesn’t help.
  • Q: What if we had a 1000 distinct values instead of 100? How many blocks would we have to look at then for A = a?
  • A: 10 blocks, so there may be some savings.

26

… Blocks …

27 of 32

Why does cardinality matter? II

  • As mentioned previously, sequential access is much faster than random access:
    • Depending on the characteristics of the storage media, reading 1 in 10 blocks may not be enough to justify the use of an index
      • Might be as fast to just scan
    • But reading 1 in 100 or 1 in 1000 may be enough [again, it depends]

  • So, indexes on low cardinality attributes are not a great idea
  • Exception: if data is clustered on that attribute
    • i.e., sorted based on that attribute and stored in that order on disk
    • If we knew that all values of A = a are together (as opposed to randomly shuffled), we can simply read one block as opposed to 100 blocks
    • Can force PostgreSQL to cluster your data based on an index
      • CLUSTER Stops ON ageIndex;

27

28 of 32

How Do We Decide?

  • Good on attributes used often in WHERE clauses
  • Good on key attributes (sometimes done by default)
  • Good on attributes that are “clustered”
  • Not very good on attributes where there are many tuples per value of attribute
  • Not very good on tables that are modified more often than queried

28

29 of 32

Other Types of Indexes: Hash Indexes

  • Like hash-based dictionaries or maps in programming languages
    • A hash function applied to value a, i.e., h(a) yields location where pointers for tuples with A = a are found
  • Main challenge: how to “extend” the hash table as values of A grow
    • See linear or extendible hashing

29

10

20

30

40

50

60

70

80

A1

A2

A3

A4

A5

A6

A7

A8

BLOCK

DATA FILE

HASH INDEX

h(A1) = 30; h(A2) = 20; …

30 of 32

Types of Indexes: Hash Indexes

  • Constant complexity O(1)
  • Can only handle value lookups, not range lookups
    • Why?
  • In PostgreSQL done as follows:
    • CREATE INDEX <name> ON <table> USING hash (column);

30

10

20

30

40

50

60

70

80

A1

A2

A3

A4

A5

A6

A7

A8

BLOCK

DATA FILE

HASH INDEX

h(A1) = 30; h(A2) = 20; …

31 of 32

Other Types of Indexes

  • Inverted Indexes: valuable for keyword search (more later, hopefully!)
    • In PostgreSQL, known as GIN indexes
  • R-Tree: valuable for range-based lookups in k-dimensions
    • e.g., searching for items in rectangular range in a map
    • In PostgreSQL, a special case of a GIST index
  • Lots of other types of indexes!
    • Partial indexes
    • Indexes on expressions

31

32 of 32

Indexes Summary

  • Indexes are great!
  • They speed up certain types of queries
  • But we should use them carefully
    • Only on high cardinality = “selective” attributes like keys
    • Or on frequently queried but rarely updated attributes
  • Lots of different types of indexes: but B+trees are most popular

32