1 of 14

Performance Tuning

February 12, 2024

Data 101/Info 258, Spring 2024 @ UC Berkeley

Aditya Parameswaran https://data101.org/sp24

LECTURE 07

2 of 14

Why worry about performance?

Lecture 07, Data 101 Spring 2024

3 of 14

Beyond the Declarative Contract

So far, we’ve been imagining a declarative contract between the data system and the user, �where the user specifies:

The relations (schema+instance)�and any constraints (PK, FK, attributes, etc.)

What the output of a query looks like

What the modified relation should look like

The Declarative Contract

user

system

4 of 14

Beyond the Declarative Contract

So far, we’ve been imagining a declarative contract between the data system and the user, �where the user specifies:

The relations (schema+instance)�and any constraints (PK, FK, attributes, etc.)

What the output of a query looks like

What the modified relation should look like

..but not how to store data (disk? memory?), nor how to algorithmically enforce constraints

…but not how to compute the query result

…but not how the modification takes place

The data system is then expected to hold up its end of the bargain:

Figure out how to store data, process it, enforce constraints, etc.
And to do so in the “best” way it knows how.
(Example so far: How to process an UPDATE/DELETE reliably? Lecture 06)

5 of 14

Justifying the declarative contract

Why cede control to the system? Here are just several reasons:

Simpler for end-users

Just specify what you want, don’t worry about how to do it.
Most specifications are therefore quite compact!

Hard for (most) users to reason about how to best execute a query

Exponential # of ways to execute queries!

The system has more fine-grained awareness of the data and its nuances.

Includes data sizes, statistics, understanding of distributions, correlations, etc.
This can influence the “best” execution strategy.

The Declarative Contract

user

system

As a data engineer, however, you should understand the underlying system!

Even as a data scientist, however, you need to understand database performance!

6 of 14

Why worry about performance?

Reason 1: Our specification (e.g., SQL commands) can and should be revisited.

Sometimes, even the “best” execution strategy from the data system is poor (slow).
We could modify our query into another (possibly non-equivalent) one with faster execution

Example: instead of analyzing all movies ever, start with those in 2023.

Users need to understand what leads to slowness in order to rephrase the specification in less expensive ways.

7 of 14

Why worry about performance?

Reason 1: Our specification (e.g., SQL commands) can and should be revisited.

Users need to understand what leads to slowness in order to rephrase the specification in less expensive ways.

Reason 2: Declarative contract breakdown!

Despite decades of engineering, data systems are not perfect!
Scenarios exist where data systems struggle to find good ways to execute queries:

Space of query execution strategies too large, so data systems just pick one suboptimally.
Cost estimates (e.g., latencies) for strategies are inaccurate/incorrect.

Users can steer the system towards “better” execution strategies.

Users could have better ideas about some data characteristics, future data workload, etc.

8 of 14

Why worry about performance?

Reason 3: Many heavyweight levers are still under user control!

User-controllable performance levers include:

How relations are defined (also applies to materialized views)
Whether to generate efficient data access structures, i.e., indexes (more today)
How many resources to employ: memory footprint, disk space, # machines

Many of these decisions have repercussions beyond execution efficiency of the current query!

Ease of use
Monetary considerations
Anticipated future needs and workload

Reason 1: Our specification (e.g., SQL commands) can and should be revisited.

Users need to understand what leads to slowness in order to rephrase the specification in less expensive ways.

Reason 2: Declarative contract breakdown!

Users can steer the system towards “better” execution strategies.

Users could have better ideas about some data characteristics, future data workload, etc.

9 of 14

OK—Let’s Worry about Performance!

There are many things we can do to improve performance!

…but in order to do so, we need to learn about data system internals. It’s a Pandora’s Box!

Today: A mental model for what a data system does.

SQL queries/commands

(Select from where, joins,�window functions, DDL, DML)

Relational Algebra Introduction

Performance tuning queries

So far

Today

Disclaimer: Massively hand-wavy! For more about performance, see CS186: https://cs186berkeley.net/

data system internals

10 of 14

System Memory Model

Lecture 07, Data 101 Spring 2024

11 of 14

Cost is Disk I/O

Data is laid out on pages (AKA blocks, see here) on stable disk storage (SSD or HDD)

Page size: 8kB, usually. May have 100s of tuples!

… Blocks / Pages …

(SSD/disk)

Data

record/�tuple

page

12 of 14

Cost is Disk I/O

Data is laid out on pages (AKA blocks, see here) on stable disk storage (SSD or HDD)

Block size: 8kB, usually. May have 100s of tuples!

The data system retrieves each page, one at a time, loading it into main memory for reading/writing.

Pages stored in stable storage (disk) long-term
Then loaded into frames for editing in main memory, within a main memory buffer

… Blocks / Pages …

(SSD/disk)

Buffer

(mem)

Controller

Data

Data System

13 of 14

Cost is Disk I/O

Data is laid out on pages (AKA blocks, see here) on stable disk storage (SSD or HDD)

Page size: 8kB, usually. May have 100s of tuples!

The data system retrieves each page, one at a time, loading it into main memory for reading/writing.

Pages stored in stable storage (disk) long-term
Then loaded into frames for editing in main memory, within a main memory buffer

The main cost in processing SQL queries�is the cost of disk I/O. (mostly true)

Data computation in main memory is cheap!
Expensive part is loading pages into �buffer for read/write, then pushing edits back to disk

�Side note: sequential read/write of data pages typically cheaper than random read/writes. (more later)

Buffer

(mem)

Controller

Data System

… Blocks/Pages …

(SSD/disk)

Data

14 of 14

Strawperson: Sequential Scan Performance Analysis

Suppose we want to find a specific title in Titles:

SELECT * FROM Titles WHERE title_id = 'tt11737520';

Further suppose there are ten million records in Titles.
These records are stored across 500,000 pages in disk.

Titles

Column | Type -----------------+--------

title_id (PK) | text

type | text

primary_title | text

original_title | text

is_adult | bigint

premiered | bigint

ended | bigint

runtime_minutes | bigint

genres | text

'tt11737520'

We approximate the cost of processing a query as # of Disk I/Os (i.e., # page loads).�For sequential scan:

Data is unordered by default.
We might need to search the�entire database sequentially for one record!
Worst-case cost: 500,000 page loads!

Sequential scan: For each page:

Load into memory (because we need to read the records on the page)
For each record on the page:

Check if each record matches title_id = 'tt11737520'.

Alternate approach: Define a lookup structure, called an index, that helps us quickly tell which page each title_id is located on.