Performance Tuning
February 12, 2024
1
Data 101/Info 258, Spring 2024 @ UC Berkeley
Aditya Parameswaran https://data101.org/sp24
LECTURE 07
Why worry about performance?
2
Lecture 07, Data 101 Spring 2024
Beyond the Declarative Contract
So far, we’ve been imagining a declarative contract between the data system and the user, �where the user specifies:
3
The Declarative Contract
user
system
Beyond the Declarative Contract
So far, we’ve been imagining a declarative contract between the data system and the user, �where the user specifies:
..but not how to store data (disk? memory?), nor how to algorithmically enforce constraints
…but not how to compute the query result
…but not how the modification takes place
The data system is then expected to hold up its end of the bargain:
4
Justifying the declarative contract
Why cede control to the system? Here are just several reasons:
5
The Declarative Contract
user
system
As a data engineer, however, you should understand the underlying system!
Even as a data scientist, however, you need to understand database performance!
Why worry about performance?
Reason 1: Our specification (e.g., SQL commands) can and should be revisited.
6
Why worry about performance?
Reason 1: Our specification (e.g., SQL commands) can and should be revisited.
Reason 2: Declarative contract breakdown!
7
Why worry about performance?
Reason 3: Many heavyweight levers are still under user control!
Reason 1: Our specification (e.g., SQL commands) can and should be revisited.
Reason 2: Declarative contract breakdown!
8
OK—Let’s Worry about Performance!
There are many things we can do to improve performance!
…but in order to do so, we need to learn about data system internals. It’s a Pandora’s Box!
Today: A mental model for what a data system does.
9
SQL queries/commands
(Select from where, joins,�window functions, DDL, DML)
Relational Algebra Introduction
Performance tuning queries
So far
Today
Disclaimer: Massively hand-wavy! For more about performance, see CS186: https://cs186berkeley.net/
data system internals
System Memory Model
10
Lecture 07, Data 101 Spring 2024
Cost is Disk I/O
Data is laid out on pages (AKA blocks, see here) on stable disk storage (SSD or HDD)
11
… Blocks / Pages …
(SSD/disk)
Data
|
|
|
|
|
|
|
|
|
|
record/�tuple
page
Cost is Disk I/O
Data is laid out on pages (AKA blocks, see here) on stable disk storage (SSD or HDD)
The data system retrieves each page, one at a time, loading it into main memory for reading/writing.
12
… Blocks / Pages …
(SSD/disk)
Buffer
(mem)
Controller
Data
Data System
Cost is Disk I/O
Data is laid out on pages (AKA blocks, see here) on stable disk storage (SSD or HDD)
The data system retrieves each page, one at a time, loading it into main memory for reading/writing.
The main cost in processing SQL queries�is the cost of disk I/O. (mostly true)
�Side note: sequential read/write of data pages typically cheaper than random read/writes. (more later)
13
Buffer
(mem)
Controller
Data System
… Blocks/Pages …
(SSD/disk)
Data
Strawperson: Sequential Scan Performance Analysis
Suppose we want to find a specific title in Titles:
SELECT * FROM Titles WHERE title_id = 'tt11737520';
14
Titles
Column | Type -----------------+--------
title_id (PK) | text
type | text
primary_title | text
original_title | text
is_adult | bigint
premiered | bigint
ended | bigint
runtime_minutes | bigint
genres | text
'tt11737520'
We approximate the cost of processing a query as # of Disk I/Os (i.e., # page loads).�For sequential scan:
Sequential scan: For each page:
Alternate approach: Define a lookup structure, called an index, that helps us quickly tell which page each title_id is located on.