Query Optimization Wrap-Up
February 28, 2024
Data 101/Info 258, Spring 2024 @ UC Berkeley
Aditya Parameswaran https://data101.org/sp24
1
LECTURE 11a
EXPLAIN ANALYZE with JOINs
EXPLAIN ANALYZE with JOINs
Rules of Thumb, Part II
Other Knobs beyond Query Optimization
2
Lecture 10, Data 101/Info 258 Spring 2024
REPEAT: explain analyze select * from actor, cast_info, movie where actor.id = cast_info.person_id and movie.id = cast_info.movie_id limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=244945.11..244946.60 rows=10 width=318) (actual time=3557.268..3904.228 rows=10 loops=1)
-> Hash Join (cost=244945.11..3855759.20 rows=24248734 width=318) (actual time=3557.267..3904.224 rows=10 loops=1)
Hash Cond: (cast_info.movie_id = movie.id)
-> Hash Join (cost=202289.55..2141712.01 rows=36244344 width=116) (actual time=3152.976..3586.569 rows=9556 loops=1)
Hash Cond: (cast_info.person_id = actor.id)
-> Seq Scan on cast_info (cost=0.00..615130.44 rows=36244344 width=42) (actual time=0.006..100.851 rows=845147 loops=1)
-> Hash (cost=97287.91..97287.91 rows=4167491 width=74) (actual time=3117.718..3117.718 rows=4167491 loops=1)
Buckets: 65536 Batches: 256 Memory Usage: 2273kB
-> Seq Scan on actor (cost=0.00..97287.91 rows=4167491 width=74) (actual time=0.025..488.901 rows=4167491 loops=1)
-> Hash (cost=15598.25..15598.25 rows=662825 width=202) (actual time=243.959..243.959 rows=662825 loops=1)
Buckets: 32768 Batches: 64 Memory Usage: 1350kB
-> Seq Scan on movie (cost=0.00..15598.25 rows=662825 width=202) (actual time=0.023..70.715 rows=662825 loops=1)
Planning Time: 0.664 ms
Execution Time: 4171.240 ms
// note 2 hash joins where actor and cast_info are joined, followed by a join with movie
create index actoridindex on actor(id);
explain analyze select * from actor, cast_info, movie where actor.id = cast_info.person_id and movie.id = cast_info.movie_id limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=42655.99..42661.32 rows=10 width=318) (actual time=236.010..257.950 rows=10 loops=1)
-> Nested Loop (cost=42655.99..12965281.89 rows=24248734 width=318) (actual time=236.009..257.947 rows=10 loops=1)
-> Hash Join (cost=42655.56..1692069.63 rows=24248734 width=244) (actual time=232.978..244.324 rows=10 loops=1)
Hash Cond: (cast_info.movie_id = movie.id)
-> Seq Scan on cast_info (cost=0.00..615130.44 rows=36244344 width=42) (actual time=0.014..10.920 rows=2401 loops=1)
-> Hash (cost=15598.25..15598.25 rows=662825 width=202) (actual time=227.684..227.685 rows=662825 loops=1)
Buckets: 32768 Batches: 64 Memory Usage: 1350kB
-> Seq Scan on movie (cost=0.00..15598.25 rows=662825 width=202) (actual time=0.005..70.086 rows=662825 loops=1)
-> Index Scan using actoridindex on actor (cost=0.43..0.45 rows=1 width=74) (actual time=1.360..1.360 rows=1 loops=10)
Index Cond: (id = cast_info.person_id)
Planning Time: 3.881 ms
Execution Time: 306.243 ms
create index movieid_castinfoindex on cast_info(movie_id);
explain analyze select * from actor, cast_info, movie where actor.id = cast_info.person_id and movie.id = cast_info.movie_id limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.87..6.40 rows=10 width=318) (actual time=3.695..18.112 rows=10 loops=1)
-> Nested Loop (cost=0.87..13411901.94 rows=24248734 width=318) (actual time=3.694..18.108 rows=10 loops=1)
-> Nested Loop (cost=0.44..2138689.69 rows=24248734 width=244) (actual time=2.218..3.778 rows=10 loops=1)
-> Seq Scan on movie (cost=0.00..15598.25 rows=662825 width=202) (actual time=0.015..0.016 rows=2 loops=1)
-> Index Scan using movieid_castinfoindex on cast_info (cost=0.44..2.83 rows=37 width=42) (actual time=1.854..1.873 rows=5 loops=2)
Index Cond: (movie_id = movie.id)
-> Index Scan using actoridindex on actor (cost=0.43..0.45 rows=1 width=74) (actual time=1.431..1.431 rows=1 loops=10)
Index Cond: (id = cast_info.person_id)
Planning Time: 3.961 ms
Execution Time: 18.660 ms
drop index actoridindex;
drop index movieid_castinfoindex;
3
Two table and three table join demo!
4
Demo
Schemas: pg_catalog and information_schema
(Bonus!)
Recall from Project 1:
5
Terminology note [link]:�A table (relation) is within�A schema (namespace) is within�A catalog (database)
information_schema is portable and stable (queries transferable b/t different DBMSes).
pg_catalog contains Postgres internal stats e.g. for query optimization.
Postgres-specific system settings, statistics, values, etc.
SQL ANSI standard; can use these schema’s views in most DBMSes (postgres [docs], MySQL, Snowflake, …)
Demo
Statistics in pg_catalog
(Bonus!)
Documentation 14.2: Statistics used by the Planner [link]
6
SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname = 'actor' OR relname = 'cast_info';
/* WHERE relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'public'); */
Find total number of pages/tuples used by each relation:
Relation statistics
SELECT tablename, attname,
null_frac, avg_width, n_distinct
FROM pg_stats
WHERE schemaname = 'public';
SELECT tablename, attname,
most_common_vals, most_common_freqs
FROM pg_stats
WHERE schemaname = 'public';
Demo
Statistics in pg_catalog
(Bonus!)
Documentation 14.2: Statistics used by the Planner [link]
7
Demo
Query Optimization Settings in pg_catalog
(Bonus!)
All runtime parameters for Postgres are in one view.
8
SELECT name
FROM pg_settings
WHERE name LIKE 'enable_%';
Demo
Rules of Thumb, Part II
EXPLAIN ANALYZE with JOINs
Rules of Thumb, Part II
Other Knobs beyond Query Optimization
9
Lecture 10, Data 101/Info 258 Spring 2024
[Recall] Summary: Why do we care?
As data engineers, we need to understand what impacts performance.
10
For query-centric: we need to understand just “enough” to guide to good plans.
We have a few knobs under our control, but it is still valuable to understand what is expensive.
For code-centric: even more essential, since we’ll be manually doing the query optimization.
Summary: Why do we care? Part 2
As data engineers, we need to understand what impacts performance.
11
For query-centric: we need to understand just “enough” to guide to good plans.
We have a few knobs under our control, but it is still valuable to understand what is expensive.
For code-centric: even more essential, since we’ll be manually doing the query optimization.
Examples of things we must do now (whereas query-centric systems do automatically):
[Recall] Rules of Thumb
12
Tell the system to avoid doing extra work.
Materialize if expensive.
Use indexes to speed up access. Best if indexes help improve queries that are frequent and slow:
Rules of Thumb: A few more
Some additional items:
1. Pipelining
2. Periodic reorganization
13
Tell the system to avoid doing extra work.
Materialize if expensive.
Use indexes to speed up access. Best if indexes help improve queries that are frequent and slow:
3. Periodic statistics recomputation
4. Additional database system knobs
Other Knobs beyond Query Optimization
EXPLAIN ANALYZE with JOINs
Rules of Thumb, Part II
Other Knobs beyond Query Optimization
14
Lecture 10, Data 101/Info 258 Spring 2024
(1/4) Additional Knobs: Pipelining
So far we have described operators in a sequential manner:
15
Better approach: pipelining
Note: Operators themselves can also be parallelized via a process called partitioned parallelism. Critical in multi-node settings!
Scan
Filter
Agg.
(2/4) Additional Knob: Periodic Reorganization
Many data systems keep old versions of tuples around! Several reasons:
However, old versions of tuples cause bloat.Impacts:
16
(2/4) Additional Knob: Periodic Reorganization
Many data systems keep old versions of tuples around! Several reasons:
However, old versions of tuples cause bloat.Impacts:
Solution 1: VACUUM command to re-pack.
Solution 2: For extreme UPDATEs:
17
(3/4) Additional Knob: Statistics Computation
Recall: query optimization relies on statistics!
Statistics can become stale after many updates!
Incorrect statistics can lead to bad query plans. Therefore VACUUM recomputes statistics, too.
Can recompute statistics as part of VACUUM
These can be set to run periodically via the autovacuum program. [Documentation 25.1]
18
(4/4) Additional Knob: Database System Knobs
There are many settings that control system behavior!
In postgres:
imdb=# select count(*) from pg_settings;
count
-------
329
Some settings of particular importance:
Lots more, see Postgres manual!
19
Note about CTEs and Postgres, as of v12
20
Documentation Ch 16
🤔
Based on this text, do CTEs usually get pushed down in the most reversion of postgres (v16)?
Note about CTEs and Postgres, as of v12
21
Documentation Ch 16
Lots of tense discussions of this, which are now outdated with PostgreSQL >v12 in >2019:
StackExchange, forcing hash join vs nested loop, StackExchange 2
For more about subqueries, see Controlling the Planner with Explicit JOINs [14.3 link].