Cultural Learnings of PostgreSQL for Make Benefit Glorious�Company Product
Amir More - Founder & CEO @ dubhub.io�PyWebIL 12/09/2022
Background
Postgres for Programmers
Previously at PyWebIL
How would you solve transactions?
Data
Committed Number
Operation # | Key | Value |
1 | Amir | 10 |
2 | Derek | 10 |
3 | Katie | 10 |
4 | Amir | 15 |
4 | Katie | 5 |
# |
1 |
2 |
3 |
How would you add:
What’s the problem with this architecture?
MVCC in Postgresql
RowID | TX | Deleted? | Data | Change TX | Next RowID |
38f2 | 2 | | Derek;50 | 3 | 9e66 |
9e66 | 3 | | Derek;40 | | |
2b29 | 4 | x | Amir;50 | 5 | |
Accessing Relational Data - SQL
Structured Query Language (SQL): The standard method of interacting with a relational database. Standardized by ANSI and ISO since the 1980s
Declarative - Say what you want, let an executor figure out how (given constraints)
select * from emp inner join dept on emp.dept_id = dept.id
Want all results, have ~10k row memory
# emp | # dept | Relationship | Algorithm - Max memory for 10k rows |
20000 | 20 | 1-1000 | Hash(dept) -> lookup (Hash Join) |
20000 | 20000 | 1-1 | Merge (sort x2) |
20 | 20000 | 1-1 (of 20000) | Index lookup (Nested Loop) |
Today
The Problem
Translation: Take an idea and make it work for 1 data and on your PC
Result: data written “literally” into database�
Outcome #1 - Reading data is slow (Developer: Mistakes were made)
<starts adding indices>
Outcome #2 - Writing data is slow
The Real Problem
The Real Problem
When do we “optimize”?
Trivial Example: Add an Index
Data: NASA Cassini Mission used by bigmachine.io�
�
Slow Query:�SELECT * FROM master_plan WHERE start_time_utc = $1
CREATE INDEX idx ON master_plan (start_time_utc);
“Advanced” Example: Add a Covering Index
�
select start_time_utc from master_plan�where team = $1 and target = $2�order by start_time_utc desc limit $3
CREATE INDEX cover_idx ON master_plan�(team, target, start_time_utc desc);
App/DB Optimization Framework
DB Optimization Mental Framework
Insertion Time
Retrieval�Time
Add Index
Disk & CPU
Retrieval Time
Precompute Query
The Query Execution Pipeline
Optimize This
Optimizer Step
The (Reduced) Hierarchy of Computation
Hash Lookup
Index (B-Tree) Lookup
Table Scan
Sort
The DB’s Toolbox - Query Execution Methods
Access Methods
Computations
Join Methods
The Optimizer
Chooses one path over all others, keeping in mind their cost
*Cost is an estimation; infeasible to know the true cost in advance
**You can’t even estimate all possibilities (General case is NP-Complete!)
ANALYZE and CREATE STATISTIC
Optimizer Plans & Explain
When Things Go Wrong
Live DB Examples
Postgresql Visibility Map
Per table, two bits per page (8KB)
One of the bits = all tuples visible to�current transaction
Updated by VACUUM
.. which runs automatically
https://www.interdb.jp/pg/pgsql06.html
VACUUM
INSERTS
Index Only Scan & Vis. Map
“PostgreSQL has to check the visibility�of the tuples in principle, and the index�tuples do not have any information�about transactions … Therefore,�PostgreSQL has to access the table data�to check the visibility of the data in the�index tuples.”
https://www.interdb.jp/pg/pgsql07.html#_7.2.
Row Level Security
https://pganalyze.com/blog/postgres-row-level-security-django-python
Tools for Mitigation - Next Time
Questions?
Shameless Plug
dubhub.io�local database copies made simple and easy