Common DB schema change mistakes
Nikolay Samokhvalov�nik@postgres.ai
Postgres.ai
This slide deck: bit.ly/pgcon2022-schema-changes
Preamble…
�We found a slow query – looks like a missing index.
We decide to create it.
What can go wrong?
Nikolay Samokhvalov
PostgreSQL user since 2005
Occasional code contributions (XML, etc.)
Co-founded 3 startups (social media; 2 successful exits)
Helped with Postgres to: GitLab, Chewy, Miro, etc.
Postgres.ai founder
👉 Created/reviewed more than 1,000 DB changes
Help companies with PostgreSQL scalability and performance
Database Lab Engine�– thin clones for Postgres
Used by:
Postgres.ai
An abstract example:
I highly recommend – GitLab’s docs and code
GitLab’s “Migration Style Guide” https://docs.gitlab.com/ee/development/migration_style_guide.html
GitLab’s migration helpers�https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/database/migration_helpers.rb
Database Lab and Postgres.ai�https://docs.gitlab.com/ee/development/database/database_lab.html
Terminology
DML – database manipulation language� (SELECT / INSERT / UPDATE / DELETE, etc.)
DDL – data definition language
(CREATE …, ALTER …, DROP …)
DB migrations – planned, incremental changes� of DB schema and/or data
DB schema migration & data migration
DB schema evolution, schema versioning
DB change management, and so on
Applying a schema migration to
a production database is always
a risk
Three big classes of DB migration mistakes
[1] Schema mismatch on dev & prod
create table t1 (� id int primary key,� val text�);�
-- dev, test, QA, staging, etc. – OK
-- prod:�ERROR: relation "t1" already exists
[2] Misuse of IF [NOT] EXISTS
create table if not exists t1 (� id int primary key,� val text�);
NOTICE: relation "t1" already exists, skipping
CREATE TABLE
🤔
Start using DB schema migration tool
Test changes in CI
Test changes in CI
�Now guess what…
“Thanks” to IF NOT EXISTS, we now may leave UNDO empty!
🤦♂️
❌ Don’t:
✅ Do:
“Three Cases Against IF NOT EXISTS / IF EXISTS in Postgres DDL”�https://postgres.ai/blog/20211103-three-cases-against-if-not-exists-and-if-exists-in-postgresql-ddl
Database Testing Landscape
Schema
Data
Test schema
Single-user DB experiments
Static analysis
Application change management
Dynamic analysis
Test data
Test DDL
Test DML
Multi-user DB experiments
Load and stress testing
EXPLAIN
Infrastructure
Test backups
Corruption check
HA tests
Microbenchmarks
...
“app”
“infra”
Reliable database changes – the hierarchy of needs
Actual, realistic testing
Review and approval process (manual)
Test DO and UNDO in CI, on an empty or small synthetic DB
Version control for DB changes: Git & Flyway / Sqitch / Liquibase / smth else
All
Many
Some
Extremely few
[3] Failed change due to statement_timeout
[4] massive change, unlimited
DELETE FROM table1 WHERE … ; -- target 10M rows
(or UPDATE)
[4] massive change, unlimited
test=# explain (buffers, analyze) update t1� set val = replace(val, '0159', 'OiSg');
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Update on t1 (cost=0.00..189165.00 rows=10000000 width=42) (actual time=76024.507..76024.508 rows=0 loops=1)
Buffers: shared hit=60154265 read=91606 dirtied=183191 written=198198
-> Seq Scan on t1 (cost=0.00..189165.00 rows=10000000 width=42) (actual time=0.367..2227.103 rows=10000000 loops=1)
Buffers: shared read=64165 written=37703
Planning:
Buffers: shared hit=17 read=1 dirtied=1
Planning Time: 0.497 ms
Execution Time: 76024.546 ms
(8 rows)
Time: 76030.399 ms (01:16.030)
Concurrency issues – holding an acquired lock for long
Lock acquired
START TRANSACTION
COMMIT
Some long-lasting actions
timeline
Locks are released only in the very end of transaction (COMMIT or ROLLBACK)
[5] Acquire an exclusive lock + wait in transaction
begin;
alter table t1 add column c123 int8;
-- do something inside or outside of the database
commit;
[6] A Transaction with DDL + massive DML
begin;
alter table t1 add column c123 int8;
copy ... -- load a lot of data, taking some time
commit;
❌ Don’t:
What is “slow” for OLTP?
“What is a slow SQL query?” https://postgres.ai/blog/20210909-what-is-a-slow-sql-query
[7] Waiting to acquire an exclusive lock ⇒ blocking others
Can this simple change be dangerous?��alter table t1 add column c123;
“Zero-downtime Postgres schema migrations need this: lock_timeout and retries”�https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries
The right way – every big/growing project must have it
begin;� set lock_timeout = 50;� lock table only test in ACCESS EXCLUSIVE MODE;� set lock_timeout = 0;�� alter table t1 ....;�commit;���“Zero-downtime Postgres schema migrations need this: lock_timeout and retries”�https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries
[8] Create an FK
alter table t1 add constraint …� foreign key (...) references t2 (...);���
-- step 1 (fast)�alter table t1 add constraint fk_t1_123� foreign key (...) references t2 (...) not valid;��-- step 2 (long) – later, in a separate transaction�alter table t1 validate constraint fk_t1_123;�
❌ Don’t:
✅ Do:
[9] Drop an FK
alter table t1 drop constraint …;
– should be deployed with lock_timeout and retry logic �
[10] Add a CHECK constraint
alter table t1 add constraint c_t1_123� check (c123 is not null);
-- step 1
alter table t1 add constraint c_t1_123� check (c123 is not null) not valid;
-- step 2
alter table t1 validate constraint c_t1_123;
❌ Don’t:
✅ Do:
[11] Add NOT NULL
alter table t1 alter column c123 set not null;
Trick 1 (PG11+): do with with DEFAULT when creating a new column with some DEFAULT, then backfill, then drop default
Trick 2 (PG12+): add a CHECK with IS NOT NULL (in 2 steps), then SET NOT NULL, then drop the CHECK constraint
❌ Don’t:
✅ Do:
[12] Change column’s data type
alter table t1 alter column id type int8;
Most common approaches:
None of them are easy.
[13] Create an index
create index i_t1_c123� on t1 using btree (c123);
create index concurrently i_t1_c123� on t1 using btree (c123);
❌ Don’t:
✅ Do:
[14] Drop an index
drop index i_t1_c123;
drop index concurrently i_t1_c123;
❌ Don’t:
✅ Do:
[15] Renaming objects
alter table t1 rename to ..;��alter table t1 rename column .. to …;��(and so on)
Application code may not expect it:
[16] Add a column with DEFAULT
alter table t1 add column c1 int8 default -1;
No worries unless you’re on Postgres 10 or older�(otherwise– some headache, if the table has a lot of rows)
Good article: https://brandur.org/postgres-default
[17] Failed CREATE INDEX CONCURRENTLY
CREATE INDEX CONCURRENTLY may fail
If it does, an invalid index is present – must be cleaned up first:
test=# select indexrelid, indexrelid::regclass as indexname, indisvalid�from pg_index�where not indisvalid and indexrelid::regclass::text = 'mytable_title_idx';�� indexrelid | indexname | indisvalid�------------+-------------------+------------� 26401 | mytable_title_idx | f�(1 row)
[18] Create a table with int4 PK
create table t1 (� id int4 primary key,� ts timestamptz�);
create table t1 (� id int8 primary key,� ts timestamptz�);
❌ Don’t:
✅ Do:
[19] Limiting text too much
[20] REFRESH MATERIALIZED VIEW (w/o CONCURRENTLY)
[21] Ignoring vacuum and bloat issues
Database Migration Testing with Database Lab
Database Migration Testing with Database Lab
Summary
Thank you!
lowercase for sql queries is ok
Some examples of failures due to lack of testing
We need better tools
Steve Jobs (1980)
Traditional DB experiments – thick clones
+
Production
“1 database copy – 10 persons”
…
…
Database Lab: use thin clones
Production
“1 database copy – 1 person”
“Thin clones” – Copy-on-Write (CoW)
The Database Lab Engine (DLE)
Open-source (AGPLv3)
The Platform (SaaS)
Proprietary (freemium)
Database Lab – Open-core model
- Thin cloning – API & CLI
- Automated provisioning and data refresh
- Data transformation, anonymization
- Supports managed Postgres (AWS RDS, etc.)
- Web console – GUI
- Access control, audit
- History, visualization
- Support
^^ use these links to start using it for your databases ^^
Database Lab unlocks “Shift-left testing”
Yes
No
Database experiments on thin clones – yes and no
DB migration testing – “stateful tests in CI”
What we want from testing of DB changes:
…and:
Perfect Lab for database experiments
Database experiments with Database Lab today (2021)
Why Database Lab was created
CI: docker pull … && docker run …
– OK only for tiny (< a few GiB) databases
– OK only for very large enterprises
Companies that do need it today
Inside the Database Lab Engine 2.x
1 (or N) physical disk(s) + CoW support
Shared cache (OpenZFS: ARC): 50% of RAM
shared_buffers: 1Gi
��Port: 600x
shared_buffers: 1Gi
��Port: 6001
shared_buffers: 1Gi
�Clone.�Port: 6001
shared_buffers: 8Gi
�“sync”�container
The main container�("dblab_server")��– control, API�
Test of a�DB migration
∀
Test of a�DB migration
∀
Test of a�DB migration
∀
DLE – the data flow (physical mode)
“Golden copy”�(initial thick clone)
backups
Fetch WALs
The “sync” instance applies WALs continuously
The main PGDATA version
– the lag behind production�is usually a few seconds
A new snapshot is created every N hours
Clone created�(based on snapshot A)
snapshot A
snapshot B
snapshot C
Clone�destroyed
Clone created�(based on snapshot A)
.. based on�snapshot B
Independent PGDATA ready,�Ready to accept any DDL, DML
Independent PGDATA ready,�Ready to accept any DDL, DML
How snapshots are created (ZFS version)
Major topics of automated (CI) testing on thin clones
https://postgres.ai/docs/platform/security
CI Observer: https://postgres.ai/docs/database-lab/cli-reference#subcommand-start-observation
Timing estimator: https://postgres.ai/docs/database-lab/timing-estimator
Making the process secure: where to place the DLE?
Production
Dev & Test
The big wall
PII here
CI runners here
Where to place the DLE? Current approach
Production
Dev & Test
The big wall
PII here
CI runners here
High-level API calls
How it looks like: CI part
Example: GitHub Actions: https://github.com/agneum/runci/runs/2519607920?check_suite_focus=true
More about dangerous lock detection
February 19, 2021 – https://gitlab.com/gitlab-org/gitlab/-/merge_requests/54564#note_512678910
…
Example: GitLab.com, testing database changes using Database Lab
More about production timing estimation
Summary – available in PR/MR and visible to whole team
Example (WIP): https://gitlab.com/postgres-ai/database-lab/-/snippets/2083427
More artifacts, details – restricted access
Database Lab Roadmap
https://postgres.ai/docs/roadmap
Where to start