3 of 79

Preamble…

�We found a slow query – looks like a missing index.

We decide to create it.

What can go wrong?

5 of 79

Nikolay Samokhvalov

PostgreSQL user since 2005

Occasional code contributions (XML, etc.)

Co-founded 3 startups (social media; 2 successful exits)

Helped with Postgres to: GitLab, Chewy, Miro, etc.

Postgres.ai founder

👉 Created/reviewed more than 1,000 DB changes

SUBSCRIBE YouTube Postgres.tv

Twitter @samokhvalov� @Database_Lab

6 of 79

Help companies with PostgreSQL scalability and performance

Database Lab Engine�– thin clones for Postgres

Used by:

Postgres.ai

An abstract example:

DB size: 10 TiB
DLE – a single VM with 10 TiB of disk space
A single DLE provides 30-50 thin clones, each is 10 TiB

Engineers work independently
CI/CD pipelines run automated tests

To get a new clone:

~10 seconds and $0 (!)

7 of 79

I highly recommend – GitLab’s docs and code

GitLab’s “Migration Style Guide” https://docs.gitlab.com/ee/development/migration_style_guide.html

GitLab’s migration helpers�https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/database/migration_helpers.rb

Database Lab and Postgres.ai�https://docs.gitlab.com/ee/development/database/database_lab.html

8 of 79

Terminology

DML – database manipulation language� (SELECT / INSERT / UPDATE / DELETE, etc.)

DDL – data definition language

(CREATE …, ALTER …, DROP …)

DB migrations – planned, incremental changes� of DB schema and/or data

DB schema migration & data migration

DB schema evolution, schema versioning

DB change management, and so on

9 of 79

Applying a schema migration to

a production database is always

a risk

Wikipedia

https://en.wikipedia.org/wiki/Schema_migration

10 of 79

Three big classes of DB migration mistakes

Correctness. E.g.:
unexpected schema deviations
schema/app code mismatch
unexpected data

Miscellaneous. E.g.:
reaching statement_timeout
connection interrupted
int4 PKs

Concurrency. Some examples:
failed to acquire a lock
updating too many rows at once
acquired an exclusive lock and left transaction open

11 of 79

[1] Schema mismatch on dev & prod

create table t1 (� id int primary key,� val text�);�

-- dev, test, QA, staging, etc. – OK

-- prod:�ERROR: relation "t1" already exists

12 of 79

[2] Misuse of IF [NOT] EXISTS

create table if not exists t1 (� id int primary key,� val text�);

NOTICE: relation "t1" already exists, skipping

CREATE TABLE

🤔

13 of 79

Start using DB schema migration tool

14 of 79

Test changes in CI

Both DO and UNDO steps are supported (can revert)
CI: test them all

Better: DO, UNDO, and DO again

15 of 79

Test changes in CI

Both DO and UNDO steps are supported (can revert)
CI: test them all

Better: DO, UNDO, and DO again

�Now guess what…

“Thanks” to IF NOT EXISTS, we now may leave UNDO empty!

🤦‍♂️

16 of 79

❌ Don’t:

IF [NOT] EXIST

✅ Do:

test DO-UNDO-DO in CI
keep schema up to date in all envs
don’t ignore or work-around errors

“Three Cases Against IF NOT EXISTS / IF EXISTS in Postgres DDL”�https://postgres.ai/blog/20211103-three-cases-against-if-not-exists-and-if-exists-in-postgresql-ddl

17 of 79

Database Testing Landscape

Schema

Data

Test schema

Single-user DB experiments

Static analysis

Application change management

Dynamic analysis

Test data

Test DDL

Test DML

Multi-user DB experiments

Load and stress testing

EXPLAIN

Infrastructure

Test backups

Corruption check

HA tests

Microbenchmarks

...

“app”

“infra”

18 of 79

Reliable database changes – the hierarchy of needs

Actual, realistic testing

Review and approval process (manual)

Test DO and UNDO in CI, on an empty or small synthetic DB

Version control for DB changes: Git & Flyway / Sqitch / Liquibase / smth else

All

Many

Some

Extremely few

19 of 79

[3] Failed change due to statement_timeout

20 of 79

[4] massive change, unlimited

DELETE FROM table1 WHERE … ; -- target 10M rows

(or UPDATE)

22 of 79

[4] massive change, unlimited

test=# explain (buffers, analyze) update t1� set val = replace(val, '0159', 'OiSg');

QUERY PLAN

--------------------------------------------------------------------------------------------------------

Update on t1 (cost=0.00..189165.00 rows=10000000 width=42) (actual time=76024.507..76024.508 rows=0 loops=1)

Buffers: shared hit=60154265 read=91606 dirtied=183191 written=198198

-> Seq Scan on t1 (cost=0.00..189165.00 rows=10000000 width=42) (actual time=0.367..2227.103 rows=10000000 loops=1)

Buffers: shared read=64165 written=37703

Planning:

Buffers: shared hit=17 read=1 dirtied=1

Planning Time: 0.497 ms

Execution Time: 76024.546 ms

(8 rows)

Time: 76030.399 ms (01:16.030)

23 of 79

Concurrency issues – holding an acquired lock for long

Lock acquired

START TRANSACTION

COMMIT

Some long-lasting actions

timeline

Locks are released only in the very end of transaction (COMMIT or ROLLBACK)

24 of 79

[5] Acquire an exclusive lock + wait in transaction

begin;

alter table t1 add column c123 int8;

-- do something inside or outside of the database

commit;

25 of 79

[6] A Transaction with DDL + massive DML

begin;

alter table t1 add column c123 int8;

copy ... -- load a lot of data, taking some time

commit;

❌ Don’t:

26 of 79

What is “slow” for OLTP?

“What is a slow SQL query?” https://postgres.ai/blog/20210909-what-is-a-slow-sql-query

27 of 79

[7] Waiting to acquire an exclusive lock ⇒ blocking others

Can this simple change be dangerous?��alter table t1 add column c123;

28 of 79

“Zero-downtime Postgres schema migrations need this: lock_timeout and retries”�https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries

29 of 79

The right way – every big/growing project must have it

begin;� set lock_timeout = 50;� lock table only test in ACCESS EXCLUSIVE MODE;� set lock_timeout = 0;�� alter table t1 ....;�commit;��“Zero-downtime Postgres schema migrations need this: lock_timeout and retries”�https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries

30 of 79

[8] Create an FK

alter table t1 add constraint …� foreign key (...) references t2 (...);��

-- step 1 (fast)�alter table t1 add constraint fk_t1_123� foreign key (...) references t2 (...) not valid;��-- step 2 (long) – later, in a separate transaction�alter table t1 validate constraint fk_t1_123;�

❌ Don’t:

✅ Do:

31 of 79

[9] Drop an FK

alter table t1 drop constraint …;

– should be deployed with lock_timeout and retry logic �

32 of 79

[10] Add a CHECK constraint

alter table t1 add constraint c_t1_123� check (c123 is not null);

-- step 1

alter table t1 add constraint c_t1_123� check (c123 is not null) not valid;

-- step 2

alter table t1 validate constraint c_t1_123;

❌ Don’t:

✅ Do:

33 of 79

[11] Add NOT NULL

alter table t1 alter column c123 set not null;

Trick 1 (PG11+): do with with DEFAULT when creating a new column with some DEFAULT, then backfill, then drop default

Trick 2 (PG12+): add a CHECK with IS NOT NULL (in 2 steps), then SET NOT NULL, then drop the CHECK constraint

❌ Don’t:

✅ Do:

34 of 79

[12] Change column’s data type

alter table t1 alter column id type int8;

Most common approaches:

Option 1: “new column”
Option 2: “new table”

None of them are easy.

35 of 79

[13] Create an index

create index i_t1_c123� on t1 using btree (c123);

create index concurrently i_t1_c123� on t1 using btree (c123);

❌ Don’t:

✅ Do:

36 of 79

[14] Drop an index

drop index i_t1_c123;

drop index concurrently i_t1_c123;

❌ Don’t:

✅ Do:

37 of 79

[15] Renaming objects

alter table t1 rename to ..;��alter table t1 rename column .. to …;��(and so on)

Application code may not expect it:

app code not yet deployed
app code deployed before commit

38 of 79

[16] Add a column with DEFAULT

alter table t1 add column c1 int8 default -1;

No worries unless you’re on Postgres 10 or older�(otherwise– some headache, if the table has a lot of rows)

Good article: https://brandur.org/postgres-default

39 of 79

[17] Failed CREATE INDEX CONCURRENTLY

CREATE INDEX CONCURRENTLY may fail

If it does, an invalid index is present – must be cleaned up first:

test=# select indexrelid, indexrelid::regclass as indexname, indisvalid�from pg_index�where not indisvalid and indexrelid::regclass::text = 'mytable_title_idx';�� indexrelid | indexname | indisvalid�------------+-------------------+------------� 26401 | mytable_title_idx | f�(1 row)

40 of 79

[18] Create a table with int4 PK

create table t1 (� id int4 primary key,� ts timestamptz�);

create table t1 (� id int8 primary key,� ts timestamptz�);

❌ Don’t:

✅ Do:

41 of 79

[19] Limiting text too much

42 of 79

[20] REFRESH MATERIALIZED VIEW (w/o CONCURRENTLY)

43 of 79

[21] Ignoring vacuum and bloat issues

44 of 79

Database Migration Testing with Database Lab

Realistic migration testing is hard
No testing = unexpected problems�

45 of 79

Database Migration Testing with Database Lab

Realistic migration testing is hard
No testing = unexpected problems�

makes realistic testing easy

46 of 79

Summary

Test DB changes in CI (with data)�Consider: with DB Migration Checker
Create automation for each case�A good example: GitLab’s migration_helpers.rb
Share your ideas with us!�👉 Slack.Postgres.ai @Database_Lab

47 of 79

Thank you!

Slack (EN): Slack.Postgres.ai

Twitter: @Database_Lab

49 of 79

lowercase for sql queries is ok

50 of 79

Some examples of failures due to lack of testing

Incompatible changes – production has different DB schema than dev & test
Cannot deploy – hitting statement_timeout – too heavy operations

During deployment, we’ve got a failover
Deployment lasted 10 minutes, the app was very slow (or even down)

Two weeks after deployment, we realize that the high bloat growth�we have now has been introduced by that deployment
Deployment succeeded, but then we have started to see errors

51 of 79

We need better tools

53 of 79

Steve Jobs (1980)

We, humans, are great tool-makers.�We amplify human abilities.

Something special happens�when you have 1 computer and 1 person.��It’s very different that having 1 computer and 10 persons.

54 of 79

Traditional DB experiments – thick clones

Production

“1 database copy – 10 persons”

…

55 of 79

Database Lab: use thin clones

Production

“1 database copy – 1 person”

56 of 79

“Thin clones” – Copy-on-Write (CoW)

57 of 79

The Database Lab Engine (DLE)

Open-source (AGPLv3)

The Platform (SaaS)

Proprietary (freemium)

Database Lab – Open-core model

- Thin cloning – API & CLI

- Automated provisioning and data refresh

- Data transformation, anonymization

- Supports managed Postgres (AWS RDS, etc.)

- Web console – GUI

- Access control, audit

- History, visualization

- Support

https://gitlab.com/postgres-ai/database-lab

https://postgres.ai/

^^ use these links to start using it for your databases ^^

58 of 79

Database Lab unlocks “Shift-left testing”

59 of 79

Yes

Check execution plan – Joe bot

EXPLAIN w/o execution
EXPLAIN (ANALYZE, BUFFERS)

(timing is different; structure and buffer numbers – the same)

Check DDL

index ideas (Joe bot)
auto-check DB migrations (CI Observer)

Heavy, long queries: analytics, dump/restore

No penalties!�(think hot_standby_feedback, locks, CPU)

Load testing
Regular HA/DR goals

backups

(but useful to check WAL stream, recover records by mistake)

hot standby

(but useful to offload very long-running SELECTs)

Database experiments on thin clones – yes and no

60 of 79

DB migration testing – “stateful tests in CI”

What we want from testing of DB changes:

Ensure the change is valid
It will be executed in appropriate time
It won’t put the system down

…and:

What to expect? (New objects, size change, duration, etc.)

61 of 79

Perfect Lab for database experiments

Realistic conditions – as similar to production as possible

The same schema, data, environment as on production
Very similar background workload

Full automation
“Memory” (store, share details)
Low iteration overhead (time & money)
Everyone can test independently� allowed to fail → allowed to learn

62 of 79

Database experiments with Database Lab today (2021)

Realistic conditions – as similar to production as possible

The same schema, data, environment as on production
Very similar background workload

Fine automation
“Memory” (store, share details)
Low iteration overhead (time & money)
Everyone can test independently� able to fail → able to learn

63 of 79

Why Database Lab was created

Containers, OverlayFS (file-level CoW)

CI: docker pull … && docker run …

– OK only for tiny (< a few GiB) databases

Existing solutions: Oracle Snap Clones, Delphix, Actifio, etc.�$$$$, not open

– OK only for very large enterprises

64 of 79

Companies that do need it today

10+ engineers
Multiple backend teams (or plans to split soon)
Microservices (or plans to move to them)
100+ GiB databases
Frequent releases

65 of 79

Inside the Database Lab Engine 2.x

1 (or N) physical disk(s) + CoW support

Shared cache (OpenZFS: ARC): 50% of RAM

shared_buffers: 1Gi

��Port: 600x

shared_buffers: 1Gi

��Port: 6001

shared_buffers: 1Gi

�Clone.�Port: 6001

shared_buffers: 8Gi

�“sync”�container

The main container�("dblab_server")��– control, API�

Test of a�DB migration

∀

Test of a�DB migration

∀

Test of a�DB migration

∀

66 of 79

DLE – the data flow (physical mode)

“Golden copy”�(initial thick clone)

backups

Fetch WALs

The “sync” instance applies WALs continuously

The main PGDATA version

– the lag behind production�is usually a few seconds

A new snapshot is created every N hours

Clone created�(based on snapshot A)

snapshot A

snapshot B

snapshot C

Clone�destroyed

Clone created�(based on snapshot A)

.. based on�snapshot B

Independent PGDATA ready,�Ready to accept any DDL, DML

67 of 79

How snapshots are created (ZFS version)

Create a “pre” ZFS snapshot (R/O)
Create a “pre” ZFS clone (R/W)
DLE launches a temporary “promote” container

If needed, performs “preprocessing” steps (bash)
Uses “pre” clone to run Postgres and promote it to primary state
If needed, performs “preprocessing” SQL queries
Performs a clean shutdown of Postgres

Create a final ZFS snapshot that will be used for cloning

68 of 79

Major topics of automated (CI) testing on thin clones

Security

https://postgres.ai/docs/platform/security

Capturing dangerous locks

CI Observer: https://postgres.ai/docs/database-lab/cli-reference#subcommand-start-observation

Forecast production timing

Timing estimator: https://postgres.ai/docs/database-lab/timing-estimator

69 of 79

Making the process secure: where to place the DLE?

Production

Dev & Test

The big wall

PII here

CI runners here

70 of 79

Where to place the DLE? Current approach

Production

Dev & Test

The big wall

PII here

CI runners here

High-level API calls

71 of 79

How it looks like: CI part

Example: GitHub Actions: https://github.com/agneum/runci/runs/2519607920?check_suite_focus=true

72 of 79

More about dangerous lock detection

73 of 79

February 19, 2021 – https://gitlab.com/gitlab-org/gitlab/-/merge_requests/54564#note_512678910

…

74 of 79

Example: GitLab.com, testing database changes using Database Lab

Full automation
GitLab CI/CD pipelines securely work with Database Lab
Database Lab clones ~10 TiB database in ~10 seconds��Read their blueprint: https://docs.gitlab.com/ee/architecture/blueprints/database_testing/

75 of 79

More about production timing estimation

Experimental, WIP: https://postgres.ai/docs/database-lab/timing-estimator

76 of 79

Summary – available in PR/MR and visible to whole team

When, who, status
Duration (in the Lab + estimated for production)
Size changes, new objects
Dangerous locks
Error stats
Transaction stats
Query analysis summary
Tuple stats
WAL generated, checkpointer/bgwriter stats
Temp files stats

Example (WIP): https://gitlab.com/postgres-ai/database-lab/-/snippets/2083427

77 of 79

More artifacts, details – restricted access

System monitoring (resources utilization)
pg_stat_*
pg_stat_statements, pg_stat_kcache
logerrors
Postgres log
pgBadger (html, json)
wait event sampling
perf tracing, flamegraphs; or eBPF
Estimated production timing

https://gitlab.com/postgres-ai/database-lab/-/issues/226

1 of 79

2 of 79

3 of 79

4 of 79

5 of 79

6 of 79

7 of 79

8 of 79

9 of 79

10 of 79

11 of 79

12 of 79

13 of 79

14 of 79

15 of 79

16 of 79

17 of 79

18 of 79

19 of 79

20 of 79

21 of 79

22 of 79

23 of 79

24 of 79

25 of 79

26 of 79

27 of 79

28 of 79

29 of 79

30 of 79

31 of 79

32 of 79

33 of 79

34 of 79

35 of 79

36 of 79

37 of 79

38 of 79

39 of 79

40 of 79

41 of 79

42 of 79

43 of 79

44 of 79

45 of 79

46 of 79

47 of 79

48 of 79

49 of 79

50 of 79

51 of 79

52 of 79

53 of 79

54 of 79

55 of 79

56 of 79

57 of 79

58 of 79

59 of 79

60 of 79

61 of 79

62 of 79

63 of 79

64 of 79

65 of 79

66 of 79

67 of 79

68 of 79

69 of 79

70 of 79

71 of 79

72 of 79

73 of 79

74 of 79

75 of 79

76 of 79

77 of 79

78 of 79

79 of 79