Using GitHub gh-ost to run Zero downtime online schema migrations in MySQL
Chinmay Naik
By the end of this talk, you will
Have good understanding of challenges of MySQL schema migration
Know about possible solutions and how these work
Understand our journey to use gh-ost in production to migrate 100M+ rows table with 100GB+ size
Learn about building operationally simple tools for your own use cases
Not have to treat your MySQL DB as a magic blackbox
$ whoami
One2N
Founder at One2N
Engineer - Full Stack, Backend, Infrastructure, Big data, Reliability
Grey hair, curious, obsessed about reliability and pragmatism
Backend + Reliability engineering focused company
Helping startups scale in the growth journey from One to N
Focus on the βlitiesβ of Software systems (Scalability, Reliability, etc)
Context
Journey so far
Reading lots and lots of documentation
Proof Of Concepts for various tools
Putting out fires in production (broken replication, topology changes etc)
Working at a growing fintech startup in South-East Asia as an SRE
Picking up DB management responsibility (15+ months πΆ), not an expert
Akin to building a boat while swimming in the water
Outline
Challenges with naive alter table
Database topology
db1 (primary/master)
db3 (replica/slave)
db2 (replica/slave)
DB Setup
Replication
(non-GTID)
Migrating table naively
On master:
mysql> alter table ratings add column feedback text;
Query OK, 0 rows affected (57.75 sec)
Meanwhile, on the slave nodes
Meanwhile, on the slave nodes
On slave: show slave status shows
altering table
So what just happened?
Alter table on master took a while (~60 sec). Extrapolate for large tables accordingly π
Equivalent replication lag on all replicas (imagine a 5 hour lag for large tables) π
No other DML operations on replica π«
Non-pausable and irreversible operation on replicas β―
It may not be blocking the table on master, but there are more caveats (refer documentation) β οΈ
What about Online and Instant DDL?
It may be enough, but only for some cases
Types of DDL operations
π±
Itβs complex, you get the idea
We need something more universal
GitHub
Online schema migration for MySQL
Triggerless, works by reading binary logs
Operationally simple with
throttling support
dynamic control/reconfiguration
auditing, and other operational perks
Other popular option is pt-online-schema-change as part of Percona toolkit
How online schema change tools work?
Create table similar to the source table (shadow table)
Migrate shadow table while empty
Copy data from source to shadow table (think snapshot copy)
Rename tables atomically
Optionally delete the older table
What about changed data while we copy?
Binary log
Triggers
Binary log based streaming replication, typically asynchronous
Entire alter table can be throttled as it just simulate inserts on master
Different transaction space since itβs asynchronous
gh-ost operates in this fashion
Synchronous, runs as part of same transaction
Data copy can be throttled, but not the triggers
Execute in same transaction space as other statement
pt-osc operates in this fashion
Our migration use case
DB size 1.3+TB growing ~5GB/day
Table to be migrated 400+M rows, 100+GB
We had external partner on which we were dependent to carry out DB migrations
Not much expertise in the team
Self-managed DB infra
Initial migrations would take weeks and PRs were often blocked
We wanted to add a new column to the largest table
gh-ost iterations and learnings
Simulate a prod like setup on sandbox environment
Master slave setup with ProxySQL
Read write traffic split, same as production
Simulating workload using pt-archiver and sysbench
Learn about MySQL replication, topologies, common replication errors
Reading documentation, trying out POCs, rolling back when things break
gh-ost options
gh-ost Demo
D-Day Migration
Future migrations and CI CD setup
Things we can learn from gh-ost design
Gh-ost and foreign keys
We worked around the problem by adding these columns in a separate table than migrating existing table
Questions / Feedback / Comments?