1 of 29

Using GitHub gh-ost to run Zero downtime online schema migrations in MySQL

Chinmay Naik

Twitter: chinmay185

LinkedIn: chinmay185

2 of 29

By the end of this talk, you will

Have good understanding of challenges of MySQL schema migration

Know about possible solutions and how these work

Understand our journey to use gh-ost in production to migrate 100M+ rows table with 100GB+ size

Learn about building operationally simple tools for your own use cases

Not have to treat your MySQL DB as a magic blackbox

3 of 29

$ whoami

One2N

Founder at One2N

Engineer - Full Stack, Backend, Infrastructure, Big data, Reliability

Grey hair, curious, obsessed about reliability and pragmatism

Backend + Reliability engineering focused company

Helping startups scale in the growth journey from One to N

Focus on the β€œlities” of Software systems (Scalability, Reliability, etc)

4 of 29

Context

Journey so far

Reading lots and lots of documentation

Proof Of Concepts for various tools

Putting out fires in production (broken replication, topology changes etc)

Working at a growing fintech startup in South-East Asia as an SRE

Picking up DB management responsibility (15+ months πŸ‘Ά), not an expert

Akin to building a boat while swimming in the water

5 of 29

Outline

  • Challenges with naive alter table
  • Introduction to gh-ost
  • Use case: migrate a table with 400M+ rows and 100GB+ size
  • Planning and preparation
  • The DM-day
  • Future migrations
  • Learnings

6 of 29

Challenges with naive alter table

Database topology

db1 (primary/master)

db3 (replica/slave)

db2 (replica/slave)

DB Setup

  • MySQL 5.7, self hosted on cloud VMs
  • Percona XtraDB Storage Engine
  • Read-Write traffic separation by ProxySQL (not shown)
  • Process and VM Monitoring enabled

Replication

(non-GTID)

7 of 29

Migrating table naively

On master:

mysql> alter table ratings add column feedback text;

Query OK, 0 rows affected (57.75 sec)

8 of 29

Meanwhile, on the slave nodes

9 of 29

Meanwhile, on the slave nodes

On slave: show slave status shows

altering table

  • Replication lag increasing
  • No other commands processed
  • No DML operations on replicas
  • Non pausable operation on replica

10 of 29

So what just happened?

Alter table on master took a while (~60 sec). Extrapolate for large tables accordingly πŸ“ˆ

Equivalent replication lag on all replicas (imagine a 5 hour lag for large tables) 🐌

No other DML operations on replica 🚫

Non-pausable and irreversible operation on replicas ⏯

It may not be blocking the table on master, but there are more caveats (refer documentation) ⚠️

11 of 29

What about Online and Instant DDL?

It may be enough, but only for some cases

Types of DDL operations

  • Index Operations
  • Primary Key Operations
  • Column Operations
  • Generated Column Operations
  • Foreign Key Operations
  • Table Operations
  • Tablespace Operations
  • Partitioning Operations

12 of 29

13 of 29

14 of 29

15 of 29

16 of 29

😱

17 of 29

It’s complex, you get the idea

We need something more universal

18 of 29

GitHub

Online schema migration for MySQL

Triggerless, works by reading binary logs

Operationally simple with

throttling support

dynamic control/reconfiguration

auditing, and other operational perks

Other popular option is pt-online-schema-change as part of Percona toolkit

19 of 29

How online schema change tools work?

Create table similar to the source table (shadow table)

Migrate shadow table while empty

Copy data from source to shadow table (think snapshot copy)

Rename tables atomically

Optionally delete the older table

20 of 29

What about changed data while we copy?

Binary log

Triggers

Binary log based streaming replication, typically asynchronous

Entire alter table can be throttled as it just simulate inserts on master

Different transaction space since it’s asynchronous

gh-ost operates in this fashion

Synchronous, runs as part of same transaction

Data copy can be throttled, but not the triggers

Execute in same transaction space as other statement

pt-osc operates in this fashion

21 of 29

Our migration use case

DB size 1.3+TB growing ~5GB/day

Table to be migrated 400+M rows, 100+GB

We had external partner on which we were dependent to carry out DB migrations

Not much expertise in the team

Self-managed DB infra

Initial migrations would take weeks and PRs were often blocked

We wanted to add a new column to the largest table

22 of 29

gh-ost iterations and learnings

Simulate a prod like setup on sandbox environment

Master slave setup with ProxySQL

Read write traffic split, same as production

Simulating workload using pt-archiver and sysbench

Learn about MySQL replication, topologies, common replication errors

Reading documentation, trying out POCs, rolling back when things break

23 of 29

gh-ost options

  • Dry run using --test-on-replica
  • Migrate on replica
  • Migrate on master
  • Immediate vs postponing the cut-over (table rename phase)
  • Running gh-ost via screen or tmux (resurrecting a migration in a different process is not yet implemented - https://github.com/github/gh-ost/issues/205)

24 of 29

gh-ost Demo

  • Migrate a 10M rows table using gh-ost
  • Explain gh-ost cli option and throttling, replication lag settings etc.

25 of 29

D-Day Migration

  • Backup DB before the migration
  • Migrate on master than on individual replicas
  • 12+hrs for data copy
  • Sub-second replication lag (--max-lag=1000)
  • Throttle control replicas (auto-throttle observed 4 times in the logs)
  • Throttle when master MySQL threads go beyond 30
  • Row copy applier running for over 6 hours (total migration time: 18hrs)
  • Manual cut-over during low traffic
  • (add a screenshot)

26 of 29

Future migrations and CI CD setup

  • Backward compatible schema changes
  • 5 migrations in the first month
  • About 2 migrations per month
  • Canary deployment on a standalone VM
  • Separate PR for just the schema changes
  • Still running schema changes manually, planning to automate it soon via standard cli params and ansible
  • Other future roadmap (not specifically related to gh-ost)
    • GTID based setup
    • Galera cluster
    • MySQL8 upgradation

27 of 29

Things we can learn from gh-ost design

  • Make it operationally simple and predictable
    • Statically linked single binary (thanks Golang)
  • Provide dynamic reconfiguration without restart
    • Unix sockets + Http throttle endpoints :-)
    • Reconfiguring the parameters without having to restart the process
  • Status reporting
    • gh-ost status logs are stored in _ghc table
  • Topology aware
    • Knows mysql replication
    • Can be used as mysql slave for data copy, backup, etc.

28 of 29

Gh-ost and foreign keys

We worked around the problem by adding these columns in a separate table than migrating existing table

29 of 29

Questions / Feedback / Comments?