GitLab.com Database Incident - 2017/01/31
Note: This incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).
YouTube Live stream - Follow us live debating and problem solving!
Timeline (all times UTC) 1
Recovery - 2017/01/31 23:00 (backup from ±17:20 UTC) 2
Problems Encountered 4
External help 5
Hugops (please add kind reactions here, from twitter and elsewhere) 5
Stephen Frost 5
Sam McLeod 5
- ±6 hours of data loss
- 4613 regular projects, 74 forks, and 350 imports are lost (roughly); 5037 projects in total. Since Git repositories are NOT lost, we can recreate all of the projects whose user/group existed before the data loss, but we cannot restore any of these projects’ issues, etc.
- ±4979 (so ±5000) comments lost
- 707 users lost potentially, hard to tell for certain from the Kibana logs
- Webhooks created before Jan 31st 17:20 were restored, those created after this time are lost
Timeline (all times UTC)
- 2017/01/31 16:00/17:00 - 21:00
- YP is working on setting up pgpool and replication in staging, creates an LVM snapshot to get up to date production data to staging, hoping he can re-use this for bootstrapping other replicas. This was done roughly 6 hours before data loss.
- Getting replication to work is proving to be problematic and time consuming (estimated at ±20 hours just for the initial pg_basebackup sync). The LVM snapshot is not usable on the other replicas as far as YP could figure out. Work is interrupted due to this (as YP needs the help of another collegue who’s not working this day), and due to spam/high load on GitLab.com
- 2017/01/31 21:00 - Spike in database load due to spam users - Twitter | Slack
- Blocked users based on IP address
- Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load). This was communicated with the infrastructure and support team.
- Removed users for spamming (by creating snippets) - Slack
- Database load goes back to normal, some manual PostgreSQL vacuuming is applied here and there to catch up with a large amount of dead tuples.
- 2017/01/31 22:00 - Replication lag alert triggered in pagerduty Slack
- Attempts to fix db2, it’s lagging behind by about 4 GB at this point
- db2.cluster refuses to replicate, /var/opt/gitlab/postgresql/data is wiped to ensure a clean replication
- db2.cluster refuses to connect to db1, complaining about max_wal_senders being too low. This setting is used to limit the number of WAL (= replication) clients
- YP adjusts max_wal_senders to 32 on db1, restarts PostgreSQL
- PostgreSQL complains about too many semaphores being open, refusing to start
- YP adjusts max_connections to 2000 from 8000, PostgreSQL starts again (despite 8000 having been used for almost a year)
- db2.cluster still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything
- At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.
- 2017/01/31 23:00-ish
- YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
- 2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left - Slack
Recovery - 2017/01/31 23:00 (backup from ±17:20 UTC)
- Suggested recovery solutions:
- Migrate db1.staging.gitlab.com data to GitLab.com (±6 hours old)
- CW: Problem with web hooks, these are removed as part of the staging sync.
- Restore LVM snapshot (6 hours old)
- Sid: try to undelete files?
- CW: Not possible! `rm -Rvf` Sid: OK
- JEJ: Probably too late, but isn't it sometimes possible if you make the disk read-only quickly enough? Also might still have file descriptor if the file was in use by a running process according to http://unix.stackexchange.com/a/101247/213510
- YP: PostgreSQL doesn't keep all files open at all times, so that wouldn't work. Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas. In other words, the data can't be recovered from the disk itself.
- SH: It appears the db1 staging server runs a separate PostgreSQL process under the gitlab_replicator directory that streams production data from db2. Due to replication lag, db2 was killed 2017-01-31 05:53, which caused the gitlab_replicator to stop. The good news is that the data up until that point looks unaltered, so we may be able to recover the WebHook data.
- Action taken:
- 2017/02/01 23:00 - 00:00: The decision is made to restore data from db1.staging.gitlab.com to db1.cluster.gitlab.com (production). While 6 hours old and without webhooks, it’s the only available snapshot. YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.
- 2017/02/01 00:36 - JN: Backup db1.staging.gitlab.com data
- 2017/02/01 00:55 - JN: Mount db1.staging.gitlab.com on db1.cluster.gitlab.com
- Copy data from staging /var/opt/gitlab/postgresql/data/ to production /var/opt/gitlab/postgresql/data/
- 2017/02/01 01:05 - JN: nfs-share01 server commandeered as temp storage place in /var/opt/gitlab/db-meltdown
- 2017/02/01 01:18 - JN: Copy of remaining production data, including pg_xlog tar’ed up as ‘20170131-db-meltodwn-backup.tar.gz’
- 2017/02/01 01:58 - JN: Start rsync from stage to production
- 2017/02/01 02:00 - CW: Updated deploy page to explain the situation. Link
- 2017/02/01 03:00 - AR: rsync progress approximately 50% (by # of files)
- 2017/02/01 04:00 - JN: rsync progress approximately 56.4% (by # of files). Data transfer is slowed by two factors: network I/O between us-east and us-east-2 and disk throughput cap on staging server (60 Mb/s).
- 2017/02/01 07:00 - JN: Found a copy of pre-sanitized data in on db1 staging in /var/opt/gitlab_replicator/postgresql. Started db-crutch VM in us-east to backup this data to another host. Unfortunately, this system maxes out at 120 GB RAM and cannot support the production load. This copy will be used to check the database state and export the WebHook data.
- 2017/02/01 08:07 - JN: Data transfer has been slow: total transfer progress by data size is 42%.
- 2017/02/02 16:28 - JN: Data transfer complete
- 2017/02/02 16:45 - Following restore procedures below
- Restore procedure
- [x] - Make a snapshot of the DB1 server - or 2 or 3 - done at 16:36 UTC
- [x] - Upgrade db1.cluster.gitlab.com to PostgreSQL 9.6.1 as it’s still running 9.6.0 while staging uses 9.6.1 (PostgreSQL might not start otherwise) -
- To install 8.16.3-EE.1
- YP move from chef-noop to chef-client (was disabled manually)
- YP running chef-client in the host @ 16:45
- [x] - Start the DB - 16:53 UTC
- Monitor that it starts and becomes available.
- It’s back up
- [x] - Update the Sentry DSN on the DB so errors don’t go to staging.
- [x] - Increment IDs on all tables by 10k to be on the safe side when creating new projects/notes. Done using https://gist.github.com/anonymous/23e3c0d41e2beac018c4099d45ec88f5, which reads from a text file containing all sequences (one per line)
- [x] - Flush Rails/Redis cache
- [x] - Attempt to restore webhooks, if possible
- [x] Start staging using the snapshot prior to webhook removal
- [x] Check if webhooks are present
- [x] Generate SQL dump (data only) of the “web_hooks” table (if there is data)
- [x] Copy SQL dump to the production server
- [x] Import SQL dump into the production database
- [x] - Test if workers can connect via a Rails console
- [x] - Gradually start workers
- [x] - Disable deploy page
- [x] - Tweet from @gitlabstatus
- [x] - Create outage issues describing future plans/actions:
- [ ] - Create new Project records for Git repositories without a Project record, where the namespace path corresponds to an existing user/group.
- PC - I’m getting a list of these repos so we can then check on the DB that they do exists or not - it’s there
- [ ] - Delete repositories for unknown (lost) namespaces
- AR - working on script based on previous point data.
- [x] - Remove the spam users again (so they don’t cause problems again)
- [x] CDN user with 47 000 IPs
- TODO after data restored:
- Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging). Show the full hostname in the bash prompt for all users by default (e.g., “db1.staging.gitlab.com” instead of just “db1”): https://gitlab.com/gitlab-com/infrastructure/issues/1094
- Somehow disallow rm -rf for the PostgreSQL data directory? Unsure if this is feasible, or necessary once we have proper backups
- Add alerting for backups: check S3 storage etc. Add a graph with the size of the backups over time, page when it goes down by more than 10%: https://gitlab.com/gitlab-com/infrastructure/issues/1095
- Consider adding a last successful backup time in DB so admins can see this easily (suggested by customer in https://gitlab.zendesk.com/agent/tickets/58274)
- Figure out why PostgreSQL suddenly had problems with max_connections being set to 8000, despite it having been set to that since 2016-05-13. A large portion of frustration arose because of this suddenly becoming a problem: https://gitlab.com/gitlab-com/infrastructure/issues/1096
- Look into increasing replication thresholds via WAL archiving / PITR - also will be useful after failed upgrades: https://gitlab.com/gitlab-com/infrastructure/issues/1097
- Create troubleshooting guide for problems users might encounter after we go online
- Experiment with moving data from one datacenter to another via AzCopy: Microsoft says this should be faster than rsync:
- This seems to be a Windows only thing, and we don’t have any Windows experts (or anybody remotely familiar enough with it to try this out properly)
- LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
- Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
- SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
- Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
- The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
- The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
- SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.
- Our backups to S3 apparently don’t work either: the bucket is empty
- We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. => we're now restoring a backup from 6 hours ago that worked
Hugops (please add kind reactions here, from twitter and elsewhere)
https://twitter.com/net_snow/status/826622954964393984 @gitlabstatus hey, I'm a PG committer, major contributor, and love what you all do. HMU if I can help in any way, I'd be happy to help.
(twitter DM): Hey Sid, Sorry to hear about your database / LVM issue, bugger of a thing to happen. Hey we run quite a number of PostgreSQL clusters (master/slave) and I noticed a few things in your report.
1. You're using Slony - that thing is a POS, not an understatement even have a laugh at it by following http://howfuckedismydatabase.com , PostgreSQL's inbuilt binary streaming replication however is rock solid and very fast, I suggest switching to that.
2. No mention of a connection pooler and mention of having thousands of connections set in postgresql.conf - this is really bad and very inefficient for performance, I suggest using pg_bouncer as a connection pooler - https://pgbouncer.github.io/ and not setting PostgreSQL's max_connection over 512-1024, realistically if you're using more than 256 active connections - you need to scale out not up.
3. The report mentions how fragile your failover and backup processes are, we wrote a simple script for postgresql failover and documentation to go with it - would you like me to provide you with it? As far as backups - we use pgbarman to perform many incremental backups during the day and fill backups twice daily bother via barman and postgresql's pg_dump command, it's important to have your backup directory on different storage from your postgresql data for both performance and resiliency / portability.
4. You're still on Azure?!?! I'd suggest getting off that crudbucket, so many internal DNS, NTP, routing and storage IO issues with Microsoft's platform it's ridiculous, I've heard some horror stories of how its held together internally too.
Let me know if you'd like any more advice on tuning PostgreSQL, I've had a lot of experience with it.
also - question - how big is your database(s) on disk? like are we talking TB here or still in the GB?
7h 7 hours ago
open sourced my failover / replication sync script:
7h 7 hours ago
Also - I see you're looking at pgpool - I would not suggest that, look at pgbouncer instead
Pgpool has lots of problems, we tested it thoroughly and then binned it
5h 5 hours ago
Also, let me know if there's anything I can say publicly on twitter or whatever to support GitLab and your transparency through this event, I know how shit these things are, we had SAN level split brain at infoxchange when I first started and I was so nervous I was vomiting!
4h 4 hours ago
Hi Sam, thanks for all the help. Mind if I paste it in a public document to share with the rest of the team?
3m 3 minutes ago
The failover script?
3m 2 minutes ago
Everything you messaged.
2m 1 minute ago
Sure, it's a public repo anyway, but yeah I'm not saying it's perfect - far from but it does work really reliable, I fail between hosts all the time without issue, but YMMV etc etc
Yeah absolute re: other recommendations too
If you can send me information about your VM that has PostgreSQL on it and your PostgreSQL.conf file I can make comments on any changes / concerns and explain each one
- Sid - Slony was only used for upgrading from 9.2 to 9.6, we use streaming replication for our regular replication needs.
Comment - Ok that's good to know, FYI, within major versions of PostgreSQL you can just use the built in replication to perform upgrades.
- Rails already pools/reuses connections, with 25 connections per process. With 20-ish processes for 20 hosts this produces 10 000 max connections, though there will only be 400-ish active concurrently (as Unicorn is single threaded)
Comment - Each PostgreSQL connection uses memory, it's inefficient to have a lot of connections open, this is where pg_bouncer comes into play, it's a fantastically simple, fast connection pooler, it does one thing and it does it well, whereas pgpool really complicates things, it can rewrite queries and prevent certain queries from working as expected, it's really designed for people not using an ORM / db framework and have complete control over the SQL gets run.
Worth reading: https://wiki.postgresql.org/wiki/Number_Of_Database_Connections
- For load balancing, pooling, better failover, etc, we’re looking into pgpool + streaming replication, with synchronous commits (for data consistency). Pgbouncer does not do load balancing (at least out of the box) as far as we’re aware, only connection pooling. https://github.com/awslabs/pgbouncer-rr-patch might also be an option.
Question - Are you running multiple, active / active PostgreSQL nodes at present, if not, how are you performing load balancing?
Question - How many RPM does the site pull on an average day? aka how many page loads and requests per second?
# START Update from Sam McLeod 02/02/2017:
Note to GitLab: I (Sam) am more than happy to assist by gathering this information and the answers to these questions myself if you are able to provide me remote access to a host or VM snapshot that is the same as the production database host, but without any client or private day in it, I.e. Just a fresh image and schema built to your standard, it'd be very quick for me to answer and document all these questions by doing so and it'd only leave me requesting information regarding your specific hosting environment that I couldn't get / rely on from a clone (I.e, ‘cat /proc/cpuinfo’).
Otherwise… where I've requested output from specific commands or files (check files for confidential data first obviously)
Hi Sam, Sid here, it says 'waiting for information from GL Comments' but Yorick looked through the document and didn't see what he should respond to.
Ah yeah the whole block is a work in progress at the moment, the one thing that could be responded to is the output from the commands I've started to write below, for example ‘cat /etc/fstab’ etc…. which will give me an idea of how you're mounting filesystems, what fs, what options etc… oh and if you're doing sain things like keeping logs and backups in separate directories etc…
Otherwise happy just for you guys / gals to wait for me to finish my list of questions tomorrow, then reply when you can and I'll look over it and make comments / recommendations from what I've worked with / seen fail etc...
Really, I just want to do whatever might help you out, if it's actually a burden to have a resource look through and respond to questions about the config or run some (discovery) commands - I totally get that and we can look at it later. But if as part of recovering from the ‘event’ you're reconfiguring things - it might make sense to at least review the PostgreSQL.conf options, filesystem mounting options and kernel tuning.
Initial observations prior to information being further provided
- Reduce max_connections and use pg_bouncer as a connection pooler
- Investigation into VM tuning and performance configuration required
- Review of postgresql.conf required
- Suggest using binary / streaming replication for the database
- Backups should be restored and decompressed then tested for consistency regularly, an alert should create a ticket / whatever for someone to action upon any failures occurring
I created this issue https://gitlab.com/gitlab-com/infrastructure/issues/1122 for discussion this in our issue tracker
Further information required from GitLab engineers
Information required from GitLab for tuning recommendations, before we tune anything we need to form a baseline both to share knowledge and to make any future changes measurable.
Question - What are the general specs of the VMs you’re running PostgreSQL on?
Information Request - Please run the following to gather information about the VMs resources and configuration:
cat /var/lib/pgsql/9.6/data/postgresql.conf # CentOS/RHEL
cat /etc/postgresql/9.6/main/postgresql.conf # Debian
cat /etc/redhat-release # CentOS/RHEL
cat /etc/debian_version # Debian
systemctl zram status
fdisk -l /dev/<fill and repeat for each of your disks i.e. sda>
blockdev --get-ra /dev/<disk> # For each of your mounted disks
smartctl -a /dev/<whatever your first disk is, i.e. sda
rpm -qa | geel -iE ‘postgres|kernel|psql|e2fs|ext|xfs|btrfs’ # CentOS/RHEL
dpkg --list | grep -iE ‘postgres|linux-image|psql|e2fs|ext|xfs|btrfs’ # Debian
Information Request - Benchmark storage performance (Warning, this should be run on a copy of the production VM as the settings / backing performance should represent production, but you don’t want the application running while the test are in progress.
yum install fio
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=10G --readwrite=randread
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=10G --readwrite=randwrite
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=2048k --iodepth=64 --size=10G --readwrite=read
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=2048k --iodepth=64 --size=10G --readwrite=write
# END Update from Sam McLeod 03/02/2017