2 of 20

Why Version Control?

Keep track of working versions
Prevent other people from overwriting your changes
Create and test features without breaking production

Examples: Mercurial, Subversion, Git

Git is the most popular, so we’ll focus only on it

3 of 20

Theory of Git

Track changes in sets called commits
Each commit is chained to the previous commit, and they form what we call a “tree”
Each commit is part of an immutable tree that tracks a complete history of changes
Can view the state of the project at any point in history by selecting any commit in the tree

4 of 20

Getting Started in Git

We have a folder, project/, we want to use Git in
cd project
git init

This makes project/ a Git “repository”

git clone [repo url] [destination folder]

Download a Git repository from elsewhere (Github)

5 of 20

Adding Files and Committing

By default, Git only takes care of files you tell it to--”tracked files”
Imagine project/file1, project/file2
Make some changes, then:
git add file1 file2

Git now manages the history of these two files

Commits

A set of changes to tracked files
The basic unit of history in Git
Think of it as saving state

git commit -m “made some changes”
Commit early, commit often!

6 of 20

Local Git Workflow

To prepare a file for commit, you must stage it first
git stage my_file

(git add my_file works too)

This tells git that you want to include changes to this file in the commit
Now commit with

git commit -m “changed my_file”

Modify -> Stage -> Commit
Question: are there any issues that could arise from this model?

7 of 20

Branching, Merging, Rebase

Branching: literally make a branch from the

‘trunk’ of the tree

A series of new commits that shares a history with the

mainline

git checkout -b <name of branch> (create a new branch and switch to it)

Merging: combine a branch back into the mainline/trunk:

Creates a merge commit
on the head branch (trunk equivalent), do git merge <name of branch>

Rebase: take all commits from a branch and apply them on top of the head

No merge commit
As if you just made all the commits on the main branch anyways
Often considered “cleaner” in terms of viewing git history
On the trunk branch, do git rebase <feature branch>

8 of 20

Pulling and Pushing

The remote/offsite copy of the repository is called a “remote”

Github, Bitbucket
Someone else’s computer

View remotes by doing git remote -v
You can push or pull branches from remotes, and you can fetch changes (download them, but don’t integrate them into your tree)
Question: How can I avoid having to copy the git history over every time I synchronize with a remote server?

9 of 20

Summary: The Git Workflow

In a git repository:

git checkout master
git checkout -b feature-branch
Edit some files
git stage files-ive-edited other-files-ive-edited
git commit -m “message about the edits I’ve made”
git push origin feature-branch
git checkout master
git merge feature-branch

11 of 20

Backups

Just do it
Murphy’s law

Accidental or malicious deletion
Device failure
Software failure
And a Berkeley special, theft.

Automated backups because you will forget
Don’t leak information! Backups must be secure
Make sure your backups actually work!

Routinely test backup procedure and recovering from backups

12 of 20

The 3-2-1 Rule

3. Have at least 3 copies of your data

2. Store your data on at least 2 different media� >1 hard drive, >1 backup server/computer

1. Have at least 1 copy of your data off-site� E.g. on Amazon S3, “the cloud,” under a mattress

13 of 20

What happens if you don’t follow what I said

GitLab 1/31/17 Database Outage
Engineer accidentally runs rm -rf on their production PostgreSQL database

Noticed after 1 second and CTRL-C’d, but lost 300GB of production data already

This shouldn’t have been too bad - 300GB isn’t THAT much data nowadays, and they can just recover from a backup, right?

14 of 20

Backup 1: Amazon S3

GitLab had an automated process to upload a backup to Amazon S3 (Amazon file storage) every 24 hours
GitLab engineers inspected their S3 bucket, hoping to find a backup
Turns out their backups had been failing for weeks due to a version mismatch, and their notification system was broken too

15 of 20

Backup 2: Azure Disk Snapshots

GitLab runs on Azure (Microsoft Cloud Hosting Provider)
Azure offers the option to generate snapshots of an entire disk periodically
GitLab had enabled this to run every 24 hours…
….except on the database servers, because they thought they had enough backups

16 of 20

Why GitLab still exists today: Hail Mary LVM Snapshots

LVM: Logical Volume Manager
Not meant to be a backup, but luckily they had these
Every 24 hours, copy data from prod to staging environment
An engineer had run this ~6 hours before the incident luckily enough
Unfortunately, took GitLab 18 hours to recover, since staging was not meant for data recovery process

Different region and slow disks

17 of 20

Impact

“It's hard to estimate how much data has been lost exactly, but we estimate we have lost at least 5000 projects, 5000 comments, and roughly 700 users.”
Became both a feel-good story of transparency and not firing the engineer involved but also a WTF story about their backups on HackerNews
Good lesson on the importance of keeping backups, making sure they work, and practicing recovering from them

18 of 20

Tools for Backups

rsync

Simple command-line util for local <-> remote transfer
Skips copying files that are the same @ destination, so good for backups
Uses SSH for transferring to remote hosts
rsync -av -P [source] user@host:[destination]

Rclone: “rsync for cloud storage”

Supports every major cloud storage provider
rclone sync source:path dest:path
Can mount cloud storage as local FS

19 of 20

More Tools

Rsnapshot

Uses rsync, but effectively incremental backups that look full
Good for storing multiple backups (-1d, -3d, -1w, etc.) w/o using too much disk space
Other incremental-backup tools: Borg, Duplicity

Some people think its a good idea to use git for backups … dont.

Git is only meant for small files, text files really.
Using git for large files will take up lots of unnecessary disk space.

1 of 20