1 of 28

Version Control and Backups

Max Vogel

Linux SysAdmin DeCal Spring 2022

Slides initially adapted from Liam Porr

Additional contents adapted from Hilfinger, Hug, & Git Pro Book

2 of 28

Why Version Control (VC)?

  1. We want to keep track of changes to code over time
  2. Collaborate with others without having to worry (too much) about conflicting changes
  3. Create and test features without breaking production
  4. Examples of Version Control Systems (VCS):
    • Git, Mercurial, Subversion, Perforce, Bazaar
  5. Git is the modern-day standard, we’ll focus on it!
    • Git is not GitHub!

3 of 28

About Git

  • FOSS created by Linus Torvalds in 2005 for development of the Linux kernel
    • Developer of their previous, proprietary VCS (Bitkeeper — now dead!) withdrew the free version.
    • Torvalds has quipped about the name Git, which is British English slang meaning “unpleasant person”. Torvalds said: “I’m an egotistical bastard, and I name all my projects after myself. First ‘Linux’, now ‘git’.” (wiki)
    • First implementation took ~2–3 months to create
  • Initially a collection of basic primitives (now called “plumbing”) that could be scripted together to provide the desired functionality
  • Over time, higher-level commands (“porcelain”) were built on top of these to provide a convenient user interface.

4 of 28

What makes Git Special?

  • Conceptually, Git stores snapshots (versions) of the files and directory structure of a project, keeping track of their relationships, authors, dates, and log messages.
  • The man pages describes Git as “the stupid content tracker
  • Command line is king!
    • Learning Git top-down can lead to a lot of confusion
      • GUIs implement only a subset of Git functionality
    • If you understand command line, you can probably understand the GUI — but the opposite isn’t necessarily true
    • You look cooler
  • Git Generally Only Adds Data
    • Hard to do anything that isn’t undoable
    • Possible, but often not intuitive, to revert changes
  • Git Has Integrity
    • Hashes objects with SHA1
    • Impossible to change anything without Git knowing
      • Lose information in transit
      • File corrupt
  • Git is Very Fast
    • Nearly every operation is local
      • No network latency overhead
    • Browsing history of the project involves Git reading it directly from your local database
    • Works offline

5 of 28

  • Git represents project history as directed acyclic graph of commit nodes
    • Nodes point (one-way) to the state they’re based on and there are no cycles
    • Commits correspond to project state’s tree (snapshots) which are made up of:
      • Files: “Blobs” of bits
      • Folders: “Trees” containing blobs and/or other trees
    • Branches are pointers to the head of a line of work. Default name is master
    • Head is a pointer to the local branch you’re currently on
  • Distributed: There can be many copies of a given repository, each supporting independent development, with machinery to transmit and reconcile versions between repositories.

6 of 28

Getting Started: Creating a Repository

  • We have an existing project in directory proj/
  • cd proj/
  • git init
    • This makes proj/ a Git “repository”
    • Creates a new subdirectory .git
  • We want to clone an existing repository
  • git clone <repo URL> [destination]
    • Creates a directory destination
    • Initializes a .git directory inside it
    • Pulls down data for that repository from <repo URL>
  • git clone https://github.com/0xcf/decal-labs

7 of 28

File States

  • Modified: File is changed but isn’t committed it to your database (repository) yet.
  • Staged: Modified file is marked in its current version to be included in the next commit snapshot.
  • git status
  • Tracked — files git knows about
    • Files that were in the last snapshot + newly staged (added) files
    • Tracked files are either unmodified, modified, or staged
  • Untracked — everything else
    • .gitignore specifies intentionally untracked files to ignore

8 of 28

Sections & Local Git Workflow

  • Staging Modified Tracked Files or New Untracked Files
  • git add <files>
    • Tells Git to track and add the file(s) in their current state to the next commit
  • Commit: Takes the files in staging area and stores them permanently as a snapshot in the local repository
    • Think of this as ‘saving’
    • git commit -m “foo”
    • Commit early & often!
  1. Working tree: single (local) checkout of the project that (from Git’s POV) contains all tracked files
  2. Staging area: New or modified files to be included in your next commit.
  3. Git directory : Where Git stores the metadata and object (hash) database
    • This is the most important part of Git— what is copied when you clone a repository from another computer.
    • Do not mess around with the .git folder!

9 of 28

Walkthrough

  • git init test
  • cd test
  • echo 'version 1' > test.txt
  • git add test.txt
  • git commit -m “first commit”
  • echo 'version 2' > test.txt
  • echo 'new file' > new.txt
  • git add test.txt new.txt
  • git commit -m “second commit”
  • git read-tree --prefix=bak [first commit hash]
    • Creates tree/directory “bak” that contains snapshot of first commit
  • git commit -m “third commit”

10 of 28

Branching

  • Idea is to create a new ‘branch’ with the current branch as the ‘trunk’
    • That is, this creates a new pointer at the same location of HEAD
    • Every time you commit, the pointer of the active branch moves forward automatically
  • Git’s ‘killer feature’ that sets it apart from other VCS
  • git branch testing
    • Creates new pointer called testing at the current branch (HEAD)
  • git add [...]
  • git commit -m “...”
  • git checkout master
  • git add [...]
  • git commit -m “...”
  • Uh oh! We now have divergent history
  • git checkout testing
    • Moves HEAD to point to the testing branch
    • git checkout -b testing

11 of 28

Merging

  • Idea is to combine a branch back into the mainline/trunk with a merge commit
  • For example, you have branch iss53 (issue #53) and are ready to merge it into the main codebase (which has changed since you started iss53):
    • git checkout master
      • Change HEAD to point to master
    • git merge iss53
      • Creates a merges commit (C6) with HEAD (master) and iss53
    • git branch -d iss53
      • Deletes branch iss53

12 of 28

Conflicts

  • If the same part of the same file is different in the two branches you’re merging, Git won’t be able to merge them cleanly and it will throw a merge conflict
    • Anything with conflicts will go unmerged
    • Git adds conflict-resolution markers to the files that have conflicts
  • See which files conflict with git status
  • git add on each file to mark it as resolved
  • git commit to finalize the merge commit

<<<<<<< HEAD

[Lines of code from HEAD (i.e master)]

=======

[Lines of code from the rebasing/merge target (i.e testing, iss53, dice)]

>>>>>>> [Commit message]

13 of 28

Rebase

  • Idea is to take all commits from a branch and apply them on top of HEAD
    • No merge commit
    • As if you just made all the commits on the main branch anyways (linear history)
  • git checkout experiment
  • git rebase master
    • Goes to common ancestor of branches (C2), saving the difference of each commit on HEAD (experiment) to temporary files, setting HEAD to the same node as master, and finally applying each change onto master.
  • git checkout master
  • git merge experiment

14 of 28

Remotes

  • The remote/offsite copy of the repository
    • origin is the default name for a remote when you run git clone
    • Remote branch names take the form <remote>/<branch>
  • git remote -v
    • View all remotes; their name and the URL they map to (i.e https://github.com/[...].git)
  • git remote show <remote>
    • Show information about remote
  • git fetch <remote>
    • Fetches any new data from remote and updates your local database (moving the origin/master pointer)
    • Will not modify your working directory at all

15 of 28

Pushing & Pulling

  • git pull [remote/branch]
    • git fetch [r,b] && git merge [r,b]
    • Looks up what server and branch your current branch is tracking, fetches from that server, then attempts to merge
      • When you clone a repository, it automatically creates a master branch that tracks origin/master
      • When you checkout a remote repositories branch, it also automatically tracks it
    • Upstream-branch is the remote branch that the tracked branch, well, tracks
      • If you initialize a new repository you won’t have an upstream branch
      • git remote add origin [URL]
      • git branch -u origin/<branch>
  • Do not rebase commits that exist outside your repository and that people may have based work on.
    • Rebase involves abandoning existing commits and creating copies
    • If you push commits somewhere and someone else pull them and bases their work on them, then you rewrite those commits (with git rebase) and push them up again, your collaborators will have to re-merge their work and things will get messy when you try to pull their work back into yours.
  • git push <remote> <branch>
    • Pushes your local commits to remote/branch
    • Local branches aren’t automatically synchronized; thus, you can have private branches
    • Good practice to pull before you push

16 of 28

Summary: The Git Workflow

  • git checkout master
  • git checkout -b feature
    • git branch feature && git checkout feature
  • [Modify files]
  • git stage file-changed-1 [...] file-changed-n
  • git commit -m “lorem”
  • git push origin feature
  • git checkout master
  • git pull origin/master
  • git merge feature
  • git push origin master

17 of 28

18 of 28

Backups

  • Just do it
  • Murphy’s law
    • Accidental or malicious deletion
    • Device failure
    • Software failure
    • And a Berkeley special, theft.
  • Automated backups because you will forget
  • Don’t leak information! Backups must be secure
  • Make sure your backups actually work!
    • Routinely test backup procedure and recovering from backups

19 of 28

The 3-2-1 Rule

3. Have at least 3 copies of your data

2. Store your data on at least 2 different media� E.g. 1 hard drive, 1 backup server/computer

1. Have at least 1 copy of your data off-site� E.g. on Amazon S3, “the cloud,” under a mattress

20 of 28

What happens if you don’t follow what I said

  • GitLab 1/31/17 Database Outage
  • Engineer accidentally runs rm -rf on their production PostgreSQL database
    • Noticed after 1 second and CTRL-C’d, but lost 300GB of production data already
  • This shouldn’t have been too bad - 300GB isn’t THAT much data nowadays, and they can just recover from a backup, right?

21 of 28

Backup 1: Amazon S3

  • GitLab had an automated process to upload a backup to Amazon S3 (Amazon file storage) every 24 hours
  • GitLab engineers inspected their S3 bucket, hoping to find a backup
  • Turns out their backups had been failing for weeks due to a version mismatch, and their notification system was broken too

22 of 28

Backup 2: Azure Disk Snapshots

  • GitLab runs on Azure (Microsoft Cloud Hosting Provider)
  • Azure offers the option to generate snapshots of an entire disk periodically
  • GitLab had enabled this to run every 24 hours…
  • ….except on the database servers, because they thought they had enough backups

23 of 28

Why GitLab still exists today: Hail Mary LVM Snapshots

  • LVM: Logical Volume Manager
  • Not meant to be a backup, but luckily they had these
  • Every 24 hours, copy data from prod to staging environment
  • An engineer had run this ~6 hours before the incident luckily enough
  • Unfortunately, took GitLab 18 hours to recover, since staging was not meant for data recovery process
    • Different region and slow disks

24 of 28

Impact

  • “It's hard to estimate how much data has been lost exactly, but we estimate we have lost at least 5000 projects, 5000 comments, and roughly 700 users.”
  • Became both a feel-good story of transparency and not firing the engineer involved but also a WTF story about their backups on HackerNews
  • Good lesson on the importance of keeping backups, making sure they work, and practicing recovering from them

25 of 28

Tools for Backups

  • rsync
    • Simple command-line util for local <-> remote transfer
    • Skips copying files that are the same @ destination, so good for backups
    • Uses SSH for transferring to remote hosts
    • rsync -av -P [source] user@host:[destination]
  • Rclone: “rsync for cloud storage”
    • Supports every major cloud storage provider
    • rclone sync source:path dest:path
    • Can mount cloud storage as local FS

26 of 28

More Tools

  • Rsnapshot
    • Uses rsync, but effectively incremental backups that look full
    • Good for storing multiple backups (-1d, -3d, -1w, etc.) w/o using too much disk space
    • Other incremental-backup tools: Borg, Duplicity
  • Some people think its a good idea to use git for backups … just don’t.
    • Git is only meant for small files, text files really.
    • Using git for large files will take up lots of unnecessary disk space.

27 of 28

Conclusion

  • Back up your shit

28 of 28

More

  • Pro Git — git-scm.com
    • The bible
    • Ch 1–5 form good foundation
    • Source for all the tree diagrams
  • Oh Shit, Git!?! — ohshitgit.com
    • “Git is hard: screwing up is easy, and figuring out how to fix your mistakes is fucking impossible”
    • Short guide on how to recover from some common Git mistakes
  • Shell/Editor integration
    • Vim-fugitive
  • Feedback for me