Reproducible
deep learning
PhD Course in Data Science
Lecturer: Simone Scardapane
code_final_v4_r3_simone.py
Lecture 2: Code versioning
set_seed(6531463)
learning_rate=0.00413
Important!
Where’s my code?
Code version control
Code versioning
Version control (VC) is a fundamental tool in most software engineering. VC tracks any change done to code, allowing the user to review the entire history of changes and possibly revert to an older version.
Among VC tools, Git, originally invented by Linus Torvalds, is considered almost a standard among programmers (but not the only one, e.g., Mercurial).
Most of you have probably already used it, but have you truly understood it?
There are two “flavours” of VC: centralized VC and distributed VC (DVC).
A centralized VC tool has a single server which is responsible for storing the history of the project. A user can “pull” the latest code, and “push” modifications to the server (e.g., Subversion).
Git is an instance of DVC: users that “clone” a project from a “repository” have access to the full history, and they become equal peers. The history of the project becomes a tree, which can easily branch, merge, ...
Most operations on Git are local, except for push/pull operations.
Git stores a history of snapshots of the project (commits), while a centralized VC can simply store deltas. Unmodified files in a snapshot link back to the previous snapshot.
Files in a repository can be tracked or untracked by Git. Once a tracked file is modified, it can be moved to the staging area, and staged modifications used to create a new commit in the repository.
Commits are identified by a hashcode (not a serial versioning number, which is not feasible in a DVC system).
A Git project can be as simple as a linear sequence of commits on a single machine:
Or a tree of changes (branches). Note that different users might have different trees!
Git can be used from terminal, or with a number of Git clients:
Most IDEs also have plugins to integrate the Git workflow.
The most common scenario is to use Git with a remote Git host, e.g., GitHub, GitLab (web, on-premise).
In this case, the hosted repository acts as the “official” copy of the project.
Most Git hosts offer a number of additional features beyond pure Git (e.g., pull requests, forks, actions, releases, …). These are fundamental to contributing on GitHub (read here).
(TODO) register on GitHub if you are not already there:
Basic commands
Code versioning
Installing a Git client:�1.5 Getting Started - Installing Git
Several tools come with a pre-packaged Git (e.g., the GitHub GUI). As a result, you might have multiple Git versions on your system!
After installing, set up your user:
git config --global user.name "Simone"
git config --global user.email simone@example.com
git init
There are two basic ways of initializing a repository. The first is to create a new repository from a given folder:
Otherwise, we can clone an existing remote repository:
git clone username@host:/path/to/repository
As a good practice, consider setting up an SSH key for authorization
Note: the starting branch for a repository is called main (customizable), or master on older repositories.
There are three basic concepts to remember:
git add <file>
1) Move a file from the working directory to the staging area:
3) Create a new commit (with HEAD as parent) from the staging area:
git commit -m "Message"
2) Check that there are staged, uncommited modifications:
git status
git push origin main
If you cloned, you now have a commit in your local repository which is not present in the hosted repository! You can synchronize the two as:
Similarly, you can pull any changes that was pushed to the hosted repository:
git pull
Otherwise, you can manually add a remote repository:
git remote add origin <server>
When doing a push/pull, a number of conflicts can arise. Commonly, we can forget to pull before committing, or commits can be pushed to the remote while we are working.
Git tries to auto-resolve most conflicts that happen in these scenarios. If this is not possible (same file has been modified twice), it asks the user to merge the modification manually and perform a final commit.
If the conflict arises during a push, the user needs to pull the updated repository and then resolve the conflicts. Before pulling, you can also inspect the content of the remote repository:
git remote show origin
git restore --staged <file>
To unstage a file that has been staged (but keep the modifications in the working directory):
restore can be dangerous: anything committed in Git can be recovered in one way or another, but restoring a file will delete the modifications forever.
Or you can simply delete any modification and restore the previous commit:
git restore <file>
git checkout exercise1_git
Time to work! We will start by porting our notebook on a Git repository: https://github.com/sscardapane/reprodl2021
To experiment with Git, move to another branch:
We will port the notebook to a script, adding Hydra support. If you want to see the completed exercise, checkout the “completed” tag:
git checkout exercise1_git_completed
Advanced commands
Understanding branches
Code versioning
Branches are one of the most powerful features in Git.
A branch is a sequence of commits that deviates from the “main” branch, and can possibly be merged later on.
For deep learning, this is incredibly useful, because it allows us to freely experiment, or even keep a separate model for later use.
Creating a branch:
git branch testing
Switching to the new branch:
git checkout testing
Moving forward on the new branch:
git commit -a -m 'Experiment'
A branch can be merged on a different branch:
git checkout master
git merge iss53
At this point, you can eventually delete the branch:
git branch -d testing # delete the branch
If conflicts arise during the merge, they must be solved using the same strategy that we used for the pull conflicts.
In fact, git pull is internally doing a git fetch operation (to get the content from the remote repository), followed by a git merge operation.
Note: Git has a poweful alternative to merge (rebasing) that we do not cover here.
Cloning a repositoy automatically adds a remote branch, which is a pointer to the “main” branch in the “origin” remote.
When doing work locally, the remote branch is fixed and cannot be moved, but the state of the remote repository can change:
Fetching moves the remote branch to its current state, syncing with the remote repository: