1 of 31

Reproducible

deep learning

PhD Course in Data Science

Lecturer: Simone Scardapane

code_final_v4_r3_simone.py

Lecture 2: Code versioning

set_seed(6531463)

learning_rate=0.00413

Important!

Where’s my code?

2 of 31

Code version control

Code versioning

3 of 31

Version control (VC) is a fundamental tool in most software engineering. VC tracks any change done to code, allowing the user to review the entire history of changes and possibly revert to an older version.

Among VC tools, Git, originally invented by Linus Torvalds, is considered almost a standard among programmers (but not the only one, e.g., Mercurial).

Most of you have probably already used it, but have you truly understood it?

4 of 31

There are two “flavours” of VC: centralized VC and distributed VC (DVC).

A centralized VC tool has a single server which is responsible for storing the history of the project. A user can “pull” the latest code, and “push” modifications to the server (e.g., Subversion).

Git is an instance of DVC: users that “clone” a project from a “repository” have access to the full history, and they become equal peers. The history of the project becomes a tree, which can easily branch, merge, ...

5 of 31

Most operations on Git are local, except for push/pull operations.

6 of 31

Git stores a history of snapshots of the project (commits), while a centralized VC can simply store deltas. Unmodified files in a snapshot link back to the previous snapshot.

7 of 31

Files in a repository can be tracked or untracked by Git. Once a tracked file is modified, it can be moved to the staging area, and staged modifications used to create a new commit in the repository.

8 of 31

Commits are identified by a hashcode (not a serial versioning number, which is not feasible in a DVC system).

9 of 31

10 of 31

A Git project can be as simple as a linear sequence of commits on a single machine:

Or a tree of changes (branches). Note that different users might have different trees!

11 of 31

Git can be used from terminal, or with a number of Git clients:

Most IDEs also have plugins to integrate the Git workflow.

12 of 31

The most common scenario is to use Git with a remote Git host, e.g., GitHub, GitLab (web, on-premise).

In this case, the hosted repository acts as the “official” copy of the project.

Most Git hosts offer a number of additional features beyond pure Git (e.g., pull requests, forks, actions, releases, …). These are fundamental to contributing on GitHub (read here).

(TODO) register on GitHub if you are not already there:

13 of 31

Code versioning

14 of 31

Installing a Git client:�1.5 Getting Started - Installing Git

Several tools come with a pre-packaged Git (e.g., the GitHub GUI). As a result, you might have multiple Git versions on your system!

After installing, set up your user:

git config --global user.name "Simone"

git config --global user.email simone@example.com

15 of 31

git init

There are two basic ways of initializing a repository. The first is to create a new repository from a given folder:

Otherwise, we can clone an existing remote repository:

git clone username@host:/path/to/repository

As a good practice, consider setting up an SSH key for authorization

Note: the starting branch for a repository is called main (customizable), or master on older repositories.

16 of 31

There are three basic concepts to remember:

  1. The HEAD is a pointer to the latest commit on the branch you are in.
  2. The working directory is the current status of your project.
  3. The index (aka staging area) is where you move modified files to be committed.

17 of 31

git add <file>

1) Move a file from the working directory to the staging area:

3) Create a new commit (with HEAD as parent) from the staging area:

git commit -m "Message"

2) Check that there are staged, uncommited modifications:

git status

18 of 31

git push origin main

If you cloned, you now have a commit in your local repository which is not present in the hosted repository! You can synchronize the two as:

Similarly, you can pull any changes that was pushed to the hosted repository:

git pull

Otherwise, you can manually add a remote repository:

git remote add origin <server>

19 of 31

When doing a push/pull, a number of conflicts can arise. Commonly, we can forget to pull before committing, or commits can be pushed to the remote while we are working.

Git tries to auto-resolve most conflicts that happen in these scenarios. If this is not possible (same file has been modified twice), it asks the user to merge the modification manually and perform a final commit.

If the conflict arises during a push, the user needs to pull the updated repository and then resolve the conflicts. Before pulling, you can also inspect the content of the remote repository:

git remote show origin

20 of 31

git restore --staged <file>

To unstage a file that has been staged (but keep the modifications in the working directory):

restore can be dangerous: anything committed in Git can be recovered in one way or another, but restoring a file will delete the modifications forever.

Or you can simply delete any modification and restore the previous commit:

git restore <file>

21 of 31

git checkout exercise1_git

Time to work! We will start by porting our notebook on a Git repository: https://github.com/sscardapane/reprodl2021

To experiment with Git, move to another branch:

We will port the notebook to a script, adding Hydra support. If you want to see the completed exercise, checkout the “completed” tag:

git checkout exercise1_git_completed

22 of 31

Advanced commands

Understanding branches

Code versioning

23 of 31

Branches are one of the most powerful features in Git.

A branch is a sequence of commits that deviates from the “main” branch, and can possibly be merged later on.

For deep learning, this is incredibly useful, because it allows us to freely experiment, or even keep a separate model for later use.

24 of 31

Creating a branch:

git branch testing

25 of 31

Switching to the new branch:

git checkout testing

26 of 31

Moving forward on the new branch:

git commit -a -m 'Experiment'

27 of 31

A branch can be merged on a different branch:

git checkout master

git merge iss53

28 of 31

At this point, you can eventually delete the branch:

git branch -d testing # delete the branch

If conflicts arise during the merge, they must be solved using the same strategy that we used for the pull conflicts.

In fact, git pull is internally doing a git fetch operation (to get the content from the remote repository), followed by a git merge operation.

Note: Git has a poweful alternative to merge (rebasing) that we do not cover here.

29 of 31

Cloning a repositoy automatically adds a remote branch, which is a pointer to the “main” branch in the “origin” remote.

30 of 31

When doing work locally, the remote branch is fixed and cannot be moved, but the state of the remote repository can change:

31 of 31

Fetching moves the remote branch to its current state, syncing with the remote repository: