1 of 45

Project-based workflows with GitHub

Courtney Robichaud and Emma Hudgins

@cdrobich @emmajhudgins

2 of 45

You walk away confident in using Git/GitHub for version control with your (R-based) projects

3 of 45

4 of 45

5 of 45

What are your concerns about using or learning Git/GitHub?

I don’t know how to use it in my research

The learning curve/difficulty for new users

Little coding experience

Never got the flow of it

6 of 45

7 of 45

What we will cover:

  • What are R projects, Git, and Github?
  • Walking through how to use them
  • Demo creating a repo, writing script, generating figures, and committing them to a repo
  • You follow along with our demo and make your own!

8 of 45

The power of projects, Git and GitHub

9 of 45

Your current organization

Could look something like this

10 of 45

Your ideal organization

11 of 45

R project

Commit often

Push to GitHub

12 of 45

R Projects

13 of 45

14 of 45

What is a “project” and why is it better?

  • All your data (raw and manipulated), scripts, and output are saved in separate folders
  • Does not call on anything system-specific
  • Well commented so others (including future you) understand

15 of 45

devmountain.com

16 of 45

What else can GitHub do?

  • Can be a cloud storage service for any type of file
  • Forking” allows people to use others’ projects as templates for their own
  • Provides a hosting service for web content
  • Allows you to freeze your work at a given moment in time as a ‘release’ which can be linked to a DOI (Required by many journals/funders)
  • Provides integration with other tools (e.g. OSF)

17 of 45

GitHub basics

18 of 45

Your moves:

Repo(sitory) - one or more folders that have git functionality, GitHub repos are stored on the cloud

Push - send changes to the cloud

Pull - get changes from the cloud

Commit - create a named version of a set of one or more changes to the repo

Clone - copy an existing repo into your local github folder such that it communicates with the original repo

Fork - freeze an existing repo in time and copy it into your github folder such that it does not communicate with the old repo

19 of 45

OR click the green button in the left pane

github.com

20 of 45

21 of 45

22 of 45

23 of 45

24 of 45

25 of 45

Structure of a repo

26 of 45

27 of 45

28 of 45

29 of 45

30 of 45

31 of 45

Check/change your settings in R:

32 of 45

GO TO GITHUB

33 of 45

.gitignore

Choose a template based on your main programming language (R template ignores files like .RHistory)

Some examples of files you probably want to ignore:

  • Sensitive information (e.g. passwords)
  • Binary files such as .Rdata.
  • Files > 50MB. Git is specifically made for code (e.g. .R) and does not intend to track all changes in large data files (these can be uploaded in ‘releases’ with DOIs through Zenodo.
  • temporary files/folders with ‘disposable’ content

34 of 45

Choosing the best license

choosealicense.com

35 of 45

Ideal folder structure

Raw Data

Metadata includes date of download or collection, original source and re-use info

(Derived) Data

Data you transformed after downloading/collecting, e.g. merging 2 databases

Scripts

Code (can separate by language)

Output

Figures, tables, results

Every folder should contain a README!

36 of 45

Readme/Metadata best practices

  • Include package version information and any external software used
  • Describe files in a logical order
  • Describe any column/variable names (especially units)

37 of 45

File naming

  • Be as descriptive as possible
  • Can add leading numbers to scripts that indicate order they should be run e.g.
  • 01-data_processing.R
  • 02-model_fitting.R
  • Avoid dates/overly generic names
  • Name output similarly to script that generated it
  • Use hyphens and dashes

38 of 45

Clean coding

Be proactive

  • Use #### #### to separate steps
  • Describe each major step and why it’s done
  • Put yourself in the shoes of the person reading the code for the first time
  • Include code author names, software versions

39 of 45

More advanced GitHub

40 of 45

More advanced functionality

Branch - one set of version histories for a repo, including the ‘main’ original branch, and additional branches used to suggest changes, test out new ideas that may not work etc.

Pull request - a suggested commit (created in another branch or from a fork) that must be approved by the owner of the main branch

Pull often, commit after each change

41 of 45

Revert changes

Easier pre-commit, but possible post-commit too.

Pre-commit:

In RStudio, right click on a file and select ‘revert’

42 of 45

Releases, Zenodo & DOI creation

43 of 45

Releases, Zenodo & DOI creation

44 of 45

OpenRefine

https://openrefine.org/

45 of 45

Other helpful resources