1 of 105

Good enough practices in research computing

�How to not lose your stuff, and generally be more efficient

Bio-IT - bio-it.embl.de

Slides adapted from Edward Wallace 2020, Licence: CC BY 4.0

2 of 105

Who am I and why do I care?

PhD and a brief PostDoc in Italy in structural biology, environment with much less infrastructural support
Now Bio-IT project manager, one of the support entities at EMBL
Bio-IT: community initiative to support computational biology
Training, Community, Resources, Information

Good habits are essential for collaboration, and for your future self

3 of 105

This lesson’s tool

The notepad

4 of 105

Introduction

15 mins

5 of 105

The Data Life Cycle

From: RDMkit

1.1 What can go wrong in each of the phases?

6 of 105

Data Management Challenges in Life Sciences

From: ENGYTE

7 of 105

Data scattered in multiple repositories

From: Liang-Bo Wang’s Blog

8 of 105

Challenges in data-heavy biology

All your raw data are digital files

How do you manage them and keep track of your actions?

There are many tools to process data

Too much choice, many bad choices
E.G. Excel changes gene names, and loses COVID test results...

There are many levels at which to study a problem and many steps to understand

Where do you even start?

9 of 105

Good news!

Your problems are not unique
People have thought about good practices and created good tools
You don’t have to reinvent them
You can learn them
This is an ongoing process through your career

10 of 105

Basic principles

This lesson has episodes covering data management, software, project organization, collaboration, keeping track of changes, and manuscripts.

“Good Enough Practices” rely on a shared set of principles that span these areas:

Planning: plan out how to work. Any plan that you can stick to is better than no plan
Modular organization: organize your data, code, and projects into coherent modules.
Names: give good names to your files, folders, and functions, that make them easy to find and to understand.
Documentation: explicitly write down everything you and your collaborators need to know in the future.

11 of 105

Lesson outline

Data management
Software
Collaboration
Project organization
Keeping track of changes
Manuscripts

Wilson et al., PLoS Comput. Biol 2017�https://doi.org/10.1371/journal.pcbi.1005510

12 of 105

Data Management

45 mins

13 of 105

Data Management

Definition

Data management is the process of storing, documenting, organizing, and sharing the data created and collected during a project.

14 of 105

Data management problems

2.1 Comment the ways to backup data

15 of 105

Data management practices

2.2 What “data” are we discussing now?

Image by storyset on Freepik

16 of 105

Data management practices

2.2 What “data” are we discussing now?

Backup strategies are only for raw data
Consider changing file permissions to read-only
Record the procedures to clean the data in the following steps, not all the intermediate data files
Unless any of these procedures is particularly time-consuming
Rely on automations when possible

Image by storyset on Freepik

17 of 105

Data management practices

2.2 What “data” are we discussing now?

Specific considerations for personal or sensitive data → GDPR and equivalent
And commercially sensitive data
And data which may cause harm if made public (e.g. data related to the distribution of rare species which can be hunted)

Image by storyset on Freepik

18 of 105

GDPR: a parenthesis

From: CyberPilot

19 of 105

Metadata: “data that describes data”

Allows understanding and interpretation of data
As important as your data
Should “travel together” your data
Can be produced automatically or manually
Can be:

Administrative: relevant to managing it (e.g. Experimental code, PI)
Descriptive/citation: assists with discovery/identity (e.g. Authors, persistent identifier)
Structural: how the data came about & is structured (e.g. Collection method, folder structures)

20 of 105

Metadata: “data that describes data”

Minimum information standards
Minimum Information for Biological and Biomedical Investigations (MIBBI)

21 of 105

Data management practices

2.3 Which file formats do you store your data in?

2.4 Do you have a (files and variables) naming convention?

Image by storyset on Freepik

22 of 105

Data management practices

2.3 Which file formats do you store your data in?

File formats: Convert data from closed, proprietary formats to open, non-proprietary formats that ensure machine readability across time and computing setups. Good options include CSV for tabular data, JSON, YAML, or XML for non-tabular data such as graphs, and HDF5 for certain kinds of structured data. OME-Zarr for image data.

Image by storyset on Freepik

23 of 105

Data management practices

2.4 Do you have a (files and variables) naming convention?

Filenames: Store especially useful metadata as part of the filename itself, while keeping the filename regular enough for easy pattern matching. For example, a filename like 2016-05-alaska.csv makes it easy for both people and programs to select by year or by location.

Image by storyset on Freepik

24 of 105

Data management practices

2.4 Do you have a (files and variables) naming convention?

Variable names: Replace inscrutable variable names and artificial data codes with self-explaining alternatives, e.g., rename variables called name1 and name2 to first_name and family_name, recode the treatment variable from 1 vs. 2 to untreated vs. treated, and replace artificial codes for missing data, such as “-99”, with NA, a code used in most programming languages to indicate that data is “Not Available”.

Image by storyset on Freepik

25 of 105

Data management practices

Create the dataset you wish you had received. The goal here is to improve machine and human readability, but not to do vigorous data filtering or add external information. Machine readability allows automatic processing using computer programs, which is important when others want to reuse your data.

Non-destructive transformations:

Create analysis-friendly data
Record all the steps used to process data
Anticipate the need to use multiple tables, and use a unique identifier for every record

26 of 105

Analysis-friendly data

2.5 Which of the table layouts is analysis friendly? Enter your answers in the collaborative document.

27 of 105

Analysis-friendly data

Make each column a variable
Make each row an observation

28 of 105

Analysis-friendly data

Make each column a variable
Make each row an observation

AND record all the steps used to process data:

Write script for every stage of the process
Use a digital cleaning tool such as OpenRefine

29 of 105

Analysis-friendly data

Make each column a variable
Make each row an observation

AND record all the steps used to process data:

Write script for every stage of the process
Use a digital cleaning tool such as OpenRefine

AND anticipate the need to use multiple tables

using unique identifiers

https://bioportal.bioontology.org/

30 of 105

Data submission

2.6 Comment the ways to share data

31 of 105

Data submission

What is a DOI:

A digital object identifier is a persistent identifier or handle used to identify objects uniquely.
Data with a persistent DOI can be found even when your lab website dies.
DOI-issuing repositories include: zenodo, figshare, dryad.
More than a URL: stable in time.

32 of 105

Data submission

Places to share data, with DOIs:

UoE DataShare (https://datashare.is.ed.ac.uk/) local open-access repository
UoE DataVault (https://datavault.ed.ac.uk) local long-term retention.
Dataverse (http://thedata.org): A repository for research data that takes care of long-term preservation and good archival practices, while researchers can share, keep control of, and get recognition for their data.
FigShare (http://figshare.com): A repository where users can make all of their research outputs available in a citable, shareable, and discoverable manner. Note that figshare is commercial.
Zenodo (http://zenodo.org): A repository service that enables researchers, scientists, projects, and institutions to share and showcase multidisciplinary research results (data and publications)
Dryad (http://datadryad.org): A repository that aims to make data archiving as simple and as rewarding as possible through a suite of services not necessarily provided by publishers or institutional websites.

33 of 105

Data submission

2.7 What are the best repositories for your type of data?

34 of 105

Data Management Plans

A formal, living document that outlines what you will do with your data at all staged of your research project

→ A requirement at EMBL

35 of 105

Data Management Plans

A formal, living document that outlines what you will do with your data at all staged of your research project

→ A requirement at EMBL

→ A requirement for several funders

European Commission (Horizon Europe), within the first 6 months of the project
DFG (German Research Foundation), submit with proposal
BMBF (Education and Research Ministry), submit with proposal if required

36 of 105

Data Management Plans

2.8 Let’s draft your DMP!

What is the project?
What data will be produced and used?
What information is needed for the data to be read and interpreted?
Which procedures will be used to create, process and quality control the data and metadata?
How will data processing operations be documented so that they can be reproduced?
Are there any security or access control requirements?
How will Intellectual Property be managed?
What happens to the data at the end of the project?
Who is responsible for which part of data management?

10 mins

37 of 105

Data Management Plans

If you want to know more:

38 of 105

Resources at EMBL: STOCKS

Electronic lab notebook system

39 of 105

Resources at EMBL: the Data Management App

Data Management tool (archive, share, move, delete), metadata annotator

40 of 105

Code and software

30 mins

41 of 105

Software problems

Neil Ferguson's covid code twitter thread

Smaller-scale versions of this problem are more common:

you want to re-run a data analysis that you did six months ago
a new postdoc starts working on a related project and needs to adapt your analysis to a new dataset
you publish a paper, and a masters student from the other side of the world emails you to reproduce the results for their project

42 of 105

What is research software?

Any code that runs in order to process your research data
“Record all the steps used to process data.” That means, script everything if possible. Data analysis is software.
R, python, MATLAB, shell, openrefine, imageJ, etc. are all scriptable. Use them. Their instructions are software too.

3.1 What can go wrong with (research) code?

43 of 105

Helpful comments

Short explanatory comments/documentation can go a long way.

* The reader doesn’t need to know what all these words mean for a comment to be useful. This comment tells the reader what words they need to look up: fuzzing, blobs, and so on.

44 of 105

Write functions

Break your script into functions (you will reuse them).

Human short-term memory cannot hold more than 7 items at once: make your life easier.

45 of 105

Write functions

Consider documenting functions and other code, specific syntax to do this in each programming language

JavaScript

3.2 Writing helpful explanatory comments

Python

46 of 105

Remove duplications

Write and re-use functions instead of copying and pasting code, and use data structures like lists (or classes) instead of creating many closely-related variables, e.g. create score = (1, 2, 3) rather than score1, score2, and score3.

Also avoid duplication of work. Consider consulting catalogues of libraries:

CRAN in R
PyPI in Python …

Always search for well-maintained software libraries that do what you need before writing new code yourself, but test libraries before relying on them.

47 of 105

Give functions and variables meaningful names

3.3 Name a function and a variable

Remember to follow each language’s conventions for names, such as net_charge for Python and NetCharge for Java. These conventions are often described in “style guides”, and can even be checked automatically.

Tab completion helps for long variable names. No excuses.

Make your life easier by using a good programming editor or integrated development environment (IDE). More on this later.

48 of 105

Do not comment and uncomment sections of your code to control a program’s behavior

Error prone and makes it difficult or impossible to automate analyses. Instead, put if/else statements in the program to control what it does, and use input arguments on the command line.

Provide a simple example

Of input and output. This is called an integration test. Very useful also in the case you run the same code on multiple machines. Automate it for extra reassurance.

Code can be managed like data

Your code is like your data and also needs to be managed, backed up, and shared. Submit code to a reputable DOI-issuing repository, just as you do with data.

49 of 105

Make dependencies and requirements explicit

This is usually done on a per-project rather than per-program basis, i.e., by adding a file called something like requirements.txt to the root directory of the project, or by adding a “Getting Started” section to the README file.

50 of 105

Record dependencies

Python example

Record dependencies through conda in a yaml file
Re-create the environment from the yaml file
Manage your multiple environment through conda

name: student-project

channels:

- conda-forge

dependencies:

- scipy=1.3.1

- numpy=1.16.4

- sympy=1.4

- click=7.0

- pip

- pip:

- git+https://github.com/someuser/someproject.git@d7b2c7e

- git+https://github.com/anotheruser/anotherproject.git@sometag

51 of 105

Record computational steps

Snakemake example

# a list of all the books we are analyzing

DATA = glob_wildcards('data/{book}.txt').book

rule all:

input:

expand('statistics/{book}.data', book=DATA),

expand('plot/{book}.png', book=DATA)

# count words in one of our books

rule count_words:

input:

script='statistics/count.py',

book='data/{file}.txt'

output: 'statistics/{file}.data'

conda: 'environment.yml'

log: 'statistics/{file}.log'

shell: 'python {input.script} {input.book} > {output}'

# create a plot for each book

rule make_plot:

…

52 of 105

Record environments

Docker example

Bootstrap: docker

From: ubuntu:latest

%post

export VIRTUAL_ENV=/app/venv

apt-get update && \

apt-get install -y --no-install-recommends \

...

pip install --no-cache-dir -r /app/requirements.txt

%files

...

%environment

...

%runscript

. $VIRTUAL_ENV/bin/activate

python /app/app.py

53 of 105

Collaboration

30 mins

54 of 105

Create an overview of your project

Create a short file in the project’s base directory that explains the purpose of the project. This file (generally called README, README.txt, REAMDE.md or something similar) should contain :

The project’s title
A brief description
Up-to-date contact information
An example or two of how to run the most important tasks
Broad overview of folder structure

55 of 105

Describe how to contribute to the project

You can use the README.

A separate CONTRIBUTING file is used for community projects and usually include:

Dependencies that need to be installed
Tests that can be run to ensure that software has been installed correctly
Guidelines or checklists that your project adheres to.
A reference to a code of conduct or similar community governance practices

4.1 Evaluate two README files

56 of 105

Create a shared “to-do” list

This can be a plain text file called something like notes.txt or todo.txt, or you can use sites such as GitHub or Bitbucket to create a new issue for each to-do item. (You can even add labels such as “low hanging fruit” to point newcomers at issues that are good starting points.)

Whatever you choose, describe the items clearly so that they make sense to newcomers.

Decide on communication strategies

Check collaborations with sensitive data

57 of 105

Make the license explicit

To license something you need to retain the Intellectual Property of it

Unlicensed code means no rights or permissions to copy, distribute, modify or adapt the work were granted to users. Always add a license at the earliest.

58 of 105

Make the license explicit

Intellectual Property at EMBL

59 of 105

What is a licence?

Specifies allowable copying and reuse
Without a licence, people cannot legally reuse your code or data
Different options for different goals and funder requirements (Apache, MIT, CC, ...)
For example, this presentation is reusable with attribution under a Creative Commons Attribution (CC BY) 4.0 licence.

60 of 105

Make the license explicit

Have a LICENSE file in the project

For data and text:

CC-0 (“No Rights reserved”)
CC-BY (“Attribution”)

For software:

MIT
Apache license

Be careful about the “no commercial use” option. It applies to you too.

Check tldrlegal.com

61 of 105

Make the license explicit

6.2 Checking common licenses

62 of 105

Make the project citable

A CITATION file describes how to cite the project as a whole, and where to find related datasets, code, figures etc.

This is the one for khmer project:

63 of 105

Guide for Collaboration - The Turing Way

64 of 105

Project organisation

20 mins

65 of 105

README files are magic

You look at a directory (or project), you read it, and it tells you what you need to know

… as long as you keep them updated!

What is the secret? Good planning.

66 of 105

Plan and implement your project structure

Divide your work into projects, based on the research efforts sharing data or code.

No share: different projects
Significantly share data or code: best managed together
Same code used on multiple data sources: a project on its own.

67 of 105

Plan and implement your project structure

Recommendations for project organisation from noble2009, alternatives: The Turing Way, a template, etc.

Text documents associated with the project in the doc directory
Raw data and metadata in the data directory
Files generated during cleanup in the results directory
Project source code in the src directory

One subdirectory with the actual scripts to generate results
One subdirectory with controller or driver scripts (steps from start to finish)

68 of 105

Plan and implement your project structure

Recommendations for project organisation from noble2009, alternatives: The Turing Way, a template, etc.

Text documents associated with the project in the doc directory
Raw data and metadata in the data directory
Files generated during cleanup in the results directory
Project source code in the src directory

One subdirectory with the actual scripts to generate results
One subdirectory with controller or driver scripts (steps from start to finish)

Compiled programs in the bin directory (but possibly not)

69 of 105

Name all files with meaning and consistently

Do not use sequential numbers only, e.g. results1.csv, results2.csv etc.

File names should be:

Machine readable
Human readable
Descriptive of their contents
Consistent

Example:

70 of 105

Project organisation

5.1 Naming and sorting, a discussion

71 of 105

Project organisation

Some helpful organisation tools

Integrated Development Environments (IDEs)

PyCharm, VSCode for Python
RStudio for R

Notebooks

Jupyter
R markdown
Alternative textual formats

Cookie Cutter project templates

72 of 105

Keeping track of changes

30 mins

73 of 105

6.1 Issues you can relate to

http://phdcomics.com/

Wit and wisdom from Jorge Cham

74 of 105

Version control is not the only option

It can have a steep learning curve

Two sets of recommendations:

Systematic manual approach
Actual version control

75 of 105

Some good rules

Backup almost everything created by a human being as soon as it exists
Keep changes between versions small (quantized)
Share changes frequently between collaborators
Create, maintain and use a checklist for saving and sharing changes
Share projects in folders that are mirrored off the working machine

Institutional shared cloud or cluster
Remote control versioning (GitHub/GitLab)

76 of 105

Some good rules

Backup almost everything created by a human being as soon as it exists
Keep changes between versions small (quantized)
Share changes frequently between collaborators
Create, maintain and use a checklist for saving and sharing changes
Share projects in folders that are mirrored off the working machine

Institutional shared cloud or cluster
Remote control versioning (GitHub/GitLab)

A good documentation for changes includes:

Date of the change
Author of the change
List of affected files
A short description of the nature of the introduced changes AND/OR motivation behind the change

77 of 105

Manual versioning

Add a file called CHANGELOG.txt to the project’s docs subfolder, and make dated notes about changes to the project in this file in reverse chronological order (i.e., most recent first). This file is the equivalent of a lab notebook.

78 of 105

Manual versioning

Add a file called CHANGELOG.txt to the project’s docs subfolder, and make dated notes about changes to the project in this file in reverse chronological order (i.e., most recent first). This file is the equivalent of a lab notebook.

Copy the entire code project whenever a significant change has been made (i.e., one that materially affects the results), and store that copy in a read-only sub-folder whose name reflects the date in the area that’s being synchronized.

79 of 105

Manual versioning

Add a file called CHANGELOG.txt to the project’s docs subfolder, and make dated notes about changes to the project in this file in reverse chronological order (i.e., most recent first). This file is the equivalent of a lab notebook.

Copy the entire code project whenever a significant change has been made (i.e., one that materially affects the results), and store that copy in a read-only sub-folder whose name reflects the date in the area that’s being synchronized.

Image by storyset on Freepik

80 of 105

Version Control

Use a version control system

6.2 Explore the GitHub repositories

Git = Version control system

System that records changes made to a project (group of files) over time

Keeps versions in a smart way
Allows reverting back to a previous state (of the project)

Essential to (programming) projects involving several people

Allows branching (working separately on different features)
Allows to compare/merge changes made
Allows synchronization of repositories in different places (distributed version control)

81 of 105

Version Control

Use a version control system

But don’t use as is it for:

Very large files → Git LFS
Some file formats, e.g. Microsoft Office files → Git LFS

82 of 105

Version Control

git.embl.de

IMPLEMENT

PLAN

TEST

RELEASE

PACKAGE

DevOps (development + operations) platform

GitLab = DevOps platform

83 of 105

Git resources (at EMBL)

G it.embl.de
Blog posts:

84 of 105

Manuscripts

20 mins

85 of 105

Collaborative document writing

7.1 List the tools

86 of 105

Collaborative document writing

7.1 List the tools

Offline document that is shared by email
Online document in GDrive or Office tools
Online Markdown-based document

Consider Manubot

Online LaTeX document in Overleaf

87 of 105

Making email-based workflows work

Give your manuscript file an informative name, and update the date and initials of last edit, for example best_practices_manuscript_2013-12-01_GW.doc would be the version edited by GW on 1st December 2013.

Choose one person to coordinate (i.e. the lead author), who is responsible for merging comments and sending out updated manuscripts to all other co-authors. And find a good way to manage references.

88 of 105

Beyond email-based workflows

Aims:

Ensure that text is accessible to yourself and others now and in the future by making a single main document that is available to all coauthors at all times.
Reduce the chances of work being lost or people overwriting each other’s work.
Make it easy to track and combine contributions from multiple collaborators.
Avoid duplication and manual entry of information, particularly in constructing bibliographies, tables of contents, and lists.
Make it easy to regenerate the final published form (e.g., a PDF) and to tell if it is up to date.
Make it easy to share that final version with collaborators and to submit it to a journal.
Easier management of references

89 of 105

Beyond email-based workflows

Solution 1: Single Main Document Online

Google Docs or MS OneDrive

Solution2: Text-based documents under Version Control

LaTeX or Markdown, then converted to PDF
Possible to combine manuscripts with data analysis in notebooks
Project organisation is crucial here
Separate sentences by linebreaks to simplify version control

90 of 105

Supplementary materials

Do NOT rely on PDF
Possibly use only open formats (CSV, JSON, YAML, XML, HDF5, SVG)
Separate the different data tables, code, and other files

91 of 105

Authorship: not only manuscripts

Several different types of contributions to science

Credits: Rachel Samberg on slideshare.net

92 of 105

Authorship: not only manuscripts

Several different types of contributions to science
How to track them?

Free, unique, persistent author identifier

7.2 Do you have an ORCID? If not, create one!

93 of 105

Authorship: not only manuscripts

Several different types of contributions to science
How to track them?
How to recognise them?

94 of 105

Research assessment at EMBL

https://www.embl.org/about/research-assessment/

Implementing responsible research assessment

EMBL CV instructions

variety of research outputs
brief narrative (max. 300 words) summarising the impact and importance of your main outputs

95 of 105

What’s next

20 mins

96 of 105

Overview of what was covered

Data management

Software

Collaboration

Project organization

Keeping track of changes

Manuscripts

Relying on a set of shared principles:

Planning: plan out how to work. Any plan that you can stick to is better than no plan.
Modular organization: organize your data, code, and projects into coherent modules.
Names: give good names to your files, folders, and functions, that make them easy to find and to understand.
Documentation: explicitly write down everything you and your collaborators need to know in the future.

97 of 105

Some closing thoughts

Open practices are good practices
Research is changing
Local resources at EMBL

98 of 105

Open practices are good practices

Five selfish reasons to work reproducibly. �Florian Markowetz, Genome Biology�https://doi.org/10.1186/s13059-015-0850-7
When will ‘open science’ become simply ‘science’? Mick Watson, Genome Biology �https://doi.org/10.1186/s13059-015-0669-2