Good enough practices in research computing
�How to not lose your stuff, and generally be more efficient
© Lisanna Paladin - CC-BY 4.0
Bio-IT - bio-it.embl.de
Slides adapted from Edward Wallace 2020, Licence: CC BY 4.0
Who am I and why do I care?
Good habits are essential for collaboration, and for your future self
This lesson’s tool
Introduction
15 mins
The Data Life Cycle
Data Management Challenges in Life Sciences
From: ENGYTE
Data scattered in multiple repositories
From: Liang-Bo Wang’s Blog
Challenges in data-heavy biology
Good news!
Basic principles
This lesson has episodes covering data management, software, project organization, collaboration, keeping track of changes, and manuscripts.
“Good Enough Practices” rely on a shared set of principles that span these areas:
Lesson outline
Wilson et al., PLoS Comput. Biol 2017�https://doi.org/10.1371/journal.pcbi.1005510
Data Management
45 mins
Data Management
Definition
Data management is the process of storing, documenting, organizing, and sharing the data created and collected during a project.
Data management problems
2.1 Comment the ways to backup data
Data management practices
2.2 What “data” are we discussing now?
Data management practices
2.2 What “data” are we discussing now?
Data management practices
2.2 What “data” are we discussing now?
GDPR: a parenthesis
From: CyberPilot
Metadata: “data that describes data”
Metadata: “data that describes data”
Data management practices
2.3 Which file formats do you store your data in?
2.4 Do you have a (files and variables) naming convention?
Data management practices
2.3 Which file formats do you store your data in?
File formats: Convert data from closed, proprietary formats to open, non-proprietary formats that ensure machine readability across time and computing setups. Good options include CSV for tabular data, JSON, YAML, or XML for non-tabular data such as graphs, and HDF5 for certain kinds of structured data. OME-Zarr for image data.
Data management practices
2.4 Do you have a (files and variables) naming convention?
Filenames: Store especially useful metadata as part of the filename itself, while keeping the filename regular enough for easy pattern matching. For example, a filename like 2016-05-alaska.csv makes it easy for both people and programs to select by year or by location.
Data management practices
2.4 Do you have a (files and variables) naming convention?
Variable names: Replace inscrutable variable names and artificial data codes with self-explaining alternatives, e.g., rename variables called name1 and name2 to first_name and family_name, recode the treatment variable from 1 vs. 2 to untreated vs. treated, and replace artificial codes for missing data, such as “-99”, with NA, a code used in most programming languages to indicate that data is “Not Available”.
Data management practices
Create the dataset you wish you had received. The goal here is to improve machine and human readability, but not to do vigorous data filtering or add external information. Machine readability allows automatic processing using computer programs, which is important when others want to reuse your data.
Non-destructive transformations:
Analysis-friendly data
2.5 Which of the table layouts is analysis friendly? Enter your answers in the collaborative document.
Analysis-friendly data
Analysis-friendly data
AND record all the steps used to process data:
Analysis-friendly data
AND record all the steps used to process data:
AND anticipate the need to use multiple tables
Data submission
2.6 Comment the ways to share data
Data submission
What is a DOI:
Data submission
Places to share data, with DOIs:
Data submission
2.7 What are the best repositories for your type of data?
Data Management Plans
A formal, living document that outlines what you will do with your data at all staged of your research project
→ A requirement at EMBL
Data Management Plans
A formal, living document that outlines what you will do with your data at all staged of your research project
→ A requirement at EMBL
→ A requirement for several funders
Data Management Plans
2.8 Let’s draft your DMP!
10 mins
Data Management Plans
Resources at EMBL: STOCKS
Resources at EMBL: the Data Management App
Code and software
30 mins
Software problems
Neil Ferguson's covid code twitter thread
Smaller-scale versions of this problem are more common:
What is research software?
3.1 What can go wrong with (research) code?
Helpful comments
Short explanatory comments/documentation can go a long way.
* The reader doesn’t need to know what all these words mean for a comment to be useful. This comment tells the reader what words they need to look up: fuzzing, blobs, and so on.
Write functions
Break your script into functions (you will reuse them).
Human short-term memory cannot hold more than 7 items at once: make your life easier.
Write functions
Consider documenting functions and other code, specific syntax to do this in each programming language
R
JavaScript
3.2 Writing helpful explanatory comments
Python
Remove duplications
Write and re-use functions instead of copying and pasting code, and use data structures like lists (or classes) instead of creating many closely-related variables, e.g. create score = (1, 2, 3) rather than score1, score2, and score3.
Also avoid duplication of work. Consider consulting catalogues of libraries:
Always search for well-maintained software libraries that do what you need before writing new code yourself, but test libraries before relying on them.
Give functions and variables meaningful names
3.3 Name a function and a variable
Remember to follow each language’s conventions for names, such as net_charge for Python and NetCharge for Java. These conventions are often described in “style guides”, and can even be checked automatically.
Tab completion helps for long variable names. No excuses.
Make your life easier by using a good programming editor or integrated development environment (IDE). More on this later.
Do not comment and uncomment sections of your code to control a program’s behavior
Error prone and makes it difficult or impossible to automate analyses. Instead, put if/else statements in the program to control what it does, and use input arguments on the command line.
Provide a simple example
Of input and output. This is called an integration test. Very useful also in the case you run the same code on multiple machines. Automate it for extra reassurance.
Code can be managed like data
Your code is like your data and also needs to be managed, backed up, and shared. Submit code to a reputable DOI-issuing repository, just as you do with data.
Make dependencies and requirements explicit
This is usually done on a per-project rather than per-program basis, i.e., by adding a file called something like requirements.txt to the root directory of the project, or by adding a “Getting Started” section to the README file.
Record dependencies
Python example
name: student-project
channels:
- conda-forge
dependencies:
- scipy=1.3.1
- numpy=1.16.4
- sympy=1.4
- click=7.0
- pip
- pip:
- git+https://github.com/someuser/someproject.git@d7b2c7e
- git+https://github.com/anotheruser/anotherproject.git@sometag
Record computational steps
Snakemake example
# a list of all the books we are analyzing
DATA = glob_wildcards('data/{book}.txt').book
rule all:
input:
expand('statistics/{book}.data', book=DATA),
expand('plot/{book}.png', book=DATA)
# count words in one of our books
rule count_words:
input:
script='statistics/count.py',
book='data/{file}.txt'
output: 'statistics/{file}.data'
conda: 'environment.yml'
log: 'statistics/{file}.log'
shell: 'python {input.script} {input.book} > {output}'
# create a plot for each book
rule make_plot:
…
Record environments
Docker example
Bootstrap: docker
From: ubuntu:latest
%post
export VIRTUAL_ENV=/app/venv
apt-get update && \
apt-get install -y --no-install-recommends \
...
pip install --no-cache-dir -r /app/requirements.txt
%files
...
%environment
...
%runscript
. $VIRTUAL_ENV/bin/activate
python /app/app.py
Collaboration
30 mins
Create an overview of your project
Create a short file in the project’s base directory that explains the purpose of the project. This file (generally called README, README.txt, REAMDE.md or something similar) should contain :
Describe how to contribute to the project
You can use the README.
A separate CONTRIBUTING file is used for community projects and usually include:
4.1 Evaluate two README files
Create a shared “to-do” list
This can be a plain text file called something like notes.txt or todo.txt, or you can use sites such as GitHub or Bitbucket to create a new issue for each to-do item. (You can even add labels such as “low hanging fruit” to point newcomers at issues that are good starting points.)
Whatever you choose, describe the items clearly so that they make sense to newcomers.
Decide on communication strategies
Check collaborations with sensitive data
Make the license explicit
To license something you need to retain the Intellectual Property of it
Copyright: exclusive rights to make certain uses of original expressions for limited period of time
Unlicensed code means no rights or permissions to copy, distribute, modify or adapt the work were granted to users. Always add a license at the earliest.
Make the license explicit
Intellectual Property at EMBL
What is a licence?
© Edward Wallace 2020, Licence: CC BY 4.0
Make the license explicit
Have a LICENSE file in the project
For data and text:
For software:
Be careful about the “no commercial use” option. It applies to you too.
Check tldrlegal.com
Make the license explicit
6.2 Checking common licenses
Make the project citable
A CITATION file describes how to cite the project as a whole, and where to find related datasets, code, figures etc.
This is the one for khmer project:
Project organisation
20 mins
README files are magic
You look at a directory (or project), you read it, and it tells you what you need to know
… as long as you keep them updated!
What is the secret? Good planning.
Plan and implement your project structure
Divide your work into projects, based on the research efforts sharing data or code.
Plan and implement your project structure
Recommendations for project organisation from noble2009, alternatives: The Turing Way, a template, etc.
Plan and implement your project structure
Recommendations for project organisation from noble2009, alternatives: The Turing Way, a template, etc.
Name all files with meaning and consistently
Do not use sequential numbers only, e.g. results1.csv, results2.csv etc.
File names should be:
Example:
Project organisation
5.1 Naming and sorting, a discussion
Project organisation
Some helpful organisation tools
Keeping track of changes
30 mins
6.1 Issues you can relate to
Version control is not the only option
It can have a steep learning curve
Two sets of recommendations:
Some good rules
Some good rules
A good documentation for changes includes:
Manual versioning
Manual versioning
Manual versioning
Version Control
6.2 Explore the GitHub repositories
Git = Version control system
Version Control
Version Control
git.embl.de
IMPLEMENT
PLAN
TEST
RELEASE
PACKAGE
DevOps (development + operations) platform
GitLab = DevOps platform
Git resources (at EMBL)
Manuscripts
20 mins
Collaborative document writing
7.1 List the tools
Collaborative document writing
7.1 List the tools
Making email-based workflows work
Give your manuscript file an informative name, and update the date and initials of last edit, for example best_practices_manuscript_2013-12-01_GW.doc would be the version edited by GW on 1st December 2013.
Choose one person to coordinate (i.e. the lead author), who is responsible for merging comments and sending out updated manuscripts to all other co-authors. And find a good way to manage references.
Beyond email-based workflows
Aims:
Beyond email-based workflows
Solution 1: Single Main Document Online
Solution2: Text-based documents under Version Control
Supplementary materials
Authorship: not only manuscripts
Credits: Rachel Samberg on slideshare.net
Authorship: not only manuscripts
Free, unique, persistent author identifier
7.2 Do you have an ORCID? If not, create one!
Authorship: not only manuscripts
Implementing responsible research assessment
EMBL CV instructions
What’s next
20 mins
Overview of what was covered
Data management
Software
Collaboration
Project organization
Keeping track of changes
Manuscripts
Relying on a set of shared principles:
Some closing thoughts
Open practices are good practices
Open Science practices
Research is changing:
“papers” aren’t paper any more
Resources: PLoS Computational Biology
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285
Ten simple rules for reproducible computational research
Resources: External Training
Many institutions and societies run training courses, for example:
For more courses:
Resources: External sites
Peer training time
Let’s discuss best practices