1 of 13

Bioinformatics Workflows and Package management

Caleb Kibet

2 of 13

Package Management

  • Conda
  • Mamba
  • Modules for the HPC: http://modules.sourceforge.net/

The alternatives can be categorized into system-wide (Debian-Med, Genotoo Science, BioLinux, and Homebrew) and per-user (EasyBuild, GNU Guix, and BioBuilds) installation mechanisms.

There are alternative slides on Package a management using Conda.

3 of 13

What is a Bioinformatics Pipeline

Number of steps to analyse data

Can be simple or very Complex

4 of 13

Example

5 of 13

Bioinformatics pipelines: what’s the problem?

Example:

user@machine:~> bwa mem -t $NSLOTS -M $BWA_INDEX_REF -R "@RG\tID:$PU\tPL:illumina\tPU:$PU\tSM:$SAMPLE" $READS1 $READS2 | samblaster --splitterFile >(samtools view -hSu /dev/stdin | samtools sort -@ $NSLOTS /dev/stdin > $SAMPLE.sr.bam) --discordantFile >(samtools view -hSu /dev/stdin | samtools sort -@ $NSLOTS /dev/stdin > $SAMPLE.disc.bam) | samtools view -hSu /dev/stdin | samtools sort -@ $NSLOTS /dev/stdin > $SAMPLE.raw.bam

6 of 13

What is the problem

  • 8 executables
    • Multithreaded
    • potentially using different threading standards (pthreads, OpenMP, ...)
  • Efficiency of parallel executables unknown
    • No idea of optimal number of threads to assign to each (assuming we can)
  • Linux pipes don’t allow straightforward control over parallel execution
    • Mostly relying on operating system to do the right thing
    • Data flow through pipes and pipe buffers adds additional complications

7 of 13

Workflows

Tools for automating bIoinformatics Analyses

8 of 13

How?

  • Traditionally from shell script files
    • Make, bash scripts, perl, python
  • Now onto frameworks or applications
    • Web-based
      • Galaxy
    • GUI and Command-line
      • Apache, Tarvena
    • Command-line
      • Nextflow, Snakemake,
    • Common workflow languages

9 of 13

Scalable pipeline components. A pipeline consists of third-party tools, data parsers, and data transformations. (Fjukstad and Bongo, 2017)

10 of 13

Advantages of workflows

Workflow languages use the concept of analysis preservation, which offers several advantages:

  • Memory: Keeps the architecture of the work for easy re-analysis
  • Portability: easy to port tools and results to other systems
  • Modularity: Each step in the pipeline is a rule, which can be updated or changed
  • Reproducibility: Snakemake offers a solution to perform the same analysis to different data sets.

11 of 13

Common Workflows and Containers

Singularity

Snakemake

12 of 13

Which framework do you choose?

  • There are many frameworks out there.
    • Some are professional, others not.
    • Some are not maintained anymore or by a few developers.
  • Many frameworks pass those filters. We have the luxury to choose one amongst many good frameworks !
    • You need to define your requirements in terms of portability,language, reproducibility, parallelization, etc ?

13 of 13

Hands-on with Snakemake