1 of 19

Codefest Report

July 20-21, 2017

www.open-bio.org/wiki/Codefest_2017

2 of 19

What is Codefest?

  • 2 day BOSC collaborative work session
  • 60+ community members
  • Learning and training
  • Building relationships
  • Writing code
  • Everyone is welcome
  • 8th successful year

3 of 19

How does Codefest work?

  • Free
  • You get wireless, power, space, coffee, food
  • Open source collaborators
  • Self organize around projects of interest
  • Produce useful code and motivation
  • Report on accomplishments

4 of 19

Thank you

  • brmlab (https://brmlab.cz/)
  • Matúš Kalaš, Heather Wiencko
  • Repositive and Seven Bridges
  • Institute of Organic Chemistry and Biochemistry
  • OpenBio and BOSC Community

5 of 19

Themes from the Codefest

  • New contributors
  • Autonomy
  • Last mile development
  • Standards and coordination
  • Fixing long standing and neglected bugs

6 of 19

Table column ordering

Specify where columns should be placed in In modules, config and report

Phil Ewels, Rickard Hammarén, Robin Andeer, Tim Booth, Dennis Schwartz, Dimitri Desvillechabrol, Amandine Perrin, Rowland Mosbergen, Murray Wham, Markus Ankenbrand, Raony Guimaraes, Tom Walsh

Pull requests

Issues

Scout integration

MultiQC reports embedded within Scout clinical genomics browser

Datatable by Musavvir Ahmed, group by Mello, help by i cons, Branch by Stanislav Levin from the Noun Project

New modules!

VCFtools, nonpareil, bcl2fastq(!), AfterQC

Module grouping

Just run modules related to a specific data type with new module tags.

Module help texts

New drop-down texts above plots in reports to describe what’s being shown

Collect software versions

Core MultiQC support for scraping software versions from logs

18 opened

9 opened

14 merged

5 closed

7 of 19

  • Bug fixing (4 PRs merged, 3 in progress)�
  • Biopython architecture discussions�
  • Python 2 retirement by 2020�
  • Introduction to Snakemake + Sequanix

,

Snakemake, and "other Python things"

Members: Wibowo Arindrarto, Kai Blin, Spencer Bliven, Christian Brueffer, Peter Cock, Thomas Cokelaer, Joe Greener,

Seqanix GUI in PyQt for Snakemake pipelines; http://sequana.readthedocs.io

8 of 19

Protein structure

  1. Integrate Biopython structural entities with nglview to allow interactive visualisation of protein structures in Jupyter notebooks

Members: Spencer Bliven, Alexander Rose, David Sehnal, Joe Greener

9 of 19

Protein structure

  1. Molecular Query Language
    • Formal specification for general selection languages
    • Interchange Format to interconvert between existing query languages

{ kind:"isCloseTo",

args: [

{ kind: "chains",

options: ["A"] },

{ kind: "residues",

options: ["HEM"] }

],

options: { maxDistance: 5 }

}

https://github.com/MolQL/molql

Jmol: select within(5, [HEM]:A)

isCloseTo(

chains('A'),

residues('HEM'),

5)

PyMol: select chain A within 5 of resn HEM

User query

Abstract Syntax Tree

MolQL Interchange Format

Query Result

10 of 19

Exploring new uses and executor APIs

  • Tested the experimental Kubernetes support in Nextflow
  • Determined the issues in setting up a correctly configured Kubernetes cluster for use with a workflow
  • Explored Google Cloud usage
  • Discussed the incorporation of the Global Alliance for Genomics and Health (GA4GH) as a Nextflow executor
  • Built the GA4GH API in Java using the protocol buffer definitions.
  • Began exploring the API
  • Group members:
    • Paolo Di Tommaso
    • Konstantinos Krampis
    • Kevin Sayers

11 of 19

Workflows including Apache Spark based analyses

  • Discussed supporting provisioning of Apache Spark clusters vs. delegating via environment profiles in CWL
  • Bioconda recipe for ADAM, https://github.com/bigdatagenomics/adam
  • Support for ADAM and Avocado (variant caller) in bcbio-nextgen in progress
  • Demonstrated Apache Spark runtime configuration via profiles in Nextflow
  • Group members: Michael Heuer, Brad Chapman, Roman Valls Guimerà, Paolo Di Tommaso

12 of 19

Reproducible software deployment

  • Part of creating reproducible solutions is creating a work flow that uses persistent software resources
  • Major packaging efforts are in Debian, GNU Guix and Bioconda
  • Challenges in versioning, dependencies, reproducibility
  • Solution:
    • Container (Docker) -> GNU Guix -> Guix packages -> BioConda -> Conda packages
    • Docker runs on CWL and Galaxy
  • Problem solved. OK. Maybe. Work in progress…
  • CWL and Galaxy projects are very interested
  • Group members: Pjotr Prins, Steffen Möller

13 of 19

Provenance

  • cwltool --provenance �generate research object (RO) w/ provenance of workflow execution
  • Model of CWL RO structure → BagIt archive
    • Level 0: workflow job submission
    • Level 1: executing master workflow, in/out
    • Level 2: step execution
    • Level 3: nested workflows
  • Capturing input data using content-based addressing
  • Rerunnable and portable master-job.json
  • Provenance as PROV JSON-LD

Farah Zaib Khan, Stian Soiland-Reyes, Tazro Inutano Ohta

PROV

{"@context": { "@base": "app://2e1287e0-6dfb-11e7-8acf-0242ac11000/" },

"@id": "workflow/master-job.json#",

"@type": "WorkflowRun",

"workflow": "workflow/packed.cwl#main",

"inputs": [

{"@id": "data/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03",

"describedByParameter": "workflow/packed.cwl#main/in1"} ],

"outputs": [

{"@id": "data/00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c",

"describedByParameter": "workflow/packed.cwl#main/in1" } ],

"steps": [

{"@id": "urn:uuid:4305467e-6dfb-11e7-885d-0242ac110002",

"@type": "ProcessRun",

"step": "workflow/packed.cwl#main/step1"},

{"@id": "urn:uuid:c42dc36e-6dfd-11e7-bc24-0242ac110002",

"@type": "ProcessRun",

"step": "workflow/packed.cwl#main/step2"}

]

}

14 of 19

Singularity support in CWL

github.com/johnfonner/cwltool/tree/feature-singularity

  • CWL workflows using “dockerPull” can transparently use Singularity for container execution.
  • 19 commits, ~150 lines of code

Members: Isak Sylvin, John Fonner

+ =

15 of 19

Rabix Suite

  • https://github.com/rabix
  • Fixing issues and creating a beta release of Rabix Composer
  • Integrating Executor into Composer (prototype)
  • Syncing Rabix Executor with CWL v1.0.1 errata and releasing v1.0
  • Group members: Janko Simonović, Siniša Ivković, Luka Stojanović, Ivan Batić, Maja Nedeljković, Đole Klisić

16 of 19

CWLToil dynamic ResourceReqs

First Goal: Calculate resource requirements based on input files: number, sizes, and other metadata.

https://github.com/BD2KGenomics/toil/pull/1767

https://github.com/common-workflow-language/cwltool/issues/483

Final Goal: Calculate (computational || economic) costs before running a job on Toil/CWL, based on cores, input file sizes, memory, etc...

Michael Crusoe and Roman Valls Guimera

17 of 19

CWL SDKs 🛠

  • Create automatically from the specification of CWL some SDKs to handle the reading, manipulation, and writing of CWL files

  • Multiple “generic” approaches unsuccessful.
  • Python SDK generation (direct from CWL spec) project started:
  • Ruby project started (using JSON schema):
  • Java:
  • Pre-existing TypeScript implementation from SBG:

Members: Niall Beard, Kenzo Hillion, Hervé Ménager, Anton Khodak, Denis Yuen, Luka Stojanovic, Heather Wiencko with help from Ivan Batić, Maja Nedeljković, Michael Crusoe and Peter Amstutz

18 of 19

The Open Bioinformatics Community

19 of 19

Join us

  • Welcome to BOSC
    • Birds of a Feather meet up
    • Lunch time today
  • Come to Codefest next year
    • Training, conference, then Codefest
    • Multiple tracks and groups
    • Everyone is welcome

https://www.open-bio.org/wiki/Codefest_2017