1 of 64

COVID-19 analysis in Galaxy:

Importance of (open) infrastructures in responding to a pandemic

27 January 2021

17.00 CET

Nadim Rahman

Guy Cochrane

EMBL-EBI

Andrew Lonie

Björn Grüning

Frederik Coppens

usegalaxy.*

2 of 64

Housekeeping

This session will be recorded

Please remain muted unless you’re invited to speak by the Chair.

This meeting will be run in line with the ELIXIR Code of Conduct. If you have any concerns please refer to the Code of Conduct, found on the ELIXIR website

Please use “Q&A” to raise questions during the presentation.

Please use the “hand-raising function” to indicate you would like to contribute directly

Please use “Chat” for further comments or discussions.

3 of 64

Running sheet

INTRO: David Lloyd

CONTEXT [1 min]: Frederik (slide 5)

DATA [15 mins]: Guy / Nadim

  1. COVID Data Portal, ENA

ANALYSIS (Tools and Infrastructure) [15-17 mins]:

  1. Bjoern: 10m Galaxy EU - linking COVID analysis resources.
    1. What does a global research infrastructure for analysis look like?
    2. What’s been done to date?
      1. Data Portal to usegalaxy.eu?
    3. What’s possible? usegalaxy.*
  2. Andrew: 5-7m Galaxy Australia - an exemplar of a global research infrastructure ecosystem for COVID
    • Common compute resources in practice - the benefits of agreeing on compute infrastructure interoperability (Pulsar use in Australia and interoperating with EU compute as proof of principle)
    • Common tools and workflows in practice (workflow use in Australia)
    • Common data resources in practice - contributing and using (Australian data to ENA)

INTEGRATED ECOSYSTEM [10 mins]: Frederik

  1. resources, links between them
  2. Circling back to what can be achieved if we agree on standard infrastructure
  3. Wrapup

4 of 64

COVID-19 analysis in Galaxy:

Importance of (open) infrastructures in responding to a pandemic

Andrew Lonie, Nadim Rahman, Guy Cochrane

Björn Grüning, Frederik Coppens

@galaxyproject

5 of 64

Tools Ecosystem

  • The European COVID-19 Data Platform - Nadim Rahman & Guy Cochrane
  • COVID-19 Analysis in Galaxy - Björn Grüning
  • Galaxy Australia: an exemplar of research infrastructure cooperation - Andrew Lonie
  • An integrated ecosystem - Frederik Coppens

6 of 64

Nadim Rahman, Guy Cochrane

The European COVID-19 Data Platform

7 of 64

European COVID-19 Data Platform

EMBL-EBI

European Research Infrastructures

International initiatives

National Infrastructures

COVID-19 Research

8 of 64

COVID-19 Data Portal

9 of 64

COVID-19 Data Portal

10 of 64

European Nucleotide Archive

  • Established in the early 1980s, extended for new technologies and applications
  • Globally comprehensive scientific record and European node of INSDC
  • Broad scope covering raw data, through layers of interpretation and context
  • Connectivity with broader resources
  • Sequence data foundation
  • Rich submission, discovery and retrieval software, tools and services
  • Support through Helpdesk and training
  • Management, sharing, integration and dissemination of sequence data
  • Data coordination including project-specific data hubs and portals

Archive

Platform

3

11 of 64

ENA data reach

Drysdale et al. (2020) The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences. Bioinformatics, 2020, 1–7; http://doi.org/10.1093/bioinformatics/btz959

Cook et al. (2020) The European Bioinformatics Institute in 2020: building a global infrastructure of interconnected data resources for the life sciences. Nucleic Acids Research 48:D17-D23; http://doi.org/10.1093/nar/gkz1033

12 of 64

Further reach

Rohden F, Huang S, Dröge G, Hartman Scholz A, and contributing authors (2019). Combined study in DSI in public and private databases and DSI traceability. https://www.cbd.int/abs/DSI-peer/Study-Traceability-databases.pdf

13 of 64

SARS-CoV-2 Data Hubs

  • Toolbox supporting viral sequencing workflows
    • Management, validation, analysis, sharing and publication
  • Based on “COMPARE Data Hubs”
  • Rapid development/deployment of tools and services including
    • Data upload
    • Validation against data standards
    • Connected cloud-based analysis workflows including assembly, annotation, variation calling and phylogenetics
    • Exploration and visualization environments
  • Early prioritisation of raw viral sequence data for systematic analysis

14 of 64

COVID-19 Data Flow: Data Mobilisation

15 of 64

COVID-19 Data Flow: Data Mobilisation

  • >225,000 Raw Reads
  • >45,000 Assembled Sequences

16 of 64

COVID-19 Data Flow: Data Mobilisation

  • >225,000 Raw Reads
  • >45,000 Assembled Sequences

17 of 64

COVID-19 Data Flow: Tools

18 of 64

COVID-19 Data Flow: Tools

  • GISAID submission conversion tool
    • Supporting submitters to also submit assembled sequence data to ENA

  • Existing submission services:
    • Webin Interactive / Submissions Portal
    • Webin REST
    • Webin-CLI
  • Bulk Webin-CLI submission tool

19 of 64

COVID-19 Data Flow: Data Discovery and Access

20 of 64

European COVID-19 Data Platform: example of use

European COVID-19 Data Platform

https://www.covid19dataportal.org/

data mobilisation

SARS-CoV-2 Data Hubs

CRG COVID Viral Beacon

https://covid19beacon.crg.eu/

COVID-19 Data Portal

analytical workflows

visualization & navigation

data access

data

users

21 of 64

COVID-19 Analysis in Galaxy

Björn Grüning

22 of 64

Tools Ecosystem

23 of 64

COVID-19 analysis on usegalaxy.

https://covid19.galaxyproject.org

  • All publicly accessible via GitHub
  • 6 different types of analysis
  • 5 different Galaxy servers - including Galaxy Australia
  • Workflows and tools available on all servers
  • Workload shared amongst global clouds and compute resources
  • Reproducible across multiple servers
  • Analysis is ongoing

24 of 64

Proteomics

25 of 64

New tools and workflows

26 of 64

New/updated workflows: ARTIC/ONT

27 of 64

New updated/workflows: Consensus construction

28 of 64

  • Galaxy Histories for all of the SARS-CoV-2 datasets available
  • Variant files for all samples, updated daily for download
  • Tools, Workflows available
  • Download all 100.000 VCF files

~ 100.000 samples

WGS, Amplicon, DRS

29 of 64

Mirrored data for easy access

30 of 64

Analysing → Monitoring

31 of 64

32 of 64

Galaxy Australia: an exemplar of research infrastructure cooperation

Andrew Lonie

33 of 64

34 of 64

usegalaxy.org.au

Galaxy Australia is a hosted web-based platform that lets anyone conduct accessible, reproducible, and transparent computational life sciences research. It is part of the global usegalaxy.★ collaboration between large public Galaxy servers

35 of 64

Early Pandemic - the race to publication

  • Researchers across the world were focussing on COVID-19
  • A lot of pre-prints appeared very quickly.
  • Not a lot of public sharing of the data or methods
  • Diverse analyses required
  • Tool selection/analysis techniques quite complex

How can we make it easier/more reproducible?

For everyone!

36 of 64

Workflows: Efficiency through Galaxy controlled scheduling

37 of 64

38 of 64

Pre-processing

Assembly

MRCA timing

Variation analysis

S- analysis

Evolutionary analysis

39 of 64

Genomics

40 of 64

Pulsar

To create this network of shared computational resources, we leverage Pulsar, a Task Execution Service for Galaxy. Pulsar allows a Galaxy server to automatically interact with those remote systems, ensuring job and provenance information are correctly exchanged.

https://github.com/galaxyproject/pulsar

41 of 64

usegalaxy.org.au

Galaxy Australia

Brisbane

Main Slurm Queue, Main storage

Pulsar

Pulsar

Pulsar

42 of 64

Getting resources to help - quickly

COVID merit allocation at Pawsey

A Pulsar Cluster in the Cloud

  • a Head node and 5 Worker nodes
  • all of n3.8c32r flavour

Setup in an afternoon

Perth Pulsar-paw

(COVID-19 Jobs)

Pulsar Server

Pulsar/Slurm/NFS

Worker Node 1

Worker Node 2

Worker Node 3

Worker Node 4

Worker Node 5

Volume

Connection with Galaxy Australia

Galaxy Australia was able to send COVID related jobs to Pawsey that day!

43 of 64

Some COVID analysis stats

  • Sequencing data from 3,200 NCBI Study Accessions representing 1,000’s of data files processed

  • 2,000 Victorian Samples processed through ARTIC workflows

  • Pawsey Pulsar node processed 10,546 jobs April - August

  • ~15,000 sequences uploaded to ENA

44 of 64

An integrated ecosystem

Frederik Coppens

45 of 64

Virtual environment

Seamless integration of services

  • data storage
  • data management
  • data analysis
  • re-use of data

Based on standardisation

Across scientific disciplines and borders

46 of 64

AnVIL: Inverting the model of genomic data sharing

Traditional: Bring data to the researcher

  • Copying/moving data is costly
  • Harder to enforce security
  • Redundant infrastructure
  • Siloed compute

Goal: Bring researcher to the data

  • Reduced redundancy and costs
  • Active threat detection and auditing
  • Greater accessibility
  • Elastic, shared, compute

47 of 64

Virtual environment

Seamless integration of services

  • data storage
  • data management
  • data analysis
  • re-use of data

Based on standardisation

Across scientific disciplines and borders

48 of 64

OECD recommendation on Access to Research Data

On 20 January 2021, the OECD Council adopted a revised Council Recommendation on Access to Research Data from Public Funding.

... expands the scope to cover not only research data, but also related metadata, as well as bespoke algorithms, workflows, models, and software (including code), which are essential for their interpretation.

RECOGNISING that re-use and value of data can depend on the availability of relevant metadata, algorithms, code, and software, from public funding together with information on workflows and the computational environment used to generate published findings, and that providing access to these other research-relevant digital objects from public funding along with the data itself can be essential;

49 of 64

Tools collaboratory

Bio.Tools

BioContainers

Workflows

Tools

c

Registries

Packaging

Testing

Powered By

EDAM ontology

50 of 64

ELIXIR Tools Ecosystem

BioContainers

160943 containers

bio.tools

17007 tools

24713 tools

OpenEBench

7923 tools in Galaxy toolshed

Galaxy

Beta release 2020

72 workflows

WorkflowHub.eu

51 of 64

WorkflowHub.eu : workflow registry

51

Leading work on metadata standards

Workflows...

  • remain in their host repositories
  • are organised by teams, collections & properties
  • are associated with data, docs, links
  • partnerships with WMS

Contributing to WP6 metadata

standards and repositories

Integration

An EOSC-Life product

52 of 64

Virtual environment

Seamless integration of services

  • data storage
  • data management
  • data analysis
  • re-use of data

Based on standardisation

Across scientific disciplines and borders

53 of 64

Across disciplines : covid19.galaxyproject.org

Webinar February 24

Webinar February 10

54 of 64

Training.galaxyproject.org

55 of 64

Across Galaxy instances

Global collaboration of managed public Galaxy instances

On demand Galaxy instances (ELIXIR Italy)

Deploy your own container

https://github.com/ELIXIR-Belgium/covid-19-galaxy-container

Open Source code: build your own

56 of 64

57 of 64

usegalaxy.* community expanding

usegalaxy.org

usegalaxy.org.au

usegalaxy.eu

usegalaxy.fr

usegalaxy.be

usegalaxy.ee

usegalaxy.es

58 of 64

Pulsar-Network

The most innovative computing centers across

Europe are currently interested to share their

remote computation power to support the

UseGalaxy.eu load:

  • DE, de.NBI cloud
  • IT, ReCaS-Bari
  • BE, Vlaams Supercomputer Centrum (VSC)
  • PT, Tecnico ULisboa
  • ES, Barcelona Supercomp. Center (INB-BSC)
  • NO, University of Bergen
  • CZ, CESNET
  • FI, CSC
  • UK, Diamond Light Source
  • FR, GenOuest�

https://pulsar-network.readthedocs.io/en/latest/project/partners.html

59 of 64

Virtual environment

Seamless integration of services

  • data storage
  • data management
  • data analysis
  • re-use of data

Based on standardisation

Across scientific disciplines and borders

60 of 64

Integration with CRG COVID Viral Beacon

European COVID-19 Data Platform

https://www.covid19dataportal.org/

data mobilisation

SARS-CoV-2 Data Hubs

CRG COVID Viral Beacon

https://covid19beacon.crg.eu/

COVID-19 Data Portal

analytical workflows

visualization & navigation

data access

data

users

Webinar February 17

61 of 64

Data retrieval & submission

Alignment of queries

to (re)analyse data

Submission of

(cleaned) viral data

Webinar February 3

62 of 64

Exemplar implementation of

In a global context

Building on existing, open infrastructure

Tools Ecosystem

63 of 64

Acknowledgments

usegalaxy.org efforts are funded by NIH Grants U41 HG006620 and NSF ABI Grant 1661497. usegalaxy.eu is supported by the German Federal Ministry of Education and Research grants 031L0101C and de.NBI-epi. Galaxy and HyPhy integration is supported by NIH grant R01 AI134384. usegalaxy.org.au is supported by Bioplatforms Australia and the Australian Research Data Commons through funding from the Australian Government National Collaborative Research Infrastructure Strategy. Hyphy.org development team is supported by NIH grant R01GM093939. usegalaxy.be is supported by the Research Foundation-Flanders (FWO) grant I002919N and the Flemish Supercomputer Center (VSC). EOSC-Life has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 824087

64 of 64