1 of 14

The GA4GH Cloud Work Stream:

Emerging Standards for Genomics in the Cloud

Brian O’Connor & David Glazer

GA4GH Cloud Work Stream Co-Leads

ga4gh.org

2 of 14

Mission: Enable genomic data sharing for the benefit of human health

The GA4GH is a policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework.

3 of 14

GA4GH Cloud Work Stream

The Cloud Work Stream is focused on creating specific standards for defining, sharing, and executing portable workflows and accessing data across clouds.

We work with many different Driver Projects to develop, enhance, test, and use the Cloud Work Stream APIs.

And more!

4 of 14

The Vision of the Cloud Work Stream

5 of 14

Motivations for the Cloud Work Stream Vision

Projects like TOPMed, HCA, All of Us, etc will sequence hundreds of thousands of genomes, producing 50+ petabytes of data on clouds in the next 5 years!

6 of 14

GA4GH Cloud Work Stream APIs

Sharing Tools and Workflows

Executing Workflows

Executing Individual Tasks

Accessing Data

(now the Data Repository Service, DRS)

7 of 14

GA4GH Tool Registry Service (TRS) API

List, search, & register CWL/WDL-described Docker Tools and Workflows.

Dockstore & Biocontainers

Tool(s)

descriptor

Docker

GET list

GET search

POST register

CWL/WDL-Described Tools

WES Sharing API

CWL/WDL �Workflow

&

8 of 14

GA4GH Workflow Execution Service (WES) API

Execute CWL/WDL-based Workflows in a cloud and platform-agnostic way. (TES is very similar!)

POST new task

GET task status

GET task stderr/stdout

standard Execution APIs

Tools

Docker

JSON

stderr

stdout

file(s)

status

+

environment-

specific implementation

WDL/CWL

Workflow���

or

Official GA4GH�Standard!

9 of 14

GA4GH Data Repository Service (DRS) API

A cloud agnostic way to lookup cloud data objects and read/write via signed URLs or other protocols

DRS Service

- Read

- Write

Data Browsing Portal

Index

Workflow Engines

AWS S3

Azure Bucket

Google Bucket

environment-

specific implementations

10 of 14

An Example - the NIH Data Commons

How we used GA4GH Cloud Work Stream APIs

Calcium

(Paten et al)

Helium

(Ahalt)

Xenon (Davis-Dusenbery)

Argon

(Foster)

TOPMed Alignment Workflow

TOPMed, GTEx data

Calcium (Paten et al)

Workspace

GTEx Realignment CRAMs (reads)

AWS/Google

TOPMed Variant Calling Workflow

Our task was to 1) create cloud portable workflows and 2) align and variant call 600+ samples across 4 different cloud stacks

GTEx VCFs (variants)

11 of 14

An Example - the NIH Data Commons

How we used GA4GH Cloud Work Stream APIs

Calcium

(Paten et al)

Helium

(Ahalt)

Xenon (Davis-Dusenbery)

Argon

(Foster)

TOPMed Alignment Workflow

TOPMed, GTEx data

Calcium (Paten et al)

Workspace

GTEx Realignment CRAMs (reads)

GTEx VCFs were then shared via GA4GH DRS

GTEx VCFs (variants)

AWS/Google

TOPMed Variant Calling Workflow

600+ WGS samples took <1 week on 4 very different environments! Would have taken months previously.

WES

TRS

DRS

DRS

TRS

DRS

12 of 14

Where Do We Stand Now?

TRS

WES

TES

DRS

Schemas approved by GA4GH

2019 goal

Yes!

2019 stretch goal

2019 goal

Registry & compliance testing framework

Yes!

Yes!

2019 goal

2019 goal

At least 1 production implementation

Yes, 2! (Dockstore & Biocontainers)

Multiple in progress,�2019 goal

In progress

Multiple in progress,

2019 goal

Projects using API in production

Yes! (multiple)

Longer term goal

Longer term goal

Longer term goal

13 of 14

For More Information

14 of 14

Acknowledgements

  • GA4GH Cloud Work Stream
    • Broad Institute
    • Cincinnati Children’s Hospital
    • Curoverse
    • European Bioinformatics Institute
    • Intel
    • Institute for Systems Biology
    • Google, Microsoft, Amazon
    • Ontario Institute for Cancer Research
    • Oregon Health and Science University
    • Seven Bridges Genomics
    • University of California Santa Cruz
    • All the members of the GA4GH Cloud Work Stream