The GA4GH Cloud Work Stream:
Emerging Standards for Genomics in the Cloud
Brian O’Connor & David Glazer
GA4GH Cloud Work Stream Co-Leads
ga4gh.org
Mission: Enable genomic data sharing for the benefit of human health
The GA4GH is a policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework.
GA4GH Cloud Work Stream
The Cloud Work Stream is focused on creating specific standards for defining, sharing, and executing portable workflows and accessing data across clouds.
We work with many different Driver Projects to develop, enhance, test, and use the Cloud Work Stream APIs.
And more!
The Vision of the Cloud Work Stream
Motivations for the Cloud Work Stream Vision
Projects like TOPMed, HCA, All of Us, etc will sequence hundreds of thousands of genomes, producing 50+ petabytes of data on clouds in the next 5 years!
GA4GH Cloud Work Stream APIs
Sharing Tools and Workflows
Executing Workflows
Executing Individual Tasks
Accessing Data
(now the Data Repository Service, DRS)
GA4GH Tool Registry Service (TRS) API
List, search, & register CWL/WDL-described Docker Tools and Workflows.
Dockstore & Biocontainers
Tool(s)
descriptor
Docker
GET list
GET search
POST register
CWL/WDL-Described Tools
WES Sharing API
CWL/WDL �Workflow
&
GA4GH Workflow Execution Service (WES) API
Execute CWL/WDL-based Workflows in a cloud and platform-agnostic way. (TES is very similar!)
POST new task
GET task status
GET task stderr/stdout
standard Execution APIs
Tools
Docker
JSON
stderr
stdout
file(s)
status
+
environment-
specific implementation
WDL/CWL
Workflow���
or
Official GA4GH�Standard!
GA4GH Data Repository Service (DRS) API
A cloud agnostic way to lookup cloud data objects and read/write via signed URLs or other protocols
DRS Service
- Read
- Write
Data Browsing Portal
Index
Workflow Engines
AWS S3
Azure Bucket
Google Bucket
environment-
specific implementations
An Example - the NIH Data Commons
How we used GA4GH Cloud Work Stream APIs
Calcium
(Paten et al)
Helium
(Ahalt)
Xenon (Davis-Dusenbery)
Argon
(Foster)
TOPMed Alignment Workflow
TOPMed, GTEx data
Calcium (Paten et al)
Workspace
GTEx Realignment CRAMs (reads)
AWS/Google
TOPMed Variant Calling Workflow
Our task was to 1) create cloud portable workflows and 2) align and variant call 600+ samples across 4 different cloud stacks
GTEx VCFs (variants)
An Example - the NIH Data Commons
How we used GA4GH Cloud Work Stream APIs
Calcium
(Paten et al)
Helium
(Ahalt)
Xenon (Davis-Dusenbery)
Argon
(Foster)
TOPMed Alignment Workflow
TOPMed, GTEx data
Calcium (Paten et al)
Workspace
GTEx Realignment CRAMs (reads)
GTEx VCFs were then shared via GA4GH DRS
GTEx VCFs (variants)
AWS/Google
TOPMed Variant Calling Workflow
600+ WGS samples took <1 week on 4 very different environments! Would have taken months previously.
WES
TRS
DRS
DRS
TRS
DRS
Where Do We Stand Now?
| TRS | WES | TES | DRS |
Schemas approved by GA4GH | 2019 goal | Yes! | 2019 stretch goal | 2019 goal |
Registry & compliance testing framework | Yes! | Yes! | 2019 goal | 2019 goal |
At least 1 production implementation | Yes, 2! (Dockstore & Biocontainers) | Multiple in progress,�2019 goal | In progress | Multiple in progress, 2019 goal |
Projects using API in production | Yes! (multiple) | Longer term goal | Longer term goal | Longer term goal |
For More Information
Acknowledgements