1 of 32

Software ecosystem concepts for federated genomic analysis

Vince Carey PhD

Bioc Europe 2019

2 of 32

100,000 users (downloads)

local clusters

laptops

3 of 32

100,000 users (downloads)

local clusters

laptops

1000+ packages

~50GB of software/reference data

~ daily CI/CD; 6-month release cycle

CRAN + Bioc git + system req.

4 of 32

100,000 users (downloads)

local clusters

laptops

1000+ packages

~50GB of software/reference data

~ daily CI/CD; 6-month release cycle

CRAN + Bioc git + system req.

~1000 developers

bugfixes in release

enhancing devel

5 of 32

100,000 users (downloads)

local clusters

laptops

1000+ packages

~50GB of software/reference data

~ daily CI/CD; 6-month release cycle

CRAN + Bioc git + system req.

~1000 developers

bugfixes in release

enhancing devel

1x commons -- commercial cloud

6 of 32

100,000 users (downloads)

local clusters

laptops

1000+ packages

~50GB of software/reference data

~ daily CI/CD; 6-month release cycle

CRAN + Bioc git + system req.

~1000 developers

bugfixes in release

enhancing devel

1x commons -- commercial cloud

Can it all

fit together???

7 of 32

Road map of the talk

Brief sketch of a data/analysis commons for "federated genomic analysis"
BigRNA/ HumanTranscriptomeCompendium as an example of data service mediation with Bioconductor data structures
The Compendium software stack as an ecosystem
Review of unmet needs:

measures of ecosystem health
strategies for durable compatibility of centralized, stability-focused commons and a dynamic analysis ecosystem like Bioconductor

8 of 32

Why consider federated approach?

Key aim: efficient progress in genome biology and its applications
Obstacles: large multiomic surveys, volume of sequencer and imaging outputs, quantification uncertainty, metadata management, access control

redundant computation/downloads mostly wasted

Technical advances: "cloud", distributed data services, …, allow us to formulate an aim of federated genomic data/analysis commons

9 of 32

from RL Grossman PMID 30691868

10 of 32

What is a data/analysis commons?

Curated accessible FAIR (findable, available, interoperable, reusable) data resources
Robust APIs that resolve queries from authorized users to produce data artifacts

E.g., OpenAPI Specification
Swagger tools implementing OpenAPI -- ideally hidden from users/bioc developers

Analytic components/APIs that help explore and interpret the artifacts
NHGRI AnVIL (Analysis, Visualization and Informatics Lab-space) is an emerging example

AnVIL makes use of components of Broad's Firecloud (now named Terra)

11 of 32

Components of a working data/analysis commons

12 of 32

Aspects of the AnVIL schematic

Fixed target platform: Google Compute Engine
Two paths in for researchers: WDL and Workspace
Data library in "buckets" mediated through a fixed "data model"
Software components in Docker containers
Interaction through Jupyter or Rstudio

13 of 32

Under the hood -- Morgan, Turaga, Stubbs

14 of 32

Summary

The interaction/analysis facets of a working data/analysis commons (Terra/AnVIL) are reasonably familiar and flexible

GCP supplies Linux + Container registry
User-defined containers can be used (some licensing constraints will be relevant)
Bioconductor users are "good to go" once

access to all relevant packages is established
natural patterns of persisting code and data are supported

We have left out some key things

What is the "data commons" aspect?
How does an evolving language/ecosystem co-evolve with the "interaction/analysis" stack?

Containers provide great freedom … is there a downside?

15 of 32

What is the "data commons" aspect?

In AnVIL, there will be a substantial role of the Gen3 platform for managing metadata, access control, data modeling and query resolution
I will discuss another approach to serving genomic data at cloud/commons scale

16 of 32

Sean Davis' BigRNA: towards a data service for RNA-seq, based in NCBI SRA

Background -- Sean has created GEOQuery, GEOmetadb, and SRAdb packages -- permitting very convenient access to NCBI resources
RNA-seq is a relatively new technology -- thoughts:

signal analysis of large-scale surveys can be useful for identifying approaches to bias reduction
uniform preprocessing and quantification of "all SRA" is feasible with cloud technologies

In summer of 2017, salmon quantifications of 181000 human RNA-seq samples were acquired from BigRNA, transformed to HDF5, and loaded into HSDS with support from John Readey of the HDF Group

this is not complex but can be time-consuming and should be optimized
HDF5 in S3 buckets can now be served in the same manner

17 of 32

BigRNA + HSDS + rhdf5client + DelayedArray =

here HDF Scalable Data Service provides the RESTful back end

19 of 32

Upshots

SummarizedExperiment + DelayedArray are well-exercised with local HDF5 storage of large numerical data
(for Bioconductor users) DelayedArray + rhdf5client + HSDS facilitate natural interrogation of very large numerical data resources in S3 stores
This presents a standard of self-description and functionality that should be met by other services/query approaches

X[G,S] denotes value of features G on samples S

20 of 32

Example: find RNA-seq studies in which a selected gene exhibits unusually large variation across samples

defaults: order by MAD for selected gene, filter top 10% of studies

21 of 32

Oddity: list studies in which a housekeeping gene exhibits unusually large variation across samples

22 of 32

Upshots

code (7 lines) at https://bit.ly/342qVQK
These are interactive targeted surveys of 181000 RNA-seq samples from 4600 distinct studies in NCBI SRA
There are no file downloads
Similar tactics could work for Recount and Conquer archives among others
What is the relationship to the software ecosystem?

23 of 32

The associated ecosystem slice

24 of 32

Proposal

RESTful data services can be used for all mature data in the commons

X[G,S] or tidyverse-like queries should be immediate
+ DelayedArray + SummarizedExperiment (or MultiAssayExperiment) for Bioconductor users

if adopted, a large number of active users can hit the ground running with such a commons
would this approach impose constraints on the development and maintenance of the commons?

25 of 32

A definition of "analysis ecosystem"

27 of 32

Aspects of the AnVIL schematic

Fixed target platform: Google Compute Engine
Data library mostly "files"
Binding between data and metadata not simple, middleware needed
No provision for "software commons"; every workspace needs its own software endowment
No versioning or detachable deployment

28 of 32

Conclusions

Federated data/analysis concept is attractive but costly
Burdens of data generation/QC/management with high-throughput biotechnology strain most organizations
Package-sharing, *Hub-sharing in large organizations not a solved problem for Bioconductor or AnVIL
General assessment and improvement of ecosystem reliability/health needs more work
Data service technologies prototyped but underexplored
Harmonious interoperation of commons and evolving software ecosystems is a prerequisite for wide adoption of commons concept -- those who ignore strategies of Bioconductor are condemned to reinvent them

29 of 32

Extra slides follow

30 of 32

if 181000 are not enough ….

31 of 32

Grossman et al. 2016

PMID

29033693

OSDC has been running since 2009

1 of 32

2 of 32

3 of 32

4 of 32

5 of 32

6 of 32

7 of 32

8 of 32

9 of 32

10 of 32

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

22 of 32

23 of 32

24 of 32

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

30 of 32

31 of 32

32 of 32