1 of 32

Software ecosystem concepts for federated genomic analysis

Vince Carey PhD

Bioc Europe 2019

2 of 32

100,000 users (downloads)

local clusters

laptops

3 of 32

100,000 users (downloads)

local clusters

laptops

1000+ packages

~50GB of software/reference data

~ daily CI/CD; 6-month release cycle

CRAN + Bioc git + system req.

4 of 32

100,000 users (downloads)

local clusters

laptops

1000+ packages

~50GB of software/reference data

~ daily CI/CD; 6-month release cycle

CRAN + Bioc git + system req.

~1000 developers

bugfixes in release

enhancing devel

5 of 32

100,000 users (downloads)

local clusters

laptops

1000+ packages

~50GB of software/reference data

~ daily CI/CD; 6-month release cycle

CRAN + Bioc git + system req.

~1000 developers

bugfixes in release

enhancing devel

1x commons -- commercial cloud

6 of 32

100,000 users (downloads)

local clusters

laptops

1000+ packages

~50GB of software/reference data

~ daily CI/CD; 6-month release cycle

CRAN + Bioc git + system req.

~1000 developers

bugfixes in release

enhancing devel

1x commons -- commercial cloud

Can it all

fit together???

7 of 32

Road map of the talk

  • Brief sketch of a data/analysis commons for "federated genomic analysis"
  • BigRNA/ HumanTranscriptomeCompendium as an example of data service mediation with Bioconductor data structures
  • The Compendium software stack as an ecosystem
  • Review of unmet needs:
    • measures of ecosystem health
    • strategies for durable compatibility of centralized, stability-focused commons and a dynamic analysis ecosystem like Bioconductor

8 of 32

Why consider federated approach?

  • Key aim: efficient progress in genome biology and its applications
  • Obstacles: large multiomic surveys, volume of sequencer and imaging outputs, quantification uncertainty, metadata management, access control
    • redundant computation/downloads mostly wasted
  • Technical advances: "cloud", distributed data services, …, allow us to formulate an aim of federated genomic data/analysis commons

9 of 32

from RL Grossman PMID 30691868

10 of 32

What is a data/analysis commons?

  • Curated accessible FAIR (findable, available, interoperable, reusable) data resources
  • Robust APIs that resolve queries from authorized users to produce data artifacts
    • E.g., OpenAPI Specification
    • Swagger tools implementing OpenAPI -- ideally hidden from users/bioc developers
  • Analytic components/APIs that help explore and interpret the artifacts
  • NHGRI AnVIL (Analysis, Visualization and Informatics Lab-space) is an emerging example
    • AnVIL makes use of components of Broad's Firecloud (now named Terra)

11 of 32

Components of a working data/analysis commons

12 of 32

Aspects of the AnVIL schematic

  • Fixed target platform: Google Compute Engine
  • Two paths in for researchers: WDL and Workspace
  • Data library in "buckets" mediated through a fixed "data model"
  • Software components in Docker containers
  • Interaction through Jupyter or Rstudio

13 of 32

Under the hood -- Morgan, Turaga, Stubbs

14 of 32

Summary

  • The interaction/analysis facets of a working data/analysis commons (Terra/AnVIL) are reasonably familiar and flexible
    • GCP supplies Linux + Container registry
    • User-defined containers can be used (some licensing constraints will be relevant)
    • Bioconductor users are "good to go" once
      • access to all relevant packages is established
      • natural patterns of persisting code and data are supported
  • We have left out some key things
    • What is the "data commons" aspect?
    • How does an evolving language/ecosystem co-evolve with the "interaction/analysis" stack?
      • Containers provide great freedom … is there a downside?

15 of 32

What is the "data commons" aspect?

  • In AnVIL, there will be a substantial role of the Gen3 platform for managing metadata, access control, data modeling and query resolution
  • I will discuss another approach to serving genomic data at cloud/commons scale

16 of 32

Sean Davis' BigRNA: towards a data service for RNA-seq, based in NCBI SRA

  • Background -- Sean has created GEOQuery, GEOmetadb, and SRAdb packages -- permitting very convenient access to NCBI resources
  • RNA-seq is a relatively new technology -- thoughts:
    • signal analysis of large-scale surveys can be useful for identifying approaches to bias reduction
    • uniform preprocessing and quantification of "all SRA" is feasible with cloud technologies
  • In summer of 2017, salmon quantifications of 181000 human RNA-seq samples were acquired from BigRNA, transformed to HDF5, and loaded into HSDS with support from John Readey of the HDF Group
    • this is not complex but can be time-consuming and should be optimized
    • HDF5 in S3 buckets can now be served in the same manner

17 of 32

BigRNA + HSDS + rhdf5client + DelayedArray =

here HDF Scalable Data Service provides the RESTful back end

18 of 32

19 of 32

Upshots

  • SummarizedExperiment + DelayedArray are well-exercised with local HDF5 storage of large numerical data
  • (for Bioconductor users) DelayedArray + rhdf5client + HSDS facilitate natural interrogation of very large numerical data resources in S3 stores
  • This presents a standard of self-description and functionality that should be met by other services/query approaches
    • X[G,S] denotes value of features G on samples S

20 of 32

Example: find RNA-seq studies in which a selected gene exhibits unusually large variation across samples

defaults: order by MAD for selected gene, filter top 10% of studies

21 of 32

Oddity: list studies in which a housekeeping gene exhibits unusually large variation across samples

22 of 32

Upshots

  • code (7 lines) at https://bit.ly/342qVQK
  • These are interactive targeted surveys of 181000 RNA-seq samples from 4600 distinct studies in NCBI SRA
  • There are no file downloads
  • Similar tactics could work for Recount and Conquer archives among others
  • What is the relationship to the software ecosystem?

23 of 32

The associated ecosystem slice

24 of 32

Proposal

  • RESTful data services can be used for all mature data in the commons
    • X[G,S] or tidyverse-like queries should be immediate
    • + DelayedArray + SummarizedExperiment (or MultiAssayExperiment) for Bioconductor users
      • if adopted, a large number of active users can hit the ground running with such a commons
      • would this approach impose constraints on the development and maintenance of the commons?

25 of 32

A definition of "analysis ecosystem"

© Carey et al., 2029

26 of 32

Caveats

27 of 32

Aspects of the AnVIL schematic

  • Fixed target platform: Google Compute Engine
  • Data library mostly "files"
  • Binding between data and metadata not simple, middleware needed
  • No provision for "software commons"; every workspace needs its own software endowment
  • No versioning or detachable deployment

28 of 32

Conclusions

  • Federated data/analysis concept is attractive but costly
  • Burdens of data generation/QC/management with high-throughput biotechnology strain most organizations
  • Package-sharing, *Hub-sharing in large organizations not a solved problem for Bioconductor or AnVIL
  • General assessment and improvement of ecosystem reliability/health needs more work
  • Data service technologies prototyped but underexplored
  • Harmonious interoperation of commons and evolving software ecosystems is a prerequisite for wide adoption of commons concept -- those who ignore strategies of Bioconductor are condemned to reinvent them

29 of 32

Extra slides follow

30 of 32

if 181000 are not enough ….

31 of 32

Grossman et al. 2016

PMID

29033693

OSDC has been running since 2009

32 of 32

ODSC schema