1 of 30

What is AnVIL?

2 of 30

Learning Objectives

  • Appreciate the value in bringing researchers to the data
  • Introduce the main components of AnVIL
  • Understand the roadmap for data, infrastructure, and events

3 of 30

Goals of AnVIL

4 of 30

Sequencing History

2000s:

Sanger Sequencing

Kilobase read length,

low error & reliable;

but slow and expensive

5 of 30

Sequencing History

2000s:

Sanger Sequencing

Kilobase read length,

low error & reliable;

but slow and expensive

2010s:

Genome Analyzer

Low error, substantially

improved throughput;

but very short reads (25bp)

6 of 30

Sequencing History

2000s:

Sanger Sequencing

Kilobase read length,

low error & reliable;

but slow and expensive

2010s:

Genome Analyzer

Low error, substantially

improved throughput;

but very short reads (25bp)

2020s:

Illumina NovaSeq

Low-error, population-scale capacity with low costs,

250bp read lengths

7 of 30

Computational Genomics History

2000s:

Institutional Clusters

Slow & expensive to apply at scale, large investment into IT support

8 of 30

Computational Genomics History

2000s:

Institutional Clusters

Slow & expensive to apply at scale, large investment into IT support

2010s:

Early Cloud

Limited tools,

difficult to program or use,

but great potential

(Schatz, 2009)

9 of 30

Computational Genomics History

2000s:

Institutional Clusters

Slow & expensive to apply at scale, large investment into IT support

2010s:

Early Cloud

Limited tools,

difficult to program or use,

but great potential

(Schatz, 2009)

2020s:

The AnVIL

Integrated tools, datasets, and authentication running within highly scalable cloud environment

10 of 30

AnVIL: Invert the model of genomic data science

Traditional: Bring data to the researcher

  • Copying/moving data is costly
  • Harder to enforce security
  • Redundant infrastructure
  • Siloed compute

Goal: Bring researcher to the data

  • Reduced redundancy and costs
  • Active threat detection and auditing
  • Greater accessibility
  • Elastic, shared, compute

11 of 30

What is AnVIL?

12 of 30

What is AnVIL?

Scalable and interoperable computing resource for the genomics scientific community

  • Cloud-based infrastructure
    • Highly elastic; shared analysis and computing environment
  • Data access and security
    • Genomic datasets, phenotypes and metadata
    • Large datasets generated by NHGRI programs, as well as other initiatives / agencies
    • dbGaP Authenticated sharing of primary and derived datasets
  • Collaborative computing environment for datasets and analysis workflows
    • Storage, scalable analytics, data visualization
    • Security, training & outreach, with new models of data access
    • ...for both users with limited computational expertise and sophisticated data scientist users

13 of 30

More coming soon!

Workspaces and

batch workflows

Sharing containerized tools

and workflows

Data models,

indexing, querying

14 of 30

More coming soon!

Workspaces and

batch workflows

Live code, equations, visualizations and narratives

Analysis and comprehension

of genomic data in R

Sharing containerized tools

and workflows

Data models,

indexing, querying

Accessible, reproducible, and transparent research

15 of 30

FISMA Moderate�2 ATOs�Pursuing FedRAMP

All data use and analysis in a FISMA moderate environment

Implemented on

Primary data storage costs covered by AnVIL,

user private data and compute billed directly through Google

16 of 30

Terra: Batch Workflows

Workflow Description Language (WDL) allows highly scalable analysis workflows

17 of 30

Terra: Jupyter Notebooks

Interactive Jupyter notebooks allow for transparent code, visualizations, and narratives

18 of 30

Terra: RStudio and Bioconductor

RStudio: analysis environment specifically designed for, and largely preferred by the R community.

Bioconductor: tools and modules for the analysis and comprehension of high-throughput genomic data, implemented in R

  • 1,903 software packages available in Bioconductor release 3.11

AnVIL provides a robust well tested RStudio environment

with the latest Bioconductor release integrated

19 of 30

Gen3: Enabling Search for AnVIL Data

Gen3 enables AnVIL users to:

  • Understand the data available in AnVIL, by reviewing Genomic Summary Result information about AnVIL datasets
  • Determine which datasets will be most useful for an analysis by creating virtual cohorts based on clinical attributes, prior to requesting access to individual subject/file data.
  • Once access is granted, send virtual cohorts made up of multiple datasets to a Terra workspace for deeper analysis

AnVIL users search through listed workspaces to which they have access, cloning data into their own workspaces for analysis

Today

Tomorrow

20 of 30

Gen3: The Gen3 Data Explorer

Search & Design patient cohorts on-the-fly from existing samples

21 of 30

Dockstore: Registry of tools and workflows

  • Tools – a container with metadata that documents the tools interface
    • 18 WDL and 192 CWL tools currently
  • Workflows – a combination of multiple tools
    • 104 WDL and 139 CWL currently

22 of 30

Dockstore: Create, share, use

Import & Execute

Dockstore workflows inside Terra

23 of 30

Galaxy

  • Web-based analysis environment for running analysis tools and building workflows for users with no programming expertise
  • Galaxy ToolShed, a repository for community contributed tools and workflows, has 7,853 tools
  • Additionally, Galaxy integrates dozens of visualization tools which will also be available in AnVIL.

Available in AnVIL Oct 2020

24 of 30

Extending AnVIL

  • Bring your own tools and workflows
    • Either by registering them in Dockstore, or by uploading your own custom WDL to Terra
  • Build on top of the AnVIL APIs
    • All of the components of the AnVIL provide APIs
    • We will be providing a unified, stable API endpoint for the AnVIL with OpenAPI documentation
    • We are building API wrapper libraries in Python and R, largely generate from the OpenAPI specification but curated
    • See the repo: https://github.com/anvilproject
  • Adding new web applications
    • We are defining standards to allow containerized web applications to be hosted inside AnVIL
    • Leveraging standards container orchestration (Kubernetes) for complex applications

25 of 30

What’s Next?

26 of 30

Data Roadmap

CMG

1000G

GTEx (V7 & V8)

eMerge

2019

Q1 2020

Q2 2020

Q4 2020

Q3 2020

  • Genotype & Phenotype Ingest

CCDG

  • Freeze 2 - (WGS/WES)

  • Freeze 2 - (Exomes & Subset VCFs)

  • Final Freeze - (Data Ingestion)

CSER

  • Primary Ingestion & Additional Freezes

UDN

Planning Stage

NIMH

HPP

NIA

GTEx (V9)

CEPH

eMERGE 4

COVID

PMDG

27 of 30

Infrastructure / Tools Roadmap

2019

Q1 2020

Q2 2020

Q4 2020

Q3 2020

(“bring your own docker”)

(AnVIL native integration)

28 of 30

Upcoming Events

2019

Q1 2020

Q2 2020

Q4 2020

Q3 2020

(10/27)

ASHG

GSP

(4/28-30)

IC-Stacks

(4/16-17)

(3/02)

ECC

(10/05)

ECC

“Train-

trainer”

(3/17-18)

MaGIC

Jamboree

(June)

BioC Conference

7/21-24

Bioinformatics Community Conference

7/29-31

SACNAS

10/22-24

ECC

Interoperability

Outreach

(10/31)

ASHG

29 of 30

Summary

  • Appreciate the value in bringing researchers to the data
  • Introduce the main components of AnVIL
  • Understand the roadmap for data, infrastructure, and events

Next Steps

  • Cloud computing
  • How much does it cost?
  • Use cases

30 of 30

Contributions

  • Mike Schatz
  • Anthony Philippakis
  • Alessandro Culotti
  • Rich Silva
  • Mo Heydarian
  • Frederick Tan