1 of 30

What is AnVIL?

2 of 30

Learning Objectives

Appreciate the value in bringing researchers to the data
Introduce the main components of AnVIL
Understand the roadmap for data, infrastructure, and events

3 of 30

Goals of AnVIL

4 of 30

Sequencing History

2000s:

Sanger Sequencing

Kilobase read length,

low error & reliable;

but slow and expensive

5 of 30

Sequencing History

2000s:

Sanger Sequencing

Kilobase read length,

low error & reliable;

but slow and expensive

2010s:

Genome Analyzer

Low error, substantially

improved throughput;

but very short reads (25bp)

6 of 30

Sequencing History

2000s:

Sanger Sequencing

Kilobase read length,

low error & reliable;

but slow and expensive

2010s:

Genome Analyzer

Low error, substantially

improved throughput;

but very short reads (25bp)

2020s:

Illumina NovaSeq

Low-error, population-scale capacity with low costs,

250bp read lengths

7 of 30

Computational Genomics History

2000s:

Institutional Clusters

Slow & expensive to apply at scale, large investment into IT support

8 of 30

Computational Genomics History

2000s:

Institutional Clusters

Slow & expensive to apply at scale, large investment into IT support

2010s:

Early Cloud

Limited tools,

difficult to program or use,

but great potential

(Schatz, 2009)

9 of 30

Computational Genomics History

2000s:

Institutional Clusters

Slow & expensive to apply at scale, large investment into IT support

2010s:

Early Cloud

Limited tools,

difficult to program or use,

but great potential

(Schatz, 2009)

2020s:

The AnVIL

Integrated tools, datasets, and authentication running within highly scalable cloud environment

10 of 30

AnVIL: Invert the model of genomic data science

Traditional: Bring data to the researcher

Copying/moving data is costly
Harder to enforce security
Redundant infrastructure
Siloed compute

Goal: Bring researcher to the data

Reduced redundancy and costs
Active threat detection and auditing
Greater accessibility
Elastic, shared, compute

11 of 30

What is AnVIL?

12 of 30

What is AnVIL?

Scalable and interoperable computing resource for the genomics scientific community

Cloud-based infrastructure

Highly elastic; shared analysis and computing environment

Data access and security

Genomic datasets, phenotypes and metadata
Large datasets generated by NHGRI programs, as well as other initiatives / agencies
dbGaP Authenticated sharing of primary and derived datasets

Collaborative computing environment for datasets and analysis workflows

Storage, scalable analytics, data visualization
Security, training & outreach, with new models of data access
...for both users with limited computational expertise and sophisticated data scientist users

13 of 30

More coming soon!

Workspaces and

batch workflows

Sharing containerized tools

and workflows

Data models,

indexing, querying

14 of 30

More coming soon!

Workspaces and

batch workflows

Live code, equations, visualizations and narratives

Analysis and comprehension

of genomic data in R

Sharing containerized tools

and workflows

Data models,

indexing, querying

Accessible, reproducible, and transparent research

15 of 30

FISMA Moderate�2 ATOs�Pursuing FedRAMP

All data use and analysis in a FISMA moderate environment

Implemented on

Primary data storage costs covered by AnVIL,

user private data and compute billed directly through Google

16 of 30

Terra: Batch Workflows

Workflow Description Language (WDL) allows highly scalable analysis workflows

17 of 30

Terra: Jupyter Notebooks

Interactive Jupyter notebooks allow for transparent code, visualizations, and narratives

18 of 30

Terra: RStudio and Bioconductor

RStudio: analysis environment specifically designed for, and largely preferred by the R community.

Bioconductor: tools and modules for the analysis and comprehension of high-throughput genomic data, implemented in R

1,903 software packages available in Bioconductor release 3.11

AnVIL provides a robust well tested RStudio environment

with the latest Bioconductor release integrated

19 of 30

Gen3: Enabling Search for AnVIL Data

Gen3 enables AnVIL users to:

Understand the data available in AnVIL, by reviewing Genomic Summary Result information about AnVIL datasets
Determine which datasets will be most useful for an analysis by creating virtual cohorts based on clinical attributes, prior to requesting access to individual subject/file data.
Once access is granted, send virtual cohorts made up of multiple datasets to a Terra workspace for deeper analysis

AnVIL users search through listed workspaces to which they have access, cloning data into their own workspaces for analysis

Today

Tomorrow

20 of 30

Gen3: The Gen3 Data Explorer

Search & Design patient cohorts on-the-fly from existing samples

21 of 30

Dockstore: Registry of tools and workflows

Tools – a container with metadata that documents the tools interface

18 WDL and 192 CWL tools currently

Workflows – a combination of multiple tools

104 WDL and 139 CWL currently

22 of 30

Dockstore: Create, share, use

Import & Execute

Dockstore workflows inside Terra

23 of 30

Galaxy

Web-based analysis environment for running analysis tools and building workflows for users with no programming expertise
Galaxy ToolShed, a repository for community contributed tools and workflows, has 7,853 tools
Additionally, Galaxy integrates dozens of visualization tools which will also be available in AnVIL.

Available in AnVIL Oct 2020

24 of 30

Extending AnVIL

Bring your own tools and workflows

Either by registering them in Dockstore, or by uploading your own custom WDL to Terra

Build on top of the AnVIL APIs

All of the components of the AnVIL provide APIs
We will be providing a unified, stable API endpoint for the AnVIL with OpenAPI documentation
We are building API wrapper libraries in Python and R, largely generate from the OpenAPI specification but curated
See the repo: https://github.com/anvilproject

Adding new web applications

We are defining standards to allow containerized web applications to be hosted inside AnVIL
Leveraging standards container orchestration (Kubernetes) for complex applications

25 of 30

What’s Next?

26 of 30

Data Roadmap

CMG

1000G

GTEx (V7 & V8)

eMerge

2019

Q1 2020

Q2 2020

Q4 2020

Q3 2020

Genotype & Phenotype Ingest

CCDG

Freeze 2 - (WGS/WES)

Freeze 2 - (Exomes & Subset VCFs)

Final Freeze - (Data Ingestion)

CSER

Primary Ingestion & Additional Freezes

UDN

Planning Stage

NIMH

HPP

NIA

GTEx (V9)

CEPH

eMERGE 4

COVID

PMDG

27 of 30

Infrastructure / Tools Roadmap

2019

Q1 2020

Q2 2020

Q4 2020

Q3 2020

(“bring your own docker”)

(AnVIL native integration)

28 of 30

Upcoming Events

2019

Q1 2020

Q2 2020

Q4 2020

Q3 2020

(10/27)

ASHG

GSP

(4/28-30)

IC-Stacks

(4/16-17)

(3/02)

ECC

(10/05)

ECC

“Train-

trainer”

(3/17-18)

MaGIC

Jamboree

(June)

BioC Conference

7/21-24

Bioinformatics Community Conference

7/29-31

SACNAS

10/22-24

ECC

Interoperability

Outreach

(10/31)

ASHG

29 of 30

Summary

Appreciate the value in bringing researchers to the data
Introduce the main components of AnVIL
Understand the roadmap for data, infrastructure, and events

Next Steps

Cloud computing
How much does it cost?
Use cases

30 of 30

Contributions

Mike Schatz
Anthony Philippakis
Alessandro Culotti
Rich Silva
Mo Heydarian
Frederick Tan