What is AnVIL?
Learning Objectives
Goals of AnVIL
Sequencing History
2000s:
Sanger Sequencing
Kilobase read length,
low error & reliable;
but slow and expensive
Sequencing History
2000s:
Sanger Sequencing
Kilobase read length,
low error & reliable;
but slow and expensive
2010s:
Genome Analyzer
Low error, substantially
improved throughput;
but very short reads (25bp)
Sequencing History
2000s:
Sanger Sequencing
Kilobase read length,
low error & reliable;
but slow and expensive
2010s:
Genome Analyzer
Low error, substantially
improved throughput;
but very short reads (25bp)
2020s:
Illumina NovaSeq
Low-error, population-scale capacity with low costs,
250bp read lengths
Computational Genomics History
2000s:
Institutional Clusters
Slow & expensive to apply at scale, large investment into IT support
Computational Genomics History
2000s:
Institutional Clusters
Slow & expensive to apply at scale, large investment into IT support
2010s:
Early Cloud
Limited tools,
difficult to program or use,
but great potential
(Schatz, 2009)
Computational Genomics History
2000s:
Institutional Clusters
Slow & expensive to apply at scale, large investment into IT support
2010s:
Early Cloud
Limited tools,
difficult to program or use,
but great potential
(Schatz, 2009)
2020s:
The AnVIL
Integrated tools, datasets, and authentication running within highly scalable cloud environment
AnVIL: Invert the model of genomic data science
Traditional: Bring data to the researcher
Goal: Bring researcher to the data
What is AnVIL?
What is AnVIL?
Scalable and interoperable computing resource for the genomics scientific community
More coming soon!
Workspaces and
batch workflows
Sharing containerized tools
and workflows
Data models,
indexing, querying
More coming soon!
Workspaces and
batch workflows
Live code, equations, visualizations and narratives
Analysis and comprehension
of genomic data in R
Sharing containerized tools
and workflows
Data models,
indexing, querying
Accessible, reproducible, and transparent research
FISMA Moderate�2 ATOs�Pursuing FedRAMP
All data use and analysis in a FISMA moderate environment
Implemented on
Primary data storage costs covered by AnVIL,
user private data and compute billed directly through Google
Terra: Batch Workflows
Workflow Description Language (WDL) allows highly scalable analysis workflows
Terra: Jupyter Notebooks
Interactive Jupyter notebooks allow for transparent code, visualizations, and narratives
Terra: RStudio and Bioconductor
RStudio: analysis environment specifically designed for, and largely preferred by the R community.
Bioconductor: tools and modules for the analysis and comprehension of high-throughput genomic data, implemented in R
AnVIL provides a robust well tested RStudio environment
with the latest Bioconductor release integrated
Gen3: Enabling Search for AnVIL Data
Gen3 enables AnVIL users to:
AnVIL users search through listed workspaces to which they have access, cloning data into their own workspaces for analysis
Today
Tomorrow
Gen3: The Gen3 Data Explorer
Search & Design patient cohorts on-the-fly from existing samples
Dockstore: Registry of tools and workflows
Dockstore: Create, share, use
Import & Execute
Dockstore workflows inside Terra
Galaxy
Available in AnVIL Oct 2020
Extending AnVIL
What’s Next?
Data Roadmap
CMG
1000G
GTEx (V7 & V8)
eMerge
2019
Q1 2020
Q2 2020
Q4 2020
Q3 2020
CCDG
CSER
UDN
Planning Stage
NIMH
HPP
NIA
GTEx (V9)
CEPH
eMERGE 4
COVID
PMDG
Infrastructure / Tools Roadmap
2019
Q1 2020
Q2 2020
Q4 2020
Q3 2020
(“bring your own docker”)
(AnVIL native integration)
Upcoming Events
2019
Q1 2020
Q2 2020
Q4 2020
Q3 2020
(10/27)
ASHG
GSP
(4/28-30)
IC-Stacks
(4/16-17)
(3/02)
ECC
(10/05)
ECC
“Train-
trainer”
(3/17-18)
MaGIC
Jamboree
(June)
BioC Conference
7/21-24
Bioinformatics Community Conference
7/29-31
SACNAS
10/22-24
ECC
Interoperability
Outreach
(10/31)
ASHG
Summary
Next Steps
Contributions