Variant Transforms and BigQuery
May 2018
Google Cloud Healthcare and Life Sciences
Agenda
BigQuery
Example
1000 genomes
Example
How many true variants are there per chromosome?
Example
How many true variants are there per chromosome?
Challenge
How to get variants into BigQuery?
VCF
standard format �for storing variants
Challenge
How to get variants into BigQuery?
Solution
Variant Transforms
Open source tool to load VCF files to BigQuery
Highly scalable
Robustly handles malformed and/or incompatible VCF files
Variant Transforms
VCF validator report example
VCF file(s)
on GCS
Variant Transforms
BigQuery
Architecture
Architecture
VCF file(s)
on GCS
Variant Transforms
(Dataflow)
BigQuery
Parse file(s)
Merge (optional)
Filter
Convert to BigQuery
Example pipeline
Load 6 VCF files from Platinum Genomes (~31GB) into one BigQuery table.
Dataflow
Dataflow
(after 23 mins)
BigQuery Table
BigQuery Table
Example query
(sex inference)
Example query
(sex inference)
CSQ=G|upstream_gene_variant|MODIFIER|PSMF1|ENSG00000125818|Transcript|ENST00000333082|protein_coding|||||||||||2567|1||HGNC|HGNC:9571|1|P1|ENSP00000327704||||||||||||||||||||||||||||,T|upstream_gene_variant|MODIFIER|PSMF1|ENSG00000125818|Transcript|ENST00000333082|protein_coding|||||||||||2567|1||HGNC|HGNC:9571|1|P1|ENSP00000327704||||||||||||||||||||||||||||,G|upstream_gene_variant|MODIFIER|PSMF1|ENSG00000125818|Transcript|ENST00000381899|protein_coding|||||||||||2582|1|cds_end_NF|HGNC|HGNC:9571|2||ENSP00000371324||||...
Annotations
Native support for parsing annotation fields from VEP
Example �query using Annotations
Find all high impact variants in BRCA1/BRCA2 genes:
Example �query using Annotations
Find all high impact variants involved in double-strand break repair (GO:0006302):
Variant Transforms
BQ Export
Case Study: Color
Actionable Insights
BigQuery
BigQuery
Large # of �VCF files
WSI DICOM Store
ML
Questions?
Visit our GitHub page for roadmap & feature requests:
github.com/googlegenomics/gcp-variant-transforms