1 of 26

May 2018

Google Cloud Healthcare and Life Sciences

2 of 26

Agenda

  • BigQuery overview
  • Variant Transforms overview
  • Examples

3 of 26

  • Highly scalable, columnar storage data warehouse
  • Fully managed
  • Powered by multiple data centers that each have:
    • Hundreds of thousands of cores
    • Dozens of Petabytes in storage
    • Terabytes of networking bandwidth
  • Low cost
    • Storage: $0.02/GB/month (or $0.01/GB/month for long term storage)
    • Query: $5/TB
  • Supports standard SQL

BigQuery

4 of 26

Example

1000 genomes

5 of 26

Example

How many true variants are there per chromosome?

6 of 26

Example

How many true variants are there per chromosome?

7 of 26

Challenge

How to get variants into BigQuery?

8 of 26

VCF

standard format �for storing variants

9 of 26

Challenge

How to get variants into BigQuery?

Solution

Variant Transforms

10 of 26

Open source tool to load VCF files to BigQuery

  • Developed by the Google Cloud Healthcare team
  • Source of truth on GitHub
  • External contributions are welcome!

Highly scalable

  • Hundreds of thousands of files
  • Millions of samples
  • Billions of records

Robustly handles malformed and/or incompatible VCF files

  • Fixes missing/incorrect headers
  • Gracefully handles invalid records

Variant Transforms

11 of 26

VCF validator report example

12 of 26

VCF file(s)

on GCS

Variant Transforms

BigQuery

Architecture

13 of 26

Architecture

VCF file(s)

on GCS

Variant Transforms

(Dataflow)

BigQuery

Parse file(s)

Merge (optional)

Filter

Convert to BigQuery

14 of 26

Example pipeline

Load 6 VCF files from Platinum Genomes (~31GB) into one BigQuery table.

15 of 26

Dataflow

16 of 26

Dataflow

(after 23 mins)

17 of 26

BigQuery Table

18 of 26

19 of 26

BigQuery Table

20 of 26

Example query

(sex inference)

21 of 26

Example query

(sex inference)

22 of 26

CSQ=G|upstream_gene_variant|MODIFIER|PSMF1|ENSG00000125818|Transcript|ENST00000333082|protein_coding|||||||||||2567|1||HGNC|HGNC:9571|1|P1|ENSP00000327704||||||||||||||||||||||||||||,T|upstream_gene_variant|MODIFIER|PSMF1|ENSG00000125818|Transcript|ENST00000333082|protein_coding|||||||||||2567|1||HGNC|HGNC:9571|1|P1|ENSP00000327704||||||||||||||||||||||||||||,G|upstream_gene_variant|MODIFIER|PSMF1|ENSG00000125818|Transcript|ENST00000381899|protein_coding|||||||||||2582|1|cds_end_NF|HGNC|HGNC:9571|2||ENSP00000371324||||...

Annotations

Native support for parsing annotation fields from VEP

23 of 26

Example �query using Annotations

Find all high impact variants in BRCA1/BRCA2 genes:

24 of 26

Example �query using Annotations

Find all high impact variants involved in double-strand break repair (GO:0006302):

25 of 26

Variant Transforms

BQ Export

Case Study: Color

Actionable Insights

BigQuery

BigQuery

Large # of �VCF files

WSI DICOM Store

ML

26 of 26

Questions?

Visit our GitHub page for roadmap & feature requests:

github.com/googlegenomics/gcp-variant-transforms