1 of 28

VRS AnVIL: Connecting VCF to Clinical Evidence

Brian Walsh

Quinn Wai Wong

2 of 28

Introduction

https://github.com/ga4gh/va-spec

https://github.com/ga4gh/vrs

https://github.com/cancervariants/metakb

3 of 28

Introduction

Subject VCF/VUS

Genotype and Phenotype Search tools

VRS AnVIL Toolkit

is a

Precomputed indices

Genomic Knowledge

Evidence Mapper

meta_kb

is a

civicdb.org

moalmanac.org

clinicalgenome.org*

normalizes

hits

Cohort Building

Downstream

Workflows

4 of 28

Presentation Outline

VRS AnVIL Toolkit as a solution

Existing tools to annotate VCFs with evidence

Proof of concept with 1000 Genomes

Usage, discussion, & future work

5 of 28

Existing Tools: Annotating VCFs with Evidence

6 of 28

The GA4GH Variant Representation specification has helped standardize the exchange of variant data (vrs-python)

  • Schema to represent alleles, sequence references, sequence expressions, etc
  • Consistent, unique global identifiers
  • Minimal dependencies (seqrepo)

7 of 28

VRS enables fixed-length ID creation from a given genomic expression by means of a VRS object (vrs-python)

[Location]

[Prefix]

ga4gh:VA.

+

[Digest]

rRPCnh0XXjuePRGWerw6PhVXFYjhchwP

+

[State]

8 of 28

The VICC MetaKB is a harmonized data warehouse for clinical variant interpretations (MetaKB)

  • Consolidate variant data over minimally overlapping knowledgebases
  • Publicly accessible API
  • Uses and supports VRS IDs!

9 of 28

Solution: VRS AnVIL Toolkit

10 of 28

vrs_anvil_toolkit provides a CLI to retrieve clinical data from VCF file in a configurable fashion

  • Define configurations
  • Run CLI

11 of 28

vrs_anvil_toolkit combines variant translation and clinical interpretations into a single data collection workflow

  1. Organize runtime settings in config file
    1. Configs: performance, strictness
    2. Directories and file paths
  2. Translate variants to VRS IDs
  3. Identify matches in the MetaKB cache
  4. Write to logs and metrics file

12 of 28

vrs_anvil_toolkit combines variant translation and clinical interpretations into a single data collection workflow

  • Organize runtime settings in config file
  • Translate variants to VRS IDs
    • Parse set of VCF files
    • Multiple processes through CLI
  • Identify matches in the MetaKB cache
  • Write to logs and metrics file

nohup vrs_bulk annotate —scatter &

vrs_bulk ps

vrs_bulk annotate

13 of 28

vrs_anvil_toolkit combines variant translation and clinical interpretations into a single data collection workflow

  • Organize runtime settings in config file
  • Translate variants to allele VRS ID
  • Identify matches in the MetaKB cache
  • Writing to logs and metrics file

14 of 28

vrs_anvil_toolkit combines variant translation and clinical interpretations into a single data collection workflow

  • Organize runtime settings in config file
  • Translate variants to allele VRS ID
  • Identify matches in the MetaKB cache
  • Write to logs and metrics file
    • Log: invalid alleles, errors, progress
    • Metrics: successes, timing, matches by file

15 of 28

Proof of Concept : 1000 Genomes

16 of 28

About the 1000 Genomes Dataset

  • AnVIL_1000G_PRIMED-data-model
  • High coverage sequencing of samples
  • 3202 samples (patients)
  • 23 VCFs organized by chromosome

17 of 28

Using public 1000 Genomes Project data on Terra, we can consolidate cohort-level stats and evidence.

  • Organizing runtime settings in config file
  • Identifying variants matches to metakb cache using VRS IDs
  • Organizing metrics into analysis-ready data
    • Get sample-level variant data
    • Aggregate metakb evidence
  • Creating figures and evidence from data

18 of 28

Using 3202 samples from the 1000 Genomes Project on Terra, we can consolidate cohort-level stats and evidence

  • Organizing runtime settings in config file
  • Identifying variants matches to metakb cache using VRS IDs
  • Organizing metrics into analysis-ready data
  • Gather analyses from data
    • Percent of patients w/ variant match
    • Number of variants per patient
    • Examples of evidence

19 of 28

Patients had successful variant ID matches to the CIViC knowledge but not to Molecular Almanac.

20 of 28

The spread varies between the germline and somatic-labelled MetaKB variants.

21 of 28

Even for doing cohort-level aggregation, MetaKB evidence is still accessible in the processed results.

Most common variant: 2952/3202 = 92.2%

(19-43551574-T-C) (ga4gh:VA.SPP7r7F_Wb3XbNY8Fawk91yt1U03eIVV) (civic.eid:673)

The XRCC1 R399Q variant was correlated with increased response to platinum-based neoadjuvant chemotherapy in patients with cervical cancer. Tumor samples from 36 patients with Stage IB or IIA bulky (greater than 4 cm in size) cervical carcinomas were used in this study.

22 of 28

Therapeutic Evidence for Top 2 VRS IDs

vrs_id

study_id

type

strength

direction

predicate

therapeutic

tumor_type

ga4gh:VA.SPP7r7F_Wb3XbNY8Fawk91yt1U03eIVV

civic.eid:673

VariantTherapeuticResponseStudy

clinical cohort evidence

supports

predictsSensitivityTo

Carboplatin

Cervical Cancer

ga4gh:VA.SPP7r7F_Wb3XbNY8Fawk91yt1U03eIVV

civic.eid:673

VariantTherapeuticResponseStudy

clinical cohort evidence

supports

predictsSensitivityTo

Cisplatin

Cervical Cancer

ga4gh:VA.ZZIGEC0okanDOaqbTEXEWuXNZTrz5qYz

civic.eid:1995

VariantTherapeuticResponseStudy

clinical cohort evidence

supports

predictsResistanceTo

Erlotinib

Lung Non-small Cell Carcinoma

ga4gh:VA.ZZIGEC0okanDOaqbTEXEWuXNZTrz5qYz

civic.eid:1995

VariantTherapeuticResponseStudy

clinical cohort evidence

supports

predictsResistanceTo

Gefitinib

Lung Non-small Cell Carcinoma

ga4gh:VA.ZZIGEC0okanDOaqbTEXEWuXNZTrz5qYz

civic.eid:2895

VariantTherapeuticResponseStudy

clinical cohort evidence

supports

predictsResistanceTo

Cisplatin

Esophageal Cancer

ga4gh:VA.ZZIGEC0okanDOaqbTEXEWuXNZTrz5qYz

civic.eid:2895

VariantTherapeuticResponseStudy

clinical cohort evidence

supports

predictsResistanceTo

Fluorouracil

Esophageal Cancer

23 of 28

Usage, Discussion, & Future Work

24 of 28

Usage

vrs-anvil (private)

pip install vrs_anvil_toolkit

25 of 28

Depending on your use case, it might be helpful to use vrs-python and MetaKB individually.

  • vrs-python: direct translation from the source, annotating VCF files with VRS IDs
  • MetaKB: existing VRS ID data, one-off queries
  • vrs_anvil_toolkit: performance (threading, caching, multiprocessing), Terra support, unified configuration, error handling

26 of 28

There’s seems to be a performance difference between Terra and other platforms

Throughput (vrs_bulk annotate)

  • Terra: 850 variants/s
  • Mac (M3): 4,600 variants/s

Pytest (test_gnomad)

  • Terra: 10.6s
  • Raw GCP: 3.0s
  • Mac (M3): 2.9s

27 of 28

We want to continue to build out the toolkit’s functionality to support GREGoR and other dataset

28 of 28

Acknowledgements

Wagner Lab: Kori Kuzma

Ellrott Lab: Brian Walsh, Kyle Ellrott