1 of 33

Minimizing reference bias with an impute-first approach

Naga Sai Kavya Vaddadi, Taher Mun, Ben Langmead

Genome Informatics

December 8, 2023

2 of 33

DNA-sequencing read alignment

  • Fundamental step for Genomic Analysis.

  • Donor/Sample Reads - Typically are aligned to a “Reference genome template”

2

3 of 33

Reference bias

  • Traditional Reference genome template - “Linear”

  • Tendency to miss or incorrectly match the donor reads.

3

Linear Reference Genome - Does not represent a population/lacks diversity.

4 of 33

Reference genome template: Pangenomes

4

Does more variation always improve the donor read alignment accuracy?

Complex Pangenome Graph

Pangenome

Image credits: GA4GH

5 of 33

FORGe - Prioritizing variants for graph genomes� ��

5

(Jacob Pritt, Nae-Chyun Chen, and Ben Langmead. "FORGe: prioritizing variants for graph genomes." Genome biology 19, no. 1 (2018): 1-16.)

6 of 33

FORGe - Prioritizing variants for graph genomes� ��

6

(Pritt, Jacob, Nae-Chyun Chen, and Ben Langmead. "FORGe: prioritizing variants for graph genomes." Genome biology 19, no. 1 (2018): 1-16.)

Adding more & more

variation to graph

(left to right)

Pangenome graph-based

Accuracy peaks

Personalized Ideal Accuracy

(exact set of donor alleles)

& decreases beyond.

7 of 33

Personalized references – Past Methods

7

Past Methods

Year

MMSeq

2011

AlleleSeq (RNA)

2011

Yuan et al. (RNA)

2012

RefEditor

2015

IMapSlice (RNA)

2018

Grozo et al. (Graph)

2020

Gramtools

2021

  • Limited to Monoploid studies.

  • Heavy Reliance on External Array-Based Inputs.
    • May not cover all relevant markers for the donor/sample.

and/or

  • Expensive pre-processing.
    • Performs upfront variant calling.

Develop a framework

to build a personalized reference, that is

Modular, efficient & scales well to massive reference panels.

GOAL

8 of 33

Idea

8

Use

Modern Genotyping methods

with

Efficient & Scalable Modern Imputation methods

Image source: Techscience

9 of 33

Impute-first Alignment Framework �

9

Genotype Imputation + Pangenome Alignment

Personalization Phase

Low coverage sample reads

(FASTQ)

Reference panel

(VCF)

Modern Genotyping

Modern Imputation

Personalized

Alleles

Full set

Donor reads

(FASTQ)

Reference genome

(FASTA)

Downstream Analysis

Pangenome Alignment

Variant Calling

Personalized

Reference

(diploid FASTA)

Personalized

VCF

(diploid & phased)

10 of 33

Methods & Data – Personalization phase �

10

Personalization Phase

Low coverage sample reads

(FASTQ)

Reference panel

(VCF)

Modern Genotyping

Modern Imputation

  • Reads: Subsampled (<1x, 1x, 2x, 5x) HG001 real reads

(Illumina paired-end short).

  • Reference panel: HGSVC2 panel

  • Genotyping:
    • rowbowt
    • bayestyper
    • bowtie2 + bcftools

  • Imputation:
    • beagle5
    • glimpse1

11 of 33

Methods & Data – Downstream analysis �

11

Downstream Analysis

Full set

Donor reads

(FASTQ)

Reference genome

(FASTA)

Pangenome Alignment

Variant Calling

12 of 33

Methods & Data – Downstream analysis �

12

  • Pangenome Graph indexing & mapping: VG Giraffe

  • Variant Calling: GATK HaplotypeCaller

  • Full set Donor reads: HG001 real novaseq.pcrfree.30x

  • Reference genome: GRCh38 fasta

  • VCF: Personalized VCFs (Personalization phase output VCFs)

VG Giraffe

(Indexing & Mapping)

GATK HaplotypeCaller

(Variant Calling)

Full set Donor reads (FASTQ)

Reference genome (FASTA)

VCF

13 of 33

Evaluations: Metrics – Personalization phase

Personalization Phase

Metrics

  • Call Accuracy
  • Window Accuracy
  • Computational Overhead
  • Reads: Subsampled (<1x, 1x, 2x, 5x) HG001 real reads.
  • Reference panel: HGSVC2 panel
  • Genotyping:
    • rowbowt, bayestyper, bowtie2 + bcftools
  • Imputation:
    • beagle5, glimpse1

Best HG001 personalized alleles (VCF) are chosen for downstream analysis.

Chosen: Impute-first personalized HG001 VCFs: {rgc1, rgc5, bbgc5,bbbc5}

rowbowt + glimpse

bowtie2+bcftools +

glimpse/beagle

14 of 33

Evaluations: Metrics – Downstream Analysis

14

  • Personalized VCF:
    • Impute-first personalized HG001 calls: {rgc1, rgc5, bbgc5,bbbc5}

  • Pangenome VCF
    • 1000 Genome Phase 3 calls (2500 samples, 26 populations)

  • NO VCF (VG Giraffe linear)

Downstream Analysis

Metrics

  • Allele balance at HET sites
  • Variant Calling Accuracy
  • Computational Overhead

VG Giraffe

(Indexing & Mapping)

GATK HaplotypeCaller

(Variant Calling)

Real HG001 novaseq.pcrfree.30x FASTQ

GRCh38 fasta

VCF

15 of 33

Evaluations: Allele Balance at HET sites

15

HET sites: 1 reference, 1 non-reference allele;

(ideal balance: 50% in evidence)

Lin, Mao-Jan, Sheila Iyer, Nae-Chyun Chen, and Ben Langmead. "Measuring, visualizing and diagnosing reference bias with biastools." bioRxiv (2023): 2023-09.

biastools

 Poster #112

Alignments using Impute-first personalized references achieved even balance over pangenome & linear references.

Deletions (-)

Insertions (+)

SNPs

16 of 33

Evaluations: Variant Calling Accuracy

16

Baseline Accuracy:

BWAMEM + GATK-HaplotypeCaller 

Truth: GIAB HG001 v4.2 VCF + High-confidence regions.

Reference: GRCh38 fasta

Evaluation:  RTGTools vcfeval module. 

17 of 33

Evaluations: Variant Calling Accuracy

17

Linear reference

Baseline Accuracy:

BWA + GATK-HaplotypeCaller 

Truth: GIAB HG001 v4.2 VCF + High-confidence regions.

Reference: GRCh38 fasta

Evaluation:  RTGTools vcfeval module. 

18 of 33

Evaluations: Variant Calling Accuracy

18

Variant calls: BWA + VG GATK-HaplotypeCaller 

Truth: GIAB HG001 v4.2 VCF + High-confidence regions

Reference: GRCh38 

Evaluation:  RTGTools VCF EVAL Module. 

Linear reference

Pangenome

graph

19 of 33

Evaluations: Variant Calling Accuracy

19

Variant calls: BWA + VG GATK-HaplotypeCaller 

Truth: GIAB HG001 v4.2 VCF + High-confidence regions

Reference: GRCh38 

Evaluation:  RTGTools VCF EVAL Module. 

Linear reference

Pangenome

graph

Impute-first

Personalized

20 of 33

Evaluations: Variant Calling Accuracy

20

Variant calls: BWA + VG GATK-HaplotypeCaller 

Truth: GIAB HG001 v4.2 VCF + High-confidence regions

Reference: GRCh38 

Evaluation:  RTGTools VCF EVAL Module. 

Linear reference

Pangenome

graph

Impute-first

Personalized

Impute-first personalized workflows consistently achieved greater precision & recall than pangenome & linear workflows.

Results are consistent on GRCh38 CMRG, Stratifications

& allele frequency ranges.

21 of 33

Evaluations: Computational Overhead

21

Personalization phase : rowbowt + glimpse1 on 1x reads: < 1 hour & 15 GB Memory

Impute-first pipelines (rgc1, rgc5, bbgc5, bbbc5): Compact graphs & computationally efficient (linear-order) over typical pangenome graph.

VG Graph sizes

Impute-first VCF (rgc1, rgc5, bbgc5, bbbc5)

Pangenome VCF (1000G phase3)

22 of 33

Conclusions & Future work

22

Practical & modular framework -

To create personalized references, even from lower-order coverages.

Linear-genome like computational overhead & better alignment accuracy than typical pangenomes.  

Future directions – To Investigate:

    • the role of reference panel sizes & variant inclusiveness on downstream accuracy. 
    • applicability to other assays like RNA-seq & exome sequencing. 

Impute-first

23 of 33

Acknowledgements

23

Thank you all 😀

Special thanks to Langmead Lab (https://langmead-lab.org/) Colleagues & NIH Grant R01HG011392 for the generous support.  

24 of 33

24

Variant Calling results

GIAB HG001 v4.2 VCF Benchmark at

GIAB High-confidence regions.

(TP, FP, FN variant count stats)

25 of 33

25

Variant Calling results

GIAB HG001 v4.2 VCF Benchmark at

GIAB High-confidence regions.

(stratified by variant type)

26 of 33

26

Variant Calling results

GIAB HG001 v4.2 VCF Benchmark at

GIAB High-confidence regions.

(stratified by Allele Frequency)

27 of 33

27

Variant Calling results

GIAB GRCh38 Complex Medically Relevant Gene Regions

28 of 33

28

Variant Calling results

GIAB GRCh38 stratifications

(MHC,

alldifficultregions, allOtherDifficultregions, alllowmpaandsegdupregions)

29 of 33

29

Call accuracy: F1 scores (HGSVC2 HG001 truth VCF) at ALT & HET sites

30 of 33

30

Window accuracy:

Impute-first genomes equivalent to

truth genome in 200 bps windows.

31 of 33

31

Computational Overhead – Personalization phase

32 of 33

32

33 of 33

33