Minimizing reference bias with an impute-first approach
Naga Sai Kavya Vaddadi, Taher Mun, Ben Langmead
Genome Informatics
December 8, 2023
DNA-sequencing read alignment
2
Reference bias
3
Linear Reference Genome - Does not represent a population/lacks diversity.
�
Reference genome template: Pangenomes
4
Does more variation always improve the donor read alignment accuracy?
�
Complex Pangenome Graph
Pangenome
Image credits: GA4GH
FORGe - Prioritizing variants for graph genomes� ��
5
(Jacob Pritt, Nae-Chyun Chen, and Ben Langmead. "FORGe: prioritizing variants for graph genomes." Genome biology 19, no. 1 (2018): 1-16.)
�
FORGe - Prioritizing variants for graph genomes� ��
6
(Pritt, Jacob, Nae-Chyun Chen, and Ben Langmead. "FORGe: prioritizing variants for graph genomes." Genome biology 19, no. 1 (2018): 1-16.)
�
Adding more & more
variation to graph
(left to right)
Pangenome graph-based
Accuracy peaks
Personalized Ideal Accuracy
(exact set of donor alleles)
& decreases beyond.
Personalized references – Past Methods
7
Past Methods | Year |
MMSeq | 2011 |
AlleleSeq (RNA) | 2011 |
Yuan et al. (RNA) | 2012 |
RefEditor | 2015 |
IMapSlice (RNA) | 2018 |
Grozo et al. (Graph) | 2020 |
Gramtools | 2021 |
and/or
Develop a framework
to build a personalized reference, that is
Modular, efficient & scales well to massive reference panels.
GOAL
Idea
8
Use
Modern Genotyping methods
with
Efficient & Scalable Modern Imputation methods
Image source: Techscience
Impute-first Alignment Framework �
9
Genotype Imputation + Pangenome Alignment
Personalization Phase
Low coverage sample reads
(FASTQ)
Reference panel
(VCF)
Modern Genotyping
Modern Imputation
Personalized
Alleles
Full set
Donor reads
(FASTQ)
Reference genome
(FASTA)
Downstream Analysis
Pangenome Alignment
Variant Calling
Personalized
Reference
(diploid FASTA)
Personalized
VCF
(diploid & phased)
Methods & Data – Personalization phase �
10
Personalization Phase
Low coverage sample reads
(FASTQ)
Reference panel
(VCF)
Modern Genotyping
Modern Imputation
(Illumina paired-end short).
Methods & Data – Downstream analysis �
11
Downstream Analysis
Full set
Donor reads
(FASTQ)
Reference genome
(FASTA)
Pangenome Alignment
Variant Calling
Methods & Data – Downstream analysis �
12
VG Giraffe
(Indexing & Mapping)
GATK HaplotypeCaller
(Variant Calling)
Full set Donor reads (FASTQ)
Reference genome (FASTA)
VCF
Evaluations: Metrics – Personalization phase
Personalization Phase
Metrics
Best HG001 personalized alleles (VCF) are chosen for downstream analysis.
Chosen: Impute-first personalized HG001 VCFs: {rgc1, rgc5, bbgc5,bbbc5}
rowbowt + glimpse
bowtie2+bcftools +
glimpse/beagle
Evaluations: Metrics – Downstream Analysis
14
Downstream Analysis
Metrics
VG Giraffe
(Indexing & Mapping)
GATK HaplotypeCaller
(Variant Calling)
Real HG001 novaseq.pcrfree.30x FASTQ
GRCh38 fasta
VCF
Evaluations: Allele Balance at HET sites
15
HET sites: 1 reference, 1 non-reference allele;
(ideal balance: 50% in evidence)
�
Lin, Mao-Jan, Sheila Iyer, Nae-Chyun Chen, and Ben Langmead. "Measuring, visualizing and diagnosing reference bias with biastools." bioRxiv (2023): 2023-09.
biastools
Poster #112�
Alignments using Impute-first personalized references achieved even balance over pangenome & linear references.
�
Deletions (-)
Insertions (+)
SNPs
Evaluations: Variant Calling Accuracy
16
�Baseline Accuracy:
BWAMEM + GATK-HaplotypeCaller
�Truth: GIAB HG001 v4.2 VCF + High-confidence regions.
�Reference: GRCh38 fasta
�Evaluation: RTGTools vcfeval module.
�
Evaluations: Variant Calling Accuracy
17
Linear reference
�Baseline Accuracy:
BWA + GATK-HaplotypeCaller
�Truth: GIAB HG001 v4.2 VCF + High-confidence regions.
�Reference: GRCh38 fasta
�Evaluation: RTGTools vcfeval module.
�
Evaluations: Variant Calling Accuracy
18
�Variant calls: BWA + VG GATK-HaplotypeCaller
�Truth: GIAB HG001 v4.2 VCF + High-confidence regions
�Reference: GRCh38
�Evaluation: RTGTools VCF EVAL Module.
�
Linear reference
Pangenome
graph
Evaluations: Variant Calling Accuracy
19
�Variant calls: BWA + VG GATK-HaplotypeCaller
�Truth: GIAB HG001 v4.2 VCF + High-confidence regions
�Reference: GRCh38
�Evaluation: RTGTools VCF EVAL Module.
�
Linear reference
Pangenome
graph
Impute-first
Personalized
Evaluations: Variant Calling Accuracy
20
�Variant calls: BWA + VG GATK-HaplotypeCaller
�Truth: GIAB HG001 v4.2 VCF + High-confidence regions
�Reference: GRCh38
�Evaluation: RTGTools VCF EVAL Module.
�
Linear reference
Pangenome
graph
Impute-first
Personalized
Impute-first personalized workflows consistently achieved greater precision & recall than pangenome & linear workflows.
�
Results are consistent on GRCh38 CMRG, Stratifications
& allele frequency ranges.
Evaluations: Computational Overhead
21
Personalization phase : rowbowt + glimpse1 on 1x reads: < 1 hour & 15 GB Memory
Impute-first pipelines (rgc1, rgc5, bbgc5, bbbc5): Compact graphs & computationally efficient (linear-order) over typical pangenome graph.
VG Graph sizes
Impute-first VCF (rgc1, rgc5, bbgc5, bbbc5)
Pangenome VCF (1000G phase3)
Conclusions & Future work
22
Practical & modular framework -
To create personalized references, even from lower-order coverages.
�
Linear-genome like computational overhead & better alignment accuracy than typical pangenomes.
�
Future directions – To Investigate:
Impute-first
Acknowledgements
23
Thank you all 😀
�
Special thanks to Langmead Lab (https://langmead-lab.org/) Colleagues & NIH Grant R01HG011392 for the generous support.
�
24
Variant Calling results
GIAB HG001 v4.2 VCF Benchmark at
GIAB High-confidence regions.
(TP, FP, FN variant count stats)
25
Variant Calling results
GIAB HG001 v4.2 VCF Benchmark at
GIAB High-confidence regions.
(stratified by variant type)
26
Variant Calling results
GIAB HG001 v4.2 VCF Benchmark at
GIAB High-confidence regions.
(stratified by Allele Frequency)
27
Variant Calling results
GIAB GRCh38 Complex Medically Relevant Gene Regions
28
Variant Calling results
GIAB GRCh38 stratifications
(MHC,
alldifficultregions, allOtherDifficultregions, alllowmpaandsegdupregions)
29
Call accuracy: F1 scores (HGSVC2 HG001 truth VCF) at ALT & HET sites
30
Window accuracy:
Impute-first genomes equivalent to
truth genome in 200 bps windows.
31
Computational Overhead – Personalization phase
32
33