1 of 16

Upfront Imputation Improves Read Alignment

Taher Mun

Joint Lab Meeting 11/22/2019

2 of 16

Reference bias

1

2

3

2

3

3 of 16

Reference bias affects mapping accuracy

30x 100bp reads simulated from chr21 of 5 1KG samples

Ref bias = total # reads w REF allele / total # reads w ALT allele

4 of 16

Reference bias affects alignments over het sites

50x paired-end reads from NA12878 provided by GIAB (Zook et al, 2016, doi: 10.1038/sdata.2016.25)

(chr21)

5 of 16

Reference bias affects low-coverage variant calling

Bobo et. al. 2018 https://doi.org/10.1101/066043

6 of 16

How can we improve alignment in the face of reference bias?

7 of 16

Solution: change the reference

Major allele reference

Pros: contains most common alleles
Cons: represents just one haplotype, ignores rare variants

Graph genome

�

Pros: can contain any number of alleles, represents any possible haplotype
Cons: blow-up negatively affects accuracy and resources (Pitt et al, 2018 https://doi.org/10.1186/s13059-018-1595-x)

8 of 16

Personal is “best case scenario”

30x 100bp reads simulated from chr21 of 5 1KG samples

Ref bias = total # reads w REF allele / total # reads w ALT allele

9 of 16

Solution 2: build a personalized reference

Build a genome personalized towards sample

Methods: assembly, genotyping, variant calling

Pros:

Genome will contain variation specific to individual
Can align to a diploid genome rather than one haploid or a graph

Cons:

All of these methods are time-consuming, resource-heavy
Requires deep coverage reads from sample
Might require extra technology (microarrays, long reads)

10 of 16

Proposed Solution: Create a personalized reference with sparse data

Goals:

Generate a reference specific to study sample
Minimal resource usage
Minimal time
Closest possible approximation to personal

11 of 16

Li Stephens model helps impute unknown genotypes

Hidden Markov Model
Built from

Haplotype reference panel
Recombination map

Existing implementations available

Beagle (Browning, et al 2018)
IMPUTE5 (Rubinacci et al 2019)

Use Viterbi algorithm to impute�values in incomplete genotype

Transition prob = recombination rate

= error rate

Li and Stephens, 2003

12 of 16

1000 Genomes Project

13 of 16

14 of 16

Imputation is dominant step

15 of 16

Future work

Resource-efficient imputation
Speed up lift-over step
Evaluate on downstream results

Variant Calls
Allele specific expression
etc.

16 of 16

Acknowledgements

Langmead Lab

Dr. Ben Langmead

Dr. Brad Solomon

Christopher Wilks

Daniel Baker

Charlotte Darby

Nae-Chyun Chen

Rone Charles