1 of 16

Upfront Imputation Improves Read Alignment

Taher Mun

Joint Lab Meeting 11/22/2019

2 of 16

Reference bias

1

1

2

3

2

3

3 of 16

Reference bias affects mapping accuracy

30x 100bp reads simulated from chr21 of 5 1KG samples

Ref bias = total # reads w REF allele / total # reads w ALT allele

4 of 16

Reference bias affects alignments over het sites

50x paired-end reads from NA12878 provided by GIAB (Zook et al, 2016, doi: 10.1038/sdata.2016.25)

(chr21)

5 of 16

Reference bias affects low-coverage variant calling

6 of 16

How can we improve alignment in the face of reference bias?

7 of 16

Solution: change the reference

  • Major allele reference

    • Pros: contains most common alleles
    • Cons: represents just one haplotype, ignores rare variants
  • Graph genome

    • Pros: can contain any number of alleles, represents any possible haplotype
    • Cons: blow-up negatively affects accuracy and resources (Pitt et al, 2018 https://doi.org/10.1186/s13059-018-1595-x)

8 of 16

Personal is “best case scenario”

30x 100bp reads simulated from chr21 of 5 1KG samples

Ref bias = total # reads w REF allele / total # reads w ALT allele

9 of 16

Solution 2: build a personalized reference

  • Build a genome personalized towards sample
    • Methods: assembly, genotyping, variant calling
  • Pros:
    • Genome will contain variation specific to individual
    • Can align to a diploid genome rather than one haploid or a graph
  • Cons:
    • All of these methods are time-consuming, resource-heavy
    • Requires deep coverage reads from sample
    • Might require extra technology (microarrays, long reads)

10 of 16

Proposed Solution: Create a personalized reference with sparse data

Goals:

  • Generate a reference specific to study sample
  • Minimal resource usage
  • Minimal time
  • Closest possible approximation to personal

11 of 16

Li Stephens model helps impute unknown genotypes

  • Hidden Markov Model
  • Built from
    • Haplotype reference panel
    • Recombination map
  • Existing implementations available
    • Beagle (Browning, et al 2018)
    • IMPUTE5 (Rubinacci et al 2019)
  • Use Viterbi algorithm to impute�values in incomplete genotype

Transition prob = recombination rate

= error rate

Li and Stephens, 2003

12 of 16

1000 Genomes Project

13 of 16

14 of 16

Imputation is dominant step

15 of 16

Future work

  • Resource-efficient imputation
  • Speed up lift-over step
  • Evaluate on downstream results
    • Variant Calls
    • Allele specific expression
    • etc.

16 of 16

Acknowledgements

Langmead Lab

Dr. Ben Langmead

Dr. Brad Solomon

Christopher Wilks

Daniel Baker

Charlotte Darby

Nae-Chyun Chen

Rone Charles