1 of 51

Diploidy and polyploidy

2 of 51

Definitions

Haploid, diploid, etc

Haplome

Haplotype

Homologous chromosomes

3 of 51

Human heterozygosity

Heterozygosity is a fraction of polymorphic alleles

Based on pre-selected 220,247 SNPs

Herraez et al., 2009

4 of 51

Inbreeding

Inbreeding is widely used to reduce heterozygosity for follow up sequencing:

Takifugu

rubripes

Kuroyanagi et al, BMC Genomics, 2013

Drosophila melanogaster

Swindell and Bouzat, Genetics, 2006

Danio

rerio

Monson and Sadler, Zebrafish, 2010

Mus

musculus

Beck et al, Nature Genetics, 2010

5 of 51

Challenge with inbreeding

Significant parts of a genome can remain outbred

Malaria mosquito

“Outbred in localized islands comprising the 8% of genome” -

Holt et al, Science, 2002

Nematode

“Up to 30% heterozygosity persists after 20 generations of inbreeding” -

Barriere et al, Genome Research, 2009

Inbreeding depression

White tiger

“Breeding practices has been linked with abnormal, debilitating, and, at times, lethal conditions”, -

Association of Zoos & Aquariums, 2011

6 of 51

Outbreeding

Panthera leo♂ × Panthera tigris♀

outbreeding depression

7 of 51

Assembly of diploid genomes

Genome assembly of a polymorphic organism: assembly of TWO DIFFERENT (albeit similar) puzzles at once

Humans have low polymorphism rates due to low effective population size. Most species sequenced so far are

inbred (mouse, rat, worm, fly, etc.) or
haploid (bacteria, yeast, etc.)

Genome assembly of an inbred organisms: assembly of a SINGLE jigsaw puzzle

8 of 51

Low and high polymorphism rates

human Homo sapiens genome size 2 x 3.1 Gb

sea squirt Ciona savignyi genome size 2 x 0.2 Gb

9 of 51

Low and high polymorphism rates

human Homo sapiens genome size 2 x 3.1 Gb

polymorphism rate 0.1%

sea squirt Ciona savignyi genome size 2 x 0.2 Gb

10 of 51

Low and high polymorphism rates

human Homo sapiens genome assembly is reduced to

assembling a single haplome

sea squirt Ciona savignyi genome size 2 x 0.2 Gb

11 of 51

Low and high polymorphism rates

human Homo sapiens genome size 2 x 3.1 Gb

polymorphism rate 0.1%

sea squirt Ciona savignyi genome size 2 x 0.2 Gb

polymorphism rate 12%

Problem: polymorphisms are too complex to be ignored

12 of 51

Low and high polymorphism rates

human Homo sapiens genome size 2 x 3.1 Gb

polymorphism rate 0.1%

sea squirt Ciona savignyi

If two chromosomes are 0.5 - 15% difference assembly becomes problematic since “double genome” is very repetitive (each black region is a repeat)

13 of 51

Results of diploid assembly

Consensus contigs

Haplocontigs

Conventional assemblers produce very fragmented assembly of both haplomes

Polymorphic alleles are randomly chosen from haplomes on consensus contigs. Allelic relations can be further reconstructed by haplotype assembly

14 of 51

Haplotype assembly

15 of 51

16 of 51

17 of 51

18 of 51

Approaches to disease study

Sequencing

A single gene or gene panel
Whole Genome Sequencing
Whole Exome Sequencing: sequencing coding genomic regions (~85% of Mendelian variants)

Mapping reads to the reference
Finding genomic variations
Associating genomic variations with diseases

19 of 51

20 of 51

21 of 51

22 of 51

Detecting genomic variations

Easy to detect using alignment

23 of 51

Detecting complex genomic variations

An integrated map of structural variation in 2,504 human genomes

24 of 51

1000 genomes

25 of 51

Haplotype assembly

Haplotypes:

A C T G T C T A T C

A C G G T A T A C C

Genotypes:

A C T G T C T A T C

G A C

Possible phases:

ACTGTCTATC

ACGGTATACC

ACTGTATACC

ACGGTCTATC

ACTGTCTACC

ACGGTATATC

ACGGTCTATC

ACTGTATACC

26 of 51

Applications

27 of 51

Haplotype assembly: challenges

Distance between some polymorphisms is too large

It is impossible to phase chromosomes
Works only with short polymorphisms

28 of 51

Haplotype assembly models

29 of 51

Haplotype assembly

Input:

reads covering diploid genome

consensus genome

Output:

resolved phases of SNPs

Definitions:

F - M x N matrix

rows correspond to fragments (reads)

columns correspond to SNP sites

f_ij in {0, 1, -}

ACTGTATACC

ACTGTC

ACGGTA

TGTCTA

GGTNTA

ATACC

CTATC

	SNP_1	SNP_2	SNP_3
f_1	1	0	-
f_2	0	1	-
f_3	1	0	-
f_4	0	-	-
f_5	-	1	0
f_6	-	0	1

30 of 51

Definitions

Conflict fragments:

Distance between fragments:

	SNP_1	SNP_2	SNP_3
f_1	1	0	-
f_2	0	1	-
f_3	1	0	-
f_4	0	-	-
f_5	-	1	0
f_6	-	0	1

ACTGTATACC

ACTGTC

ACGGTA

TGTCTA

GGTNTA

ATACC

CTATC

dist = 2

dist = 1

31 of 51

Definitions

Problem statement:

Haplotype1: {f_1, f_3, f_6}

Haplotype2: {f_2, f_4, f_5}

ACTGTATACC

ACTGTC

ACGGTA

TGTCTA

GGTNTA

ATACC

CTATC

	SNP_1	SNP_2	SNP_3
f_1	1	0	-
f_2	0	1	-
f_3	1	0	-
f_4	0	-	-
f_5	-	1	0
f_6	-	0	1

32 of 51

Fragment conflict graph

G_F = (V, E)

|V| = m (number of fragments)

Edges:

Edges weights: w(v1, v2) = d(f1, f2)

33 of 51

Fragment conflict graph

G_F = (V, E)

|V| = m (number of fragments)

Edges:

ACTGTATACC

ACTGTC

ACGGTA

TGTCTA

GGTNTA

ATACC

CTATC

1

3

6

2

4

5

1

2

3

4

5

6

34 of 51

Fragment conflict graph is bipartite

G_F = (V, E)

|V| = m (number of fragments)

Edges:

ACTGTATACC

ACTGTC

ACGGTA

TGTCTA

GGTNTA

ATACC

CTATC

1

3

6

2

4

5

1

2

3

4

5

6

35 of 51

Problem formulation: MER & UMER

Conflicts are caused by:

sequencing errors
false alignment to paralogs
erroneous fragments into the data set

G_F contains bipartite subgraph G_F’

Equivalent to minimum graph cut

36 of 51

Problem formulation: MFR & MSR

Equivalent to maximum independent set of graph

independent set is a set of vertices in a graph,

no two of which are adjacent

37 of 51

Problem formulation: MEC & LHR

MEC tries to break odd-length cycle by flipping minimal number of alleles

LHP is pretty strange and almost abandoned

38 of 51

Examples

39 of 51

A likelihood based approach

40 of 51

Flip-update MC

41 of 51

A difficult example for the flip-update

n columns, each spanned by d fragments.
Two haplotypes (H1,H2) are equally likely
Hard to move from one good haplotype to another

42 of 51

A new markov chain

43 of 51

Read-haplotype consistency graph

44 of 51

Weighting R-H graph edges

45 of 51

Cuts

46 of 51

Negative cuts are good cuts

-6

47 of 51

HapCUT algorithm

48 of 51

Biological approaches to

haplotype assembly

49 of 51

Sequencing trio

https://experiment.com/u/oSMmA

50 of 51

Strand-Seq

Strand-seq: a unifying tool for studies of chromosome segregation

51 of 51

Assembly of trio

Koren et al., 2018