Diploidy and polyploidy
Definitions
Human heterozygosity
Heterozygosity is a fraction of polymorphic alleles
Based on pre-selected 220,247 SNPs
Herraez et al., 2009
Inbreeding
Inbreeding is widely used to reduce heterozygosity for follow up sequencing:
Takifugu
rubripes
Kuroyanagi et al, BMC Genomics, 2013
Drosophila melanogaster
Swindell and Bouzat, Genetics, 2006
Danio
rerio
Monson and Sadler, Zebrafish, 2010
Mus
musculus
Beck et al, Nature Genetics, 2010
Challenge with inbreeding
Significant parts of a genome can remain outbred
Malaria mosquito
âOutbred in localized islands comprising the 8% of genomeâ -
Holt et al, Science, 2002
Nematode
âUp to 30% heterozygosity persists after 20 generations of inbreedingâ -
Barriere et al, Genome Research, 2009
Inbreeding depression
White tiger
âBreeding practices has been linked with abnormal, debilitating, and, at times, lethal conditionsâ, -
Association of Zoos & Aquariums, 2011
Outbreeding
Panthera leoâ Ă Panthera tigrisâ
outbreeding depression
Assembly of diploid genomes
Genome assembly of a polymorphic organism: assembly of TWO DIFFERENT (albeit similar) puzzles at once
Humans have low polymorphism rates due to low effective population size. Most species sequenced so far are
Genome assembly of an inbred organisms: assembly of a SINGLE jigsaw puzzle
Low and high polymorphism rates
human Homo sapiens genome size 2 x 3.1 Gb
sea squirt Ciona savignyi genome size 2 x 0.2 Gb
Low and high polymorphism rates
human Homo sapiens genome size 2 x 3.1 Gb
polymorphism rate 0.1%
sea squirt Ciona savignyi genome size 2 x 0.2 Gb
Low and high polymorphism rates
human Homo sapiens genome assembly is reduced to
assembling a single haplome
sea squirt Ciona savignyi genome size 2 x 0.2 Gb
Low and high polymorphism rates
human Homo sapiens genome size 2 x 3.1 Gb
polymorphism rate 0.1%
sea squirt Ciona savignyi genome size 2 x 0.2 Gb
polymorphism rate 12%
Problem: polymorphisms are too complex to be ignored
Low and high polymorphism rates
human Homo sapiens genome size 2 x 3.1 Gb
polymorphism rate 0.1%
sea squirt Ciona savignyi
If two chromosomes are 0.5 - 15% difference assembly becomes problematic since âdouble genomeâ is very repetitive (each black region is a repeat)
Results of diploid assembly
Consensus contigs
Haplocontigs
Conventional assemblers produce very fragmented assembly of both haplomes
Polymorphic alleles are randomly chosen from haplomes on consensus contigs. Allelic relations can be further reconstructed by haplotype assembly
Haplotype assembly
Approaches to disease study
Detecting genomic variations
Easy to detect using alignment
Detecting complex genomic variations
An integrated map of structural variation in 2,504 human genomes
1000 genomes
Haplotype assembly
Haplotypes:
A C T G T C T A T C
A C G G T A T A C C
Genotypes:
A C T G T C T A T C
G A C
Possible phases:
ACTGTCTATC
ACGGTATACC
ACTGTATACC
ACGGTCTATC
ACTGTCTACC
ACGGTATATC
ACGGTCTATC
ACTGTATACC
Applications
Haplotype assembly: challenges
Haplotype assembly models
Haplotype assembly
Input:
reads covering diploid genome
consensus genome
Output:
resolved phases of SNPs
Definitions:
F - M x N matrix
rows correspond to fragments (reads)
columns correspond to SNP sites
fij in {0, 1, -}
ACTGTATACC
ACTGTC
ACGGTA
TGTCTA
GGTNTA
ATACC
CTATC
| SNP_1 | SNP_2 | SNP_3 |
f_1 | 1 | 0 | - |
f_2 | 0 | 1 | - |
f_3 | 1 | 0 | - |
f_4 | 0 | - | - |
f_5 | - | 1 | 0 |
f_6 | - | 0 | 1 |
Definitions
Conflict fragments:
Distance between fragments:
| SNP_1 | SNP_2 | SNP_3 |
f_1 | 1 | 0 | - |
f_2 | 0 | 1 | - |
f_3 | 1 | 0 | - |
f_4 | 0 | - | - |
f_5 | - | 1 | 0 |
f_6 | - | 0 | 1 |
ACTGTATACC
ACTGTC
ACGGTA
TGTCTA
GGTNTA
ATACC
CTATC
dist = 2
dist = 1
Definitions
Problem statement:
Haplotype1: {f_1, f_3, f_6}
Haplotype2: {f_2, f_4, f_5}
ACTGTATACC
ACTGTC
ACGGTA
TGTCTA
GGTNTA
ATACC
CTATC
| SNP_1 | SNP_2 | SNP_3 |
f_1 | 1 | 0 | - |
f_2 | 0 | 1 | - |
f_3 | 1 | 0 | - |
f_4 | 0 | - | - |
f_5 | - | 1 | 0 |
f_6 | - | 0 | 1 |
Fragment conflict graph
G_F = (V, E)
|V| = m (number of fragments)
Edges:
Edges weights: w(v1, v2) = d(f1, f2)
Fragment conflict graph
G_F = (V, E)
|V| = m (number of fragments)
Edges:
ACTGTATACC
ACTGTC
ACGGTA
TGTCTA
GGTNTA
ATACC
CTATC
1
3
6
2
4
5
1
2
3
4
5
6
Fragment conflict graph is bipartite
G_F = (V, E)
|V| = m (number of fragments)
Edges:
ACTGTATACC
ACTGTC
ACGGTA
TGTCTA
GGTNTA
ATACC
CTATC
1
3
6
2
4
5
1
2
3
4
5
6
Problem formulation: MER & UMER
Conflicts are caused by:
G_F contains bipartite subgraph G_Fâ
Equivalent to minimum graph cut
Problem formulation: MFR & MSR
Equivalent to maximum independent set of graph
independent set is a set of vertices in a graph,
no two of which are adjacent
Problem formulation: MEC & LHR
MEC tries to break odd-length cycle by flipping minimal number of alleles
LHP is pretty strange and almost abandoned
Examples
A likelihood based approach
Flip-update MC
A difficult example for the flip-update
A new markov chain
Read-haplotype consistency graph
Weighting R-H graph edges
Cuts
Negative cuts are good cuts
-6
HapCUT algorithm
Biological approaches to
haplotype assembly
Sequencing trio
https://experiment.com/u/oSMmA
Strand-Seq
Strand-seq: a unifying tool for studies of chromosome segregation
Assembly of trio
Koren et al., 2018