Variation in genome structure and copy number
Applied Computational Genomics, Lecture 13
https://github.com/quinlan-lab/applied-computational-genomics
Aaron Quinlan
Departments of Human Genetics and Biomedical Informatics
USTAR Center for Genetic Discovery
University of Utah
quinlanlab.org
CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTAAGTAGTGCTTGTGCTCATCTCCTTGGCTGTGATACGTGGCCGGCCCTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCTGCCATCGGAGCCCAAAGCCGGGCTGTGACTGCTCAGACCAGCCGGCTGGAGGGAGGGGCTCAGCAGGTCTGGCTTTGGCCCTGGGAGAGCAGGTGGAAGATCAGGCAGGCCATCGCTGCCACAGAACCCAGTGGATTGGCCTAGGTGGGATCTCTGAGCTCAACAAGCCCTCTCTGGGTGGTAGGTGCAGAGACGGGAGGGGCAGAGCCGCAGGCACAGCCAAGAGGGCTGAAGAAATGGTAGAACGGAGCAGCTGGTGATGTGTGGGCCCACCGGCCCCAGGCTCCTGTCTCCCCCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCGTCTCCTGTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCCTCCTAGTCTGGCTCCAAGGCCTGGCAGAGTCTTTCCCAGGGAAAGCTACAAGCAGCAAACAGTCTGCATGGGTCATCCCCTTCACTCCCAGCTCAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCATGAAGGCTGTCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTGTCCTGGACACGCTGTTGGCCTGGATCTGAGCCCTGGTGGAGGTCAAAGCCACCTTTGGTTCTGCCATTGCTGCTGTGTGGAAGTTCACTCCTGCCTTTTCCTTTCCCTAGAGCCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAAGAAATGGTGGGTCCTGGCCATCCGTGAGATCTTCCCAGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCAGAGCAACGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCTGGAGGGAAAAGGCTGAGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCCCCGGAGACTTAAATACAGGAAGAAAAAGGCAGGACAGAATTACAAGGTGCTGGCCCAGGGCGGGCAGCGGCCCTGCCTCCTACCCTTGCGCCTCATGACCGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCAGCTGGCAAGAGCAGGGGGTGGGCAGAAAGCACCCGGTGGACTCAGGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCTCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCCCCCAGCTTGGCGGATGGACTCTAGCAGAGTGGCCAGCCACCGGAGGGGTCAACCACTTCCC
Early 2000s dogma: SNPs account for most human genetic variation
Segmental duplications (a.k.a. Low copy repeats)
Bailey et al, 2002
~5% of the human genome is duplicated!
Self Dotplot:
10 megabases of Chr 15
(dot = 1 kb exact match)
Our understanding of structural variation is driven by technology
1940s - 1980s
Cytogenetics / Karyotyping
1990s
CGH / FISH /
SKY / COBRA
2000s
Genomic microarrays
BAC-aCGH / oligo-aCGH
Today
High throughput
DNA sequencing
2007 Science magazine breakthrough of the year? Why?
Discovery of abundant copy-number variation
76 CNVs in 20 individuals
70 genes
Science, July 2004
255 CNVs in 55 individuals
127 genes
Nature Genetics, Aug. 2004
Segdups drive non-allelic homologous recombination
Nuttle et al, 2013
Variation in genome structure. So-called "structural variation" (SV)
D
B
C
A
Reference
D
B
C
A
B
Duplication
C
B
D
Inversion
A
D
C
A
Deletion
*
D
B
C
X
Insertion
A
Translocation
R
B
Q
A
CNV
CNV
SV
SV
SV
SV is a superset of copy number variation (CNV). Not all structural changes affect copy number (e.g., inversions)!
Why is structural variation relevant / important?
Why is structural variation relevant / important?
SV and human traits
Zhang et al, 2009
SV and human disease phenotypes
Zhang et al, 2009
SV and human disease phenotypes (cont).
Zhang et al, 2009
Formation of new chromatin domains determines pathogenicity of genomic duplications. Franke et al, 2016.
https://www.nature.com/articles/nature19800
Chromatin features constrain structural variation across evolutionary timescales.
Fudenberg and Pollar, 2018.
https://www.nature.com/articles/nature19800
Structural variation "breakpoints" (i.e., novel DNA junctions)
ACGTCGACGGACAGATTGGTTTTTCGCGAGATTATTACCAGAGCATGAGCCCACACACCCCAGACATTACCCCAC
ACGTCGACGGACAGATTGGCCCCAGACATTACCCCAC
Reference
Genome
Sample
Genome
SV (Deletion) Breakpoint
Deleted in the sample genome
The DNA gymnastics of visualizing SV breakpoints
B
C
H
I
B
C
D
E
F
G
H
I
Reference
Sample
A
J
A
J
Deletion
Inversion
B
C
D
E
F
G
H
I
Ref.
Sample
A
J
B
C
G
F
E
D
H
I
A
J
B
C
D
E
F
G
D
E
F
G
H
I
B
C
D
E
F
G
H
I
Ref.
Sample
A
J
A
J
Tandem Duplication
B
C
D
E
F
G
H
I
Ref
Sample
A
J
Distant Insertion
B
C
D
E
F
G
H
I
A
J
X
W
X
Y
B
C
D
E
1
2
3
4
A
5
B
C
D
E
F
G
H
I
A
J
2
3
4
5
F
G
H
I
1
J
2
3
4
5
6
7
8
9
1
10
Reciprocal translocation
Ref. Chr1
Sample chr1/2
Ref. Chr2
Sample Chr2/1
Humans differ by roughly 3,000 deletions (>=500bp)
Humans differ by a few hundred duplications
Humans differ by a few hundred inversions
Humans differ by a tens of retrotransposon insertions private to Ref
LINE element
Humans differ by a tens of retrotransposon insertions private to sample (not in the reference)
AluY
+
-
-
+
Size distribution of SVs in 1000 Genomes project
Sudmant et al, 2015
AluY
How do we identify structural variants via DNA sequencing?
Sequence alignment “signals” for structural variation
1. Align DNA sequences from sample to human reference genome
2. Look for evidence of structural differences
Ref.
Exp.
(a) Depth of
coverage
(b) Paired-end
mapping
(c) Split-read
mapping
(d) de novo
assembly
Low
High
Resolution
Copy number changes affect the depth of sequence coverage
Normal
Tumor
Duplication
Challenges:
- need high coverage for high resolution
- deletions easier than duplications
- prone to artifacts owing to repeats, GC content, etc.
Detecting CNV by counting alignments in genome "windows"
~15Mb region
Strengths:
Weaknesses:
Z-score
Genome Position
Slide in collaboration with Ira Hall
GC content varies dramatically in the genome
http://www.nature.com/nrg/journal/v10/n10/pdf/nrg2640.pdf
Region from chromosome 1
GC content
Each point is 20kb
Each point is 2kb
Each point is 200 bp
Why are there no points here?
Correct for GC bias - convert counts to Z scores of GC distributions
Z-score
GC normalization (Z-score)
Copy number segmentation
normalized & segmented
Coverage (5kb windows)
# reads
Fraction GC
Chr17:3-15mb
Slide in collaboration with Ira Hall
Depth (counts)
Normal
Primary Tumor
Metastatic
Tumor
Slide in collaboration with Ira Hall
Looking for "discordant" paired-end fragments
Paired-end sequencing
Ref
Sample
paired-ends map farther away than expected
2000 bp
Slide in collaboration with Ira Hall
Looking for "discordant" paired-end fragments
Challenges:
- Difficult to achieve single-nucleotide resolution for the SV breakpoint
- Chimeric molecules, PCR duplicates
Advantages:
- Much higher resolution
- Can find any type of SV - not limited to deletions and duplications like depth of coverage
- Chimeric molecules, PCR duplicates
Cluster
Ref
Sample
Discordant mapping "signatures" for various SV types
A
B
A
B
A
C
A
B
C
A
B
A
B
X
concordant (+/-)
too big (+/-)
= deletion
Test
genome
Ref.
genome
too small (+/-)
= spanned insertion
B
A
C
B
A
C
everted (-/+)
= tandem duplication
B
B
A
C
B
B
C
A
B
same strand (+/+ or -/-)
= inversion
Quinlan and Hall, 2012
Split-read mapping "signatures" for various SV types
Quinlan and Hall, 2012
Challenges:
- misalignment in low-complexity regions causes spurious calls
Advantages:
- in theory, yields single-nucleotide resolution at SV breakpoints: allows us to study the mechanism creating the SV!!
Paired-end
Sample
Ref.
Split-read
SV mechanisms revealed via ca. 1bp breakpoint resolution
Weckselblatt et al, 2015
A probabilistic framework for SV discovery
Layer et al, 2014
Ryan Layer
Lumpy integrates paired-end mapping, split-read mapping, and depth of coverage for better SV discovery accuracy
A probabilistic framework for SV discovery
Layer et al, 2014
A probabilistic framework for SV discovery
Layer et al, 2014
Sequencing depth
The dirty secrets of SV discovery
Secret #1: Often many false positives
Secret #2: The false negative rate is also typically high
The power of long read sequencing for SV discovery
Oxford Nanopore Sequencing
Key Points:
- Protein nanopore array embedded in an artificial lipid
- 1 DNA molecule, 1 translocating enzyme
- salt + electrodes on either side of pore
- Bases detected by change in current
- intrinsic detection of methylated cytosine
Clarke et al., 2009: Nature Nanotechnology
It works…well. This is frankly rather magical
Tom Sasani
Kelsey Rogers
Studying poxvirus genome evolution with Elde lab
Duplications of K3L increase viral fitness
when passaged in human cells
ONT recapitulates the E3L deletion (swap for LacZ)
Illumina
ONT
ONT recapitulates the K3L expansion
Illumina
ONT
Long reads allow us to look at allelic diversity in K3L expansion