1 of 50

Variation in genome structure and copy number

Applied Computational Genomics, Lecture 13

https://github.com/quinlan-lab/applied-computational-genomics

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

2 of 50

CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTAAGTAGTGCTTGTGCTCATCTCCTTGGCTGTGATACGTGGCCGGCCCTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCTGCCATCGGAGCCCAAAGCCGGGCTGTGACTGCTCAGACCAGCCGGCTGGAGGGAGGGGCTCAGCAGGTCTGGCTTTGGCCCTGGGAGAGCAGGTGGAAGATCAGGCAGGCCATCGCTGCCACAGAACCCAGTGGATTGGCCTAGGTGGGATCTCTGAGCTCAACAAGCCCTCTCTGGGTGGTAGGTGCAGAGACGGGAGGGGCAGAGCCGCAGGCACAGCCAAGAGGGCTGAAGAAATGGTAGAACGGAGCAGCTGGTGATGTGTGGGCCCACCGGCCCCAGGCTCCTGTCTCCCCCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCGTCTCCTGTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCCTCCTAGTCTGGCTCCAAGGCCTGGCAGAGTCTTTCCCAGGGAAAGCTACAAGCAGCAAACAGTCTGCATGGGTCATCCCCTTCACTCCCAGCTCAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCATGAAGGCTGTCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTGTCCTGGACACGCTGTTGGCCTGGATCTGAGCCCTGGTGGAGGTCAAAGCCACCTTTGGTTCTGCCATTGCTGCTGTGTGGAAGTTCACTCCTGCCTTTTCCTTTCCCTAGAGCCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAAGAAATGGTGGGTCCTGGCCATCCGTGAGATCTTCCCAGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCAGAGCAACGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCTGGAGGGAAAAGGCTGAGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCCCCGGAGACTTAAATACAGGAAGAAAAAGGCAGGACAGAATTACAAGGTGCTGGCCCAGGGCGGGCAGCGGCCCTGCCTCCTACCCTTGCGCCTCATGACCGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCAGCTGGCAAGAGCAGGGGGTGGGCAGAAAGCACCCGGTGGACTCAGGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCTCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCCCCCAGCTTGGCGGATGGACTCTAGCAGAGTGGCCAGCCACCGGAGGGGTCAACCACTTCCC

3 of 50

Early 2000s dogma: SNPs account for most human genetic variation

4 of 50

Segmental duplications (a.k.a. Low copy repeats)

Bailey et al, 2002

~5% of the human genome is duplicated!

Self Dotplot:

10 megabases of Chr 15

(dot = 1 kb exact match)

5 of 50

Our understanding of structural variation is driven by technology

1940s - 1980s

Cytogenetics / Karyotyping

1990s

CGH / FISH /

SKY / COBRA

2000s

Genomic microarrays

BAC-aCGH / oligo-aCGH

Today

High throughput

DNA sequencing

6 of 50

2007 Science magazine breakthrough of the year? Why?

7 of 50

Discovery of abundant copy-number variation

76 CNVs in 20 individuals

70 genes

Science, July 2004

255 CNVs in 55 individuals

127 genes

Nature Genetics, Aug. 2004

      • 331 CNVs, only 11 in common
      • Half observed in only 1 individual
      • Impact "plenty" of genes
      • Correlated with segmental duplications in the reference genome

8 of 50

Segdups drive non-allelic homologous recombination

Nuttle et al, 2013

9 of 50

Variation in genome structure. So-called "structural variation" (SV)

D

B

C

A

Reference

D

B

C

A

B

Duplication

C

B

D

Inversion

A

D

C

A

Deletion

*

D

B

C

X

Insertion

A

Translocation

R

B

Q

A

CNV

CNV

SV

SV

SV

SV is a superset of copy number variation (CNV). Not all structural changes affect copy number (e.g., inversions)!

10 of 50

Why is structural variation relevant / important?

  • They are common and affect a large fraction of the genome
    • In total, SVs impact more base pairs than all single-nucleotide differences.
  • They are a major driver of genome evolution
    • Speciation can be driven by rapid changes in genome architecture
    • Genome instability and aneuploidy: hallmarks of solid tumor genomes

11 of 50

Why is structural variation relevant / important?

  • Genetic basis of traits
    • Gene dosage effects.
    • Neuropsychiatric disease (e.g., autism, schizophrenia)
    • Spontaneous SVs implicated in so-called “genomic” and developmental disorders
    • Somatic genome instability; age-dependent disease

12 of 50

SV and human traits

Zhang et al, 2009

  • CNV of the RHD gene is responsible for determining Rh-negative blood group in Europeans
  • HIV-1 infection associated with CNV in CCL3L1
  • CYP2D6 is responsible for the metabolism of more than 30% of all orally administered drugs including many antipsychotics, antidepressants, antiarrhythmics, and opioid analgesics (10.1371/journal.pone.0113808)
    • CYP2D6 is highly polymorphic and has many alleles created by both CNV and SNPs

13 of 50

SV and human disease phenotypes

Zhang et al, 2009

14 of 50

SV and human disease phenotypes (cont).

Zhang et al, 2009

15 of 50

Formation of new chromatin domains determines pathogenicity of genomic duplications. Franke et al, 2016.

https://www.nature.com/articles/nature19800

16 of 50

Chromatin features constrain structural variation across evolutionary timescales.

Fudenberg and Pollar, 2018.

https://www.nature.com/articles/nature19800

17 of 50

Structural variation "breakpoints" (i.e., novel DNA junctions)

ACGTCGACGGACAGATTGGTTTTTCGCGAGATTATTACCAGAGCATGAGCCCACACACCCCAGACATTACCCCAC

ACGTCGACGGACAGATTGGCCCCAGACATTACCCCAC

Reference

Genome

Sample

Genome

SV (Deletion) Breakpoint

Deleted in the sample genome

18 of 50

The DNA gymnastics of visualizing SV breakpoints

B

C

H

I

B

C

D

E

F

G

H

I

Reference

Sample

A

J

A

J

Deletion

Inversion

B

C

D

E

F

G

H

I

Ref.

Sample

A

J

B

C

G

F

E

D

H

I

A

J

B

C

D

E

F

G

D

E

F

G

H

I

B

C

D

E

F

G

H

I

Ref.

Sample

A

J

A

J

Tandem Duplication

B

C

D

E

F

G

H

I

Ref

Sample

A

J

Distant Insertion

B

C

D

E

F

G

H

I

A

J

X

W

X

Y

B

C

D

E

1

2

3

4

A

5

B

C

D

E

F

G

H

I

A

J

2

3

4

5

F

G

H

I

1

J

2

3

4

5

6

7

8

9

1

10

Reciprocal translocation

Ref. Chr1

Sample chr1/2

Ref. Chr2

Sample Chr2/1

19 of 50

Humans differ by roughly 3,000 deletions (>=500bp)

20 of 50

Humans differ by a few hundred duplications

21 of 50

Humans differ by a few hundred inversions

22 of 50

Humans differ by a tens of retrotransposon insertions private to Ref

LINE element

23 of 50

Humans differ by a tens of retrotransposon insertions private to sample (not in the reference)

AluY

+

-

-

+

24 of 50

Size distribution of SVs in 1000 Genomes project

Sudmant et al, 2015

AluY

25 of 50

How do we identify structural variants via DNA sequencing?

26 of 50

Sequence alignment “signals” for structural variation

1. Align DNA sequences from sample to human reference genome

2. Look for evidence of structural differences

Ref.

Exp.

(a) Depth of

coverage

(b) Paired-end

mapping

(c) Split-read

mapping

(d) de novo

assembly

Low

High

Resolution

27 of 50

Copy number changes affect the depth of sequence coverage

Normal

Tumor

Duplication

Challenges:

- need high coverage for high resolution

- deletions easier than duplications

- prone to artifacts owing to repeats, GC content, etc.

28 of 50

Detecting CNV by counting alignments in genome "windows"

~15Mb region

  1. Fast and simple.
  2. Easy to identify gene amplifications.
  3. Relatively straightforward interpretation: is gene X amplified or deleted?

Strengths:

Weaknesses:

  1. Limited resolution (2-5kb) = imprecise boundaries
  2. Cannot detect balanced events or reveal variant architecture.

Z-score

Genome Position

Slide in collaboration with Ira Hall

29 of 50

GC content varies dramatically in the genome

http://www.nature.com/nrg/journal/v10/n10/pdf/nrg2640.pdf

Region from chromosome 1

GC content

Each point is 20kb

Each point is 2kb

Each point is 200 bp

Why are there no points here?

30 of 50

Correct for GC bias - convert counts to Z scores of GC distributions

  • Use variably-sized windows, masked for repeats
    • repeatMasker, SSRs, “mapability”
  • Window size should yield >100 reads (median)
  • With all alignments, absolute copy number can be discerned (Studmant et al., 2011)

Z-score

GC normalization (Z-score)

Copy number segmentation

normalized & segmented

Coverage (5kb windows)

# reads

Fraction GC

Chr17:3-15mb

Slide in collaboration with Ira Hall

Depth (counts)

31 of 50

Normal

Primary Tumor

Metastatic

Tumor

Slide in collaboration with Ira Hall

32 of 50

Looking for "discordant" paired-end fragments

Paired-end sequencing

Ref

Sample

paired-ends map farther away than expected

2000 bp

Slide in collaboration with Ira Hall

33 of 50

Looking for "discordant" paired-end fragments

Challenges:

- Difficult to achieve single-nucleotide resolution for the SV breakpoint

- Chimeric molecules, PCR duplicates

Advantages:

- Much higher resolution

- Can find any type of SV - not limited to deletions and duplications like depth of coverage

- Chimeric molecules, PCR duplicates

Cluster

Ref

Sample

34 of 50

Discordant mapping "signatures" for various SV types

A

B

A

B

A

C

A

B

C

A

B

A

B

X

concordant (+/-)

too big (+/-)

= deletion

Test

genome

Ref.

genome

too small (+/-)

= spanned insertion

B

A

C

B

A

C

everted (-/+)

= tandem duplication

B

B

A

C

B

B

C

A

B

same strand (+/+ or -/-)

= inversion

Quinlan and Hall, 2012

35 of 50

Split-read mapping "signatures" for various SV types

Quinlan and Hall, 2012

Challenges:

- misalignment in low-complexity regions causes spurious calls

Advantages:

- in theory, yields single-nucleotide resolution at SV breakpoints: allows us to study the mechanism creating the SV!!

Paired-end

Sample

Ref.

Split-read

36 of 50

SV mechanisms revealed via ca. 1bp breakpoint resolution

Weckselblatt et al, 2015

37 of 50

A probabilistic framework for SV discovery

Layer et al, 2014

Ryan Layer

Lumpy integrates paired-end mapping, split-read mapping, and depth of coverage for better SV discovery accuracy

38 of 50

A probabilistic framework for SV discovery

Layer et al, 2014

39 of 50

A probabilistic framework for SV discovery

Layer et al, 2014

Sequencing depth

40 of 50

The dirty secrets of SV discovery

41 of 50

Secret #1: Often many false positives

  • Short reads + heuristic alignment + rep. genome = systematic alignment artifacts (false calls)
  • Chimeras and duplicate molecules
  • Ref. genome errors (e.g., gaps, mis-assemblies)
  • ALL SV mapping studies use strict filters for above

42 of 50

Secret #2: The false negative rate is also typically high

  • Most current datasets have low to moderate physical coverage due to small insert size (~10-20X)
  • Breakpoints are enriched in repetitive genomic regions that pose problems for sensitive read alignment
  • FILTERING!
  • The false negative rate is usually hard to measure, but is thought to be extremely high for most paired-end mapping studies (>30%)
  • When searching for spontaneous mutations in a family or a tumor/normal comparison, a false negative call in one sample can be a false positive somatic or de novo call in another.

43 of 50

The power of long read sequencing for SV discovery

44 of 50

45 of 50

Oxford Nanopore Sequencing

Key Points:

- Protein nanopore array embedded in an artificial lipid

- 1 DNA molecule, 1 translocating enzyme

- salt + electrodes on either side of pore

- Bases detected by change in current

- intrinsic detection of methylated cytosine

Clarke et al., 2009: Nature Nanotechnology

46 of 50

It works…well. This is frankly rather magical

Tom Sasani

Kelsey Rogers

47 of 50

Studying poxvirus genome evolution with Elde lab

Duplications of K3L increase viral fitness

when passaged in human cells

48 of 50

ONT recapitulates the E3L deletion (swap for LacZ)

Illumina

ONT

49 of 50

ONT recapitulates the K3L expansion

Illumina

ONT

50 of 50

Long reads allow us to look at allelic diversity in K3L expansion