1 of 61

DNA Sequencing Technologies

Data Analysis in Genome Biology

GEN242

1

Thomas Girke

April 10, 2018

GEN

242

2 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

2

GEN

242

3 of 61

Usually We Prefer to Sequence DNA

  • DNA sequencing is more efficient because:
    • DNA cloning and amplification is easy
    • Availability of efficient enzymatic sequencing reactions
  • Protein sequencing is much harder because
    • No comparable cloning or amplification techniques available
    • Limited availability of enzymatic sequencing techniques
    • Chemical nature of proteins makes sequencing harder

3

GEN

242

4 of 61

DNA Libraries

A DNA library consists of cloned DNA fragments that can represent the entire genome of an organism (genomic DNA library) or its transcriptome (cDNA library).

Genomic library

    • Contains often entire DNA content of an organism.
    • Suitable for determining genomic DNA sequence.
    • Requires chromosomal DNA isolation.

cDNA library

    • Contains the mRNAs that are expressed in a tissue sample.
    • mRNA is used as starting material
    • mRNA needs to be reverse transcribed into cDNA
    • Requires mRNA isolation
    • Challenges: cDNA libraries tend to be incomplete with regard to
      • 5’ sequences
      • Representation of all genes in the genome

4

GEN

242

5 of 61

Why Is it Helpful to Have Both?

      • Genomic library gives genome sequence
      • cDNA library gives information about expressed sequences in genome
      • For instance: cDNA sequences can be aligned to newly generated genomic sequence to
        • identify gene boundaries ⇒ gene prediction with physical evidence
        • distinguish between expressed genes and pseudo-genes
      • Many other reasons

5

GEN

242

6 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

6

GEN

242

7 of 61

Workflow of a Genomic Sequencing Project

7

Annotation of Functional Features

Submit to GenBank

GEN

242

8 of 61

Synthesis of Common Genomic Libraries

8

Plasmid Library λ Phage Library

GEN

242

9 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

9

GEN

242

10 of 61

Synthesis of cDNA Library in λ Phage Vector

10

1. mRNA to cDNA 2. cDNA Cloning into λ

GEN

242

11 of 61

Cloning Vectors for Libraries

Plasmid Library [ Genomic & cDNA ]

    • Circular extra-chromosomal DNA molecule in bacteria
    • Maximum insert size for cloning: 1,000-20,000 bp

λ Phage Library [ Genomic & cDNA ]

    • Double-stranded linear DNA of E. coli infecting virus
    • Maximum insert size for cloning: 1,000-25,000 bp

Cosmid Library [ Genomic ]

    • λ phage-derived hybrid plasmid with cos sequences
    • Maximum insert size for cloning: 35,000 to 50,000 bp

BAC Library [ Genomic ]

    • Bacterial artificial chromosome
    • Maximum insert size for cloning: 150,000-350,000 bp

YAC Library [ Genomic ]

    • Yeast artificial chromosome
    • Maximum insert size for cloning: 100,000-3,000,000 bp

Many Additional Library Types

    • Please consult molecular biology books.

11

GEN

242

12 of 61

What are EST Sequences?

  • Expressed sequence tags or ESTs are short single-pass sequences from cDNA libraries (partial RNA sequences).
  • They are useful for:
    • Identification of gene boundaries in genomic sequences
    • Discovery of single nucleotide polymorphisms (SNPs)
    • Discovery of alternative splice events in transcripts
    • Quantitative (digital) gene expression analysis (DGE)
    • Currently, there are hundreds of million ESTs available in GenBank
    • RNA-Seq reads are the NGS version of ESTs
    • Many additional applications.

12

GEN

242

13 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

13

GEN

242

14 of 61

History of DNA Sequencing

  1. ”DNA Sequencing by Chemical Degradation” is published by Allan Maxam and Walter Gilbert.
  2. ”DNA Sequencing by Enzymatic Synthesis” is published by Fred Sanger.
  3. Fred Sanger and Walter Gilbert receive the Nobel Prize in Chemistry.
  4. GenBank starts as a public repository of DNA sequences.
  5. Leroy Hood’s laboratory at the California Institute of Technology announces the first semi-automated DNA sequencing machine.
  6. Genome sequence of E. coli is published.
  7. Draft sequence of the Human genome is published.
  8. First next generation sequencing technologies become available to the public.

14

GEN

242

15 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

15

GEN

242

16 of 61

Chemical Sequencing by Maxam & Gilbert

16

  1. Uses radioactive labeled DNA fragments of 500 bp.
  2. Four separate chemical treatments generate DNA breaks at the positions: G, A+G, C, C+T.
  3. The fragments are size-separated by gel electrophoresis in four separate lanes.
  4. Visualization of the fragments by autoradiography on an X-ray film.

Chemical DNA Degradation Gel Electrophoresis

GEN

242

17 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

17

GEN

242

18 of 61

Illustration of Sanger Sequencing

18

Sequencing Principle

Radioactive Fluorescence

Labeling

GEN

242

19 of 61

Processing of Sequencing Raw Data

19

  • Assign quality score to each peak
  • The frequently used Phred scores provide log(10)-transformed error probability values:
    • score = 20 corresponds to a 1% error rate
    • score = 30 corresponds to a 0.1% error rate
    • score = 40 corresponds to a 0.01% error rate
  • The base calling (A, T, G or C) is performed based on Phred scores.
  • Ambiguous positions with Phred scores ≤ 20 are labeled with N.

GEN

242

20 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

20

GEN

242

21 of 61

Common Synonyms

21

  • High-throughput sequencing: HTS or HT-Seq
  • Flow cell sequencing (FCS)
  • Massively parallel sequencing (MPS)
  • Next generation sequencing (NGS)
  • Second/third generation sequencing
  • Deep sequencing
  • Sequencing by synthesis
  • Many other synonyms
  • Review article: Holt et al 2008

GEN

242

22 of 61

Overview: 454, SOLiD and Illumina

22

From review article: Medini et al 2008

GEN

242

23 of 61

Similarities and Differences of NGS Technologies

Common components

  • Flow cells as reaction chambers
  • Iterative sequencing process
  • Massive parallelization
  • Clonally amplified or single molecule templates

Differences

  • Template preparation
  • Sequencing chemistry
  • Flow cell configuration

23

GEN

242

24 of 61

NGS Sequencing Methods

Reversible Terminator Methods (e.g. Illumina/Solexa)

  • Use reversible versions of dye-terminator reactions.
  • Principle steps: adding one nucleotide at a time, detecting fluorescence corresponding to that position, then removing the blocking group to allow polymerization of another nucleotide.

Single Molecule Methods (e.g. Helicos, PacBio)

  • Variable strategies.

Pyrosequencing Methods (e.g. 454)

  • Also use DNA polymerization to add nucleotides.
  • Principle steps: adding one type of nucleotide at a time, then detecting and quantifying the number of nucleotides added to a given location through the light emitted by the release of attached pyrophosphates.

Supported Oligonucleotide Ligation Methods (e.g. SOLiD, Complete Genomics)

  • Uses ligation-based approach
  • Principle steps: stepwise ligation of labeled random octamers to obtain sequence of attached dinucleotides; the ligated dinucleotides of each ligation round are spaced by several nucleotides; continuous sequence information is obtained by offsetting sequencing primer.

24

GEN

242

25 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

25

GEN

242

26 of 61

Example: Illumina/Solexa Technology

26

Illumina HiSeq 2500 Sequencer

Flow Cell

GEN

242

27 of 61

Basic Steps of Illumina Sequencing

Flow Cell Loading

  1. Generate DNA library (genomic- or cDNA-based) with insert length of ∼200 bp.
  2. Load library onto flow cell (nano device for liquid handling).
  3. PCR-based bridge amplification of loaded fragments to obtain DNA clusters (serves signal amplification)

Sequencing Cycles

  1. Start reversible dye-terminator reaction containing primer and labeled dNTPs among other components.
  2. Image scan to detect the identity of first base of each cluster via the characteristic fluorescence signal for each labeled nucleotide.
  3. De-protection step removes the blocking group and fluorescence group of the incorporated nucleotide.
  4. Repeat steps 4-6 about 50-250 times.

27

Compare with illustration on next 3 slides!

GEN

242

28 of 61

Flow Cell Loading

28

Illumina HiSeq 2500 Sequencer

Flow Cell

GEN

242

29 of 61

Sequencing Cycles

29

GEN

242

30 of 61

Details of Sequencing Reaction

30

Illustration shows the sequencing cycles for a single template molecule!

GEN

242

31 of 61

Single End, Paired End and Mate Pair Sequencing

31

Single End

Paired End

Mate Pair

AP1/AP2: flow cell adapators; SP1/SP2: sequencing primers

GEN

242

32 of 61

Paired End Chemistry: Step I

32

Single End

Paired End

Grafted Flow Cell

Cluster Generation: Initial Extension

Linearization: periodate two different enzymes

GEN

242

33 of 61

Paired End Chemistry: Step II

33

Cluster Generation: Amplification

GEN

242

34 of 61

Paired End Chemistry: Step III

34

Cluster Generation: Linearization

GEN

242

35 of 61

Paired End Chemistry: Step IV

35

Sequencing

GEN

242

36 of 61

Processing of Illumina Sequencing Data

  • Convert cluster images to intensity values.
  • Base calling based on intensity for each fluorescence dye.
  • Generates quality scores similar to Phred scores.
  • The length of each sequence corresponds to the number of cycles, e.g. 75 cycles → 75 bp.
  • Remove sequences with low quality reads.
  • Downstream analyses specific to application!

A single sequencing run with 2x 100 cycles can generate ∼ 3 billion sequences and 32TB of image data.

36

GEN

242

37 of 61

Sequence Format: FASTQ

@SRR446037.238 length=75

TCAGCCTTGCGACCATACTCCCCCCGGAACCCAAAAACTTTGATTTCTCATAAGGTGCCAGCGGAGTCCTATAAG

+SRR446037.238 length=75

IIIIIIIIIIIGIIHIHIIIIIIIIDHDIIIIIFHDGDEFHCCGHHHHHCDDD@?6?@A@?A;??@2@BA@BBB@

... millions of entries ...

37

FASTQ format has 4 lines per sequence

  1. '@' character followed by a sequence identifier
  2. Character string of sequence
  3. '+' character optionally followed by the same sequence identifier
  4. Quality scores in ASCII format

Example of partial FASTQ file

GEN

242

38 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

38

GEN

242

39 of 61

Helicos: Single Molecule Sequencing

39

  • Has similarities to Solexa/Illumina technology, but sequences single molecule templates.
  • Attaches one of the four nucleotides at a time using proprietary nucleotide-polymerase formulations. This prevents the incorporation of more than one nucleotide in each cycle in homopolymer regions.

GEN

242

40 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

40

GEN

242

41 of 61

454/Roche Sequencing Steps

41

Pyrosequencing Methods (e.g. 454)

  • Also uses DNA polymerization to add nucleotides.
  • Principle steps: adding one type of nucleotide at a time, then detecting and quantifying the number of nucleotides added to a given location through the light emitted by the release of attached pyrophosphates.

For more details see: 454 Web Site (http://www.454.com)

GEN

242

42 of 61

454 Pyrosequencing

42

GEN

242

43 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

43

GEN

242

44 of 61

SOLiD: Sequencing by Supported Oligonucleotide

Ligation and Detection

44

GEN

242

45 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

45

GEN

242

46 of 61

PacBio: Sequencing with Single Polymerase Molecule

46

Publication: Eid et al 2009

  • Read lengths: 0.5-20kbp!!
  • Much lower cost
  • Disadvantages: high error rate and low number of sequences

GEN

242

47 of 61

Nanopore Sequencing

47

  • Read lengths: xkbp!!
  • Much lower cost.
  • Disadvantages: currently higher error rate and low number of sequences.

GEN

242

48 of 61

Comparison of Methods

48

Method

Read Length

Sequences per Run

Utility

Sanger

500-1500bp

384

de novo and low throughput

454/Roche

300-600bp

~2*106

de novo and medium throughput

Pacbio

0.5-20kbp

~1-5*106

de novo and medium throughput

Illumina

50-150 (1-2x)

~1.6*109

de novo and high-throughput

All numbers are estimates and apply to the situation in Dec. 2015!

GEN

242

49 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

49

GEN

242

50 of 61

Applications of NGS Methods

50

NGS technologies provide vast opportunities for genomics, comparative genome biology, medical diagnostics, etc. The following list only a few of examples.

Applications

  • Genome-wide detection of SNPs and mutations (SNP-seq)
  • Methylome profiling by bisulphite sequencing (BS-seq)
  • DNA-protein interactions (ChIP-seq)
  • Transcriptome sequencing (RNA-seq)
  • mRNA expression profiling (RNA-seq and DGE)
  • Small RNA profiling and discovery
  • Hi-C to study three dimensional architecture of genomes
  • De novo genome assembly (for PacBio)

GEN

242

51 of 61

Application: De Novo Sequencing and Assembly

51

GEN

242

52 of 61

Application: DNA-Protein Interactions with ChIP-Seq

52

Reference for ChIP-Seq data analysis: Jothi et al 2008

GEN

242

53 of 61

Application: Methylome Profiling with BS-Seq

53

GEN

242

54 of 61

Application: RNA-Seq Gene Expression Profiling

54

GEN

242

55 of 61

Application: Digital Gene Expression (DGE) Profiling

55

Sequencing

GEN

242

56 of 61

Targeted Sequencing for Large Genomes

56

Targeted sequencing using DNA capture microarrays or beads

  • Powerful approach to make NGS-Seq more economic and versatile.
  • Example: usage of programmable microarrays (here NimbleGen) to enrich for DNA regions of interest (Albert et al 2007).

GEN

242

57 of 61

10X Genomics: Linked-Read Sequencing

57

Resolves many challenges inherent to short read sequencing

From Zheng et al (2016)

GEN

242

58 of 61

Database: Sequence Read Archive from NCBI

58

  • 1000 Human Genomes Project: http://www.1000genomes.org

  • Many more 1000 genome projects

GEN

242

59 of 61

Outline

What Are We Sequencing?

  • Genomic Libraries
  • cDNA Libraries

Traditional DNA Sequencing Technologies

  • Chemical Sequencing
  • Sanger Sequencing

Next Generation Sequencing Methods

  • Solexa/Illumina: Reversible Terminator Method
  • Helicos: Single Molecule Sequencing
  • 454/Roche: Pyrosequencing Method
  • SOLiD/ABI: Supported Oligo Ligation Method
  • Third Generation Sequencing: PacBio and Others

Research Applications

References and Books

59

GEN

242

60 of 61

References

Albert, T J, Molla, M N, Muzny, D M, Nazareth, L, Wheeler, D, Song, X, Richmond, T A, Middle, C M, Rodesch, M J, Packard, C J, Weinstock, G M, Gibbs, R A (2007) Direct selection of human genomic loci by microarray hybridization. Nat Methods, 4: 903-905. URL http://www.hubmed.org/display.cgi?uids=17934467

Eid, J et al. (2009) Real-time DNA sequencing from single polymerase molecules. Science, 323: 133-138. URL http://www.hubmed.org/display.cgi?uids=19023044

Holt, RA, Jones, SJ (2008) The new paradigm of flow cell sequencing. Genome Res, 18: 839-846. URL http://www.hubmed.org/display.cgi?uids=18519653

Jothi, R, Cuddapah, S, Barski, A, Cui, K, Zhao, K (2008) Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res, 36: 5221-5231. URL http://www.hubmed.org/display.cgi?uids=18684996

Medini, D, Serruto, D, Parkhill, J, … , C, Moxon, R, Falkow, S, Rappuoli, R (2008) Microbiology in the post-genomic era. Nat Rev Microbiol, 6: 419-430. URL http://www.hubmed.org/display.cgi?uids=18475305

60

GEN

242

61 of 61

References

Zheng GXY et al (2016) Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol 34: 303–311

61

GEN

242