1 of 186

Analysis and assembly methods for

microbiome sequencing data

Marcus Fedarko

2 of 186

Analysis and assembly methods for

microbiome sequencing data

Marcus Fedarko

3 of 186

Microbiomes

3

Steve Gschmeissner, Science Photo Library

“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”

Steve Gschmeissner, Science Photo Library

“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”

Steve Gschmeissner, Science Photo Library

“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”

4 of 186

Microbiomes

4

Steve Gschmeissner, Science Photo Library

“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”

Steve Gschmeissner, Science Photo Library

“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”

Steve Gschmeissner, Science Photo Library

“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”

5 of 186

This talk

  1. Introduction: Studying microbiomes
    1. Why bother?
    2. Why hasn’t this research been more useful?
    3. Defining a goal for this talk�
  2. Culture-independent (a.k.a. sequencing-based) methods
    • Marker gene sequencing
    • Metagenome sequencing�
  3. Metagenome assembly
    • Input (reads)
    • Outputs (contigs, an assembly graph, …)
    • Methods (de novo vs. reference-based, overlap graph vs. de Bruijn graph, …)�
  4. Future work: Solving the strain separation problem

5

6 of 186

This talk

  • Introduction: Studying microbiomes
    • Why bother?
    • Why hasn’t this research been more useful?
    • Defining a goal for this talk�
  • Culture-independent (a.k.a. sequencing-based) methods
    • Marker gene sequencing
    • Metagenome sequencing�
  • Metagenome assembly
    • Input (reads)
    • Outputs (contigs, an assembly graph, …)
    • Methods (de novo vs. reference-based, overlap graph vs. de Bruijn graph, …)�
  • Future work: Solving the strain separation problem

6

7 of 186

Introduction: Why bother studying microbiomes?

7

8 of 186

Introduction: Why bother studying microbiomes?

8

9 of 186

Introduction: Why bother studying microbiomes?

“During the Eastern Jin dynasty (AD 300–400 years), ‘Zhou Hou Bei Ji Fang’, a well-known monograph of traditional Chinese medicine (TCM) written by Hong Ge, recorded a case of treating patients with food poisoning or severe diarrhea by ingesting human fecal suspension (known as yellow soup or Huang-Long decoction).”

9

H. Du, T.-t. Kuang, S. Qiu, T. Xu, C.-L. G. Huan, G. Fan, and Y. Zhang. Fecal medicines used in traditional medical system of China: a�systematic review of their names, original species, traditional uses, and modern investigations. Chinese medicine, 14(1):1–16, 2019.

10 of 186

Introduction: Why bother studying microbiomes?

10

American Gastroenterological Association. Fecal Microbiota Transplanation (FMT): Overview. https://gastro.org/practice-guidance/gi-patient-center/topic/fecal-microbiota-transplantation-fmt/.

Fecal Microbiota Transplantation (FMT)

11 of 186

Introduction: Why bother studying microbiomes?

11

12 of 186

Introduction: Why bother studying microbiomes?

12

13 of 186

Introduction: Why bother studying microbiomes?

13

14 of 186

Introduction: Why bother studying microbiomes?

14

H. pylori-negative individuals harbor a microbiota that is more complex and highly diverse compared to H. pylori-positive individuals. [...] Following infection with H. pylori, Proteobacteria and specifically H. pylori dominate the gastric microbiota. This leads to the development of chronic gastritis.”

15 of 186

Introduction: Why is it always C. diff and H. pylori?

“There are two well-documented diseases in the microbiome field that link a microbial biomarker with causation in disease: Helicobacter pylori-associated peptic ulceration and gastric cancer (Parsonnet et al., 1991) and Clostridium (or Clostridioides) difficile infection-associated diarrhea (Gupta et al., 2016).”

“However, causal inferences between complex microbiomes and other inflammatory, metabolic, neoplastic, and neuro-behavioral disorders have been neither compelling nor conclusive [...]”

15

J. Walter, A. M. Armet, B. B. Finlay, and F. Shanahan. Establishing or Exaggerating Causality for the Gut Microbiome:�Lessons from Human Microbiota-Associated Rodents. Cell, 180(2):221–232, 2020.

16 of 186

Introduction: Obesity and the gut microbiome

16

17 of 186

Introduction: Obesity and the gut microbiome

17

“Adult germ-free [wild-type] C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (+/+) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). [...]”

(Context: ob/ob mice have a specific mutation “[...] that produces a stereotyped, fully penetrant obesity phenotype”; +/+ mice lack this mutation.)

Ley, R. E., Bäckhed, F., Turnbaugh, P., Lozupone, C. A., Knight, R. D., & Gordon, J. I. (2005). Obesity alters gut microbial ecology. Proceedings of the national academy of sciences, 102(31), 11070-11075.

18 of 186

Introduction: Obesity and the gut microbiome

18

“Adult germ-free [wild-type] C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (+/+) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). [...]”

Results: Strikingly, mice colonized with an ob/ob microbiota exhibited a significantly greater percentage increase in body fat over two weeks than mice colonized with a +/+ microbiota [...]”

19 of 186

Introduction: Obesity and the gut microbiome

19

20 of 186

Introduction: Obesity and the gut microbiome

20

Turnbaugh, P. J. (2017). Microbes and diet-induced obesity: fast, cheap, and out of control. Cell Host & Microbe, 21(3), 278-281.

21 of 186

Introduction: Obesity and the gut microbiome?

21

22 of 186

Introduction: Why is it always C. diff and H. pylori?

22

23 of 186

Introduction: Why is this so difficult?

  • Challenges common to many areas of science
    • Limited sample sizes
    • Institutional and personal biases against the publication of null results
    • Failure to pre-register studies
    • Hypothesizing after the collection of data without reporting the study as exploratory
  • Challenges more specific to microbiome / bioinformatics research
    • Limitations of using mice (or other non-human organisms) as models
    • The “curse of dimensionality”
    • Sparsity
    • Compositionality
    • Uneven sampling depths
    • Methods that only provide limited resolution about the types of microbes in a sample

23

24 of 186

Introduction: Why is this so difficult?

  • Challenges common to many areas of science
    • Limited sample sizes
    • Institutional and personal biases against the publication of null results
    • Failure to pre-register studies
    • Hypothesizing after the collection of data without reporting the study as exploratory
  • Challenges more specific to microbiome / bioinformatics research
    • Limitations of using mice (or other non-human organisms) as models
    • The “curse of dimensionality”
    • Sparsity
    • Compositionality
    • Uneven sampling depths
    • Methods that only provide limited resolution about the types of microbes in a sample

24

25 of 186

Introduction: Our goal for this talk

Improving the resolution with which we can study microbes.

25

As far as some technologies can go

26 of 186

Introduction: Our goal for this talk

Improving the resolution with which we can study microbes.

26

“Strain” level: completely unique genomes

27 of 186

Introduction: Our goal for this talk

Improving the resolution with which we can study microbes.

27

For more examples, see: Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain�separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.

28 of 186

Introduction: Our goal for this talk

Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!

28

For more examples, see: Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain�separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.

29 of 186

Introduction: Our goal for this talk

Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!�Our goal, then, is reconstructing the full genomes of all�microbes in a sample.

29

For more examples, see: Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain�separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.

30 of 186

Introduction: Our goal for this talk

Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!�Our goal, then, is reconstructing the full genomes of all�microbes in a sample.

30

31 of 186

Introduction: Our goal for this talk

Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!�Our goal, then, is reconstructing the full genomes of all�microbes in a sample.�

… This isn’t really possible right now, but we’ll see what we can do!

31

32 of 186

Introduction: Our goal for this talk

Improving the resolution with which we can study microbes.�Small strain-level differences can make a big difference!�Our goal, then, is reconstructing the full genomes of all�microbes in a sample.�

… This isn’t really possible right now, but we’ll see what we can do!

One final note for the introduction, though:

32

33 of 186

Introduction: Why is this so difficult?

  • Challenges common to many areas of science
    • Limited sample sizes
    • Institutional and personal biases against the publication of null results
    • Failure to pre-register studies
    • Hypothesizing after the collection of data without reporting the study as exploratory
  • Challenges more specific to microbiome / bioinformatics research
    • Limitations of using mice (or other non-human organisms) as models
    • The “curse of dimensionality”
    • Sparsity
    • Compositionality
    • Uneven sampling depths
    • Methods that only provide limited resolution about the types of microbes in a sample

33

34 of 186

Introduction: Why is this so difficult?

  • Challenges common to many areas of science
    • Limited sample sizes
    • Institutional and personal biases against the publication of null results
    • Failure to pre-register studies
    • Hypothesizing after the collection of data without reporting the study as exploratory
  • Challenges more specific to microbiome / bioinformatics research
    • Limitations of using mice (or other non-human organisms) as models
    • The “curse of dimensionality”
    • Sparsity
    • Compositionality
    • Uneven sampling depths
    • Methods that only provide limited resolution about the types of microbes in a sample

34

35 of 186

Introduction: Our goal for this talk

35

36 of 186

This talk

  • Introduction: Studying microbiomes
    • Why bother?
    • Why hasn’t this research been more useful?
    • Defining a goal for this talk�
  • Culture-independent (a.k.a. sequencing-based) methods
    • Marker gene sequencing
    • Metagenome sequencing�
  • Metagenome assembly
    • Input (reads)
    • Outputs (contigs, an assembly graph, …)
    • Methods (de novo vs. reference-based, overlap graph vs. de Bruijn graph, …)�
  • Future work: Solving the strain separation problem

36

37 of 186

Culture-Independent Methods

37

38 of 186

Culture-Independent Methods

38

Steve Gschmeissner, Science Photo Library

“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”

Steve Gschmeissner, Science Photo Library

“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”

Steve Gschmeissner, Science Photo Library

“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”

39 of 186

Culture-Independent Methods

39

Steve Gschmeissner, Science Photo Library

“Scanning electron micrograph (SEM) of bacteria cultured from a sample of human faeces.”

Steve Gschmeissner, Science Photo Library

“Bacterial contamination, coloured scanning electron micrograph (SEM). Escherichia coli bacteria in a cell culture. This contamination has come from an unclean water source.”

Steve Gschmeissner, Science Photo Library

“Coloured scanning electron micrograph (SEM) of bacteria cultured from a mobile phone.”

40 of 186

Culture-Independent Methods

“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”

40

Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of bacteriology, 180(18), 4765-4774.

41 of 186

Culture-Independent Methods

“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”

Although not all microbes are easily culturable, all have genomes.

Idea: look at the DNA in a sample and use that to study the microbes there!

41

Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of bacteriology, 180(18), 4765-4774.

42 of 186

Culture-Independent Methods

  1. Terminal restriction fragment length polymorphism a.k.a. T-RFLP�
  2. Denaturing / temperature gradient gel electrophoresis a.k.a. DGGE / TGGE
  3. Fluorescence in situ hybridization� a.k.a. FISH
  4. Marker gene sequencing� a.k.a. amplicon sequencing, metabarcoding, metataxonomics, ...
  5. Metagenome sequencing� a.k.a. metagenomics, shotgun metagenome sequencing, whole metagenome sequencing, ...

42

For a nice history of these and other methods, see: M. H. Fraher, P. W. O’Toole, and E. M. Quigley. Techniques used to characterize the�gut microbiota: a guide for the clinician. Nature Reviews Gastroenterology & Hepatology, 9(6):312–322, 2012.

43 of 186

Culture-Independent Methods

  • Terminal restriction fragment length polymorphism a.k.a. T-RFLP�
  • Denaturing / temperature gradient gel electrophoresis a.k.a. DGGE / TGGE
  • Fluorescence in situ hybridization� a.k.a. FISH
  • Marker gene sequencing a.k.a. amplicon sequencing, metabarcoding, metataxonomics, ...
  • Metagenome sequencinga.k.a. metagenomics, shotgun metagenome sequencing, whole metagenome sequencing, ...

43

For a nice history of these and other methods, see: M. H. Fraher, P. W. O’Toole, and E. M. Quigley. Techniques used to characterize the�gut microbiota: a guide for the clinician. Nature Reviews Gastroenterology & Hepatology, 9(6):312–322, 2012.

44 of 186

Culture-Independent Methods

44

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

45 of 186

Culture-Independent Methods

45

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

“Reads”

46 of 186

Culture-Independent Methods

46

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

Strings from�Σ = {A,C,G,T}

47 of 186

Culture-Independent Methods

47

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

(Marker gene sequencing)

(Metagenome sequencing)

Strings from�Σ = {A,C,G,T}

48 of 186

C. I. Methods: Marker gene sequencing

48

49 of 186

C. I. Methods: Carl Woese and rRNA genes

49

ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:�

(i) the eubacteria, comprising all typical bacteria;

(ii) the archaebacteria, containing methanogenic bacteria; and

(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.

For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.

50 of 186

C. I. Methods: Carl Woese and rRNA genes

50

For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.

ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:

(i) the eubacteria, comprising all typical bacteria;

(ii) the archaebacteria, containing methanogenic bacteria; and

(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.

51 of 186

C. I. Methods: Carl Woese and rRNA genes

51

For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.

ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:

(i) the eubacteria, comprising all typical bacteria;

(ii) the archaebacteria, containing methanogenic bacteria; and

(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.

52 of 186

C. I. Methods: Carl Woese and rRNA genes

52

For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.

53 of 186

C. I. Methods: What’s in a marker gene?

53

ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequencecharacterization reveals that living systems represent one of three aboriginal lines of descent:

(i) the eubacteria, comprising all typical bacteria;

(ii) the archaebacteria, containing methanogenic bacteria; and

(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.

54 of 186

C. I. Methods: What’s in a marker gene?

54

ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence�characterization reveals that living systems represent one of three aboriginal lines of descent:�

(i) the eubacteria, comprising all typical bacteria;

(ii) the archaebacteria, containing methanogenic bacteria; and

(iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.

55 of 186

C. I. Methods: What’s in a marker gene?

55

“To determine relationships covering the entire spectrum of extant living systems, one optimally needs a molecule of appropriately broad distribution. None of the readily characterized proteins fits this requirement. However, ribosomal RNA does. It is

  • a component of all self-replicating systems;
  • it is readily isolated;
  • and its sequence changes but slowly with time

�permitting the detection of relatedness among very distant species.”

56 of 186

C. I. Methods: What’s in a marker gene?

56

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

57 of 186

C. I. Methods: What’s in a marker gene?

57

Entropy in the 16S rRNA gene

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

58 of 186

C. I. Methods: What’s in a marker gene?

58

Entropy in the 16S rRNA gene

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

More

mutations

59 of 186

C. I. Methods: What’s in a marker gene?

59

Entropy in the 16S rRNA gene

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

More

mutations

Hypervariable regions

60 of 186

C. I. Methods: What’s in a marker gene?

60

Entropy in the 16S rRNA gene

More

mutations

Hypervariable regions

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

Conserved regions

61 of 186

C. I. Methods: What’s in a marker gene?

61

Entropy in the 16S rRNA gene

More

mutations

Hypervariable regions

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

Conserved regions

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

62 of 186

C. I. Methods: What’s in a marker gene?

62

Entropy in the 16S rRNA gene

More

mutations

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

63 of 186

C. I. Methods: What’s in a marker gene?

63

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

64 of 186

C. I. Methods: What’s in a marker gene?

64

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

65 of 186

C. I. Methods: What’s in a marker gene?

65

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

66 of 186

C. I. Methods: What’s in a marker gene?

66

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

67 of 186

C. I. Methods: What’s in a marker gene?

67

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

...

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

68 of 186

C. I. Methods: What’s in a marker gene?

68

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

...

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.

69 of 186

C. I. Methods: What’s in a marker gene?

69

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

ATCGACTGACTGACGTACTGTACGGATACCGGGGACATACTACTACTACTACTACTTTTCCCA

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

...

Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.

70 of 186

C. I. Methods: What’s in a marker gene?

70

S. Vasileiadis, E. Puglisi, M. Arena, F. Cappa, P. S. Cocconcelli, and M. Trevisan. Soil bacterial diversity screening using�single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLOS One, 7(8):e42671, 2012.

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus

k__Bacteria; p__Firmicutes; c__Bacilli

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Firmicutes; c__Bacilli

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__Haemophilus; s__parainfluenzae

k__Bacteria; p__Firmicutes; c__Bacilli

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__

With polymerase chain reaction (PCR), we can amplify specific regions of the genome using

primers that target conserved regions.

Thompson et al. 2017: “Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f–806r [...]”

...

Since these amplified sequences contain hypervariable region(s), these regions help us determine which sequences came from which microbe.

Example taxonomic annotations from the QIIME 2�“Moving Pictures” tutorial: https://docs.qiime2.org

71 of 186

C. I. Methods: Marker gene sequencing

71

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

72 of 186

C. I. Methods: Marker gene sequencing

72

Bolyen, E., Rideout, J. R., Dillon, M. R., Bokulich, N. A., Abnet, C. C., Al-Ghalith, G. A., ... & Caporaso, J. G. (2019).�Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology, 37(8), 852-857.

Broad

taxonomic

comparisons

73 of 186

C. I. Methods: Marker gene sequencing

73

Willis, A., Bunge, J., & Whitman, T. (2017). Improved detection of changes in species richness in high diversity microbial communities.�Journal of the Royal Statistical Society: Series C (Applied Statistics), 66(5), 963-977. (Figure from arXiv manuscript version.)

Estimating and comparing diversity within samples

Also referred to as 𝛼-diversity (“alpha diversity”)

74 of 186

C. I. Methods: Marker gene sequencing

74

McDonald, D., Hyde, E., Debelius, J. W., Morton, J. T., Gonzalez, A., Ackermann, G., ... & Knight, R. (2018).�American Gut: an open platform for citizen science microbiome research. mSystems, 3(3), e00031-18.

Unsupervised dimensionality reduction (e.g. PCA / PCoA)

Also referred to as 𝛽-diversity (“beta diversity”)

75 of 186

C. I. Methods: Marker gene sequencing

75

Fedarko, M. W., Martino, C., Morton, J. T., González, A., Rahman, G., Marotz, C. A., ... & Knight, R. (2020).�Visualizing ‘omic feature rankings and log-ratios using Qurro. NAR Genomics and Bioinformatics, 2(2), lqaa023.

Identifying differentially abundant features (or ratios of features) in groups of samples

76 of 186

C. I. Methods: Marker gene sequencing

  • Pros
    • Relatively cheap: much less sequencing needed to profile a community at a given depth than metagenome sequencing
    • Very well-studied, so many established pipelines have been created (mothur, QIIME / QIIME 2, …)
  • Cons
    • Marker genes are usually specific to certain types of microbes
      • 16S rRNA gene: only identifies Bacteria and Archaea
      • 18S rRNA gene: only identifies certain Eukaryotes
      • Internal transcribed spacer region: mostly identifies Fungi
    • Copy number variation
    • PCR errors can result in “chimeric” gene sequences
    • Still disagreements on how to correct errors in raw reads
    • Limited resolution: marker genes are usually only reliable up to�the genus level

76

77 of 186

C. I. Methods: Marker gene sequencing

  • Pros
    • Relatively cheap: much less sequencing needed to profile a community at a given depth than metagenome sequencing
    • Very well-studied, so many established pipelines have been created (mothur, QIIME / QIIME 2, …)
  • Cons
    • Marker genes are usually specific to certain types of microbes
      • 16S rRNA gene: only identifies Bacteria and Archaea
      • 18S rRNA gene: only identifies certain Eukaryotes
      • Internal transcribed spacer region: mostly identifies Fungi
    • Copy number variation
    • PCR errors can result in “chimeric” gene sequences
    • Still disagreements on how to correct errors in raw reads
    • Limited resolution: marker genes are usually only reliable up to�the genus level

77

Callahan B.J. et al. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal, 11(12):2639–2643, 2017.

Schloss P.D. Amplicon sequence variants artificially split bacterial genomes into separate clusters. bioRxiv, 2021.

Knight R. et al. Best practices for analysing microbiomes. Nature Reviews Microbiology, 16(7):410–422, 2018.

78 of 186

C. I. Methods: Marker gene sequencing

  • Pros
    • Relatively cheap: much less sequencing needed to profile a community at a given depth than metagenome sequencing
    • Very well-studied, so many established pipelines have been created (mothur, QIIME / QIIME 2, …)
  • Cons
    • Marker genes are usually specific to certain types of microbes
      • 16S rRNA gene: only identifies Bacteria and Archaea
      • 18S rRNA gene: only identifies certain Eukaryotes
      • Internal transcribed spacer region: mostly identifies Fungi
    • Copy number variation
    • PCR errors can result in “chimeric” gene sequences
    • Still disagreements on how to correct errors in raw reads
    • Limited resolution: marker genes are usually only reliable up to�the genus level

78

Callahan B.J. et al. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal, 11(12):2639–2643, 2017.

Schloss P.D. Amplicon sequence variants artificially split bacterial genomes into separate clusters. bioRxiv, 2021.

Knight R. et al. Best practices for analysing microbiomes. Nature Reviews Microbiology, 16(7):410–422, 2018.

79 of 186

C. I. Methods: back to Carl Woese and rRNA genes

The original paper on PCR was published in 1986.�The first automatic sequencer (the “AB370”) was developed in 1987.

79

For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.

80 of 186

C. I. Methods: back to Carl Woese and rRNA genes

The original paper on PCR was published in 1986.�The first automatic sequencer (the “AB370”) was developed in 1987.�Woese and Fox’s three-domain paper was published in 1977!

80

For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.

81 of 186

C. I. Methods: back to Carl Woese and rRNA genes

The original paper on PCR was published in 1986.�The first automatic sequencer (the “AB370”) was developed in 1987.�Woese and Fox’s three-domain paper was published in 1977!

81

For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.

Photo credit: N. Pace. Carl Woese and the Beginnings of Metagenomics.�Looking in the Right Direction: Carl Woese and the New Biology, 2015. https://www.youtube.com/watch?v=h3K50DD9kIM (timestamp: 2:56)

82 of 186

C. I. Methods: back to Carl Woese and rRNA genes

The original paper on PCR was published in 1986.�The first automatic sequencer (the “AB370”) was developed in 1987.�Woese and Fox’s three-domain paper was published in 1977!

While the grad students and technicians produced�fingerprints, Woese spent his time staring at the�spots. Was this effort tedious in practice as well�as profound in its potential results? Yes.�‘There were days,’ he wrote later, ‘when I’d�walk home from work saying to myself,�Woese, you have destroyed your mind�again today.’ 

82

For a more thorough history of Carl Woese’s life and work, see:�D. Quammen. The Scientist Who Scrambled Darwin’s Tree of Life. The New York Times, 2018.

Photo credit: N. Pace. Carl Woese and the Beginnings of Metagenomics.�Looking in the Right Direction: Carl Woese and the New Biology, 2015. https://www.youtube.com/watch?v=h3K50DD9kIM (timestamp: 2:56)

83 of 186

C. I. Methods: Marker gene sequencing

83

(Marker gene sequencing)

(Metagenome sequencing)

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

84 of 186

C. I. Methods: Metagenome sequencing

84

(Marker gene sequencing)

(Metagenome sequencing)

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

85 of 186

C. I. Methods: Metagenome sequencing

85

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

86 of 186

C. I. Methods: Metagenome sequencing

86

𝛼-diversity

Taxonomy

Many of the “standard analyses” for marker gene sequencing data are also applicable to metagenome sequencing data.

𝛽-diversity

Differential abundance

87 of 186

C. I. Methods: Metagenome sequencing

87

Metagenome sequencing enables two main additional types of analyses, compared to marker gene sequencing.

𝛼-diversity

Taxonomy

𝛽-diversity

Differential abundance

88 of 186

C. I. Methods: Metagenome sequencing

88

Functional annotation

Sequence assembly

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Human Microbiome Project Consortium. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207.

Metagenome sequencing enables two main additional types of analyses, compared to marker gene sequencing.

89 of 186

C. I. Methods: Metagenome sequencing

89

Functional annotation

Sequence assembly

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Human Microbiome Project Consortium. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207.

Metagenome sequencing enables two main additional types of analyses, compared to marker gene sequencing.

90 of 186

C. I. Methods: Functional annotation

90

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

91 of 186

C. I. Methods: Functional annotation

91

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

(not just marker!) genes

operons

terminators

promoters

For more information on metagenomic functional annotation, see: Salamov, V. S. A., & Solovyevand, A. (2011). Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Li, RW, Ed, 61-78.

...

92 of 186

C. I. Methods: Functional annotation, in practice

92

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

(not just marker!) genes

operons

terminators

promoters

For more information on metagenomic functional annotation, see: Salamov, V. S. A., & Solovyevand, A. (2011). Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Li, RW, Ed, 61-78.

...

Usually, F.A. involves aligning (“mapping”) sequences to a reference database with information about “function” in well- studied organisms.

(But there are fancier approaches, e.g. metabolic modelling methods.)

93 of 186

C. I. Methods: Functional annotation, in practice?

93

“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”

94 of 186

C. I. Methods: Functional annotation, in practice??

94

“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”

A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]

95 of 186

C. I. Methods: Functional annotation, in practice???

95

A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]

“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”

96 of 186

C. I. Methods: Functional annotation, in practice???

96

“The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the [human] genome, in particular outside of the well-studied protein-coding regions.”

A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates [...]

97 of 186

C. I. Methods: Functional annotation, in practice

97

Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics.�PLOS computational biology, 6(2), e1000667.

98 of 186

C. I. Methods: Metagenome sequencing

98

Functional annotation

Sequence assembly

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Human Microbiome Project Consortium. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207.

Metagenome sequencing enables two main additional types of analyses, compared to marker gene sequencing.

99 of 186

C. I. Methods: Metagenome sequence assembly

99

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

)

100 of 186

C. I. Methods: Metagenome sequence assembly

100

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

(Strings from�Σ = {A,C,G,T})

“Reads”

)

101 of 186

C. I. Methods: Metagenome sequence assembly

101

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

(Strings from�Σ = {A,C,G,T})

“Reads”

)

In practice there’ll�be many input molecules here...

102 of 186

C. I. Methods: Metagenome sequence assembly

102

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

(Strings from�Σ = {A,C,G,T})

“Reads”

In practice there’ll�be many input molecules here...

...and, ideally, one�output sequence per input molecule

)

103 of 186

C. I. Methods: Metagenome sequence assembly

103

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

104 of 186

C. I. Methods: Metagenome sequence assembly

104

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

105 of 186

C. I. Methods: Metagenome sequence assembly

105

Lee, M. D. (2019). Happy Belly Bioinformatics: An open-source resource dedicated to helping�biologists utilize bioinformatics. Journal of Open Source Education, 4(41), 53.

106 of 186

This talk

  • Introduction: Studying microbiomes
    • Why bother?
    • Why hasn’t this research been more useful?
    • Defining a goal for this talk�
  • Culture-independent (a.k.a. sequencing-based) methods
    • Marker gene sequencing
    • Metagenome sequencing�
  • Metagenome assembly
    • Input (reads)
    • Outputs (contigs, an assembly graph, …)
    • Methods (de novo vs. reference-based, overlap graph vs. de Bruijn graph, …)�
  • Future work: Solving the strain separation problem

106

107 of 186

Assembly (in the context of a metagenome)

107

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

(Strings from�Σ = {A,C,G,T})

“Reads”

)

In practice there’ll�be many input molecules here...

...and, ideally, one�output sequence per input molecule

108 of 186

Assembly: inputs

Reads: strings from the alphabet Σ = {A, C, G, T}� Occasionally this alphabet is extended if the base at a position is ambiguous:� e.g. W = (A or T), S = (C or G), N = (A or C or G or T), …

108

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

For more detail on this notation, see: https://en.wikipedia.org/wiki/Nucleic_acid_notation.

109 of 186

Assembly: inputs

Reads: strings from the alphabet Σ = {A, C, G, T}� Occasionally this alphabet is extended if the base at a position is ambiguous:� e.g. W = (A or T), S = (C or G), N = (A or C or G or T), …

Things that can vary based on the sequencing technology being used:� Read length� Read error rate� Number of reads� Read structure (e.g. single vs. paired-end reads)

109

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

For more detail on this notation, see: https://en.wikipedia.org/wiki/Nucleic_acid_notation.

110 of 186

Assembly: inputs

Reads: strings from the alphabet Σ = {A, C, G, T}� Occasionally this alphabet is extended if the base at a position is ambiguous:� e.g. W = (A or T), S = (C or G), N = (A or C or G or T), …

Things that can vary based on the sequencing technology being used:� Read length� Read error rate� Number of reads� Read structure (e.g. single vs. paired-end reads)

110

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

For more detail on this notation, see: https://en.wikipedia.org/wiki/Nucleic_acid_notation.

111 of 186

Assembly: inputs

Reads: strings from the alphabet Σ = {A, C, G, T}� Occasionally this alphabet is extended if the base at a position is ambiguous:� e.g. W = (A or T), S = (C or G), N = (A or C or G or T), …

Things that can vary based on the sequencing technology being used:� Read length� Read error rate� Number of reads� Read structure (e.g. single vs. paired-end reads)

Three (modern) sequencing technologies we’ll focus on:� (1) short-read, (2) long, error-prone read, (3) HiFi

111

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

For more detail on this notation, see: https://en.wikipedia.org/wiki/Nucleic_acid_notation.

112 of 186

Assembly: sequencing technologies

  • Short-read technologies
    • e.g. Illumina
    • Read lengths: up to 150 nt
    • Error rates: less than 1%
  • Long, error-prone read technologies
    • e.g. Oxford Nanopore, Pacific Biosciences
    • Read length: over 10,000 nt
    • Error rates: 10–25%
  • Long, accurate read technologies (“HiFi”)
    • Pacific Biosciences circular consensus sequencing
    • Read length: over 10,000 nt, but generally shorter than long, error-prone reads
    • Error rates:
      • less than 1% for “point” mutations
      • somewhat higher (~5%) for insertions/deletions

112

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

113 of 186

Assembly: sequencing technologies

  • Short-read technologies
    • e.g. Illumina
    • Read lengths: up to 150 nt
    • Error rates: less than 1%
  • Long, error-prone read technologies
    • e.g. Oxford Nanopore, Pacific Biosciences
    • Read length: over 10,000 nt
    • Error rates: 10–25%
  • Long, accurate read technologies (“HiFi”)
    • Pacific Biosciences circular consensus sequencing
    • Read length: over 10,000 nt, but generally shorter than long, error-prone reads
    • Error rates:
      • less than 1% for “point” mutations
      • somewhat higher (~5%) for insertions/deletions

113

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

114 of 186

Assembly: sequencing technologies

  • Short-read technologies
    • e.g. Illumina
    • Read lengths: up to 150 nt
    • Error rates: less than 1%
  • Long, error-prone read technologies
    • e.g. Oxford Nanopore, Pacific Biosciences
    • Read length: over 10,000 nt
    • Error rates: 10–25%
  • Long, accurate read technologies (“HiFi”)
    • Pacific Biosciences circular consensus sequencing
    • Read length: over 10,000 nt, but generally shorter than long, error-prone reads
    • Error rates:
      • less than 1% for “point” mutations
      • somewhat higher (~5%) for insertions/deletions

114

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

This simple categorization ignores a lot of finicky details!

But this should be enough to understand how these technologies can complicate or simplify assembly.

115 of 186

Assembly: impacts of technologies

  • Read lengths
    • Longer reads can span repeats (“identical, or nearly identical, stretches of DNA”).
    • Repeats that are longer than reads are especially challenging.
    • Accounting for repeats is the most important challenge in (metagenome) assembly!
    • So longer reads are generally better than short ones.

115

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.

116 of 186

Assembly: impacts of technologies

  • Read lengths
    • Longer reads can span repeats (“identical, or nearly identical, stretches of DNA”).
    • Repeats that are longer than reads are especially challenging.
    • Accounting for repeats is the most important challenge in (metagenome) assembly!
    • So longer reads are generally better than short ones.

116

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.

Koren, S., Treangen, T. J., & Pop, M. (2011). Bambus 2: �scaffolding metagenomes. Bioinformatics, 27(21), 2964-2971.

117 of 186

Assembly: impacts of technologies

  • Read lengths
    • Longer reads can span repeats (“identical, or nearly identical, stretches of DNA”).
    • Repeats that are longer than reads are especially challenging.
    • Accounting for repeats is the most important challenge in (metagenome) assembly!
    • So longer reads are generally better than short ones.

117

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.

Koren, S. (2012). Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms. http://www.sergek.umiacs.io/presentations/ThesisTalk_final.pdf.

Technically this figure changes k-mer lengths, not read lengths�(we’ll define k-mers soon)—however, the idea is the same.

118 of 186

Assembly: impacts of technologies

  • Read lengths
    • Longer reads can span repeats (“identical, or nearly identical, stretches of DNA”).
    • Repeats that are longer than reads are especially challenging.
    • Accounting for repeats is the most important challenge in (metagenome) assembly!
    • So longer reads are generally better than short ones.

118

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.

119 of 186

Assembly: impacts of technologies

  • Read lengths
    • Longer reads can span repeats (“identical, or nearly identical, stretches of DNA”).
    • Repeats that are longer than reads are especially challenging.
    • Accounting for repeats is the most important challenge in (metagenome) assembly!
    • So longer reads are generally better than short ones.�
  • Error rates
    • Lower error rates simplify variant calling, in which we attempt to distinguish real variations in the data from sequencing errors.
    • This is especially common for metagenome assembly: is a variation at some position the result of error, or is it indicative of a rare strain with this variation in its genome?

119

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.

120 of 186

Assembly: impacts of technologies

  • Read lengths
    • Longer reads can span repeats (“identical, or nearly identical, stretches of DNA”).
    • Repeats that are longer than reads are especially challenging.
    • Accounting for repeats is the most important challenge in (metagenome) assembly!
    • So longer reads are generally better than short ones.�
  • Error rates
    • Lower error rates simplify variant calling, in which we attempt to distinguish real variations in the data from sequencing errors.
    • This is especially common for metagenome assembly: is a variation at some position the result of error, or is it indicative of a rare strain with this variation in its genome?

120

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634.

Koren, S., Treangen, T. J., & Pop, M. (2011). Bambus 2: �scaffolding metagenomes. Bioinformatics, 27(21), 2964-2971.

121 of 186

Assembly: outputs

121

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

(Strings from�Σ = {A,C,G,T})

“Reads”

)

In practice there’ll�be many input molecules here...

...and, ideally, one�output sequence per input molecule

122 of 186

Assembly: outputs

122

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

(Strings from�Σ = {A,C,G,T})

“Reads”

)

“Contigs”

In practice there’ll�be many input molecules here...

...and, ideally, one�output sequence per input molecule

123 of 186

Assembly: outputs

123

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

(Strings from�Σ = {A,C,G,T})

“Reads”

)

“Contigs”

(Hopefully longer) strings from�Σ = {A,C,G,T}

In practice there’ll�be many input molecules here...

...and, ideally, one�output sequence per input molecule

124 of 186

Assembly: outputs (contigs)

Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that

124

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

125 of 186

Assembly: outputs (contigs)

Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that

Some projects attempt to group contigs together into bins that likely originate from the same “genome”.

125

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.

126 of 186

Assembly: outputs (contigs)

Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that

Some projects attempt to group contigs together into bins that likely originate from the same “genome”.

Bins of contigs, or especially high-quality individual contigs, are referred to as metagenome-assembled genomes (MAGs).

126

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.

127 of 186

Assembly: outputs (contigs)

Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that

Some projects attempt to group contigs together into bins that likely originate from the same “genome”.

Bins of contigs, or especially high-quality individual contigs, are referred to as metagenome-assembled genomes (MAGs).

127

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.

“We present a metagenomic HiFi assembly of a complex microbial community from sheep fecal material that resulted in�428 high-quality MAGs from a single sample, the highest resolution achieved with metagenomic deconvolution to date.”

Bickhart, D. M., Kolmogorov, M., Tseng, E., Portik, D., Korobeynikov, A., Tolstoganov, I., ... & Smith, T. P. (2021).�Generation of lineage-resolved complete metagenome-assembled genomes by precision phasing. bioRxiv.

128 of 186

Assembly: outputs (contigs)

Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that

Some projects attempt to group contigs together into bins that likely originate from the same “genome”.

Bins of contigs, or especially high-quality individual contigs, are referred to as metagenome-assembled genomes (MAGs).

128

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., ... & Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725-731.

129 of 186

Assembly: outputs (contigs)

Ideally: one contig per input molecule of DNA� (e.g. each chromosome, plasmid, …)�In practice: usually more contigs than that

129

Commins, J., Toft, C., & Fares, M. A. (2009). Computational biology methods and their�application to the comparative genomics of endocellular symbiotic bacteria of insects.�Biological Procedures Online, 11(1), 52-78.

130 of 186

Assembly: outputs (assembly graph)

Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.

130

131 of 186

Assembly: outputs (assembly graph)

Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.

Ideally: one connected component per input molecule of DNA�

131

Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive�visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.

132 of 186

Assembly: outputs (assembly graph)

Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.

Ideally: one connected component per input molecule of DNA�In practice: the graph is usually tangled, fragmented, ...

132

Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive�visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.

133 of 186

Assembly: outputs (assembly graph)

Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.

Ideally: one connected component per input molecule of DNA�In practice: the graph is usually tangled, fragmented, ...

These can be useful when “finishing” assemblies, or looking�at subtle variations.

133

Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive�visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.

134 of 186

Assembly: outputs (assembly graph)

Most assembly algorithms model the problem as some�sort of graph traversal.��Assemblers usually output assembly graphs, which�(generally) show overlaps between contigs.

Ideally: one connected component per input molecule of DNA�In practice: the graph is usually tangled, fragmented, …

These can be useful when “finishing” assemblies, or looking�at subtle variations.

134

Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive�visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.

Ghurye, J., Treangen, T., Fedarko, M., Hervey, W. J., & Pop, M. (2019). MetaCarvel: linking assembly graph motifs to biological variants. Genome Biology, 20(1), 1-14.

135 of 186

This talk

  • Introduction: Studying microbiomes
    • Why bother?
    • Why hasn’t this research been more useful?
    • Defining a goal for this talk�
  • Culture-independent (a.k.a. sequencing-based) methods
    • Marker gene sequencing
    • Metagenome sequencing�
  • Metagenome assembly
    • Input (reads)
    • Outputs (contigs, an assembly graph, …)
    • Methods (de novo vs. reference-based, overlap graph vs. de Bruijn graph, …)�
  • Future work: Solving the strain separation problem

135

136 of 186

Assembly: methods

136

“A good genome assembler is like a good sausage: you would rather not know what is inside.”

Apocryphal quote attributed to Sante Gnerre: http://rayan.chikhi.name/pdf/2019-july-19-cgsi.pdf

137 of 186

Assembly: methods (de novo vs. reference-based)

de novo assembly: Use only the read data available�

Reference-based assembly: Also use available reference sequence(s)

137

“[...] reconstruction in its pure form, without consultation to previously resolved sequence including from genomes, transcripts, and proteins.”

“For some applications, sufficient information can be extracted from the mapping of reads to a reference sequence, such as a finished genome from a related individual.”

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

138 of 186

Assembly: methods (de novo vs. reference-based)

de novo assembly: Use only the read data available� Far more commonly used when working with metagenome sequencing data.

Reference-based assembly: Also use available reference sequence(s)� Some reference-based assemblers have been developed for metagenome� sequencing data, but they have not (yet) seen widespread use in the field.

138

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Cepeda, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. (2017). MetaCompass: reference-guided assembly of metagenomes. bioRxiv, 212506.

139 of 186

Assembly: methods (de novo vs. reference-based)

de novo assembly: Use only the read data available� Far more commonly used when working with metagenome sequencing data.

Reference-based assembly: Also use available reference sequence(s)� Some reference-based assemblers have been developed for metagenome� sequencing data, but they have not (yet) seen widespread use in the field.

139

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

Cepeda, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. (2017). MetaCompass: reference-guided assembly of metagenomes. bioRxiv, 212506.

140 of 186

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: an edge is created from n1 n2 if read n1 overlaps with read n2

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: an edge is created from n1 n2 if there is a k-mer whose� prefix is n1 and whose suffix is n2

140

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

141 of 186

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: an edge is created from n1 n2 if read n1 overlaps with read n2

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: an edge is created from n1 n2 if there is a k-mer whose� prefix is n1 and whose suffix is n2

141

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

DBG

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

142 of 186

Assembly: methods (overlap vs. de Bruijn graphs)

Overlap graph (directed graph)� Nodes: input reads� Edges: an edge is created from n1 n2 if read n1 overlaps with read n2Goal: Find Hamiltonian Paths (or cycles) in this graph

de Bruijn graph (also a directed graph; takes a parameter k)� Nodes: unique (k - 1)-mers (strings of length k - 1) in the reads� Edges: an edge is created from n1 n2 if there is a k-mer whose� prefix is n1 and whose suffix is n2Goal: Find Eulerian Paths (or cycles) in this graph

142

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms�for next-generation sequencing data. Genomics, 95(6), 315-327.

OLC

Compeau, P., & Pevzner, P. (2021). Bioinformatics Algorithms: an active learning approach. https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-3/.

DBG

NP-Complete!

Polynomial time!

These descriptions are very simplistic: in practice, the graph structures used have many optimizations,�error correction methods, etc. applied. See the Miller et al. 2010 paper referenced above for more details.

Idury, R. M., & Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2), 291-306.

143 of 186

Assembly: methods (overlap vs. de Bruijn graphs)

143

OLC

DBG

  • Newbler
  • Celera
  • ARACHNE
  • Edena
  • SHORTY
  • HINGE
  • BAUM
  • Canu
  • hifiasm
  • ...
  • EULER
  • AllPaths
  • SOAP
  • Velvet
  • ABySS
  • E + V-SC
  • SPAdes
  • ABruijn
  • Flye
  • ...

Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.

Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.

Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.

144 of 186

Assembly: methods (overlap vs. de Bruijn graphs)

144

  • EULER
  • AllPaths
  • SOAP
  • Velvet
  • ABySS
  • E + V-SC
  • SPAdes
  • ABruijn
  • Flye
  • ...
  • Newbler
  • Celera
  • ARACHNE
  • Edena
  • SHORTY
  • HINGE
  • BAUM
  • Canu
  • hifiasm
  • ...

OLC

DBG

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.

Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.

Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.

Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.

145 of 186

Assembly: methods (overlap vs. de Bruijn graphs)

145

  • EULER
  • AllPaths
  • SOAP
  • Velvet
  • ABySS
  • E + V-SC
  • SPAdes
  • ABruijn
  • Flye
  • ...
  • Newbler
  • Celera
  • ARACHNE
  • Edena
  • SHORTY
  • HINGE
  • BAUM
  • Canu
  • hifiasm
  • ...

OLC

DBG

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.

Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.

Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.

Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.

146 of 186

Assembly: methods (overlap vs. de Bruijn graphs)

146

  • EULER
  • AllPaths
  • SOAP
  • Velvet
  • ABySS
  • E + V-SC
  • SPAdes
  • ABruijn
  • Flye
  • ...
  • Newbler
  • Celera
  • ARACHNE
  • Edena
  • SHORTY
  • HINGE
  • BAUM
  • Canu
  • hifiasm
  • ...

OLC

DBG

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.

Dida, F., & Yi, G. (2021). Empirical evaluation of methods for de novo genome assembly. PeerJ Computer Science, 7, e636.

Wang, A., Wang, Z., Li, Z., & Li, L. M. (2018). BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics, 34(12), 2019-2028.

Neither of these lists are comprehensive! Many assemblers (of both the OLC or DBG varieties) are still actively being developed and in use today.�Also, most assemblers implement their own “twist” on how they use these graph structures, so cleanly categorizing assemblers in this way ignores many details.

147 of 186

Assembly: methods (single-genome vs. metagenome)

  • Problem: Uneven coverage
    • Different microbes are present at different abundances in a microbiome.
    • The coverage, or number of reads “supporting” a position in a genome, varies a lot!
    • Worst case scenario: some genome(s) are only partially covered in the reads.

147

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

148 of 186

Assembly: methods (single-genome vs. metagenome)

  • Problem: Uneven coverage
    • Different microbes are present at different abundances in a microbiome.
    • The coverage, or number of reads “supporting” a position in a genome, varies a lot!
    • Worst case scenario: some genome(s) are only partially covered in the reads.
  • Problem: Intergenomic repeats
    • Stretches of DNA shared by many organisms are repeats!
    • For example, marker genes (e.g. rRNA genes).
    • Some approaches attempt to get around this by�exploiting uneven coverages across genomes.

148

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

149 of 186

Assembly: methods (single-genome vs. metagenome)

  • Problem: Uneven coverage
    • Different microbes are present at different abundances in a microbiome.
    • The coverage, or number of reads “supporting” a position in a genome, varies a lot!
    • Worst case scenario: some genome(s) are only partially covered in the reads.
  • Problem: Intergenomic repeats
    • Stretches of DNA shared by many organisms are repeats!
    • For example, marker genes (e.g. rRNA genes).
    • Some approaches attempt to get around this by�exploiting uneven coverages across genomes.�
  • Problem: Strain mixtures
    • Similar to intergenomic repeats: a microbiome can contain many strains of a microbe.
    • How can we distinguish between “real” variations and sequencing errors?
    • Separating these strains’ genomes is very challenging, especially with short reads.

149

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

150 of 186

Assembly: methods (single-genome vs. metagenome)

  • Problem: Strain mixtures
    • Similar to intergenomic repeats: a microbiome can contain many strains of a microbe.
    • How can we distinguish between “real” variations and sequencing errors?
    • Separating these strains’ genomes is very challenging, especially with short reads.

150

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

151 of 186

Assembly: methods (single-genome vs. metagenome)

  • Problem: Strain mixtures
    • Similar to intergenomic repeats: a microbiome can contain many strains of a microbe.
    • How can we distinguish between “real” variations and sequencing errors?
    • Separating these strains’ genomes is very challenging, especially with short reads.

151

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs.�Nature Methods, 17(11), 1103-1110.

“An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing [...] The component represents a bacterial genome of the Clostridia class [...] There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome.”

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

152 of 186

Assembly: methods (single-genome vs. metagenome)

  • Problem: Strain mixtures
    • Similar to intergenomic repeats: a microbiome can contain many strains of a microbe.
    • How can we distinguish between “real” variations and sequencing errors?
    • Separating these strains’ genomes is very challenging, especially with short reads.

152

Different assemblers will do different things to deal with these sorts of subtle variations: even “smooth” contigs can conceal a lot of variation. Sometimes this is desirable—sometimes not!

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs.�Nature Methods, 17(11), 1103-1110.

“An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing [...] The component represents a bacterial genome of the Clostridia class [...] There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome.”

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Kolmogorov, M., Bickhart, D. M., Behsaz, B., Gurevich, A., Rayko, M., Shin, S. B., ... & Pevzner, P. A. (2020). metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11), 1103-1110.

Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824-834.

153 of 186

This talk

  • Introduction: Studying microbiomes
    • Why bother?
    • Why hasn’t this research been more useful?
    • Defining a goal for this talk�
  • Culture-independent (a.k.a. sequencing-based) methods
    • Marker gene sequencing
    • Metagenome sequencing�
  • Metagenome assembly
    • Input (reads)
    • Outputs (contigs, an assembly graph, …)
    • Methods (de novo vs. reference-based, overlap graph vs. de Bruijn graph, …)�
  • Future work: Solving the strain separation problem

153

154 of 186

Future work

Strain separation problem (Vicedomini et al., 2021)“The reconstruction of partial or complete DNA sequences corresponding to�strains, at the base level.”

154

Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.

(Vicedomini et al.’s definition of “strain” matches the one I defined�~45 minutes ago, i.e. a completely unique genome.)

155 of 186

Future work

Strain separation problem (Vicedomini et al., 2021)“The reconstruction of partial or complete DNA sequences corresponding to�strains, at the base level.”

(Local) Metagenome individual haplotyping problem (Nicholls et al., 2021)�The ideal output [...] is the collection of whole-genome sequences representing all the individual organisms in a microbial community.”

��

155

Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.

Nicholls, S. M., Aubrey, W., De Grave, K., Schietgat, L., Creevey, C. J., & Clare, A. (2021). On the complexity of haplotyping a microbial community. Bioinformatics, 37(10), 1360-1366.

(Vicedomini et al.’s definition of “strain” matches the one I defined�~45 minutes ago, i.e. a completely unique genome.)

156 of 186

Future work

Strain separation problem (Vicedomini et al., 2021)“The reconstruction of partial or complete DNA sequences corresponding to�strains, at the base level.”

(Local) Metagenome individual haplotyping problem (Nicholls et al., 2021)�“The ideal output [...] is the collection of whole-genome sequences representing all the individual organisms in a microbial community.”

Haplotype assembly problem (Lancia et al., 2001)�“Given a set of fragments obtained by DNA sequencing from the two copies of a chromosome, reconstruct two haplotypes that would be compatible with all the fragments observed.”

156

Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.

Nicholls, S. M., Aubrey, W., De Grave, K., Schietgat, L., Creevey, C. J., & Clare, A. (2021). On the complexity of haplotyping a microbial community. Bioinformatics, 37(10), 1360-1366.

(Vicedomini et al.’s definition of “strain” matches the one I defined�~45 minutes ago, i.e. a completely unique genome.)

Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001, August). SNPs problems, complexity, and algorithms. In European Symposium on Algorithms (pp. 182-193). Springer, Berlin, Heidelberg.

157 of 186

Future work: “haplotype”?

Humans (or other diploid organisms) usually have two copies of�each chromosome.

Ordinary assemblers often “smooth out” differences between the chromosomes, creating contigs that are chimeras of both chromosomes’ sequences.

157

Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001, August). SNPs problems, complexity, and algorithms. In European Symposium on Algorithms (pp. 182-193). Springer, Berlin, Heidelberg.

https://www.genome.gov/genetics-glossary/Karyotype

In haplotype phasing, we attempt to fix this. Usually,�this involves looking for variations which occur on the same reads.

158 of 186

Future work: “haplotype”?

Humans (or other diploid organisms) usually have two copies of�each chromosome.

Ordinary assemblers often “smooth out” differences between the chromosomes, creating contigs that are chimeras of both chromosomes’ sequences.

158

Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001, August). SNPs problems, complexity, and algorithms. In European Symposium on Algorithms (pp. 182-193). Springer, Berlin, Heidelberg.

https://www.genome.gov/genetics-glossary/Karyotype

In haplotype phasing, we attempt to fix this. Usually,�this involves looking for variations which occur on the same reads.

Even in the case of a single human genome, this problem is NP-Hard. It gets worse for metagenomes!

159 of 186

Future work: “metagenomic haplotyping”

159

A solution to [metagenomic haplotyping] is confounded by five problems:�

(i) DNA from every genome needs to be extracted and sequenced to a depth sufficient for recovery,�

(ii) genomes share homologous regions that require disambiguation,�

(iii) reads may be of an insufficient length to disambiguate repeats or resolve bridges between variants,�

(iv) sequencing error can be indistinguishable from rare haplotypes and�

(v) the presence of an unknown number of haplotypes complicates the already computationally difficult (NP-hard) (Cilibrasi et al., 2005) problem of haplotyping.

Nicholls, S. M., Aubrey, W., De Grave, K., Schietgat, L., Creevey, C. J., & Clare, A. (2021).�On the complexity of haplotyping a microbial community. Bioinformatics, 37(10), 1360-1366.

160 of 186

Future work: “metagenomic haplotyping”

160

A solution to [metagenomic haplotyping] is confounded by five problems:�

(i) DNA from every genome needs to be extracted and sequenced to a depth sufficient for recovery,�

(ii) genomes share homologous regions that require disambiguation,�

(iii) reads may be of an insufficient length to disambiguate repeats or resolve bridges between variants,�

(iv) sequencing error can be indistinguishable from rare haplotypes and�

(v) the presence of an unknown number of haplotypes complicates the already computationally difficult (NP-hard) (Cilibrasi et al., 2005) problem of haplotyping.

So increase sequencing depth!

So increase read lengths!

So increase read lengths (again)!

So use better algorithms?

So decrease error rates!

161 of 186

Future work: “metagenomic haplotyping”

161

A solution to [metagenomic haplotyping] is confounded by five problems:�

(i) DNA from every genome needs to be extracted and sequenced to a depth sufficient for recovery,�

(ii) genomes share homologous regions that require disambiguation,�

(iii) reads may be of an insufficient length to disambiguate repeats or resolve bridges between variants,�

(iv) sequencing error can be indistinguishable from rare haplotypes and�

(v) the presence of an unknown number of haplotypes complicates the already computationally difficult (NP-hard) (Cilibrasi et al., 2005) problem of haplotyping.

So increase sequencing depth!

So increase read lengths!

So increase read lengths (again)!

So use better algorithms?

So decrease error rates!

162 of 186

Future work: next steps forward

  • Improving the detection (and phasing!) of rare�mutations in HiFi metagenome sequencing data
    • with Misha Kolmogorov (UCSC) and Pavel Pevzner (UCSD)�
  • Improved methods for visualizing metagenome assembly graphs
    • with Jay Ghurye (Verily Life Sciences), Todd Treangen (Rice),�Misha Kolmogorov (UCSC), Pavel Pevzner (UCSD),�Jacquelyn Michaelis + Harihara Muralidharan + Mihai Pop (Maryland)�
  • Improved classification of prokaryotic/eukaryotic contigs
    • with Misha Kolmogorov (UCSC) and Pavel Pevzner (UCSD)

162

163 of 186

Some things I’ve been up to at UCSD

  • Contributing to various analyses using marker gene / metagenome sequencing
    • “Co-assemblies” of metagenome sequencing data: Sanders JG, Nurk S, Salido RA, Minich J, Xu ZZ, Martino C, Fedarko M, Arthur TD, Chen F, Boland BS, Humphrey GC, Brennan C, Sanders K, Gaffney J, Jepsen K, Khosroheidari M, Green C, Liyange M, Dang JW, Phelan VV, Quinn RA, Bankevich A, Chang JT, Rana TM, Conrad DJ, Sandborn WJ, Smarr L, Dorrestein PC, Pevzner PA, and Knight R (2019). “Optimizing sequencing protocols for leaderboard metagenomics by combining long and short reads.” Genome Biology, 20(1):226.�
    • Analyzing marker gene sequencing data: Huey SL, Jiang L, Fedarko MW, McDonald D, Martino C, Ali F, Russell DG, Udipi SA, Thorat A, Thakker V, Ghugre P, Potdar RD, Chopra H, Rajagopalan K, Haas JD, Finkelstein JL, Knight R, and Mehta S (2020). “Nutrition and the Gut Microbiota in 10- to 18-Month-Old Children Living in Urban Slums of Mumbai, India.” mSphere, 5(5):e00731-20.�
  • Developing visualization tools for microbiome sequencing data
    • Differential abundance: Fedarko MW, Martino C, Morton JT, González A, Rahman G, Marotz CA, Minich JJ, Allen EA, and Knight R (2020). “Visualizing ‘omic feature rankings and log-ratios using Qurro.” NAR Genomics and Bioinformatics, 2(2):lqaa023.�
    • Phylogenetic trees (and associated data): Cantrell K*, Fedarko MW*, Rahman G, McDonald D, Yang Y, Zaw T, Gonzalez A, Janssen S, Estaki M, Haiminen N, Beck KL, Zhu Q, Sayyari E, Morton JT, Armstrong G, Tripathi A, Gauglitz JM, Marotz C, Matteson NL, Martino C, Sanders JG, Carrieri AP, Song SJ, Swafford AD, Dorrestein PC, Andersen KG, Parida L, Kim H-C, Vázquez-Baeza Y, and Knight R (2021). “EMPress Enables Tree-Guided, Interactive, and Exploratory Analyses of Multi-omic Data Sets.” mSystems, 6(2):e01216-20. (* = contributed equally)

163

164 of 186

Thank you!

Advice/support over the years

Research Exam Committee and Logistics

Gary Cottrell, Vineet Bafna, Melissa Gymrek, Julie Conner

164

Pavel Pevzner�Mikhail Kolmogorov�Andrey Bzikadze�Vikram SirupurapuRob KnightYoshiki Vázquez-Baeza�Lisa Marotz�Cameron Martino�Jamie Morton�Antonio González�Gibraan Rahman�Jake Minich�Eric Allen�Dan Hakim

�Kalen Cantrell�Daniel McDonald�Yimeng Yang�Thant Zaw�Stefan Janssen�Mehrbod Estaki�Niina Haiminen�Kristen Beck�Qiyun Zhu�Erfan Sayyari�George Armstrong�Priya Tripathi�Julia Gauglitz�Nate Matteson

�Jon Sanders�Anna Paola Carrieri�Se Jin Song�Austin Swafford�Pieter Dorrestein�Kristian Andersen�Laxmi Parida�Ho-Cheol Kim�Larry Smarr�Gail Ackermann�Jeff DeReus�Michiko Souza�Justin Shaffer�Pedro Belda-Ferre

�Greg Humphrey�Celeste Allaband�Rodolfo Salido�Greg Poore�Victor Cantu�Jeffrey Chiu�Franck Lejzerowicz�Shi Huang�Sarah Adams�Tomasz Kosciolek�Zech Xu�Charles Cowart�Farhana Ali�Robert Mills

�Alison Vrbanc�Bryn Taylor�Jerry Kennedy�Yna Villanueva�Justine Debelius�Evan Bolyen�Matthew Dillon�Jay Ghurye�Jacquelyn Michaelis�Harihara Muralidharan�Nidhi Shah�Brian Brubach�Todd Treangen�Mihai Pop

165 of 186

Thank you!

Research Exam Committee and Logistics

Gary Cottrell, Vineet Bafna, Melissa Gymrek, Julie Conner

Advice/support over the years

165

Pavel Pevzner�Mikhail Kolmogorov�Andrey Bzikadze�Vikram SirupurapuRob KnightYoshiki Vázquez-Baeza�Lisa Marotz�Cameron Martino�Jamie Morton�Antonio González�Gibraan Rahman�Jake Minich�Eric Allen�Dan Hakim

�Kalen Cantrell�Daniel McDonald�Yimeng Yang�Thant Zaw�Stefan Janssen�Mehrbod Estaki�Niina Haiminen�Kristen Beck�Qiyun Zhu�Erfan Sayyari�George Armstrong�Priya Tripathi�Julia Gauglitz�Nate Matteson

�Jon Sanders�Anna Paola Carrieri�Se Jin Song�Austin Swafford�Pieter Dorrestein�Kristian Andersen�Laxmi Parida�Ho-Cheol Kim�Larry Smarr�Gail Ackermann�Jeff DeReus�Michiko Souza�Justin Shaffer�Pedro Belda-Ferre

�Greg Humphrey�Celeste Allaband�Rodolfo Salido�Greg Poore�Victor Cantu�Jeffrey Chiu�Franck Lejzerowicz�Shi Huang�Sarah Adams�Tomasz Kosciolek�Zech Xu�Charles Cowart�Farhana Ali�Robert Mills

�Alison Vrbanc�Bryn Taylor�Jerry Kennedy�Yna Villanueva�Justine Debelius�Evan Bolyen�Matthew Dillon�Jay Ghurye�Jacquelyn Michaelis�Harihara Muralidharan�Nidhi Shah�Brian Brubach�Todd Treangen�Mihai Pop

166 of 186

Misc. Acknowledgements

Emojis: Google emojis from emojipedia.org: https://emojipedia.org/pile-of-poo/ (removed the emoji eyes manually in GIMP), https://emojipedia.org/non-potable-water/, https://emojipedia.org/potable-water/, https://emojipedia.org/mobile-phone/�Citation of Blaser 1992 in the context of H. pylori based on the Strainberry paper’s introduction: Vicedomini, R., Quince, C., Darling, A. E., & Chikhi, R. (2021). Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nature Communications, 12(1), 1-14.Taxonomic ranks figure modified from https://en.wikipedia.org/wiki/Taxonomic_rank#/media/File:Taxonomic_Rank_Graph.svg, ℅ Annina Breen.�E. coli phylogeny from Dunne, K. A., Chaudhuri, R. R., Rossiter, A. E., Beriotto, I., Browning, D. F., Squire, D., ... & Henderson, I. R. (2017). Sequencing a piece of history: complete genome sequence of the original Escherichia coli strain. Microbial Genomics, 3(3).�Stock photos of someone in a suit kicking a can down the road all from Shutterstock.com ℅ Jim Barber (all images marked as royalty-free). Why were there five separate images of this? That’s a great question. I wish my job was literally just putting on a suit and kicking cans down a road. That would be so sick. I bet it pays better than grad school. Wait, why are you reading this? Seriously, there’s nothing important here. It’s just links. And this text.�https://www.shutterstock.com/search/kick+the+can,https://www.shutterstock.com/image-photo/politicians-shoe-stops-dented-can-rolling-85554979,https://www.shutterstock.com/image-photo/close-politicians-shoe-kicking-dented-shiny-85554970,�https://www.shutterstock.com/image-photo/close-shiny-dentedl-can-sitting-on-85554964, https://www.shutterstock.com/image-photo/close-politicians-shoe-kicking-dented-shiny-85554973, https://www.shutterstock.com/image-photo/politician-kicks-shiny-dented-can-down-85554967�Harold’s face: https://www.independent.co.uk/arts-entertainment/interviews/hide-pain-harold-meme-gif-interview-model-real-name-arato-andras-thumbs-stock-photo-a7835076.html�Quote about functional annotation and E. coli distance is from C. Frioux, D. Singh, T. Korcsmaros, and F. Hildebrand. From bag-of-genes to bag-of-genomes: metabolic modelling of communities in the era of metagenome-assembled genomes. Computational and Structural Biotechnology Journal, 18:1722–1734, 2020.�Jigsaw puzzle photo: From VisitIndiana.com. Also, I acknowledge that I used this figure (and the Commins figure for assembly) in a talk I gave last December.�Original PCR paper: Mullis et al. 1986. Specific Enzymatic Amplification of DNA In Vitro: The Polymerase Chain Reaction.�Source about the AB370 being the first sequencer: https://www.hindawi.com/journals/bmri/2012/251364/Use of the Bambus 2 “variant” figure in the context of variant calling based on Serge’s PhD defense from 9 years ago: Koren, S. (2012). Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms. http://www.sergek.umiacs.io/presentations/ThesisTalk_final.pdf. I already cited this when using the “read lengths help” figure a few slides beforehand, but I figure I might as well make that clear here. Serge’s a cool dude.

166

167 of 186

Funding

Fall 2018–Winter 2019�Standard first-year CSE department fellowship

Spring 2019–Summer 2019�Joint University Microelectronics Program (JUMP)’s�Center for Research on Intelligent Storage and Processing-in-memory (CRISP)

Fall 2019–Fall 2020�IBM Research AI, via the AI Horizons Network and�the UCSD Center for Microbiome Innovation (CMI)

Winter 2021�Teaching assistantship (CSE 282)

Spring 2021–�Pevzner Lab grants

167

168 of 186

Bonus: So how many microbes are there?

Turnbaugh 2007: “The vast majority of the 10–100 trillion microbes in the human gastrointestinal tract live in the colon.”

Locey and Lennon 2016: “[...] we predict that Earth is home to upward of�1 trillion microbial species.”

Willis 2016: The method used by L&L 2016 isn’t statistically admissible!

∴ Maybe the only answer right now that won’t anger any statistician or biologist: “a lot, I guess”

168

169 of 186

Introduction: Obesity and the gut microbiome

169

“We performed microbiota transplantation experiments to test directly the notion that the ob/ob microbiota has an increased capacity to harvest energy from the diet and to determine whether increased adiposity is a transmissible trait. Adult germ-free C57BL/ 6J mice were colonized (by gavage) with a microbiota harvested from the caecum of obese (ob/ob) or lean (1/1) donors (1 donor and 4–5 germ-free recipients per treatment group per experiment; two independent experiments). 16S-rRNA-gene-sequence-based surveys confirmed that the ob/ob donor microbiota had a greater relative abundance of Firmicutes compared with the lean donor microbiota (Supplementary Fig. 4 and Supplementary Table 7). Furthermore, the ob/ob recipient microbiota had a significantly higher relative abundance of Firmicutes compared with the lean recipient microbiota (P < 0.05, two-tailed Student’s t-test). UniFrac analysis of 16S rRNA gene sequences obtained from the recipients’ caecal microbiotas revealed that they cluster according to the input donor community (Supplementary Fig. 4): that is, the initial colonizing community structure did not exhibit marked changes by the end of the two-week experiment. There was no statistically significant difference in (1) chow consumption over the 14-day period (55.4 6 2.5 g (ob/ob) versus 54.0 6 1.2 g (1/1); caloric density of chow, 3.7 kcal g21 ), (2) initial body fat (2.7 6 0.2 g for both groups as measured by dual-energy X-ray absorptiometry), or (3) initial weight between the recipients of lean and obese microbiotas. Strikingly, mice colonized with an ob/ob microbiota exhibited a significantly greater percentage increase in body fat over two weeks than mice colonized with a 1/1 microbiota (Fig. 3c; 47 6 8.3 versus 27 6 3.6 percentage increase or 1.3 6 0.2 versus 0.86 6 0.1 g fat (dual-energy X-ray absorptiometry): at 9.3 kcal g21 fat, this corresponds to a difference of 4 kcal or 2% of total calories consumed).”

170 of 186

Bonus: Culture-Independent methods and dark matter

“It is estimated that >99% of microorganisms observable in nature typically are not cultivated by using standard techniques.”

Some folks have used the term “dark matter” to refer to these so-far-uncultured microbes, but it’s not a great analogy...

170

Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of bacteriology, 180(18), 4765-4774.

Tempting as it may be, perhaps we should calm down on the use of the term dark matter in biology. Biology is confusing, complicated, and mysterious enough without it.”

171 of 186

Bonus: About functional annotation...

“It says something of our ability to annotate genomes, that the proportion of a genome functionally annotated is often correlated to the genetic distance to the very well researched Escherichia coli (anecdotal observation).”

C. Frioux, D. Singh, T. Korcsmaros, and F. Hildebrand. From bag-of-genes to bag-of-genomes: metabolic modelling of communities in the era of metagenome-assembled genomes. Computational and Structural Biotechnology Journal, 18:1722–1734, 2020.

171

172 of 186

Bonus: Assembly (single genome)

172

173 of 186

Bonus: Assembly (metagenome)

173

...

...

174 of 186

Bonus: “metagenomic haplotyping”??????

174

A solution to [metagenomic haplotyping] is confounded by five problems:�

(i) DNA from every genome needs to be extracted and sequenced to a depth sufficient for recovery,�

(ii) genomes share homologous regions that require disambiguation,�

(iii) reads may be of an insufficient length to disambiguate repeats or resolve bridges between variants,�

(iv) sequencing error can be indistinguishable from rare haplotypes and�

(v) the presence of an unknown number of haplotypes complicates the already computationally difficult (NP-hard) (Cilibrasi et al., 2005) problem of haplotyping.

So increase sequencing depth!

So increase read lengths!

So increase read lengths (again)!

So use better algorithms?

So decrease error rates!

175 of 186

Microbiomes

175

176 of 186

This talk

  • Introduction: Studying microbiomes
    • Why bother?
    • Why hasn’t this research been more useful?
    • Defining a goal for this talk�
  • Culture-independent (a.k.a. sequencing-based) methods
    • Marker gene sequencing
    • Metagenome sequencing�
  • Metagenome assembly
    • Input (reads)
    • Outputs (contigs, an assembly graph, …)
    • Methods (de novo vs. reference-based, overlap graph vs. de Bruijn graph, …)�
  • Future work: Solving the strain separation problem

176

Specificity

177 of 186

This talk

  • Introduction: Studying microbiomes
    • Why bother?
    • Why hasn’t this research been more useful?
    • Defining a goal for this talk�
  • Culture-independent (a.k.a. sequencing-based) methods
    • Marker gene sequencing
    • Metagenome sequencing�
  • Metagenome assembly
    • Input (reads)
    • Outputs (contigs, an assembly graph, …)
    • Methods (de novo vs. reference-based, overlap graph vs. de Bruijn graph, …)�
  • Future work: Solving the strain separation problem

177

How many people care about this

178 of 186

A brief history of microbiome research

178

179 of 186

1 of 4: Marcus Terentius Varro, 1st century B.C.E.

Precautions must also be taken in the neighbourhood of swamps, both for the reasons given, and because there are bred certain minute creatures which cannot be seen by the eyes, which float in the air and enter the body through the mouth and nose and there cause serious diseases.” “What can I do,” asked Fundanius, “to prevent disease if I should inherit a farm of that kind?” “Even I can answer that question,” replied Agrius; “sell it for the highest cash price; or if you can’t sell it, abandon it.”

179

M. T. Varro and M. P. Cato. On Agriculture, page 209.�Harvard University Press, Cambridge, MA, 1934. Translated by W. D. Hooper and H. B. Ash.

180 of 186

2 of 4: Hong Ge, 4th century C.E.

“During the Eastern Jin dynasty [...] Zhou Hou Bei Ji Fang, a well-known monograph of traditional Chinese medicine (TCM) written by Hong Ge, recorded a case of treating patients with food poisoning or severe diarrhea by ingesting human fecal suspension (known as yellow soup or Huang-Long decoction).”

180

H. Du, T.-t. Kuang, S. Qiu, T. Xu, C.-L. G. Huan, G. Fan, and Y. Zhang. Fecal medicines used in traditional medical system of China: a�systematic review of their names, original species, traditional uses, and modern investigations. Chinese medicine, 14(1):1–16, 2019.

181 of 186

3 of 4: Antonie van Leeuwenhoek, 17th century C.E.

“[...] among these streaks there were besides very many little animalcules ... And the motion of most of these animalcules in the water was so swift, and so various upwards, downwards and round about that ‘twas wonderful to see: and I judged that some of these little creatures were above a thousand times smaller than the smallest ones I have ever yet seen upon the rind of cheese [...]

181

Lane, N. (2015). The unseen world: reflections on Leeuwenhoek (1677) “Concerning little animals”.�Philosophical Transactions of the Royal Society B: Biological Sciences, 370(1666), 20140344.

Schloss, P. D. (2018). Identifying and overcoming threats to reproducibility, replicability, robustness, and generalizability in microbiome research. mBio, 9(3), e00525-18.

182 of 186

4 of 4: Ernest Rutherford (?), 20th century C.E.

All science is either physics or stamp collecting.

182

This quote is generally attributed to Ernest Rutherford, but its first written occurrence was in a book published�two years after he died. So your guess is as good as mine; see https://quoteinvestigator.com/2015/05/08/stamp.

183 of 186

4 of 4: Ernest Rutherford (?), 20th century C.E.

All science is either physics or stamp collecting.

.

183

This quote is generally attributed to Ernest Rutherford, but its first written occurrence was in a book published�two years after he died. So your guess is as good as mine; see https://quoteinvestigator.com/2015/05/08/stamp.

184 of 186

Thank you?

184

185 of 186

Thank you?

185

186 of 186

Thank you?

186