1 of 156

FoCS Breadth:

Overview of Bioinformatics

Niema Moshiri

UC San Diego SPIS 2022

2 of 156

What is Bioinformatics?

3 of 156

What is Bioinformatics?

  • “Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data” —Wikipedia

4 of 156

What is Bioinformatics?

  • “Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data” —Wikipedia
  • “The collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics” —Webster Dictionary

5 of 156

What is Bioinformatics?

  • “Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data” —Wikipedia
  • “The collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics” —Webster Dictionary
  • “Bioinformatics is conceptualizing biology in terms of macromolecules and then applying ‘informatics’ techniques to understand and organize the information associated with these molecules, on a large-scale” —Nick Luscombe

6 of 156

My Definition of Bioinformatics

7 of 156

My Definition of Bioinformatics

Biology

8 of 156

My Definition of Bioinformatics

Biology

Computer

Science

9 of 156

My Definition of Bioinformatics

Biology

Computer

Science

10 of 156

My Definition of Bioinformatics

Biology

Computer

Science

Bioinformatics

11 of 156

My Definition of Bioinformatics

Biology

Computer

Science

Computational Biology

Bioinformatics

12 of 156

My Definition of Bioinformatics

Biology

Computer

Science

Computational Biology

Bioinformatics

Chemistry? Physics? Statistics?

13 of 156

The Central Dogma

14 of 156

The Central Dogma of Biology

DNA

15 of 156

The Central Dogma of Biology

DNA

RNA

Transcription

16 of 156

The Central Dogma of Biology

DNA

RNA

Protein

Transcription

Translation

17 of 156

The Central Dogma of Biology

DNA

RNA

Protein

Transcription

Translation

Replication

Reverse Transcription

Replication

18 of 156

The Central Dogma of Biology

DNA

RNA

Protein

Transcription

Translation

19 of 156

Transcription

20 of 156

Transcription

  • DNA is transcribed to RNA

DNA

RNA

Transcription

21 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}

22 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}

23 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}
  • Mechanism

24 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}
  • Mechanism
    • Transcription Factor (TF) binds to the gene’s promoter

TF

25 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}
  • Mechanism
    • Transcription Factor (TF) binds to the gene’s promoter
    • RNA Polymerase binds near the transcription start site

TF

Pol

26 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}
  • Mechanism
    • Transcription Factor (TF) binds to the gene’s promoter
    • RNA Polymerase binds near the transcription start site
    • RNA Polymerase transcribes DNA to RNA…

TF

Pol

27 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}
  • Mechanism
    • Transcription Factor (TF) binds to the gene’s promoter
    • RNA Polymerase binds near the transcription start site
    • RNA Polymerase transcribes DNA to RNA…

TF

Pol

28 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}
  • Mechanism
    • Transcription Factor (TF) binds to the gene’s promoter
    • RNA Polymerase binds near the transcription start site
    • RNA Polymerase transcribes DNA to RNA…

TF

Pol

29 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}
  • Mechanism
    • Transcription Factor (TF) binds to the gene’s promoter
    • RNA Polymerase binds near the transcription start site
    • RNA Polymerase transcribes DNA to RNA… until it hits the transcription end site

TF

Pol

30 of 156

Transcription

  • DNA is transcribed to RNA
    • DNA alphabet is {A, C, G, T}
    • RNA alphabet is {A, C, G, U}
  • Mechanism
    • Transcription Factor (TF) binds to the gene’s promoter
    • RNA Polymerase binds near the transcription start site
    • RNA Polymerase transcribes DNA to RNA… until it hits the transcription end site

TF

Pol

31 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

32 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

33 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

34 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

35 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

36 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

37 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

38 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

39 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

40 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Transcription

AAAAAA

41 of 156

Transcription: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

42 of 156

Translation

43 of 156

Translation

  • RNA is translated to Protein

RNA

Protein

Translation

44 of 156

Translation

  • mRNA is translated to Protein
    • “Messenger” RNA

mRNA

Protein

Translation

45 of 156

Translation

  • mRNA is translated to Protein
    • “Messenger” RNA
    • RNA alphabet is {A, C, G, U}

46 of 156

Translation

  • mRNA is translated to Protein
    • “Messenger” RNA
    • RNA alphabet is {A, C, G, U}
    • Protein alphabet is 20 letters

47 of 156

Translation

  • mRNA is translated to Protein
    • “Messenger” RNA
    • RNA alphabet is {A, C, G, U}
    • Protein alphabet is 20 letters
    • Each triplet (“codon”) of RNA maps to a specific amino acid

48 of 156

Translation: Mechanism

  • Translation starts at an early-on AUG (not necessarily the first)

49 of 156

Translation: Mechanism

  • Translation starts at an early-on AUG (not necessarily the first)
  • Starting with AUG, each codon is “translated” to a specific amino acid

50 of 156

Translation: Mechanism

  • Translation starts at an early-on AUG (not necessarily the first)
  • Starting with AUG, each codon is “translated” to a specific amino acid
  • Translation continues codon-by-codon until a STOP codon is reached

51 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein:

52 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: M

53 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MA

54 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MAT

55 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MATT

56 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MATTH

57 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MATTHI

58 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MATTHIA

59 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MATTHIAS

60 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MATTHIAS

61 of 156

Translation: Summary

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MATTHIAS

62 of 156

Protein Structure

  • A protein’s function is largely based on its structure

63 of 156

The Central Dogma: Summary

DNA: GAGCTGATGGCTACTACACATATTGCCAGTTGATGGGTT

RNA: GAGCUGAUGGCUACUACACAUAUUGCCAGUUGAUGGGUU

Protein: MATTHIAS

Transcription

Translation

64 of 156

Natural Selection

65 of 156

Natural Selection

  • There is always natural variance (both “genotypic” and “phenotypic”) in a population of a given species

66 of 156

Natural Selection

  • There is always natural variance (both “genotypic” and “phenotypic”) in a population of a given species
  • Natural Selection: Traits that “improve the fitness” of an organism will cause that organism to be more likely to reproduce

67 of 156

Natural Selection

  • There is always natural variance (both “genotypic” and “phenotypic”) in a population of a given species
  • Natural Selection: Traits that “improve the fitness” of an organism will cause that organism to be more likely to reproduce
    • Traits that are “heritable” pass down to its offspring

68 of 156

Natural Selection

  • There is always natural variance (both “genotypic” and “phenotypic”) in a population of a given species
  • Natural Selection: Traits that “improve the fitness” of an organism will cause that organism to be more likely to reproduce
    • Traits that are “heritable” pass down to its offspring
    • Individuals without this trait are less likely to reproduce

69 of 156

Natural Selection

  • There is always natural variance (both “genotypic” and “phenotypic”) in a population of a given species
  • Natural Selection: Traits that “improve the fitness” of an organism will cause that organism to be more likely to reproduce
    • Traits that are “heritable” pass down to its offspring
    • Individuals without this trait are less likely to reproduce
    • In the next generation, a larger portion of the population will have the trait

70 of 156

Natural Selection: Example

Generation 0

6

5

2

71 of 156

Natural Selection: Example

Generation 0

6

5

2

72 of 156

Natural Selection: Example

Generation 1

5

5

3

73 of 156

Natural Selection: Example

Generation 1

5

5

3

74 of 156

Natural Selection: Example

Generation 2

3

6

4

75 of 156

Natural Selection: Example

Generation 2

3

6

4

76 of 156

Natural Selection: Example

Generation 3

2

5

6

77 of 156

Natural Selection: Example

If a trait is essential to an organism’s survival,

it will be conserved in the population

Generation 3

2

5

6

78 of 156

Sequence Alignment

79 of 156

Pairwise Sequence Alignment

  • General Idea: If I have two strings s and t, if I were to stick gaps in either string, could I make them line up better?

80 of 156

Pairwise Sequence Alignment

  • General Idea: If I have two strings s and t, if I were to stick gaps in either string, could I make them line up better?

AGTACGTACGT

ACGTACGTAAT

81 of 156

Pairwise Sequence Alignment

  • General Idea: If I have two strings s and t, if I were to stick gaps in either string, could I make them line up better?

A-GTACGTACGT

ACGTACGTAA-T

82 of 156

Pairwise Sequence Alignment

  • General Idea: If I have two strings s and t, if I were to stick gaps in either string, could I make them line up better?

A-GTACGTACGT

ACGTACGTAA-T

83 of 156

Pairwise Sequence Alignment

  • General Idea: If I have two strings s and t, if I were to stick gaps in either string, could I make them line up better?

  • Biological Motivation: Align an important gene in human and its “ortholog” (equivalent) in mouse to see which parts are conserved

84 of 156

Pairwise Sequence Alignment: Scoring Function

Given an alignment, a gap penalty σ, and a scoring matrix M, let the

score of the alignment be defined as the sum of the scores of each

position of the alignment, where a position is scored σ if either sequence

has a gap, else M(c,c’) where c is the symbol at the position in one

sequence and c’ is the symbol at the position in the other sequence

85 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 0

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

86 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 1

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

87 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 0

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

88 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 1

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

89 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 2

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

90 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 3

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

91 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 4

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

92 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 5

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

93 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 6

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

94 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 7

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

95 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 6

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

96 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 5

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

97 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 6

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

98 of 156

Pairwise Sequence Alignment: Scoring Function

A-GTACGTACGT

ACGTACGTAA-T

Score: 6

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

99 of 156

Pairwise Sequence Alignment: Scoring Function

We want to maximize this scoring function

A-GTACGTACGT

ACGTACGTAA-T

Score: 6

A

C

G

T

A

+1

-1

-1

-1

C

-1

+1

-1

-1

G

-1

-1

+1

-1

T

-1

-1

-1

+1

σ = -1

100 of 156

The Global Alignment Problem

Given two strings s and t, a gap penalty σ, and a scoring matrix M, return a maximum-scoring alignment of s and t

101 of 156

The Global Alignment Problem

Given two strings s and t, a gap penalty σ, and a scoring matrix M, return a maximum-scoring alignment of s and t

AGTACGTACGT

ACGTACGTAAT

A-GTACGTACGT

ACGTACGTAA-T

102 of 156

The Local Alignment Problem

Given two strings s and t, a gap penalty σ, and a scoring matrix M, return a maximum-scoring alignment of

a substring of s and a substring of t

103 of 156

The Local Alignment Problem

Given two strings s and t, a gap penalty σ, and a scoring matrix M, return a maximum-scoring alignment of

a substring of s and a substring of t

AGTACGTACGT

ACGTACGTAAT

GTACGTA

GTACGTA

104 of 156

The Multiple Sequence Alignment Problem

Given multiple strings, a gap penalty σ, and a scoring matrix M, return a maximum-scoring alignment of the strings

105 of 156

The Multiple Sequence Alignment Problem

Given multiple strings, a gap penalty σ, and a scoring matrix M, return a maximum-scoring alignment of the strings

106 of 156

Variant Calling

107 of 156

Variant Calling

  • Any two humans have genomes that are roughly 99.9% identical

108 of 156

Variant Calling

  • Any two humans have genomes that are roughly 99.9% identical
  • Single Nucleotide Variants (SNVs)

ACATACGTACGT

ACGTACGTACGT

ACGTACGTACGT

ACATACGTTCGT

ACGTACGTACGT

ACGTACGTACGT

ACATACGTACGT

ACGTACGTACGT

ACGTACGTTCGT

109 of 156

Variant Calling

  • Any two humans have genomes that are roughly 99.9% identical
  • Single Nucleotide Variants (SNVs)
  • Structural Variants (SVs)

ACAGCAGCAGCAGTT

ACAGCAGTT

ACAGTT

ACAGCAGCAGTT

110 of 156

SNV Calling: General Approach

  • Sequence the DNA of the individual

111 of 156

SNV Calling: General Approach

  • Sequence the DNA of the individual
  • Align the reads to the reference genome

112 of 156

SNV Calling: General Approach

  • Sequence the DNA of the individual
  • Align the reads to the reference genome
  • For each site in the genome, predict the genotype based on the reads

ACTTACGT

GTACGTAC

TACGTACG

CTTACGTA

CGTACTTA

REF: ...ACGTACGTACGTACGTACGTACGT...

113 of 156

SNV Calling: General Approach

  • Sequence the DNA of the individual
  • Align the reads to the reference genome
  • For each site in the genome, predict the genotype based on the reads

ACTTACGT

GTACGTAC

TACGTACG

CTTACGTA

CGTACTTA

REF: ...ACGTACGTACGTACGTACGTACGT...

50% G

50% T

G

T

114 of 156

SNV Calling: Challenges

  • Some regions of the genome are difficult to sequence

115 of 156

SNV Calling: Challenges

  • Some regions of the genome are difficult to sequence
  • Sequencing technologies have sequencing error

116 of 156

SNV Calling: Challenges

  • Some regions of the genome are difficult to sequence
  • Sequencing technologies have sequencing error
  • Sequencing technologies have sampling error

117 of 156

Population Genetics

  • Once we’ve called SNVs and SVs in enough people, what can we do?

118 of 156

Population Genetics

  • Once we’ve called SNVs and SVs in enough people, what can we do?
    • Genome-Wide Association Studies (GWAS)

119 of 156

Population Genetics

  • Once we’ve called SNVs and SVs in enough people, what can we do?
    • Genome-Wide Association Studies (GWAS)
    • Genetic Ancestry/Admixture

120 of 156

Population Genetics

  • Once we’ve called SNVs and SVs in enough people, what can we do?
    • Genome-Wide Association Studies (GWAS)
    • Genetic Ancestry/Admixture
    • Genetic Counseling

121 of 156

Differential Expression Analysis

122 of 156

Differential Expression Analysis: RNA-Seq

  • All cells in the body have (roughly) identical genomes

123 of 156

Differential Expression Analysis: RNA-Seq

  • All cells in the body have (roughly) identical genomes
    • Differences in how they look/function are caused by “differential expression” of genes

124 of 156

Differential Expression Analysis: RNA-Seq

  • All cells in the body have (roughly) identical genomes
    • Differences in how they look/function are caused by “differential expression” of genes
  • Biological Question: Given two different samples, what genes are differentially expressed across them?

125 of 156

Differential Expression Analysis: RNA-Seq

  • All cells in the body have (roughly) identical genomes
    • Differences in how they look/function are caused by “differential expression” of genes
  • Biological Question: Given two different samples, what genes are differentially expressed across them?
    • We want to measure protein levels, but we can’t in high-throughput

126 of 156

Differential Expression Analysis: RNA-Seq

  • All cells in the body have (roughly) identical genomes
    • Differences in how they look/function are caused by “differential expression” of genes
  • Biological Question: Given two different samples, what genes are differentially expressed across them?
    • We want to measure protein levels, but we can’t in high-throughput
    • Instead, we measure RNA levels

127 of 156

Differential Expression Analysis

  • Reverse Transcribe RNA to DNA

DNA

RNA

Reverse Transcription

128 of 156

Differential Expression Analysis

  • Reverse Transcribe RNA to DNA
  • Sequence the resulting DNA

129 of 156

Differential Expression Analysis

  • Reverse Transcribe RNA to DNA
  • Sequence the resulting DNA
  • Align the reads to the reference genome

130 of 156

Differential Expression Analysis

  • Reverse Transcribe RNA to DNA
  • Sequence the resulting DNA
  • Align the reads to the reference genome
  • Count the number of reads that mapped to each gene

Gene

Sample 1 Count

Sample 2 Count

A

###

###

B

###

###

C

###

###

131 of 156

Differential Expression Analysis

  • Reverse Transcribe RNA to DNA
  • Sequence the resulting DNA
  • Align the reads to the reference genome
  • Count the number of reads that mapped to each gene
  • Normalize by gene length and by sequencing depth

Gene

Sample 1 FPKM

Sample 2 FPKM

A

###

###

B

###

###

C

###

###

132 of 156

Differential Expression Analysis

  • Reverse Transcribe RNA to DNA
  • Sequence the resulting DNA
  • Align the reads to the reference genome
  • Count the number of reads that mapped to each gene
  • Normalize by gene length and by sequencing depth
  • Perform differential expression statistical tests for each gene

Gene

Sample 1 FPKM

Sample 2 FPKM

Log-2 Ratio

p

A

###

###

###

###

B

###

###

###

###

C

###

###

###

###

133 of 156

Genome Assembly

134 of 156

Genome Assembly

  • What is the genome sequence of a given organism?

...ATACAGTGGAACACCATCTG...

135 of 156

Genome Assembly

  • What is the genome sequence of a given organism?
  • We are able to sequence small fragments of an organism’s genome

ATACAG

CAGTGG

GGAACA

CACCAT

CCATCT

136 of 156

Genome Assembly

  • What is the genome sequence of a given organism?
  • We are able to sequence small fragments of an organism’s genome
  • How do we tie these small fragments together into a single string?

ATACAG

CAGTGG

GGAACA

CACCAT

CCATCT

...ATACAGTGGAACACCATCTG...

137 of 156

Genome Assembly

  • What is the genome sequence of a given organism?
  • We are able to sequence small fragments of an organism’s genome
  • How do we tie these small fragments together into a single string?

  • Computational Problem: Given a list of strings reads, find the shortest superstring of reads

138 of 156

Phylogenetics

139 of 156

Phylogenetics

140 of 156

Phylogenetics

Present-Day Species

141 of 156

Phylogenetics

Ancestors (extinct)

142 of 156

Phylogenetics

Evolutionary Time

143 of 156

Models of Evolution

144 of 156

Models of Evolution

  • Models of Tree Evolution: Describe a probability distribution over the shapes of the phylogenetic trees

145 of 156

Models of Evolution

  • Models of Tree Evolution: Describe a probability distribution over the shapes of the phylogenetic trees
    • Are some tree topologies more likely to be observed?

146 of 156

Models of Evolution

  • Models of Tree Evolution: Describe a probability distribution over the shapes of the phylogenetic trees
    • Are some tree topologies more likely to be observed?

  • Models of Sequence Evolution: Describe a probability distribution over the observed sequences

147 of 156

Models of Evolution

  • Models of Tree Evolution: Describe a probability distribution over the shapes of the phylogenetic trees
    • Are some tree topologies more likely to be observed?

  • Models of Sequence Evolution: Describe a probability distribution over the observed sequences
    • Are some sequences more likely to be observed (e.g. fitness)?

148 of 156

Phylogenetic Inference

  • Can we somehow reconstruct the evolutionary history of species based solely on their sequences?

149 of 156

Phylogenetic Inference

  • Can we somehow reconstruct the evolutionary history of species based solely on their sequences?
    • Raw Sequences → Multiple Sequence Alignment → Tree

150 of 156

Phylogenetic Inference

  • Can we somehow reconstruct the evolutionary history of species based solely on their sequences?
    • Raw Sequences → Multiple Sequence Alignment → Tree
  • Maximum Likelihood: Given a multiple sequence alignment and a model of (sequence evolution), find the tree that maximizes the “likelihood function” (i.e., probability of observing the alignment given the tree)

151 of 156

Summary

152 of 156

Summary

  • We started with some basic molecular biology review

153 of 156

Summary

  • We started with some basic molecular biology review
  • We then introduced multiple important biological problems and discussed their bioinformatics computational problem formulation

154 of 156

Summary

  • We started with some basic molecular biology review
  • We then introduced multiple important biological problems and discussed their bioinformatics computational problem formulation
  • Bioinformatics = BIG data!

155 of 156

Summary

  • We started with some basic molecular biology review
  • We then introduced multiple important biological problems and discussed their bioinformatics computational problem formulation
  • Bioinformatics = BIG data!
    • We need efficient algorithms

156 of 156

Summary

  • We started with some basic molecular biology review
  • We then introduced multiple important biological problems and discussed their bioinformatics computational problem formulation
  • Bioinformatics = BIG data!
    • We need efficient algorithms
    • We need optimized implementations of these algorithms