1 of 41

Expectation Maximization Methods for Metabolic Pathway Analysis

Dissertation Proposal

Fil Rondel

Committee: Alex Zelikovsky

Pavel Skums

Murray Patterson

Artem Rogovskyy

July 1st 2022

2 of 41

Part I - Introduction

3 of 41

Metabolism

  • Living organisms need energy to grow, move, and think
  • Metabolism is
    • series of complex biochemical (metabolic) processes that regulate energy in organisms
    • organised into distinct metabolic pathways to either maximise the capture of energy or minimise its use�
  • Understanding metabolism is key to understanding life and has been a subject of fascination with scientists for more than 150 years

4 of 41

Metabolic Pathways

  • Metabolic pathways are structured parts of metabolism
    • maximise the capture of energy
    • prevent its uncontrolled combustion
    • E.g.
      • Glycolysis – breaks down glucose to generate energy
      • Photosynthesis – used by organisms to convert light energy into chemical energy�
  • Metabolic processes are catalyzed by specific proteins called enzymes
  • Metabolic pathways are regulated by enzymes and substrates
    • Lipases – help digest fats in the gut
    • Amylase – helps process starches into sugars

5 of 41

Enzymes

  • Catalyze biochemical reactions in organisms
  • Enzyme activity is directly related to substrates available
  • Enzyme Commission (EC) numbers
  • EC classes describe specific enzyme functions
    • EC 3 enzymes are hydrolases (use water to break up molecules)
    • EC 3.4 peptide bond acting hydrolases
    • EC 3.4.11 hydrolases that cleave off the amino-terminal end from a polypeptide
    • EC 3.4.11.4 cleave off the amino-terminal end from a tripeptide

More enzymes and substrates → faster metabolism

6 of 41

Metabolic Pathway Activity

  • Pathway activity levels can be estimated using
    • RNA-seq reads
    • Gene expression data
    • Enzyme activity
  • Different metabolic pathways aren’t always unique
    • Overlap – they have genes in common
    • May use same enzymes
  • Common enzymes may not perform same functions

7 of 41

Problem formulation

  • Given RNA-seq reads from a biological sample
  • Estimate
    • Gene expression
    • Enzyme expression
    • Metabolic pathway activity level

8 of 41

Challenges

  • Estimating pathway activity poses challenges
    • Genes in common → issues distinguishing between enzymes
    • Enzymes are orthologs
      • Groups of enzymes based on the sequence homology, not on the reactions they catalyze�
  • Sample-specific challenge:
    • Contigs can participate in two different enzymes
    • Enzymes are distinguished by any other contigs in the sample
  • Results in groups of indistinguishable enzymes
  • Pathway activity estimation algorithm converges inconsistently

9 of 41

Enzyme participation

  • All enzymes are important in order for metabolic pathway to continue the biochemical reactions
  • Some enzymes are more active than others
    • Different substrates available
    • Different temperatures
  • More activity → higher participation in pathway activity
  • Quantification of enzyme expression as well as enzyme participation in metabolic pathways can improve metabolic pathway activity estimate

10 of 41

Contributions

  • Exploring static enzyme-in-metabolic-pathway participation coefficients
  • Developing Expectation-Maximization (EM) based method to infer non-static enzyme-in-metabolic-pathway participation coefficients
  • Integrating enzyme participation coefficients into the metabolic pathway activity estimation pipeline to improve estimation accuracy
  • Differential analysis of metabolic pathway activity from microbial communities
  • Downstream analysis of pathway activity for infected and uninfected wild and lab mice

11 of 41

Publications

  1. Pipeline for analyzing activity of metabolic pathways in planktonic communities using metatranscriptomic data�
  2. F. Rondel, R. Hosseini, B. Sahoo, S. Knyazev, I. Mandric, F. Stewart, I. I. M ̆andoiu, B. Pasaniuc, A. Zelikovsky, ”Estimating Enzyme Participation in Metabolic Pathways for Microbial Communities from RNA-Seq Data,” Proc. of International Symposium on Bioinformatics Research Applications (ISBRA), 2020, Lecture Notes in Bioinformatics 12304, 335-343�
  3. Analyzing differential metabolic pathway activity levels using enzyme participation in metabolic pathways from RNA-seq data of Mus musculus and Peromyscus leucopus (In works)

12 of 41

Part II - Microbial community analysis

13 of 41

Previous work

14 of 41

EMPathways

15 of 41

First and Second EM

16 of 41

EM for enzyme participation (3rd EM)

  • Calculating and normalizing weights on edges
    • Enzymes -to-pathway participation ← 1/pathway size
  • Weights are calculated according to E and M step:
    • E-step��
    • M-step

17 of 41

Third EM

  • Initialize with uniform participation
  • EM 3: Enzyme expression + Pathway activity level → enzyme-in-pathway participation
  • Update enzyme-in-pathway participation
  • EM2: Enzymes → Pathways
  • Repeat EM3-EM2 until convergence �

RESULT: more accurate estimation of

  • Enzyme-in-pathway participation
  • Metabolic pathway activity

18 of 41

Datasets

  • 26 samples taken over course of 2 days in Gulf of Mexico
    • 2 Depths, 2 meters and 18 meters
    • 13 samples per depth
    • Every 4 hours
  • Samples RNA was extracted and sequenced
  • RNA-seq data processed through EMPathways pipeline
  • Output of EMPathways is a list of values:
    • Expression of each metabolic pathway in time series
    • Output is processed further to receive a list of upregulated metabolic pathways

19 of 41

Datasets

20 of 41

Sample data

  • RNA-seq data
    • Aligned and assembled
    • 181 metabolic pathways in time series, 13 time points, 2 depths
    • Normalized values that represent metabolic pathway expression
  • For each time-point there are 6 conditions:
    • PAR - photosynthetically active radiation
    • pTemp - potential temperature (C °)
    • Density - water density
    • Oxygen - amount of oxygen in water
    • PSU - practical salinity unit, in other words - salinity of water
    • Chl - Chlorophyll, levels of algae in water in ug/L

21 of 41

Challenge - different EM converging points

  • Run EM over the same sample 5 times:
    • Enzymes with stable expression
    • Enzyme with non stable expression
  • Same reads → hard to distinguish two enzymes
  • Find the group indistinguishable enzymes for editing the dictionary:
    • Collapse the enzymes into a group
    • Update the dictionary with updated enzymes

22 of 41

Finding pairs

  • Given: multiple enzyme expression estimates
  • Find: enzyme groups/pairs
  • Algorithm:
    • Remove stable enzymes
    • Unstable enzymes:
      • Find difference between the frequencies of two samples
      • Same difference over multiple samples → enzyme pair
      • Belong to the same orthology
  • There are some exceptions

A

B

B - A

23 of 41

Exceptions in finding pairs

  • Some enzymes group up with multiple different enzymes groups
  • For example: EC:4.2.1.51
    • is paired with EC:4.2.1.91
    • is paired with EC:5.4.99.5
  • EC:4.2.1.51 shares individual orthologies with one and another
  • All three do not share a common orthology group

24 of 41

Exceptions in finding pairs

  • There are triplets and quadruplets
    • The sum adds up to the same value over the 13 samples
  • Triplets and quadruplets → large orthology groups

25 of 41

Results

  • Measured the performance of the model with fault discovery rate.
    • The FDR conceptualizes the rate of type I errors
  • Generated 200 random permutations of conditions
  • Multiple regression with each of the 200 permutations
  • Marginal regression with 200 condition permutations
  • 95% Confidence interval

26 of 41

Correlation of pathways activity level with environmental parameters (old pipeline)

Salinity

Temp

Oxygen

Chl

PAR

Density

MLR

1. # of significantly correlated pathways

8

14

5

8

1

4

10

2. 95% randomized CI

1-10

1-11

1-8

0-7

0-6

1-8

0-8

3. The most correlated pathway

ec00364

ec00310

ec00281

ec00281

ec00740

ec00623

ec00623

27 of 41

Correlation of pathways activity level with environmental parameters (new pipeline)

Salinity

Temp

Oxygen

Chl

PAR

Density

MLR

1. # of significantly correlated pathways

31

22

19

18

14

30

22

2. 95% randomized CI

1-8

0-8

0-6

0-6

0-6

1-8

0-7

3. The most correlated pathway

ec00071

ec00195

ec00622

ec00460

ec00360

ec00071

ec00626

28 of 41

Enzyme-in-Pathway Participation

ec00020

D1: 12

D1: 16

D1: 20

D2: 00

D2: 04

D2: 08

D2: 12

D2: 16

D3: 00

D3: 04

D3: 12

AVE

STD

EC:1.2.4.1

12.82

21.68

20.64

33.71

35.76

30.38

21.78

23.71

32.4

28.07

21.98

25.72

6.6

EC:1.2.7.1

0.51

6.18

15.43

6.69

4.97

9.32

13.14

9.61

7.87

12.95

2.54

8.11

4.37

EC:1.2.7.3

13.99

21.46

20.32

26.74

28.96

24.87

21.26

22.22

27.08

24.44

26.7

23.46

4.02

EC:1.8.1.4

7.61

12.92

11.24

16.94

16.65

14.39

12.93

16.92

19.16

14.03

22.16

15

3.78

EC:2.3.1.12

12.82

21.68

20.64

33.71

35.76

30.38

21.78

23.71

32.4

28.07

21.98

25.72

6.6

EC:4.1.1.32

12.82

21.68

20.64

33.71

35.76

30.38

21.78

23.71

32.4

28.07

21.98

25.72

6.6

EC:4.1.1.49

14.78

23.66

23.38

32.19

36.13

37.34

26.62

28.41

35.9

33.66

25.61

28.88

6.6

EC:1.1.1.37

18.14

19.76

26.62

17.9

18.93

30.78

20.27

20.43

22.97

22.13

44.21

23.83

7.43

EC:1.1.1.41

72.88

72.85

70.78

71.2

68.42

38.66

45.68

60.11

62.77

61.29

27.09

59.25

14.74

EC:1.1.1.42

19.96

24.06

22.58

21.52

23.68

19.95

22.48

22.32

22.95

21.92

42.38

23.98

5.95

EC:1.1.5.4

0

0

0

29.35

0

0

0

20.53

0

0

0

24.94

4.41

EC:1.2.4.2

10.1

13.02

10.76

11.91

10.91

11.72

12.75

14.08

14.74

10.13

25.75

13.26

4.21

EC:1.3.5.1

21.35

27.74

28.74

34.65

39.51

30.74

29.4

29.56

36.38

33.32

46.73

32.56

6.43

EC:2.3.1.61

10.1

13.02

10.76

11.91

10.91

11.72

12.75

14.08

14.74

10.13

25.75

13.26

4.21

EC:2.3.3.1

86.31

41.26

66.16

28.14

39.2

260.4

209

93.27

70.39

107.9

96.4

99.85

68.92

EC:2.3.3.8

19.96

24.06

22.58

21.52

23.68

19.95

22.48

22.32

22.95

21.92

42.38

23.98

5.95

EC:4.2.1.2

14.54

18.81

19.68

23.77

28

20.3

19.67

20.16

24.74

22.7

32.79

22.29

4.72

EC:4.2.1.3

33.31

29.83

34.13

23.43

28.96

41.1

44.43

37.46

35.39

38.11

69.02

37.74

11.35

EC:6.2.1.4

19.96

24.06

22.58

21.52

23.68

19.95

22.48

22.32

22.95

21.92

42.38

23.98

5.95

EC:6.4.1.1

14.54

18.81

19.68

23.77

28

20.3

19.67

20.16

24.74

22.7

32.79

22.29

4.72

29 of 41

Part III - Rodent community analysis

30 of 41

Initial pipeline

31 of 41

EMPathways pipeline

32 of 41

Data Sets

  • RNA extracted from 4 groups of rodents
  • RNA-seq reads
  • IsoEM2
    • Transcript expressions
  • Transcript gene mapping
  • Mapped to gene, enzyme, pathways for further analysis
  • 4 groups of mice
    • 3 infected and 3 uninfected
      • Mus musculus
      • Peromyscus leucopus

33 of 41

Enzyme coefficients

ec00062

WS_1

WS_2

WS_3

Average

StDev

WH_1

WH_2

WH_3

Average

StDev

EC:1.1.1.211

0.08483

0.08255

0.08154

0.08297

0.00169

0.07992

0.07949

0.07321

0.07754

0.00376

EC:4.2.1.134

0.09528

0.09273

0.09386

0.09396

0.00128

0.09587

0.08763

0.08563

0.08971

0.00543

EC:1.1.1.35

0.03545

0.03325

0.03408

0.03426

0.00111

0.03396

0.03003

0.02923

0.03107

0.00253

EC:1.3.1.93

0.09528

0.09273

0.09386

0.09396

0.00128

0.09587

0.08763

0.08563

0.08971

0.00543

EC:2.3.1.16

0.07277

0.06689

0.06721

0.06896

0.00331

0.06804

0.06149

0.06175

0.06376

0.00371

EC:2.3.1.199

0.09528

0.09273

0.09386

0.09396

0.00128

0.09587

0.08763

0.08563

0.08971

0.00543

EC:3.1.2.22

0.11754

0.12052

0.11921

0.11909

0.00149

0.11801

0.12478

0.12822

0.12367

0.00519

EC:1.3.1.38

0.11754

0.12052

0.11921

0.11909

0.00149

0.11801

0.12478

0.12822

0.12367

0.00519

ec00062

LS_1

LS_2

LS_3

Average

StDev

LH_1

LH_2

LH_3

Average

StDev

EC:1.1.1.211

0.07611

0.07564

0.07962

0.07712

0.00217

0.07214

0.07137

0.06937

0.07096

0.00143

EC:4.2.1.134

0.08968

0.09392

0.08879

0.09080

0.00274

0.08621

0.08510

0.08544

0.08558

0.00057

EC:1.1.1.35

0.02749

0.02697

0.02646

0.02697

0.00051

0.02663

0.02661

0.02736

0.02686

0.00043

EC:1.3.1.93

0.08968

0.09392

0.08879

0.09080

0.00274

0.08621

0.08510

0.08544

0.08558

0.00057

EC:2.3.1.16

0.06534

0.06643

0.06407

0.06528

0.00118

0.06752

0.06742

0.06776

0.06757

0.00018

EC:2.3.1.199

0.08968

0.09392

0.08879

0.09080

0.00274

0.08621

0.08510

0.08544

0.08558

0.00057

EC:3.1.2.22

0.12275

0.12035

0.12346

0.12219

0.00163

0.12638

0.12688

0.12736

0.12687

0.00049

EC:1.3.1.38

0.12275

0.12035

0.12346

0.12219

0.00163

0.12638

0.12688

0.12736

0.12687

0.00049

WS / WH - wild sick / wild healthy

LS / LH - lab sick / lab healthy

34 of 41

Results

  • Clear differences between group metabolic pathway expressions
    • Averages of infected and uninfected groups show a significant difference
  • ec00053, ec00280, ec00500
    • Ascorbate and aldarate metabolism
    • Valine, leucine and isoleucine degradation
    • Starch and sucrose metabolism�

35 of 41

Differentially expressed pathway ec00053

Ascorbate and aldarate metabolism

Healthy

Healthy

Healthy

Average

STDev

Sick

Sick

Sick

Average

STDev

Average

STDev

Wild ec00053

11.322

11.946

12.896

12.055

0.793

33.636

12.688

11.848

19.391

12.344

15.723

8.168

Lab ec00053

45.066

44.674

34.24

41.327

6.140

30.887

34.548

22.335

29.257

6.268

35.292

0.090

For ec00053:

  1. Decreased expression for sick lab mice while (almost) no change for wild mice
  2. Increased expression in lab mice comparatively with wild mice

36 of 41

Differentially expressed pathway ec00280�Valine, leucine and isoleucine degradation

Healthy

Healthy

Healthy

Average

STDev

Sick

Sick

Sick

Average

STDev

Average

STDev

Wild ec00280

176.159

180.124

171.589

175.957

4.271

176.788

177.81

179.145

177.914

1.182

176.936

2.184

Lab ec00280

136.395

135.017

135.33

135.581

0.722

163.634

163.142

167.456

164.744

2.362

150.162

1.159

For ec00280:

  1. Elevated expression for sick lab mice while no change for wild mice
  2. Reduction of expression in lab mice comparatively with wild mice

37 of 41

Differentially expressed pathway ec00500

Starch and sucrose metabolism

Healthy

Healthy

Healthy

Average

STDev

Sick

Sick

Sick

Average

STDev

Average

STDev

Wild ec00500

157.897

159.69

160.486

159.358

1.326

114.175

153.072

155.825

141.024

23.293

150.191

15.533

Lab ec00500

145.837

129.389

138.745

137.990

8.250

89.675

133.198

87.802

103.558

25.686

120.774

12.329

For ec00500:

  1. Reduction of expression for sick lab mice while (mostly) no change for wild mice
  2. Reduction of expression in lab mice comparatively with wild mice

38 of 41

Part IV - Conclusions

39 of 41

Conclusions

  • EM approach works:
    • More conditions can be correlated with the improved metabolic pathway expression of microbial community
  • 3rd EM estimating enzyme-in-pathway participation improves the metabolic pathway expression evaluation of microbial and mice communities
  • Pathway activity correlates significantly with environmental parameters for microbial communities
  • Enzyme participation in pathways can be measured, but does not vary significantly across metabolic pathways

40 of 41

Future work

  • Analysis of mice metabolic pathways and enzyme activity and participation
  • Differential analysis of enzyme and pathway activity between M. musculus and P. Leucopus
    • Are there any pathways that change in wild mice from healthy to sick while being stable in lab mice?
  • Comparing stability of enzyme participation coefficients across microbial communities and mice groups

41 of 41

Acknowledgements

CS Department

  • Alex Zelikovsky
  • Murray Patterson
  • Roya Hosseini
  • Sergey Knyazev
  • Babatunde Bello

School of Medicine

  • Bogdan Pasaniuc

CS Department

  • Ion Mandoiu

Department of Microbiology

  • Frank Stewart
  • Igor Mandric

College of Veterinary Medicine and Biomedical Sciences

  • Artem Rogovskyy