1 of 63

Functional analysis

Martina Summer-Kutmon

martina.kutmon@maastrichtuniversity.nl NUTRIOME Workshop 1

Maastricht Centre for Systems Biology (MaCSBio) 30 May 2024

ORCID: 0000-0002-7699-8191

Part 1: Molecular processes and pathways

2 of 63

Current knowledge level

3 of 63

Introduction

4 of 63

Introduction enrichment analysis

Quantify

Isolated data points

5 of 63

Introduction enrichment analysis

Comparative statistics

Genes of interest (DEseq2)

6 of 63

Introduction enrichment analysis

Enrichment analysis

Pre-defined gene sets → functional groups

7 of 63

Introduction enrichment analysis

Enrichment analysis

Pre-defined gene sets → functional groups

Apoptosis

8 of 63

Introduction enrichment analysis

Enrichment analysis

Pre-defined gene sets → functional groups

Apoptosis

Catalytic activity

9 of 63

Introduction enrichment analysis

Enrichment analysis

Pre-defined gene sets → functional groups

Apoptosis

Catalytic activity

GATA3 targets

10 of 63

Introduction enrichment analysis

Enrichment analysis

Pathway analysis

Pathway = gene set with information about relationships

11 of 63

Introduction enrichment analysis

Systems organization

Network analysis

12 of 63

Why enrichment analysis?

“Enrichment” of gene sets

  • Statistics
    • Analysis of groups instead of individual genes
    • Increases power and reduces dimensionality
  • Biological
    • Analysis on a functional level
    • Higher explanatory power

13 of 63

How does it work?

Gene expression

(microarray / RNASeq)

Gene sets

Pathways, GO, gene sets

Enrichment analysis

method

Over-representation analysis

Functional class scoring

Gene set

significance

14 of 63

Gene set collections

15 of 63

Gene set collections

Group genes based on some shared characteristic, e.g.

  • Molecular processes/pathways
  • Molecular function
  • Cellular component
  • Positional (on chromosomes)
  • Hallmark gene sets
  • Motif gene sets
  • Signature gene sets
  • Disease gene sets

Molecular signature database�https://www.gsea-msigdb.org/gsea/msigdb

Subramanian (2005) PNAS

doi: 10.1073/pnas.0506580102

16 of 63

Gene sets - level of detail

Example: Hedgehog signaling pathway

  1. Gene sets
    • Biological components pertaining a definite biological theme
  2. Non-directed pathways
    • Describe the existence of definite interactions between the same components in the form of a network
  3. Directed pathways
    • Disclose the character of the interactions in the network. Arrows depict an activating impact from the pointer component over the pointed one, and blunt edges an inhibiting one.

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

17 of 63

Gene Ontology

  • Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing” (Gruber 1993)

  • Ontologies for molecular biology domains developed and supported by the Gene Ontology Consortium for gene and gene product annotations for all organisms

The Gene Ontology Consortium (2023) Genetics

doi: 10.1093/genetics/iyad031

18 of 63

Gene ontology vocabularies

  • Molecular Function
    • What a product ‘does’, precise activity
  • Biological Process
    • Biological objective, accomplished via one or more ordered assemblies of functions
  • Cellular Component
    • ‘is located in’ (‘is a subcomponent of’)

19 of 63

Gene Ontology - coverage

The Gene Ontology Consortium (2023) Genetics

doi: 10.1093/genetics/iyad031

20 of 63

Gene Ontology - structure

  • Directed acyclic graph (DAG): each child may have one or more parents
  • Relationships between terms defined
  • All terms are defined, accession ID associated with definition
  • True Path: all attributes of children must hold for all parents

https://geneontology.org/docs/ontology-documentation/

21 of 63

Gene Ontology - annotations

  • GO annotations are created by associating a gene or gene product with a GO term
  • Minimal information added by curator
    • Gene product (may be a protein, RNA, etc.)
    • GO term
    • Reference
    • Evidence (ECO ontology)

22 of 63

Gene Ontology - annotations

23 of 63

Pathway databases

  • Pathways are gene sets with graphical representation and information about the relationships between the molecules

  • Many different online databases (different species, biological focus, curation style)

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

24 of 63

Biological pathways

Pathway diagrams are found everywhere!

25 of 63

Biological pathways

  • Signaling pathways
  • Metabolic pathways
  • Gene regulation pathways

https://www.genome.gov/about-genomics/fact-sheets/Biological-Pathways-Fact-Sheet

26 of 63

Biological pathways

Pathway diagrams are found everywhere!

Utility to biologists as conceptual models is obvious

If modeled properly - immensely useful for computational analysis and interpretation of large-scale experimental data

27 of 63

Biological pathways

PDGFR-beta pathway with transcriptomic/phosphoproteomic data

www.wikipathways.org/instance/WP3972

Static image

Zhang et al, Cell 2016

28 of 63

Pathway Databases

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

29 of 63

WikiPathways

  • Launched in 2008 as an experiment in community-based curation of biological pathways

Too much data!

Difficult to keep knowledge up-to-date, accessible and integrated

Taking advantage of direct participation by a greater portion of the community (crowdsourcing)

Image: https://www.vizioninteractive.com/blog/data-overload-when-it-all-becomes-too-much/

30 of 63

WikiPathways

  • Community-curated
  • Collaborative
  • Open

Content:

  • 1,958 pathways
  • 27 species
  • 600+ editors

www.wikipathways.org

30

Agrawal (2024) NAR

doi: 10.1093/nar/gkad960

31 of 63

31

32 of 63

Community portals

  • Special interest groups
  • Portal pages to highlight communities
  • 15 community portals supported

Martens (2021) NAR

doi: 10.1093/nar/gkaa1024

33 of 63

33

https://academy.wikipathways.org/

34 of 63

Pathway databases - coverage

MSigDb �Human MSigDB v2023.2.Hs

19,846 protein coding genes (Ensembl GRCh38.p14)

Genes in at least one pathway of the three databases� → 12,960 genes (65%)

35 of 63

File format

  • GMT (Gene Matrix Transposed) file format
    • tab delimited file format
    • each row represents a gene set
    • gene set name | description | gene list (one gene per column)

36 of 63

Pathway enrichment

37 of 63

Over Representation Analysis (ORA)

  • Methodology
    • Use parametric statistics to identify differentially regulated molecules, e.g. limma

    • Choose significance level e.g. FDR < 0.05, FC > 1.5

    • Use parametric statistics to identify annotations over represented within your list compared to what was assayed e.g. Fisher’s exact test

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

38 of 63

Over Representation Analysis (ORA)

  • N = 25background list (total number of measured genes in experiment)

39 of 63

Over Representation Analysis (ORA)

  • N = 25�background list (total number of measured genes in experiment)
  • R = 9input list (number of changed genes in experiment)

40 of 63

Over Representation Analysis (ORA)

  • N = 25�background list (total number of measured genes in experiment)
  • R = 9�input list (number of changed genes in experiment)
  • n = 9total number of genes in pathway

Pathway X

41 of 63

Over Representation Analysis (ORA)

  • N = 25�background list (total number of measured genes in experiment)
  • R = 9�input list (number of changed genes in experiment)
  • n = 9�total number of genes in pathway
  • r = 6�number of changed genes in pathway

Pathway X

42 of 63

Over Representation Analysis (ORA)

  • N = 25�background list (total number of measured genes in experiment)
  • R = 9�input list (number of changed genes in experiment)
  • n = 9�total number of genes in pathway
  • r = 6�number of changed genes in pathway

Pathway X

Enrichment score (e.g. Z-score)

43 of 63

Over Representation Analysis (ORA)

  • Caveats
    • Threshold
      • what about the transcript with p = 0.050001, �FC = 1.4999
    • Equality
      • transcript-X with p = 0.0000001, FC = 100 considered equal to transcript-Y with p = 0.049, FC = 1.51
    • Assumption of independence �between both genes and pathways inflates significance
    • Ignores relationships between genes/gene products
    • Significance increases with population size

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

44 of 63

Functional Class Scoring (FCS)

  • Methodology
    • Use parametric statistics to determine differential regulation for all molecules e.g. t-distribution statistics
    • Use various statistics to combine gene statistics and determine pathway statistics e.g. Wilcoxon rank sum, Kolmogorov-Smirnov
    • Permutes phenotypes and pathways to determine pathway significance
  • Applications
    • Gene Set Enrichment analysis (GSEA)

44

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

45 of 63

Functional Class Scoring (FCS)

45

Subramanian (2015) PNAS

doi: 10.1073/pnas.0506580102

46 of 63

Functional Class Scoring (FCS)

  • Caveats
    • Assumes independence between pathways

    • Dependence on ranking approaches miss magnitude of changes between phenotypes, i.e., sham FC = 10; treated similar FC = 100

    • Ignores relationships between genes/gene products

46

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

47 of 63

Pathway Topology Based (PTB)

  • Methodology
    • Use various statistics to determine differences in gene-gene interactions (node-edge-node) for all genes (e.g. Pearson’s correlation)

    • Use various statistics to combine gene interaction statistics and determine pathway significance e.g. permutation, hypergeometric distribution

  • Applications
    • pathfindR
    • SPIA
    • PathNet

47

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

48 of 63

Pathway Topology Based (PTB) - SPIA

Adi Laurentiu Tarca (2009) Bioinformatics

doi: 10.1093/bioinformatics/btn577

49 of 63

Pathway Topology Based (PTB)

  • Caveats
    • Limited interaction knowledge, i.e., thus hampered by immature interaction databases (KEGG, BioCarta, Reactome, PantherDB etc.)

    • Not to mention a lack of cellular and temporal resolution of interactions.

49

García-Campos (2015) Front. Physiol.

doi: 10.3389/fphys.2015.00383

50 of 63

Interpretation

  • Be aware!
    • ORA and FCS do not take pathway topology into account!
    • You don’t know yet where the changes occur in the pathway.
    • If you have pathway models > always look at the pathway diagrams and study the changes to make the right conclusions!

51 of 63

Tools

  • Many, many different tools to perform pathway analysis!
    • Integrated in resources
    • Standalone applications
    • Packages (R / Python / Perl / etc.)

  • Practical:
    • R-package clusterProfiler
    • implements GSEA and ORA

52 of 63

Interpretation and visualization of results

53 of 63

Analysis results

  • Table view

53

54 of 63

Analysis results

  • Bar or Dot plots

54

https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html

55 of 63

Gene-concept networks

55

https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html

56 of 63

Gene-concept networks

56

Niarakis (2023) Frontiers Immunology

doi: 10.3389/fimmu.2023.1282859

57 of 63

Enrichment maps

57

https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html

58 of 63

Tree plots

58

https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html

59 of 63

Data visualization in Cytoscape

60 of 63

Data visualization in Cytoscape

61 of 63

Multiple comparisons

Miller (2019) Frontiers Genetics

doi: 10.3389/fgene.2019.00059

62 of 63

Time-series data

Tisoncik (2012) Microbiol Mol Biol Rev.

doi: 10.1128/MMBR.05015-11

63 of 63

Questions?

Martina Summer-Kutmon

martina.kutmon@maastrichtuniversity.nl

Maastricht Centre for Systems Biology (MaCSBio)