1 of 61

Hyperlink Prediction in Biological Networks

Can Chen

02/27/2025

1

2 of 61

Motivation

  • Many biological networks involve multiway interactions

  • Compared to graphs, hypergraphs allow each hyperlink (also called hyperedge) to connect more than two nodes

2

Ecological Networks

Bairey et al. Nat. Commun. (2016)

Genomic Networks

Dotson*, Chen* et al. Nat. Commun. (2022)

Metabolic Networks

Jeony et al. Nature (2000)

3 of 61

Motivation

  • Hypergraphs can represent multidimensional relationships unambiguously
  • For example, if we use the graph model to capture metabolic networks, we are not sure if the metabolites A, B, C, D, E are in the same reaction

3

4 of 61

Motivation

  • In the early 2000s, people started exploring the frequency of interactions between genomic loci to study the genome structure
  • The Hi-C technology has inspired us to consider the human genome as a graph, and many graph techniques have been used on Hi-C data to study the genome's structure

4

3C: One-One

4C: One-Many

5C: Many-Many

=

Hi-C: All-All

Hi-C Contact Map

Lieberman-Aiden et al. Science (2009)

5 of 61

Motivation

  • In 2021, the technology Pore-C came out
  • Pore-C uses sequencing of long reads to identify multi-way contacts in the genome
  • The multi-way contacts identified from Pore-C can be naturally represented as a hypergraph

5

Dotson*, Chen* et al. Nat. Commun. (2022)

Deshpande et al. Nat. Biotech. (2022)

6 of 61

Motivation

  • Analyzing Pore-C data with hypergraph learning can provide more accurate insights in the human genome compared to graph approaches
  • However, many graph techniques cannot be naturally extended to hypergraphs due to the inherent complexity of multiway interactions

6

Dotson*, Chen* et al. Nat. Commun. (2022)

7 of 61

Overview

  • Part I: Hyperlink Prediction (IEEE TNNLS 2024)

  • Part II: Metabolic Network Gap-filling (Nat. Commun 2023 & ongoing work)

  • Part III: Drug Synergy Prediction (Bioinformatics 2025)

  • Part IV: Gene Pathway Identification (ongoing work)

7

8 of 61

Part I: Hyperlink Prediction

8

9 of 61

Hyperlink Prediction

  • Hyperlink prediction is an extension of link prediction
  • The goal of hyperlink prediction is to recover the most likely existent hyperlinks missing from the original hypergraph

9

hypergraph representation

email communication networks

incidence matrix

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

10 of 61

Hypergraph Prediction

  • Mathematically, an unweighted hypergraph where

is the node set and is the

hyperlink set with for

  • For a given potential hyperlink , the goal of hyperlink prediction methods is to learn a function

where ε is a threshold to binarize the continuous value into a label

10

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

11 of 61

Taxonomy

  • Similarity-based methods compute a similarity score such as common neighbors for a candidate hyperlink
  • Probability-based methods use probability theory techniques such as random walks to estimate the likelihood of the existence of a hyperlink

11

similarity-based

probability-based

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

12 of 61

Taxonomy

  • Matrix optimization-based methods exploit different hypergraph matrices such as adjacency matrix, incidence matrix, Laplacian matrix, to formulate matrix optimization problems for hyperlink prediction
  • Deep learning-based methods uses graph/hypergraph neural networks to predict missing hyperlinks

12

matrix optimization-based

deep learning-based

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

13 of 61

Similarity-based Methods

  • Common neighbors (CN) is a classical link prediction method, which quantifies the local similarity of two nodes in a graph

  • CN can be generalized to hyperlink prediction by calculating the average of the pairwise CN indices between the nodes within each hyperlink, i.e., for a given hyperlink

13

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

14 of 61

Probability-based Methods

  • Node2Vec is a random walk-based method that learns a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes
  • Decompose the hypergraph by clique expansion
  • Suppose that the embedding of node is denoted by
  • Given an unseen hyperlink

14

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

15 of 61

Matrix Optimization-based �Methods

  • Coordinated matrix minimization (CMM) alternatively performs nonnegative matrix factorization and least square matching in the adjacency space in order to infer a subset of candidate hyperlinks that are most suitable to fill the target hypergraph
  • To find the missing hyperlinks, CMM solves the following optimization problem by using the expectation–maximization algorithm

15

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

16 of 61

Deep Learning-based Methods

  • Node2Vec-GCN is a simple deep learning-based method
  • Decompose the hypergraph by clique expansion and apply Node2Vec
  • Refine the node embeddings with graph convolutional networks

  • Obtain the reaction embeddings with pooling
  • Fed into a linear layer

16

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

17 of 61

Benchmark Study

17

  • Deep learning-based methods prevail over other methods in hyperlink prediction

Chen et al. IEEE Trans. Neural Netw. Learn. Syst. (2024)

18 of 61

Part II: Metabolic Network Gap-filling

18

19 of 61

Metabolic Networks

  • Metabolism is fundamental for all organismal life, and metabolic networks refer to the complex set of biochemical reactions that take place within a cell or organism
  • Genome-scale metabolic models (GEMs) are powerful tools to predict cellular metabolism and physiological states in living organisms
  • Completeness of GEMs has remained a challenging problem

19

20 of 61

Traditional Methods

  • Many tools have been developed to predict missing reactions in GEMs
  • Optimization-based methods
    • Find inconsistencies between the draft model prediction and experimental data
    • Solve the inconsistencies via optimization tools

  • However, these methods heavily rely on experimental data, which may not be readily available for non-model organisms

20

Vitkin et al. Genome Biol. (2012)

MIRAGE

21 of 61

Hyperlink Prediction

  • We frame the prediction of missing reactions as a hyperlink prediction

  • Nodes represent metabolites and hyperlinks represent reactions
  • We ignore the directionality of the reactions

21

22 of 61

Current Methods

  • Clique Closure-based Coordinated Matrix Minimization (C3MM)
    • An improved version of CMM
    • C3MM includes all candidate hyperlinks during training, so it has the issue of scalability, and the model must be re-trained for new candidate hyperlink sets
  • Neural Hyperlink Predictor (NHP)
    • An improved version of Node2Vec-GCN
    • NHP can handle different candidate hyperlink sets, but it uses graph embedding methods Node2Vec to generate node embedding which can lead to a loss of higher-order information
  • More importantly, both methods were only validated using artificially removed hyperlinks and lack of external validation

22

Sharma et al. IJCAI (2021)

Yadati et al. ICIKM (2021)

23 of 61

CHESHIRE

  • We proposed a new deep learning-based method for hyperlink prediction called CHESHIRE.
  • To balance the sensitivity of the trained model, CHESHIRE requires negative sampling of hyperlinks, which involves creating fake hyperlinks during training

23

24 of 61

CHESHIRE Architecture

24

Chen et al. Nat. Commun. (2023)

25 of 61

CHESHIRE Architecture

25

Chen et al. Nat. Commun. (2023)

26 of 61

CHESHIRE Architecture

  • Initial embeddings

  • Refined embeddings

  • Pooling

26

Chen et al. Nat. Commun. (2023)

27 of 61

Internal Validation

  • We performed two sets of internal validation over GEMs from the entire BiGG database (108 GEMs) based on artificially removed reactions

27

King et al. Nucleic Acids Res (2016)

Chen et al. Nat. Commun. (2023)

28 of 61

Internal Validation First Type

28

Chen et al. Nat. Commun. (2023)

  • We compared CHESHIRE with the state-of-the-art methods C3MM and NHP
  • Since NVM, NHP, and CHESHIRE are deep learning-based methods, negative sampling was required for both training and testing in this validation. C3MM does not require negative sampling, but we intentionally introduced negative reactions into its testing set for a fair comparison
  • Based on the boxplots, CHESHIRE significantly outperforms the other three methods on all classification metrics.

29 of 61

Internal Validation Second Type

29

Chen et al. Nat. Commun. (2023)

  • First, we used a genus-specific reaction pool, which contains around 200-800 reactions, for each GEM
  • We calculated the percentage of recovered reactions by adding the top 25, 50, 100, and N reactions with the highest confidence scores, where N is the total number of removed reactions (which is different for each GEM)
  • Based on the boxplots, CHESHIRE achieves the highest recovery rate at all cutoffs

Top 25

Top 50

Top 100

Top N

30 of 61

Internal Validation Second Type

30

Chen et al. Nat. Commun. (2023)

  • We evaluated the performance using a much larger reaction pool, that is the entire BiGG reaction pool, which contains more than 15,000 reactions
  • We observed a very similar performance as above. As expected, using the BiGG reaction pool would undermine the performance of all methods

Top 25

Top 50

Top 100

Top N

31 of 61

External Validation

  • We performed the external validation via phenotypic prediction by testing whether adding predicted reactions to the GEM can improve the GEM’s power in predicting metabolic phenotypes such as fermentation

31

32 of 61

Phenotypic Prediction

32

Machado Nucleic Acids Res (2018)

24 organisms

9 metabolites

  • We used a fermentation dataset consisting of 9 metabolites and 24 bacterial organisms. The draft GEMs were constructed using a recent automatic reconstruction pipeline called CarveMe
  • The draft GEMs have poor performance in phenotypic prediction

33 of 61

Phenotypic Prediction

33

Fermentation Dataset (CarveMe)

Chen et al. Nat. Commun. (2023)

  • We considered four groups of models here. The first group is the draft GEMs produced by CarveMe, which we called CarveMe. The second and third groups are gap-filled GEMs by adding the top 200 NHP-predicted and CHESHIRE-predicted reactions, which we called NHP200 and CHESHIRE200. The last group is a gap-filled GEM by adding 200 random reactions, which we called Random200
  • We observed that the performance of NHP200 is the same as that of the draft GEMs. However, adding 200 CHESHIRE-predicted reactions significantly improves the phenotypic prediction of fermentation, and this improved performance is not simply due to having more reactions, by showing the significant differences between CHESHIRE200 and Random200

34 of 61

Phenotypic Prediction

34

  • To test whether the improvement of CHESHIRE over CarveMe-reconstructed draft models is generalizable to GEMs from other sources, we performed the same fermentation test on ModelSEED-reconstructed draft models
  • Similar results were observed, where CHESHIRE-200 improves the predictions over the draft models, and the draft models plus randomly selected 200 reactions. NHP still shows no improvement

Fermentation Dataset (ModelSEED)

Chen et al. Nat. Commun. (2023)

35 of 61

Phenotypic Prediction

35

Amino Acid Dataset (CarveMe)

  • To test if CHESHIRE can fill other types of gaps, we assessed CHESHIRE for predicting secretions of amino acids
  • We conducted phenotypic prediction on an amino acid dataset consisting of 20 amino acids and 25 bacterial organisms
  • Similar to the fermentation dataset, we found that adding 200 CHESHIRE-predicted reactions significantly improves the phenotypic prediction of amino acids

Chen et al. Nat. Commun. (2023)

36 of 61

Ongoing Work: MuSHIN

MuSHIN integrates transformer-based molecular embeddings from SMILES representations with a dynamic attention mechanism

36

37 of 61

Preliminary Results: Internal Validation

  • Internal validation over 108 BiGG models
  • MuSHIN significantly outperforms CHESHIRE and CLOSEgaps over all classification metrics
  • We are currently working on external validation

37

38 of 61

Summary and Significance

  • The performance of CHESHIRE has been rigorously examined through both internal and external validations over GEMs of a large set of microorganisms
  • To the best of our knowledge, the benchmark using fermentation and amino acid data has not been performed for previous topology-based hyperlink prediction methods

38

39 of 61

Part III: Drug Synergy Prediction

39

40 of 61

Drug Synergy

  • Combination therapies, involving multiple drugs, offer potential benefits over traditional single-drug approaches
  • However, optimizing drug combinations can be challenging, as poorly chosen combinations may lead to adverse effects

40

41 of 61

Traditional Methods

  • Identification of effective combination drugs relied on clinical experience is time-consuming and prone to trial and error
  • High-throughput screening has emerged as an affordable strategy, but certain limitations persist

41

42 of 61

Hyperlink Prediction

  • We treat drug synergy prediction as a hyperlink prediction problem, where drugs and cell lines are represented as nodes, and drug-drug-cell line triplets are represented as hyperlinks

42

43 of 61

Current Method

  • HypergraphSynergy is the first hyperlink prediction method used for drug synergy prediction
  • However, HypergraphSynergy only considers drug-drug-cell line triplets, failing to capture intricate nature of biomedical interactions
  • More importantly, its ability to generalize, especially in the context of novel drug combinations, remains a challenge

43

Liu et al.. Bioinformatics (2022)

44 of 61

HERMES

  • We propose a novel deep hypergraph learning method – Heterogeneous Entity Representation for MEdicinal Synergy prediction (HERMES) – to enhance drug synergy prediction
  • HERMES integrates a variety of data sources, including drug chemical properties, cell line gene expressions, and interactions between drugs and disease indications

44

Wu, …, Chen. Bioinformatics (2025)

45 of 61

HERMES

45

Wu, …, Chen. Bioinformatics (2025)

46 of 61

Datasets

  • Drug synergy data:
    • O’Neil dataset (18,950 samples of Loewe synergy scores for 38 drugs and 39 cancer cell lines)
    • NCI-ALMANAC dataset (74,139 measurement samples of ComboScores for 87 drugs across 55 cancer cell lines)
  • Drug molecular structure SMILES from the PubChem database
  • Gene expression in cancer cell lines from the Cell Lines Project within the COSMIC database
  • Disease indications from PrimeKG, a multi-modal knowledge graph designed to support precision medicine

46

47 of 61

Setting

  • Five unique partitioning strategies to ensure comprehensive evaluation:
    • Random: Samples are randomly divided
    • Target: Target cell lines are not present in the training
    • DrugComb: Drug combinations are not present in the training
    • DrugSingle: One drug in each combination is not present in the training
    • DrugDouble: Two drugs in each combination is not present in the training

47

48 of 61

Performance

48

NCI-ALMANAC dataset

Wu, …, Chen. Bioinformatics (2025)

49 of 61

Performance

49

O’Neil dataset

Wu, …, Chen. Bioinformatics (2025)

50 of 61

Performance

50

NCI-ALMANAC dataset

Wu, …, Chen. Bioinformatics (2025)

51 of 61

Summary and Future Work

  • HERMES achieves a superior performance on two drug synergy datasets compared to HypergraphSynergy and other methods
  • We will consider improving the scope of our analysis by incorporating a wider array of heterogeneous data sources
    • Integrate chemical, physiological, and clinical data (e.g., EHR data)
    • Drug knowledge hypergraphs
    • Multi-drug synergy, drug side effects, and drug-gene-disease associations
  • We will validate our predicted results with external validation (e.g., wet lab experiments)

51

52 of 61

Part IV: Gene Pathway Identification�(Ongoing Work with Prof. Qingyun Liu)

52

53 of 61

Gene Pathways

  • A gene pathway refers to a network of genes and their products that interact to perform specific biological functions or regulate cellular processes

53

54 of 61

Gene Pathways in Bacteria

  • In bacteria, gene pathways are integral to processes such as metabolism, stress responses, virulence, and antibiotic resistance
  • However, existing gene pathways remain incomplete due to genetic diversity, environmental influences, and their dynamic nature, with new discoveries still largely dependent on experimental investigations

54

55 of 61

Existing Methods

  • Independent component analysis
  • Gene pathways can be represented by hypergraphs
  • Predicting missing gene pathways -> hyperlink prediction

55

Sastry et al. Nature Comm. (2019)

56 of 61

Datasets

  • We focus on characterizing unknown gene pathways in Mycobacterium tuberculosis (Mtb)
  • We collected 165 gene pathways/clusters from the KEGG database and 869 gene pathways/clusters from the STRING database to construct a gene hypergraph
  • RNAseq (https://www.nature.com/articles/s41467-024-47410-5)
  • TNseq (https://www.biorxiv.org/content/10.1101/2021.03.05.434127v2)
  • CRISPRi (https://www.nature.com/articles/s41564-022-01130-y)

56

57 of 61

Preliminary Results: Internal Validation

  • The procedure of internal validation is similar to the previous studies
  • We used CHESHIRE for hyperlink prediction and compared the results with Random

57

58 of 61

External Validation

58

CHESHIRE

59 of 61

Preliminary Results: External Validation

59

Experiment validated pathways

BiGG pathways

  • We used CHESHIRE for hyperlink prediction and compared the results with Random

60 of 61

Summary and Future Work

  • We conducted a preliminary study on hyperlink prediction using CHESHIRE, and the results demonstrate that the model performs well in predicting artificially removed and unseen gene pathways/clusters
  • SDSS Seed Grant
    • Multimodal hypergraph neural networks
    • Contrastive learning
    • Model interpretability and explainability
    • Gene pathway de novo
    • Wet lab experiments

60

61 of 61

Acknowledgement

  • Dr. Yang-Yu Liu (HMS and BWH)
  • Dr. Chen Liao (SMK Cancer Center)
  • Dr. Qingyun Liu (UNC-Chapel Hill)
  • Dr. Ren Wang (IIT)
  • Dr. Jun Wen (HMS)
  • Dr. Shuai Gao (BWH)

  • Jiawei Wu (NUS)
  • Mingyuan Yan (NUS)
  • Yixiao Chen (UNC)

61