1 of 21

Host Prediction

By Malte and Varada��23.09.2024

2 of 21

Background

Viruses affect microbial communities and therefore their environments THROUGH their hosts.

Release organic matter

Auxiliary metabolic genes

3 of 21

Background

Ideally, you would have an isolate bacteria that you test phages on… but we just have our data

These “signals” are based on biological interactions

4 of 21

What are some biological interactions?

Adsorption - attachment

Insertion of the genome into the cell

Horizontal gene transfer

Defense/anti-defense mechanisms

  • Crispr
  • Restriction/modification

Using cellular machinery

  • tRNA
  • Ribosome binding sites
  • Regulatory RNAs
  • Auxiliary metabolic genes
  • Codon usage
  • Modifying stress response

LYSIS

5 of 21

Prophages�/viral genes

Ecological Dynamics

Host Defense Mechanisms

Viral genes inside host genome

Matching Coverage Profiles

K-mer profiles

CRISPR spacers

Transduction

Auxiliary Metabolic Genes

Host genes inside viral genome

Host genome

Viral genome

Metabolic gene?

Homology Based

Non-Homology Based

tRNAs

Using cell machinery

6 of 21

Where to find the hosts?

Database

Prokaryotic fraction of your metagenome

“Bins” or MAGs

CRISPR �spacers DB

Refseq

IMGVR/ Mgnify

7 of 21

“Host-based” vs. “Phage-based”

Phage-host

Phage-phage

This is a feature of RaFAH also!

8 of 21

Disadvantages - homology-based

  1. A recall/sensitivity tradeoff
  2. Simply not enough matches
  3. No CRISPR arrays found?

One does not simply

blast a host

9 of 21

Disadvantages - non-homology

High recall, but matches many hosts!

% of hosts predicted correctly

10 of 21

Machine learning methods�for phage-host interactions

11 of 21

Many methods – all have biases

Nie et al., Briefings in Bioinformatics, 2024

12 of 21

Many methods – all have biases

Roux et al., PLOS Biology, 2023

13 of 21

ML methods for phage host interactions

  • Which features are informative for PHI

  • Training data and some related caveats

  • Which ML algorithms are used to predict PHI

14 of 21

Informative features for PHI

  • WIsH: 8th order Markov models of host genomes (k-mers)

  • RaFAH: viral proteins mapped to protein families

  • Boeckaerts et al.: sequence and structure of receptor-binding proteins

15 of 21

Informative features for PHI

Boeckaerts et al., Sci Rep, 2021

16 of 21

Training data – the good

Roux et al., Nat Biotechnol, 2019

17 of 21

Training data – the bad

Camargo et al., NAR, 2023

IMG/VR 4

database

18 of 21

Training data – the ugly

  • Very skewed datasets (most phages concentrated on few hosts)

-> subsample large datasets

  • No negative examples

-> random sampling of hosts distant to known hosts

-> model-based sampling

19 of 21

Machine learning algorithms - GNN

Hamilton et al., arxiv, 2017

20 of 21

Machine learning algorithms - GNN

21 of 21

Machine learning algorithms - Random Forest

https://catalyst.earth/catalyst-system-files/help/concepts/focus_c/oa_classif_intro_rt.html