AI for pathogen deep-sequence analytics
An AI4Health CDT PhD project (student to be recruited by Dec 2025)
Scholarship: Home fee + stipend
Supervisors (50:50 split)
Yingzhen Li (Department of Computing, Imperial College London)
http://yingzhenli.net/home/en/
Oliver Ratmann (Department of Mathematics, Imperial College London)
https://profiles.imperial.ac.uk/oliver.ratmann05
Project goal
The overall goal of this PhD is to advance multimodal deep sequence architectures (e.g., attention-based transformers, advanced RNNs, etc) optimized for analysis of pathogen sequence and deep-sequence data. The situation is that next-generation sequencing (e.g. Illumina, PacBio, Nanopore) has widely replaced standard Sanger sequencing as the standard in healthcare and public health investigations, however the resulting wealth of data remains largely unused due to methodological challenges in generalising existing maximum-likelihood type frameworks. This project will focus on key challenges in three core application areas: tree inference, time since infection estimation, and drug resistance prediction.
Summarise main AI challenges/ advances expected
Main AI approaches to be used
What innovation/experience will the student benefit from in addition to core CDT training
Integration into a vibrant research community in statistical deep learning at Imperial (Li >10 people, Ratmann >5 people), regular community journal clubs (CSML), regular network meetings (MLGH), supported further through in-house chat space (Zulip).
Feasible student background
Knowledge of statistical machine learning and deep generative models, and coding experience with e.g., pytorch, jax, numpy, numpyro, stan.
What will each supervisor contribute to training
Yingzhen will contribute experience in deep learning especially for sequential data, and uncertainty quantification in deep learning. Oliver will contribute experience in statistical modelling and machine learning, especially in public good applications and phylogenetics.
=======
Project proposal (short version)
With the experience and insights gained after the COVID-19 pandemic, pathogen genomics are now recognized as a central tool for monitoring and responding to novel pandemic threats (see WHO’s Global genomic surveillance strategy). This project will focus on AI methods development to unlock types of breakthrough global health and healthcare analytics that we know can be achieved through the full resolution in pathogen deep-sequence data, but not through standard consensus sequence data representations that condense the richness of the generated data into a single consensus sequence (see our Nature paper Kraemer et al 2025 AIP). Characteristically, deep-sequence data are short (<<1kb base pairs), and so the problem is not only vertical (large # unique reads across many patients) but also horizontal (data on 100’s of sliding genomic windows rather than just one large alignment). The project milestones will focus on developing methodology that can be deployed equitably across the world. We have in-house access to data from the globally largest deep-sequence database (PANGEA-HIV, >40k individuals).
Milestone 1 - Tree estimation (generic): Standard phylogenetic tree inference is time consuming with traditional methods(e.g. IQTree, FastTree or BEAST), already when the actual deep-sequence data from one sample are reduced to a single consensus sequence per sample. Recent work within the Machine Learning & Global Health network that we co-founded has shown how trees can be mapped to integer vectors (Penn et al), simplifying substantially how deep learning methodology can now be deployed for tree inference. We aim to develop structured sequence processing techniques based on innovative tree embedding schemes and new attention mechanisms tailored for phylogenetic trees [Shiv et al], and expect these to surpass state-of-the-art methodologies such as PhyloDeep.
Milestone 2 - Time since infection estimation (core application 1): Incidence of nearly all major pandemic threats has been declining with improving care and prevention, especially in Africa, but this makes it also increasingly more challenging to estimate incidence statistically from available cohort data and sample sizes. Naturally, pathogens mutate under a fairly predictable molecular clock, and so the observed diversity between viral strains isolated from one patient provides information on the time of infection. Existing algorithms harnessing low-dimensional features (Golubchik et al) have poor out-of-distribution time-since-infection estimation accuracy. Our approach centers on 1) innovating a multi-modal model to combine both phylogenetic and genomic data, 2) aligning the embeddings across domains and 3) integrating structured cross-attention mechanism to utilise both domain's data information efficiently. We have already acquired exceptionally valuable new training data from >500 patients with very narrow intervals during which infection could have occurred.
Milestone 3 - Drug resistance prediction (core application 2): A core application of deep-sequence data is drug-resistance classification based on both major and minor pathogen genomic variants within a sample. Current approaches mainly search for mutations e.g. in the Stanford Drug Resistance Database, but this misses drug-resistance unrelated genetic diversity in African populations and ignores the protein folding process that induces non-trivial correlations, so the accuracy of current approaches in determining clinical drug-resistance in Africa remains limited. This third milestone will use our deep-sequence alignment video data to build drug resistance prediction AI models, expanding on the technology from milestones 1 and 2. Importantly, we have just completed population-based data collection of longitudinal viral loads to measure clinical drug resistance, and so have large-scale population-based data for supervised learning during doletugravir roll-out in Africa, a critical international attempt to reverse rising HIV drug resistance prevalence.
Project proposal (long version)
With the experience and insights gained after the COVID-19 pandemic, pathogen genomics are now a central tool in healthcare and for responding to pandemic threats (https://www.who.int/publications/i/item/9789240046979). This project will focus on AI methods development to unlock types of breakthrough global health and healthcare analytics that we know can be achieved through the full resolution in pathogen deep-sequence data, but not through standard consensus sequence data representations that condense the richness of the generated data into a single consensus sequence (https://www.nature.com/articles/s41586-024-08564-w).
Deep-sequence data are short (<<1kb base pairs). The problem is not only vertical (large # reads across patients) but also horizontal (100’s of sliding genomic windows across genomes). The project milestones will focus on developing methodology that can be deployed equitably across the world. We have access to data from the globally largest deep-sequence database, UKHSA, and in-depth clinical data.
Milestone 1 - Tree estimation: Standard phylogenetic tree inference is time consuming with traditional methods(e.g. IQTree, FastTree or BEAST). Within https://mlgh.net/ that we co-founded we showed how trees can be mapped to integer vectors, simplifying substantially AI for tree inference, and we aim to develop structured sequence processing techniques based on innovative tree embedding schemes and new attention mechanisms tailored for phylogenetic trees (https://papers.nips.cc/paper_files/paper/2019/hash/6e0917469214d8fbd8c517dcdc6b8dcf-Abstract.html). For validation, we will be able to use phylogenies from 100,000’s sequence alignments created with likelihood-based methods from diverse patients and stored in-house at Imperial.
Milestone 2 - Time since infection estimation: Incidence of most infectious diseases is declining, but this makes it hugely challenging to estimate incidence with traditional follow-up cohort methods and decreasing incidence cohort sample sizes. Pathogens mutate under a fairly predictable molecular clock, and so observed genetic diversity within one patient informs the time since infection. We will innovate multi-modal models to combine both phylogenetic and genomic data, 2) aligning the embeddings across domains and 3) integrating structured cross-attention mechanism to utilise both domain's data information efficiently. For training and validation, we have exceptionally valuable novel training data from UKHSA and Botswana with very narrow intervals during which infection could have occurred.
Milestone 3 - Drug resistance prediction: A core application of deep-sequence data is drug-resistance (DR) classification based on both major and minor pathogen genomic variants within a sample. Current approaches miss DR unrelated genetic diversity and ignores the protein folding process. We will use our deep-sequence alignment video data and structural information to build DR prediction AI models, training on large-scale population-based clinical viral load data. We just completed population-based viral load surveys to measure clinical drug resistance (https://www.nature.com/articles/s41564-023-01530-8), and so have large-scale population-based data for supervised learning during doletugravir roll-out in Africa, a critical international attempt to reverse rising HIV drug resistance prevalence.
Data resources
Deep-sequence data will come from the PANGEA-HIV consortium that maintains the globally largest deep-sequence database (>40k patients) through core funding from the Bill & Melinda Gates foundations (>8m USD to date); and from UKHSA. Oliver holds UKHSA accreditation and has access to UKHSA deep-sequence data. For milestone 1, Oliver is on the PANGEA-HIV executive committee and a large part of the entire data are mirrored at Imperial. For milestones 2-3, all additional patient meta-data (age, sex, testing history, Lag avidity data, viral load measurements) are at Imperial through existing cooperation agreements with the Botswana-Harvard AIDS partnership and the Rakai Health Sciences program. Co-investigators are willing to share newly generated data of interest.
1-3 references to illustrate aims, approach, data
Chen and Li ICLR 2023 https://openreview.net/forum?id=jPVAFXHlbL
Yoon et al. ICLR 2023 https://openreview.net/forum?id=bHW9njOSON
Switching dynamical systems, Balsells-Rodas et al. ICML 2024 https://arxiv.org/abs/2305.15925
In progress work - marrying Gaussian processes and RNNs/Mamba, first step work here (NeurIPS 2025): https://arxiv.org/abs/2502.08736
What impact could the data have on health and what are the main challenges involved
Milestone 1: Phylogenetics are a staple workhorse in modern healthcare globally but are computationally intensive and remain underdeveloped, which this project seeks to address.
Milestone 2: Incidence estimation is critical to assess the global burden of diseases. Deep-sequence approaches could provide an important solution to incidence estimation within existing sample sizes.
Milestone 3: The discontinuation of large-scale multi-billion US global aid programs and treatment interruptions for >25 million people will likely result in unprecedented prevalence of drug-resistance and corresponding healthcare needs, in addition to approx 4 million people currently.
Ethics/data access
Data sharing agreements are in place, to which the student will be added following accreditation. For this, the student will need to complete a standard human subject research training course.