1 of 6

Amino acid transfer learning for antibody binding prediction

Taft and colleagues (J. M. Taft, et al., 2022) created mutagenesis libraries of RBD domain of SARS-CoV-2 spike protein. RBD variants were expressed on the yeast surface as a C-terminal fusion to another protein. Binding to ACE2 and 4 therapeutic antibodies was measured.

Machine learning models predicted ACE2 binding and antibody escape. Models tested included KNN, Log Reg, naive Bayes, SVMs, RFs; RNNs. RF and RNN models showed the best metrics.

Minimize the size of a train dataset with transfer learning or other methods.

Background

Goal

Students:

Natalia Khotkina

(Bioinformatics Institute)

Supervisor:

Daria Balashova

(Amsterdam UMC)

2 of 6

Transfer learning is used in tasks where big amount of data cannot be collected for some reason. In this approach, knowledge from one task is transferred to the current task.

Multi-task models include several last layers and each last layer solves a different task.

Methods

Transfer learning

Multi-task

3 of 6

Training on train datasets of different sizes

ROC AUC score was increasing and then reached a plateau with growing size of train dataset. ROC AUC score reached a plateau at about half of the train size used in publication of Taft and colleagues. This finding suggests that the amount of data used for model training was redundant and could be shortened with no loss.

Results

4 of 6

Transfer learning from ACE2 dataset

The size of LY16 training dataset was shortened from 26K to 1K. The LSTM model (from J. M. Taft, et al., 2022) was pre-trained on ACE2 dataset first and trained afterwards on LY16 dataset.

Pre-training significantly increased the ROC AUC score compared to a model without pretraining. This finding suggests that some information is shared between datasets of ACE2 and neutralizing antibodies binding, and pretraining on one dataset can improve predictions for the other dataset.

Results

The scheme of pre-training

ROC AUC score in “basic” LSTM and in “pretrained” LSTM, n=40, paired t-test

5 of 6

Multi-task model

Multi-task model can predict binding to each of the neutralizing antibodies simultaneously.

For LY16 antibody with a training dataset of 1K, the ROC AUC was significantly higher with a multi-task approach, compared to a basic model from the publication of J. M. Taft, et al., 2022.

Multi-task model

Train strategy for multi-task model

ROC AUC score in “basic” LSTM and in “multi-task” LSTM, n=40, paired t-test

Results

6 of 6

Our results suggest that accurate prediction of antibody escape is possible with a smaller train datasets and can be improved by transfer learning.

https://github.com/NatashaKhotkina/Amino_acid_transfer_learning_for_antibody_binding_prediction

Conclusion

Github