DataSAIL:
Data Splitting Against Information Leakage
Fighting inflated performance estimates
Roman Joeres, HIPS/HZI, roman.joeres@helmholtz-hips.de
Saarbruecken, Dec. 08, 2023
1
Information Leakage is the introduction of information about the target that should not be legitimately available to learn from.
adapted from Kaufman et al. (2012)
2
Recap – Random Split vs. Scaffold Split
3
Effects of Information Leakage – BACE example
0.883
0.873
0.846
0.667
0.671
0.656
0.815
0.742
0.688
DataSAIL
Scaffold Split
Random Split
Training Data
Test Data
Inference Data
Accuracy of Binding Prediction inhibitors to human β-secretase 1(BACE-1) using ChemProp
Heid et al. (2023). Chemprop: A machine learning package for chemical property prediction chemRxiv
Subramanian et al. (2016): Computational Modeling of b-Secretase 1 (BACE-1) Inhibitors J. Chem. Inf. Model.
4
4
Effects of Information Leakage – LP-PDBBind example
Cold Both: 5.62
Test: 1.65
Cold Drug: 2.13
Cold Target: 3.21
Training: 0.9
Drugs
Targets
Training: 2.09
Inference: 5.21
Test: 5.88
Drugs
Targets
Random Split
DataSAIL
MSE of Binding Affinity Prediction using DeepDTA
Li et al. (2023). Leak Proof PDBBind: A reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction arXiv
Ötztürk et al. (2018): DeepDTA: deep drug-target binding affinity prediction Bioinformatics
5
5
Specifics of Biological Datasets
MNIST
TAPE - Remote Homology
LeCun (1998). The MNIST database of handwritten digits yann.lecun.com
Rocklin et al. (2017). Global analysis of protein folding … Nature
6
6
Specifics of Biological Datasets
MNIST
TAPE – Remote Homology
No connection between classes in images
Evolutionary conservation between proteins
LeCun (1998). The MNIST database of handwritten digits yann.lecun.com
Rocklin et al. (2017). Global analysis of protein folding … Nature
7
7
Clustering Algorithms
Created with BioRender.com
8
8
Workflow of DataSAIL
Created with BioRender.com
9
Integer Linear Programmings
Created with BioRender.com
Proven to be NP-complete
10
10
DataSAIL on Lipophilicity
Hersey et al. (2015). ChEMBL Deposited Data Set - AZ_dataset
11
11
DataSAIL on Tox21
Tox21 Challenge (2014). https://tripod.nih.gov/tox21/challenge/
12
12
DataSAIL on LP-PDBBind
Li et al. (2023). Leak Proof PDBBind: A reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction arXiv
13
13
Conclusion
14
Thanks to:
David B. Blumenthal
Olga V. Kalinina
the Drug Bioinformatics Group
the colleagues in the NextAID project
contact me: roman.joeres@helmholtz-hips.de
15
ILP Setup
16
ILP Definition
17
17