1 of 17

DataSAIL:

Data Splitting Against Information Leakage

Fighting inflated performance estimates

Roman Joeres, HIPS/HZI, roman.joeres@helmholtz-hips.de

Saarbruecken, Dec. 08, 2023

1

2 of 17

Information Leakage is the introduction of information about the target that should not be legitimately available to learn from.

adapted from Kaufman et al. (2012)

2

3 of 17

Recap – Random Split vs. Scaffold Split

3

4 of 17

Effects of Information Leakage – BACE example

0.883

0.873

0.846

0.667

0.671

0.656

0.815

0.742

0.688

DataSAIL

Scaffold Split

Random Split

Training Data

Test Data

Inference Data

Accuracy of Binding Prediction inhibitors to human β-secretase 1(BACE-1) using ChemProp

Heid et al. (2023). Chemprop: A machine learning package for chemical property prediction chemRxiv

Subramanian et al. (2016): Computational Modeling of b-Secretase 1 (BACE-1) Inhibitors J. Chem. Inf. Model.

4

4

5 of 17

Effects of Information Leakage – LP-PDBBind example

Cold Both: 5.62

Test: 1.65

Cold Drug: 2.13

Cold Target: 3.21

Training: 0.9

Drugs

Targets

Training: 2.09

Inference: 5.21

Test: 5.88

Drugs

Targets

Random Split

DataSAIL

MSE of Binding Affinity Prediction using DeepDTA

Li et al. (2023). Leak Proof PDBBind: A reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction arXiv

Ötztürk et al. (2018): DeepDTA: deep drug-target binding affinity prediction Bioinformatics

5

5

6 of 17

Specifics of Biological Datasets

MNIST

TAPE - Remote Homology

LeCun (1998). The MNIST database of handwritten digits yann.lecun.com

Rocklin et al. (2017). Global analysis of protein folding … Nature

6

6

7 of 17

Specifics of Biological Datasets

MNIST

TAPE – Remote Homology

No connection between classes in images

Evolutionary conservation between proteins

LeCun (1998). The MNIST database of handwritten digits yann.lecun.com

Rocklin et al. (2017). Global analysis of protein folding … Nature

7

7

8 of 17

Clustering Algorithms

Created with BioRender.com

8

8

9 of 17

Workflow of DataSAIL

Created with BioRender.com

9

10 of 17

Integer Linear Programmings

Created with BioRender.com

Proven to be NP-complete

10

10

11 of 17

DataSAIL on Lipophilicity

Hersey et al. (2015). ChEMBL Deposited Data Set - AZ_dataset

11

11

12 of 17

DataSAIL on Tox21

Tox21 Challenge (2014). https://tripod.nih.gov/tox21/challenge/

12

12

13 of 17

DataSAIL on LP-PDBBind

Li et al. (2023). Leak Proof PDBBind: A reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction arXiv

13

13

14 of 17

Conclusion

  • Information Leakage is an open challenge
  • Knowing the inference time is vital
  • DataSAIL offers a variety of solutions
    • including two-dimensional splits
    • no manual splitting required
  • Performance estimation gets more realistic
  • DataSAIL is conda-installable
    • CLI and Python package

14

15 of 17

  • Information Leakage
    • is a problem in model training
    • can lead to inflated performance estimates
  • DataSAIL accounts for biological data
  • first to offer two-dimensional splits
  • Only bad models fear DataSAIL

Thanks to:

David B. Blumenthal

Olga V. Kalinina

the Drug Bioinformatics Group

the colleagues in the NextAID project

contact me: roman.joeres@helmholtz-hips.de

15

16 of 17

ILP Setup

16

17 of 17

ILP Definition

17

17