1 of 41

Evidence-Based Practice of �Transcriptomics Data Harmonization for Survival Outcome Prediction

Li-Xuan Qin

Department of Epidemiology and Biostatistics

Memorial Sloan Kettering Cancer Center

New York

ABGOD Conference in Honor of Professor Shili Lin

The University of Texas at Dallas, Richardson, TX

March 18, 2023

2 of 41

Outline

Background

Data artifacts in transcriptomics data
Data harmonization for dealing with the artifacts
‘Off-label’ use of data harmonization in survival prediction

Methods and Results

Alternative method for managing artifacts in survival prediction
Benchmarking tool for context-specific performance assessment
Case study of our proposed method in comparison with others

Conclusion

Evidence-based practice is needed for data harmonization

Collaboration with Professors Andy Ni (OSU) and Mengling Liu (NYU).

3 of 41

Outline

Background

Data artifacts in transcriptomics data.
Data harmonization for dealing with the artifacts.
‘Off-label’ use of data harmonization in survival prediction.

Methods and Results

Benchmarking tool for context-specific performance assessment.
Alternative method for managing artifacts in survival prediction.
Case study of our proposed method in comparison with others.

Conclusion

Evidence-based practice is needed for data harmonization.

4 of 41

Data artifacts: ubiquitous in microarray data

https://directorsblog.nih.gov

5 of 41

Data artifacts: ubiquitous in sequencing data

6 of 41

Data artifacts: an example

September

November, February

October

Qin et al, Cancer Informatics 2013

7 of 41

Data artifacts: negatively impact data translation

8 of 41

Data harmonization: for data preprocessing

9 of 41

Data harmonization: many methods published

Normalization

BE Correction

ComBat

.

Quantile Norm

.

10 of 41

Data harmonization: quantile normalization

11 of 41

Data harmonization: ComBat

Location and Scale BE parameters with parametric empirical priors:

Gene Expression Level

= Baseline

+ Group-Effect

+ Batch-Effect-L

+ Batch-Effect-S * Error

Batch-Effect-L ~ Normal()

Batch-Effect-S ~ Inverse Gamma()

12 of 41

Data harmonization: QN/ComBat assessed for DEA

13 of 41

Data harmonization: QN/ComBat adopted widely

5481

2929

14 of 41

DEA

Classification

Methylation Data

Survival Prediction

June 2021

“Off-Label” Uses

15 of 41

Off-Label Use: CAVEAT as QN changes across-sample ordering

16 of 41

17 of 41

Off-Label Use : CAVEAT as ComBat has no proper sample groups

Location and Scale BE parameters with parametric empirical priors:

Gene Expression Level

= Baseline

+ Group-Effect

+ Batch-Effect-L

+ Batch-Effect-S * Error

?

hazard(Event) = function(Adjusted Expression Level)

18 of 41

Outline

Background

Data artifacts in transcriptomics data.
Data harmonization for dealing with the artifacts.
‘Off-label’ use of data harmonization in survival prediction.

Methods and Results

Alternative method for managing artifacts in survival prediction.
Benchmarking tool for context-specific performance assessment.
Case study of our proposed method in comparison with others.

Conclusion

Evidence-based practice is needed for data harmonization.

19 of 41

1. We developed an alternative method for managing artifacts in survival prediction.

20 of 41

Our method simultaneously manages BE & builds a predictor

21 of 41

BatMan has the batch-effects canceled out

22 of 41

2. We built a benchmarking tool for assessing new and existing methods

with re-sampling-based simulations.

23 of 41

We first collected well-designed empirical data

1 tech

in 1 run

2 techs in 5 batches

Typical Practice

Careful

Design

Qin et al, Clinical Cancer Research 2014

24 of 41

We then used re-sampling algorithm to simulated additional paired data sets

Sample Effects for sample i

Careful-Design Data
‘Virtual Samples’

Handling Effects for array j

Difference between two arrays on each sample
‘Virtual Arrays’

Virtual reassignment and re-hybridization

(Sample Effects for sample i’) + ( Handling Effects for array j’)

Qin et al, Journal of Clinical Oncology 2016

25 of 41

Qin et al, Journal of Clinical Oncology 2016

26 of 41

Qin and Levine, BMC Medical Genomics 2016

27 of 41

We used an algorithm to reassign survival outcome (PFS)

Ni and Qin, Briefings in Bioinformatics 2021

28 of 41

Scenario Notation	Batch effects in the training data	Batch effects in the test data	Batch effects correlated with outcome in the training data	Batch effects correlated with outcome in the test data
BE00Cor00	No	No	No	No
BE10Cor00	Yes	No	No	No
BE10Cor10	Yes	No	Yes	No
BE11Cor00	Yes	Yes	No	No
BE11Cor10	Yes	Yes	Yes	No
BE11Cor01	Yes	Yes	No	Yes
BE11Cor11	Yes	Yes	Yes	Yes (positively)
BE11Cor1-1	Yes	Yes	Yes	Yes (negatively)

Ni, Liu and Qin, JCO CCI in press

29 of 41

Simulation results: Oracle method

Ni, Liu and Qin, JCO CCI in press

30 of 41

Simulation results: LASSO method

Ni, Liu and Qin, JCO CCI in press

31 of 41

Simulation results: Univariate selection method

Ni, Liu and Qin, JCO CCI in press

32 of 41

We made available an R package for method benchmarking

PRECISION.array.survival

https://github.com/LXQin/PRECISION.array.survival

33 of 41

3. We conducted a case study

using data from the Cancer Genome Atlas.

34 of 41

Application to TCGA ovarian miRNA array data

Ni, Liu and Qin, JCO CCI in press

35 of 41

36 of 41

Outline

Background

Data artifacts in transcriptomics data.
Data harmonization for dealing with the artifacts.
‘Off-label’ use of data harmonization in survival prediction.

Methods and Results

Benchmarking tools for context-specific performance assessment.
Alternative methods for managing artifacts in survival prediction.
Case study of our proposed method in comparison with others.

Conclusion

Evidence-based practice is needed for data harmonization.

37 of 41

Omics

Data

Artifacts

38 of 41

39 of 41

What: Data harmonization depends on the analysis context.

Why: Liberal use of harmonization faces overlooked caveats.

How: Benchmarking tools and alternative methods are needed.

So what: Evidence-based practice for data harmonization!

40 of 41

What’s next

Data Type

Analysis Goal

Data Application

(Reanalysis)

41 of 41

Happy Birthday, Shili!