1 of 41

Evidence-Based Practice of �Transcriptomics Data Harmonization for Survival Outcome Prediction

Li-Xuan Qin

Department of Epidemiology and Biostatistics

Memorial Sloan Kettering Cancer Center

New York

ABGOD Conference in Honor of Professor Shili Lin

The University of Texas at Dallas, Richardson, TX

March 18, 2023

2 of 41

Outline

  • Background
    • Data artifacts in transcriptomics data
    • Data harmonization for dealing with the artifacts
    • ‘Off-label’ use of data harmonization in survival prediction

  • Methods and Results
    • Alternative method for managing artifacts in survival prediction
    • Benchmarking tool for context-specific performance assessment
    • Case study of our proposed method in comparison with others

  • Conclusion
    • Evidence-based practice is needed for data harmonization

Collaboration with Professors Andy Ni (OSU) and Mengling Liu (NYU).

3 of 41

Outline

  • Background
    • Data artifacts in transcriptomics data.
    • Data harmonization for dealing with the artifacts.
    • ‘Off-label’ use of data harmonization in survival prediction.

  • Methods and Results
    • Benchmarking tool for context-specific performance assessment.
    • Alternative method for managing artifacts in survival prediction.
    • Case study of our proposed method in comparison with others.

  • Conclusion
    • Evidence-based practice is needed for data harmonization.

4 of 41

Data artifacts: ubiquitous in microarray data

https://directorsblog.nih.gov

5 of 41

Data artifacts: ubiquitous in sequencing data

6 of 41

Data artifacts: an example

September

November, February

October

Qin et al, Cancer Informatics 2013

7 of 41

Data artifacts: negatively impact data translation

8 of 41

Data harmonization: for data preprocessing

9 of 41

Data harmonization: many methods published

Normalization

BE Correction

ComBat

.

.

.

Quantile Norm

.

.

.

10 of 41

Data harmonization: quantile normalization

11 of 41

Data harmonization: ComBat

Location and Scale BE parameters with parametric empirical priors:

Gene Expression Level

= Baseline

+ Group-Effect

+ Batch-Effect-L

+ Batch-Effect-S * Error

Batch-Effect-L ~ Normal()

Batch-Effect-S ~ Inverse Gamma()

12 of 41

Data harmonization: QN/ComBat assessed for DEA

13 of 41

Data harmonization: QN/ComBat adopted widely

5481

2929

14 of 41

DEA

Classification

Methylation Data

Survival Prediction

June 2021

“Off-Label” Uses

15 of 41

Off-Label Use: CAVEAT as QN changes across-sample ordering

16 of 41

17 of 41

Off-Label Use : CAVEAT as ComBat has no proper sample groups

Location and Scale BE parameters with parametric empirical priors:

Gene Expression Level

= Baseline

+ Group-Effect

+ Batch-Effect-L

+ Batch-Effect-S * Error

?

?

hazard(Event) = function(Adjusted Expression Level)

18 of 41

Outline

  • Background
    • Data artifacts in transcriptomics data.
    • Data harmonization for dealing with the artifacts.
    • ‘Off-label’ use of data harmonization in survival prediction.

  • Methods and Results
    • Alternative method for managing artifacts in survival prediction.
    • Benchmarking tool for context-specific performance assessment.
    • Case study of our proposed method in comparison with others.

  • Conclusion
    • Evidence-based practice is needed for data harmonization.

19 of 41

1. We developed an alternative method for managing artifacts in survival prediction.

20 of 41

Our method simultaneously manages BE & builds a predictor

 

21 of 41

BatMan has the batch-effects canceled out

 

22 of 41

2. We built a benchmarking tool for assessing new and existing methods

with re-sampling-based simulations.

23 of 41

We first collected well-designed empirical data

1 tech

in 1 run

2 techs in 5 batches

Typical Practice

Careful

Design

Qin et al, Clinical Cancer Research 2014

24 of 41

We then used re-sampling algorithm to simulated additional paired data sets

  • Sample Effects for sample i
    • Careful-Design Data
    • ‘Virtual Samples’

  • Handling Effects for array j
    • Difference between two arrays on each sample
    • ‘Virtual Arrays’

  • Virtual reassignment and re-hybridization
    • (Sample Effects for sample i’) + ( Handling Effects for array j’)

Qin et al, Journal of Clinical Oncology 2016

25 of 41

Qin et al, Journal of Clinical Oncology 2016

26 of 41

Qin and Levine, BMC Medical Genomics 2016

27 of 41

We used an algorithm to reassign survival outcome (PFS)

  •  

Ni and Qin, Briefings in Bioinformatics 2021

28 of 41

Scenario Notation

Batch effects in the training data

Batch effects in the test data

Batch effects correlated with outcome in the training data

Batch effects correlated with outcome in the test data

BE00Cor00

No

No

No

No

BE10Cor00

Yes

No

No

No

BE10Cor10

Yes

No

Yes

No

BE11Cor00

Yes

Yes

No

No

BE11Cor10

Yes

Yes

Yes

No

BE11Cor01

Yes

Yes

No

Yes

BE11Cor11

Yes

Yes

Yes

Yes (positively)

BE11Cor1-1

Yes

Yes

Yes

Yes (negatively)

Ni, Liu and Qin, JCO CCI in press

29 of 41

Simulation results: Oracle method

Ni, Liu and Qin, JCO CCI in press

30 of 41

Simulation results: LASSO method

Ni, Liu and Qin, JCO CCI in press

31 of 41

Simulation results: Univariate selection method

Ni, Liu and Qin, JCO CCI in press

32 of 41

We made available an R package for method benchmarking

33 of 41

3. We conducted a case study

using data from the Cancer Genome Atlas.

34 of 41

Application to TCGA ovarian miRNA array data

Ni, Liu and Qin, JCO CCI in press

35 of 41

36 of 41

Outline

  • Background
    • Data artifacts in transcriptomics data.
    • Data harmonization for dealing with the artifacts.
    • ‘Off-label’ use of data harmonization in survival prediction.

  • Methods and Results
    • Benchmarking tools for context-specific performance assessment.
    • Alternative methods for managing artifacts in survival prediction.
    • Case study of our proposed method in comparison with others.

  • Conclusion
    • Evidence-based practice is needed for data harmonization.

37 of 41

Omics

Data

Data

Artifacts

38 of 41

39 of 41

What: Data harmonization depends on the analysis context.

Why: Liberal use of harmonization faces overlooked caveats.

How: Benchmarking tools and alternative methods are needed.

So what: Evidence-based practice for data harmonization!

40 of 41

What’s next

Data Type

Analysis Goal

Data Application

(Reanalysis)

41 of 41

Happy Birthday, Shili!