Evidence-Based Practice of �Transcriptomics Data Harmonization for Survival Outcome Prediction
Li-Xuan Qin
Department of Epidemiology and Biostatistics
Memorial Sloan Kettering Cancer Center
New York
ABGOD Conference in Honor of Professor Shili Lin
The University of Texas at Dallas, Richardson, TX
March 18, 2023
Outline
Collaboration with Professors Andy Ni (OSU) and Mengling Liu (NYU).
Outline
Data artifacts: ubiquitous in microarray data
https://directorsblog.nih.gov
Data artifacts: ubiquitous in sequencing data
Data artifacts: an example
September
November, February
October
Qin et al, Cancer Informatics 2013
Data artifacts: negatively impact data translation
Data harmonization: for data preprocessing
Data harmonization: many methods published
Normalization
BE Correction
ComBat
.
.
.
Quantile Norm
.
.
.
Data harmonization: quantile normalization
Data harmonization: ComBat
Location and Scale BE parameters with parametric empirical priors:
Gene Expression Level
= Baseline
+ Group-Effect
+ Batch-Effect-L
+ Batch-Effect-S * Error
Batch-Effect-L ~ Normal()
Batch-Effect-S ~ Inverse Gamma()
Data harmonization: QN/ComBat assessed for DEA
Data harmonization: QN/ComBat adopted widely
5481
2929
DEA
Classification
Methylation Data
Survival Prediction
June 2021
“Off-Label” Uses
Off-Label Use: CAVEAT as QN changes across-sample ordering
Off-Label Use : CAVEAT as ComBat has no proper sample groups
Location and Scale BE parameters with parametric empirical priors:
Gene Expression Level
= Baseline
+ Group-Effect
+ Batch-Effect-L
+ Batch-Effect-S * Error
?
?
hazard(Event) = function(Adjusted Expression Level)
Outline
1. We developed an alternative method for managing artifacts in survival prediction.
Our method simultaneously manages BE & builds a predictor
BatMan has the batch-effects canceled out
2. We built a benchmarking tool for assessing new and existing methods
with re-sampling-based simulations.
We first collected well-designed empirical data
1 tech
in 1 run
2 techs in 5 batches
Typical Practice
Careful
Design
Qin et al, Clinical Cancer Research 2014
We then used re-sampling algorithm to simulated additional paired data sets
Qin et al, Journal of Clinical Oncology 2016
Qin et al, Journal of Clinical Oncology 2016
Qin and Levine, BMC Medical Genomics 2016
We used an algorithm to reassign survival outcome (PFS)
Ni and Qin, Briefings in Bioinformatics 2021
Scenario Notation | Batch effects in the training data | Batch effects in the test data | Batch effects correlated with outcome in the training data | Batch effects correlated with outcome in the test data |
BE00Cor00 | No | No | No | No |
BE10Cor00 | Yes | No | No | No |
BE10Cor10 | Yes | No | Yes | No |
BE11Cor00 | Yes | Yes | No | No |
BE11Cor10 | Yes | Yes | Yes | No |
BE11Cor01 | Yes | Yes | No | Yes |
BE11Cor11 | Yes | Yes | Yes | Yes (positively) |
BE11Cor1-1 | Yes | Yes | Yes | Yes (negatively) |
Ni, Liu and Qin, JCO CCI in press
Simulation results: Oracle method
Ni, Liu and Qin, JCO CCI in press
Simulation results: LASSO method
Ni, Liu and Qin, JCO CCI in press
Simulation results: Univariate selection method
Ni, Liu and Qin, JCO CCI in press
We made available an R package for method benchmarking
PRECISION.array.survival
3. We conducted a case study
using data from the Cancer Genome Atlas.
Application to TCGA ovarian miRNA array data
Ni, Liu and Qin, JCO CCI in press
Outline
Omics
Data
Data
Artifacts
What: Data harmonization depends on the analysis context.
Why: Liberal use of harmonization faces overlooked caveats.
How: Benchmarking tools and alternative methods are needed.
So what: Evidence-based practice for data harmonization!
What’s next
Data Type
Analysis Goal
Data Application
(Reanalysis)
Happy Birthday, Shili!