1 of 15

CanDLE:�Illuminating Biases in Transcriptomic Pan-Cancer Diagnosis

10/21/2023

Gabriel Mejía

Pablo Arbelaez

Natasha Bloch

Paper ID 8

2 of 15

10/21/2023

Why Molecular Cancer Diagnosis?

Cancer Prevalence and Diversity

19.2 M New Cases (2020)

9.9 M Deaths (2020)

Cancer Diagnosis Problems:

Many tissues
Many cancer types
Expertise level dependent
Time consuming

Proposed Solution:

Use genetic expression vector to predict type of cancer or normal tissue.

3 of 15

�But Wait… �What Public Data Do We Have?

10/21/2023

4 of 15

10/21/2023

Public Databases

Healthy TCGA counterpart
Atlas of healthy tissue expression
54 non-diseased tissues
Nearly 1,000 individuals
More than 17,000 RNA-seq samples

GTEx

The Cancer Genome Atlas

Genotype Tissue Expression Project

TCGA and GTEx Data Are NOT Comparable!!!

Reference Genome, Alignment and Quantification Algorithm Variability

5 of 15

10/21/2023

Efforts to Join the GTEx and the TCGA

Vivian et al. Dataset

Wang et al. Dataset

STAR Alignment and RSEM quantification
18,354 samples with 60,498 genes

6 of 15

10/21/2023

Quinn et al. Tissue Detector

Binary cancer detection with residual analysis
Fits statistical model and predicts outlier samples as cancer

Hong et al. Multitask MLPs

Classifies disease state, tissue and cancer subtype separately
Uses one muti-task and one normal MLP

Related Work

7 of 15

�So…�Problem Solved, Right?

10/21/2023

8 of 15

10/21/2023

TCGA

Translation Bias

GTEx

Linear SVM

Both Datasets Present Empirical Translation Biases

Previous Results Lose Clinical Relevance

Z-score Batch Standardization Corrects Translation Biases

Gene 1

Gene 3

Gene 2

Gene 1

Gene 3

Gene 2

9 of 15

10/21/2023

CanDLE: Cancer Diagnosis Logistic Engine

Which cancer/tissue?

Classification &

all-vs-one Detection

63 Neurons for Multilabel Classification

2 Neurons for All-vs-one Detection

Simplest Gradient Based Approach

…

SoftMax

0.2

0.0

1.2

-0.5

-0.1

Gene Expression Vector

…

0.01

0.0

0.7

0.2

0.05

…

Class Probability Vector

Multinomial Logistic Regression

Previous Findings for Representation Learning

[1] Smith, A., et al., 2020: Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. DOI: 10.1186/s12859-020-3427-8

10 of 15

10/21/2023

Experimental Setup

Random 60/20/20% Train/Val/Test Partition

11 of 15

10/21/2023

Main Results: Classification

CanDLE’s Simplicity Can Generalize Better With Removed Biases

*Reimplementation

State-of-the-Art Performance by +7.3% Balanced Accuracy in Test

[2] Hong, J., et al., 2022: A deep learning model to classify neoplastic state and tissue origin from transcriptomic data. DOI: 10.1038/s41598-022-13665-5

Hong’s Method Takes Advantage of Translation Biases

12 of 15

10/21/2023

Main Results: All-Vs-One Detection

Generally Worst Performing Classes Have Low Training Samples

Comparable Performance With Respect to the State-of-the-art Method by Quinn et al.

[3] Quinn, T., et al., 2019: Cancer as a Tissue Anomaly: Classifying Tumor Transcriptomes Based Only on Healthy Data. DOI: 10.3389/fgene.2019.00599

13 of 15

10/21/2023

Select Top 1,000 Genes in Absolut Value For Each Cancer Class

Order by Number of Times that a Gene Was Selected for a Class

1,982 Genes Important for at Least 3 Cancers Were in the Final List

Interpretation

Gene Ontology Biological Processes Enrichment Analysis

Interpretation

Developmental and Morphogenesis Pathways Were Over-Represented

14 of 15

Batch effects appear even with standard quantification of RNA-seq.
There is no need for a complex model in RNA-seq based cancer diagnosis.
Machine learning cancer diagnosis is feasible and can offer valuable biological insights.

10/21/2023

Take Home Messages

Code Availability

https://github.com/g27182818/CanDLE

15 of 15

Thank You�for Your Time!�Questions?

10/21/2023

Gabriel Mejía

gm.mejia@uniandes.edu.co

Pablo Arbelaez

pa.arbelaez@uniandes.edu.co

Natasha Bloch

n.blochm@uniandes.edu.co

Biomedical Computer Vision

CanDLE’s Code Availability