1 of 15

CanDLE:�Illuminating Biases in Transcriptomic Pan-Cancer Diagnosis

1

10/21/2023

Gabriel Mejía

Pablo Arbelaez

Natasha Bloch

Paper ID 8

2 of 15

2

10/21/2023

Why Molecular Cancer Diagnosis?

Cancer Prevalence and Diversity

19.2 M New Cases (2020)

9.9 M Deaths (2020)

  • Cancer Diagnosis Problems:
    • Many tissues
    • Many cancer types
    • Expertise level dependent
    • Time consuming
  • Proposed Solution:
    • Use genetic expression vector to predict type of cancer or normal tissue.

3 of 15

But Wait… �What Public Data Do We Have?

3

10/21/2023

4 of 15

4

10/21/2023

Public Databases

 

  • Healthy TCGA counterpart
  • Atlas of healthy tissue expression
  • 54 non-diseased tissues
  • Nearly 1,000 individuals
  • More than 17,000 RNA-seq samples

GTEx

The Cancer Genome Atlas

Genotype Tissue Expression Project

TCGA and GTEx Data Are NOT Comparable!!!

Reference Genome, Alignment and Quantification Algorithm Variability

5 of 15

5

10/21/2023

Efforts to Join the GTEx and the TCGA

Vivian et al. Dataset

Wang et al. Dataset

  • STAR Alignment and RSEM quantification
  • 18,354 samples with 60,498 genes

 

6 of 15

6

10/21/2023

Quinn et al. Tissue Detector

  • Binary cancer detection with residual analysis
  • Fits statistical model and predicts outlier samples as cancer

Hong et al. Multitask MLPs

  • Classifies disease state, tissue and cancer subtype separately
  • Uses one muti-task and one normal MLP

Related Work

7 of 15

So…�Problem Solved, Right?

7

10/21/2023

8 of 15

8

10/21/2023

TCGA

Translation Bias

GTEx

 

Linear SVM

Both Datasets Present Empirical Translation Biases

Previous Results Lose Clinical Relevance

Z-score Batch Standardization Corrects Translation Biases

1

Gene 1

Gene 3

Gene 2

Gene 1

Gene 3

Gene 2

9 of 15

9

10/21/2023

 

CanDLE: Cancer Diagnosis Logistic Engine

Which cancer/tissue?

Classification &

all-vs-one Detection

 

63 Neurons for Multilabel Classification

2 Neurons for All-vs-one Detection

 

Simplest Gradient Based Approach

SoftMax

0.2

0.0

1.2

-0.5

-0.1

Gene Expression Vector

0.01

0.0

0.7

0.2

0.05

 

 

 

Class Probability Vector

Multinomial Logistic Regression

Previous Findings for Representation Learning

[1] Smith, A., et al., 2020: Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. DOI: 10.1186/s12859-020-3427-8

10 of 15

10

10/21/2023

Experimental Setup

 

Random 60/20/20% Train/Val/Test Partition

11 of 15

11

10/21/2023

*

*

Main Results: Classification

CanDLE’s Simplicity Can Generalize Better With Removed Biases

*Reimplementation

State-of-the-Art Performance by +7.3% Balanced Accuracy in Test

 

[2] Hong, J., et al., 2022: A deep learning model to classify neoplastic state and tissue origin from transcriptomic data. DOI: 10.1038/s41598-022-13665-5

Hong’s Method Takes Advantage of Translation Biases

12 of 15

12

10/21/2023

Main Results: All-Vs-One Detection

 

Generally Worst Performing Classes Have Low Training Samples

Comparable Performance With Respect to the State-of-the-art Method by Quinn et al.

[3] Quinn, T., et al., 2019: Cancer as a Tissue Anomaly: Classifying Tumor Transcriptomes Based Only on Healthy Data. DOI: 10.3389/fgene.2019.00599

13 of 15

13

10/21/2023

Select Top 1,000 Genes in Absolut Value For Each Cancer Class

Order by Number of Times that a Gene Was Selected for a Class

1,982 Genes Important for at Least 3 Cancers Were in the Final List

Interpretation

Gene Ontology Biological Processes Enrichment Analysis

Interpretation

Developmental and Morphogenesis Pathways Were Over-Represented

14 of 15

  • Batch effects appear even with standard quantification of RNA-seq.
  • There is no need for a complex model in RNA-seq based cancer diagnosis.
  • Machine learning cancer diagnosis is feasible and can offer valuable biological insights.

14

10/21/2023

Take Home Messages

Code Availability

https://github.com/g27182818/CanDLE

15 of 15

Thank You�for Your Time!�Questions?

15

10/21/2023

Gabriel Mejía

gm.mejia@uniandes.edu.co

Pablo Arbelaez

pa.arbelaez@uniandes.edu.co

Natasha Bloch

n.blochm@uniandes.edu.co

Biomedical Computer Vision

CanDLE’s Code Availability