1 of 5

Biomarkers of Acute Myeloid Leukemia From RNA-seq Expression and Feature Selection with Machine Learning

Jenny Smith (lead)

Vikas Peddu

David Lee (technical)

Sean Madden (Scribe)

2 of 5

Overview

Leukemia the most common pediatric cancer

AML has high molecular heterogeneity

Recent interest in comparing pediatric and adult cancers to find age-related and –unrelated markers

Interest in applying machine learning for feature selection, inc. ”old” and newer techniques

Need for well-packaged data forms for TARGET

3 of 5

Variable Selection

  1. 60, 488 Genes Initially x 478 Samples
    1. Filtered to remove low count genes - at least 1 CPM in 5% of Samples
  2. 21,407 Genes with Adequate expression levels x 478 Samples
  3. Split into 3 Data sets
    • David - merged Clinical Annotations and Normalized counts for each cohort separately
    • AML: 187 cases; NBL: 150 cases; WT: 132 cases
    • Biological Samples used in RNAseq from various time points (eg diagnosis, remission (?), and relapse)
  4. Batch Effect Investigation
    • PCA w/ different covariates

5A. Differential Expression Analysis

    • Limma voom for each cohort comparing groups of interest (eg all patients who relapsed, patients with mutation X )

5B. No additional filters

Machine Learning

4 of 5

Data Access, Normalization, and Summaries of TARGET AML Samples

  • Download Illumina HiSeq RNA-seq gene counts by samples
  • Convert to TMM-normalized gene expression (scaled to library size)
  • Samples: N = 119 primary marrow, 36 primary peripheral blood
  • Approx. balanced survival time between ‘old’ and ‘young’ age groups (based on median)
  • Approx. balanced sex terms between old/young

5 of 5

Feature