1 of 19

Simple Machine Learning using microRNA profiles of oral lesions to classify them as benign or malignant

University of Illinois at Chicago

College of Dentistry

2 of 19

Problem: How to characterize oral epithelial gene expression noninvasively.�

To detect oral cancer noninvasively

So to facilitate usage of gene expression analysis in human studies not from blood cells.

3 of 19

RNA from Brush Biopsy �sampling of oral mucsoa

Noninvasive

Allows for repeated sampling

J.L Schwartz , S. Panda, C. Beam, L.E. Bach, G.R. Adami 2008 J. Oral Path Med

Kolokythas et al. 2011 Oral Oncol

Kolokythas A, Bosman M, Pytynia K, Dai Y, Sroussi, Schwartz JL, Adami GR 2013 J. Oral Path Med

4 of 19

Surgical Biopsy vs Brush Biopsy

Adami, G.R, Tang, J.L Tang and Markiewicz, M.R 2017 Oral Oncol

5 of 19

Oral mucosal lesions can be hard �to diagnose visually.

Left lesion is oral squamous cell carcinoma and the right

is benign inflammation.

6 of 19

��Method of measurement of miRNA�Polyadenylation RT-PCR with LNA and visualized with Sybr green to measure miRNA.�Sample 370 miRNAs at a time, though only about 240 expressed in oral epithelium.�Alternative methods are miSeq or NanoString nCounter which are easier but have other limitations.

7 of 19

Class comparison gives list of differentially expressed

miRNAs using a t-test and correction for multiple testing

8 of 19

Heat Map miRNA levels Brush Biopsy� OSCC vs. Normal

Kolokythas, et al. 2015 PlosOne

10 of 19

Class Prediction Using Machine Learning��Allows you to classify an unknown sample given a profile of its features, in this case miRNA levels.�Requires a spreadsheet with miRNA levels for each sample.

11 of 19

Clustering of samples based on miRNAs can be

supervised or unsupervised. Shown is nonsupervised clustering for RNA from oral brush biopsy from tobacco smokers and never smokers.

12 of 19

Heat Map of miRNA expression �profiles OLP vs. OSCC

OLP OSCC

Supervised clustering uses more information and works well with groups with more subtle differences. Here is shown a heat map of the two groups OLP and OSCC.

13 of 19

Random Forest relies on at least two “innovations”. Bootstraps data –puts it into subgroups when testing so to avoid overfitting.

It tests out different random features in different orders at each step choosing a cutoff to maximize sorting the samples most homogeneously.

14 of 19

Support Vector Machines Plots or “clusters”, data in multidimensional space with each axis a feature – then determine the formula of a hyperplane to separate the groups.

15 of 19

Construction of a Class predictor for OSCC versus OLP

Class prediction using 30 sample training set of OSCC and OLP

Test of Random Forest Algorithm on classification of validation set.

16 of 19

Sources of Error

Calibrate Sample Collectors
<90% reproduciblity rate for qRT-PCR based miRNA measurement with low concentration samples
Need more OLP samples than 16
RF does not deal well with sample heterogeneity can only do two way classifiers well. Is OLP a homogenous disease?

17 of 19

ROC curve shows 89% accuracy in prediction of OSCC using LOO-CV

18 of 19

Solution

More samples.

More calibration.

With enough samples can use deep learning methods with neural networks to classify samples into multiple groups.

19 of 19

People who helped with the work

Joel Schwartz
Antonia Kolokythas
Madeleine Fine
Joel Epstein
Robert Cabay
Jessica Tang
Michael Markiewicz
Nicholas Callahan
Beth Miloro

Saurabh Sinha UIUC
Saba Ghaffari UIUC
Yalu Zhou