Simple Machine Learning using microRNA profiles of oral lesions to classify them as benign or malignant
University of Illinois at Chicago
College of Dentistry
Problem: How to characterize oral epithelial gene expression noninvasively.�
To detect oral cancer noninvasively
So to facilitate usage of gene expression analysis in human studies not from blood cells.
RNA from Brush Biopsy �sampling of oral mucsoa
Noninvasive
Allows for repeated sampling
J.L Schwartz , S. Panda, C. Beam, L.E. Bach, G.R. Adami 2008 J. Oral Path Med
Kolokythas et al. 2011 Oral Oncol
Kolokythas A, Bosman M, Pytynia K, Dai Y, Sroussi, Schwartz JL, Adami GR 2013 J. Oral Path Med
Surgical Biopsy vs Brush Biopsy
Adami, G.R, Tang, J.L Tang and Markiewicz, M.R 2017 Oral Oncol
Oral mucosal lesions can be hard �to diagnose visually.
Left lesion is oral squamous cell carcinoma and the right
is benign inflammation.
����Method of measurement of miRNA�Polyadenylation RT-PCR with LNA and visualized with Sybr green to measure miRNA.�Sample 370 miRNAs at a time, though only about 240 expressed in oral epithelium.�Alternative methods are miSeq or NanoString nCounter which are easier but have other limitations.
Class comparison gives list of differentially expressed
miRNAs using a t-test and correction for multiple testing
Heat Map miRNA levels Brush Biopsy� OSCC vs. Normal
Kolokythas, et al. 2015 PlosOne
Class Prediction Using Machine Learning��Allows you to classify an unknown sample given a profile of its features, in this case miRNA levels.�Requires a spreadsheet with miRNA levels for each sample.
Clustering of samples based on miRNAs can be
supervised or unsupervised. Shown is nonsupervised clustering for RNA from oral brush biopsy from tobacco smokers and never smokers.
Heat Map of miRNA expression �profiles OLP vs. OSCC
OLP OSCC
Supervised clustering uses more information and works well with groups with more subtle differences. Here is shown a heat map of the two groups OLP and OSCC.
Random Forest relies on at least two “innovations”. Bootstraps data –puts it into subgroups when testing so to avoid overfitting.
It tests out different random features in different orders at each step choosing a cutoff to maximize sorting the samples most homogeneously.
Support Vector Machines Plots or “clusters”, data in multidimensional space with each axis a feature – then determine the formula of a hyperplane to separate the groups.
Construction of a Class predictor for OSCC versus OLP
Class prediction using 30 sample training set of OSCC and OLP
Test of Random Forest Algorithm on classification of validation set.
Sources of Error
ROC curve shows 89% accuracy in prediction of OSCC using LOO-CV
Solution
More samples.
More calibration.
With enough samples can use deep learning methods with neural networks to classify samples into multiple groups.
People who helped with the work