1 of 12

Machine Learning Model for Predicting Drug Sensitivity in Hepatocellular Carcinoma Cell Lines Using Gene Expression Data

Ethan Cody, Andrew Lukashchuk, Wuyan Li

2 of 12

Early Stage

Late Stage

3 of 12

Huang, Ao, et al. "Targeted therapy for hepatocellular carcinoma." Signal transduction and targeted therapy 5.1 (2020): 146.

4 of 12

5 of 12

Precision Medicine

Without Precision Medicine

With Precision Medicine

Tailored therapy

Response

No-response

Adverse

Same Therapy

DNA

6 of 12

Can we create a prediction model using genetic information to identify the best treatment response in HCC patients?

7 of 12

Key Data Sources

Cancer Cell Line Encyclopedia (CCLE)

  • Provides comprehensive gene expression data
  • Multiple cell lines access

Genomics of Drug Sensitivity in Cancer (GDSC)

  • Source of drug response data measured as IC50 values
  • Key for understanding how HCC cell lines respond to various drugs

Simplified Molecular Input Line Entry System (SMILES)

  • Standard notation to describe the structure of chemical compounds
  • Enables the conversion of drug molecules into descriptors used for machine learning

8 of 12

Workflow

Predictive Model

External Dataset

+

Gene Expression with Drug resistance

Drug molecular information

Train Dataset

Validation

External Testing Set

Qiu, Zhixin, et al. "A pharmacogenomic landscape in human liver cancers." Cancer Cell 36.2 (2019): 179-193.

9 of 12

Modeling

Linear Regression

  • Initial model to see if linear patterns exist in data
  • R2 score of just 0.15 between test and real values�during validation

XGBoost

  • Model sees vast improvement over linear regression
  • R2 score of 0.79 between test and real values during�validation

Random Forest Classifier

  • Slightly worse performance on test data, but slight�improvement on validation data over XGBoost
  • R2 score of 0.78 between test and real values �during validation

10 of 12

External Dataset

Following the initial results of the various models on the validation set, we chose the best models to use on the external test dataset. The results were worse than on the validation set, but still effective:

XGBoost

  • MSE: 0.66
  • R2: 0.44

�Random Forest Classifier

  • MSE: 0.64
  • R2: 0.46

XGBoost

Random Forest

11 of 12

Summary

Data Organization

  • Normalized data from different sources so that it could be comparable and used in training models
  • Combined data so model could consider not just gene expression data, but also drug molecular information
  • Performed using the R language to join the various data entries

Modeling

  • Ran various models on the dataset while evaluating performance on the validation data which was made through an 80-20 split.
  • Most effective models were XGBoost and Random Forest Classifier
  • Models were then evaluated on performance on previously unseen test data, which justified their effectiveness

12 of 12

Future Direction

  • Apply the model on the new in house HCC established cell lines to predict the candidate drugs.
  • Validate the candidate drugs by performing in vitro drug sensitivity assay.