1 of 16

Breast Cancer Analysis and Prediction

PRESENTED BY:

  • Blessing Orji
  • Chidinma Ukaegbu
  • Damilola Onanuga
  • Fatimah Salami
  • Genevieve Nwagwu
  • Kingsley Makinde
  • Motunrayo Koyejo
  • Taiwo Bello

2 of 16

TABLE OF CONTENTS

  • Introduction
  • Aims & Objectives
  • Methodology
  • Result
  • Feature Selection and its significance
  • Challenges
  • Conclusion
  • Recommendations

3 of 16

Introduction

Breast cancer is the top cancer in women worldwide and is increasing particularly in developing countries where majority of cases are diagnosed in late stages.

Early detection in order to improve breast cancer outcomes and survival remains the cornerstone of breast cancer control.

Tests like the FNA (Fine Needle Aspirate) Biopsy are carried out on a breast mass to diagnose malignant (cancerous cells) or benign (non-cancerous cells).

Source: John Hopkins Medicine

4 of 16

26,310+

Number of new cases

12,000+

Deaths

53,000+

Prevalent cases(5-years)

Total Population

195.8mn

Useful Statistics - Nigeria

Source: Globocan 2018

5 of 16

More Useful Statistics

Source: John Hopkins Medicine

6 of 16

Step One

Step Two

Step Three

Step Four

Step Five

Visualize the data

Pre-process and train the data

Test the model

Evaluate the model

01

02

03

04

05

Get and analyze the dataset

The steps used in this classification problem include:

Steps

7 of 16

Aims & Objectives

This analysis aims to reduce the length at which it takes to detect a malignant or benign lump by training a model using datasets containing the various features recorded in lump samples from tested patients.

8 of 16

Methodology

1

Explore dataset

Visualization (Swarm Plot, Correlation matrix, frequency distribution etc.

Label Encoder

Scale the Features

Classification Algorithm

Model Evaluation

2

3

4

5

6

7

Train_Test_Split

9 of 16

Methodology….

.

10 of 16

Result

From the various machine learning classification algorithm, the RFC was observed to perform best (with 98% accuracy) in comparison with other classifiers on the test-data

11 of 16

Feature Selection and its Significance

Correlation matrix of selected features and its accuracies

12 of 16

Challenges

  • Trying to understand which classifier to use or not use.
  • The use and understanding of evaluation metrics.
  • Cleaning the data.
  • Feature selection (which was best to use and/or to drop).
  • Translating the result.
  • Visualizations kept on breaking the notebook or keeping one stuck.

13 of 16

Conclusion

In this project, we successfully utilized most machine learning classification algorithm methods to understand how different classifiers give different accuracy results and observed that most feature selection techniques do not significantly change the algorithm accuracy and hence are only suitable for feature analysis/model simplification.

Conclusively, based on comparison of various machine learning classification algorithm (Logistic Regression, K Nearest Neighbors, Support Vector Classifier-Linear, Support Vector Classifier (Radial Basis Function), Gaussian Naive Bayes, Decision Tree Classifier, Random Forest Classifier(RFC)), the RFC was observed to perform best on the test data with 98% accuracy.

ML Classification Algorithm

Algo

Status

Comment

Outcome

Logistic Regression

 

95.80%

KNeighbours Classifier

 

95.10%

SVC_lin

 

97.20%

SVC_rbf

 

96.50%

GaussianNB

 

91.60%

Decision Tree Classifier

 

95.80%

Random Forest Classifier

 

98.60%

14 of 16

Recommendations

  • We recommend that the RFC be deployed for use in hospitals for the prediction of Malignant and Benign lumps in patients.
  • In addition, further research can be done in using an ensemble of the classifiers for research prediction of benign or malignant cancer.
  • Also, a model could be embedded in an app using computer vision to make sense of mammography scans to give benign or malignant diagnosis.

15 of 16

THANK YOU!

16 of 16