PREDICTING THE WINNER OF
CHRIS DONNAY & EMILY GRIFFITH
The Great British Baking Show
Our Data
Our Parameters
Regression or Classification?
Regression
Classification
What happened with regression?
Explained variance ~ .73
Probably not enough data for regression to be effective, and not a lot of clear linear relationships between features and the rank.
Not able to rank as a unique integer value--no one ever ranks 1
Regression or Classification?
Regression
Classification
KNN
Ran a 5-fold Cross Validation, 3 neighbors got the highest average accuracy
Random Forest
Ran a 5-fold Cross Validation, max depth of 3 got the highest accuracy, recall, and precision.
Tells us the most important features are the technical challenges and star baker.
What features are important?
Naive Bayse
Had significantly lower accuracy than either KNN or RF so we eventually dropped it from the model.
Voting
Using KNN, RF, and NB to vote, we find accuracy scores
Random Forest: 0.96
Naive Bayse: 0.75
KNN: 0.83
Voting: 0.91
Nothing out-performed the RF!
Let’s test it on the newest season! No spoilers :)
Random Forest Predicted Top 3 by Episode
Finds 1/3
Random Forest Predicted Top 3
Probability by Episode