1 of 20

Real World Machine Learning

TJ Machine Learning Club

Slide 1

TJ Machine Learning Club

2 of 20

Introduction

  • So far in TJML:
    • Decision Tree
    • Random Forest
    • SVM
    • KNN
    • Naive Bayes
    • Neural Network
  • How do data scientists today code these?

Slide 2

TJ Machine Learning Club

3 of 20

Setup

  • Visualizing data:
    • Excel
    • Matplotlib
    • Seaborn
  • Helpful data analysis libraries:
    • Pandas
    • Numpy

Slide 3

TJ Machine Learning Club

4 of 20

Conda

  • Open-source Python package manager designed for data scientists.
  • Can be installed through Anaconda, which bundles the package manager with 150+ data science packages for convenience.
    • Tensorflow, Keras, sklearn
  • Another more efficient way to install Conda is with Miniconda, which will just install Conda and everything it needs to run, letting you decide what packages to install.
  • Data scientists generally choose Conda over pip because it groups package installations into separate environments, which prevents conflicts between packages intended for separate projects.

Slide 4

TJ Machine Learning Club

5 of 20

Slide 5

TJ Machine Learning Club

6 of 20

Jupyter Notebooks

  • An open-source web application that allows us to create and share codes and documents.
  • It provides an environment, where you can document your code, run it, look at the outcome, visualize data and see the results without leaving the environment.
  • Code can be written in individual cells that are run separately

Slide 6

TJ Machine Learning Club

7 of 20

Slide 7

TJ Machine Learning Club

8 of 20

Data Preprocessing

  • Handling null values
    • df.isnull()
    • Returns a boolean matrix, if the value is NaN then True otherwise False
    • df.dropna()
  • Imputation
    • Process of substituting the missing values of our dataset

Slide 8

TJ Machine Learning Club

9 of 20

Data Preprocessing

  • Standardization:
    • Transform values such that the mean of the values is 0 and the standard deviation is 1.

Slide 9

TJ Machine Learning Club

10 of 20

Scikit-Learn

  • Python library that provides many unsupervised and supervised learning algorithms
    • Regression, including Linear and Logistic Regression
    • Classification, including K-Nearest Neighbors
    • Clustering, including K-Means and K-Means++
    • Model selection
    • Preprocessing, including Min-Max Normalization

Slide 10

TJ Machine Learning Club

11 of 20

Slide 11

TJ Machine Learning Club

12 of 20

Keras

  • One of the most common deep learning frameworks
  • Uses Tensorflow backend
    • Used for Deep Learning
    • Neural Networks
    • CNNs

Slide 12

TJ Machine Learning Club

13 of 20

Neural Networks with Keras

Slide 13

TJ Machine Learning Club

14 of 20

Hyperparameter Tuning

  • Tuning hyperparameters can largely be guess-and-check
  • Sklearn has built-in ‘guess-and-check’ processes to help you find the best parameters
    • GridSearchCV()
    • RandomSearchCV()

Slide 14

TJ Machine Learning Club

15 of 20

Real-World ML

After splitting your data into training and testing data with sklearn, a confusion matrix is a simple way to gauge a model, once you know how to read one. A confusion matrix for a binary classification problem looks like this:

Slide 15

TJ Machine Learning Club

16 of 20

Real-World ML

  • True positives and true negatives are cases where our model is correct
  • A false positive (top right) is when the predicted value is yes but the actual value is no
  • A false negative (bottom-left) is the opposite

Slide 16

TJ Machine Learning Club

17 of 20

Precision and Recall

  • Precision: What proportion of positive identifications was actually correct?

  • Recall: What proportion of actual positives was identified correctly?

Slide 17

TJ Machine Learning Club

18 of 20

Confusion Matrix

Slide 18

TJ Machine Learning Club

19 of 20

Results

After using np.reshape() and np.concatenate() to merge our y_pred array with a corresponding id row, writing to a .csv file is as simple as creating an output DataFrame and saving it:

Slide 19

TJ Machine Learning Club

20 of 20

Credits

Portions of this lecture have been adapted from Kevin Fu’s October 2019 “Real World Machine Learning” lecture

Slide 20

TJ Machine Learning Club