1 of 16

Intro to ML with

Yash Potdar & Ishaan Gupta

2 of 16

Before we begin...

  • Please download the files from the following Github repository:
    • Intro to ML.ipynb
    • Pokedex.csv

https://github.com/YashPotdar/DS3-Intro-ML

3 of 16

Agenda

  • What is Machine Learning?
    • Supervised and Unsupervised
    • Clustering, Classification, and Regression�
  • Common Models with DEMO on Pokemon Dataset
    • Classification (Supervised)
      • Naive Bayes
      • K-Nearest Neighbors
      • Logistic Regression
    • Regression (Supervised)
      • Linear Regression
    • Clustering (Unsupervised)
      • K-Means Clustering

4 of 16

What is Machine Learning?

  • Purpose: to teach computers how to learn and act without being explicitly programmed

  • Essence: to capture a mathematical function that best fits that data and predict on unseen data
    • Creating dynamic models that can learn, predict and improve

  • Types of data: numbers, text, images, time series, etc.

5 of 16

Supervised Learning

  • Using a training set with labeled outputs to determine a predictive model

  • Task Categories:
    • Classification: categorizing data into binary or multiple classes
    • Regression: finding a relationship between independent and dependent variables

6 of 16

Unsupervised Learning

  • Finding patterns in an unlabeled dataset

  • Task Categories:
    • Clustering: dynamically discovering groups within data
    • Anomaly detection: singling out unusual or extreme data points
    • Dimensionality Reduction: reducing the data to include only meaningful features

7 of 16

Task 1: Classification

  • Categorizing data into binary or multiple classes
    • Binary: identifying spam vs not spam emails
    • Multi-class: recognizing handwritten characters

  • Utilizes supervised ML techniques

  • Common algorithms:
    • Naive Bayes, k-Nearest Neighbors, Logistic Regression, Decision Trees

8 of 16

Task 2: Regression

  • Finding a relationship between independent and dependent variables using least-squares
    • Simple regression: prediction using 1 explanatory variable
    • Multiple regression: prediction using multiple explanatory variables

  • Linear and nonlinear models within simple and multiple regression�
  • Utilizes supervised ML techniques�
  • Common algorithms:
    • Simple/multiple (non)linear regression, lasso regression, support vector machines

rent = f(income)

rent = f(income, # bedrooms, humidity)

9 of 16

Task 3: Clustering

  • Dynamically discovering groups within data
    • Centroid-based: organize data non-hierarchically
      • Most common and simple type, but sensitive to outliers
    • Density-based: connects dense areas and ignores outliers
    • Distribution-based: clusters data into various Gaussian probability distributions
    • Hierarchical: creates trees of clusters

  • Utilizes unsupervised ML techniques

  • Common algorithms:
    • K-Means, Mean-Shift, DBSCAN, Gaussian Mixture Models, Agglomerative Hierarchical Clustering

10 of 16

DEMO

11 of 16

Naive Bayes

  • Commonly used for both binary and multi-class classification

  • Naive assumption of conditional independence between the predictor variables
    • Highly unlikely that the predictors are unrelated

  • Selects the class whose probability is highest given those specific features

  • Demo: Naive Bayes to classify Pokémon types
    • Assumes continuous data are distributed normally

12 of 16

k-Nearest Neighbors (KNN)

  • Used for both binary and multi-class classification
    • Side note: also used for regression!
      • Value is the average of the k nearest neighbors

  • Take the majority class of the k nearest neighbors

  • Often choose k by trial-and-error
    • Low values of k subject to outliers
    • High values of k may lead you to miss smaller clusters

  • Less accurate with more features: overfitting�
  • Demo: KNN to predict legendary Pokémon (binary) and Pokémon types (multi-class)

13 of 16

Linear Regression

  • Used for regression tasks

  • Output is continuous and slope is constant
    • Multiple regression uses weights for attributes

  • Uses ordinary least-squares regression

  • Demo: Multiple Linear Regression to predict Hit Points (HP)

14 of 16

Logistic Regression

  • Used for both binary classification
    • Predicts a discrete variable (True/False)

  • Regression technique used for classification

  • Fits an S-shaped curve (sigmoid function)
    • Curve tells the probability of the discrete variable taking that True/False value
    • If probability of A is over 50%, we classify the data point as A

  • Uses maximum likelihood estimation

  • Demo: Multiple Logistic Regression to predict legendary Pokémon

15 of 16

k-Means Clustering

  • Used for clustering tasks (unsupervised!)
    • Not to be confused with k-Nearest Neighbors!

  • Iterative process until convergence:
    • Randomly initialize k centroids
    • Fix centroids while assigning points to the closest centroid
    • Fix groups while assigning new centroids to be the mean of the groups

  • Uses only numerical inputs because calculating Euclidean Distance requires numbers

  • Demo: Clustering Attack and Defense points

16 of 16

THANKS!

Do you have any questions?

This presentation would not be possible without Avinash Navlani’s awesome tutorials (DataCamp), Josh Starmer’s videos (StatQuest), Scott Robinson’s tutorials (Stack Abuse), and Lorraine Li’s tutorial on k-means clustering.

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik