1 of 16

Intro to ML with

Yash Potdar & Ishaan Gupta

2 of 16

Before we begin...

	Please download the files from the following Github repository: Intro to ML.ipynb Pokedex.csv https://github.com/YashPotdar/DS3-Intro-ML

3 of 16

Agenda

	What is Machine Learning? Supervised and Unsupervised Clustering, Classification, and Regression� Common Models with DEMO on Pokemon Dataset Classification (Supervised) Naive Bayes K-Nearest Neighbors Logistic Regression Regression (Supervised) Linear Regression Clustering (Unsupervised) K-Means Clustering

4 of 16

What is Machine Learning?

	Purpose: to teach computers how to learn and act without being explicitly programmed Essence: to capture a mathematical function that best fits that data and predict on unseen data Creating dynamic models that can learn, predict and improve Types of data: numbers, text, images, time series, etc.

5 of 16

Supervised Learning

	Using a training set with labeled outputs to determine a predictive model Task Categories: Classification: categorizing data into binary or multiple classes Regression: finding a relationship between independent and dependent variables

6 of 16

Unsupervised Learning

	Finding patterns in an unlabeled dataset Task Categories: Clustering: dynamically discovering groups within data Anomaly detection: singling out unusual or extreme data points Dimensionality Reduction: reducing the data to include only meaningful features

7 of 16

Task 1: Classification

	Categorizing data into binary or multiple classes Binary: identifying spam vs not spam emails Multi-class: recognizing handwritten characters Utilizes supervised ML techniques Common algorithms: Naive Bayes, k-Nearest Neighbors, Logistic Regression, Decision Trees

8 of 16

Task 2: Regression

	Finding a relationship between independent and dependent variables using least-squares Simple regression: prediction using 1 explanatory variable Multiple regression: prediction using multiple explanatory variables Linear and nonlinear models within simple and multiple regression� Utilizes supervised ML techniques� Common algorithms: Simple/multiple (non)linear regression, lasso regression, support vector machines

rent = f(income)

rent = f(income, # bedrooms, humidity)

9 of 16

Task 3: Clustering

	Dynamically discovering groups within data Centroid-based: organize data non-hierarchically Most common and simple type, but sensitive to outliers Density-based: connects dense areas and ignores outliers Distribution-based: clusters data into various Gaussian probability distributions Hierarchical: creates trees of clusters Utilizes unsupervised ML techniques Common algorithms: K-Means, Mean-Shift, DBSCAN, Gaussian Mixture Models, Agglomerative Hierarchical Clustering

10 of 16

DEMO

11 of 16

Naive Bayes

	Commonly used for both binary and multi-class classification Naive assumption of conditional independence between the predictor variables Highly unlikely that the predictors are unrelated Selects the class whose probability is highest given those specific features Demo: Naive Bayes to classify Pokémon types Assumes continuous data are distributed normally

12 of 16

k-Nearest Neighbors (KNN)

	Used for both binary and multi-class classification Side note: also used for regression! Value is the average of the k nearest neighbors Take the majority class of the k nearest neighbors Often choose k by trial-and-error Low values of k subject to outliers High values of k may lead you to miss smaller clusters Less accurate with more features: overfitting� Demo: KNN to predict legendary Pokémon (binary) and Pokémon types (multi-class)

13 of 16

Linear Regression

	Used for regression tasks Output is continuous and slope is constant Multiple regression uses weights for attributes Uses ordinary least-squares regression Demo: Multiple Linear Regression to predict Hit Points (HP)

14 of 16

Logistic Regression

	Used for both binary classification Predicts a discrete variable (True/False) Regression technique used for classification Fits an S-shaped curve (sigmoid function) Curve tells the probability of the discrete variable taking that True/False value If probability of A is over 50%, we classify the data point as A Uses maximum likelihood estimation Demo: Multiple Logistic Regression to predict legendary Pokémon

15 of 16

k-Means Clustering

	Used for clustering tasks (unsupervised!) Not to be confused with k-Nearest Neighbors! Iterative process until convergence: Randomly initialize k centroids Fix centroids while assigning points to the closest centroid Fix groups while assigning new centroids to be the mean of the groups Uses only numerical inputs because calculating Euclidean Distance requires numbers Demo: Clustering Attack and Defense points

16 of 16

THANKS!

Do you have any questions?

This presentation would not be possible without Avinash Navlani’s awesome tutorials (DataCamp), Josh Starmer’s videos (StatQuest), Scott Robinson’s tutorials (Stack Abuse), and Lorraine Li’s tutorial on k-means clustering.

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik