1 of 54

Final Review - Part 1

Manana Hakobyan and Stephanie Djajadi

2 of 54

Overview of Topics

Pandas & SQL
RegEx
EDA & Visualization (will not be covering today)
Dimensionality Reduction & PCA
Random Variables & Probability Distributions
Risk & Loss Functions
Logistic Regression
Classifier Evaluation
Decision Trees & Random Forests

3 of 54

Agenda

Pre-midterm review

Pandas, SQL, RegEx
PCA
Probability
Risk & Loss Functions

Logistic Regression
Classifier Evaluation
Decision Trees & Random Forests

4 of 54

5 of 54

PCA

[ True or False ] PCA can be used for projecting and visualizing data in lower dimensions.

A. TRUE

B. FALSE

6 of 54

PCA

[ True or False ] PCA can be used for projecting and visualizing data in lower dimensions.

A. TRUE

B. FALSE

7 of 54

PCA

The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA?

PCA is an unsupervised method
It searches for the directions that data have the largest variance
Maximum number of principal components <= number of features
All principal components are orthogonal to each other

8 of 54

PCA

The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA?

PCA is an unsupervised method
It searches for the directions that data have the largest variance
Maximum number of principal components <= number of features
All principal components are orthogonal to each other

9 of 54

PCA

What happens when you get features in lower dimensions using PCA?

The features will still have interpretability
The features will lose interpretability
The features must carry all information present in data
The features may not carry all information present in data

10 of 54

PCA

What happens when you get features in lower dimensions using PCA?

The features will still have interpretability
The features will lose interpretability
The features must carry all information present in data
The features may not carry all information present in data

11 of 54

PCA

Imagine, you are given the following scatterplot between height and weight.

Select the angle which will capture maximum variability along a single axis?

A. ~ 0 degree

B. ~ 45 degree

C. ~ 60 degree

D. ~ 90 degree

12 of 54

PCA

Imagine, you are given the following scatterplot between height and weight.

Select the angle which will capture maximum variability along a single axis?

A. ~ 0 degree

B. ~ 45 degree

C. ~ 60 degree

D. ~ 90 degree

13 of 54

PCA

Which of the following can be the first 2 principal components after applying PCA?

(0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0)
(0.5, 0.5, 0.5, 0.5) and (0, 0, -0.71, -0.71)
(0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5)
(0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)

14 of 54

PCA

Which of the following can be the first 2 principal components after applying PCA?

(0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0)
(0.5, 0.5, 0.5, 0.5) and (0, 0, -0.71, -0.71)
(0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5)
(0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)

15 of 54

PCA

Suppose X is a (100 x 5) matrix with rank 3.

What are dimensions of U, Σ, and V?

Remember X = UΣV^T (or XV = UΣ)

U: (100 x 3)
Σ: (3 x 3)
V: (5 x 3)

V^T (3 x 5)

16 of 54

Probabilities, RVs (Fall 2017 Final)

17 of 54

Probabilities, RVs (Fall 2017 Final)

18 of 54

Probabilities, RVs (Fall 2017 Final)

19 of 54

Probabilities, RVs (Fall 2017 Final)

20 of 54

Loss Functions

21 of 54

Loss Functions

22 of 54

Loss Functions

Remember: To minimize the Loss Function you need to take the derivative (gradient) and set it equal to 0 !

23 of 54

24 of 54

25 of 54

Pandas & SQL (Spring 2019 Final Q7c)

26 of 54

Pandas & SQL (Spring 2019 Final Q7c) - Solution

27 of 54

Break!

Fill out Attendance:

http://bit.ly/at-d100

28 of 54

Logistic Regression

29 of 54

Regression vs Classification

Regression: the problem of creating a model that takes in a point and outputs a number. We’ve seen regression in the form of Least Squares, Ridge Regression, and LASSO Regression.
Classification: the problem of creating a model that takes in a point and outputs a discrete label.

Examples:

Regression: predict a student’s final exam grade, given their midterm and homework grades
Classification: predict whether a patient has a disease

30 of 54

Sigmoid Function

A function that maps ℝ → [0,1]

Definition: σ(x) = =
Symmetry: σ(-x) = 1 - σ(x)
Derivative: = σ(x) * (1 - σ(x))
If we are computing σ(Xθ)

As |θ| increases, steepness of curve increases
Negative values reflect curve over the y-axis

1
1 + e^-x

e^x
1 + e^x

dσ(x)
dx

31 of 54

Cross Entropy Loss

In order to find θ coefficients, we minimize cross entropy loss.
f(x): logistic model, predicted probabilities → often this is σ(Xθ)
No analytical solution for optimal thetas, need to use other methods (e.g. gradient descent)
How do you find risk?

Calculate the cross entropy loss for each point, then take the average of all loss to get the risk

32 of 54

Practice - Logistic Regression (T/F)

True or False: A training dataset consists of 98 dog pictures and 2 cat pictures. It should always be possible to train a classifier to achieve 98% training accuracy on this dataset.
True or False: A classifier that always predicts 0 has a test accuracy of 50% on all binary prediction tasks.
True or False: While training a logistic regression model with gradient descent, we may converge to a local minimum and fail to find the global minimum of our loss function.

33 of 54

Practice - Spring 2019 Midterm 2 Q2a

Calculate Empirical Risk and estimate 𝝰^.

34 of 54

Classifier Evaluation

35 of 54

Classifiers

Function that outputs a prediction for y (for example 0 or 1)
Decision rules are used to determine the class cut-offs.
Simplest decision rule:

36 of 54

Evaluation

Accuracy: (TP + TN) / n

Most general metric, measures proportion of correct predictions made

Error rate: (FP + FN) / n
Precision: TP / (TP + FP)

percentage of selected results that are relevant

Recall: TP / (TP + FN)

percentage of relevant results that are selected

Note: TPR = Recall

	1	0
1	True positive (TP)	False positive (FP)
0	False negative (FN)	True negative (TN)

Prediction

Truth

37 of 54

ROC Curves

Recall:

FPR = FP / (FP + TN)
TPR = TP / (TP + FN)

Goal: maximize TPR, minimize FPR by changing decision cutoff

Tradeoff shown by ROC curve

True positive rate

False positive rate

Always predicting 0

Always predicting 1

38 of 54

Practice - Classification

You have 2 classifiers A and B, and you are trying to pick one to use for filtering spam. You train both on a dataset of 100 spam and 100 ham emails.

Classifier A has 0% accuracy on the dataset, and Classifier B has 50% accuracy. Which classifier would you rather use?

A) Classifier A

B) Classifier B

39 of 54

Practice - Precision & Recall

A classifier has a high number of false negatives, and we want to reduce this number. Which metric should we study to address this?

A) Accuracy

B) Precision

C) Recall

40 of 54

Practice - Precision & Recall

Suppose you create a classifier to predict whether an image contains a picture of a goat. You test it on 23 images.

There were 12 true images of goats. Your classifier predicted 9 of them to be goats, and 3 not to be goats.
There were 11 images that did not contain goats. Your classifier predicted 3 of them to be goats, and 8 not to be goats.

Determine the precision and recall of your goat classifier.

41 of 54

Solutions - Precision & Recall

Suppose you create a classifier to predict whether an image contains a picture of a goat. You test it on 23 images.

There were 12 true images of goats. Your classifier predicted 9 of them to be goats, and 2 not to be goats.
There were 11 images that did not contain goats. Your classifier predicted 3 of them to be goats, and 8 not to be goats.

Precision: TP/(TP+FP) = 9/(9+3) = 3/4

Recall:TP/(TP+FN) = 9/(9+2) = 9/11

42 of 54

43 of 54

Decision Trees

Nonlinear algorithm
Used for classification AND regression
Works with numeric and categorical data (no extra work needed)
Split training data into nodes
Node is pure if all the points in the node belong to the same label
Entropy = “a measure of disorder - messiness in the node”
Good nodes = lower entropy

44 of 54

Entropy; Loss of a Split; Information Gain

From the lecture we calculate the entropy of the node:

We calculate the loss of a split (split entropy):

Information Gain: S(Node) - entropy of split

45 of 54

Practice with Entropy

First node:

Second node:

Loss:

Information Gain:

40 D, 60 B

20 D, 10 B

20 D, 50 B

46 of 54

Practice with Entropy

First node: S(N₁)= -(⅔)log(⅔) - (⅓)log(⅓) = 0.918

Second node: S(N₂) = -(2/7)log(2/7) - (5/7)log(5/7)=0.863

Loss: (30 * 0.918 + 70 * 0.863)/100 = 0.8795

Information Gain: S(Node) - Loss Split = 0.97 - 0.8795 = 0.0905

40 D, 60 B

20 D, 10 B

20 D, 50 B

S(Node) = -(2/5)log(2/5) - (⅗)log(⅗)=0.97

47 of 54

Problems with Decision Trees

100% TRAINING accuracy
Overfitting

Fix it by setting max tree depth, prune the tree or bagging

48 of 54

Random Forests

Bagging - Bootstrap Aggregating the training set to T sets
Use different subset of features to train T trees
Individual trees overfit in different ways, so the overall variance is lower
To predict, ask the T decision trees for their predictions and take the majority vote (ensemble methods)

49 of 54

50 of 54

Random Forests

Both regression and classification
No extra work with feature selection
Nonlinear boundaries without feature engineering
Does better job than decision trees in terms of reducing overfitting

51 of 54

Decision Trees and Random Forest

In Random forest you can generate hundreds of trees (say T1, T2 …..Tn) and then aggregate the results of these tree. Which of the following is true about individual(Tk) tree in Random Forest?

Individual tree is built on a subset of the features
Individual tree is built on all the features
Individual tree is built on a subset of observations
Individual tree is built on full set of observations

52 of 54

Decision Trees and Random Forest

In Random forest you can generate hundreds of trees (say T1, T2 …..Tn) and then aggregate the results of these tree. Which of the following is true about individual(Tk) tree in Random Forest?

Individual tree is built on a subset of the features
Individual tree is built on all the features
Individual tree is built on a subset of observations
Individual tree is built on full set of observations

53 of 54

Decision Trees and Random Forest

How to select best hyperparameters in tree based models?

A) Measure performance over training data

B) Measure performance over validation data

C) Both of these

D) None of these

54 of 54

Decision Trees and Random Forest

How to select best hyperparameters in tree based models?

A) Measure performance over training data

B) Measure performance over validation data

C) Both of these

D) None of these