1 of 54

Final Review - Part 1

Manana Hakobyan and Stephanie Djajadi

2 of 54

Overview of Topics

  • Pandas & SQL
  • RegEx
  • EDA & Visualization (will not be covering today)
  • Dimensionality Reduction & PCA
  • Random Variables & Probability Distributions
  • Risk & Loss Functions
  • Logistic Regression
  • Classifier Evaluation
  • Decision Trees & Random Forests

3 of 54

Agenda

  • Pre-midterm review
    • Pandas, SQL, RegEx
    • PCA
    • Probability
    • Risk & Loss Functions
  • Logistic Regression
  • Classifier Evaluation
  • Decision Trees & Random Forests

4 of 54

5 of 54

PCA

[ True or False ] PCA can be used for projecting and visualizing data in lower dimensions.

A. TRUE

B. FALSE

6 of 54

PCA

[ True or False ] PCA can be used for projecting and visualizing data in lower dimensions.

A. TRUE

B. FALSE

7 of 54

PCA

The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA?

  1. PCA is an unsupervised method
  2. It searches for the directions that data have the largest variance
  3. Maximum number of principal components <= number of features
  4. All principal components are orthogonal to each other

8 of 54

PCA

The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA?

  • PCA is an unsupervised method
  • It searches for the directions that data have the largest variance
  • Maximum number of principal components <= number of features
  • All principal components are orthogonal to each other

9 of 54

PCA

What happens when you get features in lower dimensions using PCA?

  1. The features will still have interpretability
  2. The features will lose interpretability
  3. The features must carry all information present in data
  4. The features may not carry all information present in data

10 of 54

PCA

What happens when you get features in lower dimensions using PCA?

  • The features will still have interpretability
  • The features will lose interpretability
  • The features must carry all information present in data
  • The features may not carry all information present in data

11 of 54

PCA

Imagine, you are given the following scatterplot between height and weight.

Select the angle which will capture maximum variability along a single axis?

A. ~ 0 degree

B. ~ 45 degree

C. ~ 60 degree

D. ~ 90 degree

12 of 54

PCA

Imagine, you are given the following scatterplot between height and weight.

Select the angle which will capture maximum variability along a single axis?

A. ~ 0 degree

B. ~ 45 degree

C. ~ 60 degree

D. ~ 90 degree

13 of 54

PCA

Which of the following can be the first 2 principal components after applying PCA?

  1. (0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0)
  2. (0.5, 0.5, 0.5, 0.5) and (0, 0, -0.71, -0.71)
  3. (0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5)
  4. (0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)

14 of 54

PCA

Which of the following can be the first 2 principal components after applying PCA?

  • (0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0)
  • (0.5, 0.5, 0.5, 0.5) and (0, 0, -0.71, -0.71)
  • (0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5)
  • (0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)

15 of 54

PCA

Suppose X is a (100 x 5) matrix with rank 3.

What are dimensions of U, Ξ£, and V?

Remember X = UΞ£VT (or XV = UΞ£)

  • U: (100 x 3)
  • Ξ£: (3 x 3)
  • V: (5 x 3)
    • VT (3 x 5)

16 of 54

Probabilities, RVs (Fall 2017 Final)

17 of 54

Probabilities, RVs (Fall 2017 Final)

18 of 54

Probabilities, RVs (Fall 2017 Final)

19 of 54

Probabilities, RVs (Fall 2017 Final)

20 of 54

Loss Functions

21 of 54

Loss Functions

22 of 54

Loss Functions

Remember: To minimize the Loss Function you need to take the derivative (gradient) and set it equal to 0 !

23 of 54

24 of 54

25 of 54

Pandas & SQL (Spring 2019 Final Q7c)

26 of 54

Pandas & SQL (Spring 2019 Final Q7c) - Solution

27 of 54

Break!

Fill out Attendance:

http://bit.ly/at-d100

28 of 54

Logistic Regression

29 of 54

Regression vs Classification

  • Regression: the problem of creating a model that takes in a point and outputs a number. We’ve seen regression in the form of Least Squares, Ridge Regression, and LASSO Regression.
  • Classification: the problem of creating a model that takes in a point and outputs a discrete label.

Examples:

  • Regression: predict a student’s final exam grade, given their midterm and homework grades
  • Classification: predict whether a patient has a disease

30 of 54

Sigmoid Function

  • A function that maps ℝ β†’ [0,1]

  • Definition: Οƒ(x) = =
  • Symmetry: Οƒ(-x) = 1 - Οƒ(x)
  • Derivative: = Οƒ(x) * (1 - Οƒ(x))
  • If we are computing Οƒ(XΞΈ)
    • As |ΞΈ| increases, steepness of curve increases
    • Negative values reflect curve over the y-axis

1

1 + e-x

ex

1 + ex

dσ(x)

dx

31 of 54

Cross Entropy Loss

  • In order to find ΞΈ coefficients, we minimize cross entropy loss.
  • f(x): logistic model, predicted probabilities β†’ often this is Οƒ(XΞΈ)
  • No analytical solution for optimal thetas, need to use other methods (e.g. gradient descent)
  • How do you find risk?
    • Calculate the cross entropy loss for each point, then take the average of all loss to get the risk

32 of 54

Practice - Logistic Regression (T/F)

  • True or False: A training dataset consists of 98 dog pictures and 2 cat pictures. It should always be possible to train a classifier to achieve 98% training accuracy on this dataset.
  • True or False: A classifier that always predicts 0 has a test accuracy of 50% on all binary prediction tasks.
  • True or False: While training a logistic regression model with gradient descent, we may converge to a local minimum and fail to find the global minimum of our loss function.

33 of 54

Practice - Spring 2019 Midterm 2 Q2a

Calculate Empirical Risk and estimate 𝝰^.

34 of 54

Classifier Evaluation

35 of 54

Classifiers

  • Function that outputs a prediction for y (for example 0 or 1)
  • Decision rules are used to determine the class cut-offs.
  • Simplest decision rule:

36 of 54

Evaluation

  • Accuracy: (TP + TN) / n
    • Most general metric, measures proportion of correct predictions made
  • Error rate: (FP + FN) / n
  • Precision: TP / (TP + FP)
    • percentage of selected results that are relevant
  • Recall: TP / (TP + FN)
    • percentage of relevant results that are selected
  • Note: TPR = Recall

1

0

1

True positive (TP)

False positive (FP)

0

False negative (FN)

True negative (TN)

Prediction

Truth

37 of 54

ROC Curves

  • Recall:
    • FPR = FP / (FP + TN)
    • TPR = TP / (TP + FN)
  • Goal: maximize TPR, minimize FPR by changing decision cutoff
    • Tradeoff shown by ROC curve

True positive rate

False positive rate

Always predicting 0

Always predicting 1

38 of 54

Practice - Classification

You have 2 classifiers A and B, and you are trying to pick one to use for filtering spam. You train both on a dataset of 100 spam and 100 ham emails.

Classifier A has 0% accuracy on the dataset, and Classifier B has 50% accuracy. Which classifier would you rather use?

A) Classifier A

B) Classifier B

39 of 54

Practice - Precision & Recall

A classifier has a high number of false negatives, and we want to reduce this number. Which metric should we study to address this?

A) Accuracy

B) Precision

C) Recall

40 of 54

Practice - Precision & Recall

Suppose you create a classifier to predict whether an image contains a picture of a goat. You test it on 23 images.

  • There were 12 true images of goats. Your classifier predicted 9 of them to be goats, and 3 not to be goats.
  • There were 11 images that did not contain goats. Your classifier predicted 3 of them to be goats, and 8 not to be goats.

Determine the precision and recall of your goat classifier.

41 of 54

Solutions - Precision & Recall

Suppose you create a classifier to predict whether an image contains a picture of a goat. You test it on 23 images.

  • There were 12 true images of goats. Your classifier predicted 9 of them to be goats, and 2 not to be goats.
  • There were 11 images that did not contain goats. Your classifier predicted 3 of them to be goats, and 8 not to be goats.

Precision: TP/(TP+FP) = 9/(9+3) = 3/4

Recall:TP/(TP+FN) = 9/(9+2) = 9/11

42 of 54

43 of 54

Decision Trees

  • Nonlinear algorithm
  • Used for classification AND regression
  • Works with numeric and categorical data (no extra work needed)
  • Split training data into nodes
  • Node is pure if all the points in the node belong to the same label
  • Entropy = β€œa measure of disorder - messiness in the node”
  • Good nodes = lower entropy

44 of 54

Entropy; Loss of a Split; Information Gain

From the lecture we calculate the entropy of the node:

We calculate the loss of a split (split entropy):

Information Gain: S(Node) - entropy of split

45 of 54

Practice with Entropy

First node:

Second node:

Loss:

Information Gain:

40 D, 60 B

20 D, 10 B

20 D, 50 B

46 of 54

Practice with Entropy

First node: S(N1)= -(β…”)log(β…”) - (β…“)log(β…“) = 0.918

Second node: S(N2) = -(2/7)log(2/7) - (5/7)log(5/7)=0.863

Loss: (30 * 0.918 + 70 * 0.863)/100 = 0.8795

Information Gain: S(Node) - Loss Split = 0.97 - 0.8795 = 0.0905

40 D, 60 B

20 D, 10 B

20 D, 50 B

S(Node) = -(2/5)log(2/5) - (β…—)log(β…—)=0.97

47 of 54

Problems with Decision Trees

  • 100% TRAINING accuracy
  • Overfitting

  • Fix it by setting max tree depth, prune the tree or bagging

48 of 54

Random Forests

  • Bagging - Bootstrap Aggregating the training set to T sets
  • Use different subset of features to train T trees
  • Individual trees overfit in different ways, so the overall variance is lower
  • To predict, ask the T decision trees for their predictions and take the majority vote (ensemble methods)

49 of 54

50 of 54

Random Forests

  • Both regression and classification
  • No extra work with feature selection
  • Nonlinear boundaries without feature engineering
  • Does better job than decision trees in terms of reducing overfitting

51 of 54

Decision Trees and Random Forest

In Random forest you can generate hundreds of trees (say T1, T2 …..Tn) and then aggregate the results of these tree. Which of the following is true about individual(Tk) tree in Random Forest?

  1. Individual tree is built on a subset of the features
  2. Individual tree is built on all the features
  3. Individual tree is built on a subset of observations
  4. Individual tree is built on full set of observations

52 of 54

Decision Trees and Random Forest

In Random forest you can generate hundreds of trees (say T1, T2 …..Tn) and then aggregate the results of these tree. Which of the following is true about individual(Tk) tree in Random Forest?

  • Individual tree is built on a subset of the features
  • Individual tree is built on all the features
  • Individual tree is built on a subset of observations
  • Individual tree is built on full set of observations

53 of 54

Decision Trees and Random Forest

How to select best hyperparameters in tree based models?

A) Measure performance over training data

B) Measure performance over validation data

C) Both of these

D) None of these

54 of 54

Decision Trees and Random Forest

How to select best hyperparameters in tree based models?

A) Measure performance over training data

B) Measure performance over validation data

C) Both of these

D) None of these