1 of 96

Classification: Perceptron

2 of 96

Classification

  •  

2

3 of 96

Classification

  • We will learn
    • Perceptron
    • Support vector machine (SVM)
    • Logistic regression

  • To find

a classification boundary

3

4 of 96

Perceptron

  •  

4

5 of 96

Perceptron

  •  

5

6 of 96

Perceptron

  •  

6

7 of 96

Distance from a Line

7

8 of 96

 

  •  

8

9 of 96

 

  •  

9

10 of 96

 

  •  

10

11 of 96

Sign

  • Sign with respect to a line

11

12 of 96

 

  •  

12

13 of 96

Perceptron Algorithm

  • The perceptron implements

  • Given the training set

  1. pick a misclassified point

  • and update the weight vector

13

14 of 96

Perceptron Algorithm

  •  

14

15 of 96

Iterations of Perceptron

  •  

15

16 of 96

Diagram of Perceptron

16

17 of 96

Perceptron Loss Function

  •  

17

18 of 96

Perceptron Algorithm in Python

18

19 of 96

Perceptron Algorithm in Python

  •  

19

20 of 96

Perceptron Algorithm in Python

20

 

21 of 96

Perceptron Algorithm in Python

21

22 of 96

Perceptron Algorithm in Python

22

23 of 96

Perceptron Algorithm in Python

23

24 of 96

Scikit-Learn for Perceptron

24

25 of 96

The Best Hyperplane Separator?

  • Perceptron finds one of the many possible hyperplanes separating the data if one exists
  • Of the many possible choices, which one is the best?

  • Utilize distance information
  • Intuitively we want the hyperplane having the maximum margin
  • Large margin leads to good generalization on the test data
    • we will see this formally when we discuss Support Vector Machine (SVM)

  • Utilize distance information from all data samples
    • We will see this formally when we discuss the logistic regression

  • Perceptron will be shown to be a basic unit for neural networks and deep learning later

25

26 of 96

Support Vector Machine

27 of 96

Classification (Linear)

  • Autonomously figure out which category (or class) an unknown item should be categorized into

  • Number of categories / classes
    • Binary: 2 different classes
    • Multiclass: more than 2 classes

  • Feature
    • The measurable parts that make up the unknown item (or the information you have available to categorize)

27

28 of 96

Distance from a Line

28

29 of 96

 

  •  

29

30 of 96

 

  •  

30

31 of 96

 

  •  

31

32 of 96

 

32

33 of 96

 

  •  

33

34 of 96

Illustrative Example

  •  

34

35 of 96

Hyperplane

  •  

35

36 of 96

Decision Making

  •  

36

37 of 96

Decision Boundary or Band

  •  

37

38 of 96

Data Generation for Classification

38

39 of 96

Optimization Formulation 1

  •  

39

40 of 96

Optimization Formulation 1

40

41 of 96

CVXPY 1

41

42 of 96

CVXPY 1

42

43 of 96

Linear Classification: Outlier

  • Note that in the real world, you may have noise, errors, or outliers that do not accurately represent the actual phenomena

  • Linearly non-separable case

43

44 of 96

Outliers

  • No solutions (hyperplane) exist

  • We have to allow some training examples to be misclassified !
  • but we want their number to be minimized

44

45 of 96

Optimization Formulation 2

  •  

45

46 of 96

Optimization Formulation 2

  • The optimization problem for the non-separable case

46

47 of 96

Expressed in a Matrix Form

47

48 of 96

CVXPY 2

48

49 of 96

Further Improvement

  • Notice that hyperplane is not as accurately represent the division due to the outlier

  • Can we do better when there are noise data or outliers?
  • Yes, but we need to look beyond linear programming

  • Idea: large margin leads to good generalization on the test data���

49

50 of 96

Maximize Margin

  •  

50

51 of 96

Support Vector Machine

51

52 of 96

Support Vector Machine

  • In a more compact form

52

53 of 96

Scikit-learn

53

54 of 96

Classifying Non-linear Separable Data

  •  

54

55 of 96

Classifying Non-linear Separable Data

  •  

55

56 of 96

Classifying Non-linear Separable Data

  •  

56

57 of 96

Kernel

  • Often we want to capture nonlinear patterns in the data
    • nonlinear regression: input and output relationship may not be linear
    • nonlinear classification: classes may note be separable by a linear boundary

  • Linear models (e.g. linear regression, linear SVM) are not just rich enough
    • by mapping data to higher dimensions where it exhibits linear patterns
    • apply the linear model in the new input feature space
    • mapping = changing the feature representation

  • Kernels: make linear model work in nonlinear settings

57

58 of 96

Nonlinear Classification

58

59 of 96

Classifying Non-linear Separable Data

59

60 of 96

Classifying Non-linear Separable Data

60

61 of 96

Classifying Non-linear Separable Data

61

62 of 96

Logistic Regression

63 of 96

Linear Classification: Logistic Regression

  • Logistic regression is a classification algorithm
    • don't be confused

  • Perceptron: make use of sign of data

  • SVM: make use of margin (minimum distance)
    • Distance from two closest data points

  • We want to use distance information of all data points
    • logistic regression

63

64 of 96

Using Distances

64

65 of 96

Using Distances

65

66 of 96

Using Distances

66

67 of 96

Using Distances

67

68 of 96

Using all Distances

  •  

68

69 of 96

Using all Distances

  •  

69

70 of 96

Using all Distances with Outliers

  • SVM vs. Logistic Regression

70

SVM

Logistic Regression

71 of 96

Sigmoid Function

  •  

71

Step function

72 of 96

Sigmoid Function

  •  

72

73 of 96

Sigmoid Function

  •  

73

74 of 96

Goal: We Need to Fit 𝜔 to Data

  •  

74

75 of 96

Goal: We Need to Fit 𝜔 to Data

  • It would be easier to work on the log likelihood.

  • The logistic regression problem can be solved as a (convex) optimization problem:

  • Again, it is an optimization problem

75

76 of 96

Logistic Regression using GD

76

77 of 96

Gradient Descent

  • To use the gradient descent method, we need to find the derivative of it

  • We need to compute

77

78 of 96

Gradient Descent

  •  

78

79 of 96

Gradient Descent for Logistic Regression

  • Maximization problem
  • Be careful on matrix shape

79

80 of 96

Python Implementation

80

81 of 96

Python Implementation

81

82 of 96

Logistic Regression using CVXPY

82

83 of 96

Probabilistic Approach (or MLE)

  •  

83

84 of 96

Probabilistic Approach (or MLE)

  •  

84

85 of 96

CVXPY Implementation

85

86 of 96

In a More Compact Form

  •  

86

87 of 96

CVXPY Implementation

87

88 of 96

Logistic Regression using Scikit-Learn

88

89 of 96

Logistic Regression using Scikit-Learn

89

90 of 96

Non-linear Classification

90

91 of 96

Non-linear Classification

  • Same idea as non-linear regression: non-linear features
    • Explicit or implicit Kernel

91

92 of 96

Explicit Kernel

92

93 of 96

Non-linear Classification

93

94 of 96

Multiclass Classification

94

95 of 96

Multiclass Classification

  • Generalization to more than 2 classes is straightforward
    • one vs. all (one vs. rest)
    • one vs. one

95

96 of 96

Multiclass Classification

  •  

96