1 of 110

Classification: Perceptron

2 of 110

Classification

  • Categorizing data into predefined classes or categories based on input features

  • Examples
    • Spam detection: Classifying emails as either spam or not spam
    • Image recognition: Identifying objects, animals, or faces in images
    • Medical diagnosis: Predicting whether a patient has a particular condition based on diagnostic data

2

3 of 110

Classification

  •  

3

4 of 110

Classification

  • We will learn
    • Perceptron
    • Support vector machine (SVM)
    • Logistic regression

  • To find a classification boundary

4

5 of 110

Classification

  • We will learn
    • Perceptron
    • Support vector machine (SVM)
    • Logistic regression

  • To find a classification boundary

5

6 of 110

Classification

  • We will learn
    • Perceptron
    • Support vector machine (SVM)
    • Logistic regression

  • To find a classification boundary

6

7 of 110

Perceptron

  •  

7

8 of 110

Perceptron

  •  

8

9 of 110

Classification Boundary

  •  

9

10 of 110

Learning a Hyperplane for Classification

  •  

10

11 of 110

Learning a Hyperplane for Classification

  •  

11

12 of 110

Learning a Hyperplane for Classification

  •  

12

13 of 110

Perceptron Algorithm

  • The perceptron implements

  • Given the training set

  1. pick a misclassified point

  • and update the weight vector

13

14 of 110

Perceptron Algorithm: Illustration

14

15 of 110

Perceptron Algorithm: Illustration

15

16 of 110

Perceptron Algorithm: Illustration

16

17 of 110

Perceptron Algorithm: Illustration

17

18 of 110

Perceptron Algorithm: Illustration

18

19 of 110

Perceptron Algorithm: Illustration

19

20 of 110

Why Perceptron Updates Work ?

  •  

20

21 of 110

Diagram of Perceptron

  • Perceptron can be viewed as a simple neuron in an Artificial Neural Network (ANN)

  • Perceptron Update Rule (discrete version):

  • Gradient Descent Update Rule (continuous version):

21

22 of 110

Loss Function for Perceptron

  •  

22

23 of 110

Perceptron in Python

23

24 of 110

Perceptron in Python

  •  

24

25 of 110

Perceptron in Python

25

 

26 of 110

Perceptron in Python

26

27 of 110

Scikit-Learn for Perceptron

27

28 of 110

The Best Hyperplane Separator?

  • Limitations
    • The Perceptron identifies a separating hyperplane if the data is linearly separable.
    • However, it merely finds one of the possible solutions.

28

29 of 110

The Best Hyperplane Separator?

  • Improvements
    • Identifying the best hyperplane requires an optimization framework.

  • Utilize distance information
    • Support Vector Machines (SVM) and Logistic Regression

29

30 of 110

Support Vector Machine

31 of 110

Distance from a Line

  • To build the foundation for understanding SVMs, we will first examine how to compute the distance from a point to a line.

31

32 of 110

 

  •  

32

33 of 110

 

  •  

33

34 of 110

 

  •  

34

35 of 110

 

  •  

35

36 of 110

Distance between Two Parallel Lines (1/2)

  •  

36

37 of 110

Distance between Two Parallel Lines (2/2)

  •  

37

38 of 110

Illustrative Example

  •  

38

39 of 110

Decision Making

  •  

39

40 of 110

The Key Insight in SVM: Introducing a Margin

  •  

40

41 of 110

Scaling to Simplify the Problem

  •  

41

42 of 110

Data Generation for Classification

42

Buffer zone

43 of 110

Optimization Formulation 1

  •  

43

44 of 110

The First Attempt

  • These constraints ensure that all correctly classified points lie outside the margin boundaries, establishing a buffer zone that enhances the classifier's robustness.
  • However, the appropriate objective function to be minimized has not yet been determined.

44

45 of 110

CVXPY 1

45

46 of 110

CVXPY 1

46

47 of 110

Outliers

  • May fail to produce a valid boundary when the data is not linearly separable

  • In real-world scenarios, datasets often contain noise, errors, or outliers that deviate from the true underlying patterns
    • Linearly non-separable case
    • No feasible solution

47

48 of 110

Outliers

  • No solutions (hyperplane) exist

  • Allowing Misclassifications
    • Some training examples are allowed to be misclassified.
    • However, the goal remains to minimize the number of misclassified points or, more formally, to minimize the total deviation from the margin constraints

48

49 of 110

The Second Attempt

  •  

49

50 of 110

The Second Attempt

  • The optimization problem for the non-separable case

50

51 of 110

The Second Attempt

  •  

51

52 of 110

Expressed in a Matrix Form

52

53 of 110

CVXPY 2

53

54 of 110

Further Improvement

  • Can we do better when there are noise data or outliers?
  • Yes, but this makes a buffer zone highly sensitive to noise or extreme points.
    • Notice that hyperplane is not as accurately represent the division due to the outlier

  • Idea: large margin not only separates the data effectively but also improves the model's generalization on unseen data.

54

55 of 110

Maximize Margin

  •  

55

56 of 110

Support Vector Machine

56

57 of 110

In a More Compact Form

57

58 of 110

Scikit-learn

58

59 of 110

Lesson from SVM

  • Throughout the development of SVM, we encourage you to recognize the continuous improvement process
    • We began with a straightforward linear classifier designed to separate clean, linearly separable data.
    • Upon encountering non-separable data due to noise, errors, or outliers, we introduced slack variables to relax the margin conditions, enhancing the model’s robustness.
    • To further improve generalization on unseen data, we adopted the concept of a large margin, which maximizes the distance between the decision boundary and the closest data points, making the model more resilient to noise and improving overall performance.

  • This step-by-step refinement mirrors the natural progression of real-world problem-solving
    • Starting with a basic concept, confronting obstacles, and evolving the model to achieve better performance and broader applicability.

59

60 of 110

Logistic Regression

61 of 110

Linear Classification: Logistic Regression

  • Logistic regression is a classification algorithm
    • don't be confused

  • Perceptron: make use of sign of data

  • SVM: make use of margin (minimum distance)
    • Distance from two closest data points

  • We want to use distance information of all data points
    • logistic regression

61

62 of 110

Using Distances

62

63 of 110

Using Distances

63

64 of 110

Using Distances

64

65 of 110

Using Distances

65

66 of 110

Using all Distances

  •  

66

67 of 110

Using all Distances

  •  

67

68 of 110

Using all Distances with Outliers

  • SVM vs. Logistic Regression

68

SVM

Logistic Regression

69 of 110

Sigmoid Function

  •  

69

Perceptron

SVM

Logistic regression

70 of 110

Sigmoid Function

  •  

70

71 of 110

Distance

  •  

71

72 of 110

Distance

  • Perceptron

72

73 of 110

Distance

  • SVM

73

74 of 110

Distance

  • Logistic Regression

74

75 of 110

Sigmoid Function

  •  

75

76 of 110

Goal: We Need to Fit 𝜔 to Data

  •  

76

77 of 110

Goal: We Need to Fit 𝜔 to Data

  • It would be easier to work on the log likelihood.

  • The logistic regression problem can be solved as a (convex) optimization problem:

  • Again, it is an optimization problem

77

78 of 110

Logistic Regression using GD

78

79 of 110

Gradient Descent

  • To use the gradient descent method, we need to find the derivative of it

  • We need to compute

79

80 of 110

Gradient Descent

  •  

80

81 of 110

Gradient Descent for Logistic Regression

  • Maximization problem
  • Be careful on matrix shape

81

82 of 110

Python Implementation

82

83 of 110

Python Implementation

83

84 of 110

Logistic Regression using CVXPY

84

85 of 110

Probabilistic Approach (or MLE)

  •  

85

86 of 110

Probabilistic Approach (or MLE)

  •  

86

87 of 110

CVXPY Implementation

87

88 of 110

In a More Compact Form

  •  

88

89 of 110

CVXPY Implementation

89

90 of 110

Logistic Regression using Scikit-Learn

90

91 of 110

Logistic Regression using Scikit-Learn

91

92 of 110

Cross-Entropy

  •  

92

93 of 110

Multiclass Classification

93

94 of 110

Multiclass Classification: One vs. One

  • Generalization to more than 2 classes is straightforward

  • one vs. one
    • For three classes (C0, C1, C2)
      • Classifier 1: Distinguishes C0 vs. C1
      • Classifier 2: Distinguishes C0 vs. C2
      • Classifier 3: Distinguishes C1 vs. C2

94

95 of 110

Multiclass Classification: One vs. All (One vs. Rest)

  • Generalization to more than 2 classes is straightforward

  • one vs. all (one vs. rest)
    • For three classes (C0, C1, C2)
      • Classifier 1: Distinguishes C0 from C1 and C2.
      • Classifier 2: Distinguishes C1 from C0 and C2.
      • Classifier 3: Distinguishes C2 from C0 and C1.

95

96 of 110

Multiclass Classification: Softmax

  •  

96

97 of 110

One-Hot Encoding

  •  

97

98 of 110

Non-linear Classification

98

99 of 110

Classifying Non-linear Separable Data

  •  

99

100 of 110

Classifying Non-linear Separable Data

  •  

100

101 of 110

Classifying Non-linear Separable Data

  •  

101

102 of 110

Nonlinear Classification

102

103 of 110

Kernel

  • Often we want to capture nonlinear patterns in the data
    • nonlinear regression: input and output relationship may not be linear
    • nonlinear classification: classes may note be separable by a linear boundary

  • Linear models (e.g. linear regression, linear SVM) are not just rich enough
    • by mapping data to higher dimensions where it exhibits linear patterns
    • apply the linear model in the new input feature space
    • mapping = changing the feature representation

  • Kernels: make linear model work in nonlinear settings

103

104 of 110

Selecting the Appropriate Kernel

  • Applying the right kernel enables linear classification to handle nonlinearly distributed data
  • But, we have not yet addressed the process of selecting this optimal kernel

  • Throughout our previous discussions, we assumed that the kernel function was predefined.
  • Identifying the suitable kernel for a given dataset remains a non-trivial challenge, but we will not explore this topic further.

  • The primary reason for this decision is that in deep learning, the model architecture inherently learns effective feature transformations directly from the data.
    • Unlike traditional kernel methods, modern deep learning frameworks are capable of automatically discovering complex and data-driven feature mappings, eliminating the need for manually selecting or designing an optimal kernel.

104

105 of 110

Classifying Non-linear Separable Data

105

106 of 110

Classifying Non-linear Separable Data

106

107 of 110

Classifying Non-linear Separable Data

107

108 of 110

Non-linear Classification

  • Same idea as non-linear regression: non-linear features
    • Explicit or implicit Kernel

108

109 of 110

Explicit Kernel

109

110 of 110

Non-linear Classification

110