2 of 110

Classification

Categorizing data into predefined classes or categories based on input features

Examples

Spam detection: Classifying emails as either spam or not spam
Image recognition: Identifying objects, animals, or faces in images
Medical diagnosis: Predicting whether a patient has a particular condition based on diagnostic data

3 of 110

Classification

4 of 110

Classification

We will learn

Perceptron
Support vector machine (SVM)
Logistic regression

To find a classification boundary

5 of 110

Classification

We will learn

Perceptron
Support vector machine (SVM)
Logistic regression

To find a classification boundary

6 of 110

Classification

We will learn

Perceptron
Support vector machine (SVM)
Logistic regression

To find a classification boundary

7 of 110

Perceptron

8 of 110

Perceptron

9 of 110

Classification Boundary

10 of 110

Learning a Hyperplane for Classification

11 of 110

Learning a Hyperplane for Classification

12 of 110

Learning a Hyperplane for Classification

13 of 110

Perceptron Algorithm

The perceptron implements

Given the training set

pick a misclassified point

and update the weight vector

14 of 110

Perceptron Algorithm: Illustration

15 of 110

Perceptron Algorithm: Illustration

16 of 110

Perceptron Algorithm: Illustration

17 of 110

Perceptron Algorithm: Illustration

18 of 110

Perceptron Algorithm: Illustration

19 of 110

Perceptron Algorithm: Illustration

20 of 110

Why Perceptron Updates Work ?

21 of 110

Diagram of Perceptron

Perceptron can be viewed as a simple neuron in an Artificial Neural Network (ANN)

Perceptron Update Rule (discrete version):

Gradient Descent Update Rule (continuous version):

22 of 110

Loss Function for Perceptron

23 of 110

Perceptron in Python

24 of 110

Perceptron in Python

25 of 110

Perceptron in Python

26 of 110

Perceptron in Python

27 of 110

Scikit-Learn for Perceptron

28 of 110

The Best Hyperplane Separator?

Limitations

The Perceptron identifies a separating hyperplane if the data is linearly separable.
However, it merely finds one of the possible solutions.

29 of 110

The Best Hyperplane Separator?

Improvements

Identifying the best hyperplane requires an optimization framework.

Utilize distance information

Support Vector Machines (SVM) and Logistic Regression

30 of 110

Support Vector Machine

31 of 110

Distance from a Line

To build the foundation for understanding SVMs, we will first examine how to compute the distance from a point to a line.

36 of 110

Distance between Two Parallel Lines (1/2)

37 of 110

Distance between Two Parallel Lines (2/2)

38 of 110

Illustrative Example

39 of 110

Decision Making

40 of 110

The Key Insight in SVM: Introducing a Margin

41 of 110

Scaling to Simplify the Problem

42 of 110

Data Generation for Classification

Buffer zone

43 of 110

Optimization Formulation 1

44 of 110

The First Attempt

These constraints ensure that all correctly classified points lie outside the margin boundaries, establishing a buffer zone that enhances the classifier's robustness.
However, the appropriate objective function to be minimized has not yet been determined.

45 of 110

CVXPY 1

46 of 110

CVXPY 1

47 of 110

Outliers

May fail to produce a valid boundary when the data is not linearly separable

In real-world scenarios, datasets often contain noise, errors, or outliers that deviate from the true underlying patterns

Linearly non-separable case
No feasible solution

48 of 110

Outliers

No solutions (hyperplane) exist

Allowing Misclassifications

Some training examples are allowed to be misclassified.
However, the goal remains to minimize the number of misclassified points or, more formally, to minimize the total deviation from the margin constraints

49 of 110

The Second Attempt

50 of 110

The Second Attempt

The optimization problem for the non-separable case

51 of 110

The Second Attempt

52 of 110

Expressed in a Matrix Form

53 of 110

CVXPY 2

54 of 110

Further Improvement

Can we do better when there are noise data or outliers?
Yes, but this makes a buffer zone highly sensitive to noise or extreme points.

Notice that hyperplane is not as accurately represent the division due to the outlier

Idea: large margin not only separates the data effectively but also improves the model's generalization on unseen data.

55 of 110

Maximize Margin

56 of 110

Support Vector Machine

57 of 110

In a More Compact Form

58 of 110

Scikit-learn

59 of 110

Lesson from SVM

Throughout the development of SVM, we encourage you to recognize the continuous improvement process

We began with a straightforward linear classifier designed to separate clean, linearly separable data.
Upon encountering non-separable data due to noise, errors, or outliers, we introduced slack variables to relax the margin conditions, enhancing the model’s robustness.
To further improve generalization on unseen data, we adopted the concept of a large margin, which maximizes the distance between the decision boundary and the closest data points, making the model more resilient to noise and improving overall performance.

This step-by-step refinement mirrors the natural progression of real-world problem-solving

Starting with a basic concept, confronting obstacles, and evolving the model to achieve better performance and broader applicability.

60 of 110

Logistic Regression

61 of 110

Linear Classification: Logistic Regression

Logistic regression is a classification algorithm

don't be confused

Perceptron: make use of sign of data

SVM: make use of margin (minimum distance)

Distance from two closest data points

We want to use distance information of all data points

logistic regression

62 of 110

Using Distances

63 of 110

Using Distances

64 of 110

Using Distances

65 of 110

Using Distances

66 of 110

Using all Distances

67 of 110

Using all Distances

68 of 110

Using all Distances with Outliers

SVM vs. Logistic Regression

SVM

Logistic Regression

69 of 110

Sigmoid Function

Perceptron

SVM

Logistic regression

70 of 110

Sigmoid Function

71 of 110

Distance

72 of 110

Distance

Perceptron

73 of 110

Distance

74 of 110

Distance

Logistic Regression

75 of 110

Sigmoid Function

76 of 110

Goal: We Need to Fit 𝜔 to Data

77 of 110

Goal: We Need to Fit 𝜔 to Data

It would be easier to work on the log likelihood.

The logistic regression problem can be solved as a (convex) optimization problem:

Again, it is an optimization problem

78 of 110

Logistic Regression using GD

79 of 110

Gradient Descent

To use the gradient descent method, we need to find the derivative of it

We need to compute

80 of 110

Gradient Descent

81 of 110

Gradient Descent for Logistic Regression

Maximization problem
Be careful on matrix shape

82 of 110

Python Implementation

83 of 110

Python Implementation

84 of 110

Logistic Regression using CVXPY

85 of 110

Probabilistic Approach (or MLE)

86 of 110

Probabilistic Approach (or MLE)

87 of 110

CVXPY Implementation

88 of 110

In a More Compact Form

89 of 110

CVXPY Implementation

90 of 110

Logistic Regression using Scikit-Learn

91 of 110

Logistic Regression using Scikit-Learn

92 of 110

Cross-Entropy

93 of 110

Multiclass Classification

94 of 110

Multiclass Classification: One vs. One

Generalization to more than 2 classes is straightforward

one vs. one

For three classes (C₀, C₁, C₂)

Classifier 1: Distinguishes C₀ vs. C₁
Classifier 2: Distinguishes C₀ vs. C₂
Classifier 3: Distinguishes C₁ vs. C₂

95 of 110

Multiclass Classification: One vs. All (One vs. Rest)

Generalization to more than 2 classes is straightforward

one vs. all (one vs. rest)

For three classes (C₀, C₁, C₂)

Classifier 1: Distinguishes C₀ from C₁ and C₂.
Classifier 2: Distinguishes C₁ from C₀ and C₂.
Classifier 3: Distinguishes C₂ from C₀ and C₁.

96 of 110

Multiclass Classification: Softmax

97 of 110

One-Hot Encoding

98 of 110

Non-linear Classification

99 of 110

Classifying Non-linear Separable Data

100 of 110

Classifying Non-linear Separable Data

100

101 of 110

Classifying Non-linear Separable Data

101

102 of 110

Nonlinear Classification

102

https://www.youtube.com/watch?v=3liCbRZPrZA

103 of 110

Kernel

Often we want to capture nonlinear patterns in the data

nonlinear regression: input and output relationship may not be linear
nonlinear classification: classes may note be separable by a linear boundary

Linear models (e.g. linear regression, linear SVM) are not just rich enough

by mapping data to higher dimensions where it exhibits linear patterns
apply the linear model in the new input feature space
mapping = changing the feature representation

Kernels: make linear model work in nonlinear settings

103

104 of 110

Selecting the Appropriate Kernel

Applying the right kernel enables linear classification to handle nonlinearly distributed data
But, we have not yet addressed the process of selecting this optimal kernel

Throughout our previous discussions, we assumed that the kernel function was predefined.
Identifying the suitable kernel for a given dataset remains a non-trivial challenge, but we will not explore this topic further.

The primary reason for this decision is that in deep learning, the model architecture inherently learns effective feature transformations directly from the data.

Unlike traditional kernel methods, modern deep learning frameworks are capable of automatically discovering complex and data-driven feature mappings, eliminating the need for manually selecting or designing an optimal kernel.

104

105 of 110

Classifying Non-linear Separable Data

105

106 of 110

Classifying Non-linear Separable Data

106

107 of 110

Classifying Non-linear Separable Data

107

108 of 110

Non-linear Classification

Same idea as non-linear regression: non-linear features

Explicit or implicit Kernel

108

1 of 110

2 of 110

3 of 110

4 of 110

5 of 110

6 of 110

7 of 110

8 of 110

9 of 110

10 of 110

11 of 110

12 of 110

13 of 110

14 of 110

15 of 110

16 of 110

17 of 110

18 of 110

19 of 110

20 of 110

21 of 110

22 of 110

23 of 110

24 of 110

25 of 110

26 of 110

27 of 110

28 of 110

29 of 110

30 of 110

31 of 110

32 of 110

33 of 110

34 of 110

35 of 110

36 of 110

37 of 110

38 of 110

39 of 110

40 of 110

41 of 110

42 of 110

43 of 110

44 of 110

45 of 110

46 of 110

47 of 110

48 of 110

49 of 110

50 of 110

51 of 110

52 of 110

53 of 110

54 of 110

55 of 110

56 of 110

57 of 110

58 of 110

59 of 110

60 of 110

61 of 110

62 of 110

63 of 110

64 of 110

65 of 110

66 of 110

67 of 110

68 of 110

69 of 110

70 of 110

71 of 110

72 of 110

73 of 110

74 of 110

75 of 110

76 of 110

77 of 110

78 of 110

79 of 110

80 of 110