1 of 55

(2024-25 EVEN)

UTA027

Artificial Intelligence

Machine Learning

(Classification)

Thapar Institute of Engineering and Technology

(Deemed to be University)

2 of 55

Machine Learning

Classification

Raghav B. Venkataramaiyer

Thapar Institute of Engineering and Technology

(Deemed to be University)

3 of 55

Ref

Artificial Intelligence: Structures and Strategies for Complex Problem Solving

By: Luger & Stubblefield

[Download URL]

PRML (Bishop)�[Download URL]

ISL (Hastie/Tibshirani)�[Website]

ITILA (MacKay)�[Website]

Introduction to Prob�[Google Scholar]�[YouTube Playlist]�[YouTube Playlist]

Classification

Probability
Simple Linear Regression (Closed Form Solution)
Multivariate Case (Closed Form Solution)
Using Solvers
Using basis functions.

4 of 55

Notations

Concepts

Set Notation:�{a,b,c,…} (e.g. set of vertices)�{a,b} ≡ {b,a}�{a∈ℕ : a even.}

Vectors:�Row Vectors: (w₁,…,w_M) OR [w₁,…,w_M]�Column Vectors: w = [w₁,…,w_M]^T�Closed/Open Intervals: [a,b],(a,b),[a,b)

Matrices: M (uppercase bold letters)�M×M Identity (Unit) Matrix: I_M� I_M ⊢ I_ij = 1 if i=j; I_ij = 0 if i≠j

Probability:�Expectation: 𝔼[X], Variance: Var(X)�Conditionals: 𝔼_x[f(x)|z], Var_x(f(x)|z)

Set Partition:

Given set S≡{a,b,c,…}

Partitions of S:�S₁,S₂,S₃,…⊆S, ⋃_iS_i=S ⊢�∀i,j i≠j → S_i∩S_j=∅�(pairwise disjoint subsets that span the space)

PS:

⋃_iS_i=S �enforces the spanning property;
∀i,j i≠j → S_i∩S_j=∅�defines the pairwise disjoint condition.

5 of 55

Classification Setup

6 of 55

Coin Sorting

Image Courtesy:

Interesting Engineering

7 of 55

Stamp Sorting

Image Courtesy:

Etsy

8 of 55

Mail Sorting

Image Courtesy:

Gadgets360

9 of 55

Classification

10 of 55

Binary Classification

11 of 55

Binary Classification

Ham / Spam

Camera Feed

Blank / Guest-on-door

12 of 55

Multi-class Classification

Dialogue

Happy, Angry, Fear, Sad, …

Image

Cat, Dog, Car, Cycle, Flower, …

Performance Metrics

Grade A, A-, B, B-, …

13 of 55

Classification Setup

Ground Truth

Labels

14 of 55

Classification Setup

Linear Model

15 of 55

Classification Setup

Threshold

Criterion

16 of 55

The Setup: Data

The set of observations is called data.

Generally a set of input/output pair�x ≡ [x₁, …, x_D]^T, y∈{0,1}

x is called features/ feature vector.�y is called labels.

x|y=1 is often called positive examples�x|y=0 is often called negative examples

17 of 55

The Setup: Data

The set of observations is called data.

Generally a set of input/output pair�x ≡ [x₁, …, x_D]^T, y∈{0,1}

Example: Email Sorting

Let a dictionary be�ⅆ ≡ {“buy”, “free”, … (D words)}

x ≡ [x₁, …, x_D]^T�represent their corresponding frequency of occurrence in the email.

y=1 indicates email is a spam,�y=0 indicates otherwise.

features

labels

features

18 of 55

The Setup : Data

x_i: i-th sample in the dataset (say i-th email)

x_i ≡ [x₁⁽ⁱ⁾, …, x_D⁽ⁱ⁾]^T�x_j⁽ⁱ⁾: j-th component (or feature) of the i-th sample in the dataset

19 of 55

The Setup : Data

y_i: label of the i-th sample in the dataset.

May also be interpreted probabilistically as,

y_i = P(i-th sample is positive)

Example:�y_i = P(i-th email is a spam)

20 of 55

Linear Regression

21 of 55

f(x)∈ℝ

x∈D

For some domain D

Given:

Evidence suggests that �for features x∈D, �the target y∈{0,1}

y=0

y=1

Threshold, y=0.5

22 of 55

Linear Regression

Example: Email Sorting

Let a dictionary be�ⅆ ≡ {“buy”, “free”, … (D words)}

x ≡ [x₁, …, x_D]^T�represent their corresponding frequency of occurrence in the email.

y=1 indicates email is a spam,�y=0 indicates otherwise.

Set up w in order to weigh in the relevant word frequencies.

23 of 55

Linear Regression

Implementation (Notebook)

24 of 55

Logistic Regression

25 of 55

σ(z) = 1/(1+e^-z)

x∈D

For some domain D

Given:

Evidence suggests that �for features x∈D, �the target y∈{0,1}

y=0

y=1

Threshold, y=0.5

26 of 55

The Setup: Model

y = P(x is positive sample)

27 of 55

The Model

Logistic function

28 of 55

The Training Objective

Cross Entropy

Targets

Predictions

Cross Entropy of Targets from Predictions

29 of 55

The Training Objective

Assuming y follows the Binomial Distribution �(a series of coin flips)

Recall, that for n trials with m success and success rate p,

P(m,n;p) = p^m(1-p)^(n-m)

30 of 55

The Training Objective

31 of 55

The Training Objective

32 of 55

The Update Step

E over the whole population.

(Gradient Descent)

33 of 55

The Update Step

E over a population sample.

(Stochastic Gradient Descent)

34 of 55

The Gradients

35 of 55

The Gradients

36 of 55

The Gradients

38 of 55

Probability

39 of 55

What is the probability that I’ll choose a tall person in this hall?

Remember, I haven’t specified ‘today’;�or, when exactly!

40 of 55

Probability

Outcome: A person in the hall.
Random Variable: Height of the person (a proxy for outcome).
Event: A person is tall.
Probability: The chance of the event that the chosen person is tall.

41 of 55

Random Variable

The outcome �of a random process

42 of 55

In this hall,�Choose one person

at random.

Random Process

43 of 55

Random Variable

X,Y

The outcomes of a Random process.

In this hall, choose one person at random.

Let their height be X.

Let their weight be Y.

Random process

44 of 55

Random Variable

BMI, computed as �a function of random variables �Z = f(X,Y)= kY/X² �is also a random variable.

45 of 55

What is probability P?

Ω : Set of all possible outcomes,

(persons in a hall)

ω∈Ω

46 of 55

What is probability P?

Ω : Set of all possible outcomes

⅀ : Set of Events

Ω, ⅀

ω∈Ω

σ∈⅀

σ₁ Very short-highted

σ₂ Short highted

σ₃ Mid height

σ₄ Tall

σ₅ Very tall

⋮

47 of 55

What is probability P?

ω∈Ω

σ∈⅀

P(σ) = |σ| / |Ω|

P : ⅀ →ℝ�Size of the event.

Such that P(Ω)=1

P,Ω,⅀

48 of 55

Probability

Outcome (ω∈Ω): A person in the hall.
Random Variable(X): Height of the person (A proxy for outcome).
Event(σ∈⅀): A person is tall.
Probability(P[X∈σ]): The chance of event that the chosen person is tall.

49 of 55

Probability

P(X is short) = 0.10 �Probability is related to an event.

50 of 55

Probability

P(5’ < X < 5’1”) = 0.20 �Probability is related to an event.

51 of 55

Probability

Probability of all possible outcomes is 1.

P(X is short�OR X is mid height�OR X is tall) = 1 �

52 of 55

Probability

Probability of all possible outcomes is 1.

P(0<X⩽5’�OR 5’<X⩽5’1”�OR 5’1”<X⩽tallest) = 1 �

53 of 55

Probability Density

P(X=5’) = 1.5

Probability density at (and near) X=5’.

54 of 55

Probability Density

P(X=5’) = 1.5

Probability density at (and near) X=5’.�near implies continuity

Recall that X=5’ is barely an outcome and a zero size set!

55 of 55

Cumulative Probability

P(X⩽5’) = 0.5

So we define an event �X⩽a for all possible values of a.

This notational ambiguity is inherent in literature, and hence we need to disambiguate contextually.

1 of 55

2 of 55

3 of 55

4 of 55

5 of 55

6 of 55

7 of 55

8 of 55

9 of 55

10 of 55

11 of 55

12 of 55

13 of 55

14 of 55

15 of 55

16 of 55

17 of 55

18 of 55

19 of 55

20 of 55

21 of 55

22 of 55

23 of 55

24 of 55

25 of 55

26 of 55

27 of 55

28 of 55

29 of 55

30 of 55

31 of 55

32 of 55

33 of 55

34 of 55

35 of 55

36 of 55

37 of 55

38 of 55

39 of 55

40 of 55

41 of 55

42 of 55

43 of 55

44 of 55

45 of 55

46 of 55

47 of 55

48 of 55

49 of 55

50 of 55

51 of 55

52 of 55

53 of 55

54 of 55

55 of 55