(2024-25 EVEN)
UTA027
Artificial Intelligence
Machine Learning
(Classification)
Thapar Institute of Engineering and Technology
(Deemed to be University)
Machine Learning
Classification
Raghav B. Venkataramaiyer
Thapar Institute of Engineering and Technology
(Deemed to be University)
Ref
Artificial Intelligence: Structures and Strategies for Complex Problem Solving
By: Luger & Stubblefield
PRML (Bishop)�[Download URL]
ISL (Hastie/Tibshirani)�[Website]
ITILA (MacKay)�[Website]
Introduction to Prob�[Google Scholar]�[YouTube Playlist]�[YouTube Playlist]
Notations
Concepts
Set Notation:�{a,b,c,…} (e.g. set of vertices)�{a,b} ≡ {b,a}�{a∈ℕ : a even.}
Vectors:�Row Vectors: (w1,…,wM) OR [w1,…,wM]�Column Vectors: w = [w1,…,wM]T�Closed/Open Intervals: [a,b],(a,b),[a,b)
Matrices: M (uppercase bold letters)�M×M Identity (Unit) Matrix: IM� IM ⊢ Iij = 1 if i=j; Iij = 0 if i≠j
Probability:�Expectation: 𝔼[X], Variance: Var(X)�Conditionals: 𝔼x[f(x)|z], Varx(f(x)|z)
Set Partition:
Given set S≡{a,b,c,…}
Partitions of S:�S1,S2,S3,…⊆S, ⋃iSi=S ⊢�∀i,j i≠j → Si∩Sj=∅�(pairwise disjoint subsets that span the space)
PS:
Classification Setup
Coin Sorting
Image Courtesy:
Interesting Engineering
Stamp Sorting
Image Courtesy:
Etsy
Mail Sorting
Image Courtesy:
Gadgets360
Classification
Binary Classification
Binary Classification
Ham / Spam
Camera Feed
Blank / Guest-on-door
Multi-class Classification
Dialogue
Happy, Angry, Fear, Sad, …
Image
Cat, Dog, Car, Cycle, Flower, …
Performance Metrics
Grade A, A-, B, B-, …
Classification Setup
Ground Truth
Labels
Classification Setup
Linear Model
Classification Setup
Threshold
Criterion
The Setup: Data
The set of observations is called data.
Generally a set of input/output pair�x ≡ [x1, …, xD]T, y∈{0,1}
x is called features/ feature vector.�y is called labels.
x|y=1 is often called positive examples�x|y=0 is often called negative examples
The Setup: Data
The set of observations is called data.
Generally a set of input/output pair�x ≡ [x1, …, xD]T, y∈{0,1}
Example: Email Sorting
Let a dictionary be�ⅆ ≡ {“buy”, “free”, … (D words)}
x ≡ [x1, …, xD]T�represent their corresponding frequency of occurrence in the email.
y=1 indicates email is a spam,�y=0 indicates otherwise.
features
labels
features
The Setup : Data
xi: i-th sample in the dataset (say i-th email)
xi ≡ [x1(i), …, xD(i)]T�xj(i): j-th component (or feature) of the i-th sample in the dataset
The Setup : Data
yi: label of the i-th sample in the dataset.
May also be interpreted probabilistically as,
yi = P(i-th sample is positive)
Example:�yi = P(i-th email is a spam)
Linear Regression
f(x)∈ℝ
x∈D
For some domain D
Given:
Evidence suggests that �for features x∈D, �the target y∈{0,1}
y=0
y=1
Threshold, y=0.5
Linear Regression
Example: Email Sorting
Let a dictionary be�ⅆ ≡ {“buy”, “free”, … (D words)}
x ≡ [x1, …, xD]T�represent their corresponding frequency of occurrence in the email.
y=1 indicates email is a spam,�y=0 indicates otherwise.
Set up w in order to weigh in the relevant word frequencies.
Linear Regression
Implementation (Notebook)
Logistic Regression
σ(z) = 1/(1+e-z)
x∈D
For some domain D
Given:
Evidence suggests that �for features x∈D, �the target y∈{0,1}
y=0
y=1
Threshold, y=0.5
The Setup: Model
y = P(x is positive sample)
The Model
Logistic function
The Training Objective
Cross Entropy
Targets
Predictions
Cross Entropy of Targets from Predictions
The Training Objective
Assuming y follows the Binomial Distribution �(a series of coin flips)
Recall, that for n trials with m success and success rate p,
P(m,n;p) = pm(1-p)(n-m)
The Training Objective
The Training Objective
The Update Step
E over the whole population.
(Gradient Descent)
The Update Step
E over a population sample.
(Stochastic Gradient Descent)
The Gradients
The Gradients
The Gradients
Appendix
Probability
What is the probability that I’ll choose a tall person in this hall?
Remember, I haven’t specified ‘today’;�or, when exactly!
Probability
Random Variable
X
The outcome �of a random process
In this hall,�Choose one person
at random.
Random Process
Random Variable
X,Y
The outcomes of a Random process.
In this hall, choose one person at random.
Let their height be X.
Let their weight be Y.
Random process
Random Variable
BMI, computed as �a function of random variables �Z = f(X,Y)= kY/X2 �is also a random variable.
What is probability P?
Ω : Set of all possible outcomes,
(persons in a hall)
Ω
ω∈Ω
What is probability P?
Ω : Set of all possible outcomes
⅀ : Set of Events
Ω, ⅀
ω∈Ω
σ∈⅀
σ1 Very short-highted
σ2 Short highted
σ3 Mid height
σ4 Tall
σ5 Very tall
⋮
What is probability P?
ω∈Ω
σ∈⅀
P(σ) = |σ| / |Ω|
P : ⅀ →ℝ�Size of the event.
Such that P(Ω)=1
P,Ω,⅀
Probability
Probability
P(X is short) = 0.10 �Probability is related to an event.
Probability
P(5’ < X < 5’1”) = 0.20 �Probability is related to an event.
Probability
Probability of all possible outcomes is 1.
P(X is short�OR X is mid height�OR X is tall) = 1 �
Probability
Probability of all possible outcomes is 1.
P(0<X⩽5’�OR 5’<X⩽5’1”�OR 5’1”<X⩽tallest) = 1 �
Probability Density
P(X=5’) = 1.5
Probability density at (and near) X=5’.
Probability Density
P(X=5’) = 1.5
Probability density at (and near) X=5’.�near implies continuity
Recall that X=5’ is barely an outcome and a zero size set!
Cumulative Probability
P(X⩽5’) = 0.5
So we define an event �X⩽a for all possible values of a.
This notational ambiguity is inherent in literature, and hence we need to disambiguate contextually.