CS60050: Machine Learning
Sourangshu Bhattacharya
CSE, IIT Kharagpur
NAÏVE BAYES
Generative vs. Discriminative Classifiers
Discriminative classifiers (e.g. Logistic Regression)
Generative classifiers (e.g. Naïve Bayes)
arg max_Y P(Y|X) = arg max_Y P(X|Y) P(Y)
4
4
A text classification task: Email spam filtering
From: ‘‘’’ <takworlld@hotmail.com>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for similar courses
I am 22 years old and I have already purchased 6 properties using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
How would you write a program that would automatically detect
and delete this type of message?
5
5
Formal definition of TC: Training
Given:
Using a learning method or learning algorithm, we then wish to
learn a classifier ϒ that maps documents to classes:
ϒ : X → C
6
6
Formal definition of TC: Application/Testing
Given: a description d ∈ X of a document Determine: ϒ (d) ∈ C,
that is, the class that is most appropriate for d
7
7
Examples of how search engines use classification
8
8
Derivation of Naive Bayes rule
We want to find the class that is most likely given the document:
Apply Bayes rule
Drop denominator since P(d) is the same for all classes:
9
9
Too many parameters / sparseness
10
10
Naive Bayes conditional independence assumption
To reduce the number of parameters to a manageable size, we
make the Naive Bayes conditional independence assumption:
We assume that the probability of observing the conjunction of
attributes is equal to the product of the individual probabilities
P(Xk = tk |c).
11
11
The Naive Bayes classifier
as follows:
document of class c
that c is the correct class.
class vs. another, we choose the c with highest P(c).
12
12
Maximum a posteriori class
13
13
Taking the log
14
14
Naive Bayes classifier
15
15
Parameter estimation take 1: Maximum likelihood
16
16
The problem with maximum likelihood estimates: Zeros
P(China|d) ∝ P(China) ・ P(BEIJING|China) ・ P(AND|China)
・ P(TAIPEI|China) ・ P(JOIN|China) ・ P(WTO|China)
17
17
The problem with maximum likelihood estimates: Zeros
(cont)
18
18
To avoid zeros: Add-one smoothing
19
19
To avoid zeros: Add-one smoothing
20
20
Exercise
21
21
Example: Parameter estimates
The denominators are (8 + 6) and (3 + 6) because the lengths of
textc and are 8 and 3, respectively, and because the constant
B is 6 as the vocabulary consists of six terms.
22
22
Example: Classification
Thus, the classifier assigns the test document to c = China. The
reason for this classification decision is that the three occurrences
of the positive indicator CHINESE in d5 outweigh the occurrences
of the two negative indicators JAPAN and TOKYO.
Class Conditional Probabilities
24
24
Generative model
On naïve Bayesian classifier
BAYESIAN LINEAR REGRESSION
Maximum Likelihood and Least Squares
where
Maximum Likelihood and Least Squares
Bayesian Linear Regression (1)
Bayesian Linear Regression (2)
Bayesian Linear Regression (3)
0 data points observed
Prior
Data Space
Bayesian Linear Regression (4)
1 data point observed
Likelihood
Posterior
Data Space
Bayesian Linear Regression (5)
2 data points observed
Likelihood
Posterior
Data Space
Bayesian Linear Regression (6)
20 data points observed
Likelihood
Posterior
Data Space
Predictive Distribution (1)
Predictive Distribution (2)
Predictive Distribution (3)
Predictive Distribution (4)
Predictive Distribution (5)
GAUSSIAN MIXTURE MODELS
Mixture of Gaussians
Mixture of Gaussians
Generative Procedure
Generative Procedure
Posterior distribution
Example
Max-likelihood
KKT conditions
KKT conditions
KKT conditions
(EM) Algorithm
Example