CS60050: Machine Learning
Sourangshu Bhattacharya
CSE, IIT Kharagpur
REGRESSION
Linear Basis Function Models (1)
Example: Polynomial Curve Fitting
Linear Basis Function Models (2)
Generally
where Áj(x) are known as basis functions.
Typically, Á0(x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions : Ád(x) = xd.
Linear Basis Function Models (3)
Polynomial basis functions:
These are global; a small change in x affect all basis functions.
Linear Basis Function Models (4)
Gaussian basis functions:
These are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (width).
Linear Basis Function Models (5)
Sigmoidal basis functions:
where
Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).
Least Squares Estimation
Maximum Likelihood and Least Squares (1)
Assume observations from a deterministic function with added Gaussian noise:
which is the same as saying,
Given observed inputs, , and targets,� , we obtain the likelihood function
where
Maximum Likelihood and Least Squares (2)
Taking the logarithm, we get
where
is the sum-of-squares error.
Maximum Likelihood and Least Squares (3)
Computing the gradient and setting it to zero yields
Solving for w, we get
where
The Moore-Penrose pseudo-inverse, .
Geometry of Least Squares
Consider
S is spanned by .
wML minimizes the distance between t and its orthogonal projection on S, i.e. y.
N-dimensional
M-dimensional
Normal Equations
If
is invertible,
When is invertible ?
Recall: Full rank matrices are invertible.
What if is not invertible ?
p xp p x1
p x1
Gradient Descent
14
Even when is invertible, might be computationally expensive if A is huge.
Treat as optimization problem
Observation: J(β) is convex in β.
J(β1)
β1
β1
β2
How to find the minimizer?
J(β1, β2)
Gradient Descent
Even when is invertible, might be computationally expensive if A is huge.
Initialize:
Update:
0 if =
Stop: when some criterion met e.g. fixed # iterations, or
< ε.
Since J(β) is convex, move along negative of gradient
step size
Effect of step-‐size α
Large α => Fast convergence but larger residual error Also possible oscillations
Small α => Slow convergence but small residual error
0th Order Polynomial
n=10
1st Order Polynomial
Slide courtesy of William Cohen
3rd Order Polynomial
Slide courtesy of William Cohen
9th Order Polynomial
Slide courtesy of William Cohen
Over-fitting
Root-Mean-Square (RMS) Error
Slide courtesy of William Cohen
Polynomial Coefficients
Slide courtesy of William Cohen
Regularization
Penalize large coefficient values
Slide courtesy of William Cohen
Regularization:
Slide courtesy of William Cohen
Over Regularization
Slide courtesy of William Cohen
Regularization
Regularized Least Squares (1)
Consider the error function:
With the sum-of-squares error function and a quadratic regularizer, we get
which is minimized by
Data term + Regularization term
¸ is called the regularization coefficient.
Regularized Least Squares (2)
With a more general regularizer, we have
Lasso
Quadratic
Regularized Least Squares (3)
Lasso tends to generate sparser solutions than a quadratic �regularizer.
CLASSIFICATION
Discrete and Continuous Labels
Sports Science News
Classification
Regression
Anemic cell Healthy cell
Stock Market Prediction
Y = ?
X = Feb01
X = Document
Y = Topic
X = Cell Image
Y = Diagnosis
An example application
An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients.
A decision is needed: whether to put a new patient in an intensive-care unit.
Due to the high cost of ICU, those patients who may survive less than a month are given higher priority.
Problem: to predict high-risk patients and discriminate them from low-risk patients.
Another application
A credit card company receives thousands of applications for new cards. Each application contains information about an applicant,
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.
The data and the goal
Data: A set of data records (also called examples, instances or cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is labelled with a pre-defined class.
Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances.
Supervised learning process: two steps
Least squares classification
Least squares classification
Least squares classification
From Linear to Logistic Regression
Assumes the following functional form for P(Y|X):
Logistic function applied to a linear function of the data
Logistic function
(or Sigmoid):
z
logit (z)
Features can be discrete or continuous!
Logistic Regression is a Linear Classifier!
Assumes the following functional form for P(Y|X):
Decision boundary:
1
1
(Linear Decision Boundary)
Logistic Regression is a Linear Classifier!
Assumes the following functional form for P(Y|X):
1
1
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Properties of Error function
Properties of Error function
Gradient Descent
Problem: min f(x)
f(x): differentiable
g(x): gradient of f(x)
Negative gradient is�steepest descent�direction.
At each step move in�the gradient direction�so that there is �“sufficient decrease”.
Gradient Descent
Logistic Regression for more than 2 classes
Y {y1,…,yK}
for k<K
for k=K (normalization, so no weights for this class)
Multiple classes
Multiple classes
Multiple Classes
Multiple classes
MORE REGRESSION
The Bias-Variance Decomposition (1)
Recall the expected squared loss,
where
The second term of E[L] corresponds to the noise inherent in the random variable t.
What about the first term?
The Bias-Variance Decomposition (2)
Suppose we were given multiple data sets, each of size N. Any particular data set, D, will give a particular function y(x;D). We then have
The Bias-Variance Decomposition (3)
Taking the expectation over D yields
The Bias-Variance Decomposition (4)
Thus we can write
where
The Bias-Variance Decomposition (5)
Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.
The Bias-Variance Decomposition (6)
Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.
The Bias-Variance Decomposition (7)
Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.
The Bias-Variance Trade-off
From these plots, we note that an over-regularized model (large ¸) will have a high bias, while an under-regularized model (small ¸) will have a high variance.
Bayesian Linear Regression (1)
Define a conjugate prior over w
Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives the posterior
where
Bayesian Linear Regression (2)
A common choice for the prior is
for which
Next we consider an example …
Bayesian Linear Regression (3)
0 data points observed
Prior
Data Space
Bayesian Linear Regression (4)
1 data point observed
Likelihood
Posterior
Data Space
Bayesian Linear Regression (5)
2 data points observed
Likelihood
Posterior
Data Space
Bayesian Linear Regression (6)
20 data points observed
Likelihood
Posterior
Data Space
Predictive Distribution (1)
Predict t for new values of x by integrating over w:
where
Predictive Distribution (2)
Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point
Predictive Distribution (3)
Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points
Predictive Distribution (4)
Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points
Predictive Distribution (5)
Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points
Multiple Outputs (1)
Analogously to the single output case we have:
Given observed inputs, , and targets,� , we obtain the log likelihood function
Multiple Outputs (2)
Maximizing with respect to W, we obtain
If we consider a single target variable, tk, we see that
where , which is identical with the single output case.