1 of 76

CS60050: Machine Learning

Sourangshu Bhattacharya

CSE, IIT Kharagpur

2 of 76

REGRESSION

3 of 76

Linear Basis Function Models (1)

Example: Polynomial Curve Fitting

4 of 76

Linear Basis Function Models (2)

Generally

where Áj(x) are known as basis functions.

Typically, Á0(x) = 1, so that w0 acts as a bias.

In the simplest case, we use linear basis functions : Ád(x) = xd.

5 of 76

Linear Basis Function Models (3)

Polynomial basis functions:

These are global; a small change in x affect all basis functions.

6 of 76

Linear Basis Function Models (4)

Gaussian basis functions:

These are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (width).

7 of 76

Linear Basis Function Models (5)

Sigmoidal basis functions:

where

Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).

8 of 76

Least Squares Estimation

 

9 of 76

Maximum Likelihood and Least Squares (1)

Assume observations from a deterministic function with added Gaussian noise:

which is the same as saying,

Given observed inputs, , and targets,� , we obtain the likelihood function

where

10 of 76

Maximum Likelihood and Least Squares (2)

Taking the logarithm, we get

where

is the sum-of-squares error.

11 of 76

Maximum Likelihood and Least Squares (3)

Computing the gradient and setting it to zero yields

Solving for w, we get

where

The Moore-Penrose pseudo-inverse, .

12 of 76

Geometry of Least Squares

Consider

S is spanned by .

wML minimizes the distance between t and its orthogonal projection on S, i.e. y.

N-dimensional

M-dimensional

13 of 76

Normal Equations

If

is invertible,

When is invertible ?

Recall: Full rank matrices are invertible.

What if is not invertible ?

p xp p x1

p x1

14 of 76

Gradient Descent

14

Even when is invertible, might be computationally expensive if A is huge.

Treat as optimization problem

Observation: J(β) is convex in β.

J(β1)

β1

β1

β2

How to find the minimizer?

J(β1, β2)

15 of 76

Gradient Descent

Even when is invertible, might be computationally expensive if A is huge.

Initialize:

Update:

0 if =

Stop: when some criterion met e.g. fixed # iterations, or

< ε.

Since J(β) is convex, move along negative of gradient

step size

16 of 76

Effect of step-­‐size α

Large α => Fast convergence but larger residual error Also possible oscillations

Small α => Slow convergence but small residual error

17 of 76

0th Order Polynomial

n=10

18 of 76

1st Order Polynomial

Slide courtesy of William Cohen

19 of 76

3rd Order Polynomial

Slide courtesy of William Cohen

20 of 76

9th Order Polynomial

Slide courtesy of William Cohen

21 of 76

Over-fitting

Root-Mean-Square (RMS) Error

Slide courtesy of William Cohen

22 of 76

Polynomial Coefficients

Slide courtesy of William Cohen

23 of 76

Regularization

Penalize large coefficient values

Slide courtesy of William Cohen

24 of 76

Regularization:

Slide courtesy of William Cohen

25 of 76

Over Regularization

Slide courtesy of William Cohen

26 of 76

Regularization

27 of 76

Regularized Least Squares (1)

Consider the error function:

With the sum-of-squares error function and a quadratic regularizer, we get

which is minimized by

Data term + Regularization term

¸ is called the regularization coefficient.

28 of 76

Regularized Least Squares (2)

With a more general regularizer, we have

Lasso

Quadratic

29 of 76

Regularized Least Squares (3)

Lasso tends to generate sparser solutions than a quadratic �regularizer.

30 of 76

CLASSIFICATION

31 of 76

Discrete and Continuous Labels

Sports Science News

Classification

Regression

Anemic cell Healthy cell

Stock Market Prediction

Y = ?

X = Feb01

X = Document

Y = Topic

X = Cell Image

Y = Diagnosis

32 of 76

An example application

An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients.

A decision is needed: whether to put a new patient in an intensive-care unit.

Due to the high cost of ICU, those patients who may survive less than a month are given higher priority.

Problem: to predict high-risk patients and discriminate them from low-risk patients.

33 of 76

Another application

A credit card company receives thousands of applications for new cards. Each application contains information about an applicant,

age

Marital status

annual salary

outstanding debts

credit rating

etc.

Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.

34 of 76

The data and the goal

Data: A set of data records (also called examples, instances or cases) described by

k attributes: A1, A2, … Ak.

a class: Each example is labelled with a pre-defined class.

Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances.

35 of 76

Supervised learning process: two steps

  • Learning (training): Learn a model using the training data
  • Testing: Test the model using unseen test data to assess the model accuracy

36 of 76

Least squares classification

 

37 of 76

Least squares classification

 

38 of 76

Least squares classification

39 of 76

From Linear to Logistic Regression

Assumes the following functional form for P(Y|X):

Logistic function applied to a linear function of the data

Logistic function

(or Sigmoid):

z

logit (z)

Features can be discrete or continuous!

40 of 76

Logistic Regression is a Linear Classifier!

Assumes the following functional form for P(Y|X):

Decision boundary:

1

1

(Linear Decision Boundary)

41 of 76

Logistic Regression is a Linear Classifier!

Assumes the following functional form for P(Y|X):

1

1

42 of 76

Logistic Regression

 

43 of 76

Logistic Regression

 

44 of 76

Logistic Regression

 

45 of 76

Logistic Regression

 

46 of 76

Properties of Error function

 

 

47 of 76

Properties of Error function

 

48 of 76

Gradient Descent

Problem: min f(x)

f(x): differentiable

g(x): gradient of f(x)

Negative gradient is�steepest descent�direction.

At each step move in�the gradient direction�so that there is �“sufficient decrease”.

49 of 76

Gradient Descent

50 of 76

Logistic Regression for more than 2 classes

  • Logistic regression in more general case, where

Y {y1,…,yK}

for k<K

for k=K (normalization, so no weights for this class)

51 of 76

Multiple classes

 

52 of 76

Multiple classes

 

53 of 76

Multiple Classes

54 of 76

Multiple classes

 

55 of 76

MORE REGRESSION

56 of 76

The Bias-Variance Decomposition (1)

Recall the expected squared loss,

where

The second term of E[L] corresponds to the noise inherent in the random variable t.

What about the first term?

57 of 76

The Bias-Variance Decomposition (2)

Suppose we were given multiple data sets, each of size N. Any particular data set, D, will give a particular function y(x;D). We then have

58 of 76

The Bias-Variance Decomposition (3)

Taking the expectation over D yields

59 of 76

The Bias-Variance Decomposition (4)

Thus we can write

where

60 of 76

The Bias-Variance Decomposition (5)

Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.

61 of 76

The Bias-Variance Decomposition (6)

Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.

62 of 76

The Bias-Variance Decomposition (7)

Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.

63 of 76

The Bias-Variance Trade-off

From these plots, we note that an over-regularized model (large ¸) will have a high bias, while an under-regularized model (small ¸) will have a high variance.

64 of 76

Bayesian Linear Regression (1)

Define a conjugate prior over w

Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives the posterior

where

65 of 76

Bayesian Linear Regression (2)

A common choice for the prior is

for which

Next we consider an example …

66 of 76

Bayesian Linear Regression (3)

0 data points observed

Prior

Data Space

67 of 76

Bayesian Linear Regression (4)

1 data point observed

Likelihood

Posterior

Data Space

68 of 76

Bayesian Linear Regression (5)

2 data points observed

Likelihood

Posterior

Data Space

69 of 76

Bayesian Linear Regression (6)

20 data points observed

Likelihood

Posterior

Data Space

70 of 76

Predictive Distribution (1)

Predict t for new values of x by integrating over w:

where

71 of 76

Predictive Distribution (2)

Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point

72 of 76

Predictive Distribution (3)

Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points

73 of 76

Predictive Distribution (4)

Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points

74 of 76

Predictive Distribution (5)

Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points

75 of 76

Multiple Outputs (1)

Analogously to the single output case we have:

Given observed inputs, , and targets,� , we obtain the log likelihood function

76 of 76

Multiple Outputs (2)

Maximizing with respect to W, we obtain

If we consider a single target variable, tk, we see that

where , which is identical with the single output case.