2 of 64

Introduction

Bayesian learning methods are relevant to our study of machine learning for
two different reasons.
First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses, such as the naive Bayes classifier, are among the most practical approaches to certain types of learning problems.
For example, the detailed study comparing the naive Bayes classifier to other learning algorithms, including decision tree and neural network algorithms.
These researchers show that the naive Bayes classifier is competitive with these other learning algorithms in many cases and that in some cases it outperforms these other methods.
In this chapter we describe the naive Bayes classifier and provide a detailed example of its use.
In particular, we discuss its application to the problem of learning to classify text documents such as electronic news articles.

3 of 64

Cont…

For such learning tasks, the naive Bayes classifier is among the most effective algorithms known.
The second reason that Bayesian methods are important to our study of machine learning is that they provide a useful perspective for understanding many learning algorithms that do not explicitly manipulate probabilities.
For example, in this chapter we analyze algorithms such as the FIND-S and CANDIDATE ELIMINATION algorithms of Chapter 2 to determine conditions under which they output the most probable hypothesis given the training data.
We also use a Bayesian analysis to justify a key design choice in neural network learning algorithms: choosing to minimize the sum of squared errors when searching the space of possible neural networks.
We also derive an alternative error function, cross entropy, that is more appropriate than sum of squared errors when learning target functions that predict probabilities.

4 of 64

Cont…

We use a Bayesian perspective to analyze the inductive bias of decision tree learning algorithms that favor short decision trees and examine the closely related Minimum Description Length principle.
A basic familiarity with Bayesian methods is important to understanding and characterizing the operation of many algorithms in machine learning.
Features of Bayesian learning methods include: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct.
This provides a more flexible approach to learning than algorithms that completely eliminate a hypothesis if it is found to be inconsistent with any single example.
Prior knowledge can be combined with observed data to determine the final probability of a hypothesis.
In Bayesian learning, prior knowledge is provided by asserting
(1) a prior probability for each candidate hypothesis, and
(2) a probability distribution over observed data for each possible hypothesis.

5 of 64

Cont…

Bayesian methods can accommodate hypotheses that make probabilistic predictions (e.g., hypotheses such as "this pneumonia patient has a 93% chance of complete recovery").
New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities.
Even in cases where Bayesian methods prove computationally intractable, they can provide a standard of optimal decision making against which other practical methods can be measured.
One practical difficulty in applying Bayesian methods is that they typically require initial knowledge of many probabilities.
When these probabilities are not known in advance they are often estimated based on background knowledge, previously available data, and assumptions about the form of the underlying distributions.

6 of 64

Cont…

A second practical difficulty is the significant computational cost required to determine the Bayes optimal hypothesis in the general case (linear in the number of candidate hypotheses).
In certain specialized situations, this computational cost can be significantly reduced.

7 of 64

BAYES THEOREM

In machine learning we are often interested in determining the best hypothesis from some space H, given the observed training data D.
One way to specify what we mean by the best hypothesis is to say that we demand the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H.
Bayes theorem provides a direct method for calculating such probabilities.
More precisely, Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself.
To define Bayes theorem precisely, let us first introduce a little notation. We shall write P(h) to denote the initial probability that hypothesis h holds, before we have observed the training data.
P(h) is often called the priorprobability of h and may reflect any background knowledge we have about the chance that h is a correct hypothesis.

8 of 64

Cont…

If we have no such prior knowledge, then we might simply assign the same prior probability to each candidate hypothesis.
Similarly, we will write P(D) to denote the prior probability that training data D will be observed (i.e., the probability of D given no knowledge about which hypothesis holds).
Next, we will write P(D/h) to denote the probability of observing data D given some world in which hypothesis h holds.
More generally, we write P(x/y) to denote the probability of x given y.
In machine learning problems we are interested in the probability P (h/D) that h holds given the observed training data D.
P (h/D) is called the posteriorprobability of h, because it reflects our confidence that h holds after we have seen the training data D.
Notice the posterior probability P(h/D) reflects the influence of the training data D, in contrast to the prior probability P(h) , which is independent of D.

9 of 64

Cont…

Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the posterior probability P(h/D), from the prior probability P(h), together with P(D) and P(D/h).

As one might intuitively expect, P(h/D) increases with P(h) and with P(D/h)

according to Bayes theorem.

It is also reasonable to see that P(h/D) decreases as P(D) increases, because the more probable it is that D will be observed independent of h, the less evidence D provides in support of h.

10 of 64

Cont…

In many learning scenarios, the learner considers some set of candidate hypotheses H and is interested in finding the most probable hypothesis
h H given the observed data D (or at least one of the maximally probable if there are several).
Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis.
We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.
More precisely, we will say that is a MAP hypothesis provided.

11 of 64

Cont…

Notice in the final step above we dropped the term P(D) because it is a constant independent of h.
In some cases, we will assume that every hypothesis in H is equally probable a priori (P(hi) = P(hj) for all hi and hj in H). In this case we can further simplify Equation (6.2) and need only consider the term P(D/h) to find the most probable hypothesis.

12 of 64

Cont…

P(D/h) is often called the likelihood of the data D given h, and any hypothesis that maximizes P(D/h) is called a maximum likelihood (ML) hypothesis, hML.

In order to make clear the connection to machine learning problems, we introduced Bayes theorem above by referring to the data D as training examples of some target function and referring to H as the space of candidate target functions.
In fact, Bayes theorem is much more general than suggested by this discussion. It can be applied equally well to any set H of mutually exclusive propositions whose probabilities sum to one (e.g., "the sky is blue," and "the sky is not blue").

13 of 64

Cont…

In this chapter, we will at times consider cases where H is a hypothesis space containing possible target functions and the data D are training examples.
At other times we will consider cases where H is some other set of mutually exclusive propositions, and D is some other kind of data.

14 of 64

An Example

To illustrate Bayes rule, consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has a- articular form of cancer and
(2) that the patient does not.
The available data is from a particular laboratory
test with two possible outcomes:
We have prior knowledge that over the entire population of people only .008 have this disease.
Furthermore, the lab test is only an imperfect indicator of the disease.
The test returns a correct positive result in only 98% of the cases in which the disease is actually present and a correct negative result in only 97% of the cases in which the disease is not present.
In other cases, the test returns the opposite result. The above situation can be summarized by the following probabilities:

15 of 64

Suppose we now observe a new patient for whom the lab test returns a positive result. Should we diagnose the patient as having cancer or not? The maximum a posteriori hypothesis can be found using Equation (6.2):

17 of 64

Cont…

As this example illustrates, the result of Bayesian inference depends strongly on the prior probabilities, which must be available in order to apply the method directly.
Note also that in this example the hypotheses are not completely accepted or rejected, but rather become more or less probable as more data is observed.
Basic formulas for calculating probabilities are summarized in Table 6.1.

19 of 64

BAYES THEOREM AND CONCEPT LEARNING

What is the relationship between Bayes theorem and the problem of concept learning?
Since Bayes theorem provides a principled way to calculate the posterior probability of each hypothesis given the training data, we can use it as the basis for a straightforward learning algorithm that calculates the probability for each possible hypothesis, then outputs the most probable.
This section considers such a brute-force Bayesian concept learning algorithm, then compares it to concept learning algorithms we considered in Chapter 2.
As we shall see, one interesting result of this comparison is that under certain conditions several algorithms discussed in earlier chapters output the same hypotheses as this brute-force Bayesian algorithm, despite the fact that they do not explicitly manipulate probabilities and are considerably more efficient.

20 of 64

Brute-Force Bayes Concept Learning

Consider the concept learning problem first introduced in Chapter 2. In particular, assume the learner considers some finite hypothesis space H defined over the instance space X, in which the task is to learn some target concept
As usual, we assume that the learner is given some sequence of training examples where xi is some instance from X and where di is the target value of xi (i.e., di = c(xi)).
To simplify the discussion in this section, we assume the sequence of instances is held fixed, so that the training data D can be written simply as the sequence of target values
We can design a straightforward concept learning algorithm to output the maximum a posteriori hypothesis, based on Bayes theorem, as follows:

21 of 64

Cont…

This algorithm may require significant computation, because it applies Bayes theorem to each hypothesis in H to calculate P(h/D ).
While this may prove impractical for large hypothesis spaces, the algorithm is still of interest because it provides a standard against which we may judge the performance of other concept learning algorithms.

22 of 64

Cont…

In order specify a Iearning problem for the BRUTE-FORCEMAP LEARNING algorithm we must specify what values are to be used for P(h) and for P(D/h) (as we shall see, P(D) will be determined once we choose the other two).
We may choose the probability distributions P(h) and P(D/h) in any way we wish, to describe our prior knowledge about the learning task. Here let us choose them to be consistent with the following assumptions:

23 of 64

Cont…

Given these assumptions, what values should we specify for P(h)? Given no prior knowledge that one hypothesis is more likely than another, it is reasonable to assign the same prior probability to every hypothesis h in H.
Furthermore, because we assume the target concept is contained in H we should require that these prior probabilities sum to 1. Together these constraints imply that we should choose

24 of 64

Cont…

What choice shall we make for P(D/h)? P(D/h) is the probability of observing the target values for the fixed set of instances

given a world in which hypothesis h holds (i.e., given a world in which h is the correct description of the target concept c). Since we assume noise-free training data, the probability of observing classification di given h is just 1 if di = h(xi) and 0 if di # h(xi). Therefore,

25 of 64

Cont…

In other words, the probability of data D given hypothesis h is 1 if D is consistent with h, and 0 otherwise.
Given these choices for P(h) and for P(Dlh) we now have a fully-defined problem for the above BRUTE-FORCEMAP LEARNING algorithm.
Let us consider the first step of this algorithm, which uses Bayes theorem to compute the posterior probability P(h/D) of each hypothesis h given the observed training data D.
Recalling Bayes theorem, we have

26 of 64

Cont…

First consider the case where h is inconsistent with the training data D. Since Equation (6.4) defines P(D/h) to be 0 when h is inconsistent with D, we have

The posterior probability of a hypothesis inconsistent with D is zero.
Now consider the case where h is consistent with D. Since Equation (6.4) defines P(D/h) to be 1 when h is consistent with D, we have

27 of 64

Cont…

28 of 64

Cont…

30 of 64

Cont…

To summarize, Bayes theorem implies that the posterior probability P(h/D) under our assumed P(h) and P(D/h) is

where is the number of hypotheses from H consistent with D. The evolution of probabilities associated with hypotheses is depicted schematically in Figure 6.1. Initially (Figure 6.1a) all hypotheses have the same probability.
As training data accumulates (Figures 6.1 b and 6. lc), the posterior probability for inconsistent hypotheses becomes zero while the total probability summing to one is shared equally among the remaining consistent hypotheses.

31 of 64

Cont…

The above analysis implies that under our choice for P(h) and P(D/h), every
consistent hypothesis has posterior probability , and every inconsistent hypothesis has posterior probability 0. Every consistent hypothesis is, therefore, a MAP hypothesis.

32 of 64

MAP Hypotheses and Consistent Learners

The above analysis shows that in the given setting, every hypothesis consistent with D is a MAP hypothesis.
This statement translates directly into an interesting statement about a general class of learners that we might call consistent learners.
We will say that a learning algorithm is a consistent learner provided it outputs a hypothesis that commits zero errors over the training examples.
Given the above analysis, we can conclude that every consistent learner outputs a MAP hypothesis, if we assume a uniform prior probability distribution over H (i.e., P(hi) = P(hj) for all i, j), and if we assume deterministic, noise free training data (i.e., P(D /h) =1 if D and h are consistent, and 0 otherwise)

33 of 64

Cont…

Consider, for example, the concept learning algorithm FIND-S discussed in Chapter 2.
FIND-S searches the hypothesis space H from specific to general hypotheses, outputting a maximally specific consistent hypothesis (i.e., a maximally specific member of the version space).
Because FIND-S outputs a consistent hypothesis, we know that it will output a MAP hypothesis under the probability distributions P(h) and P(D1h) defined above.
Of course FIND-S does not explicitly manipulate probabilities at all-it simply outputs a maximally specific member of the version space.
However, by identifying distributions for P(h) and P(D(h) under which its output hypotheses will be MAP hypotheses, we have a useful way of characterizing the behavior of FIND-S.

35 of 64

Cont…

Are there other probability distributions for P(h) and P(D1h) under which FIND-S outputs MAP hypotheses? Yes.
Because FIND-S outputs a maximally space hypothesis from the version space, its output hypothesis will be a MAP hypothesis relative to any prior probability distribution that favors more specific hypotheses.
More precisely, suppose 3-1 is any probability distribution P(h) over H that assigns P(h1) >=P(h2) if hl is more specific than h2. Then it can be shown that FIND-S outputs a MAP hypothesis assuming the prior distribution H and the same distribution P(D/h) discussed above.
To summarize the above discussion, the Bayesian framework allows one way to characterize the behavior of learning algorithms (e.g., FIND-S), even when the learning algorithm does not explicitly manipulate probabilities.

36 of 64

Cont…

By identifying probability distributions P(h) and P(D/h) under which the algorithm outputs optimal (i.e., MAP) hypotheses, we can characterize the implicit assumptions, under which this algorithm behaves optimally.
Using the Bayesian perspective to characterize learning algorithms in this way is similar in spirit to characterizing the inductive bias of the learner.
Recall that in Chapter 2 we defined the inductive bias of a learning algorithm to be the set of assumptions B sufficient to deductively justify the inductive inference performed by the learner.
For example, we described the inductive bias of the CANDIDATE-ELIMINATION algorithm as the assumption that the target concept c is included in the hypothesis space H.
Furthermore, we showed there that the output of this learning algorithm follows deductively from its inputs plus this implicit inductive bias assumption.

37 of 64

Cont…

The above Bayesian interpretation provides an alternative way to characterize the assumptions implicit in learning algorithms.
Here, instead of modeling the inductive inference method by an equivalent deductive system, we model it by an equivalent probabilistic reasoning system based on Bayes theorem.
And here the implicit assumptions that we attribute to the learner are assumptions of the form "the prior probabilities over H are given by the distribution P(h), and the strength of data in rejecting or accepting a hypothesis is given by P(D/h).
The definitions of P(h) and P(D/h) given in this section characterize the implicit assumptions of the CANDIDATE-ELIMINATION FNIND-S algorithms.
A probabilistic reasoning system based on Bayes theorem will exhibit input-output behavior equivalent to these algorithms, provided it is given these assumed probability distributions.

38 of 64

Cont…

The discussion throughout this section corresponds to a special case of Bayesian reasoning, because we considered the case where P(D/h) takes on values of only 0 and 1, reflecting the deterministic predictions of hypotheses and the assumption of noise-free training data.
As we shall see in the next section, we can also model learning from noisy training data, by allowing P(D1h) to take on values other than 0 and 1, and by introducing into P(D1h) additional assumptions about the probability distributions that govern the noise.

39 of 64

MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES

Bayesian analysis can sometimes be used to show that a particular learning algorithm outputs MAP hypotheses even though it may not explicitly use Bayes rule or calculate probabilities in any form.
In this section we consider the problem of learning a continuous-valued target function-a problem faced by many learning approaches such as neural network learning, linear regression, and polynomial curve fitting.
A straightforward Bayesian analysis will show that under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood hypothesis.
The significance of this result is that it provides a Bayesian justification (under certain assumptions) for many neural network and other curve fitting methods that attempt to minimize the sum of squared errors over the training data.

40 of 64

Cont…

Consider the following problem setting. Learner L considers an instance space X and a hypothesis space H consisting of some class of real-valued functions defined over X (i.e., each h in H is a function of the form

,where R represents the set of real numbers).

The problem faced by L is to learn an unknown target function

drawn from H.

A set of m training examples is provided, where the target value of each example is corrupted by random noise drawn according to a Normal probability distribution.
More precisely, each training example is a pair of the form (xi, di) where di = f (xi) + ei.
Here f (xi) is the noise-free value of the target function and ei is a random variable representing the noise.
It is assumed that the values of the ei are drawn independently and that they are distributed according to a Normal distribution with zero mean.

41 of 64

Cont…

The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP hypothesis assuming all hypotheses are equally probable a priori.
A simple example of such a problem is learning a linear function, though our
analysis applies to learning arbitrary real-valued functions. Figure 6.2 illustrates a linear target function f depicted by the solid line, and a set of noisy training examples of this target function.
The dashed line corresponds to the hypothesis hML with least-squared training error, hence the maximum likelihood hypothesis.
Notice that the maximum likelihood hypothesis is not necessarily identical to the correct hypothesis, f, because it is inferred from only a limited sample of noisy training data.

43 of 64

Cont…

Before showing why a hypothesis that minimizes the sum of squared errors in this setting is also a maximum likelihood hypothesis, let us quickly review two basic concepts from probability theory: probability densities and Normal distributions.
First, in order to discuss probabilities over continuous variables such as e, we must introduce probability densities.
The reason, roughly, is that we wish for the total probability over all possible values of the random variable to sum to one.
In the case of continuous variables we cannot achieve this by assigning a finite probability to each of the infinite set of possible values for the random variable.
Instead, we speak of a probability density for continuous variables such as e and require that the integral of this probability density over all possible values be one.

44 of 64

Cont…

In general we will use lower case p to refer to the probability density function, to distinguish it from a finite probability P (which we will sometimes refer to as a probability mass).
The probability density p(x0) is the limit as ϵ goes to zero, of times the probability that x will take on a value in the interval [xo, xo + ϵ ).
Probability density function:

45 of 64

Cont…

Second, we stated that the random noise variable e is generated by a Normal probability distribution.
A Normal distribution is a smooth, bell-shaped distribution that can be completely characterized by its mean p and its standard deviation σ.
Given this background we now return to the main issue: showing that the least-squared error hypothesis is, in fact, the maximum likelihood hypothesis within our problem setting.
We will show this by deriving the maximum likelihood hypothesis starting with our earlier definition Equation (6.3), but using lower case p to refer to the probability density.

46 of 64

Cont…

As before, we assume a fixed set of training instances (x1 . . . xm) and therefore consider the data D to be the corresponding sequence of target values D = (d1 . . .dm). Here di = f (xi) + ei. Assuming the training examples are mutually independent given h, we can write P(D/h) as the product of the various P(di/h).

47 of 64

Cont…

Given that the noise ei obeys a Normal distribution with zero mean and unknown variance σ2, each di must also obey a Normal distribution with variance σ2 centered around the true target value f (xi) rather than zero.
Therefore p(di /h) can be written as a Normal distribution with variance a2 and mean p = f (xi).
Let us write the formula for this Normal distribution to describe p(di /h), beginning with the general formula for a Normal distribution and substituting the appropriate μ and σ2.
Because we are writing the expression for the probability of di given that h is the correct description of the target function f, we will also substitute μ = f (xi) = h(xi), yielding

48 of 64

Cont…

We now apply a transformation that is common in maximum likelihood calculations: Rather than maximizing the above complicated expression we shall choose to maximize its (less complicated) algorithm.
This is justified because ln p is a monotonic function of p. Therefore maximizing In p also maximizes p.

49 of 64

Cont…

The first term in this expression is a constant independent of h, and can therefore be discarded, yielding

Maximizing this negative quantity is equivalent to minimizing the corresponding positive quantity.

Finally, we can again discard constants that are independent of h.

50 of 64

Cont…

Thus, Equation (6.6) shows that the maximum likelihood hypothesis HML the one that minimizes the sum of the squared errors between the observed training values di and the hypothesis predictions h(xi).
This holds under the assumption that the observed training values di are generated by adding random noise to the true target value, where this random noise is drawn independently for each example from a Normal distribution with zero mean.
As the above derivation makes clear, the squared error term

directly from the exponent in the definition of the Normal distribution. Similar derivations can be performed starting with other assumed noise distributions, producing different results.

51 of 64

MAXIMUM LIKELIHOOD HYPOTHESES FOR PREDICTING PROBABILITIES

In the problem setting of the previous section we determined that the maximum likelihood hypothesis is the one that minimizes the sum of squared errors over the training examples.
In this section we derive an analogous criterion for a second setting that is common in neural network learning: learning to predict probabilities.
Consider the setting in which we wish to learn a nondeterministic (probabilistic) function f : X ->{0, 1}, which has two discrete output values.
For example, the instance space X might represent medical patients in terms of their symptoms, and the target function f (x) might be 1 if the patient survives the disease and 0 if not.
Alternatively, X might represent loan applicants in terms of their past credit history, and f (x) might be 1 if the applicant successfully repays their next loan and 0 if not.
In both of these cases we might well expect f to be probabilistic.

52 of 64

Cont…

For example, among a collection of patients exhibiting the same set of observable symptoms, we might find that 92% survive, and 8% do not.
This unpredictability could arise from our inability to observe all the important distinguishing features of the patients, or from some genuinely probabilistic mechanism in the evolution of the disease.
Whatever the source of the problem, the effect is that we have a target function f (x) whose output is a probabilistic function of the input.
Given this problem setting, we might wish to learn a neural network (or other
real-valued function approximator) whose output is the probability that f (x) = 1.
In other words, we seek to learn the target function, f' : X ->{0, 1}, such that f '(x) = P( f (x) = 1). In the above medical patient example, if x is one of those indistinguishable patients of which 92% survive, then f'(x) = 0.92 whereas the probabilistic function f (x) will be equal to 1 in 92% of cases and equal to 0 in the remaining 8%.

53 of 64

Cont…

How can we learn f' using, say, a neural network? One obvious, bruteforce way would be to first collect the observed frequencies of 1's and 0's for each possible value of x and to then train the neural network to output the target frequency for each x.
As we shall see below, we can instead train a neural network directly from the observed training examples of f, yet still derive a maximum likelihood hypothesis for f '.
What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in this setting? To answer this question we must first obtain an expression for P(D/h).
Let us assume the training data D is of the form D = {(xl, dl) . . . (xm, dm)}, where di is the observed 0 or 1 value for f (xi).

54 of 64

Cont…

Recall that in the maximum likelihood, least-squared error analysis of the previous section, we made the simplifying assumption that the instances (xl . . . x,) were fixed.
This enabled us to characterize the data by considering only the target values di.
Although we could make a similar simplifying assumption in this case, let us avoid it here in order to demonstrate that it has no impact on the final outcome.
Thus treating both xi and di as random variables, and assuming that each training example is drawn independently, we can write P(D/h) as

55 of 64

Cont…

It is reasonable to assume, furthermore, that the probability of encountering any particular instance xi is independent of the hypothesis h.
For example, the probability that our training set contains a particular patient xi is independent of our hypothesis about survival rates (though of course the survival di, of the patient does depend strongly on h).
When x is independent of h we can rewrite the above expression (applying the product rule from Table 6.1) as

56 of 64

Cont…

Now what is the probability P(di/h, xi) of observing di = 1 for a single instance xi, given a world in which hypothesis h holds? Recall that h is our hypothesis regarding the target function, which computes this very probability. Therefore, P(di = 1/h, xi) = h(xi), and in general

In order to substitute this into the Equation (6.8) for P(Dlh), let us first re-express it in a more mathematically manipulable form, as

57 of 64

Cont…

It is easy to verify that the expressions in Equations (6.9) and (6.10) are equivalent.
Notice that when di = 1, the second term from Equation (6.10),

becomes equal to 1. Hence P(di = 1/h,xi) = h(xi), which is equivalent to the first case in Equation (6.9). A similar analysis shows that the two equations are also equivalent when di = 0.

We can use Equation (6.10) to substitute for P(di /h, xi) in Equation (6.8) to obtain

58 of 64

Cont…

Now we write an expression for the maximum likelihood hypothesis

The last term is a constant independent of h, so it can be dropped

59 of 64

Cont…

The expression on the right side of Equation (6.12) can be seen as a generalization of the Binomial distribution.
The expression in Equation (6.12) describes the probability that flipping each of m distinct coins will produce the outcome (dl . . .dm), assuming that each coin xi has probability h(xi) of producing a heads.
As in earlier cases, we will find it easier to work with the log of the likelihood, yielding

60 of 64

Gradient Search to Maximize Likelihood in a Neural Net

Above we showed that maximizing the quantity in Equation (6.13) yields the maximum likelihood hypothesis. Let us use G(h, D) to denote this quantity.
In this section we derive a weight-training rule for neural network learning that seeks to maximize G(h, D) using gradient ascent.
The gradient of G(h, D) is given by the vector of partial derivatives of G(h, D) with respect to the various network weights that define the hypothesis h represented by the learned network. In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to unit j is

61 of 64

Cont…

To keep our analysis simple, suppose our neural network is constructed from a single layer of sigmoid units. In this case we have

62 of 64

Cont…

where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative of the sigmoid squashing function.
Finally, substituting this expression into Equation (6.14), we obtain a simple expression for the derivatives that constitute the gradient

63 of 64

Cont…

Because we seek to maximize rather than minimize P(D(h), we perform gradient ascent rather than gradient descent search.
On each iteration of the search the weight vector is adjusted in the direction of the gradient, using the weight update rule

and where is a small positive constant that determines the step size of the gradient ascent search.

64 of 64

Cont…

It is interesting to compare this weight-update rule to the weight-update rule used by the BACKPROPAGATION algorithm to minimize the sum of squared errors between predicted and observed network outputs. The BACKPROPAGATION update rule for output unit weights, re-expressed using our current notation, is

1 of 64

2 of 64

3 of 64

4 of 64

5 of 64

6 of 64

7 of 64

8 of 64

9 of 64

10 of 64

11 of 64

12 of 64

13 of 64

14 of 64

15 of 64

16 of 64

17 of 64

18 of 64

19 of 64

20 of 64

21 of 64

22 of 64

23 of 64

24 of 64

25 of 64

26 of 64

27 of 64

28 of 64

29 of 64

30 of 64

31 of 64

32 of 64

33 of 64

34 of 64

35 of 64

36 of 64

37 of 64

38 of 64

39 of 64

40 of 64

41 of 64

42 of 64

43 of 64

44 of 64

45 of 64

46 of 64

47 of 64

48 of 64

49 of 64

50 of 64

51 of 64

52 of 64

53 of 64

54 of 64

55 of 64

56 of 64

57 of 64

58 of 64

59 of 64

60 of 64

61 of 64

62 of 64

63 of 64

64 of 64