1 of 94

PROBABILISTIC REASONING

2 of 94

Introduction

  • Chapter 13 introduced the basic elements of probability theory and noted the importance of independence and conditional independence relationships in simplifying probabilistic representations of the world.
  • This chapter introduces a systematic way to represent such relationships explicitly in the form of Bayesian networks.
  • We define the syntax and semantics of these networks and show how they can be used to capture uncertain knowledge in a natural and efficient way.
  • We then show how probabilistic inference, although computationally intractable in the worst case, can be done efficiently in many practical situations.
  • We also describe a variety of approximate inference algorithms that are often applicable when exact inference is infeasible.

3 of 94

REPRESENTING KNOWLEDGE IN AN UNCERTAIN DOMAIN

  • We also saw that independence and conditional independence relationships among variables can greatly reduce the number of probabilities that need to be specified in order to define the full joint distribution.
  • This section introduces a data structure called a Bayesian network to represent the dependencies among variables.
  • Bayesian networks can represent essentially any full joint probability distribution and in many cases can do so very concisely.
  • A Bayesian network is a directed graph in which each node is annotated with quantitative probability information. The full specification is as follows:

4 of 94

Cont…

  • The topology of the network the set of nodes and links specifies the conditional independence relationships that hold in the domain, in a way that will be made precise shortly.
  • The intuitive meaning of an arrow is typically that X has a direct influence on Y, which suggests that causes should be parents of effects.
  • It is usually easy for a domain expert to decide what direct influences exist in the domain much easier, in fact, than actually specifying the probabilities themselves.
  • Once the topology of the Bayesian network is laid out, we need only specify a conditional probability distribution for each variable, given its parents.
  • We will see that the combination of the topology and the conditional distributions suffices to specify (implicitly) the full joint distribution for all the variables.

5 of 94

Cont…

  • Recall the simple world described in Chapter 13, consisting of the variables Toothache, Cavity, Catch, and Weather. We argued that Weather is independent of the other variables; furthermore, we argued that Toothache and Catch are conditionally independent, given Cavity.
  • These relationships are represented by the Bayesian network structure shown in Figure 14.1. Formally, the conditional independence of Toothache and Catch, given Cavity, is indicated by the absence of a link between Toothache and Catch.
  • Intuitively, the network represents the fact that Cavity is a direct cause of Toothache and Catch, whereas no direct causal relationship exists between Toothache and Catch.
  • Now consider the following example, which is just a little more complex. You have a new burglar alarm installed at home. It is fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes.

6 of 94

Cont…

  • You also have two neighbors, John and Mary, who have promised to call you at work when they hear the alarm.
  • John nearly always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too. Mary, on the other hand, likes rather loud music and often misses the alarm altogether.
  • Given the evidence of who has or has not called, we would like to estimate the probability of a burglary.

7 of 94

Cont…

8 of 94

Cont…

  • A Bayesian network for this domain appears in Figure 14.2. The network structure shows that burglary and earthquakes directly affect the probability of the alarm’s going off, but whether John and Mary call depends only on the alarm.
  • The network thus represents our assumptions that they do not perceive burglaries directly, they do not notice minor earthquakes, and they do not confer before calling.
  • The conditional distributions in Figure 14.2 are shown as a conditional probability table, or CPT. (This form of table can be used for discrete variables; other representations, including those suitable for continuous variables, are described in Section 14.2.)
  • Each row in a CPT contains the conditional probability of each node value for a conditioning case.

9 of 94

Cont…

  • A conditioning case is just a possible combination of values for the parent nodes a miniature possible world, if you like. Each row must sum to 1, because the entries represent an exhaustive set of cases for the variable.
  • For Boolean variables, once you know that the probability of a true value is p, the probability of false must be 1 – p, so we often omit the second number, as in Figure 14.2.
  • In general, a table for a Boolean variable with k Boolean parents contains 2k independently specifiable probabilities. A node with no parents has only one row, representing the prior probabilities of each possible value of the variable.
  • Notice that the network does not have nodes corresponding to Mary’s currently listening to loud music or to the telephone ringing and confusing John.
  • These factors are summarized in the uncertainty associated with the links from Alarm to JohnCalls and MaryCalls.

10 of 94

Cont…

  • This shows both laziness and ignorance in operation: it would be a lot of work to find out why those factors would be more or less likely in any particular case, and we have no reasonable way to obtain the relevant information anyway.
  • The probabilities actually summarize a potentially infinite set of circumstances in which the alarm might fail to go off (high humidity, power failure, dead battery, cut wires, a dead mouse stuck inside the bell, etc.) or John or Mary might fail to call and report it (out to lunch, on vacation, temporarily deaf, passing helicopter, etc.).
  • In this way, a small agent can cope with a very large world, at least approximately.
  • The degree of approximation can be improved if we introduce additional relevant information.

11 of 94

THE SEMANTICS OF BAYESIAN NETWORKS

  • The previous section described what a network is, but not what it means. There are two ways in which one can understand the semantics of Bayesian networks.
  • The first is to see the network as a representation of the joint probability distribution. The second is to view it as an encoding of a collection of conditional independence statements.
  • The two views are equivalent, but the first turns out to be helpful in understanding how to construct networks, whereas the second is helpful in designing inference procedures.

12 of 94

Representing the full joint distribution

  • Viewed as a piece of “syntax,” a Bayesian network is a directed acyclic graph with some numeric parameters attached to each node.
  • One way to define what the network means its semantics is to define the way in which it represents a specific joint distribution over all the variables.
  • To do this, we first need to retract (temporarily) what we said earlier about the parameters associated with each node. We said that those parameters correspond to conditional

13 of 94

Cont…

14 of 94

Cont…

15 of 94

A method for constructing Bayesian networks

  • Equation (14.2) defines what a given Bayesian network means. The next step is to explain how to construct a Bayesian network in such a way that the resulting joint distribution is a good representation of a given domain.
  • We will now show that Equation (14.2) implies certain conditional independence relationships that can be used to guide the knowledge engineer in constructing the topology of the network.
  • First, we rewrite the entries in the joint distribution in terms of conditional probability, using the product rule :

16 of 94

Cont…

17 of 94

Cont…

18 of 94

Cont…

  • Because each node is connected only to earlier nodes, this construction method guarantees that the network is acyclic.
  • Another important property of Bayesian networks is that they contain no redundant probability values.
  • If there is no redundancy, then there is no chance for inconsistency: it is impossible for the knowledge engineer or domain expert to create a Bayesian network that violates the axioms of probability.

19 of 94

Compactness and node ordering

  • As well as being a complete and non-redundant representation of the domain, a Bayesian network can often be far more compact than the full joint distribution.
  • This property is what makes it feasible to handle domains with many variables. The compactness of Bayesian networks is an example of a general property of locally structured (also called sparse) systems.
  • In a locally structured system, each subcomponent interacts directly with only a bounded number of other components, regardless of the total number of components.
  • Local structure is usually associated with linear rather than exponential growth in complexity. In the case of Bayesian networks, it is reasonable to suppose that in most domains each random variable is directly influenced by at most k others, for some constant k.

20 of 94

Cont…

  • If we assume n Boolean variables for simplicity, then the amount of information needed to specify each conditional probability table will be at most 2k numbers, and the complete network can be specified by n2k numbers. In contrast, the joint distribution contains 2n numbers.
  • To make this concrete, suppose we have n=30 nodes, each with five parents (k =5). Then the Bayesian network requires 960 numbers, but the full joint distribution requires over a billion.
  • There are domains in which each variable can be influenced directly by all the others, so that the network is fully connected.
  • Then specifying the conditional probability tables requires the same amount of information as specifying the joint distribution. In some domains, there will be slight dependencies that should strictly be included by adding a new link.
  • But if these dependencies are tenuous, then it may not be worth the additional complexity in the network for the small gain in accuracy.

21 of 94

Cont…

  • For example, one might object to our burglary network on the grounds that if there is an earthquake, then John and Mary would not call even if they heard the alarm, because they assume that the earthquake is the cause.
  • Whether to add the link from Earthquake to JohnCalls and MaryCalls (and thus enlarge the tables) depends on comparing the importance of getting more accurate probabilities with the cost of specifying the extra information.
  • Even in a locally structured domain, we will get a compact Bayesian network only if we choose the node ordering well. What happens if we happen to choose the wrong order?
  • Consider the burglary example again. Suppose we decide to add the nodes in the order MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.
  • We then get the somewhat more complicated network shown in Figure 14.3(a). The process goes as follows:

22 of 94

Cont…

23 of 94

Cont…

24 of 94

Cont…

  • The resulting network has two more links than the original network in Figure 14.2 and requires three more probabilities to be specified.
  • What’s worse, some of the links represent tenuous relationships that require difficult and unnatural probability judgments, such as assessing the probability of Earthquake, given Burglary and Alarm.
  • This phenomenon is quite general and is related to the distinction between causal and diagnostic models.
  • If we try to build a diagnostic model with links from symptoms to causes (as from MaryCalls to Alarm or Alarm to Burglary), we end up having to specify additional dependencies between otherwise independent causes (and often between separately occurring symptoms as well).
  • If we stick to a causal model, we end up having to specify fewer numbers, and the numbers will often be easier to come up with.

25 of 94

Cont…

  • In the domain of medicine, for example, it has been shown by Tversky and Kahneman (1982) that expert physicians prefer to give probability judgments for causal rules rather than for diagnostic ones.
  • Figure 14.3(b) shows a very bad node ordering: MaryCalls, JohnCalls, Earthquake, Burglary, Alarm.
  • This network requires 31 distinct probabilities to be specified exactly the same number as the full joint distribution. It is important to realize, however, that any of the three networks can represent exactly the same joint distribution.
  • The last two versions simply fail to represent all the conditional independence relationships and hence end up specifying a lot of unnecessary numbers instead.

26 of 94

Conditional independence relations in Bayesian networks

  • We have provided a “numerical” semantics for Bayesian networks in terms of the representation of the full joint distribution, as in Equation (14.2).
  • Using this semantics to derive a method for constructing Bayesian networks, we were led to the consequence that a node is conditionally independent of its other predecessors, given its parents.
  • It turns out that we can also go in the other direction. We can start from a “topological” semantics that specifies the conditional independence relationships encoded by the graph structure, and from this we can derive the “numerical” semantics.
  • The topological semantics specifies that each variable is conditionally independent of its non-descendants, given its parents.
  • For example, in Figure 14.2, JohnCalls is independent of Burglary, Earthquake, and MaryCalls given the value of Alarm. The definition is illustrated in Figure 14.4(a).

27 of 94

Cont…

28 of 94

Cont…

  • From these conditional independence assertions and the interpretation of the network parameters θ(Xi |Parents(Xi)) as specifications of conditional probabilities P(Xi |Parents(Xi)), the full joint distribution
  • given in Equation (14.2) can be reconstructed.
  • In this sense, the “numerical” semantics and the “topological” semantics are equivalent.
  • Another important independence property is implied by the topological semantics: a node is conditionally independent of all other nodes in the network, given its parents, children, and children’s parents that is, given its Markov blanket.
  • For example, Burglary is independent of JohnCalls and MaryCalls, given Alarm and Earthquake. This property is illustrated in Figure 14.4(b).

29 of 94

EFFICIENT REPRESENTATION OF CONDITIONAL DISTRIBUTIONS

  • Even if the maximum number of parents k is smallish, filling in the CPT for a node requires up to O(2k) numbers and perhaps a great deal of experience with all the possible conditioning cases.
  • In fact, this is a worst-case scenario in which the relationship between the parents and the child is completely arbitrary. Usually, such relationships are describable by a canonical distribution that fits some standard pattern.
  • In such cases, the complete table can be specified by naming the pattern and perhaps supplying a few parameters much easier than supplying an exponential number of parameters.
  • The simplest example is provided by deterministic nodes. A deterministic node has its value specified exactly by the values of its parents, with no uncertainty.
  • The relationship can be a logical one: for example, the relationship between the parent nodes Canadian, US, Mexican and the child node NorthAmerican is simply that the child is the disjunction of the parents.

30 of 94

Cont…

  • The relationship can also be numerical: for example, if the parent nodes are the prices of a particular model of car at several dealers and the child node is the price that a bargain hunter ends up paying, then the child node is the minimum of the parent values;
  • or if the parent nodes are a lake’s inflows (rivers, runoff, precipitation) and outflows (rivers, evaporation, seepage) and the child is the change in the water level of the lake, then the value of the child is the sum of the inflow parents minus the sum of the outflow parents.
  • Uncertain relationships can often be characterized by so-called noisy logical relationships. The standard example is the noisy-OR relation, which is a generalization of the logical OR.
  • In propositional logic, we might say that Fever is true if and only if Cold , Flu, or Malaria is true.

31 of 94

C0ont…

  • The noisy-OR model allows for uncertainty about the ability of each parent to cause the child to be true the causal relationship between parent and child may be inhibited, and so a patient could have a cold, but not exhibit a fever.
  • The model makes two assumptions. First, it assumes that all the possible causes are listed. (If some are missing, we can always add a so-called leak node that covers “miscellaneous causes.”)
  • Second, it assumes that inhibition of each parent is independent of inhibition of any other parents:
  • for example, whatever inhibits Malaria from causing a fever is independent of whatever inhibits Flu from causing a fever. Given these assumptions, Fever is false if and only if all its true parents are inhibited, and the probability of this is the product of the inhibition probabilities q for each parent.
  • Let us suppose these individual inhibition probabilities are as follows:

32 of 94

Cont…

33 of 94

Cont…

  • In general, noisy logical relationships in which a variable depends on k parents can be described using O(k) parameters instead of O(2k) for the full conditional probability table.
  • This makes assessment and learning much easier. For example, the CPCS network (Pradhan et al., 1994) uses noisy-OR and noisy-MAX distributions to model relationships among diseases and symptoms in internal medicine.
  • With 448 nodes and 906 links, it requires only 8,254 values instead of 133,931,430 for a network with full CPTs.

34 of 94

Bayesian nets with continuous variables

  • Many real-world problems involve continuous quantities, such as height, mass, temperature, and money; in fact, much of statistics deals with random variables whose domains are continuous.
  • By definition, continuous variables have an infinite number of possible values, so it is impossible to specify conditional probabilities explicitly for each value.
  • One possible way to handle continuous variables is to avoid them by using discretization that is, dividing up the possible values into a fixed set of intervals.
  • For example, temperatures could be divided into (<0oC), (0oC−100oC), and (>100oC). Discretization is sometimes an adequate solution, but often results in a considerable loss of accuracy and very large CPTs.
  • The most common solution is to define standard families of probability density functions that are specified by a finite number of parameters.

35 of 94

Cont…

  • For example, a Gaussian (or normal)

36 of 94

Cont…

37 of 94

Cont…

  • For this example, then, the conditional distribution for Cost is specified by naming the linear Gaussian distribution and providing the parameters at, bt, σt, af , bf, and σf . Figures 14.6(a) and (b) show these two relationships.

38 of 94

Cont…

  • Notice that in each case the slope is negative, because cost decreases as supply increases. (Of course, the assumption of linearity implies that the cost becomes negative at some point; the linear model is reasonable only if the harvest size is limited to a narrow range.)
  • Figure 14.6(c) shows the distribution P(c | h), averaging over the two possible values of Subsidy and assuming that each has prior probability 0.5.
  • This shows that even with very simple models, quite interesting distributions can be represented.
  • The linear Gaussian conditional distribution has some special properties. A network containing only continuous variables with linear Gaussian distributions has a joint distribution that is a multivariate Gaussian distribution (see Appendix A) over all the variables (Exercise 14.9).
  • Furthermore, the posterior distribution given any evidence also has this property.

39 of 94

Cont…

  • When discrete variables are added as parents (not as children) of continuous variables, the network defines a conditional Gaussian, or CG, distribution: given any assignment to the discrete variables, the distribution over the continuous variables is a multivariate Gaussian.
  • Now we turn to the distributions for discrete variables with continuous parents.
  • Consider, for example, the Buys node in Figure 14.5. It seems reasonable to assume that the customer will buy if the cost is low and will not buy if it is high and that the probability of buying varies smoothly in some intermediate region.
  • In other words, the conditional distribution is like a “soft” threshold function.

40 of 94

Cont…

  • One way to make soft thresholds is to use the integral of the standard normal distribution:

  • which means that the cost threshold occurs around μ, the width of the threshold region is proportional to σ, and the probability of buying decreases as cost increases.
  • This probit distribution (pronounced “pro-bit” and PROBIT short for “probability unit”) is illustrated in Figure 14.7(a).
  • The form can be justified by proposing that the underlying decision process has a hard threshold, but that the precise location of the threshold is subject to random Gaussian noise.

41 of 94

Cont…

42 of 94

Cont…

43 of 94

EXACT INFERENCE IN BAYESIAN NETWORKS

  • The basic task for any probabilistic inference system is to compute the posterior probability distribution for a set of query variables, given some observed event that is, some assignment of values to a set of evidence variables.
  • To simplify the presentation, we will consider only one query variable at a time; the algorithms can easily be extended to queries with multiple variables.
  • X denotes the query variable; E denotes the set of evidence variables E1, . . . ,Em, and e is a particular observed event; Y will denotes the nonevidence, nonquery variables Y1, . . . , Yl (called the hidden variables).
  • Thus, the complete set of variables is X={X}∪E Y.
  • A typical query asks for the posterior probability distribution P(X | e).

44 of 94

Cont…

  • In the burglary network, we might observe the event in which JohnCalls =true and MaryCalls =true.
  • We could then ask for, say, the probability that a burglary has occurred:

  • In this section we discuss exact algorithms for computing posterior probabilities and will consider the complexity of this task.
  • It turns out that the general case is intractable, so Section 14.5 covers methods for approximate inference.

45 of 94

Inference by enumeration

  • Any conditional probability can be computed by summing terms from the full joint distribution.
  • More specifically, a query P(X | e) can be answered using Equation (13.9), which we repeat here for convenience:

  • Now, a Bayesian network gives a complete representation of the full joint distribution. More specifically, Equation (14.2) shows that the terms P(x, e, y) in the joint distribution can be written as products of conditional probabilities from the network.
  • Therefore, a query can be answered using a Bayesian network by computing sums of products of conditional probabilities from the network.

46 of 94

Cont…

  • Consider the query P(Burglary | JohnCalls =true,MaryCalls =true). The hiddenvvariables for this query are Earthquake and Alarm. From Equation (13.9), using initial letters for the variables to shorten the expressions, we have

  • The semantics of Bayesian networks (Equation (14.2)) then gives us an expression in terms of CPT entries. For simplicity, we do this just for Burglary =true:

47 of 94

Cont…

48 of 94

Cont…

  • That is, the chance of a burglary, given calls from both neighbors, is about 28%.
  • The evaluation process for the expression in Equation (14.4) is shown as an expression tree in Figure 14.8. The ENUMERATION-ASK algorithm in Figure 14.9 evaluates such trees using depth-first recursion.
  • The algorithm is very similar in structure to the backtracking algorithm for solving CSPs (Figure 6.5) and the DPLL algorithm for satisfiability (Figure 7.17).
  • The space complexity of ENUMERATION-ASK is only linear in the number of variables: the algorithm sums over the full joint distribution without ever constructing it explicitly.
  • Unfortunately, its time complexity for a network with n Boolean variables is always O(2n)—better than the O(n 2n) for the simple approach described earlier, but still rather grim.

49 of 94

Cont…

  • Note that the tree in Figure 14.8 makes explicit the repeated subexpressions evaluated by the algorithm. The products P(j | a)P(m| a) and P(j | ¬a)P(m| ¬a) are computed twice, once for each value of e. The next section describes a general method that avoids such wasted computations.

50 of 94

Cont…

51 of 94

Cont…

52 of 94

The variable elimination algorithm

  • The enumeration algorithm can be improved substantially by eliminating repeated calculations of the kind illustrated in Figure 14.8.
  • The idea is simple: do the calculation once and save the results for later use. This is a form of dynamic programming.
  • There are several versions of this approach; we present the variable elimination algorithm, which is the simplest.
  • Variable elimination works by evaluating expressions such as Equation (14.4) in right-to-left order (that is, bottom up in Figure 14.8).
  • Intermediate results are stored, and summations over each variable are done only for those portions of the expression that depend on the variable.
  • Let us illustrate this process for the burglary network. We evaluate the expression

53 of 94

Cont…

  • where the “×” operator is not ordinary matrix multiplication but instead the pointwise product operation, to be described shortly.

54 of 94

Cont…

  • The process of evaluation is a process of summing out variables (right to left) from point wise products of factors to produce new factors, eventually yielding a factor that is the solution, i.e., the posterior distribution over the query variable.
  • The steps are as follows:

55 of 94

Cont…

56 of 94

Operations on factors

57 of 94

Cont…

  • Summing out a variable from a product of factors is done by adding up the submatrices formed by fixing the variable to each of its values in turn.
  • For example, to sum out A from f3(A,B,C), we write

58 of 94

Cont…

59 of 94

Cont…

60 of 94

Variable ordering and variable relevance

  • The algorithm in Figure 14.11 includes an unspecified ORDER function to choose an ordering for the variables.
  • Every choice of ordering yields a valid algorithm, but different orderings cause different intermediate factors to be generated during the calculation.
  • For example, in the calculation shown previously, we eliminated A before E; if we do it the other way, the calculation becomes

  • during which a new factor f6(A,B) will be generated.
  • In general, the time and space requirements of variable elimination are dominated by the size of the largest factor constructed during the operation of the algorithm.
  • This in turn is determined by the order of elimination of variables and by the structure of the network.

61 of 94

Cont….

62 of 94

The complexity of exact inference

  • The complexity of exact inference in Bayesian networks depends strongly on the structure of the network.
  • The burglary network of Figure 14.2 belongs to the family of networks in which there is at most one undirected path between any two nodes in the network.
  • These are called singly connected networks or polytrees, and they have a particularly nice property: The time and space complexity of exact inference in polytrees is linear in the size of the network.
  • Here, the size is defined as the number of CPT entries; if the number of parents of each node is bounded by a constant, then the complexity will also be linear in the number of nodes.
  • For multiply connected networks, such as that of Figure 14.12(a), variable elimination can have exponential time and space complexity in the worst case, even when the number of parents per node is bounded.

63 of 94

Cont…

  • This is not surprising when one considers that because it includes inference in propositional logic as a special case, inference in Bayesian networks is NP-hard.
  • In fact, it can be shown (Exercise 14.16) that the problem is as hard as that of computing the number of satisfying assignments for a propositional logic formula.
  • This means that it is #P-hard (“number-P hard”)—that is, strictly harder than NP-complete problems.
  • There is a close connection between the complexity of Bayesian network inference and the complexity of constraint satisfaction problems (CSPs).
  • Measures such as tree width, which bound the complexity of solving a CSP, can also be applied directly to Bayesian networks.
  • Moreover, the variable elimination algorithm can be generalized to solve CSPs as well as Bayesian networks.

64 of 94

Cont…

65 of 94

Clustering algorithms

  • The variable elimination algorithm is simple and efficient for answering individual queries. If we want to compute posterior probabilities for all the variables in a network, however, it can be less efficient.
  • For example, in a polytree network, one would need to issue O(n) queries costing O(n) each, for a total of O(n2) time.
  • Using clustering algorithms (also known as join tree algorithms), the time can be reduced to O(n). For this reason, these algorithms are widely used in commercial Bayesian network tools.
  • The basic idea of clustering is to join individual nodes of the network to form cluster nodes in such a way that the resulting network is a polytree.
  • For example, the multiply connected network shown in Figure 14.12(a) can be converted into a polytree by combining the Sprinkler and Rain node into a cluster node called Sprinkler+Rain, as shown in Figure 14.12(b).

66 of 94

Cont…

  • The two Boolean nodes are replaced by a “meganode” that takes on four possible values: tt, tf , ft, and ff. The meganode has only one parent, the Boolean variable Cloudy, so there are two conditioning cases.
  • Although this example doesn’t show it, the process of clustering often produces meganodes that share some variables.
  • Once the network is in polytree form, a special-purpose inference algorithm is required, because ordinary inference methods cannot handle meganodes that share variables with each other.
  • Essentially, the algorithm is a form of constraint propagation (see Chapter 6) where the constraints ensure that neighboring meganodes agree on the posterior probability of any variables that they have in common.
  • With careful bookkeeping, this algorithm is able to compute posterior probabilities for all the nonevidence nodes in the network in time linear in the size of the clustered network.

67 of 94

APPROXIMATE INFERENCE IN BAYESIAN NETWORKS

  • Given the intractability of exact inference in large, multiply connected networks, it is essential to consider approximate inference methods.
  • This section describes randomized sampling algorithms, also called Monte Carlo algorithms, that provide approximate answers whose accuracy depends on the number of samples generated.
  • Monte Carlo algorithms, of which simulated annealing (page 126) is an example, are used in many branches of science to estimate quantities that are difficult to calculate exactly.
  • In this section, we are interested in sampling applied to the computation of posterior probabilities.
  • We describe two families of algorithms: direct sampling and Markov chain sampling.
  • Two other approaches variational methods and loopy propagation are mentioned in the notes at the end of the chapter.

68 of 94

Direct sampling methods

  • The primitive element in any sampling algorithm is the generation of samples from a known probability distribution.
  • For example, an unbiased coin can be thought of as a random variable Coin with values heads, tails and a prior distribution P(Coin) = 0.5, 0.5.
  • Sampling from this distribution is exactly like flipping the coin: with probability 0.5 it will return heads, and with probability 0.5 it will return tails.
  • Given a source of random numbers uniformly distributed in the range [0, 1], it is a simple matter to sample any distribution on a single variable, whether discrete or continuous.
  • The simplest kind of random sampling process for Bayesian networks generates events from a network that has no evidence associated with it.
  • The idea is to sample each variable in turn, in topological order.

69 of 94

Cont…

  • The probability distribution from which the value is sampled is conditioned on the values already assigned to the variable’s parents.
  • This algorithm is shown in Figure 14.13. We can illustrate its operation on the network in Figure 14.12(a), assuming an ordering [Cloudy, Sprinkler , Rain, WetGrass ]:

70 of 94

Cont…

71 of 94

Cont…

72 of 94

Cont…

73 of 94

Cont…

  • That is, the probability of the event can be estimated as the fraction of all complete events generated by the sampling process that match the partially specified event.
  • For example, if we generate 1000 samples from the sprinkler network, and 511 of them have Rain =true, then the estimated probability of rain, written as ˆ P(Rain =true), is 0.511.

74 of 94

Rejection sampling in Bayesian networks

75 of 94

Cont…

76 of 94

Likelihood weighting

  • Likelihood weighting avoids the inefficiency of rejection sampling by generating only events that are consistent with the evidence e.
  • It is a particular instance of the general statistical technique of importance sampling, tailored for inference in Bayesian networks.
  • We begin by describing how the algorithm works; then we show that it works correctly that is, generates consistent probability estimates.
  • LIKELIHOOD-WEIGHTING fixes the values for the evidence variables E and samples only the nonevidence variables. This guarantees that each event generated is consistent with the evidence.
  • Not all events are equal, however. Before tallying the counts in the distribution for the query variable, each event is weighted by the likelihood that the event accords to the evidence, as measured by the product of the conditional probabilities for each evidence variable, given its parents.
  • Intuitively, events in which the actual evidence appears unlikely should be given less weight.

77 of 94

Cont…

78 of 94

Cont…

79 of 94

Cont…

  • To understand why likelihood weighting works, we start by examining the sampling probability SWS for WEIGHTED-SAMPLE.
  • Remember that the evidence variables E are fixed with values e. We call the nonevidence variables Z (including the query variable X).
  • The algorithm samples each variable in Z given its parent values:

80 of 94

Cont…

81 of 94

Cont…

82 of 94

Cont…

83 of 94

Cont…

84 of 94

Inference by Markov chain simulation

  • Markov chainMonte Carlo (MCMC) algorithms work quite differently from rejection sampling and likelihood weighting.
  • Instead of generating each sample from scratch, MCMC algorithms generate each sample by making a random change to the preceding sample.
  • It is therefore helpful to think of an MCMC algorithm as being in a particular current state specifying a value for every variable and generating a next state by making random changes to the current state.
  • Here we describe a particular form of MCMC called Gibbs sampling, which is especially well suited for Bayesian networks.
  • We will first describe what the algorithm does, then we will explain why it works.

85 of 94

Gibbs sampling in Bayesian networks

  • The Gibbs sampling algorithm for Bayesian networks starts with an arbitrary state (with the evidence variables fixed at their observed values) and generates a next state by randomly sampling a value for one of the nonevidence variables Xi.
  • The sampling for Xi is done conditioned on the current values of the variables in the Markov blanket of Xi.
  • The algorithm therefore wanders randomly around the state space the space of possible complete assignments flipping one variable at a time, but keeping the evidence variables fixed.
  • Consider the query P(Rain | Sprinkler =true, WetGrass =true) applied to the network in Figure 14.12(a).
  • The evidence variables Sprinkler and WetGrass are fixed to their observed values and the nonevidence variables Cloudy and Rain are initialized randomly let us say to true and false respectively. Thus, the initial state is [true, true, false, true].

86 of 94

Cont…

87 of 94

Cont…

88 of 94

Why Gibbs sampling works

  • We will now show that Gibbs sampling returns consistent estimates for posterior probabilities.
  • The material in this section is quite technical, but the basic claim is straightforward: the sampling process settles into a “dynamic equilibrium” in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability.
  • This remarkable property follows from the specific transition probability with which the process moves from one state to another, as defined by the conditional distribution given the Markov blanket of the variable being sampled.

89 of 94

Cont…

90 of 94

Cont…

91 of 94

Cont…

92 of 94

Cont…

93 of 94

Cont…

94 of 94

Cont…