MLAPP errata

- 1.4.6 logistic regression, footnote 9: it was actually Jakob Bernoulli, not Daniel. cf http://jeff560.tripod.com/b.html
- 10.2.5 final sentence "regression weights correspond to a Cholesky decomposition of $\Sigma”: however it is not the case that W=chol(\Sigma). Also note that in the text, the “regression weights” are the w_ij
- 10.5.3 and 19.2.1 Markov blanket = the smallest set of nodes which …
- Fig 11.13: 10% out of 5 times ?? also cf p 356/357 MAP never encounters numerical problems: unclear, because there seem to be problems also in the MAP case in the figure

- below eq 13.28 D_t = ||\gamma_t ||_0 (?)
- above eq 13.38: is the l1 constraint really called a linear constraint (which I thought of as Ax+b>0) ? Ex 13.11 is about linear constraints too, but there it certainly is not OK to resort to a constraint in terms of an l1-norm.
- eq 13.53: statement is eq 13.52 rewrites as 13.53. Should be “w is a local minimum iff 13.53” ?
- below eq 13.56, text describing fig 13.5a: dotted = black, solid = red. Also fig has got only c_k axis instead of c_j, w_j. Could also place +/- lambda here
- 13.4.4 “We show how to solve the lasso problem using EM” (not “using lasso”)
- eq 13.88 should be exp(-1/2 w^T D_\tau^-1 w) [^-1 must appear]

- Suggestion: drop one of \Lambda or D_\tau

- eq 13.95 suggestion: note 1/\varpi instead of \sigma^2
- eq 13.97: \pi’(w_j) -> \gamma
- p449 first line: inverting -> computing!

- Below eq 13.91 “denote the result of this E step”: should actually try obtain 1/E[...] since |w_j| may -> 0
- 13.4.4.3 should warn against doing eq 13.91 and say “we’ll compute \bar{\Lambda}^-1 here, not \bar{Lambda}
- Suggestion: remove one of \bar{\Lambda} or \psi, and consider removing D_\tau from 13.88

- eq 13.149: \beta -> \beta^-1, A -> A^-1
- eq 13.150: \beta -> \beta^-1
- eq 13.151, 153, 155: alpha bold*
- fig 13.13 legend refers to top/ bottom, left/ right
- table 13.2 suggestion remind that line 1 corresponds to lasso 13.4.4, line 2 to HAL 13.6.2
- ex 13.11 the suggested substitution should read \theta = \theta_+ - \theta_-, not [..., ...]
- p464 footnote echoes footnote p238

- eq 14.27 should have minus sign (also suggest make notation consistent with p193)
- fig 14.4d, 14.5d: suspect error in code for SVM regression and classification: eg only orange points are circled as support vectors, only points above the green line are support vectors.
- Fig 14.10 for consistency with text, \ksi, \ksi^* should be \ksi^+ and \ksi^-
- Fig 14.14 C2/C3 inverted at bottom right
- 14.42, 43 replace M by N

- eq 19.12 log Z, not Z
- eq 19.19 minus signs missing before the first term
- below eq 19.19, c = -1/2 \Sigma b, not \Sigma^-1

[I think it should be

$\vSigma^{-1} = \vW$, $\vmu \defeq \vSigma \vb$,

and $c \defeq \half \vmu^T \vSigma^{-1} \vmu$ -- KPM]

- below 19.21 should be > -b_s
- eq 19.22 replace 0 by e^0. Say K=3 !
- below eq 19.22 replace K by J
- 19.4.4.1, 1st sentence "and hence sparse Cholesky factorizations of covariance matrix": it is not the case that if the Gaussian DGM has missing edges (ie the regression matrix W is sparse), then chol(\Sigma) is sparse.
- sect 19.4.4.1, p674: For example in Fig 19.11(a) we see that: replace Y by X in the remainder to be consistent with figure
- eq 19.49, eq 19.50 remove the characters “p_emp(“

- eq 19.50 add 1/N to \ell_ML

- 19.5.7.1 (IPF example, problem setting unclear): Y_1 indicates gender (0 for male, 1 for female), Y_2 indicates handedness (0 for right-handed, 1 for left-handed). We are then concerned with p(Y_1, Y_2)=1/Z \psi_1(Y_1) \psi_2(Y_2). The table of model expectations is M, the table of empirical probabilities is C.
- going from eq 19.89 to 19.90: sign of E(w) should be >0 I think

- below 19.90 L(y,y^*) should be \tilde(L)(y,y^*)
- above eq 19.92, def of f(x;w) not consistent with usage later on, eg eq 19.95. [I have modified eqn 19.55 - KPM]
- eq 19.96: gamma should appear after max
- eq 19.95: formula shouldn’t involve f at all, and the indices are messed up. Rewrite as min_i w^T \phi(x_i, y_i) - max_{y \in callig Y \\ y_i} w^T \phi(x_i,y) (or more easily, introduce gamma in eq 19.92)
- section 19.7.3

- Algorithm 11 refers to Algo 19.3
- “by replacing the expression etc”: replace y by \hat{y_i}

- above eq 19.108, discussing w’ : w’^T \delta_i(y) = -L(y_i, y) + w^T \delta_i (y) (L term missing)
- algo 19.3 line 5 replace \hat(y_i) by y

- line 6 replace y by \hat{y_i}
- line 9 replace y’ and \hat{y_i} by y (then you recover exactly eq 19.100 and 19.110, the latter requiring correction cf below)
- suggestion: mention how the convergence rate for Algo 19.3 is 1/\epsilon^2 (cf Nowozin & Lampert p158 Algo 15)

- 19.101 \delta_i -> \delta_i (y)
- ex 19.2c Bishop reference
- 19.106 need R=1/2N ||w||^2_2 (add the N)
- 19.110 replace \ksi by \sum_i \ksi_i ; replace C/N \ksi by C \sum_i \ksi_i
- 19.110 replace L(y_i,\bar{y_i}) by L(y_i, y) (or replace y by \bar{y_i}, to look more like 19.109)
- Algo 19.4 line 4: ksi -> ksi_i
- below Algo 19.4 : reference to Algorithm 10 should be to Algo 19.4. “constant independent on N” should read “independent of N”
- 19.119 h_i -> h_i^*
- 19.120 replace L by \tilde(L) [suggestion = harmonize L/Ltilde between exposition in 19.7.1, in 19.7.2 and hre 19.7.5, eg reverse L/Ltilde in 19/7/1]

- eq 20.55 : should be a large Sum, not product ! cf PGM 9.4.1

- eq 21.39 +L_i(x_i) otherwise sign error in eq 21.41 propagates down to 21.52
- p736: Suggestion: mention that unless \Lambda is singular, the only valid solution is m_1=\mu_1, m_2=\mu_2 (cf corresponding passage in Bishop chap 10). The reader is left confused otherwise.
- eq 21.82 missing kappa_0 / 2
- eq 21.97 E[log lambda] {q(lambda} -> E_{q(lambda)} [log lambda]
- eq 21.98 term -b_0 a_N/b_N from 12.96 should stay
- in code for normalGammaPdf, should use kappa, not kappa * lambda, as the precision for the Gaussian. Specifically, that’s the function used to compute q_mu q_lmabda, which factorizes, so must not find lambda in N() !
- in code: for case mu_0 kappa_0 != 0, exact posterior assumes this. Guess: This manifests itself in that lower bound doesnt increase monotonically ?

- eq 21.115 missing p(alpha)
- p760 algo 15 should be 21.1
- eq 21.168-170 should have p(w) not p(w|D)
- eq 21.171 LHS should be -KL( N(m_N, V_N) || N(m_0,V_0) )
- above eq 21.172, replace mu_0 by m_0
- eq 21.173 RHS missing \sum_i=1^N
- eq 21.176 V_n -> V_N
- algo 21.1 line 8: m -> psi

Chap 23

- Sec 23.4.1. Derivation of optimal importance weights is wrong (see email from 6th september)

Chap 24

- 24.4.1 and fig 24.10: figure legend says 200 Gibbs iterations, text 500
- Figure 24.3 (p.844) speaks of 300 data points shown in Figure 25.7 (p.888). Figure 25.8 (p.889) also speaks of 300 data points shown in Figure 25.7. Figure 25.7 itself only shows 100 data points, though.
- Page 853, after the first equation (24.55): "As long as each $q_k$ is individually valid, the overall proposal will also be valid." I think that this is too strong, the following should suffice: "As long as there is a valid $q_k$ with $w_k > 0$, the overall proposal will also be valid." And even this would be too strong (see the definition of validity in Eq. 24.52).
- Also, the sentence before the equation is a bit funny, it sounds like "If you don't know what proposal distribution to chose, try a mixture proposal, so you have to choose even more distributions plus mixture weights." I would suggest to start with "If one doesn't want to settle on a single kind of proposal distribution, one can try ..."
- 24.80 should use [] not ()
- 24.82 should use \bar{f} not f, I think
- 24.85 and 24.91 missing p() =
- legends fig 24.7 and 24.12 (and maybe other places in the text referring to this experiment): mix up between variance and standard deviation: values 1, 8, 500 are for the standard deviation, not the variance ! Also, (100, 100) are the variances for the MoG components, noted bold \sigma, which is unfortunate (Suggestion: keep lowercase \sigma for standard deviation, and v, V or uppercase \Sigma for variance/ covariance)
- fig 24.14: legend has f, but text has p tilde. Text should say somewhere that p tilde is unnormalized.
- eq 24.111 replace q by g
- sentence just after eq 24.211: the prior partition function is Z_n, and we want to estimate Z_0 (?)

Chap 25

- line before eq 25.46: stochastic matrix is not L_rw but D^-1 W

- parag below eq 25.47: introduce T as the row-normalized U (T is not defined at all)

Chap 26

- Suggestion: Thm 26.6.1 : the assumption that the DGM is causal/obeys the causal Markov condition, mentioned at the start of the section, should be restated in the theorem for clarity.

Chap 27

- eq 27.114 F(v) = log \sum_h exp -E(...) [and not \sum_h E(...) as stated !]. The rest of the equations must be corrected, or else in the first line say this is “log F(v)”, not F(v).
- 27.7.1.2: replace v_ir by v_r
- 27.7.1.2: what is calligraphic S ?
- 27.7.1.3 and 27.7.2.2: usage of W_rk vs w_rk inconsistent
- 27.7.1.3 why is N indexed by c ?

- p480 an example of a a
- p483 and can be computed
- p486 the the
- p486 have shown
- p487 these techniques is hard
- fig 7.9 axes u_1, u_2 in picture but w_1, w_2 in text
- p490 algorothm
- p441 analyticall.
- below eq 13.67 || … |_2^2
- section 13.4.2 text about warm starting etc is duplicated in two places in this section
- p442 c.f. -> cf.
- p445 epxanding
- below 13.86 distibution, missing ) at Expon(....)
- eq 13.86 consider replacing gamma by lambda for consistency with eq 13.32 - KPM skip
- eq 13.88 suggestion: rearrange in lines according to eq 13.87
- p442 generating generalization pathS
- p463 this this method is called called; case can be It is conventional
- ex 13.8: intN
- sec 13.2.1: X_\gamma : sometimes gamma is bold, sometimes not, suggestion harmonize
- forgot to define w_\gamma, X_\gamma for eq 13.12
- eq 13.28: notation inconsistent \gamma_t, or ^(t). Also : extra ) after w_t+1
- “Backwards selection Backwards selection”
- p428 depende
- p486 satsify, the the, have show
- 487 these techniques is hard
- below eq 13.47, must subscript |\hat{\theta}
- eq 13.92 remove superfluous () around N
- section 19.7 that that
- p843, one but last sentence in a paragraph after eq. 24.27: superfluous "predictive"
- eq 19.93 space the quantifiers away from the inequality, or use different punctuation, eg \forall x, \forall y \in blah : blah blah
- eq 27.103 missing , between v and theta
- details on to -> on how to
- Zustandssumme (with a capital Z in German)
- dstribution
- discriminativel
- proposed in in
- there are directed edge
- principle problem -> principal
- preceeding
- matrix-vetor
- how long can the variance go -> low
- as slow as -> low
- accidently -> accidentally
- wholeheatedly
- so it unwise
- have gotten very close to
- In the true model is Fig 26.16b
- Punctuation:

- Superfluous second full stop/period/”.” after Thm 26.6.1
- idem ex 10.6

- unupervised
- and then to from its posterior
- diaogonal
- comon
- satsified
- below eq 11.34 z_i x -> z_i^*
- reference broken in 11.4.5 “section ??”
- iff there a
- 8.5.2.3 see Algorithm 8 for some pseudocode -> 8.3
- minimized the posterior -> minimize
- eq 19.84 remove v in L(vy_i …) + add large round bracket around summand of \sum_i
- below eq (8.83) old values of are forgotten
- p682 see algo 7 for the pseudocode (should be algo 19.2)
- p683 easily to generalized
- p855 applied to Bayesian inference fro
- p247 eq 8.5 f(w) undefined, should be NLL(w)
- p856 paragraph -2: it takes over 100 steps: figure seems to show it even takes over 200
- p838 footnote 2 - messages For comparison
- p39 Student t distribution (missing period) before 2.51
- p583 practioners, there the, there is lots of (sugg. there is much)
- p855 line -4 fro
- 24.5.3 footnote 5 no space before \footnote{}
- p226 so sommon
- p224 which as follows
- p498 thse are points, once … we can then (remove then)...
- p499 subjet to, we disuss margins
- p501 form a completely, the factOR of 1/2
- everywhere: replace c.f. by cf. (latin confer; the erroneous c.f. for some reason I find only in machine learning papers)
- p869 it it
- p871 ditribution
- section 21.5, first paragraph, line -1: described -> describe
- (p???) putting it altogether -> all together
- eq 21.14, 21.15 ln -> log
- ex 21.7 add “y” to designate columns in table
- p739 in to -> into
- p769 per edge. we get -> ,
- fig 22.5 caption: messasges asynchnronous x2. Also suggest to revise caption of line types (even Koller’s original could be more helpful). b) has solid twice. Dashed is different in a) and b). Could be “sync damped”, “async damped”, “async undamped”, “truth” to clarify.
- p773 damping factor Clearly: missing .
- p759 altogether -> all together
- p750 as well in -> as well as in
- p750 principle -> principal
- p126 section 4.5.1 that if…, then … -> that if …, then … (no that)
- p754 last line : unfortunately
- p747 relevancy
- 21.102 X^X
- 21.209 N(w| …)
- p734 proagation
- p746 altogether -> all together :-)
- p742 end of 21.5 described -> describe
- p740 relevancy

- “mixed directed graph” refers to 931, just to be referred to 26.5.5 (where index “directed mixed graph” points to), which refers to 19.4.4 (where “directed mixed graphical model” points to). 19.4.4. is the “final destination” and should be the one pointed to.
- merge “sigmoid belief net” and “...nets”
- merge Student t, Student t distribution

- argh! “the monster” Wainwright&Jordan duplicated itself
- Kuss and C. Rasmussen (2006) -> Malte Kuss, initial missing
- Fu 1998 spelling
- Viterbi typo
- Fienberg 1970
- Yuille and He is 2012, not 2011
- Meinshausen et al 2010
- Altun 2006 - the chapter is actually called Support Vector Machine Learning for Interdependent and Structured Output Spaces / Yasemin Altun, Thomas Hofmann, and Ioannis Tsochantaridis; the book is actually called Predicting Structured Output; date 2006 (assuming we’re speaking of the same thing?)...
- Vanhatalo et al 2010
- p473 S. and Black 2009
- typos in Elkan 2006
- CarvaLHo 2010
- Bishop’s PRML is duplicated