3 of 47

Rationale

Congratulations! You made it to the last lecture of the course ;)
This lecture will provide a "30'000-foot view" of what we have done in this course.

Goal of this lecture: retain a global picture of the BRDS course

We will put all the "pieces" together, i.e. all the topics covered in this course on Bayesian Reasoning in Data Science.

4 of 47

Material

Course website https://cfteach.github.io/brds is an entry point to all the material (lectures, notebooks, references) used in the course

25 lectures

13 notebooks

5 of 47

Announcements

Third assignment due December 9, 11:59pm
Final is going to take place on Mon, December 12, from 4 to ~ 6pm (in presence) at ISC 1291

You will have 22 mins for your talk + 3 mins for questions. Please be sure to not exceed the time allocated for your talk
After the 4 talks (25x4=100 mins), we will have +10 mins for discussion

Please remember to complete your evaluations at https://evals.wm.edu

6 of 47

L1: What is Probability?

Standard answers:

GdA, Ch. 3.1, 3.2

(i) [combinatorial definition] The ratio of the number of favorable cases to the number of all cases

(ii) [frequentist definition] The ratio of the number of times the event occurs in a test series to the total number of trials in the series

Neither of these statements can define the concept of probability:

(i) it lacks the clause “if all the cases are equally probable”; this definition is often labeled classical or Laplace, forgetting that Bayes, Gauss, Laplace, Bernoulli, etc were aware of that clause.

(ii) it lacks the condition that the number of trails must be very large (“it goes to infinity”). Also, to use frequencies as a measurement of probability, we need to assume that a phenomenon occurred in the past will happen in the future too and with the same probability.

Laplace: probability theory is “good sense turned into calculation”

7 of 47

L1: Bayes’Rule

T. Bayes, 1701-1761

Statistician, philosopher

...describes the probability of an event, based on prior knowledge of conditions that might be related to the event… [wikipedia]
One of the main applications is Bayesian inference… the theorem expresses how a degree of belief…should rationally change to account for the availability of related evidence

conditional probabilities

What is all about? How can this be powerful?

Its simplicity can be “deceiving” as it involves the interpretation of probability.

posterior

likelihood

prior

marginal

8 of 47

L1: Probability of Causes

A theory of probability which does not consider probabilities of hypotheses is unnatural and prevents from being assessed transparent and consistent statements about the causes which may have produced the observed effects.

GdA, Ch. 1.6

True values

Measured values

When we do a measurement we access measured quantities (e.g. if you repeat your measurement results may change depending on the precision and accuracy of your instrument),

not the true values

9 of 47

L1: Bayesian is Everywhere

No Data Scientist can work without a solid grasp of conditional probability and Bayesian reasoning. → See, e.g., Bayesian Deep Learning.
Bayesian reasoning permeates multiple diverse fields and applications, e.g.

Business: pricing decisions and project risk for new product development
Marketing: A/B testing for click through rates
Stock markets: Bayesian networks used to identify future trends in stocks
Weather Forecast
Disease risk
Medical Diagnosis
Design
Hyperparameters optimization
Particle Physics Experiments
Image denoising
….

From deciphering encrypted messages during the 2nd world war to hyperparameter tuning of neural networks

10 of 47

L1: Bayesian is Everywhere

No Data Scientist can work without a solid grasp of conditional probability and Bayesian reasoning. → See, e.g., Bayesian Deep Learning.
Bayesian reasoning permeates multiple diverse fields and applications, e.g.

Business: pricing decisions and project risk for new product development
Marketing: A/B testing for click through rates
Stock markets: Bayesian networks used to identify future trends in stocks
Weather Forecast
Disease risk
Medical Diagnosis
Design
Hyperparameters optimization
Particle Physics Experiments
Image denoising
….

Topics chosen for your mini-project and in your final projects these topics

11 of 47

L4: Coin Example “Revisited”

Credits: https://towardsdatascience.com/conducting-bayesian-inference-in-python-using-pymc3-d407f8d934a5

This problem has multiple applications: e.g., it could be also seen as the percentage of people visiting a webpage A instead of B

12 of 47

L4: Intro PyMC3

(cf. notebook mod1-part2)

https://cfteach.github.io/pyscr/

PyMC3 can’t give us a formula for the posterior distribution. However, it can provide as many samples as we like from the posterior without explicitly calculating the denominator.

See Bayes’ theorem

13 of 47

L4: Intro PyMC3

(cf. notebook mod1-part2)

https://cfteach.github.io/pyscr/

PyMC3 can’t give us a formula for the posterior distribution. However, it can provide as many samples as we like from the posterior without explicitly calculating the denominator.

See Bayes’ theorem

14 of 47

L6: Credible vs Confidence Interval

Priors

A credible interval is an interval within which an unobserved parameter value falls with a particular probability.

It is an interval in the domain of a posterior probability distribution or a predictive distribution. The generalisation to multivariate problems is the credible region.

Credible intervals are analogous to confidence intervals and confidence regions in frequentist statistics, although they differ in the interpretation:

Bayesian intervals treat their bounds as fixed and the estimated parameter as a random variable, whereas frequentist confidence intervals treat their bounds as random variables and the parameter as a fixed value

https://en.wikipedia.org/wiki/Credible_interval

GdA, Chap 1,2

HDI: High Density Interval

ROPE: Region of Practical Equivalence

Bayesian Reasoning allows to deal with uncertainties…

15 of 47

L10: Bayesian Linear Regression

Highlight machine learning connection — umbrella term for a collection of methods to automatically learn patterns in data, and then use what we learn to predict future data or make decisions under uncertainty
Regression problems as an example of supervised learning
In this class we will compare the ordinary Least Squares fitting procedure for linear regression with the bayesian linear regression

Optimization Problem (the one you are familiar with) VS Probabilistic Problem

We assume you are familiar with OLS as well as uncertainty propagation from previous courses. Nonetheless we will recall some concepts in class.
The probabilistic approach to the linear regression problem can be summarized as:

16 of 47

L10: Bayesian Linear Regression

Half Cauchy?

https://docs.pymc.io/en/v3/api/distributions/continuous.html

Why?

Generally works well as a good regularizing prior that avoid overfitting

disturbance

17 of 47

L10: Bayesian Linear Regression

disturbance

Μ expressed as “deterministic” — see code

Probabilistic Programming

18 of 47

L10: Bayesian Linear Regression

disturbance

Μ expressed as “deterministic” — see code

Notebook mod2_part1

19 of 47

L10: Bayesian Linear Regression: FAQ

I am familiar with linear regression models already, and I know methods for fitting, e.g., least square. Why should I use Bayesian linear regression?

Bayesian linear regression allows a useful mechanism to deal with insufficient data, or poor distributed data. It allows you to put a prior on the coefficients and on the noise so that in the absence of data, the priors can take over. [ref]
The aim of Bayesian Linear Regression is not to find the single “best” value of the model parameters, but rather to determine the posterior distribution for the model parameters.

In problems where we have limited data or have some prior knowledge that we want to use in our model, the Bayesian Linear Regression approach can both incorporate prior information and show our uncertainty. Bayesian Linear Regression reflects the Bayesian framework: we form an initial estimate and improve our estimate as we gather more data.

[ref ]

20 of 47

L10: Bayesian Polynomial Regression

In general, using a polynomial to fit data is not the best idea

A model that perfectly fits your data will in general do a poor job at fitting/describing unobserved data — this is called Overfitting

We will hopefully discuss more about this in this course in the coming weeks

By extension

=> Polynomial regression

21 of 47

L13: Logistic Regression

Credits: University of Toronto

Credits: references [1], [2], [3]

Likely, you are familiar with logistic regression, a Machine Learning classification algorithm used to assign observations to a discrete set of classes.

Example

22 of 47

L13: Logistic Regression

Credits: references [1], [2], [3]

https://cfteach.github.io/brds/mod2_part2_Bayesian_Logistic_Regression.html

23 of 47

L14: All models are wrong…

All the models are wrong in the sense that some models will be better than others at describing a given problem
How to compare two or more models is a central problem in data analysis
Luckily we have a few strategies (and we will briefly discuss some of them)

Posterior Predictive Checks (mod1_part3)

What you can do is comparing metrics like the above mean from your samples with the posterior predictive. The idea is to get distributions “centered” symmetrically. If that is not the case, there may be a disagreement…

24 of 47

L14: Simplicity vs Accuracy, aka Bias vs Variance

Occam’s razor: “if we have two or more equivalent explanations for the same phenomenon, we should choose the simpler one…”
Simplicity is only part of the story. You want simplicity (or reduce complexity) but at the same time have good accuracy.
Perhaps we should change the above quote, to this one: “Make everything as simple as possible, but not simpler” A. Einstein
The bias/variance tradeoff is a well known problem (and you probably encountered in other courses, e.g., on machine learning)

Too many parameters in your model leads to overfitting (high variance): if you are overfitting a dataset that has signal and noise, you are “memorizing” your dataset, even the noise. You are not really “learning” the meaningful behavior (related to the signal)
Too few parameters leads to underfitting (high bias): you are underestimating the “complexity” of your data, and missing important information (trivial example: using a model of order 0 when data are actually linearly distributed)

25 of 47

L14: Information Criteria

The exact way these quantities are derived comes from Information Theory
Deviance

Akaike Information Criterion

Widely Applicable Information Criterion

Other information criteria

N.b. when the likelihood is Normal, it corresponds to a quadratic mean error.

The lower the deviance the higher the likelihood and the agreement between model prediction and data
The deviance is measuring the within-sample accuracy…

p_AICis a penalization term that represents the number of parameters; it penalizes complexity; θ_mle is the maximum likelihood estimation of θ (i.e., in a Bayesian context, the maximum a posteriori); it is a point estimation, not a distribution.

AIC works well for non-Bayesian approaches. It does not use posterior, so it discard information about the uncertainty. It is also assuming flat priors and hence this metric is incompatible with informative and weakly informative priors.

lppd: log point-wise predictive density; computes the mean likelihood over the posterior samples…
p_WAIC computes the variance of the log-likelihood over the posterior samples; the larger the number of effective parameters, the larger the spread. It is used as a penalization term.

We prefer lower values of WAIC…

You may her of, e.g., BIC (Bayesian Information Criterion) and others. BIC is a bit misleading, it is similar to AIC and is somehow related to Bayes factors (next lectures).

26 of 47

L14: Just few lines of PyMC…

e.g.,

trace_l (linear model)

trace_p (polynomial)

Compare different models

Dataframe sorted from lowest to highest WAIC model
pwaic is the estimate of the effective parameters
dwaic is the relative difference
weight is useful if you want to weight the models (sometimes you do not want to pick just one model); weight can be seens as the probability of each model
se: standard error
dse: se on the difference
Warning flag: 0 is OK

27 of 47

L14: Bayes Factors: strength of evidence

The following are just guidelines: you should always put into context what is evaluated

1-3: anecdotal
3-10: Moderate
10-30: Strong
30-100: Very strong
>100: Extreme

Also, remember that

BAYES FACTOR

PRIOR

POSTERIOR

28 of 47

L14: Sequential MC for Bayes Factor

with pm.Model() as model_BF_0:

θ = pm.Beta(‘θ’, 4, 8)

y = pm.Bernoulli(‘y’, θ, observed=y_d)

trace_BF_0 = pm.sample_smc(2500)

with pm.Model() as model_BF_1:

θ = pm.Beta(‘θ’, 8, 4)

y = pm.Bernoulli(‘y’, θ, observed=y_d)

trace_BF_1 = pm.sample_smc(2500)

BF_smc = np.exp(trace_BF_0.report.log_marginal_likelihood - trace_BF_1.report.log_marginal_likelihood)

https://docs.pymc.io/en/v3/pymc-examples/examples/diagnostics_and_criticism/Bayes_factor.html

29 of 47

L14: Remarks on BF

Good aspect: Bayes Factors have a built-in Occam’s razor, because the Bayes’ theorem leads naturally to penalization of complex models. Why? The larger the number of parameters, the more diffused the prior volume is, compared to the likelihood volume. Some regions will be penalized.
Bad aspect: Computing the marginal is generally a computationally hard task.
Critical aspect: the marginal term depends sensitively on the values of the priors.

N.B.:

Criticism #1: we have seen for the inference of θ, that changes in the prior often do not affect the inference… but when it comes to do model comparison, the BF are affected by these changes
Criticism #2: BF can be used as Bayesian hypothesis testing; nonetheless many authors pointed out that an inference approach is better suited to most problems than the hypothesis testing approach.

30 of 47

L18: Gaussian Processes in a nutshell

Naively, a Gaussian Process is a probability distribution over possible functions
GP help describing the probability distribution over functions. Bayes’ theorem allows to update our distribution of functions by collecting more data / observations

What kind of problems are we talking about?

Suppose your data follow a function y=f(x). Given x, you have a response y through a function f.
Suppose now that you do not know the function f and you want to “learn” it.

In the figure we are using GP to approximate this function f. Intuitively, the observed points constrain the modeling. The more points, the more accurate the model.

31 of 47

L18: Advantages

O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d

Scikit-learn, Classifier Comparison: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

“GP know what they do not know…” The uncertainty of a fitted GP increases away from the training data. For other approaches, like RF or NN, they just separate the regions of blue and red and keep high certainty on their predictions far from the training data…

This could be linked to the phenomenon of adversarial examples…

When you are using a GP to model a problem, the prior belief can be shaped via the choice of a kernel function as we discussed. We will expand on the kernel in the next slides.

32 of 47

L18: Key-ingredient: Covariance and Kernel

C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for machine learning (2016), The MIT Press

O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d

https://en.wikipedia.org/wiki/Covariance_matrix

K is the covariance kernel matrix where its entries correspond to the covariance function evaluated at observations.
The covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In practice covariance matrices are specified using functions known as kernels, whose output can be interpreted as the similarity between two points (the closer two points are, the more similar)
One popular kernel is the exponentiated quadratic kernel

Bandwidth (l) controls width of the kernel

A wide variety of functions can be represented through this kernels.

This helps us build our prior.

33 of 47

L20: BO in a nutshell

BO is a sequential strategy developed for global optimization.

After gathering evaluations we builds a posterior distribution used to construct an acquisition function.

This cheap function determines what is next query point.

1. Select a Sample by Optimizing the Acquisition Function.

2. Evaluate the Sample With the Objective Function.

3. Update the Data and, in turn, the Surrogate Function.

4. Go To 1.

Extension to multiple objectives

34 of 47

L20: Acquisition Functions

Best found so far

We are sampling x

Exploitation

Exploration

Exploitation: search where μ is high
Exploration: search where σ is high

35 of 47

L20: BO Applications

The approach has been applied to solve a wide range of problems,including learning to rank,computer graphics and visual design,robotics,sensor networks,automatic algorithm configuration, automatic machine learning toolboxes,reinforcement learning, planning, visual attention, architecture configuration in deep learning, static program analysis, experimental particle physics,chemistry, material design, and drug development… (source: wikipedia)

36 of 47

L21: Bayesian A/B Testing

Bayesian A/B Testing gained a lot of popularity in the last few years

Simple and easy to understand
Allows to calculate the probability that a “treatment” is better than a “control” (A/B testing)
It performs better on small sample size compared to frequentist approaches: see [2], where experiments show the required sample size to make the “right” decision can be reduced by 75%

[1] https://towardsdatascience.com/bayesian-a-b-testing-and-its-benefits-a7bbe5cb5103

[2] https://towardsdatascience.com/exploring-bayesian-a-b-testing-with-simulations-7500b4fc55bc

[3] https://exp-platform.com/top-challenges-from-first-practical-online-controlled-experiments-summit/

A/B testing

Consists of a randomized experiment that usually involves two variants

(Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Stanford University)

37 of 47

Rini Gupta’s mini-project on PyScript!

https://cfteach.github.io/brds/bayesian_A_B_testing_mini_project.html

PyScript is a framework that allows users to create rich Python applications in the browser using HTML’s interface and the power of Pyodide, WASM, and modern web technologies.

Examples of PyScript can be found at https://pyscript.net/

38 of 47

L22: Markov Chain Monte Carlo

Monte Carlo + Markov Chain

Monte Carlo: see first part of notebook mod3_part3_MCMC
Markov Chain: see second part of notebook mod3_part3_MCMC

In few words, Monte Carlo methods are a broad family of algorithms that use random sampling to simulate a given process

You may guess from the name which problems inspired the development of these methods :)
Stanislaw Ulam is one of the first developers of Monte Carlo methods

In few words, a Markov chain is a sequence of states characterized by a set of transition probabilities. A chain is Markovian if the probability of moving to any other state depends only on the current state (“memory 1”). We can perform a random walk by choosing a starting point amd moving to other states according to these transition probabilities.

This is the tool PyMC uses to do the sampling…

39 of 47

L22: Metropolis-Hastings

[1] Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng, eds. Handbook of markov chain monte carlo. CRC press, 2011.

[2] C. Fanelli, Measurement of Polarization Transfers in Real Compton Scattering by a proton target at JLab PhD thesis

[3] C. Fanelli, et al. "Polarization transfer in wide-angle compton scattering and single-pion photoproduction from the proton." Physical review letters 115.15 (2015): 152001

This is the engine of several applications we have seen

The TARGET distribution (the posterior distribution in Bayesian statistics) is approximated by a list of sampled parameter values
More on MCMC can be found here [1]

Where do I use it in my research? [1,2]

What is Bayesian in all this? [see notes in class]

40 of 47

L22: MCMC Diagnostics

Autocorrelation

Other metrics provided by az.summary

Check for divergences

Are posteriors from different chains similar?

41 of 47

L22: MCMC: Other Real-world applications

[1] https://cfteach.github.io/brds/mod3_part3_extra_MCMC_denoise.html

Image denoising, see notebook mod3_part3_extra_MCMC_denoise

Notice that for this application, in particular, we made use of the Ising Model. Do not worry, we will have more time to discuss about it in the next lectures.

42 of 47

I hope you enjoyed the course

44 of 47

References of our course

https://cfteach.github.io/brds/referencesmd.html

45 of 47

L14: Remarks on BF

Computationally, in calculating a BF, when a model is better than another one, we spend more time on that model. This could be problematic because we undersample one of the models:

We can adjust the prior for each model in such a way to favor the less favorable model. This will not affect the computation of the BF, because

We will see in one of our notebooks how to calculate the BF. Following [Mar18], it is recommended to use the Sequential Monte Carlo method to compute BF
Using informative or weakly informative priors is a way to introduce bias in a model. If done properly, this can be a good thing to prevent overfitting, and to make predictions that generalize well. You may have encountered regularization techniques in other courses, well, Bayesian reasoning can inherently/automatically do regularization...

46 of 47

L18: Disadvantages

O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d

GP are computationally expensive

Parametric approaches distill knowledge about your data into a set of numbers, e.g., for linear regression, we need two numbers, the slope and the intercept. Other approaches like NN may need 10s of millions. After their are trained, the cost of making predictions depend only on the number of parameters.
GPs are a non-parametric method—truthfully, kernel hyperparameters blur the picture a bit because of their hyperparameters, but neglect that—, and they need to take into account the whole training data each time to make a prediction.

The training data has to be kept at inference time and the computational cost of predictions scales cubically with the number of training data (the observations).

47 of 47

Supervised parametric learning

Data
Model (M_i)
Gaussian LKD

Prior

Posterior

Marginal LKD

Making predictions

[1] https://www.tsc.uc3m.es/~fernando/l1.pdf

Parametric VS Gaussian Process

Bayes

Gaussian Processes (non parametric)

Data

Gaussian LKD

Prior

GP Posterior

Marginal LKD

Making predictions

1 of 47

2 of 47

3 of 47

4 of 47

5 of 47

6 of 47

7 of 47

8 of 47

9 of 47

10 of 47

11 of 47

12 of 47

13 of 47

14 of 47

15 of 47

16 of 47

17 of 47

18 of 47

19 of 47

20 of 47

21 of 47

22 of 47

23 of 47

24 of 47

25 of 47

26 of 47

27 of 47

28 of 47

29 of 47

30 of 47

31 of 47

32 of 47

33 of 47

34 of 47

35 of 47

36 of 47

37 of 47

38 of 47

39 of 47

40 of 47

41 of 47

42 of 47

43 of 47

44 of 47

45 of 47

46 of 47

47 of 47