1 of 47

Bayesian Reasoning

In Data Science

Cristiano Fanelli

12/8/2022 - - Lectures 25

2 of 47

2

BRDS: Grand Fanelli Finale

3 of 47

Rationale

  • Congratulations! You made it to the last lecture of the course ;)
  • This lecture will provide a "30'000-foot view" of what we have done in this course.
    • Goal of this lecture: retain a global picture of the BRDS course
      • We will put all the "pieces" together, i.e. all the topics covered in this course on Bayesian Reasoning in Data Science.

3

4 of 47

Material

  • Course website https://cfteach.github.io/brds is an entry point to all the material (lectures, notebooks, references) used in the course

25 lectures

13 notebooks

4

5 of 47

Announcements

  • Third assignment due December 9, 11:59pm
  • Final is going to take place on Mon, December 12, from 4 to ~ 6pm (in presence) at ISC 1291
    • You will have 22 mins for your talk + 3 mins for questions. Please be sure to not exceed the time allocated for your talk
    • After the 4 talks (25x4=100 mins), we will have +10 mins for discussion
  • Please remember to complete your evaluations at https://evals.wm.edu

5

6 of 47

L1: What is Probability?

6

  • Standard answers:

GdA, Ch. 3.1, 3.2

(i) [combinatorial definition] The ratio of the number of favorable cases to the number of all cases

(ii) [frequentist definition] The ratio of the number of times the event occurs in a test series to the total number of trials in the series

  • Neither of these statements can define the concept of probability:

(i) it lacks the clause “if all the cases are equally probable”; this definition is often labeled classical or Laplace, forgetting that Bayes, Gauss, Laplace, Bernoulli, etc were aware of that clause.

(ii) it lacks the condition that the number of trails must be very large (“it goes to infinity”). Also, to use frequencies as a measurement of probability, we need to assume that a phenomenon occurred in the past will happen in the future too and with the same probability.

Laplace: probability theory is “good sense turned into calculation

7 of 47

L1: Bayes’Rule

7

T. Bayes, 1701-1761

Statistician, philosopher

  • ...describes the probability of an event, based on prior knowledge of conditions that might be related to the event… [wikipedia]
  • One of the main applications is Bayesian inference… the theorem expresses how a degree of belief…should rationally change to account for the availability of related evidence

conditional probabilities

What is all about? How can this be powerful?

Its simplicity can be “deceiving” as it involves the interpretation of probability.

posterior

likelihood

prior

marginal

8 of 47

L1: Probability of Causes

8

A theory of probability which does not consider probabilities of hypotheses is unnatural and prevents from being assessed transparent and consistent statements about the causes which may have produced the observed effects.

GdA, Ch. 1.6

True values

Measured values

When we do a measurement we access measured quantities (e.g. if you repeat your measurement results may change depending on the precision and accuracy of your instrument),

not the true values

9 of 47

L1: Bayesian is Everywhere

9

  • No Data Scientist can work without a solid grasp of conditional probability and Bayesian reasoning. → See, e.g., Bayesian Deep Learning.
  • Bayesian reasoning permeates multiple diverse fields and applications, e.g.
    • Business: pricing decisions and project risk for new product development
    • Marketing: A/B testing for click through rates
    • Stock markets: Bayesian networks used to identify future trends in stocks
    • Weather Forecast
    • Disease risk
    • Medical Diagnosis
    • Design
    • Hyperparameters optimization
    • Particle Physics Experiments
    • Image denoising
    • ….

From deciphering encrypted messages during the 2nd world war to hyperparameter tuning of neural networks

10 of 47

L1: Bayesian is Everywhere

10

  • No Data Scientist can work without a solid grasp of conditional probability and Bayesian reasoning. → See, e.g., Bayesian Deep Learning.
  • Bayesian reasoning permeates multiple diverse fields and applications, e.g.
    • Business: pricing decisions and project risk for new product development
    • Marketing: A/B testing for click through rates
    • Stock markets: Bayesian networks used to identify future trends in stocks
    • Weather Forecast
    • Disease risk
    • Medical Diagnosis
    • Design
    • Hyperparameters optimization
    • Particle Physics Experiments
    • Image denoising
    • ….

Topics chosen for your mini-project and in your final projects these topics

11 of 47

L4: Coin Example “Revisited”

11

This problem has multiple applications: e.g., it could be also seen as the percentage of people visiting a webpage A instead of B

12 of 47

L4: Intro PyMC3

(cf. notebook mod1-part2)

12

PyMC3 can’t give us a formula for the posterior distribution. However, it can provide as many samples as we like from the posterior without explicitly calculating the denominator.

See Bayes’ theorem

13 of 47

L4: Intro PyMC3

(cf. notebook mod1-part2)

13

PyMC3 can’t give us a formula for the posterior distribution. However, it can provide as many samples as we like from the posterior without explicitly calculating the denominator.

See Bayes’ theorem

14 of 47

L6: Credible vs Confidence Interval

14

Priors

A credible interval is an interval within which an unobserved parameter value falls with a particular probability.

It is an interval in the domain of a posterior probability distribution or a predictive distribution. The generalisation to multivariate problems is the credible region.

Credible intervals are analogous to confidence intervals and confidence regions in frequentist statistics, although they differ in the interpretation:

Bayesian intervals treat their bounds as fixed and the estimated parameter as a random variable, whereas frequentist confidence intervals treat their bounds as random variables and the parameter as a fixed value

HDI: High Density Interval

ROPE: Region of Practical Equivalence

Bayesian Reasoning allows to deal with uncertainties…

15 of 47

L10: Bayesian Linear Regression

15

  • Highlight machine learning connection — umbrella term for a collection of methods to automatically learn patterns in data, and then use what we learn to predict future data or make decisions under uncertainty
  • Regression problems as an example of supervised learning
  • In this class we will compare the ordinary Least Squares fitting procedure for linear regression with the bayesian linear regression
    • Optimization Problem (the one you are familiar with) VS Probabilistic Problem
  • We assume you are familiar with OLS as well as uncertainty propagation from previous courses. Nonetheless we will recall some concepts in class.
  • The probabilistic approach to the linear regression problem can be summarized as:

16 of 47

L10: Bayesian Linear Regression

16

  • Half Cauchy?
  • Why?
    • Generally works well as a good regularizing prior that avoid overfitting

disturbance

17 of 47

L10: Bayesian Linear Regression

17

disturbance

Μ expressed as “deterministic” — see code

Probabilistic Programming

18 of 47

L10: Bayesian Linear Regression

18

disturbance

Μ expressed as “deterministic” — see code

19 of 47

L10: Bayesian Linear Regression: FAQ

19

I am familiar with linear regression models already, and I know methods for fitting, e.g., least square. Why should I use Bayesian linear regression?

  • Bayesian linear regression allows a useful mechanism to deal with insufficient data, or poor distributed data. It allows you to put a prior on the coefficients and on the noise so that in the absence of data, the priors can take over. [ref]
  • The aim of Bayesian Linear Regression is not to find the single “best” value of the model parameters, but rather to determine the posterior distribution for the model parameters.

In problems where we have limited data or have some prior knowledge that we want to use in our model, the Bayesian Linear Regression approach can both incorporate prior information and show our uncertainty. Bayesian Linear Regression reflects the Bayesian framework: we form an initial estimate and improve our estimate as we gather more data.

20 of 47

L10: Bayesian Polynomial Regression

20

In general, using a polynomial to fit data is not the best idea

A model that perfectly fits your data will in general do a poor job at fitting/describing unobserved data — this is called Overfitting

We will hopefully discuss more about this in this course in the coming weeks

By extension

=> Polynomial regression

21 of 47

L13: Logistic Regression

21

Credits: University of Toronto

Credits: references [1], [2], [3]

Likely, you are familiar with logistic regression, a Machine Learning classification algorithm used to assign observations to a discrete set of classes.

Example

22 of 47

L13: Logistic Regression

22

Credits: references [1], [2], [3]

23 of 47

L14: All models are wrong…

23

  • All the models are wrong in the sense that some models will be better than others at describing a given problem
  • How to compare two or more models is a central problem in data analysis
  • Luckily we have a few strategies (and we will briefly discuss some of them)
  • Posterior Predictive Checks (mod1_part3)
  • What you can do is comparing metrics like the above mean from your samples with the posterior predictive. The idea is to get distributions “centered” symmetrically. If that is not the case, there may be a disagreement…

24 of 47

L14: Simplicity vs Accuracy, aka Bias vs Variance

24

  • Occam’s razor: “if we have two or more equivalent explanations for the same phenomenon, we should choose the simpler one…”
  • Simplicity is only part of the story. You want simplicity (or reduce complexity) but at the same time have good accuracy.
  • Perhaps we should change the above quote, to this one: “Make everything as simple as possible, but not simpler” A. Einstein
  • The bias/variance tradeoff is a well known problem (and you probably encountered in other courses, e.g., on machine learning)
    • Too many parameters in your model leads to overfitting (high variance): if you are overfitting a dataset that has signal and noise, you are “memorizing” your dataset, even the noise. You are not really “learning” the meaningful behavior (related to the signal)
    • Too few parameters leads to underfitting (high bias): you are underestimating the “complexity” of your data, and missing important information (trivial example: using a model of order 0 when data are actually linearly distributed)

25 of 47

L14: Information Criteria

25

  • The exact way these quantities are derived comes from Information Theory
  • Deviance

  • Akaike Information Criterion

  • Widely Applicable Information Criterion

  • Other information criteria

N.b. when the likelihood is Normal, it corresponds to a quadratic mean error.

  • The lower the deviance the higher the likelihood and the agreement between model prediction and data
  • The deviance is measuring the within-sample accuracy…

pAIC is a penalization term that represents the number of parameters; it penalizes complexity; θmle is the maximum likelihood estimation of θ (i.e., in a Bayesian context, the maximum a posteriori); it is a point estimation, not a distribution.

  • AIC works well for non-Bayesian approaches. It does not use posterior, so it discard information about the uncertainty. It is also assuming flat priors and hence this metric is incompatible with informative and weakly informative priors.
  • lppd: log point-wise predictive density; computes the mean likelihood over the posterior samples…
  • pWAIC computes the variance of the log-likelihood over the posterior samples; the larger the number of effective parameters, the larger the spread. It is used as a penalization term.

We prefer lower values of WAIC…

  • You may her of, e.g., BIC (Bayesian Information Criterion) and others. BIC is a bit misleading, it is similar to AIC and is somehow related to Bayes factors (next lectures).

26 of 47

L14: Just few lines of PyMC…

26

e.g.,

trace_l (linear model)

trace_p (polynomial)

Compare different models

  • Dataframe sorted from lowest to highest WAIC model
  • pwaic is the estimate of the effective parameters
  • dwaic is the relative difference
  • weight is useful if you want to weight the models (sometimes you do not want to pick just one model); weight can be seens as the probability of each model
  • se: standard error
  • dse: se on the difference
  • Warning flag: 0 is OK

27 of 47

L14: Bayes Factors: strength of evidence

27

  • The following are just guidelines: you should always put into context what is evaluated
  • 1-3: anecdotal
  • 3-10: Moderate
  • 10-30: Strong
  • 30-100: Very strong
  • >100: Extreme
  • Also, remember that

BAYES FACTOR

PRIOR

POSTERIOR

28 of 47

L14: Sequential MC for Bayes Factor

28

with pm.Model() as model_BF_0:

θ = pm.Beta(‘θ’, 4, 8)

y = pm.Bernoulli(‘y’, θ, observed=y_d)

trace_BF_0 = pm.sample_smc(2500)

with pm.Model() as model_BF_1:

θ = pm.Beta(‘θ’, 8, 4)

y = pm.Bernoulli(‘y’, θ, observed=y_d)

trace_BF_1 = pm.sample_smc(2500)

BF_smc = np.exp(trace_BF_0.report.log_marginal_likelihood - trace_BF_1.report.log_marginal_likelihood)

29 of 47

L14: Remarks on BF

29

  • Good aspect: Bayes Factors have a built-in Occam’s razor, because the Bayes’ theorem leads naturally to penalization of complex models. Why? The larger the number of parameters, the more diffused the prior volume is, compared to the likelihood volume. Some regions will be penalized.
  • Bad aspect: Computing the marginal is generally a computationally hard task.
  • Critical aspect: the marginal term depends sensitively on the values of the priors.

N.B.:

    • Criticism #1: we have seen for the inference of θ, that changes in the prior often do not affect the inference… but when it comes to do model comparison, the BF are affected by these changes
    • Criticism #2: BF can be used as Bayesian hypothesis testing; nonetheless many authors pointed out that an inference approach is better suited to most problems than the hypothesis testing approach.

30 of 47

L18: Gaussian Processes in a nutshell

30

  • Naively, a Gaussian Process is a probability distribution over possible functions
  • GP help describing the probability distribution over functions. Bayes’ theorem allows to update our distribution of functions by collecting more data / observations

What kind of problems are we talking about?

  • Suppose your data follow a function y=f(x). Given x, you have a response y through a function f.
  • Suppose now that you do not know the function f and you want to “learn” it.
  • In the figure we are using GP to approximate this function f. Intuitively, the observed points constrain the modeling. The more points, the more accurate the model.

31 of 47

L18: Advantages

31

GP know what they do not know…” The uncertainty of a fitted GP increases away from the training data. For other approaches, like RF or NN, they just separate the regions of blue and red and keep high certainty on their predictions far from the training data…

This could be linked to the phenomenon of adversarial examples…

When you are using a GP to model a problem, the prior belief can be shaped via the choice of a kernel function as we discussed. We will expand on the kernel in the next slides.

32 of 47

L18: Key-ingredient: Covariance and Kernel

32

  • K is the covariance kernel matrix where its entries correspond to the covariance function evaluated at observations.
  • The covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.
  • In practice covariance matrices are specified using functions known as kernels, whose output can be interpreted as the similarity between two points (the closer two points are, the more similar)
  • One popular kernel is the exponentiated quadratic kernel

Bandwidth (l) controls width of the kernel

A wide variety of functions can be represented through this kernels.

This helps us build our prior.

33 of 47

L20: BO in a nutshell

  • BO is a sequential strategy developed for global optimization.

  • After gathering evaluations we builds a posterior distribution used to construct an acquisition function.

  • This cheap function determines what is next query point.

1. Select a Sample by Optimizing the Acquisition Function.

2. Evaluate the Sample With the Objective Function.

3. Update the Data and, in turn, the Surrogate Function.

4. Go To 1.

Extension to multiple objectives

33

34 of 47

L20: Acquisition Functions

Best found so far

We are sampling x

Exploitation

Exploration

  • Exploitation: search where μ is high
  • Exploration: search where σ is high

f

x

34

35 of 47

L20: BO Applications

The approach has been applied to solve a wide range of problems, including learning to rank, computer graphics and visual design, robotics, sensor networks, automatic algorithm configuration, automatic machine learning toolboxes,reinforcement learning, planning, visual attention, architecture configuration in deep learning, static program analysis, experimental particle physics, chemistry, material design, and drug development… (source: wikipedia)

35

36 of 47

L21: Bayesian A/B Testing

  • Bayesian A/B Testing gained a lot of popularity in the last few years
    • Simple and easy to understand
    • Allows to calculate the probability that a “treatment” is better than a “control” (A/B testing)
    • It performs better on small sample size compared to frequentist approaches: see [2], where experiments show the required sample size to make the “right” decision can be reduced by 75%

  • A/B testing
    • Consists of a randomized experiment that usually involves two variants
    • (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Stanford University)

36

37 of 47

Rini Gupta’s mini-project on PyScript!

PyScript is a framework that allows users to create rich Python applications in the browser using HTML’s interface and the power of Pyodide, WASM, and modern web technologies.

Examples of PyScript can be found at https://pyscript.net/

37

38 of 47

L22: Markov Chain Monte Carlo

  • Monte Carlo + Markov Chain
  • Monte Carlo: see first part of notebook mod3_part3_MCMC
  • Markov Chain: see second part of notebook mod3_part3_MCMC
  • In few words, Monte Carlo methods are a broad family of algorithms that use random sampling to simulate a given process
    • You may guess from the name which problems inspired the development of these methods :)
    • Stanislaw Ulam is one of the first developers of Monte Carlo methods
  • In few words, a Markov chain is a sequence of states characterized by a set of transition probabilities. A chain is Markovian if the probability of moving to any other state depends only on the current state (“memory 1”). We can perform a random walk by choosing a starting point amd moving to other states according to these transition probabilities.

This is the tool PyMC uses to do the sampling…

38

39 of 47

L22: Metropolis-Hastings

[1] Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng, eds. Handbook of markov chain monte carlo. CRC press, 2011.

[2] C. Fanelli, Measurement of Polarization Transfers in Real Compton Scattering by a proton target at JLab PhD thesis

[3] C. Fanelli, et al. "Polarization transfer in wide-angle compton scattering and single-pion photoproduction from the proton." Physical review letters 115.15 (2015): 152001

This is the engine of several applications we have seen

  • The TARGET distribution (the posterior distribution in Bayesian statistics) is approximated by a list of sampled parameter values
  • More on MCMC can be found here [1]
  • Where do I use it in my research? [1,2]
    • What is Bayesian in all this? [see notes in class]

39

40 of 47

L22: MCMC Diagnostics

Autocorrelation

Other metrics provided by az.summary

Check for divergences

Are posteriors from different chains similar?

40

41 of 47

L22: MCMC: Other Real-world applications

[1] https://cfteach.github.io/brds/mod3_part3_extra_MCMC_denoise.html

  • Image denoising, see notebook mod3_part3_extra_MCMC_denoise

  • Notice that for this application, in particular, we made use of the Ising Model. Do not worry, we will have more time to discuss about it in the next lectures.

41

42 of 47

42

I hope you enjoyed the course

43 of 47

Backup

44 of 47

References of our course

44

https://cfteach.github.io/brds/referencesmd.html

45 of 47

L14: Remarks on BF

45

    • Computationally, in calculating a BF, when a model is better than another one, we spend more time on that model. This could be problematic because we undersample one of the models:
      • We can adjust the prior for each model in such a way to favor the less favorable model. This will not affect the computation of the BF, because
    • We will see in one of our notebooks how to calculate the BF. Following [Mar18], it is recommended to use the Sequential Monte Carlo method to compute BF
    • Using informative or weakly informative priors is a way to introduce bias in a model. If done properly, this can be a good thing to prevent overfitting, and to make predictions that generalize well. You may have encountered regularization techniques in other courses, well, Bayesian reasoning can inherently/automatically do regularization...

46 of 47

L18: Disadvantages

46

GP are computationally expensive

  • Parametric approaches distill knowledge about your data into a set of numbers, e.g., for linear regression, we need two numbers, the slope and the intercept. Other approaches like NN may need 10s of millions. After their are trained, the cost of making predictions depend only on the number of parameters.
  • GPs are a non-parametric method—truthfully, kernel hyperparameters blur the picture a bit because of their hyperparameters, but neglect that—, and they need to take into account the whole training data each time to make a prediction.
    • The training data has to be kept at inference time and the computational cost of predictions scales cubically with the number of training data (the observations).

47 of 47

  • Supervised parametric learning
    • Data
    • Model (Mi)
    • Gaussian LKD

    • Prior

    • Posterior

    • Marginal LKD

    • Making predictions

47

Parametric VS Gaussian Process

Bayes

  • Gaussian Processes (non parametric)
    • Data

    • Gaussian LKD

    • Prior

    • GP Posterior

    • Marginal LKD

    • Making predictions