Bayesian Reasoning
In Data Science
Cristiano Fanelli
12/8/2022 - - Lectures 25
2
BRDS: Grand Fanelli Finale
Rationale
3
Material
25 lectures
13 notebooks
4
Announcements
5
L1: What is Probability?
6
GdA, Ch. 3.1, 3.2
(i) [combinatorial definition] The ratio of the number of favorable cases to the number of all cases
(ii) [frequentist definition] The ratio of the number of times the event occurs in a test series to the total number of trials in the series
(i) it lacks the clause “if all the cases are equally probable”; this definition is often labeled classical or Laplace, forgetting that Bayes, Gauss, Laplace, Bernoulli, etc were aware of that clause.
(ii) it lacks the condition that the number of trails must be very large (“it goes to infinity”). Also, to use frequencies as a measurement of probability, we need to assume that a phenomenon occurred in the past will happen in the future too and with the same probability.
Laplace: probability theory is “good sense turned into calculation”
L1: Bayes’Rule
7
T. Bayes, 1701-1761
Statistician, philosopher
conditional probabilities
What is all about? How can this be powerful?
Its simplicity can be “deceiving” as it involves the interpretation of probability.
posterior
likelihood
prior
marginal
L1: Probability of Causes
8
A theory of probability which does not consider probabilities of hypotheses is unnatural and prevents from being assessed transparent and consistent statements about the causes which may have produced the observed effects.
GdA, Ch. 1.6
True values
Measured values
When we do a measurement we access measured quantities (e.g. if you repeat your measurement results may change depending on the precision and accuracy of your instrument),
not the true values
L1: Bayesian is Everywhere
9
From deciphering encrypted messages during the 2nd world war to hyperparameter tuning of neural networks
L1: Bayesian is Everywhere
10
Topics chosen for your mini-project and in your final projects these topics
L4: Coin Example “Revisited”
11
This problem has multiple applications: e.g., it could be also seen as the percentage of people visiting a webpage A instead of B
L4: Intro PyMC3
(cf. notebook mod1-part2)
12
PyMC3 can’t give us a formula for the posterior distribution. However, it can provide as many samples as we like from the posterior without explicitly calculating the denominator.
See Bayes’ theorem
L4: Intro PyMC3
(cf. notebook mod1-part2)
13
PyMC3 can’t give us a formula for the posterior distribution. However, it can provide as many samples as we like from the posterior without explicitly calculating the denominator.
See Bayes’ theorem
L6: Credible vs Confidence Interval
14
Priors
A credible interval is an interval within which an unobserved parameter value falls with a particular probability.
It is an interval in the domain of a posterior probability distribution or a predictive distribution. The generalisation to multivariate problems is the credible region.
Credible intervals are analogous to confidence intervals and confidence regions in frequentist statistics, although they differ in the interpretation:
Bayesian intervals treat their bounds as fixed and the estimated parameter as a random variable, whereas frequentist confidence intervals treat their bounds as random variables and the parameter as a fixed value
https://en.wikipedia.org/wiki/Credible_interval
GdA, Chap 1,2
HDI: High Density Interval
ROPE: Region of Practical Equivalence
Bayesian Reasoning allows to deal with uncertainties…
L10: Bayesian Linear Regression
15
L10: Bayesian Linear Regression
16
disturbance
L10: Bayesian Linear Regression
17
disturbance
Μ expressed as “deterministic” — see code
Probabilistic Programming
L10: Bayesian Linear Regression
18
disturbance
Μ expressed as “deterministic” — see code
L10: Bayesian Linear Regression: FAQ
19
I am familiar with linear regression models already, and I know methods for fitting, e.g., least square. Why should I use Bayesian linear regression?
In problems where we have limited data or have some prior knowledge that we want to use in our model, the Bayesian Linear Regression approach can both incorporate prior information and show our uncertainty. Bayesian Linear Regression reflects the Bayesian framework: we form an initial estimate and improve our estimate as we gather more data.
L10: Bayesian Polynomial Regression
20
In general, using a polynomial to fit data is not the best idea
A model that perfectly fits your data will in general do a poor job at fitting/describing unobserved data — this is called Overfitting
We will hopefully discuss more about this in this course in the coming weeks
By extension
=> Polynomial regression
L13: Logistic Regression
21
Credits: University of Toronto
Likely, you are familiar with logistic regression, a Machine Learning classification algorithm used to assign observations to a discrete set of classes.
Example
L13: Logistic Regression
22
L14: All models are wrong…
23
L14: Simplicity vs Accuracy, aka Bias vs Variance
24
L14: Information Criteria
25
N.b. when the likelihood is Normal, it corresponds to a quadratic mean error.
pAIC is a penalization term that represents the number of parameters; it penalizes complexity; θmle is the maximum likelihood estimation of θ (i.e., in a Bayesian context, the maximum a posteriori); it is a point estimation, not a distribution.
We prefer lower values of WAIC…
L14: Just few lines of PyMC…
26
e.g.,
trace_l (linear model)
trace_p (polynomial)
Compare different models
L14: Bayes Factors: strength of evidence
27
BAYES FACTOR
PRIOR
POSTERIOR
L14: Sequential MC for Bayes Factor
28
with pm.Model() as model_BF_0:
θ = pm.Beta(‘θ’, 4, 8)
y = pm.Bernoulli(‘y’, θ, observed=y_d)
trace_BF_0 = pm.sample_smc(2500)
with pm.Model() as model_BF_1:
θ = pm.Beta(‘θ’, 8, 4)
y = pm.Bernoulli(‘y’, θ, observed=y_d)
trace_BF_1 = pm.sample_smc(2500)
BF_smc = np.exp(trace_BF_0.report.log_marginal_likelihood - trace_BF_1.report.log_marginal_likelihood)
L14: Remarks on BF
29
N.B.:
L18: Gaussian Processes in a nutshell
30
What kind of problems are we talking about?
L18: Advantages
31
O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
Scikit-learn, Classifier Comparison: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
“GP know what they do not know…” The uncertainty of a fitted GP increases away from the training data. For other approaches, like RF or NN, they just separate the regions of blue and red and keep high certainty on their predictions far from the training data…
This could be linked to the phenomenon of adversarial examples…
When you are using a GP to model a problem, the prior belief can be shaped via the choice of a kernel function as we discussed. We will expand on the kernel in the next slides.
L18: Key-ingredient: Covariance and Kernel
32
C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for machine learning (2016), The MIT Press
O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
Bandwidth (l) controls width of the kernel
A wide variety of functions can be represented through this kernels.
This helps us build our prior.
L20: BO in a nutshell
1. Select a Sample by Optimizing the Acquisition Function.
2. Evaluate the Sample With the Objective Function.
3. Update the Data and, in turn, the Surrogate Function.
4. Go To 1.
Extension to multiple objectives
33
L20: Acquisition Functions
Best found so far
We are sampling x
Exploitation
Exploration
f
x
34
L20: BO Applications
The approach has been applied to solve a wide range of problems, including learning to rank, computer graphics and visual design, robotics, sensor networks, automatic algorithm configuration, automatic machine learning toolboxes,reinforcement learning, planning, visual attention, architecture configuration in deep learning, static program analysis, experimental particle physics, chemistry, material design, and drug development… (source: wikipedia)
35
L21: Bayesian A/B Testing
[1] https://towardsdatascience.com/bayesian-a-b-testing-and-its-benefits-a7bbe5cb5103
[2] https://towardsdatascience.com/exploring-bayesian-a-b-testing-with-simulations-7500b4fc55bc
[3] https://exp-platform.com/top-challenges-from-first-practical-online-controlled-experiments-summit/
36
Rini Gupta’s mini-project on PyScript!
PyScript is a framework that allows users to create rich Python applications in the browser using HTML’s interface and the power of Pyodide, WASM, and modern web technologies.
Examples of PyScript can be found at https://pyscript.net/
37
L22: Markov Chain Monte Carlo
This is the tool PyMC uses to do the sampling…
38
L22: Metropolis-Hastings
[1] Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng, eds. Handbook of markov chain monte carlo. CRC press, 2011.
[2] C. Fanelli, Measurement of Polarization Transfers in Real Compton Scattering by a proton target at JLab PhD thesis
[3] C. Fanelli, et al. "Polarization transfer in wide-angle compton scattering and single-pion photoproduction from the proton." Physical review letters 115.15 (2015): 152001
This is the engine of several applications we have seen
39
L22: MCMC Diagnostics
Autocorrelation
Other metrics provided by az.summary
Check for divergences
Are posteriors from different chains similar?
40
L22: MCMC: Other Real-world applications
[1] https://cfteach.github.io/brds/mod3_part3_extra_MCMC_denoise.html
41
42
I hope you enjoyed the course
Backup
References of our course
44
https://cfteach.github.io/brds/referencesmd.html
L14: Remarks on BF
45
L18: Disadvantages
46
GP are computationally expensive
47
Parametric VS Gaussian Process
Bayes