Bayesian Reasoning
In Data Science
DATA 644
Cristiano Fanelli
03/21/2024 - Lecture 15
03/26/2024 - Lecture 16
03/28/2024 - Lecture 17
Outline
2
Main points
3
Bayesian Reasoning
4
Assuming you cannot check Obama’s height on the internet ;)
Distribution of heights of American males
Evidence from a photo
Updated belief
Example from O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
Gaussian Processes in a nutshell
5
What kind of problems are we talking about?
What is Gaussian Process?
6
No data, no prior information. Lack of knowledge reflected in wide range of possible functions explored by GP.
Sampling from a GP in this case gets every time a different function.
Our best guess is mean 0.
Few observations of the function f, i.e., data points of the form (x,f(x)), are collected.
Bayes’ rule updates our beliefs about the function to get the posterior Gaussian process.
The updated GP is constrained to the possible functions that fit our data points.
Example from O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
Advantages
7
O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
Scikit-learn, Classifier Comparison: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
“GP know what they do not know…” The uncertainty of a fitted GP increases away from the training data. For other approaches, like RF or NN, they just separate the regions of blue and red and keep high certainty on their predictions far from the training data…
When you are using a GP to model a problem, the prior belief can be shaped via the choice of a kernel function as we discussed. We will expand on the kernel in the next slides.
Key-ingredient: Covariance and Kernel
8
[1] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for machine learning (2016), The MIT Press
[2] O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
Bandwidth (l) controls width of the kernel
A wide variety of functions can be represented through these kernels.
This helps us build our prior.
The Kernel quantifies the “similarity” between points in the input space and dictates how information propagates from the observed data to the predictions
Disadvantages
9
GP are computationally expensive
What is Bayesian in GP?
10
[1] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for machine learning (2016), The MIT Press
[2] H. Sit, Quick start to Gaussian Process Regression
11
Parametric VS Gaussian Process
Bayes
12
Parametric VS Gaussian Process
Bayes
Things are simpler!
13
Gaussian Process
Multivariate gaussian
14
Gaussian Process
Covariance Matrix between X and X*
Covariance Matrix between X* and X*
Covariance Matrix from training inputs
(Training inputs X, testing inputs X’)
Noise variance
15
Gaussian Process
n x n* dims
n* x n* dims
n x n dims
https://cs.stanford.edu/~rpryzant/blog/gp/gp.html
https://www.cs.cmu.edu/~epxing/Class/10708-15/notes/10708_scribe_lecture21.pdf
y and f* form a joint vector of dimension n + n*
(Training inputs X, testing inputs X’)
Predictive Mean
Predictive Covariance
16
Gaussian Process
https://cs.stanford.edu/~rpryzant/blog/gp/gp.html
https://www.cs.cmu.edu/~epxing/Class/10708-15/notes/10708_scribe_lecture21.pdf
17
Visual Exploration of Gaussian Processes
(i) GP Prior
18
(ii) Marginal Likelihood
19
21
PyMC notebook VS Gaussian Process
For GP prior, and a normal LKD, the marginalization can be performed analytically
22
Can do regression and classification…