1 of 22

Bayesian Reasoning

In Data Science

DATA 644

Cristiano Fanelli

03/21/2024 - Lecture 15

03/26/2024 - Lecture 16

03/28/2024 - Lecture 17

2 of 22

Outline

Gaussian Processes: introduction

3 of 22

Main points

This is an introduction; we will follow a practical approach. To deepen the topic, please see references at the end of the slides
Gaussian process can be seen as an infinite-dimensional generalization of the Gaussian distribution that can be used to set a prior on unknown functions (based on our observed points)
Gaussian processes provide a principled solution to modeling arbitrary functions by effectively letting the data decide the complexity of the function, minimizing overfitting
Covariance matrix is a way to encode information about how the data points are related to each other

Covariance Matrix specified by Kernel functions

Few technical details on the implementation (exercise in class)

4 of 22

Bayesian Reasoning

As we know, in Bayesian inference our beliefs about the world are represented by probability distributions. The Bayes’ rule provides a way to update these probabilities as new data is collected.

Assuming you cannot check Obama’s height on the internet ;)

Distribution of heights of American males

Evidence from a photo

Updated belief

Example from O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d

5 of 22

Gaussian Processes in a nutshell

Naively, a Gaussian Process is a probability distribution over possible functions
GP helps describing the probability distribution over functions. Bayes’ theorem allows to update our distribution of functions by collecting more data / observations

What kind of problems are we talking about?

Suppose your data follow a function y=f(x). Given x, you have a response y through a function f.
Suppose now that you do not know the function f and you want to “learn” it.

In the figure we are using GP to approximate this function f. Intuitively, the observed points constrain the modeling. The more points, the more accurate the model.

6 of 22

What is Gaussian Process?

No data, no prior information. Lack of knowledge reflected in wide range of possible functions explored by GP.

Sampling from a GP in this case gets every time a different function.

Our best guess is mean 0.

Few observations of the function f, i.e., data points of the form (x,f(x)), are collected.

Bayes’ rule updates our beliefs about the function to get the posterior Gaussian process.

The updated GP is constrained to the possible functions that fit our data points.

Example from O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d

7 of 22

Advantages

O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d

Scikit-learn, Classifier Comparison: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

“GP know what they do not know…” The uncertainty of a fitted GP increases away from the training data. For other approaches, like RF or NN, they just separate the regions of blue and red and keep high certainty on their predictions far from the training data…

When you are using a GP to model a problem, the prior belief can be shaped via the choice of a kernel function as we discussed. We will expand on the kernel in the next slides.

8 of 22

Key-ingredient: Covariance and Kernel

[1] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for machine learning (2016), The MIT Press

[2] O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d

[3] https://en.wikipedia.org/wiki/Covariance_matrix

K is the covariance kernel matrix where its entries correspond to the covariance function evaluated at observations.
The covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector (n.b.: a vector of observations).

In practice covariance matrices are specified using functions known as kernels, whose output can be interpreted as the similarity between two points (the closer two points are, the more similar)
One popular kernel is the exponentiated quadratic kernel

Bandwidth (l) controls width of the kernel

A wide variety of functions can be represented through these kernels.

This helps us build our prior.

The Kernel quantifies the “similarity” between points in the input space and dictates how information propagates from the observed data to the predictions

9 of 22

Disadvantages

O . Knagg, https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d

GP are computationally expensive

Parametric approaches distill knowledge about your data into a set of numbers, e.g., for linear regression, we need two numbers, the slope and the intercept. Other approaches like NN may need 10s of millions. After their are trained, the cost of making predictions depend only on the number of parameters.
GPs are a non-parametric method—in reality, kernel hyperparameters blur the picture a bit because of their hyperparameters, but neglect that—, and they need to take into account the whole training data each time to make a prediction.

The training data has to be kept at inference time and the computational cost of predictions scales cubically with the number of training data (the observations).

10 of 22

What is Bayesian in GP?

[1] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for machine learning (2016), The MIT Press

[2] H. Sit, Quick start to Gaussian Process Regression

Bayesian reasoning allows to infer a probability distribution over all possible values.
Let’s assume, for the sake of argument, a linear function

Bayesian approach works by specifying a prior distribution over the parameter w, and update probabilities based on evidence (observed data) [1]:

As we said, Gaussian Progress Regression is non-parametric (i.e., not limited by a functional form), so rather than calculating the probability distribution of parameters of a specific function, GPR calculates the probability over all admissible functions that fit the data.
The function f(x) obtained from GP regression to fit the data points is intimately linked to the kernel function k(x,x′) through the GP's prior and the inferred posterior distribution.

11 of 22

[1] https://www.tsc.uc3m.es/~fernando/l1.pdf

Parametric VS Gaussian Process

Supervised parametric learning

Data
Model (M_i)
Gaussian LKD

Prior

Posterior

Marginal LKD

Making predictions

Bayes

12 of 22

Supervised parametric learning

Data
Model (M_i)
Gaussian LKD

Prior

Posterior

Marginal LKD

Making predictions

[1] https://www.tsc.uc3m.es/~fernando/l1.pdf

Parametric VS Gaussian Process

Bayes

Gaussian Processes (non parametric)

Data

Gaussian LKD

Prior

GP Posterior

Marginal LKD

Making predictions

Things are simpler!

13 of 22

Gaussian Process

Multivariate gaussian

14 of 22

Gaussian Process

y and f* have a joint distribution:

Covariance Matrix between X and X*

Covariance Matrix between X* and X*

Covariance Matrix from training inputs

(Training inputs X, testing inputs X’)

Noise variance

15 of 22

Gaussian Process

y and f* have a joint distribution:

n x n* dims

n* x n* dims

n x n dims

https://cs.stanford.edu/~rpryzant/blog/gp/gp.html

https://www.cs.cmu.edu/~epxing/Class/10708-15/notes/10708_scribe_lecture21.pdf

y and f* form a joint vector of dimension n + n*

(Training inputs X, testing inputs X’)

The conditional distribution of f* given y, in the context of Gaussian distributions, can be obtained directly from the joint distribution

Predictive Mean

Predictive Covariance

We can also marginalize over the function values f to obtain the marginal likelihood

16 of 22

Gaussian Process

https://cs.stanford.edu/~rpryzant/blog/gp/gp.html

https://www.cs.cmu.edu/~epxing/Class/10708-15/notes/10708_scribe_lecture21.pdf

Performing the integration, we obtain the log marginal likelihood

This corresponds to the logarithm of the multivariate normal distribution

This term corresponds to an entropy, see multivariate normal

17 of 22

Visual Exploration of Gaussian Processes

https://distill.pub/2019/visual-exploration-gaussian-processes/

18 of 22

(i) GP Prior

https://www.pymc.io/projects/docs/en/stable/api/gp/generated/pymc.gp.Marginal.html

19 of 22

(ii) Marginal Likelihood

https://www.pymc.io/projects/docs/en/stable/api/gp/generated/classmethods/pymc.gp.Marginal.marginal_likelihood.html#

21 of 22

PyMC notebook VS Gaussian Process

Gaussian Processes (non parametric)

Data

Gaussian LKD

Prior

GP Posterior

Marginal LKD

Making predictions

For GP prior, and a normal LKD, the marginalization can be performed analytically

22 of 22

Can do regression and classification…

In class we will see (already started seeing, actually) this notebook for regression

Multiple applications:

See statistics (originally geostatistics) of https://en.wikipedia.org/wiki/Kriging
Krige estimated the most likely distribution of gold based on samples from a few boreholes

For classification, you can deal with problem like the following:

Difficult to capture with a logistic regression

1 of 22

2 of 22

3 of 22

4 of 22

5 of 22

6 of 22

7 of 22

8 of 22

9 of 22

10 of 22

11 of 22

12 of 22

13 of 22

14 of 22

15 of 22

16 of 22

17 of 22

18 of 22

19 of 22

20 of 22

21 of 22

22 of 22