1 of 22

Bayesian Reasoning

In Data Science

DATA 644

Cristiano Fanelli

03/21/2024 - Lecture 15

03/26/2024 - Lecture 16

03/28/2024 - Lecture 17

2 of 22

Outline

2

  • Gaussian Processes: introduction

3 of 22

Main points

3

  • This is an introduction; we will follow a practical approach. To deepen the topic, please see references at the end of the slides
  • Gaussian process can be seen as an infinite-dimensional generalization of the Gaussian distribution that can be used to set a prior on unknown functions (based on our observed points)
  • Gaussian processes provide a principled solution to modeling arbitrary functions by effectively letting the data decide the complexity of the function, minimizing overfitting
  • Covariance matrix is a way to encode information about how the data points are related to each other
    • Covariance Matrix specified by Kernel functions
  • Few technical details on the implementation (exercise in class)

4 of 22

Bayesian Reasoning

4

  • As we know, in Bayesian inference our beliefs about the world are represented by probability distributions. The Bayes’ rule provides a way to update these probabilities as new data is collected.

Assuming you cannot check Obama’s height on the internet ;)

Distribution of heights of American males

Evidence from a photo

Updated belief

5 of 22

Gaussian Processes in a nutshell

5

  • Naively, a Gaussian Process is a probability distribution over possible functions
  • GP helps describing the probability distribution over functions. Bayes’ theorem allows to update our distribution of functions by collecting more data / observations

What kind of problems are we talking about?

  • Suppose your data follow a function y=f(x). Given x, you have a response y through a function f.
  • Suppose now that you do not know the function f and you want to “learn” it.
  • In the figure we are using GP to approximate this function f. Intuitively, the observed points constrain the modeling. The more points, the more accurate the model.

6 of 22

What is Gaussian Process?

6

No data, no prior information. Lack of knowledge reflected in wide range of possible functions explored by GP.

Sampling from a GP in this case gets every time a different function.

Our best guess is mean 0.

Few observations of the function f, i.e., data points of the form (x,f(x)), are collected.

Bayes’ rule updates our beliefs about the function to get the posterior Gaussian process.

The updated GP is constrained to the possible functions that fit our data points.

7 of 22

Advantages

7

GP know what they do not know…” The uncertainty of a fitted GP increases away from the training data. For other approaches, like RF or NN, they just separate the regions of blue and red and keep high certainty on their predictions far from the training data…

When you are using a GP to model a problem, the prior belief can be shaped via the choice of a kernel function as we discussed. We will expand on the kernel in the next slides.

8 of 22

Key-ingredient: Covariance and Kernel

8

  • K is the covariance kernel matrix where its entries correspond to the covariance function evaluated at observations.
  • The covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector (n.b.: a vector of observations).
  • In practice covariance matrices are specified using functions known as kernels, whose output can be interpreted as the similarity between two points (the closer two points are, the more similar)
  • One popular kernel is the exponentiated quadratic kernel

Bandwidth (l) controls width of the kernel

A wide variety of functions can be represented through these kernels.

This helps us build our prior.

The Kernel quantifies the “similarity” between points in the input space and dictates how information propagates from the observed data to the predictions

9 of 22

Disadvantages

9

GP are computationally expensive

  • Parametric approaches distill knowledge about your data into a set of numbers, e.g., for linear regression, we need two numbers, the slope and the intercept. Other approaches like NN may need 10s of millions. After their are trained, the cost of making predictions depend only on the number of parameters.
  • GPs are a non-parametric method—in reality, kernel hyperparameters blur the picture a bit because of their hyperparameters, but neglect that—, and they need to take into account the whole training data each time to make a prediction.
    • The training data has to be kept at inference time and the computational cost of predictions scales cubically with the number of training data (the observations).

10 of 22

What is Bayesian in GP?

10

[1] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for machine learning (2016), The MIT Press

[2] H. Sit, Quick start to Gaussian Process Regression

  • Bayesian reasoning allows to infer a probability distribution over all possible values.
  • Let’s assume, for the sake of argument, a linear function
  • Bayesian approach works by specifying a prior distribution over the parameter w, and update probabilities based on evidence (observed data) [1]:
  • As we said, Gaussian Progress Regression is non-parametric (i.e., not limited by a functional form), so rather than calculating the probability distribution of parameters of a specific function, GPR calculates the probability over all admissible functions that fit the data.
  • The function f(x) obtained from GP regression to fit the data points is intimately linked to the kernel function k(x,x′) through the GP's prior and the inferred posterior distribution.

11 of 22

11

Parametric VS Gaussian Process

  • Supervised parametric learning
    • Data
    • Model (Mi)
    • Gaussian LKD

    • Prior

    • Posterior

    • Marginal LKD

    • Making predictions

Bayes

12 of 22

  • Supervised parametric learning
    • Data
    • Model (Mi)
    • Gaussian LKD

    • Prior

    • Posterior

    • Marginal LKD

    • Making predictions

12

Parametric VS Gaussian Process

Bayes

  • Gaussian Processes (non parametric)
    • Data

    • Gaussian LKD

    • Prior

    • GP Posterior

    • Marginal LKD

    • Making predictions

Things are simpler!

13 of 22

13

Gaussian Process

Multivariate gaussian

14 of 22

14

Gaussian Process

  • y and f* have a joint distribution:

Covariance Matrix between X and X*

Covariance Matrix between X* and X*

Covariance Matrix from training inputs

(Training inputs X, testing inputs X’)

Noise variance

15 of 22

15

Gaussian Process

  • y and f* have a joint distribution:

n x n* dims

n* x n* dims

n x n dims

y and f* form a joint vector of dimension n + n*

(Training inputs X, testing inputs X’)

  • The conditional distribution of f* given y, in the context of Gaussian distributions, can be obtained directly from the joint distribution

Predictive Mean

Predictive Covariance

  • We can also marginalize over the function values f to obtain the marginal likelihood

16 of 22

16

Gaussian Process

  • Performing the integration, we obtain the log marginal likelihood

  • This corresponds to the logarithm of the multivariate normal distribution

  • This term corresponds to an entropy, see multivariate normal

17 of 22

17

Visual Exploration of Gaussian Processes

18 of 22

(i) GP Prior

18

19 of 22

(ii) Marginal Likelihood

19

20 of 22

21 of 22

21

PyMC notebook VS Gaussian Process

  • Gaussian Processes (non parametric)
    • Data

    • Gaussian LKD

    • Prior

    • GP Posterior

    • Marginal LKD

    • Making predictions

For GP prior, and a normal LKD, the marginalization can be performed analytically

22 of 22

22

Can do regression and classification…

  • In class we will see (already started seeing, actually) this notebook for regression
  • Multiple applications:
    • See statistics (originally geostatistics) of https://en.wikipedia.org/wiki/Kriging
    • Krige estimated the most likely distribution of gold based on samples from a few boreholes
  • For classification, you can deal with problem like the following:
  • Difficult to capture with a logistic regression