1 of 106

James Requeima

University of Toronto

Vector Institute

John Bronskill

University of Cambridge

Dami Choi

University of Toronto,�Transluce

Rich Turner

University of Cambridge

David Duvenaud

University of Toronto

Vector Institute

LLM Posteriors over Functions:�A new Mode

2 of 106

A crazy idea

If I believed your prior, why not show you data and ask your posterior? � - Andrew Gelman, Objections to Bayesian Statistics, 2008

3 of 106

LLM-Powered Probabilistic Prediction

Text shown to LLM

Data shown to LLM

4 of 106

LLM-Powered Probabilistic Prediction

Text shown to LLM

Data shown to LLM

5 of 106

LLMs can incorporate:

All the data.
All the side info.
Entire history of world

New: How to elicit LLMs’ posterior over functions?

LLM-based probabilistic prediction

6 of 106

LLMs have rich posteriors

7 of 106

Natively handles numbers

Probabilistic Regression

Fast

Closed-form predictive dists.

Hard to incorporate context

Missing input values are awkward

No knowledge besides dataset

Tokenization issues

LLMs

Slow

Hard to elicit joint posterior

Simply concatenate context

Automatically handles missing values.

Knows history of the world!

8 of 106

The Best of Both Worlds?

Want a joint dist at query points conditioned on numbers and text:

9 of 106

Eliciting posteriors from LLMs

10 of 106

Eliciting a Distribution Over Numbers

How to represent “50.318”?

50.318

5.0318 x 10

50318/1000

0050.31800000

Fifty point three one eight.

110010.010001011000100110111101...

🕔🕔🕔🕔🕔 ⚪ 🕒🕒🕒 ⚪ 🕐 🕜

Easier if we restrict representations.
Does this hurt performance?

11 of 106

Eliciting a Distribution Over Numbers

Let’s use a standard decimal representation.
Can we prompt the model to use this representation?

“50.318”

“5.0318 x 10”

“50318/1000”

“0050.31800000”

“

”

“Fifty point three one eight.”

“110010.010001011000100110111101...”

“ 🕔🕔🕔🕔🕔 ⚪ 🕒🕒🕒 ⚪ 🕐 🕜 ”

"Half a century, a trinity of tenths,

and a one with an octal's eighth"

Two methods for eliciting: sampling and logit based.

12 of 106

LLM

Generate a sample that continues the prompt

Observed Point

Query Location

Convert observed points and the query point into a string.

Sampling to estimate predictive dists

Repeated rejection sampling

13 of 106

Sampling trajectories

14 of 106

Sampling a Predictive Distribution

15 of 106

From logits to numerical distributions

16 of 106

mass assigned to values between 300 and 400.

From logits to marginals over numbers

400

300

17 of 106

From logits to marginals over numbers

400

300

18 of 106

From logits to marginals over numbers

320

330

400

300

19 of 106

From logits to marginals over numbers

325

326

320

330

400

300

20 of 106

Sampling vs Logit Based Predictive Distributions

21 of 106

What format to use?

Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B

“x1.1y1.2x2.5y4.1x5.3y2.2x2.7y”

“1.1,1.2\n2.5,4.1\n5.3,2.2\n2.7,”

“(1.1,1.2),(2.5,4.1),(5.3,2.2),(2.7,”

“1.1,1.2,2.5,4.1,5.3,2.2, 2.7,”

“1.1, 1.2\n2.5, 4.1\n5.3, 2.2\n2.7,”

“x=1.1, y=1.2\nx=2.5, y=4.1\nx=5.3, y=2.2\nx=2.7, y=”

A good balance of performance and token efficiency.

22 of 106

Data ordering

Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B

“5.3,2.2\n1.1,1.2\n2.5,4.1\n2.7,”

23 of 106

Data ordering

Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B

“5.3,2.2\n1.1,1.2\n2.5,4.1\n2.7,”

Closest value to target is last in prompt.

24 of 106

Data ordering

Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B

“5.3,2.2\n1.1,1.2\n2.5,4.1\n2.7,”

25 of 106

Data ordering

Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B

“5.3,2.2\n1.1,1.2\n2.5,4.1\n2.7,”

26 of 106

Output Scaling

Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B

Bad when values are large

Bad when numbers are negative

y-value range.

Lose context when rescaling

Top-p and Temperature

insensitive to these.

27 of 106

Rich joint distributions?

LLM

Generate a sample that continues the prompt

Observed Point

Query Location

Convert observed points and the query point into a string.

28 of 106

LLM

Generate a sample that continues the prompt

Observed Point

Query Location

Convert observed points and the query point into a string.

Autoregressive:

1. Append previous output to prompt

2. Add new query point

Rich joint distributions?

Order-dependent!

29 of 106

How good are LLMPs on (only) numerical data?

30 of 106

How good of a model is it out-of-the-box?

LLMP

Mixtral-8×7B

Squared

Exponential GP

31 of 106

Black-box Function Optimization

32 of 106

Black-box Function Optimization

33 of 106

Black-box Function Optimization

Model:

Llama-3-8B,

10 samples

34 of 106

Multimodal posteriors

35 of 106

Multivariate data

Can easily handle missing data.
Can model joint distributions over any subset of a set of variables:

36 of 106

Multivariate data

Can easily handle missing data.
Can model joint distributions over any subset of a set of variables:

Suppose you want to model variables c and e

37 of 106

Multivariate data

Can easily handle missing data.
Can model joint distributions over any subset of a set of variables:

Suppose you want to model variables c and e

conditional on variables a,b and d

38 of 106

Multivariate data

Can easily handle missing data.
Can model joint distributions over any subset of collection of variables:

Suppose you want to model variables c and e

conditional on variables a,b and d

Structure your prompts like this:

And predict the missing variables autoregressively.

39 of 106

Multidimensional data: weather prediction (1D in, 3D out) using Llama-3-7B LLM�

Simultaneously predict temperature, precipitation, and wind speed in London, UK.

LLMP

Mixtral-8×7B

Squared

Exponential GP

Multivariate outputs

40 of 106

Image Reconstruction (2D in, 1D out) using Mixtral 8×7B

True

20% Observed

20% Reconstruct

50% Observed

50% Reconstruct

Blue pixels are unobserved.

Multivariate inputs

41 of 106

What can we do that we couldn’t before?

42 of 106

Conditioning on text

43 of 106

Predictive Distributions Conditioned on Instructions

44 of 106

Task: Predict house prices given 10 examples.

“30.45738, -97.75516, 78729, 107830.0, 30907, 1216.1, 1349, 3"

“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"

Same numbers

Labels are helpful

45 of 106

Using Mixtral-8×7B Instruction Tuned

Without Text Labels

Labels are helpful

46 of 106

Using Mixtral-8×7B Instruction Tuned

Without Text Labels

With Text Labels

Labels are helpful

47 of 106

Summary

48 of 106

Image Source: Google Research blog: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

Regression is an LLM capability

49 of 106

Image Source: Google Research blog: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

Regression is an LLM capability

50 of 106

Natively handles numbers

Probabilistic Regression

Fast

Closed-form predictive dists.

Hard to incorporate context

Missing input values are awkward

No knowledge besides dataset

Tokenization issues

LLMs

Slow

Hard to elicit joint posterior

Simply concatenate context

Automatically handles missing values.

Knows history of the world!

51 of 106

Summary

Cons:

Slow (hundreds of samples per marginal).
Ordering + formatting issues.
Data contamination (for science).
Context size limits.
Not calibrated? (For instruction-tuned models)

Pros:

No training necessary
Almost no coding or math skills necessary
Only game in town for incorporating everything you know

And what model knows!

Can say you did a qualitative analysis and got quantitative results

52 of 106

James Requeima

University of Toronto

Vector Institute

John Bronskill

University of Cambridge

Dami Choi

University of Toronto

Rich Turner

University of Cambridge

David Duvenaud

University of Toronto

Vector Institute

Thanks!

53 of 106

Extra Slides

54 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context

Using Mixtral-8×7B

55 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

56 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

57 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

58 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

59 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

60 of 106

60

LLM Powered spreadsheet prediction

LLMPs on tabular data

61 of 106

“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"

LLM Powered spreadsheet prediction

LLMPs on tabular data

62 of 106

“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"

LLM Powered spreadsheet prediction

LLMPs on tabular data

63 of 106

LLM Powered spreadsheet prediction

LLM

“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"

LLMPs on tabular data

64 of 106

LLM Powered spreadsheet prediction

LLM

“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"

LLMPs on tabular data

65 of 106

LLM

“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"

LLMPs on tabular data

66 of 106

Context and Target Ordering

Predict the target point closest to the context set

67 of 106

Context and Target Ordering

“1.1,1.2\n 5.3,2.2\n 2.5,4.1\n 3.0,”

“1.1,1.2\n 2.5,4.1\n 5.3,2.2\n 3.0, \n 4.0, ”

Predict the target point closest to the context set

68 of 106

Context and Target Ordering

“1.1,1.2\n 5.3,2.2\n 2.5,4.1\n 3.0,”

“1.1,1.2\n 2.5,4.1\n 5.3,2.2\n 3.0, \n 4.0, ”

“5.3,2.2\n 4.0, \n, 3.0, \n 2.5,4.1\n 1.1,1.2\n 0.0,”

Predict the target point closest to the context set

69 of 106

Context and Target Ordering

“1.1,1.2\n 5.3,2.2\n 2.5,4.1\n 3.0,”

“1.1,1.2\n 2.5,4.1\n 5.3,2.2\n 3.0, \n 4.0, ”

“5.3,2.2\n 4.0, \n, 3.0, \n 2.5,4.1\n 1.1,1.2\n 0.0,”

Predict the target point closest to the context set

70 of 106

Should we be worried about output ordering?

Independent marginal predictions vs autoregressive predictions?

Unlucky ordering?

Sorted outputs (test)?

Autoregressive

No

Yes

71 of 106

One problem….

We need a way to establish the order of magnitude

From logits to marginals over numbers

Note: Llama <3, Mixtral, Gemma, Deepseek, and Yi tokenize individual digits. Llama3 and GPT4 do not.

72 of 106

Natively handles numerical values

Probabilistic Regression

Fast inference + predictions

Usually give closed-form predictive distributions.

Hard to incorporate side info.

Sometimes need special handling for missing values

Ignorant of context unless explicitly incorporated into model

Need to tokenize + detokenize numbers

LLMs

Slow

Distribution over tokens, not clear how to elicit joint numerical posteriors

Simply attach side info in text or images

Automatically handles special or missing values

Automatically conditions on entire history of the world!

Stress-tested numerical encoding schemes

LLMPs

Still slow (but worth it!)

We show how to elicit arbitrarily complex numerical posterior predictive distributions

Simply attach side info in text or images (not tested).

Automatically handles special or missing values

Automatically conditions on entire history of the world!

73 of 106

Unlocking multi-modal regression

" Please predict the price I could expect to get if I sold the house at 157 Bellmont Ave, Charlotte, NC, pictured here:

Here are recent prices of other houses sold in the area and a video of the interior:

Please give me uncertainty estimates.”

Here is a forecast of the price that you could expect for 157 Bellmont Ave depending on which month listed the house:

LLM

74 of 106

What should we do next?

Benchmark dataset for time series forecasting that evaluates the ability for models to incorporate text information.
Fine-tuning to improve predictions on limited context sizes.
Modelling categorical data and joint distributions over mixed data types.
Automated prediction on tabular data.
Economic data (Canada food price guide), medical data.

Ideas?

75 of 106

Context is Key Benchmark

time series forecasting benchmark that pairs numerical data with textual context, requiring models to integrate both modalities

Context is Key: A Benchmark for Forecasting with Essential Textual Information. Andrew Robert Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, Alexandre Drouin. Preprint (2024)

https://github.com/ServiceNow/context-is-key-forecasting

76 of 106

Direct Prompt Forecasting

I have a time series forecasting task for you.

Here is some context about the task. Make sure to factor in any background knowledge,

satisfy any constraints, and respect any scenarios.

((context))

</context>

Here is a historical time series in (timestamp, value) format:

((history))

</history>

Now please predict the value at the following timestamps: ((pred time)).

Return the forecast in (timestamp, value) format in between <forecast> and </forecast> tags.

Do not include any other information (e.g., comments) in the forecast.

Example:

(t1, v1)

(t2, v2)

(t3, v3)

</history>

(t4, v4)

(t5, v5)

</forecast>

Works well for sufficiently powerful instruction tuned models

77 of 106

No coding or specialized hardware required - make accurate predictions from your datasets using just plain language.

Accessible to students, professionals, and everyday users for forecasting, regression, classification, and data exploration.

Communicate your goals and knowledge naturally - no need for jargon.

Predictive Power for Everyday Users

78 of 106

Moving Forward

LLMPs connect LLMs to the rich world of Probabilistic Regression and Joint Distributions.

Justifies adding joint pdf as a currently missing modality to LLMs.

Flexibly mixes text and data.

79 of 106

So LLMs can do regression as well as traditional methods, but…

you need to use a big, slow model.

Why all this trouble to use an LLM for probabilistic prediction?

So What?

There is a huge win in being able to provide problem specific text and leverage the knowledge captured in the LLM!

80 of 106

Eliciting a Distribution Over Numerical Values

P(token)

token index

LLM

Prompt

81 of 106

LLMPs are (Conditional) Neural Processes

Neural Network

CNP:

LLMP:

LLM

1,2

1. M. Garnelo et al., Conditional Neural Processes, 2018

2. W. Bruinsma et al., Autoregressive Conditional Neural Processes, 2023

82 of 106

LLM Optimization.

83 of 106

LLM priors: can we examine them?

LLM

Prompt

LLM inference?

LLM

What is going on under the hood?

84 of 106

LLMs competitively predict time series based only on a text tokenization of numerical data.

LLMs can condition on particular task being solved, leveraging contextual information to make better predictions or decisions.

Integrating Contextual Conditioning and Numerical Prediction

1

2

1. N. Gruver et al., arge language models are zero-shot time series forecasters, 2023

2. K. Choi et al., Lmpriors: Pre-trained language models as task-specific priors, 2022

one-dimensional time series -> multi-dimensional regression and density estimation

condition on data and unstructured text

85 of 106

Prompt Formatting

86 of 106

Scaling y-values

87 of 106

Top-p and Temperature

88 of 106

Prompt Orders

89 of 106

Prompt Orders: Independent vs Autoregressive

90 of 106

US housing Prices

91 of 106

Well Calibrated Uncertainty: Black-box Function Optimization

The objective is to find the value of x where the value of the function is maximum using the minimum number of function queries.

92 of 106

Black-box Function Optimization

We start by selecting a small number of “cold-start” points and query the function there.

We choose the next location to query the function where uncertainty is highest.

93 of 106

Black-box Function Optimization

Model:

Llama-3-8B,

10 samples

94 of 106

We can interpret this as the mass that the model assigns to values between 5 and 6 given that the first two digits were 32.

Method used in LLMTime: Large Language Models are Zero-Shot Time Series Forecasters, Gruver et al., NeurIPS 2023

A Logit Based Predictive Distribution

95 of 106

Probabilistic Regression	LLMs	LLMPs
Natively handles numerical values	Need to tokenize + detokenize numbers	Stress-tested numerical encoding schemes
Fast inference + predictions	Slow	Still slow (but worth it!)
Usually give closed-form predictive distributions.	Distribution over tokens, not clear how to elicit joint numerical posteriors	We show how to elicit arbitrarily complex numerical posterior predictive distributions
Hard to incorporate side info.	Simply attach side info in text or images	Simply attach side info in text or images
Sometimes need special handling for missing values	Automatically handles special or missing values.	Automatically handles special or missing values.
Ignorant of context unless explicitly incorporated into model	Automatically conditions on entire history of the world!	Automatically conditions on entire history of the world!

96 of 106

The good: helping to democratize AI

It has the potential to allow practitioners from fields such as medical research and climate modelling to more easily access probabilistic modelling and machine learning.
A tool for gaining insights into beliefs and biases of LLMs.

The bad: along with the flexibility of LLMs, LLMPs inherit their drawbacks.

Context size limits.
Limitations of open source.
Potential for abuse, and possible consequences from incorrect predictions made with LLMPs.
Unknown biases in the underlying LLMs.

Conclusions

97 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

98 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

99 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

100 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

101 of 106

LLMTime

LLMP

LLMTime

LLMTime¹ is a time series model that requires ordered, sequential data.

“1.2, 4.1, 2.4, 2.6,”

“1,1.2\n 2, 4.1\n 3, 2.4\n 4, 2.6\n 5”

1. Large Language Models are Zero-Shot Time Series Forecasters, Gruver et al., NeurIPS 2023

102 of 106

LLMTime

LLMP

LLMTime

LLMTime¹ is a time series model that requires ordered, sequential data.

“1.2, 4.1, 2.4, 2.6,”

“1,1.2\n 2, 4.1\n 3, 2.4\n 4, 2.6\n 5”

1. Large Language Models are Zero-Shot Time Series Forecasters, Gruver et al., NeurIPS 2023

103 of 106

LLMTime

LLMP

LLMTime

LLMTime¹ is a time series model that requires ordered, sequential data.

“1.2, 4.1, Nan, 2.6,”

“1,1.2\n 2, 4.1\n 4, 2.6\n 5”

1. Large Language Models are Zero-Shot Time Series Forecasters, Gruver et al., NeurIPS 2023

104 of 106

Comparison to LLMTime

Task: predict temperature for London for 86 consecutive days commencing on Dec 12, 2023

using Llama-2-7B

LLMP

LLMTime

105 of 106

The model saw 4 training examples

Task: predict the average monthly precipitation for locations across Canada

Predictions for the next 32 months

Then given, in-context, between 1-10 historical three year periods from the same location to learn from

We can learn from similar problems in-context:

Using Mixtral-8×7B

We can learn from similar problems in-context

106 of 106

Should we be worried about output ordering?

Independent marginal predictions are output order independent but do we get better predictive performance using autoregressive predictions?

Do we need to worry about getting unlucky with our output ordering?

Is there a particular output ordering that we could select to improve performance?