James Requeima
University of Toronto
Vector Institute
John Bronskill
University of Cambridge
Dami Choi
University of Toronto,�Transluce
Rich Turner
University of Cambridge
David Duvenaud
University of Toronto
Vector Institute
LLM Posteriors over Functions:�A new Mode
A crazy idea
If I believed your prior, why not show you data and ask your posterior? � - Andrew Gelman, Objections to Bayesian Statistics, 2008
LLM-Powered Probabilistic Prediction
Text shown to LLM
Data shown to LLM
LLM-Powered Probabilistic Prediction
Text shown to LLM
Data shown to LLM
LLMs can incorporate:
New: How to elicit LLMs’ posterior over functions?
LLM-based probabilistic prediction
LLMs have rich posteriors
Natively handles numbers
Probabilistic Regression
Fast
Closed-form predictive dists.
Hard to incorporate context
Missing input values are awkward
No knowledge besides dataset
Tokenization issues
LLMs
Slow
Hard to elicit joint posterior
Simply concatenate context
Automatically handles missing values.
Knows history of the world!
The Best of Both Worlds?
Want a joint dist at query points conditioned on numbers and text:
Eliciting posteriors from LLMs
Eliciting a Distribution Over Numbers
How to represent “50.318”?
50.318
5.0318 x 10
50318/1000
0050.31800000
Fifty point three one eight.
110010.010001011000100110111101...
🕔🕔🕔🕔🕔 ⚪ 🕒🕒🕒 ⚪ 🕐 🕜
Eliciting a Distribution Over Numbers
“50.318”
“5.0318 x 10”
“50318/1000”
“0050.31800000”
“
”
“Fifty point three one eight.”
“110010.010001011000100110111101...”
“ 🕔🕔🕔🕔🕔 ⚪ 🕒🕒🕒 ⚪ 🕐 🕜 ”
"Half a century, a trinity of tenths,
and a one with an octal's eighth"
Two methods for eliciting: sampling and logit based.
LLM
Generate a sample that continues the prompt
Observed Point
Query Location
Convert observed points and the query point into a string.
Sampling to estimate predictive dists
Repeated rejection sampling
Sampling trajectories
Sampling a Predictive Distribution
From logits to numerical distributions
mass assigned to values between 300 and 400.
From logits to marginals over numbers
400
300
From logits to marginals over numbers
400
300
From logits to marginals over numbers
320
330
400
300
From logits to marginals over numbers
325
326
320
330
400
300
Sampling vs Logit Based Predictive Distributions
What format to use?
Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B
A good balance of performance and token efficiency.
Data ordering
Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B
“5.3,2.2\n1.1,1.2\n2.5,4.1\n2.7,”
Data ordering
Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B
“5.3,2.2\n1.1,1.2\n2.5,4.1\n2.7,”
Closest value to target is last in prompt.
Data ordering
Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B
“5.3,2.2\n1.1,1.2\n2.5,4.1\n2.7,”
Data ordering
Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B
“5.3,2.2\n1.1,1.2\n2.5,4.1\n2.7,”
Output Scaling
Experiments performed using Llama-2-7B, Llama-2-70B, and Mixtral-8x7B
y-value range.
Top-p and Temperature
Rich joint distributions?
LLM
Generate a sample that continues the prompt
Observed Point
Query Location
Convert observed points and the query point into a string.
LLM
Generate a sample that continues the prompt
Observed Point
Query Location
Convert observed points and the query point into a string.
Autoregressive:
1. Append previous output to prompt
2. Add new query point
Rich joint distributions?
Order-dependent!
How good are LLMPs on (only) numerical data?
How good of a model is it out-of-the-box?
LLMP
Mixtral-8×7B
Squared
Exponential GP
Black-box Function Optimization
Black-box Function Optimization
Black-box Function Optimization
Model:
Llama-3-8B,
10 samples
Multimodal posteriors
Multivariate data
Multivariate data
Suppose you want to model variables c and e
Multivariate data
Suppose you want to model variables c and e
conditional on variables a,b and d
Multivariate data
Suppose you want to model variables c and e
conditional on variables a,b and d
Structure your prompts like this:
And predict the missing variables autoregressively.
Multidimensional data: weather prediction (1D in, 3D out) using Llama-3-7B LLM�
Simultaneously predict temperature, precipitation, and wind speed in London, UK.
LLMP
Mixtral-8×7B
Squared
Exponential GP
Multivariate outputs
Image Reconstruction (2D in, 1D out) using Mixtral 8×7B
True
20% Observed
20% Reconstruct
50% Observed
50% Reconstruct
Blue pixels are unobserved.
Multivariate inputs
What can we do that we couldn’t before?
Conditioning on text
Predictive Distributions Conditioned on Instructions
Task: Predict house prices given 10 examples.
“30.45738, -97.75516, 78729, 107830.0, 30907, 1216.1, 1349, 3"
“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"
Same numbers
Labels are helpful
Using Mixtral-8×7B Instruction Tuned
Without Text Labels
Labels are helpful
Using Mixtral-8×7B Instruction Tuned
Without Text Labels
With Text Labels
Labels are helpful
Summary
Image Source: Google Research blog: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance
Regression is an LLM capability
Image Source: Google Research blog: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance
Regression is an LLM capability
Natively handles numbers
Probabilistic Regression
Fast
Closed-form predictive dists.
Hard to incorporate context
Missing input values are awkward
No knowledge besides dataset
Tokenization issues
LLMs
Slow
Hard to elicit joint posterior
Simply concatenate context
Automatically handles missing values.
Knows history of the world!
Summary
James Requeima
University of Toronto
Vector Institute
John Bronskill
University of Cambridge
Dami Choi
University of Toronto
Rich Turner
University of Cambridge
David Duvenaud
University of Toronto
Vector Institute
Thanks!
Extra Slides
We can learn from similar problems in-context
We can learn from similar problems in-context:
We can learn from similar problems in-context
We can learn from similar problems in-context:
We can learn from similar problems in-context
We can learn from similar problems in-context:
We can learn from similar problems in-context
We can learn from similar problems in-context:
We can learn from similar problems in-context
We can learn from similar problems in-context:
We can learn from similar problems in-context
60
LLM Powered spreadsheet prediction
LLMPs on tabular data
“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"
LLM Powered spreadsheet prediction
LLMPs on tabular data
“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"
LLM Powered spreadsheet prediction
LLMPs on tabular data
LLM Powered spreadsheet prediction
LLM
“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"
LLMPs on tabular data
LLM Powered spreadsheet prediction
LLM
“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"
LLMPs on tabular data
LLM
“Location: Austin, Texas, Latitude: 30.45738, Longitude: -97.75516, Zip Code: 78729, Median Household Income: 107830.0, Zip Code Population: 30907 people, Zip Code Density: 1216.1, people per square mile, Living Space: 1349 square feet, Number of Bedrooms: 3, Number of, Bathrooms: 2"
LLMPs on tabular data
Context and Target Ordering
Predict the target point closest to the context set
Context and Target Ordering
“1.1,1.2\n 5.3,2.2\n 2.5,4.1\n 3.0,”
“1.1,1.2\n 2.5,4.1\n 5.3,2.2\n 3.0, \n 4.0, ”
Predict the target point closest to the context set
Context and Target Ordering
“1.1,1.2\n 5.3,2.2\n 2.5,4.1\n 3.0,”
“1.1,1.2\n 2.5,4.1\n 5.3,2.2\n 3.0, \n 4.0, ”
“5.3,2.2\n 4.0, \n, 3.0, \n 2.5,4.1\n 1.1,1.2\n 0.0,”
Predict the target point closest to the context set
Context and Target Ordering
“1.1,1.2\n 5.3,2.2\n 2.5,4.1\n 3.0,”
“1.1,1.2\n 2.5,4.1\n 5.3,2.2\n 3.0, \n 4.0, ”
“5.3,2.2\n 4.0, \n, 3.0, \n 2.5,4.1\n 1.1,1.2\n 0.0,”
Predict the target point closest to the context set
Should we be worried about output ordering?
Autoregressive
No
Yes
One problem….
We need a way to establish the order of magnitude
From logits to marginals over numbers
Note: Llama <3, Mixtral, Gemma, Deepseek, and Yi tokenize individual digits. Llama3 and GPT4 do not.
Natively handles numerical values
Probabilistic Regression
Fast inference + predictions
Usually give closed-form predictive distributions.
Hard to incorporate side info.
Sometimes need special handling for missing values
Ignorant of context unless explicitly incorporated into model
Need to tokenize + detokenize numbers
LLMs
Slow
Distribution over tokens, not clear how to elicit joint numerical posteriors
Simply attach side info in text or images
Automatically handles special or missing values
Automatically conditions on entire history of the world!
Stress-tested numerical encoding schemes
LLMPs
Still slow (but worth it!)
We show how to elicit arbitrarily complex numerical posterior predictive distributions
Simply attach side info in text or images (not tested).
Automatically handles special or missing values
Automatically conditions on entire history of the world!
Unlocking multi-modal regression
" Please predict the price I could expect to get if I sold the house at 157 Bellmont Ave, Charlotte, NC, pictured here:
Here are recent prices of other houses sold in the area and a video of the interior:
Please give me uncertainty estimates.”
Here is a forecast of the price that you could expect for 157 Bellmont Ave depending on which month listed the house:
LLM
|
What should we do next?
Ideas?
Context is Key Benchmark
Context is Key: A Benchmark for Forecasting with Essential Textual Information. Andrew Robert Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, Alexandre Drouin. Preprint (2024)
https://github.com/ServiceNow/context-is-key-forecasting
Direct Prompt Forecasting
I have a time series forecasting task for you.
Here is some context about the task. Make sure to factor in any background knowledge,
satisfy any constraints, and respect any scenarios.
<context>
((context))
</context>
Here is a historical time series in (timestamp, value) format:
<history>
((history))
</history>
Now please predict the value at the following timestamps: ((pred time)).
Return the forecast in (timestamp, value) format in between <forecast> and </forecast> tags.
Do not include any other information (e.g., comments) in the forecast.
Example:
<history>
(t1, v1)
(t2, v2)
(t3, v3)
</history>
<forecast>
(t4, v4)
(t5, v5)
</forecast>
Works well for sufficiently powerful instruction tuned models
Predictive Power for Everyday Users
Moving Forward
So LLMs can do regression as well as traditional methods, but…
you need to use a big, slow model.
Why all this trouble to use an LLM for probabilistic prediction?
So What?
There is a huge win in being able to provide problem specific text and leverage the knowledge captured in the LLM!
Eliciting a Distribution Over Numerical Values
P(token)
token index
LLM
Prompt
LLMPs are (Conditional) Neural Processes
Neural Network
CNP:
LLMP:
LLM
1,2
1. M. Garnelo et al., Conditional Neural Processes, 2018
2. W. Bruinsma et al., Autoregressive Conditional Neural Processes, 2023
LLM Optimization.
LLM
Prompt
LLM
What is going on under the hood?
Integrating Contextual Conditioning and Numerical Prediction
1
2
1. N. Gruver et al., arge language models are zero-shot time series forecasters, 2023
2. K. Choi et al., Lmpriors: Pre-trained language models as task-specific priors, 2022
one-dimensional time series -> multi-dimensional regression and density estimation
condition on data and unstructured text
Prompt Formatting
Scaling y-values
Top-p and Temperature
Prompt Orders
Prompt Orders: Independent vs Autoregressive
US housing Prices
Well Calibrated Uncertainty: Black-box Function Optimization
The objective is to find the value of x where the value of the function is maximum using the minimum number of function queries.
Black-box Function Optimization
We start by selecting a small number of “cold-start” points and query the function there.
We choose the next location to query the function where uncertainty is highest.
Black-box Function Optimization
Model:
Llama-3-8B,
10 samples
We can interpret this as the mass that the model assigns to values between 5 and 6 given that the first two digits were 32.
Method used in LLMTime: Large Language Models are Zero-Shot Time Series Forecasters, Gruver et al., NeurIPS 2023
A Logit Based Predictive Distribution
Probabilistic Regression | LLMs | LLMPs |
Natively handles numerical values | Need to tokenize + detokenize numbers | Stress-tested numerical encoding schemes |
Fast inference + predictions | Slow | Still slow (but worth it!) |
Usually give closed-form predictive distributions. | Distribution over tokens, not clear how to elicit joint numerical posteriors | We show how to elicit arbitrarily complex numerical posterior predictive distributions |
Hard to incorporate side info. | Simply attach side info in text or images | Simply attach side info in text or images |
Sometimes need special handling for missing values | Automatically handles special or missing values. | Automatically handles special or missing values. |
Ignorant of context unless explicitly incorporated into model | Automatically conditions on entire history of the world! | Automatically conditions on entire history of the world! |
The good: helping to democratize AI
The bad: along with the flexibility of LLMs, LLMPs inherit their drawbacks.
Conclusions
We can learn from similar problems in-context:
We can learn from similar problems in-context
We can learn from similar problems in-context:
We can learn from similar problems in-context
We can learn from similar problems in-context:
We can learn from similar problems in-context
We can learn from similar problems in-context:
We can learn from similar problems in-context
LLMTime
LLMP
LLMTime
“1.2, 4.1, 2.4, 2.6,”
“1,1.2\n 2, 4.1\n 3, 2.4\n 4, 2.6\n 5”
1. Large Language Models are Zero-Shot Time Series Forecasters, Gruver et al., NeurIPS 2023
LLMTime
LLMP
LLMTime
“1.2, 4.1, 2.4, 2.6,”
“1,1.2\n 2, 4.1\n 3, 2.4\n 4, 2.6\n 5”
1. Large Language Models are Zero-Shot Time Series Forecasters, Gruver et al., NeurIPS 2023
LLMTime
LLMP
LLMTime
“1.2, 4.1, Nan, 2.6,”
“1,1.2\n 2, 4.1\n 4, 2.6\n 5”
1. Large Language Models are Zero-Shot Time Series Forecasters, Gruver et al., NeurIPS 2023
Comparison to LLMTime
Task: predict temperature for London for 86 consecutive days commencing on Dec 12, 2023
using Llama-2-7B
LLMP
LLMTime
We can learn from similar problems in-context:
We can learn from similar problems in-context
Should we be worried about output ordering?