In-Context Learning of Intuitive Physics in Transformers
Cristian Alamos1, Masha Bondarenko1, Martin Guo1, Alex Nguyen1, Max Vogel1
1
1 University of California, Berkeley
Abstract
Introduction
Transformers can perform new tasks when prompted with unseen test examples without updating their parameters, a phenomenon known as in-context learning (ICL). We build upon the ICL literature by investigating whether transformers can learn intuitive physics in-context and predict the future state of a physical system given a sequence of observations. In particular, we train a transformer to predict the next position of a sequence of coordinates representing a bouncing ball and vary parameters such as the strength of gravity and elasticity of the ball. We then evaluate the model's performance on in-distribution and out-of-distribution parameter combinations and compare the result to RNNs and LSTMs.
Methods
Results (continued)
Transformers can learn intuitive physics in-context better than baseline models like LSTM and RNNs. Their performance is relatively robust to distributional shifts like being exposed to unseen speeds in test-time, but performance degrades rapidly when Gaussian noise is injected into inputs.
Example of Data
Future Work
Conclusion
Future work could evaluate vision transformers’ in-context learning capabilities by using sequences of images rather than sequences of coordinates. In addition, a systematic exploration of which out-of-distribution examples lead to increased and decreased relative performance could help ground our findings within a theoretical framework.
Results
Toy Problem
Goal: Predict the next location of the bouncing balls given all previous locations.
Dataset Generation
Model Structure
Critical Components:
Transformers have demonstrated a remarkable capacity for adapting to novel tasks without the need for parameter updates—a phenomenon termed in-context learning (ICL). Notably, within simple function classes such as linear regression, transformers have exhibited near Bayes-optimal in-context learning capabilities.
We extend the understanding of ICL by exploring the potential of transformers to learn intuitive physics in context. In particular, we train a decoder-only transformer to predict the location of three balls bouncing in 3D space given all previous locations of the balls. Concretely, our inputs are sequences of coordinates that represent vectorized positions of the balls, as well as their radius:
(x1, y1, r1, x2, y2, r2, x3, y3, r3)1, …, (x1, y1, r1, x2, y2, r2, x3, y3, r3)L-1 .
Our outputs sequences of coordinates of predicted locations for the balls:
(x1, y1, x2, y2, x3, y3)2, …, (x1, y1, x2, y2, x3, y3)L.
The prediction for the nth frame is made using all n-1 previous frames. We use the average Euclidean distance between predicted and true positions as our loss function.
Training Curves
Robustness to Noise
RNN
LSTM
Transformer
Above: Transformer models are less robust to the injection of Gaussian noise into test examples compared to baseline models. As the noise injected increases, the performance of transformers degrades linearly.
Top Left: Figure shows models trained with speed = 25
Top Right: Figure shows models trained with no gravity.
Robustness to Distributional Shifts
Right: Transformer models decrease their loss as they are exposed to more in-context examples and perform better than RNNs and LSTMs. The greatest decreases in loss are seen after the first few frames.
Above: Average Euclidean distance loss across all models per training epoch. Note that axes scales differ across graphs.
Ground Truth
Predicted
t=1
t=2
t=3
t=4
Performance on In-Distribution Parameters
Performance on Out-of-Distribution Parameters
Transformers generalize well to speeds and strengths of gravity outside of their training distribution when compared to LSTMs and RNNs.
Above: Visualization of input and output sequences. The inputs and outputs are not themselves images, but rather sequences of coordinates that can have been visualized above for illustrative purposes.