1 of 1

In-Context Learning of Intuitive Physics in Transformers

Cristian Alamos1, Masha Bondarenko1, Martin Guo1, Alex Nguyen1, Max Vogel1

1

1 University of California, Berkeley

Abstract

Introduction

Transformers can perform new tasks when prompted with unseen test examples without updating their parameters, a phenomenon known as in-context learning (ICL). We build upon the ICL literature by investigating whether transformers can learn intuitive physics in-context and predict the future state of a physical system given a sequence of observations. In particular, we train a transformer to predict the next position of a sequence of coordinates representing a bouncing ball and vary parameters such as the strength of gravity and elasticity of the ball. We then evaluate the model's performance on in-distribution and out-of-distribution parameter combinations and compare the result to RNNs and LSTMs.

Methods

Results (continued)

Transformers can learn intuitive physics in-context better than baseline models like LSTM and RNNs. Their performance is relatively robust to distributional shifts like being exposed to unseen speeds in test-time, but performance degrades rapidly when Gaussian noise is injected into inputs.

Example of Data

Future Work

Conclusion

Future work could evaluate vision transformers’ in-context learning capabilities by using sequences of images rather than sequences of coordinates. In addition, a systematic exploration of which out-of-distribution examples lead to increased and decreased relative performance could help ground our findings within a theoretical framework.

Results

Toy Problem

Goal: Predict the next location of the bouncing balls given all previous locations.

Dataset Generation

  • Used Pymunk 2D physics simulation library.
  • Created 100,000 sequences of length 50 depicting the positions and radii of bouncing balls in a 100x100 grid.
  • Parameters used: initial velocity, initial position, coefficient of restitution, strength of gravity.

Model Structure

Critical Components:

  • Inputs: Sequences of length 50 of 9-dimensional vectors containing the position and radius of all balls.
  • Embedding Layer with output dimension 128.
  • Positional Encoding
  • Decoder with GPT2 architecture, causal masking, 2 layers and 1 attention head.
  • Final Layer with output dimension 6 containing the predicted positions of all balls.

Transformers have demonstrated a remarkable capacity for adapting to novel tasks without the need for parameter updates—a phenomenon termed in-context learning (ICL). Notably, within simple function classes such as linear regression, transformers have exhibited near Bayes-optimal in-context learning capabilities.

We extend the understanding of ICL by exploring the potential of transformers to learn intuitive physics in context. In particular, we train a decoder-only transformer to predict the location of three balls bouncing in 3D space given all previous locations of the balls. Concretely, our inputs are sequences of coordinates that represent vectorized positions of the balls, as well as their radius:

(x1, y1, r1, x2, y2, r2, x3, y3, r3)1, …, (x1, y1, r1, x2, y2, r2, x3, y3, r3)L-1 .

Our outputs sequences of coordinates of predicted locations for the balls:

(x1, y1, x2, y2, x3, y3)2, …, (x1, y1, x2, y2, x3, y3)L.

The prediction for the nth frame is made using all n-1 previous frames. We use the average Euclidean distance between predicted and true positions as our loss function.

Training Curves

Robustness to Noise

RNN

LSTM

Transformer

Above: Transformer models are less robust to the injection of Gaussian noise into test examples compared to baseline models. As the noise injected increases, the performance of transformers degrades linearly.

Top Left: Figure shows models trained with speed = 25

Top Right: Figure shows models trained with no gravity.

Robustness to Distributional Shifts

Right: Transformer models decrease their loss as they are exposed to more in-context examples and perform better than RNNs and LSTMs. The greatest decreases in loss are seen after the first few frames.

Above: Average Euclidean distance loss across all models per training epoch. Note that axes scales differ across graphs.

Ground Truth

Predicted

t=1

t=2

t=3

t=4

Performance on In-Distribution Parameters

Performance on Out-of-Distribution Parameters

Transformers generalize well to speeds and strengths of gravity outside of their training distribution when compared to LSTMs and RNNs.

Above: Visualization of input and output sequences. The inputs and outputs are not themselves images, but rather sequences of coordinates that can have been visualized above for illustrative purposes.