1 of 18

Foundation Models for Robot Learning

Soroush Nasiriany

UT Robot Learning Reading Group

2022.04.28

2 of 18

What is a foundation model?

A “foundation” on top of which to train on a wide range of downstream tasks
Trained on large, diverse data
(typically) self-supervised task-agnostic pre-training objective

Figure from “On the Opportunities and Risks of Foundation Models”, Bommasani et al. 2021

3 of 18

Why do we want foundation models?

Obtaining training data for downstream tasks is expensive
Can foundation models improve data efficiency and robustness on downstream tasks? Yes! See language and vision

4 of 18

The biggest challenge ahead…

Applying foundation models to niche, safety-critical tasks

Obtaining useful training data

Adaptation to downstream tasks

Safety, interpretability, and privacy

5 of 18

This tutorial: foundation models for robot learning

Vision

Image representation for downstream tasks

Language

High-level reasoning

6 of 18

Pre-trained vision models for robot learning

Visuomotor policies are data hungry and brittle!
How to address this problem? Use a pre-trained vision model
Pretrained ImageNet features not helpful, but this is changing with large data and self-supervised learning

7 of 18

R3M: A Universal Visual Representation for Robot Manipulation

8 of 18

What training objectives are used?

Time contrastive learning: push temporally close frames together, others apart
Video-language alignment: learn to match correct frames and language to each other via contrastive loss
Regularization: l1 and l2 loss on learned representation

9 of 18

Experiments

10 of 18

Leveraging language models for robot learning

Robot agents need high-level reasoning to solve complex, long-horizon tasks. Can language models help?
Challenge: ground language model to downstream robot learning task

11 of 18

Prior work: prompt engineering

Ungrounded. Does not take current observation from environment into account!

12 of 18

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ground instructions from language model with learned value function

How feasible is this instruction, given the current environment context?

13 of 18

Where does the value function come from?

Train multi-task policy and value function, conditioned on language
Diverse set of language instructions. Sparse reward at end of episode, manually annotated by humans
These models can perform short-horizon skills. The value function and language model are combined for planning complex tasks

14 of 18

Experiment setup

15 of 18

Evaluation

16 of 18

Case study

17 of 18

Ablations

18 of 18

Discussion