1 of 18

Foundation Models for Robot Learning

Soroush Nasiriany

UT Robot Learning Reading Group

2022.04.28

1

2 of 18

What is a foundation model?

  • A “foundation” on top of which to train on a wide range of downstream tasks
  • Trained on large, diverse data
  • (typically) self-supervised task-agnostic pre-training objective

2

Figure from “On the Opportunities and Risks of Foundation Models”, Bommasani et al. 2021

3 of 18

Why do we want foundation models?

  • Obtaining training data for downstream tasks is expensive
  • Can foundation models improve data efficiency and robustness on downstream tasks? Yes! See language and vision

3

4 of 18

The biggest challenge ahead…

Applying foundation models to niche, safety-critical tasks

4

Obtaining useful training data

Adaptation to downstream tasks

Safety, interpretability, and privacy

5 of 18

This tutorial: foundation models for robot learning

Vision

Image representation for downstream tasks

Language

High-level reasoning

5

6 of 18

Pre-trained vision models for robot learning

  • Visuomotor policies are data hungry and brittle!
  • How to address this problem? Use a pre-trained vision model
  • Pretrained ImageNet features not helpful, but this is changing with large data and self-supervised learning

6

7 of 18

R3M: A Universal Visual Representation for Robot Manipulation

7

8 of 18

What training objectives are used?

  1. Time contrastive learning: push temporally close frames together, others apart
  2. Video-language alignment: learn to match correct frames and language to each other via contrastive loss
  3. Regularization: l1 and l2 loss on learned representation

8

9 of 18

Experiments

9

10 of 18

Leveraging language models for robot learning

  • Robot agents need high-level reasoning to solve complex, long-horizon tasks. Can language models help?
  • Challenge: ground language model to downstream robot learning task

10

11 of 18

Prior work: prompt engineering

11

Ungrounded. Does not take current observation from environment into account!

12 of 18

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

12

Ground instructions from language model with learned value function

How feasible is this instruction, given the current environment context?

13 of 18

Where does the value function come from?

  • Train multi-task policy and value function, conditioned on language
  • Diverse set of language instructions. Sparse reward at end of episode, manually annotated by humans
  • These models can perform short-horizon skills. The value function and language model are combined for planning complex tasks

13

14 of 18

Experiment setup

14

15 of 18

Evaluation

15

16 of 18

Case study

16

17 of 18

Ablations

17

18 of 18

Discussion

18