1 von 104

Scaling Generalist Robots

Ted Xiao

2 von 104

Proprietary + Confidential

3 von 104

Is foundation modeling the right framing for robotics?

Foundation models enable emergent capabilities and homogenization

emergent capabilities: emergence of more complex behavior not present in smaller models
homogenization: generalization to combinatorially many downstream use cases
More is Different for AI, Emergence in LLMs

Proprietary + Confidential

4 von 104

Is foundation modeling the right framing for robotics?

Foundation models enable emergent capabilities and homogenization

emergent capabilities: emergence of more complex behavior not present in smaller models
homogenization: generalization to combinatorially many downstream use cases
More is Different for AI, Emergence in LLMs

Betting on “emergent capabilities” might be required for robotics to generalize and be useful

the world

1 building

1 room

1 bin

Proprietary + Confidential

5 von 104

language

images

audio

robotics

coding

music

Proprietary + Confidential

6 von 104

Agenda

Data Scaling: Data Quantity

Data Scaling: Data Quality and Diversity

Data Scaling: Synthetic Data

Beyond Language: Robotics is Unique

Horizons: What’s Next?

01

02

03

04

05

7 von 104

Lessons from Foundation Modeling: Data Scaling

Data scaling a key ingredient in LLMs and VLMs

Source: Kaplan et al. 2020

8 von 104

Lessons from Foundation Modeling: Data Scaling

Data scaling a key ingredient in LLMs and VLMs
…but the internet already exists. No equivalent for robot data yet!

Source: Kaplan et al. 2020

9 von 104

Lessons from Foundation Modeling: Data Scaling

Data scaling a key ingredient in LLMs and VLMs
…but the internet already exists. No equivalent for robot data yet!
(1) wish to deploy in the real world (2) need to create robotics training data

10 von 104

Lessons from Foundation Modeling: Data Scaling

Data scaling a key ingredient in LLMs and VLMs
…but the internet already exists. No equivalent for robot data yet!
(1) wish to deploy in the real world (2) need to create robotics training data

Real data is indeed rare and expensive, but maybe it’s not intractable?

Can we just use $$$ to train on our test distribution?

11 von 104

Real-world Robot Data Engines: $$$ to Data

Online RL: Arm Farms

2016 - 2019

2019 - 2020

Online RL:

Multitask Arm Farms

2020 - 2022

Offline Dataset Creation: Teleoperation + IL+RL

12 von 104

Real-world Robot Data Engines: $$$ to Data

Online RL: Arm Farms

2016 - 2019

2019 - 2020

Online RL:

Multitask Arm Farms

2020 - 2022

Offline Dataset Creation: Teleoperation + IL+RL

O(100k) demos, 13 robots,

17 months

13 von 104

SayCan

LLM as a planner

Q-function as an affordance model

Grounded planning

RT-1

Scalable Transformer robot policy

Many more tasks

Compatible with SayCan

PaLM-E

Vision Language Model (VLM)

Trained on Web and embodied data

Better planning than LLM-only

RT-2

Unified web-scale VLM as robot policy

Generalization to new tasks and situations

Chain-of-thought reasoning possible

Data Scaling to Capabilities Research

14 von 104

Robot Data to Policies: RT-1

Robotics Transformer

Tokenized input and outputs

Decoder only transformer, sparse categorical entropy objective

Token learner for compression/ faster inference

Image tokenizer: Pre-trained film efficient net backbone

15 von 104

VLMs encompass both visual and semantic understanding of the world
In Robotics we have to deal a lot with both of these
How do we leverage all of this knowledge?

Vision-Language Models

[1] PaLI: A Jointly-Scaled Multilingual Language-Image Model. Chen et al. 2022.

Google DeepMind

16 von 104

RT-1: image + text → discretized actions

VLMs as Robot Policies

FiLM EfficientNet

+

Transformer

Positional

encoding

Universal Sentence Encoder

Self-Attention

Camera images

Language instruction

Pick sponge…

Action

RT-1 architecture [2]

[2] RT-1: Robotics Transformer for Real-World Control at Scale, Robotics at Google and Everyday Robots, 2022.

[1] PaLI: A Jointly-Scaled Multilingual Language-Image Model. Chen et al. 2022.

PaLI architecture [1]

Use large pre-trained VLMs directly as the policy!

How do we deal with actions when using pre-trained VLMs?

Similar to a Visual-Language Model (VLM) with different output tokens

Google DeepMind

17 von 104

Robot actions:

Moving the robot arm and gripper
Discretized into 256 bins

Actions in VLMs

Convert to a string of numbers
Example: “1 127 115 218 101 56 90 255”
Alternatives:

Float numbers - more tokens needed
Extra-IDs, least used language tokens
Human language (left, right etc.) - can’t be directly executed on a robot

→ Vision-Language-Action (VLA) model!

Representing Actions in VLMs

Google DeepMind

18 von 104

Models

PaLI-X (5B, 55B)
PaLM-E (12B)

Training data and underlying models

Data

Pretraining: Web-data
Robot data

RT-1 data
13 robots
17 months
130k demos

Google DeepMind

19 von 104

Inference

Closed-loop robot control

(1-3Hz)

Google DeepMind

20 von 104

Results: Emergent skills

Google DeepMind

21 von 104

Results: Emergent skills

Google DeepMind

22 von 104

Results: Quantitative evals

Google DeepMind

23 von 104

RT2 w/ PaLI-X-55B ablations

Co-Fine-Tuning with VQA data
Fine-Tuning on robot data only
Training on robot data from scratch

Results: Quantitative evals

Google DeepMind

24 von 104

Results: Chain-of-Thought with RT-2-PaLM-E

Google DeepMind

25 von 104

The Open X-Embodiment Dataset

25

26 von 104

The Open X-Embodiment Dataset

26

Many Robots

Many Skills

Many Scenes

Many Objects

27 von 104

Current data assumptions for now

Single Arm

2-fingers, mostly parallel yaw

Still interesting diversity!

Subset of datasets with single arm

27

28 von 104

Model architectures

images and a text instruction as input

discretized actions as output

28

29 von 104

Evaluation methodologies

RT-1-X

Training in lab A

Model checkpoints

Send checkpoints over the internet

Evaluation in labs B, C, D

no standardization in control infrastructure

RT-2-X

Query actions over the internet

29

30 von 104

Summary: Signs of Positive Transfer!

RT-1 vs RT-1-X

Does training on X-Embodiment datasets improves in-distribution performance?
Yes!

50% improvement

Original Methods vs RT-1-X

Does generalist models outperform specialist models?
Yes!

30

31 von 104

Large scale data domains

RT-1-X underfits for large datasets

RT-2-X recovers performance

31

32 von 104

RT-2 generalization evals

RT-2 and RT-2-X perform roughly on par

Not unexpected, since RT-2 already generalizes well along these dimensions due to its VLM backbone

More robust to distractors, on top of VLM pre-training?

32

33 von 104

Emergent skills evaluations

Object-Relative Position Understanding

Preposition modulates low-level motion

Absolute Position Understanding

33

34 von 104

Is web-scale data enough?

34

RT-2-X outperforms RT-2 by 3x in emergent skill evaluations

3x

put apple on cloth

move apple near cloth

35 von 104

Is OXE only good because of Bridge dataset?

35

red vs orange

removing Bridge dataset leads to large drop in success rate

blue vs orange

but still almost 2x the performance
the other datasets also help

36 von 104

Data Scaling: Data Quantity Recap

Real-world robot demonstration dataset

Co-train on robot data alongside internet data

Add robot data from different embodiments

Increasing data interoperability by treating robot actions

as just another data modality

[RT-1]

[RT-2]

[RT-X]

37 von 104

Data Scaling: Data Quantity Recap

Real-world robot demonstration dataset

Co-train on robot data alongside internet data

Add robot data from different embodiments

[RT-1]

[RT-2]

[RT-X]

Better in-distribution performance

Generalization to internet semantics

Generalization to spatial concepts

38 von 104

Data Scaling: Data Quantity Recap

Real-world robot demonstration dataset

Co-train on robot data alongside internet data

Add robot data from different embodiments

[RT-1]

[RT-2]

[RT-X]

Better in-distribution performance

Generalization to internet semantics

Generalization to spatial concepts

…but how do we encourage robotics-specific generalization and robustness?

39 von 104

Agenda

Data Scaling: Data Quantity

Data Scaling: Data Quality and Generalization

Data Scaling: Synthetic Data

Beyond Language: Robotics is Unique

Horizons: What’s Next?

01

02

03

04

05

40 von 104

What does generalization in visual manipulation mean?

40

New tasks, distractors, backgrounds

New objects

New workspaces

Many notions of generalization, often overlapping and defined at different levels of granularity

41 von 104

What makes it hard to generalize to new environments?

41

Camera Pose

Table Texture

Floor Texture

Lighting

Object Texture

Object Position

Table Position

Distractor

42 von 104

Evaluation Tasks

42

Blue chip bag

Blue plastic bottle

Jalapeno chip bag

Oreo

Pepsi can,

water bottle

Pick place

Bin picking

Door opening

Factor World

Real Robot

43 von 104

Evaluation Setup

Factor World: 100 new values for each factor

Evaluation Metrics: success rate, generalization gap (train - test success rate)

43

Lighting (x2)

Camera Pose (x3)

Distractors (x3)

Background (x3)

Table (x3)

Real Robot

44 von 104

Impact of Individual Factors

“Easier” factors: background, lighting, distractor

“Harder” factors: table position, table texture, camera position, object texture

44

(Robot)

45 von 104

Effect of Data Augmentation

Surprisingly, crop augmentations improve generalization to new table textures.

45

46 von 104

Main Results

A (roughly) consistent ordering of factors, across different tasks, datasets, and between real and simulated.
Most factors, when combined, do not have a compounding effect on generalization performance.

46

47 von 104

Main Results

A (roughly) consistent ordering of factors, across different tasks, datasets, and between real and simulated.
Most factors, when combined, do not have a compounding effect on generalization performance.

47

Question: Can we use this insight about compositional generalization during data collection?

48 von 104

Focusing on Compositional Generalization for Robotics

49 von 104

Focusing on Compositional Generalization for Robotics

50 von 104

Focusing on Compositional Generalization for Robotics

51 von 104

Focusing on Compositional Generalization for Robotics

52 von 104

Factors of Variation: Visual and Physical Factors

53 von 104

Comparing Data Collection Strategies

54 von 104

Comparing Data Collection Strategies

Compositional data collection important for realistic setting (all factors of variation)!

55 von 104

Compositional Data Collection w/ Prior Data

Compositional data strategies still important in the presence of offline datasets!

56 von 104

Agenda

Data Scaling: Data Quantity

Data Scaling: Data Quality and Diversity

Data Scaling: Synthetic Data

Beyond Language: Robotics is Unique

Horizons: What’s Next?

01

02

03

04

05

57 von 104

No matter how much robot data scaling you do… is it enough?

Can you use synthetic data to augment and enrichen offline real datasets?
This has been a core component in foundation modeling, in LLMs and VLMs!

source

Synthetic Data in LLMs

58 von 104

No matter how much robot data scaling you do… is it enough?

Can you use synthetic data to augment and enrichen offline real datasets?
This has been a core component in foundation modeling, in LLMs and VLMs!

source

Offline Dataset

Relabeling Logic

Change reward

Change states

Change task

Change actions

Augmented Dataset

Synthetic Data in LLMs

Synthetic Data in Robotics?

59 von 104

No matter how much robot data scaling you do… is it enough?

Can you use synthetic data to augment and enrichen offline real datasets?
This has been a core component in foundation modeling, in LLMs and VLMs!

source

Offline Dataset

Relabeling Logic

Change reward

Change states

Change task

Change actions

Augmented Dataset

Synthetic Data in LLMs

Synthetic Data in Robotics?

60 von 104

RT-1 + Human Teleoperation is tractable but expensive

The secret ingredient for RT-1 was “130,000 episodes, 13 robots, over 17 months, 700 tasks”

“Pick up a coke can from the table”

“pick coke can”

Crowdsourced Language

Annotations

Structured Teleoperator Commands

61 von 104

DIAL: Data-driven Instruction Augmentation for Language-conditioned control

Step 1: Fine-tune VLM on a small dataset of crowd-sourced instructions

Step 2: Label the instructions in a larger dataset using the fine-tuned VLM

Dataset A

(small)

Dataset C

(relabeled B)

Step 3: Train a language conditioned policy with original and relabeled datasets

VLM

+

Dataset B

(large)

Dataset A

(small)

Pretrained

Dataset C

(relabeled B)

Dataset B

(large)

+

Step 4: Run trained policy on unseen instructions

“Move the right apple to the left of the counter”

+

62 von 104

Teleoperator Instruction: “move blue plastic bottle near brown chip bag”

DIAL: “move plastic bottle near chip bag at the center left of the table”

Teleoperator Instruction: “pick green rice chips from white bowl”

DIAL: “lift up the green chip bag from the bowl and drop it at the bottom left corner of the table”

Teleoperator Instruction: “open top drawer”

DIAL: “hold and pull out the top cupboard to open it”

Structured Teleoperator

Command Categories

DIAL Predicted Instructions

no DIAL prediction

‘near’

‘next to’

‘left’

‘right’

‘from’

‘side’

‘center’

‘below’

other

move

pick from

open

place

close

pick

upright

knock

63 von 104

Evaluations on Novel Instructions

64 von 104

Evaluations on Novel Instructions

Tradeoff: more data but less accurate

Need enough data quantity with enough label accuracy

65 von 104

Semantic Data Augmentation

Offline Dataset

Relabeling Logic

Change reward

Change states

Change task

Change actions

Augmented Dataset

Can we leverage advances in image generation for robot data augmentation?

66 von 104

ROSIE: Robot Learning with Semantically Imagined Experience

67 von 104

68 von 104

69 von 104

70 von 104

71 von 104

72 von 104

73 von 104

Overall performance

74 von 104

Recap: Data Scaling

Data Scaling: Data Quantity

RT-1: Robot Data
RT-2: Add Web Data
RT-X: Add Cross-Embodiment Data

Data Scaling: Data Quality and Diversity

GenGap: What makes generalization hard in robotics?
DataComp: Can we consider compositional generalization during data collection?

Data Scaling: Synthetic Data

DIAL: Can we use VLMs for hindsight instruction relabeling?
ROSIE: Can we use image generation models for semantic visual data augmentation?

75 von 104

Recap: Data Scaling

Data Scaling: Data Quantity

RT-1: Robot Data
RT-2: Add Web Data
RT-X: Add Cross-Embodiment Data

Data Scaling: Data Quality and Diversity

GenGap: What makes generalization hard in robotics?
DataComp: Can we consider compositional generalization during data collection?

Data Scaling: Synthetic Data

DIAL: Can we use VLMs for hindsight instruction relabeling?
ROSIE: Can we use image generation models for semantic visual data augmentation?

76 von 104

Recap: Data Scaling

Data Scaling: Data Quantity

RT-1: Robot Data
RT-2: Add Web Data
RT-X: Add Cross-Embodiment Data

Data Scaling: Data Quality and Diversity

GenGap: What makes generalization hard in robotics?
DataComp: Can we consider compositional generalization during data collection?

Data Scaling: Synthetic Data

DIAL: Can we use VLMs for hindsight instruction relabeling?
ROSIE: Can we use image generation models for semantic visual data augmentation?

77 von 104

Agenda

Data Scaling: Data Quantity

Data Scaling: Data Quality and Diversity

Data Scaling: Synthetic Data

Beyond Language: Robotics is Unique

Horizons: What’s Next?

01

02

03

04

05

78 von 104

Strengths and Limitations of Language

High-level Language

Knowledge

Low-level Robotics

Knowledge

79 von 104

Beyond Language

Object-Centric Representations

Segmentation Masks

Code-Centric Representations

Code as Policies and Rewards

Motion-Centric

Representations

EEF Trajectories

Language Hierarchies

Granular Language Motions

Language

Representations

Text Instructions

80 von 104

Object-centric Representations:

Manipulation of Open-World Objects (MOO)

Key Idea: Use single-pixel detection centroids from an

open-vocabulary VLM for task conditioning

81 von 104

17 RT-1 Objects

(all skills)

90 Diverse Objects

(“pick” skill only)

47 Novel Evaluation Objects

(unseen during training)

RT-1 Data

Newly added diverse Pick Task data

Novel Object Evaluations

82 von 104

Unseen Objects,

Seen Categories (22)

Seen Objects (49)

MOO (our data)

MOO (original RT-1 data)

RT-1 (our data)

VIMA-like (our data)

Pick Skill

Non-pick Skills

Pick Skill

Non-pick Skills

Pick Skill

Non-pick Skills

Unseen Categories (25)

Results: Generalization to Novel Open-World Objects

Open-World Objects

Challenging Textures

New Environments

83 von 104

Additional Input Modalities

Human Pointing

Clicking on Screen

Target Image

84 von 104

User

Stand up on two feet.

Reward Translator (LLM)

low-level actions

Reward function

set_torso_rewards(height=0.6, pitch=0.5*np.pi)

set_feet_pos_rewards('back_left', height=0.0)

set_feet_pos_rewards('back_right', height=0.0)

Reward code

Low-level controller

User

Stand up on two feet.

Code as Policies (LLM)

low-level actions

Code-centric Representations: Language to Reward

Code as Policies

Language to Reward

85 von 104

Simulation and real results with iterative reward code generation loops

"Stand up on two legs"
"Now walk backgrounds slowly"

85

Results

86 von 104

Robot-centric Representations:

RT-Trajectory

87 von 104

Robot-centric Representations:

RT-Trajectory

88 von 104

Results: Quantitative Evaluations

89 von 104

Results: Diverse Input Modalities

Human Videos

Code as Policies

Foundation Models

90 von 104

Results: Emergent Capabilities

Ego-centric trajectory representations enable broad generalization:

Novel motions (new heights, new shapes, new curvatures)
Visual distribution shifts (new furniture, new rooms, new objects, new lighting)
Behavior modulation within skills (specify exactly how to accomplish the task)

91 von 104

Is language enough, if it’s hierarchical and granular?

RT-Hierarchy

Idea: predict granular language motions before predicting low-level robot actions

“move arm forward”, “rotate arm clockwise”, “close gripper”

Can be viewed as chain-of-thought / planning for language-based skills

92 von 104

Results: RT-H Outperforms RT-2

No other policy class (RT-1, RT-2) was able to learn from challenging new data

93 von 104

Results: Language Interventions

RT-H bottleneck often was language motion prediction rather than low-level action prediction: language motions easier to collect interventions for!

94 von 104

Recap: Beyond Language

Object-Centric Representations

Segmentation Masks

Code-Centric Representations

Code as Policies and Rewards

Motion-Centric

Representations

EEF Trajectories

Language Hierarchies

Granular Language Motions

Language

Representations

Text Instructions

RT-1

MOO

Language to Reward

RT-Trajectory

RT-Hierarchy

95 von 104

Agenda

Data Scaling: Data Quantity

Data Scaling: Data Quality and Diversity

Data Scaling: Synthetic Data

Beyond Language: Robotics is Unique

Horizons: What’s Next?

01

02

03

04

05

96 von 104

Moonshot #1: AGI Already Exists?

Non-robotics foundation models keep surprising us with what they can do

LLMs as General Pattern Machines

Code as Policies

Keypoint Action Tokens

97 von 104

98 von 104

Moonshot #2: AI has an Evaluation Problem

All roads lead to generalist models, but generalist models that can "do anything" need to be evaluated on "everything"!
How do you scalably evaluate a broad set of capabilities?

LLMs Target Human Data Distribution

Evaluate on Humans Directly

99 von 104

Moonshot #2: AI has an Evaluation Problem

All roads lead to generalist models, but generalist models that can "do anything" need to be evaluated on "everything"!
How do you scalably evaluate a broad set of capabilities?

LLMs Target Human Data Distribution

Evaluate on Humans Directly

Robots Target Physical Data Distribution

Evaluate on ???

100 von 104

Real-to-Sim Evaluation for Real-world Robot Policies

Key Insight: A simulation "good enough" for useful evaluation signal may be much easier to build than a full digital clone for training

101 von 104

Thank you!

xiaoted@gmail.com

102 von 104

Goal-centric Representations:

RT-Sketch

Key Idea: Use coarse goal conditioning that focuses goals on what actually matters in the scene

103 von 104

Training: Generate Goal Sketches in Hindsight

104 von 104

Results

Robustness to Goal Detail

Spatial Reasoning

Distractor Robustness