Scaling Generalist Robots
Ted Xiao
Proprietary + Confidential
Is foundation modeling the right framing for robotics?
Proprietary + Confidential
Is foundation modeling the right framing for robotics?
the world
1 building
1 room
1 bin
Proprietary + Confidential
language
images
audio
robotics
coding
music
Proprietary + Confidential
Agenda
Data Scaling: Data Quantity
Data Scaling: Data Quality and Diversity
Data Scaling: Synthetic Data
Beyond Language: Robotics is Unique
Horizons: What’s Next?
01
02
03
04
05
Lessons from Foundation Modeling: Data Scaling
Source: Kaplan et al. 2020
Lessons from Foundation Modeling: Data Scaling
Source: Kaplan et al. 2020
Lessons from Foundation Modeling: Data Scaling
Lessons from Foundation Modeling: Data Scaling
Real data is indeed rare and expensive, but maybe it’s not intractable?
Can we just use $$$ to train on our test distribution?
Real-world Robot Data Engines: $$$ to Data
Online RL: Arm Farms
2016 - 2019
2019 - 2020
Online RL:
Multitask Arm Farms
2020 - 2022
Offline Dataset Creation: Teleoperation + IL+RL
Real-world Robot Data Engines: $$$ to Data
Online RL: Arm Farms
2016 - 2019
2019 - 2020
Online RL:
Multitask Arm Farms
2020 - 2022
Offline Dataset Creation: Teleoperation + IL+RL
O(100k) demos, 13 robots,
17 months
SayCan
LLM as a planner
Q-function as an affordance model
Grounded planning
RT-1
Scalable Transformer robot policy
Many more tasks
Compatible with SayCan
PaLM-E
Vision Language Model (VLM)
Trained on Web and embodied data
Better planning than LLM-only
RT-2
Unified web-scale VLM as robot policy
Generalization to new tasks and situations
Chain-of-thought reasoning possible
Data Scaling to Capabilities Research
Robot Data to Policies: RT-1
Robotics Transformer
Vision-Language Models
[1] PaLI: A Jointly-Scaled Multilingual Language-Image Model. Chen et al. 2022.
Google DeepMind
VLMs as Robot Policies
FiLM EfficientNet
+
Transformer
Positional
encoding
Universal Sentence Encoder
Self-Attention
Camera images
Language instruction
Pick sponge…
Action
RT-1 architecture [2]
[2] RT-1: Robotics Transformer for Real-World Control at Scale, Robotics at Google and Everyday Robots, 2022.
[1] PaLI: A Jointly-Scaled Multilingual Language-Image Model. Chen et al. 2022.
PaLI architecture [1]
Google DeepMind
→ Vision-Language-Action (VLA) model!
Representing Actions in VLMs
Google DeepMind
Models
Training data and underlying models
Data
Google DeepMind
Inference
Closed-loop robot control
(1-3Hz)
Google DeepMind
Results: Emergent skills
Google DeepMind
Results: Emergent skills
Google DeepMind
Results: Quantitative evals
Google DeepMind
RT2 w/ PaLI-X-55B ablations
Results: Quantitative evals
Google DeepMind
Results: Chain-of-Thought with RT-2-PaLM-E
Google DeepMind
The Open X-Embodiment Dataset
25
The Open X-Embodiment Dataset
26
Many Robots
Many Skills
Many Scenes
Many Objects
Current data assumptions for now
Single Arm
2-fingers, mostly parallel yaw
Still interesting diversity!
Subset of datasets with single arm
27
Model architectures
images and a text instruction as input
discretized actions as output
28
Evaluation methodologies
RT-1-X
Training in lab A
Model checkpoints
Send checkpoints over the internet
Evaluation in labs B, C, D
no standardization in control infrastructure
RT-2-X
Query actions over the internet
29
Summary: Signs of Positive Transfer!
RT-1 vs RT-1-X
50% improvement
Original Methods vs RT-1-X
30
Large scale data domains
RT-1-X underfits for large datasets
RT-2-X recovers performance
31
RT-2 generalization evals
RT-2 and RT-2-X perform roughly on par
Not unexpected, since RT-2 already generalizes well along these dimensions due to its VLM backbone
More robust to distractors, on top of VLM pre-training?
32
Emergent skills evaluations
Object-Relative Position Understanding
Preposition modulates low-level motion
Absolute Position Understanding
33
Is web-scale data enough?
34
RT-2-X outperforms RT-2 by 3x in emergent skill evaluations
3x
put apple on cloth
move apple near cloth
Is OXE only good because of Bridge dataset?
35
red vs orange
blue vs orange
Data Scaling: Data Quantity Recap
Real-world robot demonstration dataset
Co-train on robot data alongside internet data
Add robot data from different embodiments
Increasing data interoperability by treating robot actions
as just another data modality
[RT-1]
[RT-2]
[RT-X]
Data Scaling: Data Quantity Recap
Real-world robot demonstration dataset
Co-train on robot data alongside internet data
Add robot data from different embodiments
[RT-1]
[RT-2]
[RT-X]
Better in-distribution performance
Generalization to internet semantics
Generalization to spatial concepts
Data Scaling: Data Quantity Recap
Real-world robot demonstration dataset
Co-train on robot data alongside internet data
Add robot data from different embodiments
[RT-1]
[RT-2]
[RT-X]
Better in-distribution performance
Generalization to internet semantics
Generalization to spatial concepts
…but how do we encourage robotics-specific generalization and robustness?
Agenda
Data Scaling: Data Quantity
Data Scaling: Data Quality and Generalization
Data Scaling: Synthetic Data
Beyond Language: Robotics is Unique
Horizons: What’s Next?
01
02
03
04
05
What does generalization in visual manipulation mean?
40
New tasks, distractors, backgrounds
New objects
New workspaces
Many notions of generalization, often overlapping and defined at different levels of granularity
What makes it hard to generalize to new environments?
41
Camera Pose
Table Texture
Floor Texture
Lighting
Object Texture
Object Position
Table Position
Distractor
Evaluation Tasks
42
Blue chip bag
Blue plastic bottle
Jalapeno chip bag
Oreo
Pepsi can,
water bottle
Pick place
Bin picking
Door opening
Factor World
Real Robot
Evaluation Setup
Factor World: 100 new values for each factor
Evaluation Metrics: success rate, generalization gap (train - test success rate)
43
Lighting (x2)
Camera Pose (x3)
Distractors (x3)
Background (x3)
Table (x3)
Real Robot
Impact of Individual Factors
“Easier” factors: background, lighting, distractor
“Harder” factors: table position, table texture, camera position, object texture
44
(Robot)
Effect of Data Augmentation
Surprisingly, crop augmentations improve generalization to new table textures.
45
Main Results
46
Main Results
47
Question: Can we use this insight about compositional generalization during data collection?
Focusing on Compositional Generalization for Robotics
Focusing on Compositional Generalization for Robotics
Focusing on Compositional Generalization for Robotics
Focusing on Compositional Generalization for Robotics
Factors of Variation: Visual and Physical Factors
Comparing Data Collection Strategies
Comparing Data Collection Strategies
Compositional data collection important for realistic setting (all factors of variation)!
Compositional Data Collection w/ Prior Data
Compositional data strategies still important in the presence of offline datasets!
Agenda
Data Scaling: Data Quantity
Data Scaling: Data Quality and Diversity
Data Scaling: Synthetic Data
Beyond Language: Robotics is Unique
Horizons: What’s Next?
01
02
03
04
05
No matter how much robot data scaling you do… is it enough?
Synthetic Data in LLMs
No matter how much robot data scaling you do… is it enough?
Offline Dataset
Relabeling Logic
Change reward
Change states
Change task
Change actions
Augmented Dataset
Synthetic Data in LLMs
Synthetic Data in Robotics?
No matter how much robot data scaling you do… is it enough?
Offline Dataset
Relabeling Logic
Change reward
Change states
Change task
Change actions
Augmented Dataset
Synthetic Data in LLMs
Synthetic Data in Robotics?
RT-1 + Human Teleoperation is tractable but expensive
“Pick up a coke can from the table”
“pick coke can”
Crowdsourced Language
Annotations
Structured Teleoperator Commands
DIAL: Data-driven Instruction Augmentation for Language-conditioned control
Step 1: Fine-tune VLM on a small dataset of crowd-sourced instructions
Step 2: Label the instructions in a larger dataset using the fine-tuned VLM
Dataset A
(small)
Dataset C
(relabeled B)
Step 3: Train a language conditioned policy with original and relabeled datasets
VLM
VLM
VLM
+
Dataset B
(large)
Dataset A
(small)
Pretrained
Dataset C
(relabeled B)
Dataset B
(large)
+
+
Step 4: Run trained policy on unseen instructions
“Move the right apple to the left of the counter”
+
Teleoperator Instruction: “move blue plastic bottle near brown chip bag”
DIAL: “move plastic bottle near chip bag at the center left of the table”
Teleoperator Instruction: “pick green rice chips from white bowl”
DIAL: “lift up the green chip bag from the bowl and drop it at the bottom left corner of the table”
Teleoperator Instruction: “open top drawer”
DIAL: “hold and pull out the top cupboard to open it”
Structured Teleoperator
Command Categories
DIAL Predicted Instructions
no DIAL prediction
‘near’
‘next to’
‘left’
‘right’
‘from’
‘side’
‘center’
‘below’
other
move
pick from
open
place
close
pick
upright
knock
Evaluations on Novel Instructions
Evaluations on Novel Instructions
Tradeoff: more data but less accurate
Need enough data quantity with enough label accuracy
Semantic Data Augmentation
Offline Dataset
Relabeling Logic
Change reward
Change states
Change task
Change actions
Augmented Dataset
Can we leverage advances in image generation for robot data augmentation?
ROSIE: Robot Learning with Semantically Imagined Experience
Overall performance
Recap: Data Scaling
Recap: Data Scaling
Recap: Data Scaling
Agenda
Data Scaling: Data Quantity
Data Scaling: Data Quality and Diversity
Data Scaling: Synthetic Data
Beyond Language: Robotics is Unique
Horizons: What’s Next?
01
02
03
04
05
Strengths and Limitations of Language
High-level Language
Knowledge
Low-level Robotics
Knowledge
Beyond Language
Object-Centric Representations
Segmentation Masks
Code-Centric Representations
Code as Policies and Rewards
Motion-Centric
Representations
EEF Trajectories
Language Hierarchies
Granular Language Motions
Language
Representations
Text Instructions
Object-centric Representations:
Manipulation of Open-World Objects (MOO)
Key Idea: Use single-pixel detection centroids from an
open-vocabulary VLM for task conditioning
17 RT-1 Objects
(all skills)
90 Diverse Objects
(“pick” skill only)
47 Novel Evaluation Objects
(unseen during training)
RT-1 Data
Newly added diverse Pick Task data
Novel Object Evaluations
Unseen Objects,
Seen Categories (22)
Seen Objects (49)
MOO (our data)
MOO (original RT-1 data)
RT-1 (our data)
VIMA-like (our data)
Pick Skill
Non-pick Skills
Pick Skill
Non-pick Skills
Pick Skill
Non-pick Skills
Unseen Categories (25)
Results: Generalization to Novel Open-World Objects
Open-World Objects
Challenging Textures
New Environments
Additional Input Modalities
Human Pointing
Clicking on Screen
Target Image
User
Stand up on two feet.
Reward Translator (LLM)
low-level actions
Reward function
set_torso_rewards(height=0.6, pitch=0.5*np.pi)
set_feet_pos_rewards('back_left', height=0.0)
set_feet_pos_rewards('back_right', height=0.0)
Reward code
Low-level controller
User
Stand up on two feet.
Code as Policies (LLM)
low-level actions
Code-centric Representations: Language to Reward
Code as Policies
Language to Reward
85
Results
Robot-centric Representations:
RT-Trajectory
Robot-centric Representations:
RT-Trajectory
Results: Quantitative Evaluations
Results: Diverse Input Modalities
Human Videos
Code as Policies
Foundation Models
Results: Emergent Capabilities
Ego-centric trajectory representations enable broad generalization:
Is language enough, if it’s hierarchical and granular?
RT-Hierarchy
Results: RT-H Outperforms RT-2
No other policy class (RT-1, RT-2) was able to learn from challenging new data
Results: Language Interventions
RT-H bottleneck often was language motion prediction rather than low-level action prediction: language motions easier to collect interventions for!
Recap: Beyond Language
Object-Centric Representations
Segmentation Masks
Code-Centric Representations
Code as Policies and Rewards
Motion-Centric
Representations
EEF Trajectories
Language Hierarchies
Granular Language Motions
Language
Representations
Text Instructions
RT-1
MOO
Language to Reward
RT-Trajectory
RT-Hierarchy
Agenda
Data Scaling: Data Quantity
Data Scaling: Data Quality and Diversity
Data Scaling: Synthetic Data
Beyond Language: Robotics is Unique
Horizons: What’s Next?
01
02
03
04
05
Moonshot #1: AGI Already Exists?
Moonshot #2: AI has an Evaluation Problem
LLMs Target Human Data Distribution
Evaluate on Humans Directly
Moonshot #2: AI has an Evaluation Problem
LLMs Target Human Data Distribution
Evaluate on Humans Directly
Robots Target Physical Data Distribution
Evaluate on ???
Real-to-Sim Evaluation for Real-world Robot Policies
Key Insight: A simulation "good enough" for useful evaluation signal may be much easier to build than a full digital clone for training
Thank you!
xiaoted@gmail.com
Goal-centric Representations:
RT-Sketch
Key Idea: Use coarse goal conditioning that focuses goals on what actually matters in the scene
Training: Generate Goal Sketches in Hindsight
Results
Robustness to Goal Detail
Spatial Reasoning
Distractor Robustness