1 of 28

Generalist Models

for Robotic Manipulation

Martin Sedláček

2 of 28

Intro

Main Goal: Get useful generalist robots that can robustly do tasks in the real world.

3 of 28

Intro

Main Goal: Get useful generalist robots that can robustly do tasks in the real world.

Or equivalently - From human instructions, perform various seen or unseen tasks in �unknown environments based on limited visual observations.

4 of 28

Intro

Main Goal: Get useful generalist robots that can robustly do tasks in the real world.

Or equivalently - From human instructions, perform various seen or unseen tasks in �unknown environments based on limited visual observations.

[figure from RT-2]

5 of 28

Set-up

“Pick up the apple”

[Franka Emika Panda robot], [RealSense camera]

7 of 28

[demo from Pi0]

9 of 28

Adding other modalities?

Environment state?

Unknown apriori
Hard to estimate
Can be dynamic �(and random)
Potentially large

[x, y, z, 𝛉]

10 of 28

Adding other modalities?

Depth?

[figure from Brás & Neto]

Hard to set-up
Costly hardware
Computationally expensive

11 of 28

Adding other modalities?

Robot state?

Known
Useful
Embodiment specific

[figure from Richard Savery]

12 of 28

Depth
State
…

Many Different�Robots!

13 of 28

What’s in the black box?

14 of 28

What’s in the black box?

Planning?

Unknown state
Expensive to compute
Environment is random
No generalization

15 of 28

What’s in the black box?

Planning?

Learn from millions of on-robot examples
Leverage large availability of data�on the internet

ML?

Unknown state
Expensive to compute
Environment is random
No generalization

16 of 28

Two ways:

A) Build a complex pipeline of “expert” sub-systems
B) One system that does the action prediction end-to-end

17 of 28

Two ways:

Recent example - image matching:

A) Build a complex pipeline of “expert” sub-systems
B) One system that does the action prediction end-to-end

[figure form Mast3r]

[figure from CVPR 2017 tutorial]

18 of 28

Two ways:

Recent example - image matching:

A) Build a complex pipeline of “expert” sub-systems
B) One system that does the action prediction end-to-end

[figure form Mast3r]

One NN�(and A LOT more data)

[figure from CVPR 2017 tutorial]

+30%�(~2.5x)

19 of 28

Is image and language enough?

Information rich modalities (especially image)
Lots of data available

…unlike robotics (very expensive to collect and hard to simulate accurately)

Large models are good at combining these modalities

i.e., capturing how words relate to images and vice-versa

20 of 28

Is image and language enough?

Information rich modalities (especially image)
Lots of data available

…unlike robotics (very expensive to collect and hard to simulate accurately)

Large models are good at combining these modalities

i.e., capturing how words relate to images and vice-versa

Multi-Modal + lots of data = generalization capabilities = Foundation Model

How can you add the robot action modality into this mix?

21 of 28

Vision-Language Model (VLM)

GOAL: via large-scale pre-training from millions of examples, learn a good shared latent space for both visual and language understanding.

[figure form Paligemma]

22 of 28

Vision-Language Model (VLM)

GOAL: via large-scale pre-training from millions of examples, learn a good shared latent space for both visual and language understanding.

[figure form CLIP]

23 of 28

Vision-Language Model (VLM) for robotics?

NEW GOAL: learn relationships between vision-language latent space and robot actions (mostly) from demonstration.

“Pick up the bag near the edge of the table.”

24 of 28

Vision-Language-Action (VLA)

NEW GOAL: learn relationships between vision-language latent space and robot actions (mostly) from demonstration.

“Pick up the bag near the edge of the table.”

Directly map to some discrete action space.
Condition a generative model. (e.g., Diffusion)
…

25 of 28

Secret sauce?

Enabler: Open-X-Embodiment (OXE) dataset

26 of 28

[slide from Kevin Black]

27 of 28

Beyond manipulation?

CrossFormer

Idea: learn representations that are useful for various tasks regardless of embodiment type.
Preliminary results show doing this can have a small but positive impact on performance.

[figure form CrossFormer]

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28