1 of 28

Generalist Models

for Robotic Manipulation

Martin Sedláček

1

2 of 28

Intro

Main Goal: Get useful generalist robots that can robustly do tasks in the real world.

2

3 of 28

Intro

Main Goal: Get useful generalist robots that can robustly do tasks in the real world.

  • Or equivalently - From human instructions, perform various seen or unseen tasks in �unknown environments based on limited visual observations.

3

4 of 28

Intro

Main Goal: Get useful generalist robots that can robustly do tasks in the real world.

  • Or equivalently - From human instructions, perform various seen or unseen tasks in �unknown environments based on limited visual observations.

4

[figure from RT-2]

5 of 28

Set-up

“Pick up the apple”

+

+

5

[Franka Emika Panda robot], [RealSense camera]

6 of 28

6

7 of 28

7

[demo from Pi0]

8 of 28

8

9 of 28

Adding other modalities?

  • Environment state?

9

  • Unknown apriori
  • Hard to estimate
  • Can be dynamic �(and random)
  • Potentially large

[x, y, z, 𝛉]

[x, y, z, 𝛉]

[x, y, z, 𝛉]

10 of 28

Adding other modalities?

  • Depth?

10

[figure from Brás & Neto]

  • Hard to set-up
  • Costly hardware
  • Computationally expensive

11 of 28

Adding other modalities?

  • Robot state?

11

  • Known
  • Useful
  • Embodiment specific

[figure from Richard Savery]

12 of 28

12

  • Depth
  • State

Many Different�Robots!

13 of 28

What’s in the black box?

13

?

14 of 28

What’s in the black box?

14

?

Planning?

  • Unknown state
  • Expensive to compute
  • Environment is random
  • No generalization

15 of 28

What’s in the black box?

15

?

Planning?

  • Learn from millions of on-robot examples
  • Leverage large availability of data�on the internet

ML?

  • Unknown state
  • Expensive to compute
  • Environment is random
  • No generalization

16 of 28

Two ways:

  • A) Build a complex pipeline of “expert” sub-systems
  • B) One system that does the action prediction end-to-end

16

17 of 28

Two ways:

Recent example - image matching:

  • A) Build a complex pipeline of “expert” sub-systems
  • B) One system that does the action prediction end-to-end

17

[figure form Mast3r]

[figure from CVPR 2017 tutorial]

18 of 28

Two ways:

Recent example - image matching:

  • A) Build a complex pipeline of “expert” sub-systems
  • B) One system that does the action prediction end-to-end

18

[figure form Mast3r]

X

One NN�(and A LOT more data)

[figure from CVPR 2017 tutorial]

+30%�(~2.5x)

19 of 28

Is image and language enough?

  • Information rich modalities (especially image)
  • Lots of data available
    • …unlike robotics (very expensive to collect and hard to simulate accurately)
  • Large models are good at combining these modalities
    • i.e., capturing how words relate to images and vice-versa

19

20 of 28

Is image and language enough?

  • Information rich modalities (especially image)
  • Lots of data available
    • …unlike robotics (very expensive to collect and hard to simulate accurately)
  • Large models are good at combining these modalities
    • i.e., capturing how words relate to images and vice-versa

Multi-Modal + lots of data = generalization capabilities = Foundation Model

  • How can you add the robot action modality into this mix?

20

21 of 28

Vision-Language Model (VLM)

  • GOAL: via large-scale pre-training from millions of examples, learn a good shared latent space for both visual and language understanding.

21

[figure form Paligemma]

22 of 28

Vision-Language Model (VLM)

  • GOAL: via large-scale pre-training from millions of examples, learn a good shared latent space for both visual and language understanding.

22

[figure form CLIP]

23 of 28

Vision-Language Model (VLM) for robotics?

  • NEW GOAL: learn relationships between vision-language latent space and robot actions (mostly) from demonstration.

23

“Pick up the bag near the edge of the table.”

?

24 of 28

Vision-Language-Action (VLA)

  • NEW GOAL: learn relationships between vision-language latent space and robot actions (mostly) from demonstration.

24

“Pick up the bag near the edge of the table.”

  • Directly map to some discrete action space.
  • Condition a generative model. (e.g., Diffusion)

25 of 28

Secret sauce?

  • Enabler: Open-X-Embodiment (OXE) dataset

25

26 of 28

26

[slide from Kevin Black]

27 of 28

Beyond manipulation?

  • CrossFormer
    • Idea: learn representations that are useful for various tasks regardless of embodiment type.
    • Preliminary results show doing this can have a small but positive impact on performance.

27

[figure form CrossFormer]

28 of 28

Thank you!

28