1 of 81

Embodied Language Models

From multi-task learning towards the generalist robot

Thomas Dooms, Alexander Belooussov

2 of 81

Context

Research project 2 (12 sp)

  • Transformers United
  • Deep multi-task & meta-learning

CS 330: Deep Multi-Task and Meta Learning

3 of 81

Content

A brief overview of this lecture

The Basics

Reinforcement Learning

Language Conditioning

4 of 81

Multi-Task Learning

5 of 81

Definition

Let’s refresh the basics

Supervised learning

6 of 81

Motivation

Some data modalities are very hard to acquire

datapoints

NLP

CV

RL

Medical imaging

Audio

7 of 81

Transfer learning

Solve target task 𝒯b

By transferring knowledge learned from 𝒯a

Without access to 𝒟a

8 of 81

Limitations

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Kumar Ananya et al. (2022)

9 of 81

Multi-task learning

CS330: Deep multi-task & meta-learning

Chelsea Finn (2021)

Assumption

Tasks share some structure → most often, they do

Learning this structure is beneficial for both tasks

Definitions

Example

10 of 81

Multi-task architectures

Hard Sharing

All Sharing

Soft Sharing

11 of 81

Summary

Short summary of multi-task learning

Train multiple tasks together

No tradeoff between specificity and generality

Less overfitting

Higher accuracy

12 of 81

Meta learning

13 of 81

Learning to Learn

Who can figure it out?

4 🍕 6 = ?

3 🍕 5 = 18

1 🍕 2 = 3

2 🍕 3 = 8

6 🍕 1 = 12

14 of 81

Definitions

Learning to Learn with Gradients

Finn, Chelsea (2018)

Mathematically

Given data from 𝒯1 , ..., 𝒯n, quickly solve new task 𝒯test

Supervised learning:

Meta-learning:

Intuitively

Find the set of parameters θ such that new tasks can be learned quickly.

15 of 81

Support & Query

Support

Query

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

?

?

16 of 81

Black-box

meta learning

17 of 81

Black-box

meta learning

A concrete example using RNNs

18 of 81

Optimisation based meta learning

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks�Finn, Chelsea et al. (2017)

19 of 81

Quiz Time

Which marker is the best candidate for θ?

20 of 81

Summary

What kind of learning algorithms have we discussed?

Transfer learning

Solve 𝒯b by transferring knowledge from 𝒯a

Multi-task learning

Solve multiple tasks 𝒯1, 𝒯2, … , 𝒯n at once

Meta-learning

Given data from 𝒯1, 𝒯2, … , 𝒯n quickly solve 𝒯test

21 of 81

Questions?

22 of 81

Reinforcement Learning

23 of 81

Multi-Task RL

Cross-task generalization

Performance increases for all tasks

Why multi-task?

24 of 81

Multi-Task RL

Cross-task generalization

Easier exploration

Tasks share knowledge

Why multi-task?

25 of 81

Multi-Task RL

Cross-task generalization

Easier exploration

Sequencing for long-horizon tasks

Long tasks can be split into easier sub-tasks

Why multi-task?

26 of 81

Multi-Task RL

Cross-task generalization

Easier exploration

Sequencing for long-horizon tasks

Reset-free learning

No intervention needed

Generalize to different starting states and goals

Why multi-task?

27 of 81

Multi-Task RL

Cross-task generalization

Easier exploration

Easier exploration

Sequencing for long-horizon tasks

Reset-free learning

Per-task sample-efficiency gains

Fewer examples per task needed

Why multi-task?

28 of 81

Multi-Task RL

Task can be defined by

New State/Action space

Different dynamics

Different reward function

(Optional) Task identifier inside state

One-hot

Language description

Goal state ⇒ Goal-conditioned RL

Task Specification

29 of 81

Defining tasks

30 of 81

Defining tasks

31 of 81

Defining tasks

32 of 81

Multi-Task RL

Hindsight Relabeling�or Hindsight Experience Replay (HER)

Hindsight experience replay Andrychowicz, Marcin, et al. (2017)

33 of 81

Multi-Task RL

Hindsight Relabeling�or Hindsight Experience Replay (HER)

Hindsight experience replay Andrychowicz, Marcin, et al. (2017)

34 of 81

Multi-Task RL

Hindsight Relabeling�or Hindsight Experience Replay (HER)

Hindsight experience replay Andrychowicz, Marcin, et al. (2017)

35 of 81

Goal-Conditioned RL

Pretend achieved goal is what we wanted to do, even if real goal not reached

Hindsight experience replay Andrychowicz, Marcin, et al. (2017)

36 of 81

Goal-Conditioned RL

Pretend achieved goal is what we wanted to do, even if real goal not reached

Hindsight experience replay Andrychowicz, Marcin, et al. (2017)

37 of 81

Goal-Conditioned RL

Actionable models: Unsupervised offline reinforcement learning of robotic skills�Chebotar, Yevgen, et al. (2021)

38 of 81

Goal-Conditioned RL

Pretend achieved goal is what we wanted to do, even if real goal not reached

Use any state as goal of trajectory, not just final

Many more samples = more exploration

Always optimal data

Goal is always reached after relabeling

Use unstructured dataset

Human play

→ Turn into supervised dataset

Imitation Learning

39 of 81

Meta-RL

Problem Statement

Inputs

Current state st

K previous timesteps or rollouts from policy

Outputs

Action at

Goal

Given small amount of experience

Adapt to new task

40 of 81

Meta RL example

Meta Train

Meta Test

41 of 81

Meta-RL

Black-Box Meta-RL

Training

  1. Sample Task Ti
  2. Roll-out policy for N episodes
  3. Store sequence in buffer for task Ti
  4. Update policy to maximize discounted return for all tasks
  5. Repeat�
  6. General & expressive
  7. Variety in design choices/architectures
  8. Hard to optimize

~ Inherits sample efficiency from RL optimizer

42 of 81

Meta-RL

Optimisation-Based Meta-RL

Training

  1. Sample task Ti
  2. Collect by rolling out
  3. Inner loop adaptation�
  4. Collect by rolling out
  5. Outer loop update��
  6. Repeat

43 of 81

Meta-RL

Optimization-Based Meta-RL

Learning to Adapt in Dynamic Environments through Meta-RL.�Nagabandi*, Clavera*, Liu, Fearing, Abbeel, Levine, Finn. (2019)

  • Inductive bias
  • Easy to combine with policy gradients, model-based methods
  • Hard to combine with value-based methods
  • Policy gradients are very noisy

~ Inherits sample efficiency from outer RL optimizer

44 of 81

Questions?

45 of 81

Language Models

46 of 81

Language Prediction

Improving Language Understanding by Generative Pre-Training

Radford, Alec et al. (2018)

The Eiffel tower is in Paris

xi

yi

The Eiffel tower is in

Paris

The Eiffel tower is

in

The Eiffel tower

is

The Eiffel

tower

The

Eiffel

The

47 of 81

Transformers

Improving Language Understanding by Generative Pre-Training

Radford, Alec et al. (2018)

Sequential data

A stream of tokens

Can vary in length

Three main components

Positional encoding

Attention mechanism

Positionwise FFN

x N

48 of 81

Attention

49 of 81

Encoders & Decoders

Attention is all you need

Vaswani, Ashish et al. (2017)

Decoders → look at previous tokens

Useful for text generation

Does not leak future tokens

Encoders → look at all tokens

Useful for text classification (sentiment analysis)

Global Attention

Masked Attention

50 of 81

Cross-Attention

Attention is all you need

Vaswani, Ashish et al. (2017)

Self-attention

Query the data itself

Keys & values from original tokens

Cross-attention

Query other keys and values

Keys & values source can vary

K, V

51 of 81

The Full Transformer

Attention is all you need

Vaswani, Ashish et al. (2017)

52 of 81

Foundation models

PaLM: Scaling Language Modeling with Pathways�Chowdhery, Aakanksha et al. (2022)

  • Trained on huge amounts of data
  • Largest models in existence
  • Are very versatile

53 of 81

Foundation Model Overview

Name

Company

Release Date

Number of Parameters

Corpus Size

GPT-2

OpenAI

2019

1.5 billion

10 billion

GPT-3

OpenAI

2020

175 billion

300 billion

PaLM

Google

2022

540 billion

768 billion

GPT-4

OpenAI

2023

~ 1 trillion*

unknown

Chinchilla

DeepMind

2022

70 billion

1.4 trillion

Llama

Meta

2023

65 billion

1.4 trillion

PaLM 2

Google

2023

340 billion

3.5 trillion

*not actually known

54 of 81

Emergent Abilities

Emergent Abilities of Large Language Models

Wei, Jason et al. (2022)

GPT-4 Technical Report

OpenAI (2023)

55 of 81

Few-Shot Abilities

Language Models are Few-Shot Learners�Brown, Tom et al. (2020)

Fine-tuned Language Models are Zero-Shot Learners

Wei, Jason et al. (2022)

In-Context Learning (ICL)

“Wish You Were Here: 1979, The Dark side of the Moon:”

56 of 81

Attention as a Meta-Optimiser

A Survey on In-context Learning

Dong, Qingxiu et al. (2023)

Language Models Implicitly Perform Gradient Descent as

Meta-Optimizers

Dai, Damai et al. (2023)

Attention changes the latent space like fine-tuning (FT)

These modifications resemble fine-tuning

Latent variables show high similarity between ICL and FT

57 of 81

Fine-Tuning

Training language models to follow instructions with human feedback

Ouyang, Long et al. (2022)��BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob et al. (2018)

Transfer knowledge to specific domains

Ask open-ended question on a set of documents

Models retain insight of language

Knowledge of idioms

Models retain the few-shot capabilities and ICL

Respond correctly to never-seen inputs

Tolerance to slight distribution shift

58 of 81

Language Conditioning

59 of 81

Goal

Interactive Language: Talking to Robots in Real Time

Lynch, Corey et al. (2022)

Guide robot using natural language

Complete long-term objectives

Give short-term instructions

Rectify mistakes

Free vocabulary

60 of 81

Dataset Gathering

Play dataset

Human operators

Optional objective prompts

Explore the environment as much as possible

Actions are better than random

Prompts discarded ⇒ Unlabeled dataset

Interactive Language: Talking to Robots in Real Time

Lynch, Corey et al. (2022)

61 of 81

Dataset Gathering

Event-selectable Hindsight Relabeling

Select few second fragment in replay

→ Better than using random windows

Add natural language annotation

→ Open vocabulary

Annotations ⇒ Labeled dataset

Interactive Language: Talking to Robots in Real Time

Lynch, Corey et al. (2022)

62 of 81

Imitation Learning

Always-optimal labeled dataset

Use simple objectives

Base imitation learning objective

Condition on language

Interactive Language: Talking to Robots in Real Time

Lynch, Corey et al. (2022)

63 of 81

Architecture: LAVA (Language Attends to Vision to Act)

Image encoding

Pre-trained ResNet + learned convolutions

Text encoding

Fine-tuned CLIP encoder

(image, language) → (video, language)

64 of 81

Relating text with images (CLIP)

Learning Transferable Visual Models From Natural Language Supervision

Radford, Kim et al (Feb 2021)

65 of 81

Relating text with images (CLIP)

Learning Transferable Visual Models From Natural Language Supervision

Radford, Kim et al (Feb 2021)

66 of 81

Architecture: LAVA (Language Attends to Vision to Act)

Decoder-only Transformer

Cross-attention

Multi-Layer Perceptron

Output

Vision-language embedding for certain timestep

67 of 81

Architecture: LAVA (Language Attends to Vision to Act)

Encoder-only Transformer

Self-attention

Multi-Layer Perceptron

Input

Sequence of vision-language embeddings

68 of 81

Architecture: LAVA (Language Attends to Vision to Act)

Policy

MLP blocks with residual connections

Outputs action

69 of 81

Demonstrations

70 of 81

Summary

What do we learn from this?

Language embeddings are useful task descriptors

Progress in NLP and CV provides powerful architectures

Another use case for attention/transformers

71 of 81

PaLM-E

72 of 81

Goal

Visual-language embodied robot

Grounding: connect LLM representations to the real world

Interleave multi-modal inputs into same input embedding space

PaLM-E: An Embodied Multimodal Language Model

Driess, Danny et al. (2023)

73 of 81

Architecture

Input

Language, images, sensor readings, …

Encoders embed all inputs into same space

Output

Language

Can be used to give short-term instructions to language-conditioned policies

74 of 81

Example Tasks: Visual QA

Task

Given

Target

Object-table relation

<img>. Q: Is the red object left, right, or center of the table?

A: The red object is in the center of the table.

Object-object relation

<img>. Q: Is the yellow object below the

blue object?

A: No, the yellow object is not below the blue object.

Plan feasibility

<img>. Q: Is it possible to first grasp the blue object, then place it on the yellow object, and then grasp the yellow object?

A: No, this is not possible.

75 of 81

Example Tasks: Planning

Task

Given

Target

Grasping

<img>. Q: How to grasp the green object?

A: First grasp the orange object and place it on the table, then grasp the green object.

Stacking

<img>. Q: How to stack the white object on top

of the red object?

A: First grasp the green object and place it on the table, then grasp the white object and place it on the red object.

76 of 81

Positive Transfer

Model benefits from multi-task learning

PaLM-E: An Embodied Multimodal Language Model

Driess, Danny et al. (2023)

77 of 81

Positive Transfer

Model benefits from multi-task learning

Training on language tasks helps control

Only few robotics examples are needed

PaLM-E: An Embodied Multimodal Language Model

Driess, Danny et al. (2023)

78 of 81

Positive Transfer

Model benefits from multi-task learning

Training on language tasks helps control

Only few robotics examples are needed

Large models preserve NLP capabilities

PaLM-E: An Embodied Multimodal Language Model

Driess, Danny et al. (2023)

79 of 81

Demos

80 of 81

Demos

81 of 81

Questions?