Embodied Language Models
From multi-task learning towards the generalist robot
Thomas Dooms, Alexander Belooussov
Context
Research project 2 (12 sp)
CS 330: Deep Multi-Task and Meta Learning
Content
A brief overview of this lecture
The Basics | |
| Reinforcement Learning |
Language Conditioning | |
Multi-Task Learning
Definition
Let’s refresh the basics
Supervised learning
Motivation
Some data modalities are very hard to acquire
datapoints
NLP
CV
RL
Medical imaging
Audio
Transfer learning
Solve target task 𝒯b
By transferring knowledge learned from 𝒯a
Without access to 𝒟a
Limitations
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
Kumar Ananya et al. (2022)
Multi-task learning
CS330: Deep multi-task & meta-learning
Chelsea Finn (2021)
Assumption
Tasks share some structure → most often, they do
Learning this structure is beneficial for both tasks
Definitions
Example
Multi-task architectures
Hard Sharing
All Sharing
Soft Sharing
Summary
Short summary of multi-task learning
Train multiple tasks together
No tradeoff between specificity and generality
Less overfitting
Higher accuracy
Meta learning
Learning to Learn
Who can figure it out?
4 🍕 6 = ?
3 🍕 5 = 18
1 🍕 2 = 3
2 🍕 3 = 8
6 🍕 1 = 12
Definitions
Learning to Learn with Gradients
Finn, Chelsea (2018)
Mathematically
Given data from 𝒯1 , ..., 𝒯n, quickly solve new task 𝒯test
Supervised learning:
Meta-learning:
Intuitively
Find the set of parameters θ such that new tasks can be learned quickly.
Support & Query
Support
Query
1 | 2 | 1 | 2 |
1 | 2 | 1 | 2 |
1 | 2 | 1 | 2 |
1 | 2 |
1 | 2 |
? | ? |
Black-box
meta learning
Black-box
meta learning
A concrete example using RNNs
Optimisation based meta learning
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks�Finn, Chelsea et al. (2017)
Quiz Time
Which marker is the best candidate for θ?
Summary
What kind of learning algorithms have we discussed?
Transfer learning
Solve 𝒯b by transferring knowledge from 𝒯a
Multi-task learning
Solve multiple tasks 𝒯1, 𝒯2, … , 𝒯n at once
Meta-learning
Given data from 𝒯1, 𝒯2, … , 𝒯n quickly solve 𝒯test
Questions?
Reinforcement Learning
Multi-Task RL
Cross-task generalization
Performance increases for all tasks
Why multi-task?
Multi-Task RL
Cross-task generalization
Easier exploration
Tasks share knowledge
Why multi-task?
Multi-Task RL
Cross-task generalization
Easier exploration
Sequencing for long-horizon tasks
Long tasks can be split into easier sub-tasks
Why multi-task?
Multi-Task RL
Cross-task generalization
Easier exploration
Sequencing for long-horizon tasks
Reset-free learning
No intervention needed
Generalize to different starting states and goals
Why multi-task?
Multi-Task RL
Cross-task generalization
Easier exploration
Easier exploration
Sequencing for long-horizon tasks
Reset-free learning
Per-task sample-efficiency gains
Fewer examples per task needed
Why multi-task?
Multi-Task RL
Task can be defined by
New State/Action space
Different dynamics
Different reward function
(Optional) Task identifier inside state
One-hot
Language description
Goal state ⇒ Goal-conditioned RL
Task Specification
Defining tasks
Defining tasks
Defining tasks
Multi-Task RL
Hindsight Relabeling�or Hindsight Experience Replay (HER)
Hindsight experience replay Andrychowicz, Marcin, et al. (2017)
Multi-Task RL
Hindsight Relabeling�or Hindsight Experience Replay (HER)
Hindsight experience replay Andrychowicz, Marcin, et al. (2017)
Multi-Task RL
Hindsight Relabeling�or Hindsight Experience Replay (HER)
Hindsight experience replay Andrychowicz, Marcin, et al. (2017)
Goal-Conditioned RL
Pretend achieved goal is what we wanted to do, even if real goal not reached
Hindsight experience replay Andrychowicz, Marcin, et al. (2017)
Goal-Conditioned RL
Pretend achieved goal is what we wanted to do, even if real goal not reached
Hindsight experience replay Andrychowicz, Marcin, et al. (2017)
Goal-Conditioned RL
Actionable models: Unsupervised offline reinforcement learning of robotic skills�Chebotar, Yevgen, et al. (2021)
Goal-Conditioned RL
Pretend achieved goal is what we wanted to do, even if real goal not reached
Use any state as goal of trajectory, not just final
Many more samples = more exploration
Always optimal data
Goal is always reached after relabeling
Use unstructured dataset
→ Human play
→ Turn into supervised dataset
→ Imitation Learning
Meta-RL
Problem Statement
Inputs
Current state st
K previous timesteps or rollouts from policy
Outputs
Action at
Goal
Given small amount of experience
Adapt to new task
Meta RL example
Meta Train
Meta Test
Meta-RL
Black-Box Meta-RL
Training
~ Inherits sample efficiency from RL optimizer
Meta-RL
Optimisation-Based Meta-RL
Training
Meta-RL
Optimization-Based Meta-RL
Learning to Adapt in Dynamic Environments through Meta-RL.�Nagabandi*, Clavera*, Liu, Fearing, Abbeel, Levine, Finn. (2019)
~ Inherits sample efficiency from outer RL optimizer
Questions?
Language Models
Language Prediction
Improving Language Understanding by Generative Pre-Training
Radford, Alec et al. (2018)
The Eiffel tower is in Paris
xi | yi |
The Eiffel tower is in | Paris |
The Eiffel tower is | in |
The Eiffel tower | is |
The Eiffel | tower |
The | Eiffel |
| The |
Transformers
Improving Language Understanding by Generative Pre-Training
Radford, Alec et al. (2018)
Sequential data
A stream of tokens
Can vary in length
Three main components
Positional encoding
Attention mechanism
Positionwise FFN
x N
Attention
Encoders & Decoders
Attention is all you need
Vaswani, Ashish et al. (2017)
Decoders → look at previous tokens
Useful for text generation
Does not leak future tokens
Encoders → look at all tokens
Useful for text classification (sentiment analysis)
Global Attention
Masked Attention
Cross-Attention
Attention is all you need
Vaswani, Ashish et al. (2017)
Self-attention
Query the data itself
Keys & values from original tokens
Cross-attention
Query other keys and values
Keys & values source can vary
K, V
The Full Transformer
Attention is all you need
Vaswani, Ashish et al. (2017)
Foundation models
PaLM: Scaling Language Modeling with Pathways�Chowdhery, Aakanksha et al. (2022)
Foundation Model Overview
Name | Company | Release Date | Number of Parameters | Corpus Size |
GPT-2 | OpenAI | 2019 | 1.5 billion | 10 billion |
GPT-3 | OpenAI | 2020 | 175 billion | 300 billion |
PaLM | 2022 | 540 billion | 768 billion | |
GPT-4 | OpenAI | 2023 | ~ 1 trillion* | unknown |
Chinchilla | DeepMind | 2022 | 70 billion | 1.4 trillion |
Llama | Meta | 2023 | 65 billion | 1.4 trillion |
PaLM 2 | 2023 | 340 billion | 3.5 trillion |
*not actually known
Emergent Abilities
Emergent Abilities of Large Language Models
Wei, Jason et al. (2022)
GPT-4 Technical Report
OpenAI (2023)
Few-Shot Abilities
Language Models are Few-Shot Learners�Brown, Tom et al. (2020)
Fine-tuned Language Models are Zero-Shot Learners
Wei, Jason et al. (2022)
In-Context Learning (ICL)
“Wish You Were Here: 1979, The Dark side of the Moon:”
Attention as a Meta-Optimiser
A Survey on In-context Learning
Dong, Qingxiu et al. (2023)
Language Models Implicitly Perform Gradient Descent as
Meta-Optimizers
Dai, Damai et al. (2023)
Attention changes the latent space like fine-tuning (FT)
These modifications resemble fine-tuning
Latent variables show high similarity between ICL and FT
Fine-Tuning
Training language models to follow instructions with human feedback
Ouyang, Long et al. (2022)��BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob et al. (2018)
Transfer knowledge to specific domains
Ask open-ended question on a set of documents
Models retain insight of language
Knowledge of idioms
Models retain the few-shot capabilities and ICL
Respond correctly to never-seen inputs
Tolerance to slight distribution shift
Language Conditioning
Goal
Interactive Language: Talking to Robots in Real Time
Lynch, Corey et al. (2022)
Guide robot using natural language
Complete long-term objectives
Give short-term instructions
Rectify mistakes
Free vocabulary
Dataset Gathering
Play dataset
Human operators
Optional objective prompts
Explore the environment as much as possible
Actions are better than random
Prompts discarded ⇒ Unlabeled dataset
Interactive Language: Talking to Robots in Real Time
Lynch, Corey et al. (2022)
Dataset Gathering
Event-selectable Hindsight Relabeling
Select few second fragment in replay
→ Better than using random windows
Add natural language annotation
→ Open vocabulary
Annotations ⇒ Labeled dataset
Interactive Language: Talking to Robots in Real Time
Lynch, Corey et al. (2022)
Imitation Learning
Always-optimal labeled dataset
Use simple objectives
Base imitation learning objective
Condition on language
�
Interactive Language: Talking to Robots in Real Time
Lynch, Corey et al. (2022)
Architecture: LAVA (Language Attends to Vision to Act)
Image encoding
Pre-trained ResNet + learned convolutions
Text encoding
Fine-tuned CLIP encoder
(image, language) → (video, language)
Relating text with images (CLIP)
Learning Transferable Visual Models From Natural Language Supervision
Radford, Kim et al (Feb 2021)
Relating text with images (CLIP)
Learning Transferable Visual Models From Natural Language Supervision
Radford, Kim et al (Feb 2021)
Architecture: LAVA (Language Attends to Vision to Act)
Decoder-only Transformer
Cross-attention
Multi-Layer Perceptron
Output
Vision-language embedding for certain timestep
Architecture: LAVA (Language Attends to Vision to Act)
Encoder-only Transformer
Self-attention
Multi-Layer Perceptron
Input
Sequence of vision-language embeddings
Architecture: LAVA (Language Attends to Vision to Act)
Policy
MLP blocks with residual connections
Outputs action
Demonstrations
Summary
What do we learn from this?
Language embeddings are useful task descriptors
Progress in NLP and CV provides powerful architectures
Another use case for attention/transformers
PaLM-E
Goal
Visual-language embodied robot
Grounding: connect LLM representations to the real world
Interleave multi-modal inputs into same input embedding space
PaLM-E: An Embodied Multimodal Language Model
Driess, Danny et al. (2023)
Architecture
Input
Language, images, sensor readings, …
Encoders embed all inputs into same space
Output
Language
Can be used to give short-term instructions to language-conditioned policies
Example Tasks: Visual QA
Task | Given | Target |
Object-table relation | <img>. Q: Is the red object left, right, or center of the table? | A: The red object is in the center of the table. |
Object-object relation | <img>. Q: Is the yellow object below the blue object? | A: No, the yellow object is not below the blue object. |
Plan feasibility | <img>. Q: Is it possible to first grasp the blue object, then place it on the yellow object, and then grasp the yellow object? | A: No, this is not possible. |
Example Tasks: Planning
Task | Given | Target |
Grasping | <img>. Q: How to grasp the green object? | A: First grasp the orange object and place it on the table, then grasp the green object. |
Stacking | <img>. Q: How to stack the white object on top of the red object? | A: First grasp the green object and place it on the table, then grasp the white object and place it on the red object. |
Positive Transfer
Model benefits from multi-task learning
PaLM-E: An Embodied Multimodal Language Model
Driess, Danny et al. (2023)
Positive Transfer
Model benefits from multi-task learning
Training on language tasks helps control
Only few robotics examples are needed
PaLM-E: An Embodied Multimodal Language Model
Driess, Danny et al. (2023)
Positive Transfer
Model benefits from multi-task learning
Training on language tasks helps control
Only few robotics examples are needed
Large models preserve NLP capabilities
PaLM-E: An Embodied Multimodal Language Model
Driess, Danny et al. (2023)
Demos
Demos
Questions?