1 of 22

Considering The Role of Language in Embodied Systems

Jesse Thomason

http://glamor.rocks/

2 of 22

Why Language? A Draft Manifesto

Learning systems that are embodied can share physical space with the people interacting with them

Robots, tabletop devices, camera arrays (sort of), …

AI systems that interact with physical spaces shared by people should interpret both explicit and implicit feedback from those people in order to:

1) establish goals;
2) respect constraints; and
3) seek missing information

3 of 22

Interactive Learning with Explicit Human Feedback

Establish goals ✅
Respect constraints ❌
Seek missing information ✅

Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog. JAIR 2020

4 of 22

Interactive Learning with Explicit Human Feedback

Semantic parsing
Slot based, information maximizing dialogue policy
Words-as-classifiers
Neural networks?

Vision features
Synonym hypotheses

GOFAI

Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog. JAIR 2020

5 of 22

Enter: Large Pretrained Language and Vision Models

Language Conditioned Imitation Learning over Unstructured Data; Google; May 18, 2020

RT-2: New model translates vision and language into action; DeepMind; July 28, 2023

6 of 22

Large Pretrained Models for Robotics Today

Establish goals ✅
Respect constraints ✅
Seek missing information ❌

LPTMs have made establishing goals using language so much more straightforward than was possible a few years ago
Huge leaps over closed vocabulary and object lists, slot-based possible goals, and even specifying constraints

RT-2: New model translates vision and language into action; DeepMind; July 28, 2023

"No, to the Right" -- Online Language Corrections for Robotic Manipulation via Shared Autonomy; Cui…Sadigh; HRI 2023

7 of 22

What’s Missing?

AI systems that interact with physical spaces shared by people should interpret both explicit and implicit feedback from those people in order to: 1) establish goals; 2) respect constraints; and 3) seek missing information

Feedback can’t just be typing, that’s nuts. We need to understand speech, gaze, gesture, and more from people
Interpretability of system behavior may require exposing symbolic reasoning and readable plans
Systems should actively seek information from people

8 of 22

Grounding Language in Actions, Multimodal Observations, and Robots (GLAMOR ✨) Lab

We have been tackling a few open problems related to these new challenges for robots learning from human feedback
Two I’ll highlight:

Improving speech recognition using physical context
Combining large language models and symbolic planning

9 of 22

Improving Speech Recognition with Context

Feedback can’t just be typing, that’s nuts. We need to understand speech, gaze, gesture, and more from people
Automatic Speech Recognition keeps getting better, but is trained on massive, clean audio like podcasts
Robots hearing speech do so from a microphone attached to a noisy, whirring body from a person potentially across the room with task-related background noise

Warehouse floor, manufacturing, kitchen appliances, …

But, we speak with respect to the current world context!

Multimodal Speech Recognition for Language-Guided Embodied Agents, INTERSPEECH 2023

10 of 22

Improving Speech Recognition with Context

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks, CVPR 2020

Multimodal Speech Recognition for Language-Guided Embodied Agents, INTERSPEECH 2023

11 of 22

Improving Speech Recognition with Context

Multimodal Speech Recognition for Language-Guided Embodied Agents, INTERSPEECH 2023

12 of 22

Prompting for Robot Control

Interpretability of system behavior may require exposing symbolic reasoning and readable plans
Large, pretrained language models can sort of behave like planners, generating readable “code”
A lot of folks had this realization at the same time

For example, Google’s SayCan and Code As Policies

We took an approach based on designing prompts that looked like code, given LLM training data over github, etc.

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models, ICRA 2023

13 of 22

Prompting for Robot Control

LLMs generate actions robots can’t do with objects that aren’t around
Pythonic prompts can specify what’s available

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models, ICRA 2023

14 of 22

Prompting for Robot Control

Used to control an agent in a virtual environment
Perform new tasks in a zero-shot setting

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models, ICRA 2023

15 of 22

Prompting for Robot Control

Also enables generating pick-and-place robot plans!

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models, ICRA 2023

16 of 22

Language as A Medium for Human Feedback

Recent work in the space of RoboNLP has gotten really exceptionally good at establishing goals for robots, and there is progress on communicating and respecting constraints
There is a rich space to explore on seeking missing information during closed-loop planning
Take inspiration from human-human problem solving!

17 of 22

Language as A Medium for Human Feedback

TEACh: Task-driven Embodied Agents that Chat, AAAI 2022

18 of 22

Considering The Role of Language in Embodied Systems

Jesse Thomason

http://glamor.rocks/

Slides: https://jessethomason.com/ -> News

19 of 22

Gaze Tracking and Language

20 of 22

Multimodal Continual Learning

Pretrained backbones a hard (sometimes impossible) to fine-tune; we need ways to adapt them to new contexts
Considering explicit and implicit feedback, modalities may appear and disappear depending on context and task

What happens to a model trained on speech and gaze when someone turns away?

Can we adapt models to both new tasks and the presence and absence of input modalities?

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks, NeurIPS 2022

21 of 22

Multimodal Continual Learning

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks, NeurIPS 2022

22 of 22

Multimodal Continual Learning

Upshot: existing continual learning approaches are terrible at downstream transfer when a modality disappears
If you have an idea of how to do a better job, please check out the benchmark!

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks, NeurIPS 2022