1 of 22

Considering The Role of Language in Embodied Systems

Jesse Thomason

http://glamor.rocks/

2 of 22

Why Language? A Draft Manifesto

  • Learning systems that are embodied can share physical space with the people interacting with them
    • Robots, tabletop devices, camera arrays (sort of), …
  • AI systems that interact with physical spaces shared by people should interpret both explicit and implicit feedback from those people in order to:
    • 1) establish goals;
    • 2) respect constraints; and
    • 3) seek missing information

3 of 22

Interactive Learning with Explicit Human Feedback

  • Establish goals ✅
  • Respect constraints ❌
  • Seek missing information ✅

4 of 22

Interactive Learning with Explicit Human Feedback

  • Semantic parsing
  • Slot based, information maximizing dialogue policy
  • Words-as-classifiers
  • Neural networks?
    • Vision features
    • Synonym hypotheses
  • GOFAI

5 of 22

Enter: Large Pretrained Language and Vision Models

6 of 22

Large Pretrained Models for Robotics Today

  • Establish goals ✅
  • Respect constraints ✅
  • Seek missing information ❌
  • LPTMs have made establishing goals using language so much more straightforward than was possible a few years ago
  • Huge leaps over closed vocabulary and object lists, slot-based possible goals, and even specifying constraints

7 of 22

What’s Missing?

  • AI systems that interact with physical spaces shared by people should interpret both explicit and implicit feedback from those people in order to: 1) establish goals; 2) respect constraints; and 3) seek missing information
  • Feedback can’t just be typing, that’s nuts. We need to understand speech, gaze, gesture, and more from people
  • Interpretability of system behavior may require exposing symbolic reasoning and readable plans
  • Systems should actively seek information from people

8 of 22

Grounding Language in Actions, Multimodal Observations, and Robots (GLAMOR ✨) Lab

  • We have been tackling a few open problems related to these new challenges for robots learning from human feedback
  • Two I’ll highlight:
    • Improving speech recognition using physical context
    • Combining large language models and symbolic planning

9 of 22

Improving Speech Recognition with Context

  • Feedback can’t just be typing, that’s nuts. We need to understand speech, gaze, gesture, and more from people
  • Automatic Speech Recognition keeps getting better, but is trained on massive, clean audio like podcasts
  • Robots hearing speech do so from a microphone attached to a noisy, whirring body from a person potentially across the room with task-related background noise
    • Warehouse floor, manufacturing, kitchen appliances, …
  • But, we speak with respect to the current world context!

10 of 22

Improving Speech Recognition with Context

11 of 22

Improving Speech Recognition with Context

12 of 22

Prompting for Robot Control

  • Interpretability of system behavior may require exposing symbolic reasoning and readable plans
  • Large, pretrained language models can sort of behave like planners, generating readable “code”
  • A lot of folks had this realization at the same time
    • For example, Google’s SayCan and Code As Policies
  • We took an approach based on designing prompts that looked like code, given LLM training data over github, etc.

13 of 22

Prompting for Robot Control

  • LLMs generate actions robots can’t do with objects that aren’t around
  • Pythonic prompts can specify what’s available

14 of 22

Prompting for Robot Control

  • Used to control an agent in a virtual environment
  • Perform new tasks in a zero-shot setting

15 of 22

Prompting for Robot Control

  • Also enables generating pick-and-place robot plans!

16 of 22

Language as A Medium for Human Feedback

  • Recent work in the space of RoboNLP has gotten really exceptionally good at establishing goals for robots, and there is progress on communicating and respecting constraints
  • There is a rich space to explore on seeking missing information during closed-loop planning
  • Take inspiration from human-human problem solving!

17 of 22

Language as A Medium for Human Feedback

18 of 22

Considering The Role of Language in Embodied Systems

Jesse Thomason

http://glamor.rocks/

Slides: https://jessethomason.com/ -> News

19 of 22

Gaze Tracking and Language

20 of 22

Multimodal Continual Learning

  • Pretrained backbones a hard (sometimes impossible) to fine-tune; we need ways to adapt them to new contexts
  • Considering explicit and implicit feedback, modalities may appear and disappear depending on context and task
    • What happens to a model trained on speech and gaze when someone turns away?
  • Can we adapt models to both new tasks and the presence and absence of input modalities?

21 of 22

Multimodal Continual Learning

22 of 22

Multimodal Continual Learning

  • Upshot: existing continual learning approaches are terrible at downstream transfer when a modality disappears
  • If you have an idea of how to do a better job, please check out the benchmark!