1 of 14

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

University of Hamburg

Department of Informatics

Knowledge Technology

http://www.informatik.uni-hamburg.de/WTM/

Muhammad

Burhan Hafez

Stefan Wermter

Mengdi Li

Cornelius Weber

2 of 14

How do humans/robots perceive the surroundings to uncover latent properties? [1]

2

  • Active perceptions
  • Passive perceptions

Visual monitoring

Auditory monitoring

(Epistemic)

Uncertainty

Weigh

Knock on

Touch

Low-resolution sensing

Ambiguity in human instructions

Information insufficiency in modalities

[1] Kroemer, Oliver, Scott Niekum, and George Konidaris. "A review of robot learning for manipulation: Challenges, representations, and algorithms." The Journal of Machine Learning Research 22.1 (2021): 1395-1476.

  • Common sense
  • Established knowledge

3 of 14

Bridge the gap with LLMs

3

Robots with hand-crafted design

  • Increased complexity
  • Difficulties in generalizability and robustness in dynamically changing environments

Humans

  • Common sense
  • Established knowledge

Robots with LLMs

  • Reasoning / Planning ability with distilled human knowledge inside
  • In-context learning ability with few-shot prompts

Matcha* agent

(Multimodal environment chatting agent)

* By the name of a type of East Asian green tea. To fully appreciate matcha, one must engage multiple senses to perceive its appearance, aroma, taste, texture, and other sensory nuances.

4 of 14

4

Start with a vision module to describe the scene

Yellow block

Orange block

Gray block

Matcha agent

(Structure)

5 of 14

5

The scene description, and the task instruction (together with few-shot examples), will be fed into a large language model to actively choose the next perception action.

Yellow block

Orange block

Gray block

Pick up the plastic cube.

Few-shot examples

Matcha agent

(Structure)

6 of 14

6

Yellow block

Orange block

Gray block

Pick up the plastic cube.

Few-shot examples

The chosen action will be carried out with motion planning.

Matcha agent

(Structure)

7 of 14

7

Yellow block

Orange block

Gray block

Pick up the plastic cube.

Few-shot examples

Feeding back the multimodal response to the LLM and loop until the task is done.

Matcha agent

(Structure)

8 of 14

8

  • Sound module
  • Haptic module
  • Weight module

  • (Overlap) Similar modality descriptions for different materials

  • (Conflict) Quite different descriptions for the same material

9 of 14

9

Matcha agent

(In simulation)

  • NICOL robot [2]
  • Coppeliasim simulator
  • LLM: OpenAI API �text-davinci-003
  • Speed x4

[2] Kerzel, Matthias, et al. "NICOL: A Neuro-inspired Collaborative Semi-humanoid Robot that Bridges Social Interaction and Reliable Manipulation." arXiv preprint arXiv:2305.08528 (2023).

https://youtu.be/rMMeMTWmT0k

10 of 14

Experiment results

  • Works without any fine-tuning
  • The language instructions can be flexible
  • Only a larger language model with strong multistep reasoning ability helps

10

*Random guess in principle: 33.33%

11 of 14

Generalization, Limitation and Future Work

  • No need for massive dataset/interactions to learn the common sense
  • Limitations in interpreting the real, complex, dynamic world with language
    • Large multimodal models
    • Advanced reasoning techniques to decompose tasks
  • Future work: large multimodal models and real-world robots

11

12 of 14

Thank You for Your Attention!

12

Muhammad

Burhan Hafez

Stefan Wermter

Mengdi Li

Cornelius Weber

University of Hamburg

Department of Informatics

Knowledge Technology

13 of 14

Logos

14 of 14

  • Sound classification accuracy: 93.33%
  • The robot can randomly knock on an object among three, and classify the material until the one that is classified as the target. In theory, the success rate is computed as 1/3 p + 2/3 p^2 = 89.18%.

14