1 of 26

LLM for Embodied AI

Team 12

Edouard Albert-Roulhac

&

Abdelhakim Sehad

13-03-2024

2 of 26

Motivation

Image: qualcomm

3 of 26

Embodied AI – Definition

Build AI agents which interact with the world

  • Perception
    • Open ended world
    • Physical (Robot)
    • Virtual (Minecraft)

  • Cognition
    • Reasoning
    • World knowledge
    • Planning

  • Action
    • Access to tools
    • Responsability

4 of 26

Embodied AI – Challenges

  • Perception
    • Grounded Multimodal models
    • Training in virtual world (simulation)
    • Life-long learning

  • Cognition
    • Leverage LLMs for world knowledge
    • Reasoning / Planning abilities
    • Success evaluation

  • Action
    • Define / Build tools
    • Scope & Capabilities

5 of 26

VOYAGER: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang & al.

Oct. 2023

6 of 26

Voyager – Context

Minecraft as a virtual world

  • Open Ended World
  • No predefined objective
    • Explore the world
    • Define & solve tasks
    • Develop skills
  • Interact via API (no vision here)

New paradigm

  • LLMs can code
  • Build a skill library

7 of 26

Voyager – Curriculum, Skill library, Iterative Prompting

8 of 26

Skill Library

9 of 26

10 of 26

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Yao Mu & al.

Sept. 2023

11 of 26

EmbodiedGPT - Overview

  • Goal:
    • Enhance the ability of Large Language Models (LLMs) to understand and reason about the world using both text and visual information.

  • Approach:
    • Pre-training on a massive embodied planning dataset – EgoCOT.

  • Examples of tasks:
    • Embodied planning, embodied control, embodied VQA,..

  • Modules:
    • Frozen vision model – ViT-G/14
    • Frozen language model – LLaMA 7B
    • Embodied-former with a language mapping layer – aligning visual and embodied instructions
    • Policy network – producing low-level actions based on task-relevant features

12 of 26

Main approach – Key Points

  • Crafting a large-scale embodied planning dataset, EgoCOT.

  • Implementing prefix tuning on the frozen language model.
    • Enhance the performance of EmbodiedGPT in generating reliable planning containing sub-goal sequences.

  • Introducing a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control.

13 of 26

EgoCOT

  • Large-scale embodied planning dataset with carefully selected videos from the Ego4D dataset (egocentric videos dataset) along with corresponding high-quality language instructions.
    • Machine-generated, filtered, and finally human-verified.

  • Designed to enable effective embodied planning by generating a sequence of sub-goals with the Chain of Thoughts mode.

  • 9645 untrimmed videos of various durations ranging from 5 seconds to 7 hours.

  • Example :

14 of 26

EgoCOT - Data Preparation

  • First stage
    • Filtering missing/very short videos.
    • Excluding videos without human-object interaction such as walking or watching TV.
    • Generating pairs of captions and embodied planning using EgoVLP.
    • Providing prompts and corresponding captions for ChatGPT to generate a reasonable and detailed embodied planning – Chain Of Thought.
    • Performing five sampling iterations for each prompt.

  • Second stage
    • Assessing similarities between the video and text pairs using CLIP and eliminating both data with the low similarity.

15 of 26

How EmbodiedGPT work’s ?

16 of 26

Training process

  • Stage I : pre-training in basic cognitive and responsive skills
    • Focus on image-text conversation alignment pre-training.
      • COCO Captions, CC3M and LAION-400M re-captioned using BLIP-2.

  • Stage II : pre-training in basic cognitive and responsive skills
    • Update the language projection and prefix language adapter.
      • "Complex_Reasoning_77k" and multi-turn conversation datasets provided by "LLaVA_Instruct_150K".

  • Stage III : training the embodied AI
    • Use Conv3D to transfer the pre-trained vision model to the video encoder.
    • Introduce Chain-Of-Thought vision language pre-training paradigm.

17 of 26

Evaluation

  • Image input tasks
    • Human evaluation
      • Object recognition accuracy
      • Spatial relationship understanding
      • Level of redundancy in the answer
      • Reasonability of the planning
      • Executability of the planning
    • Visual ChatGPT

  • Video input embodied AI tasks
    • Meta-World
    • Franka Kitchen

18 of 26

Evaluation

19 of 26

Results

  • 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark compared to BLIP-2.

  • Comparable level of object recognition and spatial relationship understanding as the LLaVA-13B model.

  • Less redundant content and more reasonable.

20 of 26

Results

21 of 26

Strengths and Weaknesses

  • Strengths
    • Development of an end-to-end multi-modal model for embodied AI.
    • Creation of a large-scale embodied planning dataset.
    • Significant improvement in the success rate of embodied control tasks.

  • Weaknesses
    • Relies on large amount of data to train the LLM.
    • Black box reasoning process.
    • Potential for unforeseen ethical/social issues.
    • No performance results on the multiple embodied tasks.

22 of 26

Reflections

  • Interesting Aspects:

    • Introduction of a multi-modal foundational model that enables agents to perform step-by-step planning and execute low-level commands.
    • Chain of Thought Reasoning.

  • Further exploration:

    • Scalability to real-world applications.
    • Is it safe and trustable?

23 of 26

Embodied Learning vs. Pre-Training

Paper

Focus

Embodiment

Chain of Thought

Dataset likely Open Source?

PaLM-E

Develop an embodied multimodal language model

Yes (can interact with physical objects and perceive the world)

Not a core focus (paper emphasizes multimodal capabilities)

No (proprietary robot data)

Voyager

Develop an open-ended embodied agent using LLMs

Yes (acts in Minecraft environment)

Not explicitly mentioned (focuses on learning through trial and error)

NaN (dataset not required)

EmbodiedGPT

Pre-train LLMs for better visual-language understanding and reasoning

No (trained on data, not embodied interaction)

Yes

Yes (and built based on an open-source large-scale dataset – great scalability)

24 of 26

Thank you !

25 of 26

Annexe

  • They also create EgoVQA dataset as an extension of the Ego4D dataset, focusing on egocentric human-object interaction video question-answering tasks, which aims to offer a wider range of egocentric multi-modal data.

26 of 26

Annexe