1 of 26

LLM for Embodied AI

Team 12

Edouard Albert-Roulhac

&

Abdelhakim Sehad

13-03-2024

2 of 26

Motivation

Image: qualcomm

3 of 26

Embodied AI – Definition

Build AI agents which interact with the world

Perception

Open ended world
Physical (Robot)
Virtual (Minecraft)

Cognition

Reasoning
World knowledge
Planning

Action

Access to tools
Responsability

4 of 26

Embodied AI – Challenges

Perception

Grounded Multimodal models
Training in virtual world (simulation)
Life-long learning

Cognition

Leverage LLMs for world knowledge
Reasoning / Planning abilities
Success evaluation

Action

Define / Build tools
Scope & Capabilities

5 of 26

VOYAGER: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang & al.

Oct. 2023

6 of 26

Voyager – Context

Minecraft as a virtual world

Open Ended World
No predefined objective

Explore the world
Define & solve tasks
Develop skills

Interact via API (no vision here)

New paradigm

LLMs can code
Build a skill library

7 of 26

Voyager – Curriculum, Skill library, Iterative Prompting

8 of 26

Skill Library

9 of 26

10 of 26

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Yao Mu & al.

Sept. 2023

11 of 26

EmbodiedGPT - Overview

Goal:

Enhance the ability of Large Language Models (LLMs) to understand and reason about the world using both text and visual information.

Approach:

Pre-training on a massive embodied planning dataset – EgoCOT.

Examples of tasks:

Embodied planning, embodied control, embodied VQA,..

Modules:

Frozen vision model – ViT-G/14
Frozen language model – LLaMA 7B
Embodied-former with a language mapping layer – aligning visual and embodied instructions
Policy network – producing low-level actions based on task-relevant features

The main goal of the EmbodiedGPT paper wasn't to create an embodied agent that interacts with the physical world (like Voyager), but instead, it’s to introduce an end-to-end multi-modal model for embodied AI, enabling agents to perform step-by-step planning and execute low-level commands in physical environments. The paper describes the creation of a large-scale embodied planning dataset called EgoCOT and trained the model on it. The paper also discusses the effectiveness of EmbodiedGPT on various embodied tasks such as embodied planning, VQA, and embodied.

The model comprises four integrated modules: a frozen vision model, a frozen language model, an embodied-former with a language mapping layer for aligning visual and embodied instructions, and a policy network.

I’ll give more details later on.

12 of 26

Main approach – Key Points

Crafting a large-scale embodied planning dataset, EgoCOT.

Implementing prefix tuning on the frozen language model.

Enhance the performance of EmbodiedGPT in generating reliable planning containing sub-goal sequences.

Introducing a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control.

EmbodiedGPT takes a multi-modal approach to enhance the capabilities of embodied AI agents. It achieves this through the following :

Crafting a large-scale embodied planning dataset, EgoCOT.

Introducing an efficient training approach for high-quality plan generation by adapting a 7B large language model to the EgoCOT dataset via prefix tuning to enhance the generation of more executable planning.

Introducing a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. This approach enables the model to interact with the physical world in a more natural and intuitive manner, performing tasks such as embodied planning, VQA, and control.

Questions : �What is prefix tuning?�Instead of retraining the entire LLM, prefix tuning focuses on adding a small set of learnable parameters called "prefixes" before the LLM's main architecture.

These prefixes act like prompts or guides, influencing the LLM's processing of the input data towards the desired task.

13 of 26

EgoCOT

Large-scale embodied planning dataset with carefully selected videos from the Ego4D dataset (egocentric videos dataset) along with corresponding high-quality language instructions.

Machine-generated, filtered, and finally human-verified.

Designed to enable effective embodied planning by generating a sequence of sub-goals with the Chain of Thoughts mode.

9645 untrimmed videos of various durations ranging from 5 seconds to 7 hours.

Example :

14 of 26

EgoCOT - Data Preparation

First stage

Filtering missing/very short videos.
Excluding videos without human-object interaction such as walking or watching TV.
Generating pairs of captions and embodied planning using EgoVLP.
Providing prompts and corresponding captions for ChatGPT to generate a reasonable and detailed embodied planning – Chain Of Thought.
Performing five sampling iterations for each prompt.

Second stage

Assessing similarities between the video and text pairs using CLIP and eliminating both data with the low similarity.

To prepare the data, they conducted two stages of data cleaning. In the first stage, they filtered out videos with missing or very short narrations. They also excluded videos without human-object interaction.

To generate pairs of captions and embodied plannings they used the EgoVLP framework to segment the video. For each video segment, they provide prompts and corresponding captions for ChatGPT to generate a reasonable and detailed embodied planning. The caption is a brief introduction such as "C opens the door." For each prompt, they perform five sampling iterations.

Then to ensure the quality of the generated planning instructions, they perform a second stage of data cleaning where they used the CLIP model to assess the similarities between the video and text pairs where for each video, they compared it against five potential embodied plans and selected the one with the highest similarity as the corresponding label for the embodied plan.

15 of 26

How EmbodiedGPT work’s ?

As shown in this figure, the embodied-former acts as a bridge between the visual and language domains, it first extracts compact visual features from the output of the ViT model involving visual tokens, text queries, and learnable embodied queries and then maps it to the language modality through a language mapping layer. These embeddings are sent to the frozen LLaMA language model for visual caption, visual QA, and embodied planning. The generated planning is then used to query highly relevant features from the general visual tokens encoded by the visual model via the embodied-former. These features are utilized to generate low-level control commands for task execution through the downstream policy network.

The Embodied-former consists of two sub-modules: one for extracting features from the image input, and another for extracting features from the text input. The global context is inferred using a ResNet50 model that has been pre-trained on ImageNet. The policy network is just a MLP mapping function. The output of the policy network consists of specific executable actions, such as positions.

16 of 26

Training process

Stage I : pre-training in basic cognitive and responsive skills

Focus on image-text conversation alignment pre-training.

COCO Captions, CC3M and LAION-400M re-captioned using BLIP-2.

Stage II : pre-training in basic cognitive and responsive skills

Update the language projection and prefix language adapter.

"Complex_Reasoning_77k" and multi-turn conversation datasets provided by "LLaVA_Instruct_150K".

Stage III : training the embodied AI

Use Conv3D to transfer the pre-trained vision model to the video encoder.
Introduce Chain-Of-Thought vision language pre-training paradigm.

The training process consists of three stages, each stage is designed to increase the reasoning and planning capabilities. The first two stages focus on pre-training in basic cognitive and responsive skills, while the third stage involves training the embodied AI task with egocentric video-text data on EgoCOT. In the first stage, they focus on image-text conversation alignment pre-training. In the second stage, their goal is to enhance the model’s ability to comprehend and generate more complex sentences and improve its reasoning skills.

During the third stage, they introduced the ’chain-of-thought’ pre-training paradigm where the model takes 8 keyframes of the video as input [STOP], along with the task description, embodied planning, and structured verb-noun pairs summary to reason with a prompt.

17 of 26

Evaluation

Image input tasks

Human evaluation

Object recognition accuracy
Spatial relationship understanding
Level of redundancy in the answer
Reasonability of the planning
Executability of the planning

Visual ChatGPT

Video input embodied AI tasks

Meta-World
Franka Kitchen

In order to evaluate the quality of generated captions and planning with the given image, They conducted a user study with 30 participants. They were asked to rate the generated captions from different end-to-end models using a scoring system ranging from 1 to 10 on object recognition, spatial relationship understanding, level of redundancy in the answer, reasonability and the executability of the planning.

They also compared the performance of EmbodiedGPT with Visual ChatGPT.

Then, they evaluate the recognition ability of videos on standard embodied AI benchmarks, Meta-World and Franka Kitchen. Meta-World provides a challenging set of tasks that require complex object manipulation skills, such as picking and placing a block between bins, while Franka Kitchen benchmark focuses on tasks like opening the cabinet, turning on the light.

18 of 26

Evaluation

19 of 26

Results

1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark compared to BLIP-2.

Comparable level of object recognition and spatial relationship understanding as the LLaVA-13B model.

Less redundant content and more reasonable.

20 of 26

Results

21 of 26

Strengths and Weaknesses

Strengths

Development of an end-to-end multi-modal model for embodied AI.
Creation of a large-scale embodied planning dataset.
Significant improvement in the success rate of embodied control tasks.

Weaknesses

Relies on large amount of data to train the LLM.
Black box reasoning process.
Potential for unforeseen ethical/social issues.
No performance results on the multiple embodied tasks.

22 of 26

Reflections

Interesting Aspects:

Introduction of a multi-modal foundational model that enables agents to perform step-by-step planning and execute low-level commands.
Chain of Thought Reasoning.

Further exploration:

Scalability to real-world applications.
Is it safe and trustable?

The paper presents a significant advancement in the field of embodied AI. It introduces a multi-modal model that enables agents to perform step-by-step planning and execute low-level commands. The key contributions of the paper is the concept of incorporating a "chain of thought" into the pre-training allowing LLMs to not only generate text but also explain the reasoning behind their outputs, and of course the EgoCOT dataset.

However, training embodied agents in the real world can be expensive and time-consuming. One further exploration could be : how this approach can be scaled to larger and more complex environments.

Also. there are potential safety concerns when training an LLM with an embodied agent. More research should explore safeguards to prevent the LLM from learning harmful behaviors from its interactions with the world.

23 of 26

Embodied Learning vs. Pre-Training

Paper	Focus	Embodiment	Chain of Thought	Dataset likely Open Source?
PaLM-E	Develop an embodied multimodal language model	Yes (can interact with physical objects and perceive the world)	Not a core focus (paper emphasizes multimodal capabilities)	No (proprietary robot data)
Voyager	Develop an open-ended embodied agent using LLMs	Yes (acts in Minecraft environment)	Not explicitly mentioned (focuses on learning through trial and error)	NaN (dataset not required)
EmbodiedGPT	Pre-train LLMs for better visual-language understanding and reasoning	No (trained on data, not embodied interaction)	Yes	Yes (and built based on an open-source large-scale dataset – great scalability)

This is at the same time a conclusion and a summary of the differences between Embodied Learning and Pre-training.

Goal:

PaLM-E: Develop an LLM that can not only understand language but also perceive and interact with the physical world through sensors.

Voyager: Create an agent that can learn and act in an open-ended environment.

EmbodiedGPT: Improve the LLM's ability to process and reason about the world.

Embodiment:

PaLM-E: Crucial aspect. PaLM-E can perceive its surroundings and take actions in the real world.

Voyager: Central to the research. The agent has a physical body in the Minecraft world.

EmbodiedGPT: Not directly involved. The LLM is trained on data, not embodied interaction.

Chain of Thought:

PaLM-E: Not a core area of research in the PaLM-E paper itself. The focus is on the multimodal capabilities of the model.

Voyager: Not a main focus. The paper emphasizes the agent's ability to learn from experience.

EmbodiedGPT: Yes.

Dataset:�EgoCOT dataset is built based on an open-source large-scale dataset, which offers greater scalability compared to the PaLM-E model trained on proprietary robot data

24 of 26

Thank you !

25 of 26

Annexe

They also create EgoVQA dataset as an extension of the Ego4D dataset, focusing on egocentric human-object interaction video question-answering tasks, which aims to offer a wider range of egocentric multi-modal data.

26 of 26

Annexe