Vision-and-Language Navigation
Книжный клуб Embodied agents �Лаборатории Cognitive AI Systems AIRI
Бакаева Ева 18.10.24
Планирование поведения с помощью предобученных языковых моделей �с обратной связью
План
В чем общая задача VLN?
Создание системы, которая сможет ориентироваться в 3d среде с использованием инструкций на естественном языке и визуальной информации
Классификация задач VLN
Основные составляющие систем решения VLN задач
Основные VLN бенчмарки: R2R
input:
output:
baseline: LSTM-based sequence-to-sequence architecture with an attention mechanism
Основные VLN бенчмарки: RxR
input:
output:
baseline: an instruction CNN encoder and a sequential LSTM decoder that computes a distribution over actions at each step
Основные VLN бенчмарки: REVERIE
input:
output:
baseline: Interactive Navigator-Pointer model
Основные VLN бенчмарки: CVDN
input:
output: navigation actions that bring the agent closer to the goal location Gj , starting from the terminal node of N_{i−1}
baseline: sequence-to-sequence model with an LSTM encoder that takes in learnable token embeddings (LE) of the dialog history. The encoder conditions an LSTM decoder for predicting navigation actions that takes in fixed ResNet embeddings of visual environment frame
Основные VLN бенчмарки: Touchdown
input: A state s 2 S is a pair (I, α), where I is a panorama and α is the heading angle indicating the agent heading. Navigation instruction x_n and a start state s_1
output:
baseline: LINGUNET
Основные VLN бенчмарки: VLNA
input:
output: navigation actions that bring the agent closer to the goal location
baseline: LSTM-based
Основные VLN бенчмарки: FAO
input:
output:
difference from REVERIE: any starting point + polar coordinates for bounding box
Основные VLN бенчмарки: сравнение
Новые VLN бенчмарки: CoWs on PASTURE
input:
output: If the agent is within c units of o and o meets a visibility criteria, the episode is successful.
бейзлайн: CLIP on Wheels (CoW)
Новые VLN бенчмарки: Memory-Maze
input: natural language instruction, the environment input which includes the details obtained from sensors, API specification which includes the commands and their explanations in Python that the agent can use, API implementation which is the actual implementation of the API specification, and the initial orientation of the robot
output: route to goal position
baseline:
We build NaVid on top of a general-purpose video-based
VLM named LLaMA-VID [57]. For our proposed NaVid, we
inherit the main architecture of LLaMA-VID and incorporate
the task-specific designs on top of it, to facilitate the transfer
of general knowledge to VLN-CE to make its generalization
challenges more readily solvable.
As illustrated in Fig. 2, NaVid consists of a vision encoder,
a query generator, a LLM, and two cross-modality projectors.
Given the observations up to time t, i.e., a video sequence
comprising t frames, we encode this video to a sequence of
tokens via the vision encoder (EVA-CLIP [92] in implemen-
tation) and project them to a space aligned with language
tokens. For brevity, we call the projected tokens as observation
tokens. As common, the instructions are also tokenized as a
set of tokens, called instruction tokens. We concatenate both
observation tokens and instruction tokens and send them to the
LLM to infer the VLN actions in linguistic form. Note that
our work focuses on task-specific modeling rather than model
architecture, as detailed in the following.
Новые статьи: BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes
BehAV is structured into four key components:
1. Human Instruction Decomposition;
2. Behavioral Cost Map Generation;
3. Visual Landmark Estimation;
4. Behavior-Aware Planning
Behavioral Action Costs as Conditional Probabilities:
2. Behavioral Cost Map Construction
1) Segmentation Maps for Behavioral Targets:
2) Combining Segmentation Maps with Behavioral Ac-
tion Costs:
3) Generating the Behavioral Cost Map (C_behav ):
3. Visual Landmark Estimation
4. Behavior-Aware Planning
Our unconstrained model predictive control (MPC) planner leverages the trajectory parameterization z = (r, θ, δ, vmax) detailed in section III-B to optimize a novel objective function that incorporates behavior-aware navigation.