1 of 33

Vision-and-Language Navigation

Книжный клуб Embodied agents �Лаборатории Cognitive AI Systems AIRI

Бакаева Ева 18.10.24

2 of 33

Планирование поведения с помощью предобученных языковых моделей �с обратной связью

3 of 33

План

  1. В чем задача vision-and-language navigation(VLN)
  2. Классификация задач VLN
  3. Основные составляющие моделей для решения задачи VLN
  4. Бенчмарки для оценки успешности решения задачи VLN
  5. Актуальные статьи о подходах решения задач VLN

4 of 33

В чем общая задача VLN?

Создание системы, которая сможет ориентироваться в 3d среде с использованием инструкций на естественном языке и визуальной информации

  1. https://arxiv.org/pdf/2405.07060v1
  2. https://arxiv.org/pdf/2409.15451
  3. https://arxiv.org/pdf/2409.16484

5 of 33

Классификация задач VLN

  1. Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions arXiv:2203.12667v3 [cs.CV] 3 Jun 2022
  2. Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments arXiv:2004.02857v2 [cs.CV] 1 May 2020

6 of 33

Основные составляющие систем решения VLN задач

7 of 33

Основные VLN бенчмарки: R2R

input:

output:

baseline: LSTM-based sequence-to-sequence architecture with an attention mechanism

8 of 33

Основные VLN бенчмарки: RxR

input:

output:

baseline: an instruction CNN encoder and a sequential LSTM decoder that computes a distribution over actions at each step

9 of 33

10 of 33

Основные VLN бенчмарки: REVERIE

input:

output:

baseline: Interactive Navigator-Pointer model

11 of 33

12 of 33

Основные VLN бенчмарки: CVDN

input:

output: navigation actions that bring the agent closer to the goal location Gj , starting from the terminal node of N_{i−1}

baseline: sequence-to-sequence model with an LSTM encoder that takes in learnable token embeddings (LE) of the dialog history. The encoder conditions an LSTM decoder for predicting navigation actions that takes in fixed ResNet embeddings of visual environment frame

13 of 33

Основные VLN бенчмарки: Touchdown

input: A state s 2 S is a pair (I, α), where I is a panorama and α is the heading angle indicating the agent heading. Navigation instruction x_n and a start state s_1

output:

baseline: LINGUNET

14 of 33

15 of 33

Основные VLN бенчмарки: VLNA

input:

output: navigation actions that bring the agent closer to the goal location

baseline: LSTM-based

16 of 33

Основные VLN бенчмарки: FAO

input:

output:

difference from REVERIE: any starting point + polar coordinates for bounding box

17 of 33

18 of 33

Основные VLN бенчмарки: сравнение

19 of 33

Новые VLN бенчмарки: CoWs on PASTURE

input:

output: If the agent is within c units of o and o meets a visibility criteria, the episode is successful.

20 of 33

21 of 33

бейзлайн: CLIP on Wheels (CoW)

22 of 33

Новые VLN бенчмарки: Memory-Maze

input: natural language instruction, the environment input which includes the details obtained from sensors, API specification which includes the commands and their explanations in Python that the agent can use, API implementation which is the actual implementation of the API specification, and the initial orientation of the robot

output: route to goal position

baseline:

23 of 33

24 of 33

We build NaVid on top of a general-purpose video-based

VLM named LLaMA-VID [57]. For our proposed NaVid, we

inherit the main architecture of LLaMA-VID and incorporate

the task-specific designs on top of it, to facilitate the transfer

of general knowledge to VLN-CE to make its generalization

challenges more readily solvable.

As illustrated in Fig. 2, NaVid consists of a vision encoder,

a query generator, a LLM, and two cross-modality projectors.

Given the observations up to time t, i.e., a video sequence

comprising t frames, we encode this video to a sequence of

tokens via the vision encoder (EVA-CLIP [92] in implemen-

tation) and project them to a space aligned with language

tokens. For brevity, we call the projected tokens as observation

tokens. As common, the instructions are also tokenized as a

set of tokens, called instruction tokens. We concatenate both

observation tokens and instruction tokens and send them to the

LLM to infer the VLN actions in linguistic form. Note that

our work focuses on task-specific modeling rather than model

architecture, as detailed in the following.

25 of 33

26 of 33

BehAV is structured into four key components:

1. Human Instruction Decomposition;

2. Behavioral Cost Map Generation;

3. Visual Landmark Estimation;

4. Behavior-Aware Planning

27 of 33

  1. Human Instruction Decomposition Using LLMs

Behavioral Action Costs as Conditional Probabilities:

28 of 33

2. Behavioral Cost Map Construction

1) Segmentation Maps for Behavioral Targets:

2) Combining Segmentation Maps with Behavioral Ac-

tion Costs:

3) Generating the Behavioral Cost Map (C_behav ):

29 of 33

3. Visual Landmark Estimation

30 of 33

4. Behavior-Aware Planning

Our unconstrained model predictive control (MPC) planner leverages the trajectory parameterization z = (r, θ, δ, vmax) detailed in section III-B to optimize a novel objective function that incorporates behavior-aware navigation.

31 of 33

32 of 33

33 of 33