1 of 33

Vision-and-Language Navigation

Книжный клуб Embodied agents �Лаборатории Cognitive AI Systems AIRI

Бакаева Ева 18.10.24

2 of 33

Планирование поведения с помощью предобученных языковых моделей �с обратной связью

3 of 33

План

В чем задача vision-and-language navigation(VLN)
Классификация задач VLN
Основные составляющие моделей для решения задачи VLN
Бенчмарки для оценки успешности решения задачи VLN
Актуальные статьи о подходах решения задач VLN

4 of 33

В чем общая задача VLN?

Создание системы, которая сможет ориентироваться в 3d среде с использованием инструкций на естественном языке и визуальной информации

5 of 33

Классификация задач VLN

6 of 33

Основные составляющие систем решения VLN задач

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments arXiv:1711.07280v3 [cs.CV] 5 Apr 2018

7 of 33

Основные VLN бенчмарки: R2R

input:

output:

baseline: LSTM-based sequence-to-sequence architecture with an attention mechanism

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments arXiv:1711.07280v3 [cs.CV] 5 Apr 2018

8 of 33

Основные VLN бенчмарки: RxR

input:

output:

baseline: an instruction CNN encoder and a sequential LSTM decoder that computes a distribution over actions at each step

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding arXiv:2010.07954v1 [cs.CV] 15 Oct 2020

9 of 33

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding arXiv:2010.07954v1 [cs.CV] 15 Oct 2020

10 of 33

Основные VLN бенчмарки: REVERIE

input:

output:

baseline: Interactive Navigator-Pointer model

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments arXiv:1904.10151v2 [cs.CV] 6 Jan 2020

11 of 33

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments arXiv:1904.10151v2 [cs.CV] 6 Jan 2020

12 of 33

Основные VLN бенчмарки: CVDN

input:

output: navigation actions that bring the agent closer to the goal location Gj , starting from the terminal node of N_{i−1}

baseline: sequence-to-sequence model with an LSTM encoder that takes in learnable token embeddings (LE) of the dialog history. The encoder conditions an LSTM decoder for predicting navigation actions that takes in fixed ResNet embeddings of visual environment frame

Vision-and-Dialog Navigation arXiv:1907.04957v3 [cs.CL] 13 Oct 2019

13 of 33

Основные VLN бенчмарки: Touchdown

input: A state s 2 S is a pair (I, α), where I is a panorama and α is the heading angle indicating the agent heading. Navigation instruction x_n and a start state s_1

output:

baseline: LINGUNET

TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments arXiv:1811.12354v7 [cs.CV] 16 May 2020

14 of 33

TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments arXiv:1811.12354v7 [cs.CV] 16 May 2020

15 of 33

Основные VLN бенчмарки: VLNA

input:

output: navigation actions that bring the agent closer to the goal location

baseline: LSTM-based

Vision-based Navigation with Language-based Assistance via Imitation Learning with Indirect Intervention arXiv:1812.04155v4 [cs.LG] 6 Apr 2019

16 of 33

Основные VLN бенчмарки: FAO

input:

output:

difference from REVERIE: any starting point + polar coordinates for bounding box

SOON: Scenario Oriented Object Navigation with Graph-based Exploration arXiv:2103.17138v2 [cs.CV] 15 Oct 2021

17 of 33

SOON: Scenario Oriented Object Navigation with Graph-based Exploration arXiv:2103.17138v2 [cs.CV] 15 Oct 2021

18 of 33

Основные VLN бенчмарки: сравнение

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding arXiv:2010.07954v1 [cs.CV] 15 Oct 2020

19 of 33

Новые VLN бенчмарки: CoWs on PASTURE

input:

output: If the agent is within c units of o and o meets a visibility criteria, the episode is successful.

CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation arXiv:2203.10421v2 [cs.CV] 14 Dec 2022

20 of 33

CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation arXiv:2203.10421v2 [cs.CV] 14 Dec 2022

21 of 33

CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation arXiv:2203.10421v2 [cs.CV] 14 Dec 2022

бейзлайн: CLIP on Wheels (CoW)

22 of 33

Новые VLN бенчмарки: Memory-Maze

input: natural language instruction, the environment input which includes the details obtained from sensors, API specification which includes the commands and their explanations in Python that the agent can use, API implementation which is the actual implementation of the API specification, and the initial orientation of the robot

output: route to goal position

baseline:

Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People arXiv:2405.07060v1 [cs.RO] 11 May 2024

23 of 33

Новые статьи: NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation arXiv:2402.15852v7 [cs.CV] 30 Jun 2024

24 of 33

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation arXiv:2402.15852v7 [cs.CV] 30 Jun 2024

We build NaVid on top of a general-purpose video-based

VLM named LLaMA-VID [57]. For our proposed NaVid, we

inherit the main architecture of LLaMA-VID and incorporate

the task-specific designs on top of it, to facilitate the transfer

of general knowledge to VLN-CE to make its generalization

challenges more readily solvable.

As illustrated in Fig. 2, NaVid consists of a vision encoder,

a query generator, a LLM, and two cross-modality projectors.

Given the observations up to time t, i.e., a video sequence

comprising t frames, we encode this video to a sequence of

tokens via the vision encoder (EVA-CLIP [92] in implemen-

tation) and project them to a space aligned with language

tokens. For brevity, we call the projected tokens as observation

tokens. As common, the instructions are also tokenized as a

set of tokens, called instruction tokens. We concatenate both

observation tokens and instruction tokens and send them to the

LLM to infer the VLN actions in linguistic form. Note that

our work focuses on task-specific modeling rather than model

architecture, as detailed in the following.

25 of 33

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation arXiv:2402.15852v7 [cs.CV] 30 Jun 2024

26 of 33

Новые статьи: BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes arXiv:2409.16484v2 [cs.RO] 2 Oct 2024

BehAV is structured into four key components:

1. Human Instruction Decomposition;

2. Behavioral Cost Map Generation;

3. Visual Landmark Estimation;

4. Behavior-Aware Planning

27 of 33

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes arXiv:2409.16484v2 [cs.RO] 2 Oct 2024

Human Instruction Decomposition Using LLMs

Behavioral Action Costs as Conditional Probabilities:

28 of 33

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes arXiv:2409.16484v2 [cs.RO] 2 Oct 2024

2. Behavioral Cost Map Construction

1) Segmentation Maps for Behavioral Targets:

2) Combining Segmentation Maps with Behavioral Ac-

tion Costs:

3) Generating the Behavioral Cost Map (C_behav ):

29 of 33

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes arXiv:2409.16484v2 [cs.RO] 2 Oct 2024

3. Visual Landmark Estimation

30 of 33

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes arXiv:2409.16484v2 [cs.RO] 2 Oct 2024

4. Behavior-Aware Planning

Our unconstrained model predictive control (MPC) planner leverages the trajectory parameterization z = (r, θ, δ, vmax) detailed in section III-B to optimize a novel objective function that incorporates behavior-aware navigation.

31 of 33

Новые статьи: LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action arXiv:2207.04429v2 [cs.RO] 26 Jul 2022

32 of 33

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action arXiv:2207.04429v2 [cs.RO] 26 Jul 2022

1 of 33

2 of 33

3 of 33

4 of 33

5 of 33

6 of 33

7 of 33

8 of 33

9 of 33

10 of 33

11 of 33

12 of 33

13 of 33

14 of 33

15 of 33

16 of 33

17 of 33

18 of 33

19 of 33

20 of 33

21 of 33

22 of 33

23 of 33

24 of 33

25 of 33

26 of 33

27 of 33

28 of 33

29 of 33

30 of 33

31 of 33

32 of 33

33 of 33