1 of 35

Are We There Yet?�Learning to Localize in Embodied Instruction Following

Shane Storks*^, Qiaozi Gao*, Govind Thattai*, & Gokhan Tur*

Hybrid AI @ AAAI 2021

*Amazon Alexa AI

^University of Michigan

2 of 35

Motivation

Mobile robots are being widely adopted for completing various pre-programmed and demonstrated tasks
Embodied task learning: How can we teach a robot how to complete a new task using language?

Requires navigation and object manipulation in a physical space
Requires grounding language to visual inputs and primitive actions
Combines language, vision, and robotics

How can we best harness the rich features in the environment, agent capabilities to guide navigation?

2

3 of 35

Related Work

Language, vision, and robotics

Embodied question answering (Das et al., 2018)
Remote object grounding (Qi et al., 2020)
Robotic motion planning (Xia et al., 2020)
Vision-and-language navigation (Anderson et al., 2018)
Embodied task learning (Shridhar et al., 2019)

Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2019). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

3

https://embodiedqa.org/

4 of 35

Related Work

Language, vision, and robotics

Embodied question answering (Das et al., 2018)
Remote object grounding (Qi et al., 2020)
Robotic motion planning (Xia et al., 2020)
Vision-and-language navigation (Anderson et al., 2018)
Embodied task learning (Shridhar et al., 2019)

Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2019). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

4

https://yuankaiqi.github.io/REVERIE_Challenge/challenge.html

5 of 35

Related Work

Language, vision, and robotics

Embodied question answering (Das et al., 2018)
Remote object grounding (Qi et al., 2020)
Robotic motion planning (Xia et al., 2020)
Vision-and-language navigation (Anderson et al., 2018)
Embodied task learning (Shridhar et al., 2019)

Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2019). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

5

https://arxiv.org/pdf/1910.14442.pdf

6 of 35

Related Work

Language, vision, and robotics

Embodied question answering (Das et al., 2018)
Remote object grounding (Qi et al., 2020)
Robotic motion planning (Xia et al., 2020)
Vision-and-language navigation (Anderson et al., 2018)
Embodied task learning (Shridhar et al., 2020)

Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2020). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

6

https://bringmeaspoon.org/

7 of 35

Related Work

Language, vision, and robotics

Embodied question answering (Das et al., 2018)
Remote object grounding (Qi et al., 2020)
Robotic motion planning (Xia et al., 2020)
Vision-and-language navigation (Anderson et al., 2018)
Embodied task learning (Shridhar et al., 2019)

Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2019). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

7

8 of 35

8

An instance of ALFRED consists of 3 units:

Goal G
Subgoals g₁, g₂, …, g_N

Navigation
Object manipulation

Pick up object
Put down object
Clean object
…

Actions a₁, a₂, …, a_T

9 of 35

Seq2Seq Baseline

The baseline model uses task inputs at each timestep to predict a primitive action

And (if applicable) a mask over the current visual observation to indicate the object to interact with
(not pictured) language instructions are reweighted by an attention mechanism at every timestep

9

LSTM

Linear

DeConv

L_G + L_g1+ … +L_gN

a_t

a_t-1

+

…

+

predicted action and mask

language instructions for �goal G and all subgoals g₁- g_N

previous predicted action

visual observation v_t

10 of 35

Evaluation Details

Three granularities of inference and evaluation:

Goal-based

Can the agent achieve the goal G?

Subgoal-based

In isolation, can the agent achieve a single subgoal?

Action-based

How close is the predicted sequence of actions to the ground truth?

Can evaluate in rooms seen during training, or rooms unseen in training

Validation seen and unseen partitions

ALFRED baseline: 3.6% goal success rate in seen rooms, 0.4% goal success rate in unseen rooms ☹

10

11 of 35

Project Contributions

Granular training with ALFRED subgoals
Augmented navigation

Full coverage of object segmentation masks
Panoramic visual observations

Integrated object detection
Enabled spatial tracking in the model

11

12 of 35

Project Contributions

Granular training with ALFRED subgoals
Augmented navigation

Full coverage of object segmentation masks
Panoramic visual observations

Integrated object detection
Enabled spatial tracking in the model

12

13 of 35

Key Limitations

Sequence of actions is long. The model must predict a long sequence of actions from a long sequence of text.

13

“Rinse off a mug and place� it in the coffee maker.”

”Walk to the coffee maker on the right.”

“Pick up the dirty mug from the coffee maker.”

“Turn and walk to the sink.

“Wash the mug in the sink.”

“Pick up the mug and go back to the coffee maker.”

“Put the clean mug in the coffee maker.”

MoveAhead�MoveAhead�MoveAhead�MoveAhead�RotateRight

PickupObject(mug)

RotateLeft�RotateLeft�MoveAhead�MoveAhead

PutObject(mug, sink)�ToggleObjectOn�ToggleObjectOff�PickupObject(mug)

RotateLeft�RotateLeft�MoveAhead�MoveAhead

PutObject(mug, coffee maker)

STOP

14 of 35

Contribution 1: Granular Training

Solution: break the problem down into subgoal completion

14

LSTM

Goal + N Subgoals

All Actions to Achieve the Goal

seq2seq baseline

LSTM

Subgoal 1

Actions for subgoal 1

granular training

Subgoal 2

Subgoal N

Actions for subgoal 2

Actions for subgoal N

…

Model	Val. Seen�Action F1 (%)	Avg. Subgoal Success Rate (%)
ALFRED Baseline	84.5	25.8
Granular Training	91.6	32.2

15 of 35

Project Contributions

Granular training with ALFRED subgoals
Augmented navigation

Full coverage of object segmentation masks
Panoramic visual observations

Integrated object detection
Enabled spatial tracking in the model

15

16 of 35

Key Limitations

Navigation performance is a bottleneck for overall performance.

Success rate on navigation subgoals is low relative to some other subgoal types. Why?

The agent is not explicitly trained to ground �language during navigation.
The agent doesn’t learn to explore.

16

"Turn right and walk to the sink next to the bathtub."

?

Most object manipulation subgoals are preceded by navigation – if navigation is failed, things go off the rails

Success rate is relatively low, so we assume overall performance can be improved if we improve this

Why is navigation performance low?

Talk about figure – problem is hard if agent doesn’t understand “sink” or “bathtub”, and harder if there are multiple instances of objects, e.g,. Multiple sinks

Agent is trained by imitation learning and doesn’t learn to explore – it watches an oracle agent perform the tasks, and the oracle agent always knows where to go

b) Mention the idea of specific vs general instructions “go to the bathtub” when agent is in the corner facing the wall

To resolve these issues, we’ll generate some new inputs for navigation and use them to enable new behaviors to assist in language grounding and exploration

17 of 35

Augmented Navigation

To resolve these issues, we generate new inputs for navigation:

Extra segmentation masks
Panoramic image observations

Use this data to enable the following behaviors during navigation:

Explicit object recognition
“Looking around” before taking a step

17

18 of 35

Project Contributions

Granular training with ALFRED subgoals
Augmented navigation

Full coverage of object segmentation masks
Panoramic visual observations

Integrated object detection
Enabled spatial tracking in the model

18

19 of 35

Additional Masks

Base dataset only includes segmentation masks for objects that the agent must manipulate
Collect masks for every visible object at every timestep

19

20 of 35

Project Contributions

Granular training with ALFRED subgoals
Augmented navigation

Full coverage of object segmentation masks
Panoramic visual observations

Integrated object detection
Enabled spatial tracking in the model

20

21 of 35

Panoramic Image Observations

Performance gains have come in vision-and-language navigation (VLN) �from using panoramic visual inputs

Fried et al. (2018). Speaker-Follower Models for Vision-and-Language Navigation.

Training: we collect images at 8 view angles for every timestep of navigation

Built-in exploratory behavior

Inference: force the agent to ”look around” 360 degrees before taking� each step during navigation

At a cost of extra predicted actions

21

22 of 35

Panoramic Image Observations

Train a model (only on navigation subgoals) with these panoramic images as input rather than single POV image

Small performance improvement

22

Model	Val. Seen �Navigation Subgoal Success Rate (%)
ALFRED Baseline	31.0
Granular Training	30.0
Granular Training +�Look Around in Nav.	35.3

23 of 35

Project Contributions

Granular training with ALFRED subgoals
Augmented navigation

Full coverage of object segmentation masks
Panoramic visual observations

Integrated object detection
Enabled spatial tracking in the model

23

24 of 35

Introducing Object Detection

Using newly generated masks, train an object detection model

Bochkovski, A. et al. (2018). YOLOv4: Optimal Speed and Accuracy of Object Detection.

If we add this to the pipeline, agent can explicitly identify any object it sees

(even in panoramic observations)

24

25 of 35

Project Contributions

Granular training with ALFRED subgoals
Augmented navigation

Full coverage of object segmentation masks
Panoramic visual observations

Integrated object detection
Enabled spatial tracking in the model

25

26 of 35

Oracle Angle Tracking

In the granular trained model, the agent loses the ability to look ahead in the instructions.

When navigating to the counter, the agent doesn’t know that it will need a knife during the next subgoal

During navigation, enable the �agent to track the relative �location of the precise �navigation goal

Angle to the goal location

26

27 of 35

27

Model	Navigation Subgoal Success Rate (%)	Goal Condition Success Rate (%)
ALFRED Baseline	31.0	1.6
Granular Training	30.0	1.3
Granular Training + Oracle Angle Tracking	67.8	2.8

28 of 35

Predicting the Angle

How can we predict this angle to achieve such high performance fairly?
We combine all the work so far into a localizer module:

Inputs at each timestep:

Panoramic bounding box information (coordinates and labels)
Current and next subgoal language instructions

Output:

Angle d_t to goal (sine and cosine)

28

29 of 35

Projecting Bounding Boxes to 3D Space

29

30 of 35

Transformer-based Angular Prediction

30

[CLS]

spatial token

[SEP]

Walk

to

[SEP]

.

spatial token

current + next subgoal

“bread”

“knife”

“sink”

BERT

d_t

YOLO bounding box coords. (in panoramic space) + class labels

…

Li, X. et al. (2020). OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks.

Miyazawa, K. et al. (2020). lamBERT: Language and Action Learning Using Multimodal BERT.

…

31 of 35

BERT-based Angular Prediction

31

Model	Training�Avg. Absolute Error	Val. Seen�Avg. Absolute Error
Feedforward	[0.27, 0.32]	[0.25, 0.30]
Feedforward + Dest. Prediction	[0.37, 0.43]	[0.34, 0.43]
Multimodal BERT	[0.061, 0.069]	[0.32, 0.38]

32 of 35

Full Structure

32

33 of 35

(all results on stack & place task type)

33

Model	Action F1 (%)		Navigation Subgoal Success Rate (%)		Goal Condition Success Rate (%)
	Val. Seen	Val. Unseen	Val. Seen	Val. Unseen	Val. Seen	Val. Unseen
Baseline	84.5	75.6	31.0	27.5	1.6	0.0
Granular Training	91.6	85.3	30.0	26.5	1.3	0.0
Granular Training + Oracle Goal Angle	93.9	86.9	67.8	35.4	2.8	0.0
Granular Training + BERT-Based Localizer	93.8	88.7	25.4	28.8	1.4	0.0

best non-oracle result best overall result

34 of 35

Summary

Granular training with subgoals improved performance of action prediction
Augmented inputs combined with object detection gave the agent new capabilities during navigation

”Looking around”
Identifying objects explicitly

Used capabilities to enable spatial tracking in the model and improve action prediction, navigation performance

34

35 of 35

Questions?

Thank you!

35

@shanestorks