Are We There Yet?�Learning to Localize in Embodied Instruction Following
Shane Storks*^, Qiaozi Gao*, Govind Thattai*, & Gokhan Tur*
Hybrid AI @ AAAI 2021
*Amazon Alexa AI
^University of Michigan
Motivation
2
Related Work
Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.
3
Related Work
Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.
4
Related Work
Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.
5
Related Work
Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.
6
Related Work
Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.
7
8
An instance of ALFRED consists of 3 units:
Seq2Seq Baseline
9
LSTM
Linear
DeConv
LG + Lg1 + … + LgN
at
at-1
+
…
…
+
+
predicted action and mask
language instructions for �goal G and all subgoals g1 - gN
previous predicted action
visual observation vt
Evaluation Details
10
Project Contributions
11
Project Contributions
12
Key Limitations
Sequence of actions is long. The model must predict a long sequence of actions from a long sequence of text.
13
“Rinse off a mug and place� it in the coffee maker.”
”Walk to the coffee maker on the right.”
“Pick up the dirty mug from the coffee maker.”
“Turn and walk to the sink.
“Wash the mug in the sink.”
“Pick up the mug and go back to the coffee maker.”
“Put the clean mug in the coffee maker.”
MoveAhead�MoveAhead�MoveAhead�MoveAhead�RotateRight
PickupObject(mug)
RotateLeft�RotateLeft�MoveAhead�MoveAhead
PutObject(mug, sink)�ToggleObjectOn�ToggleObjectOff�PickupObject(mug)
RotateLeft�RotateLeft�MoveAhead�MoveAhead
PutObject(mug, coffee maker)
STOP
Contribution 1: Granular Training
14
LSTM
Goal + N Subgoals
All Actions to Achieve the Goal
seq2seq baseline
LSTM
Subgoal 1
Actions for subgoal 1
granular training
Subgoal 2
Subgoal N
Actions for subgoal 2
Actions for subgoal N
…
…
Model | Val. Seen�Action F1 (%) | Avg. Subgoal Success Rate (%) |
ALFRED Baseline | 84.5 | 25.8 |
Granular Training | 91.6 | 32.2 |
Project Contributions
15
Key Limitations
Navigation performance is a bottleneck for overall performance.
Success rate on navigation subgoals is low relative to some other subgoal types. Why?
16
"Turn right and walk to the sink next to the bathtub."
?
Augmented Navigation
17
Project Contributions
18
Additional Masks
19
Project Contributions
20
Panoramic Image Observations
21
Panoramic Image Observations
22
Model | Val. Seen �Navigation Subgoal Success Rate (%) |
ALFRED Baseline | 31.0 |
Granular Training | 30.0 |
Granular Training +�Look Around in Nav. | 35.3 |
Project Contributions
23
Introducing Object Detection
24
Project Contributions
25
Oracle Angle Tracking
26
27
Model | Navigation Subgoal Success Rate (%) | Goal Condition Success Rate (%) |
ALFRED Baseline | 31.0 | 1.6 |
Granular Training | 30.0 | 1.3 |
Granular Training + Oracle Angle Tracking | 67.8 | 2.8 |
Predicting the Angle
28
Projecting Bounding Boxes to 3D Space
29
Transformer-based Angular Prediction
30
[CLS]
spatial token
spatial token
[SEP]
Walk
to
[SEP]
.
spatial token
current + next subgoal
“bread”
“knife”
“sink”
BERT
dt
YOLO bounding box coords. (in panoramic space) + class labels
…
…
…
…
…
BERT-based Angular Prediction
31
Model | Training�Avg. Absolute Error | Val. Seen�Avg. Absolute Error |
Feedforward | [0.27, 0.32] | [0.25, 0.30] |
Feedforward + Dest. Prediction | [0.37, 0.43] | [0.34, 0.43] |
Multimodal BERT | [0.061, 0.069] | [0.32, 0.38] |
Full Structure
32
(all results on stack & place task type)
33
Model | Action F1 (%) | Navigation Subgoal Success Rate (%) | Goal Condition Success Rate (%) | |||
| Val. Seen | Val. Unseen | Val. Seen | Val. Unseen | Val. Seen | Val. Unseen |
Baseline | 84.5 | 75.6 | 31.0 | 27.5 | 1.6 | 0.0 |
Granular Training | 91.6 | 85.3 | 30.0 | 26.5 | 1.3 | 0.0 |
Granular Training + Oracle Goal Angle | 93.9 | 86.9 | 67.8 | 35.4 | 2.8 | 0.0 |
Granular Training + BERT-Based Localizer | 93.8 | 88.7 | 25.4 | 28.8 | 1.4 | 0.0 |
best non-oracle result best overall result
Summary
34
Questions?
Thank you!
35
@shanestorks