1 of 35

Are We There Yet?�Learning to Localize in Embodied Instruction Following

Shane Storks*^, Qiaozi Gao*, Govind Thattai*, & Gokhan Tur*

Hybrid AI @ AAAI 2021

*Amazon Alexa AI

^University of Michigan

2 of 35

Motivation

  • Mobile robots are being widely adopted for completing various pre-programmed and demonstrated tasks
  • Embodied task learning: How can we teach a robot how to complete a new task using language?
    • Requires navigation and object manipulation in a physical space
    • Requires grounding language to visual inputs and primitive actions
    • Combines language, vision, and robotics
  • How can we best harness the rich features in the environment, agent capabilities to guide navigation?

2

3 of 35

Related Work

  • Language, vision, and robotics
    • Embodied question answering (Das et al., 2018)
    • Remote object grounding (Qi et al., 2020)
    • Robotic motion planning (Xia et al., 2020)
    • Vision-and-language navigation (Anderson et al., 2018)
    • Embodied task learning (Shridhar et al., 2019)
      • Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2019). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

3

4 of 35

Related Work

  • Language, vision, and robotics
    • Embodied question answering (Das et al., 2018)
    • Remote object grounding (Qi et al., 2020)
    • Robotic motion planning (Xia et al., 2020)
    • Vision-and-language navigation (Anderson et al., 2018)
    • Embodied task learning (Shridhar et al., 2019)
      • Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2019). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

4

5 of 35

Related Work

  • Language, vision, and robotics
    • Embodied question answering (Das et al., 2018)
    • Remote object grounding (Qi et al., 2020)
    • Robotic motion planning (Xia et al., 2020)
    • Vision-and-language navigation (Anderson et al., 2018)
    • Embodied task learning (Shridhar et al., 2019)
      • Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2019). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

5

6 of 35

Related Work

  • Language, vision, and robotics
    • Embodied question answering (Das et al., 2018)
    • Remote object grounding (Qi et al., 2020)
    • Robotic motion planning (Xia et al., 2020)
    • Vision-and-language navigation (Anderson et al., 2018)
    • Embodied task learning (Shridhar et al., 2020)
      • Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2020). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

6

7 of 35

Related Work

  • Language, vision, and robotics
    • Embodied question answering (Das et al., 2018)
    • Remote object grounding (Qi et al., 2020)
    • Robotic motion planning (Xia et al., 2020)
    • Vision-and-language navigation (Anderson et al., 2018)
    • Embodied task learning (Shridhar et al., 2019)
      • Action Learning From Realistic Environments and Directives (ALFRED)

Das, A. et al. (2018). Embodied Question Answering. In CVPR 2018.

Qi, Y. et al. (2020). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR 2020.

Xia, F. et al. (2020). Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments. In IEEE Robotics and Automation Letters 5(2): 713-720.

Anderson, P. et al. (2018). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real envirnoments. In CVPR 2018.

Shridhar, M., et al. (2019). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR 2020.

7

8 of 35

8

An instance of ALFRED consists of 3 units:

  1. Goal G
  2. Subgoals g1, g2, …, gN
    • Navigation
    • Object manipulation
      • Pick up object
      • Put down object
      • Clean object
  3. Actions a1, a2, …, aT

9 of 35

Seq2Seq Baseline

  • The baseline model uses task inputs at each timestep to predict a primitive action
    • And (if applicable) a mask over the current visual observation to indicate the object to interact with
    • (not pictured) language instructions are reweighted by an attention mechanism at every timestep

9

LSTM

Linear

DeConv

LG + Lg1 + … + LgN

at

at-1

+

+

+

predicted action and mask

language instructions for �goal G and all subgoals g1 - gN

previous predicted action

visual observation vt

10 of 35

Evaluation Details

  • Three granularities of inference and evaluation:
    • Goal-based
      • Can the agent achieve the goal G?
    • Subgoal-based
      • In isolation, can the agent achieve a single subgoal?
    • Action-based
      • How close is the predicted sequence of actions to the ground truth?
  • Can evaluate in rooms seen during training, or rooms unseen in training
    • Validation seen and unseen partitions
  • ALFRED baseline: 3.6% goal success rate in seen rooms, 0.4% goal success rate in unseen rooms ☹

10

11 of 35

Project Contributions

  1. Granular training with ALFRED subgoals
  2. Augmented navigation
    1. Full coverage of object segmentation masks
    2. Panoramic visual observations
  3. Integrated object detection
  4. Enabled spatial tracking in the model

11

12 of 35

Project Contributions

  1. Granular training with ALFRED subgoals
  2. Augmented navigation
    1. Full coverage of object segmentation masks
    2. Panoramic visual observations
  3. Integrated object detection
  4. Enabled spatial tracking in the model

12

13 of 35

Key Limitations

Sequence of actions is long. The model must predict a long sequence of actions from a long sequence of text.

13

“Rinse off a mug and place� it in the coffee maker.”

”Walk to the coffee maker on the right.”

“Pick up the dirty mug from the coffee maker.”

“Turn and walk to the sink.

“Wash the mug in the sink.”

“Pick up the mug and go back to the coffee maker.”

“Put the clean mug in the coffee maker.”

MoveAhead�MoveAhead�MoveAhead�MoveAhead�RotateRight

PickupObject(mug)

RotateLeft�RotateLeft�MoveAhead�MoveAhead

PutObject(mug, sink)�ToggleObjectOn�ToggleObjectOff�PickupObject(mug)

RotateLeft�RotateLeft�MoveAhead�MoveAhead

PutObject(mug, coffee maker)

STOP

14 of 35

Contribution 1: Granular Training

  • Solution: break the problem down into subgoal completion

14

LSTM

Goal + N Subgoals

All Actions to Achieve the Goal

seq2seq baseline

LSTM

Subgoal 1

Actions for subgoal 1

granular training

Subgoal 2

Subgoal N

Actions for subgoal 2

Actions for subgoal N

Model

Val. Seen�Action F1 (%)

Avg. Subgoal Success Rate (%)

ALFRED Baseline

84.5

25.8

Granular Training

91.6

32.2

15 of 35

Project Contributions

  1. Granular training with ALFRED subgoals
  2. Augmented navigation
    1. Full coverage of object segmentation masks
    2. Panoramic visual observations
  3. Integrated object detection
  4. Enabled spatial tracking in the model

15

16 of 35

Key Limitations

Navigation performance is a bottleneck for overall performance.

Success rate on navigation subgoals is low relative to some other subgoal types. Why?

    • The agent is not explicitly trained to ground �language during navigation.
    • The agent doesn’t learn to explore.

16

"Turn right and walk to the sink next to the bathtub."

?

17 of 35

Augmented Navigation

  • To resolve these issues, we generate new inputs for navigation:
    1. Extra segmentation masks
    2. Panoramic image observations
  • Use this data to enable the following behaviors during navigation:
    • Explicit object recognition
    • “Looking around” before taking a step

17

18 of 35

Project Contributions

  1. Granular training with ALFRED subgoals
  2. Augmented navigation
    1. Full coverage of object segmentation masks
    2. Panoramic visual observations
  3. Integrated object detection
  4. Enabled spatial tracking in the model

18

19 of 35

Additional Masks

  • Base dataset only includes segmentation masks for objects that the agent must manipulate
  • Collect masks for every visible object at every timestep

19

20 of 35

Project Contributions

  1. Granular training with ALFRED subgoals
  2. Augmented navigation
    1. Full coverage of object segmentation masks
    2. Panoramic visual observations
  3. Integrated object detection
  4. Enabled spatial tracking in the model

20

21 of 35

Panoramic Image Observations

  • Performance gains have come in vision-and-language navigation (VLN) �from using panoramic visual inputs
  • Training: we collect images at 8 view angles for every timestep of navigation
    • Built-in exploratory behavior
  • Inference: force the agent to ”look around” 360 degrees before taking� each step during navigation
    • At a cost of extra predicted actions

21

22 of 35

Panoramic Image Observations

  • Train a model (only on navigation subgoals) with these panoramic images as input rather than single POV image
    • Small performance improvement

22

Model

Val. Seen �Navigation Subgoal Success Rate (%)

ALFRED Baseline

31.0

Granular Training

30.0

Granular Training +�Look Around in Nav.

35.3

23 of 35

Project Contributions

  1. Granular training with ALFRED subgoals
  2. Augmented navigation
    1. Full coverage of object segmentation masks
    2. Panoramic visual observations
  3. Integrated object detection
  4. Enabled spatial tracking in the model

23

24 of 35

Introducing Object Detection

  • Using newly generated masks, train an object detection model
  • If we add this to the pipeline, agent can explicitly identify any object it sees
    • (even in panoramic observations)

24

25 of 35

Project Contributions

  1. Granular training with ALFRED subgoals
  2. Augmented navigation
    1. Full coverage of object segmentation masks
    2. Panoramic visual observations
  3. Integrated object detection
  4. Enabled spatial tracking in the model

25

26 of 35

Oracle Angle Tracking

  • In the granular trained model, the agent loses the ability to look ahead in the instructions.
    • When navigating to the counter, the agent doesn’t know that it will need a knife during the next subgoal
  • During navigation, enable the �agent to track the relative �location of the precise �navigation goal
    • Angle to the goal location

26

27 of 35

27

Model

Navigation Subgoal Success Rate (%)

Goal Condition Success Rate (%)

ALFRED Baseline

31.0

1.6

Granular Training

30.0

1.3

Granular Training +

Oracle Angle Tracking

67.8

2.8

28 of 35

Predicting the Angle

  • How can we predict this angle to achieve such high performance fairly?
  • We combine all the work so far into a localizer module:
    • Inputs at each timestep:
      • Panoramic bounding box information (coordinates and labels)
      • Current and next subgoal language instructions
    • Output:
      • Angle dt to goal (sine and cosine)

28

29 of 35

Projecting Bounding Boxes to 3D Space

29

 

 

 

 

 

 

 

 

30 of 35

Transformer-based Angular Prediction

30

[CLS]

spatial token

spatial token

[SEP]

Walk

to

[SEP]

.

spatial token

current + next subgoal

“bread”

“knife”

“sink”

BERT

dt

YOLO bounding box coords. (in panoramic space) + class labels

31 of 35

BERT-based Angular Prediction

31

Model

Training�Avg. Absolute Error

Val. Seen�Avg. Absolute Error

Feedforward

[0.27, 0.32]

[0.25, 0.30]

Feedforward + Dest. Prediction

[0.37, 0.43]

[0.34, 0.43]

Multimodal BERT

[0.061, 0.069]

[0.32, 0.38]

32 of 35

Full Structure

32

33 of 35

(all results on stack & place task type)

33

Model

Action F1 (%)

Navigation Subgoal

Success Rate (%)

Goal Condition

Success Rate (%)

Val. Seen

Val. Unseen

Val. Seen

Val. Unseen

Val. Seen

Val. Unseen

Baseline

84.5

75.6

31.0

27.5

1.6

0.0

Granular Training

91.6

85.3

30.0

26.5

1.3

0.0

Granular Training +

Oracle Goal Angle

93.9

86.9

67.8

35.4

2.8

0.0

Granular Training +

BERT-Based Localizer

93.8

88.7

25.4

28.8

1.4

0.0

best non-oracle result best overall result

34 of 35

Summary

  1. Granular training with subgoals improved performance of action prediction
  2. Augmented inputs combined with object detection gave the agent new capabilities during navigation
    1. ”Looking around”
    2. Identifying objects explicitly
  3. Used capabilities to enable spatial tracking in the model and improve action prediction, navigation performance

34

35 of 35

Questions?

Thank you!

35

@shanestorks