1 of 34

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

Michelle A. Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, Jeannette Bohg�Best paper ICRA 2019, Finalist for Best Paper in Cognitive Robotics ICRA 2019

Journal publication in Transactions in Robotics 2020

Presented by Louis Montaut�PhD student at CTU-FEL / ENS-PSL

1

2 of 34

Introduction: Context & Goal

2

3 of 34

Introduction: Task solved in this paper

Keywords: �- Contact-rich manipulation �- Representation learning & self-supervision�- Sensor fusion

Greater context: fusion vision and touch

3

4 of 34

Introduction: Paper’s contributions

  • System paper (from sensor data to policy - from simulation to real robot)

  • Learning representation using VAEs and self-supervision

  • Time-aligned multi-sensor data

  • Real robot experiment

4

5 of 34

Summary

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

5

6 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

6

7 of 34

Related works

  • Optimal control in robotics: hard to incorporate raw sensor data…

  • … Let’s do Deep RL ?

7

Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17, 1–40.

→ Elements of optimal control: Guided Policy Search

8 of 34

Related works

8

Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17, 1–40.

9 of 34

Related works

9

Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramár, J., Hadsell, R., de Freitas, N., & Heess, N. (2018). Reinforcement and Imitation Learning for Diverse Visuomotor Skills. https://doi.org/10.15607/rss.2018.xiv.009

→ Guided policy search is replaced by imitation learning

10 of 34

Related works

10

Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39

→ Self-supervision to predict time-aligned audio-vision data

11 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� D - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

11

12 of 34

Method: Overview of pipeline

12

2 (Focus of the paper)

  • Motion planning / low frequency “control”
  • Learned parameters
  • Sensor fusion

1

  • Robot sensors

3

  • Robot
  • High frequency control

13 of 34

Method: Overview of pipeline

13

  • Trained first then frozen
  • VAE (self-supervised training)
  • Takes raw sensor data
  • Outputs latent variable z
  • Trained in second
  • Takes latent variable
  • Outputs actions

14 of 34

Method: Overview of pipeline

14

→Architecture of representation ?

→How is the data fused ?

→How is the representation trained ?

→How is the policy trained ?

15 of 34

I - Related worksA - Contact-rich manipulationB - Representation learning� C - Sensor fusion

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

15

16 of 34

Method: Variational Auto-Encoders and representation learning

16

Goal: Data compression of high-dimensional input x into low-dimensional latent vector z

  • Assume a structure of low-dimensional latent manifold ⇒ Give yourself q(z)
  • Now goal is to maximize p(D)
  • 2 NN (encoder/decoder): and

17 of 34

Method: Variational Auto-Encoders and representation learning

17

Example of VAEs training:

  • x is an image
  • applied to x → parameters of some proba distribution in latent space
  • Sample z from distribution
  • applied to z → parameters of some proba distribution in image space
  • Loss: ELBO (reconstruction + KL)
  • Backprop through proba distribution: reparametrization trick*

18 of 34

Method: Variational Auto-Encoders and representation learning

18

4 types of sensors / 4 encoders

  • RGB
  • Depth
  • F/T
  • Proprioception (end-effector position)

19 of 34

Method: Multimodal fusion

19

TL;DR: Product of experts

2xd vector

(parameters of gaussian distrib)

Wu, M., & Goodman, N. (2018). �Multimodal generative models for scalable weakly-supervised learning. Advances in Neural Information Processing Systems, 2018-December(NeurIPS), 5575–5585.

Sensor

Latent z dimension

-

-

20 of 34

Method: Variational Auto-Encoders and representation learning

20

Authors swapped reconstruction loss

self-supervised objectives

log p(D | z) → log p(f(D) | z) ?

Ex loss: (+ add KL ELBO losses)

  • They construct optical flow/flow mask�Loss = reconstruction loss on optical flow

  • Next end effector pose:�Loss = MSE(ground truth, predicted)

  • Contact classifier/pairing predictor:�Loss = Cross-entropy (bernoulli law)

⇒ Total of 5 losses + KL ELBO loss of fused representation

21 of 34

Method: Variational Auto-Encoders and representation learning

21

Supplement: Pairing predictor

F/T sensor: �For each state (RGB-D image + proprioception), take mean of last 32 data points of F/T sensor

Pairing predictor:

Do the F/T sensor and visuals correspond to the same situation ?

Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39

22 of 34

Method: Variational Auto-Encoders and representation learning

22

Supplement: Pairing predictor

Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39

23 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs and representation learning� C - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

23

24 of 34

Method: Variational Auto-Encoders and representation learning

24

Training this representation:

1 - Random policy + heuristic policy

2 - Gather RGB-D, F/T, proprioception data

3 - Fit losses previously discussed

Data:

150k state samples on real robot

600k state samples in simulation

(about 90 min / 100k data samples)

“Full model” of representation

Authors also trained “deterministic model” and “representation model”

25 of 34

Method: Training the policy

25

Frozen during policy training

Trained parameters

Reward used during training

Training:

  • Simulation:

collect 1,000,000 � (observation, action, reward) tuples

  • Robot:

collect 500,000 � (observation, action, reward) tuples � ~ 7 hours of training

RL algorithm used: TRPO

Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust region policy optimization. 32nd International Conference on Machine Learning, ICML 2015, 3, 1889–1897.

26 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the policy

III - ExperimentsA - Simulation� B - Real robot

26

27 of 34

Experiments

27

Question asked / answered during experiments:

1 - Is multi-sensor fusion beneficial to policy learning ?

2 - Is the self-supervised method presented a good way to do representation learning ?

3 - Does it work on a real robot ?

28 of 34

Experiments: Simulation

28

Depth info is the most important...

… but haptics help.

Temporal coherence is essential.

29 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� D - Training the policy

III - Experiments� A - Simulation� B - Real robot

29

30 of 34

Experiments: On the real robot

30

Policy on different peg geometries

31 of 34

Experiments: On the real robot

31

Feedback and compliant control

Robust to occlusion

32 of 34

Experiments: On the real robot

32

33 of 34

Conclusion

  • Representation learning with VAEs and self-supervision

  • Time-aligned multi-sensor fusion

  • Experiments on the real robot

33

34 of 34

Models

Encoders:

  • RGB: 6 layers CNN similar to FlowNet
  • Depth: 18 layers CNN similar to VGG-16
  • F/T sensor: last 32 readings go into a 5 layers causal convolution network
  • Proprioception (end-effector pose): 4 layers MLP

Decoders:

  • Action encoder (then concatenation with multimodal representation): 2 layers MLP
  • Flow predictor: 4 layers deconvolutional net
  • Contact predictor: 1 layer MLP
  • End-effector predictor (next pose): 4 layers MLP
  • Time-alignment predictor: 1 layer MLP

Policy:

  • 2 layers MLP: takes latent z, outputs pose p (x, y, z and roll)

34