1 of 34

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

Michelle A. Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, Jeannette Bohg�Best paper ICRA 2019, Finalist for Best Paper in Cognitive Robotics ICRA 2019

Journal publication in Transactions in Robotics 2020

Presented by Louis Montaut�PhD student at CTU-FEL / ENS-PSL

1

2 of 34

Introduction: Context & Goal

2

3 of 34

Introduction: Task solved in this paper

Keywords: �- Contact-rich manipulation �- Representation learning & self-supervision�- Sensor fusion

Greater context: fusion vision and touch

3

4 of 34

Introduction: Paper’s contributions

System paper (from sensor data to policy - from simulation to real robot)

Learning representation using VAEs and self-supervision

Time-aligned multi-sensor data

Real robot experiment

4

5 of 34

Summary

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

5

6 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

6

7 of 34

Related works

Optimal control in robotics: hard to incorporate raw sensor data…

… Let’s do Deep RL ?

7

Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17, 1–40.

→ Elements of optimal control: Guided Policy Search

8 of 34

Related works

8

Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17, 1–40.

9 of 34

Related works

9

Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramár, J., Hadsell, R., de Freitas, N., & Heess, N. (2018). Reinforcement and Imitation Learning for Diverse Visuomotor Skills. https://doi.org/10.15607/rss.2018.xiv.009

→ Guided policy search is replaced by imitation learning

10 of 34

Related works

10

Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39

→ Self-supervision to predict time-aligned audio-vision data

11 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� D - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

11

12 of 34

Method: Overview of pipeline

12

2 (Focus of the paper)

Motion planning / low frequency “control”
Learned parameters
Sensor fusion

1

Robot sensors

3

Robot
High frequency control

13 of 34

Method: Overview of pipeline

13

Trained first then frozen
VAE (self-supervised training)
Takes raw sensor data
Outputs latent variable z

Trained in second
Takes latent variable
Outputs actions

14 of 34

Method: Overview of pipeline

14

→Architecture of representation ?

→How is the data fused ?

→How is the representation trained ?

→How is the policy trained ?

15 of 34

I - Related works � A - Contact-rich manipulation� B - Representation learning� C - Sensor fusion

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

15

16 of 34

Method: Variational Auto-Encoders and representation learning

16

Goal: Data compression of high-dimensional input x into low-dimensional latent vector z

Assume a structure of low-dimensional latent manifold ⇒ Give yourself q(z)
Now goal is to maximize p(D)
2 NN (encoder/decoder): and

⇒

17 of 34

Method: Variational Auto-Encoders and representation learning

17

*More in-depth explanation: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

Example of VAEs training:

x is an image
applied to x → parameters of some proba distribution in latent space
Sample z from distribution
applied to z → parameters of some proba distribution in image space
Loss: ELBO (reconstruction + KL)
Backprop through proba distribution: reparametrization trick*

18 of 34

Method: Variational Auto-Encoders and representation learning

18

4 types of sensors / 4 encoders

RGB
Depth
F/T
Proprioception (end-effector position)

19 of 34

Method: Multimodal fusion

19

TL;DR: Product of experts

2xd vector

(parameters of gaussian distrib)

Wu, M., & Goodman, N. (2018). �Multimodal generative models for scalable weakly-supervised learning. Advances in Neural Information Processing Systems, 2018-December(NeurIPS), 5575–5585.

Sensor

Latent z dimension

-

20 of 34

Method: Variational Auto-Encoders and representation learning

20

Authors swapped reconstruction loss

→ self-supervised objectives

log p(D | z) → log p(f(D) | z) ?

Ex loss: (+ add KL ELBO losses)

They construct optical flow/flow mask�Loss = reconstruction loss on optical flow

Next end effector pose:�Loss = MSE(ground truth, predicted)

Contact classifier/pairing predictor:�Loss = Cross-entropy (bernoulli law)

⇒ Total of 5 losses + KL ELBO loss of fused representation

21 of 34

Method: Variational Auto-Encoders and representation learning

21

Supplement: Pairing predictor

F/T sensor: �For each state (RGB-D image + proprioception), take mean of last 32 data points of F/T sensor

Pairing predictor:

Do the F/T sensor and visuals correspond to the same situation ?

Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39

22 of 34

Method: Variational Auto-Encoders and representation learning

22

Supplement: Pairing predictor

Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39

23 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs and representation learning� C - Training the representation and the policy

III - Experiments� A - Simulation� B - Real robot

23

24 of 34

Method: Variational Auto-Encoders and representation learning

24

Training this representation:

1 - Random policy + heuristic policy

2 - Gather RGB-D, F/T, proprioception data

3 - Fit losses previously discussed

Data:

150k state samples on real robot

600k state samples in simulation

(about 90 min / 100k data samples)

“Full model” of representation

Authors also trained “deterministic model” and “representation model”

25 of 34

Method: Training the policy

25

Frozen during policy training

Trained parameters

Reward used during training

Training:

Simulation:

collect 1,000,000 � (observation, action, reward) tuples

Robot:

collect 500,000 � (observation, action, reward) tuples � ~ 7 hours of training

RL algorithm used: TRPO

Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust region policy optimization. 32nd International Conference on Machine Learning, ICML 2015, 3, 1889–1897.

26 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the policy

III - Experiments� A - Simulation� B - Real robot

26

27 of 34

Experiments

27

Question asked / answered during experiments:

1 - Is multi-sensor fusion beneficial to policy learning ?

2 - Is the self-supervised method presented a good way to do representation learning ?

3 - Does it work on a real robot ?

28 of 34

Experiments: Simulation

28

Depth info is the most important...

… but haptics help.

Temporal coherence is essential.

29 of 34

I - Related works

II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� D - Training the policy

III - Experiments� A - Simulation� B - Real robot

29

30 of 34

Experiments: On the real robot

30

Policy on different peg geometries

31 of 34

Experiments: On the real robot

31

Feedback and compliant control

Robust to occlusion

32 of 34

Experiments: On the real robot

32

33 of 34

Conclusion

Representation learning with VAEs and self-supervision

Time-aligned multi-sensor fusion

Experiments on the real robot

33

34 of 34

Models

Encoders:

RGB: 6 layers CNN similar to FlowNet
Depth: 18 layers CNN similar to VGG-16
F/T sensor: last 32 readings go into a 5 layers causal convolution network
Proprioception (end-effector pose): 4 layers MLP

Decoders:

Action encoder (then concatenation with multimodal representation): 2 layers MLP
Flow predictor: 4 layers deconvolutional net
Contact predictor: 1 layer MLP
End-effector predictor (next pose): 4 layers MLP
Time-alignment predictor: 1 layer MLP

Policy:

2 layers MLP: takes latent z, outputs pose p (x, y, z and roll)

34