Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks
Michelle A. Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, Jeannette Bohg�Best paper ICRA 2019, Finalist for Best Paper in Cognitive Robotics ICRA 2019
Journal publication in Transactions in Robotics 2020
Presented by Louis Montaut�PhD student at CTU-FEL / ENS-PSL
1
Introduction: Context & Goal
2
Introduction: Task solved in this paper
Keywords: �- Contact-rich manipulation �- Representation learning & self-supervision�- Sensor fusion
Greater context: fusion vision and touch
3
Introduction: Paper’s contributions
4
Summary
I - Related works
II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy
III - Experiments� A - Simulation� B - Real robot
5
I - Related works
II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy
III - Experiments� A - Simulation� B - Real robot
6
Related works
7
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17, 1–40.
→ Elements of optimal control: Guided Policy Search
Related works
8
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17, 1–40.
Related works
9
Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramár, J., Hadsell, R., de Freitas, N., & Heess, N. (2018). Reinforcement and Imitation Learning for Diverse Visuomotor Skills. https://doi.org/10.15607/rss.2018.xiv.009
→ Guided policy search is replaced by imitation learning
Related works
10
Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39
→ Self-supervision to predict time-aligned audio-vision data
I - Related works
II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� D - Training the representation and the policy
III - Experiments� A - Simulation� B - Real robot
11
Method: Overview of pipeline
12
2 (Focus of the paper)
1
3
Method: Overview of pipeline
13
Method: Overview of pipeline
14
→Architecture of representation ?
→How is the data fused ?
→How is the representation trained ?
→How is the policy trained ?
I - Related works � A - Contact-rich manipulation� B - Representation learning� C - Sensor fusion
II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the representation and the policy
III - Experiments� A - Simulation� B - Real robot
15
Method: Variational Auto-Encoders and representation learning
16
Goal: Data compression of high-dimensional input x into low-dimensional latent vector z
⇒
Method: Variational Auto-Encoders and representation learning
17
*More in-depth explanation: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
Example of VAEs training:
Method: Variational Auto-Encoders and representation learning
18
4 types of sensors / 4 encoders
Method: Multimodal fusion
19
TL;DR: Product of experts
2xd vector
(parameters of gaussian distrib)
Wu, M., & Goodman, N. (2018). �Multimodal generative models for scalable weakly-supervised learning. Advances in Neural Information Processing Systems, 2018-December(NeurIPS), 5575–5585.
Sensor
Latent z dimension
-
-
Method: Variational Auto-Encoders and representation learning
20
Authors swapped reconstruction loss
→ self-supervised objectives
log p(D | z) → log p(f(D) | z) ?
Ex loss: (+ add KL ELBO losses)
⇒ Total of 5 losses + KL ELBO loss of fused representation
Method: Variational Auto-Encoders and representation learning
21
Supplement: Pairing predictor
F/T sensor: �For each state (RGB-D image + proprioception), take mean of last 32 data points of F/T sensor
Pairing predictor:
Do the F/T sensor and visuals correspond to the same situation ?
Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39
Method: Variational Auto-Encoders and representation learning
22
Supplement: Pairing predictor
Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11210 LNCS, 639–658. https://doi.org/10.1007/978-3-030-01231-1_39
I - Related works
II - Method� A - Overview of pipeline� B - VAEs and representation learning� C - Training the representation and the policy
III - Experiments� A - Simulation� B - Real robot
23
Method: Variational Auto-Encoders and representation learning
24
Training this representation:
1 - Random policy + heuristic policy
2 - Gather RGB-D, F/T, proprioception data
3 - Fit losses previously discussed
Data:
150k state samples on real robot
600k state samples in simulation
(about 90 min / 100k data samples)
“Full model” of representation
Authors also trained “deterministic model” and “representation model”
Method: Training the policy
25
Frozen during policy training
Trained parameters
Reward used during training
Training:
collect 1,000,000 � (observation, action, reward) tuples
collect 500,000 � (observation, action, reward) tuples � ~ 7 hours of training
RL algorithm used: TRPO
Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust region policy optimization. 32nd International Conference on Machine Learning, ICML 2015, 3, 1889–1897.
I - Related works
II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� C - Training the policy
III - Experiments� A - Simulation� B - Real robot
26
Experiments
27
Question asked / answered during experiments:
1 - Is multi-sensor fusion beneficial to policy learning ?
2 - Is the self-supervised method presented a good way to do representation learning ?
3 - Does it work on a real robot ?
Experiments: Simulation
28
Depth info is the most important...
… but haptics help.
Temporal coherence is essential.
I - Related works
II - Method� A - Overview of pipeline� B - VAEs - representation learning - sensor fusion� D - Training the policy
III - Experiments� A - Simulation� B - Real robot
29
Experiments: On the real robot
30
Policy on different peg geometries
Experiments: On the real robot
31
Feedback and compliant control
Robust to occlusion
Experiments: On the real robot
32
Conclusion
33
Models
Encoders:
Decoders:
Policy:
34