Pixels-Based Bimanual Peg Insertion by Distilling a State-based RL Policy
Technical Report
Andrew JY Luo
Run 1
Run 2
Run 3
Run 4
Run 5
Run 6
Summary of Approach
Simulation-Only Training Pipeline
Curriculum B: insertion
Curriculum A: to pre-insertion
- Simulation-only
- Knows object locations
- Trained with RL + curriculum
- Deployed on hardware
- Pixels-based
Teacher
Student
BC
Summary of Approach
RGBD Peg Insertion - Architecture
64
64
64
64
64
64
64
64
Action size * 2
Action Logits
Dense Layer
Atari CNN (pretrained, frozen)
Flatten
Resize
[qpos,
previous action]
30
MLP (256, 256, 256)
32x32
Summary of Approach
8x8
Online Dagger for BC (~2.5e6 samples)
st
ot
at
at
st+1
Lt = || at - at ||22
mjx.step()
𝞹student
𝞹teacher
Summary of Approach
pixt
render()
Robot Setup
Arms: ViperX 300S
Cameras: Realsense D405
Hardware
Huge thanks to Kevin Zakka + Levine Lab!
Software
Server
Client
Hardware
Policy
Jax policy -> ONNX:
~ 1 day
Summary of Approach
Summary of Training Pipeline
Phase | Time (jit) | Time (run) | Algorithm | Inputs |
Teacher Curriculum A To pre-insertion | 1.5 min | 8.5 min | PPO | State |
Teacher Curriculum B Insertion | 1.5 min | 8.5 min | PPO | State |
Student Distillation | 24 s | 2 min 55 s | Behaviour Cloning Dagger | Pixels |
Total | 23.5 min | | |
Summary of Approach
What helped?
Madrona
Madrona-MJX Rendering
Madrona MJX vastly outperforms other SOTA batch renderers for Cartpole, for multiple render resolutions
What helped?
Madrona-MJX Rendering - High Resolution Support
Madrona-MJX enables fast iteration
What helped?
Visual Domain Randomisation
Real
Sim
Sim
Randomized:
What helped?
Visual Domain Randomisation
What helped?
32x32 images
Real
Sim - 9/1024 envs
Pixel-based Sim2Real Tricks
[1] Mandlekar, Ajay, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation.” arXiv, September 25, 2021. https://doi.org/10.48550/arXiv.2108.03298.
Image from [2] Laskin, Michael, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. “Reinforcement Learning with Augmented Data.” arXiv, November 5, 2020. https://doi.org/10.48550/arXiv.2004.14990.
Training with Random Shifts
Easily improves transfer by >30% [1]
Pre-trained Encoder
Atari CNN scores >70% on CIFAR 10!
What helped?
Difficulties
Depth-only Exteroception
Time
Left Hand
Right Hand
Real
Sim
Real
Sim
Difficulties
Depth-only Exteroception
Difficulties
Training a teacher policy - inter-arm reward signal interference?
Difficulties
PPO on two arms - symmetric deployment
A) Single-Arm with randomized blocks
B) Symmetric deployment
Difficulties
Wrapping Up
What Helped?
Difficulties
Giving it a Shot
Discussion
Appendix
A1: A spatial credit assignment problem
A1: A spatial credit assignment problem
Reward Contamination
A1: Pick-Block experiment
versus
9 reward terms
7-D action space
53-D observation
18 reward terms
14-D action space
106-D observation
Single Pick Block
Double Pick Block
A1: Pick-Block experiment
A2: Training the insertion policy with height-based switching
A3: Architecture for depth-only attempt
64
64
30
[qpos,
previous action]
Note *: small details omitted for clearer explanation
Atari CNN (random init)
MLP (256, 256, 256)
~400k parameters
64
64
Dense Layer
Action size * 2
Action Logits
A4: RGBD Peg Insertion - Hold-Out
Issue: grasping at the air
Partial Solution: Randomly hold out object (doesn’t transfer perfectly)
A5: Emergent Retry Behaviour 🦆
A6: Peg Insertion Policy; 6 sequential runs; uncut