1 of 33

Pixels-Based Bimanual Peg Insertion by Distilling a State-based RL Policy

Technical Report

Andrew JY Luo

2 of 33

Run 1

Run 2

Run 3

Run 4

Run 5

Run 6

3 of 33

Summary of Approach

4 of 33

Simulation-Only Training Pipeline

Curriculum B: insertion

Curriculum A: to pre-insertion

- Simulation-only

- Knows object locations

- Trained with RL + curriculum

- Deployed on hardware

- Pixels-based

Teacher

Student

BC

Summary of Approach

5 of 33

RGBD Peg Insertion - Architecture

64

Action size * 2

Action Logits

Dense Layer

Atari CNN (pretrained, frozen)

Flatten

Resize

[qpos,

previous action]

30

MLP (256, 256, 256)

32x32

Summary of Approach

8x8

6 of 33

Online Dagger for BC (~2.5e6 samples)

s_t

o_t

a_t

s_t+1

L_t= || a_t - a_t ||₂²

mjx.step()

𝞹_student

𝞹_teacher

Summary of Approach

pix_t

render()

7 of 33

Robot Setup

Arms: ViperX 300S

Cameras: Realsense D405

Hardware

Huge thanks to Kevin Zakka + Levine Lab!

Software

Server

Client

Hardware

Policy

Jax policy -> ONNX:

Re-implement policy in Tensorflow
Manually transfer weights
Use TF2Onnx

~ 1 day

Summary of Approach

8 of 33

Summary of Training Pipeline

Phase	Time (jit)	Time (run)	Algorithm	Inputs
Teacher Curriculum A To pre-insertion	1.5 min	8.5 min	PPO	State
Teacher Curriculum B Insertion	1.5 min	8.5 min	PPO	State
Student Distillation	24 s	2 min 55 s	Behaviour Cloning Dagger	Pixels
Total	23.5 min

Summary of Approach

9 of 33

What helped?

10 of 33

Madrona

11 of 33

Madrona-MJX Rendering

Introduced in the Mujoco Playground paper
SoTA batch renderer
Compatible with MJX/Brax RL training

Madrona MJX vastly outperforms other SOTA batch renderers for Cartpole, for multiple render resolutions

What helped?

12 of 33

Madrona-MJX Rendering - High Resolution Support

13 of 33

Madrona-MJX enables fast iteration

What helped?

14 of 33

Visual Domain Randomisation

Real

Sim

Randomized:

Shadows on/off
Light poses
Color
Camera position

What helped?

15 of 33

Visual Domain Randomisation

What helped?

32x32 images

Real

Sim - 9/1024 envs

16 of 33

Pixel-based Sim2Real Tricks

[1] Mandlekar, Ajay, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation.” arXiv, September 25, 2021. https://doi.org/10.48550/arXiv.2108.03298.

Image from [2] Laskin, Michael, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. “Reinforcement Learning with Augmented Data.” arXiv, November 5, 2020. https://doi.org/10.48550/arXiv.2004.14990.

Training with Random Shifts

Easily improves transfer by >30% [1]

Pre-trained Encoder

Atari CNN scores >70% on CIFAR 10!

What helped?

17 of 33

Difficulties

18 of 33

Depth-only Exteroception

Time

Left Hand

Right Hand

Real

Sim

Real

Sim

Difficulties

19 of 33

Depth-only Exteroception

Difficulties

20 of 33

Training a teacher policy - inter-arm reward signal interference?

Difficulties

21 of 33

PPO on two arms - symmetric deployment

A) Single-Arm with randomized blocks

B) Symmetric deployment

Difficulties

22 of 33

Wrapping Up

What Helped?

Fast rendering = fast iteration
Visual domain randomization
Pre-trained encoder
Random pixel shifts

Difficulties

Depth-only exteroception
RL on two arms

Giving it a Shot

Madrona MJX 🔪
Increase visual DR
Ablations

23 of 33

Discussion

24 of 33

Appendix

25 of 33

A1: A spatial credit assignment problem

26 of 33

A1: A spatial credit assignment problem

Reward Contamination

27 of 33

A1: Pick-Block experiment

versus

9 reward terms

7-D action space

53-D observation

18 reward terms

14-D action space

106-D observation

Single Pick Block

Double Pick Block

28 of 33

A1: Pick-Block experiment

29 of 33

A2: Training the insertion policy with height-based switching

30 of 33

A3: Architecture for depth-only attempt

64

30

[qpos,

previous action]

Note *: small details omitted for clearer explanation

Atari CNN (random init)

MLP (256, 256, 256)

~400k parameters

64

Dense Layer

Action size * 2

Action Logits

31 of 33

A4: RGBD Peg Insertion - Hold-Out

Issue: grasping at the air

Partial Solution: Randomly hold out object (doesn’t transfer perfectly)

32 of 33

A5: Emergent Retry Behaviour 🦆

33 of 33

A6: Peg Insertion Policy; 6 sequential runs; uncut