1 of 33

Pixels-Based Bimanual Peg Insertion by Distilling a State-based RL Policy

Technical Report

Andrew JY Luo

2 of 33

Run 1

Run 2

Run 3

Run 4

Run 5

Run 6

3 of 33

Summary of Approach

4 of 33

Simulation-Only Training Pipeline

Curriculum B: insertion

Curriculum A: to pre-insertion

- Simulation-only

- Knows object locations

- Trained with RL + curriculum

- Deployed on hardware

- Pixels-based

Teacher

Student

BC

Summary of Approach

5 of 33

RGBD Peg Insertion - Architecture

64

64

64

64

64

64

64

64

Action size * 2

Action Logits

Dense Layer

Atari CNN (pretrained, frozen)

Flatten

Resize

[qpos,

previous action]

30

MLP (256, 256, 256)

32x32

Summary of Approach

8x8

6 of 33

Online Dagger for BC (~2.5e6 samples)

st

ot

at

at

st+1

Lt = || at - at ||22

mjx.step()

𝞹student

𝞹teacher

Summary of Approach

pixt

render()

7 of 33

Robot Setup

Arms: ViperX 300S

Cameras: Realsense D405

Hardware

Huge thanks to Kevin Zakka + Levine Lab!

Software

Server

Client

Hardware

Policy

Jax policy -> ONNX:

  1. Re-implement policy in Tensorflow
  2. Manually transfer weights
  3. Use TF2Onnx

~ 1 day

Summary of Approach

8 of 33

Summary of Training Pipeline

Phase

Time (jit)

Time (run)

Algorithm

Inputs

Teacher Curriculum A To pre-insertion

1.5 min

8.5 min

PPO

State

Teacher Curriculum B

Insertion

1.5 min

8.5 min

PPO

State

Student Distillation

24 s

2 min 55 s

Behaviour Cloning

Dagger

Pixels

Total

23.5 min

Summary of Approach

9 of 33

What helped?

10 of 33

Madrona

11 of 33

Madrona-MJX Rendering

  • Introduced in the Mujoco Playground paper
  • SoTA batch renderer
  • Compatible with MJX/Brax RL training

Madrona MJX vastly outperforms other SOTA batch renderers for Cartpole, for multiple render resolutions

What helped?

12 of 33

Madrona-MJX Rendering - High Resolution Support

13 of 33

Madrona-MJX enables fast iteration

What helped?

14 of 33

Visual Domain Randomisation

Real

Sim

Sim

Randomized:

  • Shadows on/off
  • Light poses
  • Color
  • Camera position

What helped?

15 of 33

Visual Domain Randomisation

What helped?

32x32 images

Real

Sim - 9/1024 envs

16 of 33

Pixel-based Sim2Real Tricks

[1] Mandlekar, Ajay, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation.” arXiv, September 25, 2021. https://doi.org/10.48550/arXiv.2108.03298.

Image from [2] Laskin, Michael, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. “Reinforcement Learning with Augmented Data.” arXiv, November 5, 2020. https://doi.org/10.48550/arXiv.2004.14990.

Training with Random Shifts

Easily improves transfer by >30% [1]

Pre-trained Encoder

Atari CNN scores >70% on CIFAR 10!

What helped?

17 of 33

Difficulties

18 of 33

Depth-only Exteroception

Time

Left Hand

Right Hand

Real

Sim

Real

Sim

Difficulties

19 of 33

Depth-only Exteroception

Difficulties

20 of 33

Training a teacher policy - inter-arm reward signal interference?

Difficulties

21 of 33

PPO on two arms - symmetric deployment

A) Single-Arm with randomized blocks

B) Symmetric deployment

Difficulties

22 of 33

Wrapping Up

What Helped?

  • Fast rendering = fast iteration
  • Visual domain randomization
  • Pre-trained encoder
  • Random pixel shifts

Difficulties

  • Depth-only exteroception
  • RL on two arms

Giving it a Shot

  • Madrona MJX 🔪
  • Increase visual DR
  • Ablations

23 of 33

Discussion

24 of 33

Appendix

25 of 33

A1: A spatial credit assignment problem

26 of 33

A1: A spatial credit assignment problem

Reward Contamination

27 of 33

A1: Pick-Block experiment

versus

9 reward terms

7-D action space

53-D observation

18 reward terms

14-D action space

106-D observation

Single Pick Block

Double Pick Block

28 of 33

A1: Pick-Block experiment

29 of 33

A2: Training the insertion policy with height-based switching

30 of 33

A3: Architecture for depth-only attempt

64

64

30

[qpos,

previous action]

Note *: small details omitted for clearer explanation

Atari CNN (random init)

MLP (256, 256, 256)

~400k parameters

64

64

Dense Layer

Action size * 2

Action Logits

31 of 33

A4: RGBD Peg Insertion - Hold-Out

Issue: grasping at the air

Partial Solution: Randomly hold out object (doesn’t transfer perfectly)

32 of 33

A5: Emergent Retry Behaviour 🦆

33 of 33

A6: Peg Insertion Policy; 6 sequential runs; uncut