1 of 24

Bootstrapping from Sub-Optimal Demonstrations: A Case-Study in Deformable Insertion

LfD in High-Dimensional Feature Spaces

RSS 2017

Jonathan Scholz

2 of 24

Robot Programming in Industry

Cumbersome trajectory scripting

High integration costs

Dangerous high-gain open-loop control

3 of 24

Collaborative Robotics: Cutting Edge

UIs for easy trajectory scripting

Visual perception limited to AR-tags

Tasks must be explicitly parameterized w.r.t. tags

“Pendant programming”

Figure courtesy of Rethink Robotics

4 of 24

Where Kinematic Trajectories Break Down

Easy part: screwing the cap

ABB solved with carefully tuned splines using only encoder feedback

Hard part: inserting the clip

Intrinsically contact-rich
Difficult to model nonlinear dynamics
Multi-modal feedback (encoders, torques, etc.)

“Last-inch learning”

5 of 24

The “Clippy Task”

Task

Difficulty

Material

Prong Angle

90

85

3D-printed socket

3D-printed curriculum of deformable clips

*OnShape Models Available

6 of 24

Challenges for RL-Based Approach

Difficult to define a cost function

Requires state information for object deformation
Hardware hacks, e.g. hall-sensors, are undesirable

Difficult exploration problem

Continuous high-dimensional action space
Reward shaping prone to local minima

Our vision:

Given a few demonstrations of a task being achieved, safely learn to recognize the goal and how to achieve it autonomously

7 of 24

Bootstrapped Learning from Demonstration

Deterministic Policy Gradient from Demonstrations (DPGfD)

Key observation: Off-policy experience-replay algorithms (e.g. DQN, DPG) are ideal vessels for injecting demonstration data

Simply add the traces to the replay buffer
Bootstrap a value function as usual

Prioritized-replay sampling mechanism based on TD-Error

Core result: Works even when task rewards are sparse

Demonstrations

Replay Buffer

Minibatch

S

A

R

t

t+1

Expert (persistent)

8 of 24

Simulation Experiments

Implemented suite of insertion tasks in Mujoco simulator

First Term

Mating-site distance

Second Term

Goal-site distance

Defined shaped and sparse rewards

Peg-Insertion

Harddrive-Insertion

Clip-Insertion

Cable-Insertion

9 of 24

Simulation Results

10 of 24

Varying the Number of Demonstrations

DPGfD on Sparse-Reward Clipping Task vs. Number of Demonstrations

Even a single trace is enough to find a policy

Baseline without demonstrations fails to find solution

11 of 24

Real-Robot Experiment

Clip-insertion with 90^o prong-angle
Observations:

Joint angles
Joint velocity
Joint torques

Actions:

Joint velocities

Rewards:

Distance from goal pose

Obtained from forward kinematics (socket position known)

12 of 24

Learned Policy with Recovery Behavior

13 of 24

Limitations

Required careful calibration of the socket position
Failure modes:

Erroneous reward function

Partial observability of the state of the deformable object

Erroneous begin_episode

Non-stationary dynamics due to permanent deformation object during training

14 of 24

Image-Based Task Predicate

Motivation

Specify task by what the goal state looks like, instead of hand-coding a reward function from ground-truth state features

Removes dependence on analytical scene state and geometry

Challenge: DPG is very sensitive to false positives

3

16

32

64

32

max

True

False

Conv only, no pooling, similar to VGG
Dilated (atrous) power of 2 convolution, similar to WaveNet
Soft targets (0.9)

G_θ(o)→True/False

15 of 24

Gathering Training Data

Is faster than it looks

Added gripper camera

Iteratively gathered 20k labeled examples using the cuff buttons @ 20Hz

False

True

16 of 24

DPGfD with Insertion-Detector

17 of 24

DPGfD with Image-Based Insertion Detector

Lights Out (10pm)

Return to Lab (10 am)

18 of 24

DPGfD from Pixels & Proprioception

Still not solving the acute-angle insertion task

Requires state information about clip deformation

Simple idea: add image to observation
Conventional wisdom that Deep-RL from pixels requires millions of frames

Initialize vision channel using encoder from goal detector

1

0

Action

Observation

Q

Action

-1

Q_t-1

TD-Error

π

R

Joint Position

Joint Velocity

Joint Torque

Gripper Camera

19 of 24

DPGfD from Images for Acute-Angle Clipping

20 of 24

Conclusions

First e2e DeepRL from pixels to joints on a real-robot! (AFAIK)

Fully data-driven LfD pipeline, with no reliance on simulation, state estimation, or reward shaping

Can solve real-world deformable insertion problem in < 1 day
(small) step towards symbolic task specification

Still doing model-free RL...

Weak prior, naive exploration
Many DPG hypers to tune

Requires lots of data to train goal predicate

No transfer to other tasks (e.g. peg, wire, reach, stack, etc.)

Hand-coded compliance controller

Good

Bad

Ugly

Clippy Graveyard

21 of 24

Perspectives

Twist on classical story in deep learning of “just give it enough data”

Now: “just give it the right data”

Solves a simpler problem than Inverse-RL by making a simplifying assumption:

The cost function is sparse (boolean in this case)
Human input now used in two capacities: labels for goal region and trajectories for exploration

22 of 24

Next Steps

Learned variable-impedance

Hand-coded impedance controller subject to calibration errors
Future tasks may require greater effort for certain phases of the task

Hierarchical control with contextual input for transfer
Stronger priors and regularizers for predicate and policy learning

23 of 24

Thanks

Kevin Luck

Fumin Wang

Matej Vecerik

Todd Hester

Tom Rothorl

24 of 24

We’re Hiring!

https://deepmind.com/careers/

jscholz@google.com

kittygarraway@google.com