1 of 24

Bootstrapping from Sub-Optimal Demonstrations: A Case-Study in Deformable Insertion

LfD in High-Dimensional Feature Spaces

RSS 2017

Jonathan Scholz

2 of 24

Robot Programming in Industry

Cumbersome trajectory scripting

High integration costs

Dangerous high-gain open-loop control

3 of 24

Collaborative Robotics: Cutting Edge

UIs for easy trajectory scripting

Visual perception limited to AR-tags

Tasks must be explicitly parameterized w.r.t. tags

“Pendant programming”

Figure courtesy of Rethink Robotics

4 of 24

Where Kinematic Trajectories Break Down

  • Easy part: screwing the cap
    • ABB solved with carefully tuned splines using only encoder feedback

  • Hard part: inserting the clip
    • Intrinsically contact-rich
    • Difficult to model nonlinear dynamics
    • Multi-modal feedback (encoders, torques, etc.)

“Last-inch learning”

5 of 24

The “Clippy Task”

Task

Difficulty

Material

Prong Angle

90

85

3D-printed socket

3D-printed curriculum of deformable clips

6 of 24

Challenges for RL-Based Approach

  • Difficult to define a cost function
    • Requires state information for object deformation
    • Hardware hacks, e.g. hall-sensors, are undesirable

  • Difficult exploration problem
    • Continuous high-dimensional action space
    • Reward shaping prone to local minima

  • Our vision:
    • Given a few demonstrations of a task being achieved, safely learn to recognize the goal and how to achieve it autonomously

7 of 24

Bootstrapped Learning from Demonstration

Deterministic Policy Gradient from Demonstrations (DPGfD)

  • Key observation: Off-policy experience-replay algorithms (e.g. DQN, DPG) are ideal vessels for injecting demonstration data
    • Simply add the traces to the replay buffer
    • Bootstrap a value function as usual
      • Prioritized-replay sampling mechanism based on TD-Error
  • Core result: Works even when task rewards are sparse

Demonstrations

Replay Buffer

Minibatch

S

A

R

t

t+1

Expert (persistent)

8 of 24

Simulation Experiments

Implemented suite of insertion tasks in Mujoco simulator

First Term

Mating-site distance

Second Term

Goal-site distance

Defined shaped and sparse rewards

Peg-Insertion

Harddrive-Insertion

Clip-Insertion

Cable-Insertion

9 of 24

Simulation Results

10 of 24

Varying the Number of Demonstrations

DPGfD on Sparse-Reward Clipping Task vs. Number of Demonstrations

  • Even a single trace is enough to find a policy

  • Baseline without demonstrations fails to find solution

11 of 24

Real-Robot Experiment

  • Clip-insertion with 90o prong-angle
  • Observations:
    • Joint angles
    • Joint velocity
    • Joint torques
  • Actions:
    • Joint velocities
  • Rewards:
    • Distance from goal pose

Obtained from forward kinematics (socket position known)

12 of 24

Learned Policy with Recovery Behavior

13 of 24

Limitations

  • Required careful calibration of the socket position
  • Failure modes:
    • Erroneous reward function
      • Partial observability of the state of the deformable object
    • Erroneous begin_episode
      • Non-stationary dynamics due to permanent deformation object during training

14 of 24

Image-Based Task Predicate

Motivation

  • Specify task by what the goal state looks like, instead of hand-coding a reward function from ground-truth state features
    • Removes dependence on analytical scene state and geometry
  • Challenge: DPG is very sensitive to false positives

3

16

32

64

32

32

max

True

False

  • Conv only, no pooling, similar to VGG
  • Dilated (atrous) power of 2 convolution, similar to WaveNet
  • Soft targets (0.9)

Gθ(o)→True/False

15 of 24

Gathering Training Data

Is faster than it looks

  • Added gripper camera

  • Iteratively gathered 20k labeled examples using the cuff buttons @ 20Hz

False

True

16 of 24

DPGfD with Insertion-Detector

17 of 24

DPGfD with Image-Based Insertion Detector

Lights Out (10pm)

Return to Lab (10 am)

18 of 24

DPGfD from Pixels & Proprioception

  • Still not solving the acute-angle insertion task
    • Requires state information about clip deformation
  • Simple idea: add image to observation
  • Conventional wisdom that Deep-RL from pixels requires millions of frames
    • Initialize vision channel using encoder from goal detector

1

0

Action

Observation

Q

Action

-1

Qt-1

TD-Error

π

R

Joint Position

Joint Velocity

Joint Torque

Gripper Camera

19 of 24

DPGfD from Images for Acute-Angle Clipping

20 of 24

Conclusions

First e2e DeepRL from pixels to joints on a real-robot! (AFAIK)

Fully data-driven LfD pipeline, with no reliance on simulation, state estimation, or reward shaping

  • Can solve real-world deformable insertion problem in < 1 day
  • (small) step towards symbolic task specification

Still doing model-free RL...

  • Weak prior, naive exploration
  • Many DPG hypers to tune

Requires lots of data to train goal predicate

No transfer to other tasks (e.g. peg, wire, reach, stack, etc.)

Hand-coded compliance controller

Good

Bad

Ugly

Clippy Graveyard

21 of 24

Perspectives

Twist on classical story in deep learning of “just give it enough data”

  • Now: “just give it the right data”

Solves a simpler problem than Inverse-RL by making a simplifying assumption:

  • The cost function is sparse (boolean in this case)
  • Human input now used in two capacities: labels for goal region and trajectories for exploration

22 of 24

Next Steps

  • Learned variable-impedance
    • Hand-coded impedance controller subject to calibration errors
    • Future tasks may require greater effort for certain phases of the task
  • Hierarchical control with contextual input for transfer
  • Stronger priors and regularizers for predicate and policy learning

23 of 24

Thanks

Kevin Luck

Fumin Wang

Matej Vecerik

Todd Hester

Tom Rothorl

24 of 24

We’re Hiring!