Bootstrapping from Sub-Optimal Demonstrations: A Case-Study in Deformable Insertion
LfD in High-Dimensional Feature Spaces
RSS 2017
Jonathan Scholz
Robot Programming in Industry
Cumbersome trajectory scripting
High integration costs
Dangerous high-gain open-loop control
Collaborative Robotics: Cutting Edge
UIs for easy trajectory scripting
Visual perception limited to AR-tags
Tasks must be explicitly parameterized w.r.t. tags
“Pendant programming”
Figure courtesy of Rethink Robotics
Where Kinematic Trajectories Break Down
“Last-inch learning”
The “Clippy Task”
Task
Difficulty
Material
Prong Angle
90
85
3D-printed socket
3D-printed curriculum of deformable clips
Challenges for RL-Based Approach
Bootstrapped Learning from Demonstration
Deterministic Policy Gradient from Demonstrations (DPGfD)
Demonstrations
Replay Buffer
Minibatch
S
A
R
t
t+1
Expert (persistent)
Simulation Experiments
Implemented suite of insertion tasks in Mujoco simulator
First Term
Mating-site distance
Second Term
Goal-site distance
Defined shaped and sparse rewards
Peg-Insertion
Harddrive-Insertion
Clip-Insertion
Cable-Insertion
Simulation Results
Varying the Number of Demonstrations
DPGfD on Sparse-Reward Clipping Task vs. Number of Demonstrations
Real-Robot Experiment
Obtained from forward kinematics (socket position known)
Learned Policy with Recovery Behavior
Limitations
Image-Based Task Predicate
Motivation
3
16
32
64
32
32
max
True
False
Gθ(o)→True/False
Gathering Training Data
Is faster than it looks
False
True
DPGfD with Insertion-Detector
DPGfD with Image-Based Insertion Detector
Lights Out (10pm)
Return to Lab (10 am)
DPGfD from Pixels & Proprioception
1
0
Action
Observation
Q
Action
-1
Qt-1
TD-Error
π
R
Joint Position
Joint Velocity
Joint Torque
Gripper Camera
DPGfD from Images for Acute-Angle Clipping
Conclusions
First e2e DeepRL from pixels to joints on a real-robot! (AFAIK)
Fully data-driven LfD pipeline, with no reliance on simulation, state estimation, or reward shaping
Still doing model-free RL...
Requires lots of data to train goal predicate
No transfer to other tasks (e.g. peg, wire, reach, stack, etc.)
Hand-coded compliance controller
Good
Bad
Ugly
Clippy Graveyard
Perspectives
Twist on classical story in deep learning of “just give it enough data”
Solves a simpler problem than Inverse-RL by making a simplifying assumption:
Next Steps
Thanks
Kevin Luck
Fumin Wang
Matej Vecerik
Todd Hester
Tom Rothorl
We’re Hiring!