1 of 29

Helping robots to help humans

Luca Minciullo, Lifehikes

2 of 29

DB-GAN: Boosting Object Recognition Under Strong Lighting Conditions

Luca Minciullo*, Fabian Manhardt*, Kei Yoshikawa, Sven Meier, Federico Tombari and Norimasa Kobori

Toyota Motor Europe, Technical University of Munich, Woven Core

*equal contribution

3 of 29

Motivation

  • Lighting conditions
    • vary over time
    • affect recognition accuracy
  • Acquiring a dataset with the required lighting variation is impractical.
  • Indoor performance is affected more than we realize

4 of 29

Existing works

Difference of gaussians filters

Lighting is removed together with most texture information

EnlightenGAN[1]

Dark -> Bright or viceversa

None of these images is generated to explicitly maximise detection accuracy

RetinexNet[2]

Color constancy

DeepUPE[3]

GAN-based image enhancement

5 of 29

Training Data

Background images are patches from PHOS dataset*(15 static scenes under 15 different lighting conditions)

Object is rendered with a random pose, synthetic lighting is applied on the rendered object.

*https://sites.google.com/site/vonikakis/datasets

Input

Output

6 of 29

Our approach

  • Based on the Pix2Pix encoder-decoder architecture
  • Joint learning of image lighting normalization and object detection

7 of 29

Quantitative Results�Experiments on the test sets from two BOP datasets*

SSD with

Toyota Light mAP

TUD Light mAP

DoG

0.20

0.36

enlightenGAN[1]

0.29

0.43

RetinexNet[2]

0.28

0.62

deepUPE[3]

0.29

0.47

baseline

0.27

0.18

DB-GAN

0.72

0.66

2D Object Detection Results

SSD with

Toyota Light mAP w\o ICP

Toyota Light mAP w\ICP

TUD Light mAP w\o

TUD Light w\ ICP

DoG

0.35

0.37

0.14

0.19

enlightenGAN[1]

0.30

0.34

0.157

0.21

RetinexNet[2]

0.32

0.36

0.13

0.19

deepUPE[3]

0.34

0.38

0.12

0.18

baseline

0.23

0.32

0.159

0.155

DB-GAN

0.42

0.44

0.164

0.25

6D Object Pose Estimation Results

*https://bop.felk.cvut.cz/home/

Losses Used

mAP

L1

0.55

+Perceptual

0.67

+ Global Discriminator

0.66

+ Local Discriminator

0.60

+ SSD Loss

0.72

Loss Ablation study�2D detection on Toyota Light

8 of 29

Qualitative Results(2D)

9 of 29

Qualitative Results(6D)

EnlightenGAN

RetinexNet

DeepUPE

DB-GAN

10 of 29

Toyota TrueBlue dataset

  • 11 scenes
  • 11 different color temperatures per scene
  • 5 objects
    • 2D Bbox annotation available
    • 3D object model also available
  • Checker-board in the scene

11 of 29

Results: Toyota TrueBlue

method

mAP

baseline

0.39

Baseline + color augmentation

0.54

DB-GAN

0.73

Baseline

DB-GAN

12 of 29

DB-GAN LIVE CAMERA DEMO

camera

SSD-GAN

output

SSD output

SSD output

13 of 29

DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration

Pengyuan Wang*, Fabian Manhardt*, Luca Minciullo, Lorenzo Garattoni, Sven Meier, Nassir Navab and Benjamin Busam

Toyota Motor Europe, Technical University of Munich, Woven Core

*equal contribution

14 of 29

The robotics grasping problem

  • Layman Definition
  • A robot can grasp an object if it can pick it up from a surface, hold it for some time and release the object safely in another location.
  • Motivation
  • Pick & place
  • Manufacturing/automation
  • Assistive robotics
  • Problem
  • How/where to place their gripper for a given object?
  • How to acquire this ability for new objects?

[1] https://www.toyota-global.com/innovation/partner_robot

15 of 29

  • 3D object models (CAD)
  • Grasping points pre-assigned to the object by a technical person
  • Problem is solved by 6D pose estimation solution, when model is fit to the scene, grasping point can be retrieved for free

15

YCB datasets [2]

DenseFusion [3]

[2] C. Berk, S. Arjun, B. James, W. Aaron, K. Kurt, S. Siddhartha, A. Pieter, D. Aaron M. Yale-CMU-Berkeley dataset for robotic manipulation research, IJRR, 2017.

[3] W. Chen, X. Danfei, Z. Yuke, M. Roberto, L. Cewu, F. Li, S. Silvio. Densefusion: 6d object pose estimation by iterative dense fusion, CVPR, 2019.

Existing solutions: Model-based Grasping

16 of 29

16

New objects outside the dataset?

...etc.

Model-based Grasping

  • Collect CAD model (3D scanner)
  • Assign grasping point manually
  • Training data needs to be collected for each object / generated by simulation
  • Pose estimation model needs to be re-trained for any new object

17 of 29

  • Dex-Net [4], GraspNet [5]
  • Grasp point proposals from neural networks
  • Large amount of real training data needed
  • As we will see, the accuracy of this type of solution is far from ideal

17

Dex-Net [4]

[4] M. Jeffrey, L. Jacky, N. Sherdil, L. Michael, D. Richard, L. Xinyu, O. Juan Aparicio, G. Ken. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics, arXiv preprint, 2017.

[5] A. Mousavian, C. Eppner, D. Fox. 6-dof graspnet: Varational grasp generation for object manipulation, ICCV, 2019.

Existing solutions: Model-free Grasping

18 of 29

18

Idea

  • Human demonstration
    • a person shows the robot how to grasp a new object
  • Robot learns to transfer the demonstration to a behaviour it can perform and reproduce
  • Pros:
    • No expert knowledge needed
    • No object scanning
    • No re-training
  • Cons:
    • Inference was not realtime

19 of 29

19

Method overview

  • Learning Phase
    • Collect an RGB-D sequence
    • Segment both hand and object
    • Fuse the sequence into hand-object mesh
  • Hand-Object Interaction
    • Separate hand mesh from object mesh
    • Perform object completion on object mesh
    • Fit a hand model to the hand mesh
    • Grasping points are defined based on the fingers’ location
  • Grasping
    • Load a RGB-D test scene
    • Match the object mesh to the scene
    • Hand mesh is matched as a result
    • Retrieve grasping points

20 of 29

Segmentation of Hand and Object

  • Mask R-CNN [7]
  • Binary cross entropy for each class
  • Preventing inter-class competition

Hand-Object Interaction

  • Simultaneously track the hand and object together
  • the hand and object reconstructed in TSDF volume following KinectFusion [8]

20

[7] K. He, G. Gkioxari, P. Dollar, R. Girshick. Mask r-cnn, ICCV, 2017

Segmentation masks of hand and object

Point cloud extracted from fused TSDF volume

[8] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davison, P. Kohli, J. Shotton, S. Hodges, A. Fitzgibbon. KinectFusion: Real-Time Dense Surface Mapping and Tracking, ISMAR, 2011

Method: Learning phase

21 of 29

Hand Pose Alignment

  • Hand mesh from RGB image leveraging [11]
  • Point cloud extracted from partial hand reconstruction
  • ICP to tightly align both together

21

[11] Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. Black, I. Laptev, C. Schmid. Learning joint reconstruction of hands and manipulated objects , CVPR, 2019

Hand mesh aligned with fused point cloud

Method: hand-object interaction(I)

22 of 29

Additional Object Shape Completion

  • 3D U-Net [9] with skip-connections
  • Fused TSDF volume as input
  • Predicts 64x64x64 voxels with binary classification scores (object vs no-object)

22

[9] O. Ronneberger, P. Fischer, T. Brox. U-net: Convolutional networks for biomedical image segmentation, MICCAI, 2015.

[10] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár. Focal loss for dense object detection, ICCV, 2017.

Focal loss [10] as loss function, where Pos and Neg represent occupied and empty voxels and γ is set as 2

Method: hand-object interaction(II)

23 of 29

Object Point Cloud Registration

  • PPF-FoldNet [12]
  • collect point pair features in local patches
  • generate encoded descriptors for surrounding point pair features
  • Registration of point cloud in the scene using RANSAC

23

[12] D. Haowen, B. Tolga, S. Ilic. PPF-FoldNet: Unsupervised learning of rotation invariant 3d local descriptors, ECCV, 2018

Point pair feature: selected points m1, m2. difference vector d. normals n1, n2 [12]

Method: match object mesh to the scene

24 of 29

  • 6D robotic gripper pose from hand mesh
  • Grasp point from middle point between the index and thumb locations
  • Grasp direction from wrist and grasp point

24

Gripper grasp instruction from hand pose

Method: Grasping Instruction Retrieval

25 of 29

Simulation Setup

  • Human Support Robot (HSR)
  • Four test objects
  • Simulation scene in Gazebo
  • Object placed on table with random position and rotation

25

Test objects: Shampoo, Drill, Hole Punch, Cookie Box

Evaluation

26 of 29

Evaluation Metric

  • 15 trials for each object
  • 6DoF GraspNet [5] as baseline
  • Success if the robot grasps the object and hold it for 5 seconds without dropping it

26

Evaluation result

Evaluation

27 of 29

Real World Evaluation

  • Similar setup as synthetic evaluation

27

Scene setup

Test objects

Evaluation

28 of 29

Evaluation Metric

  • For each object, learning sequence from side and top
  • Each learning sequence 9 trials, each object 18 trials
  • Overall 72 trials

28

Evaluation result

Evaluation

29 of 29

Q&A