1 of 57

1

Origami Sensei: Mixed reality AI-assistant for creative tasks using hands

1

1

1

1

2 of 57

Team

Advisors

  • School of Design: Dr Dina El-Zanfaly
  • Robotics Institute: Dr Kris Kitani

MSCV students

  • Richa Mishra
  • Qiyu Chen

2

3 of 57

Motivation

3

4 of 57

Challenges/Design decisions

  • How to decide the key states in the origami-making process?
  • How to handle transition states?
  • How to establish that state “n” of the process is completed and give instructions for making state “n+1”?
  • What is the best way to guide the user if they make a mistake?

4

5 of 57

5

Level: 0

Level: 100

Level: 30

  • Collected and annotated videos

Data collected and annotated

Data augmentation

Accuracy on test (Internet video) ~ 70%

  • Augmented dataset
  • Image classification

https://apps.apple.com/us/app/how-to-make-origami/id472936700

6 of 57

6

Current setup

Classify

Give feedback

7 of 57

Summary

7

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

8 of 57

Summary

8

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

9 of 57

Paper 1: Anomaly Detection of Folding Operations for Origami Instruction with Single Camera (IEICE 2020)

9

Anomaly Detection of Folding Operations for Origami Instruction with Single Camera. Hiroshi Shimanuki, Toyohide Watanabe, Koichi Asakura, Hideki Sato, Taketoshi Ushiama. IEICE 2020.

10 of 57

History

of

this

group

10

11 of 57

Pipeline

11

Input

Output

12 of 57

Pipeline

12

Input

Output

13 of 57

Manual instruction model construction

13

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

Predefined types of folds

Generate silhouette

14 of 57

Pipeline

14

Input

Output

15 of 57

Pipeline

15

Input

Output

16 of 57

Segmentation

  1. Input needs to be specific set-up
    1. Black table
    2. Yellow hand skin
    3. White-blue paper
  2. Clustering + color threshold
    • → Extract background, hands, and papers

16

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

17 of 57

Pipeline

17

Input

Output

18 of 57

Pipeline

18

Input

Output

19 of 57

Position & rotation estimation

  • Extract feature points

  • Filtering based on IoU

  • Match with silhouette vertices

19

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

20 of 57

Pipeline

20

Input

Output

21 of 57

Pipeline

21

Input

Output

22 of 57

State recognition + binary SVM mistake detection

22

https://appliedmachinelearning.wordpress.com/tag/hyperplane-svm/

23 of 57

Pipeline

23

Input

Output

24 of 57

Pipeline

24

Input

Output

25 of 57

Provide Instruction

25

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

26 of 57

Pipeline

26

Input

Output

27 of 57

State recognition performance (good)

27

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

28 of 57

SVM Performance (not very good)

28

29 of 57

Limitations

  1. Predefined table/paper set-up
    1. Fixed black table + yellow hands (no sleeves) + white paper

29

30 of 57

Limitations

  • Predefined table/paper set-up
    • Fixed black table + yellow hands (no sleeves) + white paper
  • Complexity of origami instruction:
    • No complex folds: “they are too complex for origami beginners to perform
    • Manipulated all in 2D (for silhouette matching)
    • No tiny folding
  • Lots of preprocessing + feature extraction, filtering, grouping for ML algorithms

30

31 of 57

Pipeline Comparison

31

Fixed set-up (e.g. paper/table color)

Automatic step recognition

Use Neural Networks

Origami step recognition method

Manual instruction construction

How to give instruction?

Hiroshi et al. (2020)

Yes

Yes

No

Silhouette IoU + Argmax

(=> limitation on views and steps)

Yes (built a software)

Visual overlay

Our project

No

Yes

Yes

Multi-class classification network

(see paper 2)

Yes

Visual overlay (via projector later)

+ hand guidance (see paper 3)

32 of 57

Summary

32

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

33 of 57

Paper2: Temporal Action Segmentation from Timestamp Supervision (CVPR 2021)

33

Temporal Action Segmentation from Timestamp Supervision. Zhe Li, Yazan Abu Farha, Juergen Gall. CVPR 2021.

34 of 57

Temporal Action Segmentation from Timestamp Supervision (CVPR 2021)

34

Temporal Action Segmentation from Timestamp Supervision. Zhe Li, Yazan Abu Farha, Juergen Gall. CVPR 2021.

35 of 57

Definition

  • Temporal Action Segmentation: predict frame-wise action labels for videos

35

36 of 57

Definition

  • Temporal Action Segmentation: predict frame-wise action labels for videos

  • Timestamp Supervision: only one frame is annotated from each action segment

36

37 of 57

Motivation

“annotators need 6 times longer to annotate the start and end frame compared to annotating a single timestamp”

37

Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. SF-Net: Single-frame supervision for temporal action localization. In European Conference on Computer Vision (ECCV), 2020

38 of 57

Novelty

  • Train a temporal action segmentation model from timestamp supervision
    • Performance similar to the fully supervised approaches.

  • A novel confidence loss

38

39 of 57

Method

39

Input

Video

Timestamp Annotation

40 of 57

Method

40

Input

41 of 57

Method

41

42 of 57

Loss Definition

  • Classification Loss: frame-wise cross-entropy → accuracy

  • Smoothing Loss: mean squared difference between adjacent frames → smoothness

42

43 of 57

Loss Definition

  • (New) Confidence Loss: as the distance to the timestamp increases, the model confidence should decrease monotonically
    • Boost close low-confidence region
    • suppresses outlier frames

43

44 of 57

Performance

44

45 of 57

Performance: comparable to fully supervised models

45

46 of 57

Performance: agnostic to the segmentation model

46

47 of 57

Relation to our project:

  • Possible to use its timestamp supervision to save time on annotations
  • Borrow ideas from its training method into our real-time step recognition model

47

48 of 57

Summary

48

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

49 of 57

49

Romero, Javier, Dimitrios Tzionas, and Michael J. Black. 2017. “Embodied Hands.” ACM Transactions on Graphics 36 (6): 1–17.

MANO: Parametric model for hands

Artist-defined hand mesh with joints and blend weights

Disentangle pose and shape space

50 of 57

Paper:3 RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video (SIGGRAPHAsia2020)

50

Wang, Jiayi, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. ‘RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video’. ACM Transactions on Graphics (TOG) 39, no. 6 (12 2020).

51 of 57

51

Motivation

    • Self-occlusion
    • Self-similarity of hand parts
    • Depth-scale ambiguity

Tracking two interacting hands in real-time using monocular RGB video

Key-point detection

Dense 2D fitting

Inter-hand and Intra-hand distance

52 of 57

52

RGB Image

Two-hand tracking

MANO pose and shape parameters

Wang, Jiayi, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. ‘RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video’. ACM Transactions on Graphics (TOG) 39, no. 6 (12 2020).

Method

53 of 57

53

Image fitting loss

Wang, Jiayi, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. ‘RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video’. ACM Transactions on Graphics (TOG) 39, no. 6 (12 2020).

54 of 57

54

Wang, Jiayi, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. ‘RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video’. ACM Transactions on Graphics (TOG) 39, no. 6 (12

Results

55 of 57

Summary

55

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

56 of 57

Q & A

56

57 of 57

Thank you! Happy spring break!

57