1 of 57

1

Origami Sensei: Mixed reality AI-assistant for creative tasks using hands

1

2 of 57

Team

Advisors

School of Design: Dr Dina El-Zanfaly
Robotics Institute: Dr Kris Kitani

MSCV students

Richa Mishra
Qiyu Chen

2

3 of 57

Motivation

3

4 of 57

Challenges/Design decisions

How to decide the key states in the origami-making process?
How to handle transition states?
How to establish that state “n” of the process is completed and give instructions for making state “n+1”?
What is the best way to guide the user if they make a mistake?

4

5 of 57

5

Level: 0

Level: 100

Level: 30

Collected and annotated videos

Data collected and annotated

Data augmentation

Accuracy on test (Internet video) ~ 70%

Augmented dataset

Image classification

https://apps.apple.com/us/app/how-to-make-origami/id472936700

6 of 57

6

Current setup

Classify

Give feedback

7 of 57

Summary

7

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

8 of 57

Summary

8

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

9 of 57

Paper 1: Anomaly Detection of Folding Operations for Origami Instruction with Single Camera (IEICE 2020)

9

Anomaly Detection of Folding Operations for Origami Instruction with Single Camera. Hiroshi Shimanuki, Toyohide Watanabe, Koichi Asakura, Hideki Sato, Taketoshi Ushiama. IEICE 2020.

10 of 57

History

of

this

group

10

11 of 57

Pipeline

11

Input

Output

12 of 57

Pipeline

12

Input

Output

13 of 57

Manual instruction model construction

13

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

Predefined types of folds

Generate silhouette

14 of 57

Pipeline

14

Input

Output

15 of 57

Pipeline

15

Input

Output

16 of 57

Segmentation

Input needs to be specific set-up

Black table
Yellow hand skin
White-blue paper

Clustering + color threshold

→ Extract background, hands, and papers

16

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

17 of 57

Pipeline

17

Input

Output

18 of 57

Pipeline

18

Input

Output

19 of 57

Position & rotation estimation

Extract feature points

Filtering based on IoU

Match with silhouette vertices

19

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

20 of 57

Pipeline

20

Input

Output

21 of 57

Pipeline

21

Input

Output

22 of 57

State recognition + binary SVM mistake detection

22

https://appliedmachinelearning.wordpress.com/tag/hyperplane-svm/

23 of 57

Pipeline

23

Input

Output

24 of 57

Pipeline

24

Input

Output

25 of 57

Provide Instruction

25

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

26 of 57

Pipeline

26

Input

Output

27 of 57

State recognition performance (good)

27

Folding support for beginners based on state estimation of Origami. Toyohide Watanabe and Yasuhiro Kinoshita. TENCON 2012. https://ieeexplore.ieee.org/document/6412167

28 of 57

SVM Performance (not very good)

28

29 of 57

Limitations

Predefined table/paper set-up

Fixed black table + yellow hands (no sleeves) + white paper

29

30 of 57

Limitations

Predefined table/paper set-up

Fixed black table + yellow hands (no sleeves) + white paper

Complexity of origami instruction:

No complex folds: “they are too complex for origami beginners to perform”
Manipulated all in 2D (for silhouette matching)
No tiny folding

Lots of preprocessing + feature extraction, filtering, grouping for ML algorithms

30

31 of 57

Pipeline Comparison

31

	Fixed set-up (e.g. paper/table color)	Automatic step recognition	Use Neural Networks	Origami step recognition method	Manual instruction construction	How to give instruction?
Hiroshi et al. (2020)	Yes	Yes	No	Silhouette IoU + Argmax (=> limitation on views and steps)	Yes (built a software)	Visual overlay
Our project	No	Yes	Yes	Multi-class classification network (see paper 2)	Yes	Visual overlay (via projector later) + hand guidance (see paper 3)

32 of 57

Summary

32

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

33 of 57

Paper2: Temporal Action Segmentation from Timestamp Supervision (CVPR 2021)

33

Temporal Action Segmentation from Timestamp Supervision. Zhe Li, Yazan Abu Farha, Juergen Gall. CVPR 2021.

34 of 57

Temporal Action Segmentation from Timestamp Supervision (CVPR 2021)

34

Temporal Action Segmentation from Timestamp Supervision. Zhe Li, Yazan Abu Farha, Juergen Gall. CVPR 2021.

35 of 57

Definition

Temporal Action Segmentation: predict frame-wise action labels for videos

35

36 of 57

Definition

Temporal Action Segmentation: predict frame-wise action labels for videos

Timestamp Supervision: only one frame is annotated from each action segment

36

37 of 57

Motivation

“annotators need 6 times longer to annotate the start and end frame compared to annotating a single timestamp”

37

Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. SF-Net: Single-frame supervision for temporal action localization. In European Conference on Computer Vision (ECCV), 2020

38 of 57

Novelty

Train a temporal action segmentation model from timestamp supervision

Performance similar to the fully supervised approaches.

A novel confidence loss

38

39 of 57

Method

39

Input

Video

Timestamp Annotation

40 of 57

Method

40

Input

41 of 57

Method

41

42 of 57

Loss Definition

Classification Loss: frame-wise cross-entropy → accuracy

Smoothing Loss: mean squared difference between adjacent frames → smoothness

42

43 of 57

Loss Definition

(New) Confidence Loss: as the distance to the timestamp increases, the model confidence should decrease monotonically

Boost close low-confidence region
suppresses outlier frames

43

44 of 57

Performance

44

45 of 57

Performance: comparable to fully supervised models

45

46 of 57

Performance: agnostic to the segmentation model

46

47 of 57

Relation to our project:

Possible to use its timestamp supervision to save time on annotations
Borrow ideas from its training method into our real-time step recognition model

47

48 of 57

Summary

48

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

49 of 57

49

Romero, Javier, Dimitrios Tzionas, and Michael J. Black. 2017. “Embodied Hands.” ACM Transactions on Graphics 36 (6): 1–17.

MANO: Parametric model for hands

Artist-defined hand mesh with joints and blend weights

Disentangle pose and shape space

50 of 57

Paper:3 RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video (SIGGRAPHAsia2020)

50

Wang, Jiayi, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. ‘RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video’. ACM Transactions on Graphics (TOG) 39, no. 6 (12 2020).

51 of 57

51

Motivation

Self-occlusion
Self-similarity of hand parts
Depth-scale ambiguity

Tracking two interacting hands in real-time using monocular RGB video

Key-point detection

Dense 2D fitting

Inter-hand and Intra-hand distance

52 of 57

52

RGB Image

Two-hand tracking

MANO pose and shape parameters

Wang, Jiayi, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. ‘RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video’. ACM Transactions on Graphics (TOG) 39, no. 6 (12 2020).

Method

Camera intrinsics are known. To solve the scale ambiguity they use a user-defined palm length value. The dense matching is from pixel to a 3D position on the hand surface. Since this is a crucial step and miscalculated matching would result in incorrect parameters. So there is an additional silhouette term for noisy boundary pixels. Inter and intra hand relative distance is an alternative image-based representation of 3D geometry information.

Keypoints detection makes the model robust to occlusion.

The goal is to find the global pose and shape of two interacting hands from rgb data. There are 2 main components. The parametric hand model to estimate pose and shape in a generative two hand tracking framework. Second they extract multiple information from the rgb image. Dense matching image that establishes correspondences to the parametric hand model independent of appearance and lighting variations. Since the rgb data lacks any 3d information they predict intra-hand relative depth and inter hand distances as an alternative image representation of the 3d information. Intar hand relative depth stores per Hand root relative depth values. Inter hand distance store the distance between th trots of the hands

53 of 57

53

Image fitting loss

Wang, Jiayi, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. ‘RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video’. ACM Transactions on Graphics (TOG) 39, no. 6 (12 2020).

54 of 57

54

Wang, Jiayi, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. ‘RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video’. ACM Transactions on Graphics (TOG) 39, no. 6 (12

‌

Results

55 of 57

Summary

55

Paper 1:

Existing pipeline

Our Pipeline

Improve step recognition

Inspiration on pipeline

Improve feedback

Paper 2:

Step recognition

Paper 3:

Hand tracking

56 of 57

Q & A

56

57 of 57

Thank you! Happy spring break!

57