1 of 34

Gamifying Facial Emotion Recognition for Both Human Training �and Machine Learning Data Collection

Yeonsun Yang¹

Ahyeon Shin¹

Nayoung Kim¹

Huidam Woo¹

John Joon Young Chung²

Jean Y. Song¹

Midjourney

2 of 34

Facial Emotion Recognition In the Real-world

Motivation

Spontaneous facial expressions are diverse, subjective, and ambiguous.

😀

Happy

😭

Sad

An example of in-the-wild FER datasets

3 of 34

The Impact of FER on Interactions

FER is important for both human-human interactions and human-machine interactions.

Motivation

# Family

# Workplace

# Law Enforcement

4 of 34

Who Needs FER Training?

Motivation

Clinical Populations

Professionals

General Populations

5 of 34

Training Interfaces to Enhance Human FER

Related Work

Learner

Image 1/20

Emotion

Feedback

Happy

Sad

Disgust

Neutral

The inner corners of the eyebrows and angled downward

Micro Expressions Training Tool (METT)

Ekman et al., 2003

6 of 34

Training Interfaces to Enhance Human FER

Related Work

Image 1/20

Emotion

Feedback

Happy

Sad

Disgust

Neutral

Micro Expressions Training Tool (METT)

Ekman et al., 2003

The inner corners of the eyebrows and angled downward

Limited to sign-based training�
Limited to self-administered training�
Tedious and repetitive sessions

Limitations

Sign-based explanation of emotion with action units

Facial expression images taken in controlled environments with action units

7 of 34

Image 1/20

Emotion

Happy

Sad

Disgust

Neutral

Labeling Interfaces to Enhance Machine FER

Related Work

AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild

Mollahosseini et al., IEEE Transactions on Affective Computing (2017)

“Judgement-based approach”

Interprets facial expression based on how it is universally and heuristically perceived by a large common population

8 of 34

Large labeling error and bias�
Limited to single-choice format�
Tedious and repetitive sessions

Image 1/20

Emotion

Happy

Sad

Disgust

Neutral

Labeling Interfaces to Enhance Machine FER

Related Work

Limitations

AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild

Mollahosseini et al., IEEE Transactions on Affective Computing (2017)

9 of 34

Research Goal: Building an Integrated Interface

Approach

Simultaneously addressing limitations

FER Training

Data Collection

10 of 34

To effectively address the challenges identified in current interfaces within a single application,

Emotion Categories

# of Responses

(1) Engaging and motivating groups of general populations

(2) Group learning through interaction with each other

(3) Aggregating all socially-agreed emotional judgments collected from user groups

Research Goal: Building an Integrated Interface

Approach

11 of 34

Emotion Categories

# of Responses

”Gamification”

Research Goal: Building an Integrated Interface

Approach

12 of 34

Design Probing

Approach

Based on iterative design probes (N=9):

[DG1]

[DG2]

[DG3]

Enable diverse layers of interactions to support learning socially agreed-upon interpretations of emotions�

Observational learning, real-time personalized feedback, and reflection

Minimize the difficulty and effort required for the labeling actions during the game

Breaking labeling actions into smaller unit of work (i.e., binary labeling)

Provide game rules and elements that are easy to learn

�Using mainstream game plot

13 of 34

Design Probing

Approach

“Mafia Game Plot”

Active interactions among users, such as observation, debating, and voting�
Useful for redesigning to incorporate suggestions and guidelines from literature�
Requiring little time to adapt to the system due to its familiar rules�

Voting

Last defense

Advice

Game stages

25 of 34

Ground-truth measures^* (N=275)

Assessing FER scores (0~64 points)
Criteria to divide participants into low FER group or not (40 points)
Providing a basis for judgment-based scoring

^*Japanese and Caucasian Facial Expressions of Emotion (JACFEE) and Neutral Faces (JACNeuF)

Matsumoto et al., (1988)

User study (N=59)

Classifying 22 participants as low FER group and 37 as ordinary group based on their pre-survey FER scores
Randomly dividing the low FER group into learner group (n=11) and control group (n=11)

Experiment

Evaluation Setup

26 of 34

Ground-truth measures^* (N=275)

Assessing FER scores (0~64 points)
Criteria to divide participants into low FER group or not (40 points)
Providing a basis for judgment-based scoring

^*Japanese and Caucasian Facial Expressions of Emotion (JACFEE) and Neutral Faces (JACNeuF)

Matsumoto et al., (1988)

User study (N=59)

Classifying 22 participants as low FER group and 37 as ordinary group based on their pre-survey FER scores
Randomly dividing the low FER group into learner group (n=11) and control group (n=11)

Experiment

Evaluation Setup

27 of 34

Playing Find the Bot!

Plain labeling task

Learner group

(N=11)

Player group

(N=37)

Control group

(N=11)

Two 90-mins lab sessions

A 90-mins lab session

pre-test to evaluate their FER scores�
Playing Find the Bot! with pre-matched team consisting of one learner and three players �
Post-test to assess improvements in their FER scores�
Survey on user experience (GEQ, SUS, customized questions)

pre-test to evaluate their FER scores�
Labeling 200 in-the-wild facial expression images�
Post-test to assess improvements in their FER scores�

Experiment

Evaluation Setup

28 of 34

Playing Find the Bot!

Plain labeling task

Learner group

(N=11)

Player group

(N=37)

Control group

(N=11)

Two 90-mins lab sessions

A 90-mins lab session

pre-test to evaluate their FER scores�
Playing Find the Bot! with pre-matched team consisting of one learner and three players �
Post-test to assess improvements in their FER scores�
Survey on user experience (GEQ, SUS, customized questions)

pre-test to evaluate their FER scores�
Labeling 200 in-the-wild facial expression images�
Post-test to assess improvements in their FER scores�

Experiment

Evaluation Setup

29 of 34

Rich game interactions and a progression in emotional assumptions by analyzing game log data
Well-motivated and engaged game design by analyzing post survey (GEQ, SUS, customized questions)

Experiment

Evaluation of Game Design (for All Players)

30 of 34

Experiment

Improvement on Judgment-based FER (Learner vs. Control)

The higher increase in inter-rater reliability of pre-and post-test responses in learner group (k=0.078) rather than control group (k=0.02)
A trend where the only learner group’s responses shifted towards judgment-based answers

31 of 34

Experiment

Increase on Social Agreement of Collected Labels

The Gini coefficient of label distributions is highly skewed towards 1
A more skewed Gini coefficient of responses in the post-test compared to the pre-test

32 of 34

Generalizability

The task could be answered in binary (e.g., yes or no) and does not require open-ended responses.�
The task can be broken down into the smallest units of work. �
The task is simple enough to allow users to make judgments within a few seconds without a second thought. �
The task embraces subjective responses, but the expected responses should have high agreement rate.

Discussion

Findings from the Study

33 of 34

Guidelines for Game with a Purpose

Discussion

Findings from the Study

Identification two important considerations in designing game interfaces to successfully engage and motivate participants

Benefits from consistent and attractive UI design��
Usefulness of utilizing mainstream games

34 of 34

Yeonsun Yang¹

Ahyeon Shin¹

Nayoung Kim¹

Huidam Woo¹

John Joon Young Chung²

Jean Y. Song¹

Midjourney

Find the Bot!

Project page: https://github.com/diag-dgist/FindtheBot

Gamifying Facial Emotion Recognition for Both Human Training �and Machine Learning Data Collection