1 of 11

Tutorial of the Challenge

Virtual Data Generation for Complex Industrial Activity Recognition

12, Dec 2024

2 of 11

Outline

Background of the Challenge
Dataset Overview
Challenge Overview
Sample Notebook Walkthrough
Sample Submission File
Evaluation Criteria
Questions and Answers

3 of 11

Background of the Challenge

Emergence of Virtual Data:

Virtual data generation is especially important in Factory Activity Recognition with wearable sensors, where real-world data is often limited and activities are complex. By generating more data, we can reduce data collection efforts with various activities in various scenarios.

Key Technologies Enabling Virtual Data Generation:

Data Augmentations
Generative Adversarial Networks (GANs)
Variational Autoencoders (VAEs)
Diffusion Model
Cross-domain generation (such as IMUTube, IIMUGPU…)
etc.

Challenges in Generating High-Quality Virtual Data:

Source data quality is poor
Data distribution that different from real data

Virtual data generation is especially important in Factory Activity Recognition with wearable sensors, where real-world data is often limited and activities are complex. By generating more data, we can reduce data collection efforts with various activities in various scenarios.

Key technologies enabling virtual data generation include data augmentation (which involves modifying existing data to create new, diverse samples), Generative Adversarial Networks (GANs) (which use a generator and discriminator in competition to produce realistic synthetic data), Variational Autoencoders (VAEs) (which encode data into a latent space and decode it to generate variations), diffusion models (which iteratively refine noise into structured data), and more.

Challenges in generating high-quality virtual data include but not limited to poor source data quality and mismatched data distributions compared to real-world data. So, in this challenge, you will need to consider how to generate high-quality virtual data for factory activity recognition.

E.g. GAN, Training a GAN on a small or noisy dataset may result in blurry or distorted synthetic outputs.

Difussion: Improper noise tuning could result in overly noisy or incomplete outputs, making them unusable.

4 of 11

Dataset Overview

In this challenge, we use acceleration data of Scenario 1.

5 of 11

Dataset Overview

In this challenge, we use acceleration data from subjects’ both wrists.

6 of 11

Challenge Overview

Key Objective: Develop virtual data generation methods to improve Human Activity Recognition (HAR) using the OpenPack dataset.
Dataset Features: Acceleration data of subjects’ both wrists in Scenario 1.
Evaluation Metric: F1 score calculated on unseen test data using trained HAR models.

Raw data

Virtual data

HAR model

Virtual data generation algorithm

The only part the participants need to do

7 of 11

Sample Notebook Walkthrough

Code Availability: Pre-configured Jupyter notebook on Google Colab for quick setup and execution.
Functionality Demonstrated:

Preparation
Use real data to generate virtual data
Use the generated data to improve HAR model performance
Submission code

Ease of Use: Intuitive notebook design allows participants to modify code and test their ideas.

Design your code here

Check the size of generated data

The code consists of three main components:

Preparation:�This component handles the extraction of raw sensor data from the OpenPack dataset. It ensures that the raw data is loaded correctly and prepared for further processing.
Use real data to generate virtual data:�In this component, you will modify the code to design an algorithm for generating virtual data from the real OpenPack data. Your task is to:

Implement your data generation logic here.
Save the generated virtual data in the correct CSV format and at the specified path.

Use the generated data to improve HAR model performance:�This component tests whether the virtual data enhances the HAR model's performance. You can:

Adjust random seeds and data split ratios to verify the robustness of your code.
But: You are not required to modify the model structure; focus only on improving the data input.

8 of 11

Sample Submission File

Submission Format: Participants must submit (1) a `.py` file containing virtual data generation functions that relate to “custom_virtual_data_generation” function and (2) the generated virtual data.
Required Details:

Keeping unchanged of the input and output of “custom_virtual_data_generation” function.
Save the virtual data in correct format (next slide).
File size of generated data located at “virtual” directory should be limited to 500MB.

Compatibility: Need be executable in Google Colab, with output saved in designated paths. But participants can run their codes on their own computers.

Don’t change the input

9 of 11

Sample Submission File

Submission Format: Participants must submit (1) a `.py` file containing virtual data generation functions that relate to “custom_virtual_data_generation” function and (2) the generated virtual data.
Required Details:

Keeping unchanged of the input and output of “custom_virtual_data_generation” function.
Save the virtual data in correct format (next slide).
File size of generated data located at “virtual” directory should be limited to 500M.

Compatibility: Need be executable in Google Colab, with output saved in designated paths. But participants can run their codes on their own computers.

Example of virtual data format (.csv file)

Accel. Left wrist

Accel. right wrist

Label

10 of 11

Evaluation and Judging Criteria

F1 Score as Core Metric: Virtual data quality evaluated by improvements in HAR model performance on test data.
Testing Setup: HAR model trained on generated data and tested using different random seeds with different test data of the OpenPack dataset.
Fairness Measures: All algorithms evaluated under the same conditions to ensure comparability.

This part will be changed and made private.

Evaluate with F1 score.

11 of 11

Q&A�Both English and Japanese are fine