1 of 50

�CS60055: Ubiquitous Computing

Location, Gesture and Activity Sensing

Part IV: Inferring Activities from IMU

INDIAN INSTITUTE OF TECHNOLOGY

KHARAGPUR

Sandip Chakraborty

sandipc@cse.iitkgp.ac.in

Department of Computer Science and Engineering

2 of 50

What We Have Learnt So Far?

IMU sensors are noisy

Accelerometer measures true acceleration – needs to separate out the gravity component; however, gravity value gets polluted when the object is in motion
Integrations make the errors cumulative: Orientation and location estimations become noisy
Magnetic north pole can be used as an anchor, but it is also not free from noise

Combine measurement logic with the application-enforced physics for gesture estimation

Statistical estimators/filters can help
Tradeoff between accuracy and real-time performance

Indian Institute of Technology Kharagpur

3 of 50

The Next Steps – Gesture to Activities

Combine the gestures to infer the activities

Cooking

Indian Institute of Technology Kharagpur

4 of 50

The Next Steps – Gesture to Activities

Combine the gestures to infer the activities

Preparing poached egg

However, the activity label can be more precise depending on the downstream application

Indian Institute of Technology Kharagpur

5 of 50

Human Activity Recognition (HAR)

Can use handcrafted statistical features from the gesture signatures

Mean, mode, median, kurtosis, etc.

However, these features need to be extracted and engineered properly

Note that there is a fundamental difference between image/video processing and physical sensor data processing
IMU data is not visually recognizable like image or video data – so data annotation and model development is tricky
How do you understand which statistical features are more prominent for a specific type of activity patterns?
Also, activity signatures vary a lot across demography, age, gender, etc.

Indian Institute of Technology Kharagpur

6 of 50

Human Activity Recognition (HAR)

Can we use vision models for HAR?

The image representation of the accelerometer data and pass it to the vision-based model for activity classification?

Indian Institute of Technology Kharagpur

7 of 50

Human Activity Recognition (HAR)

Can we use vision models for HAR?

Take a matrix representation of the input: The image representation of the accelerometer data and pass it to the vision-based model for activity classification?
May lose a lot of contextual information depending on the resolution and granularity of the data – what would be the optimal resolution for the input?
Also, data from multiple modalities (accelerometer, gyroscope, compass) need to be combined to define the motion model properly

Indian Institute of Technology Kharagpur

8 of 50

Human Activity Recognition (HAR)

Can we use vision models for HAR?

Take a matrix representation of the input: The image representation of the accelerometer data and pass it to the vision-based model for activity classification?
May lose a lot of contextual information depending on the resolution and granularity of the data – what would be the optimal resolution for the input?
Also, data from multiple modalities (accelerometer, gyroscope, compass) need to be combined to define the motion model properly

A Possible Solution: Use supervised models on the preprocessed IMU data for HAR classification

Indian Institute of Technology Kharagpur

9 of 50

Deep Learning for HAR

Represent the raw time-series data (after the necessary preprocessing and filtering) from the IMU as your input

No hand-crafted feature engineering is necessary

UCI-HAR dataset, several activity classes:

Walking
Walking upstairs
Walking downstairs
Sitting
Standing
Laying

https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones

Indian Institute of Technology Kharagpur

10 of 50

Deep Learning for HAR

1D-CNN to predict the feature classes

Can learn from the raw time series data directly (accelerometer and gyroscope)
Do not require domain expertise for manual feature engineering

Input: three time-series signals (3-DoF representation)

Total acceleration (gravitational acceleration + body acceleration)
Body acceleration
Body gyroscope

Window the input data – one sample is one window

128 time-steps

Multi-class classification

Adam optimizer, categorical cross entropy loss
Tune hyperparameters (kernel size, number of filters) to best fit the model

Indian Institute of Technology Kharagpur

11 of 50

Deep Learning for HAR

1D-CNN to predict the feature classes

Can learn from the raw time series data directly (accelerometer and gyroscope)
Do not require domain expertise for manual feature engineering

Input: three time-series signals (3-DoF representation)

Total acceleration (gravitational acceleration + body acceleration)
Body acceleration
Body gyroscope

Window the input data – one sample is one window

128 time-steps

Multi-class classification

Adam optimizer, categorical cross entropy loss
Tune hyperparameters (kernel size, number of filters) to best fit the model

Cons: You still need a significant amount of labelled data to train your model

Indian Institute of Technology Kharagpur

12 of 50

Self-Supervised Learning (SSL) for HAR

Self-supervised Learning (SSL)

Does not rely on the labelled dataset for supervisory signals
Generate implicit labels from the unstructured data

Representation

Pretext task

Representation

Classifier

Cooking

Indian Institute of Technology Kharagpur

13 of 50

Contrastive Learning

Pick up an anchor – the target that you want to learn

Positive (X+): Samples that are similar to the anchor (similar semantic meaning)
Negative (X-): Samples that are different from the anchor

Pull Anchor and Positive embeddings closer and push the Anchor and Negative embeddings further

Indian Institute of Technology Kharagpur

14 of 50

Contrastive Learning

Indian Institute of Technology Kharagpur

15 of 50

Contrastive Learning

Loss Function:

Triplet loss:

Other loss functions are also available
Lifted Structured Loss (all pair)

N-pair Loss (Multiple Negatives)
NCE (Noise contrastive estimation)
And so on ….

Indian Institute of Technology Kharagpur

16 of 50

Contrastive Learning

How do you get the positive samples without labeling?

Positive: Pick up a few anchors and apply augmentations
Negative: Everything else

Contrastive learning is popular in the image domain

Augmentation is easy
Well defined semantic augmentations that do not�change the meaning of the subjects

Flip
Rotate
Color distortion
Random crop
Zooming
Blur, and so on …

Slide credit: Prasenjit Karmakar

Indian Institute of Technology Kharagpur

17 of 50

MNIST Representations

Learned M dimensional Embedding Vector for which below properties holds

Two similar images are embedded closely in the M dimensional space as compared to two dissimilar images, forming clusters.
Two such cluster should have minimal to no overlap (GOOD for downstream classification)

Two dimensional MNIST representations

Slide credit: Prasenjit Karmakar

Indian Institute of Technology Kharagpur

18 of 50

SimCLR

One of the simplest but effective contrast learning framework for vision-based classification

Gif source: https://medium.com/analytics-vidhya/eli5-a-simple-framework-for-contrastive-learning-of-visual-representations-20d9509d0a12

Indian Institute of Technology Kharagpur

19 of 50

SimCLR

One of the simplest but effective contrast learning framework for vision-based classification

Gif source: https://medium.com/analytics-vidhya/eli5-a-simple-framework-for-contrastive-learning-of-visual-representations-20d9509d0a12

How do we apply data augmentation for physical sensing data?

Indian Institute of Technology Kharagpur

20 of 50

SimCLR

One of the simplest but effective contrast learning framework for vision-based classification

Gif source: https://medium.com/analytics-vidhya/eli5-a-simple-framework-for-contrastive-learning-of-visual-representations-20d9509d0a12

How do we apply data augmentation for physical sensing data?

Apply the laws of physics

Indian Institute of Technology Kharagpur

21 of 50

Contrastive Learning for HAR

Adoption of SimCLR framework for HAR

Indian Institute of Technology Kharagpur

22 of 50

Contrastive Learning for HAR

Adoption of SimCLR framework for HAR

What transformations can we apply for the HAR data?

Indian Institute of Technology Kharagpur

23 of 50

Contrastive Learning for HAR

IMWUT 2022

https://github.com/iantangc/ContrastiveLearningHAR

Indian Institute of Technology Kharagpur

24 of 50

Time Synchronous Multi Device Systems (TSMDS)

Multiple computing/sensor devices are worn on, affixed to or implanted in a person’s body

All devices observe a physical phenomenon (e.g., a user’s locomotion activity) simultaneously and record sensor data in a time-aligned manner

Assumptions

All sensor devices in the multi-device system �share the same sensor sampling rate (or that �their data can be re-sampled to the same rate)
Multiple devices in collect sensor data in a �time-aligned manner (minor variations do not�impact the performance significantly)

Indian Institute of Technology Kharagpur

25 of 50

Transformations for Data Augmentation

Rather than applying manual transformations (like rotation), can we utilize natural transformations that are already available in the sensor datasets

Multiple sensors in a TSMDS system captures the same activity under different view

Both indicate the same activity "Running"

Indian Institute of Technology Kharagpur

26 of 50

Transformations for Data Augmentation

Rather than applying manual transformations (like rotation), can we utilize natural transformations that are already available in the sensor datasets

Multiple sensors in a TSMDS system captures the same activity under different view

These are natural transformations of each other!

Indian Institute of Technology Kharagpur

27 of 50

Transformations for Data Augmentation

Rather than applying manual transformations (like rotation), can we utilize natural transformations that are already available in the sensor datasets

Multiple sensors in a TSMDS system captures the same activity under different view

Accelerometer and Gyro at different body parts capture the translational and the rotational motions of those body parts

For an activity, the motions at different body parts are interlinked
Think about the motion patterns of hands and legs when�you are running!

Indian Institute of Technology Kharagpur

28 of 50

Transformations for Data Augmentation

Rather than applying manual transformations (like rotation), can we utilize natural transformations that are already available in the sensor datasets

Multiple sensors in a TSMDS system captures the same activity under different view

Accelerometer and Gyro at different body parts capture the translational and the rotational motions of those body parts

For an activity, the motions at different body parts are interlinked
Think about the motion patterns of hands and legs when�you are running!

Use such natural transformations to define a pretext task and perform contrastive learning

Indian Institute of Technology Kharagpur

29 of 50

Challenge 1: Selecting Positive and Negative Samples

Indian Institute of Technology Kharagpur

30 of 50

Challenge 1: Selecting Positive and Negative Samples

Data distributions for some devices will be very different from the data distribution of Chest – can be the "negative" samples!

How do we know which are of different distributions?

Indian Institute of Technology Kharagpur

31 of 50

Challenge 1: Selecting Positive and Negative Samples

Positive

Negative

Indian Institute of Technology Kharagpur

32 of 50

Challenge 2: Which Samples Can be Used in CL?

Indian Institute of Technology Kharagpur

33 of 50

Challenge 2: Which Samples Can be Used in CL?

Pattern gets very different

Indian Institute of Technology Kharagpur

34 of 50

Collaborative Self-supervised Learning

We have D devices with time-aligned but unlabeled sensor datasets {X_i}^D_i=1
The dataset is pre-segmented in T number of windows

Each dataset X_i contains T windows {x¹_i, x²_i, …, x^T_i}
Each sensor sample x^j_i is the 6-DoF IMU data

Let D* ∈ D be an anchor device for which we want to train HAR prediction
Let L* = {(x_*¹, y_*¹), …, (x_*^m, y_*^m)} be a small pre-segmented labeled dataset from the anchor device with m windows (m << T)

X_* are the sensor samples, y_* are the labels

Objectives:

Use the unlabeled dataset {X_i}^D_i=1 to train a feature extractor f(.)
Use the pretrained f(.) to obtain the feature embeddings for the labeled anchor samples and subsequently train a HAR classifier g(.) that maps feature embeddings to the labels

Indian Institute of Technology Kharagpur

35 of 50

Solution Steps

Initialize the feature extractor f(.) with random weights
Sample a batch 𝐵 of time-aligned unlabeled data from all the D devices
Device Selection: Decide which of the remaining devices will provide positive samples and negative samples for CL
Contrastive Sampling: For each anchor sample, which samples from the batches of positive and negative devices should be used for CL

Indian Institute of Technology Kharagpur

36 of 50

Solution Steps

Indian Institute of Technology Kharagpur

37 of 50

Solution Steps

Multi-view Contrastive Learning: The anchor, positive and negative sample(s) are fed to the feature extractor f(.) to generate feature embeddings

Use 1D-CNN as the feature extractor
Compute Multi-view Contrastive Loss to push the positive samples towards the anchor and the negative embeddings far from the anchor

Repeat the above steps until the contrastive loss converges (and we compute all the anchor samples for the batch)

Supervised Fine Tuning: Train a HAR classifier using f(.) and the labeled dataset from the anchor device

Indian Institute of Technology Kharagpur

38 of 50

Solution Steps

Indian Institute of Technology Kharagpur

39 of 50

Device Selection

'Good' positive samples (𝑥⁺):

𝑥⁺ should belong to the same label/class as the anchor sample 𝑥
𝑥⁺ should come from a device whose data distribution has a small divergence from that of the anchor device.

Indian Institute of Technology Kharagpur

40 of 50

Device Selection

'Good' positive samples (𝑥⁺):

𝑥⁺ should belong to the same label/class as the anchor sample 𝑥
𝑥⁺ should come from a device whose data distribution has a small divergence from that of the anchor device.

We do not have ground truth data, so how do we ensure this condition?

Indian Institute of Technology Kharagpur

41 of 50

Device Selection

'Good' positive samples (𝑥⁺):

𝑥⁺ should belong to the same label/class as the anchor sample 𝑥
𝑥⁺ should come from a device whose data distribution has a small divergence from that of the anchor device.

The Catch: Devices are time-aligned in TSMDS. So, all the devices observe the same class label at every time window

Indian Institute of Technology Kharagpur

42 of 50

Device Selection

'Good' positive samples (𝑥⁺):

𝑥⁺ should belong to the same label/class as the anchor sample 𝑥
𝑥⁺ should come from a device whose data distribution has a small divergence from that of the anchor device.

'Good' negative samples (𝑥^-)

𝑥^- should be a true negative (belongs to a different class)
The most informative negative samples are those whose embeddings are initially near to the anchor embeddings, and 𝑓 needs to push them far apart.

𝑓 gets a strong supervisory signal from the data and more useful gradients during training if the embeddings are initially near to the anchor
Otherwise (if the embeddings are already far apart), 𝑓 will receive a weaker supervisor signal: May affect its convergence

Indian Institute of Technology Kharagpur

43 of 50

Device Selection

'Good' positive samples (𝑥⁺):

𝑥⁺ should belong to the same label/class as the anchor sample 𝑥
𝑥⁺ should come from a device whose data distribution has a small divergence from that of the anchor device.

'Good' negative samples (𝑥^-)

𝑥^- should be a true negative (belongs to a different class)
The most informative negative samples are those whose embeddings are initially near to the anchor embeddings, and 𝑓 needs to push them far apart.

𝑓 gets a strong supervisory signal from the data and more useful gradients during training if the embeddings are initially near to the anchor
Otherwise (if the embeddings are already far apart), 𝑓 will receive a weaker supervisor signal: May affect its convergence

Strictly enforcing this may not be possible

Indian Institute of Technology Kharagpur

44 of 50

Device Selection

Increase the likelihood of selecting 'good' positive and negative samples

Compute pairwise Maximum Mean Discrepancy (MMD) between the data samples from the anchor and other devices
MMD: Distance between the feature means -> Higher MMD implies a larger distance between the two distributions

Device Selection Policy:

Closest Positive: Least MMD distance from the anchor is chosen as the positive (empirically observed that one positive device gives the best performance)
Weighted Negatives: All from the candidate set, but the feature contributions are weighted by the inverse of the MMD from anchor (satisfies the second requirement – negatives closer to the anchor gets more weight, also ensures the first requirement by enabling other devices to supersede the training when one negative device belongs to the same class as of the anchor)

Indian Institute of Technology Kharagpur

45 of 50

Contrastive Sampling

Decides which data samples should be picked from each device for contrastive training
Sampling polcy

Synchronous positive samples
Asynchronous negative samples

Synchronous positive samples: Time-aligned positive counterparts from the positive device, corresponding to the anchors
Asynchronous negative samples: We need the samples from a different class – samples which are not time-synchronized with the positive class are "more likely" to be the samples from a different class

Indian Institute of Technology Kharagpur

46 of 50

Contrastive Sampling

Indian Institute of Technology Kharagpur

47 of 50

Multi-view Contrastive Loss

Extension of standard contrastive loss for multiple positive and negative samples:

sim(.) indicates cosine similarity

The loss function is minimized for each batch of data using stochastic gradient descent

Increases the cosine similarity between the anchor and positive samples (push them closer in the feature space)

Indian Institute of Technology Kharagpur

48 of 50

ColloSSL Performance (Macro F1 Score)

The number in the bracket indicates amount of data used

Indian Institute of Technology Kharagpur

49 of 50

Take Aways

IMU can provide significant insights about human activities

Motion models can well capture various gestures in different body parts
One of the fundamental sensing modality used in almost every smart device

Statistical estimators/filters help denoising IMU data and predicting gestures

Obtained labeled data is costly for ubicomp applications

Supervised methods are good if sufficient labels are available (but they are scarce)
Unsupervised or semi-supervised/self-supervised methods may give reasonable performance with significantly less amount of data

Indian Institute of Technology Kharagpur

50 of 50

Happy Learning!

Some resources related to this topic

Introduction	Related Work	Background	Observation	Methodology	Evaluation	Conclusion