1 of 50

�CS60055: Ubiquitous Computing 

Location, Gesture and Activity Sensing

Part IV: Inferring Activities from IMU

INDIAN INSTITUTE OF TECHNOLOGY

KHARAGPUR

Sandip Chakraborty

sandipc@cse.iitkgp.ac.in

Department of Computer Science and Engineering

2 of 50

What We Have Learnt So Far?

  • IMU sensors are noisy
    • Accelerometer measures true acceleration – needs to separate out the gravity component; however, gravity value gets polluted when the object is in motion
    • Integrations make the errors cumulative: Orientation and location estimations become noisy
    • Magnetic north pole can be used as an anchor, but it is also not free from noise

  • Combine measurement logic with the application-enforced physics for gesture estimation
    • Statistical estimators/filters can help
    • Tradeoff between accuracy and real-time performance

Indian Institute of Technology Kharagpur

3 of 50

The Next Steps – Gesture to Activities

  • Combine the gestures to infer the activities

Cooking

Indian Institute of Technology Kharagpur

4 of 50

The Next Steps – Gesture to Activities

  • Combine the gestures to infer the activities

Preparing poached egg

However, the activity label can be more precise depending on the downstream application

Indian Institute of Technology Kharagpur

5 of 50

Human Activity Recognition (HAR)

  • Can use handcrafted statistical features from the gesture signatures
    • Mean, mode, median, kurtosis, etc.

  • However, these features need to be extracted and engineered properly
    • Note that there is a fundamental difference between image/video processing and physical sensor data processing
    • IMU data is not visually recognizable like image or video data – so data annotation and model development is tricky
    • How do you understand which statistical features are more prominent for a specific type of activity patterns?
    • Also, activity signatures vary a lot across demography, age, gender, etc.

Indian Institute of Technology Kharagpur

6 of 50

Human Activity Recognition (HAR)

  • Can we use vision models for HAR?
    • The image representation of the accelerometer data and pass it to the vision-based model for activity classification?

Indian Institute of Technology Kharagpur

7 of 50

Human Activity Recognition (HAR)

  • Can we use vision models for HAR?
    • Take a matrix representation of the input: The image representation of the accelerometer data and pass it to the vision-based model for activity classification?
    • May lose a lot of contextual information depending on the resolution and granularity of the data – what would be the optimal resolution for the input?
    • Also, data from multiple modalities (accelerometer, gyroscope, compass) need to be combined to define the motion model properly

Indian Institute of Technology Kharagpur

8 of 50

Human Activity Recognition (HAR)

  • Can we use vision models for HAR?
    • Take a matrix representation of the input: The image representation of the accelerometer data and pass it to the vision-based model for activity classification?
    • May lose a lot of contextual information depending on the resolution and granularity of the data – what would be the optimal resolution for the input?
    • Also, data from multiple modalities (accelerometer, gyroscope, compass) need to be combined to define the motion model properly

A Possible Solution: Use supervised models on the preprocessed IMU data for HAR classification

Indian Institute of Technology Kharagpur

9 of 50

Deep Learning for HAR

  • Represent the raw time-series data (after the necessary preprocessing and filtering) from the IMU as your input
    • No hand-crafted feature engineering is necessary

UCI-HAR dataset, several activity classes:

  • Walking
  • Walking upstairs
  • Walking downstairs
  • Sitting
  • Standing
  • Laying

Indian Institute of Technology Kharagpur

10 of 50

Deep Learning for HAR

  • 1D-CNN to predict the feature classes
    • Can learn from the raw time series data directly (accelerometer and gyroscope)
    • Do not require domain expertise for manual feature engineering
  • Input: three time-series signals (3-DoF representation)
    • Total acceleration (gravitational acceleration + body acceleration)
    • Body acceleration
    • Body gyroscope
  • Window the input data – one sample is one window
    • 128 time-steps
  • Multi-class classification
    • Adam optimizer, categorical cross entropy loss
    • Tune hyperparameters (kernel size, number of filters) to best fit the model

Indian Institute of Technology Kharagpur

11 of 50

Deep Learning for HAR

  • 1D-CNN to predict the feature classes
    • Can learn from the raw time series data directly (accelerometer and gyroscope)
    • Do not require domain expertise for manual feature engineering
  • Input: three time-series signals (3-DoF representation)
    • Total acceleration (gravitational acceleration + body acceleration)
    • Body acceleration
    • Body gyroscope
  • Window the input data – one sample is one window
    • 128 time-steps
  • Multi-class classification
    • Adam optimizer, categorical cross entropy loss
    • Tune hyperparameters (kernel size, number of filters) to best fit the model

Cons: You still need a significant amount of labelled data to train your model

Indian Institute of Technology Kharagpur

12 of 50

Self-Supervised Learning (SSL) for HAR

  • Self-supervised Learning (SSL)
    • Does not rely on the labelled dataset for supervisory signals
    • Generate implicit labels from the unstructured data

Representation

Pretext task

Representation

Classifier

Cooking

Indian Institute of Technology Kharagpur

13 of 50

Contrastive Learning

  • Pick up an anchor – the target that you want to learn
    • Positive (X+): Samples that are similar to the anchor (similar semantic meaning)
    • Negative (X-): Samples that are different from the anchor

  • Pull Anchor and Positive embeddings closer and push the Anchor and Negative embeddings further

Indian Institute of Technology Kharagpur

14 of 50

Contrastive Learning

Indian Institute of Technology Kharagpur

15 of 50

Contrastive Learning

Loss Function:

  • Triplet loss:

  • Other loss functions are also available
  • Lifted Structured Loss (all pair)
    • N-pair Loss (Multiple Negatives)
    • NCE (Noise contrastive estimation)
    • And so on ….

Indian Institute of Technology Kharagpur

16 of 50

Contrastive Learning

  • How do you get the positive samples without labeling?
    • Positive: Pick up a few anchors and apply augmentations
    • Negative: Everything else

  • Contrastive learning is popular in the image domain
    • Augmentation is easy
    • Well defined semantic augmentations that do not�change the meaning of the subjects
      • Flip
      • Rotate
      • Color distortion
      • Random crop
      • Zooming
      • Blur, and so on …

Slide credit: Prasenjit Karmakar

Indian Institute of Technology Kharagpur

17 of 50

MNIST Representations

  • Learned M dimensional Embedding Vector for which below properties holds
    • Two similar images are embedded closely in the M dimensional space as compared to two dissimilar images, forming clusters.
    • Two such cluster should have minimal to no overlap (GOOD for downstream classification)

Two dimensional MNIST representations

Slide credit: Prasenjit Karmakar

Indian Institute of Technology Kharagpur

18 of 50

SimCLR

  • One of the simplest but effective contrast learning framework for vision-based classification

Indian Institute of Technology Kharagpur

19 of 50

SimCLR

  • One of the simplest but effective contrast learning framework for vision-based classification

How do we apply data augmentation for physical sensing data?

Indian Institute of Technology Kharagpur

20 of 50

SimCLR

  • One of the simplest but effective contrast learning framework for vision-based classification

How do we apply data augmentation for physical sensing data?

Apply the laws of physics

Indian Institute of Technology Kharagpur

21 of 50

Contrastive Learning for HAR

Adoption of SimCLR framework for HAR

Indian Institute of Technology Kharagpur

22 of 50

Contrastive Learning for HAR

Adoption of SimCLR framework for HAR

What transformations can we apply for the HAR data?

Indian Institute of Technology Kharagpur

23 of 50

Contrastive Learning for HAR

IMWUT 2022

Indian Institute of Technology Kharagpur

24 of 50

Time Synchronous Multi Device Systems (TSMDS)

  • Multiple computing/sensor devices are worn on, affixed to or implanted in a person’s body
    • All devices observe a physical phenomenon (e.g., a user’s locomotion activity) simultaneously and record sensor data in a time-aligned manner

  • Assumptions
    • All sensor devices in the multi-device system �share the same sensor sampling rate (or that �their data can be re-sampled to the same rate)
    • Multiple devices in collect sensor data in a �time-aligned manner (minor variations do not�impact the performance significantly)

Indian Institute of Technology Kharagpur

25 of 50

Transformations for Data Augmentation

  • Rather than applying manual transformations (like rotation), can we utilize natural transformations that are already available in the sensor datasets
    • Multiple sensors in a TSMDS system captures the same activity under different view

Both indicate the same activity "Running"

Indian Institute of Technology Kharagpur

26 of 50

Transformations for Data Augmentation

  • Rather than applying manual transformations (like rotation), can we utilize natural transformations that are already available in the sensor datasets
    • Multiple sensors in a TSMDS system captures the same activity under different view

These are natural transformations of each other!

Indian Institute of Technology Kharagpur

27 of 50

Transformations for Data Augmentation

  • Rather than applying manual transformations (like rotation), can we utilize natural transformations that are already available in the sensor datasets
    • Multiple sensors in a TSMDS system captures the same activity under different view

  • Accelerometer and Gyro at different body parts capture the translational and the rotational motions of those body parts
    • For an activity, the motions at different body parts are interlinked
    • Think about the motion patterns of hands and legs when�you are running!

Indian Institute of Technology Kharagpur

28 of 50

Transformations for Data Augmentation

  • Rather than applying manual transformations (like rotation), can we utilize natural transformations that are already available in the sensor datasets
    • Multiple sensors in a TSMDS system captures the same activity under different view

  • Accelerometer and Gyro at different body parts capture the translational and the rotational motions of those body parts
    • For an activity, the motions at different body parts are interlinked
    • Think about the motion patterns of hands and legs when�you are running!

Use such natural transformations to define a pretext task and perform contrastive learning

Indian Institute of Technology Kharagpur

29 of 50

Challenge 1: Selecting Positive and Negative Samples

Indian Institute of Technology Kharagpur

30 of 50

Challenge 1: Selecting Positive and Negative Samples

Data distributions for some devices will be very different from the data distribution of Chest – can be the "negative" samples!

How do we know which are of different distributions?

Indian Institute of Technology Kharagpur

31 of 50

Challenge 1: Selecting Positive and Negative Samples

Positive

Negative

Indian Institute of Technology Kharagpur

32 of 50

Challenge 2: Which Samples Can be Used in CL?

Indian Institute of Technology Kharagpur

33 of 50

Challenge 2: Which Samples Can be Used in CL?

Pattern gets very different

Indian Institute of Technology Kharagpur

34 of 50

Collaborative Self-supervised Learning

  • We have D devices with time-aligned but unlabeled sensor datasets {Xi}Di=1
  • The dataset is pre-segmented in T number of windows
    • Each dataset Xi contains T windows {x1i , x2i , …, xTi}
    • Each sensor sample xji is the 6-DoF IMU data
  • Let D* D be an anchor device for which we want to train HAR prediction
  • Let L* = {(x*1, y*1), …, (x*m, y*m)} be a small pre-segmented labeled dataset from the anchor device with m windows (m << T)
    • X* are the sensor samples, y* are the labels
  • Objectives:
    • Use the unlabeled dataset {Xi}Di=1 to train a feature extractor f(.)
    • Use the pretrained f(.) to obtain the feature embeddings for the labeled anchor samples and subsequently train a HAR classifier g(.) that maps feature embeddings to the labels

Indian Institute of Technology Kharagpur

35 of 50

Solution Steps

  • Initialize the feature extractor f(.) with random weights
  • Sample a batch 𝐵 of time-aligned unlabeled data from all the D devices
  • Device Selection: Decide which of the remaining devices will provide positive samples and negative samples for CL
  • Contrastive Sampling: For each anchor sample, which samples from the batches of positive and negative devices should be used for CL

Indian Institute of Technology Kharagpur

36 of 50

Solution Steps

Indian Institute of Technology Kharagpur

37 of 50

Solution Steps

  • Multi-view Contrastive Learning: The anchor, positive and negative sample(s) are fed to the feature extractor f(.) to generate feature embeddings
    • Use 1D-CNN as the feature extractor
    • Compute Multi-view Contrastive Loss to push the positive samples towards the anchor and the negative embeddings far from the anchor

  • Repeat the above steps until the contrastive loss converges (and we compute all the anchor samples for the batch)

  • Supervised Fine Tuning: Train a HAR classifier using f(.) and the labeled dataset from the anchor device

Indian Institute of Technology Kharagpur

38 of 50

Solution Steps

Indian Institute of Technology Kharagpur

39 of 50

Device Selection

  • 'Good' positive samples (𝑥+):
    • 𝑥+ should belong to the same label/class as the anchor sample 𝑥 
    • 𝑥+ should come from a device whose data distribution has a small divergence from that of the anchor device.

Indian Institute of Technology Kharagpur

40 of 50

Device Selection

  • 'Good' positive samples (𝑥+):
    • 𝑥+ should belong to the same label/class as the anchor sample 𝑥 
    • 𝑥+ should come from a device whose data distribution has a small divergence from that of the anchor device.

We do not have ground truth data, so how do we ensure this condition?

Indian Institute of Technology Kharagpur

41 of 50

Device Selection

  • 'Good' positive samples (𝑥+):
    • 𝑥+ should belong to the same label/class as the anchor sample 𝑥 
    • 𝑥+ should come from a device whose data distribution has a small divergence from that of the anchor device.

The Catch: Devices are time-aligned in TSMDS. So, all the devices observe the same class label at every time window

Indian Institute of Technology Kharagpur

42 of 50

Device Selection

  • 'Good' positive samples (𝑥+):
    • 𝑥+ should belong to the same label/class as the anchor sample 𝑥 
    • 𝑥+ should come from a device whose data distribution has a small divergence from that of the anchor device.

  • 'Good' negative samples (𝑥-)
    • 𝑥- should be a true negative (belongs to a different class)
    • The most informative negative samples are those whose embeddings are initially near to the anchor embeddings, and 𝑓 needs to push them far apart.
      • 𝑓 gets a strong supervisory signal from the data and more useful gradients during training if the embeddings are initially near to the anchor
      • Otherwise (if the embeddings are already far apart), 𝑓 will receive a weaker supervisor signal: May affect its convergence

Indian Institute of Technology Kharagpur

43 of 50

Device Selection

  • 'Good' positive samples (𝑥+):
    • 𝑥+ should belong to the same label/class as the anchor sample 𝑥 
    • 𝑥+ should come from a device whose data distribution has a small divergence from that of the anchor device.

  • 'Good' negative samples (𝑥-)
    • 𝑥- should be a true negative (belongs to a different class)
    • The most informative negative samples are those whose embeddings are initially near to the anchor embeddings, and 𝑓 needs to push them far apart.
      • 𝑓 gets a strong supervisory signal from the data and more useful gradients during training if the embeddings are initially near to the anchor
      • Otherwise (if the embeddings are already far apart), 𝑓 will receive a weaker supervisor signal: May affect its convergence

Strictly enforcing this may not be possible

Indian Institute of Technology Kharagpur

44 of 50

Device Selection

  • Increase the likelihood of selecting 'good' positive and negative samples
    • Compute pairwise Maximum Mean Discrepancy (MMD) between the data samples from the anchor and other devices
    • MMD: Distance between the feature means -> Higher MMD implies a larger distance between the two distributions

  • Device Selection Policy:
    • Closest Positive: Least MMD distance from the anchor is chosen as the positive (empirically observed that one positive device gives the best performance)
    • Weighted Negatives: All from the candidate set, but the feature contributions are weighted by the inverse of the MMD from anchor (satisfies the second requirement – negatives closer to the anchor gets more weight, also ensures the first requirement by enabling other devices to supersede the training when one negative device belongs to the same class as of the anchor)

Indian Institute of Technology Kharagpur

45 of 50

Contrastive Sampling

  • Decides which data samples should be picked from each device for contrastive training
  • Sampling polcy
    • Synchronous positive samples
    • Asynchronous negative samples
  • Synchronous positive samples: Time-aligned positive counterparts from the positive device, corresponding to the anchors
  • Asynchronous negative samples: We need the samples from a different class – samples which are not time-synchronized with the positive class are "more likely" to be the samples from a different class

Indian Institute of Technology Kharagpur

46 of 50

Contrastive Sampling

Indian Institute of Technology Kharagpur

47 of 50

Multi-view Contrastive Loss

  • Extension of standard contrastive loss for multiple positive and negative samples:

    • sim(.) indicates cosine similarity
  • The loss function is minimized for each batch of data using stochastic gradient descent
    • Increases the cosine similarity between the anchor and positive samples (push them closer in the feature space)

Indian Institute of Technology Kharagpur

48 of 50

ColloSSL Performance (Macro F1 Score)

The number in the bracket indicates amount of data used

Indian Institute of Technology Kharagpur

49 of 50

Take Aways

  • IMU can provide significant insights about human activities
    • Motion models can well capture various gestures in different body parts
    • One of the fundamental sensing modality used in almost every smart device

  • Statistical estimators/filters help denoising IMU data and predicting gestures

  • Obtained labeled data is costly for ubicomp applications
    • Supervised methods are good if sufficient labels are available (but they are scarce)
    • Unsupervised or semi-supervised/self-supervised methods may give reasonable performance with significantly less amount of data

Indian Institute of Technology Kharagpur

50 of 50

Happy Learning!

Some resources related to this topic

Introduction

Related Work

Background

Observation

Methodology

Evaluation

Conclusion