1 of 44

Lilin Xu*, Kaiyuan Hou*, Xiaofan Jiang

Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding

FMSys 2025

*Co-first author

2 of 44

2

LLMs in Sensor-based Applications

  • LLMs shows exceptional generalization and reasoning capabilities

LLM

Healthcare

Assistive Systems

Monitoring

What day did I spend longest eating last week?

You spent the most time excising last week.

Where are my keys?

According to the cameras and detection results, your key is in the bedroom.

My heart feels pain when I take a deep breath after COVID-19…

Have you noticed any other symptoms such as cough, or headache?

3 of 44

3

IMU-based Human Activity Recognition

  • IMU sensors are widely available and cost-effective, making them ideal for human activity recognition

Wearable Device with IMU Sensor

Accelerometer

Gyroscope

Portability and Ubiquity

Low Power Consumption

Privacy Protection

4 of 44

4

IMU-based Human Activity Recognition

  • IMU sensors are widely available and cost-effective, making them ideal for human activity recognition

Wearable Device with IMU Sensor

Accelerometer

Gyroscope

Portability and Ubiquity

Low Power Consumption

Privacy Protection

How to take advantage of LLMs in IMU-based HAR?

Contextual understanding

5 of 44

5

Current Solutions

  • Knowledge-Driven HAR with Pre-Trained LLMs
  • [FMSys’24] HARGPT
  • [SenSys-ML’24] LLMSense

Text

Textual descriptions of activities

Pretrained LLM

6 of 44

6

Current Solutions

  • Knowledge-Driven HAR with Pre-Trained LLMs
  • Modality Alignment between IMU and Text
  • [FMSys’24] HARGPT
  • [SenSys-ML’24] LLMSense

Text

Textual descriptions of activities

Semantic Gap

IMU Encoder

C moves the pen on the table

Text Encoder

Alignment

  • LLaSA
  • SensorLLM

Pretrained LLM

Pretrained LLM

7 of 44

7

Current Solutions

  • Focus on coarse-grained activities (e.g., walking, sit-stand, jogging)

Coarse-grained Activity

Fine-grained Activity

8 of 44

8

  • Treat IMU data as text to use pretrained LLMs (post-training process)

Current Solutions

  • Focus on coarse-grained activities (e.g., walking, sit-stand, jogging)

Coarse-grained Activity

Fine-grained Activity

Pretrained LLM

Post-training Process

Pre-training Process

9 of 44

9

Preliminary

  • Data collection of handwritten letters
      • 26 letters (from ‘A’ to ‘Z’), two cases (on-flatten [2D] & Mid-air [3D]), 10 repetitions
      • Data from one participant is used as the training set, while data from the other participant serves as the test set

Data Collection Setup and Process

Data Visualization

10 of 44

10

Preliminary - Microbenchmark

  • The recognition performance of LLMs with in-context learning
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

Zero-shot: prompt LLMs using domain knowledge and Chain-of-Thought (CoT) reasoning

11 of 44

11

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

Zero-shot: prompt LLMs using domain knowledge and Chain-of-Thought (CoT) reasoning

12 of 44

12

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

Zero-shot: prompt LLMs using domain knowledge and Chain-of-Thought (CoT) reasoning

13 of 44

13

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

Few-shot: include ‘label-data’ pairs as examples in the prompt

14 of 44

14

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

Few-shot: include ‘label-data’ pairs as examples in the prompt

Same as the zero-shot prompt

15 of 44

15

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

Few-shot: include ‘label-data’ pairs as examples in the prompt

16 of 44

16

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

[‘(2D)|(3D)’]

17 of 44

17

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

[‘(2D)|(3D)’]

Perform poorly with accuracies falling below random guessing

Zero-shot

18 of 44

18

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

[‘(2D)|(3D)’]

  • GPT-4o and DeepSeek-R1 can benefit from provided examples

Few-shot (2D Case)

19 of 44

19

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

[‘(2D)|(3D)’]

  • GPT-4o and DeepSeek-R1 can benefit from provided examples

  • Small LLM (LLaMA-3-8B) fail to interpret this time-series classification task

Few-shot (2D Case)

20 of 44

20

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

[‘(2D)|(3D)’]

  • All LLMs perform poorly on 3D letter recognition since mid-air gestures present additional complexities

Few-shot (3D Case)

21 of 44

21

Preliminary - Microbenchmark

  • The recognition performance of pretrained LLMs
      • In-context Learning: zero-shot and few-shot settings for prompts
      • Compare with traditional small classification models

[‘(2D)|(3D)’]

  • All LLMs performed poorly on 3D letter recognition since mid-air gestures present additional complexities

Few-shot (3D Case)

Pretrained LLMs cannot directly handle fine-grained HAR tasks

Expert Knowledge

Pretrained LLMs

22 of 44

22

Experiment - Single Letter Prediction

  • We first generate an instruction-response dataset to fine-tune LLMs
      • Step 1: Generate Correct Reasoning
      • Step 2: Rephrase for “Discovery” Mode
      • Step 3: Prompt Versatility

23 of 44

23

Experiment - Single Letter Prediction

  • We first generate an instruction-response dataset to fine-tune LLMs
      • Step 1: Generate Correct Reasoning
      • Step 2: Rephrase for “Discovery” Mode
      • Step 3: Prompt Versatility

Step 1

24 of 44

24

Experiment - Single Letter Prediction

  • We first generate an instruction-response dataset to fine-tune LLMs
      • Step 1: Generate Correct Reasoning
      • Step 2: Rephrase for “Discovery” Mode
      • Step 3: Prompt Versatility

Reconstruct the reasoning answer

Step 1

Step 2

25 of 44

25

Experiment - Single Letter Prediction

  • We first generate an instruction-response dataset to fine-tune LLMs
      • Step 1: Generate Correct Reasoning
      • Step 2: Rephrase for “Discovery” Mode
      • Step 3: Prompt Versatility

Reconstruct the reasoning answer

Convert the phrasing style

Step 1

Step 3

Step 2

26 of 44

26

Experiment - Single Letter Prediction

  • We first generate an instruction-response dataset to fine-tune LLMs
      • Step 1: Generate Correct Reasoning
      • Step 2: Rephrase for “Discovery” Mode
      • Step 3: Prompt Versatility

Reconstruct the reasoning answer

Convert the phrasing style

Step 1

Step 3

Step 2

27 of 44

27

Experiment - Single Letter Prediction

  • We first generate an instruction-response dataset to fine-tune LLMs
      • 1560 instruction-response pair

IMU data

28 of 44

28

Experiment - Single Letter Prediction

  • We first generate an instruction-response dataset to fine-tune LLMs
      • 1560 instruction-response pair

IMU data

Reasoning

29 of 44

29

Experiment - Single Letter Prediction

  • We first generate an instruction-response dataset to fine-tune LLMs
      • 1560 instruction-response pair

IMU data

Classification Result

30 of 44

30

Experiment - Single Letter Prediction

  • We fine-tune LLaMA-3-8B and GPT-4o with LoRA

Recognition accuracy (2D)

Recognition accuracy (3D)

0

0.38%

0

0

31 of 44

31

Experiment - Single Letter Prediction

  • We fine-tune LLaMA-3-8B and GPT-4o with LoRA

Recognition accuracy (2D)

Recognition accuracy (3D)

Fine-tuning improves performance across both models

0

0

0

0.38%

32 of 44

32

Experiment - Single Letter Prediction

  • We fine-tune LLaMA-3-8B and GPT-4o with LoRA

Recognition accuracy (2D)

Recognition accuracy (3D)

Fine-tuning improves performance across both models

0

0.38

0

0

Before Fine-tuning

33 of 44

33

Experiment - Single Letter Prediction

  • We fine-tune LLaMA-3-8B and GPT-4o with LoRA

Recognition accuracy (2D)

Recognition accuracy (3D)

Fine-tuning improves performance across both models

0

0.38

0

0

Before Fine-tuning

34 of 44

34

Experiment - Single Letter Prediction

  • We fine-tune LLaMA-3-8B and GPT-4o with LoRA

Recognition accuracy (2D)

Recognition accuracy (3D)

Few-shot learning substantially improves accuracy

0

0

0

Fine-tuning improves performance across both models

0.38%

35 of 44

35

Experiment - Single Letter Prediction

  • We fine-tune LLaMA-3-8B and GPT-4o with LoRA

Recognition accuracy (2D)

Recognition accuracy (3D)

Performance on 3D data remains poor

0

0

0

Few-shot learning substantially improves accuracy

Fine-tuning improves performance across both models

0.38%

36 of 44

36

Experiment - Mid-Air Contextual Letter Series

  • An end-to-end mid-air gesture understanding pipeline based on LLMs

Mid-air gesture instead of on-flatten gesture

Contextual letter series instead of single letters

37 of 44

37

Experiment - Mid-Air Contextual Letter Series

  • An end-to-end mid-air gesture understanding pipeline based on LLMs
      • Includes Mapping stage and classification stage

Mapping Stage

Classification Stage

Maps 3D IMU data to 2D representations that can be interpreted by LLMs

Uses the fine-tuned LLMs to recognize contextual letter series instead of single letter

38 of 44

38

  • An end-to-end mid-air gesture understanding pipeline based on LLMs
      • Mapping stage: Mapping mid-air gestures to flat-surface gestures through deep metric learning

Framework of Similarity Estimator

Mapping Accuracy: 93.08%

Experiment - Mid-Air Contextual Letter Series

39 of 44

39

  • An end-to-end mid-air gesture understanding pipeline based on LLMs
      • Classification stage: identify contextual letter series (word) instead of single letter
      • Experiment with fine-tuned LLaMA-3-8B

Experiment - Mid-Air Contextual Letter Series

‘s’

‘d’

‘v’

‘e’

‘save’

Single Letter

Contextual Letter Series

40 of 44

40

  • An end-to-end mid-air gesture understanding pipeline based on LLMs
      • Classification stage: identify contextual letter series (word) instead of single letter
      • Experiment with fine-tuned LLaMA-3-8B

Experiment - Mid-Air Contextual Letter Series

‘s’

‘d’

‘v’

‘e’

‘save’

Single Letter

Contextual Letter Series

1500 common English nouns

Draw the letter for 𝑘 times, where 𝑘 ∈ [2, 5]

Word length: 3 - 6

41 of 44

41

  • An end-to-end mid-air gesture understanding pipeline based on LLMs
      • Classification stage: identify contextual letter series (word) instead of single letter
      • Experiment with fine-tuned LLaMA-3-8B
        • 1500 common English nouns
        • Draw the letter for 𝑘 times, where 𝑘 ∈ [2, 5]

Recognition Example

Experiment - Mid-Air Contextual Letter Series

42 of 44

42

  • An end-to-end mid-air gesture understanding pipeline based on LLMs
      • Classification stage: identify contextual letter series (word) instead of single letter
      • Experiment with fine-tuned LLaMA-3-8B
        • 1500 common English nouns
        • Draw the letter for 𝑘 times, where 𝑘 ∈ [2, 5]

Recognition Example

Experiment - Mid-Air Contextual Letter Series

Accuracy on Contextual Letter Series

43 of 44

43

Conclusion

  • We explore the capabilities of LLMs for IMU-based fine-grained HAR based on handwritten letter recognition

      • Measure the LLM’s capabilities in fine-grained HAR with different settings

      • Provide insights for LLMs in understanding IMU data

      • Propose an end-to-end pipeline for mid-air gesture understanding

Fine-tuning help LLMs to understand the specific task

Examples help LLMs to increase the performance

The contextual capabilities make LLMs promising in practical applications

44 of 44

Thank you!

lx2331@columbia.edu

Lilin Xu

Columbia University