1 of 12

EldCare: Real-Time Monitoring System for Safe and Independent Elderly Living using a Novel Vision-Based Framework with Spatio-Temporal Action Localization.

Aryaman Khanna, Ryan Park, Shubhrangshu Debsarkar

Thomas Jefferson High School for Science and Technology

2 of 12

Introduction

Problem

Elderly population projected to double by 2050 and reach 1.5 billion globally (Figure 1)

Their population will soon be greater than the number of people aged 10 to 24

Over 37 million severe falls each year

Falls are the leading cause of injury-related death among seniors
87% of falls result in broken bones
60% of those who don’t get care in one hour suffer life-long issues

3 million adults treated in emergency departments for fall-related injuries yearly
COVID limited availability of elderly care centers and nursing homes

Figure 1: Predicted population by age group from 1950-2050

Figure 2: Picture of elderly person falling

3 of 12

Existing Work

Marker-based systems

Current markerless systems

Based on GPS and sensors
Large amount of false alarms
Always-on-body devices are inconvenient for seniors

Figure 3: Marker-based fall detection bracelet

Figure 4: Markerless motion capture with pose estimation

Capture motion in videos
Video streaming is computationally expensive
Current systems lack edge processing
Inaccurate in real-time scenarios

4 of 12

Goal and Constraints

Goal: Develop a cost-effective markerless system using machine learning and computer vision to perform real-time action recognition for elderly monitoring using a camera and a Nvidia Jetson Nano.

This includes:

Pose Estimation (Foreground segmentation)
Action Classification (Spatio-temporal action localization)
Novel machine learning architecture (EfficientNet-GRU hybrid)
Constraints:
Cost (Less than $80)
Accurate recognition system (>90%)
Low latency time
Maintenance of privacy

Figure 5: EldCare V1

Figure 6: EldCare V2

5 of 12

Methods - Summary of System

Figure 7: Flowchart demonstrating EldCare pose estimation and spatio-temporal action classification

Figure 9: Action Summary Interface and Live Action Classification Performed by IOS App

Figure 8: EldCare Prototype using Nvidia Jetson Nano and webcam

6 of 12

Methods - Data

Dataset consisted of 6 self-recorded video action classes: Falling, standing, sitting, lying down, jumping jacks, and walking

210 total videos

Split into 64 testing and 146 training

540x900 pixels
3-4 seconds long

Online data

Videos contained multiple actions in one video
Poorly trimmed
Same person performing actions in each video

Our dataset

Limited to one action per video
Realistic and varied angles that simulated real environments for seniors
Variation in the humans in each video

7 of 12

Methods - Pose Estimation

Pose estimation allows for precise tracking of human movement by detecting keypoints representing major joints
Pipeline consists of two-step detector-tracker

Locates person/pose in region-of-interest (ROI)
Predicts pose landmarks and segmentation masks

Pose estimation utilizes a regression approach that is supervised by a combined heat map/offset prediction of all keypoints (Fig. 10)

Used a heat map and offset loss trains the center and left tower of the network
Heat map is removed and then regression encoder is trained

The algorithm transfers only the “skeleton” of the person through background subtraction

Limits noise and variables for the the action classifier model to process
Maintains privacy through foreground segmentation

“Skeleton” is created with an adaptation of a BlazePose detector, which predicts two virtual key points that describe the human body center, rotation, and scale as a circle

Figure 10: Tracking network architecture: regression with heatmap supervision

Figure 11: Data flow for real-time video capture and rendered output

8 of 12

Methods - Spatio-Temporal Action Localization

EfficientNet is a CNN scaling architecture that simultaneously optimizes multiple channels

Outperforms state-of-the-art algorithms (Fig. 12)

EldCare takes live video feed and localizes actions within space-time using a novel EfficientNet-GRU architecture (Fig. 13)
Detects human actions at the frame level and creates a frame linkage system for the last 5 frames

Inputs pose estimation “skeletal structure” for each frame.
GRU layers use the past 5 frames to recognize movement actions
GRU outperforms LSTM in speed

Supervised regression outperforms Spatial Region proposal models (SRPs)

Figure 12: Performance of EfficientNet and other state-of-the-art networks on the ImageNet database.

Figure 13: EfficientNet-GRU hybrid architecture.

9 of 12

Results

Algorithm performed better on our newly formed dataset with pose estimation data (64 testing videos - 30% of data)

97.9% average with pose estimation
92.3% average without pose estimation

Met 90% accuracy engineering goal
0.3ms of real-time latency
Our model outperformed current markerless action classification techniques (Table 1)

Figure 14: Confusion Matrix Representing our Model Classification Accuracy.

Figure 15: Accuracy of training and testing data plotted over iterations

Figure 16: Training and testing loss plotted over iterations

Table 1: Table comparison of accuracies of previous elderly care pose-detection models. EldCare displays the greatest accuracy.

10 of 12

Data Analysis

Data Interpretation

Utilizing the EfficientNet-GRU architecture, EldCare outperforms other algorithms on the HMDB51 and UCF101 datasets (Table 2).
Performs efficiently in real-time due to pose estimation and lightweight network
Markerless detection allows for the greatest accessibility and ease of use.

Problems

Limited dataset - 210 videos
Model only tested on data with only one human
Due to real-time detection, Jetson Nano did not display high FPS

Table 2: Table comparison of accuracies of various machine learning architectures on common publicly available action recognition datasets.

11 of 12

Conclusion

Summary

Action recognition system for elderly monitoring
Easy to manufacture and cost-effective, only $70
Low set-up time and ease of access with an app version

Impact

Can be used in homes for automated monitoring of seniors
Increase elderly care injury response time and limit human error resulting from marker-based systems

Future Work

Create a compact integrated system which contains a camera built in and allows for easier use
Involve an alert system which sends out an alert to close family members or friends when an senior suffers an accident.
Can be utilized to track sleeping patterns and for healthcare monitoring

EldCare V1

EldCare V2

Future Goal

12 of 12

References

Buzzelli, M., Albé, A., & Ciocca, G. (2020). A Vision-based system for monitoring elderly people at home. Applied Sciences, 10(1). https://doi.org/10.3390/app10010374

Fan, K., Wang, P., Hu, Y., & Dou, B. (2017). Fall detection via human posture representation and support vector machine. International Journal of Distributed Sensor Networks, 13(5), 155014771770741. https://doi.org/10.1177/1550147717707418

Harrou, F., Zerrouki, N., Sun, Y., & Houacine, A. (2019). An Integrated Vision-Based Approach for Efficient Human Fall Detection in a Home Environment. IEEE Access, 7, 114966–114974. https://doi.org/10.1109/access.2019.2936320

Hellsten, T., Karlsson, J., Shamsuzzaman, M., & Pulkkis, G. (2021). The Potential of Computer Vision-Based Marker-Less Human Motion Analysis for Rehabilitation. Rehabilitation Process and Outcome. https://doi.org/10.1177/11795727211022330

Martinez, G. H., Kitani , K., & Bansal, A. (2019). (rep.). (Y. Sheikh, Ed.)OpenPose: Whole-Body Pose Estimation. Carnegie Mellon University. Retrieved January 18, 2022, from https://www.ri.cmu.edu/wp-content/uploads/2019/05/MS_Thesis___Gines_Hidalgo___latest_compressed.pdf.

Oudah, M., Al-Naji, A., & Chahl, J. (2021). Computer Vision for Elderly Care Based on Deep Learning CNN and SVM. IOP Conference Series: Materials Science and Engineering, 1105(1), 012070. https://doi.org/10.1088/1757-899x/1105/1/012070

Pham, H.-H., Khoudour, L., Crouzil, A., Zegers, P., & Velastin, S. A. (2015). Video-based human action recognition using deep learning. Video-Based Human Action Recognition Using Deep Learning: a Review. https://doi.org/https://e-archivo.uc3m.es/bitstream/handle/10016/26542/videobased_2015.pdf

Song, L., Yu, G., Yuan, J., & Liu, Z. (2021). Human pose estimation and its application to action recognition: A survey. Journal of Visual Communication and Image Representation, 76, 103055. https://doi.org/10.1016/j.jvcir.2021.103055

Tang, J. (2020). MediaPipe Pose. MediaPipe. Retrieved January 6, 2022, from https://google.github.io/mediapipe/solutions/pose.html