1 of 12

EldCare: Real-Time Monitoring System for Safe and Independent Elderly Living using a Novel Vision-Based Framework with Spatio-Temporal Action Localization.

Aryaman Khanna, Ryan Park, Shubhrangshu Debsarkar

Thomas Jefferson High School for Science and Technology

2 of 12

Introduction

  • Problem
    • Elderly population projected to double by 2050 and reach 1.5 billion globally (Figure 1)
  • Their population will soon be greater than the number of people aged 10 to 24
    • Over 37 million severe falls each year
  • Falls are the leading cause of injury-related death among seniors
  • 87% of falls result in broken bones
  • 60% of those who don’t get care in one hour suffer life-long issues
    • 3 million adults treated in emergency departments for fall-related injuries yearly
    • COVID limited availability of elderly care centers and nursing homes

Figure 1: Predicted population by age group from 1950-2050

Figure 2: Picture of elderly person falling

3 of 12

Existing Work

Marker-based systems

Current markerless systems

  • Based on GPS and sensors
  • Large amount of false alarms
  • Always-on-body devices are inconvenient for seniors

Figure 3: Marker-based fall detection bracelet

Figure 4: Markerless motion capture with pose estimation

  • Capture motion in videos
  • Video streaming is computationally expensive
  • Current systems lack edge processing
  • Inaccurate in real-time scenarios

4 of 12

Goal and Constraints

  • Goal: Develop a cost-effective markerless system using machine learning and computer vision to perform real-time action recognition for elderly monitoring using a camera and a Nvidia Jetson Nano.

This includes:

  • Pose Estimation (Foreground segmentation)
  • Action Classification (Spatio-temporal action localization)
  • Novel machine learning architecture (EfficientNet-GRU hybrid)
  • Constraints:
  • Cost (Less than $80)
  • Accurate recognition system (>90%)
  • Low latency time
  • Maintenance of privacy

Figure 5: EldCare V1

Figure 6: EldCare V2

5 of 12

Methods - Summary of System

Figure 7: Flowchart demonstrating EldCare pose estimation and spatio-temporal action classification

Figure 9: Action Summary Interface and Live Action Classification Performed by IOS App

Figure 8: EldCare Prototype using Nvidia Jetson Nano and webcam

6 of 12

Methods - Data

  • Dataset consisted of 6 self-recorded video action classes: Falling, standing, sitting, lying down, jumping jacks, and walking
    • 210 total videos
  • Split into 64 testing and 146 training
    • 540x900 pixels
    • 3-4 seconds long
  • Online data
    • Videos contained multiple actions in one video
    • Poorly trimmed
    • Same person performing actions in each video
  • Our dataset
    • Limited to one action per video
    • Realistic and varied angles that simulated real environments for seniors
    • Variation in the humans in each video

7 of 12

Methods - Pose Estimation

  • Pose estimation allows for precise tracking of human movement by detecting keypoints representing major joints
  • Pipeline consists of two-step detector-tracker
    • Locates person/pose in region-of-interest (ROI)
    • Predicts pose landmarks and segmentation masks
  • Pose estimation utilizes a regression approach that is supervised by a combined heat map/offset prediction of all keypoints (Fig. 10)
    • Used a heat map and offset loss trains the center and left tower of the network
    • Heat map is removed and then regression encoder is trained
  • The algorithm transfers only the “skeleton” of the person through background subtraction
    • Limits noise and variables for the the action classifier model to process
    • Maintains privacy through foreground segmentation
  • “Skeleton” is created with an adaptation of a BlazePose detector, which predicts two virtual key points that describe the human body center, rotation, and scale as a circle

Figure 10: Tracking network architecture: regression with heatmap supervision

Figure 11: Data flow for real-time video capture and rendered output

8 of 12

Methods - Spatio-Temporal Action Localization

  • EfficientNet is a CNN scaling architecture that simultaneously optimizes multiple channels
    • Outperforms state-of-the-art algorithms (Fig. 12)
  • EldCare takes live video feed and localizes actions within space-time using a novel EfficientNet-GRU architecture (Fig. 13)
  • Detects human actions at the frame level and creates a frame linkage system for the last 5 frames
    • Inputs pose estimation “skeletal structure” for each frame.
    • GRU layers use the past 5 frames to recognize movement actions
    • GRU outperforms LSTM in speed
  • Supervised regression outperforms Spatial Region proposal models (SRPs)

Figure 12: Performance of EfficientNet and other state-of-the-art networks on the ImageNet database.

Figure 13: EfficientNet-GRU hybrid architecture.

9 of 12

Results

  • Algorithm performed better on our newly formed dataset with pose estimation data (64 testing videos - 30% of data)
    • 97.9% average with pose estimation
    • 92.3% average without pose estimation
  • Met 90% accuracy engineering goal
  • 0.3ms of real-time latency
  • Our model outperformed current markerless action classification techniques (Table 1)

Figure 14: Confusion Matrix Representing our Model Classification Accuracy.

Figure 15: Accuracy of training and testing data plotted over iterations

Figure 16: Training and testing loss plotted over iterations

Table 1: Table comparison of accuracies of previous elderly care pose-detection models. EldCare displays the greatest accuracy.

10 of 12

Data Analysis

  • Data Interpretation
    • Utilizing the EfficientNet-GRU architecture, EldCare outperforms other algorithms on the HMDB51 and UCF101 datasets (Table 2).
    • Performs efficiently in real-time due to pose estimation and lightweight network
    • Markerless detection allows for the greatest accessibility and ease of use.
  • Problems
    • Limited dataset - 210 videos
    • Model only tested on data with only one human
    • Due to real-time detection, Jetson Nano did not display high FPS

Table 2: Table comparison of accuracies of various machine learning architectures on common publicly available action recognition datasets.

11 of 12

Conclusion

  • Summary
    • Action recognition system for elderly monitoring
    • Easy to manufacture and cost-effective, only $70
    • Low set-up time and ease of access with an app version
  • Impact
    • Can be used in homes for automated monitoring of seniors
    • Increase elderly care injury response time and limit human error resulting from marker-based systems
  • Future Work
    • Create a compact integrated system which contains a camera built in and allows for easier use
    • Involve an alert system which sends out an alert to close family members or friends when an senior suffers an accident.
    • Can be utilized to track sleeping patterns and for healthcare monitoring

EldCare V1

EldCare V2

Future Goal

12 of 12

References

Buzzelli, M., Albé, A., & Ciocca, G. (2020). A Vision-based system for monitoring elderly people at home. Applied Sciences, 10(1). https://doi.org/10.3390/app10010374

Fan, K., Wang, P., Hu, Y., & Dou, B. (2017). Fall detection via human posture representation and support vector machine. International Journal of Distributed Sensor Networks, 13(5), 155014771770741. https://doi.org/10.1177/1550147717707418

Harrou, F., Zerrouki, N., Sun, Y., & Houacine, A. (2019). An Integrated Vision-Based Approach for Efficient Human Fall Detection in a Home Environment. IEEE Access, 7, 114966–114974. https://doi.org/10.1109/access.2019.2936320

Hellsten, T., Karlsson, J., Shamsuzzaman, M., & Pulkkis, G. (2021). The Potential of Computer Vision-Based Marker-Less Human Motion Analysis for Rehabilitation. Rehabilitation Process and Outcome. https://doi.org/10.1177/11795727211022330

Martinez, G. H., Kitani , K., & Bansal, A. (2019). (rep.). (Y. Sheikh, Ed.)OpenPose: Whole-Body Pose Estimation. Carnegie Mellon University. Retrieved January 18, 2022, from https://www.ri.cmu.edu/wp-content/uploads/2019/05/MS_Thesis___Gines_Hidalgo___latest_compressed.pdf.

Oudah, M., Al-Naji, A., & Chahl, J. (2021). Computer Vision for Elderly Care Based on Deep Learning CNN and SVM. IOP Conference Series: Materials Science and Engineering, 1105(1), 012070. https://doi.org/10.1088/1757-899x/1105/1/012070

Pham, H.-H., Khoudour, L., Crouzil, A., Zegers, P., & Velastin, S. A. (2015). Video-based human action recognition using deep learning. Video-Based Human Action Recognition Using Deep Learning: a Review. https://doi.org/https://e-archivo.uc3m.es/bitstream/handle/10016/26542/videobased_2015.pdf

Song, L., Yu, G., Yuan, J., & Liu, Z. (2021). Human pose estimation and its application to action recognition: A survey. Journal of Visual Communication and Image Representation, 76, 103055. https://doi.org/10.1016/j.jvcir.2021.103055

Tang, J. (2020). MediaPipe Pose. MediaPipe. Retrieved January 6, 2022, from https://google.github.io/mediapipe/solutions/pose.html