| Literature DB >> 33803891 |
Ohoud Nafea1,2, Wadood Abdul2,3, Ghulam Muhammad2,3, Mansour Alsulaiman2,3.
Abstract
Human activity recognition (HAR) remains a challenging yet crucial problem to address in computer vision. HAR is primarily intended to be used with other technologies, such as the Internet of Things, to assist in healthcare and eldercare. With the development of deep learning, automatic high-level feature extraction has become a possibility and has been used to optimize HAR performance. Furthermore, deep-learning techniques have been applied in various fields for sensor-based HAR. This study introduces a new methodology using convolution neural networks (CNN) with varying kernel dimensions along with bi-directional long short-term memory (BiLSTM) to capture features at various resolutions. The novelty of this research lies in the effective selection of the optimal video representation and in the effective extraction of spatial and temporal features from sensor data using traditional CNN and BiLSTM. Wireless sensor data mining (WISDM) and UCI datasets are used for this proposed methodology in which data are collected through diverse methods, including accelerometers, sensors, and gyroscopes. The results indicate that the proposed scheme is efficient in improving HAR. It was thus found that unlike other available methods, the proposed method improved accuracy, attaining a higher score in the WISDM dataset compared to the UCI dataset (98.53% vs. 97.05%).Entities:
Keywords: Bi-directional LSTM; convolution neural networks; deep learning; human activity recognition; local spatio-temporal features
Mesh:
Year: 2021 PMID: 33803891 PMCID: PMC8003187 DOI: 10.3390/s21062141
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1A block diagram of sensor-based human activity recognition using deep learning.
Summary of the proposed studies in the related work.
| Ref. | Year | Model | Domain | Proposed Study | Comments |
|---|---|---|---|---|---|
| [ | 2015 | CNN | video-based | Two-stream convNets (VGGNet, GoogLeNet); 10-frame stacking of optical flow for temporal network and a single frame for spatial network; data augmentation to increase the size of the dataset | * It requires features pre-processing. |
| [ | 2016 | video-based | Propose the idea of dynamic map and rank pooling to encode the video frames into a single RGB image per video. | ||
| [ | 2016 | sensor-based | Propose a multi-layer CNN with alternating convolution and pooling layers to extract the features. | ||
| [ | 2016 | video-based | Propose a two-stream network, fusion is done at the level of convolution layer. | ||
| [ | 2016 | video-based | Propose a temporal segment network (TSN); Instead of working on each frame individually it works on a set of a short snippets sparsely sampled from the video; these snippets will be fed up to a two-stream of CNN. | ||
| [ | 2020 | sensor-based | Propose to use Lego filters to have a lightweight deep CNNs | ||
| [ | 2020 | video-based | Propose to use CNN as a feature extractor from different transformed domains | ||
| [ | 2015 | CNNs with | video-based | A new video representation, called trajectory-pooled deep-convolutional descriptor (TDD); two-stream ConvNets; using improved dense trajectories (iDTs) features | * It requires features pre-processing. |
| [ | 2018 | sensor-based | Propose a new design of CNN and combine it with stats. features | * A new designing of nets but it is not effectively representing the spatio-temporal features in HAR. | |
| [ | 2015 | 3D-CNN | video-based | Propose a C3D (Convolutional-3D), with a simple linear classifier | * It can capture the spatio-temporal |
| [ | 2018 | video-based | Two-steam inflated 3D-ConvNet (I3D) | ||
| [ | 2018 | CNN-LSTM | video-based | CNN to extract the spatial features, then these feature will be the input to two different stream of LSTM(FC-LSTM, ConvLSTM) to extract the temporal features | * CNN-LSTM model was found to be |
| [ | 2020 | sensor-based | They improved many models such as 1D CNN, a multichannel CNN, a CNN-LSTM, and multichannel CNN-LSTM |
Figure 2Overall architecture of the proposed approach.
Figure 3Architecture of a long short-term memory (LSTM) unit.
Figure 4Architecture of a Bi-directional long short-term memory (BiLSTM) unit.
Figure 5A detailed description of the proposed approach.
Activities of wireless sensor data mining (WISDM) [1].
| Activity | Walk | Jog | Up | Down | Sit | Std |
|---|---|---|---|---|---|---|
| Samples | 424,400 | 342,177 | 122,869 | 100,427 | 59,939 | 48,397 |
| Percentage (%) | 38.6 | 31.2 | 11.2 | 9.1 | 5.5 | 4.4 |
Activities of UCI-HAR [1].
| Activity | Walk | Up | Down | Sit | Std | Lay |
|---|---|---|---|---|---|---|
| Samples | 122,091 | 116,707 | 107,961 | 126,677 | 138,105 | 136,865 |
| Percentage(%) | 16.3 | 15.6 | 14.4 | 16.9 | 18.5 | 18.3 |
Figure 6Training accuracy vs. validation accuracy.
Figure 7Training loss vs. validation loss.
Figure 8The impact of the number of filters on recognition accuracy in the proposed convolution neural networks (CNN).
Classification of the confusion matrix on the WISDM.
| Activities | Down | Jog | Sit | Std | Up | Walk | Precision | F1 |
|---|---|---|---|---|---|---|---|---|
| Down | 739 | 2 | 0 | 0 | 34 | 5 | 0.94 | 0.96 |
| Jog | 0 | 2576 | 0 | 0 | 8 | 0 |
|
|
| Sit | 0 | 0 | 395 | 20 | 1 | 2 | 0.94 | 0.97 |
| Std | 0 | 0 | 2 | 353 | 1 | 0 | 0.99 | 0.97 |
| Up | 20 | 4 | 0 | 0 | 855 | 0 | 0.97 | 0.95 |
| Walk | 6 | 2 | 0 | 0 | 14 | 3197 |
|
|
| Recall | 0.96 |
|
| 0.94 | 0.93 |
| ||
| Accuracy | 98.53% | |||||||
| Kappa | 0.98 | |||||||
The result marked in bold refers to the results that achieve the best classification of activities using different metrics.
Classification of the confusion matrix on the WISDM using the proposed model, CNN-BiLSTM, as a feature extractor and support vector machine (SVM) as a classifier.
| Activities | Down | Jog | Sit | Std | Up | Walk | Precision | F1 |
|---|---|---|---|---|---|---|---|---|
| Down | 646 | 0 | 2 | 0 | 30 | 2 | 0.95 | 0.96 |
| Jog | 1 | 2574 | 0 | 0 | 7 | 2 |
|
|
| Sit | 1 | 0 | 395 | 20 | 1 | 1 | 0.94 | 0.97 |
| Std | 2 | 0 | 0 | 354 | 0 | 0 |
| 0.97 |
| Up | 22 | 3 | 0 | 0 | 854 | 0 | 0.97 | 0.96 |
| Walk | 10 | 1 | 1 | 0 | 13 | 3194 |
|
|
| Recall | 0.94 |
|
| 0.94 | 0.94 |
| ||
| Accuracy | 98.53% | |||||||
| Kappa | 0.98 | |||||||
The result marked in bold refers to the results that achieve the best classification of activities using different metrics.
Classification of the confusion matrix on the UCI-HAR.
| Activities | Walk | Up | Down | Sit | Std | Lay | Precision | F1 |
|---|---|---|---|---|---|---|---|---|
| Walk | 494 | 2 | 0 | 0 | 0 | 0 | 0.99 | 0.99 |
| Up | 0 | 470 | 0 | 1 | 0 | 0 | 0.99 | 0.99 |
| Down | 2 | 10 | 407 | 0 | 1 | 0 | 0.96 | 0.99 |
| Sit | 0 | 3 | 0 | 449 | 36 | 3 | 0.94 | 0.91 |
| Std | 0 | 0 | 0 | 29 | 503 | 0 | 0.93 | 0.94 |
| Lay | 0 | 0 | 0 | 0 | 0 | 537 | 0.99 |
|
| Recall | 0.99 | 0.96 |
| 0.93 | 0.93 | 0.99 | ||
| Accuracy | 97.04% | |||||||
| Kappa | 0.96 | |||||||
The result marked in bold refers to the results that achieve the best classification of activities using different metrics.
Classification of the confusion matrix on the UCI-HAR using the proposed model, CNN-BiLSTM, as a feature extractor and SVM as a classifier.
| Activities | Walk | Up | Down | Sit | Std | Lay | Precision | F1 |
|---|---|---|---|---|---|---|---|---|
| Walk | 494 | 2 | 0 | 0 | 0 | 0 | 0.99 | 0.99 |
| Up | 0 | 465 | 5 | 1 | 0 | 0 | 0.98 | 0.99 |
| Down | 0 | 4 | 415 | 0 | 1 | 0 | 0.98 | 0.99 |
| Sit | 0 | 0 | 0 | 448 | 43 | 0 | 0.91 | 0.93 |
| Std | 0 | 0 | 0 | 24 | 508 | 0 | 0.95 | 0.94 |
| Lay | 0 | 0 | 0 | 0 | 0 | 537 |
|
|
| Recall |
| 0.98 | 0.98 | 0.94 | 0.92 |
| ||
| Accuracy | 97.28% | |||||||
| Kappa | 0.96 | |||||||
The result marked in bold refers to the results that are achieve the best classification of activities using different metrics.
Comparison with other studies conducted on different human activity recognition (HAR) dataset.
| Database | Ref. | Year | Used Technique | Accuracy (%) |
|---|---|---|---|---|
| UCF101 | [ | 2015 | GoogLeNet & VGG-16 | 91.40 |
| [ | 2015 | TDD | 91.50 | |
| [ | 2016 | Two CNN stream (VGG-16) | 93.50 | |
| [ | 2016 | TSN | 94.20 | |
| [ | 2016 | CNN | 89.10 | |
| [ | 2017 | Two-stream-3D-ConvNet | 93.40 | |
| [ | 2018 | CNN-LSTM | 84.10 | |
| HMDB51 | [ | 2015 | TDD | 65.90 |
| [ | 2016 | Two CNN stream—iDT | 69.20 | |
| [ | 2016 | TSN | 69.40 | |
| [ | 2016 | CNN | 65.20 | |
| [ | 2017 | Two-3D-ConvNet | 88.40 | |
| ASLAN | [ | 2015 | CONV3D-SVM | 78.30 |
| Sports 1M | [ | 2015 | CONV3D-SVM | 85.20 |
| UCF-ARG | [ | 2020 | Pre-trained CNN | 87.60 |
| Sound dataset | [ | 2020 | CNN | 87.20 |
| HHAR | [ | 2020 | Fusion ResNet | 96.63 |
| MHEALTH | [ | 2020 | Fusion ResNet | 98.50 |
| WISDM | [ | 2020 | CNN-LSTM | 95.75 |
| [ | 2020 | CNN | 97.51 | |
| Proposed | 2020 | CNN-BiLSTM |
| |
| UCI | [ | 2016 | CNN | 93.75 |
| [ | 2018 | CNN | 95.31 | |
| [ | 2018 | CNN with stat. features | 97.63 | |
| [ | 2020 | CNN-LSTM | 95.80 | |
| [ | 2020 | Lightweight CNN | 96.27 | |
| Proposed | 2020 | CNN-BiLSTM |
|
The result marked in bold refers to the results that are achieved by the proposed approach.
Ablation study.
| Experiment | Accuracy (%) | Precision (%) | Recall (%) | F1 Score (%) | Cohens Kappa |
|---|---|---|---|---|---|
| Six Conv. Layers at each level, work in parallel with two BiLSTM layers, then their output are concatenated | 95.55 | 95.52 | 95.55 | 95.52 | 0.9465 |
| Two Conv. Layers at each level, work in parallel with three BiLSTM layers, then their output are | 75.97 | 78.93 | 75.97 | 75.27 | 0.7107 |
| Two Conv. Layers, followed by BatchNormalization layer, work in parallel with two BiLSTM layers, then their output are | 87.13 | 89.98 | 87.13 | 87.23 | 0.8456 |
| One Conv. Layers, work in parallel with three BiLSTM layers, then their output are | 88.63 | 88.99 | 88.63 | 88.51 | 0.8633 |