| Literature DB >> 35009944 |
Yajurv Bhatia1, Asm Hossain Bari1, Gee-Sern Jison Hsu2, Marina Gavrilova1.
Abstract
Motion capture sensor-based gait emotion recognition is an emerging sub-domain of human emotion recognition. Its applications span a variety of fields including smart home design, border security, robotics, virtual reality, and gaming. In recent years, several deep learning-based approaches have been successful in solving the Gait Emotion Recognition (GER) problem. However, a vast majority of such methods rely on Deep Neural Networks (DNNs) with a significant number of model parameters, which lead to model overfitting as well as increased inference time. This paper contributes to the domain of knowledge by proposing a new lightweight bi-modular architecture with handcrafted features that is trained using a RMSprop optimizer and stratified data shuffling. The method is highly effective in correctly inferring human emotions from gait, achieving a micro-mean average precision of 0.97 on the Edinburgh Locomotive Mocap Dataset. It outperforms all recent deep-learning methods, while having the lowest inference time of 16.3 milliseconds per gait sample. This research study is beneficial to applications spanning various fields, such as emotionally aware assistive robotics, adaptive therapy and rehabilitation, and surveillance.Entities:
Keywords: deep learning; emotion recognition; gait; handcrafted features; human motion; long short-term memory; motion capture sensor; remote visual technology
Mesh:
Year: 2022 PMID: 35009944 PMCID: PMC8749847 DOI: 10.3390/s22010403
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Summary of gait emotion recognition literature.
| Article | Year | Methodology | Training Dataset | Pros | Cons |
|---|---|---|---|---|---|
| Karg et al. [ | 2010 | PCA, Fourier transform, and dimension reduction on handcrafted features, followed by SVM, NB, and NN classifiers to classify | Dataset collected at Technische Universität München (TU München) [ | Dimensionally reduced feature set. | Limited feature set consisting of only basic handcrafted features and classical machine learning algorithms. Low accuracy. |
| Yan et al. [ | 2018 | Spatial Temporal Graph Convolutional Network (STGCN) | DeepMind Kinetics video dataset [ | Structured graph representation of gait skeleton and three different partitionings for graph convolutions. | Dependencies between bones and joints not exploited. |
| Ahmed et al. [ | 2019 | ANOVA and MANOVA for feature refinement, and GA for feature selection, followed by a score and rank-level fusion of four classifiers | Proprietary dataset [ | Two-layered feature selection from 10 pools of features. | Classical machine learning algorithms. |
| Bhattacharya et al. [ | 2020 | Concatenation of affective features and features extracted from STGCN | E-Gait dataset [ | CNN for processing STGCN output and hybrid feature set. | Dependencies between bones and joints not exploited. |
| Randhavane et al. [ | 2020 | Concatenation of affective and LSTM-extracted deep features | EWalk dataset [ | Hybrid feature set and a dedicated classifier. | Inefficient LSTM module. |
| Bhattacharya et al. [ | 2020 | Hierarchical attention pooling and affective mapping using GRUs | Emotion–gait dataset [ | Hierarchical network and hybrid features. | GRUs used instead of LSTM units. |
Figure 1Architecture of the proposed bi-modular sequential neural network.
Figure 2Modified body skeleton from Edinburgh Locomotive MOCAP Dataset.
Figure 3Model training graphs with: (a) Adam optimizer, (b) RMSprop optimizer, and (c) SGD optimizer.
Performance of the proposed bi-modular sequential neural network with different optimization methods.
| Optimizer Selection Experiment | |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| Adam | 0.991 | 0.802 | 0.694 | 0.360 | 0.915 |
|
|
|
|
|
|
|
| SGD | 0.876 | 0.436 | 0.217 | 0.265 | 0.679 |
RMSprop achieves the best mean and class average precision scores.
Figure 4Model training graphs with: (a) TanH activation functions in all layers of the network and (b) TanH activations in the LSTM sub-network and ReLU activation functions in the MLP sub-network.
Performance of the proposed method with different configurations of Tanh and ReLU activations.
| Activation Function Selection Experiment | |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 0.987 | 0.698 | 0.665 | 0.283 | 0.904 | |
The configuration with TanH activations in all layers of the network results in the best mean and class average precision scores.
Performance evaluation of proposed method with different values and positions of the dropout layer.
| Dropout Position and Value Selection Experiment | ||||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| No Dropout | - | 0.990 | 0.827 | 0.542 | 0.509 | 0.920 |
| 1st | 0.1 | 0.991 | 0.841 | 0.659 | 0.462 | 0.923 |
| LSTM | 0.2 | 0.991 | 0.746 | 0.490 | 0.436 | 0.902 |
| Layer | 0.4 | 0.990 | 0.853 | 0.765 | 0.304 | 0.928 |
| 2nd | 0.1 | 0.988 | 0.780 | 0.547 | 0.350 | 0.906 |
| LSTM | 0.2 | 0.994 | 0.880 | 0.739 | 0.488 | 0.925 |
| Layer | 0.4 | 0.985 | 0.795 | 0.699 | 0.363 | 0.906 |
| 3rd | 0.1 | 0.986 | 0.834 | 0.430 | 0.465 | 0.896 |
| LSTM | 0.2 | 0.994 | 0.764 | 0.398 | 0.314 | 0.906 |
| Layer | 0.4 | 0.978 | 0.762 | 0.505 | 0.504 | 0.877 |
| 1st | 0.1 | 0.994 | 0.828 | 0.636 | 0.495 | 0.931 |
| MLP | 0.2 | 0.990 | 0.813 | 0.597 | 0.514 | 0.923 |
| Layer | 0.4 | 0.985 | 0.732 | 0.544 | 0.374 | 0.903 |
|
| 0.1 | 0.984 | 0.737 | 0.450 | 0.370 | 0.896 |
|
|
|
|
|
|
|
|
|
| 0.4 | 0.988 | 0.739 | 0.599 | 0.380 | 0.912 |
| 3rd | 0.1 | 0.984 | 0.840 | 0.661 | 0.367 | 0.923 |
| MLP | 0.2 | 0.980 | 0.733 | 0.536 | 0.241 | 0.895 |
| Layer | 0.4 | 0.985 | 0.757 | 0.651 | 0.319 | 0.897 |
A dropout of 0.2 in the second layer of the MLP sub-network results in the best mean and class average precision scores.
Figure 5Model training graphs with (a) random and (b) stratified data shuffling. The stratified data shuffling results in less overfitting and a lower model loss. Hence, stratified data shuffling was used.
Comparison between random and stratified data selection methods for LSTM and MLP with JRA and JRD.
| Data Shuffling Selection Experiment | |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| Random | 0.992 | 0.865 | 0.690 | 0.389 | 0.940 |
|
|
|
|
|
|
|
Stratified data shuffling results in the best mean and class average precision scores.
Performance of the proposed architecture with different batch sizes.
| Batch Size Selection Experiment | |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| 16 | 0.987 | 0.863 | 0.776 | 0.544 | 0.938 |
| 32 | 0.993 | 0.815 | 0.785 | 0.487 | 0.928 |
|
|
|
|
|
|
|
| 128 | 0.988 | 0.829 | 0.677 | 0.400 | 0.913 |
| 256 | 0.986 | 0.810 | 0.589 | 0.417 | 0.915 |
A batch size of 64 achieves the optimal mean and class average precision scores.
Figure 6Model training graphs with smaller batch sizes result in unstable learning depicted by the oscillating precision and loss curves. In contrast, larger batch sizes ensure that batches have a good representation of input samples from each class. However, larger batch sizes also cause the learning to regularize too much and worsens the performance of the network. A batch size of 64 provides a balance of smooth learning with a low loss value at the end of the training and was chosen as a model parameter.
Figure 7Model precision and loss graphs for the (a) training and (b) validation sets with respect to the number of epochs. The validation loss of the model started to stabilize around epoch 75 and then began to overfit. Additionally, the precision values of the model also started dipping shortly after epoch 75. Hence, the number of epochs to train the model was chosen to 75.
Performance comparison of the recent methods with the proposed bi-modular sequential neural network.
| Performance Comparison Experiments | ||||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| STEP (2020) [ | 0.22 | 0.52 | 0.30 | 0.12 | 0.29 | 0.27 |
| ADF (2019) [ | 0.22 | 0.59 | 0.30 | 0.12 | 0.31 | 0.27 |
| STGCN (2018) [ | 0.06 | 0.97 | 0.20 | 0.01 | 0.34 | 0.41 |
| HAPAM (2020) [ | 0.97 | 0.66 | 0.40 | 0.18 | 0.60 | 0.88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The proposed bi-modular networks outperform the previous state-of-the-art methods in mean average precision scores.
Comparison of the inference time of the recent methods with the proposed bi-modular sequential neural network.
| Inference Time Comparisons | ||
|---|---|---|
|
|
|
|
| STEP [ | 717,987 | 4.82 × 10−2 |
| HAPAM [ | 40,444,854 | 4.66 × 10−2 |
| ADF [ | 310,978 | 3.91 × 10−2 |
| STGCN [ | 2,628,290 | 2.17 × 10−2 |
| Proposed LSTM and with batch normalization (RGS + JRA + JRD) |
|
|
|
|
|
|
|
|
|
|
The proposed bi-modular networks exhibit the fastest inference times.
Figure 8High regularization in the model training graphs due to the batch normalization after the first MLP layer.