| Literature DB >> 32532007 |
Huogen Wang1,2, Zhanjie Song3, Wanqing Li2, Pichao Wang4.
Abstract
The paper presents a novel hybrid network for large-scale action recognition from multiple modalities. The network is built upon the proposed weighted dynamic images. It effectively leverages the strengths of the emerging Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based approaches to specifically address the challenges that occur in large-scale action recognition and are not fully dealt with by the state-of-the-art methods. Specifically, the proposed hybrid network consists of a CNN based component and an RNN based component. Features extracted by the two components are fused through canonical correlation analysis and then fed to a linear Support Vector Machine (SVM) for classification. The proposed network achieved state-of-the-art results on the ChaLearn LAP IsoGD, NTU RGB+D and Multi-modal & Multi-view & Interactive ( M 2 I ) datasets and outperformed existing methods by a large margin (over 10 percentage points in some cases).Entities:
Keywords: 3D convolutional LSTM network; action recognition; canonical correlation analysis; weighted dynamic image; weighted rank pooling
Year: 2020 PMID: 32532007 PMCID: PMC7308905 DOI: 10.3390/s20113305
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Performance evaluation of the two-streams, 3DCNN, CNN+RNN (ConvLSTM) and DI+CNN approaches on the NTU RGB-D action dataset using depth modality and cross-subject protocol.
| Category | Two-Streams | 3D CNN | ConvLSTM | DI + CNN |
|---|---|---|---|---|
|
| fair (72.5%) | fair (71.8%) | good (85.5%) | good (85.0%) |
|
| good (84.7%) | good (84.2%) | good (88.1%) | fair (77.0%) |
|
| fair (74.1%) | poor (68.1%) | good (84.6%) | fair (71.4%) |
|
| fair (73.4%) | poor (67.8%) | poor (61.8%) | fair (71.4%) |
Figure 1An overview of the proposed hybrid network for multimodal action recognition. The network is built upon the proposed weighted dynamic images, CNNs and 3D ConvLSTM to extract highly complementary information from the depth and RGB video sequences. Canonical correlation analysis is adopted for feature-level fusion and a linear SVM for classification.
Figure 2The dynamic images of action from the NTU RGB + D Dataset. (a) a conventional dynamic image of “eating meal/snack”; (b) a weighted dynamic image of “eating meal/snack”; (c) a conventional dynamic image of “put something inside pocket / take out something from pocket”; (d) a weighted dynamic image of “put something inside pocket / take out something from pocket”.
Comparison of recognition accuracy using Weighted Dynamic Images and Dynamic Images on the validation set of ChaLearn LAP IsoGD Dataset.
| Methods | Accuracy |
|---|---|
| DDI [ | 45.11% |
| DRI [ | 39.23% |
| Fusion of DDI and DRI [ | 49.14% |
| WDDI (Proposed) | 50.50% |
| WDRI (Proposed) | 48.60% |
| Fusion of WDDI and WDRI (Proposed) | 55.64% |
Evaluation of different spatial/temporal weights estimation methods on the validation set of ChaLearn LAP IsoGD Dataset (only depth modality).
| Methods | Accuracy |
|---|---|
| DDI | 45.11% |
| Background-foreground segmentation | 45.75% |
| Salient region detection | 46.18% |
| Flow-guided aggregation | 49.13% |
| Selection key frames | 48.75% |
| Flow-guided frame weight | 48.87% |
Performance on Chalearn LAP IsoGD dataset using the features extracted by the CNN and the 3D ConvLSTM components where “+” indicates average score fusion.
| Modality | Feature | Accuracy |
|---|---|---|
| Depth | CNN | 50.50% |
| 3D ConvLSTM | 44.76% | |
| CNN + 3D ConvLSTM | 55.67% | |
| RGB | CNN | 48.60% |
| 3D ConvLSTM | 44.23% | |
| CNN + 3D ConvLSTM | 55.52% | |
| RGB + Depth | CNN + 3D ConvLSTM | 60.15% |
Performance comparison on IsoGD with different fusion methods.
| Fusion Method | IsoGD |
|---|---|
| Score Fusion (Depth) | 55.67% |
| Score Fusion (RGB) | 55.52% |
| Score Fusion (Depth+RGB) | 60.15% |
| BoW+SVM (Depth) | 54.91% |
| BoW+SVM (RGB) | 55.21% |
| BoW+SVM (Depth+RGB) | 58.93% |
| FV+SVM (Depth) | 55.61% |
| FV+SVM (RGB) | 55.72% |
| FV+SVM (Depth+RGB) | 60.23% |
| CCA+SVM (Depth) | 55.89% |
| CCA+SVM (RGB) | 56.23% |
| CCA+SVM (Depth+RGB) | 61.14% |
Comparison of the proposed method with other methods on the NTU RGB + D dataset. We report the accuracies using both the cross-subject and cross-view protocols.
| Methods | Modality | Cross Subject | Cross View |
|---|---|---|---|
| Lie Group [ | Skeleton | 50.08% | 52.76% |
| Dynamic Skeletons [ | Skeleton | 60.23% | 65.22% |
| HBRNN [ | Skeleton | 59.07% | 63.97% |
| Deep RNN [ | Skeleton | 56.29% | 64.09% |
| Part-aware LSTM [ | Skeleton | 62.93% | 70.27% |
| ST-LSTM+ Trust Gate [ | Skeleton | 69.20% | 77.70% |
| JTM [ | Skeleton | 73.40% | 75.20% |
| JDM [ | Skeleton | 76.20% | 82.30% |
| Geometric Features [ | Skeleton | 70.26% | 82.39% |
| Clips+CNN+MTLN [ | Skeleton | 79.57% | 84.83% |
| View invariant [ | Skeleton | 80.03% | 87.21% |
| IndRNN [ | Skeleton | 81.80% | 87.97% |
| Pose Estimation Maps [ | RGB | 78.80% | 84.21% |
| Pose-based Attention [ | RGB+skeleton | 82.50% | 88.60% |
| SI-MM [ | RGB+skeleton | 85.12% | 92.82% |
| SSSCA-SSLM [ | RGB+Depth | 74.86% | - |
| Aggregation Networks [ | RGB+Depth | 86.42% | 89.08% |
| Proposed method | RGB | 86.46% | 88.54% |
| Depth | 87.73% | 87.37% | |
| RGB+Depth | 89.51% | 91.68% |
Performance comparison of the two-streams, 3DCNN, CNN + RNN (ConvLSTM), DI + CNN methods with the proposed method on the NTU RGB + D action dataset using depth modality and cross-subject protocol.
| Category | Two-Streams | 3D CNN | ConvLSTM | DI + CNN | Proposed Method |
|---|---|---|---|---|---|
|
| 72.5% | 71.8% | 85.5% | 85.0% | 90.03% |
|
| 84.7% | 84.2% | 88.1% | 77.2% | 91.76% |
|
| 74.1% | 68.1% | 84.6% | 71.4% | 85.73% |
|
| 73.4% | 67.8% | 61.8% | 71.4% | 83.32% |
Information of the ChaLearn LAP IsoGD Dataset.
| Sets | Gestures | RGB Videos | Depth Videos | Subjects |
|---|---|---|---|---|
| Training | 35,878 | 35,878 | 35,878 | 17 |
| Validation | 5784 | 5784 | 5784 | 2 |
| Testing | 6271 | 6271 | 6271 | 2 |
| All | 47,933 | 47,933 | 47,933 | 21 |
Figure 3Examples of image frames at the body level and hand level. From up to bottom: body level RGB images, hand level RGB images, body level depth images and hand level depth images.
Comparison of the proposed method with other methods on the validation set and the test set of the ChaLearn LAP IsoGD.
| Methods | Modality | Accuracy (Validation) | Accuracy (Testing) | |
|---|---|---|---|---|
| MFSK [ | RGB+Depth | 18.65% | 24.19% | |
| MFSK+DeepID [ | RGB+Depth | 18.23% | 23.67% | |
| Scene Flow [ | RGB+Depth | 36.27% | - | |
| Pyramidal C3D [ | RGB+Depth | 45.02% | 50.93% | |
| 2SCVN+3DDSN [ | RGB+Depth | 49.17% | 67.26% | |
| 32-frame C3D [ | RGB+Depth | 49.2% | 56.9% | |
| C3D+ConvLSTM [ | RGB+Depth | 51.02% | - | |
| C3D+ConvLSTM+Temporal Pooling [ | RGB+Depth | 58.00% | 62.14% | |
| CNN+3D ConvLSTM [ | RGB+Depth | 60.81% | 65.59% | |
| ResC3D [ | RGB+Depth | 64.40% | 67.71% | |
| Proposed method | body level | RGB+Depth | 61.14% | 66.43% |
| hand level | RGB+Depth | 62.78% | 66.23% | |
| Score fusion (body level + hand level) | RGB+Depth | 64.61% | 68.13% | |
Figure 4The confusion matrix of the proposed method at the hand level on the Chalearn LAP IsoGD dataset. To see the details, please zoom in.
Figure 5The confusion matrix of the proposed method at the body level on the Chalearn LAP IsoGD dataset. To see the details, please zoom in.
Figure 6The confusion matrix of the proposed method for the fusion of the hand level and the body level on the Chalearn LAP IsoGD dataset. To see the details, please zoom in.
Comparison of the proposed method with other methods on the dataset for the single task scenario (learning and testing in the same view).
| Methods | SV | FV |
|---|---|---|
| iDT-Tra(BoW) [ | 69.8% | 65.8% |
| iDT-COM(BoW) | 76.9% | 75.3% |
| iDT-COM(FV) [ | 80.7% | 79.5% |
| iDT-MBH(BoW) [ | 77.2% | 79.6% |
| SFAM [ | 89.4% | 91.2% |
| STSDDI [ | 90.1% | 92.1% |
| Proposed method | 100% | 100% |
Comparison of the proposed method with other methods on the dataset for the cross-view scenario (SV–FV:training in the side view and test in the front view; FV–SV:training in the front view and testing in the side view).
| Methods | SV–FV | FV–SV |
|---|---|---|
| iDT-Tra [ | 43.3% | 39.2% |
| iDT-COM [ | 70.2% | 67.7% |
| iDT-HOG + MBH [ | 75.8% | 72.8% |
| iDT-HOG + HOF [ | 78.2% | 72.1% |
| SFAM [ | 87.6% | 76.5% |
| STSDDI [ | 86.4% | 82.6% |
| Proposed method | 93.8% | 90.6% |