| Literature DB >> 35047028 |
Chao Tang1,2, Anyang Tong1,2, Aihua Zheng2, Hua Peng3,4, Wei Li5.
Abstract
The traditional human action recognition (HAR) method is based on RGB video. Recently, with the introduction of Microsoft Kinect and other consumer class depth cameras, HAR based on RGB-D (RGB-Depth) has drawn increasing attention from scholars and industry. Compared with the traditional method, the HAR based on RGB-D has high accuracy and strong robustness. In this paper, using a selective ensemble support vector machine to fuse multimodal features for human action recognition is proposed. The algorithm combines the improved HOG feature-based RGB modal data, the depth motion map-based local binary pattern features (DMM-LBP), and the hybrid joint features (HJF)-based joints modal data. Concomitantly, a frame-based selective ensemble support vector machine classification model (SESVM) is proposed, which effectively integrates the selective ensemble strategy with the selection of SVM base classifiers, thus increasing the differences between the base classifiers. The experimental results have demonstrated that the proposed method is simple, fast, and efficient on public datasets in comparison with other action recognition algorithms.Entities:
Mesh:
Year: 2022 PMID: 35047028 PMCID: PMC8763533 DOI: 10.1155/2022/1877464
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1The proposed system configuration.
Figure 2The algorithm flow of HOG algorithm.
Figure 3DMM-LBP feature extraction algorithm flow.
SESVM.
|
|
| Training set |
|
|
| Selected base classifier set {SVM1 |
|
|
| (1) Initialize the base classifier set Θ=∅ |
| (2) |
| (3) Based on the training set |
| (4) The base classifier SVM |
| (5) |
| (6) Selecting process: |
| (7) Each base classifier SVM |
| (8) The selected base classifier set is obtained by using CCCSA |
| {SVM1 |
Classifiers' relational table of classification of the samples.
| Relations | SVM | SVM |
|---|---|---|
| SVM |
|
|
| SVM |
|
|
Note. N is the number of samples in the dataset, classified correctly (A = 1) or incorrectly (A = 0) by SVM, and correctly (B = 1) or incorrectly (B = 0) by SVM.
CCCSA.
|
|
|
|
| (1) |
| (2) |
| (3) |
| (4) |
| (5) |
| (6) The error rates of |
| Err(0) ← ERR( |
| Err(2) ← ERR( |
| (7)Min − err ← MIN(Err(0), Err(1), Err(2), Err(3)) |
| (8) |
| (9) |
| (10) |
| (11) |
| (12) |
| (13) |
|
|
Figure 4Sample images from the G3D dataset. (a) RGB image. (b) Depth image. (c) Skeleton joint image.
Figure 5Sample images from the Cornell Activity Dataset 60. (a) Depth image. (b) RGB image. (c) Skeleton joint image.
Figure 6The confusion matrix based on RGB-HOG features on the G3D dataset.
Figure 7The confusion matrix based on DMM-LBP features on the G3D dataset.
Figure 8The confusion matrix based on HJF on the G3D dataset.
Figure 9The confusion matrix based on this paper's method on the G3D dataset.
Figure 10The confusion matrix based on RGB-HOG features on the CAD60.
Figure 11The confusion matrix based on DMM-LBP features on the CAD60.
Figure 12The confusion matrix based on HJF on the CAD60.
Figure 13The confusion matrix based on the method of this paper on the CAD60.
Recognition rate using the single modal feature and multimodal features.
| Dataset | Descriptor | Precision (%) |
|---|---|---|
| G3D | RGB-HOG | 83.7 |
| DMM-LBP | 83.2 | |
| HJF | 83.7 | |
| Mixed features | 91.7 | |
|
| ||
| CAD60 | RGB-HOG | 85.3 |
| DMM-LBP | 86.0 | |
| HJF | 88.6 | |
| Mixed features | 91.8 | |
The comparison results between the proposed method and other machine learning methods.
| Dataset | Descriptor | SESVM (%) | Boosting (%) | Bagging (%) | SVM (%) | ANNs (%) |
|---|---|---|---|---|---|---|
| G3D | RGB-HOG | 83.7 | 83.2 | 75.2 | 80.2 | 74.2 |
| DMM-LBP | 83.2 | 82.4 | 79.4 | 83.4 | 80.2 | |
| HJF | 83.7 | 84.0 | 80.5 | 83.6 | 74.4 | |
| Mixed features | 91.7 | 89.5 | 83.2 | 87.7 | 82.4 | |
|
| ||||||
| CAD60 | RGB-HOG | 85.3 | 87.3 | 80.0 | 82.0 | 76.2 |
| DMM-LBP | 86.0 | 87.4 | 79.2 | 84.3 | 80.5 | |
| HJF | 88.6 | 89.2 | 80.2 | 84.4 | 74.2 | |
| Mixed features | 91.8 | 89.1 | 83.3 | 87.6 | 82.2 | |
The comparison results between our approach and other methods.
| Researchers | Descriptor | Recognition methods | Accuracy | |
|---|---|---|---|---|
| G3D (%) | CAD60 (%) | |||
| Dollár et al. [ | Sparse | SVM | 78 | 83 |
| Spatiotemporal features | ||||
| Liu et al. [ | PMI spatiotemporal features | SVM | 82 | 86 |
| Laptev et al. [ | Spatiotemporal corner | SVM | 87 | 84 |
| Rapantzikos et al. [ | Dense saliency spatiotemporal features | KNN | 88 | 89 |
| Rodriguez et al. [ | Spatiotemporal template | Template matching | 88 | 89 |
| Our approach | Mixed features | SESVM | 91.7 | 91.8 |