| Literature DB >> 35281195 |
Ming-Xing Lu1, Guo-Zhen Du1, Zhan-Fang Li2.
Abstract
Gesture recognition utilizes deep learning network model to automatically extract deep features of data; however, traditional machine learning algorithms rely on manual feature extraction and poor model generalization ability. In this paper, a multimodal gesture recognition algorithm based on convolutional long-term memory network is proposed. First, a convolutional neural network (CNN) is employed to automatically extract the deeply hidden features of multimodal gesture data. Then, a time series model is constructed using a long short-term memory (LSTM) network to learn the long-term dependence of multimodal gesture features on the time series. On this basis, the classification of multimodal gestures is realized by the SoftMax classifier. Finally, the method is experimented and evaluated on two dynamic gesture datasets, VIVA and NVGesture. Experimental results indicate that the accuracy rates of the proposed method on the VIVA and NVGesture datasets are 92.55% and 87.38%, respectively, and its recognition accuracy and convergence performance are better than those of other comparison algorithms.Entities:
Mesh:
Year: 2022 PMID: 35281195 PMCID: PMC8906951 DOI: 10.1155/2022/4068414
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1The structure of LSTM.
Figure 2LSTM cell internal structure.
Figure 3CLT-Net model structure.
Experimental parameter settings.
| Parameter | Options |
|---|---|
| Initialize the weight coefficient of CNN layer | Kaiming method |
| LSTM layer weight coefficient initialization | Orthogonal method |
| The weight coefficient of the full connection layer is initialized | Kaiming method |
| Optimizer | Adam optimizer |
| Loss function | Cross entropy |
| Initial learning rate | 0.001 |
| Sample sequence size | 24 × 410 |
| Number of training set samples | 20088 |
| Number of samples in test set | 2232 |
| Number of training wheels | 20 |
| Batch size | 500 |
| Leaky ReLU divisor | 0.1 |
Figure 4The comparison of convergence speed.
The accuracy results on the VIVA dataset.
| Method | Accuracy |
|---|---|
| HOG + HOG2 [ | 65.51 |
| CNN : LRN [ | 75.41 |
| CNN : LRN : HRN [ | 78.51 |
| C3D [ | 78.41 |
| I3D [ | 84.11 |
| MTUT [ | 87.09 |
| 3D-Dense (a) | 89.22 |
| Res3D + TCNs (b) | 86.98 |
| T3D-Dense + TCNs (c) | 91.74 |
|
|
|
Figure 5The recognition confusion matrix of the VIVA dataset.
The accuracy results on the NVGesture dataset.
| Methods | Fusion model | Accuracy |
|---|---|---|
| HOG + HOG2 [ | RGB + depth | 37.91 |
| I3D [ | RGB + depth | 84.83 |
| MTUT [ | RGB + depth | 87.11 |
| Proposed | RGB + depth | 85.88 |
| 2S-CNNs | RGB + Opt.flow | 66.61 |
| iDT [ | RGB + Opt.flow | 74.41 |
| I3D [ | RGB + Opt.flow | 85.44 |
| MTUT [ | RGB + Opt.flow | 86.49 |
| Proposed | RGB + Opt.flow | 87.22 |
| R3DCNN [ | RGB + depth + Opt.flow | 84.81 |
| I3D [ | RGB + depth + Opt.flow | 86.69 |
|
|
|
|
Figure 6The recognition confusion matrix of the NVGesture dataset.