| Literature DB >> 36120684 |
Maochang Zhu1, Sheng Bin1, Gengxin Sun1.
Abstract
Three-dimensional convolutional network (3DCNN) is an essential field of motion recognition research. The research work of this paper optimizes the traditional three-dimensional convolution network, introduces the self-attention mechanism, and proposes a new network model to analyze and process complex human motion videos. In this study, the average frame skipping sampling and scaling and the one-hot encoding are used for data pre-processing to retain more features in the limited data. The experimental results show that this paper innovatively designs a lightweight three-dimensional convolutional network combined with an attention mechanism framework, and the number of parameters of the model is reduced by more than 90% to only about 1.7 million. This study compared the performance of different models in different classifications and found that the model proposed in this study performed well in complex human motion video classification. Its recognition rate increased by 1%-8% compared with the C3D model.Entities:
Mesh:
Year: 2022 PMID: 36120684 PMCID: PMC9481321 DOI: 10.1155/2022/4816549
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 12D Convolution (a) and 3D convolution (b) diagram.
Figure 2The overall process of this method.
Figure 3Data pre-processing.
Figure 4One-hot encoding.
Figure 5C3D network structure.
Figure 6Activation function comparison. (a) ReLU (x) and (b) PReLU (x).
Figure 7Action recognition architecture of this study.
Figure 8Self-attention mechanism unit.
The network structure and parameters of this framework (20 class).
| Layers | Output shape | Parameters |
|---|---|---|
| Input layer | 32, 32, 20, 3 | 0 |
| conv3d | 32, 32, 20, 32 | 2624 |
| activation | 32, 32, 20, 32 | 655360 |
| conv3d_1 | 32, 32, 20, 32 | 27680 |
| activation_1 | 32, 32, 20, 32 | 0 |
| max_pooling3d | 10, 10, 6, 32 | 0 |
| Dropout | 10, 10, 6, 32 | 0 |
| conv3d_2 | 10, 10, 6, 64 | 55360 |
| activation_2 | 10, 10, 6, 64 | 0 |
| conv3d_3 | 10, 10, 6, 64 | 110656 |
| activation_3 | 10, 10, 6, 64 | 0 |
| max_pooling3d_1 | 3, 3, 2, 64 | 0 |
| dropout_1 | 3, 3, 2, 64 | 0 |
| time_distributed (flatter) | 3, 384 | 0 |
| self__attention | 3, 512 | 589824 |
| Dense | 3, 512 | 262656 |
| batch_normalization | 3, 512 | 2048 |
| dropout_2 | 3, 512 | 0 |
| global_average_pooling1d | 512 | 0 |
| dense_1 | 20 | 10260 |
Figure 9Sports video dataset.
Figure 10C3D model accuracy.
Figure 11Proposed method accuracy.
The validation accuracy of the proposed method for the complex human movement of the UCF-101 dataset.
| Model | 10 class (%) | 20 class (%) | 30 class (%) |
|---|---|---|---|
| C3D | 82.2 | 84.7 | 83.5 |
| Lite-3DCNN | 85.3 | 80.2 | 70.6 |
| Lite-3DCNN-LSTM | 81.1 | 83.5 | 75.2 |
| Lite-3DCNN-BiLSTM | 84.5 | 85.3 | 79.5 |
| Proposed method |
|
|
|
The trainable parameters (in millions) of the proposed method and other methods for the UCF-101 dataset.
| Model type | 10 class (M) | 20 class (M) | 30 class (M) |
|---|---|---|---|
| C3D | 52.87 | 61.30 | 61.34 |
| Lite-3DCNN | 1.609 | 1.616 | 1.621 |
| Lite-3DCNN–LSTM (512) | 3.120 | 3.122 | 3.135 |
| Lite-3DCNN–BiLSTM (512) | 5.219 | 5.224 | 5.229 |
| Proposed method | 1.712 | 1.716 | 1.884 |