| Literature DB >> 35733557 |
Xin Xiong1,2,3, Weidong Min2,3,4, Qing Han4, Qi Wang5, Cheng Zha4.
Abstract
Effective extraction and representation of action information are critical in action recognition. The majority of existing methods fail to recognize actions accurately because of interference of background changes when the proportion of high-activity action areas is not reinforced and by using RGB flow alone or combined with optical flow. A novel recognition method using action sequences optimization and two-stream fusion network with different modalities is proposed to solve these problems. The method is based on shot segmentation and dynamic weighted sampling, and it reconstructs the video by reinforcing the proportion of high-activity action areas, eliminating redundant intervals, and extracting long-range temporal information. A two-stream 3D dilated neural network that integrates features of RGB and human skeleton information is also proposed. The human skeleton information strengthens the deep representation of humans for robust processing, alleviating the interference of background changes, and the dilated CNN enlarges the receptive field of feature extraction. Compared with existing approaches, the proposed method achieves superior or comparable classification accuracies on benchmark datasets UCF101 and HMDB51.Entities:
Mesh:
Year: 2022 PMID: 35733557 PMCID: PMC9208928 DOI: 10.1155/2022/6608448
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Overview of the proposed method. The optimized action sequences module reconstructs the input video to increase the ratio of action features. The network fuses the advantages of two modalities and enlarges the receptive field of action feature.
Figure 2Dynamic weighted sampling. In one shot, the different sampling strategy can obtain different reconstructed videos of reconstruct.
Accuracy comparison of different sampling rates of Situation1 (%).
| Sampling rate | UCF101 | HMDB51 |
|---|---|---|
| 1/8 | 69.65 | 51.13 |
| 1/4 | 89.29 | 66.88 |
| 1/2 | 93.93 | 72.05 |
| 1 | 69.65 | 51.13 |
Accuracy comparison of different sampling rates of Situation2 (%).
| Sampling rate (Seg1, Seg2) | UCF101 | HMDB51 |
|---|---|---|
| 1/8, 1/2 | 92.13 | 70.79 |
| 1/4, 1/2 | 95.17 | 75.36 |
| 1/2, 1/2 | 89.45 | 68.52 |
| 1, 1/2 | 92.99 | 72.86 |
Accuracy comparison of different sampling rates of Situation3 (%).
| Sampling rate (Seg1, Seg2, Seg3) | UCF101 | HMDB51 |
|---|---|---|
| 1/4, 1/2, 1/8 | 93.84 | 73.88 |
| 1/4, 1/2, 1/4 | 95.85 | 75.93 |
| 1/4, 1/2, 1/2 | 92.60 | 69.27 |
| 1/4, 1/2, 1 | 92.38 | 68.57 |
Algorithm 1Proposed action sequences optimization algorithm.
Figure 3Comparisons of different modalities. The RGB and optical flow mix the change information of the background and the action information. The skeleton flow contains the human action information only, which strengthens the deep representation of humans for robust processing.
Figure 43D dilated convolution operation.
Figure 5Structure of the two-stream 3D dilated network.
Accuracy evaluation of the action sequences optimization method (%).
| UCF101 | HMDB51 | |
|---|---|---|
| The original video | 91.13 | 66.48 |
| Reconstructed action video |
|
|
Comparison of the running time for training of the proposed method (hours).
| UCF101 | HMDB51 | |
|---|---|---|
| The original video | 20.5 | 18 |
| Reconstructed action video |
|
|
Evaluation of performance of different modalities (%).
| UCF101 | HMDB51 | |
|---|---|---|
| RGB flow + 3D dilated only | 89.15 | 66.09 |
| Skeleton flow + 3D dilated only | 68.84 | 43.62 |
| Two-stream fusion network |
|
|
Accuracy comparison of different methods (%).
| UCF101 | HMDB51 | |
|---|---|---|
| Peng et al. [ | 87.9 | 61.1 |
| Zhao et al. [ | 89.1 | 65.1 |
| Tran et al. [ | 85.3 | 62.3 |
| Tu et al. [ | 94.5 | 69.8 |
| Tu et al. [ | 94.8 | 70.4 |
| Zhao et al. [ | 92.5 | — |
| Wang et al. [ | 92.4 | 62.0 |
| Feichtenhofer et al. [ | 92.5 | 65.4 |
| Qiu et al. [ | 93.7 | 66.3 |
| Wang et al. [ | 92.4 | 70.5 |
| Lu et al. [ | 90.4 | 65.0 |
| Hara et al. [ | 90.7 | 63.8 |
| Cong et al. [ | 91.8 | 68.8 |
| Wang et al. [ | 84.0 | 55.1 |
| Sun et al. [ | 91.9 | 70.0 |
| Huang et al. [ | 92.6 | 69.1 |
| Yao et al. [ | 92.1 | 65.9 |
| Liu et al. [ | 92.5 | 62.4 |
| Hao et al. [ | 93.7 | 66.7 |
| Tong et al. [ | 94.6 | 69.4 |
| Li et al. [ | 91.5 | 63.0 |
| Peng et al. [ | 94.0 | 68.7 |
| Long et al. [ | 94.6 | 69.2 |
| Wang et al. [ | 94.9 | 70.2 |
| Wu et al. [ | 94.3 | 70.9 |
| Li et al. [ | 94.5 | 70.2 |
| Cai and Hu [ | 91.0 | 64.7 |
| Cai and Hu [ | 92.5 | 66.5 |
| Li et al. [ | 86.7 | — |
| Xu et al. [ |
|
|
| Jiang et al. [ | 94.6 | 70.7 |
| Yang and Zou [ | 92.7 | — |
| Chang et al. [ | 93.8 | — |
| Deng et al. [ | 95.3 | 71.3 |
| Wang et al. [ | 94.5 | 74.1 |
| Proposed method | 95.6 | 75.3 |
The accuracy comparison of different methods on Kinetics dataset (%).
| Top-1 | Top-5 | |
|---|---|---|
| Tran et al. [ | 56.1 | 79.5 |
| Feichtenhofer et al. [ | 56.0 | 77.3 |
| Donahue et al. [ | 57.0 | 79.0 |
| Wang et al. [ | 69.1 | 83.7 |
| Zolfaghari et al. [ | 68.0 | 80.9 |
| Jiang et al. [ | 73.1 | 90.6 |
| Proposed method | 69.6 | 87.1 |
Accuracy comparison of different network (%).
| UCF101 | HMDB51 | |
|---|---|---|
| Tran et al. [ | 85.3 | 62.3 |
| Tran et al. [ | 90.2 | 68.5 |
| Proposed method | 95.6 | 75.3 |