| Literature DB >> 32492842 |
Jianyu Chen1, Jun Kong2,3, Hui Sun2, Hui Xu1, Xiaoli Liu4, Yinghua Lu2, Caixia Zheng1,3.
Abstract
Action recognition is a significant and challenging topic in the field of sensor and computer vision. Two-stream convolutional neural networks (CNNs) and 3D CNNs are two mainstream deep learning architectures for video action recognition. To combine them into one framework to further improve performance, we proposed a novel deep network, named the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP possesses three advantages. First, the STINP consists of two branches constructed based on residual networks (ResNets) to simultaneously learn the spatial and temporal information of the video. Second, the STINP integrates the pseudo3D block into residual units for building the spatial branch, which ensures that the spatial branch can not only learn the appearance feature of the objects and scene in the video, but also capture the potential interaction information among the consecutive frames. Finally, the STINP adopts a simple but effective multiplication operation to fuse the spatial branch and temporal branch, which guarantees that the learned spatial and temporal representation can interact with each other during the entire process of training the STINP. Experiments were implemented on two classic action recognition datasets, UCF101 and HMDB51. The experimental results show that our proposed STINP can provide better performance for video recognition than other state-of-the-art algorithms.Entities:
Keywords: pseudo3D architecture; spatiotemporal representation learning; two-branches network; video action recognition
Year: 2020 PMID: 32492842 PMCID: PMC7308980 DOI: 10.3390/s20113126
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The structure of the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP consists of two branches, the spatial branch and the temporal branch. The spatial branch aims to obtain the features of the scene and objects in the individual frames of the video, where the green arrows represent introducing the pseudo3D structure to extract the interactive relationship among the consecutive frames. The temporal branch employs the optical flow frames as input to obtain the dynamic information of the video.
Figure 2The different structures of the spatial branch developed for the STINP: (a) is the spatial branch in STINP-1, and (b) is the spatial branch in STINP-2. The yellow blocks represent the 2D convolutional filter, and the blue blocks represent the 1D convolutional filter.
Figure 3The structure of the temporal branch of the STINP. The yellow block denotes the 2D spatial convolutional filter, and the blue block represents the 1D temporal convolutional filter.
Figure 4The structure of the proposed STINP. (a) STINP-1 and (b) STINP-2.
Figure 5Examples of videos from the UCF101 dataset.
Figure 6Examples of videos from the HMBD51 dataset.
The detailed architecture of convolutional blocks in our proposed STINP.
| Layer Name | Blocks |
|---|---|
| conv1 | 7 × 7 × 1,64 |
| pool1 | 3 × 3 × 1 max stride 2 |
| conv2_i |
|
| conv3_i |
|
| conv4_i |
|
| conv5_i |
|
| Pool5 | 7 × 7 × 1 average |
Comparison results (Top-1) of STINP-1 and STINP-2 using ResNets-50 for the spatial branch and ResNets-50 for the temporal branch.
| Model | Branch | UCF101 | HMDB51 |
|---|---|---|---|
| STINP-1 | Spatial branch-1 | 84.00% | 53.20% |
| Temporal branch | 86.00% | 62.10% | |
| Fusion | 93.40% | 66.70% | |
| STINP-2 | Spatial branch-2 | 83.20% | 53.00% |
| Temporal branch | 86.00% | 62.10% | |
| Fusion | 93.00% | 67.10% |
Comparison results (Top-1) of STINP-1 and STINP-2 using ResNets-50 for the spatial branch and ResNets-152 for the temporal branch.
| Model | Branch | UCF101 | HMDB51 |
|---|---|---|---|
| STINP-1 | Spatial branch-1 | 89.80% | 61.60% |
| Temporal branch | 86.40% | 60.80% | |
| Fusion | 94.40% | 69.60% | |
| STINP-2 | Spatial branch-2 | 87.50% | 59.00% |
| Temporal branch | 86.60% | 60.20% | |
| Fusion | 94.00% | 69.00% |
Comparison results (Top-1) of STINP-1 and STINP-2 using ResNets-152 for the spatial branch and ResNets-50 for the temporal branch.
| Model | Branch | UCF101 | HMDB51 |
|---|---|---|---|
| STINP-1 | Spatial branch-1 | 86.30% | 54.10% |
| Temporal branch | 85.00% | 61.80% | |
| Fusion | 93.60% | 68.70% | |
| STINP-2 | Spatial branch-2 | 85.80% | 53.80% |
| Temporal branch | 85.00% | 61.20% | |
| Fusion | 93.50% | 68.50% |
Comparison results (Top-1) of STINP-1 and STINP-2 using ResNets-152 for the spatial branch and ResNets-152 for the temporal branch.
| Model | Branch | UCF101 | HMDB51 |
|---|---|---|---|
| STINP-1 | Spatial branch-1 | 85.80% | 56.60% |
| Temporal branch | 86.10% | 60.00% | |
| Fusion | 93.70% | 67.80% | |
| STINP-2 | Spatial branch-2 | 86.20% | 55.80% |
| Temporal branch | 84.50% | 58.80% | |
| Fusion | 93.30% | 68.00% |
Comparison results (Top-5) of STINP-1 and STINP-2 using ResNets-50 for the spatial branch and ResNets-152 for the temporal branch.
| Model | UCF101 | HMDB51 |
|---|---|---|
| STINP-1 | 99.50% | 91.60% |
| STINP-2 | 98.80% | 91.00% |
Comparison of the proposed STINP and the other methods.
| Methods | UCF101 | HMDB51 |
|---|---|---|
| IDT [ | 86.40% | 61.70% |
| Spatiotemporal ConvNet [ | 65.40% | — |
| Long-term recurrent ConvNet [ | 82.90% | — |
| Composite LSTM Model [ | 84.30% | 44.00% |
| Two-Stream ConvNet [ | 88.00% | 59.40% |
| P3D ResNets (Without IDT) [ | 88.60% | — |
| Two-Stream+LSTM [ | 88.60% | — |
| C3D [ | 85.20% | - |
| Res3D [ | 85.80% | 54.90% |
| Dynamic Image Networks [ | 76.90% | 42.80% |
| Dynamic Image Networks + IDT [ | 89.10% | 65.20% |
| Asymmetric 3D-CNN (RGB+RGBF+IDT) [ | 92.60% | 65.40% |
| T3D [ | 93.20% | 63.50% |
| TDD+IDT [ | 91.50% | 65.90% |
| Conv Fusion (Without IDT) [ | 92.50% | 65.40% |
| Transformations [ | 92.40% | 62.00% |
| VideoLSTM + IDT [ | 92.20% | 64.90% |
| Hierarchical Attention Networks [ | 92.70% | 64.30% |
| Spatiotemporal Multiplier ConvNet [ | 94.20% | 68.90% |
| Sequential Learning Framework [ | 90.90% | 65.70% |
| T-ResNets (Without IDT) [ | 93.90% | 67.20% |
| TSN (2 modalities) [ | 94.00% | 68.50% |
| Spatiotemporal Heterogeneous Two-stream Network [ | 94.40% | 67.20% |
| Our proposed STINP | 94.40% | 69.60% |
IDT is the abbreviation of Improved Dense Trajectory; ConvNet is the abbreviation of Convolutional Network; LSTM is the abbreviation of Long Short-Term Memory; P3D ResNets is the abbreviation of Pseudo-3D Residual Networks; C3D is the abbreviation of Convolutional 3D; Res3D is the abbreviation of 3D Residual Convolutional Network; 3D-CNNs is the abbreviation of 3D Convolutional Neural Networks; T3D is the abbreviation of Temporal 3D Convolutional Network; TDD is the abbreviation of Trajectory-pooled Deep-convolutional Descriptors; Conv Fusion is the abbreviation of Convolutional Two-Stream Network Fusion; T-ResNets is the abbreviation of Temporal Residual Networks; TSN is the abbreviation of Temporal Segment Networks.