| Literature DB >> 35498187 |
Viet-Tuan Le1, Kiet Tran-Trung1, Vinh Truong Hoang1.
Abstract
Human action recognition is an important field in computer vision that has attracted remarkable attention from researchers. This survey aims to provide a comprehensive overview of recent human action recognition approaches based on deep learning using RGB video data. Our work divides recent deep learning-based methods into five different categories to provide a comprehensive overview for researchers who are interested in this field of computer vision. Moreover, a pure-transformer architecture (convolution-free) has outperformed its convolutional counterparts in many fields of computer vision recently. Our work also provides recent convolution-free-based methods which replaced convolution networks with the transformer networks that achieved state-of-the-art results on many human action recognition datasets. Firstly, we discuss proposed methods based on a 2D convolutional neural network. Then, methods based on a recurrent neural network which is used to capture motion information are discussed. 3D convolutional neural network-based methods are used in many recent approaches to capture both spatial and temporal information in videos. However, with long action videos, multistream approaches with different streams to encode different features are reviewed. We also compare the performance of recently proposed methods on four popular benchmark datasets. We review 26 benchmark datasets for human action recognition. Some potential research directions are discussed to conclude this survey.Entities:
Mesh:
Year: 2022 PMID: 35498187 PMCID: PMC9045967 DOI: 10.1155/2022/8323962
Source DB: PubMed Journal: Comput Intell Neurosci
Summary of the related survey articles.
| Survey | Year | Scope | Contributions | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Handcrafted | 2D CNN | RNN | 3D single-stream | 3D multistream | Convolution-free | Datasets | |||
| [ | 2010 | ✓ | ✓ | (i) We categorise five aspects of deep learning methods for human action recognition | |||||
| [ | 2019 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| [ | 2019 | ✓ | |||||||
| [ | 2019 | ✓ | ✓ | ✓ | ✓ | ||||
| [ | 2020 | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| [ | 2020 | ✓ | |||||||
| [ | 2020 | ✓ | ✓ | ||||||
| [ | 2021 | ✓ | |||||||
| [ | 2021 | ✓ | ✓ | ✓ | ✓ | ||||
| Our | 2021 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Figure 1The structure of the survey.
Figure 23D single-stream network.
Figure 33D two-stream network.
Accuracy of different methods on UCF101 and HMDB datasets.
| Method | Year | Method | UCF101 | HMDB |
|---|---|---|---|---|
| Carreira and Zisserman [ | 2017 | Two-stream I3D | 98.00 | 80.90 |
| Zhou et al. [ | 2018 | Mixed 3D CNNs, 2D CNNs | 94.70 | 70.50 |
| Zhang et al. [ | 2019 | SoSR + ToSR (TSN [ | 92.13 | 68.30 |
| Ge et al. [ | 2019 | Attention + ConvLSTM | 92.39 | 66.37 |
| Pan et al. [ | 2019 | TR-LSTM (Inception-V3 [ | 93.80 | 63.80 |
| Liu et al. [ | 2019 | R-STAN(ResNet101 [ | 94.50 | 68.70 |
| Wang et al. [ | 2019 | I3D, LSTM | 95.10 | — |
| Lin et al. [ | 2019 | TSM (TSN [ | 95.90 | 73.50 |
| Jiang et al. [ | 2019 | STM (CSTM, CMM, TSN [ | 96.20 | 72.20 |
| Chi et al. [ | 2019 | CMA (attention) | 96.50 | — |
| Zhang et al. [ | 2019 | CSN (TSN [ | 97.40 |
|
| Hong et al. [ | 2019 | I3D [ | 98.02 | 80.92 |
| Crasto et al. [ | 2019 | MARS + RGB + Flow | 98.10 | 80.90 |
| Qiu et al. [ | 2019 | LGD-3D two-stream |
| 80.50 |
| Piergiovanni and Ryoo [ | 2019 | Fully-differentiable convolutional layer | — | 81.10 |
| Kwon et al. [ | 2020 | MSNet (ResNet50 [ | — | 77.40 |
| Liu et al. [ | 2020 | STS + attention LSTM | 92.70 | 64.40 |
| Majd and Safabakhsh [ | 2020 |
| 92.80 | 61.30 |
| Huang and Bors [ | 2020 | TSN (squeeze and excitation operation) | 95.20 | 71.50 |
| Li et al. [ | 2020 | Attention (ResNeXt-101 [ | 95.90 | 72.20 |
| Zhou et al. [ | 2020 | Probability space | 96.50 | — |
| Zhu et al. [ | 2020 | FAST-GRU | 96.90 | 75.70 |
| Diba et al. [ | 2020 | HATNet (2D ResNet50, 3D ResNet18) | 97.80 | 76.50 |
| Zhang et al. [ | 2020 | PANet (ResNet101 [ | 97.20 | 77.30 |
| Duan et al. [ | 2020 | 2D network (ResNet50 [ | 97.52 | 79.02 |
| Stroud et al. [ | 2020 | D3D (S3D-G [ | 97.60 | 80.50 |
| Li et al. [ | 2020 | CIDC (ResNet50 [ | 97.90 | 75.20 |
| Li et al. [ | 2020 | PoseNet, ResNet50 (3D), multiteacher network | 98.20 | 82.00 |
| Gowda et al. [ | 2020 | MobileNet, MLP, LSTM | 98.60 | 84.30 |
| Kalfaoglu et al. [ | 2020 | BERT, 3D convolution architecture |
|
|
| Akbari et al. [ | 2021 | VATT | 89.60 | 65.20 |
| Xu et al. [ | 2021 | MotionNet [ | 91.50 | 67.90 |
| Li et al. [ | 2021 | VidTr (MSA, topK-based pooling) | 96.70 | 74.40 |
| Sharir et al. [ | 2021 | STAM (spatial and temporal attention) | 97.00 | — |
| He et al. [ | 2021 | DB-LSTM (ID3 [ | 97.30 | 81.20 |
| Huang and Bors [ | 2021 | FineCoarse (TSM R50 [ | 97.60 | 77.60 |
| Hua et al. [ | 2021 | SCN (Mask R-CNN [ | 98.30 |
|
| Sheth [ | 2021 | Three-stream network + LSTM/Attention |
| — |
Bold represents the best performance.
Accuracy of different methods on Something-Something-V1 and Something-Something-V2 datasets.
| Method | Year | Method | Something-V1 | Something-V2 | ||
|---|---|---|---|---|---|---|
| Top-1 | Top-5 | Top-1 | Top-5 | |||
| Zhou et al. [ | 2018 | TRN (2-stream TRN) | 42.01 | — | 55.52 | 83.06 |
| Jiang et al. [ | 2019 | STM (CSTM, CMM, TSN [ | 50.70 | 80.40 | 64.20 | 89.80 |
| Lin et al. [ | 2019 | TSM (TSN [ | 52.60 | 81.90 | 66.00 | 90.50 |
| Tran et al. [ | 2019 | CSN (ResNet3D [ | 53.30 | — | — | — |
| Martinez et al. [ | 2019 | 2D TSN [ |
| 81.80 | — | |
| Li et al. [ | 2020 | CIDC (ResNet50 [ | — | — | 56.30 | 83.70 |
| Zhou et al. [ | 2020 | Probability space | — | — | 62.90 | 88.00 |
| Perez-Rua et al. [ | 2020 | W3 (ResNet50-TSM [ | 52.60 | 81.30 | 66.50 | 90.40 |
| Lee et al. [ | 2020 | VOV3D-L (T-OSA) | 54.70 | 82.00 | 67.40 | 90.50 |
| Kwon et al. [ | 2020 | MSNet (ResNet50 [ | 55.10 | 84.00 | 67.10 | 91.00 |
| Sudhakaran et al. [ | 2020 | GSM (InceptionV3 [ | 55.16 | — | — | — |
| Zhang et al. [ | 2020 | PANet (ResNet101 [ | 55.30 | 82.80 | 66.50 | 90.60 |
| Wang et al. [ | 2020 | TDN (short- and long-term TDM) |
| 84.10 |
| 91.60 |
| Huang and Bors [ | 2021 | RNL (ResNet50 [ | 54.10 | 82.20 | — | — |
| Huang and Bors [ | 2021 | FineCoarse network (ResNet [ |
| 83.70 | — | — |
Bold represents the best performance.
Some benchmark datasets for human action recognition.
| Dataset | Year | Samples | Mean length | Actions | Resolution | |
|---|---|---|---|---|---|---|
| Simple | KTH [ | 2004 | 2391 | 4 sec | 6 | 160 × 120 |
| Weizmann [ | 2005 | 90 | Len | 10 | 180 × 144 | |
| Hollywood [ | 2008 | 430 | Len | 8 | — | |
| Hollywood2 [ | 2009 | 3669 | Len | 12 | — | |
|
| ||||||
| Clip-level dataset | UCF101 [ | 2012 | 13,320 | 7.21 sec | 101 | 320 × 240 |
| HMDB51 [ | 2013 | 6,766 | — | 51 | —– × 240 | |
| J-HMDB [ | 2013 | 31,838 | 1.4 sec | 21 | 320 × 240 | |
| MPII cooking [ | 2012 | 881,755 | Len | 65 | 1624 × 1224 | |
| Charades [ | 2016 | 9,848 | 30 sec | 157 | 671 × 857 | |
| Something-Something-V1 [ | 2017 | 108,499 | 4.03 sec | 174 | —– × 100 | |
| Something-Something-V2 [ | 2018 | 220,847 | 4.03 sec | 174 | —– × 240 | |
| Kinetics-400 [ | 2017 | 306,245 | 10 sec | 400 | Variable resolution | |
| Kinetics-600 [ | 2018 | 495,547 | 10 sec | 600 | Variable resolution | |
| Kinetics-700 [ | 2019 | 650,317 | 10 sec | 700 | Variable resolution | |
| Diving48 [ | 2018 | 18,404 | Len | 48 | — | |
| Moments in time [ | 2019 | 1,000,000 | 3 sec | 339 | 340 × 256 | |
| HACS [ | 2019 | 1.55M | 2 sec | 200 | — | |
| HVU [ | 2020 | 572K | 10 sec | 739 | — | |
| AViD [ | 2020 | 450K | 3–15 sec | 887 | — | |
|
| ||||||
| Video-level dataset | Sport1M [ | 2014 | 1,133,158 | 5 min 36 sec | 487 | — |
| ActivityNet [ | 2015 | 28,108 | (5–10) min | 200 | 1280 × 720 | |
| DALY [ | 2016 | 8133 | 3 min 45 sec | 10 | 1290 × 790 | |
| YouTube-8M [ | 2016 | 1.9B | 226.6 sec | 4,800 | — | |
| EPIC-kitchens [ | 2018 | 11.5M | 1.7 hrs | 149 | 1920 × 1080 | |
| AVA [ | 2018 | 392,426 | 15 min | 60 | 451 × 808 | |
| AVA-kinetics [ | 2020 | 624,430 | — | 60 | — | |