| Literature DB >> 30445801 |
Shahela Saif1, Samabia Tehseen2, Sumaira Kausar3.
Abstract
Recognition of human actions form videos has been an active area of research because it has applications in various domains. The results of work in this field are used in video surveillance, automatic video labeling and human-computer interaction, among others. Any advancements in this field are tied to advances in the interrelated fields of object recognition, spatio- temporal video analysis and semantic segmentation. Activity recognition is a challenging task since it faces many problems such as occlusion, view point variation, background differences and clutter and illumination variations. Scientific achievements in the field have been numerous and rapid as the applications are far reaching. In this survey, we cover the growth of the field from the earliest solutions, where handcrafted features were used, to later deep learning approaches that use millions of images and videos to learn features automatically. By this discussion, we intend to highlight the major breakthroughs and the directions the future research might take while benefiting from the state-of-the-art methods.Entities:
Keywords: action recognition; computer vision; deep learning; visual action recognition
Mesh:
Year: 2018 PMID: 30445801 PMCID: PMC6263411 DOI: 10.3390/s18113979
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Surveys and studies on action and motion analysis.
| Survey | Scope |
|---|---|
| Poppe [ | Handcrafted action features and classification models |
| Aggarwal and Ryoo [ | Individual and group activity analysis |
| Turaga et al. [ | Human actions, complex activities |
| Moeslund et al. [ | Human action analysis |
| Poppe [ | Human action recognition |
| Cheng et al. [ | Handcrafted models |
| Aggarwal and Cai [ | Human action analysis |
| Gavrila [ | Human body and hands tracking-based motion analysis |
| Yilmaz et al. [ | Object detection and tracking |
| Zhan et al. [ | Surveillance and crowd analysis |
| Weinland et al. [ | Action recognition |
| Aggarwal [ | Motion analysis fundamentals |
| Chaaraoui et al. [ | Human behavior analysis and understanding |
| Metaxas and Zhang [ | Human gestures to group activities |
| Vishwakarma and Agrawal [ | Activity recognition and monitoring |
| Cedras and Shah [ | Motion-based recognition approaches |
Figure 1Classification of action recognition based on techniques employed for identification and classification of actions.
Figure 2Research publications per year as discussed in the current study.
Figure 3Moving light displays used for action recognition in [30].
Figure 4Human model created in 3D using 2D information in [31].
Figure 5Top row: A walking sequence of a person; middle row: a Motion Energy Image (MEI) template; bottom row: a Motion History Image (MHI) template [41].
Figure 6Spatio-temporal interest point detection for a walking person. Reprinted with permission from [62].
Figure 7Fusion strategies for incorporating the temporal dimension in neural networks. Source: Reprinted with permission from [96].
Figure 8Two-stream architecture with the spatial stream using images and the temporal stream using optical flows. Source: [102].
Datasets used for action recognition in increasing order of complexity.
| Dataset | Type | No. of Videos | No. of Classes | No. of Subjects |
|---|---|---|---|---|
| KTH [ | Indoor/Outdoor | 600 | 6 | 25 |
| Weizmann [ | Outdoor | 90 | 10 | 9 |
| CAVIAR [ | Indoor/Outdoor | 80 | 9 | numerous |
| UCFSports [ | Television sports | 150 | 10 | numerous |
| UCF-50 [ | YouTube videos | - | 50 | numerous |
| UCF-101 [ | YouTube videos | 13,320 | 101 | numerous |
| Sports-1 M [ | YouTube sports | 1,133,158 | 487 | numerous |
| Hollywood2 [ | Clips from Hollywood movies | 1707 | 12 | numerous |
| HMDB-51 [ | YouTube, movies | 7000 | 51 | numerous |
Comparison of various action recognition techniques.
| Paper | Year | Technique | UCF-101 | HMDB-51 | Others |
|---|---|---|---|---|---|
| Handcrafted Features | |||||
| Wang et al. [ | 2011 | Dense Trajectory | UCF Sports 88.2 | ||
| Wang et al. [ | 2013 | Dense Trajectory | UCF-50 91.2 | ||
| Learned Models | |||||
| Ji et al. [ | 2013 | 3D Convolution | KTH 90.2 | ||
| Tran et al. [ | 2015 | C3D generic descriptor | 90.4 | ||
| Karpathy et al. [ | 2014 | Slow fusion | Sports-1 80.2 | ||
| Sun et al. [ | 2015 | Factorized spatiotemporal CovNets | 88.1 | 59.1 | |
| Wang et al. [ | 2015 | Two-stream | 89.3 | ||
| Ng et al. [ | 2015 | Conv Pooling | 88.2 | Sports-1 73.1 | |
| Ng et al. [ | 2015 | LSTM | 88.6 | ||
| Donahue et al. [ | 2015 | LRCN | 82 | ||
| Jiang et al. [ | 2012 | Trajectories | 78.5 | 48.4 | |
| Varol et al. [ | 2017 | Long-term temporal convolutions | 91.7 | 64.8 | |
| Li et al. [ | 2016 | VLAD | 92.2 | ||
| Hybrid Models | |||||
| Simonyan and Zisserman [ | 2014 | Two-stream CNN | 88.0 | 59.4 | |
| Feichtenhofer et al. [ | 2016 | ResNet | 93.5 | 69.2 | |
| Wang et al. [ | 2015 | Trajectory pooling + Fisher vector | 91.5 | 65.9 | |
| Lev et al. [ | 2016 | RNN Fisher vector | 94.08 | 67.71 | |
| Bilen et al. [ | 2016 | Dynamic Image network | 89.1 | 65.2 | |
| Wu et al. [ | 2015 | Adaptive multi-stream fusion | 92.6 | ||
| Deep Generative Models | |||||
| Srivastava et al. [ | 2015 | LSTM autoencoder | 75.8 | 44.1 | |
| Mathieu [ | 2015 | Adversarial network | ≈90 |