| Literature DB >> 31323804 |
Itsaso Rodríguez-Moreno1, José María Martínez-Otzeta2, Basilio Sierra2, Igor Rodriguez2, Ekaitz Jauregi3.
Abstract
Video activity recognition, although being an emerging task, has been the subject of important research efforts due to the importance of its everyday applications. Surveillance by video cameras could benefit greatly by advances in this field. In the area of robotics, the tasks of autonomous navigation or social interaction could also take advantage of the knowledge extracted from live video recording. The aim of this paper is to survey the state-of-the-art techniques for video activity recognition while at the same time mentioning other techniques used for the same task that the research community has known for several years. For each of the analyzed methods, its contribution over previous works and the proposed approach performance are discussed.Entities:
Keywords: activity recognition; computer vision; deep learning; optical flow
Year: 2019 PMID: 31323804 PMCID: PMC6679256 DOI: 10.3390/s19143160
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Summary diagram.
Summary of methods using hand-crafted motion features.
| YEAR | SUMMARY | DATASET | |
|---|---|---|---|
| Bobick et al. [ | 2001 | Use of motion-energy image (MEI) and motion-history image (MHI). | - |
| Schuldt et al. [ | 2004 | Use of local space-time features to recognize complex motion patterns. | KTH Action [ |
| Niebles et al. [ | 2007 | Use of a hybrid hierarchical model, combining static and dynamic features. | Weizmann [ |
| Laptev et al. [ | 2008 | Use of spatio-temporal features and extend spatial pyramids to spatio-temporal pyramids. | KTH Action [ |
| Chen et al. [ | 2009 | Use of HOG for human pose representations and HOOF to characterize human motion. | Weizmann [ |
| Chaudhry et al. [ | 2009 | Use of HOOF features by computing optical flow at every frame and binning them according to primary angles. | Weizmann [ |
| Lertniphonphan et al. [ | 2011 | Use of a motion descriptor based on direction of optical flow. | Weizmann [ |
| Wang et al. [ | 2013 | Use of camera motion to correct dense trajectories. | HMDB51 [ |
| Akpinar et al. [ | 2014 | Use of a generic temporal video segment representation, introducing a new velocity concept: Weighted Frame Velocity. | Weizmann [ |
| Kumar et al. [ | 2016 | Use of a local descriptor built by optical flow vectors along the edges of the action performers. | Weizmann [ |
| Sehgal, S. [ | 2018 | Use of background subtraction, HOG features and BPNN classifier. | Weizmann [ |
Summary of depth information based methods.
| YEAR | SUMMARY | DATASET | |
|---|---|---|---|
| Yang et al. [ | 2012 | Use of Depth Motion Maps (DMM), combining them with HOG descriptors. | MSRAction3D [ |
| Oreifej et al. [ | 2013 | Use of histogram of oriented 4D surface normals (HON4D) descriptor. | MSRAction3D [ |
| Liu et al. [ | 2018 | Use of a two-layer BoVW model, using motion-based and shape-based STIPs to distinguish the action. | MSRAction3D [ |
| Satyamurthi et al. [ | 2018 | Use of multi-directional projected depth motion maps (MPDMM). | MSRAction3D [ |
Summary of deep learning based methods.
| YEAR | SUMMARY | DATASET | |
|---|---|---|---|
| Karpathy et al. [ | 2014 | Use of different connectivity patterns for CNNs: early fusion, late fusion and slow fusion. | Sports-1M [ |
| Simonyan et al. [ | 2014 | Use of a two-stream CNN architecture, incorporating spatial and temporal networks. | UCF101 [ |
| Donahue et al. [ | 2015 | Use of a Long-term Recurrent Convolutional Network (LRCN) to learn compositional representations in space and time. | UCF101 [ |
| Wang et al. [ | 2015 | Use of very deep two-stream convNets, using stacked optical flow for temporal network and a single frame image for spatial network. | UCF101 [ |
| Wang et al. [ | 2015 | Use of trajectory-pooled deep-convolutional descriptor (TDD). | UCF101 [ |
| Tran et al. [ | 2015 | Use of deep 3D convolutional networks, which are better for spatio-temporal feature learning. | UCF101 [ |
| Feichtenhofer et al. [ | 2016 | Use of two-stream architecture associating spatial feature maps of a particular area to temporal feature maps of that region and fusing the networks at an early level. | UCF101 [ |
| Wang et al. [ | 2016 | Use of Temporal Segment Network (TSN) to incorporate long-range temporal structures avoiding overfitting. | UCF101 [ |
| Bilen et al. [ | 2016 | Use of image classification CNNs after summarizing the videos in dynamic images. | UCF101 [ |
| Carreira et al. [ | 2017 | Use of two-stream Inflated 3D ConvNet (I3D), using two different 3D networks for both streams of a two-stream architecture. | UCF101 [ |
| Varol et al. [ | 2018 | Use of space-time CNNs and architectures with long-term temporal convolutions (LTC), using lower spatial resolution and longer clips. | UCF101 [ |
| Ullah et al. [ | 2018 | Use of CNNs to reduce complexity and redundancy and deep bidirectional LSTM (DB-LSTM) to learn sequential information among frame features. | UCF101 [ |
| Wang et al. [ | 2018 | Use of a discriminative pooling, taking into account that just a few frames provide characteristic information about the action. | HMDB51 [ |
| Wang et al. [ | 2018 | Use of convNets which admit videos of arbitrary size and length, using first a STPP and a LSTM (or CNN-E) then. | UCF101 [ |
Advantages and disadvantages of presented techniques.
| Advantages | Disadvantages | |
|---|---|---|
| Hand-crafted motion features | - There is no need of a large amount of data for training. | - Usually these features are not robust. |
| Depth information | - The 3D structure information of the image that depth sensors provide is used to recover postures and recognize the activity. | - Depth maps have no texture, making it difficult to apply local differential operators. |
| Deep Learning | - There is no need of expert knowledge to get suitable features, reducing the effort of feature extraction. | - Need to collect massive data, consequently there is a lack of data sets. |
Summary of the presented datasets.
| # Classes | # Videos | # Actors | Resolution | Year | |
|---|---|---|---|---|---|
| Weizmann | 10 | 90 | 9 | 180 × 144 | 2005 |
| MSRAction3D | 20 | 420 | 7 | 640 × 480 | 2010 |
| HMDB51 | 51 | 6849 | - | 320 × 240 | 2011 |
| UCF50 | 50 | 6676 | - | - | 2012 |
| UCF101 | 101 | 13,320 | - | 320 × 240 | 2012 |
| Sports-1M | 487 | 1,133,158 | - | - | 2014 |
| ActivityNet | 203 | 27,801 | - | 1280 × 720 | 2015 |
| Something Something | 174 | 220,847 | - | __ (Variable width) × 240 | 2017 |
| AVA | 80 | 430 | - | - | 2018 |
Obtained accuracies for the benchmark dataset with depth information based methods.
| METHOD | MSRAction3D | |
|---|---|---|
| DS | DMM-HOG [ | 85.52% |
| HON4D [ | 88.89% | |
| M3DLSK+STV [ |
| |
| MPDMM [ | 94.8% |
Obtained accuracies for the benchmark datasets with hand-crafted methods and deep learning methods.
| METHOD | UCF101 | HMDB51 | Weizmann | |
|---|---|---|---|---|
| Hand-crafted | Hierarchical [ | - | - | 72.8% |
| Far Field of View [ | - | - |
| |
| HOOF NLDS [ | - | - | 94.4% | |
| Direction HOF [ | - | - | 79.17% | |
| iDT [ | - | 57.2% | - | |
| iDT+FV [ | 85.9% | 57.2% | - | |
| OF Based [ | - | - | 90.32% | |
| Edges OF [ | - | - | 95.69% | |
| HOG features [ | - | - | 99.7% | |
| Deep learning | Slow Fusion CNN [ | 65.4% | - | - |
| Two stream (avg) [ | 86.9% | 58.0% | - | |
| Two stream (SVM) [ | 88.0% | 59.4% | - | |
| IDT+MIFS [ | 89.1% | 65.1% | - | |
| LRCN (RGB) [ | 68.2% | - | - | |
| LRCN (FLOW) [ | 77.28% | - | - | |
| LRCN (avg, 1/2-1/2) [ | 80.9% | - | - | |
| LRCN (avg, 1/3-2/3) [ | 82.34% | - | - | |
| Very deep two-stream (VGGNet-16) [ | 91.4% | - | - | |
| TDD [ | 90.3% | 63.2% | - | |
| TDD + iDT [ | 91.5% | 65.9% | - | |
| C3D [ | 85.2% | - | - | |
| C3D + iDT [ | 90.4% | - | - | |
| TwoStreamFusion [ | 92.5% | 65.4% | - | |
| TwoStreamFusion+iDT [ | 93.5% | 69.2% | - | |
| TSN (RGB+FLOW) [ | 94.0% | 68.5% | - | |
| TSN (RGB+FLOW+WF) [ | 94.2% | 69.4% | - | |
| Dynamic images + iDT [ | 89.1% | 65.2% | - | |
| Two-StreamI3D [ | 93.4% | 66.4% | - | |
| Two-StreamI3D, pre-trained [ |
| 80.2% | - | |
| LTC (RGB) [ | 82.4% | - | - | |
| LTC (FLOW) [ | 85.2% | 59.0% | - | |
| LTC(FLOW+RGB) [ | 91.7% | 64.8% | - | |
| LTC(FLOW+RGB)+iDT [ | 92.7% | 67.2% | - | |
| DB-LSTM [ | 91.21% |
| - | |
| Two-Stream SVMP(VGGNet) [ | - | 66.1% | - | |
| Two-Stream SVMP(ResNet) [ | - | 71.0% | - | |
| Two-Stream SVMP(+ iDT) [ | - | 72.6% | - | |
| Two-Stream SVMP(I3D conf) [ | - | 83.1% | - | |
| STPP + CNN-E (RGB) [ | 85.6% | 62.1% | - | |
| STPP + LSTM (RGB) [ | 85.0% | 62.5% | - | |
| STPP + CNN-E (FLOW) [ | 83.2% | 55.4% | - | |
| STPP + LSTM (FLOW) [ | 83.8% | 54.7% | - | |
| STPP + CNN-E (RGB+FLOW) [ | 92.4% | 70.5% | - | |
| STPP + LSTM (RGB+FLOW) [ | 92.6% | 70.3% | - |
Available code for presented methods.
| METHOD | YEAR | PAPER | CODE |
|---|---|---|---|
| Deep Learning | 2018 | Video representation learning using discriminative pooling [ | SVMP |
| Deep Learning | 2018 | Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features [ | Bi-directional LSTM |
| Deep Learning | 2018 | Long-term temporal convolutions for action recognition [ | LTC |
| Deep Learning | 2017 | Quo vadis, action recognition? A new model and the Kinetics dataset [ | Two-Stream I3D |
| Deep Learning | 2016 | Dynamic image networks for action recognition [ | Dynamic images |
| Deep Learning | 2016 | Temporal segment networks: Towards good practices for deep action recognition [ | TSN |
| Deep Learning | 2016 | Convolutional two-stream network fusion for video action recognition [ | Two-Stream Fusion |
| Deep Learning | 2015 | Learning spatiotemporal features with 3D convolutional networks [ | C3D |
| Deep Learning | 2015 | Action recognition with trajectory-pooled deep-convolutional descriptors [ | TDD |
| Deep Learning | 2015 | Towards good practices for very deep two-stream convNets [ | Very deep Two-Stream convNets |
| Depth information | 2013 | HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences [ | HON4D |
| Hand-crafted motion features | 2013 | Action Recognition with Improved Trajectories [ | Improved Trajectories |