| Literature DB >> 29065585 |
Shugang Zhang1, Zhiqiang Wei1, Jie Nie2, Lei Huang1, Shuang Wang1, Zhen Li1.
Abstract
Human activity recognition (HAR) aims to recognize activities from a series of observations on the actions of subjects and the environmental conditions. The vision-based HAR research is the basis of many applications including video surveillance, health care, and human-computer interaction (HCI). This review highlights the advances of state-of-the-art activity recognition approaches, especially for the activity representation and classification methods. For the representation methods, we sort out a chronological research trajectory from global representations to local representations, and recent depth-based representations. For the classification methods, we conform to the categorization of template-based methods, discriminative models, and generative models and review several prevalent methods. Next, representative and available datasets are introduced. Aiming to provide an overview of those methods and a convenient way of comparing them, we classify existing literatures with a detailed taxonomy including representation and classification methods, as well as the datasets they used. Finally, we investigate the directions for future research.Entities:
Mesh:
Year: 2017 PMID: 29065585 PMCID: PMC5541824 DOI: 10.1155/2017/3090343
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Figure 1Research trajectory of activity representation approaches.
Feature encoding methods.
| Method | Proposed | Description paper, the number of citations |
|---|---|---|
| Vector quantization (VQ)/hard assignment (HA) | Sivic et al. (2003) | [ |
| Kernal codebook coding (KCB)/soft assignment (SA) | Gemert et al. (2008) | [ |
| Spase coding (SPC) | Yang et al. (2009) | [ |
| Local coordinate coding (LCC) | Yu et al. (2009) | [ |
| Locality-constrained linear coding (LLC) | Wang et al. (2010) | [ |
| Improved Fisher kernel (iFK)/Fisher vector (FV) | Perronnin et al. (2010) | [ |
| Triangle assignment coding (TAC) | Coates et al. (2010) | [ |
| Vector of locally aggregated descriptors (VLAD) | Jegou et al. (2010) | [ |
| Super vector coding (SVC) | Zhou et al. (2010) | [ |
| Local tangent-based coding (LTC) | Yu et al. (2010) | [ |
| Localized soft assignment coding (LSC/SA- | Liu et al. (2011) | [ |
| Salient coding (SC) | Huang et al. (2011) | [ |
| Group salient coding (GSC) | Wu et al. (2012) | [ |
| Stacked Fisher vectors (SFV) | Peng et al. (2014) | [ |
Figure 2Long-distance videos under real-world settings. (a) HAR in long-distance broadcasts. (b) Abnormal behaviors in surveillance.
Taxonomy of activity recognition literatures.
| References | Year | Representation (global/local/depth) | Classification | Modality | Level | Dataset | Performance result |
|---|---|---|---|---|---|---|---|
| Yamato et al. [ | 1992 | Symbols converted from mesh feature vector and encoded by vector quantization (G) | HMM | RGB | Action/activity | Collected dataset: | 96% accuracy |
| Darrell and Pentland [ | 1993 | View model sets (G) | Dynamic time warping | RGB | Action primitive | Collected instances of 4 gestures. | 96% accuracy (“Hello” gesture) |
| Brand et al. [ | 1997 | 2D blob feature (G) | Coupled HMM (CHMM) | RGB | Action primitive | Collected dataset: 52 instances. | 94.2% accuracy |
| Oliver et al. [ | 2000 | 2D blob feature (G) | (i) CHMM; | RGB | Interaction | Collected dataset: 11–75 training sequences +20 testing sequences. | (i) 84.68 accuracy (average); |
| Bobick and Davis [ | 2001 | Motion energy image & motion history image (G) | Template matching by measuring Mahalanobis distance | RGB | Action/activity | Collected dataset: | (a) 12/18 (single view); |
| Efros et al. [ | 2003 | Optical flow (G) | K-nearest neighbor | RGB | Action/activity | (a) Ballet dataset; (b) tennis dataset; (c) football dataset | (a) 87.4% accuracy; |
| Park and Aggarwal [ | 2004 | Body model by combining an ellipse representation and a convex hull-based polygonal representation (G) | Dynamic Bayesian network | RGB | Interaction | Collected dataset: 56 instances. | 78% accuracy |
| Schüldt et al. [ | 2004 | Space-time interest points (L) | SVM | RGB | Action/activity | KTH dataset | 71.7% accuracy |
| Blank et al. [ | 2005 | Space-time shape (G) | Spectral clustering algorithm | RGB | Action/activity | Weizmann dataset | 99.63% accuracy |
| Oikonomopoulos et al. [ | 2005 | Spatiotemporal salient points (L) | RVM | RGB | Action/activity | Collected dataset: 152 instances. | 77.63% recall |
| Dollar et al. [ | 2005 | Space-time interest points (L) | (i) 1-nearest neighbor (1NN); | RGB | Action/activity | KTH dataset | (i) 78.5% accuracy (1NN); |
| Ke et al. [ | 2005 | Integral videos (L) | Adaboost | RGB | Action/activity | KTH dataset | 62.97% accuracy |
| Veeraraghavan et al. [ | 2005 | Space-time shape (G) | Nonparametric methods by extending DTW | RGB | Action/activity | (a) USF dataset [ | No accuracy data presented. |
| Duong et al. [ | 2005 | High level activities are represented as sequences of atomic activities; atomic activities are only represented using durations (−). | Switching hidden semi-Markov model (S-HSMM) | RGB | Interaction | Collected dataset: 80 video sequences. | 97.5 accuracy (average accuracy; Coxian model) |
| Weinland et al. [ | 2006 | Motion history volumes (G) | Principal component analysis (PCA) + Mahalanobis distance | RGB | Action/activity | IXMAS dataset [ | 93.33% accuracy |
| Lu et al. [ | 2006 | PCA-HOG (L) | HMM | RGB | Action/activity | (a) Soccer sequences dataset [ | The implemented system can track subjects in videos and recognize their activities robustly. No accuracy data presented. |
| Ikizler and Duygulu [ | 2007 | Histogram of oriented rectangles and encoded with BoVW (G) | (i) Frame by frame voting; | RGB | Action/activity | Weizmann dataset | 100% accuracy (DTW) |
| Huang and Xu [ | 2007 | Envelop shape acquired from silhouettes (G) | HMM | RGB | Action/activity; | Collected dataset: | Subject dependent + view independent: 97.3% accuracy; |
| Scovanner et al. [ | 2007 | 3D SIFT (L) | SVM | RGB | Action/activity | Weizmann dataset | 82.6% accuracy |
| Vail et al. [ | 2007 | — | (i) HMM | — | Interaction | Data from the hourglass and the unconstrained tag domains generated by robot simulator. | 98.1% accuracy (CRF, hourglass); |
| Cherla et al. [ | 2008 | Width feature of normalized silhouette box (G) | Dynamic time warping | RGB | Action/activity | IXMAS dataset [ | 80.05% accuracy; |
| Tran and Sorokin [ | 2008 | Silhouette and optical flow (G) | (i) Naïve Bayes (NB); | RGB | Interaction; | (a) Weizmann dataset; | (a) 100% accuracy; |
| Achard et al. [ | 2008 | Semi-global features extracted from space-time micro volumes (L) | HMM | RGB | Action/activity | Collected dataset: 1614 instances. | 87.39% accuracy (average) |
| Rodriguez et al. [ | 2008 | Action MACH-maximum average correlation height (G) | Maximum average correlation height filter | RGB | Interaction; | (a) KTH dataset; | (a) 80.9% accuracy; |
| Kiaser et al. [ | 2008 | Histograms of oriented 3D spatiotemporal gradients (L) | SVM | RGB | Interaction; | (a) KTH dataset; | (a) 91.4% (±0.4) accuracy; |
| Willems et al. [ | 2008 | Hessian-based STIP detector & SURF3D (L) | SVM | RGB | Action/activity | KTH dataset | 84.26% accuracy |
| Laptev et al. [ | 2008 | STIP with HOG, HOF are encoded with BoVW (L) | SVM | RGB | Interaction; | (a) KTH dataset; | (a) 91.8% accuracy; |
| Natarajan and Nevatia [ | 2008 | 23 degrees body model (G) | Hierarchical variable transition HMM (HVT-HMM) | RGB | Action/activity; | (a) Weizmann dataset; | (a) 100% accuracy; |
| Natarajan and Nevatia [ | 2008 | 2-layer graphical model: top layer corresponds to actions in particular viewpoint; lower layer corresponds to individual poses (G) | Shape, flow, duration-conditionalrandom field (SFD-CRF) | RGB | Action/activity | Collected dataset: 400 instances. | 78.9% accuracy |
| Ning et al. [ | 2008 | Appearance and position context (APC) descriptor encoded by BoVW (L) | Latent pose conditional random fields (LPCRF) | RGB | Action/activity; | HumanEva dataset | 95.0% accuracy (LPCRF |
| Marszalek et al. [ | 2009 | SIFT, HOG, HOF encoded by BoVW (L) | SVM | RGB | Interaction | Hollywood2 dataset | 35.5% accuracy |
| Li et al. [ | 2010 | Action graph of salient postures (D) | Non-Euclidean relational fuzzy (NERF) C-means & Hausdorf distance-based dissimilarity measure | Depth | Action/activity | MSR Action3D dataset | 91.6% accuracy (train/test = 1/2); |
| Suk et al. [ | 2010 | YIQ color model for skin pixels; histogram-based color model for face region; optical flow for tracking of hand motion (L) | Dynamic Bayesian network | RGB | Action primitive | Collected dataset: 498 instances. | (a) 99.59% accuracy; |
| Baccouche et al. [ | 2010 | SIFT descriptor encoded by BoVW (L) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | RGB | Interaction | MICC-Soccer-Actions-4 dataset [ | 92% accuracy |
| Kumari and Mitra [ | 2011 | Discrete Fourier transform on silhouettes (G) | K-nearest neighbor | RGB | Action/activity | (a) MuHaVi dataset; | (a) 96% accuracy; |
| Wang et al. [ | 2011 | Dense trajectory with HOG, HOF, MBH (L) | SVM | RGB | Interaction; | (a) KTH dataset; | (a) 94.2% accuracy; |
| Wang et al. [ | 2012 | STIP with HOG, HOF are encoded with various encoding methods (L) | SVM | RGB | Interaction; | (a) KTH dataset; | (a) 92.13% accuracy (Fisher vector); |
| Zhao et al. [ | 2012 | Combined representations: | SVM | RGB-D | Interaction | RGBD-HuDaAct dataset | 89.1% accuracy |
| Yang et al. [ | 2012 | DMM-HOG (D) | SVM | Depth | Action/activity | MSR Action3D dataset | 95.83% accuracy |
| Xia et al. [ | 2012 | Histograms of 3D joint locations (D) | HMM | Depth | Action/activity | (a) collected dataset: 6220 frames, 200 samples. | (a) 90.92% accuracy; |
| Yang and Tian [ | 2012 | EigenJoints (D) | Naïve-Bayes-Nearest-Neighbor (NBNN) | Depth | Action/activity | MSR Action3D dataset | 96.8% accuracy; |
| Wang et al. [ | 2012 | Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D) | SVM | Depth | Interaction; | (a) MSR Action3D dataset; | (a) 88.2% accuracy; |
| Wang et al. [ | 2013 | Improved dense trajectory with HOG, HOF, MBH (L) | SVM | RGB | Interaction | (a) Hollywood2 dataset; | (a) 64.3% accuracy; |
| Oreifej and Liu [ | 2013 | Histogram of oriented 4D surface normals (D) | SVM | Depth | Action/activity; | (a) MSR Action3D dataset; | (a) 88.89% accuracy; |
| Chaaraoui [ | 2013 | Combined representations: | Dynamic time warping | RGB-D | Action/activity | MSR Action3D dataset | 91.80% accuracy |
| Ren et al. [ | 2013 | Time-series curve of hand shape (G) | Dissimilarity measure based on Finger-Earth Mover's Distance (FEMD) | RGB | Action primitive | Collected dataset: 1000 instances. | 93.9% accuracy |
| Ni et al. [ | 2013 | Depth-Layered Multi-Channel STIPs (L) | SVM | RGB-D | Interaction | RGBD-HuDaAct database | 81.48% accuracy (codebook size = 512 & SPM kernel) |
| Grushin et al. [ | 2013 | STIP with HOF (L) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | RGB | Action/activity | KTH dataset | 90.7% accuracy |
| Peng et al. [ | 2014 | (i) STIP with HOG, HOF and encoded by various encoding methods; (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; | Hybrid representation: |
| Peng et al. [ | 2014 | Improved dense trajectory encoded with stacked Fisher kernal (L) | SVM | RGB | Interaction; | (a) YouTube dataset; | (a) 93.38% accuracy; |
| Wang et al. [ | 2014 | Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D) | SVM | Depth | Interaction; | (a) MSR Action3D dataset; | (a) 88.2% accuracy; |
| Simonyan and Zisserman [ | 2014 | Spatial stream ConvNets & optical flow based temporal stream ConvNets (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; | (a) 59.4% accuracy; |
| Lan et al. [ | 2015 | Improved dense trajectory with HOG, HOF, MBHx, MBHy enhanced with multiskip feature tracking (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; | (a) 65.1% accuracy (L = 3); |
| Shahroudy et al. [ | 2015 | Combined representations: | SVM | RGB-D | Interaction | MSR DailyActivity3D | 81.9% accuracy |
| Wang et al. [ | 2015 | Weighted hierarchical depth motion maps (D) | Three-channel deep convolutional neural networks (3ConvNets) | Depth | Interaction; | (a) MSR Action3D dataset; | (a) 100% accuracy; |
| Wang et al. [ | 2015 | Pseudo-color images converted from DMMs (D) | Three-channel deep convolutional neural networks (3ConvNets) | Depth | Interaction; | (a) MSR Action3D dataset; | (a) 100% accuracy; |
| Wang et al. [ | 2015 | Trajectory-pooled deep-convolutional descriptor and encoded by Fisher kernal (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; | (a) 65.9% accuracy; |
| Veeriah et al. [ | 2015 | (i) HOG3D in KTH 2D action dataset; (L) | Differential recurrent neural network (dRNN) | RGBD | Action/activity | (a) KTH dataset; | (a) 93.96% accuracy (KTH-1); |
| Du et al. [ | 2015 | Representations of skeleton data extracted by subnets (D) | Hierarchical bidirectional recurrent neural network (HBRNN) | RGBD | Action/activity | (a) MSR Action3D dataset; | (a) 94.49% accuracy; |
| Zhen et al. [ | 2016 | STIP with HOG3D and encoded with various encoding methods (L) | SVM | RGB | Interaction; | (a) KTH dataset; | (a) 94.1% (Local NBNN); |
| Chen et al. [ | 2016 | Action graph of skeleton-based features (D) | Maximum likelihood estimation | Depth | Action/activity | (a) MSR Action3D dataset; | (a) 95.56% accuracy (cross subject); |
| Zhu et al. [ | 2016 | Co-occurrence features of skeleton joints (D) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | Depth | Interaction; | (a) SBU Kinect interaction dataset [ | (a) 90.41% accuracy; |
| Li et al. [ | 2016 | VLAD for deep dynamics (G) | Deep convolutional neural networks (ConvNets) | RGB | Interaction; | (a) UCF101 dataset; | (a) 84.65% accuracy; |
| Berlin & John [ | 2016 | Harris corner-based interest points and histogram-based features (L) | Deep neural networks (DNNs) | RGB | Interaction | UT Interaction dataset [ | 95% accuracy on set1; |
| Huang et al. [ | 2016 | Lie group features (L) | Lie Group Network (LieNet) | Depth | Interaction; | (a) G3D-Gamingdataset [ | (a) 89.10% accuracy; |
| Mo et al. [ | 2016 | Automatically extracted features from skeletons data (D) | Convolutional neural networks (ConvNets) + multilayer perceptron | Depth | Interaction | CAD-60 dataset | 81.8% accuracy |
| Shi et al. [ | 2016 | Three stream sequential deep trajectory descriptor (L) | Recurrent neural networks (RNN) and deep convolutional neural networks (ConvNets) | RGB | Interaction; | (a) KTH dataset; | (a) 96.8% accuracy; |
| Yang et al. [ | 2017 | Low-level polynormal assembled from local neighboring hypersurface normals and are then aggregated by Super Normal Vector (D) | Linear classifier | Depth | Interaction; | (a) MSR Action3D dataset; | (a) 93.45% accuracy; |
| Jalal et al. [ | 2017 | Multifeatures extracted from human body silhouettes and joints information (D) | HMM | Depth | Interaction; | (a) Online self-annotated dataset [ | (a) 71.6% accuracy; |
Figure 3Pipeline of Fisher vector and Stacked fisher vector. (a) Fisher vector. (b) Stacked fisher vector.
Figure 4Kinect RGBD cameras and their color images, depth maps, skeletal information. (a) Kinect v1 (2011). (b) Kinect v2 (2014). (c) Color image. (d) Depth map. (e) Skeleton captured by Kinect v1. (f) Skeleton captured by Kinect v2.
Figure 5Noisy skeleton problem caused by self-conclusion.
Figure 6A typical dynamic Bayesian network [101].
Overview of representative datasets.
| Dataset | Modality | Level | Year | References | Web pages | Activity category |
|---|---|---|---|---|---|---|
| RGBD-HuDaAct | RGB-D | Interaction level | 2013 | [ |
| 12 classes: eat meal, drink water, mop floor, and so forth |
| Hollywood | RGB | Interaction level | 2008 | [ |
| 8 classes: answer phone, hug person, kiss, and so forth |
| Hollywood-2 | RGB | Interaction level | 2009 | [ |
| 12 classes: answer phone, driving a car, fight, and so forth |
| UCF sports | RGB | Interaction level | 2008 | [ |
| 10 classes: golf swing, diving, lifting, and so forth |
| KTH | RGB | Activity/action level | 2004 | [ |
| 6 classes: walking, jogging, running, and so forth |
| Weizmann | RGB | Activity/action level | 2005 | [ |
| 10 classes: run, walk, bend, jumping-jack, and so forth |
| NTU-MSR | RGB-D | Action primitive level | 2013 | [ |
| 10 classes: it contains 10 different gestures. |
| MSRC-Gesture | RGB-D | Action primitive level | 2012 | [ |
| 12 classes: it contains 12 different gestures. |
| MSR DailyAction3D | RGB-D | Interaction level | 2012 | [ |
| 16 classes: call cellphone, use laptop, walk, and so forth |
| MSR Action3D | Depth | Activity/action level | 2010 | [ |
| 20 classes: high arm wave, hand clap, jogging, and so forth |