| Literature DB >> 27536510 |
Hong-Bo Zhang1, Qing Lei1, Bi-Neng Zhong1, Ji-Xiang Du1, Jialin Peng1, Tsung-Chih Hsiao1, Duan-Sheng Chen1.
Abstract
The majority of methods for recognizing human actions are based on single-view video or multi-camera data. In this paper, we propose a novel multi-surface video analysis strategy. The video can be expressed as three-surface motion feature (3SMF) and spatio-temporal interest feature. 3SMF is extracted from the motion history image in three different video surfaces: horizontal-vertical, horizontal- and vertical-time surface. In contrast to several previous studies, the prior probability is estimated by 3SMF rather than using a uniform distribution. Finally, we model the relationship score between each video and action as a probability inference to bridge the feature descriptors and action categories. We demonstrate our methods by comparing them to several state-of-the-arts action recognition benchmarks.Entities:
Keywords: Human action recognition; Multi-view video analysis; Probability inference; Three surfaces motion feature
Year: 2016 PMID: 27536510 PMCID: PMC4971009 DOI: 10.1186/s40064-016-2876-z
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
Fig. 1Sketch of the multi-surface transfer of video. X indicates the horizontal direction, Y indicates the verticality direction and T indicates the time direction. The image sequence in the top right corner is the result of XT surface transfer. And the image sequence in the lower right corner is the result of YT surface transfer
Fig. 2Outline of the workflow of the proposed approach. The 3SMF and STIP feature have been extracted in the training data and testing data. In training process, prior probability inference model is trained by SVM, and posterior probability is estimated by NBNN algorithm
Fig. 3Framework of the 3SMF. Firstly, video is regarded as three different surface image sequences. The MHI is calculated by frame difference. And HOG feature is detection for MHI
Fig. 4Framework of the HOG feature. The MHI is divided to M × N grids. The gradient of pixel casts a weight vote for an orientation-based histogram
Fig. 5Examples of action datasets. a KTH dataset, b UCF sport dataset
Fig. 6Motion history image of six actions (handclapping, boxing, hand-waving, running, jogging and walking) from three different surfaces. Every MHI belonged to different actions has its own appearance, as shown in each row. The same as the above, the MHI of specific action has different appearance in the different surfaces
Fig. 7MHI image of the XT surface image sequence (XTMHI)
Comparison of the proposed method with existing methods for the KTH dataset
| Method | Accuracy (%) |
|---|---|
| 3SMF + STIP + NBNN |
|
| STIP + NBNN algorithm (uniform distribution) (Zhang et al. | 92.83 |
| Yuan et al. ( | 94.00 |
| Yan and Luo ( | 93.98 |
| Chakraborty et al. ( | 96.35 |
| Wang et al. ( | 94.4 |
| Weinland et al. ( | 92.4 |
Italic value mean the best results
Confusion matrix of the proposed method on the KTH dataset
| Walking | Running | Jogging | Handwaving | Handclapping | Boxing | |
|---|---|---|---|---|---|---|
| Walking | .98 | .01 | .01 | .00 | .00 | .00 |
| Running | .00 | .86 | .14 | .00 | .00 | .00 |
| Jogging | .00 | .02 | .98 | .00 | .00 | .00 |
| Handwaving | .00 | .00 | .00 | .99 | .00 | .01 |
| Handclapping | .00 | .00 | .00 | .02 | .98 | .00 |
| Boxing | .00 | .00 | .00 | .00 | .00 | 1.0 |
Comparison of the proposed method with existing methods for the UCF sports dataset
| Methods | Accuracy (%) |
|---|---|
| 3SMF + STIP + NBNN |
|
| Wang et al. ( | 85.60 |
| Yan and Luo ( | 90.67 |
| Le et al. ( | 86.50 |
| Shao et al. ( | 93.4 |
| Zhang et al. ( | 88.0 |
Italic value mean the best results
Confusion matrix of the proposed method on the UCF sport dataset
| Diving | Golf | High-swinging | Kicking | Lifting | Riding | Running | Skating | Swing | Walking | |
|---|---|---|---|---|---|---|---|---|---|---|
| Diving | 1.0 | .00 | .00 | .00 | .00 | .00 | .00 | .00 | .00 | .00 |
| Golf | .00 | .90 | .00 | .00 | .00 | .04 | .00 | .00 | .00 | .06 |
| High-swinging | .00 | .00 | .89 | .00 | .00 | .02 | .00 | .00 | .09 | .00 |
| Kicking | .00 | .00 | .00 | 1.0 | .00 | .00 | .00 | .00 | .00 | .00 |
| Lifting | .00 | .00 | .00 | .00 | 1.0 | .00 | .00 | .00 | .00 | .00 |
| Riding | .00 | .00 | .00 | .00 | .00 | 1.0 | .00 | .00 | .00 | .00 |
| Running | .00 | .00 | .00 | .01 | .00 | .00 | .93 | .00 | .00 | .06 |
| Skating | .00 | .00 | .00 | .00 | .00 | .00 | .09 | .86 | .05 | .00 |
| Swing | .00 | .00 | .00 | .00 | .00 | .00 | .05 | .00 | .95 | .00 |
| Walking | .00 | .04 | .00 | .00 | .00 | .05 | .00 | .00 | .00 | .91 |
Algorithm 1 Three-surface motion feature (3SMF) detection algorithm
| Input: Video or Image sequence |
| Output: Feature Vector |
| 1. Image sequence transfer using Eq. ( |
| 2. For each image sequence, calculate the motion history image (MHI) using frame difference method: |
| 3. For each MHI image |
| (a) Divided into |
| (b) Calculated the gradient of all pixel in |
| (c) Each pixel within a block casts a weighted vote for an orientation-based histogram: |
| (d) Concatenated the histogram of blocks to represent MHI: |
| 4. Concatenated MHI feature to build 3SMF feature: |
Algorithm 2: Action recognition through multi-surface analysis
| Input: Video or Image sequence |
| Output: Action category |
| Training: |
| 1. Detection STIPs for training data: |
| 2. Detection 3SMF using Algorithm 1 for training data |
| 3. Using 3SMF feature to train SVM model |
| Testing: |
| 1. Detection STIPs for testing data |
| 2. Detection 3SMF feature for testing data |
| 3. For each feature |
| 4. Using SVM model to calculate the prior probability |
| 5. Inference action by Eq. ( |