Literature DB >> 25436224

The approach for action recognition based on the reconstructed phase spaces.

Hong-bin Tu1, Li-min Xia1.   

Abstract

This paper presents a novel method of human action recognition, which is based on the reconstructed phase space. Firstly, the human body is divided into 15 key points, whose trajectory represents the human body behavior, and the modified particle filter is used to track these key points for self-occlusion. Secondly, we reconstruct the phase spaces for extracting more useful information from human action trajectories. Finally, we apply the semisupervised probability model and Bayes classified method for classification. Experiments are performed on the Weizmann, KTH, UCF sports, and our action dataset to test and evaluate the proposed method. The compare experiment results showed that the proposed method can achieve was more effective than compare methods.

Entities:  

Mesh:

Year:  2014        PMID: 25436224      PMCID: PMC4241687          DOI: 10.1155/2014/495071

Source DB:  PubMed          Journal:  ScientificWorldJournal        ISSN: 1537-744X


1. Introduction

Automatic recognition of human actions from image sequences is a challenging problem that has attracted the attention of researchers in the past decades. This has been motivated by the desire for application of entertainment, virtual reality, motion capture, sport training [1-3], medical biomechanical analysis, and so on. In a simple case, where a video is segmented to contain only one execution of a human activity, the objective of the system is to correctly classify the video into its activity category. More generally, the continuous recognition of human activities must be performed by detecting the starting and ending times of all occurring activities from an input video. Aggarwal and Ryoo [4] summarized the general method as single-layered approaches, hierarchical approaches, and so forth. Single-layered approaches represent and recognize human activities directly based on sequences of images. So, they are suitable for the recognition of gestures and actions with sequential characteristics. Single-layered approaches are again classified into two types: space-time approaches and sequential approaches. Space-time approaches are further divided into three categories: space-time volume, trajectories, and space-time features. Hierarchical approaches represent high-level human activities by describing them in terms of simpler activities. Hierarchical approaches usually can be divided into 3 classes: the statistical, the syntactic, and the description-based classes. Recognition systems composed of multiple layers are constructed, which are suitable for the analysis of complex activities. Among all these methods, the space-time approaches are the most widely used ones to recognize simple periodic actions such as “walking” and “waving,” and periodic actions will generate feature patterns repeatedly and the local features are scale, rotation, and translation-invariant in most cases. However, the space-time volume approach is difficult in recognizing the actions when multiple persons are present in the scene and it requires a large amount of computations for the accurate localization of actions. Besides, it is difficult in recognizing actions which cannot be spatially segmented. The major disadvantage of the space-time feature is that it is not suitable for modeling more complex activities. In contrast, the trajectory-based approaches have the ability to analyze detailed levels of human movements. Furthermore, most of these methods are view-invariant. Therefore, the trajectory-based approaches have been the most extensively studied approaches. Several approaches used the trajectories themselves to represent and recognize actions directly. Sheikh et al. [5] applied a set of 13 joint trajectories in a 4D XYZT space to describe the human action. Yilmaz and Shah [6] presented a methodology to compare action videos by the set of 4D XYZT joint trajectories. Anjum and Cavallaro [7] proposed algorithm based on the extraction of a set of representative trajectory features. Jung et al. [8] designed the novel method to detect event by trajectory clustering of objects and 4D histograms. Hervieu et al. [9] used Hidden Markov models to capture the temporal causality of object trajectories for the unexpected event detection. Wang et al. [10] proposed a nonparametric Bayesian model to analysis trajectory and model semantic region in surveillance. Wang et al. [11] presented a video representation based on dense trajectories and motion boundary descriptors for recognizing human actions. Yu et al. [12] used the novel approach based on weighted feature trajectories and concatenated bag-of-features (BOF) to recognize action. Pao et al. [13] proposed a general user verification approach based on user trajectories, which include on-line game traces, mouse traces, and handwritten characters. Yi and Lin [14] introduced the salient trajectories to recognize. Du et al. [15] proposed an intuitive approach on videos based on the feature trajectories. Psarrou et al. [16] designed the model of the statistical dynamic to recognize human actions by learning prior and continuous propagation of trajectories models. These approaches approximated the true motion state by setting constraints on the type of the dynamical model [1]. Above all, they required the detailed mathematical and statistical modeling. To solve these problems, we present the approach for action recognition based on the reconstructed phase spaces. The remainder of this paper is organized as follows. Section 2 presents the modified particle filter that is used to track human key points. In Section 3, we reconstruct the phase space of the total data. Section 4 explains the probability generation model. Section 5 explains action classification. Section 6 explains the results and analysis of the proposed approach. Finally, we conclude the paper in Section 7.

2. Human Key Joints Track

The human body [2] is divided into 15 key points, which are named 15 key joint points for representing the human body structure (torso, pelvis, left upper leg, left lower leg, left foot, right upper leg, right lower leg, right foot, left upper arm, left lower arm, left hand, right upper arm, right lower arm, right hand, and head) [17], which the 15 joints trajectory represents the human body behavior (blue dot represents pelvis, which is the origin of coordinate). Another consideration was that these joints were relatively easy to automatically detect and track in real videos, as opposed to the inner body joints which were more difficult to track. Each key joint had a trajectory as the time was going on and 15 trajectories were used to represent different actions. Therefore, we must track accurately the human body 15 nodes for indicating the human behavior. These are illustrated in Figure 1.
Figure 1

The human joints model. The original photo stems from Weizmann dataset [21].

However, it is difficult to track some key points for occlusion. In this paper, we use the modified particle filters to track these key points. Particle filters are very efficient methods for tracking multiple objects, which they can cope with no-linear and multimodality induced by occlusions and background clutter. But it has been proved that the number of samples increases exponentially with the size of the state vector to be explored. The reason is that one sample dominates the weight distribution and the rest of the samples are not in statistically significant regions. In order to solve the above problem, we adopt the integrated algorithm based on both particle filters and Markov chain Monte Carlo models [18, 19], which is based on drift homotopy for stochastic differential equations and the existing particle filter methodology for multitarget tracking by appending an MCMC step after the particle filter resampling step. The MCMC step is integrated to the particle filter algorithm to bring the samples closer to the observation at the same time respecting the target dynamics. We can assume [18] the parameters as follows: Z ,…, Z : the noisy observations, K 1,…, K : the status of the system particular time, Z = G(X , η ) (η , n = 1,…, N): the observations functions, g(X , Z ): the distribution of the observations, and E[f(X ∣{Z } ]: the conditional expectation. Given a video sequence and labeled samples of object or background pixels on the first frame [20], we have access to noisy observations of the status of the system particular time. The filtering problem consists of computing estimates of the conditional expectation. Therefore, we can compute the conditional density of the state of the system p(X ∣{Z } ) and define a reference density: q(X ∣{Z } ). At last, we obtain the weighted sample [18]: We assume that and, from (2), we can obtain the formula The approximation in expression (1) becomes Thus, we can define the (normalized) weights The tracking algorithm is described as follows. Sampling N particles in accordance with the unified weights randomly generated particles form unweighted samples X and determination p(X ∣{Z } ), as follows: Predict by sampling X from Target observation association. Update and evaluate the weights: By resampling, through the above steps, we can generate independent uniform random variables {θ } (0 < {θ } < 1). Therefore, we can obtain the following equation: where ∑ w ≤ θ ≤ ∑ w . By Markov chain Monte Carlo tracking, we choose a modified drift for n = 1,…, N and k = 1,…, K. Construct a Markov chain [18-20] for Y with initial value X (the global state of the system is defined by X ) and obtain the stationary distribution Set X = Y . Set n = n + 1 and go to Step 1. Using the tracking algorithm, we can obtain key points trajectories, which are used to recognize human behavior. Figure 2 depicts the results of human target tracking.
Figure 2

The target tracking results. The original photo stems from Weizmann datasets [21].

3. Phase Space Reconstruction

At present, the phase space reconstruction has been used in many research fields. de Martino et al. [22] constructed the trajectory space and refer to the phase space in the dynamic system. Paladin and Vulpiani [23] presented the embedding trajectory dimension, which was similar to reconstruct the embedding dimension of the phase spaces of the dynamic system. Fang and Chan [24, 25] present the unsupervised ECG-based identification method based on phase space reconstruction in order to save the picking up characteristic points. Nejadgholi et al. [26] used the phase space reconstruction for recognizing the heart beat types. In this paper, we use the phase space reconstruction for human action recognition. We use the linear dynamic systems instead of the traditional gradient and optical flow features of interest points to recognize action. The linear dynamic system [27] is suitable to deal with temporally ordered data, which has been used in several applications in computer vision, such as tracking, human recognition from gait, and dynamic texture. The temporal evolution of a measurement vector can be modeled by the dynamic system. In this case, we use the linear dynamic system to model the spatiotemporal model. In this series, it is sometimes necessary to search for patterns not only in the time series itself, but also in a higher-dimensional transformation of the time series. We can estimate the delay time τ and embedding dimensions d in reconstructed phase space in order to extract more useful information from human action trajectories. These parameters can be computed as follows. The phase portrait of a dynamic system [28] described by a one-dimensional time series of measured scalar values x(t) can be reconstructed in a k-dimensional state space. From the time-series signal, we can construct an m-dimensional signal x(t). We define [28] a dynamical system as the possibly nonlinear map, which represents the temporal evolution of state variables de Martino et al. [22] pointed out that the phase space reconstruction based on Taken's theory is equivalent to the original attractor if m is large enough by suitable hypotheses. Each point in the phase space is calculated according to [26]. Consider where x is the nth point in the time series, delay times τ is the time lag, N is the number of points in the time series, and d is the dimension of the phase space. From now on, η is used to denote this set of body model variables describing human motion. The reconstructed phase space is shown by López-Méndez and Casas and Takens [28, 29] for the large enough m, which is a homeomorph m (embedding dimension) of the true dynamical system in the generated time series. We used Takens' theorem to reconstruct state spaces by time-delay embedding. In our case, parameters [26, 28] are defined as follows: η: the temporal evolution; Y : time series (scalar), and we want to characterize Y is a point in the reconstructed phase space, m is the embedding dimension, and τ is the embedding delay. Therefore, the phase space can be reconstructed by stacking sets of m (the large enough m) temporally spaced samples. The embedding delay τ determines the properties of the reconstructed phase space. At first, the embedding delay using the mutual information method was determined [26] and the estimated delay was used to obtain the appropriate embedding dimension [30]. Once both the embedding delay and the embedding dimension have been estimated, is performed [26] as follows: We use the phase space as signatures, where each one of the model variables constitutes a time series from the reconstructed phase space. The time series [28] model provides a better performance to recognize the action model based on independent scalar time series, which are based on action recognition method. Therefore, we get the phase space corresponding to each point trajectory, which contained the joint point of occlusion and nonocclusion. Besides, we choose Kolmogorov-Sinai entropy [31, 32] as another feature for analyzing the dynamics human action. K-S entropy (HKS) is the average entropy per unit time. We define it as the following [32]: Therefore, each trajectory of the human action can be described as the 3-dimensional feature vector according to the 9-dimensional feature vector of each key joint and 90-dimensional feature vector of each action. Figure 3 shows the reconstructed phase space of the total joint point.
Figure 3

Examples of the reconstructed phase space of the missing data. Hip rotations for walking, jogging, and running actions in the KTH dataset [33]. (a) shows original images. (b) shows the result of reconstructing the reconstructed phase space of the missing data. (b1) shows the phase space reconstruction of right foot motion. (b2) shows the phase space reconstruction of right elbow motion. (b3) shows the phase space reconstruction of right elbow motion. (c) shows the reconstructed phase space of the total occlude joint point. (c1) shows the phase space reconstruction of walking. (c2) shows the phase space reconstruction of jogging. (c3) shows the phase space reconstruction of running.

4. Probability Generation Model

These are a few labeled actions; however, a large number of unlabeled actions need be recognized. Therefore, we use the semisupervised probability model. It is assumed that [34] the action is generated by a mixture generative model of distribution function p(x∣θ ). Then, we can obtain the generative model [34] as follows: It is generally assumed that the distribution of the feature space is almost consistent with a Gaussian distribution or a multinomial distribution for human action images. x is the feature vector of the training sample, p(θ ) is the probability of the sample belonging to the ith class, θ represents the object classes and the covariance matrix of pixel. Therefore, likelihood functions [34] were defined as follows: The first part is supervised classification and the second is called unsupervised part. Unsupervised part should be written as Finally, we can obtain the log-likelihood function In this case, we build the relationship between the unlabeled samples and the learning sample. EM is also an iterative algorithm which has two main steps: expectation and maximization. E-step: this step predicts the labels of each unlabeled sample by calculating from the last iteration parameters in (21) where p is the current prediction of model i unlabeled samples conditioned on the current distributed parameter, M−1 is the previous state value, and M is the current state value. M-step: we calculate the current parameters by maximizing the likelihood function as follows: where p (θ ) is the posterior distribution of the k category, COV(•) is the covariance matrix, u is the number of unlabeled sample, j is the number of the label sample, l is the number of the label sample within the j class, and x ′ is the k label sample within the j class. When the change of the likelihood function between two iterations goes below the threshold, we stop the iteration and export the parameters. Threshold is determined empirically as 0.06.

5. Action Classification

We can recognize the human action by trained classified samples by the Bayes classified method [35, 36]: Because our generation model is based on the assumption of a Gaussian mixture distribution, we can obtain the following equation: where μ is mean vector and ∑p(Y ∣X ) is the covariance matrix. The operation of the classifier is shown in Algorithm 1.
Algorithm 1
Therefore, we obtain the result of human recognition as follows:

6. Experimental Result

In this section, firstly, four action datasets are used for evaluating the proposed approach: Weizmann human motion dataset [21], the KTH human action dataset [33], the UCF sports action dataset [37], and our action dataset (Table 8). Secondly, we compare our method with some other popular methods under these action datasets. We use a Pentium 4 machine with 2 GB of RAM, and the implementation on MATLAB to experiment, similar to [3]. Representative frames of this dataset are shown in Figure 4.
Table 8

Comparison with other approaches on our action dataset.

MethodAverage recognition rate (%)
The proposed method90.30
Martínez-Contreras et al. [38]88.80
Chaaraoui et al. [39]89.60
Zhang and Gong [40]87.10
Figure 4

Sample frames from our datasets. The action labels in each dataset are as follows: (a) KTH dataset [33]: walking (a1), jogging (a2), running (a3), boxing (a4), and hand clapping (a5); (b) Weizmann dataset [21]: bending (a1), jumping jack (a2), jumping forward on two legs (a3), jumping in place on two legs (a4), running (a5), galloping sideways (a6), walking (a7), waving one hand (a8), and waving two hands (a9); (c)UCF sports action dataset [37]: diving(a1), golf swinging (a2), kicking (a3), lifting (a4), horseback riding (a5), running (a6), skating (a7), swinging (a8), and walking (a9); (d) our action dataset: walking (a1), jogging (a2), running (a3), boxing (a4), and handclapping (a5).

6.1. Evaluation on KTH Dataset

The KTH dataset is provided by Schuldt which contains 2391 video sequences with 25 actors showing six actions. Each action is performed in 4 different scenarios, which contain some human actions (walking (a1), jogging (a2), running (a3), boxing (a4), and hand waving (a5)). Representative frames of this dataset are shown in Figure 4(a). The classified results are shown in Table 1.
Table 1

Confusion matrix for KTH dataset.

a1a2a3a4a5
a1 0.95 0.010.020.000.01
a20.01 0.93 0.020.100.00
a30.000.02 0.90 0.000.01
a40.010.000.00 0.92 0.30
a50.030.000.020.00 0.82

6.2. Evaluation on Weizmann Dataset

The Weizmann dataset is established by Blank, which contains 83 video sequences, showing nine different people, with each performing nine different actions including bending (a1), jumping jack (a2), jumping forward on two legs (a3), jumping in place on two legs (a4), running (a5), galloping sideways (a6), walking (a7), waving one hand (a8), and waving two hands (a9). Representative frames of this dataset are shown in Figure 4(b). The classified results are shown in Table 2.
Table 2

Confusion matrix for the Weizmann dataset.

a1 a2a3a4a5a6a7a8a9
a1 1.00 0.010.020.000.200.000.100.050.02
a20.01 0.96 0.020.030.000.000.000.040.00
a30.000.00 0.80 0.100.130.000.020.010.00
a40.000.010.00 0.95 0.000.200.040.000.00
a50.000.010.000.00 0.85 0.000.000.300.02
a60.010.000.030.000.05 0.91 0.020.000.01
a70.000.030.000.000.010.00 0.94 0.000.02
a80.000.030.040.100.000.000.00 0.98 0.00
a90.000.000.200.000.100.000.000.03 1.00

6.3. Evaluation on UCF Sports Action Dataset

The UCF sports action dataset is as follows. This dataset consists of several actions from various sporting events from the broadcast television channels. The actions in this dataset include diving (a1), golf swinging (a2), kicking (a3), lifting (a4), horse-back riding (a5), running (a6), skating (a7), swinging (a8), and walking (a9). Representative frames of this dataset are shown in Figure 4(c). The classified results are shown in Table 3.
Table 3

Confusion matrix for the UCF sports dataset.

a1a2a3a4a5a6a7a8a9
a1 0.97 0.020.010.000.150.000.100.050.02
a20.00 0.95 0.010.000.000.020.000.030.00
a30.010.00 0.82 0.150.100.000.020.020.00
a40.000.000.00 0.92 0.100.100.000.000.00
a50.000.010.200.00 0.88 0.000.000.100.02
a60.010.000.020.000.05 0.93 0.050.010.02
a70.000.040.000.000.000.00 0.92 0.000.02
a80.000.020.030.100.000.000.00 0.97 0.00
a90.000.100.300.040.100.000.000.00 1.00

6.4. Evaluation on Our Action Dataset

Our action dataset is as follows. We capture the behavior video in the laboratory. It contains five types of human actions (walking (a1), jogging (a2), running (a3), boxing (a4), and handclapping (a5)). Some sample frames are shown in Figure 4(d). The classified results achieved by this approach are shown in Table 4.
Table 4

Confusion matrix for our dataset.

a1a2a3a4a5
a1 0.98 0.000.000.010.02
a20.00 0.96 0.010.000.00
a30.000.02 0.87 0.010.00
a40.000.200.00 0.88 0.02
a50.020.100.000.00 0.86

6.5. Algorithm Comparison

In this case, we compare the proposed method with the three methods: Martínez-Contreras et al. [38], Chaaraoui et al. [39], and Zhang and Gong [40] in four datasets. In Tables 5, 6, and 7, it is obvious that the low recognition accuracy existed in these methods for the complex occlusion situation and the complex beat, motion, and other group actions. The average accuracy in our method is higher than that in the comparative method.
Table 5

Comparison with other approaches on KTH action dataset.

MethodAverage recognition rate (%)
The proposed method92.30
Martínez-Contreras et al. [38]89.20
Chaaraoui et al. [39]91.20
Zhang and Gong [40]90.60
Table 6

Comparison with other approaches on the Weizmann action dataset.

MethodAverage recognition rate (%)
The proposed method89.10
Martínez-Contreras et al. [38]85.10
Chaaraoui et al. [39]87.20
Zhang and Gong [40]85.40
Table 7

Comparison with other approaches on UCF sportsaction dataset.

MethodAverage recognition rate (%)
The proposed method91.10
Martínez-Contreras et al. [38]85.20
Chaaraoui et al. [39]87.30
Zhang and Gong [40]88.60
The experimental results show that the proposed approach can get satisfactory results and overcome these problems by comparing the average accuracy with that in [38-40].

7. Conclusions and Future Work

In this paper, we present a novel method of human action recognition, which is based on the reconstructed phase space. Firstly, the human body is divided into 15 key points, whose trajectory represents the human body behavior, and the modified particle filter is used to track these key points for self-occlusion. Secondly, we reconstruct the phase space for extracting more useful information from human action trajectories. Finally, we can construct use the semisupervised probability model and Bayes classified method to classify. Experiments were performed on the Weizmann, KTH, UCF sports, and our action dataset to test and evaluate the proposed method. The compare experiment results showed that the proposed method can achieve was more effective than compare methods. Our future work will deal with adding complex event detection by the phase space-based action representation and action learning and theoretical analysis of their relationship, involving more complex problems, such as dealing with more variable motion and interpersonal occlusions.
  3 in total

1.  Using phase space reconstruction for patient independent heartbeat classification in comparison with some benchmark methods.

Authors:  Isar Nejadgholi; Mohammad Hasan Moradi; Fatemeh Abdolali
Journal:  Comput Biol Med       Date:  2011-05-04       Impact factor: 4.589

2.  Actions as space-time shapes.

Authors:  Lena Gorelick; Moshe Blank; Eli Shechtman; Michal Irani; Ronen Basri
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2007-12       Impact factor: 6.226

3.  The complex action recognition via the correlated topic model.

Authors:  Hong-bin Tu; Li-min Xia; Zheng-wu Wang
Journal:  ScientificWorldJournal       Date:  2014-01-16
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.