BACKGROUND: Human gait as an effective behavioral biometric identifier has received much attention in recent years. However, there are challenges which reduce its performance. In this work we aim at improving performance of gait systems under variations in view angles, which present one of the major challenges to gait algorithms. METHODS: We propose employment of a view transformation model based on sparse and redundant (SR) representation. More specifically, our proposed method trains a set of corresponding dictionaries for each viewing angle, which are then used in identification of a probe. In particular, the view transformation is performed by first obtaining the SR representation of the input image using the appropriate dictionary, then multiplying this representation by the dictionary of destination angle to obtain a corresponding image in the intended angle. RESULTS: Experiments performed using CASIA Gait Database, Dataset B, support the satisfactory performance of our method. It is observed that in most tests, the proposed method outperforms the other methods in comparison. This is especially the case for large changes in the view angle, as well as the average recognition rate. CONCLUSION: A comparison with state-of-the-art methods in the literature showcases the superior performance of the proposed method, especially in the case of large variations in view angle. Copyright:
BACKGROUND: Human gait as an effective behavioral biometric identifier has received much attention in recent years. However, there are challenges which reduce its performance. In this work we aim at improving performance of gait systems under variations in view angles, which present one of the major challenges to gait algorithms. METHODS: We propose employment of a view transformation model based on sparse and redundant (SR) representation. More specifically, our proposed method trains a set of corresponding dictionaries for each viewing angle, which are then used in identification of a probe. In particular, the view transformation is performed by first obtaining the SR representation of the input image using the appropriate dictionary, then multiplying this representation by the dictionary of destination angle to obtain a corresponding image in the intended angle. RESULTS: Experiments performed using CASIA Gait Database, Dataset B, support the satisfactory performance of our method. It is observed that in most tests, the proposed method outperforms the other methods in comparison. This is especially the case for large changes in the view angle, as well as the average recognition rate. CONCLUSION: A comparison with state-of-the-art methods in the literature showcases the superior performance of the proposed method, especially in the case of large variations in view angle. Copyright:
Human gait enjoys distinctive features such as recognition from distance and unobtrusiveness. Moreover, a typical gait recognition system does not require high quality video, which makes it inexpensive and easy to set up since one can use existing security cameras for gait recognition. These characteristics have helped the popularity among research community of human gait in recent years as a behavioral biometric identifier. Gait is particularly applicable in criminal cases. As a matter of fact, a number of criminal cases have already used gait, for example in identifying a bank robber[1] and a burglar.[2] However, human gait recognition suffers from some challenges and difficulties such as variations in viewpoint,[3] walking speed,[4] clothing,[5] and carry conditions.[6] Since gaits from the same person vary drastically from different viewpoints and camera view is always unconstrained in real scenarios, this variation is one of the most critical challenges in human gait recognition. Therefore, gait recognition performance is dramatically dropped by changing in viewing angles.[7]This work proposes a robust scheme for gait recognition which is shown to be tolerant against variations in the viewing angle. The proposed scheme uses a cross-view approach to gait recognition, where the probe and gallery gaits are captured from two distinct viewpoints. Other approaches to this problem include fixed-view, where probe and gallery sequences are captured from the same viewpoint, and multi-view, where the probe sequence is recorded from single view and is processed against gallery gaits from multiple views. The most common approach among these in the literature is the cross-view approach.The proposed scheme employs a view transformation model (VTM) based on sparse and redundant (SR) representation. The VTM tries to learn an association between gait features from different viewing angles to map gait features from one view to another, and in turn facilitates comparison of probe gait from one viewpoint with gallery gaits of another viewpoint.This paper is structured as follows: Section 2 gives a brief review of some existing schemes related to the present work. Section 3 provides the necessary background. The proposed scheme is recounted in Section 4. Section 5 presents empirical results showcasing the performance of the proposed scheme, as well as comparison with the existing work. Section 6 contains some concluding remarks.
Related Work
Human gait recognition schemes are typically categorized as model-based schemes and appearance-based schemes. Considering the model-based approaches like those in some studies,[8910111213141516] one aims at fitting each frame of the input gait sequence to a specific model of the human body. This is achieved via determination of the parameters (e.g., joint angles) of the model at hand. The obtained parameters of the model are then used as features for identification of a probe sequence against stored gallery data. In contrast, appearance-based approaches such as described in various studies[171819202122232425262728293031] focus on the shape of the subject’s silhouettes in input frames and uses these to compute the desired feature. This usually leads to a single representation for a gait cycle. The most common of such representations is the gait energy image (GEI)[17] which is simply the statistical mean, after alignment and normalization, of all silhouettes of a gait cycle.The model-based approach seems to be more appropriate for cross-view gait recognition since given certain geometric assumptions, the calculated model have a view-invariant nature. However, these approaches generally suffer from model fitting errors due to typical low spatial resolution of the input frames.[32] Appearance-based approach on the other hand, can recognize a subject even from relatively low spatial resolution images, while its performance suffers drastically under variations in the viewing angle. While the proposed scheme is based on the appearance-based approach, it aims at mitigating the challenge of viewpoint variations.Appearance-based cross-view gait recognition schemes typically fall under one of the following descriptions:[33] (1) those that use multiple cameras or camera calibrations to construct a three-dimensional (3-D) model; (2) those that aim at extracting view-invariant gait features; and (3) those that aim at learning cross-view mapping relationships of gait features.Schemes following the first approach[1416343536373839] construct a 3-D model using cooperative multiple cameras or camera calibration and then project the obtained 3-D gallery into a 2-D silhouette. In theory, 2-D gaits for any desired view can be obtained from the 3-D model, yet there are some practical limitations to this.[3] This approach is suitable for a fully controlled and cooperative multi-camera setting, for example, a biometric tunnel[40] which is expensive and complicated. Also, the processes of 3-D reconstruction and 2-D rendering are computationally demanding.The second approach employs view-invariant features to facilitate cross-view gait recognition. A brief description of some examples of this approach follows. The method of BenAbdelkader et al.[41] uses a self-similarity plot to achieve robustness against limited view changes which achieved good performance with a limited range of view changes. Kale et al.[42] make use of a perspective projection model to compute from an arbitrary view the side-view gaits. Jean et al.[4344] normalize all input data (from any view) onto a fixed plane, thus allowing direct comparison in that plane. Han et al.[45] propose to select only parts of GEIs that overlap between various views to make a representation for the cross-view comparison. Finally, a joint subspace learning method is proposed by Liu et al.[46] to mitigate the view variations challenge. This category works well when the angle between the sagittal plane of the person and the image plane is small; otherwise it fails.[33] Furthermore, these methods are sensitive to noise which negatively affects recognition rates.[3]The third category of appearance-based cross-view gait recognition schemes maps gait features from one view to another by first training on the projection rule between the two views. Makihara et al.[28] propose a VTM based on singular value decomposition (SVD). In addition to the SVD-based VTM, Kusakunniran et al.[47] approaches optimization of the GEI feature through linear discriminant analysis (LDA). In a study by Zheng et al.,[48] a method is proposed to obtain the VTM using robust principal component analysis. Other approaches to construction of appropriate VTMs include those that employ tools such as support vector regression,[49] multilayer perceptron,[50] and sparse regression.[3] The method presented in Chen et al.[51] constructs a VTM based on projection of gravity center trajectory and Kusakunniran et al.[29] further improve the performance of this method. The method in study by Liu and Tan[52] trains LDA-subspaces for constructing a VTM and Bashir et al.[53] use canonical correlation analysis. The method in Kusakunniran et al.[49] reformulates VTM construction as a regression problem. Hu et al.[54] propose to apply a projection named view-invariant discriminative projection. Hu[55] proposes enhanced Gabor gait which is a gait feature that includes a nonlinear mapping of statistical and structural characteristics of gait. Muramatsu et al.[5657] propose to use 3D training gait models to create a VTM.Comparing to the first category, methods in this category are more feasible and less expensive since they do not use complicated multi-camera systems. Besides, they are more efficient and stable than the second category because they are less sensitive to noise.[3] Also, these methods can be applied for scenarios with no explicit interaction with the subjects, and can also be directly applied to views which are significantly different from the side view.[58] One limitation of the third category is that these methods rely on supervised learning and it will be difficult for recognizing gait under untrained viewing angles.[3] There are some works to address this challenge. For instance, Tian et al.[59] proposes an innovative view-adaptive mapping approach. However, as mentioned in Yu et al.,[60] small changes in view angles do not affect the recognition rates significantly, and if a sufficient number of cameras are employed, this challenge would be negligible. Another challenge in the third category is that most of the above mentioned methods train multiple mapping matrices, one for each pair of viewpoints.[58] Also, performance of cross-view gait recognition drops when the change in viewing angle is large.[3] The proposed method tries to mitigate these limitations. It creates a dictionary per each view angle and as the results show it performs well for big changes in view angle.
Background
VTM-based human gait recognition
The flowchart in Figure 1 presents a general outline of a VTM-based human gait recognition scheme. Such a scheme generally consists of three phases:[33] Training, transformation and matching. In the training phase, the gait features of multiple training subjects are used to construct the appropriate VTM. This VTM is then used in the transformation phase to compute an input gait feature for a destination view from the source view. Finally, the matching phase calculates a score of recognition between the gait features of the probe and every gallery subject.
Figure 1
A general framework for VTM-based human gait recognition
A general framework for VTM-based human gait recognitionIt should be noted that prior to these phases, a preprocessing module is needed which removes the background and extracts the silhouette from each frame. A simple yet effective method is to record the background in advance and subtract this recorded image from the input frame. The silhouette of the person is then obtained using some morphological operations. This method is easily applicable for security cameras in real applications. Some other methods for background detection and removal are described in study by Piccardi.[61] The extracted silhouettes are then passed on to a feature extraction module which computes the desired features. These features are then passed to the training phase. Another important operation in preprocessing module is to detect the view angle of input sequence. In this work like most VTM methods, we assume that this angle is known. However, there are some methods like study by Chtourou et al.[62] which can be used for walking direction estimation.
Sparse and redundant representation
The SR representation model has attracted great attention in past decades. This model is used to represent signals and images and yields great performance in many applications such as noise removal, image separation, and image compression. The main idea behind this model is that each signal can be obtained by a weighted sum of some basic atoms.[63] More specifically,x = Dα (1)Where x is the signal, D is a full-rank matrix called dictionary in which each column is an atom, and α is the SR representation of the signal x. In other words, α is a vector containing the weights of atoms. So, we can say that:α = D+x (2)Where D+ is a pseudoinverse of D. The number of atoms is typically more than the signal length (m > n), so the representation is called redundant. Furthermore, an important property of this model is sparsity, which means most values in α are zero. α is defined as the sparsest vector that can model x with at most ε errors:α = argmin ||α||0
Subject to ||Dα-x||2 < ε (3)Where ||.||0 is the l0 norm counting the nonzero entries of a vector. Solving the above equation is an NP-Hard problem but there are some estimation approaches that calculate α with good precision such as orthogonal matching pursuit which is used in this work.[64]Another challenge in this model is the dictionary. The dictionary must be rich enough to adequately describe the input signal in a sparse manner. There are some methods like K-SVD[63] which are used to train and obtain the desired dictionary.
Proposed Method
We propose a VTM based on SR representation. Given the input from angle v, this model generates the corresponding output in another angle v.
Main idea
Assume that Dv and Dv are two dictionaries containing m atoms each. We call these dictionaries a transform pair if and only if for each 1≤i≤m, the i atom of them were transform pair. Two atoms are called transform pair if and only if they correspond to same regions in different view angles.Assume that Dv and Dv are two dictionaries for view angles vt and vs respectively. If they are a transform pair then:xv = Dvα = DvD+vxv (4)Where xv and xv represents the image in source and target view angles and Dv and Dv denote the dictionaries corresponding to these angles. Loosely speaking, the input image (xv) is coded using the corresponding dictionary (Dv) to obtain α which is then multiplied to the dictionary of target view (Dv) that produces the image in target view (xv). The next section describes how to obtain a transform pair of dictionaries.
Training the VTM
As mentioned before, the K-SVD algorithm is used to train a dictionary using training data. Hence, we can use samples of each view to train a dictionary for that view. However, in our method, the corresponding atoms of the dictionaries must be correlated. For example, if the patch related to head area is composed of atoms number 1, 3, 6, and 7 of Dv, then composing these exact atoms of Dv should make the head area in v. Training dictionaries independently loses these constraints. To mitigate this issue we propose the scheme presented in Figure 2. In this scheme, at first the corresponding patches from training samples in both view angles are extracted and linearized and then concatenated to form a bigger train sample. Samples obtained in this way are then used to train a dictionary (D). After that, D is split horizontally to make the desired dictionaries. Note that in this way, corresponding atoms in the dictionaries are correlated.
Figure 2
The scheme of training view transformation model
The scheme of training view transformation model
Transformation
Let pv be a probe sample captured in v view angle. We first split pv into n patches p1v, p2v, ..., pnv. Each patch piv is encoded using Div which is trained using patches from the same location of train samples in v. The result would be the SR representation of the patch (αiv) which is then decoded using Div to make piv, the transformed patch in v. Finally, the transformed patches are merged to make the transformed probe (pv). More formally:Where e(.) and d(.) denotes encoding and decoding respectively. The transformation process is presented in Figure 3.
Figure 3
The view transformation process
The view transformation process
Matching
After obtaining the transformed probe in the same view angle as gallery, we can compare them and find the similarity of the probe with each gallery sample using any criterion such as Euclidian distance. Then a classifier such as nearest neighbor is used to find the most similar sample. However, using the above method for transformation leads to some artifacts such as chessboard effect which affect the recognition process. To eliminate the chessboard effect we could use overlapping patches but this process increases the computational cost significantly. An alternate less costly approach is to use αiv as a feature instead of decoding to obtain the patch. We refer to this feature as SR feature in the following.Instead of comparing two images we may compare their SR representation with respect to the same dictionary D. More formally, let x1 and x2 be two patches. Then:x1 = Dα1 (6)x2 = Dα2 (7)Hence, the similarity of x and x can be estimated by similarity of α1 and α2. In this way, the computation cost is reduced and the effect of artifacts is mitigated. In addition, using SR representation mitigates noises and less important data. Evaluation results show that using this representation leads to acceptable recognition rates.Figure 4 illustrates the transformation and matching processes. Probe sample is split and each patch piv is encoded using Div. Similar process is performed for each patch at location i of j the gallery sample gj,id using Div. After that, the dissimilarity of probe and gallery samples is calculated using their sparse and redundant representation. More formally:
Figure 4
The transformation and matching processes
The transformation and matching processesWhere f (.) denotes the Euclidian distance. Finally, the gallery sample with minimum distance represents the subject.
Computational complexity
Considering the proposed method, the critical modules for analyzing the computational complexity are training VTM which is performed once and transformation that is executed for each probe. According to Rubinstein et al.,[65] the computational complexity for K-SVD algorithm which is used for training VTM is:T ≈ n.(K2L+2NL) (9)Where n is the number of training signals, k is target sparsity, L denotes number of atoms in dictionary and N is the signal length.As mentioned before, we use OMP for transformation module. Its computational complexity is:T = K3+2KNL (10)
Experimental Validation
Dataset
The CASIA gait database, Dataset B[4860] is utilized to assess the proposed method which contains sequences from 124 subjects. Eleven view angles (0°, 18°, 36°, 54°, 72°, 90°, 108°, 126°, 144°, 162°, and 180°) are considered and for each angle 6 sequences are recorded with normal clothing. Four sequences are used as gallery and others are used as probe samples.
Selection of parameter values
There are two main parameters which affect the performance of the proposed approach: Size of dictionary and patch size. To obtain the appropriate size of dictionaries and patch size, we have tested multiple values for these parameters. Dictionary size varies from 10 to 80, and seven different values for patch size are considered. The gallery view is 90° and all angles are used as probe view.We use the average recognition rates with all angles and patch sizes per dictionary size to find the appropriate dictionary size as depicted in Figure 5. It is obvious that 40 has the best performance among others.
Figure 5
Average rank-1 recognition rates against dictionary size
Average rank-1 recognition rates against dictionary sizeTo find the appropriate patch size, using 40 as dictionary size the average recognition rates with all angles per each patch size is obtained that depicted in Figure 6. We can see that the 80 is the best choice for patch size.
Figure 6
Average rank-1 recognition rates against patch size with 40 as dictionary size
Average rank-1 recognition rates against patch size with 40 as dictionary size
Sparse and redundant feature
The performance of SR representation as a gait feature is investigated in this section. Towards this end, we have compared SR feature with the case which uses GEI as feature using Cumulative Matching Score (CMS) curves.[66] The CMS or Cumulative Matching Characteristic is a rank based method of showing measured accuracy performance of a biometric system. The horizontal axis of the CMS graph is rank and the vertical axis is the recognition rate. The value for rank r shows the recognition rate within first r ranks.Figure 7 shows the CMS curves of average recognition rates of GEI and SR features when gallery and probe are from the same angle. It can be observed that SR feature almost performs as well as GEI and both features have good performance when there is no view change. Therefore, using SR feature instead of GEI is acceptable.
Figure 7
Average rank-1 recognition rates when probe and gallery are from the same angle using SR and GEI as feature
Average rank-1 recognition rates when probe and gallery are from the same angle using SR and GEI as feature
Recognition rates
In study by Kusakunniran et al.,[29] the angles 54°, 90°, and 126° of the dataset are used as probe views, while the remaining ten viewing angles are taken as gallery views for training the VTM. We follow these choices in testing the performance of the proposed method. We then use the obtained results to compare the performance of the proposed scheme with the algorithm of Zhaoxiang Zhang et al.[39] and all the algorithm compared there-in. It is worth to mention that the baseline method as explained in Yu et al.[60] is a simple method that does not do any action to mitigate the view angle challenge. It simply extracts GEI as feature, measures the distances using Euclidian distance and uses nearest neighbor as classifier. The reason of comparing this method is to highlight the effectiveness of other methods.Comparisons of the rank-1 recognition rates according to each probe view 54°, 90°, and 126° are presented in Figures 8-10. It is observed that in most tests, the proposed method outperforms the other methods in comparison. This is especially the case for large changes in the view angle, as well as the average recognition rate. For example, when the probe view is 54°, the recognition rate of the proposed method is 46%, 40%, and 47% higher than the second best method with 0°, 162°, and 180° as gallery views, respectively. In addition, the proposed scheme performs approximately 16% better than GII-DPLCR which is the second best method. Similar results can be observed when the probe view is 90° and 126°.
Figure 8
Performance comparison between rank-1 recognition rates of different methods for probe view 54° and various gallery views in the range from 0° to 180°
Figure 10
Performance comparison between rank-1 recognition rates of different methods for probe view 126° and various gallery views in the range from 0° to 180°
Performance comparison between rank-1 recognition rates of different methods for probe view 54° and various gallery views in the range from 0° to 180°Performance comparison between rank-1 recognition rates of different methods for probe view 90° and various gallery views in the range from 0° to 180°Performance comparison between rank-1 recognition rates of different methods for probe view 126° and various gallery views in the range from 0° to 180°Figures 11-13 report CMS curves for gallery view of 90° and various probe views in the range from 36° to 126° (except 90°). Investigating these curves, it is seen that for 36°, 54°, and 144° view angles, the proposed method outperforms others while it has competitive performance for the other view angles.
Figure 11
Plot of CMS curves of the methods in comparison for gallery view 90° and probe view (a) 36° and (b) 54°
Figure 13
Plot of CMS curves of the methods in comparison for gallery view 90° and probe view (a) 126° and (b) 144°
Plot of CMS curves of the methods in comparison for gallery view 90° and probe view (a) 36° and (b) 54°Plot of CMS curves of the methods in comparison for gallery view 90° and probe view (a) 72° and (b) 108°Plot of CMS curves of the methods in comparison for gallery view 90° and probe view (a) 126° and (b) 144°
Concluding Remarks
Using SR representation, we propose in this work a VTM for cross-view gait recognition. We verify satisfactory performance of the proposed scheme using the CASIA Gait Database, Dataset B. These test results illustrate superiority of the proposed method in comparison with several state-of-the-art methods, especially in the case of large changes in the view angle. It is also observed that the average recognition rates for all angles are higher than these existing methods.
Financial support and sponsorship
None.
Conflicts of interest
There are no conflicts of interest.
BIOGRAPHIES
Abbas Ghebleh is a PhD student of Software Engineering at Shahid Beheshti University, Tehran, Iran. He received his B.Sc. and M. Sc. in Software Engineering from Shahid Beheshti University, Tehran, Iran (2006 and 2010). His research interests are Digital Image Processing and Computer Vision including human gait recognition and action recognition.Email:
a_ghebleh@sbu.ac.irMohsen Ebrahimi Moghaddam is a professor in computer engineering and science department, Shahid Beheshti university of Iran since 2006. He has got his Ph.D., M.Sc., and B.Sc. from Sharif University in Iran. His research interests are image processing and pattern recognition specially using artificial intelligence techniques such as image security, watermarking, deblurring, and biometrics. He is the image processing lab chief in his department.Email:
m_moghadam@sbu. ac.ir