Literature DB >> 28134818

An Interactive Image Segmentation Method in Hand Gesture Recognition.

Disi Chen¹, Gongfa Li², Ying Sun³, Jianyi Kong⁴, Guozhang Jiang⁵, Heng Tang⁶, Zhaojie Ju⁷, Hui Yu⁸, Honghai Liu⁹.

Abstract

In order to improve the recognition rate of hand gestures a new interactive image segmentation method for hand gesture recognition is presented, and popular methods, e.g., Graph cut, Random walker, Interactive image segmentation using geodesic star convexity, are studied in this article. The Gaussian Mixture Model was employed for image modelling and the iteration of Expectation Maximum algorithm learns the parameters of Gaussian Mixture Model. We apply a Gibbs random field to the image segmentation and minimize the Gibbs Energy using Min-cut theorem to find the optimal segmentation. The segmentation result of our method is tested on an image dataset and compared with other methods by estimating the region accuracy and boundary accuracy. Finally five kinds of hand gestures in different backgrounds are tested on our experimental platform, and the sparse representation algorithm is used, proving that the segmentation of hand gesture images helps to improve the recognition accuracy.

Entities: Chemical Disease Gene Species

Keywords: Gibbs Energy; image segmentation; min-cut/max-flow algorithm; sparse representation

Mesh：

Year: 2017 PMID： 28134818 PMCID： PMC5336094 DOI： 10.3390/s17020253

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

Hand gesture recognition, utilized in visual input of controlling computers, is one of the most important aspects in human-computer interaction [1]. Compared with the traditional input methods, such as mice, keyboards and data gloves [2,3], the use of hand gestures to control computers will greatly reduce the user’s learning curve and further expand the application scenario. To achieve hand gesture control [4], many research achievements have been conducted by the pioneers in the field. Sophisticated data gloves can capture every single movement of finger joints by highly sensitive sensors [5,6] and store the hand gesture data. The hand gesture recognition process based on computer vision is illustrated in Figure 1. However, some essential problems have yet to be solved. Firstly, the vision-driven hand gesture recognition method is highly dependent on the sensibility of image sensors, therefore the relatively poor image quality hinders its development. Secondly, the image processing algorithms are not robust as they supposed to be, some of which cannot meet the demand to finish the segmentation correctly, while others fulfill the accuracy demands, but require too many human interactions [7], which are not efficient in real applications.

Figure 1

Process of hand gesture recognition.

To address the above problems, with the cutting edge technologies, the image sensor industry has mushroomed recently. On the one hand, new kinds of image sensors, like the Microsoft Kinect 2.0, or Asus Xtion, have come into the commercial market [8], and the innovative infrared camera [9] makes it possible obtain depth information from image sensors. On the other hand, innovations in image processing algorithms have made them capable of segmenting accurate hand gestures, promoting in turn the accuracy of classifiers to ascribe gestures into different patterns. The image segmentation is an important stage in the whole hand gesture recognition process, and several well-known segmentation methods have been proposed to meet different image segmentation demands. For example, in the graph cut method [10], proposed by Boykov and Jolly, the main idea was to divide one image into “object” and “background”. A gray scale histogram was established to describe the distribution of gray scale, and then a cut was drawn to divide the object and background. Max-flow/min cut algorithm was applied to minimize the energy function of one cut, and the segmentation was achieved by this minimized cut. These algorithms not only focus on the whole image, but also take every morphological detail into account. Random walker [11,12] is another supervised image segmentation method, where the image is viewed as an electric circuit. The edges are replaced by passive linear resistors, and the weight of each edge equals the electrical conductance. It proved to perform better segmentation compared with the graph cut method. Gulshan et al. [13] proposed an interactive image segmentation method, which regarded shape as a powerful cue for object recognition, making the problem well posed. The use of geodesic-star convexity made it have a much lower error rate compared with Euclidean star-convexity. In the process of hand gesture recognition [13], the feature extraction is also very important. The image feature methods such as HOG [14], Hu invariant [15] and Haar [16] are used. In this paper, as for classifier and template matching algorithms, the sparse representation will be applied, since it requires much less sample for training. With the intention of recognising five different hand gestures, according to the dataset of hand gesture images, a dictionary will also be built. Then the K-SVD [17] algorithm is adapted for sample training, and the algorithm will be evaluate and compared with other methods.

2. Modelling of Hand Gesture Images

In order to optimize the segmentation, the human visual system was carefully studied. Our eyes usually got a fuzzy picture of the whole scene at first, and then the saccadic eye movements [18] help us to obtain the details of regions of interest. With the inspiration of the human visual system, we used the Gaussian Mixture Model (GMM) [19] to get an overall view the color distributions of the image. Since the color images are mainly represented in digital formats, with tens of thousands of pixels in one image made up of red, green and blue sub-pixels, as shown in Figure 2, an M × N × 3 array was applied to store the color information in one image, where M is the horizontal resolution and N is the vertical.

Figure 2

The RGB format hand gesture image.

2.1. Single Gaussian Model

The single Gaussian distribution, also known as the normal distribution [20], was proposed by the French scientist Moivre in 1733. The probability density function of a single Gaussian distribution is given by the formula: where μ is the mathematical expectation or the mean, σ is the covariance of Gaussian distribution, and exp denotes the exponential function. For convenience, the single Gaussian distribution is usually denoted as: The single Gaussian distribution formula is capable of dealing with gray scale pictures, because the variable x has only one dimension. One color image is an M × N × 3 array, so any element in dataset should be at least 3-dimensional. To address this problem, the concept of the multi-dimensional Gaussian distribution is introduced. The definition of d dimensional Gaussian distribution is: where is a d dimensional vector, and as for the RGB model, each component of represents the average red, green and blue color density value. is the covariance matrix and is its inverse matrix. ( − ) is the transposed matrix of ( − ). To simplify Equation (3) above, θ is introduced to represent the parameters and , then the probability density function of the d dimensional Gaussian distribution can be written as: According to the law of large numbers, every pixel is one sample of the real scene. When the resolution is high enough, the average color density could be estimated.

2.2. Gaussian Mixture Model of RGB Image

In reality, the color distributions of the gesture image in Figure 2 can be represented by three histograms [21], shown in Figure 3. With independent red, green and blue distributions shown in Figure 3, we can notice that the gesture image cannot be exactly described by one single Gaussian model. But there are about five peaks in each histogram, so five single Gaussian models should be applied in gesture image modelling.

Figure 3

Color distributions of the gesture image. (a) Red distribution; (b) green distribution; (c) blue distribution.

GMM is introduced to approximate the continuous probability distribution by increasing the number of single Gaussian models. The probability density function of GMM with k mixed Gaussian models becomes: where shows which single Gaussian model the component belongs to. is the mixing coefficients of k mixed component [22] or the prior probability of belonging to the i-th single Gaussian model, and . is the probability density function of the i-th single Gaussian model, parameterized by and in . is introduced as a parameters [23] set, {}, to denote and . As mentioned above, one RGB hand gesture image could be described in the dataset , and if we regard as a sample, its probability density is: where is called likelihood function of parameters given the sample . Then we hope to find a set of parameter to finish modelling. According to maximum likelihood method [24], our next task is to find where: The function and have the same equation form, but considering now we are going to use to estimate , the becomes variables and are the fixed parameters, it is denoted in the second form. The value of is usually too small to be calculated by computer, so we are going to replace it with the log-likelihood function [25]:

2.3. Expectation Maximum Algorithm

After establishing the Gaussian mixture model of a RGB hand gesture image, there are still several parameters that need to be estimated. The expectation maximum (EM) algorithm [26] is introduced for the subsequent calculations. The EM algorithm is a method of acquiring the parameters set in the maximum likelihood method. There are two steps in this algorithm, called the E-step and M-step, respectively. To start the E-step we will introduce another probability Q(). It is a posterior probability of π, in another words, the posterior probability of each belonging to the i-th single Gaussian model, from the dataset . where the definition of is given according to Bayes’ theorem, and . Then we use Equation (11) to modify the log-likelihood function in (10): From (12) to (13), the Jensen’s inequality have been applied, since , it is concave on its domain. Then: Maximizing Equation (13) guarantees that is maximized. The iteration of an EM algorithm estimating the new parameters in terms of the old parameters is given as follows: Initialization: Initialize with random numbers [27], and the unit matrices are used as covariance matrices to start the first iteration. The mixed coefficients or prior probability is assumed as . E-step: Compute the posterior probability of using current parameters: M-step: Renew the parameters: For most hand gesture images, the number of iterations is usually defined as a certain number. In order to improve the segmentation quality and to take account of the efficiency, the number of iterations should be 8 [28].

3. Interactive Image Segmentation

The modelling method discussed previously provides a universal way of dealing with hand gesture images. To segment the digital images, a mask is introduced as shown in Figure 4, which is a binary bitmap denoted as . By introducing it, we changed the segmentation problem into a pixels labelling problem. As α ∈ {1,0}, the value 0 is taken for labelling background pixels and 1 for foreground pixels.

Figure 4

The mask.

To deal with the GMM tractably, we introduce two independent k-component GMMs, one for the foreground modelling and one for the background modelling. Each pixel , either from the background or the foreground model, is marked as α = 1 or 0. The parameters of each component become: θ = {π(α), μ(α), Σ(α); α = 0,1, i = 1, …, k}.

3.1. Gibbs Random Field

The overall color modelling completes the first step in our human visual system, to take every detail of the image into account, Gibbs random field (GRF) [29] is introduced. GRF is defined as: Here, gives the probability of the system being in the state . T is a constant parameter, whose unit is temperature in physics, and usually its value is 1. is the partition function, and: where, is interpreted as the energy function of the state , to apply GRF in image segmentation, the Gibbs Energy [30] can be defined as follows: The term , also called regional term, is defined taking account of GMM. It indicates the penalty of being classified in the background or foreground: and , which is the boundary term, which is defined to describe the smoothness between pixel and its neighbour pixels in the pixel set : where the constant γ was obtained as 50 by optimizing the efficiency over training. is an indicator function taking values 0 or 1, by judging the formula inside. β is a constant, which represents the contrast of the pixel set , to adjust the exponential term. in the equation below is the expectation:

3.2. Automatical Seed Selection

Until now all the constants have been defined. To begin with, all the pixels in the picture are automatically marked as undefined and labeled [31]. is the background seed pixel set and is the foreground seed set. After the training over training set , the set is obtained as the segmentation result and . Three pixel sets are shown in Figure 5.

Figure 5

The relationships between three pixel sets.

To achieve the segmentation automatically, we propose an initial seeds selection method in hand gesture images. Considering that the human skin color has an elliptical distribution in YCbCr color space [32], the image is transformed from RGB color space to YCbCr, using the equation below: where, Y indicates the luminance. By setting , the interference of highlights would be overcome. Then the Cb, Cr values of human skin color are located by the elliptical equations given below: where, x and y are the intermediate variables. All the pixels satisfying the equations above will be marked as the foreground seeds, which belong to set . We also define the pixels on the image edges as background seeds, which belong to set B, because the gestures are usually located far away from the edges of the images. The result of seeds selection are displayed in Figure 6 below.

Figure 6

The result of automatic seed selection.

3.3. Min-Cut/Max-Flow Algorithm

According to the Gibbs random field, the image segmentation or pixel labelling problem equals minimizing the Gibbs energy function: The min-cut/max-flow algorithm [33] is proposed to finish the segmentation more accurately. The idea of this algorithm is to regard one image as a net with nodes, and each node take the place of a corresponding pixel. Apart from that, two extra nodes, S and T, are introduced, which represent “source” and “sink”, respectively. Node S is linked to pixels belonging to O, while T linked pixels in B as shown in Figure 7.

Figure 7

Nodes and net model.

There are three kinds of links in the neighbourhood , from pixel to pixel, from pixel to S and from pixel to T, denoted as . Each link is assumed with a certain weight or a cost [34] while it being cut down, which detailed in Table 1.

Table 1

The weight of each link.

Link Type	Weight	Precondition
xuxv¯	exp(−β‖xu−xv‖2)	xu,xv∈N
xuS¯	U(α=0,i,θ,X)	xu∈U
	K	xu∈O
	0	xu∈B
xuT¯	U(α=1,i,θ,X)	xu∈U
	0	xu∈O
	K	xu∈B
where K=1+maxxu∈X∑xu,xv∈Nexp(−β‖xu−xv‖2)

According to the max-flow/min-cut theorem, an optimal segmentation is defined by the minimum cut C as seen in Figure 7c. C is known as a set of links, so that: Then the Gibbs energy could be minimized by using the min-cut defined above. The whole process of this segmentation is as follows: firstly, assign the GMM components i to each according to the human select of the region. Secondly, the parameters set is learned from the whole pixel set . Thirdly, use the min-cut to minimize the Gibbs energy of the whole image. Then jump to the first step to start another round, and after eight times, the optimal segmentation will be achieved.

4. Experimental Comparison

To evaluate interactive segmentation quantitatively, an image dataset proposed by Gulshan [13], which contains 49 images from GrabCut dataset [35], 99 images from PASCAL VOC’09 segmentation challenge [36] and 3 images from the alpha-matting dataset [37] is chosen. Those images cover all kinds of shapes, textures and backgrounds. The corresponding ground true images together with the initial seeds were also included in this dataset. The initial seed maps were made up of 4 manually generated brush-strokes all in 8 pixels wide, and one for foreground and 3 for background as shown in Figure 8.

Figure 8

The evaluation samples from dataset.

To simulate the human interactions, after the first segmentation with initial seed map, one more seed would be generated in the largest connected segmentation error area (LEA) automatically. As shown in Figure 9a, the blue area is the segmentation result of the algorithm, while the white one is the ground true segmentation and the LEA is marked in yellow. From Figure 9b, the seed is a round dot (8 pixels in diameter), generated according to the LEA. Then we update the segmentation with all the seeds. After that, this step is repeated 20 times, and a sequence of segmentations will be obtained.

Figure 9

Evaluation on the dataset.

To evaluate the quality of segmentation results, we used two different methods in evaluating the region accuracy (RA) and boundary accuracy (BA). Each evaluation will be conducted to a single segmentation, and all the images in Gushan’s dataset will be tested to verify that our proposed method is suitable for interactive image segmentation.

4.1. Region Accuracy

The RA of segmentation results is evaluated by a weighted F − measure [38]. Compared with normal F − measure, the two terms Precision and Recall become: where, TP denotes the overlap of ground true and segmented foreground pixels. FP is the wrongly segmented pixels compared with ground true images and NP represent the wrongly segmented background pixels. The is defined as follows: where, β signifies the effectiveness of detection with respect to a user who attaches β times as much importance to Recall as to Precision, normally β = 1. Then, we apply to calculate the RA of different segmentation results. The higher RA is, the better the segmentation achieved is.

4.2. Boundary Accuracy

The BA [39] is defined according to the Hausdorff distance. The boundary pixels of ground true image and segmented image are defined as B and B as shown in Figure 10.

Figure 10

Boundary extraction.

The formula is as follows: where, g ∈ B and s ∈ B, dist(·) denotes the Euclidean distance, N(·) is the pixel number in the set. The value of BA shows the segmentation accuracy of boundaries.

4.3. Results Analysis

We segmented the images from the dataset by graph cut and random walker as shown in Figure 11. The segmentation test of our method has been made on Gulshan’s dataset as well as our hand gesture images, and some of the results using our method on hand gesture image segmentation are shown here in Figure 12.

Figure 11

The evaluation on different algorithms.

Figure 12

Segmentation results of our method on hand images.

For a more rigorous test, we tested 151 images from Gulshan’s dataset and used the human interaction simulator to perform the interactions, which generated the seeds 20 times to further refine the segmentation results. The result of each simulation step has been tested on the experiment platform. The RA and BA scores are the mean values of 151 segmentations, shown in Figure 13 and Figure 14.

Figure 13

Region accuracy comparison.

Figure 14

Boundary accuracy comparison.

From the figures above, the segmentation quality shows an increase with simulated human interactions. When the seed number becomes high, a satisfactory segmentation will be achieved. Our method obtains the best segmentation quality with few human interactions. Since the seeds are generated once automatically in human hand image segmentation, our method is suitable for human image segmentation.

5. Hand Gesture Recognition

We defined five hand gestures: hand closed (HC), hand open (HO), wrist extension (WE), wrist flexion (WF), and fine pitch (FP), as shown in Figure 15.

Figure 15

Five hand gestures for recognition.

One hundred images of each hand gesture were captured and segmented by the proposed method. We used the recognition framework in Figure 16. Each gesture takes 50 images for training and 50 for testing. To achieve a better classification, we extract HOG along with Hu invariant moments at the same weights. The K-SVD dictionary training method [40] is used to choose atoms representing [41] all features and reduce the computation costs.

Figure 16

Hand gesture recognition framework.

We tested the recognition rates on both unsegmented hand images and segmented hand images. The recognition rates on unsegmented hand images are shown in Table 2, and the recognition rates on segmented hand images are shown in Table 3.

Table 2

Recognition rates on unsegmented hand images.

Gestures	Recognition Rates
Hand close	86.7%
Hand open	73.3%
Wrist extension	100%
Wrist flexion	100%
Fine pitch	66.7%
Over all rate	85.3%

Table 3

Recognition rates on segmented hand images.

Gestures	Recognition Rates
Hand close	93.3%
Hand open	100%
Wrist extension	100%
Wrist flexion	100%
Fine pitch	100%
Over all rate	98.7%

By segmenting the images before feature extraction, the recognition rates on those five hand gestures are increased compared with unsegmented images, according to the results in the tables above.

6. Conclusions and Future Work

In conclusion, the interactive hand gesture image segmentation method can perfectly meet the segmentation demands of hand gesture images with no human interactions. The mechanism behind this method is carefully explored and deduced with the assistance of modern mathematical theories. Comparing the segmentation results of hand gestures with other popular image segmentation methods, our method can obtain a better segmentation accuracy and a higher quality, when there are limited seeds. Automatic seeds selection also helps to reduce human interactions. The segmentation work in turn improves the recognition rate. In future work, we could adapt this method to higher resolution pictures, which requires simplifying the calculation process. In seed selection, the automatic selection method could be improved to overcome various interferes, such as highlights, shadows and image distortion. Other future work will focus on improving the recognition rate by integrating the segmentation algorithm with more advanced recognition methods.

7 in total

1. Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm.

Authors: Y Zhang; M Brady; S Smith
Journal: IEEE Trans Med Imaging Date: 2001-01 Impact factor: 10.048

2. [MR brain image segmentation based on modified fuzzy C-means clustering using fuzzy GIbbs random field].

Authors: Liang Liao; Tusheng Lin; Bi Li; Weidong Zhang
Journal: Sheng Wu Yi Xue Gong Cheng Xue Za Zhi Date: 2008-12

Review 3. The extraction of neural information from the surface EMG for the control of upper-limb prostheses: emerging avenues and challenges.

Authors: Dario Farina; Ning Jiang; Hubertus Rehbaum; Aleš Holobar; Bernhard Graimann; Hans Dietl; Oskar C Aszmann
Journal: IEEE Trans Neural Syst Rehabil Eng Date: 2014-02-11 Impact factor: 3.802

4. First saccadic eye movement reveals persistent attentional guidance by implicit learning.

Authors: Yuhong V Jiang; Bo-Yeong Won; Khena M Swallow
Journal: J Exp Psychol Hum Percept Perform Date: 2014-02-10 Impact factor: 3.332

5. Complete scene recovery and terrain classification in textured terrain meshes.

Authors: Wei Song; Kyungeun Cho; Kyhyun Um; Chee Sun Won; Sungdae Sim
Journal: Sensors (Basel) Date: 2012-08-13 Impact factor: 3.576

6. Intuitive terrain reconstruction using height observation-based ground segmentation and 3D object boundary estimation.

Authors: Wei Song; Kyungeun Cho; Kyhyun Um; Chee Sun Won; Sungdae Sim
Journal: Sensors (Basel) Date: 2012-12-12 Impact factor: 3.576

7. A Hybrid Vehicle Detection Method Based on Viola-Jones and HOG + SVM from UAV Images.

Authors: Yongzheng Xu; Guizhen Yu; Yunpeng Wang; Xinkai Wu; Yalong Ma
Journal: Sensors (Basel) Date: 2016-08-19 Impact factor: 3.576

7 in total

2 in total

1. Efficient Segmentation of a Breast in B-Mode Ultrasound Tomography Using Three-Dimensional GrabCut (GC3D).

Authors: Shaode Yu; Shibin Wu; Ling Zhuang; Xinhua Wei; Mark Sak; Duric Neb; Jiani Hu; Yaoqin Xie
Journal: Sensors (Basel) Date: 2017-08-08 Impact factor: 3.576

2. Simultaneous Calibration: A Joint Optimization Approach for Multiple Kinect and External Cameras.

Authors: Yajie Liao; Ying Sun; Gongfa Li; Jianyi Kong; Guozhang Jiang; Du Jiang; Haibin Cai; Zhaojie Ju; Hui Yu; Honghai Liu
Journal: Sensors (Basel) Date: 2017-06-24 Impact factor: 3.576

2 in total