Literature DB >> 26999150

Scene-Level Geographic Image Classification Based on a Covariance Descriptor Using Supervised Collaborative Kernel Coding.

Chunwei Yang1,2, Huaping Liu3, Shicheng Wang4, Shouyi Liao5.   

Abstract

Scene-level geographic image classification has been a very challenging problem and has become a research focus in recent years. This paper develops a supervised collaborative kernel coding method based on a covariance descriptor (covd) for scene-level geographic image classification. First, covd is introduced in the feature extraction process and, then, is transformed to a Euclidean feature by a supervised collaborative kernel coding model. Furthermore, we develop an iterative optimization framework to solve this model. Comprehensive evaluations on public high-resolution aerial image dataset and comparisons with state-of-the-art methods show the superiority and effectiveness of our approach.

Entities:  

Keywords:  collaborative kernel coding; covariance descriptor; scene-level geographic image classification

Year:  2016        PMID: 26999150      PMCID: PMC4813967          DOI: 10.3390/s16030392

Source DB:  PubMed          Journal:  Sensors (Basel)        ISSN: 1424-8220            Impact factor:   3.576


1. Introduction

Nowadays, high spatial resolution remote sensing images are easily acquired thanks to the rapid development of satellite and remote sensing technology, which has endowed us with the opportunity to interpret, analyze and understand the image. As a fundamental research area of remote sensing image analysis, scene-level geographic image classification is of great importance for land use and land cover (LULC) image classification [1,2,3], semantic interpretations of images [4], geographic image retrieval [5,6,7] and forest type mapping [8], which has drawn increasing attention and scholars’ study [1,2,3,5,9,10,11,12,13]. Figure 1 shows geographic images whose spatial resolution is 30 m, 1 m and 0.3 m, respectively.
Figure 1

Images with a resolution of: (a) 30 m; (b) 1 m; (c) 0.3 m.

However, finding an efficient representation of the scene-level image is a challenging problem. The bag of visual words (BOVW) model [14] is one of the most successful models. The works in [2,5] detailed the application of BOVW on the scene-level image classification task. As is illustrated in [2,5], BOVW can represent the image by compact representation through a visual word counts histogram and provides further invariance to the image transformations. However, the tradeoff between invariance and discriminability is controlled by the visual dictionary size. What is more, BOVW disregards the information about the spatial layout of the features, which is of great importance to scene-level image classification [2,15,16]. In order to overcome this shortcoming, one successful extension of BOVW is spatial pyramid matching (SPM) [16], which partitions the image into increasing finer sub-images and computes histograms of local features from each sub-image. Although SPM is a computationally-efficient extension of BOVW and shows superior performance, it does not consider the relative spatial arrangement and only characterizes the absolute location of the visual words in an image. From this point of view, SPM also limits the descriptive ability of the scene-level geographic image representation. Hence, two new image representation models, which are termed spatial co-occurrence kernel (SCK) [1] and spatial pyramid co-occurrence kernel (SPCK) [2], are proposed by Yang and Newsam. What is more, in order to capture the absolute and relative spatial relationships of BOVW, a pyramid of spatial relations (PSR) model is developed by Chen and Tian. The work in [17] points out that the computational complexities of SCK and SPCK are high because of the need to use nonlinear Mercer kernels and developed a linear form of the SCK. Besides, [10] proposed an unsupervised feature learning method, in which the new sparse representations of the feature descriptors are generated by the low-level feature descriptors. On the other hand, the covariance descriptor (covd) proposed by by Tuzel [18] can be used for feature representation of the image, which has been extensively adopted in vast computer vision tasks, e.g., texture discrimination [18], visual saliency estimation [19], object detection [18,20] and object tracking [21]. Covd is a covariance matrix of different features, e.g., color, gradient and spatial location, and it holds certain rotation and scale invariance. However, how to model and compute covd still remains a key problem. We all know that covd lies in the Riemannian manifold, which is a non-Euclidean space. As a result, traditional mathematical modeling and computation in Euclidean space cannot be directly utilized, which results in a great challenge. In [22], a discriminative learning method is developed to formulate the classification problem on Riemannian space by covd, which presents a kernel function and a log-Euclidean distance metric to solve Riemannian-Euclidean transformation. In [23], a coding strategy is introduced, and the descriptor can be transformed into a new feature; and then, extreme learning machine (ELM) can be used for dynamic texture video classification. However, such a method separately optimizes the reconstruction error of the coding and the classification error of ELM, and the design stage of coding and the classifier are totally independent. In order to solve this problem, a supervised collaborative kernel coding approach incorporating the linear classifier supervised term that can optimize both the reconstruction error and the linear classifier simultaneously is developed. There are three contributions as follows: A supervised collaborative kernel coding model, illustrated in Figure 2, is proposed. This model can not only transform the covd to a discriminative feature representation, but also can obtain the corresponding linear classifier.
Figure 2

Illustration of the supervised collaborative kernel coding model.

An iterative optimization framework is introduced to solve the supervised collaborative kernel coding model. Experiments on public high-resolution aerial image dataset validate that the proposed supervised collaborative kernel coding model derives a satisfying performance on the scene-level geographic image classification. The paper is organized as follows: After a review of our proposed methodology in Section 2, Section 3 shows the iterative optimization approach. In Section 4 and Section 5, we give the experiments and conclusions.

2. Overview of the Methodology

Figure 3 shows the overview of the proposed method, which consists of 3 stages, the pre-processing stage, coding stage and classification stage. In the pre-processing stage, covd is extracted as the initial feature representation of the scene-level geographic image. Then, in the coding stage, the supervised collaborative kernel coding strategy involving dictionary coefficients, the coding representation phase and the linear classification phase is presented. Finally, in the classification stage, based on the dictionary coefficients and learned linear classifier, a label vector can be simply derived through the linear classifier, the index corresponding to the largest value of which is the label of a testing scene-level geographic image.
Figure 3

The overview of the proposed method. covd, covariance descriptor.

2.1. Covariance Descriptor

Covd was first proposed by Tuzel et al. [18] as a compact descriptor. Formally, let be a feature vector denoting the feature points of p-dimension as color, gradient filter response, etc. Then, a covd of dimensions of an image can be described as: where d and denote the pixel number and the mean value, respectively. The feature vector is established using the image intensity of each channel, the norm of the first and second derivatives of intensity in the x and y directions. As for a geographic image, a feature vector of 15 dimensions is computed at each pixel , and here, , where and denote the the C channel intensity image and the channel of the color, respectively. The work in [18] points out that covd has at least three characteristics: (1) it is enough to describe the image of different poses and views; (2) multiple features can be fused in a natural way through covd, the diagonal and non-diagonal elements of which describe the variance and correlations of different features, respectively; (3) comparing to other descriptors, such as raw values and the histogram, covd is low-dimensional, and it has only different values due to symmetry. Nevertheless, covd is a symmetric positive definite matrix. The key issue for a symmetric positive definite matrix is how to model and compute it. As is illustrated in Figure 4, covd lies in a Riemannian manifold [24], which is not a Euclidean space.
Figure 4

Sample geographic images and corresponding covariance descriptor (covd) features.

Accordingly, the mathematical modeling of covd is not the same as what we usually do in the Euclidean space. Here, we adopt the idea of Ruiping Wang [22] and compute the distance of two covds and using log-Euclidean distance [25,26]: where is the logarithm computation of the matrix and denotes the Frobenius norm. Moreover, there is a tricky problem regarding how to use covd in the geographic image classification. It is a fact that covd lies in a non-Euclidean space; thus, the traditional linear classifier based on Euclidean space cannot be directly utilized. Therefore, in the following, how to solve this problem is the theme.

2.2. Supervised Collaborative Kernel Coding Model

As is shown in Figure 5, here, we propose a supervised collaborative kernel coding model, which consists of two jointly working components: (1) the dictionary learning and feature representation phase; and (2) the linear classification phase. First, the linear classifier is incorporated into the dictionary learning and feature representation phase, making the resulting coding vector more discriminative. Then, based on the coding vector , the linear classifier is derived. In this way, the objectives function in each phase are combined into a unified optimization framework, through which a collaborative coding vector and the corresponding linear classifier can be simultaneously obtained. At last, based on the dictionary coefficients , testing signal is transformed into a feature vector, which is used for linear classification directly.
Figure 5

The illustration of the proposed model.

Denote as the training samples, where is a Riemannian manifold. Through the proper mapping function, are mapped into a higher dimensional space. Namely, let be the nonlinear mapping process from the original space into a high or infinite dimensional space . For convenience, the dimension of is denoted as . The mapping function here is associated with a kernel , where . As for covd computation, the Gaussian kernel is chosen as the mapping function for its superior performance in vast computer vision tasks [27]: where the decay parameter β is empirically set as 0.02 and is the Gaussian kernel between two samples and . The aim of dictionary learning is to empirically learn a dictionary adapted to the training sample set; therefore, we need to determine some atoms to represent the training samples, where K is the dictionary size and . Let , and the kernel dictionary learning process can be formulated as: where is the coding matrix and λ is the penalty parameter. Thanks to the kernel trick [28,29], through the mapping function , the problem on the Riemannian manifold can be transformed to a collaborative coding problem in the Euclidean space. Nevertheless, since the number of dictionary atoms may be infinite, there exists a new challenge to the dictionary learning process in such a formulation. Fortunately, [30,31] prove that the dictionary can be represented as , where is a coefficient matrix. This indicates that the training samples can linearly represent the dictionary in the feature space. As a result, Equation (4) can be reformulated as: Such a formulation provides two significant advantages: (1) the dictionary learning process becomes searching the matrix ; (2) for any kernel function, this formulation reduces the dictionary learning process to linear problems. Now, we propose a novel objective function combining both the collaborative kernel coding phase and classification phase as: where and denotes the reconstruction error and the linear classification error, respectively, and represents the classifier parameters. η, λ and ρ are all penalty parameters. The derived dictionary through this formulation can generate more discriminative codes , which is of great importance to the performance of the classifier and also adaptive to the underlying structure of training samples. The resulting codes are then directly used for classification. For a testing sample , through Equation (7), the feature representation code is firstly computed with dictionary coefficients . Then, in order to derive the label vector, we can use . The index corresponding to the largest value of is the label of .

3. Optimization Algorithm

There are three variables as , and in the objective function Equation (6). Here, an iterative optimization algorithm for each variable by fixing the other two is introduced. (Equation (6) is denoted as , and the obtained variables from the k-th and -th iteration are denoted as the subscripts and , respectively, and ). Step 1: Initialization. We randomly set coefficient matrix . Next, we compute the corresponding coding coefficient by taking the derivative of of Equation (6): where is an square matrix of which the -th element is . Step 2: Fixing , taking the derivative of : Additionally, the corresponding solution is: Step 3: Fixing and , taking the derivative of , we can derive the optimal solution of . Step 4: Fixing and , and taking the derivative of : Then, the optimal solution of is: Step 5: Iteration from Step 2 to Step 4 until convergence. A whole algorithm summary, which includes the above optimization procedures, is given in Algorithm 1, and the representative reconstruction error of the objective function is shown in Figure 6. In case of the optimal , we can derive the optimal solution of based on Equation (7) as: where .
Figure 6

The representative reconstruction error of the objective function.

4. Experiments

4.1. Dataset and Experiment Setup

In this section we demonstrate the application of our method in the classification experiments using a publicly available dataset [1], which includes twenty one scene categories with one hundred images of each class. This dataset corresponds to various land LULC types, which is shown in Figure 7.
Figure 7

Samples from UCMERCED. Example geographic images associated with 21 categories are shown here.

For each category, it is randomly partitioned into five subsets, and each subset contains twenty samples. During the experiments, one subset is used for testing, and the remaining four subsets are used for training. Finally, we report the average classification accuracy.

4.2. Parameter Analysis

Equation (6) has four parameters, λ, η, ρ and dictionary size K, which need to be tuned. In order to determine their values, n-fold cross-validation is adopted. Each parameter is investigated by fixing the other parameters. It is noted that the initialization of K is 210. Figure 8 shows the classification accuracy of each tuned parameter. It is easy to find that our approach obtains the best performance () when , , (or 1).
Figure 8

Evaluation of the effect on the classification accuracy for parameters: (a) λ; (b) η; and (c) ρ.

4.3. Experiment Results and Comparison

The following three baseline methods are designed for comparison: This method isolates the feature representation and classification process, which means that is used as the feature representation and that is used as the linear classifier. This method is the same as the proposed method, except that the covd is established based on image intensities and the magnitude of the first and second gradients. Namely, and . This method is the same as baseline Method 1, except that the covd is a 9 × 9 matrix, which is the same as baseline Method 2. Figure 9 shows the classification accuracy versus dictionary size K. From this figure, we can find some interesting results:
Figure 9

Comparison of different methods.

Our approach is always better than the three baseline methods, and when , our approach obtains the best performance (). Comparing to baseline Method 1, our proposed method obtains a higher classification accuracy, which indicates the effectiveness of the optimization algorithm. Comparing the proposed method to baseline Method 2, the only difference is the covd. The former uses a 15 × 15 matrix, which is a covariance format of intensity of each channel and the norms of the first and second gradients of intensities, while the covd of the latter is a 9 × 9 covariance format of the intensity of each channel and the magnitude of the first and second gradients. It is clear that both covds are not rotationally invariant, especially that the former covd is not direction invariant. However, the proposed method obtains a higher classification accuracy. This may indicate that the covariance format offsets the rotations to some extent. Figure 10 shows the confusion matrices of the baseline methods and our approach, respectively. The classification accuracy of fifteen categories is more than 80%, and eleven categories are more than 90%. Nevertheless, the classification accuracy of three categories, buildings, dense residential and intersection, is less than 70%.
Figure 10

The average confusion matrices of: (a) baseline Method 1; (b) baseline Method 2; (c) baseline Method 3; and (d) the proposed method.

In order to analyze the proposed method, Figure 11 lists some representative misclassification samples of the proposed method. Some misclassification couples, such as intersection/overpass, overpass/runway and river/forest, shown in Figure 11 are hard to identify, even with our own eyes.
Figure 11

The representative misclassification samples.

Besides, we report the classification accuracies of both baseline methods and our method over groups rotated five times in Table 1. Then, the comparison with the classical approaches [1,2], BOVW, SPM, SCK, BOVW + SCK, color histogram, such as RGB, HLS and CIE Lab, texture, SPCK, BOVW + SPCK and SPCK + SPM, is shown in Figure 12.
Table 1

classification accuracies over all five groups of our method.

Subset Number12345Average
baseline Method 178.10%79.29%76.90%78.81%71.90%77.00%
baseline Method 278.33%75.48%74.29%75.71%74.29%75.62%
baseline Method 373.10%71.43%70.00%73.33%69.05%71.38%
proposed Method 87.14% 84.52% 88.10% 87.14% 84.52% 86.28%
Figure 12

The comparison of our method with the state-of-the-art performance reported in the literature on the dataset UCMERCED. BOVW, bag of visual words; SPM, spatial pyramid matching; SCK, spatial co-occurrence kernel; SPCK, spatial pyramid co-occurrence kernel.

5. Conclusions

This paper proposes a novel supervised collaborative kernel coding model based on covd for scene-level geographic image classification. Since covd lies in non-Euclidean space, the linear classifier, which is based on Euclidean distance, cannot be utilized. Additionally, our main contribution is explicitly integrating the discriminative feature coding and a linear classifier into the objective function. Moreover, the solution to the new objective function is efficiently achieved by simply employing the optimization algorithm. Experiments implemented on the UCMERCED dataset show the effectiveness of our approach.
  9 in total

1.  Log-Euclidean metrics for fast and simple calculus on diffusion tensors.

Authors:  Vincent Arsigny; Pierre Fillard; Xavier Pennec; Nicholas Ayache
Journal:  Magn Reson Med       Date:  2006-08       Impact factor: 4.668

2.  GeoIRIS: Geospatial Information Retrieval and Indexing System-Content Mining, Semantics Modeling, and Complex Queries.

Authors:  Chi-Ren Shyu; Matt Klaric; Grant J Scott; Adrian S Barb; Curt H Davis; Kannappan Palaniappan
Journal:  IEEE Trans Geosci Remote Sens       Date:  2007-04       Impact factor: 5.600

3.  Pedestrian detection via classification on Riemannian manifolds.

Authors:  Oncel Tuzel; Fatih Porikli; Peter Meer
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2008-10       Impact factor: 6.226

4.  Design of non-linear kernel dictionaries for object recognition.

Authors:  Hien Van Nguyen; Vishal M Patel; Nasser M Nasrabadi; Rama Chellappa
Journal:  IEEE Trans Image Process       Date:  2013-09-16       Impact factor: 10.856

5.  Sparse representation with kernels.

Authors:  Shenghua Gao; Ivor Wai-Hung Tsang; Liang-Tien Chia
Journal:  IEEE Trans Image Process       Date:  2012-09-21       Impact factor: 10.856

6.  Efficient kernel sparse coding via first-order smooth optimization.

Authors:  Minyoung Kim
Journal:  IEEE Trans Neural Netw Learn Syst       Date:  2014-08       Impact factor: 10.451

7.  Visual saliency estimation by nonlinearly integrating features using region covariances.

Authors:  Erkut Erdem; Aykut Erdem
Journal:  J Vis       Date:  2013-03-18       Impact factor: 2.240

8.  Remote-sensing image classification based on an improved probabilistic neural network.

Authors:  Yudong Zhang; Lenan Wu; Nabil Neggaz; Shuihua Wang; Geng Wei
Journal:  Sensors (Basel)       Date:  2009-09-23       Impact factor: 3.576

9.  Multiple classifier system for remote sensing image classification: a review.

Authors:  Peijun Du; Junshi Xia; Wei Zhang; Kun Tan; Yi Liu; Sicong Liu
Journal:  Sensors (Basel)       Date:  2012-04-12       Impact factor: 3.576

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.