Haijin Ji1,2, Song Huang1. 1. Command & Control Engineering College, Army Engineering University of PLA, Nanjing 210007, China. 2. School of Computer Science and Technology, Huaiyin Normal University, Huaian 223300, China.
Abstract
Kernel entropy component analysis (KECA) is a newly proposed dimensionality reduction (DR) method, which has showed superiority in many pattern analysis issues previously solved by principal component analysis (PCA). The optimized KECA (OKECA) is a state-of-the-art variant of KECA and can return projections retaining more expressive power than KECA. However, OKECA is sensitive to outliers and accused of its high computational complexities due to its inherent properties of L2-norm. To handle these two problems, we develop a new extension to KECA, namely, KECA-L1, for DR or feature extraction. KECA-L1 aims to find a more robust kernel decomposition matrix such that the extracted features retain information potential as much as possible, which is measured by L1-norm. Accordingly, we design a nongreedy iterative algorithm which has much faster convergence than OKECA's. Moreover, a general semisupervised classifier is developed for KECA-based methods and employed into the data classification. Extensive experiments on data classification and software defect prediction demonstrate that our new method is superior to most existing KECA- and PCA-based approaches. Code has been also made publicly available.
Kernel entropy component analysis (KECA) is a newly proposed dimensionality reduction (DR) method, which has showed superiority in many pattern analysis issues previously solved by principal component analysis (PCA). The optimized KECA (OKECA) is a state-of-the-art variant of KECA and can return projections retaining more expressive power than KECA. However, OKECA is sensitive to outliers and accused of its high computational complexities due to its inherent properties of L2-norm. To handle these two problems, we develop a new extension to KECA, namely, KECA-L1, for DR or feature extraction. KECA-L1 aims to find a more robust kernel decomposition matrix such that the extracted features retain information potential as much as possible, which is measured by L1-norm. Accordingly, we design a nongreedy iterative algorithm which has much faster convergence than OKECA's. Moreover, a general semisupervised classifier is developed for KECA-based methods and employed into the data classification. Extensive experiments on data classification and software defect prediction demonstrate that our new method is superior to most existing KECA- and PCA-based approaches. Code has been also made publicly available.
Curse of dimensionality is one of the major issues in machine learning and pattern recognition [1]. It has motivated many scholars from different areas to properly implement dimensionality reduction (DR) to simplify the input space without degrading performances of learning algorithms. Various efficient methods associated with DR have been developed, such as independent component analysis (ICA) [2], linear discriminant analysis [3], principal component analysis (PCA) [4], projection pursuit [5], to name a few. Among these robust algorithms, PCA has been one of the most used techniques to perform feature extraction (or DR). PCA implements linear data transformation according to the projection matrix, which aims to maximize the second-order statistics of input datasets [6]. To extend PCA to nonlinear space, Schölkopf et al. [7] proposed the kernel PCA, the so-called KPCA method. The key of KPCA is to find the nonlinear relation between the input data and the kernel feature space (KFS) using the kernel matrix, which is derived from a positive semidefinite kernel function of computing inner products. Both PCA and KPCA perform data transformation by selecting the eigenvectors corresponding to the top eigenvalues of the projection matrix and the kernel matrix, respectively. All of them (including their variants) have experienced great success in different areas [8-12], such as image reconstruction [13], face recognition [14-17], image processing [18, 19], to name a few. However, as suggested by Zhang and Hancock [20], the DR should be performed according to the perspective of information theory for obtaining more acceptable results.To improve performances of the aforementioned approaches to DR, Jessen [6] developed a new and completely different data transformation algorithm, namely, kernel entropy component analysis (KECA). The main difference between KECA and PCA or KPCA is that the optimal eigenvectors (or called entropic components) derived from KECA can compress the most Renyi entropy of the input data instead of being associated with top eigenvalues. The procedure of selecting the eigenvectors related to the Renyi entropy of the input space is started with a Parzen window kernel-based estimator [21]. Then, only the eigenvectors corresponding to the most entropy of the input datasets are selected to perform DR. This distinguished characteristic helps KECA achieve better performances than the classical PCA and KPCA in face recognition and clustering [6]. In recent years, Izquierdo-Verdiguier et al. [21] employed the rotation matrix from ICA [2] to optimize KECA and proposed the optimized KECA (OKECA). OKECA not only shows superiority in classification of both synthetic and real datasets but can obtain acceptable kernel density estimation (KDE) just using very fewer entropic components (just one or two) compared with KECA [21]. However, OKECA is sensitive to outliers for its inherent properties of L2-norm. In other words, if the input space follows normal distribution and is contaminated by nonnormal distributed outliers, this may lead to the downgrade of its performance on DR in terms of OKECA. Additionally, OKECA is very time-consuming when handling large-scale input datasets (Section 4).Therefore, the main purpose of this paper is to propose a new variant of KECA and improve the proneness to outliers and efficiency of OKECA. L1-norm is well known for its robustness to outliers [22]. Additionally, Nie et al. [23] established a fast iteration process to handle the general L1-norm maximization issue with nongreedy algorithm. Hence, we take advantages of OKECA and propose a new L1-norm version of KECA (denoted as KECA-L1). KECA-L1 uses an efficient convergence procedure, motivated by Nie et al.'s method [23], to search for the entropic components contributing to the most Renyi entropy of input data. To evaluate the efficiency and effectiveness of KECA-L1, we design and conduct a series of experiments, in which the data vary from single class to multiattribute and from small to large size. The classical KECA and OKECA are also included for comparison.The remainder of this paper is organized as follows: Section 2 reviews the general L1-norm maximization issue, KECA, and OKECA. Section 3 presents KECA with nongreedy L1-norm maximization and semisupervised-learning-based classifier. Section 4 validates the performance of the new method on different data sets. Section 5 ends this paper with some conclusions.
2. Preliminaries
2.1. An Efficient Algorithm to Solving the General L1-Norm Maximization Issue
The general L1-norm maximization problem is first raised by Nie et al. [23]. This issue, based on a hypothesis that there exists an upper bound for the objective function, can be generally formulated as [23]where both f(ν) and g
(ν) for each i denote arbitrary functions, and ν ∈ 𝒞 represents an arbitrary constraint.Then a sign function sign(·) is defined asand employed to transform the maximization problem (1) as follows:where α
=sign(g
(ν)). Nie et al. [23] proposed a fast iteration process to solve problem (3), which is shown in Algorithm 1. It can be seen from Algorithm 1 that α
is determined by current solution ν
, and the next solution ν
is updated according to the current α
. The iterative process is repeated until the procedure converges [23, 24]. The convergence of the Algorithm 1 has been demonstrated, and the associated details can also be read in [23].
Algorithm 1
Fast iteration approach to solving the general L1-Norm maximization problem (3).
2.2. Kernel Entropy Component Analysis
KECA is characterized by its entropic components instead of the principal or variance-based components in PCA or KPCA, respectively. Hence, we firstly describe the concept of the Renyi quadratic entropy. Given the input dataset X=[x
1,…, x
](x
∈ ℝ
), the Renyi entropy of X is defined as [6]where p(x) is a probability density function. Based on the monotonic property of logarithmic function, Equation (4) can be rewritten asWe can estimate Equation (5) using the kernel k
(x, x
) of Parzen window density estimator determined by the bandwidth coefficient σ [6] such thatwhere K
=k
(x
, x
) constitutes the kernel matrix K and 1 represents an N-dimensional vector containing all ones. With the help of the kernel decomposition [6],Equation (6) is transformed as follows:where the diagonal matrix D and the matrix E consist of eigenvalues λ
1,…, λ
and the corresponding eigenvectors e
1,…, e
, respectively. It can be observed from Equation (7) that the entropy estimator consists of projections onto all the KFS axes becausewhere the function of ϕ(·) is to map the two samples x
and x
into the KFS. Additionally, only an entropic component e
meeting the criteria of λ
≠ 0 and 1
e
≠ 0 can contribute to the entropy estimate [21]. In a word, KECA implements DR by projecting ϕ(X) into a subspace E
spanned not by the eigenvectors associated with the top eigenvalues but by entropic components contributing most to the Renyi entropy estimator [25].
2.3. Optimized Kernel Entropy Component Analysis
Due to the fact that KECA is sensitive to different bandwidth coefficients σ [21], OKECA is proposed to fill this gap and improve performances of KECA on DR. Motivated by the fast ICA method [2], an extra rotation matrix (applying W) is employed to the kernel decomposition (Equation (7)) in KECA for maximizing the information potential (the entropy values in Equation (8)) [21]:where ‖·‖2 is the L2-norm and w denotes a column vector (N × 1) in W. Izquierdo-Verdiguier et al. [21] utilized a gradient-ascent approach to handle the maximization problem (10):where τ is the step size. ∂J/∂w(t) can be obtained by Lagrangian multiplier:The entropic components multiplied by the rotation matrix can obtain more (or equal) information potential than that of the KECA even using fewer components [21]. Moreover, OKECA shows the capability of being robust to the bandwidth coefficient. However, there exist two main limitations for OKECA. First, the new entropic components derived from OKECA are sensible to outliers since its inherent properties of L2-norm (Equation (10)). Second, although a very simple stopping criterion is designed to avoid additional iterations, OKECA is still of high computational complexities for its computational cost is O(N
3+4tN
2) [21], where t is the number of iterations for finding the optimal rotation matrix, compared with that the one of KECA is O(N
3) [21].
3. KECA with Nongreedy L1-Norm Maximization
3.1. Algorithm
In order to alleviate the problems existing in OKECA, this section presents how to extend KECA to its nongreedy L1-norm version. For readers' easy understanding, the definition of L1-norm is firstly introduced as follows:Definition 1. Given an arbitrary vector x ∈ ℝ
, the L1-norm of the vector x iswhere ‖·‖1 is the L1-norm and x
denotes the jth element of x.Then, motivated by OKECA, we attempt to develop a new objective function to maximize the information potential (Equations (8) and (10)) based on the L1-norm:where (a
1,…, a
)=A=E
D
1/2, N is the size of samples. The rotation matrix is denoted as W ∈ ℝ
DIM×, where DIM and m are the dimension of input data and dimension of the selected entropic components (or number of projection), respectively. It is difficult to directly solve problem (14), but we may regard it as a special case of problem (1) when f(ν) ≡ 0. Therefore, the Algorithm 1 can be employed to solve (14). Next, we show the details about how to find the optimal solution of problem (14) based on the proposal from References [23, 24]. LetThus, problem (14) can be simplified asBy singular value decomposition (SVD), thenwhere U ∈ ℝ
DIM×DIM, Λ ∈ ℝ
DIM×, and V ∈ ℝ
. Then we obtainwhere Z ∈ ℝ
, λ
and z
denote the (i, i) − th element of matrix Λ and Z, respectively. Due to the property of SVD, we have λ
≥ 0. Additionally, Z is an orthonormal matrix [23] such that z
≤ 1. Therefore, Tr(W
M) can reach the maximum only if Z=[I
, 0
], where I
denotes the m × m identity matrix, and 0
is a m × (DIM − m) matrix of zeros. Considering that Z=V
W
U, thus the solution to problem (16) isAlgorithm 2 (A MATLAB implementation of the algorithm is available at the Supporting Document for the interested readers) shows how to utilize the nongreedy L1-norm maximization described in Algorithm 1 to compute Equation (19). Since problem (16) is a special case of problem (1), we can obviously obtain that the optimal solution W
to Equation (19) is a local maximum point for ‖W
E
D
1/2‖1 based on Theorem 2 in Reference [23]. Moreover, the Phase 1 of the Algorithm 2 spends O(N
3) on the eigen decomposition. Thus, the total of computational cost of KECA-L1 is O(N
3+Nt), where t is the number of iterations for convergence. Considering that the computational complexity of OKECA is O(N
3+4tN
2), we can safely conclude that KECA-L1 has much faster convergence than OKECA's.
Algorithm 2
KECA-L1.
3.2. The Convergence Analysis
This subsection attempts to demonstrate the convergence of the Algorithm 2 in the following: theorem:
Theorem 1 .
The above KECA-L1 procedure can converge.
Proof
Motivated by References [23, 24], first we show the objective function (9) of KECA-L1 will monotonically increase in each iteration t. Let g
(u
)=W
a
and α
=sign(a
W), then (9) can be simplified toObviously, α
is parallel to g
(u
), but neither is α
. Therefore,Considering that |g
(u
)|=α
g
(u
), thusSubstituting (22) in (21), it can be obtainedAccording to the Step 3 in Algorithm 2 and the theory of SVD, for each iteration t, we haveCombining (23) and (24) for every i, we havewhich means that Algorithm 2 is monotonically increasing. Additionally, considering that objective function (14) of KECA-L1 has an upper bound within the limited iterations, the KECA-L1 procedure will converge.
3.3. The Semisupervised Classifier
Jenssen [26] established a semisupervised learning (SSL) algorithm for classification using KECA. This SSL-based classifier was trained by both labeled and unlabeled data to build the kernel matrix such that it can map the data to KFS appropriately [26]. Additionally, it is based on a general modelling scheme and applicable for other variants of KECA, such as OKECA and KECA-L1.More specifically, we are given N pairs of training data {x
, y
}
with samples x
∈ ℝ
and the associated labels y
. In addition, there are M unlabeled data points for testing. Let X
=[x
1,…, x
] and X
=[x
1,…, x
] denote the testing data and training data without labels, respectively; thus, we can obtain an overall matrix X=[X
X
]. Then we construct the kernel matrix K derived from X using (6), K ∈ ℝ
(, which plays as the input of Algorithm 2. After the iteration procedure of nongreedy L1-norm maximization, we obtain a projection of onto m orthogonal axes, where and . In other words, and are the low-dimensional representations of each testing data point x
and the training one x
, respectively. Assume that x
is an arbitrary data point to be tested. If it satisfiesthen x
is assigned to the same class with the jth data point of X
.
4. Experiments
This section shows the performance of the proposed KECA-L1 compared with the classical KECA [6] and OKECA [21] for real-world data classification using the SSL-based classifier illustrated in Section 3.3. Several recent techniques such as PCA-L1 [27] and KPCA-L1 [28] are also included for comparison. The rationale to select these methods is that previous studies related to DR found that they can produce impressive results [27-29]. We implement the experiments on a wide range of real-world datasets: (1) six different datasets from the University California Irvine (UCI) Machine Learning Repository (available at http://archive.ics.uci.edu/ml/datasets.html) and (2) 9 different software projects with 34 releases from the PROMISE data repository (available at http://openscience.us/repo). The MATLAB source code for running KECA and OKECA, uploaded by Izquierdo-Verdiguier et al. [21], is available at http://isp.uv.es/soft_feature.html. The coefficients set for PCA-L1 and KPCA-L1 is the same with [27, 28]. All of the experiments are all performed by MATLAB R2012a on a PC with Inter Core i5 CPU, 4 GB memory, and Windows 7 operating system.
4.1. Experiments on UCI Datasets
The experiments are conducted on six datasets from the UCI: the Inonosphere dataset is a binary classification problem of whether the radar signal can describe the structure of free electrons in the ionosphere or not; the Letter dataset is to assign each black-and-white rectangular pixel display to one of the 26 capital letters in the English alphabet; the Pendigits handles the recognition of pen-based handwritten digits; the Pima-Indians data set constitutes a clinical problem of diabetes diagnosis in patients from clinical variables; the WDBC dataset is another clinical problem for the diagnosis of breast cancer in malignant or benign classes; and the Wine dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. Table 1 shows the details of them. In the subsequent experiments, we just utilized the simplest linear classifier [30]. The theory of maximizing maximum likelihood (ML) [31] is selected as the rule for selecting bandwidth coefficient as suggested in [21].
Table 1
UCI datasets description.
Database
N
DIM
Nc
Ntrain
Ntest
Ionosphere
351
33
2
30 × 2
175
Letter
20000
16
26
35 × 26
3870
Pendigits
10992
16
9
60 × 9
3500
Pima-Indians
768
8
2
100 × 2
325
WDBC
569
30
2
35 × 2
345
Wine
178
12
3
30 × 3
80
N: number of samples, DIM: number of dimensions, N
c: number of classes, N
train: number of training data, and N
test: number of testing data.
The implementation of KECA-L1 and other methods is repeated using all the selected datasets with respect to different numbers of components for 10 times. We have utilized the overall classification accuracy (OA) to evaluate the performance of different algorithms on the classification. OA is defined as the total number of samples correctly assigned in percentage terms, which is within [0,1] and indicates better quality with larger values. Figure 1 presents the average OA curves obtained by the aforementioned algorithms for these six real datasets. It can be observed from Figure 1 that OKECA is superior to KECA, PCA-L1, and KPCA-L1 except for solving Letter issue. This is probably because DR performed by OKECA not only can reveal the structure related to the most Renyi entropy of the original data but also consider the rotational invariance property [21]. In addition, KECA-L1 outperforms the other methods besides of OKECA. This may be attributed to the robustness of L1-norm to outliers compared with that of the L2-norm. In Figure 1, OKECA seems to obtain nearly the same results with KECA-L1's. However, the average running time (in hours) of OKECA in the Pendigits is 37.384 times more than that of KECA-L1 1.339.
Figure 1
Overall accuracy obtained by the PCA-L1, KPCA-L1, KECA, OKECA, and KECA-L1 using different UCI databases with different numbers of extracted features. (a) Ionosphere, (b) Letter, (c) Pendigits, (d) Pima-Indians, (e) WDBC, and (f) Wine.
4.2. Experiments on Software Projects
In software engineering, it is usually difficult to test a software project completely and thoroughly with the limited resources [32]. Software defect prediction (SDP) may provide a relatively acceptable solution to this problem. It can allocate the limited test resources effectively by categorizing the software modules into two classes: nonfault-prone (NFP) or fault-prone (FP) according to 21 software metrics (Table 2).
Table 2
Descriptions of data attributes.
Attribute
Description
WMC
Weighted methods per class
AMC
Average method Complexity
AVG_CC
Mean values of methods in the same class
CA
Afferent couplings
CAM
Cohesion among methods of class
CBM
Coupling between Methods
CBO
Coupling between object classes
CE
Efferent couplings
DAM
Data access Metric
DIT
Depth of inheritance tree
IC
Inheritance Coupling
LCOM
Lack of cohesion in Methods
LCOM3
Normalized version of LCOM
LOC
Lines of code
MAX_CC
Maximum values of methods in the same class
MFA
Measure of function Abstraction
MOA
Measure of Aggregation
NOC
Number of Children
NPM
Number of public Methods
RFC
Response for a class
Bug
Number of bugs detected in the class
This section aims to employ KECA-based methods to reduce the selected software data (Table 3) dimensions and then utilize the SSL-based classifier combined with the support vector machine [33] to classify each software module as NFP or FP. The bandwidth coefficient set is still restricted to the rule of ML. PCA-L1 and KPCA-L1 are involved as a benchmarking yardstick. There are 34 groups of tests for each release in Table 3. The most suitable releases [34] from different software projects are selected as training data. We evaluate the performance of different selected methods on SDP in terms of recall (R), precision (P), and F-measure (F) [35, 36]. The F-measure is defined aswhere
Table 3
Descriptions of software data.
Releases
#Classes
#FP
% FP
Ant-1.3
125
20
0.160
Ant-1.4
178
40
0.225
Ant-1.5
293
32
0.109
Ant-1.6
351
92
0.262
Ant-1.7
745
166
0.223
Camel-1.0
339
13
0.038
Camel-1.2
608
216
0.355
Camel-1.4
872
145
0.166
Camel-1.6
965
188
0.195
Ivy-1.1
111
63
0.568
Ivy-1.4
241
16
0.066
Ivy-2.0
352
40
0.114
Jedit-3.2
272
90
0.331
Jedit-4.0
306
75
0.245
Lucene-2.0
195
91
0.467
Lucene-2.2
247
144
0.583
Lucene-2.4
340
203
0.597
Poi-1.5
237
141
0.595
Poi-2.0
314
37
0.118
Poi-2.5
385
248
0.644
Poi-3.0
442
281
0.636
Synapse-1.0
157
16
0.102
Synapse-1.1
222
60
0.270
Synapse-1.2
256
86
0.336
Synapse-1.4
196
147
0.750
Synapse-1.5
214
142
0.664
Synapse-1.6
229
78
0.341
Xalan-2.4
723
110
0.152
Xalan-2.5
803
387
0.482
Xalan-2.6
885
411
0.464
Xerces-init
162
77
0.475
Xerces-1.2
440
71
0.161
Xerces-1.3
453
69
0.152
Xerces-1.4
588
437
0.743
In (28), FN (i.e., false negative) means that buggy classes are wrongly classified to be nonfaulty, while FP (i.e., false positive) means nonbuggy classes are wrongly classified to be faulty. TP (i.e., true positive) refer to correctly classified buggy classes [34]. Values of Recall, Precision, and F-measure range from 0 to 1 and higher values indicate better classification results.Figure 2 shows the results using box-plot analysis. From Figure 2, considering the minimum, maximum, median, first quartile, and third quartile of the boxes, we find that KECA-L1 performs better than the other methods in general. Specifically, KECA-L1 can obtain acceptable results in experiments for SDP compared with the benchmarks proposed in Reference [34], since the median values of the boxes with respect to R and F are close to 0.7 and more than 0.5, respectively. On the contrary, not only KECA and OKECA but PCA-L1 and KPCA-L1 cannot meet these criteria. Therefore, all of the results validate the robustness of KECA-L1.
Figure 2
The standardized boxplots of the performance achieved by PCA-L1, KPCA-L1, KECA, OKECA, and KECA-L1, respectively. From the bottom to the top of a standardized box plot: minimum, first quartile, median, third quartile, and maximum.
5. Conclusions
This paper proposes a new extension to the OKECA approach for dimensional reduction. The new method (i.e., KECA-L1) employs L1-norm and a rotation matrix to maximize information potential of the input data. In order to find the optimal entropic kernel components, motivated by Nie et al.'s algorithm [23], we design a nongreedy iterative process which has much faster convergence than OKECA's. Moreover, a general semisupervised learning algorithm has been established for classification using KECA-L1. Compared with several recently proposed KECA- and PCA-based approaches, this SSL-based classifier can remarkably promote the performance on real-world datasets classification and software defect prediction.Although KECA-L1 has achieved impressive success on real examples, several problems still should be considered and solved in the future research. The efficiency of KECA-L1 has to be optimized for it is relatively time-consuming compared with most existing PCA-based methods. Additionally, the utilization of KECA-L1 is expected to appear in each pattern analysis algorithm previously based on PCA approaches.