Literature DB >> 24678505

Sparse representation for tumor classification based on feature extraction using latent low-rank representation.

Bin Gan¹, Chun-Hou Zheng², Jun Zhang³, Hong-Qiang Wang⁴.

Abstract

Accurate tumor classification is crucial to the proper treatment of cancer. To now, sparse representation (SR) has shown its great performance for tumor classification. This paper conceives a new SR-based method for tumor classification by using gene expression data. In the proposed method, we firstly use latent low-rank representation for extracting salient features and removing noise from the original samples data. Then we use sparse representation classifier (SRC) to build tumor classification model. The experimental results on several real-world data sets show that our method is more efficient and more effective than the previous classification methods including SVM, SRC, and LASSO.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
DNA, Neoplasm

Year: 2014 PMID： 24678505 PMCID： PMC3942202 DOI： 10.1155/2014/420856

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

Tumor is a solid lesion caused by the abnormal growth of cells. A timely accurate treatment is very important clinically. The premise of an accurate treatment is an exact diagnosis due to the heterogeneity of cancer. That is, we need to classify them accurately before treating tumors. Current methods for classifying cancer malignancies mostly rely on a variety of morphological, clinical, or molecular variables. Despite recent progresses, there are still many uncertainties in diagnosis. The advent of DNA microarray and RNA_seq [1] makes it possible to analyze tumor samples and classify them based on gene expression profiles. Moreover, we can get the expression data of tens of thousands of genes through DNA microarray or RNA-seq simultaneously. Many methods for molecular data classification or clustering based on gene expression data have appeared in this area [2-14]. Huang and Zheng used independent component analysis [5] to extract features; Gao and Church introduced sparse nonnegative matrix factorization for feature extraction [4]; Zheng et al. proposed metasample-based sparse representation [7], and Furey et al. used support vector machines [8] to classify the gene expression data. All these methods have achieved impressive classification performances. Recently published sparse representation classification (SRC) is also a powerful tool for processing gene expression data. SRC method was inspired by many theories such as Basis pursuing [15], compressive sensing for signal reconstruction [16], and least absolute shrinkage. It has already been widely used in face recognition [17] and texture classification [18]. In SRC method, test samples can be only represented as a sparse linear combination of the training samples from the same class. Furthermore, an imposed l 1-regularized least square optimization is used to calculate an SR coefficient vector with only a few significant coefficients. In theory, a test sample can be well represented by only using the training samples from the same class. However, there is too much noise in gene expression data, which causes that the discriminative features are not obvious and the test samples can also be represented by some training samples from different classes. This will decrease the classification accuracy. To reduce noise [19-21] and get salient features [20] for tumor classification, in this paper, we introduce latent low-rank representation to preprocess gene expression data. By combining it with SRC algorithm, we propose a new method for tumor classification. Latent low-rank representation (LatLRR) is a kind of theory which can be used to extract principal and salient features from original data. LatLRR is the improved version of LRR. The two methods can be resolved by the inexact augmented Lagrange multiplier (ALM) optimization. In [19-22], LRR has been successfully used for the recovery of subspace structure, subspace segmentation, feature extraction, outlier detection, and so forth. In [23], the author introduced LRR theory for face recognition in order to remove noise and achieved an impressive result. Based on these successful applications, in this paper, we introduce LatLRR into sparse representation classifier for tumor classification. Firstly, we use LatLRR to remove noise from original data and extract salient features. Then based on the new extracted salient features, we design sparse representation classifier to classify new test samples. We referred to the proposed method as SRC-based latent low-rank representation (SRC- LatLRR). The rest of the paper is organized as follows. Section 2 describes our proposed SRC-LatLRR method in detail. We firstly review SRC and latent low-rank representation methods in Sections 2.1 and 2.2, respectively. Then we present our method in detail in Section 2.3. Section 2.4 specifies our experimental setting. In Section 3, we evaluate our method using several publicly available gene expression data sets. Section 4 concludes the paper and outlines our future work. The abbreviations used in this paper are summarized in the Abbreviations section.

2. Methods

2.1. Sparse Representation Classification

Sparse representation classification is a supervised classification. Let W ∈ R denote a training sample matrix with n samples and m genes. As we know, each DNA microarray chip usually contains thousands of genes; the number of genes is much larger than tumor samples; that is, m ≫ n. Let c be the lth sample of W and the n samples are divided into k object classes. Assuming that there are n samples belonging to ith class and making up W = [c , c ,…, c ], the whole data set can be reexpressed as W = [W 1, W 2,…, W ]. Suppose that a new testing sample y ∈ R belongs to ith class. Based on the theory of sparse representation, y would lie in the linear span of the training samples W ; that is, where α ∈ R is a scalar and j = 1,2,…, n . Supposing a linear representation coefficient vector x 0 ∈ R , y can be also rewritten as Ideally, if the training samples are sufficient and the training samples sets that belong to different class are disjoint each other, then we have that is, in x 0, only the entries corresponding to the same class as y are nonzero. From the above analysis, it can be seen that we can classify the test sample y according to x 0. So the key problem is how to calculate x 0 in (2). As in [7], x 0 would be sparse if the number of object classes k is large; this is what sparse representation implies. According to the theory of compressive sensing [16, 24–26] and SR, x 0 can be achieved by solving the following l 1-minimization problem: This problem can be solved by standard linear programming methods [15]. But (4) has no exact solutions since m ≫ n. Then a generalized version of (4) can be conceived: where λ is a scalar regularization. This function can balance the degree of noise by using λ. In this study, we solve this function by the truncated Newton interior-point method [27].

2.2. Latent Low-Rank Representation

Latent low-rank representation is an extension of low-rank representation. Consider an observed data matrix X = [x 1, x 2,…, x ] ∈ R , where each column vector x is a sample, and a dictionary A = [a 1, a 2,…, a ] ∈ R , where a is also a sample. X can be linearly represented by the dictionary. That is, where Z = [z 1, z 2,…, z ] ∈ R is a coefficient matrix and each z is the representation of x . Equation (6) means that each column vector of X can be represented by a linear combination of the bases in A. In (6), the dictionary A should be overcomplete enough to represent any observed data matrix X. But meanwhile, this causes multiple feasible solutions of Z to (6). To achieve the optimal solution, low rankness criterion is introduced to (6): Here, the optimal solution Z* is the so-called lowest-rank representation of data X with respect to the dictionary A. Unfortunately, function (7) can not be easy to solve because of the discrete nature of the rank function. By matrix completion method [28-30], we replace solving low-rank problem with dealing with nuclear norm [31]; then problem (7) can be rerepresented as where ||Z||∗ means the nuclear norm of matrix Z, that is, the sum of the singular values of matrix Z. Strictly speaking, the dictionary A should be overcomplete and noiseless. But this kind of dictionary is difficult to get. In practice, we usually use observed data matrix X itself as the dictionary [19, 21, 32]. Finally we have the following convex optimization problem: To solve this equation, two conditions need to be met. Firstly, the data sampling X should be sufficient. Secondly, the data sampling X should also contain sufficient noiseless data to achieve robust capability. In fact, the first one can be easily met but the second one not. Because gene expression data are usually noisy, in reality, function (9) may be invalid and not robust. To solve the problem in (9), we introduce the following LRR problem [20]: where X is the observed data matrix and the X is the unobserved data, that is, the hidden data. We use the concatenation of X and X as a dictionary. The optimal result of (10) is Z * = [Z *; Z *], where Z * and Z * correspond to X and X , respectively. By solving (10), the two problems above can be solved well. Then our next mission is to recover the affinity matrix Z * by using only X in the absence of the hidden data X . The method is called latent low-rank representation (LatLRR), which is an improvement of LRR. Supposing we have two matrices X and X , then by solving (10) we have the following equations: where V and V can be obtained through computing the skinny singular value decomposition of [X , X ] = U∑V , and V = [V ; V ]. Namely, X = U∑V and X = U∑V . Depending on function (11), we have Let L * = U∑V V ∑−1 U ; then we have the following simple function: If X and X come from the same collection of low-rank subspaces, then both Z * and L * should be of low-rank, so we can achieve Just as in [28-30], we also change the above rank minimization problem to the nuclear norm. Then we have the following convex optimization problem: Here, we replace X , Z , and L with X, Z, and L, respectively, for ease of representation. In (15), X is the noiseless observed data. By considering there may exist corrupted data or noise in X, we also need to introduce a denoising model about (15); then we have where λ > 0 is a scalar and ||E||1 is the l 1-norm of sparse noise matrix E. If λ → +∞, the problem (16) will be equivalent to (15), that is, no noise in the observed data X. In (16), the optimal solutions XZ*, L*X, and E* represent the principal features, salient features, and noise, respectively. To solve the LatLRR problem listed in (16), we introduce the augmented Lagrange multiplier (ALM) [33] method and revise (16) as follows to meet the requirement of ALM algorithm: This problem can be solved by ALM method which minimizes the following augmented Lagrange function: where tr⁡(·) and ||·|| denote the trace and Frobenius norm of a matrix, respectively. μ > 0 is a penalty parameter. More details about (18) can be found in [33].

2.3. Sparse Representation Classification Based on LatLRR

Since LatLRR can extract the salient features and remove noise from original data sets, in this study, before using observed data for classification, we firstly use LatLRR to suppress noise and get the salient features. Then we use the denoised data for tumor classification; that is, we factorize the observed data X into Here, we only use D = LX for data classification. For a test sample y, we can calculate its SR by the following function: where the parameter λ > 0 can be determined experimentally and x is a coefficient vector. Assuming the test sample y belongs to one of target classes, the training data set is sufficient. When classifying y, we introduce Ly, where L is a square matrix obtained through LatLRR method when extracting the salient features. Ideally, Ly can be linearly represented by the samples from the same class in D. Namely, the representation vector x should be sparse and the nonzero entries are associated with the columns of D from the same class. This will lead us to classify the test samples. However, noise and modeling errors will also introduce some nonzero entries to x which correspond to the columns of D from the multiple classes [17]. To solve this problem, we classify Ly based on how well it can be reconstructed by using the coefficients from each class as in [17]. Using the result of (20), we construct δ (x) as the characteristic function which selects the coefficients associated with the ith class in the coefficient vector x. By only using ith class coefficients to reconstruct the test sample Ly as , we can classify Ly into the minimum residual class between Ly and ; that is, Our classification algorithm can be summarized as follows. Input. Observed data X ∈ R for k classes; test sample y. Step 1. Normalize the columns of X. Step 2. Extract the salient features of X and remove to some extent noise to get data D defined in (19). Step 3. Solve the optimization problem defined in (20). Step 4. Compute the residuals r (y) = ||Ly−Dδ (x)||2. Output. Identity(y) = arg min r (y). Our method can be seen as the combination of SRC [17] and latent low-rank representation for feature extraction [20], so we named it as SRC-LatLRR. In SRC, the test sample is represented as a sparse linear combination of the training samples from the same class. In LatLRR, noise is removed to some extent and salient features are simultaneously extracted from the training samples. So the introduction of LatLRR can improve the classification accuracy of SRC in a way.

2.4. Evaluation of the Performance

To evaluate our proposed method, we compare our method with SRC [17, 34], LASSO [35], and SVM [8, 36, 37]. SVM has been proved to be one of the best classifiers for classifying data in the area of “high dimensionality and small sample size” [36, 37]. We do binary classification and multiclass classification experiments in Sections 3.1 and 3.2, respectively. During the experiment, the best results of SRC, LASSO, and SVM are also used to compare with those of our method, which were achieved by choosing appropriate parameters experimentally. As the number of tumor sample is too small, we use stratified 10-fold cross validation in all our experiments. In the multiclass classification experiments, we do not use LASSO method because it is designed only for binary class classification problems [35]. As we know, dimensionality reduction can improve the classification performance and computing speed, so we reduce data dimensionality using between-category to within-category sums of squares methods in our experiments.

3. Experimental Results

3.1. Two-Class Classification Problem

In this subsection, three two-class microarray data sets are used to evaluate our method: colon cancer [38], prostate cancer [39], and diffuse large B-cell lymphoma [40]. The colon data set contains 62 samples consisting of 40 tumor and 22 normal. The prostate data set contains prostate tumors and normal prostate samples, each consisting of the expression levels of 12600 genes. For the DLBCL data set, the gene expression values were measured by high-density oligonucleotide microarrays. An overview of the three data sets is given in Table 1.

Table 1

Three binary data sets used in the experiments.

Datasets	Samples		Genes
Datasets	Class 1	Class 2	Genes
Colon cancer	40	22	2000
Prostate cancer	77	59	12600
DLBCL	58	19	5469

The classification results by using SVM, LASSO, SRC, and the proposed SRC-LatLRR are listed in Table 2. From Table 2, we can see that our method SRC-LatLRR performs well on all the three data sets. Even the performance of SRC-LatLRR is not better than SRC on the prostate cancer data set, but it is better than SVM and LASSO. In summary, SRC has an advantage for the prostate cancer and DLBCL data sets, but SRC-LatLRR is the best classifier for the colon cancer and DLBCL data sets.

Table 2

Classification accuracies by different methods for the three binary data sets.

Datasets	SVM	LASSO	SRC	SRC-LatLRR
Colon cancer	85.48	85.48	85.48	90.32
Prostate cancer	91.18	91.91	94.85	94.12
DLBCL	96.10	96.10	97.40	97.40

To further evaluate our method, in this experiment, we also introduced BW feature selection in our method to classify these three data sets. The results are listed in Table 3, and the number of genes selected is given in the parenthesis behind data set. From Table 3, we can see that after feature selection, our proposed classification method outperforms the other three classification methods, and it can even achieve an accuracy of 100% for the DLBCL data set.

Table 3

Classification accuracies by different methods with gene selection for the three binary data sets.

Datasets	SVM	LASSO	SRC	SRC-LatLRR
Colon cancer (1000)	87.1	87.1	87.1	91.94
Prostate cancer (1500)	94.85	91.18	95.59	96.32
DLBCL (800)	97.40	93.51	97.40	100

3.2. Multiclass Classification Problem

In this subsection, we use four multiclass data sets to further check the classification performance of SRC-LatLRR. The four data sets are lung cancer [41], leukemia [42], 11_tumors [43], and 9_tumors [44]. In lung cancer data set, there are four classes of lung cancer and normal class. This data set contains 203 samples. For leukemia data set, all the samples are classified into acute myelogenous leukemia, acute lymphoblastic leukemia, or mixed-lineage leukemia. The data set includes 72 samples with 11225 genes. For 11_tumors, there are 11 classes of samples, which are ovary, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, prostate, pancreas, adeno lung, and squamous lung. This data set includes 174 samples. For the 9_tumors data set, there are 60 samples with 5726 genes. These 9 types of tumors are non-small-cell lung, colon, breast, ovarian, leukemia, renal, melanoma, prostate, and central nervous system. The detailed descriptions about these four data sets are listed in Table 4. All the four data sets were produced by oligonucleotide microarrays and the analysis tool Affymetrix GENECHIP [36].

Table 4

Descriptions of the four multiclass data sets used in DNA classification experiments.

Dataset	Class counts	Samples	Genes
Lung cancer	5	203	12600
Leukemia	3	72	11225
11_tumors	11	174	12533
9_tumors	9	60	5726

The experimental results are listed in Table 5. From these results, we can see that the proposed method SRC-LatLRR does not have a clear advantage over SVM and SRC. The reason may be that in these data sets, the training samples of each class are very few so that the sample space is not complete.

Table 5

Classification accuracies by different methods for the multiclass data sets.

Dataset	SVM	SRC	SRC-LatLRR
Lung cancer	96.05	95.07	95.07
Leukemia	96.60	95.83	98.61
11_tumors	94.68	94.83	94.83
9_tumors	65.10	66.67	66.67

We then introduced BW feature selection before applying our method. The obtained results are listed in Table 6. From the results we can see that the proposed method classified leukemia well. For the other data sets, it has no clear advantage. But it performed better than SRC for all the four data sets.

Table 6

Classification accuracies by different methods with gene selection for the multiclass data sets.

Dataset	SVM	SRC	SRC-LatLRR
Lung cancer (2000)	96.62	95.07	95.57
Leukemia (3000)	96.90	95.83	98.61
11_tumors (1000)	96.07	95.40	95.40
9_tumors (2000)	85.84	71.67	80.00

3.3. The Choice of the Balanced Parameter

In this section, we use the data sets described in Section 3.1 to check how λ in (16) affect the classification performance. We show the accuracies and the removed noise level by our method at different values of λ in Figures 1, 2, and 3 for the colon, prostate, and DLBCL data sets, respectively. From (16), we know that the lower the λ is, the bigger the noise level is removed. For these three figures we use ||E||1 to represent the level of the removed noise. From these three figures we can see that the noise that we remove from the original data can not be too much, or it will reduce the accuracy. The reason is that if λ is set to be too small, useful information may be also removed besides noise. On the contrary, if λ is too big, the noise that was removed is too little, and we still can not get a good classification result. The experiment suggests that for colon data sets, λ = 0.011 is the best choice and λ = 0.096 and λ = 0.1 for the prostate and DLBCL data sets, respectively.

Figure 1

The changing curves of classification accuracy and removed noise level with λ on the colon data set.

Figure 2

The changing curves of classification accuracy and removed noise level with λ on the prostate data set.

Figure 3

The changing curves of classification accuracy and removed noise level with λ on the DLBCL data set.

4. Conclusions

For gene expression data, cancer diagnosis is one of the most important clinical applications. In this paper, we have proposed a new SR-based method for tumor classification which uses the noiseless salient features extracted from the original samples to classify a test sample. We compared our method with several state-of-the-art methods including SVM, LASSO, and SRC on seven data sets. The results of experiments show that the proposed method is better than SVM, LASSO, and SRC in a way. These demonstrate that SRC-LatLRR is effective and efficient for tumor classification. We also introduced gene selection into our method. The results show that gene selection can improve the classification accuracy to some extent. During the study we also found that, for the optimal result of LatLRR on the observed samples, Z* represents the affinity matrix of samples [21]. In theory, the affinity matrix can be used to cluster samples. In future, we will extend it to investigate the property of sample clusters.

25 in total

1. Support vector machine classification and validation of cancer tissue samples using microarray expression data.

Authors: T S Furey; N Cristianini; N Duffy; D W Bednarski; M Schummer; D Haussler
Journal: Bioinformatics Date: 2000-10 Impact factor: 6.937

2. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.

Authors: Margaret A Shipp; Ken N Ross; Pablo Tamayo; Andrew P Weng; Jeffery L Kutok; Ricardo C T Aguiar; Michelle Gaasenbeek; Michael Angelo; Michael Reich; Geraldine S Pinkus; Tane S Ray; Margaret A Koval; Kim W Last; Andrew Norton; T Andrew Lister; Jill Mesirov; Donna S Neuberg; Eric S Lander; Jon C Aster; Todd R Golub
Journal: Nat Med Date: 2002-01 Impact factor: 53.440

3. Chemosensitivity prediction by transcriptional profiling.

Authors: J E Staunton; D K Slonim; H A Coller; P Tamayo; M J Angelo; J Park; U Scherf; J K Lee; W O Reinhold; J N Weinstein; J P Mesirov; E S Lander; T R Golub
Journal: Proc Natl Acad Sci U S A Date: 2001-09-11 Impact factor: 11.205

4. Metagenes and molecular pattern discovery using matrix factorization.

Authors: Jean-Philippe Brunet; Pablo Tamayo; Todd R Golub; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2004-03-11 Impact factor: 11.205

5. Improving molecular cancer class discovery through sparse non-negative matrix factorization.

Authors: Yuan Gao; George Church
Journal: Bioinformatics Date: 2005-11-01 Impact factor: 6.937

6. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Authors: U Alon; N Barkai; D A Notterman; K Gish; S Ybarra; D Mack; A J Levine
Journal: Proc Natl Acad Sci U S A Date: 1999-06-08 Impact factor: 11.205

7. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.

Authors: A Bhattacharjee; W G Richards; J Staunton; C Li; S Monti; P Vasa; C Ladd; J Beheshti; R Bueno; M Gillette; M Loda; G Weber; E J Mark; E S Lander; W Wong; B E Johnson; T R Golub; D J Sugarbaker; M Meyerson
Journal: Proc Natl Acad Sci U S A Date: 2001-11-13 Impact factor: 11.205

8. Molecular classification of human carcinomas by use of gene expression signatures.

Authors: A I Su; J B Welsh; L M Sapinoso; S G Kern; P Dimitrov; H Lapp; P G Schultz; S M Powell; C A Moskaluk; H F Frierson; G M Hampton
Journal: Cancer Res Date: 2001-10-15 Impact factor: 12.701

9. Gene expression correlates of clinical prostate cancer behavior.

Authors: Dinesh Singh; Phillip G Febbo; Kenneth Ross; Donald G Jackson; Judith Manola; Christine Ladd; Pablo Tamayo; Andrew A Renshaw; Anthony V D'Amico; Jerome P Richie; Eric S Lander; Massimo Loda; Philip W Kantoff; Todd R Golub; William R Sellers
Journal: Cancer Cell Date: 2002-03 Impact factor: 31.743

10. Classification and selection of biomarkers in genomic data using LASSO.

Authors: Debashis Ghosh; Arul M Chinnaiyan
Journal: J Biomed Biotechnol Date: 2005-06-30

3 in total

1. Prediction of antimicrobial peptides based on sequence alignment and support vector machine-pairwise algorithm utilizing LZ-complexity.

Authors: Xin Yi Ng; Bakhtiar Affendi Rosdi; Shahriza Shahrudin
Journal: Biomed Res Int Date: 2015-02-23 Impact factor: 3.411

2. Sample Selection for Training Cascade Detectors.

Authors: Noelia Vállez; Oscar Deniz; Gloria Bueno
Journal: PLoS One Date: 2015-07-21 Impact factor: 3.240

3. Tumor classification and biomarker discovery based on the 5'isomiR expression level.

Authors: Shengqin Wang; Zhihong Zheng; Peichao Chen; Mingjiang Wu
Journal: BMC Cancer Date: 2019-02-07 Impact factor: 4.430

3 in total