Literature DB >> 29162964

Robust Feature Selection Approach for Patient Classification using Gene Expression Data.

Md Shahjaman1,2, Nishith Kumar1,3, Md Shakil Ahmed1, AnjumanAra Begum1, S M Shahinul Islam4, Md Nurul Haque Mollah1.   

Abstract

Patient classification through feature selection (FS) based on gene expression data (GED) has already become popular to the research communities. T-test is the well-known statistical FS method in GED analysis. However, it produces higher false positives and lower accuracies for small sample sizes or in presence of outliers. To get rid from the shortcomings of t-test with small sample sizes, SAM has been applied in GED. But, it is highly sensitive to outliers. Recently, robust SAM using the minimum β-divergence estimators has overcome all the problems of classical t-test & SAM and it has been successfully applied for identification of differentially expressed (DE) genes. But, it was not applied in classification. Therefore, in this paper, we employ robust SAM as a feature selection approach along with classifiers for patient classification. We demonstrate the performance of the robust SAM in a comparison of classical t-test and SAM along with four popular classifiers (LDA, KNN, SVM and naive Bayes) using both simulated and real gene expression datasets. The results obtained from simulation and real data analysis confirm that the performance of the four classifiers improve with robust SAM than the classical t-test and SAM. From a real Colon cancer dataset we identified 21 additional DE genes using robust SAM that were not identified by the classical t-test or SAM. To reveal the biological functions and pathways of these 21 genes, we perform KEGG pathway enrichment analysis and found that these genes are involved in some important pathways related to cancer disease.

Entities:  

Keywords:  Feature selection; classification; robust SAM; β-divergence estimators

Year:  2017        PMID: 29162964      PMCID: PMC5680713          DOI: 10.6026/97320630013327

Source DB:  PubMed          Journal:  Bioinformation        ISSN: 0973-2063


Background

Nowadays the big biological data is one of the hottest topics for the researchers. Gene expression datasets is the high-dimensional big datasets because it contains ten thousands of genes/features with very few patients/samples [1]. This behavior of gene expression data often refers to the curse of dimensionality [2-3]. Thus analyzing of these types of datasets has become complicated and challenging for the researchers. The goal of classification is to allocate/classify the new objects into one of two or more population of the training dataset whose categories are known in advance. Cancer classification based on gene expression dataset is important for subsequent diagnosis and treatment. Without correct classification of different cancer types of the patient, it is very difficult to provide proper treatment and therapies [4]. The conventional classification methods are largely dependent on different morphological parameters to classify cancer. Thus their applications become limited with low prediction accuracies. To get rid from the curse of dimensionality of GED, classification through informative gene identification or feature selection (FS) has already attracted to the research communities [5]. FS can boost the performance of the classifiers by selecting smaller number of features. It also reduces the computational time and provides more reliable estimates to train the classifiers. There are three types of FS methods for GED analysis; (a) wrapper method, (b) embedded method and (b) filter based method [6-7]. Wrapper method searches the features until a certain accuracy of the classifier was achieved. Embedded methods embed feature selection within classifier construction. Filter based method first select few informative features (DE genes) using the labeled samples of training dataset and based on these pre selected features, researchers perform the further classification task. Filter based methods are easily understandable and computationally faster than the wrapper and embedded methods, thus they are better suited to high dimensional datasets [8]. Among the filter-based methods, t-test is one of the popular and widely used methods in gene expression data analysis [9]. However, the major drawback of this classical t-test is that it produces higher false discoveries and lower accuracies with small-sample sizes or outlying gene expressions. Significance Analysis of Microarrays (SAM) has overcome the shortcomings of t-test for small-sample case by controlling false discoveries [10]. However, SAM is very sensitive to outliers and produces misleading results in presence of outlying gene expressions. Consequently, the popular classifiers produce misleading results in presence of outliers when feature selection is performed using classical t-test or SAM. Recently, we have robustified the SAM approach by minimum β-divergence estimators to solve the allaforesaid problems of classical t-test and SAM [11]. Therefore, in this paper, we employ robust SAM as a feature selection method along with classifiers. To investigate the performance of the robust SAM in a comparison with classical t-test and SAM, we pick up four popular classifiers: linear discriminant analysis (LDA) [12], K-nearest neighborhood (KNN) [13], support vector machine (SVM) [14] and naive Bayes classifier [15]. From a real Colon cancer dataset we identified additional 21 DE genes using robust SAM approach that were not identified by the classical ttest or SAM approach. Using the functional annotation and KEGG pathway enrichment analysis we revealed that 15 genes out of 21 genes, are involved in some important pathways related to cancer disease.

Methodology

Performance Evaluation

In order to evaluate the performance of different classifiers for binary classification test such as normal or cancer, we used different statistical measures. For binary class prediction, the outcomes are always divided into four categories: (a) normal samples are correctly predicted as normal (true positives: TP), (b) normal samples are incorrectly predicted as cancer (false negative: FN), (c) cancer samples are correctly predicted as cancer (true negative: TN) and (d) cancer samples are incorrectly predicted as normal (false positive: FP). Then we calculate the following performance measures based on these performance measures: True positive rate (TPR) = TP / TP + FN, False positive rate (FPR) = FP / (FP + TN), True negative rate (TNR) = TN / (TN + FP) and area under the receiving operating characteristics (ROC) curve, AUC = (TPR + TNR) / 2.

Patient Classification through Robust SAM Approach

Gene expression datasets are often contaminated by outliers due to several steps involve in the data generating process from hybridization of DNA samples to image analysis [16]. If outliers are present in the dataset then the results of the downstream analysis might be changed. Despite the popularity of the statistical FS methods (t-test or SAM), they are sensitive to outliers. Therefore, in this paper, we used robust SAM [11] as a feature selection method to select the smaller number of informative features to train the classifiers Figure 4. The detail procedure of patient classification is as follows:
Figure 4

Computational pipeline for patient classification through robust SAM

1) Apply the robust SAM approach in the GED to select the informative features or DE genes using the p-values. 2) Adjust the p-values for multiple testing corrections using Benjamini-Hochberg method. Then arrange the adjusted p-values in ascending order. 3) Select first T < max (n1, n2) genes out of G genes as top DE genes from the training dataset. Here, n1 and n2 are the number of patient in the normal and cancer group, respectively. G is the total number of gene in the dataset. 4) Estimate the parameters of the classifiers using the expressions of these top T DE genes based on training dataset. 5) Select the expressions of top T DE genes from test dataset to obtain the reduced test dataset. 6) Finally, classify the patients of the test dataset into one of two groups (normal/cancer).

Dataset

Simulated Gene Expression Dataset

We generate the simulated gene expression dataset from the following model as described in Table 1. In this table g1 and g2 represents the up-regulated and down-regulated DE gene group, respectively and g3 represents the EE gene group. We generated gene expression profiles of G=10,000 genes, with k=2 groups (normal/cancer). We considered 100 datasets for both small (N1=N2=10) and large (N1=N2=40) sample cases, respectively. Each dataset for each case represents the gene expression profiles of G=10,000 genes with N= (N1+N2) samples. We set the values of the parameter d as 2 and σ2 = 0.1. Among the expression of 10,000 genes for each datasets we divided these expressions in to two groups (expressions of important features or DE genes, 200 and expressions of the unimportant features or EE genes, 9800). We randomly divided each of the 100 datasets into two independent datasets to construct the training and test dataset such that training and test datasets consist of n1 = N1 / 2 samples in normal and n2 = N2 / 2 samples in cancer group.
Table 1

Simulated gene expression data generating model for k=2 groups

Gene GroupPatients
Normal (N1)Cancer(N2)
g1-d + N(0, σ 2)d + N(0, σ 2)
g2d + N(0, σ 2)-d + N(0, σ 2)
g3d + N(0, σ 2)d + N(0, σ 2)

Real Gene Expression Dataset

This dataset consists gene expression profiles of 6,500 human genes collected from 40 tumor and 22 normal colon tissue samples were analyzed with an Affymetrix technology [17]. Among the 6,500 genes, the highest minimal intensity across the samples with 2000 genes was selected for the further analysis. This dataset can also be downloaded from the R-package ''plsgenomics''.

Result and Discussion

To demonstrate the performance of the robust SAM in a comparison of classical t-test and SAM for both simulated and real gene expression datasets, we pick up four popular classifiers: linear discriminant analysis (LDA) [12], K-nearest neighborhood (KNN) [13], support vector machine (SVM) [14] and naive Bayes classifier [15]. We used three R packages for the four classifiers: MASS for LDA, kknn for KNN, e1071 for SAM and naive Bayes. The performance measure AUC was computed for each of the classifiers using ROC R package. All R packages are available in the comprehensive R archive network (cran) or bioconductor.

Performance Evaluation Based on Simulated Gene Expression Dataset

To investigate the performance of the robust SAM in a comparison of classical t-test and SAM, we employed these three FS methods to identify the 200 informative features (DE genes) from each of the 100 simulated training datasets as described above. We select the top T < max (n1, n2) DE genes obtained from the three methods by ranking the adjusted p-values in ascending order. The adjusted p-values were obtained using Benjamini- Hochberg method. The expressions of these top T (10 or 40) detected DE genes are then used to train the four popular classifiers (LDA, KNN, SVM and naive Bayes) to predict the patients/samples class. We computed four performance measures such as TPR, TNR, FPR and AUC based on reduced training and test datasets using the four classifiers. A method is said to be good performer if it produces larger values of TPR, TNR, AUC and smaller values of FPR. To show the effect of outliers in the three FS methods, we randomly corrupted 5%, 20% and 35% genes by a single outlier in the training datasets. Here, values of outliers are considered as larger than the maximum value of the expressions. The Figure 1 represents the test ROC curve produced by the four classifiers using the average values of 100 estimated FPR and TPR based on 100 training and test datasets for small-sample case (n1=n2=5). From this figure we observe that, in absence of outliers, all the four classifiers (LDA, KNN, SVM and naive Bayes) performed well when feature selection is carried out from SAM and robust SAM. In this case classical t-test performed slightly worse than the SAM and robust SAM. But in presence of outliers (5%, 20% and 35%), the performances of all the four classifiers deteriorate by producing lower values of AUC (< 0.80) with classical t test and SAM. Four classifiers performed well with robust SAM (AUC > 0.85) for same conditions. On the other hand, for large-sample (n1=n2=20) case in absence and presence of 5% and 20% outliers, all the four classifiers produces almost similar values of AUC with classical ttest, SAM and robust SAM. But in presence of 35% outliers, in this case, these four classifiers performed well only when FS is carried out from robust SAM (see Table 4 in supplementary file).
Figure 1

Performance evaluation using tests ROC curve produced by four classifiers for simulated dataset with sample size (n1=n2=5). (a) In absence of outliers. (b) In presence of 5% outliers. (c) In presence of 20% outliers. (d) In presence of 35% outliers.

Table 4

Performance evaluation using test AUC values estimated by the four classifiers for simulated dataset

Feature Selection (FS)For large-sample case (n1= n2=20)
In absence of outliersIn presence of 5% outliers
LDAKNNSVMnaive BayesLDAKNNSVMnaive Bayes
t-test0.9830.960.9920.9850.9640.9520.9820.961
SAM0.9850.9720.9930.9920.950.9620.9840.973
robust SAM0.980.9630.9910.9930.9820.9630.990.992
FSIn presence of 20% outliersIn presence of 35% outliers
LDAKNNSVMnaive BayesLDAKNNSVMnaive Bayes
t-test0.9350.9280.9520.9470.60.660.530.62
SAM0.930.9150.9490.9330.6210.6320.5620.633
robust SAM0.980.9620.990.9910.9740.9520.9820.987
In this table performance measure test AUC values were estimated by the four classifiers (LDA, KNN, SVM and naive Bayes) based on top 200 DE genes for large (n1= n2=20) sample cases.

Performance Evaluation Based on Real Colon cancer Gene Expression Dataset

To demonstrate the performance of robust SAM in a comparison of classical t-test and SAM, we employed these methods in the real Colon cancer dataset to detect the DE genes. We select top 200 DE genes by ranking the adjusted p-values. The adjusted pvalues were obtained using Benjamini-Hochberg method. The detecting performance of top 200 DE genes using classical t-test, SAM and robust SAM is shown in a Venn diagram of Figure 2a. From this figure we notice that there are additional 13, 8 and 21 DE genes identified by classical t-test, SAM and robust SAM, respectively. The Figure 2b shows heatmap using the 21 genes detected by the robust SAM. We can clearly observe from this heatmap that these 21 genes have classified the samples into two groups (normal and cancer). To investigate the classification performance of the four classifiers (LDA, KNN, SVM and naive Bayes) using the expressions of additional 13, 8 and 21 genes detected by the three methods, we randomly divided this Colon cancer dataset into two datasets (training dataset and test dataset) such that each dataset contains same number of samples. Then we computed the four performance measures (TPR, TNR, FPR and AUC) using the four classifiers. The average values of AUC are summarized in Table 2. From this table we clearly notice that the performance of all the classifiers is improved using robust SAM. We also observe that SVM and naive Bayes classifiers performed better than the LDA and KNN. The test ROC curve shown in Figure 2c also supports the results of Table 2. The Figure 2d shows the boxplot of test AUC values estimated by the four classifiers for Colon cancer dataset using classical t-test, SAM and robust SAM. This plot also supports the results of Table 2. To elucidate the molecular functions and KEGG pathways of these additional 21 genes, we used WebGestalt software package [18]. The Figure 3 shows the bar chart of the biological process, cellular component and molecular function categories. In this figure there are 15 genes out of 21 genes involved in the three categories. The top ten KEGG pathways for additional 21 genes detected by the robust SAM is summarized in Table 3. We found that DNA replication pathway is the highest enriched pathway.
Figure 2

Comparison of the DE genes detected by t-test, SAM and robust SAM for the Colon cancer dataset. (a) Venn diagram of DE genes detected by t-test, SAM and robust SAM. (b) Heatmap of 21 DE genes identified by the robust SAM. (c) Test ROC curve produced by four classifiers using the expression values of 13, 8 and 21 DE genes identified by t-test, SAM and robust SAM, respectively. (d) Boxplot of AUC values estimated by the four classifiers using t-test, SAM and robust SAM. 1000 trials were performed to obtain this result.

Table 2

Performance evaluation using test AUC values for Colon cancer dataset

ClassifiersFeature Selection Methods
t-testSAMRobust SAM
LDA0.7880.8280.834
KNN0.7450.7660.787
SVM0.8390.8620.914
Naive Bayes0.8170.8250.873
The performance measure AUC values were estimated using four classifiers (LDA, KNN, SVM and naive Bayes), based on 13, 8 and 21 DE genes identified by classical t-test, SAM and robust SAM approach, respectively.
Figure 3

Functional annotation of 21 DE genes identified by the robust SAM. Frequency distribution of biological process, cellular component and molecular function categories for 15 DE genes identified by robust SAM. KEGG identified 15 genes out of 21 DE genes using in WebGestalt software.

Table 3

KEGG pathways for 21 DE genes detected using robust SAM for Colon cancer dataset

KEGG IDName of PathwaysNo of GeneAdjusted p-values
hsa03030DNA replication 22.29E-01
hsa00230Purine metabolism 32.29E-01
hsa00511Other glycan degradation 19.70E-01
hsa03430Mismatch repair 19.70E-01
hsa00062Fatty acid elongation 19.70E-01
hsa03410Base excision repair 19.70E-01
hsa05166HTLV-I infection 29.70E-01
hsa03440Homologous recombination 19.70E-01
hsa00071Fatty acid degradation 19.70E-01
hsa03420Nucleotide excision repair 19.70E-01
KEGG terms that are significantly enriched in the 15 Colon cancer related genes detected by the robust SAM. The p-values were calculated using hypergeometric test and then adjusted by Benjamini-Hochberg method for multiple testing corrections. 15 genes out of 21 genes were mapped using the KEGG map in WebGestalt sortware.

Conclusion

Patient classification into various sources of population of training dataset is very popular in GED. t-test and SAM are the popular FS methods for patient classification using GED. However, both of them suffer from outliers. To prevail over the problems of classical t-test and SAM, robust SAM using the minimum β-divergence estimators was proposed [11]. In this paper, we employed robust SAM as a FS method along with classifiers. From a real Colon cancer dataset we identified additional 21 DE genes by robust SAM and we found that these genes are involved in some important pathways related to cancer disease. Then we apply the expressions of these 21 genes in the classification and reveal that the classification performances improve using these genes. Moreover, we notice that SVM and naive Bayes classifiers performed better compare to the LDA and KNN.
  13 in total

1.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.

Authors:  Tao Li; Chengliang Zhang; Mitsunori Ogihara
Journal:  Bioinformatics       Date:  2004-04-15       Impact factor: 6.937

Review 2.  Penalized feature selection and classification in bioinformatics.

Authors:  Shuangge Ma; Jian Huang
Journal:  Brief Bioinform       Date:  2008-06-18       Impact factor: 11.622

3.  Gene expression profiling predicts clinical outcome of breast cancer.

Authors:  Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal:  Nature       Date:  2002-01-31       Impact factor: 49.962

4.  Gene-expression-based cancer subtypes prediction through feature selection and transductive SVM.

Authors:  Ujjwal Maulik; Anirban Mukhopadhyay; Debasis Chakraborty
Journal:  IEEE Trans Biomed Eng       Date:  2012-10-18       Impact factor: 4.538

5.  Big biological data: challenges and opportunities.

Authors:  Yixue Li; Luonan Chen
Journal:  Genomics Proteomics Bioinformatics       Date:  2014-10-14       Impact factor: 7.691

6.  Semi-Supervised Projective Non-Negative Matrix Factorization for Cancer Classification.

Authors:  Xiang Zhang; Naiyang Guan; Zhilong Jia; Xiaogang Qiu; Zhigang Luo
Journal:  PLoS One       Date:  2015-09-22       Impact factor: 3.240

7.  Robust Significance Analysis of Microarrays by Minimum β-Divergence Method.

Authors:  Md Shahjaman; Nishith Kumar; Md Manir Hossain Mollah; Md Shakil Ahmed; Anjuman Ara Begum; S M Shahinul Islam; Md Nurul Haque Mollah
Journal:  Biomed Res Int       Date:  2017-07-27       Impact factor: 3.411

8.  Robustification of Naïve Bayes Classifier and Its Application for Microarray Gene Expression Data Analysis.

Authors:  Md Shakil Ahmed; Md Shahjaman; Md Masud Rana; Md Nurul Haque Mollah
Journal:  Biomed Res Int       Date:  2017-08-07       Impact factor: 3.411

9.  Clustering cancer gene expression data by projective clustering ensemble.

Authors:  Xianxue Yu; Guoxian Yu; Jun Wang
Journal:  PLoS One       Date:  2017-02-24       Impact factor: 3.240

10.  WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013.

Authors:  Jing Wang; Dexter Duncan; Zhiao Shi; Bing Zhang
Journal:  Nucleic Acids Res       Date:  2013-05-23       Impact factor: 16.971

View more
  1 in total

1.  Identification of NEO1 as a prognostic biomarker and its effects on the progression of colorectal cancer.

Authors:  Meng Zhang; Zhou Zhou; Xue-Kai Pan; Yun-Jiao Zhou; Hai-Ou Li; Pei-Shan Qiu; Meng-Na Zhang; Ru-Yi Peng; Hai-Zhou Wang; Lan Liu; Jing Liu; Qiu Zhao
Journal:  Cancer Cell Int       Date:  2020-10-17       Impact factor: 5.722

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.