| Literature DB >> 29162964 |
Md Shahjaman1,2, Nishith Kumar1,3, Md Shakil Ahmed1, AnjumanAra Begum1, S M Shahinul Islam4, Md Nurul Haque Mollah1.
Abstract
Patient classification through feature selection (FS) based on gene expression data (GED) has already become popular to the research communities. T-test is the well-known statistical FS method in GED analysis. However, it produces higher false positives and lower accuracies for small sample sizes or in presence of outliers. To get rid from the shortcomings of t-test with small sample sizes, SAM has been applied in GED. But, it is highly sensitive to outliers. Recently, robust SAM using the minimum β-divergence estimators has overcome all the problems of classical t-test & SAM and it has been successfully applied for identification of differentially expressed (DE) genes. But, it was not applied in classification. Therefore, in this paper, we employ robust SAM as a feature selection approach along with classifiers for patient classification. We demonstrate the performance of the robust SAM in a comparison of classical t-test and SAM along with four popular classifiers (LDA, KNN, SVM and naive Bayes) using both simulated and real gene expression datasets. The results obtained from simulation and real data analysis confirm that the performance of the four classifiers improve with robust SAM than the classical t-test and SAM. From a real Colon cancer dataset we identified 21 additional DE genes using robust SAM that were not identified by the classical t-test or SAM. To reveal the biological functions and pathways of these 21 genes, we perform KEGG pathway enrichment analysis and found that these genes are involved in some important pathways related to cancer disease.Entities:
Keywords: Feature selection; classification; robust SAM; β-divergence estimators
Year: 2017 PMID: 29162964 PMCID: PMC5680713 DOI: 10.6026/97320630013327
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Figure 4Computational pipeline for patient classification through robust SAM
Simulated gene expression data generating model for k=2 groups
| Gene Group | Patients | |
| Normal (N1) | Cancer(N2) | |
| g1 | -d + N(0, σ 2) | d + N(0, σ 2) |
| g2 | d + N(0, σ 2) | -d + N(0, σ 2) |
| g3 | d + N(0, σ 2) | d + N(0, σ 2) |
Figure 1Performance evaluation using tests ROC curve produced by four classifiers for simulated dataset with sample size (n1=n2=5). (a) In absence of outliers. (b) In presence of 5% outliers. (c) In presence of 20% outliers. (d) In presence of 35% outliers.
Performance evaluation using test AUC values estimated by the four classifiers for simulated dataset
| Feature Selection (FS) | For large-sample case (n1= n2=20) | |||||||
| In absence of outliers | In presence of 5% outliers | |||||||
| LDA | KNN | SVM | naive Bayes | LDA | KNN | SVM | naive Bayes | |
| t-test | 0.983 | 0.96 | 0.992 | 0.985 | 0.964 | 0.952 | 0.982 | 0.961 |
| SAM | 0.985 | 0.972 | 0.993 | 0.992 | 0.95 | 0.962 | 0.984 | 0.973 |
| robust SAM | 0.98 | 0.963 | 0.991 | 0.993 | 0.982 | 0.963 | 0.99 | 0.992 |
| FS | In presence of 20% outliers | In presence of 35% outliers | ||||||
| LDA | KNN | SVM | naive Bayes | LDA | KNN | SVM | naive Bayes | |
| t-test | 0.935 | 0.928 | 0.952 | 0.947 | 0.6 | 0.66 | 0.53 | 0.62 |
| SAM | 0.93 | 0.915 | 0.949 | 0.933 | 0.621 | 0.632 | 0.562 | 0.633 |
| robust SAM | 0.98 | 0.962 | 0.99 | 0.991 | 0.974 | 0.952 | 0.982 | 0.987 |
| In this table performance measure test AUC values were estimated by the four classifiers (LDA, KNN, SVM and naive Bayes) based on top 200 DE genes for large (n1= n2=20) sample cases. | ||||||||
Figure 2Comparison of the DE genes detected by t-test, SAM and robust SAM for the Colon cancer dataset. (a) Venn diagram of DE genes detected by t-test, SAM and robust SAM. (b) Heatmap of 21 DE genes identified by the robust SAM. (c) Test ROC curve produced by four classifiers using the expression values of 13, 8 and 21 DE genes identified by t-test, SAM and robust SAM, respectively. (d) Boxplot of AUC values estimated by the four classifiers using t-test, SAM and robust SAM. 1000 trials were performed to obtain this result.
Performance evaluation using test AUC values for Colon cancer dataset
| Classifiers | Feature Selection Methods | ||
| t-test | SAM | Robust SAM | |
| LDA | 0.788 | 0.828 | 0.834 |
| KNN | 0.745 | 0.766 | 0.787 |
| SVM | 0.839 | 0.862 | 0.914 |
| Naive Bayes | 0.817 | 0.825 | 0.873 |
| The performance measure AUC values were estimated using four classifiers (LDA, KNN, SVM and naive Bayes), based on 13, 8 and 21 DE genes identified by classical t-test, SAM and robust SAM approach, respectively. | |||
Figure 3Functional annotation of 21 DE genes identified by the robust SAM. Frequency distribution of biological process, cellular component and molecular function categories for 15 DE genes identified by robust SAM. KEGG identified 15 genes out of 21 DE genes using in WebGestalt software.
KEGG pathways for 21 DE genes detected using robust SAM for Colon cancer dataset
| KEGG ID | Name of Pathways | No of Gene | Adjusted p-values |
| hsa03030 | DNA replication | 2 | 2.29E-01 |
| hsa00230 | Purine metabolism | 3 | 2.29E-01 |
| hsa00511 | Other glycan degradation | 1 | 9.70E-01 |
| hsa03430 | Mismatch repair | 1 | 9.70E-01 |
| hsa00062 | Fatty acid elongation | 1 | 9.70E-01 |
| hsa03410 | Base excision repair | 1 | 9.70E-01 |
| hsa05166 | HTLV-I infection | 2 | 9.70E-01 |
| hsa03440 | Homologous recombination | 1 | 9.70E-01 |
| hsa00071 | Fatty acid degradation | 1 | 9.70E-01 |
| hsa03420 | Nucleotide excision repair | 1 | 9.70E-01 |
| KEGG terms that are significantly enriched in the 15 Colon cancer related genes detected by the robust SAM. The p-values were calculated using hypergeometric test and then adjusted by Benjamini-Hochberg method for multiple testing corrections. 15 genes out of 21 genes were mapped using the KEGG map in WebGestalt sortware. | |||