| Literature DB >> 27703256 |
Guodong Zhao1, Yan Wu1.
Abstract
Microarray is recently becoming an important tool for profiling the global gene expression patterns of tissues. Gene selection is a popular technology for cancer classification that aims to identify a small number of informative genes from thousands of genes that may contribute to the occurrence of cancers to obtain a high predictive accuracy. This technique has been extensively studied in recent years. This study develops a novel feature selection (FS) method for gene subset selection by utilizing the Weight Local Modularity (WLM) in a complex network, called the WLMGS. In the proposed method, the discriminative power of gene subset is evaluated by using the weight local modularity of a weighted sample graph in the gene subset where the intra-class distance is small and the inter-class distance is large. A higher local modularity of the gene subset corresponds to a greater discriminative of the gene subset. With the use of forward search strategy, a more informative gene subset as a group can be selected for the classification process. Computational experiments show that the proposed algorithm can select a small subset of the predictive gene as a group while preserving classification accuracy.Entities:
Mesh:
Year: 2016 PMID: 27703256 PMCID: PMC5050509 DOI: 10.1038/srep34759
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Description of the datasets.
| Number | Data set | Genes | Samples | Classes |
|---|---|---|---|---|
| 1 | ALL-AML-3C | 7129 | 72 | 3 |
| 2 | DLBCL_A | 661 | 141 | 3 |
| 3 | SRBCT | 2308 | 83 | 4 |
| 4 | MLL | 12582 | 72 | 3 |
| 5 | CNS | 7129 | 60 | 2 |
| 6 | Lymphoma | 4026 | 66 | 3 |
| 7 | Colon | 2000 | 62 | 2 |
| 8 | Lung | 12600 | 203 | 5 |
Comparisons of the best results between WLMGS and others with 1NN classifier.
| Classifiers | |||||||
|---|---|---|---|---|---|---|---|
| Dataset | |||||||
| ALL-AML-3C | |#G| | 15 | 28 | ||||
| 94.46 | 98.61 | ||||||
| DLBCL_A | |#G| | 22 | 14 | 3 | 22 | 29.35 | |
| 95.71 | 90.91 | 85.76 | 93 | 93.91 | |||
| SRBCT | |#G| | 12 | 29 | 16 | 28 | 25 | 4 |
| 98.75 | 100 | 100 | |||||
| MLL | |#G| | 27 | 16 | 2 | 7 | 27 | |
| 98.57 | 95.89 | 94.46 | 100 | ||||
| CNS | |#G| | 9 | 3 | 30 | 3 | 12 | |
| 73.66 | 78 | 85.01 | 84.12 | 84.5 | |||
| Lymphoma | |#G| | 5 | 5 | 8 | 8 | 21 | |
| 98.33 | |||||||
| Colon | |#G| | 12 | 19 | 17 | 4 | 13 | 17 |
| 90.23 | 90.23 | 90.71 | 75.95 | 90.76 | |||
| Lung | |#G| | 28 | 13 | 26 | 30 | 27 | 9 |
| 93.61 | 94.59 | 95.11 | 93.04 | 95.15 | |||
Note: |#G|: average number of genes; ACC: average classification accuracy (%); T: average time (s) in selected 15 genes.
Comparisons of the best results between WLMGS and others with SVM classifier.
| Classifiers | SVM | ||||||
|---|---|---|---|---|---|---|---|
| Dataset | mRMR | MIFS_U | CMIM | Relief | CMQFS | WLMGS | |
| ALL-AML-3C | |#G| | 3 | 11 | 33 | 27 | 4 | |
| k = 7 | ACC | 98.57 | 98.57 | 97.32 | 96.07 | 98.51 | |
| DLBCL_A | |#G| | 16 | 16 | 31 | 30 | 29 | |
| k = 5 | ACC | 98.66 | 91.47 | 97.23 | 93.67 | 95.81 | |
| SRBCT | |#G| | 12 | 13 | 16 | 28 | 7 | |
| k = 9 | ACC | 95.27 | 100 | 100 | |||
| MLL | |#G| | 16 | 16 | 17 | 20 | 26 | |
| k = 7 | ACC | 98.75 | 85.17 | 98.61 | |||
| CNS | |#G| | 8 | 25 | 7 | 30 | 26 | 5 |
| k = 11 | ACC | 75 | 76.67 | 85 | 75.12 | 82.16 | 90 |
| Lymphoma | |#G| | 5 | 5 | 17 | 28 | 22 | |
| k = 7 | ACC | 92.38 | 99.85 | ||||
| Colon | |#G| | 11 | 22 | 16 | 25 | 23 | 19 |
| k = 9 | ACC | 89.28 | 91.67 | 81.74 | 80.71 | 87.95 | 93.33 |
| Lung | |#G| | 11 | 14 | 15 | 15 | 15 | 10 |
| k = 5 | ACC | 94.54 | 94.59 | 95.52 | 81.33 | 92.71 | |
Note: |#G|: average number of genes; ACC: average classification accuracy (%); T: average time (s) in selected 15 genes.
Figure 1The average classification accuracy using 1NN classifier with respect to the subset of s features selected by different filter methods.
For different methods, (a) is the classification accuracy in data MLL, (b) is the classification accuracy in data Lymphoma, (c) is the classification accuracy in data ALL-AML-3c, (d) is the classification accuracy in data DLBCL-A, (e) is the classification accuracy in data SRBCT, (f) is the classification accuracy in data CNS, (g) is the classification accuracy in data Lung, (h) is the classification accuracy in data Colon.
Figure 2The average classification accuracy using SVM classifier with respect to the subset of s features selected by different filter methods.
For different methods, (a) is the classification accuracy in data ALL-AML-3c, (b) is the classification accuracy in data MLL, (c) is the classification accuracy in data Lymphoma, (d) is the classification accuracy in data Lung, (e) is the classification accuracy in data DLBCL-A, (f) is the classification accuracy in data Colon, (g) is the classification accuracy in data CNS, (f) is the classification accuracy in data SRBCT.
p-Values between WLMGS and other methods about ACC and |#G| with 1NN.
| methods | ACC-p | |#G|-p |
|---|---|---|
| WLMGS vs. mRMR | 0.028 | 0.021 |
| WLMGS vs. MIFS_U | 0.013 | 0.019 |
| WLMGS vs. CMIM | 0.004 | 0.013 |
| WLMGS vs. Relief | 0.000 | 0.000 |
| WLMGS vs. CMQFS | 0.000 | 0.000 |
p-Values between WLMGS and other methods about ACC and |#G| with SVM.
| methods | ACC-p | |#G|-p |
|---|---|---|
| WLMGS vs. mRMR | 0.044 | 0.046 |
| WLMGS vs. MIFS_U | 0.029 | 0.014 |
| WLMGS vs. CMIM | 0.037 | 0.004 |
| WLMGS vs. Relief | 0.002 | 0.000 |
| WLMGS vs. CMQFS | 0.000 | 0.000 |
Figure 3The average classification accuracy using 1NN classifier with respect to the subset of s features selected by different wrapped methods.
For different methods, (a) is the classification accuracy in data ALL-AML-3c, (b) is the classification accuracy in data CNS, (c) is the classification accuracy in data Colon, (d) is the classification accuracy in data DLBCL-A, (e) is the classification accuracy in data Lung, (f) is the classification accuracy in data Lymphoma, (g) is the classification accuracy in data MLL, (h) is the classification accuracy in data SRBCT.
Figure 4The average classification accuracy using SVM classifier with respect to the subset of s features selected by different wrapped methods.
For different methods, (a) is the classification accuracy in data CNS, (b) is the classification accuracy in data Colon, (c) is the classification accuracy in data DLBCL-A, (d) is the classification accuracy in data Lung, (e) is the classification accuracy in data Lymphoma, (f) is the classification accuracy in data MLL, (g) is the classification accuracy in data ALL-AML-3c (h) is the classification accuracy in data SRBCT.
Figure 5The average time cost in terms of Top 20 genes selected by our method and wrapped methods.
The enrichment analysis results about annotation cluster by DAVID in Top ten genes selected by WLMGS.
| Dataset | Annotation Cluster | Enrichment Score |
|---|---|---|
| ALL-AML_3c | GO:0002521~leukocyte differentiation, GO:0030097~hemopoiesis, GO:0048534~hemopoietic or lymphoid organ development, GO:0002520~immune system development | 2.07 |
| CNS | GO:0005261~cation channel activity, GO:0046873~metal ion transmembrane transporter activity, GO:0005216~ion channel activity, GO:0022838~substrate specific channel activity, GO:0015267~channel activity, GO:0022803~passive transmembrane transporter activity, GO:0030001~metal ion transport, GO:0006812~cation transport, GO:0006811~ion transport, SP_PIR_KEYWORDS~disease mutation, UP_SEQ_FEATURE~sequence variant, SP_PIR_KEYWORDS~polymorphism | 2.31 |
| MLL | GO:0030528~transcription regulator activity, GO:0006350~transcription, GO:0045449~regulation of transcription, SP_PIR_KEYWORDS~Transcription | 0.59 |
| Lung | GO:0005615~extracellular-space, GO:0044421~extracellular region part, GO:0005576~extracellular region | 1.78 |
The enrichment score results by DAVID in Top ten genes selected by different methods.
| ALL-AML_3c | CNS | MLL | Lung | |
|---|---|---|---|---|
| 2.19 | 1.91 | 0.59 | 0.65 | |
| 0.33 | 0.65 | 0.31 | ||
| 0.36 | 0.41 | 0.2 | ||
| 0.15 | 0.21 | 0.23 | 0.15 | |
| 1.79 | 2.05 | 0.55 | 1.54 | |
| 2.07 | 0.59 |
The larger the enrichment score, the more enriched the genes subset.
Figure 6The average 1NN accuracy results on the different k for all datasets in our method.
(a) is the classification accuracy on the different k for the data ALL-AML-3c, (b) is the classification accuracy on the different k for the data SRBCT, (c) is the classification accuracy on the different k for the data Lymphoma, (d) is the classification accuracy on the different k for the data DLBCL-A, (e) is the classification accuracy on the different k for the data CNS (f) is the classification accuracy on the different k for the data Colon, (g) is the classification accuracy on the different k for the data MLL, (h) is the classification accuracy on the different k for the data Lung.
The increment of WLM with selected genes.
| Dataset | Increment of | AN_B |
|---|---|---|
| ALL-AML-3C | 0.72, 0.32, 0.24, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 | 4.35 |
| DLBCL_A | 0.52, 0.17, 0.27, 0.16, 0.14, 0.11, 0.07, 0.09, 0.02, 0.02, 0.00, 0.00, 0.00, 0.00 | 9.73 |
| SRBCT | 0.83, 0.82, 0.40, 0.14, 0.12, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 | 7.56 |
| MLL | 0.45, 0.30, 0.11, 0.03, 0.00, −0.00, −0.00, 0.01, 0.04, 0.01, 0.00, 0.00, 0.00, 0.00 | 4.72 |
| CNS | 0.13, 0.17, 0.12, 0.14, 0.07, 0.08, 0.05, 0.09, 0.05, 0.01, 0.02, 0.02, 0.01, 0.00 | 7.23 |
| Lymphoma | 0.58, 0.14, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 | 3.00 |
| Colon | 0.13, 0.06, 0.15, 0.13, 0.13, 0.07, 0.05, 0.05, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00 | 12.73 |
| Lung | 1.03, 0.63, 0.34, 0.22, 0.30, 0.20, 0.06, 0.08, 0.04, 0.03, 0.02, 0.00, 0.00, 0.00 | 13.29 |
Note: AN_B: the average number of selected genes while the best result is obtained.
Figure 7A simple graph with three local communities, enclosed by the dashed circles.
Reprinted figure with permission from ref. 43.