| Literature DB >> 24625895 |
Monalisa Mandal1, Anirban Mukhopadhyay1.
Abstract
The purpose of feature selection is to identify the relevant and non-redundant features from a dataset. In this article, the feature selection problem is organized as a graph-theoretic problem where a feature-dissimilarity graph is shaped from the data matrix. The nodes represent features and the edges represent their dissimilarity. Both nodes and edges are given weight according to the feature's relevance and dissimilarity among the features, respectively. The problem of finding relevant and non-redundant features is then mapped into densest subgraph finding problem. We have proposed a multiobjective particle swarm optimization (PSO)-based algorithm that optimizes average node-weight and average edge-weight of the candidate subgraph simultaneously. The proposed algorithm is applied for identifying relevant and non-redundant disease-related genes from microarray gene expression data. The performance of the proposed method is compared with that of several other existing feature selection techniques on different real-life microarray gene expression datasets.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24625895 PMCID: PMC3953335 DOI: 10.1371/journal.pone.0090949
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Construction of Feature-dissimilarity Graph.
From the data matrix first Relevance Vector and Dissimilarity Matrix are Computed, then a weighted complete Feature-dissimilarity Graph is computed. Here an example of 5 feature-dissimilarity graph is depicted.
Algorithm 1: Graph based MObPSO (Minimization Problem).
|
|
|
|
| 1: |
| 2: |
| 3: |
| 4: |
| 5: |
|
|
| 7: |
| 8: |
| 9: |
| 10: |
| 11: |
| 12: |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: |
| 25: |
| 26: |
| 27: |
| 28: |
| 29: |
| 30: |
| 31: |
| 32: |
| 33: |
| 34: |
| 35: CrowdingSort(A) |
| 36: From step-6 to step-33 are repeated according to number of iteration |
Performance Analysis for Three Real-life Data Set.
| Dataset | Algorithm | Sensitivity | Specificity | Accuracy | Fscore | AUC | Average | Time |
| (sd) | (sd) | (sd) | (sd) | correlation | (Sec.) | |||
| Prostate | Proposed | 0.8962 | 0.9 | 0.898 | 0.9002 | 0.964 | 0.4714 | 81.176 |
| Cancer | (0.0575) | (0.0909) | (0.047) | (0.0434) | ||||
| dataset | Singleobjective | 0.8221 | 0.855 | 0.8382 | 0.8382 | 0.913 | 0.6343 | 45.39 |
| (correlation) | (0.0768) | (0.1024) | (0.06) | (0.0582) | ||||
| Singleobjective | 0.8701 | 0.865 | 0.8676 | 0.8704 | 0.9219 | 0.6646 | 53.72 | |
| (SNR) | (0.0352) | (0.0563) | (0.0174) | (0.0142) | ||||
| T-test | 0.7778 | 0.8244 | 0.8497 | 0.8336 | 0.7817 | 0.4434 | 27.199 | |
| (0.1462) | (0.0615) | (0.074) | (0.1052) | |||||
| Ranksum test | 0.8547 | 0.8375 | 0.8768 | 0.8522 | 0.8311 | 0.5177 | 23.267 | |
| (0.0976) | (0.0825) | (0.0399) | (0.0317) | |||||
| SFS | 0.7393 | 0.7864 | 0.7950 | 0.7126 | 0.7308 | 0.5514 | 21.382 | |
| (0.1272) | (0.2313) | (0.1047) | (0.0847) | |||||
| SBE | 0.78 | 0.7116 | 0.763 | 0.7233 | 0.7701 | 0.612 | 46.113 | |
| (0.1882) | (0.2839) | (0.179) | (0.0787) | |||||
| CFS | 0.9131 | 0.9001 | 0.9112 | 0.9211 | 0.9215 | 0.4993 | 73.9 | |
| (0.061) | (0.0672) | (0.0561) | (0.0373) | |||||
| mRMR(miq) | 0.9176 | 0.8686 | 0.8936 | 0.8970 | 0.9484 | 0.579 | 64.83 | |
| (0.0783) | (0.0552) | (0.0564) | (0.0574) | |||||
| Graph-based | 0.8646 | 0.96 | 0.9216 | 0.92 | 0.9362 | 0.4773 | 39.172 | |
| (0.061) | (0.197) | (0.048) | (0.528) | |||||
| Cluster-based | 0.8077 | 0.9211 | 0.8431 | 0.84 | 0.92 | 0.4815 | 23.18 | |
| (0.0593) | (0.093) | (0.0514) | (0.0512) | |||||
| DLBCL | Proposed | 0.9111 | 0.9207 | 0.9184 | 0.8428 | 0.9644 | 0.6128 | 113.81 |
| dataset | (0.1021) | (0.0564) | (0.0315) | (0.0513) | ||||
| Singleobjective | 0.6389 | 0.8966 | 0.8355 | 0.639 | 0.8167 | – | 63.11 | |
| (correlation) | (0.24) | (0.0922) | (0.07) | (0.148) | ||||
| Singleobjective | 0.8333 | 0.8707 | 0.8618 | 0.7434 | 0.9214 | 0.6369 | 79.443 | |
| (SNR) | (0.0197) | (0.1219) | (0.0767) | (0.1082) | ||||
| T-test | 0.7284 | 0.9119 | 0.8486 | 0.7052 | 0.6672 | 0.6667 | 43.78 | |
| (0.3148) | (0.0881) | (0.1036) | (0.268) | |||||
| Ranksum test | 0.7654 | 0.8945 | 0.8621 | 0.7327 | 0.4252 | 0.778 | 51.9 | |
| (0.2747) | (0.0622) | (0.0668) | (0.1991) | |||||
| SFS | 0.5714 | 0.7783 | 0.7293 | 0.5997 | 0.4760 | 0.6767 | 87.76 | |
| (0.2437) | (0.1318) | (0.0921) | (0.1558) | |||||
| SBE | 0.6814 | 0.7000 | 0.6744 | 0.6119 | 0.60 | 0.7011 | 96.221 | |
| (0.2773) | (0.218) | (0.201) | (0.2056) | |||||
| CFS | 0.5556 | 0.9355 | 0.8684 | 0.6667 | 0.9308 | 0.5101 | 51.482 | |
| (0.0233) | (0.0354) | (0.0424) | (0.021) | |||||
| mRMR(miq) | 0.8889 | 0.9163 | 0.9098 | 0.8244 | 0.9568 | 0.639 | 80.11 | |
| (0.0641) | (0.0337) | (0.0298) | (0.0544) | |||||
| Graph-based | 0.7889 | 1 | 0.9337 | 0.8402 | 0.8462 | 0.4413 | 43.91 | |
| (0.0367) | (0.0493) | (0.0291) | (0.0552) | |||||
| Cluster-based | 0.8779 | 0.8966 | 0.8947 | 0.8 | 0.9637 | 0.5815 | 56.225 | |
| (0.0339) | (0.0601) | (0.0441) | (0.052) | |||||
| Child-ALL | Proposed | 0.752 | 0.8233 | 0.7909 | 0.7671 | 0.8639 | 0.7324 | 81.3 |
| dataset | (0.0648) | (0.1055) | (0.053) | (0.0501) | ||||
| Singleobjective | 0.71 | 0.8042 | 0.7614 | 0.7295 | 0.743 | – | 66.96 | |
| (correlation) | (0.0763) | (0.0844) | (0.0382) | (0.0414) | ||||
| Singleobjective | 0.64 | 0.8442 | 0.7568 | 0.7079 | 0.8073 | 0.7854 | 71.014 | |
| (SNR) | (0.0676) | (0.1296) | (0.0701) | (0.0692) | ||||
| T-test | 0.4960 | 0.68 | 0.5964 | 0.5184 | 0.8253 | 0.9014 | 21.681 | |
| (0.1943) | (0.3719) | (0.1596) | (0.1728) | |||||
| Ranksum test | 0.4640 | 0.87 | 0.6855 | 0.5506 | 0.8114 | 0.9223 | 23.575 | |
| (0.2296) | (0.2755) | (0.1108) | (0.1901) | |||||
| SFS | 0.46 | 0.8556 | 0.6758 | 0.5402 | 0.84 | 0.7416 | 63.014 | |
| (0.1908) | (0.1089) | (0.0656) | (0.1766) | |||||
| SBE | 0.6878 | 0.6173 | 0.5889 | 0.6202 | 0.84 | 0.7655 | 71.224 | |
| (0.2108) | (0.29) | (0.0197) | (0.2689) | |||||
| CFS | 0.6400 | 0.9133 | 0.789 | 0.7442 | 0.8427 | 0.6313 | 76.44 | |
| (0.191) | (0.1189) | (0.06114) | (0.0677) | |||||
| mRMR(miq) | 0.7486 | 0.8762 | 0.7782 | 0.7896 | 0.8802 | 0.741 | 69.886 | |
| (0.0380) | (0.0600) | (0.0315) | (0.0313) | |||||
| Graph-based | 0.44 | 1 | 0.7455 | 0.6111 | 0.9267 | 0.7813 | 52.13 | |
| (0.055) | (0.21) | (0.0671) | (0.0519) | |||||
| Cluster-based | 0.749 | 0.8164 | 0.7818 | 0.7917 | 0.8133 | 0.7399 | 59.45 | |
| (0.061) | (0.093) | (0.0409) | (0.0572) |
10-fold Cross-validation Result Analysis for Three Real-life Data Set.
| Dataset | Algorithm | Sensitivity | Specificity | Accuracy | Fscore | AUC | Average correlation | Time (In Sec.) |
| Prostate Cancer | Proposed | 0.9423 | 0.9515 | 0.9412 | 0.9423 | 0.9624 | 0.4136 | 4.1278 |
| dataset | Singleobjective | 0.9038 | 0.82 | 0.8627 | 0.8704 | 0.9505 | 0.4770 | 1.6487 |
| Singleobjective | 0.8846 | 0.94 | 0.9118 | 0.9109 | 0.9415 | 0.4743 | 1.643 | |
| T-test | 0.9142 | 0.9419 | 0.9336 | 0.9314 | 0.9554 | 0.5367 | 1.184 | |
| Ranksum test | 0.8846 | 0.9344 | 0.9216 | 0.92 | 0.9485 | 0.5203 | 1.094 | |
| SFS | 0.819 | 0.8741 | 0.91 | 0.8901 | 0.8531 | 0.476 | 1.031 | |
| SBE | 0.7951 | 0.8146 | 0.8863 | 0.8359 | 0.8322 | 0.4993 | 1.8351 | |
| CFS | 0.9231 | 0.96 | 0.951 | 0.9401 | 0.9835 | 0.4211 | 1.8989 | |
| mRMR(miq) | 0.9387 | 0.93 | 0.9351 | 0.9412 | 0.9665 | 0.3929 | 1.362 | |
| Graph-based | 0.9231 | 0.94 | 0.9314 | 0.932 | 0.967 | 0.4743 | 2.189 | |
| Cluster-based | 0.9038 | 0.94 | 0.9216 | 0.9216 | 0.96 | 0.5245 | 3.0718 | |
| DLBCL | Proposed | 1 | 0.9483 | 0.961 | 0.9268 | 0.9955 | 0.4658 | 3.6832 |
| Singleobjective | 0.8421 | 0.9377 | 0.9221 | 0.8421 | 0.9737 | 0.4926 | 2.5195 | |
| Singleobjective | 0.6263 | 0.9655 | 0.9551 | 0.6452 | 0.961 | 0.5692 | 2.4434 | |
| T-test | 0.8153 | 0.965 | 0.9366 | 0.8649 | 0.9837 | 0.5412 | 1.233 | |
| Ranksum test | 0.8944 | 0.951 | 0.949 | 0.9231 | 0.9946 | 0.5152 | 1.391 | |
| SFS | 0.7222 | 0.898 | 0.8318 | 0.834 | 0.868 | 0.5016 | 1.05 | |
| SBE | 0.6319 | 0.9133 | 0.8001 | 0.7822 | 0.823 | 0.4772 | 1.7312 | |
| CFS | 0.9474 | 0.9518 | 0.949 | 0.9191 | 0.9894 | 0.4408 | 2.47 | |
| mRMR(miq) | 0.9474 | 0.9432 | 0.9487 | 0.9231 | 0.9809 | 0.4545 | 1.898 | |
| Graph-based | 0.8947 | 0.9601 | 0.9481 | 0.8947 | 0.9827 | 0.5538 | 2.43 | |
| Cluster-based | 0.7105 | 0.9383 | 0.7662 | 0.8077 | 0.8818 | 0.5517 | 3.3227 | |
| Child-ALL | Proposed | 0.8813 | 0.719 | 0.8733 | 0.7352 | 0.8801 | 0.6764 | 3.6605 |
| Singleobjective | 1 | 0.0167 | 0.4638 | 0.6289 | 0.796 | 0.7576 | 3.22 | |
| Singleobjective | 0.989 | 0.1701 | 0.4545 | 0.625 | 0.8133 | 0.6926 | 2.92 | |
| T-test | 0.82 | 0.4333 | 0.6091 | 0.656 | 0.8427 | 0.73 | 2.73 | |
| Ranksum test | 0.80 | 0.45 | 0.6091 | 0.6504 | 0.8060 | 0.7136 | 2.899 | |
| SFS | 0.9667 | 0.2961 | 0.4358 | 0.7111 | 0.828 | 0.6936 | 2.39 | |
| SBE | 1 | 0.017 | 0.4913 | 0.6535 | 0.783 | 0.725 | 2.873 | |
| CFS | 0.9411 | 0.0677 | 0.4833 | 0.6144 | 0.8403 | 0.5989 | 3.3486 | |
| mRMR(miq) | 0.7 | 0.4167 | 0.5455 | 0.5833 | 0.692 | 0.6774 | 2.968 | |
| Graph-based | 0.6909 | 0.7333 | 0.7455 | 0.7143 | 0.7297 | 0.6786 | 3.43 | |
| Cluster-based | 0.94 | 0.1851 | 0.4818 | 0.6395 | 0.7927 | 0.6817 | 3.6357 |
Gene Markers Identified by the Proposed Method for Various Dataset.
| Data set | Gene ID | Symbol | Description | Up or Down |
|
|
| HPN | Hepsin | up |
|
|
| CRYAB | crystallin, alpha B | up |
|
| CLDN3 | claudin 3 | up | |
|
| MAF | v-maf musculoaponeurotic fibrosarcoma oncogene homolog | up | |
|
| SLC25A6 | solute carrier family 25, member 6 | down | |
|
| RPL18A, | ribosomal protein L18a, L18a pseudogene 3 | down | |
| RPL18AP3 | ||||
|
|
| LDHA | lactate dehydrogenase | down |
|
| ENO1 | enolase 1 (alpha) | down | |
|
| FH | fumarate hydratase, mitochondrial precursor | down | |
|
|
| SLC9A3R2 | solute carrier family 9, isoform 3 regulator 2 | down |
|
|
| BNIP1 | BCL2/adenovirus E1B 19 KDa interacting protein 1 | down |
|
| UGT2B15 | UDP glucuronosy1transferase 2 family, polypeptide B15 | down | |
|
| PARP2 | poly (ADP-ribose) polymerase 2 | down | |
|
| EIF5AL1, | eukaryotic translation initiation factor 5A-like1 and 5A | down | |
| EIF5A |
Figure 2The Heatmap of the gene markers for Prostate Cancer data.
The Heatmap describe the expression levels of the four up-regulated and two down-regulated gene markers for normal and cancerous type in Prostate Cancer data.
Figure 3The Heatmap of the gene markers for DLBCL data.
The Heatmap describe the expression levels of the three down-regulated gene markers for DLBCL and FL type in DLBCL data.
Figure 4The Heatmap of the gene markers for Child-ALL data.
The Heatmap describe the expression levels of the five down-regulated gene markers for after and before therapy in Child-ALL data.