| Literature DB >> 29872069 |
Zejun Li1,2, Bo Liao3, Lijun Cai1, Min Chen1,2, Wenhua Liu2.
Abstract
In the present study, we introduce a novel semi-supervised method called the semi-supervised maximum discriminative local margin (semiMM) for gene selection in expression data. The semiMM is a "filter" approach that exploits local structure, variance, and mutual information. We first constructed a local nearest neighbour graph and divided this information into within-class and between-class local nearest neighbour graphs by weighing the edge between the two data points. The semiMM aims to discover the most discriminative features for classification via maximizing the local margin between the within-class and between-class data, the variance of all data, and the mutual information of features with class labels. Experiments on five publicly available gene expression datasets revealed the effectiveness of the proposed method compared to three state-of-the-art feature selection algorithms.Entities:
Mesh:
Year: 2018 PMID: 29872069 PMCID: PMC5988834 DOI: 10.1038/s41598-018-26806-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Dataset descriptions, including the sample number, gene dimension and class number.
| DataSet | Num of Sample | Num of Dim | Num of Class |
|---|---|---|---|
| DLBCL | 77 | 5469 | 2 |
| Prostate_Tumor | 102 | 10509 | 2 |
| Leukemia2 | 72 | 11225 | 3 |
| SRBCT | 83 | 2308 | 4 |
| Lung_Cancer | 203 | 12600 | 5 |
Figure 1Performance comparison of average prediction accuracy of binary classification gene expression datasets Prostate.
Figure 2Performance comparison of average prediction accuracy of binary classification gene expression datasets DLBCL.
Figure 3Performance comparison of average prediction accuracy in multi-classification gene expression datasets Leukemia2.
Figure 5Performance comparison of average prediction accuracy in multi-classification gene expression datasets Lung.
Comparison of mean evaluation metrics of the DLBCL dataset with the top 150 selected genes by varying the value of L.
| Acc labelNum = 2/4/6 | Precision labelNum = 2/4/6 | Recall labelNum = 2/4/6 | F-score labelNum = 2/4/6 | AUC labelNum = 2/4/6 | |
|---|---|---|---|---|---|
| semiMM | 0.9281 | 0.8269/0.8989/0.8865 | 0.8630/ | 0.9818/0.9844/0.9833 | |
| LSDF | 0.9219/0.9344/0.9406 | 0.8436/0.8903/0.8856 | 0.8625/0.8500/0.8875 | 0.8487/0.8634/0.8797 | 0.9568/0.9802/0.9813 |
| Fishser | 0.9094/ | 0.8340/ | 0.8250/0.9000/0.9250 | 0.8193/ | 0.9573/ |
| Laplacian |
Comparison of mean evaluation metrics of the Lung dataset with the top 150 selected genes by varying the value of L.
| Acc | Precision labelNum = 2/4/6 | Recall labelNum = 2/4/6 | F-score labelNum = 2/4/6 | AUC | |
|---|---|---|---|---|---|
| semiMM | 0.9660 | 0.9326 | 0.8561 | 0.8811 | 0.9878/0.9898/0.9898 |
| LSDF | 0.9157 | 0.6887 | 0.5888/0.8027/0.8044 | 0.6221/0.8463/0.8454 | 0.8898/0.9607/0.9623 |
| Fishser |
| ||||
| Laplacian | 0.9670 | 0.9260 | 0.8303/0.8449/0.8761 | 0.8611/0.8722/0.8977 | 0.9832/0.9833/0.9848 |
Comparison of the mean evaluation metrics of the Prostate dataset with the top 150 selected genes by varying the value of L.
| Acc | Precision labelNum = 2/4/6 | Recall labelNum = 2/4/6 | F-score labelNum = 2/4/6 | AUC | |
|---|---|---|---|---|---|
| semiMM | 0.8512 | 0.8394/0.8849 | 0.8600 | 0.8486/ | 0.9052 |
| LSDF | 0.7780/0.7780 | 0.7671/0.7671 | 0.8000/0.8000 | 0.7799/0.7799 | 0.8407/0.8407 |
| Fishser |
| ||||
| Laplacian | 0.8415/0.8244 | 0.8426/0.8313 | 0.8300/0.8050 | 0.8358/0.8155 | 0.8926/0.8981 |
Comparison of mean evaluation metrics in the Leukemia2 dataset with the top 150 selected genes by varying the value of L.
| Acc | Precision labelNum = 2/4/6 | Recall labelNum = 2/4/6 | F-score labelNum = 2/4/6 | AUC | |
|---|---|---|---|---|---|
| semiMM | 0.9511/0.9556 | 0.94670.9440 | 0.9028/0.9169 | 0.9201/0.9271 | 0.9774/0.9885 |
| LSDF |
| 0.9974/0.9989 | |||
| Fishser | 0.9478/0.9733 | 0.9514/0.9670 | 0.8894/0.9414 | 0.9164/0.9527 | 0.9795/0.9939 |
| Laplacian | 0.8533/0.8644 | 0.7544/0.7731 | 0.7728/0.7950 | 0.7570/0.7760 | 0.9125/0.9159 |
Comparison of mean evaluation metrics of the SRBCT dataset with the top 15 selected genes by varying the value of L.
| Acc | Precision labelNum = 2/4/6 | Recall labelNum = 2/4/6 | F-score labelNum = 2/4/6 | AUC | |
|---|---|---|---|---|---|
| semiMM | 0.9879 | 0.9939 | 0.9654/ | 0.9785 | 0.9995/ |
| LSDF | 0.9593/0.9850/0.9921 | 0.9657/0.9864/0.9914 | 0.8854/0.9654/0.9833 | 0.9208/0.9746/0.9865 | 0.9863/0.9980/0.9998 |
| Fishser | |||||
| Laplacian | 0.9886/0.9850/0.9786 | 0.9942/0.9952/0.9816 | 0.9706/0.9556/0.9435 | 0.9808/0.9736/0.9600 | 0.9993/0.9978/0.9971 |