| Literature DB >> 31828154 |
Abstract
The identification of discriminative features from information-rich data with the goal of clinical diagnosis is crucial in the field of biomedical science. In this context, many machine-learning techniques have been widely applied and achieved remarkable results. However, disease, especially cancer, is often caused by a group of features with complex interactions. Unlike traditional feature selection methods, which only focused on finding single discriminative features, a multilayer feature subset selection method (MLFSSM), which employs randomized search and multilayer structure to select a discriminative subset, is proposed herein. In each level of this method, many feature subsets are generated to assure the diversity of the combinations, and the weights of features are evaluated on the performances of the subsets. The weight of a feature would increase if the feature is selected into more subsets with better performances compared with other features on the current layer. In this manner, the values of feature weights are revised layer-by-layer; the precision of feature weights is constantly improved; and better subsets are repeatedly constructed by the features with higher weights. Finally, the topmost feature subset of the last layer is returned. The experimental results based on five public gene datasets showed that the subsets selected by MLFSSM were more discriminative than the results by traditional feature methods including LVW (a feature subset method used the Las Vegas method for randomized search strategy), GAANN (a feature subset selection method based genetic algorithm (GA)), and support vector machine recursive feature elimination (SVM-RFE). Furthermore, MLFSSM showed higher classification performance than some state-of-the-art methods which selected feature pairs or groups, including top scoring pair (TSP), k-top scoring pairs (K-TSP), and relative simplicity-based direct classifier (RS-DC).Entities:
Mesh:
Year: 2019 PMID: 31828154 PMCID: PMC6885241 DOI: 10.1155/2019/9864213
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Algorithm 1Description of the MLFSSM algorithm.
Details of five public datasets for comparison.
| No. | Dataset | Feature number | Sample number |
|---|---|---|---|
| 1 | Breast [ | 7129 | 49 |
| 2 | ColonCancer [ | 2000 | 60 |
| 3 | CNS [ | 7129 | 60 |
| 4 | Hepato [ | 7129 | 60 |
| 5 | Leukemia [ | 7129 | 72 |
Summary of the parameter setting.
| Parameters | Default values | Range |
|---|---|---|
| Weight ratio | 0.2 | 0.1, 0.2, 0.4, 0.6, 0.8, 0.9 |
| Power number | 32 | 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 |
| Feature subset length | 21 | 1, 5, 11, 15, 21, 25, 31, 35, 41, 45, 51 |
Figure 1Effects of weight ratio α.
Figure 2Effects of power number p.
Figure 3Effects of feature subset length ls.
Figure 4Comparisons in LVW, imp-LVW, and MLFSSM.
Comparison of the accuracies of GAANN_RP and MLFSSM.
| Method | Average accuracy |
|---|---|
| Fuzzy_GA | 0.9736 |
| GAANN_RP | 0.9829 |
| MLFSSM |
|
Comparison of the average accuracy rates of MLFSSM with four feature selection methods.
| Method | Breast | Leukemia | Colon | Hepato | CNS |
|---|---|---|---|---|---|
| SVM-RFE | 0.877 | 0.967 | 0.835 | 0.658 | 0.693 |
| LS-bound | 0.778 | 0.935 | 0.817 | 0.618 | 0.61 |
| Bayes + KNN | 0.821 | 0.92 | 0.828 | 0.628 | 0.628 |
| EN-LR | 0.854 | 0.962 | 0.837 | 0.683 | 0.664 |
| GRRF | 0.846 | 0.939 | 0.834 | 0.67 | 0.618 |
| T-SS | 0.893 |
| 0.871 | 0.693 | 0.655 |
| MLFSSM |
| 0.969 |
|
|
|
Comparison of the average accuracy rates of MLFSSM with four feature selection methods based on groups.
| Method | Breast | Leukemia | Colon | Hepato | CNS |
|---|---|---|---|---|---|
| TSP | 0.783 | 0.900 | 0.891 | 0.602 | 0.496 |
| K-TSP ( | 0.870 |
|
| 0.657 | 0.517 |
| RS-DC | 0.868 | 0.944 | 0.896 | 0.604 | 0.597 |
| MLFSSM |
|
| 0.961 |
|
|
Details of the ten most selected genes of CNS dataset.
| No. | Gene accession number | Gene description | Official symbol | Gene ID | Biological pathway |
|---|---|---|---|---|---|
| 1 | M13149_at | HRG histidine-rich glycoprotein | HRG | 3273 | Dissolution of fibrin clot |
| 2 | S75989_at | Gamma-aminobutyric acid transporter type 3 (human, fetal brain, mRNA, 1991 nt) | — | — | — |
| 3 | HG2987-HT3136_s_at | Vasoactive intestinal peptide | VIP | 7432 | Glucagon-type ligand receptors |
| 4 | M63959_at | LRPAP1 low density lipoprotein-related protein-associated protein 1 (alpha-2-macroglobulin receptor-associated protein 1) | LRPAP1 | 4043 | Reelin signaling pathway |
| 5 | Z73677_at | Gene encoding plakophilin 1b | PKP1 | 5317 | Apoptotic cleavage of cell adhesion proteins |
| 6 | D79986_at | KIAA0164 gene | — | — | — |
| 7 | M23575_f_at | PSG11 pregnancy-specific beta-1 glycoprotein 11 | PSG11 | 5680 | — |
| 8 | U37139_at | Beta 3-endonexin mRNA, long form and short form | — | — | — |
| 9 | D28235_s_at | Cyclooxygenase-2 (hCox-2) gene | PTGS2 | 5743 | COX reactions |
| 10 | HG2271-HT2367_at | Profilaggrin | — | — | — |
Figure 5Gene interaction diagram. Note: the size of node is related with its degree in the graph.
Figure 6Weight values with increasing layer of top and bottom 5-ranked genes on CNS.
Figure 7The frequencies of accuracy rates of features with increasing layer on CNS.