| Literature DB >> 18831793 |
Yi Zhang1, Chris Ding, Tao Li.
Abstract
BACKGROUND: Gene expression data usually contains a large number of genes, but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. In this paper, we present a two-stage selection algorithm by combining ReliefF and mRMR: In the first stage, ReliefF is applied to find a candidate gene set; In the second stage, mRMR method is applied to directly and explicitly reduce redundancy for selecting a compact yet effective gene subset from the candidate set.Entities:
Mesh:
Year: 2008 PMID: 18831793 PMCID: PMC2559892 DOI: 10.1186/1471-2164-9-S2-S27
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
The dataset description.
| Dataset | # Samples | # Genes | # Classes |
| ALL | 248 | 12558 | 6 |
| ARR | 420 | 278 | 2 |
| GCM | 198 | 16063 | 14 |
| HBC | 22 | 3226 | 3 |
| LYM | 62 | 4026 | 3 |
| MLL | 72 | 12582 | 3 |
| NCI60 | 60 | 1123 | 9 |
Figure 1Comparison of ReliefF and mRMR-ReliefF algorithms I. This figure describes the two classifications (SVM and NB) results using 3 to 60 selected genes, for HBC, Lymphoma, MLL, and NCI60 datasets. From this figure, it is easy to know that in the same number of selected genes, the performance of mRMR-ReliefF algorithm is obviously better than ReliefF algorithm.
Figure 2Comparison of ReliefF and mRMR-ReliefF algorithms II. This figure describes the two classifications (SVM and NB) results using 3 to 60 selected genes, for GCM, ALL, and ARR datasets. From this figure, it is easy to know that in the same number of selected genes, the performance of mRMR-ReliefF algorithm is obviously better than ReliefF algorithm.
The comparisons in ReliefF, mRMR and mRMR-ReliefF algorithms (gene number = 30)
| ReliefF | SVM | 96.37% | 79.29% | 100% | 95.45% | 58.33% | 94.44% | 55.25% |
| Naive Bayes | 92.34% | 75% | 95.16% | 90.91% | 53.33% | 91.67% | 55.56% | |
| mRMR | SVM | - | 75.35% | 100% | 95.45% | 53.33% | - | - |
| Naive Bayes | - | 73.21% | 97.33% | 87.51% | 51.20% | - | - | |
| mRMR-ReliefF | SVM | 96.77% | 81.43% | 100% | 95.45% | 68.33% | 98.61% | 64.65% |
| Naive Bayes | 95.97% | 79.05% | 100% | 95.45% | 61.67% | 98.61% | 61.11% |
The comparisons in seven gene selection methods (gene number = 30).
| No feature sel | SVM | 91.94% | 51.04% | 95.16% | 77.27% | 63.33% | 97.22% | 51.52% |
| Naive Bayes | 85.23% | 49.57% | 95.04% | 70.11% | 45.22% | 93.13% | 40.33% | |
| mRMR-ReliefF | SVM | 96.77% | 81.43% | 100% | 95.45% | 68.33% | 98.61% | 64.65% |
| Naive Bayes | 95.97% | 79.05% | 100% | 95.45% | 61.67% | 98.61% | 61.11% | |
| Maxrel | SVM | 89.11% | 74.53% | 100% | 72.73% | 51.67% | 77.78% | 60.61% |
| Naive Bayes | 88.71% | 73.49% | 100% | 63.64% | 48.33% | 80.56% | 46.97% | |
| Information Gain | SVM | 97.58% | 80.13% | 98.39% | 100% | 61.67% | 98.67% | 46.67% |
| Naive Bayes | 92.74% | 77.21% | 93.55% | 86.38% | 60% | 97.22% | 47.47% | |
| Sum Minority | SVM | 93.95% | 76.42% | 98.39% | 95.45% | 55% | 90.28% | 55.05% |
| Naive Bayes | 91.13% | 74.32% | 95.16% | 81.82% | 46.67% | 91.67% | 49.49% | |
| Twoing Rule | SVM | 96.77% | 79.37% | 98.39% | 90.91% | 61.67% | 97.22% | 45.96% |
| Naive Bayes | 90.32% | 72.19% | 93.55% | 86.36% | 45% | 95.83% | 46.46% | |
| F-statistic | SVM | 97.17% | 67.12% | 96.77% | 90.91% | 63.33% | 77.22% | 39.10% |
| Naive Bayes | 80.27% | 71.55% | 98.52% | 85.41% | 60.15% | 80.13% | 39.81% | |
| GSNR | SVM | 93.18% | 77.24% | 100% | 95.45% | 63.37% | 90.25% | 40.74% |
| Naive Bayes | 90.11% | 70.43% | 100% | 85.65% | 58.25% | 87.22% | 39.81% |
This table shows the classification results based on the 30 genes, which are selected from 7 different datasets using seven feature selection methods, named mRMR-ReliefF, Maxrel, information gain, sum minority, twoing rule, F-statistic, GSNR.
Figure 3Description of ReliefF algorithm.
MATLAB Command List For Gene Selection.
| ReliefF | |
| F-statistic | |
| GNSR | |
| ReliefF-mRMR | |
| Rankgene |
This table shows MATLAB commands for the feature selection algorithms, which are ReliefF, F-statistic, GNSR, ReliefF-mRMR, and all algorithms included in Rankgene.
Figure 4The data structure description for software package. X is the gene array with 62 genes and 4026 expression variables. y is the label for each gene.