| Literature DB >> 20122224 |
Pengyi Yang1, Bing B Zhou, Zili Zhang, Albert Y Zomaya.
Abstract
BACKGROUND: Feature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses.Entities:
Mesh:
Year: 2010 PMID: 20122224 PMCID: PMC3009522 DOI: 10.1186/1471-2105-11-S1-S5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Different types of feature selection algorithms. (a)Filter approach (b)Wrapper approach (c)Embedded approach.
Figure 2The flow chart of the MF-GE hybrid system for gene selection and classification of microarrays.
Figure 3An example of multiple filter score mapping strategy for evaluation information fusion.
Microarray datasets for evaluation
| Name | Leukemia | Colon | Liver | MLL |
|---|---|---|---|---|
| [ | [ | [ | [ | |
| # Sample | 72 | 62 | 157 | 72 |
| # Gene | 7129 | 2000 | 20983 | 12582 |
| # Class | 2 | 2 | 2 | 3 |
| C1 | ALL: 47 | TUM: 40 | HCC: 82 | ALL: 24 |
| C2 | AML: 25 | NOR: 22 | NON: 75 | MLL: 20 |
| C3 | AML: 28 |
Genetic ensemble settings
| Parameter | Value |
|---|---|
| Fitness Function | Multi-Objective |
| Iteration | 100 |
| Population Size | 100 |
| Niche | 2 |
| Chromosome Size | 15 |
| Termination | Multiple Conditions |
| Selection | Tournament Selection (3) |
| Crossover | Single Point (0.7) |
| Mutation | Multi-Point (0.1 & 0.25) |
| Contribution Weight |
Classification comparison of different gene ranking algorithms using Leukemia dataset
| Dataset | Classifier | Algorithm | |||
|---|---|---|---|---|---|
| Gain Ratio | GA/KNN | GE | MF-GE | ||
| Leukemia | C4.5 | 87.41 | 78.55 ± 2.96 | 83.04 ± 1.56 | 84.51 ± 2.53 |
| Random Forests | 92.59 | 91.75 ± 0.99 | 90.82 ± 1.87 | 92.35 ± 0.70 | |
| 3-Nearest Neighbor | 91.16 | 93.74 ± 1.27 | 94.30 ± 1.73 | 95.48 ± 0.95 | |
| 7-Nearest Neighbor | 83.10 | 89.43 ± 1.10 | 90.45 ± 2.04 | 90.86 ± 1.26 | |
| Naive Bayes | 92.78 | 90.28 ± 1.33 | 96.20 ± 0.93 | 96.27 ± 1.65 | |
| Mean | 89.41 | 88.75 | 90.69 | 91.89 | |
| Majority Voting | 92.45 | 93.29 ± 1.29 | 95.33 ± 0.96 | 96.23 ± 1.26 | |
Classification comparison of different gene ranking algorithms using Colon dataset
| Dataset | Classifier | Algorithm | |||
|---|---|---|---|---|---|
| Gain Ratio | GA/KNN | GE | MF-GE | ||
| Colon | C4.5 | 71.49 | 62.43 ± 2.78 | 73.08 ± 2.77 | 76.64 ± 1.53 |
| Random Forests | 63.66 | 73.48 ± 2.09 | 71.86 ± 2.02 | 74.35 ± 2.01 | |
| 3-Nearest Neighbor | 68.02 | 73.83 ± 1.57 | 75.43 ± 0.92 | 77.01 ± 2.09 | |
| 7-Nearest Neighbor | 65.43 | 67.62 ± 1.45 | 68.39 ± 1.76 | 68.78 ± 2.32 | |
| Naive Bayes | 70.61 | 72.12 ± 1.68 | 76.46 ± 2.14 | 75.07 ± 2.38 | |
| Mean | 68.84 | 69.90 | 73.04 | 74.37 | |
| Majority Voting | 70.56 | 73.37 ± 1.84 | 75.81 ± 2.00 | 76.98 ± 1.06 | |
Classification comparison of different gene ranking algorithms using Liver dataset
| Dataset | Classifier | Algorithm | |||
|---|---|---|---|---|---|
| Gain Ratio | GA/KNN | GE | MF-GE | ||
| Liver | C4.5 | 84.88 | 88.33 ± 0.94 | 87.09 ± 0.79 | 88.19 ± 0.56 |
| Random Forests | 89.65 | 90.31 ± 1.11 | 91.87 ± 0.94 | 93.13 ± 1.18 | |
| 3-Nearest Neighbor | 87.76 | 90.46 ± 0.65 | 93.57 ± 0.57 | 93.39 ± 0.79 | |
| 7-Nearest Neighbor | 87.65 | 89.53 ± 0.56 | 91.91 ± 0.69 | 92.54 ± 0.57 | |
| Naive Bayes | 89.05 | 90.85 ± 0.51 | 92.70 ± 0.67 | 93.63 ± 0.64 | |
| Mean | 87.80 | 89.90 | 91.43 | 92.18 | |
| Majority Voting | 89.02 | 91.60 ± 0.36 | 93.37 ± 0.46 | 93.80 ± 0.47 | |
Classification comparison of different gene ranking algorithms using MLL dataset
| Dataset | Classifier | Algorithm | |||
|---|---|---|---|---|---|
| Gain Ratio | GA/KNN | GE | MF-GE | ||
| MLL | C4.5 | 81.87 | 72.89 ± 2.08 | 78.27 ± 3.10 | 81.54 ± 1.67 |
| Random Forests | 83.02 | 88.07 ± 1.05 | 88.20 ± 1.41 | 89.74 ± 0.60 | |
| 3-Nearest Neighbor | 79.63 | 88.22 ± 1.30 | 86.18 ± 1.39 | 88.14 ± 1.09 | |
| 7-Nearest Neighbor | 79.63 | 86.72 ± 1.03 | 85.02 ± 1.49 | 86.69 ± 1.98 | |
| Naive Bayes | 83.95 | 89.62 ± 0.67 | 90.68 ± 1.28 | 91.50 ± 0.67 | |
| Mean | 81.62 | 85.10 | 85.67 | 87.52 | |
| Majority Voting | 83.88 | 88.38 ± 0.97 | 89.02 ± 1.71 | 91.08 ± 0.96 | |
Figure 4Sample classification. The comparison of average classification and majority voting classification of the five classifiers with different gene selection methods in each microarray dataset.
Figure 5Multi-filter scores of the 200 genes pre-filtered by BSS/WSS.
Generation of convergence & subset size for each dataset using MFGE and GE
| Dataset | Comparison Criterion | MF-GE | GE | |
|---|---|---|---|---|
| Leukemia | Average Generation of Convergence | 21.2 | 23.4 | 1 × 10-2 |
| Average Subset Size | 4.7 | 5.4 | 4 × 10-3 | |
| Colon | Average Generation of Convergence | 25.5 | 27.1 | 5 × 10-2 |
| Average Subset Size | 6.0 | 6.6 | 3 × 10-3 | |
| Liver | Average Generation of Convergence | 27.1 | 27.4 | 1 × 10-1 |
| Average Subset Size | 7.2 | 7.7 | 1 × 10-3 | |
| MLL | Average Generation of Convergence | 25.0 | 26.1 | 8 × 10-2 |
| Average Subset Size | 6.8 | 7.2 | 3 × 10-2 | |
*P-Values are calculated using student t-test with one tail.
Figure 6Average gene subset size selected by GE and MF-GE with each microarray dataset.
Figure 7Average generation of convergence of GE and MF-GE with each microarray dataset.
Top 5 genes with the highest selection frequency of each microarray data
| Dataset | Accession Num | Gene Description |
|---|---|---|
| Leukemia | X95735_at | Zyxin |
| M31523_at | TCF3 Transcription factor 3 (E2A immunoglobulin enhancer binding factors E12/E47) | |
| Y07604_at | Nucleoside-diphosphate kinase | |
| M92287_at | CCND3 Cyclin D3 | |
| M27891_at | CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) | |
| Colon | Hsa.549 | P03001 TRANSCRIPTION FACTOR IIIA |
| Hsa.3016 | S-100P PROTEIN (HUMAN) | |
| Hsa.8147 | Human desmin gene, complete cds | |
| Hsa.36689 | H. sapiens mRNA for GCAP-II/uroguanylin precursor | |
| Hsa.6814 | COLLAGEN ALPHA 2(XI) CHAIN (Homo sapiens) | |
| Liver | AA232837 | Plasmalemma vesicle associated protein (PLVAP) |
| AA464192 | PDZ domain containing 11 (PDZD11) | |
| AA486817 | Shisa homolog 5 (Xenopus laevis) (SHISA5) | |
| R43576 | Basic leucine zipper nuclear factor 1 (BLZF1) | |
| H62781 | Ficolin (collagen/fibrinogen domain containing lectin) 2 (hucolin) (FCN2) | |
| MLL | 33412_at | vicpro2.D07.r Homo sapiens cDNA, 5' end |
| 1389_at | Human common acute lymphoblastic leukemia antigen (CALLA) mRNA, complete cds | |
| 32847_at | Homo sapiens myosin light chain kinase (MLCK) mRNA, complete cds | |
| 39318_at | H. sapiens mRNA for Tcell leukemia | |
| 40763_at | Human leukemogenic homolog protein (MEIS1) mRNA, complete cds | |