| Literature DB >> 19458773 |
Yanxiong Peng1, Wenyuan Li, Ying Liu.
Abstract
Microarrays allow researchers to monitor the gene expression patterns for tens of thousands of genes across a wide range of cellular responses, phenotype and conditions. Selecting a small subset of discriminate genes from thousands of genes is important for accurate classification of diseases and phenotypes. Many methods have been proposed to find subsets of genes with maximum relevance and minimum redundancy, which can distinguish accurately between samples with different labels. To find the minimum subset of relevant genes is often referred as biomarker discovery. Two main approaches, filter and wrapper techniques, have been applied to biomarker discovery. In this paper, we conducted a comparative study of different biomarker discovery methods, including six filter methods and three wrapper methods. We then proposed a hybrid approach, FR-Wrapper, for biomarker discovery. The aim of this approach is to find an optimum balance between the precision of the biomarker discovery and the computation cost, by taking advantages of both filter method's efficiency and wrapper method's high accuracy. Our hybrid approach applies Fisher's ratio, a simple method easy to understand and implement, to filter out most of the irrelevant genes, then a wrapper method is employed to reduce the redundancy. The performance of FR-Wrapper approach is evaluated over four widely used microarray datasets. Analysis of experimental results reveals that the hybrid approach can achieve the goal of maximum relevance with minimum redundancy.Entities:
Keywords: Biomarker discovery; Cancer classification; Gene expression; Gene selection; Microarray
Year: 2007 PMID: 19458773 PMCID: PMC2675487
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Four microarray datasets* we used in this paper.
| Dataset | # of genes | # of positive samples | # of negative samples |
|---|---|---|---|
| Leukemia | 7129 | 47 (ALL) | 25 (AML) |
| Lung Cancer | 12533 | 331 (MPM) | 150 (ADCA) |
| Breast Cancer | 24481 | 46 | 51 |
| Colon Cancer | 2000 | 22 | 40 |
Data was obtained from http://sdmc.lit.org.sg/GEDatasets/Datasets.html
Figure 1The leave-one-out-cross-validation accuracies of leukemia dataset. The genes were ranked by different filter methods and top-ranked k genes were selected for a classifier to classify the samples. (A): Support Vector Machine (SVM); (B): Decision tree J4.8; (C): Naïve Bayes (NB).
Figure 4The leave-one-out-cross-validation accuracies of colon cancer dataset. The genes were ranked by different filter methods and top-ranked k genes were selected for a classifier to classify the samples. (A): Support Vector Machine (SVM); (B): Decision tree J4.8; (C): Naïve Bayes (NB).
Classification accuracies of different microarray datasets by RBF.
| Data sets | classifier | # of genes selected | Accuracy (%) |
|---|---|---|---|
| Leukemia | NB | 4 | 94.44 |
| J4.8 | 4 | 87.50 | |
| SVM | 4 | 93.06 | |
| Lung cancer | NB | 6 | 98.90 |
| J4.8 | 6 | 98.34 | |
| SVM | 6 | 96.13 | |
| Breast cancer | NB | 67 | 61.85 |
| J4.8 | 67 | 79.38 | |
| SVM | 67 | 75.26 | |
| Colon cancer | NB | 4 | 77.42 |
| J4.8 | 4 | 93.55 | |
| SVM | 4 | 80.65 |
Classification accuracies of different microarray datasets without gene selection.
| Data sets | # of genes | classifier | Accuracy (%) |
|---|---|---|---|
| Leukemia | 7129 | NB | 100 |
| J4.8 | 73.61 | ||
| SVM | 98.61 | ||
| Lung cancer | 12533 | NB | 97.79 |
| J4.8 | 96.13 | ||
| SVM | 99.45 | ||
| Breast cancer | 24481 | NB | 52.57 |
| J4.8 | 52.58 | ||
| SVM | 69.07 | ||
| Colon cancer | 2000 | NB | 58.64 |
| J4.8 | 80.65 | ||
| SVM | 82.26 |
Figure 2The leave-one-out-cross-validation accuracies of lung cancer dataset. The genes were ranked by different filter methods and top-ranked k genes were selected for a classifier to classify the samples. (A): Support Vector Machine (SVM); (B): Decision tree J4.8; (C): Naïve Bayes (NB).
Figure 3The leave-one-out-cross-validation accuracies of breast cancer dataset. The genes were ranked by different filter methods and top-ranked k genes were selected for a classifier to classify the samples. (A): Support Vector Machine (SVM); (B): Decision tree J4.8; (C): Naïve Bayes (NB).
Classification accuracies of different microarray datasets by three wrapper methods (SVM-forward selection, NB-forward selection, and decision tree J4.8-forward selection).
| Data sets | classifier | time (seconds) | # of gene selected | Accuracy (%) |
|---|---|---|---|---|
| Leukemia | NB | 360 | 3 | 98.61 |
| J4.8 | 360 | 2 | 95.83 | |
| SVM | 55980 | 5 | 98.61 | |
| Lung cancer | NB | 1080 | 3 | 100 |
| J4.8 | 1560 | 2 | 99.45 | |
| SVM | 59760 | 4 | 100 | |
| Breast cancer | NB | 5280 | 3 | 88.66 |
| J4.8 | 13920 | 2 | 93.81 | |
| SVM | 447060 | 4 | 89.69 | |
| Colon cancer | NB | 300 | 8 | 93.55 |
| J4.8 | 300 | 3 | 96.77 | |
| SVM | 12060 | 5 | 91.94 |
Classification accuracies of different microarray datasets by the hybrid approach.
| Datasets | Search space | Classifier | time (seconds) | # of gene selected | Accuracy (%) |
|---|---|---|---|---|---|
| Leukemia | 200 | NB | 13 | 4 | 100 |
| 200 | J4.8 | 13 | 2 | 95.83 | |
| 200 | SVM | 896 | 3 | 98.61 | |
| 100 | NB | 7 | 4 | 100 | |
| 100 | J4.8 | 7 | 2 | 95.83 | |
| 100 | SVM | 566 | 4 | 98.61 | |
| 50 | NB | 4 | 4 | 100 | |
| 50 | J4.8 | 4 | 2 | 95.83 | |
| 50 | SVM | 273 | 4 | 98.61 | |
| Lung cancer | 200 | NB | 16 | 3 | 100 |
| 200 | J4.8 | 30 | 2 | 99.45 | |
| 200 | SVM | 652 | 3 | 100 | |
| 100 | NB | 8 | 3 | 100 | |
| 100 | J4.8 | 16 | 2 | 99.45 | |
| 100 | SVM | 327 | 3 | 100 | |
| 50 | NB | 4 | 3 | 100 | |
| 50 | J4.8 | 8 | 2 | 99.45 | |
| 50 | SVM | 157 | 3 | 100 | |
| Breast cancer | 200 | NB | 34 | 6 | 84.54 |
| 200 | J4.8 | 84 | 4 | 86.60 | |
| 200 | SVM | 1032 | 3 | 82.47 | |
| 100 | NB | 13 | 5 | 86.60 | |
| 100 | J4.8 | 42 | 4 | 85.57 | |
| 100 | SVM | 886 | 6 | 88.66 | |
| 50 | NB | 7 | 5 | 86.60 | |
| 50 | J4.8 | 15 | 3 | 85.57 | |
| 50 | SVM | 427 | 6 | 88.66 | |
| Colon cancer | 200 | NB | 12 | 5 | 91.94 |
| 200 | J4.8 | 36 | 3 | 96.77 | |
| 200 | SVM | 1079 | 5 | 90.32 | |
| 100 | NB | 6 | 5 | 91.94 | |
| 100 | J4.8 | 16 | 3 | 90.32 | |
| 100 | SVM | 539 | 5 | 90.32 | |
| 50 | NB | 4 | 5 | 90.32 | |
| 50 | J4.8 | 8 | 3 | 90.32 | |
| 50 | SVM | 246 | 4 | 87.09 |