| Literature DB >> 18973862 |
Edmundo Bonilla Huerta1, Béatrice Duval, Jin-Kao Hao.
Abstract
Gene subset selection is essential for classification and analysis of microarray data. However, gene selection is known to be a very difficult task since gene expression data not only have high dimensionalities, but also contain redundant information and noises. To cope with these difficulties, this paper introduces a fuzzy logic based pre-processing approach composed of two main steps. First, we use fuzzy inference rules to transform the gene expression levels of a given dataset into fuzzy values. Then we apply a similarity relation to these fuzzy values to define fuzzy equivalence groups, each group containing strongly similar genes. Dimension reduction is achieved by considering for each group of similar genes a single representative based on mutual information. To assess the usefulness of this approach, extensive experimentations were carried out on three well-known public datasets with a combined classification model using three statistic filters and three classifiers.Entities:
Mesh:
Year: 2008 PMID: 18973862 PMCID: PMC5054105 DOI: 10.1016/S1672-0229(08)60021-2
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Fig. 1A. Simple filter based model used as our comparison reference. B. Combined model using our fuzzy processing followed by the classical filter approach.
Reduced dataset obtained by the fuzzy approach
| Dataset | Original number of genes | Reduced number of genes | Percentage (%) of informative genes |
|---|---|---|---|
| Leukemia | 7,129 | 1,360 | 19.07 |
| Colon | 2,000 | 943 | 47.15 |
| Lymphoma | 4,026 | 435 | 10.80 |
Fig. 2Classification rate (accuracy) (%) with p genes on the leukemia dataset. A. BW: 95.83 (peak classification rate) and 90.23 (average classification rate) vs CM1: 100 and 87.77. B. TT: 95.83 and 83.70 vs CM2: 97.22 and 93.37. C. WT: 87.5 and 75.64 vs CM3: 98.61 and 94.72.
Fig. 3Classification rate (accuracy) (%) with p genes on the colon dataset. A. BW: 88.70 (peak classification rate) and 86.45 (average classification rate) vs CM1: 90.32 and 82.04. B. TT: 80.64 and 75.43 vs CM2: 85.48 and 84.78. C. WT: 82.25 and 77.79 vs CM3: 85.48 and 83.60.
Fig. 4Classification rate (accuracy) (%) with p genes on the lymphoma dataset. A. BW: 92.70 (peak classification rate) and 89.30 (average classification rate) vs CM1: 92.70 and 83.47. B. TT: 87.50 and 84.37 vs CM2: 88.54 and 79.30. C. WT: 93.75 and 91.04 vs CM3: 86.45 and 78.81.
Best and average classification rates for leukemia, colon, and lymphoma datasets using the first 100 top-ranked genes
| Best classification rate (%) | ||||||
| Dataset | Method | |||||
| BW | CM1 | TT | CM2 | WT | CM3 | |
| Leukemia | 98.6 | 95.8 | ||||
| Colon | 88.7 | 80.6 | 82.2 | 85.4 | ||
| Lymphoma | 92.7 | 87.5 | 90.6 | |||
| Average classification rate (%) | ||||||
| Dataset | Method | |||||
| BW | CM1 | TT | CM2 | WT | CM3 | |
| Leukemia | 94.5 | 91.0 | 86.9 | |||
| Colon | 87.2 | 70.3 | 72.5 | |||
| Lymphoma | 87.7 | 83.6 | 84.2 | |||
The 30 genes selected for the leukemia dataset
| Rank | ID | Gene code | Description | References |
|---|---|---|---|---|
| 1 | X95735 | Zyxin | ||
| 2 | X17042 | PRG1 proteoglycan 1 | ||
| 3 | M23197 | CD33 antigen | ||
| 4 | L09209 | APLP2 | ||
| 5 | U46499 | Glutathione s-transferase | ||
| 6 | M27891 | CST3 cystatin C | ||
| 7 | M16038 | LYN V-yes-1 | ||
| 8 | M22960 | PPGB (galactosialidosis) | ||
| 9 | M63138 | CTSD cathepsin D | ||
| 10 | M55150 | FAH fumarylacetoacetate | ||
| 11 | M62762 | ATP6C vacuolar H+ | ||
| 12 | U50136 | Leukotriene C4 synthase | ||
| 13 | X61587 | ARHG Ras (rho G) | ||
| 14 | 6005 | M32304 | TIMP2 tissue inhibitor | |
| 15 | 4229 | X52056 | SPI1 (SFFV) | |
| 16 | D49950 | Liver mRNA (IGIF) | ||
| 17 | X59417 | Proteasome iota chain | ||
| 18 | 6281 | M31211 | MYL1 myosin (alkali) | |
| 19 | M92287 | CCND3 cyclin D3 | ||
| 20 | 6185 | X64072 | SELL | |
| 21 | 1260 | L09717 | LAMP2 | |
| 22 | M31523 | TCF3 | ||
| 23 | Y07604 | NDP kinase | ||
| 24 | U05259 | MB-1 gene | ||
| 25 | 1615 | L42379 | Quiescin (Q6) | |
| 26 | M84371 | CD19 gene | ||
| 27 | D14664 | KIAA0022 | ||
| 28 | J05243 | SPTAN1 | ||
| 29 | 2363 | M93053 | Leukocyte elastase inhibitor | |
| 30 | HG612-HT1612 | Macmarcks |
Best classification rate (%) with different relevance criteria and filter methods combined with different classifiers*
| Combined methods | Leukemia | Colon | Lymphoma | ||||||
|---|---|---|---|---|---|---|---|---|---|
| kNN | LVQ | SVM | kNN | LVQ | SVM | kNN | LVQ | SVM | |
| F | 97.5 | 91.0 | 99.4 | 90.8 | 87.0 | 91.7 | 93.7 | 97.9 | 99.3 |
| F | 98.3 | 91.0 | 99.8 | 89.3 | 87.0 | 92.0 | 94.7 | 97.9 | 99.4 |
| F | 97.2 | 91.0 | 98.1 | 89.5 | 87.0 | 88.5 | 94.7 | 95.8 | 98.4 |
| F | 96.2 | 91.0 | 98.4 | 89.5 | 87.0 | 91.9 | 91.6 | 72.9 | 97.9 |
| F | 97.7 | 91.0 | 98.4 | 89.8 | 87.0 | 91.4 | 94.7 | 98.7 | |
| F | 98.0 | 91.0 | 98.4 | 90.1 | 87.0 | 88.7 | 93.7 | 97.9 | 99.1 |
| F | 99.4 | 91.0 | 99.8 | 89.5 | 87.0 | 91.4 | 96.8 | 93.7 | 99.4 |
| F | 98.3 | 97.0 | 89.5 | 83.8 | 94.7 | 95.8 | 99.6 | ||
| F | 98.8 | 91.0 | 98.3 | 89.5 | 87.0 | 89.0 | 93.7 | 89.5 | 97.7 |
We report the best classification rate obtained with p selected genes (p ≤100).
Comparison of classification rates on the three datasets
| Work/Method | Best classification rate (%) | ||
|---|---|---|---|
| Leukemia | Colon | Lymphoma | |
| Ben-dor | 91.6–95.8 | 72.6–80.6 | – |
| Furey | 94.1 | 90.3 | – |
| Li | – | 94.1 | 84.6 |
| Li and Yang | 94.1 | – | – |
| Dudoit | 95.0 | – | 90.0 |
| Nguyen and Rocke | 94.2–96.4 | 87.1–93.5 | 96.9–98.1 |
| Marohnic | – | – | |
| Ding and Peng | 93.5 | 98.9 | |
| Tang | – | – | |
| Marchiori and Sebag | 94.0 | 93.0 | |
| Hu | 94.1 | 83.8 | 95.8 |
| Cho and Won | 95.9 | 87.7 | 93.0 |
| Yang | 76.7 | 86.1 | |
| Peng | 98.6 | 96.7 | – |
| Wang | 95.8 | 95.6 | |
| Kim | 90.32 | – | |
| Mundra and Rajapakse | 97.2 | 89.3 | – |
| Tang | 96.7 | 95.4 | |
| Li | 97.1 | 83.5 | 93.0 |
| Zhang | 90.3 | 92.2 | |
| CM1 using kNN | 90.3 | 93.7 | |
| CM1 using LVQ | 87.1 | ||
| CM1 using SVM (RBF) | 91.4 | ||
| F | 98.3 | 89.5 | 94.7 |
| F | 97.0 | 83.8 | 95.8 |
| F | 92.4 | 99.6 | |
Fig. 5Fuzzy discretization of gene expression levels using a fuzzy inference system.