| Literature DB >> 19718448 |
Abstract
One of the difficulties in using gene expression profiles to predict cancer is how to effectively select a few informative genes to construct accurate prediction models from thousands or ten thousands of genes. We screen highly discriminative genes and gene pairs to create simple prediction models involved in single genes or gene pairs on the basis of soft computing approach and rough set theory. Accurate cancerous prediction is obtained when we apply the simple prediction models for four cancerous gene expression datasets: CNS tumor, colon tumor, lung cancer and DLBCL. Some genes closely correlated with the pathogenesis of specific or general cancers are identified. In contrast with other models, our models are simple, effective and robust. Meanwhile, our models are interpretable for they are based on decision rules. Our results demonstrate that very simple models may perform well on cancerous molecular prediction and important gene markers of cancer can be detected if the gene selection approach is chosen reasonably.Entities:
Keywords: cancer prediction; decision rules; feature selection; gene expression profiles; rough set theory; soft computing
Year: 2009 PMID: 19718448 PMCID: PMC2730177 DOI: 10.4137/cin.s2655
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
6 genes with high prediction accuracy in the CNS tumor dataset.
| Probe ID | Correctly-classified sample number (accuracy) | α |
|---|---|---|
| U28963_at | 47 (78%) | 0.75 |
| X99050_rna1_at | 45 (75%) | 0.75 |
| D83542_at | 46 (77%) | 0.7 |
| S71824_at | 50 (83%) | 0.7 |
| U37673_at | 40 (67%) | 0.7 |
| D86974_at | 45 (75%) | 0.7 |
11 gene pairs with high prediction accuracy in the CNS tumor dataset.
| 1st – 2nd Probe ID | Correctly-classified sample number (accuracy) | α |
|---|---|---|
| D83542_at–S71824_at | 54 (90%) | 0.85 |
| D31763_at–U08998_at | 54 (90%) | 0.8 |
| D83542_at–X99050_rna1_at | 49 (82%) | 0.8 |
| D83542_at–D86974_at | 52 (87%) | 0.8 |
| L33243_at–U36448_at | 52 (87%) | 0.8 |
| M73547_at–U74324_at | 51 (85%) | 0.8 |
| M96739_at–U36448_at | 54 (90%) | 0.8 |
| S71824_at–D86974_at | 51 (85%) | 0.8 |
| U37143_at–D43682_s_at | 48 (80%) | 0.8 |
| U79277_at–D43682_s_at | 47 (78%) | 0.8 |
| X99050_rna1_at–D86974_at | 49 (82%) | 0.8 |
21 genes with high prediction accuracy in the colon tumor dataset.
| GenBank accession no. | Correctly-classified sample number (accuracy) | α |
|---|---|---|
| M63391 | 52 (84%) | 0.8 |
| M76378 | 50 (81%) | 0.8 |
| J02854 | 50 (81%) | 0.8 |
| M26383 | 52 (84%) | 0.8 |
| M76378 | 50 (81%) | 0.75 |
| T60155 | 48 (77%) | 0.75 |
| M22382 | 50 (81%) | 0.75 |
| X12671 | 49 (79%) | 0.75 |
| M76378 | 50 (81%) | 0.75 |
| T96873 | 47 (76%) | 0.75 |
| X86693 | 47 (76%) | 0.75 |
| J05032 | 50 (81%) | 0.75 |
| U25138 | 48 (77%) | 0.75 |
| T60778 | 47 (76%) | 0.75 |
| M91463 | 48 (77%) | 0.75 |
| R87126 | 51 (82%) | 0.7 |
| T51571 | 46 (74%) | 0.7 |
| T92451 | 48 (77%) | 0.7 |
| U09564 | 48 (77%) | 0.7 |
| R97912 | 45 (73%) | 0.7 |
| L41559 | 45 (73%) | 0.7 |
16 gene pairs with high prediction accuracy in the colon tumor-dataset.
| 1st – 2nd GenBank accession no. | Correctly-classified sample number (accuracy) | α |
|---|---|---|
| T51571–J02854 | 56 (90%) | 0.9 |
| J02854–L41559 | 56 (90%) | 0.9 |
| M76378–M63391 | 52 (84%) | 0.85 |
| M63391–M76378 | 52 (84%) | 0.85 |
| M63391–Z49269 | 45 (73%) | 0.85 |
| M63391–X86693 | 53 (85%) | 0.85 |
| Z50753–H40095 | 55 (89%) | 0.85 |
| R87126–H81068 | 55 (89%) | 0.85 |
| X12671–J02854 | 56 (90%) | 0.85 |
| X12671–M26383 | 54 (87%) | 0.85 |
| M76378–M26383 | 55 (89%) | 0.85 |
| H40095–M36634 | 54 (87%) | 0.85 |
| R97912–J02854 | 55 (89%) | 0.85 |
| R97912–M26383 | 54 (87%) | 0.85 |
| R06601–X63629 | 54 (87%) | 0.85 |
| M36634–H08393 | 56 (90%) | 0.85 |
8 genes with high prediction accuracy in the lung cancer dataset.
| Unigene ID | Correctly-classified sample number (accuracy) | α |
|---|---|---|
| 505266 | 32 (82%) | 0.75 |
| Hs.95243 | 32 (82%) | 0.75 |
| Hs.25882 | 32 (82%) | 0.75 |
| Hs.275198 | 32 (82%) | 0.75 |
| 36491 | 32 (82%) | 0.75 |
| Hs.170225 | 33 (85%) | 0.75 |
| Hs.17258 | 29 (74%) | 0.75 |
| Hs.11556 | 31 (79%) | 0.75 |
The Unigene ID is not available.
8 gene pairs with high prediction accuracy in the lung cancer dataset.
| 1st – 2nd Unigene ID | Correctly-classified sample number (accuracy) | α |
|---|---|---|
| Hs.169611–Hs.285701 | 31 (79%) | 0.75 |
| Hs.285701–Hs.132415 | 29 (74%) | 0.75 |
| Hs.285701–Hs.57655 | 30 (77%) | 0.75 |
| Hs.57655–Hs.8595 | 31 (79%) | 0.75 |
| Hs.184542–Hs.58323 | 31 (79%) | 0.75 |
| Hs.262823–Hs.8595 | 31 (79%) | 0.75 |
| Hs.262480–Hs.772 | 32 (82%) | 0.75 |
| Hs.112193–505266a | 31 (79%) | 0.75 |
4 genes with high prediction accuracy in the DLBCL dataset.
| Probe ID | Correctly-classified sample number (accuracy) | α |
|---|---|---|
| U70663_at | 44 (76%) | 0.7 |
| M17863_s_at | 44 (76%) | 0.7 |
| U48865_s_at | 43 (74%) | 0.7 |
| U90543_at | 45 (78%) | 0.7 |
20 gene pairs with high prediction accuracy in the DLBCL dataset.
| 1st – 2nd Probe ID | Correctly-classified sample number (accuracy) | α |
|---|---|---|
| AFFX-BioC-3_at–M95925_at | 46 (79%) | 0.8 |
| AFFX-BioC-3_at–U70663_at | 48 (83%) | 0.8 |
| AFFX-M27830_5_at – X70811_at | 49 (84%) | 0.8 |
| AFFX-M27830_5_at – U46744_at | 49 (84%) | 0.8 |
| AC002450_at–M95925_at | 47 (81%) | 0.8 |
| AC002450_at–U48213_at | 47 (81%) | 0.8 |
| AC002450_at–HG4020-HT4290_s_at | 48 (83%) | 0.8 |
| M95925_at–X70811_at | 46 (79%) | 0.8 |
| U23028_at–U70663_at | 47 (81%) | 0.8 |
| U23028_at–X70811_at | 48 (83%) | 0.8 |
| U51903_at–U70663_at | 48 (83%) | 0.8 |
| U51903_at–X70811_at | 47 (81%) | 0.8 |
| U66702_at–U70663_at | 47 (81%) | 0.8 |
| U66702_at–HG4020-HT4290_s_at | 48 (83%) | 0.8 |
| U66702_at–U90543_at | 52 (90%) | 0.8 |
| U70663_at–U83908_at | 47 (81%) | 0.8 |
| U70663_at–X83412_at | 46 (79%) | 0.8 |
| U70663_at–X77777_s_at | 47 (81%) | 0.8 |
| U70663_at–X16660_cds1_s_at | 46 (79%) | 0.8 |
| U70663_at–U46744_at | 47 (81%) | 0.8 |
Comparison of best prediction accuracy for the CNS tumor dataset.
| Methods (feature selection + classification) | # Selected genes | # Correctly-classified samples (accuracy) |
|---|---|---|
| 1 | 50 (83%) | |
| [this work] | 2 | 54 (90%) |
| Signal to noise ratios + k-NNs | 8 | 47 (78%) |
| Signal to noise ratios + Weighted voting | 1–200 | 46 (77%) |
| Signal to noise ratios + SVMs | 150 | 45 (75%) |
| Signal to noise ratios + SPLASH | 1–200 | 45 (75%) |
| Signal to noise ratios + TrkC | 1 | 40 (67%) |
| Signal to noise ratios + Staging | 1–200 | 41 (68%) |
| Signal to noise ratios + staging, k-NNs and TrkC | 1–200 | 48 (80%) |
| Signal to noise ratios + SVM, k-NNs and TrkC | 1–200 | 48 (80%) |
| HFW + C4.5 | 20 | 45 (75%) |
| HFW + NaiveBayes | 29 | 52 (86.67%) |
| Discretization + Single C4.5 | 74 | 51 (85%) |
| Discretization + Bagging C4.5 | 74 | 53 (88%) |
| Discretization + AdaBoost C4.5 | 74 | 53 (88%) |
The methods include two sections: feature selection methods and classification methods. The decision trees classification methods are also involved in feature selection.
74 is the number of the genes withheld for the actual learning process instead of the number of the genes contained in the decision trees, which is not provided in.12
Tenfold cross-validation accuracy is provided.
Comparison of best prediction accuracy for the colon tumor dataset.
| Methods (feature selection + classification) | # Selected genes | # Correctly-classified samples (accuracy) |
|---|---|---|
| 1 | 52 (84%) | |
| [this work] | 2 | 56 (90%) |
| HykGene + k-NNs, SVMs, C4.5, NB | 3 | 57 (92%) |
| MAVE + logistic discrimination | 50 | 52 (84%) |
| Clustering and rough sets attribute reduction + k-NNs | 6 | 49 (79%) |
| Clustering and rough sets attribute reduction + NB | 6 | 51 (82%) |
| Clustering and rough sets attribute reduction + C5.0 | 6 | 56 (90%) |
| MRMR + NB | 9 | 58 (94%) |
| RBF + C4.5 | 4 | 58 (94%) |
| ReliefF + C4.5 | 4 | 53 (85%) |
| CFS-SF + C4.5 | 26 | 55 (89%) |
Comparison of best prediction accuracy for the lung cancer dataset.
| Methods (feature selection + classification) | # Selected genes | # Correctly-classified samples (accuracy) |
|---|---|---|
| 1 | 33 (85%) | |
| [this work] | 2 | 32 (82%) |
| HFW + C4.5 | 12 | 35 (90%) |
| HFW + NaiveBayes | 18 | 35 (90%) |
| FCBF + C4.5 | 12 | 31 (79%) |
| FCBF + NaiveBayes | 12 | 24 (62%) |
| CFS-SF + C4.5 | 13 | 26 (67%) |
| CFS-SF + NaiveBayes | 13 | 24 (62%) |
| ReliefF + C4.5 | 12 | 24 (62%) |
| ReliefF + NaiveBayes | 18 | 25 (64%) |
Comparison of best prediction accuracy for the DLBCL dataset.
| Methods (feature selection + classification) | # Selected genes | # Correctly-classified samples (accuracy) |
|---|---|---|
| α depended degree + decision rules | 1 | 48 (78%) |
| [this work] | 2 | 52 (90%) |
| Signal to noise ratios + Weighted voting | 13 | 44 (76%) |
| Signal to noise ratios + k-NNs | 9 | 41 (71%) |
| Gradient descent algorithm + SVMs | unknown | 45 (78%) |
| HFW + C4.5 | 22 | 44 (76%) |
| HFW + NaiveBayes | 19 | 50 (86%) |
| FCBF + C4.5 | 27 | 27 (47%) |
| FCBF + NaiveBayes | 27 | 31 (53%) |
| ReliefF + C4.5 | 22 | 25 (43%) |
| ReliefF + NaiveBayes | 19 | 31 (53%) |
No related data is provided.
Microarray data decision table.
| Samples | Condition attributes (genes)
| Decision attributes (classes)
| |||
|---|---|---|---|---|---|
| Gene 1 | Gene 2 | … | Gene | Class label | |
| 1 | |||||
| 2 | |||||
| … | |||||
| … | |||||
Summary of the four gene expression datasets.
| Dataset | # Original genes | Class | # Samples |
|---|---|---|---|
| CNS Tumor | 7129 | Class 1/Class 0 | 60 (21/39) |
| Colon Tumor | 2000 | negative/positive | 62 (40/22) |
| Lung Cancer | 2880 | relapse/non-relapse | 39 (24/15) |
| DLBCL | 6817 | cured/fatal | 58 (32/26) |
Discretized CNS tumor decision table with the first sample left out.
| Samples | Condition attributes (genes) | Decision attributes (classes)
| ||||||
|---|---|---|---|---|---|---|---|---|
| Gene 1 | … | Gene 11 | … | Gene 18 | … | Gene 7129 | Class label | |
| 1 | ‘All’ | … | ‘(-inf-187]’ | … | ‘(−330-inf]’ | … | ‘All’ | Class 1 |
| 2 | ‘All’ | … | ‘(-inf-187]’ | … | ‘(−330-inf]’ | … | ‘All’ | Class 1 |
| … | … | … | … | … | … | … | … | … |
| 20 | ‘All’ | … | ‘(-inf-187]’ | … | ‘(−330-inf]’ | … | ‘All’ | Class 1 |
| 21 | ‘All’ | … | ‘(-inf-187]’ | … | ‘(−330-inf]’ | … | ‘All’ | Class 0 |
| 22 | ‘All’ | … | ‘(187-inf]’ | … | ‘(−330-inf]’ | … | ‘All’ | Class 0 |
| … | … | … | … | … | … | … | … | … |
| 58 | ‘All’ | … | ‘(-inf-187]’ | … | ‘(-inf−330]’ | … | ‘All’ | Class 0 |
| 59 | ‘All’ | … | ‘(-inf-187]’ | … | ‘(−330-inf]’ | … | ‘All’ | Class 0 |
‘All’ represents that one gene has the same value in all samples; ‘(-inf-x]’ represents ‘≤x’; ‘(x-inf]’ represents ‘≤x’.