| Literature DB >> 24564916 |
Md Hassan, Ramamohanarao Kotagiri.
Abstract
BACKGROUND: Gene expression data classification is a challenging task due to the large dimensionality and very small number of samples. Decision tree is one of the popular machine learning approaches to address such classification problems. However, the existing decision tree algorithms use a single gene feature at each node to split the data into its child nodes and hence might suffer from poor performance specially when classifying gene expression dataset.Entities:
Year: 2013 PMID: 24564916 PMCID: PMC4044984 DOI: 10.1186/1753-6561-7-S7-S3
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
| Output in actual ordering: | -1 | -0.4 | -0.7 | -0.9 | 0.01 | 0.5 | 0.9 | 1 |
|---|---|---|---|---|---|---|---|---|
| Output (sorted): | -1 | -0.9 | -0.7 | -0.4 | 0.01 | 0.5 | 0.9 | 1 |
| Rank: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| Desired Output: | -1 | -1 | +1 | -1 | -1 | -1 | +1 | +1 |
| Output (sorted): | -1 | -0.9 | -0.7 | -0.4 | 0.01 | 0.5 | 0.9 | 1 |
| Rank: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| Desired Output: | -1 | -1 | +1 | -1 | -1 | -1 | +1 | +1 |
| Predicted Class Label: | -1 | -1 | -1 | -1 | +1 | +1 | +1 | +1 |
| ✓ | ✓ | ✓ | ||||||
| ✓ | ||||||||
| ✓ | ✓ | |||||||
| ✓ | ✓ | |||||||
| 0 | 0 | 2 | 0 | 1 | 2 | 0 | 0 | |
| ... | |||||
| Rank: | 1 | 2 | 3 | ... | m |
Performance in accuracy.
| Method | GE1 | GE2 | GE3 | GE4 | GE5 | GE6 | GE7 |
|---|---|---|---|---|---|---|---|
| BVROC-Tree | 84.16 ± 0.02 | ||||||
| ROC-Tree | 64.13 ± 4.53 | 86.26 ± 0.05 | 98.34 ± 0.89 | 38.10 ± 5.95 | 88.24 ± 2.33 | 94.44 ± 2.96 | 52.63 ± 0.07 |
| AUCsplit | 56.96 ± 0.09 | 81.93 ± 0.02 | 96.14 ± 1.36 | 34.01 ± 2.87 | 82.47 ± 3.96 | 81.61 ± 3.28 | 50.53 ± 0.07 |
| C4.5 | 53.48 ± 5.67 | 78.04 ± 1.83 | 93.21 ± 1.07 | 41.7 ± 4.74 | 79.42 ± 5.45 | 84.39 ± 2.01 | 39.00 ± 5.48 |
| ADTree | 55.22 ± 5.87 | 95.14 ± 2.17 | 43.10 ± 4.80 | 86.76 ± 2.63 | 88.82 ± 5.06 | 49.00 ± 4.18 | |
| REPTree | 58.26 ± 2.83 | 78.64 ± 2.99 | 95.01 ± 1.79 | 44.23 ± 5.18 | 80.88 ± 3.33 | 87.64 ± 4.49 | 57.00 ± 13.51 |
| Random Tree | 51.74 ± 1.82 | 65.53 ± 3.24 | 92.03 ± 5.62 | 46.40 ± 6.74 | 62.50 ± 5.23 | 81.64 ± 11.47 | 47.00 ± 16.43 |
| Random Forest | 48.6 ± 4.85 | 81.45 ± 4.62 | 92.98 ± 5.36 | 47.52 ± 7.19 | 80.88 ± 2.56 | 82.13 ± 10.33 | 43.00 ± 10.37 |
| Naïve Bayes | 50.60 ± 5.82 | 88.60 ± 2.26 | 93.85 ± 5.27 | 46.15 ± 7.44 | 55.88 ± 4.76 | 84.85 ± 11.26 | 62.00 ± 4.47 |
| 47.10 ± 5.31 | 86.80 ± 2.29 | 93.73 ± 4.88 | 48.23 ± 8.61 | 78.68 ± 4.78 | 84.68 ± 10.42 | 44.00 ± 4.18 |
Table representing the overall accuracy for gene expression datasets using 5 × 10 fold cross-validation scheme
Performance in AUC.
| Method | GE1 | GE2 | GE3 | GE4 | GE5 | GE6 | GE7 |
|---|---|---|---|---|---|---|---|
| BVROC-Tree | 0.82 ± 0.04 | ||||||
| ROC-Tree | 0.64 ± 0.09 | 0.79 ± 0.05 | 0.93 ± 0.04 | 0.29 ± 0.05 | 0.89 ± 0.33 | 0.95 ± 0.01 | 0.54 ± 0.08 |
| AUCsplit | 0.57 ± 0.10 | 0.78 ± 0.02 | 0.92 ± 0.02 | 0.30 ± 0.06 | 0.81 ± 0.04 | 0.82 ± 0.08 | 0.49 ± 0.11 |
| C4.5 | 0.56 ± 0.05 | 0.78 ± 0.03 | 0.87 ± 0.03 | 0.39 ± 0.04 | 0.78 ± 0.06 | 0.83 ± 0.02 | 0.45± 0.05 |
| ADTree | 0.57 ± 0.04 | 0.92 ± 0.06 | 0.36 ± 0.05 | 0.84 ± 0.03 | 0.90 ± 0.08 | 0.50 ± 0.06 | |
| REPTree | 0.59 ± 0.06 | 0.80 ± 0.02 | 0.91 ± 0.05 | 0.40 ± 0.07 | 0.79 ± 0.04 | 0.88 ± 0.07 | 0.61± 0.08 |
| Random Tree | 0.55 ± 0.03 | 0.64 ± 0.04 | 0.85 ± 0.12 | 0.43 ± 0.09 | 0.63 ± 0.05 | 0.81 ± 0.14 | 0.53 ± 0.15 |
| Random Forest | 0.54 ± 0.05 | 0.89 ± 0.04 | 0.88 ± 0.12 | 0.43 ± 0.09 | 0.79 ± 0.03 | 0.83 ± 0.13 | 0.47 ± 0.21 |
| Naïve Bayes | 0.55 ± 0.05 | 0.93± 0.02 | 0.89 ± 0.12 | 0.42 ± 0.09 | 0.53 ± 0.05 | 0.86 ± 0.14 | 0.65 ± 0.11 |
| 0.53 ± 0.03 | 0.93 ± 0.02 | 0.91 ± 0.11 | 0.42 ± 0.09 | 0.79 ± 0.05 | 0.87 ± 0.13 | 0.51 ± 0.09 |
Table representing the AUC result for gene expression datasets using 5 × 10 fold cross-validation scheme
Datasets.
| Dataset | Data collected from | No. of genes | Total Samples | Classification of: |
|---|---|---|---|---|
| GE1 | Critchley-Thorne | 20,845 | 46 | Metastatic Melanoma |
| GE2 | Zizhen | 4133 | 101 | Marfan Syndrome |
| GE3 | Gordon | 12,533 | 181 | Lung Cancer |
| GE4 | Singh | 12,600 | 21 | Prostate cancer |
| GE5 | Singh | 12,600 | 136 | Prostate cancer |
| GE6 | Golub | 7,129 | 72 | Leukemia |
| GE7 | Notterman | 22,278 | 19 | Colorectal Adenoma |
Properties of the datasets used in this study
Tree size.
| Tree size | ||||||
|---|---|---|---|---|---|---|
|
|
| C4.5 | ADTree | REPTree | Random Tree | |
| GE1 | 3 | 5 | 7 | 28 | 4 | 52 |
| GE2 | 2 | 6 | 5 | 22 | 3 | 18 |
| GE3 | 3 | 7 | 7 | 26 | 4 | 61 |
| GE4 | 2 | 7 | 5 | 30 | 3 | 27 |
| GE5 | 3 | 16 | 10 | 32 | 5 | 86 |
| GE6 | 2 | 6 | 4 | 31 | 3 | 42 |
| GE7 | 2 | 5 | 3 | 28 | 1 | 21 |
Comparison of the sizes of the trees using all the data instances as training data