| Literature DB >> 15958165 |
Thanyaluk Jirapech-Umpai1, Stuart Aitken.
Abstract
BACKGROUND: In the clinical context, samples assayed by microarray are often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. 1 and the NCI60 dataset of Ross et al. 2 present multiclass classification problems where three tumour types and nine cell lines respectively must be identified. We apply an evolutionary algorithm to identify the near-optimal set of predictive genes that classify the data. We also examine the initial gene selection step whereby the most informative genes are selected from the genes assayed.Entities:
Mesh:
Year: 2005 PMID: 15958165 PMCID: PMC1181625 DOI: 10.1186/1471-2105-6-148
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Evolutionary algorithm for multiclass classification.
Figure 2The average score in each iteration for several trails of the evolutionary algorithm on the leukemia dataset.
The accuracy of the baseline system built by randomly selecting genes from 7,070 genes in the leukemia dataset.
| Population size | Feature size | Training data [%] | Test data [%] | ||
| Max | Average | Max | Average | ||
| 10 | 30 | 89.47 | 85.79 | 79.41 | 64.41 |
| 50 | 89.47 | 84.21 | 76.47 | 63.53 | |
| 30 | 30 | 94.74 | 98.42 | 85.29 | 68.82 |
| 50 | 97.37 | 96.05 | 73.53 | 67.35 | |
| 50 | 30 | 100.00 | 98.42 | 85.29 | 70.29 |
| 50 | 100.00 | 98.42 | 85.29 | 72.64 | |
The average accuracy using out-of-sample prediction on the 34 leukemia test samples. The symbol (*) means that there is some perfect predictors found by the algorithm. The highest accuracy is written in bold.
| Population size | Feature size | The accuracy of different rank methods on the Test data (out-of-sample) [%] | |||||
| R1 | R2 | R3 | R4 | R5 | R6 | ||
| 10 | 30 | 97.35* | 95.29 | 93.82 | 92.94* | 93.53 | 94.12 |
| 50 | 95.59 | 94.71* | 93.82 | 93.53 | 95.00 | ||
| 30 | 30 | 96.74* | 92.65 | 94.41 | 95.00* | 94.71 | 93.82 |
| 50 | 97.06* | 95.00 | 95.00 | 93.82 | 95.30 | 93.82 | |
| 50 | 30 | 97.35* | 93.82 | 94.71* | 92.06* | 94.12* | 93.82* |
| 50 | 96.17 | 93.82 | 92.65 | 92.35 | 94.12 | 94.71 | |
Abbreviations: R1. Information gain; R2. Twoing rule; R3. Gini index; R4. Sum minority; R5. Max minority; R6. Sum of variances.
The average accuracy of the GA/KNN classifier using out-of-sample prediction on the 34 leukemia test samples.
| The accuracy of different rank methods on the Test data (out-of-sample) [%] | ||||||
| Baseline | R1 | R2 | R3 | R4 | R5 | R6 |
| 67.06 | 98.24 | 95.29 | 93.82 | 95.00 | 94.71 | 95.00* |
The average accuracy on 5 sets of parameters and six ranking methods on the NCI60 data. No perfect predictors are found by the algorithm. The highest accuracy is written in bold.
| Population size | Feature size | The accuracy of different rank methods on all dataset (LOOCV) [%] | |||||
| R1 | R2 | R3 | R4 | R5 | R6 | ||
| 10 | 30 | 66.72 | 63.77 | 60.00 | 54.43 | 62.78 | 69.34 |
| 50 | 67.86 | 62.78 | 61.63 | 52.62 | 62.62 | 65.90 | |
| 30 | 30 | 72.29 | 72.02 | 65.90 | 74.26 | 75.41 | |
| 50 | 73.44 | 72.46 | 71.15 | 63.11 | 73.44 | 73.93 | |
| 50 | 50 | 75.08 | 72.29 | 71.96 | 71.97 | 73.77 | 74.16 |
Figure 3A plot of Z-scores for 100 ranked genes on the leukemia dataset.
Figure 4A plot of Z-scores for 100 ranked genes on the NCI60 dataset.
Top ranking genes by Z-score (top 24 genes, this paper), by TNoM score (top 30 genes [13]1), and by differential expression in AML or in ALL (top 25 genes, [14]2). As the Z-score can give genes equal scores, a rank of 1 can be assigned to several genes. In our study 12 genes are ranked 1, in that of [13] 3 are ranked 1. The lowest place in a total ordering of the genes is indicated by figure in parenthesis, e.g. (≤12). Genes ranked in column 5 cannot appear in column 6, and vice versa, as indicated by (-). Otherwise, a blank entry indicates a ranking outside of the top 24, 25 or 30 genes respectively.
| Gene | Description | Rank by Z-score | Rank by TNoM1 | Rank in AML2 | Rank in ALL2 |
| M23197 | Myeloid cell surface antigen CD33 | 1 (≤12) | 1 (≤3) | 22 (≤22) | - |
| X04145 | T-cell surface glycoprotein CD3 | 1 (≤12) | |||
| M31211 | Myosin light chain | 1 (≤12) | 6 (≤20) | - | 4 (≤4) |
| M31303 | Leukemia-associated phosphoprotein | 2 (≤13) | 7 (≤32) | - | 11 (≤13) |
| U50136 | Leukotriene C4 | 3 (≤24) | 7 (≤32) | 3 (≤3) | - |
| M28170 | B-lymphocyte surface antigen CD19 | 3 (≤24) | |||
| J04132 | T-cell surface glycoprotein CD3 | 3 (≤24) | |||
| X95735 | Zyxin | 1 (≤3) | 6 (≤6) | - | |
| M55150 | Fumarylacetoacetate | 6 (≤20) | 1 (≤1) | - | |
| X59417 | Proteasome iota chain | 6 (≤20) | - | 3 (≤3) | |
| U22376 | C-myb | - | 1 (≤1) |