| Literature DB >> 16959042 |
Stuart G Baker1, Barnett S Kramer.
Abstract
BACKGROUND: The goal of most microarray studies is either the identification of genes that are most differentially expressed or the creation of a good classification rule. The disadvantage of the former is that it ignores the importance of gene interactions; the disadvantage of the latter is that it often does not provide a sufficient focus for further investigation because many genes may be included by chance. Our strategy is to search for classification rules that perform well with few genes and, if they are found, identify genes that occur relatively frequently under multiple random validation (random splits into training and test samples).Entities:
Mesh:
Substances:
Year: 2006 PMID: 16959042 PMCID: PMC1574352 DOI: 10.1186/1471-2105-7-407
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description of studies
| Data set | Number of genes | Number per class |
| Colon cancer [11] | 2000 | 22 normal |
| Leukemia [12] | 7219 | 47 acute lymphoblastic leukemia |
| Medulloblastoma [13] | 7129 | 39 survivors over two years |
| Breast cancer [14] | 7129 | 25 estrogen receptor positive |
Figure 1Smoothed ROC curves in test sample derived from multiple splitting of training and test samples. Graphs depict 40 randomly selected ROC curves out of 1000 splits. AUC is the mean area under the ROC curve from 1000 splits (95% confidence interval). FPR is false positive rate (one minus specificity) and TPR is true positive rate (sensitivity).
Estimates of area under ROC curve (AUC) and 95% confidence intervals
| Number of genes in classification rule | |||||
| Data set | Percent in training sample | 1 gene | 5 genes | 20 genes | 50 genes |
| Colon cancer | 50% | .77 (.55 to .92) | .82 (.62 to .95) | .84 (.66 to .95) | .85 (.69 to .95) |
| 80% | .86 (.59 to 1) | .90 (.69 to 1) | .89 (.69 to 1) | .90 (.70 to 1) | |
| Leukemia | 50% | .90 (.72 to 1) | .95 (.84 to .99) | .97 (.90 to .99) | .98 (.93 to 1) |
| 80% | .95 (.76 to 1) | .97 (.84 to 1) | .99 (.91 to 1) | .99 (.93 to 1) | |
| Medulloblastoma | 50% | .60 (.50 to .77) | .62 (.50 to .78) | .65 (.50 to .79) | .67 (.51 to .82) |
| 80% | .65 (.50 to .88) | .69 (.50 to .88) | .75 (.50 to .94) | .78 (.50 to .97) | |
| Breast Cancer | 50% | .81 (.58 to .94) | .88 (.71 to .97) | .91 (.75 to .99) | .91 (.76 to .99) |
| 80% | .85 (.62 to 1) | .93 (.78 to 1) | .96 (.80 to 1) | .95 (.78 to 1) | |
Comparison of related methods.
| Authors | Training sample | Test sample | Random aspect | Results |
| Michiels et al, 2005 [2] | (1) Selected genes most correlated with prognosis, | Used | Test and training sample splits in entire data set. | (1) Misclassification rate for test samples |
| Ma et al, 2006 [7] | (1) Split into training-training sample and training-test sample, | Used | Training-training and training-test samples (i.e. the cross-validation and evaluation is repeated) | (1) Area under ROC curve for test samples, |
| Li et al, 2004 [8] | (1) Split into training-training sample and training-test sample, | Not used | Resampling for training-training samples and training test samples. | (1) Relevancy intensity, which equals frequencies of genes selected in training sample when weights equal 1. |
| Proposed method | (1) Selected genes with highest individual, classification performance | Used | Test and training samples splits in entire data set. | (1) ROC curve and area under ROC curve for test samples with emphasis on comparing many versus few genes, |