| Literature DB >> 15598613 |
Weida Tong1, Qian Xie, Huixiao Hong, Leming Shi, Hong Fang, Roger Perkins, Emanuel F Petricoin.
Abstract
Class prediction using "omics" data is playing an increasing role in toxicogenomics, diagnosis/prognosis, and risk assessment. These data are usually noisy and represented by relatively few samples and a very large number of predictor variables (e.g., genes of DNA microarray data or m/z peaks of mass spectrometry data). These characteristics manifest the importance of assessing potential random correlation and overfitting of noise for a classification model based on omics data. We present a novel classification method, decision forest (DF), for class prediction using omics data. DF combines the results of multiple heterogeneous but comparable decision tree (DT) models to produce a consensus prediction. The method is less prone to overfitting of noise and chance correlation. A DF model was developed to predict presence of prostate cancer using a proteomic data set generated from surface-enhanced laser deposition/ionization time-of-flight mass spectrometry (SELDI-TOF MS). The degree of chance correlation and prediction confidence of the model was rigorously assessed by extensive cross-validation and randomization testing. Comparison of model prediction with imposed random correlation demonstrated biologic relevance of the model and the reduction of overfitting in DF. Furthermore, two confidence levels (high and low confidences) were assigned to each prediction, where most misclassifications were associated with the low-confidence region. For the high-confidence prediction, the model achieved 99.2% sensitivity and 98.2% specificity. The model also identified a list of significant peaks that could be useful for biomarker identification. DF should be equally applicable to other omics data such as gene expression data or metabolomic data. The DF algorithm is available upon request.Entities:
Mesh:
Year: 2004 PMID: 15598613 PMCID: PMC1247659 DOI: 10.1289/txg.7109
Source DB: PubMed Journal: Environ Health Perspect ISSN: 0091-6765 Impact factor: 9.031
Summary of the four DT models combined for developing the DF model (n = number of misclassifications).
| DT model 1 ( | DT model 2 ( | DT model 3 ( | DT model 4 ( | |
|---|---|---|---|---|
| Variables ( | 9,656 | 8,067 | 6,542 | 7,692 |
| 8,446 | 8,356 | 7,934 | 6,756 | |
| 5,074 | 5,457 | 7,195 | 9,593 | |
| 6,797 | 2,144 | 4,497 | 9,456 | |
| 8,291 | 7,885 | 4,080 | 5,978 | |
| 9,720 | 7,024 | 6,199 | 3,780 | |
| 3,486 | 7,771 | 7,481 | 2,794 | |
| 4,191 | 3,897 | 5,586 | 7,844 | |
| 4,653 | 4,757 | 6,099 | 5,113 | |
| 6,890 | 7,070 | 28,143 | ||
| 2,014 | 24,400 | 2,982 | ||
| 9,149 | 2,887 | 6,443 | ||
| 7,054 | 7,820 | |||
| 4,475 | 4,580 | |||
| 4,537 | ||||
| 7,409 | ||||
| 7,054 |
Figure 1Plot of misclassifications versus the number of DT models to be combined in DF.
Figure 2Prediction distribution in the 2,000-L10O process: real data set (A) and 2,000 pseudo-data set (B) generated from a randomization test.
Figure 3Distribution of true/false predictions for the left-out samples over 10 equal-probability bins in the 2,000-L10O process.
Comparison of statistics between DF and DT models in prediction of the left-out samples in the 2,000 L10O runs.
| Prediction accuracy | DF (%) | DT (%) |
|---|---|---|
| Overall accuracy | 94.7 | 89.4 |
| Accuracy in HC region | 98.7 | 90.7 |
| Accuracy in LC region | 78.9 | 63.8 |
List of m/z peaks used more than 10,000 times in the 2,000-L10O process, where 23 peaks are used in fitting with p < 0.001.
| Frequency | ||
|---|---|---|
| 7,934 | 30,203 | < 0.001 |
| 9,149 | 26,482 | < 0.001 |
| 7,984 | 25,171 | < 0.001 |
| 8,296 | 24,793 | < 0.001 |
| 3,897 | 23,754 | < 0.001 |
| 9,720a,c | 22,630 | < 0.001 |
| 7,776 | 21,723 | 0.003 |
| 7,024a,c | 21,718 | < 0.001 |
| 5,074 | 20,800 | < 0.001 |
| 8,446 | 20,620 | < 0.001 |
| 9,656a,c | 20,479 | < 0.001 |
| 6,542a,c | 20,219 | < 0.001 |
| 8,067a,c | 20,058 | < 0.001 |
| 7,692 | 19,982 | 0.004 |
| 6,797a,c | 19,587 | < 0.001 |
| 8,356a,c | 19,429 | < 0.001 |
| 7,054 | 19,333 | 0.010 |
| 6,099 | 19,265 | 0.004 |
| 5,586 | 18,103 | < 0.001 |
| 7,820a,c | 17,918 | 0.359 |
| 6,756 | 17,668 | < 0.001 |
| 9,593 | 17,615 | < 0.001 |
| 7,844 | 17,611 | 0.089 |
| 4,191 | 17,387 | < 0.001 |
| 3,486 | 17,290 | < 0.001 |
| 4,451 | 17,041 | 0.459 |
| 4,079a,c | 16,790 | 0.020 |
| 9,456 | 16,767 | < 0.001 |
| 4,653 | 16,674 | 0.002 |
| 7,195 | 15,832 | < 0.001 |
| 7,885a,c | 15,388 | < 0.001 |
| 8,277 | 15,388 | < 0.001 |
| 6,072 | 15,093 | < 0.001 |
| 3,963b,c | 14,434 | < 0.001 |
| 3,780 | 14,139 | 0.014 |
| 4,291 | 13,540 | < 0.001 |
| 4,102 | 13,294 | 0.001 |
| 4,858 | 13,076 | 0.003 |
| 6,949b,c | 12,555 | < 0.001 |
| 3,280 | 11,808 | < 0.001 |
| 6,991b,c | 11,281 | 0.122 |
| 2,144 | 11,110 | < 0.001 |
| 9,100 | 10,578 | < 0.001 |
| 7,652 | 10,159 | 0.005 |
| 5,457 | 10,139 | < 0.001 |
| 6,914 | 10,073 | < 0.001 |
Used in fitting.
Not used in fitting.
Reported by Qu et al. (2002).