| Literature DB >> 15113409 |
Michael Wagner1, Dayanand N Naik, Alex Pothen, Srinivas Kasukurti, Raghu Ram Devineni, Bao-Ling Adam, O John Semmes, George L Wright.
Abstract
BACKGROUND: Recent technological advances in mass spectrometry pose challenges in computational mathematics and statistics to process the mass spectral data into predictive models with clinical and biological significance. We discuss several classification-based approaches to finding protein biomarker candidates using protein profiles obtained via mass spectrometry, and we assess their statistical significance. Our overall goal is to implicate peaks that have a high likelihood of being biologically linked to a given disease state, and thus to narrow the search for biomarker candidates.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15113409 PMCID: PMC406491 DOI: 10.1186/1471-2105-5-26
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Cross-validation classification accuracy (in percent) of various classification methods on the full four-class prostate cancer dataset using various numbers of peaks. Numbers are average observed accuracies over 100 runs with randomized 90/10 splits into training and test sets, respectively. The numbers in parentheses are the corresponding standard deviations.
| # of peaks used | ||||||||
| 10 | 15 | 20 | 25 | 30 | 35 | 50 | 70 | |
| Quadr. Discr. | 74.7 (7.4) | 74.7 (9.6) | 74.1 (8.4) | 74.7 (7.1) | 78.2 (6.8) | 77.8 (7.3) | 78.7 (6.6) | 76.8 (7.1) |
| Nonpar (Kernel) | 76.7 (7.1) | 77.4 (8.4) | 77.7 (6.9) | 78.6 (6.6) | 80.0 (6.3) | 79.9 (7.3) | 78.1 (6.5) | 76.1 (7.6) |
| kNN | 73.4 (7.4) | 76.4 (6.9) | 76.9 (6.0) | 76.6 (6.1) | 75.8 (6.7) | 77.2 (6.9) | 73.9 (7.5) | 69.8 (6.7) |
| Fisher Linear | 72.4 (7.3) | 77.3 (6.9) | 80.8 (6.5) | 80.1 (5.8) | 81.8 (6.0) | 84.6 (5.2) | 85.5 (6.1) | 84.3 (5.1) |
| Linear SVM | 75.4 (6.4) | 79.3 (7.4) | 81.7 (7.2) | 81.3 (5.7) | 83.7 (6.8) | 83.1 (6.6) | 83.5 (6.1) | 84.0 (6.2) |
Details of classification results obtained with Fisher's Linear Discriminator and 20 peaks on the full four-class problem. The overall average classification accuracy (100 runs) is 81%.
| Computational Prediction | |||||
| BPH | Late Cancer | Early Cancer | Control | ||
| BPH | 745 (93.1%) | 55 (6.9%) | 0 (0%) | 0 (0%) | |
| Clinical | Late Cancer | 156 (19.5%) | 531 (66.3%) | 91 (16.0%) | 22 (1.6%) |
| Diagnosis | Early Cancer | 99 (12.3%) | 54 (6.8%) | 616 (82.0%) | 31 (1.8%) |
| Control | 92 (11.5%) | 11 (1.4%) | 5 (0.6%) | 692 (86.5%) | |
Average classification accuracy over 100 runs on data obtained by grouping all control and BPH samples into one class, and all cancer samples into another. Class sizes thus remain approximately balanced. Numbers in parentheses are standard deviations.
| # of peaks used (malignant vs. other) | |||||
| 5 | 8 | 10 | 12 | 15 | |
| Quadr. Disc. | 84.1 (5.3) | 85.1 (5.4) | 85.0 (6.1) | 86.1 (6.7) | 86.0 (6.1) |
| Nonpar. (Kernel) | 84.6 (5.2) | 87.1 (5.3) | 88.3 (5.8) | 88.9 (6.1) | 88.1 (6.0) |
| kNN | 89.9 (4.6) | 87.4 (5.6) | 87.5 (5.7) | 88.9 (5.2) | 88.5 (4.6) |
| Fisher Linear | 88.6 (5.9) | 88.4 (5.6) | 87.9 (4.9) | 89.1 (5.4) | 88.0 (5.0) |
| Linear SVM | 89.5 (5.5) | 91.0 (4.8) | 91.9 (4.6) | 91.7 (4.9) | 91.9 (4.7) |
Linear SVM classification average accuracy results for other pairwise distinctions using varying numbers of peaks.
| # of peaks used | |||||
| 5 | 8 | 10 | 12 | 15 | |
| BPH vs Control | 96.4 | 96.2 | 96.6 | 96.4 | 97.4 |
| BPH vs E. Cancer | 91.8 | 94.6 | 93.6 | 94.7 | 95.4 |
| BPH vs L. Cancer | 89.1 | 88.1 | 88.9 | 89.7 | 91.7 |
| Control vs E. Cancer | 89.1 | 91.5 | 94.4 | 95.5 | 96.2 |
| Control vs L. Cancer | 88.0 | 88.7 | 88.5 | 90.4 | 90.0 |
Statistics on classification accuracy for the linear SVM averaged over 1000 randomized datasets. 10 cross-validation runs using 15 peaks were performed on each dataset.
| max. acc. | median acc. | 95th %ile | |
| BPH vs Control | 70.0 | 51.6 | 59.7 |
| BPH vs E. Cancer | 68.1 | 50.0 | 59.4 |
| BPH vs L. Cancer | 68.1 | 50.0 | 59.7 |
| Control vs E. Cancer | 66.9 | 50.0 | 59.7 |
| Control vs L. Cancer | 65.0 | 51.6 | 59.3 |
Figure 1Accuracy and standard deviation estimates as a function of the number of cross-validation runs (shown, as an example, for the Fisher method with 15 peaks). Significant variability can be observed at the beginning, which motivates the need for a large number of runs in order to arrive at reasonable estimates.