| Literature DB >> 15788095 |
Abstract
BACKGROUND: The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15788095 PMCID: PMC1274262 DOI: 10.1186/1471-2105-6-68
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Performance of Feature Extraction Algorithms on five cancer data sets. Both graphs show balanced accuracy (BACC) score. Top: Results grouped by data set. Bottom: Results grouped by feature extraction algorithm.
Detailed performance statistics for ovarian cancer data sets Bold columns represent the mean of the respective performance measure, while columns labeled as (std) correspond to the standard deviation across the three cross-validation folds.
| Corr(std) | 3ACC(std | Spec(std) | Sens(std) | PPV(std) | ||||||
| No FE | 0.05 | 0.05 | 0.16 | 0.11 | 0.13 | |||||
| PCA | 0.07 | 0.07 | 0.25 | 0.12 | 0.20 | |||||
| PCA/LDA | 0.07 | 0.07 | 0.31 | 0.18 | 0.19 | |||||
| SFS | 0.22 | 0.22 | 0.02 | 0.42 | 0.06 | |||||
| SBS | 0.08 | 0.08 | 0.13 | 0.08 | 0.12 | |||||
| P-test | 0.20 | 0.20 | 0.05 | 0.38 | 0.09 | |||||
| T-test | 0.19 | 0.19 | 0.02 | 0.38 | 0.08 | |||||
| KS-test | 0.22 | 0.22 | 0.09 | 0.35 | 0.28 | |||||
| NSC(20) | 0.19 | 0.19 | 0.06 | 0.32 | 0.29 | |||||
| Boosted | 0.06 | 0.06 | 0.02 | 0.11 | 0.03 | |||||
| Boosted FE | 0.13 | 0.13 | 0.00 | 0.26 | 0.00 | |||||
| Corr(std) | 3ACC(std | Spec(std) | Sens(std) | PPV(std) | ||||||
| No FE | 0.09 | 0.09 | 0.02 | 0.18 | 0.05 | |||||
| PCA | 0.18 | 0.18 | 0.14 | 0.25 | 0.18 | |||||
| PCA/LDA | 0.02 | 0.02 | 0.10 | 0.06 | 0.09 | |||||
| SFS | 0.03 | 0.03 | 0.03 | 0.05 | 0.04 | |||||
| SBS | 0.15 | 0.15 | 0.08 | 0.23 | 0.12 | |||||
| P-test | 0.03 | 0.03 | 0.03 | 0.06 | 0.03 | |||||
| T-test | 0.02 | 0.02 | 0.05 | 0.02 | 0.04 | |||||
| KS-test | 0.02 | 0.02 | 0.03 | 0.05 | 0.03 | |||||
| NSC(20) | 0.04 | 0.04 | 0.02 | 0.08 | 0.02 | |||||
| Boosted | 0.06 | 0.06 | 0.00 | 0.12 | 0.00 | |||||
| Boosted FE | 0.01 | 0.01 | 0.00 | 0.02 | 0.00 | |||||
| Corr(std) | 3ACC(std | Spec(std) | Sens(std) | PPV(std) | ||||||
| No FE | 0.14 | 0.12 | 0.07 | 0.20 | 0.05 | |||||
| PCA | 0.05 | 0.03 | 0.03 | 0.07 | 0.02 | |||||
| PCA/LDA | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||||
| SFS | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | |||||
| SBS | 0.14 | 0.13 | 0.10 | 0.17 | 0.07 | |||||
| P-test | 0.02 | 0.03 | 0.05 | 0.01 | 0.03 | |||||
| T-test | 0.07 | 0.04 | 0.05 | 0.13 | 0.01 | |||||
| KS-test | 0.02 | 0.02 | 0.04 | 0.01 | 0.02 | |||||
| NSC(20) | 0.02 | 0.03 | 0.04 | 0.03 | 0.02 | |||||
| Boosted | 0.01 | 0.00 | 0.02 | 0.02 | 0.01 | |||||
| Boosted FE | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||||
Detailed performance statistics for prostate cancer data sets Bold columns represent the mean of the respective performance measure, while columns labeled as (std) correspond to the standard deviation across the three cross-validation folds.
| Corr(std) | 3ACC(std | Spec(std) | Sens(std) | PPV(std) | ||||||
| No FE | 0.05 | 0.06 | 0.05 | 0.09 | 0.06 | |||||
| PCA | 0.20 | 0.18 | 0.24 | 0.21 | 0.11 | |||||
| PCA/LDA | 0.15 | 0.14 | 0.22 | 0.33 | 0.17 | |||||
| SFS | 0.05 | 0.17 | 0.03 | 0.36 | 0.03 | |||||
| SBS | 0.03 | 0.09 | 0.11 | 0.27 | 0.07 | |||||
| P-test | 0.02 | 0.11 | 0.08 | 0.28 | 0.07 | |||||
| T-test | 0.04 | 0.14 | 0.05 | 0.31 | 0.07 | |||||
| KS-test | 0.04 | 0.14 | 0.08 | 0.35 | 0.05 | |||||
| NSC(20) | 0.04 | 0.10 | 0.12 | 0.31 | 0.07 | |||||
| Boosted | 0.06 | 0.11 | 0.04 | 0.22 | 0.10 | |||||
| Boosted FE | 0.01 | 0.03 | 0.00 | 0.07 | 0.00 | |||||
| Corr(std) | 3ACC(std | Spec(std) | Sens(std) | PPV(std) | ||||||
| No FE | 0.13 | 0.13 | 0.14 | 0.12 | 0.15 | |||||
| PCA | 0.07 | 0.07 | 0.21 | 0.20 | 0.08 | |||||
| PCA/LDA | 0.03 | 0.03 | 0.07 | 0.04 | 0.06 | |||||
| SFS | 0.03 | 0.03 | 0.14 | 0.07 | 0.13 | |||||
| SBS | 0.15 | 0.15 | 0.15 | 0.15 | 0.17 | |||||
| P-test | 0.06 | 0.06 | 0.15 | 0.05 | 0.12 | |||||
| T-test | 0.06 | 0.06 | 0.10 | 0.09 | 0.08 | |||||
| KS-test | 0.03 | 0.03 | 0.09 | 0.05 | 0.07 | |||||
| NSC(20) | 0.09 | 0.09 | 0.12 | 0.17 | 0.10 | |||||
| Boosted | 0.05 | 0.05 | 0.09 | 0.11 | 0.09 | |||||
| Boosted FE | 0.02 | 0.02 | 0.00 | 0.03 | 0.00 | |||||
Overall performance comparison Performance of each feature extraction algorithms averaged across data sets. Balanced accuracy (BACC) reported in increasing order.
| Average BACC | OC-H4 | OC-WCX2a | OC-WCX2b | PC-H4 | PC-IMAC-Cu | ||
| 0.712 | 0.682 | 0.893 | 0.516 | 0.619 | |||
| 0.763 | 0.773 | 0.834 | 0.777 | 0.711 | |||
| 0.747 | 0.965 | 0.834 | 0.709 | 0.766 | |||
| 0.621 | 0.944 | 0.973 | 0.736 | 0.764 | |||
| 0.727 | 0.899 | 1.000 | 0.667 | 0.748 | |||
| 0.823 | 0.854 | 0.903 | 0.729 | 0.760 | |||
| 0.763 | 0.944 | 0.975 | 0.728 | 0.773 | |||
| 0.702 | 0.929 | 0.983 | 0.784 | 0.791 | |||
| 0.747 | 0.949 | 0.991 | 0.827 | 0.798 | |||
| 0.884 | 0.914 | 0.982 | 0.810 | 0.826 | |||
| 0.854 | 0.965 | 1.000 | 0.906 | 0.911 |
Feature set size comparison
| OC-H4 | (+/-) | OC-WCX2a | (+/-) | OC-WCX2b | (+/-) | PC-H4 | (+/-) | PC-IMAC | (+/-) | |
| 1.15 | 1.15 | 1.15 | 1.00 | 1.53 | ||||||
| 195.19 | 120.75 | 769.62 | 162.69 | 144.24 | ||||||
| 1.53 | 3.06 | 66.97 | 2.08 | 0.58 | ||||||
| 2.08 | 0.58 | 0.58 | 0.58 | 0.58 | ||||||
| 1.00 | 4.36 | 106.52 | 1.53 | 0.58 | ||||||
| 3.51 | 0.58 | 0.58 | 1.15 | 5.03 |
Computational cost comparison Results presented in CPU seconds and in increasing order. All experiments were conducted using Matlab code on a dual CPU Athlon 1400+ running Linux.
| Ave. CPU Time | OC-H4 | OC-WCX2a | OC-WCX2b | PC-H4 | PC-IMAC-Cu | ||
| 0.84 | 0.87 | 1.11 | 1.31 | 1.82 | |||
| 1.41 | 1.58 | 4.12 | 2.08 | 2.90 | |||
| 12.69 | 12.08 | 17.41 | 25.57 | 29.21 | |||
| 13.53 | 12.95 | 18.52 | 26.88 | 31.03 | |||
| 25.56 | 24.55 | 29.96 | 25.40 | 31.34 | |||
| 371.73 | 134.56 | 507.62 | 688.22 | 1014.97 | |||
| 622.97 | 623.37 | 645.66 | 639.14 | 718.04 | |||
| 2178.24 | 2175.75 | 2516.42 | 3269.37 | 5683.70 | |||
| 2679.97 | 1336.25 | 1841.57 | 3997.65 | 6928.99 | |||
| 3717.30 | 1345.60 | 5076.20 | 6882.20 | 10149.70 | |||
| 17032.94 | 17244.80 | 29913.61 | 24574.69 | 30908.07 |