| Literature DB >> 18547427 |
Abstract
BACKGROUND: Pancreatic cancer is the fourth leading cause of cancer death in the United States. Consequently, identification of clinically relevant biomarkers for the early detection of this cancer type is urgently needed. In recent years, proteomics profiling techniques combined with various data analysis methods have been successfully used to gain critical insights into processes and mechanisms underlying pathologic conditions, particularly as they relate to cancer. However, the high dimensionality of proteomics data combined with their relatively small sample sizes poses a significant challenge to current data mining methodology where many of the standard methods cannot be applied directly. Here, we propose a novel methodological framework using machine learning method, in which decision tree based classifier ensembles coupled with feature selection methods, is applied to proteomics data generated from premalignant pancreatic cancer.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18547427 PMCID: PMC2440392 DOI: 10.1186/1471-2105-9-275
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Computation procedure used in this study. In each round of 10 fold cross-validation, the whole dataset was randomly separated into training set and test set. Features that significantly differentiate the control class from the disease class are selected using training set only. Then test sets are classified by decision tree and ensembles using these features. Mass spec: mass spectrometry.
Figure 2Data preprocessing result. Spectrogram ID 2 is used as an example of data preprocessing procedure. (A) Original spectrogram without any processing. The maximum m/z ratio is 11922.91 and the minimum m/z ratio is 800. (B) Original spectrogram and adjusted baseline. (C) Noise reduction using Gaussian kernel smoothing. (D) Normalization using the area under the curve (AUC).
Top ten features (m/z ratio) selected by Student t test method in our 10 fold cross validation.
| Rank | Round 1 | Round 2 | Round 3 | Round 4 | Round 5 | Round 6 | Round 7 | Round 8 | Round 9 | Round 10 | Most Frequent |
| 1 | 5798.9 | 5798.9 | 5819.8 | 5798.9 | 5819.8 | 5819.8 | 5798.9 | 5798.9 | 11477 | 5798.9 | 5798.9 |
| 2 | 5801.2 | 5819.8 | 5822.1 | 5801.2 | 5822.1 | 5822.1 | 5801.2 | 11541 | 11774 | 5801.2 | 5801.2 |
| 3 | 5819.8 | 5801.2 | 5798.9 | 5819.8 | 5798.9 | 5798.9 | 5819.8 | 11592 | 11472 | 11592 | 5819.8 |
| 4 | 5796.5 | 5822.1 | 5801.2 | 5822.1 | 5801.2 | 11592 | 5822.1 | 5801.2 | 5798.9 | 11597 | 11541 |
| 5 | 5822.1 | 11541 | 11592 | 5829.1 | 11770 | 11597 | 11541 | 11537 | 11481 | 11587 | 11592 |
| 6 | 11422 | 11592 | 11597 | 5831.4 | 11541 | 11587 | 5831.4 | 11546 | 5819.8 | 11541 | 5822.1 |
| 7 | 5817.4 | 11546 | 11541 | 11592 | 11597 | 5801.2 | 11592 | 11597 | 11770 | 11601 | 11597 |
| 8 | 11774 | 11587 | 11601 | 5803.5 | 11592 | 11541 | 11546 | 11774 | 11514 | 5819.8 | 11546 |
| 9 | 11541 | 11537 | 11546 | 11541 | 11601 | 11643 | 5829.1 | 11587 | 11509 | 11546 | 11601 |
| 10 | 11426 | 11569 | 11639 | 5796.5 | 11606 | 11601 | 11597 | 11601 | 5822.1 | 11606 | 11587 |
Rank is determined by the probability of the two means between disease and control groups in the training set being significantly different. m/z ratios with smaller probability ranks higher. Most frequent features are determined by the frequency of each feature appears in the top 10 list in these ten runs and ranked by their frequency.
Top ten features (m/z ratio) selected by Wilcoxon rank test method in our 10 fold cross validation.
| Rank | Round 1 | Round 2 | Round 3 | Round 4 | Round 5 | Round 6 | Round 7 | Round 8 | Round 9 | Round 10 | Most Frequent |
| 1 | 5798.9 | 5798.9 | 4941.6 | 5801.2 | 5822.1 | 5819.8 | 5801.2 | 5798.9 | 4941.6 | 5801.2 | 5798.9 |
| 2 | 5801.2 | 5801.2 | 5819.8 | 5798.9 | 5819.8 | 5822.1 | 5798.9 | 5801.2 | 11774 | 5798.9 | 5801.2 |
| 3 | 4941.6 | 11472 | 5822.1 | 4941.6 | 5798.9 | 5798.9 | 4941.6 | 11472 | 5798.9 | 5796.5 | 11472 |
| 4 | 5796.5 | 5819.8 | 5801.2 | 5819.8 | 5801.2 | 5801.2 | 5803.5 | 11477 | 11770 | 11472 | 5819.8 |
| 5 | 5819.8 | 11477 | 5798.9 | 9706.1 | 11472 | 11592 | 5822.1 | 11774 | 11477 | 5803.5 | 5822.1 |
| 6 | 5822.1 | 5822.1 | 4943.6 | 5822.1 | 11477 | 11587 | 11472 | 11770 | 11472 | 11592 | 11477 |
| 7 | 4943.6 | 11468 | 11592 | 5803.5 | 11468 | 11472 | 5819.8 | 11541 | 5819.8 | 11541 | 11541 |
| 8 | 11472 | 5796.5 | 11541 | 5796.5 | 11770 | 11541 | 11477 | 11537 | 5822.1 | 5819.8 | 4941.6 |
| 9 | 11774 | 11541 | 11472 | 11472 | 4941.6 | 11774 | 5829.1 | 11468 | 5801.2 | 11477 | 11774 |
| 10 | 11477 | 11481 | 11597 | 9710 | 11774 | 5796.5 | 11541 | 11481 | 11481 | 11468 | 5796.5 |
Rank is determined by the probability of the two means between disease and control groups in the training set being significantly different.
Ten features (m/z ratio) selected by Genetic algorithm coupled with LDA in our 10 fold cross validation.
| Round 1 | Round 2 | Round 3 | Round 4 | Round 5 | Round 6 | Round 7 | Round 8 | Round 9 | Round 10 |
| 3385 | 4943.6 | 11555 | 1489.9 | 5939.7 | 1859.5 | 5098.2 | 5916 | 9835.1 | 5775.7 |
| 3304.8 | 5775.7 | 1125.2 | 11541 | 5209.5 | 2009.5 | 9578.9 | 2016.7 | 1857.2 | 3833.4 |
| 3186.7 | 4013.8 | 4943.6 | 1644 | 5822.1 | 2951.2 | 3760.5 | 3787.6 | 11940 | 3510.5 |
| 1858.7 | 3915.5 | 3383.7 | 4941.6 | 1063.5 | 11662 | 7553.5 | 5857.1 | 2756.1 | 1857.2 |
| 4256.8 | 1858 | 1528.6 | 5409.1 | 1644.6 | 11546 | 3540.1 | 3727.5 | 1808.1 | 11031 |
| 3790.7 | 1063.1 | 3959.6 | 1936.1 | 1859.5 | 9415.6 | 1860.2 | 1064 | 4532.7 | 5801.2 |
| 4941.6 | 1476.3 | 3726 | 2368.5 | 11463 | 7406.9 | 7966.1 | 5819.8 | 7931.2 | 4318.6 |
| 11027 | 3727.5 | 5829.1 | 3188 | 7592.8 | 6569.5 | 11394 | 9640.4 | 11477 | 11821 |
| 11426 | 11560 | 3188 | 1859.5 | 11472 | 1645.9 | 10183 | 1702.2 | 6511.9 | 5794.2 |
| 7085.3 | 2579.1 | 5949.2 | 3836.4 | 6509.3 | 1411.1 | 9575 | 6506.7 | 4941.6 | 9640.4 |
Classification results using features selected by Student t test.
| Algorithm | Accuracy(%) | TP rate | FP rate | TN rate | FN rate | Sensitivity | Specificity | Precision | Fmeasure | RMSE |
| C4.5 | 0.6444 | 0.99 | 0.79 | 0.21 | 0.01 | 0.99 | 0.21 | 0.61 | 0.76 | 0.4687 |
| Random Forest | 0.6500 | 0.79 | 0.53 | 0.48 | 0.21 | 0.79 | 0.48 | 0.65 | 0.71 | 0.4569 |
| Bagging | 0.6833 | 0.78 | 0.44 | 0.56 | 0.22 | 0.78 | 0.56 | 0.69 | 0.73 | 0.4285 |
| Logitboost | 0.6889 | 0.83 | 0.49 | 0.51 | 0.17 | 0.83 | 0.51 | 0.69 | 0.75 | 0.4402 |
| Stacking | 0.6444 | 0.99 | 0.79 | 0.21 | 0.01 | 0.99 | 0.21 | 0.61 | 0.76 | 0.4761 |
| Adaboost | 0.6444 | 0.77 | 0.51 | 0.49 | 0.23 | 0.77 | 0.49 | 0.69 | 0.69 | 0.4412 |
| Multiboost | 0.6889 | 0.81 | 0.46 | 0.54 | 0.19 | 0.81 | 0.54 | 0.70 | 0.74 | 0.5175 |
| Logistic | 0.7500 | 0.79 | 0.30 | 0.70 | 0.21 | 0.79 | 0.70 | 0.78 | 0.78 | 0.4224 |
| Naivebayes | 0.6833 | 0.64 | 0.26 | 0.74 | 0.36 | 0.64 | 0.74 | 0.76 | 0.68 | 0.5289 |
| Bayesnet | 0.6722 | 0.63 | 0.28 | 0.73 | 0.37 | 0.63 | 0.73 | 0.74 | 0.67 | 0.5308 |
| Neural Network | 0.7000 | 0.70 | 0.30 | 0.70 | 0.30 | 0.70 | 0.70 | 0.75 | 0.72 | 0.4517 |
| RBFnet | 0.6722 | 0.76 | 0.44 | 0.56 | 0.24 | 0.76 | 0.56 | 0.69 | 0.71 | 0.4632 |
| SVM | 0.6944 | 0.71 | 0.33 | 0.68 | 0.29 | 0.71 | 0.68 | 0.74 | 0.71 | 0.5489 |
TP rate: True positive rate, FP rate: False positive rate, TN rate: True negative rate, FN rate: False negative rate, RMSE: Root Mean Squared Error. RBFnet: Radio Basis Function network, SVM: Support Vector Machine.
Classification results using features selected by Wilcoxon rank test.
| Algorithm | Accuracy(%) | TP rate | FP rate | TN rate | FN rate | Sensitivity | Specificity | Precision | Fmeasure | RMSE |
| C4.5 | 0.6667 | 0.90 | 0.63 | 0.38 | 0.10 | 0.90 | 0.38 | 0.65 | 0.75 | 0.4683 |
| Random Forest | 0.7000 | 0.79 | 0.41 | 0.59 | 0.21 | 0.79 | 0.59 | 0.71 | 0.74 | 0.4401 |
| Bagging | 0.6667 | 0.68 | 0.35 | 0.65 | 0.32 | 0.68 | 0.65 | 0.72 | 0.69 | 0.4484 |
| Logitboost | 0.6833 | 0.76 | 0.41 | 0.59 | 0.24 | 0.76 | 0.59 | 0.70 | 0.73 | 0.4499 |
| Stacking | 0.6667 | 0.93 | 0.66 | 0.34 | 0.07 | 0.93 | 0.34 | 0.64 | 0.76 | 0.4639 |
| Adaboost | 0.6611 | 0.76 | 0.46 | 0.54 | 0.24 | 0.76 | 0.54 | 0.68 | 0.71 | 0.4805 |
| Multiboost | 0.7000 | 0.73 | 0.34 | 0.66 | 0.27 | 0.73 | 0.66 | 0.74 | 0.73 | 0.5187 |
| Logistic | 0.6556 | 0.77 | 0.49 | 0.51 | 0.23 | 0.77 | 0.51 | 0.67 | 0.71 | 0.4362 |
| Naivebayes | 0.6944 | 0.70 | 0.31 | 0.69 | 0.30 | 0.70 | 0.69 | 0.77 | 0.72 | 0.4969 |
| Bayesnet | 0.6778 | 0.73 | 0.39 | 0.61 | 0.27 | 0.73 | 0.61 | 0.71 | 0.71 | 0.5232 |
| Neural Network | 0.6778 | 0.66 | 0.30 | 0.70 | 0.34 | 0.66 | 0.70 | 0.73 | 0.68 | 0.4606 |
| RBFnet | 0.5944 | 0.74 | 0.59 | 0.41 | 0.26 | 0.74 | 0.41 | 0.62 | 0.67 | 0.4556 |
| SVM | 0.6611 | 0.71 | 0.40 | 0.60 | 0.29 | 0.71 | 0.60 | 0.71 | 0.70 | 0.5760 |
Classification results using features selected by genetic algorithm.
| Algorithm | Accuracy(%) | TP rate | FP rate | TN rate | FN rate | Sensitivity | Specificity | Precision | Fmeasure | RMSE |
| C4.5 | 0.5944 | 0.61 | 0.43 | 0.58 | 0.39 | 0.61 | 0.58 | 0.64 | 0.62 | 0.5718 |
| Random Forest | 0.6000 | 0.71 | 0.54 | 0.46 | 0.29 | 0.71 | 0.46 | 0.63 | 0.66 | 0.5047 |
| Bagging | 0.6111 | 0.64 | 0.43 | 0.58 | 0.36 | 0.64 | 0.58 | 0.66 | 0.65 | 0.4965 |
| Logitboost | 0.6167 | 0.68 | 0.46 | 0.54 | 0.32 | 0.68 | 0.54 | 0.65 | 0.66 | 0.5153 |
| Stacking | 0.6056 | 0.66 | 0.46 | 0.54 | 0.34 | 0.66 | 0.54 | 0.65 | 0.65 | 0.4892 |
| Adaboost | 0.6167 | 0.67 | 0.45 | 0.55 | 0.33 | 0.67 | 0.55 | 0.65 | 0.65 | 0.5960 |
| Multiboost | 0.6111 | 0.68 | 0.48 | 0.53 | 0.32 | 0.68 | 0.53 | 0.65 | 0.66 | 0.6147 |
| Logistic | 0.6056 | 0.67 | 0.48 | 0.53 | 0.33 | 0.67 | 0.53 | 0.63 | 0.65 | 0.5122 |
| Naivebayes | 0.6000 | 0.76 | 0.60 | 0.40 | 0.24 | 0.76 | 0.40 | 0.62 | 0.67 | 0.5251 |
| Bayesnet | 0.5611 | 0.73 | 0.65 | 0.35 | 0.27 | 0.73 | 0.35 | 0.59 | 0.65 | 0.5110 |
| Neural Network | 0.5944 | 0.61 | 0.43 | 0.58 | 0.39 | 0.61 | 0.58 | 0.65 | 0.62 | 0.5814 |
| RBFnet | 0.6000 | 0.69 | 0.51 | 0.49 | 0.31 | 0.69 | 0.49 | 0.63 | 0.65 | 0.5038 |
| SVM | 0.6333 | 0.72 | 0.48 | 0.53 | 0.28 | 0.72 | 0.53 | 0.66 | 0.68 | 0.5985 |
AUG results of classifiers
| Algorithm | AUG | Algorithm | AUG | Algorithm | AUG | Algorithm | AUG |
| C4.5 | 0.5625 | Logitboost | 0.8438 | Bayes Net | 0.8563 | RBFnet | 0.9 |
| Random Forest | 0.9375 | Stacking | 0.5625 | Logistic | 0.925 | SVM | 0.7 |
| Random Tree | 0.825 | Adaboost | 0.85 | Neural Network | 0.85 | ||
| Bagging | 0.85 | Multiboost | 0.875 | Naïve Bayes | 0.8875 |