| Literature DB >> 24165390 |
Yaping Fang, Shan Gao, David Tai, C Russell Middaugh, Jianwen Fang1.
Abstract
BACKGROUND: Protein aggregation is a significant problem in the biopharmaceutical industry (protein drug stability) and is associated medically with over 40 human diseases. Although a number of computational models have been developed for predicting aggregation propensity and identifying aggregation-prone regions in proteins, little systematic research has been done to determine physicochemical properties relevant to aggregation and their relative importance to this important process. Such studies may result in not only accurately predicting peptide aggregation propensities and identifying aggregation prone regions in proteins, but also aid in discovering additional underlying mechanisms governing this process.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24165390 PMCID: PMC3819749 DOI: 10.1186/1471-2105-14-314
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of 9 classifiers before and after feature selection
| | ||||||
| SVM-linear | 0.759 | 0.518 | 0.771 | 0.542 | 0.788 | 0.576 |
| RF | 0.748 | 0.497 | 0.737 | 0.474 | 0.782 | 0.564 |
| GBM | 0.754 | 0.509 | 0.718 | 0.436 | 0.797 | 0.593 |
| RPART | 0.717 | 0.435 | 0.751 | 0.502 | 0.729 | 0.457 |
| NNet | 0.754 | 0.507 | 0.734 | 0.468 | 0.780 | 0.558 |
| PLS | 0.740 | 0.479 | 0.788 | 0.578 | 0.782 | 0.565 |
| KNN | 0.762 | 0.530 | 0.763 | 0.524 | 0.763 | 0.528 |
| NB | 0.731 | 0.465 | 0.743 | 0.488 | 0.790 | 0.581 |
| Ada | 0.754 | 0.509 | 0.740 | 0.479 | 0.779 | 0.558 |
The 7 features are selected by SVM-RFE and the 10 features are selected by RF-IS. The 10 fold cross validation of each classifier is conducted on dataset AP1.
Figure 1Dependency of classification performance on the numbers of selected features A) Classification error plotted against the number of feature selected by SVM-RFE, B) OOB error plotted against the number of feature selected by RF-IE. “Class error” equals to 1 minus classification accuracy, and “OOB error” is the abbreviation of out-of-bag (OOB) error rate which represents error rate for classification.
Selected top features using SVM-RFE and RF-IS
| * | ||
|---|---|---|
| SVM-RFE | ROSM880105 | Hydrophilicity of Polar Amino Acid Side-chains |
| RICJ880117 | Relative preference in alpha helices | |
| VENT840101 | Hydrophobicity, the bitter taste of L-amino acids | |
| ROBB760110 | Conformational properties of middle turn | |
| PONP800105 | Surrounding hydrophobicity in beta-sheet | |
| ZIMJ680101 | Hydrophobicity by statistical methods | |
| PRAM820103 | Shape and surface features of globular proteins | |
| RF-IS | GUYH850101 | Partition energy |
| VHEG790101 | Transfer free energy | |
| ROSM880105 | Hydrophilicity of Polar Amino Acid Side-chains | |
| CASG920101 | Hydrophobicity scale from native protein structures | |
| PONP800107 | Accessibility reduction ratio | |
| WILM950102 | Hydrophobicity coefficient in RP-HPLC | |
| X15925383 | Pagg | |
| LEVM780102 | Normalized frequency of beta-sheet | |
| PALJ810111 | Normalized frequency of beta-sheet | |
| PRAM900103 | Relative frequency in beta-sheet |
*AAIndex database entry numbers.
Results of LOPO cross-validation
| ProA-SVM (7 features) | 148 | 36 | 38 | 132 | 0.791 | 0.5811 |
| ProA-RF (10 features) | 146 | 38 | 41 | 129 | 0.7768 | 0.5527 |
TP: the number of True Positive samples; FN: the number of False Negative samples; FP: the number of false positive samples and TP: the number of true positive samples. Ac: Accuracy. MCC: Matthews correlation coefficient.
Figure 2The receiver operator characteristic (ROC) curves curve for different methods. Area Under the ROC Curve (AUC): ProA-RF: 0.8929; ProA-SVM: 0.8680, ZYGGREGATOR: 0.8395; AAGRESCAN: 0.8336; FoldAmyloid: 0.7946; PAGE: 0.7303, TANGO: 0.7121.
Comparison results with other methods
| ProA-RF | 136 | 30 | 154 | 34 | 0.8193 | 0.8191 | 0.8192 |
| ProA-SVM | 136 | 40 | 144 | 34 | 0.7727 | 0.8090 | 0.7910 |
| ZYGGREGATOR | 136 | 45 | 139 | 34 | 0.7514 | 0.8035 | 0.7768 |
| AAGRESCAN | 136 | 61 | 123 | 34 | 0.6904 | 0.7834 | 0.7316 |
| FoldAmyloid | 136 | 68 | 116 | 34 | 0.6667 | 0.7733 | 0.7119 |
| TANGO | 136 | 86 | 98 | 34 | 0.6126 | 0.7424 | 0.6610 |
| PAGE | 136 | 88 | 96 | 34 | 0.6071 | 0.7385 | 0.6554 |
TP: the number of True Positive samples; FN: the number of False Negative samples; FP: the number of false positive samples and TP: the number of true positive samples. Ac: Accuracy.
Figure 3The predicted aggregation regions of tau protein (region 244–368) by different methods. A. The predicted aggregation propensity profiles of ProA-SVM (dashed line), ProA-RF (solid line); B. The predicted aggregation propensity profiles of ZYGGREGATOR (blue), AAGRESCAN (red), FoldAmyloid (black), PAGE (purple), and TANGO (green). “A” and “N” indicate experimentally confirmed aggregation and non-aggregation regions, respectively.
Results of testing ProA-SVM and ProA-RF on the SP dataset
| ProA-SVM | EUKSIG.reduc | 1022 | 103 | 189 | 936 | 0.8702 | 0.7426 |
| EUKANC.reduc | 55 | 12 | 10 | 56 | 0.8346 | 0.6695 | |
| GRAM+SIG.reduc | 118 | 51 | 33 | 136 | 0.7515 | 0.5058 | |
| GRAM-SIG.reduc | 286 | 64 | 73 | 277 | 0.8043 | 0.6088 | |
| ProA-RF | EUKSIG.reduc | 896 | 229 | 297 | 828 | 0.7662 | 0.5334 |
| EUKANC.reduc | 60 | 7 | 9 | 57 | 0.8797 | 0.7597 | |
| GRAM+SIG.reduc | 127 | 42 | 38 | 131 | 0.7633 | 0.5268 | |
| GRAM-SIG.reduc | 290 | 60 | 104 | 246 | 0.7657 | 0.5357 | |
TP: the number of True Positive samples; FN: the number of False Negative samples; FP: the number of false positive samples and TP: the number of true positive samples. Ac, Accuracy; MCC, Matthews correlation coefficient.