| Literature DB >> 30013015 |
Mengting Niu1, Yanjuan Li2, Chunyu Wang3, Ke Han4.
Abstract
Amyloid is an insoluble fibrous protein and its mis-aggregation can lead to some diseases, such as Alzheimer's disease and Creutzfeldt⁻Jakob's disease. Therefore, the identification of amyloid is essential for the discovery and understanding of disease. We established a novel predictor called RFAmy based on random forest to identify amyloid, and it employed SVMProt 188-D feature extraction method based on protein composition and physicochemical properties and pse-in-one feature extraction method based on amino acid composition, autocorrelation pseudo acid composition, profile-based features and predicted structures features. In the ten-fold cross-validation test, RFAmy's overall accuracy was 89.19% and F-measure was 0.891. Results were obtained by comparison experiments with other feature, classifiers, and existing methods. This shows the effectiveness of RFAmy in predicting amyloid protein. The RFAmy proposed in this paper can be accessed through the URL http://server.malab.cn/RFAmyloid/.Entities:
Keywords: RFAmy; amyloid protein; machine learning; protein classification; random forest
Mesh:
Substances:
Year: 2018 PMID: 30013015 PMCID: PMC6073578 DOI: 10.3390/ijms19072071
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
The result of using different feature representation methods on cross-validation.
| Method | ACC (%) | MCC | SE | SP | F-Measure |
|---|---|---|---|---|---|
| 188-D+Pse-in-One | 89.1941 | 0.739 | 0.781 | 0.927 | 0.891 |
| 188-D | 84.8482 | 0.626 | 0.655 | 0.932 | 0.626 |
| Pse-in-one | 81.31 | 0.5626 | 0.6374 | 0.8989 | 0.792 |
| 400-D | 84.1105 | 0.634 | 0.691 | 0.917 | 0.838 |
| 81.3187 | 0.522 | 0.534 | 0.930 | 0.802 |
The result of using different feature representation methods on external validation.
| Method | ACC (%) | MCC | SE | SP | F-Measure |
|---|---|---|---|---|---|
| 188-D+Pse-in-One | 89.7196 | 0.757 | 0.818 | 0.932 | 0.897 |
| 188-D | 73.1841 | 0.524 | 0.512 | 0.960 | 0.678 |
| Pse-in-one | 78.7037 | 0.679 | 0.676 | 0.880 | 0.782 |
| 400-D | 71.2963 | 0.543 | 0.503 | 0.893 | 0.684 |
| 69.4444 | 0.522 | 0.534 | 0.893 | 0.657 |
The result of using different classifiers based on 188-D feature.
| Classifier | ACC (%) | MCC | SE | SP | F-Measure |
|---|---|---|---|---|---|
| Random Forest | 89.19 | 0.739 | 0.781 | 0.927 | 0.891 |
| Naive Bayes | 75.50 | 0.3791 | 0.4606 | 0.8822 | 0.8721 |
| SGD | 77.51 | 0.4451 | 0.5515 | 0.8717 | 0.6533 |
| Nearest Neighbors | 77.70 | 0.4293 | 0.2970 | 0.9843 | 0.8818 |
| Decision Tree | 67.28 | 0.2567 | 0.5333 | 0.7330 | 0.7461 |
| LinearSVC | 77.51 | 0.4654 | 0.6242 | 0.8403 | 0.8658 |
| Logistic Regression | 79.52 | 0.5123 | 0.6545 | 0.8560 | 0.8694 |
| LibSVM | 70.02 | 0.0651 | 0.0061 | 1.0000 | 0.8239 |
| ExtraTrees | 74.95 | 0.4128 | 0.6061 | 0.8115 | 0.8087 |
| Bagging | 74.95 | 0.4128 | 0.6061 | 0.8115 | 0.7727 |
| AdaBoost | 76.78 | 0.4700 | 0.6788 | 0.8063 | 0.8763 |
| GradientBoosting | 80.26 | 0.5298 | 0.6667 | 0.8613 | 0.8668 |
| LibD3C | 86.99 | 0.683 | 0.732 | 0.929 | 0.868 |
Figure 1Receiver Operating Characteristic (ROC) curve for RFAmy and other methods.
The result of using different methods.
| Method | ACC (%) | MCC | SE | SP |
|---|---|---|---|---|
| RFAmy | 89.1941 | 0.739 | 0.781 | 0.927 |
| BioSeq-SVM | 76.86 | 0.4419 | 0.4953 | 0.9006 |
| BioSeq-RF | 81.31 | 0.5626 | 0.6374 | 0.8989 |
Figure 2Receiver Operating Characteristic (ROC) curve for RFAmy and two other methods.
The result of using different feature representation methods.
| Method | ACC (%) | MCC | SE | SP | F-Measure |
|---|---|---|---|---|---|
| unbalanced | 89.1941 | 0.739 | 0.781 | 0.927 | 0.891 |
| balanced | 83.4962 | 0.757 | 0.847 | 0.823 | 0.865 |
Figure 3Overview of the paper framework for an Amyloid classifier. First, the original protein sequence was generated from the Uniprot and AmyPro datasets and then subjected to a de-redundant operation to generate the final protein sequence data called Amy. The second step is feature extraction of protein sequences. The third step is to use RF to classify protein sequences.
Structure of 188-D Feature.
| Physical-Chemical Property | Dimensions |
|---|---|
| Amino acid composition | 20 |
| Hydrophobicity | 21 |
| Normalized van der Waals volume | 21 |
| Polarity | 21 |
| Polarizability | 21 |
| Charge | 21 |
| Surface tension | 21 |
| Secondary structure | 21 |
| Solvent accessibility | 21 |
Pse-in-one feature extraction method of protein sequences.
| Category | Method |
|---|---|
| Amino acid composition | K-mer |
| DR | |
| Distance Pair | |
| Autocorrelation | AC |
| CC | |
| ACC | |
| PDT | |
| Pseudo amino acid composotion | PC-PseAAC |
| SC-PseAAC | |
| PC-PseAAC-General | |
| SC-PseAAC-General | |
| Profile-based features | Top- |
| PDT-Profile | |
| DT | |
| AC-PSSM | |
| CC-PSSM | |
| ACC-PSSM | |
| PSSM-DT | |
| PSSM-RT | |
| CS | |
| Predicted structure features | SS |
| SASA |
Figure 4Ten-fold cross validation diagram. The dataset was divided into ten parts, and nine of them were taken as training data in turn, and one was used as test data for testing. The average value E of the ten-groups test results is calculated as an estimate of the model accuracy and is used as a performance indicator for the current K-fold cross-validation model. Where E represents the cross-validation error of the ith group.3.4. The RFAmyloid Online Prediction Server.