| Literature DB >> 29568772 |
Runtao Yang1, Chengjin Zhang1,2, Lina Zhang1, Rui Gao2.
Abstract
Cancerlectins have an inhibitory effect on the growth of cancer cells and are currently being employed as therapeutic agents. The accurate identification of the cancerlectins should provide insight into the molecular mechanisms of cancers. In this study, a new computational method based on the RF (Random Forest) algorithm is proposed for further improving the performance of identifying cancerlectins. Hybrid feature space before feature selection is developed by combining different individual feature spaces, CTD (Composition, Transition, and Distribution), PseAAC (Pseudo Amino Acid Composition), PSSM (Position-Specific Scoring Matrix), and disorder. The SMOTE (Synthetic Minority Oversampling Technique) is applied to solve the imbalanced data problem. To reduce feature redundancy and computation complexity, we propose a two-step feature selection process to select informative features. A 5-fold cross-validation technique is used for the evaluation of various prediction strategies. The proposed method achieves a sensitivity of 0.779, a specificity of 0.717, an accuracy of 0.748, and an MCC (Matthew's Correlation Coefficient) of 0.497. The prediction results are also compared with other existing methods on the same dataset using 5-fold cross-validation. The comparison results demonstrate the high effectiveness of our method for predicting cancerlectins.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29568772 PMCID: PMC5820548 DOI: 10.1155/2018/9364182
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Division of the 20 natural amino acids according to different physicochemical properties.
| Physicochemical properties | Group 1 | Group 2 | Group 3 |
|---|---|---|---|
| Hydrophobicity | DEKNQR | AGHPSTY | CFILMVW |
| Normalized van der Waals volume | ACDGPST | EILNQV | FHKMRWY |
| Polarity | CFILMVWY | AGPST | DEHKNQR |
| Polarizability | ADGST | CEILNPQV | FHKMRWY |
| Charge | KR | DE | ACFGHILMNPQSTVWY |
| Secondary structures | AEHKLMQR | CFITVWY | DGNPS |
| Solvent accessibility | ACFGILVW | DEKNQR | HMPSTY |
Performance comparisons of different machine learning methods on the full features using 5-fold cross-validation.
| Machine learning method | Sensitivity | Specificity | Accuracy | MCC |
|---|---|---|---|---|
| AdaBoost | 0.690 | 0.540 | 0.615 | 0.233 |
| Decision Table | 0.681 | 0.540 | 0.611 | 0.223 |
| Nearest Neighbor Analysis | 0.757 | 0.584 | 0.670 | 0.346 |
| Logistic Regression | 0.531 | 0.558 | 0.544 | 0.089 |
| Naïve Bayes | 0.500 | 0.699 | 0.600 | 0.203 |
| RBFNetwork | 0.615 | 0.491 | 0.553 | 0.107 |
| Random Forest | 0.704 | 0.695 | 0.699 | 0.398 |
Figure 1ROC curves of different machine learning classifiers. DT: Decision Table, NNA: Nearest Neighbor Analysis, LR: Logistic Regression, NB: Naïve Bayes, RF: Random Forest, and AUC: Area under the ROC curve.
Prediction results with and without SMOTE on the full features using 5-fold cross-validation.
| Method | Sensitivity | Specificity | Accuracy | MCC |
|---|---|---|---|---|
| Without SMOTE | 0.461 | 0.717 | 0.604 | 0.085 |
| With SMOTE | 0.704 | 0.695 | 0.699 | 0.398 |
Figure 2ROC curves with and without SMOTE on the full features using 5-fold cross-validation.
Figure 3The prediction accuracy against the dimension of top features by performing the SFS (Sequential Forward Selection) scheme.
Figure 4ROC curves for the classifiers using all the features and the 13 optimal features.
The prediction performance trained with the 13 optimal features and the prediction performance trained with the 13 features that are randomly selected from original features.
| Method | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|
| Randomly selected features | 0.631 | 0.609 | 0.620 | 0.240 | 0.659 |
| Optimal features | 0.779 | 0.717 | 0.748 | 0.497 | 0.787 |
Performance comparisons with the existing methods using 5-fold cross-validation.
| Method | Sensitivity | Specificity | Accuracy | MCC | Feature number |
|---|---|---|---|---|---|
| Amino Acid Composition [ | 0.680 | 0.642 | 0.658 | 0.32 | 20 |
| Dipeptide Composition [ | 0.673 | 0.628 | 0.648 | 0.30 | 400 |
| Split based Composition (2-part) [ | 0.663 | 0.642 | 0.651 | 0.31 | 40 |
| Split based Composition (4-part) [ | 0.651 | 0.669 | 0.661 | 0.32 | 80 |
| Position-Specific Scoring Matrix [ | 0.679 | 0.686 | 0.683 | 0.36 | 400 |
| PSSM with 14 PROSITE domains [ | 0.680 | 0.699 | 0.691 | 0.38 | 414 |
|
| 0.691 | 0.801 | 0.752 | 0.495 | 68 |
| Our method | 0.779 | 0.717 | 0.748 | 0.497 | 13 |