| Literature DB >> 35350759 |
Haozheng Li1, Yihe Pang2, Bin Liu2,3, Liang Yu1.
Abstract
Intrinsically disordered regions (IDRs) without stable structure are important for protein structures and functions. Some IDRs can be combined with molecular fragments to make itself completed the transition from disordered to ordered, which are called molecular recognition features (MoRFs). There are five main functions of MoRFs: molecular recognition assembler (MoR_assembler), molecular recognition chaperone (MoR_chaperone), molecular recognition display sites (MoR_display_sites), molecular recognition effector (MoR_effector), and molecular recognition scavenger (MoR_scavenger). Researches on functions of molecular recognition features are important for pharmaceutical and disease pathogenesis. However, the existing computational methods can only predict the MoRFs in proteins, failing to distinguish their different functions. In this paper, we treat MoRF function prediction as a multi-label learning task and solve it with the Binary Relevance (BR) strategy. Finally, we use Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), and Random Forest (RF) as basic models to construct MoRF-FUNCpred through ensemble learning. Experimental results show that MoRF-FUNCpred performs well for MoRF function prediction. To the best knowledge of ours, MoRF-FUNCpred is the first predictor for predicting the functions of MoRFs. Availability and Implementation: The stand alone package of MoRF-FUNCpred can be accessed from https://github.com/LiangYu-Xidian/MoRF-FUNCpred.Entities:
Keywords: binary relevance; ensemble learning; intrinsically disordered regions; molecular recognition features; multi-label learning
Year: 2022 PMID: 35350759 PMCID: PMC8957949 DOI: 10.3389/fphar.2022.856417
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.810
Different functional residues in the protein sequence, training set, and testing set.
| Types | Training set | Testing set | Sequences |
|---|---|---|---|
| Negative | 6158 | 8108 | 167 |
| MoR_assembler | 8821 | 8537 | 160 |
| MoR_chaperone | 2006 | 1052 | 24 |
| MoR_display_sites | 1992 | 1503 | 58 |
| MoR_effector | 8431 | 7576 | 149 |
| MoR_scavenger | 1617 | 1500 | 20 |
| MoR_assembler, MoR_ display_sites | 301 | 294 | 9 |
| MoR_assembler, MoR_effector | 1134 | 434 | 17 |
| MoR_display_sites, MoR_effector | 1128 | 562 | 18 |
| MoR_assembler, MoR_display_sites, MoR_effector | 113 | 78 | 4 |
FIGURE 1The network architecture of MoRF-FUNCpred. MoRF-FUNCpred uses PSFM to express the protein. MoRF-FUNCpred extracts features and labels of residues divided into training set and testing set. The SVM, LR, DT, and RF models are trained using the training set, and the four models are integrated to obtain better performance through ensemble learning. Obtain the weights of ensemble learning through the genetic algorithm.
Hyperparameter ranges for each model.
| Model | Hyperparameters | Range |
|---|---|---|
| SVM | C | {2−5, 2−4, 2−3, 2−2, 2−1, 2−0, 21, 22, 23, 24, 25} |
| gamma | {2−5, 2−4, 2−3, 2−2, 2−1, 2−0, 21, 22, 23, 24, 25} | |
| kernel | (liner, polynomial, rbf) | |
| LR | penalty | {l1, l2} |
| c | {2−5, 2−4, 2−3, 2−2, 2−1, 2−0, 21, 22, 23, 24, 25} | |
| DT | criterion | {gini, entropy} |
| splitter | {best, random} | |
| RF | n_estimators | {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} |
| max_features | {sqrt, log2} |
Different basic models in overall metric and accuracy of each function.
| Metrics | SVM | LR | DT | RF |
|---|---|---|---|---|
| Macro_Accuracy |
| 0.564 | 0.743 | 0.813 |
| MoR_assembler_Accuracy | 0.639 | 0.480 | 0.549 |
|
| MoR_chaperone_Accuracy |
| 0.620 | 0.877 | 0.930 |
| MoR_display_sites_Accuracy |
| 0.531 | 0.800 | 0.866 |
| MoR_effector_Accuracy | 0.684 | 0.580 | 0.598 |
|
| MoR_scavenger_Accuracy | 0.937 | 0.606 | 0.891 |
|
Bold values represent the best results for each metric.
Comparison of the ensemble model and best metric in the single model.
| Metrics | SVM | LR | DT | RF |
|---|---|---|---|---|
| sn_MoR_assembler | 0.237 |
| 0.391 | 0.227 |
| sp_MoR_assembler | 0.824 | 0.474 | 0.622 |
|
| sn_MoR_chaperone | 0.003 |
| 0.099 | 0.029 |
| sp_MoR_chaperone |
| 0.629 | 0.906 | 0.963 |
| sn_MoR_display_sites | 0.000 |
| 0.287 | 0.213 |
| sp_MoR_display_sites |
| 0.544 | 0.846 | 0.925 |
| sn_MoR_effector | 0.150 |
| 0.291 | 0.109 |
| sp_MoR_effector | 0.902 | 0.587 | 0.725 |
|
| sn_MoR_scavenger | 0.047 |
| 0.109 | 0.022 |
| sp_MoR_scavenger | 0.984 | 0.609 | 0.932 |
|
Bold values represent the best results for each metric.
FIGURE 2Distance between the two models in the training set under the five MoRF functions.
Comparison of the ensemble model and best metric in the single model.
| Metrics | MoRF-FUNCpred | SVM | LR | DT | RF |
|---|---|---|---|---|---|
| Macro_Accuracy |
| 0.828 | 0.564 | 0.743 | 0.813 |
| MoR_assembler_Accuracy |
| 0.639 | 0.480 | 0.549 | 0.640 |
| MoR_chaperone_Accuracy | 0.960 |
| 0.620 | 0.877 | 0.930 |
| MoR_display_sites_Accuracy | 0.910 |
| 0.531 | 0.800 | 0.866 |
| MoR_effector_Accuracy |
| 0.684 | 0.580 | 0.598 | 0.686 |
| MoR_scavenger_Accuracy |
| 0.937 | 0.606 | 0.891 |
|
Bold values represent the best results for each metric.
FIGURE 3MoRF-FUNCpred prediction of the MoR_assembler function of DP01087.