| Literature DB >> 34434652 |
Xinyi Liu1, Yueyue Shen1, Youhua Zhang1, Fei Liu1, Zhiyu Ma1, Zhenyu Yue1, Yi Yue1.
Abstract
BACKGROUND: A moonlighting protein refers to a protein that can perform two or more functions. Since the current moonlighting protein prediction tools mainly focus on the proteins in animals and microorganisms, and there are differences in the cells and proteins between animals and plants, these may cause the existing tools to predict plant moonlighting proteins inaccurately. Hence, the availability of a benchmark data set and a prediction tool specific for plant moonlighting protein are necessary.Entities:
Keywords: eXtreme gradient boosting; Benchmark data set; Plant moonlighting protein; Prediction tool
Year: 2021 PMID: 34434652 PMCID: PMC8351581 DOI: 10.7717/peerj.11900
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1The flowchart of IdentPMP development.
(A) Data preparation. The composition and source of training data set and independent test data set. (B) Feature engineering. We used iLearn to generate feature classes, perform pre-processing, and use each feature class to construct a classifier to select the best feature. (C) Model training. Five algorithms are used, including Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Decision Tree (DT) and K-Nearest Neighbour (KNN).
The performance of five algorithms on the training set.
AUPRC, area under the precision-recall curve. AUC, area under the receiver operating characteristic curve. AUPRC is the main metric. Sen, sensitivity. Spe, specificity. MCC, Matthews correlation coefficient. F1, F1-score. The selected algorithm and The maximum values in each metric are marked in bold.
| 0.90 | ||||||
| SVM | 0.84 | 0.86 | 0.68 | 0.85 | 0.55 | 0.70 |
| RF | 0.82 | 0.86 | 0.70 | 0.85 | 0.56 | 0.72 |
| DT | 0.80 | 0.82 | 0.68 | 0.84 | 0.53 | 0.70 |
| KNN | 0.78 | 0.79 | 0.53 | 0.51 | 0.62 |
Figure 2Performance comparison of the five algorithms on the training set.
(A) AUPRC curves of the five algorithms. (B) AUC curves of the five algorithms.
Figure 3Performance comparison of IdentPMP and MPFit on the independent test set.
(A) AUPRC curves of the IdentPMP and MPFit (Phylo+DOR+NET+GE+GI). (B) AUC curves of the IdentPMP and MPFit (Phylo+DOR+NET+GE+GI).
The detailed values of the results of IdentPMP and MPFit on the independent test set.
AUPRC, area under the precision-recall curve. AUC, area under the receiver operating characteristic curve. Sen, sensitivity. Spe, specificity. MCC, Matthews correlation coefficient. F1, F1-score. The maximum values in each metric are marked in bold.
| IdentPMP | 0.43 | 0.46 | ||||
| MPFit(Phylo+GE+GI+DOR+NET) | 0.36 | 0.60 | 0.54 | 0.64 | 0.17 | 0.44 |
| MPFit(PPI+Phylo+GE) | 0.50 | 0.00 | 0.00 | 0.43 |