| Literature DB >> 35765653 |
Shihu Jiao1, Quan Zou1,2,3.
Abstract
Plant vacuoles are the most important organelles for plant growth, development, and defense, and they play an important role in many types of stress responses. An important function of vacuole proteins is the transport of various classes of amino acids, ions, sugars, and other molecules. Accurate identification of vacuole proteins is crucial for revealing their biological functions. Several automatic and rapid computational tools have been proposed for the subcellular localization of proteins. Regrettably, they are not specific for the identification of plant vacuole proteins. To the best of our knowledge, there is only one computational software specifically trained for plant vacuolar proteins. Although its accuracy is acceptable, the prediction performance and stability of this method in practical applications can still be improved. Hence, in this study, a new predictor named iPVP-DRLF was developed to identify plant vacuole proteins specifically and effectively. This prediction software is designed using the light gradient boosting machine (LGBM) algorithm and hybrid features composed of classic sequence features and deep representation learning features. iPVP-DRLF achieved fivefold cross-validation and independent test accuracy values of 88.25 % and 87.16 %, respectively, both outperforming previous state-of-the-art predictors. Moreover, the blind dataset test results also showed that the performance of iPVP-DRLF was significantly better than the existing tools. The results of comparative experiments confirmed that deep representation learning features have an advantage over other classic sequence features in the identification of plant vacuole proteins. We believe that iPVP-DRLF would serve as an effective computational technique for plant vacuole protein prediction and facilitate related future research. The online server is freely accessible at https://lab.malab.cn/~acy/iPVP-DRLF. In addition, the source code and datasets are also accessible at https://github.com/jiaoshihu/iPVP-DRLF.Entities:
Keywords: Deep representation learning; Feature selection; Light gradient boosting machine; Machine learning; Vacuole proteins
Year: 2022 PMID: 35765653 PMCID: PMC9207291 DOI: 10.1016/j.csbj.2022.06.002
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1The workflow of the development and evaluation process for iPVP-DRLF.
5-fold cross-validation results of different classic sequence descriptors.
| Feature | Acc (%) | AUC | SE (%) | SP (%) | MCC |
|---|---|---|---|---|---|
| DPC (34D) | 79.00 | 0.840 | 77.00 | 0.580 | |
| CTD (51D) | 78.50 | 0.845 | 78.00 | 79.00 | 0.570 |
| ASDC (57D) | 78.25 | 0.829 | 77.50 | 79.00 | 0.565 |
| DDE (38D) | 79.00 | ||||
| PAAC (22D) | 72.00 | 0.775 | 67.50 | 76.50 | 0.442 |
| AAC (20D) | 68.25 | 0.742 | 67.00 | 69.50 | 0.365 |
| QSO (44D) | 71.00 | 0.774 | 70.00 | 72.00 | 0.420 |
Note: The best performance value of each column is highlighted in bold for clarification. Numbers in parentheses represent feature dimensions after feature selection.
Fivefold cross-validation results of different deep representation learning features.
| Features | Acc (%) | AUC | SE (%) | SP (%) | MCC |
|---|---|---|---|---|---|
| BiLSTM (44D) | 0.908 | 86.50 | |||
| LM (40D) | 82.00 | 0.888 | 81.00 | 83.00 | 0.640 |
| SSA (28D) | 75.25 | 0.815 | 74.50 | 76.00 | 0.505 |
| TAPE (55D) | 83.00 | 0.893 | 82.00 | 0.660 | |
| UniRep (60D) | 85.00 | 82.50 | 0.701 |
Note: The best performance value of each column is highlighted in bold for clarification. Numbers in parentheses represent feature dimensions after feature selection.
Performance comparison of eight machine learning models based on the corresponding optimal feature subset of F176.
| Classifier | Fivefold cross-validation | Independent testing | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | AUC | SE (%) | SP (%) | MCC | Acc (%) | AUC | SE (%) | SP (%) | MCC | |
| AB | 82.43 | 0.885 | 85.14 | 79.73 | 0.650 | |||||
| Bagging | 84.25 | 0.895 | 82.50 | 86.00 | 0.685 | 83.11 | 0.898 | 85.14 | 81.08 | 0.663 |
| ERT | 83.50 | 0.888 | 80.00 | 87.00 | 0.672 | 85.81 | 81.08 | 0.719 | ||
| GBM | 86.25 | 0.925 | 87.50 | 85.00 | 0.725 | 83.11 | 0.899 | 85.14 | 81.08 | 0.663 |
| LGBM | 88.25 | 0.933 | 87.50 | 0.765 | 0.916 | 89.19 | ||||
| RF | 84.25 | 0.899 | 83.00 | 85.50 | 0.685 | 84.46 | 0.926 | 87.84 | 81.08 | 0.691 |
| SVM | 86.75 | 0.922 | 85.50 | 88.00 | 0.735 | 80.41 | 0.871 | 70.27 | 0.621 | |
| XGBT | 88.25 | 0.926 | 88.00 | 88.50 | 0.765 | 83.78 | 0.900 | 85.14 | 82.43 | 0.676 |
Note: The best performance value of each column is highlighted in bold for clarification.
Fig. 2UMAP distribution of PVPs and non-PVPs using the 63-dimensional vector F63 and four compared individual descriptors. The orange dots represent PVPs and the blue dots represent non-PVPs. (A-E) are the distributions of DDE, DPC, BiLSTM, UniRep and F63, respectively. F presents the ROC curves for iPVP-DRLF on the training and independent test datasets. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Performance comparison of proposed iPVP-DRLF and the SOTA predictors.
| Classifier | Training | Testing | Blind | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | AUC | SE (%) | SP (%) | MCC | Acc (%) | AUC | SE (%) | SP (%) | MCC | Acc (%) | |
| 0.916 | 89.19 | ||||||||||
| VacPred-DPC | 75.50 | 0.800 | 70.00 | 81.00 | 0.510 | 80.41 | 0.840 | 82.43 | 78.38 | 0.610 | 59.91 |
| VacPred-PSSM | 81.75 | 0.860 | 76.50 | 87.00 | 0.640 | 86.49 | 82.43 | 0.730 | 62.99 | ||
Note: The best performance value of each column is highlighted in bold for clarification.
Fig. 3Performance comparison of different PVPs prediction software. A presents the comparison of iPVP-DRLF with the SOTA predictor VacPred-PSSM on training, test and blind datasets. B shows the benchmark results of iPVP-DRLF and different published software using blind dataset.