| Literature DB >> 35260777 |
Saeed Ahmad1, Phasit Charoenkwan2, Julian M W Quinn3, Mohammad Ali Moni4, Md Mehedi Hasan5, Pietro Lio'6, Watshara Shoombuatong7.
Abstract
Fast and accurate identification of phage virion proteins (PVPs) would greatly aid facilitation of antibacterial drug discovery and development. Although, several research efforts based on machine learning (ML) methods have been made for in silico identification of PVPs, these methods have certain limitations. Therefore, in this study, we propose a new computational approach, termed SCORPION, (StaCking-based Predictior fOR Phage VIrion PrOteiNs), to accurately identify PVPs using only protein primary sequences. Specifically, we explored comprehensive 13 different feature descriptors from different aspects (i.e., compositional information, composition-transition-distribution information, position-specific information and physicochemical properties) with 10 popular ML algorithms to construct a pool of optimal baseline models. These optimal baseline models were then used to generate probabilistic features (PFs) and considered as a new feature vector. Finally, we utilized a two-step feature selection strategy to determine the optimal PF feature vector and used this feature vector to develop a stacked model (SCORPION). Both tenfold cross-validation and independent test results indicate that SCORPION achieves superior predictive performance than its constitute baseline models and existing methods. We anticipate SCORPION will serve as a useful tool for the cost-effective and large-scale screening of new PVPs. The source codes and datasets for this work are available for downloading in the GitHub repository ( https://github.com/saeed344/SCORPION ).Entities:
Mesh:
Year: 2022 PMID: 35260777 PMCID: PMC8904530 DOI: 10.1038/s41598-022-08173-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Characteristics of the existing methods for PVP prediction.
| Predictors/tools | Year | Algorithm | Feature descriptors | Type | Evaluation strategy |
|---|---|---|---|---|---|
| iVIREONS[ | 2012 | ANN | AAC, PIP | Single | 10CV |
| Feng et al.’s method[ | 2013 | NB | AAC, DPC | Single | 10CV |
| PVPred[ | 2014 | SVM | GGAP | Single | LOOCV, IND |
| Zhang et al.’s method[ | 2015 | SVM | CTD, bi-profile Bayes, PAAC, PSSM | Ensemble | 10CV, IND |
| PVP-SVM[ | 2018 | SVM | AAC, ATC, CTD, DPC, PCP | Single | 10CV, IND |
| PhagePred[ | 2018 | NB | GGAP | Single | 10CV, LOOCV |
| Tan et al.’s method[ | 2018 | SVM | GGAP | Single | 10CV, IND |
| Ru et al.’s method[ | 2019 | RF | CCPA, AKSNG, Seq-Str | Single | 10CV |
| Pred-BVP-Unb[ | 2019 | SVM | CT, Bi-PSSM, SAAC | Single | LOOCV, IND |
| PVPred-SCM[ | 2020 | SCM | DPC | Single | 10CV, IND |
| Meta-iPVP[ | 2020 | SVM | AAC, APAAC, DPC, CTDC, CTDD, CTDT and PAAC | Ensemble | 10CV, IND |
| iPVP-MCV[ | 2021 | SVM | PSSM-AAC, PSSM-composition and DP-PSSM | Ensemble | LOOCV, 10CV, IND |
| VirionFinder[ | 2021 | CNN | AAI | Deep learning | 10CV, IND |
| SCORPION | This study | RF | AAC, AAI, APAAC, CTDC, CTDD, CTDT, DDE, DPC, EAAC, PAAC, PSSM_AAC, PSSM_Com and PSSM_DP | Ensemble | 10CV, IND |
ANN artificial neural network; CNN convolutional neural network, LR logistic regression, NB naive bayes, RF random forest, SCM scoring card matrix, SVM support vector machine, AAC amino acid composition, AACPCP amino acid composition and physicochemical properties, AKSNG adaptive k-skip-n-Gram Algorithm, APAAC pseudo amino acid composition, ATC atomic composition, Bi-PSSM bigram position-specific scoring matrix, CTD composition translation and distribution, DPC dipeptide composition, PSSM_DP position-specific scoring matric based on dipeptides, GGAP g-gap dipeptide composition, GGAPTree g-gap feature tree, PAAC pseudo amino acid composition, PCP physicochemical properties, PF probabilistic features, PIP protein isoelectric points, PSSM position-specific scoring matrix, PSSM_AAC position-specific scoring matrix based on amino acid composition, PSSM_COM position-specific scoring matrix based on composition, PSSM Profiles position-specific scoring matrix based on profiles, SAAC split amino acid composition, Seq-Str sequence-structure, 10CV tenfold cross-validation, IND independent test, LOOCV leave-one-out cross-validation.
Figure 1Schematic flowchart of the development of the SCORPION. It consists of dataset construction, baseline models construction, new feature representations and the stacked model development.
Summary of 13 different sequence-based feature descriptors along with their corresponding description and dimension.
| Order | Descriptors | Description | Dimension | References |
|---|---|---|---|---|
| 1 | AAC | Frequency of 20 amino acids | 20 | [ |
| 2 | AAI | Different biochemical and biophysical properties extracted from the AAindex database | 11 | [ |
| 3 | APAAC | Amphiphilic pseudo-amino acid composition | 22 | [ |
| 4 | CTDC | Percentage of particular amino acid property groups | 39 | [ |
| 5 | CTDD | Percentage of mutual conversion in amino acid properties | 39 | [ |
| 6 | CTDT | Distribution of amino acid properties in sequences | 195 | [ |
| 7 | DDE | Dipeptide deviation from expected mean | 400 | [ |
| 8 | DPC | Frequency of 400 dipeptides | 400 | [ |
| 9 | EAAC | Enhance amino acid composition | 20 | [ |
| 10 | PAAC | Pseudo amino acid composition | 21 | [ |
| 11 | PSSM_AAC | Traditional AAC from the primary sequence to the PSI-BLAST profile | 20 | [ |
| 12 | PSSM_DP | Traditional PDC from the primary sequence to the PSI-BLAST profile | 400 | [ |
| 13 | PSSM_COM | Position-specific scoring matrix composition | 400 | [ |
Figure 2Performance evaluations of top 30 baseline models. (A,B) Cross-validation ACC and MCC of top 30 baseline models. (C,D) Independent test ACC and MCC of top 30 baseline models.
Cross-validation results for different feature representations using class and probabilistic information.
| Features | Dimension | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|---|
| PF | 130 | 0.858 | 0.840 | 0.876 | 0.722 | 0.914 |
| CF | 130 | 0.838 | 0.848 | 0.828 | 0.684 | 0.895 |
| PCF | 260 | 0.864 | 0.880 | 0.848 | 0.733 | 0.920 |
| Optimal PF | 50 | 0.868 | 0.884 | 0.852 | 0.743 | 0.920 |
| Optimal CF | 5 | 0.868 | 0.880 | 0.856 | 0.743 | 0.902 |
| Optimal PCF | 5 | 0.868 | 0.884 | 0.852 | 0.741 | 0.907 |
Independent test results for different feature representations using class and probabilistic information.
| Features | Dimension | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|---|
| PF | 130 | 0.857 | 0.937 | 0.778 | 0.723 | 0.924 |
| CF | 130 | 0.817 | 0.746 | 0.889 | 0.642 | 0.892 |
| PCF | 260 | 0.857 | 0.778 | 0.937 | 0.723 | 0.925 |
| Optimal PF | 50 | 0.881 | 0.810 | 0.952 | 0.770 | 0.922 |
| Optimal CF | 5 | 0.802 | 0.794 | 0.810 | 0.603 | 0.859 |
| Optimal PCF | 5 | 0.873 | 0.841 | 0.905 | 0.748 | 0.891 |
Figure 3Performance comparison of SCORPION with the models without the optimal PF feature vector, as assessed by tenfold cross-validation (A) and independent test (B).
Figure 4Performance comparison of the optimal PFs with the top five commonly used feature descriptors on the training (A,B) and independent tests (C,D). Prediction results of the optimal PFs with the top five commonly used feature descriptors in terms of ACC, Sn, Sp and MCC (A,C). ROC curves and AUC values of the optimal PFs with the top five commonly used feature descriptors (B,D).
Figure 5t-distributed stochastic neighbor embedding (t-SNE) distribution of positive and negative samples on the training dataset.
Figure 6Performance comparison of SCORPION with the top five baseline models on the training (A,B) and independent tests (C,D). Prediction results of SCORPION and the top five baseline models in terms of ACC, Sn, Sp and MCC (A,C). ROC curves and AUC values of SCORPION with the top five baseline models (B,D).
Figure 7Feature importance from SCORPION (A) and selected three baseline models, where SHAP values represent the directionality of top features where negative and positive SHAP values influences the predictions toward PVPs and non-PVPs, respectively. SCORPION (A), RF-AAC (B), XGB-DPC (C) and LR-DPC (D).
Cross-validation results of SCORPION and existing methods on the Charoenkwan’s dataset.
| Methodsa | ACC | Sn | Sp | MCC |
|---|---|---|---|---|
| Meta-iPVP | 0.846 | 0.832 | 0.698 | 0.846 |
| iPVP-MCV | 0.864 | 0.876 | 0.728 | 0.864 |
| SCORPION | 0.868 | 0.852 | 0.743 | 0.868 |
aPerformance of existing methods were obtained from the work iPVP-MCV[19].
Independent test results of SCORPION and existing methods on the Charoenkwan’s dataset.
| Methodsa | ACC | Sn | Sp | MCC |
|---|---|---|---|---|
| PVPred | 0.730 | 0.892 | 0.663 | 0.505 |
| PVP-SVM | 0.746 | 0.816 | 0.701 | 0.505 |
| PVPred-SCM | 0.714 | 0.745 | 0.690 | 0.432 |
| Meta-iPVP | 0.817 | 0.889 | 0.746 | 0.642 |
| iPVP-MCV | 0.833 | 0.889 | 0.778 | 0.671 |
| SCORPION | 0.881 | 0.810 | 0.952 | 0.770 |
aPerformance of existing methods were obtained from the work iPVP-MCV[19].