| Literature DB >> 32028709 |
Phasit Charoenkwan1, Sakawrat Kanthawong2, Nalini Schaduangrat3, Janchai Yana4, Watshara Shoombuatong3.
Abstract
Although, existing methods have been successful in predicting phage (or bacteriophage) virion proteins (PVPs) using various types of protein features and complex classifiers, such as support vector machine and naïve Bayes, these two methods do not allow interpretability. However, the characterization and analysis of PVPs might be of great significance to understanding the molecular mechanisms of bacteriophage genetics and the development of antibacterial drugs. Hence, we herein proposed a novel method (PVPred-SCM) based on the scoring card method (SCM) in conjunction with dipeptide composition to identify and characterize PVPs. In PVPred-SCM, the propensity scores of 400 dipeptides were calculated using the statistical discrimination approach. Rigorous independent validation test showed that PVPred-SCM utilizing only dipeptide composition yielded an accuracy of 77.56%, indicating that PVPred-SCM performed well relative to the state-of-the-art method utilizing a number of protein features. Furthermore, the propensity scores of dipeptides were used to provide insights into the biochemical and biophysical properties of PVPs. Upon comparison, it was found that PVPred-SCM was superior to the existing methods considering its simplicity, interpretability, and implementation. Finally, in an effort to facilitate high-throughput prediction of PVPs, we provided a user-friendly web-server for identifying the likelihood of whether or not these sequences are PVPs. It is anticipated that PVPred-SCM will become a useful tool or at least a complementary existing method for predicting and analyzing PVPs.Entities:
Keywords: interpretable model; machine learning; phage virion protein; physicochemical properties; propensity score; scoring card method
Year: 2020 PMID: 32028709 PMCID: PMC7072630 DOI: 10.3390/cells9020353
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Summary of some existing methods for predicting phage virion proteins.
| Method | Classifier a | Sequence Feature b | Independent Test | Webserver |
|---|---|---|---|---|
| Seguritan et al.’s method [ | ANN | AAC, PIP | - | - |
| Feng et al.’s method [ | NB | AAC, DPC | - | - |
| PVPred [ | SVM | g-gap DPC | ✓ | ✓ |
| PVP-SVM [ | SVM | AAC, DPC, ATC, CTD, PCP | ✓ | ✓ |
| PhagePred [ | Multinomial NB | g-gap DPC feature tree | - | ✓ c |
| Tan et al.’s method [ | SVM | GDC | ✓ | - |
| Pred-BVP-Unb [ | SVM | CT, SAAC, bi-PSSM | ✓ | - |
| PVPred-SCM (This study) | SCM | DPC | ✓ | ✓ |
a ANN: artificial neural network, NB: Naïve Bayes, SCM: scoring card method, SVM: support vector machine. b AAC: amino acid composition, ATC: atomic composition, bi-PSSM: bi-profile position specific scoring matrix, CTD: chain-transition-distribution, CT: composition and translation, DPC: dipeptide composition, g-gap DPC: g-gap dipeptide composition, PCP: physicochemical properties, PIP: protein isoelectric points, SAAC: split amino acid composition. c The webserver was not functional during our manuscript preparation.
Figure 1Schematic framework of PVPred-SCM for prediction and analysis of phage virion proteins (PVPs).
Comparison of ten SCM models with ten different optimized dipeptide propensity scores (opti-DPS) over 10-fold cross-validation.
| #Exp. | Fitness Score | Threshold | ACC (%) | SN (%) | SP (%) | MCC | auROC |
|---|---|---|---|---|---|---|---|
| 1 | 0.955 | 443.96 | 92.50 | 99.00 | 89.41 | 0.849 | 0.952 |
| 2 | 0.955 | 459.91 | 93.15 | 91.89 | 93.76 | 0.851 | 0.954 |
| 3 | 0.968 | 471.80 | 95.11 | 94.00 | 95.64 | 0.894 | 0.966 |
| 4 | 0.946 | 476.36 | 92.15 | 95.89 | 90.31 | 0.840 | 0.942 |
| 5 | 0.960 | 455.34 | 94.44 | 96.00 | 93.74 | 0.882 | 0.960 |
| 6 | 0.956 | 458.08 | 93.16 | 96.00 | 91.81 | 0.856 | 0.953 |
| 7 | 0.950 | 446.56 | 92.50 | 96.00 | 90.81 | 0.846 | 0.947 |
| 8 | 0.954 | 446.24 | 92.51 | 98.00 | 89.88 | 0.849 | 0.953 |
| 9 | 0.950 | 461.75 | 92.52 | 95.89 | 90.86 | 0.846 | 0.948 |
| 10 | 0.960 | 463.10 | 93.82 | 94.78 | 93.29 | 0.866 | 0.960 |
| Mean | 0.955 | 458.31 | 93.18 | 95.74 | 91.95 | 0.858 | 0.954 |
| STD. | 0.006 | 10.77 | 0.98 | 1.97 | 2.05 | 0.018 | 0.007 |
The threshold is an optimal score for discriminating PVPs from non-PVPs. Meanwhile, ACC, SN, SP, MCC, and auROC are accuracy, sensitivity, specificity, Matthews coefficient correlation, and area under the receiver operating characteristic (ROC) curve, respectively.
Comparison of ten SCM models with ten different optimized dipeptide propensity scores (opti-DPS) over independent validation test.
| #Exp. | Fitness Score | Threshold | ACC (%) | SN (%) | SP (%) | MCC | auROC |
|---|---|---|---|---|---|---|---|
| 1 | 0.955 | 443.96 | 74.47 | 80.00 | 71.88 | 0.486 | 0.782 |
| 2 | 0.955 | 459.91 | 75.53 | 73.33 | 76.56 | 0.476 | 0.743 |
| 3 | 0.968 | 471.80 | 76.60 | 70.00 | 79.69 | 0.482 | 0.781 |
| 4 | 0.946 | 476.36 | 76.60 | 63.33 | 82.81 | 0.461 | 0.775 |
| 5 | 0.960 | 455.34 | 76.60 | 63.33 | 82.81 | 0.461 | 0.793 |
| 6 | 0.956 | 458.08 | 71.28 | 73.33 | 70.31 | 0.410 | 0.749 |
| 7 | 0.950 | 446.56 | 72.34 | 76.67 | 70.31 | 0.440 | 0.749 |
| 8 | 0.954 | 446.24 | 70.21 | 73.33 | 68.75 | 0.395 | 0.742 |
| 9 | 0.950 | 461.75 | 77.66 | 76.67 | 78.13 | 0.523 | 0.781 |
| 10 | 0.960 | 463.10 | 73.40 | 66.67 | 76.56 | 0.417 | 0.787 |
| Mean | 0.955 | 458.31 | 74.47 | 71.67 | 75.78 | 0.455 | 0.768 |
| STD. | 0.006 | 10.77 | 2.56 | 5.72 | 5.22 | 0.040 | 0.020 |
The threshold is an optimal score for discriminating PVPs from non-PVPs. Meanwhile, ACC, SN, SP, MCC, and auROC are accuracy, sensitivity, specificity, Matthews coefficient correlation, and area under the ROC curve, respectively.
Figure 2Heatmap of dipeptide propensity scores obtained from the PVPred-SCM method.
Comparison between PVPred-SCM based on optimized (opti-DPS) and initial (init-DPS) dipeptide propensity scores assessed by 10-fold cross-validation and independent validation test.
| Method | 10-fold CV | Independent Test | ||||
|---|---|---|---|---|---|---|
| ACC (%) | MCC | ACC (%) | SN (%) | SP (%) | MCC | |
| Init-DPS | 85.99 | 0.705 | 75.53 | 53.33 | 85.94 | 0.414 |
| opti-DPS | 92.52 | 0.846 | 77.66 | 76.67 | 78.13 | 0.523 |
Figure 3The histogram of scores of PVPs and non-PVPs derived from PVPred-SCM on the benchmark dataset by using initial (a) and optimized (b) dipeptide propensity scores, respectively.
Performance comparisons between PVPred-SCM and existing methods as assessed by the 10-fold cross-validation and independent validation test.
| Method | 10-fold CV | Independent Test | ||||
|---|---|---|---|---|---|---|
| ACC (%) | MCC | ACC (%) | SN (%) | SP (%) | MCC | |
| Feng et al.’s method a | 79.15 | - | - | - | - | - |
| PVPred a | 85.02 | - | 71.30 | 60.00 | 76.50 | 0.357 |
| PVP-SVM a | 87.00 | 0.695 | 79.80 | 66.70 | 85.90 | 0.531 |
| PhagePred a | 98.05 | 0.963 | - | - | - | - |
| Tan et al.’s method a | 87.95 | 0.761 | 75.53 | 70.00 | 78.13 | 0.464 |
| PVPred-SCM | 92.52 | 0.846 | 77.66 | 76.67 | 78.13 | 0.523 |
a Results were reported from the work of Tan et al.’s method [14].
Top ten potential phage virion proteins having the highest of their PVP scores.
| Name (Uniprot) | PVP Score | PDBID | UniProtID | Source |
|---|---|---|---|---|
| Capsid protein G8P | 605.69 | 1HH0 | P82889 | |
| Capsid protein G8P | 581.12 | 2IFO | P03622 | |
| HIS6-pVII fusion protein | 541.13 | ADR00487 | VCSM13 HIS6-pVII modified interference-resistant helper phage | |
| G VIII capsid protein Precursor | 538.64 | 6A7F | NP_040575 | Enterobacteria phage Ike |
| P34 | 538.52 | YP_009639974 | Enterobacteria phage PRD1 | |
| Major coat protein | 534.58 | 1IFP | NP_040652 | |
| Structural protein P7 | 532.24 | NP_049902 | ||
| Transclycosylase | 529.49 | YP_009639979 | Enterobacteria phage PRD1 | |
| MULTISPECIES: major coat protein | 529.45 | WP_015979773 | Enterobacteriaceae | |
| Hypothetical protein | 519.23 | WP_015975197 |
|
Figure 4Structures of selected PVPs elucidated via the fiber diffraction method. Each structure is labeled by a common name followed by the Protein Data Bank identification number (PDB ID) in parenthesis on the subsequent line.
The propensity scores of twenty amino acids to be phage virion proteins (score) along with amino acid compositions (%) of PVPs and non- PVPs.
| Amino Acid | PVP (%) | Non-PVP (%) | Difference | Score | |
|---|---|---|---|---|---|
| A-Ala | 9.98 | 8.09 | 1.89(1) | 529.50(1) | <0.05 |
| T-Thr | 6.90 | 5.49 | 1.41(4) | 511.43(2) | <0.05 |
| V-Val | 8.09 | 6.39 | 1.71(3) | 506.88(3) | <0.05 |
| G-Gly | 8.20 | 6.42 | 1.78(2) | 506.68(4) | <0.05 |
| S-Ser | 7.29 | 6.10 | 1.19(5) | 504.63(5) | <0.05 |
| Y-Tyr | 3.37 | 3.50 | −0.12(13) | 479.13(6) | 0.571 |
| N-Asn | 4.69 | 4.64 | 0.05(9) | 471.50(7) | 0.866 |
| P-Pro | 4.13 | 3.76 | 0.38(7) | 462.15(8) | 0.178 |
| Q-Gln | 4.18 | 3.72 | 0.46(6) | 452.83(9) | 0.106 |
| I-Ile | 6.23 | 6.06 | 0.17(8) | 443.18(10) | 0.629 |
| W-Trp | 1.38 | 1.50 | −0.12(12) | 442.33(11) | 0.408 |
| D-Asp | 5.23 | 5.87 | −0.64(16) | 435.45(12) | <0.05 |
| C-Cys | 0.66 | 1.05 | −0.39(14) | 426.58(13) | <0.05 |
| M-Met | 2.64 | 2.73 | −0.09(11) | 426.15(14) | 0.626 |
| F-Phe | 3.91 | 3.91 | 0.00(10) | 423.45(15) | 0.989 |
| L-Leu | 7.80 | 8.34 | −0.54(15) | 395.85(16) | 0.160 |
| R-Arg | 4.28 | 5.48 | −1.21(18) | 383.15(17) | <0.05 |
| H-His | 1.04 | 1.80 | −0.76(17) | 378.45(18) | <0.05 |
| E-Glu | 4.85 | 7.36 | −2.51(19) | 358.93(19) | <0.05 |
| K-Lys | 5.14 | 7.81 | −2.67(20) | 310.90(20) | <0.05 |
The three important physicochemical properties (PCPs) derived from PVPred-SCM.
| Amino Acid | PS | KOEP990101 | Side-Chain [ | WOLR790101 |
|---|---|---|---|---|
| A-Ala | 529.50(1) | −0.04(12) | 15(19) | 1.12(5) |
| T-Thr | 511.43(2) | 0.39(3) | 45(15) | −0.02(10) |
| V-Val | 506.88(3) | −0.06(13) | 43(16) | 1.13(4) |
| G-Gly | 506.68(4) | 1.24(1) | 1(20) | 1.20(1) |
| S-Ser | 504.63(5) | 0.15(7) | 31(18) | −0.05(11) |
| Y-Tyr | 479.13(6) | 0.05(8) | 107(2) | −0.23(13) |
| N-Asn | 471.50(7) | 0.25(5) | 58(11) | −0.83(16) |
| P-Pro | 462.15(8) | 0.00(9) | 42(17) | 0.54(9) |
| Q-Gln | 452.83(9) | −0.02(11) | 72(9) | −0.78(14) |
| I-Ile | 443.18(10) | −0.26(17) | 57(12) | 1.16(3) |
| W-Trp | 442.33(11) | 0.21(6) | 130(1) | −0.19(12) |
| D-Asp | 435.45(12) | 0.27(4) | 59(10) | −0.83(17) |
| C-Cys | 426.58(13) | 0.57(2) | 47(14) | 0.59(7) |
| M-Met | 426.15(14) | −0.09(14) | 75(6) | 0.55(8) |
| F-Phe | 423.45(15) | −0.01(10) | 91(4) | 0.67(6) |
| L-Leu | 395.85(16) | −0.38(20) | 57(13) | 1.18(2) |
| R-Arg | 383.15(17) | −0.30(18) | 101(3) | −2.55(20) |
| H-His | 378.45(18) | −0.11(15) | 82(5) | −0.93(19) |
| E-Glu | 358.93(19) | −0.33(19) | 73(7) | −0.92(18) |
| K-Lys | 310.90(20) | −0.18(16) | 73(8) | −0.80(15) |
| Correlation R | 1.000 | 0.502 | −0.516 | 0.484 |
Figure 5Screenshots of the PVPred-SCM web server before (a) and after (b) submission of a query protein. Prediction results are represented with the PVP scores derived from the scoring function (S(P)) and predicted classes.