| Literature DB >> 31861928 |
Phasit Charoenkwan1, Nalini Schaduangrat2, Chanin Nantasenamat2, Theeraphon Piacham3, Watshara Shoombuatong2.
Abstract
Understanding of quorum-sensing peptides (QSPs) in their functional mechanism plays an essential role in finding new opportunities to combat bacterial infections by designing drugs. With the avalanche of the newly available peptide sequences in the post-genomic age, it is highly desirable to develop a computational model for efficient, rapid and high-throughput QSP identification purely based on the peptide sequence information alone. Although, few methods have been developed for predicting QSPs, their prediction accuracy and interpretability still requires further improvements. Thus, in this work, we proposed an accurate sequence-based predictor (called iQSP) and a set of interpretable rules (called IR-QSP) for predicting and analyzing QSPs. In iQSP, we utilized a powerful support vector machine (SVM) cooperating with 18 informative features from physicochemical properties (PCPs). Rigorous independent validation test showed that iQSP achieved maximum accuracy and MCC of 93.00% and 0.86, respectively. Furthermore, a set of interpretable rules IR-QSP was extracted by using random forest model and the 18 informative PCPs. Finally, for the convenience of experimental scientists, the iQSP web server was established and made freely available online. It is anticipated that iQSP will become a useful tool or at least as a complementary existing method for predicting and analyzing QSPs.Entities:
Keywords: classification; machine learning; physicochemical properties; quorum sensing peptides; random forest; support vector machine
Mesh:
Substances:
Year: 2019 PMID: 31861928 PMCID: PMC6981611 DOI: 10.3390/ijms21010075
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Structures of selected quorum sensing peptides that have been experimentally elucidated, where red and yellow colors represent alpha-helix and loop structures, respectively. Each structure is labelled by a common name followed by the Protein Data Bank identification number (PDB ID) in parenthesis on the subsequent line.
Summary of existing methods for predicting quorum sensing peptides.
| Method | Classifier a | Sequence Feature b | No. of Features | Independent Test |
|---|---|---|---|---|
| QSPpred | SVM | AAC, DPC, N5C5Bin, PCP | 630 | Yes |
| QSPpred-FL | RF | GDP, OVP, CTD, ASDC | 913 | No |
| iQSP (this study) | SVM | PCP | 18 | Yes |
a RF: random forest, SVM: support vector machine. b AAC: amino acid composition, ASDC: adaptive skip dipeptide composition, DPC: dipeptide composition, CTD: composition–transition–distribution, GDP: g-gap dipeptide composition, NCBin: binary pattern of N- and C-terminal residues, OVP: overlapping property features, PCP: physicochemical properties.
Figure 2Schematic framework of iQSP. Arrows in the figure represents the direction that data flows from one process to the next process.
Amino acid compositions (%) of quorum sensing (QSP) and non-quorum sensing (Non-QSP) peptides along with their mean decrease of Gini index (MDGI) values.
| Amino Acid | QSP (%) | Non-QSP (%) | Difference | MDGI | |
|---|---|---|---|---|---|
|
| 0.109 | 0.049 | 0.059 (1) | 0.000 | 41.94 (1) |
|
| 0.047 | 0.093 | −0.045 (20) | 0.000 | 19.37 (2) |
|
| 0.076 | 0.106 | −0.030 (18) | 0.002 | 17.57 (3) |
|
| 0.043 | 0.054 | −0.011 (14) | 0.114 | 15.56 (4) |
|
| 0.034 | 0.015 | 0.019 (4) | 0.000 | 10.99 (5) |
|
| 0.063 | 0.067 | −0.004 (11) | 0.629 | 10.15 (6) |
|
| 0.053 | 0.084 | −0.032 (19) | 0.000 | 9.15 (7) |
|
| 0.042 | 0.020 | 0.022 (3) | 0.000 | 8.69 (8) |
|
| 0.039 | 0.050 | −0.011 (15) | 0.128 | 7.99 (9) |
|
| 0.049 | 0.061 | −0.013 (16) | 0.091 | 7.82 (10) |
|
| 0.079 | 0.062 | 0.017 (5) | 0.023 | 6.70 (11) |
|
| 0.078 | 0.094 | −0.016 (17) | 0.032 | 6.56 (12) |
|
| 0.041 | 0.043 | −0.001 (10) | 0.846 | 6.25 (13) |
|
| 0.070 | 0.043 | 0.026 (2) | 0.009 | 5.24 (14) |
|
| 0.051 | 0.041 | 0.010 (7) | 0.098 | 4.38 (15) |
|
| 0.033 | 0.029 | 0.004 (8) | 0.411 | 4.37 (16) |
|
| 0.026 | 0.030 | −0.004 (12) | 0.423 | 3.64 (17) |
|
| 0.028 | 0.016 | 0.012 (6) | 0.013 | 2.84 (18) |
|
| 0.010 | 0.017 | −0.007 (13) | 0.048 | 2.71 (19) |
|
| 0.029 | 0.026 | 0.003 (9) | 0.484 | 2.49 (20) |
Figure 3Sequence logo representations, where x- and y-axis represent the first and last 10 residues at N- and C-terminal regions from QSPs (a,b) and Non-QSPs (c,d), and proportional to the propensities of amino acids, respectively. Colors are: red for hydrophobic (A, I, L, M, F, V, C, G), green for charged (R, D, E, K), orange for polar (Q, H, S, T), and black for the remaining amino acids (P, Y, W, N).
Ten different feature subsets and their corresponding physicochemical properties in the AAindex database.
| Subset | # Feature | AAindex |
|---|---|---|
| 1 | 14 | FASG760103, ISOY800102, KYTJ820101, LIFS790103, MEIH800102, OOBM850101, PALJ810113, QIAN880137, ROSG850101, TANS770101, AURR980116, VINM940104, NADH010107, FUKS010106 |
| 2 | 17 | ANDN920101, GRAR740102, JANJ780103, LIFS790101, NAGK730102, PALJ810102, QIAN880137, RACS820102, ROBB760109, ROBB760111, SUEM840102, TANS770104, TANS770108, AURR980115, VINM940103, BLAM930101, PUNT030102 |
| 3 | 16 | EISD860101, GEIM800105, HOPA770101, LIFS790101, MEIH800103, OOBM770101, PALJ810111, QIAN880128, RACS770101, ROBB760107, ROBB790101, AURR980110, VINM940101, KIMC930101, FUKS010106, TSAJ990102, |
| 4 | 17 | DAYM780101, ISOY800108, MEIH800102, NAKH900104, PALJ810111, PALJ810113, PRAM820102, RACS770101, ROBB760111, SNEP660104, AURR980118, VINM940101, PARS000101, NADH010106, FUKS010106, TSAJ990101, GEOR030107 |
| 5 | 17 | ANDN920101, FINA910102, GEIM800102, GRAR740102, ISOY800101, LIFS790102, NOZY710101, PALJ810106, QIAN880138, ROBB760107, ROBB760112, ZIMJ680101, AURR980113, MUNV940102, MUNV940105, MONM990201, FUKS010109 |
| 6 | 18 | DAYM780101, GEIM800101, GRAR740101, ISOY800107, MANP780101, PALJ810111, PONP800102, PRAM820101, PRAM900102, QIAN880137, ROBB760104, ROBB760113, AURR980102, MUNV940103, WIMW960101, KUMS000103, NADH010104, FUKS010106 |
| 7 | 15 | GRAR740101, ISOY800108, LIFS790103, MIYS850101, OOBM850105, PALJ810114, QIAN880137, RACS820109, ROBB760108, ROSG850101, TANS770107, AURR980116, VINM940104, PARS000102, GEOR030108 |
| 8 | 17 | ANDN920101, ARGP820103, BIGC670101, BULH740101, BUNA790102, CHOC760104, CHOP780211, CHOP780212, DESM900101, FASG760101, GRAR740101, NAKH900101, RADA880107, ROBB790101, TANS770109, GUOD860101, PONJ960101 |
| 9 | 18 | FAUJ880112, FINA910103, GRAR740101, JOND750102, NISK860101, PALJ810108, PONP800104, PRAM820101, QIAN880125, QIAN880134, ROBB760108, WERD780104, AURR980113, VINM940101, VINM940102, PARS000102, FUKS010108, WOLR790101 |
| 10 | 17 | ANDN920101, BROC820102, BUNA790101, CHAM830104, CHOP780207, CIDH920101, CIDH920104, DESM900102, FASG760101, FINA910102, JANJ780101, LEWP710101, PALJ810110, QIAN880125, VASM830101, AURR980108, COSI940101 |
# Feature represents the number of features used for constructing a model.
Performance comparisons of SVM models built with various subsets of physicochemical properties evaluated by means of ten-fold cross-validation subjected to ten rounds of random splits.
| Subset | # Feature | Ac (%) | Sn (%) | Sp (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| 1 | 14 | 91.23 ± 1.31 | 91.17 ± 2.65 | 88.23 ± 3.03 | 0.82 ± 0.03 | 0.95 ± 0.05 |
| 2 | 17 | 90.69 ± 1.26 | 92.08 ± 2.82 | 87.04 ± 2.89 | 0.82 ± 0.03 | 0.95 ± 0.05 |
| 3 | 16 | 91.58 ± 1.77 | 92.79 ± 2.73 | 88.45 ± 2.97 | 0.83 ± 0.04 | 0.94 ± 0.05 |
| 4 | 17 | 91.63 ± 1.75 | 91.49 ± 2.73 | 89.55 ± 3.71 | 0.84 ± 0.04 | 0.92 ± 0.05 |
| 5 | 17 | 92.01 ± 1.40 | 91.43 ± 3.64 | 90.27 ± 2.39 | 0.84 ± 0.03 | 0.92 ± 0.08 |
| 6 | 18 | 91.07 ± 1.77 | 90.06 ± 2.69 | 88.79 ± 3.40 | 0.82 ± 0.04 | 0.91 ± 0.10 |
| 7 | 15 | 89.28 ± 1.99 | 88.78 ± 3.88 | 86.44 ± 2.67 | 0.79 ± 0.04 | 0.93 ± 0.07 |
| 8 | 17 | 88.24 ± 2.15 | 85.04 ± 3.61 | 86.81 ± 3.86 | 0.76 ± 0.05 | 0.92 ± 0.07 |
| 9 | 18 | 90.54 ± 1.26 | 90.60 ± 4.12 | 87.57 ± 2.68 | 0.81 ± 0.03 | 0.93 ± 0.09 |
| 10 | 17 | 92.19 ± 1.09 | 90.12 ± 2.98 | 92.16 ± 1.96 | 0.84 ± 0.02 | 0.93 ± 0.07 |
# Feature represents the number of features used for constructing a model.
Performance comparisons of SVM models built with various subsets of physicochemical properties evaluated by means of independent validation test subjected to ten rounds of random splits.
| Subset | # Feature | Ac (%) | Sn (%) | Sp (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| 1 | 14 | 92.50 ± 1.67 | 95.50 ± 3.69 | 89.50 ± 4.97 | 0.85 ± 0.03 | 0.95 ± 0.03 |
| 2 | 17 | 91.50 ± 3.58 | 98.00 ± 3.50 | 85.00 ± 7.07 | 0.84 ± 0.07 | 0.96 ± 0.03 |
| 3 | 16 | 92.00 ± 2.58 | 92.50 ± 5.40 | 91.50 ± 5.80 | 0.84 ± 0.05 | 0.95 ± 0.02 |
| 4 | 17 | 92.25 ± 2.19 | 93.00 ± 4.83 | 91.50 ± 4.12 | 0.85 ± 0.04 | 0.97 ± 0.01 |
| 5 | 17 | 92.50 ± 2.36 | 94.00 ± 5.68 | 91.00 ± 6.15 | 0.86 ± 0.05 | 0.96 ± 0.02 |
| 6 | 18 | 93.00 ± 1.97 | 92.50 ± 5.40 | 93.50 ± 4.12 | 0.86 ± 0.04 | 0.96 ± 0.02 |
| 7 | 15 | 92.00 ± 1.97 | 94.00 ± 5.16 | 90.00 ± 7.45 | 0.85 ± 0.04 | 0.96 ± 0.02 |
| 8 | 17 | 91.75 ± 2.90 | 94.00 ± 5.16 | 89.50 ± 6.43 | 0.84 ± 0.06 | 0.95 ± 0.04 |
| 9 | 18 | 91.50 ± 3.38 | 90.50 ± 7.62 | 92.50 ± 7.91 | 0.84 ± 0.06 | 0.97 ± 0.04 |
| 10 | 17 | 92.50 ± 1.67 | 95.00 ± 3.33 | 90.00 ± 4.08 | 0.85 ± 0.03 | 0.95 ± 0.04 |
# Feature represents the number of features used for constructing a model.
Figure 4Performance comparisons of SVM models in conjunction with 531 PCPs and the eighteen informative PCPs assessed by 10-fold cross-validation (a,b) and independent validation test (c,d).
Performance comparisons between iQSP and existing methods assessed by 10-fold cross-validation and independent validation tests.
| Method | # Feature | 10-Fold CV | Independent Test | ||||
|---|---|---|---|---|---|---|---|
| Ac (%) | MCC | auROC | Ac (%) | MCC | auROC | ||
| QSPpred a | 430 | 91.25 | 0.830 | 0.960 | 90.00 | 0.800 | 0.950 |
| QSPpred-FL b | 913 | 94.30 | 0.885 | N/A | 92.50 | 0.860 | 0.962 |
| iQSP | 18 | 91.07 ± 1.77 | 0.82 ± 0.04 | 0.91 ± 0.10 | 93.00 ± 1.97 | 0.86 ± 0.04 | 0.96 ± 0.02 |
a Results were reported from the work of QSPpred. b Results were obtained by feeding the peptide sequences in the independent set to the webserver of QSPpred-FL. # Feature represents the number of features used for constructing a model. N/A symbol represents the authors did not provide the result.
Performance comparison of iQSP with other conventional classifiers by using the optimal feature subset. Models were evaluated by means of 10-fold cross-validation and independent validation test subjected to ten rounds of random splits.
| Classifier | 10-Fold CV | Independent Test | ||||
|---|---|---|---|---|---|---|
| Ac (%) | MCC | auROC | Ac (%) | MCC | auROC | |
| 85.13 ± 0.27 | 0.72 ± 0.01 | 0.86 ± 0.00 | 85.75 ± 1.21 | 0.73 ± 0.02 | 0.91 ± 0.03 | |
| DT | 83.57 ± 2.74 | 0.67 ± 0.06 | 0.87 ± 0.03 | 83.75 ± 3.39 | 0.68 ± 0.07 | 0.86 ± 0.05 |
| RF | 87.93 ± 0.48 | 0.76 ± 0.01 | 0.95 ± 0.01 | 91.00 ± 3.16 | 0.82 ± 0.06 | 0.94 ± 0.02 |
| iQSP | 91.07 ± 1.77 | 0.82 ± 0.04 | 0.91 ± 0.10 | 93.00 ± 1.97 | 0.86 ± 0.04 | 0.96 ± 0.02 |
Figure 5ROC curves of the proposed model iQSP with the conventional classifiers evaluated by 10-fold cross-validation (a) and the independent validation test (b) with ten independent rounds, where the bar represents the standard deviation of prediction results from ten independent round.
The eighteen informative physicochemical properties [58] derived from the genetic algorithm utilizing self-assessment-report (GA-SAR) algorithm and their MDGI.
| Rank | AAindex ID | MDGI | Description |
|---|---|---|---|
| 1 | QIAN880137 | 32.55 | Weights for coil at the window position of 4 (Qian-Sejnowski, 1988) |
| 2 | AURR980102 | 16.62 | Normalized positional residue frequency at helix termini N’ (Aurora-Rose, 1998) |
| 3 | ROBB760113 | 13.56 | Information measure for loop (Robson-Suzuki, 1976) |
| 4 | PRAM820101 | 12.62 | Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982) |
| 5 | GRAR740101 | 12.26 | Composition (Grantham, 1974) |
| 6 | PALJ810111 | 11.71 | Normalized frequency of beta-sheet in alpha + beta class (Palau et al., 1981) |
| 7 | PONP800102 | 10.96 | Average gain in surrounding hydrophobicity (Ponnuswamy et al., 1980) |
| 8 | MUNV940103 | 9.07 | Free energy in beta-strand conformation (Munoz-Serrano, 1994) |
| 9 | DAYM780101 | 8.64 | Amino acid composition (Dayhoff et al., 1978a) |
| 10 | MANP780101 | 8.36 | Average surrounding hydrophobicity (Manavalan-Ponnuswamy, 1978) |
| 11 | KUMS000103 | 8.23 | Distribution of amino acid residues in the alpha-helices in thermophilic proteins (Kumar et al., 2000) |
| 12 | ROBB760104 | 8.18 | Information measure for C-terminal helix (Robson-Suzuki, 1976) |
| 13 | ISOY800107 | 8.09 | Normalized relative frequency of double bend (Isogai et al., 1980) |
| 14 | GEIM800101 | 7.80 | Alpha-helix indices (Geisow-Roberts, 1980) |
| 15 | PRAM900102 | 7.59 | Relative frequency in alpha-helix (Prabhakaran, 1990) |
| 16 | NADH010104 | 7.20 | Hydropathy scale based on self-information values in the two-state model (20% accessibility) (Naderi-Manesh et al., 2001) |
| 17 | FUKS010106 | 6.47 | Interior composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001) |
| 18 | WIMW960101 | 5.54 | Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White, 1996) |
Fourteen if–then rules for the prediction of quorum sensing peptides using random forest and the 18 informative physicochemical properties.
| No. | Rule | Cover Samples | Misclassified Samples | Ac (%) |
|---|---|---|---|---|
| 1 | GRAR740101 ≤ 0.9055 & MANP780101 > 0.7495 & PRAM900102 > 0.848 & QIAN880137 ≤ 0.237 | 10 | 0 | 100.00 |
| 2 | PONP800102 > −0.751 & PONP800102 ≤ 1.0025 & QIAN880137 ≤ −0.104 & ROBB760104 ≤ 0.3645 & ROBB760104 > −0.5205 | 61 | 1 | 98.36 |
| 3 | PALJ810111 ≤ 1.369 & QIAN880137 > −0.104 & QIAN880137 ≤ 0.417 & ROBB760113 ≤ 0.5975 & AURR980102 ≤ 0.6955 | 21 | 2 | 90.48 |
| 4 | GEIM800101 > −0.3135 & GRAR740101 > −0.176 & ISOY800107 ≤ 1.367 & MANP780101 > −0.3325 & PALJ810111 ≤ 1.0905 & QIAN880137 ≤ −0.0985 | 94 | 6 | 93.62 |
| 5 | PALJ810111 > −0.786 & QIAN880137 > 0.237 & QIAN880137 > 0.403 & ROBB760113 > 0.5975 & AURR980102 ≤ 0.811 & KUMS000103 ≤ 0.793 | 45 | 7 | 84.44 |
| 6 | GRAR740101 ≤ 0.341 & ISOY800107 ≤ −0.089 & PALJ810111 ≤ 1.2455 & QIAN880137 > −0.009 & ROBB760113 ≤ −0.0285 | 17 | 3 | 82.35 |
| 7 | GRAR740101 > −0.708 & PRAM900102 ≤ 1.2985 & QIAN880137 > −0.104 & QIAN880137 ≤ 1.105 & AURR980102 ≤ 0.6585 & KUMS000103 ≤ 0.974 | 94 | 28 | 70.21 |
| 8 | PONP800102 ≤ 1.1095 & PRAM900102 ≤ 0.8295 & QIAN880137 ≤ 0.2625 & QIAN880137 > −0.9055 & ROBB760113 > −0.5875 & ROBB760113 ≤ 1.031 | 121 | 15 | 87.60 |
Twenty if–then rules for discriminating QSPs from Non-QSPs using random forest model in conjunction with the eighteen informative physicochemical properties.
| % Covered Samples | Rule | Prediction Result |
|---|---|---|
| 3.57 | GRAR740101 ≤ 0.9055 & MANP780101 > 0.7495 & PRAM900102 > 0.848 & QIAN880137 ≤ 0.237 | QSP |
| 14.03 | PONP800102 > −0.751 & PONP800102 ≤ 1.0025 & QIAN880137 ≤ −0.104 & ROBB760104 ≤ 0.3645 & ROBB760104 > −0.5205 | QSP |
| 4.08 | PALJ810111 ≤ 1.369 & QIAN880137 > −0.104 & QIAN880137 ≤ 0.417 & ROBB760113 ≤ 0.5975 & AURR980102 ≤ 0.6955 | QSP |
| 9.95 | GEIM800101 > −0.3135 & GRAR740101 > −0.176 & ISOY800107 ≤ 1.367 & MANP780101 > −0.3325 & PALJ810111 ≤ 1.0905 & QIAN880137 ≤ −0.0985 | QSP |
| 9.95 | PALJ810111 > −0.786 & QIAN880137 > 0.237 & QIAN880137 > 0.403 & ROBB760113 > 0.5975 & AURR980102 ≤ 0.811 & KUMS000103 ≤ 0.793 | QSP |
| 2.81 | GRAR740101 ≤ 0.341 & ISOY800107 ≤ −0.089 & PALJ810111 ≤ 1.2455 & QIAN880137 > −0.009 & ROBB760113 ≤ −0.0285 | QSP |
| 2.81 | GRAR740101 > −0.708 & PRAM900102 ≤ 1.2985 & QIAN880137 > −0.104 & QIAN880137 ≤ 1.105 & AURR980102 ≤ 0.6585 & KUMS000103 ≤ 0.974 | QSP |
| 1.53 | PONP800102 ≤ 1.1095 & PRAM900102 ≤ 0.8295 & QIAN880137 ≤ 0.2625 & QIAN880137 > −0.9055 & ROBB760113 > −0.5875 & ROBB760113 ≤ 1.031 | QSP |
| 11.73 | GEIM800101 > −0.547 & PONP800102 > 0.9625 & PRAM900102 ≤ 1.026 | Non-QSP |
| 3.57 | QIAN880137 > −0.0985 & QIAN880137 ≤ 0.482 & ROBB760113 > 0.5975 & AURR980102 > −0.1535 | Non-QSP |
| 2.30 | QIAN880137 > 0.257 & ROBB760113 > 0.613 & ROBB760113 > 0.7775 & AURR980102 > 0.584 | Non-QSP |
| 1.79 | DAYM780101 ≤ 0.1035 & MANP780101 > 0.5635 & QIAN880137 ≤ 0.2625 & ROBB760113 > 0.5215 | Non-QSP |
| 1.53 | GRAR740101 > −0.045 & PALJ810111 > 1.0905 & QIAN880137 ≤ −0.0985 | Non-QSP |
| 6.12 | MANP780101 > −0.5985 & ROBB760104 > −0.3545 & AURR980102 ≤ 0.5935 & AURR980102 ≤ 0.5125 & KUMS000103 > 0.663 | Non-QSP |
| 3.57 | PONP800102 > 0.5045 & QIAN880137 > −0.039 & ROBB760104 > −0.1345 & ROBB760113 ≤ 0.5975 & AURR980102 ≤ 1.22 & FUKS010106 ≤ 1.1985 | Non-QSP |
| 1.28 | QIAN880137 > 0.2625 & QIAN880137 ≤ 0.7795 & ROBB760104 > −0.3885 & AURR980102 ≤ 0.6585 & AURR980102 ≤ 0.515 & NADH010104 > −0.023 | Non-QSP |
| 3.06 | GEIM800101 ≤ 1.636 & PONP800102 ≤ 0.6385 & PRAM900102 > −0.017 & PRAM900102 > 0.627 & QIAN880137 > −0.009 | Non-QSP |
| 2.55 | PALJ810111 ≤ 0.436 & PRAM900102 > 0.1605 & QIAN880137 > −0.009 & AURR980102 ≤ 0.6525 | Non-QSP |
| 1.53 | GEIM800101 ≤ −0.3305 & GRAR740101 > −0.2425 & PONP800102 ≤ 0.98 & QIAN880137 ≤ 0.237 | Non-QSP |
| 12.24 | Else | Non-QSP |
Figure 6Screenshot of the iQSP web server before (a) and after (b) submission of the input peptide sequence.