| Literature DB >> 30682021 |
Zhila Esna Ashari1, Kelly A Brayton1,2,3, Shira L Broschat1,2,3.
Abstract
Type IV secretion systems exist in a number of bacterial pathogens and are used to secrete effector proteins directly into host cells in order to change their environment making the environment hospitable for the bacteria. In recent years, several machine learning algorithms have been developed to predict effector proteins, potentially facilitating experimental verification. However, inconsistencies exist between their results. Previously we analysed the disparate sets of predictive features used in these algorithms to determine an optimal set of 370 features for effector prediction. This study focuses on the best way to use these optimal features by designing three machine learning classifiers, comparing our results with those of others, and obtaining de novo results. We chose the pathogen Legionella pneumophila strain Philadelphia-1, a cause of Legionnaires' disease, because it has many validated effector proteins and others have developed machine learning prediction tools for it. While all of our models give good results indicating that our optimal features are quite robust, Model 1, which uses all 370 features with a support vector machine, has slightly better accuracy. Moreover, Model 1 predicted 472 effector proteins that are deemed highly probable to be effectors and include 94% of known effectors. Although the results of our three models agree well with those of other researchers, their models only predicted 126 and 311 candidate effectors.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30682021 PMCID: PMC6347213 DOI: 10.1371/journal.pone.0202312
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Workflow.
Accuracy measures for 10-fold cross-validation of Model 1 using the entire feature set for prediction.
| Fold Accuracy (%) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| 95.13 | 93.80 | 93.75 | 92.47 | 93.75 | 93.36 | 95.08 | 95.13 | 95.11 | 92.92 | |
Accuracy measures for 10-fold cross-validation of Model 3 using three feature subsets.
i) PSSM-related features, ii) compositional features, and iii) chemical and structural features.
| Fold Accuracy (%) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| 90.70 | 91.59 | 92.41 | 91.59 | 94.64 | 92.92 | 93.30 | 90.13 | 93.30 | 93.80 | |
Average accuracy, recall, precision, MCC, and AUC measures over 10 folds for the three effector prediction models.
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| 94.05% | 93.64% | 92.44% | |
| 92.00% | 93.06% | 92.83% | |
| 92.49% | 90.91% | 87.33% | |
| 0.87 | 0.86 | 0.84 | |
| 0.983 | 0.979 | 0.970 |
Fig 2ROC curves for three designed classifiers for 10-fold, cross-validation results.
(a) Model 1, (b) Model 2, and (c) Model 3.
Comparison of results for the three effector prediction models for L. pneumophila strain Philadelphia-1.
| Number of predicted effector proteins | Number of correctly | Number of effectors predicted | |||
|---|---|---|---|---|---|
| Effectors | Non-effectors | S4TE | Burstein et al. | ||
| 760 | 315 (99.7%) | 514 (97.7%) | 273 (90.4%) | 101 (80.2%) | |
| 717 | 300 (94.9%) | 518 (98.5%) | 253 (83.8%) | 100 (79.4%) | |
| 568 | 306 (96.8%) | 521 (99.0%) | 258 (85.4%) | 97 (77.0%) | |
Fig 3Venn diagram comparing predicted effector proteins for three methods.
The pink circle shows the results for Model 1, the yellow circle for the S4TE method, and the blue circle for the method by Burstein et al.
Comparison of results for the most probable group of candidate effectors by Model 1 for L. pneumophila strain Philadelphia-1.
| Number of predicted effector proteins | Number of correctly | Number of effectors predicted | |||
|---|---|---|---|---|---|
| Effectors | Non-effectors | S4TE | Burstein et al. | ||
| 472 | 297 (93.7%) | 525 (99.8%) | 243 (80.5%) | 101 (72.2%) | |
Accuracy measures for 10-fold cross-validation of Model 2 using three feature subsets.
i) PSSM composition features, ii) PSSM auto-covariance correlation features, and iii) chemical, structural, and compositional features.
| Fold Accuracy (%) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| 93.36 | 93.36 | 95.53 | 92.47 | 93.74 | 92.44 | 93.30 | 95.13 | 93.30 | 93.80 | |