| Literature DB >> 30464223 |
Andi Dhroso1, Samantha Eidson2, Dmitry Korkin3.
Abstract
Gram-negative bacteria are responsible for hundreds of millions infections worldwide, including the emerging hospital-acquired infections and neglected tropical diseases in the third-world countries. Finding a fast and cheap way to understand the molecular mechanisms behind the bacterial infections is critical for efficient diagnostics and treatment. An important step towards understanding these mechanisms is the discovery of bacterial effectors, the proteins secreted into the host through one of the six common secretion system types. Unfortunately, current prediction methods are designed to specifically target one of three secretion systems, and no accurate "secretion system-agnostic" method is available. Here, we present PREFFECTOR, a computational feature-based approach to discover effector candidates in Gram-negative bacteria, without prior knowledge on bacterial secretion system(s) or cryptic secretion signals. Our approach was first evaluated using several assessment protocols on a manually curated, balanced dataset of experimentally determined effectors across all six secretion systems, as well as non-effector proteins. The evaluation revealed high accuracy of the top performing classifiers in PREFFECTOR, with the small false positive discovery rate across all six secretion systems. Our method was also applied to six bacteria that had limited knowledge on virulence factors or secreted effectors. PREFFECTOR web-server is freely available at: http://korkinlab.org/preffector .Entities:
Mesh:
Substances:
Year: 2018 PMID: 30464223 PMCID: PMC6249201 DOI: 10.1038/s41598-018-33874-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The current state-of-the-art methods and their reported performance.
| Name | Method | Secretion System | Signal Based | Reported Accuracy | References |
|---|---|---|---|---|---|
| Luo | RF | I | Yes | 88.6% |
[ |
| Luo | RF | I | No | 98.1% |
[ |
| SSE-AAC | SVM | III | Yes | 98.3% |
[ |
| BPBAac | SVM | III | Yes | 95.3% |
[ |
| T3_MM | Markov Model | III | Yes | 88.2% |
[ |
| BEAN | HMM/SVM | III | No | 78.0% |
[ |
| EffectiveT3 | Naïve Bayes | III | No | 78.0% (AUC) |
[ |
| Sieve | SVM | III | No | 95.0% (AUC) |
[ |
| Yang X | RF | III | No | 94.3% |
[ |
| T4SEpre family | SVM | IV | Yes | 79.1%-94.6% |
[ |
| T4EffPred | SVM | IV | No | 95.9% |
[ |
Each method is designed to predict effector candidates of a specific type of secretion system. The employed methods include Random Forest (RF), Artificial Neural Network (ANN), Hidden Markov Model (HMM), and Support Vector Machine (SVM).
Figure 1PREFFECTOR and its comparative performance. (A) Basic stages of the approach. Three classifier models are designed and independently trained with secretion signals located primarily in the N-terminal region, C-terminal region, or anywhere in the effector protein sequence. Models that rely on the full protein sequence or its N- and C-termini are referred to in the figure as ModelF, ModelN, and ModelC, correspondingly. (B) The finalized positive set of 168 effectors is manually curated and covers all six secretion systems. (C) The most accurate full model (Random Forest classifier) performs evenly across all four assessment measures; it is more accurate than the best performing C-terminal and N-terminal models (both are SVM classifiers with RBF kernels). Abbreviations: Accuracy (Acc), Precision (Pre), and Recall (Rec). (D) The assessment of the state-of-the-art methods that are designed to detect effector candidates of a single secretion system on our data set has showed that none of these methods can be universally used across all six secretion systems. One asterisk (*) corresponds to the methods specialized in predicting T3SS effectors, while two asterisks (**) correspond to the methods specialized in predicting T4SS effectors.
Three categories of features.
| Category | Feature position in vector | Feature Description | Dimensions |
|---|---|---|---|
| Structure/sequence information | 1 | Length | 1 |
| 432–436 | H, C, E, Exposed, Buried | 5 | |
| Residue composition | 4–23 | Residue occurrence frequency | 20 |
| 32–431 | Dipeptide occurrence frequency | 400 | |
| Physico-chemical properties | 2, 3, 24–31 | AvgCharge, Iept, Tiny, Small, Aliphatic, Non-Polar, Polar, Charged, Basic, Acidic | 10 |
The features are calculated independently for each of the three models.
Leave-one-out (LOO) assessment for three SVM and Random Forest (RF) models.
| Model | Kernel | Acc | Pre | Rec | F-measure |
|---|---|---|---|---|---|
| N | Radial | 0.80 | 0.80 | 0.80 | 0.80 |
| Linear | 0.76 | 0.77 | 0.76 | 0.76 | |
| Polynomial | 0.68 | 0.73 | 0.68 | 0.67 | |
| RF | 0.71 | 0.71 | 0.71 | 0.71 | |
| C | Radial | 0.73 | 0.73 | 0.73 | 0.73 |
| Linear | 0.70 | 0.70 | 0.70 | 0.70 | |
| Polynomial | 0.67 | 0.73 | 0.67 | 0.64 | |
| RF | 0.68 | 0.68 | 0.68 | 0.68 | |
| F | Radial | 0.87 | 0.87 | 0.87 | 0.87 |
| Linear | 0.83 | 0.83 | 0.83 | 0.83 | |
| Polynomial | 0.81 | 0.84 | 0.81 | 0.81 | |
| RF | 0.89 | 0.89 | 0.89 | 0.89 |
The performance of RF classifier and each of three SVM classifiers using one of the three different kernels, Radial Base Function (Radial), Linear, and Polynomial, is measured by Accuracy (Acc), Precision (Pre), Recall (Rec), and f-measure. The default probability threshold of is used for each SVM model.
Grid search of the probability threshold for a full SVM model with RBF kernel and 10-fold cross validation assessment.
| θ | TP | TN | FP | FN | FPR | TPR | Accuracy |
|---|---|---|---|---|---|---|---|
| 0.1 | 15.8 | 8.9 | 8.1 | 1.2 | 0.48 | 0.93 | 0.73 |
| 0.2 | 15.5 | 12.0 | 5.0 | 1.5 | 0.29 | 0.91 | 0.81 |
| 0.3 | 15.1 | 14.1 | 2.9 | 1.9 | 0.17 | 0.89 | 0.86 |
| 0.4 | 14.6 | 14.7 | 2.3 | 2.4 | 0.14 | 0.86 | 0.86 |
| 0.5 | 13.9 | 15.0 | 2.0 | 3.1 | 0.12 | 0.82 | 0.85 |
| 0.6 | 13.3 | 15.1 | 1.9 | 3.7 | 0.11 | 0.78 | 0.84 |
| 0.7 | 12.3 | 15.8 | 1.2 | 4.7 | 0.07 | 0.72 | 0.83 |
| 0.8 | 10.8 | 16.2 | 0.8 | 6.2 | 0.05 | 0.64 | 0.79 |
| 0.9 | 8.7 | 16.3 | 0.7 | 8.3 | 0.04 | 0.51 | 0.74 |
While the original prediction probability uses the probability threshold , the optimal, with respect to the accuracy probability thresholds of and 0.3 increase the false positive ratio and are not considered. In turn, the most stringent threshold of drastically decreases the false positive rate, while decreasing accuracy by 0.11. Abbreviations: TP – true positives, TN – true negatives, FP – false positives, FN – false negatives, FPR – false positive rate, TPR – true positive rate.
Figure 2Whole-genome application of PREFFECTOR. (A) Predicted effector candidates of Chlamydia trachomatis using the optimal probability threshold of θ0 = 0.6 (red) and a stringent probability threshold of θ0 = 0.99 (cyan) are mapped according to their corresponding positions on the circular bacterial genome. Many predicted effector candidates for both thresholds form distinct compact clusters. All known genes (blue) are mapped on the corresponding DNA strand of the genome. Shown in black is GC content. Shown in green and purple are GC skew+ and GC skew−, respectively. Regions of tightly clustered effector candidates can be clearly identified. The image was generated using CGView Comparison Tool. (B) The analysis of the PREFFECTOR’s performance on the four whole genomes (Acinetobacter baumanni, Chlamydia trachoma, Helicobacter pylori, and Legionella pneumophila). (C) Examples of enriched (E) and depleted (D) GO functions occurring in the three genomes; both enriched are depleted GO functional terms are unique to each genome.
Significantly enriched (E) and depleted (D) GO functions for the three bacterial genomes.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
| GOTerm | p-value | E/D | GOTerm | p-value | E/D | GOTerm | p-value | E/D |
| GO:0065007 | 1.7E-03 | D | GO:0051179 | 1.3E-02 | D | GO:0030234 | 4.4E-02 | D |
| GO:0005488 | 5.4E-03 | D | GO:0065007 | 2.0E-02 | D | |||
| GO:0005623 | 6.0E-03 | D | GO:0004872 | 5.3E-02 | D | |||
| GO:0003824 | 2.2E-02 | D | ||||||
| GO:0002376 | 4.1E-02 | E | ||||||
| GO:0060089 | 5.5E-02 | E | ||||||
There were no significantly enriched and depleted GO functions found in Chlamydia. Effector candidates were predicted using Random Forrest classifier.