| Literature DB >> 34611252 |
Darcy A B Jones1, Lina Rozano1,2, Johannes W Debler1, Ricardo L Mancera2,3,4, Paula M Moolhuijzen1, James K Hane5,6.
Abstract
Fungal plant-pathogens promote infection of their hosts through the release of 'effectors'-a broad class of cytotoxic or virulence-promoting molecules. Effectors may be recognised by resistance or sensitivity receptors in the host, which can determine disease outcomes. Accurate prediction of effectors remains a major challenge in plant pathology, but if achieved will facilitate rapid improvements to host disease resistance. This study presents a novel tool and pipeline for the ranking of predicted effector candidates-Predector-which interfaces with multiple software tools and methods, aggregates disparate features that are relevant to fungal effector proteins, and applies a pairwise learning to rank approach. Predector outperformed a typical combination of secretion and effector prediction methods in terms of ranking performance when applied to a curated set of confirmed effectors derived from multiple species. We present Predector ( https://github.com/ccdmb/predector ) as a useful tool for the ranking of predicted effector candidates, which also aggregates and reports additional supporting information relevant to effector and secretome prediction in a simple, efficient, and reproducible manner.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34611252 PMCID: PMC8492765 DOI: 10.1038/s41598-021-99363-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Bioinformatics tools and methods integrated into the Predector pipeline.
| Software | Description | References |
|---|---|---|
| SignalP v3.0, 4.1g, 5.0b | Extracellular secretion via signal peptide. Both NN and HMM methods are run for v3.0. Eukaryotic types specified | [ |
| Deepsig commit 69e01cb | Extracellular secretion. *-k euk | [ |
| Phobius 1.01 | Extracellular secretion | [ |
| LOCALIZER v1.0.4 | Host sub-cellular localisation. Using predicted mature proteins from SignalP 5.0b. *-e -M | [ |
| ApoplastP v1.0.1 | Apoplast-specific localisation | [ |
| DeepLoc v1.0 | Sub-cellular localisation | [ |
| TargetP v2.0 | Sub-cellular localisation. *-org non-pl | [ |
| TMHMM v2.0c | Membrane localisation via transmembrane domains. *-d | [ |
| EffectorP v1.0, 2.0 | Probabilistic prediction of effector likelihood | [ |
| EMBOSS: pepstats v6.5.7 | Amino acid properties and frequencies | [ |
| HMMER (vs dbCAN v8) v3.2.1 | Used to search dbcan | [ |
| MMSeqs2 v10-6d92c (vs PHIBase v4.9) | Used to search phibase. *--max-seqs 300 -e 0.01 -s 7 --num-iterations 3 -a | [ |
| MMSeqs2 v10-6d92c (vs known effectors in Supplementary Table | *--max-seqs 300 -e 0.01 -s 7 --num-iterations 3 -a | [ |
| PfamScan (vs Pfam v33.1) | With active site prediction. *-as | [ |
*Non-default parameters are indicated where applicable.
Figure 1UpSet plot showing predictions of signal peptides, transmembrane domains, and effector-like properties for all known effectors in the training dataset (N = 125). Rows indicate sets of proteins predicted to have a property related to effector prediction (e.g. a signal peptide), with the horizontal bar chart indicating set size. Columns indicate where the horizontal sets intersect with each other, where the vertical bar-chart indicates the number of proteins in that intersection. For clarity, intersections with only 1 member have been excluded, the full plot is presented in Supplementary Data S1:1.
Figure 2A violin plot showing the distributions of Predector effector ranking scores for each class in the test and training datasets. The effectors consist of experimentally validated fungal effector sequences. “Secreted” and “non-secreted” proteins are manually annotated proteins from the SwissProt database. Proteomes consist of the complete predicted proteomes from 10 well studied fungi (Supplementary Table S2). The number of proteins represented by each violin are indicated on the x-axis.
Figure 3Comparing the scores of Predector with EffectorP versions 1 and 2 for proteins in the testing dataset. Scatter plots in the lower-left corner indicate comparisons of predictive scores between methods, with predicted secreted proteins (any signal peptide and fewer than two TM domains predicted) indicated in yellow, and non-secreted proteins indicated in blue. Density plots along the diagonal indicate distributions of the full test dataset versus predictive scores for each method (indicated along the x-axis), also coloured by secretion prediction as before (Note: there are far more non-secreted than secreted proteins in the dataset). Scatter plots in the top-right corner indicate score comparisons between methods for confirmed effectors, coloured by whether they have been predicted as secreted (criteria as above), or additionally predicted by EffectorP versions 1 or 2. Two proteins that are misclassified by a Predector score > 0 are labelled in the top-right subplot.
Effector prediction and ranking statistics for Predector and a combined classifier based on EffectorP and secretion prediction on the test dataset.
| Full test dataset | Secreted test subset | |||||
|---|---|---|---|---|---|---|
| EP1 and Sec | EP2 and Sec | Predector | EP1 | EP2 | Predector | |
| Coverage error | – | – | 8054 | 2275 | 1593 | 1115 |
| NDCG@50 | – | – | 0.640 | 0.615 | 0.629 | 0.652 |
| NDCG@500 | – | – | 0.928 | 0.916 | 0.926 | 0.933 |
| NDCG | – | – | 0.447 | 0.365 | 0.402 | 0.448 |
| TP@50 | – | – | 4 | 2 | 2 | 4 |
| TP@500 | – | – | 20 | 13 | 18 | 20 |
| TP | 20 | 20 | 26 | 20 | 20 | 25 |
| TN | 14,450 | 14,609 | 14,317 | 1410 | 1569 | 1323 |
| FP | 839 | 680 | 972 | 839 | 680 | 926 |
| FN | 8 | 8 | 2 | 6 | 6 | 1 |
| Precision | 0.023 | 0.028 | 0.026 | 0.023 | 0.028 | 0.026 |
| Recall | 0.714 | 0.714 | 0.928 | 0.769 | 0.769 | 0.961 |
| FPR | 0.055 | 0.044 | 0.064 | 0.373 | 0.302 | 0.412 |
| Accuracy | 0.944 | 0.955 | 0.936 | 0.628 | 0.698 | 0.592 |
| Balanced accuracy | 0.829 | 0.834 | 0.932 | 0.698 | 0.733 | 0.774 |
| MCC | 0.122 | 0.137 | 0.149 | 0.086 | 0.107 | 0.118 |
Test datasets here do not contain any effector homologue sequences. Note that EffectorP is not optimised for ranking tasks and Predector is not optimised for classification. These scores are shown merely for comparison and not necessarily as an endorsement of how they should be used. Coverage error is the index of the last known effector in the test dataset. NDGC is a measure of how often effectors are placed ahead of unlabelled samples in the list sorted by score, penalising incorrect orderings more highly near the top of the list. NDGC@N is the same statistic but only for the top N items in the sorted list. TP, TN, FP, FN are the number of true positives, true negatives, false positives, and false negatives for the classification task, respectively. TP@N indicates the number of known effectors in the top ranked N proteins. Precision indicates how many of the predicted effectors are false positives (unlabelled in this case, so these could be real effectors; higher being better), recall indicates how many of the known effectors are correctly predicted as effectors (higher being better), and FPR (false positive rate) indicated how many of the unlabelled set were incorrectly predicted as effectors (lower being better). Balanced accuracy and MCC are better indicators of model predictive performance than precision for unbalanced data. The secreted test subset consists only of known effector proteins and proteins with a signal peptide (by any method) and fewer than two predicted TM domains (by either TMHMM or Phobius). Correct classification for EffectorP in the full dataset is conditional on secretion prediction by the same criteria as the secreted dataset (SP and < 2 TM). For the same reason, Predector and EffectorP cannot be fairly compared by ranking statistics in the full dataset.
EP1 effectorP v1, EP2 EffectorP v2, Sec secreted.
Predector results on pathogen and saprobe proteomes held out of the training set.
| Organism | Classa | # proteins | # secreted | Predector | EP1 and Sec | EP2 and Sec | #homologs in top 50 | #Pfam domain in top 50 |
|---|---|---|---|---|---|---|---|---|
| B | 35,196 | 3606 (10%) | 1271 (4%) | 1272 (4%) | 1115 (3%) | 2 | 0 | |
| B | 8347 | 1612 (19%) | 696 (8%) | 694 (8%) | 540 (6%) | 20 | 0 | |
| B | 16,372 | 2366 (14%) | 1282 (8%) | 914 (6%) | 924 (6%) | 1 | 0 | |
| B | 13,233 | 2212 (16%) | 1326 (10%) | 740 (6%) | 711 (5%) | 1 | 1 | |
| H | 14,026 | 2249 (16%) | 868 (6%) | 750 (5%) | 559 (4%) | 9 | 9 | |
| H | 11,991 | 1705 (14%) | 971 (8%) | 505 (4%) | 475 (4%) | 4 | 1 | |
| N | 10,688 | 1444 (14%) | 707 (7%) | 308 (3%) | 305 (3%) | 6 | 11 | |
| N | 13,795 | 1561 (11%) | 850 (6%) | 347 (3%) | 368 (3%) | 6 | 8 | |
| W | 26,719 | 3323 (12%) | 1464 (5%) | 763 (3%) | 710 (3%) | 7 | 8 | |
| S | 5040 | 389 (8%) | 76 (2%) | 67 (1%) | 65 (1%) | 0 | 4 | |
| S | 5134 | 349 (7%) | 97 (2%) | 58 (1%) | 44 (1%) | 1 | 3 | |
| S | 14,495 | 1507 (10%) | 487 (3%) | 449 (3%) | 297 (2%) | 3 | 9 | |
| S | 9115 | 1134 (12%) | 529 (6%) | 207 (3%) | 176 (2%) | 2 | 5 | |
| S | 7798 | 766 (10%) | 289 (4%) | 161 (2%) | 149 (2%) | 2 | 5 | |
| S | 6448 | 704 (11%) | 257 (4%) | 128 (2%) | 122 (2%) | 1 | 2 |
Class indicates the lifestyle of the fungus. Proteins were considered to be secreted if they have a secretion signal predicted by any method and fewer than two predicted transmembrane domains. Predector indicates the number of proteins with a Predector ranking score > 0. EffectorP 1 (EP1) and 2 (EP2) predictions were conditional on secretion and used the default 0.5 decision threshold. The number of protein sequence similarity matches to known effectors and matches to Pfam domains with putative virulence functions are noted for the top 50 candidates by ranked by Predector scores.
aMain lifestyle classes of each fungus. B Biotroph, H Hemibiotroph, N Necrotroph, W Wilt, S Saprotroph.