| Literature DB >> 24204795 |
Salvatore Cosentino1, Mette Voldby Larsen, Frank Møller Aarestrup, Ole Lund.
Abstract
Although the majority of bacteria are harmless or even beneficial to their host, others are highly virulent and can cause serious diseases, and even death. Due to the constantly decreasing cost of high-throughput sequencing there are now many completely sequenced genomes available from both human pathogenic and innocuous strains. The data can be used to identify gene families that correlate with pathogenicity and to develop tools to predict the pathogenicity of newly sequenced strains, investigations that previously were mainly done by means of more expensive and time consuming experimental approaches. We describe PathogenFinder (http://cge.cbs.dtu.dk/services/PathogenFinder/), a web-server for the prediction of bacterial pathogenicity by analysing the input proteome, genome, or raw reads provided by the user. The method relies on groups of proteins, created without regard to their annotated function or known involvement in pathogenicity. The method has been built to work with all taxonomic groups of bacteria and using the entire training-set, achieved an accuracy of 88.6% on an independent test-set, by correctly classifying 398 out of 449 completely sequenced bacteria. The approach here proposed is not biased on sets of genes known to be associated with pathogenicity, thus the approach could aid the discovery of novel pathogenicity factors. Furthermore the pathogenicity prediction web-server could be used to isolate the potential pathogenic features of both known and unknown strains.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24204795 PMCID: PMC3810466 DOI: 10.1371/journal.pone.0077302
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Training, test data and model parameters.
| Training Set | Test Set | Model Parameters | ||||||||
| Model Name | Pathogenic | Non-pathogenic | Total | Pathogenic | Non-pathogenic | Total | MinORG | LT | HT | Zthr |
| TM-Alphaproteobacteria | 29 | 60 | 89 | 11 | 28 | 39 | 2 | 0.15 | 0.6 | 10.43 |
| TM-Betaproteobacteria | 26 | 26 | 52 | 10 | 22 | 32 | 2 | 0.3 | 0.9 | 0.55 |
| TM-Epsilonproteobacteria | 17 | 5 | 22 | 16 | 2 | 18 | 2 | 0.4 | 1.0 | −9.31 |
| TM-Gammaproteobacteria | 122 | 97 | 219 | 33 | 50 | 83 | 2 | 0.2 | 0.85 | 25.37 |
| TM-Actinobacteria | 27 | 44 | 71 | 24 | 36 | 60 | 2 | 0.0 | 1.0 | −3.22 |
| TM-Bacteroidetes | 7 | 12 | 19 | 5 | 24 | 29 | 2 | 0.35 | 0.6 | 1.68 |
| TM-Firmicutes | 98 | 87 | 185 | 34 | 83 | 117 | 3 | 0.0 | 1.0 | −2.85 |
| TM-Tenericutes | 6 | 8 | 14 | 5 | 9 | 14 | 2 | 0.0 | 1.0 | −1.59 |
| COMPL | 40 | 174 | 214 | 17 | 40 | 57 | 2 | 0.0 | 1.0 | −1.78 |
| WDM | 372 | 513 | 885 | 155 | 294 | 449 | 2 | 0.0 | 1.0 | 3.0 |
Training, test data and model parameters. The last 3 columns show the MinORG, LT and HT parameters used to create the pathogenicity families and build the model for each of the 10 models. Zthr is a threshold value, calculated for each model at the cross validation phase, which is used, given the final prediction score, to decide if the input organisms will be predicted as pathogenic or non-pathogenic. The parameters for each model are chosen after 5-fold cross-validation tests.
MCC on cross validation and independent test-set.
| Organism subset | 5-fold CV | TM or COMPL | WDM |
| All Bacteria | 0.847 | 0.736 | 0.758 |
|
| 0.949 | 0.886 | 0.873 |
|
| 0.923 | 0.855 | 0.79 |
|
| 0.741 | 0.686 | 1.0 |
|
| 0.825 | 0.666 | 0.661 |
| Actinobacteria | 0.681 | 0.816 | 0.826 |
| Bacteroidetes | 0.889 | 0.535 | 0.383 |
| Firmicutes | 0.915 | 0.756 | 0.785 |
| Tenericutes | 0.866 | −0.344 | 0.0 |
| Remaining Organisms | 0.940 | 0.793 | 0.877 |
Column 2, the MCC obtained in the 5-fold cross validation (CV) by each of the 10 models. Column 3, the MCC of the individual TM models and the COMPL model (last line) when tested on independent test data from the corresponding phyla/classis. Column 4, the MCC of the WDM model when tested on independent test data from specific phyla/classis.
Organisms of phylum/class for which no TM model is available were tested using COMPL model. COMPL was trained on all organisms from classes or phyla for which only either pathogenic or non-pathogenic strains were available.
MCC for WDM on the same test-set used for COMPL.
Overall MCC for all the TM models and the COMPL model.
Top 10 ranking pathogenic protein families and annotated functions of their proteins for TM-Gammaproteobacteria model.
| RANK | Z-score | P | N | Function |
| 1 | 9.134 | 77 | 8 | N-acetylmannosamine kinase (TCS) |
| 2 | 8.500 | 49 | 0 | Fimbrial proteins |
| 3 | 8.170 | 62 | 6 | Sialic Acid Transporter |
| 4 | 8.158 | 53 | 3 | Transposition helper protein |
| 5 | 8.023 | 62 | 7 | Acetyltransferase, type III secretion proteins |
| 6 | 8.023 | 62 | 7 | Macrolide-specific efflux, membrane protein |
| 7 | 8.023 | 62 | 7 | Type II secretion proteins |
| 8 | 7.922 | 69 | 10 | Unknown function, possible membrane proteins |
| 9 | 7.906 | 60 | 7 | Unknown function |
| 10 | 7.855 | 53 | 4 | Cythochrome b562 |
P and N columns contain the number of pathogenic and non-pathogenic organisms in the protein family respectively.
Top 10 ranking non-pathogenic protein families and annotated functions of their proteins for TM-Gammaproteobacteria model.
| RANK | Z-score | P | N | Function |
| 1 | −6.52 | 3 | 34 | Protein-L-isoaspartate |
| 2 | −6.44 | 2 | 31 | ThiJ/PfpI domain protein |
| 3 | −6.43 | 6 | 40 | Anthranilate synthase component I |
| 4 | −5.98 | 6 | 36 | 8-amino-7-oxononanoate synthase |
| 5 | −5.92 | 5 | 34 | Unknown function, putative transcriptional regulator |
| 6 | −5.82 | 0 | 21 | Adenosylmethionine decarboxylase |
| 7 | −5.81 | 8 | 39 | Unknown function |
| 8 | −5.80 | 2 | 26 | Unknown function, probable condensation protein |
| 9 | −5.68 | 0 | 20 | Nitrite transporter |
| 10 | −5.62 | 1 | 22 | Glucose-galactose transporter |
P and N columns contain the number of pathogenic and non-pathogenic organisms in the protein family respectively.
Top 10 ranking pathogenic protein families and annotated functions of their proteins for the WDM model.
| RANK | Z-score | P | N | Function |
| 1 | 10.18 | 38 | 0 | Borrelia Plasmid partition proteins |
| 2 | 9.49 | 33 | 0 | TCS associated genes, unknown functions |
| 3 | 9.19 | 31 | 0 | Lipoate-protein ligase, lipoate metabolism associated proteins |
| 4 | 9.19 | 31 | 0 | Unknown functions, flavin oxidoreductase |
| 5 | 9.04 | 30 | 0 | Exfoliative toxin A |
| 6 | 8.89 | 29 | 0 | Pili assembly proteins, Motility, Secretion Systems |
| 7 | 8.89 | 30 | 0 | Unknown function, shikimate kinase |
| 8 | 8.89 | 29 | 0 | Pili assembly proteins, Motility, Secretion Systems |
| 9 | 8.74 | 28 | 0 | Multiple antibiotic resistance (MarR) family proteins |
| 10 | 8.74 | 28 | 0 | Mutarotase Yjht (sialic acid mutarotation), unknown functions |
P and N columns contain the number of pathogenic and non-pathogenic organisms in the protein family respectively.
Top 10 ranking non-pathogenic protein families and annotated functions of their proteins for the WDM model.
| RANK | Z-score | P | N | Function |
| 1 | −6.68 | 0 | 63 | tRNA proteins |
| 2 | −6.62 | 0 | 62 | ABC transporter related proteins (for |
| 3 | −6.18 | 0 | 54 | Rubrerythrin |
| 4 | −6.07 | 0 | 52 | Rubrerythrin |
| 5 | −6.01 | 0 | 51 | Iron-sulfur binding domain proteins |
| 6 | −6.01 | 0 | 51 | Hydroxymethylglutaryl-CoA synthase |
| 7 | −5.95 | 0 | 50 | Unknown function |
| 8 | −5.89 | 0 | 49 | Unknown function |
| 9 | −5.83 | 0 | 48 | Unknown function |
| 10 | −5.70 | 0 | 46 | Sulfite reductase subunit |
P and N columns contain the number of pathogenic and non-pathogenic organisms in the protein family respectively.
Figure 1P and Z-score histograms for TM-Betaproteobacteria model.
The model was built setting MinOrg = 2, HT = 0.9 and LT = 0.3. (A) and (B) respectively show the P and Z-score histograms for the clusters i such that ORG≥MinOrg. By this step the original 69,744 clusters are reduced to 26,706. In (A) the bars at the extremes are the count for clusters containing either only genes from pathogenic organisms (right bar) and non-pathogenic ones (left bar), while the small pick in the middle are clusters containing the same number of pathogenic and non-pathogenic organisms, and hence will not be used since they provide no discriminative information about pathogenicity. (C) and (D) show the same histograms for the PFs obtained removing all the significant clusters with P value between LT and HT. We can see how the amount of non-pathogenic PFs is higher than the pathogenic ones (C). HT and LT can be used to modify the amount of both pathogenic and non-pathogenic PFs, which can be useful in model in which the training-set has an unbalanced amount of pathogenic and non-pathogenic organisms. In (D) the negative Z-scores are associated with non-pathogenic families while the others are for pathogenic PFs.
Figure 2PFDB, training and test-set for each model.
Each bar-plot shows the percentage of pathogenic (orange) and non-pathogenic (light-blue) organisms in the training and test-set, and the percentage of pathogenic and non-pathogenic protein families in the PFDB of the model identified by the title of the bar-plot (eg. WMD). Below each horizontal bar-plot the number of protein families composing the PFDB of the model the bar-plot refers to, along with its size in megabytes and the number of sequences, is shown.