| Literature DB >> 21483735 |
Frank Po-Yen Lin1, Ruiting Lan, Vitali Sintchenko, Gwendolyn L Gilbert, Fanrong Kong, Enrico Coiera.
Abstract
The phylogenetic profile of a gene is a reflection of its evolutionary history and can be defined as the differential presence or absence of a gene in a set of reference genomes. It has been employed to facilitate the prediction of gene functions. However, the hypothesis that the application of this concept can also facilitate the discovery of bacterial virulence factors has not been fully examined. In this paper, we test this hypothesis and report a computational pipeline designed to identify previously unknown bacterial virulence genes using group B streptococcus (GBS) as an example. Phylogenetic profiles of all GBS genes across 467 bacterial reference genomes were determined by candidate-against-all BLAST searches,which were then used to identify candidate virulence genes by machine learning models. Evaluation experiments with known GBS virulence genes suggested good functional and model consistency in cross-validation analyses (areas under ROC curve, 0.80 and 0.98 respectively). Inspection of the top-10 genes in each of the 15 virulence functional groups revealed at least 15 (of 119) homologous genes implicated in virulence in other human pathogens but previously unrecognized as potential virulence genes in GBS. Among these highly-ranked genes, many encode hypothetical proteins with possible roles in GBS virulence. Thus, our approach has led to the identification of a set of genes potentially affecting the virulence potential of GBS, which are potential candidates for further in vitro and in vivo investigations. This computational pipeline can also be extended to in silico analysis of virulence determinants of other bacterial pathogens.Entities:
Mesh:
Year: 2011 PMID: 21483735 PMCID: PMC3070697 DOI: 10.1371/journal.pone.0017964
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Determination of phylogenetic profiles.
For each gene, a candidate-against-all BLAST was performed to determine whether at least one homolog of a candidate gene is present in a given reference genome. The binary values of presence (1) or absence (0) were stored in a vector which were used for subsequent rediscovery analyses and virulence gene predictions.
List of known GBS virulence genes with systematic gene names in three published reference genomes.
| Systematic name/loci in reference genomes | ||||||
| Category | Gene | Function/annotation | NEM316 (III) | A909 (Ia) | 2603 (V) | Ref. |
| Adhesins |
| fibrinogen-binding protein FbsA | GBS1087 | SAK1142 | SAG1052 | [S1-4] |
|
| fibrinogen-binding protein FbsB | GBS0850 | SAK0955 | SAG0832 | [S4,5] | |
|
| fibronectin-binding protein | GBS1263 | SAK1277 | SAG1190 | [S6] | |
|
| C5a peptidase | GBS1308 | SAK1320 | SAG1236 | [S7,8] | |
|
| laminin-binding protein | GBS1307 | SAK1319 | SAG1234 | [S9-11] | |
| GBS pilus cluster | streptococcal pilus cluster | GBS0628-32 | SAK0776-80 | SAG0645-49 | [S12-14] | |
| Invasins |
|
| GBS0644-55 | SAK0790-0801 | SAG0662-73 | [S20-26] |
|
| CAMP factor | GBS2000 | SAK1983 | SAG2043 | [S27] | |
|
| hemolysin III | GBS1477 | SAK1440 | SAG1407 | [S27,S31] | |
|
| hyaluronate lyase | GBS1270 | SAK1284 | SAG1197 | [S28-30] | |
|
| surface protein rib | GBS0470 | SAG0433 | [S15-19] | ||
|
| C- | SAK0517 | [S15-19] | |||
| Immune evasins |
| C- | - | SAK0186 | [S32-34] | |
|
|
| GBS1237-47 | SAK1251-62 | SAG1162-75 | [S35-37] | |
|
|
| GBS1233-36 | SAK1247-50 | SAG1158-61 | [S38-41] | |
|
| C5a peptidase | (see above) | [S7,8] | |||
|
| serine protease cspA | GBS2008 | SAK1991 | SAG2053 | [S42] | |
|
| penicillin-binding protein 1A | GBS0288 | SAK0370 | SAG0298 | [S43-45] | |
a. IS1548 is embedded upstream of scpB gene in 2603 V/R.
b. although primarily an invasin, cyl is capable of damaging phagocytes and hence also have a role in immune system evasion.
c. dual roles of both an invasin and an immune system evading gene.
d. dual roles of both an adhesin and an immune system evading gene.
Please refer to Text S3 for the reference entries.
Performance of algorithms (area under ROC curve, AUC) in the rediscovery experiment using only NEM316 genome.
| Algorithms (AUC) | |||||
| Virulence gene category |
| ADTree | IBk | RBF | Poly |
| All virulence genes | 43 | 0.721 | 0.722 |
| 0.791 |
| Adhesins | 10 | 0.716 | 0.776 |
| 0.767 |
| minor pilin cluster | 5 | 0.970 | 0.763 |
| 0.881 |
| Invasins | 17 | 0.864 | 0.679 | 0.857 |
|
|
| 12 | 0.824 | 0.648* |
| 0.820 |
| Immune evasins | 17 | 0.825 | 0.770 |
| 0.860 |
|
| 11 | 0.808 | 0.797 |
| 0.849 |
|
| 4 |
| 0.836 |
|
|
|
| 15 | 0.864 | 0.773 |
| 0.914 |
This analysis evaluated the relative performance of each algorithm to rediscover virulence genes by applying stratified n-fold cross-validations with of the entire set of S. agalactiae NEM316 genes serving as test-set in each fold. Each fold of training set comprised positive and negative examples.n: number of virulence genes in the category. Singleton virulence gene categories were excluded from this analysis, as it is not possible to perform cross-validations on training sets with n = 1. All but one (labeled*) AUCs reached the statistical significance level at α = 0.05 (two-tailed Mann-Whitley U-test). At least 3 out of 4 algorithms were still significant after adjustment for multiple testing (across the family of 4 algorithms) by the Bonferroni method. Abbreviations: ADTree: alternating decision tree; IBk: nearest neighbor classifier; SVM: support vector machine; RBF: SVM with radial basis function; Poly: SVM with polynomial kernel. Refer to the methods section for the parameters used to train the machine learning algorithms. The numbers in bold face indicate the best performing algorithm for a given category.
The performance of inductive CGP algorithms in the rediscovery of known virulence genes in all 3 GBS reference genomes.
| Algorithms (AUC) | |||||
| Virulence gene category |
| ADTree | IBk | SVM/RBF | SVM/Poly |
|
| 134 | 0.848 |
| 0.951 | 0.960 |
|
| 30 |
| 0.961 | 0.960 | 0.965 |
|
| 3 | 0.888 | 0.677 | 0.754 |
|
|
| 3 | 0.874 |
| 0.959 | 0.957 |
|
| 3 |
|
|
|
|
|
| 3 |
|
|
|
|
|
| 3 |
|
|
|
|
| minor pilin cluster | 15 |
|
|
|
|
|
| 51 | 0.929 | 0.974 | 0.954 |
|
|
| 36 | 0.950 |
| 0.962 | 0.980 |
|
| 3 |
|
|
|
|
|
| 3 |
|
|
|
|
|
| 3 |
|
|
|
|
| C- | 3 | 0.933 | 0.967 |
| 0.978 |
|
| 60 | 0.929 | 0.974 | 0.954 |
|
|
| 1 | - | - | - | - |
|
| 37 | 0.960 | 0.966 | 0.948 |
|
|
| 12 |
|
|
|
|
|
| 49 | 0.970 | 0.974 | 0.960 |
|
|
| 3 |
|
|
|
|
|
| 3 |
|
|
|
|
This rediscovery analysis applied all known GBS virulence genes by applying stratified n-fold cross-validations with of the entire set of S. agalactiae genes in A909, NEM316, and 2603V/R genomes serving as test-set in each fold.n: number of genes in the category.
a. scpB was also included as immune evasion genes.
b. Including both bca and rib; also included as immune evasion genes.
c. bac was represented by less than two genes in the three reference genomes studied. No rediscovery experiment was performed.
d. Including all genes from the cps-neu operon. e. cspA was also included as an invasin.
Figure 2Proposed candidate GBS virulence genes.
The figure illustrates the putative S. agalactiae virulence genes identified in this paper, of which the biological function have been known in other pathogens or inferred by sequence similarity with known protein motifs. The cluster IDs (Cnumber) identify the homolog clusters defined in Table S2.
List of genes encoding hypothetical proteins and their putative biological significance.
| Cluster | Gene | In rank(s) | Have orthologs in other genomes with annotations;
Contains Pfam Motifs | Predicted function |
| C0036 | GBS0036 |
| DUF386 ( | |
| C0255 | GBS0253 |
| quinone-reactive Ni/Fe hydrogenase, cytochrome b subunit | |
| C0257 | GBS0255 |
| lipoprotein | |
| C0348 | GBS0344 |
| intercellular adhesion protein C | ? adhesin |
| C0429 | GBS0488 |
| superfamily II helicase | |
| C0442 | GBS0502 | minor pilin | ATP-dependent endopeptidase | |
| C0560 | (absent) |
| phage protein; DUF1642
( | |
| C0613 | GBS0616 | C- | DUF1706 ( | |
| C0753 | GBS0806 |
| Methyltransferase; (Methyltransf_11 domain,
| ? methyltransferase |
| C1080 | GBS1195 |
| [ | ? staphylokinase analog |
| C1172 | GBS1295 |
| DUF208 ( | |
| C1271 | GBS1415 |
| DUF2127 ( | |
| C1332 | GBS1482 |
| putative O-antigen transporter; | ? synthesis of unknown antigens |
| Polysaccharide biosynthesis protein (Polysacc_synt,
| ||||
| C1377 | GBS1529 |
| streptococcal hemagglutinin; fibrinogen-binding
adhesin (SdrG_C_C, | ? adhesin |
| C1412 | GBS1559 |
| [ | |
| C1716 | GBS1861 |
| putative DNA-binding protein; YheO-like PAS domain
(PAS_6, | |
| C1856 | GBS1961 |
| RNA-binding protein | |
| C1860 | GBS1992 |
| ABC-type transport system, permease | |
| C1977 | (absent) |
| filamentation induced by cAMP protein Fic; (Fic
family domain, | |
| C2042 | GBS0486 |
| Methyltransferase (Methyltransf_11 domain,
| ? methyltransferase |
This table lists the genes encoding hypothetical proteins from the top-10 genes of all 15 functional category listed in Table S1. Cluster refers to the homolog clusters listed in Table S2. In ranks(s): within top-10 of functional categories (ranks). Each hypothetical protein was searched against KEGG [47] and Pfam database [48] to identify potential homologous sequence motifs. Note: *) Systematic gene names in the NEM316 (serotype III) genome. †) Pfam motifs with E-value are not presented in the table.
Figure 3Number of other gene categories discoverable at a certain rank position.
This analysis evaluated how many virulence gene categories are discoverable at a given position of a prioritized rank. A category is considered discoverable by another if at least one virulence gene is present above a given position in the rank is being analyzed. The gene positions were measured by rank fraction (between 0 and 1) with 0 being the top of the rank and 1 at the bottom. Candidate genes were ranked by SVM/RBF algorithm (the best algorithm evaluated in Table 2).
Figure 4Inter-discovery between virulence gene categories.
These figures provide two cross-sectional views of Figure 3 at the positions of top-1% (A) and -5% (B) respectively. The arrowheads indicates which other categories of virulence genes were discoverable by the category at the tail of arrow.
Figure 5Positions of the training set (in red) and top-10 genes (in blue) in each of the 15 virulence gene categories in S. agalactiae NEM316 genome (serotype III).
The highly-ranked genes is shown to be scattered across the entire GBS genome and not aggregated in close physical proximity. Physical linkages between the known and the prioritized genes are therefore unlikely. This illustration demonstrated the novelty of the PP approach for virulence gene discovery compared with the traditional paradigm of physical linkage and gene clusters. The blue boxes refer to the known genomic islands and are discussed in the results section. (*) Predicted by homology to other reference genomes, as islands (1) and (2) were not listed in PAI-DB or IslandViewer for NEM316.