| Literature DB >> 19593377 |
David Burstein1, Tal Zusman, Elena Degtyar, Ram Viner, Gil Segal, Tal Pupko.
Abstract
A large number of highly pathogenic bacteria utilize secretion systems to translocate effector proteins into host cells. Using these effectors, the bacteria subvert host cell processes during infection. Legionella pneumophila translocates effectors via the Icm/Dot type-IV secretion system and to date, approximately 100 effectors have been identified by various experimental and computational techniques. Effector identification is a critical first step towards the understanding of the pathogenesis system in L. pneumophila as well as in other bacterial pathogens. Here, we formulate the task of effector identification as a classification problem: each L. pneumophila open reading frame (ORF) was classified as either effector or not. We computationally defined a set of features that best distinguish effectors from non-effectors. These features cover a wide range of characteristics including taxonomical dispersion, regulatory data, genomic organization, similarity to eukaryotic proteomes and more. Machine learning algorithms utilizing these features were then applied to classify all the ORFs within the L. pneumophila genome. Using this approach we were able to predict and experimentally validate 40 new effectors, reaching a success rate of above 90%. Increasing the number of validated effectors to around 140, we were able to gain novel insights into their characteristics. Effectors were found to have low G+C content, supporting the hypothesis that a large number of effectors originate via horizontal gene transfer, probably from their protozoan host. In addition, effectors were found to cluster in specific genomic regions. Finally, we were able to provide a novel description of the C-terminal translocation signal required for effector translocation by the Icm/Dot secretion system. To conclude, we have discovered 40 novel L. pneumophila effectors, predicted over a hundred additional highly probable effectors, and shown the applicability of machine learning algorithms for the identification and characterization of bacterial pathogenesis determinants.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19593377 PMCID: PMC2701608 DOI: 10.1371/journal.ppat.1000508
Source DB: PubMed Journal: PLoS Pathog ISSN: 1553-7366 Impact factor: 6.823
Figure 1Schematic representation of the computation and experimental steps used for the discovery of novel effectors.
A machine learning approach was utilized, in which validated effectors and non-effectors were used as input. Various features expected to separate these two groups were extracted, filtered, and fed into various classifiers. Ten-fold cross validation was used to train the classifiers. The trained classifiers were used to classify the remaining ORFs as either putative effectors or not. High ranking predictions were experimentally validated and the newly validated effectors were used, iteratively, to refine the learning scheme. NN stands for Neural networks and Bayesian Net stands for Bayesian networks.
Features used in the machine learning algorithms.
| Features | Rationale | References |
| Sequence similarity to known effector proteins | Effectors were shown to share local sequence similarity |
|
| Sequence similarity to eukaryotic proteomes | A high number of effectors were shown to contain eukaryotic-like domains |
|
| Taxonomic distribution among Bacteria and Metazoa | Effectors are unlikely to be house keeping genes, which have homologs in numerous other bacteria | |
| Genome organization | Effector genes cluster in specific genomic regions, possibly as a result of horizontal gene transfer (HGT) events |
|
| G+C content | Effectors were reported to have atypical G+C content, possibly as a result of HGT events |
|
| C-terminal signal | Effectors have a C-terminal secretion signal. Two putative signals were previously suggested |
|
| Regulatory elements | The PmrA and CpxR response regulators regulate numerous effectors |
|
L. pneumophila putative effectors that were experimentally examined.
| Phase | Lpg # | Classification cutoff | Summary |
|
|
| 0.903 (15) | 11/11/12 |
|
|
| 0.998 (50) | 21/23/25 |
|
|
| 1 (50) | 8/8/8 |
|
| 103 | 40/42/45 |
In bold, genes that were validated to encode for effectors; in italics, genes that we failed to clone or express; in plain text, genes that encode proteins that failed to translocate.
The machine learning cutoff of the classification score used for determining the list of highly confident effectors in each learning phase. In brackets is the number of predicted effectors with equal or higher score than the cutoff value. In total 103 putative effectors had cutoff values similar to those experimentally tested (some of the putative effectors overlap among phases).
Validated effectors/successfully expressed genes/genes tested.
Figure 2Icm/Dot-dependent translocation of top ranking putative effector proteins.
Wild-type strain JR32 (gray bars) and icmT mutant GS3011 (white bars) harboring the CyaA fusion proteins (indicated on the left side of the bars) were used to infect HL-60-derived human macrophages, and the cAMP levels of the infected cells were determined (as described in Materials and Methods). The previously validated effector protein LegA10 was used as a positive control [19],[27]. The data are the means for the amount of cAMP per well and the error bars indicate standard deviations of at least 3 independent experiments.
Information summary regarding the novel effectors discovered.
| ORF | Symbol | Paralogs in | Most Proximate effector | PmrA/CpxR | Paris homolog (lpp) | Corby homolog (lpc) | Lens homolog (lpl) | % G+C | Motif |
| lpg0080 |
| lpg0081 | P | 0094 | 38.3 | ||||
| lpg0090 |
| 0104 | 0106 | 0089 | 37.0 | ||||
| lpg0096 |
| P | 0110 | 0115 | 0096 | 41.0 | |||
| lpg0191 |
| P | 0251 | 35.0 | |||||
| lpg0240 |
| P | 0310 | 0316 | 0294 | 35.4 | |||
| lpg0285 |
| lpg0284 ( | 0361 | 0362 | 0337 | 35.2 | |||
| lpg0437 |
| lpg0436 ( | P+C | 0504 | 2905 | 0480 | 36.5 | ||
| lpg0519 |
| lpg0518 | P | 36.4 | |||||
| lpg0696 |
| lpg0695 ( | C | 0751 | 2598 | 0733 | 36.7 | ||
| lpg1101 |
| 1101 | 2154 | 1100 | 34.3 | ||||
| lpg1110 |
| 1111 | 2142 | 1114 | 35.2 | ||||
| lpg1120 |
| lpg2433 ( | lpg1121 ( | 2043 | 31.9 | ||||
| lpg1121 |
| lpg1120 ( | P | 1121 | 0578 | 1126 | 38.0 | ||
| lpg1145 |
| lpg1144 ( | 1147 | 0608 | 1151 | 36.1 | |||
| lpg1290 |
| 1253 | 36.1 | ||||||
| lpg1426 |
| lpg2410 ( | 1381 | 0842 | 1377 | 35.0 | PL | ||
| lpg1491 |
| lpg1488 ( | 1447 | 32.7 | |||||
| lpg1496 |
| lpg1491 ( | 1453 | 0915 | 1530 | 36.4 | |||
| lpg1598 |
| lpg1602 ( | 1556 | 1025 | 1427 | 31.2 | |||
| lpg1625 |
| lpg1621 ( | 1595 | 1052 | 1398 | 33.8 | |||
| lpg1702 |
| lpg1701 ( | 1667 | 1131 | 1661 | 37.5 | CC | ||
| lpg1851 |
| 1818 | 1296 | 1817 | 36.2 | ||||
| lpg1933 |
| lpg2400( | 1914 | 1406 | 1903 | 35.9 | |||
| lpg1947 |
| lpg1948 ( | 1930 | 32.1 | CC | ||||
| lpg1949 |
| lpg1948 ( | 1931 | 1422 | 1918 | 36.1 | |||
| lpg1969 |
| lpg1966 ( | 1952 | 1452 | 1941 | 37.2 | CC | ||
| lpg2166 |
| 2104 | 1626 | 2093 | 35.5 | CC | |||
| lpg2216 |
| lpg2215 ( | P | 2167 | 1681 | 2141 | 35.1 | CC | |
| lpg2248 |
| 2202 | 1717 | 2174 | 38.8 | ||||
| lpg2328 |
| lpg2327 | 2276 | 1795 | 2248 | 38.0 | |||
| lpg2406 |
| lpg2407 | 2472 | 2070 | 2329 | 37.6 | |||
| lpg2411 |
| lpg2410 ( | 2480 | 2064 | 2335 | 32.9 | |||
| lpg2422 |
| 2487 | 2055 | 2345 | 37.9 | CC | |||
| lpg2433 |
| lpg0126 ( | P | 2500 | 2043 | 2353 | 37.9 | ||
| lpg2504 |
| lpg2508 ( | P | 2572 | 1967 | 2426 | 34.6 | ||
| lpg2523 |
| lpg2527 | 36.6 | ||||||
| lpg2529 |
| lpg2527 | 2594 | 1942 | 2449 | 38.0 | |||
| lpg2603 |
| 2656 | 0539 | 2526 | 35.3 | ||||
| lpg2804 |
| P+C | 2850 | 3090 | 2719 | 38.2 | |||
| lpg2826 |
| lpg2829 ( | P | 3113 | 2741 | 34.3 | ANK |
P: contains PmrA regulatory element; C: contains CpxR regulatory element.
PL: Phospholipase; CC: Coiled-coil; ANK: Ankyrin-repeat.
Classification performance of each feature group separately.
| Feature group | Correct rate | AUC |
| Taxonomic distribution among Bacteria and Metazoa | 89.9% | 0.96 |
| Sequence similarity to known effector proteins | 80.2% | 0.78 |
| Sequence similarity to eukaryotic proteomes | 78.7% | 0.78 |
| G+C content | 78.4% | 0.78 |
| Genome organization | 74.3% | 0.77 |
| Regulatory elements | 70.9% | 0.71 |
| C-terminal signal | 60.1% | 0.6 |
| All features combined | 95.9% | 0.98 |
Figure 3Schematic representation of the distribution of effectors and putative effectors in the L. pneumophila genome.
Validated effectors are in red and putative effectors are in yellow. Roman digits indicate genomic regions enriched with effector encoding genes (as described in Table 5). Numbers represent lpg (L. pneumophila Philadelphia-1 gene) identifier. Notably, the units used for this schematic presentation are ORFs rather than base-pairs.
Genomic regions enriched with effector encoding genes.
| Region | Validated effectors | Predicted effectors | G+C |
|
| lpg1933 ( | lpg1952, lpg1957, lpg1959, lpg1961, lpg1968, lpg1972, lpg1975 | 36.8% |
|
| lpg2137 ( | lpg2143, lpg2147, lpg2148, lpg2149, lpg2150, lpg2159, lpg2160, lpg2170 | 38.1% |
|
| lpg2391 ( | lpg2395, lpg2403, lpg2408, lpg2413, lpg2414, lpg2416, lpg2424, lpg2425 | 38.4% |
|
| lpg2504 ( | lpg2505, lpg2518, lpg2519, lpg2520, lpg2522, lpg2525 | 37.6% |
Number of ORFs in region/validated effectors/predicted effectors.
G+C content of coding regions.
Figure 4The putative secretion signal at the C-terminus of effectors.
The enrichment and depletion pattern of groups of amino acids within the 20 C-terminal residues of effectors is shown. Amino acids with aliphatic side-chains bearing a hydroxyl group (S/T) are in red, hydrophobic amino acids (I/L/V/F) in green, and negatively charged amino acids (E/D) in blue. Statistically significant enrichments or depletions (G-test; p-value<0.01 after Bonferroni correction) are marked with asterisks.