| Literature DB >> 23945940 |
Damien F Meyer1, Christophe Noroy, Amal Moumène, Sylvain Raffaele, Emmanuel Albina, Nathalie Vachiéry.
Abstract
Type IV effectors (T4Es) are proteins produced by pathogenic bacteria to manipulate host cell gene expression and processes, divert the cell machinery for their own profit and circumvent the immune responses. T4Es have been characterized for some bacteria but many remain to be discovered. To help biologists identify putative T4Es from the complete genome of α- and γ-proteobacteria, we developed a Perl-based command line bioinformatics tool called S4TE (searching algorithm for type-IV secretion system effectors). The tool predicts and ranks T4E candidates by using a combination of 13 sequence characteristics, including homology to known effectors, homology to eukaryotic domains, presence of subcellular localization signals or secretion signals, etc. S4TE software is modular, and specific motif searches are run independently before ultimate combination of the outputs to generate a score and sort the strongest T4Es candidates. The user keeps the possibility to adjust various searching parameters such as the weight of each module, the selection threshold or the input databases. The algorithm also provides a GC% and local gene density analysis, which strengthen the selection of T4E candidates. S4TE is a unique predicting tool for T4Es, finding its utility upstream from experimental biology.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23945940 PMCID: PMC3814349 DOI: 10.1093/nar/gkt718
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Flowchart of the bioinformatics search by S4TE to identify putative effector proteins (PEs). This bioinformatics pipeline is composed of 15 steps delimited by color boxes. Steps 1–13 look for T4 effector features in a given bacterial genome. Step 14 (Files compilation/Ranking) ranks and classifies the predicted T4 effectors based on their number of features to provide the best candidates for experimental validation (Supplementary Figures S2 and S3). Step 15 analyses the genome architecture and G + C content and shows the distribution of predicted effectors. Programs used are indicated in italics. Euk-like, Eukaryotic-like; Prok-like, Prokaryotic-like; NLS, nuclear localization signal; MLS, mitochondrial localization signal; C-ter, C-terminal.
Description of the 13 features used in S4TE to screen a bacterial genome
| Feature number | Feature name | Description | References |
|---|---|---|---|
| 1 | RRRSNTTTTY motif in the −300 bp ( | This work, | |
| 2 | Homology | Sequence identity to a known effector molecule; Blastp against effector database (e-value = 10−2) | ( |
| 3 | Euk-like domains | Presence of eukaryotic domain: 58 eukaryotic domains | This work, |
| 4 | Prok-like domains | 3617 Domain of Unknown Function (DUF) domains | ( |
| 5 | NLS (nuclear localization signal) | Monopartite NLS; [KR]-[KR]-[KR]-[KR]-[KR] and bipartite NLS; K-[KR]-X(6,20)-[KR]-[KR]-X-[KR] | ( |
| 6 | MLS (mitochondrial localization signal) | Probability of a sequence containing a mitochondrial targeting peptide ( | ( |
| 7 | Prenylation domain | CaaX at the C-terminal; ‘C’ represents a cysteine residue, ‘a’ denotes an aliphatic amino acid and ‘X’ is one of four amino acids | ( |
| 8 | Secondary structure | Probability of a coiled-coil structure for windows of 28 residues through a protein sequence ( | ( |
| Coiled coils | |||
| 9 | C-ter basicity | ≤3 [HRK] in the C-terminal 25 amino acids | ( |
| 10 | C-ter charges | Charge of C-terminal 25 amino acids ≥2; C-ter charge = number of [HRK]-number of [ED]-1 (COO-) | ( |
| 11 | C-ter hydrophobicity | Hydropathy of C-terminal 25 amino acids; Hydrophobic residue at the −3rd or −4th position | ( |
| 12 | Global hydrophilicity | Hydropathy of total protein <−200 | ( |
| 13 | E-block | EEXXE in the C-terminal 30 amino acids | ( |
Figure 3.Schematic representation of putative T4 effectors in the L. pneumophila genome according to G + C content. This representation is an output file automatically generated by S4TE. The mean GC% is indicated by the blue line. Putative effectors in genomic regions with low G + C content are in green and those in regions with high G + C content are in red.
Enumeration of L. pneumophila effectors predicted by individual features implemented in S4TE
| S4TE feature | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True positives | 38 | 223 | 30 | 5 | 21 | 5 | 1 | 96 | 33 | 185 | 107 | 152 | 13 |
| False positives | 19 | 48 | 12 | 11 | 23 | 3 | 0 | 31 | 40 | 81 | 28 | 69 | 5 |
| PPV (%) | 67 | 82 | 71 | 31 | 48 | 63 | 100 | 76 | 45 | 70 | 79 | 69 | 72 |
The number of true positives (TP), false positives (FP) and the positive predictive value (PPV, expressed in %) is indicated.
Figure 2.Distribution of the number of features that detected effector candidates in L. pneumophila. (A) Cumulated numbers of effectors correctly detected (TPs) and called by error (false positives, FP) by S4TE L. pneumophila genome. (B) Accuracy, sensitivity and specificity of S4TE analysis on L. pneumophila genome with combinations of 3, 4, 5, 6 and 7 features.
Figure 4.Distribution of L. pneumophila genes and predicted T4 effectors according to local gene density (measured as length of flanking intergenic regions, FIRs). (A) Distribution of L. pneumophila genes according to their FIRs. Genes were sorted in two-dimensional bins according to the length of their 5′ (y-axis) and 3′ (x-axis) FIR lengths. The number of genes in bins is represented by a color-coded density graph. Genes with both FIRs longer than the median length of FIRs were considered as gene-sparse region (GSR) genes. Genes with both FIRs below the median value were considered as gene-dense region (GDR) genes. In between genes are genes with a long 5′ FIR and short 3′ FIR, and inversely. For L. pneumophila, this median value is 109-bp for 5′ FIRs and 65 bp for 3′ FIRs. The dotted line for the median length of FIR delimits the genes in GSR, GDR and in between. (B) Distribution of predicted T4 effectors according to their FIRs. The number of hits per T4 effectors in bins is indicated by a color scale. (C) Distribution of predicted T4 effectors in the GSRs and GDRs of L. pneumophila. The proportion of T4 effectors in GSRs, in between and GDRs is shown in red, yellow and blue, respectively, with percentage indicated.
Predicted T4 effectors in various genomes of α- and γ-proteobacteria
| Genome | ORF | Known T4Es (%) | Predicted T4Es (%) | Predicted TP (%) | Mean GC (%) | High GC (%) | Low GC (%) | GDRs (%) | IB (%) | GSRs (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| 950 | NA | 22 (2,32) | NA | 28 | 68 | 32 | 23 | 36 | 41 | |
| 963 | 4 (0,42) | 26 (2,70) | 100 | 50 | 62 | 38 | 15 | 46 | 38 | |
| 2000 | 3 (0,15) | 53 (2,65) | 100 | 57 | 62 | 38 | 25 | 40 | 34 | |
| 1034 | 1 (0,10) | 17 (1,64) | 100 | 57 | 41 | 59 | 6 | 65 | 29 | |
| 2085 | 43 (2,06) | 126 (6,04) | 77 | 43 | 50 | 50 | 15 | 41 | 44 | |
| 36 | 1 (2,78) | 4 (11,11) | 100 | 39 | 50 | 50 | 25 | 25 | 50 | |
| 2943 | 275 (9,34) | 311 (10,57) | 81 | 38 | 40 | 60 | 8 | 40 | 53 |
aNumber of ORFs in the genome.
bNumber and proportion of known T4 effectors in the genome.
cNumber and proportion of predicted T4 effectors.
dProportion of true positives in S4TE prediction.
eMean G + C content of the genome.
fProportion of predicted T4 effectors in genomic regions with high G + C content.
gProportion of predicted T4 effectors in genomic regions with low G + C content.
hProportion of predicted T4 effectors in gene-dense regions.
iProportion of predicted T4 effectors in ‘in between’ regions.
jProportion of predicted T4 effectors in gene-sparse regions.
NA, not applicable; TP, true positives; GDRs, gene dense regions; GSRs, gene sparse regions; T4Es, type IV effectors.