| Literature DB >> 22073040 |
Nagarajan Paramasivam1, Dirk Linke.
Abstract
The subcellular localization (SCL) of proteins provides important clues to their function in a cell. In our efforts to predict useful vaccine targets against Gram-negative bacteria, we noticed that misannotated start codons frequently lead to wrongly assigned SCLs. This and other problems in SCL prediction, such as the relatively high false-positive and false-negative rates of some tools, can be avoided by applying multiple prediction tools to groups of homologous proteins. Here we present ClubSub-P, an online database that combines existing SCL prediction tools into a consensus pipeline from more than 600 proteomes of fully sequenced microorganisms. On top of the consensus prediction at the level of single sequences, the tool uses clusters of homologous proteins from Gram-negative bacteria and from Archaea to eliminate false-positive and false-negative predictions. ClubSub-P can assign the SCL of proteins from Gram-negative bacteria and Archaea with high precision. The database is searchable, and can easily be expanded using either new bacterial genomes or new prediction tools as they become available. This will further improve the performance of the SCL prediction, as well as the detection of misannotated start codons and other annotation errors. ClubSub-P is available online at http://toolkit.tuebingen.mpg.de/clubsubp/Entities:
Keywords: clustering; protein homology; signal peptide; start codon prediction; subcellular localization prediction
Year: 2011 PMID: 22073040 PMCID: PMC3210502 DOI: 10.3389/fmicb.2011.00218
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
List of SCL and feature specific tools used in the prediction pipeline.
| Tools | Features of SCL | Used for | Signal peptide prediction modes | Prediction threshold (default threshold from the predictors) | References |
|---|---|---|---|---|---|
| LipoP1.0 | SPII | Archaea and Gram− | Gram-negative bacteria | Best prediction: SpII | Juncker et al. ( |
| Tatp 1.0 | TAT | Archaea and Gram− | Bacteria | Twin-arginine motif and MaxDscore >0.36 | Bendtsen et al. ( |
| TaTFind 1.4 | TAT | Archaea and Gram− | Prokaryote | Rules 3a, 3b, or 4* | Rose et al. ( |
| SignalP 3.0-NN | GSP | Archaea and Gram− | Gram-positive and Gram-negative bacteria | MaxDscore >0.44 | Bendtsen et al. ( |
| SignalP 3.0-HMM | GSP | Archaea and Gram− | Gram-positive and Gram-negative bacteria | SP probability >0.5 | Bendtsen et al. ( |
| Predisi | GSP | Gram− | Gram-negative bacteria | Prediction score >0.5 | Hiller et al. ( |
| RPSP | GSP | Gram− | Prokaryote | Positive SP prediction | Plewczynskia et al. ( |
| Phobius | GSP, IMP | Archaea and Gram− | – | Positive SP prediction and TMH prediction | Käll et al. ( |
| TMHMM 2.0.0 | IMP | Archaea and Gram− | – | Positive TMH prediction | Krogh et al. ( |
| HMMTOP 2.0 | IMP | Archaea and Gram− | – | Positive TMH prediction | Tusnady and Simon ( |
| EffectiveT3 | T3SS | Gram− | Gram-negative bacteria | Prediction score ≥0.8 | Arnold et al. ( |
| T3SS_prediction | T3SS | Gram− | Gram-negative bacteria | Prediction score ≥0.8 | Löwer and Schneider ( |
| PSORTb v3.0.2 | OMP, LPP, EXT, CW | Archaea and Gram− | – | Final prediction – outer membrane or extracellular or cell wall | Yu et al. ( |
| CELLO v.2.5 | OMP, LPP | Gram− | – | Final prediction – outer membrane | Yu et al. ( |
| BOMP | OMBB | Gram− | – | Positive prediction (category 1–5) | Berven et al. ( |
| HHomp | OMBB | Gram− | – | OMP probability ≥90 | Remmert et al. ( |
| PRED-SIGNAL | GSP | Archaea | Archaea | Positive “signal” prediction | Bagos et al. ( |
| FlaFind | Prepilin SP | Archaea | Archaea | Positive prepilin signal detection | Szabó et al. ( |
| PilFind | Type IV pilin SP | Gram− | – | Positive pilin signal peptide | Imam et al. (submitted) |
*Twin-arginine motif followed by a single charged residue (Rule: 3a, 3b) or basic residue following the twin-arginine and hydrophobic stretch (Rule 4).
**SPII, lipoprotein signal peptide; TAT, TAT signal peptide; GSP, general signal peptide; CMP, cytoplasmic membrane protein; T3SS, type 3 secretory signal peptide; OMP, outer membrane protein; EXT, extracellular protein; LPP, leaderless periplasmic protein; OMBB, outer membrane β-barrel; Prepilin SP, prepilin signal peptide.
‡ Gram .
***Periplasmic prediction used only when there is no consensus signal peptide prediction.
.
Logic for SCL prediction at the protein level.
| Features | Lipoprotein SP | Consensus TAT SP | Consensus general SP | Consensus TMH | Consensus TMBB | Consensus T3SS SP or T4SS SP or extracellular |
|---|---|---|---|---|---|---|
| Cytoplasm | No | No | No | No | No | No |
| Cytoplasmic membrane | No | No | No | 1 or more | No | No |
| Periplasm | No | Any one of the SP | No | No | No | |
| Lipoprotein | Yes | No | No | No | No | No |
| Outer membrane | Any one of the SP | No | Yes | No | ||
| Extracellular | No | Yes or no | 0 or more | No | Yes | |
Figure 1Flowchart for SCL prediction at the level of single proteins.
Figure 2Consensus transmembrane helix prediction module. A consensus TMH should be predicted by at least two tools. Signal peptides frequently result in false-positive TMH predictions and are removed with consensus TMH predictions.
Figure 3Threshold determination for the assignment of subcellular localizations to whole clusters. (A) For Gram-negative bacteria; (B) for Archaea. In both cases, the number of clusters annotated as “uncertain” (i.e., with no SCL prediction above the threshold) increases at 0.7 – to minimize the number of “uncertain” clusters, cluster SCLs were assigned when a fraction of 0.7 or above (=70%) of the proteins in a cluster are predicted to have a given SCL.
Logical rules used for archaeal SCL predictions.
| Features | Lipoprotein SP | TAT SP | General SP | Prepilin SP | Consensus TMH | PSORTb cell wall | PSORTb extracellular |
|---|---|---|---|---|---|---|---|
| Cytoplasm | No | No | No | No | No | No | No |
| Cytoplasmic membrane | Yes or no | One or more | Yes or no | Yes or no | |||
| Cell Wall | No | Yes or no | 0 or more | Yes | No | ||
| Secreted/extracellular | No | Any one of the SP | 0 or more | No | Yes or no | ||
Figure 4Cluster-based comparison of signal peptide prediction tools. Shown are the comparisons for TAT (A), T3SS (B), General signal peptide (C) and transmembrane (D) prediction tools. The x-axis describes how many sequences (in %) in a cluster are predicted to have a signal peptide, where 10% means 0.1–10%, 20 means 10.1–20% etc. The majority of clusters contains only sequences where no signal sequence is predicted (0% positive results) – for clarity, these are ignored in the graph.
Statistics of the ClubSub-P database.
| ClubSub-P subcellular localizations | No. of clusters | No. of proteins |
|---|---|---|
| Cytoplasmic | 95,191 | 1,023,339 |
| Cytoplasmic membrane | 33,814 | 304,996 |
| Periplasmic | 15,261 | 107,602 |
| Inner/outer membrane lipoprotein | 4,471 | 27,711 |
| Outer membrane beta-barrel | 3,011 | 20,976 |
| Extracellular | 1,319 | 8,250 |
| Extracellular AND transmembrane helix | 733 | 3,582 |
| Extracellular AND signal peptide | 540 | 2,930 |
| Outer membrane beta-barrel AND lipid anchor | 124 | 1,572 |
| Uncertain1 | 18,388 | 113,286 |
| Unknown2 | 1,356 | 5,969 |
.
.
Performance measurement for different Gram-negative bacterial subcellular localization prediction tools.
| Location | Precision | Recall | Accuracy | MCC |
|---|---|---|---|---|
| Cytoplasm | 66.67 | 74.42 | 83.93 | 0.6 |
| Inner membrane | 90 | 58.06 | 90.68 | 0.68 |
| Periplasm | 60 | 15.79 | 89.41 | 0.27 |
| Outer membrane | 55.56 | 62.5 | 95.88 | 0.57 |
| Extracellular | 100 | 50.67 | 78.24 | 0.6 |
| Total | 80 | 54.55 | 87.6 | 0.59 |
| Cytoplasm | 62.32 | 100 | 84.34 | 0.7 |
| Inner membrane | 94.12 | 61.54 | 92.95 | 0.73 |
| Periplasm | 58.62 | 89.47 | 91.3 | 0.68 |
| Outer membrane | 28.57 | 75 | 89.7 | 0.42 |
| Extracellular | 86.36 | 50.67 | 74.25 | 0.5 |
| Total | 66.67 | 70.18 | 86.38 | 0.6 |
| Cytoplasm | 72.22 | 88.64 | 88.17 | 0.72 |
| Inner membrane | 100 | 53.57 | 91.77 | 0.7 |
| Periplasm | 73.68 | 73.68 | 94.12 | 0.7 |
| Outer membrane | 87.5 | 87.5 | 98.82 | 0.87 |
| Extracellular | 100 | 45.33 | 75.88 | 0.56 |
| Total | 83.85 | 62.64 | 89.73 | 0.67 |
Figure 5Cluster alignment and start codon mispredictions (sequences are labeled with gene identifiers). The alignment shows extended and shortened ends of orthologous sequences at the DNA level. Wrong extensions are colored in red, and shortened sequences are highlighted in yellow. Corrected sequences with alternative start codons are shown in bold. These corrections in most cases lead to corrected predictions of signal peptides. For clearer view, sequences are chopped.
Genomes with multiple signal peptide/start codon errors in secretory clusters.
| Replicon name | Number of alternative start codons | Replicon ID |
|---|---|---|
| 104 | NC_009085 | |
| 62 | NC_013282 | |
| 27 | NC_015733 | |
| 27 | NC_014012 | |
| 25 | NC_002696 | |
| 21 | NC_009648 | |
| 20 | NC_011566 |
*Found in protein clusters with signal peptide annotation where single-sequences lacked the signal peptide. Only genomes with more than 20 erroneous proteins are shown.
ClubSub-P archaeal SCL prediction statistics.
| Cluster’s subcellular localizations | No. of clusters | No. of sequences |
|---|---|---|
| Cytoplasmic | 15,592 | 84,978 |
| Cytoplasmic membrane | 4,535 | 17,158 |
| Secreted/extracellular | 399 | 1,157 |
| Secreted/extracellular with membrane anchor | 244 | 804 |
| Lipoprotein | 181 | 572 |
| Cell wall | 57 | 189 |
| Cell wall with membrane anchor | 14 | 38 |
| Uncertain | 1,139 | 3,921 |
| Unknown | 23 | 55 |
Performance measurement of ClubSub-P archaeal predictions.
| Precision | Recall | Accuracy | MCC | |
|---|---|---|---|---|
| PSORTb v3.0.2 | 98.8 | 98.02 | 99.2 | 0.98 |
| ClubSub-P | 99.55 | 86.77 | 96.46 | 0.91 |
Figure 6ClubSub-P database screenshots.