| Literature DB >> 18371216 |
Miaomiao Zhou1, Jos Boekhorst, Christof Francke, Roland J Siezen.
Abstract
BACKGROUND: In the past decades, various protein subcellular-location (SCL) predictors have been developed. Most of these predictors, like TMHMM 2.0, SignalP 3.0, PrediSi and Phobius, aim at the identification of one or a few SCLs, whereas others such as CELLO and Psortb.v.2.0 aim at a broader classification. Although these tools and pipelines can achieve a high precision in the accurate prediction of signal peptides and transmembrane helices, they have a much lower accuracy when other sequence characteristics are concerned. For instance, it proved notoriously difficult to identify the fate of proteins carrying a putative type I signal peptidase (SPIase) cleavage site, as many of those proteins are retained in the cell membrane as N-terminally anchored membrane proteins. Moreover, most of the SCL classifiers are based on the classification of the Swiss-Prot database and consequently inherited the inconsistency of that SCL classification. As accurate and detailed SCL prediction on a genome scale is highly desired by experimental researchers, we decided to construct a new SCL prediction pipeline: LocateP.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18371216 PMCID: PMC2375117 DOI: 10.1186/1471-2105-9-173
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1(A): Classification of protein SCLs in Gram-positive bacteria. The secreted proteins can be divided into the following subgroups: (i) N-terminal hydrophobic tail anchored (N-anchored), (ii) C-terminal hydrophobic tail anchored (C-anchored), (iii) covalent lipid-anchored, (iv) covalently/non-covalently cell-wall anchored, (v) secreted/released (defined as proteins that are Sec-/Tat-secreted and cleaved by the signal peptidase I), and (vi) non-classically secreted/released proteins via minor pathways [120, 163]. Based on the Swiss Prot classification system the SCLs could be categorized into: Cytoplasmic, Membrane (multi-transmembrane, N-/C-anchored), Cell wall (LPxTG-anchored) and Extracellular (lipid-anchored, secreted, bacteriocin-like) proteins. (B): The structure of known signal peptides. The overall structure of Tat- and Sec-dependent signal peptides is commonly conserved as distinct consecutive N, H and C regions. The N region is the start of the protein containing positively charged residues. The H region follows the N region and is a string of consecutive hydrophobic residues which can form an α-helix in the membrane. The C region contains the signal peptidase cleavage signals. Known cleavage/retention signals include the AxAA type I SPase cleavage site [163, 172], the L-x-x-C (so-called lipobox) type II SPase cleavage site [157] and the AxA Tat-substrate cleavage site [88, 90, 173]. The LPxTG-type motif is a C-terminal sorting signal which is involved in the covalent attachment of proteins to the peptidoglycan of the cell wall. The signal peptide of proteins targeted for minor secretion pathways does not follow the N-H-C structure [2, 125, 163].
Recent methods for protein SCL prediction
| Membrane protein predictor | a | TMHMM | [12] |
| Both transmembrane helices and signal peptide predictor | a | Phobius | [14] |
| Signal peptide predictor | a | SignalP | [18] |
| a | Predisi | [98] | |
| Signal peptidase type I cleavage site motif | [41] | ||
| Lipoprotein predictor | b | LipoP | [151] |
| a | Signal peptidase type II cleavage site motif | [41, 157] | |
| Tat-secreted protein predictor | b | TatP | [86] |
| a | Tat-find.v.1.2 | [174] | |
| Protein subcellular location classifier | b | Psortb.v.2.0 | [17] |
| b | CELLO | [20] | |
| b | Gpos-PLoc | [28] | |
| Augur | [27] | ||
| Minor pathway secreted protein predictor | a | Bagel | [149] |
| SecretomeP 2.0 | [128] | ||
| Mycobacteria protein SCL predictor | b | TBpred | [95] |
a, Tools included in the LocateP pipeline
b, Tools used for comparison and validation of LocateP
Figure 2Flowchart of the LocateP pipeline. Firstly, the possibility of being secreted by the Tat pathway was calculated by combining Tat-find v1.2 [91] and our Tat-specific HMMs (RR-HMM, CS-HMM). Bacteriocin-like proteins were identified using Bagel [149]. Secondly, Phobius [14], PrediSi [98], SignalP 3.0 [18] and TMHMM 2.0 [12] were combined to identify transmembrane regions. Those proteins without any predicted TM segments were considered intracellular, whereas those with TM segments were divided into multi-TM membrane proteins, N-anchored membrane proteins or secreted/released proteins (single N-terminal TM segment, possibly signal peptide), and C-anchored membrane proteins (signal peptide and single C-terminal TM segment). Thirdly, a sortase-substrate HMM [165] was used to distinguish LPxTG-type peptidoglycan-anchored proteins from C-anchored membrane proteins. Subsequently, signal peptidase type II (SPII) substrates were predicted by combining existing lipoprotein motif models [41, 157] and new lipoprotein HMMs. The remaining proteins were classified into the categories secreted/released or N-anchored membrane proteins. See Methods and additional file 1 for more details. Abbreviation: A-S = Anchored-Secreted; TMS = TransMembrane Segment; SP = Signal Peptide; C/N-TM = C/N-terminally transmembrane anchored; LPxTG = LPxTG cell-wall anchored.
Figure 3Distinguishing between secreted and N-anchored proteins. Tjalsma et al. [41] have identified 33 N-anchored and 36 secreted proteins from Bacillus subtilis (by 2D gel electrophoresis) which have a putative SPI-cleavage site motif in the C-region that follows the transmembrane helix H-region (see Fig. 1B). (A): A sequence composition chart, made using WebLogo [47], based on multiple-sequence alignment of the H- and C-regions (see Fig. 1B) of the N-anchored and secreted protein sets. The red arrow indicates the cleavage position of true SPI-site motifs (see Figure 1B), and the green dashed arrow represents the corresponding position in N-anchored proteins that is not cleaved. (B): The specificity of HMMs of different lengths containing the putative cleavage site A* = the Alanine after which cleavage takes place. Mod1: residues -9 to A*; Mod2: residues -11 to A*; Mod3: residues -14 to A*; Mod4: residues -8 to +3 of A*; Mod5: residues -13 to +10 of A*; Mod6: residues -8 to +17 of A*; Mod7: residues -3 to +10 of A*; Mod8: residues -3 to +17 of A*; Mod9: residues +1 to +25.
Comparison of the performance of LocateP with other SCL prediction tools. The entry in each cell indicates the recall of the method with respect to the data in the test-set (TS). * indicates that the test data were extracted from experimental studies. N/A indicates that a certain tool was not applied to the test sets because that set could not be treated appropriately by the tool. The size of the test sets (TS) is indicated in brackets and the relevant literature is mentioned in the Table legend.
| LocateP | 98.8% | 99.4% | 97.5% | 97.2% | 91.0% | 95.7% | 97.9% | 98.1% |
| LipoP | N/A | 96.8% | N/A | N/A | 89.4% | 95.7% | N/A | |
| SignalP 3.0-NN | 99.3% | 98.3% | 97.2% | 25.6% | N/A | N/A | N/A | |
| SignalP 3.0-HMM | 99.4% | 96.6% | 97.2% | 20.5% | N/A | N/A | N/A | |
| Phobius | 98.8% | 96.6% | 97.2% | 42.3% | N/A | N/A | 96.1% | |
| Predisi | 99.4% | 93.2% | 94.4% | 37.2% | N/A | N/A | N/A | |
| TMHMM | N/A | 99.3% | N/A | N/A | N/A | N/A | N/A | 97.1% |
| Psortb v.2.0 | N/A | N/A | 49.2% | 36.1% | 10.3% (M) | 18.6% (M) | 10.6% (M) | N/A |
| Cello | N/A | N/A | 82.6% | 80.6% | 75.6% (M) | 61.7% (M) | 68.1% (M) | N/A |
| TatP | 92.8% | 99.6% | 96.5% | |||||
| Tat-find v1.2 | 94.9% | 98.6% | 93% | |||||
| LocateP | 93.6% | 99.9% | 98.4% | |||||
| LocateP | 98% | 97% | 80.6%e | 84%f | 97.4% | 96.7% | 86.1% | 86%g |
| Psortb v.2.0 | 93.9% | 91.7% | 79.6% | 50% | 89.1% | 6.7% | 81.1% | 80% |
| CELLOd | 97% | 99.2% | 97.2% | 57.1% | 94.1% | 56.7% (E) | 87.6% | 94% |
| TBPredh | N/A | N/A | N/A | N/A | 94.71% | 68.33% | 87.81% | 50% |
The test sets are: TS1 [175], TS2 [98]NGP = Cytoplasmic; TS3 [98]PGP, TS4 [41]* = Secreted; TS5 [41]*a = N-anchored; TS6 [157]*, TS7 [151]c [41]b* = Lipid-anchored; TS8 [175] = Membrane; TS9a [86]TestRR = Cytoplasmic; TS9b [86]TestRR = Membrane; TS 10a [28]Test,Training = Cytoplasmic; TS 10b [28]Test,Training = Membrane; TS 10c [28]Test,Training = Extracellular; TS 10d [28]Test,Training = Cell wall; TS11a [95]Training = Cytoplasmic; TS11b [95]Training = Lipid-anchored; TS11c [95]Training = Membrane; TS11d [95]Training = Secreted.
Abbreviations: TS: test set; M: Membrane; E: Extracellular; Test: test set of this article; Training: training set of this article; NGP: negative training set containing only Gram-positive bacterial proteins; PGP: positive training set containing only Gram-positive bacterial proteins; RR: the proteins contain twin-arginine residues in the initial 35 residues.
a: 30 proteins of this set contained putative SPI-cleavage site and were included in LocateP training process
b: After removing redundancy, 47 proteins were left in this set
c: The set contains both Gram-positive and Gram-negative bacterial proteins
d: Only the predictions with highest score were taken
e: 17 proteins in this test set were either proven to be secreted or they were found to be secreted via minor secretion pathways. LocateP focuses on the prediction of major secretion systems, therefore these proteins were predicted as "intracellular", which meant that no classical signal peptides were found in these proteins.
f: Most of the proteins in this set are associated on the cell wall via non-covalent interactions such as protein-protein interaction.
g: 23 out of 50 proteins in this set were predicted as "N-anchored" proteins by LocateP, indicating that these proteins could be secreted via Sec-pathway but remained attached to the cytoplasmic membrane of the cell.
h: among the support-vector machines involved in TBPred only the best performance with the appropriate protein class was taken.
Validation of LocateP predictions of transporter systems using the annotation in TransportDB
| 426 | 98.2% | |
| 571 | 97.5% | |
| 373 | 98.8% |
LocateP-predicted average distribution (%/(STDEV)) of proteins over different SCLs for Gram-positive bacteria
| Species | Actinobacteria | Bacillales | Clostridia | Lactobacillales | Mollicutes | |||
| Average genome size | 4098 | 3573 | 2969 | 2048 | 724 | |||
| N-anchored (Membrane) | 5.0/(1.1) | 5.7/(0.6) | 6.8/(1.0) | 5.8/(0.7) | 8.7/(3.1) | |||
| C-anchored (Membrane) | 0.3/(0.2) | 0.1/(0.1) | 0.2/(0.1) | 0.2/(0.1) | 0.3/(0.3) | |||
| Multi-transmembrane (Membrane) | 16.5/(2.6) | 20.3/(1.4) | 16.9/(2.8) | 17.9/(2.1) | 17.1/(2.3) | |||
| Intracellular (Cytoplasmic) | 74.3/(2.8) | 69.8/(2.2) | 73.2/(3.6) | 72.9/(2.0) | 71.4/(3.8) | |||
| Lipid anchored (Extracellular) | 2.2/(0.5) | 2.3/(0.4) | 1.6/(0.6) | 1.6/(0.5) | 1.9/(1.6) | |||
| Secreted (Extracellular) | 3.0/(0.9) | 2.1/(0.5) | 2.1/(0.5) | 1.8/(0.6) | 2.3/(1.3) | |||
| Secreted via minor pathways (Extracellular) | 0.1/(0.1) | 0.1/(0.1) | 0.1/(0.1) | 0.28/(0.2) | 0.04/(0.1) | |||
| LPxTG Cell-wall anchored (Cell wall) | 0.1/(0.2) | 0.4/(0.4) | 0.1/(0.2) | 0.6/(0.4) | 0.03/(0.1) | |||
| Membrane | 21.4/(2.7) | 26.2/(1.7) | 23.8/(3.4) | 23.8/(1.9) | 26.1/(3.9) | |||
| Cytoplasmic | 74.3/(2.8) | 69.8/(2.2) | 73.2/(3.6) | 72.9/(2.0) | 71.4/(3.7) | |||
| Extracellular | 5.4/(1.1) | 4.5/(0.7) | 3.8/(0.8) | 3.7/(0.8) | 4.2/(1.9) | |||
| Cell wall | 0.1/(0.2) | 0.4/(0.4) | 0.1/(0.2) | 0.6/(0.4) | 0.03/(0.1) | |||
| Organism | STDEV | |||||||
| Total proteins | 2105 | 2321 | 2656 | 2846 | 3009 | 3672 | 4105 | |
| N-anchored (Membrane) | 4.5 | 5.9 | 6.0 | 4.9 | 5.2 | 6.9 | 6.2 | 0.8 |
| C-anchored (Membrane) | 0.1 | 0.1 | 0.1 | 0.4 | 0.2 | 0.2 | 0.1 | 0.1 |
| Multi-transmembrane (Membrane) | 17.9 | 18.4 | 19.5 | 19.1 | 20.5 | 18.1 | 20.7 | 1.1 |
| Intracellular (Cytoplasmic) | 74.7 | 72.8 | 70.5 | 71.1 | 70.2 | 71.3 | 69.1 | 1.9 |
| Lipid anchored (Extracellular) | 1.7 | 1.4 | 2.2 | 2.0 | 1.6 | 1.7 | 2.0 | 0.3 |
| Secreted (Extracellular) | 1.2 | 1.9 | 2.1 | 1.7 | 1.9 | 2.3 | 2.6 | 0.4 |
| Secreted via minor pathways (Extracellular) | 0.5 | 0.0 | 0.1 | 0.2 | 0.3 | 0.1 | 0.2 | 0.2 |
| LPxTG cell-wall anchored (Cell wall) | 0.5 | 0.5 | 0.5 | 1.5 | 1.1 | 0.1 | 0.1 | 0.5 |
| Membrane | 22.4 | 24.4 | 25.5 | 24.4 | 25.9 | 25.2 | 27.0 | 1.4 |
| Cytoplasmic | 74.7 | 72.8 | 70.5 | 71.1 | 70.2 | 71.3 | 69.1 | 1.9 |
| Extracellular | 3.4 | 3.3 | 4.4 | 4.0 | 3.8 | 4.1 | 4.8 | 0.5 |
| Cell wall | 0.5 | 0.5 | 0.5 | 1.5 | 1.1 | 0.1 | 0.1 | 0.5 |
Abbreviations: Spn: S. pneumoniae; Lla: L. lactis; Sau: S. aureus; Lmo: L. monocytogenes; Lpl: L. plantarum; Cac: C. acetobutylicum; Bsu: Bacillus subtilis