| Literature DB >> 22496884 |
Jhih-Siang Lai1, Cheng-Wei Cheng, Ting-Yi Sung, Wen-Lian Hsu.
Abstract
Secretome analysis is important in pathogen studies. A fundamental and convenient way to identify secreted proteins is to first predict signal peptides, which are essential for protein secretion. However, signal peptides are highly complex functional sequences that are easily confused with transmembrane domains. Such confusion would obviously affect the discovery of secreted proteins. Transmembrane proteins are important drug targets, but very few transmembrane protein structures have been determined experimentally; hence, prediction of the structures is essential. In the field of structure prediction, researchers do not make assumptions about organisms, so there is a need for a general signal peptide predictor.To improve signal peptide prediction without prior knowledge of the associated organisms, we present a machine-learning method, called SVMSignal, which uses biochemical properties as features, as well as features acquired from a novel encoding, to capture biochemical profile patterns for learning the structures of signal peptides directly.We tested SVMSignal and five popular methods on two benchmark datasets from the SPdb and UniProt/Swiss-Prot databases, respectively. Although SVMSignal was trained on an old dataset, it performed well, and the results demonstrate that learning the structures of signal peptides directly is a promising approach. We also utilized SVMSignal to analyze proteomes in the entire HAMAP microbial database. Finally, we conducted a comparative study of secretome analysis on seven tuberculosis-related strains selected from the HAMAP database. We identified ten potential secreted proteins, two of which are drug resistant and four are potential transmembrane proteins.SVMSignal is publicly available at http://bio-cluster.iis.sinica.edu.tw/SVMSignal. It provides user-friendly interfaces and visualizations, and the prediction results are available for download.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22496884 PMCID: PMC3322152 DOI: 10.1371/journal.pone.0035018
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
MCCs (%) of various predictors on classification of proteins with and without signal peptides.
| Species | Data set | Cleavable | SVMSignal | Phobius | Philius | RPSP | SignalP NN+Euk | SignalP HMM+Euk | SignalP NN,G+ | SignalP HMM,G+ | SignalP NN,G− | SignalP HMM,G− | PrediSi Euk | PrediSi G+ | PrediSi G− |
| Mixed | SP/TM | 12.83% |
| 65.70% | 64.36% | 33.23% | 42.22% | 57.12% | 42.88% | 28.42% | 46.54% | 37.98% | 20.96% | 15.53% | 18.73% |
| SP/(TM+G) | 32.73% |
| 82.94% | 76.93% | 82.21% | 81.03% | 78.65% | 63.08% | 71.21% | 64.90% | 75.75% | 77.07% | 54.51% | 58.35% | |
| Euk | SP/TM | 12.68% | 58.11% | 54.91% | 46.60% | 39.09% |
| 48.79% | 22.97% | 19.58% | 29.38% | 26.71% | 28.30% | 6.24% | 6.38% |
| SP/(TM+G) | 35.48% |
| 83.13% | 75.34% | 86.53% | 84.54% | 78.15% | 56.47% | 63.87% | 58.80% | 70.72% | 80.31% | 45.17% | 50.18% | |
| Bac | SP/TM | 9.81% | 72.14% | 71.80% |
| 27.79% | 26.11% | 59.73% | 65.18% | 59.78% | 64.14% | 66.21% | 11.51% | 40.06% | 47.02% |
| SP/(TM+G) | 29.57% | 84.72% | 83.92% | 80.76% | 74.45% | 75.34% | 80.57% | 79.17% |
| 79.95% | 85.89% | 71.13% | 72.97% | 73.81% |
Accuracies (%) of various predictors on classification of proteins with and without signal peptides.
| Species | Data set | Cleavable | SVMSignal | Phobius | Philius | RPSP | SignalP NN+Euk | SignalP HMM+Euk | SignalP NN,G+ | SignalP HMM,G+ | SignalP NN,G− | SignalP HMM,G− | PrediSi Euk | PrediSi G+ | PrediSi G− |
| Mixed | SP/TM | 88.56% | 91.14% |
| 90.85% | 78.61% | 87.56% | 89.85% | 82.79% | 68.36% | 85.07% | 77.11% | 81.29% | 56.82% | 63.98% |
| SP/(TM+G) | 56.43% |
| 95.71% | 93.80% | 96.04% | 95.18% | 94.38% | 89.28% | 93.77% | 89.64% | 94.51% | 94.30% | 90.12% | 90.40% | |
| Euk | SP/TM | 94.05% | 92.02% | 92.33% | 90.77% | 82.94% |
| 92.18% | 79.81% | 58.69% | 83.10% | 70.89% | 86.85% | 46.64% | 55.40% |
| SP/(TM+G) | 57.71% |
| 95.26% | 92.48% | 96.64% | 95.61% | 93.41% | 85.46% | 91.53% | 85.99% | 92.78% | 94.61% | 87.14% | 87.54% | |
| Bac | SP/TM | 79.82% | 90.21% | 90.50% |
| 71.51% | 76.56% | 86.05% | 88.43% | 86.94% | 88.43% | 89.32% | 72.11% | 75.67% | 80.42% |
| SP/(TM+G) | 53.66% | 96.38% | 96.10% | 95.20% | 94.39% | 94.01% | 95.34% | 94.72% |
| 94.91% | 96.62% | 92.86% | 93.86% | 93.67% |
Sensitivities (%) of various predictors on the signal peptide protein benchmark datasets.
| Species | Dataset | Cleavable | SVMSignal | Phobius | Philius | RPSP | SignalP NN+Euk | SignalP HMM+Euk | SignalP NN,G+ | SignalP HMM,G+ | SignalP NN,G− | SignalP HMM,G− | PrediSi Euk | PrediSi G+ | PrediSi G− |
| Mixed | SPDB | 99.24% | 95.27% | 93.90% | 94.05% | 85.82% |
| 96.04% | 80.34% | 58.38% | 84.76% | 67.53% | 91.46% | 49.54% | 58.38% |
| SP | 97.67% | 91.12% |
| 91.12% | 79.91% | 91.12% | 91.34% | 83.68% | 67.26% | 86.24% | 76.80% | 86.13% | 55.27% | 63.71% | |
| Euk | SPDB | 100.00% | 95.97% | 94.43% | 94.43% | 85.99% |
| 97.50% | 76.97% | 50.48% | 82.15% | 61.61% | 92.71% | 42.80% | 52.98% |
| SP | 98.52% | 91.76% | 92.59% | 91.43% | 82.54% |
| 93.08% | 80.72% | 57.17% | 83.86% | 70.02% | 88.47% | 45.47% | 55.19% | |
| Bac | SPDB | 96.55% | 93.97% | 92.24% | 93.10% | 84.48% | 86.21% | 90.52% | 95.69% | 93.10% |
| 94.83% | 87.07% | 80.17% | 82.76% |
| SP | 95.99% | 90.51% | 91.61% | 90.88% | 74.45% | 84.31% | 87.96% | 90.51% | 90.15% | 91.24% |
| 81.75% | 76.64% | 82.48% |
Specificities (%) of various predictors on the non-signal peptide protein benchmark datasets.
| Species | Data set | Cleavable | SVMSignal | Phobius | Philius | RPSP | SignalP NN+Euk | SignalP HMM+Euk | SignalP NN,G+ | SignalP HMM,G+ | SignalP NN,G− | SignalP HMM,G− | PrediSi Euk | PrediSi G+ | PrediSi G− |
| Mixed | TM | 9.62% |
| 87.50% | 88.46% | 67.31% | 56.73% | 76.92% | 75.00% | 77.88% | 75.00% | 79.81% | 39.42% | 70.19% | 66.35% |
| TM+G | 50.09% | 97.58% | 96.28% | 94.21% |
| 95.80% | 94.85% | 90.13% | 97.85% | 90.17% | 97.24% | 95.56% | 95.48% | 94.50% | |
| Euk | TM | 9.38% |
| 87.50% | 78.13% | 90.63% | 87.50% | 75.00% | 62.50% | 87.50% | 68.75% | 87.50% | 56.25% | 68.75% | 59.38% |
| TM+G | 50.38% | 97.69% | 95.74% | 92.67% |
| 95.83% | 93.47% | 86.31% | 97.69% | 86.37% | 96.87% | 95.71% | 94.62% | 93.35% | |
| Bac | TM | 9.52% | 88.89% | 85.71% |
| 58.73% | 42.86% | 77.78% | 79.37% | 73.02% | 76.19% | 76.19% | 30.16% | 71.43% | 71.43% |
| TM+G | 47.32% | 97.26% | 96.77% | 95.84% | 97.37% | 95.46% | 96.44% | 95.35% |
| 95.46% | 97.26% | 94.53% | 96.44% | 95.35% |
Basic statistics in the selected seven proteomes of Mycobacterium tuberculosis (*) and Mycobacterium bovis (**).
| code | name | # proteins | # SP proteins (%) | SP mean length | # SPTM proteins (%) | # SPTM single | # SPTM multi |
| MYCTU | *Mycobacterium tuberculosis | 3950 | 345 (8.73%) | 28.4 | 71 (1.80%) | 23 | 48 |
| MYCTA | *strain ATCC 25177/H37Ra | 3990 | 345 (8.65%) | 28.2 | 69 (1.73%) | 22 | 47 |
| MYCTF | *strain F11 | 3905 | 334 (8.55%) | 29.0 | 66 (1.69%) | 22 | 44 |
| MYCTK | *strain KZN 1435/MDR | 4024 | 335 (8.33%) | 28.7 | 64 (1.59%) | 22 | 42 |
| MYCBO | **Mycobacterium bovis | 3910 | 334 (8.54%) | 28.2 | 72 (1.84%) | 22 | 50 |
| MYCBP | **strain BCG/Pasteur 1173P2 | 3891 | 336 (8.64%) | 28.1 | 73 (1.88%) | 25 | 48 |
| MYCBT | **strain BCG/Tokyo 172/ATCC 35737/TMC 1019 | 3906 | 340 (8.70%) | 28.2 | 73 (1.87%) | 25 | 48 |
List of unique proteins and their similar non-signal peptide proteins.
| Unique proteins | Similar proteins | |||||||||
| Species | ID | gene | TMHMM results | Seq. length(after cleaved) | Species | ID | gene | SVMSignal results | Seq. length | type (similar protein) |
| MYCTU | O06239 |
| multi-pass TM | 282 (261) | MYCTK | C6DPS4 | TBMG_01845 | non SP | 276 | residues missing |
| MYCTA | A5U3R8 | MRA_1910 | non TM | 343(283) | MYCTU | O07733 |
| non SP | 359 | residues addition |
| MYCTF | A5WU15 | TBFG_13829 | multi-pass TM | 1082(1051) | MYCTU | P72030 |
| non SP | 1098 | residues addition |
| MYCTK | C6DL40 | TBMG_03351 | non TM | 481(457) | MYCTU | O53355 |
| non SP | 493 | residues missing |
| MYCTK | C6DUE9 | TBMG_02759 | non TM | 362(329) | MYCTU | O06291 |
| non SP | 528 | residues addition |
| MYCTK | C6DWG6 | TBMG_03065 | multi-pass TM | 262(239) | MYCTU | O05916 |
| non SP | 428 | residues addition |
| MYCTK | C6DSD8 | TBMG_00349 | non TM | 138(120) | MYCTU | O06296 | Rv0345 | non SP | 136 | residues missing |
| MYCTK | C6DTY5 | TBMG_00617 | non TM | 111(89) | MYCTA | A5TZZ2 | MRA_0618 | non SP | 139 | residues addition |
| MYCTK | C6DQS5 | TBMG_03974 | non TM | 53(28) | - | - | - | |||
| MYCBO | Q7U0W0 | Mb1023 | multi-pass TM | 358(330) | MYCBT | C1ALY5 | JTY_1023 | non SP | 358 | residue replacement |
| MYCBP | - | - | - | - | - | - | - | |||
| MYCBT | - | - | - | - | - | - | - | - | ||
Figure 1The hierarchical architecture of SVMSignal.
Figure 2An example, O95994 (AGR2_HUMAN), of biochemical feature profiles in a 100-residue N-terminal subsequence predicted to contain a signal peptide.