| Literature DB >> 17825110 |
Emily Chia-Yu Su1, Hua-Sheng Chiu, Allan Lo, Jenn-Kang Hwang, Ting-Yi Sung, Wen-Lian Hsu.
Abstract
BACKGROUND: Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17825110 PMCID: PMC2040162 DOI: 10.1186/1471-2105-8-330
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of proteins distributed in different localization sites in the data sets.
| Localization | Benchmark | Non-redundant | Evaluation | ||||
| PS1302 | PS1444 | NR755 | NR828 | EV90_high | EV153_low | EV243_all | |
| Cytoplasm (CP) | 248 | 278 | 206 | 229 | 28 | 96 | 124 |
| Inner membrane (IM) | 268 | 309 | 182 | 205 | 26 | 26 | 52 |
| Periplasm (PP) | 244 | 276 | 147 | 161 | 13 | 11 | 24 |
| Outer membrane (OM) | 352 | 391 | 134 | 148 | 19 | 9 | 28 |
| Extracellular space (EC) | 190 | 190 | 86 | 85 | 4 | 11 | 15 |
| Total | 1302 | 1444 | 755 | 828 | 90 | 153 | 243 |
Comparison of different hybrid approaches using cross-validation for the benchmark data sets.
| PS1302 | ||||||
| Localization | PSL101 | PSLseq+PSL101 | PSLsse+PSL101 | |||
| CP | 97.2 (94.8) | 0.91 (0.89) | 96.4 (94.4) | 0.90 (0.89) | 95.6 (94.4) | 0.90 (0.90) |
| IM | 94.4 (92.9) | 0.95 (0.94) | 93.3 (91.8) | 0.95 (0.93) | 93.3 (91.8) | 0.94 (0.93) |
| PP | 87.7 (88.1) | 0.86 (0.84) | 88.9 (88.9) | 0.86 (0.85) | 91.4 (91.0) | 0.88 (0.88) |
| OM | 94.3 (93.8) | 0.94 (0.91) | 95.5 (95.7) | 0.96 (0.93) | 96.3 (96.9) | 0.96 (0.95) |
| EC | 87.9 (83.2) | 0.87 (0.84) | 89.5 (85.8) | 0.89 (0.87) | 90.0 (87.9) | 0.89 (0.89) |
| Overall | 92.7 (91.2) | - | 93.1 (91.9) | - | 93.7 (92.9) | - |
| PS1444 | ||||||
| Localization | PSL101 | PSLseq+PSL101 | PSLsse+PSL101 | |||
| CP | 96.0 (94.2) | 0.91 (0.90) | 94.6 (92.8) | 0.89 (0.88) | 95.0 (93.5) | 0.91 (0.90) |
| IM | 94.5 (92.6) | 0.95 (0.94) | 93.5 (91.6) | 0.94 (0.93) | 93.5 (91.6) | 0.94 (0.93) |
| PP | 85.1 (88.0) | 0.82 (0.83) | 87.0 (88.4) | 0.84 (0.83) | 90.2 (91.7) | 0.86 (0.87) |
| OM | 94.9 (93.9) | 0.93 (0.91) | 95.9 (95.7) | 0.95 (0.93) | 96.7 (96.4) | 0.96 (0.95) |
| EC | 82.6 (83.2) | 0.83 (0.85) | 87.9 (86.3) | 0.87 (0.88) | 87.4 (87.9) | 0.87 (0.89) |
| Overall | 91.6 (91.1) | - | 92.4 (91.6) | - | 93.2 (92.8) | - |
§ The performance of incorporating a three-way data split procedure is indicated in the parentheses.
Figure 1Feature combinations derived from the PS1302 data set using cross-validation. Selected general and compartment-specific features are represented by filled circles and triangles, respectively.
Figure 2The distribution of the prediction accuracy as a function of secondary structure similarity. The blue line and the red line indicate the distribution of the prediction accuracy as a function of secondary structure similarity for PSL101 and PSLsse using cross-validation for the PS1444 data set, respectively.
Performance comparison of different approaches using cross-validation for the benchmark data sets.
| PS1302 | ||||||||||
| Localization | HYBRID | CELLO | PSORTb v.1.1 | PSLpred | P-CLASSIFIER | |||||
| CP | 90.7 | 0.85 | 69.4 | 0.79 | 90.7 | 0.86 | 94.6 | 0.85 | ||
| IM | 88.4 | 0.92 | 78.7 | 0.85 | 86.8 | 0.88 | 87.1 | 0.92 | ||
| PP | 0.88 | 86.9 | 0.80 | 57.6 | 0.69 | 90.3 | 85.9 | 0.81 | ||
| OM | 94.6 | 0.90 | 90.3 | 0.93 | 95.2 | 0.95 | 93.6 | 0.90 | ||
| EC | 90.0 | 78.9 | 0.82 | 70.0 | 0.79 | 0.84 | 86.0 | 0.89 | ||
| Overall | - | 88.9 | - | 74.8 | - | 91.2 | - | 89.8 | - | |
| PS1444 | ||||||||||
| Localization | HYBRID | CELLO II | PSORTb v.2.0 | PSLpred | P-CLASSIFIER | |||||
| CP | 95.0 | 0.89 | 70.1 | 0.77 | - | - | - | - | ||
| IM | 90.0 | 0.91 | 92.6 | 0.92 | - | - | - | - | ||
| PP | 87.7 | 0.82 | 69.2 | 0.78 | - | - | - | - | ||
| OM | 92.8 | 0.90 | 94.9 | 0.95 | - | - | - | - | ||
| EC | 79.5 | 0.82 | 78.9 | 0.86 | - | - | - | - | ||
| Overall | - | 90.0 | - | 82.6 | - | - | - | - | - | |
§ The best performance of overall and individual localization sites is underlined.
Predictive performance of different prediction methods for the evaluation data sets.
| EV153_low | ||||||||||
| Localization | HYBRID | CELLO II | PSORTb v.2.0 | PSLpred | P-CLASSIFIER | |||||
| CP | 0.67 | 63.5 | -0.61 | 89.6 | 0.59 | 0.66 | ||||
| IM | 46.2 | 0.64 | 46.2 | -0.58 | 38.5 | 0.41 | 30.8 | 0.48 | ||
| PP | 45.5 | 0.25 | 00.0 | -0.03 | 54.5 | 0.34 | ||||
| OM | 33.3 | 0.34 | 22.2 | -0.46 | 22.2 | 0.17 | ||||
| EC | 27.3 | 0.43 | 0.50 | 09.1 | -0.29 | 27.3 | 0.33 | |||
| Overall | - | - | 49.7 | - | 72.5 | - | 71.9 | - | ||
| EV90_high | ||||||||||
| Localization | HYBRID | CELLO II | PSORTb v.2.0 | PSLpred | P-CLASSIFIER | |||||
| CP | 0.95 | 92.9 | 0.83 | 096.4 | 0.88 | 92.9 | 0.78 | |||
| IM | 96.2 | 0.97 | 73.1 | 0.75 | 92.3 | 0.92 | 80.8 | 0.84 | ||
| PP | 0.96 | 61.5 | 0.58 | 92.3 | 0.83 | 46.2 | 0.46 | |||
| OM | 73.7 | 0.67 | 68.4 | 0.79 | 73.7 | 0.69 | ||||
| EC | 75.0 | 0.86 | 75.0 | 0.54 | 0.81 | 75.0 | 0.54 | |||
| Overall | 96.7 | - | 77.8 | - | - | 88.9 | - | 77.8 | - | |
| EV243_all | ||||||||||
| Localization | HYBRID | CELLO II | PSORTb v.2.0 | PSLpred | P-CLASSIFIER | |||||
| CP | 91.9 | 0.77 | 71.8 | 0.73 | 91.1 | 0.72 | 91.9 | 0.73 | ||
| IM | 59.6 | 0.70 | 73.1 | 0.80 | 65.4 | 0.68 | 55.8 | 0.67 | ||
| PP | 0.56 | 70.8 | 0.51 | 54.2 | 0.57 | 62.5 | 0.45 | |||
| OM | 60.7 | 0.58 | 71.4 | 0.83 | 60.7 | 0.73 | 57.1 | 0.53 | ||
| EC | 40.0 | 0.57 | 53.3 | 0.50 | 33.3 | 0.57 | 40.0 | 0.39 | ||
| Overall | - | 77.0 | - | 67.9 | - | 78.6 | - | 74.1 | - | |
§ The best performance of overall and individual localization sites is underlined. HYBRID is trained on the PS1444 data set.
Performance of non-redundant data sets.
| Localization | NR755 | NR828 | ||
| CP | 95.6 | 0.86 | 97.8 | 0.87 |
| IM | 88.5 | 0.88 | 88.8 | 0.90 |
| PP | 81.0 | 0.76 | 80.7 | 0.76 |
| OM | 85.1 | 0.84 | 83.8 | 0.82 |
| EC | 64.0 | 0.65 | 57.6 | 0.61 |
| Overall | 85.6 | - | 85.6 | - |
Figure 3Diversity of Gram-negative bacteria translocation pathways. 1, 2, 3, 4, 5, and 6 represent translocation pathways from CP to IM, CP to PP, CP to EC, IM to PP, PP to OM, and PP to EC, respectively. SRP, signal recognition particle, SecB, export-specific cytoplasmic chaperone, SecA, preprotein translocase SecA subunit, SecYEG, preprotein translocase complex, lep, leader peptidase, TAT, twin argine translocase, Gsp complex, general secretion pathway complex, Omp85, outer membrane protein assembly factor, ABC transporter, ATP-binding cassette transporter, TolC, Type I secretion outer membrane protein. [Modified from Wickner and Schekman (2005) with permission]
Figure 4System architecture of PSL101.