| Literature DB >> 16689702 |
Evangelia I Petsalaki1, Pantelis G Bagos, Zoi I Litou, Stavros J Hamodrakas.
Abstract
The ability to predict the subcellular localization of a protein from its sequence is of great importance, as it provides information about the protein's function. We present a computational tool, PredSL, which utilizes neural networks, Markov chains, profile hidden Markov models, and scoring matrices for the prediction of the subcellular localization of proteins in eukaryotic cells from the N-terminal amino acid sequence. It aims to classify proteins into five groups: chloroplast, thylakoid, mitochondrion, secretory pathway, and "other". When tested in a five-fold cross-validation procedure, PredSL demonstrates 86.7% and 87.1% overall accuracy for the plant and non-plant datasets, respectively. Compared with TargetP, which is the most widely used method to date, and LumenP, the results of PredSL are comparable in most cases. When tested on the experimentally verified proteins of the Saccharomyces cerevisiae genome, PredSL performs comparably if not better than any available algorithm for the same task. Furthermore, PredSL is the only method capable for the prediction of these subcellular localizations that is available as a stand-alone application through the URL:http://bioinformatics.biol.uoa.gr/PredSL/.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16689702 PMCID: PMC5054032 DOI: 10.1016/S1672-0229(06)60016-8
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Comparison of the Localization Performance of PredSL and TargetP Tested by Five-fold Cross-validation and Self-consistency*
| Predictor set | Overall accuracy (%) | Category | Sensitivity | Specificity | MCC |
|---|---|---|---|---|---|
| A. PredSL (cross-validation/self-consistency) | |||||
| Plant | 86.7/88.3 | cTP | 0.90/0.90 | 0.80/0.91 | 0.82/0.88 |
| mTP | 0.89/0.96 | 0.87/0.81 | 0.84/0.85 | ||
| SP | 0.96/0.95 | 0.92/0.89 | 0.91/0.90 | ||
| other | 0.70/0.72 | 0.86/0.95 | 0.74/0.79 | ||
| Non-plant | 87.1/92.5 | mTP | 0.88/0.91 | 0.84/0.96 | 0.80/90.5 |
| SP | 0.94/0.95 | 0.91/0.91 | 0.89/0.90 | ||
| other | 0.80/0.92 | 0.86/0.91 | 0.77/0.88 | ||
| B. TargetP (cross-validation/self-consistency) | |||||
| Plant | 85.3/90.4 | cTP | 0.85/0.96 | 0.69/0.78 | 0.72/0.84 |
| mTP | 0.82/0.88 | 0.90/0.95 | 0.77/0.88 | ||
| SP | 0.91/0.94 | 0.95/0.94 | 0.90/0.92 | ||
| other | 0.85/0.85 | 0.78/0.87 | 0.77/0.84 | ||
| Non-plant | 90.0/92.2 | mTP | 0.89/0.92 | 0.67/0.72 | 0.73/0.79 |
| SP | 0.96/0.97 | 0.92/0.95 | 0.92/0.95 | ||
| other | 0.88/0.90 | 0.97/0.97 | 0.82/0.86 | ||
The PredSL datasets for plant proteins consist of 249 chloroplast sequences, 250 mitochondrial sequences, and 253 secreted proteins’ sequences, whereas for non-plant proteins the datasets consist of 366 mitochondrial sequences and 370 secreted proteins’ sequences. The TargetP datasets for plant proteins consist of 141 chloroplast sequences, 368 mitochondrial sequences, and 269 secreted proteins’ sequences, whereas for non-plant proteins the datasets consist of 371 mitochondrial sequences and 715 secreted proteins’ sequences.
Comparison of PredSL and LumenP on the Prediction Accuracy of the lTP and Its Cleavage Site*
| Dataset | lTP prediction (%) | Cleavage site prediction (±2 residues) (%) | ||
|---|---|---|---|---|
| PredSL | LumenP | PredSL | LumenP | |
| Complete set (259 sequences) | 91.9 | 88.8 | 88.7 | 75.1 |
| Reduced set (40% similarity) | 85.3 | 82.4 | 82.4 | 70.1 |
| Cross-validation (259 sequences) | 87.3 | 87.0 | 66.1 | 54.8 |
Tested by the five-fold cross-validation on the complete dataset (259 sequences) and on a 40% redundancy reduced dataset by cd-hit (109 sequences), respectively.
Comparison of PredSL with Other Three Prediction Tools on the Subcellular Localization Prediction of the S. cerevisiae Proteins
| Subcellular localization | PredSL | iPSORT | TargetP | Predotar |
|---|---|---|---|---|
| Total (unknown=2,164) | 2,621/3,554 (73.7%) | 2,404/3,554 (67.6%) | 2,616/3,554 (71.6%) | 2,475/3,554 (69.6%) |
| Mitochondrion | 301/499 (60.3%) | 304/499 (60.9%) | 306/499 (61.3%) | 315/499 (63.1%) |
| Secretory pathway | 224/850 (26.4%) | 206/850 (24.2%) | 257/850 (26.4%) | 204/850 (24.0%) |
| Other | 2,096/2,305 (90.9%) | 1,894/2,305 (82.2%) | 2,053/2,305 (89.1%) | 1,956/2,305 (84.9%) |
Prediction Performance of PredSL on Various Completely Sequenced Genomes from Different Taxonomic Groups
| Group | Organism | cTP | lTP | mTP | SP | other | Total |
|---|---|---|---|---|---|---|---|
| Plants | 4,596 (13.8%) | 184 (5.5%) | 5,326 (16.0%) | 8,191 (24.6%) | 15,160 (45.6%) | 33,273 | |
| 813 (7.1%) | 21 (0.2%) | 1,406 (12.3%) | 2,493 (21.9%) | 6,686 (58.7%) | 11,397 | ||
| Fungi | – | – | 586 (11.8%) | 511 (10.3%) | 3,890 (78.0%) | 4,987 | |
| – | – | 566 (13.0%) | 635 (14.5%) | 3,167 (72.5%) | 4,368 | ||
| – | – | 1,314 (11.8%) | 2,364 (21.3%) | 7,431 (66.9%) | 11,109 | ||
| Mammals | – | – | 2,727 (9.4%) | 7,221 (24.8%) | 19,159 (65.8%) | 29,107 | |
| – | – | 3,353 (9.4%) | 9,099 (25.5%) | 23,274 (65.2%) | 35,726 | ||
| Protozoa | – | – | 314 (6.2%) | 706 (14.0%) | 4,029 (79.8%) | 5,049 | |
| – | – | 644 (4.7%) | 2,158 (15.8%) | 10,878 (79.5%) | 13,680 | ||
| Arhthropoda | – | – | 1,949 (10.5%) | 3,973 (21.5%) | 12,576 (68.0%) | 18,498 | |
| – | – | 1,627 (7.6%) | 2,648 (12.4%) | 17,027 (79.9%) | 21,302 | ||
| Fishes | – | – | 1,383 (8.7%) | 2,370 (15.0%) | 12,099 (76.3%) | 15,852 | |
| – | – | 1,617 (4.3%) | 4,478 (12.0%) | 31,344 (83.7%) | 37,439 |
We list the total number of sequences classified in each subcellular location and their percentage in the whole genome.
Fig. 1An schematic overview of the PredSL algorithm.