| Literature DB >> 19958518 |
Hsin-Nan Lin1, Ching-Tai Chen, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu.
Abstract
BACKGROUND: The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. However, determining the localization sites of a protein through wet-lab experiments can be time-consuming and labor-intensive. Thus, computational approaches become highly desirable. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19958518 PMCID: PMC2788359 DOI: 10.1186/1471-2105-10-S15-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Two different transitivity relationships. (a) Protein A and protein B share sequence identity of 34%, and protein B and protein C share sequence identity of 27%, whereas protein A and protein C only share sequence identity of 12%. We infer the homologous relationship between A and protein C through protein B. (b) Protein A and protein C are aligned with protein B1 and protein B2. The peptide fragments of B1 and B2 besieged by the rectangles are identical, the two corresponding peptide fragments of A and C are considered to be similar.
Figure 2A real example of HSP found by PSI-BLAST. We define that MYSKILL (assuming that the window size is 7) is a similar peptide of MYKKILY and we treat it as an extended sequence feature of the query protein. The similarity level of MYSKILL and MYKKILY is 5 since there are five interchangeable residue pairs within that window. We can generate multiple similar peptides from protein gi|2622094 (Sbjct) for the query protein.
A similar peptide example.
| Similar Peptide: MYSKILL | ||||
|---|---|---|---|---|
| Cytoplasm | MYKKILY | 5 | 21 | |
| Nuclear | MYSSIIL | 4 | 12 | |
| Cytoplasm Extracellular | MYSSILY | 5 | 17 | |
Three protein sources with known localization sites contain peptides that are aligned and similar to the peptide MYSKILL in their HSPs. The similarity level indicates the number of amino acid pairs that are interchangeable between the native peptide sequence and the similar peptide sequence. The frequency represents the number of occurrences they are aligned in HSPs.
Figure 3The main algorithm of KnowPred.
Figure 4The overall accuracies of KnowPred.
The overall accuracies using different thresholds of similarity levels for window size 7 and 8.
| Similarity Level Threshold | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| 91.2 | 91.2 | 91.3 | 91.4 | 91.5 | 91.8 | 92.0 | 91.6 | - | |
| 91.4 | 91.4 | 91.4 | 91.4 | 91.4 | 91.5 | 91.6 | 91.7 | 90.9 |
The combination of w = 7 and k = 6 provides the best accuracy. Some results are shown to have identical overall accuracies due to the rounding off to the first decimal place.
Prediction performance of KnowPredsite, ngLOC, and Blast-hit
| Methods | Top 1 | Top 2 | Top 3 | Top 4 | |
|---|---|---|---|---|---|
| Single-localized | *KnowPredsite | 92.0 | 95.7 | 96.8 | 98.1 |
| #KnowPredsite | 91.7 | 95.4 | 96.6 | 97.9 | |
| ngLOC | 88.8 | 92.2 | 94.5 | 96.3 | |
| Blast-hit | 86.0 | - | - | - | |
| Multi-localized | *KnowPredsite | 90.8 | 96.4 | 98.2 | 98.9 |
| #KnowPredsite | 90.1 | 96.1 | 98.1 | 98.9 | |
| ngLOC | 81.9 | 92.0 | 96.1 | 97.4 | |
| Blast-hit | 78.8 | - | - | - | |
| Multi-localized | *KnowPredsite | 74.3 | 83.3 | 88.7 | |
| #KnowPredsite | 72.1 | 82.2 | 87.5 | ||
| ngLOC | 59.7 | 73.8 | 83.2 | ||
| Blast-hit | 45.7 | - | - | ||
*KnowPredsite represents the experiment result using leave-one-out cross validation; #KnowPredsite represents the experiment result using 10-fold cross validation.
Prediction performance of KnowPredsite for each site using precision, accuracy, and MCC.
| Site | Occurrence in the dataset (%) | |||
|---|---|---|---|---|
| CYT | 11.1 | 75.7 | 84.4 | 0.774 |
| CSK | 1.0 | 81.1 | 52.0 | 0.645 |
| END | 3.6 | 92.9 | 84.1 | 0.88 |
| EXC | 29.1 | 98.5 | 93.9 | 0.946 |
| GOL | 1.1 | 79.1 | 70.9 | 0.746 |
| LYS | 0.6 | 87.2 | 81.9 | 0.844 |
| MIT | 9.4 | 96.7 | 86.9 | 0.907 |
| NUC | 18.0 | 87.3 | 93.8 | 0.884 |
| PLA | 25.2 | 94.4 | 96.4 | 0.938 |
| POX | 0.8 | 87.3 | 85.1 | 0.861 |
Figure 5Matthew's correlation coefficient (.
Figure 6MLCS analysis. A true positive represents a multi-localized protein whose MLCS is above the threshold and a true negative represents a single-localized protein whose MLCS is below the threshold. We compare the ratio of true positives/true negatives to the total number of multi-/single-localized proteins.
Prediction result of EF1A2_RABIT.
| Query | CYT | CSK | END | EXC | GOL | LYS | MIT | NUC* | PLA | POX | MLCS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| EF1A2_RABIT | 95.45 | 0 | 0 | 1.45 | 0 | 0 | 0.04 | 2.97 | 0.05 | 0 | 7.40 |
| Template | CYT | CSK | END | EXC | GOL | LYS | MIT | NUC | PLA | POX | SI |
| EF1A2_RAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.94 | 0 | 0 | 99.78 |
| EF1A_CHICK | 2.77 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 92.22 |
| EF1A1_HUMAN | 2.75 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 92.22 |
| EF1A1_RAT | 2.75 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 92.22 |
| EF1A0_XENLA | 2.69 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90.06 |
| EF1A_BRARE | 2.64 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90.06 |
| EF1A2_XENLA | 2.64 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 88.79 |
| EF1A3_XENLA | 2.60 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 88.55 |
*: correct answer; SI: sequence identity.
Prediction result of RASH_HUMAN.
| Query | CYT* | CSK | END | EXC | GOL* | LYS | MIT | NUC | PLA | POX | MLCS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| RASH_HUMAN | 18.95 | 0.06 | 0.09 | 0.09 | 13.74 | 0.04 | 0.24 | 0.25 | 83.61 | 0 | 36.24 |
| Template | CYT | CSK | END | EXC | GOL | LYS | MIT | NUC | PLA | POX | SI |
| RASK_HUMAN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13.88 | 0 | 86.32 |
| RASK_MOUSE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13.81 | 0 | 86.32 |
| RASN_HUMAN | 13.19 | 0 | 0 | 0 | 13.19 | 0 | 0 | 0 | 0 | 0 | 85.19 |
| LET60_CAEEL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10.55 | 0 | 74.07 |
| RAS3_RHIRA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5.05 | 0 | 57.07 |
| RAS1_RHIRA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.88 | 0 | 58.62 |
| RAS2_RHIRA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.33 | 0 | 35.20 |
| RAS_LIMLI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.15 | 0 | 46.03 |
*: correct answer; SI: sequence identity.
Prediction result of MCA3_MOUSE. Templates marked with '+' are those that have the same localization annotation with the query protein.
| Query | CYT* | CSK | END | EXC | GOL | LYS | MIT | NUC* | PLA | POX | MLCS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| MCA3_MOUSE | 95.46 | 0.3 | 0.27 | 0.36 | 0.2 | 0.01 | 1.13 | 93.59 | 1.82 | 0.22 | 100 |
| Template | CYT | CSK | END | EXC | GOL | LYS | MIT | NUC | PLA | POX | SI |
| MCA3_HUMAN+ | 89.16 | 0 | 0 | 0 | 0 | 0 | 0 | 89.16 | 0 | 0 | 88.51 |
| EF1G1_YEAST+ | 2.74 | 0 | 0 | 0 | 0 | 0 | 0 | 2.47 | 0 | 0 | 8.67 |
| EF1G2_YEAST | 0.49 | 0 | 0 | 0 | 0 | 0 | 0.49 | 0 | 0 | 0 | 8.50 |
| GSTA_PLEPL | 0.35 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15.86 |
| SYEC_YEAST | 0.16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.86 |
| CCNA1_MOUSE | 0 | 0.15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7.36 |
| NU155_RAT+ | 0.14 | 0 | 0 | 0 | 0 | 0 | 0 | 0.14 | 0 | 0 | 3.17 |
| GCYB2_HUMAN | 0.14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.86 |
*: correct answer; SI: sequence identity.