| Literature DB >> 18047679 |
Takeyuki Tamura1, Tatsuya Akutsu.
Abstract
BACKGROUND: Subcellular location prediction of proteins is an important and well-studied problem in bioinformatics. This is a problem of predicting which part in a cell a given protein is transported to, where an amino acid sequence of the protein is given as an input. This problem is becoming more important since information on subcellular location is helpful for annotation of proteins and genes and the number of complete genomes is rapidly increasing. Since existing predictors are based on various heuristics, it is important to develop a simple method with high prediction accuracies.Entities:
Mesh:
Substances:
Year: 2007 PMID: 18047679 PMCID: PMC2220007 DOI: 10.1186/1471-2105-8-466
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Subcellular location prediction of proteins. Subcellular location prediction of proteins is a problem of predicting which part in a cell a given protein is transported to, where an amino acid sequence of the protein is given as an input.
Figure 2Overview of our method. Elements of our proposed kernel matrix are scores of alignment between block sequences of proteins. The alignment scores are calculated in which blocks are treated as if these were residues and the score between blocks is calculated from the corresponding feature vectors based on the amino acid frequency.
Comparison of predictive accuracy for plant proteins in the TargetP data set. "cTP", "mTP", "SP", and "other" indicate proteins destined for chloroplast, mitochondria, secretory pathway, and other locations (nucleus and cytosol), respectively.
| Predictor | Location | Sensitivity | Specificity | MCC | Average MCC | Overall Accuracy |
| Our method | cTP | 0.8440 | 0.9015 | 0.8507 | 0.8655 | 0.9096 |
| mTP | 0.9348 | 0.9125 | 0.8735 | |||
| SP | 0.9665 | 0.9319 | 0.9282 | |||
| other | 0.8148 | 0.8684 | 0.8095 | |||
| Matsuda et al. (2005) | cTP | 0.7591 | 0.8474 | 0.7694 | 0.8244 | 0.8809 |
| mTP | 0.9240 | 0.8652 | 0.8227 | |||
| SP | 0.9219 | 0.9326 | 0.8983 | |||
| other | 0.8210 | 0.8586 | 0.8070 | |||
| Kim et al. (2004) | cTP | 0.6874 | 0.8435 | 0.7222 | 0.7791 | 0.8479 |
| mTP | 0.8970 | 0.8392 | 0.7773 | |||
| SP | 0.8592 | 0.9428 | 0.8872 | |||
| other | 0.8027 | 0.7549 | 0.7296 | |||
| Emanuelsson et al. (2000) | cTP | 0.85 | 0.69 | 0.72 | 0.79 | 0.853 |
| mTP | 0.82 | 0.90 | 0.77 | |||
| SP | 0.91 | 0.95 | 0.90 | |||
| other | 0.85 | 0.78 | 0.77 | |||
Note: Parameters of Table 2 are used in our method.
Parameters used in the computational experiment. "Posconstraint" and "negconstraint" are parameters used in GIST.
| Location | Gap penalty | Posconstraint | Negconstraint | c | w | |
| cTP | 0.4 | 2.5 | 1.000000 | 0.009853 | 10 | 20 |
| mTP | 0.4 | 2.5 | 0.151072 | 0.099261 | 10 | 20 |
| SP | 0.4 | 2.5 | 0.032387 | 0.008566 | 10 | 20 |
| Other | 0.4 | 2.5 | 0.039955 | 0.008566 | 10 | 20 |
Prediction accuracy for WoLF PSORT plant data sets. "Chlo," "cyto," "cysk," "E.R.," "extr," "golg," "mito," "nucl," "pero," "plas," "vacu" indicate, chloroplast, cytosol, cytoskeleton, endoplasmic reticulum, extracellular, Golgi apparatus, mitochondria, nuclear, peroxisome, plasma membrane, vacuolar membrane respectively.
| Predictor | Location | No. of seq.s | Sensitivity | Specificity | MCC | Average MCC | Overall Accuracy |
| Our method (2391 sequences) | chlo | 750 | 0.9733 | 0.8795 | 0.8893 | 0.8343 | 0.8703 |
| cyto | 432 | 0.9329 | 0.8258 | 0.8491 | |||
| cysk | 41 | 0.7561 | 1.0000 | 0.8677 | |||
| E.R. | 69 | 0.7391 | 0.9623 | 0.8395 | |||
| extr | 114 | 0.8246 | 0.7581 | 0.7797 | |||
| golg | 29 | 0.6207 | 0.9474 | 0.7645 | |||
| mito | 210 | 0.7857 | 0.9066 | 0.8303 | |||
| nucl | 456 | 0.8509 | 0.8788 | 0.8329 | |||
| pero | 52 | 0.7500 | 0.9512 | 0.8417 | |||
| plas | 165 | 0.7212 | 0.9675 | 0.8255 | |||
| vacu | 73 | 0.7671 | 0.9655 | 0.8569 | |||
| Horton et al. (2006) (2426 sequences) | 0.86 | ||||||
Note: In our method, although gap penalty = 0.4, γ = 2.5, c = 10, w = 20 are given, posconstraint and negconstraint are not specified.
Comparison of predictive accuracy for non-plant proteins in the TargetP data set. "mTP", "SP", and "other" indicate proteins destined for mitochondria, secretory pathway, and other locations (nucleus and cytosol), respectively.
| Predictor | Location | Sensitivity | Specificity | MCC | Average MCC | Overall Accuracy |
| Our method | mTP | 0.7647 | 0.8990 | 0.8005 | 0.8452 | 0.9204 |
| SP | 0.9157 | 0.9056 | 0.8790 | |||
| other | 0.9576 | 0.9324 | 0.8560 | |||
| Matsuda et al. (2005) | mTP | 0.8303 | 0.8635 | 0.8228 | 0.8542 | 0.9229 |
| SP | 0.9091 | 0.9118 | 0.8788 | |||
| other | 0.9498 | 0.9409 | 0.8609 | |||
| Kim et al. (2004) | mTP | 0.6483 | 0.8569 | 0.7121 | 0.7635 | 0.8762 |
| SP | 0.8530 | 0.8736 | 0.8158 | |||
| other | 0.9389 | 0.8819 | 0.7626 | |||
| Emanuelsson et al. (2000) | mTP | 0.89 | 0.67 | 0.73 | 0.8233 | 0.900 |
| SP | 0.96 | 0.92 | 0.92 | |||
| other | 0.88 | 0.97 | 0.82 | |||
| Reczko et al. (2004) | mTP | 0.78 | 0.82 | 0.77 | 0.8333 | 0.913 |
| SP | 0.93 | 0.91 | 0.89 | |||
| other | 0.93 | 0.94 | 0.84 | |||
Note: In our method, although gap penalty = 0.4, γ = 2.5, c = 10, w = 20 are given, posconstraint and negconstraint are not specified.
Comparison of predictive accuracy for plant proteins in the TargetP data set with three types of feature vectors
| Predictor | Location | Sensitivity | Specificity | MCC | Average MCC | Overall Accuracy |
| 20 feature vectors (amino acid composition) | cTP | 0.8440 | 0.9015 | 0.8507 | 0.8655 | 0.9096 |
| mTP | 0.9348 | 0.9125 | 0.8735 | |||
| SP | 0.9665 | 0.9319 | 0.9282 | |||
| other | 0.8148 | 0.8684 | 0.8095 | |||
| 400 feature vectors (all adjacent amino acid composition) | cTP | 0.8227 | 0.4128 | 0.4806 | 0.6126 | 0.7372 |
| mTP | 0.8342 | 0.8797 | 0.7686 | |||
| SP | 0.8253 | 0.8880 | 0.8015 | |||
| other | 0.2963 | 0.8000 | 0.4339 | |||
| 40 feature vectors | cTP | 0.8014 | 0.9040 | 0.8270 | 0.8594 | 0.9064 |
| mTP | 0.9402 | 0.9081 | 0.8739 | |||
| SP | 0.9628 | 0.9317 | 0.9255 | |||
| other | 0.8272 | 0.8590 | 0.8110 | |||
Note: Parameters of Table 2 are used for all methods.
Comparison of predictive accuracy of our methods for plant proteins in the TargetP data set. "cTP", "mTP", "SP", and "other" indicate proteins destined for chloroplast, mitochondria, secretory pathway, and other locations (nucleus and cytosol), respectively.
| Predictor | Location | Sensitivity | Specificity | MCC | Average MCC | Overall Accuracy |
| Our method (SVM with specifying posconstraint and negconstraint) | cTP | 0.8440 | 0.9015 | 0.8507 | 0.8655 | 0.9096 |
| mTP | 0.9348 | 0.9125 | 0.8735 | |||
| SP | 0.9665 | 0.9319 | 0.9282 | |||
| other | 0.8148 | 0.8684 | 0.8095 | |||
| Our method (SVM without specifying posconstraint and negconstraint) | cTP | 0.7518 | 0.9550 | 0.8249 | 0.8525 | 0.8989 |
| mTP | 0.9592 | 0.8506 | 0.8363 | |||
| SP | 0.9554 | 0.9554 | 0.9375 | |||
| other | 0.7963 | 0.8897 | 0.8111 | |||
| Nearest Neighbor Classifier (with our similarity measure) | cTP | 0.7447 | 0.7664 | 0.7131 | 0.7255 | 0.8128 |
| mTP | 0.8750 | 0.7854 | 0.7098 | |||
| SP | 0.8848 | 0.9015 | 0.8508 | |||
| other | 0.6111 | 0.7674 | 0.6284 | |||
Note: In the first method, all parameters of Table 2 are given. In the second and third methods, gap penalty = 0.4, γ = 2.5, c = 10, w = 20 are given although posconstraint and negconstraint are not specified.
Comparison of predictive accuracy of our method and Mitopred for plant proteins in the TargetP data set
| Predictor | Location | Sensitivity | Specificity | MCC |
| Our method | mTP | 0.9348 | 0.8958 | 0.8587 |
| Mitopred (60%) | mTP | 0.9348 | 0.8132 | 0.7816 |
| Mitopred (85%) | mTP | 0.5815 | 0.8807 | 0.5918 |
| Mitopred (99%) | mTP | 0.0598 | 0.8148 | 0.1492 |
Note: Parameters of Table 2 are used for our method.
Comparison of predictive accuracy of our method and MitPred for both plant and nonplant proteins in the TargetP data set
| Predictor | Location | Sensitivity | Specificity | MCC |
| Our method | plant mTP | 0.9348 | 0.8958 | 0.8587 |
| MitPred (SVM) | plant mTP | 0.8234 | 0.8584 | 0.7418 |
| MitPred (BLAST+SVM) | plant mTP | 0.9429 | 0.8422 | 0.8158 |
| MitPred (HMM+SVM) | plant mTP | 0.8668 | 0.8484 | 0.7644 |
| Our method | nonplant mTP | 0.7547 | 0.8333 | 0.7626 |
| MitPred (SVM) | nonplant mTP | 0.8194 | 0.8863 | 0.8302 |
| MitPred (BLAST+SVM) | nonplant mTP | 0.9380 | 0.9355 | 0.9268 |
| MitPred (HMM+SVM) | nonplant mTP | 0.8571 | 0.9408 | 0.8830 |
Note: In our methods, parameters shown in Table 2 of the manuscript were used, but "posconstraint" and "negconstraint" were not specified for nonplant mTP. For MitPred, we used default parameters as follows. In SVM method, threshold was set as 0.5. E-value cutoff of BLAST+SVM method was 1e-4. E-value and SVM threshold for HMM based Pfam search+SVM method were 1e-5 and 0.5 respectively.
Numbers of sequences of non-redundant datasets at 25% and 70 % obtained from TargetP plant data sets by utilizing BLASTclust
| Subcellular location | No. of sequences (plant) (non-redundant at 25%) | No. of sequences (plant) (non-redundant at 70%) |
| Chloroplast(cTP) | 95 | 116 |
| Mitochondrial(mTP) | 200 | 314 |
| Secretory(SP) | 119 | 232 |
| Nuclear+cytosolic(other) | 125 | 138 |
| Total | 539 | 800 |
Predictive accuracies of our method for datasets of Table 9
| Predictor | Location | Sensitivity | Specificity | MCC | Average MCC | Overall Accuracy |
| Our method (25%) | cTP | 0.7684 | 0.6577 | 0.6434 | 0.7068 | 0.7829 |
| mTP | 0.7900 | 0.8404 | 0.7111 | |||
| SP | 0.7311 | 0.9158 | 0.7751 | |||
| other | 0.8320 | 0.7172 | 0.6976 | |||
| Our method (70%) | cTP | 0.6638 | 0.8750 | 0.7289 | 0.7937 | 0.8625 |
| mTP | 0.9618 | 0.8118 | 0.8006 | |||
| SP | 0.9095 | 0.9505 | 0.9020 | |||
| other | 0.7246 | 0.8475 | 0.7431 | |||
Note: Parameters of Table 2 are used for all methods.
Figure 3Global and local alignment. Global alignment is applied to left ends of block sequences and local alignment is applied to right ends of block sequences. (A) The alignment detects the signal sequence which is in the left part of sequences. (B) The alignment does not detect the signal sequence since the alignment score decreases if the block of the signal sequence is included.
Figure 4Web-based system. Our proposed method is implemented as a web-system and available on [29]. (A) Amino acid sequences in the FASTA format are pasted into our web-system as input. (B) Two candidate locations are shown for each input sequence along with scores.
Figure 5Alignment for block sequences. (A)(B)(C) Transformation of the input sequences into block sequences, where w = 10 is the length of blocks and c = 5 is the distance between neighboring blocks. (D) Alignment of block sequences. While global alignment is applied to left ends of block sequences, local alignment is applied to right ends of block sequences.
Number of sequences in each subcellular location of TargetP plant and non-plant data sets
| Subcellular location | No. of sequences (plant) | No. of sequences (non-plant) |
| Chloroplast(cTP) | 141 | - |
| Mitochondrial(mTP) | 368 | 371 |
| Secretory(SP) | 269 | 715 |
| Nuclear+cytosolic(other) | 162 | 1652 |
| Total | 940 | 2738 |