| Literature DB >> 16966337 |
KiYoung Lee1, Dae-Won Kim, DoKyun Na, Kwang H Lee, Doheon Lee.
Abstract
Subcellular localization is one of the key functional characteristics of proteins. An automatic and efficient prediction method for the protein subcellular localization is highly required owing to the need for large-scale genome analysis. From a machine learning point of view, a dataset of protein localization has several characteristics: the dataset has too many classes (there are more than 10 localizations in a cell), it is a multi-label dataset (a protein may occur in several different subcellular locations), and it is too imbalanced (the number of proteins in each localization is remarkably different). Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackles effectively these characteristics at the same time. Thus, a new computational method for protein localization is eventually needed for more reliable outcomes. To address the issue, we present a protein localization predictor based on D-SVDD (PLPD) for the prediction of protein localization, which can find the likelihood of a specific localization of a protein more easily and more correctly. Moreover, we introduce three measurements for the more precise evaluation of a protein localization predictor. As the results of various datasets which are made from the experiments of Huh et al. (2003), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding a good boundary for each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh et al. (2003).Entities:
Mesh:
Substances:
Year: 2006 PMID: 16966337 PMCID: PMC1636404 DOI: 10.1093/nar/gkl638
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The number of proteins in the original Huh et al. Dataset (2003) and three training datasets
| Subcellular localization | Huh | Dataset-I | Dataset-II | Dataset-III |
|---|---|---|---|---|
| 1. Actin | 32 | 32 | 27 | 27 |
| 2. Bud | 25 | 25 | 19 | 19 |
| 3. Bud neck | 61 | 61 | 48 | 48 |
| 4. Cell periphery | 130 | 130 | 98 | 98 |
| 5. Cytoplasm | 1782 | 1782 | 1472 | 1472 |
| 6. Early golgi | 54 | 54 | 39 | 39 |
| 7. Endosome | 46 | 46 | 37 | 37 |
| 8. ER | 292 | 292 | 207 | 207 |
| 9. ER to golgi | 6 | 6 | 5 | 5 |
| 10. Golgi | 41 | 41 | 30 | 30 |
| 11. Late golgi | 44 | 44 | 38 | 38 |
| 12. Lipid particle | 23 | 23 | 15 | 15 |
| 13. Microtubule | 20 | 20 | 17 | 17 |
| 14. Mitochondrion | 522 | 522 | 389 | 389 |
| 15. Nuclear periphery | 60 | 60 | 38 | 38 |
| 16. Nucleolus | 164 | 164 | 122 | 122 |
| 17. Nucleus | 1446 | 1446 | 1126 | 1126 |
| 18. Peroxisome | 21 | 21 | 16 | 16 |
| 19. Punctate composite | 137 | 137 | 91 | 91 |
| 20. Spindle pole | 61 | 61 | 27 | 27 |
| 21. Vacuolar membrane | 58 | 58 | 47 | 47 |
| 22. Vacuole | 159 | 159 | 124 | 124 |
| Total number of classified proteins, | 5184 | 5184 | 4032 | 4032 |
| Total number of different proteins, | 3914 | 3914 | 3017 | 3017 |
| Dimension of features | 9620D | 2372D | 11992D | |
| Coverage | 100% | 77.08% | 77.08% |
The previous researches in the prediction of protein subcellular localization
| Author(s) | Algorithm | Feature | # of Classes | Multi-label | Imbalance |
|---|---|---|---|---|---|
| Nakashima and Nishikawa ( | Scoring System | AA | 2 | x | x |
| Cedano | LD | AA | 5 | x | x |
| Reinhardt and Hubbard ( | ANN | AA | 3, 4 | x | x |
| Chou and Elrod ( | CD | AA | 12 | x | x |
| Yuan ( | Markov Model | AA | 3, 4 | x | x |
| Nakai and Horton ( | k-NN | Signal Motif | 11 | x | x |
| Emanuelsson | Neural network | Signal Motif | 4 | x | x |
| Drawid | CD | Gene Expression Pattern | 8 | x | x |
| Drawid and Gerstein ( | BN | Signal Motif, HDEL motif | 5, 6 | x | x |
| Cai | SVMs | AA | 12 | x | x |
| Chou ( | Augumented CD | AA | 5, 7, 12 | x | x |
| Hua and Sun ( | SVMs | AA | 4 | x | x |
| Chou and Cai ( | SVMs | SBASE-FunD | 12 | x | x |
| Nair and Rost ( | NN | functional annotation | 10 | x | x |
| Cai | SVMs | SBASE-FunD | 5 | x | x |
| Chou and Cai ( | NN | GO | 3, 4 | x | x |
| Chou and Cai ( | LD | PseAA | 14 | x | x |
| Pan | Augmented CD | PseAA | 12 | x | x |
| Park and Kanehisa ( | SVMs | AA | 12 | x | x |
| Zhou and Doctor ( | CD | AA | 4 | x | x |
| Gardy | SVMs | AA | 5 | x | x |
| Huang and Li ( | fuzzy k-NN | PairAA | 4, 11 | x | x |
| Guo | p-ANN | AA | 8 | x | x |
| Bhasin and Raghava ( | SVMs | AAk, PairAA | 4 | x | x |
| Chou and Cai ( | NN | GO | 22 | Considering | x |
aLD: Least Distance algorithm.
bBN: Bayesian Network.
cANN: Artificial Neural Network.
dCD: Covariant Discriminant algorithm.
eNN: Nearest Neighbor.
hHMM: Hidden Markov Model.
iSVMs: Support Vector Machines.
jp-ANN: probabilistic Artificial Neural Network.
kAA: amino acid composition.
lPairAA: amino acid pair composition.
mPseAA: pseudo amino acid composition.
nSOC: sequence-order correlation.
oSBASE-FunD: functional domain composition using SBASE.
pGO: gene ontology.
qInterPro-FunD: InterPro functional domain composition.
rFunDC: functional domain composition. (Here, ‘x’ means ‘Not Considering’.)
Figure 1A typical solution of C-SVDD when outliers are permitted. The C-SVDD finds the minimum-volume hypersphere which includes most of target data. The data which resides on the boundary and outside the boundary are called support vectors which fully determine the compact boundary. Thus, the data with solid circle are the support vectors on the boundary, and the data with dotted circle are also support vectors which are the outliers.
Figure 2A typical solution of C-SVDD when negative data are available. The C-SVDD finds the minimum-volume hypersphere which includes most of target data and at the same time, excludes most of negative data.
Figure 3A typical solution of C-SVDD when a kernel function is used. The C-SVDD finds a more flexible solution in a high-dimensional feature space without mapping data into the feature space using some kernel function; C-SVDD finds a flexible solution directly in the original input space with a kernel function as shown in the left figure.
Prediction performance (%) of ISort and PLPD to the Dataset-I
| Measure | ISort method (%) | PLPD method (%) | |
|---|---|---|---|
| Measure-I | 65.14 | 73.89 | |
| Measure-II | 35.91 | 53.09 | |
| 1. Actin | 0.00 | 0.00 | |
| 2. Bud | 0.00 | 0.00 | |
| 3. Bud neck | 0.00 | 0.00 | |
| 4. Cell periphery | 0.00 | 0.00 | |
| 5. Cytoplasm | 77.55 | 99.89 | |
| 6. Early golgi | 5.56 | 5.56 | |
| 7. Endosome | 6.52 | 6.52 | |
| 8. ER | 0.00 | 0.00 | |
| 9. ER to golgi | 16.67 | 16.67 | |
| 10. Golgi | 0.00 | 4.88 | |
| 11. Late golgi | 2.27 | 6.82 | |
| Measure-III | 12. Lipid particle | 8.70 | 21.74 |
| 13. Microtubule | 10.00 | 10.00 | |
| 14. Mitochondrion | 0.77 | 1.15 | |
| 15. Nuclear periphery | 0.00 | 0.00 | |
| 16. Nucleolus | 6.10 | 1.83 | |
| 17. Nucleus | 83.82 | 65.08 | |
| 18. Peroxisome | 0.00 | 9.52 | |
| 19. Punctate composite | 0.73 | 0.00 | |
| 20. Spindle pole | 0.00 | 0.00 | |
| 21. Vacuolar membrane | 1.72 | 1.72 | |
| 22. Vacuole | 0.00 | 0.00 | |
| Average | 10.02 | 11.43 |
Prediction performance (%) of ISort and PLPD to the Dataset-III
| Measure | ISort method (%) | PLPD method (%) | |
|---|---|---|---|
| Measure-I | 75.90 | 83.49 | |
| Measure-II | 49.16 | 57.24 | |
| 1. Actin | 3.70 | 18.52 | |
| 2. Bud | 5.26 | 57.89 | |
| 3. Bud neck | 4.17 | 33.33 | |
| 4. Cell periphery | 30.61 | 33.67 | |
| 5. Cytoplasm | 73.30 | 77.04 | |
| 6. Early golgi | 12.82 | 25.64 | |
| 7. Endosome | 24.32 | 35.14 | |
| 8. ER | 22.71 | 21.26 | |
| 9. ER to golgi | 0.00 | 60.00 | |
| 10. Golgi | 13.33 | 43.33 | |
| 11. Late golgi | 10.53 | 26.32 | |
| Measure-III | 12. Lipid particle | 0.00 | 53.33 |
| 13. Microtubule | 29.41 | 52.94 | |
| 14. Mitochondrion | 27.51 | 33.16 | |
| 15. Nuclear periphery | 15.79 | 23.68 | |
| 16. Nucleolus | 28.69 | 31.97 | |
| 17. Nucleus | 51.07 | 66.96 | |
| 18. Peroxisome | 6.25 | 68.75 | |
| 19. Punctate composite | 10.99 | 12.09 | |
| 20. Spindle pole | 29.63 | 40.74 | |
| 21. Vacuolar membrane | 4.26 | 14.89 | |
| 22. Vacuole | 41.13 | 22.58 | |
| Average | 20.25 | 38.78 |
The performance (%) of the proposed PLPD to the Dataset-I, Dataset-II, and Dataset-III only with regard to the Measure-III
| Measure | Dataset-I | Dataset-II | Dataset-III |
|---|---|---|---|
| Measure-III Average | 19.10% | 44.61% | 46.50% |
Prediction performance (%) of ISort and PLPD to the Dataset-II
| Measure | ISort method (%) | PLPD method (%) | |
|---|---|---|---|
| Measure-I | 69.94 | 82.40 | |
| Measure-II | 44.27 | 56.32 | |
| 1. Actin | 0.00 | 22.22 | |
| 2. Bud | 5.26 | 42.11 | |
| 3. Bud neck | 10.42 | 31.25 | |
| 4. Cell periphery | 41.84 | 26.53 | |
| 5. Cytoplasm | 71.13 | 84.58 | |
| 6. Early golgi | 7.69 | 25.64 | |
| 7. Endosome | 10.81 | 21.62 | |
| 8. ER | 16.43 | 12.56 | |
| 9. ER to golgi | 0.00 | 60.00 | |
| 10. Golgi | 6.67 | 33.33 | |
| 11. Late golgi | 5.26 | 21.05 | |
| Measure-III | 12. Lipid particle | 0.00 | 46.67 |
| 13. Microtubule | 29.41 | 41.18 | |
| 14. Mitochondrion | 18.77 | 16.20 | |
| 15. Nuclear periphery | 0.00 | 13.16 | |
| 16. Nucleolus | 17.21 | 26.23 | |
| 17. Nucleus | 44.32 | 65.36 | |
| 18. Peroxisome | 0.00 | 50.00 | |
| 19. Punctate composite | 6.59 | 7.69 | |
| 20. Spindle pole | 14.81 | 33.33 | |
| 21. Vacuolar membrane | 0.00 | 10.64 | |
| 22. Vacuole | 30.65 | 21.77 | |
| Average | 15.33 | 32.41 |
The first 35 prediction results of proteins whose localizations are not clearly observed by the experiments.
| Protein | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YAL029C | I | II | III | |||||||||||||||||||
| YAL053W | III | I | I | III | II | II | II | II | II | |||||||||||||
| YAR019C | III | I | II | |||||||||||||||||||
| YAR027W | II | I | II | II | III | II | ||||||||||||||||
| YAR028W | II | I | II | II | III | I | ||||||||||||||||
| YBL034C | I | III | II | III | II | III | ||||||||||||||||
| YBL067C | I | III | III | I | II | |||||||||||||||||
| YBL105C | III | II | II | I | III | |||||||||||||||||
| YBR007C | I | II | II | III | ||||||||||||||||||
| YBR072W | III | I | II | I | III | |||||||||||||||||
| YBR168W | III | I | III | II | ||||||||||||||||||
| YBR200W | II | III | I | II | ||||||||||||||||||
| YBR235W | I | III | III | II | ||||||||||||||||||
| YBR260C | I | II | II | III | ||||||||||||||||||
| YCL024W | III | II | III | I | ||||||||||||||||||
| YCR021C | I | II | III | III | III | |||||||||||||||||
| YCR023C | I | III | III | III | II | |||||||||||||||||
| YCR037C | I | III | III | II | III | III | III | |||||||||||||||
| YDL025C | II | I | III | I | III | |||||||||||||||||
| YDL171C | III | II | I | II | III | I | III | III | ||||||||||||||
| YDL203C | I | III | II | II | III | |||||||||||||||||
| YDL238C | I | III | II | I | III | |||||||||||||||||
| YDL248W | II | I | II | II | III | I | ||||||||||||||||
| YDR069C | I | III | II | III | ||||||||||||||||||
| YDR072C | III | I | II | II | III | I | ||||||||||||||||
| YDR089W | I | I | III | I | II | II | II | |||||||||||||||
| YDR093W | II | I | II | III | III | III | ||||||||||||||||
| YDR164C | II | I | II | II | III | II | I | |||||||||||||||
| YDR181C | I | II | I | III | ||||||||||||||||||
| YDR182W | I | III | II | I | I | |||||||||||||||||
| YDR251W | II | III | I | II | III | III | ||||||||||||||||
| YDR261C | I | III | III | III | III | II | II | |||||||||||||||
| YDR276C | III | II | II | II | III | II | I | I | II | |||||||||||||
| YDR309C | I | III | I | I | II | III | ||||||||||||||||
| YDR313C | III | II | II | III | I |
The numbers in the first row indicate the specific localizations listed in Table 2.