| Literature DB >> 23130999 |
Shibiao Wan1, Man-Wai Mak, Sun-Yuan Kung.
Abstract
BACKGROUND: Although many computational methods have been developed to predict protein subcellular localization, most of the methods are limited to the prediction of single-location proteins. Multi-location proteins are either not considered or assumed not existing. However, proteins with multiple locations are particularly interesting because they may have special biological functions, which are essential to both basic research and drug discovery.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23130999 PMCID: PMC3582598 DOI: 10.1186/1471-2105-13-290
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Explicit GO terms for the virus dataset
| | | GO:00046727 (Part of), GO:0046798 (Part of), | |
| Viral capsid | GO:0019028 | GO:0046806 (Part of), GO:0019013 (Part of), | 7 |
| | | GO:0019029 (Is a), GO:0019030 (Is a) | |
| | | GO:0044155 (Part of), GO:0044084 (Part of), | |
| | | GO:0044385 (Part of), GO:0044160 (Is a), | |
| | | GO:0044162 (Is a), GO:0085037 (Is a), | |
| | | GO:0085042 (Is a), GO:0085039 (Is a), | |
| Host cell membrane | GO:0033644 | GO:0020002 (Is a), GO:0044167 (Is a), | 20 |
| | | GO:0044173 (Is a), GO:0044175 (Is a), | |
| | | GO:0044178 (Is a), GO:0044384 (Is a), | |
| | | GO:0033645 (Is a), GO:0044231 (Is a), | |
| | | GO:0044188 (Is a), GO:0044191 (Is a), | |
| | | GO:0044200 (Is a) | |
| Host ER∗ | GO:0044165 | GO:0044166 (Part of), GO:0044167 (Part of), | 5 |
| | | GO:0044168 (Is a), GO:0044170 (Is a) | |
| Host cytoplasm | GO:0030430 | GO:0033655 (Part of) | 2 |
| Host nucleus | GO:0042025 | GO:0044094 (Part of) | 2 |
| Secreted | GO:0005576 | GO:0048046 (Is a), GO:0044421 (Part of) | 3 |
Explicit GO terms include essential GO terms and their child terms. The definition of essential GO terms can be found in [32]. Here the relationship includes ‘is a’ and ‘part of’ only, because only cellular component GO terms are analyzed here. Relationship: the relationship between child terms and their parent essential GO terms;
No. of Terms: the total number of explicit GO terms in a particular class.
*:host endoplasmic reticulum.
Breakdown of the (a) virus protein dataset and (b) plant protein dataset
| 1 | Viral capsid | 8 |
| 2 | Host cell membrane | 33 |
| 3 | Host endoplasmic reticulum | 20 |
| 4 | Host cytoplasm | 87 |
| 5 | Host nucleus | 84 |
| 6 | Secreted | 20 |
| Total number of locative proteins (
| 252 | |
| Total number of actual proteins (
| 207 | |
| 1 | Cell membrane | 56 |
| 2 | Cell wall | 32 |
| 3 | Chloroplast | 286 |
| 4 | Cytoplasm | 182 |
| 5 | Endoplasmic reticulum | 42 |
| 6 | Extracellular | 22 |
| 7 | Golgi apparatus | 21 |
| 8 | Mitochondrion | 150 |
| 9 | Nucleus | 152 |
| 10 | Peroxisome | 21 |
| 11 | Plastid | 39 |
| 12 | Vacuole | 52 |
| Total number of locative proteins (
| 1055 | |
| Total number of actual proteins (
| 978 | |
Comparing mGOASVM with state-of-the-art multi-label predictors based on leave-one-out cross validation (LOOCV) using (a) the virus dataset and (b) the plant dataset
| 1 | Viral capsid | 8/8 = 100.0% | 8/8 = 100.0% | 8/8 = 100.0% | 8/8 = 100.0% |
| 2 | Host cell membrane | 19/33 = 57.6% | 27/33 = 81.8% | 25/33 = 75.8% | 32/33 = 97.0% |
| 3 | Host ER | 13/20 = 65.0% | 15/20 = 75.0% | 15/20 = 75.0% | 17/20 = 85.0% |
| 4 | Host cytoplasm | 52/87 = 59.8% | 86/87 = 98.8% | 64/87 = 73.6% | 85/87 = 97.7% |
| 5 | Host nucleus | 51/84 = 60.7% | 54/84 = 65.1% | 70/84 = 83.3% | 82/84 = 97.6% |
| 6 | Secreted | 9/20 = 45.0% | 13/20 = 65.0% | 15/20 = 75.0% | 20/20 = 100.0% |
| Overall Locative Accuracy | 152/252 = 60.3% | 203/252 = 80.7% | 197/252 = 78.2% | 244/252 = | |
| Overall Actual Accuracy | – | – | 155/207 =74.8% | 184/207 = | |
| | | | |||
| 1 | Cell membrane | 24/56 = 42.9% | 39/56 = 69.6% | 53/56 = 94.6% | |
| 2 | Cell wall | 8/32 = 25.0% | 19/32 = 59.4% | 27/32 = 84.4% | |
| 3 | Chloroplast | 248/286 = 86.7% | 252/286 = 88.1% | 272/286 = 95.1% | |
| 4 | Cytoplasm | 72/182 = 39.6% | 114/182 = 62.6% | 174/182 = 95.6% | |
| 5 | Endoplasmic reticulum | 17/42 = 40.5% | 21/42 = 50.0% | 38/42 = 90.5% | |
| 6 | Extracellular | 3/22 = 13.6% | 2/22 = 9.1% | 22/22 = 100.0% | |
| 7 | Golgi apparatus | 6/21 = 28.6% | 16/21 = 76.2% | 19/21 = 90.5% | |
| 8 | Mitochondrion | 114/150 = 76.0% | 112/150 = 74.7% | 150/150 = 100.0% | |
| 9 | Nucleus | 136/152 = 89.5% | 140/152 = 92.1% | 151/152 = 99.3% | |
| 10 | Peroxisome | 14/21 = 66.7% | 6/21 = 28.6% | 21/21 = 100.0% | |
| 11 | Plastid | 4/39 = 10.3% | 7/39 = 17.9% | 39/39 = 100.0% | |
| 12 | Vacuole | 26/52 = 50.0% | 28/52 = 53.8% | 49/52 = 94.2% | |
| Overall Locative Accuracy | 672/1055 = 63.7% | 756/1055 = 71.7% | 1015/1055 = | | |
| Overall Actual Accuracy | – | 666/978 = 68.1% | 855/978 = | ||
“–” means the corresponding references do not provide the overall actual accuracy. KNN-SVM: the KNN-SVM ensemble classifier proposed in [39]. Host ER: Host endoplasmic reticulum.
Performance of mGOASVM using different kernels with different parameters based on leave-one-out cross validation (LOOCV) using the virus dataset
| Linear SVM | – | 244/252 = | 184/207 = |
| RBF SVM | 182/252 = 72.2% | 53/207 = 25.6% | |
| RBF SVM | 118/252 = 46.8% | 87/207 = 42.0% | |
| RBF SVM | 148/252 = 58.7% | 116/207 = 56.0% | |
| RBF SVM | 189/252 = 75.0% | 142/207 = 68.6% | |
| RBF SVM | 223/252 = 88.5% | 154/207 = | |
| RBF SVM | 231/252 = 91.7% | 150/207 = 72.5% | |
| RBF SVM | 233/252 = | 115/207 = 55.6% | |
| RBF SVM | 136/252 = 54.0% | 5/207 = 2.4% | |
| Polynomial SVM | 231/252 = | 180/207 = | |
| Polynomial SVM | 230/252 = 91.3% | 178/207 = 86.0% |
The penalty parameter (C) was set to 0.1 for all cases. σis the kernel parameter for the RBF SVM; d is the polynomial degree in the Polynomial SVM.
Performance of different GO-vector construction methods based on leave-one-out cross validation (LOOCV) for (a) the virus dataset and (b) the plant dataset
| | | |
| 1-0 value | 244/252 = | 179/207 = 86.5% |
| Term-frequency (TF) | 244/252 = | 184/207 = |
| | | |
| 1-0 value | 1014/1055 = 96.1% | 788/978 = 80.6% |
| Term-frequency (TF) | 1015/1055 = | 855/978 = |
Distribution of the number of labels predicted by mGOASVM for proteins in the virus and plant datasets
| | | | | ||||
| Virus | Over-prediction | 0 | 18 | 0 | 0 | 18/207 = 8.7% | |
| | Equal-prediction | 187 | 0 | 0 | 0 | 187/207 = 90.3% | |
| | Under-prediction | 0 | 2 | 0 | 0 | 2/207 = 1.0% | |
| Plant | Over-prediction | 0 | 83 | 2 | 0 | 85/978 = 8.7% | |
| | Equal-prediction | 879 | 0 | 0 | 0 | 879/978 = 89.9% | |
| Under-prediction | 0 | 14 | 0 | 0 | 14/978 = 1.4% | ||
: Number of predicted labels for the i-th (i=1,…, Nact) protein; : Number of the true labels for the i-th protein; Over-prediction: the number of predicted labels is larger than that of the true labels; Equal-prediction: the number of predicted labels is equal to that of the true labels; Under-prediction: the number of predicted labels is smaller than that of the true labels; or : the number of proteins that are over-, equal-, or under-predicted by k (k=0,…,5 for the virus dataset and k=0,…,11 for the plant dataset) labels, respectively; No, Ne or Nu: the total number of proteins that are over-, equal-, or under-predicted, respectively.
Performance of mGOASVM with different inputs and different numbers of homologous proteins for (a) the virus dataset and (b) the plant dataset
| AC | 0 | 331 | 244/252 = | 191/207 = |
| S | 1 | 310 | 244/252 = | 184/207 = |
| S | 2 | 455 | 235/252 = 93.3% | 178/207 = 86.0% |
| S | 4 | 664 | 221/252 = 87.7% | 160/207 = 77.3% |
| S | 8 | 1134 | 202/252 = 80.2% | 130/207 = 62.8% |
| S + AC | 1 | 334 | 242/252 = | 188/207 = |
| S + AC | 2 | 460 | 238/252 = 94.4% | 179/207 = 86.5% |
| S + AC | 4 | 664 | 230/252 = 91.3% | 169/207 = 81.6% |
| S + AC | 8 | 1134 | 216/252 = 85.7% | 145/207 = 70.1% |
| AC | 0 | 1532 | 1023/1055 = | 863/978 = |
| S | 1 | 1541 | 1015/1055 = | 855/978 = |
| S | 2 | 1906 | 907/1055 = 85.8% | 617/978 = 63.1% |
| S + AC | 1 | 1541 | 1010/1055 = | 859/978 = |
| S + AC | 2 | 1906 | 949/1055 = 90.0% | 684/978 = 70.0% |
S: Sequence; AC: Accession Number; #homo: Number of homologs used in the experiments; N(GO): Number of Distinct GO Terms. #homo=0 means only the true accession number is used.
Performance of mGOASVM on (a) the virus dataset and (b) the plant dataset
| | | | |||
| AC | 0 | 154/165 = 93.3% | 34/39 = 87.2% | 3/3 = 100% | 191/207 = |
| S | 1 | 148/165 = 89.7% | 33/39 = 84.6% | 3/3 = 100% | 184/207 = 88.9% |
| S + AC | 1 | 151/165 = 91.5% | 34/39 = 87.2% | 3/3 = 100% | 188/207 = 90.8% |
| | | | |||
| AC | 0 | 813/904 = 89.9% | 49/71 = 69.0% | 1/3 = 33.3% | 863/978 = |
| S | 1 | 802/904 = 88.7% | 52/71 = 73.2% | 1/3 = 33.3% | 855/978 = 87.4% |
| S + AC | 1 | 811/904 = 89.7% | 47/71 = 66.2% | 1/3 = 33.3% | 859/978 = 87.8% |
S: Sequence; AC: Accession Number; #homo: Number of homologs used in the experiments; l (l=1,…,3): Number of co-locations. #homo=0 means only the true accession number is used.
Breakdown of the new plant dataset
| 1 | Cell membrane | 16 |
| 2 | Cell wall | 1 |
| 3 | Chloroplast | 54 |
| 4 | Cytoplasm | 38 |
| 5 | Endoplasmic reticulum | 9 |
| 6 | Extracellular | 3 |
| 7 | Golgi apparatus | 7 |
| 8 | Mitochondrion | 16 |
| 9 | Nucleus | 46 |
| 10 | Peroxisome | 6 |
| 11 | Plastid | 1 |
| 12 | Vacuole | 7 |
| Total number of locative proteins | 204 | |
| Total number of actual proteins | 175 | |
The dataset was constructed from Swiss-Prot created between 08-Mar-2011 and 18-Apr-2012. The sequence identity of the dataset is below 25%.
Figure 1Flowchart of mGOASVM for three cases: (1) using accession numbers only; (2) using sequences only; (3) using both accession numbers and sequences. AC: Accession Number; S: Sequence. Part II does not exist for Case 1, and Part I does not exist for Case 2. Case 3 requires using both Part I and Part II. The score fusion implements Eq. 12.
Figure 2Distribution of the closeness between the novel testing proteins and the training proteins. The closeness is defined as the BLAST E-values of the training proteins using the test proteins as the query proteins in the BLAST searches. Number of Proteins: The number of training proteins whose E-values fall into the interval specified under the bar. Small E-values suggest that the corresponding novel proteins are close homologs of the training proteins.
Comparing mGOASVM with a state-of-the-art multi-label plant predictor based on independent tests using the new plant dataset
| 1 | Cell membrane | 8/16 = 50.0% | 7/16 = 43.8% |
| 2 | Cell wall | 0/1 = 0% | 0/1 = 0% |
| 3 | Chloroplast | 27/54 = 50.0% | 39/54 = 72.2% |
| 4 | Cytoplasm | 5/38 = 13.2% | 19/38 = 50.0% |
| 5 | Endoplasmic reticulum | 1/9 = 11.1% | 3/9 = 33.3% |
| 6 | Extracellular | 0/3 = 0% | 1/3 = 33.3% |
| 7 | Golgi apparatus | 3/7 = 42.9% | 3/7 = 42.9% |
| 8 | Mitochondrion | 6/16 = 37.5% | 11/16 = 68.8% |
| 9 | Nucleus | 31/46 = 67.4% | 33/46 = 71.7% |
| 10 | Peroxisome | 4/6 = 66.7% | 3/6 = 50.0% |
| 11 | Plastid | 0/1 = 0% | 0/1 = 0% |
| 12 | Vacuole | 2/7 = 28.6% | 4/7 = 57.1% |
| Overall locative accuracy | 87/204 = 42.7% | 123/204 = | |
| Overall actual accuracy | 60/175 = 34.3% | 97/175 = | |