| Literature DB >> 26911432 |
Shibiao Wan1, Man-Wai Mak2, Sun-Yuan Kung3.
Abstract
BACKGROUND: Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26911432 PMCID: PMC4765148 DOI: 10.1186/s12859-016-0940-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Information of the human dataset. (a) Dataset breakdown; (b) dataset analysis. The number of proteins shown in each subcellular location represents the number of ‘locative proteins’ [44, 65]. The dataset comprises 3106 actual proteins and 3681 locative proteins distributed in 14 subcellular locations. In (b), for each bar, the numbers m(n) on top denote that there are m actual proteins and n locative proteins having the same number of co-location(s) indicated in the bottom of the bar. For example, there are 43 actual and 129 locative proteins that have three subcellular locations. The small pie charts show the distribution of locative proteins having the number of co-location(s) shown in the bottom of the bar chart
Fig. 2The categorical breakdown of the essential GO terms in each subcellular location for (a) mLASSO and (b) mEN. CEN: centrosome; CYT: cytoplasm; CYK: cytoskeleton; ER: endoplasmic reticulum; END: endosome; EXT: extracellular; GOL: Golgi apparatus; LYS: lysosome; MIC: microsome; MIT: mitochondrion; NUC: nucleus; PER: peroxisome; PM: plasma membrane; SYN: synapse
Fig. 3Overlapping between the essential GO terms found by mLASSO (yellow) and mEN (pink). (a) All subcellular locations, (b) centrosome, (c) cytoplasm and (d) mitochondrion
Fig. 4Analysis of location-specific weights β of the essential GO terms for mLASSO and mEN. (a) Number of non-zero weights for mLASSO and mEN; (b) number of positive and negative weights; (c) distribution of non-zero weights for mLASSO; (d) distribution of non-zero weights for mEN. CEN: centrosome; CYT: cytoplasm; CYK: cytoskeleton; ER: endoplasmic reticulum; END: endosome; EXT: extracellular; GOL: Golgi apparatus; LYS: lysosome; MIC: microsome; MIT: mitochondrion; NUC: nucleus; PER: peroxisome; PM: plasma membrane; SYN: synapse
Fig. 5The distribution of non-zero weights in each subcellular location for mLASSO and mEN. CEN: centrosome; CYT: cytoplasm; CYK: cytoskeleton; ER: endoplasmic reticulum; END: endosome; EXT: extracellular; GOL: Golgi apparatus; LYS: lysosome; MIC: microsome; MIT: mitochondrion; NUC: nucleus; PER: peroxisome; PM: plasma membrane; SYN: synapse
Fig. 6Networks showing the relationships between the essential GO terms and each subcellular location for (a) mLASSO and (b) mEN, and between the significantly essential GO terms and each subcellular location for (c) mLASSO and (d) mEN. In all figures, small green dots represent the GO terms and the large dots in different colors represent the 14 subcellular locations. A line connecting an essential GO term and a subcellular location denotes that the GO term contributes to the prediction of the subcellular location. On the contrary, if there is no line connecting an essential GO term with a particular subcellular location, then this GO term does not provide any information about the presence or absence of a protein in this particular subcellular location. CEN: centrosome; CYT: cytoplasm; CYK: cytoskeleton; ER: endoplasmic reticulum; END: endosome; EXT: extracellular; GOL: Golgi apparatus; LYS: lysosome; MIC: microsome; MIT: mitochondrion; NUC: nucleus; PER: peroxisome; PM: plasma membrane; SYN: synapse
Comparing mLASSO and mEN with state-of-the-art multi-label predictors based on leave-one-out cross-validation on the human dataset
| Label | Subcellular location | LOOCV Locative Accuracy (LA) | |||
|---|---|---|---|---|---|
| iLoc-Hum [ | mGOASVM [ | mLASSO | mEN | ||
| 1 | Centrosome | 56/77 = 0.727 | 64/77 = 0.831 | 42/77 = 0.546 | 60/77 = 0.779 |
| 2 | Cytoplasm | 561/817 = 0.687 | 683/817 = 0.836 | 699/817 = 0.856 | 683/817 = 0.836 |
| 3 | Cytoskeleton | 27/79 = 0.342 | 44/79 = 0.557 | 29/79 = 0.367 | 32/79 = 0.405 |
| 4 | Endoplasmic reticulum | 166/229 = 0.725 | 193/229 = 0.843 | 194/229 = 0.847 | 190/229 = 0.830 |
| 5 | Endosome | 1/24 = 0.042 | 9/24 = 0.375 | 1/24 = 0.042 | 5/24 = 0.208 |
| 6 | Extracellular | 325/385 = 0.844 | 344/385 = 0.894 | 311/385 = 0.808 | 314/385 = 0.816 |
| 7 | Golgi apparatus | 99/161 = 0.615 | 131/161 = 0.814 | 118/161 = 0.733 | 128/161 = 0.795 |
| 8 | Lysosome | 56/77 = 0.727 | 71/77 = 0.922 | 62/77 = 0.805 | 74/77 = 0.961 |
| 9 | Microsome | 7/24 = 0.292 | 18/24 = 0.750 | 1/24 = 0.042 | 14/24 = 0.583 |
| 10 | Mitochondrion | 284/364 = 0.780 | 339/364 = 0.931 | 336/364 = 0.923 | 336/364 = 0.923 |
| 11 | Nucleus | 918/1021 = 0.899 | 931/1021 = 0.912 | 922/1021 = 0.903 | 923/1021 = 0.904 |
| 12 | Peroxisome | 20/47 = 0.426 | 43/47 = 0.915 | 34/47 = 0.723 | 39/47 = 0.830 |
| 13 | Plasma membrane | 277/354 = 0.783 | 288/354 = 0.814 | 267/354 = 0.754 | 266/354 = 0.751 |
| 14 | Synapse | 12/22 = 0.546 | 12/22 = 0.546 | 3/22 = 0.136 | 13/22 = 0.591 |
| Overall Actual Accuracy ( | 2118/3106 = 0.682 | 2251/3106 = 0.725 | 2265/3106 = 0.729 | 2307/3106 = | |
| Overall Locative Accuracy ( | 2809/3681 = 0.763 | 3170/3681 = | 3019/3681 = 0.820 | 3077/3681 = 0.836 | |
|
| – | 0.821 | 0.814 |
| |
|
| – | 0.851 | 0.859 |
| |
|
| – |
| 0.857 | 0.870 | |
|
| – | 0.853 | 0.843 |
| |
|
| – | 0.835 | 0.826 |
| |
|
| – | 0.740 | 0.638 |
| |
|
| – | 0.029 | 0.029 |
| |
“–” means the corresponding references do not provide the related metrics. Note that OAA is the most stringent and objective among all the metrics. Data in bold represent the best result of the corresponding measures among all predictors
Prediction results of 7 novel proteins by mEN
| AC | Date of creation | Ground-truth location(s) | Prediction results | GO total number | Essential GO terms |
|---|---|---|---|---|---|
| D3DTV9 | 26-Nov-2014 | Nucleus | Nucleus | 13 | GO:0000166, GO:0016787, GO:0003676, GO:0003723, GO:0004386, GO:0046872, GO:0051607, GO:0005524 |
| E9PAV3 | 19-Feb-2014 | Cytoplasm, Nucleus | Cytoplasm, Nucleus | 9 | GO:0015031, GO:0003677, GO:0005634, GO:0005737, GO:0006351, GO:0006355, GO:0006810 |
| B7ZW38 | 26-Nov-2014 | Nucleus | Nucleus | 5 | GO:0000166, GO:0030529, GO:0003676, GO:0003723, GO:0005634 |
| P0DMR3 | 07-Jan-2015 | Cytoplasm | Cytoplasm | 22 | GO:0000166, GO:0016740, GO:0016874, GO:0003824, GO:0046872, GO:0005524, GO:0005575, GO:0008152 |
| P0DML3 | 09-Jul-2014 | Extracellular | Extracellular | 6 | GO:0046872, GO:0005179, GO:0005576, GO:0007165 |
| P0DMN0 | 03-Sep-2014 | Cytoplasm | Cytoplasm | 16 | GO:0016740, GO:0030968, GO:0044267, GO:0044281, GO:0005737, GO:0005829, GO:0006629, GO:0006805 |
| C9JSJ3 | 29-Oct-2014 | Nucleus | Nucleus | 4 | GO:0003677, GO:0005634, GO:0006351, GO:0006355 |
AC: UniProtKB accession number; Ground-truth location(s): the experimentally-validated actual subcellular location(s); GO Total Number: the total number of GO terms retrieved for a given query protein
Fig. 7Examples showing how mEN interprets subcellular localization of (a) a single-location protein (D3DTV9) and (b) a multi-location protein (E9PAV3). SCL: subcellular location; Score: the score determined in Eq. 14; Feature Score: the score that each essential GO term contributes to the final prediction; Term-freq: the frequency of occurrence of an essential GO term; C: cellular component; F: molecular function; P: biological process; CEN: centrosome; CYT: cytoplasm; CYK: cytoskeleton; ER: endoplasmic reticulum; END: endosome; EXT: extracellular; GOL: Golgi apparatus; LYS: lysosome; MIC: microsome; MIT: mitochondrion; NUC: nucleus; PER: peroxisome; PM: plasma membrane; SYN: synapse
Impacts of the hierarchical-information based (HIB) technique on mLASSO and mEN based on leave-one-out cross-validation (LOOCV) on the human dataset
| Measures | mLASSO | mEN | ||
|---|---|---|---|---|
| without HIB | with HIB | without HIB | with HIB | |
|
| 0.729 |
| 0.743 | 0.742 |
|
| 0.820 |
| 0.836 | 0.825 |
|
| 0.814 |
| 0.827 | 0.821 |
|
| 0.859 |
| 0.869 | 0.866 |
|
| 0.857 |
| 0.870 | 0.860 |
|
| 0.843 |
| 0.855 | 0.849 |
|
| 0.826 |
| 0.837 | 0.831 |
|
| 0.638 | 0.676 |
| 0.667 |
|
| 0.029 |
| 0.028 | 0.029 |
Note that OAA is the most stringent and objective among all the metrics. Data in bold represent the best result of the corresponding measures among all predictors
Significance of GO terms from different categories on the performance of mLASSO and mEN based on leave-one-out cross-validation (LOOCV) on the human dataset
| Measures | mLASSO | mEN | ||||||
|---|---|---|---|---|---|---|---|---|
| All | CC + MF | CC + BP | MF + BP | All | CC + MF | CC + BP | MF + BP | |
|
| 0.729 | 0.662 | 0.654 | 0.385 |
| 0.621 | 0.640 | 0.440 |
|
| 0.820 | 0.726 | 0.715 | 0.436 |
| 0.686 | 0.701 | 0.492 |
|
| 0.814 | 0.733 | 0.724 | 0.446 |
| 0.690 | 0.709 | 0.506 |
|
| 0.859 | 0.782 | 0.773 | 0.500 |
| 0.739 | 0.760 | 0.560 |
|
| 0.857 | 0.759 | 0.747 | 0.457 |
| 0.712 | 0.730 | 0.521 |
|
| 0.843 | 0.758 | 0.748 | 0.469 |
| 0.713 | 0.733 | 0.528 |
|
| 0.826 | 0.750 | 0.741 | 0.462 |
| 0.711 | 0.728 | 0.516 |
|
| 0.638 | 0.435 | 0.426 | 0.212 |
| 0.410 | 0.427 | 0.346 |
|
| 0.029 | 0.041 | 0.042 | 0.086 |
| 0.047 | 0.044 | 0.078 |
Note that OAA is the most stringent and objective among all the metrics. CC: cellular components; MF: molecular functions; BP: biological processes. Data in bold represent the best result of the corresponding measures among all predictors