| Literature DB >> 28658305 |
Chi-Hua Tung1, Chi-Wei Chen2, Han-Hao Sun2, Yen-Wei Chu2,3.
Abstract
Drug development and investigation of protein function both require an understanding of protein subcellular localization. We developed a system, REALoc, that can predict the subcellular localization of singleplex and multiplex proteins in humans. This system, based on comprehensive strategy, consists of two heterogeneous systematic frameworks that integrate one-to-one and many-to-many machine learning methods and use sequence-based features, including amino acid composition, surface accessibility, weighted sign aa index, and sequence similarity profile, as well as gene ontology function-based features. REALoc can be used to predict localization to six subcellular compartments (cell membrane, cytoplasm, endoplasmic reticulum/Golgi, mitochondrion, nucleus, and extracellular). REALoc yielded a 75.3% absolute true success rate during five-fold cross-validation and a 57.1% absolute true success rate in an independent database test, which was >10% higher than six other prediction systems. Lastly, we analyzed the effects of Vote and GANN models on singleplex and multiplex localization prediction efficacy. REALoc is freely available at http://predictor.nchu.edu.tw/REALoc.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28658305 PMCID: PMC5489166 DOI: 10.1371/journal.pone.0178832
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Number of proteins in the different subcellular locations in the training and testing datasets.
| Subcellular location | Training dataset | Testing dataset |
|---|---|---|
| Cell membrane | 1453 | 221 |
| Cytoplasm | 1542 | 197 |
| ER/Golgi | 562 | 136 |
| Mitochondrion | 462 | 133 |
| Nucleus | 2064 | 156 |
| Extracellular | 795 | 82 |
| Total | 6878 | 925 |
aTotal number of proteins in all locations
bTotal number of different proteins
Fig 1The flowchart shows the implementation of REALoc.
Fig 2Two-layer architecture of REALoc.
Performance of REALoc with five-fold cross-validation and other predictors for the 5939p dataset.
| Subcellular location | LocTree2 | CELLO | Hum-mPLoc 2.0 | iLoc-hum | REALoc_ | REALoc_ |
|---|---|---|---|---|---|---|
| Cell membrane | 57.5 | 58.9 | 61.8 | 69.7 | 79.6 | |
| Cytoplasm | 30.0 | 25.2 | 45.3 | 63.5 | 65.6 | |
| ER/Golgi | 38.1 | 2.1 | 53.9 | 32.4 | 74.4 | |
| Mitochondrion | 51.8 | 57.1 | 66.7 | 76.0 | 75.3 | |
| Nucleus | 57.3 | 58.8 | 71.1 | 71.6 | 72.7 | |
| Extracellular | 78.7 | 62.1 | 74.8 | 80.1 | 76.5 | |
| Overall | 60.0 | 54.3 | 67.2 | 72.7 | 72.8 |
All results are the absolute true success rate given as %, and the bold text indicates the highest value in each row.
Testing dataset 868pt predicted by REALoc and other predictors.
| Subcellular location | LocTree2 | CELLO | Hum-mPLoc 2.0 | iLoc-hum | GOASVM | mGOF-loc | REALoc_ | REALoc_ |
|---|---|---|---|---|---|---|---|---|
| Cell membrane | 23.3 | 30.8 | 29.4 | 43.9 | 6.8 | 28.6 | 47.5 | |
| Cytoplasm | 8.0 | 8.1 | 22.3 | 35.5 | 4.1 | 7.6 | 50.8 | |
| ER/Golgi | 18.9 | 0.7 | 28.9 | 22.1 | 3.7 | 0.0 | 52.9 | |
| Mitochondrion | 38.8 | 30.1 | 45.9 | 11.3 | 25.2 | 53.4 | 54.1 | |
| Nucleus | 58.5 | 64.7 | 63.3 | 57.7 | 89.9 | 58.3 | 55.8 | |
| Extracellular | 50.0 | 40.0 | 52.0 | 40.0 | 76.0 | 40.0 | ||
| Overall | 28.4 | 27.4 | 36.9 | 44.0 | 24.2 | 30.4 | 52.5 |
All results are the absolute true success rate given as % (the number of proteins correctly predicted/the number of total accepted proteins), and the bold text indicates the highest value in each row.
Performance of REALoc and other approaches for predicting multiplex proteins.
| Dataset | Hum-mPLoc 2.0 | iLoc-hum | mGOF-loc | REALoc_ | REALoc_ |
|---|---|---|---|---|---|
| 5939p | 30.5 | 44.3 | 6.4 | 74.2 | |
| 868pt | 20.0 | 17.3 | 2.7 | 36.3 | |
| 868pt | n/a | n/a | n/a | 22.7 |
* Not available
All results are the absolute true success rate given as % (the number of proteins correctly predicted/the number of total accepted proteins), and the bold text indicates the highest value in each row.
Performance comparison of REALoc_GANN and REALoc_Vote.
| Detaset | REALoc_Vote | REALoc_GANN |
|---|---|---|
| Training dataset (5939p) | 72.5 | |
| Training dataset (5939p) | 74.2 | |
| Training dataset (5939p) | 72.8 | |
| Testing dataset (868pt) | 63.3 | |
| Testing dataset (868pt) | 36.3 | |
| Testing dataset (868pt) | 52.5 |
All results are representative as absolute true success rate (%), given by (the number of proteins correctly predicted / the number of total accepted proteins).