| Literature DB >> 23282098 |
Emily Chia-Yu Su1, Jia-Ming Chang, Cheng-Wei Cheng, Ting-Yi Sung, Wen-Lian Hsu.
Abstract
BACKGROUND: Identification of subcellular localization in proteins is crucial to elucidate cellular processes and molecular functions in a cell. However, given a tremendous amount of sequence data generated in the post-genomic era, determining protein localization based on biological experiments can be expensive and time-consuming. Therefore, developing prediction systems to analyze uncharacterised proteins efficiently has played an important role in high-throughput protein analyses. In a eukaryotic cell, many essential biological processes take place in the nucleus. Nuclear proteins shuttle between nucleus and cytoplasm based on recognition of nuclear translocation signals, including nuclear localization signals (NLSs) and nuclear export signals (NESs). Currently, only a few approaches have been developed specifically to predict nuclear localization using sequence features, such as putative NLSs. However, it has been shown that prediction coverage based on the NLSs is very low. In addition, most existing approaches only attained prediction accuracy and Matthew's correlation coefficient (MCC) around 54%~70% and 0.250~0.380 on independent test set, respectively. Moreover, no predictor can generate sequence motifs to characterize features of potential NESs, in which biological properties are not well understood from existing experimental studies.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23282098 PMCID: PMC3521467 DOI: 10.1186/1471-2105-13-S17-S13
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Data sets
| Nuclear | Non-nuclear | |
|---|---|---|
| Training | 2,842 | 2,606 |
| Testing | 564 | 398 |
| Total | 3,406 | 3,004 |
Numbers of nuclear and non-nuclear proteins for training and testing.
Performance comparison
| Training Data Set (by five-fold cross-validation) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Method | tp | tn | fp | fn | Sens | Spec | Acc | |
| PSLNuc | 2,317 | 2,030 | 576 | 525 | ||||
| NUCLEO | 2,157 | 1,924 | 682 | 685 | 0.759 | 0.760 | 0.749 | 0.497 |
| Independent Test Data Set | ||||||||
| tp | tn | fp | fn | Sens | Spec | Acc | ||
| PSLNuc | 452 | 258 | 140 | 112 | ||||
| NUCLEO | 430 | 246 | 152 | 134 | 0.760 | 0.620 | 0.700 | 0.380 |
| PredictNLS | 153 | 369 | 29 | 411 | 0.270 | 0.930 | 0.540 | 0.250 |
| NucPred | 376 | 233 | 165 | 188 | 0.670 | 0.590 | 0.630 | 0.250 |
Performance comparison of different nuclear localization predictors.
Gapped-dipeptide signatures for nuclear proteins
| Gapped-dipeptide signatures for nuclear proteins | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| G2P | G11P | G8P | P11P | P5P | P3Q | P5Q | |||
| E4E | E1E | E8E | E0E | P1T | |||||
| P0T | P6T | P10T | P1S | S6P | P3G | G4P | |||
| P0G | Q11Q | Q10Q | Q5Q | ||||||
| C8S | S2C | S6C | N6P | P12N | S9D | E12D | D12D | ||
| D4D | D13D | S12T | S8T | S10T | |||||
| M3P | P12M | Q8R | Q2H | ||||||
| H0R | S7Q | Q12S | S13H | S10H | S2H | H6S | S12H | P7Y | P4Y |
| Y5P | N3N | N6N | N0N | D2S | D5S | ||||
| K8S | D0S | N8N | H8N | H5N | T4E | E13S | T6E | ||
| N0E | N0K | N4K | K1N | D2T | |||||
| G6L | G8L | E8C | K1C | K5C | K7C | C2C | C0C | K3C | C4K |
| H8C | H12C | H13C | H10C | D6I | D10I | D12I | A13Q | A6Q | Q9A |
| Q6A | M3S | Q13T | K9T | Q1T | Q9T | N1S | S7N | N8S | |
| H1I | I0H | E12I | S2E | E11I | D6A | S3D | D8A | M13Q | M3Q |
| L13S | L12S | V6S | S9N | E9S | N0S | S3N | H1Q | ||
| H10Q | H3Q | H5Q | T7E | T8E | T12E | T9E | W13K | W11K | N11I |
| I12N | M13D | M2D | F13D | L1H | H1L | A3H | L5H | A9H | |
| H10A | A10H | ||||||||
Our method propose 183 gapped-dipeptide signatures for nuclear proteins. Signatures that have been reported as motifs for nuclear localization signals (NLSs) are shown in bold face.
Gapped-dipeptide signatures for non-nuclear proteins
| Gapped-dipeptide signatures for non-nuclear proteins | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| A3V | A8V | A12I | A6I | A4I | I2A | E11L | L11Q | Q11L | L11E |
| Q5W | Q13W | S7W | Q12W | I13I | I12I | I10I | Y6Y | Y0Y | |
| Y5Y | Y12Y | V0N | L2N | I2Q | I2S | R11A | R8A | R1A | K1L |
| L1K | L0K | F10A | F11A | F11S | F10S | L8A | L4A | L5A | A6L |
| F3F | F4F | F10F | F13F | V11D | V13D | ||||
| V7D | V4D | I5E | I4D | I12E | E5I | W8P | W9P | W10P | W3P |
| V5V | V13V | G2V | G11V | S2S | S4S | A5S | S9S | W1H | H10W |
| W8H | Y10W | M3M | M2M | M13M | M0M | T0T | T13T | T7T | T1T |
| H1Y | H3Y | H13Y | H7Y | F12C | F9C | F0C | F10C | A7K | A3K |
| A10A | A4A | A5A | A9A | Y2G | Y0G | G0Y | G0F | F2K | K0F |
| K3F | Q13L | L13Q | L1T | L10Q | C12I | I12C | C10I | C3I | W4E |
| W0L | W5L | W1L | Y7T | Y7V | Y6V | V5Y | E1M | L9K | R11M |
| I10K | I13K | F13K | T1K | K5I | K2V | H10L | L12H | H8L | E13L |
| E13I | E10V | N13L | K13N | K12N | I9N | V10N | V12N | V1N | R10Q |
| D3R | R2D | V7G | M11G | V13G | L3G | I6N | I5N | V9I | I11N |
| K7L | V4H | H12V | V8H | H4V | F9E | D13L | D12L | L3D | D9L |
| H11M | L10M | M7H | H12I | E9L | E11A | E12A | A5E | E10A | I4N |
| N0I | N13I | N9I | |||||||
Our method propose 183 gapped-dipeptide signatures for non-nuclear proteins. Signatures that have been reported as motifs for nuclear export signals (NESs) are shown in bold face.
Figure 1Physicochemical analyses of gapped-dipeptide signatures. (A) Amino acid compositions and (B) grouped amino acid compositions of gapped-dipeptide signatures for nuclear and non-nuclear proteins.
Performance comparison using gapped-dipeptide signatures
| Training Set (by five-fold cross-validation) | ||||||||
|---|---|---|---|---|---|---|---|---|
| tp | tn | fp | fn | Sens | Spec | Acc | ||
| PSLNTS | 2,429 | 1,978 | 628 | 413 | ||||
| PSLNuc | 2,413 | 1,964 | 642 | 429 | 0.815 | 0.779 | 0.797 | 0.595 |
| NUCLEO | 2,157 | 1,924 | 682 | 685 | 0.759 | 0.760 | 0.749 | 0.497 |
Performance comparison of PSLNuc, PSLNTS, and NUCLEO.
Figure 2Construction of a smoothed PSSM profile. Transformation of (A) a standard PSSM profile into (B) a smoothed PSSM profile using w as 7.
Figure 3System architecture of PSLNuc. System architecture of PSLNuc based on support vector machines using reduced/transformed feature vectors.