| Literature DB >> 23460833 |
Guo Sheng Han1, Zu Guo Yu, Vo Anh, Anaththa P D Krishnajith, Yu-Chu Tian.
Abstract
BACKGROUND: Predicting protein subnuclear localization is a challenging problem. Some previous works based on non-sequence information including Gene Ontology annotations and kernel fusion have respective limitations. The aim of this work is twofold: one is to propose a novel individual feature extraction method; another is to develop an ensemble method to improve prediction performance using comprehensive information represented in the form of high dimensional feature vector obtained by 11 feature extraction methods. METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2013 PMID: 23460833 PMCID: PMC3584121 DOI: 10.1371/journal.pone.0057225
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Amino acid classifications.
| Method | Number | Amino acid classification | Reference |
| HP | 2 | (ALIMFPWV) (DENCQGSTYRHK) |
|
| DHP | 4 | (ALVIFWMP) (STYCNGQ) (KRH) (DE) |
|
| 7-Cat | 7 | (AGV) (ILFP) (YMTS) (HNQW) (RK) (DE) C |
|
| 20-Cat | 20 | A G V I L F P Y M T S H N Q W R K D E C | - |
| ms | 6 | (AVLIMC) (WYHF) (TQSN) (RK) (ED) (GP) |
|
| lesk | 6 | (AST) (CVILWYMPF) (HQN) (RK) (ED) G |
|
| F-Ic4 | 7 | (AWM) (GST) (HPY) (CVIFL) (DNQ) (ER) K |
|
| F-Ic2 | 9 | (AWM) (GS) (HPY) (CVI) (FL) (DNQ) (ER) K T |
|
| F-IIIc4 | 9 | (ACV) (HPL) (DQ) S (ERGN) F (IMT) (KW) Y |
|
| F-Vc4 | 8 | (AWHC) G (LEPV) (KYMT) (IN) Q D S |
|
| Murphy8 | 8 | (LVMIC) (AG) (ST) P (FYW) (DENQ) (KR) H |
|
| Murphy15 | 15 | (LVIM) C A G S T P (FY) W E D N Q (KR) H |
|
| Letter12 | 12 | (LVIM) C (AG) (ST) P (FY) W (ED) N Q (KR) H |
|
| Hydrophobicity | 3 | (RKEDQN) (GASTPHY) (CLVIMFW) |
|
| Normalized van der Waals | 3 | (GASTPD) (NVEQIL) (MHKFRYW) |
|
| Polarity | 3 | (LIFWCMVY) (PATGS) (HQRKNED) |
|
| Polarizability | 3 | (GASDT) (CPNVEQIL) (KMHFRYW) |
|
| Charge | 3 | (KR) (ANCQGHILMFPSTWYV) (DE) |
|
| Secondary structure | 3 | (EALMQKRH) (VIYCWFT) (GNPSD) |
|
| Solvent accessibility | 3 | (ALFCGIVW) (PKQEND) (MPSTHY) |
|
30 physicochemical properties of amino acids selected from AAindex database.
| AAindex | Physicochemical property | Range of property |
| BULH740101 | Transfer free energy to surface | [−2.46 0.16] |
| BULH740102 | Apparent partial specific volume | [0.558 0.842] |
| PONP800106 | Surrounding hydrophobicity in turn | [10.53 13.86] |
| PONP800104 | Surrounding hydrophobicity in alpha-helix | [10.98 14.08] |
| PONP800105 | Surrounding hydrophobicity in beta-sheet | [11.79 16.49] |
| PONP800106 | Surrounding hydrophobicity in turn | [9.93 15.00] |
| MANP780101 | Average surrounding hydrophobicity | [11.36 15.71] |
| EISD840101 | Consensus normalized hydrophobicity scale | [−1.76 0.73] |
| JOND750101 | Hydrophobicity | [0.00 3.15] |
| HOPT810101 | Hydrophilicity value | [−3.4 3.00] |
| PARJ860101 | HPLC parameter | [−10.00 10.00] |
| JANJ780101 | Average accessible surface area | [22.8 103.0] |
| PONP800107 | Accessibility reduction ratio | [2.12 7.69] |
| CHOC760102 | Residue accessible surface area in folded protein | [18 97] |
| ROSG850101 | Mean area buried on transfer | [62.9 224.6] |
| ROSG850102 | Mean fractional area loss | [0.52 0.91] |
| BHAR880101 | Average flexibility indices | [0.295 0.544] |
| KARP850101 | Flexibility parameter for no rigid neighbors | [0.925 1.169] |
| KARP850102 | Flexibility parameter for one rigid neighbor | [0.862 1.085] |
| KARP850103 | Flexibility parameter for two rigid neighbors | [0.803 1.057] |
| JANJ780102 | Percentage of buried residues | [3 74] |
| JANJ780103 | Percentage of exposed residues | [5 85] |
| LEVM780101 | Normalized frequency of alpha-helix, with weights | [0.90 1.47] |
| LEVM780102 | Normalized frequency of beta-sheet, with weights | [0.72 1.49] |
| LEVM780103 | Normalized frequency of reverse turn, with weights | [0.41 1.91] |
| GRAR740102 | Polarity | [4.9 13.0] |
| GRAR740103 | Volume | [3 170] |
| MCMT640101 | Refractivity | [0.00 42.35] |
| PONP800108 | Average number of surrounding residues | [4.88 7.86] |
| KYTJ820101 | Hydropathy index | [−4.5 4.5] |
Figure 1The architecture of our method.
Comparison of the overall prediction accuracy between different feature extraction methods.
| Feature extraction method | Grid search | GFO | ||
|
| CPU time (hr) |
| CPU time (hr) | |
| Combination1 | 69.84 (59.76) | 2.704 | 70.83 (62.50) | 0.406 |
| RQA | 53.57 (45.12) | 2.174 | 52.98 (44.82) | 0.413 |
| HHT | 63.49 (60.37) | 2.336 | 65.87 (64.63) | 0.427 |
| PPDD | 56.55 (53.35) | 2.213 | 58.53 (59.15) | 0.414 |
| DWT | 57.54(52.74) | 2.035 | 56.15 (50.91) | 0.402 |
| Combination2 | 77.78 (70.12) | 11.056 | 75.20 (69.82) | 2.303 |
Note: the values on the new independent dataset are shown in the parentheses.
Figure 2The ROC curves for all binary classifications.
The upper letters B, L, S, C, P and N correspond to six subnuclear locations, PML body, nuclear lamina, nuclear speckles, chromatin, nucleoplasm and nucleolus, respectively.
Performance comparison on Lei’s benchmark data set.
| Subnuclear localization | size | SVMensemble | Go-AA | SpectrumKernel | Ourmethod | ||||||
|
| MCC |
| MCC |
|
| MCC |
|
| MCC | ||
| PML Body | 38 | 29.0 | 0.172 | 34.2 | 0.253 | 11.1 | 10.5 | 0.046 | 86.1(85.3) |
|
|
| Nuclear Lamina | 55 | 43.6 | 0.338 | 63.6 | 0.578 | 51.9 | 50.9 | 0.461 | 91.0 (91.9) | 69.1 ( | 0.534( |
| Nuclear Speckles | 56 | 35.7 | 0.363 | 62.5 | 0.607 | 86.7 |
|
| 91.8 (91.1) | 62.5 (53.6) | 0.503(0.460) |
| Chromatin | 61 | 19.7 | 0.260 | 60.7 | 0.518 | 64.3 | 59.0 | 0.570 | 93.1(93.1) |
|
|
| Nucleoplasm | 75 | 22.7 | 0.206 | 56.0 | 0.504 | 52.6 | 54.7 | 0.465 | 90.8 (89.2) | 64.0 ( |
|
| Nucleolus | 219 | 76.7 | 0.367 | 79.0 | 0.656 | 89.8 |
|
| 78.6 (75.9) | 93.6 (91.3) | 0.726(0.570) |
| OA for single-localization | 50.0 | 66.5 | 71.2 | 77.8(75.2) | |||||||
| OA for multi-localization | 65.2 | 65.2 | - |
| |||||||
Note: the values about models using GFO are shown in the parentheses.
Performance comparison on SNL9 benchmark data set.
| Subnuclear localization | Size | MCC | |
| Nuc-Ploc | Our method | ||
| Chromatin | 99 | 0.60 | 0.64 |
| Heterochromatin | 22 | 0.52 | 0.27 |
| Nuclear envelope | 61 | 0.53 | 0.58 |
| Nuclear matrix | 29 | 0.52 | 0.56 |
| Nuclear pore complex | 79 | 0.70 | 0.70 |
| Nuclear speckle | 67 | 0.43 | 0.62 |
| Nucleolus | 307 | 0.57 | 0.69 |
| Nucleoplasm | 37 | 0.31 | 0.55 |
| Nuclear PML body | 13 | 0.32 | 0.43 |
|
| 67.4% | 72.1% | |
Note: MCCs and about Nuc-Ploc are obtained directly from the original paper [26].
Comparisons of Combination2 with the individual method on the new independent dataset.
| Methods | Grid search | GFO |
| P-values | P-values | |
| Combination1 | 0.022 | 0.028 |
| RQA | 4.461e−4 | 3.494e−4 |
| HHT | 0.037 | 0.025 |
| PPDD | 0.005 | 0.004 |
| DWT | 0.003 | 0.001 |
Comparisons with other popular classifiers on the new independent dataset.
| Methods | TraditionalSVM ( | RandomForest ( | ||
| weight | without weight | weight | without weight | |
| Combination1 | 59.45 | 57.62 | 58.54 | 57.32 |
| RQA | 45.73 | 45.73 | 45.73 | 44.82 |
| HHT | 59.76 | 56.10 | 57.93 | 56.10 |
| PPDD | 58.54 | 57.93 | 55.49 | 55.18 |
| DWT | 57.62 | 55.49 | 52.74 | 51.52 |
| Combination2 | 66.16 | 64.63 | 64.02 | 63.11 |
Figure 3Comparisons of error rate (percentage of misclassified samples) over 50 runs of randomization analysis.
Random 1: selecting randomly features subsets from original features, whose size is one-forth of the number of optimal features; Random 2: one half of the number of optimal features; Random 3: equal to the number of optimal features.
Figure 4Comparisons of error rate (percentage of misclassified samples) over 50 runs of permutation analysis.
The original class memberships of all samples are randomly shuffled for 50 times and then used together with original optimal features for classification using the same cross validation as applied before for original dataset.