| Literature DB >> 32528639 |
Sitanshu S Sahu1, Cristian D Loaiza2, Rakesh Kaundal2,3.
Abstract
The subcellular localization of proteins is very important for characterizing its function in a cell. Accurate prediction of the subcellular locations in computational paradigm has been an active area of interest. Most of the work has been focused on single localization prediction. Only few studies have discussed the multi-target localization, but have not achieved good accuracy so far; in plant sciences, very limited work has been done. Here we report the development of a novel tool Plant-mSubP, which is based on integrated machine learning approaches to efficiently predict the subcellular localizations in plant proteomes. The proposed approach predicts with high accuracy 11 single localizations and three dual locations of plant cell. Several hybrid features based on composition and physicochemical properties of a protein such as amino acid composition, pseudo amino acid composition, auto-correlation descriptors, quasi-sequence-order descriptors and hybrid features are used to represent the protein. The performance of the proposed method has been assessed through a training set as well as an independent test set. Using the hybrid feature of the pseudo amino acid composition, N-Center-C terminal amino acid composition and the dipeptide composition (PseAAC-NCC-DIPEP), an overall accuracy of 81.97 %, 84.75 % and 87.88 % is achieved on the training data set of proteins containing the single-label, single- and dual-label combined, and dual-label proteins, respectively. When tested on the independent data, an accuracy of 64.36 %, 64.84 % and 81.08 % is achieved on the single-label, single- and dual-label, and dual-label proteins, respectively. The prediction models have been implemented on a web server available at http://bioinfo.usu.edu/Plant-mSubP/. The results indicate that the proposed approach is comparable to the existing methods in single localization prediction and outperforms all other existing tools when compared for dual-label proteins. The prediction tool will be a useful resource for better annotation of various plant proteomes.Entities:
Keywords: Artificial intelligence; machine learning; multi-location; prediction tool; protein science; subcellular localization; web server
Year: 2019 PMID: 32528639 PMCID: PMC7274489 DOI: 10.1093/aobpla/plz068
Source DB: PubMed Journal: AoB Plants Impact factor: 3.276
Distribution of subcellular localization classes (single- and dual-located) for all plant data from UniProt database release 2018_02 in the training data set and independent testing data set. *About 10 % of sequences from the original training data set were kept separate for independent testing. In total, 16 494 plant protein sequences were found after applying the filters [viridiplantae AND annotation:(type: location confidence: experimental)].
| Type | Subcellular location | # sequences retrieved | # sequences after redundancy check (30 % cut-off) | *Training data set | Training data set (sequences length > 50) | Independent data set (sequences length > 50) |
|---|---|---|---|---|---|---|
| Single label | Plastid | 11 302 | 2979 | 2678 | 2468 | 248 |
| Cytoplasm | 739 | 403 | 361 | 351 | 40 | |
| Extracellular | 237 | 186 | 166 | 140 | 14 | |
| Nucleus | 734 | 636 | 571 | 568 | 63 | |
| Mitochondrion | 759 | 537 | 481 | 447 | 52 | |
| Cell membrane | 1256 | 927 | 830 | 829 | 92 | |
| Golgi apparatus | 277 | 229 | 204 | 204 | 23 | |
| Endoplasmic reticulum | 393 | 320 | 285 | 280 | 29 | |
| Vacuole | 260 | 198 | 176 | 176 | 20 | |
| Peroxisome | 80 | 63 | 57 | 57 | 06 | |
| Cell wall | 52 | 47 | 42 | 37 | 05 | |
| Dual label | Mito-plastid | 141 | 133 | 118 | 118 | 13 |
| Cyto-nucleus | 210 | 196 | 175 | 170 | 20 | |
| Cyto-Golgi | 54 | 38 | 34 | 34 | 04 | |
| Total | 16 494 | 6892 | 6178 | 5879 | 629 |
Group attributes and classification of various amino acids in a protein, as defined in Dubchak .
| Group 1 | Group 2 | Group 3 | |
|---|---|---|---|
| Hydrophobicity | Polar | Neutral | Hydrophobicity |
| R, K, E, D, Q, N | G, A, S, T, P, H, Y | C, L, V, I, M, F, W | |
| Normalized van der Waals volume | 0–2.78 | 2.95–4.0 | 4.03–8.08 |
| G, A, S, T, P, D, C | N, V, E, Q, I, L | M, H, K, F, R, Y, W | |
| Polarity | 4.9–6.2 | 8.0–9.2 | 10.4–13.0 |
| L, I, F, W, C, M, V, Y | P, A, T, G, S | H, Q, R, K, N, E, D | |
| Polarizability | 0–1.08 | 0.128–0.186 | 0.219–0.409 |
| G, A, S, D, T | C, P, N, V, E, Q, I, L | K, M, H, F, R, Y, W | |
| Charge | Positive | Neutral | Negative |
| K, R | A, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, V | D, E | |
| Secondary structure | Helix | Strand | Coil |
| E, A, L, M, Q, K, R, H | V, I, Y, C, W, F, T | G, N, P, S, D | |
| Solvent accessibility | Buried | Exposed | Intermediate |
| A, L, F, C, G, I, V, W | R, K, Q, E, N, D | M, S, P, T, H, Y |
Figure 1.Andrews plot of amino acid composition (AAC) feature for all the single- and dual-label localizations.
Figure 2.Andrews plot of PseAAC-NCC-DIPEP feature for all the single- and dual-label localizations.
(a) Performance comparison by 5-fold cross-validation testing on the training data set of single-label proteins using SVMs; (b) Performance comparison of 5-fold cross-validation testing on the combined training data set (single- + dual-label) using SVMs; (c) Performance comparison of 5-fold cross-validation testing on the dual-localized training data set using SVMs. Bold values represents the best performance. RBF = radial basis function of SVM; C = regularization parameter.
| (a) | ||||
|---|---|---|---|---|
| Feature representation methods | Overall accuracy (%) (single-label data) | |||
| AAC (σ = 2, C = 10) | 73.65 | |||
| DIPEP (σ = 50, C = 500) | 77.56 | |||
| PseAAC (σ = 10, C = 500) | 75.49 | |||
| NCC (σ = 10, C = 50) | 74.36 | |||
|
|
| |||
| NCC-DIPEP (σ = 50, C = 500) | 81.18 | |||
| QSO (σ = 10, C = 500) | 73.25 | |||
| NCC-DIPEP-CTDC-CTDT- QSO (σ = 5, C = 30) | 80.42 | |||
| (b) | ||||
| Feature representation methods | Overall accuracy (%) (single- + dual-label data) | |||
| AAC (σ = 2, C = 10) | 68.48 | |||
| DIPEP (σ = 50, C = 500) | 74.59 | |||
| PseAAC (σ = 10, C = 500) | 71.87 | |||
| NCC (σ = 10, C = 50) | 70.74 | |||
|
|
| |||
| NCC-DIPEP (σ = 50, C = 500) | 83.96 | |||
| Physicochem [atomi + hydrophobicity, basic] | 73.21 | |||
| NCC-DIPEP-physicochem | 83.79 | |||
| Quasi-sequence-order descriptors | 54.38 | |||
| NCC-DIPEP-CTDC-CTDT- QSO | 60.02 | |||
| (c) | ||||
| Model | Kernel | C | Gamma | Overall accuracy (%) (dual- label data) |
| AAC | RBF | 10 | 0.001 | 76.64 |
| DIPEP | RBF | 10 | 0.001 | 82.29 |
| PseAAC | RBF | 10 | 0.001 | 77.63 |
| NCC | RBF | 10 | 0.001 | 86.02 |
| NCC-DIPEP | RBF | 10 | 0.001 | 87.57 |
|
| RBF | 10 | 0.001 |
|
(a) Comparison of prediction results on an ‘independent data set’ based on models trained from single-label proteins using SVMs; (b) Comparison of prediction results on an ‘independent data set’ based on models trained from combined data set (single- + dual-label); (c) Comparison of prediction results on an ‘independent data set’ based on models trained from dual-label proteins data set. Bold values represents the best performance.
| (a) | ||||
|---|---|---|---|---|
| Feature representation methods | Accuracy (%) | |||
| AAC (σ = 2, C = 10) | 59.11 | |||
| DIPEP (σ = 50, C = 500) | 59.11 | |||
| PseAAC (σ = 10, C = 500) | 59.12 | |||
| NCC (σ = 10, C = 50) | 50.34 | |||
|
|
| |||
| NCC-DIPEP (σ = 50, C = 500) | 64.05 | |||
| QSO (σ = 10, C = 500) | 57.05 | |||
| NCC-DIPEP-CTDC-CTDT-QSO (σ = 5, C = 300) | 61.46 | |||
| (b) | ||||
| Feature representation methods | Accuracy (%) | |||
| AAC (σ = 2, C = 10) | 57.71 | |||
| DIPEP (σ = 50, C = 500) | 58.95 | |||
| PseAAC (σ = 10, C = 500) | 56.60 | |||
| NCC (σ = 10, C = 50) | 52.88 | |||
|
|
| |||
| NCC-DIPEP (σ = 50, C = 500) | 64.42 | |||
| Quasi-sequence-order descriptors | 58.94 | |||
| NCC-DIPEP-CTDC-CTDT-QSO | 38.49 | |||
| (c) | ||||
| Model | Kernel | C | Gamma | Accuracy (%) |
| AAC | RBF | 10 | 0.001 | 72.56 |
| DIPEP | RBF | 10 | 0.001 | 72.97 |
| PseAAC | RBF | 10 | 0.001 | 75.67 |
| NCC | RBF | 10 | 0.001 | 78.37 |
| NCC-DIPEP | RBF | 10 | 0.001 | 75.67 |
|
| RBF | 10 | 0.001 |
|
Comparison of actual prediction accuracy of Plant-mSubP on an ‘independent data set’ with the existing web tools that support multi-label localizations. Actual accuracy is calculated (in percentage) as the ratio of number of localization samples correctly predicted divided by the total number of samples in the independent data set.
| Web tools | Prediction accuracy (%) (single- + dual-label data) | Prediction accuracy (%) (dual-label data) |
|---|---|---|
| YLoc | 34.35 | 35.89 |
| Euk-mPloc 2.0 | 53.5 | 44.86 |
| iLoc-Plant | 37.42 | 34.42 |
| Our method [Plant-mSubP] | 64.84 | 81.08 |