| Literature DB >> 33808227 |
Warin Wattanapornprom1, Chinae Thammarongtham2, Apiradee Hongsthong2, Supatcha Lertampaiporn2.
Abstract
The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10-14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset.Entities:
Keywords: average voting; consensus voting; ensemble machine learning; feature extraction; feature selection; go term; plant protein; subcellular localization prediction
Year: 2021 PMID: 33808227 PMCID: PMC8066735 DOI: 10.3390/life11040293
Source DB: PubMed Journal: Life (Basel) ISSN: 2075-1729
Figure 1Workflow of the program.
Number of proteins from each location in the training and testing datasets.
| Type | Subcellular Location | Training Data | Training Data | Testing Data |
|---|---|---|---|---|
| Single location | Plastid | 2468 | 533 | 248 |
| Cytoplasm | 351 | 351 | 40 | |
| Extracellular | 140 | 140 | 14 | |
| Nucleus | 568 | 568 | 63 | |
| Mitochondrion | 447 | 447 | 52 | |
| Cell membrane | 829 | 438 | 92 | |
| Golgi Apparatus | 204 | 204 | 23 | |
| Endoplasmic reticulum | 280 | 280 | 29 | |
| Vacuole | 176 | 176 | 20 | |
| Peroxisome | 57 | 57 | 6 | |
| Cell wall | 37 | 37 | 5 | |
| Multilocation | Mito-Plastid | 118 | 118 | 13 |
| Cyto-Nucleus | 170 | 170 | 20 | |
| Cyto-Golgi | 34 | 34 | 4 | |
|
|
|
|
| |
Summary of features or descriptors that were used in this research.
| Features (Total = 479 Features) | Abbreviation |
|---|---|
| Amino acid Composition | AAC1-AAC20 |
| Amphiphilic PseAAC | APAAC1-APAAC30 |
| BLOSUM matrix-derived | Blosum1-Blosum8 |
| Composition descriptor of the CTD | CTDC1-CTDC21 |
| Distribution descriptor of the CTD | CTDD1-CTDD105 |
| Transition descriptor of the CTD | CTDT1-CTDT21 |
| Geary autocorrelation | Geary1-Geary40 |
| Pseudo amino acid composition | PAAC1-PAAC30 |
| Parallel pseudo amino acid composition | PsePC1-PsePC22 |
| Serial pseudo amino acid composition | PseSC1-PseSC26 |
| Net charge | Charge |
| Potential protein interaction index | Boman |
| Aliphatic index of protein | aIndex |
| Autocovariance index | autocov |
| Crosscovariance1 | Crosscov1 |
| Crosscovariance2 | Crosscov2 |
| Cruciani covariance index | Crucian1-Crucian3 |
| Factor analysis scales of generalized amino acid information | fasgai1-fasgai6 |
| Hmoment alpha helix | Hmomonet1 |
| Hmoment beta sheet | Hmoment2 |
| Hydrophobicity index | hydrophobicity |
| Instability index | Instaindex |
| MS-WHIM scores derived from 36 electrostatic potential properties | mswhimscore1-mswhimscore 3 |
| Isoelectric point (pI) | pI |
| Average of protFP | protFP1-protFP8 |
| ST-scale based on physicochemical properties | stscales1-stscales8 |
| T-scale based on physicochemical properties | tscales1-tscales5 |
| VHSE-scale based on physicochemical properties (vhsescales1 | vhsescales1-vhsescales8 |
| Z-scale based on physicochemical properties | stscales1-stscales5 |
| Quasi-sequence-order descriptor | QSO1-QSO60 |
| Sequence-order-coupling numbers | SOCN1-SOCN20 |
| Chloroplast transit peptide | cTP |
| Mitochondrial transit peptide | mTP |
| Signal peptide cleavage site score | SP |
| Number of predicted transmembrane segments | TM |
| Other location score from targetP | other |
| Nuclear localization signal | NLS |
| SVM score from Erpred | erpred |
| SubmitoPred (SVM_score_mito) | SVM_mito |
| SubmitoPred (SVM_inner_mem) | SVM_mem |
| SubmitoPred (SVM_inter_mem) | SVM_inter |
| SubmitoPred (SVM_score_matrix) | SVM_matrix |
| SubmitoPred (SVM_score_outer_mem) | SVM_outer |
| Aggregation (tango1) | Tango1 |
| Amyloid (tango2) | Tango2 |
| Turn-turns (tango3) | Tango3 |
| Alpha-helices (tango4) | Tango4 |
| Helical aggregation (tango5) | Tango5 |
| Beta-strands (tango6) | Tango6 |
| Homology based feature (GO term) | Homology |
Summary of cellular component GO terms used in this work.
| Go term; ‘Cellular component’ |
| GO:0005737; cytoplasm |
| GO:0005783; endoplasmic reticulum |
| GO:0005788; endoplasmic reticulum lumen |
| GO:0005789; endoplasmic reticulum membrane |
| GO:0005793; endoplasmic reticulum-Golgi intermediate compartment |
| GO:0005615; extracellular space |
| GO:0005794; Golgi apparatus |
| GO:0005796; Golgi lumen |
| GO:0000139; Golgi membrane |
| GO:0005739; mitochondrion |
| GO:0005740; mitochondrial envelope |
| GO:0005743; mitochondrial inner membrane |
| GO:0005758; mitochondrial intermembrane space |
| GO:0005759; mitochondrial matrix |
| GO:0031966; mitochondrial membrane |
| GO:0005741; mitochondrial outer membrane |
| GO:0005886; plasma membrane |
| GO:0005618; cell wall |
| GO:0005634; nucleus |
| GO:0009536; plastid |
| GO:0009528; plastid inner membrane |
| GO:0005777; peroxisome |
| GO:0005778; peroxisomal membrane |
| GO:0005773; vacuole |
| GO:0005774; vacuolar membrane |
| GO:0016020; membrane |
| GO:0009507; chloroplast |
Figure 2Top 20 features that are highly correlated with each localization target.
Classification training performances for different feature subsets.
|
|
|
|
|
|
| ACC | 82.72% | 82.62% | 85.97% | 91.00% |
| MCC | 0.795 | 0.798 | 0.845 | 0.896 |
| AUC | 0.977 | 0.897 | 0.975 | 0.993 |
|
|
|
|
|
|
| ACC | 92.02% | 91.51% | 93.87% | 93.76% |
| MCC | 0.907 | 0.902 | 0.932 | 0.929 |
| AUC | 0.995 | 0.991 | 0.993 | 0.996 |
|
|
|
|
|
|
| ACC | 94.27% | 89.57% | 93.46% | 93.97% |
| MCC | 0.935 | 0.879 | 0.928 | 0.932 |
| AUC | 0.996 | 0.991 | 0.992 | 0.996 |
|
|
|
|
|
|
| ACC | 94.48% | 93.05% | 95.30% | 94.68% |
| MCC | 0.938 | 0.921 | 0.948 | 0.94 |
| AUC | 0.996 | 0.994 | 0.996 | 0.997 |
Classification performance of the heterogeneous ensemble for the independent testing dataset.
| Type | Subcellular Location | Testing Data | Correctly Predicted | Percent | MCC |
|---|---|---|---|---|---|
| Single location | Plastid | 248 | 238 | 95.97% | 0.756 |
| Cytoplasm | 40 | 34 | 85% | 0.829 | |
| Extracellular | 14 | 9 | 64.28% | 0.756 | |
| Nucleus | 63 | 61 | 96.82% | 0.854 | |
| Mitochondrion | 52 | 31 | 59.61% | 0.708 | |
| Cell membrane | 92 | 81 | 88.04% | 0.792 | |
| Golgi Apparatus | 23 | 14 | 60.86% | 0.747 | |
| Endoplasmic reticulum | 29 | 25 | 86.21% | 0.710 | |
| Vacuole | 20 | 5 | 25% | 0.359 | |
| Peroxisome | 6 | 3 | 50% | 0.705 | |
| Cell wall | 5 | 5 | 100% | 1 | |
|
|
|
|
|
| |
| Multilocation | Mito-Plastid | 13 | 8 | 61.54% | 0.607 |
| Cyto-Nucleus | 20 | 18 | 90% | 0.897 | |
| Cyto-Golgi | 4 | 0 | 0% | 0 | |
|
|
|
|
|
| |
|
|
|
|
|
| |
Comparison of prediction accuracy for an independent dataset with the accuracy of existing tools that support multiple-labels localizations. The actual accuracy is calculated as a percentage of the ratio of the number of correctly predicted sequences divided by the total number of sequences in the independent dataset.
| Method | Machine Learning Technique | Accuracy | Accuracy |
|---|---|---|---|
| YLoc [ | Naïve Bayes | 34.35 | 35.89 |
| Euk-mPloc 2.0 [ | OET-KNN 1 | 53.5 | 44.86 |
| iLoc-Plant [ | ML-KNN 2 | 37.42 | 34.42 |
| Plant-mSubP [ | SVM 3 | 64.84 | 81.08 |
| Our model | Ensemble | 84.58 | 70.27 |
1 OET-KNN = Optimized Evidence-Theoretic K-Nearest Neighbor; 2 SVM = Support Vector Machine; 3 ML-KNN = Multi-labeled K-Nearest Neighbor.