| Literature DB >> 29649330 |
Abstract
A wide variety of methods have been proposed in protein subnuclear localization to improve the prediction accuracy. However, one important trend of these means is to treat fusion representation by fusing multiple feature representations, of which, the fusion process takes a lot of time. In view of this, this paper novelly proposed a method by combining a new single feature representation and a new algorithm to obtain good recognition rate. Specifically, based on the position-specific scoring matrix (PSSM), we proposed a new expression, correlation position-specific scoring matrix (CoPSSM) as the protein feature representation. Based on the classic nonlinear dimension reduction algorithm, kernel linear discriminant analysis (KLDA), we added a new discriminant criterion and proposed a dichotomous greedy genetic algorithm (DGGA) to intelligently select its kernel bandwidth parameter. Two public datasets with Jackknife test and KNN classifier were used for the numerical experiments. The results showed that the overall success rate (OSR) with single representation CoPSSM is larger than that with many relevant representations. The OSR of the proposed method can reach as high as 87.444% and 90.3361% for these two datasets, respectively, outperforming many current methods. To show the generalization of the proposed algorithm, two extra standard datasets of protein subcellular were chosen to conduct the expending experiment, and the prediction accuracy by Jackknife test and Independent test is still considerable.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29649330 PMCID: PMC5896989 DOI: 10.1371/journal.pone.0195636
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Abbreviation for the corresponding term.
| Number | Full Name of Term | Abbreviation |
|---|---|---|
| 1 | Pseudo-amino acid composition | PseAAC |
| 2 | Position-specific scoring matrix | PSSM |
| 3 | Correlation position-specific scoring matrix | CoPSSM |
| 4 | Kernel linear discriminant analysis | KLDA |
| 5 | K-nearest-neighbor | KNN |
| 6 | Genetic algorithm | GA |
| 7 | Dichotomous greedy genetic algorithm | DGGA |
| 8 | Overall success rate | OSR |
| 9 | True positive | TP |
| 10 | True negative | TN |
| 11 | False positive | FP |
| 12 | False negative | FN |
| 13 | Sensitivity | SE |
| 14 | Specificity | SP |
| 15 | Accuracy | ACC |
| 16 | Mathew’s correlation coefficient | MCC |
Constitutions of protein benchmark datasets.
| Dataset 1 of ten subnuclear locations | Dataset 2 of nine subnuclear locations | ||||
|---|---|---|---|---|---|
| Class | Subnuclear Location | Number | Class | Subnuclear Location | Number |
| 1 | Centromere | 86 | 1 | Chromatin | 99 |
| 2 | Chromosome | 113 | 2 | Heterochromatin | 22 |
| 3 | Nuclear envelope | 17 | 3 | Nuclear envelope | 61 |
| 4 | Nuclear matrix | 18 | 4 | Nuclear matrix | 29 |
| 5 | Nuclear pore complex | 12 | 5 | Nuclear pore complex | 79 |
| 6 | Nuclear speckle | 50 | 6 | Nuclear speckle | 67 |
| 7 | Nucleolus | 294 | 7 | Nucleolus | 307 |
| 8 | Nucleoplasm | 30 | 8 | Nucleoplasm | 37 |
| 9 | Telomere | 37 | 9 | Nuclear PML body | 13 |
| 10 | Nuclear PML body | 12 | |||
Constitutions of protein benchmark datasets for expending experiment.
| Class | Subcellular Location | Number of Dataset 3 | Number of Dataset 4 |
|---|---|---|---|
| 1 | Cytoplasm | 152 | 210 |
| 2 | Extracell | 76 | 20 |
| 3 | Fimbrium | 12 | 4 |
| 4 | Flagellum | 6 | 1 |
| 5 | Inner membrane | 186 | 345 |
| 6 | Nucleoid | 6 | 1 |
| 7 | Outer membrane | 103 | 13 |
| 8 | Periplasm | 112 | 49 |
Fig 1The DGGA algorithm for searching kernel parameter.
Fig 2The flowchart for predicting protein subnuclear location.
Prediction results for protein subnuclear of dataset 1.
| Type | Feature Representation | ||
|---|---|---|---|
| PseAAC (40-D) | PSSM (400-D) | The proposed CoPSSM (210-D) | |
| Class 1 | 30.2326% | 55.8140% | 70.9302% |
| Class 2 | 36.2832% | 41.5929% | 53.0973% |
| Class 3 | 29.4118% | 23.5294% | |
| Class 4 | 16.6667% | 22.2222% | 50% |
| Class 5 | 41.6667% | 58.3333% | 66.6667% |
| Class 6 | 32% | 30% | 26% |
| Class 7 | 86.0544% | 81.6327% | 78.5714% |
| Class 8 | 10% | 23.3333% | |
| Class 9 | 5.4054% | 10.8108% | 43.2432% |
| Class 10 | 25% | ||
* Class 1 ~ Class 10 denote Centromere, Chromosome, Nuclear envelope, Nuclear matrix, Nuclear pore complex, Nuclear speckle, Nucleolus, Nucleoplasm, Telomere and Nuclear PML body respectively.
Prediction results for protein subnuclear of dataset 2.
| Type | Feature Representation | ||
|---|---|---|---|
| PseAAC (40-D) | PSSM (400-D) | The proposed CoPSSM (210-D) | |
| Class 1 | 48.4848% | 69.6970% | 59.5960% |
| Class 2 | 22.7272% | 36.3636% | 36.3636% |
| Class 3 | 24.5902% | 29.5082% | 49.1803% |
| Class 4 | 13.7931% | 31.0345% | 44.8276% |
| Class 5 | 55.6962% | 67.0886% | 78.4810% |
| Class 6 | 29.8507% | 40.2985% | 32.8358% |
| Class 7 | 81.4332% | 71.3355% | 79.1531% |
| Class 8 | 16.2162% | 18.9189% | |
| Class 9 | 7.6923% | 15.3846% | |
* Class 1 ~ Class 9 denote Chromatin, Heterochromatin, Nuclear envelope, Nuclear matrix, Nuclear pore complex, Nuclear speckle, Nucleolus, Nucleoplasm and Nuclear PML body respectively.
SE, SP, ACC and MCC for PseAAC, PSSM and the proposed CoPSSM on dataset 1.
| Type and Representation | Evaluation Index | ||||
|---|---|---|---|---|---|
| SE | SP | ACC | MCC | ||
| Class 1 | PseAAC | 0.3023 | 0.9286 | 0.8050 | 0.2859 |
| PSSM | 0.5581 | 0.8964 | 0.8307 | 0.4565 | |
| The proposed CoPSSM | 0.7093 | 0.9117 | 0.8747 | 0.5979 | |
| Class 2 | PseAAC | 0.3628 | 0.7908 | 0.6950 | 0.1492 |
| PSSM | 0.4159 | 0.7716 | 0.6957 | 0.1735 | |
| The proposed CoPSSM | 0.5310 | 0.8756 | 0.8 | 0.4106 | |
| Class 3 | PseAAC | 0.2941 | 0.9774 | 0.9461 | 0.3088 |
| PSSM | 0.9866 | 0.9436 | -0.0243 | ||
| The proposed CoPSSM | 0.2353 | 0.9831 | 0.9537 | 0.2696 | |
| Class 4 | PseAAC | 0.1667 | 0.9943 | 0.9538 | 0.2999 |
| PSSM | 0.2222 | 0.9918 | 0.9558 | 0.3382 | |
| The proposed CoPSSM | 0.5 | 0.9805 | 0.9604 | 0.4939 | |
| Class 5 | PseAAC | 0.4167 | 0.9914 | 0.9723 | 0.4969 |
| PSSM | 0.5833 | 0.9863 | 0.9735 | 0.5697 | |
| The proposed CoPSSM | 0.6667 | 0.9902 | 0.9810 | 0.6569 | |
| Class 6 | PseAAC | 0.32 | 0.9571 | 0.8775 | 0.3428 |
| PSSM | 0.3 | 0.9671 | 0.8867 | 0.3526 | |
| The proposed CoPSSM | 0.26 | 0.9236 | 0.8548 | 0.1905 | |
| Class 7 | PseAAC | 0.8605 | 0.3590 | 0.6190 | 0.2550 |
| PSSM | 0.8163 | 0.4830 | 0.6583 | 0.3190 | |
| The proposed CoPSSM | 0.7857 | 0.7269 | 0.7587 | 0.5135 | |
| Class 8 | PseAAC | 1 | 0.9213 | ||
| PSSM | 0.1 | 0.9892 | 0.9223 | 0.1791 | |
| The proposed CoPSSM | 0.2333 | 0.9485 | 0.9015 | 0.1847 | |
| Class 9 | PseAAC | 0.0541 | 0.9831 | 0.8954 | 0.0768 |
| PSSM | 0.1081 | 0.9945 | 0.9132 | 0.2447 | |
| The proposed CoPSSM | 0.4324 | 0.9451 | 0.9035 | 0.3686 | |
| Class 10 | PseAAC | 0.9943 | 0.9616 | -0.0137 | |
| PSSM | 0.9973 | 0.9659 | -0.0093 | ||
| The proposed CoPSSM | 0.25 | 0.9808 | 0.9604 | 0.2408 | |
SE, SP, ACC and MCC for PseAAC, PSSM and the proposed CoPSSM on dataset 2.
| Type and Representation | Evaluation Index | ||||
|---|---|---|---|---|---|
| SE | SP | ACC | MCC | ||
| Class 1 | PseAAC | 0.4848 | 0.7897 | 0.7324 | 0.2439 |
| PSSM | 0.6970 | 0.7445 | 0.7361 | 0.3579 | |
| The proposed CoPSSM | 0.5960 | 0.9021 | 0.8447 | 0.4943 | |
| Class 2 | PseAAC | 0.2273 | 0.9896 | 0.9484 | 0.3335 |
| PSSM | 0.3636 | 0.9617 | 0.9318 | 0.3123 | |
| The proposed CoPSSM | 0.3636 | 0.9648 | 0.9370 | 0.3151 | |
| Class 3 | PseAAC | 0.2459 | 0.9789 | 0.8773 | 0.3490 |
| PSSM | 0.2951 | 0.9561 | 0.8705 | 0.3174 | |
| The proposed CoPSSM | 0.4918 | 0.9455 | 0.8902 | 0.4611 | |
| Class 4 | PseAAC | 0.1379 | 0.9922 | 0.9324 | 0.2576 |
| PSSM | 0.3103 | 0.9733 | 0.9297 | 0.3379 | |
| The proposed CoPSSM | 0.4483 | 0.9752 | 0.9429 | 0.4629 | |
| Class 5 | PseAAC | 0.5570 | 0.8930 | 0.8355 | 0.4372 |
| PSSM | 0.6709 | 0.952 | 0.9031 | 0.6501 | |
| The proposed CoPSSM | 0.7848 | 0.9576 | 0.9292 | 0.7424 | |
| Class 6 | PseAAC | 0.2985 | 0.9632 | 0.8635 | 0.3523 |
| PSSM | 0.4030 | 0.9341 | 0.8595 | 0.3697 | |
| The proposed CoPSSM | 0.3284 | 0.9319 | 0.8544 | 0.2882 | |
| Class 7 | PseAAC | 0.8143 | 0.4533 | 0.6359 | 0.2874 |
| PSSM | 0.7134 | 0.6945 | 0.7045 | 0.4076 | |
| The proposed CoPSSM | 0.7915 | 0.6767 | 0.7348 | 0.4716 | |
| Class 8 | PseAAC | 0.9897 | 0.9040 | -0.03 | |
| PSSM | 0.1622 | 0.9735 | 0.9071 | 0.1955 | |
| The proposed CoPSSM | 0.1892 | 0.9543 | 0.8974 | 0.1634 | |
| Class 9 | PseAAC | 1 | 0.9674 | ||
| PSSM | 0.0769 | 0.9951 | 0.9670 | 0.1482 | |
| The proposed CoPSSM | 0.1538 | 0.9801 | 0.9571 | 0.1453 | |
Fig 3The OSR of PseAAC, PSSM and the proposed CoPSSM for different K values on dataset 1.
Fig 4The OSR of PseAAC, PSSM and the proposed CoPSSM for different K values on dataset 2.
Prediction results of the proposed method.
| Dataset 1 (reduced 10-D CoPSSM) | Dataset 2 (reduced 9-D CoPSSM) | |
|---|---|---|
| Class 1 | 72.0930% | 84.8485% |
| Class 2 | 80.5310% | 90.9091% |
| Class 3 | 100% | 86.8852% |
| Class 4 | 94.4444% | 89.6552% |
| Class 5 | 83.3333% | 88.6076% |
| Class 6 | 98% | 89.5522% |
| Class 7 | 92.1769% | 93.1596% |
| Class 8 | 93.3333% | 91.8919% |
| Class 9 | 75.6757% | 92.3077% |
| Class 10 | 100% | |
Prediction results for protein subcellular location.
| Type | Jackknife test for dataset 3 | Independent test for dataset 4 | ||||
|---|---|---|---|---|---|---|
| PseAAC (40-D) | PSSM (400-D) | CoPSSM (210-D) | PseAAC (40-D) | PSSM (400-D) | CoPSSM (210-D) | |
| Class 1 | 90.7895% | 86.8421% | 94.0790% | 90.4762 | 89.0476 | 88.5714 |
| Class 2 | 35.5263% | 53.9474% | 55.2632% | 50 | 65 | 60 |
| Class 3 | 8.3333% | 50% | 75 | |||
| Class 4 | 83.3333% | 83.3333% | 100 | 100 | 100 | |
| Class 5 | 84.9462% | 86.0215% | 86.0215% | 77.9710 | 79.1304 | 78.5507 |
| Class 6 | ||||||
| Class 7 | 60.1942% | 54.3689% | 61.1651% | 46.1539 | 76.9231 | 84.6154 |
| Class 8 | 48.2143% | 65.1786% | 53.5714% | 61.2245 | 65.3061 | 57.1429 |
* Class 1 ~ Class 8 denote Cytoplasm, Extracell, Fimbrium, Flagellum, Inner membrane, Nucleoid, Outer membrane, Periplasm respectively.
Prediction results of the proposed method for protein subcellular location.
| Type | Jackknife test for dataset 3 | Independent test for dataset 4 |
|---|---|---|
| Reduced 8-D CoPSSM | Reduced 8-D CoPSSM | |
| Class 1 | 136/152 = 89.4737% | 186/210 = 88.5714% |
| Class 2 | 67/76 = 88.1579% | 18/20 = 90% |
| Class 3 | 10/12 = 83.3333% | 4/4 = 100% |
| Class 4 | 6/6 = 100% | 1/1 = 100% |
| Class 5 | 181/186 = 97.3118% | 339/345 = 98.2609% |
| Class 6 | 6/6 = 100% | 1/1 = 100% |
| Class 7 | 97/103 = 94.1748% | 12/13 = 92.3077% |
| Class 8 | 100/112 = 89.2857% | 48/49 = 97.9592% |
| 603/653 = | 609/643 = |
Comparison of overall success rate on the four benchmark datasets.
| Algorithm | Representation and method | OSR (%) | ||
|---|---|---|---|---|
| Dataset 1 | SubNucPred [ | SSLD and AAC based on SVM by Jackknife test | 81.46 | |
| Effective Fusion Representations [ | DipPSSM with LDA based on KNN by 10-fold cross-validation | ≈97 | ||
| PseAAPSSM with LDA based on KNN by 10-fold cross-validation | ≈84 | |||
| CoPSSM with KLDA based on KNN and Jackknife test | ||||
| Dataset 2 | Nuc-PLoc [ | Fusion of PsePSSM and PseAAC based on Ensemble classifier by Jackknife test | 67.4 | |
| Effective Fusion Representations [ | DipPSSM with LDA based on KNN by 10-fold cross-validation | 95.94 | ||
| PseAAPSSM with LDA based on KNN by 10-fold cross-validation | 88.1 | |||
| CoPSSM with KLDA based on KNN and Jackknife test | ||||
| Dataset 3 | Gneg-PLoc [ | Fusion of GO approach and PseAAC based on Ensemble classifier by Jackknife test | 87.3 | |
| Nonlinear dimensionality reduction method [ | Fusion of PSSM and PseAAC with KLDA based on KNN by Jackknife test | 98.77 | ||
| CoPSSM with KLDA based on KNN and Jackknife test | ||||
| Dataset 4 | Gneg-PLoc [ | Fusion of GO approach and PseAAC based on Ensemble classifier by Independent test | 89.3 | |
| CoPSSM with KLDA based on KNN and Independent test | ||||