| Literature DB >> 30312302 |
Md Mehedi Hasan1, Hiroyuki Kurata1,2.
Abstract
Lysine succinylation is one of the dominant post-translational modification of the protein that contributes to many biological processes including cell cycle, growth and signal transduction pathways. Identification of succinylation sites is an important step for understanding the function of proteins. The complicated sequence patterns of protein succinylation revealed by proteomic studies highlight the necessity of developing effective species-specific in silico strategies for global prediction succinylation sites. Here we have developed the generic and nine species-specific succinylation site classifiers through aggregating multiple complementary features. We optimized the consecutive features using the Wilcoxon-rank feature selection scheme. The final feature vectors were trained by a random forest (RF) classifier. With an integration of RF scores via logistic regression, the resulting predictor termed GPSuc achieved better performance than other existing generic and species-specific succinylation site predictors. To reveal the mechanism of succinylation and assist hypothesis-driven experimental design, our predictor serves as a valuable resource. To provide a promising performance in large-scale datasets, a web application was developed at http://kurata14.bio.kyutech.ac.jp/GPSuc/.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30312302 PMCID: PMC6193575 DOI: 10.1371/journal.pone.0200283
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The computational framework of GPSuc.
Fig 2Sequence logos illustrating the amino acid appearance in the sequences surrounding the succinylation sites (http://www.twosamplelogo.org/).
Nine species: H. sapiens, H. capsulatum, M. musculus, E. coli, M. tuberculosis, T. gondii, S. cerevisiae, S. lycopersicum, and T. aestivum were used.
Fig 3Distribution of AAF in the surrounding succinylation (gray color) and non-succinylation (red color) sequences for nine species.
The columns represent AAF, while the rows show each of amino acid residues.
AUC values of different combination of feature scores for training and test dataset in a generic predictor.
| Datasets | Predictors | AUC |
|---|---|---|
| Training | pCKSAAP + AAindex | 0.827 |
| Test | pCKSAAP + AAindex | 0.752 |
For combining the features, different LR parameters were added.
Performance of generic and species-specific succinylation site prediction on the training dataset.
| Performances | Sp | Sn | Ac | MCC |
|---|---|---|---|---|
| Generic | 0.903 | 0.537 | 0.781 | 0.498 |
| 0.903 | 0.545 | 0.784 | 0.524 | |
| 0.901 | 0.411 | 0.738 | 0.39 | |
| 0.890 | 0.512 | 0.764 | 0.429 | |
| 0.890 | 0.422 | 0.734 | 0.408 | |
| 0.890 | 0.289 | 0.700 | 0.201 | |
| 0.896 | 0.655 | 0.816 | 0.536 | |
| 0.896 | 0.535 | 0.776 | 0.519 | |
| 0.897 | 0.478 | 0.757 | 0.447 | |
| 0.887 | 0.418 | 0.731 | 0.406 |
Performance of exiting generic tools on the test dataset.
| Performances/ prediction schemes | Sp | Sn | Ac | MCC |
|---|---|---|---|---|
| iSuc-PseAAC | 0.887 | 0.122 | 0.827 | 0.013 |
| iSuc-PseOpt | 0.758 | 0.303 | 0.722 | 0.038 |
| pSuc-Lys | 0.826 | 0.224 | 0.779 | 0.036 |
| SuccinSite | 0.882 | 0.371 | 0.842 | 0.199 |
| SuccinSite2.0 | 0.882 | 0.454 | 0.848 | 0.261 |
| GPSuc | 0.883 | 0.499 | 0.853 | 0.296 |
Fig 4Performance evaluation using single five features and the ‘combined model’ for prediction succinylation sites in nine species.
Gray colors represent the AUC value of training dataset while red colors show that of the test dataset. ‘Combined’ indicates the performance by the combined five encoding features. The final H. sapiens model was given as a linear combination of the five AAC, AAindex, binary, PSSM, and pCKSAAP features with LR coefficient values of 0.142, 1.566, 0.665, 0.342 and 0.667, respectively. In the same way, the combined H. capsulatum, M. musculus, E. coli, M. tuberculosis, S. cerevisiae, T. gondii, S. lycopersicum and T. aestivum were given with (0.102, 0.466, 0.462, 0.242 and 1.367), (0.155, 1.077, 0.575 and 0.761), (0.121, 0.473, 0.763, 0.230 and 1.214), (0.127, 0.358, 0.404, 0.109 and 1.066), (0.320, 0.391, 0.553, 0.182 and 1.122), (0.117, 0.331, 0.734, 0.139 and 1.014), (0.113, 0.417, 0.818, 0.103 and 1.172), and (0.112, 0.462, 0.723, 0.164 and 1.299), respectively. The LR constant terms for each species were set to zero.
Fig 5ROC curve of nine species-specific predictors of GPSuc.
(A)Training data performances over a 10-fold cross-validation test. (B) Test dataset performances.
Performance comparison of a species-specific predictor using the test dataset.
| Species / Measurements | SuccinSite2.0 | GPSuc | ||||||
|---|---|---|---|---|---|---|---|---|
| Sp | Sn | Ac | MCC | Sp | Sn | Ac | MCC | |
| 0.872 | 0.632 | 0.866 | 0.241 | 0.877 | 0.693 | 0.872 | 0.279 | |
| 0.780 | 0.461 | 0.769 | 0.101 | 0.788 | 0.523 | 0.779 | 0.146 | |
| 0.733 | 0.456 | 0.685 | 0.192 | 0.740 | 0.562 | 0.710 | 0.246 | |
| 0.720 | 0.440 | 0.664 | 0.139 | 0.719 | 0.501 | 0.675 | 0.188 | |
| 0.826 | 0.512 | 0.807 | 0.216 | 0.822 | 0.596 | 0.809 | 0.249 | |
| 0.824 | 0.452 | 0.790 | 0.191 | 0.822 | 0.593 | 0.801 | 0.296 | |
| 0.815 | 0.401 | 0.771 | 0.172 | 0.817 | 0.471 | 0.800 | 0.220 | |