| Literature DB >> 29940836 |
Qiao Ning1, Xiaosa Zhao1, Lingling Bao1, Zhiqiang Ma2, Xiaowei Zhao3.
Abstract
BACKGROUND: Lysine succinylation is a new kind of post-translational modification which plays a key role in protein conformation regulation and cellular function control. To understand the mechanism of succinylation profoundly, it is necessary to identify succinylation sites in proteins accurately. However, traditional methods, experimental approaches, are labor-intensive and time-consuming. Computational prediction methods have been proposed recent years, and they are popular because of their convenience and high speed. In this study, we developed a new method to predict succinylation sites in protein combining multiple features, including amino acid composition, binary encoding, physicochemical property and grey pseudo amino acid composition, with a feature selection scheme (information gain). And then, it was trained using SVM (Support Vector Machine) and an ensemble learning algorithm.Entities:
Keywords: Ensemble learning algorithm; Grey pseudo amino acid composition; Information gain; Multiple features; Predict succinylation sites; SVM
Mesh:
Year: 2018 PMID: 29940836 PMCID: PMC6016146 DOI: 10.1186/s12859-018-2249-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The flow chart of PSuccE
Fig. 2The information entropy value of positions around the central residue
Fig. 3Two sample logos of the compositional biases around succinylation sites compared to non-succinylation sites. Statistically significant symbols are plotted using the size of the symbol that is proportional to the difference between the two samples. Residues are separated in two groups: (1) enriched in the positive samples; and (2) depleted in the positive samples. Color of the symbols was classified according to the polarity of the residue side chain
The number of features in every training dataset
| Training Datasets | Number of features | ACC | BE | PCP | GPAAC |
|---|---|---|---|---|---|
| Subset1 | 186 | 14 | 38 | 123 | 11 |
| Subset2 | 191 | 14 | 34 | 133 | 10 |
| Subset3 | 158 | 13 | 27 | 107 | 11 |
| Subset4 | 112 | 12 | 14 | 77 | 9 |
| Subset5 | 194 | 14 | 38 | 131 | 11 |
| Subset6 | 177 | 14 | 30 | 122 | 11 |
| Subset7 | 194 | 14 | 37 | 134 | 9 |
| Subset8 | 88 | 9 | 6 | 64 | 9 |
| Subset9 | 66 | 7 | 5 | 45 | 9 |
| Subset10 | 45 | 4 | 5 | 30 | 6 |
| Common features | 34 | 4 | 5 | 19 | 6 |
10-fold cross validation performance of 10 subsets and ensemble classifier on training dataset
| Training dataset | Sn (%) | Sp (%) | Acc | MCC |
|---|---|---|---|---|
| Subset1 | 72.29 | 66.91 | 0.6961 | 0.3926 |
| Subset2 | 72.15 | 66.39 | 0.6927 | 0.3861 |
| Subset3 | 72.21 | 66.33 | 0.6927 | 0.3861 |
| Subset4 | 72.83 | 65.73 | 0.6929 | 0.3867 |
| Subset5 | 71.69 | 67.24 | 0.6948 | 0.3898 |
| Subset6 | 72.12 | 66.46 | 0.6930 | 0.3865 |
| Subset7 | 71.94 | 65.64 | 0.6881 | 0.3767 |
| Subset8 | 72.07 | 65.53 | 0.6880 | 0.3768 |
| Subset9 | 72.97 | 63.52 | 0.6824 | 0.3665 |
| Subset10 | 72.36 | 62.48 | 0.6742 | 0.3502 |
| Ensemble |
|
|
|
|
Fig. 4ROC curves (AUC) of predictions based on 10-fold cross validation
A comparison of PSuccE with existing predictors using an independent test set
| Measurement* | SucPred | iSuc-PseAAC | SuccFind | SuccinSite | Success | PSuccE |
|---|---|---|---|---|---|---|
| Sp(%) | 67.3 | 88.7 | 79.2 | 88.2 | 86.8 |
|
| Sn(%) | 27.2 | 12.2 | 25.2 | 37.1 | 14.2 |
|
| Acc | 0.643 | 0.827 | 0.750 | 0.842 | 0.811 |
|
| MCC | −0.030 | 0.013 | 0.029 | 0.199 | 0.007 |
|
* The threshold value was controlled as 0.9 for these predictors