| Literature DB >> 22272076 |
Xiaowei Zhao1, Xiangtao Li, Zhiqiang Ma, Minghao Yin.
Abstract
Ubiquitylation is an important process of post-translational modification. Correct identification of protein lysine ubiquitylation sites is of fundamental importance to understand the molecular mechanism of lysine ubiquitylation in biological systems. This paper develops a novel computational method to effectively identify the lysine ubiquitylation sites based on the ensemble approach. In the proposed method, 468 ubiquitylation sites from 323 proteins retrieved from the Swiss-Prot database were encoded into feature vectors by using four kinds of protein sequences information. An effective feature selection method was then applied to extract informative feature subsets. After different feature subsets were obtained by setting different starting points in the search procedure, they were used to train multiple random forests classifiers and then aggregated into a consensus classifier by majority voting. Evaluated by jackknife tests and independent tests respectively, the accuracy of the proposed predictor reached 76.82% for the training dataset and 79.16% for the test dataset, indicating that this predictor is a useful tool to predict lysine ubiquitylation sites. Furthermore, site-specific feature analysis was performed and it was shown that ubiquitylation is intimately correlated with the features of its surrounding sites in addition to features derived from the lysine site itself. The feature selection method is available upon request.Entities:
Keywords: ensemble classifier; lysine ubiquitylation sites; support vector machine; ubiquitylation
Mesh:
Substances:
Year: 2011 PMID: 22272076 PMCID: PMC3257073 DOI: 10.3390/ijms12128347
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
The number of ubiquitylation and non-ubiquitylation sites in each dataset.
| Dataset | No of ubiquitylation sites | No of non-ubiquitylation sites |
|---|---|---|
| Training dataset | 298 | 563 |
| Test dataset | 170 | 357 |
| Independent dataset | 14 | 267 |
Figure 1Schematic representation of transformation of each protein sequence into L*20 dimensional position-specific scoring matrix (PSSM); the rows represent the protein residues and the columns represent the 20 amino acids.
Figure 2The framework of the ensemble model.
Figure 3The relationship between the prediction performance and the quantity of base classifiers.
The performance comparison of two feature selection methods on the training dataset.
| Method | ||||
|---|---|---|---|---|
| mRMR [ | 64.76 ± 2.12 | 68.21 ± 3.52 | 67.42 ± 1.37 | 0.282 ± 0.13 |
| This paper | 76.85 ± 1.84 | 76.91 ± 2.09 | 76.82 ± 1.03 | 0.519 ± 0.08 |
The performance comparison of the two feature selection methods on the test dataset.
| Method | ||||
|---|---|---|---|---|
| mRMR [ | 51.68 ± 1.35 | 74.22 ± 0.92 | 69.20 ± 1.06 | 0.229 ± 0.09 |
| This paper | 72.61 ± 2.34 | 81.27 ± 0.76 | 79.16 ± 0.98 | 0.503 ± 0.07 |
The performance comparison of different predictors on the independent dataset.
| Predictor | ||||
|---|---|---|---|---|
| mRMRPred [ | 34.34 | 79.67 | 68.34 | 0.139 |
| UbiPred [ | NA | NA | NA | 0.135 |
| UbPred [ | NA | NA | NA | 0.117 |
| This paper | 57.14 ± 1.39 | 74.15 ± 0.95 | 71.32 ± 1.26 | 0.153 ± 0.06 |
Figure 4The number of each type of feature in the 10 selected subsets.
Figure 5The number of all features on each site in the 10 selected subsets.
Figure 6The number of PSSM features on each site in the 10 selected subsets.
| (1) | Initialize related parameters, |
| (2) | |
| (3) | calculate its mutual information with the target classes |
| (4) | Sort them in a descending order; |
| (5) | |
| (6) | |
| (7) | |
| (8) | Calculate its criterion |
| (9) | If |
| (10) | Select the gene with the largest |
| (11) | |
| (12) |