| Literature DB >> 25521314 |
Ahmet Sinan Yavuz, Osman Ugur Sezerman.
Abstract
BACKGROUND: Sumoylation, which is a reversible and dynamic post-translational modification, is one of the vital processes in a cell. Before a protein matures to perform its function, sumoylation may alter its localization, interactions, and possibly structural conformation. Abberations in protein sumoylation has been linked with a variety of disorders and developmental anomalies. Experimental approaches to identification of sumoylation sites may not be effective due to the dynamic nature of sumoylation, laborsome experiments and their cost. Therefore, computational approaches may guide experimental identification of sumoylation sites and provide insights for further understanding sumoylation mechanism.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25521314 PMCID: PMC4290605 DOI: 10.1186/1471-2164-15-S9-S18
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1The reversible sumoylation mechanism. The sumoylation pathway starts with an immature SUMO protein that needs to be protealytically processed by SENPs to reveal its target binding site, an invariant Gly-Gly motif. The mature SUMO protein is then activated by E1 heterodimer in an ATP-dependent reaction. SUMO is then transferred to an E2 enzyme, UBC9, which is responsible for the recognition of target binding sites. After the recognition, the SUMO protein is transferred to a lysine residue in the target binding site. This process is generally assisted by an E3 ligase. Sumoylated sites can also act as substrates for SENPs, so the sumoylation can be reversed. This ensures the dynamic and reversible nature of sumoylation.
Dataset distribution.
| Training Set | Test Set | ||
|---|---|---|---|
| Positive Sites (Consensus) | 267 (3.23) | 17 (3.18) | 284 |
| Positive Sites (Non-consensus) | 90 (1.09) | 7 (1.31) | 97 |
| Negative Sites (Consensus) | 280 (3.39) | 22 (4.11) | 302 |
| Negative Sites (Non-consensus) | 7629 (92.29) | 488 (91.39) | 8117 |
| 8266 (100) | 534 (100) | 8800 | |
A total of 381 experimentally verified sumoylation sites were divided into 4 categories. 357 of the positive sites formed the training set, in which 267 sites conformed to the consensus motif and 90 did not conform to the consensus motif. Remaining 24 positive sites formed the independent testing set, in which 17 conformed to the consensus motif and, 7 did not conformed to the consensus motif.
Figure 2Window position representation and comparison of sequence logos between positive and negative sites. a) Position nomenclature that has been used throughout the study. Sequence windows divided into negative (-) and positive (+) sub-windows. In each subwindow, amino acids are numbered in an incrementing order. b) Sequence logo of positive consensus sites indicating preferences in amino acids in each window position. c) Sequence logo of positive non-consensus sites. d) Sequence logo of negative consensus sites. e) Sequence logo of negative non-consensus sites.
Top 25 features selected using RELIEFF algorithm.
| Rank | Feature | Merit Score | P-value | Adj. P-value | Significance |
|---|---|---|---|---|---|
| 1 | w+E_2 | 0.355474 | 0.00E+00 | 0.00E+00 | * |
| 2 | Consensus | 0.261813 | 0.00E+00 | 0.00E+00 | * |
| 3 | wDE | 0.164459 | 1.66E-41 | 3.23E-40 | * |
| 4 | w+2_Hydro | 0.160149 | 3.90E-83 | 1.33E-81 | * |
| 5 | w-I_3 | 0.105916 | 1.18E-107 | 5.33E-106 | * |
| 6 | w-3_Hydro | 0.104835 | 8.12E-58 | 1.84E-56 | * |
| 7 | wK | 0.078651 | 1.33E-02 | 4.03E-02 | * |
| 8 | w-V_3 | 0.075073 | 1.92E-58 | 5.22E-57 | * |
| 9 | w-2_Hydro | 0.057669 | 1.48E-02 | 4.37E-02 | * |
| 10 | w+3_Hydro | 0.056496 | 1.49E-01 | 2.93E-01 | |
| 11 | w+1_Hydro | 0.05232 | 7.13E-02 | 1.62E-01 | |
| 12 | w-1_Hydro | 0.051279 | 4.22E-02 | 9.89E-02 | |
| 13 | w-L_3 | 0.051001 | 1.96E-03 | 7.00E-03 | * |
| 14 | w+K_2 | 0.050248 | 1.70E-08 | 1.78E-07 | * |
| 15 | w+P_2 | 0.045911 | 3.39E-02 | 8.23E-02 | |
| 16 | w+P_3 | 0.043573 | 2.11E-25 | 3.58E-24 | * |
| 17 | w-K_3 | 0.043208 | 4.33E-04 | 1.96E-03 | * |
| 18 | Flexibility | 0.042334 | 7.52E-07 | 7.31E-06 | * |
| 19 | w+D_2 | 0.041784 | 7.14E-01 | 7.77E-01 | |
| 20 | w-S_2 | 0.041097 | 2.95E-01 | 4.57E-01 | |
| 21 | DisorderBinary | 0.040666 | 7.27E-14 | 9.88E-13 | * |
| 22 | w-E_3 | 0.039548 | 6.14E-05 | 3.79E-04 | * |
| 23 | w-A_3 | 0.03804 | 3.04E-02 | 7.95E-02 | |
| 24 | w-P_1 | 0.037893 | 4.46E-06 | 3.80E-05 | * |
| 25 | w-E_2 | 0.035935 | 4.16E-01 | 5.74E-01 |
93 out of 137 features has been selected based on 10-fold classification performance. Features are ranked using RELIEFF [26] algorithm, implemented in Weka [39]. Details of statistical testing can be found in Methods section. For assessing significance, adjusted p-value cutoff of 0.05 is used. Feature name explanations can be found in Methods section and Figure 2a.
Prediction performance of self-consistency, 5-fold cross validation and 10-fold cross-validation tests on the training set.
| Evaluation Method | Accuracy | Specificity | Sensitivity | MCC |
|---|---|---|---|---|
| Self Consistency | 0.97 | 0.98 | 0.76 | 0.68 |
| 5-fold Cross-validation | 0.97 | 0.98 | 0.73 | 0.66 |
| 10-fold Cross-validation | 0.97 | 0.98 | 0.73 | 0.66 |
| Regular Expressions | 0.96 | 0.97 | 0.72 | 0.58 |
An average of 25 repeats are reported for 5-fold and 10-fold cross validation tests. Standard deviations were less than 0.01, so they were not reported. Regular expressions scan is done by searching [IVLMAP]K.[DE] pattern in the sequence window centering the lysine residue.
Figure 3Receiver operator characteristic (ROC) curves. Average ROC for 25 repeats of 5-fold cross-validation and 10-fold cross validation. AUC is calculated using the average ROC curve. The dashed line represents a random classifier and the star indicates the performance of regular expressions scanner.
Effect of conformational flexibility and disorder on prediction performance.
| Self-consistency | 5-fold Cross Validation | |||||||
|---|---|---|---|---|---|---|---|---|
| All Features | 0.97 | 0.98 | 0.76 | 0.68 | 0.97 | 0.98 | 0.73 | 0.66 |
| without Flexibility | 0.97 | 0.98 | 0.75 | 0.67 | 0.97 | 0.98 | 0.72 | 0.66 |
| without Disorder | 0.97 | 0.98 | 0.75 | 0.67 | 0.97 | 0.98 | 0.72 | 0.66 |
| without Flexibility & Disorder | 0.97 | 0.98 | 0.74 | 0.67 | 0.97 | 0.98 | 0.72 | 0.66 |
Conformational flexibility and disorder features ('DisorderBinary', 'DisorderReal') were eliminated from dataset one by one and together. Self-consistency and 5-fold cross validation tests were performed. An average of 25 repeats are reported for 5-fold cross validation tests. Standard deviations were less than 0.01, so they were not reported.
Comparison of SUMOsu with other predictors.
| Method | Threshold | Acc | Sp | Sn | MCC |
|---|---|---|---|---|---|
| SUMOsp2.0 | 0.83 | 0.83 | 0.30 | ||
| SUMOhydro | 0.91 | 0.91 | 0.71 | 0.41 | |
| seeSUMO-RF | 0.82 | 0.83 | 0.30 | ||
| seeSUMO-SVM | 0.90 | 0.91 | 0.67 | 0.37 | |
| 0.67 | |||||
| SUMOsp2.0 | 0.91 | 0.93 | 0.63 | 0.38 | |
| SUMOhydro | 0.92 | 0.94 | 0.67 | 0.43 | |
| seeSUMO-RF | 0.88 | 0.88 | 0.35 | ||
| seeSUMO-SVM | 0.93 | 0.95 | 0.54 | 0.40 | |
| 0.58 | |||||
| SUMOsp2.0 | 0.95 | 0.96 | 0.58 | 0.47 | |
| SUMOhydro | 0.93 | 0.95 | 0.58 | 0.42 | |
| seeSUMO-RF | 0.89 | 0.90 | 0.36 | ||
| seeSUMO-SVM | 0.95 | 0.98 | 0.38 | 0.39 | |
| 0.58 | |||||
| Regular Expressions | 0.95 | 0.96 | 0.71 | 0.56 |
The values of accuracy (Acc), specificity (Sp), sensitivity (Sn), and Matthew's correlation efficients (MCC) are obtained from Chen et al. [23] as the exact same independent dataset was employed in this study. Thresholds for SUMOsu was set as -0.5, 0, and 0.5 for low, medium and high, respectively.