| Literature DB >> 36078053 |
Yafei Zhu1, Yuhai Liu2, Yu Chen1, Lei Li1,3.
Abstract
Lysine SUMOylation plays an essential role in various biological functions. Several approaches integrating various algorithms have been developed for predicting SUMOylation sites based on a limited dataset. Recently, the number of identified SUMOylation sites has significantly increased due to investigation at the proteomics scale. We collected modification data and found the reported approaches had poor performance using our collected data. Therefore, it is essential to explore the characteristics of this modification and construct prediction models with improved performance based on an enlarged dataset. In this study, we constructed and compared 16 classifiers by integrating four different algorithms and four encoding features selected from 11 sequence-based or physicochemical features. We found that the convolution neural network (CNN) model integrated with residue structure, dubbed ResSUMO, performed favorably when compared with the traditional machine learning and CNN models in both cross-validation and independent tests. The area under the receiver operating characteristic (ROC) curve for ResSUMO was around 0.80, superior to that of the reported predictors. We also found that increasing the depth of neural networks in the CNN models did not improve prediction performance due to the degradation problem, but the residual structure could be included to optimize the neural networks and improve performance. This indicates that residual neural networks have the potential to be broadly applied in the prediction of other types of modification sites with great effectiveness and robustness. Furthermore, the online ResSUMO service is freely accessible.Entities:
Keywords: SUMOylation; deep learning; machine learning; modification site prediction; posttranslational modification; residual structure
Mesh:
Substances:
Year: 2022 PMID: 36078053 PMCID: PMC9454673 DOI: 10.3390/cells11172646
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 7.666
Figure 1Schematic diagram of data collection and preprocessing for human SUMOylation datasets.
Figure 2The architecture of the RSCNN model. The BLOSUM62 feature was used as the characteristic matrix of the input layer.
Figure 3Sequence pattern surrounding the SUMOylation sites, including the significantly enriched and depleted residues based on SUMOylation-containing peptides and nonmodified peptides in our dataset (p < 0.05, t-test with Bonferroni correction). The pattern was generated using the two-sample logo method (Vacic et al. 2006).
Figure 4AUC values of different classifiers in terms of five-fold cross-validation. The classifiers are displayed in ascending order of AUC values. The statistical differences between the neighboring classifiers were calculated using paired Student t-test. The name for each model is a combination of the names of the involved algorithm and the encoding feature.
Comparison of performance of the different models for prediction of SUMOylation sites based on five-fold cross-validation *.
| Model |
|
|
|
| AUC |
|---|---|---|---|---|---|
|
RF_AAindex
[ | 0.697 ± 0.004 | 0.650 ± 0.000 | 0.348 ± 0.004 | 0.674 ± 0.002 | 0.742 ± 0.003 |
| RF_BLOSUM62 [ | 0.664 ± 0.004 | 0.650 ± 0.000 | 0.314 ± 0.004 | 0.657 ± 0.002 | 0.722 ± 0.003 |
| RF_EAAC [ | 0.693 ± 0.008 | 0.650 ± 0.000 | 0.344 ± 0.008 | 0.672 ± 0.004 | 0.738 ± 0.005 |
| RF_ZScale [ | 0.661 ± 0.006 | 0.650 ± 0.000 | 0.311 ± 0.006 | 0.655 ± 0.003 | 0.721 ± 0.004 |
| LGBM_AAindex [ | 0.709 ± 0.011 | 0.650 ± 0.000 | 0.360 ± 0.011 | 0.680 ± 0.005 | 0.752 ± 0.003 |
| LGBM_BLOSUM62 [ | 0.708 ± 0.013 | 0.650 ± 0.000 | 0.358 ± 0.013 | 0.679 ± 0.007 | 0.748 ± 0.005 |
| LGBM_EAAC [ | 0.734 ± 0.008 | 0.650 ± 0.000 | 0.385 ± 0.008 | 0.692 ± 0.004 | 0.762 ± 0.004 |
| LGBM_ZScale [ | 0.694 ± 0.012 | 0.650 ± 0.000 | 0.344 ± 0.012 | 0.672 ± 0.006 | 0.741 ± 0.005 |
| CNN_AAindex [ | 0.784 ± 0.007 | 0.650 ± 0.000 | 0.438 ± 0.008 | 0.717 ± 0.004 | 0.788 ± 0.005 |
| CNN_BLOSUM62 [ | 0.784 ± 0.006 | 0.650 ± 0.000 | 0.438 ± 0.007 | 0.717 ± 0.003 | 0.788 ± 0.004 |
| CNN_EAAC [ | 0.771 ± 0.009 | 0.650 ± 0.000 | 0.424 ± 0.010 | 0.711 ± 0.004 | 0.782 ± 0.004 |
| CNN_ZScale [ | 0.779 ± 0.008 | 0.650 ± 0.000 | 0.433 ± 0.009 | 0.714 ± 0.004 | 0.785 ± 0.005 |
| RSCNN_AAindex [ | 0.803 ± 0.008 | 0.650 ± 0.000 | 0.458 ± 0.009 | 0.726 ± 0.004 | 0.799 ± 0.004 |
| RSCNN_BLOSUM62 [ | 0.803 ± 0.007 | 0.650 ± 0.000 | 0.458 ± 0.008 | 0.726 ± 0.004 | 0.799 ± 0.003 |
| RSCNN_EAAC [ | 0.763 ± 0.007 | 0.650 ± 0.000 | 0.416 ± 0.008 | 0.706 ± 0.004 | 0.774 ± 0.004 |
| RSCNN_ZScale [ | 0.802 ± 0.007 | 0.650 ± 0.000 | 0.457 ± 0.007 | 0.726 ± 0.003 | 0.800 ± 0.004 |
* The model name is a combination of the algorithm and feature names. For example, RF_AAindex combines the RF algorithm and the AAindex feature. The abbreviations of the algorithms and features are described in “Materials and Methods”. Each measure (e.g., Sn, Sp, MCC, ACC, and AUC) is shown as the average ± standard deviation of the corresponding values of the models trained and evaluated based on five-fold cross-validation. The Sp values were fixed to allow fair comparison of the Sn, MCC, and ACC measures across different models.
Figure 5The AUC values of ResSUMO, the reproduced SUMO-Forest, and iSUMOK-PseAAC in five-fold cross-validation (A) and performance comparison (B).
Comparison of performance between ResSUMO and the reported models based on five-fold cross-validation and independent tests.
| Model |
|
|
|
| AUC |
|---|---|---|---|---|---|
| Five-fold cross-validation | |||||
| SUMO-Forest | 0.729 ± 0.006 | 0.650 ± 0.000 | 0.380 ± 0.006 | 0.689 ± 0.003 | 0.760 ± 0.003 |
| iSUMOK-PseAAC | 0.506 ± 0.023 | 0.650 ± 0.000 | 0.158 ± 0.023 | 0.578 ± 0.012 | 0.620 ± 0.016 |
| ResSUMO | 0.802 ± 0.007 | 0.650 ±0.000 | 0.457 ± 0.007 | 0.726 ± 0.003 | 0.800 ± 0.004 |
| Independent test | |||||
| SUMO-Forest | 0.745 ± 0.002 | 0.650 ± 0.000 | 0.397 ± 0.002 | 0.698 ± 0.001 | 0.769 ± 0.002 |
| iSUMOK-PseAAC | 0.524 ± 0.006 | 0.650 ± 0.000 | 0.176 ± 0.006 | 0.587 ± 0.003 | 0.628 ± 0.006 |
| ResSUMO | 0.795 ± 0.007 | 0.650 ±0.000 | 0.450 ± 0.008 | 0.722 ± 0.003 | 0.801 ± 0.003 |
Figure 6The online ResSUMO interface for the prediction of lysine SUMOylation sites (A) and its application in prediction (B).
Performance of ResSUMO and the CNN models with different numbers of convolution layers in five-fold cross-validation.
| Model * |
|
|
|
| AUC |
|---|---|---|---|---|---|
| CNN-1 | 0.716 ± 0.015 | 0.650 ± 0.000 | 0.367 ± 0.015 | 0.683 ± 0.007 | 0.747 ± 0.009 |
| CNN-2 | 0.779 ± 0.008 | 0.650 ± 0.000 | 0.433 ± 0.009 | 0.714 ± 0.004 | 0.785 ± 0.005 |
| CNN-4 | 0.773 ± 0.005 | 0.650 ± 0.000 | 0.426 ± 0.006 | 0.711 ± 0.003 | 0.783 ± 0.002 |
| CNN-6 | 0.765 ± 0.009 | 0.650 ± 0.000 | 0.418 ± 0.009 | 0.708 ± 0.004 | 0.779 ± 0.007 |
| CNN-8 | 0.764 ± 0.013 | 0.650 ± 0.000 | 0.417 ± 0.014 | 0.707 ± 0.006 | 0.778 ± 0.005 |
| ResSUMO | 0.802 ± 0.007 | 0.650 ± 0.000 | 0.457 ± 0.007 | 0.726 ± 0.003 | 0.800 ± 0.004 |
* The number in each CNN model name represents the number of convolutional layers.