| Literature DB >> 36209286 |
Suresh Pokharel1, Pawel Pratyush1, Michael Heinzinger2,3, Robert H Newman4,5, Dukka B Kc6.
Abstract
Protein succinylation is an important post-translational modification (PTM) responsible for many vital metabolic activities in cells, including cellular respiration, regulation, and repair. Here, we present a novel approach that combines features from supervised word embedding with embedding from a protein language model called ProtT5-XL-UniRef50 (hereafter termed, ProtT5) in a deep learning framework to predict protein succinylation sites. To our knowledge, this is one of the first attempts to employ embedding from a pre-trained protein language model to predict protein succinylation sites. The proposed model, dubbed LMSuccSite, achieves state-of-the-art results compared to existing methods, with performance scores of 0.36, 0.79, 0.79 for MCC, sensitivity, and specificity, respectively. LMSuccSite is likely to serve as a valuable resource for exploration of succinylation and its role in cellular physiology and disease.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36209286 PMCID: PMC9547369 DOI: 10.1038/s41598-022-21366-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Dataset description of the training and independent test dataset.
| Dataset type | Positive (succinylated) | Negative (non-succinylated) |
|---|---|---|
| Training data | 4750 | 50,565 |
| Training data (after balancing) | 4750 | 4750 |
| Benchmark independent test data | 254 | 2977 |
Figure 1Architecture of supervised word embedding based model using a convolutional neural network.
Figure 2Architecture of Prot-T5 based model using a two-layer artificial neural network.
Figure 3Model architecture of ensemble mechanism and meta classifier.
10-Fold cross-validation results on the training set of Embedding module, ProtT5 module with different ML and DL models.
| Encoding approach | Architecture | ACC | MCC | Sn | Sp |
|---|---|---|---|---|---|
| Embedding | CNN2D | 0.73 ± 0.02 | 0.76 ± 0.01 | 0.70 ± 0.01 | |
| LSTM | 0.71 ± 0.01 | 0.77 ± 0.04 | 0.66 ± 0.03 | ||
| ProtT5 | RF | 0.62 ± 0.01 | 0.25 ± 0.01 | 0.59 ± 0.01 | 0.65 ± 0.01 |
| SVM | 0.73 ± 0.01 | 0.46 ± 0.01 | 0.75 ± 0.02 | ||
| XGBoost | 0.70 ± 0.01 | 0.41 ± 0.01 | 0.76 ± 0.01 | 0.65 ± 0.01 | |
| CNN1D | 0.69 ± 0.01 | 0.38 ± 0.03 | 0.59 ± 0.09 | ||
| ANN | 0.76 ± 0.02 | 0.71 ± 0.02 |
The highest values in each category are bolded.
Performance of best different ML/DL models as a meta classifier.
| Encoding | Model | ACC | MCC | Sn | Sp |
|---|---|---|---|---|---|
| Embedding + ProtT5 | SVM | 0.76 ± 0.01 | 0.52 ± 0.02 | 0.80 ± 0.02 | 0.71 ± 0.02 |
| RF | 0.75 ± 0.01 | 0.51 ± 0.02 | 0.79 ± 0.01 | 0.71 ± 0.02 | |
| LR | 0.74 ± 0.01 | 0.50 ± 0.03 | 0.78 ± 0.02 | 0.71 ± 0.02 | |
| XGBoost | 0.73 ± 0.02 | 0.46 ± 0.04 | 0.75 ± 0.03 | 0.71 ± 0.02 | |
| ANN(LMSuccSite) |
Highest values in each category are in [bold].
Figure 4ROC (Receiver Operating Characteristic) curve and PR (Precision Recall) curves with AUC (Area Under Curve) for different models explained in Tables 3, 4. (a) ROC curves for supervised embedding-based models (b) PR curve for supervised embedding based models (c) ROC curves for Prot-T5 based models (d) PR curves for Prot-T5 based models (e) ROC curves of different meta classifier for combined models (f) PR Curves of different meta classifier for combined models.
Comparison of our model with existing succinylation prediction tools using an independent test set.
| Tool | ACC | MCC | Sn | Sp | g-mean |
|---|---|---|---|---|---|
| iSuc-PseAAC[ | 0.83 | 0.01 | 0.12 | 0.89 | 0.33 |
| iSuc-PseOpt[ | 0.72 | 0.04 | 0.30 | 0.76 | 0.48 |
| pSuc-Lys[ | 0.78 | 0.04 | 0.22 | 0.83 | 0.43 |
| SuccineSite[ | 0.84 | 0.19 | 0.37 | 0.88 | 0.57 |
| SuccineSite2.0[ | 0.85 | 0.26 | 0.40 | 0.63 | |
| GPSuc[ | 0.30 | 0.50 | 0.88 | 0.66 | |
| PSuccE[ | 0.85 | 0.20 | 0.38 | 0.89 | 0.58 |
| DeepSuccinylSite[ | 0.70 | 0.27 | 0.69 | 0.74 | |
| LMSuccSite | 0.79 | 0.79 |
The numbers are rounded to two significant digits after the decimal. The highest value in each category is shown in bold. ACC Accuracy, MCC Matthew’s Correlation Coefficient, Sn Sensitivity, Sp Specificity, g-mean geometric mean.
Figure 5t-SNE to visualize the high-dimensional embedding learned by different features. (a) Before training CNN2D Embedding module (b) after training CNN2D with Embedding (c) before training the ANN model using prot-T5 features (d) after training the ANN model using prot-T5 features (e) after training the combined model. Blue dots represent non-succinylated sites and orange dots represent succinylated sites.
Figure 6Sensitivity analysis of the final model on varying the size of available training data.