| Literature DB >> 35711929 |
Fatma Indriani1,2, Kunti Robiatul Mahmudah3, Bedy Purnama4, Kenji Satou5.
Abstract
Lysine glutarylation is a post-translational modification (PTM) that plays a regulatory role in various physiological and biological processes. Identifying glutarylated peptides using proteomic techniques is expensive and time-consuming. Therefore, developing computational models and predictors can prove useful for rapid identification of glutarylation. In this study, we propose a model called ProtTrans-Glutar to classify a protein sequence into positive or negative glutarylation site by combining traditional sequence-based features with features derived from a pre-trained transformer-based protein model. The features of the model were constructed by combining several feature sets, namely the distribution feature (from composition/transition/distribution encoding), enhanced amino acid composition (EAAC), and features derived from the ProtT5-XL-UniRef50 model. Combined with random under-sampling and XGBoost classification method, our model obtained recall, specificity, and AUC scores of 0.7864, 0.6286, and 0.7075 respectively on an independent test set. The recall and AUC scores were notably higher than those of the previous glutarylation prediction models using the same dataset. This high recall score suggests that our method has the potential to identify new glutarylation sites and facilitate further research on the glutarylation process.Entities:
Keywords: binary classification; imbalanced data classification; lysine glutarylation; machine learning; post-translation modification; protein embedding; protein sequence; transformer-based models
Year: 2022 PMID: 35711929 PMCID: PMC9194472 DOI: 10.3389/fgene.2022.885929
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Workflow strategy for the development of ProTrans-Glutar model.
Number of positive and negative sites in training and test set.
| Training set | Test set | ||
|---|---|---|---|
| Positive sites | 400 | 44 | 444 |
| Negative sites | 1703 | 203 | 1906 |
| 2103 | 247 |
Physicochemical attributes and its division of the amino acids.
| Attribute | Division | ||
|---|---|---|---|
| Hydrophobicity_PRAM900101 | Polar: RKEDQN | Neutral: GASTPHY | Hydrophobicity: CLVIMFW |
| Hydrophobicity_ARGP820101 | Polar: QSTNGDE | Neutral: RAHCKMV | Hydrophobicity: LYPFIW |
| Hydrophobicity_ZIMJ680101 | Polar: QNGSWTDERA | Neutral: HMCKV | Hydrophobicity: LPFYI |
| Hydrophobicity_PONP930101 | Polar: KPDESNQT | Neutral: GRHA | Hydrophobicity: YMFWLCVI |
| Hydrophobicity_CASG920101 | Polar: KDEQPSRNTG | Neutral: AHYMLV | Hydrophobicity: FIWC |
| Hydrophobicity_ENGD860101 | Polar: RDKENQHYP | Neutral:SGTAW | Hydrophobicity: CVLIMF |
| Hydrophobicity_FASG890101 | Polar: KERSQD | Neutral: NTPG | Hydrophobicity: AYHWVMFLIC |
| Normalized van der Waals volume | Volume range: 0–2.78 | Volume range: 2.95–94.0 | Volume range: 4.03–8.08 |
| GASTPD | NVEQIL | MHKFRYW | |
| Polarity | Polarity value: 4.9–6.2 | Polarity value: 8.0–9.2 | Polarity value: 10.4–13.0 |
| LIFWCMVY | PATGS | HQRKNED | |
| Polarizability | Polarizability value: 0–1.08 | Polarizability value: 0.128–120.186 | Polarizability value: 0.219–0.409 |
| GASDT | GPNVEQIL | KMHFRYW | |
| Charge | Positive: KR | Neutral: ANCQGHILMFPSTWYV | Negative: DE |
| Secondary structure | Helix: EALMQKRH | Strand: VIYCWFT | Coil: GNPSD |
| Solvent accessibility | Buried: ALFCGIVW | Exposed: PKQEND | Intermediate: MPSTHY |
Features investigated for method development.
| Group | Feature set | Length of features |
|---|---|---|
| Amino acid composition | AAC | 20 |
| EAAC | 380 | |
| C/T/D | CTDC | 39 |
| CTDT | 39 | |
| CTDD | 195 | |
| Pseudo amino acid composition | PAAC | 35 |
| APAAC | 50 | |
| Embeddings from pretrained transformer-based model | ProtBERT | 1,024 |
| ProtBert-BFD | 1,024 | |
| ProtAlbert | 4,096 | |
| ProtT5-XL-UniRef50 | 1,024 | |
| ProtT5-XL-BFD | 1,024 | |
| ProtXLNet | 1,024 |
Cross validation result of models from sequence-based features.
| Feature groups | Classifier | Rec | Spe | Pre | Acc | MCC | F1 | AUC |
|---|---|---|---|---|---|---|---|---|
| AAC | Adaboost | 0.6120 | 0.6013 | 0.2654 | 0.6033 | 0.1690 | 0.3700 | 0.6433 |
| MLP | 0.6520 | 0.6192 | 0.2864 | 0.6255 | 0.2150 | 0.3977 | 0.6864 | |
| Random Forest | 0.6190 | 0.5809 | 0.2575 | 0.5881 | 0.1576 | 0.3635 | 0.6378 | |
| SVM | 0.6395 | 0.5969 | 0.2714 | 0.6050 | 0.1868 | 0.3808 | 0.6651 | |
| XGBoost | 0.5917 | 0.5482 | 0.2353 | 0.5565 | 0.1102 | 0.3362 | 0.6101 | |
| EAAC | Adaboost | 0.5983 | 0.6015 | 0.2608 | 0.6009 | 0.1584 | 0.3629 | 0.6384 |
| MLP | 0.5850 | 0.5946 | 0.2530 | 0.5928 | 0.1422 | 0.3529 | 0.6323 | |
| Random Forest | 0.6450 | 0.6598 | 0.3089 | 0.6570 | 0.2450 | 0.4171 | 0.6999 | |
| SVM | 0.5967 | 0.6434 | 0.2821 | 0.6345 | 0.1923 | 0.3827 | 0.6571 | |
| XGBoost | 0.6408 | 0.6385 | 0.2945 | 0.6389 | 0.2230 | 0.4030 | 0.6834 | |
| CTDC | Adaboost | 0.7050 | 0.5518 | 0.2699 | 0.5809 | 0.2019 | 0.3901 | 0.6641 |
| MLP | 0.6867 | 0.6034 | 0.2905 | 0.6193 | 0.2300 | 0.4073 | 0.6912 | |
| Random Forest | 0.6408 | 0.5676 | 0.2579 | 0.5815 | 0.1639 | 0.3676 | 0.6556 | |
| SVM | 0.6842 | 0.5657 | 0.2705 | 0.5882 | 0.1966 | 0.3874 | 0.6765 | |
| XGBoost | 0.6367 | 0.5754 | 0.2605 | 0.5871 | 0.1672 | 0.3693 | 0.6450 | |
| CTDT | Adaboost | 0.6208 | 0.5762 | 0.2566 | 0.5847 | 0.1556 | 0.3627 | 0.6261 |
| MLP | 0.6408 | 0.5756 | 0.2622 | 0.5880 | 0.1708 | 0.3717 | 0.6439 | |
| Random Forest | 0.6025 | 0.5982 | 0.2603 | 0.5990 | 0.1588 | 0.3633 | 0.6241 | |
| SVM | 0.6425 | 0.5841 | 0.2661 | 0.5952 | 0.1787 | 0.3760 | 0.6493 | |
| XGBoost | 0.5783 | 0.5668 | 0.2390 | 0.5690 | 0.1147 | 0.3378 | 0.6015 | |
| CTDD | Adaboost | 0.6358 | 0.6046 | 0.2744 | 0.6106 | 0.1904 | 0.3831 | 0.6531 |
| MLP | 0.5942 | 0.5365 | 0.2434 | 0.5475 | 0.1120 | 0.3297 | 0.6065 | |
| Random Forest | 0.6967 | 0.6164 | 0.2994 | 0.6316 | 0.2476 | 0.4185 | 0.6987 | |
| SVM | 0.6675 | 0.6111 | 0.2877 | 0.6218 | 0.2206 | 0.4017 | 0.6794 | |
| XGBoost | 0.6675 | 0.6201 | 0.2927 | 0.6291 | 0.2282 | 0.4064 | 0.6847 | |
| PAAC | Adaboost | 0.5942 | 0.6052 | 0.2611 | 0.6031 | 0.1581 | 0.3626 | 0.6253 |
| MLP | 0.5958 | 0.5717 | 0.2462 | 0.5763 | 0.1321 | 0.3482 | 0.6261 | |
| Random Forest | 0.6375 | 0.5809 | 0.2633 | 0.5917 | 0.1723 | 0.3723 | 0.6413 | |
| SVM | 0.6617 | 0.5905 | 0.2752 | 0.6041 | 0.1990 | 0.3885 | 0.6745 | |
| XGBoost | 0.6217 | 0.5731 | 0.2554 | 0.5823 | 0.1537 | 0.3615 | 0.6375 | |
| APAAC | Adaboost | 0.6125 | 0.5976 | 0.2634 | 0.6004 | 0.1662 | 0.3682 | 0.6367 |
| MLP | 0.5658 | 0.5904 | 0.2450 | 0.5857 | 0.1237 | 0.3416 | 0.6162 | |
| Random Forest | 0.6458 | 0.5831 | 0.2671 | 0.5950 | 0.1805 | 0.3776 | 0.6464 | |
| SVM | 0.6650 | 0.5970 | 0.2794 | 0.6099 | 0.2069 | 0.3932 | 0.6777 | |
| XGBoost | 0.6425 | 0.5694 | 0.2596 | 0.5833 | 0.1668 | 0.3695 | 0.6375 |
Cross validation result of models from pre-trained transformer models.
| Feature groups | Classifier | Rec | Spe | Pre | Acc | MCC | F1 | AUC |
|---|---|---|---|---|---|---|---|---|
| ProtBERT | Adaboost | 0.5767 | 0.5680 | 0.2389 | 0.5697 | 0.1142 | 0.3374 | 0.5996 |
| MLP | 0.5892 | 0.5608 | 0.2395 | 0.5662 | 0.1187 | 0.3396 | 0.6128 | |
| Random Forest | 0.5567 | 0.6426 | 0.2681 | 0.6262 | 0.1602 | 0.3616 | 0.6415 | |
| SVM | 0.7042 | 0.4775 | 0.2420 | 0.5207 | 0.1475 | 0.3578 | 0.6275 | |
| XGBoost | 0.6033 | 0.6007 | 0.2619 | 0.6012 | 0.1616 | 0.3649 | 0.6398 | |
| ProtBert-BFD | Adaboost | 0.5433 | 0.5547 | 0.2231 | 0.5525 | 0.0773 | 0.3162 | 0.5776 |
| MLP | 0.5900 | 0.5645 | 0.2420 | 0.5694 | 0.1218 | 0.3430 | 0.6076 | |
| Random Forest | 0.5383 | 0.6230 | 0.2510 | 0.6069 | 0.1289 | 0.3421 | 0.6122 | |
| SVM | 0.6242 | 0.5819 | 0.2595 | 0.5899 | 0.1626 | 0.3662 | 0.6420 | |
| XGBoost | 0.5908 | 0.5733 | 0.2453 | 0.5766 | 0.1295 | 0.3464 | 0.6142 | |
| ProtAlbert | Adaboost | 0.5875 | 0.5753 | 0.2450 | 0.5776 | 0.1284 | 0.3456 | 0.6193 |
| MLP | 0.5858 | 0.6189 | 0.2657 | 0.6126 | 0.1646 | 0.3615 | 0.6407 | |
| Random Forest | 0.5808 | 0.6316 | 0.2703 | 0.6220 | 0.1697 | 0.3687 | 0.6535 | |
| SVM | 0.6283 | 0.6136 | 0.2767 | 0.6164 | 0.1919 | 0.3840 | 0.6744 | |
| XGBoost | 0.6092 | 0.5927 | 0.2604 | 0.5958 | 0.1597 | 0.3646 | 0.6477 | |
| ProtT5-XL-UniRef50 | Adaboost | 0.5533 | 0.5655 | 0.2306 | 0.5632 | 0.0938 | 0.3254 | 0.5897 |
| MLP | 0.6192 | 0.5633 | 0.2501 | 0.5739 | 0.1439 | 0.3558 | 0.6296 | |
| Random Forest | 0.5608 | 0.6171 | 0.2562 | 0.6064 | 0.1419 | 0.3515 | 0.6237 | |
| SVM | 0.6583 | 0.5710 | 0.2653 | 0.5876 | 0.1807 | 0.3777 | 0.6600 | |
| XGBoost | 0.5933 | 0.5807 | 0.2497 | 0.5831 | 0.1377 | 0.3509 | 0.6183 | |
| ProtT5-XL-BFD | Adaboost | 0.5892 | 0.5600 | 0.2395 | 0.5656 | 0.1175 | 0.3405 | 0.5959 |
| MLP | 0.6000 | 0.5768 | 0.2502 | 0.5812 | 0.1396 | 0.3529 | 0.6188 | |
| Random Forest | 0.5392 | 0.6163 | 0.2485 | 0.6017 | 0.1242 | 0.3399 | 0.6145 | |
| SVM | 0.6550 | 0.5625 | 0.2604 | 0.5801 | 0.1711 | 0.3724 | 0.6548 | |
| XGBoost | 0.5858 | 0.5862 | 0.2490 | 0.5862 | 0.1361 | 0.3489 | 0.6224 | |
| ProtXLNet | Adaboost | 0.5125 | 0.5343 | 0.2057 | 0.5302 | 0.0369 | 0.2934 | 0.5421 |
| MLP | 0.5325 | 0.5248 | 0.2081 | 0.5262 | 0.0450 | 0.2991 | 0.5463 | |
| Random Forest | 0.5050 | 0.5668 | 0.2152 | 0.5551 | 0.0568 | 0.3015 | 0.5511 | |
| SVM | 0.4742 | 0.5770 | 0.2103 | 0.5575 | 0.0408 | 0.2900 | 0.5460 | |
| XGBoost | 0.5642 | 0.5504 | 0.2274 | 0.5530 | 0.0902 | 0.3238 | 0.5652 |
Performance comparison of the best models in each group.
| Evaluation | Models | Length | Rec | Spe | Pre | Acc | MCC | F1 | AUC |
|---|---|---|---|---|---|---|---|---|---|
| 10-fold CV on Training Data | ProtTrans-Glutar | 1,599 | 0.6783 | 0.6277 | 0.3004 | 0.6374 | 0.2433 | 0.4158 | 0.7093 |
| ProtAlbert + SVM | 4,096 | 0.6283 | 0.6136 | 0.2767 | 0.6164 | 0.1919 | 0.3840 | 0.6744 | |
| EAAC + RF | 380 | 0.6450 | 0.6598 | 0.3089 | 0.6570 | 0.2450 | 0.4171 | 0.6999 | |
| Independent Test Set | ProtTrans-Glutar | 1,599 | 0.7864 | 0.6286 | 0.3147 | 0.6567 | 0.3196 | 0.4494 | 0.7075 |
| ProtAlbert + SVM | 4,096 | 0.6500 | 0.6286 | 0.2753 | 0.6324 | 0.2161 | 0.3866 | 0.6393 | |
| EAAC + RF | 380 | 0.6409 | 0.6739 | 0.2989 | 0.6680 | 0.2479 | 0.4076 | 0.6574 |
Model uses combined features CTDD-EAAC-ProtT5XLUniRef50 with XGBoost classifier.
FIGURE 2Independent test evaluation of the best models from each group.
FIGURE 3ROC-Curve plot of best models in each group.
Performance comparison of existing models.
| Models | Resources | Rec | Spe | Pre | Acc | MCC | F1 | AUC |
|---|---|---|---|---|---|---|---|---|
| GlutPred | PLMD | 0.5179 | 0.7850 | 0.2397 | 0.7541 | 0.2238 | n/a | 0.7663 |
| iGlu-Lys | PLMD | 0.5143 | 0.9531 | n/a | 0.8853 | 0.52 | n/a | 0.8842 |
| MDDGlutar | PLMD | 0.652 | 0.739 | n/a | 0.71 | 0.38 | n/a | n/a |
| iGlu_AdaBoost | PLMD, NCBI, Swiss-Prot | 0.7273 | 0.7192 | 0.3596 | 0.7207 | 0.36 | 0.48 | 0.6300 |
| ProtTrans-Glutar | PLMD, NCBI, Swiss-Prot | 0.7822 | 0.6286 | 0.3147 | 0.6567 | 0.3196 | 0.4494 | 0.7075 |
Performance comparison with RF-GlutarySite using balanced train and test data.
| Models | Resources | Rec | Spe | Pre | Acc | MCC | F1 | AUC |
|---|---|---|---|---|---|---|---|---|
| RF-GlutarySite | PLMD, NCBI, Swiss-Prot | 0.741 | 0.685 | 0.72 | 0.713 | 0.43 | 0.72 | 0.72 |
| ProtTrans-Glutar (balanced) | PLMD, NCBI, Swiss-Prot | 0.7864 | 0.6455 | 0.6955 | 0.7159 | 0.4388 | 0.7358 | 0.7159 |
RF-GlutarySite model balanced the training and testing dataset using undersampling.