Literature DB >> 35860402

Mini-review: Recent advances in post-translational modification site prediction based on deep learning.

Lingkuan Meng1,2, Wai-Sum Chan1, Lei Huang1, Linjing Liu1, Xingjian Chen1, Weitong Zhang1, Fuzhou Wang1, Ke Cheng2, Hongyan Sun2, Ka-Chun Wong1.   

Abstract

Post-translational modifications (PTMs) are closely linked to numerous diseases, playing a significant role in regulating protein structures, activities, and functions. Therefore, the identification of PTMs is crucial for understanding the mechanisms of cell biology and diseases therapy. Compared to traditional machine learning methods, the deep learning approaches for PTM prediction provide accurate and rapid screening, guiding the downstream wet experiments to leverage the screened information for focused studies. In this paper, we reviewed the recent works in deep learning to identify phosphorylation, acetylation, ubiquitination, and other PTM types. In addition, we summarized PTM databases and discussed future directions with critical insights.
© 2022 The Authors.

Entities:  

Keywords:  AAindex, Amino acid index; ATP, Adenosine triphosphate; AUC, Area under curve; Ac, Acetylation; BE, Binary encoding; BLOSUM, Blocks substitution matrix; Bi-LSTM, Bidirectional LSTM; CKSAAP, Composition of k-spaced amino acid Pairs; CNN, Convolutional neural network; CNNOH, CNN with the one-hot encoding; CNNWE, CNN with the word-embedding encoding; CNNrgb, CNN red green blue; CV, Cross-validation; DC-CNN, Densely connected convolutional neural network; DL, Deep learning; DNNs, Deep neural networks; Deep learning; E. coli, Escherichia coli; EBGW, Encoding based on grouped weight; EGAAC, Enhanced grouped amino acids content; IG, Information gain; K, Lysine; KNN, k nearest neighbor; LASSO, Least absolute shrinkage and selection operator; LSTM, Long short-term memory; LSTMWE, LSTM with the word-embedding encoding; M.musculus, Mus musculus; MDC, Modular densely connected convolutional networks; MDCAN, Multilane dense convolutional attention network; ML, Machine learning; MLP, Multilayer perceptron; MMI, Multivariate mutual information; Machine learning; Mass spectrometry; NMBroto, Normalized Moreau-Broto autocorrelation; P, Proline; PSP, PhosphoSitePlus; PSSM, Position-specific scoring matrix; PTM, Post-translational modifications; Ph, Phosphorylation; Post-translational modification; Prediction; PseAAC, Pseudo-amino acid composition; R, Arginine; RF, Random forest; RNN, Recurrent neural network; ROC, Receiver operating characteristic; S, Serine; S. typhimurium, Salmonella typhimurium; S.cerevisiae, Saccharomyces cerevisiae; SE, Squeeze and excitation; SEV, Split to Equal Validation; ST, Source and target; SUMO, Small ubiquitin-like modifier; SVM, Support vector machines; T, Threonine; Ub, Ubiquitination; Y, Tyrosine; ZSL, Zero-shot learning

Year:  2022        PMID: 35860402      PMCID: PMC9284371          DOI: 10.1016/j.csbj.2022.06.045

Source DB:  PubMed          Journal:  Comput Struct Biotechnol J        ISSN: 2001-0370            Impact factor:   6.155


Introduction

Post-translational modifications (PTMs) generally refer to the addition of functional groups (e.g., phosphates, acetates, small proteins, lipids, carbohydrates, etc.) to amino acids during translation [1]. After PTM, amino acids' chemical properties or structures will be changed, leading to functional changes. To date, over 600 different types of PTMs have been discovered in different proteins [2], [3]. It is known that phosphorylation, acetylation, and ubiquitination are the extensively studied PTMs, as quantified with the dbPTM [4] database. PTMs are critical in maintaining protein structures [5], functions [6], metabolic regulation [7], cellular signaling [8], and proteomic diversity [9], whereby our understanding of PTMs are essential to downstream consequences such as diseases. For example, S-nitrosylation is a promising therapeutic target for cancers and neurodegenerative diseases [10], [11], [12]; methyl glutamine is associated with the host defence mechanism against microorganisms [13], [14]. Different experimental techniques have been developed to reveal the mechanisms underlying PTMs, including chromatin immunoprecipitation (ChIP) [15], western blotting (WB) [16], mass spectrometry (MS) [17], [18], and isotope labeling [19]. In the recent decade, MS-based proteomic techniques [20] play a major role in PTM identification, which yield solid data with actual evidence [21]. In addition, computational methods can also explore and predict new modification sites by building a model from those data. In the last few years, machine learning has grown to be a cost-effective and labor-efficient method for the prediction of various PTM sites [22], [23], [24], [25], [26], [27], [28]. Specifically, deep learning is an advanced machine learning method that is capable of automatically exploring PTM patterns and capturing high-level abstraction (Fig. 1 [29]). Therefore, it is an appropriate solution to improve the efficiency of PTM sites' prediction with growing interest in recent years (Fig. 2). A lot of published works focused on adopting deep learning to predict PTM sites for phosphorylation [30], acetylation [31], ubiquitination [32], and many other types of modifications [33], [34]. One of the most famous tools is MusiteDeep [30], developed by Wang and Zeng, which leveraged convolutional neural network (CNN) and 2D attention mechanism for phosphorylation sites prediction. DeepPhos [35], which is created by Luo et al., is an efficient phosphorylation sites predictor to identify not only general but also kinase-specific sites. Moreover, Wu et al. [36] and Fu et al. [37] developed deep learning-based methods to predict acetylation and putative ubiquitination sites with promising results.
Fig. 1

Overview of deep learning approaches for PTM prediction. [29].

Fig. 2

The statistics of published literature on machine/deep learning-based PTM prediction. (a) Number of articles published in different peer-reviewed journals. Note that the year 2022 only includes publications up to January 2022. Abbreviations: DL = deep learning, ML = machine learning, PTM = post-translational modification. (b) Word cloud based on the collective concordance ranking with the size of terms proportional to their frequency in the above articles.

Overview of deep learning approaches for PTM prediction. [29]. The statistics of published literature on machine/deep learning-based PTM prediction. (a) Number of articles published in different peer-reviewed journals. Note that the year 2022 only includes publications up to January 2022. Abbreviations: DL = deep learning, ML = machine learning, PTM = post-translational modification. (b) Word cloud based on the collective concordance ranking with the size of terms proportional to their frequency in the above articles. In this mini-review, we summarized and discussed the most recent (2020–2022) progress made in the prediction of PTMs using deep learning-based methods with a particular emphasis on protein phosphorylation, acetylation, and ubiquitination sites. Moreover, we presented frequently used databases for deep learning-based PTM prediction, along with future directions in the computational identification of PTMs.

PTM databases

Available PTM datasets can mainly be retrieved from two sources: databases with various types of data and scientific literature data. The obtained data can be used to train a model for PTM prediction. Table 1 summarizes the leading databases with different data types based on recent literature [38], [39], [40], [41], [42], [43].
Table 1

Summary of PTM databases harbored.

DatabaseDevelopment YearNumber of PTM Sites DepositedDatabase LinkAnnotationReference
UniProt2005Varies according to the keyword searchhttps://www.uniprot.orgMultiple-type PTM sites for multi-species[38]
PLMD2017284,780https://plmd.biocuckoo.org/Protein lysine modification sites for multi-species[43]
PhosphoSitePlus2012598,976https://www.phosphosite.org/Multiple-type PTM sites for multi-species[40]
Phospho.ELM201042,914https://phospho.elm.eu.org/Phosphorylation sites for Eukaryotic[39]
mUbiSida2014110,976https://reprod.njmu.edu.cn/mUbiSiDaUniquitination sites mainly for Human and Mouse[41]
DEPOD20151,215https://www.depod.orgDephosphorylation interactions[42]
Summary of PTM databases harbored.

UniProt

UniProt [38] is one of the most comprehensive databases with PTM annotations; it contains annotations for a wide variety of PTMs. UniProt data is of high quality and was recognized as an ELIXIR Core Data Resource in 2017 [44]. The database received the CoreTrustSeal certification in 2020. It has four components customized for different uses: UniParc, UniProtKB, UniRef, and UniMES. Notably, the UniProtKB database has become the gateway to protein functional information. Over the last two years, UniProtKB's sequences have grown to about 190 million [45], despite efforts in sequence redundancy removal at the proteome level. According to the survey, we found that most of the literature collect datasets from UniProtKB as their benchmark datasets. The latest version of the UniProt database can be accessed by visiting https://www.uniprot.org/.

PLMD

There are 20 types of protein lysine modifications across 176 species in PLMD [43]. The PLMD database was constructed from the CPLA and CPLM databases with manual curations. It contains 284,780 protein lysine modification sites in 53,501 proteins, including 111,253 acetylation sites and 121,742 ubiquitination sites. To the best of our knowledge, it is the largest available database of protein acetylation, along with the largest database of protein ubiquitination sites, which has never been reported in any other ubiquitination sites prediction research. There is a free and open-source version of PLMD 3.0 at https://plmd.biocuckoo.org, which is implemented in PHP and MySQL.

PhosphoSitePlus

PhosphoSitePlus (PSP) [40] offers comprehensive data information for studying PTMs, such as phosphorylation, SUMOylation, ubiquitination, and others. Manually collected and organized data are curated to constitute this database, which primarily contains human and mouse protein data. At the time of writing, it has harbored 598,976 nonredundant modified sites, including 294,425 phosphorylation sites. The PSP database is versatile, offering a variety of information about the modification sites. PSP is a free database that can be accessed through https://www.phosphosite.org.

Phosphorylation site prediction

Phosphorylation is one the most frequently investigated PTM, referring to the transfer of phosphate groups (PO4) from adenosine triphosphate (ATP) sites to amino acid chains via the catalysis of various kinases [46]. Typically, phosphorylation of proteins occurs at serine (S), threonine (T), or tyrosine (Y) [47]. Approximately 13,000 human proteins can be phosphorylated, and 230,000 phosphorylation sites in human proteome were reported [48]. In the past decades, phosphorylation studies have gained widespread popularity due to their significance in characterizing signaling pathways [49], [50] and cellular processes, such as cell growth [51], cell division [52], and apoptosis [53]. With the development of high-throughput MS-based technology, a single proteomic experiment can detect large-scale phosphorylation. Therefore, various databases have been built to collect annotated phosphorylation sites [38], [39], [40]. The application of these databases in recent years has been enabled through the extensive development of computational methods for phosphorylation sites identification [22], [54], [55], [56], [57], [58]. In machine learning, we can formulate the phosphorylation site prediction problem as two classification tasks. The first task is the general site prediction, which aims to determine whether a given site can be modified. The second task is the kinase-specific prediction, which determines whether a site can be modified by a particular kinase [29]. In particular, the recent development of deep learning could speed up the progress of phosphorylation site prediction. A well-known deep learning-based predictor, MusiteDeep [30], incorporates one-hot encoding and CNN with attention layers and performs better than previous feature-based models. Another phosphorylation site prediction method, DeepPhos [35], exploits densely connected convolutional neural network (DC-CNN) blocks for predictions. The results of DeepPhos outperform MusiteDeep in not only general sites but also kinase-specific sites predictions. Recently, a single unified multi-label classification model, EMBER [58], was released. Unlike the previous deep learning methods, MusiteDeep and DeepPhos, which perform single-label classification, EMBER was designed to predict phosphorylation events for multiple kinases. In this tool, the input sequence is fifteen amino acids in length, of which the eighth site is to be predicted. The sequence is encoded using both one-hot encoding and embedding generated from a siamese neural network. After encoding, both sequences are fed into their corresponding identical CNNs. In the top layer, the two feature vectors are concatenated, followed by fully connected layers. Finally, the output is a vector of length eight, where each value represents the probability that a family of kinases will phosphorylate an input site. In addition, different tools are also proposed to predict protein-specific phosphorylation sites. In 2020, Chen et al. developed PROSPECT [56] which is a method for phosphorylation site prediction occur on histidine using deep learning. Three specific classifiers are set up in PROSPECT for histidine phosphorylation site prediction based on one-of-K, EGAAC, and CKSAAGP encodings [35], [59]. The classifier for one-of-K encoding is built with a multi-layer attention-based CNN; and the classifier for EGAAC encoding employs a multi-layer CNN. In the case of CKSAAGP encoding, the random forest (RF) algorithm is used to train the classifier. After that, an online web server of PROSPECT is developed. In the same year, Wang et al. also presented a web server named MusiteDeep based on their deep-learning models implemented in 2017. The server is capable of providing real-time prediction and batch submission for large-scale protein sequences, as listed in Table 4. Conclusively, we compare the performance of recent deep learning-based phosphorylation predictors in Table 2.
Table 4

Summary of recently deep learning tools associated with PTM sites prediction.

Tool namePTM typeSpeciesCore network modelEvaluationstrategyBenchmark dataset size (modification sites)Web server/ source codePublished yearReference
MusiteDeepMultipleHumanCNN5-fold CV997,687https://www.musite.net2017/2020[30]
PROSPECTPhosphorylationEscherichia coliCNN10-fold CV and independent test1,664*prospect.erc.monash.edu/2020[56]
DeepKinZeroPhosphorylationHumanZSLholdout12,901*https://github.com/Tastanlab/DeepKinZero2020[60]
PhosTransferPhosphorylationCNNholdout43,785https://github.com/yxu132/PhosTransfer2020[61]
GPS-PBSPhosphorylationMultipleseven-layer DNNs10-fold CV4,4582020[62]
DeepPPSitePhosphorylationMammals and Arabidopsis thalianaLSTM10-fold CV41,436github.com/saeed344/DeepPPSite2021[57]
DeepIPsPhosphorylationHumanCNN + LSTM5-fold CV10.978https://lin-group.cn/server/DeepIPshttps://github.com/linDing-group/DeepIPs2021[63]
PhosIDNPhosphorylationHumanMulti-layer DNNsholdoutmore than 160,000https://github.com/ustchangyuanyang/PhosIDN2021[64]
EMBERPhosphorylationMultipleCNN + RNN5-fold CV8,389https://github.com/gomezlab/EMBER2022[58]
DNNAceAcetylationMultipleDNN10-fold CV and independent test96,372https://github.com/QUSTAIBBDRC/DNNAce/2020[78]
Deep-PLAAcetylationHuman andNonhumanDNN5- and 10-fold CV1,331https://deeppla.cancerbio.info2020[79]
MDC-KaceAcetylationMultipleMDC10-fold CV and independent test11,583https://github.com/lianglianggg/MDC-Kace2020[80]
DeepTL-UbiUbiquitinationMultipleCNNholdout94,518github.com/USTC-HIlab/DeepTL-Ubi2020[106]
Wang et al.’s workUbiquitinationMultipleCNN10-fold CV121,742*https://github.com/wang-hong-fei/DL-plantubsites-prediction2020[105]
UbiCombUbiquitinationMultipleLSTM10-fold CV121,742https://nsclbio.jbnu.ac.kr/tools/UbiComb2021[107]
SSMFNMethylationHuman and MouseCNN + LSTMholdout6,754*https://github.com/bharuno/SSMFNMethylation-Analysis2021[110]
Malebary et al.’s workMethylationHumanCNN10-fold CV and jackknife2000https://github.com/s2018https://doi.org/1080001/WebServer.git2022[14]
RecSNOS-NitrosylationBiLSTM5-fold CV4,762https://nsclbio.jbnu.ac.kr/tools/RecSNO/.2021[111]
MDCAN-LysSuccinylationHumanMDCAN10-fold CV and independent test77,4182021[112]
LSTMCNNsuccSuccinylationMultipleLSTM + CNNholdout18,593https://8.129.111.5/2021[113]
DeepMalMalonylationMultipleCNN + DNN10-fold CV and independent test17,288https://github.com/QUST-AIBBDRC/DeepMal/2020[114]
K_netMalonylationHuman and MiceCNN10-fold CV and SEV85,2042020[115]
DeepCSOS-SulphenylationHomo sapiens and Arabidopsis thalianaLSTMWE10-fold CV10,354*https://www.bioinfogo.org/DeepCSO.2020[116]
DeepSSPredS-SulphenylationHomo Sapiens2D-CNNjackknife7,756*https://github.com/zaheerkhancs/DeepSSPred2021[117]
pKcrCrotonylationPapayaCNN10-fold CV and independent test58,769*https://www.bioinfogo.org/pkcr.2020[119]
Deep-KcrCrotonylationHumanCNN10-fold CV19,928https://lin-group.cn/server/Deep-Kcr2020[120]
DeepKcrotCrotonylationMultipleCNNWE10-fold CV and independent test10,702/1,265/2,044/5,995*https://www.bioinfogo.org/deepkcrot.2021[121]
nhKcrCrotonylationHumanCNNrgb10-fold CV and independent test180,312https://nhKcr.erc.monash.edu/2021[118]
DeepKhib2-HydroxyisobutyrylationMultipleCNNOH10-fold CV and independent test18,946/15,444/12,756/19,330/2,098*https://www.bioinfogo.org/DeepKhib.2020[122]
DeepGlutGlutarylationProkaryotes and EukaryoteCNN10-fold CV4,572*https://github.com/urmisen/DeepGlut.2020[123]
NPalmitoylDeep-PseAACN-PalmitoylationHumanDNNholdout4,364https://mega.nz/#F!s9cSiQIa!1jXO0NmgrhxUqOexmYuouA2021[124]
DTL-DephosSiteDephosphorylationHumanBi-LSTM5-fold CV and independent test4,956https://github.com/dukkakc/DTLDephos2021[127]
PreCar_DeepCarbonylationHuman and other MammalsCNN + BiLSTM10-fold CV and independent test5,003https://github.com/QUST-SHULI/PreCar_Deep/2021[125]
He et al.'s workSUMOylation UbiquitylationCNN + DNN10-fold CV280,731https://github.com/lijingyimm/MultiUbiSUMO2021[126]

Note: *, Link is not working at the time of writing. Multiple, more than three species or PTM types. -, data not available.

Table 2

Comparison of deep learning-based phosphorylation sites predictors.

Tool nameFrameworkEncoding strategyWindow sizeAverage AUCReference
MusiteDeepKeras/TensorFlowOne-hot330.880[30]
PROSPECTPyTorchOne-hot, EGAAC, CKSAAGP270.770[56]
DeepKinZeroTensorFlowWord embedding15[60]
PhosTransferTensorFlowWord embedding0.898[61]
GPS-PBSKeras/TensorFlowBLOSUM62210.832[62]
DeepPPSiteKeras/TensorFlowBE, EBGW, CKSAAP, PSPM, IPCP210.872[57]
DeepIPsKeras/TensorFlowWord embedding150.909[63]
PhosIDNKeras/TensorFlowOne-hot, PPI embedding210.939[64]
EMBERPyTorchOne-hot150.928[58]

Note: -, data not available. AUC: Area under the Curve of ROC.

Comparison of deep learning-based phosphorylation sites predictors. Note: -, data not available. AUC: Area under the Curve of ROC.

Acetylation site prediction

Acetylation is a very common PTM that describes the modification of the acetyl group to amino acid residues. About 63% of mitochondrial proteins can be acetylated at their lysine residues [65]. During the protein acetylation process, the positive charge in lysine residues is neutralized, leading to the regulation of cell lifespan [66], DNA binding [67], the interactions between proteins [68], and the interactions between proteins and membranes [69]. In contrast, dysregulation of lysine acetylation is associated with several diseases, including cancers [70], cardiovascular diseases [71], Parkinson's diseases [72], and neurodegenerative disorders [73]. Thus, the identification of acetylation sites may benefit the understanding of its molecular mechanism and further experimental design. Proteomic and high-throughput MS-based techniques have identified massive acetylation sites. For example, Choudhary et al. detected 3,600 lysine acetylation sites on 1,750 proteins from a human cell line. [74]; Lundby et al. quantified 15,474 lysine acetylation sites on 4,541 proteins from 16 rat tissues [75]. Several public databases have been developed to facilitate the collection and maintenance of acetylation sites information [38], [43]. Therefore, to predict acetylation sites, many computational methods have been proposed [76], [77], [36]. Among them, deep learning methods are increasingly popular in bioinformatics, which also show encouraging results of acetylation sites identification [78], [79], [80]. For example, Wu et al. [36] presented an MLP architecture, DeepAcet, as an acetylation site prediction model. Feature embedding were performed with six methods (One-hot, IG, CKSAAP, PSSM, AAindex, and BLOSUM62); multilayer perceptron (MLP) is then applied to extract features. After adopting 10-fold cross-validation method [81] paired model evaluation on a separate test site, accuracies were reported to be 0.8495 and 0.8487, respectively. Yu et al. also developed a deep neural networks (DNN) based model called DNNAce for acetylation sites prediction [78]. First, they applied eight different encoding methods to extract information from multiple amino acid residues and then fused the encoded feature vectors to create a high-level feature representation. These encodings methods are BE, PseAAC, AAindex, NMBroto, EBGW, MMI, BLOSUM62, and KNN. Next, they employ LASSO to screen the optimal feature subsets to improve the model performance. As a final stage, nine prokaryotic acetylation site datasets are adopted to evaluate the performance and compared to state-of-the-art models such as AdaBoost, Naive Bayes, XGBoost, KNN, RF, SVM, CNN, and LSTM. An evaluation of DNNAce was conducted by comparing its results with ProAcePred [82]. The performance of DNNAce on the remaining eight species was significantly lower than that of ProAcePred except for S. typhimurium species. However, DNNAce outperforms ProAcePred for the other seven species during independent evaluation. Therefore, the advantages of DNNAce are trivial because there is performance discrepancy in training and independent testing. In contrast to deepAcet and DNNAce, which only consider the amino acid sequences and their physicochemical properties, MDC-Kace [80] pays attention to both sequence information and protein structural properties to predict acetylation sites. In MDC-Kace, modular densely connected convolutional networks (MDC), which consist of three independent modules (sequence, physicochemical and structure), is employed to extract features of lysine acetylation sites. In the next step, squeeze and excitation (SE) layer [83] is utilized to weight importance of features to build representation more accurately. Finally, the fused advanced feature is fed into a softmax layer for classification to predict acetylation sites efficiently. The authors compared MDC-Kace with state-of-the-art models (MusiteDeep [30], CapsNet [34], DeepAcet [36], PSKAcePred [84], EnsemblePail [85], GPS-PAIL2.0 [86] and ProAcePred [82]) to evaluate its performance. Three species (human, M. musculus, E. coli) datasets have been evaluated by10-fold cross-validation and independent testing. The results indicate that MDC-Kace has a similar performance as existing acetylation sites predictors.

Ubiquitination site prediction

Ubiquitination represents an enzymatic PTM on cellular protein by ubiquitin conjugation [87]. Multiple important cellular processes are related to ubiquitination, including protein degradation [88], cell division [89], and protein stability [90], [91]. Ubiquitination serves as a fundamental component of the ubiquitin–proteasome system, mediating more than 80% of protein degradation in eukaryotes [92]. Moreover, aberrant ubiquitination is highly related to the progression of aging [93] and many diseases; for example, the dysregulation of ubiquitin–proteasome system may contribute to the occurrence of neurodegenerative conditions [94] and inflammatory bowel diseases [95]. Therefore, the identification of ubiquitination sites is an essential step in exploring various ubiquitination-involved mechanisms. In order to identify the ubiquitination sites in proteins, a myriad of experimental [96], [97], [98] and computational methods [99], [100], [101] have been developed. In recent years, with the continuous growth in high-throughput experimental data [102], [103], [104], deep learning [105], [106], [107] has been increasingly applied to the prediction of ubiquitination. Fu et al. proposed a deep learning predictor, DeepUbi [37], based on CNN. In this tool, four feature encoding schemes are utilized for feature construction. Under 10-fold cross-validation, DeepUbi is able to achieve an AUC of 0.90, with the accuracy, sensitivity, and specificity being all over 0.85. Compared with DeepUbi, which is trained for general ubiquitination site prediction, DeepTL-Ubi [106] is a species-specific sites predictor which consists of three connected modules: a deep feature extractor, a source label classifier, and a target label classifier. Firstly, a densely connected convolutional neural network (DCCNN) is applied as the deep feature extractor, which is composed of six layers. Features of both source species and target species are extracted simultaneously by the deep feature extractor, mapping samples into a joint feature space. Secondly, the two parallel classifiers are employed to classify source species and target species at the same time. Thirdly, ST (source and target) loss assists the extractor in transferring knowledge from source species to target species by learning relevant features. Finally, as the performance optimization step, the classification loss is minimized to train the two classifiers. DeepTL-Ubi outperforms several existing tools, including Ubisite [108], Ubiprober [24], and MUscADEL [109], as shown in Table 3.
Table 3

AUC values on different ubiquitination prediction tools. [106].

AUCSpecies
H.sapiensM.musculusR.norvegicusS.cerevisiaeT.gondiiA.nidulans
ToolsDeepTL-Ubi0.7530.7890.7200.7720.8240.814
Ubisite0.5980.6250.5610.5480.6070.611
Ubiprober0.6240.6610.6440.6000.6300.638
MUscADEL0.6560.6930.6590.6640.7150.681
AUC values on different ubiquitination prediction tools. [106].

Other PTMs

In addition to those discussed, deep learning can also be applied for other PTMs’ predictions, including methylation [110], S-nitrosylation [111], succinylation [112], [113], malonylation [114], [115], S-sulphenylation [116], [117], crotonylation [118], [119], [120], [121], 2- hydroxyisobutyrylation [122], glutarylation [123], N-palmitoylation [124] carbonylation [125], and SUMOylation [126]. In particular, crotonylation prediction has demonstrated highly accurate results based on deep-learning methods. Moreover, 2- hydroxyisobutyrylation, as a novel type of PTM, was predicted by deep learning method for the first time in 2020. Along with predicting conventional PTMs associated with functional group addition, deep learning-based methods have also been applied to predict niche-type PTMs; for instance, Chaudhari et al. developed a transfer learning-based predictor (DTL-DephosSite) for dephosphorylation site prediction [127]. To collect datasets of S, T, and Y dephosphorylation sites, they integrated the experimentally verified datasets from the literature and datasets from the DEPOD database. They then employ bidirectional long short-term memory (Bi-LSTM), which can predict the modification of the target amino acid according to the knowledge of residues from both directions. To the best of our knowledge, it is the first tool that can predict the general dephosphorylation sites for protein S/T residues and Y residues. On the other hand, a novel prediction model focusing on carbonylation, Precar_Deep [125], is recently reported. Carbonylation is an irreversible covalent PTM and is a measure of protein oxidative damage. In this model, CNN and Bi-LSTM are combined under a deep learning framework. The AUC values of the four datasets (K, T, P, and R) reach 0.981, 0.982, 0.987, and 0.976, respectively. The AUC values of the independent test set reach 0.945, 0.978, 0.965, and 0.983, respectively. In addition, there is also a novel small protein-addition type PTM site predictor based on deep learning in 2021. He et al. built an ensemble learning model that adopts CNN and DNN, followed by the output result containing four types of sites. [126]. This is the first tool that predicts both ubiquitylation and SUMOylation sites at the same time based on deep learning. PTM prediction tools mentioned in this section, as well as predictors of phosphorylation, acetylation, and ubiquitination, are tabulated in Table 4. Summary of recently deep learning tools associated with PTM sites prediction. Note: *, Link is not working at the time of writing. Multiple, more than three species or PTM types. -, data not available.

Summary and outlook

PTM identification is critical to a better understanding of molecular functions and diseases. Advanced MS-based technology has yielded an extensive list of identified PTMs, providing abundant data to support the development of downstream computational identification methods. Although the traditional machine learning methods can precisely predict the modified sites, deep learning features can be automatically deduced and optimally turned without encoding features ahead of time [29]. Thus, deep learning is highly effective in scientific fields with large and complex datasets. Researchers recently gradually shift their attention from traditional machine learning to deep learning for PTM site prediction (Fig. 2). Furthermore, with the growing number of PTM profiling datasets, deep learning models have been developed for not only phosphorylation, acetylation, and ubiquitination, but also many other PTM types. In this review, we summarized the recently (2020–2022) released deep learning tools and online web servers for protein PTM site prediction (Table 4). Among all these, CNN and cross-validation are the most widely used network model and evaluation strategy, respectively (Fig. 3).
Fig. 3

Sankey diagram depicting the distribution of PTM types, core network models, evaluation strategies, and published years.

Sankey diagram depicting the distribution of PTM types, core network models, evaluation strategies, and published years. Although several deep learning methods have been built with high performance to predict PTM sites, there is still room for improvement. Most of the existing deep learning algorithms employed CNN, DNN, and LSTM classifiers. However, each classifier has its own advantages and disadvantages. Therefore, further research is required to evaluate more state-of-the-art frameworks such as attention and transformer-based models. On top of that, in many developed tools, although PTM sites are predicted based on certain characteristics, such as sequence information, physical properties, chemical properties, and protein structure properties, there are still other approaches that need to be explored, such as reduced amino acid compositions [128], [129], [130]. Additionally, most of web server links are not working, and few methods provide stand-alone versions. After testing all web servers, we found that they were difficult to operate. By using deep learning based methods, PTM identification can be implemented in a non-invasive, efficient, and low-cost way. However, there is still a caveat before deep learning algorithms can directly diagnose diseases. Typical PTM prediction models lack sufficient interpretations due to the black-box nature of deep learning algorithms. Insufficient interpretability may not be an issue in many areas, but within healthcare, every misdiagnosis can pose a danger to a patient's health. Therefore, transparent and explainable models [131], [132], [133] will be needed, so that the technique can be applied in clinical practice.

CRediT authorship contribution statement

Lingkuan Meng: Writing, Conceptualization, Methodology, Visualization. Wai-Sum Chan: Methodology. Lei Huang: Methodology. Linjing Liu: Methodology. Xingjian Chen: Methodology. Weitong Zhang: Methodology. Fuzhou Wang: Methodology. Ke Cheng: Methodology. Hongyan Sun: Writing – review & editing, Supervision. Ka-Chun Wong: Writing – review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  114 in total

1.  pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties.

Authors:  Zi Liu; Xuan Xiao; Dong-Jun Yu; Jianhua Jia; Wang-Ren Qiu; Kuo-Chen Chou
Journal:  Anal Biochem       Date:  2015-12-31       Impact factor: 3.365

Review 2.  Mapping protein post-translational modifications with mass spectrometry.

Authors:  Eric S Witze; William M Old; Katheryn A Resing; Natalie G Ahn
Journal:  Nat Methods       Date:  2007-10       Impact factor: 28.547

3.  PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition.

Authors:  Yongchun Zuo; Yuan Li; Yingli Chen; Guangpeng Li; Zhenhe Yan; Lei Yang
Journal:  Bioinformatics       Date:  2016-08-26       Impact factor: 6.937

4.  DeepKinZero: zero-shot learning for predicting kinase-phosphosite associations involving understudied kinases.

Authors:  Iman Deznabi; Busra Arabaci; Mehmet Koyutürk; Oznur Tastan
Journal:  Bioinformatics       Date:  2020-06-01       Impact factor: 6.937

5.  Absolute quantification of protein and post-translational modification abundance with stable isotope-labeled synthetic peptides.

Authors:  Arminja N Kettenbach; John Rush; Scott A Gerber
Journal:  Nat Protoc       Date:  2011-01-27       Impact factor: 13.491

6.  Direct identification of a G protein ubiquitination site by mass spectrometry.

Authors:  Louis A Marotti; Rick Newitt; Yuqi Wang; Ruedi Aebersold; Henrik G Dohlman
Journal:  Biochemistry       Date:  2002-04-23       Impact factor: 3.162

7.  dbPTM: an information repository of protein post-translational modification.

Authors:  Tzong-Yi Lee; Hsien-Da Huang; Jui-Hung Hung; Hsi-Yuan Huang; Yuh-Shyong Yang; Tzu-Hao Wang
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

8.  MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics.

Authors:  Andy T Kong; Felipe V Leprevost; Dmitry M Avtonomov; Dattatreya Mellacheruvu; Alexey I Nesvizhskii
Journal:  Nat Methods       Date:  2017-04-10       Impact factor: 28.547

9.  A Hybrid Deep Learning Model for Predicting Protein Hydroxylation Sites.

Authors:  Haixia Long; Bo Liao; Xingyu Xu; Jialiang Yang
Journal:  Int J Mol Sci       Date:  2018-09-18       Impact factor: 5.923

Review 10.  Regulation of p63 protein stability via ubiquitin-proteasome pathway.

Authors:  Chenghua Li; Zhi-Xiong Xiao
Journal:  Biomed Res Int       Date:  2014-04-15       Impact factor: 3.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.