Literature DB >> 35860402

Mini-review: Recent advances in post-translational modification site prediction based on deep learning.

Lingkuan Meng^1,2, Wai-Sum Chan¹, Lei Huang¹, Linjing Liu¹, Xingjian Chen¹, Weitong Zhang¹, Fuzhou Wang¹, Ke Cheng², Hongyan Sun², Ka-Chun Wong¹.

Abstract

Post-translational modifications (PTMs) are closely linked to numerous diseases, playing a significant role in regulating protein structures, activities, and functions. Therefore, the identification of PTMs is crucial for understanding the mechanisms of cell biology and diseases therapy. Compared to traditional machine learning methods, the deep learning approaches for PTM prediction provide accurate and rapid screening, guiding the downstream wet experiments to leverage the screened information for focused studies. In this paper, we reviewed the recent works in deep learning to identify phosphorylation, acetylation, ubiquitination, and other PTM types. In addition, we summarized PTM databases and discussed future directions with critical insights.

Entities: Chemical

Keywords: AAindex, Amino acid index; ATP, Adenosine triphosphate; AUC, Area under curve; Ac, Acetylation; BE, Binary encoding; BLOSUM, Blocks substitution matrix; Bi-LSTM, Bidirectional LSTM; CKSAAP, Composition of k-spaced amino acid Pairs; CNN, Convolutional neural network; CNNOH, CNN with the one-hot encoding; CNNWE, CNN with the word-embedding encoding; CNNrgb, CNN red green blue; CV, Cross-validation; DC-CNN, Densely connected convolutional neural network; DL, Deep learning; DNNs, Deep neural networks; Deep learning; E. coli, Escherichia coli; EBGW, Encoding based on grouped weight; EGAAC, Enhanced grouped amino acids content; IG, Information gain; K, Lysine; KNN, k nearest neighbor; LASSO, Least absolute shrinkage and selection operator; LSTM, Long short-term memory; LSTMWE, LSTM with the word-embedding encoding; M.musculus, Mus musculus; MDC, Modular densely connected convolutional networks; MDCAN, Multilane dense convolutional attention network; ML, Machine learning; MLP, Multilayer perceptron; MMI, Multivariate mutual information; Machine learning; Mass spectrometry; NMBroto, Normalized Moreau-Broto autocorrelation; P, Proline; PSP, PhosphoSitePlus; PSSM, Position-specific scoring matrix; PTM, Post-translational modifications; Ph, Phosphorylation; Post-translational modification; Prediction; PseAAC, Pseudo-amino acid composition; R, Arginine; RF, Random forest; RNN, Recurrent neural network; ROC, Receiver operating characteristic; S, Serine; S. typhimurium, Salmonella typhimurium; S.cerevisiae, Saccharomyces cerevisiae; SE, Squeeze and excitation; SEV, Split to Equal Validation; ST, Source and target; SUMO, Small ubiquitin-like modifier; SVM, Support vector machines; T, Threonine; Ub, Ubiquitination; Y, Tyrosine; ZSL, Zero-shot learning

Year: 2022 PMID： 35860402 PMCID： PMC9284371 DOI： 10.1016/j.csbj.2022.06.045

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Post-translational modifications (PTMs) generally refer to the addition of functional groups (e.g., phosphates, acetates, small proteins, lipids, carbohydrates, etc.) to amino acids during translation [1]. After PTM, amino acids' chemical properties or structures will be changed, leading to functional changes. To date, over 600 different types of PTMs have been discovered in different proteins [2], [3]. It is known that phosphorylation, acetylation, and ubiquitination are the extensively studied PTMs, as quantified with the dbPTM [4] database. PTMs are critical in maintaining protein structures [5], functions [6], metabolic regulation [7], cellular signaling [8], and proteomic diversity [9], whereby our understanding of PTMs are essential to downstream consequences such as diseases. For example, S-nitrosylation is a promising therapeutic target for cancers and neurodegenerative diseases [10], [11], [12]; methyl glutamine is associated with the host defence mechanism against microorganisms [13], [14]. Different experimental techniques have been developed to reveal the mechanisms underlying PTMs, including chromatin immunoprecipitation (ChIP) [15], western blotting (WB) [16], mass spectrometry (MS) [17], [18], and isotope labeling [19]. In the recent decade, MS-based proteomic techniques [20] play a major role in PTM identification, which yield solid data with actual evidence [21]. In addition, computational methods can also explore and predict new modification sites by building a model from those data. In the last few years, machine learning has grown to be a cost-effective and labor-efficient method for the prediction of various PTM sites [22], [23], [24], [25], [26], [27], [28]. Specifically, deep learning is an advanced machine learning method that is capable of automatically exploring PTM patterns and capturing high-level abstraction (Fig. 1 [29]). Therefore, it is an appropriate solution to improve the efficiency of PTM sites' prediction with growing interest in recent years (Fig. 2). A lot of published works focused on adopting deep learning to predict PTM sites for phosphorylation [30], acetylation [31], ubiquitination [32], and many other types of modifications [33], [34]. One of the most famous tools is MusiteDeep [30], developed by Wang and Zeng, which leveraged convolutional neural network (CNN) and 2D attention mechanism for phosphorylation sites prediction. DeepPhos [35], which is created by Luo et al., is an efficient phosphorylation sites predictor to identify not only general but also kinase-specific sites. Moreover, Wu et al. [36] and Fu et al. [37] developed deep learning-based methods to predict acetylation and putative ubiquitination sites with promising results.

Fig. 1

Overview of deep learning approaches for PTM prediction. [29].

Fig. 2

The statistics of published literature on machine/deep learning-based PTM prediction. (a) Number of articles published in different peer-reviewed journals. Note that the year 2022 only includes publications up to January 2022. Abbreviations: DL = deep learning, ML = machine learning, PTM = post-translational modification. (b) Word cloud based on the collective concordance ranking with the size of terms proportional to their frequency in the above articles.

Overview of deep learning approaches for PTM prediction. [29]. The statistics of published literature on machine/deep learning-based PTM prediction. (a) Number of articles published in different peer-reviewed journals. Note that the year 2022 only includes publications up to January 2022. Abbreviations: DL = deep learning, ML = machine learning, PTM = post-translational modification. (b) Word cloud based on the collective concordance ranking with the size of terms proportional to their frequency in the above articles. In this mini-review, we summarized and discussed the most recent (2020–2022) progress made in the prediction of PTMs using deep learning-based methods with a particular emphasis on protein phosphorylation, acetylation, and ubiquitination sites. Moreover, we presented frequently used databases for deep learning-based PTM prediction, along with future directions in the computational identification of PTMs.

PTM databases

Available PTM datasets can mainly be retrieved from two sources: databases with various types of data and scientific literature data. The obtained data can be used to train a model for PTM prediction. Table 1 summarizes the leading databases with different data types based on recent literature [38], [39], [40], [41], [42], [43].

Table 1

Summary of PTM databases harbored.

Database	Development Year	Number of PTM Sites Deposited	Database Link	Annotation	Reference
UniProt	2005	Varies according to the keyword search	https://www.uniprot.org	Multiple-type PTM sites for multi-species	[38]
PLMD	2017	284,780	https://plmd.biocuckoo.org/	Protein lysine modification sites for multi-species	[43]
PhosphoSitePlus	2012	598,976	https://www.phosphosite.org/	Multiple-type PTM sites for multi-species	[40]
Phospho.ELM	2010	42,914	https://phospho.elm.eu.org/	Phosphorylation sites for Eukaryotic	[39]
mUbiSida	2014	110,976	https://reprod.njmu.edu.cn/mUbiSiDa	Uniquitination sites mainly for Human and Mouse	[41]
DEPOD	2015	1,215	https://www.depod.org	Dephosphorylation interactions	[42]

Summary of PTM databases harbored.

UniProt

UniProt [38] is one of the most comprehensive databases with PTM annotations; it contains annotations for a wide variety of PTMs. UniProt data is of high quality and was recognized as an ELIXIR Core Data Resource in 2017 [44]. The database received the CoreTrustSeal certification in 2020. It has four components customized for different uses: UniParc, UniProtKB, UniRef, and UniMES. Notably, the UniProtKB database has become the gateway to protein functional information. Over the last two years, UniProtKB's sequences have grown to about 190 million [45], despite efforts in sequence redundancy removal at the proteome level. According to the survey, we found that most of the literature collect datasets from UniProtKB as their benchmark datasets. The latest version of the UniProt database can be accessed by visiting https://www.uniprot.org/.

PLMD

There are 20 types of protein lysine modifications across 176 species in PLMD [43]. The PLMD database was constructed from the CPLA and CPLM databases with manual curations. It contains 284,780 protein lysine modification sites in 53,501 proteins, including 111,253 acetylation sites and 121,742 ubiquitination sites. To the best of our knowledge, it is the largest available database of protein acetylation, along with the largest database of protein ubiquitination sites, which has never been reported in any other ubiquitination sites prediction research. There is a free and open-source version of PLMD 3.0 at https://plmd.biocuckoo.org, which is implemented in PHP and MySQL.

PhosphoSitePlus

PhosphoSitePlus (PSP) [40] offers comprehensive data information for studying PTMs, such as phosphorylation, SUMOylation, ubiquitination, and others. Manually collected and organized data are curated to constitute this database, which primarily contains human and mouse protein data. At the time of writing, it has harbored 598,976 nonredundant modified sites, including 294,425 phosphorylation sites. The PSP database is versatile, offering a variety of information about the modification sites. PSP is a free database that can be accessed through https://www.phosphosite.org.

Phosphorylation site prediction

Phosphorylation is one the most frequently investigated PTM, referring to the transfer of phosphate groups (PO4) from adenosine triphosphate (ATP) sites to amino acid chains via the catalysis of various kinases [46]. Typically, phosphorylation of proteins occurs at serine (S), threonine (T), or tyrosine (Y) [47]. Approximately 13,000 human proteins can be phosphorylated, and 230,000 phosphorylation sites in human proteome were reported [48]. In the past decades, phosphorylation studies have gained widespread popularity due to their significance in characterizing signaling pathways [49], [50] and cellular processes, such as cell growth [51], cell division [52], and apoptosis [53]. With the development of high-throughput MS-based technology, a single proteomic experiment can detect large-scale phosphorylation. Therefore, various databases have been built to collect annotated phosphorylation sites [38], [39], [40]. The application of these databases in recent years has been enabled through the extensive development of computational methods for phosphorylation sites identification [22], [54], [55], [56], [57], [58]. In machine learning, we can formulate the phosphorylation site prediction problem as two classification tasks. The first task is the general site prediction, which aims to determine whether a given site can be modified. The second task is the kinase-specific prediction, which determines whether a site can be modified by a particular kinase [29]. In particular, the recent development of deep learning could speed up the progress of phosphorylation site prediction. A well-known deep learning-based predictor, MusiteDeep [30], incorporates one-hot encoding and CNN with attention layers and performs better than previous feature-based models. Another phosphorylation site prediction method, DeepPhos [35], exploits densely connected convolutional neural network (DC-CNN) blocks for predictions. The results of DeepPhos outperform MusiteDeep in not only general sites but also kinase-specific sites predictions. Recently, a single unified multi-label classification model, EMBER [58], was released. Unlike the previous deep learning methods, MusiteDeep and DeepPhos, which perform single-label classification, EMBER was designed to predict phosphorylation events for multiple kinases. In this tool, the input sequence is fifteen amino acids in length, of which the eighth site is to be predicted. The sequence is encoded using both one-hot encoding and embedding generated from a siamese neural network. After encoding, both sequences are fed into their corresponding identical CNNs. In the top layer, the two feature vectors are concatenated, followed by fully connected layers. Finally, the output is a vector of length eight, where each value represents the probability that a family of kinases will phosphorylate an input site. In addition, different tools are also proposed to predict protein-specific phosphorylation sites. In 2020, Chen et al. developed PROSPECT [56] which is a method for phosphorylation site prediction occur on histidine using deep learning. Three specific classifiers are set up in PROSPECT for histidine phosphorylation site prediction based on one-of-K, EGAAC, and CKSAAGP encodings [35], [59]. The classifier for one-of-K encoding is built with a multi-layer attention-based CNN; and the classifier for EGAAC encoding employs a multi-layer CNN. In the case of CKSAAGP encoding, the random forest (RF) algorithm is used to train the classifier. After that, an online web server of PROSPECT is developed. In the same year, Wang et al. also presented a web server named MusiteDeep based on their deep-learning models implemented in 2017. The server is capable of providing real-time prediction and batch submission for large-scale protein sequences, as listed in Table 4. Conclusively, we compare the performance of recent deep learning-based phosphorylation predictors in Table 2.

Table 4

Summary of recently deep learning tools associated with PTM sites prediction.

Tool name	PTM type	Species	Core network model	Evaluationstrategy	Benchmark dataset size (modification sites)	Web server/ source code	Published year	Reference
MusiteDeep	Multiple	Human	CNN	5-fold CV	997,687	https://www.musite.net	2017/2020	[30]
PROSPECT	Phosphorylation	Escherichia coli	CNN	10-fold CV and independent test	1,664	*prospect.erc.monash.edu/	2020	[56]
DeepKinZero	Phosphorylation	Human	ZSL	holdout	12,901	*https://github.com/Tastanlab/DeepKinZero	2020	[60]
PhosTransfer	Phosphorylation	–	CNN	holdout	43,785	https://github.com/yxu132/PhosTransfer	2020	[61]
GPS-PBS	Phosphorylation	Multiple	seven-layer DNNs	10-fold CV	4,458	–	2020	[62]
DeepPPSite	Phosphorylation	Mammals and Arabidopsis thaliana	LSTM	10-fold CV	41,436	github.com/saeed344/DeepPPSite	2021	[57]
DeepIPs	Phosphorylation	Human	CNN + LSTM	5-fold CV	10.978	https://lin-group.cn/server/DeepIPshttps://github.com/linDing-group/DeepIPs	2021	[63]
PhosIDN	Phosphorylation	Human	Multi-layer DNNs	holdout	more than 160,000	https://github.com/ustchangyuanyang/PhosIDN	2021	[64]
EMBER	Phosphorylation	Multiple	CNN + RNN	5-fold CV	8,389	https://github.com/gomezlab/EMBER	2022	[58]
DNNAce	Acetylation	Multiple	DNN	10-fold CV and independent test	96,372	https://github.com/QUSTAIBBDRC/DNNAce/	2020	[78]
Deep-PLA	Acetylation	Human andNonhuman	DNN	5- and 10-fold CV	1,331	https://deeppla.cancerbio.info	2020	[79]
MDC-Kace	Acetylation	Multiple	MDC	10-fold CV and independent test	11,583	https://github.com/lianglianggg/MDC-Kace	2020	[80]
DeepTL-Ubi	Ubiquitination	Multiple	CNN	holdout	94,518	github.com/USTC-HIlab/DeepTL-Ubi	2020	[106]
Wang et al.’s work	Ubiquitination	Multiple	CNN	10-fold CV	121,742	*https://github.com/wang-hong-fei/DL-plantubsites-prediction	2020	[105]
UbiComb	Ubiquitination	Multiple	LSTM	10-fold CV	121,742	https://nsclbio.jbnu.ac.kr/tools/UbiComb	2021	[107]
SSMFN	Methylation	Human and Mouse	CNN + LSTM	holdout	6,754	*https://github.com/bharuno/SSMFNMethylation-Analysis	2021	[110]
Malebary et al.’s work	Methylation	Human	CNN	10-fold CV and jackknife	2000	https://github.com/s2018https://doi.org/1080001/WebServer.git	2022	[14]
RecSNO	S-Nitrosylation	–	BiLSTM	5-fold CV	4,762	https://nsclbio.jbnu.ac.kr/tools/RecSNO/.	2021	[111]
MDCAN-Lys	Succinylation	Human	MDCAN	10-fold CV and independent test	77,418	–	2021	[112]
LSTMCNNsucc	Succinylation	Multiple	LSTM + CNN	holdout	18,593	https://8.129.111.5/	2021	[113]
DeepMal	Malonylation	Multiple	CNN + DNN	10-fold CV and independent test	17,288	https://github.com/QUST-AIBBDRC/DeepMal/	2020	[114]
K_net	Malonylation	Human and Mice	CNN	10-fold CV and SEV	85,204	–	2020	[115]
DeepCSO	S-Sulphenylation	Homo sapiens and Arabidopsis thaliana	LSTM_WE	10-fold CV	10,354	*https://www.bioinfogo.org/DeepCSO.	2020	[116]
DeepSSPred	S-Sulphenylation	Homo Sapiens	2D-CNN	jackknife	7,756	*https://github.com/zaheerkhancs/DeepSSPred	2021	[117]
pKcr	Crotonylation	Papaya	CNN	10-fold CV and independent test	58,769	*https://www.bioinfogo.org/pkcr.	2020	[119]
Deep-Kcr	Crotonylation	Human	CNN	10-fold CV	19,928	https://lin-group.cn/server/Deep-Kcr	2020	[120]
DeepKcrot	Crotonylation	Multiple	CNN_WE	10-fold CV and independent test	10,702/1,265/2,044/5,995	*https://www.bioinfogo.org/deepkcrot.	2021	[121]
nhKcr	Crotonylation	Human	CNNrgb	10-fold CV and independent test	180,312	https://nhKcr.erc.monash.edu/	2021	[118]
DeepKhib	2-Hydroxyisobutyrylation	Multiple	CNN_OH	10-fold CV and independent test	18,946/15,444/12,756/19,330/2,098	*https://www.bioinfogo.org/DeepKhib.	2020	[122]
DeepGlut	Glutarylation	Prokaryotes and Eukaryote	CNN	10-fold CV	4,572	*https://github.com/urmisen/DeepGlut.	2020	[123]
NPalmitoylDeep-PseAAC	N-Palmitoylation	Human	DNN	holdout	4,364	https://mega.nz/#F!s9cSiQIa!1jXO0NmgrhxUqOexmYuouA	2021	[124]
DTL-DephosSite	Dephosphorylation	Human	Bi-LSTM	5-fold CV and independent test	4,956	https://github.com/dukkakc/DTLDephos	2021	[127]
PreCar_Deep	Carbonylation	Human and other Mammals	CNN + BiLSTM	10-fold CV and independent test	5,003	https://github.com/QUST-SHULI/PreCar_Deep/	2021	[125]
He et al.'s work	SUMOylation Ubiquitylation	–	CNN + DNN	10-fold CV	280,731	https://github.com/lijingyimm/MultiUbiSUMO	2021	[126]

Note: *, Link is not working at the time of writing. Multiple, more than three species or PTM types. -, data not available.

Table 2

Comparison of deep learning-based phosphorylation sites predictors.

Tool name	Framework	Encoding strategy	Window size	Average AUC	Reference
MusiteDeep	Keras/TensorFlow	One-hot	33	0.880	[30]
PROSPECT	PyTorch	One-hot, EGAAC, CKSAAGP	27	0.770	[56]
DeepKinZero	TensorFlow	Word embedding	15	–	[60]
PhosTransfer	TensorFlow	Word embedding	–	0.898	[61]
GPS-PBS	Keras/TensorFlow	BLOSUM62	21	0.832	[62]
DeepPPSite	Keras/TensorFlow	BE, EBGW, CKSAAP, PSPM, IPCP	21	0.872	[57]
DeepIPs	Keras/TensorFlow	Word embedding	15	0.909	[63]
PhosIDN	Keras/TensorFlow	One-hot, PPI embedding	21	0.939	[64]
EMBER	PyTorch	One-hot	15	0.928	[58]

Note: -, data not available. AUC: Area under the Curve of ROC.

Comparison of deep learning-based phosphorylation sites predictors. Note: -, data not available. AUC: Area under the Curve of ROC.

Acetylation site prediction

Acetylation is a very common PTM that describes the modification of the acetyl group to amino acid residues. About 63% of mitochondrial proteins can be acetylated at their lysine residues [65]. During the protein acetylation process, the positive charge in lysine residues is neutralized, leading to the regulation of cell lifespan [66], DNA binding [67], the interactions between proteins [68], and the interactions between proteins and membranes [69]. In contrast, dysregulation of lysine acetylation is associated with several diseases, including cancers [70], cardiovascular diseases [71], Parkinson's diseases [72], and neurodegenerative disorders [73]. Thus, the identification of acetylation sites may benefit the understanding of its molecular mechanism and further experimental design. Proteomic and high-throughput MS-based techniques have identified massive acetylation sites. For example, Choudhary et al. detected 3,600 lysine acetylation sites on 1,750 proteins from a human cell line. [74]; Lundby et al. quantified 15,474 lysine acetylation sites on 4,541 proteins from 16 rat tissues [75]. Several public databases have been developed to facilitate the collection and maintenance of acetylation sites information [38], [43]. Therefore, to predict acetylation sites, many computational methods have been proposed [76], [77], [36]. Among them, deep learning methods are increasingly popular in bioinformatics, which also show encouraging results of acetylation sites identification [78], [79], [80]. For example, Wu et al. [36] presented an MLP architecture, DeepAcet, as an acetylation site prediction model. Feature embedding were performed with six methods (One-hot, IG, CKSAAP, PSSM, AAindex, and BLOSUM62); multilayer perceptron (MLP) is then applied to extract features. After adopting 10-fold cross-validation method [81] paired model evaluation on a separate test site, accuracies were reported to be 0.8495 and 0.8487, respectively. Yu et al. also developed a deep neural networks (DNN) based model called DNNAce for acetylation sites prediction [78]. First, they applied eight different encoding methods to extract information from multiple amino acid residues and then fused the encoded feature vectors to create a high-level feature representation. These encodings methods are BE, PseAAC, AAindex, NMBroto, EBGW, MMI, BLOSUM62, and KNN. Next, they employ LASSO to screen the optimal feature subsets to improve the model performance. As a final stage, nine prokaryotic acetylation site datasets are adopted to evaluate the performance and compared to state-of-the-art models such as AdaBoost, Naive Bayes, XGBoost, KNN, RF, SVM, CNN, and LSTM. An evaluation of DNNAce was conducted by comparing its results with ProAcePred [82]. The performance of DNNAce on the remaining eight species was significantly lower than that of ProAcePred except for S. typhimurium species. However, DNNAce outperforms ProAcePred for the other seven species during independent evaluation. Therefore, the advantages of DNNAce are trivial because there is performance discrepancy in training and independent testing. In contrast to deepAcet and DNNAce, which only consider the amino acid sequences and their physicochemical properties, MDC-Kace [80] pays attention to both sequence information and protein structural properties to predict acetylation sites. In MDC-Kace, modular densely connected convolutional networks (MDC), which consist of three independent modules (sequence, physicochemical and structure), is employed to extract features of lysine acetylation sites. In the next step, squeeze and excitation (SE) layer [83] is utilized to weight importance of features to build representation more accurately. Finally, the fused advanced feature is fed into a softmax layer for classification to predict acetylation sites efficiently. The authors compared MDC-Kace with state-of-the-art models (MusiteDeep [30], CapsNet [34], DeepAcet [36], PSKAcePred [84], EnsemblePail [85], GPS-PAIL2.0 [86] and ProAcePred [82]) to evaluate its performance. Three species (human, M. musculus, E. coli) datasets have been evaluated by10-fold cross-validation and independent testing. The results indicate that MDC-Kace has a similar performance as existing acetylation sites predictors.

Ubiquitination site prediction

Ubiquitination represents an enzymatic PTM on cellular protein by ubiquitin conjugation [87]. Multiple important cellular processes are related to ubiquitination, including protein degradation [88], cell division [89], and protein stability [90], [91]. Ubiquitination serves as a fundamental component of the ubiquitin–proteasome system, mediating more than 80% of protein degradation in eukaryotes [92]. Moreover, aberrant ubiquitination is highly related to the progression of aging [93] and many diseases; for example, the dysregulation of ubiquitin–proteasome system may contribute to the occurrence of neurodegenerative conditions [94] and inflammatory bowel diseases [95]. Therefore, the identification of ubiquitination sites is an essential step in exploring various ubiquitination-involved mechanisms. In order to identify the ubiquitination sites in proteins, a myriad of experimental [96], [97], [98] and computational methods [99], [100], [101] have been developed. In recent years, with the continuous growth in high-throughput experimental data [102], [103], [104], deep learning [105], [106], [107] has been increasingly applied to the prediction of ubiquitination. Fu et al. proposed a deep learning predictor, DeepUbi [37], based on CNN. In this tool, four feature encoding schemes are utilized for feature construction. Under 10-fold cross-validation, DeepUbi is able to achieve an AUC of 0.90, with the accuracy, sensitivity, and specificity being all over 0.85. Compared with DeepUbi, which is trained for general ubiquitination site prediction, DeepTL-Ubi [106] is a species-specific sites predictor which consists of three connected modules: a deep feature extractor, a source label classifier, and a target label classifier. Firstly, a densely connected convolutional neural network (DCCNN) is applied as the deep feature extractor, which is composed of six layers. Features of both source species and target species are extracted simultaneously by the deep feature extractor, mapping samples into a joint feature space. Secondly, the two parallel classifiers are employed to classify source species and target species at the same time. Thirdly, ST (source and target) loss assists the extractor in transferring knowledge from source species to target species by learning relevant features. Finally, as the performance optimization step, the classification loss is minimized to train the two classifiers. DeepTL-Ubi outperforms several existing tools, including Ubisite [108], Ubiprober [24], and MUscADEL [109], as shown in Table 3.

Table 3

AUC values on different ubiquitination prediction tools. [106].

AUC		Species
		H.sapiens	M.musculus	R.norvegicus	S.cerevisiae	T.gondii	A.nidulans
Tools	DeepTL-Ubi	0.753	0.789	0.720	0.772	0.824	0.814
	Ubisite	0.598	0.625	0.561	0.548	0.607	0.611
	Ubiprober	0.624	0.661	0.644	0.600	0.630	0.638
	MUscADEL	0.656	0.693	0.659	0.664	0.715	0.681

AUC values on different ubiquitination prediction tools. [106].

Other PTMs

In addition to those discussed, deep learning can also be applied for other PTMs’ predictions, including methylation [110], S-nitrosylation [111], succinylation [112], [113], malonylation [114], [115], S-sulphenylation [116], [117], crotonylation [118], [119], [120], [121], 2- hydroxyisobutyrylation [122], glutarylation [123], N-palmitoylation [124] carbonylation [125], and SUMOylation [126]. In particular, crotonylation prediction has demonstrated highly accurate results based on deep-learning methods. Moreover, 2- hydroxyisobutyrylation, as a novel type of PTM, was predicted by deep learning method for the first time in 2020. Along with predicting conventional PTMs associated with functional group addition, deep learning-based methods have also been applied to predict niche-type PTMs; for instance, Chaudhari et al. developed a transfer learning-based predictor (DTL-DephosSite) for dephosphorylation site prediction [127]. To collect datasets of S, T, and Y dephosphorylation sites, they integrated the experimentally verified datasets from the literature and datasets from the DEPOD database. They then employ bidirectional long short-term memory (Bi-LSTM), which can predict the modification of the target amino acid according to the knowledge of residues from both directions. To the best of our knowledge, it is the first tool that can predict the general dephosphorylation sites for protein S/T residues and Y residues. On the other hand, a novel prediction model focusing on carbonylation, Precar_Deep [125], is recently reported. Carbonylation is an irreversible covalent PTM and is a measure of protein oxidative damage. In this model, CNN and Bi-LSTM are combined under a deep learning framework. The AUC values of the four datasets (K, T, P, and R) reach 0.981, 0.982, 0.987, and 0.976, respectively. The AUC values of the independent test set reach 0.945, 0.978, 0.965, and 0.983, respectively. In addition, there is also a novel small protein-addition type PTM site predictor based on deep learning in 2021. He et al. built an ensemble learning model that adopts CNN and DNN, followed by the output result containing four types of sites. [126]. This is the first tool that predicts both ubiquitylation and SUMOylation sites at the same time based on deep learning. PTM prediction tools mentioned in this section, as well as predictors of phosphorylation, acetylation, and ubiquitination, are tabulated in Table 4. Summary of recently deep learning tools associated with PTM sites prediction. Note: *, Link is not working at the time of writing. Multiple, more than three species or PTM types. -, data not available.

Summary and outlook

PTM identification is critical to a better understanding of molecular functions and diseases. Advanced MS-based technology has yielded an extensive list of identified PTMs, providing abundant data to support the development of downstream computational identification methods. Although the traditional machine learning methods can precisely predict the modified sites, deep learning features can be automatically deduced and optimally turned without encoding features ahead of time [29]. Thus, deep learning is highly effective in scientific fields with large and complex datasets. Researchers recently gradually shift their attention from traditional machine learning to deep learning for PTM site prediction (Fig. 2). Furthermore, with the growing number of PTM profiling datasets, deep learning models have been developed for not only phosphorylation, acetylation, and ubiquitination, but also many other PTM types. In this review, we summarized the recently (2020–2022) released deep learning tools and online web servers for protein PTM site prediction (Table 4). Among all these, CNN and cross-validation are the most widely used network model and evaluation strategy, respectively (Fig. 3).

Fig. 3

Sankey diagram depicting the distribution of PTM types, core network models, evaluation strategies, and published years.

Sankey diagram depicting the distribution of PTM types, core network models, evaluation strategies, and published years. Although several deep learning methods have been built with high performance to predict PTM sites, there is still room for improvement. Most of the existing deep learning algorithms employed CNN, DNN, and LSTM classifiers. However, each classifier has its own advantages and disadvantages. Therefore, further research is required to evaluate more state-of-the-art frameworks such as attention and transformer-based models. On top of that, in many developed tools, although PTM sites are predicted based on certain characteristics, such as sequence information, physical properties, chemical properties, and protein structure properties, there are still other approaches that need to be explored, such as reduced amino acid compositions [128], [129], [130]. Additionally, most of web server links are not working, and few methods provide stand-alone versions. After testing all web servers, we found that they were difficult to operate. By using deep learning based methods, PTM identification can be implemented in a non-invasive, efficient, and low-cost way. However, there is still a caveat before deep learning algorithms can directly diagnose diseases. Typical PTM prediction models lack sufficient interpretations due to the black-box nature of deep learning algorithms. Insufficient interpretability may not be an issue in many areas, but within healthcare, every misdiagnosis can pose a danger to a patient's health. Therefore, transparent and explainable models [131], [132], [133] will be needed, so that the technique can be applied in clinical practice.

CRediT authorship contribution statement

Lingkuan Meng: Writing, Conceptualization, Methodology, Visualization. Wai-Sum Chan: Methodology. Lei Huang: Methodology. Linjing Liu: Methodology. Xingjian Chen: Methodology. Weitong Zhang: Methodology. Fuzhou Wang: Methodology. Ke Cheng: Methodology. Hongyan Sun: Writing – review & editing, Supervision. Ka-Chun Wong: Writing – review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

114 in total

1. pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties.

Authors: Zi Liu; Xuan Xiao; Dong-Jun Yu; Jianhua Jia; Wang-Ren Qiu; Kuo-Chen Chou
Journal: Anal Biochem Date: 2015-12-31 Impact factor: 3.365

Review 2. Mapping protein post-translational modifications with mass spectrometry.

Authors: Eric S Witze; William M Old; Katheryn A Resing; Natalie G Ahn
Journal: Nat Methods Date: 2007-10 Impact factor: 28.547

3. PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition.

Authors: Yongchun Zuo; Yuan Li; Yingli Chen; Guangpeng Li; Zhenhe Yan; Lei Yang
Journal: Bioinformatics Date: 2016-08-26 Impact factor: 6.937

4. DeepKinZero: zero-shot learning for predicting kinase-phosphosite associations involving understudied kinases.

Authors: Iman Deznabi; Busra Arabaci; Mehmet Koyutürk; Oznur Tastan
Journal: Bioinformatics Date: 2020-06-01 Impact factor: 6.937

5. Absolute quantification of protein and post-translational modification abundance with stable isotope-labeled synthetic peptides.

Authors: Arminja N Kettenbach; John Rush; Scott A Gerber
Journal: Nat Protoc Date: 2011-01-27 Impact factor: 13.491