Literature DB >> 32581647

Identify Lysine Neddylation Sites Using Bi-profile Bayes Feature Extraction via the Chou's 5-steps Rule and General Pseudo Components.

Abstract

INTRODUCTION: Neddylation is a highly dynamic and reversible post-translational modification. The abnormality of neddylation has previously been shown to be closely related to some human diseases. The detection of neddylation sites is essential for elucidating the regulation mechanisms of protein neddylation.
OBJECTIVE: As the detection of the lysine neddylation sites by the traditional experimental method is often expensive and time-consuming, it is imperative to design computational methods to identify neddylation sites.
METHODS: In this study, a bioinformatics tool named NeddPred is developed to identify underlying protein neddylation sites. A bi-profile bayes feature extraction is used to encode neddylation sites and a fuzzy support vector machine model is utilized to overcome the problem of noise and class imbalance in the prediction.
RESULTS: Matthew's correlation coefficient of NeddPred achieved 0.7082 and an area under the receiver operating characteristic curve of 0.9769. Independent tests show that NeddPred significantly outperforms existing lysine neddylation sites predictor NeddyPreddy.
CONCLUSION: Therefore, NeddPred can be a complement to the existing tools for the prediction of neddylation sites. A user-friendly webserver for NeddPred is accessible at 123.206.31.171/NeddPred/.

Entities: Chemical

Keywords: Post-translational modification; chou’s 5-steps rule; feature extraction; fuzzy support vector machine; neddylation; pseudo components

Year: 2019 PMID： 32581647 PMCID： PMC7290059 DOI： 10.2174/1389202921666191223154629

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

NEDD8 is an 81 amino acid polypeptide, which is 60% identical and 80% similar to ubiquitin. The process of ubiquitin-like protein NEDD8 attaching substrate lysine via isopeptide bonds is known as neddylation [1]. Neddylation is a highly dynamic and reversible protein post-translational modification (PTM), which occurs similarly to ubiquitination and needs enzyme cascades involving E1, E2 and E3 [2]. Although neddylation relies on its own E1 and E2 enzymes, no NEDD8-specific E3 has yet been identified and it is possible that the neddylation system relies on E3 ligases with dual specificity [3]. Neddylation has been demonstrated to be essential to maintain the ubiquitin ligase activity of Cullin-Roc based E3 ligases, and affects cell-cycle regulation, transcriptional regulation and signal transduction indirectly [4]. Previous studies have shown that abnormal neddylation is strongly linked to some human diseases, such as cancer, Parkinson’s disease and Alzheimer’s disease [5-7]. Therefore, exploring the biological functions of neddylation will help to reveal the pathogenesis of the above-mentioned diseases. However, compared with the ubiquitination that has been widely studied in the past two decades, the molecular mechanism and physiological functions of neddylation still not well characterized. Accurate detection of neddylation sites is the biggest challenge to decipher the molecular mechanisms of neddylation. However, the experimental approaches are often time-consuming and expensive, it is crucial to develop computational methods to identify neddylation sites. The computational identification and analysis of PTM sites are gaining more attention in recent years [8-12]. Yavuz et al. [13] developed a predictor named NeddyPreddy to predict neddylation sites using a support vector machine based on various sequence properties, position-specific scoring matrices, and disorder. However, the prediction sensitivity of NeddyPreddy (75%) is not satisfactory. In order to develop an accurate predictor for the identification of neddylation sites, the key is to seek an efficient feature extraction method to encode neddylation sites. Based on many aspects of assessments, we found bi-profile bayes (BPB) was more suitable for distinguishing between the neddylation sites and non-neddylation sites than split amino acid composition (SplitAAC), amino acid factors (AAF) amino acid composition (AAC) and binary encoding (BE) which are the widely used feature extraction techniques in PTM sites prediction. Therefore, the BPB was used to encode neddylation sites. Moreover, a fuzzy SVM algorithm is used to handle the class imbalance and noise problem in the neddylation sites training dataset. A novel predictor named NeddPred was constructed by combining the BPB with the fuzzy SVM. Feature analysis indicated that the residues in some positions around neddylation sites play a key role in predicting neddylation sites.

MATERIALS AND METHODS

As demonstrated by a series of recent publications [14-25] and summarized in a comprehensive review [26], to develop a really useful predictor for PTMs site, one needs to follow Chou’s 5-step rule: (a) collect valid PTMs sites to train the predictor; (b) encode the PTMs sites by effective feature extraction that can reflect their sequential pattern; (c) develop a robust algorithm to conduct the prediction; (d) properly perform cross-validation tests to objectively evaluate the effectiveness of the predictor; (e) establish a user-friendly and publicly accessible web-server for the predictor. Below, let us elaborate on how to deal with these five steps.

Dataset

Yavuz’s training dataset, validation dataset and independent test dataset [13] were used to train and assess NeddPred. The training dataset consisted of 34 experimentally verified neddyllysine sites and 687 non-neddyllysine sites; the validation dataset consisted of 6 neddyllysine sites and 115 non-neddyllysine sites; and the independent test dataset consisted of 11 neddyllysine sites and 229 non-neddyllysine sites. According to Yavuz’s work and our trials (section 3.1), the window size was selected as 21. The neddylated peptides were used as positive samples, while the non-neddylated peptides were used as negative samples. The training dataset and the independent test dataset are provided in Supplementary material S1.

Feature Extraction

It is well-known that how to express a biological sequence with a discrete model or a vector is one of the most difficult problems in computational biology. This is because the machine learning algorithms (such as “Optimization” algorithm [27], “Covariance Discriminant” algorithm [28, 29], “Nearest Neighbor” algorithm [30], and “Support Vector Machine” algorithm [31] can only handle vectors [32]. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition (PseAAC) [26, 33] was proposed. The PseAAC has been widely used in the areas of bioinformatics [34-44]. As PseAAC has been widely and increasingly used, four powerful open-access software, called ‘PseAAC’ [45], ‘PseAAC-Builder’ [46], ‘Propy’ [47], and ‘PseAAC-General’ [48], were established to generate pseudo amino acid composition features. The former three are for generating various modes of Chou’s special PseAAC [49]; while PseAAC-General is for those of Chou’s general PseAAC such as “Functional Domain” mode, “Gene Ontology” mode, and “Sequential Evolution” or “PSSM” mode [26]. Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, the concept of PseKNC (Pseudo K-tuple Nucleotide Composition) [50] was developed for generating various feature vectors for DNA/RNA sequences [51-53]. Particularly, recently a very powerful web-server called ‘Pse-in-One’ [54] and its updated version ‘Pse-in-One2.0’ [55] have been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users’ studies. Bi-profile bayes (BPB) is an effective feature coding method which can be covered by the general PseAAC. The BPB coding has been applied to many bioinformatics problems [56-60]. Here, BPB was used to encode neddylation sites. Given a sequence fragment , where stands for one amino acid and n denotes the length of the sequence fragment. S belongs to category or , where and represent neddylation sites and non-neddylation sites, respectively. Based on Bayes’ rule, assume that s (j=1,2,…,n) are mutually independent, the posterior probability of S for and can be given by: Therefore, (1) and (2) can be rewritten as: Assume that , the decision function can be written as formula (5): Based on the results of the literature [56], formula (5) can be rewritten as: where is weigh vector; is the posterior probability vector; represent the posterior probability of each amino acid at each position in the category C1 (C2). In this study, the posterior probability p was given by the frequency of each amino acid in the training peptides. Therefore, every training peptide was encoded by BPB encoding as 42-dimensional vectors.

Fuzzy Support Vector Machine

As one of the effective machine learning algorithms, SVM has been used in the detection of protein PTMs sites, such as succinylation sites [61], glycation sites [62], crotonylation sites [63], propionylation sites [38] and citrullination sites [9]. In the standard SVM model, each training sample was assigned to the same weight. However, there may be some noisy samples in the training dataset. Therefore, it is more reasonable to assign different weight values to different samples based on their importance and imbalance than to assign the same weight value. Here, the fuzzy SVM model was used to construct the classifier. To facilitate the description, the training set is denoted as . Assume that the first p training examples are positive , and the rest l – p training examples are negative (i.e., ). The fuzzy SVM can be formulated as follows: where C+ and C– are the penalty factors; are slack variables, and is the non-linear mapping; and are the fuzzy memberships. As described in the literature [64], in this study, the fuzzy memberships are defined as follows: where is a very small positive value guaranteeing the value of fuzzy membership always higher than zero. The observation was treated more important and assigned higher fuzzy membership values when they were closer to their class center; whereas the observation was treated as less important (such as noises or outliers) and assigned lower fuzzy membership values when they were farther away from their class center. Based on the results reported by Batuwita and Palade [65], to handle the problem of class imbalance in the prediction, the penalty factors C+ and C– were set to and C, respectively. The Gaussian kernel function is used in the fuzzy SVM [38, 63] The kernel parameter was selected from {2-10, 2-9, …, 20}; penalty parameters C was selected from {20, 21, …, 212}; the Libsvm-weights-3.20 package [66] was used to implement the fuzzy SVM models.

Cross-validation and Performance Assessment

K-fold cross-validation test, jackknife test and independent dataset test are often adopted to evaluate the performance of a predictor. As the jackknife test can always yield a unique result for a given training dataset, it is the most objective and least arbitrary among the above three test methods [26]. However, to reduce computational time, the 10-fold cross-validation was adopted to evaluate our model. Here, the 10-fold cross-validation is repeated 10 times. Five widely-accepted measurements, including sensitivity (Sn), specificity (Sp), accuracy (ACC) and Matthew’s correlation coefficient (MCC) and area under the receiver operating characteristic curve (AUC), were used to evaluate prediction performances of NeddPred. In accordance with Eq. (14) [18], they are defined as: Where N + is the total number of the neddylation sites investigated, while is the number of the sites incorrectly predicted as the non-neddylation sites, and N – is the total number of the non-neddylation sites investigated, while is the number of the non-neddylation sites incorrectly predicted as the neddylation sites. The AUC can measure the overall performance of a given prediction system. The closer the AUC is to 1, the better the prediction system is. Either the set of conventional metrics copied from math books or the intuitive metrics derived from the Chou’s symbols [67-69] are valid only for the single-label systems. For the multi-label systems, whose existence has become more frequent in system biology [70-75] system medicine [76, 77], and biomedicine [17], a completely different set of metrics, as previously defined [78] is needed.

RESULTS AND DISCUSSION

Performance of NeddPred

The optimal parameters (window size, penalty factor C and kernel parameter γ) of the proposed model were determined by the highest AUC value in 10-fold cross-validation performances. The proposed model achieved the highest AUC value of 0.9769 when using the window size 21, C = 25 and γ = 0.5. Therefore, the optimal window size was selected as 21. As shown in Table , the predicted Sn, Sp, ACC and MCC values were 79.41%, 97.96%, 97.09% and 0.7082, respectively. Moreover, NeddPred was also implemented by the jackknife test with the optimal parameter obtained in the 10-fold cross-validation. NeddPred also achieved a satisfactory performance with a Se of 79.41%, an Sp of 97.09%, an ACC of 96.26%, an MCC of 0.6569 and an AUC of 0.9789. To assess the performance of the fuzzy SVM, it was compared with the standard SVM and biased SVM [65]. The comparison results of the above SVM algorithms were shown in Table . The fuzzy SVM reached the highest Sn, ACC and MCC values of 79.41%, 97.09% and 0.7082, respectively. Although the Sp value of the standard SVM (99.56%) was slightly higher than that of the fuzzy SVM (97.96%), the Sn value of the standard SVM (44.12%) was much lower than that of the standard SVM (79.41%). In short, the fuzzy SVM showed better results as compared with standard SVM and biased SVM.

Comparison of BPB with Other Feature Extraction Technologies

To demonstrate the effectiveness of BPB, it was compared with the most widely used feature extraction technologies in computational biology, including amino acid composition (AAC) [79], split amino acid composition (SplitAAC) [80], amino acid factors (AAF) [81], binary encoding (BE) [82] and composition of k-space amino acid pairs (CKSAAP) [83]. For comparison, CKSAAP with k=0, 1, 2, 3 and 4 was performed, and the peptide in SplitAAC was divided into three parts: 7 amino acids of N termini, 7 amino acids of C termini, and the region between these two termini. The performance of 10-fold cross-validation with various features was shown in Table . The model with BPB reached the highest value of AUC. The results indicated that BPB encoding is more effective for extracting the sequence information around the neddylation sites than other encoding schemes.

Comparison of NeddPred with Existing Predictor

At present, only one predictor named NeddyPreddy [13] was proposed for the prediction of neddylation sites. It is considered that NeddPred and NeddyPreddy were both trained on Yavuz’s dataset [13] which contained 34 neddylation sites and 687 non-neddylation sites. It is interesting to compare NeddPred with NeddyPreddy. As shown in Table , NeddPred outperforms NeddyPreddy significantly, whether on the training dataset, validation set and independent test set. For example, NeddPred revealed about 26% higher MCC than NeddyPreddy. These results showed that NeddPred can predict more reliable neddylation sites from protein sequences than NeddyPreddy. The ROC curves for NeddPred by 10-fold cross-validation, jackknife test, validation set test and independent test are shown in Fig. (). The results indicated that NeddPred can be an effective predictor for the prediction of neddylation sites. There are two factors for the improvement of NeddPred. One is the fuzzy SVM that can effectively handle the problem of the noise in the prediction of neddylation sites. Another factor is that the BPB feature used in NeddPred outperforms sequence properties, position-specific scoring matrices, and disorder used in NeddyPreddy.

Prediction Server of NeddPred

As pointed out previously [84], user-friendly and publicly accessible web-servers are the future direction for developing useful bioinformatics tools [85-89]. To provide convenience for the experimental scientists, NeddPred has been implemented as a web-server which was trained on all available data (training data, validation data and independent testing data, i.e., 34+6+11=51 neddylation sites and 687+115+229=1031 non-neddylation sites) using the optimal parameters (window size 21, and). The web-server for NeddPred is now available at http://123.206.31.171/NeddPred/. As shown in Fig. (), users can enter query protein sequences (FASTA) or batch-upload the query protein sequences (FASTA) as a file for the prediction. The CKSAAP_NeddSite server will output a CSV-formatted file with prediction results.

The Significant Features

As previously described, every lysine site in neddylated proteins was encoded as a 42-dimensional vector through BPB encoding. To clarify the contribution of different features for the prediction of neddylation sites, we used the F-score feature selection method to rank the 42 BPB features [90] (Table ). The higher the F-score of a feature is, the more important a feature will be. Moreover, the position-specific residue composition of lysine-centric peptides was characterized by Two-Sample-Logo [91] in Fig. (). As shown in Table and Fig. (), the ‘Pos_8’ feature was ranked at the top of the 42 BPB features, which imply that asparagine residue in position 8 around neddylation sites may play a key role in the identification of neddylation sites. The residues in positions (-4, -3, 1 and -2) around neddylation sites may play a relatively important role. The 42 BPB features ranked by the F-score may provide clues for deciphering the molecular mechanisms of neddylation.

CONCLUSION

In this paper, a bioinformatics tool named NeddPred was developed to identify neddylation sites using BPB encoding and fuzzy SVM. Experimental results showed that NeddPred yielded better performance than the existing neddylation sites predictor. Therefore, NeddPred will be a useful predictor for the accurate identification of neddylation sites. To provide convenience for researchers to study neddylation, a web-server for NeddPred was established. Feature analysis shows that BPB features at some positions may play a key role in the prediction of neddylation sites.

Table 1

The 10-fold cross-validation results of NeddPred with different window sizes.

Window Size	Sn(%)	Sp(%)	ACC(%)	MCC	AUC
11	76.47	93.01	92.23	0.4853	0.9331
13	73.53	91.27	90.43	0.4259	0.9096
15	67.65	93.89	92.65	0.4554	0.9329
17	76.47	97.96	96.95	0.6893	0.9592
19	73.53	96.80	95.70	0.6039	0.9592
21	79.41	97.96	97.09	0.7082	0.9769
23	76.47	97.82	96.81	0.6800	0.9756
25	79.41	97.09	96.26	0.6569	0.9723
27	82.35	95.78	95.15	0.6138	0.9721

Table 2

Comparison of fuzzy SVM with standard SVM and biased SVM.

Method	Sn	Sp	ACC	MCC	AUC
Standard SVM	44.12	99.56	96.95	0.5936	0.9747
Biased SVM	79.41	97.38	96.53	0.6729	0.9716
Fuzzy SVM	79.41	97.96	97.09	0.7082	0.9769

Table 3

The predictive performance of 10-fold cross-validation using various training features.

Feature	Sn(%)	Sp(%)	ACC(%)	MCC(%)	AUC(%)
AAC	77.73	38.24	75.87	0.0804	0.6319
SplitAAC	44.12	88.21	86.13	0.2017	0.7096
AAF	23.53	99.71	96.12	0.4212	0.6649
BE	26.47	99.27	95.84	0.3955	0.6912
CKSAAP	32.35	96.94	93.90	0.3015	0.7717
BPB	79.41	97.96	97.09	0.7082	0.9769

Table 4

Comparison of NeddPred with NeddyPreddy under different evaluation strategies.

Method	Evaluation Strategie	Sn	Sp	ACC	MCC	AUC
NeddyPreddy¹	10-fold cross-validation	0.76	0.91	0.91	0.45	0.95
NeddPred	10-fold cross-validation	0.7941	0.9796	0.9709	0.7082	0.9769
NeddyPreddy¹	Validation set	0.67	0.91	0.90	0.39	0.83
NeddPred	Validation set	1.00	0.9913	0.9917	0.9218	1.00
NeddyPreddy¹	Independent testing set	0.64	0.91	0.90	0.35	0.80
NeddPred	Independent testing set	1.00	0.9520	0.9542	0.6899	1.00

1 The corresponding results were obtained from the literature (Yavuz et al., 2015).

Table 5

The 42 BPB features ranked by the F-score method.

Order	Amino Acid Position	F-score	Order	Amino Acid Position	F-score
1	Pos_8¹	0.5889	22	Neg_-1	0.0995
2	Pos_-4	0.3916	23	Pos_-9	0.0885
3	Pos_-3	0.3843	24	Neg_1	0.0832
4	Pos_1	0.3752	25	Pos_6	0.0744
5	Pos_-2	0.3665	26	Neg_10	0.0521
6	Pos_-7	0.3549	27	Neg_-5	0.0229
7	Pos_-5	0.3474	28	Neg_3	0.0195
8	Pos_-1	0.3353	29	Neg_-10	0.0124
9	Pos_7	0.3276	30	Neg_-4	0.0092
10	Pos_-10	0.2673	31	Neg_9	0.0063
11	Pos_5	0.2653	32	Neg_2	0.0056
12	Pos_10	0.2608	33	Neg_6	0.0023
13	Pos_-6	0.2416	34	Neg_-6	0.0012
14	Pos_4	0.2403	35	Neg_-8	0.0009
15	Pos_2	0.2290	36	Neg_-2	0.0005
16	Pos_3	0.2105	37	Neg_4	0.0005
17	Pos_-8	0.2005	38	Neg_-7	0.0004
18	Pos_9	0.1940	39	Neg_-9	0.0003
19	Neg_-3	0.1856	40	Neg_5	0.0000
20	Neg_7	0.1383	41	Pos_0	-1.0000
21	Neg_8	0.1133	42	Neg_0	-1.0000

1 Pos_i and Neg_j mean position i in neddylated peptides and position j in non-neddylated peptides, respectively.

84 in total

1. Prediction of signal peptides using scaled window.

Authors: K C Chou
Journal: Peptides Date: 2001-12 Impact factor: 3.750

2. Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition.

Authors: Kuo-Chen Chou; Yu-Dong Cai
Journal: J Cell Biochem Date: 2003-12-15 Impact factor: 4.429

3. pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties.

Authors: Zi Liu; Xuan Xiao; Dong-Jun Yu; Jianhua Jia; Wang-Ren Qiu; Kuo-Chen Chou
Journal: Anal Biochem Date: 2015-12-31 Impact factor: 3.365

4. Using LogitBoost classifier to predict protein structural classes.

Authors: Yu-Dong Cai; Kai-Yan Feng; Wen-Cong Lu; Kuo-Chen Chou
Journal: J Theor Biol Date: 2005-07-25 Impact factor: 2.691

5. pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset.

Authors: Xuan Xiao; Xiang Cheng; Genqiang Chen; Qi Mao; Kuo-Chen Chou
Journal: Med Chem Date: 2019 Impact factor: 2.745

6. pLoc_bal-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by quasi-balancing training dataset and general PseAAC.

Authors: Xiang Cheng; Xuan Xiao; Kuo-Chen Chou
Journal: J Theor Biol Date: 2018-09-08 Impact factor: 2.691

7. pLoc_bal-mHum: Predict subcellular localization of human proteins by PseAAC and quasi-balancing training dataset.

Authors: Kuo-Chen Chou; Xiang Cheng; Xuan Xiao
Journal: Genomics Date: 2018-09-01 Impact factor: 5.736

8. Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC.

Authors: Abdollah Dehzangi; Rhys Heffernan; Alok Sharma; James Lyons; Kuldip Paliwal; Abdul Sattar
Journal: J Theor Biol Date: 2014-09-28 Impact factor: 2.691

Review 9. Novel substrates and functions for the ubiquitin-like molecule NEDD8.

Authors: Dimitris P Xirodimas
Journal: Biochem Soc Trans Date: 2008-10 Impact factor: 5.407

10. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691