Literature DB >> 25961025

Predict and Analyze Protein Glycation Sites with the mRMR and IFS Methods.

Yan Liu¹, Wenxiang Gu², Wenyi Zhang³, Jianan Wang³.

Abstract

Glycation is a nonenzymatic process in which proteins react with reducing sugar molecules. The identification of glycation sites in protein may provide guidelines to understand the biological function of protein glycation. In this study, we developed a computational method to predict protein glycation sites by using the support vector machine classifier. The experimental results showed that the prediction accuracy was 85.51% and an overall MCC was 0.70. Feature analysis indicated that the composition of k-spaced amino acid pairs feature contributed the most for glycation sites prediction.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Amino Acids
Proteins

Year: 2015 PMID： 25961025 PMCID： PMC4413511 DOI： 10.1155/2015/561547

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

Glycation is one of the most important posttranslation modifications (PTMs) of proteins. Glycation is a two-step nonenzymatic reaction. First, generate the stable Amadori product based on the unstable Schiff base. Secondly, the advanced glycation end products (AGEs) are generated at the second step. According to the clinical researches [1], the advanced glycation end products are involved in a variety of human diseases, such as diabetes, Alzheimer's disease, and Parkinson's disease. The glycation mechanism might be a key to the treatment of the above diseases. Identification of the glycation sites in protein may provide guidelines to understand the biological function of proteins glycation. It is important to note that glycation and glycosylation are different. Glycation is the result of typically covalent bonding of a protein or lipid molecule with a sugar molecule, such as fructose or glucose, without the controlling action of an enzyme. As opposed to the nonenzymatic chemical reaction of glycation, glycosylation is an enzyme-directed site-specific process. Five types of glycosylation are produced, including N-glycosylation, O-glycosylation, C-mannosylation, glypiation, and phosphoglycans linked through the phosphate of a phosphoserine [2]. Although some high-throughput proteomics experimental methods [3] have been developed to find posttranslational modification (PTM) sites [4-6], it is still difficult to confirm glycation sites by these methods. Several computational approaches to predict glycosylation sites have been reported. Li et al. [7] trained SVM based on physicochemical properties of amino acids and a 0/1 system which was only focusing on the O-glycosylation in mammalian proteins; Caragea et al. [8] used ensemble method to identify N-linked, O-linked, and C-linked glycosylation. Chen et al. [9] predicted mucin-type O-glycosylation serine/threonine (S/T) sites in mammalian proteins with the assistance of SVM based on the composition of k-spaced amino acid pairs (CKSAAP) encoding scheme. Compared with the glycosylation, the determination of protein glycation sites was more difficult. Therefore, to the best of our knowledge, such computer methods for prediction of glycation sites were rarely mentioned in literatures except the GlyNN [10]. GlyNN was built by combining 60 artificial neural networks with a balloting procedure and obtained the maximal Matthews correlation coefficient (MCC) of 0.58 with the sequence size 23. Here, we used support vector machine to develop a predictor for glycation sites of lysine. Amino acid occurrence frequency, amino acid factors, and the composition of k-spaced amino acid pairs (CKSAAP) were used to encode glycation site peptides. We reduced the support vector machine classifier input features dimension by utilizing the maximum relevance minimum redundancy (mRMR) method followed by the incremental feature selection (IFS) procedure. The experimental result showed that our predictor achieved an overall MCC of 0.7063. Feature analysis prompted that the CKSAAP encoding was efficient to capture a glycation site's characters. The detailed analysis results in this work may provide useful insights to detect glycation sites. A web server (PreGly) that implemented the proposed method was freely available: http://202.198.129.220:8080/GlycationPre/.

2. Materials and Methods

2.1. Dataset

In this study, we took the GlyNN [10] dataset as the benchmark dataset. It was convenient to compare the performance of PreGly and GlyNN, since they were built on the same dataset. Johansen et al. [10] collected experimentally validated glycation sites by searching hundreds of research papers manually. The whole dataset contained 89 glycation sites (positive samples) and 126 nonglycation sites (negative samples) from 20 proteins. Subsequently, we extracted each glycation site peptide with the window size 23, with 11 residues upstream and 11 residues downstream of the glycation site. To make sure that each sequence window size was fixed to 23, we complemented a nonexisting residue “O” to the peptide, less than 23 amino acid residues. The window size fixed to 23 is rational and has been confirmed experimentally [10]. However, for the amino acid site near the end of the peptide, 23 may be a too much window size. We removed peptides which extended too many “O” residues to test the influence of “O” residues on the predictor result. The experimental result yielded a prediction accuracy of 82.74% (Sn = 71.84%, Sp = 0.9136, and MCC = 0.6527) and the prediction performance was also better than GlyNN [10]. To ensure the data integrity, we still applied the window size 23 on the entire data. The whole dataset was provided in the supporting information available online at http://dx.doi.org/10.1155/2015/561547 (Supporting Information S1).

2.2. Feature Space

2.2.1. Amino Acid Occurrence Frequency Feature

We calculated the occurrence frequencies of the 20 native amino acids in given proteins [11]. Given a protein R, L is the length of the peptide in protein R, and the number of ith amino acid in the peptide is n . We use p as the amino acid occurrence frequency feature:

2.2.2. k-Spaced Amino Acid Pairs Feature

The composition of k-spaced amino acid pairs (CKSAAP) encode scheme has been employed to predict various PTMs [6, 9]. Given 20 native amino acids and one complementary residue “O,” there are 441 basic amino acid pair types: AA, AC⁡,…, AW, AY, OO. The basic amino acid pair types are enlarged to the k-spaced amino acid pair types. For example, the space number k of “A∧∧A” is equal to 2. We examined three predictors built with the parameter k = 3,4, and 5 and obtained that the maximum accuracy was 85.51% with k = 4.

2.2.3. Amino Acid Factors Feature

AA Index database [12] collected the various physicochemical and biochemical properties of amino acids. Atchley et al. [13] performed multivariate statistical analyses on AA Index to produce five multidimensional patterns of attribute covariation which reflected polarity (AA Factor 1), secondary structure (AA Factor 2), molecular volume (AA Factor 3), codon diversity (AA Factor 4), and electrostatic charge (AA Factor 5). These five factors have been successfully used to solve several different biology problems, such as [7, 11]. As mentioned above, we encoded each glycation site peptide by 21 features of amino acid occurrence frequency, 5 features of amino acid factors, and 441 × 4 features of k-spaced amino acid pairs. Therefore, for each peptide consisting of 23 amino acid residues, there were a total of 21 + 5 × 23 + 441 × 4 = 1900 features (Supporting Information S5).

2.3. The mRMR Method

The maximum relevancy minimum redundancy (mRMR) method [14] is used to rank features based on the criterion of maximum relevance to the target and the minimum redundancy between features. On the rank list, features with the small index are considered the “good” features, and these “good” features may provide more information for glycation site prediction. The maximum relevancy criterion can be expressed aswhere I(f , c) calculates the relevance between the feature f in Ω and the target c. And the minimum redundancy (MR) can be represented aswhere Ω is the already-selected features set with the set size m and Ω is the to-be-selected features set with the set size n. The MR calculates the redundancy between the feature f in Ω and all the other features in Ω . Based on formulas (2) and (3), the mRMR method is

2.4. Incremental Feature Selection

Incremental feature selection (IFS) method [15] is used to select the optimal features. Features in the mRMR feature rank list are added one by one during the IFS procedure. Then, we construct N feature sets when there are N features on the list and the ith feature set is composed of i features.

2.5. Support Vector Machine

Vapnik [16] first proposed the support vector machine (SVM) algorithm. In principle, SVM is a two-class classifier. Given training vectors x ∈ R and their class labels y ∈ (−1,1), i = 1,…, N, SVM solves the problem:where ω is a normal vector perpendicular to the hyperplane and ξ are slake variables for allowing misclassifications. The support vector machine has been widely used in bioinformation [9, 17]. In this work, we implement LIBSVM package [18] with RBF function. The two parameters penalty parameter C and kernel parameter γ are found by using a grid search strategy based on 10-fold cross-validation.

2.6. Performance Evaluation

In statistical prediction, three cross-validation methods are often used to examine a predictor for its anticipated accuracy: K-fold cross-validation test, independent dataset test, and jackknife test [3]. We chose the 10-fold cross-validation test to examine the quality of our predictor. During the 10-fold cross-validation test process, the peptide samples were divided into ten parts. Each part of them was in turn as test samples, and the remaining nine parts were as the train samples. Four parameters, sensitivity (Sn), specificity (Sp), accuracy (Ac), and Mathew correlation coefficients (MCC), are used to evaluate the predictor performance:where TP, TN, FP, and FN represent the true positive, the true negative, the false positive, and the false negative, respectively. MCC (Matthew correlation coefficient) reflects both the sensitivity and the specificity of a predictor.

3. Results and Discussion

3.1. The Predictors' Performance with Different k

In order to find the optimal k value of the CKSAAP feature encoding which can detect the glycation sites with high accuracy, we investigated the predictor performance of k = 3,4, and 5. The highest accuracy was 85.51% when k = 4 (Supporting Information S2).

3.2. The mRMR and IFS Results

First of all, by using the mRMR method, a total of 1900 features were ranked. Then we implemented the IFS procedure based on the mRMR rank list and generated 1900 feature sets. Subsequently, we built 1900 predictors and tested those predictors (Supporting Information S3) and plotted the IFS curve in Figure 1. It can be seen from Figure 1 that the maximum MCC was 0.7063 by using the top 167 features, and these 167 features (Supporting Information S4) were considered the optimal features to train our final predictor.

Figure 1

The MCC value against feature number.

3.3. Optimal Features Analysis

In the 167 optimal features, over 90% features (153 CKSAAP features in 167 optimal features) were CKSAAP features. Thus, we inferred that the CKSAAP feature was the most importance feature for the prediction of glycation sites. We also listed the top 10 features of the optimal features in Table 1. It can be seen from Table 1 that there were 7 CKSAAP features in top 10 features. So we suggested that the CKSAAP feature was the most suitable feature for glycation sites prediction.

Table 1

Top 10 features of the optimal features.

Order	Feature number	Feature name	Explain
1	1793	S^∧∧∧∧W	Pair of serine and tryptophan spaced with 4 residues
2	33	SS	The secondary structure of site 3
3	1494	C^∧∧∧∧Q	Pair of cystine and glutamine spaced by 4 residues
4	1752	Q^∧∧∧∧Y	Pair of glutamine and tyrosine spaced by 4 residues
5	91	EC	The electrostatic charge of site 14
6	710	H^∧∧H	Pair of histidine and histidine spaced by 2 residues
7	129	MV	The molecular volume of site 22
8	341	L^∧S	Pair of leucine and serine spaced by one residue
9	629	D^∧∧L	Pair of aspartic acid and leucine spaced by 2 residues
10	210	E^∧M	Pair of glutamic acid and methionine spaced by one residue

The feature selection methods could find out the most important CKSAAP pairs. For example, the first feature in Table 1 was the CKSAAP pair “S∧∧∧∧W,” which suggested that a potential glycation residue existed if the CKSAAP pair “S∧∧∧∧W” surrounding the residue have high abundance. We further investigated the classifications of the 20 native amino acids of the optimal features (Figure 2). From Figure 2, we can see that the polar amino acid and the nonpolar amino acid were vital characteristics for the glycation sites prediction.

Figure 2

Amino acids classification distribution.

We also noted that the AA Factor features played some roles in glycation sites' prediction. There were one polarity factor, two secondary structure factors, five molecular volume factors, one codon diversity factor, and five electrostatic charge factors in the 14 AA Factor features. And the sites distributions of AA Factor were shown in Figure 3. Figure 3 indicated that site 23 was the most important site for glycation sites' prediction and site 3 was the secondary important site for glycation sites prediction. Those results are consistent with the literature [10], which reported that the N-terminal parts of the human proteins have a higher predicted glycation potential than the other parts of the proteins.

Figure 3

Number of corresponding specific sites.

3.4. Compare with Previous Method

In this section, we compared the proposed method with a previous method GlyNN [10]. As can be seen from Table 2, the performance of PreGly was better than that of GlyNN. The prediction accuracy of PreGly was 85.51% and the MCC was 0.70 with 167 optimal features. And from these better prediction results, we speculated that the CKSAAP encoding may provide more information than the consensus sequence motif.

Table 2

Comparison of PreGly with GlyNN.

Method	Sensitivity (%)	Specificity (%)	Accuracy (%)	MCC
PreGly	71.06	95.85	85.51	0.70
GlyNN	78.65	80.15	79.50	0.58

3.5. The Performance of PreGly on Independent Dataset

We have reported that PreGly could achieve a better prediction performance. To objectively assess our predictor, we further tested our method on an independent dataset. 17 protein sequences containing experimentally validated glycation sites were retrieved from Uniport as the independent dataset (Supporting Information S6). Glycation sites labeled by “potential” and “probable” were removed. Finally, a total of 82 glycation sites and 117 not glycated sites in 17 proteins were retrieved. For each glycation site or not glycated site, a 23-residue peptide containing central glycation/not glycated site and 11 residues upstream and 11 residues downstream of the glycation/not glycated site was extracted. The peptides with length less than 23 were extended by “O” residue. After tenfold cross-validation, the average prediction accuracy (Ac), sensitivity (Sn), specificity (Sp), and MCC of PreGly on the independent dataset were 79.92%, 64.2%, 91.57%, and 0.5849.

3.6. Discussion

In the 167 optimal features, there were 153 CKSAAP features, 14 AA Factor features, and none of the amino acid occurrence frequency features. The feature distribution of optimal features set was shown in Figure 4. Considering the better performance of PreGly, it was possible that the CKSAAP [19] encoding was particularly suitable for the prediction of glycation site, which was to say that the short linear motifs may be more important than position-specific patterns in glycation sites' recognization. These analysis results reinforced the viewpoint that there may be no significant difference of mutations to other amino acids for each glycation site [20]. Figure 4 also implied that the amino acid occurrence frequency feature was faintly contributed to glycation sites prediction.

Figure 4

The distribution of each feature type in the optimal features set.

The amino acids classification distribution (Figure 2) revealed that the polar amino acid and the nonpolar amino acid were effective for the glycation sites prediction. Apart from this, we further investigated the 20 native amino acids quantity in the optimal features (Figure 5). As shown in Figure 5, the numbers of Leucine and Valine were more than the other amino acids, whereas Asparagine and Tryptophan were the two kinds of amino acids in the least number. It was worthwhile to point out that the Asparagine and Tryptophan may provide useful clues to validate new glycation sites in protein sequences.

Figure 5

The distribution of the amino acid types in the optimal features.

4. Conclusion

In this work, we built a predict model for protein glycation site prediction based on support vector machine. Our predictor reached an overall MCC of 0.706324, and sensitivity, specificity, and accuracy were 71.06%, 95.85%, and 85.51%, respectively. Glycation is involved in several diseases such as Alzheimer. It was advisable to identify glycation site for the associated diseases treatment. Detailed analysis conducted in this study may provide insights into understanding the mechanism of glycation and provide clues for the treatments of glycation related disease [21-28]. The supplementary material includes the training dataset and the independent dataset of our manuscript.

21 in total

1. Solving the protein sequence metric problem.

Authors: William R Atchley; Jieping Zhao; Andrew D Fernandes; Tanja Drüke
Journal: Proc Natl Acad Sci U S A Date: 2005-04-25 Impact factor: 11.205

2. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy.

Authors: Hanchuan Peng; Fuhui Long; Chris Ding
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2005-08 Impact factor: 6.226

3. Predicting O-glycosylation sites in mammalian proteins by using SVMs.

Authors: Sujun Li; Boshu Liu; Rong Zeng; Yudong Cai; Yixue Li
Journal: Comput Biol Chem Date: 2006-06 Impact factor: 2.877

4. A classification-based prediction model of messenger RNA polyadenylation sites.

Authors: Guoli Ji; Xiaohui Wu; Yingjia Shen; Jiangyin Huang; Qingshun Quinn Li
Journal: J Theor Biol Date: 2010-05-26 Impact factor: 2.691

5. Analysis and prediction of mammalian protein glycation.

Authors: Morten Bo Johansen; Lars Kiemer; Søren Brunak
Journal: Glycobiology Date: 2006-06-08 Impact factor: 4.313

6. Systematic discovery of new recognition peptides mediating protein interaction networks.

Authors: Victor Neduva; Rune Linding; Isabelle Su-Angrand; Alexander Stark; Federico de Masi; Toby J Gibson; Joe Lewis; Luis Serrano; Robert B Russell
Journal: PLoS Biol Date: 2005-11-15 Impact factor: 8.029

7. AAindex: amino acid index database, progress report 2008.

Authors: Shuichi Kawashima; Piotr Pokarowski; Maria Pokarowska; Andrzej Kolinski; Toshiaki Katayama; Minoru Kanehisa
Journal: Nucleic Acids Res Date: 2007-11-12 Impact factor: 16.971

8. The Universal Protein Resource (UniProt) in 2010.

Authors:
Journal: Nucleic Acids Res Date: 2009-10-20 Impact factor: 16.971

9. Glycosylation site prediction using ensembles of Support Vector Machine classifiers.

Authors: Cornelia Caragea; Jivko Sinapov; Adrian Silvescu; Drena Dobbs; Vasant Honavar
Journal: BMC Bioinformatics Date: 2007-11-09 Impact factor: 3.169

10. Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs.

Authors: Yong-Zi Chen; Yu-Rong Tang; Zhi-Ya Sheng; Ziding Zhang
Journal: BMC Bioinformatics Date: 2008-02-18 Impact factor: 3.169

9 in total

1. XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites.

Authors: Kewei Liu; Wei Chen; Hao Lin
Journal: Mol Genet Genomics Date: 2019-08-07 Impact factor: 3.291

2. Systematic Characterization of Lysine Post-translational Modification Sites Using MUscADEL.

Authors: Zhen Chen; Xuhan Liu; Fuyi Li; Chen Li; Tatiana Marquez-Lago; André Leier; Geoffrey I Webb; Dakang Xu; Tatsuya Akutsu; Jiangning Song
Journal: Methods Mol Biol Date: 2022

3. iProtGly-SS: A Tool to Accurately Predict Protein Glycation Site Using Structural-Based Features.

Authors: Iman Dehzangi; Alok Sharma; Swakkhar Shatabda
Journal: Methods Mol Biol Date: 2022

4. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites.

Authors: Zhen Chen; Xuhan Liu; Fuyi Li; Chen Li; Tatiana Marquez-Lago; André Leier; Tatsuya Akutsu; Geoffrey I Webb; Dakang Xu; Alexander Ian Smith; Lei Li; Kuo-Chen Chou; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

5. GlyStruct: glycation prediction using structural properties of amino acid residues.

Authors: Hamendra Manhar Reddy; Alok Sharma; Abdollah Dehzangi; Daichi Shigemizu; Abel Avitesh Chandra; Tatushiko Tsunoda
Journal: BMC Bioinformatics Date: 2019-02-04 Impact factor: 3.169

6. Glypre: In Silico Prediction of Protein Glycation Sites by Fusing Multiple Features and Support Vector Machine.

Authors: Xiaowei Zhao; Xiaosa Zhao; Lingling Bao; Yonggang Zhang; Jiangyan Dai; Minghao Yin
Journal: Molecules Date: 2017-11-03 Impact factor: 4.411

7. Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks.

Authors: Yingxi Yang; Hui Wang; Wen Li; Xiaobo Wang; Shizhao Wei; Yulong Liu; Yan Xu
Journal: BMC Bioinformatics Date: 2021-03-31 Impact factor: 3.169

8. A Feature Fusion Predictor for RNA Pseudouridine Sites with Particle Swarm Optimizer Based Feature Selection and Ensemble Learning Approach.

Authors: Xiao Wang; Xi Lin; Rong Wang; Nijia Han; Kaiqi Fan; Lijun Han; Zhaoyuan Ding
Journal: Curr Issues Mol Biol Date: 2021-11-01 Impact factor: 2.976

9. On the Prediction of In Vitro Arginine Glycation of Short Peptides Using Artificial Neural Networks.

Authors: Ulices Que-Salinas; Dulce Martinez-Peon; Angel D Reyes-Figueroa; Ivonne Ibarra; Christian Quintus Scheckhuber
Journal: Sensors (Basel) Date: 2022-07-13 Impact factor: 3.847

9 in total