Literature DB >> 33214770

iHyd-LysSite (EPSV): Identifying Hydroxylysine Sites in Protein Using Statistical Formulation by Extracting Enhanced Position and Sequence Variant Feature Technique.

Muhammad Khalid Mahmood¹, Asma Ehsan¹, Yaser Daanial Khan¹, Kuo-Chen Chou¹.

Abstract

INTRODUCTION: Hydroxylation is one of the most important post-translational modifications (PTM) in cellular functions and is linked to various diseases. The addition of one of the hydroxyl groups (OH) to the lysine sites produces hydroxylysine when undergoes chemical modification.
METHODS: The method which is used in this study for identifying hydroxylysine sites based on powerful mathematical and statistical methodology incorporating the sequence-order effect and composition of each object within protein sequences. This predictor is called "iHyd-LysSite (EPSV)" (identifying hydroxylysine sites by extracting enhanced position and sequence variant technique). The prediction of hydroxylysine sites by experimental methods is difficult, laborious and highly expensive. In silico technique is an alternative approach to identify hydroxylysine sites in proteins.
RESULTS: The experimental results require that the predictive model should have high sensitivity and specificity values and must be more accurate. The self-consistency, independent, 10-fold cross-validation and jackknife tests are performed for validation purposes. These tests are resulted by using three renowned classifiers, Neural Networks (NN), Random Forest (RF) and Support Vector Machine (SVM) with the demanding prediction rate. The overall predictive outcomes are extraordinarily superior to the results obtained by previous predictors. The proposed model contributed an excellent prediction rate in the system for NN, RF, and SVM classifiers. The sensitivity and specificity results using all these classifiers for jackknife test are 96.08%, 94.99%, 98.16% and 97.52%, 98.52%, 80.95%.
CONCLUSION: The results obtained by the proposed tool show that this method may meet the future demand of hydroxylysine sites with a better prediction rate over the existing methods.

Entities: Chemical

Keywords: ANN; Hydroxylysine; PTMs; cross-validation; post-translational modifications; predictive model

Year: 2020 PMID： 33214770 PMCID： PMC7604750 DOI： 10.2174/1389202921999200831142629

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

Introduction

Numerous proteins experience a broad collection of post-translational modifications. There are two types of modifications; one is called reversible, while another is named as non-reversible. Reversible modifications are related to physiological procedures and significant in the functioning of organisms, whereas later one is related to pathological causes and diseases [1]. Hydroxylation is one of the essential reversible post-translational modifications in protein. In this modification, at least one hydroxyl group is attached to an amino acid by modifying it [2]. The hydroxylation of proline and lysine is the main type of hydroxylated residue in the protein chain, contained in collagen to a large extent [3]. The hydroxylation of proline happens in a gamma-carbon atom, which forms a vital constituent of collagen called hydroxyproline. It is used to maintain the triple helix structure of collagen and in hypoxia through hypoxia-inducible factors hydroxyproline is also expedient [4]. Lack of ascorbate produces deficiencies in hydroxyproline, influences less stability of collagen, which causes metabolic disorder or disease [5]. Another kind of protein hydroxylation is imparted as hydroxylation of a lysine residue, also exclusively produced in collagen [6]. This type of hydroxylation happens in the delta-carbon atom to form hydroxylysine (Fig. ) and associated with secretion as well as function in the extracellular matrix [7]. Thus, in the field of biomedical research and drug development, the identification of hydroxyproline and hydroxylysine gives significant information [8]. Mass spectrometry is an experimental method to predict the hydroxylysine site in the protein. The experimental prediction of the hydroxylysine site is pretty difficult, tedious and overpriced [7, 9]. In contrast, the in-silico method is much more handy and useful in order to predict hydroxylysine sites. This methodology gives the desire results in no time and cost. This is a fundamental approach in bioinformatics in the prediction of the protein modified residue in the process of a post-translational modification. Furthermore, most of the computational algorithms have been developed in order to understand the complex molecular structure and to predict hydroxylation sites [10]. Many similar methods related to post-translational modifications involve, prediction of threonine phosphorylation sites, tyrosine nitration, tyrosine phosphorylation and so forth are described in the series of very recent published articles [11-16]. The predictor “iHyd-PseAAC” was developed for identifying hydroxylysine and hydroxyproline sites in proteome by incorporating the dipeptide position-specific propensity into the general form of pseudo amino acid composition [8]. Another scheme, “iHyd-PseCp”, for identifying hydroxyproline and hydroxylysine sites in protein, was developed by Qiu, Wang-Ren et al. [17] based on the sequence-coupled information into the general pseudo amino acid composition. The number of encoding schemes based on the composition of k-spaced amino acid pairs (CKSAAP), Amino Acid Composition (AAC), Binary Encoding (BE) and so forth is used for the prediction of phosphorylation, S-sulfenylation and lysine succinylation sites [18-20]. The composition of k-spaced amino acid pairs is an interesting and an effective features extraction technique proposed for identifying lysine formylation sites. By incorporating general pseudo components and Chou's 5-steps rule, this scheme into the CKSAAP method is used to encode formylation sites [21]. The k-spaced amino acid pairs (CKSAAP) encoding scheme is also used in predicting antifreeze proteins [22] and protein phosphorylation sites [23]. Nanni, L. et al. [24] developed a technique based on wavelet images and Chou's pseudo amino acid composition for the classification of protein. The technique which is used in this study is taken from the recent work [25, 26], and the predictor is called “iHyd-LysSite (EPSV)” to identify hydroxylysine sites in proteins.

Methods

Benchmark Dataset

The dataset for hydroxylated proteins is established from the UniProt database. The dataset for the term “hydroxylysine” was searched in the field of “modified residue” with PTM/Processing annotation. In order to construct a stringent benchmark dataset, the entries glossed with terms probable, potential, or by similarity were excluded. Against this query, 281 protein sequences were obtained that include hydroxylysine sites. For the sake of convenience, the records found with hydroxylysine sites are denoted as a positive sample and symbolically represented as . Later on, the converse query was run for the sequences not containinghydroxylysine sites. In the result of this query, 500 sequences were obtained. This dataset is considered as a negative dataset and symbolically represented as . The overall dataset comprised of 781 sequences, the sum of the positive and negative dataset, mathematically expressed in Eq. (1). After cutting down the duplicate sequences along with those having homology greater than 60%, the positive dataset is reduced to 185 samples and negative dataset to 497, accumulatively forming 682 samples. The supplementary information of datasets can be found in Supplementary Tables and , respectively.

Construction of Algorithm

To formulate the protein sequences and to classify them according to their attributes, we adopted the scheme as employed by Ehsan et al. [25, 26]. The algorithm for peptides classification was encoded in 220 features incorporated three attributes, namely, hydrophobicity, hydrophilicity and side-chain mass of amino acid. This method particularly focuses on the composition and order of each monomer and gives fixed length vector while featuring the polypeptide sample [25, 26]. In this work, by using this methodology, we will identify the hydroxylysine site in an uncharacterized protein sequence. According to this scheme, a peptide P sample formulation is described below: Where , , , ..., represent the amino acid monomers linked by a peptide bond and n represents the number of amino acid residues within the polypeptide sequence (2). Fig. ( shows the proposed scheme for the hydroxylysine site and the feature vector corresponding to the lysine modification site in an uncharacterized protein sequence can be obtained by (3).

Fig. (2)

Adopted formulation scheme of the proposed method [16]. (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Where represents a lysine residue and denotes the number of repetitions of in sequence, while j varies for all other amino acid residues except . represents the pairing of and residues in every possible way by consolidating the difference of each position for residue. In addition, r, s, and t show the corresponding positions of “” in (2). Now let, Equation (3) in term of (4) to (6) can be written in compact form as: The term in (7) can be evaluated by the following constraints The ordinal numbers in (3) represent the amino acids in alphabetical order named as A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y. Each can cyclically repeat itself more than once and form a peptide sample of length n. As represents the feature vector corresponding to the ith residue for “K”. Similarly, we can define feature vectors for all residues. Thus, we can relate feature vector for all twenty amino acids then the set of twenty feature vectors is given by: The above set of vectors consists of sixty components with respect to three choices of each pair , evaluated by using Eqs. (10) to (12) for the hydrophobic property. Similarly, hydrophilicity and side-chain mass attributes give sixty components individually. In this regard, we get 180 components while remaining forty components incorporating a number of occurrence of twenty amino acid residues and the sum of the positions of the corresponding occurrence of each object. Where are normalized values of naturally occurring values of hydrophobicity, hydrophilicity and side-chain mass, respectively. These are normalized to the values obtained from the same source, which was taken by Ehsan et al. [16]. The mean of normalized values of all twenty amino acids (aa) residues with respect to properties listed above is denoted by . The normalized values are obtained by using equation (13) within normalization range N. The classifiers, which are used to train the extracted feature vector data, are neural networks (NN), random forest (RF), and support vector machine (SVM). NN algorithm works as a neuron system and every last output of the neuron used as input of the next neuron. In the decision-making problems, neural networks play a key role in solving problems. To identify and incorporate all obscure structure and vague information in the wide collection of datasets, Multi-Layer Perceptron (MLP) is an excellent tool to overcome this difficulty. In any classification problem, MLP is fitted better, as it is adjusted finely by changing the number of hidden layer neurons, training parameters and training algorithms to generate excellent results. To train the extracted feature set, a Multi-Layer Perceptron (MLP) is used (Fig. 3). The basic strength of neural networks is its flexibility. It has numerous parameters that can be fine tuned to provide the best results. After extensive probing and testing, a neural network was set up having 50 neurons in the hidden layer. Adaptive gradient descent algorithm was incorporated for training, which uses a variable learning rate for optimal convergence. The feature vectors for each sample set are assembled into a large array. In the array, each row represents the feature vector corresponding to a single sequence, whereas each column made up of extracted feature components. Since there are 220 features for each sample, so each row consists of 220 columns. The total columns in feature vectors were 682 out of which 185 were positive samples and weights for each layer randomly adjusted with 75 neurons wereutilized. Moreover, to adjust the weight for each epoch, the back propagation algorithm was employed, while outcomes were obtained after 2693 iterations by the use of the gradient descent method for the learning rate. The results were simulated on MATLAB R2017 version and were duplicated on the python ver 3.6 platform along with Scikit Learn 0.20 for neural network training and simulation bearing identical results. Random forest is another ensemble learning technique for classification. It creates various decision trees on entire data samples by using various learning algorithms to collect prediction results from all of them and decide the final solution upon voting. The support vector machine is mostly used for classification problems. The feature data are plotted over n-space, then draw a line between the two classes by finding hyper-plane for the sake of classification.

Results and Discussion

Metrics Evaluation

To evaluate the predictive quality of the proposed predictor, one of the most important and easiest methods was adopted, which is also utilized by Chou [27]. The following set of four metrics based on this formulation was employed in the list of publications [25-27]. In the current study, the four convoluted measures, sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficients (MCC), respectively, were employed to assess the performance of “iHyd-LysSite (EPSV)” predictor as expressed in Ref [8, 17]: Where describes the total number of accurate predictions of hydroxylysine sites, indicates the number of true predictions of hydroxylysine sites wrongly predicted as non-hydroxylysine site; the total number of non-hydroxylysine predicted sites is denoted by , while is the number of non-hydroxylysine sites investigated as hydroxylysine site. It can be clearly seen from the above equation when describing no true hydroxylysine sites are wrongly predicted to be of non-hydroxylysine sites, which gives . When describing that all the true hydroxylysine sites are wrongly predicted to be of non-hydroxylysine site, we have the sensitivity . Similarly, in the case of , describing none of the non-hydroxylysine sites are wrongly investigated to be as hydroxylysine site, gives , while meaning that all non-hydroxylysine sites investigated to be of true hydroxylysine predicted sites, we have specificity . On the other hand, when indicating that there are none true hydroxylysine sites and none of the non-hydroxylysine sites are wrongly predicted in positive as well as a negative dataset, we have accuracy and . When and describing that all true hydroxylysine investigated sites are wrongly predicted as non-hydroxylysine sites in a positive dataset and all non-hydroxylysine sites are wrongly investigated as hydroxylysine sites in a negative dataset that gives the overall accuracy and ; while for and we obtained the accuracy and describing not good than a random estimate. Moreover, the set of equations defined in Eq. (14) is only applicable for single-labeled systems and multi-label system, which is useful in systems biology, system medicine and biomedicine [28] defined by more perplexed metrics as given in Ref. [29].

Test Method

In order to score the metrics given in Eq. (14) and to evaluate the performance of the predictor, the following validation methods, self-consistency test, independent dataset test, 10-fold cross-validation test and jackknife test are frequently used. In the jackknife validation process, the test was performed by removing each sample from the given dataset for test purposes, while the remaining dataset was used to train the predictive model. The test is then conducted on the rest of the data trained by a predictive classifier. Moreover, the jackknife test was used to evaluate several predictors as expended in a series of literature [30-32]. While 10-fold cross-validation is partitioned into 10 dissimilar datasets by splitting the dataset for both positive and negative class and outcomes that are generated by taking the mean of all partition outcomes. Each partition gives the independent dataset test individually. These tests are scored by using the following three classifiers for validation purposes, namely, Neural Network (NN), Random Forest (RF), and Support Vector Machine (SVM). The results obtained by using Eq. (14) for all four metrics are given in Table . It can be observed from Table , the accuracy and recall value obtained by employing the proposed predictor “iHyd-LysSite (EPSV)” throughout all classifiers is higher, which gives the results for correctly identified hydroxylated sites. The accuracy graph for the 10-fold cross-validation test is shown in Fig. (. Precision is a positive predictive value (PPV), used to describe the relationship between all true positive predictions and all positive predicted conditions. This test is like a screening test when it returns a positive result or correctly identified hydroxylated sites. It is a probability that protein sequences with a positive screening test result indeed have the hydroxylated sites. The precision values table for all varying classifiers is given in Table . The Receiver Operating Characteristic (ROC) is a two-dimensional graphical representation used to explain the performance of the predictive model by the area under the curve (AUC) or area under the ROC curve. The value of AUC ranges from 0 to 1. The AUC with value 0.0 represents the 100% wrong prediction, while AUC = 1.0 is obtained when the prediction is 100% correct. The more area under the curve, the more accurate the model. The ROC for a 10-fold cross-validation test for all three classifiers is shown in Fig. (, and the comparison of all classifiers for self-consistency test is presented in Fig. (.

Fig. (4)

The graph shows the 10-fold cross-validation performed on the overall dataset and the corresponding accuracy for each fold test, the results are generated by employing the Neural network (NN) classifier.

Fig. (5)

The ROC curves obtained from the classifiers, NN, RF, SVM for the 10-fold cross-validation test. (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Fig. (6)

The comparison of NN, RF, SVM ROC curves for self-consistency test. (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Comparison with Previous Methods

In this study, a comparison is established by the former prediction methods by using a rigorous jackknife test to check the quality of the proposed model “iHyd-LysSite (EPSV)”. The comparison is made among all classifiers, neural network (NN), random forest (RF), and support vector machine (SVM). The jackknife results for all classifiers obtained by using the proposed model “iHyd-LysSite (EPSV)” for the above metrics in Eq. (14) are given in (Table ). The examination is prepared with two existing predictors, the “iHyd-PseAAC” [8], and ” iHyd-PseCp” [17]. These methods have also achieved the metric scores using the jackknife test method and it is easy to see from Table that, the accuracy (Acc), stability (MCC), sensitivity (Sn), and specificity (Sp) scores evaluated by the proposed predictor are superior than calculated by existing predictors. It can also be noticed that all classifiers contribute to excellent scores in the result of Eq. (14). It is also demonstrated by the cross-validation test, the overall prediction accuracies of the system for all three classifiers are 96.77%, 97.31 and 84.38. It can be observed that the overall accuracy of the predictor “iHydLysSite (EPSV)” is higher than the existing predictors. Due to the following reasons, the proposed predictor is more reliable and robust, in prediction. First is the formulation of sequence, which is convenient in handling the diverse length sequences in a generous way without skipping any information of the sequence, and makes pairwise couplings in every possible combination with amino acids. Second is the fixed length vector, which always imparts with a non-variable feature vector that equally separates the proteins according to their attributes. Due to this reason, each sample could rigorously classify and conveniently recognize. The third is about correlation expression, this correlation mainly takes part in scoring the feature vector, that is manipulated by incorporating each attribute group. Each and every expression deals with some specific metric and statistical expressions. For the sake of convenience, every amino acid was standardized with a suitable range, that the value of each property of amino acid lies between this range. Moreover, it is observed that, in comparison with previous methods, the proposed predictor outcomes are more superior and better than the former prediction rate.

Friendly User Web-Server

“User-friendly and publicly accessible web-servers represent the current trend for developing various computational methods [33], as reflected by a serious of recent publications see e.g. [15, 34-37]. Actually, they have significantly enhanced the impacts of computational biology on medical science and driving medicinal chemistry into an unprecedented revolution [38], here we shall do our best to provide a web-server for the predictor presented in this paper as soon as possible.”

Conclusion

In cellular functions, the Post-Translational Modification (PTM) of protein is of vital importance. Covalent addition of any functional group to the proteins produces PTM. Hydroxylation is one of the PTM reactions, which is mostly occurring on three residues, proline, lysine and asparagine. On the maturation of collagen fibers, hydroxyproline and hydroxylysine are significant, while, hydroxyasparagine is important for antifungal and anti-toxin drugs. Hydroxylysine is the hydroxylated class of lysine residue and plays a central role in both biomedical research and drug development against cancer and many other diseases. A powerful computational approach was adopted for identifying the potential hydroxylysine sites in proteins. In the current work, we prove that “iHyd-LysSite (EPSV)” is a predictor that has an excellent prediction proficiency for identifying hydroxylysine sites on a comparison with the former techniques. For this purpose, the methodology is used taken from a recent published article given in Ref. [25]. To validate the potency of the proposed model, the exhaustive jackknife test was performed. The model is verified with three main classifiers, Neural Network (NN), Random Forest (RF) and Support Vector Machine (SVM). Then 96.08%, 94.99%, 98.16% sensitivity and 97.52%, 98.52%, and 80.95% specificity results have been obtained for the jackknife test using the above three classifiers. It is concluded that the proposed predictor has the potential of more improvement in the computed result as in a continuous sequence, there are so rapidly increasing combinations of lysine residues.

Table 1

The values of all four metrics for three classifiers obtained by using the proposed predictor “iHyd-LysSite (EPSV)”.

Classifiers NN					RF				SVM
Tests	Sn(%)	Sp(%)	Acc(%)	MCC	Sn(%)	Sp(%)	Acc(%)	MCC	Sn(%)	Sp(%)	Acc(%)	MCC
Self-consistency	95.15	99.39	98.00	0.95	100.00	99.60	99.74	0.99	86.01	92.93	90.40	0.88
Independent	93.00	100.00	97.40	0.95	95.10	97.12	96.32	0.93	81.32	92.36	88.09	0.87
Cross-validation	96.14	97.57	96.77	0.92	95.04	98.60	97.31	0.94	98.22	81.00	84.38	0.89
Jackknife	96.08	97.52	97.14	0.93	94.99	98.52	97.24	0.90	98.16	80.95	84.33	0.84

Precision Table

Classifiers NN		RF	SVM
Tests	PPV	PPV	PPV
Self-consistency	0.89	0.99	0.88
Independent	0.88	0.93	0.87
Cross-validation	0.92	0.97	0.78
Jackknife	0.88	0.92	0.74

Table 2

A comparison of the proposed model “iHyd-LysSite (EPSV)” with the previous methods using the jackknife test validated by NN, RF, and SVM classifiers.

Methods	Sn^a	Sp^a	Acc^a	MCC^a
iHyd-PseAAC	87.85	83.01	83.56	0.50
iHyd-PseCp	78.77	99.08	97.08	0.86
iHyd-LysSite (EPSV)-NN^b	96.08	97.52	97.14	0.93
iHyd-LysSite (EPSV)-RF^b	94.99	98.52	97.24	0.90
iHyd-LysSite (EPSV)-SVM ^b	98.16	80.95	84.33	0.84

( definition of metrics in Eq. (14), ( proposed predictor “iHyd-LysSite (EPSV)”.

33 in total

1. Prediction of protein signal sequences and their cleavage sites.

Authors: K C Chou
Journal: Proteins Date: 2001-01-01

2. Wavelet images and Chou's pseudo amino acid composition for protein classification.

Authors: Loris Nanni; Sheryl Brahnam; Alessandra Lumini
Journal: Amino Acids Date: 2011-10-13 Impact factor: 3.520

Review 3. Some remarks on predicting multi-label attributes in molecular biosystems.

Authors: Kuo-Chen Chou
Journal: Mol Biosyst Date: 2013-03-28

4. Conformational implications of enzymatic proline hydroxylation in collagen.

Authors: R K Chopra; V S Ananthanarayanan
Journal: Proc Natl Acad Sci U S A Date: 1982-12 Impact factor: 11.205

5. A protein interaction network analysis for yeast integral membrane protein.

Authors: Ming-Guang Shi; De-Shuang Huang; Xue-Ling Li
Journal: Protein Pept Lett Date: 2008 Impact factor: 1.890

Review 6. Ascorbate depletion: a critical step in nickel carcinogenesis?

Authors: Konstantin Salnikow; Kazimierz S Kasprzak
Journal: Environ Health Perspect Date: 2005-05 Impact factor: 9.031

7. iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition.

Authors: Yan Xu; Xin Wen; Xiao-Jian Shao; Nai-Yang Deng; Kuo-Chen Chou
Journal: Int J Mol Sci Date: 2014-05-05 Impact factor: 5.923

8. PTM-ssMP: A Web Server for Predicting Different Types of Post-translational Modification Sites Using Novel Site-specific Modification Profile.

Authors: Yu Liu; Minghui Wang; Jianing Xi; Fenglin Luo; Ao Li
Journal: Int J Biol Sci Date: 2018-05-22 Impact factor: 6.580

9. ORION: a web server for protein fold recognition and structure prediction using evolutionary hybrid profiles.

Authors: Yassine Ghouzam; Guillaume Postic; Pierre-Edouard Guerin; Alexandre G de Brevern; Jean-Christophe Gelly
Journal: Sci Rep Date: 2016-06-20 Impact factor: 4.379

10. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou's 5-step rule.

Authors: Sharaf Jameel Malebary; Muhammad Safi Ur Rehman; Yaser Daanial Khan
Journal: PLoS One Date: 2019-11-21 Impact factor: 3.240

4 in total