Literature DB >> 31555063

iHyd-PseAAC (EPSV): Identifying Hydroxylation Sites in Proteins by Extracting Enhanced Position and Sequence Variant Feature via Chou's 5-Step Rule and General Pseudo Amino Acid Composition.

Asma Ehsan¹, Muhammad K Mahmood¹, Yaser D Khan¹, Omar M Barukab¹, Sher A Khan¹, Kuo-Chen Chou¹.

Abstract

BACKGROUND: In various biological processes and cell functions, Post Translational Modifications (PTMs) bear critical significance. Hydroxylation of proline residue is one kind of PTM, which occurs following protein synthesis. The experimental determination of hydroxyproline sites in an uncharacterized protein sequence requires extensive, time-consuming and expensive tests.
METHODS: With the torrential slide of protein sequences produced in the post-genomic age, certain remarkable computational strategies are desired to overwhelm the issue. Keeping in view the composition and sequence order effect within polypeptide chains, an innovative in-silico> predictor via a mathematical model is proposed.
RESULTS: Later, it was stringently verified using self-consistency, cross-validation and jackknife tests on benchmark datasets. It was established after a rigorous jackknife test that the new predictor values are superior to the values predicted by previous methodologies.
CONCLUSION: This new mathematical technique is the most appropriate and encouraging as compared with the existing models.

Entities: Chemical

Keywords: Hydroxylation of proline; Hydroxyproline; Mammalian proteins; Post Translational Modifications (PTMs); PseAAC; Sequence-coupling model

Year: 2019 PMID： 31555063 PMCID： PMC6728902 DOI： 10.2174/1389202920666190325162307

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

Collagens are profoundly plenteous mammalian proteins which possess abundant hydroxyproline [1] that plays a key role in its stability. The structure of collagen is stringy and long; nearly a quarter or even more of total protein content in mammals is comprised of collagen [2]. In medical applications, collagens work as a major constituent while contributing to wound healing [3], burns surgery [4] and cosmetic surgery [5]. Their asymmetrical behavior and irregular movements may contribute to stomach disease [6] and lung cancer [7]. The ability to predict hydroxyproline (HyP) sites as a result of post-translational modifications in proteins provides precious information useful for both biomedical research and medication evolution [8]. Hydroxyproline is a non-essential amino acid which means that it is mostly synthesized with other amino acids in the liver and need not to be obtained directly through systemic ingestion. Proline undergoes hydroxylation by the conversion of group in proline residue into group or a hydroxyl group [8] as shown in Fig. (). Owing to its significance for an in-depth understanding of the cellular biological process and discovering drug against cancers and other major diseases, many efforts have been made by other scientists in this regard [9-17]. Although, experimental techniques based on mass spectrometry exist that are used to determine hydroxylation sites of a given protein [18], however, this is laborious, tedious and high-priced. As a multitude of proteomic sequences are gathered into databanks each day, it is extremely desirable to devise an integrated and robust computational technique incorporating the composition and sequence order effect to determine potential hydroxylation sites with greater accuracy. Researchers have proposed a few methodologies for this purpose. However, the existing predictors lack the most pertinent details of features obscured within the primary sequences that prove crucial for reaching an accurate decision. Hydroxylation process had been of great interest to many researchers. Quantification of hydroxyproline was estimated by Colgrave, et al. [1] by using multiple-reaction-monitoring mass spectrometry. A mathematical modeling has been developed to understand the microbial behavior and their communities [19]. It was shown that the hydroxyproline and hydroxylysine in collagen were integrated by a clear extraordinary pathway, in which proline and lysine were hydroxylated after they were consolidated into a comprehensive polypeptide antecedent of collagen. Berg, et al. [20] defined a system that was set up to examine the inadequacy of collagen in connective tissues occurring due to lack of ascorbates to some extent. The isolation and partial characterization of highly purified protocollagen proline hydroxylase and hydroxylation of proline in synthetic polypeptides with purified procollagen hydroxylase were elaborated by Halme et al. [21] and Kivirikko et al. [22]. Morgan, et al. [23] investigated, in terms of the distribution, the frequency, positioning, and common functional roles of proline and polyproline sequences in the human proteome. Hydroxylation of lysine and crosslinking of collagens have been discussed in “Posttranslational Modifications of Proteins” [24]. Shi, Shao-Ping, et al. [25] presented a new method named as PredHydroxy to mechanize the forecast of the proline and lysine hydroxylation locales in term of position weight of 8 high-quality amino acid indices and support vector machines. The metabolism for the proline, hydroxyproline and a survey of activity of proline with the changing environment were also studied [26, 27]. Employing support vector machine and developing a tool for prediction of hydroxyproline sites were proposed by ZR Yang [28]. Hu, Le-Le, et al. [29] developed a sequence-based methodology for predicting hydroxylation of hydroxyproline and hydroxylysine. Xu, Yan, et al. [8] predicted hydroxyproline and hydroxylysine in proteins using dipeptide position and specific propensity into pseudo amino acid composition. An improved approach over this proficiency was proposed by Qiu, Wang-Ren, et al. [30] by integrating a sequence-coupled effect into general PseAAC.

RESULTS

To develop a worthwhile predictor for a biological phenomenon, one should observe the Chou's 5-step rule [31]. It is indeed good to present the new prediction method by observing the Chou's 5-step rule as many researchers followed this fundamental rule in their papers, published very recently [9, 32-38]. In the first step, benchmark dataset is accumulated for training and testing the predictor; in the next step, a mathematical model is formulated which sieves out the most momentous features of the polypeptide sequence. Later the feature vector is integrated into a prediction algorithm for training. Once the training is completed, the trained model is thoroughly tested and validated. Lastly, a web-server is developed for open use of the prediction model. In this study, the first four steps have been meticulously performed, however, the last step has been kept open for future work.

ACCURACY METRICS

In order to measure the predictive quality of the predictor, the following metrics are commonly used: is used to quantify the comprehensive accuracy of the predictor, is a stable measure of overall accuracy of the model, is used to estimate sensitivity, and is used for specificity [39]. To evaluate the prediction rate of the proposed model, this set of metrics is followed which are also employed by Ehsan et al. [40]. The formulation for the actual prediction of hydroxylated and non-hydroxylated site of proline is given below. (1) (2) Where and represent the total number of peptides which was correctly predicted with proline hydroxylated site and the number of hydroxylated peptides which was incorrectly predicted as a non-hydroxylated proline site, respectively. Likewise and represent the total actual count of non-hydroxylated peptides and the number of wrongly predicted hydroxylated peptides, respectively. (3) It has been observed that when there are zero incorrectly predicted hydroxylated and non-hydroxylated proline peptides such that then equation (1) to (3) gives and signifying the highest possible accuracy rate. Subsequently, when then the prediction would be less than 1. There are a number of statistical equations which are used to measure the performance of the predictor given in eq (4). (4) Where , , and represent the true positive, true negative, false positive and false negative values, respectively. Expressions in equations (5) and (6) represent the symbols in terms of equation (1) to (3). It is also advantageous to use the intuitive metrics of Equations (5)-(6) to replace the traditional Equation (4). Either the set of traditional metrics copied from maths books or the intuitive metrics derived from the Chou's symbols [41-43] are valid only for the single-label systems (where each sample only belongs to one class). For the multi-label systems (where a sample may simultaneously belong to several classes), whose existence has become more frequent in system biology [32, 33, 36, 44], system medicine [45] and biomedicine [46], a completely different set of metrics as defined in the study represnted as reference [47] is absolutely needed. (5) (6) It is relevant to discuss the following cases of the above equation (6), if then there is no incorrectly predicted hydroxylated proline peptides as non-hydroxylated proline peptides such that . Similarly, when it indicates that all hydroxylated proline peptides were incorrectly predicted as non-hydroxylated proline peptides, hence the sensitivity was computed as . Furthermore, yields specificity, represents that not even one non-hydroxylated proline peptide was incorrectly predicted as a hydroxylated proline peptide. Likewise yields specificity, and represents that all non-hydroxylated proline peptides were incorrectly predicted as hydroxylated proline peptides. Also, implies that all sequences of hydroxylated and non-hydroxylated proline peptides were predicted correctly such that . Further, the performance of binary classifications is often measured by Matthew correlative coefficient (MCC). There were three cases herein, indicates that no incorrectly predicted sequences were found both for hydroxylated and non-hydroxylated peptides yielding . In the second case, and generated indicating that this prediction was not more accurate than the random prediction. Lastly, with values of and , was obtained signifying a totally wrong binary classification and complete disagreement between the observed and predicted values.

VALIDATION METHOD

The metrics given in equation (6) are used to describe three frequently used test methods namely, independent dataset test, K-fold cross-validation test, and jackknife test. These tests are considered beneficial in validating the quality of the predictor. The jackknife test is considered the least arbitrary because it can agree to specific results for particularly obtained benchmark dataset as explained earlier in a study [31]. To study the statistical analysis of the new predictor, a comparison was made using the jackknife test with previous methodologies [8, 30]. In this study, all of these validation tests were employed to evaluate the quality of the proposed methodology. In addition, K-fold cross-validation test is based on sub-sampling to validate the classifier since several partitioning permutations exist therefore it cannot avoid ambiguity [8].

COMPARISON WITH PREVIOUS METHODS

Values given in Table are the scores of the four metrics attained by the proposed predictor using the independent dataset test, 10-fold cross-validation test, and jackknife test on the dbptm benchmark dataset, while, Table represents the scores of similar metrics using the most updated dataset obtained from UniProt. Furthermore, Table shows a comparison with the existing techniques. Two existing predictors have been depicted, namely “iHyd-PseAAC” [8], and ” iHyd-PseCp” [30], for identifying the hydroxyproline sites. These methods also achieved the metrics scores using the jackknife test method. It can be observed from Table that the accuracy (), stability (), sensitivity (), and specificity () scores evaluated by the newly proposed predictor are superior than those reported by the existing predictors. A comparison with previous methods was made using two benchmark datasets extracted from (a) dbptm and (b) uniprot database. To understand the complex biological systems, the graphical representation gives a valuable vision as represented by the list of earlier articles [48-50]. The same is depicted as a comparison in graphical representation showing the Receiver Operating Characteristic (ROC) [51] of the proposed predictor and previously existing predictors. In Fig. (), the red curve represents the ROC curve for iHyd-PseAAC and green curve for iHyd-PseCp, while blue solid and dotted curves represent the ROC plotted by using the proposed predictor on dbptm and uniprot benchmark datasets. It is evident from the figure below that the area under the blue dotted and solid curves is extraordinarily larger than that under the red and green curves. Undoubtedly, the novel proposed predictor is certainly an improved approach over the existing predictors. The superior performance of the proposed system can be rationalized by a number of scientific and theoretical reasons. Some of these are discussed here. Firstly, the proposed model is a formulation based on the composition and sequence of primary structure which can conveniently handle diverse length sequences in a generous way without skipping any obscure information and form pairwise couplings in every possible permutation of amino acid residues. Secondly, it generates a fixed length vector, which imparts a non-variable size feature vector that equally separates proteins according to their attributes. This aspect enables the predictor to rigorously classify and conveniently recognize each sample. Thirdly, the correlation expression is the main mechanism that contributes towards the computation of a feature vector. It has been configured by incorporating each attribute group. Each expression deals with some specific metric and statistical expressions. For the sake of convenience, every property of amino acids was standardized numerically within a suitable range. Also, it has been observed that in comparison with previous methods proposed, the predictor outcomes are more superior and better than the former prediction rate.

WEB-SERVER

User-friendly and publicly accessible web-servers represent the current trend for developing various computational methods [52], as reflected by a series of recent publications [32, 33, 35, 36, 44]. Actually, they have significantly enhanced the impacts of computational biology in medical science [53], driving medicinal chemistry into an unprecedented revolution [54], here we shall do our best to provide a web-server for the predictor presented in this paper as soon as possible.

DISCUSSION

The proposed model is a new predictor to identify hydroxylation of proline. It can be analysed from Table that the accuracy calculated for the proposed model is 96.80 and 96.01 which is higher than the accuracy calculated using previous predictors, that is 80.57 and 96.58. Also, MCC values were 0.90 and 0.88 which were superior to both the predictors i.e. iHyd-PseAAC and iHyd-PseCp. The proposed model was validated using benchmark datasets extracted from dbptm as well as from UniProt database.

METHODS

Benchmark Dataset

According to Chou's 5-step rule [31], the extraction of benchmark dataset is a crucial step that leads to the acquisition of a robust, diverse and updated dataset. In this study, a stringent benchmark dataset has been borrowed from two roots. One of the datasets is received from the resource http://www.uniprot.org/, and the other is leased from a post-translational modification database dbPTM 3.0 [55] that has also been utilized by Xu et al. [8]. The following two steps are used to select a stringent benchmark dataset. Step-1: The data extracted from UniProt database, consists of positive and negative samples that represent the hydroxylated and non-hydroxylated polypeptide sequences at proline site. A query is generated to select protein sequences in the PTM/processing field as hydroxyproline. Entries annotated with any experimental assertion in Feature Table (FT) were exclusively selected. Step-2: After a rigorous adoption of the above step, a first-rate benchmark dataset of hydroxyproline was collected. Total samples of 816 and 24,980 for positive and negative were extracted, respectively. After obtaining the duplicates, both were cut down to 782 and 24971 unique values. For the sake of convenience, and represent the positive and negative set of the hydroxylated polypeptides, respectively. Further, let be the total sum of these two. Also, it can be easily seen that there exist more negative peptides than positive peptides in nature. Thus, . Similarly, to extract another stringent benchmark dataset, the dbdtm 3.0 [55] was employed. The dataset was easily available in FASTA format and conveniently were downloaded for hydroxylation (positive and negative). There were found 226 positive sets and 3,865 negative sets. A demonstration in term of a Flowchart is given in Fig. (), to understand the above steps. The primary structure of hydroxylated and non-hydroxylated proline sites can be found in Supplementary Tables S1, S2, S3 and S4 respectively.

SAMPLE FORMULATION AND ALGORITHM DEVELOPMENT

According to the Chou's second and third step [31], a powerful mathematical formulation is proposed that can accurately reflect their indispensable correlation to arrange the sample in an effective way, also used by Ehsan et al. [40]. Considering a protein sample P, consisting of L amino acid residues. (7) Where is the first amino acid residue, is the second amino acid residue, and so on up to the last residue of protein sequence P, where indicates the length of the sequence (7). To identify the post translational modification in proline site, a computational methodology has been persuaded. This method upholds the sequence order effect and is adopted using the whole sequence data together with the occurrence of each amino acid residue of type (any one of the residues among twenty amino acid residues). Expression (8) to (11) describes the whole formulation strategy. The number of occurrences of residue and the possible number of correlated factors of with itself, such that is linked to expression (8). While, mean factors , and are connected with the deviation factors of at their respective positions and are represented by the expression (9), followed by condition (10). Whereas, runs over deviation factors and these factors are linked by a local mean. This deviation is denoted by , provided the positions of are labeled by p and q in , the polypeptide chain. While the subscript denotes the frequency of occurrence of deviation factors for similar amino acid residues discarding the occurrence at the first and last position residue , based on n total occurrences of ; similarly, is labeled for the difference, , and r represent the exact position of the residue appearing at of its occurrence in (7) while and denote the amino acid residues in the corresponding positions. (8) (9) (10) Combining expressions (8) and (9) and using constraint (10) yield the template for manipulating feature component related to , given in (11). (11) While and denote the number occurrences of before and after, with the remaining residues, respectively. These are given in eq (12) and (13). (12) Where (13) Where shows the occurrence of binary function related to residue with any of the remaining nineteen residues and stands for none of its occurrence with others. , is defined as the pair function Â£h for all combinations of all residues, whereas the pair function in terms of is defined as , elaborated in a matrix (14). Equation (15) assigns all the possible pair factors concerning and together with (16). If a pair is found, then is labeled as 1 otherwise it will be assigned 0 value. Additionally, (15) admits to (14) with entries , and specifying lower triangular matrix for . Accordingly, the diagonal entries signify the combination among analogous residues and upper triangular matrix for . (14) (15) Where (16) The manipulation of feature components in a matrix form, incorporating all the amino acid residues given in (17) can be viewed as an extension of (11). (17) Expression (11) together with equation (13) yields the component of feature vector that is , elaborated in eq (18) and (19). (18) Or (19) The structural scheme of the proposed formulation can be understood by considering term of a sequence (7), say, , which mirrors the amino acid residues say “A”. It must be noted that makes a pair with its adjacent residues before and after the residue in terms of and exemplified by blue and pink curvy lines and pairs with itself which is denoted by muddy green loops as shown in Fig. . The procedure must be followed till appears in place such that . Correspondingly, a similar procedure will be adopted for . The feature component agreeing to residue “A” is substituted in equation (20). (20) Where are the amino acid residues in ascending order. For simplicity, taking as the 20 amino acids in an alphabetical order for further generalization and onwards the 20 residues that periodically replicate themselves. Supposing ,,,..., are their associate feature components. These are given in equation (21). (21) The three main characteristics of amino acids, that is hydrophobicity, hydrophilicity and side chain mass of amino acids mainly take part in the above set of twenty feature components. Every characteristic relates 60 entries as coordinates,which contribute to 180 coordinates in total influenced by equation (22) to (24), identifies the characteristics. (22) (23) (24) Where ,, represent the normalized hydrophobicity, hydrophilicity and side-chain mass, respectively, and , , indicate the mean of the normalized values corresponding to the 20 amino acids related to attributes. The values used in (22) to (24) are normalized by using (25), and standardized in a range (-T, T), where T is the count for amino acids to be standardized. Entries for hydrophobicity are picked from Tanford C. [56], and for hydrophilicity, entries are taken from Hopp T.P., Woods K.R. [57], while the values of side-chain mass can be found in most of the books given in the bibliography. (25) The feature set is categorized into a vector with 220 components, of which, the first sixty are constructed by virtue of the hydrophobic nature of amino acids, the next sixty components depict their hydrophilic nature, the subsequent sixty components are related to side chain mass, whereas the last forty reflect the position and composition of each amino acid residue. The feature vectors hence obtained for the training data are clamped to a neural network for training. Once the training is completed, the trained network apparently gains the experience to categorize arbitrary input with an appreciable precision. While the process is carried on, the network normalizes its weights with a minimum slip. Multilayer Perceptron (MLP) is an excellent model that can uncover and identify obscure patterns in diversified data sets. MLP is best suited for any classification problem as it can be fine tuned by changing the number of hidden layer neurons, training parameters and training algorithm to provide the best outcome. A Multi-Layer Perceptron (MLP) was trained using the extracted feature set for this purpose (Fig. ). The feature vectors for the samples were assembled into a large array. Each row of the array represents the feature vector for a single sequence while each column represents a feature item extracted. Since 220 features were extracted for each sample; therefore each row had 220 columns while the total columns were 25796; out of which, 816 were positive samples.The weights of each layer were initialized randomly while a hidden layer with 75 neurons was used. Further, back propagation algorithm was used to adjust the weights after each epoch. Convergence was achieved after 2693 iterations while using gradient descent method for learning rate. The results were simulated on MATLAB R2017 version and were duplicated on python ver 3.6 platform along with Scikit Learn 0.20 for neural network training and simulation bearing identical results. The algorithm which is developed by the following above method is called iHyd-PseAAC (EPSV), where “i” represents the first word of “identify”, Hyd is used for “hydroxylation” and Pse-AAC is the general term used for pseudo amino acid composition. Also the term “EPSV” stands for “enhanced position and sequence variant” technique which is used to construct an algorithm for polypeptide sequence.

Table 1

Three tests result on set of metrics using proposed model on dbptm benchmark.

Tests	Sn (%)	Sp (%)	Acc (%)	MCC
Independentdataset test	98.30	98.02	98.77	0.96
Cross-Validation	98.73	94.87	96.85	0.93
Jackknife test	98.68	94.82	96.80	0.90

Table 2

Three tests result on four metrics using proposed model on recent uniprot benchmark.

Tests	Sn (%)	Sp (%)	Acc (%)	MCC
Independentdataset test	98.38	99.54	98.80	0.95
Cross-Validation	97.07	94.62	96.06	0.91
Jackknife test	97.02	94.57	96.01	0.88

Table 3

A comparison of the proposed model with the previous methods to identify hydroxylation of proline using jackknife test in the validation of benchmark datasets extracted from (a) dbptm and (b) uniprot.

Predictors	Sn (%)	Sp (%)	Acc (%)	MCC
iHyd-PseAAC	80.66	80.54	80.57	0.51
iHyd-PseCp	86.35	99.12	96.58	0.89
iHyd-PseAAC (EPSV)^a	98.68	94.82	96.80	0.90
iHyd-PseAAC (EPSV)^b	97.02	94.57	96.01	0.88

50 in total

1. Using subsite coupling to predict signal peptides.

Authors: K C Chou
Journal: Protein Eng Date: 2001-02

2. Prediction of protein signal sequences and their cleavage sites.

Authors: K C Chou
Journal: Proteins Date: 2001-01-01

3. Prediction of signal peptides using scaled window.

Authors: K C Chou
Journal: Peptides Date: 2001-12 Impact factor: 3.750

Review 4. Biomedical applications of collagen.

Authors: C H Lee; A Singla; Y Lee
Journal: Int J Pharm Date: 2001-06-19 Impact factor: 5.875

5. Lysine hydroxylation and crosslinking of collagen.

Authors: Mitsuo Yamauchi; Masashi Shiiba
Journal: Methods Mol Biol Date: 2002

Review 6. Effect of collagen matrices on dermal wound healing.

Authors: Zbigniew Ruszczak
Journal: Adv Drug Deliv Rev Date: 2003-11-28 Impact factor: 15.470

Review 7. Collagens--structure, function, and biosynthesis.

Authors: K Gelse; E Pöschl; T Aigner
Journal: Adv Drug Deliv Rev Date: 2003-11-28 Impact factor: 15.470

8. Prediction of linear B-cell epitopes using amino acid pair antigenicity scale.

Authors: J Chen; H Liu; J Yang; K-C Chou
Journal: Amino Acids Date: 2007-01-26 Impact factor: 3.520

9. A preliminary study on antimetastatic activity of Thuja occidentalis L. in mice model.

Authors: E S Sunila; G Kuttan
Journal: Immunopharmacol Immunotoxicol Date: 2006 Impact factor: 2.730

10. Deregulation of collagen metabolism in human stomach cancer.

Authors: Tomasz Guszczyn; Krzysztof Sobolewski
Journal: Pathobiology Date: 2004 Impact factor: 4.342

7 in total

Review 1. Some illuminating remarks on molecular genetics and genomics as well as drug development.

Authors: Kuo-Chen Chou
Journal: Mol Genet Genomics Date: 2020-01-01 Impact factor: 3.291

2. Deep Learning-Based Advances In Protein Posttranslational Modification Site and Protein Cleavage Prediction.

Authors: Subash C Pakhrin; Suresh Pokharel; Hiroto Saigo; Dukka B Kc
Journal: Methods Mol Biol Date: 2022

3. Evaluating machine learning methodologies for identification of cancer driver genes.

Authors: Sharaf J Malebary; Yaser Daanial Khan
Journal: Sci Rep Date: 2021-06-10 Impact factor: 4.379

4. PPAI: a web server for predicting protein-aptamer interactions.

Authors: Jianwei Li; Xiaoyu Ma; Xichuan Li; Junhua Gu
Journal: BMC Bioinformatics Date: 2020-06-09 Impact factor: 3.169

5. HeteroDualNet: A Dual Convolutional Neural Network With Heterogeneous Layers for Drug-Disease Association Prediction via Chou's Five-Step Rule.

Authors: Ping Xuan; Hui Cui; Tonghui Shen; Nan Sheng; Tiangang Zhang
Journal: Front Pharmacol Date: 2019-11-08 Impact factor: 5.810

6. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou's 5-step rule.

Authors: Sharaf Jameel Malebary; Muhammad Safi Ur Rehman; Yaser Daanial Khan
Journal: PLoS One Date: 2019-11-21 Impact factor: 3.240

7. iHyd-LysSite (EPSV): Identifying Hydroxylysine Sites in Protein Using Statistical Formulation by Extracting Enhanced Position and Sequence Variant Feature Technique.

Authors: Muhammad Khalid Mahmood; Asma Ehsan; Yaser Daanial Khan; Kuo-Chen Chou
Journal: Curr Genomics Date: 2020-11 Impact factor: 2.236

7 in total