Literature DB >> 32030089

iSulfoTyr-PseAAC: Identify Tyrosine Sulfation Sites by Incorporating Statistical Moments via Chou's 5-steps Rule and Pseudo Components.

Omar Barukab, Yaser Daanial Khan, Sher Afzal Khan, Kuo-Chen Chou.

Abstract

BACKGROUND: The amino acid residues, in protein, undergo post-translation modification (PTM) during protein synthesis, a process of chemical and physical change in an amino acid that in turn alters behavioral properties of proteins. Tyrosine sulfation is a ubiquitous posttranslational modification which is known to be associated with regulation of various biological functions and pathological pro-cesses. Thus its identification is necessary to understand its mechanism. Experimental determination through site-directed mutagenesis and high throughput mass spectrometry is a costly and time taking process, thus, the reliable computational model is required for identification of sulfotyrosine sites.
METHODOLOGY: In this paper, we present a computational model for the prediction of the sulfotyrosine sites named iSulfoTyr-PseAAC in which feature vectors are constructed using statistical moments of protein amino acid sequences and various position/composition relative features. These features are in-corporated into PseAAC. The model is validated by jackknife, cross-validation, self-consistency and in-dependent testing.
RESULTS: Accuracy determined through validation was 93.93% for jackknife test, 95.16% for cross-validation, 94.3% for self-consistency and 94.3% for independent testing.
CONCLUSION: The proposed model has better performance as compared to the existing predictors, how-ever, the accuracy can be improved further, in future, due to increasing number of sulfotyrosine sites in proteins.

Entities: Chemical

Keywords: 5-step rule; PseAAC; Sulfation; pseudo components; statistical moments; sulfotyrosine

Year: 2019 PMID： 32030089 PMCID： PMC6983959 DOI： 10.2174/1389202920666190819091609

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

Proteins are the diverse macromolecules in living organisms and have an important role in all biological development of organisms [1]. Proteins, as an enzyme, boost chemical reaction within a cell and produce movement, broadcast nerve force and increase muscle growth. These proteins are comprised of amino acid residues, joined by a peptide bond to make a polypeptide chain in protein. The amino acid residues, in protein, undergo post-translation modification (PTM) during protein synthesis, a process of chemical and physical change in an amino acid that in turn alters behavioural properties of proteins [2, 3]. The process of protein synthesis starts from the nucleus where ribonucleic acid (RNA) copies code for specific proteins from Deoxyribonucleic acid (DNA) then messenger ribonucleic acid(mRNA) takes the copy to protein-making factory namely ribosome in the cytoplasm. The ribosome with transfer ribonucleic acid (tRNA) continues to add a correct sequence of amino acid till it receives ending codon from mRNA thus making the protein ready to function. PTM could occur during or after protein synthesis to enhance proteomics range, control cell action and to use the same proteins for various cell functions [4, 5]. Tyrosine sulfation is a ubiquitious posttranslation modificaiton which is known to be associated with the regulation of various biological functions including protien-protein interactions, transportation modulation, and the proteolysis [6, 7]. Besides all this, the tyrosine sulfaction is linked with various pathalogical processes includng HIV infection, atheroscielerosis, and numerous lung diseases [4, 5, 8]. This depicts the dire need of idnetifying the mechanism of tyrosine sulfation which cannot be understood without the identification of tyrosine sulfation sites [6, 9, 10]. Thus, identification of tyrosine sulfation sites is of great importance. Although, the sites can be identified through various experimetnal techniques including site directed mutagenesis and high throughput mass spectrometery, however, all these techniques are laborius, time taking and costly. Therfore, the identification of sulfotyrosine sites through computaitonal predictors is one of the most optimal approaches and for this purpose, various researchers have propsed different methods previously for the identification of sulfotyrosine site. Also, the use of computaitonal predictors for the identification of sulfotyrosine sites can help process large scale proteomic data as well. Computational predictors using the neural network and statistical moments for feature extraction has been developed and used previously. In the last few years, many studies have been reported by the previous investigators in the field of bioinformatics and computational biology, which help in identifying the function and characteristics of proteins [3, 10-25]. Besides these, various papers have been reported targeting the prediction of PTM [3, 11-16, 18-62]. Yu and coworkers [63] used position specific scoring matrix (PSSM) to predict the sulfotyrosine sites in proteins. Later on, Sulfinator [64] named predictor was proposed by Monigatti and coworkers to identify sulfotyrosine sites using hidden markov models and sequence alignment information. A further improvement in sulfotyrosine predictors was observed with the development of SulfoSite [65], by incorporating acessible surface area and prositional weighted matrix for the prediction of tyrosine sulfation sites. Niu and coworkers [66] proposed another predictor for sulfotyrosine sites based on sequence and amino acid level information. In 2012, PredSulSite [67] was prposed by Huang et al. which incorported various features such as secondary strucutre inforamtion, physiochemcial characteristics and residue postion information. Various models were trained while SVM outperforemd the counterparts. Later on, in 2014, another SVM based method named SulfoTyrP [68] was propsoed by Jia et al. which is supposed to be the most accurate method for prediction of sulfotyrosine to date. Although, these predictors have been propsoed for sulfotyrosine sites, still there are limitations in the accuracy of prediction. Herein, we propose a computational model named iSulfoTyr-PseAAC for the prediction of Sulfotyrosine sites in proteins. The dataset used in this model is experimentally verified and updated. The feature vectors are constructed using statistical moments of protein amino acid sequences and various position/composition relative features. These features are incorporated into PseAAC [69]. The whole process is carried out by the aid of Chou’s 5-step rule [70] which are followed by current studies [12, 27, 28, 52, 71-77]. As demonstrated by a series of recent publications [12, 14, 17, 18, 20, 23, 49, 57, 59, 73, 78-94] and summarized in two comprehensive review papers [69, 95], to develop a really useful predictor for a biological system, one needs to follow Chou’s 5-step rule to go through the following five steps: (1) select or construct a valid benchmark dataset to train and test the predictor; (2) represent the samples with an effective formulation that can truly reflect their intrinsic correlation with the target to be predicted; (3) introduce or develop a powerful algorithm to conduct the prediction; (4) properly perform cross-validation tests to objectively evaluate the anticipated prediction accuracy; (5) establish a user-friendly web-server for the predictor that is accessible to the public. Papers presented for developing a new sequence-analyzing method or statistical predictor by observing the guidelines of Chou’s 5-step rules have the following notable merits: (1) crystal clear in logic development, (2) completely transparent in operation, (3) easily to repeat the reported results by other investigators, (4) with high potential in stimulating other sequence-analyzing methods, and (5) very convenient to be used by the majority of experimental scientists.

MATERIAL AND METHODS

This section elaborates the first three phases of Chou’s 5-step rule. Fig. ( explains that at first stage raw data with the standard format is collected from online protein database known as UniProt. Raw data undergoes the process of filtration at the second stage. The filtration process removes duplicated data and extracts sequences which are most suitable for sulfotyrosine. After the process of filtration, features are extracted of selected sequences. At the last stage filtered data are used for training purpose then the trained neural network is tested with different dataset.

Fig. (1)

Detailed step for proposed methodology.

Dataset Collection

The data used for the prediction of sulfotyrosine sites was taken from the UniProt Protein database. The UniProt database is verified and contains complete features of all proteins. Dataset was downloaded in the XML format, which was processed to extract sequences along accession number. For purposed technique Data of two types, i.e. positive and negative type was gathered from the UniProt. Preprocessing was performed on both sets of data to remove any duplication. The data have only alphabetic sequences. The positive dataset contained all the sequences which have experimental evidence of sulfotyrosine sites. The positive dataset contained those protein sequences which were explained with the field PTM/Processing. Dataset quality was enhanced by removing proteins which were not reviewed. On both sides of tyrosine (Y), 20 amino acid residues were selected. Taking into account Chou’s scheme [70], a protein containing tyrosine site can be expressed as: Amino acid code Y is the targeted tyrosine residue in this equation, the character is an integer, represent -th upstream amino acid residue from the centre, represents +th downstream amino acid residue from the centre. a tuple can be illustrated in 2 types: The following condition holds if the centre is sulfotyrosine site, it is not true than holds. Set theory represents Ԑ symbol as “a member of”. Testing and training dataset is developed for the statistical prediction model. The model is trained using training dataset then tested using testing dataset. The matter is extensively illustrated in [32], explaining that there is no compelling reason to isolate a benchmark dataset into two subsets if jackknife and cross-validation tests are used for testing prediction model because result acquired in the way is from a combination of many different independent dataset results. In this research paper, the ideal value of for test is 20, meanwhile, the dataset has (2 + 1)=41 residues. Considering all, the dataset was minimized to In the equation T+ hold 200 positive sample, T- holds 420 negative sample and represents “union of two set”. In total 200+420 = 620 samples are included in benchmark dataset (Supplementary information ).

Feature Vector Construction

With the explosive growth of biological sequences in the post-genomic era, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, yet still keep considerable sequence-order information or key pattern characteristic. This is because all the existing machine-learning algorithms (such as “Optimization” algorithm [96], “Covariance Discriminant” or “CD” algorithm [97, 98], “Nearest Neighbor” or “NN” algorithm [99], and “Support Vector Machine” or “SVM” algorithm [99, 100] can only handle vectors as elaborated in a comprehensive review [56]. However, a vector defined in a discrete model may completely lose all the sequence-pattern information. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition [97] or PseAAC [101] was proposed. Ever since the concept of Chou’s PseAAC was proposed, it has been widely used in nearly all the areas of computational proteomics [102-112] as well as a long list of references cited in [113]. Because it has been widely and increasingly used, four powerful open access soft-wares, called ‘PseAAC’ [114], ‘PseAAC-Builder’ [115], ‘propy’ [116], and ‘PseAAC-General’ [117], were established: the former three are for generating various modes of Chou’s special PseAAC [118]; while the 4th one for those of Chou’s general PseAAC, including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as “Functional Domain” mode (see Eqs. 9, 10 of [69]), “Gene Ontology” mode (see Eqs. 11, 12 of [69]), and “Sequential Evolution” or “PSSM” mode (see Eqs. 13, 14 of [69]). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, the concept of PseKNC (Pseudo K-tuple Nucleotide Composition) [119] was developed for generating various feature vectors for DNA/RNA sequences [120-122] that have proved very useful as well. Particularly, recently a very powerful web-server called ‘Pse-in-One’ [123] and its updated version ‘Pse-in-One2.0’ [124] have been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users’ studies. For the help in feature vector construction, Chou’s computational model sample formation was implemented. A feature is a numerical and computable property of Protein represented as n-dimension by the vector. A feature vector represents multiple properties relevant to protein sequence. For studying the properties of the protein, construction of feature vector holds the primary position. An array of the amino acids is utilized to develop a feature vector that increases the probability of site prediction in protein. Proteins’ performance is determined by amino acid location and little change in location modifies protein qualities. Feature vector sequences represented by feature vector is broadly utilized in predicting different structural characteristic [27, 28, 49, 53, 57, 85, 87, 125-127].

Site Vicinity Vector

Many elements make some sites in protein sensitive to post-translational modification. Most elements are environmental while neighbouring resides in peptide chain makes sites more sensitive to modification [35]. Supposing to be PTM site, then neighbouring resides is represented as: Substructure in a primary sequence which contains possible sites and its neighbour help in making site vicinity vector such as, In this equation, is an integer chosen through testing and experiments. In feature vector, site vicinity vector form sections that are awarded various numerical value replacing every residue position. Only 20 amino acids are important for protein synthesis and for calculating feature vector, every amino acid is given special integral value. If the values are changed, sections are allocated regularly and it doesn't make a difference which number is assigned to which amino acid.

Statistical Moments

The numerical quantity that describes various characteristics or distribution of data is called Statistical moments. These moments explains the shape of data's histogram and provides data which is enough for making frequency distribution function. It helps to quantify the symmetry of data in a set by the use of variation and skewness. Mathematicians and analysts have shaped different moments in the light of certain outstanding polynomials. Raw, central and Hahn moments are utilized to illustrate their polynomial tasks. A raw moment is the mean of all number in a set with order k, before taking mean each number is raised to the kth power. The first raw moment is the mean of all addition, second is average of squared number, while third the average of the cubed number. Change and unevenness made by the composed dataset are calculated by these moments [126-129]. The central moments are also used for the same purpose. A central moment is dependent at the average of the difference between numbers from their mean. The second central moment is achieved by the squared differences before averaging, while the third central moment is achieved by cubing difference before averaging. Hahn moments is used widely for feature extraction. These moments are dependent at Hahn polynomials [129]. It is used as an input in a neural network for providing an asymmetric grouping of feature selection beside classification. Merely 20 amino acids are present, for calculating moments every amino acid is allocated exclusive numerical value. Since the values are distinctive, numerical values are allocated again and again, so any value can be allocated to any amino acid. The 1-dimensional grouping of the amino acid is changed into 2-dimensional form. Suppose S stand for series of protein and sequence is given as: m residue exists in the primary sequence of the protein, where is the i amino acid residue, also let, All amino acid component of protein S are held by matrix S' created with m x m dimensions. The 2-dimensional matrix S' corresponds to the matrix S. The matrix S is converted to S' by using V as the mapping function. Where and if S' is populated in row-major order. Moments till 3 degrees are calculated using a 2D matrix S', the following equation is used for calculating raw moments. Where m+n denotes the order of moments. Moments till level three are calculated as Z and Z The data centre is similar to the centre of gravity. Data is fairly distributed at the data’s central point with reference to average weight. It is calculated after calculation of raw moments. It is known as an argument where, Central moments are calculated with the help of centroid. Central moments lies at data central point where centroid acts as data’s centre of gravity. Following equation is used to calculate central moments. In order to calculate Hahn moment, 1-dimensional interpretation S was converted to a square matrix interpretation S′. Two-dimensional input data is needed by two dimensional Hahn moments. The Hahn polynomial of order n is given as: The above expression uses the Pochhammer symbol generalized as: And is simplified using the Gamma operator. The raw values of Hahn moments are usually scaled using a weighting function and a square norm is given as: While, The orthogonal normalized Hahn for the two-dimensional discrete data are computed using the following equation: The central moments and the Hahn moments are computed up to order 3.

Position Relative Incidence Matrix

Informational series is the root of a mathematical model that predict that role of proteins. Location of amino acid plays a key role in determining the physical properties of the protein. It is also important to minimize placement of amino acid in the polypeptide chain. Position relative incidence matrix (PRIM) extracts location information of amino acid in the polypeptide chain. The matrix of PRIM is made with 20x20 dimensions as given below. An item holds the total of b residue against the first occurrence of d residue. Prim makes 400 coefficient which is a large number. For reducing the coefficient more, moments.

Reverse Position Relative Incidence Matrix

Machine learning algorithm accuracy mostly depends on the perfection of data’s feature extraction and the algorithm is able to change itself for understanding data’s unclear pattern. The relative positioning of amino acid in the polypeptide chain is extracted by PRIM matrix. Similar workflow at the reverse primary sequence is followed by Reverse Position Relative Incident Matrix (RPRIM). Addition of RPRIM reveals more hidden pattern and uncertainties among proteins in the polypeptide sequence. Similar to PRIM, RPRIM also has 400 elements with 20x20 dimension. RPRIM matrix is represented as: The dimension of RPRIM matrix is minimized by calculating raw, central and Hahn moments.

Frequency Matrix

The amino acid sequence makes the native shape of the protein and their number of occurrence is calculated by the frequency matrix. Frequency matrix has a vital role in protein alignment. The amino acid series information is retrieved by PRIM and frequency matrix does not hold series information. The frequency matrix is calculated by the given formula: In this formula represents the frequency of i native amino acid.

Accumulative Absolute Position Incidence Vector

Amount of Amino acid residue in the polypeptide chain is represented by a frequency matrix and it also gives information relevant to protein formation. The frequency matrix lacks information relevant to the position of amino acid residues in the polypeptide chain and this deficit is accommodated by Accumulative Absolute Position Incidence Vector (AAPIV). AAPIV represent relevant positioning of amino acid residues in the polypeptide chain. A vector containing 20 elements is made where every element has a numerical ordered value that represents relevant residue in the primary sequence. Primary sequence showing the occurrence of specific residue in the structure is represented as: It shows that residue located at a position Let AAPIV be represented as: Therefore the ith element of AAPIV is calculated as:

Reverse Accumulative Absolute Position Incidence Vector

As prior discussion, feature extraction is efficient in detecting an ambiguous pattern. Reverse accumulative absolute position incidence vector (RAAPIV) performs the same task, it is made from reversed AAPIV string. RAAPIV contain 20 elements is shown as: Specific residue in the Reversed sequence is shown as: In the sequence above residue occur in reverse sequence and are their ordered location. The value of any element is calculated as:

Neural Network

The neural network is one of the most important tools for solving the problem discussed in this paper, it simulates processing information as shown in Fig. (. Neural network explains the basic shape of each residue in a given protein. For training the network, negative and positive samples are made that are used to calculate feature vector which represents 2-dimensional protein structures by using raw, central and Hahn moments.

Fig. (2)

Architecture of the artificial neural network for the iSulfoTyr-PseAAC.

Gradient Descent and Adaptive Learning

Different algorithms with different characteristic and performance are available to train the neural network. Among all, Gradient Decent algorithm performs the best. It is an iterative minimization method that finds out best set of weight which is used for making a prediction during neural network training. The main objective of algorithms is to find weights that reduce the error of the model on the training dataset. The training process is started by randomly guessing set of weight, the weight set whose loss function has more steps down value is selected. The process is repeated following a negative gradient until a satisfied lowest point is found and then the gradient of the loss function is calculated against all parameters. A gradient is a multidimensional vector containing the slope of loss function along every axis [126, 127]. The weight W is updated with the help of learning rate R, objective function F(W) and its gradient £F(W). The central goal of the algorithm is to find the ideal weight W by minimizing F(W). Depending on this algorithm, the parameters are iteratively computer at every stage by given equation. Algorithm execution depends at learning rate R and it is mostly kept constant. It defines the time for function minimization and small learning rate requires more time to reach an optimal point whereas high learning rate may lead function to never reach the optimal point, thus, learning rate should have the ideal value to reach the optimal point. Mostly the starting process starts with a higher learning rate which slowly decreases as training proceeds. The learning rate may change at each layer which reduces the chance of gradient vanish. Weights stop to change at the first layer. Considering Wi and Wi+1 calculated sequentially parameters. Using this parameter weight, output and expected error are calculated. Comparing with the previous iteration if the error is greater than the learning rate is decreased or if the error is smaller than the learning rate is increased, weights are excluded and new weight Wi+1 is calculated. Weight calculation at each iteration is represented as (W1, W2, W3, W4 …). The following equation is used to calculate weight for the successive epoch. In the equation, Rt is used for tth epoch. The adaptive algorithm guarantees normalization of learning rate while minimizing function at each epoch. Following condition is fulfilled before choosing the learning rate.

RESULT AND DISCUSSION

Accuracy Estimation

The objective evaluation of a newly developed predictor is a very important aspect, which helps to assess the success rate of that model [69]. However, for such objective evaluation, one needs to consider two important factors which are (i) selection of accuracy metrics and (ii) the testing method employed to validate the model. Herein, firstly we will formulate the metrics for objective evaluation, then we will employ various validation methods.

Formulation of Metrics

For objective evaluation, one needs to consider the metrics of evaluation and method of evaluation. The most observed practice for the objective evaluation of the predictor is the use of accuracy metrics which are (1) Accuracy (Acc), which is used for the estimation of the overall accuracy of that perdition model, (2) Sensitivity (Sn), which is used for the estimation of positive sample prediction capability, (3) Specificity (Sp), which is used for the estimation of negative sample prediction capability, and (4) Mathews Correlation Coefficient (MCC), which is used for the estimation of prediction model stability. Either the set of traditional metrics copied from math books or the intuitive metrics derived from the Chou’s symbols [70, 130, 131] are valid only for the single-label systems (where each sample only belongs to one class). Initially, these measures have been introduced in [132], and a set of four intuitive equation have been derived in [133, 134] for all these measures, which are: Where represents the total number of non-sulfotyrosine sites, correctly predicted as non-sulfotyrosine sites by iSulfoTyr-PseAAC. represents the total number non-sulfotyrosine sites which are predicted incorrectly as sulfotyrosine sites by iSulfoTyr-PseAAC. Moreover, is the total number of sulfotyrosine sites which are correctly predicted as sulfotyrosine sites by iSulfoTyr-PseAAC and is the total number of sulfotyrosine sites which are predicted incorrectly as the non- sulfotyrosine sites by iSulfoTyr-PseAAC. Thus, Eq. (30) gives the explanation of specificity, sensitivity, overall-accuracy, and stability more easy to understand and intuitive, particularly when we talk about MCC [135-137]. This set of perceptive metrics have been used by a number of modern publications [14-16, 20-23, 30, 80, 89, 133, 138-158], but only for binary labelled data. Multi-label prediction is a completely different problem, which has been more popular in computational biology [159-161] and biomedicine [162]. Thus, it requires a different kind of metrics [163]. For the multi-label systems (where a sample may simultaneously belong to several classes), whose existence has become more frequent in system biology [73, 164-170], system medicine [171, 172] and biomedicine [35], a completely different set of metrics as defined in [173] is absolutely needed.

Self-consistency Testing

To test the proposed prediction model accuracy, self-consistency testing was performed in which same training and testing datasets were used through using which the model was built. There is a reason for doing the self-consistency test and that is, we already know the actual true positive of benchmark dataset. The results of self-consistency are shown in Table ; it can be observed that the proposed model has the 99.23% Acc, 99.10% Sp, 99.75% Sp, and 0.99 MCC.

Validation of Model

In general, prediction models are trained using experimentally proven dataset for prediction but some of the time we don’t have experimentally proven datasets for model prediction testing. Interestingly, if somehow we have the experimentally proven dataset, it might be possible that data is not suitable or not sufficient for model testing against the prediction accuracy. To check the score four metrics of Eq. (30), what kind of testing method should be used to check the accuracy reliability of prediction model? Normally, a prediction model can be tested using Leave-one-out (jackknife), k-folds (Subsampling) and independent test [174].

Jackknife Testing

In jackknife testing, every time model is trained on N – 1, where N is a total number of instances of benchmark dataset and testing is done by the rest of the 1 instance of benchmark dataset. Each time data for training and testing is selected randomly and the model is trained and tested according to that datasets. In jackknife validation of prediction model, training and testing both datasets are open and every sample of the benchmark dataset is used for training and testing, it’s very exhaustive because of huge turn in and out of data samples and it excludes the memory effects. Its validation always gives different output for given benchmark dataset instances. The arbitrariness problem caused by independent test and subsampling completely avoided by using jackknife. Using jackknife, perdition model validation gives 97.07% accuracy as shown in Table . It has been widely used to validate the prediction model by investigators [78, 145, 175-184].

K-fold Cross-Validation

Cross-validation is one of the best available methods to validate model prediction, cross-validation is the best option to choose and to give the validation that the proposed model is predicting true Sulfotyrosine sites. Using cross-validation, the benchmark dataset is distributed into total k number of unique folds, where k is the number in which the benchmark dataset is divided, for now, k=10. In each round of validation, a different subset of data is selected randomly for validation across the rest of the data, by this, each part of the dataset is used for training and testing both. At the end of last round of cross-validation, the cumulated accuracy for k=10 is calculated by adding the accuracy of each validation round and dividing it by 10 and it's 94.26% in this study as shown in Table . This shows that the accuracy of the proposed method is higher than the other previously proposed methods for sulfotyrosine site prediction, as shown in Fig. (.

Fig. (3)

10-fold cross validation of iSulfoTyr-PseAAC.

Using graphic approaches to study biological and medical systems can provide an intuitive vision and useful insights for helping analyze complicated relations therein, as indicated by many previous studies on a series of important biological topics, [185-198], particularly in enzyme kinetics, protein folding rates [192, 199-201], and low-frequency internal motion [199-204] (Table ).

Comparative Analysis

In a comparative analysis of iSulfoTyr-PseAAC, the results of iSulfoTyr-PseAAC for the metrics of Eq. (30) are compared with already existing methods. For this purpose, an independent dataset of 80 positive and 80 negative samples was used. Numerous imperative highlights make the proposed approach dignified and detailed from previous methods. First of all standard and balanced dataset has been included, which is experimentally verified and is of discrete nature. Secondly, the data is non repetitive, precise and complete in scope. Moreover, performance evaluation of proposed model is performed with 10 fold cross validation. The proposed model uses artificial neural networks which carefully handle dependence. iSulfoTyr-PseAAC applies a novel approach and uses the compositional and positional features of primary sequences of protein to perform the prediction of Sulfotyrosine sites. In first, it uses PseAAC and cut the sequence by modified residue from 20 downstream and upstream, then calculate the AAPIV, RAAPIV, PRIM, RPIRM, and statistical moments, using the compositional and positional features of primary sequences of protein, iSulfoTyr-PseAAC outperforms its counterparts.

WEB SERVER

The final step of Chou’s 5-steps rule is the development of user-friendly publicly available web-server for the ease of users and biologists as explained in recent publications by various authors [33, 136, 143, 146, 165, 166, 169, 171]. As pointed out in [203] and demonstrated in a series of recent publications [18, 31, 33, 59, 71-74, 78, 81, 82, 87, 88, 93, 153, 158, 164-170, 205], user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful prediction methods and computational tools. Actually, many practically useful web-servers have significantly increased the impacts of bioinformatics on medical science [56], driving medicinal chemistry into an unprecedented revolution [206]. Accordingly, in our future work, we shall strive to establish a web-server for the new method presented in this paper.

CONCLUSION

In this study, using Chou's 5-step rule we have developed a model for sulfotyrosine sites prediction based on ANN. Due to its strong biological importance, the finding of sulfotyrosine sites positions is a primary and essential task. The aim of the study is to develop an efficient and more accurate sulfotyrosine sites predictor and enhance it in usage. By implementing the PseAAC we have used many positional and compositional features of proteins samples. After model development, the prediction model was tested and validated against various exhaustive validation methods and techniques i.e. self-consistency, cross-validation, and jackknife. The self-consistency validation gives the 99.23% accuracy, for cross-validation the accuracy is 94.26% and jackknife gives 97.07% accuracy. The prediction models give overall 97.07% accuracy, sensitivity value 96.96% and specificity 97.39%. Using the above-mentioned accuracy and other values it concludes, the proposed model iSulfoTyr-PseAAC for prediction of sulfotyrosine site has the great ability to predict these sites in given proteins. In computational ways, the proposed model still can be improved as the number of protein sequences is rapidly growing, day to day.

Table 1

Results for self-consistency testing for iSulfoTyr-PseAAC.

Predictor	Accuracy Metrics
Predictor	Acc (%)	Sp (%)	Sn (%)	MCC
iSulfoTyr-PseAAC	99.23	99.10	99.75	0.99

Table 2

Results for jackknife testing of iSulfoTyr-PseAAC (Average of n-iterations).

Predictor	Accuracy Metrics
Predictor	Acc (%)	Sp (%)	Sn (%)	MCC
iSulfoTyr-PseAAC	97.07	97.39	96.96	0.92

Table 3

Results for 10-fold cross-validation of iSulfoTyr-PseAAC (Average of 10-folds).

Predictor	Accuracy Metrics
Predictor	Acc (%)	Sp (%)	Sn (%)	MCC
iSulfoTyr-PseAAC	94.26	94.55	94.16	0.86

Table 4

Comparison with existing models.

Predictor	Accuracy Metrics				Number of Proteases
Predictor	Acc (%)	Sp (%)	Sn (%)	MCC
iSulfoTyr-PseAAC	85.63	88.75	82.50	0.71	71	9	14	6
Sulfinator [⁶⁴]	68.13	71.25	65.00	0.36	57	23	28	52
SulfoSite [⁶⁵]	73.13	76.25	70.00	0.46	61	19	24	56
PredSulSite [⁶⁷]	76.88	81.25	72.50	0.54	65	15	22	58
SulfoTyrP [⁶⁸]	80.00	83.75	76.25	0.60	67	13	19	61

165 in total

1. iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach.

Authors: Xuan Xiao; Jian-Liang Min; Wei-Zhong Lin; Zi Liu; Xiang Cheng; Kuo-Chen Chou
Journal: J Biomol Struct Dyn Date: 2015-01-14

Review 2. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences.

Authors: Wei Chen; Hao Lin; Kuo-Chen Chou
Journal: Mol Biosyst Date: 2015-10

3. pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset.

Authors: Xuan Xiao; Xiang Cheng; Genqiang Chen; Qi Mao; Kuo-Chen Chou
Journal: Med Chem Date: 2019 Impact factor: 2.745

4. iPreny-PseAAC: Identify C-terminal Cysteine Prenylation Sites in Proteins by Incorporating Two Tiers of Sequence Couplings into PseAAC.

Authors: Yan Xu; Zu Wang; Chunhui Li; Kuo-Chen Chou
Journal: Med Chem Date: 2017 Impact factor: 2.745

5. Simulated Protein Thermal Detection (SPTD) for Enzyme Thermostability Study and an Application Example for Pullulanase from Bacillus deramificans.

Authors: Jian-Xiu Li; Shu-Qing Wang; Qi-Shi Du; Hang Wei; Xiao-Ming Li; Jian-Zong Meng; Qing-Yan Wang; Neng-Zhong Xie; Ri-Bo Huang; Kuo-Chen Chou
Journal: Curr Pharm Des Date: 2018 Impact factor: 3.116

6. The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase.

Authors: I W Althaus; A J Gonzales; J J Chou; D L Romero; M R Deibel; K C Chou; F J Kezdy; L Resnick; M E Busso; A G So
Journal: J Biol Chem Date: 1993-07-15 Impact factor: 5.157

7. iPhosH-PseAAC: Identify Phosphohistidine Sites in Proteins by Blending Statistical Moments and Position Relative Features According to the Chou's 5-Step Rule and General Pseudo Amino Acid Composition.

Authors: Muhammad Awais; Waqar Hussain; Yaser Daanial Khan; Nouman Rasool; Sher Afzal Khan; Kuo-Chen Chou
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2021-04-06 Impact factor: 3.710

8. Iris recognition using image moments and k-means algorithm.

Authors: Yaser Daanial Khan; Sher Afzal Khan; Farooq Ahmad; Saeed Islam
Journal: ScientificWorldJournal Date: 2014-04-01

9. Prediction of protein S-nitrosylation sites based on adapted normal distribution bi-profile Bayes and Chou's pseudo amino acid composition.

Authors: Cangzhi Jia; Xin Lin; Zhiping Wang
Journal: Int J Mol Sci Date: 2014-06-10 Impact factor: 5.923

10. PSNO: predicting cysteine S-nitrosylation sites by incorporating various sequence-derived features into the general form of Chou's PseAAC.

Authors: Jian Zhang; Xiaowei Zhao; Pingping Sun; Zhiqiang Ma
Journal: Int J Mol Sci Date: 2014-06-25 Impact factor: 5.923

6 in total