Literature DB >> 29255385

Prediction of relative solvent accessibility by support vector regression and best-first method.

Abstract

Since, it is believed that the native structure of most proteins is defined by their sequences, utilizing data mining methods to extract hidden knowledge from protein sequences, are unavoidable. A major difficulty in mining bioinformatics data is due to the size of the datasets which contain frequently large numbers of variables. In this study, a two-step procedure for prediction of relative solvent accessibility of proteins is presented. In a first "feature selection" step, a small subset of evolutionary information is identified on the basis of selected physicochemical properties. In the second step, support vector regression is used to real value prediction of protein solvent accessibility with these custom selected features of evolutionary information. The experiment results show that the proposed method is an improvement in average prediction accuracy and training time.

Entities: Chemical Species

Keywords: PSI-BLAST; feature selection method; physicochemical properties of amino acids; support vector regression

Year: 2010 PMID： 29255385 PMCID： PMC5698889

Source DB: PubMed Journal: EXCLI J ISSN： 1611-2156 Impact factor: 4.068

Abbreviations

RSA: Relative Solvent Accessibility; SVR: Support Vector Regression; PSSM: Position Specific Scoring Matrix

Introduction

Protein native structure strongly influences the protein's biological function, thus it is relevant to study protein functions, knowing the protein tertiary structure and thus its solvent accessibility. Because knowledge of the solvent accessibility of a protein plays a vital role in predicting the tertiary structure of the protein. Accessible Surface Area (ASA) reflects the percentage of the surface area of a given residue that is accessible to the solvent. Relative Solvent Accessibility (RSA) was computed by the ASA of a residue normalized by the ASA of this residue in its extended tripeptide (Ala-X-Ala) conformation. This paper investigates whether improved sequence representation, which is based on the custom selected features harvested from evolutionary information, could lead to improving the accuracy of RSA prediction. In prediction of protein solvent accessibility with evolutionary information, the dimensions of features are too high, i. e. N*20, where N is the size of the window. The idea of this paper is based on the hypothesis that if data mining features selection methods are used for selecting subset of best-performing features, then prediction accuracy and training time would be improved. This idea results in a simplified prediction model, reduced computational time, and optimized prediction quality. The goals of this paper are achieved by designing a two-step procedure for prediction of relative solvent accessibility of proteins. In a first “feature selection” step, a relatively small subset of evolutionary information is identified on the basis of selected physicochemical properties in each position of the given window. In the second step, support vector regression method is used to real value prediction of protein solvent accessibility with these custom selected features of evolutionary information.

Previous Related Works

The existing solvent accessibility prediction methods can be divided into two main groups: - Real valued predictors that predict real-value of solvent accessibility. The representative existing methods are based on linear regression (Wagner et al., 2005[22]), neural network based regression (Adamczak et al., 2004[1]), neural networks (Shandar et al., 2003[18]; Faraggi et al. 2009[5]; Petersen et al. 2009[16]; Dor and Zhou, 2007[4]), support vector regression (Yuan and Huang, 2004[27]; Xu et al., 2005[25]), pace regression (Meshkin et al., 2009[13]) and look up table (Wang et al., 2004[23]). In the study of Shandar et al. (2003[18]), binary coding of the sequence was taken as the input features, while all other studies use the evolutionary information (Wagner et al., 2005[22]; Adamczak et al., 2004[1]; Yuang and Huang, 2004[27]; Xu et al., 2005[25]; Wang et al., 2004[23]). - Discrete valued predictors classify each residue into a predefined set class. The classes are usually defined based on a threshold and include buried, intermediate, and exposed classes (in most cases the predictions concern only two classes, i. e., buried vs. exposed). The corresponding prediction methods apply fuzzy-nearest neighbor (Sim et al., 2005[20]), neural network (Cuff and Barton, 2000[3]; Shandar and Gromiha, 2002[17]; Gianese and Pascarella, 2006[8]), support vector machine (Kim and Park, 2004[10]; Yuan et al., 2002[26]), two stage support vector machine (Nguyen and Rajapakse, 2005[15]), information theory (Naderi-Manesh et al., 2001[14]), and probability profile (Gianese et al., 2003[7]). Early studies only used sequence to generate features (Shandar and Gromiha, 2002[17]; Naderi-Manesh et al., 2001[14]), while recent studies have used the evolutionary information (Kim and Park, 2004[10]; Nguygen and Rajapakse, 2005[15]). Some conformational structures are mainly determined by local interactions between near residues, whereas others are due to distant interactions in the same protein. Therefore, with reducing number of feature in each position of window, we can enlarge the window size and then the effects of more neighbors can be considered for better prediction of RSA values. In addition, reducing dimensionality and removing irrelevant data has further advantages such as reducing the costs of data acquisition, better understanding of the prediction model, and a decrease in training time. Considering the advantages that are mentioned above, it seems to be important to investigate the idea of this paper. With regard to the too high number of PSSM profile features (in a window with size N), the main practical aim of this work is to find an optimal subset of features among a set of N*20 features which enables an efficient prediction of relative solvent accessibility of proteins.

Materials

In this section, the dataset is introduced, then qualitative and quantitative features are described.

Dataset

In this study, the Manesh dataset (Naderi-Manesh et al., 2001[14]) is used and it consists of 215 low-similarity proteins, i. e. < 25 %. The sequences are available online at http://gibk21.bse.kyutech.ac.jp/rvp-net/all-data.tar.gz. The Manesh dataset has been widely used by researchers to benchmark prediction methods Adamczak et al., 2004[1]; Meshkin et al., 2009[13]; Shandar and Gromiha, 2002[17]; Garg et al., 2005[6]; Gianese et al., 2003[7]), and this motivated us to use it to design and validate our method.

Qualitative features

As shown in Table 1(Tab. 1), 48 qualitative properties of amino acids are applied for encoding each of 20 amino acids. Qualitative features for a window surrounding the given amino acid are represented by a bipolar vector. Instead of using the physicochemical values, for a given property, the amino acids are grouped based on the binary classification, assigning 1 for those residues having or strongly showing the property and -1 for those without the property. According to this grouping scheme, each amino acid is encoded and represented by a 48-dimensional vector.

Table 1

48 physicochemical properties of amino acid

The bipolar vector was produced for a 13 residues wide window centered on a target residue. There are 13*48+1 features in a bipolar vector for each residue in a sequence. The pattern of input vector is shown in (1). For instance, physicochemical features for a window surrounding the given amino acid are encoded as (2). After creating qualitative input vectors for all residues of proteins in Manesh dataset (Naderi-Manesh et al., 2001[14]), subset of physicochemical properties which have a strong correlation with the relative solvent accessibility of proteins is selected by feature selection method.

Quantitative features

For a protein sequence, the position specific scoring matrix (PSSM) describes the likelihood of a particular residue substitution at a specific position based on evolutionary information. PSI-BLAST is used to compare different protein sequences to find similar sequences and to discover evolutionary relationships (Altschul et al., 1997[2]). PSI-BLAST generates a profile representing a set of similar protein sequences in the form of a 20 × N PSSM matrix, where N is the length of the sequence and where each amino acid in the sequence is described by 20 features. Since the profile features created by sequence alignment and quantitative criterions, we called them quantitative features. We used PSI-BLAST with the default parameters and the BLOSUM62 substitution matrix in this study.

Methods

Figure 1(Fig. 1) shows a detailed overview of the prediction procedure that consists of two steps, the first is aimed for creating input vector by subset selection of evolutionary features, the second is responsible for model building.

Figure 1

A detailed overview of the proposed method

The proposed two-step method works as follows: The task of the first step is grouped into two subtasks: “Physicochemical Feature Selection” and “Evolutionary Information Selection”. In “Physicochemical Feature Selection” subtask, we select subset of physicochemical properties of amino acids in each position of a window which have a strong correlation with relative solvent accessibility of proteins. Whenever, the subset of physicochemical features is selected, in “Evolutionary Information Selection” subtask, amino acids that have those selected physicochemical properties are chosen in each position of window. Finally, we have subset of best-performing features from PSI-BLAST profile, which are used in the next step for training the model. The second step is responsible for building model. This step performs core ability and explores unknown relationships between selected PSSM features and RSA by learning from training data. It creates model for RSA prediction of protein sequences. Support vector regression with RBF kernel applied in this step.

Feature selection

Feature selection, as a preprocessing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility that performed in the first step of our proposed method. Feature selection method was used to find subset of physicochemical properties of amino acids which have a strong correlation with relative solvent accessibility of proteins. We applied the best-first method for selecting a subset of physicochemical features. The best-first method searches the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility. We applied the best-first (Korf, 1993[11]) method with forward direction and use CfsSubsetEval (Hall, 1998[9]) method to evaluate the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them. Subsets of physicochemical features that are highly correlated with the RSA values while having low intercorrelation are preferred. The best-first method filters the redundancy among the physicochemical features and selects the final number of selected features, which in our case were 31 features. Table 2(Tab. 2), shows the selected physicochemical features which have strong relationship with RSA value of the residue A that is located in the center of the window with size 13.

Table 2

Results of subset selection of physicochemical features

Whenever, the subset of qualitative features are produced, a set of amino acids that have those selected properties are chosen in each position of window, for example, in position Ai+3, if inflexibility or very hydrophobic properties are selected, we select only amino acids that have at least one of those properties in that position. Finally, we have a subset of PSI-BLAST profile features, which used for training a model in the second step, see Table 3(Tab. 3).

Table 3

Results of subset selection of evolutionary information

The selected features include 76 features from the PSSM profile and one binary value that corresponds with the residue that is located close to either terminus of the sequence. We add this binary feature; because the amino acids that are located at the two terminus of the sequence have larger probability of being exposed to the solvent, see Table 4(Tab. 4).

Table 4

The total count of selected features

Among the 13*48 qualitative features, only 31 physicochemical features deemed more significant for prediction of RSA in a given window. The first step of our method discovered all the valuable knowledge about which qualitative features deemed more interesting for prediction of RSA, such as: - The physicochemical features of the central residue i. e. Ai have the strongest correlation on the prediction. Interestingly, features of other residues have relatively small influence at the prediction. - The residues that are located in Ai-6, Ai-5, Ai-2, Ai+2, Ai+5 positions, have too low impact on the RSA prediction of the central amino acid. - The features of Ai-2 amino acid were not selected, i. e. this residue has no impact on the RSA prediction of the central amino acid. - Hydrophilicity, hydrophobicity, long, flexibility and inflexibility features of amino acids have strong correlation with RSA values because these features are mentioned in many positions of a window in Table 1(Tab. 1). - Among the 48 physicochemical features of amino acids, only 20 distinct physicochemical features have strong correlation with protein solvent accessibility.

Support vector regression

Given a training set of n data point pairs (x,y), i = 1,2,...,n, where xdenotes the vector of features representing i protein sequence, y denotes the predicted RSA value, finding the optimal SVR is achieved by solving: Such that Where w is a vector w.x - b = 0 perpendicular to hyperplane, C is a user defined complexity constant, are slack variables that measure the degree of prediction error of xi for a given hyperplane, and z = Φ(x) where k(x,x') = Φ(x).Φ(x') is a user defined kernel function. The SVR was trained using sequential minimal optimization algorithm (Smola and Scholkopf, 1998[21]) that was further optimized by Shevade and colleagues (1999[19]). The proposed SVR uses RBF kernel (6).

Results and Discussion

The SVR and best-first methods were implemented in weka, which is a comprehensive open-source library of machine learning methods (Witten and Frank, 2005[24]). The evaluation was performed using 10 fold cross validation test type to allow for a comprehensive comparison with previous studies. Residues were classified into two states (buried/exposed) by different thresholds. The prediction accuracy was evaluated by the percentage of correctly predicted residues divided by the total number of residues in the test dataset. For example, for the two states we have where Q is the percentage of correctly predicted residues, NB and NE represent the number of residues correctly predicted as buried and exposed, respectively.

Comparison with other prediction methods

Figure 2(Fig. 2) shows the experimental and predicted values for each residue in thioredoxin. We selected this protein as an example, because residues fall within different ranges of RSA values which are indicative of the high degree of accuracy of this prediction across a wide range of RSAs and amino acid residues. It shows good linear relationship between the experimental and predicted values.

Figure 2

Example of predicted RSA values for a protein (PDB code 1ABA)

Since the model training in our method is done in one stage, our method should be compared with methods that their training is done in one stage. Table 3(Tab. 3) shows the comparison between this paper and one stage methods for RSA prediction, which include neural network and SVR models (Adamczak et al., 2004[1]; Meshkin et al., 2009[13]; Shandar and Gromiha, 2002[17]; Garg et al., 2005[6]; Gianese et al., 2003[7]). Since methods predict discrete valued classes (exposed vs. buried), we examined the performance of our method by converting the real value prediction into the two states prediction. We followed the standard approach, in which the state is defined based on the predicted RSA value and a predefined threshold. For instance, a 5 % threshold means that the residues having an RSA value (%) greater or equal 5 are defined as exposed, and otherwise they are classified as buried. The threshold's value is usually adjusted between 5 % and 50 %. We note that for most of thresholds, our method provides more accurate two states predictions, see Table 5(Tab. 5) (References in Table 5: Shandar and Gromiha, 2002[17]; Gianese et al., 2003[7]; Garg et al., 2005[6]; Adamczak et al., 2004[1]; Meshkin et al., 2009[13]).

Table 5

Comparison between our method and other reported methods; unreported results are denoted by “-“

The two main remarks based on the performed experimental evaluation include: the proposed method obtains favorable error rates when compared with five competing methods; and the reduced number of features (i. e. 76+1 attributes instead of 13*20+1 attributes) result in a simplified prediction model, reduced computational time, and optimized prediction quality.

Conclusion

In this paper, an approach for predicting protein relative solvent accessibility has been presented, which relies on a two-step procedure, consisting of subset selection of evolutionary information, followed by a real-value predictor of relative solvent accessibility. As shown in our study, feature selection is effective to reduce dimensionality, removing irrelevant features and increasing prediction accuracy in prediction of relative solvent accessibility of proteins. We have recently proposed an approach for prediction of RSA (Meshkin et al., submitted) with scatter search technique. Results of this paper achieve more improvement in training time by smaller size of feature set rather than research (Meshkin et al., submitted). We can conclude from this research that most of features in evolutionary information profile do not have any significant impact on prediction of RSA for a central residue in a given window. Despite of choosing subset of features, prediction accuracy has not decreased, and in some thresholds, prediction accuracy has improved in comparison with methods that their training is done in one stage. For future works we will widen our scope to consider more feature selection and classification algorithms such as boosting, genetic algorithm, evolutionary algorithm, and neural networks, so that we can find an optimal approach to determining discriminatory features. To find common features from different feature selection methods is another interesting task.

20 in total

1. Application of multiple sequence alignment profiles to improve protein secondary structure prediction.

Authors: J A Cuff; G J Barton
Journal: Proteins Date: 2000-08-15

2. Prediction of protein surface accessibility with information theory.

Authors: H Naderi-Manesh; M Sadeghi; S Arab; A A Moosavi Movahedi
Journal: Proteins Date: 2001-03-01

3. Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor.

Authors: Hyunsoo Kim; Haesun Park
Journal: Proteins Date: 2004-02-15

4. Accurate prediction of solvent accessibility using neural networks-based regression.

Authors: Rafał Adamczak; Aleksey Porollo; Jarosław Meller
Journal: Proteins Date: 2004-09-01

5. Look-up tables for protein solvent accessibility prediction and nearest neighbor effect analysis.

Authors: Jung-Ying Wang; Shandar Ahmad; M Michael Gromiha; Akinori Sarai
Journal: Biopolymers Date: 2004-10-15 Impact factor: 2.505

6. Prediction of protein relative solvent accessibility with a two-stage SVM approach.

Authors: Minh N Nguyen; Jagath C Rajapakse
Journal: Proteins Date: 2005-04-01

7. Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method.

Authors: Jaehyun Sim; Seung-Yeon Kim; Julian Lee
Journal: Bioinformatics Date: 2005-04-06 Impact factor: 6.937

8. Improving Prediction of Residue Solvent Accessibility with SVR and Multiple Sequence Alignment Profile.

Authors: Ao Li; Xian Wang; Zhaohui Jiang; Huanqing Feng
Journal: Conf Proc IEEE Eng Med Biol Soc Date: 2005

9. Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network.

Authors: Eshel Faraggi; Bin Xue; Yaoqi Zhou
Journal: Proteins Date: 2009-03

10. A generic method for assignment of reliability scores applied to solvent accessibility predictions.

Authors: Bent Petersen; Thomas Nordahl Petersen; Pernille Andersen; Morten Nielsen; Claus Lundegaard
Journal: BMC Struct Biol Date: 2009-07-31

1 in total

1. Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks.

Authors: Adele Sadat Haghighat Hoseini; Mitra Mirzarezaee
Journal: Iran J Biotechnol Date: 2018-08-11 Impact factor: 1.671

1 in total