Literature DB >> 18053160

Predicting and improving the protein sequence alignment quality by support vector regression.

Minho Lee1, Chan-seok Jeong, Dongsup Kim.   

Abstract

BACKGROUND: For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment.
RESULTS: In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs.
CONCLUSION: The present work demonstrates that the alignment quality can be predicted with reasonable accuracy. Our method is useful not only for selecting the optimal alignment parameters for a chosen template based on predicted alignment quality, but also for filtering out problematic templates that are not suitable for structure prediction due to poor alignment accuracy. This is implemented as a part in FORECAST, the server for fold-recognition and is freely available on the web at http://pbil.kaist.ac.kr/forecast.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 18053160      PMCID: PMC2222655          DOI: 10.1186/1471-2105-8-471

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

As the number of protein sequences is exponentially growing, knowledge on their structures and functions is lagging far behind the growth rate of the number of new protein sequences because the experiments to determine structures and functions are difficult and time-consuming. One way to resolve this problem is computational methods such as structure and function prediction. In the case of protein structure prediction, computational methods fall into two categories; ab initio folding method and comparative modeling. Ab initio folding method is based on physical principles and does not require prior knowledge on protein structures, but comparative modeling [1] has shown superior performance throughout recent experiments assessing the effectiveness of structure prediction methods such as CASP (Critical Assessment of Structure Prediction) [2]. The first step in comparative modeling is the fold recognition in which one searches for homologous proteins with known structure and chooses the best one that can be used as a template. After this process, the alignment between the selected template and the query protein is generated. Finally the alignment is used to build the 3-dimensional structure models by using 3D model building tools such as MODELLER [1,3]. High-quality query-template alignments are, therefore, essential for successful homology modeling. Thus, there are two factors that essentially determine the quality of predicted protein structures; good templates and high quality query-template alignments. There have been many approaches to increase the performance of fold recognition. Progress in fold recognition has made it possible to increase the structural coverage of newly sequenced genomes [4] and to improve our ability to predict the protein structures as demonstrated in recent CASP experiments. Importance of alignment accuracy for comparative modeling has been already addressed [5]. Among many sequence alignment methods, the easiest way is to use sequence-sequence alignments such as Smith-Waterman [6] or BLAST algorithm [7]. Other ways are to utilize evolutionary information: profile-sequence alignments such as PSI-BLAST [8] and sequence-profile alignments such as IMPALA [9]. To get better alignments, it has been shown in many studies that using profiles of both the query and the template, named profile-profile alignment, are superior to sequence-profile methods and profile-sequence methods [10]. Even though profile-profile alignments are better, they do not always provide the optimal alignments [11]. Profile-profile alignments can be carried out in many different ways [12-14] and the alignment results change as alignment options vary. There is no single best profile-profile method and the universal alignment option that always generates the optimal alignment. To overcome this problem, some methods such as Consensus [15], ESyPred3D [16], Multiple Mapping Method (MMM) [17], and methods using genetic algorithm [18,19] have used population of suboptimal alignments. ESyPred3D fixes the redundant results from suboptimal alignments and finds optimal alignments by moving anchor point. Consensus make alignments by consensus of several alignments based on the consensus strength and by discarding the residues where alternative alignments differ. These two methods use limited number of alternative alignments. On the other hands, other two methods have used genetic algorithm to generate sub alignments as many as possible. After sets of model structure are constructed from alignments, score of each model is calculated by fitness function such as atom-atom potential [20] and Z-score [21]. However, these approaches take longer time, and alignments made by crossover are likely to be biologically meaningless. MMM, the recent study, focused on minimizing alignment errors based on its own scoring function by combining differently alignment segments from alternative alignments. MMM outperformed other methods and showed significant improvements. We introduce here a novel method not only to predict the alignment quality but also to improve the alignment quality by support vector regression (SVR) [22]. Machine learning technique such as the artificial neural network (ANN) or support vector machine (SVM) [23] has been a popular tool for fold recognition, but is only available for feature vectors of fixed length. A new method in which all templates in template library have feature vectors of different lengths with profile-profile alignments scores has been recently developed [24]. In our work, a modified version has been used. Among many different kinds of measures for the alignment quality, MaxSub [25], which has been used as a measure in assessment experiments of structure prediction such as CASP [26], CAFASP [27], and LiveBench [28], is used to represent a measure of alignment quality. MaxSub is a good measure of alignment quality in that it is a normalized single numeric and reflects structure-level quality. Our attempt to develop a method to predict the alignment quality is not entirely new. A related work [29] has been published, but the alignment quality prediction was not their final research goal. Rather, in the work by Xu [29], the predicted alignment quality was used to improve performance of fold recognition. In the present work, we develop a highly accurate method to predict the alignment quality, and we utilize the method not only to maximize the alignment quality and but also to choose good templates. In our work, an alignment of a query protein against its template of length n is converted into a feature vector of length n + 1 composed of profile-profile alignment scores and the length of the query protein. The predicted MaxSub score is calculated by the SVR model specifically built for that template. The test results show highly accurate regression performance. For a pair of a query and a template, various alignments are generated by using many different combinations of alignment parameters. The SVR model for the template is then used to find the optimal alignment parameters which are specific to that pair. We name this method 'adaptive selection' method. The adaptive selection method outperforms the method which uses the universal alignment option for large-scale testing set.

Results and discussion

Performance measures of SVRs

Alignments are converted into (n + 1) dimensional feature vectors which are input of SVRs where n is the length of the templates (Figure 1). In order to evaluate the performance of the method, trained SVR models are evaluated for the testing set. The correlation between observed and predicted MaxSub values is presented in the density map (Figure 2a). Each column in the figure2a is normalized independently by dividing the number of alignments with a specific range of MaxSub scores by the total number of alignments in that column. The number of alignments in each column is plotted on Figure 2b. The highest density is represented by black squares; the lowest density is represented by white squares. The Pearson correlation coefficient is calculated from the pairs of predicted MaxSub scores and observed MaxSub scores. The calculated correlation coefficient is 0.945. A previous related work [29] has reported the correlation coefficient of 0.71, which is lower than that of the present method. However, because the testing set and the measure of alignment quality in the previous work (the measure of alignment quality was calculated by comparing the sequence alignment and the structural alignments generated by SARF [30] that were assumed to be the gold standard) are different from those used in this work, direct comparison between the two methods may not have much meaning, although much higher correlation coefficient of our work seems to suggest that the present method is apparently better at predicting the alignment quality than the previous method. The good correlation coefficient and the density diagram with good diagonal shape imply that the MaxSub scores as a measure of alignment quality can be accurately predicted. Moreover, the results suggest that for each query-template pair it is possible to find its own optimal alignment parameters that would maximize the alignment quality.
Figure 1

Generation of the input feature vectors from alignments. (a) The sequence of a template of length n is aligned to the sequences of examples by profile-profile alignment method. (b) Each alignment is transformed to (n + 1)-dimensional feature vector composed of the alignment scores at n positions and the total alignment score. (c) These feature vectors are used to train SVR model for the target template.

Figure 2

Performance of SVR models. (a) Correlations between observed and predicted MaxSub scores with correlation coefficient of 0.9453. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution.

Generation of the input feature vectors from alignments. (a) The sequence of a template of length n is aligned to the sequences of examples by profile-profile alignment method. (b) Each alignment is transformed to (n + 1)-dimensional feature vector composed of the alignment scores at n positions and the total alignment score. (c) These feature vectors are used to train SVR model for the target template. Performance of SVR models. (a) Correlations between observed and predicted MaxSub scores with correlation coefficient of 0.9453. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution. In addition to the Pearson correlation coefficient, three different measures of errors are also calculated. The first one is the mean absolute error (MAE) which is given by where yis the predicted value, ois the observed value, and N the total number of the predictions. The normalized MAE (NMAE) is defined as follows The last one is the root-mean-square error (RMSE) given by MAE, NMAE, and RMSE are 0.0775, 1.877, and 0.0969, respectively, also shown in Table 1 and distributions of MAE and NMAE are shown in Figure 2c and Figure 2d, respectively. MAE is always lower than 0.2 for all the range of observed MaxSub scores when the window size is set to 0.5.
Table 1

Performance of SVR models for overall test set and at three levels of SCOP hierarchy. Pearson stands for Pearson correlation coefficient. MAE, NMAE, and RMSE are types of error

AllFamilySuperfamilyFold
Pearson0.94530.91850.83180.6106
MAE0.07750.06300.07730.0848
NMAE1.87710.31121.83442.6738
RMSE0.09690.09360.09620.0988
Performance of SVR models for overall test set and at three levels of SCOP hierarchy. Pearson stands for Pearson correlation coefficient. MAE, NMAE, and RMSE are types of error

Adaptive selection of the alignment options having the best MaxSub score

The ultimate objective of predicting alignment quality is to find the best alignment. One straightforward, although not the best, way to do this is to choose a set of the optimal alignment parameters, such as gap opening penalty, gap extension penalty, baseline score, and the amount of secondary structure term, that would yield the best alignments overall. However, as seen in Table 2 where the average MaxSub scores for the alignments generated with various different combination of the alignment parameters are shown, there is no such single set of parameters that are universally optimal for all query-template pairs. For example, for the query-template protein pairs that are related at the family level, the optimal alignment parameters are 9, 1, 1, and 0.5 for gap opening penalty, gap extension penalty, baseline score, and the secondary structure information, respectively, while those parameters change to 12, 2, 0, and 2 for the protein pairs that are related at the fold level. Overall, the maximum of average MaxSub scores is 0.2386 with the optimal alignment parameters of 9, 1, 1, and 1, which interestingly are not the optimal parameters for the protein pairs related at any level of similarity.
Table 2

Average MaxSub scores of the alignments generated by using various combinations of alignment parameters for the protein pairs related at the three SCOP levels. Open, Extension, and Baseline column shows gap open penalty, gap extension penalty and baseline value, respectively. '2nd' stands for the weight of predicted secondary structure. The best option showing highest MaxSub at each level is bolded.

Average MaxSub

OpenExtensionBaseline2ndAllFamilySuperfamilyFold
51000.21040.59300.16360.0447
51100.21720.60730.16790.0492
52000.21300.60620.15660.0477
52100.21050.60600.14940.0470
91000.22000.60800.17740.0488
91100.22080.61330.17160.0514
92000.21710.61150.16210.0505
92100.21310.61040.15080.0494
131000.21760.60960.16960.0479
131100.21580.61090.16090.0487
132000.21310.61020.15300.0481
132100.20880.60760.14290.0467
51010.22100.59500.17050.0619
51110.22980.60700.17840.0696
52010.22830.60660.17380.0696
52110.22860.60890.17130.0706
91010.23420.61290.18530.0718
91110.23860.61760.18770.0771
92010.23730.61750.18370.0770
92110.23450.61690.17700.0755
131010.23550.61390.18510.0741
131110.23740.61650.18520.0767
132010.23560.61630.18080.0759
132110.23190.61430.17300.0737
51020.21110.57650.15720.0586
51120.22080.59350.16290.0669
52020.22160.59500.16090.0691
52120.22480.60170.16110.0725
91020.22470.60140.16760.0684
91120.23110.60900.17190.0754
92020.23240.61010.17170.0777
92120.23270.61060.17060.0788
131020.22900.60730.17130.0723
131120.23370.61100.17500.0780
132020.23430.61220.17410.0793
132120.23320.61200.17070.0792
5100.50.22140.59790.17380.0593
5110.50.22880.60940.18010.0652
5200.50.22600.60950.17290.0638
5210.50.22470.61040.16800.0635
9100.50.23370.61410.18880.0678
9110.50.23590.61830.18810.0709
9200.50.23280.61740.18070.0693
9210.50.22910.61700.17170.0672
13100.50.23290.61500.18560.0678
13110.50.23230.61620.18090.0687
13200.50.22950.61600.17390.0672
13210.50.22510.61390.16450.0647
Mean0.22610.60900.17130.0651
Average MaxSub scores of the alignments generated by using various combinations of alignment parameters for the protein pairs related at the three SCOP levels. Open, Extension, and Baseline column shows gap open penalty, gap extension penalty and baseline value, respectively. '2nd' stands for the weight of predicted secondary structure. The best option showing highest MaxSub at each level is bolded. The results suggest the following alignment strategy. Instead of using single universal set of alignment parameters for all query-template pairs, by simply picking up a different set of the alignment parameters that are uniquely optimal for a query-template pair, the alignment can be improved. If we do so, as seen in Table 3, the average of the overall MaxSub scores improves from 0.2386 to 0.2887 (0.0501 point improvement, corresponding to roughly 21% improvement).
Table 3

Comparison of average MaxSub scores. The values in the first row "Overall best option" are retrieved from Table 2.

Average MaxSub

MethodAllFamilySuperfamilyFold
Overall Best Option0.23860.61760.18770.0771
Always Best (Upper Limit)0.28870.64140.25050.1396
Adaptive Selection (Observed)0.25630.62550.21280.0953
Adaptive Selection (Predicted)0.30390.63850.25010.1669
Comparison of average MaxSub scores. The values in the first row "Overall best option" are retrieved from Table 2. Obviously, we do not know a priori which set of alignment parameters is optimal for a given query-template pair because the structure of a query protein is not known. Therefore, here we propose the 'adaptive selection' method. The adaptive selection procedure is carried out as follows. (1) Generate the alignments using many different combinations of alignment parameters. (2) Predict MaxSub scores of alignments using the trained SVR models. (3) Select the alignment that gives the highest predicted MaxSub score. When we follow the adaptive selection procedure, the average of actual MaxSub scores of the alignments selected by the adaptive selection procedure improves to 0.2563 (Table 3), which corresponds to 0.0177 point or 7.42% improvement, compared to the single best option procedure. This improvement is statistically significant (p-value < 10-300 calculated by Wilcoxon signed rank test [31]). It also indicates that the adaptive selection method can scoop roughly 35.3% (0.0177 vs. 0.0501) of the maximum improvement that can be achievable by selecting the optimal alignment parameters unique to each query-template pair. Moreover, it also implies that it is possible to improve the alignment quality even more by developing more accurate alignment quality prediction method.

Performance at three levels of SCOP hierarchy

In this section, we describe performance at three levels of SCOP hierarchy (family, superfamily, and fold) to closely examine where the improvement is achieved. All the experiments carried out in the previous section are done for testing sets at the three different levels. The density diagram in Additional file 1 shows the correlation at the family level. It looks similar to Figure 2a except that it shows weak correlation in low MaxSub score region. The reason seems to be that alignments of pairs at the family level likely have high MaxSub scores, and SVR models have not experienced sufficient alignments that have low MaxSub scores during the training stage. The correlation coefficient, MAE, NMAE and RMSE is 0.9185, 0.0630, 0.3112 and 0.0936, respectively (Table 1). Additional file 1 shows the number of alignments in different regions of observed MaxSub score. Additional file 2 shows the correlation at the superfamily level. It shows rather weak correlation in high MaxSub score region. The correlation coefficient, MAE, NMAE and RMSE is 0.8318, 0.0773, 1.8344 and 0.0962, respectively (Table 1). Contrary to the case of the family level, there are not many examples in high observed MaxSub region, which is the reason for weak correlation in high score region. The density map in Additional file 3 represents the correlation at the fold level. The correlation coefficient, MAE, NMAE and RMSE is 0.6106, 0.0848, 2.6738 and 0.0988, respectively (Table 1). Like the case of the superfamily level, it seems to show weak correlation at high score region. In Table 2, the MaxSub scores are presented at three different levels. The averages are 0.6090, 0.1713, and 0.0651, and the values for best options are 0.6183, 0.1888, and 0.0793 at the level of family, superfamily, and fold, respectively. These values are also compared with corresponding scores achieved by adaptive selection method (Table 3). It is also observed that adaptive selection method shows higher performance at the every SCOP level as for overall testing set showing an improvement of 1.16, 12.7, and 20.2% at the family, superfamily and fold level, respectively. To check diversity of test set, sequence identities of query-template pairs are presented in Fig at each SCOP level, family (Figure 3a), superfamily (Figure 3b), and fold (Figure 3c). The average values of sequence identity are 30.95%, 13.03%, and 11.51% at each SCOP level, respectively. Except for some pairs in the test set at family level, the sequence identities of almost all pairs are under 35%, "twilight zone [32]." The distribution tells our results are not based on high sequence identity.
Figure 3

Distribution of sequence identities on the test set. Distribution of sequence identities of the query-template pairs on the test set at (a) family (b) superfamily (c) fold level.

Distribution of sequence identities on the test set. Distribution of sequence identities of the query-template pairs on the test set at (a) family (b) superfamily (c) fold level.

Alignments of pairs that are not related

All protein pairs in the testing set used in the experiments share the similar structure at least at the fold level (see Methods). It is, therefore, necessary to check whether the trained SVR models show reliable performance for proteins which do not share the same fold. In order to check this, 10 unrelated proteins per each template are randomly selected, aligned against the templates, and transformed into feature vectors. The vectors are then applied to SVR models of the templates to predict MaxSub scores. All the observed MaxSub scores are zero without exception. Thus all the predicted values should be zero in ideal situation. Histogram of predicted values is shown in Figure 4. Unfortunately, most predicted values are not zero. The mean is 0.1979 and the standard deviation is 0.1257. We can infer here that SVR models predict the MaxSub scores larger than the true values in low MaxSub score region. The histogram shows that the true MaxSub scores of alignments predicted to have MaxSub score near 0.1 might be zero. Thus, if a predicted MaxSub is low and is not zero, it should be carefully examined.
Figure 4

Histogram of predicted MaxSub scores of the alignments of the pairs that are not related at the fold level.

Histogram of predicted MaxSub scores of the alignments of the pairs that are not related at the fold level.

Alignments of the pairs whose MaxSub scores are zero despite being in the same family

It is expected that two proteins in the same SCOP family have a similar 3D structure. There are, however, many alignments of the pairs in the same family for which observed MaxSub scores are zero (Additional file 1). When MaxSub score is zero, the alignment is completely incorrect by definition [25]. For these pairs, we check how much improvement can be achieved by adaptive selection method. Figure 5 shows histogram of MaxSub scores which is given by adaptive selection method for the alignments of those pairs. For about 37.3% of all pairs, there is no improvement, while about 62.7% of pairs achieve some improvement. In other words, around 63% of completely incorrect alignments between a pair of protein related at the family level are corrected into partially corrected alignments by changing alignment options by adaptive selection method.
Figure 5

Histogram of MaxSub scores by adaptive selection method for the alignments of the pairs sharing the same family whose MaxSub score is zero when single best alignment option method is used.

Histogram of MaxSub scores by adaptive selection method for the alignments of the pairs sharing the same family whose MaxSub score is zero when single best alignment option method is used. Then, what are the reasons that remaining 37.3% of pairs gain no improvement? The most obvious one is regression error. Adaptive selection method might wrongly select an option due to regression error although there is another option that might give improved MaxSub score. When we examine the data, it appears that 17.9% constitute this type. Second, it may result from the limitation of profile-profile alignments. It has been well known that profile-profile alignment is not always the optimal alignment when compared to the structure alignment. It may fail to align a query against a particular template with any alignment options due to problem of alignment method itself. The third reason may be the lack of alignment options in our method. Although 48 options are used in our work, they may not be sufficient because the options used here do not cover all possible cases. For example, to align a particular pair of proteins, abnormally large gap open penalty might be necessary. The fourth reason may be the limitation of MaxSub score as a measure of alignment quality. There have been a number of assessment methods for alignment quality. It has been controversial what evaluation method is the best. There are many alternative measures such as GDT_TS [33,34], LGscore [35] and MAMMOTH [36]. Another aspect is that MaxSub score is basically sequence-dependent assessment. In sequence-dependent assessment, only corresponding residues in alignment are compared. It is stricter than sequence-independent assessment [37,38] for alignments which are slightly shifted from the optimal alignment, which might make MaxSub scores of some alignments become zero. Our method might be improved by combining these sequence-dependent and sequence-independent methods. Finally, some template structures may not be good for predicting the structure of a query protein, even though they are in the same family with a query protein. One example of this case is an alignment of a query protein, d1tsk__, against a template, d1chl__, both of which belong to the same family (g.3.7.2). All MaxSub scores of the alignments generated by using all 48 options are zero. To check whether it is caused by the problem of profile-profile method, we perform the structural alignment by CE algorithm [39], and we find that the MaxSub score of this structural alignment is also zero. Figure 6 shows a superposition of these two proteins. It can be inferred that there are bad templates for structure prediction although they are the same family member with a query protein. It might be caused by strict definition of MaxSub. However, in the view of MaxSub, the template d1chl__ is apparently a bad one for the query.
Figure 6

Superposition of 2 SCOP domains. Superposition of SCOP domain d1tsk__ (bright) onto d1chl__ (dark), both of which belong to the same family (g.3.7.2).

Superposition of 2 SCOP domains. Superposition of SCOP domain d1tsk__ (bright) onto d1chl__ (dark), both of which belong to the same family (g.3.7.2). Such alignments are tested by the fold recognition method developed in the previous study [24] to see their fold recognition scores. The raw SVM outputs are converted into posterior probabilities [40], ranging from zero to one, and the distribution of these probabilities is shown in Figure 7. The distribution exhibits two peaks, near zero and one. If we choose decision-threshold as 0.5, roughly 15% of pairs are classified into protein pairs sharing the same family. Let us consider a situation where one tries to predict the protein structure and chooses the templates by means of fold-recognition score only. For some cases, if a certain template is selected simply because it is predicted to be homologous at the family level, the final result of structure prediction might be failed due to wrong selection of the template. Adaptive selection method may help to filter this sort of templates out and can prevent ones from selecting these bad templates.
Figure 7

Distribution of posterior probabilities of outputs of SVM for fold-recognition.

Distribution of posterior probabilities of outputs of SVM for fold-recognition.

Benchmark test

The benchmark test of adaptive selection method is carried out on 62 targets of CASP7. We use EsyPred3D and Multiple Mapping Method (MMM) for the comparing. Both are publicly available web servers, and alignments and 3D models are provided. We used the default options of the servers. Out of all 88 targets of CASP7, 77 targets have significantly close template in SCOP 1.69 according to the result of fold search by Proteinsilico [41]. The templates of 62 targets of those are trained in our dataset, and these target-template pairs are used in the benchmarking. Table 4 shows the performances of MMM, EsyPred3D, and adaptive selection method. The greatest values of MaxSub, Mammoth Z-score, TM-score [42], GDT_TS for each pair are bolded. Our method gives better alignments having larger MaxSub than other two methods on average (0.264 vs. 0.203 and 0.182). In the aspect of other measures the adaptive method also shows the best performance. In addition, the values of our method are statistically significant according to p-values calculated by Wilcoxon signed rank test [31] with significance level 0.05.
Table 4

Alignment performances on CASP 7 using MMM, ESyPred3D, and Adaptive method. The highest value for each pair is bolded. P-values are calculated by Wilcoxon signed rank test.

MaxSubMammoth Z-scoreTM-scoreGDT_TS

TargetTemplateMMMESyPred3DAdaptive MethodMMMESyPred3DAdaptive MethodMMMESyPred3DAdaptive MethodMMMESyPred3DAdaptive Method
T02831pq1a_0.0000.0000.2122.301.184.690.2340.2160.3150.2100.2080.301
T02881r6ja_0.7430.6680.75612.4212.8112.560.7750.7200.7670.7690.7120.769
T02891boub_0.0000.0000.0001.361.130.480.1860.1600.1790.0730.0700.075
T02911rdqe_0.5150.4830.48428.7628.0327.820.7350.6910.7000.5680.5420.538
T02921rdqe_0.5650.5790.55627.9028.4829.660.7680.7860.7740.6170.6390.631
T02931jg1a_0.2210.0000.24012.6511.8912.200.4050.1750.3850.2900.0820.300
T02951jg1a_0.0000.2120.2880.238.8211.940.1710.2520.3670.0930.2080.299
T02961p1xa_0.0480.0000.0552.403.685.780.2160.1750.1940.0770.0680.091
T02971k7ca_0.1990.2160.46617.1718.4616.710.4210.4250.5960.2890.2990.493
T02991p5fa_0.0000.0000.0000.090.360.400.1920.1590.1700.1320.1070.113
T03001rh5b_0.2560.0000.2343.832.194.320.2760.2390.2490.3460.2580.289
T03021orja_0.0000.0000.1491.721.921.720.2570.2240.2650.2100.1890.239
T03031o08a_0.5200.5170.42525.5525.5425.480.7430.7180.6660.6070.5880.554
T03041j3wa_0.1920.0000.0000.511.912.790.3010.1400.2000.2800.1260.220
T03051lyva_0.4400.4100.45126.7222.9826.970.7050.5680.6680.5320.4440.522
T03081f4pa_0.1810.0000.1568.291.928.780.3960.1390.3480.2890.0940.247
T03101us6a_0.0000.0000.0001.030.792.830.0850.0550.0600.0490.0410.032
T03151i0da_0.3880.2960.45725.6618.3926.550.6670.5820.7200.4760.3940.541
T03161kqpa_0.1400.1150.16910.7215.5313.130.3160.2270.2770.1700.1460.190
T03171byi__0.1740.0000.1690.858.446.160.3140.2460.2690.2370.1690.212
T03181rtqa_0.0480.0980.0976.475.4816.720.1910.2560.2510.0700.1170.130
T03211jbea_0.0000.0000.0002.771.821.810.1660.1550.1460.0900.0810.090
T03221vh5a_0.6140.5960.60316.4816.9517.750.7070.6730.6980.6220.5990.629
T03231c20a_0.0000.0000.0001.010.581.620.1730.1180.1140.1030.0800.083
T03241o08a_0.5620.5620.58225.9324.7624.840.7480.7600.7500.6040.6260.629
T03251i0da_0.1010.0000.10313.964.045.390.3220.2210.3950.1830.1200.217
T03261p5fa_0.0650.1310.1908.817.4814.350.2200.2410.3490.1120.1480.238
T03281mwqa_0.0000.0000.0890.423.739.590.1260.0950.1630.0600.0490.110
T03291o08a_0.4640.4710.47125.5725.2924.670.6830.6550.6610.5450.5290.529
T03301o08a_0.4240.3650.37627.1321.0722.940.6940.6490.6040.5520.5190.496
T03321io0a_0.0000.0000.1173.311.663.780.2350.1800.2540.1780.1370.193
T03351hz4a_0.0000.0000.4581.763.503.800.2670.3120.3770.4640.4760.542
T03381tqga_0.0000.0000.0000.711.922.020.1480.1210.1010.0930.0890.090
T03391lc5a_0.1770.1190.23720.1820.5226.160.4760.4170.5310.2470.2070.316
T03401r6ja_0.6970.7400.74613.4511.9512.350.7430.7550.7620.7420.7420.758
T03411qcza_0.0880.0000.0791.721.351.380.2030.1410.1630.1100.0670.107
T03532igd__0.2550.2600.2805.935.686.840.3150.3180.3390.3380.3470.365
T03541fm0e_0.0000.0000.0001.114.691.250.2300.1910.2200.2110.1640.209
T03561j27a_0.0000.0000.000-0.24-0.870.460.0830.0870.0450.0340.0400.031
T03571nxja_0.0000.0000.22010.805.396.190.2210.1860.2940.1690.1290.278
T03591r6ja_0.6770.6290.68313.4811.6812.350.7180.6630.7010.7070.6340.699
T03611t7ra_0.1260.1120.1444.051.211.980.2210.2250.1920.1850.1730.173
T03621vh5a_0.0000.0000.44910.452.0513.780.1670.1870.5380.1150.1620.477
T03632igd__0.0000.0000.3214.664.024.020.1760.1560.3430.1780.1660.372
T03641vh5a_0.2200.2530.48912.887.8414.390.3460.3120.5570.2760.2710.508
T03651g73a_0.1460.0000.1095.234.474.830.2470.1620.1900.1880.1150.140
T03661r6ja_0.6850.7540.78112.5911.7811.370.7220.7740.7770.7020.7800.765
T03671ug7a_0.0000.0000.2207.654.207.160.2740.1960.3040.2420.1720.264
T03681hz4a_0.2110.1590.2085.654.494.160.3270.2950.3080.2610.2390.266
T03691orja_0.1660.0000.0004.122.405.120.2340.1480.1670.2250.1260.157
T03711f4pa_0.0000.0000.0000.342.515.310.1580.1540.1440.0780.0890.086
T03721m44a_0.0000.0000.1330.040.140.550.1430.1310.2460.0640.0570.168
T03731rh5b_0.0000.0000.000-0.063.793.020.1320.1470.1530.1250.1550.154
T03741m44a_0.2930.3860.38413.3315.6614.940.4450.5660.5450.3610.4630.459
T03751bx4a_0.4820.4570.50832.3429.7229.860.7410.6960.7300.5420.5100.541
T03761twda_0.1250.1080.19114.0515.9315.390.3030.3460.3820.1670.1890.261
T03781sdsa_0.0890.0000.1592.020.358.180.1480.1220.2270.0950.0710.177
T03791o08a_0.2990.3600.40315.2415.9615.460.4560.5360.5690.3530.4150.455
T03811tf1a_0.5340.5860.58221.0423.7923.680.6290.6550.6510.5370.5650.560
T03821kpsb_0.0000.1710.0001.651.303.080.2630.2800.2400.2290.2540.238
T03841jg1a_0.0000.0000.0000.161.081.670.1800.1400.1270.0680.0840.083
T03851lb3a_0.4620.4890.65914.6017.5018.960.6330.6180.7590.5460.5700.659
mean0.2030.1820.2649.5649.08610.7120.3670.3380.3910.2920.2730.328
p-value1.042E-042.163E-07-3.8E-035.761E-05-0.21331.898E-06-9.996E-042.209E-08-
Alignment performances on CASP 7 using MMM, ESyPred3D, and Adaptive method. The highest value for each pair is bolded. P-values are calculated by Wilcoxon signed rank test.

Conclusion

In the process of protein sequence alignment, generally only one particular set of alignment parameters is used throughout the all protein pairs, regardless of their evolutionary relationship. In some cases, many alignments are generated using many different combinations of alignment parameters, and then the potentially optimal alignment is chosen purely based on experience or intuition. In this work, however, we select the alignment parameters which are predicted to give the highest MaxSub score specific to a pair of a query and a template. Our work is distinguishable to other efforts to improve the quality of protein sequence alignments in that we directly predict alignment quality with quite good accuracy. By predicting the alignment quality and then choosing the optimal alignment parameters based on the prediction, we show that the alignment quality can be improved significantly. Our method can be utilized to select not only the optimal alignment parameters for a chosen template but also good templates with which the structure of a query protein can be best predicted. In summary, we develop a method to predict the MaxSub score as an alignment quality of a given profile-profile alignment between a query and a template. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector. These feature vectors are used to train the SVR models for the templates. We rigorously test the performance of the method using various evaluation measures such as Pearson correlation coefficient, MAE, NMAE, and RMSE. Results show the high correlation coefficient of 0.945 and low prediction errors. Trained SVR models are then applied to select the best alignment option which is chosen specifically to the pair of a query and a template. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to the scores when single best option is used for the all query-template pairs.

Methods

Data

To make a template library, classification by the SCOP version 1.69 [43] is used. First, the fold library composed of ~11,130 domains is constructed using domain subsets with less than 90% sequence identity to each other prepared by ASTRAL Compendium [44]. We choose the folds containing at least 20 members for training and testing the SVR models. A total of 7509 domains in 122 folds are selected as a result. Two thirds are used to train and the rest is used to test. To estimate the performance, we employ the three-fold cross-validation procedure.

MaxSub score as alignment quality (target of each SVR)

Conventionally, the alignment quality is calculated by comparing the sequence alignment and the structural alignments generated by various structure alignment programs such as SARF [30], CE and MAMOTH, assuming that the structure alignments are the gold standard. A problem of this approach is that depending on the specific choice of structure alignment program, the structure alignments can vary significantly, especially for distant homolog pairs. A different approach is that first the structure prediction model of a query protein is quickly generated by directly copying C-α positions of all aligned residues of the template protein using the sequence alignment, and then the protein structure model quality measure such as MaxSub [25] or TM-score [42] is calculated and used as a alignment quality score. The second approach is more relevant to the present study, because the main focus of this work is how to generate good sequence alignments that would eventually lead to better structure models. Specifically, we use MaxSub [25], a popular model quality measure which finds the largest subset of Cα atoms of a model that superimpose well over the experimental structure. At the stage of training, each alignment is converted into a structure model of the query protein. MaxSub score is then calculated using the model derived from the alignment and the correct structure, with d parameter set to 3.5 Å which has been found to be a good choice for the evaluation of fold-recognition models [25]. We have also considered to use TM-score [42], another popular model quality measure, as the alignment quality measure. However, it turned out that the correlation between MaxSub scores and TM-scores was as high as 0.95. Therefore, we expect that our specific choice of MaxSub score as the alignment quality measure does not affect the performance of our method and the main conclusion of this work.

Profile-profile alignments and SVR feature vectors

To train SVR models for all templates in the training set, feature vector scheme developed in previous work [24] is adopted with slight modification. We first generate all-against-all alignments within the set sharing the same fold by profile-profile alignment scheme with 48 different combinations of alignment parameters (gap open-penalty, gap extension-penalty, base-line score, and weight of predicted secondary structure). The profile-profile alignment score to align the position i of a query q and the position j of a template t is given by where , , and are the frequencies and the position-specific score matrix (PSSM) scores of amino acid k and at position i of a template q and position j of a template t, respectively. For the secondary structure score (sij), a positive score is added (subtracted) if the predicted secondary structure of the query protein at the position i is the same (different) type of secondary structure of the template protein at position j. Finally, the constant baseline score (b) is added to the alignment score. The frequency matrices and PSSMs are generated by running PSI-BLAST [8] with default parameters except for the number of iterations (j = 11) and the E-value cutoff (h = 0.001). For each template of length n in the training set, alignments with the other templates in the training set are generated. Then, these alignments are transformed, respectively, into (n + 1)-dimensional feature vectors, ( where sais the profile-profile alignment score at position i of a given template [45] and query_length is the length of the query protein (Figure 1). If gaps occur, fixed negative scores are arbitrarily assigned. This is the modified version of [24]. The difference is that we use query_length instead of total alignment score. Since the size of the vector, n is dependent on the length of template protein, we make the same number of SVRs for all templates.

SVR training

Only templates sharing at least the same fold with a target template are trained. To learn as many alignment examples as possible, 48 alignments are made per each pair of a query and a template (Table 2). Gap open penalty ranging from 5 to 13 is used; gap extension is one or two; baseline value is zero or one. The parameter for the predicted secondary structure information content is also varied. The input and the target of SVR are derived from the previous two sections. We would like to emphasize that there is no correct alignment example. Regression is basically a real value prediction. In training step for each input-target data of training sample, SVR models are trained with radial basis function (RBF) kernel without attempting serious performance optimization by SVMlight version 6.01 with the parameter gamma of 0.001 [46].

Availability and requirements

The method is implemented in the platform-independent web server, FORECAST as a part. It is freely available without any restriction at

Authors' contributions

ML wrote the code for the analysis, carried out the training and testing SVRs, and drafted the manuscript. CSJ wrote the code for profile-profile alignment and implemented the code which generates input feature vectors for SVRs. DK participated in the design of the work and collaborated in writing the manuscript. All authors have read and approved the manuscript.

Additional file 1

Performance of SVR models at the family level. (a) Correlations between observed and predicted MaxSub scores at the family level. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution Click here for file

Additional file 2

Performance of SVR models at the superfamily level. (a) Correlations between observed and predicted MaxSub scores at the superfamily level. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution Click here for file

Additional file 3

Performance of SVR models at the fold level. (a) Correlations between observed and predicted MaxSub scores at the fold level. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution. Click here for file
  40 in total

1.  Using evolutionary information for the query and target improves fold recognition.

Authors:  Björn Wallner; Huisheng Fang; Tomas Ohlson; Johannes Frey-Skött; Arne Elofsson
Journal:  Proteins       Date:  2004-02-01

2.  Rosetta predictions in CASP5: successes, failures, and prospects for complete automation.

Authors:  Philip Bradley; Dylan Chivian; Jens Meiler; Kira M S Misura; Carol A Rohl; William R Schief; William J Wedemeyer; Ora Schueler-Furman; Paul Murphy; Jack Schonbrun; Charles E M Strauss; David Baker
Journal:  Proteins       Date:  2003

3.  The ASTRAL Compendium in 2004.

Authors:  John-Marc Chandonia; Gary Hon; Nigel S Walker; Loredana Lo Conte; Patrice Koehl; Michael Levitt; Steven E Brenner
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

4.  Profile-profile methods provide improved fold-recognition: a study of different profile-profile alignment methods.

Authors:  Tomas Ohlson; Björn Wallner; Arne Elofsson
Journal:  Proteins       Date:  2004-10-01

5.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

6.  Fold recognition by predicted alignment accuracy.

Authors:  Jinbo Xu
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2005 Apr-Jun       Impact factor: 3.710

7.  Refined models for computer simulation of protein folding. Applications to the study of conserved secondary structure and flexible hinge points during the folding of pancreatic trypsin inhibitor.

Authors:  B Robson; D J Osguthorpe
Journal:  J Mol Biol       Date:  1979-07-25       Impact factor: 5.469

8.  Comparative protein modelling by satisfaction of spatial restraints.

Authors:  A Sali; T L Blundell
Journal:  J Mol Biol       Date:  1993-12-05       Impact factor: 5.469

9.  Identification of common molecular subsequences.

Authors:  T F Smith; M S Waterman
Journal:  J Mol Biol       Date:  1981-03-25       Impact factor: 5.469

10.  The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms.

Authors:  Liam J McGuffin; Stefano A Street; Kevin Bryson; Søren-Aksel Sørensen; David T Jones
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

View more
  5 in total

1.  Sub-AQUA: real-value quality assessment of protein structure models.

Authors:  Yifeng David Yang; Preston Spratt; Hao Chen; Changsoon Park; Daisuke Kihara
Journal:  Protein Eng Des Sel       Date:  2010-06-04       Impact factor: 1.650

2.  Linear predictive coding representation of correlated mutation for protein sequence alignment.

Authors:  Chan-seok Jeong; Dongsup Kim
Journal:  BMC Bioinformatics       Date:  2010-04-16       Impact factor: 3.169

3.  Prediction of Local Quality of Protein Structure Models Considering Spatial Neighbors in Graphical Models.

Authors:  Woong-Hee Shin; Xuejiao Kang; Jian Zhang; Daisuke Kihara
Journal:  Sci Rep       Date:  2017-01-11       Impact factor: 4.379

4.  Prodepth: predict residue depth by support vector regression approach from protein sequences only.

Authors:  Jiangning Song; Hao Tan; Khalid Mahmood; Ruby H P Law; Ashley M Buckle; Geoffrey I Webb; Tatsuya Akutsu; James C Whisstock
Journal:  PLoS One       Date:  2009-09-17       Impact factor: 3.240

5.  TransportTP: a two-phase classification approach for membrane transporter prediction and characterization.

Authors:  Haiquan Li; Vagner A Benedito; Michael K Udvardi; Patrick Xuechun Zhao
Journal:  BMC Bioinformatics       Date:  2009-12-14       Impact factor: 3.169

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.