Literature DB >> 17953741

Application of amino acid occurrence for discriminating different folding types of globular proteins.

Y-h Taguchi1, M Michael Gromiha.   

Abstract

BACKGROUND: Predicting the three-dimensional structure of a protein from its amino acid sequence is a long-standing goal in computational/molecular biology. The discrimination of different structural classes and folding types are intermediate steps in protein structure prediction.
RESULTS: In this work, we have proposed a method based on linear discriminant analysis (LDA) for discriminating 30 different folding types of globular proteins using amino acid occurrence. Our method was tested with a non-redundant set of 1612 proteins and it discriminated them with the accuracy of 38%, which is comparable to or better than other methods in the literature. A web server has been developed for discriminating the folding type of a query protein from its amino acid sequence and it is available at http://granular.com/PROLDA/.
CONCLUSION: Amino acid occurrence has been successfully used to discriminate different folding types of globular proteins. The discrimination accuracy obtained with amino acid occurrence is better than that obtained with amino acid composition and/or amino acid properties. In addition, the method is very fast to obtain the results.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17953741      PMCID: PMC2174517          DOI: 10.1186/1471-2105-8-404

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Deciphering the native conformation of a protein from its amino acid sequence, known as, protein folding problem is a challenging task. The recognition of proteins belonging to same fold/structural class is an intermediate step to protein structure prediction. For the past few decades, several methods have been proposed for predicting protein structural classes. These methods include discriminant analysis [1], correlation coefficient [2], hydrophobicity profiles [3], amino acid index [4], Bayes decision rule [5], amino acid distributions [6], functional domain occurrences [7], supervised fuzzy clustering approach [8] and amino acid principal component analysis [9]. These methods discriminated protein structural classes with the sensitivity of 70–100% and it mainly depends on the data set. Wang and Yuan [5] developed a data set of 674 globular protein domains belonging to four different structural classes and reported that methods claiming 100% sensitivity for structural class prediction could predict only with the sensitivity of 60% with this data set. On the other hand, alignment profiles have been widely used for recognizing protein folds [10,11]. Recently, Cheng and Baldi [12] proposed a machine learning algorithm for fold recognition using secondary structure, solvent accessibility, contact map and β-strand pairing, which showed a pair wise sensitivity of 27%. On the other hand, amino acid properties have been used for discriminating membrane proteins [13], identification of membrane spanning regions [14], prediction of protein structural classes [15], protein folding rates [16], protein stability [17] etc. Towards this direction, Ding and Dubchak [18] proposed a method based on neural networks and support vector machines for fold recognition using amino acid composition and five other properties, and reported a cross-validated sensitivity of 45%. Further, Ofran and Margalit [19] showed the existence of significant similarity in amino acid composition between proteins of the same fold. In this work, we have used amino acid occurrence (not composition) for discriminating 30 different folding types of globular proteins. We have developed a method based on linear discriminant analysis (LDA), which discriminated a set of 1612 proteins with an accuracy of 38%, which is comparable to other methods in the literature, in spite of the simplicity of the method and large dataset.

Results and discussion

Role of re-weighting for fold discrimination

We have computed the occurrence of all the 20 amino acid residues in each protein, which represents the elements of 20 dimensional vectors in each protein. We have applied LDA to these vectors for discrimination. We have employed two kinds of LDA, i.e., with and without re-weighting. In LDA with re-weighting, i.e. W= 1 in eq. (1), all folds equally contribute to maximize the performance of discrimination irrespective of the number of proteins in each fold; i.e., if one fold has hundreds of proteins and another has only few proteins, LDA is optimized to achieve the highest performance equally in all folds. This re-weighting is important especially when the number of proteins included in each fold has large variations. On the other hand, LDA without re-weighting, i.e. W= Nin eq. (1), tends to achieve the maximum performance for the whole dataset. We have used the measures, accuracy, sensitivity, precision and F1 for examining the performance of the method. In general, accuracy has the tendency to show high values without re-weighting since it is computed with all data. Sensitivity tends to increase by re-weighting, giving equal weight to each fold. In contrast, precision has the tendency to decrease by re-weighting, since re-weighting increase FP for folds with less number of proteins. On the other hand, F1 is independent of re-weighting as it is harmonic mean of sensitivity and precision. In Table 1, we presented the discrimination results obtained with different measures and two kinds of LDA (with and without re-weighting). As expected, re-weighting significantly changed all the performances other than F1. Re-weighting increased the sensitivity whereas an opposite trend was observed for precision and accuracy. This is due to the divergence in the number of proteins in each fold (min. 25, max. 173, mean 54, see Table 2). F1 does not change significantly by re-weighting.
Table 1

Role of re-weighting. Leave-one-out cross validation results [%] obtained with different measures and two types of LDA

with re-weightingwithout re-weighting

sensitivityprecisionF1accuracysensitivityprecisionF1accuracy
Occurrence3329.293328353038
Composition2723232624272733
Table 2

Performances of fold recognition. Leave-one-out cross validation performances [%] in each fold. wo: without re-weighting, w: with re-weighting

SensitivityPrecisionF1

IDFoldFold DescriptionNumberRatiowowwowwow
all-α
1a.3Cytochrome C252244850273235
2a.4DNA/RNA binding 3-helical bundle1036734943515450
3a.24Four helical up and down bundle262233835202826
4a.39EF hand-like fold252404445264333
5a.60SAMdomain-like26282729121216
6a.118α-α superhelix473474550504847
all-β
7b.1Immunoglobulin-like β-sandwich17311763841695449
8b.2Common fold of diphtheria toxin/transcription factors/cytochrome f2824291121524
9b.6Cupredoxin-like302273742223327
10b.18Galactose-binding domain-like252203650262930
11b.29Concanavalin A-like lectins/glucanases262232724182422
12b.34SH3-like barrel423029020-24
13b.40OB-fold785222424242324
14b.82Double-stranded α-helix342121819171517
15b.121Nucleoplasmin-like423525251475249
α/β
16c.1TIM barrel1459442757655038
17c.2NAD(P)-binding Rossmann-fold domains775343130323232
18c.3FAD/NAD(P)-binding domain312101613111113
19c.23Flavodoxin-like553115178137
20c.26Adenine nucleotide a hydrolase-like342122914221325
21c.37P-loop containing nucleoside triphosphate hydrolases956433442534341
22c.47Thioredoxin fold32291938101513
23c.55Ribonuclease H-like motif4934611867
24c.66S-adenosyl-L-methionine-dependent methyltransferases342292931213024
25c.69α/β-Hydrolases372354139343737
α + β
26d.15β-Grasp, ubiquitin-like4235214018919
27d.17Cystatin-like25208-4-5
28d.58Ferredoxin-like118732717252211
small
29g.3Knottins805988972828385
30g.41Rubredoxin-like282117175321944
Role of re-weighting. Leave-one-out cross validation results [%] obtained with different measures and two types of LDA Performances of fold recognition. Leave-one-out cross validation performances [%] in each fold. wo: without re-weighting, w: with re-weighting Remarkably, we achieved the accuracy of 38% (without re-weighting), which is the best performance to our knowledge, for large number of folds (30) and proteins (1612). Further, the method is extremely simple, which indicates that the amino acid occurrence of proteins carry sufficient information to discriminate protein folds.

Discrimination of proteins belonging to different folding types

We have examined the ability of the present method for predicting proteins belonging to 30 major folds. In Table 2, we have shown the performances of discriminating 30 different folds. We observed that the folds with less number of proteins have the sensitivity of less than 10% without re-weighting. For example, SAM domain like fold has the sensitivity of 8%, which has only 26 proteins. Similar tendency is also observed for the folds b.2, b.34, c.3, c.47, c.55, d.15 and d.17. The sensitivity of these folds increased significantly with re-weighting. On the other hand, many folds with less than 30 proteins have the sensitivity of more than 20% without re-weighting (e.g., a.3, a.24, a.39 etc.). As there are 30 folds, the expected sensitivity is only 3.3% if classification is supposed to be random. In Table 2, we have also shown the ratio between the number of proteins in each fold and the total number of proteins, which ranges from 2 to 11%. Hence the sensitivity of 20% obtained for several folds is significantly higher than that of random for fold discrimination. Interestingly, most of the folds that were discriminated with more than 20% sensitivity belong to either all-α or all-β class. This might be due to the fact that these proteins have different secondary structural patterns and hence they are easy to discriminate them. In addition, folds in each of these classes are near-by each other in amino acid occurrence vector space, which caused high sensitivity. On the other hand, an opposite tendency was observed for precision. Re-weighting decreased the precision for several folds including a.3, a.24, a.29, a.60, b.6, b.18, b.29, c.23, c.47, d.15, and g.41. Most of these folds have less number of proteins. The re-weighting procedure causes two opposite effects: increased the sensitivity and decreased the precision. Hence, F1 may be used to balance these effects. Only two folds, c.23 and d.58 decreased the F1 with re-weighting and several folds significantly increased the F1 by re-weighting (e.g, b.2, b.34, c.26, d.17, and g.41). The comparison between experimental versus predicted folds is shown in Fig. 1. In this figure, dark block indicates the presence of many proteins. The data are normalized in such a way that the total percentage of true fold is 100%. Fig. 1a showed that mainly the folds with more number of proteins are misclassified without re-weighting (e.g., a.4, b.1, c.1 and d.58). The trend has been changed after re-weighting: the misclassified proteins are observed to be within the same structural class. Especially, in α + β class, the block diagonal region is distributed almost uniformly, which is partially caused by re-weighting. Since each fold is equally weighted, α + β class is less weighted than other classes. This causes inter-class misclassification between α + β and other classes, because α + β class has only three folds. However, two folds in small structural class can be discriminated with high accuracy/sensitivity/precision/F1 and α + β folds are difficult to discriminate using our method.
Figure 1

Prediction versus experiment. Comparison between predicted and experimental folds in 1612 proteins. The diagonal elements show the correctly predicted proteins. Dark block indicates the presence of more number of proteins and solid line indicates the boundary between five classes as shown in Table 2, i.e., all-α, all-β, α/β, and α + β and small proteins. (a)without re-weighing. (b) with re-weighing.

Prediction versus experiment. Comparison between predicted and experimental folds in 1612 proteins. The diagonal elements show the correctly predicted proteins. Dark block indicates the presence of more number of proteins and solid line indicates the boundary between five classes as shown in Table 2, i.e., all-α, all-β, α/β, and α + β and small proteins. (a)without re-weighing. (b) with re-weighing.

Comparison among different re-weighting procedures

The results presented in Tables 1 and 2 showed that the sensitivity of discriminating protein folds differs significantly between different methods (with and without re-weighting). Hence, it would be difficult to choose the best method for fold recognition. However, it may be selected based on the interest of the users, whether the prediction is optimized for each fold or over all dataset. Usually, training and test sets of data are obtained from sequence and structure databases and are culled with sequence identity. However, these data sets do not always reflect proper representatives of all proteins in different folds, e.g., protein population in each fold. Further, the proteins available in databases such as, PDB are biased with the proteins that can be solved experimentally, which may be different from the proportion of real proteins. Hence, considering these aspects would help to develop "good" methods for protein fold recognition in future. In essence, based on the methods and data sets used in the present work, we suggest that the performance with re-weighting is better than that without re-weighting.

Influence of amino acid occurrence in recognizing protein folds

The importance of amino acid occurrence is illustrated with Figure 2(a). In this figure we show the occurrence of the 20 types of amino acid residues in DNA/RNA binding 3-helical bundle (a.4) and Immunoglobulin-like β-sandwich (b.1). The average number of amino acid residues in these folds are 88 and 110, respectively. We observed that the residues Gly, Pro, Ser, Thr and Val are dominant in the fold b.1 whereas an opposite trend was observed for Leu and Arg. In Figure 2(b), we have shown the distribution of residues in "amino acid occurrence" space. It is clearly seen that the two folds are more or less separated in this space. We observed similar results about the variation of amino acid occurrences among different folds in our data set.
Figure 2

Amino acid occurrence. (a)Comparison between mean amino acid occurrence of two typical folds, DNA/RNA binding α-helical bundle (a.4, black) and Immunoglobulin-like β-sandwich (b.2, red) (b) Distribution of these two folds over the first two discriminant functions with re-weighting. a.4: filled black circles, b.2: red crosses.

Amino acid occurrence. (a)Comparison between mean amino acid occurrence of two typical folds, DNA/RNA binding α-helical bundle (a.4, black) and Immunoglobulin-like β-sandwich (b.2, red) (b) Distribution of these two folds over the first two discriminant functions with re-weighting. a.4: filled black circles, b.2: red crosses. In addition, we have tested the performance of the method using amino acid composition (i.e., amino acid occurrence/total number of residues) in each protein. We noticed that the accuracy without re-weighting decreased to 33% indicating the importance of amino acid occurrence (un-normalized composition) in each fold (Table 1). Similar tendency is also observed for discriminating β-barrel membrane proteins [16]. Hence, we suggest that the amino acid occurrence is better than composition for discriminating protein folds. In fact, the normalization of amino acid composition produced the problem of co-linearity, i.e., diversity of vectors is not sufficient compared with the number of proteins. The reason for the dependency of F1 upon different types of LDA (with or without re-weighting) is that four folds have no positive proteins without re-weighting. On the other hand, amino acid occurrence has only two folds without any positive proteins (without re-weighting) as seen in Table 2.

Probability measure of discrimination

In order to have the feasibility of combining the results of our method with other methods we provided the probability of being a protein in a specific fold along with the predicted folding type. In Figure 3, we have shown the probability for fold a.4 (DNA/RNA binding 3-helical bundle). Clearly, the fold a.4 has the highest average probability. However, some other folds, (e.g., a.60, d.15, d.17 and d.58) have relatively higher probabilities. This may result in wrong discrimination, which may be fixed by combining the results with other methods.
Figure 3

Probability measure of discrimination. Rows : 103 proteins in fold (a.4). Columns : 30 folds. From left to right, the order is ID in Table 2. The darkest square corresponds to probability 0.5, and the lightest is zero.

Probability measure of discrimination. Rows : 103 proteins in fold (a.4). Columns : 30 folds. From left to right, the order is ID in Table 2. The darkest square corresponds to probability 0.5, and the lightest is zero.

Comparison with other methods

We have compared the performance of our method with other related works in the literature. Ding and Dubchak [18] introduced a combined method for predicting the folding type of a protein. They have used six parameters, amino acid composition, secondary structure, hydrophobicity, van der Waals volume, polarity and polarizability as attributes, and neural networks and support vector machines for recognition. These features have been combined with the number of votes in each method. They reported the sensitivity of 56% in a test set of 384 proteins and 10-fold cross validation sensitivity of 45% in a training set of 311 proteins from 27 folding types. We have used the same dataset of 311 proteins and assessed the performance of our method. We observed that our method could predict with the leave-one-out cross validation accuracy of 42% (with LDA without re-weighting), which is close to that (45%) reported in Ding and Dubchak [18]. In addition, we have selected the proteins from the folds that are common in both the studies and tested the performance of our method (trained with our dataset of 1612 proteins) in predicting the folding types of the proteins used in Ding and Dubchak [18]. The results are presented in Table 3. Interestingly, our method with re-weighting could correctly identify the folding types with F1 value of more than 30% in 11 among the 19 considered folds. The performances are similar to or better than that reported with the dataset of 1612 proteins. Although our method is optimized with different datasets it has the power to predict the folding type of independent dataset of proteins with similar sensitivity.
Table 3

Performances with independent dataset Predictive ability [%]of our method to the independent dataset of proteins used in Ding and Dubchak [18]. wo: without re-weighting, w: with re-weighting

NumberRatioSensitivityPrecisionF1

Fold Description[%]wowwowwow
Cytochrome C163569464476063
DNA/RNA binding 3-helical bundle326755641475351
Four helical up and down bundle153333371424537
EF hand-like fold153535357425547
Immunoglobulin-like β-sandwich7414663144685343
Cupredoxin-like214293850333636
Concanavalin A-like lectins/glucanases132383842334036
SH3-like barrel163050-44-47
OB-fold326162826312030
TIM barrel7714402566705037
FAD/NAD: (P)-binding domain2342230114503738
Flavodoxin-like24581328351318
NAD: (P)-binding Rossmann-fold domains408403558813
P-loop containing nucleoside triphosphate hydrolases224231838502927
Thioredoxin fold173183533252329
Ribonuclease H-like motif2245181422720
α/β-Hydrolases183333943413840
β-Grasp, ubiquitin-like153033020-25
Ferredoxin-like4082331110154

Total/Mean532313542383434

Accuracy

without reweighting36
with reweighting32
Performances with independent dataset Predictive ability [%]of our method to the independent dataset of proteins used in Ding and Dubchak [18]. wo: without re-weighting, w: with re-weighting Further, there are several advantages in our method: (i) only one feature, amino acid occurrence is sufficient for prediction rather than six features. The comparison of results obtained with only one feature showed that the performance of our method (42%) is significantly better than that of Ding and Dubchak [18] reported with amino acid composition (20–49%), (ii) voting procedure is not necessary; our method can be directly used for multi-fold classifications, (iii) our method uses LDA, which requires significantly less computational power compared with SVM. In SVM one has to diagonalize the matrix with the size of (protein number) × (protein number); on the other hand, LDA requires only diagonalization of 20 (the number of kinds of amino acid residues) × 20 matrix independent of number of proteins and (iv) although they have reported the dependency of fold specific sensitivities upon number of proteins in each fold, it is difficult to compensate this effect without modifying the complicated voting systems; our method has freedom to compensate it as discussed in the previous sections. Recently, Shen and Chou [20] reported better sensitivity for the same data set of Ding and Dubchak [18]. However, the results are biased with training set of data. We have evaluated the sensitivity of identifying proteins belonging to the folds, four helical up and down bundle (a.24) and EF hand-like (a.39) and we observed that the sensitivity is 30.5% and 24%, respectively. Our predicted sensitivities (38% and 44%, with re-weighting, see table 2) are better than that of Shen and Chou [20].

Possible reasons for obtaining good performance with amino acid occurrence

We have analyzed the possible reasons for obtaining good performance with amino acid occurrence. In Table 4, we have summarized the performance as a function of different features. When we use more than two features to discriminate folds, we simply apply LDA to merge feature vectors. This means, if there are two features vectors with n components and with m features,
Table 4

Performances with other features Mean performances [%] obtained with different features for the data set used in Ding and Dubchak [18]. Re-weighting scheme is employed

SensitivityPrecisionF1Accuracy
Features

secondary structure35324036
polarity19182621
polarizability18182619
hydrophobicity23222824
volume21202522

Composition

composition34333435
composition + length36353838
composition + other five features35393939

Occurrence

occurrence40403942
occurrence + other five features40464244
Performances with other features Mean performances [%] obtained with different features for the data set used in Ding and Dubchak [18]. Re-weighting scheme is employed then we merge and apply LDA to The usage of five features, i.e., predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, polarizability [18] along with amino acid composition yielded the accuracy of 45% using sophisticated and time consuming methods. On the other hand, our simple method employing amino acid occurrence and five features has also showed almost the same value (44%). The in-depth analysis of the results presented in Table 4 revealed many interesting features. For example, amino acid composition alone showed the accuracy of 35%, which is 7% less than that obtained with occurrence (42%). On the other hand, composition and length (i.e., the first 20 components of feature vectors consist of composition and the 21th component is amino acid length) increased the accuracy from 35% to 38%. The composition and five features showed the accuracy of 39%, which is similar (38%) to that obtained with composition and length. Hence, length of the protein has an important role as that of five features for discriminating protein folds. This analysis demonstrates the importance of amino acid length and obtaining good performance with amino acid occurrence. As an individual feature amino acid occurrence showed the best performance among all features, including secondary structure. The combination of amino acid occurrence with other features did not increase the sensitivity and the increase of other parameters is only marginal. This result reveals that the amino acid occurrence contains most of the information that are reflected in other physical features. Generally, any physical feature can be expressed by amino acid occurrence. Hence, linear combination of amino acid occurrence may express many of physical properties of proteins. In order to verify this concept, we have computed the correlation coefficients between 49 amino acid properties [21-23] and the first discriminate function. Each property consists of 20 dimensional vector, like where is the kth physical property of ith amino acid. Since discriminant function is also 20 dimensional vector and each component of which describes contribution from each amino acid, one can compute correlation coefficient between them. As can be seen in Table 5, 23 out of 49 properties have high correlation coefficients and less than 5% q-values (i.e., FDR corrected p-values). This analysis shows that linear discriminant function can express many of physical properties, at least, partly. Hence, even if we do not consider physical properties directly, the consideration of amino acid occurrence could discriminate folds well.
Table 5

Correlation between physical properties and the first discriminant function Brief descriptions of 49 selected physico-chemical, energetic and conformational properties, their correlation coefficient with the first discriminate function, and q-value. Asterisks in the last column shows q-value is less than 5%

No.DescriptionCorr. Coef.q-value [%]q ≤ 5%
1.Compressibility0.0438.6
2.Thermodynamic transfer hydrophobicity0.541.9*
3.Surrounding hydrophobicity0.740.4*
4.Polarity0.369.2
5.Isoelectric point0.0241.2
6.Equilibrium constant with reference to the ionization property0.0141.7
7.Molecular weight0.0638.4
8.Bulkiness0.493.0*
9.Chromatographic index0.512.7*
10.Refractive index0.369.2
11.Normalized consensus hydrophobicity0.483.4*
12.Short and medium range non-bonded energy0.1132.7
13.Long-range non-bonded energy0.650.7*
14.Total non-bonded energy0.571.5*
15.Alpha-helical tendency0.2914.1
16.Beta-helical tendency0.630.8*
17.Turn tendency0.610.9*
18.Coil tendency0.601.1*
19.Helical contact area0.2023.0
20.Mean rms fluctuational displacement0.571.5*
21.Buriedness0.630.8*
22.Solvent accessible reduction ratio0.700.4*
23.Average number of surrounding residues0.720.4*
24.Power to be at the N-terminal of alpha helix0.571.5*
25.Power to be at the C-terminal of alpha helix0.1826.4
26.Power to be at the middle of alpha helix0.0538.6
27.Partial-specific volume0.2518.8
28.Average medium-range contacts0.1132.7
29.Average long-range contacts0.650.7*
30.Combined surrounding hydrophobicity (globular and membrane)0.690.4*
31.Solvent accessible surface area for denatured protein0.1232.7
32.Solvent accessible surface area for native protein0.522.5*
33.Solvent accessible surface area for protein unfolding0.473.7*
34.Gibbs free energy change of hydration for unfolding0.3014.1
35.Gibbs free energy change of hydration for denatured protein0.407.3
36.Gibbs free energy change of hydration for native protein0.464.1*
37.Unfolding enthalpy change of hydration0.0538.6
38.Unfolding entropy change of hydration0.378.9
39.Unfolding hydration heat capacity change0.541.9*
40.Unfolding Gibbs free energy change of chain0.1627.6
41.Unfolding enthalpy change of chain0.2221.7
42.Unfolding entropy change of chain0.444.7*
43.Unfolding Gibbs free energy change0.3311.0
44.Unfolding enthalpy change0.3510.2
45.Unfolding entropy change0.3410.3
46.Volume (number of non-hydrogen side chain atoms)0.1132.7
47.Shape (position of branch point in a side-chain)0.1032.8
48.Flexibility (number of side-chain dihedral angles)0.2419.5
49.Backbone dihedral probability0.512.5*
Correlation between physical properties and the first discriminant function Brief descriptions of 49 selected physico-chemical, energetic and conformational properties, their correlation coefficient with the first discriminate function, and q-value. Asterisks in the last column shows q-value is less than 5%

Fold recognition on the web

We have developed a web server for discriminating protein folds from amino acid sequence [24]. It takes the amino acid sequence as input and displays the folding type in the output along with probability. Further, the server has the feasibility of selecting the method, with and without re-weighting, and the display options to show the probability details for each fold.

Advantages and limitations of the method

The main advantage of the present method is the discrimination of 30 different folding types of globular proteins with high accuracy/sensitivity/precision/F1. Further it will provide the probability of being a protein in a specific fold. The discrimination results along with probability may be helpful to select templates to build models to new protein. Further, it can be combined with other methods for better performance. The limitation of the method is the usage of only 30 specific folds for discrimination.

Conclusion

In this paper, we have proposed a simple method for discriminating 30 folding types of globular proteins. Interestingly, the simplest method is the best method for the truly complicated problems. Although complicated methods have several possibilities for tuning they generate over fitting to the data set. Further, the method proposed in this work is better than or comparable to other complicated methods, such as, neural networks and support vector machines proposed in the literature for discriminating folding types. In addition, our method has several advantages including the less computational time and classifying the folds at a single run rather than pair-wise comparisons. We have developed a web server [24], which takes the amino acid sequence as the input and displays the folding type in the output. The main limitation of the method is that its application is restricted to 30 folds considered in this work. However, the approach can be extended to other folds when significant representatives are available.

Methods

Dataset

We have used a dataset of 1612 globular proteins belonging to 30 major folding types obtained from SCOP database [25] for recognizing protein folds. This dataset has been constructed with the following criteria: (i) there should be at least 25 proteins in each fold and (ii) the sequence identity between any two proteins is not more than 25%. The amino acid sequences of all the proteins are available at [24].

Linear discriminant analysis

We have employed LDA in this work and a brief description is given below. First, we compute the amino acid occurrence of each protein, where i is the number of protein; n, netc. represents the number of amino acids of each type (Ala, Arg etc.) in ith protein. Then LDA tries to maximize S(S) is the summation of squared distance between the center of mass of all proteins and that within fold (coordinate of each protein) along axis z, i.e., where K is the number of folds, Nis the number of proteins belonging to kth fold and is the center of mass along the axis z, and zis that within kth fold, i.e., where i' is the i' th protein within the kth fold. zis the linear combination of nwith the set of coefficients a ≡ (a0, a1, ..., a, ... a20), Hence, LDA tries to find a which maximizes η2. In total, we can get 20 kinds of zs which are orthogonal to each other, and discrimination is done based on these zs. In addition one can introduce weights Wfor each group using the equation: The discrimination is done with Bayesian scheme employing Gaussian kernel. Proteins in each fold are assumed to distribute in amino acid occurrence space obeying Gaussian distribution whose center is the mean occurrence within each fold and variance is computed along the 20 kinds of z coordinates. Then the fold with maximum probability is used assign the folding type of a protein. The probability of each fold may also be used to find other probable folds for a specific sequence. We have used lda module in MASS library of R [26] and the computational time is less than few seconds using Intel Pentium M processor (1.10 GHz) and 1 GB memory.

Scoring

In this paper, we employed four performances to validate the results. We computed TPwhich is the number of proteins being correctly discriminated to be in kth category (e.g., fold). We have also computed FP(FN), which is the number of proteins which are incorrectly discriminated as being (not being) in kth category. Then we defined sensitivity (or recall), Precision, and F1 as For validating whole data set we have taken the category average, where (Performance) is sensitivity/precision/F1. For some cases denominator of precision and/or F1 will be zero and we excluded these categories to compute the average. The accuracy is defined as, N is the total number of proteins.

Availability and requirements

Project name: PROLDA Project home page: http://www.granular.com/PROLDA/ Operating systems : Platform independent Programing languages : R [26] Licence: GNU GPL Any restrictions to use non-academics: none

Authors' contributions

YhT coded the program, carried out most of the calculations and constructed the prediction server. MMG directly supervised the work and provided the dataset of amino acid sequences. All authors contributed in the preparation of the manuscript, read and approved it.
  24 in total

1.  Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins.

Authors:  M M Gromiha; M Oobatake; A Sarai
Journal:  Biophys Chem       Date:  1999-11-15       Impact factor: 2.352

2.  Prediction of protein (domain) structural classes based on amino-acid index.

Authors:  W S Bu; Z P Feng; Z Zhang; C T Zhang
Journal:  Eur J Biochem       Date:  1999-12

3.  Structural class prediction: an application of residue distribution along the sequence.

Authors:  T S Kumarevel; M M Gromiha; M N Ponnuswamy
Journal:  Biophys Chem       Date:  2000-12-15       Impact factor: 2.352

4.  Multi-class protein fold recognition using support vector machines and neural networks.

Authors:  C H Ding; I Dubchak
Journal:  Bioinformatics       Date:  2001-04       Impact factor: 6.937

5.  How good is prediction of protein structural class by the component-coupled method?

Authors:  Z X Wang; Z Yuan
Journal:  Proteins       Date:  2000-02-01

6.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties.

Authors:  J Shi; T L Blundell; K Mizuguchi
Journal:  J Mol Biol       Date:  2001-06-29       Impact factor: 5.469

7.  Amino Acid Principal Component Analysis (AAPCA) and its applications in protein structural class prediction.

Authors:  Qi-Shi Du; Zhi-Qin Jiang; Wen-Zhang He; Da-Peng Li; Kou-Chen Chou
Journal:  J Biomol Struct Dyn       Date:  2006-06

8.  Ensemble classifier for protein fold pattern recognition.

Authors:  Hong-Bin Shen; Kuo-Chen Chou
Journal:  Bioinformatics       Date:  2006-05-03       Impact factor: 6.937

9.  Proteins of the same fold and unrelated sequences have similar amino acid composition.

Authors:  Yanay Ofran; Hanah Margalit
Journal:  Proteins       Date:  2006-07-01

10.  A machine learning information retrieval approach to protein fold recognition.

Authors:  Jianlin Cheng; Pierre Baldi
Journal:  Bioinformatics       Date:  2006-03-17       Impact factor: 6.937

View more
  13 in total

1.  Sequence physical properties encode the global organization of protein structure space.

Authors:  S Rackovsky
Journal:  Proc Natl Acad Sci U S A       Date:  2009-08-12       Impact factor: 11.205

2.  Fold homology detection using sequence fragment composition profiles of proteins.

Authors:  Armando D Solis; Shalom R Rackovsky
Journal:  Proteins       Date:  2010-10

3.  A novel fusion based on the evolutionary features for protein fold recognition using support vector machines.

Authors:  Mohammad Saleh Refahi; A Mir; Jalal A Nasiri
Journal:  Sci Rep       Date:  2020-09-01       Impact factor: 4.379

4.  Prediction of functionally important residues in globular proteins from unusual central distances of amino acids.

Authors:  Marek Kochańczyk
Journal:  BMC Struct Biol       Date:  2011-09-18

5.  Improving protein fold recognition using the amalgamation of evolutionary-based and structural based information.

Authors:  Kuldip K Paliwal; Alok Sharma; James Lyons; Abdollah Dehzangi
Journal:  BMC Bioinformatics       Date:  2014-12-08       Impact factor: 3.169

6.  Random amino acid mutations and protein misfolding lead to Shannon limit in sequence-structure communication.

Authors:  Andreas Martin Lisewski
Journal:  PLoS One       Date:  2008-09-01       Impact factor: 3.240

7.  Functional discrimination of membrane proteins using machine learning techniques.

Authors:  M Michael Gromiha; Yukimitsu Yabuki
Journal:  BMC Bioinformatics       Date:  2008-03-03       Impact factor: 3.169

8.  A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition.

Authors:  Alok Sharma; Kuldip K Paliwal; Abdollah Dehzangi; James Lyons; Seiya Imoto; Satoru Miyano
Journal:  BMC Bioinformatics       Date:  2013-07-24       Impact factor: 3.169

9.  Evaluation of sequence features from intrinsically disordered regions for the estimation of protein function.

Authors:  Alok Sharma; Abdollah Dehzangi; James Lyons; Seiya Imoto; Satoru Miyano; Kenta Nakai; Ashwini Patil
Journal:  PLoS One       Date:  2014-02-24       Impact factor: 3.240

10.  Highly Accurate Prediction of Protein-Protein Interactions via Incorporating Evolutionary Information and Physicochemical Characteristics.

Authors:  Zheng-Wei Li; Zhu-Hong You; Xing Chen; Jie Gui; Ru Nie
Journal:  Int J Mol Sci       Date:  2016-08-25       Impact factor: 5.923

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.