Literature DB >> 24278175

Bi-factor analysis based on noise-reduction (BIFANR): a new algorithm for detecting coevolving amino acid sites in proteins.

Juntao Liu1, Xiaoyun Duan, Jianyang Sun, Yanbin Yin, Guojun Li, Lushan Wang, Bingqiang Liu.   

Abstract

Previous statistical analyses have shown that amino acid sites in a protein evolve in a correlated way instead of independently. Even though located distantly in the linear sequence, the coevolved amino acids could be spatially adjacent in the tertiary structure, and constitute specific protein sectors. Moreover, these protein sectors are independent of one another in structure, function, and even evolution. Thus, systematic studies on protein sectors inside a protein will contribute to the clarification of protein function. In this paper, we propose a new algorithm BIFANR (Bi-factor Analysis Based on Noise-reduction) for detecting protein sectors in amino acid sequences. After applying BIFANR on S1A family and PDZ family, we carried out internal correlation test, statistical independence test, evolutionary rate analysis, evolutionary independence analysis, and function analysis to assess the prediction. The results showed that the amino acids in certain predicted protein sector are closely correlated in structure, function, and evolution, while protein sectors are nearly statistically independent. The results also indicated that the protein sectors have distinct evolutionary directions. In addition, compared with other algorithms, BIFANR has higher accuracy and robustness under the influence of noise sites.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 24278175      PMCID: PMC3835919          DOI: 10.1371/journal.pone.0079764

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The amino acids coevolution is very common in various protein families [1], [2], [3]. Highly conserved amino acid sites are often located in the core or on the functional surface of protein tertiary structure [4], [5], [6]. These sites usually are under strong evolutionary constraint, thus are critical for maintaining the protein’s function. The amino acid sites that are highly correlated in evolution often form protein sectors [7], [8]. Protein sectors decompose proteins into quasi-independent groups, which are distinct from the traditional hierarchy of protein structure. The statistical characteristic analysis of the cooperative action of conserved amino acids could contribute to the inference of protein function and evolution [1], [9]. Since functionally important amino acid regions in a protein are usually conserved in evolution, researchers have been identifying these regions by performing directed mutagenesis experiments [10], [11], [12], [13]. However, such experimental approaches are time and labor intensive. In order to overcome this problem, researchers have developed statistical methods to detect functionally dependent (or correlated) amino acids in proteins using coevolution analysis [14], [15]. For example, some parametric and non-parametric methods were employed to detect important amino acid sites [16], [17], [18], [19], [20], [21], [22], which usually focus on amino acids important for maintaining the protein structure and function. These methods rely on multiple sequence alignment (MSA), so the quality and size of MSA and the background coevolution noise became the main obstacles [15], [18], [23]. In addition, some other typical probabilistic models have also been implemented, e. g. Maximum likelihood approximation [24], [25], [26], Bayesian probabilities [27], phylogenetic approaches [28] and sequence divergence based approximation [15], [29]. Lastly, several new ideas were introduced to reduce the influence of noise [7], [8]. However, these methods can only reach high accuracy in some specific protein families, thus cannot be widely used. Therefore, there is a need of more effective method to be developed. In this paper, we propose a new algorithm, named BIFANR (Bi-factor analysis based on noise-reduction), to reveal the coevolving pattern of amino acid sites. The algorithm originates from the Factor Analysis in psychological researches [30], which is widely used to analyze psychological factors, such as human personality and sensibility. Like previous studies, our algorithm follows the following principals: 1) the coevolved amino acid sites in a protein constitute a protein sector, which are closely combined in the tertiary structure to account for certain biological characteristics; 2) different protein sectors are independent of each other in terms of the tertiary structure and function. However, different from other methods, BIFANR first conducts noise reduction before factor analysis, which improves efficiency and accuracy. After that, a bi-factor analysis is employed to determine the corresponding eigenvectors of non-random eigenvalues with a stochastic simulation and then to extract protein sectors. In linear combination of eigenvectors, this algorithm employs varimax orthogonal rotation to ensure independence between protein sectors. Furthermore, we applied BIFANR to a PDB structure 3TGI of the S1A serine protease family and 1BE9 of the PSD95/Dig1/ZO1 (PDZ) family. As a result, we found 3 protein sectors in 3TGI and 2 in 1BE9. Further analysis showed that BIFANR has higher accuracy and robustness compared with other algorithms. The flowchart of the complete analyses is presented in Figure 1. The source code of the BIFANR program is available in the file Program S1.
Figure 1

Methods flowchart.

Results

1. The Algorithm Design of BIFANR

The BIFANR algorithm consists of two major components, detailed as follows:

1.1 Correlation coefficient matrix and noise reduction

This algorithm applies the idea of Factor Analysis to amino acid site analysis to extract protein sectors. Specifically, starting from a given MSA, we first calculated the correlation coefficients between amino acid sites and constructed a covariance matrix (non-weighted correlation coefficient matrix, see Methods). Considering the biological significance, we then gave weights to the covariance matrix like previous studies [7], [31]. Finally, we calculated the weighted correlation coefficient matrix based on the background frequency of the 20 amino acids and the conservation of amino acid sites. As a result, we have measured the pair-wise correlation of amino acid sites with this matrix, based on which we further conducted noise reduction. The noise sites are amino acid sites that are weakly correlated with almost all the other sites. These noise sites usually reduce the efficiency and accuracy of the algorithm to identify protein sectors. This is the main reason causing the failure of some covariance amino acid sites detecting methods [32]. In order to overcome this problem, we developed a noise reduction method in the pre-processing step. Specifically, the amino acid sites with high randomness in evolution are removed before the detection step. Taking S1A family and PDZ family as examples: in the S1A family, there were 223 sites in the multiple sequence alignment (MSA) of 3TGI [33] and its homologous protein sequences and 104 sites were removed by the noise reduction step; in the PDZ family, 49 sites of the total 94 sites were removed after the noise reduction step. These removed sites are weakly correlated with other sites, and have higher evolutionary rates than the remaining sites. Calculated by Rate4Site, in S1A family [34], the average evolutionary rates of removed and remained sites are 0.7692 and −0.6723 respectively. In PDZ family they are 0.6717 and −0.7314 respectively.

1.2 Bi-factor analysis

In the bi-factor analysis, we obtained protein sectors according to the eigenvectors of the weighted correlation coefficient matrix. In order to guarantee the non-randomness of the predicted protein sectors, we simulated the data by randomly shuffling the multiple sequence alignment for 100 times, and then chose the non-random eigenvectors of the correlation coefficient matrix based on the stochastic simulation result. Using this method, we can find the non-random protein sectors hidden in the protein sequence. The original factor coefficient for each amino acid site can be considered as the correlation between the site and the factor (i.e. selected eigenvectors of the correlation coefficient matrix, see Methods). BIFANR assigns amino acid sites to factors according to the correlation. However, we cannot obtain protein sectors based on the original factor coefficients directly because one site may have similar coefficients with different factors. Thus we have further conducted varimax orthogonal rotation for these factors. Our ideal expectation is that each site will have a large coefficient value with just one factor, which could be sufficient to distinguish this factor from the remaining factors. Consequently, the protein sectors detected by BIFANR will have significant statistical independence. After varimax orthogonal rotation, amino acid sites were assigned to factors according to the factor coefficients calculated above. As the coefficient is within the range [−1, 1], there are both positive and negative correlation and the larger the absolute value is, the more significant the correlation is. Those sites might also form a protein sector, if they have significant negative correlation with one factor. Therefore, BIFANR conducted bidirectional selection of factor coefficients on the basis of factor analysis, which could prevent the loss of protein sectors due to solely selection of positive factor coefficients. However, bidirectional selection may cause the occurrence of two overlapping protein sectors. In order to merge overlapping protein sectors, we retain the overlap and then add those sites having higher correlation with current sites.

2. Statistical and Biological Tests for Protein Sectors

BIFANR detected three protein sectors in S1A family and two protein sectors in PDZ family. To verify these protein sectors and evaluate the performance of our algorithm, we conducted statistical tests and did biological analysis with these protein sectors. The statistical tests include internal correlation test and statistical independence test. Besides, we conducted evolutionary rate test. To demonstrate the correlation between the amino acid sites within a protein sector, we took all of the amino acid sites in each protein sector and calculated the mean correlation coefficients between each pair of the amino acid sites. In addition, we randomly simulated the data set with the same number of sites for 1000 times and similarly calculated the mean correlation coefficients to be the random expectation. The results showed that in each protein sector, the average of correlation coefficients is much higher than the random expectation (Figure 2).
Figure 2

Average correlation coefficients of protein sectors.

A: Average correlation coefficient of each protein sector in S1A family. B: Average correlation coefficient of each protein sector in PDZ family. Red column represents average correlation coefficient of protein sector 1. Blue column represents average correlation coefficient of protein sector 2. Green column represents average correlation coefficient of protein sector 3. Black column represents stochastic expected average correlation coefficient of each protein sector.

Average correlation coefficients of protein sectors.

A: Average correlation coefficient of each protein sector in S1A family. B: Average correlation coefficient of each protein sector in PDZ family. Red column represents average correlation coefficient of protein sector 1. Blue column represents average correlation coefficient of protein sector 2. Green column represents average correlation coefficient of protein sector 3. Black column represents stochastic expected average correlation coefficient of each protein sector. To illustrate the statistical independence between protein sectors, we calculated the MDI entropy of S1A family and PDZ family, respectively. The MDI entropy was originally used to quantify the degree to which a selected group of amino acid sites are statistically coupled to each other in an MSA. If two protein sectors are independent, the MDI entropy of them taken together should be equal to the sum of their MDI entropies taken individually in theory. The results supported this conjecture by showing that the MDI entropy of each two predicted protein sectors was much higher than the random expectation, which means that the protein sectors are statistically independent of each other (Figure 3).
Figure 3

Statistical independence of protein sectors.

A: Statistical independence of protein sectors in S1A family. B: Statistical independence of protein sectors in PDZ family. (Red column represents MDI entropy of protein sector 1. Blue column represents MDI entropy of protein sector 2. Green column represents MDI entropy of protein sector 3. Black column represents MDI entropy of two protein sectors as a whole. Yellow column represents stochastic expected MDI entropy after disrupting the amino acid sites within two protein sectors 100 times.).

Statistical independence of protein sectors.

A: Statistical independence of protein sectors in S1A family. B: Statistical independence of protein sectors in PDZ family. (Red column represents MDI entropy of protein sector 1. Blue column represents MDI entropy of protein sector 2. Green column represents MDI entropy of protein sector 3. Black column represents MDI entropy of two protein sectors as a whole. Yellow column represents stochastic expected MDI entropy after disrupting the amino acid sites within two protein sectors 100 times.). To study the evolutionary feature of amino acid sites within a protein sector, we calculated the average evolutionary rates of the amino acid sites in the entire protein and the amino acid sites in the protein sectors, respectively. The results showed that the latter was much lower than the former (Table 1). Figure 4 shows the result for both the S1A family and the PDZ family, where for both families, the evolutionary rates of over 90% sites in protein sectors are negative, suggesting that these sites have lower evolutionary rates and thus are selectively constrained to maintain the protein structure and function.
Table 1

Average evolutionary rate of each protein sector and average evolutionary rate of all sites in S1A and PDZ.

sector 1sector 2sector 3all sites
S1A −0.7288−0.4520−1.21430
PDZ −0.8683−0.91630
Figure 4

Distribution of amino acid site evolutionary rates.

A: Distribution of amino acid site evolutionary rates in S1A family (rat trypsin: 3TGI). B: Distribution of amino acid site evolutionary rates in PDZ family (rat PSD-95∶1BE9). Red column represents the evolutionary rate distribution of amino acid sites in protein sectors.

Distribution of amino acid site evolutionary rates.

A: Distribution of amino acid site evolutionary rates in S1A family (rat trypsin: 3TGI). B: Distribution of amino acid site evolutionary rates in PDZ family (rat PSD-95∶1BE9). Red column represents the evolutionary rate distribution of amino acid sites in protein sectors.

3. Comparison to Buck’s Method

In order to evaluate the performance of BIFANR, we compared our method with Buck’s method [8]. In comparison, we chose 3TGI of the S1A family and 1BE9 [35] of the PDZ family as template sequences, since protein sectors in these two template sequences have been experimentally verified [7]. Then we evaluated the predicted results of the two methods by comparing experimentally confirmed sectors with our predicted sectors (i.e. factors in [8] ). If all or most sites of an experimentally confirmed sector are found in just one predicted sector, it means that the prediction is reliable. Otherwise, if sites are found in several predicted sectors, it indicates that the prediction is unreliable. We then calculated the percentage of experimentally confirmed sectors that are found in our predicted sectors, i.e. sensitivity and the percentage of our predicted sectors to be true positives (i.e. experimentally confirmed sectors), i.e. positive predictive value (PPV). For the result of Buck’s method, the sites in experimentally confirmed sectors distributed almost uniformly in different predicted sectors. But for our result, the sites in any experimentally confirmed sector distributed on just one predicted sector (Figure 5, 6). The results show that the sensitivities of Buck’s method in S1A family and PDZ family were 85.07% and 82.35%, respectively, while those of our method were 91.04% and 94.11%. In addition, the PPVs of Buck’s method in S1A family and PDZ family were 43.84% and 29.16%, respectively, while those of our method were 90.77% and 94.11%. The results clearly demonstrate that BIFANR performs much better than Buck’s method.
Figure 5

Comparison between BIFANR and Buck’s in S1A family.

A: S1,S2, and S3 represent 3 experimental confirmed protein sectors in S1A family and the height of color bar represents the number of sites in corresponding predicted protein sector by Buck’s method. B: S1,S2, and S3 represent 3 protein sectors in S1A family and the height of each color bar represents the number of sites in corresponding predicted sector by BIFANR. And the height of the brown bar represents the number of lost sites by algorithms in each experimental confirmed protein sector.

Figure 6

Comparison between BIFANR and Buck’s in PDZ family.

A: S1 and S2 represent 2 experimental confirmed protein sectors in PDZ family and the height of each color bar represents the number of sites in corresponding predicted protein sector by Buck’s method. B: S1 and S2 represent 2 protein sectors in PDZ family and the height of each color bar represents the number of sites in corresponding predicted sector by BIFANR. And the height of the brown bar represents the number of lost sites by algorithms in each experimental confirmed protein sector.

Comparison between BIFANR and Buck’s in S1A family.

A: S1,S2, and S3 represent 3 experimental confirmed protein sectors in S1A family and the height of color bar represents the number of sites in corresponding predicted protein sector by Buck’s method. B: S1,S2, and S3 represent 3 protein sectors in S1A family and the height of each color bar represents the number of sites in corresponding predicted sector by BIFANR. And the height of the brown bar represents the number of lost sites by algorithms in each experimental confirmed protein sector.

Comparison between BIFANR and Buck’s in PDZ family.

A: S1 and S2 represent 2 experimental confirmed protein sectors in PDZ family and the height of each color bar represents the number of sites in corresponding predicted protein sector by Buck’s method. B: S1 and S2 represent 2 protein sectors in PDZ family and the height of each color bar represents the number of sites in corresponding predicted sector by BIFANR. And the height of the brown bar represents the number of lost sites by algorithms in each experimental confirmed protein sector.

4. Function Analysis of Protein Structure

BIFANR detected three and two protein sectors in S1A family and PDZ family respectively (see Table S1 for the amino acids in each protein sector). Strikingly, the amino acid sites in the three protein sectors of S1A family are not linearly close to each other in the sequence, but apparently are correlated in the tertiary structure (Figure 7A–C). In addition, protein sectors tend to independent of each other in protein function.
Figure 7

Protein sectors of S1A family (Rat trypsin PDB: 3TGI).

A: Red balls represent protein sector 1, which mainly comprises residues located near the catalytic cleave. B: Blue balls represent protein sector 2, which comprises residues within the core of the two β barrels. C: Green balls represent protein sector 3, comprising residues within the catalytic cleave. Residues comprising protein sectors are displayed in space filling representation with a van der Waals surface.

Protein sectors of S1A family (Rat trypsin PDB: 3TGI).

A: Red balls represent protein sector 1, which mainly comprises residues located near the catalytic cleave. B: Blue balls represent protein sector 2, which comprises residues within the core of the two β barrels. C: Green balls represent protein sector 3, comprising residues within the catalytic cleave. Residues comprising protein sectors are displayed in space filling representation with a van der Waals surface. In S1A family, protein sector 1 mainly contains amino acids surrounding the pocket of S1 enzyme. Amino acid mutations within this sector may affect the substrate specificity of some enzymes in the family as residues in this sector is involved in transferring chymotryptic specificity into trypsin [36], [37], [38], [39], [40]. Protein sector 2 mainly contains amino acids of the two β-sheets in the protein core. The double alanine mutation in this sector could affect the thermal stability of the enzyme, but hardly affect the catalytic ability. Moreover, the mutations in this sector are synergistic. The effect of sector 1 to substrate specificity is independent on that of sector 2 to structure stability [7], [41], [42]. Protein sector 3, which is mainly responsible for catalytic ability, contains the catalytic triad (57H, 102D and 195S) and neighboring amino acids that are related to catalytic ability or accounting for allosteric regulation [36], [43], [44], [45]. This sector also includes one disulfide bond pair (42C–58C) and the substitution of this bond would cooperatively interact with mutation of S195. In addition, triple mutation of C42A, C58A/V, and S195T will convert trypsin from a serine protease to a threonine protease. So, this sector represents the catalytic core of this protease family. In PSD95/Dig1/ZO1 (PDZ) protein domain family, protein sector 1 contains amino acids in α2–β2 groove and α1-helix, which affect the substrate binding affinity [31], [46] and the regulation of α2–β2 groove affinity [47]. The sites in this sector are either in relation to each other directly or are connected through interactions with the substrate peptides. In protein sector 2, residues 36 and 75 co-mutate to cysteine may be responsible for the redox-dependent equilibrium [48], [49] between two conformations in INAD PDZ5 [50] (Figure 8).
Figure 8

Protein sectors of PDZ family (Rat PSD-95 DPB: 1BE9).

A: Red balls represent residues within protein sector 1. B: Blue balls represent residues within protein sector 2. Residues comprising protein sectors are displayed in space filling representation with a van der Waals surface.

Protein sectors of PDZ family (Rat PSD-95 DPB: 1BE9).

A: Red balls represent residues within protein sector 1. B: Blue balls represent residues within protein sector 2. Residues comprising protein sectors are displayed in space filling representation with a van der Waals surface. To further investigate the function and independent evolution of protein sectors, we analyzed the evolutionary independence of the three protein sectors in S1A family. Evolutionary independence test is to construct a similarity matrix M with the sequence similarities of amino acid sites in a protein sector and then conduct principal component analysis. In principal component analysis, only one principal component was selected and all sequences were separated into two parts according to factor coefficients. Taking S1A family as an example, the results of the principal component analysis of protein sectors 1, 2 and 3 were displayed in Figure 9A, 9B and 9C, respectively. According to the sequence similarity of sites in each protein sector, casein (red, top) and chymotrypsin (blue, below) proteins are separated by protein sector 1 (Figure 9A); vertebrates and non-vertebrates are separated by protein sector 2 (Figure 9B); and enzymes and non-enzymes are separated by protein sector 3 (Figure 9C). These results indicate that protein sector 1 may be responsible for the specificity of substrate recognition in the catalytic process; protein sector 2 may be involved in the protein backbone evolution, while protein sector 3 may account for the protein catalytic activity.
Figure 9

Evolutionary independence of protein sectors in S1A family.

A: Evolutionary independence of protein sector 1. B: Evolutionary independence of protein sector 2. C: Evolutionary independence of protein sector 3.

Evolutionary independence of protein sectors in S1A family.

A: Evolutionary independence of protein sector 1. B: Evolutionary independence of protein sector 2. C: Evolutionary independence of protein sector 3.

5. Application to Hsp70/110 Family and G Protein Family

To demonstrate the generality of the algorithm BIFANR, we carried out protein sector prediction for another 2 protein families: Hsp70/110 family and G protein family. For the Hsp70/110 family, the MSA consists of 926 sequences and 605 positions, and for the G protein family the MSA consists of 678 sequences and 160 positions. BIFANR detected 2 significant protein sectors for each of the two families (see Figure 10 and Table S2 ). The internal correlation test and the statistical independence test for both datasets showed that the conclusions drawn from the experiments on Hsp70/110 family and G protein family were also well supported (Fig S1 and S2).
Figure 10

Protein sectors of G protein family (A and B) and Hsp70/110 family (C and D).

Red balls and Green balls represent different protein sectors in protein 3D structure. Residues comprising protein sectors are displayed in space filling representation with a van der Waals surface.

Protein sectors of G protein family (A and B) and Hsp70/110 family (C and D).

Red balls and Green balls represent different protein sectors in protein 3D structure. Residues comprising protein sectors are displayed in space filling representation with a van der Waals surface.

Discussion

Exploring the coevolved protein sectors among homologous proteins is currently a hot issue, especially for the studies of the biological features and evolutionary direction of proteins. There have been a few methods developed for detecting coevolved sites in a protein family but they all suffer from low accuracy and low robustness. In this paper, we proposed a new algorithm BIFANR aiming to address these issues. BIFANR is unique in the following aspects. First, BIFANR has a noise reduction step for the sites in MSA. This step can reduce the complexity of the calculation and improve the accuracy. Second, motivated by factor analysis, a stochastic simulation step is adopted to choose non-random eigenvectors. This step ensures that the protein sectors detected are non-random and thus of high credibility. Third, BIFANR uses varimax orthogonal rotation to calculate the linear combination of selected eigenvectors, which leads to the significant statistical independence between protein sectors. Fourth, the algorithm avoids manual curation, such as visual inspecting and screening, thus is more practical to use for high throughput analysis. Besides, BIFANR is robust for various data scales. When the data is randomly reduced to half in size, the result remains almost the same. We did this on both S1A and PDZ family and compared the new results with the old ones. As shown in Table 2 and Table 3, the accuracy remained high especially for PDZ family, which indicates that BIFANR is robust for data scales.
Table 2

The performance of BIFANR on original data and half data in S1A family.

Original dataHalf data
sector1 95%90%
sector2 92%88%
sector3 90.91%90.91%
Table 3

The performance of BIFANR on original data and half data in PDZ family.

Original dataHalf data
sector1 100%100%
sector2 90.91%90.91%
In the future, we will consider using the common amino acid substitution matrix (e.g. PAM or BLOSUM) to incorporate the relationships among amino acids, as currently BIFANR assumes that all the 20 amino acids are independent. In addition, we will work with biologists to use our predicted sectors to guide site-specific mutagenesis experiments on some selected genes of interests.

Materials and Methods

1. Obtaining Materials

In this study, we chose the classic S1A serine protease family and PDZ family for protein sector analysis (see Data S1.zip) as R. Ranganathan did in previous study [7]. The members of S1A family have the same peptide bond hydrolysis mechanism and possess broad substrate spectrum. PDZ family is a common domain in signal protein, widely existing in bacteria, fungi, plant, animal, and virus [51], [52], [53], [54], which mediates the protein-protein interaction between α2–β2 groove and the C-terminal ligand of target protein. The dataset was obtained through PSI-BLAST [55] from NCBI (release 2.2.14, May-07-2006) non-redundant database with 3TGI and 1BE9 as the template sequences, and the multiple sequence alignment was provided by Clustal X [7], [56].

2. Construction of Weighted Covariance Matrix and Reduction of Noise Sites

2.1 Construction of covariance matrix

Proteins consist of 20 common amino acids. For the purpose of calculation, we replace 20 common amino acids with number 1–20 and gap with 0. BIFANR constructs the covariance matrix by the formula:in which is the observed frequency of the amino acid a at position i and represents the joint frequency of having a at position i and b at position j. BIFANR constructs the weighted covariance matrix by the formula:. where represents the background frequency of amino acid a. is the weighted covariance matrix. The more relevant between sites i and j, the higher probability of synergistic reaction and the larger correlation coefficient; On the contrary, the correlation coefficient is small.

2.2 Reduction of noise sites

In this study, we consider that every two sites in one protein sector have significant correlation, thus have large correlation coefficient. For each site i, we take top 5% sites that have the largest correlation coefficients with i. And we calculate the average of them and represent this average with Rmax(i),. Then, we sum Rmax(i) for all sites i and calculate its average, represented with plus. If site i belongs to one protein sector, its Rmax(i) should be larger than 0.8plus. Thus, the site with Rmax no larger than 0.8plus is considered as noise site. Finally, we remove the rows and columns of noise sites in and represent the new matrix with .

3. Bi-factor Analysis

3.1. Selection of eigenvectors

In this study, we selected eigenvectors by the following steps. (1) Calculate eigenvalues of the matrix in descending order, where n is the number of amino acid sites. (2) Scramble each column of the original alignment randomly and independently, and represent the new random alignment with Al. Then we calculate the eigenvalues of Al. After randomly combining for 100 times, we obtain 100n eigenvalues and put them into set E. (3) Count the number of eigenvalues in E which are bigger than in and represent this number with N(i), Obviously, N(i) is in increasing order for i = 1, …, n. (4) Let N(t) be the last one less than 100 in {N(i); i = 1, …, n } and take the corresponding eigenvectors of .

3.2 Rotation, the linear combination of eigenvectors

We conduct varimax orthogonal rotation which can maximum [57]:where is the i-th element of the k-th eigenvector after rotation.

3.3 Bidirectional selection of protein sectors

We construct protein sectors for each factor (eigenvector after rotation) as following. (1) Put factor coefficients of this factor in descending order, select top 50% and calculate the average of them, represented with w. Then we calculate the ratio , where p(i) is the coefficient between the factor and amino acid site i, i.e. the i-th sector of the factor. (2) If the ratio r(i) is not smaller than the given threshold and p(i) is bigger than the given threshold , amino acid site i belongs to the protein sector. (3) Besides, if p(i) is large enough, say no smaller than the given threshold , then this amino acid belongs to the protein sector, too.

3.4 Merging of protein sectors

Bidirectional selection of protein sectors may lead to the occurrence of overlap protein sectors and we merge them into one as following. (1) For two overlap protein sectors, we use a vector named Ssame to record the overlap sites and another vector Sdiff to record the symmetric set difference. (2) Select such site from Sdiff satisfying that the sum of the correlation coefficients between the site and sites in Ssame is the biggest, and put it into Ssame. (3) Repeat step (2) until Ssame reaches the size of the smaller protein sector before merging. Then we output Ssame as protein sector.

4. Examination of Protein Sectors

4.1 Correlations between amino acids in a protein sector

We calculate the average correlation coefficients and random average correlation coefficients of each protein sector. We use these two parameters to measure the significant correlation between amino acids in a protein sector.

4.2 Statistical independence

We use MDI entropy to measure the degree to which a selected group of residues are statistically coupled to each other in the multiple sequence alignment. In this study, the definition of statistical independence is that, if two protein sectors are independent, then the MDI entropy of two taken together must be the sum of their MDI entropies taken individually [7], [58]. We adapt generalized iterative scaling algorithm to calculate MDI entropy [59].

4.3 Calculation of the evolutionary rate

In this study, the evolutionary rate is estimated by Rate4site. Rate4Site adapts maximum likelihood criterion to estimate the normalized rate of evolution at each site, taking into consideration the topology and branch lengths of the phylogenetic tree. The sites with positive values evolve faster than average, and sites with negative values evolve slower than average for that protein.

4.4 Evolutionary independence

To evaluate the evolutionary independence of protein sectors, we use Principle Component Analysis to separate proteins based on the sequence similarities of sites in protein sector. Programs of algorithm BIFANR can be obtained from Program S1 i. Average correlation coefficients of protein sectors. A: Average correlation coefficient of each protein sector in G protein family. B: Average correlation coefficient of each protein sector in Hsp70/110 family. Blue and green columns represent average correlation coefficient of protein sector 1 and protein sector 2 in G protein family. Red and green columns represent average correlation coefficient of protein sector 1 and protein sector 2 in Hsp70/110 family. Black column represents stochastic expected average correlation coefficient. (TIF) Click here for additional data file. Statistical independence of protein sectors. A: Statistical independence of protein sectors in G protein family. B: Statistical independence of protein sectors in Hsp70/110 family. Red column represents MDI entropy of protein sector 1. Blue column represents MDI entropy of protein sector 2. Black column represents MDI entropy of two protein sectors as a whole. Yellow column represents stochastic expected MDI entropy after disrupting the amino acid sites within two protein sectors 100 times. (TIF) Click here for additional data file. The amino acid sites in each protein sector of the two protein families S1A and PDZ. (DOC) Click here for additional data file. The amino acid sites in each protein sector of G protein family and Hsp70/110 family. (DOC) Click here for additional data file. Programs of algorithm BIFANR. (ZIP) Click here for additional data file. Data of S1A serine protease family, PDZ family, HSP family and G family. (ZIP) Click here for additional data file.
  54 in total

1.  Persistently conserved positions in structurally similar, sequence dissimilar proteins: roles in preserving protein fold and function.

Authors:  Iddo Friedberg; Hanah Margalit
Journal:  Protein Sci       Date:  2002-02       Impact factor: 6.725

2.  Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues.

Authors:  Tal Pupko; Rachel E Bell; Itay Mayrose; Fabian Glaser; Nir Ben-Tal
Journal:  Bioinformatics       Date:  2002       Impact factor: 6.937

3.  Networks of coevolving sites in structural and functional domains of serpin proteins.

Authors:  Michael J Buck; William R Atchley
Journal:  Mol Biol Evol       Date:  2005-04-27       Impact factor: 16.240

4.  A novel method for detecting intramolecular coevolution: adding a further dimension to selective constraints analyses.

Authors:  Mario A Fares; Simon A A Travers
Journal:  Genetics       Date:  2006-03-17       Impact factor: 4.562

5.  Evidence for PDZ domains in bacteria, yeast, and plants.

Authors:  C P Ponting
Journal:  Protein Sci       Date:  1997-02       Impact factor: 6.725

6.  Correlated mutations contain information about protein-protein interaction.

Authors:  F Pazos; M Helmer-Citterich; G Ausiello; A Valencia
Journal:  J Mol Biol       Date:  1997-08-29       Impact factor: 5.469

7.  Evolution of proteins formed by beta-sheets. II. The core of the immunoglobulin domains.

Authors:  A M Lesk; C Chothia
Journal:  J Mol Biol       Date:  1982-09-15       Impact factor: 5.469

8.  The molecular basis of thrombin allostery revealed by a 1.8 A structure of the "slow" form.

Authors:  James A Huntington; Charles T Esmon
Journal:  Structure       Date:  2003-04       Impact factor: 5.006

9.  Correlation analysis for protein evolutionary family based on amino acid position mutations and application in PDZ domain.

Authors:  Qi-Shi Du; Cheng-Hua Wang; Si-Ming Liao; Ri-Bo Huang
Journal:  PLoS One       Date:  2010-10-06       Impact factor: 3.240

Review 10.  The HTLV-1 Tax interactome.

Authors:  Mathieu Boxus; Jean-Claude Twizere; Sébastien Legros; Jean-François Dewulf; Richard Kettmann; Luc Willems
Journal:  Retrovirology       Date:  2008-08-14       Impact factor: 4.602

View more
  1 in total

1.  Block-based characterization of protease specificity from substrate sequence profile.

Authors:  Enfeng Qi; Dongyu Wang; Bo Gao; Yang Li; Guojun Li
Journal:  BMC Bioinformatics       Date:  2017-10-03       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.