Literature DB >> 20949088

Correlation analysis for protein evolutionary family based on amino acid position mutations and application in PDZ domain.

Qi-Shi Du1, Cheng-Hua Wang, Si-Ming Liao, Ri-Bo Huang.   

Abstract

BACKGROUND: It has been widely recognized that the mutations at specific directions are caused by the functional constraints in protein family and the directional mutations at certain positions control the evolutionary direction of the protein family. The mutations at different positions, even distantly separated, are mutually coupled and form an evolutionary network. Finding the controlling mutative positions and the mutative network among residues are firstly important for protein rational design and enzyme engineering.
METHODOLOGY: A computational approach, namely amino acid position conservation-mutation correlation analysis (CMCA), is developed to predict mutually mutative positions and find the evolutionary network in protein family. The amino acid position mutative function, which is the foundational equation of CMCA measuring the mutation of a residue at a position, is derived from the MSA (multiple structure alignment) database of protein evolutionary family. Then the position conservation correlation matrix and position mutation correlation matrix is constructed from the amino acid position mutative equation. Unlike traditional SCA (statistical coupling analysis) approach, which is based on the statistical analysis of position conservations, the CMCA focuses on the correlation analysis of position mutations.
CONCLUSIONS: As an example the CMCA approach is used to study the PDZ domain of protein family, and the results well illustrate the distantly allosteric mechanism in PDZ protein family, and find the functional mutative network among residues. We expect that the CMCA approach may find applications in protein engineering study, and suggest new strategy to improve bioactivities and physicochemical properties of enzymes.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20949088      PMCID: PMC2950854          DOI: 10.1371/journal.pone.0013207

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Coevolution is a well known phenomenon in biological world. However, coevolution in families of proteins and genes is still an open topic [1]. Conservation and mutation are two opposite aspects of functional evolution of protein family. It is commonly accepted that the evolution of a protein family is the result of large-scale random mutagenesis, with selection constraints imposed by their biological functions. In the studies of statistical analysis for protein evolutionary family the following two basic hypotheses were recognized widely, which were derived from the empirical observation of sequence evolution [2]. (i) The lack of evolutionary constraint at one position should cause the distribution of observed amino acids at that position in the MSA (multiple structure alignment) to approach their mean abundance in all proteins, and deviances from the mean values should quantitatively represent conservation. (ii) The functional coupling of two positions, even if distantly positioned in the structure, should mutually constrain evolution at the two positions, and these should be represented in the statistical coupling of the underlying amino acid distributions [3], [4], [5]. The protein functions are not only determined by the interactions between local residues, but also depend on nonlocal, long-range communication between amino acids [6]. For example, information transmission between distant functional surfaces on signaling proteins [7], the distributed dynamics of amino acids involved in enzyme catalysis [8], [9], [10], and allosteric regulation in various proteins [3], [11] all represent manifestations of nonlocal interactions between residues. To the extent that these features contribute to defining biological properties of protein lineages, it is expected that the underlying mechanisms represent a long-range mutative network, consisting of local and nonlocal residues. Understanding the fundamental basis of long-range communication represents a major challenge in structural biology, which is significantly important for enzyme engineering and rational protein design. Over the last ten years, a considerable methodological effort has been made to detect coevolution in protein and gene families at molecular level. In the method developed by Dunn et al. [12], [13] information theory was used to reduce the random noise in the identification of coevolving positions. Dutheil and Galtier reviewed the literatures about molecular coevolution between or within residues in gene and protein families [1], [14]. A successful approach, namely statistical coupling analysis (SCA), was developed by Ranganathan's group [2], [15], [16], [17], which focused on the conservations between coupling positions (sectors). However, in the enzyme engineering [18] what we are more interested is that how the amino acid mutations at certain positions modify (or improve) the biological functions (including bioactivity, thermostability, pH tolerance, and other properties) of enzymes. In the rational protein design we want to know the dominative positions for functional evolution and the mutually mutative network among positions in three dimensional structures of proteins. In this study a computational approach, namely amino acid position conservation-mutation correlation analysis (CMCA), is developed to predict mutually mutative positions and find the evolutionary network in protein family. Unlike traditional SCA (statistical coupling analysis) approach [2], [15], [16], [17], which is based on the statistical analysis of position conservations, the CMCA focuses on the correlations of position mutations in a protein evolutionary family. We expect that the CMCA approach may find applications to rational protein design and enzyme engineering to improve bioactivities and physicochemical properties of enzymes.

Results

In this study the PDZ domain family is selected as a model system to demonstrate the CMCA approach. The PDZ domain is a common structural domain found in the signaling proteins [19] of bacteria, yeast, plants, viruses [20], animals [21], [22], and human [23], [24]. PDZ domains consist of 90–100 amino acid modules that adopt a six-stranded β sandwich configuration with two flanking α helices (Fig. 1). Target C-terminal ligands bind in a surface groove formed between the β2 strand and α2 helix and make a number of interactions that determine both general and sequence specific recognition [25], [26], [27]. Both the overall three-dimensional structure and most details of ligand recognition are highly conserved in the PDZ family despite considerable sequence divergence [28]. PDZ domains well represent protein binding motifs for which four high-resolution structures of distantly related members exist [25], [29], [30]. These domains help anchor transmembrane proteins to the cytoskeleton and hold together signaling complexes [31], [32].
Figure 1

Structure alignment of four PDZ proteins and results of position conservation-mutation correlation analysis.

(A) The 3D structural alignment of 2QKT, 2F5Y, 1G9O, and 1BE9. The 27 residues, at the positions having higher mutation correlation coefficients (see Table 1), are shown in stick render. (B) The surface of α2-β2 groove of 1BE9 and the peptide ligand in 1BE9. Blue is for hydrophilic surface and green is for hydrophobic surface. (C) The residues at the controlling positions for ligand affinity. The sizes of Tyr79 and Leu81 of 2QKT (blue) are much bigger than the Ala76 and Ala78 of 1BE9 (green). (D) The disulfide bond between Cys37 and Cys78 of 2QKT. (E) Sequence alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9). The residues, at the positions having high position mutation correlation coefficients (see Table 1), are indicated by green frames.

Structure alignment of four PDZ proteins and results of position conservation-mutation correlation analysis.

(A) The 3D structural alignment of 2QKT, 2F5Y, 1G9O, and 1BE9. The 27 residues, at the positions having higher mutation correlation coefficients (see Table 1), are shown in stick render. (B) The surface of α2-β2 groove of 1BE9 and the peptide ligand in 1BE9. Blue is for hydrophilic surface and green is for hydrophobic surface. (C) The residues at the controlling positions for ligand affinity. The sizes of Tyr79 and Leu81 of 2QKT (blue) are much bigger than the Ala76 and Ala78 of 1BE9 (green). (D) The disulfide bond between Cys37 and Cys78 of 2QKT. (E) Sequence alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9). The residues, at the positions having high position mutation correlation coefficients (see Table 1), are indicated by green frames.
Table 1

Mutation couple pairs with higher correlation coefficients in position mutation correlation matrix of PDZ database.

Mutation regiona Couple pairb Correlation coefficient1BE9 numbering
ijRi,j ij
R141440.8267083639
4238c 0.84711537c
R246140.8521984116
4890.8278864311
R349290.8528674430
56290.8509575130
R453520.8573844847
R561560.9742215651
67430.8686646238
69490.9343576444
69560.8133156451
74470.8201226942
R664260.8135975927
6534c 0.89678660c
69290.9265726430
71250.8387726626
R781670.8710947662
R883140.8922207816
R983460.8657997841
R108634c 0.80998181c
9234c 0.95990486c
R1186650.8009128160
92650.9344988660
R1292860.8446108681

The regions (R1 to R12) are shown in (the contour map of position mutation correlation matrix).

The position numbers are from PDZ database of ref [33].

The positions indicated by superscript ‘c’ are the gaps in the 1BE9, which are inserted in multiple alignment.

In this study we use a multiple structure aligned PDZ database [33] consisting of 240 PDZ proteins. After sequence alignment there are 129 positions in the PDZ database, and after deletion of the unnecessary gaps, 27 positions are deleted and the reduced database contains 102 positions.

Conservation correlation analysis of PDZ

Following the procedure described in Method section, we first perform the conservation position correlation analysis to the PDZ protein family. The position conservation correlation matrix R(con) L×L of PDZ is graphically shown in Fig. 2. The matrix R(con) L×L is symmetric to the diagonal line. For a clear view the matrix elements r (con) less than 0.5 are filtered. The matrix elements on the diagonal line, whose values are 1 (r (con)  = 1), are not shown, because the diagonal elements are self correlation coefficients, having no statistical meaning. Fig. 2 A is the relief map of position conservation correlation matrix R(con) L×L of PDZ database. The red bands indicate the region with correlation coefficients from 0.80 to 0.85. Fig. 2 B is the contour map of position conservation correlation matrix R(con) L×L of PDZ database. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange.
Figure 2

Graphical representation of position conservation correlation matrix R(con) L×L of PDZ.

(A) The relief map of position conservation correlation matrix R(con) L×L of PDZ database. The red bands indicate the region with correlation coefficients between 0.80 to 0.85. The relief map of R(con) L×L is complementary to the relief map of position mutation correlation matrix R(mut) L×L (Fig. 3). (B) The contour map of position conservation correlation matrix R(con) L×L of PDZ database. For a clear view the matrix elements r (con) less than 0.5 are filtered, and the elements (r (con)  = 1) on diagonal line are not shown. In the primer PDZ database the sequence length is 129, including gaps, which are inserted in the multiple alignment. After deletion of the unnecessary gaps, the length is reduced to L = 102. The position conservation correlation matrix R(con) L×L is symmetric to the diagonal line. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange.

Graphical representation of position conservation correlation matrix R(con) L×L of PDZ.

(A) The relief map of position conservation correlation matrix R(con) L×L of PDZ database. The red bands indicate the region with correlation coefficients between 0.80 to 0.85. The relief map of R(con) L×L is complementary to the relief map of position mutation correlation matrix R(mut) L×L (Fig. 3). (B) The contour map of position conservation correlation matrix R(con) L×L of PDZ database. For a clear view the matrix elements r (con) less than 0.5 are filtered, and the elements (r (con)  = 1) on diagonal line are not shown. In the primer PDZ database the sequence length is 129, including gaps, which are inserted in the multiple alignment. After deletion of the unnecessary gaps, the length is reduced to L = 102. The position conservation correlation matrix R(con) L×L is symmetric to the diagonal line. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange.
Figure 3

Graphical representation of position mutation correlation matrix R(mut) L×L of PDZ.

(A) The relief map of position mutation correlation matrix R(mut) L×L of PDZ database. The red bands indicate the region with correlation coefficients between 0.80 to 0.85. The relief map of R(mut) L×L is complementary to the relief map of position conservation correlation matrix R(con) L×L (Fig. 2). (B) The contour map of position mutation correlation matrix R(mut) L×L of PDZ database. For a clear view the matrix elements r (mut) less than 0.5 are filtered, and the elements (r (mut)  = 1) on diagonal line are not shown. The position mutation correlation matrix R(mut) L×L is symmetric to the diagonal line. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange. In the map there are 12 regions (R1 to R12), where the correlation coefficients r (mut) are higher than 0.80.

In Fig. 2 A most places are high peaks and in Fig. 2 B most areas are in red color, meaning that in many sequence positions the residues are highly conserved, consistent to the high conservation of PDZ family. It is difficult to dig out detailed information from the position conservation correlation matrix R(con) L×L of PDZ database, because too many conservative positions complicate the analysis.

Mutation correlation analysis of PDZ

Then we perform the mutation position correlation analysis to the PDZ protein family. The position mutation correlation matrix R(mut) L×L of PDZ is graphically shown in Fig. 3.

Graphical representation of position mutation correlation matrix R(mut) L×L of PDZ.

(A) The relief map of position mutation correlation matrix R(mut) L×L of PDZ database. The red bands indicate the region with correlation coefficients between 0.80 to 0.85. The relief map of R(mut) L×L is complementary to the relief map of position conservation correlation matrix R(con) L×L (Fig. 2). (B) The contour map of position mutation correlation matrix R(mut) L×L of PDZ database. For a clear view the matrix elements r (mut) less than 0.5 are filtered, and the elements (r (mut)  = 1) on diagonal line are not shown. The position mutation correlation matrix R(mut) L×L is symmetric to the diagonal line. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange. In the map there are 12 regions (R1 to R12), where the correlation coefficients r (mut) are higher than 0.80. The matrix R(mut) L×L is symmetric to the diagonal line. For a clear view the matrix elements r (mut) less than 0.5 are filtered, and the elements on the diagonal line (r (mut)  = 1) are not shown. Fig. 3 A is the relief map of position mutation correlation matrix R(mut) L×L of PDZ database. The red bands indicate the region with correlation coefficients from 0.80 to 0.85. Fig. 3 B is the contour map of position mutation correlation matrix R(mut) L×L. The map is colored in the same manner as in Fig. 2. After careful observation and comparison we find partially complementary relationship between the two correlation matrices R(con) L×L and R(mut) L×L: the peaks and valleys are located in alternate places in Fig. 2 A and Fig. 3 A. And in Fig. 3 B there are less areas in red color than in Fig. 2 B. Only 12 separated red regions (R1 to R12) with higher correlation coefficients (r>0.80) are found in Fig. 3 B. It is easier to find useful information from the position mutation correlation matrix R(mut) L×L than that from the position conservation correlation matrix R(con) L×L.

Information from CMCA of PDZ

Several studies have highlighted the presence of interaction networks within single-domain proteins, which are crucial for allostery, stability, and folding [2], [17], [34], [35], [36]. Based on the statistical coupling analysis (SCA) [2] PDZ domains were proposed to contain energetically coupled positions between residues located in the binding site and elsewhere, forming a long-range interaction network. Table 1 lists 24 position pairs with higher correlation coefficients (r (mut) >0.80) in the mutation correlation matrix R(mut) L×L, which distribute in 12 red regions in Fig. 3 B. In Table 1 the 24 position pairs are numbered according to the PDZ database [33]. The corresponding position numbers in the PDZ protein 1BE9 [25] are also listed in Table 1, which is a well investigated PDZ protein. Total 30 different positions are in the 24 position pairs. Among the 30 positions, 3 positions are gaps in the protein 1BE9. The correlations of four position pairs are shown in Fig. 4, which possess higher mutative correlation coefficients (R71−25 = 0.8388, R46−14 = 0.8522, R83−14 = 0.892, and R61−56 = 0.9742). The position 14 is highly correlated to both position 46 and position 83. Therefore, the three positions form a mutually mutative group (14, 46, 83) in the PDZ family.
Figure 4

Amino acid mutation correlation relationships of 4 couple position pairs.

The 4 position pairs have high mutation correlation coefficients (R71−25 = 0.8388, R46−14 = 0.8522, R83−14 = 0.892, and R61−56 = 0.9742), and the position mutation factors (t) of 20 amino acid types at the couple position pairs show higher correlation relationship.

Amino acid mutation correlation relationships of 4 couple position pairs.

The 4 position pairs have high mutation correlation coefficients (R71−25 = 0.8388, R46−14 = 0.8522, R83−14 = 0.892, and R61−56 = 0.9742), and the position mutation factors (t) of 20 amino acid types at the couple position pairs show higher correlation relationship. The regions (R1 to R12) are shown in (the contour map of position mutation correlation matrix). The position numbers are from PDZ database of ref [33]. The positions indicated by superscript ‘c’ are the gaps in the 1BE9, which are inserted in multiple alignment. The structure alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9) and results of position mutation correlation analysis are shown Fig. 1. The 27 residues, shown in stick render in Fig. 1 A, are located at the positions having higher mutative correlation coefficients (see Table 1). The surface of α2-β2 groove of 1BE9 and the peptide ligand is shown in Fig. 1 B. Blue is for hydrophilic surface and green is for hydrophobic surface. Fig. 1 C shows the residues of PDZ proteins 1BE9 and 2QKT at the positions controlling the ligand affinity. The size of Tyr79 and Leu81 of 2QKT (blue) are much larger than the corresponding residues Ala76 and Ala78 of 1BE9 (green). The PDZ protein 2QKT has a disulfide bond between Cys37 and Cys78 shown in Fig. 1 D, which makes it different from other PDZ proteins [26], [37]. The position 37 and 78 are easily mutative positions based on CMCA results in Table 1. Fig. 1 E shows the sequence alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9). The residues, at the positions having higher position mutation correlation coefficients (see Table 1), are indicated by green frames. The 27 mutative positions with higher mutation correlation coefficients distribute in two α-helices, all 6 β-strands, and some loops, including the easily mutative positions and some very conservative positions [33]. In Fig. 1 E among the 27 positions there are two larger sectors, each of them consists of four adjacent positions: 36–39 (in β3) and 41–44. Some interesting findings are summarized as follows.

Controlling positions for ligand affinity

The groove between α2 helix and β2 strand is the binding location for ligand peptide [2], [38] and the residues at these positions are highly conservative. However, mutations at these highly conservative positions may have more important significance to biological functions. Four easily mutative positions are found in the α2-β2 groove: Asn26 and Ile27 in β2, Ala76 and Ala78 in α2 (1BE9 numbering), which determine the ligand binding affinity and control the peptide shape and specificity. In Fig. 1 C the small residues Ala76 and Ala78 (in green) of 1BE9 are replaced by Tyr79 and Leu81 (in blue) of 2QKT. The size of Tyr79 and Leu81 of 2QKT are much larger than the Ala76 and Ala78 of 1BE9. Therefore 1BE9 and 2QKT must have very different preferences of peptide ligand. The biological relevance of long-range allosteric effects in PDZ domains has attracted considerable attention [7], [39], [40]. The PDZ domain of the cell polarity protein Par6 was shown to be allosterically regulated by its adjacent Crib domain in response to binding of CdC42 [39]. Structural analysis showed the β1-α1 interface of the Par6 PDZ domain to be in direct contact with the Crib domain and responsible for transmission to the structurally distinct peptide binding pocket [41]. The results of CMCA fully support above observations. Two easily mutative positions (Ala47 and Asp48, in 1BE9 numbering) are found in α1 helix, and two positions Pro11 and Ile16 (in 1BE9 numbering) are found in β1 strand. These easily mutative positions are connected through peptide ligand and define an allosteric mechanism for regulating binding affinity at the α2-β2 groove through molecular interactions at a distant surface site on the α1 helix [39].

Disulfide bond in INAD PDZ5

In the alignment of four PDZ proteins in Fig. 1 E the 2QKT [37] is an INAD PDZ [42] domain and belongs to type 5 PDZ. The INAD PDZ domain (PDZ5) exists in a redox-dependent equilibrium [43], [44] between two conformations—a reduced form that is similar to the structure of other PDZ domains, and an oxidized form. In INAD PDZ an intramolecular disulfide bond covalently links a pair of buried cysteine residues located underneath the floor of the ligand-binding pocket [26], [37]. In 2QKT the disulfide bond is formed between Cys37 in β3 and Cys78 in α2 (in Fig. 1 E numbering). The positions of Cys37 and Cys78 are corresponding to the positions of residues Ile36 and Ala75 of 1BE9, respectively. The position 36 (in 1BE9 numbering) is an easily mutative position according to results of CMCA calculations (see Table 1), and the position 75 (in 1BE9 numbering) is adjacent to the easily mutative position 76 (see Table 1) falling into the mutative region R7 (see Fig. 3 B). The strong intramolecular disulfide bond connects the β3 strand with the α2 helix, suggesting that this interaction may be responsible for the equilibrium between the reduced conformation and the oxidized conformation in INAD PDZ5.

Distantly allosteric network in PDZ proteins

The functional coupling of two positions, even if distantly positioned in the structure, could mutually constrain evolution at the two positions, and these should be represented in the statistical coupling of the underlying amino acid distributions [2], [4], [23]. In some cases the functional coupling is not limited only between two positions, but could be among several distant positions, which form a mutually evolutionary network. Long-range allosteric effects that cause the preference change of ligand peptide in the PDZ binding groove happen in several distant positions. The results of CMCA study reveal the mutually multi coupling positions in the PDZ family. Table 1 lists the couple pairs of easily mutative positions. Actually, these positions can be reorganized into three groups according to the mutual couple pairs (in 1BE9 numbering): (30, 44, 51, 56, 64), (16, 41, 78), and (60, 81, 86). In the group 2 the three positions are at distantly separated β1 (position 16), β3 (position 41), and α2 (position 78) that may form a mutually mutative network and may affect the binding sites in α2-β2 groove. These findings could provide an explanation to distant allosteric interaction network in PDZ proteins.

Discussion

Conservation and mutation are two opposite aspects of functional evolution of protein family. In the studies for protein evolutionary family the conversation statistical analysis can provide useful information, and several successful tools are developed based on the position conversations, such as SCA (statistical coupling analysis) [2], [15] and MI (mutual information) [12], [13]. In this study we prove that correlation analysis based on position mutations of amino acids also can reveal very useful information for study of functional evolution of protein family. The position mutations are equally important to the position conservations for study of functional evolution of protein family. In the conservation-based statistical methods the “phylogenetic relationship” in a protein family causes the “coherent correlation” of all positions [33], which is raised by sharing common ancestry. Great efforts have been made to solve the “coherent correlation” problem [12], [13]. In this study the foundational equation is the amino acid position mutative function (Eq.7), based on which the amino acid position mutation matrix TM×L is constructed, and the CMCA approach is developed. Unlike the conservation-based methods, the “coherent correlation” problem may be avoided in the mutation-based method. The theoretical implications of Eq.7 and CMCA approach are summarized as follows. (i) The actual mutations at specific directions are caused by the functional constraints in the protein evolutionary family. The directional mutations at some key positions control the functional evolution of the protein family. (ii) The functional coupling of two or more positions, even if distantly positioned in the structure, mutually constrains mutations at these positions, which form a communicative and evolutionary network in the protein family. The computation results of CMCA application to the PDZ protein database show that generalizing the principle of mutations to account for correlations between positions reveals a novel structural organization for PDZ proteins that is distinct from traditional structural descriptions and the SCA approach. The CMCA approach and the SCA approach describe the distant allostery and mutative network in protein evolutionary family from different aspects (mutations and conservations), therefore both methods can provide useful information complementally. Because the conservation-mutation correlation analysis is based on the correlation analysis of amino acid mutations, the CMCA approach may find applications in rational protein design and enzyme engineering by means of artificial residue mutations, and provide suggestion to improve the bioactivities and physicochemical properties of enzymes.

Materials and Methods

Evolutionarily related proteins have similar sequences and naturally occurring homologous proteins have similar protein structures. It has been shown that three-dimensional protein structure is evolutionarily more conserved than expected due to sequence conservation [45], [46]. However, the evolution of protein family mainly depends on the mutations happening on the key positions in the 3D structures. Statistical analysis for a protein evolutionary family starts from a multiple 3D structural alignment (MSA) of a homologous protein group.

Multiple structure alignment of protein family

In this study, the multiple structure alignment procedure is used. Chains that possess coordinates for all their alpha carbons can be realigned taking into account their structure. From an initial estimate of the alignment, a new similarity matrix is generated using the relative alpha carbon coordinates that result from a multi-body superposition. This matrix is used to realign just these alpha carbon populated chains. This procedure is then repeated until the Root Mean Square Distance (RMSD) of the superposition fails to improve. The multiple structural alignment of a protein family has to reveal the structural features: all key functional residues are aligned in same sequence columns, and all key secondary structures (α-helices, β-sheets, and loops) are positioned in the same sectors. After multiple sequence alignment the protein family is represented by a three dimensional primer data matrix X(0) N×M×L. N is the number of protein structures in the database, M is the types of amino acids (M = 21, including 20 natural amino acids and the gap, which are inserted during the multiple alignment), and L is the length of amino acid sequences (including gaps). The database matrix X(0) N×M×L is a binary matrix, in which the element x(0) of sequence i at position l is 1 when the amino acid is a, otherwise, it is 0,

Amino acid position frequency matrix

From the primer data matrix X(0) N×M×L the primer amino acid position frequency matrix F(0) M×L is constructed as follows,The value of f (0) is an integer in region [0,N], equals the times of amino acid a appearing at position l in all N protein sequences. The higher value of f (0) means the higher frequency of amino acid a at position l. In this study the gaps are treated as a special amino acid type numbered by 0, and the 20 natural amino acids are numbered from 1 to 20. The summation of f (0) from k = 0 to M is N. The F(0) M×L is integer frequency matrix of amino acids. It can be transformed to decimal frequency after dividing by N.

Reducing unnecessary gaps

The position correlation analysis is complicated by the presence of alignment gaps, commonly called indels, indicating the structural region present in some proteins but not in others. The gaps (space positions) in the primer data matrix X(0) N×M×L may interfere with the results of statistical analysis badly. Before performing the correlation analysis we have to reduce the unnecessary gaps. To do so, the total amino acid frequencies of 20 natural amino acids at each position l are computed as follows.In Eq.3 the index k for amino acid types is from 1 to M = 20, not including the gap. If the total amino acid position frequencies of 20 natural amino acids q (0) is less than 20%, the position l is deleted from the primer sequences. Because it means that at the position l more than 80% ‘amino acids’ are gaps, and this position is less important for the biological function of the protein family. After unnecessary gaps are deleted, we get the reduced data matrix XN×M×L and amino acid position frequency matrix FM×L, in which the sequence length L is smaller than in the primer data matrix. For simplicity, we still use L for the reduced sequence length.

Position conservation correlation matrix

The position conservation correlation matrix can be derived from the reduced position frequency matrix FM×L. Because the conservation is directly correlated to the amino acid position frequency, the higher frequency f of amino acid k at position l, the more conservation of amino acid k at this position. For position conservation correlation analysis the position frequency covariance matrix C(con) L×L is constructed firstly from the reduced position frequency matrix FM×L,where and are the average frequencies at position i and j, respectively,Hereby we get the position conservation correlation matrix R(con) L×L from the position covariance matrix C(con) L×L as follows.where the superscript ‘con’ indicates the ‘conservation’, and (con) is the position conservation correlation coefficient between position i and j.

Position mutation correlation matrix

Before computing the amino acid position mutation correlation matrix, we have to build the amino acid position mutative equation, which measures the mutation of amino acid k at position l in protein family. For this purpose the amino acid types n (including gap) at each position l in the protein family is very useful, which describes the diversification of amino acids at position l. The larger value of n, the more mutations at position l. The amino acid position mutation matrix TM×L is constructed as follows.The value t , is the measurement of mutation of amino acid k at position l in protein family, which is directly proportional to the amino acid types n at position l and inversely proportional to the integer frequency f , of amino acid k at position l. The term N/n in denominator is the average frequency at position l. The ‘1’ in denominator is added to avoid the infinite value when the frequency of amino acid k at position l equals to the average frequency f , = N/n. The ‘1’ in numerator makes the value of t , is 0 for all amino acid types when n is 1 (no mutation at position l). Fig. 5 shows the curve shapes of amino acid position mutative equation (Eq.7). When the frequency of amino acid k takes the average value at position l (f , = N/n) all curves have the maximum values, and when the amino acid type at position l has the largest value (n = 20), the mutative factor gets the largest value.
Figure 5

The mutative factor function (Eq.7) used in conservation-mutation correlation analysis (CMCA).

The curves are computed according to Eq.7. The horizontal coordinates are the integer position frequencies f of amino acids in protein database, and the vertical coordinates are the mutative factors. The n is the amino acid types at position l. In calculations the protein sample number N is 100. When the frequency of amino acid k takes the average value f = N/n, all curves have the maximum values, and when the amino acid types at position l has the largest value (n = 20), the mutative factor gets the largest value 19.

The mutative factor function (Eq.7) used in conservation-mutation correlation analysis (CMCA).

The curves are computed according to Eq.7. The horizontal coordinates are the integer position frequencies f of amino acids in protein database, and the vertical coordinates are the mutative factors. The n is the amino acid types at position l. In calculations the protein sample number N is 100. When the frequency of amino acid k takes the average value f = N/n, all curves have the maximum values, and when the amino acid types at position l has the largest value (n = 20), the mutative factor gets the largest value 19. Following the same procedure described in sector 2.4, we can construct the position mutation covariance matrix C(mut) L×L from amino acid position mutation matrix TM×L,where and are the average mutations at position i and j, respectively,Hereby we get the position mutation correlation matrix R(mut) L×L from the position mutation covariance matrix C(mut) L×L as follows.where (mut) is the position mutation correlation coefficient between position i and j. The computational procedure is graphically illustrated in Fig. 6, the flowchart of conservation-mutation correlation analysis (CMCA).
Figure 6

The flowchart of conservation-mutation correlation analysis (CMCA).

The binary matrix X(0) N×M×L is the primer database of protein evolutionary family, and F(0) M×L is the primer amino acid position frequency matrix. Integer N is the number of protein samples, M = 21 is the types of amino acids (including the gaps), and L is the length of protein sequences. After the unnecessary gaps are deleted, the above two matrices are denoted as XN×M×L and FM×L. From the frequency matrix FM×L the amino acid position conservation correlation matrix R(con) L×L is constructed, and from the amino acid position mutation matrix TM×L the amino acid position mutation correlation matrix R(mut) L×L is constructed.

The flowchart of conservation-mutation correlation analysis (CMCA).

The binary matrix X(0) N×M×L is the primer database of protein evolutionary family, and F(0) M×L is the primer amino acid position frequency matrix. Integer N is the number of protein samples, M = 21 is the types of amino acids (including the gaps), and L is the length of protein sequences. After the unnecessary gaps are deleted, the above two matrices are denoted as XN×M×L and FM×L. From the frequency matrix FM×L the amino acid position conservation correlation matrix R(con) L×L is constructed, and from the amino acid position mutation matrix TM×L the amino acid position mutation correlation matrix R(mut) L×L is constructed.
  44 in total

Review 1.  The changing landscape of protein allostery.

Authors:  Joanna F Swain; Lila M Gierasch
Journal:  Curr Opin Struct Biol       Date:  2006-01-19       Impact factor: 6.809

2.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction.

Authors:  S D Dunn; L M Wahl; G B Gloor
Journal:  Bioinformatics       Date:  2007-12-05       Impact factor: 6.937

Review 3.  Dynamic personalities of proteins.

Authors:  Katherine Henzler-Wildman; Dorothee Kern
Journal:  Nature       Date:  2007-12-13       Impact factor: 49.962

Review 4.  Coevolution within and between genes.

Authors:  N Galtier; J Dutheil
Journal:  Genome Dyn       Date:  2007

5.  Dynamic scaffolding in a G protein-coupled signaling system.

Authors:  Prashant Mishra; Michael Socolich; Mark A Wall; Jennifer Graves; ZiFen Wang; Rama Ranganathan
Journal:  Cell       Date:  2007-10-05       Impact factor: 41.582

6.  Cdc42 acts downstream of Bazooka to regulate neuroblast polarity through Par-6 aPKC.

Authors:  Scott X Atwood; Chiswili Chabu; Rhiannon R Penkert; Chris Q Doe; Kenneth E Prehoda
Journal:  J Cell Sci       Date:  2007-08-28       Impact factor: 5.285

7.  The PDZ protein Canoe/AF-6 links Ras-MAPK, Notch and Wingless/Wnt signaling pathways by directly interacting with Ras, Notch and Dishevelled.

Authors:  Ana Carmena; Stephan Speicher; Mary Baylies
Journal:  PLoS One       Date:  2006-12-20       Impact factor: 3.240

8.  The PDZ domain as a complex adaptive system.

Authors:  Alexei Kurakin; Andrzej Swistowski; Susan C Wu; Dale E Bredesen
Journal:  PLoS One       Date:  2007-09-26       Impact factor: 3.240

9.  Insights into the molecular evolution of the PDZ/LIM family and identification of a novel conserved protein motif.

Authors:  Aartjan J W Te Velthuis; Tadamoto Isogai; Lieke Gerrits; Christoph P Bagowski
Journal:  PLoS One       Date:  2007-02-07       Impact factor: 3.240

10.  Detecting groups of coevolving positions in a molecule: a clustering approach.

Authors:  Julien Dutheil; Nicolas Galtier
Journal:  BMC Evol Biol       Date:  2007-11-30       Impact factor: 3.260

View more
  6 in total

1.  Structural position correlation analysis (SPCA) for protein family.

Authors:  Qi-Shi Du; Jian-Zong Meng; Cheng-Hua Wang; Si-Yu Long; Ri-Bo Huang
Journal:  PLoS One       Date:  2011-12-05       Impact factor: 3.240

Review 2.  Seeking allosteric networks in PDZ domains.

Authors:  Candice Gautier; Louise Laursen; Per Jemth; Stefano Gianni
Journal:  Protein Eng Des Sel       Date:  2018-10-01       Impact factor: 1.650

Review 3.  Allosterism in the PDZ Family.

Authors:  Amy O Stevens; Yi He
Journal:  Int J Mol Sci       Date:  2022-01-27       Impact factor: 5.923

4.  Three Binding Conformations of BIO124 in the Pocket of the PICK1 PDZ Domain.

Authors:  Amy O Stevens; Samuel Luo; Yi He
Journal:  Cells       Date:  2022-08-07       Impact factor: 7.666

5.  Bi-factor analysis based on noise-reduction (BIFANR): a new algorithm for detecting coevolving amino acid sites in proteins.

Authors:  Juntao Liu; Xiaoyun Duan; Jianyang Sun; Yanbin Yin; Guojun Li; Lushan Wang; Bingqiang Liu
Journal:  PLoS One       Date:  2013-11-20       Impact factor: 3.240

6.  Ensemble-Based Analysis of the Dynamic Allostery in the PSD-95 PDZ3 Domain in Relation to the General Variability of PDZ Structures.

Authors:  Dániel Dudola; Anett Hinsenkamp; Zoltán Gáspári
Journal:  Int J Mol Sci       Date:  2020-11-06       Impact factor: 5.923

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.