Literature DB >> 26072474

Using kernelized partial canonical correlation analysis to study directly coupled side chains and allostery in small G proteins.

Laleh Soltan Ghoraie¹, Forbes Burkowski¹, Mu Zhu¹.

Abstract

MOTIVATION: Inferring structural dependencies among a protein's side chains helps us understand their coupled motions. It is known that coupled fluctuations can reveal pathways of communication used for information propagation in a molecule. Side-chain conformations are commonly represented by multivariate angular variables, but existing partial correlation methods that can be applied to this inference task are not capable of handling multivariate angular data. We propose a novel method to infer direct couplings from this type of data, and show that this method is useful for identifying functional regions and their interactions in allosteric proteins.
RESULTS: We developed a novel extension of canonical correlation analysis (CCA), which we call 'kernelized partial CCA' (or simply KPCCA), and used it to infer direct couplings between side chains, while disentangling these couplings from indirect ones. Using the conformational information and fluctuations of the inactive structure alone for allosteric proteins in the Ras and other Ras-like families, our method identified allosterically important residues not only as strongly coupled ones but also in densely connected regions of the interaction graph formed by the inferred couplings. Our results were in good agreement with other empirical findings. By studying distinct members of the Ras, Rho and Rab sub-families, we show further that KPCCA was capable of inferring common allosteric characteristics in the small G protein super-family.
AVAILABILITY AND IMPLEMENTATION: https://github.com/lsgh/ismb15

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2015 PMID： 26072474 PMCID： PMC4765857 DOI： 10.1093/bioinformatics/btv241

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Predicting allosteric regions in proteins and understanding their interaction mechanisms are challenging problems in bioinformatics. It is common to mainly identify backbone motions responsible for the allosteric behaviour of proteins. However, recent studies have not only highlighted the commonly neglected role of side-chain fluctuations in information transmission within a molecule (DuBay ), but also emphasized the presence of allostery in proteins with minimal backbone motions (Tsai ). Moreover, recent discoveries by X-ray crystallography reveal that alternate side-chain conformations are more prevalent than previously thought (Lang ; van den Bedem ). These findings further accentuate the importance of a thorough study of the role played by side-chains in allostery (DuBay ; van den Bedem ). Intrinsic networks of correlated residues are known to play an important role in the propagation of information during an allosteric event. Identifying networks of directly coupled allosteric residues is thus of crucial importance for understanding the allosteric mechanism and interaction paths in a molecule. Common correlation-based analyses, however, cannot disentangle causal (or direct) correlations from transitive (or indirect) ones (Morcos ). To this effect, the statistical concept of partial correlation, a conditional dependence measure between two variables given all other variables, is more appropriate for inferring direct couplings. Existing methods based on the partial correlation such as the graphical LASSO (GLASSO; Friedman ), on the other hand, often assume that the data are generated from a multivariate normal distribution—although Jones also applied it to binary variables. This is a restrictive assumption for many applications such as the one discussed in this paper. In particular, side-chain conformations are commonly modelled by multidimensional angular variables, for which the normal distribution is not a good fit. Another challenge is how to quantify correlations between two multidimensional random variables. Here, a useful statistical tool is canonical correlation analysis (CCA; Hotelling, 1936). We developed a novel extension of CCA by incorporating partial correlation and by using a multivariate von-Mises kernel function (Mardia ) to capture similarities between two multidimensional angular variables. We tested our method on a number of well-studied allosteric proteins from the Ras, Rho and Rab sub-families of the small G protein super-family. While the sequence similarity within a sub-family may be relatively high (50–55%), members of two different sub-families tend to share low (∼30%) sequence identity (Takai ). Despite the low similarity and having distinct functions, 3D structural analysis of these proteins has revealed common characteristics. For example, they cycle between two inter-convertible forms (Raimondi ; Takai ; Wennerberg )—the inactive form [bound to guanosine diphosphate (GDP)] and the active form [bound to guanosine triphosphate (GTP)]. Furthermore, during this cycle, all of the small G proteins undergo major conformational changes in two common regions, referred to as Switch I and Switch II (see Fig. 1) in the literature (Grizot ; Milburn ; Scheffzek ).

Fig. 1.

Superimposed 3D structures of active and inactive H-Ras. Major conformational changes are known to occur in the Switch I and Switch II regions during an allosteric event

Superimposed 3D structures of active and inactive H-Ras. Major conformational changes are known to occur in the Switch I and Switch II regions during an allosteric event Our method successfully identified the aforementioned allosteric regions in these test cases. In each case, residues belonging to these regions are specifically involved in the strongest couplings and are among the highest-degree nodes in the interaction graph formed by the inferred couplings. Furthermore, allosteric and binding sites in these test cases are connected in the interaction graphs as well. This means that, by studying side-chain fluctuations, our method can infer pathways between these sites and shed light on how information propagates between these functionally important residues. It is worthwhile to note that we obtained our results by using information from only the inactive (GDP-bound) structure of each allosteric protein. Most methods for studying allostery use both the active and the inactive structures. However, in many situations, not both structures are readily available. We think our method can provide especially valuable information in these types of situations.

2 Methods

Our method for inferring direct couplings comprises a few fundamental components. First, we rely on the mathematical notion of the partial correlation to measure direct couplings (Section 2.1). Second, we use CCA to quantify the notion of correlation (more specifically, partial correlation) for multivariate data (Section 2.2). Third, we use a specific kernel function—the von-Mises kernel—to measure similarity between two sets of conformational variables expressed in terms of dihedral angles (Sections 2.4, 2.5.2).

2.1 Direct coupling and partial correlation

If variable x is correlated with a set of variables and so is y, a transitive correlation requires that x be also correlated with y. For direct couplings between residues, we are interested in the direct correlation between x and y, not the kind of transitive correlations between them. In many applications, computing direct couplings between residues is crucial (Jones ; Morcos ). The ‘partial correlation’ between x and y is a measure of their dependence after having removed the effect of . It can be computed as follows. First, we respectively regress both x and y onto , that is, we fit the following models to x and y: Let and denote the estimated regression coefficients, for . Let r and r denote the residuals from these regression models, i.e. The partial correlation between x and y is the usual Pearson correlation between r and r.

2.2 Canonical correlation analysis (CCA)

As indicated above, if both x and y are univariate random variables, we can use their usual Pearson correlation to measure their marginal association, or their partial correlation to measure their direct association. But what if both of them are multivariate random variables? Moreover, what if they have different dimensions, e.g. and for some ? One way to come up with a single numeric measure of the association between two multivariate variables and is to compute the quantity, sometimes referred to as the canonical correlation coefficient between and . More specifically, since the maximization problem in Equation (1) is equivalent to subject to the constraints Given a dataset, , where and , let be the usual data matrices, respectively stacking n samples of and as row vectors. If both and are centered so that each column has mean zero, then the sample estimates of , and are simply Hence, the empirical estimate of the canonical correlation coefficient given in Equation (1) can be obtained by solving the maximization problem (2)–(3) using the three sample estimates above, i.e. The maximization problem in Equation (4) is well-known to be a generalized eigenvalue problem (see, e.g. Shawe-Taylor ), and can be solved in many scientific computing platforms, including MATLAB.

2.3 Partial CCA

If, instead, we are interested in a single numeric measure of the direct association between and , we can use the same idea as that of the partial correlation (Section 2.1). That is, we can first remove the effect of from both of them, before computing their canonical correlation coefficient. More specifically, let We simply compute (4) using instead of the original data matrices and . We refer to the resulting estimate, as the partial canonical correlation coefficient between and .

2.4 Kernelization of CCA and partial CCA

It is easy to see that, if we reparameterize and for some , the sample canonical correlation coefficient [Equation (4)] can be computed as Let be an n × n matrix whose (i, j)-th entry is equal to , the inner-product between observations and , and likewise for . Then, Equation (6) can be written as This shows that CCA can easily be ‘kernelized’ (see, e.g. Shawe-Taylor )—simply replace the inner-products, and , with and , for some kernel function . When a different kernel function is used in Equation (7) other than the usual inner-product, we will use the notation, , to refer to the quantity in Equation (7). Clearly, the same argument applies to sample estimate of the partial canonical correlation coefficient [Equation (5)] as well, that is, the quantity can be obtained from Equation (7), too, by simply letting and . Similarly, when a different kernel function is used other than the usual inner-product, we will use the notation, , to distinguish it from . Table 1 summarizes our notations.

Table 1.

Summary of notations

Notation	Meaning	Section
ρ^(x,y)	CCA	2.2
ρ^(x,y\|z)	Partial CCA	2.3
ρ^K(x,y)	Kernelized CCA	2.4
ρ^K(x,y\|z)	Kernelized Partial CCA	2.4

Summary of notations A technical detail, which we largely have suppressed in our exposition here, is that, in Equation (7), it is necessary to add a regularization term such as to both and in the constraints to avoid an otherwise degenerate solution.

2.5 Application of KPCCA to the study of allostery

In this article, we use Kernelized Partial CCA (or simply KPCCA) to quantify the direct coupling between pairs of residues and study the allosteric behaviour of proteins. Let m denote the number of residues in a given protein. For any given pair of residues , we let be the vector of p dihedral angles describing the side-chain conformation of residue a; be the vector of q dihedral angles describing the side-chain conformation of residue b; and be the vector of d dihedral angles describing the side-chain conformations of all other residues. In general, , depending on the type of amino acids for the two residues, whereas d is much larger. When we say that we use KPCCA, we mean that we use the quantity, (see Table 1), to quantify how strongly the two residues a and b are directly coupled. We do this for all pairs of residues. In order to compute , we need multiple observations for , and , that is, different conformations of the same protein; and an appropriate kernel function for measuring the similarity of two different conformations (expressed in terms of dihedral angles). In the next three subsections, we explain in more detail how we addressed these specific issues.

2.5.1 Generation of a conformational ensemble

As mentioned earlier (Section 1), a recent study has highlighted the role of side-chain fluctuations alone in information propagation within ‘natively-folded’ proteins (DuBay ). To conduct such a study, one requires a heterogeneous dataset (or ensemble) of protein structures, in which the source of diversity among the different structures comes from alternate side-chain conformations alone, while the backbone is held fixed. Commonly used methods for generating such datasets include Monte Carlo (MC) and molecular dynamics (MD). The common practice is to introduce a fluctuation or structural change that in a real environment may be caused by heat or other environmental factors. The introduced change can be an amino-acid mutation or a small change in dihedral angles. The fluctuation stimulates a response from the system (molecule) accordingly. The simulation techniques approximate the final stabilized structure that would be a candidate member for the ensemble of conformations. We applied a different (and much more efficient) approach to generate the required protein ensembles for our study, by using two state-of-the-art and fast side-chain prediction (SCP) algorithms, namely, SCWRL (Krivov ) and TreePack (Xu, 2005). Although this approach is different from the commonly used simulation methods, it follows the same principles. For each given protein of m residues, we produced an ensemble consisting of structures, as follows. First, we randomly selected 20% of the side chains and set each of their conformations to a randomly chosen rotamer from the backbone-dependent rotamer library provided by Dunbrack . This step introduced fluctuations to the protein’s conformation. Then, the rest of the side chains were packed by SCWRL or TreePack, and the final structure was added to the ensemble. Essentially, this amounted to solving the side-chain packing problem (a complex optimization problem known to have many local solutions) with many different initial values in order to create a diverse ensemble. This step simulated the response of the system to the fluctuations introduced in the first step. We believe using SCP methods is a reasonable and efficient alternative for data (ensemble) generation, as long as we focus on allosteric effects caused mainly by side-chain fluctuations alone (as opposed to backbone movements).

2.5.2 Weighted Von-Mises kernel function

For , we used the following kernel function to perform KPCCA (see Section 2.4), and likewise for . This is based on the multivariate von-Mises distribution (Mardia ) and treating the dihedral angles as if they were independent. An angular random variable is said to follow the multivariate von-Mises distribution if it has density function, where for , and is a normalizing constant. The parameter describes the location, i.e. the mean (or center), and the parameter () describes the scale, i.e. the spread (or concentration). The parameter, is a matrix whose diagonal elements are zero () and whose off-diagonal elements Λ capture the correlation between x and x. Setting ignores the correlation between x and x. The multivariate von-Mises distribution frequently has been used (e.g. Mardia , 2012) as an appropriate tool for modelling angular variables that describe residue conformations in proteins. We also introduced weights w, w in our kernel function [Equation (8)]. These weights were set to be inversely proportional to the energies of the two structures, i and j, in our ensemble (Section 2.5.1). This allows structures with lower energies—i.e. the ones that are more stable in our ensemble—to contribute more information to our overall procedure.

2.5.3 Choice of κt

The kernel function (8) contains p concentration parameters, , one for each dihedral angle. An advantage of the von Mises kernel is that these concentration parameters can be set to reflect the intrinsic nature of side-chains dihedral angles. For example, the first two dihedral angles are known to undergo more restricted motions, while the third and fourth have more freedom of movement. Hence, we assigned higher concentration parameters to the first two angles () to allow less freedom in motion, and lower concentration parameters to the 3rd and 4th angles () to allow more freedom of movement. The von-Mises kernel can be thought of as the Gaussian kernel (or radial basis kernel) for angular data. To see this, notice that, using the Taylor approximation, , we can write On the other hand, the corresponding Gaussian (or radial basis) kernel is given by Since is a constant not depending on either input to the kernel function, we can see that Equation (9) is equivalent to a Gaussian (or radial basis) kernel with ‘standard deviation unit’ Therefore, our choices of and roughly correspond to using ‘standard deviation units’ of and in the corresponding Gaussian kernel (see Table 2). A side-chain prediction is often deemed successful if the predicted dihedral angle is within of the true angle (Krivov ). Thus, our choice of κ4 agreed with this convention, and we used larger values for to permit less movement for the lower dihedral angles.

Table 2.

Conversion Between κ And σ [Equation (10)]

Dihedral angle		σ_t (degrees)
(t)	κ_t	One decimal	Nearest 10th
1st	8	20.3°	20°
2nd	8	20.3°	20°
3rd	4	28.6°	30°
4th	2	40.5°	40°

Conversion Between κ And σ [Equation (10)]

3 Results and discussion

We tested our method on a number of well-studied Ras and Ras-like proteins (see Table 3). They have been of special interest due to their diverse range of functions. The inactive and active structures of many family members have been crystallized and are known.

Table 3.

Allosteric proteins from three sub-families of the small G protein super-family with PDB (Berman ) IDs of active and inactive structures*

Sub		PDB ID		AUC against CRN
Family	Protein	Inactive	Active	KPCCA	GLASSO
Ras	H-Ras	4Q21	6Q21	0.796	0.776
	Rap2A	1KAO	2RAP	0.693	0.677
	Rheb	1XTQ	1XTS	0.699	0.711
Rho	RhoA	1FTN	1A2B	0.750	0.719
	Rac1	1HH4(A)	1MH1	0.672	0.594
	Cdc42	1AN0	1NF3	0.681	0.675
Rab	Sec4	1G16	1G17	0.676	0.683
	Ypt7p	1KY3	1KY2	0.717	0.666

*Active structure: bound to GTP. Inactive structure: bound to GDP.

Allosteric proteins from three sub-families of the small G protein super-family with PDB (Berman ) IDs of active and inactive structures* *Active structure: bound to GTP. Inactive structure: bound to GDP. We performed both quantitative and qualitative comparisons of our results with those obtained by the Contact Rearrangement Network (CRN; Daily ) and by the GLASSO (Soltan Ghoraie ). For an allosteric protein, the CRN method generates networks of allosteric pathways by calculating significant differences in the residue-residue contact network derived from the inactive structure and that derived from the active structure. Therefore, it provides a direct and model-free analysis of both structures (Daily ). The GLASSO is a relatively new statistical method, which we have used in an earlier study to extract direct couplings between residues, but its application required that we work with discrete conformation variables rather than angular variables that describe the conformations more directly (more details in Section 3.2). We implemented the KPCCA in MATLAB using the Kernel Methods Toolbox (Vaerenbergh, 2010). All three methods’ outputs consisted of a list of coupled residues, each ranked by a score indicating their coupling strength. The quantitative comparison was performed using the receiver-operating characteristic (ROC) curve. Treating the list of CRN results as ‘ground truths’, the Area Under the ROC Curve (AUC) is a numeric summary of how well the ranked list produced by the KPCCA or by the GLASSO matched against the CRN findings (see Table 3). These AUC values show quite conclusively that KPCCA’s detection of allosteric couplings is significantly better than random, and that there is a good deal of agreement between our results and those from the CRN. This is a significant finding considering that the CRN relies on structural information of both the inactive and the active structure of an allosteric protein, while we have analysed the dynamics of the side chains in the inactive structure alone. Furthermore, we evaluated our results qualitatively (see Section 3.1 below) by visualizing them as interaction graphs, and comparing them to the interaction graphs generated by the CRNs and by the GLASSO. The inferred couplings for each test case were visualized as a 3D network graph superimposed onto the 3D structure of the protein itself. All 3D molecular visualizations and graphs were produced using the StructBio package (Burkowski, 2014) for the software, Chimera (Pettersen ). For each coupling, nodes were placed at the α-carbon for each of the involved residues and edges were drawn between them. We used two different cut-offs to threshold the top-ranked couplings when generating the interaction graphs, and studied a small subset of these couplings in more detail. The first threshold was equal to the number of couplings identified by the CRN (Daily ) for each individual test case, so that we could make a fair comparison. The second threshold was 100 for all test cases, and used for generating 2D graphs (Fig. 3), so that connections between residues in important regions could be shown more clearly. It should be noted that both types of cut-offs allowed only a small subset of all the couplings ( 0.6–1%) to be shown. From these graphs, we noticed that the top-ranked couplings often involved allosterically crucial residues (more details in Section 3.1). Moreover, these allosterically important residues often appeared as high-degree nodes in the graphs; sometimes, they could be seen to act as hubs connecting the allosteric region to other functionally important parts of the protein, such as the binding site.

Fig. 3.

2D interaction graphs of H-Ras (top), Rac1 (middle) and Sec4 (bottom), all showing the top 100 couplings. For nodes shaded in grey, darkness is proportional to node degree. For edges, thickness is proportional to coupling strength. Residues in Switch I are coloured yellow and those in Switch II, blue. The binding site residues (which do not overlap with Switch I) are highlighted in pink and the phosphate-binding loop, orange. The KPCCA (left) produced more connected networks whereas, for the GLASSO (right), the inferred couplings are more ‘spread out’ within the molecule

Both our quantitative and qualitative results indicated that the KPCCA outperformed the GLASSO in that it was able to capture couplings that correspond to more significant connections in the crucial regions of the test cases. This confirms that, to infer direct residue-residue couplings from the same conformational data, the KPCCA—which facilitates data modelling by continuous, multivariate angular variables—is more accurate than the GLASSO. In some cases, such as Rheb (Table 3), although we noticed a smaller AUC value for the KPCCA (indicating that the GLASSO had slightly better agreement with the CRN), the interaction graphs still showed that the KPCCA identified the crucial residues more effectively (see, e.g. Figs 4, 5 and more discussions in Section 3.2).

Fig. 4.

2D interaction graphs for Rheb, using the top 100 couplings. Residues in Switch I (II) are coloured yellow (blue). (a) KPCCA: The two Switch regions are directly connected (residue 37 from Switch I with residues 73 and 77 from Switch II); moreover, Switch I is indirectly connected to Switch II by residue 15 in the phosphate-binding loop (Yu ). (b) GLASSO: No connection between the Switches is identified

Fig. 5.

2D interaction graphs for Rheb, using the top 47 couplings (the same number as identified by the CRN). Residues in Switch I (33–41), Switch II (63–79) and the phosphate-binding loop (p-loop) are coloured yellow, blue and orange, respectively. The p-loop residues are connected to the Switches in the CRN. Crucial couplings emerge at higher positions in the ranked list of the KPCCA (a) than in that of the GLASSO (b)

3.1 Small G proteins

The members of this super-family are structurally categorized into five sub-families: Ras (Section 3.1.1), Rho (Section 3.1.2), Rab (Section 3.1.3), Sar1/Arf and Ran. Both NMR and crystallographic analyses have shown that members of different sub-families act as molecular switches that cycle between on (active) and off (inactive) states (Raimondi ), and that they share a common topology in the GDP/GTP binding domain (Takai ). In this section, we highlight our findings for a few representative and well-studied members of these sub-families.

3.1.1 Ras sub-family

Members from the Ras subfamily are the primary members of the super-family; they play a critical role in human oncogenesis (Wennerberg ). When activated, they regulate cell proliferation and survival through gene expression (Takai ; Wennerberg ). We experimented with three members of this sub-family: H-Ras, Rap2A and Rheb (see Table 3). We used the software, Blastp (protein–protein BLAST; http://blast.ncbi.nlm.nih.gov/Blast.cgi), to perform pairwise sequence alignment between the test cases. The sequences of Rap2A and Rheb respectively shared 49 and 36% amino-acid identity with H-Ras. The regions that undergo major conformational changes in H-Ras have been identified (Milburn ) as Switch I (residues 30–38) and Switch II (residues 60–73); see Figure 1. Switch II is known to be directly involved in switching the protein from inactive to active status (Kidd ). Residues residing in the binding site are residues 28–35, 12–19, 145–147 and 116–120. Using the Rosetta software for structural prediction (Rohl ), Kidd obtained strong correlations in Switch II and the hydrophobic core, which is conserved in the Ras family, though they did not report connectivity between the two Switches. The CRN for H-Ras generated using both active and inactive structures contained 75 couplings (Fig. 2a). The interaction graph based on the top 75 couplings inferred by the KPCCA is shown in Figure 2(b); it clearly shows that strong couplings connected the two Switches to each other and to the binding site. In addition, the residues in these two regions are among the highest-degree nodes in the interaction graph—e.g. the node with maximum degree of 21 (see Table 4) in the 2D interaction graph [Fig. 3(a), based on the top 100 couplings] is associated with residue 34 in the Switch I region.

Fig. 2.

Table 4.

Statistical features of interaction graphs*

			No. of
		No. of	connected	Max. Node	Avg. Node
Protein	Method	nodes	components	(Degree)	(Degree)
H-Ras	GLASSO	91	12	5	1.176
	KPCCA	70	7	21	2.829
Rac1	GLASSO	97	12	7	2.062
	KPCCA	15	1	14	13.333
Sec4	GLASSO	100	14	5	2.000
	KPCCA	64	9	17	3.125

*Based on graphs formed by the top 100 inferred couplings.

3D interaction graphs of H-Ras (top), Rac1 (middle) and Sec4 (bottom), showing the top 75, 63 and 112 couplings, respectively. These cut-offs are chosen to be equal to the number of couplings identified by the CRN. The couplings are mostly seen in the Switch I and Switch II regions (blue) and the β-sheets close by Left: CRN; Right: KPCCA 2D interaction graphs of H-Ras (top), Rac1 (middle) and Sec4 (bottom), all showing the top 100 couplings. For nodes shaded in grey, darkness is proportional to node degree. For edges, thickness is proportional to coupling strength. Residues in Switch I are coloured yellow and those in Switch II, blue. The binding site residues (which do not overlap with Switch I) are highlighted in pink and the phosphate-binding loop, orange. The KPCCA (left) produced more connected networks whereas, for the GLASSO (right), the inferred couplings are more ‘spread out’ within the molecule Statistical features of interaction graphs* *Based on graphs formed by the top 100 inferred couplings.

3.1.2 Rho sub-family

The best-studied members of this sub-family are RhoA, Rac1 (Grizot ) and Cdc42 (Table 3). Sequence alignment results from Blastp showed that RhoA, Rac1 and Cdc42 shared 30, 29 and 32% amino-acid identity with H-Ras, respectively, whereas the amino-acid identity is higher within the sub-family, e.g. 58% between RhoA and Rac1 and 69% between Rac1 and Cdc42. Like the Ras family, the Rho proteins also are involved as regulators in cell cycle progression (cell polarity, movement, shape, and so on) and gene expression (Wennerberg ). The three aforementioned members are known to be involved in very diverse cellular processes (Takai ; Wennerberg ). The CRN for Rac1 consisted of 63 couplings (Fig. 2c). The edges were mostly concentrated in the classic Switch regions of this super-family. Figure 2(d) shows a 3D interaction graph based on the top 63 couplings inferred from the KPCCA; this network included residues 29, 32 and 34 from Switch I as well as residues 69 and 73 from Switch II. The set of top-ranked couplings also included residues in the C-terminus (residues 179, 180 and 182) and those from a loop segment (residues 106–110). These regions were also identified by the CRN. Figure 3(c) shows a 2D interaction graph formed by the top 100 couplings from the KPCCA; we can see that it is a highly connected and concentrated network consisting of a single connected component with only 15 residues (see also Table 4).

3.1.3 Rab sub-family

The Rab proteins constitute the largest sub-family in the small G protein super-family (Garcia-Saez ). They are involved in the regulation of intracellular vesicular trafficking, vesicle formation, budding and fusion (McCray ; Stein ; Wennerberg ). For our experiments, we selected the inactive structures of a few well-known members, such as Rab7, Sec4, Ypt7p, Rab11b, Rab11a and Rab6b (PDB IDs: 1VG1, 1G16, 1KY3, 2F9L, 4LWZ and 2FE4). The PDB structures for many members of this family are incomplete in the critical and functional regions, i.e. the two Switches. To perform our experiments, we used the software, MODELLER (http://toolkit.tuebingen.mpg.de/modeller), to complete their structures (Sali ). By comparing the active and inactive structures, we noticed that, except for Sec4 and Ypt7p, the others underwent a secondary structural change (from loop to helix, or vice versa) in the Switch II region during the transition from inactive to active form. We excluded them from the current study, which focuses on proteins with minor backbone motions (Tsai ). The common amino acids shared between H-Ras and (Sec4, Ypt7p) are about (35, 32%), respectively. The CRN for Sec4 consisted of 112 couplings (Fig. 2e). The interaction graph based on top-ranked couplings by the KPCCA (Figs 2e and 3e) showed connections between the two Switches through the edges, (47,87) and (83,56).

3.1.4 Ran and Arf/Sar1 sub-family

The Ran proteins (Partridge ; Scheffzek ; Stewart ) are best known for their involvement in nucleocytoplasmic transport of macromolecules (e.g. RNAs, proteins), whereas members of the Arf family function as regulators of vesicular transport (Takai ; Wennerberg ), like the Rab proteins. Comparing the active and inactive structures of the best studied members from these families, we noticed that they underwent drastic conformational changes during the activation procedure. Calculated RMSDs between the GDP- and GTP-bound pairs for Ran (PDB IDs: 1BYU-1RRP, 1BYU-1IBR, 3GJ0-1WA5), Arf1 (PDB ID: 1HUR-1O3Y) and Arf6 (PDB ID: 1E0S-1HFV) were in the range of approximately 4–14 Å. Hence, these proteins do not belong to the category characterized by ‘minor backbone motions’ (Tsai ) and we excluded them from the current study.

3.2 KPCCA and GLASSO

We have recently applied the GLASSO to infer direct couplings between side chains (Soltan Ghoraie ). The GLASSO is incapable of handling multivariate angular variables. Thus, for each structure in the ensemble (which we generated with the same procedure as what we explained in Section 2.5.1), the conformation of each side chain (i) is matched against a set of candidate rotamers (R) from the standard rotamer library (Dunbrack ), and encoded using a set of binary random variables, b. More specifically, for each structure in the ensemble, for . Using these binary variables, a sample covariance matrix S can be computed as follows. For each residue pair, (i, j), we compute an sub-covariance matrix, , whose entries are where each expectation is estimated empirically by a weighted sample average, using weights inversely proportional to the energy of each structure (see also Section 2.5.2). Using S as the input, the GLASSO estimates a sparse inverse covariance matrix, Θ. After removing the entropic bias (more on this below), entries in each sub-matrix are aggregated to form a coupling score for the residue pair, (i, j).

3.2.1 Comparison

Based on our observations, the KPCCA has the following advantages over the GLASSO: It models side-chain conformations more appropriately with continuous rather than discrete variables. As we stated above, one main disadvantage of the GLASSO algorithm was the need to encode side-chain conformation information using a discrete rotamer library. By contrast, the KPCCA algorithm facilitates a more realistic modelling approach, consistent with the intrinsic nature of conformational data, by allowing us to use continuous, multidimensional angular variables to characterize side-chain conformations. It identifies allosteric regions more effectively. Quantitatively, Table 3 showed that the KPCCA results agreed well with the CRN ones, and that, in this respect, it either compared favourably to the GLASSO or obtained similar results. The superiority of the KPCCA becomes more evident from the qualitative comparisons based on interaction graphs. In particular, the strongest couplings inferred by the KPCCA are concentrated in the allosteric regions, whereas couplings inferred by the GLASSO are more ‘spread out’ within the entire molecule. The GLASSO often identified couplings between residues that may undergo concerted motions in the inactive structure but do not necessarily reside in allosteric regions. Some of these residues are located in semi-rigid secondary structures such as helices; they may reside in or close to the binding site but do not necessarily participate directly in allosteric events. This can be seen more clearly in Figure 3, which contains 2D interaction graphs formed by the top 100 couplings identified by each method for a few representative test cases. For Rheb and Sec4, although it is the GLASSO that appeared to be in better agreement with the CRN (see Table 3, the AUC values), their respective interaction graphs lead to the same qualitative conclusion as that in other cases. Even for these two cases, the KPCCA can be seen to have identified more dependencies specific to coupled motions during allosteric events (see Fig. 4). In addition, couplings in important functional regions also tend to emerge earlier (i.e. at higher positions) in the ranked list of the KPCCA than in that of the GLASSO, another indication that the KPCCA is better at identifying regions crucial to function. For example, Figure 5 contains 2D interaction graphs for Rheb using a cut-off threshold < 100, and shows that the KPCCA has identified more residues in the functionally crucial regions at the top of its ranked list than has the GLASSO. Its interaction graphs tend to show better connectivity among functionally important regions. Another important observation was that the GLASSO obtained significantly sparser clusters of residues (see Fig. 3). The increased connectivity among couplings inferred by the KPCCA was a notable advantage; these connections can potentially explain the mechanism of information propagation within the molecule. If interaction pathways between the allosteric and/or binding sites can be discerned using an interaction graph based on K top-ranked couplings from the GLASSO, using top-ranked couplings from the KPCCA it often can be done with much fewer than K couplings. Table 4 contains various statistical features showing the overall connectivity of the interaction graphs produced by the GLASSO and by the KPCCA for H-Ras, Rac1 and Sec4. It is less prone to entropic bias. One drawback of using discrete rotamers to encode conformations is that the results produced by the GLASSO were biased towards larger amino acids that naturally have more diverse rotamer conformations. This is referred to as the ‘entropic bias’ in the literature (e.g. Jones ; Dunn , who also suggested techniques for its correction). By contrast, the KPCCA does not appear to suffer from such biases. Figure 6 shows the average number of available rotamer conformations for residues involved in the top 1–5, 2–6, … , up to the top 300–305 paired couplings, as computed by the KPCCA and by the GLASSO for H-Ras, Rac1 and Sec4. For the GLASSO, the rankings were based on scores after bias correction. Although no correction was introduced for the KPCCA, its results do not show significant bias.

Fig. 6

X-axis: Rank order of the inferred couplings (). Y-axis: Average number of available rotamers for residues involved in couplings ranked at positions x, x + 1, … , x + 4 for (a) H-Ras (b) Rac1 (c) Sec4. Red: GLASSO (with bias correction). Blue: KPCCA (without bias correction). The KPCCA does not show significant bias towards residues with more rotamer alternatives; in fact, the average number of rotamers is lower for the KPCCA than for the GLASSO in general

4 Conclusion

We have proposed a novel extension of CCA, namely KPCCA, to quantify direct correlations between multidimensional angular data. Existing methods for inferring direct correlations do not handle data of this type, which are common in structural bioinformatics, where side-chain conformations of proteins are characterized by a number of dihedral angles. Using information about side-chain fluctuations in the inactive structure alone, we are able to identify common, allosterically crucial regions (e.g. Switch I and Switch II) in the Ras, Rho and Rab sub-families of small G proteins. Residues in these allosteric regions appear in the strongest couplings inferred by our method and in the densest regions of the corresponding interaction graph. Furthermore, allosteric sites and binding sites are connected in these graphs, which may explain the mechanism with which allostery occurs in these proteins. 2D interaction graphs for Rheb, using the top 100 couplings. Residues in Switch I (II) are coloured yellow (blue). (a) KPCCA: The two Switch regions are directly connected (residue 37 from Switch I with residues 73 and 77 from Switch II); moreover, Switch I is indirectly connected to Switch II by residue 15 in the phosphate-binding loop (Yu ). (b) GLASSO: No connection between the Switches is identified 2D interaction graphs for Rheb, using the top 47 couplings (the same number as identified by the CRN). Residues in Switch I (33–41), Switch II (63–79) and the phosphate-binding loop (p-loop) are coloured yellow, blue and orange, respectively. The p-loop residues are connected to the Switches in the CRN. Crucial couplings emerge at higher positions in the ranked list of the KPCCA (a) than in that of the GLASSO (b) X-axis: Rank order of the inferred couplings (). Y-axis: Average number of available rotamers for residues involved in couplings ranked at positions x, x + 1, … , x + 4 for (a) H-Ras (b) Rac1 (c) Sec4. Red: GLASSO (with bias correction). Blue: KPCCA (without bias correction). The KPCCA does not show significant bias towards residues with more rotamer alternatives; in fact, the average number of rotamers is lower for the KPCCA than for the GLASSO in general Our analytic framework is modular. In principle, ensembles generated by other techniques such as MC and/or MD can be used as well. But currently they are much less efficient. For instance, in one of our test cases (Rap2A; PDB ID: 1KAO, 167 residues), SCWRL took about 1 second to generate a structure whereas an MC method in Rosetta, like that described by Kaufman , took as much as 40 seconds. Hence, for an ensemble of size , our current method took about 4 hours but an MC method would have taken 160 hours, almost a full week, for a single protein! In future studies, our proposed analytic framework can be extended to include backbone dihedral angles as well. This will allow us to study allosteric behaviours of all protein types, even those that may undergo drastic backbone motions. The method also can be applied to other problems in the bioinformatics, e.g. for revealing the ‘hot spot’ residues in protein–protein interactions by using only the fluctuation information of the ‘unbound’ protein (Ozbek ).

Funding

This research is partially supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada. Conflict of Interest: none declared.

34 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

3. Automated electron-density sampling reveals widespread conformational polymorphism in proteins.

Authors: P Therese Lang; Ho-Leung Ng; James S Fraser; Jacob E Corn; Nathaniel Echols; Mark Sales; James M Holton; Tom Alber
Journal: Protein Sci Date: 2010-07 Impact factor: 6.725

Review 4. The Ras superfamily at a glance.

Authors: Krister Wennerberg; Kent L Rossman; Channing J Der
Journal: J Cell Sci Date: 2005-03-01 Impact factor: 5.285

Review 5. Allostery: absence of a change in shape does not imply that allostery is not at play.

Authors: Chung-Jung Tsai; Antonio del Sol; Ruth Nussinov
Journal: J Mol Biol Date: 2008-02-29 Impact factor: 5.469

6. Molecular switch for signal transduction: structural differences between active and inactive forms of protooncogenic ras proteins.

Authors: M V Milburn; L Tong; A M deVos; A Brünger; Z Yamaizumi; S Nishimura; S H Kim
Journal: Science Date: 1990-02-23 Impact factor: 47.728

10. The structure of the Q69L mutant of GDP-Ran shows a major conformational change in the switch II loop that accounts for its failure to bind nuclear transport factor 2 (NTF2).

Authors: M Stewart; H M Kent; A J McCoy
Journal: J Mol Biol Date: 1998-12-18 Impact factor: 5.469