Dheeraj Dube1, Navjeet Ahalawat1, Himanshu Khandelia2, Jagannath Mondal1, Surajit Sengupta1. 1. Tata Institute of Fundamental Research, Center for Interdisciplinary Sciences, Hyderabad, India. 2. MEMPHYS, Center for Biomembrane Physics, Department of Physics, Chemistry and Pharmacy, University of Southern Denmark, Odense, Denmark.
Abstract
Binding of small molecules to proteins often involves large conformational changes in the latter, which open up pathways to the binding site. Observing and pinpointing these rare events in large scale, all-atom, computations of specific protein-ligand complexes, is expensive and to a great extent serendipitous. Further, relevant collective variables which characterise specific binding or un-binding scenarios are still difficult to identify despite the large body of work on the subject. Here, we show that possible primary and secondary binding pathways can be discovered from short simulations of the apo-protein without waiting for an actual binding event to occur. We use a projection formalism, introduced earlier to study deformation in solids, to analyse local atomic displacements into two mutually orthogonal subspaces-those which are "affine" i.e. expressible as a homogeneous deformation of the native structure, and those which are not. The susceptibility to non-affine displacements among the various residues in the apo- protein is then shown to correlate with typical binding pathways and sites crucial for allosteric modifications. We validate our observation with all-atom computations of three proteins, T4-Lysozyme, Src kinase and Cytochrome P450.
Binding of small molecules to proteins often involves large conformational changes in the latter, which open up pathways to the binding site. Observing and pinpointing these rare events in large scale, all-atom, computations of specific protein-ligand complexes, is expensive and to a great extent serendipitous. Further, relevant collective variables which characterise specific binding or un-binding scenarios are still difficult to identify despite the large body of work on the subject. Here, we show that possible primary and secondary binding pathways can be discovered from short simulations of the apo-protein without waiting for an actual binding event to occur. We use a projection formalism, introduced earlier to study deformation in solids, to analyse local atomic displacements into two mutually orthogonal subspaces-those which are "affine" i.e. expressible as a homogeneous deformation of the native structure, and those which are not. The susceptibility to non-affine displacements among the various residues in the apo- protein is then shown to correlate with typical binding pathways and sites crucial for allosteric modifications. We validate our observation with all-atom computations of three proteins, T4-Lysozyme, Src kinase and Cytochrome P450.
Modern drug discovery operates around the principle of molecular recognition in the context of protein-ligand binding. In the preliminary stages, drug discovery programs often invest in quantifying the binding affinity of suitable drug candidates towards a well-established protein target. However, such efforts often encounter scenarios where identification of the ligand-binding cavity on the protein target becomes challenging. In addition, screening across a large pool of protein receptor candidates as potential drug targets for a specific drug proves tedious and time consuming.A popular and inexpensive approach for zeroing on a suitable combination of protein receptor and ligand has been drug design based on in silico techniques. Towards this end, macromolecular docking [1-5] based techniques are routinely employed for screening protein-ligand recognition partners. One of the major drawbacks of docking based approach has been the heuristic choice of the scoring function implemented in the docking programs to quantify and rank-order the protein-ligand affinity. Further, the other major criticism has been that in most of the cases the receptor cavity’s inherent flexibility is not taken into account for docking purposes. Rather, docking softwares [6-17] consider the protein cavity to be rigid and perform the search for a ligand’s ability to dock only in a limited area of protein, often pre-determined by the user’s intuition-based inputs. As a result, docking based drug discovery protocols are fraught often with failures in identifying the correct set of protein ligand combinations.Molecular Dynamics simulations, with the recent emergence of Graphics-Processing-Unit (GPU) based computation and special purpose machine like Anton, has slowly emerged as an accurate approach to simulate the complete process of protein ligand binding in its atomistic details. The approach, majorly pioneered by Shaw and coworkers [18-21] has been able to decipher the kinetic pathways leading to spontaneous recognition of the ligand by the receptor. This process is frequently being supplemented by judicious application of Markov State Model [22-24] based techniques to infer long time recognition processes by shorter time simulations. These techniques being unbiased and devoid of user intervention, currently remain one of the best approaches to explore the protein ligand recognition at their highest resolution. But the process comes at the expense of high computational cost and extended wall clock time. As a result, a plan for practical usage of the Molecular dynamics simulation for the purpose of serious drug discovery is still in its infancy.The current work offers a practical middle ground between expensive Molecular Dynamics simulation of protein-ligand recognition processes and inaccurate docking based approach by employing an idea from materials physics, which correctly accounts for deformations in protein structure relevant for binding. We use a formalism, introduced earlier to analyse atomic displacements in solids [25-27], to predict the protein locations which are potentially susceptible towards ligand recognition. We build on our work under the basic premises that all external and internal stresses result in two types of mutually orthogonal deformations namely those which are “affine” i.e. expressible as a homogeneous deformation of the native structure, and those which are not i.e. “non-affine”. Further details are provided in figure A and section 1 of
S1 text. As further elaborated in a later section, the current study characterizes the local arrangements of atoms inside protein by probing an important metric termed as “Non Affine parameter” (NAP) [25]. The approach is first convincingly validated by reproducing key conformational changes en route to ligand entry to a mutant T4 Lysozyme. This formalism is then used to propose potential regions which are implicated in “gate-opening” and capture the conformational dynamics during the binding process in the protein receptor Src kinase and Cytochrome P450. Finally, we show how spatial correlations of NAP can elucidate sites for allosteric control in these proteins.For discussion on the effect of different coarse-graining volume on the eigen-spectrum refer section 6 of SI—Notes on the effect of choice of neighbourhood on NAP eigen-spectrum. c. Local structure of the T4Lys protein showing the neighbourhood analysed in b. The helices and β−sheets are in purple while the rest of the protein is in cyan. The dominant eigenvector is also shown in transparent purple (see also SI movies S1 & S2, description provided in section 4 of SI—Description of Videos).
Results
We explore three independent protein-ligand systems. Most of the validations of our approach has been done on binding of benzene to the solvent-inaccessible cavity of the L99A mutant of T4 Lysozyme(T4Lys), a system which has served long as the prototypical protein ligand system for biophysical studies [28-37]. Apart from the binding process of simulation, in the current work, we have also simulated the apo or ligand-unbound form of T4Lys (PDB id: 3DMV) and Src kinase (pdb id: 1Y57). Some data for Cytochrome P450 (pdb id: 2CPP) has also been shown.
Non affine parameter is a novel collective coordinate for protein structure
Atoms of an isolated protein molecule in aqueous solution and at non-zero temperatures undergo continuous random motion. For most proteins, these atomic motions may be analysed in terms of displacement fluctuations about some fixed native structure. Some of these fluctuations represent local affine distortions of the native state while others correspond to non-trivial conformation changes. It is the latter which are associated with functional aspects of the protein. Determining functionally relevant conformational changes is a challenge, especially if such distortions are subtle and have to be extracted from the random background noise. We show now how a collective variable may be defined, which does exactly that. This collective variable, which we call the non-affine parameter, was first defined for crystalline solids. In solids, non-affine displacements are obtained by systematically projecting out [25] trivial homogeneous deformations of atoms, capturing only those displacements responsible for irreversible plastic events [26, 27]. The NAP is the squared sum of the mean amplitudes of these displacements. In proteins NAP behaves as a collective coordinate responsible for revealing binding pathways and also for determining possible allosteric couplings between spatially distant residues.Consider for example (see Fig 1a) an atom i and its neighbourhood Ω which contains j = 1, nΩ neighbours. The size of Ω is fixed. Too small a Ω results in large fluctuations while making Ω too large may average out the relevant signal and make it disappear. Typically, we use a neighbourhood which contains 50–100 atoms excluding H-atoms. The instantaneous positions of the atoms r at any particular time step of MD simulation are compared with a set of fixed reference positions R taken, for example, from the (ligand-free) native state of the protein. NAP is then defined as the value of the quantity where the minimisation is over choices of the deformation matrix . NAP is therefore the least square error incurred in trying to represent an arbitrary displacement as an affine or homogeneous deformation of the reference configuration.
Fig 1
a. Schematic diagram used for defining the NAP, χ for particle i (see text). The neighbourhood Ω centred around i is shown as a sphere. The reference positions of the neighbours R are shown in light grey while the instantaneous positions r are shown in dark grey. We have subtracted off the displacement of particle i (R = r) for simplicity. b. The distribution of the eigenvalues λ of PCP for three locations in the protein T4Lys, location I corresponds to a carboxyl C atom in helix 7, II corresponds to a carboxyl O atom in helix 5 and III corresponds to a C − α atom in a relatively inflexible portion of a β sheet. In each of the locations I and II, which are active during ligand binding, the largest eigenvalue is separated from the rest by a large gap Γ. The Γ is markedly smaller in location III which is relatively inert. Here, we consider the mode corresponding to highest eigen value to be the key mode. Here each set of eigenvalues are normalized by the maximum eigen-value to make it 1. That way Γ (the difference between the eigen-values) can be compared across the three different locations of the protein on equal footing.
a. Schematic diagram used for defining the NAP, χ for particle i (see text). The neighbourhood Ω centred around i is shown as a sphere. The reference positions of the neighbours R are shown in light grey while the instantaneous positions r are shown in dark grey. We have subtracted off the displacement of particle i (R = r) for simplicity. b. The distribution of the eigenvalues λ of PCP for three locations in the protein T4Lys, location I corresponds to a carboxyl C atom in helix 7, II corresponds to a carboxyl O atom in helix 5 and III corresponds to a C − α atom in a relatively inflexible portion of a β sheet. In each of the locations I and II, which are active during ligand binding, the largest eigenvalue is separated from the rest by a large gap Γ. The Γ is markedly smaller in location III which is relatively inert. Here, we consider the mode corresponding to highest eigen value to be the key mode. Here each set of eigenvalues are normalized by the maximum eigen-value to make it 1. That way Γ (the difference between the eigen-values) can be compared across the three different locations of the protein on equal footing.The minimisation process is actually equivalent [25] to a projection. The instantaneous displacement of atom i is u = r − R. Within the region Ω, displacements of neighbouring particles relative to that of i are given by Δ = u − u. Next we define a projection of Δ onto the non-affine subspace using a projection operator P where P = I − R(R, and the nΩ
d × d2 elements of R, taking i to be at the origin. Finally, one may show χ(R) = ΔT
P
Δ where we have used the definition Δ, as an nΩ
d dimensional column vector constructed by rearranging Δ. This projection formalism can be carried out for any given set of {R}. This identification also allows us to compute statistical averages, probability distributions and spatio-temporal correlation functions using standard methods of statistical mechanics. For example, one derives [25] the equilibrium ensemble average 〈χ〉 = ∑ λ where λ are the eigenvalues of the matrix PCP with C = 〈ΔT
Δ〉 the local displacement correlator, and we have suppressed the particle index i for simplicity. Associated with NAP, we can also define the local NAP susceptibility 〈(χ − 〈χ〉)2〉 and the spatial correlation function 〈(χ − 〈χ〉)(χ − 〈χ〉)〉, in the same way as it is done for a crystalline solid [26]. Section 1 of SI, along with figure A, provides a summary of useful terms related to our analysis for the readers.One of our important results concerns the distribution of the eigenvalues λ. For crystalline solids we had observed [26] that (1) this distribution is non uniform and the largest eigenvalue is separated from the rest by a large gap and (2) the eigenvector corresponding to the largest eigenvalue represented the dominant plastic mode eg. the creation of a defect dipole. Quite interestingly, the situation is similar here. For details of NAP formalism refer section 1 of SI.In Fig 1b we have plotted the relative magnitudes of the eigenvalues of the local PCP matrix for three different locations on the protein T4Lys. The first one (I) corresponds to a Carbon atom on a Carboxyl group on residue 132 (amino acid Asn) in a region of helix 7, the second one (II) on an Oxygen atom of a Carboxyl group on helix 5 corresponding to residue 105 (amino acid Gln), and the third one (III) for an α− Carbon atom within β sheet 1, belonging to residue 13, (amino acid Leu). We shall see later that regions I and II are both quite active during binding events while region III is relatively inert. The eigenvalue spectrum surprisingly bears an imprint of this behaviour. The active regions appear to be dominated by a single non-affine mode which is particularly soft. The eigenvalue for this mode is larger than the rest by a substantial amount producing a gap Γ in the non-affine spectrum. The inert regions of the proteins have more uniformly distributed eigenvalues. By looking at the structure alone, (see Fig 1C) it would have been impossible to guess this fact.In Fig 2a & 2b we show that the distribution of the eigenvalues of the PCP matrix are not random. The soft modes that prove out to be useful in our study correspond to a set of eigenvalues which are quite well separated from the spectrum of the rest of the eigenvalues. In order to demonstrate this, in Fig 2a, we plot the distribution of the various eigenvalues of the PCP matrix for region (I) of T4Lys (Fig 1c), which is involved in the gate opening resulting from the movements of helices 7 and 9. This is fitted to the Marchenko Pastur (MP) distribution from the prediction from Random Matrix Theory [38-40]. It is clear that the fit is not perfect and there are several eigenvalues which are separated from the rest by gaps. On the other hand, if the matrix PCP was constructed out of Δ which were completely random then we should have obtained the distribution enclosed under the red curve representing the MP distribution for our data. This is shown explicitly for a random correlation matrix in Fig 2b. The fact that we have a sparsely distributed tail of few relatively very large eigenvalues widely separated from the rest of the spectrum and these large eigenmodes contribute most towards the major conformational changes, clearly demonstrates that our analysis is not a straightforward strain-analysis [41]. Instead, indeed, there is an underlying physics that can be revealed by focusing on the non-affine subspace PΔ of the total displacement Δ. Non-affinity may play an important role in guiding a majority of the routine functions performed by the proteins.
Fig 2
a. Distribution of the eigenvalues of the PCP matrix at a specific location in the protein T4Lys (blue histogram) compared with the fitted Marchenko-Pastur distribution (red curve). b. A similar fit for a truly random correlation matrix constructed out of vectors with the same number of components as in the first case but now taken from a Gaussian random distribution of zero mean and unit standard deviation.c. Three dimensional plot of the local NAP, χ, the gap Γ and the NAP susceptibility for all locations analysed for three different proteins of widely diverse structure and function viz. T4Lys (purple symbols), Src kinase (cyan symbols) and P450 (green symbols). The relation between all three variables for the three proteins is same within the scatter of the data. All the neighbourhoods analysed contained 50–100 atoms. Refer to section 5 of SI and figure C for detailed discussion on sensitivity of choice of neighbourhood-volume for detailed description on neighbourhood.
a. Distribution of the eigenvalues of the PCP matrix at a specific location in the protein T4Lys (blue histogram) compared with the fitted Marchenko-Pastur distribution (red curve). b. A similar fit for a truly random correlation matrix constructed out of vectors with the same number of components as in the first case but now taken from a Gaussian random distribution of zero mean and unit standard deviation.c. Three dimensional plot of the local NAP, χ, the gap Γ and the NAP susceptibility for all locations analysed for three different proteins of widely diverse structure and function viz. T4Lys (purple symbols), Src kinase (cyan symbols) and P450 (green symbols). The relation between all three variables for the three proteins is same within the scatter of the data. All the neighbourhoods analysed contained 50–100 atoms. Refer to section 5 of SI and figure C for detailed discussion on sensitivity of choice of neighbourhood-volume for detailed description on neighbourhood.The relation between activity i.e. a large NAP susceptibility and gaps, Γ, appears to be an universal feature of all the three proteins, with very diverse structures and functions, studied by us. In Fig 2c we have plotted the NAP susceptibility Γ and NAP for several regions within three proteins T4Lys, P450 and Src kinase. In all of these proteins the three quantities are positively correlated with each other and the nature of the correlation (the slope of the curves) appears to be the same within the error bars for all the proteins.We demonstrate next that regions of the protein which are involved in a binding event and show large NAP values during the process of ligand binding also have large NAP susceptibility in the apo form of the proteins without the ligand. The motions of the protein during binding events are also captured by the dominant non-affine eigenmode. Finally we show that large NAP correlations between spatially distant parts of the proteins point out domains of possible allosteric control.
Non affine parameter successfully tracks the ligand binding event in T4 Lysozyme
We first validate the ability of NAP to successfully trace the ligand binding pathway for a recently studied protein-ligand system by Mondal et al. [28], namely benzene binding to T4Lys. Briefly, Mondal et al. has recently captured the full details of the benzene binding pathways to the buried cavity of T4Lys using multi-microsecond long atomistic MD simulations. The resulting binding trajectories had identified more than one binding pathways. A key observation from those simulated trajectories was that the benzene binding to the solvent-inaccessible cavity of T4Lys did not require large-scale protein conformational change. Rather, the simulations discovered that it is the subtle displacement of helices of the protein which opens up productive pathways leading to ligand binding in the buried cavity.Specifically, as plotted in the right hand axis of Fig 3a, a subset of the trajectories pinpointed that prior to benzene binding, the distance between helix-4 and helix-6 increases transiently by around 2–3 Å from its equilibrium value, facilitating benzene-gating in the cavity, before it reverts back to its original equilibrium distance after the binding event is complete. On the other hand, another subset of trajectories, spawned from same starting configuration, revealed an alternate pathway of ligand binding, where the transient increase in the distance between a different pair of helices namely, helix-7 and helix-9 triggers the ligand entry to the cavity, as depicted in Fig 3b (right hand axis). The snapshots shown at the bottom panel of the Fig 3a and 3b captures the location of the benzene relative to the helix gateway during the period of binding event. The three different snapshots labelled as 1, 2 and 3 respectively represent the helix conformations before, during and after the ligand binding event. The creation of a helix gateway prior to ligand binding (snapshot 2 in each case) is very evident from the increased inter-helix distance.
Fig 3
NAP as a good descriptor of ligand binding event: Correlation between time profile of inter-helix distance (blue curves right-hand axis) and corresponding NAP of the helices (red curve, left hand axis) during the process of ligand (benzene) recognition to the buried cavity of L99A T4 Lysozyme.
a. The trajectory involving helix4-helix6 gateway leading to ligand binding, b. the trajectory involving helix7-helix9 gateway leading to ligand recognition. Below each of these curves we also show the ligand (pink) and the protein (cyan) at three time slices with positions corresponding to (1) outside the gate, (2) within the gate and (3) after the binding event. The three time slices are also labelled (black dots) in the NAP vs time plots a. and b. Here, all the possible distances between helix 7 particles (excluding hydrogen) and helix 9 particles (excluding hydrogen) are calculated and then an average of all these values are taken to arrive at a singe point on the blue curve in figure 3b. Same goes for the blue curve in figure 3a. Also refer section 7 of SI and figure E for detailed discussion on sensitivity of main and side chains atoms on NAP Analysis.
NAP as a good descriptor of ligand binding event: Correlation between time profile of inter-helix distance (blue curves right-hand axis) and corresponding NAP of the helices (red curve, left hand axis) during the process of ligand (benzene) recognition to the buried cavity of L99A T4 Lysozyme.
a. The trajectory involving helix4-helix6 gateway leading to ligand binding, b. the trajectory involving helix7-helix9 gateway leading to ligand recognition. Below each of these curves we also show the ligand (pink) and the protein (cyan) at three time slices with positions corresponding to (1) outside the gate, (2) within the gate and (3) after the binding event. The three time slices are also labelled (black dots) in the NAP vs time plots a. and b. Here, all the possible distances between helix 7 particles (excluding hydrogen) and helix 9 particles (excluding hydrogen) are calculated and then an average of all these values are taken to arrive at a singe point on the blue curve in figure 3b. Same goes for the blue curve in figure 3a. Also refer section 7 of SI and figure E for detailed discussion on sensitivity of main and side chains atoms on NAP Analysis.We calculate the time profile of NAP, averaged over all heavy atoms involving these helices and compare its correspondence with corresponding inter-helix distance. As presented in Fig 3a, we plot the profile of average NAP involving helix-4 and helix-6 on the left hand axis and that of average distance between these two helices on the right hand axis, for the part of time period spanning the benzene entry to the cavity. We find that, both NAP profile and distance profile involving helix-4 and helix-6 follow almost identical trend during the course of ligand entry. Similarly, as shown in Fig 3b, for the other trajectory where ligand binding occurred via helix-7 and helix-9 gateway, time profile of NAP involving helix-7 and helix-9 reproduces the trend of distance fluctuation involving helix-7 and helix-9 quite accurately. The remarkable consistency between the NAP profile and displacement profile of key-profiles responsible for ligand binding suggests that these subtle helix fluctuations are results of non-affine displacements of local environment for each particle. Overall, these results validate the potential of NAP as a suitable collective variable or descriptor for tracking ligand binding kinetics.
NAP susceptibility of apo protein highlights potential ligand hotspots
The results discussed in previous section have validated the role of NAP as a good descriptor of a ligand binding event and its ability to track the ligand binding pathways. But the validation needed the extensive simulation of actual ligand binding events. As an important finding of the current work, in this section we show that by analysing the NAP susceptibility on a ligand-free (apo) protein itself, one can predict potential protein sites for facile ligand approach, otherwise known as “ligand hotspots”. Since MD simulations of the apo protein do not need to track rare binding events, these cost a small fraction of the time needed for simulating the full protein-ligand systems. Fig 4a and 4c illustrate the spatial distribution of computed NAP susceptibility of two protein systems in their ligand-free (apo) form, namely, T4Lys and Src kinase. For the purpose of comparison, we have also plotted the locations on protein surface which ligand frequently visits in previously carried out multi-microsecond long unbiased ligand-binding simulations (Fig 4b and 4d).
Fig 4
NAP susceptibility of apo proteins (a,c) and ligand binding hotspots (b,d).
a The spatial distribution of the NAP susceptibility shown as a colour map red (scaled to 100) to blue (0) for L99A T4 Lysozyme. b. In the adjacent panel, we show the same protein now together with snapshots of the ligand. Ligand snapshots in red represents configurations obtained from 4–5 μs and blue from 5–6 μs of a long MD trajectory. After 6 μs the ligand stays close to the binding gateway. Panels c and d show the corresponding plots for Src kinase. In c the colours have meanings same as in a. Panel d is reprinted with permission from Ref. [18], (copyright (2011) American Chemical Society) red represents configurations earlier than 15 μs while blue from 15-20 μs. Locations within the protein which are important for ligand binding are numbered and are described in detail in the text. Note that in both cases the regions which have large NAP susceptibility are also those where ligands spend most of their time during binding events.
NAP susceptibility of apo proteins (a,c) and ligand binding hotspots (b,d).
a The spatial distribution of the NAP susceptibility shown as a colour map red (scaled to 100) to blue (0) for L99A T4 Lysozyme. b. In the adjacent panel, we show the same protein now together with snapshots of the ligand. Ligand snapshots in red represents configurations obtained from 4–5 μs and blue from 5–6 μs of a long MD trajectory. After 6 μs the ligand stays close to the binding gateway. Panels c and d show the corresponding plots for Src kinase. In c the colours have meanings same as in a. Panel d is reprinted with permission from Ref. [18], (copyright (2011) American Chemical Society) red represents configurations earlier than 15 μs while blue from 15-20 μs. Locations within the protein which are important for ligand binding are numbered and are described in detail in the text. Note that in both cases the regions which have large NAP susceptibility are also those where ligands spend most of their time during binding events.Fig 4a represents the three-dimensional map of NAP susceptibility of T4Lys. Location 2 of T4Lys includes residues with very high NAP susceptibility (shown in red spheres). A comparison with corresponding location 2 of the time-trace of the ligand in Fig 4b, as obtained from unbiased ligand-binding trajectories, identifies this location as the native ligand binding site at the buried cavity near the 102 residue (a part of helix 5). Further, apart from the final ligand binding location at the 102 residue, the computed NAP susceptibilities around ligand-free T4Lys were also found to be quite high near locations 1 and 3, which are respectively the 105 (a part of helix 5) and 137 residues (gateway between helix-7 and helix-9). The residues 102 and 105 (which are parts of helix-5) act as hinge-centers or pivots about which the neighboring residues undergo relative re-arrangement opening up the gateway between helices 4 and 6. Similarly, the residue 137 (which is a part of helix-8) acts as a pivot for the gateway between helix 7 and 9. The residue 164 corresponds to location 4 in T4Lys which also has a very high NAP-susceptibility value as it is a terminal residue-part of random coil, mildly contributing towards re-arrangement in helix-9. Interestingly, as shown in Fig 4b, previously simulated independent binding trajectories have also revealed that the pathways leading to ligand entry to the buried cavity of T4Lys involve the helices near locations 1, 2 and 3. More over, the NAP susceptibility analysis recognises multiple locations with moderate NAP susceptibility (shown by green sphered residues), which are also regularly visited by ligands. Although away from major binding sites, these sites serve as possible precursors to intermediates leading to final molecular recognition. Taken together, the ability to correctly predict eventual ligand binding site and other accessory potential ligand hotspots at T4Lys in its apo form by measuring NAP susceptibility, without the need for prior knowledge of protein-ligand binding interactions, rates NAP susceptibility as a very promising metric.The display of the hinge action which opens the gateway to the hydrophobic binding cavity is demonstrated by the supporting movie S1 (opening-up of helix-7 and helix-9) and S2 (opening-up of helix-4 and helix-6), see (S1 Text for further details). Note that these are not MD simulation snapshots but represent the dominant eigen-displacements arising from a single non-affine mode with the largest eigenvalue. The collective motion involves a “twisting” of the helices and is a very special linear combination of hundreds of local normal modes from which all affine (strain-like) distortions have been systematically projected out. We return to this issue later in the paper.The ability of NAP susceptibility to predict the potential ligand hotspots on a protein surface is also validated in another popular receptor protein, namely the Src kinase. Fig 4c shows the NAP susceptibility map around Src kinase in its ligand-free form while the corresponding simulated ligand binding trajectories, as previously simulated by Shaw and coworkers, is shown in Fig 4d. In this case as well, the ATP-binding site (location 2) is correctly predicted to have one of the highest NAP susceptibilities in the apo-form and independent simulated ligand-binding trajectories also confirm this. Moreover, other locations of kinase rated as high to intermediate NAP susceptibility for ligand approach are also independently found to be locations regularly traced by ligand in the unbiased ligand binding trajectories by Shaw and coworkers [18]. Specifically, so-called PIF site, MYR site and G-loop site of kinase (PIF-site:region1, MYR-site:region4,), which were found to attract regular ligand visit in actual ligand binding trajectories, are also predicted to have high to moderate NAP susceptibility for the ligand in apo form of the kinase itself. However, interestingly, our NAP susceptibility analysis also predicts a new location 5 deemed highly susceptible to ligand, which is otherwise not known as a ligand hotspot. The discovery of this new site prompted further analysis to justify the observed high susceptibility for the ligand. We have also checked the effect of simulation lengths on the NAP susceptibility. Figure B plots the NAP susceptibility of all particles of src kinase computed over two different simulation lengths. We find that the results remain invariant over simulation time. We also note that these simulations on ligand-free protein, which are used for NAP analysis, are orders of magnitude shorter than the simulation time required for seeing the direct ligand binding events. In the next section, we show that the newly obtained regions of Src kinase which have high NAP-susceptibility also play an important role in establishing an allosteric communication across the protein structure. All the important sites with moderate and high NAP susceptibility values are reported in section 3 of SI.
NAP correlations and allostery
In the previous section, we have explored the NAP susceptibilities for different locations of the protein and using this as a metric we were able to extract the important ligand hotspots, in accordance with the simulated ligand binding trajectories. Similarly spatial NAP correlation function or covariance, calculated at the same time, as defined previously, measures the correlation between non-affine displacement fluctuations at two different, possibly distant, locations. We link this quantity to a measure of allostery in the protein systems i.e. the ability of spatially distal sites in proteins to influence catalytic or binding activity.In Fig 5, we plot residue-residue NAP covariance matrix representing spatial correlation functions for our two protein systems, namely T4Lys (Fig 5a) and Src kinase (Fig 5b) respectively. On the two axes we have the residue ids and the value of the spatial correlation function between two residues is given by the colour of the point. High values above a suitable cut-off are in yellow and the rest are in black. We are interested in those off-diagonal residues that are spatially far away from each other, since spatially adjacent residues will be trivially correlated due to direct interactions. As shown in Fig 5a, we could get five prominent locations in the matrix for T4Lys which have high NAP correlation at a spatially far-off distance (see S1 Text for further details of these regions). For example, we found that a very high NAP correlation exists between residue 137 in region 3 and residue 18 in region 5 (labelled (3,5)) of T4Lys. This allosteric connection can be seen to be contributing towards regulation of the hydrogen bond between the residue 22 Glu and residue 137 Arg. Using a newly introduced method ExProSE, It was confirmed by Greener et al. [42] that, the breaking of a hydrogen bond between Arg137 and Glu22 causes the opening motion taking the protein from a closed active site cleft conformation to an open active site cleft conformation. Also, earlier, Mchaourab et al. [43] have confirmed that in solution there is a conformeric equilibrium between substrate-bound and substrate-unbound T4 Lysozyme which is in accord with the active site cleft opening upon substrate removal due to Hinge-Bending motion. Mchaourab et al. used the site-directed Spin Labeling to detect these motions. From our NAP formalisms too, [26] we can see that the softest non-affine modes corresponding to the residues 18 and 137 act as the nucleating-precursors to the opening of the active site cleft. These modes contribute by initially breaking the hydrogen bond between residues 22 and 137 and therefore we can directly see the increasing distance between residue 22 and 137 in S3 Video (description provided in section 4 of SI—Description of Videos) in the supplementary info.
Fig 5
NAP correlation function plotted as a colour map spanning residue ids where correlation values larger than a threshold (= 4 percent of the maximum value) are shown in yellow while the rest is kept black; a T4Lys, b Src kinase.
Highly correlated, spatially distant regions are sites for possible allosteric modification. Such pairs of sites are numbered in both cases with the font sizes corresponding to the relative value of the correlation (see section 3 of
S1 Text). The numbers correspond to the same regions as in Fig 4. It is observed for T4Lys that the residue 18 from region 5 (yellow) has the highest correlation with the residue 137 in region 3 (black) as demonstrated by the largest font. In case of Src kinase, the residues 416 and 328, from regions 5(yellow) and 1(blue) respectively, have the highest correlation.
NAP correlation function plotted as a colour map spanning residue ids where correlation values larger than a threshold (= 4 percent of the maximum value) are shown in yellow while the rest is kept black; a T4Lys, b Src kinase.
Highly correlated, spatially distant regions are sites for possible allosteric modification. Such pairs of sites are numbered in both cases with the font sizes corresponding to the relative value of the correlation (see section 3 of
S1 Text). The numbers correspond to the same regions as in Fig 4. It is observed for T4Lys that the residue 18 from region 5 (yellow) has the highest correlation with the residue 137 in region 3 (black) as demonstrated by the largest font. In case of Src kinase, the residues 416 and 328, from regions 5(yellow) and 1(blue) respectively, have the highest correlation.For Src kinase too, as shown in Fig 5b, we could identify five regions of high NAP correlations at distant locations. For example, residue 416 (Tyr) of region 5 of src kinase has a very high NAP correlation with residue 328 (Val) (an element of Helix-αc) of region 1 (labelled (1,5). This is supported by a previous report [44] of coupling between the activation loop and Helix-αc. A hallmark of active conformation of kinase [45] is the autophosphorylation of the activation loop (404–432) at the residue 416 (Tyr) and “αC-in” conformation of Helix-αc. Analysis of non-affinity shows that the softest non-affine modes (S4 Video in supporting info, description provided in section 4 of SI—Description of Videos) at residue 416 and residue 328 in our simulations act as incipients for residue 416 (Tyr) protruding outward from the activation loop region, thereby becoming more exposed for the phosphorylation at this site (this site being tyrosine which is favourable for phosphorylation), and at the same time inward rotation of Helix-αc. In summary, our analysis based upon NAP-susceptibility and NAP spatial correlation function points towards allosteric contact between Helix-αc (helix undergoing inward rotation) and activation loop of kinase (phosphorylation site Tyr416 getting exposed). Apart from this, we also note that there is a slight distortion of the salt-bridge between 295LYS and 310GLU. The two residues 295LYS and 310GLU are shown in silver small spheres and stick form. This is also supported by the previous works [45-47]. The details of regions 1-5 are mentioned in section 3 of SI.
Multiple active sites in Cytochrome P450
The family of Cytochrome P450 proteins are quite versatile and are present in many different human tissues performing many tasks such as oxygen metabolism by binding to Heme, breakdown of toxins and hormones, synthesis of cholesterol [48] etc. They are often associated with intracellular membranes such as in mitochondria and the endoplasmic reticulum. Extensive studies using sequence-based co-evolutionary analysis and anisotropic thermal diffusion MD simulations have identified four principal active regions in P450 related to membrane association, catalytic activity, Heme binding and dimerization [49].Our results for the NAP susceptibility are shown in Fig 6. Remarkably, we can also identify the same regions as obtained in the earlier study [49], although the variant of P450 used by us is somewhat different. A complete analysis of all the different binding and allosteric processes of P450 analysed using the NAP values, eigenvectors and correlation functions is much beyond the scope of the present work and will be published elsewhere.
Fig 6
Relative NAP susceptibility plotted with the same color scale as in Fig 4 for Cytochrome P450.
We have identified four regions of high activity. Region 1 and 2 are associated with membrane binding and catalytic activity respectively while 3 and 4 may correspond to Heme binding and dimerisation.
Relative NAP susceptibility plotted with the same color scale as in Fig 4 for Cytochrome P450.
We have identified four regions of high activity. Region 1 and 2 are associated with membrane binding and catalytic activity respectively while 3 and 4 may correspond to Heme binding and dimerisation.
Discussion
Pinpointing sites of catalytic activity, binding of ligands or other functionally important conformational changes in proteins is a challenge when such displacements in protein residues are subtle and easily masked by background noise. Accurate “order parameters” or collective variables need to be designed to study such activity. It is often unclear how such collective variable may be defined. Drawing from ideas which were first demonstrated in crystalline solids [25-27] we have defined a new collective variable, NAP, which quantifies non-affine thermally excited displacements of proteins. Firstly, we show that the susceptibility of different locations of a protein to non-affine displacements can be used as an indicator to determine regions which are important for binding events. Secondly, from the analysis of the NAP susceptibility values we conclude that the regions with higher values of NAP susceptibilities are also those which contain a dominant non-affine mode whose eigenvalue is separated from those of the the rest of the modes by a large gap. The eigenvalue spectrum of the non-affine modes is also highly non-random. The eigenvector corresponding to this dominant mode involves atomic displacements that are the ones prone to act as binding gate-ways and the time spent by a ligand during a binding event at a particular residue is determined by the NAP susceptibility of this region. We also show that a positive correlation exists between the NAP value, its susceptibility and the magnitude of the gap in all the proteins studied by us.Recent times have seen the emergence of many novel techniques to tackle the challenging issue of coming up with optimum sets of collective variable. Tiwary and Berne’s techniques of spectral gap optimisation of order parameters (SGOOP), [50] Time-structure based independent component analysis (TICA)-based approach for optimising collective variable [51-53] for protein conformational analysis and molecular recognitions certainly have contributed significantly in this line. The approach described in this work is quite relevant, in light of these recent efforts in optimising collective variable for describing a biological processes. It is quite remarkable that our projection formalism yields useful data even from short simulation runs and results are robust over different simulation lengths (refer section 2 of SI and figure B). Our analysis is fundamentally different from principal component analysis (PCA) [54-56] or other PCA based methods [57] which are used extensively for analysing protein configurations obtained from simulations. The restriction to the coarse graining volumes Ω makes the analysis local and the projection to the non-affine sub space guarantees that all trivial motions are filtered out. Nonetheless, we have analysed the sensitivity of our results over choice of coarse-graining volumes in Section 5 of SI and figure C provides justification of our choice of Ω. The NAP eigen spectrums are also analysed over varying coarse-grains radius in figure D and section 6 of SI. Our NAP analysis are also robust against choice of main-chain or side-chain atoms of protein for the computation (see section 7 of SI and figure E.) Finally, the large gap in the non-affine excitation spectrum between the dominant eigenmode and the rest ensures that there is very little mixing of the modes. As an example, recall the (un-)twisting motion of the helices observed in T4Lys as shown in the supplementary S1 Video (see S1 Text). Such a motion requires a special linear superposition of a large number of normal modes. The dominant PCA mode on the other hand cannot capture these motions (refer to figure F and section 8 of SI for detailed discussion on comparison with PCA) in detail although atomic displacements in the relevant regions are large. Unfortunately, these displacements are also contaminated by unimportant, affine motons which mask the gate opening feature.The ability to compute NAP susceptibility by our approach helps us to unify two different proposed hypotheses of protein-ligand recognition, namely “conformational selection” [58-63] (in which protein’s intrinsic ability to shift to a ligand-preferred conformation guides the molecular recognition) and “induced fit” (in which ligand induces the protein to change its conformation). Our calculations show that these ideas are not in conflict but follows from standard fluctuation-response behaviour connecting local fluctuations of non-affine part of atomic displacements to the response of the protein to the conjugate local non-affine field [26]. This is a simple consequence of a fluctuation response relation viz. 〈χ〉 = 〈χ〉0 + h〈(χ − 〈χ〉)2〉0 if we identify ligands with local fields h which generate NAP in proportion to their susceptibilities at h = 0. The ligand, in this case acts as such an “external” field and the response of the protein to the field is proportional to the fluctuations of local NAP in its apo state without the field (ligand).Additionally, we see that the spatial correlation of NAP among various residues can be used to discover sites of possible allosteric control. NAP susceptibilities and correlations may be obtained readily from simulations of the apo protein without the ligand and come at a fraction of the cost needed for detailed simulations of ligand binding events. This implies that our proposed method is extremely well suited to be adapted for high throughput searches of possible protein ligand pairs and allosteric control.It is to be noted that in our work we have focussed on the projection of the displacements to the non-affine subspace which is orthogonal to local strain [41]. Since binding hotspots feature large displacements, both strain and strain fluctuations may be large together with large NAP values. NAP however, quantifies the error made in trying to fit the local displacements to an affine model and therefore large NAP values also signify that a description in terms of strain becomes invalid in those regions.
Materials and methods
Protein models and simulation details
We have worked on three systems including binding of benzene to the solvent-inaccessible cavity of the L99A mutant of T4 Lysozyme (T4Lys). Also we had previously performed multi-microsecond molecular dynamics simulation where we have simulated the complete kinetic process of ligand recognition [28]. The details of atomistic simulation process and models for this system have been discussed therein. The apo or ligand-unbound form of T4Lys (PDB id: 3DMV) and and Src kinase (pdb id: 1Y57) have also been explored in our work along with with some data for Cytochrome P450 (pdb id: 2CPP). All the apo forms have been modelled using same simulation protocols. We have used the CHARMM36 force field [64]. The protein is solvated by 11613 TIP3P water molecules in a cubical box of dimension 7.18 nm. Sufficient number of ions were added to maintain 150 mM salt concentration. All MD simulations were performed with the Gromacs 5.0.6 simulation package [65]. During the simulation, the average temperature was maintained at 303K using the Nose-Hoover thermostat [66, 67] with a relaxation time of 1.0 ps and an average isotropic pressure of 1 bar was maintained with the Parrinello-Rahman barostat [68]. The Verlet cutoff scheme was employed throughout the simulation with the truncated and shifted Lenard Jones interaction extending to 1.2 nm and long-range electrostatic interactions [69] treated by Particle Mesh Ewald (PME) summation [70]. All bond lengths involving hydrogen atoms of the protein and the ligand benzene were constrained using the LINCS algorithm [71] and waterhydrogen bonds were fixed using the SETTLE [72] approach. Simulations were performed using the leapfrog integrator with a time step of 2 fs and initiated by randomly assigning the velocities of all particles. The total time for our simulations of the apo-proteins amounted to (T4Lys = 493.54ns, Src kinase = 460.00ns, P450 = 500ns) 1453.54 ns which was sufficient to obtain equilibrated data.
Supporting information including text and figures.
(PDF)Click here for additional data file.
Supporting video demonstrating the opening of the helices 4 and 6.
(MP4)Click here for additional data file.
Supporting video demonstrating the opening of the helices 7 and 9.
(MP4)Click here for additional data file.
Supporting video demonstrating the non-affine modes for the residues having high spatial NAP correlation shown in Fig 4(a).
(MP4)Click here for additional data file.
Supporting video demonstrating the non-affine modes for the residues having high spatial NAP correlation shown in Fig 4(b).
Authors: Cintia A Menéndez; Fabian Byléhn; Gustavo R Perez-Lemus; Walter Alvarado; Juan J de Pablo Journal: Sci Adv Date: 2020-09-11 Impact factor: 14.136