Literature DB >> 19247429

Rapid sampling of molecular motions with prior information constraints.

Barak Raveh¹, Angela Enosh, Ora Schueler-Furman, Dan Halperin.

Abstract

Proteins are active, flexible machines that perform a range of different functions. Innovative experimental approaches may now provide limited partial information about conformational changes along motion pathways of proteins. There is therefore a need for computational approaches that can efficiently incorporate prior information into motion prediction schemes. In this paper, we present PathRover, a general setup designed for the integration of prior information into the motion planning algorithm of rapidly exploring random trees (RRT). Each suggested motion pathway comprises a sequence of low-energy clash-free conformations that satisfy an arbitrary number of prior information constraints. These constraints can be derived from experimental data or from expert intuition about the motion. The incorporation of prior information is very straightforward and significantly narrows down the vast search in the typically high-dimensional conformational space, leading to dramatic reduction in running time. To allow the use of state-of-the-art energy functions and conformational sampling, we have integrated this framework into Rosetta, an accurate protocol for diverse types of structural modeling. The suggested framework can serve as an effective complementary tool for molecular dynamics, Normal Mode Analysis, and other prevalent techniques for predicting motion in proteins. We applied our framework to three different model systems. We show that a limited set of experimentally motivated constraints may effectively bias the simulations toward diverse predicates in an outright fashion, from distance constraints to enforcement of loop closure. In particular, our analysis sheds light on mechanisms of protein domain swapping and on the role of different residues in the motion.

Entities: CellLine Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2009 PMID： 19247429 PMCID： PMC2637990 DOI： 10.1371/journal.pcbi.1000295

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Mechanistic understanding of protein motions intrigued structural biologists, bio-informaticians and physicists to explore molecular motions for the last five decades. In two seminal breakthroughs in 1960 [1],[2], the structures of Haemoglobin and Myoglobin were solved and consequently, for the first time, mechanistic structural insights into the motion of a protein were deduced from its snap-shot image. This finding paved the way to a by-now classical model for cooperativity in binding of allosteric proteins [3]. Nowadays, hundreds of proteins with known multiple conformations, together with their suggested molecular motion, are recorded in databases such as MolMovDB [4]. This number increases with the influx of solved structures from the Protein Data Bank [5]. An inherent flexibility is characteristic of fundamental protein functions such as catalysis, signal transduction and allosteric regulation. Elucidating motion of protein structures is essential for understanding their function, and in particular, for understanding control mechanisms that prevent or allow protein motions. Understanding the relation between protein sequence and protein motion can allow de-novo design of dynamic proteins, enhance our knowledge about transition states and provide putative conformations for targeting drugs. Accurate prediction of protein motion can also help address other computational challenges. For instance, Normal Mode Analysis (NMA) motion predictions [6] can be used for efficient introduction of localized flexibility into docking procedures [7],[8].

Experimental Limitations and Progress

Experimental knowledge of macro-molecular motions has been discouragingly limited to this day by the fact that high-resolution structures solved by X-ray crystallography are merely the outmost stable conformations of proteins, in a sense a snap shot of a dynamic entity. While high resolution experimental data of molecular motion are still beyond reach, innovative breakthroughs in time-resolved optical spectroscopy, single molecule Förster resonance energy transfer (FRET), small-angle X-ray scattering (SAXS) [9], as well as advances in NMR spectroscopy such as residual dipolar coupling methods and paramagnetic relaxation enhancements [10]–[13] now provide increasingly detailed experimental data on molecular motion, e.g., distance and angle constraints or measurements of rotational motion [14].

Computational Simulation of Motion

In spite, and perhaps due to the limited amount of experimental information, computational techniques like molecular dynamics (MD) simulations [15],[16] have been used extensively for the last three decades to simulate macro-molecular motion. Unfortunately, standard MD simulations are computationally intensive, and moreover, they often remain trapped in repetitive cycles of Brownian motion throughout the simulation, without being able to cross significant energy barriers. Therefore, they are often limited to pico-to-nano second timescales of motion [17], whereas events like enzyme catalysis [18], protein folding [19] and protein recognition [20] may require more time. As researchers often possess intuition and explicit partial knowledge about the nature of a motion or target conformations, biasing techniques were devised in steered MD simulations [21]. Such methods incorporate prior knowledge or expert intuition about the system and compromise the intended purity of MD simulations as a physical simulation. Nonetheless, they still rely to the most part on an approximation of physical forces, and guarantee that some plausible assumptions are satisfied. Subsequent motion trajectories were shown useful for designing experiments and deriving mechanistic insights into protein motion. Complementary coarse-grained methods such as Normal-Mode Analysis and Gö models [6],[22],[23] (reviewed in [10]) provide quick impressions about protein conformational changes when given a native conformation, but do not aim at the very fine details of the motion.

Sampling-Based Approaches

In recent years, a novel approach for sampling motion pathways, rooted in algorithmic robotics motion planning, has been applied to large-scale molecular motion prediction. This approach suggests an efficient alternative to slow step-by-step simulations of Newton equations. Instead, a sequence of clash-free conformations is generated by sampling the topology of the conformational space. This sequence is a fine discretization of continuous motion. In their original context, motion planning techniques like probabilistic road-maps (PRM) [24], Rapidly-exploring Random Trees (RRT) [25],[26] and similar methods [27]–[29] (all reviewed in [26],[30]) have been used to plan the motion of objects with many degrees of freedom (dofs) among obstacles in a constrained environment [31]. (Usually, these objects are referred to as “robots”, but can be any moving object, such as digital avatars, manufactured parts, or molecules in the context of this study). For simplicity, we collectively refer to this family of techniques as motion planning sampling techniques. In molecular biology, motion planning techniques were used to predict motion pathways for molecules while considering a large numbers of dofs [32]–[39], and contributed to our understanding of molecular kinetics in applications such as energy landscape exploration, protein and nucleic acids folding pathways and ligand binding [32], [34]–[36],[40]. Their performance has been compared to molecular dynamics [36] and integrated with Normal Mode Analysis [38]. In several cases, they were shown to capture known conformational intermediates and other experimental indicators [33], [37]–[39]. Motion planning techniques are optimized for finding complete motion pathways. They record the history and approximate the topology of the sampled search space in a tree or a graph data structure, the “road-map”. Molecular motions are extracted from paths or “roads” in the graph, where nodes stand for feasible (low-energy) conformations and edges connect close-by conformations. Therefore, paths in this graph are sequences of clash-free conformations. This also adds a whole new dimension of memory to the sampling process and the resulting search in conformation space is shown to be less prone to futile repetitive sampling [25]. Motion planning techniques are very fast – it takes between minutes and hours to generate a full motion pathway of relatively large time-scales with dozens of dofs and hundreds of amino-acids [38],[39], compared to weeks to months in MD simulations of motions with shorter time-scales. Hence, in contrast to MD simulations, sampling based methods are fast enough to generate a very large number of alternative pathways, whereas in an MD simulation it is often hard to decide if the pathway is representative or just the outcome of specific random start conditions. As the application of motion planning techniques to molecular motion is relatively new, further research is required in order to validate and calibrate its use. The external incorporation of experimental measurements into sampling-based simulations can increase the credibility of predictions, and turn them into a fair complement to ab-initio simulations. In addition, as the dimensionality of the search space increases, it is advantageous to exploit prior information about the nature of the motion to direct the search. A common practice in sampling methods of single conformations like MC is to bias the energy function itself towards known constraints [41]. In the context of sampling-based motion planning, it is common to explicitly bias the sampling to include the target conformation (e.g. [42],[43]). Another common bias is towards narrow passages in the space of configurations [26],[44]. In order to avoid getting stuck due to over-bias, biased sampling is often restricted to a fraction of the tree growth iterations. Kalisiak and Panne [45] terminated RRT branches that lead to immediate collisions, by sensing the local environment on-the-fly in order to save running time. Zucker et al. [46] used various features of the workspace environment (the Cartesian representation of the world) to bias the sampling of motion planning algorithms, by introducing ad-hoc relations between robotic dofs and workspace features, and using a grid discretization of the workspace.

Our Contribution

Here, we present PathRover, a comprehensive and generalized framework for efficiently sampling and generating motion pathways that satisfy constraints of prior information with the RRT algorithm [25]. PathRover generates low-energy, clash-free motion pathways that are biased towards external constraints. This is in analogy to similar approaches for finding a single optimal structure (but not a motion pathway) under a set of experimental constraints [47],[48]. Our approach follows the notion that the combination of a number of partial constraints can significantly limit the number of feasible solutions. We rely on a generalized RRT formalism that allows efficient, flexible and straightforward integration of prior information into the basic RRT algorithm. Partial information is incorporated through a branch-termination scheme where the growth of undesired pathways from the RRT tree is terminated (see Figure 1 for a toy example that illustrates the effect of constraints on RRT motion sampling). To our knowledge, this is the first thorough generalized attempt to incorporate diverse types of prior biological information into the RRT algorithm in biological context.

Figure 1

Comparison of pathway motion predictions in a 2D toy example.

Here, we aim to find collision-free paths for a point robot in 2D-space starting from a source configuration. (A) The basic Single-RRT algorithm provides fast but rough coverage of unexplored regions, and the target is often missed (red star, top left). During the run, the tree grows in feasible space (white) among obstacles (orange rectangles). In biological examples, these obstacles are high-energy conformations. Each point stands for a two dimensional conformation, and the tree grows from a source conformation (violet star, middle of figure), towards random directions (see Methods). (B) In the Partial-RRT variant, we use partial information to truncate branches that do not grow towards the target (like the truncated branch in the grey ellipse, compare to the branch in the magenta ellipse). The search is more confined to relevant regions, at the expense of overall coverage of the search space. (C, D) Comparison between the basic Single-RRT algorithm and RRT with partial information (Partial-RRT), for the toy example in a and b. The partial information used here is the distance to the target. In SingleRRT-t50 and PartialRRT-t50 the target is also used as an explicit direction of growth once in 50 iterations, in case the tree reaches the proximity of the target but not its exact location. This test follows a common assumption that RRT running time is dominated by the number of collision tests. We compare the Euclidean distance of the RRT node that is closest to the target (y-axis) as a function of the number of collision tests (x-axis) throughout the run. Results are the average distance (in c) or the minimum distance (in d) over 50 independent runs. PartialRRT performed better than SingleRRT, especially for a lower number of collision checks. Better performance is achieved in less running time. As the number of collision checks grows. All methods converge. Note that this is only a toy example for illustrative purposes; in many biological examples, the target conformation might not be given explicitly, and the number of dofs is in general much higher.

Comparison of pathway motion predictions in a 2D toy example.

Table 1

Examples for predicates of partial information in PathRover, motivated by experimental techniques and comparative methods.

Name of Predicate	Formal Definition of Predicate	Motivating Examples of Relevant Partial Information
Pair Distance	The distance between a pair of residues	Experimental distance constraints for transient and non-transient interactions (Spin-Label NMR, Single-Molecule FRET, Cross-Linking)
RMSD Minimize	RMSD between Cα atoms of two conformations or subdomains	Structure of an alternative native or homologue structure; the conformation of an active site region
Line-Fit Distance/Angle	The distance, or angle, between a set of Cα atoms, fitted by a least mean square line (LMSL)	Pairing of two beta strands ; relative orientation between the main axes of a helix and a sheet
Cent-Mass Distance	The distance between the center of mass of two subsets of Cα atoms	Cryo-electron Microscopy images that indicate the coarse distance between centers of mass of internal domains
H-Bond Formation	The formation of hydrogen bonds in unspecified locations	Circular Dichroism (CD) spectroscopy indications for helix-sheet formation, without indication of their specific location within the protein sequence

PathRover is integrated into the Rosetta molecular modeling framework [49], an accurate protocol for a range of different structural modeling tasks (e.g., [50]–[53]). Thus, PathRover is equipped with state-of-the-art energy functions, sampling and optimization protocols. All generated motion pathways are guaranteed to form a sequence of clash-free low-energy conformations, and to satisfy the input constraints.

Model Systems

We mainly focus on domain swapping of two molecular model systems, the CesT and the Cyanovirin-N proteins. Domain swapping occurs in multi-domain proteins, when a domain from one chain packs against the complementary domain in an identical chain [54], forming a pseudo-monomer (Figure 2 and Figure S1). The pseudo-monomer resembles the native structure, and the interface between the swapped domains is native-like. Domain swapping can lead to undesired effects of aggregation, such as the formation of amyloidal fibrils [55]. Investigation of domain unpacking and repacking may improve our understanding of the general mechanism of oligomerization [56].

Figure 2

Domain swapping in CesT type III secretion chaperone.

Domain swapping in CesT type III secretion chaperone.

(A) Crystal structure of the swapped dimer, pdb-id 1k3e [66]. The domains of each chain are packed against the complementary domains in the other chain. The presumed hinge region between the two domains of each chain is marked in space-fill representation. (B) The pseudo-monomer consists of the C-terminal domain from chain A (blue), packed against the N-terminal domain from chain B (red). Note the β-sheet in the interface between the two domains. Domain swapping is an interesting target for motion simulations [57]–[60]. It requires the unpacking of domains in the original chain, and the subsequent repacking to another chain. The main structural changes between swapped conformations are usually restricted to a few hinge residues that connect the two domains [54]. This may allow for some simplifying assumptions about the degrees of freedom that are involved in the motion. Since the structure of each domain is identical in different conformations, we may assume they remain rigid during the motion. We examine the validity of this simplifying assumption and experiment with various choices of dofs. A clever choice of dofs may reduce the running time, but may introduce additional bias to the motion. We compare restricted runs where only a subset of torsion angles is allowed to rotate, to free runs where all degrees of freedom are mobile. We note that the implications of the domain swapping examples are far reaching, since a large set of conformational motions is presumed to involve hinge motions with similar characteristics [61]. As an example, we consider the substrate binding motion of the Ribose Binding Protein. PathRover supports full-atom simulations in which the output conformations contain the coordinates of all side-chains and hydrogen atoms. These conformations can be used to formulate precise, lab-testable hypotheses (e.g., suggest mutations that may interfere with the motion), which are of substantial interest to both experimentalists and theoreticians. In the following sections, we provide a detailed analysis of these models.

Methods

Conformational Space

The conformation space is described in terms of internal coordinates. Backbone torsion angles uniquely define the conformation of a protein, since the side-chain torsion angles are optimized on the fly for each given backbone conformation. Bond lengths and angles are fixed, assuming that changes in torsion angles can in general compensate milder changes observed in bond angles and length.

Partial Information Predicates

We are interested in finding a collision-free, low-energy motion pathway that starts from a given initial conformation, and is consistent with partial information about the motion or the target conformation. We have formulated diverse types of predicates to constrain the sampling of motion pathways according to prior information. Here, we focus on partial information motivated by experiments, comparative analysis and expert intuition. For example, comparative analysis of biological databases can provide partial information from homologue structures, or from alternative conformations of the native protein, and distance constrains can be extracted from time-resolved spectroscopy. Table 1 includes a list of examples for predicates that are motivated by existing or yet to improve experimental methods for assessing transient conformations. Importantly, different types of partial information can be combined into a joint predicate. We note that distance constraints and additional constraints have been previously used in Rosetta to direct Monte-Carlo with Minimization sampling, although in a different algorithmic and biological context (see Discussion). As the combinatorial search space grows exponentially with the number of dofs, it is also beneficial to restrict the choice of flexible torsion angles. An automated, accurate choice of mobile dofs is a challenging aspect of motion prediction, and in this step, prior information can be most useful (see [38] for an attempt in this direction). In this work, we have combined information from several sources for restricting the number of dofs, such as: (1) careful inspection of structures, (2) relevant literature, (3) computational tools for detecting hinge regions like Normal Mode Analysis (NMA) [22], and (4) comparison of structural changes in alternative (native or homologue) conformations. When both a source conformation and a target conformation are available, we used the FlexProt flexible alignment tool [62] to extract fixed regions of the protein, and defined the dofs by the regions in-between. These were used to manually restrict the allowed dofs in the examined model systems (Table 2). The effect of the choice of mobile degrees of freedom is examined in detail in the Results section.

Table 2

Backbone degrees of freedom (dofs) that were free to move in simulations. For each model system we include the main evidence that was used for choosing a specific set of degrees of freedom.

Name of Simulation	Mobile Residues	Evidence Used for Selection of Residue Degrees of Freedom
CesT [M]1 ^, 2	34–37 (8 dofs)	• N-terminal domain (residues 1–33) and C-terminal domain (residues 38–134) can be independently aligned to homologue counterparts (e.g., SigE) by a rigid transformation. The pseudo-monomer is obtained by the packing of domains A and B.
		• Two slowest modes of GNM3 analysis predict hinges at residues 34 and 37.
		• Manual inspection shows that residues 34–37 side-chains are unpacked, and are flanked by two well-packed domains with regular secondary structures.
Ribose Binding Protein (RBP)4	101–104; 234–236; 261–262 (18 dofs)	• Extract loop residues that connect the two structured domains based on manual inspection (Figure 4a).
		• Each structured domain is structurally conserved between conformations 1urp and 2dri.
		• Slowest mode for GNM3 of pdb-id 2dri predicts hinges at residues 103–104, 235–236, 262–265, in the vicinity of the selected degrees of freedom.
Cyanovirin-N 5: Central-Hinge	48–55 (16 dofs)	• N-terminal domain (residues 1–50) and C-terminal domain (residues 51–101) are repeat domains at the sequence and the structure level (<1 Å RMSD deviation). The structure of each domain is highly conserved between alternative conformations, but not that of the connecting residues 48–55.
		• Large differences in φ/ψ values between alternative structures 2ezm, 1l5b and 1l5e around this approximate region.
		• The literature about Cyanovirin-N structure marks this region as the hinge region [72],[73].
		• Mutations of P50 and S51 significantly affect the equilibrium between the monomeric and dimeric forms [73].
		• Slowest mode of GNM3 analysis for pdb-id 1l5e predicts hinges at residues 50–52.
Cyanovirin-N 5 : Secondary-Hinges [M]2	48–55; 36–40; 87–91 (32 dofs)	• Secondary hinge residues 36–40 and 87–91 connect separate secondary structures within the N-terminal and C-terminal domain, respectively.
		• GNM3 analysis: slowest/second-slowest modes for pdb-id 2ezm, and second slowest mode for pdb-id 1l5e, both predict hinges around residue 34–36 and 86–87, in the vicinity of the selected degrees of freedom.
Cyanovirin-N 5 : Partially-Restricted [M]2	48–55; 2–47; 56–100 (198 dofs)	Here we allow for “breathing motion” of ±30° in torsions 2–45, 56–100, in addition to full motion in the central hinge, like in the central-hinge simulation above (residues 48–55).
Cyanovirin-N 5 :Free [M]2	2–100 (198 DOFS)	All torsion degrees of freedom (except extreme tail residues) are free to move by ±180°.

pdb-id used: 1k3e [66] ; protein total length is 146 residues.

Simulations with local energy minimization are denoted by [M], see Methods.

Gaussian Network Models analysis on iGNM server http://ignm.ccbb.pitt.edu/ [22], using default cutoff parameter of 10 Å for building the harmonic potential.

pdb-ids used: 1urp [83], 2dri [84] ; protein total length is 271 residues.

pdb-ids used: 2ezm [74], 1l5e [73]; protein total length is 101 residues.

pdb-id used: 1k3e [66] ; protein total length is 146 residues. Simulations with local energy minimization are denoted by [M], see Methods. Gaussian Network Models analysis on iGNM server http://ignm.ccbb.pitt.edu/ [22], using default cutoff parameter of 10 Å for building the harmonic potential. pdb-ids used: 1urp [83], 2dri [84] ; protein total length is 271 residues. pdb-ids used: 2ezm [74], 1l5e [73]; protein total length is 101 residues.

RRT Motion Planning with Partial Information

The Rapidly-exploring Random Tree (RRT) algorithm is a general framework for rapid exploration of a conformation space (referred to as “configuration space” in robotics) in a highly constrained environment. It was first presented in algorithmic robotics, where it was used to plan the motion of moving objects among obstacles [25]. RRT produces a tree of conformations and records the topology of the search space. Nodes stand for feasible (low-energy) conformations, edges connect close-by conformations, and paths are sequences of feasible conformations. It was shown that the RRT tends to grow towards unexplored regions at progressively increasing resolution [25].

Forbidden space and feasible space

We define the conformation space by the internal dofs of the protein, namely the torsion angles that are allowed to change throughout the motion pathway (see below). The conformation space is divided into forbidden and feasible regions (referred here as C-forbid and C-feasible, respectively; for illustration see Text S1). The forbidden regions correspond to all the conformations that involve high energy values, namely energy score above a threshold parameter, whereas the feasible regions comprise the low-energy conformations.

RRT algorithm with branch termination

In algorithmic robotics literature, RRT is often biased by manipulation of node sampling, e.g. sampling in certain regions of interest. Here we take a different approach and rely instead on terminating branches that do not improve a certain predicate (note that branch termination has been previously used in a rather different context only, namely for avoiding imminent collisions under kinematic-dynamic constraints by sensing the immediate environment [45], and to deal with moving obstacles [42]). The input to the algorithm is an initial conformation and a set of partial information predicates (detailed pseudo-code and a full list of parameters are provided in Text S1). The tree is grown iteratively in small incremental moves to guarantee the smoothness of the motion. At each iteration, a new conformation, qrand, is randomly sampled from the feasible space C-feasible. The nearest neighbor in the tree is then expanded towards q , by linear interpolation of the degrees of freedom from qnear to qrand. Each path in the RRT tree can be considered a fine discretization of a continuous motion pathway in the feasible conformation space. The simulation terminates when: (i) the number of nodes in the tree is larger than N, a parameter for maximal tree size, or (ii) the tree could not be expanded for k consecutive iterations. The partial information predicates are used to choose a motion path that leads from the initial conformation to the conformation with the best predicate score. The partial information bias is introduced by a filtering step. The filtering step is applied only in every other iteration, to allow an escape from local minima traps. In the filtering step, the branch that grows towards qrand is terminated if it does not improve the partial information predicate after m consecutive interpolation steps (typically m = 2, again to allow an escape from local minima traps). The branch is terminated even if it leads to energetically feasible conformations (Figure 1b). The aim of this filtering step is to avoid expensive energy calculations in undesired directions. We note that existing branches are not pruned, only the growth of the current branch is terminated.

The effect of avoiding local minima with respect to partial information

Figure 1 shows a toy example where branch termination is applied in all iterations. As desired, this narrows down the search to relevant regions of the conformational space (Figure 1b, grey circle). If we compare to Figure 1a, where branch termination is not employed, we see that the overall coverage of unexplored regions is compromised, but the target is reached faster (Figure 1c and 1d). By applying the global filtering step in every other iteration, we gain a bias towards partial information predicates, but still benefit from the rapid sampling of the unbiased RRT algorithm.

Side-chain optimization and local minimization

In full atom-mode (see below), side-chains of generated conformations along the pathways are locally refined by the Rosetta Rotamer-Trial procedure to alleviate local steric clashes and optimize the interaction of side-chains with neighboring residues [63]. In addition it is advisable for full-atom runs to include short gradient-descent minimization with respect to all torsion degrees of freedom: very slight rotations of torsion angles (∼0.1°–0.2°) can alleviate local steric clashes and reduce the energy score substantially. To restrict the change in φ/ψ values, we added a heavy Gaussian penalty for deviating from the initial backbone torsions (Rosetta energy constraint CST_PHI_PSI with weight = 250.0). Note however that minimization is time consuming (3–8 times slowdown). Local minimization is optional and can be invoked by turning on a run-time flag.

Rosetta Infrastructure

Energy function: Full-atom vs. centroid mode

We experiment with both the Rosetta full-atom energy function (Rosetta score12 [50]), which was shown useful for discerning native structures at atomic detail, and the coarser Rosetta centroid-mode energy function, a united-atom representation where side-chains are represented as centroid spheres (Rosetta score4 [49]). The latter allows rapid calculations at the expense of atomic detail, and has been used in a wide range of applications in Rosetta to speed up the conformational search by optimizing coarse features prior to atomic level optimization. We assume here that the Rosetta energy function is relevant for our current task (see Discussion). The Rosetta energy function was optimized for native structures, but it includes physical van der Waals terms and solvation models, as well as a statistical hydrogen bonding term that was shown to correlate with quantum mechanical calculations [64]. However, we do note that PathRover is in principle not restricted to any specific energy function, and can be used in conjunction with other energy potentials as well.

Rosetta optimizations

In addition to the full-atom and centroid-mode energy functions, the presented framework takes advantage of the elaborate infrastructure of Rosetta, including manipulation of molecular dofs, rapid side-chain optimization for fixed backbones (using the “rotamer trial” procedure described in [63]), energy minimization and caching of energy calculations. The framework can also use the gamut of other Rosetta features, such as closed loop sampling and sophisticated manipulation of backbone torsions, e.g. backbone fragment libraries or backrub motions [65]. These features are out of the scope of the current study, and will be explored in future work.

Analyzing Hinge Residues in Simulations

In order to characterize the predicted motion in our simulations, we have examined what portions of proteins remained rigid during the motion and what residues served as hinge residues. We note that inspection of φ/ψ values is not necessarily suitable for this purpose, since small backbone perturbations can result in large scale motions and vice versa. In Figure S2, we describe our protocol for detecting hinge residues in simulated motion. In brief, we rely on structural comparison of different pairs of conformation in the simulated motion. Rigid portions of the protein are detected by the FlexProt [62] flexible structural alignment algorithm (Figure S2), and the hinge residues are defined as the regions that connect the rigid parts. We score each residue for how often it serves as a hinge in different alignments throughout the simulation (Figure S2). The structural alignment is performed at different resolutions of RMSD, using a resolution parameter ρ. Low resolution hinges are involved in strong hinge motions, and high-resolution hinges are involved in milder ones.

Running times

All runs in this study were conducted on an AMD Opteron 275 2.2 Ghz/1 MB processor. Unless otherwise specified, in the Results section, centroid mode runs take the order of a few seconds to minutes each and Full-atom runs take 2–8 hours without energy minimization, and roughly 10–60 hours when the energy minimization flag was employed (see above), for growing trees of 30,000–100,000 conformations each. The number of dofs in different runs was between 8 and 198 backbone torsions (and all side-chains, see Table 2).

Results

In the first part of this section, we examine the usage of various geometric constraints and a combination of constraints to bias the motion during simulations. We also show how the energy function prevents over-bias by the input constraints. In the second part, we deal with another form of partial information – the choice of degrees of freedom that are allowed to move during the simulations. We examine the robustness of simulations to different choices of degrees of freedom, and analyze in full-atom detail the domain-swapping motion of inspected model systems.

Partial-Information Predicates to Bias the Motion

CesT domain swapping

CesT is a type III secretion chaperone in Enteropathogenic E. coli that binds numerous effector proteins. In CesT the neighboring chains within the crystal lattice are domain swapped [66] (Figure 2a). The N-terminal domain (residues 1–33) and C-terminal domain (residues 38–134) from neighboring chains pack to form a monomer-like globular unit, the “pseudo-monomer” (Figure 2b). The pseudo-monomer can be well aligned to monomers in the other homologues (Figure S1), suggesting that there is a monomeric form of CesT that resembles the pseudo-monomer. Packing of non-swapped monomers against each other is mostly identical to their packing in the pseudo-monomers, as the sequence of domains is identical [56]. It is however not known whether the swapped conformation of CesT is a crystallographic artifact or whether it is the physiologically active peptide-binding form [66], and it is interesting to examine the possibility for domain-swapping motion of this protein.

Using the pseudo-monomer as a partial information predicate

We first examine whether we can model a hinge motion in which a chain of CesT moves to the pseudo-monomer conformation where its two domains are interacting. We start from the swapped conformation (where the domains are farther apart), using chain A in pdb-id 1k3e [66]. We allowed backbone mobility in residues 34–37, a loop region that separates the two domains and is also predicted to be a hinge region by normal mode analysis (Figure 2a, spheres; see also Table 2 for additional evidence that these residues form a hinge). The algorithm was biased to minimize the RMSD between the initial conformation and the pseudo-monomer conformation, using the residues in the interface between the N-terminal domain and the C-terminal domain (Table 3). We applied RRT with branch termination, as described in Methods, and simulated the motion in both centroid mode and full-atom mode (where all side chain atoms are included, see Methods). Each centroid mode run was repeated 50 times, taking a few seconds only to complete. Full-atom mode simulations were performed with the energy minimization flag turned on, to relieve local steric clashes. Each such simulation took a few hours on a single processor, and was repeated 15 times.

Table 3

Predicates used for guiding the domain swapping motion of CesT.

Name of Predicate	Description of Predicate	CesT Residues	SigE Residues
Pseudo-monomer	Minimize RMSD between CesT and the pseudo-monomer of the CesT crystal structure	S5:C29 Y38:N62	not relevant
Atom distance A	Compare the distance between a pair of atoms in CesT and in SigE. The distance in SigE is used as a reference for CesT during the simulation. A, B and C are three different choices of atom pairs.	F12:E110	L8:L95
Atom distance B	see Atom distance A	L8:A104	L4:S89
Atom distance C	see Atom distance A	D34:Y37	D27:I29
Atom distance A+B	Weighted combination of above Atom-Distance predicates: 1A+1B	F12:E110 L8:A104	L8:L95 L4:S89
Atom distance A+B+C	Weighted combination of above Atom-Distance predicates: 1A+1B+1*C	F12:E110 L8:A104 D34:Y37	L8:L95 L4:S89 D27:I29
Helix line-fit	Fit a least-mean square line (LMSL) to both helix H1 and H3. The predicate is a weighted sum of the three terms: 1Line_angle+1Line_dist+1*CMass_dist Line_angle = the angle between fitted lines Line_dist = the distance between fitted lines CMass_dist = the center of mass distance from helix H1 to H3	Helix H1: L8..K15 Helix H3: P106..L125	Helix H1 L4..A11 Helix H3: E91..E110
Helix RMSD	Minimize RMSD between helices H1 and H3	Identical to Helix Line-Fit
Strand RMSD	Minimize RMSD between β-strands B0 and B1	Sheet B0/1: A32..D34 I36..L41	Sheet B0/1: 23..25 29..34
Helix+strand RMSD	Minimize RMSD between both helices H1/H3, and sheets B0/B1 (with equal weights)	Identical to Helix RMSD+Sheet RMSD

We analyzed the runs that best minimized the predicate. Both in centroid mode and full-atom mode, a collision-free path towards the pseudo-monomer conformation was found. The initial conformation deviates by 16 Å RMSD from the pseudo-monomer, and the final conformation deviates by 0.8 Å RMSD in centroid mode, and 1.34 Å in full-atom (0.76 Å for a partial alignment without residues N24-I33). These runs provide a proof of concept that the biased RRT algorithm successfully employs bias for guiding the motion. The suggested motion is shown in Video S1 (full-atom mode) and Video S2 (centroid side-chains mode).

Biasing the motion with SigE, a homologue of CesT

SigE (pdb-id 1k3s [66]) is one of a few distant homologues of CesT. The N-terminal and C-terminal domains of SigE are similar to those of CesT (RMSD of 1.7 Å and 2.5 Å respectively), and the pseudo-monomer of CesT can be aligned to the SigE monomer (Figure S1). However, the two structures share a very low sequence identity (18%), and the SigE monomer deviates by 3.96 Å from the pseudo-monomer (using the sequence alignment from [66]). In order to investigate whether this distant homologue can indeed guide the motion towards the correct conformation, we devised a set of geometric predicates that use SigE as a reference for guiding the motion of CesT, such as the distance between specific atoms or the orientation between specific secondary structures (see Table 3 and Figure 3). For each predicate, we conducted 50 independent runs and analyzed the run that best minimized the predicate. We worked in centroid (united-atom) mode, as this example mainly serves to illustrate the effect of various predicates on the simulations. Running time was a few seconds for each simulation.

Figure 3

Use of partial information in simulations of CesT domain swapping (in centroid mode representation).

Use of partial information in simulations of CesT domain swapping (in centroid mode representation).

(A) The final conformation along the motion pathway of CesT (cyan) is shown for five different examples of predicates (see Table 3). We show the best scoring run with respect to the specified predicate (out of 50 independent runs). The orientation of the N-terminal domain of the native pseudo-monomer is shown in red for comparison. (B) The RMS distance of the final structure, for simulations with different predicates, is plotted relative to SigE (the homologue that was used to guide the motion, blue) and the native pseudo-monomer of CesT (black). Even though the homologue was the reference for biasing the motion, the simulations reached the correct conformation with a better level of accuracy for several predicates. (C) Biasing the motion by combining several distance constraints (see Table 3 for details about the constraints): the results are shown as the fraction among 50 independent simulations that reached given RMSD thresholds (to the native pseudo-monomer). In Figure 3a and 3b we examine, for 5 of the predicates in Table 3, the RMSD distance of the final conformation from (1) SigE – the reference homologue protein, and (2) the native CesT pseudo-monomer (Figure 2b). Remarkably, in four cases, the final structure was more similar to the pseudo-monomer than to SigE, even though the reference for guiding the motion was SigE (the SigE monomer deviates by 3.96 Å from the pseudo-monomer). This suggests that the energy function prevented over-bias of CesT motion towards the structural features of SigE. This fact is particularly surprising since the simulations were conducted in centroid mode, without the atomic details of the side-chains, and demonstrates the effectiveness of a simplified and rapid energy function in this case. At such level of predictive precision (<2 Å), many side-chains can be already modeled quite accurately. Although our aim is motion prediction and not homology modeling, it is promising that the near-native conformation is recovered using very simple predicates.

Relative orientation of secondary structures

In Sig-E and in the CesT pseudo-monomer (but not in the initial conformation), α-helices H1 and H3 each lie in a different domain of the protein (Figure 2b). In addition, β-strand B0 of the N-terminal domain is paired to β-strand B1 of the C-terminal domain (Figure 2b). In Table 3, we list the set of predicates that we formulated over the relative orientation of these secondary structures in the two domains. An interesting predicate is the Helix Line-Fit predicate, which combines three measures. We used least mean square line fitting (LMSLs, Table 1) to approximate the main axes of α-helices H1 and H3. We then measured the distance and angle between the fitted lines, as well as the distances between the centers of mass of each helix. This is a useful measure when relying on a homologue protein, since it is much less sensitive to alignment shifts that are characteristic for α-helices. In Figure 3b we observe that the final conformation was very close to the native pseudo-monomer (1.87 Å) but not to SigE, which is the reference for the biasing predicate. Looking closely at this example (Table S1), we saw that the line angle and line distance predicates were perfectly matched to their values in SigE, whereas the center of mass distance between the helices did not reach its value in SigE (15 Å) but rather reached its value in the CesT pseudo-monomer. Hence, the simulation was not over-biased by partial information constraints, and took into account the specific features of the simulated molecule.

β-sheet formation

The combination of helix and sheet predicates (Table 3 and Figure 3) was sufficient to direct the motion to the native conformation. How does β-sheet formation affect α-helix orientation and vice versa? We observed that when the motion was guided by partial information on the orientation of the α-helices alone, the β-strands B0 and B1 still came close together and the final conformation exhibited similar structure to the native pseudo-monomer. In contrast, the helices did not move to the correct orientation when the only partial information provided was β-sheet formation. We note however, that this may also be an artifact of the smaller number of atoms involved in the strand-pairing predicate.

Combinations of atomic distance constraints

Not surprisingly, a constraint on the distance between a single pair of atoms is not deterministic enough for guiding the motion (Figure 3a and 3b–left bar). The structural alignment between the final conformation of CesT and the pseudo-monomer is rather poor. It is clear that the distance between a single pair of atoms should be combined with other partial information or atomic detail constraints, in order to derive a more reliable target conformation and motion pathway. Therefore, we have examined what combination of distance constraints suffices for biasing the motion. Combinations of two or three distance constraints (Table 3) were used to guide the motion. In Figure 3c, we plot the percentage of 50 independent simulations that reached the native pseudo-monomer conformation up to various degrees of similarity (in RMSD). We observe that 2 or 3 constraints are still not enough to guide the motion in all simulations, but they lead to a much higher percentage of runs that reach the native conformation. This suggests that the combination of just a few distance constraints is an effective way of constraining motion-planning simulations.

Ribose binding protein (RBP): Ligand-binding-induced hinge movement: Incorporating loop closure constraints with simple predicates

The problem of fixing remote structural segments that are connected by a flexible loop is known in the literature as the protein loop closure problem [67]. It might require complex loop closure calculations or interpolation of internal coordinates motion from Normal-Mode Analysis [38],[68]. Previous attempts have been made for ad-hoc solutions to this problem during RRT simulations [68],[69], as well as in the broader context of structural modeling [67],[70],[71]. The Ribose Binding Protein (RBP) belongs to a family of ligand-binding proteins that comprise two domains, connected by a hinge. Upon binding of the ligand in a cleft between the two domains, the domains approach each other to close the cleft (Figure 4a). However, unlike CesT, in RBP each domain is discontinuous with respect to the sequence. The hinge that connects the two domains is made of three separate stretches of sequence (Figure 4b and Table 2). Consequently, the hinge torsion angles must change in a coordinated way, to prevent the two domains to disintegrate. Although violating the integrity of domains is energetically unfavorable, a lot of running time may be consumed on sampling non-favored conformations that disintegrate the domains. Indeed, when we simulated the motion of RBP without any external constraints, the domains wobbled and partially disintegrated during the motion, with high energy fluctuations (results not shown). Although this type of motion cannot be negated completely, domain disintegration during relatively fast substrate binding motion contrasts basic biological intuition.

Figure 4

Ribose Binding Protein (RBP) architecture.

Ribose Binding Protein (RBP) architecture.

Domain A (cyan) is connected to Domain B (blue) by a hinge (red). (A) The RBP structure in its open and closed form, pdb-ids 1urp [83] and 2dri [84] respectively. (B) The architecture of RBP–each domain consists of discontinuous segments of residues. The two domains are connected by three hinges that must move in a coordinated way to maintain domain integrity. Domain boundaries are rough estimates. (C) The sequence of RBP showing the discontinuous domains and the secondary structures. This illustration was taken from 1urp Protein Data Bank entry at http://www.rcsb.org/pdb/ [5]; and domain assignments are from [85]. In order to enforce loop closure, we have simply added a partial information predicate that penalizes disintegration of the domains (in terms of RMSD to the native domain). In the resulting pathway (Video S3), the two domains are kept in one piece throughout the motion, and only a small β-strand at the C-terminus of the protein (residues 266–269) deforms during the motion. The simulation of the motion in centroid mode is performed within a few minutes time. This demonstrates the flexibility of the partial information framework to efficiently address diverse settings, without the need for explicit ad-hoc calculations.

The Degrees of Freedom that are Involved in the Motion of Cyanovirin-N

We now examine in detail the importance of different degrees of freedom for another model system of domain swapping motion: Cyanovirin-N is an anti-viral fusion inhibitor protein that binds to viral sugars, and is trialed for preventing sexual transmission of HIV. It comprises two repeat domains of 30% sequence identity. The domain swapped dimer has higher anti-viral affinity than the monomer [72], and it was shown that the two forms can exist in solution, with a high energy transition barrier between them. In addition, it has been reported that certain mutations can affect the energy barrier and stabilize alternative conformations [73]. We examined here how two repeat domains of a single chain can unpack from the tightly-intertwined monomeric conformation to an extended domain-swapped conformation. The conformational transition during swapping is substantial, as the swapped conformations deviate by 14 Å RMSD. In all simulations, we started from the monomer conformation (pdb-id 2ezm [74]), and for biasing the motion towards the swapped conformation (pdb-id 1l5e [73]), we minimized the RMSD distance towards it. The difference between the following simulations is the sets of degrees of freedom that are allowed to rotate during the motion (Table 2).

The central hinge: Allowing rotation in residues 48–55

It has been suggested that residues 48–55 between the two repeat domains of Cyanovirin-N form a hinge region for domain swapping [72],[73]. Additional support comes from structural conservation patterns, difference in torsion angle values between alternative structures and Gaussian Network Models for detecting hinge regions (Table 2). We refer to this region as the central hinge of Cyanovirin-N. Using the central hinge set of dofs (Table 2), we first used the simplified centroid mode representation to generate a low-energy motion pathway within minutes. Considering the experimentally determined high energy barrier for this motion, it is rather surprising that such a pathway could indeed be easily created. We thought this might be an artifact of the simplified representation of the structure in centroid mode: the barrier might be apparent only at a higher resolution level. We therefore proceeded to a full-atom representation:, When all side-chains atoms and hydrogen atoms were modeled explicitly, it was impossible to unlock the intertwined monomer, unless the energy threshold was substantially raised to 105 Rosetta Score-12 units (which allows for extreme steric clashes). The domains did not unpack even in a long run of RRT, consisting of 300,000 conformations and taking a few days to run. The protein moved by no more than 1.5 Å from the initial conformation, over 13 Å away from the swapped target conformation. Video S4 demonstrates how the side-chains of one domain are tightly locked within the other domain, and the motion is confined within a steric “cage”.

The effect of local energy minimization

Could very slight “breathing” motions of other degrees of freedom allow the domain-swapping of the protein? Local energy minimization involves slight changes (∼0.1–0.2) in all backbone and side-chain degrees of freedom and as such might suffice to alleviate local steric clashes of the sensitive full atom energy function (see Methods section for details). Indeed, in a new simulation with freedom of motion in the central hinge together with local energy minimization (“Central-Hinge (M)” in Table 2) we were able to generate a clash-free motion pathway in full-atom. Local minimization slowed down the rate of generating new nodes (3–8 fold), but allowed the initial unpacking of the domains after less than an hour run. It is striking to observe how minute structural breathing in all degrees of freedom can alleviate steric clashes and allow motion that was not possible when only the central hinge is mobile.

The effect of adding secondary hinges

We now examine if the introduction of several additional dofs may provide the simulation with sufficient “breathing” flexibility to allow the large scale motion of the hinge, even without energy minimization. We inspected the structure and located two symmetry-related loops at residues 36–40 and at residues 87–91 that connect the two β-sheets in each domain. These loops appear as “weak links” in the protein chain between the two sheets (see Table 2 for more reasoning behind this choice). The addition of flexibility in the two loop regions (“Secondary-Hinges” in Table 2) allowed our simulations to find low-energy clash-free motions in full-atom mode (Video S5), without energy minimization in all dofs.

Analyzing hinge residues by restricted sampling of all DOFs

We saw that restricted local energy minimization of all dofs, as well as the introduction of secondary hinges both enabled the domain swapping motion. In order to analyze the motion of all residues, we conducted another simulation where the central hinge (residues 48–55) is free to move, and all other dofs can also rotate by up to 30° from their initial value, allowing for more extensive “breathing” motion in all dofs (“Partially-Restricted (M)” in Table 2 ). The total number of dofs in this simulation is 198, with 16 dofs that are completely free to move. Using this large number of dofs (and including local energy minimization), the simulations took 3–4 days. A movie of such a simulation (see Video S6) shows the breathing motion of all dofs, and in addition suggests that side-chains L1 and W49 (marked in red) act as “gate-keepers” that interfere in the unpacking of the two domains. It would be interesting to examine the role of these residues experimentally and in-silico, although this is out of the scope of this work. We should note that the S–S bonds between two adjacent β-strands, from C8 to C22, and from C58 to C73, were not modeled in the simulation due to technical limitations. However, we note that both of these bonds connect adjacent β-strands, and atomic distances between these pairs of residues are close to constant during all the simulation, suggesting that S–S bonds will not play a critical role.

Consistency of the hinge regions between runs

In order to examine the residues involved in the motion, we have scored each residue for how often it serves as a hinge during the motion, at different resolutions of motion (see Methods and Figure S2). Low resolution hinges are involved in strong hinge motions, and high-resolution hinges are involved in milder ones. We conducted three independent simulations, and compared the consistency of the detected hinges at different resolutions of hinge motion (parameter ρ). We used Pearson's linear correlation (with values ranging between 1 for full-linear correlation, and 0 for no correlation). The correlation over all residues is very high and rises with decreasing resolution, such that the most prominent hinge motions are consistent between runs (Figure 5b, blue line; correlations range from 0.77 at ρ = 0.5 Å, to 0.96 for ρ = 4 Å). Since the central hinge is inherently biased by the run parameters, we also analyzed the correlation when excluding the central hinge region (green line). In this case the correlation is lower but still significant. As expected, the correlation is low at the highest resolution (ρ = 0.5 Å), where small, flickering, movements are measured. For a detailed inspection, we plot in Figure 5c the hinge scores in the three simulations at different resolutions (1.5 Å and 3.5 Å), and also show a Cyanovirin-N structure in cartoon representation colored based on the residue hinge score. Finally, a plot of the weighted average of hinge scores at different resolutions is shown in Figure 5d. Low-resolution hinges are involved in larger hinge motions and are assigned a higher weight, so that pronounced hinge regions are inspected:

Figure 5

Hinge regions in independent simulations of Cyanovirin-N domain swapping.

All runs use the Partially-Restricted (M) set of dofs (Table 2), where the central hinge is allowed free motion, and all other residues can rotate by ±30°. (A) In each run, each residue is scored by how often it tends to be in hinge regions. Hinges are extracted by structural comparison between conformations along the motion sequence. They connect sub-domains that remain rigid during the simulation. Resolution parameter ρ states the RMSD threshold used for rigid alignments. Mild hinges appear only at higher resolutions (low value of ρ), and salient hinges appear at low resolutions (see Figure S2 for a detailed protocol). (B) Pearson's Correlation between the hinge scores of three independent simulations, for different values of ρ. In blue, the correlation over all residues, including the central hinge (residues 48–55). In green, the correlation when excluding the central hinge. (C) Hinge scores for each residue in three independent simulations, for ρ = 1.5 Å and ρ = 3.5 Å. The y-axis denotes how often each residue appears in hinge regions (see Figure S2 for more details). Secondary structures (according to DSSP [86]) are plotted along the x-axis. Observe that milder hinges disappear at lower resolution (3.5 Å). R and R* are the average Pearson's correlations between runs with and without the central hinge region, respectively. For each plot, the crystal structure of Cyanovirin-N is colored according to the corresponding hinge score, with warm colors indicating higher scores. (D) The weighted average of the hinge scores for different values of ρ (see Results). Since higher resolutions contain milder hinges, they were assigned a lower weight.

Hinge regions in independent simulations of Cyanovirin-N domain swapping.

Energy analysis

What is the contribution of different energy terms to the motion? For analyzing the estimated energy landscape of the motion, we used the Rosetta energy score with dampened van der Waals potential [75], to reduce the dominant effect of fluctuative contributions of slight steric clashes (in the simulations themselves we use the classic van der Waals potential, so any steric clashes are heavily penalized). In Figure 6a we observe the energetic barrier that results from unpacking of the two domains, both disrupting favorable attractive forces (vdW–blue, hydrogen bonds - green) and causing increased repulsion due to the motion in a cluttered environment (repulsive force, red line). The solvation term decreases as polar residues are exposed (the hydrophobic effect is evident in the much higher increase in the attractive vdW force).

Figure 6

The contribution of different energy terms in the domain-swapping simulation of Cyanovirin-N.

We used the Rosetta soft-repulsive energy score for this analysis, in order to dampen repulsive fluctuations that are due to mild sterical clashes (see text, note that the simulation itself was conducted with the Rosetta score-12 energy function, with a classical vdW potential). Note that the y-axis shows the total energy score, and the specific energy terms are shifted by a constant number of units, for convenient comparison with the other terms. (A) Energy plot for simulation with restricted dofs (central hinge is free to move, and all other dofs can fluctuate by ±30°). (B) Energy comparison for the aligned sections of the restricted (left) and free (right) simulations. In the free simulation, all dofs are free to move. The reaction coordinates of the two simulation were aligned by a string matching algorithm [39] based on structural similarity between conformations (see text, Figure S3 and Video S8).

The contribution of different energy terms in the domain-swapping simulation of Cyanovirin-N.

Comparison of restricted and free simulations

Is the biasing for the central hinge indeed justified? In order to answer this question, we performed a simulation where all residues (except for the first and the last residue) were completely flexible and no bias was introduced (“Free” in Table 2). For 198 free dofs, the simulation took 5–6 days to generate a tree of 100,000 conformations. For the free simulation, the domains unpacked from each other substantially, but did not manage to reach the target conformation within the limitations of the running time (Video S7), probably due to the large number of non-restricted dofs. We compared the free simulations to the “Partially-restricted” simulations described above and in Table 2. Each simulation results in a motion pathway that comprises a sequence of conformations. We aligned the simulated motion pathways based on RMSD between corresponding conformations (using a path alignment scheme where corresponding frames in the two simulations are aligned by a string matching algorithm, similar to sequence alignment methods, see [39]). The movie of the aligned motion of the restricted and free simulations (Video S8) demonstrates that the two simulations are very similar, and Cα RMSD between aligned conformations stays within 2–3 Å throughout most of the motion, growing to 3.5–4 Å only towards the end (Figure S3). Remarkably, comparison of the hinge scores in the restricted and free simulation (Figure 7), shows that the central hinge is the most prominent hinge at low resolution (left panel, ρ = 4 Å), which means it is involved in the largest scale motion in the free simulation. Milder hinge motions at resolution ρ = 1.5 Å are less correlated, although the hinges at residues 27–30 are still markedly observed in both simulations. Note that since the free simulation spanned a part of the domain swapping motion, the alignment is partial, comprising half of the partially-restricted simulation, and the entire free simulation.

Figure 7

Robustness to restriction of dofs in Cyanovirin-N simulations.

Robustness to restriction of dofs in Cyanovirin-N simulations.

The hinge score for each residue is plotted for (A) restricted simulations, where only the central hinge (residues 48–55) is free to fully rotate, and the other residues are restricted to ±30° deviation from the initial conformation, and (B) a free simulation (magenta) where all backbone degrees of freedom are free to rotate (see Table 2). Hinge scores are plotted for resolution parameter values ρ = 4 Å and ρ = 1.5 Å. The central hinge is the most salient feature in the free simulation, and therefore it appears even in low-resolution plots. Milder hinges are less robust to the restriction of dofs (see text for more details). In Figure 6b we also compare the energy of the aligned simulations. In both simulations we observe the energetic barrier that results from unpacking of the two domains, both disrupting favorable attractive forces (blue) and causing increased repulsion due to the motion in a cluttered environment (repulsive force, red line). In both simulations the repulsive forces subside as the domains continue to separate and the attractive forces increase. It is worth noting that motion-planning with such a large number of dofs (198) is not a trivial task, and that both simulations converged only when we used local energy minimization. Although energy minimization may often increase the running time, it may allow to deal with a larger number of flexible dofs, since unlikely random perturbations that lead to clashes can be countered by local minimization.

Discussion

In this work we showed how different types of partial information can be incorporated into the Rapidly-exploring Random Tree (RRT) algorithm. We present PathRover as a comprehensive framework, implemented within the Rosetta modeling infrastructure. In structural biology, partial information constraints are widely used in predictions of static minimal-energy conformations [47],[48],[76] and in MD simulations. The novelty in this work is the systematic introduction and the integration of partial information to sampling-based motion planning of molecules. In this sense, sampling based methods like RRT pose a natural framework for integrating prior biological information. From the perspective of algorithmic robotics, partial information is employed through a branch-termination scheme which is somewhat different from explicitly biasing the sampling of new conformations, used in previous works [25],[42],[43],[46]. This allows for the use of very general features, whereas biased sampling may require ad-hoc computations of a biased distribution functions that differ between various types of information. We incorporated partial information into simulations of three different systems: CesT type III secretion chaperone, Ribose Binding Protein (RBP), and Cyanovirin-N anti-viral protein. Our analysis demonstrates how partial information constraints limit the search in the vast space of possible motion pathways. These constraints are motivated by existing and novel experimental methods for measuring constraints over transient conformations, or by expert intuition. In turn, computational observations allow for further subsequent validation by introducing detailed predictions of the motion that can be validated by experimental methods. We showed that the energy function prevents an over-bias by the partial information constraints, in case our prior information is inexact. PathRover simulations allowed us to assess the contribution of different residues to motion. Apparently, modest motions in specific regions may facilitate large-scale motions. The results from different simulations produced consistent patterns, and may therefore justify partial restriction of motion to improve running times. In particular, restricted and free simulations resulted in similar patterns of motion. An important aspect of PathRover is its full embedding into the Rosetta modeling framework. Rosetta has repeatedly demonstrated an exceptional ability to produce high-quality results for a variety of different modeling tasks in the field of protein modeling, docking, protein design and other modeling challenges at atomic-level detail (e.g., [50]–[53]). The incorporation into Rosetta provides well-calibrated energy functions (both for centroid and full-atom simulations), efficient energy calculations, and a battery of established conformational sampling protocols. It also allows extension to additional predicates of partial information that were previously implemented in Rosetta, such as NMR coupling measurements and docking interface constraints. These have been used to guide and filter Rosetta Monte-Carlo searches, and will here be incorporated into RRT-based motion prediction.

A Knowledge-Based Energy Function

Previous applications of the RRT algorithm have mainly been based on geometric considerations of clash avoidance or Van der Waals terms of established force fields. In some cases, more sophisticated terms were employed [32],[34]. Here we introduce the established Rosetta full-atom energy function into sampling based methods. Hence, we are able to generate motion pathways for complex movements that are at the same time energetically favorable and that abide by possibly known constraints about the motion. The full-atom energy function of Rosetta (we used here score12 [50]) includes physical terms such as van der Waals potential and solvation terms, as well as statistical knowledge based terms like the Ramachandran score, rotamer likelihood, statistical hydrogen bonding term and a simplified electrostatic score [49]. In some cases we observed that the repulsive energy term dominates the motion pathway: in a cluttered environment, clash avoidance is indeed probably the main contribution. Naturally, however, additional energy terms will affect the details of the motion pathways, such as solvation effects and electrostatic interactions [77]. We note that the statistical terms in Rosetta have straightforward interpretation in terms of physical properties. For instance, the Ramachandran score and rotamer likelihoods reflect steric hindrance in disallowed regions. The hydrogen bonding term was also shown to correlate remarkably with quantum-mechanical calculation [64]. While the original Rosetta energy function was optimized for native conformations, we postulate that it can also be used for the generation of clash-free, reasonable motion paths, which also account for other physical principles. Comparison to other common force-fields like CHARMM [78] and Amber [79] will provide additional credibility to PathRover simulations. In principle, PathRover is not restricted to any energy scoring function, as the energy scoring is a “black-box” in the implementation of the algorithm. As molecular mechanics energy functions are currently being added to the Rosetta modeling framework, we intend to compare different energy scoring functions in future work.

Future Applications

Experimental validation and analysis of simulations

One of the big challenges of computational biology is the interface between computational and experimental observations. While full-atom experimental motion pathways of high resolution are still not in sight, significant progress has been recently made in experimental research of transient conformations. Distance constraints from FRET experiments, Paramagnetic-Resonance Enhancement, Residual Dipolar Coupling and other spectroscopic methods for assessing molecular dynamics can be used for (1) constraining simulations of molecular motion using measurable constraints, and for (2) validating motion pathways of suggested simulations, by comparing the measured distance constraints to simulated predictions. Our vision is that innovative experimental measurements of limited scope can focus and enhance computational techniques, effectively allowing researchers to generate realistic motion pathways that incorporate as much external information as possible within the currently suggested framework of PathRover. Particularly, this can allow for the design of experiments that target specific states within a motion pathway based on in silico predictions of large-scale motions. The predicted motions can be also used to suggest mutations, such as our suggested mutations in L1 and W49 for Cyanovirin-N (see residues marked in red in Videos S6 and S7). Computational observations are most meaningful when they are well-defined in a way that poses them as clear, lab-testable hypotheses. To that end, it is not sufficient to rely on raw simulations results. In our work, we have therefore devoted significant effort for developing analysis and visualization tools for extracting physical features from simulated motion, including the protocol for analyzing hinge regions in simulated motion, as well as the visualization and space-time alignment of multiple motion paths. We believe that developing novel analysis and visualization tools is an important direction of future research, which is just as important as the simulations themselves, as it can provide the missing link between experimental and computational observations.

Applications to other types of molecular motion

In this work PathRover was applied to motions of domain swapping and substrate binding. However, different types of molecular motions might have different characteristics with respect to the number of torsion angles that are involved in the motion, the scale of the motion, the role of side-chains, etc. One challenging class of molecular motions involves allosteric protein motions [80]. In this case, a large number of torsion angles are often involved in the motion, but each of them changes by rather small increments, and partial information might constrain the overall nature of the motion. Another interesting type of motion involves more than one molecule, such as docking of a protein or a flexible peptide onto another protein. Motion-planning techniques have been used for small-molecule docking [40], but to the best of our knowledge not for docking of two globular proteins or for protein-peptide docking. Of particular interest within this framework are cases where partial information can provide details about the approximate location of the interface, and conformational backbone flexibility of the monomer needs to be modeled efficiently [71],[81].

Analyzing multiple motion pathways

One of the advantages of RRT-based techniques is their relative speed. A large body of motion pathways can be created at atomic level that includes side-chain atom positions. A large number of pathways provide further insights about the connectivity of the conformational space under a wide range of settings. In contrast, it is difficult to generate a large number of pathways using, e.g., MD simulations, due to slower running times. In Video S8 we showed an alignment between two motion pathways. We recently proposed a method to compare, cluster and merge multiple motion pathways from independent runs of the RRT algorithm. The merged pathways have lower energy or shorter length than all input pathways [39]. It would be interesting to examine the clusters of pathways that are generated with different types of partial information.

Conclusions

This study proposes PathRover as a general and flexible setup where molecular systems can be explored, and constraints can be incorporated in a general and straightforward manner. Partial information can improve the performance of sampling based algorithms, by narrowing down the search in the vast conformational space of proteins. This is demonstrated in the present study on a number of molecular motions of specific interest. Future work will concentrate on refining protocols for additional systems and types of motions. Beneficial crosstalk between experimental procedures and in silico simulations will ultimately optimize the wide integration of partial information into fast sampling-based algorithms–and forward our general understanding of protein motion and function. Structural alignment between the pseudo-monomer of CesT (cyan) and its distant homologue SigE (red). (0.93 MB TIF) Click here for additional data file. Protocol for hinge analysis of a motion path by structural comparison between conformations. (0.75 MB TIF) Click here for additional data file. RMSD for alignment between restricted and free simulations throughout the simulation. The first half of the restricted simulation is aligned against the entire free simulation. (0.14 MB TIF) Click here for additional data file. Results of biasing the motion of CesT towards its distant homologue SigE with five different types of partial information (0.05 MB DOC) Click here for additional data file. PathRover Parameters List and Pseudo-Code for the RRT_PARTIAL_INFO Algorithm with Branch Termination (0.50 MB DOC) Click here for additional data file. Full-atom simulation of CesT starting from a domain swapped conformation. The pseudo-monomer is used as the partial information predicate to guide the motion. (1.74 MB MPG) Click here for additional data file. Simulation of CesT starting from a domain swapped conformation in centroid mode. The pseudo-monomer is used as the partial information predicate to guide the motion. (1.21 MB MPG) Click here for additional data file. Simulation of Ribose-Binding Protein motion. (0.35 MB MPG) Click here for additional data file. Simulation of Cyanovirin-N motion where only the central hinge is allowed to rotate and local minimization is not used. (0.40 MB MPG) Click here for additional data file. Simulation of Cyanovirin-N motion where both the central hinge and secondary hinges are allowed to rotate (0.87 MB MPG) Click here for additional data file. Partially restricted simulation of Cyanovirin-N motion where the central hinge is free to move, and all other residues can fluctuate by ±30° (2.58 MB MPG) Click here for additional data file. Free simulation of Cyanovirin-N motion where all residues are free to move. (1.87 MB MPG) Click here for additional data file. An alignment of the free and the partially-restricted simulations of Cyanovirin-N (1.48 MB MPG) Click here for additional data file.

62 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The domain-swapped dimer of cyanovirin-N is in a metastable folded state: reconciliation of X-ray and NMR structures.

Authors: Laura G Barrientos; John M Louis; Istvan Botos; Toshiyuki Mori; Zhaozhong Han; Barry R O'Keefe; Michael R Boyd; Alexander Wlodawer; Angela M Gronenborn
Journal: Structure Date: 2002-05 Impact factor: 5.006

3. Cyclic coordinate descent: A robotics algorithm for protein loop closure.

Authors: Adrian A Canutescu; Roland L Dunbrack
Journal: Protein Sci Date: 2003-05 Impact factor: 6.725

4. MolMovDB: analysis and visualization of conformational change and structural flexibility.

Authors: Nathaniel Echols; Duncan Milburn; Mark Gerstein
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. Protein folding by motion planning.

Authors: Shawna Thomas; Guang Song; Nancy M Amato
Journal: Phys Biol Date: 2005-11-09 Impact factor: 2.583

6. Intermediates and transition states in protein folding.

Authors: D Thirumalai; Dmitri K Klimov
Journal: Methods Mol Biol Date: 2007

7. Protein-protein docking with backbone flexibility.

Authors: Chu Wang; Philip Bradley; David Baker
Journal: J Mol Biol Date: 2007-08-02 Impact factor: 5.469

Review 8. NMR studies of RNA dynamics and structural plasticity using NMR residual dipolar couplings.

Authors: Melissa Getz; Xiaoyan Sun; Anette Casiano-Negroni; Qi Zhang; Hashim M Al-Hashimi
Journal: Biopolymers Date: 2007 Aug 5-15 Impact factor: 2.505

9. Generation, comparison, and merging of pathways between protein conformations: gating in K-channels.

Authors: Angela Enosh; Barak Raveh; Ora Furman-Schueler; Dan Halperin; Nir Ben-Tal
Journal: Biophys J Date: 2008-07-11 Impact factor: 4.033

10. Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations.

Authors: Alexandre V Morozov; Tanja Kortemme; Kiril Tsemekhman; David Baker
Journal: Proc Natl Acad Sci U S A Date: 2004-04-26 Impact factor: 11.205

20 in total

1. ART-RRT: As-Rigid-As-Possible search for protein conformational transition paths.

Authors: Minh Khoa Nguyen; Léonard Jaillet; Stéphane Redon
Journal: J Comput Aided Mol Des Date: 2019-08-21 Impact factor: 3.686

2. As-Rigid-As-Possible molecular interpolation paths.

Authors: Minh Khoa Nguyen; Léonard Jaillet; Stéphane Redon
Journal: J Comput Aided Mol Des Date: 2017-03-20 Impact factor: 3.686

3. Maintaining and Enhancing Diversity of Sampled Protein Conformations in Robotics-Inspired Methods.

Authors: Jayvee R Abella; Mark Moll; Lydia E Kavraki
Journal: J Comput Biol Date: 2017-10-16 Impact factor: 1.479

4. Prion Protein-Antibody Complexes Characterized by Chromatography-Coupled Small-Angle X-Ray Scattering.

Authors: Lester Carter; Seung Joong Kim; Dina Schneidman-Duhovny; Jan Stöhr; Guillaume Poncet-Montange; Thomas M Weiss; Hiro Tsuruta; Stanley B Prusiner; Andrej Sali
Journal: Biophys J Date: 2015-08-18 Impact factor: 4.033

5. Tracing conformational changes in proteins.

Authors: Nurit Haspel; Mark Moll; Matthew L Baker; Wah Chiu; Lydia E Kavraki
Journal: BMC Struct Biol Date: 2010-05-17

6. Markov dynamic models for long-timescale protein motion.

Authors: Tsung-Han Chiang; David Hsu; Jean-Claude Latombe
Journal: Bioinformatics Date: 2010-06-15 Impact factor: 6.937

7. Structural Insight into the 14-3-3 Protein-dependent Inhibition of Protein Kinase ASK1 (Apoptosis Signal-regulating kinase 1).

Authors: Olivia Petrvalska; Dalibor Kosek; Zdenek Kukacka; Zdenek Tosner; Petr Man; Jaroslav Vecer; Petr Herman; Veronika Obsilova; Tomas Obsil
Journal: J Biol Chem Date: 2016-08-11 Impact factor: 5.157

8. Detection and alignment of 3D domain swapping proteins using angle-distance image-based secondary structural matching techniques.

Authors: Chia-Han Chu; Wei-Cheng Lo; Hsin-Wei Wang; Yen-Chu Hsu; Jenn-Kang Hwang; Ping-Chiang Lyu; Tun-Wen Pai; Chuan Yi Tang
Journal: PLoS One Date: 2010-10-14 Impact factor: 3.240

9. Structural intermediates in a model of the substrate translocation path of the bacterial glutamate transporter homologue GltPh.

Authors: Sebastian Stolzenberg; George Khelashvili; Harel Weinstein
Journal: J Phys Chem B Date: 2012-05-02 Impact factor: 2.991

10. MORPH-PRO: a novel algorithm and web server for protein morphing.

Authors: Natalie E Castellana; Andrey Lushnikov; Piotr Rotkiewicz; Natasha Sefcovic; Pavel A Pevzner; Adam Godzik; Kira Vyatkina
Journal: Algorithms Mol Biol Date: 2013-07-11 Impact factor: 1.405