Literature DB >> 31088963

Unsupervised determination of protein crystal structures.

Abstract

We present a method for automatic solution of protein crystal structures. The method proceeds with a single initial model obtained, for instance, by molecular replacement (MR). If a good-quality search model is not available, as often is the case with MR of distant homologs, our method first can automatically screen a large pool of poorly placed models and single out promising candidates for further processing if there are any. We demonstrate its utility by solving a set of synthetic cases in the 2.9- to 3.45-Å resolution.

Entities: Chemical Disease Gene Species

Keywords: Bcl-xL; massively parallel; protein crystal structure; unsupervised method

Year: 2019 PMID： 31088963 PMCID： PMC6561213 DOI： 10.1073/pnas.1821512116

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

At present, most protein crystal structures are determined in the absence of experimental phase information [the phase problem (1–3)], principally by molecular replacement (MR) methods (4–8). This is especially relevant for the recently introduced serial femtosecond crystallography (SFX) method made possible by the availability of free electron laser (XFEL) light sources (9–11), wherein de novo phasing is difficult due to inherent inaccuracies of the scattering data (12–14). Success of MR methods strongly depends on the quality of the search model, which often is expressed as Cα-rmsd distance from the solution. This distance correlates with the sequence identity between the protein and the homolog used to build the model (15). Typically 25–30% or better sequence identity is needed to build a promising search model (7, 16). At lower sequence identity the quality of the model deteriorates quickly, and MR typically is unable to reliably place the model, instead producing many (tens, hundreds) similarly poor solutions as evaluated by log-likelihood and translation z scores (7, 17). Choosing the initial model for building and refinement in this case is a nontrivial problem and subject to a degree of chance. If one model fails, one has to start with a different model, and the process can take many months, and sometimes the structure is not solved. Modern packages for automatic model building and refinement (18–25) use smart algorithms for density map interpretation, yet still rely to a large extent on human input, especially at the beginning of the solution process when phases of sufficient quality are not available. Here we build on these previous approaches and present a method for solving protein crystal structures from low-quality initial models that generally converges to the solution with little or no human supervision. Contrary to the modern paradigm in crystallography software development, we do not try to develop an algorithm that can interpret electron densities on par with human crystallographers. Instead, we use a statistical approach wherein thousands of automatically built models of reasonable (but far from the best) quality are combined together and lead to the solution. This is possible since fast and reasonably accurate algorithms for automatic model building are readily available (19). As demonstrated below, our method has a large “radius of convergence” expressed as the Cα-rmsd of the initial model from the converged solution. Finally, our method can readily take advantage of computational clusters to quickly screen pools of initial models and find good candidates for further processing. Here we test the method on several small proteins serving as synthetic examples (Fig. 1) showing that we can generally solve structures when the initial model is within 3 Å Cα-rmsd. In the best-case scenario the method will produce a high-quality solution (better than what a human could attain) or an improved model that can be corrected manually and then run through another cycle of unsupervised solution process. In Ufimtsev et al. (26), the method is used to solve the crystal structure of the human lethal giant larvae (LGL2) protein that resisted years of human efforts due to very low 10% sequence identity to the closest solved homolog (27).

Fig. 1.

Protein models in the 2.9- to 3.45-Å resolution range used in the synthetic tests. (Top row) Multiple initial decoys were generated by random deformation of the deposited structures along lowest-frequency normal modes (33). All of the side chains were removed and renamed UNK. (Middle row) Solved structures. (Bottom row) Same structures as in the Middle row displayed with side chains. Only solved models and the corresponding initial decoys are displayed for clarity.

Results

Description of the Method.

We begin with an initial model obtained by MR or some other method, the sequence of the molecule we are building, and the experimental structure factor amplitudes with standard deviations (SDs) (F σ). The model can be a polyalanine chain (recommended at early macrocycles) or can have partial or full sequence and consist of one or many chains. This model is referred to as parent model M0. At this point we enter the macrocycle loop (Fig. 2).

Fig. 2.

Detailed flowchart of our unsupervised determination of protein crystal structures. The parent model M is updated at each macrocycle based on information derived from the 20 best auto-built trial models (of 50). The trial models are built by Buccaneer (19) and refined by Refmac (29), and their quality is assessed by their free R factors (31). The models are built from density maps generated by our in-house–developed density modification code. The density modification solver is seeded randomly, thereby generating every time a different map. Here, M is the structure of the parent model at macrocycle m, F, σ are experimental structure factor amplitudes and SDs, ρ is the density map of model M, {F, φ} are computed structure factors of model M, ρ is the density map used to build the trial models, {F, φ} are computed structure factors of ρ, and φ are the averaged phases. FFT is a fast Fourier transform of the density map and FFT−1 is the inverse transform. The resolution of the data is measured by 1/s with s determining the degree of phase projection in each microcycle. At the end of each macrocycle we anticipate the model to be deformed toward the solution (if one chooses to refine M with F and φ restraints to produce M, i.e., the “refinement” mode) or fully rebuilt (if M is built in ρ = {F, φ} density, i.e., the “full” mode). The refinement mode is good in the beginning of the procedure when maps are poorly interpretable and the chances of breaking the parent model in the auto-building step are high. The full mode is good at late stages, when the parent model has phases of good quality able to produce interpretable maps to guarantee rebuilding does not break the model. One also can combine refinement and full modes by refining and rebuilding the parent model at every macrocycle and passing the rebuilt model to the next macrocycle if the model is of higher quality than the refined model. When run in refinement mode, one macrocycle is akin to one iteration of a standard refinement program like Refmac or phenix.refine, yet it is more robust with respect to the choice of the optimization direction and the ability to escape local minima. It is also ∼100 times more computationally expensive and typically is executed in parallel due to the embarrassing parallelism of the macrocycle loop (Fig. 2). At every macroiteration m, the algorithm proceeds through a pipeline composed of standard (density modification → auto build → refinement) steps, which is repeated 50 times (the inner microcycle in Fig. 2). The density modification tool is our in-house developed code. The density modification code generates an electron density map based on the parent model M, sequence (to estimate the solvent content), and [F, σ] subject to a set of standard restraints: (i) density histogram and bulk solvent restraints (solvent flattening), (ii) [F, σ], and (iii) M’s low-resolution phase restraints up to some resolution threshold s that is adjusted dynamically at the end of each macrocycle. All these restraints are enforced through a set of real ↔ real and real ↔ reciprocal space projections: (i) density histogram projection, (ii) 2mF-DF projection computed by the program Sigmaa (28), and (iii) phase projection computed by Fourier transform of the density map followed by the inverse Fourier transform of the computed amplitudes and target phases. This runs for a fixed number of 30 iterations. The procedure starts with a map combining random amplitudes r exp(−B0 s2) with M’s phases, where r is a random number uniformly distributed in the [0,1] range, B0 is the overall B factor, and s is the length of the reciprocal space vector. Repeating the procedure 50 times produces 50 different density maps, which are quite diverse at early macroiterations and overlap strongly at late iterations. For each of the maps ρ we compute its correlation with the M density map ρ and average all of the 50 correlation coefficients. If the average is greater than 0.6, s is decreased by 10% (fewer phases constrained to φ in the next macrocycle, i.e., more relaxed phase constraints) and is increased by 10% otherwise. Parameter s is initialized at the beginning of the solution process to the resolution of the N/2th structure factor after sorting all of the N structure factors by resolution. Next, for each map ρ we build a trial model by Buccaneer 1.6.1 (19) and refine it by Refmac 5.8.0135 (29) with default settings against F subject to secondary structure restraints generated by Prosmart (30). The restraints do not include so-called h-bond terms to avoid any unnecessary bias. Likewise, disulfide bond restraints are not applied. The 50 newly built trial structures are ranked by their R-free values (31), and the 20 best structures are used to derive phase restraints used to refine (in refine mode) or rebuild (in full mode) M to produce the next parent model M. We combine the 20 models together by averaging their figure-of-merit–weighted (FOM-weighted) density maps aswhere Fexp(iφ) is the computed kth structure factor of trial structure n and m is its figure of merit as computed by Sigmaa. We also experimented with an alternative averaging scheme with all F set equal to 1.0 and found that it performed similarly well. A more accurate approach to combine the phases and filter out outliers would be based on cluster analysis (32); however, it is not employed in the current version of the code. In refine mode, M is generated by refining M against the complex value structure factors {F,φ}. In full mode, M is built in the {F,φ} density map starting with M’s alpha carbons. Finally, this step completes one macrocycle, and M is passed to the next macrocycle if needed.

Validation with Synthetic Data.

To test our method we selected four small proteins solved at different fairly low resolution: 3ADJ at 3.0 Å, 4EIX at 2.9 Å, 2NSB at 3.2 Å, and 2B48 at 3.45 Å. For each protein we generated several hundred near-native decoys by computing the protein’s normal modes and randomly exciting the 10 lowest-frequency normal modes (33). After removing structures with clashes, for each protein we obtained a set of several hundred decoys with Cα-rmsd from the corresponding deposited structure in the 0.1- to 5.0-Å range. Next, we removed all side chains, ligands, and waters and renamed all residues as UNK. All of the decoys are represented in Fig. 3 as gray shaded circles. The x and y coordinates were obtained from multidimensional scaling (MDS) analysis (34) of the decoy all-to-all pairwise rmsd matrix. Here, MDS seeks for n points on a 2D plain, with pairwise Euclidian distances approximating the n-by-n rmsd matrix in the least-squares way, and provides a 2D representation of the rmsd data. Coordinates {x,y} of all decoys were shifted by the same amount to place the deposited structure {x0,y0} at the origin of the plot, and then each {x,y} pair was scaled so that its Euclidean distance from the origin, (x2 + y2)1/2, would be exactly equal to the decoy’s rmsd from the deposited structure. The deposited structure is represented by the magenta circle in Fig. 3, and the concentric circles, therefore, define regions of constant rmsd.

Fig. 3.

Decoy statistics for the four proteins. For every structure we generated several hundred decoys at different levels of deformation (gray shaded circles). The green decoys were selected for refinement and produced converged (solved to within 2-Å rmsd from the deposited structure, large red circles inside the 2-Å circle) and unconverged (small red circles outside the circle) structures, correspondingly. The plots were built by multidimensional scaling representation of the decoys’ pairwise rmsd matrix. The cyan circle for 2B48 shows the rerefined structure. Next, for each protein we handpicked 20–30 decoys uniformly distributed in the 2.3- to 3.5-Å rmsd zone (green circles in Fig. 3). This level of deviation is much larger than the theoretical convergence radius (35). We then processed each decoy for 120 macrocycles in refine mode (i.e., the parent model was only refined and not rebuilt at each macrocycle). If a refined model was inside the 2-Å rmsd zone outlined by the thick circle in Fig. 3, we define it as a converged model and the corresponding initial decoy is shown as a large green circle; otherwise it is shown as a small green circle. The refined models are shown as large red circles for those that converged and small red circles for those that did not converge. As anticipated, the converged models (large green circles in Fig. 3 for 43 decoys in total) tend to localize closer to the origin than unconverged models (small green circles for 47 decoys), indicating a strong dependency of the quality of our solution on the degree of deformation of the initial decoy. The statistics are summarized in Fig. 4, where the relative fractions of converged and unconverged decoys in various rmsd zones are shown by green and red bars, correspondingly. The relative fraction of solved decoys (green bars in Fig. 4) decreases almost linearly with the magnitude of the initial deformation: At 2.3-Å rmsd all structures are solved, while at 3.5-Å rmsd only 10% of structures are solved. Approximately one-half of all decoys that start at 2.9-Å rmsd deformation are solved, allowing this value to be considered as the radius of convergence of our method.

Fig. 4.

The fraction of solved (green) and unsolved (red) decoys depends monotonically on the magnitude of their initial deformation with ∼50% of cases solved at 2.9-Å Cα-rmsd deformation.

The fraction of solved (green) and unsolved (red) decoys depends monotonically on the magnitude of their initial deformation with ∼50% of cases solved at 2.9-Å Cα-rmsd deformation. One can see in panel 2B48 in Fig. 3 that the converged solutions (red circles inside the 2-Å zone) tend to cluster in the 1.5-Å rmsd zone. To find the origin of such clustering, we rerefined the structure, starting with the best trial structure in terms of R-free and chain integrity that was generated in one of the later macrocycles. Compared with the original deposited structure, our solution has a larger number of protein atoms (1,230 vs. 1,145) and lower R factors (work/free) (0.243/0.250 vs. 0.261/0.305). The major structural difference is between residues R100 and T118 as shown in Fig. 5.

Fig. 5.

Deposited (A) and rerefined (B) structure of Bcl-XL at 3.45-Å resolution. The 2mF-DF electron density map is contoured at the 2σ level. The major structural difference is between R100 and T118 residues. Note the density peaks on R100 and R103 side chains in B, which are missing in A. The rerefined structure is depicted by the cyan circle in Fig. 3. Because a similar systematic shift is observed in panel 4EIX in Fig. 3, we rerefined the structure starting from one of the best trial structures. However, unlike the 2B48 case we did not find any relevant structural differences from the deposited structure. Our method produced the systematic bias because 4EIX has an unstructured C-terminus loop that is supported by two disulfide bonds at C115 and C121 residues. Ignoring these bond restraints gave rise to unbalanced model bias and produced structures with displaced loops and poorer density maps. To visualize a possible solution trajectory, we chose one 3ADJ decoy that was refined from an initial 2.7-Å rmsd down to 0.65-Å rmsd in a single unsupervised run. In Fig. 6 we plot the highest (red) and lowest (blue) R-free factor of the 50 trial structures generated at every macrocycle. In addition, the black line in Fig. 6 represents rmsd from the deposited structure of the parent model plotted on another scale. The structure was solved in 80 macrocycles, with the 0.65-Å residual rmsd being likely due to the 3.0-Å resolution of the data. One important thing to note is that at final macrocycles Buccaneer and Refmac consistently produce structures that are better than the deposited structure (dashed line in Fig. 6), which was also refined by Refmac (36). In addition, the high-quality trial structures generated at late-stage macrocycles form an ensemble of possible solutions of the phase problem and thus provide insights into the structural heterogeneity of different parts of the protein and the lower bound of the atomic coordinate errors (37). This is strikingly different from the amount of information contained in the single-structure solutions typically built by human crystallographers, where the structural heterogeneity is modeled by temperature factors, which sometimes does not provide accurate interpretation of the data (38).

Fig. 6.

Solution trajectory of one of the 3ADJ decoys. The red and blue lines, correspondingly, denote the highest and lowest R-free factor of the 50 trial structures generated at every macrocycle. The rmsd from the deposited structure of the parent model is shown as a solid black line. The dashed black line is the R-free value of the deposited structure.

Discussion

Unlike standard refinement protocols using maximum-likelihood estimator target functions to fit the experimental data (39, 40), our method deforms a model in the direction that improves interpretability of density maps produced by combining experimental amplitudes with phases derived from the ensembles of trial models. We define interpretability as the ability of a computer program, in our case Buccaneer, to automatically build a good model whose quality is quantified by the model’s R-free value. The particular method or program used to build and refine the models should not matter as long as it is applied consistently everywhere during the solution process. In fact, we observed that it was better to trade accuracy for speed and build as many as possible trial models rather than rely on a smaller number of higher-quality models. This is partially due to the fact that electron density maps in principle cannot be fully interpreted at early stages of structure solution. We observed that unlike the R-factor signal which degrades quickly with rmsd, the interpretability signal is more robust with respect to model deformations and allows us to explore and navigate through “flat” R-free surfaces. For instance, in the 30- to 60-macrocycle range in Fig. 6, the correlation coefficient between the best R-free value in a cycle (blue line in Fig. 6) and rmsd of the parent model (black line) is zero (−0.01), although rmsd still exhibits steady progress toward the solution. In addition, Fig. 7 shows that essentially all decoys are improved in terms of rmsd and R-free metrics. Even the models that did not converge to the deposited structures to within 2-Å rmsd demonstrated systematic improvements. To our surprise, we discovered a few polyalanine chains (for instance, 2B48 decoy 15) that fitted the experimental data quite accurately yet deviated substantially from the corresponding deposited structures. Building full sequence models from such backbones never succeeded.

Fig. 7.

Initial and final R-free values and Cα-rmsd of all of the parent structures (polyalanine backbones) used in the benchmark. Most models are consistently improved. There are a few false-positive structures with low R-free and large rmsd. None of these led to acceptable all-atom models. Fig. 8 shows the 20 overlaid best trial structures generated at the first and the last macrocycle iteration in the 3ADJ run shown in Fig. 6. More interpretable parts of the electron density map can be traced well enough to be visible in this ensemble representation and indicate the high-resolution part of the structure. Less interpretable parts of the density are represented by more random atom distribution that forms low-resolution structures, such as cylinders in place of alpha helices. Furthermore, regions of structural heterogeneity in the solved structure are visible in Fig. 8, Right.

Fig. 8.

(Left) Deposited 3ADJ structure (orange) and the initial decoy (cyan). (Center) Trial structures built at the first macrocycle by Buccaneer based on the density maps computed by our density modification code. One beta strand was interpreted quite accurately but the other strand was less clear. Attempts to build the beta turn at the right place are visible. Alpha helices are interpreted as cylinders due to the large displacement of the helices in the initial decoy. (Right) Trial structures built in the last macrocycle. Regions of structural heterogeneity are circled. All trial structures were superimposed to minimize their mutual rmsd and all side chains were removed for clarity.

Methods

In the density modification protocol, the spacing of the real space grid was set to one-quarter of the dataset resolution, and all Fourier transforms were performed by the Nvidia CUDA FFT library. The calculations were carried out inside the unit cell with all space group symmetry operations handled in the real space explicitly. The Protein Data Bank (PDB) ID 5DTE structure was used to compute the reference density histogram for the density histogram projection. The binary protein mask includes all grid points located within 1.3-Å distance from any atom of the parent model. If the size of this distance-based protein mask is smaller than that estimated from the protein content of the unit cell cV, we compute the Gaussian-weighted density map fluctuations where c is the estimated protein content, V is the unit cell volume, d is the Euclidian distance between points i and n, and the summation is performed over the entire unit cell. Then we add as many points with the largest σ to the distance-based protein mask as needed to make its size will be equal to cV.

32 in total

1. Pushing the boundaries of molecular replacement with maximum likelihood.

Authors: R J Read
Journal: Acta Crystallogr D Biol Crystallogr Date: 2001-09-21

Review 2. The phase problem.

Authors: Garry Taylor
Journal: Acta Crystallogr D Biol Crystallogr Date: 2003-10-23

3. Structure of the yeast polarity protein Sro7 reveals a SNARE regulatory mechanism.

Authors: Douglas A Hattendorf; Anna Andreeva; Akanksha Gangar; Patrick J Brennwald; William I Weis
Journal: Nature Date: 2007-03-29 Impact factor: 49.962

4. Free R value: a novel statistical quantity for assessing the accuracy of crystal structures.

Authors: A T Brünger
Journal: Nature Date: 1992-01-30 Impact factor: 49.962

5. The Buccaneer software for automated model building. 1. Tracing protein chains.

Authors: Kevin Cowtan
Journal: Acta Crystallogr D Biol Crystallogr Date: 2006-08-19

6. Automated main-chain model building by template matching and iterative fragment extension.

Authors: Thomas C Terwilliger
Journal: Acta Crystallogr D Biol Crystallogr Date: 2002-12-19

7. Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard.

Authors: Thomas C Terwilliger; Ralf W Grosse-Kunstleve; Pavel V Afonine; Nigel W Moriarty; Peter H Zwart; Li Wei Hung; Randy J Read; Paul D Adams
Journal: Acta Crystallogr D Biol Crystallogr Date: 2007-12-05

8. Interpretation of ensembles created by multiple iterative rebuilding of macromolecular models.

Authors: Thomas C Terwilliger; Ralf W Grosse-Kunstleve; Pavel V Afonine; Paul D Adams; Nigel W Moriarty; Peter Zwart; Randy J Read; Dusan Turk; Li Wei Hung
Journal: Acta Crystallogr D Biol Crystallogr Date: 2007-04-21

9. Fitting molecular fragments into electron density.

Authors: Kevin Cowtan
Journal: Acta Crystallogr D Biol Crystallogr Date: 2007-12-05

10. An introduction to molecular replacement.

Authors: Philip Evans; Airlie McCoy
Journal: Acta Crystallogr D Biol Crystallogr Date: 2007-12-05

2 in total

1. Instrumentation and experimental procedures for robust collection of X-ray diffraction data from protein crystals across physiological temperatures.

Authors: Tzanko Doukov; Daniel Herschlag; Filip Yabukarski
Journal: J Appl Crystallogr Date: 2020-11-05 Impact factor: 3.304

2. Solving the structure of Lgl2, a difficult blind test of unsupervised structure determination.

Authors: Ivan S Ufimtsev; Lior Almagor; William I Weis; Michael Levitt
Journal: Proc Natl Acad Sci U S A Date: 2019-05-14 Impact factor: 11.205

2 in total