Literature DB >> 31909739

How far are we from automatic crystal structure solution via molecular-replacement techniques?

Maria Cristina Burla1, Benedetta Carrozzini2, Giovanni Luca Cascarano2, Carmelo Giacovazzo2, Giampiero Polidori2.   

Abstract

Although the success of molecular-replacement techniques requires the solution of a six-dimensional problem, this is often subdivided into two three-dimensional problems. REMO09 is one of the programs which have adopted this approach. It has been revisited in the light of a new probabilistic approach which is able to directly derive conditional distribution functions without passing through a previous calculation of the joint probability distributions. The conditional distributions take into account various types of prior information: in the rotation step the prior information may concern a non-oriented model molecule alone or together with one or more located model molecules. The formulae thus obtained are used to derive figures of merit for recognizing the correct orientation in the rotation step and the correct location in the translation step. The phases obtained by this new version of REMO09 are used as a starting point for a pipeline which in its first step extends and refines the molecular-replacement phases, and in its second step creates the final electron-density map which is automatically interpreted by CAB, an automatic model-building program for proteins and DNA/RNA structures. open access.

Entities:  

Keywords:  automated model building; molecular replacement; nucleic acids; proteins; structure refinement

Mesh:

Substances:

Year:  2020        PMID: 31909739      PMCID: PMC6939436          DOI: 10.1107/S2059798319015468

Source DB:  PubMed          Journal:  Acta Crystallogr D Struct Biol        ISSN: 2059-7983            Impact factor:   7.652


Symbols and abbreviations

EDM: electron-density modification. C = (R , T ), with s = 1, …, m: the symmetry operators of the target structure. R is the rotational part, T is the translational part and m is the number of symmetry operators. t, t : the numbers of atoms in the asymmetric units of the target and model structure, respectively. N = mt, N = mt: the numbers of atoms in the unit cells of the target structure and model structure, respectively. It is supposed, for the sake of simplicity, that all of the atoms are in general positions. Usually N ≤ N, but it may also be the case that N > N. f: the atomic scattering factor of the jth atom (thermal factor included). F = = |F|exp(iφ): structure factor of the model structure. r are the atomic positions of the model structure when it has been well oriented and located. F = = |F|exp(iφ): structure factor of the target structure. r are the true atomic positions. It is supposed that the target and model molecules are isomorphous, so that r = r + Δr . Δr is the misfit between the atomic position r in the target and the corresponding r in the model structure. E = A + iB = Rexp(iφ), E = A + iB = R exp(iφ): normalized structure factors F and F, respectively. , : the scattering power at a given sinθ/λ for the target and model structure, respectively. D = 〈cos(2πhΔr )〉. The average is calculated per resolution shell. σA = . σA is a statistical estimate of the correlation between the model and target structures (Srinivasan, 1966 ▸). Ideally σA = 0 for uncorrelated models and σA = 1 for identical model and target structures. SI: the sequence identity between model and target molecules. AMB: automated model building.

Introduction

Molecular-replacement (MR) techniques (Rossmann & Blow, 1962 ▸; Rossmann, 1972 ▸, 1990 ▸) aim at phasing an unknown target structure using a known search molecule. The problem to solve is of a six-dimensional nature because it implies the correct orientation and location of the search molecule. Some MR programs face this in six-dimensional space [for example EPMR (Kissinger et al., 1999 ▸), SOMoRe (Jamrog et al., 2003 ▸) and Queen Of Spades (Glykos & Kokkinidis, 2000 ▸); see also Fujinaga & Read (1987 ▸)], even if an exhaustive six-dimensional search is generally avoided. Such programs are, in general, very time-consuming. More frequent is the practice of splitting the MR process into two three-dimensional steps: a rotation and a translation step. The most popular related programs are X-PLOR/CNS (Brünger, 1992 ▸), AMoRe (Navaza, 1994 ▸), BEAST (Read, 1999 ▸), MOLREP (Vagin & Teplyakov, 2010 ▸) and Phaser (McCoy et al., 2007 ▸). In BEAST and Phaser, maximum-likelihood-based conditional distributions are applied (see Read & McCoy, 2016 ▸, 2018 ▸; McCoy et al., 2018 ▸). Comprehensive reviews of the various techniques (updated up to 2007) have been collected in the January 2008 issue of Acta Crystallographica Section D. In recent years, more effort has been dedicated to cases in which the available experimental structures used as search models are only distantly homologous to the target; see, for example, Simpkin et al. (2018 ▸), Rigden et al. (2018 ▸), Pröpper et al. (2014 ▸), Millán et al. (2015 ▸) and Cabellero et al. (2018 ▸). In 2009, an MR program (REMO09; Caliandro et al., 2009 ▸) was proposed in which a probabilistic approach based on the joint probability distribution method was described. Joint distributions were derived in the absence of or under various prior conditions. For example, in the rotation step the correct rotation of a monomer is found via a figure of merit calculated when other monomers were previously oriented or located, or also when such information is not available. Joint distributions were also derived for the translation step: a monomer is located given its own orientation or the orientations and/or locations of other monomers. Burla et al. (2017 ▸), starting from REMO09 phases, checked the efficiency of a phase-refinement pipeline which synergically combines mainstream refinement techniques (specifically DM; Cowtan, 2001 ▸) with out-of-mainstream techniques [specifically, free lunch (Caliandro et al., 2005a ▸,b ▸), low-density Fourier transform (Giacovazzo & Siliqi, 1997 ▸), vive la difference (Burla, Caliandro et al., 2010 ▸; Burla, Giacovazzo et al., 2010 ▸), Phantom derivative (Giacovazzo, 2015b ▸; Carrozzini et al., 2016 ▸) and phase-driven model refinement (Giacovazzo, 2015a ▸)]. For simplicity, we will refer to this modulus as SYNERGY. Burla et al. (2017 ▸) automatically submitted the protein data obtained by SYNERGY to the AMB procedure CAB (Burla et al., 2017 ▸): it applies Buccaneer (Cowtan, 2006 ▸) in a cyclic way. In a recent paper (Giacovazzo, 2019 ▸), the standard method of joint probability distribution functions has been revised and updated. In particular, two-phase, three-phase and four-phase invariants are estimated directly via conditional distributions without passing through a previous calculation of the related joint probability distributions. The probabilistic formulae thus obtained do not coincide, in general, with the corresponding formulae established through the standard study of the joint probability distribution functions. Some of them are immediately applicable to MR, and some others, also suitable for MR, are derived here via this new approach. The formulae thus obtained form the basis for the modified version of REMO09 used in this paper. In this paper, in accordance with the talk given by one of us at the 2019 CCP4 Study Weekend in Nottingham, England, we show the default results obtained on applying the modified REMO09 → SYNERGY → CAB pipeline to a large set of protein and nucleic acid structures. To obtain these results, we extended CAB to nucleic acid structures (unpublished work) by making the use of Nautilus (Cowtan, 2014 ▸) cyclical. The purposes are twofold: to check the efficiency of the new probabilistic formulae used in the modified version of REMO09 and to check how far a modern crystallographic pipeline based on MR phases is from the automatic crystal structure solution of macromolecules.

General features of REMO09

Various directives allow REMO09 users to choose proper approaches for solving macromolecular structures. In this section, we will summarize the default approach used in all of our applications. (i) The observed and calculated data are scaled by Wilson techniques, which are also used to calculate the normalized structure factors (the observed and calculated 〈R 2〉 are scaled to unity shell by shell). The isotropic thermal factors of the model atoms are automatically modified to make them compatible with the overall temperature factor of the target structure. (ii) The target and model sequences are read. (iii) The orientation space is sampled in terms of Lattman angles (Lattman, 1972 ▸) with an angular step depending on the resolution of the active reflections (the maximum angular step is 5°). The extent of the orientation space is limited to the asymmetric region of the rotation group (Hirshfeld, 1968 ▸). For the first monomer to be located, only the Cheshire cell is explored in the translation step. (iv) The map grid used in the translation search along each axis is 1/3 of the data resolution for proteins and 1/4 for nucleic acids. (v) The active reflections for calculating figures of merit used in the rotation and translation searches are automatically selected. Low-resolution reflections (up to 7 Å) are eliminated from the calculations unless the SI is less than 0.5. The highest accepted resolution is 2.5 Å. This limit is extended a little for the translation step owing to the increased prior information gained during the rotation step. The SI is usually less critical for nucleic acids, mostly because nucleic acid helices can adopt similar conformations even when their sequences are drastically different. (vi) The rotations are ordered according to the rotation figure of merit (RFOM; see Section 4). The good solutions are usually dispersed at the top of the list of ordered solutions: therefore, to speed up calculations only a subset are submitted to the translation step, in which the new figure of merit TFOM is used (see Section 5).

Rotational search when only one monomer lies in the asymmetric unit of the target structure

The rotational search is performed by locating the model molecule in a P1 cubic unit cell. According to Rabinovich et al. (1998 ▸), the structure factors of the model are calculated only once: fitting to the observed data is obtained by rotating the observed reciprocal lattice with respect to the model lattice. The figure of merit designed for picking up the correct orientation of the model molecule is RFOM, the correlation factor between the observed R 2 and its expected value 〈R 2〉 as calculated by the probabilistic approach described by Giacovazzo (2019 ▸). RFOM is expected to be maximum for the correct model orientation and 〈R 2〉 is the expected value of R 2 given the prior information on the model stereochemistry:where F is the contribution to the calculated model structure factor arising from the asymmetric unit of the model structure, and E is its normalized (with respect to the scattering power of the model structure, symmetry-equivalent molecules included) form. The E are calculated and stored for each reflection via FFT of the electron density of the model structure in the enlarged cubic cell. (1) has appropriate asymptotic behaviours: i.e. when σA = 0 then 〈R 2〉 = 1, as it should be in the absence of prior information, and when σA = 1 then 〈R 2〉 = . The identity 〈R 2〉 = R 2 may only occur in P1 when the asymmetric unit contains only one monomer showing a high similarity index to the target molecule. Despite its good asymptotic properties, the use of (1) did not lead to a very efficient RFOM. The reason may lie in the mathematical definition of σA 2: according to Carrozzini et al. (2013 ▸) it coincides with the correlation factor between |F|2 and the calculated squared structure factor. In the rotation step the experimental values of σA 2 are generally small, mostly because is not the dominant component of the calculated squared structure factor. Thus, in some resolution shells σA < 0 (anticorrelation situation), while the σA 2 parameter to be used in (1) remains positive. This suggested that we eliminate the calculation of σA from (1) and simplify it asThe 200 orientations corresponding to the highest values of RFOM are selected for the translation step: this number is enhanced to 300 if more than one monomer is in the target molecule and to 400 if SI < 0.4.

Translation search when only one monomer lies in the asymmetric unit of the target structure

The orientations selected according to Section 4 are submitted to the translation search one by one. This is performed by using the T2 function of Crowther & Blow (1967 ▸) in the form modified by Harada et al. (1981 ▸) and by Navaza (1994 ▸). T2 is implemented via FFT, as suggested by Vagin & Teplyakov (1997 ▸). Only peaks falling inside the Cheshire unit cell are considered. For the same orientation, more peaks can be found: to spare computing time, only the largest five translations per orientation are saved. The selection of the best translations is made via the figure of merit TFOM, coinciding with the correlation factor between the observed amplitude |F| and the structure-factor amplitude |F| as calculated for each translation. Some further controls modify the simple approach above. (i) The translations with the largest TFOM values are submitted to the SIMPLEX method (Rowan, 1990 ▸), an unconstrained optimization technique related to the downhill method (Nelder & Mead, 1965 ▸), which is here applied to a six-dimensional parameter space (three for rotation and three for translation). The method is applied two times to the selected five (or ten for nucleic acids or if SI < 0.4) roto-translations with the largest values of TFOM: they are then submitted to REFMAC optimization cycles. The purpose is to optimize the model and better recognize the best solution. The final figure of merit iswhere (ii) The clash test (among symmetry-equivalent molecules) is applied, which dumps the TFOM value calculated above when a nonvanishing clash is found. The dumping factor is set towhere cl is the percentage of Cα atoms in the clash condition. The dumping factor cannot be <0.2. The roto-translation with the highest figure of merit is automatically submitted to the SYNERGY step and to the CAB procedure.

Rotational search when more than one monomer lies in the asymmetric unit of the target molecule

In the standard REMO09 program, when several monomers with the same stereochemistry are present in the asymmetric unit, the following three-step approach is used.This simple procedure may not work when the number of monomers in the asymmetric unit is large (more than three) or when the target is constituted of a number of components with different stereochemistry, each contributing a fraction of the scattering power in the asymmetric unit. (i) A number of orientations are selected when the orientation of the first monomer is searched. (ii) Once the first monomer has been located, the orientation of the second monomer is searched among the most probable orientations selected in step (i). (iii) After the location of the second monomer, steps (i) and (ii) are repeated until all monomers are located. This is the case for PDB entries 1lat and 2iff. The first test structure shows two chains of 71 and 74 resideues, respectively, and two identical nucleic acid chains, each with 19 nucleotides. The structure with PDB code 2iff is composed of three protein chains: two with 212 and 214 residues and a third chain with only 129 residues. The model coincides with the third target protein chain. We then decided to modify the REMO09 approach as follows: when the first molecule has been located, the rotations of the second and the others must be searched for using an ex novo rotation step and, where the case, by using a different model. In both of the approaches the figures of merit to be used for recognizing the correct rotation must be designed to take into account that one or more monomers have been previously oriented and located. This increases the signal to noise in the search for the new monomer. Let us consider the simplest case: the first monomer has been located and we want to orient the second monomer (no other monomers are supposed to lie in the asymmetric unit). Appendix A suggests that RFOM may still be the correlation factor between the observed R 2 and its expected value 〈R 2〉, but nowwhere R 2 is the squared amplitude of the normalized model structure factor corresponding to the already located first model monomer (normalized with respect to the scattering power of the structure containing the first monomer and its symmetry equivalents) and σA1 is the σA value corresponding to the pairs (R, R ). The last term on the right-hand side of (4) corresponds to the contribution of the second model monomer (the correct orientation of which we are searching for). σA2 is the σA value corresponding to the pairs (R, 〈R 2 2〉1/2), whereLet us briefly discuss the expected behaviour of (4). The probabilistic approach used to derive (4) excludes the existence of a mixed nonzero term relating the monomer already positioned to the monomer for which the orientation is searched. Thus, the two contributions are simply additive. When the first monomer is badly oriented and/or located σ2 A1 is expected to be close to zero. Since σ2 A2 is always expected to be a small value (at least for non-P1 space groups; see Section 4), RFOM is expected to be small. When the first monomer is well located and the second is well oriented then RFOM is expected to be larger. However, values of σ2 A1 and σ2 A2 that are both close to unity are not expected because Σ/Σ and Σ/Σ values that are both close to unity are not allowed. Sections 4 and 5 suggest avoiding the use of σA values so that 〈R 2〉 reduces toThe final RFOM is the correlation coefficient between the observed R 2 and its expected value 〈R 2〉. Let us now generalize (6) to the case in which three monomers are contained in the asymmetric unit under the condition that the first and second monomers have already been oriented and located. The expression (6) is still valid; we only have to change the meaning of the symbols. R will represent the normalized amplitude of the model structure corresponding to the first and second monomers (symmetry equivalents included), will represent the contribution arising from the monomer for which the correct orientation is searched. The procedure is now cyclic: the same equation may be applied to any number of monomers.

Translational search when more than one monomer lies in the asymmetric unit of the target molecule

Let us first suppose that one monomer has already been oriented and located (F 1 is its generic structure factor) and that a second monomer has been oriented. If we use the Crowther T2 function to locate the second monomer in the translation step then the expected squared structure factor of the structure constituted by the two monomers and their symmetry equivalents in correct positions isThis is a weak relation owing to the fact that 〈|F|2〉 does not include the mixed term F 1 F 2. A better approach is that using the translation function involving F instead than its square. Let r be the current positional vector of the jth atom of the second model monomer: the structure factor of the structure constituted by the second monomer and its symmetry equivalents in correct positions is thenwhere Δr is a suitable unknown positional shift,andis the component of the current model structure factor. The algorithm is very simple. F 2 is calculated for each active reflection only once, in the initial position of the second monomer. The second monomer is then moved by the shift Δr on all of the grid points of the asymmetric unit, where F 2 is calculated via (7) and summed with F 1 to obtain The correct grid position is expected to be that for which TFOM, the correlation factor between the observed amplitude |F| and the structure-factor amplitude 〈F〉, is a maximum. The method is simply generalized to locate an nth well oriented monomer when the first n − 1 monomers have been well oriented and located.

Applications

We applied the automatic modified pipeline REMO09 → SYNERGY → CAB to an extended set of test structures, proteins and nucleic acids. We used 80 protein and 38 nucleic acid test structures, the PDB codes of which are reported in Tables 1 ▸ and 2 ▸. The first 34 protein test structures had previously been used by Burla et al. (2017 ▸) to check the SYNERGY refinement process on standard REMO09 phases. Proteins 25–34 belong to the set of 13 structures studied by DiMaio et al. (2011 ▸) and characterized by an SI between the model and target structures of lower than 0.30. The experimental data and models for the remaining 46 protein test structures had been deposited in the PDB by the Joint Centre for Structural Genomics, Wilson Laboratory, Scripps Institute: they were used to verify the efficiency of our pipeline on a larger number of test structures (most of them were not originally solved by MR).
Table 1

The 80 protein test structures are identified by their PDB codes

Their experimental data were submitted to the REMO09 + SYNERGY + CAB pipeline. For each test structure we show MRP°, the average phase error/weighted average phase error in degrees at the end of REMO09; SYN°, the average phase error in degrees at the end of the SYNERGY step; and MA, the ratio ‘number of Cα atoms within 0.6 Å distance from the published positions/number of Cα atoms in the asymmetric unit’. Dashes indicate that useful roto-translations were not found by the MR program.

PDBMRP°SYN°MA PDBMRP°SYN°MA PDBMRP°SYN°MA
1dy5 55/421599  2f53 58/433095  3nr6 79/675890
1bxo 74/602897  2ayv 54/403389  3zyt 88/89901
2fc3 57/433298  2pby 77/643696  3q6o 80/665699
1tgx 58/443594  2f8m 62/474196  3on5 73/624373
2a46 75/583196  1yxa 74/603795  4fqd 76/616090
1lys 45/362896  2f84 56/423592  3tx8 75/58475
1cgo 78/6646100  1cgn 74/643998  3o8s 90/90891
2otb 55/433499  1xyg 64/503998  3npg 79/67763
1kqw 59/463399  2a4k 59/473291  4e2t 74/602796
2sar 54/423996  2b5o 52/403388  3nng 76/61669
1lat 68/555346  1ycn 55/433189     
1e8a 69/543998  2iff 62/53704     
 
1vkf 90/89  3mcq 72/574794  4mru 76/677323
1vki 73/5637100  3mdo 56/413196  4ogz 68/544796
1vl2 90/90  3mz2 89/90  4ouq 49/362998
1vl7 71/574295  3nyy 77/685096  4q1v 72/604498
1vlc 69/553195  3obi 89/90  4q34 70/533699
2wu6 55/433897  3oz2 74/623793  4q53 62/493295
2x7h 67/595198  3p94 61/463897  4q6k 64/483499
3e49 75/615297  3ufi 77/653894  4q9a 81/76891
3gp0 75/614096  3us5 66/523798  4qjr 66/513588
3h9e 56/433497  4e2e 54/403989  4qni 74/634282
3h9r 63/485087  4ef2 69/523896  4r0k 53/393399
3khu 90/90  4ezg 68/502898  4rvo 74/61698
3l23 73/564194  4fvs 89/88  4rwv 69/543994
3llx 69/553399  4gbs 55/383685  4yod 71/566899
3m7a 76/614198  4gcm 65/503298     
3mbj 75/594397  4ler 69/503098     
Table 2

The 38 nucleic acid test structures are identified by their PDB codes

Their experimental data were submitted to the REMO09 + SYNERGY + CAB pipeline. For each test structure we give MRP°, the average phase error/weighted average phase error in degrees at the end of REMO09; SYN°, the average phase error in degrees at the end of the SYNERGY step; and MA, the ratio ‘number of residues with P atoms within 1.3 Å distance from the published positions/number of residues in the asymmetric unit’. Dashes indicate that useful roto-translations were not found by the MR program.

PDBMRP°SYN°MA PDBMRP°SYN°MA PDBMRP°SYN°MA
1iha 38/274188  4enc 37/272887  5l4o 68/544183
1q96 90/90  4gsg 54/415125  5lj4 42/303277
1z7f 39/2735100  4ms5 70/576078  5mvt 65/552995
2a0p 34/2432100  4wo3 88/88  5nt5 28/1826100
2b1d 83/81822  4xqz 53/37482  5nz6 42/292793
2fd0 49/363395  4zym 77/68800  5t4w 46/3327100
2pn4 47/344261  5cv2 89/90  5tgp 72/563486
3ce5 60/514857  5dwx 75/636359  5ua3 84/80850
3d2v 77/696027  5fj0 89/88  5ux3 90/90
3eil 62/474779  5i4s 51/403864  5uz6 73/643499
3fs0 68/5134100  5ihd 70/513913  5zeg 88/89
3n4o 36/263573  5ju4 50/332795  6az4 56/424390
3tok 60/454914  5kvj 65/505194     
The 38 nucleic acid structures were selected from the PDB: we downloaded the observed diffraction data, information on the unit cell, space-group symmetry, published sequences and MR models. 20 of them are DNA and the remaining 18 are RNA fragments. Additional information on all of the test structures is given in Supplementary Tables S1 and S2. For all of the test structures the same small set of directives was used (coinciding with our default set) such as those shown in Table 3 ▸ for PDB entry 1xyg.
Table 3

Directives for the default use of the REMO09/SYNERGY/CAB pipeline

The example refers to the protein with PDB code 1xyg.

%cab buccaneer
%structure 1xyg
%job Molecular Replacement Test on 1xyg
%data
mtz 1xyg.mtz
label H K L F SIGF
sequence 1xyg.seq
%remo
fragment 1vkn.pdb
%end
The experimental results are reported in Tables 1 ▸ and 2 ▸. For each test structure PDB is the PDB code, MRP° is the average phase error in degrees at the end of the REMO09 step and SYN° is the average phase error in degrees at the end of the SYNERGY step. For proteins, MA is the ratio ‘number of Cα atoms within 0.6 Å distance from the published positions/number of Cα atoms in the asymmetric unit’ as obtained by CAB. For nucleic acids, MA is the ratio ‘number of residues with P atoms within 1.3 Å distance from the published positions/number of residues in the asymmetric unit’ in accordance with CAB interpretation. We will assume that good models are obtained by CAB when MA is sufficiently large: as a rough rule of thumb, we will assume that a good solution has been automatically found when MA > 0.5. For proteins we observe the following. (i) Good solutions were found for 64 of the 80 test proteins. The 16 failures are essentially owing to the limited efficiency of REMO09. Indeed, for 14 of the 16 failures MRP° was ≥74°: in these conditions SYNERGY is often unable to substantially reduce the average phase error so as to allow CAB to succeed. REMO09 failures are frequent for DiMaio structures because, owing to the extreme low value of SI, the MR step often ends with a large model bias which SYNERGY is unable to correct. (ii) When MRP° is not extremely large, SYNERGY dramatically reduces the average phase error. In 15 cases MRP° values in the interval 73–80° are broken down to values of less than 43°, thus allowing CAB to succeed. (iii) CAB for proteins is extremely efficient. The MA value is very often close to 100 (a clear signal of successful map interpretation), even in nine of the cases for which SYNERGY ended with SYN° > 50°. The panorama is different for nucleic acids. Such behaviour is in part expected because of the special stereochemistry of DNA/RNA structures. They have a large number of rotatable bonds in the main chain (six, while there are two for proteins); consequently, the conformation at low resolution is often ambiguous (Keating & Pyle, 2012 ▸; Murray et al., 2003 ▸). Our experimental results may be summarized as follows: of the 38 nucleic acid structures only 24 are routinely solved. Ten of the 14 failures may be ascribed to REMO09 (i.e. for these MRP° ≥ 77°). Four of the remaining five failures are owing to CAB failures (CAB is unable to interpret the electron-density maps of PDB entries 3tok, 4gsg, 4xqz and 5ihd, for which SYN° ≤ 51°). SYNERGY is again efficient (MPR° values of >70° are broken down to values smaller than 40°). The above experimental tests indicate that the application of REMO09 and CAB to DNA/RNA are the weakest points of the pipeline. On the contrary, SYNERGY, applied to both nucleic acids and to proteins, and the application of CAB to proteins are particularly efficient. The existence of weak points in the pipeline do not allow us to positively answer the question in the title of this paper. There are three simple ways to improve the present situation. (i) Modify REMO09 to give a more modern and efficient version. (ii) Replace REMO09 with a more efficient program. (iii) Modify the CAB algorithms for DNA/RNA structures. Modifications (i) and (iii) would require supplementary and probably lengthy work which is beyond the purpose of the present paper. For suggestion (ii) the easiest choice would be to replace REMO09 by a popular and documented MR tool to check whether the conclusions suggested by the results obtained via our pipeline are confirmed by the inclusion of a better updated MR program. MOLREP (Vagin & Teplyakov, 2010 ▸) was our choice: it is also preferred amongst others because of its simple use and its possible automation. Our default MOLREP procedure corresponds to the following directives (i.e. such as those shown below for PDB entry 1xyg): A better default can probably be provided by expert users; therefore, the potential of MOLREP is certainly much greater than that corresponding to the naïve default we choose. However, the experimental results obtained by the pipeline MOLREP → SYNERGY → CAB, shown in Tables 4 ▸ and 5 ▸, help to better answer the general question regarding automatic crystal structure solution via MR.
Table 4

The 80 protein test structures are identified by their PDB codes

Their experimental data were submitted to the MOLREP + SYNERGY + CAB pipeline. For each test structure we give MRP°, the average phase error/weighted average phase error in degrees at the end of MOLREP; SYN°, the average phase error at the end of the SYNERGY step; and MA, the ratio ‘number of Cα atoms within 0.6 Å distance from the published positions/number of Cα atoms in the asymmetric unit’. Dashes indicate that useful roto-translations were not found by the MR program.

PDBMRP°SYN°MA PDBMRP°SYN°MA PDBMRP°SYN°MA
1dy5 90/90  2f53 66/58718  3nr6 86/83831
1bxo 76/682998  2ayv 56/463194  3zyt 90/91
2fc3 57/443298  2pby 70/623397  3q6o 83/79789
1tgx 61/493594  2f8m 65/553799  3on5 89/89891
2a46 69/592998  1yxa 76/693695  4fqd 83/79813
1lys 68/625096  2f84 58/473294  3tx8
1cgo   1cgn 77/6935100  3o8s 90/90
2otb   1xyg 63/533594  3npg 89/89
1kqw 62/523298  2a4k 62/533093  4e2t 79/723196
2sar 53/413995  2b5o 52/413188  3nng 78/706619
1lat 89/89  1ycn 58/473090     
1e8a 71/623598  2iff 67/60693     
 
1vkf 84/765196  3mcq 82/734993  4mru 69/604598
1vki 81/7335100  3mdo 52/403197  4ogz 68/584796
1vl2 77/684297  3mz2 90/90  4ouq 52/422999
1vl7 77/696392  3nyy 83/797614  4q1v 72/644497
1vlc 67/564771  3obi 80/744497  4q34 77/673799
2wu6 59/503897  3oz2 79/723794  4q53 64/553396
2x7h 49/403898  3p94 58/483797  4q6k 53/423599
3e49 63/514596  3ufi 78/713991  4q9a 71/614597
3gp0 74/674297  3us5 67/563798  4qjr 67/553682
3h9e 59/473298  4e2e 55/453994  4qni 78/704281
3h9r 73/65682  4ef2 73/633898  4r0k 44/343099
3khu 77/695693  4ezg 79/672798  4rvo 78/706732
3l23 75/654196  4fvs 74/656086  4rwv 70/593993
3llx 74/643499  4gbs 57/433789  4yod 71/617089
3m7a 76/684199  4gcm 65/523298     
3mbj 77/694395  4ler 78/706365     
Table 5

The 38 nucleic acid test structures are identified by their PDB codes

Their experimental data were submitted to the MOLREP + SYNERGY + CAB pipeline. For each test structure we give MRP°, the average phase error/weighted average phase error in degrees at the end of MOLREP; SYN°, the average phase error in degrees at the end of the SYNERGY step; and MA, the ratio ‘number of residues with P atoms within 1.3 Å distance from the published positions/number of residues in the asymmetric unit’. Dashes indicate that useful roto-translations were not found by the MR program.

PDBMRP°SYN°MA PDBMRP°SYN°MA PDBMRP°SYN°MA
1iha 71/612894  4enc 52/412888  5l4o 74/643786
1q96 90/89  4gsg 59/52556  5lj4 67/553095
1z7f 49/3627100  4ms5 88/87  5mvt 68/5524100
2a0p 40/3132100  4wo3 87/87  5nt5 51/3725100
2b1d 87/86  4xqz 88/89  5nz6 44/342588
2fd0 61/5225100  4zym 87/87  5t4w 61/472791
2pn4 49/373964  5cv2 88/90  5tgp 77/714986
3ce5 72/685857  5dwx 87/86  5ua3 86/83
3d2v 90/90  5fj0   5ux3 89/87
3eil 85/828323  5i4s 67/633882  5uz6 72/626593
3fs0 74/6633100  5ihd 88/89  5zeg 88/89
3n4o 43/263085  5ju4 88/89  6az4 57/454395
3tok 67/544717  5kvj 59/525491     
The results in Table 4 ▸ for proteins may be summarized as follows. (i) Solutions are found for 61 of the 80 test structures. Most of them are owing to our non-optimal MOLREP default choice. (ii) The efficiency of SYNERGY and CAB is similar to that described for the REMO09 → SYNERGY → CAB pipeline. (iii) REMO09 and MOLREP have a complementary behaviour. Indeed, only nine of the 80 protein test structures remained unsolved by both pipelines. The experimental results in Table 5 ▸ for nucleic acid structures may be summarized as follows. (i) Of the 38 nucleic acids only 20 are automatically solved: 16 of the 18 failures may be ascribed to the limited effectiveness of our default MOLREP procedure (for these MRP° ≥ 86°) and two to CAB (PDB entries 3tok, for which SYN° = 47°, and 4gsg, for which SYN° = 55°); (ii) 14 of the 38 nucleic acid structures remained unsolved by both pipelines.

Conclusions

The phase problem for small molecules is considered to be universally solved in practice. The main purpose of this paper is to check whether a similar situation is, or will soon be, available for macromolecules if MR techniques are used. We applied the two pipelines REMO09 → SYNERGY → CAB and MOLREP → SYNERGY → CAB to 80 protein structures and 38 nucleic acid structures. Only nine of the 80 protein structures remained unsolved by both of the pipelines; most of the failures occurred when the SI was extremely low (below 0.30). The increasing availability of better models, the selection of improved default procedures for REMO09 and MOLREP, and the possible use of more efficient MR programs (e.g. SYNERGY and CAB may use Phaser) suggest that automatic crystal structure solution is close for proteins. The situation for nucleic acid structures is different: 14 of the 38 nucleic acid structures remained unsolved by both of the pipelines. Further efforts are therefore necessary to obtain their automatic crystal structure solution: the necessary improvements involve the MR programs (in particular the treatment of ligands, which may be a non-negligible part of the structure) and the AMB section. Supplementary Tables. DOI: 10.1107/S2059798319015468/ip5004sup1.pdf
  32 in total

1.  Molecular replacement: the revival of the molecular Fourier transform method.

Authors:  D Rabinovich; H Rozenberg; Z Shakked
Journal:  Acta Crystallogr D Biol Crystallogr       Date:  1998-11-01

2.  The difference electron density: a probabilistic reformulation.

Authors:  Maria Cristina Burla; Rocco Caliandro; Carmelo Giacovazzo; Giampiero Polidori
Journal:  Acta Crystallogr A       Date:  2010-03-31       Impact factor: 2.290

3.  Solution of the phase problem at non-atomic resolution by the phantom derivative method.

Authors:  Carmelo Giacovazzo
Journal:  Acta Crystallogr A Found Adv       Date:  2015-08-28       Impact factor: 2.290

4.  Ab initio phasing at resolution higher than experimental resolution.

Authors:  Rocco Caliandro; Benedetta Carrozzini; Giovani L Cascarano; Liberato De Caro; Carmelo Giacovazzo; Dritan Siliqi
Journal:  Acta Crystallogr D Biol Crystallogr       Date:  2005-07-20

5.  Molecular replacement with MOLREP.

Authors:  Alexei Vagin; Alexei Teplyakov
Journal:  Acta Crystallogr D Biol Crystallogr       Date:  2009-12-21

Review 6.  The molecular replacement method.

Authors:  M G Rossmann
Journal:  Acta Crystallogr A       Date:  1990-02-01       Impact factor: 2.290

7.  The Buccaneer software for automated model building. 1. Tracing protein chains.

Authors:  Kevin Cowtan
Journal:  Acta Crystallogr D Biol Crystallogr       Date:  2006-08-19

Review 8.  Macromolecular ab initio phasing enforcing secondary and tertiary structure.

Authors:  Claudia Millán; Massimo Sammito; Isabel Usón
Journal:  IUCrJ       Date:  2015-01-01       Impact factor: 4.769

9.  Ensembles generated from crystal structures of single distant homologues solve challenging molecular-replacement cases in AMPLE.

Authors:  Daniel J Rigden; Jens M H Thomas; Felix Simkovic; Adam Simpkin; Martyn D Winn; Olga Mayans; Ronan M Keegan
Journal:  Acta Crystallogr D Struct Biol       Date:  2018-03-02       Impact factor: 7.652

10.  ARCIMBOLDO on coiled coils.

Authors:  Iracema Caballero; Massimo Sammito; Claudia Millán; Andrey Lebedev; Nicolas Soler; Isabel Usón
Journal:  Acta Crystallogr D Struct Biol       Date:  2018-03-02       Impact factor: 7.652

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.