Literature DB >> 34142833

Efficient Flexible Fitting Refinement with Automatic Error Fixing for De Novo Structure Modeling from Cryo-EM Density Maps.

Takaharu Mori¹, Genki Terashi², Daisuke Matsuoka¹, Daisuke Kihara^2,3, Yuji Sugita^1,4,5.

Abstract

Structural modeling of proteins from cryo-electron microscopy (cryo-EM) density maps is one of the challenging issues in structural biology. De novo modeling combined with flexible fitting refinement (FFR) has been widely used to build a structure of new proteins. In de novo prediction, artificial conformations containing local structural errors such as chirality errors, cis peptide bonds, and ring penetrations are frequently generated and cannot be easily removed in the subsequent FFR. Moreover, refinement can be significantly suppressed due to the low mobility of atoms inside the protein. To overcome these problems, we propose an efficient scheme for FFR, in which the local structural errors are fixed first, followed by FFR using an iterative simulated annealing (SA) molecular dynamics protocol with the united atom (UA) model in an implicit solvent model; we call this scheme "SAUA-FFR". The best model is selected from multiple flexible fitting runs with various biasing force constants to reduce overfitting. We apply our scheme to the decoys obtained from MAINMAST and demonstrate an improvement of the best model of eight selected proteins in terms of the root-mean-square deviation, MolProbity score, and RWplus score compared to the original scheme of MAINMAST. Fixing the local structural errors can enhance the formation of secondary structures, and the UA model enables progressive refinement compared to the all-atom model owing to its high mobility in the implicit solvent. The SAUA-FFR scheme realizes efficient and accurate protein structure modeling from medium-resolution maps with less overfitting.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2021 PMID： 34142833 PMCID： PMC9282639 DOI： 10.1021/acs.jcim.1c00230

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 6.162

Introduction

Single-particle cryo-electron microscopy (cryo-EM) is a powerful tool to determine the three-dimensional (3D) structures of biomolecules at near-atomic resolution.[1] In the method, a 3D density map of the target molecule is reconstructed from a large number of 2D images of the molecule. Owing to the development of various technologies, such as efficient sample preparation, direct electron detection, and software for image processing,[2] high-resolution analyses have been realized for large protein complexes and membrane proteins.[3−5] The method also enables us to understand protein dynamics by capturing snapshots of the structures in their biological processes, such as gene transcription[6] and substrate transport.[7] Although atomic resolution has been recently achieved in some cases,[8,9] typical resolution is still 3–5 Å due to the intrinsic flexibility of proteins in solution. Thus, reliable structure modeling from low- or medium-resolution maps is one of the essential issues in structural molecular biology. Structure modeling from cryo-EM density maps is usually conducted with computational techniques such as rigid-body docking, flexible fitting, and de novo modeling.[10−13] In rigid-body docking, the entire protein structure is treated as an assembly of component segments, and the positions and orientations of each component are optimized with rigid-body translations and rotations (6D search) to fit the density map.[14−16] Flexible fitting uses a complete model of the target biomolecule. The initial structure, which is typically determined with other methods such as X-ray crystallography, nuclear magnetic resonance (NMR), or homology modeling, is deformed using normal mode analysis (NMA),[17,18] molecular dynamics (MD) simulations,[19−25] or their hybrid approach.[26] De novo modeling predicts the structure from the density map and amino acid sequence information. To date, various methods, including Rosetta,[27] EM-Fold,[28] Pathwalking,[29] and MAINMAST,[30] have been developed. Rosetta constructs a full-atom model based on the fragment assembly algorithm, in which the predicted short fragments are assembled to fit the density map using a 6D search. EM-Fold builds α-helical proteins by placing α-helices on rod-shaped densities in the map based on a Monte Carlo search. Pathwalking traces the Cα atoms in the density map using a traveling salesman problem solver, while MAINMAST employs a minimum spanning tree algorithm and tabu search. The model obtained from de novo modeling is usually refined with MD-based flexible fitting (flexible fitting refinement; FFR) to remove steric clashes or repack the side chains. In the method, biasing potential that guides the structure toward the target density is added to the molecular mechanics force fields. One of the popular methods is cross-correlation coefficient (c.c.)-based flexible fitting, which uses the c.c. between the experimental and simulated density maps in the biasing potential.[19,21] On the other hand, Molecular Dynamics Flexible Fitting (MDFF) introduces a biasing potential that is proportional to the Coulomb potential derived from the experimental density map and also the secondary structure (SS) restraint potential.[20] MAINMAST predicts the Cα model, which is further converted to an all-atom (AA) model with PULCHRA,[31] followed by the refinement with MDFF.[30] The iterative MD-Rosetta protocol, which iteratively performs Rosetta loop prediction and MDFF, has been proposed to improve the quality of the model predicted from EM-Fold.[32,33] Enhanced sampling algorithms such as the replica-exchange method[34] have been widely employed to search the global-energy-minimum structure in flexible fitting.[16,35,36] The CryoFold algorithm[37] first performs the model building using targeted MD[38] combined with Bayesian-inference-based restraints (MELD),[39] which utilizes the Cα atom positions and Cα–Cα distance information obtained from MAINMAST, and then refines the model with resolution-exchange MDFF (ReMDFF).[35] Although replica-exchange schemes with the AA model could provide an accurate structure model compared to the other conventional schemes, they must employ multiple replicas of the target system, resulting in large computational cost. Simple protocols with low computational cost that maintain high accuracy should be needed for efficient structure modeling. In structure refinement, another important task besides the global-energy-minimum search is the removal of local structural errors such as chirality errors and cis peptide bonds. These errors are frequently generated in de novo structure modeling, especially when a full-atom model is constructed after the main-chain modeling such as MAINMAST and Pathwalking. The errors can be removed using energy minimization or MD simulation with error-fixing restraints such as the dihedral angle restraint at ω = 180° for the cis peptide bond. Many model building tools provide some functions for automatic or manual modifications of the errors, including geometry restraint or inversion of the corresponding atoms.[40−42] In particular, ISOLUDE performs on-the-fly flexible fitting, where the errors visualized in the monitor can be manually removed with a mouse or haptics tool.[43] Another troublesome error is ring penetration, where a covalent bond penetrates an aromatic ring. This situation can accidentally occur, especially when a coarse-grained (CG) model is converted to an AA model.[44] In fact, PULCHRA can generate penetrated rings, even though the algorithm tries to minimize the possibility of occurrence of ring penetration.[31] Because ring penetration is difficult to solve, an effective algorithm that automatically detects or fixes such errors should be developed. To solve these problems, we propose an efficient flexible fitting scheme for refining the decoys obtained from de novo modeling, where MD-based flexible fitting with an iterative simulated annealing (SA) protocol is conducted, during which the united atom (UA) model and implicit solvent model are employed; this scheme is, therefore, called “SAUA-FFR”. The UA model, which incorporates hydrogen atoms of CH3, CH2, and CH groups into the carbon atoms, can maintain the atomic resolution, and the implicit solvent model considers solvent effects with low computational cost. We use the decoys obtained from MAINMAST.[30] The local structural errors in the decoys are automatically fixed using new functions implemented in MD software GENESIS,[45,46] which can address ring penetrations, cis peptide bonds, and chirality errors. Our refinement scheme is compared with the original scheme of MAINMAST using eight selected proteins: F420-reducing hydrogenase α subunit (FrhA), 20S proteasome core subunit (PCS), Sputnik virophage (SV), Bordetella phage (BPP-1), transient receptor potential vanilloid 1 (TRPV1), CARD domain of mitochondria antiviral signaling protein (MAVS), bombyx mori cypovirus 1 (BmCPV-1), and porcine circovirus 2 (PCV2). The models refined with our scheme are also compared with those refined with the Phenix real_space_refine tool. The results demonstrate that our scheme can achieve progressive formation of SSs due to the high mobility of atoms in the UA model, realizing efficient FFR.

Methods

MAINMAST

MAINMAST is a powerful method for de novo modeling.[30] In the method, a tree structure is first constructed by connecting local dense points in the density map (minimum spanning tree). The structure is further refined with a tabu search to find the longest pathway, which corresponds to the main chain of the protein. The amino acid sequence is aligned to the pathway by evaluating the matching between local densities in the experimental map and those predicted from each amino acid on the path, and then, the Cα model is constructed. Finally, the AA model is generated from the Cα model, and it is subjected to FFR. Previous work demonstrated good performance [average root-mean-square deviation (rmsd) = 2.68 Å)] for the selected 30 proteins with 2.6–4.8 Å resolution maps.[30] Figure A illustrates the flowchart of the original scheme in MAINMAST, focusing on the protocol after generating the Cα model. First, 500 possible models (decoys) are generated, and then, the Cα model is converted to the AA model using PULCHRA.[31] Each model is subjected to energy minimization, followed by refinement with MDFF.[10,11] In the refinement, only a single run is carried out at 300 K with g-scale 0.5. The restraints that maintain the trans peptide bond and proper chirality are employed, while the restraints of the SS are not applied in the system. The best model is selected from the 500 models according to the MDFF energy, which is composed of the EM biasing potential and restraint energy for fixing chirality errors and cis peptide bonds.

Figure 1

Flowchart of the FFR. (A) Original scheme in MAINMAST, (B) improved scheme proposed in this study (SAUA-FFR and SAAA-FFR), and (C) protocol to fix chirality errors, cis peptide bonds, and ring penetrations in step 2 of the SAUA-FFR or SAAA-FFR.

The SAUA-FFR Scheme

In this study, we propose an efficient scheme for this FFR method (namely, SAUA-FFR) using the decoys obtained from MAINMAST, which is mainly composed of four steps after generating the AA model (Figure B, left scheme). In step 1, energy minimization is carried out to remove steric clashes in the initial decoy. In step 2, energy minimization and restrained MD simulation are further carried out to fix local structural errors such as chirality errors, cis peptide bonds, and ring penetrations, where the specific treatment and restraints are applied to the errors (Figure C; for details, see the next paragraph). In step 3, FFR is carried out using the UA model in an implicit solvent model, where the SA MD is iterated five times. In step 4, the UA model is converted to the AA model by generating hydrogen atoms, and the FFR with the same EM biasing potential is carried out once in the implicit solvent. In steps 3 and 4, c.c.-based flexible fitting is employed, which introduces the biasing potential EEM = k (1 – c.c.) with the force constant k. Here, various force constants ranging from low to high values (N force constants) are examined to generate a “pool” containing strongly or weakly fitted structures because the optimal value for the force constant is unknown. Thus, the pool contains 500 × N decoys in total. Any restraints except for the EM biasing potential are not applied in the system. The best model is selected from the 500 × N decoys based on three validation scores: the c.c. between the experimental and simulated density maps, the RWplus score,[47] and the MolProbity score[48] (for details, see the Results section). For comparison, we also examine the AA model in step 3 (SAAA-FFR; Figure B, right scheme). Flowchart of the FFR. (A) Original scheme in MAINMAST, (B) improved scheme proposed in this study (SAUA-FFR and SAAA-FFR), and (C) protocol to fix chirality errors, cis peptide bonds, and ring penetrations in step 2 of the SAUA-FFR or SAAA-FFR. Local structural errors are frequently observed in the energy-minimized structure at step 1. The chirality error can occur in the Cα atoms of the amino acids except for Gly or Cγ atoms of Thr and Ile. To fix the error, the corresponding hydrogen atom attached to the chiral center is inverted, and energy minimization is carried out (Figure C top).[49] The errors can also be easily fixed through the MD simulation using the UA model (step 3 in the SAUA-FFR), because the geometry around the chiral center, which involves the hydrogen atom, is regulated with the improper torsion angle potential. In the cis peptide bond, the backbone dihedral angle ω is close to 0°, which is energetically unstable. To invert the cis peptide bond to a trans peptide bond, the dihedral angle restraint at ω = 180° is applied to the corresponding part during the MD simulation (Figure C, middle). Note that cis peptide bond is not always an error, and it can often be found even in high-resolution X-ray crystal structures.[49] In this study, we applied restraints to all backbone peptide bonds except for those in proline. Another typical error is ring penetration, in which one covalent bond accidentally penetrates the ring of Phe, Tyr, Trp, His, or Pro (Figure C, bottom). This error can be detected based on the bond length of the ring because the penetration makes a ring larger. To fix the error, we first geometrically reduce the ring size and then carry out energy minimization, which allows the penetrating bond to escape from the ring owing to the quick recovery of the natural ring size (see Video S1). The functions for automatically detecting and fixing errors are available in MD software GENESIS ver 1.6 or later.[45,46]

Test Systems

To examine the efficiency and reliability of our scheme, we selected eight proteins: FrhA, PCS, SV, BPP-1, TRPV1, CARD domain of MAVS, BmCPV-1, and PCV2 (see Table ). In this study, we used the same initial decoys and target density maps as those used in the previous work.[30] We consider the PDB coordinates as the answer of the prediction, which have been determined by fitting the reference X-ray crystal structure or homology model (MAVS and PCS),[50,51] manual building (SV, BPP-1, BmCPV-1, and PCV2),[52−55] or their combinations (FrhA and TRPV1).[56,57] The proteins are composed of α-helices, β-sheets, or their mixtures (5th column in Table ). Note that each system is a part of a large complex (6th column in Table ). Thus, to make the target density map, the corresponding region was clipped out of the experimental map. The resolution of the original map is approximately 3 Å. Previous work demonstrated that these test sets show different qualities in the predicted best model in terms of rmsd with respect to the native conformation.[30] Specifically, the prediction was successful for BmCPV-1 (rmsd = 1.67 Å) but not for PCS (15.00 Å), SV (9.46 Å), or BPP-1 (33.10 Å).

Table 1

Summary of the Target Systems

protein	EMD ID	res. (Å)	PDB ID	α-helix/β-sheet	chain/residues	rmsd (Å)a
FrhA	2513	3.36	4CI0	181/58	A/2–386	3.80
PCS	3231	3.6	5FMG	56/41	K/2–195	15.00
SV	5495	3.5	3J26	29/151	A/1–508	9.46
BPP-1	5764	3.5	3J4U	65/49	A/5–331	33.10
TRPV1	5778	3.275	3J5P	220/0	A/381–719	6.04
MAVS	5925	3.64	3J6J	68/0	A/1–97	3.34
BmCPV-1	6374	2.90	3JB0	103/37	D/1–292	1.67
PCV2	6555	2.90	3JCI	0/86	A/42–231	2.32

The Cα rmsd with respect to the native structure calculated with the MMTSB toolset (rms.pl).[58]

Computational Details of SAUA-FFR and SAAA-FFR

We employed the CHARMM C19[59] and C36m force fields[60] for the UA and AA models, respectively. In step 1 of the improved scheme (Figure B), a 1000-step energy minimization was carried out in vacuum. In step 2, energy minimization and restrained MD simulations were performed. To fix the cis peptide bond, we conducted MD simulations at 300 K in vacuum using dihedral angle restraints with a force constant of 10 kcal/mol/rad2, where the positional restraint was also applied to all Cα atoms (k = 0.5 kcal/mol/Å2). In step 3 of the SAUA-FFR, c.c.-based FFR[19] was carried out using the UA model with the SA MD protocol (100 ps × 5 cycles = 500 ps in total), where the temperature was decreased from 600 to 300 K in each cycle. We used the effective energy function (EEF1) model for the implicit solvent model.[61] In step 3 of the SAAA-FFR, the AA model was employed with the generalized Born/solvent-accessible surface area (GB/SA) implicit solvent model (OBC2 model) using the same simulation conditions.[62] Here, we examined five different force constants (k = 2000, 4000, 6000, 8000, and 10,000 kcal/mol) in the EM biasing potential for each system. In the MD simulations, we used a cutoff distance of 18 Å. The Langevin thermostat was employed for temperature control, and the equations of motion were integrated with the leapfrog algorithm. All MD simulations were performed using GENESIS.[45,46]

Results

Removal of Local Structural Errors from Initial Decoys

First, we analyzed the number of local structural errors in all decoys to investigate the efficiency of our error-fixing algorithms. In Table , we list the average number of chirality errors, cis peptide bonds, and ring penetrations in the 500 decoys obtained at step 2 of the original scheme, step 1 of the improved scheme, and step 2 of the improved scheme. Here, the cis peptide bond was counted using the VMD cispeptide plugin,[40] and the chirality errors and ring penetrations were counted with the GENESIS check_structure function. Note that the numbers in the table are based on a “warning” message for the suspicious moiety in the molecule. We can see that in the original scheme, there are still some errors even after the FFR with error-fixing restraints. On the other hand, in the improved scheme, chirality errors, cis peptide bonds, and ring penetrations are resolved or significantly reduced. Specifically, in TRPV1, ring penetrations were found in 309 of 500 decoys at step 1 of the improved scheme, but they were completely removed at step 2. These results suggest that the simple protocol used in the original scheme cannot fully solve local structural errors, and careful removal of the errors is necessary before the FFR.

Table 2

Average Number of Local Structural Errors (Chirality Errors, cis Peptide Bonds, and Ring Penetrations) in One Decoy Obtained at Step 1 or Step 2 of the Original and Improved Schemesa

system	error	original scheme step 2	improved scheme step 1	improved scheme step 2
FrhA	chirality error	1.216 (1.703)	0.006 (0.077)	0.000 (0.000)
	cis peptide bond	10.708 (3.698)	11.546 (3.578)	3.716 (1.733)
	ring penetration	0.226 (0.489)	0.180 (0.428)	0.002 (0.045)
PCS	chirality error	1.210 (2.011)	0.000 (0.000)	0.000 (0.000)
	cis peptide bond	9.348 (3.393)	9.870 (3.816)	0.192 (0.437)
	ring penetration	0.280 (0.605)	0.244 (0.541)	0.000 (0.000)
SV	chirality error	2.204 (2.467)	0.002 (0.045)	0.000 (0.000)
	cis peptide bond	16.698 (6.353)	19.138 (6.485)	4.080 (2.051)
	ring penetration	0.542 (0.867)	0.400 (0.724)	0.006 (0.077)
BPP-1	chirality error	1.260 (2.205)	0.000 (0.000)	0.000 (0.000)
	cis peptide bond	12.166 (4.110)	13.504 (4.200)	1.700 (1.302)
	ring penetration	0.218 (0.463)	0.206 (0.442)	0.002 (0.045)
TRPV1	chirality error	2.064 (2.467)	0.006 (0.077)	0.000 (0.000)
	cis peptide bond	17.014 (5.317)	19.378 (6.296)	1.654 (1.248)
	ring penetration	0.668 (1.100)	0.488 (0.700)	0.000 (0.000)
MAVS	chirality error	0.574 (1.098)	0.000 (0.000)	0.000 (0.000)
	cis peptide bond	4.338 (2.427)	4.648 (2.591)	1.198 (0.942)
	ring penetration	0.074 (0.277)	0.058 (0.234)	0.002 (0.045)
BmCPV-1	chirality error	0.234 (0.632)	0.000 (0.000)	0.000 (0.000)
	cis peptide bond	4.282 (1.958)	4.990 (2.072)	1.582 (1.050)
	ring penetration	0.170 (0.503)	0.158 (0.448)	0.004 (0.063)
PCV2	chirality error	0.788 (1.805)	0.000 (0.000)	0.000 (0.000)
	cis peptide bond	3.426 (2.372)	4.406 (2.355)	1.310 (1.118)
	ring penetration	0.228 (0.522)	0.216 (0.491)	0.002 (0.045)

The values in parentheses represent the standard deviation.

Structural Change during the FFR

To monitor the progress of the refinement in each scheme, we analyzed the c.c. between the experimental and simulated density maps for all 500 models. Here, we focus on the five highest c.c. models obtained at 500 ps of step 3. Figure A,B illustrates the time evolution of the averaged c.c. of the five highest c.c. models for PCV2 in the SAUA-FFR and SAAA-FFR, respectively. The average was calculated at the last step of each SA cycle. Note that cycle = 0 corresponds to the initial model generated from PULCHRA. For comparison, we also plot the results of the original scheme (black). In all cases in the SAUA-FFR and SAAA-FFR using different biasing force constant k values, the averaged c.c. value increased from the initial value, demonstrating that most models were successfully fitted to the target density map during the iterative FFR. The original scheme also showed an increase in the averaged c.c. value, and the result was similar to the averaged c.c. value of SAUA-FFR and SAAA-FFR.

Figure 2

Time evolution of the averaged c.c. and Cα rmsd values for the five highest c.c. models of PCV2. (A) c.c. in SAUA-FFR, (B) c.c. in SAAA-FFR, (C) Cα rmsd in SAUA-FFR, and (D) Cα rmsd in SAAA-FFR. Brown, orange, blue, green, and purple lines are the results obtained from the FFR using k = 2000, 4000, 6000, 8000, and 10,000 kcal/mol, respectively, and the black line corresponds to the original scheme (previous work).[30] The averaged c.c. value also increased as the force constant increased (from brown to purple lines in Figure A,B). If the same force constant was used in SAUA-FFR and SAAA-FFR, a higher c.c. value was obtained in SAUA-FFR than in SAAA-FFR. This is mainly because the structural energy [(i.e., molecular-mechanics potential energy (EMM) + solvation free energy (ΔGsolv)] in the UA model is lower than that in the AA model, and thus, the EM biasing energy (EEM) in the SAUA-FFR had a relatively larger contribution to the fitting than that in the SAAA-FFR. Similar tendencies were observed in the other seven systems (see Figure S1). We also found that the c.c. value could decrease from the initial value, especially when weak force constants (e.g., k = 2000 kcal/mol) are used for large systems such as BmCPV-1. In such cases, EEM is still inferior to the structural energy, resulting in less fitting to the density map. These results indicate that the optimal force constant for the FFR depends on the system size as well as force fields or molecular models, although it is unknown a priori. We analyzed the averaged rmsd of the Cα atoms with respect to the native structure for the five highest c.c. models [Figure C,D]. In most cases, except for SAAA-FFR, the averaged rmsd value decreased from the initial value by 0.2–0.4 Å, and it almost converged at cycle = 2 or 3. In SAAA-FFR, the biasing force might be too weak to guide the initial model toward the native structure. We can see that a smaller rmsd was mostly obtained with a strong force constant (e.g., k = 6000, 8000, and 10,000 kcal/mol). However, the smallest rmsd was not always obtained with the strongest force constant, as in SAAA-FFR [blue line in Figure D]. One of the reasons might be that a moderate force constant is required in some cases to prevent the structure from becoming trapped in local energy minima. Another reason might be overfitting, where the obtained structure is distorted due to fitting to the noisy density map. This issue is further discussed in the next subsection. A comparison between the three schemes suggests that the SAUA-FFR and SAAA-FFR schemes seem to be more effective than the original scheme. In fact, SAUA-FFR and SAAA-FFR yielded a smaller rmsd compared to the rmsd of the original scheme, even though these three schemes showed almost identical c.c. values [compare the black, green, and purple lines in Figure A,B]. For the other systems, we observed similar results, where the stronger force constant yielded a smaller rmsd (see Figure S1). In MAVS and BmCPV-1, the averaged rmsd successfully decreased by 0.2–0.6 Å using k = 6000–10,000 kcal/mol in both SAUA-FFR and SAAA-FFR. On the other hand, in FrhA, PCS, SV, BPP-1, and TRPV1, the rmsd did not change even with the strongest force constant. This is presumably because the structure of the initial model deviated largely from the native structure (e.g., averaged initial rmsd = 6.5–8.8 and 3.5–4.6 Å in TRPV1 and FrhA, respectively), and the conformational search was not conducted sufficiently. These results suggest that the FFR seems to work effectively if the initial rmsd is less than 3.5 Å.

Evaluation of the Decoys

One of the difficult issues in protein structure prediction is the selection of the best model from a large number of decoys. In typical flexible fitting approaches using an X-ray crystal structure as the initial model, we may simply choose a model according to a score that represents goodness of fitting, such as c.c., because the model would already have a protein-like structure. In the FFR for de novo models, however, many structures that show a high c.c. value but include a nonprotein-like conformation might be contained in the decoys and should be discriminated from near-native structures. In addition, we should address overfitting. Thus, the best model must be carefully selected based on not only the goodness of fitting but also any scores that validate the protein structure. To examine this, we first analyzed the distribution of rmsd as a function of c.c. Figure A shows the c.c.–rmsd plot obtained at the last step of SAUA-FFR (step 4 in Figure B) for PCV2, where the brown, orange, blue, green, and purple points were obtained with k = 2000, 4000, 6000, 8000, and 10,000 kcal/mol, respectively. Hereafter, we define the 500 models obtained with each force constant as a “decoy set”. We see that the rmsd decreases as the c.c. increases in each decoy set, and the distribution shifts toward a higher c.c. as the force constant increases. Each decoy set exhibits a funnel-like distribution, where the bottom decoy is close to the native structure (black point). Similar distributions were observed in the other systems except for PCS and BPP-1 (see Figure S2). In these two cases, improvement of the initial main-chain modeling might be required to reproduce the funnel-like distribution. We suggest that if higher c.c. models are selected, smaller rmsd models can be obtained.

Figure 3

Distribution of the decoys obtained from SAUA-FFR for PCV2. (A) Distribution of the Cα rmsd with respect to the native structure. Note that decoys with a large rmsd (>10 Å) were excluded. (B) Heat map of the rmsd projected onto the c.c.–RWplus plot obtained from SAUA-FFR. (C) Heat map of the MolProbity score projected onto the c.c.–RWplus plot obtained from SAUA-FFR. Here, the model with the highest c.c. was usually obtained with the strongest force constant, and it corresponded to the model with the smallest rmsd in some cases (e.g., SAUA-FFR for FrhA and SAAA-FFR for TRPV1). However, in most cases, the smallest rmsd was often obtained with a moderate force constant (see Table S1). Particularly, in SAUA-FFR for PCV2 [Figure A], the model with the highest c.c. was found in the decoy set with k = 10,000 kcal/mol (c.c. = 0.816 and rmsd = 1.75 Å), while the model with the smallest rmsd was in k = 6000 kcal/mol (c.c. = 0.778 and rmsd = 1.37 Å). These results suggest that to select a model with a smaller rmsd, we should search all decoys generated with various force constants ranging from low to high values. Then, we can eliminate the dependency of the force constant or simultaneously determine the optimal force constant that gives the best model. The best model should have a protein-like conformation. To investigate the accuracy of the protein structure in the decoys, we first examined the RWplus score, which is a statistical energy function that evaluates the protein structure based on the orientation of the side chains.[47]Figure B shows a heat map of the rmsd projected onto the c.c.–RWplus plot obtained from SAUA-FFR for PCV2. The decoy set exhibits a funnel-like distribution, where the RWplus decreases as the c.c. increases. The rmsd also decreases as the c.c. increases and RWplus decreases, and thus, the bottom of the funnel is close to the native structure (black point). Similar tendencies were observed in the other decoy sets (data not shown). We also found that the model with the best RWplus does not always correspond to the model with the smallest rmsd (rmsd = 5.6 Å) if it has a low c.c., as indicated by the arrow in Figure B. Therefore, to find near-native and protein-like structures in the decoys, we should choose decoys that have good scores for both c.c. and RWplus. Another useful method to validate the protein structure is MolProbity, which assesses protein geometry using clash-score and conformational outliers in the main chain and side chains.[48]Figure C shows a heat map of the MolProbity score projected onto the c.c.–RWplus plot. We can see that the MolProbity score decreases as the c.c. increases and the RWplus decreases, suggesting that the decoys at the bottom of the funnel are again likely to have a protein-like structure. Interestingly, as indicated by the arrow, the model with a high c.c. showed worse MolProbity and RWplus scores than the other nearby decoys. For such decoys, we should suspect overfitting, in which the structure is distorted to some extent due to fitting to the noisy density. Thus, we exclude such decoys from the candidates of the best model.

Selection of the Best Model

Based on the above observations, we propose a scheme for the best model selection, which consists of three steps (Figure ). After obtaining N decoy sets (N × 500 decoys in total) using various force constants, we first decide the Top5N models, where the 5 highest c.c. models are selected from each decoy set. This step can filter out large rmsd models or less-fitted models. Then, we select Top5 models from the Top5N models based on the RWplus score to filter out nonprotein-like structures and eliminate the dependency of the force constant. Finally, we select the best model from the Top5 models based on the MolProbity score to further filter out nonprotein-like structures and eliminate the possibility of overfitting.

Figure 4

Proposed scheme for the selection of the best model from the decoys obtained from either SAUA-FFR or SAAA-FFR.

Proposed scheme for the selection of the best model from the decoys obtained from either SAUA-FFR or SAAA-FFR. Figure A shows the best model of each system obtained from the original scheme (blue), SAUA-FFR (purple), and SAAA-FFR (green) compared to the native structure (gold). Note that the best model in the original scheme was selected using the previous protocol (see the Methods section).[30] Obviously, all best models obtained from the SAUA-FFR and SAAA-FFR have much more SS than those from the original scheme. In the case of FrhA, both α-helices and β-strands are successfully yielded in the SAUA-FFR and SAAA-FFR but not in the original scheme. The rmsd in the original, SAUA-FFR, and SAAA-FFR schemes was 3.80, 3.30, and 4.21 Å, respectively, demonstrating the good performance of SAUA-FFR. Similar tendencies were observed in PCS, SV, BPP-1, MAVS, BmCPV-1, and PCV2. In the case of TRPV1, SAUA-FFR showed a larger rmsd (9.45 Å) than that of the other two schemes. However, the latter two schemes also showed a large rmsd (6.04 or 5.95 Å). These large rmsds mainly originate from a specific region of the protein. Figure B illustrates the deviations of the predicted model from the native structure, where the red and blue regions have large and small deviations in the Cα atom position, respectively. In the case of TRPV1, the prediction for most regions was successful, but one α-helix (indicated by the arrow) shifted by three turns, resulting in a large rmsd. This shift in the α-helix might be difficult to discriminate from the native conformation using the current protocol. This point will be discussed later.

Figure 5

Best models obtained from the original, SAUA-FFR, and SAAA-FFR schemes. (A) Comparison of the native structure and the predicted best models. (B) Deviations of the predicted best model from the native structure (blue: small deviation and red: large deviation). The PyMOL colorbyrmsd.py tool was used to make a color map.[63] In Table , we summarize the structural properties of each of the best models. In most cases, the rmsd in SAUA-FFR is smaller than that in the original scheme or SAAA-FFR. Some structural errors were generated in the original scheme, while they were significantly reduced in SAUA-FFR and SAAA-FFR. The reproducibility of the SS in SAUA-FFR is better than that in the original scheme or SAAA-FFR. One of the most remarkable results was 85.7% in SAUA-FFR for BmCPV-1, in which both α-helices (93.2%) and β-strands (64.9%) were well formed in the predicted model. SAUA-FFR also showed better MolProbity and RWplus scores. Similar tendencies were observed in the averaged structural properties of the Top5 models (see Table S2). In Tables and S2, we also show the results of the refinement using Phenix real_space_refine.[42] We carried out a basic scheme consisting of minimization global, local grid search, morphing, and SA (simulated_annealing = every_macro_cycle and the other options are default) for the AA model generated from PULCHRA. The best model was selected based on the c.c. value. We see that the results are similar to those in the original scheme, and the reproducibility of the SS is still low. If the initial model deviates largely from the ideal structure, the refinement might not work well in the basic protocol of Phenix. Overall, the SAUA-FFR showed better performance for most cases than the other schemes.

Table 3

Summary of the Structural Properties of the Best Model Predicted from Each Schemea

system	scheme	c.c.	rmsd (Å)	rank	chiral errors/cis peptide bonds/ring penetrations	SS (%)	MolProbity	RWplus (kcal/mol)
FrhA	original	0.795	3.80	9	0/6/0	3.8	2.55	–67,852
	SAUA-FFR	0.793	3.30	3	0/1/0	53.1	2.42	–74,796
	SAAA-FFR	0.739	4.21	100	0/0/0	48.5	2.37	–72,563
	Phenix	0.781	3.82	4	0/4/0	16.7	2.46	–67,886
PCS	original	0.725	15.00	102	1/7/0	0.0	2.43	–33,168
	SAUA-FFR	0.737	8.87	4	0/1/0	26.8	2.57	–35,001
	SAAA-FFR	0.640	14.62	273	0/0/0	9.3	2.49	–33,713
	Phenix	0.698	14.50	14	0/5/0	4.1	2.64	–33,728
SV	original	0.758	9.46	1	0/8/0	15.6	2.52	–89,093
	SAUA-FFR	0.732	9.36	3	0/0/0	44.4	2.15	–88,184
	SAAA-FFR	0.528	10.05	10	0/0/0	21.7	2.19	–90,578
	Phenix	0.757	9.43	1	0/8/0	12.8	2.49	–88,216
BPP-1	original	0.799	33.10	417	0/4/0	3.5	2.94	–42,912
	SAUA-FFR	0.667	27.58	274	0/0/0	7.0	2.25	–44,807
	SAAA-FFR	0.709	33.04	2201	0/0/0	7.9	2.35	–45,125
	Phenix	0.781	33.18	411	0/1/0	0.0	2.93	–42,729
TRPV1	original	0.755	6.04	6	0/17/0	9.1	2.60	–55,334
	SAUA-FFR	0.705	9.45	396	0/0/0	54.1	2.03	–62,981
	SAAA-FFR	0.710	5.95	25	0/1/0	42.7	2.18	–61,773
	Phenix	0.732	7.99	40	0/6/0	9.5	2.42	–50,946
MAVS	original	0.830	3.34	31	1/4/0	14.7	2.69	–14,178
	SAUA-FFR	0.803	2.91	110	0/0/0	64.7	1.62	–18,265
	SAAA-FFR	0.820	4.01	1406	0/0/0	57.4	2.23	–17,147
	Phenix	0.805	3.43	22	0/4/0	17.6	2.47	–15,059
BmCPV-1	original	0.837	1.67	2	0/4/0	43.6	2.22	–58,214
	SAUA-FFR	0.856	1.51	3	0/0/0	85.7	2.15	–62,604
	SAAA-FFR	0.834	1.73	16	0/1/0	70.7	1.90	–61,319
	Phenix	0.802	1.86	2	0/2/0	50.7	2.18	–58,035
PCV2	original	0.777	2.32	30	0/0/0	23.3	2.21	–29,185
	SAUA-FFR	0.798	2.08	124	0/1/0	69.8	2.09	–33,403
	SAAA-FFR	0.798	2.50	168	0/0/0	69.8	1.82	–33,314
	Phenix	0.772	1.72	1	0/1/0	38.4	2.30	–31,378

rmsd: Cα rmsd with respect to the native structure using the MMTSB toolset (rms.pl).[58] c.c.: cross-correlation coefficient between the experimental and simulated density maps using the VMD mdffi tool.[20,40] Rank: rank of the rmsd over all decoys (500 decoys in the original scheme and 500 × 5 decoys in the improved schemes). Chirality errors: number of chirality errors using the GENESIS check_structure function. Cis peptide bonds: number of cis peptide bonds using the VMD cispeptide plugin. Ring penetrations: number of ring penetrations using the GENESIS check_structure function. SS: reproducibility of the α-helix and β-strand residues using the DSSP program (symbols H and E were counted).[64] In the 5th column of Table , we show the rank of the best models in terms of the rmsd. Some best models have a high rank. In particular, the rank of the best model obtained from the SAUA-FFR for FrhA, SV, and BmCPV-1 was 3/2500. On the other hand, the rank was 124/2500 in the case of PCV2. In this case, the smallest rmsd decoy (1.37 Å) was filtered out because it had worse validation scores (MolProbity = 2.56 and RWplus = −32,867 kcal/mol) and a lower reproducibility of SS (66.3%) than those in the best model. Basically, our scheme aims to reduce overfitting as much as possible in the candidate decoys. Here, we also emphasize that the Top5 models actually contained the smallest rmsd model in some cases (e.g., SAUA-FFR for TRPV1, SAAA-FFR for MAVS, and SAAA-FFR for BmCPV-1; see Table S2), demonstrating the good performance of our scheme.

Why the Formation of SS Was Enhanced?

One of the remarkable features of our improved schemes is the progressive formation of SS (7th column in Table ). To investigate how SS were formed during the refinement, we analyzed the time evolution of the average number of residues that formed α-helices or β-strands. Figure A shows the results of the DSSP analysis[64] for the five highest c.c. models of BmCPV-1 in SAUA-FFR (left panel) and SAAA-FFR (right panel). In the native structure, there are 140 residues that form SS (103 α-helix and 37 β-strand residues). We found that in the SAUA-FFR, SS were quickly generated at cycle 1 and gradually increased up to ∼135 at cycles 2–5. On the other hand, fewer SS were generated in the SAAA-FFR (∼120) and the original scheme (∼70). Similar results were obtained in the other seven systems (see Figure S3).

Figure 6

Formation of SS during the FFR. (A) Time evolution of the average number of SS (α-helix + β-strand) in the five highest c.c. models of BmCPV-1 during SAUA-FFR (left) and SAAA-FFR (right). In the analysis, symbols H (α-helix) and E (β-strand) obtained from DSSP were counted. (B) Average displacement of the Cα, Cβ, Cγ, and Cδ atoms in the surface-exposed residues (left) and buried residues (right) before and after the FFR. Here, exposed and buried residues were defined according to the relative solvent accessibility of each residue (criteria = 50%), which was computed with the Naccess program.[65] (C) Examples of the local structural errors that prevented the formation of α-helices (top) and β-sheets (bottom). Right panels show the correct formation of SS after fixing the errors using our algorithms. One of the reasons for the progressive formation of SS is presumably the high mobility of atoms in the UA model. Figure B shows the averaged displacement of the Cα, Cβ, Cγ, and Cδ atoms in the surface-exposed residues (left panel) and buried residues (right panel) of PCV2. Here, the purple, green, and blue bars were obtained from SAUA-FFR, SAUA-FFR, and SAAA-FFR, respectively. We see that the displacement is suppressed if the stronger force constant is used (compare purple and green bars) or if the residues are buried inside the protein (compare left and right panels). If we compare SAUA-FFR (purple) and SAAA-FFR (blue), which showed identical results in terms of c.c. and rmsd (see Figure ), the displacement of the UA model is larger than that of the AA model. In the UA model, hydrogen atoms of CH3, CH2, and CH groups are incorporated into the carbon atoms. Therefore, the molecule represented with the UA model is less dense than that with the AA model, resulting in a higher mobility of atoms in the UA model. The implicit solvent model can also contribute to the high mobility of atoms or quick relaxation of the system because there are no explicit waters around the protein. Treatment of local structural errors such as chirality errors, cis peptide bonds, and ring penetrations is important for the formation of SS because these errors can prevent polypeptides from forming a proper hydrogen bond network. In Figure C, we show two examples, where SS were correctly formed by fixing cis peptide bonds (top panels) and ring penetrations (bottom panels). Particularly, fixing ring penetrations is important to stabilize the MD simulation because these errors can easily cause the SHAKE error around the aromatic ring or penetrating covalent bonds. Chirality errors in the Cα atom can also disrupt the α-helix due to steric clashes between side chains.[66] Overall, we suggest fixing these errors before the FFR, and using the combination of the UA model and implicit solvent model are useful for the efficient structure refinement of the decoys obtained from de novo modeling.

Discussion

Computational de novo modeling is usually useful for the density maps at 3.5–5 Å. In fact, if the resolution of the density map is high enough to recognize the type of side chains (e.g., higher than 3.0 Å), manual de novo modeling should be possible by tracing the high dense points in the map. If the resolution is not so high (e.g., lower than 4.0 Å), computational de novo modeling can give us a good hint for reliable structure modeling. MAINMAST is useful for 4–5 Å or higher resolution maps because the backbone is recognized in such maps.[30] In this study, we employed the PDB coordinates determined from the density maps at a slightly higher resolution (2.9–3.6 Å) because the native structure, which is needed to evaluate the protocols, is more reliable. Although the obtained model can be refined with the flexible fitting, local structural errors such as chirality errors, cis peptide bonds, and ring penetrations should be fixed in advance to obtain a more realistic model. Fixing of the errors can enhance the formation of SS during the refinement. We note that the errors can be mainly generated when the full-atom model is constructed from the main-chain model. In fact, we found that the full-atom model constructed from the Cα-model predicted with Pathwalking contained some local structural errors in the tested systems as in MAINMAST (see Table S3). On the other hand, there were no chirality errors, no ring penetrations, and a small number of cis peptide bonds in the Rosetta model, which directly constructs a full-atom model. Because more than 99.5% of peptide bonds in the native proteins have cis conformation,[67] we applied the dihedral angle restraint (ω = 180°) to all peptide bonds for simplicity. Spontaneous transition from cis to trans is difficult due to a high energy barrier. To examine the possibility of the transition in the decoys, we performed SAUA-FFR without the dihedral angle restraints starting from the structure in which almost all peptide bonds have cis conformation. We observed that 50% of the peptide bonds had still cis conformation after the refinement in the case of BmCPV-1 with SAUA-FFR, suggesting that the dihedral angle restraints are essential to fix the cis peptide bonds. In the proposed SAUA-FFR scheme, iterative SA MD simulation is carried out using the UA model (CHARMM19 force field) in combination with the implicit solvent model (EEF1). One of the advantages of the UA model is its low computational cost compared to that of the AA model. The EEF1 model is also faster than the GB/SA model.[61] In fact, the benchmark performance of SAUA-FFR was 46.5 ns/day for FrhA using a typical Linux machine (Intel Xeon Gold 6130 2.10 GHz; 32 CPU cores), while it was 6.0 ns/day in SAAA-FFR. In the scheme, a short MD simulation (typically 10–100 ps) seems to be enough to obtain a converged structure because the implicit solvent model enables quick equilibration of the system. These features allow us to perform many parallel runs using a supercomputer. Another advantage of the UA model is that the model is already close to the atomic resolution, enabling an easy conversion to the AA model by just adding hydrogen to the heavy atoms. To date, various CG models, such as Go-model,[68] PRIMO,[69] and SICHO,[70] have been utilized for structure modeling, including not only cryo-EM flexible fitting[24,36,71] but also general de novo protein structure prediction.[72] Although low-resolution molecular models are usually used to reduce computational cost or to enhance conformational sampling, they should eventually be converted to the AA model, which in turn may cause structural errors such as ring penetration. A multiscale protocol combining the CG and AA models can avoid such issues. For example, targeted MD simulation is carried out starting from a certain conformation with the AA model using the Cα atom positions as the reference, as in the CryoFold algorithm[37] or our multiscale flexible fitting protocol proposed recently.[73] However, such a conversion scheme requires additional computation in addition to structure refinement. In this study, we also proposed a new scheme for the best model selection, where the decoys obtained from multiple FFR runs with various force constants are screened using the combination of the c.c., RWplus score, and MolProbity score. This idea is based on the fact that we do not know the optimal force constant that can minimize overfitting. In the first screening, we use the c.c. to select the models that are fitted to the density map. In the second and third screenings, we try to find a model that has a protein-like conformation and minimal overfitting using the RWplus and MolProbity scores without considering the density map. For validation of the map-to-model quality, various algorithms have been proposed.[74−76] EMRinger is useful to evaluate the side chain modeling based on the consistency between the dihedral angle in the rotamer and the local density of the map.[77] The solvation free energy is also useful for scoring because it can evaluate the exposure of hydrophobic and hydrophilic residues on the protein surface.[78] We suggest that screening decoys through multiple steps and scores with and without the density map is essential in de novo modeling from cryo-EM density maps. Finally, we discuss further improvements of our scheme. Among the eight proteins employed in this study, TRPV1 is the only membrane protein. For such cases, using the implicit membrane model[79] is more reasonable than the implicit water model. We applied the implicit micelle model (IMIC)[80] to TRPV1 in SAUA-FFR but found that the obtained results were similar to those in the implicit water model. The structural properties of the best model were rmsd = 6.79 Å, c.c. = 0.716, reproducibility of SS = 55.5%, MolProbity score = 2.29, and RWplus score = −59,940 kcal/mol. This is presumably because the membrane environment does not significantly affect the movement of atoms in the flexible fitting. The fitting force still seems to be superior to the effect from the membrane or solvent. However, we expect that the solvation free energy calculated in the membrane environment is useful to select the best model or to filter out the decoys that have abnormal conformations, such as the shift of the transmembrane α-helices, as observed in our calculations. Yuzlenko and Lazaridis[81] and Dutagaci et al.[82] suggested using a scoring function that includes the solvation free energy calculated in the implicit membrane model for the discrimination of the native conformation from decoys of membrane proteins. The early stage of MAINMAST should also be improved by considering the effect of the solvent and/or membrane environment.

Conclusions

In this study, we propose the SAUA-FFR scheme for efficient FFR, in which c.c.-based flexible fitting with the iterative SA MD protocol is carried out using the UA model in combination with the implicit solvent model. To obtain a model with less overfitting, we carried out multiple FFR runs with various force constants ranging from weak to strong values and screened the decoys using a combination of the c.c., RWplus score, and MolProbity score. Our scheme showed progressive formation of SS owing to the high mobility of atoms in the UA model. Our new algorithm for fixing local structure errors also contributed to the correct formation of the hydrogen bond network in SS. We expect that our scheme is useful for reliable de novo structure modeling from cryo-EM density maps with low computational cost.

73 in total

1. iMODFIT: efficient and robust flexible fitting based on vibrational analysis in internal coordinates.

Authors: José Ramón Lopéz-Blanco; Pablo Chacón
Journal: J Struct Biol Date: 2013-08-30 Impact factor: 2.867

2. VMD: visual molecular dynamics.

Authors: W Humphrey; A Dalke; K Schulten
Journal: J Mol Graph Date: 1996-02

3. Implicit Micelle Model for Membrane Proteins Using Superellipsoid Approximation.

Authors: Takaharu Mori; Yuji Sugita
Journal: J Chem Theory Comput Date: 2019-12-09 Impact factor: 6.006

4. Discrimination of Native-like States of Membrane Proteins with Implicit Membrane-based Scoring Functions.

Authors: Bercem Dutagaci; Kitiyaporn Wittayanarakul; Takaharu Mori; Michael Feig
Journal: J Chem Theory Comput Date: 2017-05-11 Impact factor: 6.006

5. Molecular imprinting as a signal-activation mechanism of the viral RNA sensor RIG-I.

Authors: Bin Wu; Alys Peisley; David Tetrault; Zongli Li; Edward H Egelman; Katharine E Magor; Thomas Walz; Pawel A Penczek; Sun Hur
Journal: Mol Cell Date: 2014-07-10 Impact factor: 17.970

6. Exploring protein native states and large-scale conformational changes with a modified generalized born model.

Authors: Alexey Onufriev; Donald Bashford; David A Case
Journal: Proteins Date: 2004-05-01

7. GENESIS: a hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations.

Authors: Jaewoon Jung; Takaharu Mori; Chigusa Kobayashi; Yasuhiro Matsunaga; Takao Yoda; Michael Feig; Yuji Sugita
Journal: Wiley Interdiscip Rev Comput Mol Sci Date: 2015-05-07