Literature DB >> 31969975

Computational reconstruction of atomistic protein structures from coarse-grained models.

Aleksandra E Badaczewska-Dawid¹, Andrzej Kolinski¹, Sebastian Kmiecik¹.

Abstract

Three-dimensional protein structures, whether determined experimentally or theoretically, are often too low resolution. In this mini-review, we outline the computational methods for protein structure reconstruction from incomplete coarse-grained to all atomistic models. Typical reconstruction schemes can be divided into four major steps. Usually, the first step is reconstruction of the protein backbone chain starting from the C-alpha trace. This is followed by side-chains rebuilding based on protein backbone geometry. Subsequently, hydrogen atoms can be reconstructed. Finally, the resulting all-atom models may require structure optimization. Many methods are available to perform each of these tasks. We discuss the available tools and their potential applications in integrative modeling pipelines that can transfer coarse-grained information from computational predictions, or experiment, to all atomistic structures.

Entities: CellLine Chemical Disease Mutation Species

Keywords: Coarse-grained modeling; Protein modeling; Protein reconstruction; Structure prediction; Structure refinement

Year: 2019 PMID： 31969975 PMCID： PMC6961067 DOI： 10.1016/j.csbj.2019.12.007

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Coarse-grained protein models (with some missing atomic details) are the outcome of many experimental or computational methods for the investigation of protein structures and their dynamics. For example, structures obtained via difficult comparative modeling and de novo simulation strategies often need further improvement. The complexity of the protein systems demands a multiscale approach, which requires easy and fast conversion between models of various resolutions and accurate reconstruction of atomic details. Coarse-grained modeling tools offer high efficiency and enable to overcome the limitations of all-atom tools on accessible system sizes and simulation time scales [1]. All-atom resolution of protein structures is required for many practical structure-based studies, including drug design and protein design [2], [3], [4]. Therefore, the practical use of coarse-grained protein models and elastic network models requires integration with efficient tools for rebuilding atomic details [1], [157]. Ideally, the reconstruction procedure should be effective not only for regularly packed folded protein structures, but also for models of disordered or partially unfolded proteins [157], [158]. In this mini-review, we provide an overview of the available computational tools for reconstruction of all-atom protein structures from various levels of incomplete representation. The review is organized as follows. First, we present the typical reconstruction pipeline and visualize example coarse-grained protein models of various resolutions (Section 2.1). Then, we review the computational methods for consecutive reconstruction steps from low to high resolution levels: reconstruction from low-resolution and contact maps (Section 2.2), backbone reconstruction from the C-alpha trace (Section 2.3), side chain reconstruction from the backbone (Section 2.4), hydrogen atom reconstruction (Section 2.5) and final optimization/refinement of all-atom structures (Section 2.6). The reconstruction methods are described and reviewed in Table 1.

Table 1

Method, reference and year of the last publication	Software availability*	Reconstruction** task	Description***	Benchmark sets and comments***
Reconstruction from deeply coarse-grained representation or contact maps
CONFOLD[31], 2015CONFOLD2[32], 2018	server (confold) + standalone (confold2): http://protein.rnet.missouri.edu/confold/https://github.com/multicom-toolbox/CONFOLD2/	CM → CA	The method translates contact maps into distance restraints and uses them as the input to distance geometry algorithm which builds tertiary structure models. CONFOLD2 predicts 200 models using various subsets of input contacts and selects five top models by clustering them.	CONFOLD2 is an improved version of CONFOLD method. Structure predictions for 150 proteins from the PSICOV dataset and for CASP12 targets showed that the for most protein sequences CONFOLD2 was able to capture the structural fold of the protein.
FT-COMAR[30], 2008	standalonehttp://bioinformatics.cs.unibo.it/FT-COMAR/	CM → CA	A heuristic procedure for building tertiary structure models from a possibly erroneous and incomplete contact maps.	Tested on 100 non-redundant single-domain protein chains (α, β, α+β, α/β; size from 55 to 786 residues) from SCOPE release 1.67. FT-COMAR is much more tolerant to under prediction than to over prediction of contacts. It can ignore up to 75% of the contact map and still compute a protein structure whose RMSD_CA < 4 Å (assuming that the remaining 25% contains no errors).
GDFuzz3D[33], 2015	server + standalone: http://iimcb.genesilico.pl/gdserver/GDFuzz3D/	CM → AA + optimization	The method transforms contact maps into distance restrains and uses them as the input to MODELLER method [44], which generates protein models and REFINER method [138] for structure refinement.	Tested on 45 single-domain targets analyzed in the CASP10 experiment and 150 proteins of the PSICOV dataset. The tests showed that GDFuzz3D is slightly more accurate (based on TM-score and RMSD) than FT-COMAR and slightly inferior to PconsFold but more computationally efficient.
PconsFold[34], 2014	standalone: https://github.com/ElofssonLab/pcons-fold	CM → AA	Merges PconsC contact prediction tool [139] and the ROSETTA protein modeling tool [140]. The method has no intermediate stages of reconstruction.	Tested on 150 proteins (from 52 to 266 residues) of the PSICOV dataset. The input sequence can come from a PDB header (instead of an ATOM section) to avoid internal gaps of chain. This approach enables protein structure prediction of single-domain targets. PconsFold performance was also compared to that of GDFuzz3D [33].
SICHO[36], 2000	standalone: http://blue11.bch.msu.edu/mmtsb/rebuild.pl	SICHO → AA	Method for reconstruction from the SICHO coarse-grained model (see Section 2.2 and Fig. 2). Uses a library of fragments and a side chain center-based coordinate system to rebuild Cα positions and a complete backbone. Chooses side chain conformations from a rotamer library.	Tested on 13 high-resolution X-ray structures. Reconstruction quality RMSD_CA: < 0.6 Å on experimental structures.
SUReLib, 2019	standalone: http://biocomp.chem.uw.edu.pl/tools/surpass	SURPASS → CA	Method for reconstruction from the SURPASS coarse-grained model (see Section 2.2 and Fig. 2). Uses a knowledge-based library of 6-residue fragments and structural regularities observed in known protein structures.	Tested on PISCES_4600, BAKER_62 and other various proteins (α, β, α+β, α/β; size from 56 to 1016 residues). Reconstruction quality RMSD_CA: < 0.5 Å on experimental structures and 1–2 Å on distorted models.
Backbone reconstruction from CA-trace
BBQ[38], 2007	standalone: http://biocomp.chem.uw.edu.pl/tools/bbq	CA → BB	Uses the library of 5148 backbone 4-residue fragments (quadrilaterals) and algorithm described by Milik et al. [141] with some modifications. All quadrilaterals are pre-computed (as Cα distances and a local coordinate system) and stored in a table. The algorithm is sequence-independent.	Tested on 81 non-redundant experimental protein structures and near-native decoys. Reconstruction quality RMSD_BB < 0.7 Å on experimental structures. Available as part of the Bioshell package. The algorithm is implemented in java programming language. BBQ performance was also compared to that of PD2 and other tools [39] and can be improved by additional minimization [52].
BriX[42], [142], 2010	standalone	CA → BB	Uses the library of high-resolution structural fragments between 4 and 14 residue long and local fit approximation algorithm. Newer version [142] uses additional Loop BriX database of non-regular structure elements (loops) and fragments from over 7000 non-homologous proteins from the Astral set. User provided structures can be covered on the fly with BriX fragments, especially gaps or low-confidence regions in these structures can be bridged.	Tested on all known human structures from the PDB (935, Park & Levitt protein set), with a global 0.48 Å RMSD [42] (improving existing results using smaller libraries [48], [143]) and over 300 protein-peptide complexes from PepX database within 1 Å RMSD [144]. Irregular loop regions can be reconstructed from smaller (4–8 residues long) building blocks.
PD2 ca2main[39], 2013	server + standalone: http://www.sbg.bio.ic.ac.uk/~phyre2/PD2_ca2main/	CA → BBoptimization	Uses the library of short 528 backbone fragments obtained using Gaussian mixture models (GMMs). The accuracy of reconstruction can be improved by additional (optional) energy gradient minimization.	Tested on 15 low-resolution and 28 high-resolution protein structures. Reconstruction quality RMSD_BB < 0.4 Å on experimental structures. When combined with Rosetta, PD2 method produced significantly lower energy all-atom models than other tested tools. Except built-in minimization, another minimization scheme had been successfully tested [52]. The algorithm is implemented in C++ programming language.
SABBAC[40], 2006	server: http://bioserv.rpbs.jussieu.fr/SABBAC.html	CA → BB	Uses a 27-letter hidden Markov model-derived structural alphabet described by 155 backbone fragments from known protein structures and a greedy algorithm (based on the OPEP force field) to obtain an optimal combination of fragments. Cα-trace coordinates remain unaffected and only the missing backbone atoms are added. No further refinement is performed.	Tested on the Adcock subset of 14 proteins from 58 to 437 residues and a 7 PDB newcomers subset up to 666 residues. Reconstruction quality RMSD_BB is near 0.4 Å for experimental structures. The algorithm is robust to CA deviations (for Cα-traces randomly perturbed by over 1 Å, SABBAC results were only marginally affected). SABBAC enables reconstructing single polypeptide chains.
Side chains reconstruction from backbone
CIS-RR[67], 2011	server	BB → SC	Uses Dunbrack backbone-dependent rotamer library, SCWRL3-based scoring function and clash-reduction guided iterative search (CIS) with conjugate gradients optimization of rotamers (rotamer relaxation, RR). CIS-RR detects the cysteine pair, which forms a disulfide bond.	Tested on 180 proteins (SCWRL3 test set) and 65 high-resolution crystal structures of proteins. Compared to other tools (SCWRL4, IRECS and SCAP) reconstruction accuracy is similar but removes atomic clashes much more effectively. Also evaluated and compared with other tools in work [95].
IRECS[76], 2007	standalone: https://irecs.bioinf.mpi-inf.mpg.de/index.php	BB → SC	Uses a coarse‐grained backbone‐dependent rotamer library, heuristic greedy iteration scheme and effective score (based on knowledge‐based scoring term ROTA 10 Å) for ranking all SC rotamers according to the probability of rotamer conformation.	Tested on 641 high resolution X-ray structures (194 with single conformation for all SCs and 447 with at least one SC of multiple conformations). Reconstruction accuracy similar to SCWRL3 and SCAP, RMSD_SC ~1.5 Å. Allows the use of additional template of side-chain conformations.
NCN[92], 2004	standalone:available on request from the authors https://www.med.upenn.edu/wandlab/research.html	BB → SC	Uses optimized OPLS parameters for long-range and multi-body terms (van der Waals and electrostatic terms), hydrogen-bonding potential and frequency of rotameric states from PDB. The library contains 49,042 discrete rotamers.	Tested on 65 high resolution X-ray structures. Highly accurate tool for SC reconstruction (RMSD_SC: ~1 Å).
OPUS_Rota2[81], 2019OPUS_Rota[96], 2008	standalone: http://ma-lab.rice.edu/soft.php	BB → SC	Uses rotamer frequency and van der Waals potentials and two additional unique pairwise energy terms: short-range orientation-dependent (OPUS-PSP) for side chain packing interactions and explicit solvation effects. In newer OPUS_Rota2 version, OPUS-PSP had been replaced by OPUS-DASF term that describes relative positions of atoms on the side chains.	Tested on 65 high resolution X-ray structures and a 379-protein PISCES subset (sequence identity 30%, 1.8 Å) [77], [81]. In the native tests sets, Opus_Rota2 was more accurate than other methods (OpusRota, SCWRL4, OSCAR-star variants) but slightly less accurate than OSCAR-o. In non-native test sets (with added random noise to the main-chain torsional angles) Opus_Rota2 was more accurate than any other tested method and also several times faster (except Upside).
OSCAR[97], 2011	standalone: https://sysimm.ifrec.osaka-u.ac.jp/OSCAR/	BB → SC	Uses a flexible (-o, slow modeling) or rigid (-star, fast modeling) rotamer model. The energy terms include distance and orientation-dependent potentials and side chain dihedral angle potential energy function. The library of sub-rotamers was derived by perturbation of dihedral angles of rotamers from Dunbrack and Cohen [145].	Tested on 218 proteins and a RAPPER decoy set.Oscar had similar accuracy in SC reconstruction for experimental structures as other available software and good accuracy in selecting near-native conformations from loop decoys. Also evaluated and compared with other tools in work [73].
PEARS[82], 2018	server: http://opig.stats.ox.ac.uk/webapps/newsabdab/sabpred/pears	BB → SC	Uses position-dependent antibody-specific rotamer library which is based on SC’s χ₁ dependency on its immunogenetic positions. The method is robust for uncertainties in the model backbone and detects disulphide bridges. SC clashes are reduced during 200 rounds of Gaussian relaxation.	Tested on a set of 639 non-redundant and a blind set of 95 antibody structures. The approach is comparable to SIDEpro, RASP and SCWRL in reconstruction the side chains of crystal structures, while on computationally designed models PEARS achieves the highest average accuracy and the smallest number of clashes.
RASP[83], 2011	standalone	BB → SCrotamer optimization	Uses backbone-dependent rotamer library, an optimized energy terms and the clash elimination strategy to guide the optimization of side chain conformations. Combinatorial search includes dead-end elimination, graph theory-based, branch-and-terminate, backtrack and Monte Carlo algorithms.	Tested on 2412 high-resolution (≤1.8 Å) structures with complete side chains obtained from PISCES server. RASP had comparable prediction accuracy (%chi1, % chi1+2, RMSD) and returned much fewer clashes than SCWRL4, OPUS-Rota or IRECS. It was also much faster than these methods, but an order of magnitude slower than Upside. RASP performance was also evaluated and compared with other tools in works [73], [95].
SCAP[98], 2001	standalone: http://honig.c2b2.columbia.edu/jackal	BB → SC	Heuristic approach using optimized CHARMM parameters for van der Waals torsion-angle terms in an iterative repacking protocol. The library contains 7562 discrete rotamers in terms of 1) Cartesian coordinates, 2) dihedral angles..	Tested on 33 high resolution protein structures (66–328 residues) not included in the creation of rotamer library. For multi-chain proteins, only the first chain was used. Reconstruction quality RMSD_SC < 2 Å.
SCATD (ThreePack)[79], 2005	standalone: https://ttic.uchicago.edu/~jinbo/TreePack.htm	BB → SCrotamer optimization	Uses a backbone-dependent rotamer library (the same as SCWRL3), interaction scores by dead end elimination and energy minimization by tree decomposition. This tool does not attempt to regularize the backbone geometry or solve punched rings.	Tested on 180 experimental structures from the SCWRL3 benchmark set of proteins. This approach was several times faster than SCWRL3 especially on larger proteins or cases with heavy atomic clashes. SCATD is freely available and was only tested on a Debian Linux machine.
SCWRL4[77], 2009	standalone: http://dunbrack.fccc.edu/scwrl4/	BB → AArotamer optimization	Uses a backbone-dependent rotamer library based on kernel density estimates to provide rotamer frequencies and torsional angles, a tree decomposition algorithm to solve the side chain packing problem, specific potentials (anisotropic hydrogen-bonding, soft pairwise van der Waals), and fast collision detection. Allows consideration of the crystal symmetry in the side-chain conformation prediction. SCWRL4 is perhaps the most widely used SC reconstruction method, as shown by its high citation count.	Optimized on a set of 100 protein structures and tested on 379 X-ray structures with electron densities available from UEDS [146]. SCWRL4 performance was evaluated and compared with other tools in works [73], [95]. SCWRL4 is also available as a dynamic-linked library for incorporation into other software. In comparison to its earlier version SCWRL3, SCWRL4 can be slower but converged in all cases tested, while SCWRL3 sometimes did not converge [77]. The software is freely available for academic research on request.
SIDEpro[99], 2012	server + standalone: http://sidepro.proteomics.ics.uci.edu/http://scratch.proteomics.ics.uci.edu/	BB → SC rotamer optimization	Uses a machine learning approach based on 156 neural networks that are trained to compute an energy function based on pairwise contact distances and a backbone-dependent rotamer library (the same as OPUS-Rota [96]). The neural networks set the side-chains to the highest probability rotamers. The final optimizing procedure removes steric clashes.	Tested on the SCWRL4 benchmark set (379 proteins), 94 proteins from CASP9, 7 large protein complexes and a ribosome with and without RNA. SIDEpro can use non-standard amino acids, post-translational modifications and external ligands. It was several times faster and slightly better in accuracy than SCWRL4 and its RMSD_SC remained ~1.0 Å also for complexes. SIDEpro performance was also evaluated and compared with other tools in work [95].
Upside[100], 2018	standalone: https://github.com/sosnicklab/upside-md	BB → SCrotamer optimization	Uses side chain free energy in a molecular dynamics simulations scheme. During the optimization of side chain packing, each rotamer state is represented by a single oriented CG bead (3 spatial and 2 orientation coordinates). Uses a combination of isotropic (excluded volume) and directional interactions (chemical character, e.g. polar, aromatic) for each pair of interacting side chains or backbones. The side chain model is trained by the maximum-likelihood scheme. The NDRD rotamer library [70] is used to define the atomic positions of side chains.	Tested on a large, non-redundant set of crystal structures of globular proteins from the PDB with 50–500 residues and resolution < 2.2 Å (6255 chains). The method gave similar accuracy of chi1 angle as SCWRL4 and RASP, but is several (1–3) orders of magnitude faster.
All-atom reconstruction from CA-trace
ca_to_allatom (ROSETTA)[43], 2008	standalone: https://www.rosettacommons.org/	CA → AAAA optimization	The Rosetta protocol ca_to_allatom reconstructs AA structure and performs structure refinement. Uses the initial Cα-trace (with a user-defined parameter specifying how far Cα atoms are allowed to deviate from the initial model)) and rigid-body perturbation of secondary structure fragments from known protein structures. The protocol includes optional loop remodeling (centroid mode) and all torsion angle minimization (all-atom).	Tested on 8 proteins (from 101 to 310 residues) from cryoEM maps at 5 and 10 Å resolution. Original Cα positions are slightly changed during the reconstruction process by harmonic oscillation [147]. Successfully used in protein reconstruction from experimental data[43]. Available as an executable in the bin directory of ROSETTA package (bin/ca_to_allatom.version).
CG2AA[37], 2016	standalone	CA → AASC optimization	Uses a strictly geometric approach based on Cα triplets and parameters from the Amber03 force field for rebuilding the protein backbone and Cβ. The side chain is rebuilt based on the definition of the united atom for the side group.	Tested on 5 experimental protein structures with reconstruction quality RMSD_BB: <0.8 Å, stability of reconstructed models has been tested in MD simulations. The algorithm is implemented in Python.
Modeller[44], 2016	standalone: https://salilab.org/modellerModeller-based reconstruction script: https://bitbucket.org/lcbio/ca2all	CA → AAAA optimization	Uses protein template(s) in CG representation (it can be in Cα-trace) to create a set of distance restraints that guide the reconstruction. Stereochemical restraints (bond lengths and angles) are obtained from the CHARMM force field and statistical analysis of known structures. MODELLER employs various structure optimization techniques.	Available as part of the Modeller package. The algorithm is implemented in Python. Modeller-based script ca2all[148] is used by CABS-flex and CABS-dock multiscale modeling tools [149], [150] for reconstruction of protein or protein-peptide models).
ModRefiner[53], 2011	server + standalone: https://zhanglab.ccmb.med.umich.edu/ModRefiner/	CA → AAAA optimization	Reconstructs and refines protein structures, first the BB only and, after adding SC, the entire structure. Both side-chain and backbone atoms are flexible during refinement simulations, while conformational search is driven by physics- and knowledge-based force-field. It can optionally use secondary structure assignment/prediction to drive the refinement. The method can start from the CA, BB or SC model.	Tested on 261 proteins up to 150 residues (148 hard targets for I-TASSER and 113 with good templates). Compared to other tools, ModRefiner was better in side chain packing and improving hydrogen-bonding networks. Input CA coordinates can have unphysical distortions. A standalone tool enables reconstruction of dimeric proteins, while server handles only single-chain proteins.
PULCHRA[54], 2008	standalone: http://cssb.biology.gatech.edu/skolnick/files/PULCHRA/index.html	CA → AAAA optimization	Uses backbone fragment library, rotamer library and backbone reconstruction algorithm described by Milik et al. [141] with some modifications. The initial Cα-trace and reconstructed backbone are minimized to improve hydrogen-bonding networks. Positions of SC united atoms (center of mass) can be used to improve the accuracy of full-atomic reconstruction.	Tested on 30 high-quality X-ray structures. (reconstruction quality RMSD_AA 1.0–1.5 Å) and on a set of 500 low-resolution protein models. Initial Cα coordinates can be distorted. This approach enables reconstruction of multi-chain models or a chain with breaks and solves punched rings. The algorithm is implemented in C programming language.
RACOGS[55], 2007	server available on request from the authors http://www.kavrakilab.org/software.html	CA → AAAA optimization	Uses a geometric approach to place the backbone atoms at the average positions derived from known protein structures (based on the algorithm by Milik et al. [141] and Feig et al. [36]) with SC reconstruction using backbone dependent, coordinate rotamer libraries (algorithm described by Xiang and Honig [98]). The final stage of the procedure includes the addition of all hydrogen atoms and short all-atom minimization.	Tested on CG trajectories of SH3, S6 systems and a subset of 2945 non-redundant experimental structures from PDB. This approach enables reconstruction of all-atom details from large regions of the protein folding landscape as folded, partially folded or random protein structures.
REMO[41], 2009	server + standalone: https://zhanglab.ccmb.med.umich.edu/REMO/	CA → AAAA optimization	Uses backbone isomer libarary (528,798 fragments) and backbone-dependent rotamer library (SCWRL) for atomic details reconstruction. Backbone rebuilding stage includes removing steric clashes and optimizing the hydrogen-bonding network based on a consensus of PSIPRED preferred secondary structure distribution.	Tested on 230 non-redundant proteins up to 300 residues (experimental and CG decoys generated by I-TASSER in the CASP8). This approach can remove steric clashes, retain correct topology and improve the backbone hydrogen-bonding network.
Hydrogen atom reconstruction
CHIMERA (AddH)[104], 2004	standalone: http://www.cgl.ucsf.edu/chimera/index.html	SC → AA	Adds missing hydrogen and OXT atoms. Uses the atom types and steric-only or H-bonds (default, slower) criterion to determine the number and positions of added hydrogens. Bond lengths are taken from Amber parm99 parameters.	The positions of pre-existing atoms are not changed. Protonation states of certain ionizable side chains can be specified at specific pH (default: physiological). This software is also a molecular visualization tool.
CNS[113], 1998	standalone: http://cns-online.org/v1.3/	SC → AA	The algorithm starts from random positions of hydrogen atoms and optimizes them using an iterative procedure of molecular dynamics simulations and Powell energy minimization steps. The energy function includes bonded terms and van der Waals.	The method is able to compute also electrostatic interactions if the required parameters are provided. It is flexible hierarchical software for macromolecular structure determination, especially crystallographic refinement or NMR structure calculations using NOEs, J-coupling or chemical shifts.
Computational Titration[106], 2009	server	SC → AAAA optimization	Uses a force field with the concept of hydropathic interactions (HINT) as its noncovalent force field and exhaustive enumeration for optimization. The method uses coordinate data for the protein, ligand and bridging water molecules (if available) and predicts the best combination of protonation states for each ionizable residue and/or ligand functional group as well as the Gibbs free energy of binding for the ionization-optimized protein-ligand complex.	Tested successfully in modeling binding affinities of protein-ligand complexes: β secretase (2va7), mutant HIV-1 reverse transcriptase (2opq) and human sialidase NEU2 complexed with an isobutyl ether mimetic inhibitor (2f11). The method improves optimization of protonated amines and phosphines and supports the use of additional functional groups such as phosphates, sulfates, nucleotide backbone phosphates and sugars.
GROMACS (pdb2gmx)[114], 2001	standalone: http://www.gromacs.org/Downloads	SC → AAAA optimization	Uses a geometry-based approach and performs molecular dynamics simulations. Uses different bond lengths and angles according to selected force field parameters. The energy function includes bonded terms, van der Waals and electrostatics.	The method enables optimization of histidine protonation states by attempting to satisfy neighboring hydrogen bonds. Water hydrogen atoms are also predicted. GROMACS is very fast at calculating non-bonded interactions.
HAAD[107], 2009	server + standalone: https://zhanglab.ccmb.med.umich.edu/HAAD/	SC → AAAA optimization	Combines local geometry restraints and conformational search that minimizes atomic overlap, encourage hydrogen bonding and optimize electrostatic interactions. Local geometries of the initial positions of H-atoms are taken from the CHARMM22 force field.	Tested on three sets of experimental data: high-resolution X-ray crystallography, structures from neutron diffraction, and NOE proton-proton distance restraints. Compared with other methods (CHARMM and REDUCE) HAAD was faster and had significantly higher accuracy and better compatibility with NOE restraints. The algorithm is implemented in FORTRAN90 programming language.
Hbuild (CHARMM) (X-PLOR)[115], [117], [151], 2005	standalone: http://charmm.chemistry.harvard.edu/https://nmr.cit.nih.gov/xplor-nih/doc/current/xplor/	SC → AA	Searches hydrogen atom positions at intervals of 10° (ϕ = 10) or 3° (ϕ = 3) around the axis of a cone with a side equal to the bond length or places hydrogens using geometric criteria. Uses different bond lengths and angles according to the selected version of the CHARMM force field. The energy function includes torsion angle, van der Waals and electrostatics.	All hydrogen atoms (including non-polar) are described explicitly. Water hydrogen atoms are also predicted. This approach is quite fast and can be used before running molecular dynamics calculations or during large-scale homology modeling. The Hbuild algorithm is used in CHARMM and X-PLOR software packages.
MCCE2[152], 2009MCCE[108], 2002	standalone	SC → AA	Uses a geometry-based and molecular mechanics approach to place all non-hydroxyl hydrogen atoms. For hydroxyl and water hydrogens it uses systematic search of torsion angles. The energy function includes torsion angle (from CHARMM), van der Waals, solvation and continuum electrostatics.	Cysteine residues cannot be treated as disulfide bridged. This is a slower but more accurate approach that can be used for studies involving a specific protein, especially when the protonation states of ionizable residues and orientations of buried hydroxyls are relevant.
PyMOL, DeepView (SPV)[105], 1997	standalone: https://pymol.org/2/https://spdbv.vital-it.ch/disclaim.html	SC → AA	Molecular visualization tools that use only geometric criteria, without minimization.
Protonate3D[109], 2009	Standalone available on request from the authors http://www.chemcomp.com	SC → AAAA optimization	Predicts hydrogen geometry, ionization, and tautomer states for macromolecular structures based on 3D coordinates. The energy model includes van der Waals, electrostatics, solvation, rotamer, tautomer, and titration effects. Optimal states are chosen according to a chemical model derived from the MMFF94 force field.	Tested on ultra-high resolution X-ray structures. The method considers side-chain flip, rotamer, tautomer, and ionization states of all chemical groups, ligands, and solvent based templates are available in a parameter file. Close contacts and other poor geometry may cause structure distortions. The tool is not available for free.
Protoss[153], [154], [110], 2014	server: https://proteins.plus/	SC → AAAA optimization	Adds hydrogen atom positions based on optimal hydrogen bond networks in the protein-ligand interface. Networks are modeled as graphs. Uses an efficient dynamic programming approach with storing partial solutions and combining them to globally optimal solutions. The algorithm is split into two phases: initialization (performed only once) and optimization.	Can be used to model the protein-ligand interface. Predicted hydrogen positions were compared with those in high-resolution protein structures (the test set consisted of 34 hydrogen atoms from seven protein structures). This approach does not work well on strongly interconnected graphs (1ps3). Samples 60 orientations for a water molecule. The tool is faster than Protonate3D.
REDUCE (MolProbity)[111], [155], 2010	standalone: http://kinemage.biochem.duke.edu/software/reduce.phpa part of the MolProbity server: http://molprobity.biochem.duke.edu/	SC → AAAA optimization	Adds hydrogens based on expected atomic geometry lengths and angles. Places hydrogens to optimize local H-bonding networks, avoid steric overlaps and detect the correct orientations of side chains for NQH residues, as well as imidazole ring, OH, SH, NH3+, Met methyls, HET groups. The protonation state of histidine is adjusted based on the local environment.	Both proteins and nucleic acids can be processed.This approach is also efficient when a more intensive approach is desired.MolProbity evaluates X-ray and NMR structures (ensemble structures of up to 80 models, accepts an mmCIF file and automatically converts it to the PDB hybrid36 format) and rebuilds the model by removing outliers as part of the refinement cycle.
WHAT IF[112], 1990	server: https://swift.cmbi.umcn.nl/servers/html/index.htmlselect option: Hydrogen, then Add Protons	SC → AAAA optimization	Adds all missing hydrogens to the structure. It contains several servers which additionally compute all possible hydrogen bonds, but in default they do not determine which bonds would be most favorable.	Uses the Optimal Hydrogen Bonds server for computing the best possible hydrogen bond network. The program works much slower when the system contains many water molecules. Dedicated for LINUX systems.
Reconstruction from coarse-grained protein complexes with other biomolecules
BACKWARD[132], 2014	standalone: http://cgmartini.nl/index.php/back	Protein-lipid MARTINI → AA	Method for reconstruction from the MARTINI coarse-grained representation of protein-lipid systems. Uses a strictly geometric approach based on Cα triplets for rebuilding the protein backbone from coarse-grained beads. It is possible to map from MARTINI CG to united-aliphatic atom (GROMOS) or all-atom (CHARMM, AMBER) representation of single and multimeric proteins.	Tested on 6 systems including lipid bilayers, proteins in solution (YvoA), membrane proteins (ASIC) and peptides (WALP). Reconstruction quality RMSD_BB: <0.6 Å. The approach enables integral backmapping and reconstructing complete systems, including the solvent.
Stansfeld & Sansom[133], 2011	Standalone available on request from the authors MemProtMD database: http://memprotmd.bioch.ox.ac.uk/	Protein-lipid CG → AA optimization	Method for reconstruction from the MARTINI coarse-grained representation of protein-lipid systems. Uses fragment-based libraries for reconstructing CG complex protein-lipid bilayer systems. The protocol starts from the MARTINI CG model and uses all-atom force fields such as CHARMM36, GROMOS and OPLS for final energy minimization in MD simulations. Atomic details of protein structure are obtained by using MODELLER or PULCHRA. Higher resolution of lipids is provided by a library of atomistic lipid fragments.	Tested on 10 membrane protein-lipid bilayer systems of different size and complexity, generated by self-assembly CGMD simulations (leuT, aquaporin, ELIC, ASIC, Cyt Ox, KcsA, SERCA, β₂AdR/lysozyme, OmpC, OSC). This approach does not attempt to convert united water particles.The algorithm is implemented in perl programming language.
Shimizu & Takada[134], 2018	Standalone available on request from the authors	Protein-DNACG → AAoptimization	Method for reconstruction from coarse-grained representation of protein-DNA complexes. Uses a DNA fragment library to reconstruct all-atomic details of DNA and optimize side chain orientations of the protein-DNA interface. Other fragments of protein structure are modeled with PD2 and SCWRL4. The final stage of the procedure includes the addition of all hydrogen atoms by gmx (pdb2gmx) [156]. The method reconstructs atomic details from a CG protein-DNA complex (CafeMol representation), where an amino acid is replaced by a single bead at the Cα position and a deoxyribonucleotide by three beads for the sugar, phosphate and base.	A library of 22,347 DNA fragments is derived from high-resolution X-ray structures from PDB. Tested on 180 complex protein-DNA experimental structures with single or multiple DNA chains and CG models obtained from CGMD simulations. This approach provides the tilt of a base plane well and proper Watson-Crick base pairing of hydrogen bonds and maintains the initial protein-DNA interface. It should also be applicable to other complexes as protein-ligand or multi-protein systems.

* links to web servers or standalone methods have been provided only if working at the time of writing this publication.

** reconstruction tasks realized by outlined methods are summarized in the third column using the following shortcuts: contact map (CM), alpha carbon atoms (CA), backbone atoms (BB), backbone and side chain atoms (SC), all-atom representation that includes backbone, side chain and hydrogen atoms (AA), coarse-grained representation (CG).

*** some major or unique features are bolded for readers convenience.

Overview of protein reconstruction methods. The accuracy of some methods is evaluated using RMSD values between reconstructed and reference structures measured on: alpha carbons (RMSDCA) or backbone (RMSDBB) or side chain (RMSDSC) heavy atoms. The accuracy of side chain reconstruction is also evaluated using chi angles, the first (χ1) and the second (χ2, if applicable). * links to web servers or standalone methods have been provided only if working at the time of writing this publication. ** reconstruction tasks realized by outlined methods are summarized in the third column using the following shortcuts: contact map (CM), alpha carbon atoms (CA), backbone atoms (BB), backbone and side chain atoms (SC), all-atom representation that includes backbone, side chain and hydrogen atoms (AA), coarse-grained representation (CG). *** some major or unique features are bolded for readers convenience.

Protein structure reconstruction methods

Stages of protein reconstruction

Fig. 1 shows a typical reconstruction pipeline used in multiscale modeling methods that merge coarse-grained protein modeling tools with all-atom modeling. Coarse-grained protein models can present different levels of resolution [1]. In the case of low-resolution models (such as SICHO [5], [6] or SURPASS [7], [8]) the coarse-graining level can be so deep that it does not take into account even the explicit positions of alpha carbons (see Fig. 2). In such cases, structure reconstruction requires an additional stage addressed to determine the C-alpha trace from the unified atoms that encode deeply averaged fragments of protein structure. This is not a trivial task due to the lack of unambiguous mathematical formula or simple geometric rules. However, as accurate as possible determination of the C-alpha trace plays crucial role for subsequent reconstruction of all-atom structure. C-alpha atoms are explicitly present in majority of medium resolution coarse-grained models (such as CABS [9], UNRES [10], [11], AWSEM [12] or MARTINI [13], see Fig. 2) and C-alpha based elastic network models [157]. In these cases, the reconstruction procedure starts from the C-alpha trace level. Higher-resolution coarse-grained models, such as ROSETTA-centroid [14] (see Fig. 2), OPEP [15], PRIMO [16] or PaLaCe [17], require side chains reconstruction from protein backbone coordinates.

Fig. 1

Fig. 2

Example tripeptide presented in all-atom and corresponding coarse-grained resolutions. Various coarse-grained modeling tools are shown: Rosetta-centroid, MARTINI, CABS, UNRES, SICHO and SURPASS. Note that most coarse-grained models use explicit positions of (pseudo) atoms while ROSETTA uses a set of torsional angles φ, ψ, ω to describe backbone geometry. The legend explaining the colors of atoms and pseudoatoms is presented in top right.

Typical stages of protein structure reconstruction. The required range of reconstruction stages depends on the resolution of the initial models. For some deeply coarse-grained (CG) models, the first step is to reconstruct positions of C-alpha (CA) atoms. For most medium resolution CG models, recovering atomistic details starts with backbone (BB) reconstruction from the CA atoms that is followed by side-chain (SC) reconstruction and, subsequently, adding hydrogen atoms. The geometry of the final all-atom structure can be further improved using various refinement techniques. Example tripeptide presented in all-atom and corresponding coarse-grained resolutions. Various coarse-grained modeling tools are shown: Rosetta-centroid, MARTINI, CABS, UNRES, SICHO and SURPASS. Note that most coarse-grained models use explicit positions of (pseudo) atoms while ROSETTA uses a set of torsional angles φ, ψ, ω to describe backbone geometry. The legend explaining the colors of atoms and pseudoatoms is presented in top right.

Reconstruction from low-resolution models and contact maps

Reconstruction from low-resolution coarse-grained protein models is a significant challenge and depends on the specificity of the model’s simplification. For example, the SICHO [5], [6] coarse-grained protein model (see Fig. 2) is based on an assumption that the protein spatial structure is determined and maintained by interactions between packed side chains. The single united atom per residue is located in the center of mass of the side group. Based on side chain center positions, the C-alpha trace and backbone heavy atoms can be reconstructed using a set of geometric criteria (for more details see the SICHO method in Table 1). Another low-resolution SURPASS model [7], [8] assumes the averaging of short 4-residue long fragments of secondary structure to a single united atom lying in the center of their mass. As a result, the representation of regular secondary structure elements (α-helices and β-strands) in this model is almost linear. The procedure for recovering the C-alpha trace from SURPASS representation uses the SUReLib library (see Table 1), which consists of short fragments differentiated by the type of secondary structure. The positions of rebuilt C-alpha atoms maintain correct geometry and spatial orientation. Therefore, the reconstructed C-alpha trace can be used as a source of restraints (distances, angles or contacts) for higher resolution models or directly reconstructed to atomic resolution using the available tools. Protein contact maps are another kind of low-resolution protein models generated by contact prediction methods [18]. The contact maps are usually defined as binary entries or distance maps between Cα or Cβ atoms [18]. Distance restraints can also be an outcome of low-resolution experimental data analysis (SAXS [19], [20], [21], NMR [22], cryo-EM [23], XL-MS [24], HDX-MS [25], [26]). Prediction of contact maps (and their application in protein structure modeling) has become more accurate and effective by using evolutionary coupling analysis (DCA) of multiple sequence alignment (MSA) and deep neural networks to detect high-order correlation [27], [28]. The reconstruction of three-dimensional protein structure based on a specific contact map is an NP-hard problem. Using the preferred contacts as restraints in de novo modeling can lead to more accurate structure predictions than template-based modeling, especially for proteins without close homologs [29]. The predicted contact maps often contain a fraction of false contacts. Some reconstruction from contact maps are robust to inaccurate or incomplete sets of preferred contacts (e.g. FT-COMAR [30], CONFOLD [31], [32], GDFuzz3D [33], see Table 1). Contact maps are typically used as distance restraints between pairs of alpha carbons or as part of the force field in de novo structure modeling (e.g. CONFOLD, PconsFold [34]). Initial, partially random atomic positions are optimized in an iterative procedure to satisfy the specified distance restrictions.

Backbone reconstruction from C-alpha positions

The arrangement of alpha carbons in the polypeptide chain is locally very regular with an average distance of 3.8 Å between neighboring Cα atoms. There are many methods dedicated to reconstruction of protein backbone coordinates, which provide models of protein backbone geometry (or complete all-atom structure) based on the C-alpha trace (see Table 1, section “Backbone reconstruction from CA-trace” and section “All-atom reconstruction from CA-trace”). Heavy atoms (N, C, O) in the main chain are usually added according to simple geometric criteria based on bond lengths and angles in the peptide plane (proline residues need separate treatment) [35], [36], [37]. The optimal roto-translation of the peptide plane is usually provided by the sequence-dependent statistical potential that assumes ideal bond lengths and phi-psi angles. Instead of inserting individual atoms, the other commonly employed approach is to use a library of peptide backbone fragments [38], [39], [40], [41]. The fragments, typically from 4 to 15 residues long, are derived from non-redundant set of known protein structures and collected in the library. The size of libraries can be very wide and results from clustering strategy and adopted criteria. Some libraries are built from several hundred (e.g. 528 in PD2 method [39]) to even several thousand structural components (e.g. 5148 of 4-residue fragments in BBQ method [38]) with fixed or multiple overlapping fragment lengths [42]. The strategy of using protein fragments of various lengths is also successfully used by Rosetta [43], Modeller [44], and I-TASSER [136] packages for protein structure prediction. The large size and diversity of backbone libraries is likely to ensure high accuracy of reconstructed structures, but it increases the cost of calculations [45]. Therefore, much smaller size libraries (a dozen or several tens of fragments) are offered by methods based on structural alphabets such as Protein Blocks [46], SA-HMM [47] SABBAC [40] and other methods [48], [49], [50]. The structural alphabets are libraries consisting of short (from 3 to 7 residue long) usually fixed-length backbone fragments, that can be used as building blocks in protein reconstruction [42], [46], [51]. During the reconstruction procedure, overlapping fragments are selected from the library that best fit to the C-alpha trace. Selection of preferred fragments is based on energy scores, structural similarity, secondary structure assignment or geometric matching criteria [35]. Typically, the accuracy of the backbone reconstruction procedure is evaluated using measurement of the RMSD values (average; or of individual atoms: C, N, O; to a reference structure calculated on main chain heavy atoms) and Ramachandran dihedral angles (ψ, φ). For example, comparison of selected methods for protein backbone reconstruction from the C-alpha trace [39] showed that PD2 (especially with minimization step) and BBQ remain the most accurate due to RMSD and dihedral angle shifts criteria. Accuracy of those tools can be further improved by additional refinement of the protein backbone [52]. When considering the reconstruction of protein structures from coarse-grained modeling, a very important aspect that should be bear in mind is the ability of the method to handle unphysical distortions of the initial C-alpha trace. The various backbone reconstruction methods show different resistance to small unphysical local distortions in the Cα chain that are often present in coarse-grained models [1]. For some approaches, fragments of incorrect C-alpha trace geometry can result in missing parts of the rebuilt backbone or unphysical backbone distortions. These may have significant impact on the quality of subsequent side chain reconstruction, all-atom energy-minimization and scoring [39]. Some of the backbone reconstruction methods (like PD2 [39], SABBAC [40], ModRefiner [53], PULCHRA [54], RACOGS [55]) have been designed to be robust to small (~1 Å) distortions in the initial Cα chain of coarse-grained models. Methods like PULCHRA and ModRefiner offer additional optimization of the reconstructed main chain including Cα positions (see Table 1). Finally, it should be noted that methods based on fragment libraries, while usually effective in reconstruction of folded proteins, do not always cope with unstructured/disordered fragments of the protein chain.

Side-chain reconstruction from backbone

Side group interactions (hydrogen bonds, ionization, solvation, contacts) have a major role for the stabilization of three-dimensional protein structure [56], [57], [58] and binding interaction in protein complexes [59], [60], [61], [62]. Therefore, the accurate side chains packing is important in structure prediction of proteins, their complexes and protein design [59], [63], [64], [65]. Except for a few methods [66], [67], most of the available side chain reconstruction methods are based on the position of backbone atoms and use rotamer/conformer libraries [68], [69], [70], [71], [72] with various strategies for the optimization of side chain packing [73], [74], [75], [76], [67], [77], [78], [79], [80]. Such backbone-dependent rotamer libraries define the probability of a given rotamer as a function of the main chain dihedral angles. Thus, backbone distortions (for example errors in backbone reconstruction) may have a significant influence on the accuracy of reconstructed side chains. However, minor backbone distortions are tolerated by some reconstruction methods [53], [54], [55], [81], [82]. The prediction of side chain conformations and packing usually involves three crucial modules: all-atom or coarse-grained rotamer library of discrete side chain conformations (conformer library) or the frequency distribution of rotational states (statistical rotamer library); rotamer models differ in flexibility (rigid or flexible), number of available rotameric states, packing conditions (e.g. force field, score function) and backbone dependencies set of energy functions to distinguish rotamer states (various combinations of van der Waals and electrostatic potentials, solvation effects, hydrogen bonds and orientation-dependent terms) search algorithm for efficient sampling of the conformational space of rotameric states: Monte Carlo Dynamics or Molecular Dynamics, simulated annealing scheme, neural networks, dead-end elimination, graph theory-based, self-consistent mean field, branch-and-terminate, backtrack and various combinations of these approaches [73], [83]. The side chain reconstruction methods try to strike the balance between these modules by enhancing the sampling scheme [86], [74], [87], [76], optimizing terms of energy function [78], [88], [89], [90], [91] or improving rotamers library [92], [93], [94]. Reconstruction of side chain geometry defining their proper spatial packing is a much more challenging task than reconstruction of the protein backbone. It is related to the high flexibility of side groups, especially for larger amino acids, defining a vast conformational space that needs to be considered [57]. The complexity of the side chain reconstruction problem can be simplified by using a finite number of variants of the spatial arrangement of side-chain rotamers. Rotational states are stored in the library, which can be efficiently searched even for large proteins or their complexes [68], [85]. Rotamers are selected to avoid steric clashes and to provide favorable local interactions. There are many software tools dedicated only to side-chain reconstruction that available mainly as standalone programs [76], [92], [96], [97], [98], [79], [77], [99] (see Table 1, section “Side chains reconstruction from backbone”). The side-chain reconstruction methods are also available within integrated software for reconstruction of atomic details (including optimization of side chain packing) from the initial C-alpha trace [37], [44], [54], [55], [41], [53] (see Table 1, section “All-atom reconstruction from CA-trace”). A comparison of the best performing methods in various residue environments (buried, surface, interaction interface, membrane-spanning) and protein types (membrane, mono- and multimeric) can be found in the comprehensive benchmark [73]. For all OSCAR (-o [78], -star [97]), OPUS (-Rota [96], -Rota2 [81]), Upside [100], SCWRL4 [77], RASP [83] methods the overall accuracy exceeded 85% of χ1 angle, 75% of χ1 + χ2 angles and below 1.5 Å of average RMSD between all-atoms in the predicted and native side chain conformations. Interestingly, another evaluation of some best performing algorithms suggested that for buried residues in the protein, the algorithms are close to the best possible accuracy [95]. For exposed residues, there is large room for improvement and the scoring functions seem to be the main obstacle to correct side-chain packing [95]. Another room for improvement remains also in the design and specialization of rotamer libraries. This has been recently demonstrated in the work on the PEARS tool [82], a family specific side-chain predictor for antibodies, in which rotamers are binned according to their immunogenetics position rather than their local backbone geometry. The concept of PEARS is potentially generalizable to other protein families, provided that enough structural data is available. The computational efficiency of these methods differs significantly. For example, the Upside method is extremely fast (Upside needs 0.006 s per 100 residues). RASP, OPUS-Rota2 and SCWRL4 methods are approximately 15, 150 and 300 times slower, respectively. The OPUS-Rota and the OSCAR-star are almost equally fast as the SCWRL4 and the OSCAR-o is 2 orders of magnitude slower [83], [81], [100]. Taking into account methods accuracy, efficiency and various features, different methods may be better in different applications (see Table 1). For example, Upside and OPUS-Rota2 methods have been tested in modeling of non-native conformers and can be very efficient as a component of multiscale modeling protocols for simulation of protein dynamics. SCWRL4 and OPUS-Rota methods are easy-to-use and well tested in the application to homology modeling. Also, SCWRL4 can improve the interactions of side chains within the crystal conformations, which can be useful in molecular replacement, structure refinement or prediction of protein-protein interfaces [77]. Both OPUS- and OSCAR- tools variants are sensitive to side chain orientations and used in selecting near-native conformations from decoys [101], [102].

Hydrogen atom reconstruction

Hydrogen atoms account for nearly half of the atoms in protein structure. Omitting them in coarse-grained modeling enables significant simplification of the conformational space and acceleration of calculations by an order of magnitude. However, a more detailed analysis of system energy (e.g. ligand binding to a protein) requires an accurate physicochemical force field, in which hydrogens are treated in an explicit manner and their location significantly contributes to system energy (hydrogen bonds, ionization, solvation, contacts and structure stabilization). There are many tools for placing hydrogen atoms according to geometric criteria, and they also include specific effects, such as tautomeric or protonation states. The experimental structures or reconstructed models may have local stresses or clashes that require additional energy optimization. To minimize energy, some methods also refine the final structure using molecular dynamics simulations (see Table 1) or even quantum-mechanical calculations [103]. For most Protein Data Bank entries the experimental structures contain incomplete information about the proper location of hydrogen atoms. The main limiting factor for experimental techniques in the detection of hydrogen positions is their high mobility. However, the hydrogen occurring in various functional groups differs in rotational flexibility. Tautomeric states occur mainly in histidine and carboxyl groups. Torsional angle changes based on the rotation of the hydrogen position around the bond with the heavy atom involve mainly hydroxyl, thiol and amine groups. Protonation states differ in the number of hydrogens in the functional group due to losing (negative charge for carboxyl or thiol) or adding a proton (positive charge for amine or imidazole). Side chain flips occur in amide and imidazole groups and are particularly frequent for glutamine and asparagine residues. Several tools that address the location of hydrogen atoms in protein structure have been developed (see Table 1, section “Hydrogen atom reconstruction”). Some of them add hydrogen according to simple geometric criteria (CHIMERA [104], PyMOL, DeepView [105]), while others take into account more subtle interactions and perform additional optimization (Computational Titration [106], HAAD [107], MCCE [108], Protonate3D [109], Protoss [110], REDUCE [111], WHAT IF [112]) or employ molecular dynamics (CNS [113], GROMACS [114], Hbuild [115]). Adding hydrogen atoms is a necessary step in crystallographic structure refinement, theoretical structure prediction, or calculation of associated binding energies [107], [116]. A typical hydrogen reconstruction scheme involves initial placement of atoms according to geometric criteria which are then optimized by conformational search guided using empirical or physicochemical energy terms [113], [114], [115], [116], [117] or heuristic approaches [111], [112]. Most methods are very effective in predicting the position of a hydrogen atom that is bonded to a tetrahedral geometry atom (both C and N), especially when the positions of the other three atoms are known. Quite good compatibility was also obtained for planar hydrogens and CH2-type groups. It is slightly more difficult to predict the orientation of the CH3 and NH3 groups due to their high rotational flexibility and planar amine groups in asparagine, glutamine and arginine. In this case, geometry-based methods provide the highest accuracy (MCCE, WHAT IF) [116]. CHARMM software seems to be an efficient tool to predict hydroxyl and water hydrogens [116]. The HAAD [107] method is very effective in avoiding steric clashes in the densely packed hydrophobic protein core. REDUCE [111] and several recently developed tools such as Protoss [110] or Protonate3D [109] effectively take into account the effects of rotamers, tautomers and ionization states as well as side chain flips.

Optimization of all-atom structure

The accuracy of all-atom protein models, obtained using protein reconstruction methods and/or experimental techniques, can be further improved using physics-based energy-minimization and simulation techniques [84], [6]. Most commonly, the optimization step is the last step of reconstruction procedures. However, energy minimization can be also combined with different reconstruction steps. This is the case of the ModRefiner method [53] which uses two-step atomic level minimization: the first one to refine the backbone only, and the second one to refine all-atom models. Optimization of protein models can be short-timescale and aimed at local-scale improvement [118], [119], i.e. side chain repacking, loop remodeling or optimization of hydrogen bonding in secondary structure elements. Much more challenging is deeper long-timescale optimization aimed at large conformational changes toward more accurate model [118], [122], [124], [125], [120], [121], [122], [123]. The most common approach for optimization of protein models is all-atom Molecular Dynamics (MD) [120], [121], [122], [123]. Long-timescale MD simulations require enormous computational resources but they can usually be significantly accelerated by proper sampling strategies [126], [127], [128], [129], use of spatial restrains and knowledge-based information [120], [121], [122], [123], [159], [160]. The recent evaluation of protein refinement techniques in the CASP12 experiment showed that the best performing approaches used restrained MD simulations alone, or in combination with other tools [122].

Summary

For successful reconstruction of all-atom protein models, computational methods most commonly use a set of geometric rules, libraries of protein fragments, various simulation techniques or their combinations. The most effective strategy for backbone reconstruction of folded proteins seems to be assembly from known protein fragments. This is because of the well-defined character of the protein backbone that is structurally conserved among homologous proteins and maintains major structural regularities in protein fragments of similar sequence. What’s important to bear in mind, the accuracy of backbone reconstruction has significant impact on the accuracy of subsequent side-chain reconstruction and energy-based scoring of obtained models [39], [81]. Reconstruction of side chain positions is a challenging problem and also in this case statistical regularities extracted from known protein structures can be useful [82]. The problem is NP-hard in nature and only suboptimal solutions are available. Nevertheless, for many reconstruction tasks such suboptimal solutions are satisfactory. Eventually, the performance of backbone and side chain reconstruction stages can be improved through combination with physics-based optimization techniques. Methods of protein structure reconstruction from incomplete models are already commonly used and will be valuable components of modeling strategies that integrated data from various sources. Those sources include experiment (like SAXS, NMR, X-ray, cryo-EM [19], [20], [21], [22], [23], [84] or measurements of the activity of mutant protein variants [130], [131]) and theoretical predictions (like residue-residue contact predictions from evolutionary information [27], [28] or simulation trajectories in coarse-grained resolution [1], [157], [158]). Since the all-atom MD is the most widely employed simulation method, the local quality and stability of reconstructed structures should be tested by using them as starting points for the all-atom MD. The growing number of experimental data or coarse-grained predictions on the structure of protein complexes also call for reconstruction methods designed for refining structural models of different biomolecules (the examples of methods for reconstruction of protein-lipid [132], [133] and protein-DNA [134] systems are presented in Table 1). This short review focuses on reconstruction tools which use various kinds of coarse-grained protein representations as the input. Note that there are also a number of tools, not discussed in this review, that enable filling the gaps of missing residues in protein structures [135], [136], [137]. Finally, we hope this short review can be a useful reference to existing protein reconstruction resources. They may be useful for design and development of new efficient molecular modeling tools, but also for a much larger community of bioscientists who may use reconstruction methods as supporting tools for deeper analysis and illustration of experimental data in structural biology, biomedicine and other branches of molecular biology. The tools available as web servers (see the availability column in Table 1) are probably the easiest to access and use.

Acknowledgments

AEB-D, AK, SK received funding from NCN Poland, Grant MAESTRO2014/14/A/ST6/00088.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

145 in total

1. From coarse-grain to all-atom: toward multiscale analysis of protein landscapes.

Authors: Allison P Heath; Lydia E Kavraki; Cecilia Clementi
Journal: Proteins Date: 2007-08-15

Review 2. Principles and Overview of Sampling Methods for Modeling Macromolecular Structure and Dynamics.

Authors: Tatiana Maximova; Ryan Moffatt; Buyong Ma; Ruth Nussinov; Amarda Shehu
Journal: PLoS Comput Biol Date: 2016-04-28 Impact factor: 4.475

Review 3. The coming of age of de novo protein design.

Authors: Po-Ssu Huang; Scott E Boyken; David Baker
Journal: Nature Date: 2016-09-15 Impact factor: 49.962

4. OPUS-DOSP: A Distance- and Orientation-Dependent All-Atom Potential Derived from Side-Chain Packing.

Authors: Gang Xu; Tianqi Ma; Tianwu Zang; Weitao Sun; Qinghua Wang; Jianpeng Ma
Journal: J Mol Biol Date: 2017-08-31 Impact factor: 5.469

5. GapRepairer: a server to model a structural gap and validate it using topological analysis.

Authors: Aleksandra I Jarmolinska; Michal Kadlof; Pawel Dabrowski-Tumanski; Joanna I Sulkowska
Journal: Bioinformatics Date: 2018-10-01 Impact factor: 6.937

6. Reconstruction of protein conformations from estimated positions of the C alpha coordinates.

Authors: P W Payne
Journal: Protein Sci Date: 1993-03 Impact factor: 6.725

7. REMO: A new protocol to refine full atomic protein models from C-alpha traces by optimizing hydrogen-bonding networks.

Authors: Yunqi Li; Yang Zhang
Journal: Proteins Date: 2009-08-15

8. Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation.

Authors: J M Word; S C Lovell; J S Richardson; D C Richardson
Journal: J Mol Biol Date: 1999-01-29 Impact factor: 5.469

9. Assessment of the model refinement category in CASP12.

Authors: Ladislav Hovan; Vladimiras Oleinikovas; Havva Yalinca; Andriy Kryshtafovych; Giorgio Saladino; Francesco Luigi Gervasio
Journal: Proteins Date: 2017-11-29

10. GalaxyRefine: Protein structure refinement driven by side-chain repacking.

Authors: Lim Heo; Hahnbeom Park; Chaok Seok
Journal: Nucleic Acids Res Date: 2013-06-03 Impact factor: 16.971

10 in total

Review 1. Bottom-up Coarse-Graining: Principles and Perspectives.

Authors: Jaehyeok Jin; Alexander J Pak; Aleksander E P Durumeric; Timothy D Loose; Gregory A Voth
Journal: J Chem Theory Comput Date: 2022-09-07 Impact factor: 6.578

2. Molecular Dynamics Scoring of Protein-Peptide Models Derived from Coarse-Grained Docking.

Authors: Mateusz Zalewski; Sebastian Kmiecik; Michał Koliński
Journal: Molecules Date: 2021-05-30 Impact factor: 4.411

3. MAPIYA contact map server for identification and visualization of molecular interactions in proteins and biological complexes.

Authors: Aleksandra E Badaczewska-Dawid; Chandran Nithin; Karol Wroblewski; Mateusz Kurcinski; Sebastian Kmiecik
Journal: Nucleic Acids Res Date: 2022-05-07 Impact factor: 19.160

4. Efficient Flexible Fitting Refinement with Automatic Error Fixing for De Novo Structure Modeling from Cryo-EM Density Maps.

Authors: Takaharu Mori; Genki Terashi; Daisuke Matsuoka; Daisuke Kihara; Yuji Sugita
Journal: J Chem Inf Model Date: 2021-06-18 Impact factor: 6.162