Literature DB >> 26918164

Sequence co-evolutionary information is a natural partner to minimally-frustrated models of biomolecular dynamics.

Jeffrey K Noel1, Faruck Morcos2, Jose N Onuchic3.   

Abstract

Experimentally derived structural constraints have been crucial to the implementation of computational models of biomolecular dynamics. For example, not only does crystallography provide essential starting points for molecular simulations but also high-resolution structures permit for parameterization of simplified models. Since the energy landscapes for proteins and other biomolecules have been shown to be minimally frustrated and therefore funneled, these structure-based models have played a major role in understanding the mechanisms governing folding and many functions of these systems. Structural information, however, may be limited in many interesting cases. Recently, the statistical analysis of residue co-evolution in families of protein sequences has provided a complementary method of discovering residue-residue contact interactions involved in functional configurations. These functional configurations are often transient and difficult to capture experimentally. Thus, co-evolutionary information can be merged with that available for experimentally characterized low free-energy structures, in order to more fully capture the true underlying biomolecular energy landscape.

Entities:  

Keywords:  Direct coupling analysis; biomolecular dynamics; frustrated protein models; minimally frustrated models; nuclear magnetic resonance; protein structure model; x-ray crystallography

Year:  2016        PMID: 26918164      PMCID: PMC4755392          DOI: 10.12688/f1000research.7186.1

Source DB:  PubMed          Journal:  F1000Res        ISSN: 2046-1402


Introduction

High-resolution structural techniques (e.g., X-ray crystallography and nuclear magnetic resonance) have provided the data necessary to develop and refine a multitude of potential energy functions used in the simulation of biomolecules. In particular, these structures provide the parameterization for simplified models that are based on the energy landscape theory of protein folding. These models construct an energetically unfrustrated (ideal) funneled landscape by including stabilizing interactions between native contacts (i.e., amino acid pairs that are nearby in the three-dimensional native structure of a protein). In cases in which experimental structures are lacking or insufficient, it becomes necessary to supplement these models with other sources of contact information. An emerging technique for contact estimation is via the statistical analysis of residue co-evolution in families of protein sequences. Combinations of high-resolution structural data and predictions from residue co-evolution are proving to be invaluable tools for building models to study protein structure and dynamics. Understanding the fundamental process of how a heterogeneous polypeptide can reversibly fold into a distinct native three-dimensional structure on biological timescales led to the development of the energy landscape theory of biomolecular folding. This theory is based on the principle of minimal frustration [1] and the folding funnel concept [2, 3]. These physical principles describe an energy landscape that has been molded by evolution such that the native interactions (i.e., the molecular interactions present in low free-energy configurations of folded proteins and RNAs) are, on average, more stabilizing than non-native interactions. The consequence of proteins having sufficiently reduced energetic frustration is that geometry dominates energetic roughness in determining folding mechanisms. Thus, a description of the effective energetics of the folding phenomenon can be attained by including a set of native stabilizing interactions consistent with the native basin of attraction. Potential energy functions of this type, which use experimental information to determine such native interactions, are known as “structure-based models” (SBMs) [4– 6] and, when employed in dynamical models, are powerful tools for understanding the connection between structure, folding, and function. Although these SBMs have been successfully applied to different biomolecules, we will be focusing on proteins for clarity in this review. Structural information, however, may be limited for many interesting systems. This is particularly true for functional configurations that are transient or partially disordered or both. The recent explosion in genomic information has enabled complementary methods for discovering functionally important amino acid interactions. The minimal frustration principle applies equally to any sequence of amino acids that can robustly fold to a particular native structure. Thus, in a family of sequences where most of them fold to a common structure, residue positions that are in contact will display a correlated mutational record because of the global evolutionary constraint that the native structure imposes for foldability. Of course, additional constraints beyond folding affect sequence evolution, including maintenance of molecular assemblies, enzymatic activity, and allosteric motions. Signals of these functionally relevant contacts are necessarily mixed with those providing robust folding. To identify such relevant interactions involved in folding and function, a number of methodologies have been developed in recent years that have been successful in uncovering such molecular couplings from sequence data. One of them is direct coupling analysis (DCA) [7, 8], which is designed to infer a global statistical model from a multiple sequence alignment (MSA) of a single protein family. Using a maximum entropy approach, DCA infers the parameters of an effective energy function consisting of single-site fields and pairwise couplings that is able to approximately reproduce the empirically observed single-site and pairwise amino acid frequencies from the input sequence alignment. The DCA energy function is known as a Potts model, a generalized Ising model that includes non-nearest neighbor interactions and non-constant spin-spin interactions. In practice, couplings of varying strength are computed between all possible pairs of sequence positions. In the past, accurate and tractable approximations of such global models were elusive and detection of direct correlations, as opposed to an aggregate of direct and indirect correlations, was challenging. Other methods are derived from similar theoretical perspectives but have varying computational demands and accuracies [9– 12]. Using an inferred effective energy function, one can estimate pairwise direct probabilities at a particular pair of residue sites. Calculating the Kullback–Leibler divergence between these joint probabilities and single marginal frequencies gives the direct information (DI) score for that residue pair. DI is a proxy of how “directly correlated” two sites are in an MSA. When compared with crystal structures, high DI scores correlate highly with native contacts, and more than 80% overlap, on average, for the top residue pairs in many protein families [7, 13]. The full set of highly scoring contacts amounts to a superset of minimally frustrated and functionally important residue pairs that are spatially localized in the functional configurations of the members of a protein family. Here, we will review the current progress in using residue co-evolution for modeling the structure and dynamics of proteins with a focus on its combination with SBMs.

Residue co-evolutionary constraints are natural input for minimally frustrated protein models

In their simplest form, SBMs idealize minimally frustrated protein energy landscapes by including only native interactions. This model removes any residual non-native energetic roughness and clarifies analysis of the geometrical and topological aspects of protein dynamics and folding. These models faithfully represent the local geometry through bond, angle, dihedral, and excluded volume terms at either single-bead-per-residue or all-atom resolutions. Non-local interactions consist of stabilizing pairwise potentials applied between residue (or atom) pairs that are nearby in the native structure. These pairwise interactions are called native contacts, and the entire set is known as a native contact map. All of the interactions, local and non-local, are set to have an explicit minimum at the native structure, hence the name “structure-based”. The simplified construction of the potential energy function permits for reduced computational requirements, and the explicitly encoded native interactions provide a baseline model that can be used for molecular modeling or studying physical perturbations. For a detailed discussion of the theoretical foundation and construction of SBMs, we refer you to the following reviews, [14– 16], and the references therein. The quality of contact maps derived from DCA and similar methods have been benchmarked against contact maps calculated from crystallographic structures, and their accuracy is promising. In general, the larger and more diverse the family of sequences, the higher the quality of contact prediction. The high level of DCA accuracy provided sufficient tertiary constraints to allow folding single domain proteins to within 3 Å from the crystal structure when given knowledge of the secondary structure [17– 21]. A rule of thumb is that the number of sequences should be larger than 1000 with less than 80% identity; however, others propose an even lower requirement of a minimum number of sequences close to the length L of the protein polypeptide chain, provided that they are diverse [18]. The notoriously difficult problem of predicting membrane protein structures has also been aided by considering evolutionarily coupled pairs [22, 23]. A native contact map derived from a single native structure is often not sufficient to encode all the functionally relevant, minimally frustrated interactions. This led to the development of a variety of “multi-basin” models, where multiple experimental structures or structural constraints are included in a single SBM [24– 26]. As described above, residue pairs with the highest DI scores, the high DI pairs (HDPs), are consistent with the native contact maps. Thus, in an analogous fashion, predicted contact constraints from co-evolution can be merged with contact maps computed from experimental structures in order to more fully capture the true underlying biomolecular energy landscape, including functional transitions and conformations, and therefore to be consistent with multiple structures [27– 29].

Recent advances

Interactions between proteins are fundamental to cellular processes. Where these interactions involve direct contact, multimeric structures, both long-lived and transient, leave correlated mutational patterns between interacting surface residues. A pioneering study used the HDPs between a histidine kinase and its response regulator to make a prediction of the transient protein complex enabling phosphotransfer [30]. This allowed a prediction for the histidine kinase TM0853 and its response regulator TM0468 that was later confirmed experimentally to be within 3.3 Å [31]. These predictions are made by minimizing a contact-based energy function consisting of dimeric HDPs. Where dimerization only weakly perturbs the monomer structure, refined rigid-body modeling in combination with co-evolutionary constrains can be employed to estimate protein complexes. When combined with experimental observations, directly coupled amino acids can unveil protein interfaces relevant for the study of disease [32]. Larger monomer distortions can be readily sampled with SBMs coupled with simulated annealing [33]. Current protocols involving HDPs have allowed the large-scale prediction of both homodimers [34] and heterodimers [35, 36]. The HDP contact map for a protein family that forms homodimers is a prime example of how ambiguity can arise in co-evolutionary information. The co-evolving dimeric interfacial contacts are mixed with HDPs selected for monomeric folding, but the dimeric contacts can in general be sorted from the monomeric contacts when there is a known monomer structure [34]. But rarely are there true dichotomies in biology; the existence of domain swapping [37, 38] and structural symmetry [26, 39] highlights some difficulties in assigning particular roles to each HDP. Also, some protein-protein interactions are mediated by disordered regions that order upon binding. The utility of DCA in these cases remains to be tested. In addition to homo-multimerization, the set of conformations encoded in HDP contact maps can include functional motions. Multi-domain proteins can undergo conformational changes, for example, to accommodate ligands [40] or in response to phosphorylation [41]. In periplasmic ligand binding proteins, there exists an open, ligand-free configuration and a closed, ligand-bound configuration. Molecular dynamics simulations can be performed by using an SBM specific to an open configuration but overlaid with an additional potential term consisting of a set of attractive, short-range interactions for each HDP [27]. Figure 1 illustrates an example of this for the leucine-binding protein. The native contact maps for two crystal structures of leucine-binding protein are shown in Figure 1A: “open” without ligand and “closed” with ligand. The closed contact map has additional contacts not present in the open structure. The DCA contact map, shown as the lower triangular map in Figure 1B, contains a superset of both the open and closed configuration contacts. An SBM is constructed that is specific to the open structure ( Figure 1D) and additionally contains contact potentials stabilizing all the “non-native” DCA contacts (i.e., any DCA contacts that are not already in the open structure). These additional contacts are each given a stabilizing potential with a minimum at 8 Å. Molecular dynamics simulations of this hybrid SBM+DCA Hamiltonian show two clusters, each within 2 Å of either the open or closed state. Overlaying the DCA contacts does not disrupt the stability of the open structure, and additionally reveals the closed state without including any information from the closed crystal structure. This shows that co-evolutionary information can be used to uncover intermediary, hidden, and functionally relevant conformational states present in many protein families [27].
Figure 1.

Direct coupling analysis (DCA) contact maps derived from protein family sequence co-evolution are consistent not only with single native structures but also with multiple functional configurations.

( A) Leucine-binding protein (LBP) contact maps derived from crystal structures: “open” without ligand and “closed” bound to a ligand. Each triangular region in the map shows a mark if residue pairs are less than 8 Å apart in the experimental structure. The closed contact map (upper triangle) has additional contacts not present in the open structure. ( B) The DCA contact map inferred from residue co-evolution (lower triangle) contains a superset of contacts from both open and closed conformations. ( C) A cartoon representation of the aligned open (apo) and closed (holo) LBP structures shows a large conformational change upon ligand binding. ( D) A structure-based model (SBM) is defined from the apo structure plus contact potentials stabilizing DCA contacts that are not already in the open structure. A two-dimensional root mean square deviation (RMSD) distribution of the states explored by molecular dynamics simulations of this hybrid Hamiltonian shows two peaks within 2 Å of the open and closed states. This shows the ability to uncover functional states via co-evolutionary couplings.

Direct coupling analysis (DCA) contact maps derived from protein family sequence co-evolution are consistent not only with single native structures but also with multiple functional configurations.

( A) Leucine-binding protein (LBP) contact maps derived from crystal structures: “open” without ligand and “closed” bound to a ligand. Each triangular region in the map shows a mark if residue pairs are less than 8 Å apart in the experimental structure. The closed contact map (upper triangle) has additional contacts not present in the open structure. ( B) The DCA contact map inferred from residue co-evolution (lower triangle) contains a superset of contacts from both open and closed conformations. ( C) A cartoon representation of the aligned open (apo) and closed (holo) LBP structures shows a large conformational change upon ligand binding. ( D) A structure-based model (SBM) is defined from the apo structure plus contact potentials stabilizing DCA contacts that are not already in the open structure. A two-dimensional root mean square deviation (RMSD) distribution of the states explored by molecular dynamics simulations of this hybrid Hamiltonian shows two peaks within 2 Å of the open and closed states. This shows the ability to uncover functional states via co-evolutionary couplings. So far, we have discussed how HDP contact maps can be used for structural modeling. However, the fundamental output of the DCA algorithm is not direct information about co-evolving pairs but rather a Potts model Hamiltonian describing the effective energies of interaction for all pairs of residues in a protein family. This Hamiltonian, though not transferable to any sequences outside the family, should, in principle, be able to provide a quantitative window into the stabilities provided by each amino acid in a protein. Strong evidence of the utility of the effective energies comes from their ability to predict the stability changes of single-site mutants [42, 43] and significant correlations to folding rates [44]. Including the so-called single-site fields in addition to the pairwise energies provides even better predictive power [45]. These results suggested that the pairwise energies calculated from co-evolution could be used to inform thermodynamic models of protein folding. Indeed, folding simulations using SBMs with DCA-weighted native contact potentials can better capture transition state ensembles [46]. DCA energies have also been shown to correlate with physical potentials when summed over the entire sequence [47]. Confidence in the ability to estimate energies at both the single-mutant and full-sequence levels is allowing novel methods for investigating the effective energy landscape of evolution, and bridging the gap between biophysics and sequence evolution [47, 48]. These developments are important for integrating the energetics of protein folding and function with protein evolution and selection, which will be crucial to understanding drug resistance and cancer development going forward.

Future directions

The marriage between co-evolutionary information and physical models of biomolecules has been shown to be a fertile research field, where the most important results are yet to come. This field has been focused on rigorously validating the connection and usefulness between evolutionary information with structural modeling and experimental information. However, the true utility of co-evolutionary information is that it allows us to go places that are hard to access by current experimental technologies; important examples are those of membrane protein structure [22, 23] and dynamics, systems with transient conformational states, as well as investigation of large molecular assemblies that resist crystallographic characterization. Although crystal structures exist for FtsH AAA peptidase and the 30S ribosome, recent studies on these two systems [28, 49] show the promise of co-evolutionary information for discovering structural constraints in molecular assemblies. The ability to detect relevant evolutionary interactions has repercussions to our understanding of biomolecular assembly and function. Hopefully, these new tools can be used to alter protein conformation and rewire their interfaces. This has potential applications in the field of protein engineering, as well as systems biology. There is no conceptual hurdle to resisting the application of these ideas to RNA structure and function as well as protein-RNA interactions. Ultimately, we would hope to use all this knowledge to tackle biomedical problems that would help advance human health.

Abbreviations

DCA, direct coupling analysis; DI, direct information; HDP, high direct information pair; MSA, multiple sequence alignment; SBM, structure-based model.
  47 in total

Review 1.  The Venus flytrap of periplasmic binding proteins: an ancient protein module present in multiple drug receptors.

Authors:  C B Felder; R C Graul; A Y Lee; H P Merkle; W Sadee
Journal:  AAPS PharmSci       Date:  1999

Review 2.  3D domain swapping: as domains continue to swap.

Authors:  Yanshun Liu; David Eisenberg
Journal:  Protein Sci       Date:  2002-06       Impact factor: 6.725

3.  Genomics-aided structure prediction.

Authors:  Joanna I Sułkowska; Faruck Morcos; Martin Weigt; Terence Hwa; José N Onuchic
Journal:  Proc Natl Acad Sci U S A       Date:  2012-06-12       Impact factor: 11.205

4.  Co-evolutionary constraints of globular proteins correlate with their folding rates.

Authors:  Saurav Mallik; Sudip Kundu
Journal:  FEBS Lett       Date:  2015-07-08       Impact factor: 4.124

5.  Sequence co-evolution gives 3D contacts and structures of protein complexes.

Authors:  Thomas A Hopf; Charlotta P I Schärfe; João P G L M Rodrigues; Anna G Green; Oliver Kohlbacher; Chris Sander; Alexandre M J J Bonvin; Debora S Marks
Journal:  Elife       Date:  2014-09-25       Impact factor: 8.140

6.  Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection.

Authors:  Faruck Morcos; Nicholas P Schafer; Ryan R Cheng; José N Onuchic; Peter G Wolynes
Journal:  Proc Natl Acad Sci U S A       Date:  2014-08-11       Impact factor: 11.205

7.  Spin glasses and the statistical mechanics of protein folding.

Authors:  J D Bryngelson; P G Wolynes
Journal:  Proc Natl Acad Sci U S A       Date:  1987-11       Impact factor: 11.205

8.  Biomolecular dynamics: order-disorder transitions and energy landscapes.

Authors:  Paul C Whitford; Karissa Y Sanbonmatsu; José N Onuchic
Journal:  Rep Prog Phys       Date:  2012-06-28

9.  Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information.

Authors:  Sergey Ovchinnikov; Hetunandan Kamisetty; David Baker
Journal:  Elife       Date:  2014-05-01       Impact factor: 8.140

10.  Dimeric interactions and complex formation using direct coevolutionary couplings.

Authors:  Ricardo N dos Santos; Faruck Morcos; Biman Jana; Adriano D Andricopulo; José N Onuchic
Journal:  Sci Rep       Date:  2015-09-04       Impact factor: 4.379

View more
  9 in total

1.  Elucidating the druggable interface of protein-protein interactions using fragment docking and coevolutionary analysis.

Authors:  Fang Bai; Faruck Morcos; Ryan R Cheng; Hualiang Jiang; José N Onuchic
Journal:  Proc Natl Acad Sci U S A       Date:  2016-11-29       Impact factor: 11.205

2.  Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis.

Authors:  Guido Uguzzoni; Shalini John Lovis; Francesco Oteri; Alexander Schug; Hendrik Szurmant; Martin Weigt
Journal:  Proc Natl Acad Sci U S A       Date:  2017-03-13       Impact factor: 11.205

3.  Frustration and Direct-Coupling Analyses to Predict Formation and Function of Adeno-Associated Virus.

Authors:  Nicole N Thadani; Qin Zhou; Kiara Reyes Gamas; Susan Butler; Carlos Bueno; Nicholas P Schafer; Faruck Morcos; Peter G Wolynes; Junghae Suh
Journal:  Biophys J       Date:  2020-12-25       Impact factor: 4.033

4.  Divergence in Dimerization and Activity of Primate APOBEC3C.

Authors:  Amit Gaba; Mark A Hix; Sana Suhail; Ben Flath; Brock Boysan; Danielle R Williams; Tomas Pelletier; Michael Emerman; Faruck Morcos; G Andrés Cisneros; Linda Chelico
Journal:  J Mol Biol       Date:  2021-10-16       Impact factor: 6.151

5.  SMOG 2: A Versatile Software Package for Generating Structure-Based Models.

Authors:  Jeffrey K Noel; Mariana Levi; Mohit Raghunathan; Heiko Lammert; Ryan L Hayes; José N Onuchic; Paul C Whitford
Journal:  PLoS Comput Biol       Date:  2016-03-10       Impact factor: 4.475

6.  Modular Organization of Residue-Level Contacts Shapes the Selection Pressure on Individual Amino Acid Sites of Ribosomal Proteins.

Authors:  Saurav Mallik; Sudip Kundu
Journal:  Genome Biol Evol       Date:  2017-04-01       Impact factor: 3.416

7.  Glutantβase: a database for improving the rational design of glucose-tolerant β-glucosidases.

Authors:  Diego Mariano; Naiara Pantuza; Lucianna H Santos; Rafael E O Rocha; Leonardo H F de Lima; Lucas Bleicher; Raquel Cardoso de Melo-Minardi
Journal:  BMC Mol Cell Biol       Date:  2020-07-01

8.  Connecting the Sequence-Space of Bacterial Signaling Proteins to Phenotypes Using Coevolutionary Landscapes.

Authors:  R R Cheng; O Nordesjö; R L Hayes; H Levine; S C Flores; J N Onuchic; F Morcos
Journal:  Mol Biol Evol       Date:  2016-09-07       Impact factor: 16.240

9.  The generative capacity of probabilistic protein sequence models.

Authors:  Francisco McGee; Sandro Hauri; Quentin Novinger; Slobodan Vucetic; Ronald M Levy; Vincenzo Carnevale; Allan Haldane
Journal:  Nat Commun       Date:  2021-11-02       Impact factor: 14.919

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.