Literature DB >> 27074285

Emerging Computational Methods for the Rational Discovery of Allosteric Drugs.

Jeffrey R Wagner¹, Christopher T Lee¹, Jacob D Durrant¹, Robert D Malmstrom¹, Victoria A Feher¹, Rommie E Amaro¹.

Abstract

Allosteric drug development holds promise for delivering medicines that are more selective and less toxic than those that target orthosteric sites. To date, the discovery of allosteric binding sites and lead compounds has been mostly serendipitous, achieved through high-throughput screening. Over the past decade, structural data has become more readily available for larger protein systems and more membrane protein classes (e.g., GPCRs and ion channels), which are common allosteric drug targets. In parallel, improved simulation methods now provide better atomistic understanding of the protein dynamics and cooperative motions that are critical to allosteric mechanisms. As a result of these advances, the field of predictive allosteric drug development is now on the cusp of a new era of rational structure-based computational methods. Here, we review algorithms that predict allosteric sites based on sequence data and molecular dynamics simulations, describe tools that assess the druggability of these pockets, and discuss how Markov state models and topology analyses provide insight into the relationship between protein dynamics and allosteric drug binding. In each section, we first provide an overview of the various method classes before describing relevant algorithms and software packages.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2016 PMID： 27074285 PMCID： PMC4901368 DOI： 10.1021/acs.chemrev.5b00631

Source DB: PubMed Journal: Chem Rev ISSN： 0009-2665 Impact factor: 60.622

Review Motivation and Organization

To date, most allosteric drugs have been discovered through high-throughput screening. But growing databases of biomolecular structure and sequence data, in conjunction with increases in computing power and improvements in predictive algorithms, are enabling the rational de novo design of allosteric drugs. Given the large number of published algorithms for predicting allosteric mechanisms, it can be difficult to select the most appropriate method for a given target. This review serves as an introduction for those who wish to use computational techniques to develop allosteric drugs. After a broad overview of allosteric drug discovery, this review is divided into three sections. First, we discuss bioinformatics and molecular-dynamics methods to identify allosterically important sequence positions. Second, we summarize the computational methods to predict druggable pockets at these functionally relevant sites. Finally, we describe how Markov state models and topological analyses can tie these single sequence sites to global protein function and dynamics.

Introduction

Allosteric drugs offer a number of advantages that make them desirable as drug candidates. Allosteric effectors, by definition, alter protein activity by binding to a site distinct from the orthosteric pocket. One example of an allosteric system is fructose 1,6-bisphosphatase (shown in Figure ). Because allosteric sites are typically less evolutionarily conserved, allosteric drugs can be highly selective, even among other members of the same protein family.[1−8] In some cases, allosteric sites are so unique among proteins that an effector is said to have “absolute subtype specificity.”[2,3,9,10]

Figure 2

Allosteric protein fructose 1,6-bisphosphatase, shown for illustration. Orthosteric and allosteric pockets (yellow and red, respectively) are bound to an endogenous ligand and an allosteric effector, respectively. Note that the allosteric site is distant from the orthosteric site such that there is no overlap between the bound poses of the allosteric and orthosteric ligands. Despite the distance between them, the allosteric effector measurably modifies the enzymatic activity at the orthosteric site. Illustration derived from PDB IDs 2Y5K(53) and 3IFC.[54]

Allosteric modulators may have spatiotemporal specificity. For example, they can be active only in the presence of the endogenous ligand, thus restricting their effect to certain tissues at certain times,[2−4,9,11] which may slow desensitization.[10,12] Allosteric effectors are generally saturable, meaning that they have a maximal effect that does not necessarily correspond to complete inhibition or activation.[2,4−6,8,10,11] This saturability enables safer dosing. For example, if the maximal effect is an 80% reduction in signaling, overdosing will not fully eliminate an essential signal.[1,2,4] Other advantages can include noncompetitive inhibition (i.e., drug activity cannot be “overwhelmed” by high concentrations of the endogenous ligand) and pathway- or substrate-specific modulation, which reduces unwanted activity by specifically targeting a single protein function.[2,4,10] For example, if a protein is involved in multiple pathways, an allosteric effector may impact the activity of each pathway differently depending on the systems-biology context. If a protein acts on multiple substrates, the impact on activity may depend on the biological context. Despite many potential advantages of allosteric therapeutics, it has been challenging to identify predictive approaches to discovering allosteric drugs. In recent decades, the pharmaceutical industry has favored more traditional targets for three primary reasons: the relative ease of assay development around orthosteric sites; access to high-throughput, high-resolution X-ray crystallography; and advances in ligand- and receptor-based computational methods to optimize ligand-binding affinity at a substrate-competitive site. This structure-based approach is thought to significantly reduce the time and cost of hit-to-lead and lead-to-drug development by reducing the number of compounds that need be synthesized.[13,14] Work by Doman et al. comparing computer-aided drug discovery (CADD) and high-throughput screening (HTS) reported that the two methods had hit rates of 35 and 0.021%, respectively.[15] In contrast, allosteric drugs are uniquely challenging from a rational drug-design perspective. Because experimental assays typically measure orthosteric function rather than ligand binding at the allosteric site, efficient development of allosteric drugs requires that the complex structure–activity relationships (SARs) governing both binding affinity and allosteric activity be considered simultaneously.[5,10,16] Further, allosteric sites are less likely to be evolutionarily conserved. While this enables increased subtype specificity, it also increases the chances of evolved resistance[5,9,10] and can complicate testing in evolutionarily distant animal models.[10] Additionally, allosteric effectors are particularly susceptible to “mode switching,” where relatively minor chemical changes can drastically affect ligand efficacy.[10,16] Structurally similar drug metabolites, therefore, may have varying and unpredictable distributions and allosteric effects.[10,16] Optimizing allosteric modes of action requires methods that are very different from those used in orthosteric drug discovery.[5] Multifunctional allosteric proteins are particularly challenging. While drug designers may desire to target a single protein function, an allosteric effector may also alter other functions, hindering a full mechanistic understanding of the pharmacology.[10] Also, the benefits of spatiotemporal specificity are lost if the distribution of the endogenous ligand changes with progression of the disease state.[10] Finally, assessment of the limited number of known allosteric pockets indicates that they are generally shallow[5] and present flat SARs.[10] These structural features similarly challenge existing rational drug-discovery paradigms and the general practice of developing selective compounds by optimizing affinity. Despite these challenges, allosteric drug discovery has gained momentum recently due to a number of developments.[10] First, several allosteric drugs across a broad range of pharmacological target classes have been rationally designed,[17−23] encouraging pursuit of others, as evidenced by the number of allosteric drugs currently in clinical trials.[24] The recent elucidation of new membrane protein crystal structures for GPCRs[25,26] and ion channels[27,28] have assisted in the structure-based design approach to these successes. Finally, advances in our understanding of allosteric mechanisms have supported development of additional rational design strategies (see below). Our understanding of allosteric mechanisms has advanced considerably since the initial conception of Monod, Wyman, and Changeux.[29] Modern models of allostery consider conformational ensembles.[5,8,9,30−43] This revised view supports newly established and emerging computational advances that comprehensively map conformational landscapes and predict communication between allosteric and orthosteric sites. For example, the physical mechanisms of allostery generally alter the entropic and enthalpic factors that define the conformational landscape and, therefore, govern protein function.[5,8,9] The observed correlation between allosteric modulation and protein structural dynamics is varied: Major conformational rearrangements occur in some cases, as compared with subtle shifts in conformational populations in others.[4−6,9,10,35,37,44] An excellent metaphor for these phenomena is Kornev and Taylor’s classification of “domino” versus “violin” models of allosteric signal transduction[45] (Figure ). Further, allosteric signals are transmitted through a range of structural motifs, from rigid core regions to flexible linkers.[46] Allostery may occur through essential residues along a single allosteric path[47] or through many weak pathways connecting one site to another, acting in concert.[48] As such, it is not surprising that many of the allostery-prediction methods discussed in this review (some of which do not use structure/geometry information at all) in practice may identify noncontiguous groups of residues as being allosterically linked. Such predictions should not immediately be assumed to be wrong but rather may indicate a non-“domino” model of allostery.

Figure 1

When a small molecule binds to the allosteric site of a protein, information is transferred through the protein molecule to its active site. Two different methods of transmission can be defined. The first mechanism, here defined as the “domino model”, is a sequential set of events propagating linearly from the allosteric site to the active site. Binding of the effector triggers local structural changes that sequentially propagate via a single pathway to the active site. It was suggested that this mechanism is applicable for the PDZ domain family,[49] G protein-coupled receptors, the chymotrypsin class of serine proteases, and hemoglobin.[50] The second mechanism, defined here conceptually as a “violin model”, is based on vibration pattern changes inside the protein. In a violin its pitch can be changed by a slight movement of the violin player’s finger on the fingerboard. Information about the finger movement is, thus, transferred throughout the whole body of the violin with no specific pathway for the signal transduction. By analogy, protein allosteric site is a fingerboard of the protein and a small signaling molecule is the player’s finger. If a protein is in a particular vibration mode, it is possible to suggest that binding a small effector molecule to a specific site can change this mode. The signal, thus, will be spread throughout the whole protein including its active site. The “domino model” is a reliable way to transfer information in a macro world, but on a molecular level, with significant thermal motions of the protein, this mechanism will be prone to random triggering of the domino chain reaction, creating noise in the signaling system. Thermal motions in the case of the “violin model” do not hinder the transduction. In fact, the permanent motion of the molecule is a prerequisite for this mechanism. Reproduced with permission from ref (45). Copyright 2015 Cell Trends in Biological Sciences. Recent work has also revealed that protein allostery is not merely a transition between two discrete protein conformations, as initially thought, but rather a shift in the equilibrium populations of many conformations, induced by effector binding.[31,35,38,41,42,51,52] It is becoming increasingly clear that the kinetics of these transitions define the mechanisms of allostery.[38,51] Empowered with these new understandings and advances in molecular simulation (in terms of speed of calculation and improved methodologies), the era of allosteric drug discovery is now on the cusp of radical advancement. I’ Allosteric protein fructose 1,6-bisphosphatase, shown for illustration. Orthosteric and allosteric pockets (yellow and red, respectively) are bound to an endogenous ligand and an allosteric effector, respectively. Note that the allosteric site is distant from the orthosteric site such that there is no overlap between the bound poses of the allosteric and orthosteric ligands. Despite the distance between them, the allosteric effector measurably modifies the enzymatic activity at the orthosteric site. Illustration derived from PDB IDs 2Y5K(53) and 3IFC.[54]

Emerging Rational Design Principles

So-called tried-and-true “design principles” are still being developed. However, a few general principles have begun to emerge. For example, many argue that it is insufficient to design a ligand that merely binds to an allosteric site; rather, the effector must make contact with certain key binding-pocket atoms to have the desired effect.[5,10] These key atoms can often be identified through mutagenesis experiments and crystallographic studies of other allosteric ligands. A promising set of design principles is encapsulated in the “allo-network” strategy, a rational approach that adopts two simultaneous but orthogonal approaches to ligand design.[6] On the protein-structure level, the primary focus is to target a single protein function or an interaction with a single partner. On the signaling-pathway level, the “allo-network” strategy suggests targeting less-connected upstream proteins instead of the more direct, though potentially highly connected, signaling proteins themselves. When applied to early stage design, the allo-network method is predicted to increase the likelihood that a given allosteric effector will proceed through the drug-approval process.[6,12,55] Several published examples for recently approved allosteric drugs serve to illustrate the current state of the art for emerging allosteric drug design principles. They also represent the significant advances that have been made to utilize structure-based methods for challenging druggable sites such as protein–protein interfaces and for membrane proteins such as ion channels. The available details of the discovery and optimization of these compounds do not include the methods discussed in this review; however, they highlight where these predictive techniques could contribute to the allosteric design process. In 2011, Gilmartin et al. of GlaxoSmithKline reported the discovery of a pharmacokinetically optimized allosteric MEK inhibitor, GSK1120212.[56] The first generation inhibitor was discovered by high throughput screening,[57] and the subsequent ternary crystal structure showed the allosteric pocket to be adjacent to the orthosteric ATP-bound site.[58] By 2012 there were 14 allosteric MEK1/2 inhibitors in clinical trials,[59] because it was recognized that an inhibitor developed for this allosteric pocket afforded two very unique opportunities to avoid adverse clinical effects: the high doses required to compete against 1 mM cellular ATP concentrations and inhibition of closely related ATP-binding sites in other kinases. The unique efficacy properties of GSK1120212 highlight both the opportunity and challenge of allosteric drug design. Gilmartin et al. report that, although some other MEK allosteric drugs demonstrate inhibition of the ERK1/2 pathway in vitro, this has not translated into efficacy in patients.[56] GSK1120212 has since been approved in the United States under the name Trametinib for treatment of metastatic melanoma caused by the V600E mutation. In 2012, Saalau-Bethell et al. of Astex reported the discovery of allosteric inhibitors for the HCV NS3 protein.[17] These inhibitors produce an allosteric effect by binding at an interdomain interface and stabilizing a preexisting autoinhibited state of the protein. The original discovery of the allosteric site was accomplished using a fragment-based HTS technique, followed by optimization using X-ray crystallography and structure-based SAR. The authors discuss their experience with a few confounding factors in the allosteric design process, namely the need to use the full-length protein in their screening construct to observe the exerted allosteric effect, and the subsequent directed evolution study of resistance mutations that could occur at the allosteric site. Hackos et al. of Genentech published on the discovery of positive allosteric modulators (PAMs) for GluN2A-containing NMDA receptors in 2016.[60] PAMs are allosteric ligands which increase the effect of the endogenous signaling molecule and do not cause a change in its absence. The allosteric site in this case was at a protein–protein interface, and was discovered using HTS. Subsequent medicinal chemistry efforts then optimized the early hit molecule. The authors note that the validation of this allosteric site was reinforced by its similarity to an analogous allosteric site in AMPA receptors, but that the NMDA receptor site has elements of asymmetry that the AMPA receptor site did not. In comparing two similar compounds, GNE-6901 and GNE-8324, the authors make comments that indicate evidence of mode switching or a shallow SAR landscape, and they further characterize the details of the allosteric mechanism using mutagenesis experiments. In summary these examples demonstrate allosteric drug discovery can be successful at protein sites often considered to be undruggable. It is apparent that these successes can be further built upon through computational methods that allow for rational rather than serendipitous HTS discovery of new allosteric binding sites and a deeper understanding of allosteric mechanisms that overcome design challenges such as mode switching.

Computational Methods for Studying Allostery

Protein-Sequence Analysis Methods

Introduction

Protein-sequence analysis is a useful tool to detect and characterize allosteric pathways and pockets. Here, we classify sequence-based methods into two groups: (1) “single site” methods, which produce a list of individual functional sequence positions; (2) “coupled site” methods, which produce a list of groups comprised of two or more sequence positions that appear to be functionally linked based on their coevolution. All sequence-based analysis methods share some challenges. These challenges include how to select and aggregate clean, relevant sequences as input; interpret the output; and integrate sequence-analysis results with other forms of data. Determining the biological meaning of a strong signal is also problematic. While many analysis methods identify evolutionarily important residues, the specific biological role of these residues cannot be inferred without additional knowledge. For example, it is difficult to determine, based on sequence alone, whether an evolutionarily significant residue plays an allosteric role, or whether its role is related to another essential process (e.g., substrate binding, maintaining protein structure, etc.).[61,62] Indeed, it is likely that a given residue serves multiple purposes simultaneously. Input sequence selection and alignment also present challenges. Most techniques require many sequences to establish statistical significance. To obtain the required number of sequences, researchers often lower the stringency of their search parameters, resulting in alignments that contain sequences with lower similarity or incomplete coverage of the original query. While some analysis methods manage to detect meaningful coevolution over a wide range of multiple sequence alignment (MSA) conservation and noise levels, others are more susceptible to messy data.[63,64] For a more complete discussion of these topics and how they affect coevolution analysis methods, readers are directed to an excellent recent review by de Juan et al.[65]

Single-Site Evolutionary Analysis Methods

By our definition, single-site evolutionary analysis methods return a list of predicted functional sequence positions but do not suggest specific linkages between sites. Once a researcher has constructed an MSA, the conservation or phylogenetic relevance of each column can be used to infer the evolutionary importance of each sequence position. This importance is sometimes a hallmark of thermodynamically critical residues that participate in allostery. Though single-site methods only return a list of single high-scoring sequence positions, the inner workings of some single-site methods are based on the aggregate or correlated behaviors of multiple sequence positions (e.g., to determine baseline residue probabilities within a multiple-sequence alignment or construct a phylogenetic tree). Single-site methods for detecting allostery are advantageous because they lack much of the noise often associated with correlation analysis. These analyses are also appealing because of their simplicity: There are usually fewer parameters to set, and the results can be visualized directly by highlighting key residues on a three-dimensional (3D) protein structure.

Single-Position Entropy

Shannon entropy, one of the simplest nontrivial sequence-analysis metrics,[66] was used widely in early works to identify conserved sequence positions for drug-design or mutagenesis experiments.[67] Similar in form to thermodynamic entropy from statistical mechanics, Shannon entropy measures the population diversity of residues at a single MSA position. It is also central to mutual information (MI), a popular coupled-pair metric. The MI of two sequence positions is defined as the sum of the individual position entropies, minus the entropy of the positions considered jointly. While we do not cover the mathematical details of these methods here, interested readers are directed to previous articles on these topics.[64,68] Shannon entropy does not consider amino acid similarity (e.g., in the Shannon entropy framework, a leucine-to-isoleucine mutation is considered mathematically equivalent to a leucine-to-arginine mutation). Other entropy measures, such as the relative Shannon entropy (also called the Kullback–Leibler divergence (KLD))[69] and the von Neumann entropy,[70,71] attempt to overcome this limitation and, as a result, may be more useful in the search for allosteric sites. Relative Shannon entropy/KLD accounts for some measure of the protein’s chemical environment by considering each mutation with respect to the background amino acid frequencies calculated from the MSA. This analysis may be particularly useful when searching for allosteric sites in proteins that reside in membranes or other noncytosolic compartments, where background residue probabilities or mutational preferences may be biased due to different biochemical contexts. In contrast, von Neumann entropy, a concept borrowed from quantum statistical mechanics, is calculated using amino acid similarity matrices. Identifying an optimal amino acid similarity metric is nontrivial and may well depend on the nature of the system (e.g., in a well-packed protein, residue size may be a sensitive metric, whereas surface-site comparison may require the user to prioritize charge). In a recent publication describing these types of entropy, Johansson and Toh explored how the two metrics can be mixed to detect enzyme active sites with maximum sensitivity.[64] Zhang et al. constructed a variety of new analysis methods in 2008 by combining Shannon or von Neumann entropy, phylogenetic tree structure, and a novel gap-treatment approach.[70] In benchmarking their method, they compared their results to Evolutionary Trace and ConSurf (discussed in greater detail below). Two of their hybrid approaches outperformed all other techniques in detecting significant residues across a variety of proteins: the Improved Zoom method, which incorporates a tree breakdown of subalignments, and the Physiochemical Similarity Zoom method, which extends the Improved Zoom method with von Neumann entropy and tree-branch-size normalization.

Evolutionary Trace

Lichtarge, Bourne, and Cohen pioneered the evolutionary trace (ET) method. The approach has become quite popular, largely because the algorithm is intuitive and its results are readily visualizable.[72] ET aligns a number of sequences and constructs a phylogenetic tree, and then monitors the conservation of sequence positions at major tree branching points. By slicing the tree at different similarity cutoffs, the algorithm extracts the cluster-defining sequence positions. The evolutionary significance of these sequence positions is implied by their conservation in the sequences beyond the next branch. In their first paper,[72] the authors demonstrated that ET can detect functionally important sites in SH2, SH3, and DNA-binding domains. Work has since been published on ET validation, parameter optimization, and best-use practices.[73] In a method often referred to as “difference-ET,” the user runs ET on two related proteins and considers differences in the high-ranking residues and their scores. The sequence positions with strongly varying scores may suggest specificity determinants or differences in allosteric and/or orthosteric mechanisms. Notably, difference-ET has been used in the study of GPCR specificity.[73−75] To better account for varying rates of evolution in different subtrees and correlated mutations, in 2004 Mihalek et al. developed real-valued ET.[77] This method incorporates entropy information into the standard ET framework. This work also introduces the zoom ET method (not related to improved zoom, above), which adds higher weight to sequences that are more similar to the protein of interest. In the introductory work, they used real-valued and zoom ET to detect the functional residues in a kinase domain, and then compared the performance of both methods to regular (integer-valued) ET and entropy. Given unpruned sequence data sets, the real-valued ET and zoom ET methods outperformed the others by a significant margin. In contrast, integer-valued ET prevailed in most respects when pruned data was available. An automated web server is available to perform real-valued ET calculations, generate reports, and visualize results (http://mammoth.bcm.tmc.edu/ETserver.html).[78]

H2r(s)

In 2008, Merkl et al. introduced a method called H2r that serves as a segue between single-site and coupled-site approaches.[79] H2r generates a mutual-information matrix for an MSA, and then discards all but the strongest detected coupled pairs. For each sequence position k, the method returns conn(k), the number of top-ranked pairs that include k. Initial work proved that H2r can successfully detect functionally significant residues across a range of proteins. More recently, H2rs, an improved version of H2r, has been released.[80] This method modifies the original by using von Neumann instead of Shannon entropy and performing more detailed checks for statistical significance. H2rs is available as a web server and a stand-alone application at http://www-bioinf.uni-regensburg.de/.

Coupled-Site Evolutionary Analysis Methods

Second-order sequence analysis detects residue pairs that mutate in concert more frequently than would be expected given random genetic events. Coevolving residue pairs are assumed to be functionally linked, often because they serve essential roles in allostery or structural integrity. The immediate output of second-order allostery analysis is a list of residue pairs with associated correlation strengths. Combining these individual pairwise correlations into a single picture of the entire protein is a separate task. On the most basic level, the strongest correlations that include a residue or site of interest can suggest thermodynamic coupling to other sites, possibly related to allostery. More complex analyses use hierarchical clustering or principal component analysis to analyze these linkages and uncover strongly linked networks of coevolving residues.

Basic Coupled-Site Analyses

Several simple yet reliable residue-coupling analyses have maintained a presence in the literature over the past decades. These basic approaches are advantageous because they are easier to understand and have been shown to score consistently well in a wide range of tests. However, they may fail to detect correlations in more complex cases.[81,82] Though more complex methods exist, many of these basic methods still appear as analysis options in coevolution-detection software packages and web servers. In this review, we focus on a few that are still widely used.

Mutual Information

Mutual information (MI) is one of the most straightforward and long-lived coupling metrics. The MI between two sequence positions is defined as the sum of the Shannon entropies of both positions, minus their joint entropy. Due to its simplicity and favorable mathematical properties, MI analysis is the basis for a number of more complex coevolution methods. However, MI does present certain shortcomings. For example, uncorrelated pairs of high-entropy sequence positions are likely to have a higher MI than uncorrelated pairs of low-entropy positions.[83,84] To compensate for this and other shortcomings, various software packages have implemented a number of mathematical corrections to MI.[85−89] Further, methods to estimate baseline values for correlation (e.g., resampling or sequence shuffling) can improve MI analysis.[84,90,91] Another relatively direct coevolution metric, the McLachlan-based substitution correlation (McBASC),[92] looks for similar patterns of variation in the columns of an MSA, weighting for residue similarity using the McLachlan scoring matrix.[93] Analogous methods can be constructed using different substitution matrices, but McBASC continues to be a popular choice in the literature.[63,84,94] In 2002, Kass and Horovitz[95] analyzed the GroEL complex using a chi-squared test to detect significant residue coevolution in an MSA. The analysis suggested intra- and interchain contact pairs and has continued to appear in the literature under the name “observed minus expected squared” (OMES).[63,81,82,84,96]

Statistical Coupling Analysis

The statistical coupling analysis (SCA) method developed by Lockless and Ranganathan is perhaps the most widely used sequence-based method for allostery prediction.[49] SCA draws an analogy to statistical physics by calculating a “coupling energy” between each sequence-position pair. The original SCA method computes a conservation value for each sequence position i in an MSA, applies one of several types of perturbation to another position j (depending on the SCA version[97]), and finally recalculates the conservation at position i for the sequences that satisfy the perturbation. By calculating the change in individual and joint conservation over a variety of perturbations, SCA establishes a “coupling energy” that indicates the evolutionary coupling of positions i and j. The output of the SCA method is an N × N matrix of coupling energies, where N is the number of sequence positions in the alignment. In early work, the researchers manually identified strongly coupled residue pairs that included one functional member (per experiment). More recent versions of SCA have grouped this matrix into meaningful clusters of coevolving residues using hierarchical or spectral clustering.[50] Refinements of SCA have achieved improved statistical properties by resampling the original distribution.[98] In a 2011 paper, SCA was effectively used to engineer a light-sensitive LOV2 domain onto the surface of DHFR at a location that SCA had identified as energetically linked to the enzyme active site. Some variants of the resulting protein chimera were found to have acquired light-dependent activity.[99] Further work has used SCA to design artificial WW domains.[100,101] In 2011, an SCA analysis of antigen 85C from Mycobacterium tuberculosis suggested new sites that potentially could be exploited in drug design.[102] A number of projects have also demonstrated how SCA can be used to target mutations that affect protein function.[103−109] Inspired by earlier work on the sequence correlation entropy (SCE) method,[110] Dima and Thirumalai published an SCA variant in 2006.[111] This variant controls for specific protein composition by calculating the background probability that a given amino acid will be present at a random sequence position. This probability is determined by considering only the sequences being analyzed, as opposed to all sequences in the SWISS-PROT database.[112,113] Further, they borrowed a coupled two-way clustering procedure from gene-sequence analysis to define the sectors.[114] In validating this method, the authors analyzed the PDZ, GPCR, and lectin families of proteins and were able to quantitatively predict functional residues, which were in agreement with experimental findings.

Explicit Likelihood of Subset Covariation

As mentioned above, SCA is a “perturbation-based” method in which correlation is established by excluding certain sequences from an MSA and monitoring how entropies change. Another popular perturbation-based method was published in 2004 by Dekker et al.[115] This method, explicit likelihood of subset covariation (ELSC), relies on similar principles but returns correlation scores in the form of probabilities rather than statistical energies. ELSC was shown to be superior to SCA in contact prediction when tested on a range of protein families. It has since been implemented on web servers[116,117] and has been a popular benchmark method in the literature.[81,82,94,96]

Direct Coupling Analysis

In 2009, Weigt et al. proposed a mutual-information-based method called direct coupling analysis (DCA) that disentangles directly interacting residues from large networks of indirectly coupled sequence positions.[118] While this method is typically used in structure prediction to identify spatially adjacent sequence positions, it may find application in the study of short-range allosteric interactions. A more efficient implementation of the DCA method, known as “mean field” (as opposed to the original “message-passing” implementation) was published in 2011.[119] Both introductory papers show that DCA is a robust predictor of both intra- and interprotein contacts and that it can hint at the existence of unobserved protein conformations. Related work has shown that DCA can be used in conjunction with structural models to generate predictive models of protein complexes,[120−122] determine the sequence positions that contribute to protein-interaction specificity,[123] and describe the conformational ensembles of proteins in crystallographic or near-crystallographic states.[124] A web server and software package are available to perform DCA analysis at http://dca.rice.edu/portal/dca/home.

PSICOV

PSICOV is another popular contact-detection method that may find productive use in the study of allostery.[125] Mathematically, PSICOV relies on an estimated inverse of the MSA covariance matrix, which acts as a matrix of correlations between all sequence-position pairs that inherently controls for the variations in all other positions. PSICOV was successful at predicting contacting protein residues based on MSA data. The code has been published online at http://bioinfadmin.cs.ucl.ac.uk/downloads/PSICOV/.

Recurrence Quantification Analysis

Recurrence quantification analysis (RQA), another second-order sequence-analysis technique, is best used when much is already known about the mechanism under investigation (e.g., physiochemical amino acid properties such as charge or hydrophobicity are known to drive the allostery). RQA itself is a general method in nonlinear dynamics:[126] In the context of protein sequences, it considers a scalar-value vector that represents some property of a given sequence. In introductory work by Zbilut et al.,[127] the method was used to properly classify 56 TEM-1/β-lactamase mutants with impaired function based on their hydrophobicity profiles. Further RQA work used hydrophobicity scores to classify proteins as allosteric or nonallosteric,[128] study p53 mutants,[129] and reveal interaction partners in viral-envelope proteins.[130] In 2005, Colafranceschi et al. investigated the effect of choosing different physicochemical amino acid descriptors and changing the numerical parameters of the RQA algorithm.[131] More recently, a comparison method based on RQA measurements, known as cross-RQA, effectively detected protein allostery.[132] Interested readers are directed to a review by Zbilut-Webber, which provides examples of RQA applied to a range of computational biology problems.[133]

Comparative Analyses

Some work has been done to competitively benchmark the performance of these methods. In 2004, Fodor and Aldrich compared OMES, MI, SCA, and McBASC in a variety of tests. In short, the study found that performance is largely dependent on the way that different methods determine background residue probabilities and handle positional conservation.[63] A follow-up study investigated how effectively coevolution analysis finds thermodynamically linked residue pairs.[62] In general, spatially contiguous linked pairs were detected, but long-range couplings did not agree with experiments. In 2010, Brown and Brown introduced a new pair-scoring method, called Z-scored-product Normalized Mutual Information (ZNMI), and compared it to the accuracy and reproducibility of MI, two versions of SCA, OMES, and ELSC.[82] The authors presented a thorough meta analysis of method performance and the impact of input-parameter selection. Though none of the tools tested was particularly powerful, ZNMI was the most robust prediction tool. Brown and Brown also found that the use of multiple subalignments produced more accurate and reproducible results. A comparative analysis of SCA and DCA revealed that the top 35 “sectons” found via spectral clustering of the DCA matrix corresponded to pairs, triplets, and quadruplets of spatially contiguous residues.[134] In contrast, a similar analysis of the SCA matrix produced spatially adjacent clusters of many residues each. These different results validate the stated goals of each method: DCA aims to find contacting pairs, whereas SCA aims to find potentially distant groups that are thermodynamically linked in a certain function. In 2014, Pelé et al. investigated seven coevolution analysis methods to find the hallmark covarying pairs in GPCR alignments.[81] They considered three variants of MI, McLachlan based substitution correlation, SCA, ELSC, and OMES. OMES and ELSC were the most robust methods for finding the residues responsible for subfamily divergence. Their article also included an insightful discussion of the methods. Mao et al. published a comparative analysis in 2015.[135] Their study tests OMES, two variants of MI, SCA, PSICOV, and DI, and finds that PSICOV and DI are best at identifying contacting residues. OMES and MIp excel at removing false positives from the lists of predicted contacts. While the authors focused on detecting inter- and intramolecular contacts, their analysis also provided useful insights to guide the productive use of each method. For example, all methods benefit from repeatedly shuffling the MSA and rerunning the analyses in order to provide a baseline and remove false positives. Finally, the authors found that the consensus of DI and PSICOV provides a more robust prediction of contacting residues than any single prediction method alone. The software used to perform this analysis is available through the ProDy Evol program (http://prody.csb.pitt.edu/evol/).[136] In the course of introducing new types of MI analysis (dbZPX2, dgbZPX2, and nbZPX2) and evaluating the effectiveness of MSA simulation (a topic beyond the scope of this paper), Ackerman et al. in 2012 compared many different coevolution analyses in their ability to predict contacting residue pairs.[96] These comparisons found that the “new” methods (the ZPX2 family, DCA, and log(R) (not discussed here)) were significantly superior to the “old” methods (OMES, McBASC, ELSC, and SCA).

Web Servers

Several web servers perform and visualize sequence analyses. Given a PDB code, Contact Map WebViewer (CMWeb)[117] automatically constructs an MSA and visualizes a variety of coevolution analyses: mutual information, SCA, ELSC, OMES, and an early method presented by Göbel et al.[137] The same server can also compare the results of these methods to user-uploaded data (e.g., results the user obtained using some other type of analysis). The CMWeb server can be accessed at http://cmweb.enzim.hu/. The Coevolution Analysis of Protein Residues server hosted by the Gerstein Lab[116] (http://coevolution.gersteinlab.org/coevolution/) can perform a large number of the coupled-site analyses presented in this review, including SCA, ELSC, MI, and McBASC-type methods employing different scoring matrices. The server can also validate the results of these methods in predicting residue distances in a crystal structure. MISTIC (Mutual Information Server to Infer Coevolution) is an automated web server that accepts user-submitted MSAs or collects them from PFAM.[83] MISTIC uses a corrected form of MI to infer coevolving pairs and offers several analysis methods that combine structure and coevolution.[88] It can be accessed at http://mistic.leloir.org.ar/. CAPS (Coevolution Analysis using Protein Sequences) is a unique algorithm that combines phylogenetic, 3D, and MSA data to predict coupled sequence positions.[138,139] Versions 1 and 2 are hosted on web servers at http://bioinf.gen.tcd.ie/caps/ and http://caps.tcd.ie/, respectively. The Interprotein-COrrelated Mutations Server (I-COMS, http://i-coms.leloir.org.ar/index.php) focuses on detecting contacts at protein–protein interfaces, though it can also return intrachain correlations.[140] The server automatically builds alignments; performs MI, DCA, or PSICOV analysis; generates visualizations of the results; and allows users to download data taken from various points in the data-collection and analysis workflow. In 2012, Jeong and Kim published a study describing a close MI variant.[141] They employed an automated workflow to control for various types of noise in sequence alignments, using the MSA sequence profile to establish prior knowledge about the protein. While the authors only studied a few MI variants, they stressed that their profile-based method could be extended to more complex analysis techniques. Their approach, Correlated Mutation Analysis Tool (CMAT), is available on a web server at http://binfolab12.kaist.ac.kr/cmat/. Access to ConSurf, a single-site detection method similar to ET, is available at http://consurf.tau.ac.il/.[142−147]

Software

The ProDy Python package, referred to above, can compute a variety of coevolution metrics.[135,148,149] In 2014, Skjærven et al.[150] released the latest version of the powerful Bio3D R package for protein structure and sequence analysis.[151] While it focuses on structural analysis, the package can compute Shannon entropy and offers useful functions to create and manipulate sequence alignments. Also in 2014, Li et al.[152] published the CorMut software package for R, which computes MI, a metric called the “Jaccard index,”[153] and the conditional selection pressure metric Ka/Ks.[154]Table summarizes the selected coevolution web servers/software packages and their capabilities discussed here.

Table 1

Table of Selected Coevolution Web Servers/Software Packages and Their Capabilities

	web server	downloadable	generates MSA	Shannon entropy	relative Shannon/KLD	MI	SCA	DCA	PSICOV	OMES	ELSC	notes	URL
CMWeb[137]	×	×	×			×	×			×	×	also computes an early method from Gobel et al.	http://cmweb.enzim.hu/
MISTIC[83]	×				×	×						calculates a corrected version of MI	http://mistic.leloir.org.ar/
CMAT[141]	×	×	×			×						many noise-filtering parameters to customize; returns MI, MIp, and MIc	http://binfolab12.kaist.ac.kr/cmat/
I-COMS[140]	×		×			×		×	×			offers two variants of DCA	http://i-coms.leloir.org.ar/
ET[78]	×	×										runs evolutionary trace	http://mammoth.bcm.tmc.edu/ETserver.html
coevolution analysis of protein residues[116]	×	×				×	×			×	×	also computes similarity matrix-based methods (including McBASC), chi square, and quartets coevolution metrics	http://coevolution.gersteinlab.org/
ConSurf[147]	×		×									performs a single-site type of analysis similar to ET	http://consurf.tau.ac.il/
DCA[118,119]	×	×										returns pairwise DI	http://dca.rice.edu/portal/dca/home
CAPS[138,139]	×	×										runs a nonstandard coevolution analysis technique	http://caps.tcd.ie/
H2r[79]	×											runs a nonstandard single-site coevolution analysis technique	http://www-bioinf.uni-regensburg.de/
H2rs[80]		×										runs a nonstandard single-site coevolution analysis technique	http://www-bioinf.uni-regensburg.de/
Bio3D[150]		×		×								R package; useful for creating and modifying sequence alignments	http://thegrantlab.org/bio3d/
CorMut[152]		×				×						R package; also computes K_a/K_s and Jaccard index	https://www.bioconductor.org/packages/release/bioc/html/CorMut.html
ProDy[148,149]		×		×		×	×			×		Python package; can compute DI and mutual information correction/normalization	http://prody.csb.pitt.edu/evol/

Simulation Methods/Correlated Motions

Molecular Dynamics and Monte Carlo Methods

Protein allostery is intimately tied to protein dynamics. Allosteric effectors communicate with orthosteric pockets by altering protein dynamics, either through large-scale structural changes or through more subtle changes in correlated residue motions. A proper understanding of the allosteric mechanism is impossible unless one fully appreciates the underlying dynamic conformational flexibility and energetic landscape. Computational methods to characterize receptor flexibility include molecular dynamics (MD) and Monte Carlo simulations. Briefly, MD simulations represent molecular systems as beads (e.g., atoms) connected by springs (e.g., bonds). Newton’s equations of motion are numerically integrated to propagate the dynamics of this classically formulated system.[155−158] Simulation time is typically discretized into steps of 1 or 2 fs, as required to accurately simulate the fastest atomic motions (i.e., hydrogen bond stretching). At each step, the potential energy of the system is calculated by considering the energies associated with bonded atoms (e.g., due to bond stiffness, angle bending, torsion rotation, etc.) and nonbonded atom pairs (e.g., due to electrostatic and van der Waals forces). The forces on each atom, calculated from the negative gradient of the potential energy, are then used to update the atomic positions at each step. In a related method, called “accelerated MD” (aMD),[159−164] the underlying potential energy landscape is modified by raising the energy wells. These modifications facilitate transitions between states that do not normally occur on the time scales of traditional MD simulations. Interested readers are directed to recent reviews that describe MD simulations in the context of drug discovery.[165−168] In contrast, Monte Carlo based methods take a stochastic approach to conformational sampling.[169,170] At each Monte Carlo step, a Markov-chain procedure is used to generate a slightly modified system conformation. This new conformation is then accepted or rejected at random, biased by the relative potential energy of the new conformation as compared to that of the previous step (the so-called “Metropolis criterion”[169]). As these simulations are stochastic, they do not typically explore conformational space via the same time-dependent path traced in an MD simulation. Indeed, a key limitation of the Monte Carlo method is that the time information connecting the states is lost. Nevertheless, Monte Carlo simulations are widely used because they tend to more efficiently sample system conformations that are statistically representative of the equilibrium state. A recently published book by Landau and Binder describes the technique in detail.[171]

Network Representations and Protein Allostery

As mentioned above, allosteric signals can be transmitted through large-scale conformational changes or subtle shifts in the correlated motions/interactions of individual residues.[31,35] Accessing the time scales required to observe many large-scale conformational changes is presently computationally intractable using unbiased simulation-based methods,[172,173] but these simulations are well-suited to the study of allosteric signals that are transmitted via fast, local fluctuations (i.e., nanosecond-scale changes in the coordinated motions of adjacent residues). Distilling the dynamics sampled by these simulation-based methods into a simple network representation of protein motions often facilitates allosteric analysis.[174−180] Rather than tracking the position of every atom, each protein constituent (e.g., amino acid) is represented as a single node (located, for example, at the residue center of mass). Each pair of nodes is connected by an edge whose length is inversely proportional to the degree of interdependence between their motions, such that nodes connected by short edges are highly interdependent. Different metrics of “interdependence” have been used, including metrics based on correlated motions (e.g., mutual information,[181,182] fluctuations in atomic positions,[183] etc.) or the number of specific noncovalent interactions.[183] Once a network representation of protein dynamics has been constructed, various network-analysis techniques (described below) can identify important pathways of communication between distant residues that may contribute to allostery. Interested readers are directed to a recent review by Atilgan et al. for more detailed information.[184]

Dynamical Network Analysis

A number of simulation-analysis techniques consider the interdependence of individual residues along an allosteric path or paths. In 2008, Bradley et al. analyzed molecular dynamics simulations of Escherichia coli NikR, a transcriptional repressor of the NikABCDE nickel permease, using a novel network-space clustering approach.[183] Two types of residue–residue interdependence were considered, based on fluctuations in residue motions and the number of polar contacts between residue-pair members. Bradley et al. first identified sets of residues whose motions and/or contact counts were correlated with residues in both allosteric and orthosteric pockets. These sets were then clustered using the “unweighted pair group method with arithmetic mean” (UPGMA), a hierarchical agglomerative approach that groups residue sets according to the degree of similarity in their correlation patterns. The authors ultimately identified residue sets likely to mediate communication between the NikR allosteric Ni2+-binding pocket and the orthosteric DNA-binding pocket. Others have pursued orthogonal methods to represent and analyze protein dynamics in silico. In 2009, McClendon et al. introduced MutInf, a novel MD-analysis technique for identifying statistically significant correlated motions.[181] Unlike other techniques, MutInf uses an internal coordinate scheme. Rather than considering the positional coordinates of residue atoms directly, it quantifies correlated motions using a mutual-information metric derived from residue torsion angles. This approach makes MutInf particularly well-suited to study allosteric signals transmitted through the gearlike correlated motions of side-chain rotamers. Using multiple short simulations, the authors were able to identify statistically significant correlations. Applying the technique to interleukin-2 provided a viable theory of allosteric transmission involving a population-shift-mediated signal transmitted between the allosteric and orthosteric pockets via a hydrophobic core. More recent work has incorporated the Kullback–Leibler divergence metric to improve the sensitivity of MutInf analysis.[182] A paper by Van Wart et al., published in 2012, studied imidazole glycerol phosphate synthase (IGPS), an important enzyme in the histidine biosynthetic pathway of plants, fungi, and microbes.[178] Following a molecular dynamics simulation of the enzyme, they built a representative network of residue interactions as inferred by correlated motions. The edges connecting physically distant protein residues were ignored to emphasize correlations resulting from immediate physical interactions. The single shortest path (in network space) between the allosteric and orthosteric sites was then identified. Two of the residues along this path had been shown previously by experiment to be involved in allostery. Van Wart et al. later expanded on their work by creating the freely available Weighted Implementation of Suboptimal Paths (WISP) program.[185] WISP can identify not only the single optimal path connecting allosteric and orthosteric sites, but also multiple other near-optimal paths. While allosteric signaling may occur through a single optimal path in some cases, for many proteins it is likely the summed (or perhaps synergistic) effect of signaling through multiple near-optimal pathways. While several efficient algorithms exist to calculate the single optimal path between two nodes in a network space (e.g., the Floyd–Warshall algorithm, Dijkstra’s algorithm, etc.), calculating near-optimal pathways is far more computation intensive. WISP uses several methods to first simplify the network representation of system dynamics. First, the edges between physically distant nodes are removed regardless of the interdependence of their motions, as in Van Wart’s original paper. Second, all nodes that cannot possibly participate in optimal or near-optimal pathways of allosteric communication are deleted. For each node, Dijkstra’s algorithm is used to calculate the shortest path connecting the source node, the node being considered (n), and the sink node. If this shortest path is still too long (in network space) to be an optimal or near-optimal pathway, no other path passing through n can be optimal or near-optimal. Therefore, n is deleted from the network. After thus greatly simplifying the network, it becomes computationally tractable to calculate notable pathways of allosteric communication between source and sink residues using a recursive, bidirectional approach. In 2014, LeVine et al. introduced a novel analysis framework, called N-body Information Theory (NbIT), that uses multivariate mutual information to measure residue interdependence. This technique is unique in that it considers not only the interdependence (mutual information) between two residues, but also the mutual information between groups of up to N residues. Specifically, a metric called “coordination information” is used to measure the degree to which the correlated motions of a set of residues are shared with another residue not in the set (i.e., the “contribution of a site to all possible n-body correlations with another site”). The authors use this technique to analyze MD simulations of the bacterial transporter LeuT and identify many of the same residues previously found through experiment.[186]

Community Analysis

A number of simulation-analysis techniques expand on calculations of residue–residue interdependence to consider how highly interconnected (and often rigid) communities of residues interact with one another. In this view, allostery is achieved when largely isolated communities of highly interconnected residue nodes (“small-world networks”) engage in long-range interactions. For example, in 2012, Rivalta et al. studied the allosteric mechanism of PRFAR binding to the IGPS heterodimer.[187] After simulating the system in both the absence and presence of the allosteric modulator, they calculated the mutual information between the positional vectors of all residue–residue pairs. Ignoring the correlations between nodes that were not frequently physically adjacent (≤5.0 Å for at least 75% of the simulation), the authors next used the Girvan–Newman algorithm to identify communities of highly interconnected nodes. This analysis provides a powerful illustration of how an allosteric modulator might manipulate entire protein “modules” to transmit a signal. Supported by NMR experiments, the authors ultimately concluded that PRFAR binding decreases the correlated motions at sideL of the heterodimer and induces conformational changes in sideR that ultimately impact the hydrogen-bond network near the active site.[187] Sethi et al. used a similar technique in 2008 to study allostery in tRNA:protein complexes.[174] The use of accelerated MD (aMD) simulations in this same mutual-information/community-analysis workflow has also proved fruitful. aMD simulations are particularly helpful when the allosteric mechanism being studied occurs on time scales that are longer than those accessible through traditional MD. For example, in 2012, Gasper et al. used a mutual-information/community-analysis technique to study α-thrombin.[175] Ultimately, their results suggested that α-thrombin may have two allosteric pathways. One occurs on slower time scales that might not have been accessible were it not for the aMD. Others have used aMD to similarly study allosteric effects in the M2 muscarinic receptor.[188] In 2014, Blacklock and Verkhivker used a similar technique to analyze MD trajectories of the Hsp90 chaperone in various functional states.[189] To verify the existence of communities in this system, they first calculated force constants for each residue, corresponding to the energy cost of residue displacement over the course of the equilibrium simulation. As residues with high force constants are typically rigid, community boundaries (e.g., the boundary between a rigid module and a flexible interdomain hinge site) often correspond to locations where these constants change abruptly. In fact, the “hot spot” regions did correspond to residues known experimentally to be important hinge sites or sites of dimerization. To further characterize this community-based organization, the authors next used protein-structure network analysis to build a network representation of Hsp90 dynamics. They calculated cross-correlation matrices from the MD simulations and identified residues with high degrees of “centrality” (the number of interacting residues) and “betweenness” (the frequency with which a given node lies along the shortest path in network space between two other nodes). Based on this analysis, the authors concluded that Hsp allosteric communication is mediated by highly stable central nodes, surrounded by residues with high degrees of betweenness that may “shield” the more critical residues from the random Brownian/thermal motions common to all proteins.

Pocket Detection

Simple pocket-finding techniques can be useful to identify potential allosteric sites. One straightforward approach is to consider any identified pocket other than the orthosteric to be allosteric. Pocket-detecting algorithms can be used to analyze static crystal structures. However, coupling these algorithms with techniques that account for pocket dynamics is often useful. Some allosteric pocket conformations are extremely rare in the absence of the allosteric effector itself. These “cryptic pockets” are not readily evident in static apo- or ligand-bound crystal structures, but they may be predictively sampled with molecular dynamics simulations.[190−194] Computational methods for pocket detection can be classified as geometry, knowledge, and energy based. We describe these three types of methods in sections , 3.3.3, and 3.3.4.

Geometry-Based Pocket Detection

Geometry-based methods identify pockets by considering only the positions of the receptor atoms themselves. These methods are well-suited to situations where speed of computation is critical (e.g., when pocket detection is applied to conformations extracted from entire MD trajectories) or where pockets are particularly well-defined. On the other hand, accuracy in pocket detection suffers when pockets are shallow, partially collapsed, or highly flexible. And pocket ranking by simple geometric metrics (e.g., volume) does not always correlate strongly with ligand binding affinities or druggability because geometry-based methods do not consider the chemical properties of a given pocket.[195−205] Many geometry-based algorithms have been described in the literature. They can be classified into three subcategories: sphere based, α-shape based,[206] and grid based. Sphere-based methods, such as PASS,[198] PHECOM,[207] POCASA,[208] and SURFNET,[209] first cover protein surfaces with spheres. Each sphere is evaluated and eliminated if judged unlikely to occupy a surface cavity. The locations of the remaining spheres identify likely binding pockets. In contrast, methods such as APROPOS,[210] CAST,[211] CASTp,[212] and Fpocket[199] use α-shape schemes to identify protein pockets. In brief, an α-shape is like a convex hull, except that it potentially follows the surface of the source points more closely than a convex hull. The degree to which it deviates from the convex hull is controlled by the α value (for more details, see Edelsbrunner et al.[213]). APROPOS compares multiple α-shape representations of a protein at different values of α.[210] Pockets are identified by subtracting “higher-resolution” α-shapes from “lower-resolution” shapes. In contrast, CAST generates a Voronoi diagram based on protein-atom locations, removes any Voronoi edges and vertices that fall completely outside the receptor, and identifies pockets as collections of empty triangles (for details, see ref (211)). Fpocket uses a Voronoi tessellation to calculate receptor α-spheres (i.e., spheres that include four protein atoms on their surface but no protein atoms in their interior).[199] When clustered, these spheres tend to congregate at potential pocket sites. Grid-based geometric methods for pocket detection include GHECOM,[214] LIGSITE,[215] LIGSITECSC,[216] POCKET,[217] Pocket-Picker,[218] PocketFinder,[219] Q-SiteFinder,[220] and POVME/POVME Pocket ID.[195,221] These techniques consider points spaced along a grid that encompasses the protein receptor. Each point is evaluated for the likelihood of pocket occupancy, and high-scoring points are clustered to identify binding pockets, whether orthosteric or allosteric. Various metrics are used to evaluate the points, depending on the algorithm. They include the presence of steric clashes with the protein, detected protein–solvent–protein events,[215,217] buriedness indices,[218] predicted methyl–probe interaction energies,[219,220] etc. As a representative example of the geometry-based class, consider POVME/POVME Pocket ID, a grid-based pocket-finding algorithm.[195,221] POVME Pocket ID identifies surface and internal cavities by first creating a low-resolution 3D grid of equidistant points that encompasses the entire protein. Points are removed if they are physically near any protein heavy atom or fall outside the convex hull defined by the protein α-carbons. The volumes corresponding to the remaining low-resolution points, which tend to encompass protein pockets, are then replaced with smaller, higher-resolution 3D grids, and the same point-eliminating process is repeated. Finally, any remaining high-resolution points that have fewer than a user-defined number of neighbors are eliminated iteratively until no such points remain, and stretches of contiguous points are grouped together into distinct pockets. The locations of these identified pockets can be fed into the POVME program, which calculates pocket metrics such as volume and shape. The latest version of POVME can efficiently analyze entire MD trajectories, allowing the user to monitor pocket changes over time and identify sampled conformations that are geometrically distinct. This approach is useful for identifying cryptic pockets that only occasionally manifest themselves over the course of an MD simulation.

Knowledge-Based Pocket Detection

Knowledge-based algorithms draw on large structural and genomic databases to determine binding-pocket locations based on sequence and overall structure rather than pocket shape. Algorithms such as 3DLigandSite,[222] FINDSITE,[223] and Pocketome[224] infer these locations by querying existing databases for ligand-bound proteins that are structurally similar to the target protein. These algorithms work particularly well when the target protein has a highly conserved binding site. Other methods, such as FRpred, use sequence conservation, in conjunction with supporting methods like surface-residue-prediction algorithms, to identify surface residues that are likely biologically important.[225] Finally, some hybrid methods, such as the evolutionary-trace method,[226] ConSeq,[142] and ConCavity,[227] combine sequence and structural homology to identify likely binding pockets. The section Single-Site Evolutionary Analysis Methods describes methods like these in further detail.

Energy-Based Pocket Detection

Energy-based methods identify binding sites by evaluating whether or not a given protein region interacts favorably with docked small organic probes. These methods are particularly useful to detect a given pocket and assess its druggability. This additional information does come at a cost; energy-based methods are more computationally demanding and so do not scale well, limiting their applicability when analyzing large-scale structural data sets or MD trajectories. Methods such as SiteMap[202] and SITEHOUND[228] use single probes (typically methane or water molecules) to assess the likelihood of small-molecule/protein binding. In contrast, methods such as FTMap,[203,229,230] GRID,[231] and MCSS[232] consider multiple chemically diverse probes and so identify not only pockets, but also druggable hot spots that are more likely to function as orthosteric or allosteric binding sites. FTMap[203,229,230,235] in particular warrants further discussion. FTMap is the computational equivalent of the multiple solvent crystal structures (MSCS) technique.[233,234] Rather than superimposing multiple crystal structures obtained by soaking a protein in six to eight different organic solvents, FTMap computationally docks 16 distinct organic probes into a user-provided protein model. The docked positions of these probes are then clustered, and the clusters are ranked by their Boltzmann-averaged energies. In an alternate implementation called FTSite,[ref100] adjacent, mutually accessible clusters are combined into “sites” that are then ranked by the number of nonbonded contacts between the protein and the corresponding probes. Both these techniques can be applied to multiple structures, whether derived from X-ray crystallography or simulation, to potentially identify cryptic sites that are not always evident when single structures are considered. FTMap can be used to identify possible allosteric sites. For example, Ivetac et al. used it to detect five potentially druggable sites on β1AR and β2AR.[236,237] It has also been used to evaluate the druggability of p53 cryptic pockets, which were subsequently used to prospectively discover novel reactivation compounds.[192]

Markov State Models

Over the past 10 years, Markov state models (MSMs) have been used increasingly to analyze molecular dynamics simulations, suggesting that they are already promising tools to study allostery in a drug-discovery context.[194,238−242] In brief, a Markov state model is a stochastic kinetic model that describes the probability of transitioning between discrete states at a fixed time interval.[238,243,244] These models are required to have the Markovian property (i.e., that the probability of transitioning between discrete states is independent of previous transitions). By clustering protein structures extracted from an MD trajectory, it is possible to identify discrete conformational states for use in MSMs.[238,243,244] Transitions between conformational clusters observed over the course of an MD trajectory are tallied, and the MSM is then built from the transition probabilities between these distinct clusters. As they are built solely on these transition probabilities, MSMs can draw upon multiple trajectories, enabling efficient sampling of the entire conformational space through many independent simulations carried out in parallel. MSMs can be built with the software package MSMBuilder, available at http://msmbuilder.org/latest/.[243] EMMA,[245] available at http://www.emma-project.org/latest/, is another useful building tool. Because independent trajectories are aggregated as a postprocessing step, the simulations can be carried out on a variety of computational resources, including independent desktop machines, local clusters, or high-performance computing resources. A recent example carried out by Kohlhoff et al. used tens of thousands of trajectories carried out on Google’s exascale cloud computing resources. By subsequently “knitting together” the simulation data within a MSM framework, the researchers elucidated the activation pathway of GPCR β2AR.[246] MSM/MD analysis provides access to the thermodynamic, kinetic, and structural characteristics of the protein conformational ensemble (i.e., a robust description of the free-energy landscape of the protein).[238,243−245,247,248] The thermodynamics of the various conformational states can be calculated from the equilibrium distribution. It is also possible to resolve the transition kinetics between individual states, the concerted or principal protein motions, metastable states, and the transition pathways between discrete states.[243,245,247] Lastly, the MD trajectories also provide representative receptor snapshots for use in structure-based drug design.[248]

MSMs and Drug Discovery

MD-based MSMs were originally developed to study protein folding,[239,249−254] but they can also be used to study molecular phenomena that are directly applicable to allostery and drug discovery, including cryptic site identification,[191,194,246] protein–ligand interactions,[255,256] and protein function.[51,246,249,257] MSMs have been used to study binding-site conformational heterogeneity, revealing conformations that are not readily evident in any crystal structure. In 2012, Bowman and co-workers used MSMs to identify and validate cryptic allosteric sites in TEM-1 β-lactamase, interleukin-2, and RNase H.[191,194] Similarly, in a recent study of GPCR conformational transitions, Kohlhoff et al. identified intermediate conformations that they used successfully in a subsequent virtual-screening campaign.[246] Shukla and co-workers also identified a key conformational intermediate in the Src kinase activation pathway that could be used as a target for drug discovery.[258] While traditional molecular dynamics simulations have been used to identify cryptic allosteric sites and elucidate binding-site conformational diversity, Markov state models can characterize the probability and dynamics of these conformations, permitting rational design against dynamic pockets. Researchers have also employed MSMs to gain insights into ligand–protein binding that are critical to both allosteric and orthosteric structure-based drug discovery. Buch et al. used MSMs derived from multiple independent MD simulations to accurately determine the kon and binding free energy of benzamidine to trypsin.[255] Building on that work, Tiwary et al. used MSMs in conjunction with metadynamics, a variant of molecular dynamics, to determine the koff for the same system.[256] While this approach is currently too computationally demanding for virtual screening, it has identified transitory sites important for binding and shed new light on the interactions that govern ligand-binding kinetics. MSMs also provide useful insights into how a protein conformational ensemble begets function and transmits allosteric signals. Most studies have focused on the transition between two functional states. In this context, MSMs have been used to study Src kinase,[258] the β2 adrenergic receptor (β2AR),[246] and bacterial response regulator NtrC.[52,257] Malmstrom et al. recently used this same technique to study allostery in the cyclic-nucleotide binding domain (CBD) of protein kinase A.[51] By building and simulating models of the CBD, with and without bound cAMP, the study provided insight into how ligand binding modulates the protein conformational ensemble. As it is challenging to determine the impact of an effector molecule on the conformational ensemble, most MSM studies to date have not examined allostery directly. Nevertheless, two relevant themes emerge from existing studies. First, kinetics play an important role in the transition between functional states.[51,52,257] This is best shown in our study of the PKA CBD, where the effector molecule cAMP significantly modulates the kinetics of the transition from the active to the inactive states of the protein, while simultaneously leaving the reverse transition between the inactive and active states unmodified.[51] This modulation seems to be regulated by transient interactions between the effector molecule and the protein. Second, recent MSM studies suggest that there is not one but many pathways between functional states. Studies of β2AR,[246] NtrC,[52,257] and the PKA CBD[51] each revealed multiple transition pathways that arose from transient interactions within the system, suggesting a probabilistic model of allostery in addition to one that describes allostery in terms of concerted pathways.[257] Studies like these can provide useful insights applicable to structure-based drug discovery targeting transitional conformational states. MSMs can provide significant insights into protein–ligand interactions and allostery, but building these models is computationally demanding, limiting their use as screening tools. This weakness aside, MSMs do provide a strong foundation for rational drug discovery and enhanced understanding of allosteric mechanisms.

Energy Landscape and Topological Analyses

Introduction

Like MSMs, energy-landscape theories were originally developed to study protein folding. But they too can be applied to the study of allosteric mechanisms and protein dynamical modes. Per statistical mechanics, the underlying free-energy landscape of a system determines the probability of observing a particular configuration. Any topological analysis aimed at understanding allostery must be based on a thermodynamic understanding of the impact that effector binding has on this energy landscape.

Protein Frustration

It is helpful to view allosteric binding as an event that perturbs the conformational exchange, or folding pathways, of a protein. The principles of statistical mechanics applied to polymers in solution have provided much insight into the protein-folding process. As proteins fold, the number of native contacts increases, and the entropy of the polymer decreases. Folding is impossible when the required conformational transitions are occluded by large free-energy barriers or wells. For example, a large local minimum along the folding pathway could force a protein to undergo a glass transition, leading to a kinetically trapped state. Typical folding must be governed by the “principle of minimum frustration” (i.e., the folding pathway must be funnel shaped and relatively smooth).[259−264] In terms of allostery and energy-landscape theory, a positive allosteric modulator can be seen as a compound that increases the likelihood of the binding state by deepening the binding-state well or decreasing the depth of competing wells. On the other hand, a negative allosteric modulator may shift the location of the binding-site well, shallow out the well, or increase the depth of competing wells. Using these guiding principles, heuristics can be designed to identify each of the above scenarios. Ferreiro et al. have developed a heuristic to pick out locally frustrated residues.[265,266] While the overall folding process obeys the principle of minimum frustration, this does not preclude the possibility of having small local barriers that must be overcome to fold. To detect these local frustrations, the authors systematically perturbed the amino acid sequence and compared the perturbed total energy with that of the native state. This comparison provides the effective stabilization energy associated with each amino acid pair. If a given native-state interaction is highly stabilizing (i.e., the interaction energy is favorable compared to mutants), it is said to be “minimally frustrated.” On the other hand, if a pair of residues is sufficiently destabilizing (i.e., the interaction energy is weak compared with that of the mutants), it is “frustrated.” Frustrated residues can be thought of as weak points in the global structure, contributing to barriers that may introduce slow degrees of freedom. As such, they may be involved in ligand binding or allostery. Jenik et al. have implemented this heuristic as a web server http://www.frustratometer.tk/.[267] Weinkam et al. used a dual-topology Go̅ model[268] to survey the free-energy landscape of the calmodulin-GFP Ca2+ sensor protein, the maltose binding protein, and the CSL transcription factor. They introduced a truncated Gaussian distance term to the soft-sphere atom overlap term implemented in MODELLER,[269−272] resulting in either one basin corresponding to the bound/unbound distances or two corresponding to both, depending on the pairwise distance and cutoff. Subsequent MD simulations of the resulting model were analyzed using their pseudocorrelation map algorithm.[273] They found that their model reproduces allosteric motions and sheds insight into the macroscopic mechanism.

Normal-Mode Analysis

Protein dynamics and subsequent perturbations from binding events can also be studied using normal-mode analysis (NMA).[274−276] NMA assumes that the potential energy landscape in the vicinity of a minimized atomic structure is approximately harmonic. This simplifying assumption allows for diagonalization of the Hessian matrix. Solving the eigenvalue problem produces the eigenvectors (movement direction) and eigenvalues (vibrational frequencies). NMA is known to reproduce the slow degrees of freedom of protein motion well.[274,277] Applying additional constraints to simulate the impact of an effector on binding-site residues can yield new fundamental modes that may provide insight into the allosteric mechanism. Ming and Wall applied normal-mode analysis to investigate the effect of tri-N-acetyl-d-glucosamine upon lysozyme.[278] They assumed that greater perturbation to the conformational distribution corresponds to increased likelihood that a binding event will occur at a regulatory or active site. The degree of perturbation is monitored by comparing the distributions predicted by NMA of the protein with and without the ligand bound, via the Kullback–Leibler divergence. In this work, Ming and Wall generated many arbitrary poses and investigated the allosteric potential, defined by the KLD. They showed that sites with large KLD correspond with identified functional lysozyme residues. The use of normal-mode analysis removes the need to integrate the equations of motion. However, for the harmonic approximations to be accurate, the starting structure needs to be in a local minima. An initial molecular mechanics minimization can ensure that proteins of interest are near what might be seen in their native environments (e.g., a GPCR embedded in a lipid bilayer with heterogeneous composition). However, alternate options include simplified interaction Hamiltonians such as those offered by elastic network models (ENMs).

Elastic Network Models

The complexity and computational cost of simulating detailed atomic potentials inspired development of simpler ENMs in which uniform harmonic potentials are used to model all interactions.[279−281] A popular ENM is the Gaussian network model (GNM), wherein neighboring residues are connected by virtual springs to create a network/graph of interactions. Here the interatomic potential energy, U, of the system can be expressed aswhere γ is the uniform force constant of the harmonic springs, δR is the N-dimensional column vector of the instantaneous fluctuations of specific atoms, and Γ is the N × N Kirchoff (connectivity) matrix of residue interactions given bywhere Rc is a residue cutoff distance. Ming and Wall have developed a backbone-enhanced elastic network model (BENM) in which the interactions between connected residues are scaled by an additional factor. Using an objective function defined by the KLD of the marginal distribution of all-atom and Cα atomic coordinates, the authors showed that BENM can reproduce the atomistic mean-squared displacement for bovine pancreatic trypsin inhibitor (BPTI).[282] To consider the effects of allostery, the authors again used the KLD of local conformational distributions before and after ligand binding.[282] To locate allosteric binding sites, Su et al. recently identified allosterically coupled regions in a GNM using a thermodynamic method.[283] All neighboring residues (represented by their corresponding Cα atoms) were connected by virtual harmonic springs of equivalent strength. A spherical probe then sampled the protein surface to locate the potential effector interaction points. Additional springs were attached between the effector point and the local residues in the protein–effector model. The free-energy difference (ΔΔG) between the protein–ligand and protein–effector–ligand systems was calculated. Regions with large ΔΔG values were assumed to be important for the allosteric mechanism. Using this method, Su et al. studied the Hsp70 and the GluA2 AMPA receptors. These allosteric predictions corresponded to experimental results.[283] Berezovsky et al. used a similar coarse-grained network-interaction model to predict the location of allosteric sites.[284] To identify potential pockets for effector/substrate binding, the authors generated a “residue interaction graph” (RIG), with residues as nodes and interatomic contacts as edges. They then evaluated the “local closeness,” defined as the number of total neighboring residues weighted by 1/r, where r is the inter-residue distance.[285] Regions with high local closeness were likely to be in pockets and, thus, were putative binding sites. The extent to which each binding site was coupled to the intrinsic protein motions was then explored by estimating the strain (so-called “binding leverage”) on ligand–protein contacts under deformations described by NMA. A ligand that bound to a site with high binding leverage could restrict protein dynamics.[286] The binding leverage describes the extent to which changes in binding sites and protein conformational degrees of freedom are coupled; it does not specifically predict allosterically linked sites. To predict allostery, Mitternacht and Berezovsky introduced “leverage coupling,” which selects for sites that are strongly coupled to the same degree of freedom.[286] The mode associated with the most binding leverage is considered most relevant, based on the assumption that there is one dominant motion promoting allostery. These methods have been implemented in the online SPACER server http://allostery.bii.a-star.edu.sg/.[287]

Conclusions

Over the past decade, advances in computing power and predictive algorithms coupled with the increased availability of structural and biochemical data have revealed new opportunities for rational design of allosteric drugs. The emergence of novel computational approaches to describe and predict allosteric phenomena across a range of scales, from the coordinated atomic movements in a single receptor molecule to complex allosteric signaling networks, is ushering in a new era wherein computational methods can be used to prospectively predict, discover, and characterize allosteric sites and effector molecules. Within the context of a drug-discovery program, such approaches hold the potential for developing drugs with increased specificity and selectivity, as well as the ability to gain new and more comprehensive understanding of old targets. The convergence of advances in (i) theoretical MSM-based frameworks and MSM building software, (ii) community MD codes that can achieve >100 ns/day sampling for realistic sized systems on single gaming/commodity GPU processors, and (iii) pocket and druggable site-detection algorithms now make it possible for researchers even in industrial settings, with fast-paced timelines and stringent quality standards, to apply these approaches to drug targets already in their arsenals. The application of these methods to kinases and GPCRs seems particularly worthwhile, given the existence of assays and structural data, and the challenges faced by existing drug candidates in the clinic.

273 in total

1. Evolutionarily conserved pathways of energetic connectivity in protein families.

Authors: S W Lockless; R Ranganathan
Journal: Science Date: 1999-10-08 Impact factor: 47.728

Review 2. Folding and binding cascades: dynamic landscapes and population shifts.

Authors: S Kumar; B Ma; C J Tsai; N Sinha; R Nussinov
Journal: Protein Sci Date: 2000-01 Impact factor: 6.725

3. Fast prediction and visualization of protein binding pockets with PASS.

Authors: G P Brady; P F Stouten
Journal: J Comput Aided Mol Des Date: 2000-05 Impact factor: 3.686

4. ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information.

Authors: A Armon; D Graur; N Ben-Tal
Journal: J Mol Biol Date: 2001-03-16 Impact factor: 5.469

Review Motivation and Organization

Introduction

Emerging Rational Design Principles

Computational Methods for Studying Allostery

Protein-Sequence Analysis Methods

Introduction

Single-Site Evolutionary Analysis Methods

Single-Position Entropy

Evolutionary Trace

H2r(s)

Coupled-Site Evolutionary Analysis Methods

Basic Coupled-Site Analyses

Mutual Information

Statistical Coupling Analysis

Explicit Likelihood of Subset Covariation

Direct Coupling Analysis

PSICOV

Recurrence Quantification Analysis

Comparative Analyses

Web Servers

Software

Simulation Methods/Correlated Motions

Molecular Dynamics and Monte Carlo Methods

Network Representations and Protein Allostery

Dynamical Network Analysis

Community Analysis

Pocket Detection

Geometry-Based Pocket Detection

Knowledge-Based Pocket Detection

Energy-Based Pocket Detection

Markov State Models

MSMs and Drug Discovery

Energy Landscape and Topological Analyses

Introduction

Protein Frustration

Normal-Mode Analysis

Elastic Network Models

Conclusions

Review 2. Folding and binding cascades: dynamic landscapes and population shifts.

Review 8. Comparative protein structure modeling of genes and genomes.

Review 7. NMR and computational methods for molecular resolution of allosteric pathways in enzyme complexes.

Review 10. Engineered control of enzyme structural dynamics and function.