Literature DB >> 32489528

Coevolutionary data-based interaction networks approach highlighting key residues across protein families: The case of the G-protein coupled receptors.

Filippo Baldessari¹, Riccardo Capelli², Paolo Carloni², Alejandro Giorgetti^1,2.

Abstract

We present an approach that, by integrating structural data with Direct Coupling Analysis, is able to pinpoint most of the interaction hotspots (i.e. key residues for the biological activity) across very sparse protein families in a single run. An application to the Class A G-protein coupled receptors (GPCRs), both in their active and inactive states, demonstrates the predictive power of our approach. The latter can be easily extended to any other kind of protein family, where it is expected to highlight most key sites involved in their functional activity.

Entities: CellLine Chemical Disease Gene Species

Keywords: Coevolution; Conformational states; Functionally relevant residues; GPCRs; Interaction network

Year: 2020 PMID： 32489528 PMCID： PMC7260681 DOI： 10.1016/j.csbj.2020.05.003

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

The discovery of correlated mutations in protein families across different organisms has shown to provide valuable information on the functional role of residues [1]. These mutations arise from evolutionary pressure, that drives the changes to enhance stability and/or biological function. Starting from the correlation between mutations in sequence alignments (i.e., coevolutionary analysis, CA), one can infer that the destabilization induced by a single-point mutation can be attenuated or even counterbalanced by a corresponding mutation in a different portion of the sequence. Different CA approaches include DCA [2], [3], [4], plmDCA [5], GREMLIN [6], PSICOV [7], and many others [8], [9]. These methods have been used for different goals [10]: from the definition of coarse-grained force fields for molecular simulation [11], to the prediction of mutation energetics [12], [13], the direct inference of 3D structures [3], refinement of structure prediction [14], [15], and the investigation of protein–protein interactions [16], [17]. Here we introduce a structure-based CA that identifies in a single run highly structurally and/or functionally relevant residues across a protein family. This is achieved integrating structural information on a modified version of Lui and Tiana’s [12] approach. The latter uses internal interaction networks to uncover frustrated interactions and mutation free energy differences for a specific protein. Our protocol has several advantages. Unlike previous approaches, mainly based on single protein domains (usually obtained from PFAM [18]), our protocol includes entire proteins in the calculations. In addition, it provides insight on different conformational states of proteins (unlike “classical” DCA analysis [2], which is based on pure sequence information), such as receptor active/inactive states or ion channel close/open states. Finally, and most importantly, it can be applied to large sparse families, i.e. with very large sequence variability due to evolution. Applications to the sparse and fairly well-structurally characterized [19] subfamily, “class A” of the human G-Protein Coupled Receptor (hGPCRs) superfamily, shows the predictive power of the approach: in a single run, it identifies coevolutionary related hotspots, previously pinpointed by techniques other than CA [20], [21], integrating also structural information to highlight differences in distinct conformational states. These are fundamental structural/functional residues or correlated with diseases. The protocol is totally general and can easily be extended to other subfamilies of GPCRs, from organisms other than Homo Sapiens, as well of other large receptor families with large intrinsic variability, like the pentameric ligand-gated ion channels (pLGICs) or the voltage-gated ion channels.

Results

The approach

In the following sections, we briefly describe our multistep strategy (Fig. 1). Our approach uses structural information in two different stages of the protocol, i.e. during the multiple sequence alignment (MSA) generation and in the construction of the interaction matrix, in contrast to most of the previously used coevolutionary approaches, that usually do not consider this information.

Fig. 1

Schematic representation of the workflow. We employ available sequence alignments and structures to build a coevolution-based interaction matrix that we refine using a contact map, building a network that contains spatial and interaction information about the protein of interest. Hotspots are finally identified by means of the analysis of the betweenness centrality of every node, that are subsequently labeled based on the data available in the literature.

Experimental data

Let us consider the general case for which sequences across a given family are highly diverse. In this case several bottlenecks, starting from the MSA generation, can be found. Thus, the use of state of the art methodologies can be applied to overcome these difficulties. We consider here an alignment formed by M sequences composed by L residues (where i is the kind of amino acid) obtained form curated databases (Uniprot [22], Pfam [18], or GPCRdb [23]). The MSA can be generated using an algorithm, Promals3D [24], that utilizes structural data, and thus can alleviate the difficulty in aligning families that displays sparse sequences.

Interaction analysis – DCA and Sequence-specific interaction matrix

Our coevolutionary analysis protocol is based on Direct Coupling Analysis [2], [3]. The basic assumption of such technique is that the interaction between different residues can be written aswhere is the amino acid at i-th position of the sequence, is the two-bodies term (analogous to the interaction term of a Potts’ model) that contains the interaction energy between residues and at position i and j of the alignment, respectively, and is a one-body potential that can act on the residues (analogous to the field term of a Potts’ model). The key quantity here is the two-bodies term , that contains the information needed to build the interaction graph, leading, eventually, to the identification of the hotspots (see below). The frequency counts of pairs and single residues of a sequence with n amino acids can be seen as marginals of a probability distributionwhere the probability p is defined as Starting from our alignment with n residues and M sequences of length L, we can compute the empirical frequencies and . For , the empirical frequencies will match the theoretical distributions . For a realistic situation, using the empirical distribution we will have finite-size effects, given by the finite number of sequences available. To overcome this issue, we reweight the empirical frequencies with the appropriate pseudocounts [25]. These pseudocounts are weighted on the distribution of residue types (via the weight x), on the distribution of residue types in the considered alignment (via the weight y), and on the distribution of residue types in a specific pair of positions in the alignment (via the weight z), namelywhere q is the number of residue types (here we have 21 types: the 20 amino acids and the gap), is an effective number of sequences and is the number of sequences with similarity larger than 70%. In this work we fixed , and following Contini and Tiana [13], that tested these parameters for both cytosolic and membrane proteins. After the reweighting, if we apply a mean-field approximation we can obtain the associated correlation matrix for the frequencies defined asand finally obtain the two-bodies interaction energy in Table 2

Table 2

Details of the hotspots. For every hotspot identified, we highlight the state (active or inactive) of the hGPCR where the residue was identified, the presence of a documented function, interaction with a ligand, the existence of a mutant or variant in the GPCRdb, and the amino acid consensus. Hy indicates general hydrophobic residues; Ha, Hydrophobic aliphatic; Hb, hydrogen bonding; Sm, small. For references of the experimental data, see [11] to [100] of SI.

As shown in Refs. [3], [12]. This two-bodies term (that have the form of a 4-dimensional tensor) contains the two-bodies interaction of all the possible residue pairs. Choosing our sequence of interest, we can extract from this tensor the interaction matrix that describes the interaction between all the possible amino acid pairs in the system.

Network analysis

contains interaction information that is complementary to 3D structural data, because can discriminate the most important energetic contribution of residue that are located in spatial proximity. To integrate spatial and energetic information in a single object, we computed a residue-residue contact map for every protein structure: Where is the distance between the Cs of the i-th and the j-th residues (in the case of glycine, the hydrogen atom in the analogue position). To insert structural information in , we proceed following Scarabelli et al.[26]. There, the authors perform an Hadamard product (a element-by-element multiplication of the two matrices) between the contact map and the interaction matrix, obtaining the local interaction matrix , namely If the experimental structure considered contains non-resolved residues, one can remove the part of involving such parts. Now, let us consider each residue of the protein family as a node of a weighted graph . In this graph, we connect two nodes if the modulus of the interaction obtained from DCA between the respective residues is larger than a threshold value defined iteratively: we start building a network considering the maximum energy of the matrix (in modulus), obtaining an unconnected graph (i.e., a network of isolated nodes except for the two residues with the strongest interaction).1 At this point, we iteratively lower value until we obtain a connected graph. The maximum value of that still returns a connected graph defines the final interaction network and the connected graph () itself.

Hotspots

The betweenness centrality of a node in reads [27]:where is the number of the shortest paths in the graph that connect i and j passing through k, and is the number of the shortest paths in the graph that connect i and j. If the considered node is “central” in the network (i.e., the information flow passes through it to connect different portion of the protein), its betweenness centrality will be larger. We considered residues as hotspots if the betweenness centrality of their associated node is larger than half of the maximum betweenness centrality in all the nodes.

Application to class A hGPCRs

hGPCRs, with more than 800 members [19], is the largest family of cell-surface receptors. External signals are translated by this family into cell stimuli. A widely used classification system of hGPCRs is the A-F system that is mainly based on their amino acid sequences and functional similarities. This classification identifies six classes, labeled A-F. Class A, also known as the “rhodopsin-like family”, is the largest group of hGPCRs [28], which includes hormones, neurotransmitters, and light receptors and accounts for around 80% of hGPCRs. These proteins share a common topological signature, namely seven -helical transmembrane (TM) domains [29]. The members of the family share also positions of residues directly involved in ligand binding and receptors activation. These include for example positions 3.28, 3.32, 3.33, 3.36, 3.37, 5.39, 6.44, 6.48, 6.55, 7.35 and 7.39 [30], [31], [32], functionally conserved along the entire class A [32], [20].2 Such common structural organization contrasts strongly with the agonists’ structural diversity, from subatomic particles (a photon), to small molecules, up to peptides and even proteins [29]. Agonist binding to class A hGPCRs triggers receptor activation. The residues involved in the activation extend from the binding cavity, [20], [34], [35], [36], [37] to the intracellular side of the receptor. Activation lead to binding of a cognate proteins, e.g. G-protein and -arrestins, and finally downstream signaling pathways. However, these proteins do not simply switch between alternative agonist-bound and inactive forms in this process. They rather adopt a series of intermediate states -likely represented by an ensemble of conformations [38]- influenced not only by agonist binding, but also by other receptors, signaling and regulatory proteins, by post-translational modifications, and by environmental cues [39]. The input of the workflow (Fig. 1) consists of sequence alignments and of experimentally solved (PDB) structures of vertebrate class A GPCRs.3 We considered the vertebrates GPCRs for building up the evolutionary history of the family because, out of vertebrate species the classification in subfamilies is more difficult and not always accurate [19]. The subclass sequences were downloaded from the Uniprot database [22]. The reviewed sequences were firstly chosen (2514 sequences). New sequences from the unreviewed data set were then manually added. All the resulting curated sequences (5,000) were aligned using the Promals3D web-server [24]. We used the default parameters of the server. The MSA obtained by using this program satisfies all the class A hGPCR features, a set of highly conserved residues in each of the transmembrane helices [30], that gives rise to the Ballesteros-Weinstein nomenclature [33] (see footNote 3). The alignment was aided by 50 experimental or predicted structures belonging to 28 different human class A GPCRs, both in the active and inactive states (see Table 1). In all the cases, the structure was not resolved in its entirety: parts of the sequence (typically the intracellular loop and the C- and N-termini of the chain) was missing in the experimental structure. To match structural/sequence information (and matrices dimension), we removed the parts of the that involved unresolved residues. Next, we built the local interaction network and computed the betweenness centrality of every residue. As mentioned in the description of the method, in this phase we used again the structural information of Table 1. Several hotspot positions underwent site-directed mutagenesis experiments (see Tables 2 and 1 SI and references within). Many mutants have a lower ligand activity or prevent activation ( https://gpcrdb.org) [23]) or are linked to disease. Residues identified as hotspots across 30% or more hGPCRs have a documented biological function (Table 2), such as: belonging to the ligand binding site (i), or to the micro-switch network of activation (ii) or being located within the allosteric Na+ binding cavity (iii). Not all the hotspots listed in Table 2 are present in all the structures in Table 1 (the number of hotspots per single structure ranging from 2 to 26, see Table 2 SI). In particular, for some identified hotspots we can infer their functional role via a direct comparison with other members within the same family. Some examples are briefly discussed below,

Table 1

Name	Species	UNIPROT	Active	Inactive
Rhodopsin	Human	OPSD_HUMAN	6CMO	(98%)
Cannabinoid-1	Human	CNR1_HUMAN	6N4B	5U09
Cannabinoid-2	Human	CNR2_HUMAN	(66%)	5ZTY
Muscarinic M1	Human	ACM1_HUMAN	6OIJ	5CXV
Muscarinic M2	Human	ACM2_HUMAN	4MQS	3UON
Muscarinic M4	Human	ACM4_HUMAN	(91%)	5DSG
β2-Adrenoreceptor	Human	ADRB2_HUMAN	4LDE	2RH1
Adenosine A1	Human	AA1R_HUMAN	6D9H	5UEN
Adenosine A2A	Human	AA2AR_HUMAN	5G53	5NM4
δ-Opioid	Human	OPRD_HUMAN	(82%)	4N6H
μ-Opioid	Human	OPRM_HUMAN	5C1M	4DKL
κ-Opioid	Human	OPRK_HUMAN	6B73	4DJK
NOP Receptor	Human	OPRX_HUMAN	(77%)	5DHH
Serotonin 1B	Human	5HT1B_HUMAN	6G79	(60%)
Serotonin 2A	Human	5HT2A_HUMAN	(83%)	6A94
Serotonin 2B	Human	5HT2B_HUMAN	5TUD	(78%)
Serotonin 2C	Human	5HT2C_HUMAN	6BQG	6BQH
Dopamine 2 Receptor	Human	DRD2_HUMAN	(60%)	6CM4
Dopamine 3 Receptor	Human	DRD3_HUMAN	(57%)	3PBL
Dopamine 4 Receptor	Human	DRD4_HUMAN	(57%)	5WIU
Angiotensin 1	Human	AGTR1_HUMAN	6DO1	4YAY
Apelin Receptor	Human	APJ_HUMAN	(54%)	5VBL
C–C Chemokine 2	Human	CCR2_HUMAN	–	6GPX
C–C Chemokine 5	Human	CCR5_HUMAN	–	5UIX
C–C Chemokine 9	Human	CCR9_HUMAN	–	5LWE
C–C Chemokine 1	Human	CCR1_HUMAN	–	(78%)
C–C Chemokine 3	Human	CCR3_HUMAN	–	(77%)
C–C Chemokine-Like 2	Human	CCRL2_HUMAN	–	(62%)

Human Class A GPCRs experimental and predicted structures. The homology models were obtained from GPCRdb [23] if the sequence identity between target and template was 50%. Some of the human chemokine receptors structures are only in the inactive state and we also analyzed models for inactive conformations that did not have an experimental structure. Details of the hotspots. For every hotspot identified, we highlight the state (active or inactive) of the hGPCR where the residue was identified, the presence of a documented function, interaction with a ligand, the existence of a mutant or variant in the GPCRdb, and the amino acid consensus. Hy indicates general hydrophobic residues; Ha, Hydrophobic aliphatic; Hb, hydrogen bonding; Sm, small. For references of the experimental data, see [11] to [100] of SI.

Ligand binding

Our protocol is able to capture residues with a fundamental role in selectivity, ligand binding affinity [30] and in dynamical events underlying hGPCR activation (see Table 2; [31], [32], [20]). For example, it identifies the conserved hotspots involved in ligand binding 3.32, 3.37 and 6.48 [40], [30], [32], [20] (Tables 2 and 1 SI for references). The first residue plays a role for ligand charge detection [32]. It is an aspartic acid in 22% of the class A hGPCRs, interacting with amines or with other positively charged groups [32]. The second residue is involved not only in ligand recognition but also in receptor activation [41]. The last one, is a tryptophan residue in more than 77% of the class A subfamily. This position is well known in literature, since it is a hub involved either in ligand detection and as forming the ‘toggle switch’ involved in receptor activation (see below) [42].

Micro-switches

The so-called “micro-switches” are small groups of residues that undergo conformational changes during receptor activation and are mechanistically involved in the activation of GPCRs [43], [42], [44]. Those include: (i) D[E]R3.50Y in helix III which is a common motif that forms the ’ionic lock’, during inactivation, (ii) NP7.50xxY in helix VII which plays an important role in proteins’ conformational changes upon activation [42]. The link between this region and the binding cavity is the ’toggle switch’ formed by positions 6.44, 6.48, 3.40, 5.50: upon ligand binding, position 3.40 rotates and locates between 6.48 and 6.44. This, induce a conformational change of the “hydrophobic barrier”, located below the “toggle switch” that includes positions 2.43, 2.46, 3.43, 3.46, 6.36, 6.37 and 6.40 [45], [42]. The conformational change is important for receptor activation [30], [31]. All the residues involved in this complex mechanism were detected as hotspots in one single run of our protocol (Table 2).

Na+ binding cavity

An allosteric binding cavity for a partially hydrated Na+ ion is conserved across class A hGPCRs, excluding the opsins [46]. The hydrated Na+ is bound in the middle of 7TM helices bundle, it stabilizes the inactive state and reduces basal G-protein activity [46]. D2.50 (90% conserved as Asp) directly coordinates the Na+ ion, N1.50 (97% Asn), S3.39 (75% Ser), N7.45 (70% Asn), S7.46 (66% Ser), N7.49 (75% Asn), and finally Y7.53 (89% Tyr) [46] complete the coordination of the ion. The protocol identifies, as hotspots, D2.50, S3.39 and 7.49 across class A subfamily members, except for the opsins, consistently with experiments [46] (Table 2 SI).

Conclusions

We have presented an approach able to capture most of the coevolution-related events relevant for the function of a very sparse protein family or subfamily (Fig. 1). The protocol provides information not only on residues spanning along the full-length, but also on different activation states of the receptors. In contrast to previous approaches, we make use of structural information. Application to human class A hGPCRs structures shows that, from a sparse family multiple sequence alignment, we were capable of extracting all the residues known to be involved in the different aspects of the receptor activity that were previously identified [20], [21]. These include i) all the position within the binding cavity with a conserved functional role; ii) residues forming the activation microswitches and iii) residues forming the Na+ allosteric binding cavity and those that were found to be mutated in correlation with disease. Importantly, the method was able to capture the functional role of all the residues in one single shot. Our approach is totally general and can easily be extended to other subfamilies of GPCRs for which experimental structures are available.4 As an example, we cite here the pentameric ligand-gated ion channels (pLGICs) protein family, that mediate fast neurotransmission in the nervous system [47], [48], [49]. These are evolutionary correlated, they share a common architecture that consists of three distinct domains, and they exist in at least three distinct functional states [47], [48], [50]. By exploiting the available structural information [47], [48], the approach might be able to identify all hotspots across the family in a single run. As soon as a statistical significant number of structure will be available, a more specific analysis on single subfamilies of class A hGPCRs can be readily be performed. One of the present limitations of the protocol regards the study of the oligomers, because the integration of the structural data removes the interaction between residues that are far away in the single monomer. In the future, we plan to remove this restriction integrating oligomers data coming from experiments.

CRediT authorship contribution statement

Filippo Baldessari: Investigation. Riccardo Capelli: Conceptualization, Investigation. Paolo Carloni: Conceptualization. Alejandro Giorgetti: Conceptualization.

44 in total

Review 1. Constant electric field simulations of the membrane potential illustrated with simple systems.

Authors: James Gumbart; Fatemeh Khalili-Araghi; Marcos Sotomayor; Benoît Roux
Journal: Biochim Biophys Acta Date: 2011-10-05

2. The network of stabilizing contacts in proteins studied by coevolutionary data.

Authors: Sara Lui; Guido Tiana
Journal: J Chem Phys Date: 2013-10-21 Impact factor: 3.488

Review 3. Binding, activation and modulation of Cys-loop receptors.

Authors: Paul S Miller; Trevor G Smart
Journal: Trends Pharmacol Sci Date: 2010-01-25 Impact factor: 14.819

4. PconsC: combination of direct information methods and alignments improves contact prediction.

Authors: Marcin J Skwark; Abbi Abdel-Rehim; Arne Elofsson
Journal: Bioinformatics Date: 2013-05-08 Impact factor: 6.937

5. Activation mechanism of the β2-adrenergic receptor.

Authors: Ron O Dror; Daniel H Arlow; Paul Maragakis; Thomas J Mildorf; Albert C Pan; Huafeng Xu; David W Borhani; David E Shaw
Journal: Proc Natl Acad Sci U S A Date: 2011-10-26 Impact factor: 11.205

6. Fingerprinting G-protein-coupled receptors.

Authors: T K Attwood; J B Findlay
Journal: Protein Eng Date: 1994-02

Review 7. Action of molecular switches in GPCRs--theoretical and experimental studies.

Authors: B Trzaskowski; D Latek; S Yuan; U Ghoshdastider; A Debinski; S Filipek
Journal: Curr Med Chem Date: 2012 Impact factor: 4.530

8. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information.

Authors: Sergey Ovchinnikov; Hetunandan Kamisetty; David Baker
Journal: Elife Date: 2014-05-01 Impact factor: 8.140

9. Evolutionary action and structural basis of the allosteric switch controlling β₂AR functional selectivity.

Authors: Anne-Marie Schönegge; Jonathan Gallion; Louis-Philippe Picard; Angela D Wilkins; Christian Le Gouill; Martin Audet; Wayne Stallaert; Martin J Lohse; Marek Kimmel; Olivier Lichtarge; Michel Bouvier
Journal: Nat Commun Date: 2017-12-18 Impact factor: 14.919

10. Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta.

Authors: Sergey Ovchinnikov; David E Kim; Ray Yu-Ruei Wang; Yuan Liu; Frank DiMaio; David Baker
Journal: Proteins Date: 2016-02-24

4 in total

1. Delineating the conformational landscape and intrinsic properties of the angiotensin II type 2 receptor using a computational study.

Authors: Xiaoliang Cong; Xiaogang Zhang; Xin Liang; Xinheng He; Yehua Tang; Xing Zheng; Shaoyong Lu; Jiayou Zhang; Ting Chen
Journal: Comput Struct Biotechnol J Date: 2022-05-10 Impact factor: 6.155

2. Delineating the activation mechanism and conformational landscape of a class B G protein-coupled receptor glucagon receptor.

Authors: Ying Wang; Mingyu Li; Wenqi Liang; Xinchao Shi; Jigang Fan; Ren Kong; Yaqin Liu; Jian Zhang; Ting Chen; Shaoyong Lu
Journal: Comput Struct Biotechnol J Date: 2022-01-20 Impact factor: 7.271

3. Leri: A web-server for identifying protein functional networks from evolutionary couplings.

Authors: Ngaam J Cheung; Arun T John Peter; Benoit Kornmann
Journal: Comput Struct Biotechnol J Date: 2021-06-06 Impact factor: 7.271

4. Evolution of frustrated and stabilising contacts in reconstructed ancient proteins.

Authors: Martina Crippa; Damiano Andreghetti; Riccardo Capelli; Guido Tiana
Journal: Eur Biophys J Date: 2021-02-11 Impact factor: 1.733

4 in total