Literature DB >> 31303974

Interfaces Between Alpha-helical Integral Membrane Proteins: Characterization, Prediction, and Docking.

Bian Li¹, Jeffrey Mendenhall¹, Jens Meiler¹.

Abstract

Protein-protein interaction (PPI) is an essential mechanism by which proteins perform their biological functions. For globular proteins, the molecular characteristics of such interactions have been well analyzed, and many computational tools are available for predicting PPI sites and constructing structural models of the complex. In contrast, little is known about the molecular features of the interaction between integral membrane proteins (IMPs) and few methods exist for constructing structural models of their complexes. Here, we analyze the interfaces from a non-redundant set of complexes of α-helical IMPs whose structures have been determined to a high resolution. We find that the interface is not significantly different from the rest of the surface in terms of average hydrophobicity. However, the interface is significantly better conserved and, on average, inter-subunit contacting residue pairs correlate more strongly than non-contacting pairs, especially in obligate complexes. We also develop a neural network-based method, with an area under the receiver operating characteristic curve of 0.75 and a Pearson correlation coefficient of 0.70, for predicting interface residues and their weighted contact numbers (WCNs). We further show that predicted interface residues and their WCNs can be used as restraints to reconstruct the structure α-helical IMP dimers through docking for fourteen out of a benchmark set of sixteen complexes. The RMSD100 values of the best-docked ligand subunit to its native structure are <2.5 Å for these fourteen cases. The structural analysis conducted in this work provides molecular details about the interface between α-helical IMPs and the WCN restraints represent an efficient means to score α-helical IMP docking candidates.

Entities: CellLine Chemical Disease Gene Species

Keywords: AUC, Area under the ROC curve; IMP, Integral membrane protein; MAE, Mean absolute error; MSA, Multiple sequence alignment; Membrane protein docking; Membrane protein interfaces; Neural networks; OPM, Orientations of proteins in membranes; PCC, Pearson correlation coefficient; PDB, Protein data bank; PPI, Protein-protein interaction; PPM, Positioning of proteins in membrane.; PPV, Positive predictive value; PSSM, Position-specific scoring matrix; RMSD, Root-mean-square distance; ROC, Receiver operating characteristic curve; RSA, Relative solvent accessibility; TNR, True negative rate; TPR, True positive rate; WCN, Weighted contact number; Weighted contact numbers

Year: 2019 PMID： 31303974 PMCID： PMC6603304 DOI： 10.1016/j.csbj.2019.05.005

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Alpha-helical integral membrane proteins (IMPs) play essential roles in biochemical and electrical signal transduction, molecular transport, and energy propagation among other critical cellular processes across the membrane. It was estimated that about one quarter of the human genome encodes α-helical IMPs [1]. Frequently, these IMPs function as either homo-oligomers or hetero-oligomers with other IMPs [[2], [3], [4]]. Proteins can either form stable, obligate complexes via high-affinity protein-protein interactions (PPIs) or non-obligate, transient complexes via low-affinity PPIs. Obligate PPIs are typically both structurally and functionally obligate: the participating IMPs are typically not found in isolation in vivo. One example is the KCNQ1 potassium channel, which forms homo-tetramers to create the channel pore in the center. It is believed to always bind one member of the KCNE protein family [5]. In contrast, IMPs that participate in transient PPIs can exist independently and complexes of this kind are usually involved in processes such as cellular signaling and receptor-ligand binding [6]. The characteristics of protein interfaces between globular proteins have been extensively studied in terms of size, amino acid composition, physicochemical texture, conservation, as well as coevolution of inter-subunit residue pairs. These properties usually differ for those PPIs that are transient versus those that are obligate [7]. In general, the interfaces in obligate complexes of soluble proteins are larger and more hydrophobic than those of non-obligate associations, for which, amino acid compositions are usually not drastically different from the rest of the protein surface [[8], [9], [10], [11], [12], [13], [14]]. It was also found that the interface is usually more conserved than the rest of the protein surface [[15], [16], [17]], although residues at the interfaces of obligate PPIs tend to be better conserved and exhibit much stronger coevolution with their interacting partners than those at the interfaces of transient PPIs [16]. These observations have been essential to our understanding of the biochemistry and biophysics of PPIs between globular, soluble proteins. In contrast, little is known about the characteristics of the interfaces between α-helical IMPs. Here, we analyzed a non-redundant set of α-helical IMP complexes whose structures have been experimentally determined to a high resolution to answer the following questions about IMP interfaces: 1) Is the IMP interface physico-chemically distinguishable from the rest of the protein surface? 2) Are residues in the interface more conserved than the rest of the protein surface? 3) Are there detectable coevolutionary signals across the interface? 4) How do the interfaces in obligate complexes differ from those in transient ones? Our work consists of three parts that build on one another: The analysis of IMP PPIs: We found that in α-helical IMP complexes, while the aqueous region of the interface exhibits similar characteristics to interfaces between globular proteins, the interface in the membrane region is not significantly different from the rest of the surface in terms of average hydrophobicity. However, the interface is significantly better conserved than the rest of the surface both in the aqueous region and the membrane region, and residue pairs that are in physical contact at the interface of homo-oligomeric IMPs correlate more strongly than pairs not in contact. We restricted our correlation analysis on homo-oligomers because for hetero-oligomers, it requires construction of paired multiple sequence alignment (MSA) of interacting partners, and computationally, this problem becomes difficult by itself due to the existence of paralogs [[18], [19], [20]]. Sequence-based prediction of interface residues involved in IMP PPIs: Based on our findings, we also developed a neural network-based method that predicts weighted contact numbers (WCNs) of surface residues from evolutionary information. WCN is defined as the number of neighboring residues weighted by their proximity to the focal residue, it measures the local packing degree of residues within the protein tertiary structure [21,22]. We find that interface residues can be accurately identified based on the deviation of predicted WCN from the WCN computed from the protomer structure alone. Structure-based docking of IMPs: Based on our previous findings in which we demonstrated that residue WCNs can be effectively used as restraints to improve de novo tertiary structure prediction for α-helical IMPs [23], we implemented an algorithm which leverages the high discriminatory power of a WCN-based penalty score for accurate docking of α-helical IMPs.

Methods

Data Set

A set of multi-pass α-helical IMPs whose structures have been determined to a resolution of 2.5 Å or better and an R-free value of 0.3 or better were extracted from the Orientations of Proteins in Membranes (OPM) database [24] in March 2016. The data set was further refined by using the PISCES server [101] to reduce redundancy such that pairwise sequence identity between protein subunits is <25%. Proteins whose structures were not determined by X-ray crystallography or artificial chimeras were excluded from consideration. Classification of a complex as obligate or transient and assignment of biologically relevant oligomeric state (dimer, trimer, etc.) were carried out based on evidence found in the literature where the structure of the complex was reported. In summary, the data set consists of 36 obligate and nine transient complexes (Table 1). The bias toward more obligate complexes is not unexpected as the higher affinity and rigidity should facilitate crystallization and increase quality of the resulting structural model. The data set consists of 15 homodimers, twelve homotrimers, four homotetramers, two homopentamers, two homodecamers, one heterodimer, four heterotrimers, four heterotetramers, and one heteropentamer. It's worth mentioning that there is a spread in transmembrane helix counts and each category of helix count is well represented except that no subunit with nine, fourteen, or fifteen helices satisfied our aforementioned data set curation criteria (Supporting Information Fig. S1).

Table 1

α-helical IMP chains that form the oligomers in the data set.

Protein subunit	TMH count	Resolution (Å)	Oligomeric state	Obligate	Reference
1m0lA	7	1.5	Homotrimer	No	[25]
1m56A	12	2.3	Heterotetramer	Yes	[26]
1m56C	7	2.3	Heterotetramer	Yes	[26]
1q16C	5	1.9	Homodimer	No	[27]
1u7gA	11	1.4	Homotrimer	Yes	[28]
1yq3C	3	2.2	Heterotetramer	No	[29]
1yq3D	3	2.2	Heterotetramer	No	[29]
2a65A	12	1.7	Homodimer	Yes	[30]
2bl2A	4	2.1	Homodecamer	Yes	[31]
2bs2C	5	1.8	Homodimer	No	[32]
2j8cM	5	1.9	Heterotrimer	Yes	[33]
2nq2A	10	2.4	Homodimer	Yes	[34]
2qtsA	2	1.9	Homotrimer	Yes	[35]
2uuhA	4	2.2	Homotrimer	Yes	[36]
2vpzC	8	2.4	Homodimer	No	[37]
2w2eA	6	1.2	Homotetramer	Yes	[38]
2wswA	12	2.3	Homotrimer	Yes	[39]
2yevC	7	2.4	Heterotrimer	Yes	[40]
2z73A	7	2.5	Homodimer	No	[41]
2zxeA	10	2.4	Heterotrimer	Yes	[42]
3b9yA	11	1.9	Homotrimer	Yes	[43]
3c02A	6	2.1	Homotetramer	Yes	[44]
3cx5C	8	1.9	Heteropentamer	Yes	[45]
3k3fA	10	2.3	Homotrimer	Yes	[46]
3klyA	6	2.1	Homopentamer	Yes	[47]
3m73A	10	1.2	Homotrimer	Yes	[48]
3oduA	7	2.5	Homodimer	No	[49,50]
3oufA	2	1.6	Homotetramer	Yes	[49]
3puwF	8	2.3	Heterodimer	Yes	[51]
3s8gA	13	1.8	Heterotrimer	Yes	[52]
3spcA	2	2.5	Homotetramer	Yes	[53]
3tijA	8	2.4	Homotrimer	Yes	[54]
4a01A	16	2.4	Homodimer	Yes	[55]
4bpmA	4	2.1	Homotrimer	Yes	[56]
4d2eA	3	2.3	Homotrimer	Yes	[57]
4dx5A	12	1.9	Homotrimer	Yes	[58]
4f4sA	2	1.9	Homodecamer	Yes	[59]
4jkvA	7	2.5	Homodimer	No	[60]
4mrsA	6	2.4	Homodimer	Yes	[61]
4o6mA	6	1.9	Homodimer	Yes	[62]
4o6yA	6	1.7	Homodimer	Yes	[63]
4qndA	3	1.7	Homodimer	Yes	[64]
4rngC	3	2.4	Homodimer	Yes	[65]
4u9nA	5	2.2	Homodimer	Yes	[66]
4wd8A	4	2.3	Homopentamer	Yes	[67]

α-helical IMP chains that form the oligomers in the data set.

Defining Interface Residues

Residue categorization was based on residues' weighted contact numbers (WCN). The WCN of a residue is defined as the number of its neighboring residues weighted by spatial proximity. The WCN of each residue in the data set was computed according to a procedure previously described [68]. A residue was then categorized as surface residue if its WCN < 8.38 and as core residue otherwise. This cutoff was determined based on a linear fit of WCN to relative solvent accessibility (RSA) (WCN =  − 9.78 × RSA + 8.48) and residues with an RSA > 0.01 are considered solvent exposed as in a previous study [16]. Interface residues were defined as those surface residues whose WCN increases by at least 1.0 upon the formation of a complex (see Fig. 1 for an example). Although some previous work has classified whether a residue is on the surface or in the core according to its solvent accessibility [16,[69], [70], [71], [72]], we classified residues based on their WCN for two practical reasons. On the one hand, computing residue WCN is computationally less demanding than computing residue solvent accessibility. This makes it feasible to compute residue WCNs on the fly and use them directly for scoring in protein structure prediction and protein-protein docking predictions where an enormous amount of conformational space must be sampled [23]. On the other hand, we had previously developed a machine-learning framework that allowed accurate prediction of residue WCNs for α-helical IMPs [68].

Fig. 1

Defining interface residues according to residue weighted contact number (WCN).

A surface residue is defined as an interface residue if its WCN increases by at least 1.0 (∆WCN ≥ 1.0) upon the formation of a protein complex. This figure shows interface residues, represented as spheres, identified according to the above criterion for the trimeric bacteriorhodopsin (PDB ID: 1m0l).

Defining interface residues according to residue weighted contact number (WCN). A surface residue is defined as an interface residue if its WCN increases by at least 1.0 (∆WCN ≥ 1.0) upon the formation of a protein complex. This figure shows interface residues, represented as spheres, identified according to the above criterion for the trimeric bacteriorhodopsin (PDB ID: 1m0l).

Site-specific Rate of Evolution

Site-specific rate of evolution was estimated using the Rate4Site program [73]. The input multiple sequence alignment (MSA) of homologs to each monomer was obtained by running HHblits on the Uniprot20 sequence database [74] with minimum coverage of query sequence (sequence of the monomer) being set to 75%, maximum sequence identity to query sequence being set to 90%, maximum pairwise sequence identity being set to 90%, and e-value cutoff for inclusion in result alignment being set to 10−5.

Mutual Information

We model a sequence site as a discrete random variable X, which takes on one of an alphabet of 20 possible letters A = (A, C, D, ⋯, W, Y) with probabilities (p, p, p, ⋯, p, p), with and . The alphabet A contains one letter for each amino acid. Alignment gap was not considered because it introduces spuriously high conservation for alignment columns containing a high percentage of gaps. p is estimated by the relative frequency f of amino acid residue x at the column of an MSA:where M indicates alignment depth (the number of sequences aligned at position i), δ is the Kronecker delta function such that it evaluates to 1 if x = X and 0 otherwise. f is adjusted by a pseudocount parameter λ = 1. Given two MSA columns X and Y, the mutual information I(X, Y) between them is defined asI(X, Y) is a general measure of association between two random variables X and Y (int this case, two alignment columns in an MSA), it equals zero if and only if X and Y are independent, and it is positive otherwise. Intuitively, I(X, Y) can be thought of as the average reduction in uncertainty about X that results from knowing the value of Y, and vice versa. To correct for bias in I(X, Y) that may arise from phylogeny [75], entropy of the MSA columns [76], or background noise [77], and test for significance, we implemented a permutation test in which, for each pair of columns, one column was selected and permuted 200 times. This procedure was described in [78] in detail. Briefly, a mutual information was calculated for the pair after each permutation. If the number of permutations for which the mutual information is greater than that of the original column pair is ≥2 (1% of the total number of permutations), we reject the hypothesis that the column pair is correlated, and its mutual information is set to 0. Otherwise, the bias-corrected mutual information for the pair was computed by subtracting the average mutual information of the 200 permutations from that of the original column pair.

Training a Neural Network for Predicting WCN

We trained a neural network for predicting residue WCN from amino acid sequence. The target WCNs used in the training were computed from the structure of oligomers in the data set, using the procedure previously described [68]. Each residue was numerically described by rate of evolution of the sequence site, entries in a window of size 15 from the position-specific scoring matrix (PSSM) computed from an MSA of homologs of the protein family as previously described [79]. The output layer of the neural network consists of a single node for the residue-specific WCN. The hidden layer consists of 64 neurons. The neural network was regularized using the “dropout” technique [80] where 5% of units in the input layer and 50% of neurons in the hidden layer were randomly silenced during the presentation of each training case. Connection weights were iteratively tuned with back-propagation of errors [81] with the learning rate η being set to 0.1 and momentum factor α being set to 0.1. The accuracy of WCN prediction was assessed by the Pearson correlation coefficient (PCC) and the mean absolute error (MAE) of predicted WCNs from target WCNs, computed as followswhere x and y denote the target and predicted WCNs of residue i, respectively, and n denotes the number of residues in the data set.

Predicting Interface Residues

Note that the WCN of a residue may be different depending on whether it is computed from the structure of the individual protomers or the complex. To make the distinction straightforward, we refer to WCNs computed from individual subunits as protomeric WCNs and those computed from complexes as oligomeric WCNs. For predicting interface residues, it is reasonable to assume that an experimental structure of each of the individual subunits is available, and as such, true protomeric WCNs can be computed from the structures of individual subunits. A surface residue is then predicted to be an interface residue if its neural network-predicted oligomeric WCN is higher than its true protomeric WCN by at least 1.0. The performance of the neural network on predicting interface residues was assessed by the area under the receiver operating characteristic curve (ROC) [82], or AUC. The AUC was estimated through cross-validation where the data set was partitioned into subsets such that proteins from the same SCOP superfamily [83] were placed in the same subset. Each subset was used exactly once as the test set for evaluating the performance of the neural network trained on the remaining of the data set. Effectively, a value of AUC was computed from each test subset. The final estimate of the AUC was computed as the mean of all the AUCs. The true positive rate (TPR), true negative rate (TNR), and positive predictive value (PPV) [84] were calculated using 1.0 as the threshold, where a true positive is defined as a correctly predicted interface residue and a true negative is a correctly predicted non-interface residue.

Integral Membrane Protein Docking

One of the advantages of modeling IMP complexes is that the membrane reduces the search space by imposing a restraint on the position of the subunits in the membrane. We designed a docking-based algorithm called BCL::MP-Dock for predicting structure of complexes where both binding partners are α-helical IMPs. The algorithm assumes that both docking partners are IMPs and requires that their tertiary structures be available. The input structures to BCL::MP-Dock are a structure of the “receptor” and a structure of the “ligand”, both oriented in the membrane using the Positioning of Proteins in Membrane (PPM) server [24]. Generation of docking candidates begins with a random rotation of the ligand about the z-axis (e.g. the membrane normal) and translation of the ligand in the membrane to create a glancing contact with the receptor. The ligand is then randomly moved with respect to the receptor using a Monte Carlo search [85]. Translation along the z-axis is limited to <5.4 Å (one winding of an alpha-helix) and the step size of the tilt angle from the z-axis is limited to 0.05 rad (or ~2.8 degree). The baseline scoring function of BCL::MP-Dock consists of a clashing term that represents van der Waals repulsion, a residue-pair contact potential term for interface interaction, and a radius of gyration term that favors compact packing between the two docking partners, which were described in detail previously [86]. While the analysis of the characteristics of interface residues compared to the rest of surface residues was conducted on a diverse set of non-redundant IMP oligomers, the BCL::MP-Dock algorithm is designed to work primarily with dimers. In the current study, we tested the idea of using predicted interface residues and their predicted WCNs as restraints for scoring docking solutions on a set of 16 α-helical IMP dimers (Table 2). This benchmark set consists of all α-helical IMP dimers satisfying our data set curation criteria (see Methods). The constituting subunits have as few as three and as many as sixteen transmembrane helices, and most of the subunits have moderate sizes with five, six, or seven transmembrane helices.

Table 2

Summary of the benchmark set of IMP complexes for testing the docking algorithm.

Protein ID	Resolution (Å)	Oligomeric state	Obligate	Protein name	TMH count in subunit
1q16_CF	1.9	Homodimer	No	Nitrate reductase A	5
2a65_AB	1.7	Homodimer	Yes	LEUTAA	12
2bs2_CF	1.8	Homodimer	No	Quinol fumarate reductase	5
2nq2_AB	2.4	Homodimer	Yes	Putative metal-chelate type ABC transporter	10
2vpz_CG	2.4	Homodimer	No	Polysulfide reductase from Thermus thermophilus	8
2z73_AB	2.5	Homodimer	No	Squid rhodopsin	7
3odu_AB	2.5	Homodimer	No	CXCR4 chemokine receptor	7
3puw_FG	2.3	Heterodimer	Yes	MBP-Maltose transporter transmembrane subunits	8/6
4a01_AB	2.4	Homodimer	Yes	H1-Translocating Pyrophosphatase	16
4jkv_AB	2.5	Homodimer	No	Human smoothened receptor	7
4mrs_AB	2.4	Homodimer	Yes	Bacterial Atm1-family ABC transporter	6
4o6m_AB	1.9	Homodimer	Yes	CDP-alcohol phosphotransferase	6
4o6y_AB	1.7	Homodimer	Yes	Cytochrome b561	6
4qnd_AB	1.7	Homodimer	Yes	Bacterial homologue of SWEET transporters	3
4rng_AC	2.4	Homodimer	Yes	Bacterial homologue of SWEET transporters (sequence identity with 4qnd_AB <25%)	3
4u9n_AB	2.2	Homodimer	Yes	Mg(2+) channel MgtE	5

Protein IDs are denoted as four-letter PDB ID followed by the chain IDs of the two subunits comprising the complex. Note that this benchmark set consists of all the dimers (fifteen homodimers + one heterodimer) from the data set used in our analysis of membrane protein interfaces.

Summary of the benchmark set of IMP complexes for testing the docking algorithm. Protein IDs are denoted as four-letter PDB ID followed by the chain IDs of the two subunits comprising the complex. Note that this benchmark set consists of all the dimers (fifteen homodimers + one heterodimer) from the data set used in our analysis of membrane protein interfaces.

Computation of Enrichment

The enrichment was used to measure how capable a scoring function is to select the most accurate docking solutions from a pool of solutions. To calculate enrichment, a given set S of docking solutions are sorted by their RMSD100 values. The top 10% of the solutions with the lowest RMSD100 [102] values are put into the set T (true) and the rest of the solutions are put into the set F (false). The solutions in S are then sorted by their evaluated score. The top 10% of solutions with the lowest score are put into the set P (positive) and the rest are put into the set N (negative). The intersection of sets T and P are solutions that are correctly identified by the scoring function and referred to as TP (true positives). The intersection of sets F and P are solutions that are incorrectly identified by the scoring function and are referred to as FP (false positives). The enrichment value is then computed using the following formula: Intuitively, represents that probability of obtaining a native-like model when choosing a model from S at random, whereas represents the probability of obtaining a native-like model when choosing from a set of models below an energy cutoff. By our experimental design, has a constant value of 0.1, so the enrichment values cannot exceed 10.

Results

Each protomer is divided into three disjoint regions: protein core, interface, and non-interface surface with no overlapping residues; and each region is further divided into two parts: the aqueous region and the membrane region based on the atomic coordinates and the calculated hydrophobic thickness of the membrane in which each complex resides, both were obtained from the OPM database [24]. The hydrophobic thicknesses provided by OPM were calculated by the PPM method, whose accuracy was thoroughly tested for a large set of IMPs whose orientations with respect to the lipid bilayer or membrane binding affinities have been experimentally studied [87].

Amino Acid Composition and Interface Propensities

Fig. 2 compares the amino acid compositions of protein core, interface, and non-interface surface in the aqueous region and in the membrane region. Fig. 2A shows that in the aqueous region, protein cores have the highest frequencies of hydrophobic residues (e.g. Cys, Phe, Ile, Leu, Met, and Val), whereas surfaces have the highest frequencies of hydrophilic residues (e.g. Asp, Glu, His, Lys, Asn, Pro, Ser, and Trp), similar to the amino acid compositions of soluble protein complexes [17]. In the membrane, it is expected that hydrophobic residues are well represented on non-interface surfaces due to the hydrophobic nature of lipid tails. Interfaces and the IMP core should still be dominated by hydrophobic amino acids. However, some fraction of hydrophilic residues is expected for function, i.e. to build the polar pore in a channel or to lend selectivity to a PPI. Fig. 2B shows that non-interface surfaces in the membrane region have the highest frequencies of some hydrophobic residues (e.g. Ile, Leu, Val) as well as some hydrophilic residues (e.g. His, Lys, and Arg), whereas some hydrophilic residues have the highest frequencies in protein cores (e.g. Glu, Asn, Pro, Gln, Ser, and Thr). It is also interesting that Gly and Ala, which have the smallest side-chain size, have the highest frequencies in protein cores both in the aqueous part and in the membrane. This is likely because Gly and Ala, which usually occur in GxxxG and AxxxA types of motifs, may facilitate more efficient helix packing [88,89].

Fig. 2

Amino acid composition of the core, interface, and the rest of the surface.

(A) Amino acid composition in the aqueous region; just as amino acid composition in globular proteins and protein complexes, the relative frequencies of residues of hydrophobic nature increases in going from surface, to interface, and then to protein core, whereas those of residues of hydrophilic nature follows the opposite trend. (B) amino acid composition in the membrane region; due to its hydrophobic nature, the residue relative frequencies follow trends opposite to those in aqueous portion.

Amino acid composition of the core, interface, and the rest of the surface. (A) Amino acid composition in the aqueous region; just as amino acid composition in globular proteins and protein complexes, the relative frequencies of residues of hydrophobic nature increases in going from surface, to interface, and then to protein core, whereas those of residues of hydrophilic nature follows the opposite trend. (B) amino acid composition in the membrane region; due to its hydrophobic nature, the residue relative frequencies follow trends opposite to those in aqueous portion. The amino acid type distribution at the interface is significantly different from that on the rest of the surface for both the aqueous region (p = 1.7 × 10−11, χ2 test) and the membrane region (p = 0.011, χ2 test), albeit the difference is more pronounced in the aqueous region. Fig. 3 shows the interface propensity of each residue type. In the aqueous region, residue types that prefer interface over the rest of the surface are Phe, Ile, Leu, Met, Val, and Tyr. In the membrane, residue types that prefer interface over the rest of the surface are Ala, Asp, Met, Asn, Pro, Gln, Thr, and Tyr.

Fig. 3

Interface versus surface propensity in the aqueous region and the membrane region for each amino acid type.

The interface propensity of an amino acid type A, IP, was calculated according to , where f is its fraction at the interface and f is its fraction on the surface as a whole. A negative propensity value indicates that the amino acid is preferentially located at the interface.

Interface versus surface propensity in the aqueous region and the membrane region for each amino acid type. The interface propensity of an amino acid type A, IP, was calculated according to , where f is its fraction at the interface and f is its fraction on the surface as a whole. A negative propensity value indicates that the amino acid is preferentially located at the interface.

Hydrophobicity

Since IMPs experience both lipid bilayer and aqueous environments at the same time, we analyzed the hydrophobicity of the interface, non-interface surface region, and the protein core separately for these two environments. In the aqueous region, the interface is significantly more hydrophobic than non-interface surface region (p = 0.0020, paired t-test), and the protein core is significantly more hydrophobic than the interface (p = 1.2 × 10−5, paired t-test) (Fig. 4A), consistent with the observation made on soluble proteins [70]. In the membrane region, the protein core is significantly less hydrophobic than the noninterface surface region (p = 9.7 × 10−6, paired t-test) (Fig. 4B), due to the nonpolar nature of lipid tails. The protein core is also less hydrophobic than the interface (p = 0.00077, paired t-test). No significant difference in average hydrophobicity is found between the interface and the rest of the surface (p = 0.054, paired t-test). This may be partly because protomers of transient membrane protein complexes, when they are not part of a complex, need their interface hydrophobic enough to make favorable contact with lipid tails. In fact, in obligate complexes the interface is marginally less hydrophobic than the rest of the surface (p = 0.014, paired t-test) (Fig. 4C).

Fig. 4

Distribution of average hydrophobicity of core, interface, and noninterface surface of α-helical IMPs.

The average hydrophobicities were computed according to the Kyte & Doolittle hydrophobicity scale [90]. In this scale, arginine is assigned a hydrophobicity of −4.50 and is the least hydrophobic; isoleucine is assigned 4.50 and is the most hydrophobic. (A) Distribution of average hydrophobicity in the aqueous region; the protein core is significantly more hydrophobic than the interface, which is in turn, significantly more hydrophobic than the surface. (B) Distribution of average hydrophobicity in the membrane region; the protein core is significantly less hydrophobic than the noninterface surface region. The protein core is also less hydrophobic than the interface, however, no significant difference in average hydrophobicity is found between the interface and the rest of the surface. (C) Distribution of average hydrophobicity for obligate oligomers; the interface is less hydrophobic than the rest of the surface, although it is only marginally significant.

*: p < 0.05, **: p < 0.01, ***: p < 0.001, n.s.: not significant.

Distribution of average hydrophobicity of core, interface, and noninterface surface of α-helical IMPs. The average hydrophobicities were computed according to the Kyte & Doolittle hydrophobicity scale [90]. In this scale, arginine is assigned a hydrophobicity of −4.50 and is the least hydrophobic; isoleucine is assigned 4.50 and is the most hydrophobic. (A) Distribution of average hydrophobicity in the aqueous region; the protein core is significantly more hydrophobic than the interface, which is in turn, significantly more hydrophobic than the surface. (B) Distribution of average hydrophobicity in the membrane region; the protein core is significantly less hydrophobic than the noninterface surface region. The protein core is also less hydrophobic than the interface, however, no significant difference in average hydrophobicity is found between the interface and the rest of the surface. (C) Distribution of average hydrophobicity for obligate oligomers; the interface is less hydrophobic than the rest of the surface, although it is only marginally significant. *: p < 0.05, **: p < 0.01, ***: p < 0.001, n.s.: not significant.

The Interface is Better Conserved than the Rest of the Surface in Obligate Oligomers

The selective pressure, and likewise conservation, is expected to be higher at the interface than the rest of the surface because molecular recognition is often achieved by making specific contacts at the interface. To test this hypothesis, we computed the average rate of evolution of the interface and that of the rest of the surface for each complex in the data set. A lower average rate of evolution signifies a stronger conservation. While the average rate of evolution of the interface is significantly lower than that of the rest of the surface in both the aqueous region (p = 0.0043, paired t-test) (Fig. 5A) and the membrane region (p = 0.00088, paired t-test) (Fig. 5D), we found that such difference stems from the selection pressure on obligate complexes. In obligate complexes, the interface is significantly better conserved than the rest of the surface in both the aqueous region (p = 0.00015, paired t-test) (Fig. 5B) and the membrane region (p = 3.3 × 10−5, paired t-test) (Fig. 5E). In contrast, in transient complexes, the interface is not more conserved than the surface in either region (p = 0.87 and 0.81, respectively, paired t-test) (Fig. 5C and F). This confirms that, in general, the selection pressure on obligate complexes is stronger than on transient complexes [6,16].

Fig. 5

Distribution of average rate of evolution of core, interface, and noninterface surface of α-helical IMPs.

Distribution of average rate of evolution for the aqueous region taking (A) obligate and transient oligomers together, (B) obligate oligomers only, (C) transient oligomers only; and for the membrane region taking (D) obligate and transient oligomers together, (E) obligate oligomers only, (F) transient oligomers only.

**: p < 0.01, ***: p < 0.001, n.s.: not significant.

Distribution of average rate of evolution of core, interface, and noninterface surface of α-helical IMPs. Distribution of average rate of evolution for the aqueous region taking (A) obligate and transient oligomers together, (B) obligate oligomers only, (C) transient oligomers only; and for the membrane region taking (D) obligate and transient oligomers together, (E) obligate oligomers only, (F) transient oligomers only. **: p < 0.01, ***: p < 0.001, n.s.: not significant. We also found that while the average rate of evolution of interface residues in the membrane region is significantly lower than that of interface residues in the aqueous region (p = 0.00016, paired t-test), it is worth noting that for many interfaces (12 out of 45 cases in our data set), the average rate of evolution is lower in the aqueous region than that in the hydrophobic region. This suggests that, for some membrane proteins, the aqueous region may be under stronger selection pressure than the membrane region for protein-protein interactions.

Contacting Interface Residue Pairs Show Stronger Correlation than Non-Contacting Pairs

Previously, studies have shown that residue pairs in inter-domain interfaces in globular proteins tend to be correlated [91] and that inter-subunit contacting residues in single-pass homo-oligomers co-evolve strongly [92]. However, it remains unclear whether these observations hold true for multi-pass membrane protein complexes. We reasoned that inter-subunit contacting residue pairs in multi-pass membrane protein complexes are also correlated more strongly than non-contacting pairs. To test this hypothesis, we analyzed the strength of correlation, quantified by mutual information, of contacting and non-contacting residue pairs for all homo-oligomers in our data set. A residue i in one subunit forming the interface is considered in physical contact with another residue j in the other subunit if the separation between any heavy atom of i and j is ≤4.0 Å. Fig. 6 compares the distribution of mutual information between pairs of residues in contact with that of the mutual information between pairs of residues not in contact. We observe a shift toward higher mutual information (p = 2.0 × 10−10, Mann-Whitney U test) for residue pairs that are in contact. This shift suggests that on average, at the interface of multi-pass membrane protein complexes, residue pairs that are in physical contact tend to correlate more strongly than those that are not in contact.

Fig. 6

Comparison on the distribution of the mutual information between pairs of residues in contact with that of the mutual information between pairs of residues not in contact.

There is a shift of the distribution of mutual information between contacting residues toward higher values.

Comparison on the distribution of the mutual information between pairs of residues in contact with that of the mutual information between pairs of residues not in contact. There is a shift of the distribution of mutual information between contacting residues toward higher values. A recent study suggests that there is a subtle trade-off between evolutionary and energetic constraints in protein-protein interactions [93]. The authors showed that for some classes of protein complexes, evolutionary constraints play the key role in defining the interaction surface, whereas in others, energetic constraints emerge as more important. While their analysis was conducted exclusively on soluble protein complexes, their conclusion might as well apply to membrane protein oligomers. In fact, from the 32 homo-oligomers for which we were able compute inter-subunit residue-pair mutual information, a randomly selected contacting residue pair will have a higher mutual information than that of a randomly selected non-contacting residue pair in twenty cases (using Mann-Whitney U test and a cutoff p-value of 0.05, Supporting Information Table S1). Whereas in the remaining twelve homo-oligomers, other factors such as energetic constraint may play a more important role than evolutionary constraint.

Predicting Interface Residues in the Membrane

We first evaluate the accuracy of the trained neural networks in predicting WCNs. As shown in the scatter plot in Fig. 7, the Pearson correlation coefficient between true and predicted WCNs is 0.70 and the mean absolute error of predicted WCNs from true WCNs is 1.58. We then evaluate the performance of the trained neural networks in distinguishing residues at the interface from those on the rest of the surface given experimental structures for the protomers. A score for a residue to be at the interface is computed by subtracting its protomeric WCN from its predicted WCN, this score is termed ΔWCN. Surface residues whose ΔWCN ≥ 1.0, a threshold chosen for consistency with the threshold used in the definition of interface residues, are predicted to be interface residues, resulting in a TPR of 0.88, a TNR of 0.54, and a PPV of 0.39.

Fig. 7

Performance of the neural network-based interface residue prediction model trained using interface and surface residues in the membrane region only.

(A) Scatter plot of true WCN versus WCN predicted by the neural network. The Pearson correlation coefficient between true WCN and predicted WCN is 0.70 and the mean absolute error of predicted WCN from true WCN is 1.58. (B) The ROC curve for interface residue classification with an AUC of 0.75. The ROC curve is colored according to classification thresholds and the scale on the right vertical axis indicates the mapping between colors and classification thresholds ΔWCN.

Performance of the neural network-based interface residue prediction model trained using interface and surface residues in the membrane region only. (A) Scatter plot of true WCN versus WCN predicted by the neural network. The Pearson correlation coefficient between true WCN and predicted WCN is 0.70 and the mean absolute error of predicted WCN from true WCN is 1.58. (B) The ROC curve for interface residue classification with an AUC of 0.75. The ROC curve is colored according to classification thresholds and the scale on the right vertical axis indicates the mapping between colors and classification thresholds ΔWCN. Fig. 7 shows the ROC curve plotted using cross-validated predictions from neural networks trained to distinguish interface residues from residues on the rest of the surface. While the AUC of our neural network is 0.75, comparable to a previous random forest-based protein-protein binding site predictor for membrane proteins [69], we'd like to highlight that our work addressed several limitations of the method introduced in [69] (Supplementary discussion).

Docking Membrane Proteins Using Predicted WCNs as Restraints

The BCL::MP-Dock algorithm was evaluated on a set consisting of 1 heterodimer and 15 homodimer structures. The results for global searches from fully randomized starting positions and orientations are detailed in Table 3. Of the 15 homodimer cases, BCL::MP-Dock is able to reconstruct the structure of the complex for 14 cases where the best RMSD100 of the docked ligand subunit to its native structure is <2.5 Å (2.5 Å is the threshold of resolution used in creating the data set). The two cases where the best RMSD100 of the docked ligand is >2.5 Å are the bacterial Atm1-family ABC transporter (PDB ID: 4mrs_AB) and the heterodimeric transmembrane subunits of the maltose ABC transporter (PDB ID: 3puw_FG). To confirm the effectiveness of the sampling scheme of BCL::MP-Dock, we also computed the means of RMSD100 of the docked ligand of the top 1% models ranked by RMSD100. Again, except for 3puw_FG and 4mrs_AB, the mean RMSD100 is either less than or only slightly >3.0 Å. 3puw_FG is a heterodimer where helices from the two subunits are domain-swapped and 4mrs_AB is a homodimer with bulky cytosolic domains. Both pose a significant challenge to the sampling scheme of BCL::MP-Dock.

Table 3

Summary of the global docking of IMPs using predicted interface residue WCN as restraints.

TPR: true positive rate; PPV: positive predictive value; MAE: mean absolute error; ∅WCN indicates docking using the baseline scoring function, which lacks use of predicted ΔWCN; PWCN indicates adding predicted ΔWCN to the baseline scoring function. Cases for which the enrichment of the penalty score is <2.0 are colored in red and those for which the enrichment is higher than 2.0 are colored in green.

Summary of the global docking of IMPs using predicted interface residue WCN as restraints. TPR: true positive rate; PPV: positive predictive value; MAE: mean absolute error; ∅WCN indicates docking using the baseline scoring function, which lacks use of predicted ΔWCN; PWCN indicates adding predicted ΔWCN to the baseline scoring function. Cases for which the enrichment of the penalty score is <2.0 are colored in red and those for which the enrichment is higher than 2.0 are colored in green. To assess the baseline scoring function, we computed the means of RMSD100 of the docked ligand of the 1% models ranked by score v.s. ranked by RMSD100. Compared to the means of RMSD100 of the top 1% models ranked by RMSD100, the baseline scoring function is remotely effective in only three cases (2a65_AB, 2vpz_CG, and 4a01_AB). While the enrichment of the baseline scoring function is >1.0 for eleven cases, it is >2.0 for only five cases (Table 3). These results indicate that the baseline scoring function has difficulty in identifying docked ligands that are top-ranked by RMSD100 in most cases. Previously we demonstrated that residue WCNs [23] and other experimental or simulated restraints [94] can be effectively incorporated into scoring functions to improve de novo tertiary structure prediction for α-helical membrane proteins. Likewise, we hypothesized that a scoring term that assigns a penalty score to docked models according the magnitude of the deviation the WCNs of predicted interface residues from their predicted WCNs will also be highly effective:where WCN is the WCN of interface residue i computed based on the docked model, WCN is the WCN of interface residue i predicted by the neural network. The predicted WCNs of the resulting interface residues of each docking partners were used as restraints. Note that, to avoid potential overestimation of the effectiveness of this approach, the WCNs used as restraints in docking the partners of each complex are predicted by a neural network trained with a data set containing no members from the SCOP superfamily [83] of the complex being tested for docking. We defined a quantity called enrichment to measure how capable a scoring function is to select the most accurate docked models from a pool of candidates. A scoring function with an enrichment of r means that models top-ranked by the scoring function are r times more likely than a randomly selected set of models to contain native-like models. As shown in Table 3, the enrichment of the WCN-based penalty score is improved in 13 cases, compared to the enrichment of the baseline scoring function. The number of cases where the enrichment is >2.0 is increased from five to ten cases. The cases where the enrichment of the penalty score is worse than that of the baseline scoring function or <2.0 is likely due to poor classification of interface residues or poor accuracy of WCN prediction. For example, cases for which the enrichment of the penalty score is <2.0 (colored in red in Table 3) have an average TPR, PPV, and MAE of 0.71, 0.20, and 2.09, whereas the average TPR, PPV, and MAE for cases for which the enrichment is >2.0 (colored in green in Table 3) are 0.80, 0.44, and 1.75, respectively. To demonstrate the effectiveness of the penalty score, we plot the “score v.s. rmsd” relationship and compare the best docked model in the 1% models ranked by this score to the native complex structure for cases where the enrichment is greater 2.0 (Fig. 8).

Fig. 8

Examples of oligomers for which WCNs helped identify the correct docking solutions.

Plots on WCN restraint score v.s. RMSD100 relationship, and comparison of the best docked model in the 1% models ranked by WCN restraint score to the native oligomer structure (pale green: native receptor subunit, pale blue: native ligand subunit, light orange: docked ligand subunit) for cases where the enrichment of the WCN restraint score is higher than 2 using ΔWCN = 1 as the threshold for classifying interface residue. The values in parentheses are the RMSD100 of the models shown.

Examples of oligomers for which WCNs helped identify the correct docking solutions. Plots on WCN restraint score v.s. RMSD100 relationship, and comparison of the best docked model in the 1% models ranked by WCN restraint score to the native oligomer structure (pale green: native receptor subunit, pale blue: native ligand subunit, light orange: docked ligand subunit) for cases where the enrichment of the WCN restraint score is higher than 2 using ΔWCN = 1 as the threshold for classifying interface residue. The values in parentheses are the RMSD100 of the models shown.

Discussion

Herein, we compiled a non-redundant set of oligomers of α-helical IMPs whose structure have been determined to high resolution and studied the properties of the interfaces in terms of hydrophobicity, amino acid composition, interface propensity of amino acids, evolutionary conservation, and correlation of amino acid pairs at the interface. We found that the aqueous portion of the interface in oligomers of IMPs have similar characteristics to the interfaces between oligomers of globular proteins [11,14,17,70,95]. Within the membrane, while the average hydrophobicity of the interface is similar to the rest of the lipid-contacting surface, however, the interface is significantly more conserved. We also observed that contacting residue pairs across the interface tend to correlate more strongly than non-contacting residue pairs. While our results suggest that interfaces within the membrane are not significantly different from the rest of the surface for the global physicochemical compositions, two important questions remain: are specific motifs of the GxxxG type enriched in the interface of multi-pass IMP oligomers? And if so, how do such motifs contribute to multi-pass IMP oligomerization? It is known that the GxxxG motif plays an important role in stabilizing single-pass homodimers [88,89]. Within multi-pass IMPs, previous studies have indicated that GxxxG motifs can be important for both folding and oligomerization and it has been suggested that the presence of a GxxxG type of motif alone is only a weak predictor of protein dimerization in the membrane [89]. However, we are not aware of any study that has systematically examined whether the GxxxG motifs have a higher propensity for inter-subunit than intra-subunit interactions. Given the added complexities in the process of multi-pass IMP folding and oligomerization, separate in-depth studies are needed to answer these questions. Based on the observation that the interface is significantly more conserved than the rest of the surface in the membrane, we adapted our previously-developed neural network-based method [68] to distinguish interface residues from residues on the rest of the surface. This classification was based on the WCNs of surface residues predicted by the neural network. While the performance of this method is comparable to the random forest-based binary classifier developed by Bordner [69], its strength lies in that it predicts residues' real-valued WCN. Residues' WCN is an effective restraint for improving the fraction of native contacts in predicted structural models for de novo prediction of tertiary structures of α-helical IMPs [23], and we have herein shown that it can be further used to derive an effective score for selecting native-like docking candidates of IMP complexes. The sampling problem in docking IMPs is inherently more tractable than that in docking globular proteins, due to the reduced translational and rotational degree of freedom with respect to the membrane normal. Despite this simplification in sampling, docking IMPs is still a challenging problem in terms of scoring native-like poses. Intuitively, we expected that matching relatively polar patches on the IMPs transmembrane surface, so as to shield polar regions from the membrane, would be beneficial, however, we unexpectedly found that intra-membrane interface regions do not differ from non-interface surface regions in terms of hydrophobicity. Shape complementarity, or packing volume is similarly of limited use due to IMPs having relatively cylindrical shapes. Using restraints derived from evolutionary analysis may represent an effective approach to narrowing down the set of viable docking candidates for IMPs. In fact, the effectiveness of using predicted coevolving residue pairs to identify native-like docking solutions has been demonstrated previously for globular proteins [91,[96], [97], [98], [99], [100]] and some isolated cases of IMPs [97,98]. While coevolutionary analysis has the benefit of pinpointing specific interacting residue pairs, it requires construction of paired MSA of interacting partners, and computationally, this problem becomes difficult by itself due to the existence of paralogs [[18], [19], [20]]. We have taken a novel approach by first predicting WCNs of surface residues from sequence for monomeric vs. multimeric proteins and using the difference between these predictions to identify likely interface residues. Next, we used the predicted interface residues as “sticky” points and their predicted WCNs as restraints for ranking docking solutions. This approach is computationally efficient and easy to implement as it eliminates the necessity of creating paired MSAs. We have also demonstrated the effectiveness of this approach using a benchmark set of 16 α-helical IMP oligomers. The effectiveness of our approach depends on the accuracy of the prediction of WCNs and the identification of interface residues. Structural models ranked top 1% by the WCN restraint score are closer to the native structure (lower average RMSD100) than models ranked 1% by the baseline scoring function are when either the PPV for classifying interface residues is above 0.28 or the MAE for WCN prediction is below 2.0 (Table 3). When the PPV drops below 0.28 and the MAE goes above 2.0, the WCN score starts to act against identifying correct docking solutions. We realize that such criteria are arbitrary due to the limited size of the benchmark set and that in a real-life scenario a user of our method wouldn't know the PPV or MAE for their protein of interest. Thus, we'd like to point out that non-obligate oligomers where structural or functional constraint on the interface may be too weak to leave detectable evolutionary information on interface residues or incidental oligomers where the interface is neither structurally nor functionally relevant [6] are particularly challenging to our method. In fact, five out of the six non-obligate dimers in the benchmark set have a PPV <0.28 or a MAE >2.0. This also highlights the need for both computational and experimental studies to better understand non-obligate interactions between membrane proteins and the development of more accurate methods for predicting interaction sites. Nevertheless, the current study provides important insights into the interface of α-helical IMP oligomers and presents an effective and robust approach for docking α-helical IMPs. The statistics obtained may give insights into the development of methods that are more accurate in predicting interface residues, while the docking approach is valuable for constructing structural models for α-helical IMP oligomers.

Software and Data Availability

BCL::MP-Dock has been integrated into the Biochemical Library (BCL) software suite that is being actively developed. It is available at http://www.meilerlab.org/bclcommons under academic and business site licenses. The BCL source code is published under the BCL license and is available at http://www.meilerlab.org/bclcommons. Weighted contact numbers of amino acid residues in helical membrane proteins can be predicted using BCL::TMH-Expo via its webserver: http://www.meilerlab.org/servers/tmh_expo. The raw data, processed data, and protocol capture for this work are available at http://dx.doi.org/10.17632/cbk98yrszd.1.

Declarations of Interest

None.

Sources of Funding

This project was supported by National Institutes of Health (NIH) Grant R01 HL122010 and R01 GM080403. B. Li was also supported by American Heart Association Pre-Doctoral Fellowship Award 16PRE27260211.

99 in total

1. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues.

Authors: Tal Pupko; Rachel E Bell; Itay Mayrose; Fabian Glaser; Nir Ben-Tal
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

2. Transport of drugs by the multidrug transporter AcrB involves an access and a deep binding pocket that are separated by a switch-loop.

Authors: Thomas Eicher; Hi-jea Cha; Markus A Seeger; Lorenz Brandstätter; Jasmin El-Delik; Jürgen A Bohnert; Winfried V Kern; François Verrey; Markus G Grütter; Kay Diederichs; Klaas M Pos
Journal: Proc Natl Acad Sci U S A Date: 2012-03-26 Impact factor: 11.205

Review 3. The assembly of membrane proteins into complexes.

Authors: Daniel O Daley
Journal: Curr Opin Struct Biol Date: 2008-06-05 Impact factor: 6.809

4. Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis.

Authors: Thomas Gueudré; Carlo Baldassi; Marco Zamparo; Martin Weigt; Andrea Pagnani
Journal: Proc Natl Acad Sci U S A Date: 2016-10-11 Impact factor: 11.205