Literature DB >> 31871588

Coevolutive, evolutive and stochastic information in protein-protein interactions.

Miguel Andrade1, Camila Pontes1, Werner Treptow1.   

Abstract

Here, we investigate the contributions of coevolutive, evolutive and stochastic information in determining protein-protein interactions (PPIs) based on primary sequences of two interacting protein families A and B. Specifically, under the assumption that coevolutive information is imprinted on the interacting amino acids of two proteins in contrast to other (evolutive and stochastic) sources spread over their sequences, we dissect those contributions in terms of compensatory mutations at physically-coupled and uncoupled amino acids of A and B. We find that physically-coupled amino-acids at short range distances store the largest per-contact mutual information content, with a significant fraction of that content resulting from coevolutive sources alone. The information stored in coupled amino acids is shown further to discriminate multi-sequence alignments (MSAs) with the largest expectation fraction of PPI matches - a conclusion that holds against various definitions of intermolecular contacts and binding modes. When compared to the informational content resulting from evolution at long-range interactions, the mutual information in physically-coupled amino-acids is the strongest signal to distinguish PPIs derived from cospeciation and likely, the unique indication in case of molecular coevolution in independent genomes as the evolutive information must vanish for uncorrelated proteins.
© 2019 The Authors.

Entities:  

Keywords:  Coevolution; Evolution; Mutual information; Protein network; Protein-protein interaction

Year:  2019        PMID: 31871588      PMCID: PMC6906720          DOI: 10.1016/j.csbj.2019.10.005

Source DB:  PubMed          Journal:  Comput Struct Biotechnol J        ISSN: 2001-0370            Impact factor:   7.271


Introduction

While being selected to be thermodynamically stable and kinetically accessible in a particular fold [1], [2], interacting proteins A and B coevolve to maintain their bound free-energy stability against a vast repertoire of non-specific partners and interaction modes. Protein coevolution, in the form of a time-dependent molecular process, then translates itself into a series of primary-sequence variants of A and B encoding coordinated compensatory mutations [3] and, therefore, specific protein-protein interactions (PPIs) derived from this stability-driven process [4]. As a ubiquitous process in molecular biology, coevolution thus apply to protein interologs, either paralogous or orthologous, under cospeciation or in independent genomes. Thanks to extensive investigations in the past following ingenious approaches based on the correlation of phylogenetic trees [5], [6], [7] and profiles [8], gene colocalization [9] and fusions [10], maximum coevolutionary interdependencies [11] and correlated mutations [12], [13], the problem of predicting PPIs based on multi-sequence alignments (MSAs) appears to date resolvable, at least for small sets of paralogous sequences – recent improvements [14], [15], [16], [17], [18] resulting from PPI prediction allied to modern coevolutionary approaches [19], [20], [21], [22], [23] to identify interacting amino acids across protein interfaces. In these previous studies, however, the information was taken into account as a whole, and it was not clarified, as discussed in recent reviews [4], [24], the isolated contributions of coevolutive, evolutive and stochastic information in resolving the problem. Differentiating functional coevolution from stochastic and phylogenetic sources remains looked for in the research field and may help introducing models capable of accurately detecting protein-protein interactions and interfaces, especially when the number of sequences or the amount of biological information are limited [25]. Here, by benefiting from much larger data sets made available in the sequence- and structure-rich era, we revisit the field by quantifying the amount of information that protein A stores about protein B stemming from each of these sources and, more importantly, their effective contributions in discriminating PPIs based on MSAs (Scheme 1). Specifically, under the assumption that the coevolutive information is imprinted on the interacting amino acids of protein interologs in contrast to other (evolutive and stochastic) sources spread over their sequences, we want the information to be dissected in terms of compensatory mutations at physically-coupled and uncoupled amino acids of A and B. Given a known set of protein three-dimensional amino-acid contacts and their underlying primary sequences we seek therefore differentiating functional coevolution from stochastic and phylogenetic signals for subsequent evaluation of their contributions in PPI recognition from primary sequences. It is worth emphasizing our study is not aimed at providing a method for prediction of protein-protein interactions nor protein-protein interfaces, hence it differs from previous studies in which sequence covariance is used to predict three-dimensional amino-acid contacts across interfaces and assemble models of protein complexes [26] or protein docking [27]. Anticipating our findings, we show that physically-coupled amino-acids store the largest per-contact mutual information (MI) content to discriminate concatenated MSAs with the largest expectation fraction of PPI matches – a conclusion that holds against various definitions of intermolecular protein contacts and binding modes, including native and non-native decoy structures. A significant fraction of that information results from coevolutive sources alone. Although, our analysis involved protein interologs under cospeciation that is, proteins evolving in the same genome, the derived conclusions are likely general to cases of non-cospeciating interologs given that the underlying thermodynamical principles must be the same for all cases.
Scheme 1

Structural contacts mapped into M-long multi-sequence alignment (MSA) of protein interologs A and B. A set of pairwise protein-protein interactions is defined by associating each sequence l in MSA B to a sequence k in MSA A in one unique arrangement, {l(k)|z}, determined by the coevolution process z to which these protein families were subjected. Shown is a “scrambled” concatenated MSA of A and B associated to a given process z (red dashes). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Structural contacts mapped into M-long multi-sequence alignment (MSA) of protein interologs A and B. A set of pairwise protein-protein interactions is defined by associating each sequence l in MSA B to a sequence k in MSA A in one unique arrangement, {l(k)|z}, determined by the coevolution process z to which these protein families were subjected. Shown is a “scrambled” concatenated MSA of A and B associated to a given process z (red dashes). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Theory and methods

Decomposition of mutual information

In detail, consider two proteins A and B that interact via formation of i = 1,…,N amino-acid contacts at the molecular level. Proteins A and B are assumed to coevolve throughout M! distinct processes z described by the stochastic variable Z with an uniform probability mass function ρ(z), ∀z ∈ {1,…,M!}. Given any specific process z, their interacting amino-acid sequences are respectively described by two N-length blocks of discrete stochastic variables X ≡ (X) and Y ≡ (Y) with probability mass functions {ρ(x),ρ(y),ρ(x|z)} such that, andfor every joint sequence defined in the alphabet χ of size |χ|. Under these considerations, the amount of information that protein A stores about protein B is given by the mutual information I(X) between X and Y conditional to process z [28]. As made explicit in Eq. (1), we are particularly interested in quantifying I(X) for the situation in which marginals of the N-block variables {ρ(x), ρ(y)} are assumed to be independent of process z meaning that, for a fixed sequence composition of proteins A and B only their joint distribution depends on the process. Furthermore, by assuming N-independent contacts, we want that information to be quantified for the least-constrained model ρ*(x|z) that maximizes the conditional joint entropy between A and B – that condition ensures the mutual information to be written exactly, in terms of the individual contributions of contacts i. For the least-constrained distribution {ρ*(x|z)}, the conditional mutual information writes in terms of the Shannon’s information entropiesassociated with the conditional joint distribution {ρ*(x|z)} and the derived marginals {ρ*(x), ρ*(y)} of the N-block variables. From its entropy-maximization property, the critical distribution {ρ*(x|z)} factorizes into the conditional two-site marginal of every contact i then allowing Eq. (4) to be written extensively, in terms of the individual entropic contributions such that, (cf. SI for details). In Eq. (7), the conditional mutual information achieves its lower bound of zero if X and Y are conditionally independent given z i.e., ρ*(x |z) = ρ*(x) × ρ*(y). For the case of perfectly correlated variables ρ*(x |z) = ρ*(x) = ρ*(y), the conditional mutual information is bound to a maximum which cannot exceed the entropy of either block variables H(X) and H(Y). Given a known set of protein amino-acid contacts and their underlying primary sequence distributions defining the stochastic variables X and Y, Eq. (7) thus establishes the formal dependence of their mutual information with any given process z. Because “contacts” can be defined for a variety of cutoff distances r, Eq. (7) is particularly useful to dissect mutual information in terms of physically-coupled and uncoupled protein amino acids. In the following, we explore Eq. (7) in that purpose by obtaining the two-site probabilities in Eq. (5) from the observed frequencies in the multiple-sequence alignmentwhere the N-length amino-acid block l of protein B is joint to block k of protein A in one unique arrangement {l(k)|z} for 1 ≤ k ≤ M (cf. Scheme 1 and Computational Methods).

Computational methods

Table 1 details the interacting protein systems considered in the study. For each system under investigation, amino-acid contacts defining the discrete stochastic variables X and Y including physically coupled amino acids at short-range cut-off distances (r ≤ 8.0 Å) and physically uncoupled amino-acids at long-range cut-off distances (r > 8.0 Å) were identified from the x-ray crystal structure of the bound state of proteins A and B. The reference (native) multi-sequence alignment {x, y|z*} of the joint amino-acid blocks associated to X and Y was reconstructed from annotated primary-sequence alignments published by Baker and coworkers [22], containing M paired sequences with known protein-protein interactions and defined in the alphabet of 20 amino acids plus the gap symbol (|χ|=21). “Scrambled” MSA models were generated by randomizing the pattern {l(k)|z*} in which block l is joint to block k in the reference alignment.
Table 1

Protein system A and B considered in the study.

Complex descriptionPDB IDProtein AProtein BMMSA length
Obligate DimersCarbamoyl Phosphate Synthetase1BXRChain A: Carbamoyl-Phosphate Synthetase large subunitChain B: Carbamoyl-Phosphate Synthetase small subunit10041452
Lactococcus Lactis Dihydroorotate Dehydrogenase B.1EP3Chain A: Dihydroorotate Dehydrogenase B (PYRD Subunit)Chain B: Dihydroorotate Dehydrogenase B (pyrk Subunit)552572
Polysulfide reductase native structure2VPZChain A: Thiosulfate ReductaseChain B: NRFC Protein676927
heterohexameric TusBCD proteins2D1PChain B: Hypothetical UPF0116 protein yheMChain C: Hypothetical protein yheL216214
3-oxoadipate coA-transferase3RRLChain A: Succinyl-CoA:3-ketoacid-coenzyme A transferase subunit AChain B: Succinyl-CoA:3-ketoacid-coenzyme A transferase subunit B1330437
Bovine heart cytochrome c oxidase2Y69Chain A: Cytochrome C Oxidase Subunit 1Chain B: Cytochrome C Oxidase Subunit 21484740
Non-Obligate DimerToxin-antitoxin complex RelBE2 from Mycobacterium tuberculosis3G5OChainA: Protein Rv2865ChainB: Protein Rv2866904173
Protein system A and B considered in the study. For any given MSA model, two-site probabilities were defined from the observable frequencies regularized by a pseudocount effective fraction λ* in case of insufficient data availability as devised by Morcos and coauthors [19]. More specifically, two-site frequencies were calculated according to where, n = |{m′|1 ≤ m′ ≤ M, Hamming Disatnce(m,m′) ≥ δh}| is the number of similar sequences m′ within a certain Hamming distance δh of sequence m and is the effective number of distinguishable primary sequences at that distance threshold – the Kronecker delta ensures counting of (x) occurrences only. In Eq. (9), two-site frequencies converge to raw occurrences in the sequence alignment for λ* = 0 or approach the uniform distribution for λ* = 1; Eq. (9) is identical to the equation devised by Morcos and coauthors [19] by rewriting λ* = λ/(λ + M). Here, two-site probabilities were computed from Eq. (9) after unbiasing the reference MSA by weighting down primary sequences with amino-acid identity equal to 100%. An effective number of primary sequences M = M (cf. Table S1) was retained for analysis and a pseudocount fraction of λ* = 0.001 was used to regularize data without largely impacting observable frequencies. Single-site probabilities {ρ(x), ρ(y)} were derived from ρ*(x) by marginalization via Eq. (1). The conditional mutual information in Eq. (7) was computed from single- and joint-entropies according to Eq. (3). Given the fact that the maximum value of I(X; Y|z) is bound to the conditional joint entropy, Eq. (7) was computed in practice as a per-contact entropy-weighted conditional mutual information [29], H(X; Y|z)−1 I(X; Y|z), to avoid that contributions of H(X, Y|z) contacts between highly variable sites are overestimated. Because and I(X, Y|z) have units of nats, Eq. (7) is dimensionless in the present form.

Results and discussion

Details of all protein systems under investigation are presented in Table 1. Each system involves two families of protein interologs A and B with known PPIs derived from cospeciation in the same genome [26]. We denote by {x, y|z*} their reference concatenated MSA associated to the native process z*. For convenience, in the following, we present and discuss results obtained for a representative system A and B – the protein complex TusBCD (chains B and C of 2DIP) which is crucial for tRNA modification in Escherichia coli. Similar results and conclusions hold for all other systems in Table 1 as presented in supplementary Figs. S1 through S4 (cf. SI). Fig. 1A shows the three-dimensional representation of stochastic variables embodying every possible amino-acid pairs along proteins A and B and their decomposition in terms of physically coupled amino acids at short-range cutoff distances (r ≤ 8.0 Å) and physically uncoupled amino-acids at long-range cutoff distances (r > 8.0 Å). In Fig. 1B, the total mutual information (coupled + uncoupled) across every possible amino-acid pairs of A and B amounts to 987.88 in the reference (native) MSA. As estimated from a generated ensemble of “scrambled” MSA models, expectation values for the mutual information
Fig. 1

Informational analysis of protein complex TusBCD, chains B and C. (A) Three-dimensional representation of stochastic variables X and Y as defined from physically coupled amino acids at short-range cutoff distances r ≤ 8.0 Å (turquoise) and physically uncoupled amino-acids at long-range cutoff distances r > 8.0 Å (gray). Calculation of r involved Cβ-Cβ atomic separation distances. (B) Conditional mutual information as a function of the number M − n of randomly paired proteins in the reference (native) MSA, for 0 ≤ n ≤ M. < I(X; Y|z)> are expectation values estimated from a generated ensemble of 500 MSA models. Mutual information of fully “scrambled” models featuring M unpaired sequences is similar to that calculated from randomized sequence alignments generated by aleatory swapping of lines within columns. (C) Mutual information gap ΔIM between reference and 100 fully “scrambled” models featuring M unpaired sequences. (D) Per-contact mutual information gap N−1ΔI,. (E) Mutual information decomposition according to Eq. (11) and comparison with functional mutual information (MIp,rc≤8) and direct information (DI≤8). In B, C, D and E error bars correspond to standard deviations.

Informational analysis of protein complex TusBCD, chains B and C. (A) Three-dimensional representation of stochastic variables X and Y as defined from physically coupled amino acids at short-range cutoff distances r ≤ 8.0 Å (turquoise) and physically uncoupled amino-acids at long-range cutoff distances r > 8.0 Å (gray). Calculation of r involved Cβ-Cβ atomic separation distances. (B) Conditional mutual information as a function of the number M − n of randomly paired proteins in the reference (native) MSA, for 0 ≤ n ≤ M. < I(X; Y|z)> are expectation values estimated from a generated ensemble of 500 MSA models. Mutual information of fully “scrambled” models featuring M unpaired sequences is similar to that calculated from randomized sequence alignments generated by aleatory swapping of lines within columns. (C) Mutual information gap ΔIM between reference and 100 fully “scrambled” models featuring M unpaired sequences. (D) Per-contact mutual information gap N−1ΔI,. (E) Mutual information decomposition according to Eq. (11) and comparison with functional mutual information (MIp,rc≤8) and direct information (DI≤8). In B, C, D and E error bars correspond to standard deviations. As a measure of correlation, it is not surprising that mutual information in the reference MSA is larger than that of scrambled alignments. Not expected however, is the fact that correlation does not vanish at “scrambled” models meaning that part of the calculated mutual information results at random. Supporting that notion, the mutual information of fully “scrambled” models is found here to be very similar to the same estimate from randomized sequence alignments featuring aleatory swapping of lines within columns. Subtraction of that stochastic source from the native mutual information, as computed in the form of an information gap between the reference MSA and “scrambled” models full of sequence mismatches, then reveals the isolated nonstochastic contributions to the total correlation between proteins A and B. Here, the information gap amounts to ~440 for every possible amino-acid pairs of A and B. Fig. 1C shows the individual contributions of physically coupled and uncoupled amino acids to the total mutual information gap, ΔI = ΔI≤8.0Å + ΔI>8.0Å. As a direct consequence of the extensive property of Eq. (7), individual contributions to the total mutual information gap () increase with cutoff distances defining amino-acid contacts (r) and consequently, with the block length (N) of the corresponding stochastic variables. As such, the information imprinted at physically uncoupled amino acids accounts for most of the total mutual information gap (438.8132 ± 4.5159). When normalized by the block length or the number of amino-acid contacts (Fig. 1D), the mutual-information contribution N−1ΔI reveals a distinct dependence being larger for physically coupled amino acids than uncoupled ones (0.0653 ± 0.0015 versus 0.039 ± 0.0004). The information-gap profile as a function of amino-acid pair distances shown in Fig. S2 makes sense of the result by showing few larger information-gap values at short distances in contrast to many smaller ones at long distances. Under the assumption that the coevolutive information is imprinted on the interacting amino acids of interologs in contrast to other (evolutive and stochastic) sources spread over their primary sequences, the difference between short- and long-range contributions provides us with per-contact estimates for the information content resulting from coevolution alone that is, where, represents the per-contact mutual information resulting from evolution. As shown in Fig. 1E, the information content resulting from coevolution alone amounts to 0.0264 ± 0.0014 which compares well to independent measures of coevolutionary information i.e., functional mutual information () [29] and direct information () [19], 0.0340 ± 0.0037 and 0.0202 ± 0.0019. More specifically, MI is a metric formulated by Dunn and coworkers [29] in which mutual information is subtracted from structural or functional relationships whereas, DI is based on the direct coupling analysis that removes all kinds of indirect correlations by following a global statistical approach [19]. According to definition in Eq. (11), we then conclude that ~40% of the information content stored in physically coupled amino acids of the protein complex TusBCD results from coevolutive sources alone.

Degeneracy and error analysis of short and long-range correlations

The present analysis reveals quantitative differences between short- and long-range correlations of proteins A and B. Because the total mutual-information component provides us with an unbiased (intensive) estimate for proper comparison of the information content between coupled and uncoupled amino acids, in the following, we focus our attention on to dissect their effective contributions in determining PPIs based on sequence alignments. Accordingly, let us define the total number ω of native-like MSA models generated by scrambling of M − n sequence pairs in the reference alignment in terms of rencontres numbers ω or permutations of the reference sequence set {l(k)|z*} with n fixed positions satisfying ∑ ω! (in combinatorics language). Here, S(r) denotes the set of fixed positions n for which the mutual information gap is smaller than a certain resolution δI independently from the corresponding block length N or the number of amino-acid contacts. In simple terms, ωS in Eq. (12) informs us on the degeneracy or the number of “scrambled” MSA models with a similar amount of mutual information of that in the reference (native) alignment. As shown in Table S1, rencontres numbers ω is an astronomically increasing function of M − n, identical for any definition of the stochastic variables X and Y derived from the same number M of aligned sequences. For instance, there is 164548102752 alignments for the protein complex TusBCD with M − n = 5 scrambled sequence pairs. In contrast, the total number ω of native-like MSA models depends on the stochastic variables at various resolutions δI (Fig. 2A). That number is substantially smaller for definitions of XN and YN embodying physically-coupled amino acids in consequence of the smaller number M − n of unpaired sequences required to perturb of a fixed change δI such that ω accumulates less over MSA models satisfying the condition in Eq. (14) (Fig. 2B).
Fig. 2

Degeneracy and error analysis for stochastic variables X and Y involving interacting amino acids at short-range distances r ≤ 8.0 Å (turquoise) and long-range distances r > 8.0 Å (gray). (A) Total number ω of native-like MSA models at various mutual-information resolutions δI. (B) Per-contact gaps of mutual information N−1ΔI as a function of the number M − n of “scrambled” sequence pairs in the reference native alignment. (C) Expectation values <ε> (Eq. (15)) for the fraction of sequence matches across native-like MSA models at various mutual-information resolutions δI. Dashed lines highlight differences at δI values of 0.01 and 0.02.

Degeneracy and error analysis for stochastic variables X and Y involving interacting amino acids at short-range distances r ≤ 8.0 Å (turquoise) and long-range distances r > 8.0 Å (gray). (A) Total number ω of native-like MSA models at various mutual-information resolutions δI. (B) Per-contact gaps of mutual information N−1ΔI as a function of the number M − n of “scrambled” sequence pairs in the reference native alignment. (C) Expectation values <ε> (Eq. (15)) for the fraction of sequence matches across native-like MSA models at various mutual-information resolutions δI. Dashed lines highlight differences at δI values of 0.01 and 0.02. The degeneracy of native-like MSA models at a given resolution depends on the cutoff distance defining stochastic variables (Fig. 2A). That condition imposes distinct boundaries for the amount of PPIs amenable of resolution across definitions of the stochastic variables in terms of coupled and uncoupled amino acids. Indeed, the expectation value for the fraction M−1n of primary sequence matches among native-like MSA models decreases substantially with the degeneracy of such models meaning that <ε> is systematically larger for physically-coupled amino-acids at various mutual-information resolutions δI (Fig. 2C). For instance, the fraction of matches at δI = 0.02 is ~20% larger for coupled amino-acids than the same estimate for amino acids at long-range distances (0.8333 versus 0.6991). Linear extrapolation in Fig. 2C along increased values of mutual-information resolutions suggests even larger differences in the expectation fraction of PPI matches between short and long-range correlations of A and B.

Dependence with contact definition and docking decoys

So far, “contact” is actually any given pair of residues “i” in protein A and “j” in protein B within a given distance r* which can be redefined for a variety of cutoff distances. Specifically, our results were determined by defining physically coupled amino acids at short-range cutoff distances (r ≤ r*) and physically uncoupled amino-acids at long-range cutoff distances (r > r*) for a typical “contact” geometrical definition involving Cβ-Cβ atomic separation distances of 8.0 Å (that is,  Å). In the following, amino-acid “contacts” are loosely redefined for a variety of cutoff distances to study the dependence of the information encoded in short and long-range protein interactions with r*. Further analysis shows a clear dependence of the per-contact mutual information gap (N−1ΔI) of coupled amino acids with r* – which is not the case for uncoupled ones. As shown in Fig. 3A, that distinction is due the coevolutive information stored at short-range distances which reaches a maximum at r* ≈ 8.0 Å in contrast to evolutive sources uniformly spread over an entire range of r* values. Particularly interesting, the result strongly support the assumption that coevolutive information is imprinted preferentially on physically-coupled amino acids of interologs in contrast to other (evolutive and stochastic) sources spread over their primary sequences – a conclusion further supported by calculations of the mutual information subtracted from structural-functional relationships (MI) as a function of r*.
Fig. 3

Dependence with contact definition r* and docking decoys. (A) Per-contact mutual information gap N−1ΔI and mutual information subtracted from structural-functional relationships MI at various r*. (B) Per-contact mutual information gap N−1ΔI (turquoise), information content resulting from coevolution alone N−1ΔΔI (green) and mutual information subtracted from structural or functional relationships MI (blue) at alternative interfaces generated by docking – only physically coupled amino acids as defined for r ≤ 8.0 Å were included in the calculations. Black bars represent the root-mean-square deviation (RMSD in Ȧ units) between the native bound structure and docking decoys as generated by GRAMM-X [30]. Docking solutions were selected following a stability binding-energy criterium according to the scoring function of GRAMM – all docking decoys considered in the study are low-energy configurations despite large RMSD values relative to the native structure. (C) Illustration of four docking decoys of chain B in the protein complex TusBCD (chain C is shown in gray). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Dependence with contact definition r* and docking decoys. (A) Per-contact mutual information gap N−1ΔI and mutual information subtracted from structural-functional relationships MI at various r*. (B) Per-contact mutual information gap N−1ΔI (turquoise), information content resulting from coevolution alone N−1ΔΔI (green) and mutual information subtracted from structural or functional relationships MI (blue) at alternative interfaces generated by docking – only physically coupled amino acids as defined for r ≤ 8.0 Å were included in the calculations. Black bars represent the root-mean-square deviation (RMSD in Ȧ units) between the native bound structure and docking decoys as generated by GRAMM-X [30]. Docking solutions were selected following a stability binding-energy criterium according to the scoring function of GRAMM – all docking decoys considered in the study are low-energy configurations despite large RMSD values relative to the native structure. (C) Illustration of four docking decoys of chain B in the protein complex TusBCD (chain C is shown in gray). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Still, the information encoded in short and long-range amino-acid interactions was analyzed across the native binding interface between proteins as revealed by x-ray crystallography experiments. The dependence of the per-contact mutual information gap with non-native binding modes or docking decoys of proteins A and B was then analyzed further, at the typical definition of amino-acid contacts ( Å). Shown in Fig. 3B, there is a clear dependence of the information gap with binding modes – the per-contact mutual information gap reaches a maximum at the experimentally-determined native bound configuration of A and B (RMSD = 0.0 Å), meaning that N−1ΔI embodies coevolutive pressures in the native amino acids contacts beyond their accessibility at the molecular surface of proteins. The conclusion is further supported in Fig. 3B by noticing that the isolated coevolutive content for the bound configuration of A and B or the associated mutual information subtracted from structural-functional relationships are larger than the very same estimates for any docking decoys.

Concluding remarks

Overall, molecular coevolution as the maintenance of the binding free-energy of interacting proteins leads their physically coupled amino-acids to store the largest per-contact mutual information at r* ≈ 8.0 Å, with a significant fraction of the information resulting from coevolutive sources alone. In the present formulation, coupled amino acids are related to the smallest degeneracy of native-like MSA models and, therefore, to the largest expectation fraction of PPI matches across such models. These findings hold against any other definition of protein contacts, either across a variety of limitrophe distances discriminating coupled and uncoupled amino acids or alternative binding interfaces in docking decoys. Although presented for the protein complex TusBCD, results and discussion also extent to other protein systems, including obligate and non-obligate dimers, as shown in supplementary Figs. S1 through S4 (cf. SI). Advances in PPI prediction [14], [15], [16], [17], [18] are highly welcome in the contexts of paralog matching, host-pathogen PPI network prediction and interacting protein families prediction. Recent studies suggest strategies like maximizing the interfamily coevolutionary signal [14], iterative paralog matching based on sequence “energies” [15] and expectation–maximization [18], which have been capable of accurately matching paralogs for some study cases. Despite these advances, the problem of PPI prediction remains unsolved for sequence ensembles in general, especially for proteins that coevolve in independent genomes though likely resulting from the same free-energy constraints – examples are phage proteins and bacterial receptors, pathogen and host-cell protein, neurotoxins and ion channels, to mention a few. Accordingly, to add efforts in the field, we have addressed the following questions in our study: knowing three-dimensional amino-acid contacts from x-ray crystal structures, what would be the information encoded by them in terms of stochastic, evolutive and coevolutive sources, and what would be the utility of such pieces of information in resolving PPIs from “scrambled” multi-sequence alignments. Since the Direct Information derived from modern coevolutionary approaches [19], [22] already filters out most of the information sources, the decomposition as proposed here does only make sense by considering the Mutual Information embodying unfiltered information. In this regard, it is worth emphasizing that our goals are neither the resolution of pair of residues highly-correlated via direct physical coupling [19], [22] nor to provide with a method for prediction of protein-protein interactions and interfaces [26], [27]. Although our study is not aimed at providing an approach for PPI prediction, the largest amount of non-stochastic information available in primary sequences helpful to differentiate MSA models with the largest expectation fraction of sequence matches as found here, might be of practical relevance in search of more effective heuristics to resolve protein-protein interactions from “scrambled” multi-sequence alignments. When compared to evolutive sources, that information is the strongest signal to characterize protein interactions derived from cospeciation and likely, the unique indication in case of coevolution without cospeciation as the non-stochastic information of uncoupled amino acids must vanish in independent proteins – indeed, low information between amino acid positions of multiple sequence alignments is typically indicative of independently evolved proteins. Developments of more effective heuristics based on that signal would be applied for resolution of the more general problem of PPIs under coevolution in independent genomes, providing us with a highly welcome advance in the field. We believe the results are of broad interest as the stability principles of protein systems under coevolution must be universal, either under cospeciation or in independent genomes. We therefore anticipate that decomposition of evolutive and coevolutive information imprinted in physically-coupled and uncoupled amino acids and evaluation of their potential utility in resolving MSA models in terms of degeneracy and fraction of PPI matches should guide new developments in the field, aiming at characterizing protein interactions in general.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  29 in total

1.  Similarity of phylogenetic trees as indicator of protein-protein interaction.

Authors:  F Pazos; A Valencia
Journal:  Protein Eng       Date:  2001-09

2.  Folding simulations of a three-dimensional protein model with a nonspecific hydrophobic energy function.

Authors:  L G Garcia; W L Treptow; A F de Araújo
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2001-06-21

3.  Inferring protein interactions from phylogenetic distance matrices.

Authors:  Jason Gertz; Georgiy Elfond; Anna Shustrova; Matt Weisinger; Matteo Pellegrini; Shawn Cokus; Bruce Rothschild
Journal:  Bioinformatics       Date:  2003-11-01       Impact factor: 6.937

4.  Predicting functional linkages from gene fusions with confidence.

Authors:  Cynthia J Verjovsky Marcotte; Edward M Marcotte
Journal:  Appl Bioinformatics       Date:  2002

Review 5.  Emerging methods in protein co-evolution.

Authors:  David de Juan; Florencio Pazos; Alfonso Valencia
Journal:  Nat Rev Genet       Date:  2013-03-05       Impact factor: 53.242

6.  Why should we care about molecular coevolution?

Authors:  Francisco M Codoñer; Mario A Fares
Journal:  Evol Bioinform Online       Date:  2008-02-14       Impact factor: 1.625

7.  GRAMM-X public web server for protein-protein docking.

Authors:  Andrey Tovchigrechko; Ilya A Vakser
Journal:  Nucleic Acids Res       Date:  2006-07-01       Impact factor: 16.971

8.  Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information.

Authors:  Sergey Ovchinnikov; Hetunandan Kamisetty; David Baker
Journal:  Elife       Date:  2014-05-01       Impact factor: 8.140

9.  Improving protein-protein interaction prediction using evolutionary information from low-quality MSAs.

Authors:  Csilla Várnai; Nikolas S Burkoff; David L Wild
Journal:  PLoS One       Date:  2017-02-06       Impact factor: 3.240

10.  Disentangling direct from indirect co-evolution of residues in protein alignments.

Authors:  Lukas Burger; Erik van Nimwegen
Journal:  PLoS Comput Biol       Date:  2010-01-01       Impact factor: 4.475

View more
  1 in total

1.  Trivial and nontrivial error sources account for misidentification of protein partners in mutual information approaches.

Authors:  Camila Pontes; Miguel Andrade; José Fiorote; Werner Treptow
Journal:  Sci Rep       Date:  2021-03-25       Impact factor: 4.379

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.