The ability to interact with different partners is one of the most important features in proteins. Proteins that bind a large number of partners (hubs) have been often associated with intrinsic disorder. However, many examples exist of hubs with an ordered structure, and evidence of a general mechanism promoting promiscuity in ordered proteins is still elusive. An intriguing hypothesis is that promiscuous binding sites have specific dynamical properties, distinct from the rest of the interface and pre-existing in the protein isolated state. Here, we present the first comprehensive study of the intrinsic dynamics of promiscuous residues in a large protein data set. Different computational methods, from coarse-grained elastic models to geometry-based sampling methods and to full-atom Molecular Dynamics simulations, were used to generate conformational ensembles for the isolated proteins. The flexibility and dynamic correlations of interface residues with a different degree of binding promiscuity were calculated and compared considering side chain and backbone motions, the latter both on a local and on a global scale. The study revealed that (a) promiscuous residues tend to be more flexible than nonpromiscuous ones, (b) this additional flexibility has a higher degree of organization, and (c) evolutionary conservation and binding promiscuity have opposite effects on intrinsic dynamics. Findings on simulated ensembles were also validated on ensembles of experimental structures extracted from the Protein Data Bank (PDB). Additionally, the low occurrence of single nucleotide polymorphisms observed for promiscuous residues indicated a tendency to preserve binding diversity at these positions. A case study on two ubiquitin-like proteins exemplifies how binding promiscuity in evolutionary related proteins can be modulated by the fine-tuning of the interface dynamics. The interplay between promiscuity and flexibility highlighted here can inspire new directions in protein-protein interaction prediction and design methods.
The ability to interact with different partners is one of the most important features in proteins. Proteins that bind a large number of partners (hubs) have been often associated with intrinsic disorder. However, many examples exist of hubs with an ordered structure, and evidence of a general mechanism promoting promiscuity in ordered proteins is still elusive. An intriguing hypothesis is that promiscuous binding sites have specific dynamical properties, distinct from the rest of the interface and pre-existing in the protein isolated state. Here, we present the first comprehensive study of the intrinsic dynamics of promiscuous residues in a large protein data set. Different computational methods, from coarse-grained elastic models to geometry-based sampling methods and to full-atom Molecular Dynamics simulations, were used to generate conformational ensembles for the isolated proteins. The flexibility and dynamic correlations of interface residues with a different degree of binding promiscuity were calculated and compared considering side chain and backbone motions, the latter both on a local and on a global scale. The study revealed that (a) promiscuous residues tend to be more flexible than nonpromiscuous ones, (b) this additional flexibility has a higher degree of organization, and (c) evolutionary conservation and binding promiscuity have opposite effects on intrinsic dynamics. Findings on simulated ensembles were also validated on ensembles of experimental structures extracted from the Protein Data Bank (PDB). Additionally, the low occurrence of single nucleotide polymorphisms observed for promiscuous residues indicated a tendency to preserve binding diversity at these positions. A case study on two ubiquitin-like proteins exemplifies how binding promiscuity in evolutionary related proteins can be modulated by the fine-tuning of the interface dynamics. The interplay between promiscuity and flexibility highlighted here can inspire new directions in protein-protein interaction prediction and design methods.
The ability of proteins
to interact with
different partners is
essential to the life of the cell. Indeed, the withdrawal or damage
of proteins with a high number of connections in protein–protein
interaction (PPI) networks (hubs) is often lethal[1] or associated with disease.[2]Binding to multiple partners can be promoted by different
factors,[3−9] including multiple domains, domain repeats, multiple interaction
sites, intrinsically disordered regions, conformational changes, highly
charged surfaces, post-translational modifications, and alternative
splicing. Even if using different definitions,[4,10−12] hub proteins have been traditionally classified in
two groups, according to their tendency to use the same (“date”,
“singlish-interface,” or “sociable” hubs)
or distinct (“party”, “multiple interface,”
or “stable” hubs) interfaces to interact with different
partners. Promiscuity in binding sites that are reused in different
complexes has been often found to rely on the ability of the protein
to adopt different conformations according to the partner,[12−16] with changes ranging from local side chain reorientations to global
rearrangements and to disorder-to-order transitions.The role
of intrinsic disorder in hub proteins has been thoroughly
investigated in the past.[3,4,9,15,17] However, there are many examples of multipartner proteins with an
ordered structure, such as calmodulin, ubiquitin, or Ras proteins.
In these cases, a possibility is that, in the absence of intrinsic
disorder, promiscuity is promoted by specific dynamical properties
pre-existing in the protein isolated state. Indeed, in the last decades
an increasing number of experimental and theoretical studies[18−23] have shown that the conformational changes related with a change
in the protein state (binding to ligands, post-translational modifications,
interaction with light) can be sampled by the protein also in the
initial unperturbed state. Some recent works highlighted that in specific
cases the formation of multiple arrangements at a protein–protein
interface can be assisted by conformational fluctuations in the unbound
state in one of the partners.[16,24−27] However, only a few large-scale studies[28] of structure-based PPI networks have explicitly considered the inherent
flexibility of proteins and evidence of a general relationship between
intrinsic dynamics and binding promiscuity in ordered proteins is
still missing.A first problem in the determination of a link
between intrinsic
dynamics and binding promiscuity is that, while classifications in
PPI networks are usually made at the protein level (e.g., hub and
nonhub), the regions in a protein that are involved in interactions
with partners (semi-interfaces or simply interfaces in the following)
can have a largely heterogeneous composition in terms of different
properties, including mobility,[29−31] evolutionary conservation,[32] binding promiscuity,[13,33] and binding affinity.[34−36] Thus, a promiscuous protein can
be composed of both nonpromiscuous and promiscuous residues, each
with a different degree of intrinsic flexibility. Moreover, the flexibility
of a protein can be measured in terms of different space and time
scales, with previous works usually focusing on either backbone motions,[24,28,31,37] slow and highly collective, or side chain motions,[30] faster and more localized.Here, we present the first
large scale study of the intrinsic dynamical
properties of promiscuous residues, where we mapped residue-based
measures of conformational flexibility and binding promiscuity on
interfaces of a large data set of proteins. Both backbone and side
chain intrinsic motions were described with in silico generated conformational ensembles of the isolated proteins. The
analysis revealed higher flexibility levels for promiscuous residues
compared to nonpromiscuous ones. This additional flexibility was found
to be highly organized in correlated motions both on a local and on
a global scale, indicating that, when intrinsic disorder is not a
major factor, an ordered form of flexibility can take over to promote
binding promiscuity. Remarkably, the magnitude of intrinsic motions
in promiscuous residues showed a reduced dependence from evolutionary
conservation when compared with nonpromiscuous ones, providing an
unprecedented indication that binding affinity and promiscuity can
have opposite effects on residue dynamics. The functional importance
of promiscuous residues was also confirmed with an investigation of
the distribution of single nucleotide polymorphisms (SNPs) across
interface regions, which suggests that residues in promiscuous positions
have a reduced tolerance to genetic variations, related to the necessity
to preserve their binding polyvalence.
Methods
Data Set Preparation
A data set of proteins was generated
using the PiSite database[38] (Figure 1). A nonredundant list of proteins from PiSite was
used, composed of 7739 proteins with sequence identity <30% identified
by clustering the sequences of 110325 proteins from 51482 PDB entries.[38] For a given protein, PiSite collects all the
PDB entries containing the protein itself or very close homologues
(>90% sequence identity). The original protein from the nonredundant
PiSite list and its homologues define a sequence family. In the following,
we will refer to sequence families as simply families and to the original
proteins in the nonredundant PiSite list as Family Representatives
(FRs). In PiSite, all the PDB complexes containing a member of the
family are used to define the family binding properties (number of
partners and binding states), which are then mapped onto the FR. The
basic assumption is that, because of their high sequence similarity,
members of the same family have a very similar behavior in terms of
binding interactions, so that structural information on partners can
be transferred from the family members to the FR. In this way, the
partner annotation of FR is enriched compared to when only FR occurrences
in the PDB are considered, and the bias due to the incompleteness
of the PDB is reduced.[12]
Figure 1
Generation of SFull and SMD data sets. The
flowchart shows the steps followed in the generation of the SFull and SMD data sets from the nonredundant PiSite
database. The operations indicated in blue on the right were performed
on all the members of each family. The applied filters are indicated
in orange. The inset shows how the components of each family are defined
and used. See Methods in the main text for
a full description of the procedures involved.
Generation of SFull and SMD data sets. The
flowchart shows the steps followed in the generation of the SFull and SMD data sets from the nonredundant PiSite
database. The operations indicated in blue on the right were performed
on all the members of each family. The applied filters are indicated
in orange. The inset shows how the components of each family are defined
and used. See Methods in the main text for
a full description of the procedures involved.For each family in nonredundant PiSite (Figure 1), we selected the members with known UniProtKB and SCOP IDs
using the PDBSWS PDB/UniProt mapping[39] and
the SCOP IDs using the Astral SCOP database[40] (v 1.75). Only the families with unambiguous SCOP domain classification
across all the members and with at least one member with known UniProtKB
ID were retained. We then selected 251 families that satisfied the
following requirements:Their members have only one SCOP domain.The sequences of the resolved structures
in the family cover at least 75% of the corresponding UniProtKB sequences.They have at least one
partner with
known structure.There
is at least one structure in
the family with no gaps in the resolved main chain. The ungapped X-ray
structure with the best resolution in the family was selected as the
structural representative (SR) to be used in the simulations and structural
analyses. When no crystallographic structure with a complete main
chain was found, an ungapped NMR structure was selected as SR if available.The 251 families define our full data set
SFull (Supporting Information (SI)
Table S1). Each family
includes on average ∼20 members, for a total of 4917 PDB chains.The PDB and chain IDs of partners from structural complexes (or
‘structural partners’) were extracted from PiSite for
each family in SFull and mapped to UniProtKB with the PDBSWS
mapping. For consistency with the PiSite database, redundancies were
removed by clustering the partner sequences with BLASTCLUST[41] (sequence identity ≤30%, sequence coverage
≥50%) to define the total number of structural partners (npS) of the family. This was used to classify the families of
SFull as monopartner (npS = 1, for a total of 151 families) and
multipartner (npS ≥ 2, for a total of 100 families). Each of the
main SCOP classes (α, β, α/β, α+β)
was represented in both groups with at least 14 entries. The proteins
in the families have an average size of 217 (monopartner) and 167
(multipartner) residues, as measured from the SRs.Family partners
identified by nonstructural detection methods were
extracted from the IntAct[42] database using
the UniprotKB IDs of each family member. All the binary interactions
were retained. The total number of unique partners (np) of each family
was obtained using BLASTCLUST (see above) on the sequences of all
the partners extracted from either IntAct or PiSite.In the
following, we will often refer to a family as simply a protein,
with the FR as its sequence representative. Unless otherwise stated,
structural properties and simulated conformational ensembles were
calculated from the SR of the family, while binding multiplicity profiles
and ensembles of experimental structures were built-up using all the
family members (see the following and inset in Figure 1).MD simulations (see below) were performed on a subset
(SMD) of SFull composed of 6 monopartner and
6 multipartner
proteins. The SMD proteins were selected using stricter
criteria to increase the confidence of their classification, so that
multipartner proteins have at least 4 structural partners, the monopartner
proteins have both np and npS = 1, SCOP classes are equally
sampled by the two groups and the average size of the proteins in
the two groups is comparable (SI Tables S1 and
S2).A second data set (SSoc) was considered
to test and
compare the findings observed on SFull. SSoc is composed of sociable proteins[12] (SI Table S3), defined as having at least 3 structural
partners and 3 different binding states.[38] Starting from the original list of 102 sociable proteins,[12] 69 were selected so that the FR has a complete
main chain structure. The FR was used as the structural representative
SR for the simulations and structural analyses. Since the original
data set of sociable proteins was not filtered for monodomain proteins,
SSoc includes entries with more than one SCOP domain (SI Table S3).Only a limited number of
proteins in the data sets (4 in SFull and 5 in SSoc) contained post-translational
modifications (PTMs) in the PDB structure, amounting to less than
3% of the total. None of these was found in the SMD subset.
Thus, PTMs were not explicitly taken into account and residues with
PTMs were modeled using the corresponding unmodified residue. A systematic
study of the effects of PTMs on interface dynamics would require ad
hoc data sets and a standardized framework including parameters for
a large range of PTMs. A complete set of tools for the inclusion of
PTMs in MD simulations has been recently proposed.[43,44]
Interface Definition
For each SFull family f, interface residues were mapped on the SR by analyzing
all the PDB binary complexes A:B (where
A is a member of the family and B is
one of its partners) listed in the PiSite database for the selected
family members. For consistency with PiSite, the binary complexes
were extracted from the PDB biological units. The POPSComp method[45] was used to determine the profiles of normalized
solvent accessible surface area (SASA) that is buried upon complex formation for each chain A ΔSASA(i, A:B) = (SASA(i, A) – SASA(i, A:B))/ SASA(i, A), where SASA(i, A) and SASA(i, A:B)
are the SASAs of residue i in the isolated and bound
A molecule, respectively. All the ΔSASA profiles of the family were then mapped
onto the SR sequence using A–SR
sequence alignments given by the program ProFit.[46] The resulting ΔSASA(i, SR:B) values were used to define the SR interfaces. A
SR residue i was considered to be part of a SR:B
interface if upon complex formation it buries at least 10% of its
SASA, that is, if ΔSASA(i, SR:B) > 0.1. If i is found in interfaces
involving n different partners, its binding multiplicity b(i) is set to n. The
binding multiplicity was used to partition the interface residues
into three different binding classes. Residues with b ≥ 2 were classified as multipartner (cmulti), while residues with b = 1 were grouped
in the cmono class if belonging to monopartner
SRs and in the cmono_in_multi class if
belonging to multipartner SRs (SI Table S4). Because of differences in the definition of the interface (distance-based
in PiSite and SASA-based here) and of the selection process applied
in the data set generation (see Data Set Preparation), the binding multiplicity used for the SFull data set
may differ in isolated positions from the one recorded in PiSite.
To ensure full consistency in the analysis of the SSoc data
set, original b profiles from PiSite were used for
sociable proteins.There was some redundancy in the PDB biological
units of each family, so that very similar A:B complexes were observed. Interface redundancy was eliminated
by clustering the SR-mapped ΔSASA profiles
according to their overlap matrix O = (ΔSASA(i∩j) + ΔSASA(i∩j))/(ΔSASA + ΔSASA), where ΔSASA is the buried SASA
of SR in interface i and ΔSASA(i∩j) is
the surface area of the SR residues involved both in interface i and j, calculated using the coordinates
of interface i. The clustering was performed with
the complete linkage method and the optimal number of clusters was
chosen by maximizing the average silhouette[47] value s. An average smax of 0.84 (±0.19) was obtained for the SFull families,
indicating a good partitioning of the interfaces into the clusters.
The interfaces with the largest SASA within a given cluster were chosen
as representatives to generate the final nonredundant set of 695 interfaces
(SI Table S4).The interfaces were
partitioned into three classes according to
the maximum binding multiplicity of their residues and the binding
class of the corresponding proteins (SI Table
S4). Consistently with the residue classification, interfaces
containing at least one multipartner residue where defined as multipartner
(cmulti), while the remaining interfaces
were classified as cmono and cmono_in_multi if belonging to mono- or multipartner proteins,
respectively. The distributions of the total and relative hydrophobic
interface sizes for cmulti (average ΔSASA
= 1583 Å2, average ΔSASAphobr = 62%)
and cmono (average ΔSASA = 1346
Å2, average ΔSASAphobr = 61%) were
not statistically different, while the cmono_in_multi interfaces were significantly smaller (average ΔSASA = 803
Å2, Wilcoxon p-value <2 × 10–6) and
less hydrophobic (average ΔSASAphobr = 55%, Wilcoxon p-value <2 × 10–4). The values of the relative hydrophobic interface size (ΔSASAphobr) were calculated as the relative contribution
to the total ΔSASA of the interface residues classified as hydrophobic
in POPS.[48]
Interface Analysis
The physicochemical properties of
interface residues in the different binding classes were analyzed
in terms of propensities relative to all solvent exposed residues
in the corresponding protein class.[33,49] The propensity P(c) of property x in the binding class c was calculated as P(c) = p(c)/p(surf), where p(c) and p(surf) are the fraction of residues with property x in c and at the surface,
respectively. A P(c) value >1 indicates a
higher
abundance of residues with a given property in c than in the rest of the surface. Solvent-exposed
or surface residues were defined as having a relative accessibility
SASAr ≥ 15%[50] in the
SR of the family. The SASAr values were calculated normalizing
the SASA of each residue by that of the corresponding amino acid type
X in the AXA tripeptide, where X is in an extended side chain conformation[51] selected from the Dunbrack and Cohen rotamer
library[52] as implemented in PyMOL.[53] Propensities were calculated for the amino acid
identity, the evolutionary conservation grade as determined by ConSurf,[54] the DSSP[55] secondary
structure definition, the relative accessibility of the isolated protein
SASAr, and the extent of surface burial in the complex
ΔSASA. To provide
a measure of the uncertainty associated with this calculation and
to rule out the possibility that the observed propensities were biased
by a few cases, confidence intervals (CI) at 95% were calculated by
bootstrap resampling with 1000 replicates. The statistical significance
of differences between the propensities was estimated with a Student’s t test on the CI.[56,57] Resampling and statistical
analyses were performed with R.[58]Interaction hot spots were predicted using different methods, namely
ANCHOR,[59] Robetta,[60] PISA,[61] HotPoint,[62] and KFC2a/b.[63] The prediction
was performed on the nonredundant interfaces described above. The
criteria used for the hot spot definition are summarized in SI Table S5. Robetta, HotPoint, and the KFC2a/b
methods have been specifically parametrized for hot spot prediction
against experimental alanine scanning data and the default criteria
for residue classification defined in the original publications were
used. The ANCHOR server[59] has instead been
designed to scan for residues that, besides contributing significantly
to the binding energy, are also ‘anchors’ (i.e., they
have a large exposed surface in the monomeric state). In addition,
we used PISA, an established method to determine the thermodynamic
stability of protein assemblies and to distinguish biological interfaces
from artifacts of crystal packing.[61] Here,
the single contributions of the residues to the interface stability
were used to estimate their importance in the complex formation. Thresholds
on ANCHOR and PISA binding free energies were selected to obtain an
overall hot spot fraction comparable to the other methods.Hydration
scores Shyd were evaluated
for each protein in the SMD data set from the spatial distribution
of water molecules observed in MD trajectories (see below). Water
density maps g(r)[64−66] were calculated at discrete points r defined by a 0.5-Å spaced rectangular grid around the
solute. Frames saved every 0.1 ps from the last 10 ns of the trajectory
were superimposed to a reference using Cα atom positions
to remove the overall roto-translational motion of the protein. The
number density of the water oxygen atoms was then averaged at each
grid point over the MD frames and normalized by the bulk density evaluated
in the 6–8 Å shell around the solute. The hydration sites
were then identified as local maxima of the density map with g(r) > 1 and used to define the hydration score Shyd as previously described.[66] Residues with a high Shyd score were
surrounded either by a large number of maxima or by maxima with a
high density. The Shyd value of each residue i of type aa was standardized by calculating
the ratio (Shyd – μhyd)/σhyd, where μhyd and σhyd indicate
the average and standard deviation of Shyd calculated over residues of type aa.
Generation
of Conformational Ensembles
Ensembles of
conformations were generated for each SR in the SFull and
SSoc data sets with the tCONCOORD[67] method. Given a starting structure, tCONCOORD samples alternative
conformations by fulfilling a set of geometrical constraints as determined
from the initial coordinates and interaction types (e.g., covalent
bonds, hydrogen bonds, salt bridges, or hydrophobic interactions).
Under-wrapped hydrogen bonds[68] are detected
and modeled as unstable. Each SR in the data sets was first energy-minimized
using the OPLS-AA[69] force field with 250
steps of the steepest descent algorithm. Ensembles of 500 structures
were then generated using standard tCONCOORD parameters.[67,70]Molecular Dynamics (MD) simulations of the 12 SRs of the SMD data set (SI Table S2) were performed
with GROMACS 4.0.[71] The initial structures
were solvated with a cubic box of SPC water molecules, setting the
minimal distance between the protein and the box boundaries to 12
Å. Ionizable residues were modeled in their standard protonation
state at pH 7. The systems were then neutralized adding the appropriate
number of counterions. The final composition of the systems is detailed
in SI Table S6. The GROMOS-96 force field
was used with the 43a1 parameter set.[72] Periodic boundary conditions were imposed. All the bonds were frozen
and a 2-fs time step was used. The Berendsen[73] algorithm was employed for temperature and pressure regulation,
with coupling constants of 0.2 and 1 ps, respectively. The electrostatic
interactions were calculated with the particle mesh Ewald method,[74] with a 14-Å cutoff for the direct space
sums, a 1.2-Å FFT grid spacing, and a 4-order interpolation polynomial
for the reciprocal space sums. For van der Waals interactions, a 14-Å
cutoff was used. The neighbor list for noncovalent interactions was
updated every 5 steps. The systems were first minimized with 1000
steps of steepest descent. Harmonic positional restraints with a force
constant of 4.8 kcal·mol·Å–2 were imposed onto
the protein heavy atoms and gradually reduced to 1.2 kcal·mol·Å–2 in 80 ps, while the temperature was increased from 200 to
300 K at constant volume. The system was then simulated at constant
temperature (300 K) and pressure (1 bar) for 100 ps. After the removal
of harmonic restraints, 2 ns of equilibration were run in NPT conditions.
NPT production simulations were then run for 40 ns for each system.
System coordinates were saved every 0.1 ps.Gaussian Network
Model (GNM) calculations were performed by representing
each protein in SFull as a network of Cα atoms linked by harmonic springs. Each node was assumed to fluctuate
according to a Gaussian distribution around its equilibrium position,
defined by the coordinates of the starting structure. Root mean square
fluctuation (RMSF) values were derived for each Cα atom by inverting the Kirchhoff matrix[75] built-up using a unitary force constant for the springs and a 7
Å cutoff on the distance between Cα atoms.The conformational variability within each family of SFull was evaluated on the ensemble of PDB structures composed by all
the family members as defined in Data Set Preparation. The ensembles were built-up by collecting either X-ray or NMR occurrences
as listed in PiSite. Ensembles with more than one structure were found
for 241 proteins in SFull and 11 proteins in SMD. In the generation of the ensembles, the PDB structures were first
aligned with ProFit. Only the residues occurring in all the structures
were retained, so that the resulting ensembles could be analyzed as
pseudotrajectories with standard MD analysis tools. Each structure
was also labeled as bound or unbound according to its binding state
as reported in PiSite.
Analysis of Flexibility
Cα and side
chain root-mean-square fluctuations (RMSF) of the residues from their
average position were calculated for the generated tCONCOORD, MD,
and PDB ensembles. Cα RMSF profiles were determined
after removal of the overall roto-translational motion by best-fit
superposition of the structures to all the Cα atoms
of the starting experimental structure (SR) used as a reference. For
side chains, the RMSF values were calculated after removal of the
main chain roto-translation of the single residues,[30] using the experimental structure as a reference for the
best-fit superposition.The comparison of flexibilities between
different proteins and types of ensemble was performed using standardized
Cα and side chain RMSF profiles Z-score(i) = (RMSF(i) – μ)/σ, where RMSF(i) is the RMSF of residue i in
protein j, while μ and σ are the average and standard
deviation of the RMSF distribution.Relevant rearrangements on protein interfaces could also arise
from subtle conformational changes of local structures.[23,76] These changes are often difficult to detect by traditional flexibility
analysis and require the isolation of local dynamics from the global
motion.[23] To this end, the dynamics of
local structures in the tCONCOORD and MD ensembles was analyzed with
a fragment-based approach. It was previously shown that local conformational
changes and their correlation are effectively described by means of
a Structural Alphabet (SA) including prototypical geometries of backbone
fragments.[23,77] In the present study, the M32K25
SA[77] was used. This SA comprises 25 representative
fragments of 4 consecutive Cα atoms, and it was specifically
designed to include the most typical local structures, as well as
to correctly describe conformational transitions sampled by molecular
simulations. Each fragment in the SA is labeled with a letter representing
a prototypical conformational state. Therefore, any 4-residue-long
segment in a protein structure can be annotated with a letter. The
labeling is performed by identification of the most similar SA fragment
in terms of root-mean-square deviation (RMSD)[23,77,78] between the protein segment and the letter.
The conformation of a protein of n residues can be
encoded into a structural string of length n –
3.[79] Following this procedure an ensemble
of conformations is condensed to an alignment of structural strings.
A column of this alignment summarizes the conformational states sampled
by a protein segment within the simulation.The correlation
of conformational changes in a pair of protein
segments (i,j) can be calculated
as normalized Mutual Information (MI) between the associated columns
in the alignment (eq 1):where C and C are the
relevant columns in the structural string alignment, I(C;C) is the MI, H(C,C) is the joint entropy,[80] and ε(C;C) is the expected finite size
error.[81] It was previously demonstrated
that local correlated motions are instrumental for allosteric signal
transmission and may be involved in conformational changes of interacting
interfaces.[23] To this end, the distribution
of statistically significant local correlations was compared for surface
and interface fragments in the three binding classes. Statistically
significant correlations were identified as previously reported.[23]To analyze the flexibility of single 4-residue
fragments, a fragment
RMSF was calculated by defining n–3 sliding
windows or fragments of 4 adjacent Cα atoms. For
all the structures in the ensembles, each fragment was superimposed
onto the reference experimental structure SR independently from the
rest of the protein, to remove local roto-translational motions. The
fragment RMSF was then calculated as the quadratic mean of the RMSF
values of each Cα within the window.[77]The correlation between single interfaces and global
motions was
investigated with the Functional Mode Analysis (FMA).[82] Given a functionally relevant property, FMA can be used
to identify the protein collective motion that is maximally correlated
with it (maximally correlated motion or MCM). This is usually expressed
as a linear combination of principal components (PCs) derived from
a principal component analysis (PCA)[83] of
the system trajectory. We selected the FMA approach where the coefficients
of the linear expansion β are determined
by maximizing the Pearson correlation coefficient between the time
evolution of the functional property F(t) and the projection of the trajectory onto the MCM. In this case,
the variance of F during the dynamics var(F) can be approximated as[82] var(F) ≈ Σ β var(pc), where pc is the projection of the trajectory onto the ith PC and l is the number of PCs considered
in the MCM expansion. The relative contribution of PC to var(F) was thus evaluated as
pvar = β2 var(pc)/var(F).The interface radius of gyration R was considered
in
this study as a functional quantity related to the overall shape of
the interface. R was calculated over the Cα atoms of the interface residues for each nonredundant interface
in SFull. Using an alternative property to describe the
interface geometry, namely the distance RMSD calculated over all the
possible pairs of Cα atoms in the interface,[84] did not affect the conclusions described in Results. The MCM was expanded in the essential space
of all Cα atoms, composed of the first l PCs accounting for the 90% of the total Cα fluctuation.
An average essential space size of 15 (±7) PCs was observed in
the whole data set. The first 450 structures of the tCONCOORD ensemble
were used for the PCA analysis and as training set for the construction
of the linear model. The remaining 50 structures were used for cross-validation.[82] On average, the optimized Pearson correlation
coefficient between the MCM and the R was 0.81 (±0.20)
and 0.79 (±0.23) for the training and cross-correlation sets,
respectively. This supports the validity of the linear model for the
MCM and rules out the possibility of overfitting in the determination
of the MCM. Since the motion along the MCM can be restricted in the
underlying energy landscape,[82] while the
PCs in the essential space represent the directions along which the
protein motion has the largest amplitude, we also analyzed the single
PC contributions pvar (see above) to
the overall R variance. The number nPC20 of distinct PCs with pvar ≥ 20% (deviating from the average
value by ∼1.6 σ) was determined. Different choices of
the pvar threshold produced qualitatively similar results.The
comparison of the conformational spaces sampled by the MD,
tCONCOORD and PDB ensembles was performed by calculating the normalized
overlap[85] of the Cα covariance
matrices aswhere
A and B are the matrices to compare,
tr is the trace operator and d(A,B) is the matrix difference: d(A,B)
= (tr[(A1/2 – B1/2)2])1/2. The overlap ranges from 0 (no overlap) to 1 (identical
matrices). The cumulative overlap between two sets of PCs was calculated
as the average of the squared inner products between all the possible
pairs of PCs from the two sets.[86] The cumulative
overlap is 1 if the space spanned by the two sets is identical.Estimates of per–residue configurational entropy values
in the MD ensembles were obtained using the heuristic Schlitter’s
formula[87,88]where kB is the
Boltzmann’s constant, T is the temperature, e is the Euler’s number, ℏ is the Plank’s
constant divided by 2π, M is a diagonal matrix with the atomic
masses, and σ is the covariance matrix. Entropy contributions
from the overall protein roto-translation were removed by superimposing
each frame in the MD trajectories to the reference experimental structure
using all the Cα atoms. For each residue, the covariance
matrix σ was then calculated considering all the atoms in the
residue or only the main chain atoms. In the all-atom case, to take
into account differences in the side chain length, entropy values
were normalized by the total number of atoms in the residue and multiplied
by 10 (average number of atoms in a residue considering heavy atoms
and polar hydrogen atoms). It is to be noted that the Schlitter’s
formula is an approximation to the upper bound of the entropy. Moreover,
the decomposition into residue contributions is not exact. Indeed,
when applying eq 3 to single residues the correlation
between the residue and the rest of the protein is not explicitly
taken into account.[88]Flexibility
analyses were performed on all frames of the tCONCOORD
and PDB ensembles and on MD frames saved every 1 ps.
Comparison
of Conformational Ensembles
Most of the
findings presented in Results are based on
the analysis of tCONCOORD conformational ensembles. The tCONCOORD
method can be considered as a relatively fast method to explore the
conformational landscape of a protein. As explained above, tCONCOORD
generates alternative conformations of a protein by satisfying the
constraints derived from a starting structure. An all atom representation
of the protein is used, and even if energetic terms are not directly
involved in the generation of structures, the parameters defining
the upper and lower bounds of the constraints depend on the type of
interaction represented by each constraint. These parameters were
originally determined on the basis of MD simulations.[89] Subsequently, they were modified and combined with an estimate
of hydrogen bond stability to increase the portion of conformational
space explored by the ensemble.[67] Compared
to MD simulations, tCONCOORD lacks an explicit representation of solvent
and long-range interactions, and it does not provide direct time or
energetic information on the simulated processes. However, it is not
affected by convergence issues, and it can rapidly cover a large part
of the accessible conformational space. Indeed, tCONCOORD ensembles
have been shown to be in general representative of the structural
variability of a protein on the basis both of MD simulations and experimental
data.[67,89−91] In particular, they
can be effectively used as predictors of the relative flexibility
of protein residues, to distinguish rigid regions from flexible ones.
Moreover, large conformational transitions between different states
of a protein (e.g., from the open unbound to a close bound-like conformation
of calmodulin[67]) can be successfully reproduced
starting from a single state.To test the dependence of the
main results of the paper on the specific method used for the generation
of the ensembles, we considered three more sources of information
on the conformational variability of the proteins. First, we performed
short (40 ns) equilibrium MD simulations in explicit water for 12
proteins (SMD data set) selected from SFull (see Data Set Preparation). These proteins were selected
so that the main conclusions can
be tested on them; thus, (1) they are representative of the variability
of the original data set in terms of binding properties and fold,
(2) their partner annotation is the most reliable among the SFull proteins, and (3) they do not have a bias in terms of
size or secondary structure composition. With respect to tCONCOORD,
ensembles from short MD simulations can be expected to provide a more
accurate representation of the thermal fluctuations around the starting
state, but they are less likely to sample the conformational space
farther from it, particularly when high-energy barriers are involved.As a third method, we used the GNM method[75,92] to calculate equilibrium fluctuations around the SR structures from
the complete SFull data set. Elastic Network Models are
often used when large data sets are involved because of their reduced
computational cost.[28,93,94] Differently from tCONCOORD and MD, the GNM adopts a reduced, coarse-grained
representation of the protein, where usually only Cα atoms are considered. Their interactions are described by harmonic
potentials, so that a Gaussian distribution of the fluctuations around
the equilibrium positions is assumed. Despite its simplicity, the
GNM has been found to accurately describe the protein global motions
and in particular the equilibrium fluctuations in the neighborhood
of the starting structure.[95−97]At last, the conformational
variability actually observed in experimental
structures was analyzed by collecting all the PDB occurrences of a
given protein or of close homologues (see Generation
of Conformational Ensembles). These collections of PDB structures
(PDB ensembles) can contain useful information on the protein conformational
variability.[98] Indeed, flexibility indices
from PDB ensembles have been compared in the past with RMSF values
from simulations.[67,98] It is to be noted that PDB ensembles
can contain structures solved with different techniques (X-ray and
NMR), in different experimental conditions and in different binding
states. Thus, in addition to conformational changes related to the
protein intrinsic flexibility, PDB ensembles can include changes induced
by external factors (e.g., binding to other proteins or ligands) or
due to the different experimental conditions. Moreover, they can be biased by experimental
errors and by the fact that the PDB covers only a fraction of the
known states of a protein.For each of the described ensembles,
it is possible to calculate
indices measuring the extent of the structural change sampled by each
residue within the ensemble (Analysis of Flexibility). As explained above, flexibility indices from different ensembles
include different types of contributions and derive from sampling
of conformations on different space and time scales. In particular,
the absolute magnitude of the structural changes observed in the selected
ensembles can be expected to be different. For example, the comparison
of the pairwise RMSD distributions from tCONCOORD and MD ensembles
of SMD proteins (SI Table S7) suggests that on average larger portions of the conformational
space are sampled by tCONCOORD than by MD simulations.The relationship
between binding promiscuity and flexibility indices
derived from the different ensembles will be discussed in detail in
the Results. However, the findings presented
in this paper depend mostly on the relative order of flexibility of
the residues in a protein and on the shape of the collective modes
of motion along which the protein is most free to move. Hence, it
is useful to briefly report the direct comparison of the order of
flexibility predicted by the different ensembles and the calculation
of the overlap between the PCs extracted from them.The calculation
of the correlation coefficients between the RMSF
profiles shows a generally good agreement among the tCONCOORD, MD,
and GNM ensembles in the SMD database (SI Table S8), with average correlations ranging from 0.69
(MD/GNM) to 0.75 (Cα tCON/MD). These values are in
line with previous comparisons performed on different proteins.[89,97,99,100] A good agreement between tCONCOORD and MD simulations was also found
for side chains fluctuations, with an average correlation coefficient
of 0.71. Moreover, when the comparison between GNM and tCONCOORD Cα RMSF profiles is extended from the SMD to
the SFull data set, similar values of average correlation
coefficients are obtained (0.70 for SMD and 0.75 for SFull), indicating that SMD is not biased toward
cases with high correlation values. As expected considering the higher
heterogeneity of PDB ensembles, smaller values were obtained for the
average correlation values between simulated and PDB ensembles, ranging
from 0.47 for GNM/PDB in SFull to 0.63 for MD/PDB side
chain profiles in SMD.The conformational spaces
sampled by the tCONCOORD and MD ensembles
of SMD were also compared (SI Table
S9) by calculating:The normalized overlap between the
overall covariance matrices of Cα atoms (CovMatOver,
see eq 2). This value depends on the similarity
both of the shape of the collective modes of motion (eigenvectors
of the covariance matrix) and of the amplitude of the fluctuations
along these motions (eigenvalues).The cumulative overlap between the
essential subspaces spanned by the first 15 PCs (CumOver15). A value
of 15 was chosen for CumOver15 since it is the average size of the
essential space used for the FMA analysis of tCONCOORD ensembles (Analysis of Flexibility section). This overlap
index depends uniquely on the covariance matrix eigenvectors.The maximum inner product
between
selected pairs of PCs from the compared ensembles (MaxInpr). MaxInpr
was calculated to measure how well a PC from the tCONCOORD essential
space can reproduce any of the first 3 PCs from MD. Similarly to CumOver15,
this index does not depend on the covariance matrix eigenvalues.The comparison of tCONCOORD and MD covariance
matrices (CovMatOver)
indicates a partial overlap between the two types of ensembles, with
an average value of 0.339 (SI Table S9).
The observation of higher values for CumOver15 (average = 0.474) and
MaxInpr (average = 0.566) indicates a better agreement in the shape
of the collective motions (defined by the relative amplitude and direction
of Cα motions) than in the absolute amplitude of
the fluctuations along them. These results are in line with previous
observations.[89]The MaxInpr index
was used also to compare the first 3 PCs from
the PDB ensembles with the PCs from the essential space of tCONCOORD
and MD ensembles. Smaller values were obtained for both the PDB/tCON
(average = 0.421) and the PDB/MD (average = 0.373) pairs compared
to the MD/PDB ones (average = 0.566), indicating that collective motions
in the simulated ensembles are more similar to each other than to
the principal modes of structural change observed in the PDB ensembles.
The larger values found in general for PDB/tCON pairs of PCs than
for PDB/MD ones indicate that PDB PCs are better reproduced by tCONCOORD
ensembles for the set of proteins studied here. This might be related
to a more complete coverage of large conformational changes by tCONCOORD
than by MD simulations (SI Table S7).
Analysis of SNPs
Human homologues of proteins in the
SFull and SSoc data set were identified running
NCBI-BLAST v.2.2.26+[101] (cutoff E-value
= 1 × 10–2) against a self-compiled library
composed of 2583 human proteins with solved or homologous structures,
annotated partner interactions and SNP mapping. Information on nonsynonymous
SNPs was retrieved from the dbSNP database (build 315, ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/b135_archive/organism_data/b135_SNPContigLocusId_37_3.bcp),[102] while disease-related SNPs (DisSNPs)
were extracted from the Online Mendelian Inheritance in Man (OMIM)
database (www.omim.org/downloads);[103] therefore, only genetically inherited diseases variations
are considered here. SNPs and DisSNPs were mapped back on the original
sequences of the SFull and SSoc proteins using
the BLAST sequence alignment. This procedure produced SNP annotation
for 38 proteins of SFull and 25 proteins of SSoc.Per-protein propensity values were calculated for each binding
class and for each protein with the formulas described in Interface Analysis, using as reference the SNPs
and DisSNPs abundance found at the surface of the single proteins.
Results
A data set of 251 monodomain proteins (SFull) was extracted
from the PDB and partitioned into 151 monopartner and 100 multipartner
proteins using the PiSite database[38] (Methods). The composition of the data set in terms
of 7 general protein function categories (SI Figure
S1) was obtained using a functional annotation of SCOP superfamilies.[104,105] As expected,[2,12,33] monopartner proteins (cyan) showed an enrichment in the metabolism
and general categories, while multipartner proteins (magenta) were
particularly rich in the categories related to extra- and intracellular
processes, information, and regulation.Interface residues for
each protein in SFull were extracted
from its PDB complexes and partitioned into binding classes according
to their binding multiplicity b (Methods). In particular, two classes cmono and cmono_in_multi were defined
for monopartner residues (b = 1) belonging to monopartner
and multipartner proteins, respectively. Examples of both types of
residues are given in Figure 2A, left (cmono) and middle (cmono_in_multi) panels. Residues with b ≥ 2 were assigned
to the cmulti class (Figure 2A, right panel). A total of 12 622 interfaces residues
were found, with cmono, cmono_in_multi, and cmulti accounting
for 54, 36, and 10%, respectively, of the overall population (SI Table S4). The results obtained on the SFull data set were compared with those from a second smaller
data set SSoc, a collection of highly promiscuous proteins
(Methods). As for SFull, the 4690
interface residues of SSoc were partitioned into cmono_in_multi (59%) and cmulti (41%) residues (SI Table S4).
Figure 2
Physicochemical properties of interface residues in the cmono, cmono_in_multi, and cmulti binding classes. (A) Examples
of cmono (left), cmono_in_multi (middle), and cmulti (right) interfaces. The residues involved in each interface are
color-mapped onto the surface in cyan (cmono), blue (cmono_in_multi), and orange
(cmulti). Selected partners are represented
as cartoon. Left: PDB ID 1x8d, chains D/C. Middle: PDB ID 1pp9, chains S/B. Right:
PDB ID 1b6c,
chains E/F. (B–E) Different types of calculated propensities
relative to the surface of cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues of SFull. Error bars
and significance levels were estimated with bootstrap resampling.
Stars are drawn above cmono_in_multi and cmulti distributions, indicating the significance
levels (*** p-value <0.001, ** 0.001 ≤ p-value <0.01,
* 0.01 ≤ p-value <0.05) of the comparison with cmono (cyan) and cmono_in_multi (blue). (B) Amino acid propensity Paa. (C) Conservation propensity Pcons.
The conservation is expressed as ConSurf conservation grade (Gcons, ranging from 1, less conserved, to 9,
most conserved). (D) DSSP secondary structure (SS) propensity PSS. The three SS groups collect positions annotated
as helix (H = “H” + “G” + “I”
in the DSSP dictionary), strand (E = “E” + “B”)
and loop (L = blank + “S” + “T”). (E)
Relative solvent accessibility (SASAr) propensity PSASAr. Three levels of SASAr are considered, with SASAr < 20%, 20% ≤ SASAr < 50%, and SASAr ≥ 50%. (F) Fraction
of SFull interface residues with a small (<20%), medium
(20–50%) or large (≥50%) normalized buried SASA (ΔSASA). The maximum ΔSASA value is used for residues
involved in more than one complex. (G) Standardized hydration score Shyd in the SMD binding classes.
Physicochemical properties of interface residues in the cmono, cmono_in_multi, and cmulti binding classes. (A) Examples
of cmono (left), cmono_in_multi (middle), and cmulti (right) interfaces. The residues involved in each interface are
color-mapped onto the surface in cyan (cmono), blue (cmono_in_multi), and orange
(cmulti). Selected partners are represented
as cartoon. Left: PDB ID 1x8d, chains D/C. Middle: PDB ID 1pp9, chains S/B. Right:
PDB ID 1b6c,
chains E/F. (B–E) Different types of calculated propensities
relative to the surface of cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues of SFull. Error bars
and significance levels were estimated with bootstrap resampling.
Stars are drawn above cmono_in_multi and cmulti distributions, indicating the significance
levels (*** p-value <0.001, ** 0.001 ≤ p-value <0.01,
* 0.01 ≤ p-value <0.05) of the comparison with cmono (cyan) and cmono_in_multi (blue). (B) Amino acid propensity Paa. (C) Conservation propensity Pcons.
The conservation is expressed as ConSurf conservation grade (Gcons, ranging from 1, less conserved, to 9,
most conserved). (D) DSSP secondary structure (SS) propensity PSS. The three SS groups collect positions annotated
as helix (H = “H” + “G” + “I”
in the DSSP dictionary), strand (E = “E” + “B”)
and loop (L = blank + “S” + “T”). (E)
Relative solvent accessibility (SASAr) propensity PSASAr. Three levels of SASAr are considered, with SASAr < 20%, 20% ≤ SASAr < 50%, and SASAr ≥ 50%. (F) Fraction
of SFull interface residues with a small (<20%), medium
(20–50%) or large (≥50%) normalized buried SASA (ΔSASA). The maximum ΔSASA value is used for residues
involved in more than one complex. (G) Standardized hydration score Shyd in the SMD binding classes.Simulated ensembles of accessible
conformations were generated
for each isolated protein in SFull and SSoc using
the tCONCOORD method (Methods). To test the
dependence of results from the method used for the generation of the
ensembles, we performed also GNM calculations on all SFull proteins and Molecular Dynamics (MD) simulations on selected cases
(SMD data set, Methods). The intrinsic
flexibility of the residues was described measuring the RMSF from
average positions either of the Cα or of the side
chain atoms. Correlations in residue motions were calculated using
either Cα Cartesian coordinates, providing collective
motions,[83] or the fragment encoding from
a Structural Alphabet (SA), providing local correlated motions[23] (Methods). The same
analyses were performed on the ensemble of experimental structures
generated by collecting all the occurrences of a given protein in
the PDB (Methods).The results section
is organized as follows. A first characterization
of the physicochemical properties of interface residues will be followed
by the presentation of the central results on intrinsic flexibility.
In particular, the distribution of flexibility among the different
binding classes, its dependence from evolutionary conservation, and
the relationship between correlated motions and interface shape modulation
will be discussed. The results obtained from the simulated ensembles
will then be compared with the conformational variability found in
the experimental structures. The findings from the investigation of
the relationship between binding promiscuity and SNPs occurrence will
be then introduced. The section will be closed by a case study exemplifying
the general conclusions.
Multipartner Residues Have Distinctive Physicochemical
Properties
In this section, we will characterize the residues
in the three
binding classes according to different physicochemical properties.
In particular, amino acid identity, evolutionary conservation, solvent
accessibility, and secondary structure were determined and compared
with results from different data sets.[2,6,8,12,13,33]The overall interface composition
showed a previously documented[33,51,106] enrichment with respect to the surface in large aromatic (F, W,
Y), hydrophobic (M, A, I, L, V) and specific polar/charged (C, Q,
R) amino acids (Figure 2B). When compared to
the cmono and cmono_in_multi classes (cyan and blue), cmulti residues
(orange) were found to be richer in the polar/charged Q, D, and R
and, less significantly, in the aromatic F and Y amino acids. On the
other side, cmulti turned out to be particularly
poor in hydrophobic amino acids (M, A, I, L, V). These findings are
consistent with the more pronounced polar character generally found
in hub interfaces.[2,6,12] Moreover,
the polyvalence of the Q amino acid (which can be either an hydrogen
bond donor or acceptor) and the long or bulky side chains of R, Y,
and F are particularly suitable to adapt to different interfaces with
different local arrangements.[4,6,8] The analysis on SSoc (SI Figure S2A) confirmed the cmulti enrichment in
Q, R, F, and Y, together with a preference, specific to the sociable cmulti residues, for the I and M amino acids.As expected,[7] the conservation propensity Pcons indicated that interface regions have a
higher abundance of conserved residues than the overall surface (Figure 2C). Indeed, Pcons was
>0 for ConSurf[54] conservation grades
(Gcons) > 5 in all the binding classes. Interestingly,
the cmulti group, while poorer in residues
with intermediate
conservation (Gcons = 7, 8) was found
to be richer in highly conserved residues (Gcons = 9) than the cmono and cmono_in_multi groups, both for the SFull (Figure 2C) and the SSoc (SI Figure S2B) data set. No large deviations
from the average surface values were observed for the abundance of
secondary structure elements in the three interface residue classes
of SFull (Pss ∼ 1 in
Figure 2D). However, when compared with the
two cmono (cyan) and cmono_in_multi (blue) classes, cmulti residues showed a tendency to be richer in loops and poorer in strands.
A similar preference for loops has been found for ‘overlapping
regions’ in date hub proteins.[13] These regions are conceptually close, even if differently defined,
to our multipartner residues. No significant differences in the secondary
structure composition were found between cmono_in_multi (green) and cmulti (red) residues in
the SSoc data set (SI Figure S2C), which both showed a general enrichment in α-helices with
respect to the surface.PSASA (Figure 2E) gives the propensity of an interface residue
to expose a small
(<20%), medium (20–50%), or large (>50%) fraction of
its
solvent accessible surface area (SASA, see Methods), as measured on the isolated protein. Interestingly, very exposed
residues (>50%) were found to be particularly frequent in the cmulti group, as compared with both the cmono and cmono_in_multi groups, either for the SFull (Figure 2E) and the SSoc (SI Figure
S2D) data set. Similarly, if the fraction of SASA buried upon
complexation is considered (Figure 2F), a higher
proportion of residues with a large relative ΔSASA (>50%) was found for cmulti residues
than cmono and cmono_in_multi ones. An analogous propensity for burying large
portions of SASA has been found in overlapping regions of date hub
proteins.[13]For the proteins in the
SMD subset of SFull, it was possible to evaluate
the hydrophilicity (Shyd, Methods) of the residues
from the distribution of water molecules around the solute in the
MD simulations. To eliminate the dependence from the amino acid identity,
each Shyd value was standardized against
the Shyd distributions of the corresponding
amino acid type. Even if a reduced statistical significance was observed,
probably due to the smaller size of the SMD data set, the
comparison of the Shyd Z-scores showed
that monopartner residues in multipartner proteins tend to be less
hydrated (Figure 2G) than the other two classes.
Indeed, the presence of water molecules has been observed in PDB complexes
involving multipartner interfaces.[107] Our
findings would indicate that the tendency to strongly coordinate water
molecules is mainly due to the promiscuous part of the interface.To summarize these results, we found that multipartner residues
have characteristic physicochemical features that distinguish them
from monopartner residues belonging either to mono- or multipartner
proteins. In particular, they tend to be richer in specific charged/polar
(R, Q, D) and aromatic (Y, F) amino acids and in loops, and to be
more conserved, more solvent exposed in the isolated protein, and
more buried in the complex. Moreover, within multipartner proteins,
multipartner residues seem to be preferentially hydrated compared
to monopartner ones.
Multipartner Residues Have a Higher Intrinsic
Conformational
Flexibility
In the following sections, we will analyze the
correlation between intrinsic conformational flexibility and binding
multiplicity. We measured the intrinsic flexibility of SFull residues in terms of RMSF Z-score values of either Cα or side chain atoms. The distribution of flexibility values observed
for cmulti residues in tCONCOORD ensembles
was compared with that of cmono and cmono_in_multi residues (averages are reported
in Figure 3A and B, light colors). All the
pairwise comparisons indicated a significant difference between the
distributions (p-values <0.01), with an average flexibility order cmulti > cmono > cmono_in_multi for both Cα (Figure 3A) and side chain (Figure 3B) RMSF values. A similar analysis on the SSoc data set
(SI Figure S3A and B) confirmed that multipartner
residues (red) in sociable proteins are on average more flexible than
the monopartner ones (green).
Figure 3
Conformational flexibility of interface residues
in SFull and SMD. (A/B) Average Cα (A) and side
chain (B) RMSF Z-scores (tCONCOORD) calculated over cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues
in the SFull data set. The standard error of the mean is
represented with an error bar. The significance levels from pairwise
Wilcoxon comparison tests are reported with a star code (see Figure 2 legend). Stars are drawn above cmono_in_multi and cmulti bars
indicating the significance levels of the comparison with cmono (cyan) and cmono_in_multi (blue). Averages and significance levels calculated considering
only the residues with the highest ConSurf conservation grade (Gcons = 9) are also reported in dark colors.
(C) Dependence of tCONCOORD average Cα RMSF Z-scores
(dots) from evolutionary conservation for cmono (cyan), cmono_in_multi (blue), cmulti (orange) residues. Residues are partitioned
into 9 groups according to their ConSurf conservation grade (Gcons). A best-fit linear regression is also
reported for each binding class. (D/E) Average Cα RMSF Z-scores calculated from tCONCOORD (D) and MD (E) ensembles
over highly conserved (Gcons = 9) cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues
in the SMD data set. The standard error of the mean is
represented with an error bar. The significance levels from pairwise
Wilcoxon comparison tests are reported with a star code (see Figure 2 legend).
Conformational flexibility of interface residues
in SFull and SMD. (A/B) Average Cα (A) and side
chain (B) RMSF Z-scores (tCONCOORD) calculated over cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues
in the SFull data set. The standard error of the mean is
represented with an error bar. The significance levels from pairwise
Wilcoxon comparison tests are reported with a star code (see Figure 2 legend). Stars are drawn above cmono_in_multi and cmulti bars
indicating the significance levels of the comparison with cmono (cyan) and cmono_in_multi (blue). Averages and significance levels calculated considering
only the residues with the highest ConSurf conservation grade (Gcons = 9) are also reported in dark colors.
(C) Dependence of tCONCOORD average Cα RMSF Z-scores
(dots) from evolutionary conservation for cmono (cyan), cmono_in_multi (blue), cmulti (orange) residues. Residues are partitioned
into 9 groups according to their ConSurf conservation grade (Gcons). A best-fit linear regression is also
reported for each binding class. (D/E) Average Cα RMSF Z-scores calculated from tCONCOORD (D) and MD (E) ensembles
over highly conserved (Gcons = 9) cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues
in the SMD data set. The standard error of the mean is
represented with an error bar. The significance levels from pairwise
Wilcoxon comparison tests are reported with a star code (see Figure 2 legend).The intrinsic flexibility of interface residues in the different
classes was compared also using alternative conformational ensembles
(Methods). In particular, equilibrium fluctuations
around the native structure were evaluated using full-atom MD simulations
and coarse-grained GNM calculations. Remarkably, multipartner residues
showed the largest average mobility among the binding classes for
all the considered ensembles and flexibility indices (Cα, side chain and fragment RMSF), in both the complete SFull data set and the SMD subset (SI Figure
S4). Also, the relative flexibility order of the cmono and cmono_in_multi residue
groups was generally preserved, except for Cα flexibility
in SMD. In many cases, similar significance levels in the
distribution comparison were obtained from the different ensembles.
This indicates that, despite the differences in the methods used for
the generation of the ensembles, in all of them, multipartner residues
have flexibility properties distinct from the monopartner ones.To assess the possible impact of these flexibility differences
on binding energetics, the configurational entropy of single residues
was estimated with the Schlitter’s formula (Methods) applied
to the MD ensembles of multipartner proteins in SMD (SI Table S10). The per-residue entropy term TS
(where T = 300 K and S is calculated according to
eq 3) of multipartner residues was on average
higher than monopartner ones by 1.04 kcal/mol for the whole residue
(considering an average residue size of 10 atoms) and ∼0.44
kcal/mol for the residue main chain, with maximum differences of 1.52
and 0.65 kcal/mol, respectively (SI Table S10). It has to be noted that the Schlitter’s formula gives only
an approximate estimate of the entropic term (Methods). Moreover, accurate measures of the impact of flexibility on the
binding free energy would require that protein partners are explicitly
taken into account in the calculation.A further measure of
the relationship between flexibility and binding
was obtained by calculating the correlation coefficient between profiles
of tCONCOORD RMSF and binding multiplicity b for
each protein in SFull (SI Figure S5). The observation of large correlation coefficients in this analysis
would require not only a difference in flexibility between mono- and
multipartner residues but also that, within multipartner residues,
higher b values correspond to higher RMSF values.
This makes this test more stringent than the previous one. Indeed,
small average correlations were found for both mono- and multipartner
proteins, using either Cα (SI
Figure S5A) or side chain (SI Figure S5B) RMSF values. However, the distribution of multipartner proteins
(magenta) was shifted toward significantly higher values than monopartner
ones (cyan) in both cases, with some multipartner proteins showing
correlation coefficients as high as 0.6.The presented results
show that, compared to monopartner residues,
multipartner residues have an average ‘excess’ of intrinsic
flexibility, which could be used by the residues to adapt to the different
environments provided by the different partners when binding occurs.
In the subsequent section, we will further refine this analysis by
focusing on the interface residues that are most important for the
interaction with the partner.
Opposite Effects of Evolutionary
Conservation and Binding Promiscuity
on Intrinsic Flexibility
Previous studies highlighted an
increased rigidity for hot spots[30,31,108] and in general for evolutionary conserved interface
residues, which have been found to prefer preorganized bound-like
conformation even in the unbound state.[29] Thus, we analyzed the relationship between evolutionary conservation
and conformational flexibility in our data set, to highlight possible
differences in the behavior of the three residue binding classes.A small but significant anticorrelation between flexibility from
tCONCOORD ensembles and conservation was found by plotting the average
Cα RMSF Z-scores of interface residues in SFull against the ConSurf conservation grade value Gcons (Figure 3C). A best-fit linear
regression showed that this dependence is slightly more pronounced
for the cmono class (cyan), producing
an increased difference between cmono and cmulti Cα RMSF distributions
when only the most conserved residues are compared (dark cyan and
dark orange bars in Figure 3A). The same slope
was found for the cmono_in_multi (blue)
and cmulti (orange) linear models (Figure 3C), so that cmono_in_multi average Cα flexibility was consistently smaller
than cmulti for all the Gcons values. Similar findings, but with reduced differences
between the residue binding classes, were observed when considering
side chain RMSF values (Figure 3B and SI S3F).To check if these results are
affected by the method used to generate
the conformational ensembles, we compared tCONCOORD and MD RMSF distributions
for the most conserved residues in the three binding classes of the
SMD subset (Figure 3D and E). A
fully consistent picture was obtained from the two methods, confirming
the higher average flexibility of the most conserved multipartner
residues (orange) with respect to the monopartner ones (cyan and blue).
Moreover, the generality of the results was checked on the SSoc data set (SI Figure S3C/E), where cmulti residues (red) showed an even smaller
dependence of Cα RMSF from the conservation grade
than all the other classes.We further restricted our analysis
to hot spot interface residues,
that is, the residues that contribute most to the binding energy upon
complex formation.[7,34] Owing to the limited availability
of experimental information on hot spots, especially for multipartner
proteins,[109] we identified candidate hot
spots using different prediction methods on the nonredundant set of
interfaces of SFull. For each protein and each method,
a given residue was classified as hot spot if it satisfies the method
criteria (Methods and SI Table S5) in at least one of the interfaces in which it
is involved. In agreement with previous findings,[35,109] the majority of multipartner hot spots (from 79 to 88% depending
on the specific method) was predicted as such only in one interface
(SI Table S11), indicating that hot spots
are partner-specific.[35] Hot spot populations
generated by different predictors had only a partial overlap (SI Figure S6), as expected from the differences
in the strategies adopted by the different methods (Methods). In spite of this, the comparison of the conformational
flexibility of mono- and multipartner hotspots produced surprisingly
consistent results (Figure 4). Indeed, the
highest average tCONCOORD flexibility was observed for cmulti hot spots (orange) in almost all the cases for both
Cα atoms (Figure 4A) and side
chains (Figure 4B).
Figure 4
Conformational flexibility
of hot spots. Average Cα (A) and side chain (B) RMSF
Z-scores (tCONCOORD) of predicted hotspots
in the three binding classes cmono (cyan), cmono_in_multi (blue), and cmulti (orange) of the SFull data set. The standard
error of the mean is represented with an error bar. The significance
levels from pairwise Wilcoxon comparison tests between the RMSF distributions
are reported with a star code (see Figure 2 legend). For a given hotspot prediction method, stars are drawn
above cmono_in_multi and cmulti bars indicating the significance levels of the comparison
with cmono (cyan) and cmono_in_multi (blue).
Conformational flexibility
of hot spots. Average Cα (A) and side chain (B) RMSF
Z-scores (tCONCOORD) of predicted hotspots
in the three binding classes cmono (cyan), cmono_in_multi (blue), and cmulti (orange) of the SFull data set. The standard
error of the mean is represented with an error bar. The significance
levels from pairwise Wilcoxon comparison tests between the RMSF distributions
are reported with a star code (see Figure 2 legend). For a given hotspot prediction method, stars are drawn
above cmono_in_multi and cmulti bars indicating the significance levels of the comparison
with cmono (cyan) and cmono_in_multi (blue).The results presented in this section show that binding promiscuity
has a counteracting effect on the loss of flexibility expected for
more conserved positions. As a consequence, the preference of promiscuous
residues for a higher mobility is even more pronounced when considering
only the residues that are more likely to be determinant for the binding
to the partner.
Multipartner Proteins Use Different Global
Motions to Modulate
Different Interfaces
In the previous sections, we analyzed
the flexibility of residues in terms of their equilibrium fluctuations
from the average position in the simulated ensembles. Here, we extend
this investigation to the correlation of these motions, to highlight
possible differences between monopartner and multipartner interfaces.
In particular, we aimed at detecting to which extent global motions
modulate the interface dynamics in the different classes.For
each protein in SFull, a set of nonredundant interfaces
was extracted from all its PDB complexes. Each interface was then
assigned to a binding class cmono, cmono_in_multi, or cmulti according to the maximum binding multiplicity of its residues (Methods and SI Table S4). The collective motions of each SFull protein were extracted
by a principal component analysis (PCA) of its tCONCOORD conformational
ensemble. The first l principal components (PCs)
accounting for the 90% of the overall fluctuation of Cα atoms were selected. While other choices would be possible to represent
collective motions, using PCs ensured that the motions considered
in the analysis were allowed by the underlying energy landscape. The
identification of the PCs mostly correlated with each interface was
performed through a functional mode analysis (FMA),[82] using as functional property the radius of gyration of
the interface Cα atoms (R). The linear combination
of PCs that best correlates with the R of each interface
(maximally correlated motion or MCM) was then calculated. The contribution
of single PCs to the MCM, together with the value of the MCM-R correlation, determines the fraction of the R variance that
is due to the motion along a given PC (pvar). A PC was considered
to be correlated with an interface if pvar ≥ 20%.The
number of R-correlated PCs (nPC20) was calculated for each interface
(Figure 5A). All the interfaces turned out
to be correlated on average with no more than 1 PC. The behavior of cmono interfaces (cyan) was similar to that of
the multipartner ones (orange), while the cmono_in_multi interfaces (blue) seemed to be slightly less affected by the protein
global motions. If the number of unique R-correlated PCs is
summed over all the interfaces of a given protein (nPC20 in Figure 5B), a significant difference is found between monopartner
(cyan) and multipartner (magenta) proteins, indicating that the latter
use a larger number of independent global motions to modulate the
shape of their interfaces. Since for multipartner proteins each interface
is on average correlated with ∼0.9 PC and each protein has,
on average, four nonredundant interfaces, the peak at 3 for multipartner nPC20 implies that
different interfaces tend to be correlated with different PCs. Qualitatively
similar results were obtained when performing the FMA analysis on
MD ensembles of the SMD data set (SI
Figure S7).
Figure 5
Analysis of correlated motions in SFull. (A/B)
Correlation
between global and interface motions by functional mode analysis (FMA)
on tCONCOORD ensembles from the SFull data set. (A) Number
of PCs accounting for at least 20% of the R variance of a given
interface (nPC20). Average values calculated over interfaces of the three binding
classes cmono (cyan), cmono_in_multi (blue), and cmulti (orange) are reported. (B) Distributions of nPC20 per protein (nPC20), calculated as the number of unique PCs accounting
for at least 20% of the R variance of any of the protein interfaces.
Values for monopartner (cyan) and multipartner (magenta) proteins
are reported. (C) Distributions of normalized MI (I) calculated over pairs of surface residues
in monopartner (light gray, smono) and
multipartner (dark gray, smulti) proteins,
and over pairs of interface cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues. In A and C, the standard error
of the mean is represented with an error bar. The significance levels
from pairwise Wilcoxon comparison tests are reported with a star code
(see Figure 2 legend).
Analysis of correlated motions in SFull. (A/B)
Correlation
between global and interface motions by functional mode analysis (FMA)
on tCONCOORD ensembles from the SFull data set. (A) Number
of PCs accounting for at least 20% of the R variance of a given
interface (nPC20). Average values calculated over interfaces of the three binding
classes cmono (cyan), cmono_in_multi (blue), and cmulti (orange) are reported. (B) Distributions of nPC20 per protein (nPC20), calculated as the number of unique PCs accounting
for at least 20% of the R variance of any of the protein interfaces.
Values for monopartner (cyan) and multipartner (magenta) proteins
are reported. (C) Distributions of normalized MI (I) calculated over pairs of surface residues
in monopartner (light gray, smono) and
multipartner (dark gray, smulti) proteins,
and over pairs of interface cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues. In A and C, the standard error
of the mean is represented with an error bar. The significance levels
from pairwise Wilcoxon comparison tests are reported with a star code
(see Figure 2 legend).The PCs used in the FMA represent the main global
or
collective motions of a protein. Correlations between residues can
also be analyzed by considering their local dynamics, that is, by
removing the overall roto-translation of the protein from their motion.
Indeed, it has been shown that analyzing local dynamics can highlight
communication pathways within the protein that are difficult to identify
solely from the collective motions involving the entire structure.[23] To this end, local motions were analyzed with
a fragment-based approach[23,110] and the extent of
correlation was estimated by normalized mutual information I (eq 1 in Methods). Correlations within interface
residues of each binding class (cyan, blue and orange in Figure 5C) calculated from the tCONCOORD ensembles were
found to be significantly higher than those within mono- and multipartner
surface residues (light and dark gray), suggesting a higher level
of communication between interface residues. Interestingly, the cmulti distribution (orange) was found to be
significantly higher than the cmono_in_multi one (blue), indicating that, within multipartner proteins, multipartner
residues are on average more correlated than monopartner ones. No
significant differences were instead found between cmulti (orange) and cmono (cyan)
distributions.
Multipartner Residues Have a Higher Conformational
Variability
within Experimental Structures
In this section, we will investigate
if the higher intrinsic flexibility found for multipartner residues
correlates with the variability observed in the experimental structures.
Indeed, many proteins in the SFull data set have a relatively
large number of occurrences in the PDB (>10 for ∼50% of
the
proteins). These PDB ensembles contain information on the protein
conformational variability[98] that can be
extracted and compared with that derived from the simulated ensembles.The overall variability within the PDB ensemble of each protein
was decomposed into three different contributions according to the
binding state of the structures that are compared: unbound–unbound
(U–U), bound–bound (B–B), and unbound–bound
(U–B). The highest structural variability (as measured by the
maximum Cα RMSD calculated over all the structure
pairs of a given protein) was found on average for the U–B
pairs (2.66 ± 0.36 Å), followed by U–U (1.81 ±
0.28 Å) and B–B (1.34 ± 0.15 Å). The U–U
RMSD values can be considered as related to the intrinsic plasticity
of the isolated protein, which can have different accessible states
(‘pre-existing equilibrium’), while the U–B values
include also the structural changes caused by the binding of the partners
(‘induced fit’). The smaller variability observed within
bound structures (B–B) reflects the higher number of constraints
in the complexes.For each of these different contributions,
multipartner proteins
showed on average a higher conformational variability (U–Umulti, B–Bmulti, and U–Bmulti, dark colors in Figure 6A) than monopartner
ones (U–Umono, B–Bmono, and U–Bmono, light colors). This suggests that the higher intrinsic
flexibility observed for multipartner residues in isolated proteins
could be used to enhance pre-existing equilibrium or to assist induced-fit
changes. Indeed, the single contributions of the residues to the overall
Cα fluctuation within the PDB ensemble (Figure 6B–C) showed a similar picture as that observed
for the simulated RMSF (see also SI Figure S4 for a direct comparison). A higher conformational variability (Figure 6B, light colors) and a weaker dependence from conservation
(Figure 6C) were found for multipartner residues,
resulting in an increased RMSF difference between highly conserved
mono and multipartner residues (Figure 6B,
dark colors). The cmono and cmono_in_multi groups were instead more similar to each
other than in the simulated case, indicating a comparable degree of
variation within the PDB structures. Consistently with the simulated
ensembles, reduced differences were found between the different binding
classes when the side chain variability was considered (SI Figure S8 and S4).
Figure 6
Analysis of the conformational
variability within PDB ensembles
in SFull. (A) Averages of maximum Cα RMSD
values calculated over all the possible pairs of PDB structures within
the unbound (pink) and bound (green) ensembles, and between bound
and unbound structures (magenta) of each mono- (light color) and multi-
(dark color) partner protein. The standard error of the mean is represented
with an error bar. The significance levels from pairwise Wilcoxon
comparison tests between (mono, multi) pairs of distributions are
reported with a star code (see Figure 2 legend).
(B) Average Cα RMSF Z-scores (PDB ensembles) calculated
over cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues in the SFull data set. The standard
error of the mean is represented with an error bar. The significance
levels from pairwise Wilcoxon comparison tests are reported with a
star code (see Figure 2 legend). Averages and
significance levels calculated considering only the residues with
the highest ConSurf conservation grade (Gcons = 9) are also reported in dark colors. (C) Dependence of PDB average
Cα RMSF Z-scores (dots) from evolutionary conservation
for cmono (cyan), cmono_in_multi (blue), cmulti (orange)
residues. Residues are partitioned into 9 groups according to their
ConSurf conservation grade (Gcons). A
best-fit linear regression is also reported for each binding class.
Analysis of the conformational
variability within PDB ensembles
in SFull. (A) Averages of maximum Cα RMSD
values calculated over all the possible pairs of PDB structures within
the unbound (pink) and bound (green) ensembles, and between bound
and unbound structures (magenta) of each mono- (light color) and multi-
(dark color) partner protein. The standard error of the mean is represented
with an error bar. The significance levels from pairwise Wilcoxon
comparison tests between (mono, multi) pairs of distributions are
reported with a star code (see Figure 2 legend).
(B) Average Cα RMSF Z-scores (PDB ensembles) calculated
over cmono (cyan), cmono_in_multi (blue), and cmulti (orange) residues in the SFull data set. The standard
error of the mean is represented with an error bar. The significance
levels from pairwise Wilcoxon comparison tests are reported with a
star code (see Figure 2 legend). Averages and
significance levels calculated considering only the residues with
the highest ConSurf conservation grade (Gcons = 9) are also reported in dark colors. (C) Dependence of PDB average
Cα RMSF Z-scores (dots) from evolutionary conservation
for cmono (cyan), cmono_in_multi (blue), cmulti (orange)
residues. Residues are partitioned into 9 groups according to their
ConSurf conservation grade (Gcons). A
best-fit linear regression is also reported for each binding class.
Multipartner Residues Have
a Lower Propensity for SNPs
In this section, we provide further
insight on the functional relevance
of promiscuous residues by analyzing the distribution of Single Nucleotide
Polymorphisms (SNPs) across the different binding classes considered
in the previous sections.Recent large-scale studies of the
human genome such as the International HapMap Project[111] and the 1000 Genomes Project[112] have produced a large number of data that can be used to
accurately assess the relationship between genotype and phenotype.
In particular, SNPs are single nucleotide variations observed at a
specific location of the genome in at least 1% of the population.[113] The mapping of SNPs, and in particular nonsynonymous
SNPs, to protein regions is currently exploited in a wide range of
applications, from disease association studies to pharmacogenomics.[114] In this study, only missense nonsynonymous
SNPs are considered.In order to study the relationship between
binding promiscuity
and human SNPs, human homologues of SFull and SSoc proteins were identified. We recorded the occurrence of nonsynonymous
SNPs from the dbSNP database[102] and of
SNPs with a known association with disease (DisSNPs) from the OMIM
database[103] (Methods). The SNP and DisSNP positions were then mapped back to the original
SFull and SSoc proteins.The comparison
of the SNP propensities of the different binding
classes with respect to the surface (Figure 7A, light colors) showed that promiscuous positions in SFull (orange) are less rich in SNPs than both classes of monopartner
residues (cyan and blue). Even if with reduced statistical significance,
this trend was confirmed by SSoc proteins (green and red).
As an example, the survivin protein, already in the human form in
the SFull data set, is shown in Figure 7B. While promiscuous (red shades surface) and nonpromiscuous
(light blue surface) residues are present in survivin in almost equal
proportion, 5 out of 6 interface SNPs (spheres) were found in nonpromiscuous
regions, mainly involved in the interaction with the survivin partner
borealin (green cartoon).
Figure 7
SNP and DisSNP propensity in the SFull and SSoc data sets. (A) Propensities of SNPs (light colors)
and DisSNPs (dark
colors) relative to the surface for cmono (cyan), cmono_in_multi (blue), cmulti (orange), cmono_in_multi(Soc) (green) and cmulti(Soc) (red) interface
residues. The propensity is calculated per protein. The reported values
are averages over SFull monopartner proteins for cmono, SFull multipartner proteins
for cmono_in_multi and cmulti, and SSoc proteins for cmono_in_multi(Soc) and cmulti(Soc). The error bars represent the standard error of the mean. The significance
levels from pairwise Student’s t tests are
reported with a star code (see Figure 2 legend).
(B) SNPs in the human survivin protein. SNPs found in the interface
region of survivin are labeled and represented as van der Waals spheres.
The protein surface (transparent) is colored according to the binding
multiplicity b from blue (b = 0,
non-interface) to red (b = 3). A survivin binding
partner (borealin) is also represented as green cartoon (PDB ID: 2RAW).
SNP and DisSNP propensity in the SFull and SSoc data sets. (A) Propensities of SNPs (light colors)
and DisSNPs (dark
colors) relative to the surface for cmono (cyan), cmono_in_multi (blue), cmulti (orange), cmono_in_multi(Soc) (green) and cmulti(Soc) (red) interface
residues. The propensity is calculated per protein. The reported values
are averages over SFull monopartner proteins for cmono, SFull multipartner proteins
for cmono_in_multi and cmulti, and SSoc proteins for cmono_in_multi(Soc) and cmulti(Soc). The error bars represent the standard error of the mean. The significance
levels from pairwise Student’s t tests are
reported with a star code (see Figure 2 legend).
(B) SNPs in the human survivin protein. SNPs found in the interface
region of survivin are labeled and represented as van der Waals spheres.
The protein surface (transparent) is colored according to the binding
multiplicity b from blue (b = 0,
non-interface) to red (b = 3). A survivin binding
partner (borealin) is also represented as green cartoon (PDB ID: 2RAW).A possible explanation of these findings is that
the human equivalent
of the promiscuous positions considered here tend to be less tolerant
to variation, probably due to the higher number of constraints that
they are experiencing to preserve effective binding. Mutations at
these sites might be more prone to yield a lethal phenotype and are
thus less likely to be viable.[115]The analysis of DisSNPs (Figure 7A, dark
colors) was strongly affected by the small number of observations
(SI Table S12). The large uncertainty associated
with the average propensities, especially for the SSoc proteins,
prevented the observation of statistically significant differences
between the binding classes. It is thus evident that these results,
while indicating an interesting and unexplored relationship between
binding diversity and variability in the human genome, will need to
be confirmed on a larger data set of human proteins.
Case Study:
Two Ubiquitin-like Proteins
In this section,
we will exemplify the relationship observed between binding and flexibility
analyzing two related multipartner proteins: Neddylin and the small
ubiquitin-related modifier 2 (SUMO-2).Neddylin is a ubiquitin
homologue (56% sequence identity, Figure 8A)
with the characteristic ubiquitin-like fold (Figure 8B/D). The analysis of its complexes highlighted that, as for
ubiquitin,[116] the Neddylin main interface
is centered at an hydrophobic patch at the C-terminus of the β5
strand (Figure 8B, green circle), involving
different hydrophobic residues from the β1-β2 loop (L8)
and the β3–5 strands (I44, V70, L71). This region is
highly promiscuous and rich of hot spots (sticks in Figure 8B). A secondary interaction site, previously observed
also in ubiquitin,[116] was found at the
α1 C-terminus (Figure 8B, yellow circle).
This does not show any hot spot and is mainly composed of monopartner
residues.
Figure 8
Comparison of two ubiquitin-like proteins. (A) Sequence alignment
of Ubiquitin (UniProt/KB ID: P0CG48), Neddylin (Q15843), and SUMO-2
(P61956). The sequences were aligned using T-COFFEE v 9.02[130] with default parameters. The DSSP secondary
structure of Ubiquitin (PDB ID: 1ubi) is shown (top). Conserved residues are
highlighted in red. Stars below Neddylin and SUMO-2 sequences indicate cmono_in_multi (blue) and cmulti (orange) residues identified as hotspots by at least
two prediction methods. (B/C) Cartoon representation of Neddylin (B,
PDB ID: 1ndd) and SUMO-2 (C, PDB ID: 1wm3) structures. Residues are colored according to the
binding multiplicity b from blue (b = 0, non-interface) to red (b = 4). Residues identified
as hotspots by at least two prediction methods are represented as
sticks. The approximate boundaries of two representative interfaces
of Neddylin are indicated in B with a green (main interface) and yellow
(secondary interaction site) dashed line. The insets show the structures
of the representative complexes (PDB ID 1xt9, chains B/A, and PDB ID 1r4n, chains J/C in the
green and yellow insets, respectively). (D/E) Mapping of tCONCOORD
Cα RMSF values onto Neddylin (D) and SUMO-2 (E) structures.
The structures are colored according to RMSF values from blue (0 Å)
to red (2 Å). Selected residues (highlighted with yellow arrows
in panel A) are represented as yellow sticks, while hydrogen bonds
in D are indicated as dotted blue lines. (F/G) Profiles of binding
multiplicity (b, blue) and Cα RMSF
from tCONCOORD (orange) for Neddylin (F) and SUMO-2 (G). DSSP annotation
of the secondary structure is reported as magenta (helices) and yellow
(strands) blocks.
Comparison of two ubiquitin-like proteins. (A) Sequence alignment
of Ubiquitin (UniProt/KB ID: P0CG48), Neddylin (Q15843), and SUMO-2
(P61956). The sequences were aligned using T-COFFEE v 9.02[130] with default parameters. The DSSP secondary
structure of Ubiquitin (PDB ID: 1ubi) is shown (top). Conserved residues are
highlighted in red. Stars below Neddylin and SUMO-2 sequences indicate cmono_in_multi (blue) and cmulti (orange) residues identified as hotspots by at least
two prediction methods. (B/C) Cartoon representation of Neddylin (B,
PDB ID: 1ndd) and SUMO-2 (C, PDB ID: 1wm3) structures. Residues are colored according to the
binding multiplicity b from blue (b = 0, non-interface) to red (b = 4). Residues identified
as hotspots by at least two prediction methods are represented as
sticks. The approximate boundaries of two representative interfaces
of Neddylin are indicated in B with a green (main interface) and yellow
(secondary interaction site) dashed line. The insets show the structures
of the representative complexes (PDB ID 1xt9, chains B/A, and PDB ID 1r4n, chains J/C in the
green and yellow insets, respectively). (D/E) Mapping of tCONCOORD
Cα RMSF values onto Neddylin (D) and SUMO-2 (E) structures.
The structures are colored according to RMSF values from blue (0 Å)
to red (2 Å). Selected residues (highlighted with yellow arrows
in panel A) are represented as yellow sticks, while hydrogen bonds
in D are indicated as dotted blue lines. (F/G) Profiles of binding
multiplicity (b, blue) and Cα RMSF
from tCONCOORD (orange) for Neddylin (F) and SUMO-2 (G). DSSP annotation
of the secondary structure is reported as magenta (helices) and yellow
(strands) blocks.A positive correlation
was observed between the profiles of binding
multiplicity (orange) and Cα RMSF from the tCONCOORD
ensemble (blue) of Neddylin (Figure 8F), with
a correlation coefficient calculated over the exposed residues of
0.36. Correspondingly, a higher average Cα RMSF Z-score
(SI Table S13) was found for multipartner
residues (0.79 ± 1.48) than monopartner ones (0.03 ± 0.72).
Indeed, many multipartner residues (Figure 8B) are located at or in close proximity of high-flexibility regions
(Figure 8D), namely the β1-β2,
α1-β3, and β3-β4 loops, and the C-terminus.
The intrinsic backbone mobility of promiscuous locations in the simulated
ensemble is paralleled by a high conformational variability at these
same positions in the different PDB complexes where Neddylin interacts
with different partners (SI Table S13).
Additionally, the FMA analysis (Figure 9A)
showed that the backbone flexibility of each of the two representative
interfaces mainly correlates with one specific collective motion.
In particular, PC4 (Figure 9B) accounts for
54% of the R variance of the main interface (green bars in Figure 9A). A similar collective motion, involving a ‘pincer-like’
movement of the β1-β2 and α1-β3 loops, has
been observed in NMR ensembles of ubiquitin representing solution
dynamics up to the μs time scale.[24]
Figure 9
FMA
analysis of Neddylin. (A) Decomposition of the R variance
of the two representative interfaces of Neddylin into percentage contributions
(pvar) from the first 17 PCs of the tCONCOORD ensemble. Green and
yellow bars represent the pvar values for the interfaces shown in
the green and yellow insets in Figure 8B. (B)
Porcupine representation of the 4th PC of the Neddylin tCONCOORD ensemble.
Direction and relative amplitude of the motion of each Cα atom along the PC is represented by orange spikes. The residues
involved in the main Neddylin interface (Figure 8B, green inset) are color-mapped according to their binding multiplicity b onto the tube representation of the average tCONCOORD
structure.
FMA
analysis of Neddylin. (A) Decomposition of the R variance
of the two representative interfaces of Neddylin into percentage contributions
(pvar) from the first 17 PCs of the tCONCOORD ensemble. Green and
yellow bars represent the pvar values for the interfaces shown in
the green and yellow insets in Figure 8B. (B)
Porcupine representation of the 4th PC of the Neddylin tCONCOORD ensemble.
Direction and relative amplitude of the motion of each Cα atom along the PC is represented by orange spikes. The residues
involved in the main Neddylin interface (Figure 8B, green inset) are color-mapped according to their binding multiplicity b onto the tube representation of the average tCONCOORD
structure.Multipartner residues of Neddylin
seemed to rely less on side chain
flexibility than backbone mobility to adapt to different environments,
since the average side chain RMSF was comparable to that of monopartner
residues for both the tCONCOORD and PDB ensembles (SI Table S13). Indeed, while a few multipartner residues adopted
different rotamers in different complexes (e.g., R42 in SI Figure S9B), others relied either on the backbone
flexibility (L8 in SI Figure S9A) or on
their capacity to support different interaction geometries without
changing their conformation (e.g., H68 in SI Figure
S9C, where it interacts with an aromatic ring in either a T-shaped
or parallel stacking interaction).The SUMO-2 protein, which
is a homologue of ubiquitin in spite
of its low sequence identity (14%, Figure 8A), showed important differences with respect to Neddylin in binding
modes and dynamics. This is consistent with this protein belonging
to a separate sequence subgroup[117] of ubiquitin
homologues, characterized by the replacement of the key ubiquitin
residues Q41 and Y59 (yellow arrows and sticks in Figure 8A and D, respectively) with two hydrophobic residues
that are no longer able to form hydrogen-bonds with the nearby loops
(sticks in Figure 8E). This is reflected in
both the protein structure and dynamics. In particular, compared to
Neddylin the tCONCOORD SUMO-2 ensemble presented a higher Cα flexibility at the α1 C-terminus and the α1/β3
loop (Figure 8E).The increment in flexibility
was paralleled by an increment in
binding promiscuity in this region. A multipartner interaction site,
typical of SUMO proteins, was found at the ‘groove’
defined by the α1 C-terminus and the β2 strand, which
is correspondingly enriched in multipartner hot spots with respect
to Neddylin (Figure 8C and G). The mapping
of tCONCOORD Cα and side chain RMSF onto the SUMO-2
surface (Figure 10C/D), shows that the α/β
groove is flanked by residues with high flexibility either at Cα (β1-β2 loop) or side chain atoms (β2)
or both (α1 C-terminus). Moreover, the water density map from
the MD simulation showed that the region surrounding the groove is
richer in high-density hydration sites than the nonpromiscuous regions
(SI Figure S10).
Figure 10
Binding properties and
conformational flexibility of the α/β
groove in SUMO-2. (A/B) Cartoon representation of two different complexes
of SUMO-2 (light green). Selected multipartner hot spots of SUMO-2
(blue) and interacting residues from partners (yellow) are shown as
sticks. (A) SUMO-2 homotrimer (PDB ID: 1wm2). The other two SUMO-2 proteins are shown
in dark green. (B) Complex with Thymine-DNA glycosylase (orange, PDB
ID: 2d07). (C/D)
Cα (C) and side chain (D) RMSF values mapped onto
SUMO-2 surface (PDB ID: 1wm3). Residues are colored according to their RMSF values
from blue (0 Å) to red (2 Å).
Binding properties and
conformational flexibility of the α/β
groove in SUMO-2. (A/B) Cartoon representation of two different complexes
of SUMO-2 (light green). Selected multipartner hot spots of SUMO-2
(blue) and interacting residues from partners (yellow) are shown as
sticks. (A) SUMO-2 homotrimer (PDB ID: 1wm2). The other two SUMO-2 proteins are shown
in dark green. (B) Complex with Thymine-DNA glycosylase (orange, PDB
ID: 2d07). (C/D)
Cα (C) and side chain (D) RMSF values mapped onto
SUMO-2 surface (PDB ID: 1wm3). Residues are colored according to their RMSF values
from blue (0 Å) to red (2 Å).The analysis of the SUMO-2 PDB structures (SI Table S14) highlighted that, differently from Neddylin,
side chain conformational changes are more important for multipartner
residues (average side chain RMSF Z-score = 0.56 ± 1.13) than
for monopartner ones (−0.06 ± 0.98). Particularly relevant
is the contribution of the R34 and Q35 hot spots at the α1 C-terminus,
with a side chain RMSF Z-score of 2.63 and 1.32, respectively. Correspondingly,
these residues adopt significantly different rotamers in the different
PDB complexes (Figure 10A/B).This example
suggests that a fine-tuning of the intrinsic dynamical
properties of the interface can be central in modulating the binding
specificity in evolutionary related proteins.
Discussion
In this work, we studied the intrinsic conformational flexibility
of multipartner proteins to assess its role in promoting diversity
of binding. Through the generation of simulated conformational ensembles
from a starting experimental structure, we measured the tendency of
the isolated proteins to sample different conformations independently
from the interactions with their partners. The conformational flexibility
that we considered here is a different concept from the intrinsic
disorder analyzed in other works on hub proteins,[9,12,15,17,118] characterized by the absence of a definite structure
in all or part of the protein. It is also different from the conformational
plasticity[12−14] as measured by the conformational changes observed
when multiple experimental structures are available. Indeed, if these
structures correspond to different states, such as bound and unbound
conformations, the conformational variability is likely to include
also induced fit and allosteric effects in addition to intrinsic
flexibility.The choice to consider the isolated protein is justified by many
studies indicating that the intrinsic dynamics of a protein is correlated
with its function.[19,20,26,119−122] Indeed, even when isolated,
proteins have been shown to sample functionally relevant states, which
can then be stabilized or selected by interactions with the environment
such as post-translational modifications or interactions with ligands.[18] Many of these works used elastic network models
(ENM) to characterize the equilibrium fluctuations within the native
structure basin.[28,95,120,123] In spite of their simplicity,
these methods have been shown to provide results in good agreement
with both experiments and more sophisticated approaches such as MD.
For this study, we chose a method of intermediate complexity and computational
cost, tCONCOORD. While allowing for anharmonicities and providing
a full atom description of the protein where both backbone and side
chains are included, it is still faster than MD simulations.[67]The interaction data on each protein of
our SFull data
set were derived from a structural PPI database, PiSite,[38] where partners from all the PDB complexes involving
homologues within a high (>90%) sequence identity family are mapped
onto a family representative. Structural PPI databases generally contain
higher-confidence interaction data than interactome networks derived
from a range of different experimental methods, resulting in degree
distributions with shorter tails.[11] In
our SFull data set, this is reduced also by the fact that
only monodomain proteins were considered, so that a maximum number
of 12 nonredundant partners per protein was found. The investigation
was limited to monodomain proteins to exclude cases where multiple-partner
binding is simply achieved by using different domains or a different
relative arrangement of the domains.[6]The relationship between conformational flexibility and ability
to bind multiple partners was analyzed primarily at the residue level.
Indeed, the composition of either proteins or interfaces in terms
of binding multiplicity turned out to be highly heterogeneous, with
40% of multipartner proteins in SFull presenting both mono-
and multipartner interfaces and 91% of multipartner interfaces including
both mono- and multipartner residues. Thus, we aimed mostly at identifying
the dynamical features specific of multipartner residues and not at
characterizing multipartner proteins as a whole.Even if we
did not attempt a classification of the SFull multipartner
proteins, more than 60% of them have at least 10% of
their interface residues involved in interactions with different partners,
while only 8% multipartner proteins have no overlapping interfaces.
On the other hand, as found in different data sets,[124] very large overlaps between interfaces seem to be avoided
(only 5% of the multipartner proteins have more than 60% of their
interface interacting with multiple partners). The alternate data
set SSoc used to validate our results, is composed of proteins
previously classified as ‘sociable’ or ‘transient’
hubs,[12] that can be related, even if with
a different definition, to date hubs.We found a significant
difference in the intrinsic conformational
flexibility of mono- (cmono and cmono_in_multi) and multi- (cmulti) partner residues. In particular, the comparison
between cmono_in_multi and cmulti residues suggests a nonuniform distribution of the
flexibility across the interface of multipartner proteins, with multipartner
residues being on average more flexible than monopartner ones. This
holds when considering either backbone or side chain motions, suggesting
that an enhanced ability to sample different conformations, either
globally or locally, can indeed be exploited to support diversity
of interactions. Remarkably, a more pronounced tendency of promiscuous
residues to sample different conformations is observed in all the
different types of conformational ensembles analyzed in this study
and in two different protein data sets. This confirms the generality
and robustness of the present findings. Correlated motions were also
considered and their functional importance assessed. The flexibility
of multipartner proteins was found to modulate the shape of single
interfaces in a highly specific way, by using different collective
motions for different interfaces within the same protein. Moreover,
within multipartner proteins a stronger correlation was found between
the local motions of multipartner residues, suggesting that promiscuous
regions are connected by preferential communication pathways.Similarly to disorder,[125] a higher flexibility
at the interface of isolated molecules is associated with a higher
entropic penalty upon binding. This entropy increment has been suggested
to be exploited by disordered multipartner proteins to decouple binding
affinity from binding specificity.[126] On
the other side, residues considered to be important in the interaction
with the partner, either because of their high evolutionary conservation
or for being predicted as hot spots, have been shown to be more rigid
than the average.[24,29−31] Here, we found
that (a) there is a clear anticorrelation between evolutionary conservation
and intrinsic flexibility and (b) this trend is reduced in the case
of multipartner residues, suggesting that the higher flexibility of
conserved multipartner residues is the result of a balance between
the counteracting effects of preserving binding strength and allowing
for binding diversity. In this context, the higher propensity for
conservation found for multipartner residues in ordered proteins could
reflect the higher number of evolutionary constraints deriving from
the necessity to optimize this balance.The findings on intrinsic
flexibility are confirmed by the analysis
on the conformational variability observed in the experimental structures.
Previous works using PDB ensembles of different data sets have shown
that overlapping regions in date hub interfaces tend to visit more
different side chain rotamers than nonoverlapping ones[13] and that sociable/hub proteins sample more different
overall backbone conformations than nonsociable/nonhub ones.[12] Here, we unified these results by performing
a systematic analysis on both side chain and backbone changes at the
residue level. Moreover, we decomposed the overall variability observed
in the PDB into contributions from pairs of structures with the same
and with different binding states. Interestingly, we found a higher
plasticity for multipartner proteins not only when considering changes
between unbound and bound structures but also when analyzing ensembles
of unbound and bound structures separately. This is highly consistent
with the larger intrinsic flexibility found for the multipartner residues in isolated proteins, since it implies a more pronounced tendency to explore different
conformations independently from their binding state.The higher
propensity to conservation and surface burial upon complexation
of multipartner residues suggests that they have an important role
in defining the binding energetics. The comparison of the intrinsic
flexibility of predicted hot spot residues from different binding
classes confirmed the results obtained on the whole set of interface
residues. Indeed, multipartner hot spots turned out to be on average
more flexible than monopartner ones independently from the specific
method used for hot spot prediction. This rules out the possibility
that the observed higher flexibility of multipartner residues in the
whole interface originates from residues that are only marginally
involved in the interaction with partners. Moreover, the observation
that promiscuous regions are depleted in SNPs, even if to be confirmed
on larger data sets of human proteins, provides further evidence to
the essentiality of these regions.The findings presented in
this work have potential applications
to methods for the prediction of PPIs,[127] whose accuracy has been recently brought to levels comparable to
high-throughput experiments.[128] In particular,
the introduction of per-residue flexibility indices could be used
for the identification of promiscuous regions in protein interfaces.
This could also be relevant for the detection of druggable regions,
which could be targeted by small molecules or other proteins to inhibit
interactions with specific partners.[129] A further possible application is in protein design, where the specificity
or promiscuity of proteins could be enhanced by modifying the distribution
of flexibility across the interfaces.[6,8]
Authors: David Van Der Spoel; Erik Lindahl; Berk Hess; Gerrit Groenhof; Alan E Mark; Herman J C Berendsen Journal: J Comput Chem Date: 2005-12 Impact factor: 3.376
Authors: Jamie A Macpherson; Alina Theisen; Laura Masino; Louise Fets; Paul C Driscoll; Vesela Encheva; Ambrosius P Snijders; Stephen R Martin; Jens Kleinjung; Perdita E Barran; Franca Fraternali; Dimitrios Anastasiou Journal: Elife Date: 2019-07-02 Impact factor: 8.713
Authors: Morena Pappalardo; Francesca Collu; James Macpherson; Martin Michaelis; Franca Fraternali; Mark N Wass Journal: BMC Genomics Date: 2017-08-11 Impact factor: 3.969