Alexandre Borrel1,2, Anne-Claude Camproux1, Henri Xhaard2. 1. Molécules Thérapeutiques in silico (MTi), INSERM UMRS-973, University Paris Diderot, Sorbonne Paris Cité, 75205 Paris Cedex 13, France. 2. Faculty of Pharmacy, Division of Pharmaceutical Chemistry and Technology, University of Helsinki, Viikinkaari 5E, P.O. Box 56, FI-00014 Helsinki, Finland.
Abstract
We conduct a statistical analysis of the molecular environment of common ionizable functional groups in both protein-ligand complexes and inside proteins from the Protein Data Bank (PDB). In particular, we characterize the frequency, type, and density of the interacting atoms as well as the presence of a potential counterion. We found that for ligands, most guanidinium groups, half of primary and secondary amines, and one-fourth of imidazole neighbor a carboxylate group. Tertiary amines bind more rarely near carboxylate groups, which may be explained by a crowded neighborhood and hydrophobic character. In comparison to the environment seen by the ligands, inside proteins, an environment enriched in main-chain atoms is found, and the prevalence of direct charge neutralization by carboxylate groups is different. When the ionizable character of water molecules and phenolic or hydroxyl groups is accounted, considering a high-resolution dataset (less than 1.5 Å), charge neutralization could occur for well above 80% of the ligand functional groups considered, but for tertiary amines.
We conduct a statistical analysis of the molecular environment of common ionizable functional groups in both protein-ligand complexes and inside proteins from the Protein Data Bank (PDB). In particular, we characterize the frequency, type, and density of the interacting atoms as well as the presence of a potential counterion. We found that for ligands, most guanidinium groups, half of primary and secondary amines, and one-fourth of imidazole neighbor a carboxylate group. Tertiaryamines bind more rarely near carboxylate groups, which may be explained by a crowded neighborhood and hydrophobic character. In comparison to the environment seen by the ligands, inside proteins, an environment enriched in main-chain atoms is found, and the prevalence of direct charge neutralization by carboxylate groups is different. When the ionizable character of water molecules and phenolic or hydroxyl groups is accounted, considering a high-resolution dataset (less than 1.5 Å), charge neutralization could occur for well above 80% of the ligand functional groups considered, but for tertiaryamines.
Molecular interactions
are fundamental to biochemical processes.
Ionizable, basic and acidic, functional groups can form charged interactions
mediated through a shared hydrogen atom, that is, salt bridges.[1] These hydrogen bonds are strong with energy of
interaction estimated at 28.5–48.1 kJ/mol. They are characterized
by a short distance (e.g., about 2.59–2.86 Å between the
O and N atoms of a primary amine and a carboxylate group) and a ΔpKa range of [3-11] between the acceptor
and the donor.[2] Although the basic and
acidic groups are often ionized at the binding sites, this is not
always the case, especially considering that the local pH may differ
greatly from that of the solvent.[3,4] A common way
to infer ionization of a given functional group in crystallographic
three-dimensional (3D) structures (which most often do not harbor
hydrogen atoms) is to consider its neighborhood: if a counterion is
at close range, ionization is likely.[5] If
not, it is difficult to address the issue without complex quantum
chemistry calculations.In proteins, salt bridges involve a
basic group such as the primary
amine of a lysine side chain or the protein N-terminus, the imidazole
(IMD) group of a histidine, and the guanidinium (GAI) group of an
arginine and an acidic group such as the carboxylate group from an
aspartate or glutamate side chain or the protein C-terminus. They
play a critical role in the folding, stability, and dynamics of 3D
structures at all levels, from secondary and tertiary structures to
supramolecular assemblies, and have been studied for multiple aspects:their
energetic contribution or electrostatic
strength, especially with respect to secondary, tertiary, or quaternary
structure as well as stability;[6−9] a strong correlation is observed between the secondary
structure and salt bridge formation.[10] Furthermore,
salt bridges form complex networks,[1,7] which are suspected
to have a stabilizing effect on the protein structure, following the
observed relation between the increased number of salt bridges and
thermal stability;[11−13]their
geometrical characteristics;
for example, salt bridges between aspartate and glutamate and histidine,
arginine, or lysine display extremely well defined geometric preferences;[7]their environment and their location
(within monomers or at the interface between monomers as well as their
solvent accessibility);[14] salt bridges
display preferential formation in an environment of 30% solvent-accessible
surface area;[10]the separation of the amino acids;
intrachain salt bridges are mainly separated by three or four residue
salt bridges;[15]their fluctuations and nuclear magnetic
resonance (NMR) conformer ensembles show that salt bridges may break
and new salt bridges are formed, in good correlation with crystallographic
B-factors;[16]water molecules have important roles
to play toward the stability of molecular complexes, for example,
conformational stability or stabilization or mediation of ion pairs.[17,18]A vast majority of these studies have
been based on structural
data extracted from the Protein Data Bank (PDB).[19] Consequently, the amount of data available to the authors
has been variable, from the early work in 1995 in which Barlow and
Thornton or Musafia and co-workers conducted using less than a hundred
proteins[1] to 1500–2000 structures
10 years later[10,13] and up to 3644 monomers in the
recent study by Donald et al. in 2011.[7] Larger datasets of course increase the robustness of the findings.
The data generated for proteins in the present manuscript is the largest,
that is, more than 4500 monomers, simply because of the natural growth
of the PDB. The focus of the work is the environment of salt bridges
and their frequencies; we include in our statistics elements such
as water molecules and weakly ionizable groups that to the best of
our knowledge have not been studied together so far in the literature.In contrast to the work conducted in proteins, the environment
of ionizable groups in protein–ligand complexes has received
only little attention. This is probably due to the relative difficulty
in identifying ionizable groups in the ligands, the absence of ready-to-use
datasets, and the relative difficulty in operating cheminformatics
data mining tools in the PDB. Another challenge is that until recently
only limited data were available, especially considering the need
to analyze enough high-resolution and diverse protein–ligand
complexes. Yet, a better characterization of the interacting environment
of ionizable groups would be of key interest in molecular docking
simulations,[20] where such a knowledge would
help to better position the bridging structural water molecules, select
or optimize relevant ionization states, improve the initial placing
of the ligand, and design more efficient and accurate scoring functions.[21−24]The aim of this study is to make a quantitative and qualitative
assessment of the protein molecular environments for the ligand and
protein ionizable groups in the PDB. We focused on atoms forming the
molecular environment in the close vicinity (3.0 and 4.0 Å) of
the queried functional groups. Statistics about the density, frequency,
and number of polar contacts were extracted and are discussed for
both protein–ligand complexes and inside protein structures.
Statistics were also extracted as to whether there is at least one
contact of a given type. The scope of the study is restricted and
currently excludes the long-range stabilization of basic groups either
through π interactions[25] or through
long-range electrostatics, although these are known to be important,
for example, to protein-folding processes or to molecular recognition
events.[26,27]
Results
The environment of six ionizable
chemical groups well-represented
in the ligands is considered: primary amine (referred to as I, pKa 7.75–10.64),[28] secondary amine (II, pKa 9.29–11.01),[28] tertiaryamine (III, pKa 8.31–10.65),[28] IMD (pKa 5.1–7.75),[29] GAI (pKa 8.33–13.71),[30] and carboxylic acid (COO, pKa 1.84–4.40)[31] (Table S6). These are referred to as query groups.
The study is conducted both for ligand queries and for protein queries.
Only four of these query groups are present in proteins: I (lysine
side chain and N-terminus), IMD (histidine side chain), GAI (arginine
side chain), and COO (aspartate and glutamate side chains and C-terminus).
It is important to note that to represent the queries IMD, GAI, and
COO, which contain several atoms, we used centroids (see the Experimental Section).
PDB1.5 and PDB3.0 Datasets
The work was initiated using
the PDB3.0 dataset of ligand queries at 3.0 Å resolution. The
study was then enriched by considering only a subset of the data at
higher resolution, PDB1.5, which allowed to study more accurately
the role of water molecules. Indeed, the main apparent difference
between the PDB1.5 and the PDB3.0 datasets is the amount of water
molecules present, that is, there are more water molecules in the
PDB1.5 dataset (Figure ). The study was then completed by collecting protein query interaction
statistics at both resolutions. The study was also run with the PDB50
release of the PDB to eliminate potential biases due to having similar
proteins in the dataset, and the results were found to be robust (see
the Discussion).
Figure 1
Mean number of water
molecules by the amino acid as a function
of crystallographic resolution from all proteins in the PDB. The red
line represents the mean number of water molecules by the amino acid
with an interval of 0.1 Å in resolution.
Mean number of water
molecules by the amino acid as a function
of crystallographic resolution from all proteins in the PDB. The red
line represents the mean number of water molecules by the amino acid
with an interval of 0.1 Å in resolution.The PDB1.5 dataset is composed of 387 complexes, and the
PDB3.0
contains 4592 complexes (Table ). From the dataset PDB1.5, we extracted for ligands 169 instances
for the query group I, 96 for II, 70 for III, 30 for IMD, 11 for GAI,
and 135 for COO. From PDB3.0, we extracted 1632 instances for the
query group I, 1230 for II, 1147 for III, 264 for IMD, 146 for GAI,
and 1390 for COO. The numbers for ligand query data for IMD (n = 30) and GAI (n = 11) in PDB1.5 are
thus too low to extract reasonable statistics. However, the results
are shown because they are highly consistent with the data extracted
from the PDB3.0 dataset and from the protein query data. For protein
queries, the PDB1.5 dataset contains 13 031 instances of I,
6227 of IMD, 11 380 of GAI, and 28 146 of COO. In the
PDB3.0 dataset, all query groups have more than 20 000 representatives.
Table 1
Content of the PDB1.5 and PDB3.0 Datasetsa
query groups
I
II
III
IMD
GAI
COO
any atom
PDB 1.5
number of complexes
161
91
64
26
11
96
387
number of ligand query groups
169
96
70
30
11
135
10 314
number of protein query
groups
13 031
6227
11 380
28 146
195 913
PDB 3.0
number of complexes
1491
1113
1020
251
134
1139
4592
number of ligand
query groups
1632
1230
1147
264
146
1390
126 808
number of protein query
groups
154 979
70 474
143 529
344 848
197 306
Null environments are defined from
the column “any atom”
Null environments are defined from
the column “any atom”
Null Environments
A rational way to study molecular
environments is to consider them in the light of the environment of
any atom, that is, to a null model or the reference state. We built
two null environment models, one for ligand queries and one for protein
queries (Figure ).
Null environments are considered by collecting the environment of
any ligand atom, that is, they are reflective of pockets binding the
ligands collected in this study and a set of randomly selected protein
atoms, that is, they are reflective of interactions in the protein
core, especially, secondary structure elements.
Figure 2
Null environments around
(A) ligand atoms and (B) protein atoms.
The graph shows the proportion of query groups with at least one Oox,
Ow, Oh and Oph, Nam, NaI or Nim or Ngu, and Car atom in their neighborhood
(4.0 Å). Datasets PDB1.5 (left bars) and PDB3.0 (right bars)
are both shown. The following color code will be consistently used
in this study: Oox (red), Oh and Oph (orange), Ow (cyan), Nam (green),
Nim, Ngu, and NaI (blue), Car (purple), and Oc (black).
Null environments around
(A) ligand atoms and (B) protein atoms.
The graph shows the proportion of query groups with at least one Oox,
Ow, Oh and Oph, Nam, NaI or Nim or Ngu, and Car atom in their neighborhood
(4.0 Å). Datasets PDB1.5 (left bars) and PDB3.0 (right bars)
are both shown. The following color code will be consistently used
in this study: Oox (red), Oh and Oph (orange), Ow (cyan), Nam (green),
Nim, Ngu, and NaI (blue), Car (purple), and Oc (black).Environments in the PDB1.5 and PDB3.0 datasets
are very similar,
save for the number of water molecules (see previous section). About
53% of any ligand atom or any protein atom has at least one water
molecule (Ow) within 4.0 Å in PDB1.5, whereas these numbers drop
to 32–33% in the PDB3.0 dataset.Comparing the environments
of ligand and protein atoms uncovers
a major difference. The environment of protein atoms is significantly
enriched in amide groups (Nam) [18% (any ligand atom) against 71%
(any protein atom)] as well as in carbonyl groups [(Oc) 30% (any ligand
atom) against 77% (any protein atom)] (values are from the PDB3.0
dataset; very similar values are obtained from the PDB1.5 dataset).
This can be explained by the contact formed by secondary structure
elements in proteins and by the lower exposition of the main-chain
atoms to the ligand-binding sites. The environment of ligand atoms
is slightly enriched in charged and polar amino: carboxylic acid (Oox;
13 vs 9%), phenolic and hydroxyl (Oh and Oph; 17 vs 9%), and positively
charged groups (NaI, Nim, and Ngu; 12% vs 7%). Car appears equally
in ligand and protein null environments (23–25%).
Neutralization
at the Level of the Functional Group
We start the Results section by presenting
an overview of the neutralization of the charge at the level of a
query group (Figure ) and subsequently present details about the different environments
and in particular their composition. These different types of environments
are illustrated in Figure , taking the case of a primary amine. Classical environments
are salt bridge interaction with a carboxylate group (Figure A), interaction with a carboxylate
group mediated by a water molecule (Figure B), and environment formed by water molecules
and carbonyl groups (Figure C). Less classical environments for primary amines are, for
example, interaction with an IMD group (Figure D) or with a GAI group (Figure E,F). The interacting atoms
were analyzed by placing the ligand query fragments in the same referential
(Figure ; data available
in .pdb format in the Supporting Information). This was done by computing the rotation/translation matrices using
an in-house implementation of the Kabsch’s algorithm.[32,33] For III and to a lesser extent II, interactions occur predominantly
in the axial position from the tetrahedron formed by nitrogen on the
top and to a lesser extent below the three connected carbons (Figure B,C). Note that the
superimposition of I functional groups is fuzzy because of the rotational
freedom around the C···N bond.
Figure 3
Neighborhoods of (A,C)
ligand query groups I, II, III, IMD, GAI,
and COO and (B,D) protein query groups I, IMD, GAI and COO. (A,B)
is for the PDB1.5 dataset and (C,D) is for the PDB3.0 dataset. The
presence of the following atom types in the neighborhood was searched
and exclusively assigned to the first type found (from the bottom
to the top of the bars): at least 1–4 Oox atoms within 3.0
Å; red, separators indicate the number of Oox groups from more
than five (bottom) to one (top); at least one Oox atom in the 3.0–4.0
Å range (burgundy red); at least one Ow itself interacting with
a Oox atom for basic query groups and interacting with a NaI, Ngu,
or Nim for the acidic query group (yellow); at least one Ow (cyan);
at least one Oh, Oph (orange); at least one Nam (green); at least
one Ngu, Nim, or NaI (marine blue); at least one Car (purple); at
least one aliphatic carbon or sulfur (gray). The color code is the
same for COO but (Ngu, Nim, and NaI) are used in the place of (Oox).
Note a small number of samples for IMD and GAI in panel (A).
Figure 4
Examples of six different environments for query
group I. (A) neutralization
using a counterion (human arginase I, PDB code 3MFW); (B) neutralization
using a counterion mediated by water molecules (Helicobacter
pylori 5′-methylthioadenosine/S-adenosylhomocysteine nucleosidase, PDB code 4OJT); (C) only water
molecules and main-chain carbonyl groups (Streptomyces sp. R61 DD-peptidase, PDB code 1IKI); (D) nitrogen from IMD (human GABA(B)
receptor, PDB code 4MR8), (E) nitrogen from GAI (Salmonella enterica stationary phase survival protein, PDB code 4XJ7); and (F) nitrogen
from GAI (hepatitis C virus Hcv Ns3 Protein, PDB code 4B76). Ligand carbon
atoms (blue), protein carbon atoms (green), water molecules (red spheres),
and protein cartoon trace (green) are shown.
Figure 5
3D densities of atom types around ligand queries using the dataset
PDB3.0. Color code: for query group; (A) I, (B) II, (C) III, (D) IMD,
and (E) GAI, Oox (red), Oh and Oph (orange, yellow), and Ow (cyan).
For (F) COO, Nam (green), Nim, Ngu, and NaI (blue), and Ow (cyan).
Neighborhoods of (A,C)
ligand query groups I, II, III, IMD, GAI,
and COO and (B,D) protein query groups I, IMD, GAI and COO. (A,B)
is for the PDB1.5 dataset and (C,D) is for the PDB3.0 dataset. The
presence of the following atom types in the neighborhood was searched
and exclusively assigned to the first type found (from the bottom
to the top of the bars): at least 1–4 Oox atoms within 3.0
Å; red, separators indicate the number of Oox groups from more
than five (bottom) to one (top); at least one Oox atom in the 3.0–4.0
Å range (burgundy red); at least one Ow itself interacting with
a Oox atom for basic query groups and interacting with a NaI, Ngu,
or Nim for the acidic query group (yellow); at least one Ow (cyan);
at least one Oh, Oph (orange); at least one Nam (green); at least
one Ngu, Nim, or NaI (marine blue); at least one Car (purple); at
least one aliphatic carbon or sulfur (gray). The color code is the
same for COO but (Ngu, Nim, and NaI) are used in the place of (Oox).
Note a small number of samples for IMD and GAI in panel (A).Examples of six different environments for query
group I. (A) neutralization
using a counterion (humanarginase I, PDB code 3MFW); (B) neutralization
using a counterion mediated by water molecules (Helicobacter
pylori 5′-methylthioadenosine/S-adenosylhomocysteine nucleosidase, PDB code 4OJT); (C) only water
molecules and main-chain carbonyl groups (Streptomyces sp. R61 DD-peptidase, PDB code 1IKI); (D) nitrogen from IMD (human GABA(B)
receptor, PDB code 4MR8), (E) nitrogen from GAI (Salmonella enterica stationary phase survival protein, PDB code 4XJ7); and (F) nitrogen
from GAI (hepatitis C virus Hcv Ns3 Protein, PDB code 4B76). Ligand carbon
atoms (blue), protein carbon atoms (green), water molecules (red spheres),
and protein cartoon trace (green) are shown.3D densities of atom types around ligand queries using the dataset
PDB3.0. Color code: for query group; (A) I, (B) II, (C) III, (D) IMD,
and (E) GAI, Oox (red), Oh and Oph (orange, yellow), and Ow (cyan).
For (F) COO, Nam (green), Nim, Ngu, and NaI (blue), and Ow (cyan).Strong contacts (short interaction
distances) were found between
the six functional groups studied and the atoms Oox, Oc, Oh, Oph,
and Ow and to a lower extent Nim. For the five basic queries, we sequentially
cumulatively looked at possibilities of charge neutralization not
only by carboxylate groups (Oox) but also by acidic groups that provide
opportunities for hydrogen bonds with a charge-transfer component
(Oh, Oph, and Ow). When we account for the functional groups of ionizable
character in the neighborhood, considering only the well-solvated
highest resolution dataset (PDB1.5), we assess that direct counterions
are present within 4.0 Å for ligand queries I in 93% of cases,
for II in 88%, for III in 71%, for IMD in 85%, for GAI nearly all,
and for COO in 96% of the cases; for protein queries, these numbers
are 81% for I, 97% for IMD, 98% for GAI, and 96% for COO. These numbers
are much higher than those obtained by considering only direct carboxylate
counterion neutralization.We refined the analysis to consider
separately the cases where
water molecules mediate ionic contacts (yellow in Figure ).[34] Water molecules were defined to mediate an ionic interaction if
the water molecule itself is within 3.0 Å of a potential counterion
(Oox for I, II, III, IMD, and GAI; NaI, Nim, or Ngu for COO); a corrective
number was used to calibrate distances in the case of centroids (see
the Experimental Section). As a result, water
molecules were found to mediate ionic contacts for 7% of I, 4% of
II, 4% of III, and 14% of COO in ligand queries and 7% of I, 15% of
IMD, 12% of GAI, and 16% of COO for protein queries. For all queries,
there are slightly but consistently more intervening water molecules
detected in the PDB1.5 dataset, supporting a better refinement of
the structures.Similarly, the fraction of carboxylate counterions
in the 3.0–4.0
Å distance range from the basic queries—that indicates
ionic interactions but not charge-reinforced hydrogen bonds—is
for all functional groups considered lower in the higher resolution
dataset (compare the burgundy red on Figure A,C and B,D): for example, 2% against 12%
for primary amines or 6% against 11% for secondary amines (ligand
queries). This phenomena is accompanied by an increase in the close
range interaction with Oox in the higher resolution dataset. This
could reflect a nonoptimal refinement in the lower resolution crystal
structures, a suggestion well in line with the recent work about halogen
bonds.[35] It is interesting that the phenomena
of poor refinement could be observed for classical functional groups
that are expected to be well-represented by current force fields,
as opposed to halogen atoms.
Carboxylate Contacts
Carboxylateoxygens (Oox) are
often involved in charge-reinforced hydrogen bonds (Figures A,B and 5A–E, left-hand densities).[36] The
distribution of Oox around the functional groups I, II, III, IMD,
and GAI shows a strong density peak at 2.8 Å, seen especially
for I and II (Figures , 7, and S3–S5) as well as for GAI. For III and IMD, a weak peak of density is
also found at 2.8 Å. Similarly, for COO, the peak of Ngu, NaI,
and Nim is also found at 2.8 Å. This value of 2.8 Å is typical
of salt bridges, as reported elsewhere.[2]
Figure 6
Density
of presence for selected protein atoms in the neighborhood
of ligand queries. The Y axis represents the relative
density value for all atoms collected within 6.0 Å distance from
the query group. I (A), II (B), III (C), IMD (D), GAI (E), and COO
(F) using the dataset PDB3.0. Density curves are colored as follows:
Oox (red), Oh (orange), Oph (light orange), Oc (black), Ow (cyan),
Nam (green), Ngu (light blue), NaI (blue), Car (purple), and Xot (gray).
Figure 7
Density of presence for selected protein atoms
in the neighborhood
of protein queries. The Y axis represents the relative
density value all atoms collected within 6.0 Å distance from
the query group: I (A), IMD (B), GAI (C), and COO (D) using the dataset
PDB1.5. Density curves are colored as follows: Oox (red), Oh (orange),
Oph (light orange), Oc (black), Ow (cyan), Nam (green), Ngu (light
blue), NaI (blue), Car (purple), and Xot (gray).
Density
of presence for selected protein atoms in the neighborhood
of ligand queries. The Y axis represents the relative
density value for all atoms collected within 6.0 Å distance from
the query group. I (A), II (B), III (C), IMD (D), GAI (E), and COO
(F) using the dataset PDB3.0. Density curves are colored as follows:
Oox (red), Oh (orange), Oph (light orange), Oc (black), Ow (cyan),
Nam (green), Ngu (light blue), NaI (blue), Car (purple), and Xot (gray).Density of presence for selected protein atoms
in the neighborhood
of protein queries. The Y axis represents the relative
density value all atoms collected within 6.0 Å distance from
the query group: I (A), IMD (B), GAI (C), and COO (D) using the dataset
PDB1.5. Density curves are colored as follows: Oox (red), Oh (orange),
Oph (light orange), Oc (black), Ow (cyan), Nam (green), Ngu (light
blue), NaI (blue), Car (purple), and Xot (gray).The high propensity of the query bases to form salt bridges
with
Oox atoms is corroborated by their frequent close contacts (Figures and 9): ligand GAI (72–89% combining both datasets), primary
and secondary amines (45–54%), and IMD (20–28%) often
neighbor a carboxylate group in their binding sites. Tertiaryamines
bind less near carboxylate groups (5–16%), which may be explained
by a more crowded neighborhood and a more hydrophobic character (see
the Discussion). In proteins, the prevalence
of direct charge neutralization by carboxylate groups is different:
GAI (54–55%), IMD (42–44%), and primary amine (28–29%).
Ligand and protein carboxylate groups are similarly neutralized (49–63%).
Figure 8
Proportion
of ligand query group I (A), II (B), III (C), IMD (D),
GAI (E), and COO (F) with at least one type of neighbor atom type
at a distance of 4.0 Å. For each atom type, proportions are represented
using the datasets PDB1.5 (left bars) and PDB3.0 (right bars). Color
code is the same as above: Oox (red), Ow (cyan), Oh and Oph (orange),
Nam (green), Nim, Ngu and NaI (blue), Car (purple), and Oc (black).
Figure 9
Proportion of protein query group I (A), IMD
(B), GAI (C), and
COO (D) with at least one type of neighbor atom type at a distance
of 4.0 Å. For each atom type, proportions are represented using
the datasets PDB1.5 (left bars) and PDB3.0 (right bars). Colors are
as follows: Oox (red), Ow (cyan), Oh and Oph (orange), Nam (green),
Car (purple), Oc (black), and Nim, Ngu, and NaI (blue).
Proportion
of ligand query group I (A), II (B), III (C), IMD (D),
GAI (E), and COO (F) with at least one type of neighbor atom type
at a distance of 4.0 Å. For each atom type, proportions are represented
using the datasets PDB1.5 (left bars) and PDB3.0 (right bars). Color
code is the same as above: Oox (red), Ow (cyan), Oh and Oph (orange),
Nam (green), Nim, Ngu and NaI (blue), Car (purple), and Oc (black).Proportion of protein query group I (A), IMD
(B), GAI (C), and
COO (D) with at least one type of neighbor atom type at a distance
of 4.0 Å. For each atom type, proportions are represented using
the datasets PDB1.5 (left bars) and PDB3.0 (right bars). Colors are
as follows: Oox (red), Ow (cyan), Oh and Oph (orange), Nam (green),
Car (purple), Oc (black), and Nim, Ngu, and NaI (blue).The null environments can be used to evaluate the
significance
of the query to Oox interactions. The preference for Oox is significantly
higher for four out of five basic functional groups considered (Table ). Preference for
Oox by the ligand and protein queries is clearly seen for I, II, and
GAI that have at least one Oox in their neighborhood in 44–89%
of cases compared to 14% for null environments. The number of Oox
(or other polar atoms, water molecules excepted) interacting with
III is surprisingly low, much lower than what would be expected from
the null environment (see the Discussion).
As should be expected, the COO to Oox is significantly lower than
for the null environment (Table ). Even if occurring less, carboxyl–carboxylate
interactions, which require both carboxylic acid oxygens to be in
the neutral form, are strong, as discussed elsewhere.[37]
Table 2
p-Values and Significance
[Represented by the Number of (*)] of Tests of Comparison of the Environments,
i.e., between Contingency Tables of Atom Type Composition by Query
Group and the Null Environmentsa
The PDB3.0 is preferred over the
PDB1.5 dataset for ligand queries because of lack of data in the latter.
Three significance levels are defined: not significant if corrected p-value is more than 0.1; (*) if corrected p-value is less than 0.1; (**) if corrected p-value
is less than 0.05; and (***) if corrected p-value
is less than 0.01. Boxes are colored when the p-value
is significant: in red when query neighborhoods are enriched and in
blue when query neighborhoods are depleted.
The PDB3.0 is preferred over the
PDB1.5 dataset for ligand queries because of lack of data in the latter.
Three significance levels are defined: not significant if corrected p-value is more than 0.1; (*) if corrected p-value is less than 0.1; (**) if corrected p-value
is less than 0.05; and (***) if corrected p-value
is less than 0.01. Boxes are colored when the p-value
is significant: in red when query neighborhoods are enriched and in
blue when query neighborhoods are depleted.
Hydroxyl and Phenol Contacts
For the hydroxyl (Oh)
and phenol groups (Oph), the interaction distance peaks at 2.8 Å
seen for Oox are also found (Figures , 7, and S3–S5). For IMD and COO, an equivalent peak found at
a distance of 2.8 Å suggests that strong hydrogen bonds with
a charge-transfer component, comparable to salt bridges, are formed.
For GAI, the Oh and Oph interaction is shifted toward 3.0 Å for
both types of queries. This indicates weaker hydrogen bonds and may
relate to the charge of GAI groups being most often already neutralized
by a carboxylate group (in 72–89% of the cases, see Figure ).At least
one hydroxyl or phenol group is found in the vicinity of a ligand
IMD query in 70% (PDB1.5) and 58% (PDB3.0) of the cases (Figure ); these numbers
are lower for protein queries, about 23–24% (Figure ). This could point to specific
recognition motifs at the binding sites toward the IMD query group.
Favorable Oh and Oph interaction for the IMD and COO queries may be
linked with the delocalized nature of the electrons on the IMD ring
and carboxylate. Contacts between ligand IMD and Oph in the absence
of carboxylate or water molecules in the vicinity are found for both
ligand and protein queries (orange on Figure C,A in the PDB3.0 dataset but not the PDB1.5
dataset. This may reflect an incomplete refinement of the PDB3.0 dataset
(importantly, the protein data are of significant size), or simply
the fact that some water molecules are not seen in lower resolution
structures.For most query groups (ligand I and IMD and all
protein queries),
hydroxyl and phenol groups interact significantly more than in the
null environment. Query III shows significantly less Oh or Oph contacts
than in the null environment, in accordance with its specific environment
(see the Discussion). For environments showing
interactions between hydroxyl or phenol groups and GAI, statistical
significance could not be demonstrated for ligand queries. This probably
indicates lack of data (only 134 neighborhoods considered for GAI
using the PDB3.0 dataset).
Water Molecules and Charge Neutralization
In terms
of contact density, water molecules exhibit a peak at 2.8 Å for
all considered queries, closely resembling those of Oox (Figures , 7, and S3–S5). In proteins
where there are plenty of data, this peak in the density at 2.8 Å
is visible for I and IMD (Figures and S5). For GAI, the peak
of water molecule density is shifted to longer distances, as was observed
for Oh and Oph. This may again be explained because the GAI query
groups are almost always neutralized by a salt bridge with a carboxylate.
Similar to hydroxyl and phenolic groups, water molecules can form
hydrogen bonds that have a proton-transfer component and therefore
may act as counterions (Figure C). Water molecules also have an amphoteric character and
therefore can act both as a counterion of basic groups (I, II, III,
IMD, and GAI) and the acidic group (COO).Water molecules (Ow)
were found in the close vicinity of all query groups for I, II, III,
IMD, GAI, and COO, whereby at least 60% of the query groups considered
have at least one water molecule within 4.0 Å in the PDB1.5 dataset
(Figures panels B,
C and F, 8, and 10).
Water molecules are over-represented in comparison to the null environment
of ligand queries IMD and COO. The large number of water molecules
interacting with III to some extent compensates the lower amount of
interacting protein atoms, as can be seen in Figure A (see also Figure D).
Figure 10
3D densities of contact atoms using the dataset
PDB3.0 for ligand
queries (A) I, (B) II, (C) III, (D) IMD, (E) GAI, and (F) COO. Color
code: Oc (black), Nam (green), Nim, Ngu, and NaI (blue), Oph and Oh
(yellow and orange), and Oox (red).
3D densities of contact atoms using the dataset
PDB3.0 for ligand
queries (A) I, (B) II, (C) III, (D) IMD, (E) GAI, and (F) COO. Color
code: Oc (black), Nam (green), Nim, Ngu, and NaI (blue), Oph and Oh
(yellow and orange), and Oox (red).
IMD to Base Close Contacts and Other Base–Base Interaction
The data collected highlight the interaction of IMD (either as
a query IMD or as a target atom Nim) with, surprisingly, bases (for
a complete composition of the neighborhoods at 4.0 Å, see Tables S1 and S2). The nature of the contact
between, for example, a primary amine and an IMD group is exemplified
for I with Nim (Figure D). This contact has not been described in the literature but may
take the form of hydrogen bonding with a proton being shared between
the uncharged IMD and the protonated amine. A strong interaction is
corroborated by a density peak at a distance of 2.8 Å for both
IMD ligand and protein queries (Figures , 7, and S3–S5). The atom types Nim, NaI, and Ngu
are within 4.0 Å of 30–34% of the IMD queries in both
datasets (Figures and 9). Altogether, there are sufficient
number of occurrences of IMD-Nim in the PDB1.5 dataset for protein
queries (1459 occurrences, 3588 in PDB3.0) to rule out refinement
errors. These numbers are also consequent for protein queries for
IMD–NaI (296 occurrences in PDB1.5 and 1015 occurrences in
PDB3.0) and IMD–Ngu (1865 occurrences in PDB1.5 and 4809 occurrences
in PDB3.0). In terms of significance, IMD to NaI, Nim, and Ngu is
not significant because of lack of data (n = 185)
for ligand queries, but it is significantly above background in protein
queries (Table ).The case of the other basic groups I, II, and III is different (Figure E,F). These groups
carry a positive charge under physiological conditions and are likely
to repel each other, although there is evidence for cation–cation
interactions in ionic liquids.[38] An unlikely
interaction is seen in the density proportion with the absence of
NaI and Ngu peaks at 2.8 Å for I, II, and III. These groups are
very rarely positioned near (<3.0 Å) the basic queries in
terms of raw numbers, for NaI, six occurrences in PDB3.0 and for Ngu,
16 occurrences in PDB3.0 (Tables S4 and S5). Accordingly, the environment of I, II, and III in terms of NaI,
Ngu, and Nim is significantly below the null environment (Table ). There are however
density peaks near 3.4 Å (Figures , 7, and S3–S5). This reflects another aspect of the interaction
formed by basic groups, that is, network of charges and secondary
contacts (Figure D–F).
Amide and Carbonyl Contacts
Carbonyl oxygen (Oc) forms
a suitable environment for basic groups as a hydrogen bond acceptor
(Figure C). In proteins,
carbonyls belong exclusively to main-chain and side-chain amide functional
groups. In proteins, the main-chain carbonyl groups carry a permanent
partial charge and very often benefit from aligned dipoles; thus,
they make strong hydrogen bonds. Oc densities show a strong peak in
the distribution for interaction with I and II at 2.9 and 3.0 Å
(Figures , 7, and S3–S5),
slightly longer than for hydrogen bonds that involves basic queries
and Oox, Oh, and Oph. This is fully in line with the other work.[2,39] In terms of representation, Oc is present near the queries I, II,
IMD, GAI, and COO: for ligands, from 33 to 77% (PDB3.0 dataset, where
there are enough samples for all query groups) (Figure ) and for protein queries, from 29 to 94%,
similar in both datasets (Figure ). For query III, Oc is present in only 15–17%
of the neighborhoods. For COO in ligands, Oc is surprisingly significantly
more represented than in the null environment (Table ). Instead, for protein queries, Oc is always
less represented in the neighborhood than in the null environments.Main-chain and side-chain amide groups (Nam) are almost never found
in the 3.0 Å vicinity of I, II, or III (n =
19 for protein–ligand interactions in PDB3.0) (Tables S4 and S5). For the IMD query, Nam is
located above and below the plane of the IMD ring (Figure ). Amide (Nam) shows density
peaks close (3.0 Å) to IMD and COO. In terms of significance,
Nam is significantly less represented than the environment for ligand
and protein queries of II, III, and GAI (Table ). For I and IMD, the over-representation
is found in both systems. It may reflect a favorable arrangement of
atoms without the hydrogen bond, but a fraction is to represent IMD
to main-chain nitrogen interactions.[40,41]
Distance Threshold
to Define Polar Contacts
When considering
data within a sphere of 3.0 Å radius, the number of neighboring
atoms is lower for simple groups (1.6 ± 1.3 for I; 1.0 ±
1.0 for II; and 0.3 ± 0.5 for III) compared to larger functional
groups defined using a centroid (6.5 ± 2.8 neighboring atoms
for IMD; 5.6 ± 3.4 for GAI; and 5.6 ± 3.3 for COO) (Figures and 12). This is easily explained because complex functional
groups contain several atoms. The interaction shell collected within
3.0 Å of the query groups is composed mostly of polar atoms (Figures , 12, and Tables S3–S5). Indeed,
query groups I, II, and III have 72%, 74%, and 92% of polar neighbors
(Oox, Oh, Oph, Ow, Oc, Nam, Nim, Ngu, and NaI) against 57% for any
atoms in the null environment (data from PDB1.5, Tables S2 and S4). The proportion of neighboring oxygen and
nitrogen polar atoms is in contrast lower for IMD (60%), GAI (50%),
and COO (51%), which may reflect favorable interactions with carbon
atoms, for example, COO to ring edge anion–π contacts.[42] Additionally, it could be a difference introduced
by the data collection method, either a sphere centered on a point
charge or a centroid; the latter may lead to contacts farther away
to be included.
Figure 11
Influence of the distance threshold on the number of atoms
(left
panels) and atom type frequency (right panels) for ligand queries
using the dataset PDB3.0. Neighborhood defined (A) using a data collection
distance of 3.0 Å and (B) using a distance of 4.0 Å. Atom
types are colored as follows: Oox (red), Oh (orange), Oph (light orange),
Oc (black), Ow (cyan), Nam (green), Ngu (light blue), NaI (blue),
Car (purple), Su (dark gray), and Xot (gray). Note the different y-axis
scales for the left-hand panels.
Figure 12
Influence of the distance threshold on the number of atoms (left
panels) and atom type frequency (right panels) for protein queries
using the dataset PDB3.0. Neighborhood is defined (A) using a threshold
distance of 3.0 Å and (B) using a threshold distance of 4.0 Å.
Atom types are colored as follows: Oox (red), Oh (orange), Oph (light
orange), Oc (black), Ow (cyan), Nam (green), Ngu (light blue), NaI
(blue), Car (purple), Su (dark gray), and Xot (gray). Note the different
y-axis scales for the left-hand panels.
Influence of the distance threshold on the number of atoms
(left
panels) and atom type frequency (right panels) for ligand queries
using the dataset PDB3.0. Neighborhood defined (A) using a data collection
distance of 3.0 Å and (B) using a distance of 4.0 Å. Atom
types are colored as follows: Oox (red), Oh (orange), Oph (light orange),
Oc (black), Ow (cyan), Nam (green), Ngu (light blue), NaI (blue),
Car (purple), Su (dark gray), and Xot (gray). Note the different y-axis
scales for the left-hand panels.Influence of the distance threshold on the number of atoms (left
panels) and atom type frequency (right panels) for protein queries
using the dataset PDB3.0. Neighborhood is defined (A) using a threshold
distance of 3.0 Å and (B) using a threshold distance of 4.0 Å.
Atom types are colored as follows: Oox (red), Oh (orange), Oph (light
orange), Oc (black), Ow (cyan), Nam (green), Ngu (light blue), NaI
(blue), Car (purple), Su (dark gray), and Xot (gray). Note the different
y-axis scales for the left-hand panels.When using a longer radius for selecting neighbors (Figures and 12), 4.0 Å compared to 3.0 Å, the number
of neighboring atoms
increase by 2–3 fold: 5.7 ± 3.2 for I; 4.0 ± 2.4
for II; 1.2 ± 1.2 for III; 17.2 ± 5.9 for IMD; 16.6 ±
8.9 for GAI; and 15.8 ± 7.8 for COO. Interestingly, III keeps
a small number of atoms in its neighborhood even at a distance of
4.0 Å. The relative proportion of polar interacting atoms (Oox,
Oh, Oph, Ow, Oc, Nam, Nim, Ngu, and NaI) decreases, which reflects
the inclusion in the statistics of hydrophobic contacts as well as
carbons connected to polar atoms, such as the central carbon atom
belonging to carboxylate groups.Generally, increasing the radius
of the collection sphere brings
the distribution of neighbors toward that observed for our null environment
(tested up to 6.0 Å, data not shown). For the null environments,
the number of atoms included in the neighborhood is much lower compared
to the other query groups, that is, 0.2 ± 0.6 at 3.0 Å for
the ligand query. This is explained by the fact that “any atom”
in a ligand is usually carbon connected to two or three atoms, and
that the 3.0 Å sphere represents strong polar contacts. Similar
results were observed for protein queries.
Discussion
Robustness
of the Study toward a Potential Bias in the Dataset
In this
manuscript, we present diverse statistics extracted from
the PDB, which may be sensitive to biases in the dataset because of
too many close homologues. We thus decided to run the study a second
time using the PDB50 release, that is, a release that contains no
two sequences sharing over 50% identity (statistics about the number
of groups extracted are found in the Supporting Information Table S1). For protein queries, in which a subset
of query groups are randomly extracted, we already control that the
sample taken is robust over five different random extractions (see
the Experimental Section). Not surprisingly,
the statistics derived are more or less unaffected by using PDB50
(Supporting Information Figures S1 and
S2). For ligand queries, we remove biases by keeping only one structure
for each unique ligand (see the Experimental Section). The data obtained from PDB50 thus follow closely the statistics
obtained from the complete PDB, especially for the groups having enough
data (100 or more queries). The positive effect of using PDB50 on
eliminating possible biases originating from the presence of several
close homologues is nonetheless counterbalanced by a severe depletion
in the data available. The resulting low number of ligand queries,
especially for IMD and GAI in PDB3.0 and for almost all query groups
in PDB1.5, leads to discrepancies between PDB50 and PDB100. Altogether,
the study on the nonredundant PDB50 nonetheless confirms all trends
observed with PDB100.
Interaction Environments of III are Clearly
Different than I
and II
One of the surprising findings of this study is that
III forms salt bridges less frequently with carboxylate groups in
comparison to I and II (see the Results).
This is especially unexpected because pKa of III is about the same as pKa of I
in the 8−10 range.[28] As elaborated
in the Results section, water molecules can
function as counterions and are frequently found near III (64% in
the PDB1.5), especially in the absence of a carboxylate counterion.
A reason for III to favor water molecules over protein counterions
is the limited space available around the query (Figure ). This limited space is corroborated
by the low number of interacting atoms (Figure ). Furthermore, the density curves for III
are low at a close range (Figures C and S3). Taken together,
this suggests that the distinct interacting environment of III is
a consequence of its low accessible volume. Accessibility has been
known for long in chemistry to relate to chemical reactivity. This
is the first instance to show the importance of space available affecting
the ability to form molecular interactions. Query III is furthermore
stabilized by hydrophobic contacts. This is not seen in this study
because the sphere of 4.0 Å radius used for data collection around
III does not capture hydrophobic contacts made by the attached carbon
atoms. Indeed, less than 6–8% of III has at least one aromatic
carbon (Car) within 6.0 Å, in comparison to 25–26% for
the null environment (Figures and 8C).
Charged Groups without
Neutralization by a Counterion
This manuscript is centered
on the neutralization of charges, but
what happens to the remaining complexes is of interest. First, the
majority species is not always the ionized one, especially for IMD
that has a pKa range of 5.1−7.75
(Table S6). In addition, long-range contacts
where charges are not directly neutralized by salt bridges are not
accounted here. In particular, cation−π interactions
are not studied in detail. Their number is nonetheless bounded by
the number of aromatic carbons seen in the vicinity of the queries.
For the respective ligand and protein queries, using the PDB3.0 dataset,
there is at least one Car near I in 26 and 8% of the cases, for II
in 14% of the cases, for III in 5% of the cases, for IMD in 37 and
36% of the cases, for GAI in 51 and 20% of the cases, and for COO
in 44 and 30% of the cases. The cation−π or anion−π
contacts are not the focus of this study because more complex geometric
parameters as well as longer distances should be used to study them
in more detail.[25,43] More generally, for ligand queries,
we filtered out metals in the vicinity of the ligands as well as nonbonded
ligand contacts, eliminating potential unexpected counterions.
Multiple
Atom Interactions from Functional Groups
The
peaks of densities collected at distances longer than about 3.5 Å
need to be carefully interpreted because they often relate to atoms
that do not directly interact with the query groups but are constrained
by the chemistry of proteins. These can be connected atoms, for example,
the carboxyl carbon and the second oxygen of a carboxylate group.
This is seen in Figure A where the peak for I is followed by a weaker peak starting at 3.4
Å that corresponds to the second carboxylateoxygen (see also Figure ). Another typical
example of secondary contacts is the oxygencarbonyl Oc or the amideNam in proteins. Secondary structure elements explain very well the
shape of Nam with marked peaks at 5.0 Å on density proportion
(Figure ).
Figure 13
Empirical
correction of the data collection sphere radius for complex
functional groups, exemplified by the Nε–Oox distance.
(A) Actual Nε–Oox hydrogen-bonding distances and Cζ–Oox
distances presented in this manuscript. (B) Densities of Oox atom
distribution used to define the corrective factor d. The peak of strong interaction is found at 2.8 Å for Nε–Oox
and calibrated at this value for Cζ–Oox by subtracting d.
Empirical
correction of the data collection sphere radius for complex
functional groups, exemplified by the Nε–Oox distance.
(A) Actual Nε–Ooxhydrogen-bonding distances and Cζ–Oox
distances presented in this manuscript. (B) Densities of Oox atom
distribution used to define the corrective factor d. The peak of strong interaction is found at 2.8 Å for Nε–Oox
and calibrated at this value for Cζ–Oox by subtracting d.Another type of secondary
molecular contact occurs when networks
of hydrogen bonds of ionic side chains are in place (Figure E,F). Generally, arginine amino
acid serves as a branching unit and therefore a key node in salt bridge
networks.[1] In our dataset, considering
protein-only contacts and only the salt bridge, about one-third of
GAI and half of I, IMD, and COO are part of a complex network (Table ). Very interestingly,
the numbers we obtain are similar for ligands and proteins, with the
notable exceptions of III and GAI (Table ). We found that two-thirds of the tertiaryaminesalt bridges are actually ionic networks, and for GAI in the
ligand, seldom a salt bridge network is present. This is likely to
reflect the characteristic of the binding sites that accommodated
these ligands.
Table 3
Frequency and Number of Ionizable
Side Chains within 4.0 Å of the Query Groups, Indicative of Ionic
Networksa
frequency
raw numbers
number of
ionizables side chains within 4 Å
SB
none
one
two
three
four and
more
none
one
two
three
four and
more
I (ligand)
0.36
0.44
0.19
0.24
0.10
0.03
501
220
272
110
3
I (protein)
0.54
0.70
0.16
0.11
0.02
0.01
13 865
3332
2213
436
154
II (ligand)
0.36
0.45
0.20
0.19
0.07
0.10
413
181
169
61
89
III
(ligand)
0.64
0.83
0.11
0.06
0
0
816
108
55
3
1
IMD (ligand)
0.50
0.64
0.18
0.14
0.02
0.02
118
34
26
4
3
IMD (protein)
0.38
0.45
0.21
0.18
0.08
0.08
8918
4192
3653
1670
1567
GAI (ligand)
0.09
0.23
0.07
0.36
0.11
0.21
29
9
48
14
26
GAI (protein)
0.28
0.41
0.17
0.24
0.09
0.09
8117
3373
4916
1809
1785
COO (ligand)
0.33
0.38
0.21
0.17
0.13
0.11
375
208
173
129
114
COO
(protein)
0.45
0.50
0.22
0.16
0.07
0.04
10 138
4457
3140
1444
821
“SB” refers to the
frequency of ionic networks when only queries involved in at least
one salt bridge are considered.
“SB” refers to the
frequency of ionic networks when only queries involved in at least
one salt bridge are considered.The numbers we obtained for intraprotein salt bridges agree well
with the study of Musafia et al., who reported one-third of all residues
participating in salt bridges to be part of complex salt bridges.[1] In a different study, Donald and co-workers reported
instead that most (over 95%) of the salt bridges are local and not
involved in complex networks[7] in contrast
to ours and Musafia’s study and suggested that this was due
to a methodological difference, that is, a focus on intra-subunit
salt bridges.
Conclusions
This manuscript presents
for the first time a characterization
of the molecular environments of ionizable groups in protein–ligand
complexes, and the data are placed in the light of intra- and inter-subunit
interactions in protein structures. We include in our statistics elements
such as water molecules and weakly ionizable groups, which together
with the increased amount of data resulting from the natural growth
of the PDB, make all aspects of this work novel. The findings in this
manuscript can be summarized by a few principles. Taken together or
individually they have a broad application toward the initial placement
of docking poses, scoring the quality of protein structure or protein–ligand
complexes and positioning water molecules in binding sites.The data
collected, protein–ligand
interaction of both at 1.5 and 3.0 Å resolution and intraprotein
interaction at 1.5 Å resolution, show a consistent picture about
the type and frequency of the interacting atoms. A notable difference
in the environment is the over-representation of Oc and Nam in protein
structures. This means that conclusions can be inferred from proteins
about ligand–protein complexes and reciprocally, but also highlights
that caution should be taken when deriving statistical interaction
data.A sphere of 3.0
Å radius from
point charges carries the majority of information about polar contacts.
The strong polar contacts can be selectively captured by such a method.
This avoids considering potentially noninteracting groups, as can
be seen, for example, from the densities for I and Nam or Ngu (Figures and 8). Getting a longer threshold to consider molecular interactions,
as is often done in the literature by considering a 4.0 Å threshold,[6,7,44,45] probably shadows the strong charged-reinforced hydrogen-bonding
data.Acidic and basic
groups interact within
4.0 Å with a counterion in 45–89% of cases for I, II,
GAI, and COO. When functional groups of ionizable character (Oh, Oph,
and Ow) are accounted, this number increases to above 80% but for
IMD and tertiaryamine, it increases above 70%. Formation of net–neutral
pairs has been indeed demonstrated for arginine–tyrosine pairs
in aprotic environments using a combination of experimental and computational
methods.[46] A parsimonious way to have a
protonated (basic) group at a binding site or in a protein is to have
a proton-donating (acidic) group directly interacting with it. This
could be taken advantage of, for example, in enumerating protonation states in docking simulations.
This study does not characterize what happens in the remaining cases:
interactions with other acidic groups, interactions not seen, for
example, due to crystal packing, or the group may not be ionized.
In particular, phosphate groups are widely present in endogenous ligands[47] and do form charged interactions with the protein.Tertiaryamines have a
specific interaction
shell: they form much less salt bridges, for example, than primary
amines (5–16% against 45–54%), although they have roughly
the same pKa. They form less observed
polar contact with the protein than “any atom” from
the ligand. By contrast, water molecules appear to be the most prevalent
strong polar contact made by tertiaryamines. There is a strong hydrophobic
component in their binding subsite, which can be inferred from the
prevalence of carbon atom neighbors (Figures and 6), although
it has not been directly studied here. This highlights the role of
accessibility in forming molecular contacts. Contact accessibility
is not taken into account by current scoring functions and would deserve
further study.Water
molecules play a key role in
the stabilization of polar groups, especially in the absence of salt
bridges. Water molecules are prevalent at binding sites. Their contribution
to binding is critical but difficult to measure, especially in terms
of enthalpic or entropic contributions. This study highlights an interesting
new possible role for water molecules, that is, to act as a counterion
to neutralize ionizable groups through hydrogen bonds that have a
charge-transfer character. This role may also be taken by phenolic
or hydroxyl groups. Quantum chemical calculation is necessary to study
this phenomenon in more detail.
Experimental
Section
Computational Tools
All scripts developed for this
study, developed in Python 2.7, are provided as is from the platform
GitHub (https://github.com/ABorrel/saltbridges). All plots and statistical analyses were conducted using the R
package (version 3.2.2).[48] Proteins were
visualized using Pymol (version 1.4.1),[49] and 3D densities were created using Chimera (version 1.10).[50]
Data Extraction
Crystallographic
complexes were extracted
from the PDB,[19] October 2015 release, 112 968
structures. Structures elucidated by NMR or including DNA or RNA were
not selected. Two global criteria of quality were used for filtering,
a resolution less than either 1.5 or 3.0 Å and an R-free[51] value less than 0.25. These values are standards
for the analysis of proteins or protein–ligand complexes.[7,10,34] Two datasets, named PDB1.5 or
PDB3.0 depending on the resolution range considered, were thus built,
where the PDB1.5 dataset is a subset of the PDB3.0 dataset. To control
the robustness of the statistics obtained for ligand queries, in particular
toward biases that may arise from the presence of close homologues
in the dataset, the complete study was run a second time on the PDB50
release of the PDB, which features no pairs of structure with a percentage
of sequence identity above 50%.All ligands present in the PDB,
about 14 000 ligands, were first queried. Query groups were
identified using in-house scripts. Briefly, to avoid errors due to
incomplete data, the connectivity matrix of each ligand was rebuilt
by defining bonds when the distance between two atoms is less than
1.42 Å. Tertiaryamine groups were defined as such when not planar,
that is, when the distance between the N atom and the plane formed
by the three carbon atoms is less than 1.00 Å. These values were
empirically defined at the start of the study based on their distribution
in the PDB (Figure S6). Queries with no
protein interaction (no protein atoms within 4.0 Å) were removed.
Ligand query groups returning a nonbonded interaction (upper limit
4.0 Å) with an ion or any ligand atom were also removed. To eliminate
a source of redundancy, when a ligand (based on the PDB ligand identifier)
was present in several, not necessarily homologous PDB, structures,
the structure with the best resolution was selected. In cases where
several ligands bearing a query group were included in one structure,
the first ligand occurrence in the PDB file was selected. Note that
a single ligand may contain several query groups.Query groups
and their environments were also retrieved from protein-only
structural data (both intrachain and interchain contacts for a given
PDB file). In that setup, protein query groups were deduced directly
from the atom names in the PDB file. Because there are plenty of data,
to limit the computational workload, protein-only contacts were limited
to 20 000 randomly extracted samples for each query group.
The extraction process for each query group was repeated five times
with different random seeds, and nearly identical results were obtained.
Definition of Molecular Environment
The molecular environment
of the query groups was defined by all atoms present within the sphere(s)
centered on either a point charge atom for queries I, II, and III
or a single point (a centroid) representing the functional group for
queries IMD, GAI, and COO. A centroid is used for these latter groups
to avoid combining the interacting environment of individual atoms.
For IMD, the centroid was defined by the center of mass of the side-chain
aromatic nitrogen atoms, for GAI by the Cζ carbon, and for COO
by the center of mass of the side-chain carboxylate oxygen atoms.Twelve protein atom types, deduced using the PDB files annotation,
were used to describe the environments. Oox, carboxyl oxygen atoms;
Oh, hydroxyl oxygen atoms; Oph, phenol oxygen atoms; Ow, water molecule
oxygen atoms; Oc, side-chain or main-chain carbonyl oxygen atoms;
Nam, side-chain or main-chain amidenitrogen atoms; Nim, IMDnitrogen
atoms; Ngu, GAInitrogen atoms; NaI, primary aminenitrogen atoms;
Car, aromatic carbons; Su, sulfur atoms; and Xot, remaining carbon
atoms (see Table for
a complete description).
Table 4
Protein Atom Types
Used in This Study
atoms
atoms in
PDB format with the corresponding amino acid
atom type
abbreviation
oxygen in carboxylate
Glu (OE1, OE2), aspartic acid (OD1, OD2)
Oox
oxygen in water molecule
HOH (O)
Ow
oxygen in hydroxyl or phenol
threonine (OG1), serine
(OG)
Oh
oxygen in phenol
tyrosine (OH)
Oph
oxygen in carbonyl
protein main chain (O),
asparagine (OE1), glutamine(OD1)
Oc
nitrogen in amide
asparagine (ND2), glutamine(NE2),
protein main-chain (N)
carbons not included in
the above-mentioned groups
Xot
Three types of analyses were conducted using
both PDB1.5 and PDB3.0
datasets for both ligand queries and protein queries: (i) for each
atom type, we measured if at least one representative was found near
the query groups I, II, III, IMD, and GAI as well as COO; (ii) we
collected the relative densities of the presence of a given atom type
within a sphere collection radius, up to 6.0 Å; and (iii) we
investigated the composition of the neighborhood in terms of atom
type frequency at 3.0 and 4.0 Å. These values were chosen because
it is common practice to use a 4.0 Å sphere when studying salt
bridges,[44,45] and a sphere of 3.0 Å radius allows
to focus on stronger (shorter) hydrogen-bonded interactions. For centroids,
the radii of the spheres used for data collection were corrected by
subtracting an empirically defined distance d that
cancels the offset introduced by the use of centroids (Figure ). Distance d takes the values +1.0 Å for IMD, +1.1 Å for GAI, and +0.8
Å for COO.The so-called
“null environments”
were defined as references and used to compare the environment seen
by each ligand and protein query group with the environment seen by
(i) any ligand atom and (ii) any protein atom. For ligand queries,
the environments of all ligand atoms, a total of 126 808 atoms
at 3.0 Å of resolution and 10 314 atoms at 1.5 Å
resolution, were extracted. For proteins, the null environments were
defined using a set of 200 000 random protein atoms. The comparison
of null environments against query group environments (global counts
of occurrence by atom types, grouped together by the query group)
was conducted using contingency table comparison statistical tests.
In the case of large effectives (more than one thousand data points),
a Pearson’s chi-square test was realized. In the case of smaller
effectives, the exact goodness-of-fit test was preferred. In the case
of multinomial tests from a contingency table containing more than
2 × 2 entries, a Bonferroni correction was applied on p-value thresholds of significance. For further information
about the statistical methods used, see ref (52).
Authors: Yuezhou Zhang; Alexandre Borrel; Leo Ghemtio; Leslie Regad; Gustav Boije Af Gennäs; Anne-Claude Camproux; Jari Yli-Kauhaluoma; Henri Xhaard Journal: J Chem Inf Model Date: 2017-03-13 Impact factor: 4.956
Authors: Michael R Jackson; Robert Beahm; Suman Duvvuru; Chandrasegara Narasimhan; Jun Wu; Hsin-Neng Wang; Vivek M Philip; Robert J Hinde; Elizabeth E Howell Journal: J Phys Chem B Date: 2007-06-20 Impact factor: 2.991