Allosteric effect can modulate the biological activity of a protein. Thus, the discovery of new allosteric sites is very attractive for designing new modulators or inhibitors. Here, we propose an innovative way to identify allosteric sites, based on crystallization additives (CA), used to stabilize proteins during the crystallization process. Density and clustering analyses of CA, applied on protein kinase and nuclear receptor families, revealed that CA are not randomly distributed around protein structures, but they tend to aggregate near common sites. All orthosteric and allosteric cavities described in the literature are retrieved from the analysis of CA distribution. In addition, new sites were identified, which could be associated to putative allosteric sites. We proposed an efficient and easy way to use the structural information of CA to identify allosteric sites. This method could assist medicinal chemists for the design of new allosteric compounds targeting cavities of new drug targets.
Allosteric effect can modulate the biological activity of a protein. Thus, the discovery of new allosteric sites is very attractive for designing new modulators or inhibitors. Here, we propose an innovative way to identify allosteric sites, based on crystallization additives (CA), used to stabilize proteins during the crystallization process. Density and clustering analyses of CA, applied on protein kinase and nuclear receptor families, revealed that CA are not randomly distributed around protein structures, but they tend to aggregate near common sites. All orthosteric and allosteric cavities described in the literature are retrieved from the analysis of CA distribution. In addition, new sites were identified, which could be associated to putative allosteric sites. We proposed an efficient and easy way to use the structural information of CA to identify allosteric sites. This method could assist medicinal chemists for the design of new allosteric compounds targeting cavities of new drug targets.
Proteins are fundamental entities in an
organism, controlling normal
living cells and disorder processes. These biological entities are
divided into families according to their amino acid sequences, three-dimensional
(3D) structural motifs, and primary functions. Proteins perform their
biological function through interactions with other proteins, nucleic
acids, or small ligands. The interactions between a protein and a
ligand are often located in a well-defined active site or cavity of
the protein, the so-called orthosteric site. These interactions are
directly associated to protein function modulation. However, small
ligands can also bind to other protein sites, called allosteric sites,
distant from the active site. This binding could induce a conformational
change on the protein structure, resulting in an increase or decrease
of its intrinsic activity.[1,2]The term “allostery”
was introduced in 1961[3,4] even if Christian Bohr has already
described the process as the
“Bohr effect” in the early 20th century related to hemoglobin
conformational change. Since then, allostery has progressively evolved
to a unified concept[5] associated with its
main property: the conformational change. It is an integrant part
of the protein dynamics and may be present in every protein in the
living world.[6−8] Not surprisingly, allostery has raised a great interest
in pharmaceutical research, especially in identifying allosteric sites
in protein and/or developing allosteric drugs. The latter could present
some advantages compared to drugs targeting orthosteric sites, such
as a greater specificity, fewer side effects, and an easier up- and
down-regulation of proteins.[9] In some cases,
this interest for allosteric approaches in drug discovery has led
to successful results. Indeed, in 2004, n class="Chemical">Cinacalcet was the first allosteric
drug approved by the Food and Drug Administration (FDA). This positive
allosteric modulator targets the calcium-sensing receptor belonging
to the GPCR family for the treatment of hyperparathyroidism.[10] Interest in allostery is well described in the
protein kinase (PK) family, a major therapeutic target due to its
implication in several diseases such as cancer.[11,12] Most of protein kinase inhibitors approved by the FDA or under clinical
trials are targeting the orthosteric adenosine triphosphate (ATP)
binding site.[13] However, several allosteric
sites have been identified in PKs such as ABL, CK2α, FLT3, or
MEK.[14−19] Several allosteric kinase inhibitors have been already approved
by the FDA such as trametinib, cobimetinib, and bimenitinib in 2013,
2015, and 2018, respectively, for the treatment of patients with metastatic
melanoma involving a BRAFV600E or V600K mutation.[12,17]
Nowadays, some databases[20] and
benchmarks[21] are available for helping
in the identification
of allosteric cavities through computational approaches. Those approaches
developed or adapted specifically for this objective are normal mode
analysis, Gaussiannetwork mode,[22,23] and binding
leverage approach[24] and are based on the
calculation of protein cavity volumes. A computational mapping protocol,
the multiply copy simultaneous search (MCSS) was also published in
1996. In this methodology, thousands of ligands are minimized around
a protein structure to identify the main binding sites.[25] In this paperwork, we proposed a novel computational
approach to identify allosteric cavities in a protein family based
on the presence of experimental crystallization additives (CA). Initially,
those molecules are present together with ions, buffers, and solvent
to facilitate the crystallization process of proteins or protein–ligand
complexes.[26] Interestingly, those molecules
are not always randomly distributed around the structures but seem
located in protein hotspots, especially near the binding cavities.[27] While they cannot be directly used in FBDD projects,
their binding reveals some key points on the interaction of drug-like
ligand or fragments.[28] Some previous experimental
mapping studies (multiple solvent crystal structures; MSCS) on crystalline
proteins have demonstrated the ability of those additives to bind
into interesting regions of the protein surface.[25,29,30] Thus, we decided to evaluate the relationship
between the sites where CA are located and the known orthosteric and
allosteric sites in a protein family. Here, we focused on two protein
families that have been substantially crystallized: the PK and the
nuclear receptor (NR) families. In the first step, starting from a
dataset built from several databases, CA distribution is evaluated
within 3D structures of PKs and NRs, aligned on a unique reference
protein, to determine their location sites. Then, we assessed the
ability of those CA to be collocated with known allosteric ligands
(AL) already identified in PK and NR families through a clustering
approach. Interestingly, we identified that the strong presence of
CA in cavities of experimentally determined protein structures corresponds
to known orthosteric and allosteric sites. Therefore, this study suggests
a novel approach to identify allosteric sites.
Results and Discussion
PK Family: A Model for Allostery
PKs constitute one
of the most studied protein families, and their large involvements
in several diseases (cancer, n class="Disease">inflammation, Alzheimer’s disease,
etc.) lead to the search of novel therapeutic drugs, targeting the
catalytic binding sites or allosteric cavities. In this study, we
focused on PKs since this protein family exhibits well-defined orthosteric
and allosteric cavities. Indeed, PKs belong to the transferase superfamily,
catalyzing the phosphate transfer to a protein substrate. Through
the catalytic mechanisms and regulation of PKs, the ATP molecule located
in the orthosteric site, bound to the hinge region (Figure , orange), will transfer the
γ phosphate group to a protein substrate. PK inhibitors designed
to target this orthosteric site and to compete with ATP are classified
as type I and bind the active conformation of the protein kinase.
In addition to this major active site, several other cavities have
been described during the past years, distributed all around the kinase
structure.[16,17] In this study, 18 PDB structures
representing various groups and subgroups of the PK family will be
considered as reference for the definition of those allosteric sites:
Figure 1
Structural properties of protein kinase family. (a) Structural
common motifs in protein kinases, characterizing the active site.
(b) Representation of cavities described in the literature (orange,
orthosteric cavity; purple, back pocket; cyan, P-loop pocket; yellow,
PIF pocket; gray, DEF pocket; red, myristate pocket; and green, substrate
pocket).
The so-called back pocket (Figure , purple) is close
to the orthosteric site. This back pocket concerned type II inhibitors,
which bind to the orthosteric and back pocket, in the inactive conformation
of the kinase and type III inhibitors, which bind exclusively to the
back pocket.[31] Here, the back pocket is
represented by PDB IDs 3O96 (n class="Gene">AKT1), 3LW0 (IGF1R), 1S9J (MEK), 4LMN (MEK), 3EQC (MEK), 4ITH (RIPK1), and 4ZJI (PAK1).
The myristate pocket (Figure , red) is located at the C-terminal
lobe of the n class="Gene">ABL protein kinase and is targeted by type IV inhibitors.
This pocket is associated to a subfamily of PKs, ABL (PDB IDs 3MS9, 3K5V, and 3PYY).
One additional pocket of interest
is the substrate pocket (Figure , green), in which the protein that will be phosphorylated
by the PK usually binds. Type V inhibitors target this site, and PDB
IDs 3JVR (CHK1)
and 3F9N (n class="Gene">CHK1)
contain an allosteric inhibitor bound in this pocket.
The DEF (docking site for ERK, FXF)
pocket facilitates substrate recognition[32] as presented in the PDB IDs 4E6C (MAPK14) and 3O2M (n class="Gene">JNK1).
The last cavities are located at the
N-terminal part of PKs: the PIF (PDK1 interacting fragment) pocket[33] illustrated in PDB IDs 3PXF (n class="Gene">CDK2) and 3HRF (PDK1) and the P-loop
pocket, exemplified in PDB IDs 3H30 (CK2a1) and 4CFE (AMPKα1/2).
Structural properties of protein kinase family. (a) Structural
common motifs in protein kinases, characterizing the active site.
(b) Representation of cavities described in the literature (orange,
orthosteric cavity; purple, back pocket; cyan, P-loop pocket; yellow,
PIF pocket; gray, DEF pocket; red, n class="Chemical">myristate pocket; and green, substrate
pocket).
CA in Kinase Family
Generally, 3D structures of complexes
may also contain small molecules such as conventional ligands and/or
CA. The latter are involved in the crystallization process, and their
coordinates are obtained from the electron density mapping like for
protein and ligand atoms. In most in silico studies such as docking,
the 3D structural information of CA are ignored during calculations.
In this study, we focused on the spatial positions of CA in several
crystallographic structures of PKs to identify allosteric sites. Eight
hundred forty-five diverse structures of protein–ligand complexes
were retrieved from the PDB, and a total of 2459 CA molecules (CA
dataset) are present in these complexes. Among these CA, we mainly
noticed polar compounds like ethylene glycol or n class="Chemical">glycerol but also
sugars (β-octylglucoside) that present amphipathic properties
(an exhaustive list is provided in Table S1). For each CA, we only considered their geometric center (centroid)
to simplify the structural analysis. Distribution of CA centroids
was evaluated around and inside PK structures (Figure ). An analysis of the aligned protein structures
shows that CA could be placed in three different positions: (i) far
away from the protein surface, (ii) at the solvent-exposed protein
surface, or (iii) in cavities inside the protein structure (Figure a). We analyzed the
density of CA distribution around several aligned structures. The
distribution of the CA centroids revealed that only two main cavities
having more than 80% density of CA centroids are both surface-exposed
(details of the methodology section are found in the Supporting Information). The first one is in the C-terminal
part of the PK and did not correspond to an allosteric pocket already
identified in the literature. The second one points toward the α-E
helix (Figure b) and
is located near the “peptide” pocket. In 2004, Heo et
al. published the crystal structure of JNK1 (PDB ID 1UKI), a PK that binds
to a peptide of the JIP1 protein, and this interaction plays a role
in the JNK1 phosphorylation activity.[34] The interface formed by this protein–protein interaction
is also an allosteric site.[16,34] By lowering the density
to 50%, CA are aggregated in three other sites. Two centroids are
very close together and are located in the DEF pocket. The last centroid
is found near the substrate pocket.
Figure 2
Distribution of CA centroids around a
PK (PDB ID 1K5V). (a) Representation
of all CA centroids. (b) Position of centroids with densities greater
than 80 (centroids indicated by an arrow) and 50% (all purple spheres).
Distribution of CA centroids around a
PK (PDB ID 1K5V). (a) Representation
of all CA centroids. (b) Position of centroids with densities greater
than 80 (centroids indicated by an arrow) and 50% (all purple spheres).Thus, in the case of PKs, it appears that CA are
not always distributed
arbitrarily but seem to be attracted to some conserved areas on the
protein structure, such as the surface or deep cavities. This observation
is in agreement with a very recent paper, proposing that fragments
and CA often bind in the same way as drug-like ligands in four proteins
(BACE2, n class="Gene">CLK2, TYR1, and CAH2).[27] Moreover,
in some cases, these attractive regions in PKs could be related to
allosteric sites such as pocket, substrate, and DEF pockets.
CA: A Kinase Allosteric Identifier
In proteins, orthosteric
ligands (OL) and AL bind to well-defined orthosteric and allosteric
sites, respectively. Based on the density of centroids, our preliminary
results suggest an existence of preferential CA sites. A second analysis
was performed to determine whether there is a possible relation between
the position of CA molecules on the PK surface and the various sites
occupied by OL and AL. A ligand dataset was built from 864 diverse
structures of PK-ligand complexes containing exclusively OL and AL
(1049 ligands). It is important to note that we have always considered
complexes that have at least one AL, and in some cases, an OL was
also present in the crystal structures such as in PDB ID 4AN2, where ATP and n class="Chemical">cobimetinib
are bound to the MEK1 crystal structure. This dataset was joined to
the CA dataset, guided by an alignment of all PKs to the same reference
structure PDB ID 1ATP. By focusing on CA, AL, and OL centroids, two clustering analysis
were performed using a density-based algorithm (DBSCAN),[35] first on the CA dataset and then on the combined
CA and ligand datasets (Figure S1). Parameters
of the clustering were controlled by changing gradually the minimal
number of points in the cluster (minpts from 1 to 4) and the distance
between two points in the cluster (ε from 0.5 to 3 Å).
According to the results, we observed three different types of clusters:
(i) some clusters contain only centroids of CA from the CA dataset
or (ii) some clusters contain only centroids of ligands (AL and OL)
from the ligand dataset (those two types of clusters are called homogenous
clusters) and (iii) others clusters contain centroids from both datasets
(called heterogeneous clusters). Heterogeneous clusters indicate that
CA probably occupy the same site than ligands (AL and OL). Heterogeneous
cluster IDs are depending on the clustering parameters, and three
parameter pairs [minpts - ε] have the most populated clusters
([1 - 2], [2 - 2], and [3 - 2]) (Figure S3). The 18-reference PK complexes containing AL were used to optimize
the parameters. Three parameter pairs ([1 - 2], [2 - 2], and [3 -
2]) give at least eight heterogeneous clusters containing both reference
ligands and CA. The parameter pair [3 - 2] provides the largest common
cluster since there is a large number of CA and ligands in the same
sites.
To compare the results obtained from the density analysis
and the clustering method, an unsupervised clustering was carried
out first on the CA dataset with the parameter [3 - 2]. Under these
conditions, about 33% of the CA cannot be clustered. The remaining
CA are grouped into 118 clusters, classified by their population,
and the most populated cluster is located in the substrate pocket,
a site that has been already identified from density analysis with
a density threshold of 50%.Adding the ligand dataset (AL and
OL) to the CA dataset in the
clustering method provides also 118 different clusters constituted
by homogeneous and heterogeneous clusters. Eight hundred fourteen
compounds (i.e., 23%), mainly AL, are identified as singletons. We
observed that all the known pockets in PK are found in the 20 densest
clusters, among 117 clusters in total (see Figure S4). Not surprisingly, OL are in the most populated cluster
(∼31% of all compounds) and are grouped with a small number
of CA in this heterogeneous cluster (cluster 0; Figure a and Figure S4). Thus, the ATP binding site is also a cavity that can accommodate
CA, which is consistent with the study of Drwal et al.[27] The next most populated (Figure S4) heterogeneous clusters present interesting results.
Indeed, we found that almost all the heterogeneous clusters define
an allosteric site described in the literature and experimentally
identified in crystal structures. Again, a substrate pocket appears
as an important attractive region for CA since this pocket contains
centroids of cluster 1, the second most populated cluster (Figure d and Tn class="Gene">able S2). Between the two reference ligands
present in this pocket, only the centroid of the allosteric CHEK1
inhibitor (PDB ID 3F9N) is present in cluster 1. The other allosteric CHEK1 inhibitor (PDB
ID 3JVR) partially
occupies this pocket, and its centroid was not detected in the clusters.
We also observed that cluster 103, close to cluster 1, is located
in the same substrate pocket. Thus, many clusters can point toward
the same allosteric site.
Figure 3
Representation of heterogeneous clusters on
a kinase structure
(PDB ID 1K5V). (a) Orthosteric site represented by cluster 0 in black. Density
points are still present in purple. (b) Heterogeneous clusters are
displayed on the N-terminal region of the protein kinase. Reference
ligands bound to AMPK (PDB ID 4CFE, left) and PDK1 (PDB ID 3HRF, right) are represented
in sticks. (c) Heterogeneous clusters are displayed on the hinge region.
Reference ligands bound to MEK1 (PDB ID 3EQC, left) and AKT1 (PDB ID 3O96, right) are represented
in sticks. (d) Heterogeneous clusters are displayed on the C-terminal
part. Reference ligands bound to ABL1 (PDB ID 3K5V, left) and MAPK14
(PDB ID 4E6C, right) are represented in sticks. Cluster IDs are indicated, and
centroids are color-coded based on their cluster IDs.
Representation of heterogeneous clusters on
a kinase structure
(PDB ID 1K5V). (a) Orthosteric site represented by cluster 0 in black. Density
points are still present in purple. (b) Heterogeneous clusters are
displayed on the N-terminal region of the protein kinase. Reference
ligands bound to AMPK (PDB ID 4CFE, left) and n class="Gene">PDK1 (PDB ID 3HRF, right) are represented
in sticks. (c) Heterogeneous clusters are displayed on the hinge region.
Reference ligands bound to MEK1 (PDB ID 3EQC, left) and AKT1 (PDB ID 3O96, right) are represented
in sticks. (d) Heterogeneous clusters are displayed on the C-terminal
part. Reference ligands bound to ABL1 (PDB ID 3K5V, left) and MAPK14
(PDB ID 4E6C, right) are represented in sticks. Cluster IDs are indicated, and
centroids are color-coded based on their cluster IDs.
Another example concerns the large DEF pocket,
which presents four
different cluster IDs (2, 34, 116, and 117). In this case, the reference
ligands are correctly retrieved in clusters 34 and 117 (Table S2). It is important to note that these
reference ligands are bound to n class="Disease">kinases classified in two different
PK subgroups: MAPK14 (p38α) and JNK. There are major structural
differences in the large DEF pocket between the two crystal structures
(PDB ID 4E6C and PDB ID 3O2M), which explain the different binding mode observed for the two
allosteric inhibitors (reference ligands). These substrate and DEF
pockets were also identified above from the density analysis. Moreover,
the clustering analysis allows the detection of additional allosteric
pockets. As shown in Figure d, cluster 19 indicates the position of the myristate pocket
since this cluster is located at the same position than the reference
ligand (myristate), crystallized in the ABL subgroup (PDB ID 3K5V). However, other
AL of ABL selected as reference ligands (PDB IDs 3MS9 and 3PYY) were not classified
and were considered as singletons. This can be explained by the fact
that the three ligands are not fully superimposed. Ligand cocrystallized
in PDB ID 3K5V has a centroid located at 3.7 and 5.3 Å from ligand centroids
of PDB IDs 3MS9 and 3PYY,
respectively.
On the N-terminal lobe of the protein kinase (Figure b), clusters 6 and
93 are correctly
positioned on the P-loop pocket, and clusters 78 and 9 are positioned
on the PIF pocket. In fact, reference ligands are included in the
different clusters corresponding to known allosteric sites (Table S2). In addition to the PIF and P-loop
pockets, we also identified cluster 10, which did not correspond to
any reference ligand. This cluster highlighted a new site, which was
described very recently in the literature and therefore not yet included
in our database.[36] This pocket is occupied
by ligands involved in an allosteric mechanism of Aurora kinase inhibition
and could be a new site of interest for other PKs.In the case
of the back pocket, the cavity volume is large due
to structural variations present in different PKs and due to different
chemical structures of type II inhibitors. In our study, we detected
three different clusters in the back pocket (Figure c). Among our reference ligands, the AL of
AKT1 (PDB ID n class="Chemical">3O96) is detected by the clustering method (cluster 18). This ligand
binds in a rather different position than the AL of MEK, RIPK, or
PAK, other ligands binding to the back pocket. We also identified
cluster 49 in close proximity to cluster 18. The AL of AKT occupies
a space defined by the two clusters. Hence, those clusters seem to
correctly define the allosteric pocket of the AKT subgroup. The last
cluster, cluster 8, is also involved in the back pocket and contains
the AL of the MAPK group according to the presence of a ligand recently
discovered as an allosteric inhibitor of the ERK5 and MAPK7 proteins.[37] Unfortunately, the MEK reference ligands (MAPK
subgroup - PDB IDs 1S9J, 4LMN, and 3EQC) were not found
in this cluster probably due to a bias induced by the consideration
of centroid instead of the whole ligands. The back pocket is in close
vicinity of the orthosteric site. During the clustering step, the
centroids of the OL are not correctly distinguished from centroids
of the AL located in the back pocket. For this reason, AL of MEK (PDB
IDs 1S9J, 4LMN, and 3EQC), RIPK1 (4ITH), and PAK (4ZJI), used as references,
were classified in the orthosteric cluster (cluster 0). To avoid this
problem, OL were removed using a pharmacophore search defined near
the hinge region. Clustering of datasets without OL provides more
meaningful results since AL present in the back pocket were mainly
grouped in cluster 0 and the AL of AKT (PDB ID 3O96) in cluster 19 (Table S2). The other allosteric sites were not
modified, and for the DEF pocket, for example, reference AL were correctly
identified in clusters 37 and 120.
Homogeneous clusters, formed
by CA molecules, represent the largest
number of centroids in a cluster (Table S2 and Figure S4). They are concerned with CA that cannot be clustered
with ligands (AL nor OL). According to the analysis of heterogeneous
clusters, the different allosteric sites were found in clusters containing
more than 25 centroids. Beyond this threshold, all known allosteric
pockets were identified by the clustering method, meaning that all
allosteric sites are retrieved in the top 20 clusters (Figure S4). For homogeneous cluster analysis,
those clusters (number ≥ 25 centroids) were also identified
in allosteric sites in PKs (Figure ). The homogeneous cluster 5, located near the α-E
helix, represents the peptide pocket, identified above from density
analysis. Cluster 11 is similar to the heterogeneous cluster 10, which
defined a cavity already identified in the literature as an allosteric
site.[36] Clusters 3, 4, 7, and 12 revealed
three other sites at the C-terminal part of PKs. The first one, cluster
3, is located at the end of the α-E helix, and the second, cluster
12, is located at the C-terminal lobe of PKs. The third site, formed
by clusters 4 and 7, is also at the bottom of PKs, and interestingly,
this site was identified as an important attractive cavity for CA
in PKs using the density analysis. Those three sites are described
for the first time by our approach and were not described in our reference
structures nor in the literature. They could potentially be considered
as novel binding sites of interest.
Figure 4
Representation of homogeneous clusters
on a protein kinase (PDB
ID 1K5V). Cluster
IDs are indicated by arrows.
Representation of homogeneous clusters
on a protein kinase (PDB
ID 1K5V). Cluster
IDs are indicated by arrows.Finally, we evaluated the overlap of CA and OL
on one side and
the overlap of CA and AL on the other side. In fact, the CA distribution,
via density analysis, revealed that CA are preferentially grouped
in close proximity to DEF and substrate pockets. The clustering analysis
goes further and suggests that CA can be located in the orthosteric
site and in the six known allosteric sites, defined by the reference
structures. Thus, considering CA positions can help to determine orthosteric
cavity but more importantly allosteric sites in PKs. Moreover, both
density and clustering analyses highlighted some new sites, for which
allosteric properties were not yet studied and determined but could
be targeted using de novo approaches for the design of novel allosteric
ligands. However, it might be possible that those sites could be an
attractive hotspot without any allosteric function. A validation experiment
will be needed to assess the computational results and the allosteric
regulation of these identified sites. Furthermore, because of the
nature of CA (mainly hydrophilic), the detected cavities are hydrophilic,
and hydrophobic cavities are missed with our approach.
Application of the CA Approach on NR Family
Considering
the interesting results obtained for the PK family, our approach was
extended to the NRs. These proteins play an important role in biological
events such as cell growth or development and as a pathological regulator
in many diseases. NRs share 3D structural motifs: the N-terminal activation
function 1 domain (AF1, or A/B domain), the DNA binding domain including
two n class="Chemical">Zn fingers (DBD, also called the C domain), the nuclear localization
region (D domain), and the C-terminal ligand-binding domain (LBD,
referred to the E domain).[38] In general,
NRs interact with ligands in an orthosteric pocket inside the LBD,
and this interaction results to the cofactor binding regulation and
a gene transcription regulation. However, some cavities, away from
the orthosteric pocket, have been identified in several studies.[39−41] Although these studies suggest that those sites are putative allosteric
spots, no conclusion has been drawn on the nature of those cavities.[42] Even if the nature of those sites is not yet
fully characterized, we applied our approach on NRs to detect the
known putative allosteric sites and to identify new ones. Using the
same protocol as for PKs, 591 crystal structures were extracted for
the CA dataset, and 450 crystal structures were extracted for the
ligand dataset (containing also AL and OL). Density analysis with
a threshold of 80% showed that CA have the tendency to interact in
the AF2 coregulator site,[42] a putative
allosteric site near the helix H12 (Figure a).
Figure 5
Allosteric sites identification on NR. (a) Superimposition
of density
points on the reference structure (PDB ID 2PIP) with the orthosteric site represented
in green. All purple spheres represent a region with 50% density of
presence, and only two spheres have a density greater than 80%. (b)
Clustering results obtained with parameter [3 - 2]. (c) Some heterogeneous
clusters superimposed on the reference structure indicate the position
of defined allosteric ligands (PDB IDs 2PIP, top; 2PIN, middle) and orthosteric ligand (PDB
ID 2PIP, bottom)
represented in sticks. (d) Some homogeneous clusters superimposed
on the reference structure show undefined allosteric sites. Orthosteric
ligand (PDB ID 2PIP) is represented in sticks. Cluster ID and some protein helices are
indicated.
Allosteric sites identification on NR. (a) Superimposition
of density
points on the reference structure (PDB ID 2PIP) with the orthosteric site represented
in green. All purple spheres represent a region with 50% density of
presence, and only two spheres have a density greater than 80%. (b)
Clustering results obtained with parameter [3 - 2]. (c) Some heterogeneous
clusters superimposed on the reference structure indicate the position
of defined allosteric ligands (PDB IDs 2PIP, top; 2PIN, middle) and orthosteric ligand (PDB
ID 2PIP, bottom)
represented in sticks. (d) Some homogeneous clusters superimposed
on the reference structure show undefined allosteric sites. Orthosteric
ligand (PDB ID 2PIP) is represented in sticks. Cluster ID and some protein helices are
indicated.This site was also detected using the clustering
analysis. In fact,
as shown in Figure c, heterogeneous clusters 5 and 6 are located in the same pocket
than the allosteric reference ligand, which is usually associated
with a thyroid hormone receptor (PDB ID 2PIN).[43] A surface
called binding function 3 or BF3[44] has
been identified as a putative allosteric site in an NR-like androgen
receptor.[45] This site was exclusively found
in cluster 9 using the clustering analysis. Thus, these first results
confirm the existence of these two putative allosteric sites in NRs.
Moreover, a density analysis at 50% threshold also revealed an additional
cavity (Figure a)
filled by homogeneous clusters 1, 2, and 4 constituted only by CA
(Figure b,d). The
homogeneous cluster 7 also points a surface-exposed cavity. These
two cavities are still untargeted since no AL reference ligand was
identified in NRs.Regarding the NRs, the combination of both
methods (density and
clustering) showed that the CA are located into validated-known sites
described in the literature. Consequently, our approach is not limited
to PKs but can be successfully extended to other protein superfamilies
like NRs. Considering the size of those superfamilies, further study
could be performed on each different subfamily or subgroup.
Conclusions
Nowadays, allostery is a fundamental concept
in protein regulation
and reveals great interest to modulate the activity of a biological
target in the context of drug discovery. Here, we proposed a novel
and efficient computational way to detect allosteric sites, using
crystallization additives (CA), instead of protein cavity volume and
their overlap with orthosteric and allosteric ligands. CA are often
present in 3D structures but generally unexploited in the Computational Methods. We put forward that CA are
not randomly distributed around protein structures but seem to be
attracted by hotspots. We demonstrate, with unsupervised classification
and density analysis, that CA tend to be attracted by orthosteric
and allosteric sites of n class="Disease">protein kinases and nuclear receptors. Indeed,
all those cavities of interest have been detected by one or both of
those methods. This leads to the conclusion that the location of CA
in crystal structures can be used to identify new cavities of interest
and putative allosteric sites. Finally, this method has been effectively
applied on two protein families and could be applied on other therapeutic
targets for which new allosteric cavities are still unexplored. Thus,
our method could be used for designing new allosteric drugs for the
treatment of human diseases.
Computational Methods
General Process
Protein kinase (PK) and nuclear receptor
(NR) families, for which allosteric modulators have been already described,
were treated individually along the study following the procedure
shown in Figure S1.
Datasets
Two structural datasets were generated for
this study, one dataset containing the crystallographic additives
(CA) in complex with the proteins (Figure S2) and the second dataset containing proteins cocrystallized with
allosteric ligands (AL) and orthosteric ligands (OL). We described
below the procedure to generate those datasets:CA dataset contains molecules that
are involved in the crystallographic process (polyols, n class="Chemical">sugars, and
buffer molecules) excluding water molecules and salts. Crystal structures
of protein–ligand complexes, with a resolution of less than
3 Å and with less than 10,000 residues, were retrieved from the
RCSB protein data bank (PDB).[46] The right
protein family was selected using the annotations of the PFAM protein
family database[47] (PF00069 and PF07714
for protein kinases and PF00104 for nuclear receptors). PDB files
containing multiple chains were split into individual chains, and
each chain is considered as a unique entity. AL and OL were removed
according to the presence of aromatic rings in the ligand structures.
Then, crystal structures of apo or holo proteins containing CA compounds
were conserved, resulting in 845 PK and 591 NR structures. The definition
of the CA used in our approach is detailed in Table S1.
Ligand
dataset contains AL and OL
small molecule modulators. To ensure the presence of AL in this dataset,
a search was made using different sources: the allosteric database
(ASD),[20] inventorying all protein structures
containing AL until 2015, the ASbench,[21] which provides a list of allosteric sites, and finally, a manual
search in the PDB website using “allosteric” or “allostery”
keywords. These crystal structures of protein–ligand complexes
obtained from the PDB site are then filtered based on the structure
resolution (<3 Å), on the number of residues in the protein
sequence (<10,000 residues), and on their Pfam annotations. In
this dataset, only protein–ligand complexes were kept, and
other molecules such as solvent molecules, CA, and counter-ions were
removed. This dataset contains 864 PK and 450 NR complexes. Among
them, some complexes with well-studied AL were considered as a reference
to validate the approach: PDB IDs 4E6C (p38α),[48]3O2M (JNK1),[49]n class="Chemical">3O96 (AKT1),[50]3LW0 (IGF1R),[51]1S9J (MEK),[52]4LMN (MEK),[53]3EQC (MEK),[54]4ITH (RIPK1),[55]4ZJI (PAK1),[56]3MS9 (ABL),[57]3K5V (ABL),[58]3PYY (ABL),[59]3PXF (CDK2),[60]3HRF (PDK1),[61]3JVR (CHK1),[62]3F9N (CHK1),[63]3H30 (CK2a1),[64] and 4CFE (AMPKα1/2)[65] (Table S2).
Sequence and Structure Alignment
For each dataset,
protein sequences were annotated with conserved residues, and multiple
sequence alignments were performed using the MOE software.[66] Based on these sequence alignments, Cα
atoms of each protein were superimposed to the Cα atoms of the
chain E of PDB ID 1ATP(67) for all PKs and the chain A of PDB
ID 1IE9(68) for all NRs.
Density Analysis
To evaluate the distribution of CA
in protein families, we calculated the geometric center (centroid)
of each CA molecule with the RDKit package.[69] Because of the large differences in molecular size and chemical
structure for the ligands in the datasets, we consider the centroid
of the ligands for the analysis.[70] Then,
using the cpptraj program available in the AMBER suite, we generated
a grid of density of centroids. Spacing (1 Å) between the bins
of the grid is considered on the three coordinates of the box. The
dimension of the box is considered to be 70 × 70 × 70 Å3 to encompass all the centroids of CA. The number of centroids
is enumerated within a box, and a histogram of population is created.
Only points that have a density greater than the 50 and 80% threshold,
compared to the maximum value, were recorded for subsequent analysis.
Unsupervised Classification
Two datasets were studied
for the unsupervised classification, the CA dataset alone and the
combined CA and ligands (OL and AL) datasets to validate the overlap
of CA and AL molecules in a proximal space. Each molecule (OL, AL,
and CA) was also converted into one geometric point corresponding
to their centroid using RDKit. We carried out multiple clustering
based on the Cartesian centroid through the DBSCAN algorithm,[35] implemented in the AMBER cpptraj module. Two
parameters, the distance between two points in a cluster (ε
in Å) and the minimal number of points present in a cluster (minpts),
were modified for clustering optimization. So, different pairs of
parameters [minpts - ε] were evaluated ([1 - 0.5], [1 - 1],
[1 - 2], [2 – 0.5], [2 - 1], [2 - 2], [3 - 2], [3 - 3], [4
- 2], [4 - 3]).Parameters [1 - 2] and [3 - 2], which provide
a maximum number of clusters containing both AL and CA (heterogeneous
cluster), were conserved for successive analysis. For PKs, a partial
search using a pharmacophore was also applied on the combined datasets
to detect and remove the OL bound to the hinge region of the ATP binding
site using the MOE software.[66] Based on
the chain E of PDB ID n class="Chemical">1ATP,[67] we built the partial
pharmacophoric feature on the adenosine moiety of ATP constituted
by an acceptor feature (with a radius of 1.7 Å), a donor feature
(1.7 Å), and a heavy atom (3 Å). Then, a second clustering
using parameters [1 - 2] and [3 - 2] was applied without considering
OL.
Authors: Wolfgang Jahnke; Robert M Grotzfeld; Xavier Pellé; André Strauss; Gabriele Fendrich; Sandra W Cowan-Jacob; Simona Cotesta; Doriano Fabbro; Pascal Furet; Jürgen Mestan; Andreas L Marzinzik Journal: J Am Chem Soc Date: 2010-05-26 Impact factor: 15.419
Authors: Jeffrey F Ohren; Huifen Chen; Alexander Pavlovsky; Christopher Whitehead; Erli Zhang; Peter Kuffa; Chunhong Yan; Patrick McConnell; Cindy Spessard; Craig Banotai; W Thomas Mueller; Amy Delaney; Charles Omer; Judith Sebolt-Leopold; David T Dudley; Iris K Leung; Cathlin Flamme; Joseph Warmus; Michael Kaufman; Stephen Barrett; Haile Tecle; Charles A Hasemann Journal: Nat Struct Mol Biol Date: 2004-11-14 Impact factor: 15.369
Authors: Travis S Hughes; Pankaj Kumar Giri; Ian Mitchelle S de Vera; David P Marciano; Dana S Kuruvilla; Youseung Shin; Anne-Laure Blayo; Theodore M Kamenecka; Thomas P Burris; Patrick R Griffin; Douglas J Kojetin Journal: Nat Commun Date: 2014-04-07 Impact factor: 14.919