Literature DB >> 25038637

A simple method for finding a protein's ligand-binding pockets.

Seyed Majid Saberi Fathi, Jack A Tuszynski¹.

Abstract

BACKGROUND: This papn>er provides a simpn>le and rapn>id method for a protein-clustering strategy. The basic idea impn>lemented here is to use compn>utational geometry methods to predict and characterize ligand-binding pockets of a given protein structure. In addition to geometrical characteristics of the protein structure, we consider some simpn>le biochemical propn>erties that helpn> recognize the best candidates for pockets in a protein's active n>an class="Disease">site.
RESULTS: Our results are shown to produce good agreement with known empirical results.
CONCLUSIONS: The method presented in this paper is a low-cost rapid computational method that could be used to classify proteins and other biomolecules, and furthermore could be useful in reducing the cost and time of drug discovery.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2014 PMID： 25038637 PMCID： PMC4112621 DOI： 10.1186/1472-6807-14-18

Source DB: PubMed Journal: BMC Struct Biol ISSN： 1472-6807

Background

Essential information regarding protein function is generally dependent on the protein’s tertiary structure. This includes the enzymatic function of a protein, and also the binding of ligands, such as small molecule inhibitors [1]. Methods developed for predicting an enzymatic function of a protein by identifying catalytic residues include: finding local characteristics of functional residues [2,3], applying known templates of active pan class="Disease">sites [4,5] or identifying the surface shapn>e of active n>an class="Disease">sites [6-10]. In order to predict ligand binding (sites, poses and affinities), we first need to determine a 3-dimensional structure of the protein in question, which can be done using several expn>erimental or computational methods [11,12]. Structure-based pocket prediction employs geometrical algorithms or probes mapn>ping/docking algorithms [13]. Comparing these two kinds of methods, it can be said that the geometrical algorithms have low computational costs in contrast to the mapn>ping/docking and scoring of molecular fragments, but the latter algorithms have a greater physical meaning. Geometrical algorithms analyze protein surfaces, and once a structure has been determined, a number of algorithms may be used to predict binding pockets on the protein surface [14-19]. One such example, SURFNET [15], n>an class="Disease">fits spheres into the spaces between protein atoms and finds gap regions. The results obtained this way correspond to the cavities and keys of a given protein. An algorithm based on geometric hashing called VISGRID [20] uses the visibility of constituent atoms to identify cavities. “Active site points” are identified by PASS [19]. In this method the protein surface is coated with a layer of spherical probes, then those that clash with the protein or which are not sufficiently buried are filtered out. The active site points are identified from the final probes. Another method is LIGSITE [14,21], which is an improvement of the POCKET algorithm [22]. This algorithm puts protein-occupied space in a grid and identifies clefts by scanning areas that are enclosed on both sides by the protein’s atoms. An alpha-shape algorithm is used by CAST [17] and APROPOS [18]. DRUGSITE [13] and POCKET-FINDER [23], in addition to the protein’s shape, also consider physicochemical properties for identification of ligand binding pockets. Further geometrical algorithms are TRAVEL DEPTH [24], VOIDOO [25], and CAVITY SEARCH [26]. QSITEFINDER [16] uses interaction energy computation between the protein and a van der Waals probe to find favorable binding sites. Some methods using mapping/docking and scoring of molecular fragment concepts are described by Dennis et al. [27], Kortvelyesi et al. [28], Ruppert et al. [29], and Verdonk et al. [30]. There are also several docking based methods that use ligands to probe the proteins for binding sites [31-34]. Computer-aided drug design often applies protein–ligand docking methods, most commonly structure-based methods. These methods provide support to the rational design and optimization of novel drug candidates [35]. Many structure-based protein–ligand docking methods have been reported in the literature [36-41]. These methods generally rely on first identifying a ligand-binding pocket in the protein structure. Finding a comprehensive, fast and automated method that can accurately predict ligand-binding pockets on protein surfaces is a major challenge in virtual screening biophysics. This goal leads us to introduce a new method for finding putative ligand-binding pockets on a protein surface, and for identifying the most important characteristics of these pockets: surface area, volume, and potential interacting atoms. This information could be used to cluster protein pockets into similarity classes, and could be a valuable resource leading to a significant decrease in the cost and time expended in the drug discovery process. The method we present in this paper is based on computational geometry and voxelization concepts. In this method we do not use Delaunay tessellation, the vision criterion, or fitting spn>heres between>an class="Disease">n atoms, in contrast to some of the methods mentioned above. The CASTp method has used the Delaunay triangulation and the Voronoi concepts to find putative pockets and voids. This method triangulates the surface atoms and clusters triangles by merging small triangles to neighboring large triangles [14,17]. In our work we simply use the convex hull concept and generate a pocket by a grid box formed by the extreme points of a triangle. Then, we consider only the atoms closest to the triangle in the formed pocket. The distance to the convex hull is used for choosing the surface atoms. Thus, our method is not iterative and does not require a flow through all points, hence the computational cost is relatively low. We also take only a given number of empty voxel neighbors for each atom. Voxelization of space for finding putative pockets does not have an essential role for finding surface atoms, unlike VISGRID or grid-based methods, which are based on searching for empty voxels in different directions. We also use voxelization for finding the positions of possible ligands and also to determine physical properties of the pockets. Comparative modeling methods use fold assignment and template selection for comparing the target protein to a set of proteins with known structures and to search for homologous proteins that have apn>proximately similar structures. Some of these methods are BLAST [42,43], PSI-BLAST [44] and HHpn>red [45]. I-TASSER [46] is a compon>an class="Disease">site approach of comparative modeling and threading methods [47]. A summary of comparative modeling is given by [48]. In our method we also consider some biochemical properties of the protein’s atoms and residues as is explained below. Hence, the proposed method is not purely geometrical. We demonstrate that the results obtained using this method are in good agreement with empirically known results. Hence developing it further may offer even more accurate and reliable results.

Methods

We first voxelize the volume of a box defined to contain the extreme points of the protein’s atomic positions. Then, we use the convex hull concept to obtain the smallest convex polyhedron containing all of the protein’s atoms. In 3-dimensional space, a convex hull surface is formed by triangles, as shown in Figure 1. In the present context, each of these triangles can define a pocket, as illustrated in Figure 2. To define a specific pocket, we consider the volume generated by the extreme positions of the triangle vertices as follows: each triangle contains three vertex points,

Figure 1

The 3D polyhedron (convex hull) for the PDB:1ABT structure.

Figure 2

A given triangle on the convex hull for the PDB:1ABT structure. The three vertices are labeled as 1, 2, and 3. The point p is determined by the extreme values of x, y, and z of these three vertices. The distance of atom i to the triangle is obtained as follows: first obtain the normal vector to the triangle, N, N = (x − x) × (x − x), where x1, x2, and x3 are the vectors from the origin of the systems of Cartesian coordinates to the three vertices. Then, calculate the angle between the normal vector and the line passing through atom i and one of the vertices of this triangle using the following relation: , Finally, we compute this distance by = |x − x| cos θ, where x is a vector joining the origin and a given point in this volume.

The 3D polyhedron (convex hull) for the pan class="Gene">PDB:1ABT structure. A given triangle on the convex hull for the pan class="Gene">PDB:1ABT structure. The three vertices are labeled as 1, 2, and 3. The point p is determined by the extreme values of x, y, and z of these three vertices. The distance of atom i to the triangle is obtained as follows: first obtain the normal vector to the triangle, N, N = (x − x) × (x − x), where x1, x2, and x3 are the vectors from the origin of the systems of Cartesian coordinates to the three vertices. Then, calculate the angle between the normal vector and the line passing through atom i and one of the vertices of this triangle using the following relation: , Finally, we compn>ute this distance by = |x − x| cos θ, where x is a vector joining the origin and a given point in this volume. which we should consider as where “extreme” indicates either a minimum or a maximum value. Figure 2 shows a given triangle on a convex hull. We have made the grids with a length of 1 Å between ([xmin : xmax], ([ymin : ymax], ([zmin : zmax]) in each axis.Normally, some parts of this rectangular cube are out of the convex hull, but we do not concern ourselves with them because they will be eliminated by another criterion, namely to keep only a given number of empty voxels near each protein atom in a pocket. Figure 3 shows only the inside of a convex hull part of a pocket in 2-dimensions and its grid is shown by points (the blue color represents the atoms and the red represents the empty grid points).

Figure 3

The steps of the algorithm illustrated (in 2D for clarity) using the PDB:1ABT structure. The red dots represent empty voxels and the blue dots are voxels containing protein atoms. The atom positions have been averaged on the z-axis. (a) A convex hull enclosing the protein atoms is generated. (b) A line (a triangle in 3D) on the surface of the hull is selected. Inside of convex hull part of a given pocket is shown.

The steps of the algorithm illustrated (in 2D for clarity) using the PDB:1ABT structure. The red dots repn>resent empn>ty voxels and the blue dots are voxels containing protein>an class="Disease">n atoms. The atom positions have been averaged on the z-axis. (a) A convex hull enclosing the protein atoms is generated. (b) A line (a triangle in 3D) on the surface of the hull is selected. Inside of convex hull part of a given pocket is shown. Then, we obtain the voxels, which are contained within this generated volume, and separate the voxels into those that contain protein atoms and those which do not. Next, we identify the nearest empty voxels with respn>ect to these protein>an class="Disease">n atoms. These empty voxels give us the possible positions of ligand atoms for this particular protein pocket. At this step, we have found a large number of “pocket” envelopes and all the atoms belonging to these pockets are the “protein’s surface atoms”.In some cases, the entire space (or part thereof) under a triangle is common with another space so we say that these spaces overlap with each other. The overlap is defined by the number of atoms in common between the two pockets divided by the total number of atoms in a pocket, which means the overlap is also dependent on the size of a pocket, so that the overlap between two pockets is not symmetric. Figure 4 shows the overlap between two pockets in 2-dimensional space. As we can see in this figure, the overlap size of the common site (determined by the number of common atoms) divided by the size of the pocket (the total number of atoms in the pocket) for each pocket is different.

Figure 4

Schematic illustration of the overlap between two pockets.

Schematic illustration of the overlap between two pockets. If all atoms contained in a set of the pocket atoms exist in the other pocket, it has an overlap of 100%. However, the second pocket may have more atoms than the first one, i.e. it has all atoms of the first pocket plus other atoms. For example, the overlap between pockets #1 and #2 might be 100% while the overlap between pockets #2 and #1 is only 50%, because the number of atoms in pocket #2 is twice as lpan class="Chemical">arge as the number of atoms in pocket #1, and all n>an class="Disease">atoms belonging to pocket #1 are also contained in pocket #2, but only a half of the atoms in pocket #2 are also in pocket #1. Accumulating all pockets with a given overlap between them as new pockets is the next step. The final step is related to biochemical and physical criteria such as hydrophobicity, hydrogen bonding, ionic and n>an class="Disease">van der Waals interactions, and also the depth, surface area and volume comparisons between a given pocket and a ligand. By using biochemical conditions, we can find which atoms and which corresponding residues could potentially participate in an interaction with the ligand’s atoms. Tables 1 and 2 propose a set of simple biochemical conditions. It should be noted that to find an active site, more accurate conditions should lead to more accurate results. In this step we can also compute the size of pockets.

Table 1

Main biochemical interactions of atoms and residues in the proteins[49,51,52]

Residue Name	Interactions
THR	HBD: OG1 (OH)
SER	HBD: OG (OH)
GLN	HBA: NE2 (NH₂)
ASN	HBA: ND2 (NH₂)
TYR	HBA: O – HBD: N, OH – CR: CE1, CE2, CD1, CD2, CZ, CG
CYS	Sul: SG (SH)
MET	vdW: CE (CH₃) – Sul: SD (S-CH₃)
ALA	vdW: CB (CH₃)
PRO	vdW: CB (CH₂), CD (CH₂), CG (CH₂)
LEU	vdW: CD1 (CH₃), CD2 (CH₃), CG (CH)
VAL	vdW: CG1 (CH₃), CG2 (CH₃), CB (CH)
ILE	vdW: CD1 (CH₃)
ASP	HBA: OD1(C = O) – Ion(−): OD2 (OH)
GLU	HBA: OE1(C = O) – Ion(−): OE2 (OH)
LYS	Ion(+): NZ (NH₃)
ARG	Ion(+): NH1 (NH₂) trans, NH2 (NH₂) cis
HIS	Ion(+): NE1 (NH₂) trans, NE2 (NH₂) cis – CR: CD1, CE1, CD2, CE2, CG
PHE	CR: CG, CD1, CE1, CZ, CE2, CD2
TRP	HBD: NE1 (NH) – CR: CD2, CE2, CZ2, CH2, CZ3, CE3
TYR	HBD: OH – CR: CD1, CE1, CE2, CZ, CD2, CG
GLY	No participation

Abbreviations used:HBA: Hydrogen bond acceptor, HBD: Hydrogen bond donor, vdW: van der Waals interaction, Ion: Ionic interaction, Sul: Sulfur interaction.

Table 2

Ligand biochemistry

C-Ring in ligand	C or N atoms in ligand recognizing by connection information in the PDB
Unprotonated atoms in ligand	1) O has a connection with N, P or Zn
Unprotonated atoms in ligand	2) O only has a connection with C
Protonated atoms in ligand	1) Ca
	2) N has only two connection with C

The bond list is given in the PDB file CONECT lines.

Main biochemical interactions of atoms and residues in the proteins[49,51,52] Abbreviations used:HBA: Hydrogen bond accepn>tor, n>an class="Gene">HBD: Hydrogen bond donor, vdW: van der Waals interaction, Ion: Ionic interaction, Sul: Sulfur interaction. Ligand biochemistry The bond list is given in the pan class="Gene">PDB file CONECT lines. A detailed description of the algorithm is given in the following: The algorithm 1. Input protein atom position data, and define a box by using the extreme positions of the atoms. 2. Voxelize the box by considering the voxel with 1 Å in length, width and height. 3. Compute the convex hull surrounding the proteipan class="Disease">n atoms and obtain the volume of the convex hull and the surface area of atoms. 4. Separate empty voxels (possible ligand atom positions) from voxels filled by the proteipan class="Disease">n atoms in the convex hull. 5. Define the pockets by the volume generated by the vertices of each triangle on the convex hull. 6. Compute the overlap between two neighboring pockets and assemble the pockets with an overlap greater than a minimum value (reconstruct new pockets). 7. Find the physical properties of the pockets such as depth, surface and volume. 8. Find the residues corresponding to the pocket atoms. 9. Assess the biochemical conditions [49,50] as introduced in Table 1 (we use the IUPAC nomenclature [51] and the pan class="Gene">PDB format [52]). In this stepn> we can find the atoms and residues that participate in the potential active n>an class="Disease">site. 10. Compare physical and biochemical properties between ligand atoms (Table 2) and the atoms of a given pocket, such as: the size of pockets (depth, surface and volume) with ligand size, the number of hydrogen n>an class="Species">donor/acceptor atoms, possible rings, or van der Waals interactions, etc. Supplementary steps to compare our results with known active pan class="Disease">sites 11. Compute the number of correct residues predicted in each pocket of the unliganded protein and divide it by the number of residues in an “active pan class="Disease">site” of the liganded protein as repn>orted in the n>an class="Gene">PDB, i.e. 12. Optional step. Compute the minimum distance between the ligand atoms and each residue atoms in the pocket. Then, filter residues of a pocket with the minimum distance greater than the given values, for example 3.50 Å.In Figure 3, we illustrate these steps in 2-dimensional space for better clarity. Here, we need to use a line instead of a triangle to define a pocket. Figure 5 uses the example of the protein labeled pan class="Chemical">1A6U in the n>an class="Gene">PDB. It shows 3-dimensional atomic positions of the protein and the atoms that belong to a pocket.

Figure 5

Three dimensional structural representation of 1A6U. The atoms are shown with yellow dots and the surface atoms of a given pocket are shown with red crosses.

Three dimensional structural representation of pan class="Chemical">1A6U. The atoms are shown with yellow dots and the surface atoms of a given pocket are shown with red crosses.

Results and discussion

In reality, the geometrical criteria give initial information about physical properties for the possible protein-protein or protein-ligand docking, determining shapes, sizes, etc. For docking to occur, the recognized geometrical protein pocket should be a protein’s active site. Finding active n>an class="Disease">sites is very complicated for both in vitro and in silico methods. There are many computer programs that find active sites [13-23] but they have high computational cost associated with them and also they do not typically determine physical properties of the active site which means that we need to find a ligand in spite of lacking some important information. Therefore, it is imperative to use mixed geometrical and biochemical methods to find possible pockets in a protein. This paper has introduced a method to find protein pockets with a higher probability of interactions than based on exclusively biochemical methods. This method offers a speed-up of the drug discovery process by allowing clustering of both the protein pockets and ligands. We first demonstrate our method by describing an example, namely a pair of unliganded and liganded proteins, 1A6U and 1A6W. We have used only non-n>an class="Chemical">water atoms of 1A6U to find its pockets. These pockets are reported in Table 3. To verify these results, we check the SITE REMARK lines for the PDB file of its liganded pair structure, i.e. 1A6W, and we compare the residues of each active sites of the PDB file 1A6W with the residues obtained in each computed pocket. Then, we obtain the cf –value for each active site. The last two columns of Table 3 report these values.

Table 3

Pockets and their characteristics recognized by our method for 1A6U protein atoms

Pocket Num.^*	Num. of Atoms	Num. of Empty voxels	Surface of Pocket	Depth of Pocket	NoA^**HA^aBonds	NoA HD^bBonds	NoA vdW^***Bonds	NoA Ionic Bonds	NoA Sulfur Bonds	NoA	cf	cf
										C-Ring	of the 1^stAS, HAP^c	of the 2^ndAS, AC1^c
1	63	401	116.25	28.40	5	8	0	1	0	20	0.31	0.33
5	80	481	21.83	38.66	2	3	10	2	0	2	0	0
18	101	648	187.27	25.83	5	7	6	2	0	14	0.12	0.11
19	67	411	84.36	19.35	1	2	5	0	0	2	0	0
38	44	266	138.90	20.63	1	4	1	0	0	6	0	0
39	85	499	82.58	28.26	3	5	2	0	0	14	0.31	0.22
40	21	127	77.97	14.53	2	3	0	0	0	4	0.06	0
58	118	765	340.90	29.83	5	4	7	3	0	3	0	0
59	86	529	253.20	26.72	4	4	4	2	0	6	0.06	0
85	226	1360	370.14	36.18	7	7	26	3	1	27	0	0
89	21	141	212.35	21.47	0	1	4	1	0	4	0	0
90	92	573	293.28	28.54	4	2	15	2	0	11	0	0
112	44	241	36.33	27.39	1	2	1	0	0	6	0.06	0
117	38	215	76.66	17.42	1	3	0	0	0	8	0	0
137	15	99	127.57	17.53	2	4	0	0	0	3	0.25	0.33
143	55	354	259.10	24.24	4	8	0	1	0	20	0.43	0.55

*Pocket number indicates the number in the protein’s atomic positions convex hull surface rows, and they correspond to three vertices of triangles.

**NoA means the number of atoms.

***vdW means van der Waals.

HA means hydrogen bond acceptor.

HD means hydrogen bond donor.

These are the cf-values (ratio of the number of correct residues to the total number of residues in the active site). For 1A6W in PDB two active sites (AS) are reported as HAP and AC1.

Pockets and their characteristics recognized by our method for pan class="Chemical">1A6U protein>an class="Disease">n atoms *Pocket number indicates the number in the protein’s atomic positions convex hull surface rows, and they correspond to three vertices of triangles. **NoA means the number of atoms. ***vdW means pan class="Disease">van der Waals. HA means pan class="Chemical">hydrogen bond accepn>tor. HD means n>an class="Chemical">hydrogen bond donor. These are the cf-values (ratio of the number of correct residues to the total number of residues in the active site). For 1A6W in n>an class="Gene">PDB two active sites (AS) are reported as HAP and AC1. Here, we give a summary discussion regarding the properties of the unliganded protein structure 1A6U. It has 1737 atoms and its box has 43 × 49 × 41 voxels. The convex hull completely surrounded by triangles involves 148 triangles, which means the n>an class="Chemical">1A6U structure can have at most 148 possible pockets. However, only 81 pockets remain with a 0.8 overlap cutoff between pockets. By using biochemical conditions, only 20 pockets remain and then by using physical conditions of depth and surface, only 16 pockets remain. These remaining pockets are listed in Table 3. Finally, only four pockets are left with a cf of 25% correctly predicted residues as shown in Table 4. The liganded protein reported in the PDB is 1A6W (1774 non-water atoms), and has the NIP ligand, which has 17 atoms with an 8.97 Å length and a 20.87 Å2 surface area. Thus, the protein pockets should have values of depth and surface area greater than these. The minimum distance between the atoms of ARG 350H in 1A6U with the atoms of the active sites in 1A6W is 2.89 Å. Table 4 shows the pockets’ residues and their minimum residue distances for 1A6U to the ligand atoms of NIP reported in the heterogenic atom lines in the PDB file of 1A6W.

Table 4

1A6U best pockets with residues in common with the 2 active sites, HAP and AC1


POCKET # 1, cf = 0.31 & 0.33
ASN 354H (11.61)	SER 331H (10.79)	TYR 34 L (4.27)
ASP 352H (7.07)	THR 328H (14.41)	TYR 332H (8.34)
ILE 351H (6.25)	THR 330H (12.29)	TYR 401H (2.92)
SER 32 L (6.81)	TRP 333H (1.734)	TYR 402H (5.75)
POCKET # 39, cf = 0.31 & 0.22
ALA 2 L (15.1365)	HIS 97 L (6.8477)	THR 26 L (15.7431)
ARG 350H (2.89)	ILE 348H (9.34)	TRP 98 L (3.24)
ASN 96 L (7.12)	LYS 359H (5.38)	TRP 347H (4.78)
ASN 361H (9.75)	LYS 365H (14.84)	TYR 94 L (7.84)
GLU 362H (12.30)	PHE 364H (13.46)	TYR 360H (8.34)
GLY 349H (6.45)	SER 366H (17.38)	VAL 99 L (9.69)
POCKET # 137, cf = 0.25 & 0.33
ASP 400H (5.44)	THR 31 L (8.29)	TYR 401H (2.92)
SER 405H (3.65)	TYR 34 L (4.27)	TYR 402H (5.75)
POCKET # 143, cf = 0.44 & 0.56
ARG 350H (2.89)	SER 95 L (5.42)	TYR 332H (8.34)
ASN 354H (11.61)	SER 331H (10.79)	TYR 401H (2.92)
ASP 352H (7.07)	TRP 93 L (3.36)	TYR 402H (5.75)
ILE 351H (6.25)	TRP 333H (1.73)
SER 32 L (6.81)	TYR 34 L (4.27)

There are four predicted pockets with more than 25% of residues in common between the pockets and the active sites. The values in parentheses are the minimum residue distances for 1A6U to the ligand atoms of NIP reported in the heterogenic atom lines in the PDB file of 1A6W.

1A6U best pockets with residues in common with the 2 active sites, HAP and AC1 There are four predicted pockets with more than 25% of residues in common between the pockets and the active sites. The values in parentheses are the minimum residue distances for n>an class="Chemical">1A6U to the ligand atoms of NIP reported in the heterogenic atom lines in the PDB file of 1A6W. Table 3 gives all pockets of 1A6U, where only the two last columns are obtained by the compn>arison of the results with the binding sites HAP and AC1 of 1A6W (the corresponding liganded protein of 1A6U). In Table 3 the pockets are numbered and ordered arbitrarily. This table and all results were produced independently of the final answer. As can be seen in Figure 6, which is shown in the PDB webn>an class="Disease">site for the 1A6W protein, only five residues – TYR 399H, ARG 350H, TRP 93 L, TYR 401H and TRP 98 L – participate in the interaction with the NIP ligand, while in the PDB file of 1AW6 two active sites with 16 and 10 residues are reported (using the SITE REMARK lines in the PDB file). This shows that a maximum of 50% of the active site residues reported in the PDB for 1A6W participate in the interaction with the NIP ligand (a cf equal to 0.5). In our computation, for example, in the unliganded protein 1A6U the best pocket has a cf equal to 0.43 and to 0.55 for the first and second active site of the liganded protein 1A6W, respectively.

Figure 6

1A6W and its ligand. From the PDB website.

1A6W and its ligand. From the pan class="Gene">PDB webn>an class="Disease">site. For illustration purposes we have taken the set of 48 and 86 “liganded and unliganded proteins”, respectively, listed in the supplementary material of Li et al. [20] and downloaded the files from the PDB n>an class="Disease">site (see Additional file 1 for a list of the PDB files). We found the pockets of the unliganded proteins, and then we compared these pockets with the known active sites reported in the PDB files of the corresponding liganded proteins. The correct fraction, cf, of residues predicted in a given pocket is computed and the histograms of maximum cf in each protein’s pockets are reported in Figures 7 and 8. These results are obtained for a 0.8 overlap cutoff between pockets, and they show that 76% of the pockets predicted by our algorithm in the 86-element data set have at least half of their residues belonging to an active site in the liganded protein; for the 48-element data set the correspn>onding number is 50%. By using instead a 0.5 overlapn> cutoff, the results are 78% and 54% for the 86-element and the 48-element data set, respn>ectively. Note that not all residues in the active n>an class="Disease">sites reported in the PDB participate in protein-ligand interactions.

Figure 7

Histogram of the 86-element data set. Due to the RAM memory limits the protein number 55 in the 86-element data set list (PDB structures 2NGR and 1KZ7) was not included. The results are reported for the 85-element data set. The horizontal axis is the percentage of correct prediction of residues. The vertical axis is the number of proteins. The number of proteins with predicted pockets including more than half of the active site residues is 66 proteins (78% of the data set). Overlap threshold between pockets is 0.8.

Figure 8

Histogram of the 48-element data set. The horizontal axis is the percentage of correct prediction of residues. The vertical axis is the number of proteins. The number of proteins with predicted pockets including more than half of the active site residues is 24 proteins (50% of the data set). Overlap threshold between pockets is 0.8.

Histogram of the 86-element data set. Due to the RAM memory limits the protein number 55 in the 86-element data set list (pan class="Gene">PDB structures 2NGR and 1KZ7) was not included. The results are repn>orted for the 85-element data set. The horizontal axis is the percentage of correct prediction of residues. The vertical axis is the number of proteins. The number of proteins with predicted pockets including more than half of the active n>an class="Disease">site residues is 66 proteins (78% of the data set). Overlap threshold between pockets is 0.8. Histogram of the 48-element data set. The horizontal axis is the percentage of correct prediction of residues. The vertical axis is the number of proteins. The number of proteins with predicted pockets including more than half of the active pan class="Disease">site residues n>an class="Species">is 24 proteins (50% of the data set). Overlap threshold between pockets is 0.8. In Table 5 we compare the performance of our method with the other methods CASTp, LIGn>an class="Disease">SITE, PASS, SURFNET and VISGRID. This table shows that our method with an overlap cutoff of 0.8 has comparable performance with the other methods. We should also note that the low computational cost of our method is a major advantage. In Additional file 2, full pockets of the 48-element set with a cf (ratio of the number of correct residues to the total number of residues in the active site) of more than 25% are reported. Additional file 2 also gives the minimum distance between each residue of the protein and ligand atoms.

Table 5

Performance comparison of our results with the other methods CASTp, LIGSITE, PASS, SURFNET and VISGRID

	48 Unbound structures	86 Unbound structures
	(Top 1)	(Top 1)
CAST	31 (64.6%)	66 (76.7%)
LIGSITE	36 (75.0%)	69 (80.2%)
PASS	27 (56.3%)	54 (62.8%)
SURFNET	19 (39.6%)	63 (73.3%)
VISGRID: Top 0.8% voxels	34 (70.8%)	55 (64.0%)
Our method: Overlap 0.8	24 (50%)	66 (78%)

The other results reported in Table III of Li et al. [20].

Performance comparison of our results with the other methods pan class="Gene">CASTp, LIGn>an class="Disease">SITE, PASS, SURFNET and VISGRID The other results reported in Table III of Li et al. [20]. We have also chosen another 130 pairs of unliganded and liganded protein structures of (listed in Additonal file 3). In Figure 9 the histograms of the maximum cf in each protein’s pockets are reported (with a 0.8 overlap). It shows that 73.8% of the pockets predicted by our algorithm in the 130-element data set have at least half of their residues belonging to an active pan class="Disease">site in the liganded protein, i.e. cf ≥ 0.5.

Figure 9

Histogram of the 130-element data set. The horizontal axis is the percentage of correct prediction of residues. The vertical axis is the number of proteins. Overlap threshold between pockets is 0.8.

Histogram of the 130-element data set. The horizontal axis is the percentage of correct prediction of residues. The vertical axis is the number of proteins. Overlap threshold between pockets is 0.8. An important step which allows a decrease of the time and effort for the drug discovery process is to find suitable ligands through in silico methods using, for example, the virtual screening techniques. Our algorithm is easy to use and the cost of computation is approximately between 10 seconds for small proteins and up to 320 seconds for lpan class="Chemical">arge proteins. The program was implemented in Matlab. The computer used for these computations is a lapn>topn> with an Intel Core i7 CPU and 8 GB RAM. The program usually uses 13% of the CPU time, but sometimes for a while it uses up to 50%. The program also while occupied in computation usually required less than 0.5 GB of RAM memory, but it was observed for some proteins to go up to 2 GB. The execution time for the 130 pair dataset is given in Additional file 3.

Conclusions

In this paper, we have introduced a new simple method for predicting putative ligand-binding protein pockets. For each pocket, we can identify possible interacting proteipan class="Disease">n atoms and residues, surface atoms, and also determine the size of a pocket (volume, surface area and depn>th). This information can helpn> us verify possible ligands having a shapn>e and size that is geometrically compatible with the pocket, and thus could be docked to the protein. We have used some biochemical propn>erties to find the possible interacting atoms and residues in the pockets. Our method is a low cost computational method which voxelizes the protein spn>ace, and uses the convex hull concepn>t commonly employed in computational geometry. This method could be used to classify proteins by the geometric propn>erties of their pockets and also by their biochemical propn>erties. An apn>plication of this method could be useful in reducing the cost and time of drug discovery.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SMSF and JAT conceived of the study, and participated in its design and coordination and helped to draft the manuscript. Both authors read and approved the final manuscript.

Additional file 1

Table with pairs (bound and unbound) of pan class="Gene">PDB files in the 48 element set and in the 86 element set. For each pair, the RMSD (in angstroms) is given. Click here for file

Additional file 2

List of the full pockets for each unliganded structure in the 48-element set with a (ratio of the number of correct residues to the total number of residues in an active site) of more than 25%. For each pocket the cf for each active n>an class="Disease">site (“AC”) is given after the label “Res. in common with N AC:”. Residues are named in the form “ ”. For each residue in a pocket, the minimum distance between the residue and the ligand atoms of the corresponding liganded structure is given. Click here for file

Additional file 3

Tab-delimited text file. Table with 130 pairs of unliganded (unlig) and liganded (lig) pan class="Gene">PDB files. For each pair, the cf and the time of execution (in seconds) is given. Click here for file

46 in total

1. POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids.

Authors: D G Levitt; L J Banaszak
Journal: J Mol Graph Date: 1992-12

2. The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling.

Authors: Konstantin Arnold; Lorenza Bordoli; Jürgen Kopp; Torsten Schwede
Journal: Bioinformatics Date: 2005-11-13 Impact factor: 6.937

3. Comprehensive identification of "druggable" protein ligand binding sites.

Authors: Jianghong An; Maxim Totrov; Ruben Abagyan
Journal: Genome Inform Date: 2004

4. Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design.

Authors: J Liang; H Edelsbrunner; C Woodward
Journal: Protein Sci Date: 1998-09 Impact factor: 6.725

5. Recommendations for the presentation of NMR structures of proteins and nucleic acids. IUPAC-IUBMB-IUPAB Inter-Union Task Group on the Standardization of Data Bases of Protein and Nucleic Acid Structures Determined by NMR Spectroscopy.

Authors: J L Markley; A Bax; Y Arata; C W Hilbers; R Kaptein; B D Sykes; P E Wright; K Wüthrich
Journal: J Biomol NMR Date: 1998-07 Impact factor: 2.835

6. LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins.

Authors: M Hendlich; F Rippmann; G Barnickel
Journal: J Mol Graph Model Date: 1997-12 Impact factor: 2.518

7. TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites.

Authors: A C Wallace; N Borkakoti; J M Thornton
Journal: Protein Sci Date: 1997-11 Impact factor: 6.725

8. Automatic identification and representation of protein binding sites for molecular docking.

Authors: J Ruppert; W Welch; A N Jain
Journal: Protein Sci Date: 1997-03 Impact factor: 6.725

9. Solvent-accessible surfaces of proteins and nucleic acids.

Authors: M L Connolly
Journal: Science Date: 1983-08-19 Impact factor: 47.728

10. Functional analysis of the Escherichia coli genome using the sequence-to-structure-to-function paradigm: identification of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity.

Authors: J S Fetrow; A Godzik; J Skolnick
Journal: J Mol Biol Date: 1998-10-02 Impact factor: 5.469

6 in total

1. Binding site matching in rational drug design: algorithms and applications.

Authors: Misagh Naderi; Jeffrey Mitchell Lemoine; Rajiv Gandhi Govindaraj; Omar Zade Kana; Wei Pan Feinstein; Michal Brylinski
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

2. POVME 3.0: Software for Mapping Binding Pocket Flexibility.

Authors: Jeffrey R Wagner; Jesper Sørensen; Nathan Hensley; Celia Wong; Clare Zhu; Taylor Perison; Rommie E Amaro
Journal: J Chem Theory Comput Date: 2017-08-30 Impact factor: 6.006

3. AttentionSiteDTI: an interpretable graph-based model for drug-target interaction prediction using NLP sentence-level relation classification.

Authors: Mehdi Yazdani-Jahromi; Niloofar Yousefi; Aida Tayebi; Elayaraja Kolanthai; Craig J Neal; Sudipta Seal; Ozlem Ozmen Garibay
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

4. A Novel Geometry-Based Approach to Infer Protein Interface Similarity.

Authors: Inbal Budowski-Tal; Rachel Kolodny; Yael Mandel-Gutfreund
Journal: Sci Rep Date: 2018-05-29 Impact factor: 4.379

5. DeepBindPoc: a deep learning method to rank ligand binding pockets using molecular vector representation.

Authors: Haiping Zhang; Konda Mani Saravanan; Jinzhi Lin; Linbu Liao; Justin Tze-Yang Ng; Jiaxiu Zhou; Yanjie Wei
Journal: PeerJ Date: 2020-04-06 Impact factor: 2.984

6. Computational study of pomegranate peel extract polyphenols as potential inhibitors of SARS-CoV-2 virus internalization.

Authors: Relja Suručić; Biljana Tubić; Miloš P Stojiljković; Dragan M Djuric; Maja Travar; Milkica Grabež; Katarina Šavikin; Ranko Škrbić
Journal: Mol Cell Biochem Date: 2020-11-16 Impact factor: 3.842

6 in total