Literature DB >> 27403208

bSiteFinder, an improved protein-binding sites prediction server based on structural alignment: more accurate and less time-consuming.

Jun Gao1, Qingchen Zhang2, Min Liu3, Lixin Zhu4, Dingfeng Wu2, Zhiwei Cao2, Ruixin Zhu2.   

Abstract

MOTIVATION: Protein-binding sites prediction lays a foundation for functional annotation of protein and structure-based drug design. As the number of available protein structures increases, structural alignment based algorithm becomes the dominant approach for protein-binding sites prediction. However, the present algorithms underutilize the ever increasing numbers of three-dimensional protein-ligand complex structures (bound protein), and it could be improved on the process of alignment, selection of templates and clustering of template. Herein, we built so far the largest database of bound templates with stringent quality control. And on this basis, bSiteFinder as a protein-binding sites prediction server was developed.
RESULTS: By introducing Homology Indexing, Chain Length Indexing, Stability of Complex and Optimized Multiple-Templates Clustering into our algorithm, the efficiency of our server has been significantly improved. Further, the accuracy was approximately 2-10 % higher than that of other algorithms for the test with either bound dataset or unbound dataset. For 210 bound dataset, bSiteFinder achieved high accuracies up to 94.8 % (MCC 0.95). For another 48 bound/unbound dataset, bSiteFinder achieved high accuracies up to 93.8 % for bound proteins (MCC 0.95) and 85.4 % for unbound proteins (MCC 0.72). Our bSiteFinder server is freely available at http://binfo.shmtu.edu.cn/bsitefinder/, and the source code is provided at the methods page.
CONCLUSION: An online bSiteFinder server is freely available at http://binfo.shmtu.edu.cn/bsitefinder/. Our work lays a foundation for functional annotation of protein and structure-based drug design. With ever increasing numbers of three-dimensional protein-ligand complex structures, our server should be more accurate and less time-consuming.Graphical Abstract bSiteFinder (http://binfo.shmtu.edu.cn/bsitefinder/) as a protein-binding sites prediction server was developed based on the largest database of bound templates so far with stringent quality control. By introducing Homology Indexing, Chain Length Indexing, Stability of Complex and Optimized Multiple-Templates Clustering into our algorithm, the efficiency of our server have been significantly improved. What's more, the accuracy was approximately 2-10 % higher than that of other algorithms for the test with either bound dataset or unbound dataset.

Entities:  

Keywords:  Index; Multiple-Templates Clustering; Protein-binding sites prediction; Structural alignment; Web server

Year:  2016        PMID: 27403208      PMCID: PMC4939519          DOI: 10.1186/s13321-016-0149-z

Source DB:  PubMed          Journal:  J Cheminform        ISSN: 1758-2946            Impact factor:   5.514


Background

Most biological processes involve the interaction of ligands with proteins. Functional characterization of ligand-binding sites of proteins is a key issue in understanding those biological processes [1-4]. In addition, identifying the location of protein-binding sites is a vital first step in structure-based drug design [5-8]. However, functional characterization of proteins through experimental method is a labor intensive and time-consuming process. A computational tool to predict the functional binding sites in a protein is therefore of practical importance. To date, a variety of computational methods have been developed for protein-binding sites prediction, which can be divided into four categories: geometry based methods [9-14], energy based methods [15, 16], alignment based methods [17-20] and other miscellaneous methods [21-23]. Alignment based methods can be further divided into sequence alignment based and structural alignment based methods. Recently, increasing structural genomics projects have led to the exponential growth of the number of available protein structures. As a consequence, structural alignment based methods exceeded other methods due to its more efficient and more accurate performance. In 1996, Lichtarge et al. [17] developed the first structural alignment based algorithm for protein-binding sites prediction, entitled evolutionary trace method (ET method). It is based on the extraction of functionally important residues from sequence conservation patterns in homologous proteins, and on their mapping onto the protein surface to generate clusters identifying functional interfaces. In 2007, Brylinski and Skolnick developed a popular structural alignment method called FINDSITE [18]. For a given target sequence, FINDSITE identifies ligand-bound template structures from a set of distantly homologous proteins recognized by the PROSPECTOR_3 threading approach and superposes them onto the target’s structure using the TM-align structural alignment algorithm. Binding pockets are identified by the spatial clustering of the center of mass of template-bound ligands that are subsequently ranked by the number of binding ligands. In 2009, Oh et al. [24] developed LEE, a two-stage template-based ligand binding site prediction method, where templates are used first for protein 3D modeling and then for binding site prediction by structural clustering of ligand-containing templates to the predicted 3D model. Later in 2010, Wass et al. [25] described a new method called 3DligandSite. Structures similar to the query are identified by using MAMMOTH [26] against a library of protein structures with bound ligands. The structural based alignment of the similar structures and the query superposes ligands onto the query structures. After filtering, the top 25 ligands are retained for analysis and further clustering. In 2012, another comparative approach called COFACTOR was proposed by Zhang group [19]. COFACTOR recognizes functional sites of protein–ligand interactions using low-resolution protein structural models, based on a global-to-local sequence and structural comparison algorithm. The major advantage of COFACTOR over the existing methods is the optimal combination of global and local structural comparisons for identifying protein-binding sites. But, the global comparison can be distracted by structural variations in the regions far away from the binding pockets; meanwhile the local comparison has a high false positive rate since the number of residues involved is too small. Later in 2013, Zhang group published another structural alignment based algorithm, TM-SITE [20]. Different from COFACTOR, TM-SITE compares the structures of a subsequence from the first binding residue to the last binding residue (called SSFL) on the query and template proteins, which solve the problems of global-to-local structural comparison algorithm. These methods provide us valuable choices to predict the binding sites. However, their performance needs to be improved for lack of accuracy or time-efficiency or both since the structural information of protein–ligand complexes (bound protein) are underutilized. Herein, we built so far the largest database of bound templates with stringent quality control. And on this basis, Stability of Complex as a new criterion and Optimized Multiple-Templates Clustering algorithm are introduced to improve the accuracy. Meanwhile, Homology Indexing and Chain Length Indexing are used to accelerate the efficiency of the structural alignment. Finally, we presented a user friendly protein-binding sites prediction web server (bSiteFinder), at http://binfo.shmtu.edu.cn/bsitefinder/.

Methods

Definitions of operations

Rules of five

The protein data in PDB database are filtered through the rules below: The macromolecule type is protein, no DNA and RNA. Experiment method is set to X-ray. X-ray resolution is between 0 and 3.0. Has free ligands = yes. Sequence length is over 20.

Number of ligand atoms

In the process of building databases, which database a protein finally falls into depends on whether it contains ligands and whether these ligands have enough atoms. For this reason, ligands identification, which is judged by the rules mentioned below, plays a key role. Every HETATM residue is recognized through HET records from the header of PDB files. Notably, some of the residues are modified on normal chains, which are not counted as true ligands because of their present in the MODRES records. Hence, the selected ligands only come from HET records excluding MODRES ones. Water molecule is included in HETATM but not regarded as a ligand. Analyzing the data, we define that a ligand should possess 6 or more atoms as a basic rule to identify a ligand.

Stability of Complex

The binding site check criterion is using as the standard of judging the bound structure’s stability. Only if any one of atoms of the ligand has a distance within 4 Å from the geometry center of the calculated binding site, the structure of complex is considered to be stable.

Homology Indexing

Homology Indexing is implemented by using SCOPe, version 2.03 [27]. First, a four-digit classification number is searched based on PDB ID and CHAIN ID of the query chain. After that, all the protein chains with the same classification number are obtained and used to constitute the template database for subsequent structural alignment.

Chain Length Indexing

Only the chains, which have length difference with query chain less than 30 %, are used as candidates for subsequent structural alignment.

Structural alignment

The structural alignment between query and templates in bSiteFinder is implemented by using Combinatorial Extension (CE) algorithm, which is provided by Biojava [28]. Different from traditional dynamic programming algorithm and Monte Carlo algorithm, CE algorithm defines continuous residues in the sequence as aligned fragment pairs (AFPs), which is used in local alignment between query and template. Finally, the optimized alignment results are obtained by expanding or abandoning the local AFPs.

Optimized Multiple-Templates Clustering

After structural alignment, template will be mapped to query. Then, the templates which meet the requirement of Stability of Complex are ranked according to the similarity with query chain, and ligands of the top 20 templates at most will be picked out. After 20 times of structural alignments, all the ligands in templates will be mapped to the query. Further, these ligands are clustered into different clusters. The number of ligand geometric centers, which have a distance less than 3 Å from the certain ligand geometric center, is counted for each ligand. After that, the ligand with the largest number is defined as the center of the Top1 binding site (Fig. 1). Then, this ligand and all the other ligands within 3 Å are removed for searching the centers of the Top2 and Top3 binding site in the same way.
Fig. 1

Workflow of Optimized Multiple-Templates Clustering. Template (b) is mapped to query (a) by structural alignment to form query-template complex (c). Then, the template chain will be removed, and the ligand will be retained (d). After 20 times of structural alignments, the ligands in templates will be mapped to the query (e). The number of ligand geometric centers, which have a distance less than 3 Å from the certain ligand geometric center, is counted for each ligand (f). The ligand with the largest number is defined as the center of the Top1 binding site (g)

Workflow of Optimized Multiple-Templates Clustering. Template (b) is mapped to query (a) by structural alignment to form query-template complex (c). Then, the template chain will be removed, and the ligand will be retained (d). After 20 times of structural alignments, the ligands in templates will be mapped to the query (e). The number of ligand geometric centers, which have a distance less than 3 Å from the certain ligand geometric center, is counted for each ligand (f). The ligand with the largest number is defined as the center of the Top1 binding site (g)

Detection of binding sites

On the condition that protein chains have ligands, we define all residues within the distance of 8 Å from ligands as the components of the binding site. On the condition that binding site is detected by doing structural alignment with templates, all residues within the distance of 10 Å from mapped ligands are defined as the components of the binding site. It should be noted that if the bound proteins’ stabilities did not pass the evaluation of Stability of Complex, the bound proteins would be treated as unbound proteins with original ligands removed.

Test and evaluation methods

For comparing with other binding site prediction algorithms, two widespread adopted datasets from LIGSITEcsc [29] were used for testing our algorithm with the same criteria of evaluating the accuracy of binding site prediction. The first test set contained 210 proteins with ligands (bound dataset). At the suggestion of RCSB, protein 1B6N was replaced by 1Z1H. The second test set contained 48 proteins with/without ligands (bound/unbound dataset). Here, the accuracy and Matthews Correlation Coefficient (MCC) [30] were both used to evaluate our algorithm.

Accuracy

A widely accepted verification method [13] was used. For bound protein, if the protein–ligand’s stability has passed the evaluation of Stability of Complex, the accuracy is 100 %. If the protein–ligand’s stability did not pass the evaluation of Stability of Complex, the original ligands of bound protein will be removed and in this situation, the bound protein will be regarded as unbound protein and may have a lower accuracy. For unbound proteins, if the geometric center of a binding site has a distance within 4 Å from any one of the atoms of the predicted ligands, this binding site is regarded as a correctly predicted binding site. Otherwise, this binding site is regarded as an incorrectly predicted binding site.

MCC

Another evaluation index, MCC, was also used to evaluate the accuracy of binding site prediction. For each protein chain, all the residues were divided into four categories: TP: correctly predicted binding site residues; TN: correctly predicted nonbinding site residues; FP: incorrectly predicted as binding site residues; and FN: incorrectly predicted as nonbinding site residues. MCC scores are defined as:For bound proteins that passed the evaluation of Stability of Complex, the MCC is 1. Otherwise, the bound proteins was regarded as unbound proteins and MCC would be lower than 1. For unbound proteins, the structural alignment between query and template is implemented to map the ligands in bound proteins to the unbound proteins. Then, the mapped pseudo ligands were used to detect the binding site as describe in “Detection of Binding Sites”. To evaluate our methods, we divided the residues of query chains into residues of predicted binding site (Res-BS-Pre) and residues of predicted non-binding site (Res-NBS-Pre). At the same time, we also define residues of experimental binding site as Res-BS-Exp and residues of experimental non-binding site as Res-NBS-Exp according to the original ligands of query chains. Therefore, in formula (1), TP is the intersection of Res-BS-Pre and Res-BS-Exp, and TN is the intersection of Res-NBS-Pre and Res-NBS-Exp, and FP is the intersection of Res-BS-Pre and Res-NBS-Exp, and FN is the intersection of Res-NBS-Pre and Res-BS-Exp.

Experimental

Create template database

Our algorithm will maximize the information of bound proteins. Herein, we built so far the largest database of bound templates from PDB database with stringent quality control. Figure 2 shows the workflow of creating template database, which include four steps as follow: (1) 97,591 complex structures in PDB database (February 11, 2014) were filtered according to Rules of Five, and 62,487 complex structures were obtained. (2) Proteins were divided into chains, and then the chains which are less than 20 residues in length were removed. After that, 146,089 chains were obtained. (3) Number of Ligand Atoms was employed to ensure that there is at least one ligand in the complex structures of each chain, and 117,823 chains were obtained. (4) Stability of Complex was employed to ensure that it forms a stable bound structure of each chain with its ligand. Finally, 101,315 chains were obtained for building the database of bound templates.
Fig. 2

Workflow of creating template database

Workflow of creating template database

Workflow of binding sites detection

When a query protein is submitted by user for binding site prediction, it will be firstly divided into chains. After that, the prediction will be done for each chain. Figure 3 shows the workflow of binding sites detection. Each protein chain will be processed by following steps:
Fig. 3

Workflow of binding sites detection. Each protein chain submitted would be processed successively by following steps: 1 Binding sites prediction of high quality bound protein (Part 1), or enter the following process. 2 Binding sites prediction of unbound protein with bound templates of same Homology Indexing (Part 2), or enter the following process. 3 Binding sites prediction of unbound protein with bound templates of Chain Length Indexing (Part 3). Any protein chains submitted into our system could receive the results of binding sites via efficient computation

Workflow of binding sites detection. Each protein chain submitted would be processed successively by following steps: 1 Binding sites prediction of high quality bound protein (Part 1), or enter the following process. 2 Binding sites prediction of unbound protein with bound templates of same Homology Indexing (Part 2), or enter the following process. 3 Binding sites prediction of unbound protein with bound templates of Chain Length Indexing (Part 3). Any protein chains submitted into our system could receive the results of binding sites via efficient computation Binding sites prediction of high quality bound protein (Part 1) Detection of Binding Sites is employed for binding site detection, when the protein chains meet the requirement of Number of Ligand Atoms and Stability of Complex. Otherwise, enter the following process. Binding sites prediction of unbound protein with bound templates of same Homology Indexing (Part 2) If the query chain has a four-digit classification number in SCOPe and has bound template with the same Homology Indexing in template database, the binding site of this query chain will be detected as the following procedure. First, structural alignments between query chain and templates will be done, and the top 20 bound templates which are the most similar to the query will be selected subsequently. The locations of ligands are detected by mapping the ligands in templates to the query, and then the optimization of binding sites was following by using the new developed Optimized Multiple-Templates Clustering method. Finally, Detection of Binding Sites will be employed for binding site detection. Otherwise, enter the following process. Binding sites prediction of unbound protein with bound templates of Chain Length Indexing (Part 3) If the query chain has no satisfactory homologous bound template, the binding site of this query chain will be detected as the following procedure. Chain Length Indexing will be employed to search the bound templates, which have difference with query chain less than 30 % in length, in template database. Then enter the process as the description above (Part 2 of “Workflow of binding sites detection”) with top 20 most similar bound templates. Any protein chains submitted into our system could receive the results of binding sites via efficient computation.

Results and discussion

Performance of our algorithm and its comparison with others

Two widely adopted datasets including 210 bound and 48 bound/unbound dataset [29] were used for testing our algorithm, and the results are shown in Tables 1 and 2. The accuracy of our algorithm is approximately 2–10 % higher than that of other algorithms for the test with either bound or unbound datasets. In addition, with size of the dataset increased, our algorithm exhibited even more advantage over others regarding accuracy (The accuracy differences between our algorithm and the second highest algorithm in the Top1 increase from 2.4 % with 48 unbound dataset to 11.8 % with 210 unbound dataset).
Table 1

Comparison of the top1 and top3 success rates for various methods using 210 bound structures

MethodTop1a (%)Top3a (%)
bSiteFinder 94.8 95.7
LISEb 8394
MPK2b 8195
MPK1b 7593
Q-SiteFinderb 7090
LIGSITECSCb 75
LIGSITECSb 7086
PASSb 5180
SURFNETb 4257

aThe MCC scores of the Top1 and Top3 are 0.95 and 0.97 respectively with 210 bound structures

bThe success rates of these methods were taken from Xie and Hwang [32]

Table 2

Comparison of the top1 and top3 success rates for various methods using 48 bound/unbound structures

MethodBounda Unboundb
Top1 (%)Top3 (%)Top1 (%)Top3 (%)
bSiteFinder 93.8 98.7 85.4 95.8
LISEc 92968192
MPK2c 85968094
VICEc 85948390
MPK1c 83967590
DoGSitec 83927192
Fpocketc 83926994
LIGSITECSc 81927185
LIGSITECSCc 7971
MSPocketc 77947588
POCASAc 77907592
Q-SiteFinderc 75905275
PocketPickerc 72856985
CASTc 67835875
PASSc 63816071
SURFNETc 54785275

aThe MCC scores of the Top1 and Top3 are 0.95 and 0.97 respectively with 48 bound structures

bThe MCC scores of the Top1 and Top3 are 0.72 and 0.75 respectively with 48 unbound structures

cThe success rates of these methods were taken from Xie and Hwang [32]

Comparison of the top1 and top3 success rates for various methods using 210 bound structures aThe MCC scores of the Top1 and Top3 are 0.95 and 0.97 respectively with 210 bound structures bThe success rates of these methods were taken from Xie and Hwang [32] Comparison of the top1 and top3 success rates for various methods using 48 bound/unbound structures aThe MCC scores of the Top1 and Top3 are 0.95 and 0.97 respectively with 48 bound structures bThe MCC scores of the Top1 and Top3 are 0.72 and 0.75 respectively with 48 unbound structures cThe success rates of these methods were taken from Xie and Hwang [32] For bound chain (such as PDB ID: 5p2p, CHAIN ID: A), the binding site is composed of residues within 8 Å from the ligand (Fig. 4a). For unbound chain (such as PDB ID: 3p2p, CHAIN ID: A), unlike bound chain, the binding site is detected with the aid of templates (PDB ID: 1oxr, CHAIN ID: A). First, the ligand in template is mapped to unbound chain. Then the binding site is composed of residues within 10 Å from the ligand (Fig. 4b). See Method part for details.
Fig. 4

a Binding site of bound chain (PDB ID: 5p2p, CHAIN ID: A). The binding site is composed of residues (green) within 8 Å from the ligand (red). b Binding site of unbound chain (PDB ID: 3p2p, CHAIN ID: A, blue). The detection of binding site is based on the bound template (PDB ID: 1oxr, CHAIN ID: A, black) by mapping the ligand of template into unbound chain. And the binding site is composed of residues (green) within 10 Å from the ligand (red), which is different from 8 Å for bound chain prediction

a Binding site of bound chain (PDB ID: 5p2p, CHAIN ID: A). The binding site is composed of residues (green) within 8 Å from the ligand (red). b Binding site of unbound chain (PDB ID: 3p2p, CHAIN ID: A, blue). The detection of binding site is based on the bound template (PDB ID: 1oxr, CHAIN ID: A, black) by mapping the ligand of template into unbound chain. And the binding site is composed of residues (green) within 10 Å from the ligand (red), which is different from 8 Å for bound chain prediction

Indexed alignment

Since there are still lots of protein chains have no satisfactory bound structures, bound templates is borrowed for detecting the binding sites in this situation. Our templates database contains 101,315 bound templates. It would consume a large amount of computation for predicting the binding site if structural alignments go through all the chains in the database. Thus, to improve the efficiency of our algorithm, Homology Indexing is introduced and then the time-consuming structural alignment will be limited only among homologous proteins. After building Homology Indexing for all 101,315 chains in template database by using SCOPe [27], 4254 protein classes are obtained. It means that only about 24 (101,315/4254) bound templates are needed to do the time-consuming structural alignment with the query per prediction. This would significantly reduce the computation time. Table 3 shows the alignment frequency between templates and the query from the 48 unbound dataset after Homology Indexing is used. Without Homology Indexing, 48 unbound dataset should be aligned with each of chains in template database, which means that there are 48 × 101,315 time-consuming structural alignments needed to be done. But, with the Homology Indexing introduced, it can be reduced to 25,127 structural alignments, which only account for only 0.5 % of that without Homology Indexing. It’s worth noting that alignment frequencies, in Table 3, reach hundreds or even thousands in practical, which may be due to the uneven distribution of different protein families in template database at present.
Table 3

Frequency of structural alignment with 48 unbound chains using Homology Indexing

PDB IDCHAIN IDAlignment frequencyPDB IDCHAIN IDAlignment frequencyPDB IDCHAIN IDAlignment frequencyPDB IDCHAIN IDAlignment frequency
3tmsA5351ifbA1711cgeA2731bbsA747
8adhA4243ptnA11811hsiB12031stnA144
1hxfH13261ypiA1701a4jB4891ptsA268
2fbpA2695dfrA4631imeA1732ctbA106
1gcgA1693phvA11531nnaA4162cbaA522
1helA2032ctvA6251ahcA1881krnA6
1npcA1545cpaA1062tgaA11762silA377
1esaA12461a6uH3974ca2A5231l3fE156
1brqA3441qifA5671pdyA561chgA1160
8ratA1733appA7531phcA8736insE124
1swbA2691djbA6201psnA7443p2pA209
1ulaA7241byaA783lckA31317ratA167
Frequency of structural alignment with 48 unbound chains using Homology Indexing Although the efficiency of binding-sites prediction for unbound chains has been significantly increased benefiting from Homology Indexing, there are still some chains of no satisfactory homologous template structures, such as PDB ID: 4h12, CHAIN ID: A. For this kind of protein chains, we further introduce Chain Length Indexing to reduce the number of time-consuming structural alignments. Table 4 shows the alignment frequency between templates and the query from 20 dataset of no appropriate homologous templates after Chain Length Indexing is used. Without Chain Length Indexing, the 20 dataset of no homologous template chains should be aligned with each of chains in template database, which mean that there are 20 × 101,315 time-consuming structural alignments needed to be done. But, with the Chain Length Indexing introduced, it can be reduced to 663,739 structural alignments, which only account for 32.8 % of the number without Chain Length Indexing (Table 4).
Table 4

Frequency of structural alignment in 20 no homologous template chains with Chain Length Indexing involved

Query chainQuery lengthAlignment frequencyPercentage of sequences passed Chain Length Indexing (%)Query chainQuery lengthAlignment frequencyPercentage of sequences passed Chain Length Indexing (%)
4ggbA34839,45838.91wakA35339,01438.5
2yzvA28639,29138.84ff5A22731,36231.0
4iezA18622,88222.64fk9A31440,44639.9
3ii7A28839,61439.11ujcA15617,10916.9
3a3jA34439,78239.33ianA31940,00039.5
3chlA31540,32539.83rloA19625,45525.1
2cf5A35239,19238.73mfcA18723,36923.1
2iq1A25736,14135.73dgtA27838,64138.1
1wy0A32740,20039.72dh6A33140,19539.7
2y7bA13414,99814.81w4sA14616,26516.1
Frequency of structural alignment in 20 no homologous template chains with Chain Length Indexing involved It would be argued that the best template will be excluded by the use of Chain Length Indexing. However, the result indicates that, with or without Chain Length Indexing, there are no significant differences in the length between templates (Table 5).
Table 5

Top1 template for 20 no homologous template chains and their length obtained without Chain Length Indexing

Query chainQuery lengthTemplate chainTemplate lengthTemplate chain (length constrained)Template length
3mq1F1003mq1A1013mq1A101
4kh0B1504kgvB1454kgvB145
4fzbO2004fzbK2014fzbK201
3ujoC2503ujoD2503ujoD250
3zq6A3003zq6C2843zq6C284
3mk6B3514ehtB2604ehtB260
2q14B4002q14H3982q14H398
2yg4B4502yg3A4492yg3A449
4k3tA4984k3tB4984k3tB498
4bthB5462wybB5462wybB546
4mfdC5954jx5A5964jx5A596
3szgA6503sytC6523sytC652
3alaF7003alaE7013alaE701
3w3lA7513w3lB7513w3lB751
3lq4A8011rp7A8011rp7.A801
3zhuD8522yidD8522yidD852
2wyhB8912f7oA10142f7oA1014
2okxB9542okxA9542okxA954
2xt6B9892xt6A10552xt6A1055
4dx5A10442j8sA10442j8sA1044
Top1 template for 20 no homologous template chains and their length obtained without Chain Length Indexing

Stability of Complex

Examining the bound chain structures in PDB database, it is observed that ligands do not always have a stable binding with protein chains at binding site, such as PDB ID: 2j22, CHAIN ID: A (Fig. 5). For this kind of bound structures, binding sites could not be computed directly based on their ligands. Thus, Stability of Complex is introduced into our algorithm to avoid these situations.
Fig. 5

Unstable bound structure of ligand (GOL, red) and protein chain (PDB ID: 2j22, CHAIN ID: A, blue) at binding site

Unstable bound structure of ligand (GOL, red) and protein chain (PDB ID: 2j22, CHAIN ID: A, blue) at binding site Looking for similar templates by structural alignments is needed for unbound chains which have no ligands to compute the binding site. In the process of structural alignment and ligand mapping successively, ligand in template may not have a stable bind with unbound chain (Fig. 6a, b). Likewise, Stability of Complex is employed here to decide whether ligand from template and unbound chain can form a new stable bound structure.
Fig. 6

a Unbound chain (PDB ID: 1bbs, CHAIN ID: A, blue) and related appropriate template (PDB ID: 1hrn, CHAIN ID: B, yellow). After mapping the ligand (03D, red) in template to unbound chain, a new stable bound structure is formed with the tightly binding between the ligand and unbound chain. The top 20 templates at most ranked according to the similarity would be subsequently clustered. b Unbound chain (PDB ID: 1bbs, CHAIN ID: A, blue) and related appropriate template (PDB ID: 3g6z, CHAIN ID: A, yellow). After mapping the ligand (NAG, red) in template to unbound chain, a new stable bound structure could not be formed. The reason is that there are more residues (see the red circle) in template than unbound chain which have a close connection with the ligand

a Unbound chain (PDB ID: 1bbs, CHAIN ID: A, blue) and related appropriate template (PDB ID: 1hrn, CHAIN ID: B, yellow). After mapping the ligand (03D, red) in template to unbound chain, a new stable bound structure is formed with the tightly binding between the ligand and unbound chain. The top 20 templates at most ranked according to the similarity would be subsequently clustered. b Unbound chain (PDB ID: 1bbs, CHAIN ID: A, blue) and related appropriate template (PDB ID: 3g6z, CHAIN ID: A, yellow). After mapping the ligand (NAG, red) in template to unbound chain, a new stable bound structure could not be formed. The reason is that there are more residues (see the red circle) in template than unbound chain which have a close connection with the ligand Similarly, Stability of Complex is introduced to build a template database (see details in Fig. 2), which reduced the number of bound structures from 117,823 to 101,315 with 14 % structures removed. Not only improved the quality of template database, this operation also reduced the number of time-consuming structural alignments.

An Optimized Multiple-Templates Clustering method

Similar to FINDSITE [31], 3DLigandSite [25] and COFACTER [19], the prediction accuracy of our algorithm is improved by Optimized Multiple-Templates Clustering. However, in other works, the cluster number is required in previous algorithms, which actually could not be obtained before computing. In addition, the distances between ligands in each cluster have no reasonable physical meaning. In our algorithm, this deficiency is overcome by defining a new constraint, which restrict that the distances between geometric centers of all the ligands (for one binding site) in the same cluster should be less than a certain threshold (cluster radius). Ligands in multiple templates could be clustered automatically following the constraint with reasonable physical meaning, and there has no need to estimate cluster number before clustering. Considering the space complexity of bound structure, cluster radius to be used is optimized based on test set. For 48 unbound dataset, threshold is set from 1.0 to 8.0 Å to compute the accuracy of the Top1 and Top3. Table 6 shows the accuracy computed with different cluster radius, and the accuracies of the Top1 range from 72.3 to 85.4 %. It’s worth noting that the accuracy of our algorithm with any cluster radius is higher than that of other algorithms (Tables 2, 6).
Table 6

Comparison of prediction accuracies using Optimized Multiple-Templates Clustering with different cluster radius with 48 unbound dataset

Threshold (Å)Top1Top3Threshold (Å)Top1Top3
10.7920.95850.8540.938
20.8370.95860.7920.918
3 0.854 0.958 70.7230.867
40.8530.93880.7540.876
Comparison of prediction accuracies using Optimized Multiple-Templates Clustering with different cluster radius with 48 unbound dataset Result in Table 6 indicates that the Top1 and Top3 have highest prediction accuracies with 48 unbound dataset, when cluster radius is set to 3.0 Å. Thus, 3.0 Å is set as the default parameter by bSiteFinder in Optimized Multiple-Templates Clustering.

Conclusions

bSiteFinder as a protein-binding sites prediction server was developed based on the largest database of bound templates so far with stringent quality control. Each protein chain submitted would be processed by following steps: (1) Binding sites prediction of high quality bound protein; (2) Binding sites prediction of unbound protein with bound templates of same Homology Indexing; (3) Binding sites prediction of unbound protein with bound templates of Chain Length Indexing. Any protein chain submitted could receive the results of binding sites via efficient computation. By introducing Homology Indexing, Chain Length Indexing, Stability of Complex and Optimized Multiple-Templates Clustering into our algorithm, the efficiency of our server have been significantly improved. What’s more, the accuracy was approximately 2–10 % higher than that of other algorithms for the test with either bound dataset or unbound dataset. For 210 bound dataset, bSiteFinder achieved high accuracies up to 94.8 % (MCC 0.95). For another 48 bound/unbound dataset, bSiteFinder achieved high accuracies up to 93.8 % for bound proteins (MCC 0.95) and 85.4 % for unbound proteins (MCC 0.72). An online bSiteFinder server is freely available at http://binfo.shmtu.edu.cn/bsitefinder/, and the source code is provided at the methods page. Our work lays a foundation for functional annotation of protein and structure-based drug design. With ever increasing numbers of three-dimensional protein–ligand complex structures, our server should be more accurate and less time-consuming.
  32 in total

1.  POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids.

Authors:  D G Levitt; L J Banaszak
Journal:  J Mol Graph       Date:  1992-12

2.  FTSite: high accuracy detection of ligand binding sites on unbound protein structures.

Authors:  Chi-Ho Ngan; David R Hall; Brandon Zerbe; Laurie E Grove; Dima Kozakov; Sandor Vajda
Journal:  Bioinformatics       Date:  2011-11-22       Impact factor: 6.937

3.  Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites.

Authors:  Alasdair T R Laurie; Richard M Jackson
Journal:  Bioinformatics       Date:  2005-02-08       Impact factor: 6.937

Review 4.  Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening.

Authors:  Alasdair T R Laurie; Richard M Jackson
Journal:  Curr Protein Pept Sci       Date:  2006-10       Impact factor: 3.272

5.  A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation.

Authors:  Michal Brylinski; Jeffrey Skolnick
Journal:  Proc Natl Acad Sci U S A       Date:  2007-12-28       Impact factor: 11.205

6.  LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins.

Authors:  M Hendlich; F Rippmann; G Barnickel
Journal:  J Mol Graph Model       Date:  1997-12       Impact factor: 2.518

7.  Protein interactions and ligand binding: from protein subfamilies to functional specificity.

Authors:  Antonio Rausell; David Juan; Florencio Pazos; Alfonso Valencia
Journal:  Proc Natl Acad Sci U S A       Date:  2010-01-19       Impact factor: 11.205

8.  3DLigandSite: predicting ligand-binding sites using similar structures.

Authors:  Mark N Wass; Lawrence A Kelley; Michael J E Sternberg
Journal:  Nucleic Acids Res       Date:  2010-05-31       Impact factor: 16.971

9.  LISE: a server using ligand-interacting and site-enriched protein triangles for prediction of ligand-binding sites.

Authors:  Zhong-Ru Xie; Chuan-Kun Liu; Fang-Chih Hsiao; Adam Yao; Ming-Jing Hwang
Journal:  Nucleic Acids Res       Date:  2013-04-22       Impact factor: 16.971

10.  A new protein-ligand binding sites prediction method based on the integration of protein sequence conservation information.

Authors:  Tianli Dai; Qi Liu; Jun Gao; Zhiwei Cao; Ruixin Zhu
Journal:  BMC Bioinformatics       Date:  2011-12-14       Impact factor: 3.169

View more
  3 in total

1.  Proteins and Their Interacting Partners: An Introduction to Protein-Ligand Binding Site Prediction Methods with a Focus on FunFOLD3.

Authors:  Danielle Allison Brackenridge; Liam James McGuffin
Journal:  Methods Mol Biol       Date:  2021

2.  PrankWeb: a web server for ligand binding site prediction and visualization.

Authors:  Lukas Jendele; Radoslav Krivak; Petr Skoda; Marian Novotny; David Hoksza
Journal:  Nucleic Acids Res       Date:  2019-07-02       Impact factor: 16.971

3.  P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.

Authors:  Radoslav Krivák; David Hoksza
Journal:  J Cheminform       Date:  2018-08-14       Impact factor: 5.514

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.