| Literature DB >> 25932051 |
Radoslav Krivák1, David Hoksza1.
Abstract
BACKGROUND: Protein-ligand binding site prediction from a 3D protein structure plays a pivotal role in rational drug design and can be helpful in drug side-effects prediction or elucidation of protein function. Embedded within the binding site detection problem is the problem of pocket ranking - how to score and sort candidate pockets so that the best scored predictions correspond to true ligand binding sites. Although there exist multiple pocket detection algorithms, they mostly employ a fairly simple ranking function leading to sub-optimal prediction results.Entities:
Keywords: Binding site prediction; Ligand binding site; Machine learning; Molecular recognition; Pocket score; Protein pocket; Random forests
Year: 2015 PMID: 25932051 PMCID: PMC4414931 DOI: 10.1186/s13321-015-0059-5
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Flowchart of the PRANK pocket ranking approach.
Figure 2Visualization of inner pocket points.(a) Displayed is protein 1AZM from DT198 dataset bound to one ligand (magenta). Fpocket predicted 13 pockets that are depicted as colored areas on the protein surface. To rank these pockets, the protein was first covered with evenly spaced Connolly surface points (probe radius 1.6 Å) and only the points adjacent to one of the pockets were retained. Color of the points reflects their ligandability (green = 0…red = 0.7) predicted by Random Forest classifier. PRANK algorithm rescores pockets according to the cumulative ligandability of their corresponding points. Note that there are two clusters of ligandable points in the picture, one located in the upper dark-blue pocket and the other in the light-blue pocket in the middle. The light-blue pocket, which is in fact the true binding site, contains more ligandable points and therefore will be ranked higher. (b) Detailed view of the binding site with ligand and inner pocket points.
Datasets statistics
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| CHEN11 | 251 | 476 | 1.90 | 12.41 | 1.75 | 71.0 | 52.3 | 26.9 | 38.9 | 51.0 |
| ASTEX | 85 | 143 | 1.68 | 21.58 | 2.25 | 81.1 | 65.7 | 23.2 | 41.9 | 56.9 |
| DT198 | 198 | 192 | 0.97 | 18.57 | 2.19 | 80.2 | 65.6 | 20.8 | 41.2 | 53.7 |
| MP210 | 210 | 288 | 1.37 | 14.50 | 1.99 | 78.8 | 68.2 | 22.8 | 40.0 | 50.9 |
| B48 | 48 | 54 | 1.13 | 12.06 | 1.96 | 92.6 | 81.5 | 21.9 | 37.8 | 44.2 |
| U48 | 48 | 54 | 1.13 | 11.40 | 1.79 | 88.9 | 77.8 | 21.9 | 38.0 | 46.8 |
Abbreviations: FP Fpocket, CC ConCavity.
#L: average number of ligands for one protein.
#P: average number of predicted pockets for one protein.
Cov: total coverage – success rate considering all predicted pockets (measured by DCA criterion with 4 Å threshold).
LS: average number of heavy atoms in a relevant ligands (ligand size).
PS: average number of protein surface atoms that belong to a predicted pocket (pocket size).
Rescoring Fpocket and ConCavity predictions with PRANK: cross-validation results on CHEN11 dataset and the results of the final prediction model (trained on CHEN11-Fpocket) for all datasets
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| Fpocket predictions | ||||||||
| CHEN11 (CV)** | 47.9 | 58.8 | 71 | +10.6 | 47.1 | 0.60 | 0.32 | 0.41 |
| CHEN11*** | 47.9 | 67.9 | 71 | +20 | 86.4 | 0.87 | 1.0 | 0.98 |
| ASTEX | 58 | 63.6 | 81.1 | +5.6 | 24.2 | 0.56 | 0.41 | 0.46 |
| DT198 | 37.5 | 56.2 | 80.2 | +18.8 | 43.9 | 0.31 | 0.38 | 0.33 |
| MP210 | 56.6 | 67.7 | 78.8 | +11.1 | 50 | 0.58 | 0.42 | 0.47 |
| B48 | 74.1 | 81.5 | 92.6 | +7.4 | 40 | 0.58 | 0.45 | 0.49 |
| U48 | 53.7 | 77.8 | 88.9 | +24.1 | 68.4 | 0.55 | 0.36 | 0.42 |
| ConCavity predictions | ||||||||
| CHEN11 (CV)** | 47.9 | 50.7 | 52.3 | +2.8 | 63.3 | 0.44 | 0.76 | 0.40 |
| CHEN11*** | 47.9 | 52.3 | 52.3 | +4.4 | 100 | 0.80 | 0.82 | 0.75 |
| ASTEX | 55.2 | 62.9 | 65.7 | +7.7 | 73.3 | 0.60 | 0.55 | 0.46 |
| DT198 | 45.8 | 61.5 | 65.6 | +15.6 | 78.9 | 0.33 | 0.55 | 0.34 |
| MP210 | 57.4 | 66.1 | 68.2 | +8.7 | 80.6 | 0.63 | 0.53 | 0.49 |
| B48 | 66.7 | 77.8 | 81.5 | +11.1 | 75 | 0.61 | 0.53 | 0.47 |
| U48 | 64.8 | 74.1 | 77.8 | +9.3 | 71.4 | 0.58 | 0.46 | 0.43 |
Abbreviations: P precision, R recall, MCC Matthews correlation coefficient.
*percentage of improvement that was theoretically possible to obtain by reordering pockets [ Δ / (All – Top-n)].
**cross-validation results.
***results where the test set was de facto the same as the training set for the Random Forest classifier (included here only for completeness).
Figure 3Rescoring Fpocket predictions on CHEN11 dataset. Success rates of Fpocket compared with results rescored by PRANK on CHEN11 dataset considering Top-n, Top-(n+2) and all pockets (total coverage). Identification success is measured by DCA criterion for the range of integer cutoff distances. Displayed results for rescored pockets are averaged from ten independent 5-fold cross-validation runs.
Figure 4Detailed results. Table and heatmap showing success rates [%] of Fpocket predictions for original and rescored output list of pockets together with the nominal improvements made by PRANK rescoring algorithm on CHEN11 dataset (measured by DCA and DCC criteria for different integer cutoff distances). For the DCA criterion the biggest improvements were achieved around the meaningful 4-6 Å cutoff distances. Displayed results are averaged numbers from ten independent 5-fold cross-validation runs. Four columns in each group show success rates calculated considering progressively more predicted pockets ranked at the top (where n is the number of known ligand-binding sites of the protein that includes evaluated binding site). For protein with just one binding site they correspond to Top-1, Top-3 and Top-5 cutoffs that were commonly used to report results in previous ligand-binding site prediction studies.
PRANK vs. simpler rescoring methods
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| Fpocket predictions | ||||||||
| CHEN11 | 47.9 | 71 | 58.8** | +10.6 | 49.8 | +1.9 | 34.5 | -13.4 |
| ASTEX | 58 | 81.1 | 63.6 | +5.6 | 56.6 | -1.4 | 32.2 | -25.9 |
| DT198 | 37.5 | 80.2 | 56.2 | +18.8 | 43.2 | +5.7 | 19.3 | -18.2 |
| MP210 | 56.6 | 78.8 | 67.7 | +11.1 | 54.5 | -2.1 | 30.6 | -26 |
| B48 | 74.1 | 92.6 | 81.5 | +7.4 | 72.2 | -1.9 | 42.6 | -31.5 |
| U48 | 53.7 | 88.9 | 77.8 | +24.1 | 66.7 | +13 | 31.5 | -22.2 |
| ConCavity predictions | ||||||||
| CHEN11 | 47.9 | 52.3 | 50.7** | +2.8 | 50.4 | +2.5 | 50.2 | +2.3 |
| ASTEX | 55.2 | 65.7 | 62.9 | +7.7 | 62.9 | +7.7 | 63.6 | +8.4 |
| DT198 | 45.8 | 65.6 | 61.5 | +15.6 | 56.8 | +10.9 | 59.4 | +13.5 |
| MP210 | 57.4 | 68.2 | 66.1 | +8.7 | 64.9 | +7.3 | 64.6 | +6.9 |
| B48 | 66.7 | 81.5 | 77.8 | +11.1 | 79.6 | +13 | 75.9 | +9.3 |
| U48 | 64.8 | 77.8 | 74.1 | +9.3 | 75.9 | +11.1 | 70.4 | +5.6 |
PLB - rescoring by the Propensity for Ligand Binding index based on amino acid composition of pockets [29].
VOL - rescoring by approximate volume.
**cross-validation results.
The number presented for rescoring methods (columns: PRANK,PLB,VOL) is the success rate considering Top-n predicted pockets measured by DCA criterion with 4 Å threshold.