| Literature DB >> 29492726 |
Laeeq Ahmed1, Valentin Georgiev2, Marco Capuccini2,3, Salman Toor3, Wesley Schaal2, Erwin Laure4, Ola Spjuth2.
Abstract
BACKGROUND: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. CONTRIBUTION: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling.Entities:
Keywords: Apache Spark; Cloud computing; Conformal prediction; Docking; Virtual screening
Year: 2018 PMID: 29492726 PMCID: PMC5833896 DOI: 10.1186/s13321-018-0265-z
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Workflow of CPVS. Signatures were generated for the whole dataset with two copies named Ds and DsComplete. An initial sample of DsInit number of molecules was randomly taken from Ds and docked against a chosen receptor and scores were calculated. To form a training set, docking scores were converted to class labels {0} and {1} representing ‘low-scoring’ and ‘high-scoring’ ligands, respectively. This was done using a 10-bin histogram of the docking scores where labels were assigned to ligands in different bins. An SVM-based conformal predictor model was trained on the training set and predictions were made on the whole Dataset DsComplete. The molecules were classified as ‘low-scoring’ ligands {0}, ‘high-scoring’ ligands {1} and 'unknown'. The predicted ‘low-scoring’ ligands were removed from Ds in each iteration and were hence never docked. Model efficiency was computed by finding the ratio of single label predictions [30], i.e., {0} and {1} against all predictions. The process was then repeated iteratively with a smaller data sample DsIncr from Ds which was docked and labeled, and the model was re-trained until it reached an acceptable efficiency. Thereafter all remaining ‘high-scoring’ ligands were docked. The scores of all docked molecules were sorted and accuracy for top 30 molecules was computed against the results from an experiment where all molecules were docked [9]
Fig. 2Docking score histogram for 200 K ligands shows an example docking score histogram for a sample of 200 K ligands in log scale. The data distribution is skewed right because we have fewer molecules with high scores, which is normal for these types of datasets as only a few ligands have a good fit with the target protein and the majority will not bind with high affinity
Effect of DsInit size and bin combination on accuracy and efficiency for the initial trained model (repeated 10 times)
| Trail no. | Bins | Accu. (avg) | Accu. (SD) | Eff. (avg) | Eff. (SD) | |
|---|---|---|---|---|---|---|
| 1 | 50 | 1_6 | 45.33 | 47.22 | 65 | 23 |
| 2 | 50 | 1_5 | 65.33 | 43.95 | 63 | 23 |
| 3 | 50 | 1_4 | 78.34 | 41.31 | 44 | 17 |
| 4 | 50 | 2_4 | 94.34 | 4.46 | 79 | 18 |
| 5 | 100 | 1_6 | 89.67 | 6.37 | 73 | 16 |
| 6 | 100 | 1_5 | 94.67 | 5.92 | 75 | 18 |
| 7 | 100 | 1_4 | 88.34 | 29.91 | 31 | 12 |
| 8 | 100 | 2_4 | 89.67 | 7.45 | 91 | 11 |
| 9 | 200 | 1_6 | 93.00 | 3.99 | 65 | 15 |
| 10 | 200 | 1_5 | 96.34 | 1.89 | 76 | 17 |
| 11 | 200 | 1_4 | 97.67 | 2.25 | 43 | 20 |
| 12 | 200 | 2_4 | 90.34 | 9.74 | 91 | 6 |
| 13 | 300 | 1_6 | 86.67 | 8.01 | 44 | 12 |
| 14 | 300 | 1_5 | 95.34 | 4.50 | 63 | 17 |
| 15 | 300 | 1_4 | 98.34 | 1.76 | 54 | 22 |
| 16 | 300 | 2_4 | 86.00 | 7.17 | 94 | 5 |
Selecting DsIncr size for incremental model building (repeated 20 times, mean values reported)
| Iterations | Accu. | Eff. | Docked mols (millions) | Total time (relative) | |
|---|---|---|---|---|---|
| 50 | 3.9 | 96.5 | 0.91 | 0.77 | 1 |
| 100 | 3.35 | 96.84 | 0.91 | 0.81 | 0.96 |
| 200 | 3.15 | 97.17 | 0.91 | 0.79 | 1.12 |
Paremeters DsInit size = 200 K and Bins = 1_5 for all runs. Time was calculated relative to 50 K
Fig. 3Benchmarking CPVS against parallel VS. On average, only 37.39% of the ligands were docked to reach an accuracy level of 94%. By decreasing the number of docked molecules, CPVS saves more than two-thirds of the time and got an average speedup of 3.7 in comparison to Parallel VS [9]
Results of the CPVS method for a set of target receptors
| Receptor | Iterations | Accu. | Docked mols (%) | Time (hours) | Speed up |
|---|---|---|---|---|---|
| HIV-1 | 3.9 | 97.33 | 37.15 | 4.03 | 2.93 |
| PTPN22 | 4.7 | 98.34 | 44.77 | 2.48 | 3.35 |
| MMP13 | 3.5 | 89.00 | 33.34 | 2.10 | 3.90 |
| CTDSP1 | 3.6 | 92.67 | 34.29 | 2.03 | 4.58 |
Results were averaged over 10 runs for each receptor