Literature DB >> 32607441

Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery.

Francesco Gentile¹, Vibudh Agrawal¹, Michael Hsing¹, Anh-Tien Ton¹, Fuqiang Ban¹, Ulf Norinder^2,3, Martin E Gleave¹, Artem Cherkasov¹.

Abstract

Drug discovery is a rigorous process that requires billion dollars of investments and decades of research to bring a molecule "from bench to a bedside". While virtual docking can significantly accelerate the process of drug discovery, it ultimately lags the current rate of expansion of chemical databases that already exceed billions of molecular records. This recent surge of small molecules availability presents great drug discovery opportunities, but also demands much faster screening protocols. In order to address this challenge, we herein introduce Deep Docking (DD), a novel deep learning platform that is suitable for docking billions of molecular structures in a rapid, yet accurate fashion. The DD approach utilizes quantitative structure-activity relationship (QSAR) deep models trained on docking scores of subsets of a chemical library to approximate the docking outcome for yet unprocessed entries and, therefore, to remove unfavorable molecules in an iterative manner. The use of DD methodology in conjunction with the FRED docking program allowed rapid and accurate calculation of docking scores for 1.36 billion molecules from the ZINC15 library against 12 prominent target proteins and demonstrated up to 100-fold data reduction and 6000-fold enrichment of high scoring molecules (without notable loss of favorably docked entities). The DD protocol can readily be used in conjunction with any docking program and was made publicly available.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 32607441 PMCID： PMC7318080 DOI： 10.1021/acscentsci.0c00229

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Drug discovery is an expensive and time-demanding process that faces many challenges, including low hit discovery rates for high-throughput screening, among many others.[1,2] Methods of computer-aided drug discovery (CADD) can significantly speed up the pace of such screening and can drastically improve hit rates.[3] Molecular docking is routinely used to process virtual libraries containing millions of molecular structures against a variety of drug targets with known three-dimensional structures. Recent advancements in automated synthesis and surge of available chemicals represent great opportunities for virtual screening (VS) approaches in general and for docking in particular, but also poses entirely novel challenges. For instance, the widely used ZINC library has grown from 700 000 entries in 2005[4] to over 1.3 billion constituent molecules in 2019,[5] representing a remarkable 1000-fold increase. There is still a global lack of experience in screening such libraries, and the advantage of docking them versus smaller collections is still matter of debate.[6] However, few recently published works seem to advocate for expanding VS to ultralarge chemical libraries. In a recent groundbreaking study by Lyu et al,[7] authors reported docking of 170 million make-on-demand molecular structures, showing that VS of such databases enables the discovery of highly potent inhibitors as well as novel chemical classes that are not present in routinely screened in-stock libraries. Later, other docking studies involving large collections of molecules led to similar conclusions.[9,10] Given the current state of docking programs and computational resources available to CADD scientists, one can stipulate that modern docking campaigns can rarely exceed 0.1 billion molecules and that the current chemical space remains largely inaccessible to structure-based drug discovery. One common approach to address this disparity is to filter large chemical collections to manageable drug-, lead-, fragment-, and hit-like subsets (among others) using precomputed physicochemical parameters and drug-like criteria, such as molecular weight, volume, octanol–water partition coefficient, polar surface area, number or rotatable bonds, number of hydrogen bond donors and acceptors, among many others.[11] While this approach can effectively reduce an ultralarge docking database to manageable subsets, many potentially useful compounds and novel or unconventional chemotypes (notably emerging from such large collections[6]) could be lost. In order to take a full advantage of available and emerging “make-on-demand” chemicals, it is essential to maximize the number of database entries tangibly evaluated against a target of interest. It is also important to note that a conventional docking workflow is remarkably neglectful of negative results. A typical docking campaign relies on completing a full docking run and selecting an extremely narrow subset of favorably docked molecules (virtual hits) for future evaluation. Thus, the vast majority of docking data (both favorable and, especially unfavorable) is not being utilized in any way or form, while it could represent a very relevant, well-formatted, and content-rich input for machine learning algorithms. Previously, the possibility of predicting docking scores through shallow quantitative structure–activity relationship (QSAR) models has been explored by us (using 3D “inductive” descriptors[12]) and others, using a support vector machine or random forest along with conformal predictors.[13,14] None of these methods, however, offer enough speed boost to deal with billions of molecules, and such studies were thus limited to a few millions of compounds at most. Deep learning (DL), on the other hand, is particularly suited for large data set processing,[15] and the method is rapidly gaining interest in drug discovery due to its superior performance compared to traditional machine learning techniques.[16−18] Thus, we anticipate that the use of DL could unlock a full potential and true synergy between docking and QSAR methodologies and will take a full advantage of ultralarge docking database data.

Results

In the current study, we have introduced the use of fast-computed and target-independent QSAR descriptors (such as 2D molecular fingerprint), the use of iterative and fast random sampling of the docking database, and, principally, the use of DL to predict docking scores of yet unprocessed database entries at each iteration step. As a result, DD achieves up to 100-fold reduction of an ultralarge docking database and up to 6000-fold enrichment for the top-ranked hits, while avoiding significant loss of favorable virtual hits, as it will be discussed below.

DD Pipeline

In its essence, the DD pipeline (Figure ) relies on the following consecutive steps:

Figure 1

Schematic of the DD pipeline. (Top) DD initialization: a small sample of molecules is randomly extracted from an ultralarge docking database and docked to a target under consideration. The generated docking scores are then used to train a QSAR deep model. The created QSAR solution is then used to predict docking outcome for the remainder of a database and to return predicted virtual hits required to start iteration 2. (Bottom) DD screening: from iteration 2 onward, the deep model gets gradually improved by augmenting the training set with randomly sampled QSAR-predicted virtual hits from the previous DD iteration (which also get selected for actual docking). The cycle is repeated for a predefined number of iterations, after which DD returns top scoring molecules from a database. This final library can be postprocessed to remove residual low scoring entities. Alternatively, steps 2–11 can be carried out until the convergence of an ultralarge docking database.

For each entry of an ultralarge docking database (such as ZINC15), the standard set of ligand-based QSAR descriptors (such as molecular fingerprints) is computed; A reasonably sized training subset is randomly sampled from the database and docked into the target of interest using conventional docking protocol(s); The generated docking scores of the training compounds are then related to their 2D molecular descriptors through a DL model; a docking score cutoff (typically negative) is then used to divide training compounds in virtual hits (scoring below the cutoff) and nonhits (scoring above the cutoff); The resulting QSAR deep model (trained on empirical docking scores) is then used to predict docking outcomes of yet unprocessed entries of the database. A predefined number of predicted virtual hits are then randomly sampled and used for the training set augmentation; Steps b–d are repeated iteratively until a predefined number of iterations is reached, and/or processed entries of an ultralarge docking database are converged. Schematic of the DD pipeline. (Top) DD initialization: a small sample of molecules is randomly extracted from an ultralarge docking database and docked to a target under consideration. The generated docking scores are then used to train a QSAR deep model. The created QSAR solution is then used to predict docking outcome for the remainder of a database and to return predicted virtual hits required to start iteration 2. (Bottom) DD screening: from iteration 2 onward, the deep model gets gradually improved by augmenting the training set with randomly sampled QSAR-predicted virtual hits from the previous DD iteration (which also get selected for actual docking). The cycle is repeated for a predefined number of iterations, after which DD returns top scoring molecules from a database. This final library can be postprocessed to remove residual low scoring entities. Alternatively, steps 2–11 can be carried out until the convergence of an ultralarge docking database. In DD, the virtual hits recall (i.e., the percentage of actual virtual hits that is retrieved from the database) is set implicitly through a probability threshold which is selected to include 90% of the actual virtual hits in the validation set. Then, the same threshold is applied to the independent test set, and the recall of virtual hits is evaluated in order to assess model generalizability. If recalls of the validation and test sets are consistent with each other, the model is applied to all entries of the database (more details can be found in the Methods). Although the recall values could be endorsed explicitly by using, for example, conformal predictors,[14,19] we did not observe significant differences in the resulting performance of DD. The scripts to run DD pipeline are publicly available in GitHub, together with instructions on how to setup runs and a few additional tools to facilitate automation on HPC clusters, at https://github.com/vibudh2209/D2.

Ultra Large Docking Database Sampling

Selection of a representative and balanced training set is a critical step of any modeling workflow. In the context of sampling a chemical space, a proper DD training set should effectively reflect database’s chemical diversity. It could be expected that enlarging the sampling size and preclustering the docking base would ultimately improve or even converge the chemical space coverage. On the other hand, it is currently not feasible to cluster billions of chemical structures in any way or form, and it has also been shown that preclustering large libraries prior to docking can significantly lower the rank of active chemotypes, thus hindering the discovery of new inhibitors or activators.[7] Moreover, biasing sampling toward molecules that are highly ranked by DD as potential virtual hits could exclude low ranked, yet true positive molecules from being selected for model training; therefore we selected random sampling for all DD iterations. Finally, the size of DD training set (e.g., the amount of actual docking) would have a pivotal impact on a computational runtime and should be carefully controlled. To establish an optimal sampling of ZINC15 base, we evaluated the relationship between the size of DD training set and the corresponding means and standard deviations of the test set recall values, reflecting the consistency of model’s performance and its generalizability. For that, we evaluated 12 protein targets from four major drug-target families,[20] including nuclear receptors represented by androgen receptor (AR), estrogen receptor-alpha (ERα), and peroxisome proliferator-activated receptor γ (PPARγ). Linases were represented by calcium/calmodulin-dependent protein kinase kinase 2 (CAMKK2), cyclin-dependent kinase 6 (CDK6), and vascular endothelial growth factor receptor 2 (VEGFR2). G protein-coupled receptors included adenosine A2A receptor (ADORA2A), thromboxane A2 receptor (TBXA2R), and angiotensin II receptor type 1 (AT1R). Ion channels were represented by Nav1.7 sodium channel (Nav1.7), Gloeobacter ligand-gated ion channel (GLIC), and gamma-aminobutyric acid receptor type A (GABAA) (more details about the selected targets are reported in Table S1). For all 12 studied targets, we investigated the relationships between the sample size and resulting mean test set recall values, which appear to converge to 0.90 when the training set size ranges between 250 000 and 1 million molecules (Figure a). We also observed that the standard deviations converge to 0 at about 1 million sample size (Figure b). Thus, we have set 1 million molecules as the standard sampling for DD workflow (more details can be found in the Methods).

Figure 2

Effect of training set sample size on model generalizability. (a) Mean values for test set recalls computed using different sample sizes. Values approach 0.90 for all targets, when the training set size is within 250 000 and 1 million molecules. (b) Variations of standard deviations (STD) approach 0, for a sample size of 1 million molecules. We ran one iteration for each target and repeated computations five times at each sampling size.

Size Reduction of ZINC15 by DD Virtual Screening

The main goal of DD methodology is to reduce an ultralarge docking database of billions of entries to a manageable few-million-molecules subset which yet encompasses the vast majority of virtual hits. This final molecular subset can then be normally docked into the target using one or several docking programs or can be postprocessed with other VS means. The DD method relies on iterative improvement of the deep neural network (DNN) training by expanding its training set with predicted hit molecules from each previous iteration, while the deciding cutoff also gradually becomes more stringent. We extensively evaluated the performance of this DD protocol by screening all 1.36 billion molecules from ZINC15 against the 12 protein targets introduced above, using docking program FRED.[21] Notably, DD itself is not a docking engine, but a DL score predictor to be used in conjunction with any docking program to rapidly eliminate a priori unfavorable, “undockable” molecular entities, and therefore drastically increase the speed of actual docking. To demonstrate the power of DD, we tested the pipeline with a fixed set of parameters, such as number of iterations, recall values, and others, in order to provide an objective comparison between the 12 investigated systems. It is foreseen that DD users may want to use different simulation parameters than ours, which best suit their time and resource allocations: for example, fewer iterations with more docking per iteration and less DL cycles may be an optimal choice for computing clusters with many CPUs and few GPUs, and vice versa. For each target, we ran a total of 11 DD iterations—one initial training step (requiring docking of 3 million entries to build training, validation and test sets of 1 million each) and 10 consecutive iterative docking steps, each involving docking of 1 million molecules. Thus, for each target we practically docked only 13 million molecular structures representing less than 1% of the 1.36 billion entries of ZINC15. Figure a illustrates the docking score cutoffs used to discern hits and nonhits at each iteration. These values decreased in accordance with the criterion used for defining virtual hits, that becomes more stringent at each iteration (see Methods for details). The majority of nonhits were removed during the first iteration for all targets, while fewer molecules were discarded in successive steps, as expected due to larger portions of unfavorable compounds being present at the beginning of the runs. We observed that the decrease rate and the number of hits identified were target-dependent (Figure b). Another notable observation from the analysis of the DD progression is that training sets effectively improve after each iteration, since docking scores of molecules added to training become more negative (favorable) after each round of modeling (Figure c). This observation marks progressively more confident performance of DD in recognizing and discarding low scoring molecules, and consequently favorably augmenting of the training set. Consequently, we anticipated that DD is likely to improve the enrichment for virtual hits after each iteration, as visible in Figure d, featuring enrichment values for the top 100 molecules ranked by the DNN models in the test sets. As the data indicate, these values increased after each iteration for all targets, also suggesting that model’s performance improves every time the training set is augmented with molecules from each previous DD iteration. Of note, the enrichment values become strongly increased at last iteration, where the models retrieved a very small portion (0.01%) of the top scoring molecules of the database. Further indications of iterative improvement of DNN models are provided by the area under the curve receiver operating characteristics (AUC ROC) values and full predicted database enrichment (FPDE) values, presented in Table S3 for all iterations of the 12 studied targets, as well as the precision of the models to identify high scoring molecules, which improved at each iteration as well, as expected (Figure S2).

Figure 3

DD performance statistics for 12 drug targets. (a) Variation of score cutoff values used for selecting virtual hits at each iteration. (b) Variation of numbers of molecules predicted as virtual hits after each iteration. (c) Iterative improvement of docking score mean values for randomly selected molecules used for training set augmentation. (d) Enrichment values calculated for 100 top ranked predicted virtual hits in test set after each iteration.

Analysis of DD Performance

As indicated earlier, the main objective of the DD methodology is to reduce an ultralarge docking database to a highly enriched library of molecules that can be processed using conventional docking programs and computational resources. While studying 12 selected targets, we observed that the sizes of final, remaining subsets ranged between 1% and 12% of the original ultralarge docking database (Figure a). It is foreseen that these remaining enriched and DNN-ranked libraries can then be postprocessed to remove residual low scoring molecules. Alternatively, DD can be carried out until the convergence of an ultralarge docking database.

Figure 4

Final data set sizes and enrichment values resulting from DD runs. (a) Total number of molecules predicted as virtual hits remaining after the 11th DD iteration. Values are also reported in terms of percentage of ZINC15 entries that were retained (right vertical axis). (b) FPDE values resulting from the last iterations of DD experiments. (c) Enrichment values for top 10, top 100, and top 1000 selected virtual hits (in the test sets). DD demonstrated its best performance for PPARγ protein, where the database was reduced to 1% of its size. Thus, considering docking required to train the model as well as to postprocess the final subset, DD screening of ZINC15 against this target requires docking of 50 times fewer molecules than conventional VS. On the other hand, DD was least effective on the GABAA target, where 12% of ZINC15 molecules were left after the last iteration due to low precision of the model. These results clearly suggest that, like any other computational tool, DD shows performances that are target-dependent. Encouragingly, the recall values were consistently transferred to the test sets in all cases (see Table S3). We also compared the number of molecules expected to be returned by DD at each iteration based on the test sets with the number that was actually returned when DD models were applied to ZINC15, observing no significant differences (Figure S3). Taken together, these results suggest that all underlying DL models were generalizable in a consistent way. To further assess the overall DD performance, we evaluated final FPDE values, which ranged from 8 to 89, indicating that DD enriched final subsets with high scoring molecules in a target-dependent way. As expected, FPDE values matched the trends observed for the number of DD predicted virtual hits (Figure b). We also evaluated enrichment values for the top 10, 100, and 1000 predicted virtual hits identified in the test sets after the final iteration, observing values ranging from 240 to 6000 as demonstrated on Figure c. Such enrichments decreased consistently in all cases when evaluating larger portions of top ranked structures, thus suggesting that true hits are highly concentrated at the top of the DD rank, and molecules at the bottom of the rank are mostly false positives. It is important to note that we have set stringent 0.90 recall values for our DD runs to preserve the vast majority of virtual hits in the final subset. However, one can lower such recall cutoff to sacrifice the retention of virtual hits in a DD workflow, but to significantly reduce the number of unprocessed molecules retained at the end, and to shorten the runtime. To investigate the effect of decreasing the recall of virtual hits, we ran DD for two systems, namely, PPARγ and GABAA (the best and worst targets in terms of number of molecules remained unprocessed after 11 DD iterations as described above), using recall values of 0.75 and 0.60. Encouragingly, the number of remaining entries significantly decreased in both cases compared to DD runs with 0.90 recall (Figure a). For the PPARγ system, the size of ZINC15 was scaled down 800 times at 0.75 recall, and 2300 times at 0.60. Similarly, we observed a 66- and 654-fold size reduction using recalls of 0.75 and 0.60 respectively, for the GABAA target. Importantly, the enrichment of the molecular subsets increased noticeably when lower recalls were used, reaching up to 1450 for FPDE values and again clearly indicating that density of virtual hits rapidly decreases as we move away from the top of the DNN rank (Figure b). Thus, the recall value can be chosen according to the needed speed boost or computational resources available to the user in order to further reduce the amount of docking and/or postprocessing required at the end of DD runs.

Figure 5

DD size reduction power depends on recall values. (a) Final number of molecule (in logarithmic scale) predicted as virtual hits by DD and (b) FPDE values obtained for PPARγ and GABAA systems by varying the virtual hit recall values. Overall, the above analysis indicates that the DD procedure can effectively discard most of unqualified molecules in a ultra large docking database, without losing more than a predefined percentage of virtual hits. In our opinion, this makes DD methodology an efficient mean for conducting large-scale VS campaigns involving billions of small molecule structures, and a valid alternative to brute force approaches demanding large amounts of computational resources.

DD Virtual Screening and Active Ligands

We also investigated how DD method deals with active ligands that are present in ultralarge docking databases. For five of the investigated systems, namely, AR, ERα, PPARγ, VEGFR2, and ADORA2A, the Database of Useful Decoys: Enhanced[22] (DUD-E) provided sets of confirmed active ligands (details about the data sets are reported in Table S2) that we docked to their respective targets together with 1 million of random compounds from ZINC15 (considered as inactive molecules). Docking performances of FRED were variable across targets, with AUC ROC values ranging from 0.52 to 0.91 (Figure a). In parallel, we evaluated how DD ranks of compounds (evaluated at 0.90 recall) correlated with their docking scores, in particular, with their distance from the score cutoff used to define virtual hits and nonhits in the last iteration, using random samples of 1 million ZINC15 molecules docked to each target (Figure b). DD appears to clearly bias final sets toward high docking scores, and since active ligands score better than inactives based on AUC values, we expected active ligands to be discarded at lower rates than inactive molecules. Thus, we plotted the rank of active ligands against their score distance from the virtual hit cutoff (Figure c). As expected, DD worked particularly well for systems with many high scoring actives and good AUC values, such as AR and ERα, for which all active, top scoring ligands were retained. Interestingly, DD retained also a significantly large amount of active ligands with nonhits scores for all targets, at about 10-fold higher rates than remaining ZINC15 nonhits.

Figure 6

(a) ROC curves for FRED docking of active ligands to five targets. Actives were mixed with 1 million of randomly sampled ZINC15 molecules. AUC value for each target is reported in brackets. (b) DD rank (logarithmic scale) versus distance of scores from cutoff values of random samples of 1 million ZINC15 compounds docked to the five targets. (c) DD rank (logarithmic scale) versus distance from score cutoff for active ligands. Kept (recalled) ligands are represented with yellow dots, discarded ligands are represented with blue dots. (d) Enrichment factor (EF, logarithmic scale) of active ligands when top 100 000, 1 000 000, 10 000 000, and all DD hits are considered. Then, we calculated the enrichment factors (EFs) for active compounds on top 100 000, 1 000 000, and 10 000 000 ranked molecules, for the fivr systems. Not surprisingly, the highest values were observed for AR and ERα, for which EFs for the top 100 000 molecules were encouragingly high, 660 for AR and 477 for ERα, and consistently decreased by evaluating larger portion of top scoring molecules (Figure d). For the other three systems, a clear trend for enrichments was not observed when different sizes were considered. Nevertheless, all the considered subsets of molecules were enriched with active ligands, showing EFs of at least six. Thus, selecting a subset of top ranked DD molecules seems to be a valid alternative to lowering the recall value in order to limit the amount of final docking, when limited computational resources are available. Furthermore, EFs calculated over all DD hits ranged between 11 and 14 for all systems. Thus, DD not only enables extraction of controlled portions of virtual hits from an ultralarge docking database, but also enriches final subsets with active hit molecules. Remarkably, this unexpected DD feature of retaining actives regardless of their scores resulted in enrichment also for targets showing near-to-random docking performance, such as the ADORA2A system.

Discussion

With the increasing automation of synthetic procedures, the focus of modern drug discovery campaigns will be shifting toward screening of increasingly larger molecular libraries consisting of billions of chemical structures. To reinforce such opportunity, docking protocols demonstrate improved performance on larger make-on-demand databases, effectively yielding novel, diverse, and nontraditional chemotypes for drug discovery endeavors.[7] It could be noted, however, that most of time and resources invested in modern docking campaigns are spent on processing unfavorable molecular structures, while the emerging “negative” data are also not being utilized in any way or form. Hence, to keep the pace with an ever-expanding chemical Big Data space and to fully utilize results generated by docking programs “on a fly”, we have developed the Deep Docking protocol DD, a DNN-based method for processing large chemical libraries with conventional computational and software resources. The method relies on iterative docking of a small portion of a parental ultralarge docking database (such as ZINC15) using a docking program of choice and utilizes generated scores (both favorable and unfavorable) to train ligand-based QSAR models. These models then enable approximation of the docking outcome for unprocessed database entries. Importantly, DL allows the use of simple 2D protein-independent descriptors such as Morgan fingerprints to capture the docking scores. We have demonstrated that such approach can yield a manageably small subset of a database, highly enriched with favorably “dockable” molecular structures. We proved the power of DD by screening all 1.36 billion entries of ZINC15 against 12 prominent drug targets, where the original ultralarge docking database was significantly reduced while retrieving a controlled, high portion of favorably docked molecules. At the same time, most of low scoring molecules were removed without investing time and resources in them, and the generated ZINC subsets were highly enriched in potential virtual hits. Notably, screening an ultralarge docking database using DD requires docking up to 50 times fewer molecules compared to conventional docking, while losing only about 10% of virtual hits. We also showed that DD can further shrink an ultralarge docking database to a few hundreds of thousands of molecules while still retrieving a significant part (60%) of top scoring hits. Moreover, DD appears to enrich final subsets with active ligands, even when only small portions of top ranked molecules are considered. This unexpected result suggests that true binders carry on certain chemical features that are complementary to the binding pocket and that the model is able to capture such features through the QSAR descriptors.

Conclusion

In the current work, we introduced the use of deep learning in structure-based drug discovery in a novel way. The developed Deep Docking approach utilizes QSAR models trained on actual docking scores of a small subset of a molecular database to predict docking scores for the rest. Such approach, being used in an iterative manner (with predefined recall parameters) allows significant savings of docking runtime, without notable loss of potentially “dockable” entities or active ligands. The use of Deep Docking circumvents computational limits of large-scale docking campaigns and makes billion-entries molecular databases accessible even with limited computational resources. Collectively, our results strongly advocate the use of deep learning for exploration of continuously expanding chemical space in search for new therapeutics.

Methods

QSAR Descriptors

SMILES of 1.36 billion molecular structures were downloaded from ZINC15.[5] Morgan fingerprints with a size of 1024 bits and a radius of 2 were generated using the RDKit package.[23]

Protein Targets

The X-ray structures of AR,[24] ERα,[25] PPARγ,[26] CAMKK2,[27] CDK6,[28] VEGFR2,[29] ADORA2A,[30] TBXA2R,[31] AT1R,[32] Nav1.7,[33] GLIC,[34] and GABAA[35] containing cocrystallized ligands were extracted from the Protein Data Bank (PDB).[36] Details about the selected target structures are summarized in Table S1.

Molecular Docking

PDB structures were optimized using the Protein Preparation Wizard module from the Schrödinger suite,[37] and docking grids were prepared using the MakeReceptor utility from OpenEye.[38] SMILES were processed using QUACPAC[8] in order to generate dominant tautomer and ionization states at pH 7.4. The OMEGA’s pose module[39,40] was used to generate multiple 3D optimal conformers for FRED docking. Docking simulations were carried out using FRED program and Chemgauss4 scoring function from OpenEye.[21]

Database Sampling

The optimal number of molecules required for the training set was determined by running one DD iteration for each target, using different sizes for training, validation, and test set (10 000, 20 000, 40 000, 80 000, 160 000, 320 000, 640 000, and 1 million molecules). For each sample size, computations were repeated five independent times. The optimal sampling size was then chosen by evaluating means and standard deviations of recall values in the test sets for all targets.

DD Workflow

Initial training, validation, and test sets used for the DL model consisted of 1 million molecules each that were randomly sampled from ZINC15 during the first DD iteration. Each set was docked to the target of interest and then divided into virtual hits and nonhits based on the generated docking scores. The score cutoff used to determine the class of molecules was determined in order to split the validation in 1% top scoring molecules (virtual hits) and 99% nonhits. Molecules of each set with docking scores equal or more favorable than the cutoff value were assigned to the virtual hit class, while remaining molecules were assigned to the nonhit class. The DNN model was trained using classes and molecular descriptors of the processed entries and then used to predict virtual hits and unqualified molecules from the whole ZINC15 based on molecular descriptors. From the second iteration onward, the training set was expanded with 1 million molecules randomly sampled from hits predicted in the previous iteration, while validation and test sets remained unchanged for all the length of the DD run. The score cutoff was gradually decreased (corresponding to higher predicted target affinity) after each iteration to keep selecting better compounds. This reduction was done by linearly lowering the percentage of top scoring molecules in the validation set assigned to the virtual hit class from 1% in the first iteration to 0.01% in the last one. A linear variation of the score cutoff was chosen in order to avoid large variations in initial iterations, which could restrict the model to comprehensively explore chemical classes. Thus, the cutoff value in iteration 2 corresponded to the highest (worst) docking score of the top 0.9% ranked compounds, in iteration 3 it corresponded to the highest docking score of top 0.8% compounds, and so on.

Evaluation Metrics

All evaluation metrics were calculated on the test sets. Precision was calculated aswhere TP (true positives) were virtual hits correctly predicted by the DNN, and FP (false positives) were actual nonhits that were incorrectly classified as virtual hits by the DNN. Recall was calculated aswhere FN (false negatives) were virtual hits incorrectly discarded by the DNN. Enrichment values were calculated aswith TP|top N as the number of TP found within the top N ranked molecules by the DNN, and TP|random as the number of TP found within N randomly sampled molecules. N was set to values equal to 10, 100, and 1000. FPDE was calculated as the ratio between precision (eq ) and random precision:

Deep Learning

The Keras Python library[41] was used for building and training feed-forward DNN models.[42] Model hyperparameters were set as the number of hidden layers and neurons, dropout frequency, as well as oversampling of the minority class and class weights, in order to deal with highly imbalanced data sets (1–0.01% of virtual hits). A lower threshold value was established for the DNN probabilities (indicating the likelihood of molecules of being virtual hits) and used as criterion to assign molecules to the virtual hit class upon prediction. The threshold was chosen each time in order to retrieve 90% of the actual virtual hits (i.e., top scoring molecules) of the validation set. Model selection was performed by running a basic grid search to identify the set of hyperparameters providing the highest FPDE value in the test set. The best model was then applied to all ZINC15 entries in order to predict virtual hits and nonhits.

Active Ligands

Active compounds were obtained from the DUD-E repository for available targets.[22] If not already present in ZINC15, SMILES and relative Morgan fingerprints of compounds were calculated and added to it. The molecules were prepared and docked to the respective targets as previously described. Enrichment factors after top N molecules,[43] ranked by DD probability of virtual hit-likeness, were calculated aswith actives| being the number of active ligands found in top N molecules, actives|database the total number of active ligands, and mol|database the database size.

Hardware

We used a 6 Intel(R) Xeon(R) Silver 4116 CPU @ 2.10 GHz (a total of 60 cores) for docking and 4 Nvidia Tesla V100 GPUs with 32 GB memory for DNN model training and inference.

32 in total

1. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules.

Authors: J L Sussman; D Lin; J Jiang; N O Manning; J Prilusky; O Ritter; E E Abola
Journal: Acta Crystallogr D Biol Crystallogr Date: 1998-11-01

2. Best Practices of Computer-Aided Drug Discovery: Lessons Learned from the Development of a Preclinical Candidate for Prostate Cancer with a New Mechanism of Action.

Authors: Fuqiang Ban; Kush Dalal; Huifang Li; Eric LeBlanc; Paul S Rennie; Artem Cherkasov
Journal: J Chem Inf Model Date: 2017-05-04 Impact factor: 4.956

3. Lead- and drug-like compounds: the rule-of-five revolution.

Authors: Christopher A Lipinski
Journal: Drug Discov Today Technol Date: 2004-12

4. Toxic Colors: The Use of Deep Learning for Predicting Toxicity of Compounds Merely from Their Graphic Images.

Authors: Michael Fernandez; Fuqiang Ban; Godwin Woo; Michael Hsing; Takeshi Yamazaki; Eric LeBlanc; Paul S Rennie; William J Welch; Artem Cherkasov
Journal: J Chem Inf Model Date: 2018-08-15 Impact factor: 4.956

5. Structural basis of Nav1.7 inhibition by an isoform-selective small-molecule antagonist.

Authors: Shivani Ahuja; Susmith Mukund; Lunbin Deng; Kuldip Khakh; Elaine Chang; Hoangdung Ho; Stephanie Shriver; Clint Young; Sophia Lin; J P Johnson; Ping Wu; Jun Li; Mary Coons; Christine Tam; Bobby Brillantes; Honorio Sampang; Kyle Mortara; Krista K Bowman; Kevin R Clark; Alberto Estevez; Zhiwei Xie; Henry Verschoof; Michael Grimwood; Christoph Dehnhardt; Jean-Christophe Andrez; Thilo Focken; Daniel P Sutherlin; Brian S Safina; Melissa A Starovasnik; Daniel F Ortwine; Yvonne Franke; Charles J Cohen; David H Hackos; Christopher M Koth; Jian Payandeh
Journal: Science Date: 2015-12-18 Impact factor: 47.728

6. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.

Authors: Michael M Mysinger; Michael Carchia; John J Irwin; Brian K Shoichet
Journal: J Med Chem Date: 2012-07-05 Impact factor: 7.446

7. Efficient iterative virtual screening with Apache Spark and conformal prediction.

Authors: Laeeq Ahmed; Valentin Georgiev; Marco Capuccini; Salman Toor; Wesley Schaal; Erwin Laure; Ola Spjuth
Journal: J Cheminform Date: 2018-03-01 Impact factor: 5.514

8. Ultra-large library docking for discovering new chemotypes.

Authors: Jiankun Lyu; Sheng Wang; Trent E Balius; Isha Singh; Anat Levit; Yurii S Moroz; Matthew J O'Meara; Tao Che; Enkhjargal Algaa; Kateryna Tolmachova; Andrey A Tolmachev; Brian K Shoichet; Bryan L Roth; John J Irwin
Journal: Nature Date: 2019-02-06 Impact factor: 49.962

Review 9. A comprehensive map of molecular drug targets.

Authors: Rita Santos; Oleg Ursu; Anna Gaulton; A Patrícia Bento; Ramesh S Donadi; Cristian G Bologa; Anneli Karlsson; Bissan Al-Lazikani; Anne Hersey; Tudor I Oprea; John P Overington
Journal: Nat Rev Drug Discov Date: 2016-12-02 Impact factor: 84.694

10. An open-source drug discovery platform enables ultra-large virtual screens.

Authors: Andras Boeszoermenyi; Zi-Fu Wang; Christoph Gorgulla; Patrick D Fischer; Paul W Coote; Krishna M Padmanabha Das; Yehor S Malets; Dmytro S Radchenko; Yurii S Moroz; David A Scott; Konstantin Fackeldey; Moritz Hoffmann; Iryna Iavniuk; Gerhard Wagner; Haribabu Arthanari
Journal: Nature Date: 2020-03-09 Impact factor: 49.962

29 in total

Review 1. Recent progress on cheminformatics approaches to epigenetic drug discovery.

Authors: Zoe Sessions; Norberto Sánchez-Cruz; Fernando D Prieto-Martínez; Vinicius M Alves; Hudson P Santos; Eugene Muratov; Alexander Tropsha; José L Medina-Franco
Journal: Drug Discov Today Date: 2020-09-30 Impact factor: 7.851

2. Systemic evolutionary chemical space exploration for drug discovery.

Authors: Chong Lu; Shien Liu; Weihua Shi; Jun Yu; Zhou Zhou; Xiaoxiao Zhang; Xiaoli Lu; Faji Cai; Ning Xia; Yikai Wang
Journal: J Cheminform Date: 2022-04-01 Impact factor: 5.514

3. Efficient Hit-to-Lead Searching of Kinase Inhibitor Chemical Space via Computational Fragment Merging.

Authors: Grigorii V Andrianov; Wern Juin Gabriel Ong; Ilya Serebriiskii; John Karanicolas
Journal: J Chem Inf Model Date: 2021-11-11 Impact factor: 4.956

Review 4. Artificial intelligence and machine-learning approaches in structure and ligand-based discovery of drugs affecting central nervous system.

Authors: Vertika Gautam; Anand Gaurav; Neeraj Masand; Vannajan Sanghiran Lee; Vaishali M Patil
Journal: Mol Divers Date: 2022-07-11 Impact factor: 3.364

Review 5. A review of COVID-19 biomarkers and drug targets: resources and tools.

Authors: Francesca P Caruso; Giovanni Scala; Luigi Cerulo; Michele Ceccarelli
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

6. Recent trends in artificial intelligence-driven identification and development of anti-neurodegenerative therapeutic agents.

Authors: Kushagra Kashyap; Mohammad Imran Siddiqi
Journal: Mol Divers Date: 2021-07-19 Impact factor: 3.364

10. Potential drugs against COVID-19 revealed by gene expression profile, molecular docking and molecular dynamic simulation.

Authors: Claudia Cava; Gloria Bertoli; Isabella Castiglioni
Journal: Future Virol Date: 2021-07-20 Impact factor: 1.831