Literature DB >> 31656912

Fingerprinting CANDO: Increased Accuracy with Structure- and Ligand-Based Shotgun Drug Repurposing.

Abstract

We have upgraded our Computational Analysis of Novel Drug Opportunities (CANDO) platform for shotgun drug repurposing by including ligand-based, data fusion, and decision tree pipelines. The goal of shotgun drug repurposing is to screen and rank every existing human use drug or compound for every disease/indication. The first version of CANDO implemented a structure-based pipeline that modeled interactions between compounds and proteins on a large scale, generating compound-proteome interaction signatures used to infer the similarity of drug behavior; the new pipelines accomplish this by incorporating molecular fingerprints and the Tanimoto coefficient. We obtain improved benchmarking performance with the new pipelines across all three evaluation metrics used: average indication accuracy, pairwise accuracy, and coverage. The best performing pipeline achieves an average indication accuracy of 19.0% at the top10 cutoff, compared to 11.7% for v1, and 2.2% for a random control. Our results demonstrate that the CANDO drug recovery accuracy is substantially improved by integrating multiple pipelines, thereby enhancing our ability to generate putative therapeutic repurposing candidates, and increasing drug discovery efficiency.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31656912 PMCID： PMC6812124 DOI： 10.1021/acsomega.9b02160

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Drug Repurposing

Bringing a new drug to the market may costs hundreds of millions of dollars and takes years of work.[1] Drug repurposing is the process of discovering a new use for an existing drug.[2,3] This process may take advantage of existing data on safety and pharmacokinetic properties from previous trials and clinical use to reduce costs and time associated with traditional drug discovery. Classic examples of drug repurposing include sildenafil and thalidomide,[2,4] which initially were developed to treat chest pain and morning sickness but repurposed to treat erectile dysfunction and erythema nodosum leprosum or multiple myeloma, respectively.[5] Drugs that have already been repurposed once are being researched for even more novel uses. For example, raloxifene was originally indicated for prevention of osteoporosis and subsequently approved for risk reduction in the development of breast cancer.[6] More recently, raloxifene has been suggested as a possible treatment for Ebola virus disease.[7−9] These examples of putative and/or successful drug repurposing underlies the diverse mechanisms through which a single compound may treat a variety of disease types.[10,11] High-throughput, target-based, and phenotypic screening of compounds can be used to generate putative candidates for repurposing.[12] For example, potential treatments for Zika virus infection were identified using a phenotypic screen.[13]

Computational Drug Discovery and Repurposing

Finding new drugs or new uses for existing drugs computationally takes advantage of the growing amount of data generated from wet lab experiments accessible on the Internet, increased computational power, and higher fidelity of computational models to reality. Approaches to computational drug discovery and repurposing have been classified as structure- or ligand-based.[14−16] In structure-based methods, the structure of a target macromolecule, usually a protein, is used to identify small compounds that modulate its behavior. The structure may have been determined via X-ray diffraction or nuclear magnetic resonance (NMR) or modeled using template-free (de novo) or template-based (homology or comparative) modeling.[17−19] Molecular docking and/or rational drug design is then used to identify ligands that specifically fit into a protein binding or active site.[20,21] In ligand-based methods, the focus is on the compound, and similarity between representations is used to assess whether a compound modulates the activity of a target or treat a disease like a known drug. Examples of ligand-based drug design include 2D and 3D similarity searching,[22] pharmacophore modeling,[23] and quantitative structure–activity relationships (QSAR).[14] A virtual screening experiment is typically a large-scale analysis of molecular shape or molecular docking data to suggest possible further development of hits into leads.[24] Data fusion is a technique in the field of cheminformatics for combining intermolecular similarity data from different sources or methods.[25−27] Compounds are ranked relative to each other based on the similarity scores. Multiple rankings of compounds produced by different methods of detecting similarity may be combined into a single ranking.[25] Ideally, disparate sources or types of data may yield orthogonality or complementarity in results, that is, different top compounds are captured and reported as putative therapeutics for different reasons.[28,29] For example, Tan et al. obtained an increased recall rate in a virtual screening experiment using ligand-based two dimensional fingerprint data fused with structure-based molecular docking energies.[30] Ligand- and structure-based methods have been combined for use in virtual screening pipelines and platforms, with successes reported in the use of sequential, parallel, and hybrid techniques for data integration.[29] Data fusion has been also been used to devise weighting schemes for correct dosing.[31] Newer computational techniques for drug discovery and repurposing gaining in prominence go beyond the structure- and ligand-based categorization. The Connectivity Map is a “reference collection of gene-expression profiles from cultured human cells treated with bioactive small molecules”,[32] that is, a tool to identify changes in gene expression due to a compound or a disease. If a compound causes changes in gene expression level opposite to a disease (for instance, a disease causes upregulation of the expression of a set of genes, and the compound causes downregulation of the same set of genes), then that compound is considered to be therapeutically useful in the treatment of that disease.[32] Peyvandipour et al. combined an updated version of the Connectivity Map with knowledge of drug-disease gene networks, measuring the perturbation effect of drugs on whole systems.[33] Using this model, they predicted novel treatments for idiopathic pulmonary fibrosis, non-small cell lung cancer, prostate cancer, and breast cancer while simultaneously improving the recall rate of known drug-disease associations.[33] Machine learning-based approaches have also been used to cluster drugs or diseases and predicting new drug activity and usage.[34−38] Methods for finding novel uses of drugs based on analysis of biomedical literature,[39,40] electronic health records,[38,41] and biological networks[42,43] have also been reported.

Drug Similarity

Implementations of drug discovery and drug repurposing sometimes rely on the principle of similar molecules having similar properties.[44,45] In drug design, repurposing, or screening, similar compounds are generally assumed to have similar molecular targets. In structure-based drug discovery, if two potential molecular targets are identified as similar, then a compound that modulates one target is inferred to modulate the other. In ligand-based methods, similar compounds are inferred to analogously modulate the behavior of the same target(s). In our computational shotgun drug repurposing experiments, we extend the similarity property principle to examining interactions on a proteomic scale. Compounds with similar proteomic interaction signatures are hypothesized to be effective for the same indication(s).

Shotgun Drug Repurposing with CANDO

The goal of the Computational Analysis of Novel Drug Opportunities (CANDO) platform for shotgun drug discovery and repurposing is to screen every human use compound/drug against every indication/disease.[46−49] The tenets of CANDO include docking with dynamics and multitargeting, which have been developed over the past decade and a half.[50−52] The first version of CANDO (v1) applied a bioinformatic docking protocol on large libraries of compound and protein structures. The multitargeting nature of drugs[53] is captured by inferring their similarity on a proteomic scale after calculating interactions between all compounds and all proteins in the corresponding libraries.[8,46,47] This is key, as indications can be multifactorial in nature, involving disparate or intertwined pathways.[16,54−57] Similar compounds, as determined by the root-mean-square deviation (RMSD) of their proteomic interaction signatures, are hypothesized to behave similarly, that is, compounds that are ranked highly (most similar compound–proteome interaction signatures) to a drug with an approved indication are hypothesized to be repurposable drugs/compounds for that indication. There exist other approaches to determine compound similarity without the need for docking calculations on a proteomic level. Different mathematical representations of molecules capture different chemical, physical, or functional aspects of a compound. Two or three dimensional molecular fingerprints are used in the field of cheminformatics to describe compounds.[58] In these models, the physical arrangement of atoms in a compound is captured as a binary vector where each entry of a vector indicates the presence or absence of a specific molecular feature.[45] A distance (similarity) metric between these vectors can be measured using metrics such as the Tanimoto coefficient, a widely used metric in medicinal chemistry and ligand-based virtual screening.[45,59−61] The compound–proteome interaction signatures constructed using the structure-based docking methods in CANDO are analogous to molecular fingerprints as mathematical representations of a compound. Both uniquely capture properties of a compound in a computationally tractable manner, albeit in a different descriptor space. CANDO is novel in the use of these protein structure-based data vectors as a type of mathematical representation for shotgun drug repurposing, and this work is an important step in comparison of different mathematical representations of drug likeness or behavior for this purpose. Benchmarking across all pipelines is accomplished by examining the ranks of other approved drugs for the same indication against a gold standard set of indications with two approved drugs to identify how well they perform not only relative to each other but also to analyze where the differences lie and what areas need improvement.[46,47] Finally, CANDO can be used to make predictions for any indication with at least one approved drug. In this study, we extend CANDO to include ligand-based drug repurposing by creating new pipelines based on identifying compound similarity based on their molecular fingerprints as well as data fusion pipelines that combine the protein-centric and protein-agnostic approaches. The new ligand-based pipelines in CANDO are based on molecular fingerprint similarity calculations using the Research Development Kit (RDKit)[62] and not meant as an exhaustive exploration of all possible CANDO pipelines that can be built using all the fingerprint descriptions available from RDKit. Instead, we constructed pipelines using well-studied molecular fingerprints[63] to evaluate the feasibility and compare and contrast benchmarking performance. Using the standard CANDO benchmarking procedure (see Methods), several of the pipelines described here yielded better performance than those previously obtained using only v1. Based on an exhaustive search of the literature, this is the first instance of molecular fingerprints being applied to the shotgun drug repurposing problem. While 2D fingerprint-based methods have been shown to generally outperform 3D structure-based approaches for virtual screening,[45] our work takes it one step beyond in applying it to shotgun drug repurposing and benchmarking it to every indication. Indeed, while many methods exist for virtual screening typically based on one or a few targets, and some may be better at that task than others, the results are not always colinear in the context of drug repurposing. Previously, it was unclear which fingerprint descriptors would perform the best for drug repurposing. More importantly, we provide frameworks for other virtual screening techniques to be applied to drug repurposing and combining them while retaining the individual benefits of each technique. This application of a ligand- and ligand/structure-based combination analysis is in alignment with the traditional division of virtual screening into ligand- and structure-based methods. However, this specific application to drug repurposing combined with benchmarking to a large gold standard data set has never been performed. Combination of other pipelines using data fusion as well as a decision tree approach between v1 and the best performing ligand-based approach (“ECFP4”) yielded better benchmarking performance than using only either pipeline, allowing for increased accuracy while retaining the mechanistic and precision medicine opportunities afforded by the protein-centric approach of v1. Higher benchmarking accuracies are indicative of higher drug repurposing potential, increased confidence in our predictions, a decreased number of compounds that must be tested in wet lab experiments and clinical trials to obtain true hits, and thus less time and cost required to find a new use for an old drug.

Results

Benchmarking Performance of the Different Pipelines

The new pipelines (Figure ) generally outperform v1 for all three metrics used to evaluate benchmarking performance: average indication accuracy, pairwise accuracy, and coverage (Figure ). The MUL:v1,ECFP4 data fusion pipeline, created by multiplying the compound–compound similarity scores (RMSD of interaction signatures) from v1 with the Tanimoto coefficient measured between the compounds described using the ECFP4 molecular fingerprint, yields the overall best performance relative to v1 and the ones based on fingerprint comparisons. Specifically, we obtained the highest top10, top25, and top50 average indication accuracies of 17.3, 23.8, and 29.6% using this data fusion pipeline. The highest top1% (or top37) and top100 average indication accuracies of 26.8 and 36.7% were obtained using the pipeline based on the ECFP4 molecular fingerprints. Most of the molecular fingerprint pipelines outperform the original v1 pipeline with the exception of ECFP0, a fingerprint based on simple atom count quantization (Figure ).

Figure 4

Flow diagram of the CANDO platform pipelines used for shotgun drug repurposing. The v1 structure-based pipeline is the original protein-centric approach based on a bioinformatic docking protocol used to construct compound–proteome interaction signatures. The ligand-based pipelines are based on molecular fingerprint representations of compounds. The data fusion pipelines consist of a combination these two types of pipelines after calculating compound–compound similarity, and the decision tree pipeline is devised based on the performance of individual structure- and ligand-based pipelines (see Methods). All pipelines, except the decision tree pipeline, generate a compound–compound similarity matrix that is sorted and ranked. These rankings are used to generate putative repurposable drug candidates and evaluate benchmarking performance. The figure illustrates the utility of implementing, as well as comparing and contrasting, multiple (types of) pipelines in the CANDO platform for shotgun drug repurposing.

Figure 1

Benchmarking performance of different CANDO platform pipelines. The average indication accuracy (top), pairwise accuracy (middle), and coverage (bottom) for each pipeline are shown at different cutoffs. The value for the top10 cutoff is denoted by dark purple, top25 by light purple, top1% (or top37) by yellow, top50 by green, and top100 by light blue. The individual pipeline with the best performance at each each cutoff is denoted by a red dot. The meta decision tree pipeline was built by combining two pipelines, v1 and ECFP4, using the highest average indication accuracy from either pipeline. Therefore, it has the highest average indication accuracy, pairwise accuracy, and coverage but is excluded by the “Best at cutoff” marker and plotted on a separate axis. The pipelines in all plots are sorted according to increasing top10 average indication accuracy, the most stringent criteria used in our benchmarking. The MUL:v1,ECFP4 pipeline yields the overall best performance relative to the other individual structure- and ligand-based pipelines. The pipeline based on the ECFP4 molecular fingerprint produces the highest top1% and top100 average indication accuracies (top). When assessing pairwise accuracy (middle), ECFP4 is the best performing individual pipeline at all cutoffs. The coverage (bottom) plot is the percentage of the 1439 indications for which a pipeline produces a nonzero indication accuracy. The data fusion pipelines of MUL:v1,ECFP4 and MIN:v1,RDK6 have the highest coverage at the top50 and top25 cutoffs, the ECFP4 at the top10 and top50 cutoffs, and RDK6 at the top100 cutoff. Overall, the pipelines using molecular fingerprints have promise and potential for shotgun drug repurposing by themselves, but the data fusion and decision tree pipelines that combine structure- and ligand-based approaches achieve the best performance while retaining the benefits of both types of approaches. The decision tree meta pipeline, built by combining other pipelines based on the higher average indication accuracies, yields accuracies of 19.0, 25.7, 28.9, 31.5, and 39.1% at the five cutoffs used. In contrast, the best performing control generated from uniformly random compound–compound similarity data obtains average indication accuracies of 2.2% at the top10 cutoff, the most stringent one used to benchmark the CANDO platform (Figure ). In terms of pairwise accuracy (%), which is the weighted average of the per indication accuracies based on the number of compounds approved for a given indication (see Methods), ECFP4 outperforms all other individual pipelines with accuracies of 28.5, 38.9, 43.8, 47.9, and 58.8% at the five cutoffs. The meta decision tree pairwise accuracies are 30.0, 40.5, 45.2, 49.1, and 60.0%. The coverage metric evaluates the fraction (or percentage) of the 1439 indications with two approved drugs for which there is at least one instance of a successful recapture or recovery of the known drug within a particular cutoff. The ECFP4 pipeline has the highest top10 and top1% coverage of 45.9 and 54.2%, the MIN:v1,RDK6 yields the highest top25 coverage of 52.3%, the MUL:v1,ECFP4 has the highest coverage at the top50 cutoff of 56.9%, and RDK6 the highest at the top100 cutoff of 62.8%. In contrast, the decision tree pipeline obtains coverage values of 45.9, 50.6, 54.2, 56.6, and 62.1%. For almost half of all the 1439 indications, we capture a drug associated with that indication within the top25 cutoff (Figure ).

Distribution of Indication Accuracies between the Two Types of Pipelines

To compare and contrast the behavior of the structure- and ligand-based pipelines, we calculated histograms of the average indication accuracies and counts of the highest per indication accuracies at each cutoff for two pipelines (v1 and ECFP4), excluding indications for which a 0% average indication accuracy is obtained. Figure shows that the ECFP4 pipeline has more indications with higher accuracies than v1 (the yellow histogram is shifted to the right of the purple histogram). The Kolmogorov–Smirnov statistical test p-values shown in the corresponding left-hand side graph of Figure indicate that the distributions of the v1 and ECFP4 accuracies are drawn from different samples in a statistically significant manner. The Venn diagrams of the 1439 indications in CANDO with more than one approved drug shows that v1 obtains a higher top10 accuracy for 150 indications, while ECFP4 obtains a higher top10 accuracy for 445, and 122 indications have the same nonzero top10 accuracy for both pipelines. As the cutoff increases, more indications have higher accuracies using the ECFP4 pipeline relative to v1, while the number of indications with the same accuracy increases relatively. The orthogonality in the histograms and Venn diagrams indicate that both types of pipelines appear necessary for maximum coverage and accuracy across all the indications. Figure also suggests that additional pipelines and/or improvement in existing pipelines is necessary to recover drugs for ≈500 indications that are not covered by either pipeline at the highest cutoff.

Figure 2

Comparison and overlap of indication accuracy distributions for two CANDO platform pipelines at different cutoffs. The left-hand side shows the histograms of the counts of indications with a particular average indication accuracy (or accuracy distributions) for two pipelines, v1 (purple) and ECFP4 (yellow). Indications where both pipelines perform equally well are indicated by brown. For example, at the top10 cutoff, there are approximately 200 indications that achieve an average accuracy between 10 and 20% using the v1 pipeline but just over 100 using ECFP4. At all cutoffs, a greater number of indications with higher accuracies is observed for the ECFP4 pipeline (increase in yellow along the horizontal axis). The p-value, derived from the Kolmogorov–Smirnov test statistic applied to the two distributions at each cutoff, indicates that they are significantly different. On the right-hand side of the figure are Venn diagrams of the set of indications with higher accuracies at each cutoff (excluding indications with 0% accuracy). For example, at the top10 cutoff, there are 150 indications for which the v1 pipeline yields higher average indication accuracies, 445 for which the ECFP4 pipeline is higher, and 122 with the same performance. The ECFP4 pipeline performs better than v1 for more indications at all cutoffs, but both pipelines appear to be necessary to achieve the best performance across all indications for shotgun drug repurposing.

Putative Drug Candidate Generation and Validation

The top ranking putative drug candidates generated by the v1 pipeline for eight indications, tuberculosis, malaria, hepatitis B, hepatitis C, systemic lupus erythematosus, type 2 diabetes mellitus, and Alzheimer’s disease, are available from Figure and Supplementary Material of a previous publication.[46] The top candidates were chosen based on a concurrence score, which is “the number of occurrences of particular compounds in each set of top 25 predictions generated for all of the drugs approved for a particular indication”.[46] Using this concurrence score, we generated the top candidate drugs to treat the same indications with the ECFP4 molecular fingerprint and the MUL:v1,ECFP4 data fusion pipelines. We then searched the biomedical literature using PubMed and Google Scholar for published studies corroborating these top candidates.

Figure 3

Examples of therapeutic predictions made by CANDO bound or docked to one of their respective corroborating targets. (a) Compound acarbose (red) bound to Mycobacterium smegmatis trehalose synthease. (PDB ID: 3ZOA). (b) Compound tiapride (red) docked to the human D2 receptor. Both of the new pipelines predict acarbose as a treatment for tuberculosis. Traditionally, acarbose is an inhibitor of the α-glucosidase enzyme and used in the worldwide treatment of type 2 diabetes.[64] Lending credence to our prediction, in studying the glycobiology of Mycobacterium smegmatis (a model infectious agent for the Mycobacterium genus and thus tuberculosis), high-resolution structures of the enzyme trehalose synthase have been solved in complex with acarbose.[65] The enzyme is a key member of a carbohydrate pathway; therefore, its inhibition may be of potential therapeutic importance. Caner et al. solved the structure of trehalose synthase with bound acarbose,[65] as shown in Figure a (PDB ID: 3ZOA). From a non-infectious disease perspective, both of the new pipelines rank tiapride among the top25 via the concurrence score as a treatment for type 2 diabetes. Tiapride is a selective dopamine D2 receptor antagonist,[66] shown docked to the D2 receptor in Figure b. Dopaminergic blockade has been shown to increase insulin secretion from islet cells in cell models;[67] there are case reports of tiapride therapy causing hypoglycemia in the elderly,[68] and there is a generalized increase risk for hypoglycemia in the elderly taking an antipsychotic medication.[69] These all point toward tiapride potentially being used as an antihyperglycemic treatment of type 2 diabetes, though benefits would have to be carefully weighed against negative side effects. All three pipelines recommend known antivirals for hepatitis B. For hepatitis C, all three pipelines list didanosine in the top ranked candidates. Unfortunately, the concurrent use of didanosine and traditional hepatitis treatments may induce dangerous consequences for the patient,[70] illustrating the need for careful expert curation of top candidates generated by the CANDO platform. For Alzheimer’s disease, one of the highest scoring compounds from the MUL:v1,ECFP4 pipeline was dextromethorphan. In 2015, a study was published showing dextromethorphan hydrobromide/quinidine sulfate was well tolerated in patients with Alzheimer’s disease and had clinically relevant efficacy in treating patients, as measured via agitation.[71] These examples indicate that new putative drug candidate generation by the CANDO platform with these integrated pipelines is likely to work as well, if not better, relative to the prospective validation studies previously done using v1 or its components.[8,51,72−75] The full list of drug candidates for the above indications based on the concurrence score using the newer pipelines are given in the Supporting Information and available at http://protinfo.org/cando/results/fingerprinting_cando. Putative drug candidate predictions for all 2030 indications in the platform using the v1 pipeline are available at http://protinfo.org/cando/data/raw/matrix/.

Discussion

Interpretation of Results

We have added new pipelines based on ligand-based fingerprint comparisons to the CANDO platform (Figure ) that increase benchmarking performance relative to the original v1 protein-centric pipeline (Figure ). In addition, we also identified where the differences lie and what areas/indications need further improvement for each pipeline. Finally, our individual and combined pipelines are capable of making predictions for every indication with at least one approved drug, and in some cases, we have found corroborating evidence supporting these predictions. Flow diagram of the CANDO platform pipelines used for shotgun drug repurposing. The v1 structure-based pipeline is the original protein-centric approach based on a bioinformatic docking protocol used to construct compound–proteome interaction signatures. The ligand-based pipelines are based on molecular fingerprint representations of compounds. The data fusion pipelines consist of a combination these two types of pipelines after calculating compound–compound similarity, and the decision tree pipeline is devised based on the performance of individual structure- and ligand-based pipelines (see Methods). All pipelines, except the decision tree pipeline, generate a compound–compound similarity matrix that is sorted and ranked. These rankings are used to generate putative repurposable drug candidates and evaluate benchmarking performance. The figure illustrates the utility of implementing, as well as comparing and contrasting, multiple (types of) pipelines in the CANDO platform for shotgun drug repurposing. Higher benchmarking accuracies are expected to result in better drug repurposing predictions. The top ranked similar compounds to the known drugs for a particular indication using the pipeline with the best benchmarking performance is expected to produce hits and leads with the highest likelihood of success when validated in downstream preclinical and clinical studies. The decreased need to test a large number of compounds with the new pipelines, along with greater confidence in the computational models of drug-indication associations, realizes the goal of drug repurposing: making drug discovery more efficient by reducing the labor, time, and risk in finding new uses for existing therapeutics. Using the new pipelines based on molecular fingerprinting and data fusion with v1 (Figure ), we obtain better benchmarking performance than using v1 by itself (Figure ). Our cutoffs for calculating performance metrics are chosen based on collaborations with wet lab experimentalists willing to test the top candidates generated by our CANDO platform for particular indications. In practice, when working with preclinical and clinical collaborators, we currently employ the decision tree approach of selecting the pipeline with the highest accuracy for a specific indication and the desired cutoff. For example, if a collaborator is capable of validating 10 candidates for Precursor B-Cell Lymphoblastic Leukemia-Lymphoma (MeSH identifier D015452), which is one of the 150 candidates where benchmarking performance is better using the v1 pipeline relative to ECFP4, then we would use the former pipeline to generate the top 10 putative drug candidates for this indication. The new integrated pipelines also yield a higher number of indications covered relative to v1, that is, more indications with a nonzero accuracy, demonstrating their generalized utility for shotgun drug repurposing. Indication-specific validation studies may rely on the pipeline with the highest accuracy for that indication, but CANDO platform development in shotgun drug repurposing requires that the coverage also increase in addition to the average indication and pairwise accuracy. The best performing random control achieves a top10 average indication accuracy of 2.2%, and the random control based on random sampling from the distribution the v1 compound–protein interaction matrix values yielded a top10 accuracy of 0.2%.[46,47] These random control accuracies are at least an order of magnitude lower than the accuracies obtained using the newer pipelines and align with expected hit rates in high-throughput screening.[76] All pipelines yield better performance when compared to the random control (Figure ), and the differences between the performances of the different pipelines and that of the control signify the value added by our chosen approaches. The orthogonality in the histograms and Venn diagrams of Figure indicate that both types of pipelines appear necessary for maximum coverage and accuracy across all the indications.

Limitations and Future Work

Furthermore, our results also show that ligand-based methods for drug repurposing are far from perfect and still need improvement since the average/overall metrics in Figure do not show per indication differences (but which can be gleaned from Figure ). Both types of approaches, ligand- and structure-based, fail to cover approximately half the 1439 indications in our data set (i.e., produce a benchmarking accuracy of 0% even at the top100 cutoff), and ligand-based methods do worse than structure-based methods for over 100 indications. This highlights areas of improvement and development for both types of methods, which can only be obtained from a work of this nature. The utility of combining these approaches, while producing a marginal improvement in accuracy, is also important, since target and off-target information is obtainable only from the structure-based methods. Our combined approach results in a “best of both worlds” scenario, although we do not consider the decision tree pipeline while identifying the best performing one in Figure , since it is not an individual pipeline, it yields the highest performance values. We are further enhancing CANDO by improving the performance of existing pipelines via parameter optimization,[77] exploration of different docking approaches to generate the compound–proteome interaction signatures, adding new orthogonal pipelines based on compound–pathway signatures,[78] implementing more sophisticated data fusion and machine learning approaches, and by continued dissection of the features responsible for pipeline performance and behavior.[47,48,78] Notwithstanding the relative benchmarking performance of the existing CANDO platform pipelines, the structure-based virtual screening or protein docking pipelines are not without their merits. The protein-centric approach enables mechanistic understanding of drug action by modeling compound–protein interactions at the atomic level. Additionally, the protein-centric approach readily lends itself to problems in precision medicine/drug repurposing: Incorporating genetic changes, and modeling amino acid mutations due to nonsynonymous nucleotide polymorphisms in protein structures, will result in altered compound–protein interaction scores, allowing us to tailor drug repurposing candidates to an individual genome/proteome. The protein-centric approach facilitates consideration of polypharmacy, where the cumulative effects of multiple drugs on protein targets can be evaluated by the analysis and integration of the corresponding drug-proteome interaction signatures, which can then be used to generate putative drug cocktails and combination therapy candidates. The protein-centric pipeline may also be used to generate putative drug candidates for indications without any approved drugs, but where the target protein or proteome is known.[8] We are continuing to enhance the virtual screening pipelines to model reality more accurately, with the goal of increasing compound–proteome signature comparison accuracy. For instance, we are exploring the use of different molecular docking programs, such as CANDOCK[79,80] and AutoDock Vina,[81] to populate the compound–proteome interaction signatures. An updated version of the v1 pipeline, v1.5, with parameters optimized for scoring compound–proteome interactions, yields benchmarking performance that is 10% higher relatively at the top10 cutoff (12.8% for v1.5 vs 11.7% for v1).[77] By combining the improved protein-centric and protein-agnostic pipelines using data fusion, we obtain the best performance and retain the benefits of both types of approaches while minimizing the weaknesses of any single approach. The higher benchmarking performance obtained by the ligand-based pipelines may in part be due to the nature of drug discovery and development, which is biased in favor of already effective compounds in an effort to break into a new market or retain market dominance by generating new intellectual property. New drugs are often derivatives of existing ones with small changes.[82,83] Repurposing based on molecular fingerprint similarity will be highly enriched for these “me too” compounds,[83] given that the approach to shotgun drug repurposing in the CANDO platform is currently based on detecting drug-compound similarities. Our benchmarking performance metrics are biased toward reporting particular pipelines as better when they capture what is already known/approved and not novel repurposing candidates that will work to treat or cure an indication in reality. Barring large-scale preclinical validation of putative drug candidates, it remains a reproducible and a meaningful measure in our studies.[46−48] Our goal in this study was to assess the value of adding fingerprinting and data fusion pipelines to the existing protein-centric pipelines in the CANDO platform and not an exhaustive enumeration, comparison, and fusion of ligand- and structure-based approaches for identifying drug associations.[84] More sophisticated fingerprint representations encode the structures of compounds differently and capture unique features particularly of relevance to drug discovery and repurposing. Future work will extend our analyses to include additional fingerprints that can be created using RDKit, including the Long Extended and Feature Connectivity Fingerprints (LECFP and LFCFP, respectively). Longer fingerprints have been shown to better describe a compound with less redundancy, leading to increased accuracy in virtual screening.[85] Features and categories of indications, proteins, and compounds all influence the drug repurposing accuracy of CANDO. We are continuing to undertake thorough experiments exploring the roles of particular features responsible for benchmarking performance.[47,48,78] Incorporating machine learning to understand how compound–proteome interaction signatures influence performance will help us find the most parsimonious molecular descriptors for compounds. Drugs may have targets beyond proteins, including DNA and RNA.[86,87] To better model how a compound interacts with all potential targets, we are integrating compound–nucleic acid interaction modeling into CANDO. Finally, we are working with collaborators to validate the predictions from the various pipelines in preclinical and clinical studies, which represents the ultimate test of the CANDO platform.

Conclusions

CANDO is a computational platform for shotgun drug discovery and repurposing. We implemented new ligand-based and data fusion pipelines in the CANDO platform and obtained substantial improvement in benchmarking performance using a combination of protein-centric and protein-agnostic methods. These improved results indicate greater confidence in drug repurposing predictions made by us using CANDO and demonstrate the value of considering different, orthogonal, types of approaches for calculating compound–compound similarities. Our integrated approach moves us closer to developing an accurate, robust, and reliable computational drug repurposing platform and using it to understand how small molecules interact with each other and with larger macromolecules in their corresponding environments.

Methods

Figure illustrates the different pipelines evaluated in this study, which are described in detail below.

The CANDO Platform and the Version 1 (v1) Pipeline

A detailed description of the CANDO platform, including the v1 pipeline used for assigning drugs to indications as well as its benchmarking performance, is available elsewhere.[46−48,78] Briefly, in v1, we predicted interactions between 46,784 protein structures and 3733 small molecules that mapped to 2030 indications. We obtained the molecular structures of the 3733 small molecules in our putative drug library from the Food and Drug Administration (FDA), NCATS Chemical Genomics Center, and PubChem.[88] Solved X-ray diffraction structures of proteins were obtained from the Protein Data Bank,[89] and modeled protein structures were generated using I-TASSER.[19] Approved drug-indication associations were obtained from the Comparative Toxicogenomics Database (CTD)[90] and mapped to the CANDO drug library, resulting in 2030 indications with at least one approved/associated compound. Protein–compound interaction scores were calculated using a bio- and cheminformatic docking protocol consisting of ligand binding site identification for all proteins in our structure library followed by similarity measurement between known ligands in the identified binding sites and all 3733 compounds in our putative drug library.[47] A compound is characterized as an “interaction signature” of length 46,784, where each entry is an interaction score between 0 and 2, indicating the strength of a predicted protein interaction (zero signifying no interaction). Each compound is then compared to every other compound by calculating the root–mean-square deviation (RMSD) between the corresponding interaction signatures, generating a compound–compound (or drug–compound) similarity matrix. Each compound is ranked relative to every other compound in order of increasing similarity and benchmarking performed.

Ligand-Based Pipelines

The CANDO platform for shotgun drug repurposing is not dependent on any particular method for determining compound similarity, such as the protein-centric one used in v1. Here, we consider the utility of ligand-based pipelines by constructing two-dimensional molecular fingerprints of the 3733 compounds in the CANDO putative drug library using the open-source cheminformatics software RDKit Python API[31] and performing an all-against-all comparison using the Tanimoto coefficient. Once the features of a molecule have been quantized into a vector, the Tanimoto coefficient is a score of how many bits two vectors have in common divided by the number of bits by which they differ, that is, |A ∩ B| / |A ∪ B|, where A and B represent compounds in a binary vector form, and |X| is the length of any vector X. For efficiency and accuracy, we described our putative drug library using well-studied 2D molecular fingerprints.[45] Specifically, we used Morgan fingerprints,[91] otherwise known as Extended Connectivity Fingerprints (ECFP; a circular fingerprint), one Functional Class Fingerprint (FCFP; a functional class fingerprint[92]), and fingerprints from RDKit (RDK; a linear fingerprint). Circular fingerprints are bit vector representations of compounds encoding the presence of molecular substructures constructed outward from all starting positions (all atoms) in a radial fashion, functional class fingerprints are binary vectors that encode the presence of predefined “functional” features of a compound, and linear fingerprints encode the presence of molecular substructures built in a linear fashion from all possible starting points (all atoms).[63] All fingerprints are additionally described by the length of the molecular substructure (“radius” or “diameter” depending on the type and implementation) captured. For instance, ECFP4 is a fingerprint created using ECFP with diameter four. Specific ligand-based pipelines in CANDO are identified according to the molecular fingerprint used, that is, “ECFP4” refers to the CANDO pipeline where compounds are represented using the ECFP4 molecular fingerprint. Hert et al. found that the optimal results for quantifying relationships between drug classes were achieved using ECFP4 fingerprints with similarities calculated using the Tanimoto coefficient.[60] We extended this to ligand-based drug repurposing using vectors of 2048 bits instead of the 1024 used in Hert et al.[60] We calculated the Tanimoto coefficient between the fingerprints of all possible pairs of the 3733 compounds in our library and used this to populate a compound–compound similarity matrix, just as we did with the v1 pipeline, allowing us to sort and rank all compounds relative to each other. Fingerprints could not be created for 12 of the 3733 compounds in our putative drug library, which were generally large compounds with metal chelation or long polymers. We then evaluated benchmarking performance of the ligand-based pipelines as described further below.

Data Fusion Pipelines

We combined rankings from the v1 pipeline with the new molecular fingerprint rankings using one of the following criteria: lower of two rankings (MIN), higher of two rankings (MAX), sum of two rankings (SUM), and average of two rankings (AVG). This is known as “rank-based data fusion”.[93] We also combined the compound–compound similarity scores from v1 and the ligand-based pipelines using the multiplication of raw similarity scores (MUL), a type of “kernel-based data fusion”.[93] After multiplying the similarity scores from two pipelines, the compounds are sorted and ranked based on the newly calculated scores. As in v1 and the ligand-based pipelines, the compound–compound rankings from these data fusion pipelines are then subjected to benchmarking.

Decision Tree Pipeline

One goal of CANDO is to make predictions of which compounds are likely to be efficacious against any particular indication. The second goal is to use analytics to identify causal relationships that predict indication etiology. From the benchmarking, we can determine a priori the pipeline that has the best performance for a particular indication, which are then used to generate putative drug candidates for that indication. We constructed a new meta pipeline that makes a decision as to optimal performance on a per indication basis. We made this decision using the average indication accuracy metric (described below) from two pipelines: v1 and the best performing ligand-based pipeline, ECFP4 (see Results). We used this to create a merged set of data that was then benchmarked. For example, the v1 pipeline yields a top10 average indication accuracy of 25% for type 2 diabetes, whereas ECFP4 yields a top10 accuracy of 35%. In the combined decision tree pipeline at the top10 cutoff, we chose to use ECFP4 for the prediction of repurposing candidates for type 2 diabetes. The choice of other cutoffs is based on whichever pipeline obtains a higher average indication accuracy at that cutoff. The calculation of the other two benchmarking performance metrics is based on the data from the corresponding pipeline chosen in the previous step. We extended this method of choosing the pipeline (between v1 and ECFP4) with higher average indication accuracy to all indications. This aligns with the logic that a clinician or researcher using CANDO can choose the pipeline with the highest accuracy for a particular indication, which is reflected in the benchmarking performance of this combined pipeline.

Benchmarking Pipelines in the CANDO Platform

In contrast to virtual screening experiments, our input data is human use drugs, and the performance evaluation is against known drug-indication associations. Three measures are used to perform the leave-one-out benchmarking of the CANDO platform pipelines: average indication accuracy, pairwise accuracy, and coverage. Average indication accuracy (%) evaluates the likelihood of capturing at least one drug mapped to the same indication within a particular cutoff from the list of compounds ranked in order of similarity, which is averaged over the 1439 indications with at least two approved drugs and expressed as percent (%). Mathematically, this is expressed as c/d × 100, where c is the number of times at least one other drug approved for the same indication was captured within a cutoff, and d is the total number of drugs approved for that indication. The top10, top25, top1% (top37), top50, and top100 cutoffs are used, signifying the top ranking 10–100 similar compounds. In other words, the indication accuracy represents the recovery rate of known drugs for a particular indication, which is then averaged across all 1439 indications with at least two approved drugs. Pairwise accuracy (%) is the weighted average of the per indication accuracies based on the number of compounds approved for a given indication. Coverage is the number of indications with nonzero accuracy expressed as percent (%).

Controls

The performance of a given pipeline is evaluated relative to a random control, which is the result that we would expect by chance. The original random control data for v1 was generated by repeated creation of random compound–proteome interaction matrices by sampling from the distribution of values present in the v1 matrix. The benchmarking performance for these random control matrices was calculated as described above and by Sethi et al. and Mangione et al.[47,78] However, the new ligand-centric pipeline is protein-agnostic, and the data fusion ones consist of protein-agnostic components. Therefore, we constructed a compound–compound matrix of uniformly random similarity scores to use as controls in this study, that is, the similarity between any two compounds was assigned a random value between 0 and 1. We sorted and ranked every compound relative to every other compound using this this random compound–compound similarity matrix and evaluated benchmarking performance as described above.

79 in total

Review 1. Drug repositioning: identifying and developing new uses for existing drugs.

Authors: Ted T Ashburn; Karl B Thor
Journal: Nat Rev Drug Discov Date: 2004-08 Impact factor: 84.694

2. Integrating structure- and ligand-based virtual screening: comparison of individual, parallel, and fused molecular docking and similarity search calculations on multiple targets.

Authors: Lu Tan; Hanna Geppert; Mihiret T Sisay; Michael Gütschow; Jürgen Bajorath
Journal: ChemMedChem Date: 2008-10 Impact factor: 3.466

Review 3. Machine-learning approaches in drug discovery: methods and applications.

Authors: Antonio Lavecchia
Journal: Drug Discov Today Date: 2014-11-04 Impact factor: 7.851

Review 4. Drug repurposing in oncology--patient and health systems opportunities.

Authors: Francesco Bertolini; Vikas P Sukhatme; Gauthier Bouche
Journal: Nat Rev Clin Oncol Date: 2015-10-20 Impact factor: 66.675

Review 5. CANDO and the infinite drug discovery frontier.

Authors: Mark Minie; Gaurav Chopra; Geetika Sethi; Jeremy Horst; George White; Ambrish Roy; Kaushik Hatti; Ram Samudrala
Journal: Drug Discov Today Date: 2014-06-26 Impact factor: 7.851

6. Literature-based discovery of new candidates for drug repurposing.

Authors: Hsih-Te Yang; Jiun-Huang Ju; Yue-Ting Wong; Ilya Shmulevich; Jung-Hsien Chiang
Journal: Brief Bioinform Date: 2017-05-01 Impact factor: 11.622

7. Effect of Dextromethorphan-Quinidine on Agitation in Patients With Alzheimer Disease Dementia: A Randomized Clinical Trial.

Authors: Jeffrey L Cummings; Constantine G Lyketsos; Elaine R Peskind; Anton P Porsteinsson; Jacobo E Mintzer; Douglas W Scharre; Jose E De La Gandara; Marc Agronin; Charles S Davis; Uyen Nguyen; Paul Shin; Pierre N Tariot; João Siffert
Journal: JAMA Date: 2015 Sep 22-29 Impact factor: 56.272

8. A novel computational approach for drug repurposing using systems biology.

Authors: Azam Peyvandipour; Nafiseh Saberian; Adib Shafi; Michele Donato; Sorin Draghici
Journal: Bioinformatics Date: 2018-08-15 Impact factor: 6.937

Review 9. Molecular docking and structure-based drug design strategies.

Authors: Leonardo G Ferreira; Ricardo N Dos Santos; Glaucius Oliva; Adriano D Andricopulo
Journal: Molecules Date: 2015-07-22 Impact factor: 4.411

Review 10. A Systematic Review of Computational Drug Discovery, Development, and Repurposing for Ebola Virus Disease Treatment.

Authors: James Schuler; Matthew L Hudson; Diane Schwartz; Ram Samudrala
Journal: Molecules Date: 2017-10-20 Impact factor: 4.411

5 in total

1. Identifying Protein Features and Pathways Responsible for Toxicity Using Machine Learning and Tox21: Implications for Predictive Toxicology.

Authors: Lama Moukheiber; William Mangione; Mira Moukheiber; Saeed Maleki; Zackary Falls; Mingchen Gao; Ram Samudrala
Journal: Molecules Date: 2022-05-08 Impact factor: 4.927

2. Proteomic Network Analysis of Bronchoalveolar Lavage Fluid in Ex-Smokers to Discover Implicated Protein Targets and Novel Drug Treatments for Chronic Obstructive Pulmonary Disease.

Authors: Manoj J Mammen; Chengjian Tu; Matthew C Morris; Spencer Richman; William Mangione; Zackary Falls; Jun Qu; Gordon Broderick; Sanjay Sethi; Ram Samudrala
Journal: Pharmaceuticals (Basel) Date: 2022-05-01