Literature DB >> 24289468

Predicting drug-target interactions using probabilistic matrix factorization.

Murat Can Cobanoglu¹, Chang Liu, Feizhuo Hu, Zoltán N Oltvai, Ivet Bahar.

Abstract

Quantitative analysis of known drug-target interactions emerged in recent years as a useful approach for drug repurposing and assessing side effects. In the present study, we present a method that uses probabilistic matrix factorization (PMF) for this purpose, which is particularly useful for analyzing large interaction networks. DrugBank drugs clustered based on PMF latent variables show phenotypic similarity even in the absence of 3D shape similarity. Benchmarking computations show that the method outperforms those recently introduced provided that the input data set of known interactions is sufficiently large--which is the case for enzymes and ion channels, but not for G-protein coupled receptors (GPCRs) and nuclear receptors. Runs performed on DrugBank after hiding 70% of known interactions show that, on average, 88 of the top 100 predictions hit the hidden interactions. De novo predictions permit us to identify new potential interactions. Drug-target pairs implicated in neurobiological disorders are overrepresented among de novo predictions.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2013 PMID： 24289468 PMCID： PMC3871285 DOI： 10.1021/ci400219z

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Drug discovery and development has become increasingly challenging in recent years, evidenced by the estimated cost of around $1.8 billion for the development of a novel molecular entity with suitable pharmacological properties.[1] This cost increase partly originates from the failure of many drug candidates in phase II or III clinical trials due to their toxicity or lack of efficacy.[2] The efficiency of drug discovery and development might be improved by adopting a systemic approach that takes into consideration the interaction of existing drugs and candidate compounds with the entire network of target proteins and other biomolecules in a cell.[3] Indeed, the “one gene, one drug, one disease” paradigm is widely recognized to fail in describing experimental observations.[4] Many drugs act on multiple targets, and many targets are themselves involved in multiple pathways. For example, β-lactam antibiotics and most antipsychotic drugs exert their effect through interactions with multiple proteins.[5,6] Biological networks are highly robust to single-gene knockouts, as recently shown for yeast where 80% of the gene knockouts did not affect cell survival.[7] Similarly, 81% of the 1500 genes knocked out in mice did not cause embryonic lethality, further corroborating the robustness of biological networks against single target perturbagens.[8] These results suggest that quantitative systems pharmacology strategies that take account of target (and drug) promiscuities can present attractive alternative routes to drug discovery. Recent years have seen many network-based models adopted to reduce the complexity of, and efficiently explore, drug–target interaction systems.[2,5,6,9] In particular, the development of computational methods that can efficiently assess potential new interactions became an important goal. In this regard, the important role that machine learning approaches such as active learning (AL) can play has been recently been highlighted.[10] Computational approaches used to predict unknown drug–target interactions can be divided into roughly four categories: chemical-similarity-based methods,[11−13] target-similarity-based methods,[14−16] integrative (both target- and chemical-similarity-based) methods,[17−23] and holistic approaches.[24−29] The first two posit that if two entities are chemically or structurally similar they will share interactions. The integrative approaches combine the chemical- and target-similarity methods. While the intuition behind these approaches is very reasonable, their performance has been observed to be tied to the underlying similarity computation method. We also note that the utility of different methods may depend on the size of the data set being analyzed, e.g., computing chemical–chemical and target–target similarity matrices can be problematic for large databases like STITCH[30] (that contains information on the interactions between more than 2.6 million proteins and 300 000 chemicals). To overcome these limitations, holistic methods have been introduced, which utilize a number of different data sources such as gene expression perturbation[25,26] or high-throughput screening.[28] In this study, we propose a novel approach by using a collaborative filtering algorithm to predict interactions without reliance on chemical/target similarity or external data collection. We validate the utility of probabilistic matrix factorization (PMF) for predicting unknown drug–target interactions with the help of a detailed investigation of its performance. The method is shown to group drugs according to their therapeutic effects, irrespective of their three-dimensional (3D) shape similarity. Benchmarking computations show that the method outperforms recent methods[17,20,22] when applied to large data sets of protein–drug associations, such as those of enzyme– and ion channel–drug pairs; whereas the performance falls short of these methods with decreasing size of the examined data set (e.g., GPCR- and nuclear receptor-drug data sets). The ability of the method to efficiently analyze and make inferences from large data sets of protein–drug interactions suggests that, with growing sizes of those data sets, the utility (and accuracy) of the method will further improve. Application of the same benchmarking procedure to DrugBank[31] confirms its ability to disclose hidden data: 88 out of the top 100 predictions (or 587 out of 1000) are found to hit known (but hidden) interactions, when only 30% of the entire data is used for training. Finally, when the method is trained on the entire data set of drug–target interactions compiled in DrugBank, de novo predictions for drug repurposing can be made along with the corresponding confidence levels. Top de novo predictions include many drugs indicated for neurodegenerative diseases or neurobiological disorders, including drug–target pairs apparently supported by previous experiments (not reported in DrugBank), e.g., ergotamine–serotonin receptor 1A (5HT1A),[32] amoxapine-5-HT2A,[33] and verapamil–calmodulin.[34] In conclusion, the newly introduced computational method provides an efficient approach for identifying potential drug–target association between chemicals and targets and formulating new hypotheses for repurposable drugs or side effects, thus complementing those deduced from chemical–chemical or target–target similarities.

Materials and Methods

Problem Definition

The drug–target interaction network is a bipartite graph with two types of nodes: drugs and targets (Supporting Information Figure S1). Each edge represents an interaction between a drug and a target. The drug–target interaction identification problem is to determine the missing edges that are likely to exist given all nodes and some of the edges in the network.

Data Set

We used DrugBank (version of September 20, 2011) as the database.[31] All drugs annotated therein as approved, along with their annotated targets, are included in our data set (i.e., we excluded compounds annotated as withdrawn or nutraceutical), resulting in N = 1413 drugs and M = 1050 targets with 4731 interactions among them. The interaction network displays small-world characteristics: many nodes have low degree and a few, very high degree, as illustrated in panels b and c of Supporting Information Figure S1, in line with previous studies on drug–target networks.[35] On average, there are 3.35 interactions per drug, and 4.50 interactions per target.

Probabilistic Matrix Factorization (PMF)

PMF is a member of the collaborative filtering family of machine learning algorithms that decomposes the connectivity matrix, R, of a bipartite graph of N drugs and M targets as a product of two matrices of latent variables (LVs).[36,37]R is defined asThe matrix R is modeled as the product of two matrices UT and V, that express each drug/target in terms of D LVs. Our objective is to find the best approximation for LVs, while avoiding overfitting. The predicted R̂ is then expressed aswhere UT and V are composed of N rows uT and M columns v, respectively, each being D-dimensional. The PMF adopts a probabilistic linear model with Gaussian noise to model the interaction. Therefore, the conditional probability over observed interactions is represented aswhere f(x|μ, σ2) is the Gaussianly distributed probability density function for x, with mean μ and variance σ, and I is the indicator function equal to 1 if the entry R is known and 0 otherwise. Therefore, p(R|U, V, σ2) gives us a probabilistic representation of the connectivity matrix, R.[37] Using zero-mean, spherical Gaussian priors on LVs, we can writewhich leads to the log-likelihood of U and V given byHere C is a term that does not depend on LVs; the first term on the right-hand side is the squared error function to be minimized; and the two summations over the square magnitudes of u and v are regularization terms that favor simpler solutions and penalize overfitting. The above log-likelihood directly follows from the Bayes’ rule where R stands for data, and U and V represent the model (see the Supporting Information for details). To learn an optimal model means to find the U and V matrices, or the D-dimensional LV vectors, u (1 ≤ i ≤ N) and v (1 ≤ j ≤ M), that maximize the log-likelihood function. The PMF method yields the optimal u and v vectors corresponding to each drug, d, and each target t, respectively. The basic idea is that the model is forced toward making a “no-interaction” prediction by the regularization—i.e., there is a penalty associated with any nonzero value in the LV matrices. However, there is also a penalty for failing to capture known interactions—i.e., if the dot product of the LV vectors corresponding to an interacting drug–target pair is close to zero. Therefore the learning of a model means to optimally balance out two objectives: developing a sufficiently complex model to describe the known interactions, but not overly complex to end up in overfitting. In this study, we use gradient descent for optimization. The adoption of higher D values usually yields more accurate results, although beyond a certain limit the increase in complexity and decrease in efficiency may not warrant the marginal improvement, if any, in prediction accuracy. D = 50 is adopted here as an optimal dimensionality for prediction runs. The method is highly efficient: a 50-dimensional model is trained on the entire DrugBank in approximately 2 s using a 2.00 GHz AMD Opteron processor. Moreover, the computing time to learn a PMF model scales linearly with the number of interactions, and as such, the method can be advantageously used for much larger data sets.

Active Learning (AL) Using Probabilistic Matrix Factorization

The AL strategy adopted in the present study is, in part, motivated by the success reported by Warmuth et al.[38] who demonstrated that hit maximization is a viable AL strategy applicable to predicting drug–target interactions. The AL strategy adopted here also prioritizes the discovery of unknown interactions. Our method differs in that we aim at capturing the interactions between all drugs and targets, as opposed to predicting activity against a single target. The procedure is the following: We begin with the set of N drugs and M targets, and known associations, schematically shown in Figure 1 by black connectors. The purpose is to identify new associations, indicated by red connectors. For each candidate interaction, say the possible interaction between d and t, we compute the model’s estimate, f(R|uTv, σ2) (eq 3). The dot product uTv serves as a weight ω for the edge/connector between d and t. Clearly, ω, or the likelihood of association between d and t, is high when u and v have both large values of the same sign at the same dimension(s). For example, a relatively large weight may originate from the second component of both u and v, which means that the predicted association is mainly due to latent variable 2. We evaluated the statistical weights ω(d, t) for the N × M pairs of drug targets for two purposes: (i) benchmarking the methodology via an iterative AL scheme and (ii) making de novo predictions. In the former case, the method is benchmarked by hiding 70% of known interactions and examining whether the top-ranking prediction is a “hit”, i.e., whether it corresponds to a known (but hidden) interaction. The outcome from this test is fed back to the model, to repeat the calculation for the next prediction. Therefore, the AL model is updated at each iteration using the newly acquired “hit” or “miss” data until a predetermined number (m) of predictions are made. The passive learner (PL) makes the m predictions simultaneously without updating its model.

Figure 1

Qualitative illustration of the method for identifying drug–target interactions. The known interactions between drugs and targets (indicated by the black lines) are used to learn the LV vectors (shown adjacent to each node) that describe each drug and target. The dot product uTv of the LVs for each pair of drug d and target t defines the predicted statistical weight ω of corresponding connection. Example predictions are shown in red. In the case of de novo predictions, all DrugBank data were used as input. De novo predictions also lend themselves to an AL scheme provided that the top-ranking prediction is experimentally tested and then the new hit or miss data are incorporated in the model to perform a new prediction, and so on, until the experimentation budget is exhausted.

Results and Discussion

PMF Cluster Drugs with Therapeutic Similarities, Irrespective of Their Chemical–Structural Similarities

To assess whether the LVs provide us with a pharmacologically meaningful metric, we examined the clustering of drugs in the D-dimensional space of the latent vectors. The clustering was performed for D = 30—the value that gave the lowest Akaike information criterion,[39] using as basis the drug–drug distance L1(d, d) = ∑|u – u| where u designates the kth component of u, and the summation is performed over D components. Inasmuch as our method evaluates drugs based on their interaction profiles with targets, which in turn refer to specific therapeutic or phenotypic actions, the similarity of a pair of drugs should be high when their therapeutic effects are comparable and vice versa. Thus, the method will tend to cluster drugs that exhibit similar patterns of interactions (with target proteins), which we term as functionally similar drugs. The heat map in Figure 2a displays the resulting organization of drugs in 30 clusters (indicated by different colors and indices along the axes). Supporting Information Table 1 lists the dominant therapeutic action associated with each cluster. The dark regions on the map indicate high functional similarity. The dark blocks along the diagonal show that most clusters include highly similar members, except for two (clusters 29 and 30), which apparently combine the outliers.

Figure 2

Comparison of pairwise similarities of drugs, based on their (a) therapeutic targets compiled in DrugBank and (b) 3D structure. (a) 30 clusters of drugs (color-coded along the axes; see Supporting Information Table S1 for their dominant therapeutic indication) deduced from the PMF of 1413 approved drugs and corresponding 1050 targets compiled in DrugBank. By definition, drugs belonging to a given cluster share similar interaction patterns with respect to targets. (b) 3D similarities, with the drugs being ordered as in panel a. Dark regions indicate high similarity based on LVs (panel a) or 3D similarities (panel b). Comparison of the panels shows that close proximity in LV space (which indicates functional similarity) does not necessarily imply 3D-structure similarity. LV distances were distributed in the range [0, 1]; with the distribution of values also skewed in different ways. To render the two sets comparable, we performed rank normalization on both the LV similarities and 3D similarities. Selected boxes are enlarged in Figures 3 (white) and 4 (yellow), and Supporting Information Figure S2 (green).

Figure 3

Latent variables capturing therapeutic action similarities when 3D similarity metrics cannot. Closer examination of the similarities between the members of cluster 14 in Figure 2 (enclosed in white boxes in Figure 2, enlarged in panels b and c here) shows that the cluster contains a series of antianxiety drugs. A few members of this cluster (indicated by orange boxes along the abscissa of panels b and c) are displayed in panel a, to illustrate their shared structural features, also indicated by panel c that reflects their 3D similarities. The same cluster however contains ethchlorvynol, also used as a sedative, which would have been missed if we had used exclusively used 3D similarity to identify functionally similar drugs.

Figure 4

Strong cross-correlations between different clusters of drugs, consistent with their similar therapeutic functions. Cluster 11, color-coded cyan, is essentially composed of hypnotics and sedatives. Cluster 14 (dark gray) contains antianxiety drugs. The drugs in these two clusters are located very closely on the drug–target interaction network, as shown in panel a, consistent with their similar actions. The LV-derived heat maps capture the functional similarity between these two clusters (as indicated by strong signals, or the dark region, in panel b); the maps based on 3D similarity (panel c) do not. In panel a drugs are shown in blue, protein targets in red. Most drugs and targets are part of a single connected component. Data are retrieved from DrugBank.[48] Cytoscape is used for visualization.[56]

Given that (promiscuous) proteins present more than one site for ligand-binding, different functionalities may be modulated by chemical-structurally different drugs, depending on the binding site on the target (e.g., catalysis, substrate recognition, or allosteric signaling). Furthermore, a shared phenotype may arise from the targeting of different proteins along a given pathway. In order to make a better assessment of the properties of drugs grouped in those clusters, we examined their 3D structural similarities. High similarities would suggest that they bind similar epitopes, if not similar (or identical) structural domains or proteins. If, on the contrary, they are structurally dissimilar, this might indicate a different site on the same protein, a different target on the same pathway, or other indirect effect due to drug–target network connectivity. The extent of 3D structure similarity between pairs of drugs was computed using the OpenEye Scientific software (http://www.eyesopen.com/). 3D similarity was reported to be a better predictor than 2D methods for off-target interactions and to perform equally well in on-target interactions,[40] although 3D methods may suffer from more noise due to the conformational flexibility of the small molecule. We generated for each drug all possible stereoisomers using OpenEye FLIPPER[41] and up to 200 conformers per stereoisomer using OpenEye OMEGA.[41] All combinations of conformers accessible to the examined pair of drugs were examined using OpenEye Shape[23] toolkit; and the best matching pair was adopted to assign a 3D similarity score. This computationally expensive task led to the heat map presented in panel b of Figure 2. The drugs (along the axes) are ordered as in panel a to enable visual comparison. The comparison of Figure 2 shows that some clusters of functionally similar drugs (panel a) also exhibit some 3D similarities (panel b), whereas others display little structural similarity. We examined more closely the individual clusters to see if shared therapeutic functions were captured even when 3D similarities were absent. Figure 3 illustrates the results for cluster 14. This cluster essentially consists of antianxiety drugs, the majority of which are both functionally (panel b) and structurally (panel c) similar. However, the cluster also includes a structurally dissimilar drug, ethchlorvynol (panel a), which shares the same type of phenotypic action (as a sedative) as the majority of the cluster membership (mostly targeting GABA receptors). The present approach thus detects chemically or structurally distinctive drugs that share common activities, which would have been missed by methods based on ligand fingerprint similarities. Latent variables capturing therapeutic action similarities when 3D similarity metrics cannot. Closer examination of the similarities between the members of cluster 14 in Figure 2 (enclosed in white boxes in Figure 2, enlarged in panels b and c here) shows that the cluster contains a series of antianxiety drugs. A few members of this cluster (indicated by orange boxes along the abscissa of panels b and c) are displayed in panel a, to illustrate their shared structural features, also indicated by panel c that reflects their 3D similarities. The same cluster however contains ethchlorvynol, also used as a sedative, which would have been missed if we had used exclusively used 3D similarity to identify functionally similar drugs. Another interesting observation concerns the cross-correlations between different clusters (i.e., the off-diagonal regions of the heat maps). We note for example that cluster 11 also contains a set of sedatives. LVs are able to capture the commonality between the clusters 11 and 14 as may be seen by the strong signal (dark region) at the off-diagonal region enlarged in Figure 4b. The 3D similarity, on the other hand, cannot recognize the functional similarity and potential interference/side effects between these drugs in these two clusters (Figure 4c). See Supporting Information Figure S2, which illustrates the same behavior for another cluster, whose members are mostly antineoplastic agents, albeit with various 3D structures. The LVs thus provide information on drug groups that potentially share pathways or exhibit similar activity patterns despite their distinct physicochemical properties. Strong cross-correlations between different clusters of drugs, consistent with their similar therapeutic functions. Cluster 11, color-coded cyan, is essentially composed of hypnotics and sedatives. Cluster 14 (dark gray) contains antianxiety drugs. The drugs in these two clusters are located very closely on the drug–target interaction network, as shown in panel a, consistent with their similar actions. The LV-derived heat maps capture the functional similarity between these two clusters (as indicated by strong signals, or the dark region, in panel b); the maps based on 3D similarity (panel c) do not. In panel a drugs are shown in blue, protein targets in red. Most drugs and targets are part of a single connected component. Data are retrieved from DrugBank.[48] Cytoscape is used for visualization.[56]

Benchmarking Computations Support the Utility of the Method for Analyzing Large Data Sets

To evaluate the performance of the method in comparison to previous work, we considered three important studies in this area, one recently published by Gonen[22] and two by Yamanishi et al.[17,20] Gonen used a Kernel-based matrix factorization (KBMF) with chemical and genomic similarities to predict multiple targets. Yamanishi et al., on the other hand, integrated chemical, genomic, and pharmacological data to map all drugs and targets to the same unified feature space where each protein–compound pair closer than a predefined threshold was predicted to interact. Our approach differs from both studies, in that PMF assumes an independent LV for each row and column with Gaussian priors; whereas KBMF employs LVs spanning all rows and columns with Gaussian process priors, and Yamanishi et al. project drugs and targets into a pharmacological space based on the eigenvalue decomposition of the graph-based similarity matrix. The benchmarking procedure that we adopted is a 5-fold cross-validation of drugs on four target classes: enzymes, ion channels, G-protein coupled receptors (GPCRs), and nuclear receptors. In order to achieve comparable results, we used the same protocol as that adopted earlier, i.e., we divided our data set into five subsets; each was used as a test set, and the others, as training sets. Due to the randomness involved in the selection of subsets, we repeated the cross-validation experiments 100 times with randomly selected subsets and evaluated the average AUC (area under the receiver operating curve) for each subset. The first four rows in Table 1 compare the results (columns 6–10) for the four classes, and the fifth row lists the average performances weighted by the size of the interaction space. Our method performs best when applied to large data sets (e.g., enzymes and ion channels); whereas Gonen’s performs best in the case of GPCRs, and Yamanishi et al.[20] exhibits the highest performance for nuclear receptors, where the present method yields a relatively low (0.642) AUC value. Examination of the statistical significance of our results (Supporting Information Figure S3a) indicates that the mean AUC values obtained for all four sets are highly robust. Their covariances vary from 2% (enzymes and ion channels) to 11% (nuclear receptors). Finally, the application of the same benchmarking protocol to DrugBank yielded an accuracy rate of 79.4 ± 0.01% (Table 1, last row, and Figure S3), supporting the utility of the method when applied to large data sets.

Table 1

Properties of the Examined Space of Proteins–Drugs and Performance of the Present Method in Comparison to Others (*)

						Yamanishi (pred pharmacol effects)
target type	no. of known interactions	no. of drugs (N)	no. of targets (M)	size of interaction space (N×M)	percent occupancy of the space	2008a	2010a	Gonen, 2012a	present method (D = 50)
enzymes	1515	212	478	101336	1.50%	0.821	0.845	0.832	0.861 ± 0.02
ion channels	776	99	146	14454	5.37%	0.692	0.731	0.799	0.904 ± 0.02
GPCRs	314	105	84	8820	3.56%	0.811	0.812	0.857	0.771 ± 0.04
nuclear receptors	44	27	22	594	7.41%	0.814	0.830	0.824	0.650 ± 0.11
allb	2649	443	730			0.782	0.807	0.825	0.859 ± 0.03
DrugBank	4731	1413	1050	1483650	0.32%				0.794 ± 0.01

The last four columns present the comparison with the results of Yamanishi[17,20] and Gonen[22] for the same data set.

Weighted-average mean and covariances, evaluated using the number of interactions as weights.

The last four columns present the comparison with the results of Yamanishi[17,20] and Gonen[22] for the same data set. Weighted-average mean and covariances, evaluated using the number of interactions as weights. In principle, it might be intrinsically harder to make accurate predictions for larger data sets as the size of the potential interaction space N × M grows quadratically when the number of drugs and targets grow linearly, particularly if the number of known interactions is small. The occupancy of the N × M interaction matrix is only 1.5% in the enzyme class, which could make it difficult to learn an informative model. The present PMF technique, however, successfully learned an informative model and handled the complexity of interactions in this space of interactions, apparently due to the availability of a sufficiently large (absolute) number of known interactions (Supporting Information Figure S3b). The ion channel drug class has the second largest number of known interactions among the four. Although the size of interaction space is 1 order of magnitude smaller than the enzyme class, there are 776 known interactions leading to a percent occupancy of 5.37% of all possible ion channel-drug associations. The success of our method in this case may be attributed to both the relatively large number of known interactions and the rich annotation of that class of interactions. The two other classes, GPCRs and nuclear receptors, are significantly smaller in terms of their interaction space and/or occupancy of that space. Nuclear receptors comprise only 27 drugs, 22 targets, and 44 interactions. A method that relies solely on connectivity, like ours, cannot presumably formulate an informative model when the set of “edges” to construct the network connectivity matrix is incomplete. In those cases, the data that come from other sources, e.g. chemical similarity and genomic patterns, amend this lack of information. Consequently, methods that incorporate such features[20,22] outperform ours. To further examine the effect of scarcity of known interactions on the performance of the present method, we performed additional tests by varying the fraction of hidden interactions. The results are presented in Supporting Information Figure S4. Panels a–d show the performance on ion channels, enzymes, GPCRs, and nuclear receptors, respectively. These results show that the performance depends on the fraction of known interactions. To put the results into perspective, we indicated by a vertical dashed line in each panel the fraction of data (80%) used in previous studies[20,22] for training purposes. Consistent with the above findings, ion channels yield the best result: previous AUC values[22] (of 0.799; Table 1) are matched with about only 35% of the data. In the enzyme group, we match the performance of Yamanishi et al.[20] (AUC of 0.845) with roughly 70% of the data used for training. GPCRs and nuclear receptors yield AUC values lower than those previously attained,[20,22] irrespective of the fraction of hidden interactions. In summary, the method is particularly suitable for screening and inferring repurposable drugs or potential side effects from large data sets where computational assessment of structure similarity kernels become prohibitively expensive. In cases where the data set of known interactions is too small, on the other hand, 2D or 3D similarity metrics provide more accurate assessments.

Absolute Number of Known Interactions Overrides the Scarcity of the Data in Determining the Accuracy Rate of PMF Active Learner

As a more stringent test, 3318 (70%) of the known 4731 interactions in DrugBank were randomly hidden, reducing the average number of interactions per drug from 3.35 to 1. The resulting “incomplete” interaction matrix was then used to predict the hidden interactions, one at a time (rank-ordered by statistical weights ω(d, t)) as described in Materials and Methods. The outcome was checked in a simulated experiment to assess whether the predicted interaction is a true positive (TP) or a false positive (FP). If the prediction is an existing, but hidden, interaction, the result is considered a TP (or hit), otherwise a FP (or miss). Then the model is updated in line with our AL scheme, and this loop is repeated until the completion of m = 1000 predictions. At that point, the simulation is halted and the overall performance of the model, or the hit ratio, is evaluated. Note that this method gives us a lower bound for hit ratio because the predictions are labeled as hits only if they are annotated in DrugBank, although they can be true but not yet observed experimentally or annotated in DrugBank. The results are presented in Figure 5. The figure displays the number of hits as a function of the number of predictions, obtained with three approaches: active learning (dark blue curves), passive learning (dark red curves), and random (green). The approach is able to achieve, on average, 587 hits out of 1000 predictions via AL, 407 hits, via PL; and the corresponding variances (indicated by the dashed curves) are 35 and 46, respectively. Compared to the random probability of 2.23 hits per 1000 predictions, the AL result is a 263-fold improvement over random. The improvement of AL over PL is 1.44 fold. The AL improvement over random was reported to be up to 3.19-fold in a previous SVM-based study for predicting the activity of 1316 drugs against a single target.[38] The same study also reported 1.59-fold improvement between passive and active learners. Closer examination of the results from the top 100 predictions (enlarged in the inset) further shows that hit ratios of 88.0 ± 4.7% and 82.2 ± 6.4% are obtained by the respective AL and PL protocols. The results are obtained with D = 50, which yields optimal results, as can be seen from Supporting Information Figures S5 and 6 display the dependence of the results on D, in support of the choice of D = 50.

Figure 5

Figure 6

Improvement in prediction accuracy by AL over random (a) and over PL (b), as a function of the latent space dimensionality. Fold improvement is based on hit ratios obtained at the end of 1000 predictions, using same parameters as Figure 5. The AL performance levels off at about D = 50 in panel a. The last bar in each panel refers to the work of Warmuth et al.[38]

Ability of the method to recapitulate hidden drug–target interactions. The number of drug–target interactions per drug was reduced from 3.35 (average) to 1 by hiding 70% of known interactions, selected randomly. Simulations were repeated n = 96 times for each of the 1 < m < 1000 predictions (abscissa) and the number of hits (correctly identified hidden interactions) is plotted for each run, along the ordinate. The dark blue and dark red solid curves refer to the average performance obtained by active learning and passive learning protocols, respectively, using D = 50, ε = 3, λ = 0.01, and μ = 0.9 in the adopted PMF algorithm. Dashed curves show the corresponding variances (by one standard deviation) above and below the mean value. The green curves (practically overlapping with the abscissa) refer to results from random predictions. The inset shows a close-up of the first 100 predictions. AL reaches an accuracy rate (hit ratio) of 88.0 ± 4.7% and 58.7 ± 3.5% in the respective cases of m = 100 and 1000 predictions. Improvement in prediction accuracy by AL over random (a) and over PL (b), as a function of the latent space dimensionality. Fold improvement is based on hit ratios obtained at the end of 1000 predictions, using same parameters as Figure 5. The AL performance levels off at about D = 50 in panel a. The last bar in each panel refers to the work of Warmuth et al.[38] These results permit us to draw two conclusions. First, a hit ratio of 88% is attainable in the top 100 predictions (and 59% in top 1000) upon adopting a PMF-based AL strategy for identifying hidden/unknown interactions in a sparse (0.32% occupancy) data set of about 1.5 million potential interactions. Second, the AL method outperforms random by 2 orders of magnitude and PL by a ratio of 1.5 approximately, in support of AL strategy for predicting new interactions.

De novo Predictions of Drug–Target Interactions

We used our method to predict new (potential) drug–target interactions after training our model on the latest available version of DrugBank (September 10, 2013) comprised of 5041 interactions between 1502 approved drugs and 1138 targets. The highest confidence pairs obtained for twenty distinct drugs are presented in Table 2. The pairs therein were observed to lie frequently among the top-ranking 10 pairs (in the space of N × M = 1.7 × 106 potential interactions), deduced from 104 independent runs initiated with different random numbers.

Table 2

De novo Predictions, Rank-Ordered Based on Confidence

drug	target	support from previous experiments (ref)
Ergotamine	serotonin receptor 1A (5-HT_1Aa)	direct[32]
Amoxapine	serotonin receptor 2A (5-HT_2A)	direct[33]
Minaprine	histamine receptor H₁
Trimipramine	α_2A adrenergic receptor	indirectb
Amitriptyline	serotonin receptor 2C (5-HT_2c)	indirectb
Tramadol	serotonin receptor 2C (5-HT_2c)	indirectb
Clozapine	D(1B) dopamine receptor	indirectb
Doxepin	D(1A) dopamine receptor	indirectb
Nicardipine	histamine receptor H₁
Flunitrazepam	GABAaα₁	indirectb
Paliperidone	serotonin receptor 7 (5-HT₇)	direct[42]
Iloperidone	α_2B adrenergic receptor	indirectb
Propericiazine	α _1A adrenergic receptor (ADRA1Aa)	indirectb
Asenapine	α_1B adrenergic receptor	indirectb
Verapamil	calmodulin	direct[34,47]
Meperidine	Naa channel 10, type α	direct[43]
Cinnarizine	calmodulin	direct[44]
Paroxetine (paxil)	histamine receptor H₁	direct[45,46]
Orphenadrine	DATa	indirectc
Citalopram	Naa-dependent NETa	indirect[47]

The cases listed as “indirect” refer to interactions of the drugs with different subtypes of the identified target.

Orphenadrine inhibits norepinephrine reuptake thus potentiating the effect of norepinephrine There are several drugs acting as norepinephrine-dopamine reuptake inhibitors, and ophenadrine might exhibit the same behavior.

Abbreviations: GABA γ-aminobutyric-acid receptor; 5-HT 5-hydroxytryptamine (or serotonin) receptor; NET norepinephrine (or noradrenaline) transporter; ADRA1A adrenoreceptor α1 A; DAT dopamine transporter. The cases listed as “indirect” refer to interactions of the drugs with different subtypes of the identified target. Orphenadrine inhibits norepinephrine reuptake thus potentiating the effect of norepinephrine There are several drugs acting as norepinephrine-dopamine reuptake inhibitors, and ophenadrine might exhibit the same behavior. We note that the list of de novo predictions in Table 2 is dominated by drugs used for neurological or psychiatric disorders, consistent with the known pharmacological promiscuity of this of drugs. The last column in Table 2 lists the experimental support from the literature, if any, for the possible interaction of the drug–target pair in each row. Among pairs supported by previous experiments, we note ergotamine–serotonin receptor 1A (5HT1A),[32] amoxapine-5-HT2A,[33] verapamil–calmodulin,[34] paliperidone-5-HTc,[42] meperidine–sodium channels,[43] and cinnarizine–calmodulin.[44] We also note that chronic treatment with paroxetine has been recently reported to increase the mRNA levels of histamine receptor H1, indicating an association between paroxetine and Histamine receptor H1.[45] Although paroxetine is a potent serotonin reuptake inhibitor, its weight gain side effect has been attributed to its medication action on histamine receptors.[46] Many others (indicated as “indirect”) are interactions known to occur either with subtypes of the listed targets or proteins implicated in the same phenotype (e.g., citalopram induces norepinephrine receptor hypoactivity,[47]which may relate to norepinephrine transport by NET). Yet, the validity of these predictions need to be established by experiments. Here, for exploratory purposes, we examined the potential binding pose and energetics of verapamil–calmodulin. Verapamil is known as a Ca2+ channel entry blocker. Its interaction with calmodulin is supported by the fluorescence experiments[32] and by the inhibition of calmodulin-stimulated (Ca2+ + Mg2+) ATPase activity.[48] We used as template the Protein DataBank (PDB)[49] structure of the cocrystal between calmodulin and trifluoperazine (PDB identifier 1CTR)[50] and examined the binding affinity of verapamil relative to that of trifluoperazine, using the module SMINA,[51] based on AutoDock VINA.[52] The software yielded an attractive energy of −4.8 kcal/mol for trifluoperazine binding to calmodulin, which after energy minimization, becomes −6.8 kcal/mol. Docking of verapamil to the same location and energy minimization led to −5.8 kcal/mol (Figure 7). While docking software with fixed target often fails to provide a quantitative assessment of binding affinity, and more elaborate simulation methods have been developed for druggability assessment (see, e.g., our recent work[53]), these results support, qualitatively, the shape complementarity between calmodulin and verapamil, as well as their favorable electrostatic interaction. Establishment of this and other predicted interactions, however, awaits experimental verifications by essays designed to test the specific drug–target interactions.

Figure 7

Structural model for possible binding pose of verapamil onto calmodulin. The molecular surfaces of both the drug and calmodulin are colored by electrostatic potential (a and b) and a close-up view is shown in panel c. Calmodulin surface, especially at the ligand-binding cleft, is negatively (partial) charged, favorably interacting with the verapamil surface that has partial positive charge. Note the shape between the verapamil and calmodulin. All the visualizations have been performed with OpenEye VIDA software.

Concluding Remarks

Over the last couple of years, there have been a number of computational studies performed to identify targets of existing drugs and drug candidates other than those originally known/proposed to be targeted. A pioneering study is that of Roth, Shoichet, and co-workers[11,13] based on compound chemical similarities. Dudley et al. focused on inverse correlations between gene expression profiles in the presence of a drug and in a disease state.[25] Yamanishi and his colleagues represented drugs and targets in an integrated “pharmacological space”.[17,20] Gonen used a KBMF method where chemical and genomic similarities were integrated.[22] We proposed a PMF-based AL methodology that can be advantageously used for large data sets. The applicability of the method to large data sets is worth further attention, given that we will increasingly have access to bigger data (e.g., STITCH Database[30]), which will be exploited for repurposable drug identification. The software developed here, made accessible in http://www.csb.pitt.edu/Faculty/bahar/files/, is readily scalable. For very large data sets, which typically have more known interactions, the PMF is able to construct a better model using the plethora of available data; whereas when the number of known interactions is limited, the use of chemical and genomic kernels allows KBMF to outperform PMF. The application of KBMF to large data sets may, however, become challenging, For example, STITCH contains on the order of 106 proteins and 105 compounds, implying that 1012 sequence and 1010 chemical similarity comparisons are needed to make predictions. However, the PMF method is independent of chemical, structural or other similarity metrics, and its computation time scales linearly with the number of known interactions; and it proves to perform well on large data sets. The data sets reporting drug–target interactions are constantly improving in quality and quantity and, therefore, expected to give even better results when analyzed by an efficient tool. Finally, the extension of the method to analyzing big data (with millions of nodes) is foreseeable in the near future. The recently introduced GraphChi tool[54] can be used for optimized and parallelized model learning for further performance improvements. The fact that the PMF is independent of 2D/3D shape comparison methods commonly employed in drug–target pair inferences implies that the derived LVs capture similarities based on the interaction patterns of drugs at the cellular level, even if their molecular structures are dissimilar (see Figures 2 and 3). As such, the method may be advantageously used for lead hopping, thus complementing those (e.g., SVM classification algorithms) used in conjunction with 2D or 3D pharmacophoric fingerprints (see the work of Saeh et al.[55]). Inasmuch as the currently proposed method does not require structural data for proteins but knowledge of drug–target interactions, it can be advantageously applied to membrane proteins (major drug targets) for which structural data still remain sparse. It can also be used to make predictions across major drug or target classification boundaries. One implication is that the de novo predictions are not restricted to major drug or target classification boundaries. A major utility of the developed tool is the ability to deliver testable hypotheses with regard to repurposable drugs, thus significantly reducing the search space for identifying potent applications of existing drugs (that proved to meet ADMET requirements). The number of experiments that can be efficiently conducted is usually limited, e.g. of the order of 102 if not 101 for high-confidence assays as opposed to the complete space of ∼1.5 million combinations for the data set used in this study. The fact that the top-ranking predictions exhibit a hit ratio of 59% (for the top 1000 predictions; or 88% for top 100 predictions) suggest that de novo predictions made by the presently introduced method of approach applied to increasingly large data sets are likely to provide useful guidance for experimentally testing, streamlining or prioritizing existing or investigational drugs or new compounds. Another important byproduct is the probabilistic assessments on potential side effects, a topic that will become increasingly important with advances in personalized medicine.

50 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization.

Authors: Mehmet Gönen
Journal: Bioinformatics Date: 2012-06-23 Impact factor: 6.937

Review 3. Can we rationally design promiscuous drugs?

Authors: Andrew L Hopkins; Jonathan S Mason; John P Overington
Journal: Curr Opin Struct Biol Date: 2006-01-25 Impact factor: 6.809

4. The chemical genomic portrait of yeast: uncovering a phenotype for all genes.

Authors: Maureen E Hillenmeyer; Eula Fung; Jan Wildenhain; Sarah E Pierce; Shawn Hoon; William Lee; Michael Proctor; Robert P St Onge; Mike Tyers; Daphne Koller; Russ B Altman; Ronald W Davis; Corey Nislow; Guri Giaever
Journal: Science Date: 2008-04-18 Impact factor: 47.728

5. Drug-target network.

Authors: Muhammed A Yildirim; Kwang-Il Goh; Michael E Cusick; Albert-László Barabási; Marc Vidal
Journal: Nat Biotechnol Date: 2007-10 Impact factor: 54.908

6. Combining drug and gene similarity measures for drug-target elucidation.

Authors: Liat Perlman; Assaf Gottlieb; Nir Atias; Eytan Ruppin; Roded Sharan
Journal: J Comput Biol Date: 2011-02 Impact factor: 1.479

Review 7. Glioblastoma chemotherapy adjunct via potent serotonin receptor-7 inhibition using currently marketed high-affinity antipsychotic medicines.

Authors: R E Kast
Journal: Br J Pharmacol Date: 2010-10 Impact factor: 8.739

8. Drug binding by calmodulin: crystal structure of a calmodulin-trifluoperazine complex.

Authors: W J Cook; L J Walter; M R Walter
Journal: Biochemistry Date: 1994-12-27 Impact factor: 3.162

9. Cytoscape 2.8: new features for data integration and network visualization.

Authors: Michael E Smoot; Keiichiro Ono; Johannes Ruscheinski; Peng-Liang Wang; Trey Ideker
Journal: Bioinformatics Date: 2010-12-12 Impact factor: 6.937

10. Identifying compound-target associations by combining bioactivity profile similarity search and public databases mining.

Authors: Tiejun Cheng; Qingliang Li; Yanli Wang; Stephen H Bryant
Journal: J Chem Inf Model Date: 2011-08-18 Impact factor: 4.956

45 in total

Review 1. In silico methods for drug repurposing and pharmacology.

Authors: Rachel A Hodos; Brian A Kidd; Khader Shameer; Ben P Readhead; Joel T Dudley
Journal: Wiley Interdiscip Rev Syst Biol Med Date: 2016-04-15

Review 2. Large-Scale Prediction of Drug-Target Interaction: a Data-Centric Review.

Authors: Tiejun Cheng; Ming Hao; Takako Takeda; Stephen H Bryant; Yanli Wang
Journal: AAPS J Date: 2017-06-02 Impact factor: 4.009

3. Harnessing Human Microphysiology Systems as Key Experimental Models for Quantitative Systems Pharmacology.

Authors: D Lansing Taylor; Albert Gough; Mark E Schurdak; Lawrence Vernetti; Chakra S Chennubhotla; Daniel Lefever; Fen Pei; James R Faeder; Timothy R Lezon; Andrew M Stern; Ivet Bahar
Journal: Handb Exp Pharmacol Date: 2019

4. Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction.

Authors: Yong Liu; Min Wu; Chunyan Miao; Peilin Zhao; Xiao-Li Li
Journal: PLoS Comput Biol Date: 2016-02-12 Impact factor: 4.475

5. A Perspective on Implementing a Quantitative Systems Pharmacology Platform for Drug Discovery and the Advancement of Personalized Medicine.

Authors: Andrew M Stern; Mark E Schurdak; Ivet Bahar; Jeremy M Berg; D Lansing Taylor
Journal: J Biomol Screen Date: 2016-03-08

Review 6. Implications of the small number of distinct ligand binding pockets in proteins for drug discovery, evolution and biochemical function.

Authors: Jeffrey Skolnick; Mu Gao; Ambrish Roy; Bharath Srinivasan; Hongyi Zhou
Journal: Bioorg Med Chem Lett Date: 2015-02-03 Impact factor: 2.823

7. FINDSITE^comb2.0: A New Approach for Virtual Ligand Screening of Proteins and Virtual Target Screening of Biomolecules.

Authors: Hongyi Zhou; Hongnan Cao; Jeffrey Skolnick
Journal: J Chem Inf Model Date: 2018-10-16 Impact factor: 4.956

8. COMPUTING THERAPY FOR PRECISION MEDICINE: COLLABORATIVE FILTERING INTEGRATES AND PREDICTS MULTI-ENTITY INTERACTIONS.

Authors: Sam Regenbogen; Angela D Wilkins; Olivier Lichtarge
Journal: Pac Symp Biocomput Date: 2016

9. Realizing private and practical pharmacological collaboration.

Authors: Brian Hie; Hyunghoon Cho; Bonnie Berger
Journal: Science Date: 2018-10-19 Impact factor: 47.728

Review 10. Systems biology approaches for identifying adverse drug reactions and elucidating their underlying biological mechanisms.

Authors: Mary Regina Boland; Alexandra Jacunski; Tal Lorberbaum; Joseph D Romano; Robert Moskovitch; Nicholas P Tatonetti
Journal: Wiley Interdiscip Rev Syst Biol Med Date: 2015-11-12