Literature DB >> 29186331

DDR: efficient computational method to predict drug-target interactions using graph mining and machine learning approaches.

Rawan S Olayan¹, Haitham Ashoor², Vladimir B Bajic¹.

Abstract

Motivation: Finding computationally drug-target interactions (DTIs) is a convenient strategy to identify new DTIs at low cost with reasonable accuracy. However, the current DTI prediction methods suffer the high false positive prediction rate.
Results: We developed DDR, a novel method that improves the DTI prediction accuracy. DDR is based on the use of a heterogeneous graph that contains known DTIs with multiple similarities between drugs and multiple similarities between target proteins. DDR applies non-linear similarity fusion method to combine different similarities. Before fusion, DDR performs a pre-processing step where a subset of similarities is selected in a heuristic process to obtain an optimized combination of similarities. Then, DDR applies a random forest model using different graph-based features extracted from the DTI heterogeneous graph. Using 5-repeats of 10-fold cross-validation, three testing setups, and the weighted average of area under the precision-recall curve (AUPR) scores, we show that DDR significantly reduces the AUPR score error relative to the next best start-of-the-art method for predicting DTIs by 34% when the drugs are new, by 23% when targets are new and by 34% when the drugs and the targets are known but not all DTIs between them are not known. Using independent sources of evidence, we verify as correct 22 out of the top 25 DDR novel predictions. This suggests that DDR can be used as an efficient method to identify correct DTIs. Availability and implementation: The data and code are provided at https://bitbucket.org/RSO24/ddr/. Contact: vladimir.bajic@kaust.edu.sa. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29186331 PMCID： PMC5998943 DOI： 10.1093/bioinformatics/btx731

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Drug discovery is the process through which potential beneficial treatment effects or medical uses of a new drug candidate are identified. Distinct phases of drug discovery and development define the initial stage of target identification and validation, compound leads identification, validation and optimization, and different types of preclinical and clinical trials until the final approval by the Food and Drug Administration (FDA) (Paul ) is reached. Drugs function through interaction with various molecular targets. We call such interaction drug–target interactions (DTIs). Proteins are one useful group of such targets. Through binding, drugs can either enhance or inhibit functions carried out by proteins (Overington ; Santos ) and thus affect the disease conditions. Bringing a new drug to the market is a highly challenging and complex process in terms of time and cost. Moreover, the number of newly approved drugs by the FDA is decreasing, illustrating the productivity decline in drug discovery and development (Swinney and Anthony, 2011). However, studies showed that most of the FDA-approved drug molecules exhibit poly-pharmacological properties, i.e. drugs can have interaction with multiple targets, which are not their primary therapeutic targets (i.e. drugs have off-target molecules) (Cichonska ), and this is frequently the major cause of undesirable side-effects. One interesting and useful objective is to link the newly identified DTIs of a known drug to the treatment of diseases that are different from diseases for which the drug has been originally developed (Cichonska ; Li ; Shim and Liu, 2014). The availability of public biomedical databases along with the development of computational approaches has made it possible to provide useful frameworks to partially overcome limitations of the traditional experimental approaches (Vilar and Hripcsak, 2016) and help in finding a new association for the existing drugs with off-target effects. Identifying computationally highly likely DTIs for a known drug can be then employed to identify potential new uses of the drug in question, and this makes a useful strategy in drug repurposing (Chen ; Daminelli ; Wu ). A part of such solution is the identification of novel DTIs that play an important role in the discovery of additional applications for known drugs, as well as in the understanding of drug’s modes of action (Overington ; Santos ; Schenone ). This necessitates development of accurate computational approaches to focus on a smaller number of highly likely targets of a drug for the follow-up experimental validation. However, predicting correct DTI is not sufficient for itself to infer what effect such interaction may have. Additional steps may be needed, such as, for example, to show inhibition of target expression. One approach for computationally inferring such effects may be the utilization of predictive models of activity in appropriate biological assays (Soufan ; Soufan ) as those in the PubChem resource. As summarized in recent reviews (Chen ; Santos ), a wide range of databases, web tools and computational methods have emerged with the potential to predict DTIs by learning from interaction data supplemented with information on the similarities among drugs and similarities among proteins (Daminelli ; Wu ). However, confirming whether a drug could interact with a target protein requires an additional effort. This is owing to the relatively limited information about interactions between drugs and target proteins (Dobson, 2004; Kanehisa ; Menni ), as well as the poor characterization of proteins as drug targets (Santos ). Early attempts in computational prediction of DTIs can be categorized into two main groups and include docking simulations and ligand-based approaches (Cheng ; Keiser ). Docking methods consider the 3D structure of target proteins. However, this approach is extensively time-consuming, and the structural information of targets is not available for all target proteins. Ligand-based methods compare a query ligand with a set of known ligands with target proteins. However, it may not perform well in cases the number of known ligands with target proteins is small. Public data sources have promoted the development of various strategies for repurposing drugs including genome, phenome, drug chemical structures, biological interactome, biomedical literature text and biological bioassays (Li ). Moreover, the accessibility of big data sources, through several databases and biomedical literature of DTI information, provide a useful way to extract different biological interaction profiles and signatures (or descriptors) of drugs and target proteins to discover novel DTIs (Ba-Alawi ; Cheng ,b; Ding ; Mitchell, 2001; Perlman ; Vilar and Hripcsak, 2016; Wu ; Yamanishi ). On the basis of the guilt-by-association principle, in which chemically similar drugs tend to interact with similar proteins, many methods have been proposed for DTI prediction based on the consideration of similarity measures between drugs or similarities between proteins. Such prediction methods are based on graph inference (Alaimo, 2013; Ba-Alawi ; Bleakley and Yamanishi, 2009; Chen et al., 2012a,b; Seal ; Wang ), machine-learning algorithms (Hao ; Lim ; Liu ; Mei ; Perlman ; Soufan ; van Laarhoven ; Yuan ), text mining (Zhu ) and semantic linked data (Chen ; Fu ; Tari and Patel, 2014; Zhu ). Recently, several methods are developed to integrate heterogeneous information related to the drug, target protein, and their interaction data, to provide effective and efficient ways to predict new DTIs (Hao ; Mei ). These methods utilize various types of profiles for drugs and proteins constructed with different biological data. Such DTI prediction methods were developed based on the idea of utilizing heterogeneous networks of known DTIs, similarity between drugs and similarity between target proteins (Hao ; Nascimento ; Perlman ; Zong ). These methods demonstrate that utilizing different measures of similarity between drugs and target proteins results in improved performance compared to other methods that are based on using only single similarity for drugs and single similarity for target proteins. Moreover, a prediction method (Hao ) that is based on non-linear integration of similarity measures shows better performance than the other methods based on the linearly combined similarity measures (Mei ; Nascimento ; van Laarhoven ). However, these studies indicated that the DTI prediction performance of different methods varies significantly with and depends heavily on the similarity measures used. These require development of computational methods that optimize combination of multiple similarity measures with the aim to improve the DTI prediction accuracy (for more detailed information see Supplementary Material, Related work). In this study, aiming to further improve the accuracy of DTI prediction, we developed DDR, an efficient DTI prediction method that in a novel way determines through a heuristic method an optimized combination of similarity measures between drugs and between target proteins used in the prediction model. To predict DTIs, DDR integrates information from different types of drug–drug and target–target similarity measures and then, it applies a random forest (RF) model using graph-based features. On different representative datasets and under various test setups, and using different performance measure, we show that DDR significantly outperforms the other state-of-the-arts methods by dramatically reducing the error. Using independent sources of evidence, we verified as correct 22 out of the top 25 DDR novel predictions. This suggests that DDR can be used as an efficient method to identify correct DTIs.

2 Materials

2.1 Datasets

2.1.1 DTI data

Five datasets were used to evaluate the performance of the proposed DDR method in DTI prediction. Each dataset contains three types of information: (i) the known DTIs for humans, (ii) multiple drug similarity measures and (iii) multiple target proteins similarity measures. A frequently considered gold standard dataset (we name it Yamanishi_08) was originally compiled by Yamanishi and was used as a reference in many studies (Ba-Alawi ; Lim ; Daminelli ; Mei ). This dataset contains known DTIs as retrieved from KEGG BRITE (Kanehisa ), BRENDA (Schomburg ), SuperTarget (Gunther ) and DrugBank databases (Wishart ). In Yamanishi_08, the information on DTI is classified according to the target proteins of drugs into the following four groups: (i) enzymes (E), (ii) ion channels (IC), (iii) G-protein-coupled receptors (GPCR) and (iv) nuclear receptors (NR). Thus, Yamanishi_08 dataset is composed of the four datasets corresponding to the classes of target proteins. The fifth dataset is DrugBank_FDA, which is extracted from 5.0.3 version of DrugBank database (Wishart ). We only extracted DTI information of drugs approved by the FDA and single human target proteins; these proteins are not part of protein complexes. Table 1 summarizes the statistics of these datasets. Note that, the ratios of known (positive) versus non-existing (not known, negative) DTIs in all datasets are variable. This reflects practical situations where the number of true DTIs is considered to be much smaller than that of non-interacting drug–targets.

Table 1.

Summary of the five datasets (Yamanishi_08 and DrugBank_FDA) used in this study

Datasets	Target classes	Number of drugs	Number of target proteins	Number of known DTIs
Yamanishi_08	NR	54	26	90
	GPCR	223	95	635
	IC	210	204	1476
	E	445	664	2926
DrugBank_FDA	Multi-class	1482	1408	9881

Summary of the five datasets (Yamanishi_08 and DrugBank_FDA) used in this study

2.1.2 Similarity measures for drugs and for target proteins

We computed multiple similarity measures for drugs and for target proteins, respectively, where all similarity values were normalized to the range [0, 1]. For the first four benchmark datasets from Yamanishi_08, the similarities between drug pairs and between target protein pairs were calculated based on information from different sources and from Nascimento . For drugs, distinct chemical structure fingerprints, side-effects profiles and the Gaussian interaction profile (GIP) were considered as drug information sources for calculation of the drug similarities. On the other hand, the similarities of target proteins were calculated based on various amino acid sequence profiles of proteins, as well as different parameterizations of the Mismatch (MIS) and the Spectrum (SPEC) kernels, target proteins functional annotation based on Gene Ontology (GO) terms, proximity within the protein–protein interaction (PPI) network and the GIP for target proteins. For the fifth benchmark dataset, DrugBank_FDA, we computed different similarity measures between drugs based on: different types of molecular fingerprints, drug interaction profile, drug side-effects profile, drug profile of the anatomical therapeutic class (ATC) coding system, drug-induced gene expression profile, drug disease profiles, drug pathways profiles and GIP. Furthermore, different target protein similarity measures were calculated based on protein amino acid sequence, their GO annotations, proximity in the PPI network, GIP, protein domain profiles and gene expression similarity profiles of protein encoding genes. Chemical structures of drugs were extracted from DrugBank (Wishart ), while the target protein sequences were extracted from UniProt (Boutet ). Supplementary Table S1 shows the summary of multiple similarity measures calculated for drugs and target proteins in the DrugBank_FDA dataset, as well as describing their importance and tools used to calculate them. As a summary, all different similarity measures between drugs and between target proteins for the first four datasets are recomputed/available and collected from Nascimento . For DrugBank_FDA dataset, all different similarity measures between drugs and between target proteins are calculated in this study, since there is no available similarity measures data obtained for such dataset.

3 Methods

3.1 Problem description

We define a set of DTIs, which consists of a set of drugs D and a set of target proteins T, where D = {d, i = 1, …, m} and T = {t, j = 1, …, n}, in which m represents the number of drugs and n represents the number of target proteins. The interactions between D and T are represented as a binary matrix Y such that if d interacts with t, then y = 1, if not then y = 0. We also define the set of similarity matrices between drugs in D as Ds, where similarity matrices have dimensions of m x m; we define the set of similarity matrices between target proteins in T as Ts, where similarity matrices have dimensions of n x n. Element values in different similarity matrices represent how much are drugs or target proteins similar to each other based on different measures. All elements in each matrix have values in the range of [0, 1]. A similarity value close to 0 indicates that two elements are not similar to each other while a similarity value close to 1 represents the most similar elements. Given the matrix Y, and matrices in Ds and Ts, our goal is to predict novel (i.e. unknown) interactions in Y.

3.2 Description of the DDR method

The heterogeneous DTI graph is a weighted graph that is constructed with m nodes from the drug set and n nodes from the set of target proteins. The edge between two drug nodes or two target protein nodes represents the similarity between them and is weighted by the similarity value obtained from the similarity fusion step. The edge between a drug and a target protein represents a known DTI and is weighted by 1. A path structure of a path that starts at a D node and ends up at a T node describes a subgraph that sequentially links of drug and target protein nodes. For example, a path Drug1–Drug2–Target1 connects the Drug1 node with the Target1 node through the similarity edge between Drug1 and Drug2 and via the interaction edge between Drug2 and Target1. The path structure of this path is D–D–T. All paths with more than one edge and without loops, starting at a D node and ending at a T node, and having the same path structure define a path-category on the heterogeneous DTI graph. DDR workflow (Fig. 1) depicts several steps including: (i) inferring interaction profile for new drugs and for new target proteins, (ii) similarity measure selection, (iii) similarity fusion, (iv) path-category-based features extraction, (v) building DTI prediction model using RF.

Fig. 1.

Flowchart of DDR method. DDR consists of several steps including: (i) Similarity selection, where a subset of similarity measures is selected in a heuristic process. (ii) Similarity fusion, with the goal to combine the selected similarity measures into one final composite similarity that combines information from similarities determined in (i). (iii) Path-category-based feature extraction, where the feature vector corresponds to drug and target protein pairs, i.e. for pair, features are determined as the vector composed of the 12 (i, j) elements obtained by two graph-based scores, namely, n1(h, i, j) and n2(h, i, j) for each specific path-category C,h = 1, 2, …, 6. (iv) Building DTI prediction model using RF, where both positive and negative data are provided; positive data contain known links between drugs and target proteins and represent positive labels, while negative data contain unknown DTI links that are treated as negative labels

3.2.1 Inferring interaction profile

Inferring the DTIs profiles for new drugs and target proteins is used only with the GIP similarity calculation. A drug is called new if it does not have any known target protein to interact with, while a target protein is called new if it is not targeted by any known drug. Since the GIP similarity is constructed based on training DTI data only, the GIP similarity cannot be computed for drugs or target proteins that do not have known DTIs in the training data. So, we enhance the GIP similarity calculation by inferring interaction profiles for new drugs and for new target proteins, in cases where DTIs for drugs or for target proteins are missing from the training data. This inference is made based on the interaction profiles of such drugs or target proteins. Drugs (or target proteins) with high similarities to a new drug (or a new target protein) are said to be the neighbors of the drug (the target protein). This interaction profile inferring technique is based on Mei . For example, the inferred value of interaction for a new drug with a specific target protein is represented as the ratio of the sum of similarity values for drug neighbors interacting with this target protein relative to the total sum of all neighbors’ similarity values. For DDR, we subjectively set the number of neighbors to 5.

3.2.2 Similarity selection: selection of an optimized set of similarities

Combining all similarity types may introduce noise in the data as some similarities have more information than others. In order to select a more robust similarity set, we implement similarity selection procedure (Supplementary Figure S1) that is able to select a set of informative and less-redundant set of similarities for drugs and for target proteins, separately. This is done through a heuristic process, where a subset of similarity measures is selected forming an optimized (possibly the best) combination of similarities for our problem. To select set of informative similarities, our procedure goes as follows: (i) Calculate the average entropy for each similarity matrix to determine how much information each similarity carries. For a similarity matrix M (target–target similarity or drug–drug similarity) of size k × k, where k represents the number of drugs (or target proteins), with elements m, we calculate entropy E for each row i as: After that, we average the entropy values of all matrix rows to get the final average entropy value that describes how informative a similarity matrix is. (ii) Rank the matrices according to their average entropy values in ascending order. The lower the average entropy value is the less random information the similarity matrix carries. Then, remove similarity matrices with high average entropy that contain more random information with average entropy value greater than c1log(k), where c1 is a constant that controls how much information each similarity carries; thus, c1 controls level of entropy to be selected; log(k) represents the maximum entropy value. (iii) Calculate the pairwise similarity measure between similarity matrices from different data sources, based on the Euclidean distance, as follows. To assess the information overlap between any two similarity matrices, we constructed the pairwise similarity matrix between all similarity measures based on Euclidean distance as follows: given two matrices for similarity measures A and B, we reorganize each similarity matrix into vectors (V and V) and then compute the Euclidean distance d as We converted distance values to similarity E as (iv) After obtaining a set of informative similarities matrices, the redundant similarity matrices are removed as follows: the procedure starts with the similarity measure matrix having the lowest average entropy value and eliminates all other similarity measure matrices with E value larger than a threshold c2. After that, the procedure continues with the next similarity matrix in the ranked list until the whole list of the similarity matrices is exhausted. At the end, the remaining list of similarity measures is reported as the selected set with small redundancy of informative similarity measures for drugs and target proteins. In this study, we subjectively set c1 to 0.7 and c2 to 0.6. We applied this procedure to select the set of informative less-redundant similarity measures of drugs and target proteins, separately.

3.2.3 Similarity fusion

Given the selected subsets of similarity measures obtained previously for drugs and for target proteins, respectively, the goal of the similarity fusion step is to combine multiple similarity measures into one final composite similarity that captures the necessary information from different similarities. Thus, given a set of multiple similarity measures for drugs and for target proteins, respectively, we computed the final fused similarity measure following the similarity network fusion (SNF) method developed in Wang . We represent each similarity measure by k × k similarity matrix M = (m), where m equals to the similarity value between d/d or t/t indicating how much they are similar. The SNF combines multiple similarity measures into a single fused similarity by a nonlinear method based on message-passing theory. It iteratively updates every similarity network with information from the other networks, using K-nearest neighbors, making it more similar to the others. The SNF method can capture common as well as complementary information across different measures of similarities. We applied the SNF method to integrate multiple drug–drug similarities and target–target similarities, separately.

3.2.4 Path-category-based features

After obtaining the combined similarity for drugs and for target proteins, respectively, we augmented the combined similarities with the known DTIs to construct a heterogeneous DTI graph. Based on this heterogeneous graph, we extracted 12 path-category-based features that we used to build a DTI prediction model. In this study, we work with path-categories of lengths 2 and 3 (but not longer, because of the computational cost). When we restrict paths to start at drug nodes and end at target protein nodes, there are only two path-categories with paths of length 2, having path structures (D–D–T) and (D–T–T), and four path-categories with paths of length 3, having path structures (D–D–D–T), (D–D–T–T), (D–T–D–T) and (D–T–T–T). Thus, we will consider these six path-categories through which drug nodes could connect to target protein nodes. We define matrices S1 and S2 associated with each path-category C,h = 1, 2, …, 6, that we consider. To do this, we start with a given drug d to reach a given target protein t through a specific path-category C. We restrict traversing the graph to retrieve all paths passing only through the K-nearest neighbors of drugs to d and only through the K-nearest neighbors of target proteins to t. In this study, we subjectively set the number of nearest neighbors K to 5. The set of such paths we denote as R. Next, for each path p from R we calculate an edge-weight product value s obtained by multiplying all weights w of edges of p as follows: Using the s values calculated for all paths p from R, we calculate scores s1 and s2 as follows: Thus, for each path-category C, we obtained a matrix S1 with elements s1(h, i, j). Also, for each path-category C, we obtained a matrix S2 with elements s2(h, i, j) determined as: Finally, we normalized matrices S1 and S2 to adjust for the overall connectivity of the network, where the elements of the normalized matrices are: where . The normalized matrices are now N1 with elements n1(h, i, j) and N2 with elements n2(h, i, j) calculated as shown above. In total, DDR defines 12 different path-category-based matrices, namely N1, N2, where h = 1, 2, …, 6, which contain feature values. These matrices have the same number of rows (corresponding to drugs) and the same number of columns (corresponding to target proteins).

3.2.5 RF classification model for DTI prediction

To predict DTI, DDR utilizes supervised machine learning model based on the RF classifier (Ho, 1995). RF has been shown to be an effective tool in prediction, as it runs efficiently on large datasets and is less prone to over-fitting. We implemented the RF predictive model using scikit-learn (Pedregosa ). The inputs to the RF correspond to drug and target protein pairs, i.e. for the pair, the feature vector is determined as the vector composed of the (i, j) elements of matrices N1 and N2. Since h = 1, 2, …, 6, these feature vectors contain 12 elements each. In order to learn from highly imbalanced data, in this study we adjusted the RF class weights to be inversely proportional to the number of class labels for each class in the training data. Two important parameters are set when building the RF model: The number of trees in the forest (n_estimators) was set to be in the range of [100, 600] trees and a function to measure the quality of a split (criterion) where we used Gini index and entropy based functions. To construct the prediction model, both positive and negative data are provided as either known DTIs to represent positive labels or unknown DTIs that are treated as negative labels.

3.3 Experimental setting and performance evaluation

To facilitate the comparison with other methods, we performed cross-validation (CV) and hold-out type tests. First, we evaluated the performance of the DDR method for DTI prediction using CV experiments obtained under three different settings of prediction tasks as in Pahikkala . The experiments were performed separately for each dataset used in this study (the four gold standard datasets from Yamanishi_08 and the DrugBank_FDA dataset). The three prediction settings correspond to the cases when: (a) the drugs are new, (b) the target proteins are new and (c) the drugs and their target proteins are known but all interactions between them are not necessarily known. Cases (a) and (b) correspond to the situation when there are no DTIs in the training data for such drugs or target proteins, while case (c) corresponds to the situation when there are DTI in the training data for such drugs or target proteins. We name settings (a), (b) and (c) as SD, ST and SP, respectively. For each dataset, a prediction model in each setting is built using a dataset of positive and negative labels split into the training and testing sets. This procedure is followed for each fold in 10-fold CV and the whole process is repeated 5 times, each time with a different random seed used for random selection for the split into training and testing sets. In each fold of the CV process, all interactions y in Y matrix that belong to the testing set in that fold are set to zero, i.e. they were excluded from consideration. The resulting matrix with removed testing DTIs is Ytrain. In each fold, the model learns interactions from Ytrain and then constructs the GIP similarity. Then, we select the best set of similarity measures (according to DDR’s heuristic procedure). After that, we use all selected similarities separately for drugs and separately for target proteins, to generate a fused similarity matrix for drugs and a fused similarity matrix for target proteins. Based on Ytrain and the two generated fused similarity matrices, we construct a heterogeneous DTI graph, where we extract path-category-based features as explained before, and score them using two graph-based scores. Finally, we train the RF model on the training set for that fold until the best area under the precision-recall (AUPR) is obtained. Then, using the trained model, we predict and evaluate predictions on the testing set for that fold. Moreover, we performed the hold-out tests derived from 9881 DTIs from DrugBank_FDA dataset under the same SD, ST and SP settings of prediction tasks. In the hold-out test, we split the data into 80% for training and 20% for testing. For each prediction model, at each fold in case of CV, we considered the following evaluation metrics: Based on the methods scores, we define true positive (TP), false negative (FN), false positive (FP) and true negative (TN). We calculate precision, recall and specificity values as follows: We construct precision-recall curve based on different precision and recall values at different cut-offs. Also, we construct the area under the ROC curve (AUC) at various threshold settings, based on different recall values, and false positive rate (FPR) values, calculated as 1 − specificity. Then, we calculate the AUPR and AUC, where the values of AUPR and AUC, separately, over 5-repeats of 10-fold CVs are averaged and reported as the measures of the model performance for each dataset. As the positive and negative data in the datasets used in this study are highly imbalanced, AUPR metric provides a better quality estimate, since it punishes more heavily the existence of FPs than is the case with AUC. Thus, in this study, we mainly used AUPR values to evaluate the performance of the methods, though we also reported the AUC values in Supplementary Material. As a summary, for the purpose of the fair comparison with the other methods, all methods are subjected to the exactly same conditions of testing and the same datasets [(i) the five trials of 10-fold CV and the same datasets, Yamanishi_08 and DrugBank_FDA dataset and (ii) the same hold-out test based on DrugBank_FDA]. We point out that all methods are evaluated using the same data splits to avoid any type of unwanted bias. Also, we used only training data to develop models.

4 Results

4.1 Comparisons with the state-of-the-art algorithms

First, we compare our proposed DDR method with the following state-of-the-art methods (for more detailed information see Supplementary Material, Related work) namely: COSINE (Lim ), DNILMF (Hao ), NRLMF (Liu ), KRONRLS-MKL (Nascimento ) and BLM-NII (Mei ) under the same conditions for all methods, i.e. under the three prediction settings (SP, SD and ST) and over five trials of 10-fold CV based on Yamanishi_08 and DrugBank_FDA datasets. We show that DDR, using 5-repeats of 10-fold CV, achieves higher AUPR values compared with the other methods (Fig. 2). In terms of AUPR, over the five different datasets, DDR, DNILMF, NRLMF, KRONRLS-MKL, COSINE and BLM-NII achieved weighted average of AUPR score under the three different prediction tasks settings as (SP: 71%, SD: 53%, ST: 52%), (SP: 56%, SD: 26%, ST: 37%), (SP: 50%, SD: 29%, ST: 39%), (SP: 52%, SD: 20%, ST: 17%), (SD: 32%) and (SP: 35%, SD: 14%, ST: 25%), respectively. The weighted average of AUPR is calculated for each of the three settings as where 5 is the number of datasets used in this study, TS is the total number of samples in all datasets and NS is the number of samples in i-th dataset.

Fig. 2.

Comparison results (in terms of AUPR scores) of DDR with the five state of the art methods (DNILMF, NRLMF, KRONRLS-MKL, COSINE and BLM-NII) using 5-repeats of 10-fold CV. Results are obtained under three prediction tasks (SP, SD and ST) over all datasets (NR, GPCR, IC, E and DrugBank_FDA) used in this study. The results for DNILMF, NRLMF, KRONRLS-MKL, COSINE and BLM-NII are obtained using the best parameters reported in the respective publications It should be noted that the COSINE method is specifically tailored for the SD setting to find target proteins of new drugs with little to no available interaction data; thus, only its results for the SD setting are shown. Also, we show that DDR, using 5-repeats of 10-fold CV, achieves higher AUC values compared to the other methods under three prediction tasks and over the five different datasets (Supplementary Table S2). Thus, in terms of AUC, over the five different datasets, DDR, DNILMF, NRLMF, KRONRLS-MKL, COSINE and BLM-NII achieved weighted average of AUC score under the three different prediction tasks settings as (SP: 96%, SD: 90%, ST: 89%), (SP: 95%, SD: 87%, ST: 85%), (SP: 94%, SD: 85%, ST: 84%), (SP: 89%, SD: 77%, ST: 83%), (SD: 86%) and (SP: 91%, SD: 73%, ST: 80%), respectively. To show more clearly the accuracy improvement by DDR, we define the AUPR score error E as while the relative reduction of the AUPR score error of method 1 relative to method 2 we defined as where E1 and E2 are determined for method 1 and method 2, respectively. Based on individual AUPR values reported from 5-repeats of 10-fold CV experiments, we also calculated the relative reduction of the AUPR error obtained by DDR relative to the next best method across all different testing settings for each dataset. When predicting unknown DTI pairs, as in the SP setting, DDR significantly reduces AUPR error relative to the next best method by 39%, 30%, 38%, 27% and 33% for NR, GPCR, IC, E and DrugBank_FDA datasets, respectively. For predicting DTIs for new drugs (SD setting), DDR significantly reduces AUPR error relative to the next best method by 36%, 38%, 51%, 58% and 20%, for NR, GPCR, IC, E and DrugBank_FDA datasets, respectively. Finally, for predicting new target proteins (ST setting), DDR significantly reduces AUPR error relative to the next best method by 25%, 11%, 49%, 25% and 21%, for NR, GPCR, IC, E and DrugBank_FDA dataset, respectively. As a result, we demonstrate that DDR, reported from 5-repeats of 10-fold CV experiments, achieves significantly more accurate results than the other methods by achieving higher rank (Supplementary Table S3) on different datasets and in all three settings. We also demonstrated that DDR performs significantly better than the other existing methods when known DTIs are missing in the training data. This shows practical assessments of the predictive power of DDR for real scenarios of DTI prediction, as in finding target proteins for new drugs (SD setting) with no available information about interactions and predicting drugs for new target proteins (ST setting) (Supplementary Table S3). Moreover, we demonstrated that on weighted average over five datasets, reported from 5-repeats of 10-fold CV experiments, DDR reduces the AUPR score error relative to the next best method by 34% for predicting DTIs as in setting (SP), by 31% for predicting DTIs as in setting (SD) and by 23% for predicting DTIs as in setting (ST). This demonstrates that DDR significantly reduces the AUPR score error compared to the other start-of-the-art methods. In general, based on our prediction results (Fig. 2), we observe that the results with the prediction model built for each specific class of target proteins (i.e. NR, GPCR, IC, E) are better than the results obtained by building a general model for multiple different target protein classes as in the case of DrugBank_FDA data. This is because each class of target proteins (NR, IC, GPCR, E) has its common characteristics that make them different from other classes. Thus, it is reasonable to expect that the well-designed and trained DTI predictor will capture some of these characteristics. In this way, the DTI prediction models will also be more specific and tuned to the target protein class for which they were developed and less tuned for the other target protein classes. Our results obtained for predicting DTIs using DrugBank_FDA data confirmed that even in this case DDR significantly outperformed all other state-of-the-art methods used in the comparison. We also performed test on hold-out data using DrugBank_FDA dataset. These tests show that DDR achieves higher AUPR and AUC values compared with the other methods under the three prediction settings (Supplementary Table S4). We demonstrate that based on AUPR values, reported from hold-out tests, the reduction of the AUPR error for DDR relative to the next best method across all different testing settings for DrugBank_FDA dataset are 44% in the SP setting, 21% in the SD setting and 29% in the ST setting.

4.2 Effect of similarity measures on the DDR performance

Similarity between drugs or target proteins plays the most crucial role when trying to predict DTIs for new drugs or new target proteins. Different similarity measures describe data instances differently. Several studies have highlighted the importance of selecting the proper similarity and integrating several similarity types to capture complementary information from several sources (Hao ; Nascimento ). The proof is the improved accuracy of DTI predictions over single adopted similarity (one for target proteins, one for drugs), and this is why the combining multiple types of similarities is important. We demonstrated that a suitable combination of few similarity measures results in higher accuracy of DTI predictions than when many or all similarity measures are used. Thus, the improvements DDR provide compared to the current combination strategies are that: (i) it applies non-linear similarity fusion method to combine different similarities, (ii) it can handle any number of provided similarities and (iii) it provides a systematic framework to select the most relevant non-redundant similarities. In addition, combining multiple similarity measures into one combined similarity reduces the time complexity and data dimensionality needed by the DDR method compared to the case of building a classification model with multiple features, where each feature is based on scoring a path from a drug to a target protein through each single similarity measure between drugs and each single similarity measure between target proteins. Thus, our aim is to combine multiple similarity measures into one final composite similarity that captures the necessary information from different similarities between drugs as well as from different similarities between target proteins. Regarding this, we show that DDR achieves higher AUPR values compared to the other methods (Fig. 2). We also compared the DDR performance when combining all similarity measures we used in this study, with the case of combining only the similarity measures we selected in a heuristic process. We observed that the performance of DDR when combining only selected similarities is better than when combining all similarities (Supplementary Table S5). When we examined the selected similarities over the four datasets in Yamanishi_08, we observed that DDR consistently selects a similar set of similarity measures for drugs and for target proteins (Supplementary Table S6). For the selected similarity of drugs, we observe that the selected similarities are related to network interaction profiles and drug side-effects. It has been highlighted before that the side-effect-based similarity improves the prediction of DTIs, where the assumption is that drugs with similar target protein binding profiles tend to cause similar side-effects, implying a direct correlation between target protein binding and side-effect similarity (Campillos ; Vilar and Hripcsak, 2016). It has also been shown that the interaction profiling is an effective tool that can be used for accurate prediction of DTIs (van Laarhoven ); the assumption is that two drugs that interact in a similar way with the target proteins in a known DTI network, will also interact in a similar way with new target proteins. For selected similarities of target proteins, we observe that these similarities are constructed based on a specific characteristic of amino acids sequence and closeness in PPI network that have been highlighted before in different benchmarking studies of target protein descriptors to result in a good performance for DTI prediction (Cao, 2015; Deng ; Nascimento ). For the DrugBank_FDA dataset (Supplementary Table S6), DDR selected a set of similarity measures for drugs and for target proteins, separately. We note that the information included in different data sources used to calculate the similarity measures between drugs and between target proteins have highly influenced the prediction performance for drugs interacting with multi-class target proteins (i.e. NR, GPCR, etc.). For similarity measures of drugs and target proteins that have been selected in the sequential heuristic process, we observe that these similarities are related to network interaction profiles (van Laarhoven ) and other genome-wide global characteristics of drugs and of target proteins such as drug-diseases profiles and drug-pathways profiles between drugs, drug-induced gene expression profiles of drugs, profiles of drug ATC-codes associations, profiles of GO terms of target proteins and profiles of pathways of target proteins. Using such types of similarities in DTI prediction in numerous studies proved to be effective in describing each drug and target protein in different datasets (Chen et al., 2012a,b; Dudley ; Dunkel ; Ehsani and Drablos, 2016; Iwata ; Pan ; Rodriguez-Esteban, 2016; Smith ; van Laarhoven ; Vilar and Hripcsak, 2016).

4.3 Prediction and validation of novel (unknown) DTIs

To evaluate the utility of DDR, we used it to predict novel DTIs (i.e. those that are not known to be true DTIs) in each of the five datasets, separately. For prediction of novel interactions, we build the predictive model for each dataset used in this study, in which the model is trained using all known interactions (positive labels) in all data folds of CV, and the negative labels are split into train and test sets as in a CV setup. As a result, all unknown DTI (negative labels) are predicted and the top 5 ranked interactions for each dataset are validated. To verify these novel predictions, we considered several reference databases that contain information obtained from curated/experimental/published results on small molecule–protein interactions. Thus, we searched DrugBank (Wishart ), KEGG (Kanehisa ), ChEMBL (Gaulton ), Matador (Gunther ), CTD (Davis ), T3DB (Wishart ) and the biomedical literature to find supporting evidences. In summary, we evaluated the accuracy of 25 novel DTIs predicted by our method using four datasets of Yamanashi_08 and DrugBank_FDA dataset and confirmed 22 of these novel DTIs as supported by other existing evidence (Table 2).

Table 2.

Top ranked 25 novel DTIs predicted by DDR

Drug ID	Drug name	Taregt protein ID	Target protein name	Validation source	Evidence
Dataset: NR
D00348	Isotretinoin	hsa6256	RXRA	CTD	CTD: D015474, CTD: 6256
D00585	Mifepristone	hsa2099	ESR1	C and PMID	C: 1166117, C: 206, C: 1276308, PMID: 20046055
D00962	Clomiphene citrate	hsa5241	PGR	CTD	CTD: D002996, CTD: 5241
D00182	Norethindrone	hsa2099	ESR1	T3DB and PMID	T3DB: T3D4745, PMID: 23611293
D00951	Medroxyprogesterone acetate	hsa2099	ESR1	DB	DB: DB00603
Dataset: GPCR
D00049	Niacin	hsa8843	HCAR3	DB	DB: DB00627
D02910	Amiodarone	hsa154	ADRB2	CTD	CTD: D000638, CTD: 154
D02340	Loxapine	hsa1812	DRD1	DB	DB: DB00408
D00726	Metoclopramide	hsa1129	CHRM2	M	M: PC4168
D00674	Naratriptan hydrochloride	hsa3351	HTR1B	DB	DB: DB00952
Dataset: IC
D02356	Verapamil	hsa6833	ABCC8	PMID	PMID: 21098040
D03365	Nicotine	hsa1137	CHRNA4	DB	DB: DB00184
D00538	Zonisamide	hsa6331	SCN5A	DB	DB: DB00909
D02098	Proparacaine hydrochloride	hsa8645	KCNK5	None	None
D00775	Riluzole	hsa2898	GRIK2	None	None
Dataset: E
D00139	Methoxsalen	hsa1543	CYP1A1	DB and PMID	DB: DB00553 PMID: 15670584
D00437	Nifedipine	hsa1559	CYP2C9	DB	DB: DB01115
D00410	Metyrapone	hsa1583	CYP11A1	CTD	CTD: D008797, CTD: 1583
D00574	Aminoglutethimide	hsa1589	CYP21A2	M	M: PC2145
D00542	Halothane	hsa1571	CYP2E1	M	M: PC3562
Dataset: DrugBank_FDA
DB01589	Quazepam	P47870	GABRB2	K	K: D00457
DB00825	Menthol	P35372	OPRM1	None	None
DB00147	Pyridoxal	P04798	CYP1A1	PMID	PMID: 19637937
DB01544	Flunitrazepam	P14867	GABRA1	CTD and K	CTD: D005445, K: D01230
DB02546	Vorinostat	P56524	HDAC4	CTD and C	CTD: C111237, CTD: 9759 C: 98, C: 3524

Note: Most of the top novel interactions (highest prediction score) are confirmed as supported by other existing evidences (public databases or literature) where the following annotation is used to demarcate the source of confirmatory information.

C, ChEMBL; CTD, Comparative Toxicogenomics Database; DB, DrugBank; M, MATADOR; K, KEGG; PMID, PubMed; PC, PubChem Compound.

Top ranked 25 novel DTIs predicted by DDR C: 1166117, C: 206, C: 1276308, PMID: 20046055 T3DB: T3D4745, PMID: 23611293 DB: DB00553 PMID: 15670584 CTD: C111237, CTD: 9759 C: 98, C: 3524 Note: Most of the top novel interactions (highest prediction score) are confirmed as supported by other existing evidences (public databases or literature) where the following annotation is used to demarcate the source of confirmatory information. C, ChEMBL; CTD, Comparative Toxicogenomics Database; DB, DrugBank; M, MATADOR; K, KEGG; PMID, PubMed; PC, PubChem Compound. Furthermore, to demonstrate that the predictions by DDR are not random, we additionally performed the label permutation tests to ensure that the top 5 DTI predictions by DDR in each dataset are not predicted by chance. To do so, we performed the following: we randomly shuffled the network labels (known and unknown) 100 times to produce different 100 random networks. Then, we performed SP DTI prediction setup on each network. For each dataset and for each novel DTI in the top 5 DTIs based on that dataset, we calculated P-value as the percentage of a given novel DTI being ranked in the top 5 DTIs in the 100 random networks. We demonstrated that all predicted novel DTIs have significant P-values <0.05 (Supplementary Table S7). Thus, in addition to having DDR novel DTI prediction validated based on other sources, results from the label permutation tests also confirm the reliability of DDR novel DTI predictions.

5 Discussion

This study introduces a novel DTI prediction method, DDR, which utilizes a heterogeneous drug–target graph that contains information about various similarities between drugs and similarities between proteins as drug targets. On different representative datasets, under various test setups, and using AUPR and AUC as the performance measures, we show that DDR clearly outperforms the other state-of-the-art methods we used in the comparison. For these we used CV and hold-out tests. DDR achieves notably higher AUPR values compared to other methods, thus significantly reducing the AUPR score error relative to the next best method. Moreover, on different datasets and in all three task settings we demonstrate that DDR produces significantly more accurate results than the other methods by achieving higher rank, based on AUPR values. We also demonstrated that DDR performs significantly better than the other existing methods when known DTIs are missing in the training data. This shows practical assessments of the predictive power of DDR for real scenarios of DTI prediction, as in finding target proteins for new drugs (SD setting) with no available information about interactions and predicting drugs for target proteins that are new (ST setting). When we compared DDR performance in case of combining all similarity measures we used in this study with the case of combining only the similarity measures we selected through our heuristic method, we observed that the performance of DDR with selected similarities is better than when combining all similarities. We observed that the best second method in predicting DTI as in SP setting, based on the weighted average of AUPR results over the five different datasets is the DNILMF method. This is due to the method followed by DNILMF in employing the nonlinear combination technique of multiple similarity measures for drugs and for target proteins, as well as smoothing the predictions of new drugs and new target proteins by incorporating neighbor information based on the assumption that similar drugs (or target proteins) may contribute to the accuracy of the predictions for their neighbors. On the other hand, in predicting DTIs, we observed that the second best method, based on the weighted average of AUPR results over the five different datasets in the ST and SD setting, are the NRLMF and COSINE, respectively. As the current implementation of DDR handles only binary DTI data with the goal of classifying a given DTI as binding (label = 1) or non-binding (label = 0), in future, we plan to extend the functionality of DDR to handle continuous DTI data (i.e. continuous values of binding affinities of drugs and target proteins, He ).

6 Conclusion

We presented our method (DDR) that is based on the use of a heterogeneous graph containing information about known DTIs, as well as similarities between drugs and similarities between target proteins obtained from different data sources. DDR utilizes graph mining and machine learning techniques. It is capable of utilizing different similarity measures between drugs, as well as between target proteins. DDR applied non-linear similarity fusion method to combine different similarities for drugs and target proteins. Before applying the combined similarity method, DDR performed a pre-processing step where a subset of similarity types is selected in a heuristic process. This is done to select the best combination of similarities for the tasks in question since using all similarity types introduces noise. We demonstrated that DDR achieves significantly more accurate results than the other state-of-the-art methods under different prediction tasks settings and using different datasets and different methods of performance evaluation. Finally, we evaluated the accuracy of 25 novel DTIs predicted by our method and confirmed 22 of these novel DTIs as supported by other existing evidences. Thus, DDR proved its practical utility by validating predictions of novel DTIs over different datasets, suggesting that DDR can be used as an efficient method to identify correct DTIs.

Funding

Research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST) Base Research Funds to VBB (BAS/1/1606-01-01) and by KAUST Office of Sponsored Research Grant No. URF/1/1976-02. Conflict of interest: none declared. Click here for additional data file.

63 in total

1. BRENDA, the enzyme database: updates and major new developments.

Authors: Ida Schomburg; Antje Chang; Christian Ebeling; Marion Gremse; Christian Heldt; Gregor Huhn; Dietmar Schomburg
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

2. Gaussian interaction profile kernels for predicting drug-target interaction.

Authors: Twan van Laarhoven; Sander B Nabuurs; Elena Marchiori
Journal: Bioinformatics Date: 2011-09-04 Impact factor: 6.937

3. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions.

Authors: Dong-Sheng Cao; Nan Xiao; Qing-Song Xu; Alex F Chen
Journal: Bioinformatics Date: 2014-09-22 Impact factor: 6.937

4. Relating protein pharmacology by ligand chemistry.

Authors: Michael J Keiser; Bryan L Roth; Blaine N Armbruster; Paul Ernsberger; John J Irwin; Brian K Shoichet
Journal: Nat Biotechnol Date: 2007-02 Impact factor: 54.908

5. Drug-target interaction prediction through domain-tuned network-based inference.

Authors: Salvatore Alaimo; Alfredo Pulvirenti; Rosalba Giugno; Alfredo Ferro
Journal: Bioinformatics Date: 2013-05-29 Impact factor: 6.937

6. From genomics to chemical genomics: new developments in KEGG.

Authors: Minoru Kanehisa; Susumu Goto; Masahiro Hattori; Kiyoko F Aoki-Kinoshita; Masumi Itoh; Shuichi Kawashima; Toshiaki Katayama; Michihiro Araki; Mika Hirakawa
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. Assessing drug target association using semantic linked data.

Authors: Bin Chen; Ying Ding; David J Wild
Journal: PLoS Comput Biol Date: 2012-07-05 Impact factor: 4.475

8. DrugBank: a knowledgebase for drugs, drug actions and drug targets.

Authors: David S Wishart; Craig Knox; An Chi Guo; Dean Cheng; Savita Shrivastava; Dan Tzur; Bijaya Gautam; Murtaza Hassanali
Journal: Nucleic Acids Res Date: 2007-11-29 Impact factor: 16.971

9. Pathway analysis for drug repositioning based on public database mining.

Authors: Yongmei Pan; Tiejun Cheng; Yanli Wang; Stephen H Bryant
Journal: J Chem Inf Model Date: 2014-02-05 Impact factor: 4.956

10. The Comparative Toxicogenomics Database: update 2017.

Authors: Allan Peter Davis; Cynthia J Grondin; Robin J Johnson; Daniela Sciaky; Benjamin L King; Roy McMorran; Jolene Wiegers; Thomas C Wiegers; Carolyn J Mattingly
Journal: Nucleic Acids Res Date: 2016-09-19 Impact factor: 16.971

31 in total

1. M2PP: a novel computational model for predicting drug-targeted pathogenic proteins.

Authors: Shiming Wang; Jie Li; Yadong Wang
Journal: BMC Bioinformatics Date: 2022-01-04 Impact factor: 3.169

2. Matrix factorization with denoising autoencoders for prediction of drug-target interactions.

Authors: Seyedeh Zahra Sajadi; Mohammad Ali Zare Chahooki; Maryam Tavakol; Sajjad Gharaghani
Journal: Mol Divers Date: 2022-07-23 Impact factor: 3.364

3. BETA: a comprehensive benchmark for computational drug-target prediction.

Authors: Nansu Zong; Ning Li; Andrew Wen; Victoria Ngo; Yue Yu; Ming Huang; Shaika Chowdhury; Chao Jiang; Sunyang Fu; Richard Weinshilboum; Guoqian Jiang; Lawrence Hunter; Hongfang Liu
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

Review 4. On the road to explainable AI in drug-drug interactions prediction: A systematic review.

Authors: Thanh Hoa Vo; Ngan Thi Kim Nguyen; Quang Hien Kha; Nguyen Quoc Khanh Le
Journal: Comput Struct Biotechnol J Date: 2022-04-19 Impact factor: 6.155

5. Utilizing graph machine learning within drug discovery and development.

Authors: Thomas Gaudelet; Ben Day; Arian R Jamasb; Jyothish Soman; Cristian Regep; Gertrude Liu; Jeremy B R Hayter; Richard Vickers; Charles Roberts; Jian Tang; David Roblin; Tom L Blundell; Michael M Bronstein; Jake P Taylor-King
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

6. Study on Hepatotoxicity of Rhubarb Based on Metabolomics and Network Pharmacology.

Authors: Shanze Li; Yuming Wang; Chunyan Li; Na Yang; Hongxin Yu; Wenjie Zhou; Siyu Chen; Shenshen Yang; Yubo Li
Journal: Drug Des Devel Ther Date: 2021-05-04 Impact factor: 4.162

7. Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning.

Authors: Maha A Thafar; Mona Alshahrani; Somayah Albaradei; Takashi Gojobori; Magbubah Essack; Xin Gao
Journal: Sci Rep Date: 2022-03-19 Impact factor: 4.379

Review 8. Machine learning approaches and databases for prediction of drug-target interaction: a survey paper.

Authors: Maryam Bagherian; Elyas Sabeti; Kai Wang; Maureen A Sartor; Zaneta Nikolovska-Coleska; Kayvan Najarian
Journal: Brief Bioinform Date: 2021-01-18 Impact factor: 11.622

Review 9. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases.

Authors: Ahmet Sureyya Rifaioglu; Heval Atas; Maria Jesus Martin; Rengul Cetin-Atalay; Volkan Atalay; Tunca Doğan
Journal: Brief Bioinform Date: 2019-09-27 Impact factor: 11.622

10. Understanding Drug Repurposing From the Perspective of Biomedical Entities and Their Evolution: Bibliographic Research Using Aspirin.

Authors: Xin Li; Justin F Rousseau; Ying Ding; Min Song; Wei Lu
Journal: JMIR Med Inform Date: 2020-06-16