Literature DB >> 31125330

Leveraging genetic interactions for adverse drug-drug interaction prediction.

Sheng Qian^1,2, Siqi Liang^1,2, Haiyuan Yu^1,2.

Abstract

In light of increased co-prescription of multiple drugs, the ability to discern and predict drug-drug interactions (DDI) has become crucial to guarantee the safety of patients undergoing treatment with multiple drugs. However, information on DDI profiles is incomplete and the experimental determination of DDIs is labor-intensive and time-consuming. Although previous studies have explored various feature spaces for in silico screening of interacting drug pairs, their use of conventional cross-validation prevents them from achieving generalizable performance on drug pairs where neither drug is seen during training. Here we demonstrate for the first time targets of adversely interacting drug pairs are significantly more likely to have synergistic genetic interactions than non-interacting drug pairs. Leveraging genetic interaction features and a novel training scheme, we construct a gradient boosting-based classifier that achieves robust DDI prediction even for drugs whose interaction profiles are completely unseen during training. We demonstrate that in addition to classification power-including the prediction of 432 novel DDIs-our genetic interaction approach offers interpretability by providing plausible mechanistic insights into the mode of action of DDIs.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31125330 PMCID： PMC6553795 DOI： 10.1371/journal.pcbi.1007068

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Drug-drug interactions (DDIs) refer to the unexpected pharmacologic or clinical responses due to the co-administration of two or more drugs [1]. With the simultaneous use of multiple drugs becoming increasingly prevalent, DDIs have emerged as a severe patient safety concern over recent years [2]. According to The Center for Disease Control and Prevention (CDC), the percentage of Americans taking three or more prescription drugs in the past 30 days increased from 11.8% in 1988–1994 to 21.5% in 2011–2014, and the occurrence of polypharmacy, defined as the concurrent use of five or more drugs, increased from 4.0% to 10.9% within the same time period [3,4]. Polypharmacy is especially common among elderly people, affecting 42.2% of Americans aged 65 years and older, exposing them to a higher risk of adverse DDIs. Indeed, DDIs were estimated to be responsible for 4.8% of hospitalization in the elderly, a 8.4-fold increase compared to the general population [5]. Overall, DDIs contribute to up to 30% of all adverse drug events (ADEs) [6] and account for about 74,000 emergency room visits and 195,000 hospitalizations each year in the United States alone [3]. Therefore, it has become a medical imperative to identify and predict interacting drug pairs that lead to adverse effects. In order to facilitate identification of interacting drug pairs, a number of in vitro and in vivo methods have been developed. For example, drug pharmacokinetic parameters and drug metabolism information collected from in vitro pharmacology experiments and in vivo clinical trials can be used to predict interacting drug pairs [7,8]. However, these methods are labor-intensive and time-consuming, and are thus not scalable to all unannotated drug pairs [9]. In the past decade, machine learning-based in silico approaches have become a new direction for predicting DDIs by leveraging the large amount of biological and phenotypic data of drugs available. The advantage of machine learning-based approaches lies in their ability to perform large-scale DDI prediction in a short time frame. So far, various features have been explored for building DDI prediction models, including similarity-based features and network-based features, among others. Similarity-based features characterize the similarity of the two drugs at question in terms of chemical structure, side effect profile, indication, target sequence, target docking, ATC group, etc. [10-24]. Network-based features exploit the topological properties of the drug-drug interaction network or the protein-protein interaction network, which relates to DDIs through drug-target associations [16,25-27]. While these methods have yielded important information about DDIs, few methods to date have been able to provide insight into the molecular mechanisms of drug-drug interactions. To this end, in this study, we employ the genetic interaction between genes that encode the targets of two drugs as a novel feature for predicting interacting drug pairs that cause adverse drug reactions. We show that targets of adversely interacting drugs tend to have more synergistic genetic interactions than targets of non-interacting drugs. Exploiting this finding, we apply a machine learning framework () and build a gradient boosting-based classifier for adverse DDI prediction by integrating genetic interaction and three widely used features–indication similarity, side effect similarity and target similarity. We show that our model provides accurate DDI prediction even for pairs of drugs whose interaction profiles are completely unseen during training. Furthermore, we find that excluding the genetic interaction features significantly decreases the performance of our model. Through genetic interactions, our method provides insight into the mode of action of drugs that lead to adverse combinatory effects.

Results

Genetic interaction profiles provide complementary information for distinguishing interacting and non-interacting drugs

In order to explore the separating power of various features to distinguish adversely interacting drug pairs from non-interacting drug pairs, we constructed a high-confidence set of adversely interacting drug pairs from all DDIs labeled “the risk or severity of adverse effects can be increased” in DrugBank [28] (). This resulted in a set of 117,045 adversely interacting drug pairs involving 2,261 drugs. 2,195,023 non-interacting drug pairs were generated by taking all other combinations of these drugs before removing any drug pair that has been reported in DrugBank, TWOSIDES [29] or a complete dataset of DDIs compiled from a variety of sources [30]. Furthermore, we required that all features, including indication similarity, side effect similarity, target sequence similarity and genetic interaction, should be available for each drug pair. After this filtering step, 1,113 adversely interacting drug pairs and 11,313 non-interacting drug pairs involving 262 drugs remained. Interacting and non-interacting drug pairs exhibit different distributions in terms of the four groups of properties that we investigated. Indications and side effects of drugs were mapped to four levels of the MedDRA hierarchy [31] (). At every level, adversely interacting drugs are associated with significantly more similar side effects as well as indications than non-interacting drugs (). On another front, target similarity was calculated by aligning the sequences of the protein targets with the Smith-Waterman algorithm [32]. Since a drug may have multiple protein targets, aggregation was performed by taking the minimum, mean, median or maximum alignment score for each drug pair (). As expected, the maximum, mean and median target similarity between targets of adversely interacting drug pairs are significantly higher than those of non-interacting drug pairs (). Interestingly, interacting drug pairs manifest a significantly lower minimum target similarity than non-interacting drug pairs (). This could be due to the fact that interacting drugs possess a higher number of protein targets combined, thereby having a higher chance of targeting vastly different targets (). These results establish indication similarity, side effect similarity and target similarity as informative predictors of adverse DDIs.

Adversely interacting drug pairs and non-interacting drug pairs significantly differ with regard to the 11 features selected.

(a) Schematics of calculating indication similarity and side effect similarity features. (b) Indication similarity score of hierarchy level PT, HLGT and SOC between two drugs. (c) Side effect similarity score of hierarchy level HLT and HLGT between two drugs. (d) Schematics of calculating target sequence similarity and genetic interaction features. Genetic interaction scores indicate the deviation from the expected phenotype when two genes are simultaneously knocked out, and were obtained from a global genetic interaction network in yeast by mapping targets of drugs to their yeast homologs. A negative score denotes synergistic interaction while a positive score indicates buffering interaction. (e) Minimum, mean, median and maximum target sequence similarity score between targets of two drugs. (f) Minimum and maximum genetic interaction score between targets of two drugs. Statistical significance was determined by the two-sided permutation test on the sample mean. PT, preferred term; HLT, high level term; HLGT, high level group term; SOC, system organ class. * p < 0.001; ** p < 0.0001; *** p < 0.00001. Genetic interaction refers to deviation from the expected phenotype when two genes are simultaneously mutated [33]. In short, the genetic interaction score quantifies the extent to which the fitness of a double mutant carrying mutations on two genes deviates from what is expected from the fitness defects of the corresponding single mutants. A negative score indicates synergistic genetic interaction, where the double mutant exhibits a fitness defect that is more extreme than expected from single mutants, while a positive score suggests buffering genetic interaction, where the double mutant exhibits a greater fitness than expected [34]. Since binding of drugs modulates the function of their targets, the genetic interaction between protein targets of two drugs might be associated with their joint effects. On this account, we investigated whether targets of adversely interacting drugs and targets of non-interacting drugs display divergent genetic interaction profiles. For each pair of drugs, we mapped their protein targets to the corresponding yeast homologs and obtained genetic interaction scores between the yeast genes from a global yeast genetic interaction network [35]. When the minimum, mean, median or maximum genetic interaction score was taken for targets of each drug pair, adversely interacting drugs showed significantly lower scores than non-interacting drugs irrespective of the aggregation function applied (). This trend can be recapitulated using two recently published human genetic interaction datasets (). Furthermore, genetic interaction provides complementary information that is not captured by target similarity, indication similarity, or side effect similarity, as seen from their poor correlation (). Therefore, genetic interaction profiles of drug targets provide new information as a predictor of adverse DDIs.

Building a machine learning model for predicting adverse DDIs

To divide drug pairs into a training set and a test set for building a machine learning model, most previous studies randomly split their data with a specified ratio [10,16,17,19,22,23,36,37], without considering the fact that drugs appearing in both sets may carry extra information about their interaction propensity. Considering the scenario of predicting interactions of drugs without prior information about their interaction profiles, this splitting scheme becomes inappropriate. To address this problem, we draw on a method that partitions drug pairs based on drugs [14,20,21]. All drugs in our constructed dataset were randomly split into “training drugs” and “test drugs” with a ratio of 2:1. The training set consists of all drug pairs where both drugs are “training drugs” and the test set comprises all drug pairs where both drugs are “test drugs” (). As a result, 475 interacting drug pairs and 4,802 non-interacting drug pairs involving 175 drugs went into the training set; 131 interacting drug pairs and 1,322 non-interacting drug pairs involving 87 drugs went into the test set.

The train-test splitting scheme and model performance on the test set.

(a) The train-test splitting scheme. Drugs are randomly divided into “training drugs” and “test drugs” with ratio of 2:1. Training set only consists of drug pairs constituted by “training drugs” and test set only consists of drug pairs constituted by “test drugs”. Training drugs are further split into “training drugsi” and “validation drugsi” with the same splitting scheme to obtain training seti and validation seti in the training phase. For each iteration of hold-out validation, the classifier is fit with training seti and evaluated with validation seti. Purple squares represent non-interacting drug pairs in training seti. Blue squares represent non-interacting drug pairs in validation seti. Green squares represent non-interacting drug pairs in test set. Red squares represent interacting drug pairs in each set. Grey squares represent unused drug pairs. (b) Approximate receiver operating characteristic (ROC) curves on the training set. (c) Approximate precision-recall curves on the training set. (d) AUROCs and AUPRs on the training set and the test set. (e) Receiver operating characteristic (ROC) curve on the test set. (f) Precision-recall curve on the test set. To build a more interpretable model and speed up the training process, we applied a feature selection method known as group minimax concave penalty (MCP) [38] that has been previously employed on biological datasets [39]. This resulted in a final group of 11 features whose value distributions were all significantly different between adversely interacting drugs and non-interacting drugs (). An extreme gradient boosting (XGBoost) classifier [40] was then built because of its speed and outstanding performance in data science competitions. We optimized hyperparameters of the classifier using the tree-structured Parzen Estimator (TPE) approach [41], which has been shown to drastically improve the performance in a recent study predicting protein-protein interaction interfaces [42]. Notably, instead of doing cross-validation, we adopted the same drug-based splitting scheme on the training set for hold-out validation (). This enables the model to be best tuned for predicting interacting drug pairs without any prior information about the interaction profiles of the drugs involved. Indeed, a previous report by Liu et al. showed that classifier performance dropped significantly when evaluated on a test set consisted of pairs of drugs completely unseen in the training set if conventional cross-validation was performed [21], and this flaw in the generalizability of cross-validation performance has been shown to be true in general for pair-input data [43]. Our novel training strategy resulted in an average area under the receiver operating characteristic curve (AUROC) of 0.727 and an average area under the precision-recall curve (AUPR) of 0.326 over 1,000 trials of hold-out validation on the training set (). When evaluated on the test set, our classifier achieved an AUROC of 0.689 () and an AUPR of 0.280 (), demonstrating the robustness of our model. As shown in Table 1, our classifier attained a precision of 100% on the top 10 predictions, and a precision of 65% on the top 20 predictions (). Since there is no gold-standard set of non-interacting drugs, it is plausible that our non-interacting drug pairs might actually contain adverse DDIs. Not surprisingly, some non-interacting drug pairs with the high predicted probabilities can be found with evidence supporting their possible adverse interactions. For example, the drug pair with a non-interacting label with the highest predicted interacting probability in the test set, liothyronine and tretinoin, has been indicated to potentially cause intracranial pressure increase and a higher risk of pseudotumor cerebri when taken together [44]. Furthermore, diazoxide and spironolactone, predicted with an interacting probability of 0.846, have been reported to induce asthma, cardice hypertrophy and pulmonary edema according to FDA reports when co-administrated [45].

Table 1

Top 20 DDI predictions in the test set.

	ID1	ID2	Drug Name1	Drug Name2	Label	Probability
1	DB01076	DB01098	Atorvastatin	Rosuvastatin	1	0.9943
2	DB01076	DB01095	Atorvastatin	Fluvastatin	1	0.9938
3	DB00381	DB00421	Amlodipine	Spironolactone	1	0.9889
4	DB00880	DB00887	Chlorothiazide	Bumetanide	1	0.9704
5	DB00313	DB01356	Valproic Acid	Lithium cation	1	0.9614
6	DB00887	DB00999	Bumetanide	Hydrochlorothiazide	1	0.9463
7	DB00880	DB01119	Chlorothiazide	Diazoxide	1	0.8835
8	DB00421	DB01076	Spironolactone	Atorvastatin	1	0.8815
9	DB00421	DB00622	Spironolactone	Nicardipine	1	0.8753
10	DB00999	DB01119	Hydrochlorothiazide	Diazoxide	1	0.8723
11	DB00279	DB00755	Liothyronine	Tretinoin	0	0.8677
12	DB00162	DB00755	Vitamin A	Tretinoin	1	0.861
13	DB00477	DB01098	Chlorpromazine	Rosuvastatin	0	0.8601
14	DB00421	DB01119	Spironolactone	Diazoxide	0	0.8457
15	DB00313	DB00523	Valproic Acid	Alitretinoin	0	0.8419
16	DB05015	DB06176	Belinostat	Romidepsin	0	0.8378
17	DB01065	DB01069	Melatonin	Promethazine	1	0.8342
18	DB00622	DB01076	Nicardipine	Atorvastatin	1	0.7995
19	DB00755	DB00900	Tretinoin	Didanosine	0	0.7962
20	DB00162	DB01212	Vitamin A	Ceftriaxone	0	0.7858

In order to showcase the competitiveness of the XGBoost algorithm, we implemented a number of alternative classification algorithms including support vector machine (SVM), random forest and the standard gradient boosting algorithm and performed the same prediction task using exactly the same dataset and features. We found that XGBoost achieved better or comparable performance than the other algorithms (). Furthermore, XGBoost is substantially faster than its closest contenders in terms of performance, gradient boosting and random forest. These results highlight the advantage of XGBoost over other algorithms in both predictive performance and speed. To further demonstrate the efficacy of our method, we compared it against a previously published similarity-based method for DDI prediction [18] using our training and test sets. Our method exhibited a substantial advantage both in training and on the test set (). To demonstrate the utility of our method, we obtained 5,039 drug pairs involving 295 drugs that had not been used for training and testing (). After refitting our model on all 12,426 drug pairs that were used to develop our method, we predicted 432 novel DDIs (). Remarkably, out of the top 20 newly predicted adversely interacting drug pairs, 9 can be verified in the TWOSIDES database (), manifesting the reliability of our method.

Genetic interaction provides mechanistic insight into drug-drug interactions

We investigated the contribution of genetic interaction features to classifier performance by building and tuning a new model without them. Excluding genetic interaction features significantly decreases classifier performance when either AUROC or AUPR is examined (P < 10−20 for both AUROC and AUPR, two-sided Welch’s t-test). More interestingly, the performance drop is not as profound when other groups of features are excluded (). Furthermore, prediction with genetic interaction features alone rendered significantly better performance than prediction with target similarity features alone (P < 10−20 for both AUROC and AUPR, two-sided Welch’s t-test, ). These results establish genetic interaction as an important feature in our model for predicting DDIs, providing complementary information that other features cannot capture. More importantly, genetic interaction can help us generate plausible mechanistic explanations for drug-drug interactions. For example, mesalazine and dexamethasone, both of which are anti-inflammatory drugs, are a pair of drugs in the test set that have been labeled as adversely interacting. Mesalazine can target the IKBKB protein, whereas dexamethasone can target NOS2, which plays important roles in nitric oxide signaling. In yeast, double knockout of ATG1 and TAH18, the respective yeast homologs of IKBKB and NOS2, exhibits a more negative impact on cell viability than expected from single knockout phenotypes [35]. In human, IKBKB can phosphorylate the NF-κB inhibitor and activate NF-κB [46], which is a family of transcription factors involved in inflammation and immunity. Notably, the transcription of NOS2 is induced by NF-κB activity [47]. Mesalazine has been shown to inhibit IKBKB, thereby inhibiting the activation of NF-κB, while dexamethasone is a negative modulator of NOS2. A previous study has reported that dexamethasone can decrease NOS2 translation and facilitate NOS2 degradation in rat [48] (). The combined use of mesalazine and dexamethasone may largely reduce the amount of NOS2, potentially affecting neurotransmission, antimicrobial and antitumoral activities.

Genetic interaction provides possible mechanistic insights into DDIs.

(a) Mesalazine inhibits IKBKB, a positive regulator of NF-κB activity, and NF-κB is a transcription factor which induces NOS2 transcription. Dexamethasone can inhibit the transcription of NOS2 and facilitate degradation of NOS2. The combined use of dexamethasone and mesalazine could potentially reduce the amount of NOS2 in cells to a large extent, which may affect neurotransmission, antimicrobial and antitumoral activities. (b) Mexiletine targets NAv1.5, a sodium channel encoded by SCN5A, while arsenic trioxide targets AKT1. The transcription of SCN5A is repressed by the transcriptional repressor FOXO1. AKT1 can activate the transcription of SCN5A by phosphorylating FOXO1. The combined use of mexiletine and arsenic trioxide could inactivate the transcription of SCN5A and at the same time block the existing sodium channel, which may largely reduce sodium influx in cardiac cells. As another example, arsenic trioxide and mexiletine are a pair of drugs not labelled as adversely interacting in DrugBank, but predicted by our model to interact with high probability. As a chemotherapy drug for acute promyelocytic leukemia (APL), arsenic trioxide has been reported to decrease the activity of a serine/threonine-protein kinase AKT1 [49]. On the other side, mexiletine is a sodium channel blocker that has also been used as part of a prophylactic therapy to treat APL patients to reduce cardiac complications [50]. PKC1, the yeast homolog of AKT1, exhibits strong synergistic interaction with CCH1 [35,51], which is the homolog of SCN5A, the gene encoding the sodium channel NAv1.5 targeted by mexiletine. In human, the transcription of SCN5A is repressed by FOXO1, whose transcriptional repression activity is in turn inactivated by AKT1-dependent phosphorylation [52] (). Therefore, the simultaneous inhibition of AKT1 and the sodium channel by the two drugs may reduce sodium influx in cardiac cells to a greater extent, potentially causing undesired adverse effects. Indeed, this pair of drugs is reported by TWOSIDES as interacting, providing additional supporting evidence to their adverse interaction.

Discussion

In the past decade, many methods have been developed for predicting DDIs based on various types of features. In this study, we have incorporated a novel feature, namely genetic interaction, to build a gradient boosting-based model for fast and accurate adverse DDI prediction. We have shown that our classifier can robustly predict drug-drug interactions even for drugs whose interaction profiles are completely unseen during training. Furthermore, we have predicted 432 novel DDIs, with additional evidence supporting our top predictions, demonstrating the usefulness of our approach. Most previous efforts of predicting DDIs suffer from an inability to make predictions for newly developed drugs due to train-test split based on drug pairs rather than drugs [10,16,17,19,22,23,36,37]. Three studies attempted to address this problem by dividing the entire dataset based on drugs [14,20,21]. However, they failed to do so during the training phase, resulting in an inflated performance on the training set. We have followed the drug-based train-test splitting scheme and have adopted a hold-out validation approach to avoid using overlapping drug sets for fitting the model and evaluating its performance. By doing so, we have achieved robust performance on the training set and the test set, which establishes the ability of our method to predict new DDIs for drugs whose interaction profiles are completely unknown. By examining genetic interactions, our method provides mechanistic insights into how two drugs may interact in a detrimental fashion. The combined modulatory effect resulted from binding of two drugs to their respective targets might underlie adverse DDIs, and genetic interaction gives valuable information about the nature of such combined effect. Indeed, we have observed that genetic interaction features are indispensable to our classifier performance. Notably, target sequence similarity features and genetic similarity features capture conceptually different mechanisms by which DDIs can occur. While the former can capture dosage effects where two drugs target same or similar genes, as exemplified by prolonged QT interval caused by concomitant administration of terfenadine and ketoconazole, both of which are strong CYP3A4 inhibitors [53], the latter captures DDIs resulting from drug pairs targeting genes with an epistatic relationship. For example, asthma patients receiving leukotriene-modifying drugs often show attenuated response to β2-agonists, including albuterol. This drug-drug interaction has been implicated to be associated with the epistasis between ALOX5AP and LTA4H [54]. Nevertheless, our work is limited by the lack of a global human genetic interaction network. As a surrogate for human genetic interactions, genetic interactions of yeast homologs were used in this study. Fortunately, large-scale human genetic interaction studies are coming into sight. Using a recently published dataset of human genetic interactions in K562 cells encompassing 222,784 gene pairs [55], we have found that the distribution of human genetic interaction scores vary significantly between adversely interacting drugs and non-interacting drugs (). Notably, the same trends could be recapitulated with a smaller dataset of genetic interactions [56] in the HEK293T cell line, demonstrating the generalizability of genetic interactions across different cell contexts (), although certain genetic interactions can exist in a cell type-dependent manner. For example, interactions between cancer driver genes are frequently specific to the cancer type [57]. In addition to DDI prediction, a similar machine learning method leveraging genetic interaction features can potentially be developed for predicting beneficial drug combinations. Indeed, current combination therapy for cancers have typically been developed to induce synthetic lethal genetic interactions in cancer cells [58,59]. While there have been some efforts aimed at predicting synergistic drug effects [60,61] or directly predicting drug combinations for disease therapy, especially cancer treatment [62-64], incorporating cell type-specific genetic interaction data from the matching cell type can be crucial for developing combination therapies that specifically target certain cell types. With the continuous advancement of technologies for probing human genetic interactions including CRISPR interference, we anticipate that more comprehensive maps of human genetic interactions for multiple cell lineages will become available in the near future, which could illuminate predictions of adverse DDIs and beneficial drug combinations to a larger extent.

Methods

Data collection

We obtained DDI data from DrugBank (version 5.0.10) [28]. Among the 5 major interaction categories in DrugBank (), we only considered the first category as they were clearly defined as adverse DDIs. Non-interacting drug pairs were constructed by taking all other combinations using the same set of drugs, removing drug pairs also appearing in other categories in DrugBank, TWOSIDES [29], or a complete dataset of DDIs [30] compiled from a number of sources. This minimizes the chance of having actual adverse DDIs in the non-interacting set given the absence of a gold standard set of non-interacting drug pairs. From DrugBank, we also collected human protein targets of drugs and their sequences. Side effects were obtained from SIDER 4.1 [65] and OFFSIDES [29]. Both databases use UMLS concept IDs as their side effect identifiers. However, as reported by Zhang et al. [20], some side effect terms are similar, and synonyms could cause biases when calculating side effect similarity. To solve this problem, we obtained mapping from UMLS concept IDs to MedDRA concept IDs from the 2017AB release of UMLS [66]. Furthermore, we obtained the full MedDRA hierarchy from MedDRA (version 21.0) [31]. This allowed us to map UMLS concept IDs to different levels (PT, HLT, HLGT and SOC) of the MedDRA hierarchy. Similar to side effect data, indications of drugs were acquired from SIDER 4.1 [65] and mapped to the same 4 levels of the MedDRA hierarchy. For genetic interactions, we obtained yeast genetic interactions from Costanzo et al. [35]. We first filtered all genetic interactions by a p-value cutoff of 0.05 and aggregated the scores of all combinations of alleles of each yeast gene pair by applying the arithmetic mean. Drug targets in the form of UniProt IDs were mapped to gene names by UniProt [67] and these human genes were mapped to their yeast homologs via SGD YeastMine [68]. For human gene pairs mapped to multiple yeast gene pairs, we obtained a single score for each human gene pair by applying the arithmetic mean.

Feature extraction and the train-test split

For a drug pair (A,B), four groups of features were calculated (): indication similarity scores between A and B, side effect similarity scores between A and B, target sequence similarity scores between targets of drug A and targets of drug B, and genetic interaction scores between targets of drug A and targets of drug B. Indications and side effects of drugs were mapped to 4 different levels of the MedDRA hierarchy as described above. At each level, indication similarity was calculated by taking the Jaccard index between the respective indication vectors of drug A and drug B (). Similarly, side effect similarity was calculated by applying the same measure on the side effect vectors at the 4 different MedDRA hierarchy levels (). For genetic interactions, since each drug can have multiple targets, we obtained a single score for each drug pair by aggregating the genetic interaction scores of all their corresponding target pairs using 4 different functions, namely taking the minimum, mean, median or maximum (). Similarly, the same 4 functions were used for constructing target similarity features, which were calculated from the target sequences with the Smith-Waterman algorithm using the scikit-bio Python library. The raw scores were normalized as described in Bleakley et al. [69]. Overall, 16 features belonging to 4 feature groups were constructed. Only drug pairs with all features available were considered when building the machine learning model. All drugs were randomly split into “training drugs” and “test drugs” with a 2:1 ratio. The training set consisted of all drug pairs where both drugs were “training drugs” and the test set consisted of all drug pairs where both drugs were “test drugs” (). We constrained the fraction of adversely interacting drug pairs in the training set and that in the test set to be fairly balanced. To obtain the optimal feature combination, we calculated all features for the training set and applied group minimax concave penalty (MCP) [38] with the ‘grpreg’ R package with default parameters. All subsequent training was done using this optimal set of features.

Hyperparameter optimization and classifier training

The gradient boosting-based algorithm XGBoost [40] was used in this study. To find the best combination of hyperparameters for the XGBoost classifier, the tree-structured Parzen estimator (TPE) approach [41] was adopted. Because of the drug-based approach by which we split our dataset into training and test sets, we applied the same splitting scheme on the training set multiple times to obtain training seti and validation seti instead of simply using cross-validation. Each split on the training set can be seen as a hold-out validation, as we used training seti to fit the model and validated model performance on validation seti. We selected one minus the average AUPR of 50 trials of hold-out validation as the loss function to minimize for TPE, and we ran TPE for 2,000 iterations to obtain set of hyperparameters that minimized the loss function for our XGBoost classifier (). After finding the optimal set of hyperparameters, we fit the model on the complete training data.

Model evaluation

Model performance on training set was evaluated by 1,000 runs of hold-out validation on the training set. For each hold-out validation, we fitted the model on training seti and obtained AUROC and AUPR. We averaged AUROC and AUPR over 1,000 runs of hold-out validation as measurements of the performance of the model. Approximate ROC curve and precision-recall curve () were plotted by averaging the 1,000 ROC curves and 1,000 precision-recall curves respectively at every thousandth of a point on the x-axis. In order to evaluate the ability of the classifier to identify drug-drug interactions between drugs whose interaction profiles were completely unknown during training, the model was evaluated on the test set which had no overlap with the training set in terms of the drugs involved. Predictions were ranked according to their raw prediction scores to produce the ROC curve and the precision-recall curve.

Making new predictions

To make novel adverse DDI predictions, we examined all combinations of drugs that appeared in DrugBank, excluding drug pairs where both drugs were involved in the first category of DDIs (), which we used for building the machine learning model. We then predicted 6,690 drug pairs involving 336 drugs for which all features could be calculated using the classifier retrained on the whole dataset. The probability cutoff that produced the maximum averaged F1 score over 1,000 runs of hold-out validation on the training set was chosen for determining new DDI predictions.

Schematics of our DDI prediction framework.

Four groups of features were calculated for each drug pair. Drug pairs were then divided into a training set and a test set. A gradient boosting-based model was built on the training set after feature selection. Model performance was evaluated on the training set using hold-out validation and also on the test set. We demonstrate the importance of our novel feature with a case study and provide novel DDI predictions at the end. (TIF) Click here for additional data file.

The distribution of adversely interacting drug pairs and non-interacting drug pairs in terms of the 5 unused features.

(a) Indication similarity score of hierarchy level HLT between two drugs. (b) Side effect similarity score of hierarchy levels PT and SOC between two drugs. (c) Mean and median genetic interaction score between targets of two drugs. Statistical significance was determined by the two-sided permutation test on the sample mean. * p < 0.001; ** p < 0.0001; *** p < 0.00001. (TIF) Click here for additional data file. (a) The total number of protein targets between two drugs. (b) Minimum, mean, median and maximum human K562 cell line genetic interaction score between targets of two drugs. (Statistical significance determined by two-sided Mann-Whitney U test) (c) Minimum, mean, median and maximum human HEK293T cell line genetic interaction score between targets of two drugs. (Statistical significance determined by two-sided Mann-Whitney U test). (TIF) Click here for additional data file.

The correlation between genetic interaction features and other features.

(TIF) Click here for additional data file.

Values of hyperparameters of the XGBoost model over 2000 TPE iterations.

(TIF) Click here for additional data file.

Construction of a set of drug pairs used for new predictions.

(a) All combinations between drugs that appear in the first category in DrugBank and other drugs, as well as all pairwise combinations of drugs not in the first category, are taken for new predictions. Green squares represent drug pairs used for building the classifier. Grey squares represent unused drug pairs. Blue squares represent drug pairs used for new predictions. (b) Maximum target similarity feature distribution of drug pairs used for model building (green triangular section in (a)), drug pairs where one drug appears in the dataset used for model building (blue rectangular section in (a)), and drug pairs where neither drug appears in the dataset used or model building (blue triangular section in (a)). (TIF) Click here for additional data file.

Five main DDI categories in DrugBank.

(DOCX) Click here for additional data file.

Summary statistics including mean, standard error of the mean and p-value of each feature.

Statistical significance was determined by the two-sided permutation test on the sample mean. (XLSX) Click here for additional data file.

Tab 1: performance comparison of XGBoost with several other algorithms with and without genetic interaction features.

Tab 2: comparison of our method with Zhao and Cheng, 2014. Tab 3: model performance using only genetic interaction features of target sequence similarity features. (XLSX) Click here for additional data file.

A list of 432 new adverse DDI predictions.

(XLSX) Click here for additional data file.

A list of all drug pairs in the training set and a list of all drug pairs in the test set.

(XLSX) Click here for additional data file.

Side effects, indications, human gene targets and their yeast homolog of all drugs that appear in the training set or the test set.

(XLSX) Click here for additional data file.

Table 2

Top 20 new adverse DDI predictions.

	ID1	ID2	Drug Name1	Drug Name2	Probability	In Twosides
1	DB00347	DB01189	Trimethadione	Desflurane	0.975	No
2	DB00136	DB00630	Calcitriol	Alendronic acid	0.9739	Yes
3	DB00417	DB01050	Phenoxymethylpenicillin	Ibuprofen	0.9699	Yes
4	DB00228	DB00347	Enflurane	Trimethadione	0.968	No
5	DB00347	DB00753	Trimethadione	Isoflurane	0.9668	No
6	DB00347	DB01236	Trimethadione	Sevoflurane	0.9656	No
7	DB00887	DB01586	Bumetanide	Ursodeoxycholic acid	0.9567	Yes
8	DB00532	DB01189	Mephenytoin	Desflurane	0.9531	No
9	DB00532	DB01236	Mephenytoin	Sevoflurane	0.9521	No
10	DB00228	DB00532	Enflurane	Mephenytoin	0.9466	No
11	DB01067	DB01083	Glipizide	Orlistat	0.9456	Yes
12	DB00731	DB01016	Nateglinide	Glyburide	0.943	Yes
13	DB01050	DB01053	Ibuprofen	Benzylpenicillin	0.9422	No
14	DB00532	DB00753	Mephenytoin	Isoflurane	0.9368	No
15	DB00421	DB01216	Spironolactone	Finasteride	0.9333	Yes
16	DB00162	DB00165	Vitamin A	Pyridoxine	0.926	No
17	DB01236	DB04930	Sevoflurane	Permethrin	0.925	No
18	DB00284	DB00731	Acarbose	Nateglinide	0.9199	Yes
19	DB00136	DB00273	Calcitriol	Topiramate	0.9132	Yes
20	DB00877	DB01586	Sirolimus	Ursodeoxycholic acid	0.9122	Yes

63 in total

1. Direct phosphorylation of NF-kappaB1 p105 by the IkappaB kinase complex on serine 927 is essential for signal-induced p105 proteolysis.

Authors: A Salmerón; J Janzen; Y Soneji; N Bump; J Kamens; H Allen; S C Ley
Journal: J Biol Chem Date: 2001-04-10 Impact factor: 5.157

2. State of the art and development of a drug-drug interaction large scale predictor based on 3D pharmacophoric similarity.

Authors: Santiago Vilar; Eugenio Uriarte; Lourdes Santana; Carol Friedman; Nicholas P Tatonetti
Journal: Curr Drug Metab Date: 2014 Impact factor: 3.731

3. YeastMine--an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit.

Authors: Rama Balakrishnan; Julie Park; Kalpana Karra; Benjamin C Hitz; Gail Binkley; Eurie L Hong; Julie Sullivan; Gos Micklem; J Michael Cherry
Journal: Database (Oxford) Date: 2012-03-20 Impact factor: 3.451

4. Arsenic trioxide overcomes rapamycin-induced feedback activation of AKT and ERK signaling to enhance the anti-tumor effects in breast cancer.

Authors: Cynthia Guilbert; Matthew G Annis; Zhifeng Dong; Peter M Siegel; Wilson H Miller; Koren K Mann
Journal: PLoS One Date: 2013-12-31 Impact factor: 3.240

5. Data-driven prediction of adverse drug reactions induced by drug-drug interactions.

Authors: Ruifeng Liu; Mohamed Diwan M AbdulHameed; Kamal Kumar; Xueping Yu; Anders Wallqvist; Jaques Reifman
Journal: BMC Pharmacol Toxicol Date: 2017-06-08 Impact factor: 2.483

6. Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes.

Authors: Pathima Nusrath Hameed; Karin Verspoor; Snezana Kusljic; Saman Halgamuge
Journal: BMC Bioinformatics Date: 2017-03-01 Impact factor: 3.169

7. UniProt: the universal protein knowledgebase.

Authors:
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

8. In silico drug combination discovery for personalized cancer therapy.

Authors: Minji Jeon; Sunkyu Kim; Sungjoon Park; Heewon Lee; Jaewoo Kang
Journal: BMC Syst Biol Date: 2018-03-19

9. Mapping the Genetic Landscape of Human Cells.

Authors: Max A Horlbeck; Albert Xu; Min Wang; Neal K Bennett; Chong Y Park; Derek Bogdanoff; Britt Adamson; Eric D Chow; Martin Kampmann; Tim R Peterson; Ken Nakamura; Michael A Fischbach; Jonathan S Weissman; Luke A Gilbert
Journal: Cell Date: 2018-07-19 Impact factor: 41.582

10. Systematic prediction of pharmacodynamic drug-drug interactions through protein-protein-interaction network.

Authors: Jialiang Huang; Chaoqun Niu; Christopher D Green; Lun Yang; Hongkang Mei; Jing-Dong J Han
Journal: PLoS Comput Biol Date: 2013-03-21 Impact factor: 4.475

2 in total

Review 1. On the road to explainable AI in drug-drug interactions prediction: A systematic review.

Authors: Thanh Hoa Vo; Ngan Thi Kim Nguyen; Quang Hien Kha; Nguyen Quoc Khanh Le
Journal: Comput Struct Biotechnol J Date: 2022-04-19 Impact factor: 6.155

2. Assessment of Polypharmacy, Drug Use Patterns, and Associated Factors at the Edna Adan University Hospital, Hargeisa, Somaliland.

Authors: Temesgen Sidamo; Alemu Deboch; Mohamed Abdi; Fikru Debebe; Khalid Dayib; Tamrat Balcha Balla
Journal: J Trop Med Date: 2022-08-29

2 in total