Ding Ruan1, Shuyi Ji2,3, Chenggang Yan1, Junjie Zhu2, Xibin Zhao2, Yuedong Yang4, Yue Gao2,3, Changqing Zou5, Qionghai Dai3,6. 1. School of Automation, Hangzhou Dianzi University, Hangzhou, China. 2. School of Software, KLISS, BNRist, Tsinghua University, Beijing, China. 3. Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing, China. 4. School of Computer Science, Sun Yat-sen University, Guangzhou, China. 5. Huawei Vancouver Research Center, Huawei Canada Technologies, Vancouver, Canada. 6. Department of Automation, Tsinghua University, Beijing, China.
Abstract
The continuous emergence of drug-target interaction data provides an opportunity to construct a biological network for systematically discovering unknown interactions. However, this is challenging due to complex and heterogeneous correlations between drug and target. Here, we describe a heterogeneous hypergraph-based framework for drug-target interaction (HHDTI) predictions by modeling biological networks through a hypergraph, where each vertex represents a drug or a target and a hyperedge indicates existing similar interactions or associations between the connected vertices. The hypergraph is then trained to generate suitably structured embeddings for discovering unknown interactions. Comprehensive experiments performed on four public datasets demonstrate that HHDTI achieves significant and consistently improved predictions compared with state-of-the-art methods. Our analysis indicates that this superior performance is due to the ability to integrate heterogeneous high-order information from the hypergraph learning. These results suggest that HHDTI is a scalable and practical tool for uncovering novel drug-target interactions.
The continuous emergence of drug-target interaction data provides an opportunity to construct a biological network for systematically discovering unknown interactions. However, this is challenging due to complex and heterogeneous correlations between drug and target. Here, we describe a heterogeneous hypergraph-based framework for drug-target interaction (HHDTI) predictions by modeling biological networks through a hypergraph, where each vertex represents a drug or a target and a hyperedge indicates existing similar interactions or associations between the connected vertices. The hypergraph is then trained to generate suitably structured embeddings for discovering unknown interactions. Comprehensive experiments performed on four public datasets demonstrate that HHDTI achieves significant and consistently improved predictions compared with state-of-the-art methods. Our analysis indicates that this superior performance is due to the ability to integrate heterogeneous high-order information from the hypergraph learning. These results suggest that HHDTI is a scalable and practical tool for uncovering novel drug-target interactions.
The prediction of drug-target interactions (DTIs) plays a crucial role in drug discovery. However, the biochemical experimental approaches widely used in wet laboratories are expensive and time consuming, thus slowing down the progress of drug discovery. The ever-growing demand for inexpensive, effective, and rapid prediction methods has driven the development of computational approaches, which provide a cheaper and faster way to predict potential interactions between drugs and targets. Conventional computational approaches tend to begin with the inherent properties of drugs and targets, such as the chemical structure of drugs and the three-dimensional (3D) structure of proteins. Molecular docking, an important tool in structural molecular biology and computer-assisted drug design, is used to predict the binding mode(s) of a ligand with a protein of known 3D structure. Keiser et al. use a complementary technique based on the chemical similarity of ligands to quantitatively group and relate proteins and discover unexpected ligand-target links. However, molecular docking predictions cannot be successful without a known and accurate 3D protein structure, and ligand-based methods require several known binding ligands.Recently, machine learning methods have attracted more attention and shown greater promise in drug discovery. Unlike the aforementioned methods, one key idea of current machine learning-based approaches is that similar drugs may share similar targets and vice versa. Typical computational approaches adopt machine learning methods to catalog the similarities of drugs and targets based on biological features and then predict DTIs.10, 11, 12 Yamanishi et al. made the first attempt to predict DTIs based on biological feature information, such as the similarity between drug chemical structure and target protein sequence, unifying the chemical and genomic spaces of known drugs and targets into pharmacological spaces. Yu et al. integrated features from chemical and genomic space for large-scale drug discovery using random forest and support vector machine algorithms. Gao et al. used low-level representations such as Gene Ontology annotations, amino acids sequences, and chemical structural graphs as inputs to the neural network, generating embeddings for the targets and drugs, respectively, and then calculating the similarity between the embeddings to predict the interaction. This type of approach adequately extracts information from inherent properties, but problems arise when sufficient and reliable information is not available.In addition to the inherent properties of drugs and targets, there is increasing interest in exploring the correlations among drugs, targets, and other biological entities in the data structure of a heterogeneous biological network. Compared with biological feature-based methods, network topology information-based methods make predictions based on the topology information of the network., Several recent attempts have explored topological structures of model DTIs, with biological entities such as drugs, targets, side effects, and diseases denoting vertices in the biological graph and the interactions or associations indicating edges among them. Campillos et al. constructed a network of 1,018 side effect-driven drug-drug relations and validated 13 implied drug-target relations. Cheng et al. compared network-based inference with drug-based similarity inference and target-based similarity inference , showing that the former achieved higher-quality results. Chen et al. integrated and annotated data from public datasets to build a semantic-linked network. They developed a statistical model to assess the association of drug-target pairs and observed that drugs from the same disease area will cluster together. They noted that this mode of clustering is difficult to infer based on inherent properties alone. We hypothesized that correlation among various biological entities can provide useful information that cannot be obtained from inherent properties. Some recent methods formulate DTI prediction tasks as “link predictions” in complex networks.,, TriModel represents heterogeneous topological correlations in the form of a knowledge graph and generates embeddings to predict whether there is a link between a drug and a target (supplementary note). Furthermore, similarities based on both inherent properties and topological correlations can be used to predict DTIs. DTINet integrates diverse inherent properties and topological correlations through a network diffusion process. It generates representations for drugs and targets, containing the similarities of vertices in the biological network, and then performs predictions using these representations (supplementary note). DeepDTnet is another network-based method that integrates information based on the inherent properties of drugs and targets (supplementary note). NeoDTI also integrates information from heterogeneous network data and predicts DTIs by learning the topological preservation representations of drugs and targets.In summary, previous methods have performed DTI predictions by extracting the similarities between drugs and targets. However, they describe the interactions between drugs and targets in a low-order manner where only pairwise correlations are taken into consideration, i.e., one-drug, one-target paradigms. However, the connections among biomedical entities can be far more intricate than merely pairwise links. For example, a single drug may be connected to a number of targets (so-called multi-target drugs, which can target various complex diseases as they are ubiquitous and effective), and these targets may share subtle but important pharmacological characteristics that contribute to the interactions. When further considering more connections, such as drug-disease associations and target-disease associations, the overall heterogeneous biological network becomes even more complex and emerges in a many-to-many pattern. Under such circumstances, it is important to formulate and explore the underlying higher-order topological correlations for drug discovery, which is beyond the capability of the pairwise correlation-based methods. To tackle this issue, we adopted a heterogeneous hypergraph-based model to explore complex and heterogeneous correlations for drug-target interaction prediction (HHDTI) (see section “experimental procedures” for more details).Unlike traditional graphs that model pairwise correlations, the hypergraph can model higher-order correlations and is thus more flexible and powerful, with the ability to incorporate complex correlations. There are precedents for modeling biological networks using hypergraphs, but they have not been used to predict DTIs. Vaida et al. modeled relations between pairs of drugs as a hypergraph and used a two-layer graph convolution neural network as an encoder to predict drug interactions. Niu et al. used diseases as hyperedges, connected microbes associated with them, and developed a hypergraph-based random walk model for microbe-disease association prediction.Hypergraphs are indeed suitable for modeling drug-target interaction networks. When a drug-target hypergraph is constructed, targets are denoted by vertices, and the interactions between a specific drug and a certain number of targets can be modeled by a hyperedge. In this hypergraph, all targets interacting with the same drug are connected by a hyperedge; therefore, all the target vertices connected by one hyperedge can be regarded as a set. Rather than a graph edge in a heterogeneous biological graph representing a two-order pairwise correlation (i.e., indicating direct DTIs), a hyperedge in a heterogeneous biological hypergraph instead models high-order multilateral (i.e., many-to-many) correlations between targets and drugs. Moreover, to provide a thorough understanding of DTIs, we comprehensively integrated several types of connections among various vertices (e.g., drug-target, target-disease, and drug-disease connections) in the heterogeneous biological networks. A representation modeled on higher-order correlations can significantly improve the predication performance of DTIs.Specifically, HHDTI infers candidate DTIs by fusing two types of embeddings: key and side embeddings. Key embeddings provide initial and major vector representations for all drugs and targets, which are learned using the direct drug-target interaction information. By contrast, side embeddings offer complementary representations learned by leveraging disease-relevant information. Structural drug-target embeddings are achieved by fusing the key embeddings with the side embeddings, with HHDTI estimating drug-target similarity to perform DTI predictions. We have demonstrated that, based on this embedding learning process, HHDTI consistently achieved higher-quality prediction results when analyzing several popular datasets compared with alternative state-of-the-art methods. Comprehensive evaluations have determined that the proposed HHDTI is a promising and powerful tool for drug discovery.
Results
Overview of HHDTI
We propose a computational framework for DTI prediction, called HHDTI, which captures implicit high-order topological correlations in heterogeneous biological networks. HHDTI first uses a generative model to construct key embeddings from drug-target and target-drug interactions (Figure 1). It then extracts drug-disease correlations and target-disease correlations to generate side embeddings using hypergraph neural networks (HGNNs). Ultimately, HHDTI fuses the key embeddings and side embeddings and obtains structural embeddings to perform DTI prediction. Integrating diverse information from heterogeneous biological data can assist in determining higher-order topological correlations among different vertices. HHDTI then can infer potential DTIs by computing and ranking the prediction scores of all candidate interactions. In summary, embeddings encode both topological properties and association information, resulting in a low-dimensional vector space where the distance between drug-target pairs correlates with their likelihood of interaction. More details of the HHDTI framework can be found in the section “experimental procedures.”
Figure 1
Schematic flowchart of the HHDTI pipeline
(A) Illustration of the hypergraph construction.
(B) Given the heterogeneous biological network in (A), four distinct types of sub-hypergraphs (drug-target, drug-disease, target-drug, and target-disease) can be built. Taking the target-drug interactions as an example, we used a hyperedge to connect all targets that interact with the same drug, i.e., a hyperedge in the heterogeneous biological hypergraph represents a drug. These hypergraphs provide the input for the key and side embedding learning in (B). The incidence matrix H represents the sub-hypergraph and serves as the input of the model, and Ф, Ф, and Ф represent the key embeddings, side embeddings, and structural embeddings, respectively. μ and σ, respectively, refer to the means and variances obtained by the variational autoencoder when generating the key embeddings. “Attention” means bi-embedding attention fusion module.
Schematic flowchart of the HHDTI pipeline(A) Illustration of the hypergraph construction.(B) Given the heterogeneous biological network in (A), four distinct types of sub-hypergraphs (drug-target, drug-disease, target-drug, and target-disease) can be built. Taking the target-drug interactions as an example, we used a hyperedge to connect all targets that interact with the same drug, i.e., a hyperedge in the heterogeneous biological hypergraph represents a drug. These hypergraphs provide the input for the key and side embedding learning in (B). The incidence matrix H represents the sub-hypergraph and serves as the input of the model, and Ф, Ф, and Ф represent the key embeddings, side embeddings, and structural embeddings, respectively. μ and σ, respectively, refer to the means and variances obtained by the variational autoencoder when generating the key embeddings. “Attention” means bi-embedding attention fusion module.
Better DTI prediction performance by HHDTI
We initially evaluated the overall prediction performance of HHDTI using a 10-fold cross-validation procedure. We conducted these experiments on three public datasets (DTINet_17, deepDTnet_20, and KEGG_MED) and compared HHDTI with four state-of-the-art network-based drug discovery methods: DTINet, NeoDTI, deepDTnet, and TriModel. Under the experimental setting, 10% of the known drug-target interaction pairs and non-interaction pairs were randomly chosen as the positive and negative samples, respectively, for testing. The remaining 90% were used for training. Two widely used metrics, the area under the receiver operating characteristic (AUROC) curve and the area under the precision-recall (AUPR) curve, were calculated to comprehensively compare the performance of different methods. We conducted separate experiments on these three datasets and found that there was no data overlap between the training and test sets within each dataset. The four methods were consistent with the results provided in the original papers for their corresponding datasets (Figure 2). However, HHDTI outperformed each of these competitive baselines, consistently achieving the highest prediction results for all three datasets. All four methods are network-based methods, each with minor differences. DTINet, deepDTnet, and NeoDTI blend the inherent properties of drugs and targets and the topological correlations among biological entities. For this reason, both methods perform poorly on the KEGG_MED dataset, which does not include any information related to inherent properties such as the chemical structures of drugs and the primary sequences of proteins. Although these baseline methods attempt to fuse diverse information in heterogeneous biological networks, they are still limited in terms of data modeling as they can only capture low-order pairwise correlations between vertices rather than high-order correlations.
Figure 2
HHDTI outperforms other models when used on all three datasets
(A–C) Experimental results as measured by AUROC and AUPR. 10-fold cross-validations were performed on (A) DTINet_17, (B) deepDTnet_20, and (C) KEGG_MED databases to compare the prediction ability of HHDTI with DTINet, NeoDTI, deepDTnet, and TriModel (supplementary note). The results of five trials for each method are expressed as mean ± SD; ∗p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001; ∗∗∗∗p < 0.0001.
HHDTI outperforms other models when used on all three datasets(A–C) Experimental results as measured by AUROC and AUPR. 10-fold cross-validations were performed on (A) DTINet_17, (B) deepDTnet_20, and (C) KEGG_MED databases to compare the prediction ability of HHDTI with DTINet, NeoDTI, deepDTnet, and TriModel (supplementary note). The results of five trials for each method are expressed as mean ± SD; ∗p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001; ∗∗∗∗p < 0.0001.The superior performance of the prediction methods might result from the easy predictions of homologous proteins or similar drugs in the dataset. To investigate this issue, we refer to the work of Luo et al.6 and performed an additional test on the DTINet_17 dataset without the DTIs involving homologous proteins (sequence identity scores >40%). In this test, the removal of homologous proteins can reduce the potential redundancy in the DTIs that may lead to an inflated performance evaluation. The test results were robust even after removing homologous proteins from the training data, suggesting that HHDTI capturing high-order correlation information can still achieve good performance and outperform other prediction methods even in the absence of similar targets (Figure S1.).
Additional association information for DTI prediction
We further investigated how the quantity of potential isolated data influences DTI prediction results. We extracted all known drug-target interaction pairs of three different amounts of drugs (20%, 50%, and 80%) within the datasets as positive samples and the same number of non-interaction pairs as negative samples to generate the test sets (i.e., there are no known drug-target interaction pairs in the training data for these drugs). This experimental setting simulated the so-called cold-start problem by artificially creating isolated vertices, resulting in extremely difficult DTI predictions. Our analysis showed that the side embeddings generated from the association information (i.e., drug-disease and target-disease associations) can help improve DTI predictions to some extent, despite the absence of any known drug-target interaction pairs within the training sets (Figure 3). These studies also showed that additional association information can be captured by the proposed HHDTI to enhance DTI predictions, which may provide new insights into understanding interaction mechanisms among drugs, targets, and diseases.
Figure 3
HHDTI evaluated under cold-start conditions
(A–C) All known interactions of three different amounts of drugs (20%, 50%, and 80%) in the datasets (A) DTINet_17, (B) deepDTnet_20, and (C) KEGG_MED and the same number of negative samples form the test sets. Specifically, in the first experiment, 20% of the drug vertexes in the training set are isolated vertexes; in the second experiment, 50% of the drug vertexes in the training set are isolated vertexes; and in the third experiment, 80% of the drug vertexes in the training set are isolated vertexes. HHDTI_W/O_S means HHDTI does not use side embeddings for DTI prediction. The results summarize five trials and are expressed as mean ± SD.
HHDTI evaluated under cold-start conditions(A–C) All known interactions of three different amounts of drugs (20%, 50%, and 80%) in the datasets (A) DTINet_17, (B) deepDTnet_20, and (C) KEGG_MED and the same number of negative samples form the test sets. Specifically, in the first experiment, 20% of the drug vertexes in the training set are isolated vertexes; in the second experiment, 50% of the drug vertexes in the training set are isolated vertexes; and in the third experiment, 80% of the drug vertexes in the training set are isolated vertexes. HHDTI_W/O_S means HHDTI does not use side embeddings for DTI prediction. The results summarize five trials and are expressed as mean ± SD.
High-order topological correlations for DTI prediction
We conducted ablation experiments on the DTINet_17, deepDTnet_20, and KEGG_MED datasets, respectively, to study the advantages and disadvantages of high-order topological correlations relative to low-order pairwise correlations. To this end, we replaced the hypergraph representation in HHDTI with plain graph representations and used this as the comparative method (specifically, we constructed standard plain graphs on these three datasets and performed a similar key-side embedding learning procedure as HHDTI for DTI prediction). The experimental results showed that HHDTI consistently outperformed the low-order correlations-based comparative method when used on either of the three datasets (Figure 4).
Figure 4
Ablation experiments determine the contribution of high-order topological correlation to HHDTI
We performed ablation experiments using the DTINet_17, deepDTnet_20, and KEGG_MED datasets to evaluate the superiority of high-order correlations. The results summarize five trials and are expressed as mean ± SD; ∗p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001; ∗∗∗∗p < 0.0001.
Ablation experiments determine the contribution of high-order topological correlation to HHDTIWe performed ablation experiments using the DTINet_17, deepDTnet_20, and KEGG_MED datasets to evaluate the superiority of high-order correlations. The results summarize five trials and are expressed as mean ± SD; ∗p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001; ∗∗∗∗p < 0.0001.
Practical drug discovery
Our goal was to study HHDTI's capability as a practical tool for unknown DTI discovery. We chose Target Drug-UniProt Links (approved) of the DrugBank database in version 5.1.0 for the evaluation, as it contains detailed and complete interaction information for targets and drugs. DeepDTnet was chosen as the comparative method because it achieved the highest quantitative prediction among the baselines. Since there is no disease association information in this dataset, we compared HHDTI (no disease) with DeepDTnet. We trained the two methods using all the data in Target Drug-UniProt Links (approved) and produced a top-10 target prediction list for each drug using each of the two methods (Table S1). Data S1 and S2 are the lists of DTIs predicted by HHDTI (no disease) and deepDTnet, respectively, and validated by the literature. In the lists predicted by both methods, aside from the known targets in the training set, we observed that there was a subset of new predicted DTIs that were unknown in the training set but had been reported in the literature. Statistical analysis showed that HHDTI successfully predicted 17.9% more DTIs than deepDTnet. To further compare HHDTI (no disease) and deepDTnet, we used “recall @ top-10” as the evaluation metric,, which is defined as the fraction of true interacting targets retrieved in the list of top-10 predictions for a drug. With this evaluation metric, the average recall at top-10 of HHDTI (no disease) and deepDTNet were 0.0590 and 0.0573, respectively. This indicates that both methods can successfully discover targets that interact with a given drug and that HHDTI (no disease) is more powerful than deepDTNet.Figure 5 illustrates specific practical drug discovery results produced by HHDTI (no disease) and deepDTNet. The data in the training set show that the anti-epileptic drug phenytoin acts on nuclear receptor subfamily 1, group I, member 2 (NR1I2) and several targets from the sodium channel family (SCN1A, SCN3A, and SCN5A). The drug brivaracetam, which is commonly used in the treatment of partial-onset seizures, is a ligand of synaptic vesicle protein 2A (SV2A) and inhibits voltage-gated sodium channels. Existing low-order correlation-based methods, including deepDTNet, make DTI inferences based on the “guilt-by-association” assumption that similar drugs may share similar targets and vice versa. Since both brivaracetam and phenytoin act on similar targets, deepDTNet predicted that phenytoin acts on a member of sodium channel family SCN8A. However, deepDTNet failed to predict the interaction between phenytoin and KCNH2, which is not similar to NR1I2 or the sodium channel family. The experimental results reveal that the problem with these methods is that they are only able to predict targets that are similar to known targets. In contrast, HHDTI (no disease) successfully predicted that phenytoin acts on KCNH2. As shown in Figure 5, the training data reveal the similarity between NR1I2 and KCNH2 because both NR1I2 and KCNH2 have interactions with the same drug, ketoconazole. The two targets NR1I2 and KCNH2 are thus linked by a hyperedge and are regarded as a set. We first train the model to find a certain similarity between the targets in the set and project it into a low-dimensional common feature space as the embedding of the drug. In the same way, we can obtain the embedding of the target. The drug embedding and the target embedding with known interactions are then positioned close to each other (i.e., the embedding of ketoconazole and the embedding of KCNH2 are close in the low-dimensional feature space). Since phenytoin and ketoconazole also act on SCN5A, their embeddings will also be near each other in the feature space. Due to the transfer of similarity, HHDTI successfully predicted the interaction of phenytoin with KCNH2. The interaction of propafenone with SCN5A and KCNH2 can also help predict the interaction between phenytoin and KCNH2. Furthermore, SCN5A and KCNH2 belong to the voltage-gated ion channel superfamily, suggesting that our method finds some similarity between these two proteins and facilitates us to further explore the role and structure of the proteins. The high-order topological correlation allows HHDTI to take full advantage of known interaction information in the heterogeneous biological network and recall more potential DTIs in a top-N prediction list.
Figure 5
Predicted and validated DTI examples visualized in a heterogeneous biological network
Predicted and validated DTIs refer to the predicted DTIs that can also be confirmed by known experimental or clinical evidence in the literature. Targets of the same color belong to the same protein family. HHDTI can discover more interaction targets that are not close to the known interaction targets in terms of protein family proximity for drugs than the state-of-art network-based method deepDTnet.
Predicted and validated DTI examples visualized in a heterogeneous biological networkPredicted and validated DTIs refer to the predicted DTIs that can also be confirmed by known experimental or clinical evidence in the literature. Targets of the same color belong to the same protein family. HHDTI can discover more interaction targets that are not close to the known interaction targets in terms of protein family proximity for drugs than the state-of-art network-based method deepDTnet.We conducted additional rigorous testing. We downloaded the earliest available release (v4.6.0, released on 20 April 2016) from the DrugBank database. Using all the data in Target Drug-UniProt Links (approved) from this release, we obtained some results that prove the validity of HHDTI. As shown in Table S2, these results have been validated in the literature and the publication time of these literatures is later than April 2016. For example, the interactions related to the drug celiprolol (DB04846) in the training set were first documented in the literature in 2007. HHDTI predicts that the drug also interacts with beta-3 adrenergic receptor (ADRB3, P13945) and alpha-2A adrenergic receptor (ADRA2A, P08913), and these results were proved by the literature in 2017.
Discussion
The HHDTI method presented here is a computational approach based on hypergraph networks and deep neural networks. Based on known DTIs, HHDTI extracts the intrinsic characteristics of drugs and targets, models these correlations with a hypergraph capable of higher-order modeling, and then enhances these correlations with complementary information to generate structural embeddings for both drugs and targets. The major advantage of the proposed method lies in its powerful capability of modeling high-order correlations among various entities and its flexible framework capable of integrating several types of complementary information. Our study found it can discover more DTIs that have been previously validated by the literature than other state-of-the-art computational approaches. It can therefore identify potential DTI candidates to efficiently guide validation experiments in the wet laboratory. In the future, we plan to perform wet experimental validation as a method of cross-validation through cooperation with drug discovery industry partners, which will help us further improve the framework in return.Although network-based methods have been applied,, the correlation modeling based on one-to-one correspondence may not produce the essential features reflecting a single drug acting on multiple targets or multiple drugs acting on the same target. Integrating network biology and polypharmacology promises an expanded opportunity for druggable targets, which cannot be achieved without effective high-order correlation modeling. Capturing the high-order topological correlations among various vertices in a heterogeneous biological network can achieve more accurate and robust prediction performance, which is worthy of more attention for further study. Although computational approaches have achieved decent results after years of development, there are still many under-resolved problems. The biological data used in this study are considered large-scale datasets, but the number of drug vertices, target vertices, and DTIs included in each dataset is quite limited.,,, For example, the approved Target Drug-UniProt Links in DrugBank database (version 5.1.0) only contains 2,020 drugs, 2,669 targets, and 9,796 DTIs. To construct a large-scale comprehensive heterogeneous biological network, more types of vertices in addition to drugs and targets should be provided to obtain complex relationships at different levels. It is not easy to accomplish this task using a single dataset. Fortunately, we may integrate complementary information from different public databases. For instance, we can integrate the known drug-disease associations from Drug Central, clinically reported drug side effects from the Comparative Toxicogenomics Database (CTD), protein-protein interactions data from the Human Protein Resource Database (HPRD) and the HuRI, and clinically reported drug-drug interactions data from the DrugBank database. Even with plenty of data, coping with the noise from multiple databases is a challenging problem for data integration. The sample imbalance problem may also be raised by collecting only positive sample information and ignoring information for non-interaction pairs. Furthermore, even an evaluated DTI may be rejected in the future. We believe that a high-quality, large-scale dataset that integrates various classes of information will significantly progress the development of computational approaches.By convention, the HHDTI selects drug-target pairs with no known interactions as negative samples. These negative samples are potentially positive, making it difficult to select genuine no-interaction drug-target pairs.The proposed HHDTI method can be further expanded to incorporate more topological information (e.g., drug side effect associations) and other types of information. For example, the similarity computed from the inherent property information of drugs and targets, such as drug chemical similarity and protein sequence similarity, can also be modeled in the form of hypergraphs to explore the high-order correlations in this respect, which will be considered in our future research. Importantly, although HHDTI was developed for DTI predictions, it can also be used as a general framework to address link prediction-related problems in other fields (e.g., drug interactions).
Experimental procedures
Resource availability
Lead contact
Further information and requests for code and data should be directed to and will be fulfilled by the lead contact, Yue Gao (gaoyue@tsinghua.edu.cn).
Materials availability
This study did not generate any physical materials.
The framework of the HHDTI
The framework of the proposed HHDTI is shown in Figure 1. Taking the biological hypergraphs as input, HHDTI can achieve prediction performance that outperforms other state-of-the-art methods by simultaneously optimizing both the high-order association capture process and the DTI prediction model in an end-to-end manner. We first construct hypergraphs to model the biological network and then employ a structural embedding learning framework to capture the high-order correlation and generate structural embeddings for both targets and drugs. The interaction likelihood between a given drug and target is predicted by estimating the similarity of their structural embeddings. Specifically, for drug i and target j, the DTI score can be computed as , where and denote the drug structural embeddings and target structural embeddings, respectively. These low-dimensional structural embeddings, or , are generated by fusing key and side embeddings by a biembedding attention fusion module; drug (target) structural embeddings () are generated by fusing the key drug embeddings () and side drug embeddings ().
Heterogeneous hypergraph modeling of biological networks
Biological networks in this work present both direct and indirect relationships between drugs and targets. A heterogeneous biological network = {, } refers to a biological network containing multiple types of vertices and edges, where represents the set of vertices and represents the set of edges. In our biological network, the sets of vertex types include {drug, target, disease}, the sets of correlation types include {drug-target interaction, target-drug interaction, drug-disease association, target-disease association}. Given different types of correlations, a heterogeneous multiple hypergraph with M vertices and N hyperedges is constructed to model the biological networks, where r represents different types of correlations and r = 1, 2, 3, 4. In this work, the heterogeneous hypergraph modeling of the biological networks is illustrated in Figure 1A. For each correlation, we achieve an individual sub-hypergraph. We achieve four types of sub-hypergraph in total. The heterogeneous hypergraph modeling results are four incidence matrices, which can be represented by , where H = 1 if vertex i has connected with hyperedge j; otherwise, H = 0. We obtain four types of incidence matrices (H, H, H, H) based on . Both drugs and targets employ the same structural embedding learning framework to generate the structural embeddings. For conciseness, we next present how drug structural embeddings are generated from this structural embedding learning framework.
Drug structural embedding learning
We introduce a Bayesian deep generative model that is a framework for unsupervised learning on a hypergraph-structured data-based variational auto-encoder to learn drug key embeddings from H and employ the HGNN model to generate the drug side embeddings from H. For the drug-target interaction hypergraph H, this Bayesian generative model is instantiated as a vertex encoder, which models the similarity and correlations of the drugs interacting with the same target. The vertex encoder (Figure 1B, vertex encoder) performs a nonlinear mapping from the observed space H to the common latent space bywhere is a nonlinear activation function to enable our model to approximate a nonlinear function. Based on our experiments (Figure S2), we adopted the hyperbolic tangent for the activation function due to its simplicity and superiority of performance. and are the weight and bias learned by the encoder, and D and D are the dimensionalities of H and , respectively. After obtaining , two individual fully connected layers are used to estimate the means μ and variances σ:where and are the learnable weights and biases, respectively. The dimensionality of the drug key embedding is D, and we sample this bywhere ∼ N(0, I), and ⊙ stands for the element-wise product.The key embeddings characterize the high-order topological correlations from the direct relationships between targets and drugs. Recent studies have found that integrating multiple types of information can improve prediction accuracy. For example, drug side effects are observable phenotypic effects resulting from drugs acting on genetic off-targets in human bodies. Phenotypic side effect similarity can be used to infer whether two drugs share a target. Hu et al. found that targets can be used as bridges to link drugs and diseases. Inspired by these studies, we integrated additional types of association correlations in HHDTI to provide complementary information so that the method can predict correctly even in the case of extreme challenges like the cold-start problem.As shown in Figure 1B, we learn drug side embeddings from the drug-disease incidence matrices (H) to provide complementary information for the drug key embeddings. This is achieved by the HGNN model (Figure 1B, hypergraph convolutional layers). HGNN consists of hypergraph convolutional layers that encode high-order correlations:where D and D are the diagonal degree matrices of the vertex and hyperedge respectively, with being the degree of vertex and being the degree of hyperedge. X denotes the vertex features, W is the learnable weight matrix, and (·)T is the transposition operator.The output of the HGNN model is the side embeddings, which represent high-order correlations. The adopted HGNN has two hypergraph convolutional layers. Taking the drug side embedding learning on H as an example, each layer can be formulated aswhere , , and W( are the input, output, and trainable weight matrix of the (l-1)-th layer, respectively. The vertex feature X is the inherent properties of the drugs, and we replaced with an identity matrix for . Then, we employ attention modules to fuse the key and side embeddings into a shared vector space to construct low-dimensional structural embeddings. We propose the bi-embedding attention fusion (Figure 1B, attention) to compute the coefficients to give different weights to the key embeddings and side embeddings:where stands for key embeddings or side embeddings and , , and are trainable parameters for embeddings , respectively. is dimensionality of the trainable parameters. The overall structural embeddings can be achieved bywhere and are the key and side embeddings, respectively.
Target structural embedding learning
By contrast, the target structural embedding learning uses the target-drug interaction hypergraph and the target-disease association hypergraph as inputs. It models the similarity and correlations of the targets interacting with the same drug to generate the target key embeddings through a vertex encoder (with the same structure as the vertex encoder in drug structural embedding learning). It also uses the HGNN model to generate the target side embeddings from H and fuses the target key embeddings and target side embeddings by biembedding attention fusion to obtain target structural embeddings .
DTI prediction
The DTI predictions are produced from the reconstruction space A, which is achieved by computing the likelihood of the drug and target structural embeddings.where sigmoid(·) is the sigmoid activation function. We optimize the variational lower bound L:where KL[q(·)||p(·)] is the Kullback-Leibler divergence between q(·) and p(·). Varying β encourages different learned representations by changing the degree of applied learning pressure during training. Referring to the work of the variational autoencoder, we further take Gaussian priors and. E[log p(·|·)] is the likelihood of reconstruction space A learned by HHDTI.
Model evaluation metrics
We introduced two evaluation metrics, the AUROC curve and the AUPR curve, to evaluate prediction performance. A confusion matrix is shown in Figure S3. In the receiver operating characteristic (ROC) space, the ROC curve gives a pair of x and y values where x is the false-positive rate (FPR) and y is the true-positive rate (TPR). We connected all points obtained by changing the cutoff to create the ROC curve.where true-positives (TPs) and false-positives (FPs) are positive samples correctly predicted as positive and negative samples incorrectly predicted as positive, respectively. True-negatives (TNs) are negatives correctly identified as negative. False-negatives (FNs) correspond to positives incorrectly predicted as negative.The precision-recall curve is plotted in a comparable way to the ROC curve but with the x axis being recall and the y axis being precision:As discussed in previous work,, AUPR can provide a better assessment when the data for testing are highly skewed (supplementary note).
Datasets
The three public datasets proposed in DTINet, deepDTnet, and TriModel (named DTINet_17, deepDTnet_20, and KEGG_MED, respectively) as well as the Target Drug-UniProt Links (approved) from the DrugBank database (version 5.1.0) were used for evaluation.The data in DTINet_17 were collected from public databases. Drug vertices, protein vertices, and disease vertices were obtained from the DrugBank database (version 3.0), the HPRD database (release 9), and CTD, respectively. The known DTIs were imported from the DrugBank database (version 3.0), and the drug-disease and target-disease associations were extracted from the CTD.The deepDTnet_20 dataset was also derived from the integration of information in multiple databases. The DTIs were collected from the DrugBank database (version 4.3), the Therapeutic Target Database, and the PharmGKB database. The drug-disease association information came from the DrugBank database (version 4.3), Drug Central, and repoDB. The drug-disease association data were integrated from the bioinformatics data sources CTD and HuGe navigator.The KEGG_MED dataset was larger than the above two datasets and was extracted from multiple databases, including KEGG, DrugBank database, InterPro, and UniProt.The Target Drug-UniProt Links (approved) dataset was extracted from the DrugBank database (version 5.1.0).More specific information regarding the four datasets is shown in Table S3. For more information about the datasets, please refer to the works of DTINet, deepDTnet, TriModel, and DrugBank database (version 5.1.0).
Statistical analysis
All statistical analyses were performed using GraphPad Prism software (version 8.0.2). The data shown in the study were obtained from at least five independent experiments. Values in different experimental groups are expressed as the mean ± standard deviation. p < 0.05 was considered statistically significant.
Authors: Michael J Keiser; Bryan L Roth; Blaine N Armbruster; Paul Ernsberger; John J Irwin; Brian K Shoichet Journal: Nat Biotechnol Date: 2007-02 Impact factor: 54.908
Authors: Alex L Mitchell; Teresa K Attwood; Patricia C Babbitt; Matthias Blum; Peer Bork; Alan Bridge; Shoshana D Brown; Hsin-Yu Chang; Sara El-Gebali; Matthew I Fraser; Julian Gough; David R Haft; Hongzhan Huang; Ivica Letunic; Rodrigo Lopez; Aurélien Luciani; Fabio Madeira; Aron Marchler-Bauer; Huaiyu Mi; Darren A Natale; Marco Necci; Gift Nuka; Christine Orengo; Arun P Pandurangan; Typhaine Paysan-Lafosse; Sebastien Pesseat; Simon C Potter; Matloob A Qureshi; Neil D Rawlings; Nicole Redaschi; Lorna J Richardson; Catherine Rivoire; Gustavo A Salazar; Amaia Sangrador-Vegas; Christian J A Sigrist; Ian Sillitoe; Granger G Sutton; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Siew-Yit Yong; Robert D Finn Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971
Authors: Katja Luck; Dae-Kyum Kim; Luke Lambourne; Kerstin Spirohn; Bridget E Begg; Wenting Bian; Ruth Brignall; Tiziana Cafarelli; Francisco J Campos-Laborie; Benoit Charloteaux; Dongsic Choi; Atina G Coté; Meaghan Daley; Steven Deimling; Alice Desbuleux; Amélie Dricot; Marinella Gebbia; Madeleine F Hardy; Nishka Kishore; Jennifer J Knapp; István A Kovács; Irma Lemmens; Miles W Mee; Joseph C Mellor; Carl Pollis; Carles Pons; Aaron D Richardson; Sadie Schlabach; Bridget Teeking; Anupama Yadav; Mariana Babor; Dawit Balcha; Omer Basha; Christian Bowman-Colin; Suet-Feung Chin; Soon Gang Choi; Claudia Colabella; Georges Coppin; Cassandra D'Amata; David De Ridder; Steffi De Rouck; Miquel Duran-Frigola; Hanane Ennajdaoui; Florian Goebels; Liana Goehring; Anjali Gopal; Ghazal Haddad; Elodie Hatchi; Mohamed Helmy; Yves Jacob; Yoseph Kassa; Serena Landini; Roujia Li; Natascha van Lieshout; Andrew MacWilliams; Dylan Markey; Joseph N Paulson; Sudharshan Rangarajan; John Rasla; Ashyad Rayhan; Thomas Rolland; Adriana San-Miguel; Yun Shen; Dayag Sheykhkarimli; Gloria M Sheynkman; Eyal Simonovsky; Murat Taşan; Alexander Tejeda; Vincent Tropepe; Jean-Claude Twizere; Yang Wang; Robert J Weatheritt; Jochen Weile; Yu Xia; Xinping Yang; Esti Yeger-Lotem; Quan Zhong; Patrick Aloy; Gary D Bader; Javier De Las Rivas; Suzanne Gaudet; Tong Hao; Janusz Rak; Jan Tavernier; David E Hill; Marc Vidal; Frederick P Roth; Michael A Calderwood Journal: Nature Date: 2020-04-08 Impact factor: 49.962