Literature DB >> 33553921

A Machine Learning-Based Biological Drug-Target Interaction Prediction Method for a Tripartite Heterogeneous Network.

Abstract

Drug repositioning is the identification of interactions between drugs and target proteins in pharmaceutical sciences. Traditional large-scale validation through chemical experiments is time-consuming and expensive, while drug repositioning can drastically decrease the cost and duration taken by traditional drug development. With the rapid advancement of high-throughput technologies and the explosion of various biological and medical data, computational drug repositioning methods have been used to systematically identify potential drug-target interactions. Some of them are based on a particular class of machine learning algorithms called kernel methods. In this paper, we propose a new machine learning prediction method combining multiple kernels into a tripartite heterogeneous drug-target-disease interaction spaces in order to integrate multiple sources of biological information simultaneously. This novel network algorithm extends the traditional drug-target interaction bipartite graph to the third disease layer. Meanwhile, Gaussian kernel functions on heterogeneous networks and the regularized least square method of the Kronecker product are used to predict new drug-target interactions. The values of AUPR (area under the precision-recall curve) and AUC (the area under the receiver operating characteristic curve) of the proposed algorithm are significantly improved. Especially, the AUC values are improved to 0.99, 0.99, 0.97, and 0.96 on four benchmark data sets. These experimental results substantiate that the network topology can be used for predicting drug-target interactions.

Entities: CellLine Chemical Disease Gene Species

Year: 2021 PMID： 33553921 PMCID： PMC7860102 DOI： 10.1021/acsomega.0c05377

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

In the past few decades, financial investment in drug research and development has increased dramatically. However, the increasing demand for new drugs still cannot be met.[1] Drug repositioning is a creative and resourceful approach to increase the number of therapies by exploiting available and approved drugs.[2] Since the safety and effectiveness of specific signs have been tested and approved, the investment risk can also be greatly reduced.[3,4] Drug repositioning aims to identify new therapeutic opportunities for existing drugs, which can reduce the time, costs, and risk of traditional drug development and shorten the period of drug approval and launch.[5] Many success cases of drug repositioning greatly inspired the global pharmaceutical industries to explore the new uses of the existing drugs.[6] Moreover, research on drug repositioning can provide biologists and pharmacist with drug–target interaction candidates for further research and clinical trials.[7] Traditional experimental methods are mainly divided into target-based prediction methods and ligand-based prediction methods.[8−14] Target-based prediction methods include docking or reverse docking, but they need complete 3D structures of the target, and the performance of the method is poor when facing some unknown target. Ligand-based prediction methods mainly rely on comprehensive information about the drug, but when the ligand information is inadequate, this method is ineffective.[15−19] Recently, machine learning has been widely applied in many fields.[20−24] Machine learning methods and network-based inference methods have been successfully introduced in drug repositioning. The machine learning methods can extract topological structural features, and the features can be used to calculate the similarities of the drugs and the targets.[25] The chemical–chemical interactions and chemical–protein interactions are used to select the candidate drugs that have association with approved lung cancer drugs and related genes,[26] and the permutation test and K-means clustering algorithm are introduced to exclude candidate drugs with low possibilities of treating lung cancer. Two distributed label propagation algorithms for heterogeneous networks named DHLP-1 and DHLP-2 are developed. Additionally, they measured the efficiency of DHLP-1 and DHLP-2 algorithms on a biological network consisting of drugs, diseases, and targets.[27] A method named deepDR[28] is proposed to learn high-level features of drugs from the heterogeneous networks by a multimodal deep autoencoder, which achieved a mean value of AUC (the area under the ROC curve, ROC is the receiver operating characteristic) of 0.908. The network-based prioritization method called ProphNet[29] is developed to integrate data from complex networks involving a range of types of interactions. It achieved a mean AUC value of 0.9552 +/– 0.0015 in fivefold cross validation tests. The analysis with the tripartite network is found to have a stable structure and simulated network growth, which is accompanied by a steady increase in assortativity.[30] The computational tool called DR2DI[31] is presented to infer a new candidate with unknown drugs and diseases by a series of steps of the regularized kernel classifier, a semi-supervised and global learning algorithm. It is not hard to see that with the increasing number of drugs and targets, there is still a requirement for more than 1016 of storage space in the intermediate calculation due to the number of drugs and targets reaching 104–10.[5] Such a large matrix is extremely difficult to deal with at the present computing level. The main contributions of this paper are summarized as follows: We propose a tripartite heterogeneous network model based on the former bipartite graph, which extends the conventional drug repositioning model to three layers of disease–target–drug. The disease–target and drug–target form different bipartite graph models independently, where the target layer serves as an intermediate layer between the disease layer and the drug layer. Based on the tripartite heterogeneous network, we apply the Gaussian kernel function to construct a three-layer similarity space. We use the Kronecker product regularized least square to make the final prediction (termed THN_KRLS, tripartite heterogeneous network Kronecker product regularized least square method). We compare the results of THN_KRLS with two latest efficient two-layer algorithms, which are called RLS-Kron[32] and FLapRLS.[33] Experimental results show that the THN_KRLS method makes good performance. The values of AUC are 0.99, 0.99, 0.97, and 0.96 on four benchmark data sets. The value of sensitivity is increased by 0.181 on the GPCR data set, and the value of AUPR on the GPCR data set and nuclear receptor data set is increased by 0.14 and 0.256, respectively. The rest of this paper is organized as follows. In Section , we introduce some related work, and then we present the general framework and relevant methods with detail in Section . In Section , the performance of our proposed THN_KRLS method is evaluated through extensive experiments. Some discussion is also provided in Section , and Section concludes this paper.

Related Work

Data Sets

In our experiments, we use the data sets with different focuses to build a tripartite heterogeneous network model and evaluate the performance of THN_KRLS on benchmark data sets. These benchmark data sets are named after four main targets: enzymes, ion channels, GPCR (the G protein-coupled receptors), and nuclear receptors. Table shows the standard data sets we used during the experiment. Also, for the imported data for the disease layer, the main resource of data set is DisGeNET.

Table 1

Sources and Verification of Data Sets

resource	description	URL	drug-related entities
DrugBank	free access database with comprehensive drug data	https://www.drugbank.ca/	drug and drug–target data
Kegg	open access database for molecular-level information	https://www.kegg.jp/	system information, health information, genomic information, and chemical information
UniProt	free accessible protein sequence and annotation database	https://www.uniprot.org	UniProt knowledgebase, UniProt reference cluster, and UniProt archive
OMIM	free access compendium for Mendelian disorder	http://www.omim.org/	phenotypic and genotypic information for human disease
DisGeNET	free access human disease database	https://www.disgenet.org/	genotype and phenotype relationships for diseases–diseases and diseases–target
ChEMBL	free access drug and target database	https://www.ebi.ac.uk/chembl/	bioactivity and genomic data to aid the translation of genomic information into effective new drugs

In order to compare and evaluate the performance with other algorithms, we used the benchmark data sets proposed in ref.[34]Table gives the benchmark data sets, which are downloaded from ChEMBL.[34]

Table 2

Benchmark Data Sets

data set	drugs	targets	n_d/n_t	interactions
enzyme	445	664	0.67	2926
ion channel	210	204	1.03	1476
GPCR	223	95	2.35	635
nuclear receptor	54	26	2.08	90

Another database that needs to be mentioned here is DrugBank.[35] The version we used is 5.1.4, which was released on 2019/07/02. It contains 13,463 drug entries, which includes 2621 approved small molecule drugs and 1349 approved targets, biological preparations (proteins, peptides, vaccines, and allergens), 130 kinds of nutritional drugs, and more than 6350 kinds of experimental drugs (discovery stage).

Tripartite Heterogeneous Network

Based on the related ideas of pharmacology, the therapeutic effect of a single drug is relatively limited for diseases that are complex multiple pathological.[36−38] Therefore, at the pharmacological perspective, it is necessary to set up a network for multiple drugs acing on multiple target proteins caused by complex diseases.[39−41] Hence, the problem of drug repositioning can be formulated as the problem of predicting missing graph edges in graph theory, that is, predicting possible edges on a bipartite network. Figure a is the part of the visualization of the ion channel data set, in which the red node is the drug and the green node is the target; the line from the red node to the green node indicates the drug–target interaction. Figure b is the bipartite graph model of a part of Figure a, the red node in Figure b is the target, and the black node is the drug.

Figure 1

Example of a bipartite graph model for drug–target interactions.

Example of a bipartite graph model for drug–target interactions. In this paper, we extend the previous drug–target bipartite structure model to a tripartite drug–target–disease heterogeneous network. It is worth mentioning that the disease layer plays a role in setting up the target–target interaction space. Additionally, drug repositioning aims to discover that the old drugs whether can act on new targets or not, the tripartite heterogeneous structure helps to provide a new direction for this, that is, we can predict new drug–disease interactions directly in the target layer without loss. In a tripartite heterogeneous network, we mainly focus on the drug–target interaction. Disease layer and similarity of drug chemical structure are introduced as an auxiliary characteristic matrix. We hope to show the diversity of prediction results on multiple characteristics. For example, the two drugs form a new drug–target interaction due to their similar chemical structure or the two targets are related to a certain disease before. However, it is still difficult for computers to predict new drugs or targets from scratch. Since for an abstract network, a new drug or new target can only be introduced as an isolated node. It is difficult to explain that it can interact with other drugs or targets based on its network structure alone. That is the reason why the chemical structure similarity and the disease layer are introduced to construct a tripartite heterogeneous network structure. Therefore, we construct a tripartite network that included three types of vertices: drugs, targets, and diseases. Correspondingly, two types of associations, drug–target interaction and disease–target interaction, are used as the edges to connect the vertices, as Figure shows. The network is constructed based on the knowledge (i.e., associations) from two existing knowledge base ChEMBL (Table ) and DisGeNET (Table ).

Figure 2

Tripartite heterogeneous network model.

Table 3

Result of FLapRLS, RLS-Kron, and THN_KRLSa

data sets	method	AUC	sensitivity	specificity	AUPR
enzyme	FLapRLS	0.985	0.913	0.999*	0.92
	RLS-Kron	0.978	0.905	0.997	0.915
	THN_KRLS	0.99*	0.979*	0.998	0.99*
ion channel	FLapRLS	0.991*	0.688	0.986	0.89
	RLS-Kron	0.984	0.721	0.98	0.943
	THN_KRLS	0.99	0.977*	0.998*	0.99*
GPCR	FLapRLS	0.944	0.737	0.986	0.83
	RLS-Kron	0.954	0.753	0.975	0.79
	THN_KRLS	0.97*	0.934*	0.990*	0.97*
nuclear receptor	FLapRLS	0.746	0.52	0.915	0.608
	RLS-Kron	0.92	0.713	0.937	0.684
	THN_KRLS	0.96*	0.930*	0.993*	0.94*

For each data set, * indicates the highest value.

Tripartite heterogeneous network model. For each data set, * indicates the highest value. As can be seen in Figure , drug = {d1, d2, ..., d} represents the drug layer, target = {t1, t2, ..., t} represents the target layer, and disease = {r1, r2, ..., r} represents the disease layer. In this example, the solid line represents the similarity within the level, and the dotted line represents the interaction between the two levels.

Relevant Model and Methods

Similarity of Medicinal Chemical Structures

In order to obtain the final prediction score, this chapter analyzes the drug–target heterogeneous network structure and related biological characteristics from multiple dimensions to form multiple similarity matrices. Both the drug–target interaction and disease-target interaction can be formulated to the bipartite graphs. On the heterogeneous network, we use the Gaussian kernel to make the similarity matrix space and use the Kronecker product based on the regularized least square classification to predict the highest score of the drug–target interaction. Finally, a 10-fold cross-validation was performed on the existing results. In the THN_KRLS model, it is necessary to ensure that the chemical structure of all drugs is unique and easy to handle. Due to the existence of heterogeneous structures in the chemical formula, the simplified molecular-input line-entry system (SIMILES) is adopted here. That is, the specification of the molecular structure is clearly described by ASCII strings to ensure that each chemical structure has a unique corresponding string. Moreover, the corresponding string is converted into a string of 166 bits binary chemical fingerprint, in which each bit matches to a specific molecular feature. This also guarantees the uniqueness of the chemical structure. Finally, the Tanimoto coefficient is the result of the calculation of binary vectors. The specific calculation formula iswhere f(dx) is the binary chemical fingerprint of drug x. Hence, a matrix of chemical structure similarity is constructed for all drugs.

Drug/Target Gaussian Kernel Similarity

The similarity of the drug chemical structure cannot be the only measure of similarity matrix.[42] Therefore, the interaction between the drug and target must be analyzed and calculated based on the bipartite graph structure. The Gaussian kernel is defined as the unimodal of the Euclidean distance between any two points in the space. The specific calculation formula is as follows:where D is defined as the ith drugs in the drug set and m is the size of the drug set. T is the ith target in the target set, and n is the size of the target set. The adjacency matrix Y ∈ m*n represents the known drug–target interactions. If the drug and the target have an existing interaction, then the value is 1, otherwise the value is 0. yd = {y, y, ..., y} is defined as the correlation vector between the drug d and all targets. γd and γt are the adjustment parameter that controls the width of the kernel, where γd′ and γt′ are set to 1 according to the experience of using Gaussian kernel.

Disease–Target Similarity

The same with the drug–target similarity, the human disease data set describes the interactions between the diseases and the targets. The disease–target network is constructed on this interaction, and the Gaussian kernel is also calculated. The specific calculation formula is as follows:We set S = {ts1, ts2, ..., ts} as the set of targets derived from the disease–target interaction, and k is the number of targets; the adjacency matrix Y ∈ k*n represents the known disease–target association. If there is a known correlation between the target ts and the disease ds, then the value of y is 1, otherwise the value is 0. yts = {y, y, ..., y} is defined as the correlation vector between the target ts and all diseases. yts is an adjustment parameter that controls the width of the kernel, and yts′ set to 1 according to the experience of using Gaussian kernels.

Similarity Matrix Fusion

We construct the kernel containing the spatial information of the drug and the target from the above multiple similarity matrices. Since the similarity matrix is not a positive definite matrix, a two-class prediction is required in the end. We linearly fit the drug chemical structure similarity matrix and the drug Gaussian nuclear similarity matrix and the target Gaussian kernel similarity matrix and the disease Gaussian kernel similarity matrix separately, and we set the weighted factors empirically. In the latter experiments, we adopt the standard of equal distribution for the parameter fitting constructed by all similarity spaces, that is, the ratio of 0.5:0.5 is used when fitting the similarity matrix.

Regularized Least Square Method of the Kronecker Product

Since the similarity matrix is a non-positive definite square matrix, a multiple similarity matrix should be merged into a large similarity matrix. Therefore, a more appropriate method named the regularized least square method is used to calculate the Kronecker product of the two matrices. The final Kronecker product is expressed as W = Wd ⊗ Wt, where W is a matrix of size (MN × MN). Each position in the matrix represents the specific score of the drug pair (d, d) multiplied by the target (t, t) pair. The number of drugs M and targets N are on the size of 104–105 due to the current external data sets, so the final Kronecker product calculation is difficult to directly compute, store, and operate. Therefore, it is necessary to perform eigenvalue decomposition here. There are Wd = ∨d∧d∨dT and Wt = ∨t∧t∨tT, where ∨ is an orthogonal matrix composed of the eigenvectors of the drug or target and ∧ is a diagonal matrix composed of the eigenvalues of the drug or target. Therefore, the final Kronecker product result iswhere ∨ = ∨d ⊗ ∨t, ∧ = ∧d ⊗ ∧t. However, the size of the calculated matrix is still maintained (MN × MN), and the result is still difficult to save and operate. It is necessary to introduce the regularized least square method for calculation, and it is defined as follows:where σ is the regularization parameter and VEC(YT) is the column direction formed by stacking all the columns of matrix Y, so eqs and 11 are substituted into eq : According to the properties of the matrix equation of the Kronecker product and the Kronecker product, transposition operation transforms to the distribution law as follows: Therefore, eq can be simplified as Let X = (∧d ⊗ ∧t)(∧d ⊗ ∧t + σI)−1VEC(∨tTYT∨d), matrix equation properties can still be used when X is a column vector, then VEC(ŶT) = (∨d ⊗ ∨t)X has If the drug–target pair has a higher prediction result score, then it indicates that they have a higher possibility of interaction. The prediction result can be achieved. Based on the above similarity matrix, the complexity of our algorithm is O(n2),where n is the sum of the size of drug, target, and disease. In the later experiment, the regularized least square method of the Kronecker product is used. It is easy to see that the algorithm can be applied to large-scale data sets.

Experimental Results and Analysis

Evaluation Criteria

In order to compare the performance of these methods, we conduct systematic experiments to simulate the analysis process of biological data on four heterogeneous interaction networks. In each run of the method, each drug–target pair (interacting or non-interacting) in the test sets is excluded by setting it to not exist in the adjacency matrix. Then, we try to use the remaining data to restore its true label. Note that during the 10-fold cross-validation process, the matrix needs to be rebuilt based on the current training set, so the Gaussian kernel must be recalculated. We assess the performance of the methods with two quality measures generally used in this paper: AUC and AUPR (the area under the precision–recall curve). We compute the ROC (receiver operating characteristic) curve and regard the AUC as the main quality measurement. The precision–recall curve is the chart of the true positive rate between all positive predictions of each given recall rate. The AUPR value provides a quantitative assessment. These two kinds of quality measurements have become the standard criteria for evaluating methods. To systematically evaluate the performances of these methods, we use the 10-fold cross-validation theory. Before the experiment, a heterogeneous network model was constructed on 10 sets of drug–target interaction, and the top N rankings and thresholds were obtained. During each iteration, one set of the drug–target interaction and the top N rankings is obtained after complete training is combined to form the test sets. Additionally, the remaining nine set observation results were regarded as the training set. After we complete the algorithm based on the training set, according to the final association weight to the target, the tested drug is sorted in a descending order with all other drugs. For each specific ranking threshold, if the test rank is higher than the threshold, then it is regarded as a true positive. The number of true positives found in all possible drug–target interactions is regarded as the true positive rate matching to the specified threshold. The ROC curve and the precision–recall curve are constructed from this, and the quality measures AUC and AUPR of the area under the curve are obtained. At the same time, the relevant parameter is 1 empirically in the Gaussian kernel function and the regularized least square method. In Figure , the precision–recall curve of THN_KRLS is the blue curve, the result of FLapRLS is green, and RLS-Kron is red. The areas under the precision–recall curves of these three algorithms are listed in Table . Under the same conditions and evaluation criteria, the AUPR results obtained by THN_KRLS have been significantly improved. Additionally, the values of sensitivity on the GPCR data set (Figure c) and nuclear receptor data set (Figure d) are increased by 0.14 and 0.256, respectively. It is shown that THN_KRLS maintains a high accuracy of prediction.

Figure 3

Precision–recall curves of the three methods. There is a significant improvement in the GPCR and nuclear receptor data sets

Precision–recall curves of the three methods. There is a significant improvement in the GPCR and nuclear receptor data sets (c) GPCR (d) Nuclear Receptor. In Figure , we use the same standard and color to show the ROC curves of different methods. For AUC values, the THN_KRLS algorithm still maintains a high performance. For example, on the GPCR data set (Figure c) and nuclear receptor data set (Figure d), the value is increased by 0.016 and 0.04, respectively. On the ion channel data set (Figure b), our algorithm is only 0.001 less than the latest best method (FLapRLS). In general, we have better performance on most data sets.

Figure 4

ROC curve of THN_KRLS, FLapRLS, and RLS-Kron.

ROC curve of THN_KRLS, FLapRLS, and RLS-Kron. Based on the tripartite heterogeneous network structure and the Gaussian kernel similarity matrix between the three layers, the ROC curves and the AUC values are obtained from the four benchmark data sets. The AUC results are improved to 0.99, 0.99, 0.97, and 0.96 on each data set. The value of sensitivity is increased by 0.181 on the GPCR data set. At the same time, the AUPR results are improved to 0.99, 0.99, 0.97, and 0.94 on each data set. FLapRLS has pointed out that AUPR needs to be separated from the real non-action score in the face of real existence score that has become a more important test performance because it punishes the existence of false positive examples found in the best ranking prediction score. This conclusion proves the high performance of the THN_KRLS method. From Table and the definitions of sensitivity and specificity, the ROC curve has the best cutoff point in the abscissa: 1-specificity and ordinate: sensitivity. The point means that the ROC curve is closest to the upper left corner. The distance between the best cutoff point and the (0, 1) point in the upper left corner of the coordinate axis is the radius, and the (0, 1) point is the center of the circle to form a quarter circle. The ROC curve is tangent to the quarter circle at the best cutoff point, so the AUC value must be less than 1 minus the area of the semicircle. Here, we can infer that some data may have problems. As shown in the value of the FLapRLS method on the ion channel data set in Table , the ROC curve has the best cutoff point at the point (0.014, 0.688), which is shown in Figure . Then, the value of the AUC in any case at this time must be less than 1 minus the area of the quarter circle, which is approximately equal to 0.93. However, the AUC value of FLapRLS is 0.991, which does not match the deduction.

Figure 5

Red dot in the figure is the best cutoff point (0.014, 0.688), the radius is about 0.3123, and the area of the quarter circle is 0.0766, so the maximum remaining AUC area is less than 0.9234.

New Interaction Predictions

In order to analyze the practical relevance of the method for predicting novel drug–target interactions, we conduct an external data validation similar to FLapRLS. Table shows the highest ranking new interaction in the enzyme data set, the bold pairs are the verified drug–target interactions in external data sets such as DrugBank. We verify the results on these data sets. At the same time, four interactions are proven to be existing, which proves that our experiments have a high confidence. In addition, since these data sets are not complete, if a predicted drug–target is not in the verified data sets, then it does not mean that the drug–target has no interaction in reality.

Table 4

Top 10 New Drug–Target Interactions on the Enzyme Data Set

enzyme rank	pair & name
DrugBank	D00394: trimipramine (DrugBank ID DB00726), Hsa:1557: cytochrome P450 2C19 (UniProtKB P33261)
DrugBank	D00225: alprazolam (DrugBank ID DB00404), Hsa:1557: cytochrome P450 2C19 (UniProtKB P33261)
DrugBank	D00380: tolbutamide (DrugBank ID DB01124), Hsa:1557: cytochrome P450 2C19 (UniProtKB P33261)
	D00394: trimipramine (DrugBank ID DB00726), Hsa:28: histo-blood group ABO system transferase (UniProtKB P16442)
	D00225: alprazolam (DrugBank ID DB00404), Hsa:28: histo-blood group ABO system transferase (UniProtKB P16442)
DrugBank	D01071: hexobarbital (DrugBank ID DB01355), Hsa:1557: cytochrome P450 2C19 (UniProtKB P33261)
DrugBank	D00380: tolbutamide (DrugBank ID DB01124), Hsa:28: histo-blood group ABO system transferase (UniProtKB P16442)
DrugBank	D00574: aminoglutethimide (DrugBank ID DB00357), Hsa:1557: cytochrome P450 2C19 (UniProtKB P33261)
	D01071: hexobarbital (DrugBank ID DB01355), Hsa:28: histo-blood group ABO system transferase (UniProtKB P16442)
	D00139: methoxsalen (DrugBank ID DB00553), Hsa:1557: cytochrome P450 2C19 (UniProtKB P33261)

In the top 10 list of the enzymes data set, the verified target is cytochrome P450 2C19 (CYP2C19, UniProt ID: P33261).[43] Also, (S)-mephenytoin hydroxylase is certificated to be CYP2C19, which is involved in the metabolism of several clinically useful drugs.[44] Additionally, as a terminal oxygenase, it participates in the synthesis of sterol hormones in the organism. In recent years, its role in drug metabolism has been further studied. We show the verification results of the highest predicted ranking of the GPCR data set in Table . The current research shows that the results we verified are credible. The top-ranked drug–target interactions are muscarinic acetylcholine receptor M1 and darifenacin. Muscarinic acetylcholine receptor excitation can cause bladder smooth muscle contraction and saliva secretion, which has been proven effective.[45]

Table 5

Top 10 New Drug–Target Interactions on the GPCR Data Set

GPCR rank	pair & name
DrugBank	D01699: darifenacin (DrugBank ID DB00496), Hsa:1128: muscarinic acetylcholine receptor M1 (UniProtKB P11229)
	D00465: oxybutynin (DrugBank ID DB01062), Hsa:57105: cysteinyl leukotriene receptor 2 (UniProtKB Q9NS75)
	D00465: oxybutynin (DrugBank ID DB01062), Hsa:10800: cysteinyl leukotriene receptor 1 (UniProtKB Q9Y271)
	D00645: bretylium (DrugBank ID DB01158), Hsa:1128: muscarinic acetylcholine receptor M1 (UniProtKB P11229)
Kegg	D00765: rocuronium (DrugBank ID DB00728), Hsa:1128: muscarinic acetylcholine receptor M1 (UniProtKB P11229)
DrugBank	D01699: darifenacin (DrugBank ID DB00496), Hsa:1129: muscarinic acetylcholine receptor M2 (UniProtKB P08172)
	D00465: oxybutynin (DrugBank ID DB01062), Hsa:134: adenosine receptor A1 (UniProtKB P30542)
	D00645: bretylium (DrugBank ID DB01158), Hsa:1129: muscarinic acetylcholine receptor M2 (UniProtKB P08172)
	D00465: oxybutynin (DrugBank ID DB01062), Hsa:135: adenosine receptor A2a (UniProtKB P29274)
DrugBank	D00765: rocuronium (DrugBank ID DB00728), Hsa:1129: muscarinic acetylcholine receptor M2 (UniProtKB P08172)

A significant fraction of the predictions (4 out of 10) is found in one or more of these data sets. It is worth mentioning that a large fraction of the interactions in these data sets are already included in the training data and hence are not counted as new interactions. Moreover, these data sets are incomplete, so if a predicted interaction is not present at one of the used data sets, then it does not necessarily mean it does not exist.

Discussion

We present a new kernel method that leads to good predictive performance on the task of predicting interactions between drugs and target proteins. Moreover, we also demonstrate that the THN_KRLS method performs better than the other existing methods when known drug–target interactions are missing in the training data. This shows practical assessments of the predictive power of THN_KRLS for real scenarios of drug–target interaction predictions. However, we still have some problems to be solved in the future. First, how to apply the classical algorithms in graph theory to the heterogeneous network is a problem, such as the maximum matching problem on the bipartite graph, the classical binary classification algorithm, and so on. When a new drug or target enters the heterogeneous network structure, it must rely on some former information in the layers to complete the prediction, for example, the similarity of the drug–drug chemical structure. It is hard to detect the presence of drugs or targets with these graph algorithms. Second, the features and parameters in our experiments are only obtained by an empirical or ordinary weighted method, which makes it difficult to decide whether over-fitting or under-fitting is done in the experiment. Additionally, no definite standard can be used to decide the similarity in the biological research. Finally, since the Gaussian kernel function only plays a role in detecting common neighbors under this model, the Gaussian kernel that builds the similarity matrix space can be replaced by other machine learning methods.

Conclusions

We introduce a tripartite heterogeneous network model and the regularized least square method of the Kronecker product to fit multiple kernels and receive better prediction performance of drug repositioning. The method proposed in this paper achieves significantly more accurate results than the other network methods under different prediction task settings and on different data sets.

Table A1

One-to-One ID Information of the Drugs and Targets involved, Including Kegg ID, DrugBank ID, or UniProt ID and the Drug Name or Target Name

Kegg ID	DrugBank ID or UniProt ID	drug name or target name
D00394	DB00726	trimipramine
D00225	DB00404	alprazolam
D01071	DB01355	hexobarbital
D00380	DB01124	tolbutamide
D00574	DB00357	aminoglutethimide
D00139	DB00553	methoxsalen
D01699	DB00496	darifenacin
D00465	DB01062	oxybutynin
D00645	DB01158	bretylium
D00765	DB00728	rocuronium
Hsa:28	P16442	histo-blood group ABO system transferase
Hsa:1557	P33261	cytochrome P450 2C19
Hsa:57105	Q9NS75	cysteinyl leukotriene receptor 2
Hsa:10800	Q9Y271	cysteinyl leukotriene receptor 1
Hsa:1128	P11229	muscarinic acetylcholine receptor M1
Hsa:1129	P08172	muscarinic acetylcholine receptor M2
Hsa:134	P30542	adenosine receptor A1
Hsa:135	P29274	adenosine receptor A2a

32 in total

1. Gaussian interaction profile kernels for predicting drug-target interaction.

Authors: Twan van Laarhoven; Sander B Nabuurs; Elena Marchiori
Journal: Bioinformatics Date: 2011-09-04 Impact factor: 6.937

2. DrugNet: network-based drug-disease prioritization by integrating heterogeneous data.

Authors: Víctor Martínez; Carmen Navarro; Carlos Cano; Waldo Fajardo; Armando Blanco
Journal: Artif Intell Med Date: 2015-01-13 Impact factor: 5.326

Review 3. A review of network-based approaches to drug repositioning.

Authors: Maryam Lotfi Shahreza; Nasser Ghadiri; Sayed Rasoul Mousavi; Jaleh Varshosaz; James R Green
Journal: Brief Bioinform Date: 2018-09-28 Impact factor: 11.622

4. Identification of new candidate drugs for lung cancer using chemical-chemical interactions, chemical-protein interactions and a K-means clustering algorithm.

Authors: Jing Lu; Lei Chen; Jun Yin; Tao Huang; Yi Bi; Xiangyin Kong; Mingyue Zheng; Yu-Dong Cai
Journal: J Biomol Struct Dyn Date: 2016-02-05

A Machine Learning-Based Biological Drug-Target Interaction Prediction Method for a Tripartite Heterogeneous Network.

Introduction

Related Work

Data Sets

Tripartite Heterogeneous Network

Relevant Model and Methods

Similarity of Medicinal Chemical Structures

Drug/Target Gaussian Kernel Similarity

Disease–Target Similarity

Similarity Matrix Fusion

Regularized Least Square Method of the Kronecker Product

Experimental Results and Analysis

Evaluation Criteria

New Interaction Predictions

Discussion

Conclusions

1. Gaussian interaction profile kernels for predicting drug-target interaction.

2. DrugNet: network-based drug-disease prioritization by integrating heterogeneous data.

Review 3. A review of network-based approaches to drug repositioning.

4. Identification of new candidate drugs for lung cancer using chemical-chemical interactions, chemical-protein interactions and a K-means clustering algorithm.

Review 5. Treatment of overactive bladder in the aging population: focus on darifenacin.

6. Supervised prediction of drug-target interactions using bipartite local models.

7. Drug target prediction and repositioning using an integrated network-based approach.

8. Using reverse docking to identify potential targets for ginsenosides.

9. Multi-target drug repositioning by bipartite block-wise sparse multi-task learning.

10. A Bayesian machine learning approach for drug target identification using diverse data types.

1. CancerOmicsNet: a multi-omics network-based approach to anti-cancer drug profiling.

2. Using BERT to identify drug-target interactions from whole PubMed.

3. Integrated analysis of potential pathways by which aloe-emodin induces the apoptosis of colon cancer cells.

4. An integrated network representation of multiple cancer-specific data for graph-based machine learning.

5. A Novel Deep Neural Network Technique for Drug-Target Interaction.