Literature DB >> 26072486

Improving compound-protein interaction prediction by building up highly credible negative samples.

Hui Liu¹, Jianjiang Sun², Jihong Guan², Jie Zheng², Shuigeng Zhou².

Abstract

MOTIVATION: Computational prediction of compound-protein interactions (CPIs) is of great importance for drug design and development, as genome-scale experimental validation of CPIs is not only time-consuming but also prohibitively expensive. With the availability of an increasing number of validated interactions, the performance of computational prediction approaches is severely impended by the lack of reliable negative CPI samples. A systematic method of screening reliable negative sample becomes critical to improving the performance of in silico prediction methods.
RESULTS: This article aims at building up a set of highly credible negative samples of CPIs via an in silico screening method. As most existing computational models assume that similar compounds are likely to interact with similar target proteins and achieve remarkable performance, it is rational to identify potential negative samples based on the converse negative proposition that the proteins dissimilar to every known/predicted target of a compound are not much likely to be targeted by the compound and vice versa. We integrated various resources, including chemical structures, chemical expression profiles and side effects of compounds, amino acid sequences, protein-protein interaction network and functional annotations of proteins, into a systematic screening framework. We first tested the screened negative samples on six classical classifiers, and all these classifiers achieved remarkably higher performance on our negative samples than on randomly generated negative samples for both human and Caenorhabditis elegans. We then verified the negative samples on three existing prediction models, including bipartite local model, Gaussian kernel profile and Bayesian matrix factorization, and found that the performances of these models are also significantly improved on the screened negative samples. Moreover, we validated the screened negative samples on a drug bioactivity dataset. Finally, we derived two sets of new interactions by training an support vector machine classifier on the positive interactions annotated in DrugBank and our screened negative interactions. The screened negative samples and the predicted interactions provide the research community with a useful resource for identifying new drug targets and a helpful supplement to the current curated compound-protein databases. AVAILABILITY: Supplementary files are available at: http://admis.fudan.edu.cn/negative-cpi/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2015 PMID： 26072486 PMCID： PMC4765858 DOI： 10.1093/bioinformatics/btv256

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Compound–protein interactions (CPIs) are crucial to the discovery of new drugs by screening candidate compounds and are also helpful for understanding the causes of side effects of existing drugs. Although various biological assays are available, experimental validation of CPIs remains time-consuming and expensive. Therefore, there is a strong incentive to develop computational methods to detect CPIs accurately. Meanwhile, with the rapid growth of public chemical and biological databases, such as the PubChem (Wheeler ), DrugBank (Wishart ), SIDER (Kuhn ), STITCH (Kuhn ), STRING (Franceschini ) and Gene Ontology (GO) (Ashburner ), various kinds of resources, including drug features such as chemical structures, side effects and gene expression profiles under drug treatments and protein features such as amino acid sequences, protein–protein interaction (PPI) networks and functional annotations, become available to the research community and consolidate the basis of computational CPI prediction. Traditional computational approaches fall roughly into two categories: structure based and ligand based. The structure-based methods depend on the structural information of target proteins that are often unavailable for most protein families. Ligand-based methods get poor performance for those proteins having few or none of the known ligands. Recently, a variety of machine learning-based methods have been proposed and achieved a considerable success by taking the viewpoint of chemogenomics (Jaroch and Weinmann, 2006), which integrates the chemical attributes of drug compounds, the genomic attributes of proteins and the known CPIs into a unified mathematical framework. The main rationale underlying the chemogenomics approaches is that similar compounds tend to bind similar proteins, so that the lack of known ligands for a given protein can be compensated by the availability of known ligands of similar proteins and vice versa. Following the philosophy of chemogenomics, many methods have been proposed by exploiting various types of features and classification algorithms (Tabei and Yamanishi, 2013; Yabuuchi ; Yamanishi ). With the chemical structure similarity and protein sequence similarity measures, Bleakley and Yamanishi (2009) proposed the bipartite local model (BLM) to infer CPIs by training local support vector machine (SVM) classifiers based on known interactions. van Laarhoven proposed Gaussian interaction profile (GIP) kernels that exploit the topology of CPI networks. However, these methods still suffer from the lack of known interactions between the drugs and proteins of interest, which often leads to the failure of prediction. Therefore, Mei improved BLM by exploiting the known interactions of neighbors to compensate the lack of interaction information. van Laarhoven and Marchiori (2013) also used the interaction profiles of weighted nearest neighbors to improve the GIP method. Instead of utilizing the attributes of drugs and proteins separately, more and more researchers combined these attributes into a single feature vector by the concatenation or tensor product operators and then built classifiers based on the integrated features and known CPIs. For example, Jacob and Vert (2008) proposed SVM classifiers with pairwise kernels that were derived, respectively, from similarity measures of drugs and proteins. Yamanishi proposed the bipartite graph inference that maps drugs and proteins into a unified Euclidean feature space in which the distances between drugs and proteins linked by known interactions are minimized and otherwise maximized. The network-based inference originally proposed for personal recommendation (Zhou ) was also used to identify CPIs (Alaimo ; Cheng ). Other methods including kernel-based data fusion (Wang ), Kernelized Bayesian matrix factorization with twin kernels (KBMF2K) (Gonen, 2012), restricted Boltzmann machine (Wang and Zeng, 2013) and semi-supervised methods (Chen and Zhang, 2013; Xia ) were successively proposed. In addition to chemical structures and protein sequences, researchers also resorted to other attributes of drugs and proteins to reveal their interacting associations, including the drug expression profiles (Carrella ; Iorio ; Wolpaw ), functional groups (He ) and side effects (Campillos ; Mizutani ) of drugs, signaling pathways and GO annotations (Jaeger ) of proteins or even the combination of these attributes (Gottlieb , 2012; Perlman ). Most previous approaches used experimentally validated CPIs as positive samples and randomly generated negative samples to learn the prediction models. However, the randomly generated negative samples may include real positive samples not yet known. A classifier trained by using such randomly generated negative samples may yield high cross-validation accuracy but very possibly has poor performance on independent, real test datasets. Screening highly reliable negative samples is therefore critical to improving the accuracy of computational prediction methods. The importance of true-negative interactions was recently highlighted as one of the future developments in predicting drug–target interactions (Ding ). Motivated by this, we set about to screen in silico highly credible negative samples of CPIs. An assumption underlying most computational methods for predicting CPIs is that similar drug compounds are likely to interact with similar target proteins. Our method is based on the converse negative proposition, i.e. the proteins that are dissimilar to every known/predicted target of a given compound are not much likely to be targeted by the compound and vice versa. We integrated various resources of compounds and proteins, including chemical structures, chemical expression profiles and side effects of compounds, amino acid sequences, PPI networks and GO functional annotations of proteins, to a systematic screening framework. We evaluated our method on both human and Caenorhabditis elegans data. We first tested our screened negative samples on six classical classifiers, including random forest, L1- and L2-regularized logistic regression, naive Bayes, SVM and k-nearest neighbor (kNN). All these classifiers achieved remarkably higher performance on our negative samples than on randomly generated negative samples. We also verified our negative samples on three existing prediction models, including BLM (Bleakley and Yamanishi, 2009), Gaussian kernel profile (van Laarhoven ) and Bayesian matrix factorization (Gonen, 2012), and found that the performances of these models are also significantly improved on the screened negative samples. Furthermore, we validated our screened negative samples with a drug bioactivity dataset. Finally, we derived two sets of new CPIs by training an SVM classifier on the positive interactions annotated in DrugBank and our screened negative interactions. These screened negative samples and the predicted interactions can serve the research community as a useful resource for identifying new drug targets and as a helpful supplement to the current curated compound–protein databases.

2 Materials

2.1 Compound–protein interaction

CPIs were retrieved from DrugBank 4.1 (Wishart ), Matador (Gnther ) and STITCH 4.0 (Kuhn ). DrugBank and Matador are manually curated databases, and STITCH is a comprehensive database that collects CPIs from four different sources: experiments, databases, text mining and predicted interactions. Meanwhile, STITCH provides a score ranging from 0 to 1000 for each interaction, which indicates the confidence of the CPI supported by four types of evidence, i.e. experimental validation, manually curated databases, text mining and predicted interactions. We assigned the interactions from DrugBank and Matador the highest score 1000 because these interactions are supported by biochemical experiments and the literature. Totally, we got 2 290 630 interactions between 367 142 unique compounds and 19 342 proteins of human, and 2 141 740 interactions between 276 294 unique compounds and 11 234 proteins of C.elegans. For simplicity, we refer to the created assembly of CPIs as K and denote by a triple the interaction between drug c and protein p with confidence score w in the rest of the article.

2.2 Chemical data

2.2.1 Chemical structure similarity

Chemical structures (also referred to as fingerprints) of drugs were obtained from the PubChem database (Wheeler ). We calculated the Jaccard score (Jaccard, 1908) of the fingerprints as the chemical structure similarity between compounds. The Jaccard score between compounds c and is defined as , which is the ratio of the number of common substructures between c and over the total number of substructures in c and . There are totally 881 kinds of substructures used in our analysis for human and C.elegans. Applying this operation to all drug pairs, we thus constructed a chemical similarity matrix.

2.2.2 Side effect similarity

Side effects of drugs were downloaded from the SIDER database (Kuhn ). For the drugs involving in CPIs but not included in the SIDER database, we employed a recently proposed method that predicts side effects based on chemical fragments (Pauwels ) to predict side effects. Similarly, we computed the Jaccard score of each pair of drugs as side effect similarity based on either their known side effects or top 10 predicted side effects in case they are unknown (Perlman ).

2.3 Protein data

2.3.1 Sequence similarity

Amino acid sequences of proteins were obtained from the UCSC Table Browser. We computed sequence similarity between proteins using a normalized version of Smith–Waterman score (Smith and Waterman, 2010). The normalized Smith–Waterman score between two proteins g and is where means the original Smith–Waterman score. Applying this operation to all protein pairs, we got the similarity matrix of protein sequences.

2.3.2 Functional annotation semantic similarity

GO annotations were downloaded from the GO database (Ashburner ). Semantic similarity score between each pair of proteins was calculated based on the overlap of the GO terms that were associated with the two proteins (Coutoa ). All three types of ontologies were used in the computation as similar drugs are expected to interact with proteins that act in similar biological processes or have similar molecular functions or reside in similar compartments. We computed the Jaccard score with respect to the GO terms of each pair of proteins as their similarity.

2.3.3 Protein domain similarity

Protein domains were extracted from PFAM database (Punta ). Each protein was represented by a domain fingerprint (binary vector) whose elements encode the presence or absence of each retained PFAM domain by 1 or 0, respectively. The numbers of PFAM domains for human and C.elegans are 1331 and 3837, respectively. We computed the Jaccard score of any two proteins via their domain fingerprints as their similarity.

3 Methods

3.1 Integration of multiple similarities

We have computed multiple similarity measures from different features for both drugs and proteins as mentioned above. For drugs c and c, we formulate them into a single comprehensive similarity measure as below: in which (n = 1, 2) represents the similarity measure derived from features of chemical structure and side effect, respectively. Note that similar formulation was also adopted by STITCH (Kuhn ), as it can be easily extended to integrate more similarity measures. Similarly, we computed the comprehensive similarity between proteins p and p by where (n = 1, 2, 3) represents the similarity measure derived from sequence similarity, functional annotation semantic similarity and protein domain similarity, respectively.

3.2 The screening framework

Most existing prediction methods for CPIs (positive samples) are based on the assumption that similar compounds are likely to interact with the proteins that are similar to the corresponding known target proteins. Our basic idea was inspired by the converse negative proposition of this assumption. Specifically, we assume that a protein dissimilar to every known/predicted target of a compound is not much likely to be targeted by this compound, and on the other hand, a compound not similar to any known/predicted compound targeting a protein is not much likely to target this protein. For simplicity, we refer them as protein dissimilarity rule and drug dissimilarity rule, respectively. Both rules are simultaneously applied in our screening framework so as to identify the most reliable negative samples of CPIs. Different from existing prediction methods that often depend on known CPIs for making reliable predictions, our negative sample screening framework exploits both validated and predicted CPIs. Figure 1 shows the flowchart of our method. Here, the three green dashed-line boxes show the data resources used in our screening framework, and the protein dissimilarities and drug dissimilarities are, respectively, computed so as to gain a combined score for each candidate negative sample. We summarize the screening steps as follows:

Fig. 1.

The flowchart of our negative CPI screening framework. Three green dashed-line boxes show the data resources used in our screening process, and the red dashed-line boxes represent the screening steps that include multiple operations

Compute the integrated similarity of each pair of compounds/proteins via Equation (1)/Equation (2). Build the assembly K of known/predicted CPIs as mentioned above. Build the set of candidate negative interactions from all possible interactions excluding the created assembly K of known/predicted CPIs. We take the candidate negative interaction between compound k and protein j, denoted by with d indicating the distance between compound k and protein j, as an example to demonstrate the screening process. Figure 2 is to illustrate the process of calculating d.

Fig. 2.

Schematic diagram of calculating the score d for a candidate compound–protein negative sample . Two colors, blue and red, are used to differentiate the weights and similarities for calculating two combined scores SPC and SCP, respectively

For any protein p targeted by c in K, we compute the weighted score that indicates the possibility of protein p being targeted by compound c in consideration of the similarity between p and p. Taking into account the similarity between p and each known/predicted protein p targeted by compound c, i.e. (, we calculate the combined score by summing up the weighted scores spc with respect to l, and thus obtain . Similarly, we compute the weighted score that represents the possibility of compound c targeting p in consideration of the similarity between c and c. Considering the similarity between c and each known/predicted compound c targeting protein p, i.e. (, we calculate the combined score by summing up the weighted scores spc with respect to i and thus obtain . For compound c and protein p, we define the distance between c and p as below: d is the final score representing the possibility that compound c does not target protein p. The larger d is, the higher the possibility of c not targeting p is. Build the set of positive interactions from two manually curated databases: DrugBank (Wishart ) and Matador (Gnther ). Rank the potential negative CPIs according to the scores obtained by Equation (3), and those with the highest scores are taken to form the set of negative sample candidates. The negative sample candidates are further filtered by using feature divergence of compound and protein, as described in Section 3.3. Combining the positive interactions and negative interactions, we get a gold standard set of CPIs. On the basis of the chemical substructures and protein PFAM domains, we construct the tensor product for each CPI, so that each interaction is represented by a vector in the chemogenomical space. Train a classifier (e.g. SVM) by using the chemogenomical feature vectors, tune the model parameters via cross-validation and finally predict new CPIs. The flowchart of our negative CPI screening framework. Three green dashed-line boxes show the data resources used in our screening process, and the red dashed-line boxes represent the screening steps that include multiple operations Schematic diagram of calculating the score d for a candidate compound–protein negative sample . Two colors, blue and red, are used to differentiate the weights and similarities for calculating two combined scores SPC and SCP, respectively Conceptually, the confidence values of the known/predicted interactions w and w are propagated to the candidate negative interactions via similar proteins and compounds in Step 2(a) and Step 2(b). The protein similarity linking compound c to protein j via protein p is formulated by spc in Step 2(a), and chemical similarity linking protein p to compound c via compound c is formulated by scp in Step 2(b). In Step 2(c), the two resulting scores are combined according to Equation (3), which embodies the protein dissimilarity rule and the drug dissimilarity rules through a negative exponent function. In particular, the known/predicted interactions function as a bridge to link compounds and proteins that do not form potential interactions of high probability.

3.3 Filtering by feature divergence

It is known that compounds with similar chemical features may have greatly different binding bioactivity (activity cliff) (Sun and Bajorath, 2012). On the other hand, compounds with completely different core structures could potentially target similar proteins (scaffold hopping) (Sun ). When the number of validated/predicted target proteins of a specific compound is small and thus covers limited proteomic features, it is likely that some proteins screened via proteomic feature dissimilarity based on all known target proteins are actually the targets of the compound. From the perspective of protein, similar situation maybe exists, when the number of validated/predicted compounds targeting a protein is small. Thus, we require that the number of validated/predicted interactions participated by the protein and the compound of each negative sample candidate should be larger than some predefined threshold. By setting the threshold to 15, we got 23 392 compounds and 10 757 proteins of human, 33 353 compounds and 7584 proteins of C.elegans, which were used to construct the negative samples. Moreover, we expect that the features of the proteins targeted by a specific compound differ from each other as largely as possible, so that our dissimilarity rules can exclude more specious candidates that have similar features to known target proteins. Similarly, the more different the chemical features of the compounds targeting a specific protein are, the more false-positive targeting compounds could be excluded by our dissimilarity rules. In other words, the credibility of the screened negative samples is positively correlated to the feature divergence of the proteins (compounds) in validated/predicted interactions associated to a specific compound (protein). Therefore, we exploited the feature divergence to further screen the candidate negative samples. Since variance is a commonly used measure for evaluating data divergence, we carried out statistical test to check whether the similarity variance of the subset of proteins (compounds) in interactions associated to each compound (protein) in the candidate negative samples is significantly larger than the population variance. Take compounds as example, the similarity variance of the compound population is 0.0335, which can be easily computed based on CS (see Equation 1). For a subset of n compounds interacting with a protein, the null hypothesis is that the sample variance is less than the compound population variance, and the alternate hypothesis is the opposite of the null hypothesis, then the sample variance follows distribution with degree of freedom n – 1. With a significance level 0.05, we filtered out more specious candidates and finally obtained 384 916 negative samples between 14 613 unique compounds and 2229 unique proteins of human by setting the threshold of d to 0.9 (). For C.elegans, we finally got 88 261 negative samples between 2224 unique proteins and 5278 unique compounds by setting the threshold of d to 0.368.

4 Results

4.1 Performance evaluation protocol

To conduct an objective and fair evaluation on the negative CPIs screened by our method, we first built the positive samples from the manually curated databases DrugBank and Matador and then generated two sets of negative samples: one was generated by randomly sampling compound–protein pairs not included in the positive samples, the other was extracted from the list of negative samples screened by our method. We evaluated the screened negative samples by comparing the performances of both six classical classifiers and three existing predictive methods on the same set of positive samples combining with screened and randomly generated negative samples, respectively. We selected the top 384 916 screened negative samples (the dataset is available in the Supplementary Material) from the ranking list as candidates and used some of them in the experiments. As shown in Supplementary Figure S1, the frequency distribution of interactions is biased to only a small portion of compounds/proteins, indicating that random sampling over the whole interactions might cover only a limited number of compounds and proteins. Therefore, as in Tabei and Yamanishi (2013) and Yamanishi , we used two protocols: pairwise cross-validation and blockwise cross-validation, to evaluate our negative samples against randomly generated negative samples. Concretely, pairwise cross-validation assumes that the aim is to detect missing interactions between known ligand compounds and known target proteins with information of interaction partners, while blockwise cross-validation assumes that the goal is to detect new interactions for new ligand compounds and target proteins with no information of interaction partners. Pairwise cross-validation was performed in 3 steps: (i) the CPIs in the gold standard set are randomly split into five subsets of roughly equal size; (ii) each subset is taken in turn as a test set and the remaining four subsets are used to train a predictive model, whose prediction accuracy on the test set is then evaluated and (iii) the average prediction accuracy over the 5-folds is used as the final performance measure. Instead of splitting interactions, blockwise cross-validation randomly splits the compounds and proteins in the gold standard set into five subsets, respectively. Each compound subset and each protein subset are taken in turn and combined as a test set, and then a predictive model is trained on the compound–target pairs included in the remaining four compound subsets and four protein subsets and is further evaluated on the test set. Finally, the average prediction accuracy over the 5-folds is calculated. Several performance measures are used in the following experiments. Denote by TP and FP the numbers of correctly and falsely predicted positive CPIs, TN and FN the numbers of correctly and falsely predicted negative CPIs, the measures are precision [or positive predictive values (PPV)] = TP/(TP + FP), recall (or sensitivity) = TP/(TP+FN), specificity = TN/(FP+TN) and AUC (area under the ROC curve). Especially, the PPV measure reflects the discriminatory power of a classifier to distinguish true positives when the number of negative samples is far larger than that of positive samples. In addition, we report the precision–recall curve because it is rather informative when the number of positive examples is small.

4.2 Evaluation on classical classifiers

4.2.1 Pairwise cross-validation

The human dataset that we used includes 3369 positive interactions between 1052 unique compounds and 852 unique proteins, and the C.elegans dataset includes 4000 positive interactions between 1434 unique compounds and 2504 unique proteins (the datasets are available in the Supplementary Material). Similar to Tabei and Yamanishi (2013), we evaluated the performance of each classifier when the ratio of negative samples to positive samples increases from 1 to 5. The randomly generated negative samples were produced by randomly sampling pairs of compound and protein not included in the positive samples. For screened negative samples, we got the required number of interactions from the top 384 916 candidates in the ranking list produced by our method. We produced the chemogenomical features of the positive and negative samples by performing tensor product of chemical substructures and protein domains. We conducted performance evaluation on six classical classifiers by comparing our screened negative samples against randomly generated negative samples. The six classical classifiers are naive Bayes, random forest, L1-logistic regression, L2-logistic regression, SVM and kNN. The naive Bayes, random forest and kNN were run by using Weka 3.7 (Hall ), L1- and L2-logistic regression were run by liblinear 1.94 (Fan ) and SVM was run by libsvm 3.17 (Chang and Lin, 2011). All these methods were run on default setting except for kNN where k is set to 1, 3 and 5, respectively. As similar results were obtained for different k values, we reported only the results of k = 1. Table 1 shows the AUC, recall and precision measures of the six classifiers on human data. We found that the performances of all six classifiers were significantly improved on our screened negative samples in comparison to on randomly generated negative samples. For example, for the six classifiers from naive Bayes to SVM, the average AUC improvement achieved on our screened negatives over the randomly generated negatives is 8.0%, 53.4%, 39.7%, 4.2%, 5.3% and 29.6%, respectively. When the ratio of negative samples increases, the AUC values of most classifiers keep steady or increase slightly. However, we also noticed that the recall and precision measures of most classifiers decrease with the increase of the ratio of negative samples, this is mainly due to the increasingly imbalanced ratio of the negative samples to the positive samples, which leads to the increasing bias of the classification decision boundary against the positive ones. In addition, we obtained similar results on C.elegans, as shown in Supplementary Table S1. These empirical results demonstrate the high reliability of our screened negative samples.

Table 1.

AUC/recall/precision values of six classical classifiers on screened and randomly generated negative samples of human (pairwise cross-validation)

Measure	Negative sample ratio	Naive Bayes		kNN		Random Forest		L1 logistic		L2 logistic		SVM
Measure	Negative sample ratio	Screened	Random	Screened	Random	Screened	Random	Screened	Random	Screened	Random	Screened	Random
AUC	1	0.672	0.622	0.860	0.563	0.940	0.647	0.908	0.874	0.911	0.868	0.910	0.752
	3	0.672	0.622	0.904	0.593	0.954	0.694	0.917	0.879	0.920	0.873	0.942	0.705
	5	0.671	0.622	0.913	0.589	0.967	0.709	0.916	0.877	0.920	0.872	0.951	0.713
Precision	1	0.624	0.591	0.798	0.570	0.861	0.613	0.881	0.858	0.891	0.862	0.966	0.733
	3	0.361	0.338	0.716	0.458	0.847	0.529	0.823	0.786	0.837	0.787	0.969	0.700
	5	0.252	0.237	0.684	0.500	0.830	0.514	0.793	0.732	0.804	0.739	0.969	0.732
Recall	1	0.575	0.413	0.927	0.564	0.897	0.599	0.893	0.836	0.913	0.850	0.950	0.745
	3	0.560	0.376	0.882	0.306	0.824	0.306	0.749	0.622	0.773	0.631	0.883	0.261
	5	0.555	0.364	0.844	0.205	0.825	0.199	0.649	0.524	0.666	0.522	0.861	0.112

Bold numbers represent the highest performance measures achieved by each method.

AUC/recall/precision values of six classical classifiers on screened and randomly generated negative samples of human (pairwise cross-validation) Bold numbers represent the highest performance measures achieved by each method.

4.2.2 Blockwise cross-validation

Here the positive samples are the same as those used in pairwise cross-validation. An equal number of random negative samples to positive samples were selected by the random sampling procedure mentioned above. Also, an equal number of screened negative samples were randomly extracted from the top 384 916 candidates in our ranking list. The six classical classifiers were run in the same way as mentioned above, and the precision–recall curves and AUC histograms are shown in Figure 3 and Supplementary Figure S2. Compared with pairwise validation, the six classifiers perform worse in blockwise cross-validation on both screened and randomly generated negative samples, but their performances are still substantially improved on the screened negative samples. In particular, on randomly generated negative samples, the performance of each classifier deteriorates dramatically under blockwise cross-validation, but most classifiers except for naive Bayes and kNN still achieved relatively high AUC values on the screened negative samples: L1- and L2-logistic regression and SVM achieved AUC values larger than 0.8. As also shown in Supplementary Figure S3, all the six classifiers obtain larger AUC values on screened negative samples than on random negative samples of C.elegans, whereas the overall performances of these classifiers decrease slightly in comparison to that on human dataset.

Fig. 3.

Precision–recall curves of six classical classifiers on screened and randomly generated negatives of human (blockwise cross-validation)

4.3 Evaluation on existing predictive methods

We further checked whether existing predictive methods can achieve higher performance on screened negative samples than on randomly generated negative samples. The evaluated existing methods include BLM (Bleakley and Yamanishi, 2009), RLS-avg and RLS-Kron classifiers with GIP kernels (van Laarhoven ), KBMF2K-classification and KBMF2K-regression (Gonen, 2012). RLS-avg and RLS-Kron were run by setting two different groups of parameters, (0.5, 0.5) and (1,1), respectively, and the others were run by default settings. All these methods were originally evaluated on four widely used human datasets involving Enzyme, Ion Channel, GPCR and Nuclear Receptor proposed in Yamanishi . But these four datasets are small scale and cover only a small number of negative samples screened by our method, so we built another relatively larger dataset of human to evaluate these methods. We got the positive samples from DrugBank and then extracted the negative samples whose compounds and proteins are both involved in the positive samples from our ranking list. The resulting human dataset includes 2315 positive interactions and 2576 negative interactions between 821 unique compounds and 846 unique proteins, and the C.elegans dataset includes 463 positive interactions and 1561 negative samples between 543 compounds and 504 proteins (the datasets are available in the Supplementary Material). As these compared methods take as input the chemical structure similarity matrix, protein sequence similarity matrix and CPI matrix, we built the similarity matrices and the interaction matrix as mentioned in Section 2. The AUC values achieved by these predictive methods on screened and randomly generated negative samples of human are shown in Figure 4. Clearly, all the methods achieved significantly higher performance on the screened negative samples than on the randomly generated negative samples. In particular, BLM had the least AUC (0.679) on the randomly generated negative samples but achieved a comparable AUC (0.932) to other methods on the screened negative samples. KBMF2K-classification and KBMF2K-regression had considerably high AUCs (0.868 and 0.846) on the randomly generated negative samples, but their performances were also significantly improved on the screened negative samples. On C.elegans, the performance improvement is more notable than on human for all methods except for KBMF2K-classification and KBMF2K-regression, as shown in Supplementary Figure S4. Although BLM and the four RLS algorithms performed only moderately on the randomly generated negative samples, their performances were substantially boosted on the screened negative samples. This result shows again that our screened negative samples are helpful for improving the performances of existing predictive methods.

Fig. 4.

Histogram of the AUC values achieved by three existing predictive methods on screened and randomly generated negative samples of human

4.4 Evaluation on drug bioactivity dataset

The quantitative drug–target bioactivity assays for kinase inhibitors provide experimental observations of the bindings of drug molecules to targets, which enable us to derive both positive and negative interactions. As suggested by Pahikkala , recent kinase bioactivity assay data from Davis can be used as an independent benchmark test set for performance evaluation of drug–target prediction methods. This assay reported the quantitative interaction affinity as the dissociation constant (Kd), which reflects how tightly a drug molecular binds to a target protein. The smaller Kd is, the higher the interaction affinity between the chemical compound and the target protein is. The bioactivity assay included the interactions between 68 unique drugs and 442 unique proteins, from which 20 931 interactions with 10 000 were extracted as negative samples. We got 3564 overlapping interactions between our screened negative samples and the experimentally supported ones. We calculated the frequency distribution of overlapping negative samples with respect to an increasing cutoff threshold of the confidence scores of our screened negative samples, which is shown in Supplementary Figure S5. It can be seen that the confidence scores of about 80% overlapping negative samples are more than 0.5, i.e. the screened negative samples with larger confidence scores are more likely supported by drug bioactivity experiments, which indicates the high credibility of our screened negative samples. Furthermore, we used the threshold 30.00nM of Kd suggested by Davis to extract positive samples and obtained 1867 positive interactions. Together with the same number of negative interactions, we got an independent test set, which was used to evaluate the SVM and L2-logistic classifiers trained on screened and randomly generated negative samples, respectively. We chose SVM and L2-logistic regression for performance evaluation because they are, respectively, binary classification and realistic regression representatives. Figure 5 presents their precision–recall curves, which show that the classifiers trained on our screened negative samples greatly outperform those trained on the random negative samples.

Fig. 5.

Precision–recall curves of the SVM and L2-logistic classifiers trained on screened and randomly generated negative samples, evaluated on the kinase bioactivity assay data (Davis )

4.5 Prediction of new interactions

After confirming the quality of our screened negative samples, we built two sets of predictions of potential CPIs on human. The first is a relatively small-scale prediction set built based on a subset of compounds and proteins included in DrugBank. Specifically, we extracted 2675 interactions from DrugBank as positive samples and select an equal number of negative samples from our screened ranking list and then train an SVM classifier based on the chemogenomical features to predict potential interactions. The trained SVM classifier predicted about 390 838 new CPIs from all possible 896 304 interactions whose compounds and proteins are included in DrugBank. We extracted the top 50 interactions for each compound to get a set of 35 425 interactions, in which 1093 predictions were annotated in DrugBank and 3224 predictions (3224/35 4259.2%) were annotated in STITCH. Note that only 18 580 interactions are recorded in STITCH for all possible 896 304 (18 580/896 3042.1%), thus our predictions rank these curated interactions high and give priority to highly credible interactions. As a confirmative example, we examined the predicted interactions regarding Donepezil, a centrally acting reversible acetylcholinesterase inhibitor compound that is therapeutically used in the palliative treatment of Alzheimer’s disease (Birks and Harvey, 2006). Our method predicted 253 target proteins that include the cholinesterase coding genes ACHE and BCHE, which are two main targets of Donepezil annotated in DrugBank and STITCH. In fact, the set of target proteins covers all 17 interactions annotated in STITCH whose associated proteins are included in the test set, as shown in Figure 6. To confirm other new predictions, we inspected the functional annotations of the top 125 target proteins by using DAVID (Huang ) and found that 45 proteins are highly enriched in the neuroactive ligand-receptor interaction pathway (P value = 4.7E-32). These proteins are closely related to many diseases including multiple kinds of psychotic disorders. Furthermore, these proteins are significantly associated in various mental and nervous diseases, such as hypertension (P value = 1.5E-11), Alzheimer’s disease (P value = 3.1E-4), Parkinson’s disease (P value = 3.0E-4) and arteriosclerotic vascular disease (P value = 1.3E-3). Figure 6 gives an illustration of Donepezil’s target proteins and related functional annotations (for the detailed list of proteins involved in the pathways and diseases, please see the Supplementary Table S2).

Fig. 6.

Predicted target proteins of Donepezil and related functional annotations, including neuroactive ligand-receptor interaction pathways, diseases recorded in DrugBank and STITCH

Predicted target proteins of Donepezil and related functional annotations, including neuroactive ligand-receptor interaction pathways, diseases recorded in DrugBank and STITCH In addition, to facilitate the research community, we built the second set (a large-scale one) of predictions by constructing a large training set that consists of all 6354 interactions included in DrugBank and the equal number of screened negative samples. The trained SVM classifier predicts more than 6 340 000 CPIs (please refer to the Supplementary Material for detail). These new predictions would be helpful for identifying truly druggable targets in new drug design.

5 Discussion and conclusion

The identification of interactions between compounds and proteins plays an important role in the genomic drug discovery. However, experimental validation of CPIs is still laborious and expensive, although various high-throughput biochemical assays are available. In silico prediction methods are appealing to guide experimental design and to provide supporting evidence for the experimental results. Methods based on machine learning have been proposed and demonstrated encouraging performance. However, their performance and robustness depend on the training set in which negative samples have equal importance to positive samples. Unfortunately, our knowledge of negative samples of CPIs is extreme limited which restricts severely the performance of computational methods. This problem motivated us to propose a systematic screening workflow to identify reliable negative CPIs. To the best of our knowledge, this is the first work devoted to screen reliable negative samples of CPIs. Our screening framework is based on the assumption that the proteins dissimilar to any known/predicted target of a given compound are not much likely to be targeted by the compound and vice versa. In the view of chemogenomical space, we managed to find those compound–protein pairs that locate far from all positive samples in the chemogenomical space as negative samples, which really contributed to the performance improvement of both classical classifiers and existing computational methods. Furthermore, the compounds and proteins associated with a small number of known interactions were excluded to reduce the possibility of taking real interactions as negative interactions due to activity cliff and scaffold hopping. The feature divergence filtering further consolidated the strength of our dissimilarity rules. Extensive experiments demonstrated that our screened negative samples are highly credible and helpful for identifying CPIs. On the basis of the screened negative samples and positive samples obtained from DrugBank, we carried out prediction of potential CPIs on human and C.elegans by training SVM classifiers on the chemogenomical features. Also, we gave a confirmative example that the newly predicted target proteins of Donepezil are highly enriched in mental and nervous pathways and diseases. In summary, our screened negative samples and predictions provide the research community with a useful resource for identifying drug targets and constitute a helpful supplement to the current curated compound–protein databases.

Funding

This work was supported by the National Natural Science Foundation of China under grant no. 31300707, no. 61272380 and no. 61173118, China, and MOE AcRF Tier 2 grant ARC 39/13 (MOE2013-T2-1-079), Ministry of Education, Singapore. Conflict of Interest: none declared.

46 in total

1. Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization.

Authors: Mehmet Gönen
Journal: Bioinformatics Date: 2012-06-23 Impact factor: 6.937

Review 2. Donepezil for dementia due to Alzheimer's disease.

Authors: J Birks; R J Harvey
Journal: Cochrane Database Syst Rev Date: 2006-01-25

3. Combining drug and gene similarity measures for drug-target elucidation.

Authors: Liat Perlman; Assaf Gottlieb; Nir Atias; Eytan Ruppin; Roded Sharan
Journal: J Comput Biol Date: 2011-02 Impact factor: 1.479

4. Modulatory profiling identifies mechanisms of small molecule-induced cell death.

Authors: Adam J Wolpaw; Kenichi Shimada; Rachid Skouta; Matthew E Welsch; Uri David Akavia; Dana Pe'er; Fatima Shaik; J Chloe Bulinski; Brent R Stockwell
Journal: Proc Natl Acad Sci U S A Date: 2011-09-06 Impact factor: 11.205

5. PREDICT: a method for inferring novel drug indications with application to personalized medicine.

Authors: Assaf Gottlieb; Gideon Y Stein; Eytan Ruppin; Roded Sharan
Journal: Mol Syst Biol Date: 2011-06-07 Impact factor: 11.429

6. The Pfam protein families database.

Authors: Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn
Journal: Nucleic Acids Res Date: 2011-11-29 Impact factor: 16.971

7. Scalable prediction of compound-protein interactions using minwise hashing.

Authors: Yasuo Tabei; Yoshihiro Yamanishi
Journal: BMC Syst Biol Date: 2013-12-13

8. A side effect resource to capture phenotypic effects of drugs.

Authors: Michael Kuhn; Monica Campillos; Ivica Letunic; Lars Juhl Jensen; Peer Bork
Journal: Mol Syst Biol Date: 2010-01-19 Impact factor: 11.429

9. A semi-supervised method for drug-target interaction prediction with consistency in networks.

Authors: Hailin Chen; Zuping Zhang
Journal: PLoS One Date: 2013-05-07 Impact factor: 3.240

10. Toward more realistic drug-target interaction predictions.

Authors: Tapio Pahikkala; Antti Airola; Sami Pietilä; Sushil Shakyawar; Agnieszka Szwajda; Jing Tang; Tero Aittokallio
Journal: Brief Bioinform Date: 2014-04-09 Impact factor: 11.622

36 in total

1. Screening drug-target interactions with positive-unlabeled learning.

Authors: Lihong Peng; Wen Zhu; Bo Liao; Yu Duan; Min Chen; Yi Chen; Jialiang Yang
Journal: Sci Rep Date: 2017-08-14 Impact factor: 4.379

Review 2. In silico methods for drug repurposing and pharmacology.

Authors: Rachel A Hodos; Brian A Kidd; Khader Shameer; Ben P Readhead; Joel T Dudley
Journal: Wiley Interdiscip Rev Syst Biol Med Date: 2016-04-15

3. Neural networks for protein structure and function prediction and dynamic analysis.

Authors: Yuko Tsuchiya; Kentaro Tomii
Journal: Biophys Rev Date: 2020-03-12

Review 4. Large-Scale Prediction of Drug-Target Interaction: a Data-Centric Review.

Authors: Tiejun Cheng; Ming Hao; Takako Takeda; Stephen H Bryant; Yanli Wang
Journal: AAPS J Date: 2017-06-02 Impact factor: 4.009

Review 5. Providing data science support for systems pharmacology and its implications to drug discovery.

Authors: Thomas Hart; Lei Xie
Journal: Expert Opin Drug Discov Date: 2016-01-09 Impact factor: 6.098

6. GCRNN: graph convolutional recurrent neural network for compound-protein interaction prediction.

Authors: Ermal Elbasani; Soualihou Ngnamsie Njimbouom; Tae-Jin Oh; Eung-Hee Kim; Hyun Lee; Jeong-Dong Kim
Journal: BMC Bioinformatics Date: 2022-01-11 Impact factor: 3.169

7. AttentionSiteDTI: an interpretable graph-based model for drug-target interaction prediction using NLP sentence-level relation classification.

Authors: Mehdi Yazdani-Jahromi; Niloofar Yousefi; Aida Tayebi; Elayaraja Kolanthai; Craig J Neal; Sudipta Seal; Ozlem Ozmen Garibay
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

Review 8. Machine learning approaches and databases for prediction of drug-target interaction: a survey paper.

Authors: Maryam Bagherian; Elyas Sabeti; Kai Wang; Maureen A Sartor; Zaneta Nikolovska-Coleska; Kayvan Najarian
Journal: Brief Bioinform Date: 2021-01-18 Impact factor: 11.622

Review 9. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases.

Authors: Ahmet Sureyya Rifaioglu; Heval Atas; Maria Jesus Martin; Rengul Cetin-Atalay; Volkan Atalay; Tunca Doğan
Journal: Brief Bioinform Date: 2019-09-27 Impact factor: 11.622

Review 10. Representation learning applications in biological sequence analysis.

Authors: Hitoshi Iuchi; Taro Matsutani; Keisuke Yamada; Natsuki Iwano; Shunsuke Sumi; Shion Hosoda; Shitao Zhao; Tsukasa Fukunaga; Michiaki Hamada
Journal: Comput Struct Biotechnol J Date: 2021-05-23 Impact factor: 7.271