Fatemeh Ahmadi Moughari1, Changiz Eslahchi2,3. 1. Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran. 2. Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran. Ch-Eslahchi@sbu.ac.ir. 3. School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran. Ch-Eslahchi@sbu.ac.ir.
Abstract
One of the prominent challenges in precision medicine is to select the most appropriate treatment strategy for each patient based on the personalized information. The availability of massive data about drugs and cell lines facilitates the possibility of proposing efficient computational models for predicting anticancer drug response. In this study, we propose ADRML, a model for Anticancer Drug Response Prediction using Manifold Learning to systematically integrate the cell line information with the drug information to make accurate predictions about drug therapeutic. The proposed model maps the drug response matrix into the lower-rank spaces that lead to obtaining new perspectives about cell lines and drugs. The drug response for a new cell line-drug pair is computed using the low-rank features. The evaluation of ADRML performance on various types of cell lines and drug information, in addition to the comparisons with previously proposed methods, shows that ADRML provides accurate and robust predictions. Further investigations about the association between drug response and pathway activity scores reveal that the predicted drug responses can shed light on the underlying drug mechanism. Also, the case studies suggest that the predictions of ADRML about novel cell line-drug pairs are validated by reliable pieces of evidence from the literature. Consequently, the evaluations verify that ADRML can be used in accurately predicting and imputing the anticancer drug response.
One of the prominent challenges in precision medicine is to select the most appropriate treatment strategy for each patient based on the personalized information. The availability of massive data about drugs and cell lines facilitates the possibility of proposing efficient computational models for predicting anticancer drug response. In this study, we propose ADRML, a model for Anticancer Drug Response Prediction using Manifold Learning to systematically integrate the cell line information with the drug information to make accurate predictions about drug therapeutic. The proposed model maps the drug response matrix into the lower-rank spaces that lead to obtaining new perspectives about cell lines and drugs. The drug response for a new cell line-drug pair is computed using the low-rank features. The evaluation of ADRML performance on various types of cell lines and drug information, in addition to the comparisons with previously proposed methods, shows that ADRML provides accurate and robust predictions. Further investigations about the association between drug response and pathway activity scores reveal that the predicted drug responses can shed light on the underlying drug mechanism. Also, the case studies suggest that the predictions of ADRML about novel cell line-drug pairs are validated by reliable pieces of evidence from the literature. Consequently, the evaluations verify that ADRML can be used in accurately predicting and imputing the anticancer drug response.
Precision medicine aims to finely select treatments for cancer based on the genetic information of individual patients[1]. One of the highly critical problems in precision medicine is predicting anticancer drug response for each patient[2-4]. Due to the heterogeneity of tumors, the patients with the same type of cancer may show various therapeutic responses toward similar drugs[5]. Therefore, providing computational methods to discover the relationship between genomic information and drug sensitivity is of high importance and can be beneficial in precision medicine[3,6].Genomics of Drug Sensitivity in Cancer (GDSC)[7] and Cancer Cell Line Encyclopedia (CCLE)[8] are two projects that have provided molecular profiles and drug response values for hundreds of cancer cell lines against several anticancer drugs. These large datasets facilitate the development of computational methods for anticancer drug sensitivity prediction. Numerous computational methods have been proposed to predict drug response using gene expression profile, or other molecular information of cell lines. Some of the computational methods have considered drug information such as chemical substructure of drugs, besides made use of cell line information. In the proposed computational methods, various machine learning methods have been utilized such as sparse linear regression[4,9-11], random forest[2,12,13], kernel-based methods[4,14-17], matrix factorization[1,18-20], neural networks and deep learning[21-24].Wang et al. have proposed a Similarity Regularized Matrix Factorization (SRMF) method, which utilizes the similarity of cell lines based on gene expression profiles and chemical substructure similarity of drugs to predict anticancer drug sensitivity[1]. They also conducted drug-repurposing and suggested new potential treatments for cell lines with Non-small Cell Lung Cancer (NSCL). It is suggested that patients who have similar genomic properties reveal similar responses to similar drugs[1]. Based on the SRMF study, Suphavilai et al. have proposed a recommender system called “CaDRReS” that can predict drug response for unseen cell lines[19]. Furthermore, they showed that latent space features are correlated with associated pathways of drugs. They did not consider any features of drugs for predicting the drug response values. Afterwards, Chang et al. have devised “CDRscan”, an ensemble model containing five Convolutional Neural Networks (CNNs)[21]. They made use of mutational profiles of cell lines and chemical substructure of drugs as the input features to these CNNs. The drug response values were measured by averaging the output of five CNNs. Moreover, they have repurposed multiple non-oncology drugs as the potential therapeutic agents for cancer cell lines. Recently, Wei et al. have suggested a simple cell line-drug complex network called “CDCN”[25]. They constructed a complex network from various information, including cell line similarities, drug similarities, and drug responses, and inferred unknown drug response from the network. They also proposed a generalized version that can predict the drug response for new drugs and new cell lines. Despite its simple structure, CDCN had satisfying results in imputing missing drug responses.Nevertheless, the proposed methods had moderate performance and do not analyze several types of features for cell lines and drugs. Thus, investigating the influence of various features for cell lines and drugs in predicting therapeutic response is still in need and challenging. We investigate three types of cell line features, namely gene expression, mutation profile, and copy number variation, in addition to three types of drug features, including chemical substructure, target proteins, and associated KEGG pathways. In this work, we propose ADRML, Anticancer Drug Response Prediction, by using Manifold Learning. ADRML constructs a bipartite graph between drug and cell lines, and then decompose its adjacency matrix using similarity-constrained manifold learning into two lower-dimensional latent matrices. The proposed method is capable of predicting therapeutic response for new cell lines and new drugs. The similarity-constrained manifold learning previously has been used in the context of drug-disease association prediction[26] and drug–drug interaction prediction[27], which yielded accurate performance.The performance of ADRML is measured using various types of cell line similarities and drug similarities and is compared to the recently proposed methods on both GDSC and CCLE datasets. Moreover, the rationality of ADRML predictions is confirmed by analyzing the association between the predicted drug response values and activity scores of Biocarta pathways. Finally, conducting case studies on the predictions of ADRML for unknown drug response in literature and reliable databases verifies its capability in predicting unknown drug response and admits that ADRML obtains accurate results for new pairs of cell line-drug.
Results
Benchmark datasets and collected features
In this work, we used two pharmacogenomic datasets, namely the Genomics of Drug Sensitivity in Cancer (GDSC)[7] and Cancer Cell Line Encyclopedia (CCLE)[8]. Among several types of data in these datasets, various information including the half-maximal inhibitory concentration (IC50), the gene expression profile, copy number variation, and mutation profile was downloaded by using PharmacoGx R package[28]. The collected genes were accessible in the COSMIC database[29], and the collected drugs were restricted to the drugs with a Compound ID (CID) in the PubChem database[30].Some values of IC50, copy number variation, and mutation profiles in both datasets were missing. A pre-processing procedure was applied, according to Lu et al.[2] to impute the missing values, which is fully described in “Pre-processing to impute the missing data”. After applying the pre-processing steps, the GDSC dataset contained 98 drugs and 555 cell lines from 19 cancer types, as defined by The Cancer Genome Atlas (TCGA)[31], and the CCLE dataset contained 24 drugs and 363 cell lines from 22 cancer types as defined by TCGA. Furthermore, several types of information about drugs were obtained from the following databases:A brief description of the collected data is presented in Table 1.
Table 1
The number of collected samples and features.
Dataset
Cell line
Drug
Tissue types
Expression profile
Mutation profile
Copy number variation
Drug fingerprint
Target protein
KEGG pathway
CCLE
363
24
22
19,389
1,667
24,960
881
76
124
GDSC
555
98
19
11,712
54
24,959
881
–
–
The cell line features such as gene expression profile, mutation profile, copy number variation, tissue types, drug names, and drug response values were downloaded from PharmacoGx package, Drug fingerprints were obtained from pubchem, target proteins were gathered mainly from GDSC and DrugBank, and KEGG pathways were obtained from STiTCH database.
The fingerprints of canonical simplified molecular-input line-entry (SMILES) were obtained from PubChem[30].The target proteins were collected from GDSC, DrugBank[32], and literature.The KEGG-pathways related to the drugs were downloaded from the STiTCH database[33].The number of collected samples and features.The cell line features such as gene expression profile, mutation profile, copy number variation, tissue types, drug names, and drug response values were downloaded from PharmacoGx package, Drug fingerprints were obtained from pubchem, target proteins were gathered mainly from GDSC and DrugBank, and KEGG pathways were obtained from STiTCH database.
Hyper-parameter tuning
ADRML model is fully described in “Methods” which has three hyperparameters: “k” is the dimension of latent space, “” is the regularization coefficient, and “” is the similarity conservation coefficient. In order to map the response matrix into lower dimensional space, “k” value was considered to be less than the number of cell lines and drugs. For simplicity, we considered . We tuned the hyper-parameter values using grid search. We executed ADRML with fivefold cross-validation on all pairs of cell line and drug for all combinations of , . The hyper-parameters were tuned on CCLE dataset, using gene expression similarity of cell lines and chemical similarity of drugs by maximizing a fitness score (briefly mentioned as fitness in the following).where the evaluation criteria including Coefficient of Determination (), Pearson Correlation Coefficient (PCC), and Root Mean Square Error (RMSE) are completely explained in “Evaluation criteria”. The definition of fitness score is logical since the best model is the one with the highest values of and PCC, and the lowest value of RMSE. ADRML achieved the best results when and . We considered the same hyper-parameter values for all types of similarities in CCLE and GDSC. In order to illustrate the impact of and on the fitness score, we fixed the latent dimension to and depicted the fitness function in a 3D-histogram of Fig. 1a. It is evident that when is small, the fiteness function is increasing with regard to . Conversely, when is small, the larger values leads to higher fitness score.
Figure 1
The effect of choosing different values of hyper-parameters on ADRML performance.
The effect of choosing different values of hyper-parameters on ADRML performance.Moreover, the values of and were fixed and the influence of latent space dimension was examined. Figure 1b demonstrates that the greater dimensions of latent space leads to higher fitness score. Moreover, PCC, and improves by increasing k, while RMSE declines as k grows larger. However, the criteria value do not change or have subtle changes after .
Performance of ADRML prediction
We investigated the effects of using different similarity constraints on ADRML performance. Several cell line similarities based on gene-expression, mutation, and copy number variation, and multiple drug similarities based on chemical substructure, target proteins, and KEGG pathways were considered as the constraints of manifold learning.Table 2 summarizes the performance of ADRML for every combination of cell line and drug similarity. Each pair of cell line and drug similarity is shown in one row and the columns show the computed criteria. Clearly, ADRML yields both accurate and robust performance in each scenario, because the results of all conditions are quite high and close to each other. However, it achieves the best results using similarity of cell lines based on gene expression and similarity of drugs based on target proteins, which yields . We used these two similarities for further evaluations.
Table 2
Performance of ADRML on various types of similarities.
The performance of each model is evaluated using fivefold cross-validation on cell line-drug pairs and using , , and . Each row shows the performance of ADRML on a pair of cell line and drug similarity. The best results of each criteria is shown in bold face.
Performance of ADRML on various types of similarities.The performance of each model is evaluated using fivefold cross-validation on cell line-drug pairs and using , , and . Each row shows the performance of ADRML on a pair of cell line and drug similarity. The best results of each criteria is shown in bold face.In order to investigate ADRML performance on each drug, we depicted the drug-wise correlation plots. Figures 2 and 3 illustrated the pearson correlation between the observed and the predicted for four drugs in CCLE and GDSC datasets, respectively. The figures show high drug-wise PCC and validate that ADRML can predict drug responses with high correlation to the real responses. The majority of data in these scatter plots are centered near the fitted line. It is notable that the outliers are natural due to the technical noises in gene expression data, or inconsistency of drug responses in CCLE and GDSC[5,25,34,35]. Further plots for drug-wise PCC of GDSC are shown in Supplementary Figs. S1–S98 and the drug-wise PCC of CCLE are shown in Supplementary Figs. S99–S122.
Figure 2
Drug-wise PCC for 4 drugs in CCLE. The computed PCC is illustrated in lower right corner of each plot.
Figure 3
Drug-wise PCC for 4 drugs in GDSC. The computed PCC is illustrated in lower right corner of each plot.
Drug-wise PCC for 4 drugs in CCLE. The computed PCC is illustrated in lower right corner of each plot.Drug-wise PCC for 4 drugs in GDSC. The computed PCC is illustrated in lower right corner of each plot.
Comparison of prediction performance with state-of-the-art methods
For comprehensive evaluation of ADRML’s performance, we compared it to other recent state-of-the-art methods, namely, CDRscan[21], CDCN[25], SRMF[1], and CaDRReS[19]. The implementations of all methods were obtained from the available codes referred to in their papers. In order to have a fair comparison, we conducted all evaluations in the same setting and using the same datasets. The comparison was made on the average performance of the models over 30 repetitions of fivefold cross-validation with tuned hyper-parameters.It should be noted that the hyper-parameters of CaDRReS cannot be fully tuned, due to its high time complexity. The hyper-parameters for CaDRReS is assumed according to its paper and authors’ suggestion.The features used for cell lines and drugs are different in these methods. For each method, the required features, as mentioned in their paper, are provided from the benchmark datasets described in "Benchmark datasets and collected features".In addition to the mentioned methods, K-nearest neighbor (KNN) with was considered as a baseline method and compared to the results of other methods. KNN is implemented using the Scikit-learn module in Python[36]. For executing KNN, the input feature vector for each pair of cell line and drug was considered as the concatenated vector of ith row of simC and jth column of simD. All types of cell line similarities and drug similarities were considered as simC and simD, respectively. The complete report of KNN performance on various types of similarities are provided in Supplementary Table S1. KNN obtained the best performance on gene expression similarity of cell lines and chemical substructure similarity of drugs.Tables 3 and 4 present the performance of the mentioned methods on CCLE and GDSC, respectively. Additionally, the scatter-plots with fitted lines for the predictions of the mentioned methods on CCLE are presented in Supplementary Figs. S123–S128.
Table 3
Comparison of methods’ performance on CCLE dataset.
The methods were evaluated by averaging over 30 repetitions of fivefold cross-validation on cell line-drug pair. The best results of each criterion are shown in boldface.
Table 4
Comparison of methods’ performance on the GDSC dataset.
The methods were evaluated by averaging over 30 repetitions of using fivefold cross-validation on cell line-drug pair. The best result of each criterion is shown in boldface.
Comparison of methods’ performance on CCLE dataset.The methods were evaluated by averaging over 30 repetitions of fivefold cross-validation on cell line-drug pair. The best results of each criterion are shown in boldface.The results of baseline method (KNN) in both datasets were not too far from the state-of-the-art methods, which means that improving the results is challenging. In CCLE dataset, SRMF achieved the best RMSE and favorable PCC; however, it achieved lower than the baseline, i.e., the variance of predicted did not explain the variance of real drug responses perfectly. CaDRReS yielded reasonable results but its and PCC were less than the baseline. CDRscan obtained the favorable and PCC but it had the highest RMSE. Therefore, its prediction values have a high correlation and far distance to the real responses, simultaneously. CDCN revealed a satisfying performance but with lower and PCC, and higher RMSE than the results of ADRML. Therefore ADRML outperformed other methods.Comparison of methods’ performance on the GDSC dataset.The methods were evaluated by averaging over 30 repetitions of using fivefold cross-validation on cell line-drug pair. The best result of each criterion is shown in boldface.In the case of the GDSC dataset, SRMF obtained the best RMSE and moderate , and PCC. The performance of CaDRRS was satisfying, but and PCC were worse than the baseline. CDRscan showed good performance but with high RMSE, similar to its performance on the CCLE dataset. Moreover, CDCN’s performance was satisfying; however, its and PCC were lower than ADRML, and its RMSE was higher than ADRML. Consequently, ADRML outperformed other methods with regard to , and PCC.In addition to the mentioned analysis, we investigated whether using other types of cell line similarities and dug similarities would aid in improving the results of other methods. To this aim, we executed CDCN, SRMF, CaDRReS, and KNN on all types of similarities. It is worth mentioning that CDRscan receives binary feature matrices as the input and the dimension of binary feature vectors of drugs in CCLE and GDSC datasets were not appropriate for the designed CNNs in CDRscan; therefore, it is not applicable to perform CDRscan on other types of similarities. Other methods (CDCN, SRMF, CaDRReS, and KNN) receive the similarity matrices as the input. Moreover, CaDRReS gets only the cell line similarity, and it does not obtain any drug similarity matrix from the input.The entire report of the performance criteria measured for the performance of the mentioned methods is presented in Supplementary Table S1. It can be seen that the performance of other methods almost does not improve using other similarities in comparison to their proposed similarities. Often, with respect to a particular pair of cell line similarities and drug similarities, SRMF obtains the best RMSE. At the same time, ADRML achieves best and best PCC.All in sum, ADRML performed better than other state-of-the-art methods on both CCLE and GDSC in terms of and PCC. These achievements further substantiate ADRML performance.Tissue types in CCLE. The number of cell lines in each tissue type is shown in parenthesis.
Removing redundant cell lines from CCLE and GDSC
CCLE dataset contains 363 cell lines from 22 different tissue types. The number of cell lines in each tissue type is shown in Fig. 4. The least frequent tissue types (Biliary tract and prostate) contain one cell line, and the most frequent tissue type (Lung) comprises 76 cell lines. Since the cell lines from the same tissue may have high similarity, this may lead to redundancy. Thus, it is better to eliminate the redundancy within each tissue type and based on the number of cell lines from that tissue. In order to remove the redundancy in each tissue type, we filtered out the cell lines that are very similar to other cell lines. In this way, we excluded the cell lines with high similarity to other cell lines in the same tissue type.
Figure 4
Tissue types in CCLE. The number of cell lines in each tissue type is shown in parenthesis.
The detailed procedure of removing redundant cell lines is described in “Finding the most redundant cell lines”. This procedure led to eliminating 64 cell lines and the remaining 299 cell lines from CCLE. The remaining cell lines comprise the purified CCLE dataset without redundancy. The list of remaining and excluding cell lines are reported in Supplementary Table S2. To analyze the performance of ADRML and other state-of-the-art methods on the new dataset, we executed these methods using 30 repetitions of fivefold cross-validation. Table 5 demonstrates the performance of methods on the new dataset. It can be seen that ADRML outperforms other methods with respect to and PCC.
Table 5
Comparison of methods’ performance on the purified CCLE dataset.
The methods were evaluated by averaging over 30 repetitions of fivefold cross-validation on cell line-drug pair. The best result of each criterion is shown in boldface.
Comparison of methods’ performance on the purified CCLE dataset.The methods were evaluated by averaging over 30 repetitions of fivefold cross-validation on cell line-drug pair. The best result of each criterion is shown in boldface.Tissue types in GDSC. The number of cell lines in each tissue type is shown in parenthesis.Moreover, the GDSC dataset comprises 555 cell lines from 19 tissue types. Various tissue types have different numbers of cell lines which are shown in Fig. 5 To remove the redundant cell lines from GDSC, the procedure described in “Finding the most redundant cell lines” was applied on the GDSC, resulting in eliminating 103 cell lines and preserving 452 cell lines. The remaining cell lines form the purified GDSC dataset with lower redundancy. The list of remaining and excluding cell lines are reported in Supplementary Table S2. The performance of methods on the new GDSC dataset using 30 repetitions of fivefold cross-validation is represented in Table 6. It can be seen that SRMF obtained the best RMSE, CDCN achieved the best and ADRML yield the best PCC.
Figure 5
Tissue types in GDSC. The number of cell lines in each tissue type is shown in parenthesis.
Table 6
Comparison of methods’ performance on the purified GDSC dataset.
The methods were evaluated by averaging over 30 repetitions of fivefold cross-validation on cell line-drug pair. The best result of each criterion is shown in boldface.
Comparison of methods’ performance on the purified GDSC dataset.The methods were evaluated by averaging over 30 repetitions of fivefold cross-validation on cell line-drug pair. The best result of each criterion is shown in boldface.It can be inferred from the comparison results in Tables 3 and 4 with the results in the Tables 5 and 6 that the performance of models declines a bit when the redundant cell lines were removed. This issue may be due to the reduction in sample size or the existence of bias before removing redundancy of cell lines.Moreover, we applied the redundancy removal procedure with different thresholds to investigate the performance of ADRML on different levels of redundancy removal. Furthermore, this procedure is repeated based on gene expression similarities of cell lines. Table 7 represents the number of remaining cell lines according to the various values of threshold.
Table 7
The number of remaining cell lines in CCLE and GDSC after applying redundancy removal procedure with different levels of strictness (threshold) based on copy number variation and gene expression.
Threshold
Dataset
Remained cell lines based on copy number variation
Remained cell lines based on gene expression
0.01
CCLE
72
111
0.05
CCLE
116
161
0.1
CCLE
194
209
0.15
CCLE
245
245
0.2
CCLE
299
277
0.25
CCLE
325
303
0.3
CCLE
339
322
0.35
CCLE
351
345
0.01
GDSC
78
107
0.05
GDSC
161
209
0.1
GDSC
266
293
0.15
GDSC
363
362
0.2
GDSC
452
435
0.25
GDSC
501
480
0.3
GDSC
529
509
0.35
GDSC
547
532
The number of remaining cell lines in CCLE and GDSC after applying redundancy removal procedure with different levels of strictness (threshold) based on copy number variation and gene expression.ADRML performance was evaluated on each of the resulting datasets after redundancy removal based on various levels of strictness. Figure 6a,b illustrate the PCC values of ADRML assessed using 5-fold cross-validation on the purified datasets. These figures verify that the trend of ADRML performance is almost the same on purified datasets based on copy number variation and gene expression. ADRML achieves the best PCC on the strictest threshold which removes a lot of cell lines and adding other cell lines declines its PCC. Moreover, the ADRML’s PCC on purified CCLE datasets first increased sharply and then decreased by lowering the level of strictness in redundancy removal.
Figure 6
Performance of ADRML on purified datasets with different levels of strictness. The purified datasets were obtained after redundancy removal with certain thresholds. Each panel shows the PCC of ADRML assessed using 5-fold cross-validation on the purified datasets.
Performance of ADRML on purified datasets with different levels of strictness. The purified datasets were obtained after redundancy removal with certain thresholds. Each panel shows the PCC of ADRML assessed using 5-fold cross-validation on the purified datasets.
Analysis of association between drugs and signaling pathways
To demonstrate that the prediction of ADRML is meaningful and rational, we investigated the correlation between the predicted drug responses and pathway activity scores for several Biocarta Pathways from MsigDB[37]. The detailed procedure is described in “Computing association of drugs and signalling pathways”. Figure 7 visualizes the association between drugs and signaling pathways for 24 drugs in the CCLE dataset and 25 Biocarta pathways. The entire association values are provided in Supplementary Table S3. There are numerous pieces of evidence in the literature for these correlations, some of which are provided here.
Figure 7
Correlation of pathway activity scores and drug responses. The drugs are shown in rows, and pathways are shown in columns. The positive correlations are represented in red and negative correlations are represented in blue. The intensity of the color indicates the extent of correlation.
Paclitaxel drug and signaling pathway exhibited a highly positive correlation. Paclitaxel is one of the agents that have been frequently reported for the activation of pathway[38-41]. Thus, the higher consumption of Paclitaxel leads to more activation of , which verifies the high positive correlation between Paclitaxel and . Moreover, Paclitaxel positively associated with P53 pathway. It has been verified that Paclitaxel activates P53 signaling pathway[42] and the cell lines with disrupted P53 are resistant to Paclitaxel[43]. Furthermore, HSP90 inhibitor was positively correlated with P53 pathway. It has been suggested in the previous studies that has an anti-tumor activity via activation and stabilization of P53[44], that admits the positive association of and P53 pathway.Irinotecan response has a very significant positive correlation with the activity score of P53. Irinotecan is a topoisomerase I inhibitor, which is frequently used for anticancer therapy. The previous study on human hepatocellular carcinoma (HCC) cell lines for the investigation of the apoptotic mechanisms of Irinotecan has revealed that it significantly activates P53[45]. Additionally, the positive correlation of Irinotecan response and EGFR pathway is supported by several pieces of research. They have shown the resistance to Irinotecan is connected with the increased expression of EGFR[46] and have admitted that Irinotecan upregulates the EGFR pathway[47]. Also, Panobbinostat which is a potent inhibitor of deacetylases and HSP90[48], revealed a high significant positive correlation with pathway. Previous study have shown that using Panobinostat increased the level of [48].Correlation of pathway activity scores and drug responses. The drugs are shown in rows, and pathways are shown in columns. The positive correlations are represented in red and negative correlations are represented in blue. The intensity of the color indicates the extent of correlation.The Literature evidence for some of sensitive predictions of ADRML about novel cell line-drug pairs.These cell line-drug pairs has unknown IC50 in the training dataset and ADRML predicted them as the sensitive prediction. The evidence papers for these predictions are represented in the last column.ADRML sensitive predictions for novel cell line-drug pairs verified by the latest release of GDSC.These pairs had unknown IC50 in the training dataset and were predicted as a sensitive pair by ADRML. The latest release of GDSC reported these pairs as the sensitive pairs.
Case studies
We conducted case studies on GDSC cell-line-drug pairs with unknown IC50 values. To do this, we did not impute the missing values in the IC50 matrix and trained ADRML with all known drug responses. For each drug, the predictions of ADRML on unknown pairs were partitioned into four quantiles, and the cell lines in the first and last quantiles were considered as the sensitive and resistant cell lines for that drug, respectively. The complete list of sensitive and resistant predicted associations are provided in Supplementary Tables S4 and S5, respectively. The sensitive associations were inquired into both the literature and the latest release of GDSC (released Feb. 2020). Table 8 represents the supportive pieces of evidence for ADRML predictions in Literature. Table 9 incorporates some of the cell line-drug pairs that had unknown IC50 values in the previous data extracted from GDSC, and now the drug response value for these pairs are available in the latest release of GDSC.
Table 8
The Literature evidence for some of sensitive predictions of ADRML about novel cell line-drug pairs.
Cell line
Drug
Cell line cancer type
Drug indication
Evidence
HOP-92
Parthenolide
Non-Small Cell Lung Cancer
Pan-cancer
Janganati et al.[49]
MDA-MB-468
Parthenolide
Breast cancer
Pan-cancer
Janganati et al.[49]
HCT-116
Parthenolide
Colon cancer
Pan-cancer
Janganati et al.[49]
MKN45
Roscovitine
Gastric adenocarcinoma
Pan-cancer
Trenti et al.[50]
AGS
Roscovitine
Gastric adenocarcinoma
Pan-cancer
Trenti et al.[50]
These cell line-drug pairs has unknown IC50 in the training dataset and ADRML predicted them as the sensitive prediction. The evidence papers for these predictions are represented in the last column.
Table 9
ADRML sensitive predictions for novel cell line-drug pairs verified by the latest release of GDSC.
Cell line
Drug
Cell line cancer type
Drug indication
SK-MEL-24
NSC-87877
Melanoma
Melanoma
SK-MEL-3
NSC-87877
Melanoma
Melanoma
TK10
Erlotinib
Renal cell carcinoma
Non-small cell lung cancer (NSCLC) and pancreatic cancer
SW684
WH-4-023
Fibrosarcoma
Pan-cancer
SW982
BMS-509744
Synovial sarcoma
Pan-cancer
SK-LMS-1
CMK
Vulvar leiomyosarcoma
Pan-cancer
SW982
A-770041
Synovial sarcoma
Sarcoma
These pairs had unknown IC50 in the training dataset and were predicted as a sensitive pair by ADRML. The latest release of GDSC reported these pairs as the sensitive pairs.
Discussion
In this study, we proposed a computational model for predicting anticancer drug response, using manifold learning, called ADRML. The model combines three sets of information, including known drug responses, cell line similarity, and drug similarity, to infer the novel predictions. The main contribution of this paper is evaluating the influence of various types of cell line similarities and drug similarities on the prediction performance. We collected various features for cell lines and drugs from CCLE, GDSC, STiTCH, PubChem, and Drugbank. Here, we investigated nine different scenarios using three cell line similarities based on gene expression, mutation, and copy number variation, and three drug similarities based on the chemical substructure, target proteins, and KEGG pathways. The performance of ADRML was investigated using fivefold cross-validation on cell line-drug pairs. The best performance was obtained using gene expression data about cell lines and target protein data about drugs, which was more accurate than the previously proposed methods. We also investigated the performance of other state-of-the-art methods and KNN (with k = 1) as the baseline method on various types of similarities and showed that their best performance was achieved using the similarities that were suggested in their papers.Another contribution of this paper was the purification of CCLE and GDSC benchmarks via removing redundant cell lines. The purified benchmarks were also used for assessing the methods’ performance. The results showed that excluding redundant cell lines declines the methods’ performance, which may be due to the reduction of sample size or removing bias from the database.It was interesting that KNN with k = 1 as a simple baseline method shows favorable results and outperforms some more complicated methods, especially on the purified datasets. However, it should be noted that sophisticated methods’ performance declines when the data size is not sufficient. A complicated method needs a massive amount of data to train well and gets a good grasp of predicting outputs from inputs. For example, Chang et al.[21] have provided CDRscan with more cell lines and drugs than used in this paper and have trained CDRscan with 95% of its data (despite 80% of data in this paper). Therefore, the reported in[21] is better than the results reported in this paper. One can conclude that providing more informative data may enrich the training data and lead to better training the complex models. It is noteworthy that due to the challenging inherent of the problem, little improvements in results is welcome and useful.The proposed method in this study outperformed other methods in terms of two criteria and PCC in most comparison scenarios. The predicted drug response values revealed high correlations with observed drug responses and suggested meaningful clues about drug mechanisms in activation/inhibition of pathways. Moreover, the reliable literature evidence supports the predictions of ADRML about novel cell line-drug pairs. As a consequence, the promising results of ADRML verified its efficiency in predicting anticancer drug prediction and imputation.
Method
The proposed method includes five steps:The overall workflow of ADRML is illustrated in Fig. 8.
Figure 8
The overall workflow of ADRML. (A) Collecting various types of cell lines and drugs features. Further steps can be executed for each pair of cell line feature types and drug feature types. (B) Pre-processing the collected feature by removing the features with missing data for more than half samples and then imputing the remaining missing values. (C) Computing various types of cell line similarities and drug similarities using similarity functions. (D) Normalizing the similarity matrices using symmetric normalized Laplacian. (E) The IC50 matrix constructed from known IC50 values is factorized into two low-rank latent matrices with constraints of the similarity matrices. The unknown IC50 values can be predicted by multiplying the latent matrices.
Pre-processing to impute missing dataCalculating various types of similarity matrices for cell lines and drugsNormalizing the similarity matricesSimilarity-constrained manifold learning to factorize the IC50 matrix into low-rank latent matricesEstimating Unknown IC50 values using the latent matricesFor the convenience, define , , and as the expression of all genes, copy number variation , respectively. More precisely, and denote the copy number variation and mutation status of gene g in cell line c. Furthermore, , , and stand for chemical features, target status (equals 1 for the proteins that are the target of the drug, 0 otherwise) for all proteins, and pathway status (equals 1 for the pathways that are the associated with the drug, 0 otherwise) for drug , respectively. Finally, is defined as the value for cell line treated with drug .The overall workflow of ADRML. (A) Collecting various types of cell lines and drugs features. Further steps can be executed for each pair of cell line feature types and drug feature types. (B) Pre-processing the collected feature by removing the features with missing data for more than half samples and then imputing the remaining missing values. (C) Computing various types of cell line similarities and drug similarities using similarity functions. (D) Normalizing the similarity matrices using symmetric normalized Laplacian. (E) The IC50 matrix constructed from known IC50 values is factorized into two low-rank latent matrices with constraints of the similarity matrices. The unknown IC50 values can be predicted by multiplying the latent matrices.
Pre-processing to impute the missing data
Several steps were done to impute the missing data. First, the features that were missed in the majority of cell lines are removed. Second, the cell lines that contain missing values for more than half of the features were excluded. The other missing values were imputed using a k-nearest neighbor approach. To this aim, the distance measure between cell lines was defined as the Euclidean distance of their expression profiles because there is no missing in expression features of the cell line; thus, the distance can be calculated for each pair of cell lines. The distance between , is . Then, the mean feature value among 10-nearest cell lines was used to impute the missing IC50 value of drug d or CNV value of gene g in cell line c.where is the set of 10 cell lines with the minimum distance from cell line c, and . Moreover, to impute the mutation status (“1” for mutated and “0” for wild type) of gene g in cell line c, the majority vote of 10 nearest cell lines is used, i.e. is 1, if and only if .
Similarity matrices construction and normalization
For computing the similarity score of two cell lines (or drugs), the PCC and Jaccard-index (JI) were regarded as the similarity function, which are elaborated in the following.where are two feature vectors, and denote the ith element of these vectors, and are the mean value of them. Basically, the PCC is used to calculate the similarity of two continuous vectors, while JI is appropriate to measure the similarity of two discrete vectors. Therefore, we considered this rationality in the calculation of similarity matrices. The dimensions of cell line, and drug similarity matrices are and , respectively, where n denotes the number of cell lines and m denotes the number of drugs. Consequently, we constructed the three types of similarity matrices for cell lines, based on EXPR, CNV, and MUT. Since EXPR and CNV features are real-valued, PCC was used to measure their similarity, while MUT is binary-valued and JI was used to measure mutation similarity.Furthermore, three types of similarity matrices for drug based on Pubchem SMILES (CHEM), target proteins (TRGT), and KEGG pathways (KEGG) were calculated as follows. It is notable that all drug features are binary-valued; thus, JI was used for measuring the similarity of drugs based on each type of information.Then all of the computed similarity matrices were normalized by computing the symmetric normalized Laplacian[51]. Let S be a similarity matrix, the normalized similarity matrix was obtained as follows.where D is a diagonal matrix with diagonal elements equal to the summation of each row in S, i.e. . It is noteworthy that .is the similarity matrix of cell lines based on their gene expression profiles.is the similarity matrix of cell lines based on their copy number variations.is the similarity matrix of cell lines based on their mutation profiles.is the similarity matrix of drugs according to their chemical substructure fingerprints.is the similarity matrix of drugs according to their target proteins.is the similarity matrix of drugs according to their KEGG pathways.
Manifold learning with similarity constraints
We constructed a bipartite graph with two parts: drugs and cell lines. The weight of edges between cell line and drug is value of drug on cell line . Thus, the IC50 drug response matrix is the adjacency matrix of this graph, where are the number of cell lines and drugs, respectively. We used the manifold learning to factorize the drug response matrix R in two latent matrices and with lower rank. By using this factorization we could map the cell line and drug features into a latent space with dimension k, i.e. P and Q are the cell line latent matrix and drug latent matrix, respectively. The ith row of P (shown by ) is the latent vector of cell line , and the jth row of Q (shown by ) indicates the latent vector of drug .The initial goal is to find matrices P and Q, such that each drug response value is obtained by inner product of corresponding latent vectors, i.e., ; thus, the loss function can be formulated as:Two terms and are the regularization constraints of P and Q and is the regularization coefficient. The regularization terms prevent these matrices to grow dramatically; therefore, the over-fitting issue may not occur. These regularization terms help to reduce the variance and increase the stability and generalization capabilities of the model[52].Manifold learning studies[53,54] have shown that the mapping of data to a lower dimensional space can conserve the topological structure of data. Since is the feature vector of cell line , the distance of two cell lines and can be measured by . Similarly, denotes the distance of drugs and . We should consider some constraints to maintain the distance of cell lines and the distance of drugs while mapping them from the original features space to the lower dimensional latent space. Thus, the loss function is supplemented by two more terms.where is the coefficient of similarity consistency, , , , and , , . Two last terms are minimized when the feature vectors of cell line (or drug) pairs with high similarity are mapped to not distant latent vectors. Therefore, the topological distance of cell lines (or drugs) is maintained while mapping to the lower dimensional space.
Iterative optimization rules
The latent matrices P, Q must be obtained by minimizing the loss function in 15. We used the iterative Newton’s method[55] to update P, Q matrices:where (or ) denotes the updated (or ) after t steps, for all and , were initialized randomly. The first and second derivatives (gradient and Hessian) of loss function with respect to and are computed as the following:Therefore, the latent matrices P, Q are updated alternatively according to Eqs. (22, 23) until convergence.The convergence criterion is met when . In this study, we considered . The value of loss function declined in every iteration, due to the positive definite second derivatives. Therefore, the convergence criterion is definitely met after some steps[55] (usually after 10–20 step). After convergence, an estimated matrix is obtained by .Moreover, the manifold learning was applied on the transpose of response matrix, i.e. all the above procedure was repeated for factorizing to and . In the second use of Manifold learning we initialized and by the final computed Q and P in the first run, respectively. After the convergence, the second predicted matrix was constructed by . Consequently, the predicted was computed by .
Evaluation criteria
We measured the performance ADRML using 5-fold cross-validation on cell line-drug pairs. To do this, each pair of was considered as a sample. Then, the set of all samples was partitioned randomly into five almost equally-sized subsets (fold). One fold was considered as the test data and the other folds were regarded as the training data. The evaluation was computed for the test data. This procedure was iterated until each fold was considered once as the test data. Finally, the average of evaluation criteria over these five iterations denoted the model performance. Evaluation of ADRML is summarized as pseudo-code and shown in Fig. 9.
Figure 9
The pseudo-code for evaluation of ADRML performance.
The pseudo-code for evaluation of ADRML performance.To avoid randomness and reducing variance, the model performance was averaged over 30 randomly repetition of 5-fold cross-validation. The evaluation criteria include RMSE, , and PCC as follows.where and are the vector of real and predicted drug response values for all samples in test set, respectively, are their mean values, and |Test| is the number of samples in the test set. Each criterion evaluates the model performance from a different point of view. Therefore, it is possible to obtain results which led to promising values of one criterion and unfavorable values for other criteria.
Finding the most redundant cell lines
In order to eliminate the redundancy from the dataset, the cell lines in each tissue type that have high similarity to the majority of cell lines in that tissue type were considered as the most redundant cell lines and excluded from the dataset. To do this, the minimum (Q0), first quantile (Q1), second quantile (Q2), third quantile (Q3), and maximum (Q4) values for each type of cell line similarity in all tissue type were calculated, which are shown in Supplementary Tables S6 and S7. The diversity of cell lines was projected better concerning the values of copy number variation similarities, since there was a vast difference between the quantile values with respect to this similarity. Therefore, the third quantile of copy number variation similarities between the cell lines were computed in each tissue type t (denoted by Q3(CNV, t) ). The cell line c in tissue type t was excluded if it had the similarity higher than Q3(CNV, t) with more than 20% of cell lines in tissue type t.
Computing association of drugs and signalling pathways
The association between drug and pathway was computed by the PCC of drug response values and pathway activity scores. To do this, we considered all Biocarta signaling pathways and eliminated the pathways that the gene expression data of more than 10% of its genes were not provided. Therefore, we considered 107 Biocarta pathways for CCLE dataset. The pathway activity score for cell line and pathway was computed according to Emdadi et. al.[20], by summing up the fold change of gene expressions for all genes in pathway .where is the median of gene expression of gene in all cell lines. Thus, the score of a cell line in activating a pathway denotes the total amount of change in gene expression with respect to the median expression.The correlation of drug and pathway was obtained by , where denotes the predicted drug response vector of drug for all cell lines and AS( : , j) stands for the activity score vector of pathway for all cell lines.Supplementary material 1Supplementary material 2
Authors: James C Costello; Laura M Heiser; Elisabeth Georgii; Mehmet Gönen; Michael P Menden; Nicholas J Wang; Mukesh Bansal; Muhammad Ammad-ud-din; Petteri Hintsanen; Suleiman A Khan; John-Patrick Mpindi; Olli Kallioniemi; Antti Honkela; Tero Aittokallio; Krister Wennerberg; James J Collins; Dan Gallahan; Dinah Singer; Julio Saez-Rodriguez; Samuel Kaski; Joe W Gray; Gustavo Stolovitzky Journal: Nat Biotechnol Date: 2014-06-01 Impact factor: 54.908
Authors: Jordi Barretina; Giordano Caponigro; Nicolas Stransky; Kavitha Venkatesan; Adam A Margolin; Sungjoon Kim; Christopher J Wilson; Joseph Lehár; Gregory V Kryukov; Dmitriy Sonkin; Anupama Reddy; Manway Liu; Lauren Murray; Michael F Berger; John E Monahan; Paula Morais; Jodi Meltzer; Adam Korejwa; Judit Jané-Valbuena; Felipa A Mapa; Joseph Thibault; Eva Bric-Furlong; Pichai Raman; Aaron Shipway; Ingo H Engels; Jill Cheng; Guoying K Yu; Jianjun Yu; Peter Aspesi; Melanie de Silva; Kalpana Jagtap; Michael D Jones; Li Wang; Charles Hatton; Emanuele Palescandolo; Supriya Gupta; Scott Mahan; Carrie Sougnez; Robert C Onofrio; Ted Liefeld; Laura MacConaill; Wendy Winckler; Michael Reich; Nanxin Li; Jill P Mesirov; Stacey B Gabriel; Gad Getz; Kristin Ardlie; Vivien Chan; Vic E Myer; Barbara L Weber; Jeff Porter; Markus Warmuth; Peter Finan; Jennifer L Harris; Matthew Meyerson; Todd R Golub; Michael P Morrissey; William R Sellers; Robert Schlegel; Levi A Garraway Journal: Nature Date: 2012-03-28 Impact factor: 49.962
Authors: Mathew J Garnett; Elena J Edelman; Sonja J Heidorn; Chris D Greenman; Anahita Dastur; King Wai Lau; Patricia Greninger; I Richard Thompson; Xi Luo; Jorge Soares; Qingsong Liu; Francesco Iorio; Didier Surdez; Li Chen; Randy J Milano; Graham R Bignell; Ah T Tam; Helen Davies; Jesse A Stevenson; Syd Barthorpe; Stephen R Lutz; Fiona Kogera; Karl Lawrence; Anne McLaren-Douglas; Xeni Mitropoulos; Tatiana Mironenko; Helen Thi; Laura Richardson; Wenjun Zhou; Frances Jewitt; Tinghu Zhang; Patrick O'Brien; Jessica L Boisvert; Stacey Price; Wooyoung Hur; Wanjuan Yang; Xianming Deng; Adam Butler; Hwan Geun Choi; Jae Won Chang; Jose Baselga; Ivan Stamenkovic; Jeffrey A Engelman; Sreenath V Sharma; Olivier Delattre; Julio Saez-Rodriguez; Nathanael S Gray; Jeffrey Settleman; P Andrew Futreal; Daniel A Haber; Michael R Stratton; Sridhar Ramaswamy; Ultan McDermott; Cyril H Benes Journal: Nature Date: 2012-03-28 Impact factor: 49.962