Na-Na Guan1, Yan Zhao2, Chun-Chun Wang2, Jian-Qiang Li3, Xing Chen4, Xue Piao5. 1. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China. 2. School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China. 3. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China. Electronic address: lijq@szu.edu.cn. 4. School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China. Electronic address: xingchen@amss.ac.cn. 5. School of Medical Informatics, Xuzhou Medical University, Xuzhou 221004, China. Electronic address: px@xzhmu.edu.cn.
Abstract
Precision medicine has become a novel and rising concept, which depends much on the identification of individual genomic signatures for different patients. The cancer cell lines could reflect the "omic" diversity of primary tumors, based on which many works have been carried out to study the cancer biology and drug discovery both in experimental and computational aspects. In this work, we presented a novel method to utilize weighted graph regularized matrix factorization (WGRMF) for inferring anticancer drug response in cell lines. We constructed a p-nearest neighbor graph to sparsify drug similarity matrix and cell line similarity matrix, respectively. Using the sparsified matrices in the graph regularization terms, we performed matrix factorization to generate the latent matrices for drug and cell line. The graph regularization terms including neighbor information could help to exclude the noisy ingredient and improve the prediction accuracy. The 10-fold cross-validation was implemented, and the Pearson correlation coefficient (PCC), root-mean-square error (RMSE), PCCsr, and RMSEsr averaged over all drugs were calculated to evaluate the performance of WGRMF. The results on the Genomics of Drug Sensitivity in Cancer (GDSC) dataset are 0.64 ± 0.16, 1.37 ± 0.35, 0.73 ± 0.14, and 1.71 ± 0.44 for PCC, RMSE, PCCsr, and RMSEsr in turn. And for the Cancer Cell Line Encyclopedia (CCLE) dataset, WGRMF got results of 0.72 ± 0.09, 0.56 ± 0.19, 0.79 ± 0.07, and 0.69 ± 0.19, respectively. The results showed the superiority of WGRMF compared with previous methods. Besides, based on the prediction results using the GDSC dataset, three types of case studies were carried out. The results from both cross-validation and case studies have shown the effectiveness of WGRMF on the prediction of drug response in cell lines.
Precision medicine has become a novel and rising concept, which depends much on the identification of individual genomic signatures for different patients. The cancer cell lines could reflect the "omic" diversity of primary tumors, based on which many works have been carried out to study the cancer biology and drug discovery both in experimental and computational aspects. In this work, we presented a novel method to utilize weighted graph regularized matrix factorization (WGRMF) for inferring anticancer drug response in cell lines. We constructed a p-nearest neighbor graph to sparsify drug similarity matrix and cell line similarity matrix, respectively. Using the sparsified matrices in the graph regularization terms, we performed matrix factorization to generate the latent matrices for drug and cell line. The graph regularization terms including neighbor information could help to exclude the noisy ingredient and improve the prediction accuracy. The 10-fold cross-validation was implemented, and the Pearson correlation coefficient (PCC), root-mean-square error (RMSE), PCCsr, and RMSEsr averaged over all drugs were calculated to evaluate the performance of WGRMF. The results on the Genomics of Drug Sensitivity in Cancer (GDSC) dataset are 0.64 ± 0.16, 1.37 ± 0.35, 0.73 ± 0.14, and 1.71 ± 0.44 for PCC, RMSE, PCCsr, and RMSEsr in turn. And for the Cancer Cell Line Encyclopedia (CCLE) dataset, WGRMF got results of 0.72 ± 0.09, 0.56 ± 0.19, 0.79 ± 0.07, and 0.69 ± 0.19, respectively. The results showed the superiority of WGRMF compared with previous methods. Besides, based on the prediction results using the GDSC dataset, three types of case studies were carried out. The results from both cross-validation and case studies have shown the effectiveness of WGRMF on the prediction of drug response in cell lines.
Benefiting from the development of high-throughput sequencing technology and the improvement of bioinformatics, the precision medicine has become a novel and burgeoning concept. The goal of precision medicine is to effectively classify the different states and processes for the same disease, and personally make accurate treatment for the individual patient. It is critical for the success of precision medicine to identify individual genomic signatures for different patients. Cancer is one of the most threatening human complex diseases. The cancer cell lines could reflect the “omic” diversity of primary tumors, which therefore could be considered as a promising proxy to characterize the therapeutic response. Using cell lines, many works have been carried out to study the cancer biology and drug discovery both in experimental and computational aspects.5, 6, 7 As the basis of in-depth researches, tremendous genomic and pharmacological data have been collected and categorized in large scale for diverse cancer cell lines.4, 8, 9, 10, 11, 12 The consequential mission is to develop powerful methods to extract useful information from those complicated datasets and find the connections between the cancer information and the drug response.Because experiments are expensive and time-consuming, computational approaches have been developed for tumorous drug response study, which are attracting more and more researchers’ attentions.13, 14, 15 There are mainly two classes of computational models to predict drug response in cancer cell lines. The first one is based on machine learning algorithms, such as elastic-net regression, support vector machine (SVM), and random forest (RF). For instance, Menden et al. proposed a machine learning method for drug sensitivity inference, which utilized neural network and RF as predictors. In their model, they combined cell line genomic features and drug chemical structures to be the input features, and collected response data from the Genomics of Drug Sensitivity in Cancer (GDSC) project as the training samples. The prediction accuracy of the method in 8-fold cross-validation was improved because of the integration of information from both cell lines and drugs. Besides, Fersini et al. presented a computational framework using the Consensus p-Median clustering approach for drug response inference in tumor cell lines. They performed the Consensus p-Median clustering algorithm to create homogeneous groups of tumor cell lines, based on which the relevant genes that could be responsible for drug responses were selected to characterize the cell line clusters. Then, the Bayesian networks were constructed for the prediction of different drug responses for cell lines with given genomic profiles. Geeleher et al. proposed a framework that adopted only the before-treatment gene expression profile to predict the drug response. After the gene expressions of cell line and clinical trial were combined and homogenized, the linear ridge regression was trained and tested, which could be implemented to predict drug sensitivity for new cell lines. Furthermore, Tan et al. devised a classifier based on SVM with pairwise kernels for the classification of drug responses. The chemical compound was represented with labeled undirected graph and fingerprint, whereas the cell line was characterized using gene transcriptional profile, gene copy number, gene mutational status, and microRNA expression information. Based on diverse features of drug and cell line, several different kernels, especially the pairwise kernel, were employed through computing drug similarity and cell line similarity in different ways. Similarly, combining various similarity information, Wang et al. presented a model called Predict Drug Response in Cancer Cells (PDRCC) to infer novel drug responses in cancer cells. The PDRCC calculated several drug similarities and cell line similarities based on diverse feature information. Then, they calculated the Kronecker product of any similarity of drug and cell line as the drug-cell line pair kernel function, which was sequentially fed to SVM to predict drug responses. Besides, based on SVM and a recursive feature selection tool, Dong et al. developed a predictor, using gene expression of cancer cell and drug response data in the Cancer Cell Line Encyclopedia (CCLE). They classified the cell lines into sensitive and resistant subsets according to their responses to each drug. Then, a wrapper method named Support Vector Machine Recursive Feature Elimination (SVM-RFE) was used for the feature selection. The classified responses and the selected features from CCLE data were finally input to the SVM model for training and predicting.As is known, the training dataset and the feature of samples are vitally important to machine learning models. To solve the problem of different source domains of the training and test datasets, Turki et al. presented a transfer learner for inferring drug response in cancer cell lines, using mean shift and Procrustes analysis. Honkela et al. proposed an approach that integrated differentially private learning and Bayesian linear regression, for drug sensitivity prediction with limited dataset. As a subtype of transfer learning, the multi-task learning models have shown their capability for drug response inference. For example, Gönen et al. developed a Kernelized Bayesian Multi-task Learning (KBMTL) for inference of drug response in cell lines, which employed kernel-based dimension reduction. Besides, Tan presented an approach using multi-task learning regularized with trace-norm to improve the kernelized transfer learning for anticancer drug response prediction. Similarly, Yuan et al. also developed a multi-task learning method with trace-norm regularization, in which an efficient optimization algorithm called alternating direction method of multipliers (ADMM) was adopted.Ensemble learning has been widely used in prediction problems of bioinformatics such as disease-specific risk variant prediction and disease-related non-coding RNA prediction.30, 31, 32, 33, 34, 35 For drug response prediction, Tan et al. proposed an Ensemble Learning for Drug Activity Prediction (ELDAP), which incorporated elastic net regression, KBMLT, pairwise support vector regressor (PSVR), and neural networks. Recently, Matlock et al. studied the effect of stacking three different machine learning algorithms that included RF, deep learning (DL), and k-nearest neighbor (KNN) for inference of drug response in cancer cell lines. The results revealed the ability of the ensemble models to improve the prediction accuracy.Within machine learning methods, matrix factorization is an efficient class of models for drug response inference. For example, Ammad-ud-din et al. developed a method to extend quantitative structure-activity relationship (QSAR) analysis of drugs in cancer cell lines, which applied the kernelized Bayesian matrix factorization (KBMF) algorithm. They calculated kernel matrices for each type of feature for both drug and cell line through computing Jaccard coefficient or Gaussian kernel. Moreover, Wang et al. recently proposed an improved method for drug response inference in cancer cell lines, which utilized the similarity regularized matrix factorization (SRMF).Machine learning models could infer drug response in cell lines in large scale, which often neglected the topological information of drug similarity and cell line similarity. The network-based models could remedy the limitation. Shivakumar et al. proposed a drug structural similarity-based model to predict drug sensitivity in cell lines, which assigned the sensitivity profile of the known drug to the new drug if they were structurally similar. Zhang et al. established a dual-layer network composed of cell lines and drugs, utilizing the cell line similarity based on their gene expression profiles, and drug similarity based on their chemical structures. On the basis of the dual-layer network, they proposed a weighted method to predict the response of a cell line to a drug. Moreover, Kim et al. predicted the drug response of cancer cell lines using a network-based classifier (NBC). In this method, a sensitive network and a resistant network were respectively constructed using selected genes, based on which linear and non-linear predictor functions were designed for gene expression prediction, which could be used to classify a new cell line via comparing the results from the sensitive and resistant networks. Turki et al. utilized a link-filtering algorithm on the cell line network followed with regression algorithms to predict the cancer drug sensitivity. Moreover, Stanfield et al. integrated the information of genes, cell lines, and drugs to construct a heterogeneous network, on which a link prediction with random walk with restart (RWR) was performed to build the network profiles of cell lines and drugs. Finally, the Pearson correlation coefficients (PCCs) between drugs and cell lines were calculated for drug-cell line association prediction. Recently, Zhang et al. developed a model of Heterogeneous Network-based Method for Drug Response Prediction (HNMDRP) through integrating five subnetworks, including cell line similarity network, drug similarity network, target similarity network, drug-cell line association network, and drug-protein network. An information flow-based algorithm was then implemented on the heterogenous network to predict novel drug-cell line associations. In addition, Le et al. also constructed a heterogeneous network by combining omics-based cell line similarity, drug structural similarity, and known drug responses of cell lines, based on which a global method called GloNetDRP was implemented for drug response prediction. The RWR algorithm was adopted in this model to compute the response value of test cell lines to test drugs.Previous computational methods often have their own limitations. For instance, some machine learning-based models predict only binary results; i.e., they classify cell line-drug pairs into sensitive or resistant groups. Some network-based models could be used to predict new drug-cell line associations, but cannot give the precision response values. And some other models have the need to improve their prediction accuracy. In this paper, we presented a novel approach to infer drug response in cancer cell lines using the weighted graph regularized matrix factorization (WGRMF) algorithm. For drug similarity and cell line similarity, a sparsification technique was operated using the p-nearest neighbor graphs that were constructed for drug and cell line, respectively, based on the KNN algorithm. The sparsified similarity matrices were then used to regularize the latent matrices learnt from matrix factorization together with Tikhonov regularization. Consequently, the latent drug vectors and latent cell line vectors could be obtained, respectively, through an alternative update operation. The predicted response matrix was finally computed via multiplying the two low-rank latent matrices. The performance of our model was evaluated in the 10-fold cross-validation through calculating the PCC and root-mean-square error (RMSE) between predicted values and observed values in two datasets, CCLE and GDSC. The results demonstrated the superiority of WGRMF compared with SRMF, which directly used the drug similarity and cell line similarity as regularization terms in the matrix factorization model. In the case studies, we compared the predicted values of missing drug responses and the known response values in GDSC to investigate the correlation between drug sensitivity and several gene variations, such as lapatinib sensitivity and erlotinib sensitivity to the Epidermal Growth Factor Receptor (EGFR) gene, PD-0332991 sensitivity to Cyclin Dependent Kinase Inhibitor 2A (CDKN2A) gene, as well as pazopanib resistance of KRAS Proto-Oncogene, GTPase (KRAS). Moreover, we found that the results of WGRMF could help to identify new sensitive genes of drugs through combining predicted data with existing data. The association between MET Proto-Oncogene, Receptor Tyrosine Kinase (MET) gene, and PHA-665752 was an example for this. Besides, we applied WGRMF to the discovery of sensitive drugs for non-small cell lung cancer (NSCLC).
Results
Performance Evaluation
We utilized 10-fold cross-validation to evaluate the performance of the proposed method in the CCLE dataset and GDSC dataset, respectively. Specifically, the known response values were randomly divided into 10 subsets with equal size. Then, one subset was left in turn as the test set and the other nine were combined as the training set. We can get the predicted response values of the test set through implementing WGRMF learnt from the training set. As soon as the loop ends, each of the 10 subsets was considered as the test sample one after another, and we could obtain the predicted value as the counterpart of each known response value. To improve the reliability of the results, the whole process of cross-validation was repeated a hundred times to estimate the average performance of the model. Based on the predicted and observed response data, we calculated the PCC and RMSE for each drug to estimate the capability of WGRMF on predicting drug response in cell lines. The PCC value indicates the extent of correlation between the predicted and observed response profiles of a drug, which could be formulated aswhere r and indicate the original and predicted response values, respectively; and denote their mean values, respectively; and nd is the total number of known response values for the query drug. The larger the PCC value is, the more accurate the prediction is. The RMSE stands for the deviation of predicted values from observed values, which is expected to be small. The RMSE for a drug could be calculated asAfter the PCC and RMSE were calculated for all drugs, we correspondingly computed the average PCC and average RMSE over drugs.In further estimation of the WGRMF performance, we focused on the sensitive and resistant cell lines for each drug. First, we ranked cell lines for each drug according to the response values and split them into four equal parts. Then the first and last parts were selected to compose the sensitive and resistant cell line set of each drug. Consequently, we could obtain the PCC and RMSE of sensitive and resistant cell lines for each drug, as well as the average values of PCC and RMSE.For convenience, we used PCC and RMSE to indicate the results from all cell lines, and PCCsr and RMSEsr to indicate the results from sensitive and resistant cell lines. As a result, the WGRMF got PCC, RMSE, PCCsr, and RMSEsr averaged over 23 drugs in CCLE as 0.72 ± 0.09, 0.56 ± 0.19, 0.79 ± 0.07, and 0.69 ± 0.19, respectively. For data in GDSC, the results of WGRMF are 0.64 ± 0.16, 1.37 ± 0.35, 0.73 ± 0.14, and 1.71 ± 0.44 for PCC, RMSE, PCCsr, and RMSEsr averaged over 135 drugs. To make a comparison with SRMF, the same process of 10-fold cross-validation was carried out using SRMF on both CCLE and GDSC datasets. It is worth noting that all of the above results refer to the average values and the corresponding SDs for a hundred experiments of cross-validation. The overall results of comparing WGRMF with SRMF can be seen from Table 1, which shows that the performance of WGRMF is superior to SRMF based on all indicators for both CCLE and GDSC datasets.
Table 1
The Comparison Results between WGRMF and SRMF under the 10-Fold Cross-Validation in CCLE and GDSC Datasets, Indicated by Drug-Averaged PCC, RMSE, PCCsr, and RMSEsr
Dataset
Model
Drug-Averaged PCC
Drug-Averaged RMSE
Drug-Averaged PCCsr
Drug-Averaged RMSEsr
CCLE
WGRMF
0.72 ± 0.09
0.56 ± 0.19
0.79 ± 0.07
0.69 ± 0.19
SRMF
0.71 ± 0.09
0.57 ± 0.18
0.78 ± 0.08
0.74 ± 0.22
GDSC
WGRMF
0.64 ± 0.16
1.37 ± 0.35
0.73 ± 0.14
1.71 ± 0.44
SRMF
0.61 ± 0.16
1.52 ± 0.36
0.71 ± 0.14
1.79 ± 0.45
The Comparison Results between WGRMF and SRMF under the 10-Fold Cross-Validation in CCLE and GDSC Datasets, Indicated by Drug-Averaged PCC, RMSE, PCCsr, and RMSEsrFurthermore, in order to inspect how the WGRMF performed on individual drugs, we selected drugs targeting genes in the phosphatidylinositol-4,5-bisphosphate 3-kinase (PI3K) pathway from the GDSC dataset as examples. PI3K is well known as a signaling component downstream of receptor tyrosine kinases (RTKs), which plays important roles in various biological responses.47, 48 The comparisons between WGRMF and SRMF on PCCsr and RMSEsr were illustrated in Figures 1 and 2, respectively. From the histograms, we can see that the results of WGRMF are better than that of SRMF for most of the PI3K pathway drugs.
Figure 1
The Comparison Histogram of PCCsr for the Drugs That Target Genes in PI3K Pathway
The comparison histogram of PCCsr between WGRMF and SRMF under the 10-fold cross-validation for the drugs that target genes in the PI3K pathway.
Figure 2
The Comparison Histogram of RMSEsr for the Drugs That Target Genes in PI3K Pathway
The comparison histogram of RMSEsr between WGRMF and SRMF under the 10-fold cross-validation for the drugs that target genes in the PI3K pathway.
The Comparison Histogram of PCCsr for the Drugs That Target Genes in PI3K PathwayThe comparison histogram of PCCsr between WGRMF and SRMF under the 10-fold cross-validation for the drugs that target genes in the PI3K pathway.The Comparison Histogram of RMSEsr for the Drugs That Target Genes in PI3K PathwayThe comparison histogram of RMSEsr between WGRMF and SRMF under the 10-fold cross-validation for the drugs that target genes in the PI3K pathway.
Case Studies
In this work, we have used all existing response values in the GDSC dataset to train the model, which was sequentially adopted to predict the response values for those drug-cell line pairs without known response values. Then, we ranked the cell lines for each drug in GDSC according to the predicted response values, and the top 20 most likely sensitive cell lines were selected for each drug. The ranked and selected result could be obtained from Table S1. This prediction result was released to provide some assistance for further experimental research. Furthermore, based on the whole predicted response of the GDSC dataset, we have conducted three types of case studies, including consistency identification for drug sensitivity of gene mutation, novel drug-gene association discovery, and drug reposition on specific cancer type.In the first case of consistency identification, we made a comparative analysis between the predicted and the observed responses of four drugs, based on several gene mutation profiles of cell lines. Lapatinib, known as a small-molecule kinase inhibitor, could target the EGFR gene and Erb-B2 Receptor Tyrosine Kinase 2 (ERBB2) gene, which was approved by the US Food and Drug Administration (FDA) in 2007 for the therapy of breast cancerpatients. Dual inhibition of EGFR and ERBB2 tyrosine kinases plays a clinical biological role in suppressing the proliferation and survival of those cancer cells that are promoted by signaling pathways.50, 51, 52 For lapatinib, the responses in 310 cell lines are known, whereas 342 cell lines are without known responses. The EGFR mutation profile of cell lines was used to investigate the association between the gene mutation and the response to lapatinib. The predicted responses to lapatinib were classified into the EGFR mutation group and EGFR wild-type group, and similar operation was exerted on the observed responses. All of the four groups of responses were plotted in one figure to identify the consistency between predicted and existing datasets for lapatinib sensitivity in EGFR mutation cell lines. The comparative result was shown in Figure 3A, from which we can see that EGFR-mutated cell lines are more sensitive to lapatinib for both predicted and observed data. Erlotinib, a kind of OSI Pharmaceuticals, has been reported as an inhibitor of the tyrosine kinase activity of EGFR,53, 54 which is efficiently used to prolong the lifetime of the previously treated patients with NSCLC.55, 56 In the GDSC dataset, there are 286 out of 652 cell lines with known response values to erlotinib. In this work, we also verified the sensitivity of EGFR mutation to erlotinib. The same trend could be observed between the predicted and existing data, which was shown in Figure 3B. Furthermore, PD-0332991, also named palbociclib, with 590 existing response values in cell lines in GDSC, is an effective drug in the treatment of several cancers including breast cancer, as a cyclin-dependent kinase 4/6 inhibitor.57, 58, 59 The CDKN2A mutation was selected to investigate its contribution to the PD-0332991 sensitivity. By comparing the predicted result of WGRMF with the existing data, we can identify the consistency between them, which is given in Figure 3C. Besides, the resistance of gene mutation to drug also could be predicted by the WGRMF. Taking the response of KRAS mutated cell lines to pazopanib as an example, we can see the resistant tendency both from the known and the predicted results, which is shown in Figure 3D. In order to show the comparison results more distinctly, the rank-sum test was performed between the mutated and wild groups for both predicted and observed datasets. The calculated p values have been shown in each panel of Figure 3. From the results of the case study for drugs mentioned above, we could observe the agreement of the predicted responses to the known responses in the GDSC dataset, based on the drug sensitivity of gene mutation profiles.
Figure 3
The Results for Consistency Identification between Predicted and Existing Data for Four Drug-Gene Pairs Based on the GDSC Dataset
(A) The responses of EGFR mutated and wild-type cell lines to lapatinib are shown. (B–D) EGFR mutation and erlotinib (B), CDKN2A mutation and PD-0332991 (C), and KRAS mutation and pazopanib (D). The p values obtained from the rank-sum test have been shown in each panel.
The Results for Consistency Identification between Predicted and Existing Data for Four Drug-Gene Pairs Based on the GDSC Dataset(A) The responses of EGFR mutated and wild-type cell lines to lapatinib are shown. (B–D) EGFR mutation and erlotinib (B), CDKN2A mutation and PD-0332991 (C), and KRAS mutation and pazopanib (D). The p values obtained from the rank-sum test have been shown in each panel.In the second case study, we used the predicted responses of WGRMF integrated with the known data in GDSC to infer potential drug-cancer gene associations. For instance, PHA-665752 was such a small molecule that it could act as the ATP competitive inhibitor of the catalytic activity of c-Met kinase.60, 61 It has been reported that PHA-665752 could act on the growth and motility of multiple cancer cell lines.62, 63, 64 In the GDSC dataset, the number of cell lines with known response to PHA-665752 is 381, out of which only two cell lines are related to MET amplification. It is difficult to observe the tendency of drug sensitivity based on the scarce available data. After combining the newly predicted responses and the existing data, we could obtain the sensitivity of MET amplification to PHA-665752 (see Figure 4). The p values obtained from the rank-sum test were also given in Figure 4. The extreme susceptibility of MET amplification to the PHA-665752 has been experimentally confirmed in gastric cancer cell lines.
Figure 4
The Responses of MET-Amplified and Wild-Type Cell Lines to PHA-665752 for Predicted, Existing, and Combined Data in GDSC
The sensitivity of MET amplification to PHA-665752 could be obtained through combining the newly predicted responses and the existing data. The p values obtained from rank-sum test were given for predicted, existing, and combined data, respectively.
The Responses of MET-Amplified and Wild-Type Cell Lines to PHA-665752 for Predicted, Existing, and Combined Data in GDSCThe sensitivity of MET amplification to PHA-665752 could be obtained through combining the newly predicted responses and the existing data. The p values obtained from rank-sum test were given for predicted, existing, and combined data, respectively.Furthermore, as the third case study, we used the predicted responses of WGRMF to reposition drugs on specific cancer through combining the existing responses. For a given drug, we compared the responses of cell lines with a specific tissue type with the responses of other cell lines, which would help to find some useful information of the drug sensitivity on the specific cancer type. Lung cancer is one of the most common malignant tumors in the world, of which about 80% is NSCLC. Gefitinib is the common drug to treat NSCLCpatients with mutated EGFR. Based on the predicted results of WGRMF combined with known responses in GDSC, we screened drugs for drug reposition on NSCLC. For example, through analyzing the response difference between NSCLC cell lines and other cell lines to PHA-665752, we found that NSCLC cell lines were more sensitive to PHA-665752 based on the integrated result, which could not be observed by considering only the existing data. The comparison among predicted, existing, and combined results and the p values that were computed using the rank-sum test between the NSCLC group and the other group were illustrated in Figure 5. It has been reported that the PHA-665752 could act as a c-MET inhibitor to prevent K-ras mutant NSCLC60, 68 or to treat NSCLC by cooperating with rapamycin.
Figure 5
The Reposition of PHA-665752 on NSCLC Based on the Combination of the Newly Predicted Responses and the Existing Data
The figure shows that NSCLC cell lines are more sensitive to PHA-665752 based on the integrated result (p value of rank-sum test is 3.7044 × 10−2), which could not be observed based only on existing data.
The Reposition of PHA-665752 on NSCLC Based on the Combination of the Newly Predicted Responses and the Existing DataThe figure shows that NSCLC cell lines are more sensitive to PHA-665752 based on the integrated result (p value of rank-sum test is 3.7044 × 10−2), which could not be observed based only on existing data.
Discussion
The work to investigate drug response in cancer cell lines is crucial for the precision therapy of cancerpatients. In this work, we presented a novel method to utilize WGRMF for inferring anticancer drug response in cell lines, which combined the information of drug similarity, cell line similarity, and known drug responses in cell lines. A sparsification technique was employed on the similarity matrices to construct the p-nearest neighbor graphs for drug and cell line, respectively. Then based on the low-rank approximation (LRA), together with Tikhonov regularization, drug graph regularization, and cell line graph regularization, the objective function of WGRMF was constructed, which could be solved through an alternative update rule to obtain the latent feature matrices for drug and cell line. The predicted response matrix was finally composed by the two latent matrices. To evaluate the performance of WGRMF, we implemented 10-fold cross-validation on two different datasets, GDSC and CCLE. The averaged PCC, RMSE, PCCsr, and RMSEsr over all drugs in each dataset were calculated as the measurement of model performance. In addition, several case studies based on the GDSC dataset were carried out. Just like its good performance on other issues, the WGRMF method has shown its excellent activity in predicting novel drug responses in cancer cell lines, which is better than SRMF in comparison.There are several factors that could contribute to the superior performance of WGRMF. First of all, WGRMF was constructed on the local invariance assumption; that is, two close points in the original space will be close in the learnt manifolds in latent space. The postulate was guaranteed by the graph regularization with p-nearest neighbor graphs of drug and cell line. Besides, WGRMF took full advantage of neighborhood information via performing graph regularization. Furthermore, the sparsification process could help to eliminate the redundant and noisy information in the similarity matrices. Finally, the introduction of weight matrix into the objective function made it possible to consider the contribution from known responses only rather than from those unknown ones.There are still some limitations existing in the current model. For example, the sparsification process and the LRA-based graph regularization have many parameters to choose or confirm. It is still difficult to optimize these parameters, which limits the application of this method. In addition, the cell line similarity was constructed only on the information of gene expression. If more genomic information of cell lines is incorporated in the cell line similarity, the prediction efficiency will be improved. In fact, some studies have demonstrated the effectiveness and robustness of the network framework based on diverse cancer hallmarks in cancer researches.70, 71 Thus, in future study, employing more cancer hallmarks to effectively construct the feature network of cell lines is imperative for drug response prediction. Recent researches indicated that both proteins and long non-coding RNAs (lncRNAs) could be used as a drug target.72, 73, 74, 75, 76 Therefore, making full use of predicted and experimentally validated drug-target interactions involving both protein targets and non-coding RNA targets would benefit the prediction of drug response. Third, the lack of known responses for some drugs also restricts the performance of WGRMF. We expect that more response data can be collected from experimental results, which will improve the accuracy of the model. Besides, the optimal solution obtained from the alternative iteration is not the analytic solution, which may affect the prediction precision. In addition, predicting the response of drug combination would be an interesting and important direction for future studies. Finally, the results in case studies were confirmed by only some published literature, which limited the scale of verification on the prediction results. In further research, more independent datasets derived from different databases should be applied to the validation of prediction results. This will make it more feasible and more authentic to evaluate the model performance.
Materials and Methods
Genomic of Drug Sensitivity in Cancer Project
The first dataset we used was collected from Genomic of Drug Sensitivity in Cancer project (release-5.0, https://www.cancerrxgene.org/downloads), including 652 cancer cell lines, 135 drugs, and 70,676 known response values.11, 12 The distribution of known responses for all drugs has been shown in Figure S1, from which we can see that most of the drugs have more than 90% observed values, and only 40 drugs have known responses in less than 50% of cell lines. In GDSC, the drug sensitivity in the cancer cell line is measured by log-transformed IC50 value that indicates the drug concentration for 50% inhibition in vitro. The more sensitive cell lines will get lower IC50 values to a drug. Furthermore, for the involved drugs and cell lines, we need to characterize them with some features. Thus, we adopted the PubChem fingerprint descriptors as the features of the 135 drugs, which could be acquired from https://pubchem.ncbi.nlm.nih.gov. For those 652 cell lines, the gene expression profiles were taken to characterize them. Then, based on the features of drugs and cell lines, we can compute the drug similarity matrix and the cell line similarity matrix, respectively. Motivated by the method used in SRMF, the Jaccard coefficient was utilized to calculate the drug similarity, and the PCC was computed to indicate the cell line similarity.
CCLE
The second dataset was collected from the CCLE (https://portals.broadinstitute.org/ccle), which contains 23 drugs and 491 cell lines with 10,870 known responses.
Figure S2 gives the distribution of known responses over the 23 drugs. It is obvious that most of the drugs except 4 have more than 450 known responses. In CCLE, the response value of cell line to drug represents the area over the drug response curve, which is called the activity area. The larger values of activity area stand for more sensitive responses. Similar to the method used in GDSC, we then calculated Jaccard coefficients between drugs to measure the drug similarity based on the drug fingerprints. And for cell lines, the similarity was computed using PCC between gene expression profiles of cell lines.
WGRMF
In this paper, inspired by the work of Ezzat et al., we have proposed a novel method using WGRMF for inferring anticancer drug response in cell lines (see Figure 6). We first represented the drug similarity, cell line similarity, as well as drug response in cell lines with matrix form, which were indicated by , , and , respectively. Before we implemented the WGRMF, we conducted a sparsification technique for the drug similarity matrix and cell line similarity matrix, respectively. For drugs, a matrix representing the p-nearest neighbor graph was constructed from drug similarity as follows:where represents the set of p nearest neighbors of drug . Similarly, a matrix corresponding to the p-nearest neighbor graph of cell line was constructed from the cell line similarity. Consequently, we could use the p-nearest neighbor graphs to sparsify the drug similarity, , and cell line similarity, , respectively, as follows:where and stand for the sparsified similarity matrices of drug and cell line, respectively.
Figure 6
The Flowchart of the WGRMF for Prediction of Drug Response in Cancer Cell Lines
The flowchart of the WGRMF for prediction of drug response in cancer cell lines, based on the drug chemical structure similarity, cell line gene expression similarity, and known response data.
The Flowchart of the WGRMF for Prediction of Drug Response in Cancer Cell LinesThe flowchart of the WGRMF for prediction of drug response in cancer cell lines, based on the drug chemical structure similarity, cell line gene expression similarity, and known response data.Based on the sparsified similarity matrices of drug and cell line, WGRMF was carried out for the prediction of drug response in cell lines. The fundamental of WGRMF is LRA, which can be formulated as follows:where the drug response matrix was decomposed into two low-rank latent feature matrices and , for drug and cell line, respectively. The variables n, m, and k indicate the number of drugs, cell lines, and latent features in turn. The operator means the Frobenius norm.In order to eliminate the effect of overfitting and exclude the contribution from unknown responses in matrix R, we added the Tikhonov and graph regularization terms, and introduced the weight matrix W into Equation 6 to obtain the objective function of WGRMF as follows:where W has the same shape with R, if there is a known response value in R(i, j), W(i, j) = 1, otherwise W(i, j) = 0; , , and are positive parameters; and and indicate the ith and jth rows of latent matrices A and B, respectively. Each term in the objective function reflects a different goal of the problem. The first term is to make the predicted result close to known response data. The second one, called Tikhonov regularization, is to minimize the norms of latent matrices A and B. The third one indicates the graph regularization of drugs, with the purpose to make two neighbor drugs nearest in the latent space. The last term has the similar meaning to the third one, which is the graph regularization of cell lines. According to some previous studies,78, 79 the objective function in Equation 7 can be transformed into the following equation:where is the operator to calculate the trace of a matrix, and and indicate the graph Laplacians of sparsified similarity matrices and , respectively, which are defined as follows:where and are two diagonal matrices derived from and :In order to improve the performance of the method, we utilized the normalized graph Laplacians to replace the unnormalized ones in Equation 8, which can be computed as follows:Thereby the final objective function can be rewritten as follows:The optimal solutions of the above equation can be obtained by solving and , where L denotes the objective function in Equation 15, which finally results in two alternative update formulas:We can get the latent matrices A and B row by row when the updates are converged. The predicted response matrix can be correspondingly computed as follows:
Parameter Settings
We now show how to set the hyper-parameters used in the method in cross-validation and prediction. For the GDSC dataset, the sparsification parameter p was set to p = 20 for both drug and cell line. The dimension parameter of latent matrix k was chosen from {50, 100, min(n, m)}. The three parameters, , , and , could be set using the grid search method, from the following values: , , . As for the CCLE dataset, the value of p in the sparsification process was set as p = 10 for both drug and cell line. The latent space dimension k was set as k = min(n, m), because the value of min(n, m) was <50.
, , and were set in the same way as in the GDSC dataset.
Author Contributions
N.-N.G. implemented the experiments, analyzed the result, and wrote the paper. X.C. conceived the project, developed the prediction method, designed the experiments, analyzed the result, revised the paper, and supervised the project. X.P., J.-Q.L., Y.Z., and C.-C.W. analyzed the result and revised the paper.
Authors: Frances A Shepherd; José Rodrigues Pereira; Tudor Ciuleanu; Eng Huat Tan; Vera Hirsh; Sumitra Thongprasert; Daniel Campos; Savitree Maoleekoonpiroj; Michael Smylie; Renato Martins; Maximiliano van Kooten; Mircea Dediu; Brian Findlay; Dongsheng Tu; Dianne Johnston; Andrea Bezjak; Gary Clark; Pedro Santabárbara; Lesley Seymour Journal: N Engl J Med Date: 2005-07-14 Impact factor: 91.245
Authors: Francesco Iorio; Theo A Knijnenburg; Daniel J Vis; Graham R Bignell; Michael P Menden; Michael Schubert; Nanne Aben; Emanuel Gonçalves; Syd Barthorpe; Howard Lightfoot; Thomas Cokelaer; Patricia Greninger; Ewald van Dyk; Han Chang; Heshani de Silva; Holger Heyn; Xianming Deng; Regina K Egan; Qingsong Liu; Tatiana Mironenko; Xeni Mitropoulos; Laura Richardson; Jinhua Wang; Tinghu Zhang; Sebastian Moran; Sergi Sayols; Maryam Soleimani; David Tamborero; Nuria Lopez-Bigas; Petra Ross-Macdonald; Manel Esteller; Nathanael S Gray; Daniel A Haber; Michael R Stratton; Cyril H Benes; Lodewyk F A Wessels; Julio Saez-Rodriguez; Ultan McDermott; Mathew J Garnett Journal: Cell Date: 2016-07-07 Impact factor: 41.582
Authors: Yitan Zhu; Thomas Brettin; Yvonne A Evrard; Alexander Partin; Fangfang Xia; Maulik Shukla; Hyunseung Yoo; James H Doroshow; Rick L Stevens Journal: Sci Rep Date: 2020-10-22 Impact factor: 4.996