Literature DB >> 31510700

MOLI: multi-omics late integration with deep neural networks for drug response prediction.

Hossein Sharifi-Noghabi^1,2, Olga Zolotareva³, Colin C Collins^2,4, Martin Ester^1,2.

Abstract

MOTIVATION: Historically, gene expression has been shown to be the most informative data for drug response prediction. Recent evidence suggests that integrating additional omics can improve the prediction accuracy which raises the question of how to integrate the additional omics. Regardless of the integration strategy, clinical utility and translatability are crucial. Thus, we reasoned a multi-omics approach combined with clinical datasets would improve drug response prediction and clinical relevance.
RESULTS: We propose MOLI, a multi-omics late integration method based on deep neural networks. MOLI takes somatic mutation, copy number aberration and gene expression data as input, and integrates them for drug response prediction. MOLI uses type-specific encoding sub-networks to learn features for each omics type, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. The former makes the representations of responder samples more similar to each other and different from the non-responders, and the latter makes this representation predictive of the response values. We validate MOLI on in vitro and in vivo datasets for five chemotherapy agents and two targeted therapeutics. Compared to state-of-the-art single-omics and early integration multi-omics methods, MOLI achieves higher prediction accuracy in external validations. Moreover, a significant improvement in MOLI's performance is observed for targeted drugs when training on a pan-drug input, i.e. using all the drugs with the same target compared to training only on drug-specific inputs. MOLI's high predictive power suggests it may have utility in precision oncology.
AVAILABILITY AND IMPLEMENTATION: https://github.com/hosseinshn/MOLI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2019 PMID： 31510700 PMCID： PMC6612815 DOI： 10.1093/bioinformatics/btz318

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Precision oncology is the use of genomic data to tailor therapy for an individual cancer patient. However, response to a cancer treatment—chemotherapy or targeted drugs—is a complex phenotype and often depends on multiple factors especially the genomic profile of the patient (Lee ). Presently, only 11% of patients treated with precision oncology can be placed in clinical trials and only 5% of patients benefit from precision oncology (Cheng ; Marquart ; Zehir ). Although there are many reasons underlying this modest success rate, improved drug response prediction will significantly increase the number of patients who benefit from targeted therapy (Marquart ) or chemotherapy, and avoid adverse side effects who will not (Gavan ; Mishra and Verma, 2010). Various in vitro studies of cancer cell lines and patient-derived xenograft (PDX) mice models (Gao ) have created datasets such as Genomics of Drug Sensitivity in Cancer (GDSC) (Iorio ) and Cancer Cell Line Encyclopedia (CCLE) (Barretina ). These datasets provide researchers with multi-omics profiles—consisting of genomic [somatic mutation and copy number aberration (CNA)], transcriptomic, proteomic and methylomic data—together with the response to a large number of targeted and chemotherapy drugs. This is different from patient datasets, which record the response only to one or a few drugs that have been administered to a patient. These in vitro datasets enable researchers to investigate the drug response mechanism at a large scale, in particular for many drugs, and all the way from various types of pre-clinical models to patients (Barretina ; Iorio ). Complementing in vitro studies, in silico studies have aimed at building computational methods that analyze the cumulative effects of single- or multi-omics data to accurately predict drug response (Ding ; Geeleher ). These studies usually measure the drug response as the drug concentration that reduces viability by 50% (IC50). A critical challenge in drug response research is the clinical utility, i.e. whether the outcome of the study is translatable to actual patients (Geeleher , 2017). Ideally to achieve translatability, a computational method should be trained on in vivo data, however, available in vivo datasets such as The Cancer Genome Atlas (TCGA) datasets (Weinstein ) do not have enough patient records with drug response information and in particular, unlike cell line datasets such as GDSC, they do not report responses to multiple drugs. For in silico drug response prediction, translatability in the simplest case means that a model with good performance (e.g. high prediction accuracy) on in vitro data—trained on more samples compared to in vivo data—should also have good performance on in vivo data. The majority of studies suggests that gene expression data are the most effective data type for drug response prediction (Ding ; Geeleher ; Graim ; Iorio ). Geeleher showed that a ridge regression model trained on GDCS gene expression data is translatable to Docetaxel, Cisplatin, Erlotinib and Bortezomib clinical trial data. They also showed that, for Docetaxel, including non-breast cancer cell lines in model training increases the predictive power of the final model compared to the model only trained on breast cancer cell lines. This ridge regression-based pipeline on gene expression also imputed the drug response for The Cancer TCGA (Geeleher ; Weinstein ). Despite the predictive power of gene expression, adding other omics data types can increase the predictive power especially in pan-cancer models (Iorio ). Multi-omics data provide a machine learning model with different views of the same sample and promise better characterization of biological processes (Argelaguet ; Wang ). Multi-omics data have been exploited for different problems such as driver gene identification (Dimitrakopoulos ; Mo ; Shrestha ; Singh ), patient stratification (Khakabimamaghani and Ester, 2016), survival prediction (Chaudhary ), subgroup discovery (Liang ) and drug response prediction (Ding ). For the drug response prediction, Ding proposed a method that concatenates mutation, CNA and gene expression data and applies autoencoders to learn features for the concatenated multi-omics cell line data. The learned features were used as the input of an elastic net classifier which predicts the binarized IC50 values. We note that the classifier was validated only on CCLE cell lines without studying its translatability to patients or PDX models. A critical challenge in multi-omics data analysis is how to integrate different data types. There are two major approaches to multi-omics integration: early integration and late integration (Rappoport and Shamir, 2018; Zitnik ). In early integration, all omics data types available for a sample are first concatenated, and then an integrated representation of the sample is created by applying some feature learning method, such as autoencoders (Goodfellow ), to that representation. Early integration has three disadvantages: first, it disregards the unique distribution of each omics data type. Second, it requires proper normalization to avoid giving more weight to the omics data type with more dimensions. Third, it further increases the dimensionality of the input data which often is already a challenge for single-omics input data (Rappoport and Shamir, 2018). In late integration, features are learned separately for each omics data type, and these features are then integrated into one unified representation to be used as the input for a classifier or a regressor. The advantage of this approach is that it works with the unique distribution of each omics data type, it can employ single-omics normalization for each data type, and it does not increase the dimensionality of the input space. In this paper, we explore the problem of drug response prediction and propose MOLI, a multi-omics late integration method based on deep neural networks. MOLI takes somatic mutation, CNA and gene expression data as input, and predicts the response to a given drug as the output. MOLI learns features for each omics data type by type-specific encoding sub-networks and concatenates the learned features into one representation of the multi-omics profiles. To the best of our knowledge, MOLI is the first end-to-end late integration method with deep neural networks that optimizes this representation via a combined cost function consisting of a triplet loss function (Schroff ) and a binary cross-entropy loss function. The former makes the representations of responder cell lines more similar to each and different from the representations of non-responder cell lines and the latter makes this representation predictive of the IC50 values. As another contribution, MOLI employs transfer learning to increase the size of the training dataset. It trains a drug response model on pan-drug inputs (using all the drugs with the same target) instead of drug-specific inputs. Figure 1 illustrates the workflow of MOLI.

Fig. 1.

Schematic overview of MOLI (A) pre-processing mutation, CNA and gene expression data. (B) Each encoding sub-network learns features for its omics data type and the learned features are concatenated into one representation. (C) MOLI cost function consists of a triplet loss and a classification loss, obtained from the classifier sub-network that uses the multi-omics representation to predict drug response We validated MOLI on in vitro (PDX) and in vivo (TCGA patients) datasets for five chemotherapy agents and two targeted therapeutics. Our comparison with the state-of-the-art single-omics and early integration multi-omics methods showed that MOLI can achieve significantly better performance in terms of Area Under the receiver operating characteristic Curve (AUC) on PDX and patient data. Moreover, we observed substantial improvement in MOLI’s performance for targeted drugs when training on a pan-drug dataset compared to training on drug-specific datasets. We conclude that MOLI models trained on in vitro data translate well to in vivo data and may have utility for precision oncology. Finally, we showed that the responses predicted by MOLI—while trained on the pan-drug input for the epidermal growth factor receptor (EGFR) inhibitors—for breast, lung, kidney and prostate cancers from TCGA patients (without recorded drug response) had statistically significant associations with some of the genes in the EGFR pathway. This shows that MOLI captures biological aspects of the response.

2 Materials and methods

2.1 MOLI

MOLI is a deep neural network that predicts the drug response for a given sample, represented by its multi-omics profile, and for a given drug. MOLI assumes that values for the same genes are provided for each omics data type. MOLI’s network consists of the following sub-networks. It has multiple feed forward encoding sub-networks, one for each input omics data type. Each encoding sub-network receives its corresponding omics data and encodes it into a learned feature space. The learned features from the encoding sub-networks are integrated into one representation by concatenation. The concatenated representation serves as input for a classification sub-network, which predicts the drug response. The entire network is trained in an end-to-end fashion using a cost function combining a classification loss and a triplet loss. Figure 1 shows MOLI’s components during training and model development, while Figure 2A shows the application of MOLI for external validation.

Fig. 2.

(A) Using MOLI to make predictions for PDX/patient inputs during external validation. (B) Combining targeted drugs that target the same pathway or molecule to make a pan-drug training dataset for MOLI

Learning features by encoding sub-networks

To learn features for each omics data type in the input, we design separate encoding feed forward sub-networks to map the input space to the feature space. In this paper, we focus on mutation, CNA and gene expression data. XM, XE and XC denote mutation, CNA and gene expression data, respectively, each of which are of dimensionality N × D, where N is the number of samples and D is the number of genes. We note that the proposed approach can be extended for any number of omics data types. Each encoding sub-network has a fully connected layer with Rectified linear unit (ReLU) activation functions. In addition, each sub-network employs dropout to regularize the model and batch normalization to enhance the training process. The input of each encoding sub-network is one omics data type and the output is the learned features for that omics (Fig. 1B). We denote these sub-networks as and , respectively.

Integrating learned features by late integration

In the integration step, we utilize a late integration approach and concatenate the learned features of the different single-omics data types to obtain one multi-omics representation. For example, if the outputs of three encoding sub-networks are three M × N feature matrices, after concatenation, the output will be one representation matrix. The integrated representation is further smoothed through an l2 normalization layer. We denote MOLI’s integration, receiving multi-omics data as input and returning the integrated representation, as follows: where, ⊕ denotes the concatenation operator.

Optimizing the learned features by the combined cost function

The learned features will be used by a classifier that predicts the drug response. Therefore, the last sub-network of MOLI is a classification layer with the Sigmoid activation function, using dropout and weight decay for regularization (Fig. 1C). We denote this classifier as g(.). Since the MOLI network will be used for classification, i.e. drug response prediction, the cost function used for training must include a term that measures the difference between the predicted drug response and the ground truth drug response. We choose the binary cross-entropy classification loss, one of the most common classification losses, defined as follows: where, denotes the binarized IC50 which is used as measure for the drug response.We add a triplet loss to the cost function to impose a further constraint that is necessary for accurate classification. This constraint forces responders to be more similar to each other than to non-responders. The triplet loss function was introduced in FaceNet (Schroff ) for optimizing the mapping from a space of face images to a Euclidean space where the difference between learned features is correlated with the similarity among faces. The idea is that for the image of a given person’s face, the distance between that image’s learned features and the features of another image of the same person should be smaller than the distance between that image’s learned features and the learned features of the image of some other person. In our context, we employ the triplet loss function as follows. For T given triplets in the form of (Anchor, Positive, Negative), where the first two are (the multi-omics data of) responder cell lines to a given anti-cancer drug and the last one is (the multi-omics data of) a non-responder to that drug, we require the following condition: , where d(.) is an arbitrary distance function—we used the Euclidean distance.If we move the right hand-side to the left, we obtain: In order to avoid the trivial zero solution, a margin is required: We want the distance of the Anchor and the Negative to be larger than the distance of the Anchor and the Positive. Thus, the value of the triplet loss function for the i-th triplet is: and the total triplet loss for T triplets is: Generally, there are two approaches to select triplets for the triplet loss function: offline selection and online selection. The offline selection builds the triplets based on the value of the labels (in this case the drug response) before training the model. The online selection selects the triplets from samples in each mini-batch during the training. We adopted the online approach. Triplets can be built based on all possible combinations of the input samples/mini-batches (soft selection) or based only on those triplets with high triplet loss value (hard selection). Soft selection provides the model with more training triplet examples but the network might rely too much on easy cases, and as a result may be unable to perform well on hard examples (Schroff ). Hard selection solves this problem by only relying on the hard cases in the train data to build the triplets, but this approach may suffer from having fewer training triplets especially in the case of small unbalanced datasets. We adopted the soft selection approach. Therefore, the combined cost J is defined as follows: where γ is a regularization term for the triplet loss.

2.2 Transfer learning for targeted drugs

For targeted drugs, we use transfer learning and train MOLI with a new pan-drug input. This pan-drug input consists of multi-omics profiles and drug responses for a family of targeted drugs that target the same pathway or molecule. Such drugs are expected to produce highly correlated responses in cell lines. One MOLI model is trained for a family of drugs instead of one separate model for each individual drug. This approach increases the training dataset size, since the set of the screened cell lines and the obtained responses are similar but not identical for the drugs of one family. In our experiments, we evaluate transfer learning for EGFR pathway inhibitors due to the availability of external validation data, but the approach is applicable to any family of targeted drugs. Figure 2B illustrates the idea of transfer learning for targeted drugs.

2.3 Predicting drug response for TCGA patients

To study MOLI’s performance, similar to Geeleher , we employ the model trained on the pan-drug input for the EGFR inhibitors to predict the drug response for patients in several TCGA datasets for which there was no drug response recorded. Since these drugs target EGFR pathway, we expect the expression status of the genes of this pathway to be strongly correlated with the predicted drug response. We obtain the list of genes in EGFR pathway from REACTOME. To study the correlation, we employ multiple linear regression between the predicted responses and the level of expression. We obtain P-values for each gene and correct them for multiple comparisons, using Bonferroni correction ().

2.4 Datasets

We use four main resources in this paper: GDSC cell lines dataset (Iorio ). PDX Encyclopedia dataset (Gao ). TCGA patients with the drug response available in their records (Ding ). TCGA patients without the drug response (Weinstein ). The GDSC dataset (Iorio ; Yang ) has created a multi-omics dataset of more than a thousand cell lines from different cancer types, screened with 265 targeted and chemotherapy drugs. We use GDSC as the training dataset due to a high number of screened drugs. Multi-omics profiles and drug responses for GDSC are retrieved from ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-7.0/. We use the other publicly available multi-omics datasets for external validation as follows: We apply PDX Encyclopedia mice models published by Gao . This dataset has more than 300 PDX models for different cancer types, screened with 34 targeted and chemotherapy drugs. TCGA (Weinstein ) data including profiles of tumor samples collected from more than 10 000 patients with different cancer types, downloaded from Firehose Broad GDAC (https://doi.org/10.7908/C11G0KM9, http://gdac.broadinstitute.org/runs/stddata__2016_01_28/). For TCGA datasets, we use clinical annotations of the drug response for some patients which were obtained from Supplementary Material of Ding . We also use TCGA patients for breast (BRCA), bladder (BLCA), pancreatic (PAAD), lung (LUAD), kidney (KIRP) and prostate (PRAD) cancers. These patients are without the drug response in their records. We note that we used only those genes which are in common for all of the omics data types in both training and external validation datasets for each drug. The data pre-processing steps and the used resource files are summarized below and are presented in more detail in the Supplementary Material (Section S1) and Supplementary Table S1. Table 1 provides the characteristics of each dataset such as type of drug, the number of samples and the number of genes. After the pre-processing, we have the same number of genes for the training and the external validation datasets and for each of the three omics data types. We only consider samples for which all three omics data types are available.

Table 1.

List of the studied drugs from the used resources with multi-omics profiles available

Drug	Type	Resource	Number of samples^a	Number of genes^b	Usage
Afatinib	Targeted	GDSC	828 (NR: 678, RS: 150)	13 081	Training
Cetuximab	Targeted	GDSC	856 (NR: 735, RS: 121)	12 346^c/13 081^d	Training
Cetuximab	Targeted	PDX	60 (NR: 55, RS: 5)	12 346^c/13 081^d	External validation
Cisplatin	Chemotherapy	GDSC	829 (NR: 752, RS: 77)	15 493	Training
Cisplatin	Chemotherapy	TCGA	66 (NR: 6, RS: 60)	15 493	External validation
Docetaxel	Chemotherapy	GDSC	829 (NR: 764, RS: 65)	15 016	Training
Docetaxel	Chemotherapy	TCGA	16 (NR: 8, RS: 8)	15 016	External validation
Erlotinib	Targeted	GDSC	362 (NR: 298, RS: 64)	12 325^c/13 081^d	Training
Erlotinib	Targeted	PDX	21 (NR: 18, RS: 3)	12 325^c/13 081^d	External validation
Gefitinib	Targeted	GDSC	825 (NR: 710, RS: 115)	13 081	Training
Gemcitabine	Chemotherapy	GDSC	844 (NR: 790, RS: 54)	12 067/15 381	Training
Gemcitabine	Chemotherapy	PDX	25 (NR: 18, RS: 7)	12 067	External validation
Gemcitabine	Chemotherapy	TCGA	57 (NR: 36, RS: 21)	15 381	External validation
Lapatinib	Targeted	GDSC	387 (NR: 326, RS: 61)	13 081	Training
Paclitaxel	Chemotherapy	GDSC	389 (NR: 363, RS: 26)	12 482	Training
Paclitaxel	Chemotherapy	PDX	43 (NR: 38, RS: 5)	12 482	External validation
Pan-drug	Targeted	GDSC	3258 (NR: 2747, RS: 511)	13 081	Training

Note: NR, non-responder; RS, responder.

Number of screened samples with all three omics data types available.

Number of genes in common between the train data and the external validation data for each drug.

Number of genes for the drug-specific experiments.

Number of genes for the pan-drug experiments.

List of the studied drugs from the used resources with multi-omics profiles available Note: NR, non-responder; RS, responder. Number of screened samples with all three omics data types available. Number of genes in common between the train data and the external validation data for each drug. Number of genes for the drug-specific experiments. Number of genes for the pan-drug experiments.

Gene expression profiles

Raw intensities are obtained from ArrayExpress (E-MTAB-3610) for GDSC dataset were RMA-normalized (Irizarry ), log-transformed and aggregated to the level of genes. Gene expression values of PDX and all TCGA datasets are converted to TPM (Li and Dewey, 2011) and log-transformed. FPKM values for PDX samples are converted into TPM and log-transformed. To make expression profiled by different platforms comparable, we standardize gene expression and perform pairwise homogenization procedure, as described in Geeleher ) and Johnson . Also, in each dataset we exclude the 5% of genes with lowest variance assuming them to be not informative.

Somatic copy number profiles

We remove unreliable segments from genome segmentation files for TCGA datasets and assign every gene a value corresponding to the intensity log-ratio of the segment it overlaps. If the gene overlaps more than one segment, we keep the most extreme log-ratio value. Different from TCGA, the GDSC and PDX datasets provided gene-level estimates of total copy number. In order to make these data comparable with TCGA, we compute for every gene the logarithm of its copy number divided by the ploidy of copy-neutral state in the sample. Finally, for all four datasets we binarize gene-level copy number estimates assigning zeros to copy-neutral genes and ones to all genes overlapping deletions or amplifications.

Somatic point mutations

Similarly with previous works (Ding ; Iorio ), we assign ones to genes carrying somatic point mutations and zeros to all others.

3 Results

3.1 Experimental design

In our experiments, we investigated the following questions: Does MOLI outperform single-omics and early integration baselines in terms of prediction AUC on PDX and patient data? Does transfer learning work for targeted drugs, i.e. does MOLI trained on pan-drug data outperform MOLI trained on drug-specific (single drug) data? Finally, for the targeted drugs, does the predicted response by MOLI have associations with the target of that drug? We trained MOLI on GDSC cell lines screened with Docetaxel, Cisplatin, Gemcitabine, Paclitaxel, Erlotinib and Cetuximab. We chose these drugs based on availability of PDX/patient multi-omics data for these drugs which is necessary for external validations. We trained all of the baselines for the same drugs and compared them to MOLI in terms of prediction AUC. We compared MOLI against early integration via deep neural networks inspired by Ding and early integration via non-negative matrix factorization (NMF) (Cichocki and Phan, 2009; Févotte and Idier, 2011), against the single-omics (gene expression) ridge regression method proposed by Geeleher , against an ordinary feed forward network with classification loss trained on the expression data, and against a version of MOLI trained only on the gene expression data. To test whether the triplet loss contributes to improve the performance, we compared MOLI to a late integration feed forward network with an architecture similar to MOLI but using only a classification loss. Finally, to study transfer learning for the targeted drugs, we focused on drugs that target the EGFR pathway because we have Cetuximab and Erlotinib that target this pathway in the PDX dataset utilized for external validations. In addition, GDSC was screened with numerous drugs that target EGFR including: Afatinib, Cetuximab, Erlotinib, Gefitinib and Lapatinib. We used multi-omics data of all of these drugs in GDSC and created a large training set (>3000 samples). We trained MOLI on this pan-drug data and compared the results to MOLI which was trained on the drug-specific inputs. We used 5-fold cross validation in most of the experiments to tune the hyper-parameters of the deep neural networks based on the AUC. The hyper-parameters tuned were number of nodes in the hidden layers, learning rates, mini-batch size, weight decay, the dropout rate, number of epochs and margin and regularization term (only for the triplet loss). Details on the ranges considered for each hyper-parameter and the selected settings for each drug are provided in Supplementary Tables S2 and S3. Finally, the network was re-trained on the obtained hyper-parameters on the entire dataset for that drug (train and validation). We used Adagrad for optimizing parameters in all of the deep neural networks (Duchi ). We used the PyTorch framework to implement all deep neural networks codes. For the ridge regression pipeline, we downloaded the implemented pipeline with leave-one-out cross validation provided by the original authors (Geeleher ) and applied it to our datasets. To make sure that both the downloaded pipeline and the way we pre-processed the gene expression data are correct, we evaluated it on the datasets from the original paper and got AUCs for Docetaxel and Bortezomib comparable to those of He . For early integration via NMF, we first concatenate the omics data types, and then train an NMF on the resulting matrix to learn the latent factors. Finally we train the Geeleher method (using the learned factors as features) to predict the drug response. The details of this new baseline and the tuning of its hyper-parameter are presented in the Supplementary Material (Section S2).

3.2 Multi-omics integration by MOLI improves the drug response performance

Table 2 reports the performance of MOLI and the baselines in terms of AUC.

Table 2.

Performance of different versions of MOLI compared to the baselines in terms of prediction AUC across two targeted therapeutics and five chemotherapy agents

Method	PDX	PDX	PDX	PDX	TCGA	TCGA	TCGA	Input omics
Drug	Paclitaxel	Gemcitabine	Cetuximab	Erlotinib	Docetaxel	Cisplatin	Gemcitabine	Input omics
Geeleher et al. (2014)	0.52	0.59	0.58	0.67	0.59	0.62	0.53	Expression
Early integration via NMF	0.24	0.56	0.53	0.28	0.39	0.40	0.58	Multi
Early integration via DNNs	NSC	0.66	NSC	NSC	0.52	NSC	0.59	Multi
Feed forward net	0.68	0.48	0.43	0.37	0.69	0.44	0.65	Expression
MOLI complete	0.69	0.52	0.51	0.39	0.63	0.75	0.64	Expression
MOLI with classifier	NSC	0.55	0.46	NSC	0.58	0.6	0.69	Multi
MOLI complete	0.74	0.64	0.53	0.63	0.58	0.66	0.65	Multi
MOLI complete Pan-drug	NA	NA	0.80	0.72	NA	NA	NA	Multi

Note: NSC: no stable classifier during cross validation or final training. Boldface indicates the best method for the corresponding drug and italics indicates the second best method. NA: the corresponding drug is not for targeted therapy. Complete: when MOLI has both classification and the triplet losses; NMF, non-negative matrix factorization; DNNs, deep neural networks.

Performance of different versions of MOLI compared to the baselines in terms of prediction AUC across two targeted therapeutics and five chemotherapy agents Note: NSC: no stable classifier during cross validation or final training. Boldface indicates the best method for the corresponding drug and italics indicates the second best method. NA: the corresponding drug is not for targeted therapy. Complete: when MOLI has both classification and the triplet losses; NMF, non-negative matrix factorization; DNNs, deep neural networks. First, we compared the complete MOLI (MOLI trained on multi-omics data and using its combined cost function) to the early integration baselines. MOLI achieved better performance in six out of seven external validation datasets compared to early integration via deep neural networks. Moreover, MOLI also achieved better performance in all of the external validation datasets compared to early integration via NMF. These results indicate that MOLI outperforms multi-omics early integration. Second, we compared MOLI trained on multi-omics data with two deep neural network scenarios trained only on the gene expression data (one of them was MOLI itself). MOLI trained on multi-omics data showed better performance in four out of seven external validation datasets and tied in another dataset. These results indicate that deep neural networks trained on multi-omics data achieve better performance than those trained on single-omics data. Third, we compared MOLI with MOLI without the triplet loss, both trained on multi-omics input. MOLI with its combined cost function obtained better performance in five out of seven external validation datasets and tied in another one. These results demonstrate the contribution of the triplet loss to improving the prediction performance. Finally, we compared MOLI against a single-omics non-deep neural networks baseline which is the only published method for drug response prediction (Geeleher ) that had been tested on patient data. MOLI achieved better performance in four out of seven external validation datasets and tied with this baseline in another one. These experiments show the substantial gain in predictive performance resulting from the combination of using multi-omics data, deep neural networks and the proposed cost function. MOLI when trained on the pan-drug input (only applicable for targeted drugs) had significantly better performance compared to itself when it was trained on the drug-specific inputs for Erlotinib and Cetuximab. The majority of the baselines had either poor performance or no stable classifier status (NSC) for Paclitaxel and Erlotinib. NSC means that during either cross validation or final re-training with the obtained hyper-parameters the cost and/or AUC curves were fluctuating. This may be due to the small number of samples, because both of these drugs had the fewest number of cell lines (∼400). Also, we observed NSC in the early integration baseline for four drugs which may be due to the concatenation at the beginning because it increased the dimensionality substantially, which makes feature learning harder for the autoencoder and later the classifier in this method. MOLI achieved an AUC of >0.7 for four drugs (Paclitaxel, Cetuximab, Erlotinib and Cisplatin) which may be beneficial for precision oncology particularly for the targeted drugs (Cetuximab and Erlotinib). We also studied the area under precision–recall curve for MOLI and the main baselines including the early integration methods and the Geeleher method. Compared to the early integration methods, MOLI achieved better performance in four out of seven external validation datasets and tied in two other datasets. Compared to the (Geeleher ) single-omics baseline, MOLI had better performance in six out of seven external validation datasets. We also investigated the area under precision–recall curve for the pan-drug training data. MOLI trained on pan-drug data had better performance in one external validation dataset and had competitive performance in another one compared to MOLI trained on the drug-specific input (Supplementary Table S4). All these results again suggest that MOLI may be beneficial for precision oncology.

3.3 Transfer learning for targeted drugs improves performance significantly

We observed that for the targeted drugs (in our experiment, EGFR inhibitors), MOLI trained on the pan-drug multi-omics inputs achieved significantly better performance than MOLI trained on drug-specific inputs. Pan-drug MOLI achieved an AUC of 0.8 for Cetuximab and 0.72 for Erlotinib which were significantly higher than the drug-specific performance. This suggests that transfer learning can improve the prediction performance for the targeted drugs.

3.4 Predictions for TCGA patients by MOLI have associations with EGFR genes

We applied MOLI (trained on the pan-drug input for EGFR inhibitors) to multi-omics data without drug response downloaded from TCGA (breast, bladder, pancreatic, lung, kidney and prostate cancers) and predicted the response for these patients. According to the P-values obtained from multiple linear regression, there are a number of strong associations between EGFR genes and the responses predicted by MOLI. For breast cancer, we observed statistically significant associations between the level of expression in AP2A1 (P = 0.007), CALM2 (P = 0.01), CLTA (P = 0.0002), EGFR (), PIK3CA (P = 0.007) and UBA52 () genes and the predicted responses. For prostate cancer, we found that the predicted responses have statistically significant associations with the expression of AKT1 (P = 0.02), CDK1 (P = 0.01), RICTOR (P = 0.0002), CREB1 (P = 0.02) and CSK (P = 0.01). In kidney cancer, expression of EGFR (P = 0.04) gene had association with the predicted response. In lung cancer, we observed significant associations for CDC42 (P = 0.04), EGFR () and PRKAR2A (P = 0.01) genes. However, for bladder and pancreatic cancers, we did not observe any significant associations.

4 Discussion

In this paper, we proposed MOLI method based on deep neural networks to predict drug response. MOLI integrates somatic mutation, CNA and gene expression data and predicts the drug responses. To the best of our knowledge, MOLI is the first end-to-end method for MOLI with deep neural networks that utilize a combined cost function. Our experiments showed that MOLI with its combined cost function can achieve better performance than single-omics and early integration multi-omics methods based on deep neural networks. We also observed that transfer learning for targeted drugs improves the prediction performance compared to drug-specific inputs. To the best of our knowledge, this is the first method to use transfer learning with a pan-drug approach for targeted drugs. Finally, we analyzed MOLI’s predictions for drugs targeting the EGFR pathway on breast, kidney, lung and prostate cancer patients in TCGA. We showed that MOLI’s predictions have statistically significant associations with the level of expression for some of the genes in the EGFR pathway, including the EGFR gene itself, for breast, kidney and lung cancers. We would like to point out the following directions for future research. Although we used only somatic mutation, CNA and gene expression data in our experiments, MOLI can be extended for integrating other omics data types. For example, proteomics data can be a good candidate because it has been shown to be a contributing factor in pan-cancer drug response prediction (Ali ) and is known to be in concordance with the other omics data types (Gonçalves ; Ryan ). We performed experiments on transfer learning only for the drugs that target EGFR, but this approach is also applicable for other families of targeted drugs if multi-omics data are available for external validation. Another advantage of the pan-drug approach is that there is no need to train separate pan-drug models for each EGFR inhibitor, and one model can be validated on different external datasets. In the drug-specific approach, we trained one model on Cetuximab data and another one on Erlotinib data, and could not validate them on each other’s external validation data. However, in the pan-drug approach, we trained one model for all of the EGFR inhibitors and validated it on both Cetuximab and Erlotinib data. While we studied only the triplet loss for optimizing the concatenated representation, we note that this loss function can be replaced by other similar losses such as the contrastive loss function which was used in the Siamese network (Hadsell ). We trained separate MOLI models for different drugs, but it is an interesting direction for future research to utilize multi-task learning (Yuan ) and predict the outcome for multiple drugs at the same time. Unlike areas such as medical imaging, transfer learning is yet to be explored in genomics, especially cancer genomics (Ching ). While in this paper we explored transferring related samples (also known as instance-transfer), other aspects of transfer learning such as relational-knowledge-transfer (Pan and Yang, 2010) should be explored in the future. In all of the experiments and utilized datasets, we used pan-cancer inputs. The advantage of using pan-cancer multi-omics input is that it can address, to some degree, the challenge of inter-tumor heterogeneity (Almendro ). However, these datasets are not suitable for addressing intra-tumor heterogeneity, which would require other resources such as single cell data. Geeleher showed that training on non-breast cancer cell lines in addition to breast cancer cell lines leads to improved prediction accuracy. However, predictions were tested only on breast cancer clinical trial cohort data and only for Docetaxel, a primary treatment for breast cancer. Because some drug-cancer event associations are specific to the tissue of origin and are less detectable in pan-cancer settings (Iorio ), we believe that further research in this area is required to study the performance of pan-cancer versus cancer-specific training data for a more diverse range of cancer types and for more drugs. We would like to point out the following limitations of this study: The datasets used were from different resources were not in the same format and required substantial pre-processing and standardization (see the Supplementary Material). For example, different studies used different pipelines to detect CNA and reported different estimates of copy number which could not be compared directly. A similar issue was also observed for the drug response. While the GDSC cell lines used IC50 as the response measure, the majority of datasets used other metrics to measure the response. For example, the PDX dataset used tumor volume based on RECIST criteria to define responders and non-responders. Therefore, lack of standardization on both the input and the output side adds extra challenges to the drug response prediction task. In this study we focused on monotherapy and did not explore the effect of the combination of drugs. We did not discriminate between driver and passenger events in the somatic mutation and CNA data and treated all of them similarly. However, in reality, the majority of these genomic alterations seems to have no impact on cancer development (Vogelstein ) and might appear just by chance. Therefore, in future work, we plan to use another format for these data types to distinguish between potential driver and passenger events. All of the datasets used suffered from severely unbalanced class distributions, since the number of responders was much smaller than the number of non-responders. We addressed this problem by oversampling the minority class. However, this approach often causes overfitting particularly for deep neural networks with many parameters. We reduced overfitting with strong regularization such as high dropout rate and weight decay. Moreover, using triplets as input of the network increased the number of samples and led to a more stable network, due to the large number of different combinations for triplets. As a first investigation of late integration, we did not consider interactions between genes in different omics data types in MOLI or the compared baselines. In reality, genes do not function in isolation and work in biological networks and interact with each other. Recently, Ma have shown that incorporating biological domain knowledge from the Gene Ontology leads to more interpretable neural networks with performance comparable to those of purely data-driven neural networks. Therefore, incorporating domain expert knowledge to MOLI via deep neural networks is a promising future direction.

5 Conclusion

In this paper, we proposed MOLI, a method for drug response prediction based on deep neural networks and MOLI. We trained MOLI on a pan-cancer cell line dataset and successfully validated it on PDX and patient data for five chemotherapy agents and two targeted therapeutics. Our results suggest four major findings: MOLI outperforms single-omics (gene expression) prediction performance in terms of AUC and area under precision–recall curve. MOLI outperforms deep neural networks using early integration in terms of AUC and area under precision–recall curve. MOLI with its combined cost function outperforms single- and multi-omics baselines with only the classification loss. MOLI trained on the pan-drug inputs, employing transfer learning, outperforms MOLI trained on drug-specific inputs for targeted therapeutics that target EGFR. Finally, we analyzed the biological significance of MOLI and found substantial evidence that the responses predicted by MOLI have statistically significant associations with the expression level of numerous genes in the EGFR pathway for TCGA patients with breast, kidney, lung and prostate cancers. In conclusion, our experimental results suggest MOLI may have a role in precision oncology where currently only ∼5% of all patients benefit from precision oncology. Click here for additional data file.

38 in total

1. Summaries of Affymetrix GeneChip probe level data.

Authors: Rafael A Irizarry; Benjamin M Bolstad; Francois Collin; Leslie M Cope; Bridget Hobbs; Terence P Speed
Journal: Nucleic Acids Res Date: 2003-02-15 Impact factor: 16.971

2. Adjusting batch effects in microarray expression data using empirical Bayes methods.

Authors: W Evan Johnson; Cheng Li; Ariel Rabinovic
Journal: Biostatistics Date: 2006-04-21 Impact factor: 5.899

3. Pattern discovery and cancer gene identification in integrated cancer genomic data.

Authors: Qianxing Mo; Sijian Wang; Venkatraman E Seshan; Adam B Olshen; Nikolaus Schultz; Chris Sander; R Scott Powers; Marc Ladanyi; Ronglai Shen
Journal: Proc Natl Acad Sci U S A Date: 2013-02-21 Impact factor: 11.205

4. The Cancer Genome Atlas Pan-Cancer analysis project.

Authors: John N Weinstein; Eric A Collisson; Gordon B Mills; Kenna R Mills Shaw; Brad A Ozenberger; Kyle Ellrott; Ilya Shmulevich; Chris Sander; Joshua M Stuart
Journal: Nat Genet Date: 2013-10 Impact factor: 38.330

Review 5. Cellular heterogeneity and molecular evolution in cancer.

Authors: Vanessa Almendro; Andriy Marusyk; Kornelia Polyak
Journal: Annu Rev Pathol Date: 2012-10-22 Impact factor: 23.472

Review 6. Cancer genome landscapes.

Authors: Bert Vogelstein; Nickolas Papadopoulos; Victor E Velculescu; Shibin Zhou; Luis A Diaz; Kenneth W Kinzler
Journal: Science Date: 2013-03-29 Impact factor: 47.728

7. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors: Bo Li; Colin N Dewey
Journal: BMC Bioinformatics Date: 2011-08-04 Impact factor: 3.307

8. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.

Authors: Jordi Barretina; Giordano Caponigro; Nicolas Stransky; Kavitha Venkatesan; Adam A Margolin; Sungjoon Kim; Christopher J Wilson; Joseph Lehár; Gregory V Kryukov; Dmitriy Sonkin; Anupama Reddy; Manway Liu; Lauren Murray; Michael F Berger; John E Monahan; Paula Morais; Jodi Meltzer; Adam Korejwa; Judit Jané-Valbuena; Felipa A Mapa; Joseph Thibault; Eva Bric-Furlong; Pichai Raman; Aaron Shipway; Ingo H Engels; Jill Cheng; Guoying K Yu; Jianjun Yu; Peter Aspesi; Melanie de Silva; Kalpana Jagtap; Michael D Jones; Li Wang; Charles Hatton; Emanuele Palescandolo; Supriya Gupta; Scott Mahan; Carrie Sougnez; Robert C Onofrio; Ted Liefeld; Laura MacConaill; Wendy Winckler; Michael Reich; Nanxin Li; Jill P Mesirov; Stacey B Gabriel; Gad Getz; Kristin Ardlie; Vivien Chan; Vic E Myer; Barbara L Weber; Jeff Porter; Markus Warmuth; Peter Finan; Jennifer L Harris; Matthew Meyerson; Todd R Golub; Michael P Morrissey; William R Sellers; Robert Schlegel; Levi A Garraway
Journal: Nature Date: 2012-03-28 Impact factor: 49.962

9. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells.

Authors: Wanjuan Yang; Jorge Soares; Patricia Greninger; Elena J Edelman; Howard Lightfoot; Simon Forbes; Nidhi Bindal; Dave Beare; James A Smith; I Richard Thompson; Sridhar Ramaswamy; P Andrew Futreal; Daniel A Haber; Michael R Stratton; Cyril Benes; Ultan McDermott; Mathew J Garnett
Journal: Nucleic Acids Res Date: 2012-11-23 Impact factor: 16.971

10. Cancer biomarkers: are we ready for the prime time?

Authors: Alok Mishra; Mukesh Verma
Journal: Cancers (Basel) Date: 2010-03-22 Impact factor: 6.639

41 in total

Review 1. The application of artificial neural networks in metabolomics: a historical perspective.

Authors: Kevin M Mendez; David I Broadhurst; Stacey N Reinke
Journal: Metabolomics Date: 2019-10-18 Impact factor: 4.290

Review 2. Machine learning: its challenges and opportunities in plant system biology.

Authors: Mohsen Hesami; Milad Alizadeh; Andrew Maxwell Phineas Jones; Davoud Torkamaneh
Journal: Appl Microbiol Biotechnol Date: 2022-05-16 Impact factor: 4.813

Review 3. Multimodal deep learning for biomedical data fusion: a review.

Authors: Sören Richard Stahlschmidt; Benjamin Ulfenborg; Jane Synnergren
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

Review 4. A roadmap for multi-omics data integration using deep learning.

Authors: Mingon Kang; Euiseong Ko; Tesfaye B Mersha
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

Review 5. Heterogeneous data integration methods for patient similarity networks.

Authors: Jessica Gliozzo; Marco Mesiti; Marco Notaro; Alessandro Petrini; Alex Patak; Antonio Puertas-Gallardo; Alberto Paccanaro; Giorgio Valentini; Elena Casiraghi
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

6. A Practical Guide to Integrating Multimodal Machine Learning and Metabolic Modeling.

Authors: Supreeta Vijayakumar; Giuseppe Magazzù; Pradip Moon; Annalisa Occhipinti; Claudio Angione
Journal: Methods Mol Biol Date: 2022

7. An integral genomic signature approach for tailored cancer therapy using genome-wide sequencing data.

Authors: Xiao-Song Wang; Sanghoon Lee; Han Zhang; Gong Tang; Yue Wang
Journal: Nat Commun Date: 2022-05-26 Impact factor: 17.694

8. Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and network-based approaches.

Authors: Betül Güvenç Paltun; Hiroshi Mamitsuka; Samuel Kaski
Journal: Brief Bioinform Date: 2021-01-18 Impact factor: 11.622

Review 9. Drug sensitivity prediction from cell line-based pharmacogenomics data: guidelines for developing machine learning models.

Authors: Hossein Sharifi-Noghabi; Soheil Jahangiri-Tazehkand; Petr Smirnov; Casey Hon; Anthony Mammoliti; Sisira Kadambat Nair; Arvind Singh Mer; Martin Ester; Benjamin Haibe-Kains
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 13.994

Review 10. Translational precision medicine: an industry perspective.

Authors: Dominik Hartl; Valeria de Luca; Anna Kostikova; Jason Laramie; Scott Kennedy; Enrico Ferrero; Richard Siegel; Martin Fink; Sohail Ahmed; John Millholland; Alexander Schuhmacher; Markus Hinder; Luca Piali; Adrian Roth
Journal: J Transl Med Date: 2021-06-05 Impact factor: 5.531