Literature DB >> 34390577

A Cross-Level Information Transmission Network for Hierarchical Omics Data Integration and Phenotype Prediction from a New Genotype.

Abstract

MOTIVATION: An unsolved fundamental problem in biology is to predict phenotypes from a new genotype under environmental perturbations. The emergence of multiple omics data provides new opportunities but imposes great challenges in the predictive modeling of genotype-phenotype associations. Firstly, the high-dimensionality of genomics data and the lack of coherent labeled data often make the existing supervised learning techniques less successful. Secondly, it is challenging to integrate heterogeneous omics data from different resources. Finally, few works have explicitly modeled the information transmission from DNA to phenotype, which involves multiple intermediate molecular types. Higher-level features (e.g., gene expression) usually have stronger discriminative and interpretable power than lower-level features (e.g., somatic mutation).
RESULTS: We propose a novel Cross-LEvel Information Transmission network (CLEIT) framework to address the above issues. CLEIT aims to represent the asymmetrical multi-level organization of the biological system by integrating multiple incoherent omics data and to improve the prediction power of low-level features. CLEIT first learns the latent representation of the high-level domain then uses it as ground-truth embedding to improve the representation learning of the low-level domain in the form of contrastive loss. Besides, CLEIT can leverage the unlabeled heterogeneous omics data to improve the generalizability of the predictive model. We demonstrate the effectiveness and significant performance boost of CLEIT in predicting anti-cancer drug sensitivity from somatic mutations via the assistance of gene expressions when compared with state-of-the-art methods. CLEIT provides a general framework to model information transmissions and integrate multi-modal data in a multi-level system. AVAILABILITY: The source code is freely available at https://github.com/XieResearchGroup/CLEIT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34390577 PMCID： PMC8696111 DOI： 10.1093/bioinformatics/btab580

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Advances in next-generation sequencing have generated abundant and diverse omics data. They provide us with unparalleled opportunities to reveal the secrets of biology. An unsolved problem in biology is how to predict observable traits (phenotypes) given a new genetic constitution (genotype) under environmental perturbations. The predictive modeling of genotype-phenotype associations will answer not only fundamental questions in biology but also address urgent needs in biomedicine. A typical application is anti-cancer personalized medicine. Given a new cancer patient’s genetic information, what is the best existing drug to treat this patient? Predicting phenotype from a new genotype is challenging due to the asymmetrical multi-level hierarchical organization of the biological system. Cell-, tissue- and organism-level phenotypes do not arise directly from DNAs but hierarchically through multiple intermediate molecular or cellular phenotypes characterized by protein interactions, gene expressions, etc. (Blois, 1984), as illustrated in Figure 1. In other words, in the information transmission process from DNA to RNA to protein to a biological pathway to the observed phenotype of interest, higher-level features (e.g. gene expression) usually have stronger discriminative and interpretable power than lower level features (e.g. somatic mutation) in a supervised learning task for predicting the phenotype, which is independent on the machine learning model applied. This premise is supported by multiple studies such as anti-cancer drug sensitivity prediction (Costello ), cancer drug combination (Menden ), microbiome (Lloyd-Price ) and empirical studies (Chiu ). Therefore, a multi-scale modeling approach is needed to simulate the asymmetrical hierarchical information transmission process for linking the genotype to the phenotype (Hart and Xie, 2016). It will, in turn, improve the interpretability of model predictions and facilitate clinical decisions. The interpretability of machine learning model is critical for the biomedical application. In principle, the multi-scale modeling of genotype-phenotype associations will facilitate opening the black box of machine learning (Yang ). For example, the embedding from the transcriptomics profile, directly or indirectly, can be used to elucidate biological pathways responsible for the synergy of drug combinations (Liu and Xie, 2021). In addition to the above fundamental challenge, the predictive modeling of genotype-phenotype associations faces several technical difficulties that hinder the application of existing machine learning methods. Firstly, omics data are often in an extremely high dimension. Secondly, the coherently labeled data are scarce compared with unlabeled data. Finally, it is not a trivial task to integrate heterogeneous omics data from different resources and modalities.

Fig. 1.

Rationale of CLEIT. Cellular phenotypes rise from genotypes via multi-level intermediate molecular types hierarchically from DNA to RNA to protein to biological pathway (blue arrows). The predictive and interpretable power of the DNA-level features for the phenotype is weaker than that of the high-level features such as transcriptome and biological pathways. Instead of predicting the phenotype from the genotype directly by bypassing the intermediate molecular types (gray dashed arrow), we will include the information of intermediate molecular type and model the hierarchical organization of a biology system (orange solid arrows) We develop a novel neural network-based framework: Cross-LEvel Information Transmission (CLEIT) network to address the aforementioned challenges. Inspired by domain adaptation techniques, CLEIT first learns to construct the low-dimensional latent representation that encodes signals indicative of tasks at hand from a high-level domain. Then, CLEIT uses the embedding from the high-level domain as ground-truth embedding to regularize the representation learning of the low-level domain in the form of a contrastive loss. In addition, we adopt a pre-training-fine-tuning approach, where pre-training enables the usage of unlabeled heterogeneous omics data to improve the generalizability of CLEIT, while fine-tuning is employed to enable more task-focused predictions given a specific labeled dataset. As a demonstration of CLEIT’s efficacy in a biological setting, we applied CLEIT to predicting anti-cancer drug sensitivity from somatic mutations. Precision anti-cancer therapy tailed to individual patients based on their genetic profile has gained tremendous interest in clinical (The American Cancer Society, 2020). Existing studies such as Ben-Hamo and Mucaki focused on inferring drug response based on the most salient mutation signatures. Although the drug response of several successful targeted therapies, e.g. kinase inhibitors, can be predicted from a few driver mutations harbored in patients, the percentage of US patients who can benefit from the targeted therapy is only about 4.9% (Marquart ). The choice of optimal therapy for most cancer patients remains a significant challenge (Adam ). It is well known that cancer acquires numerous mutations during its evolution. Both driver and passenger mutations collectively confer cancer phenotypes and are associated with drug responses (Aparisi ). Thus it is necessary to use the entire mutation profile of cancer to predict anti-cancer drug sensitivity in most cases. The machine learning models that can explicitly model hierarchical biological processes will undoubtedly facilitate the development of personalized medicine. In particular, we aim to build accurate predictive models solely using mutation data as inputs to mimic the practical clinical setting, where only patient mutation profiles are available for drug screening. In addition, we use a denoising Autoencoder as a building block and a pre-training-fine-tuning strategy to integrate noisy and sparse mutation, gene expression and protein-protein interaction data from different resources. Our extensive experiments show that CLEIT significantly outperforms other state-of-the-art methods in this regard.

2 Related work

CLEIT aims to develop a framework that constructs an indicative knowledge-abundant low-dimensional latent space from a high dimensional feature space of particular domains, which lacks salient discriminative information of tasks of interest. For example, although somatic mutation data undoubtedly posses biology-rich information, its sparsity and binary characteristics often make it extremely challenging to be utilized to build effective machine learning models for downstream predictive tasks. We resort to a domain adaptation-inspired approach to combat such data limitation issues. Domain adaptation aims to transfer the knowledge gained on the source domain with sufficient labeled data to the target domain without or with limited labeled data when the source and target domains are of different data distributions. In particular, feature-based domain adaptation approaches (Weiss ) have gained popularity along with the advancement in deep learning techniques due to their power in feature representation learning. It aims to learn a shared feature representation by minimizing the discrepancy across different domains while leveraging supervised loss from labeled source domain samples to maintain trait space’s discriminative power. Deep domain confusion (DDC) (Tzeng ) and CORAL (Sun ) focus on exploring proper statistical distribution discrepancy metrics. Domain adversarial neural network (DANN) (Ganin ) and adversarial discriminative domain adaptation (ADDA) (Tzeng ) intend to minimize the distribution difference across domains with adversarial training and generative adversarial network (Goodfellow ), respectively. Moreover, domain separation network (DSN) (Bousmalis ) was proposed to separate private representations for each domain and shared representations across domains explicitly. Although CLEIT borrowed some ideas from the domain adaptive transfer learning, there is a significant difference between CLEIT and those approaches. The goal of classic domain adaptation is to use the label information from the source domain data to boost the performance of supervised tasks in the target domain without abundant labels. The feature in the target domain usually has a similar discriminative power to that in the source domain. While in our case, we focus on resolving the inherent discriminative power discrepancy between two hierarchical related domains. The feature of the high-level domain has higher discriminative power than that of the low-level domain. Moreover, the entity types of source and target domains are usually the same in conventional domain adaptation. In our case, they are of different types. Specifically, our goal for information transmission is to solely push the latent representation of the low-level domain to approximate the one of the high-level domain; that is, the feature representation learned from the high-level domain is fixed and used as ground-truth feature representation of the low-level domain. In this setting, the latent space where the CLEIT happens is no longer a symmetrical consensus from different domains. The high-level and low-level domain is used as an input and an output, respectively, to boost the discrimination power of the low-level domain. A mapping function is learned between them. The multi-modal integration of somatic mutation and gene expression data has been utilized to improve predicting anti-cancer drug sensitivity, e.g. in Costello and Sharifi-Noghabi . These methods assume that both labeled mutation data and labeled gene expression data are available during training and inference. In addition, they integrate omics data horizontally. In contrast, CLEIT only needs to use the mutation data as the input during the inference stage. During the training stage, the mutation and gene expression data can come from different data resources and be unlabelled. Thus, CLEIT is more practical than existing methods. Moreover, CLEIT explicitly models the hierarchical, asymmetrical information transmission in a biological system, as shown in Figure 1.

3 Contributions

CLEIT aims to address an important problem of multi-scale modeling of genotype-phenotype associations. The major contributions of this research are summarized as follows. We propose a novel neural network framework that can explicitly model asymmetrical CLEITs in a complex system to boost the discriminative power of the low-level domain. The multi-level hierarchical structure is the fundamental characteristic of the biological and ecological system. The proposed architecture is general and can be applied to model various machine learning tasks where two domains have different features. The proposed neural network framework provides a new approach to integrating multiple omics data vertically to represent the multi-level organization of a biological system. The integration of mutation, gene expression and protein-protein interaction data from different resources can help to address the heterogeneity problem. We design a pre-training-fine-tuning strategy to fully utilize both labeled and unlabeled omics data that are naturally noisy, high-dimensional and sparse. In particular, the incorporation of autoencoder alleviated the high-dimensionality challenge of omics data and brought in denoising effects. Furthermore, the effective usage of unlabeled data addressed the sparsity of labeled data. In terms of biomedical application, the CLEIT significantly improves personalized anti-cancer drug sensitivity prediction using only somatic mutation data. To the best of our knowledge, CLEIT is the first deep learning-based framework designed to perform drug sensitivity prediction tasks solely on whole-genome somatic mutation profiles, which achieves comparable performance to the model trained from gene expression profiles. The oncology panel of somatic mutations has been routinely performed in cancer treatment. The application of CLEIT may improve the effectiveness of cancer treatment and achieve personalized medicine.

Materials and methods

Problem formulation

The problem that we are interested in is to predict the phenotype of interest (e.g. cell viability following drug treatments) of a cell from its mutation profile. Due to the multi-level hierarchical organization of a biological system, RNA-level gene expression profile, can achieve superior performance to DNA-level mutation data for predicting phenotypes independent on machine learning models applied to them. Here, the performance difference is due to the nature of each data domain, instead of the volume of labeled samples as in a classical domain adaptation setting. However, although feature spaces of DNA and RNA domains are not the same, the entities cross the feature spaces are hierarchically related, i.e. the RNA converts the information stored in the DNA. Based on this realization, this work aims to utilize the knowledge learned from the gene expression data to boost the predictive power of the mutation profile. In other words, we want to achieve the similar prediction performance when only using the mutation data as features to that when using the gene expression data. Formally, we denote a data domain D as , where stands for the feature space and samples within domain D, . P(X) is the affiliated marginal distribution. In this work, we consider two domains and , namely the high-level domain and low-level domain, where . In our benchmark experiments, the gene expression is used as D, while the somatic mutation is specified as D.

4.2 CLEIT framework

To use the knowledge learned from D to boost the performance of D, we propose a Cross-LEvel-Information Transmission (CLEIT) framework. The strategy of CLEIT is to encode the data from both domains into certain latent features. The embedded latent feature has the direct implication of the task of interests and achieves the CLEIT through transferring knowledge via learned representations cross domains. Figure 2 shows the overall framework of CLEIT. The training of CLEIT involves five steps: (i) learning an embedding of D from unlabeled data using standard autoencoder (AE) (Hinton and Zemel, 1994), (ii) fine-tuning the pre-trained embedding of D from step 1 using a multi-layer perceptron (MLP) in the setting of multi-task supervised learning, (iii) and (iv) learning an embedding of D from unlabeled data using AE along with the embedding regularization between D samples and corresponding D samples in the form of an MLP-based transmitter training (v) supervised learning of the final predictive model of D using an architecture that appends the pre-trained multi-task MLP (as a warm start) from step 2 as well as the pre-trained AE encoder and the transmitter of D from steps 3 and 4. We denote unlabeled D samples as and labeled samples as , where stands for the number of samples in corresponding datasets. Furthermore, is used to symbolize the latent vectors (embeddings) learned in different phases throughout the training. Samples from the D are similarly denoted. For details of methods, see ‘Supplementary Method’ in Supplementary Material.

Fig. 2.

CLEIT Framework. The training of CLEIT involves five steps. First, the encoder of D is learned from an autoencoder and fine-tuned by a supervised multi-task MLP in steps 1 and 2. Then, the embedding of D is encoded from an autoencoder in step 3, and the difference between it and that of D is minimized via an MLP transmitter in step 4 as measured by contrastive loss. In step 5, the supervised model of D is fine-tuned by the model that appends the pre-trained multi-task MLP of D in step 2 and the regularized encoder of D in step 3

5 Experiments

5.1 Experiment set-up

Datasets

We evaluate the performance of CLEIT on a real-world problem: predicting anti-cancer drug sensitivity given the mutation profile of cell lines. The mutation profile (oncology panel) has been implemented in the clinic but has weaker discriminative power for drug sensitivity prediction than the gene expression profile that is not a clinical standard yet. During the training stage, we use unlabeled mutation and gene expression data for unsupervised pre-training, and a small set of labeled data for supervised training. During the testing stage, we only use mutation data from cell lines that do not overlap with those used in the training stage to evaluate the performance. Specifically, we collected and integrated data from several diverse resources: cancer cell line data from CCLE (Ghandi ), pan-cancer data from TCGA (Goldman ), drug sensitivity data from GDSC (Yang ) and gene-gene interactions from STRING (Szklarczyk ). CCLE includes 1305 and 1697 cancer cell line samples with the gene expression profile and the somatic mutation profile, respectively. The pan-cancer datasets include 9808 and 9093 tumor samples with the gene expression profile and the somatic mutation profile, respectively. Moreover, we only keep the mutation profiles of samples with matched gene expression profiles in our unlabeled mutation dataset. All gene expression data are metricized by the standard transcripts per million base for each gene, with additional log transformation. For the somatic mutation data, we kept only non-silent genes and assembled as a binary-valued sparse vector. Furthermore, we applied pyNBS (Huang ), a random walk with restart algorithm, to transform the binary-valued mutation profile into continuous valued features by performing mutation score propagation on STRING gene-gene interaction network. The network-regularized mutation profile will not only reduce the sparsity of features but also significantly boost its prediction power (Huang ). We selected the top 1000 varied genes measured by the percentage of unique values in gene expression samples for cancer cell lines and tumor tissue samples separately. The use of 2000 genes achieves the best prediction performance, but is not significantly different from the use of 1000 genes (Supplementary Table S1). Then we combined the two sets of top 1000 varied genes as the input features. The union has 1424 unique genes in total. In addition, we only kept the genes present in the mutation profiles as our final raw feature sets, although CLEIT does not require it. We did so for a fair comparison to other domain adaptation methods since all other methods in comparison consist of a shared encoder component that requires the same number of input features across domains. The final feature set consists of 1407 genes. Furthermore, we matched the omics data of CCLE cell lines against the GDSC drug sensitivity score measured by the Area Under Drug Response Curve (AUC), which is presented as the fraction of the total area under the drug response curve between the highest and lowest screening concentration in GDSC (Yang ). The AUC, a continuous-valued drug sensitivity measurement, is used across our experiments as the dependent variable for the supervised fine-tuning. In total, we assembled 680 CCLE cell lines with both mutation and gene expression, which are associated with 93 anti-cancer drugs after removing drugs that have more than 10% missing drug sensitivity measurements within these cell line samples. These 680 cell lines and 59 203 drug sensitivity data were used as training data in the fine-tuning stage. Additional non-overlapping 278 cell lines that have only mutation information were used as hold-out testing data in our study. By combining both TCGA and CCLE datasets, 11 113 and 9743 samples that do not have measured drug sensitivities were used as unlabeled data in the pre-training stage. The gene expression profile is considered as D, while the mutation is D. A summary of the pre-processed data are shown in Table 1.

Table 1.

Summary of pre-processed data for training and testing

Category	Unlabeled (pre-training)	Labeled (fine-tuning)	Labeled (test)
Gene Expression (#samples)	11 113	680	NA
Somatic Mutation (#samples)	9743	680	278
Drug Sensitivity (#cell line-drug pairs)	NA	59203	23475

Summary of pre-processed data for training and testing

Training, validation and testing procedure

To demonstrate CLEIT’s stable performance in the given anti-cancer drug sensitivity prediction task, we repeated the model training five times. First, we split the labeled dataset that has both gene expression and mutation profile into 5-folds. Then, in each repetition, we used four out of five folds as the labeled training set, the remaining one fold left as the validation set. The detailed training procedure of CLEIT is summarized as follows. In the D pre-training, we trained CLEIT for N epochs. With parameter grid search, N is selected based on the target task performance. While for the fine-tuning of D, we employed early stopping with validation labeled fold (only gene expression) as mentioned earlier in this section. For the pre-training of D, similar to pre-training of D, we specified the number of epochs based on the task-specific performance. In the fine-tuning of D, we employed early stopping with the same validation fold (only mutation) in the fine-tuning of D. The final trained model is used to make predictions on a labeled mutation-only test set. All other baseline models followed the same training and testing procedure.

Performance evaluation

We evaluated CLEIT’s performance by predicting drug sensitivity on a hold-out labeled mutation-only test data. We measured the regression performance using Pearson correlation, Spearman correlation, RMSE (root mean squared error). Note that there is a maximum of 93 drug sensitivity scores associated with each cell line sample. The results are shown with the average performance per cell line sample (sample-wise) and per drug (drug-wise). Besides, because of the incompleteness of the ground truth matrix, the prediction entries without a ground truth sensitivity score are filtered out in the calculation of each evaluation metric.

5.2 Baseline models

We compared CLEIT with the following base-line models: MLP without and with the AE pre-training for D as well as several of the most popular domain adaptation algorithms that are used to transfer the knowledge learned from D to D. They include Deep Domain Confusion (DDC) network (Tzeng ), Correlation Alignment (CORAL, Sun ), Domain Adversarial Neural Network (DANN, Ganin ), Adversarial Domain Adaptation Network (ADDA, Tzeng ) and Domain Separation Network (DSN, Bousmalis ). Specifically, we used the original architecture of baseline models but exactly same features and training/testing data and procedure as CLEIT so that we have a fair performance comparison between them. Specifically, for DDC, CORAL, DANN and ADDA, we followed their original approach and combined their respective domain adaptation objectives with drug response prediction in the supervised training (same dataset used in CLEIT fine-tuning). For DSN, we employed its MMD variant for the stability of training and adopted the same pre-training fine-tuning process as used in CLEIT. In the pre-training, we leveraged unlabeled data from both domains to pre-train the encoders with an autoencoder reconstruction task and its domain adaptation objective. In the fine-tuning, we further adapt the encoder and appended the predictor with the labeled drug response dataset. To evaluate the contribution of different components in CLEIT, we performed ablation studies by (i) removing the unlabeled pre-training process and incorporating the cross-level transmission loss into the labeled training, (ii) removing the transmitter, (iii) changing the cross-level transmission loss function to Maximum Mean Discrepancy (MMD) loss (Gretton ) and Earth Mover distance approximated using Wasserstein-GAN (WGAN) (Gulrajani ). The latent dimension for hidden representation for all models is specified as 128, and all autoencoder frameworks share the same [512, 256, 128, 256, 512] architecture. Besides, all pre-trained encoders will be appended with a predictor module of the same architecture ([128] shared layer + [64,32] individual drug MLP) for the fine-tuning process.

6 Results and discussion

6.1 Gene expression feature has stronger predictive power than somatic mutation-based feature

Consistent with extensive performance evaluations from blind tests in a DREAM challenge (Costello ), and other studies (Chiu ), the gene expression feature has more substantial predictive power than the mutation-based feature. As shown in Supplementary Figure S1(a), the model trained with only labeled gene expression data has a 6.45% performance gain over the model trained with corresponding labeled somatic mutation data when evaluated using a sample-wise average. With the additional utilization of unlabeled pre-training, models trained with only gene expression data and only mutation data both showed slightly better performance, while their performance gap is around 6.8%. In terms of drug-wise average, as shown in Supplementary Figure S1(b), the performance gap between models built on mutation-only and expression only data is even more apparent. The multi-modal learning that combines the mutation and gene expression features fails to improve the performance (Supplementary Fig. S1). These results confirmed that the gene expression is more predictive than the somatic mutation for predicting the anti-cancer drug sensitivity.

6.2 CLEIT can transfer the knowledge learned from gene expression features to the model with mutation features

To demonstrate that CLEIT can transfer the knowledge learned from the gene expression feature to the model that uses the mutation-only data, we compared the drug-wise Pearson correlation distribution of CLEIT with those of the MLP+AE models trained with only gene expression or mutation data. Figure 3 shows the histogram of Pearson correlations of 93 drugs for three models. CLEIT using the mutation data shifts the performance distribution close to the model trained using the gene expression data with a False Discovery Rate (FDR) of 1.0 based on Kolmogorov-Smirnov (KS) test. It significantly outperforms the MLP+AE model using the mutation data (FDR = 3.47e-37 of KS test). It is aligned with our primary goal in this work. Note that the histograms in Figure 3 were from the validation data. Next, we evaluate the performance of CLEIT in a hold-out mutation-only test data.

Fig. 3.

Drug-wise Pearson correlation on validation dataset

6.3 CLEIT significantly outperforms state-of-the-art models to predict anti-cancer drug sensitivity using mutation-only data

Given that gene expression data have stronger predictive power than somatic mutation data, we evaluate if CLEIT can use the gene expression to boost the performance for predicting anti-cancer drug sensitivity when only the somatic mutation data are available as the input. The results for both drug-wise and sample-wise evaluation are shown in Tables 2 and 3. As seen in those tables, models that consist of unlabeled pre-training processes generally outperform the models trained with only labeled data, indicating the importance of leveraging unlabeled data. The models trained with domain adaptation methods with unlabeled pre-training (DSN or CLEITs) or only labeled training outperform their non-domain adaptation counterparts. It implies that D will benefit from the knowledge transfer from D. Furthermore, CLEIT models significantly outperform all other models in consideration (t-test P-value < 0.05). The best-performed model is the CLEIT that uses contrastive loss. Compared with the best performed state-of-the-art model (DSN), the accuracy of CLEIT, when measured by Pearson correlation, improves 277% and 2.2% for the drug-wise and the sample-wise test, respectively. Similar results can be seen in terms of Spearman correlation and RMSE. The performance gain of CLEIT over MLP and MLP+AE is 3.4% and 2.5%, respectively, in the sample-wise setting. Yet in the drug-wise setting, the improved gap is enlarged to 469% and 407%. The much improved drug-wise performance achieved by CLEIT indicated a much higher quality drug-sensitivity prediction with the mutation-only data.

Table 2.

Evaluation results on test data (drug-wise)

Method	Pearson	Spearman	RMSE
MLP (mutation-only)	0.0591 ± 0.0069	0.0532 ± 0.0066	0.0233 ± 0.0018
MLP+AE (mutation-only)	0.0681 ± 0.0085	0.0629 ± 0.0108	0.0151 ± 0.0001
DDC	0.0633 ± 0.0087	0.0621 ± 0.0087	0.0150 ± 0.0006
CORAL	0.0580 ± 0.0105	0.0542 ± 0.0080	0.0164 ± 0.0005
DANN	0.0571 ± 0.0061	0.0516 ± 0.0038	0.0173 ± 0.0010
ADDA	0.0681 ± 0.0111	0.0685 ± 0.0142	0.0197 ± 0.0010
DSN	0.1003 ± 0.0186	0.0915 ± 0.0252	0.0147 ± 0.0007
CLEIT (w/o pre-training)	0.1005 ± 0.0236	0.0924 ± 0.0216	0.0147 ± 0.0005
CLEIT (w/o transmitter)	0.2587 ± 0.0126	0.2254 ± 0.0348	0.0124 ± 0.0006
CLEIT (MMD)	0.1758 ± 0.0086	0.1421 ± 0.0200	0.0148 ± 0.0009
CLEIT (WGAN)	0.0795 ± 0.0083	0.0821 ± 0.0106	0.0150 ± 0.0009
CLEIT	0.2770 ± 0.0086	0.2482 ± 0.0243	0.0121 ± 0.0006

Note: The best results are shown in bold.

Table 3.

Evaluation results on test data (sample-wise)

Method	Pearson	Spearman	RMSE
MLP (mutation-only)	0.7390 ± 0.0017	0.6957 ± 0.0022	0.0235 ± 0.0017
MLP+AE (mutation-only)	0.7450 ± 0.0003	0.6984 ± 0.0004	0.0150 ± 0.0001
DDC	0.7449 ± 0.0017	0.7010 ± 0.0010	0.0151 ± 0.0004
CORAL	0.7439 ± 0.0013	0.7002 ± 0.0010	0.0165 ± 0.0004
DANN	0.7428 ± 0.0017	0.6995 ± 0.0019	0.0174 ± 0.0008
ADDA	0.7315 ± 0.0053	0.6891 ± 0.0010	0.0199 ± 0.0008
DSN	0.7470 ± 0.0002	0.7024 ± 0.0004	0.0148 ± 0.0004
CLEIT (w/o pre-training)	0.7467 ± 0.0003	0.7023 ± 0.0004	0.0149 ± 0.0004
CLEIT (w/o transmitter)	0.7569 ± 0.0081	0.7172 ± 0.0070	0.0125 ± 0.0005
CLEIT (MMD)	0.7443 ± 0.0018	0.7003 ± 0.0009	0.0147 ± 0.0009
CLEIT (WGAN)	0.7465 ± 0.0005	0.7022 ± 0.0008	0.0152 ± 0.0009
CLEIT	0.7640 ± 0.0094	0.7233 ± 0.0063	0.0122 ± 0.0005

Note: The best results are shown in bold.

Evaluation results on test data (drug-wise) Note: The best results are shown in bold. Evaluation results on test data (sample-wise) Note: The best results are shown in bold. CLEIT models that incorporate MLP-transmission function show significantly better performance than those without, suggesting that the transmission function plays a role in CLEIT. Choice of the loss function in the information transmission is also important. It is clear that contrastive loss performs better than MMD and WGAN. It is noted that MMD is used in DSN. When CLEIT uses MMD as the loss function to measure the domain discrepancy, the major difference between CLEIT and MMD is that CLEIT treats the information transmission between two domains asymmetrical, while DSN considers domain adaptation symmetrical. The results in Tables 2 and 3 show that CLEIT-MMD outperforms DSN in drug-wise setting and perform similarly in sample-wise settings. It indicates that the explicit modeling of the hierarchical organization of D and D is important.

6.4 CLEIT outperforms state-of-the-arts for predicting top-ranked cell-line specific anti-cancer therapies

Furthermore, CLEIT can predict the best therapy for a new patient using only mutation data for personalized medicine. We compared the performance of different methods with the precision of top-k (k = 1, 10) predictions ranked by the AUC scores, which is defined as the ratio of drugs with top-k smallest predicted scores per cell line among the drugs with top-k ground-truth scores. Mutation-only test results can be found in Figure 4. Clearly, the CLEIT model also outperforms other models in this scenario. Compared with the second-best performed model DSN, CLEIT improves the performance by approximately 40% when k = 1.

Fig. 4.

Top K Precision on Mutation-only Test Dataset

7 Conclusion

This article proposed a novel machine learning framework CLEIT for the predictive modeling of genotype-phenotype associations by explicitly modeling the asymmetric CLEIT in the biological system. Using the anti-cancer drug sensitivity prediction with only mutation data as a benchmark, CLEIT clearly outperforms existing methods and demonstrates its potential in personalized medicine. Although we only study the knowledge transfer between DNA level and RNA level in this article, the same strategy can be applied to other levels in the biological system, for example, imputing proteomics data using transcriptomics data. Nevertheless, the performance of CLEIT could be further improved by incorporating domain knowledge. For example, an autoencoder module that can model gene-gene interactions and biological pathways will be greatly helpful. Under the framework of CLEIT, it is not difficult to integrate other omics data such as epigenomics and proteomics. They may further improve the performance of CLEIT. Another challenge in personalized medicine is to transfer knowledge from cell line data to patient tissue data (He and Xie, 2021). It will be interesting to develop new neural network architectures in the framework of CLEIT to address this problem. Click here for additional data file.

16 in total

1. A White-Box Machine Learning Approach for Revealing Antibiotic Mechanisms of Action.

Authors: Jason H Yang; Sarah N Wright; Meagan Hamblin; Douglas McCloskey; Miguel A Alcantar; Lars Schrübbers; Allison J Lopatkin; Sangeeta Satish; Amir Nili; Bernhard O Palsson; Graham C Walker; James J Collins
Journal: Cell Date: 2019-05-09 Impact factor: 41.582

2. Next-generation characterization of the Cancer Cell Line Encyclopedia.

Authors: Mahmoud Ghandi; Franklin W Huang; Judit Jané-Valbuena; Gregory V Kryukov; Christopher C Lo; E Robert McDonald; Jordi Barretina; Ellen T Gelfand; Craig M Bielski; Haoxin Li; Kevin Hu; Alexander Y Andreev-Drakhlin; Jaegil Kim; Julian M Hess; Brian J Haas; François Aguet; Barbara A Weir; Michael V Rothberg; Brenton R Paolella; Michael S Lawrence; Rehan Akbani; Yiling Lu; Hong L Tiv; Prafulla C Gokhale; Antoine de Weck; Ali Amin Mansour; Coyin Oh; Juliann Shih; Kevin Hadi; Yanay Rosen; Jonathan Bistline; Kavitha Venkatesan; Anupama Reddy; Dmitriy Sonkin; Manway Liu; Joseph Lehar; Joshua M Korn; Dale A Porter; Michael D Jones; Javad Golji; Giordano Caponigro; Jordan E Taylor; Caitlin M Dunning; Amanda L Creech; Allison C Warren; James M McFarland; Mahdi Zamanighomi; Audrey Kauffmann; Nicolas Stransky; Marcin Imielinski; Yosef E Maruvka; Andrew D Cherniack; Aviad Tsherniak; Francisca Vazquez; Jacob D Jaffe; Andrew A Lane; David M Weinstock; Cory M Johannessen; Michael P Morrissey; Frank Stegmeier; Robert Schlegel; William C Hahn; Gad Getz; Gordon B Mills; Jesse S Boehm; Todd R Golub; Levi A Garraway; William R Sellers
Journal: Nature Date: 2019-05-08 Impact factor: 49.962

3. Estimation of the Percentage of US Patients With Cancer Who Benefit From Genome-Driven Oncology.

Authors: John Marquart; Emerson Y Chen; Vinay Prasad
Journal: JAMA Oncol Date: 2018-08-01 Impact factor: 31.777

4. Resistance to paclitaxel is associated with a variant of the gene BCL2 in multiple tumor types.

Authors: Rotem Ben-Hamo; Alona Zilberberg; Helit Cohen; Keren Bahar-Shany; Chaim Wachtel; Jacob Korach; Sarit Aviel-Ronen; Iris Barshack; Danny Barash; Keren Levanon; Sol Efroni
Journal: NPJ Precis Oncol Date: 2019-04-23

5. Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen.

Authors: Michael P Menden; Dennis Wang; Mike J Mason; Bence Szalai; Krishna C Bulusu; Yuanfang Guan; Thomas Yu; Jaewoo Kang; Minji Jeon; Russ Wolfinger; Tin Nguyen; Mikhail Zaslavskiy; In Sock Jang; Zara Ghazoui; Mehmet Eren Ahsen; Robert Vogel; Elias Chaibub Neto; Thea Norman; Eric K Y Tang; Mathew J Garnett; Giovanni Y Di Veroli; Stephen Fawell; Gustavo Stolovitzky; Justin Guinney; Jonathan R Dry; Julio Saez-Rodriguez
Journal: Nat Commun Date: 2019-06-17 Impact factor: 14.919

6. Predicting responses to platin chemotherapy agents with biochemically-inspired machine learning.

Authors: Eliseos J Mucaki; Jonathan Z L Zhao; Daniel J Lizotte; Peter K Rogan
Journal: Signal Transduct Target Ther Date: 2019-01-11

7. MOLI: multi-omics late integration with deep neural networks for drug response prediction.

Authors: Hossein Sharifi-Noghabi; Olga Zolotareva; Colin C Collins; Martin Ester
Journal: Bioinformatics Date: 2019-07-15 Impact factor: 6.937

Review 8. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases.

Authors: Ramnik J Xavier; Curtis Huttenhower; Jason Lloyd-Price; Cesar Arze; Ashwin N Ananthakrishnan; Melanie Schirmer; Julian Avila-Pacheco; Tiffany W Poon; Elizabeth Andrews; Nadim J Ajami; Kevin S Bonham; Colin J Brislawn; David Casero; Holly Courtney; Antonio Gonzalez; Thomas G Graeber; A Brantley Hall; Kathleen Lake; Carol J Landers; Himel Mallick; Damian R Plichta; Mahadev Prasad; Gholamali Rahnavard; Jenny Sauk; Dmitry Shungin; Yoshiki Vázquez-Baeza; Richard A White; Jonathan Braun; Lee A Denson; Janet K Jansson; Rob Knight; Subra Kugathasan; Dermot P B McGovern; Joseph F Petrosino; Thaddeus S Stappenbeck; Harland S Winter; Clary B Clish; Eric A Franzosa; Hera Vlamakis
Journal: Nature Date: 2019-05-29 Impact factor: 49.962

9. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells.

Authors: Wanjuan Yang; Jorge Soares; Patricia Greninger; Elena J Edelman; Howard Lightfoot; Simon Forbes; Nidhi Bindal; Dave Beare; James A Smith; I Richard Thompson; Sridhar Ramaswamy; P Andrew Futreal; Daniel A Haber; Michael R Stratton; Cyril Benes; Ultan McDermott; Mathew J Garnett
Journal: Nucleic Acids Res Date: 2012-11-23 Impact factor: 16.971

10. Predicting drug response of tumors from integrated genomic profiles by deep neural networks.

Authors: Yu-Chiao Chiu; Hung-I Harry Chen; Tinghe Zhang; Songyao Zhang; Aparna Gorthi; Li-Ju Wang; Yufei Huang; Yidong Chen
Journal: BMC Med Genomics Date: 2019-01-31 Impact factor: 3.063

1 in total

1. Deep learning prediction of chemical-induced dose-dependent and context-specific multiplex phenotype responses and its application to personalized alzheimer's disease drug repurposing.

Authors: You Wu; Qiao Liu; Yue Qiu; Lei Xie
Journal: PLoS Comput Biol Date: 2022-08-11 Impact factor: 4.779

1 in total