| Literature DB >> 34390577 |
Abstract
MOTIVATION: An unsolved fundamental problem in biology is to predict phenotypes from a new genotype under environmental perturbations. The emergence of multiple omics data provides new opportunities but imposes great challenges in the predictive modeling of genotype-phenotype associations. Firstly, the high-dimensionality of genomics data and the lack of coherent labeled data often make the existing supervised learning techniques less successful. Secondly, it is challenging to integrate heterogeneous omics data from different resources. Finally, few works have explicitly modeled the information transmission from DNA to phenotype, which involves multiple intermediate molecular types. Higher-level features (e.g., gene expression) usually have stronger discriminative and interpretable power than lower-level features (e.g., somatic mutation).Entities:
Year: 2021 PMID: 34390577 PMCID: PMC8696111 DOI: 10.1093/bioinformatics/btab580
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Rationale of CLEIT. Cellular phenotypes rise from genotypes via multi-level intermediate molecular types hierarchically from DNA to RNA to protein to biological pathway (blue arrows). The predictive and interpretable power of the DNA-level features for the phenotype is weaker than that of the high-level features such as transcriptome and biological pathways. Instead of predicting the phenotype from the genotype directly by bypassing the intermediate molecular types (gray dashed arrow), we will include the information of intermediate molecular type and model the hierarchical organization of a biology system (orange solid arrows)
Fig. 2.CLEIT Framework. The training of CLEIT involves five steps. First, the encoder of D is learned from an autoencoder and fine-tuned by a supervised multi-task MLP in steps 1 and 2. Then, the embedding of D is encoded from an autoencoder in step 3, and the difference between it and that of D is minimized via an MLP transmitter in step 4 as measured by contrastive loss. In step 5, the supervised model of D is fine-tuned by the model that appends the pre-trained multi-task MLP of D in step 2 and the regularized encoder of D in step 3
Summary of pre-processed data for training and testing
| Category | Unlabeled (pre-training) | Labeled (fine-tuning) | Labeled (test) |
|---|---|---|---|
| Gene Expression (#samples) | 11 113 | 680 | NA |
| Somatic Mutation (#samples) | 9743 | 680 | 278 |
| Drug Sensitivity (#cell line-drug pairs) | NA | 59203 | 23475 |
Fig. 3.Drug-wise Pearson correlation on validation dataset
Evaluation results on test data (drug-wise)
| Method | Pearson | Spearman | RMSE |
|---|---|---|---|
| MLP (mutation-only) | 0.0591 ± 0.0069 | 0.0532 ± 0.0066 | 0.0233 ± 0.0018 |
| MLP+AE (mutation-only) | 0.0681 ± 0.0085 | 0.0629 ± 0.0108 | 0.0151 ± 0.0001 |
| DDC | 0.0633 ± 0.0087 | 0.0621 ± 0.0087 | 0.0150 ± 0.0006 |
| CORAL | 0.0580 ± 0.0105 | 0.0542 ± 0.0080 | 0.0164 ± 0.0005 |
| DANN | 0.0571 ± 0.0061 | 0.0516 ± 0.0038 | 0.0173 ± 0.0010 |
| ADDA | 0.0681 ± 0.0111 | 0.0685 ± 0.0142 | 0.0197 ± 0.0010 |
| DSN | 0.1003 ± 0.0186 | 0.0915 ± 0.0252 | 0.0147 ± 0.0007 |
| CLEIT (w/o pre-training) | 0.1005 ± 0.0236 | 0.0924 ± 0.0216 | 0.0147 ± 0.0005 |
| CLEIT (w/o transmitter) | 0.2587 ± 0.0126 | 0.2254 ± 0.0348 | 0.0124 ± 0.0006 |
| CLEIT (MMD) | 0.1758 ± 0.0086 | 0.1421 ± 0.0200 | 0.0148 ± 0.0009 |
| CLEIT (WGAN) | 0.0795 ± 0.0083 | 0.0821 ± 0.0106 | 0.0150 ± 0.0009 |
|
|
|
|
|
Note: The best results are shown in bold.
Evaluation results on test data (sample-wise)
| Method | Pearson | Spearman | RMSE |
|---|---|---|---|
| MLP (mutation-only) | 0.7390 ± 0.0017 | 0.6957 ± 0.0022 | 0.0235 ± 0.0017 |
| MLP+AE (mutation-only) | 0.7450 ± 0.0003 | 0.6984 ± 0.0004 | 0.0150 ± 0.0001 |
| DDC | 0.7449 ± 0.0017 | 0.7010 ± 0.0010 | 0.0151 ± 0.0004 |
| CORAL | 0.7439 ± 0.0013 | 0.7002 ± 0.0010 | 0.0165 ± 0.0004 |
| DANN | 0.7428 ± 0.0017 | 0.6995 ± 0.0019 | 0.0174 ± 0.0008 |
| ADDA | 0.7315 ± 0.0053 | 0.6891 ± 0.0010 | 0.0199 ± 0.0008 |
| DSN | 0.7470 ± 0.0002 | 0.7024 ± 0.0004 | 0.0148 ± 0.0004 |
| CLEIT (w/o pre-training) | 0.7467 ± 0.0003 | 0.7023 ± 0.0004 | 0.0149 ± 0.0004 |
| CLEIT (w/o transmitter) | 0.7569 ± 0.0081 | 0.7172 ± 0.0070 | 0.0125 ± 0.0005 |
| CLEIT (MMD) | 0.7443 ± 0.0018 | 0.7003 ± 0.0009 | 0.0147 ± 0.0009 |
| CLEIT (WGAN) | 0.7465 ± 0.0005 | 0.7022 ± 0.0008 | 0.0152 ± 0.0009 |
|
|
|
|
|
Note: The best results are shown in bold.
Fig. 4.Top K Precision on Mutation-only Test Dataset