| Literature DB >> 35453145 |
Axel Kowald1, Israel Barrantes1, Steffen Möller1, Daniel Palmer1, Hugo Murua Escobar2, Anne Schwerk3, Georg Fuellen1,4.
Abstract
Accurate transfer learning of clinical outcomes from one cellular context to another, between cell types, developmental stages, omics modalities or species, is considered tremendously useful. When transferring a prediction task from a source domain to a target domain, what counts is the high quality of the predictions in the target domain, requiring states or processes common to both the source and the target that can be learned by the predictor reflected by shared denominators. These may form a compendium of knowledge that is learned in the source to enable predictions in the target, usually with few, if any, labeled target training samples to learn from. Transductive transfer learning refers to the learning of the predictor in the source domain, transferring its outcome label calculations to the target domain, considering the same task. Inductive transfer learning considers cases where the target predictor is performing a different yet related task as compared with the source predictor. Often, there is also a need to first map the variables in the input/feature spaces and/or the variables in the output/outcome spaces. We here discuss and juxtapose various recently published transfer learning approaches, specifically designed (or at least adaptable) to predict clinical (human in vivo) outcomes based on preclinical (mostly animal-based) molecular data, towards finding the right tool for a given task, and paving the way for a comprehensive and systematic comparison of the suitability and accuracy of transfer learning of clinical outcomes.Entities:
Keywords: biomarkers; inductive transfer learning; shared denominators; transductive transfer learning; transfer learning; unsupervised transfer learning
Mesh:
Year: 2022 PMID: 35453145 PMCID: PMC9116218 DOI: 10.1093/bib/bbac133
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 13.994
Figure 1How to select a transfer learning methodology for a given task. This given task is the ‘learning task with existing solutions’ and it refers to the source domain, see also Table 1 and [13]. For the examples (FIT, AITL and PRECISE), please see Table 1.
Transfer learning, tools and techniques
| Name/acronym, reference | Source domain | Target domain | Input of the predictors | Output of the predictors | Transfer method; regression or classification task? | Availability, advantages and disadvantages (results/accomplishments) |
|---|---|---|---|---|---|---|
| Semisupervised transfer learning [ | Application-area-specific mouse phenotype-outcome-labeled gene expression data | Human gene expression data | Human gene expression data | Human phenotype data (and subsequently DEGs and enriched pathways inferred from these) |
| Matlab code available from |
| XGSEA [ | GO (or similar) gene sets and enrichment scores, e.g. from mouse or zebrafish | GO (or similar) gene sets and enrichment scores, e.g. from human | Gene expression data from source species used to calculate enrichment scores | Gene sets significantly associated in target species |
| Code available at |
| FIT [ | Precompiled datasets of mouse gene expression | Precompiled datasets of human gene expression | Mouse gene expression | Human gene expression for matching condition, genes with high effect size | Unsupervised (dimensionality reduction): gene-level lasso regression; follow-up classification task to identify high-effect genes | Available at |
| Translatable components regression (TransComp-R) [ | Human gene expression data (pretreatment), human drug response data | Mouse proteomics data | Human gene expression (pretreatment) and drug response data (the latter are given, not to be predicted) | Mouse proteins (and corresponding pathway enrichments) with association to human drug response |
| Matlab code available from |
| Pathway RespOnsive GENes (PROGENy) [ | Two curated resources of footprint pathway perturbations (PROGENy), and another of footprint regulons (transcription factor—target interactions in DoRothEA) from human data, and human–mouse orthologs | The mouse equivalent of the source | Mouse gene expression data | Mouse pathway activity (PROGENy) or transcription factor activity and enrichment (DoRothEA) |
| Both tools are available as R (Bioconductor) and python packages; for usage examples see |
| Adversarial Inductive Transfer Learning (AITL) [ |
|
|
|
|
| Code available at |
| Patient Response Estimation Corrected by Interpolation of Subspace Embeddings (PRECISE) [ | Gene expression data from preclinical models (cell lines, patient-derived xenografts) and drug response | Human gene expression data | Human gene expression data | Human drug response |
| Available as python package,; example protocols provided as Jupyter notebooks; see |
| Transfer variational autoencoder, trVAE [ | Gene expression data (cell line) or image data (or similar) under a specific (first) condition | Gene expression data or image data (or similar) under a different (second) condition | Data under the first condition and a label specifying the second condition | Data transformed to the second condition |
| Available from |
| MultiPlier [ | Preprocessed disease-related datasets of human gene expression, highlighting LVs (characteristic patterns of correlated genes) | Human (rare disease) gene expression data | Human (rare disease) gene expression data | Characteristic expression patterns of correlated genes |
| PLIER is available at |
Figure 2Overview of the FIT algorithm [19]. FIT consists of a compendium of 170 CSP of mouse and human transcriptomics data for 28 different diseases. First, a lasso regression is performed to fit parameters α and β of a linear model based on all genes, g, of all CSP, p, penalizing α values deviating from 0 and β values deviating from 1. The fitting process is repeated 100 times to obtain mean values and confidence intervals of the two parameters (see also main text). For a new mouse expression data set, the mean values of the parameters α and β are then used to predict human effects sizes Z for each gene therein. (Mouse clipart by Vincent Le Moign / CC BY 4.0.)
Figure 3Overview of the trVAE. The encoder part of the neural net processes input data plus information about the condition (smiling, cell type, etc.) and generates a compressed latent layer. The decoder uses this latent layer together with information about the condition to produce an output. During training, the condition fed to the encoder and the decoder are the same, while during prediction, the decoder receives the new condition for which an output should be generated. In contrast to a standard conditional VAE, an additional constraint is imposed on the first layer of the decoder for further regularization, the details of which we omitted here. While two input nodes are needed for the condition if an unary (‘one-hot’) encoding is used, the number of nodes in the other parts of the neural nets is far larger than shown here, for any realistic application. The diagram is redrawn from [30].
Figure 4Overview of MultiPLIER framework. (A) PLIER [38], on which MultiPLIER is based, can analyze tissue-specific expression data and extract LVs by matrix factorization, resulting in matrices B and Z. PLIER then aligns the LVs with curated pathway gene sets in a downstream analysis. (B) To analyze data irrespective of tissue, MultiPLIER trains on a large collection of uniformly processed data in the form of the recount2 compendium [39], which contains around 370 000 samples. The resulting LVs can then be used to interpret a new dataset, by projecting the new gene expression data onto the latent space, to identify pathway-annotated LVs also featured in that new dataset. Diagram is based on [15].