| Literature DB >> 31510704 |
Lisa Handl1,2,3, Adrin Jalali1, Michael Scherer1, Ralf Eggeling2,3, Nico Pfeifer1,2,3.
Abstract
MOTIVATION: Predictive models are a powerful tool for solving complex problems in computational biology. They are typically designed to predict or classify data coming from the same unknown distribution as the training data. In many real-world settings, however, uncontrolled biological or technical factors can lead to a distribution mismatch between datasets acquired at different times, causing model performance to deteriorate on new data. A common additional obstacle in computational biology is scarce data with many more features than samples. To address these problems, we propose a method for unsupervised domain adaptation that is based on a weighted elastic net. The key idea of our approach is to compare dependencies between inputs in training and test data and to increase the cost of differently behaving features in the elastic net regularization term. In doing so, we encourage the model to assign a higher importance to features that are robust and behave similarly across domains.Entities:
Mesh:
Year: 2019 PMID: 31510704 PMCID: PMC6612879 DOI: 10.1093/bioinformatics/btz338
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Mean absolute error (MAE) of wenda-pn, wenda-cv and wenda-mar on simulated test data. Each row shows results on one target domain (no mismatch, 10–30% altered variables). We report all errors relative to the MAE of en showing the mean±standard deviation over 10 simulations
Fig. 2.(a) Mean absolute error of en-ls and wenda-pn with k = 3 per test tissue. We show the mean ± standard deviation over 10 runs of 10-fold cross-validation for en-ls, and over all splits of the test tissues where the tissue of interest was in the evaluation set for wenda-pn. Predicted versus true chronological age for typical runs of en-ls (b) and wenda-pn with k = 3 (c). In each plot, we show samples colored by tissue. As a typical run for en-ls we show the one with closest to median performance on cerebellum samples and full test set. For wenda-pn, we choose a typical run for each tissue: among all models with this tissue in the holdout set, we plot predictions of the one with closest to median performance
Fig. 3.Mean absolute error of all models on cerebellum samples. We show the mean and standard deviation over 10 runs of 10-fold cross-validation or, in case of wenda-pn, over all splits where cerebellum samples were in the evaluation set
Fig. 4.Mean absolute error (MAE) of all models on the full test set of DNA methylation data. We show the mean and standard deviation over 10 runs of 10-fold cross-validation. In case of wenda-pn, we compute the MAE only based on samples in the evaluation set, and plot the mean and standard deviation over all considered splits of the test tissues