Literature DB >> 33003196

Synthetic observations from deep generative models and binary omics data with limited sample size.

Jens Nußberger1, Frederic Boesel1, Stefan Lenz1, Harald Binder1, Moritz Hess1.   

Abstract

Deep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g. as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios (ORs) is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 300 variables, with a tendency of over-estimating ORs when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of ORs. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.
© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

Keywords:  SNP data; benchmarking; data privacy; generative models; synthetic data

Year:  2021        PMID: 33003196     DOI: 10.1093/bib/bbaa226

Source DB:  PubMed          Journal:  Brief Bioinform        ISSN: 1467-5463            Impact factor:   11.622


  3 in total

1.  Synthetic single cell RNA sequencing data from small pilot studies using deep generative models.

Authors:  Martin Treppner; Adrián Salas-Bastos; Moritz Hess; Stefan Lenz; Tanja Vogel; Harald Binder
Journal:  Sci Rep       Date:  2021-04-30       Impact factor: 4.379

2.  Deep generative models in DataSHIELD.

Authors:  Stefan Lenz; Moritz Hess; Harald Binder
Journal:  BMC Med Res Methodol       Date:  2021-04-03       Impact factor: 4.615

Review 3.  Interpretable generative deep learning: an illustration with single cell gene expression data.

Authors:  Martin Treppner; Harald Binder; Moritz Hess
Journal:  Hum Genet       Date:  2022-01-06       Impact factor: 5.881

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.