| Literature DB >> 33931726 |
Martin Treppner1,2,3, Adrián Salas-Bastos4,5, Moritz Hess6,7, Stefan Lenz6,7, Tanja Vogel4,8,9, Harald Binder6,7.
Abstract
Deep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBMs), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell transcriptomics (scRNA-seq). A small pilot study could be used for planning a full-scale experiment by investigating planned analysis strategies on synthetic data with different sample sizes. It is unclear whether synthetic observations generated based on a small scRNA-seq dataset reflect the properties relevant for subsequent data analysis steps. We specifically investigated two deep generative modeling approaches, VAEs and DBMs. First, we considered single-cell variational inference (scVI) in two variants, generating samples from the posterior distribution, the standard approach, or the prior distribution. Second, we propose single-cell deep Boltzmann machines (scDBMs). When considering the similarity of clustering results on synthetic data to ground-truth clustering, we find that the [Formula: see text] variant resulted in high variability, most likely due to amplifying artifacts of small datasets. All approaches showed mixed results for cell types with different abundance by overrepresenting highly abundant cell types and missing less abundant cell types. With increasing pilot dataset sizes, the proportions of the cells in each cluster became more similar to that of ground-truth data. We also showed that all approaches learn the univariate distribution of most genes, but problems occurred with bimodality. Across all analyses, in comparing 10[Formula: see text] Genomics and Smart-seq2 technologies, we could show that for 10[Formula: see text] datasets, which have higher sparsity, it is more challenging to make inference from small to larger datasets. Overall, the results show that generative deep learning approaches might be valuable for supporting the design of scRNA-seq experiments.Entities:
Year: 2021 PMID: 33931726 PMCID: PMC8087667 DOI: 10.1038/s41598-021-88875-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Design for evaluating the performance of deep generative models with small pilot datasets: (A) Take a sub-sample from an original dataset to obtain pilot data with known ground truth. (B) Train the deep generative approaches on the pilot dataset and generate synthetic data in the original data size. (C) Apply dimensionality reduction with UMAP and Seurat clustering to the original data and map each synthetic observation to the closest observation from the original data, thus getting a cluster assignment. (D) Evaluate the quality of synthetic samples based on the Davies–Bouldin index, adjusted Rand index, cluster proportions, and distributions per gene. The complete analysis is performed for different sizes of pilot datasets (384, 768, 1152, 1536, 1920, and 2304 cells) and repeated 30 times for each size.
Figure 2Davies-Bouldin index (top) and adjusted Rand index (bottom), indicating the quality of synthetic data generated by scDBM, scVI (prior and posterior sampling), and a baseline from pilot data of different sizes for PBMC4k (a) and Segerstolpe (b). Each boxplot represents 30 sub-samples from the original data (lower and upper hinges correspond to the 25th and 75th percentiles). The orange dotted line indicates the reference DBI for the Seurat clustering on the original data.
Mean of absolute differences in the number of cells across all clusters.
| Number of plates | ||||||
|---|---|---|---|---|---|---|
| 1 Plate | 2 Plates | 3 Plates | 4 Plates | 5 Plates | 6 Plates | |
| scDBM | 2691 | 2017 | 1910 | 1776 | 1737 | 1624 |
| scVI prior | 6567 | 4565 | 4352 | 4249 | 4205 | 4196 |
| scVI posterior | 5037 | 2965 | 2711 | 2599 | 2709 | 2513 |
| scDBM | 1037 | 925 | 887 | 889 | 864 | – |
| scVI prior | 1273 | 1228 | 1108 | 972 | 857 | – |
| scVI posterior | 1293 | 1163 | 1122 | 856 | 645 | – |
Figure 3Mean absolute deviation for various descriptive statistics across all models and sample sizes for the PBMC4k dataset (a) and the Segerstolpe dataset (b). Color coding indicates ranks among sample sizes.