| Literature DB >> 35727848 |
Christos Maniatis1, Catalina A Vallejos2,3, Guido Sanguinetti1,4.
Abstract
Single-cell multi-omics assays offer unprecedented opportunities to explore epigenetic regulation at cellular level. However, high levels of technical noise and data sparsity frequently lead to a lack of statistical power in correlative analyses, identifying very few, if any, significant associations between different molecular layers. Here we propose SCRaPL, a novel computational tool that increases power by carefully modelling noise in the experimental systems. We show on real and simulated multi-omics single-cell data sets that SCRaPL achieves higher sensitivity and better robustness in identifying correlations, while maintaining a similar level of false positives as standard analyses based on Pearson and Spearman correlation.Entities:
Mesh:
Year: 2022 PMID: 35727848 PMCID: PMC9249169 DOI: 10.1371/journal.pcbi.1010163
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.779
Fig 1Schematic and graphical representations of SCRaPL.
Here, we assume observed data consists of RNA expression and DNA methylation. 1A Schematic representation of the SCRaPL model. 1B SCRaPL’s graphical model, depicting the statistical dependencies between observed genomic data (Y is RNA expression; Y is DNA methylation), their associated latent variables (X, X) and feature-specific model parameters (, Σ). The additional parameter π is specific to the noise model that is assigned to RNA expression data and captures zero inflation. Full details are given in the model description section in Methods.
Summary of synthetic data experiments.
In all cases, latent means and standard deviations were set as μ = 4, μ = 1, σ = 3 and σ = 2. Unless otherwise stated, our simulations were based on: I = 60 cells, J = 300 features, 20% ZI rate on average for the expression data (π = 0.20) and an average methylation coverage (n) equal to 275 (sampled from a Uniform distribution with range [50, 500]) across cells and genes. When varying the number of cells, we use I ∈ {5, 10, 25, 50, 100, 200, 400, 800, 1600}. When varying expression ZI, we use π ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8}. When varying methylation coverage, we sample n from Uniform distributions with ranges given by [5, 10], [10, 20], [20, 50], [50, 250] and [500, 1000]. Full details are provided in S3 Text.
| Experiment | Description |
|---|---|
| 1 | Correlations |
| 2 | Correlations |
| 3 | Correlations |
| 4 | Correlations |
| 5 | Correlations |
| 6 | Correlations |
| 7 | As experiment 1, but latent expression means sampled from scVI. |
| 8 | As experiment 2, but latent expression means sampled from scVI. |
| 9 | As experiment 3, but latent expression means sampled from scVI. |
Fig 2Plots summarizing differences in correlation estimation between SCRaPL, Spearman in Experiment 1 with synthetic data.
(2A) Estimated correlation difference from true correlation as a function of cells for SCRaPL, Spearman and Pearson. (2B) Estimated correlation as a function of true correlation for SCRaPL, Spearman and Pearson in synthetic datasets with 300 genes and 1600 cells. Each dot represents a gene and is color-coded based inference approach.
Fig 3Summary of experiments on real data.
Figures summarizing most important points from synthetic and real data experiments. (3A, 3B) Bayesian volcano plots for mESC and mEBC data respectively. Scatter plot of posterior probability under the null hypothesis (in log scale) as a function of posterior median correlation. Each dot represents a feature and is marked with different color depending the method that labels it as a significant association. (3C, 3D) Venn diagrams summarizing detection rates for SCRaPL, Pearson and Spearman in mESC and mEBC data. By accounting for different sources of noise it detects a large set of features identified by frequentist alternatives. SCRaPL also uncovers a additional large set that would be impossible for frequentist methods to identify in a robust way.
Fig 4SCRaPL’s behavior compared to Pearson/Spearman correlation in micro and macro scale.
In all figures apart from 4D the scatter plot depicts raw data for chosen features color-coded by CpG coverage, and normalized expression plotted in the log(1 + x) scale. The violin plots depict the posterior correlation densities estimated by SCRaPL for the raw data in their left hand side. (4A) Example where both SCRaPL and Pearson/Spearman identify the feature’s association as significant. (4B) Example were only Pearson/Spearman identifies the feature’s association significant. (4C) Example were only SCRaPL identifies the feature’s association significant. (4D) Scatter plots to demonstrate the negative/positive relationship between alternative correlation estimates and CpG coverage/% zeros in expression respectively. ( and ρ in Fig 4D are posterior mean and Pearson correlation for feature j.).
Fig 5Cell label transfer from expression to accessibility data for raw 5A and SCRaPL 5B preprocessed data.
Visualization of sc-RNA and scATAC data on the same plot for raw 5C and SCRaPL 5D preprocessed data.
Fig 6DIC difference between model with and without inflation for mESC and mEBC data.
The more negative the difference, the stronger the evidence in favor of the model with zero inflation on the gene expression component and vice versa. As a visual reference, zero is marked with dashed red line.