Literature DB >> 31584606

bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data.

Wenhao Tang¹, François Bertaux^1,2,3, Philipp Thomas¹, Claire Stefanelli¹, Malika Saint^2,3, Samuel Marguerat^2,3, Vahid Shahrezaei¹.

Abstract

MOTIVATION: Normalization of single-cell RNA-sequencing (scRNA-seq) data is a prerequisite to their interpretation. The marked technical variability, high amounts of missing observations and batch effect typical of scRNA-seq datasets make this task particularly challenging. There is a need for an efficient and unified approach for normalization, imputation and batch effect correction.
RESULTS: Here, we introduce bayNorm, a novel Bayesian approach for scaling and inference of scRNA-seq counts. The method's likelihood function follows a binomial model of mRNA capture, while priors are estimated from expression values across cells using an empirical Bayes approach. We first validate our assumptions by showing this model can reproduce different statistics observed in real scRNA-seq data. We demonstrate using publicly available scRNA-seq datasets and simulated expression data that bayNorm allows robust imputation of missing values generating realistic transcript distributions that match single molecule fluorescence in situ hybridization measurements. Moreover, by using priors informed by dataset structures, bayNorm improves accuracy and sensitivity of differential expression analysis and reduces batch effect compared with other existing methods. Altogether, bayNorm provides an efficient, integrated solution for global scaling normalization, imputation and true count recovery of gene expression measurements from scRNA-seq data.
AVAILABILITY AND IMPLEMENTATION: The R package 'bayNorm' is publishd on bioconductor at https://bioconductor.org/packages/release/bioc/html/bayNorm.html. The code for analyzing data in this article is available at https://github.com/WT215/bayNorm_papercode. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA

Year: 2020 PMID： 31584606 PMCID： PMC7703772 DOI： 10.1093/bioinformatics/btz726

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Single-cell RNA-sequencing (scRNA-seq) is a method of choice for profiling global gene expression heterogeneity across tissues in health and disease (Baslan and Hicks, 2017; Chen ). Because it relies on the detection of minute amounts of biological material, namely the RNA content of one single cell, scRNA-seq is characterized by unique and strong technical biases. These arise mainly because scRNA-seq library preparation protocols recover only a small fraction of the total RNA molecules present in each cell. As a result, scRNA-seq data are usually very sparse with many genes showing missing values (i.e. zero values, also called dropouts). The fraction of all transcripts recovered from a cell is called capture efficiency and varies from cell to cell, resulting in strong technical variability in transcripts expression levels and dropouts rates. Moreover, capture efficiencies tend to vary between experimental batches resulting in confounding ‘batch effects’. Correcting for these biases in order to recover scRNA-seq counts reflecting accurately the original numbers of transcripts present in a cell remains a major challenge in the field (Bacher and Kendziorski, 2016; Vallejos ; Ziegenhain ). A common approach to scRNA-seq normalization is the use of cell-specific global scaling factors. These methods are based on principles developed for normalization of bulk RNA-seq experiments and assume that gene-specific biases are small (Vallejos ). Typically, read counts per cell are divided by a cell-specific scaling factor estimated either from spike-in controls (Brennecke ), or directly from the transcriptome data using methods developed initially for bulk RNA-seq (Love ; Robinson and Oshlack, 2010; Robinson and Smyth, 2007) or specifically for scRNA-seq (Lun ; Vallejos ). A recent method called SCnorm extended the global scaling approach by introducing different scaling factors for different expression groups (Bacher ). Importantly, scaling methods do not correct for cell-to-cell variations in dropout rates, as genes with zero counts remain zero after division by a scaling factor. Several approaches have been designed to tackle this problem. A series of methods use zero-inflated distribution functions, to explicitly model the dropout characteristics (Finak ; Kharchenko ; Pierson and Yau, 2015). Alternatively, other studies have proposed to infer dropouts based on expression values pooled across cells or genes (Eraslan ; Huang ; Li and Li, 2018; van Dijk ). For instance, scImpute pools expression values across similar cell subpopulations in each dataset and imputes dropouts using a Gamma-Normal mixture model and population-specific thresholds (Li and Li, 2018). Similarly, the MAGIC package is based on pooling gene expression values across cells using a network-based similarity metric (van Dijk ). Another method is based on K-nearest neighbor smoothing, which uses Poisson distribution and aggregate information from similar cells (Wagner ). Conversely, the SAVER approach pools expression values across genes within each cell using a Gamma-Poisson Bayesian model (Huang ). The Gamma-Poisson model is also used in two other packages called Splatter and scVI for simulating and normalizing scRNA-seq data, respectively (Lopez ; Zappia ). scVI belongs to new class of approaches which implement deep learning methods (Ding ; Eraslan ; Grønbech ; Lopez ; Wang and Gu, 2018). For instance, DCA, an autoencoder method, utilizes a zero-inflated negative binomial noise model (Eraslan ). Apart from Gamma-Poisson model, multivariate Normal distribution was assumed for the log transformed data in BISCUIT, which is a Bayesian method that uses an iterative approach to normalization and clustering Azizi and Prabhakaran . However, the log transformation can affect downstream analysis (see Lun, 2018; for more discussion about issues in log transformation of scRNA-seq data). Experimental batch-to-batch variations are another common source of technical variability in scRNA-seq data. The origin of batch effects is not fully understood but results at least in part from differences in average capture efficiencies across experiments (Hicks ). Recently, several methods have been specifically developed to remove batch effect in scRNA-seq data (Butler ; Haghverdi ; Kiselev ). Many of the methods discussed above treat normalization, imputation and batch effect correction as separate tasks. Moreover, some methods rely on strong assumptions such as various zero-inflation models. Here we provide a detailed account of a novel integrated approach called bayNorm, which performs all the processing steps discussed above at the same time using minimal assumptions. We compared its performance with a series of available packages focusing on true count recovery, differential expression (DE) analysis and batch effect correction.

2 Materials and methods

A scRNA-seq dataset is typically represented in a matrix of dimension P × Q, where P denotes the total number of genes observed and Q denotes the total number of cells studied. The element x ( and ) in the matrix represents the number of transcripts reported for the ith gene in the jth cell. This is equal to the total number of sequencing reads mapping to that gene in that cell for a non-unique molecular identifier (UMI) protocol. For UMI-based protocols this is equal to the number of individual UMIs mapping to each gene (Parekh ; Smith ). The matrix can include data from different groups or batches of cells, representing different biological conditions. This can be represented as a vector of labels for the cell groups or conditions (C). bayNorm generates for each gene (i) in each cell (j) a posterior distribution of original expression counts (), given the observed scRNA-seq read count for that gene (x) (Fig. 1a).

Fig. 1.

A binomial model of mRNA capture is consistent with the statistics of raw experimental scRNA-seq data. (a) Cartoon illustration of the bayNorm approach. Only a fraction of the total number of mRNAs present in the cell is captured during scRNA-seq library preparation. This occurs with a global probability called capture efficiency (β). Using cell-specific estimates of β, bayNorm aims at recovering the original number of mRNA of each gene present in each cell. Comparisons between raw experimental scRNA-seq data from the Klein study (Klein ) and synthetic data obtained using the Binomial_bayNorm (orange), Binomial_Splatter (blue) or Splatter (Zappia ) (green) simulation protocols (see Supplementary Note S2 for details). (b) Variance versus mean expression relationship. (c) Dropout rates versus mean expression relationship (note that Binomial_Splatter and Binomial_bayNorm are on top of each other in this panel). The dotted line shows the function. (d) Distribution of dropout values per gene. (e) Distribution of dropout values per cell. (Color version of this figure is available at Bioinformatics online.) A common approach for normalizing scRNA-seq data is based on the use of a global scaling factor (s), ignoring any gene-specific biases (for a recent review see Vallejos ). The normalized data is obtained by dividing the raw data for each cell j by its global scaling factor s: In bayNorm, we implement global scaling using a Bayesian approach to infer the original transcript counts in each cell. We assume given the original number of transcripts in the cell (), the number of transcripts observed (x) follows a Binomial model with probability β (Klein ), which we refer to as capture efficiency and it represents the probability of original transcripts in the cell to be observed for a cell with average size (or average transcript content). The capture efficiencies are proportional to global scaling factors normalized by an estimate of mean capture efficiency (the average fraction of original transcripts that are observed across all cells) for the experiment and correct for cell-to-cell variation in transcript capture and original transcript content (see Supplementary Note S1). In addition, we assume that the original number or true count of the ith gene in the jth cell () follows Negative Binomial distribution with parameters mean (μ) and size (or dispersion parameter, ϕ), such that: So, overall we have the following model: Using the Bayes rule, we have the following posterior distribution of original number of mRNAs for each gene in each cell: The prior parameters μ and ϕ of each gene were estimated using an empirical Bayesian method by pooling information across cells as discussed in detail in Supplementary Note S1. The estimation is termed ‘global’, if priors informed by combining all cells in the study regardless of their conditions or batch (C) and is termed ‘local’, if the prior is estimated by pooling information across specific cell groups (C). The marginal likelihood for gene i in cell j is which follows from using and with in Equation (4). Hence we have that the number of transcripts reported for the ith gene in the jth cell has a Negative Binomial distribution with mean and size ϕ. It can also be shown that the posterior distribution of is a shifted Negative Binomial distribution. To sample from the posterior distribution, we note that the original count can be expressed as where ζ is the lost count satisfying The posterior mean and variance then evaluate to Note that when ϕ is small, the mean of posterior tends to . After estimating the posterior distribution for each gene in each cell, we can either sample a certain number of draws from it (3D array output, see Supplementary Fig. S1) or extract the mean or maximum a posteriori probability (MAP; Gelman ; 2D array output, see Supplementary Fig. S1). More details on the use of Binomial distribution and estimation of β and priors can be found in the Supplementary Note S1 and pseudo code (Algorithm 1) in the Supplementary Note.

3 Results

3.1 The bayNorm model reproduces statistics of real scRNA-seq data

bayNorm models the true transcript counts in each cell using a Bayesian approach to global scaling normalization. The two original aspects of the method are: (i) the use of the binomial likelihood function, and (ii) the use of shrinkage methods to estimate prior parameters (μ and ϕ). The bayNorm likelihood function is assumed to be binomial as it describes the random sampling of a fraction of a cell transcriptome with constant probability. This is a simple model of transcript capture in scRNA-seq (Klein ) and we therefore hypothesized that it would be a good choice for the bayNorm likelihood function. For the prior , we assume a negative binomial model, which describes the bursty distribution of mRNAs in simple models of gene expression (Raj ; Shahrezaei and Swain, 2008) and is also commonly used in RNA-seq analysis Love . Gene-specific prior parameters are estimated using an empirical Bayes approach by pooling gene expression values across multiple cells (see Supplementary Note S1 and Supplementary Figs S27 and S28 for details). Although shrinkage methods have been commonly used to estimate dispersion (see e.g. Love ; Robinson and Smyth, 2007), the use of empirical Bayes shrinkage approaches for estimation of the mean is less common (but see Huang ; Love ; Zhu ). bayNorm normalized count of gene i in cell j is either a point estimate from posterior (mean/MAP) (2D array output) or samples of the corresponding posterior distribution (3D array output). The bayNorm 2D or 3D output can be used for further downstream analysis. To validate our choice of binomial likelihood model and prior estimates, we generated simulated scRNA-seq data based on these assumptions and investigated how closely they captured statistics of several published scRNA-seq datasets (Fig. 1b–e and Supplementary Figs S2–S7; Bacher ; Klein ; Torre ; Tung ). The simulations assumed that mRNA counts per cell followed negative binomial distributions and used gene-specific priors obtained with bayNorm (Fig. 1, ‘Binomial_bayNorm’), or sampled from estimates obtained with a modified version of the Splatter package (Fig. 1, ‘Binomial_Splatter’, see Supplementary Note S2; Zappia ). These were compared with simulations generated with the original Splatter package which is based on the Gamma-Poisson distribution (Zappia ). Note that in Splatter, scaling factors are multiplicative to the Gamma distribution’s mean. In bayNorm, however, the cell-specific capture efficiencies, which act as scaling factors, are set as the probability parameter of the binomial model. Mean-variance relationship and mean-dropout relationship are two important features in scRNA-seq data. Several models have been proposed to explain these phenomenons (Anders and Huber, 2012; Hicks ; Kharchenko ; Pierson and Yau, 2015; Andrews and Hemberg, 2018a). The binomial model used in simulation (‘Binomial_bayNorm’) can better capture both relationships than Splatter (Fig. 1a–c). Moreover, a parameter free approximation based on the binomial model predicted the dropout fraction to depend on an exponential of the negative mean expression (see Supplementary Note S1). This function produced a very close fit to the experimental data providing additional support for our choice of the binomial model (Fig. 1c). Notably, the Binomial_bayNorm simulation protocol using inferred gene-specific priors together with cell-specific parameters (β) was the only one that recovered the distribution of dropout rates per gene observed in experimental data (Fig. 1d). Finally, the results presented on Figure 1b–e could be replicated consistently using several additional experimental scRNA-seq datasets (Supplementary Figs S2–S7). The datasets discussed so far were obtained based on UMIs experimental protocol (Islam ). Datasets obtained without using UMIs are less likely to be well described by the binomial distribution. Accordingly, their dependence of dropout fractions on the mean expression has been reported to be more complex than in UMI-based datasets (Andrews and Hemberg, 2018a). We investigated this issue further and found that a simple scaling of non-UMI raw data by a constant factor produced a reasonable match to the binomial model (Supplementary Fig. S9; see Section 2). This scaling factor can be interpreted as the average number of times original mRNA molecules were sequenced after PCR amplification. This indicates that, provided appropriate scaling, non-UMI datasets are also compatible with the bayNorm model. Importantly, as bayNorm recovers dropouts rates successfully in both UMI-based and non-UMI protocols without the need of specific assumptions, we conclude that invoking zero-inflation models is not required to describe scRNA-seq data. Consistent with this, the differences in mean expression levels of lowly expressed genes observed between bulk and scRNA-seq data, which were suggested to be indicative of zero-inflation, were recovered by our simulated data using the binomial model only (Supplementary Fig. S10; Hicks ). We note that the ability of simulation protocols to recover the statistics of experimental data depended intimately on the value of cell-specific capture efficiencies (β). We used different ways to estimate β (spike-in, Scran scaling factors, trimmed means, or housekeeping genes; see Supplementary Fig. S8) together with different in the Binomial_Splatter simulation protocol. We found that changes in β values affected recovery of the distribution of dropout rates per cell. (Supplementary Fig. S8). In particular, we found that the use of spike-in controls or of housekeeping reference gene expression levels did not improve estimates of capture efficiencies (Supplementary Fig. S8c–f). Altogether, this analysis demonstrates that accurate statistics of experimental scRNA-seq data can be consistently retrieved using the binomial model and empirical Bayes estimation of gene expression parameters implemented in bayNorm along with accurate estimates of cell-specific capture efficiencies.

3.2 Recovery of true gene expression distributions and gene–gene correlation from scRNA-seq data

Single-cell RNA-seq provides a unique opportunity to study stochastic cell-to-cell variability in gene expression at a near genome-wide scale. However, doing this requires normalization approaches able to retrieve from scRNA-seq data transcripts levels matching quantitatively in vivo mRNA numbers (Torre ). bayNorm imputes drop-outs that are a result of low capture efficiency using its Bayesian approach (see Supplementary Fig. S25). However as bayNorm posterior models the original counts in the cell bayNorm should be effective in inference of the full transcript distributions. With this in mind, we evaluated bayNorm performance in reconstructing true gene expression levels from a series of experimental scRNA-seq datasets that contained matched single molecule fluorescence in situ hybridization (smFISH) measurements for a series of genes. We used mean capture efficiencies estimated directly from smFISH together with gene-specific priors informed by the sequencing data (Supplementary Fig. S11). After bayNorm normalization, scRNA-seq counts reproduced accurately count distributions obtained by smFISH for several mRNAs (Fig. 2a and b). All methods captured mean smFISH counts across different genes well (Fig. 2c and d). However, noise in gene expression (coefficient of variation, CV) and expression dispersion (Gini coefficient) measured by smFISH were better captured by bayNorm compared with normalization by scaling or by several recent normalization and imputation methods (Fig. 2e–h; Bacher ; Eraslan ; Huang ; Li and Li, 2018; van Dijk ). bayNorm’s good performance could also be confirmed in a series of simulation studies (Supplementary Fig. S12).

Fig. 2.

bayNorm recovers distributions of gene expression observed by smFISH. (a) Stag3 mRNA distribution for cells grown in 2i measured by smFISH or by scRNA-seq and normalized with different methods (from Grün study). ‘Raw’ denotes unnormalized scRNA-seq data. (b) As in (a) for the LMNA gene (from Torre study). Legend as in (a). Smoothing bandwidth is 10 for every method shown in (a and b). (c) Log2 ratio between the means of scRNA-seq measurements for 18 genes normalized by different methods and their matched smFISH measurements (from Grün study). (d) As in (c) using 12 genes (Torre study). (e) Log2 ratio between the CV of scRNA-seq measurements for 18 genes normalized by different methods and their matched smFISH measurements (from Grün study). (f) As in (e) using 12 genes (from Torre study). (g) Log2 ratio between the Gini coefficients of scRNA-seq measurements for 18 genes normalized by different methods and their matched smFISH measurements (from Grün study). (h) As in (c) using 12 genes (from Torre study). For the bayNorm and SAVER normalized datasets, 20 or 5 samples were generated from posterior distributions for the Grün and the Torre studies, respectively. All normalized datasets except bayNorm and the Scaling method have been divided by the value used in bayNorm procedure. For this analysis smFISH data were normalized for variation in total transcript numbers using either cell size measurements (Grün study) or expression levels of a house keeping gene (Torre study) as detailed in see Supplementary Note S4 Estimation of gene–gene correlations is essential in network inference from scRNA-seq data. As the bayNorm prior, assumes no correlation between genes, bayNorm could underestimate the correlations. We used Torre study that contains smFISH data on gene–gene correlations to illustrate that bayNorm gene-specific priors indeed underestimate the gene-gene correlation (Supplementary Fig. S26). In comparison the adjusted SAVER correlation estimates tend to overestimate the gene–gene correlation for most pairs of genes (Supplementary Fig. S26). We believe this is due to pooling information across genes in the same cell in SAVER’s empirical Bayes approach. However, bayNorm does not inflate gene–gene correlations as observed for some imputation methods (Andrews and Hemberg, 2018b). In summary, bayNorm combined with gene-specific priors inferred directly from the scRNA-seq data, retrieves gene expression variability and gene–gene correlations matching smFISH data.

3.3 bayNorm enables accurate and sensitive DE analysis

Differential gene expression analysis in scRNA-seq studies is challenging as several factors including variability in capture efficiencies, dropout rates, sequencing depth and experimental batch effects can introduce significant, yet spurious, DE signal. Normalization and imputation approaches have, therefore, a significant impact on the sensitivity and accuracy of DE analysis protocols. Two features of the bayNorm approach have the potential to improve the performance of DE analysis. First, bayNorm posterior distribution of original counts maintains the uncertainty resulting from small capture efficiencies and could therefore reduce false positive DE discovery rates (Pimentel ). Second, the use of priors specific to each group of cells compared in the DE analysis could increase true positive discovery rates. With this in mind, we have assessed bayNorm performance in DE analysis using several experimental scRNA-seq datasets and compared it with other normalization and imputation methods. To identify DE genes we use model-based analysis of single-cell transcriptomics (MAST) (Finak ), which performs well in terms of false positives rates, precision and recall (Jaakkola ). MAST was first applied to individual sample from the bayNorm posterior distribution (3D array, Supplementary Fig. S1). Differentially expressed genes were then called based on the median of Benjamini-Hochberg adjusted P-values of the individual samples (Benjamini and Hochberg, 1995). As mentioned earlier, differences in capture efficiencies between cells is a source of technical variability that could affect DE analysis. To test bayNorm’s ability to correct for this bias, we selected the 1000 cells with the highest and lowest capture efficiencies based on total counts in a recent UMI-based scRNA-seq study of fission yeast with cell size measurements (Saint ). We then applied bayNorm to the 2000 cells using global prior estimation by pooling information across the two groups (see Section 2). In this design, the two groups of cells differ based on their capture efficiencies while the cell size in two groups is not markedly different (Fig. 3b, inset). Therefore, significant DE is not expected. Figure 3a shows the number of genes called differentially expressed as a function of increasing average expression levels using a series of normalization and imputation methods. bayNorm normalized data show almost no differentially expressed genes, outperforming all the other methods. Moreover, log2 gene expression ratios between cells of the two groups were consistently close to zero, confirming bayNorm ability to correct for biases inherent to different capture efficiencies in UMI-based datasets (Fig. 3b).

Fig. 3.

bayNorm enables robust and sensitive DE analysis. (a) Number of differentially expressed genes between the 1000 cells with the highest and the 1000 cells with the lowest total counts in Saint study (Saint ). DE genes were called using the MAST package () and plotted for six groups of genes with increasing mean expression (1—low to 6—high). (b) log2 fold-change from (a). Inset shows box plots of total count and cell sizes (as measured in Saint ) in the two groups, illustrating lack of strong correlation between scRNA-seq raw total count and cell size. (c) DE analysis using MAST for different normalization methods (Islam study) using a benchmark list of DE genes obtained from matched bulk RNA-seq data Ye . (d) DE analysis using data from Soumillon study (Soumillon ). 20, 50, 80, 100, 200 or 400 cells were selected randomly from two groups of stage-3 differentiated cells at day 0 (D3T0) or day 7 (D3T7). A list of DE genes obtained from matched bulk RNA-seq data was used as a benchmark (1000 genes with the largest magnitude of log fold-change between the D3T0 and D3T7 samples, Ye ). For bayNorm and SAVER, 3D arrays were used Sequencing depth is another parameter affecting DE analysis especially because it impacts on the dropout rates of lowly expressed genes. Moreover, differences in sequencing depth are likely to affect levels of capture efficiencies, especially for non-UMI datasets where PCR biases are not accounted for. To assess bayNorm’s ability to correct for this source of bias, we used a benchmark dataset published by Bacher that consists of non-UMI-based scRNA-seq data for two groups of cells isolated from a single culture and sequenced to a depth of either 1 or 4 million reads per cell. bayNorm and other imputation methods performed well in this setting (Supplementary Fig. S13). Finally, bayNorm corrected robustly for variability in sequencing depth when applied to a series of simulated datasets (Supplementary Figs S14 and S15; Bacher ). We have shown that bayNorm is efficient at removing spurious DE from scRNA-seq data caused by variability in capture efficiencies and sequencing depth. We next explored bayNorm performance in supporting sensitive and robust detection of genes truly regulated between samples. To do this, we used two experimental scRNA-seq datasets (Islam ; Soumillon ) and lists of benchmark DE genes derived from matched bulk RNA-seq data (Jaakkola ; Ye ). To maximize sensitivity, we used priors specific to each groups of cells in the comparison (we call this design ‘local priors’). With the first dataset, bayNorm normalized data generated an area under the curve (AUC) value as high as other normalization methods demonstrating that the approach supports sensitive DE detection (Fig. 3c). Analysis of the second dataset (UMI-based) (Soumillon ) further confirmed this observation with bayNorm performing better than all other methods (Fig. 3d). Importantly, bayNorm’s performance did not depend on the number of cells in each group, except for groups with very low numbers of cells (Fig. 3d andSupplementary Fig. S16). Finally, using a series of simulated datasets, we explored situations where the compared groups have different mean capture efficiencies and found that bayNorm supported robust DE detection in all cases (Supplementary Fig. S17). Three important parameters should be considered before bayNorm normalization: (i) the choice of priors, (ii) the choice of average capture efficiencies , iii) the choice of bayNorm output format (2D versus 3D array). Prior parameters can be either estimated for all cells across groups (global) or within each group (local). Since priors are gene specific, applying bayNorm across homogeneous cells (i.e. using global prior) allows for mitigating technical variations (Supplementary Fig. S18a and b). On the other hand, using priors estimated ‘locally’ within each group amplifies differences in signals between heterogeneous groups of cells increasing sensitivity (Supplementary Fig. S18c and d). Average capture efficiencies are specific to each scRNA-seq protocol and reflect their overall sensitivity. This value represents the ratio of the average number of mRNA molecules sequenced per cell to the total number of mRNA molecules present in an average cell. It is not always easy to determine as quantitative calibration methods such as smFISH are not widely used, and approaches based on spike-in controls have shortcomings (Vallejos ). We investigated the impact of inaccurate estimation of β on biases in DE detection. Critically we found that DE results based on bayNorm normalized data are not affected significantly by a 2-fold change of (Supplementary Figs S20 and S21). Finally, bayNorm output consists of either samples from its posterior distributions (3D array) or the modes/means of these distributions as point estimates (2D arrays). For DE analysis using MAST, 3D array outputs reduces false positive rates (FPRs) but 2D array outputs perform slightly better in terms of AUC (Supplementary Fig. S18c and d). Supplementary Figure S19 shows DE results for two other non-parametric methods: reproducibility optimized test statistic (Elo ) and Wilcoxon test (Jaakkola ). Both approaches perform equally well with 3D arrays but show variable results when applied to 2D arrays with the Wilcoxon test performing less well. In summary, our analysis demonstrates that in addition to correcting for technical biases, bayNorm also supports robust and accurate DE analysis of a wide range of experimental and simulated scRNA-seq datasets.

3.4 bayNorm correction of experimental batch effects

scRNA-seq protocols are subject to significant experimental batch effects (Tung ). bayNorm can mitigate batch effects in two ways. First, as described above, bayNorm efficiently corrects for differences in capture efficiencies which is a pervasive source of batch-to-batch variability (Hicks ). Second, the use of bayNorm data-informed priors is an efficient way to mitigate batch variation by estimating prior parameters across different batches but within the same biological condition. To investigate bayNorm’s performance for batch effect correction we use data from the Tung study, where there are three batches for each of three individuals (Tung ). We first used priors calculated within each individual, but across batches [bayNorm local (individual)]. This strategy allows for maintaining differences between individuals while minimizing batch effects as illustrated in Figure 4a and b (also see Supplementary Fig. S22). To quantify the result, we defined the ratio of the number of DE genes (detected between each pair of batches within the same individual, adjusted < 0.05) and the total number of genes (13 058) to be the FPRs (Supplementary Fig. S23). In parallel, we tested whether bayNorm also maintained differences between individuals. To do this, we detected DE genes between the iPSC lines NA19101 and NA19239 and compared it with a benchmark list of 498 DE genes (Ye ). Efficient batch effect correction is expected to minimize FPR while maximizing AUC values of DE detection between individuals. Using bayNorm with ‘within individual’ local priors (estimated across different batches within the same line) outperformed other methods in terms of correcting batch effects while maintaining meaningful biological information. As expected, using global priors (estimated across batches and individuals, bayNorm global) preserves low FPR, but reduces AUC significantly. Finally, using ‘within batch’ local priors [bayNorm local (batch)] result in higher FPRs, which is also expected.

Fig. 4.

Batch effect correction and cell type identification (a and b) each color represent a different cell line derived from a different individual. Color shades represent different batches within a line/individual. (c) Differentially expressed genes were called between lines NA19101 and NA19239 as well as different batches within each line (seven pair of comparisons in total). FDRs were averaged across the seven pairs. The vertical and horizontal dashed lines represent 0.25 and 0.75 indicative cutoffs, respectively. bayNorm was applied either across batches but within lines [‘bayNorm local (individual)’] or across all cells (‘bayNorm global’) or within each batch [‘bayNorm local (batch)’]. Global gene-specific prior parameter estimation across all cells results in clear clusters of different cell types compared with Scaling normalization using the data from Zeisel . t-SNE plots are shown based on Scaling normalization (d) and bayNorm (e). The clustering performance is quantified by Jaccard index (the value reported at the top left of each panel). For bayNorm, 1 sample of 3D array was used. (Color version of this figure is available at Bioinformatics online.) Another common use of scRNA-seq data in heterogeneous systems is to identify different cell types by clustering cells in an unsupervised manner. The Zeisel study provides a dataset where each cell is assigned to a specific cell type based on lineage markers expression, which can therefore be used as gold standard Zeisel . In Figure 4d and e, we compared 2D t-distributed stochastic neighbor embedding (t-SNE) plots (van der Maaten and Hinton, 2008) of cells based on the scaling and bayNorm methods, respectively. We used the Seurat package for cell clustering (Butler ). The Jaccard index was calculated for each method using ‘cluster_similarity’ function from R package ‘clusteval’. This analysis illustrates how bayNorm with global priors can preserve clustering of different cell types as well as the scaling methods. Moreover, this observation was confirmed using two additional datasets, where cell type annotation was not provided (Baron ; Chen ). There, we used cell labels determined based on scaling normalized data as references (see Supplementary Fig. S24).

4 Discussion

We introduced bayNorm, a versatile Bayesian approach for implementing global scaling that simultaneously provides imputation of missing values and true counts recovery of scRNA-seq data. Bayesian methods have been applied to different aspects of RNA-seq data analysis before (Hardcastle and Kelly, 2010; Huang ; Kharchenko ; Vallejos ). We showed that using the binomial model and an empirical Bayes approach to estimating gene expression priors across cells results in simulated data almost identical to experimental scRNA-seq measurements. Importantly, this suggests that zero-inflated models are not required to explain the frequency of dropout observed in scRNA-seq (see also Svensson, 2019; Soneson and Robinson, 2018). Although designed initially for UMI-containing scRNA-seq protocols, a simple scaling factor makes bayNorm applicable to non-UMI data as well. This flexibility will allow using this approach with most present scRNA-seq datasets. We showed using datasets that combine smFISH and scRNA-seq, that bayNorm is accurately recovering true gene expression across a wide range of expression levels. This approach could therefore be particularly useful for quantitative analysis of more difficult scRNA-seq datasets, such as those generated from small quiescent cells or microbes, for instance. In fact, we have recently used bayNorm successfully in the first scRNA-seq study of fission yeast (Saint ). One of the most powerful features of bayNorm is its use of gene expression priors directly calculated from gene expression values across cells. We showed that grouping cells according to experiment design or phenotypic features increased significantly the robustness and sensitivity of DE analysis. This removes almost completely the sequencing depth and capture efficiency biases, and reduces batch effects. Critically, this approach preserved accurate and sensitive detection of benchmark DE genes in contrast to some recently developed methods of batch correction. Also, where there is no prior knowledge available for cell types, using a global approach does not affect clustering of cell types. The clustering results could be improved if priors are iteratively improved using a method of simultaneous normalization and clustering as proposed in Prabhakaran . Bayesian methods have been applied to different aspects of RNA-seq data analysis before (Hardcastle and Kelly, 2010; Huang ; Kharchenko ; Prabhakaran ; Vallejos ). The approach most related to bayNorm is taken by SAVER, which uses a Poisson-Gamma model and pooling information across genes for true count recovery (Huang ). In contrast, bayNorm uses a binomial model of mRNA capture as likelihood and achieves similar or improved performance relative to SAVER on real and simulated data (Figs 2–4). Accurate estimation of cell capture efficiencies (or scaling factors) is central to most scRNA-seq normalization methods including bayNorm. Interestingly, we observed that the choice of cell-specific capture efficiencies affect how closely simulated data recovers statistics of real data. We therefore propose that comparison of drop-out rates per cell in simulated datasets and experimental data could be used as a tool to inform appropriate choice of global scaling factors and mean capture efficiency estimates. The option to tailor bayNorm priors based on phenotypic information about cell subpopulations will be a powerful asset for discovery of gene expression programs associated with specific phenotypic features of single cells such as cell size (Saint ). Finally, the concepts and mathematical framework behind bayNorm will be useful if combined with other emerging theoretical approaches such as deep learning (for instance Ding ; Eraslan ; Grønbech ; Lopez ; Wang and Gu, 2018). Overall, bayNorm provides a simple and integrated solution to remove the technical biases typical of scRNA-seq approaches, while enabling robust and accurate detection of cell-specific changes in gene expression. bayNorm has been made freely available as an R package (https://bioconductor.org/packages/release/bioc/html/bayNorm.html) released in Bioconductor (Gentleman ). Click here for additional data file.

51 in total

1. Accounting for technical noise in single-cell RNA-seq experiments.

Authors: Philip Brennecke; Simon Anders; Jong Kyoung Kim; Aleksandra A Kołodziejczyk; Xiuwei Zhang; Valentina Proserpio; Bianka Baying; Vladimir Benes; Sarah A Teichmann; John C Marioni; Marcus G Heisler
Journal: Nat Methods Date: 2013-09-22 Impact factor: 28.547

2. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.

Authors: Allon M Klein; Linas Mazutis; Ilke Akartuna; Naren Tallapragada; Adrian Veres; Victor Li; Leonid Peshkin; David A Weitz; Marc W Kirschner
Journal: Cell Date: 2015-05-21 Impact factor: 41.582

3. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq.

Authors: Saiful Islam; Una Kjällquist; Annalena Moliner; Pawel Zajac; Jian-Bing Fan; Peter Lönnerberg; Sten Linnarsson
Journal: Genome Res Date: 2011-05-04 Impact factor: 9.043

Review 4. Unravelling biology and shifting paradigms in cancer with single-cell sequencing.

Authors: Timour Baslan; James Hicks
Journal: Nat Rev Cancer Date: 2017-08-24 Impact factor: 60.716

5. scVAE: variational auto-encoders for single-cell gene expression data.

Authors: Christopher Heje Grønbech; Maximillian Fornitz Vording; Pascal N Timshel; Casper Kaae Sønderby; Tune H Pers; Ole Winther
Journal: Bioinformatics Date: 2020-08-15 Impact factor: 6.937

6. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Authors: Michael I Love; Wolfgang Huber; Simon Anders
Journal: Genome Biol Date: 2014 Impact factor: 13.583

7. False signals induced by single-cell imputation.

Authors: Tallulah S Andrews; Martin Hemberg
Journal: F1000Res Date: 2018-11-02

8. DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data.

Authors: Chengzhong Ye; Terence P Speed; Agus Salim
Journal: Bioinformatics Date: 2019-12-15 Impact factor: 6.937

9. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data.

Authors: Greg Finak; Andrew McDavid; Masanao Yajima; Jingyuan Deng; Vivian Gersuk; Alex K Shalek; Chloe K Slichter; Hannah W Miller; M Juliana McElrath; Martin Prlic; Peter S Linsley; Raphael Gottardo
Journal: Genome Biol Date: 2015-12-10 Impact factor: 13.583

10. Splatter: simulation of single-cell RNA sequencing data.

Authors: Luke Zappia; Belinda Phipson; Alicia Oshlack
Journal: Genome Biol Date: 2017-09-12 Impact factor: 13.583

18 in total

1. FRMC: a fast and robust method for the imputation of scRNA-seq data.

Authors: Honglong Wu; Xuebin Wang; Mengtian Chu; Ruizhi Xiang; Ke Zhou
Journal: RNA Biol Date: 2021-08-30 Impact factor: 4.766

2. Benchmarking imputation methods for network inference using a novel method of synthetic scRNA-seq data generation.

Authors: Ayoub Lasri; Vahid Shahrezaei; Marc Sturrock
Journal: BMC Bioinformatics Date: 2022-06-17 Impact factor: 3.307

3. A Bayesian factorization method to recover single-cell RNA sequencing data.

Authors: Zi-Hang Wen; Jeremy L Langsam; Lu Zhang; Wenjun Shen; Xin Zhou
Journal: Cell Rep Methods Date: 2021-12-20

Review 4. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data.

Authors: Tallulah S Andrews; Vladimir Yu Kiselev; Davis McCarthy; Martin Hemberg
Journal: Nat Protoc Date: 2020-12-07 Impact factor: 13.491

Review 5. How to Get Started with Single Cell RNA Sequencing Data Analysis.

Authors: Michael S Balzer; Ziyuan Ma; Jianfu Zhou; Amin Abedini; Katalin Susztak
Journal: J Am Soc Nephrol Date: 2021-03-15 Impact factor: 14.978

Review 6. Eleven grand challenges in single-cell data science.

Authors: David Lähnemann; Johannes Köster; Ewa Szczurek; Davis J McCarthy; Stephanie C Hicks; Mark D Robinson; Catalina A Vallejos; Kieran R Campbell; Niko Beerenwinkel; Ahmed Mahfouz; Luca Pinello; Pavel Skums; Alexandros Stamatakis; Camille Stephan-Otto Attolini; Samuel Aparicio; Jasmijn Baaijens; Marleen Balvert; Buys de Barbanson; Antonio Cappuccio; Giacomo Corleone; Bas E Dutilh; Maria Florescu; Victor Guryev; Rens Holmer; Katharina Jahn; Thamar Jessurun Lobo; Emma M Keizer; Indu Khatri; Szymon M Kielbasa; Jan O Korbel; Alexey M Kozlov; Tzu-Hao Kuo; Boudewijn P F Lelieveldt; Ion I Mandoiu; John C Marioni; Tobias Marschall; Felix Mölder; Amir Niknejad; Lukasz Raczkowski; Marcel Reinders; Jeroen de Ridder; Antoine-Emmanuel Saliba; Antonios Somarakis; Oliver Stegle; Fabian J Theis; Huan Yang; Alex Zelikovsky; Alice C McHardy; Benjamin J Raphael; Sohrab P Shah; Alexander Schönhuth
Journal: Genome Biol Date: 2020-02-07 Impact factor: 13.583