Literature DB >> 31197307

DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data.

Chengzhong Ye^1,2,3, Terence P Speed^1,4, Agus Salim^1,5,6.

Abstract

MOTIVATION: Dropout is a common phenomenon in single-cell RNA-seq (scRNA-seq) data, and when left unaddressed it affects the validity of the statistical analyses. Despite this, few current methods for differential expression (DE) analysis of scRNA-seq data explicitly model the process that gives rise to the dropout events. We develop DECENT, a method for DE analysis of scRNA-seq data that explicitly and accurately models the molecule capture process in scRNA-seq experiments.
RESULTS: We show that DECENT demonstrates improved DE performance over existing DE methods that do not explicitly model dropout. This improvement is consistently observed across several public scRNA-seq datasets generated using different technological platforms. The gain in improvement is especially large when the capture process is overdispersed. DECENT maintains type I error well while achieving better sensitivity. Its performance without spike-ins is almost as good as when spike-ins are used to calibrate the capture model.
AVAILABILITY AND IMPLEMENTATION: The method is implemented as a publicly available R package available from https://github.com/cz-ye/DECENT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31197307 PMCID： PMC6954660 DOI： 10.1093/bioinformatics/btz453

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Recent developments in sequencing technology have enabled high-throughput whole-transcriptome profiling at single-cell resolution. Single-cell RNA-seq (scRNA-seq) allows the quantification of gene expression of thousands of individual cells in a single experiment. It has already led to profound new discoveries that could not be have been made using data from bulk transcriptome sequencing, ranging from the identification of novel cell types to the study of global patterns of stochastic gene expression (Kolodziejczyk ; Wagner ). However, there are still many statistical challenges in drawing inferences from scRNA-seq data. Due to the small amount of starting material and the imperfect capturing of RNA molecules in current scRNA-seq experiments, failure to detect expressed transcripts in single cells is still common. This gives rise to the characteristic dropout phenomenon in scRNA-seq data, in which a gene shows zero or very low abundance in a fraction of cells in spite of moderate to high expression in others (Finak ; Hashimshony ; Ramskold ). Also, the capture rates can vary between cells and across genes (Brennecke ), showing as a major source of unwanted variation in scRNA-seq data, with the first principal component of raw counts typically exhibiting high correlation with the proportions of zero counts (Risso ). This unique feature of scRNA-seq will hinder downstream analyses if not properly modeled. Lots of effort has been made in order to alleviate this issue, including specialized normalization methods (Bacher ; Lun ), clustering algorithms (Kiselev ; Wang ; Zeisel ) and methods for differential expression analysis (Finak ; Jia ; Kharchenko ). One way to resolve this is through explicit modeling of the molecule capturing process and hence separating the biological variation of interest from unwanted variation in the experimental procedures. For instance, several methods (Huang ; van Dijk ; Wang ) are designed to recover the pre-dropout expression matrix by modeling the process from RNA molecule to read count. However, a difficulty in modeling the molecule capturing and dropout events is that this process is usually mixed up with other sources of technical variation, such as amplification and sequencing biases (Wagner ). The unique molecular identifier (UMI) barcoding approach has become increasingly popular in scRNA-seq experiments as an effective way to address this issue (Islam ; Svensson ). Random barcodes are attached to cDNA molecules during reverse transcription. Each individual molecule from a particular gene in each cell is expected to have a distinct UMI (Islam ). Therefore, after sequencing, by counting UMI barcodes instead of reads per se, the resulting UMI counts will be a more faithful representation of the original cDNA counts, with amplification and sequencing bias largely avoided. But the UMI count will still show as zero if an RNA molecule failed to convert to cDNA, or was completely lost in amplification and sequencing. As a consequence, the main source of technical variation left in UMI counts is the loss of molecules during the experimental procedure, namely, dropouts. Hence, UMI count data provides us with an opportunity to model the molecule capturing process in depth. Also, given the distinct features of UMI-based data, it is necessary to build specific models in order to perform statistical tests reliably. Currently scRNA-seq experiments mainly focus on cell-wise analyses such as clustering and trajectory inference for studying heterogeneity within cellular populations (Qiu ; Trapnell ; Zeisel ). Nevertheless, differential gene expression (DE), as one of the most common gene-wise analyses, still plays an essential role in complementing these analyses. For example, it is used to identify cluster-specific markers for identifying the cell types. It is also used to derive disease-associated gene signatures (Savas ; Sun ; Zhao ). However, DE methods originally designed for bulk RNA-seq tend to produce unreliable results due to failing to account for the extra variation in single-cell data (Jia ; Van den Berge ). Driven by this, a few DE methods have been designed specifically for scRNA-seq data. All of them use some strategy to deal with the large variation and number of zero observations. However, most of them do not distinguish biological from technical factors that are causing the phenomenon. For example, SCDE (Kharchenko ) uses a mixture model to distinguish counts affected by dropout from the rest of the data. This model almost always assigns a probability of one that a zero count belongs to the dropout component, in essence assuming all observed zeroes to be technical. MAST (Finak ) uses a two-part generalized linear model (GLM) in which the dropout rates are adjusted by the inclusion of the observed fraction of non-zero counts as a term in their regression model. This still does not differentiate the dropouts from real biological zeros. Additionally, the effect of dropout events is likely to be non-linear, especially for genes with low to moderate expression (Bacher ), and so the inclusion of simple linear term that represents capture rates in the regression model is unlikely to be optimal. Zero-inflated negative binomial (ZINB)-WaVE (Van den Berge ) uses a zero-inflated model directly fitted to the observed data to derive observation weights for adjusting bulk DE methods. Only Jia proposed a DE method, TASC, that relies on external RNA spike-in data (Jiang ) to fit a technical variation model in order to explicitly cater for dropouts, thus enabling separation of the biological variation for DE analysis. They showed improved performance of their method compared with methods that perform DE analysis directly using the observed data. Note that the methods mentioned so far are not specifically designed for UMI-count data. There are two existing methods that considers the unique features of UMI-based experiments: Monocle2 (Qiu ; Trapnell ) and NBID (Chen ). They both fit negative binomial (NB) models directly to the observed UMI count without any explicit modeling of dropouts. Here we propose a novel model for the DE analysis of UMI-based scRNA-seq data. Leveraging the features UMI-count data, we are able to model the molecule capturing process precisely. We build a capture model to account for the gene- and cell-specific properties of molecule capturing. This allows us to perform DE analysis on the inferred pre-dropout distributions of RNA molecules. We named our method Differential Expression with Capture Efficiency adjustmeNT (DECENT). DECENT can use the external RNA spike-in data to calibrate the capture model, but also works without spike-ins. In this paper, we describe the DECENT model and benchmark it against existing methods using both simulated data and four published UMI-based scRNA-seq datasets. The results showed improved performance of DECENT in various settings when compared with existing methods.

2 Materials and methods

2.1 Model formulation

DECENT assumes that UMIs (Islam ) have been used in the scRNA-seq experiment. Our statistical model is hierarchical, involving modelling the observed count Z of mRNA molecules ‘captured’ from gene i in cell j, and the unobserved total mRNA count Y of all mRNA molecules from gene i in cell j that could have been captured had there been no molecule dropout. We will subsequently use the term ‘pre-dropout count’ when referring to Y. Cells will in general be of more than one type, and our inferences will concern the mean parameters μ of the distribution of the unobserved pre-dropout count across cell types. Other parameters in our model for the unobserved pre-dropout count are zero inflation parameters π0 and (over) dispersion parameters ψ specific to gene i, and a size-factor parameter s specific to cell j, to account for differences in total mRNA molecules across cells. To summarize, we use ZINB distribution with the following probability density function to model the unobserved pre-dropout count: Note that, as , our ZINB model reduced to a zero-inflated Poisson model (ZIP). When π0 =0, our model reduces to the NB model and finally when π0 =0 and , our model reduces to the Poisson model. The second part of the DECENT model involves the specification of the capture model for modelling the distribution of the observed data Z given the unobserved pre-dropout count Y. We use a binomial model for the molecule capture, assuming where η is the capture rates for gene i within cell j. The probability density function for the capture model is given by, Finally, to account for variability in the capture rate, we assume a beta prior for the capture rate parameter η with the prior mean equal to the cell-specific capture rates η and the prior variance characterized by a dispersion parameter . When , our capture model reduces to the standard Binomial model with all genes within a cell having the same capture rate, . In our DECENT model, backed up by empirical evidence from published scRNA-seq datasets, we assume that the dispersion parameter is related to gene abundance through a logistic linear model: where and are cell-specific parameters estimated from the data. In our experience with real datasets, we usually found that assuming global parameters and is adequate. Furthermore, from real datasets we also found that τ1 <0, which suggests that within a cell, capture rates for highly abundant genes vary around the cell-specific capture rates much less than the capture rates for low or moderately abundant genes. As an alternative to estimating a global parameter, DECENT also has an option for estimating cell-specific if there is evidence that the parameters vary significantly between cells. Our hierarchical DECENT model specification can be summarized as follows: where the parameters of the beta prior for η satisfy Next, we describe empirical evidence that motivated us to choose this specification for DECENT.

2.1.1 Empirical evidence for beta-binomial capture model

We investigated the appropriateness of our capture model using six ERCC spike-in datasets, consisting of three plate-based (Grun ; Tung ; Zeisel ) and three droplet-based experiments (Klein ; Macosko ; Zheng ) (Supplementary Table S1). We use spike-ins because their average (nominal) pre-dropout count are known, with no biological variation between cells expected. To model the distribution of their unobserved pre-dropout counts, we thus use a Poisson distribution with mean equal to the nominal count for each spike-in. The full results from all six datasets are shown in Supplementary Figure S2. But as an example, we will use results from the Tung et al. data as shown in Figure 1. Figure 1a shows that the simple Binomial capture model does not provide an adequate fit to the data because the deviance statistic (see Supplementary Materials, pp. 13–14) for this model is much larger than the expected distribution when the model is adequate. Further analyses suggests that variation in capture rates between spikes contribute to this inadequate fit (Supplementary Fig. S1). We therefore model this extra variation by allowing capture rates to have a beta prior with cell-specific mean and dispersion parameter that differs from one spike to another. We find that the dispersion parameter can be well-approximated by a linear logistic model as a function of the spikes’ abundance (Fig. 1b). We then compare the fit of the beta-binomial capture model with the simple binomial model and find that the beta-binomial model provides much better fit to the data (Fig. 1c). The same analyses were carried out using the spike-in data from the other five experiments and similar results were obtained (Supplementary Fig. S2).

Fig. 1.

Modeling extra-binomial variation in the molecule capturing process. We evaluate the binomial and beta-binomial capture models using the ERCC spike-in data from the Tung et al. experiment. (a) The observed distribution (red) of deviances with cell-wise binomial capture model shows notable deviation from the expected χ2 distribution under the null hypothesis. This indicates inadequacy of the binomial capture model. (b) Modeling the relationship between the spike-in nominal count c and the dispersion parameter ρ in the beta-binomial capture model. If the parameter is estimated in a spike-in specific manner, a high correlation between the ρ estimates and the true pre-dropout mean abundance, namely the nominal count c, can be observed, which are shown as black points. We build a cell-wise linear model to characterize this relationship. Each blue line represents a fitted cell-wise model, which is shown to adequately describe this relationship. (c) A scatter plot comparing the cell-wise deviances under the binomial and beta-binomial capture models to assess goodness-of-fit. Deviances were standardized by dividing by the degrees of freedom to enable comparison, and logged. The blue and red marginal densities represent the observed distributions of deviances under the two models, respectively. It can be seen that the beta-binomial capture model fits better than the binomial model in the majority of the cells

2.1.2 Empirical evidence for pre-dropout count distribution

The ZINB distribution has two extra parameters when compared with the simple Poisson distribution for count data, namely the overdispersion and the zero-inflation parameters. To examine whether the ZINB distribution is needed for modelling the unobserved pre-dropout count, we used two scRNA-seq datasets (Tung ; Zeisel ). To investigate the need for overdispersion parameter, we first fit the DECENT model assuming a Poisson pre-dropout distribution to the data without considering zero-inflation. We found that the expected variances of most genes were noticeably lower than the observed values for the Zeisel et al. dataset. This extra variation was well-modeled by assuming NB as pre-dropout count distribution, hence indicating the need for the dispersion parameter (Supplementary Fig. S3a). For the Tung et al. data, the expected variances under the Poisson pre-dropout count assumption were already close to the observed values for most genes, showing little need for the extra parameter (Supplementary Fig. S3b). This suggests that overdispersion in pre-dropout counts is dataset-specific and depends on the amount of biological variability in the sample. The Tung et al. data used here are from one iPSC cell line where cells were highly homogeneous and hence lack biological variation. On the other hand, the Zeisel et al. data are from mouse brain tissue, which has a complex cellular composition. To investigate the need for zero-inflation parameter, we further fitted the DECENT model assuming a ZINB pre-dropout count distribution and compared this with the model that assumes NB pre-dropout count distribution. We performed chi-square goodness-of-fits test on the DECENT models with ZINB and NB pre-dropout count distribution to assess their adequacy. Consistent with previous findings (Chen ; Vieth ), the majority of genes do not appear to require zero-inflated model. However we still found a small number of genes in both datasets in which models with ZINB provide a more adequate fit than NB (Supplementary Fig. S4). We also investigated the appropriateness of the ZINB distribution using a single-molecule fluorescence in situ hybridization (smFISH) dataset. The smFISH technology allows precise quantification of RNA molecules from a list of targeted genes. This technology can achieve nearly 100% sensitivity detection of the RNA molecules (Raj ). Hence, the smFISH count data is a good approximation to the pre-dropout molecule counts that would normally be unobserved. We used the data from an experiment that profiled 33 marker genes in mouse somatosensory cortex (Codeluppi ). We examined three of the clusters identified by the authors, Oligodendrocyte Mature, Pyramidal L4 and Inhibitory Vip, finding most of the gene count distributions to be significantly overdispersed relative to the Poisson (Supplementary Fig. S5a). Yet we did not find zero-inflated genes in these clusters. This is quite possibly because the targeted genes are all canonical markers, which are expected to mostly exhibit constitutive expression and hence unlikely to have inflated zeros caused by transcriptional bursting. However, heterogeneity within a population can also result in zero-inflation, which is common in actual DE analysis. When we looked at 3 major cell types with higher within-group heterogeneity (Oligodendrocytes, Pyramidal neurons and Inhibitory neurons), we identified 2, 1 and 2 out of the 33 genes to have significant zero-inflation (Supplementary Fig. S5b).

2.2 Estimating capture rate parameters

Our capture model requires estimates of a cell-specific capture rate that will be used as part of the beta prior in the beta-binomial capture model. This capture rate needs to be estimated externally outside the main algorithm, for reasons that we will explain below. When the dataset contains spike-ins, their data are used to estimate the capture rates. Suppose we added n spike-ins at the known concentrations into cell j and subsequently observe molecules respectively. The cell-specific capture efficiency for any cell j is estimated as the proportion of molecules observed after sequencing relative to the total number of molecules initially added: This is the method of moments estimator of the capture rate η under the beta-binomial-Poisson model for spike-ins. However, many scRNA-seq data do not have spike-ins. Interestingly, we found that if we specified a set of inexact capture rates, other components of the model will compensate for the inaccuracy and produce DE results almost as reliable as if we had the correct values. This is due to a property of the our model that is explained below: Let Y be the unobserved pre-dropout count where and , with the capture rate given by . It turns out that the marginal distribution of Z in this case is almost indistinguishable from the marginal distribution of Z when the capture rate is and the size-factor parameter of the ZINB distribution is s′=s(η/η′) (see Supplementary Fig. S6). This feature means that if we incorrectly specify the capture rate as rather than η, the misspecification can be approximately corrected by scaling the size factor estimates accordingly. Nevertheless, it is still preferable to get capture efficiency estimates as close as possible to the true value. This identifiability issue involving a cell’s size-factor and capture rate parameter also means that it is nearly impossible to simultaneously estimate these parameters inside the main algorithm. Our approach here is to estimate the capture rate parameters before invoking the expectation conditional maximization (ECM) algorithm, and given these, we estimate the size-factor parameters within the ECM algorithm. Motivated by the above results and our experience with real datasets showing that capture efficiency is the biggest factor contributing to the variation in the observed library sizes, we devised a method for generating functional capture rates when spike-ins are not available. We refer to this as the ranked random capture efficiency. To use this method, users will need to specify an interval of plausible capture rate parameters in their experiment. DECENT uses as its default and we have found this to work well in all the datasets without spike-ins that we have analyzed. As an alternative, users can refer to Ziegenhain for guidance on plausible range of capture rate parameters across various different sequencing protocols. Let the lower and upper bounds of capture rates be ηmin and ηmax, respectively. The cell-specific capture rates are specified as follows: Compute library size for each cell and denote the of these by , where n is the number of cells. To minimize the impact of a few genes having very large counts, we can also use trimmed sums instead of full sums here. Denote the minimum and maximum library size as Lmin and Lmax. Calculate weight for cell j as w=L−Lmin/Lmax−Lmin. Estimate the capture efficiency for cell j as . This ensures that cells with larger library size will have larger capture efficiency and the capture efficiency estimates are bounded within the interval.

2.3 Parameter estimation and DE analyses

DECENT’s main aim is to estimate parameters of the unobserved pre-dropout count and perform statistical inference on these parameters for the purpose of identifying differentially expressed genes (DEGs). Because the pre-dropout count is unobserved, we use an ECM algorithm to estimate these parameters (see Supplementary Materials for details). The algorithm works as follows: First, capture rate parameters are estimated prior to invoking the ECM algorithm. Given the capture rates, initial estimates of cell-specific size factor parameters s and gene-specific parameters , as well as , the parameters that control the uncertainties in the beta prior for capture rates are calculated using method-of-moment approaches, assuming no DE genes and a simple Binomial capture model. Perform an E-step to estimate the following conditional expectations given the observed count: and . For each gene, perform an M-step to update the gene-specific parameters (see Supplementary Materials for details). For each cell, perform an M-step to update the cell-specific size-factor parameters s (see Supplementary Materials for details). Perform an M-step to update the (possibly) cell-specific parameters of the beta prior for capture rates, (see Supplementary Materials for details). Iterate between step [2] to [5] until convergence is achieved. To facilitate DE analysis, the gene-wise mean parameter is assumed to depend on the cell type or group through a log-linear model: where X is the design matrix providing group information and are the coefficients. We also allow the mean parameter to depend on cell-wise covariates W to remove unwanted variation (e.g. batch effects, cell-cycle phases). In the most common two group comparisons, we have where x is simply the binary indicator of cellular group, is logarithm of mean parameter for gene i in the reference cell type, is the log fold-change parameter for gene i and the is the gene-specific regression coefficients that adjust the DE analysis for the unwanted, cell-specific factors w. Differential expression across two cellular groups for the gene i is assessed by testing the hypotheses: using the likelihood ratio test statistic, where is the maximum likelihood estimator (MLE) of θ under the restriction that is the MLE under the unrestricted model and is the log-likelihood based on the observed data Z [see Supplementary Materials; Equation (9)]. For simple two cell type comparisons, the statistic is approximately distributed as under H0. More generally, when performing DE across p different cell types or conditions, the statistic is approximately distributed as under H0.

3 Results

3.1 Benchmarking using simulated data

We simulated 20 datasets, each consisting of 500 cells belonging to 2 cell types (224 cells from cell type 1 versus 276 cells from cell type 2) with 3000 endogenous genes and 50 spike-ins. The observed count were generated under the DECENT model using parameters estimated from Tung’s dataset (see Supplementary Materials for details). In each dataset, we set ∼10% of the genes to be DEGs. Figure 2 shows that DECENT estimates gene-specific pre-dropout proportion of zeroes and variance, as well as the actual pre-dropout counts unbiasedly. Figure 3 shows that DECENT’s performance in detecting DEGs also appear to be competitive when compared with existing methods, namely SCDE (Kharchenko ), MAST (Finak ), Monocle2 (Qiu ; Trapnell ), ZINB-WaVE adjusted edgeR (Van den Berge ) and edgeR (McCarthy ). Over the 20 datasets, the mean(SD) of the partial area under the receiver operating characteristic (pAUROC) for DECENT is 0.708(0.001), followed by MAST with 0.687(0.001) (see Supplementary Table S3). DECENT’s performance also appears to be relatively robust to misspecification of capture rates parameters (Supplementary Fig. S7).

Fig. 2.

Inferring pre-dropout molecule counts in simulation. (a) Scatter plot comparing for each gene the estimated proportion of zeros of the fitted pre-dropout distribution with the true proportion of zeros in the pre-dropout counts. (b) Scatter plot comparing the expected variance of the fitted pre-dropout distribution with the true gene-wise variance in the pre-dropout counts. (c) Scatter plot comparing the expected value of pre-dropout counts (see Supplementary Methods for details) under the fitted model with the true pre-dropout counts. We shows a random subsample of 5% of all the non-zero counts. The estimated pre-dropout counts used to calculate (a) and (b) were based on single imputation, i.e. drawing a single value from the conditional pre-dropout distribution for each gene and each cell given the observed data. The estimated pre-dropout counts shown in (c) were calculated as the expected value of the conditional pre-dropout distribution (see Supplementary Methods)

Fig. 3.

Differential expression analysis of simulated data. (a) Partial receiver operating characteristic curve for differential expression methods on the simulated data. (b) False discovery rate curves for differential expression methods on the simulated data. Both curves only focus on the low P-value region, since other regions were of little interest in actual DE analysis. Z-edgeR stands for ZINB-WaVE-adjusted edgeR

3.2 Benchmarking using real data

We further benchmarked our method against existing methods using real datasets. The difficulty in benchmarking using real datasets is that the genuine DEGs are usually unknown. In order to obtain a credible list of genuine DEGs, we searched for scRNA-seq datasets that have matching bulk RNA-seq experiments, which means a bulk RNA-seq was also performed using cells from exactly the same tissues or cell lines. We found four such experiments in total that also used UMI. Then a DEG list derived from these bulk data can be used as the reference set for benchmarking. These includes two plate-based experiments and two droplet-based experiments, with different scales, sources of tissues or cell lines and observed proportion of zero counts (Supplementary Table S2) (Chen ; Savas ; Soumillon ; Tung ). We use these datasets to benchmark DECENT against existing methods (see Supplementary Materials: Benchmarking for more details about these datasets). The same existing methods were benchmarked using all four datasets, except that we also applied TASC to the Tung et al. data where spike-ins are available. All of these datasets have gone through careful quality control steps by the authors of the original publications. Therefore, we do not further filter any cells. The only further filtering we perform is filtering very low abundance genes that correspond to the second peak in the histogram of average gene count distribution (Supplementary Fig. S13). We fitted the DECENT model to all four datasets and found that a global parameter was adequate for all datasets, except Tung’s where substantial variation in the τ0 and τ1 parameters was observed (Supplementary Fig. S14). As shown in Figures 4 and 5, DECENT showed superior performance on all four datasets. In particular, DECENT performs better than other methods for the Soumillon and Chen datasets where the dispersion parameters for the beta prior are larger and thus the capture models are more overdispersed relative to the Binomial model (Supplementary Fig. S15).

Fig. 4.

Fig. 5.

False discovery rate (FDR) curves for differential expression methods on real datasets. Evaluating the performance different method by FDR curves using the (a) Tung et al., (b) Soumillon et al., (c) Savas et al. and (d) Chen et al. datasets. Bulk DEGs were considered as conditional positives. DECENT consistently showed low number of false discoveries at the same number of declared DEGs across all four datasets. Again, only the top one thousand DEGs were considered to focus on the region of interested. DE methods are denoted by different colors. Z-edgeR denotes ZINB-WaVE-adjusted edgeR. TASC requires spike-ins and was only evaluated using the Tung et al. data

Partial receiver operating characteristic curves for differential expression methods on real datasets. Evaluating the performance of different methods by partial receiver operating characteristic curves using the (a) Tung et al., (b) Soumillon et al., (c) Savas et al. and (d) Chen et al. datasets. DEGs from matching bulk RNA-seq data were used as gold-standard for benchmarking. DECENT achieves competitive accuracy of identifying genuine DEGs in all four datasets. We used pROC to focus on the low P-value region with high specificity. DE methods are denoted by different colors. Z-edgeR stands for ZINB-WaVE-adjusted edgeR. TASC requires spike-ins and was only evaluated using the Tung et al. data False discovery rate (FDR) curves for differential expression methods on real datasets. Evaluating the performance different method by FDR curves using the (a) Tung et al., (b) Soumillon et al., (c) Savas et al. and (d) Chen et al. datasets. Bulk DEGs were considered as conditional positives. DECENT consistently showed low number of false discoveries at the same number of declared DEGs across all four datasets. Again, only the top one thousand DEGs were considered to focus on the region of interested. DE methods are denoted by different colors. Z-edgeR denotes ZINB-WaVE-adjusted edgeR. TASC requires spike-ins and was only evaluated using the Tung et al. data Among the other methods, MAST showed stable and generally good performance across datasets, while the performance of SCDE appeared to be dataset-specific, showing inadequacy for droplet-based experiments. The Monocle NB-based model based on observed UMI counts did not show satisfactory performance. The ZINB-WaVE adjustment of edgeR did not show noticeable improvements over standard edgeR for three out of four datasets. But it remarkably outperformed edgeR on the Chen et al. data, where both molecule counts and the cell numbers were high. To demonstrate the merit of performing DE analysis using a inferred pre-dropout rather than the observed expression, we selected a few genuine DEGs in the Tung et al. data that are detected by our method and not the others and compared their expression levels between the two cellular groups using either the observed counts or inferred pre-dropout counts. We discovered that the differential expression between two groups became more prominent in the pre-dropout counts (Supplementary Fig. S8). ERCC spike-ins were available in Tung et al. data. We thus used capture rates estimated from spike-ins for the result shown. This dataset also enabled us to examine how specifying the ranked random capture rates impacts DE performance on real data. We performed DECENT DE analysis again using the ranked random capture rates specifying the range as half, the same and 1.5 times the range of the spike-in estimates. The results turned out to be in concordance with the simulation studies. Although optimal performance was achieved when capture rates estimated from spike-ins were used, there were only small decreases in performance when using the ranked random capture rates (Supplementary Fig. S9). This convincingly demonstrated the viability of using the spike-in capture rates for endogenous RNA and that DECENT’s DE performance is also robust to misspecified capture rates. For the Soumillon et al. data, the median of the log fold-change estimates deviates from zero when the standard MLEs were used to estimate the cell size factors s. This default size factor estimator effectively performs library size normalization on the inferred pre-dropout counts. The bias is greatly reduced when using the trimmed mean of M values method (Robinson and Oshlack, 2010) to estimate the size factors instead and the overall performance of DECENT was slightly improved (Supplementary Fig. S10). This suggests that different datasets may require different normalization strategies, and highlights the flexibility of our method with regards to normalization. The benchmarking so far was based on two group comparisons. DECENT performs statistical tests under the under the well-established GLM framework and can readily accommodate more complex experimental designs. The Soumillon et al.’s data are a time course experiment, with three time points involved in adipose stem cell differentiation. This allowed us to have a glance at how different DE methods perform on more complex UMI-based scRNA-seq experiments beyond two-group comparisons. We tested the hypothesis that expression of a gene is constant across the three time points. Except for SCDE, which is designed only for two group comparison, and TASC, which requires spike-ins, other methods were compared in this setting. The reference genuine DEGs across the three time points were also derived from the matching bulk experiments. DECENT again outperformed all other methods with an even more pronounced advantage (Supplementary Fig. S11). In terms of controlling type I error, our in-silico investigation (see Supplementary Materials and Fig. S12) demonstrates that DECENT controls the type I error quite well.

4 Discussion

We presented DECENT, a novel statistical method for performing DE analysis on UMI-based scRNA-seq data. UMI count data have provided us with an excellent opportunity to model the molecule capturing process. We found that the technical variation arising in this process can be characterized by a gene and cell-specific beta-binomial capture model. We were able to perform DE analysis on the inferred pre-dropout counts, and achieve good performance. Our model is usable either with or without spike-ins and is compatible with different normalization strategies. Also, we can draw on established statistical theory and use the model for analyzing data from experiments and studies more complex than two group comparisons. Used in a three group setting, the model gave promising results. Adding more cell-level covariates is straightforward (see Section 2) and is catered for in our software. External RNA spike-ins, such as the ERCC spike-ins (Jiang ) can give valuable insights into the technical variation in scRNA-seq data. They have been used in this way in some scRNA-seq methods (Jia ; Lun ). However, spiked-in molecules differ from endogenous transcripts in properties such as overall length, and length of the poly(A) tract, and in their technical variation such as capture rates (Svensson ). This raises the question of whether and how to use spike-ins in analyses like ours. When they are available we use them to estimate the capture rates in our model, that is, to center the beta-binomial distribution in each cell. During the development of our beta-binomial capture model, we found more variation in the data than would be found in a cell-specific binomial capture model. This extra variation is more likely to be due to spike-specific biases in capture rates rather than due to random noise (Supplementary Fig. S2). However, unlike cell-specific capture rates, the estimated spike-specific biases cannot be generalized to endogenous genes. Indeed we are unable to estimate the such gene-specific biases using only gene abundance, because it is not separable from a gene’s mean expression. Such a separation would only be achievable if extra information was available. For example, it is plausible that capture rates would depend on features of the gene sequence such as GC-content and the length of the poly(A) tract. A more refined capture model might then be built by modeling the relationship between these gene-specific features and the gene-specific biases of capture rates. Fortunately, our method is flexible enough to permit the amount of over-dispersion in the capture process to differ between the spike-ins and endogenous genes to reflect any differences in the capture process of the two types of molecules. In this way we deal with the issue just mentioned. Although multilevel models fitted with an ECM algorithm are intrinsically computationally intensive, DECENT has achieved acceptable speed with a series of acceleration approaches such as a gaussian quadrature approximation for large integration and parallelization of all the main steps. For instance, our simulated data with 500 cells and 3000 genes took 18 min, while the largest dataset, Chen et al. with 6875 cells and 12 929 genes took ∼8 h to finish on a 28-core XENON Radon Duo R1885 server node with Intel(R) Xeon(R) E5-2690 v4 CPUS @ 2.60 GHz. Some existing models for scRNA-seq allow tests beyond the comparison of means, such as comparisons of zero fractions, of biological variation or even the overall distributions (Korthauer ; Wang ; Wu ). But there remains a difficulty in assessing the type I error control and power of these tests due to lack of ground-truth. Single molecule FISH technology is under rapid development and is able to produce accurate measurements of distributions, biological variation and the zero fractions. As the amount of such data and number of genes profiled in this way increases, we should soon have the opportunity to assess these tests objectively. While DECENT focuses on performing reliable statistical tests concerning gene mean abundance, it can be easily be extended to carry out other types of tests. For example, we can permit the zero-inflation parameter in the pre-dropout distribution to be a function of cell type. Then a linear logistic model can be used to test for biological differences in zero inflation. However, some alteration of the parameter estimation strategy may be needed to achieve valid testing results. We have not incorporated any forms of Empirical Bayes (EB) in the estimation of DECENT model parameters. Several bulk RNA-seq methods for differential expression such as edgeR (McCarthy ) and DESeq2 (Love ) use EB shrinkage to stabilize estimates of gene-specific dispersion parameter. Among methods for scRNA-seq data, MAST (Finak ) uses EB to shrink the gene-specific variance parameter. Given that scRNA-seq data are very sparse, we certainly think there is potential benefit in using EB to improve DECENT’s performance. One major challenge is there is no natural conjugate prior for the dispersion parameter of the ZINB or NB distribution that we use to model the count data. This is in contrast to MAST that uses Gaussian distribution to model the log(TPM) + 1 data and therefore able to take advantage of the inverse Gamma conjugate prior for the variance parameter. DECENT also estimates the parameters associated with the unobserved rather than the observed data, which makes it less straightforward to introduce Bayes or EB variants. Our current thinking is that to implement EB within DECENT, we need to either use a form of variational Bayes (Beal, 2003) or something similar to the weighted likelihood method (Robinson and Smyth, 2007) that has been successfully used for bulk RNA-seq data. Click here for additional data file.

42 in total

1. Quantitative single-cell RNA-seq with unique molecular identifiers.

Authors: Saiful Islam; Amit Zeisel; Simon Joost; Gioele La Manno; Pawel Zajac; Maria Kasper; Peter Lönnerberg; Sten Linnarsson
Journal: Nat Methods Date: 2013-12-22 Impact factor: 28.547

2. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets.

Authors: Evan Z Macosko; Anindita Basu; Rahul Satija; James Nemesh; Karthik Shekhar; Melissa Goldman; Itay Tirosh; Allison R Bialas; Nolan Kamitaki; Emily M Martersteck; John J Trombetta; David A Weitz; Joshua R Sanes; Alex K Shalek; Aviv Regev; Steven A McCarroll
Journal: Cell Date: 2015-05-21 Impact factor: 41.582

3. Comparative Analysis of Single-Cell RNA Sequencing Methods.

Authors: Christoph Ziegenhain; Beate Vieth; Swati Parekh; Björn Reinius; Amy Guillaumet-Adkins; Martha Smets; Heinrich Leonhardt; Holger Heyn; Ines Hellmann; Wolfgang Enard
Journal: Mol Cell Date: 2017-02-16 Impact factor: 17.970

4. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification.

Authors: Tamar Hashimshony; Florian Wagner; Noa Sher; Itai Yanai
Journal: Cell Rep Date: 2012-08-30 Impact factor: 9.423

5. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.

Authors: Davis J McCarthy; Yunshun Chen; Gordon K Smyth
Journal: Nucleic Acids Res Date: 2012-01-28 Impact factor: 16.971

6. SCnorm: robust normalization of single-cell RNA-seq data.

Authors: Rhonda Bacher; Li-Fang Chu; Ning Leng; Audrey P Gasch; James A Thomson; Ron M Stewart; Michael Newton; Christina Kendziorski
Journal: Nat Methods Date: 2017-04-17 Impact factor: 28.547

7. Batch effects and the effective design of single-cell gene expression studies.

Authors: Po-Yuan Tung; John D Blischak; Chiaowen Joyce Hsiao; David A Knowles; Jonathan E Burnett; Jonathan K Pritchard; Yoav Gilad
Journal: Sci Rep Date: 2017-01-03 Impact factor: 4.379

8. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.

Authors: Aaron T L Lun; Fernando J Calero-Nieto; Liora Haim-Vilmovsky; Berthold Göttgens; John C Marioni
Journal: Genome Res Date: 2017-10-13 Impact factor: 9.043

9. SAVER: gene expression recovery for single-cell RNA sequencing.

Authors: Mo Huang; Jingshu Wang; Eduardo Torre; Hannah Dueck; Sydney Shaffer; Roberto Bonasio; John I Murray; Arjun Raj; Mingyao Li; Nancy R Zhang
Journal: Nat Methods Date: 2018-06-25 Impact factor: 28.547

10. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications.

Authors: Koen Van den Berge; Fanny Perraudeau; Charlotte Soneson; Michael I Love; Davide Risso; Jean-Philippe Vert; Mark D Robinson; Sandrine Dudoit; Lieven Clement
Journal: Genome Biol Date: 2018-02-26 Impact factor: 13.583

12 in total

1. Single-Cell Analysis of the Gene Expression Effects of Developmental Lead (Pb) Exposure on the Mouse Hippocampus.

Authors: Kelly M Bakulski; John F Dou; Robert C Thompson; Christopher Lee; Lauren Y Middleton; Bambarendage P U Perera; Sean P Ferris; Tamara R Jones; Kari Neier; Xiang Zhou; Maureen A Sartor; Saher S Hammoud; Dana C Dolinoy; Justin A Colacino
Journal: Toxicol Sci Date: 2020-08-01 Impact factor: 4.849

2. SAREV: A review on statistical analytics of single-cell RNA sequencing data.

Authors: Dorothy Ellis; Dongyuan Wu; Susmita Datta
Journal: Wiley Interdiscip Rev Comput Stat Date: 2021-05-20

3. scClassify: sample size estimation and multiscale classification of cells using single and multiple reference.

Authors: Yingxin Lin; Yue Cao; Hani Jieun Kim; Agus Salim; Terence P Speed; David M Lin; Pengyi Yang; Jean Yee Hwa Yang
Journal: Mol Syst Biol Date: 2020-06 Impact factor: 11.429

4. Atlas of breast cancer infiltrated B-lymphocytes revealed by paired single-cell RNA-sequencing and antigen receptor profiling.

Authors: Qingtao Hu; Yu Hong; Pan Qi; Guangqing Lu; Xueying Mai; Sheng Xu; Xiaoying He; Yu Guo; Linlin Gao; Zhiyi Jing; Jiawen Wang; Tao Cai; Yu Zhang
Journal: Nat Commun Date: 2021-04-12 Impact factor: 14.919

5. What are the applications of single-cell RNA sequencing in cancer research: a systematic review.

Authors: Lvyuan Li; Fang Xiong; Yumin Wang; Shanshan Zhang; Zhaojian Gong; Xiayu Li; Yi He; Lei Shi; Fuyan Wang; Qianjin Liao; Bo Xiang; Ming Zhou; Xiaoling Li; Yong Li; Guiyuan Li; Zhaoyang Zeng; Wei Xiong; Can Guo
Journal: J Exp Clin Cancer Res Date: 2021-05-11

6. Statistical methods for analysis of single-cell RNA-sequencing data.

Authors: Samarendra Das; Shesh N Rai
Journal: MethodsX Date: 2021-11-17

7. DISC: a highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning.

Authors: Yao He; Hao Yuan; Cheng Wu; Zhi Xie
Journal: Genome Biol Date: 2020-07-10 Impact factor: 13.583

8. A systematic evaluation of single-cell RNA-sequencing imputation methods.

Authors: Wenpin Hou; Zhicheng Ji; Hongkai Ji; Stephanie C Hicks
Journal: Genome Biol Date: 2020-08-27 Impact factor: 13.583

9. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data.

Authors: Wenhao Tang; François Bertaux; Philipp Thomas; Claire Stefanelli; Malika Saint; Samuel Marguerat; Vahid Shahrezaei
Journal: Bioinformatics Date: 2020-02-15 Impact factor: 6.937

10. A Comprehensive Survey of Statistical Approaches for Differential Expression Analysis in Single-Cell RNA Sequencing Studies.

Authors: Samarendra Das; Anil Rai; Michael L Merchant; Matthew C Cave; Shesh N Rai
Journal: Genes (Basel) Date: 2021-12-02 Impact factor: 4.096