Literature DB >> 30172840

Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell RNA Sequencing Data.

Nils Eling¹, Arianne C Richard², Sylvia Richardson³, John C Marioni⁴, Catalina A Vallejos⁵.

Abstract

Cell-to-cell transcriptional variability in otherwise homogeneous cell populations plays an important role in tissue function and development. Single-cell RNA sequencing can characterize this variability in a transcriptome-wide manner. However, technical variation and the confounding between variability and mean expression estimates hinder meaningful comparison of expression variability between cell populations. To address this problem, we introduce an analysis approach that extends the BASiCS statistical framework to derive a residual measure of variability that is not confounded by mean expression. This includes a robust procedure for quantifying technical noise in experiments where technical spike-in molecules are not available. We illustrate how our method provides biological insight into the dynamics of cell-to-cell expression variability, highlighting a synchronization of biosynthetic machinery components in immune cells upon activation. In contrast to the uniform up-regulation of the biosynthetic machinery, CD4+ T cells show heterogeneous up-regulation of immune-related and lineage-defining genes during activation and differentiation.

Entities: Chemical Disease Gene Species

Keywords: Bayesian; immune activation; single-cell RNA sequencing; statistics; transcriptional noise; variability

Mesh：

Year: 2018 PMID： 30172840 PMCID： PMC6167088 DOI： 10.1016/j.cels.2018.06.011

Source DB: PubMed Journal: Cell Syst ISSN： 2405-4712 Impact factor: 10.304

Introduction

Heterogeneity in gene expression within a population of single cells can arise from a variety of factors. Structural differences in gene expression within a cell population can reflect the presence of sub-populations of functionally different cell types (Zeisel et al., 2015, Paul et al., 2015). Alternatively, in a seemingly homogeneous population of cells, the so-called unstructured expression heterogeneity can be linked to intrinsic or extrinsic noise (Elowitz et al., 2002). Changes in physiological cell states (such as cell cycle, metabolism, abundance of transcriptional and translational machinery, and growth rate) represent extrinsic noise, which has been found to influence expression variability within cell populations (Miller, 1981, Keren et al., 2015, Buettner et al., 2015, Zeng et al., 2017). Intrinsic noise can be linked to epigenetic diversity (Smallwood et al., 2014), chromatin rearrangements (Buenrostro et al., 2015), as well as the genomic context of single genes, such as the presence of TATA-box motifs and the abundance of nucleosomes around the transcriptional start site (Hornung et al., 2012). Single-cell RNA sequencing (scRNA-seq) generates transcriptional profiles of single cells, allowing the study of cell-to-cell heterogeneity on a transcriptome-wide (Grün et al., 2014) and single gene level (Goolam et al., 2016). Consequently, this technique can be used to study unstructured cell-to-cell variation in gene expression within and between homogeneous cell populations (i.e., where no distinct cell sub-types are present). Increasing evidence suggests that this heterogeneity plays an important role in normal development (Chang et al., 2008) and that control of expression noise is important for tissue function (Bahar Halpern et al., 2015). For instance, molecular noise was shown to increase before cells commit to lineages during differentiation (Mojtahedi et al., 2016), while the opposite is observed once an irreversible cell state is reached (Richard et al., 2016). A similar pattern occurs during gastrulation, where expression noise is high in the uncommitted inner cell mass compared to the committed epiblast and where an increase in heterogeneity is observed when cells exit the pluripotent state and form the uncommitted epiblast (Mohammed et al., 2017). Motivated by scRNA-seq, recent studies have extended traditional differential expression analyses to explore more general patterns that characterize differences between cell populations or experimental conditions (e.g., Korthauer et al. [2016]). In particular, the Bayesian analysis of single-cell sequencing data (BASiCS) framework (Vallejos et al., 2015, 2016) introduced a probabilistic tool to assess differences in cell-to-cell heterogeneity between two or more cell populations. This feature has led to, for example, insights into the context of immune activation and aging (Martinez-Jimenez et al., 2017, Miller, 1981). To meaningfully assess changes in biological variability across the entire transcriptome, two main confounding effects must be taken into account: differences due to artefactual technical noise and differential variability between populations that is driven by changes in mean expression. The latter arises because biological noise is negatively correlated with protein abundance (Bar-Even et al., 2006, Newman et al., 2006, Taniguchi et al., 2010) or mean RNA expression (Brennecke et al., 2013, Antolović et al., 2017). To address these two confounding effects, BASiCS separates biological noise from technical variability by borrowing information from synthetic RNA spike-in molecules. Additionally, to acknowledge the variance-mean relationship, it restricts differential variability testing to those genes with equal mean expression across populations. This article extends the statistical model implemented in BASiCS by introducing a more general approach to account for the aforementioned confounding effects. First, we derive a residual measure of cell-to-cell transcriptional variability that is not confounded by mean expression. This is used to define a probabilistic rule to robustly highlight changes in variability, even for differentially expressed genes. Unlike previous related methods (e.g., Kolodziejczyk et al. [2015]), our approach directly performs gene-specific statistical testing between two conditions using a readily available measure of uncertainty. Second, by exploiting concepts from measurement error models, our method is extended to address experimental designs where spike-in sequences are not available. This is particularly critical due to the increasing popularity of droplet-based technologies. Using our approach, we identify a synchronization of biosynthetic machinery components in CD4+ T cells upon early immune activation as well as an increased variability in the expression of genes related to CD4+ T cell immunological function. Furthermore, we detect evidence of early cell fate commitment of CD4+ T cells during malaria infection characterized by a decrease in Tbx21 expression heterogeneity and a rapid collapse of global transcriptional variability after infection. These results highlight biological insights into T cell activation and differentiation that are only revealed by jointly studying changes in mean expression and variability.

Results

Addressing the Mean Confounding Effect for Differential Variability Testing

Unlike bulk RNA-seq, scRNA-seq provides information about cell-to-cell expression heterogeneity within a population of cells. Previous studies have used a variety of measures to quantify this heterogeneity. Among others, this includes the coefficient of variation (CV) (Brennecke et al., 2013) and entropy measures (Richard et al., 2016). As in Vallejos et al., 2015, Vallejos et al., 2016, we focus on biological over-dispersion as a proxy for transcriptional heterogeneity. This is defined by the excess of variability that is observed with respect to what would be predicted by Poisson sampling noise after accounting for technical variation. The aforementioned measures of variability can be used to identify genes whose transcriptional heterogeneity differs between groups of cells (defined by experimental conditions or cell types). However, the strong relationship that is typically observed between variability and mean estimates (e.g., Brennecke et al. [2013]) can hinder the interpretation of these results. A simple solution to avoid this confounding is to restrict the assessment of differential variability to those genes with equal mean expression across populations (see Figure 1A). However, this is sub-optimal, particularly when a large number of genes are differentially expressed between the populations. For example, reactive genes that change in mean expression upon changing conditions (e.g., transcription factors) are excluded from differential variability testing. An alternative approach is to directly adjust variability measures to remove this confounding. For example, Kolodziejczyk et al. (2015) computed the empirical distance between the squared CV to a rolling median along expression levels—referred to as the DM method.

Figure 1

Avoiding the Mean Confounding Effect When Quantifying Expression Variability in scRNA-Seq Data

(A and B) Illustration of changes in expression variability for a single gene between two cell populations without (A) and with (B) changes in mean expression.

(C and D) Our extended BASiCS model infers a regression trend between gene-specific estimates of over-dispersion parameters δ and mean expression μ. Residual over-dispersion parameters are defined by departures from the regression trend. For a single gene, this is illustrated using a red arrow. The color code within the scatterplots is used to represent areas with high (yellow and red) and low (blue) concentration of genes. For illustration purposes, the data introduced by Antolović et al. (2017) have been used (see STAR Methods).

(C) Gene-specific estimates of over-dispersion parameters δ were plotted against mean expression parameters μ. The red line shows the regression trend. This illustrates the typical confounding effect that is observed between variability and mean expression measures. Genes that are not detected in at least 2 cells are indicated by purple points.

(D) Gene-specific estimates of residual over-dispersion parameters were plotted against mean expression parameters μ. This illustrates the lack of correlation between these parameters.

(E) Illustration of how posterior uncertainty is used to highlight changes in residual over-dispersion. Two example genes with (upper) and without (lower) differential residual over-dispersion are shown. Left inset illustrates the posterior density associated with residual over-dispersion parameters for a gene in two groups of cells (group A, light blue; group B, dark blue). The colored area in the right inset represents the posterior probability of observing an absolute difference that is larger than the minimum tolerance threshold ψ0 (see STAR Methods).

Avoiding the Mean Confounding Effect When Quantifying Expression Variability in scRNA-Seq Data (A and B) Illustration of changes in expression variability for a single gene between two cell populations without (A) and with (B) changes in mean expression. (C and D) Our extended BASiCS model infers a regression trend between gene-specific estimates of over-dispersion parameters δ and mean expression μ. Residual over-dispersion parameters are defined by departures from the regression trend. For a single gene, this is illustrated using a red arrow. The color code within the scatterplots is used to represent areas with high (yellow and red) and low (blue) concentration of genes. For illustration purposes, the data introduced by Antolović et al. (2017) have been used (see STAR Methods). (C) Gene-specific estimates of over-dispersion parameters δ were plotted against mean expression parameters μ. The red line shows the regression trend. This illustrates the typical confounding effect that is observed between variability and mean expression measures. Genes that are not detected in at least 2 cells are indicated by purple points. (D) Gene-specific estimates of residual over-dispersion parameters were plotted against mean expression parameters μ. This illustrates the lack of correlation between these parameters. (E) Illustration of how posterior uncertainty is used to highlight changes in residual over-dispersion. Two example genes with (upper) and without (lower) differential residual over-dispersion are shown. Left inset illustrates the posterior density associated with residual over-dispersion parameters for a gene in two groups of cells (group A, light blue; group B, dark blue). The colored area in the right inset represents the posterior probability of observing an absolute difference that is larger than the minimum tolerance threshold ψ0 (see STAR Methods). In line with this idea, our method extends the statistical model implemented in BASiCS (Vallejos et al., 2015, Vallejos et al., 2016). We define a measure of “residual over-dispersion”—which is not correlated with mean expression—to meaningfully assess changes in transcriptional heterogeneity when genes exhibit shifts in mean expression (see Figure 1B). More concretely, we infer a regression trend between over-dispersion (δ) and gene-specific mean parameters (μ), by introducing a joint informative prior to capture the dependence between these parameters (see STAR Methods). A latent gene-specific residual over-dispersion parameter describes departures from this trend (see Figure 1C). Positive values of indicate that a gene exhibits more variation than expected relative to genes with similar expression levels. Similarly, negative values of suggest less variation than expected, and, as shown in Figure 1D, these residual over-dispersion parameters are not confounded by mean expression. Our hierarchical Bayes approach infers full posterior distributions for the gene-specific latent residual over-dispersion parameters . As a result, we can directly use a probabilistic approach to identify genes with large absolute differences in residual over-dispersion between two groups of cells (see Figure 1E and STAR Methods). The performance of this differential variability test was validated using simulated data (see Figure S1 and STAR Methods). In contrast, mean-corrected point estimates for residual noise parameters (such as those obtained by the DM method) cannot be directly used to perform gene-specific statistical testing between two conditions, as no measure of the uncertainty in the estimate is readily available.

The Informative Prior Stabilizes Parameter Estimation

Our joint prior formulation has introduced a non-linear regression to capture the overall trend between gene-specific over-dispersion parameters δ and mean expression parameters μ (see STAR Methods). Thus, we also refer to the extended model induced by this prior as the “regression” BASiCS model. Accordingly, the model induced by the original independent prior specification (Vallejos et al., 2016) is referred to as the “non-regression” BASiCS model. To study the performance of the regression BASiCS model, we applied it to a variety of scRNA-seq datasets. Each dataset is unique in its composition, covering a range of different cell types and experimental protocols (see STAR Methods and Table S1). Qualitatively, we observe that the inferred regression trend varies substantially across different datasets (Figures 2 and S2), justifying the choice of a flexible semi-parametric approach (see STAR Methods). Moreover, as expected, we observe that residual over-dispersion parameters are not confounded by mean expression nor by the percentage of zero counts per gene.

Figure 2

Parameter Estimation Using a Variety of scRNA-Seq Datasets

Model parameters were estimated using the regression and non-regression BASiCS models on (A) naive CD4+ T cells (Martinez-Jimenez et al., 2017) and (B) Dictyostelium cells prior to differentiation (day 0) (Antolović et al., 2017). These datasets were selected to highlight two situations with different levels of sparsity (i.e., the proportion of zero counts; see fourth column). More details about these datasets are provided in STAR Methods. The color code within the scatterplots is used to represent areas with high (yellow and red) and low (blue) concentration of genes.

First column: gene-specific over-dispersion δ versus mean expression μ as estimated by the non-regression BASiCS model.

Second column: gene-specific over-dispersion δ versus mean expression μ as estimated by the regression BASiCS model. The red line indicates the estimated regression trend. Purple dots indicate genes detected (i.e., with at least one count) in fewer than 2 cells.

Third column: gene-specific residual over-dispersion versus mean expression μ as estimated by the regression BASiCS model.

Fourth column: gene-specific posterior estimates for residual over-dispersion parameters versus percentage of zero counts for each gene.

Inferring Technical Variability without Spike-In Genes

Another critical aspect to take into account when inferring transcriptional variability based on scRNA-seq datasets is technical variation (Brennecke et al., 2013). BASiCS achieves this through a vertical data integration approach, exploiting a set of synthetic RNA spike-in molecules (e.g., the set of 92 ERCC molecules developed by Jiang et al. [2011]) as a gold standard to aid normalization and to quantify technical artefacts (see Figure 4A). However, while the addition of spike-in genes prior to sequencing is theoretically appealing (Lun et al., 2017), several practical limitations can preclude their utility in practice (Vallejos et al., 2017). Furthermore, the use of spike-in genes is not compatible with (increasingly popular) droplet-based technologies, which have massively increased the throughput of scRNA-seq over the last few years (Svensson et al., 2018).

Figure 4

The Spikes and No-Spikes Implementations of BASiCS

(A) Diagram representing the spikes implementation of BASiCS (Vallejos et al., 2015, 2016). This uses a vertical data integration approach to borrow information from gold-standard spike-in genes to aid normalization and to quantify technical variability.

(B) Diagram representing the no-spikes implementation of BASiCS. This uses a horizontal data integration approach to borrow information across multiple batches of sequenced cells (not confounded by the biological effect of interest) to quantify technical variability. More details about this implementation are discussed in STAR Methods and Figure S4.

(C and D) Comparison between the vertical and horizontal implementations of BASiCS using a dataset of mouse embryonic stem cells grown in 2i medium (see STAR Methods and Grün et al., 2014). Dashed horizontal lines located at ± log2(1.5) indicate the default minimum tolerance log2-fold change threshold used for differential testing.

(D) Comparison in terms of posterior estimates for over-dispersion parameters δ across all genes.

Expression Variability Dynamics during Immune Activation and Differentiation

Here, we illustrate how our method assesses changes in expression variability using CD4+ T cells as a model system. For all datasets, pre-processing steps are described in STAR Methods.

Testing Variability Changes in Immune Response Gene Expression

To identify gene expression changes during early T cell activation, we compared CD4+ T cells before (naive) and after (active) 3 hr of stimulation (Martinez-Jimenez et al., 2017). When using the non-regression BASiCS model, our differential over-dispersion test avoided the confounding with mean expression by solely focusing on genes with no changes in mean expression. This represents only a small fraction out of the full set of expressed genes. In contrast, testing changes in variability using residual over-dispersion measures allows testing across all genes, including the large set of genes that are up-regulated upon immune activation (see Figures S5A and S5B and STAR Methods). The latter include immune-response genes and critical drivers for CD4+ T cell functionality. Our model classifies genes into four categories based on their expression dynamics: down-regulated upon activation with (1) lower and (2) higher variability; and up-regulated with (3) lower and (4) higher variability (Figure 5A; STAR Methods; Table S2).

Figure 5

Changes in Expression Patterns during Early Immune Activation in CD4+ T Cells

Differential testing (mean and residual over-dispersion) was performed between naive and activated murine CD4+ T cells. This analysis uses a minimum tolerance threshold of τ0 = 1 for changes in mean expression and a minimum tolerance threshold of ψ0 = 0.41 for differential residual over-dispersion testing (expected false discovery rate is fixed at 10%; see STAR Methods).

(A) For each gene, the difference in residual over-dispersion estimates (Active versus Naive) is plotted versus the log2-fold change in mean expression (Active versus Naive). Genes with statistically significant changes in mean expression and variability are colored based on their regulation (up- or down-regulated, higher or lower variability).

(B and C) Denoised expression counts across the naive (purple) and active (green) CD4+ T cell population are visualized for representative genes that (B) increase in mean expression and decrease in expression variability and (C) increase in mean expression as well as expression variability upon immune activation. Each dot represents a single cell.

Expression Dynamics during In Vivo CD4+ T Cell Differentiation

In contrast to the quick transcriptional switch that occurs within hours of naive T cell activation, transcriptional changes during cellular differentiation processes are more subtle and were found to be coupled with changes in variability prior to cell fate decisions (Richard et al., 2016, Mojtahedi et al., 2016). Here, we apply our method to study changes in expression variability during CD4+ T cell differentiation after malaria infection using the dataset introduced by Lönnberg et al. (2017). In particular, we focus on samples collected 2, 4, and 7 days post malaria infection, for which more than 50 cells are available. To study global changes in over-dispersion along the differentiation time course, we first compared posterior estimates for the gene-specific parameter δ, focusing on genes for which mean expression does not change (see Figure 6A and STAR Methods). This analysis suggests that the expression of these genes is most tightly regulated at day 4, when cells are in a highly proliferative state. Moreover, between days 4 and 7, the cell population becomes more heterogeneous. This is in line with the emergence of differentiated T helper (Th) 1 and Tfh cells that was observed by Lönnberg et al. (2017).

Figure 6

Dynamics of Expression Variability throughout CD4+ T Cell Differentiation

Analysis was performed on CD4+ T cells assayed 2, 4, and 7 days after Plasmodium infection. Changes in residual over-dispersion were tested using a minimum tolerance threshold of ψ0 = 0.41 (expected false discovery rate is fixed at 10%; see STAR Methods)

(A) Distribution of posterior estimates of over-dispersion parameters δ for genes that exhibit no changes in mean expression across the differentiation time course. Changes in mean expression were tested using a minimum tolerance threshold of τ0 = 0 (expected false discovery rate is fixed at 10%).

(B) Posterior estimates for residual over-dispersion parameters , focusing on genes with statistically significant changes in expression variability between time points. Gene set size is indicated for each plot.

(C and D) Denoised expression counts across cell populations at days 2 (yellow) and 4 (red) post infection are visualized for representative genes that (C) increase or (D) decrease in variability during differentiation. Each dot represents a single cell.

(E) Tbx21 (blue) and Cxcr5 (red) measured at days 2, 4, and 7 post infection. Posterior estimates for residual over-dispersion parameters are plotted against posterior estimates for mean expression parameters μ. Statistically significant changes in mean expression (DE, minimum tolerance threshold of τ0 = 1) and variability (DV, minimum tolerance threshold of ψ0 = 0:41) are indicated for each comparison (expected false discovery rate is fixed at 10%).

Discussion

In recent years, the importance of modulating cell-to-cell transcriptional variation within cell populations for tissue function maintenance and development has become apparent (Bahar Halpern et al., 2015, Mojtahedi et al., 2016, Goolam et al., 2016). Here, we present a statistical approach to robustly test changes in expression variability between cell populations using scRNA-seq data. Our method uses a hierarchical Bayes formulation to extend the BASiCS framework by addressing (increasingly popular) experimental protocols where spike-in sequences are not available and by incorporating an additional set of residual over-dispersion parameters that are not confounded by changes in mean expression. Together, these extensions ensure a broader applicability of the BASiCS software and allow statistical testing of changes in variability that are not confounded by technical noise or mean expression. In general, stable gene-specific variability estimates ideally require a large and deeply sequenced dataset containing a homogeneous cell population (the use of unique molecular identifiers for quantifying transcript counts can also improve variability estimation; see Grün et al. [2014]). However, we observe that the regression BASiCS model leads to a more stable inference that requires fewer cells to accurately estimate gene-specific summaries, particularly for lowly expressed genes. Despite this, careful considerations should be taken in extreme scenarios where the number of cells is small and/or the data are highly sparse (e.g., droplet-based approaches). These features of the data not only affect parameter estimation but also downstream differential testing. For sparse datasets with low numbers of cells, we recommend the use of a stringent minimum tolerance threshold and/or calibrating the test to a low expected false discovery rate (e.g., 1%) to avoid detecting spurious signals. Moreover, if possible, an internal calibration can be performed to find a reasonable minimum tolerance threshold (e.g., by randomly permuting cells between two groups to calibrate the null distribution of the differences between populations). Our method allows characterization of the extent and nature of variable gene expression in CD4+ T cell activation and differentiation. First, we observe that during acute activation of naive T cells, genes of the biosynthetic machinery are homogeneously up-regulated, while specific immune-related genes become more heterogeneously up-regulated. In particular, increased variability in expression of the apoptosis-inducing Fas ligand (Strasser et al., 2009) and the inhibitory ligand PD-L1 (Chikuma, 2016) suggests a mechanism by which newly activated cells might suppress re-activation of effector cells, thereby dynamically modulating the population response to activation. Likewise, more variable expression of Smad3, which translates inhibitory TGFβ signals into transcriptional changes (Delisle et al., 2013), may indicate increased diversity in cellular responses to this signal. Increased variability in Pou2f2 (Oct2) expression after activation suggests heterogeneous activities of the NF-κB and/or NFAT signaling cascades that control its expression (Mueller et al., 2013). Moreover, we detect up-regulated and more variable Il2 expression, suggesting heterogeneous IL-2 protein expression, which is known to enable T cell population responses (Fuhrmann et al., 2016). Finally, we studied changes in gene expression variability during CD4+ T cell differentiation toward a Th1 and Tfh cell state over a 7-day time course after in-vivo malaria infection (Lönnberg et al., 2017). Our analysis provides several insights into this differentiation system. First, we observe a tighter regulation in gene expression among genes that do not change in mean expression during differentiation at day 4, when divergence of Th1 and Tfh differentiation was previously identified (Lönnberg et al., 2017). This decrease in variability on day 4 is potentially due to the induction of a strong pan-lineage proliferation program. However, we observe that not all genes follow this trend and uncover four different patterns of variability changes. Second, we observe that several Tfh and Th1 lineage-associated genes change in expression variability between days 2 and 4. For example, we noted a decrease in variability for one key Th1 regulator, Tbx21 (encoding Tbet), which suggests that a subset of cells may have already committed to the Th1 lineage at day 2. Three additional Th1 lineage-associated genes also followed this trend (Ahnak, Ctsd, Tmem154). These data suggest that differentiation fate decisions may arise as early as day 2 in subpopulations within this system, resulting in high gene expression variability. Such an effect is in accordance with the early commitment to effector T cell fates that was previously observed during viral infection (Choi et al., 2011). As these results illustrate, diversity in differentiation state within a population of T cells can drive our differential variability results. To further dissect these results, subsequent analyses such as the pseudotime inference used in Lönnberg et al. (2017) could be used to characterize a continuous differentiation process. In sum, our model provides a robust tool for understanding the role of heterogeneity in gene expression during cell fate decisions. With the increasing use of scRNA-seq to study this phenomenon, ours and other related tools will become increasingly important.

STAR★Methods

Key Resources Table

Contact for Reagent and Resource Sharing

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, John Marioni (john.marioni@cruk.cam.ac.uk).

Method Details

The BASiCS Framework

The proposed statistical model builds upon BASiCS (Vallejos et al., 2015, Vallejos et al., 2016) — an integrated Bayesian framework that infers technical noise in scRNA-seq datasets and simultaneously performs data normalisation as well as selected supervised downstream analyses. Let X be a random variable representing the expression count of gene i (∈{1, …, q}) in cell j (∈{1, …, n}). To control for technical noise, we employ reads from synthetic RNA spike-ins (e.g. those introduced by Jiang et al., 2011). Without loss of generality, we assume the first q0 genes to be biological followed by the q − q0 spike-in genes. As in the original BASiCS method introduced by Vallejos et al. (2015), we assume a Poisson hierarchical formulation:where, to account for technical and biological factors that affect the variance of the transcript counts, we incorporate two random effects: In this setup, Φ represents a cell-specific normalization parameter to correct for differences in mRNA content between cells and s models cell-specific scale differences affecting all biological and technical genes. Moreover, the random effect v captures unexplained technical noise that is not accounted for by the normalisation. The strength of this noise is then quantified by a global parameter (shared across all genes and cells). Heterogeneous gene expression across cells is captured by ρ, whose strength is controlled by gene-specific over-dispersion parameters δ. These quantify the excess of variability that is observed with respect to Poisson sampling noise, after accounting for technical noise. Finally, gene-specific parameters represent average expression of a gene across cells. When comparing two or more groups of cells (e.g. experimental conditions or cell types), the notation above can be extended by assuming that gene-specific parameters are also group-specific (as in Vallejos et al., 2016). Comparisons of gene-specific parameters across populations can be used to identify statistically significant changes in gene expression at the mean and the variability level. However, the well known confounding effect between mean and variability that typically arises in scRNA-seq datasets (Brennecke et al., 2013) can preclude a meaningful interpretation of these results.

Modeling the Confounding between Mean and Dispersion

Here, we extend BASiCS to account for the confounding effect described above. For this purpose, we estimate the relationship between mean and over-dispersion parameters by introducing the following joint prior distribution for : The latter is equivalent to the following non-linear regression model:where represents the over-dispersion (on the log-scale) that is predicted by the global trend (across all genes) expressed at a given mean expression . Therefore, can be interpreted as a latent gene-specific residual over-dispersion parameter, capturing departures from the overall trend. If a gene exhibits a positive value for , this indicates more variation than expected for genes with similar expression level. Accordingly, negative values of suggest less variation than expected for genes with similar expression level. A similar approach was introduced by DESeq2 (Love et al., 2014) in the context of bulk RNA sequencing. Whereas DESeq2 assumes normally distributed errors when estimating this trend, here we opt for a Student-t distribution as it leads to inference that is more robust to the presence of outlier genes. Moreover, the parametric trend assumed by DESeq2 is replaced by a more flexible semi-parametric approach. This is defined bywhere g1(⋅),…,g(⋅) represent a set of Gaussian radial basis function (GRBF) kernels and are regression coefficients. As in Kapourani and Sanguinetti (2016), the GRBF kernels are defined as:where m and h represent location and scale hyper-parameters for GRBF kernels. In Equation 5, the linear term captures the (typically negative) global correlation between and . Its addition also stabilises inference of GRBFs around mean expression values where only a handful of genes are observed. In Equation 6, the location and scale hyper-parameters (m, h) are assumed to be fixed a priori. Details about this choice are described below. The remaining elements of the prior were chosen as follows:with all hyper-parameters fixed a priori. Default values are chosen as:with the remaining default hyper-parameter values as in Vallejos et al. (2016). In principle, the degrees of freedom parameter could also be estimated within a Bayesian framework. However, we observed that fixing this parameter a priori led to more stable results. A default choice for this parameter is described below.

Implementation

Posterior inference for the model described above is implemented by extending the Adaptive Metropolis within Gibbs sampler (Roberts and Rosenthal, 2009) that was adopted by Vallejos et al. (2016). For this purpose, the log-Student-t distribution in Equation 3 is represented via the same data augmentation scheme as in Vallejos and Steel (2015). The latter introduces an auxiliary set of parameters such that: Moreover, the regression coefficients are inferred by noting that Equation 5 can be rewritten as a linear regression model usingwhere X is a q0×(L+2) matrix given by In this setting, the full conditionals associated with s, φ, v and θ are not affected by the new prior specification of and can be found in Vallejos et al. (2016). The full conditionals for , , , and are derived below. As in Vallejos et al. (2015), these are derived by integrating out the random effect in Equation 1, leading to: Based on Equation 19, the likelihood function therefore takes the form Let be as in Equation 5. The full conditionals associated to the mean expression parameters and over-dispersion parameters are respectively given by: Moreover, the full conditionals associated to the remaining parameters , and are given bywithwhere is a diagonal matrix with elements and . Finally, the full conditionals associated to the global technical noise parameter (θ) and cell-specific parameters (Φ, s and v) are defined as in Vallejos et al. (2016).

Probabilistic Rule Associated to the Differential Test

We use a probabilistic approach to identify changes in gene expression between groups of cells. Let and be the over-dispersion parameters associated to gene i in groups A and B. Following Equation 4, the log2 fold change in over-dispersion between these groups can be decomposed as:where the first term captures the over-dispersion change that can be attributed to differences between and . The second term in Equation 33 represents the change in residual over-dispersion that is not confounded by mean expression. Based on this observation, statistically significant differences in residual over-dispersion will be identified for those genes where the tail posterior probability of observing a large difference between and exceeds a certain threshold, i.e.where defines a pre-specified minimum tolerance threshold. As a default choice, we assume which translates into a 50% increase in over-dispersion. In the limiting case when , the probability in Equation 34 is equal to 1 regardless of the information contained in the data. Therefore, as in Bochkina and Richardson (2007), our decision rule is based on the maximum of the posterior probabilities associated to the one-sided hypotheses and , i.e. In both cases, the posterior probability threshold is chosen to control the expected false discovery rate (EFDR) (Newton et al., 2004). The default value for EFDR is set to 10%. As a default and to support interpretability of the results, we exclude genes that are not expressed in at least 2 cells per condition from differential variability testing. Changes in mean and over-dispersion are highlighted using the decision rule of Vallejos et al. (2016). To evaluate the performance of our differential test we generated synthetic data under a null model (without changes in variability) and an alternative model (with changes in variability). All datasets were generated following the BASiCS model, with parameter values used set by empirical estimates based on 98 microglia cells (see below). For this purpose, we use the BASiCS_Sim function. To simulate data under an alternative model, 1000 genes were randomly selected and their associated ’s were increased or decreased by a log2 fold change of 5. Differential testing was performed either between data simulated on the same set of parameters (null model) or between data simulated from the original parameters and the altered parameters (alternative model). We report the EFDR (Newton et al., 2004) as well as the false positive rate (FPR) for simulations under the null model and the true positive rate (TPR) for simulations under the alternative model. Synthetic data were generated with different sample sizes, with 5 repetitions for each sample size (see Figure S1)

Choice of Hyper-Parameters

As discussed above, the degrees of freedom , the number of GRBFs L as well as the associated hyper-parameters (m, h) are set a priori. Here, we explain the default values implemented in the BASiCS software. These were chosen to achieve a compromise between flexibility and shrinkage strength when applied to the datasets described in Table S1. Firstly, we observed that large values of L can lead to over-fitting but that small values of L can limit the flexibility to capture non-linear relations between and . Thus, as a parsimonious choice, we selected L=10. Moreover, as in Kapourani and Sanguinetti (2016), values for m were chosen to be equally spaced across the range of , i.e.where and . As values are unknown a priori, a and b are updated every 50 MCMC iterations during burn-in (fixed thereafter). Additionally, the scale hyper-parameters h control the width of the GRBFs and, consequently, the locality of the regression. As a default, we set these as h=c×Δm, where c is a fixed proportionality constant and Δm is the distance between consecutive values of m. In practice, we observed that the choice of a particular value of c is not critical, as long as narrow kernels (c<0.5) are avoided. As a default, c=1.2 was chosen. The degrees of freedom controls the tails of the distribution for the residual term in Equation 4. This influences the shrinkage towards the global trend and the robustness against outlying observations (here, these refer to genes whose mean and over-dispersion values are far from the trend). If , approximately follows a normal distribution for which posterior inference for is known to be sensitive to outliers. Instead, small values of introduce heavy-tails for , leading to more robust posterior inference. In principle, could be estimated within a Bayesian framework. However, this is problematic as the likelihood function associated to Equation 4 can be unbounded (Fernandez and Steel, 1999). Here, we opt for a pragmatic approach where the value of is fixed a priori. To select a reasonable default value, we ran the regression BASiCS model for a grid of possible values of ({1,2,3,4,5,6,7,8,9,15,20,25,30}), using the datasets described in Table S1 (with L, m and h fixed as described above). In all cases, we calculated Monte Carlo estimates for the log-likelihood associated to Equation 1 as a proxy for goodness-of-fit (data not shown). We observed that log-likelihood estimates were consistently the smallest for and that no substantial differences are observed across larger values of (provided that ). Based on these observations, default values implemented in the BASiCS software are set to L=10, c=1.2, . Despite this, the model’s implementation also allows flexible adjustment of L, c and by the user.

Running the Different Implementations of BASiCS

In the BASiCS R library, the default setting is to run the spikes implementation of BASiCS. The no-spikes implementation can be used by setting WithSpikes = FALSE in the call to BASiCS_MCMC. To run the regression BASiCS model, the user can set Regression = TRUE in the call to BASiCS_MCMC and Regression = FALSE to run the non-regression BASiCS model.

The Horizontal Integration Approach

As seen in Figure 4A, BASiCS (Vallejos et al., 2015, Vallejos et al., 2016) builds upon a vertical integration framework, exploiting a set of spike-in sequences (e.g. the set of 92 ERCC molecules described in Jiang et al., 2011) as a gold standard to aid normalisation and to quantify technical artifacts. However, while the addition of spike-in genes prior to sequencing is theoretically appealing (Lun et al., 2017), several practical limitations affect their utility (Vallejos et al., 2017). For example, the addition of spike-ins is not trivial in droplet-based protocols such as those introduced by Klein et al. (2015) and Macosko et al. (2015). Here, we extend BASiCS to not rely on spike-in genes using principles of measurement error models where — in the absence of gold standard features — technical variation is quantified through replication (Carroll, 1998). As scRNA-seq is a destructive technology, it is not possible to replicate experiments by sequencing the same cells multiple times. However, we rely on the replication of population-level characteristics of the cells through appropriate experimental design (Tung et al., 2017) by randomly allocating cells from the same population to multiple independent experimental replicates (hereafter these are referred to as batches). Given such an experimental design, we assume that biological effects are shared across batches and that technical variation will be reflected by spurious differences between cells and batches.

The Horizontal Integration Model

Following this reasoning, we use a horizontal data integration approach to leverage information from multiple batches of sequenced cells to estimate biological effects that are not confounded by technical variation (see Figure 4B). Let X be a random variable representing the count (read- or UMI-based) for gene i in cell j of the k-th batch . The following model is proposed: A key assumption underlying this model is that biological effects ( and ) are shared across all batches and, therefore, we borrow information across cells in all batches to infer these parameters. In contrast to the original implementation of BASiCS, the absence of spike-in genes prevents the definition of two separate normalisation effects to capture nuisance differences in the scale of the observed read-counts between cells: one to capture differences in cellular mRNA content, one to capture technical artefacts (e.g. sequencing depth). Instead, in Equation 38, the normalisation parameters s capture a combination of these effects. The latter are inferred by borrowing information across all genes assuming that a priori. Residual technical over-dispersion that is not captured by these normalisation parameters is captured by batch-specific parameters θ. Based on the proportion of variability that is attributed to a biological component, our model can be used to identify highly and lowly variable genes within a population of cells (see Vallejos et al., 2015). Moreover, differences in mean and over-dispersion between cell populations can be highlighted by comparing gene-specific parameters (, ). Finally, when adopting the prior specification described for the regression BASiCS model, our model can also be used to compare transcriptional heterogeneity in terms of a residual over-dispersion parameters .

Identifiability and Prior Specification

The model in Equation 37 and Equation 38 is not identifiable, i.e. the scale of cell-specific normalisation parameters s and gene-specific mean expression parameters cannot be separately estimated from the data. As a solution, the following identifiability restriction is proposed: In Equation 39, the geometric mean of mean expression parameters is fixed (when analysing multiple populations, this restriction independently applies within each population). In practice, we replace the value of by its empirical counterpart, e.g. adopting the normalization strategy implemented in Lun et al. (2016). To avoid ill-defined situations, this calculation must exclude genes with zero total counts across all cells (for which the empirical estimate of is equal to 0). We note, however, that the actual value of is not critical, as global offset effects between cell populations can be corrected post hoc (see Vallejos et al., 2016). Marginally, we assign a log-Normal prior distribution to each . However, we do not assume these parameters to be a priori independent. Instead, an appropriate correlation structure is introduced to satisfy the identifiability restriction in Equation 39. Following Theorem 8.2 in West and Harrison (1989), this correlated prior is defined aswhere q is the number of genes, 1 denotes a q-dimensional vector of ones and I denotes a q-dimensional identity matrix. Due to the identifiability constraint in Equation 39, the covariance matrix in Equation 40 is not full rank. Hence, for an arbitrarily chosen reference gene r, Equation 40 can be factorised as a multivariate normal prior for and a point mass prior for (see Proposition 2). As a result, posterior inference can be implemented by drawing posterior samples for , leaving posterior samples for to be completely specified by the identifiability restriction.

Using a Stochastic Reference Gene

The vertical integration version of BASiCS (with spike-ins) is used as a benchmark for the model in Equation 37 and Equation 38. To illustrate its performance, we use the dataset of Grün et al. (2014), for which technical spike-ins and multiple batches of sequenced cells are available. In both cases, the MCMC sampler was run for 20,000 iterations, storing draws every 10 iterations and ignoring an initial burn-in period of 10,000 iterations (hence, results are shown in terms of 1,000 iterations). Overall, posterior inference is unaffected for the majority of genes (Figures 4C and 4D). However, as it can be expected, the effect of the prior is more prominent for lowly expressed genes where the data is less informative. In those cases, the identifiability constrain in Equation 39 slightly shrinks posterior estimates of mean expression parameters towards . We observe that posterior inference is distorted for the arbitrarily chosen reference gene (see Figures S4A and S4B). To overcome this problem, we introduce the use of a stochastic reference choice. The latter randomly selects a reference gene at each iteration of the MCMC algorithm. As a result, each gene is treated as reference only a small proportion of times, leading to valid posterior inference for all genes (see Figures S4C and S4D).

Technical Details

A Correlated Prior to Satisfy the Identifiability Restriction

Proposition 1. The prior distributionis equivalent towhere 1q denotes a q-dimensional vector of ones and Iq denotes a q-dimensional identity matrix. Proof. The proof follows the same steps as in the proof of Theorem 8.2 in West and Harrison (1989). Let . It can be shown that Hence Finally, replacing , we obtain Proposition 2. Let , where r denotes an arbitrarily chosen reference gene. The correlated prior derived in Proposition 1 can be factorized in terms of a multivariate normal prior for and a point mass prior for which is located at . Proof. Standard multivariate normal theory leads to andwith and Proposition 3. Under the same assumptions as in Proposition 1. Let be the vector obtained after removing elements i and r from . It can be shown thatwhere 1q−2 denotes a (q−2) -dimensional vector of ones. Proof. Standard multivariate normal theory leads towithand Bayesian inference is implemented using an adaptive Metropolis within Gibbs algorithm (Roberts and Rosenthal, 2009). After integrating out the random effects , the full conditionals required for this implementation are based on the following likelihood function: Let r denote an arbitrarily chosen reference gene. If and are assumed to be a priori independent (i.e. as in Vallejos et al., 2016), the associated full conditionals for (i≠r) are given by:where is defined as in Proposition 3 and is the vector obtained after removing elements i and r from . Due to the identifiability constraint, with probability 1. If a gene i (i≠r) is excluded from the identifiability constraint (genes with less than 1 count per cell [on average] are excluded), Equation 55 becomes Under this prior, the remaining full conditionals are given by:where . Alternatively, if the joint informative prior is adopted, Equation 55 and Equation 57 are respectively replaced by

Quantification and Statistical Analysis

Quality Filtering of Single Cell RNA Sequencing Data

We employed a range of different datasets to test the proposed methodology. These datasets were selected to cover different experimental techniques (with and without unique molecular identifiers, UMI) and to encompass a variety of cell populations. Moreover, key features of each dataset can be found in Table S1.

Dictyostelium Cells

Antolović et al. (2017) studied changes in expression variability between 0 hours (undifferentiated), 3 hours and 6 hours of Dictyostelium differentiation. Raw data is available by direct download (see Data S1 in Antolović et al., 2017). Across all time-points, 5 cells were removed due to low quality. Technical spike-in genes that were not detected and biological genes with an average expression (across all cells) smaller than 1 count were removed. In total, 433 cells (131 cells and 3 batches at 0h, 157 cells and 3 batches at 3h, and 145 cells and 3 batches at 6h) and 10551 genes (88 technical and 10650 biological genes) passed filtering. We used data from the 0h time point to test the functionality of our model.

Mouse Brain Cells

This dataset was composed of UMI scRNA-seq data of cells isolated from the mouse somatosensory cortex and hippocampal CA1 region (Zeisel et al., 2015). Raw data is available from Gene Expression Omnibus under accession code GEO: GSE60361. Prior to the analysis, we removed technical genes with 0 total counts and biological genes for which the average count across all 3007 cells was below 0.1. The groups comprising microglia cells and CA1 neurons were chosen to be analysed. For these groups, 98 cells (microglia), 939 cells (CA1 pyramidal neurons) and 10744 genes (10687 biological and 57 technical genes) were left to be analysed.

Pool-and-Split RNA-Seq Data

This UMI-based dataset provides a control experiment to assess changes in biological heterogeneity in a situation where mean expression remains unchanged across conditions. Pool-and-split samples were created by pooling 1 million mESCs grown in 2i or serum medium and splitting 20pg of RNA into aliquots. These libraries are compared against single-cell samples (mESCs) (Grün et al., 2014). Raw data is available from Gene Expression Omnibus under accession code GEO: GSE54695. As in Grün et al. (2014), some cells were removed from the analysis due to low expression of the stem cell marker Oct4. Technical genes with 0 total counts were also removed from the analysis. Additionally, lowly expressed biological genes with fewer than 0.5 counts (on average, across all samples) were excluded. This left 258 libraries (74 single mESCs grown in 2i medium, 52 single mESCs grown in serum medium, 76 pool-and-split aliquots from cells grown in 2i medium and 56 pool-and-split aliquots from cells grown in serum medium) as well as 8924 genes (50 technical spike-ins and 8874 biological genes) for the analysis. Each condition contained 2 batches. Matched single molecule fluorescence in situ hybridization (smFISH) data from mESCs grown in 2i and serum media were obtained from Dominic Grün (Max Planck Institute of Immunobiology and Epigenetics, Freiburg, Germany) through personal communications. This smFISH experiment assayed 9 genes (Gli1, Klf4, Notch1, Pcna, Pou5f1, Sohlh2, Sox2, Stag3, Tpx2) in more than 70 cells per condition. We excluded Notch1 from the analysis due to strong disagreement between smFISH and scRNA-seq data of cells grown in serum medium.

CD4+ T Cells

Non-UMI scRNA-seq data of CD4+ T cells were taken from Martinez-Jimenez et al. (2017). Raw data are available from ArrayExpress under accession code ArrayExpress: E-MTAB-4888. To perform a variety of tests, naive and activated CD4+ T cells from young Mus musculus (B6) mice were selected. Biological genes with an average count < 1 and non-detected technical genes were removed from the analysis. In total, 146 cells (93 naive and 53 activated CD4+ T cells) and 10553 genes (10495 biological and 58 technical genes) passed filtering. Each condition contains 2 replicates.

CD4+ T Cell Differentiation

Non-UMI scRNA-seq data were generated from CD4+ T cells during differentiation towards Th1 and Tfh cell fates after Plasmodium infection (Lönnberg et al., 2017). Raw reads were downloaded from ArrayExpress [ArrayExpress: E-MTAB-4388] and mapped against the Mus musculus genome (mm10) using gsnap (Wu and Nacu, 2010) with default settings. Read counting was performed using HTSeq (Anders et al., 2015) with default settings. Quality control was performed by removing cells with fewer than 300,000 biological reads or fewer than 600,000 technical reads at day 2. At day 4 and 7, cells with fewer than 1,000,000 biological reads were excluded from downstream analysis. Additionally, we removed genes that did not show an average detection of more than 1 read at day 2, day 3, day 4 or day 7 after infection. After applying these criteria, 376 cells (Day 0: 16 cells, Day 2: 89, Day 3: 21, Day 4: 133, Day 7: 64, Day 7 non-infected: 53) and 7899 genes (7847 biological and 52 technical) remained for analysis. Note that, due to low sample sizes, we focused our analysis on data from day 2, day 4 and day 7 post-infection.

Thresholds When Assessing Expression Changes

Statistical assessment of changes in mean expression and residual over-dispersion was performed between datasets using the regression BASiCS model. Unless otherwise indicated, the tolerance threshold was set to for differential mean expression testing, for differential over-dispersion testing and to for differential residual over-dispersion testing. The expected false discovery rate was controlled to 10%. This information is also displayed in figure legends.

Functional Annotation Analysis

We performed functional annotation analysis using DAVID version 6.8 (Dennis et al., 2003). All genes considered for differential testing were used as background. The functional annotation clustering function in DAVID was used to cluster annotation categories based on similarity and to sort them according to their enrichment score.

Stabilization of Posterior Inference for Small Sample Sizes

To compare parameter estimates of the regression and non-regression model across different sample sizes, we used the CA1 pyramidal neuron population from Zeisel et al. (2015). The regression BASiCS model was first run on the full population of 939 cells to generate pseudo ground truth parameter estimates. Subsequently, 50, 100, 150, 200, 250, 300 and 500 cells were randomly sub-sampled from the full population prior to parameter estimation. This procedure was repeated 10 times for each sample size. Based on parameter estimates using the non-regression model, we split the genes into three sets: lowly expressed , medium expressed and highly expressed . These cut-off values were chosen such that a third of genes classifies into each category. We dissected the results of this experiment in three ways. First, we visualize boxplots showing all estimates of gene-specific parameters for a single sub-sampling experiment (Figure 3). Second, we computed the log2 fold change for estimates of gene-specific over-dispersion parameters δ between the regression and non-regression BASiCS models (Figures S3A–S3C). Third, for each sub-sampling experiment, sample size and gene set, we computed the median log2 fold change in μ and δ and the median difference for between estimates and the pseudo ground truth. The median and the range of these values across 10 sub-sampling experiment is used for visualization purposes (see Figure S3D–S3F). External validation for posterior estimates of gene-specific model parameters was obtained using matched scRNA-seq and smFISH data of mouse embryonic stem cells grown in 2i and serum media (see Table S1 and Grün et al., 2014). As in Brennecke et al. (2013), to calculate residual CV2 values for the smFISH data, we defined residuals obtained after fitting a gamma generalized linear model with an identity link (glmgam.fit of the statmod package in R) between the CV2 and the reciprocal log-transformed mean transcript counts.

Changes in Variability during CD4+ T Cell Activation

Firstly, we compare the results obtained by the regression BASiCS model with respect those presented in Martinez-Jimenez et al. (2017). To allow a direct comparison of the results, the same inclusion criteria as in Martinez-Jimenez et al. (2017) is adopted, i.e. we excluded genes with low mean expression in both conditions from testing. Moreover, our minimum tolerance thresholds were also adapted to match the choices in Martinez-Jimenez et al. (2017). To detect differentially expressed genes (mean) a minimum tolerance threshold was used (see Figure S5A). To compare the detection of differentially over-dispersed genes, we performed differential mean expression testing using a stringent minimum tolerance threshold for both models (this is to avoid the results being confounded by changes in mean, see upper panel in Figure S5B). For the 463 genes that are detected as non-differentially expressed by both models for this threshold, a total of 111 genes are detected as differentially over-dispersed by either model (minimum tolerance log2 fold change threshold ). Out of this set, 93 genes (∼83%) are detected as differentially over-dispersed by both models (see lower panel in Figure S5B)). In this article, we exclude genes whose estimated mean expression parameters μ was below 1 from the differential testing. Furthermore, a log2 fold change threshold was adopted for mean expression testing. Unlike the more stringent threshold used by Martinez-Jimenez et al. (2017) , this choice allows us to detect more subtle changes in mean expression. Moreover, the default threshold was used for differential variability testing. The expected false discovery rate (EFDR) was controlled to 10%. Genes were sorted into four categories based on their changes in variability and mean expression: down-regulated upon activation with (i) lower and (ii) higher variability, and up-regulated with (iii) lower and (iv) higher variability (see Figure 5A). For each of these gene sets, functional annotation analysis was performed using all tested genes as background. The functional annotation clustering tool in DAVID (Dennis et al., 2003) was used to cluster annotation categories based on similarity and to sort them according to their enrichment score. Here, we list the top 3 functional annotation clusters per gene set and their corresponding enrichment score (ES): Down-regulated with lower variability: Pleckstrin homology domain (ES = 1.57), G protein signalling (ES = 1.51), glycosidase (ES = 1.49), Down-regulated with higher variability: Ankyrin repeat-containing domain (ES = 2.19), GTPase mediated signalling (ES = 1.51), steroid biosynthesis (ES = 0.89), Up-regulated with lower variability: RNA polymerase (ES = 1.6), RNA binding (ES = 1.53), splicing (ES = 1.41), Up-regulated with higher variability: Cytokine-cytokine receptor interaction (ES = 1.65), WD40 repeat (ES = 1.22), transcription (ES = 1.18). To visualize gene expression in individual cells, we denoised the raw expression counts using the BASiCS_DenoisedCounts function. Finally, we performed a synthetic experiment to illustrate how individual cells that highly express certain genes can drive the detection of changes in variability. For this purpose, we created a mixed population of cells by combining 5 activated CD4+ T cells with a population of 93 naive CD4+ T cells. In this mixture, response genes are lowly expressed on average and show expression outliers in a small subset of cells. Il2 represents a gene with statistically significant higher mean expression and higher residual over-dispersion in the mixed population (see Figure S5C). All genes that show increased mean expression as well as increased residual over-dispersion are visualized in Figure S5D.

Changes in Variability during CD4+ T Cell Differentiation

To detect changes in over-dispersion and residual over-dispersion (variability) during CD4+ T cell differentiation, we performed two sets of tests between day 2 and day 4, day 4 and day 7, and day 2 and day 7. The minimum tolerance log2 fold change threshold to test changes in mean expression in the first test was set to , while the threshold for the second test was set to . The default threshold was used for differential variability testing. EFDR was controlled to 10%. To visualize gene expression in individual cells, we denoised the raw expression counts using the BASiCS_DenoisedCounts function. The results of the first stringent test allow us to detect genes that do not change in mean expression between any of the three time points (126 genes). For these genes, the δ estimates are therefore comparable across the time points, avoiding the confounding with mean expression (see Figure 6A). To detect genes that show different variability patterns across the time points, we first removed all genes that are expressed in fewer than 2 cells in at least one time point. For the remaining genes, the second testing strategy was used and all genes with statistically significant changes in variability between day 2 and day 4, and day 4 and day 7 were collected (see Figure 6B). For analysis in Figures 6C and 6D the second testing strategy was used to detect changes in variability between day 2 and day 4. Finally, we selected gene sets listed in Lönnberg et al. (2017) to visualize their changes in mean expression and residual over-dispersion. The first set of genes is taken from Figure 3E of the original publication, which filtered genes based on their association with the bifurcation of Th1 and Tfh differentiation. The second set of genes with sequential peak expression over pseudotime is taken from Figure 5A of the original publication, which were selected based on immunological relevance from a list of dynamic genes during in vivo differentiation (see Figure S6).

Data and Software Availability

BASiCS is freely available as part of Bioconductor 3.7 (bioconductor.org). The results displayed in this manuscript and its supplemental material use BASiCS version 1.1.57. All R scripts for data preparation and analysis are available at github.com/MarioniLab/RegressionBASiCS2017. This link also includes instructions to download all the publicly available datasets used throughout our analyses.

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited Data
Mouse genome reference GRCm38	ENSEMBL	ftp://ftp.ensembl.org/pub/release-91/fasta/mus_musculus/dna/
Naive and activated CD4+ T cells	Martinez-Jimenez et al. (2017)	Array Express: E-MTAB-4888
Differentiating CD4+ T cells	Lönnberg et al. (2017)	Array Express: E-MTAB-4388
Mouse brain cells	Zeisel et al. (2015)	GEO: GSE60361
Dictyostelium cells	Antolović et al. (2017)	Data S1 in original publication
Pool-split RNA	Grün et al. (2014)	GEO: GSE54695
Software and Algorithms
BASiCS version 1.1.57	This paper	https://www.bioconductor.org/packages/3.7/bioc/html/BASiCS.html
Gsnap v2014-12-29	Wu and Nacu (2010)	https://github.com/juliangehring/GMAP-GSNAP
HTSeq v0.6	Anders et al. (2015)	https://htseq.readthedocs.io
DAVID v6.8	Dennis et al. (2003)	https://david.ncifcrf.gov/

53 in total

1. ICOS receptor instructs T follicular helper cell versus effector cell differentiation via induction of the transcriptional repressor Bcl6.

Authors: Youn Soo Choi; Robin Kageyama; Danelle Eto; Tania C Escobar; Robert J Johnston; Laurel Monticelli; Christopher Lao; Shane Crotty
Journal: Immunity Date: 2011-06-24 Impact factor: 31.745

2. Synthetic spike-in standards for RNA-seq experiments.

Authors: Lichun Jiang; Felix Schlesinger; Carrie A Davis; Yu Zhang; Renhua Li; Marc Salit; Thomas R Gingeras; Brian Oliver
Journal: Genome Res Date: 2011-08-04 Impact factor: 9.043

Review 3. Basics of PD-1 in self-tolerance, infection, and cancer immunity.

Authors: Shunsuke Chikuma
Journal: Int J Clin Oncol Date: 2016-02-10 Impact factor: 3.402

4. Pseudotemporal Ordering of Single Cells Reveals Metabolic Control of Postnatal β Cell Proliferation.

Authors: Chun Zeng; Francesca Mulas; Yinghui Sui; Tiffany Guan; Nathanael Miller; Yuliang Tan; Fenfen Liu; Wen Jin; Andrea C Carrano; Mark O Huising; Orian S Shirihai; Gene W Yeo; Maike Sander
Journal: Cell Metab Date: 2017-05-02 Impact factor: 27.287

Review 5. The many roles of FAS receptor signaling in the immune system.

Authors: Andreas Strasser; Philipp J Jost; Shigekazu Nagata
Journal: Immunity Date: 2009-02-20 Impact factor: 31.745

6. Transcriptome-wide noise controls lineage choice in mammalian progenitor cells.

Authors: Hannah H Chang; Martin Hemberg; Mauricio Barahona; Donald E Ingber; Sui Huang
Journal: Nature Date: 2008-05-22 Impact factor: 49.962

7. Suboptimal T-cell receptor signaling compromises protein translation, ribosome biogenesis, and proliferation of mouse CD8 T cells.

Authors: Thomas C J Tan; John Knight; Thomas Sbarrato; Kate Dudek; Anne E Willis; Rose Zamoyska
Journal: Proc Natl Acad Sci U S A Date: 2017-07-10 Impact factor: 11.205

8. Octamer-dependent transcription in T cells is mediated by NFAT and NF-κB.

Authors: Kerstin Mueller; Jasmin Quandt; Ralf B Marienfeld; Petra Weihrich; Katja Fiedler; Melina Claussnitzer; Helmut Laumen; Martin Vaeth; Friederike Berberich-Siebelt; Edgar Serfling; Thomas Wirth; Cornelia Brunner
Journal: Nucleic Acids Res Date: 2013-01-04 Impact factor: 16.971

9. HTSeq--a Python framework to work with high-throughput sequencing data.

Authors: Simon Anders; Paul Theodor Pyl; Wolfgang Huber
Journal: Bioinformatics Date: 2014-09-25 Impact factor: 6.937

10. Batch effects and the effective design of single-cell gene expression studies.

Authors: Po-Yuan Tung; John D Blischak; Chiaowen Joyce Hsiao; David A Knowles; Jonathan E Burnett; Jonathan K Pritchard; Yoav Gilad
Journal: Sci Rep Date: 2017-01-03 Impact factor: 4.379

16 in total

Review 1. The triumphs and limitations of computational methods for scRNA-seq.

Authors: Peter V Kharchenko
Journal: Nat Methods Date: 2021-06-21 Impact factor: 28.547

2. Brahma safeguards canalization of cardiac mesoderm differentiation.

Authors: Swetansu K Hota; Kavitha S Rao; Andrew P Blair; Ali Khalilimeybodi; Kevin M Hu; Reuben Thomas; Kevin So; Vasumathi Kameswaran; Jiewei Xu; Benjamin J Polacco; Ravi V Desai; Nilanjana Chatterjee; Austin Hsu; Jonathon M Muncie; Aaron M Blotnick; Sarah A B Winchester; Leor S Weinberger; Ruth Hüttenhain; Irfan S Kathiriya; Nevan J Krogan; Jeffrey J Saucerman; Benoit G Bruneau
Journal: Nature Date: 2022-01-26 Impact factor: 69.504

3. A DNA repair pathway can regulate transcriptional noise to promote cell fate transitions.

Authors: Ravi V Desai; Xinyue Chen; Benjamin Martin; Sonali Chaturvedi; Dong Woo Hwang; Weihan Li; Chen Yu; Sheng Ding; Matt Thomson; Robert H Singer; Robert A Coleman; Maike M K Hansen; Leor S Weinberger
Journal: Science Date: 2021-07-22 Impact factor: 63.714

4. clonealign: statistical integration of independent single-cell RNA and DNA sequencing data from human cancers.

Authors: Kieran R Campbell; Adi Steif; Emma Laks; Hans Zahn; Daniel Lai; Andrew McPherson; Hossein Farahani; Farhia Kabeer; Ciara O'Flanagan; Justina Biele; Jazmine Brimhall; Beixi Wang; Pascale Walters; Alexandre Bouchard-Côté; Samuel Aparicio; Sohrab P Shah
Journal: Genome Biol Date: 2019-03-12 Impact factor: 13.583

Review 5. Eleven grand challenges in single-cell data science.

Authors: David Lähnemann; Johannes Köster; Ewa Szczurek; Davis J McCarthy; Stephanie C Hicks; Mark D Robinson; Catalina A Vallejos; Kieran R Campbell; Niko Beerenwinkel; Ahmed Mahfouz; Luca Pinello; Pavel Skums; Alexandros Stamatakis; Camille Stephan-Otto Attolini; Samuel Aparicio; Jasmijn Baaijens; Marleen Balvert; Buys de Barbanson; Antonio Cappuccio; Giacomo Corleone; Bas E Dutilh; Maria Florescu; Victor Guryev; Rens Holmer; Katharina Jahn; Thamar Jessurun Lobo; Emma M Keizer; Indu Khatri; Szymon M Kielbasa; Jan O Korbel; Alexey M Kozlov; Tzu-Hao Kuo; Boudewijn P F Lelieveldt; Ion I Mandoiu; John C Marioni; Tobias Marschall; Felix Mölder; Amir Niknejad; Lukasz Raczkowski; Marcel Reinders; Jeroen de Ridder; Antoine-Emmanuel Saliba; Antonios Somarakis; Oliver Stegle; Fabian J Theis; Huan Yang; Alex Zelikovsky; Alice C McHardy; Benjamin J Raphael; Sohrab P Shah; Alexander Schönhuth
Journal: Genome Biol Date: 2020-02-07 Impact factor: 13.583

6. Prostate cancer cell-intrinsic interferon signaling regulates dormancy and metastatic outgrowth in bone.

Authors: Katie L Owen; Linden J Gearing; Damien J Zanker; Natasha K Brockwell; Weng Hua Khoo; Daniel L Roden; Marek Cmero; Stefano Mangiola; Matthew K Hong; Alex J Spurling; Michelle McDonald; Chia-Ling Chan; Anupama Pasam; Ruth J Lyons; Hendrika M Duivenvoorden; Andrew Ryan; Lisa M Butler; John M Mariadason; Tri Giang Phan; Vanessa M Hayes; Shahneen Sandhu; Alexander Swarbrick; Niall M Corcoran; Paul J Hertzog; Peter I Croucher; Chris Hovens; Belinda S Parker
Journal: EMBO Rep Date: 2020-04-21 Impact factor: 8.807