| Literature DB >> 27052890 |
Rhonda Bacher1, Christina Kendziorski2.
Abstract
Single-cell RNA-sequencing (scRNA-seq) has emerged as a revolutionary tool that allows us to address scientific questions that eluded examination just a few years ago. With the advantages of scRNA-seq come computational challenges that are just beginning to be addressed. In this article, we highlight the computational methods available for the design and analysis of scRNA-seq experiments, their advantages and disadvantages in various settings, the open questions for which novel methods are needed, and expected future developments in this exciting area.Entities:
Mesh:
Year: 2016 PMID: 27052890 PMCID: PMC4823857 DOI: 10.1186/s13059-016-0927-y
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Prominent features in single-cell RNA-seq data relative to bulk RNA-seq include an abundance of zeros, increased variability, and multi-modal expression distributions. a Boxplots of the gene-specific proportion of zeros in a bulk (bulk1) and single-cell (sc1) dataset stratified by percentile of median gene expression. Sequencing depth ranges from 420,000 to 16.6 million in bulk1 and 385,000 to 16.4 million in sc1 (samples were chosen to have comparable depths; see the “Data” section). b Densities of gene-specific log variance for all genes in three bulk and three single-cell RNA-seq datasets. Densities are also shown for the single-cell datasets for log variances calculated following the removal of zeros, emphasizing that the increased variability observed relative to bulk is not entirely due to the presence of zeros. c For each dataset shown in b, 1000 genes were selected at random from the list of genes for which at least 75 % of cells showed non-zero expression. For each gene, zeros were removed and Mclust [92] was applied to log expression to estimate the number of modes. Because zeros were removed prior to Mclust, a mode at zero will not contribute to the total number of modes shown
Statistical methods for single-cell RNA-seq experiments
| Name | Description | Requirements/deliverables |
|---|---|---|
| Normalization | ||
| GRM [ | Fits polynomial gamma regression model to FPKM data from spike-ins; estimated parameters are used to convert FPKM of endogenous genes to an absolute scale within each cell | Performs within cell normalization and may be used with FPKM, RPKM, or TPM |
| SAMstrt [ | The resampling-based bulk normalization method in SAMseq is applied to spike-ins | Assumes that an equal number of spike-in control RNA molecules have been added to all samples |
| Identifying highly variable genes | ||
| Brennecke et al. [ | A gamma generalized linear model fit to the mean-variance relationship quantified by the square of the coefficient of variation (CV2) of the spike-ins estimates technical noise parameters. These parameters are then used to estimate technical variability for endogenous genes and to test whether each gene exceeds a variability threshold | Spike-ins and endogenous genes are normalized separately using the median normalization method. Gene specific |
| Kim et al. [ | Uses spike-ins to estimate parameters related to technical variance, allowing for differences in variability across cells. Estimates gene-specific biological variability by subtracting technical variability from total variance | Normalization factors are estimated using the median normalization method. A simulation based framework to test for highly variable genes is provided |
| BASiCS [ | Jointly models spike-ins and endogenous genes as two Poisson-Gamma hierarchicalmodels with shared parameters | Estimates normalization parameters jointly across all genes. Gene-specific posterior probabilities are provided to identify both lowly and highly variable genes |
| Noise reduction | ||
| scLVM [ | Uses a Gaussian Process Latent variable model to estimate the covariance matrix associated with latent factors. Residuals from a linear mixed model with the covariance term represent de-noised expression estimates | Requires genes associated with the latent factor to be identified a priori. Normalization factors are estimated using the median normalization method |
| OEFinder [ | Uses orthogonal polynomial regression to identify genes whose expression is associated with position on the C1 Fluidigm integrated fluidic circuit (IFC) | Gene-specific |
| Sub-population identification | ||
| ZIFA [ | Models dropout rate as a function of expression in a factor analysis (linear dimension reduction) framework | Requires normalized, log-transformed estimates of gene expression (zeros are not transformed) |
| Destiny [ | Extends diffusion maps (a non-linear dimension reduction approach) to handle zeros and sampling density heterogeneities inherent in single cell data | Requires variance-stabilized gene expression estimates; works best with a large number of cells |
| SNN-Cliq [ | Clusters cells by identifying and merging sub-graphs (quasi-cliques) in a shared nearest neighbor (SNN) graph; the number of clusters is chosen automatically | Requires a reduced set of genes. Xu and Su [ |
| RaceID [ | Uses k-means applied to a similarity matrix of Pearson’s correlation coefficients for all pairs of cells; the number of clusters is chosen using the gap statistic. Outlier cells are those that cannot be explained by a background model that accounts for technical and biological noise. In a second step, rare subpopulations can be identified and outlier cells may be merged to an outlier cluster; new cluster centers are then computed and each cell is assigned to the most highly correlated cluster center | Requires a reduced set of genes. Grün et al. [ |
| SCUBA [ | Uses k-means to cluster data along a binary tree detailing bifurcation events for time-course data. Models expression regulation along the tree using bifurcation theory | Requires a reduced set of genes. Marco et al. [ |
| BackSPIN [ | Iteratively splits a two-way sorted (by both genes and cells) expression matrix into two clusters containing independent cells and genes, for a maximum number of splits. The algorithm has a stopping condition to avoid splitting data that are very homogeneous | Requires a reduced set of genes and the maximum number of splits allowed. Zeisel et al. [ |
| PCA/t-SNE [ | Linear/non-linear dimension reduction approach used for unsupervised clustering of cells | Input is typically a correlation or similarity matrix |
| PAGODA [ | Allows for both detection and interpretation of the transcriptional heterogeneity within a cell population. A weighted principal component analysis (PCA) is conducted for each gene set; those sets for which the variance explained by the first principal component significantly exceeds genome-wide background expectation are identified. To provide a non-redundant view of heterogeneity structure, principal components from different gene sets showing high similarity are combined to form a single component of heterogeneity | Requires un-normalized gene expression counts (performs internal correction as in SCDE). Uses gene ontology (GO) annotated or user-defined gene sets |
| Differential detection | ||
| MAST [ | A logistic regression model is used to test differential expression rate between groups while a Gaussian generalized linear model (GLM) describes expression conditionally on non-zero expression estimates. Models are corrected for cellular detection rate | Requires normalized gene expression estimates and provides gene-specific |
| SCDE [ | Models gene-specific expression as a two-component mixture: a Poisson component describes zero and a Negative Binomial describes non-zero measurements | Requires un-normalized gene expression counts (performs internal correction) and provides gene-specific posterior probabilities of differential expression (DE) between two biological conditions. Tests for DE are performed on non-zeros |
| scDD [ | Models expressed counts as a Dirichlet process mixture (DPM) of normals to test for differentially distributed (DD) genes associated with multi-modality in the expressed component. Samples from the posterior further characterize the gene-specific distributional difference between two biological conditions to identify genes that are differentially expressed (DE), differ in the proportion of cells within modes (DP), differ in the number of modes (DM), or are both DE and DM (DB) | Requires normalized, log-transformed gene expression estimates and provides gene-specific |
| Pseudotemporal ordering | ||
| Monocle [ | Reduces data using independent component analysis (ICA) and constructs a minimum spanning tree (MST) to order cells in pseudotime | Requires normalized, log-transformed gene expression estimates and a reduced set of genes. Trapnell et al. [ |
| Waterfall [ | Unsupervised clustering is used to identify clusters of cells for which a putative ordering is determined on the basis of their relative location in a PCA plot. K-means clustering of single-cell transcriptomes on the PCA plot and an MST that connects cluster centers determines pseudotime | Requires normalized estimates of gene expression with outliers removed |
| Sincell [ | A flexible R workflow for building cell hierarchies with multiple options for dimension reduction, clustering, and graph building. Allows the user to assess the similarity of graphs and performs resampling or random cell substitution with simulated replicates to assess the robustness of estimated hierarchies | Requires normalized, log-transformed gene expression estimates and a reduced set of genes. Juliá et al. [ |
| Oscope [ | Uses a paired-sine model and K-medoids clustering to identify groups of oscillatory genes. For each oscillatory group, an extended nearest insertion algorithm is used to construct the cyclic order of cells, defined as the order that specifies each cell’s position within one cycle of the oscillation of that group | Identifies groups of oscillatory genes, when present. Requires normalized gene expression and use of only high mean, high variance genes is recommended |
| Wanderlust [ | Cells are represented as nodes in an ensemble of k-nearest neighbor graphs. For each graph, a user-defined starting cell is used to calculate an orientation trajectory by iteratively computing the shortest-path distance between cells. The final trajectory is an average over all graphs | Developed using single-cell mass cytometry data, which typically describe few genes (<50) and tens of thousands of cells |