| Literature DB >> 28524227 |
Raghd Rostom1, Valentine Svensson2, Sarah A Teichmann1, Gozde Kar2.
Abstract
The recent developments in high-throughput single-cell RNA sequencing technology (scRNA-seq) have enabled the generation of vast amounts of transcriptomic data at cellular resolution. With these advances come new modes of data analysis, building on high-dimensional data mining techniques. Here, we consider biological questions for which scRNA-seq data is used, both at a cell and gene level, and describe tools available for these types of analyses. This is an exciting and rapidly evolving field, where clustering, pseudotime inference, branching inference and gene-level analyses are particularly informative areas of computational analysis.Entities:
Keywords: single-cell analysis methods and tools; single-cell genomics
Mesh:
Year: 2017 PMID: 28524227 PMCID: PMC5575496 DOI: 10.1002/1873-3468.12684
Source DB: PubMed Journal: FEBS Lett ISSN: 0014-5793 Impact factor: 4.124
Figure 1Overview of analysis methods for the interpretation of scRNA‐seq data.
Tools for the visualization and clustering of cells
| Dimensionality reduction and clustering of cells | |||
|---|---|---|---|
| Method | Description | Input | Availability |
| PCA | Linear dimensionality reduction, producing a set of uncorrelated components, explaining decreasing amounts of variation in the data. | Expression table |
|
| t‐SNE | Nonlinear dimensionality reduction: t‐distributed Stochastic Neighbour Embedding. | Expression table |
|
| ZIFA | A linear dimensionality reduction technique, using the factor analysis framework, that explicitly models dropout characteristics. | Log‐transformed count values |
|
| Destiny | A fast implementation of diffusion maps for R. | Expression matrix (with a suggested variance stabilized transformation, for example, square root). |
|
| SNN‐cliq | Graph‐theory‐based algorithm; uses shared nearest neighbour (SNN) graph based upon a subset of genes. The number of clusters is automatically chosen. | Log‐transformation of normalized expression (e.g. RPKM) |
|
| RaceID | Iterative K‐means clustering of a Pearson correlation matrix, with number of clusters chosen using the gap statistic. | Raw gene expression matrix |
|
| SC3 | Distance is calculated first, followed by k‐means clustering. Instead of optimizing parameters (e.g. distance metric, matrix transformation), SC3 combines several clustering outcomes and outputs an averaged result. | Normalized expression values |
|
| SIMLR | Learns a similarity measure from scRNA‐seq data to perform dimensionality reduction, clustering and visualization. | Raw gene expression estimates and number of cell population. |
|
Tools for the ordering of cells & bifurcation/branch identification
| Method | Description | Input | Availability |
|---|---|---|---|
| Pseudo‐temporal ordering of cells | |||
| PQ‐trees | Samples are ordered by a minimum spanning tree of data, using a PQ‐tree construction. | Expression table |
|
| Monocle2 | A principal graph is embedded in the transcriptome space, distance along the graph from a start cell defines pseudotime. | Expression table, Batch effect formula, gene list (can be found through DE), dimensionality reduction options (method, number of dimensions) |
Bioconductor package ‘monocle’ |
| Wishbone | Diffusion maps on reduced k‐NN graph (using waypoints). | Expression table, Start cell, number of waypoints, number of nearest neighbours k. |
Python: |
| Wanderlust | Heuristic k‐NN graph geodesic distance | Expression table |
In CYT: |
| DPT | Diffusion components are averaged for each sample based on spectral embedding, and used as a distance between samples. | Expression table, variance of Gaussian kernel, Start cell |
For R and Matlab: |
| GPLVM | Assume genes follow any smooth functions and infer time as latent parameter | Expression table or dimensionality reduction, covariance function, optional priors, Optional covariance function hyper parameters. |
GPy |
| Ouija | Provided a small number of genes sigmoidal over trajectory, treat time as latent variable. | Expression table, list of assumed switch‐like genes, optional priors of switching time and direction. |
Bioconductor package ‘ouija’. |
| Branching analysis | |||
| Wishbone | Two branches are detected by clustering detours between cells relative to a starting cells in terms of pseudotime. | Expression table |
|
| Anticorrelation clustering | Branch points are identified when anticorrelated distances (relative to a start cell) become correlated. After this, cells can be segmented to belong to either of the two branches, or the trunk. | Expression table |
|
| OMGP/GPfates | Model data as a mixture of continuous processes. Each cell obtains a posterior probability of being generated by each of the branches. | Expression table |
|
| Monocle | The principal graph fitted to the expression data explicitly has the concept of branches, which cells are assigned to. | Expression table, gene list |
|
| Mpath | Finding Minimum Spanning Tree in neighbourhood graph of landmarks. | Expression table |
|
Tools for gene‐level analysis
| Identification of differentially expressed genes | |||
|---|---|---|---|
| Method | Description | Input | Availability |
| Designed specifically for single cell RNA‐seq data | |||
| SCDE | Bayesian method to compare two groups of single cells, taking into account variability in scRNAseq data due to dropout and amplification biases. | Raw gene expression counts |
|
| MAST | Uses two‐part generalized linear model that is adjusted for cellular detection rate. | Normalized gene expression values |
|
| M3Drop | Applies Michaelis‐Menten modelling of dropouts to identify differential expression. | Raw gene expression counts |
|
| scDD | A Bayesian modelling framework to identify genes that are differentially expressed and/or show a differential number of modes or differential proportion of cells within modes. | Normalized and log‐scaled gene expression values |
|
| SINCERA | Identifies DE genes based on simple statistical tests such as Wilcoxon rank sum and | Raw gene expression values |
|
| Designed originally for bulk RNA‐seq data | |||
| DESeq2 | Fits a GLM for each gene, uses shrinkage estimation for dispersions and fold changes, applies a Wald or LR test for significance testing. | Raw gene expression counts |
|
| EdgeR | Fits a negative binomial distribution for each gene, estimates dispersions by conditional maximum likelihood, identifies differential expression using an exact test adapted for overdispersed data. Supports arbitrary linear models. | Raw gene expression counts |
|
| Identification of highly variable genes | |||
| Brennecke | Biological variability of genes is inferred after quantifying the technical noise based on the square of coefficient of variation (CV2) of the spike‐in molecules. | Raw expression counts for both spike‐ins and endogenous genes |
|
| Kim | Presents a statistical framework to decompose the total variance into the technical and biological variance based on a generative model. | Raw expression counts for both spike‐ins and endogenous genes |
|
| BASiCS | Uses a Bayesian approach that jointly models spike‐ins and endogenous genes. Posterior probabilities associated to highly (or lowly) variable genes are provided. | Raw expression counts for both spike‐ins and endogenous genes |
|
| Unwanted factor removal | |||
| scLVM | Uses a Gaussian Process Latent variable model to dissect observed heterogeneity into different sources allowing removal of confounding factor of variation such as cell cycle‐induced variations. | Raw gene expression counts and a set of genes associated with the latent factor |
|
| Combat | Removes known batch effects based on an empirical Bayesian framework. | Normalized and log‐scaled gene expression counts and batch information |
|
| OEFinder | Identifies potential artefacts (ordering effects) generated by the Fluidigm C1 platform using orthogonal polynomial regression. | A set of genes (and |
|
| RUVSeq | Adjusts for nuisance technical effects by performing factor analysis on a set of control genes such as spike‐ins or samples such as replicate libraries. | Raw gene expression counts and a set of control genes, spike‐ins or replicate libraries |
|
| Pseudotime Analysis | |||
| Monocle | Spline regression using VGAM | Expression table, gene list |
|
| SwitchDE | Find genes which are explained as sigmoid curves over pseudotime. | Expression table |
Bioconductor package ‘switchde’ |
| ImpulseDE | Find genes which follow an impulse model. | Expression table |
Bioconductor package ‘impulsede’ |
| GP Regression | Find genes which follow any non‐linear smooth function. | Expression table |
GPy |