| Literature DB >> 28899397 |
Luke Zappia1,2, Belinda Phipson1, Alicia Oshlack3,4.
Abstract
As single-cell RNA sequencing (scRNA-seq) technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed, and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available. Here, we present the Splatter Bioconductor package for simple, reproducible, and well-documented simulation of scRNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types, or differentiation paths.Entities:
Keywords: RNA-seq; Simulation; Single-cell; Software
Mesh:
Year: 2017 PMID: 28899397 PMCID: PMC5596896 DOI: 10.1186/s13059-017-1305-0
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1The Splat simulation model. Input parameters are indicated with double borders and those that can be estimated from real data are shaded blue. Red shading indicates the final output. The simulation begins by generating means from a gamma distribution. Outlier expression genes are added by multiplying by a log-normal factor and the means are proportionally adjusted for each cell’s library size. Adjusting the means using a simulated Biological Coefficient of Variation (BCV) enforces a mean-variance trend. These final means are used to generate counts from a Poisson distribution. In the final step dropout is (optionally) simulated by randomly setting some counts to zero, based on each gene’s mean expression. DoF degrees of freedom
Input parameters for the Splat simulation model
| Name | Symbol | Description |
|---|---|---|
| Mean shape |
| Shape parameter for the mean gene expression gamma distribution |
| Mean rate |
| Rate parameter for the mean gene expression gamma distribution |
| Library size location |
| Location parameter for the library size log-normal distribution |
| Library size scale |
| Scale parameter for the library size log-normal distribution |
| Outlier probability |
| Probability that a gene is an expression outlier |
| Outlier location |
| Location parameter for the expression outlier factor log-normal distribution |
| Outlier scale |
| Scale parameter for the expression outlier factor log-normal distribution |
| Common BCV |
| Common BCV dispersion across all genes |
| BCV degrees of freedom |
| Degrees of freedom for the BCV inverse chi-squared distribution |
| Dropout midpoint |
| Midpoint for the dropout logistic function |
| Dropout shape |
| Shape of the dropout logistic function |
Fig. 2Comparison of simulations based on the Tung dataset. The left column panels show the distribution of mean expression (a), variance (c) and library size (g) across the real dataset and the simulations as boxplots, along with a scatter plot of the mean–variance relationship (e). The right column shows boxplots of the ranked differences between the real data and simulations for the same statistics: mean (b), variance (d), mean–variance relationship (f), and library size (h). Note that the y-axis for plots of the variance has been limited in order to show more detail. Variances for the Lun and Lun 2 simulations extend beyond what has been shown
Fig. 3Comparison of zeros in simulations based on the Tung dataset. The top row shows boxplots of the distribution of zeros per cell (a) and the difference from the real data (b). The distribution (c) and difference (d) in zeros per gene are shown in the middle row. The bottom row shows scatter plots of the relationship between the mean expression of a gene (including cells with zero counts) and the percentage of zeros as both the raw observations (e) and ranked differences from the real data (f)
Details of real datasets
| Dataset | Species | Cell type | Platform | Protocol | UMI | Number of cells |
|---|---|---|---|---|---|---|
| Camp [ | Human | Whole brain organoids | Fluidigm C1 | SMARTer | No | 597 |
| Engel [ | Mouse | Natural killer T cells | Flow cytometry | Modified Smart-seq2 | No | 203 |
| Klein [ | Human | K562 cells | InDrop | CEL-Seq | Yes | 213 |
| Tung [ | Human | Induced pluripotent stem cells | Fluidigm C1 | Modified SMARTer | Yes | 564 |
| Zeisel [ | Mouse | Cortex and hippocampus cells | Fluidigm C1 | STRT-Seq | Yes | 3005 |
Fig. 4Comparison of simulation models based on various datasets. For each dataset parameters were estimated and synthetic datasets generated using various simulation methods. The median absolute deviation (MAD) between each simulation and the real data was calculated for a range of metrics and the simulations ranked. A heatmap of the ranks across the metrics and datasets is presented here. We see that the Splat simulation (with and without dropout) performs consistently well, with the BASiCS simulation and the two versions of the Lun 2 simulation also performing well
Fig. 5Examples of complex Splat simulations. a A principle components analysis (PCA) plot of a simulation with six groups with varying numbers of cells and levels of differential expression. b A PCA plot of a simulation with two groups (pink and blue) and two batches (circle and triangle). PC1 separates groups (wanted biological variation) while PC2 separates batches (unwanted technical variation). c A PCA plot of a simulation with differentiation paths; the colored gradient indicates how far along a path each cell is from blue to pink. A progenitor cell type (blue circles) differentiates into an intermediate cell type (pink circles/blue triangles or diamonds), which becomes one of two (pink triangle or diamond) mature cell types
Fig. 6Evaluation of SC3 results. Metrics for the evaluation of clustering (a) include the Rand index, Hubert and Arabie’s adjusted Rand index (HA), Morey and Agresti’s adjusted Rand index (MA), Fowlkes and Mallows index (FM), and the Jaccard index. Detection of differentially expressed and marker genes were evaluated (b) using accuracy, recall (true positive rate), precision, F1 score (harmonic mean of precision and recall), and false positive rate (FPR). All of the metrics are presented here as boxplots across the 20 simulations