| Literature DB >> 31197158 |
Xiuwei Zhang1,2, Chenling Xu1, Nir Yosef3,4,5.
Abstract
The abundance of new computational methods for processing and interpreting transcriptomes at a single cell level raises the need for in silico platforms for evaluation and validation. Here, we present SymSim, a simulator that explicitly models the processes that give rise to data observed in single cell RNA-Seq experiments. The components of the SymSim pipeline pertain to the three primary sources of variation in single cell RNA-Seq data: noise intrinsic to the process of transcription, extrinsic variation indicative of different cell states (both discrete and continuous), and technical variation due to low sensitivity and measurement noise and bias. We demonstrate how SymSim can be used for benchmarking methods for clustering, differential expression and trajectory inference, and for examining the effects of various parameters on their performance. We also show how SymSim can be used to evaluate the number of cells required to detect a rare population under various scenarios.Entities:
Mesh:
Year: 2019 PMID: 31197158 PMCID: PMC6565723 DOI: 10.1038/s41467-019-10500-w
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of SymSim. The true transcript counts, which are the number of molecules for each transcript in each cell at the time of analysis, are generated through the classical promoter kinetic model with parameters: promoter on rate (kon), off rate (koff) and RNA synthesis rate (s). The values of the kinetic parameters are determined by the product of gene-specific coefficients (termed gene effects) and cell-specific coefficients. The latter set of coefficients is termed extrinsic variability factors (EVF), and it is indicative of the cell state. The expected value of each EVF is determined in accordance to the position of the cell in a user-defined tree structure. The tree dictates the structure of the resulting cell–cell similarity map (which can be either discrete or continuous) since the distance between any two cells in the tree is proportional to the expected distance between their EVF values. For homogenous populations (represented by a single location in the tree), the EVFs are drawn iid from a distribution whose mean is the expected EVF value and variance is provided by the user. From the true transcript counts we explicitly simulate the key experimental steps of library preparation and sequencing, and obtain observed counts, which are read counts for full-length mRNA sequencing protocols, and UMI counts, otherwise
Fig. 2Intrinsic variation. a A diagram of how gene and cell-specific kinetic parameters are simulated from cell-specific EVF and gene-specific gene effect vectors, and how the kinetic parameters are used in a model of transcription. Each cell has a separate EVF vector for Kon, Koff, and S. Each parameter is generated through two steps: first, for each gene in each cell, we take the dot product of the corresponding EVF and gene effect vectors. Second, the dot product values are mapped to distributions of parameters estimated from experimental data. The matched parameters are used to generate true transcript counts (see Methods). b The distributions of kon, koff, and s that are used in SymSim for simulations. These distributions are aggregated from inferred results of three subpopulations of the UMI cortex dataset (oligodendrocytes, pyramidal CA1 and pyramidal S1) after imputation by scVI and MAGIC. c A heatmap showing the effect of parameter kon and koff on the number of modes in transcript counts. The value of s is fixed to 10 in this plot. The red area with low kon and koff have one zero mode and one non-zero mode. The gray area with low kon and high koff has only one zero mode, and the blue area with high kon and low koff have one non-zero mode. The yellow arrow shows how the parameter bimod can modify the amount of bimodality in the transcript count distribution. d Histogram heatmaps of transcript count distribution of the true simulated counts with varying values of bimod, showing that increasing bimod increases the zero-components of transcript counts and the number of bimodal genes. In these heatmaps, each row corresponds to a gene, each column corresponds to a level of expression, and the color intensity is proportional to the number of cells that express the respective gene at the respective expression level. Data used to plot b–d can be found in Source Data
Fig. 3Extrinsic variation. a Illustration of generating a diverse set of cell states with SymSim. The tree represents the relationship between cells. The numbers on the edges are branch lengths; the node numbers indicate the ID of the respective subpopulation (each subpopulation is represented by a single position [leaf] in the tree). The matrix to the right depicts the derivation of EVF values. Each row corresponds to an EVF (only two are Diff-EVF), each column corresponds to a position in the tree, and the content specifies the distribution from which the EVF values are drawn. We use the notation y(b) to represent the expected value of EVF b in position a in the tree. The rightmost plot depicts the derivation of these expected values with Brownian motion. We use subpopulations 2 and 3 as examples for both discrete cases (sampling only cells within the subpopulations) or continuous (sampling cells along the trajectories from the root progenitor state [node 6] to the two target subpopulations [nodes 2 and 3]). b tSNE plots of five discrete populations generated from the tree structure shown in a. Different values of σ give rise to different heterogeneity of each population. c tSNE plots of continuous populations generated from the same tree. The colors correspond to the colors on branches in the tree shown in a. When increasing σ, cells are more scattered around the main paths which follow the tree structure. Data used to plot b, c can be found in Source Data
Fig. 4Technical variation. a A diagram showing the workflow of adding technical variation to true simulated counts. Each gray or orange square represents a molecule of the same transcript in one cell. We implement the following steps: mRNA capturing, pre-amplification (PCR or linear amplification of the cDNAs), fragmentation, amplification after fragmentation, sequencing, and calculation of UMI counts or read counts. Details of these steps can be found in Methods. b Gene length bias in both simulated and experimental data for the non-UMI protocol. Error bars represent the ranges of (mean-SD, mean + SD), where SD means standard deviation. c Scatter plots comparing true counts and observed counts obtained through: (1) non-UMI, good parameters (α = 0.2, MaxAmpBias = 0.1, Depth = 1e6) for high quality data; (2) UMI, good parameters (α = 0.2, MaxAmpBias = 0.1, Depth = 5e5) for high quality data; (3) non-UMI, bad parameters (α = 0.05, MaxAmpBias = 0.2, Depth = 1e6) for low quality data; (4) UMI, bad parameters (α = 0.04, MaxAmpBias = 0.2, Depth = 5e5) for low quality data. d 2D transcript counts histogram heatmap of UMI and non-UMI simulated true counts and simulated observed counts, generated with parameters which best match the input experimental counts, and histogram heatmaps of the respective experimental counts (non-UMI Th17, UMI cortex and UMI 10x t4k datasets). e Q–Q plots comparing the mean, percent non-zero and standard deviation in experimental counts and SymSim simulated observed counts respectively for the non-UMI, UMI cortex and UMI 10x t4k datasets. A good match is indicated by most of the dots falling close to the red line. Data used to plot b–e can be found in Source Data
Fig. 5Benchmarking of clustering methods. a Coefficients of various parameters from multiple linear regression between parameters and the adjusted Rand index (ARI). In the left plot the ARI are averaged over all populations, and in the right plot the ARI is only for the rare population (population 2). b ARI of the rare populations using the four clustering methods when changing σ (α = 0.04). Left plot: the rare population accounts for 5% of all the cells; right plot: the rare population accounts for 10% of all the cells. c ARI of the rare populations using the four clustering methods when changing α (σ = 0.6). Left plot: the rare population accounts for 5% of all the cells; right plot: the rare population accounts for 10% of all the cells. Data used to plot a–c can be found in Source Data
Fig. 6Benchmarking of DE detection methods. a Illustration of how DE genes are generated through the Diff-EVFs. Red squares in the gene effect matrix correspond to non-zero values. The two genes indicated by the arrows are DE genes by number of Diff-EVFs they have (respectively, 2 and 1). b Q–Q plot comparing the p-value obtained from differential expression analysis between subpopulations 2 and 4 (using Wilcoxon test on the true simulated counts) to a uniform distribution. Genes are grouped by the number of Diff-EVFs they use and different groups are plotted in different colors. The numbers of Diff-EVFs used by genes can be thought of as the degree of DE-ness. Genes with more Diff-EVFs have p-values further diverged from uniform distribution. c Venn diagram showing that closely related populations have less DE genes between them compared to distantly related populations. We use populations 1, 2, 4 as examples: there are much more DE genes from comparison of “1 vs 2” and “1 vs 4” than “2 vs 4”, and DE genes from “1 vs 2” and “1 vs 4” have a big overlap. The DE genes are determined by log2 fold change (LFC) of true counts with criterion |LFC| > 0.8. d The AUROC (area under receiver operating characteristic curve) of detecting DE genes using four different methods from observed counts with changing capture efficiency α (σ = 0.6). The populations under comparison are 2 and 4. Three sets of criteria were used to define the true DE genes and the final performance was the average performance from the three sets: (1) nDiff-EVFgene > 0 and |LFC| > 0.6; (2) nDiff-EVFgene > 0 and |LFC| > 0.8; (3) nDiff-EVFgene > 0 and |LFC| > 1. LFC was calculated with theoretical means from the kinetic parameters. e The negative of correlation between log2 fold change on theoretical mean of gene-expression and p-values obtained by a DE detection method, with changing capture efficiency α (σ = 0.6). The populations under comparison are 2 and 4. Data used to plot b, d, e can be found in Source Data
Fig. 7Benchmark trajectory inference methods. a Pseudotime correlation and knn purity of all methods when varying σ (α = 0.1). b Pseudotime correlation and knn purity of all methods when varying α (σ = 0.6). Data used to plot a, b can be found in Source Data
Fig. 8The number of cells needed to detect a rare population. We generate five populations according to the tree structure shown in Fig. 3 and set population 2 as the rare population which accounts for 5% of the cells. Other populations share 95% of the cells evenly. The criteria of detecting the rare population are that at least 50 cells from this population are correctly detected and the precision (positive predicted value) is at least 70%. a–d The probability of detecting the rare population when sequencing N (x-axis) cells under different σ and α configurations, with different clustering methods. The black curve represents the theoretical probability from the binomial model, assuming that all cells sequenced are assigned correctly to the original population. The gray curve with transparency takes the maximum value at each data point from all four clustering methods with smoothing. Error bars are standard deviation over 20 randomizations. e The heatmaps show the number of cells needed to sequence under different configurations of σ and α to detect the rare population with success rates 0.6, 0.7, 0.8, 0.9, always using the best clustering method. Data used to plot a–e can be found in Source Data
Runtime and memory usage of SymSim
| Capture efficiency | Protocol | Sequencing depth | Runtime (min) | Memory |
|---|---|---|---|---|
| 0.05 | UMI | 100,000 | 9.5 | 1.8 G |
| 0.2 | UMI | 100,000 | 32.8 | 2.3 G |
| 0.05 | nonUMI | 100,000 | 8.6 | 1.8 G |
| 0.2 | nonUMI | 100,000 | 16.5 | 1.7 G |
Parameters and their ranges for simulating the UMI and nonUMI databases
| Parameters | non-UMI database | UMI database |
|---|---|---|
| Sigma (σ): within-population heterogeneity | 0.1, 0.2, 0.6 | 0.2, 0.6 |
| Gene_effects_sd: standard deviation for generating gene effect vectors | 1, 2 | 2 |
| scale_s: cell size parameter, use small values for cell types known to be small | 0.3, 0.5, 0.6, 1 | 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 1 |
| prop_hge: proportion of extreme high-expression outlier genes | 0.015, 0.02, 0.025, 0.03 | 0.01, 0.015, 0.02, 0.025, 0.03 |
| mean_hge: scale factor for high-expression outlier genes | 3, 4, 5, 6 | 3, 4, 5, 6 |
| Alpha_mean (α): mean mRNA capture efficiency | 0.05, 0.1, 0.15, 0.2, 0.25 | 0.007, 0.01, 0.02, 0.03, 0.04, 0.05, 0.07, 0.1 |
| Alpha_sd (β): standard deviation of mRNA capture efficiency across cells | 0.005, 0.01, 0.015, 0.02, 0.03, 0.045, 0.06, 0.075 | 7e-04, 0.001, 0.0014, 0.002, 0.003, 0.0035, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.012, 0.015, 0.02, 0.025, 0.03, 0.035, 0.05 |
| depth_mean (depth): mean sequencing depth for a set of cells | 1e + 05, 5e + 05, 1e + 06, 2e + 06 | 45,000, 70,000, 95,000, 150,000, 3e + 05, 5e + 05 |
| Depth_sd: standard deviation of sequencing depth for a set of cells | 100,00,30,000, 50,000, 1e + 05, 150,000, 2e + 05, 3e + 05, 6e + 05 | 4500, 7e3, 9e3, 9.5e3, 13500, 14e3, 15e3, 19e3, 21e3, 22,500, 28,500, 3e4, 35e3, 4.5e4, 47500, 5e4, 7.5e4, 9e4, 15e4, 25e4 |
| nPCR1: number of PCRs in pre-amplification phase | 14, 18 | 10, 14 |