| Literature DB >> 34824223 |
Pengyi Yang1,2,3, Jean Yee Hwa Yang4,5, Yue Cao6,7.
Abstract
Single-cell RNA-seq (scRNA-seq) data simulation is critical for evaluating computational methods for analysing scRNA-seq data especially when ground truth is experimentally unattainable. The reliability of evaluation depends on the ability of simulation methods to capture properties of experimental data. However, while many scRNA-seq data simulation methods have been proposed, a systematic evaluation of these methods is lacking. We develop a comprehensive evaluation framework, SimBench, including a kernel density estimation measure to benchmark 12 simulation methods through 35 scRNA-seq experimental datasets. We evaluate the simulation methods on a panel of data properties, ability to maintain biological signals, scalability and applicability. Our benchmark uncovers performance differences among the methods and highlights the varying difficulties in simulating data characteristics. Furthermore, we identify several limitations including maintaining heterogeneity of distribution. These results, together with the framework and datasets made publicly available as R packages, will guide simulation methods selection and their future development.Entities:
Mesh:
Year: 2021 PMID: 34824223 PMCID: PMC8617278 DOI: 10.1038/s41467-021-27130-w
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Schematic of the benchmarking workflow.
a A total number of 35 datasets, covering a range of protocols, tissue types, organisms and sample size was used in this benchmark study. b We evaluated 12 simulation methods available in the literature to date. c Multiple aspects of evaluation were examined in this study, with the primary focuses illustrated in detail in panel d. e Finally, we summarised the result into a set of recommendations for users and identified potential areas of improvement for developers.
scRNA-seq simulation methods evaluated in this study.
| Methods | Year of publication | Approach | Estimate from multiple cell groups | Simulate multiple cell groups | Customise DE expressiona | Assign gene name to generated data | Primary purpose as general simulation? |
|---|---|---|---|---|---|---|---|
| scDD[ | 2016 | Dirichlet process mixture of normals | Restricted to two groups | Restricted to two groups | Yes | No | No, used for generating differentially distributed genes defined in the scDD study and evaluating the scDD framework |
| Splat[ | 2017 | Gamma distribution for modelling mean expression; Poisson distribution for modelling count | No, requires a homogenous population (e.g. one cell type) | Yes, can simulate any number of groups | Yes | No | Yes |
| powsimR[ | 2017 | Negative binomial or zero-inflated negative binomial model | No, requires a homogenous population (i.e. one cell type) | Restricted to two groups | Yes | Yes | No, power analysis tool for single-cell and bulk RNA-seq |
| SparseDC[ | 2017 | Optimisation framework | Restricted to two conditions with multiple cell groups within each condition | Restricted to two conditions with multiple cell groups within each condition | Yes | No | No, used for generating the simulation data for assessing the performance of the SparseDC clustering method |
| zingeR[ | 2018 | Negative binomial model with additive logistic regression to account for zeros | Yes, can estimate from any number of groups | Yes, can simulate any number of groups | Yes | No | No, used for generating simulation data for assessing the performance of the zingeR DE method |
| ZINB-WaVE[ | 2018 | Zero-inflated negative binomial model | Yes, can estimate from any number of groups | Restricted to the groups in the input data | No | No | No, dimension reduction method for scRNA-seq |
| SymSim[ | 2019 | Kinetic model using Markov chain Monte Carlo | No, requires a homogenous population (i.e. one cell type) | Yes, can simulate any number of groups | Yes | No | Yes |
| scDesign[ | 2019b | Gamma-normal mixture model | Restricted to one and two groups | Restricted to one and two groups | Yes | No | No, power analysis tool for scRNA-seq |
| SPARSim[ | 2020 | Gamma distribution for modelling expression; multivariate hypergeometric distribution for modelling technical variability | Yes, can estimate from any number of groups | Yes, can simulate any number of groups | Yes | Yes | Yes |
| SPsimSeq[ | 2020 | Estimation of probability distribution uses fast log-linear model-based density estimation method; Gaussian-copulas for modelling gene–gene correlation | Yes, can estimate from any number of groups | Restricted to the groups in the input data | Yes | Yes | Yes |
| POWSC[ | 2020 | Mixture of zero-inflated Poisson for modelling inactive transcription; log-normal Poisson for modelling the active transcription | Yes, can estimate from any number of groups | Restricted to the groups in the input data | Yes | No | No, power analysis tool for scRNA-seq |
| cscGAN[ | 2020 | Generative adversarial network with Wasserstein distance | Yes, can estimate from any number of groups | Restricted to the groups in the input data | No | Yes | Yes |
aIncludes either proportion of differential expression or fold change.
bWe benchmarked the version of scDesign published in 2019. We note that during the final preparation stage of our work, a newer version scDesign2 was published[35].
Fig. 2Ranking of methods across key aspects of evaluation criteria.
The colour and size of the circle denote ranking of methods, where a large blue circle represents the best possible rank of 1. Missing space indicates where a measurement was not able to be obtained, for example, due to the output format being normalised count instead of raw count (see ‘Methods’). The ranks within each criterion were summarised into an overall tier rank, with tier 1 being the best tier. a Ranking of methods within data property estimation, ranked by median score across multiple datasets. b Ranking of methods within biological signals, ranked by median score across multiple datasets. c Scalability was ranked by the total computational speed and memory usage required for properties estimation and dataset generation across datasets. d Applicability was examined in terms of three criteria, which are explained in more detail in Table 1. The number of datasets used in the entire evaluation process and the success rate of each method on running the datasets is reported in Supplementary Fig. 4.
Fig. 3Impact of dataset characteristic on method performance.
a Impact of the number of cells on selected properties (see Supplementary Fig. 6 for all properties). Line shows the trends with increasing cell numbers. Dot indicates where a measurement is taken. b Impact of protocols was examined using two collections of datasets (see Supplementary Fig. 7 for individual methods). Boxplots show the individual score of each property for each method.
Fig. 4Comparison of criteria in data property estimation and in biological signals.
a Evaluation procedure for data property estimation and biological signals. b The evaluation results and the comparison of criteria within the two aspects of evaluation. For data property estimation, the KDE score measures the difference between the distribution of 13 data properties in simulated and in real data. A score close to 1 indicates a greater similarity. Each boxplot shows the distribution of the median KDE score attained by all simulation methods (n = 12), with the KDE score attained by each method shown in individual data point. The box represents quartiles, the line represents the median, the lower and upper whisker represents the bottom 25% and top 25% of the data. Outliers can be seen from the individual data points that are outside the whiskers. For biological signals, the SMAPE score measures the percentage difference between the proportion of biological signals detected in simulated and in real data. A score of 1 indicates no difference in the biological signals detected in real and simulated data and a score of 0 indicates maximal difference.