| Literature DB >> 29028961 |
Charlotte Soneson1,2, Mark D Robinson1,2.
Abstract
Summary: Statistical tools for biological data analysis are often evaluated using synthetic data, designed to mimic the features of a specific type of experimental data. The generalizability of such evaluations depends on how well the synthetic data reproduce the main characteristics of the experimental data, and we argue that an assessment of this similarity should accompany any synthetic dataset used for method evaluation. We describe countsimQC, which provides a straightforward way to generate a stand-alone report that shows the main characteristics of (e.g. RNA-seq) count data and can be provided alongside a publication as verification of the appropriateness of any utilized synthetic data. Availability and implementation: countsimQC is implemented as an R package (for R versions ≥ 3.4) and is available from https://github.com/csoneson/countsimQC under a GPL (≥2) license. Contact: charlotte.soneson@uzh.ch or mark.robinson@imls.uzh.ch.Entities:
Mesh:
Year: 2018 PMID: 29028961 PMCID: PMC5860609 DOI: 10.1093/bioinformatics/btx631
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Example illustrations from a countsimQC report, comparing characteristics of three datasets: one real single-cell RNA-seq dataset (Original) and two datasets simulated using the real dataset as the underlying source of population parameters (powsim, splat). (A) BCV (biological coefficient of variation) as a function of average expression level, both calculated by edgeR (Robinson ). (B) Distribution of library sizes (total count sum) across all cells. (C) Distribution of average expression levels (log count per million, as calculated by edgeR) across all genes. (D) Distribution of the fraction of zeros across all genes. (E) Relationship between the average expression level (log count per million) and the fraction of zeros for all genes