| Literature DB >> 31510649 |
Hirak Sarkar1, Avi Srivastava1, Rob Patro1.
Abstract
SUMMARY: With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Entities:
Mesh:
Year: 2019 PMID: 31510649 PMCID: PMC6612833 DOI: 10.1093/bioinformatics/btz351
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview of the minnow pipeline: On the right-hand side (a), the construction of unitig-based equivalence classes is depicted based on the compacted de Bruijn graph constructed from reference sequences. Unitigs u1 and u2 are discounted as they are more than MAX_FRAGLEN bp away from end of transcripts t1 and t2. Further the equivalence class is constructed as discussed in Section 2.1.3. On the right-hand side (b), the transcript-level equivalence class structure obtained from alevin is used to derive a per-gene probability vector. Finally, the probabilities are mapped directly to the unitig labels
Timing and memory required by minnow to simulate data with various parameters
| Reads | Cells | PCR cycles | Threads | Time (hh: mm: ss) | mem. (KB) |
|---|---|---|---|---|---|
| 100M | 1000 | 4 | 8 | 0: 10: 44 | 7 556 108 |
| 100M | 1000 | 4 | 16 | 0: 5: 39 | 11 163 216 |
| 100M | 1000 | 7 | 8 | 0: 16: 56 | 8 723 320 |
| 100M | 1000 | 7 | 16 | 0: 9: 01 | 13 449 888 |
| 800M | 8000 | 4 | 8 | 0: 56: 28 | 28 249 676 |
| 800M | 8000 | 4 | 16 | 0: 31: 18 | 31 855 624 |
| 800M | 8000 | 7 | 8 | 1: 43: 32 | 29 246 148 |
| 800M | 8000 | 7 | 16 | 0: 53: 15 | 34 217 500 |
Fig. 2.Performance of quantification tools, stratified by gene uniqueness, under a ‘basic’ configuration (based on pbmc 4k dataset)
Spearman correlation and MARD are calculated with respect to ground truth under three different configurations based on the same gene-count matrix produced by Splatter
| Configuration | Correlation (Spearman) | MARD | ||||
|---|---|---|---|---|---|---|
| CR2 | CR3 |
| CR2 | CR3 |
| |
| Adversarial | 0.811 | 0.809 | 0.723 | 0.075 | 0.076 | 0.107 |
| Realistic | 0.920 | 0.915 | 0.880 | 0.043 | 0.046 | 0.076 |
| Favorable | 0.957 | 0.952 | 0.936 | 0.031 | 0.035 | 0.047 |
Note: CR2 and CR3 stand for Cell-Ranger-2 and Cell-Ranger-3, respectively.