| Literature DB >> 33471073 |
Scott Van Buren1, Hirak Sarkar2,3, Avi Srivastava4,5, Naim U Rashid1,6, Rob Patro2,3, Michael I Love1,7.
Abstract
MOTIVATION: Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of "inferential replicates", which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements.Entities:
Year: 2021 PMID: 33471073 PMCID: PMC8289386 DOI: 10.1093/bioinformatics/btab001
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Compression of scRNA-seq quantification uncertainty. This procedure stores solely the mean and variance of the bootstrap replicate count matrices, with this compressed information later used to regenerate marginal (per-gene) pseudo-inferential replicates as needed. CB, cell barcode; UMI, unique molecular identifier; NB, negative binomial
Fig. 2.Per-gene coverage comparisons for the 95% intervals calculated using negative binomial distribution quantiles (A and B) and quantiles from the bootstrap empirical distribution (C and D), for the two group difference simulation. Panels A and C are stratified by inferential uncertainty (InfRV) and expression level, while panels B and D are stratified by the average gene tier value across samples. ‘High’ InfRV and expression correspond to the top 10% of InfRV and gene-level counts, respectively
Computation comparisons for Swish and splitSwish for the two group difference simulation
| Method | R object size (MB) | Max memory (GB) | Load (s) | Compute (s) |
|---|---|---|---|---|
|
| 853 | 4.90 | 28.2 | 78 |
|
| 138 | 1.08 | 1.5 | 20 |
Note: Results include 60 179 genes across 200 cells, with 20 bootstrap replicates for Swish and 20 pseudo-inferential replicates for splitSwish. R object size and load time differ across methods, as Swish uses full bootstrap replicate matrices while splitSwish uses compressed inferential uncertainty. Max memory and compute time are provided per job (n = 8) for splitSwish.
Fig. 3.Comparison of counts across pseudotime for Nme1 and Nme2 for counts generated incorporating multi-mapping reads using the EM algorithm (A and C) and without incorporating multi-mapping reads (B and D). Counts are colored according to assignment to one of two lineages. Points represent mean of bootstrap replicates and vertical bars represent 95% normal-based intervals in A and C, while points in B and D provide estimated counts. Curves plot the fitted GAMs across pseudotime for each lineage