| Literature DB >> 33674351 |
Krishan Gupta1, Manan Lalit2, Aditya Biswas3, Chad D Sanada4, Cassandra Greene4, Kyle Hukari4, Ujjwal Maulik5, Sanghamitra Bandyopadhyay6, Naveen Ramalingam4, Gaurav Ahuja7, Abhik Ghosh8, Debarka Sengupta1,7,9,10.
Abstract
Systematic delineation of complex biological systems is an ever-challenging and resource-intensive process. Single-cell transcriptomics allows us to study cell-to-cell variability in complex tissues at an unprecedented resolution. Accurate modeling of gene expression plays a critical role in the statistical determination of tissue-specific gene expression patterns. In the past few years, considerable efforts have been made to identify appropriate parametric models for single-cell expression data. The zero-inflated version of Poisson/negative binomial and log-normal distributions have emerged as the most popular alternatives owing to their ability to accommodate high dropout rates, as commonly observed in single-cell data. Although the majority of the parametric approaches directly model expression estimates, we explore the potential of modeling expression ranks, as robust surrogates for transcript abundance. Here we examined the performance of the discrete generalized beta distribution (DGBD) on real data and devised a Wald-type test for comparing gene expression across two phenotypically divergent groups of single cells. We performed a comprehensive assessment of the proposed method to understand its advantages compared with some of the existing best-practice approaches. We concluded that besides striking a reasonable balance between Type I and Type II errors, ROSeq, the proposed differential expression test, is exceptionally robust to expression noise and scales rapidly with increasing sample size. For wider dissemination and adoption of the method, we created an R package called ROSeq and made it available on the Bioconductor platform.Entities:
Mesh:
Year: 2021 PMID: 33674351 PMCID: PMC8015842 DOI: 10.1101/gr.267070.120
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Modeling single-cell gene expression using ROSeq. (A) As part of the ROSeq differential expression analysis workflow, cells are first binned depending on expression values associated with a particular gene. For each cell-group, bins are ranked depending on cell frequency. The discrete generalized beta distribution (DGBD) is used as a probability mass function to express a normalized bin-wise cell frequency as a function of its corresponding rank using two real parameters a and b. A Wald-type test is used on the MLE of these parameters across the cell-groups to find differentially expressed genes. (B) DGBD-based modeling of VAMP3 expression (source: Tung data) (Tung et al. 2017). Discretized expression bins are ranked based on normalized bin-wise cellular frequencies. (C) Distribution of R2 values obtained from DGBD-based modeling of 11,513 expressed genes (source: Tung data) (Tung et al. 2017).
Figure 2.Benchmarking of single-cell DE call accuracy against DE genes detected at tissue levels. (A) ROC and the associated AUC values obtained by bulk-based benchmarking of single-cell DEG calls between BJ and K562 cells (Gupta data). (B) ROC plot for H1 and H9 cells (source: Chu data) (Chu et al. 2016). (C) ROC plot for NA19098 and NA19239 cells (source: Tung data) (Tung et al. 2017).
Figure 3.Type I error rates. (A) Line chart showing Type I error rates with SE (depicted by error bars), obtained by applying different DEG callers on 20 randomly sampled null data sets, for varied cell-group sizes. We applied a P-value cutoff of 0.01. These experiments were performed using Jurkat transcriptomes (approximately 3200 cells and approximately 32,000 transcripts) (Zheng et al. 2017). (B,C) Similar plots with P-value cutoff of 0.05 and 0.1, respectively.
Figure 4.Tolerance against expression dropouts. (A) Line chart showing decline in AUC with the increase in dropout levels. Performance was recorded on the Gupta data set comprising BJ fibroblasts and K562 cells. (B) Line chart showing MCC values that largely mirror AUC values in subfigure A. (C) Line chart showing the trend of increased false DEG calls with the increase in dropout levels. Null data sets were created using Jurkat cell transcriptomes from the Zheng data set. Each of the contrasting groups contains 1000 cells.
Figure 5.Tracking execution time on scRNA-seq data of varied sizes. (A) Line chart showing median time taken by each algorithm on 100 randomly sampled null data sets containing iPSC transcriptomes (replicate id: NA19098). (B) Line chart showing median time taken by each algorithm on 20 randomly sampled null data sets containing Jurkat transcriptomes. (C) Line chart showing median time taken by each algorithm on 20 randomly sampled null data sets using the Splatter R package (Zappia et al. 2017). Note that for the iPSC data, we used a single CPU core; for the remaining larger data sets, we used four cores of the workstation.