| Literature DB >> 35885218 |
Samarendra Das1,2, Anil Rai3, Shesh N Rai4,5,6,7,8,9.
Abstract
With the advent of single-cell RNA-sequencing (scRNA-seq), it is possible to measure the expression dynamics of genes at the single-cell level. Through scRNA-seq, a huge amount of expression data for several thousand(s) of genes over million(s) of cells are generated in a single experiment. Differential expression analysis is the primary downstream analysis of such data to identify gene markers for cell type detection and also provide inputs to other secondary analyses. Many statistical approaches for differential expression analysis have been reported in the literature. Therefore, we critically discuss the underlying statistical principles of the approaches and distinctly divide them into six major classes, i.e., generalized linear, generalized additive, Hurdle, mixture models, two-class parametric, and non-parametric approaches. We also succinctly discuss the limitations that are specific to each class of approaches, and how they are addressed by other subsequent classes of approach. A number of challenges are identified in this study that must be addressed to develop the next class of innovative approaches. Furthermore, we also emphasize the methodological challenges involved in differential expression analysis of scRNA-seq data that researchers must address to draw maximum benefit from this recent single-cell technology. This study will serve as a guide to genome researchers and experimental biologists to objectively select options for their analysis.Entities:
Keywords: challenges; classification; differential expression analysis; scRNA-seq; statistical approaches
Year: 2022 PMID: 35885218 PMCID: PMC9315519 DOI: 10.3390/e24070995
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1Operational framework of differential expression analysis of scRNA-seq data. Various steps in single-cell studies are shown. Pre-processing and various steps of DE analysis are also shown. Potential use and interpretation of obtained results are presented.
Comparative overview of the DEA approaches for scRNA-seq data analysis.
| SN. | Methods | Year | Model | Input | DE Test Stat. | Runtime | Platform | Ref. |
|---|---|---|---|---|---|---|---|---|
| 1 | NBID | 2018 | NB (GLM) | Counts | LRT | Medium | R code | [ |
| 2 | ZINB–WaVE | 2018 | ZINB (GLM) | Counts | LRT | High | Bioconductor, GitHub | [ |
| 3 | zingeR | 2018 | ZINB (GLM) | Counts | LRT | High | GitHub | [ |
| 4 | DECENT | 2019 | ZINB (GLM) | Counts | LRT | High | GitHub | [ |
| 5 | SwarnSeq | 2021 | ZINB (GLM) | Counts | LRT | High | GitHub | [ |
| 6 | Tweedieverse | 2021 | ZITweedie (GLM) | Counts | Wald | High | GitHub | [ |
| 7 | scMMST | 2021 | GLMM | Counts | Norm. score | High | NA | [ |
| 8 | TPMM | 2022 | GLMM | Norm. | Wald/LRT | High | GitHub | [ |
| 9 | Monocle2 | 2017 | GAM | Norm. | LRT | Medium | Bioconductor | [ |
| 10 | tradeSeq | 2020 | GAM | Counts | Wald | Medium | GitHub | [ |
| 11 | MAST | 2015 | Hurdle | Norm. | LRT/Wald | Medium | Bioconductor | [ |
| 12 | Random-Hurdle | 2019 | Hurdle | Counts | Chi-square test statistic | High | NA | [ |
| 13 | SCDE | 2014 | Poisson-NB (MM) | Counts | Bayesian stat. | High | Bioconductor | [ |
| 14 | BASiCS | 2015 | Poisson-Gamma (MM) | Norm. | Posterior prob. | High | Bioconductor | [ |
| 15 | D3E | 2016 | Poisson-Beta (MM) | Counts | CM/KS test | High | GitHub | [ |
| 16 | BPSC | 2016 | Beta-Poisson (MM) | Counts | LRT | Medium | GitHub | [ |
| 17 | TASC | 2017 | Logistic, Poisson Models (MM) | UMI | LRT | High | GitHub | [ |
| 18 | DESCEND | 2018 | Poisson-Alpha (MM) | Counts | Normalized Gini Score | High | GitHub | [ |
| 19 | SC2P | 2018 | ZIP, Poisson-Lognormal (MM) | Counts | Posterior prob. | High | GitHub | [ |
| 20 | ZIAQ | 2020 | Logistic and quantile Regression (MM) | Norm. | Fisher’s test | Medium | GitHub | [ |
| 21 | SimCD | 2021 | Gamma-NB (MM) | Counts | Bayesian | High | GitHub | [ |
| 22 | ZIQRank | 2022 | Zero-inflated model, quantile regression (MM) | Cont. | Rank-score test | High | NA | [ |
| 23 | Seurat | 2015 | NB (TCP) | Counts | LRT | Low | CRAN | [ |
| 24 | scDD | 2016 | Multi-modal Bayesian (TCP) | Norm. | Bayesian stat. | High | Bioconductor | [ |
| 25 | DEsingle | 2018 | ZINB (TCP) | Counts | LRT | High | Bioconductor, GitHub | [ |
| 26 | NYMP | 2019 | Logistic regression (TCP) | Cont. | Medium | GitHub | [ | |
| 27 | logCPM (TCP) | Norm. | T stat | Low | CRAN | [ | ||
| 28 | IDEAS | 2022 | NB/ZINB/Kernel Density estimation/ | Counts/Cont. | Jensen–Shannon Divergence/ | High | GitHub | [ |
| 29 | SAMstrt | 2013 | NP | Counts | Medium | GitHub | [ | |
| 30 | Wilcox | NP | Counts/Norm. | Sum ranks | Low | CRAN | [ | |
| 31 | SINCERA | 2015 | NP | Norm. | Welch (LS)/ | High | GitHub | [ |
| 32 | NODES | 2016 | NP | Norm. | Wilcox | Medium | Dropbox | [ |
| 33 | EMDomics | 2016 | NP | Norm. | Euclidean distance | High | Bioconductor | [ |
| 34 | sigEMD | 2018 | NP | Norm. | Distance measure | High | GitHub | [ |
| 35 | DTWscore | 2017 | NP | FPKM | Distance | Medium | GitHub | [ |
| 36 | ROSeq | 2021 | NP | Counts/Norm. | Wald | High | Bioconductor, GitHub | [ |
| 37 | scDEA 1 | 2021 | 12 Models (Hybrid) | Counts | Lancaster’s test (Chi) | High | GitHub | [ |
CM: Cramér–von Mises test; Counts: read/UMI counts; Cont.: continuous values, e.g., FPKM, log(CPM), RPKM; NA: source codes are not freely available; Norm.: normalized; GLM: generalized linear model; NB: negative binomial; GLMM: generalized linear mixed model; NP: non-parametric; GAM: generalized additive model; MM: mixture model; TCP: two-class parametric. 1: Integrated approach.
Figure 2Classification of available statistical approaches and tools used for DEA in single-cell studies. Classification of the approaches is conducted based on the requirement of input data, data distribution, and statistical models, etc. DE analytic tools belonging to each category are presented in pink colored boxes.
Figure 3Operational outlines of DE analytic GLM and two-class comparison approaches in scRNA-seq studies. (A) Workflow of steps for GLM-based DE approaches. (B) Workflow of steps for two-class comparison approaches. In both classes, the framework can be divided into four major parts, namely: (i) input (data provided as input to tools); (ii) pre-processing of data, this step involves data cleaning, outlier removal, normalization, etc.; (iii) model fitting and computation of DE test statistic, various distributional/model (e.g., GLM, simple statistical distribution or distribution-free) assumptions are made about the expression data, parameters of the models are estimated, and DE test statistic(s) for genes and their corresponding p-values are computed; and, (iv) assessment and interpretation of DE results.
Classes of statistical approaches and tools extensively used in DEA of scRNA-seq data.
| SN. | Class | Features | Limitations | Tools |
|---|---|---|---|---|
| 1 | GLM |
Gene expression can have any form of exponential distribution type. Suitable for bi-modality of data. Able to deal with categorical predictors, e.g., cell type, cell cycle, etc. Easy to interpret and allows a clear understanding of how each of the predictors are influencing the gene parameters. Can be generalized to multi-cell group comparisons. Less susceptible to model over-fitting. |
Strict exponential family distributional assumptions about the data. Needs relatively large datasets (with more predictor and large number of cells). Sensitive to outliers. Sensitive to dropout events. Not suitable for low expressed genes. Cannot handle multi-modality of the data. ZIM–GLM approaches are not able to handle zero-deflation at any level of a factor and will result in parameter estimates of infinity for the logistic component. Higher computational cost especially for large datasets. | NBID, ZingeR |
| 2 | GAM |
Predictor functions are automatically derived during model estimation. Marginal impact of a single variable does not depend on the values of the other variables in the model. Flexibility in choosing the type of functions, which will help in finding patterns missed in a parametric model. Allows controlling smoothness of the predictor functions to prevent model over-fitting. By controlling the wiggliness of the predictor functions, we can directly tackle the bias/variance tradeoff. Highly effective in many settings, particularly when one wishes to model the response variable as a function of both categorical (e.g., cell groups) and continuous predictors (e.g., cell-level auxiliary variables). Considers both linear and non-linear functions of cell-level predictors to model gene parameters. Each lineage is represented by a separate cubic smoothing spline, and its flexibility allows adjustment for other covariates or confounders as fixed effects in the model. |
Approaches such as Monocle can only handle a single lineage of cells. Lack of interpretability, to infer differences in expression between lineages of cells. Assumes the dropout events to be linear; however, the effect of dropout events is likely to be non-linear, especially for genes with low to moderate expression. Computationally complex. | Monocle, Monocle2, Monocle3, tradeSeq |
| 3 | Hurdle Model |
Considers the excess zeros while model building. Can handle zero-inflation as well as zero-deflation present in data. Models the bimodality of gene expression distribution. |
Does not differentiate the generating process for excessive zeros versus sampling zeros. Fails to consider the multi-modality of gene expression distribution. Requires higher runtime. | MAST, Random Hurdle |
| 4 | Mixture-Model |
Considers bi-modal or multi-modal nature of single-cell data. Can differentiate between major sources of variation in single-cell data. |
Certain approaches including BPSC, SC2P cannot consider the zero-inflation in single-cell data. Mostly uses linear models for DEA, which is cumbersome. Higher runtime and computationally intensive. | SCDE, D3E, BPSC, BASiCS, DESCEND, SC2P, ZIAQ, ZIQRank, SimCD |
| 5 | Non-parametric (two-class) |
Distribution-free approaches. Considers the multi-modality of the data. Computationally not cumbersome (less runtime). Estimates the parameters without fitting any distribution for genes. Performs DEA with distance-like metrics across two cell types. Performs well when there are lesser proportions of zeros in the data. |
Mostly focuses on two cellular groups’ comparison. Computationally complex for multi-groups. Performance severely affected due to high dropouts (some methods exclude dropouts). Cannot separate between true/biological and false/dropout zeros. Sensitive to sparsity. Methods such as D3E, scDD fail to consider UMI count nature of the data. Cannot separate confounding factors from each other. | Wilcox, NODES, ROTS, EMDomics, ROSeq, SINCERA, sigEMD, DTWscore, SAMstrt |
| 6 | Parametric (two-class) |
Easy to understand and execute. Lesser runtime. Particularly suitable for larger datasets. |
Makes strict distributional assumption about the data. Cannot generalize to multi-group comparisons. Ignores the multi-modal distributions of the scRNA-seq data. Sensitive to sparsity or dropout events. Cannot differentiate between the major sources of variability in the data. | scDD, DEsingle, |
Figure 4Operational outlines of DE analytic GAM, Hurdle and mixed model class of approaches in scRNA-seq studies. (A) Workflow of steps for GAM-based DEA approaches. (B) Workflow of steps for Hurdle and mixed-model-based approaches. In both classes, the framework can be divided into four major parts, namely: (i) input (data provided as input to tools); (ii) pre-processing of data, this step involves data cleaning, outlier removal, normalization, etc.; (iii) model fitting and computation of DEA test statistic, various distributional/model (e.g., GAM, Hurdle or mixture model) assumptions are made about the expression data, parameters of the models are estimated, DEA test statistic(s) for genes and their corresponding p-values are computed; and (iv) assessment and interpretation of DEA results.