| Literature DB >> 35004214 |
Samarendra Das1,2,3, Shesh N Rai2,3,4,5,6,7.
Abstract
Single-cell RNA-sequencing (scRNA-seq) is a recent high-throughput genomic technology used to study the expression dynamics of genes at single-cell level. Analyzing the scRNA-seq data in presence of biological confounding factors including dropout events is a challenging task. Thus, this article presents a novel statistical approach for various analyses of the scRNA-seq Unique Molecular Identifier (UMI) counts data. The various analyses include modeling and fitting of observed UMI data, cell type detection, estimation of cell capture rates, estimation of gene specific model parameters, estimation of the sample mean and sample variance of the genes, etc. Besides, the developed approach is able to perform differential expression, and other downstream analyses that consider the molecular capture process in scRNA-seq data modeling. Here, the external spike-ins data can also be used in the approach for better results. The unique feature of the method is that it considers the biological process that leads to severe dropout events in modeling the observed UMI counts of genes. • The differential expression analysis of observed scRNA-seq UMI counts data is performed after adjustment for cell capture rates. • The statistical approach performs downstream differential zero inflation analysis, classification of influential genes, and selection of top marker genes. • Cell auxiliaries including cell clusters and other cell variables (e.g., cell cycle, cell phase) are used to remove unwanted variation to perform statistical tests reliably. Published by Elsevier B.V.Entities:
Keywords: Mean; Molecular capture model; Observed UMI count; Overdispersion; True UMI count; Zero Inflation; Zero inflated negative binomial model
Year: 2021 PMID: 35004214 PMCID: PMC8720898 DOI: 10.1016/j.mex.2021.101580
Source DB: PubMed Journal: MethodsX ISSN: 2215-0161
Fig. 1Relationship among the SwarnSeq model parameters with expected value of sample statistics. (A) Expected value vs. variance of the observed UMI counts. X-axis: log of the expected value of the observed UMI counts. Y-axis: log of the variance. (B) Expected value vs. Co-efficient of variation (CV) of the observed UMI counts. X-axis: log of the expected value of the observed UMI counts. Y-axis: log of CV. (C) Zero-inflation vs. CV of the observed UMI counts. X-axis: log of CV. Y-axis: log of zero-inflation. (D) CV vs. Dispersion. X-axis: log of the CV. Y-axis: log of Dispersion. (E) Variance vs. Zero-inflation observed UMI counts. X-axis: log of the variance. Y-axis: log of zero-inflation. (F) Variance of the observed UMI counts vs. Dispersion. X-axis: log of the variance. Y-axis: log of dispersion.
Fig. 2Parameters of the SwarnSeq model estimated through the EM algorithm. (A) Relationship between estimated values of mean with dispersion parameters of genes. X-axis: log of estimated values of means; Y-axis: log of estimated values of dispersions. (B) Relationship between estimated values of mean with zero-inflation parameters. X-axis: log of estimated values of means. Y-axis: log of estimated values of zero-inflation. (C) Relationship between estimated values of zero-inflation with dispersion parameters of genes. X-axis: log of estimated values of dispersion. Y-axis: log of estimated values of zero-inflation. (D) Relationship between estimated values of zero-inflation with observed zero proportions of genes. X-axis: observed means zero proportions. Y-axis: estimated values of zero-inflation parameters. (E) Relationship between observed zero proportions with difference between observed and true proportion of zeros of genes. X-axis: observed means zero proportions. Y-axis: difference between observed and true proportion of zeros. (F) Relation between true and dropout zeros. X-axis: dropout zero probability. Y-axis: true zero probability.
Fig. 3Relationship between the cell specific parameters. (A) Distribution of cell library sizes. X-axis represents the cell ranks; Y-axis represents the cell library sizes. Relationship of cell library sizes with ranks of the cells is s-shaped sigmoid curve. (B) Distribution of cell library sizes with zero counts % in cells. X-axis represents the cell library sizes; Y-axis represents with the zero counts % in cells. Cells with lower library sizes have higher proportions of zero counts as genes expression and vice-versa. (C) Relationship of cell capture rates with cell ranks. Here, the cell capture rates are estimated from the external RNA spike-in data. (D) Relationship of cells’ captures rates (estimated from the UMI data) with cell library sizes. The relationship between the capture rates with cell library sizes is bell-shaped. It means the cells with higher library sizes have better cell capture rates and vice-versa. (E) Relationship between mean of non-zero counts and zero counts % in cells. X-axis represents the zero counts % in cells; Y-axis represents the mean of non-zero UMI counts. The relation is inversely proportional, i.e., cells with higher zero % have lower mean UMI counts and vice-versa. (F) Relationship between capture rates and zero counts % in cells. X-axis represents the zero counts % in cells; Y-axis represents the cell capture rates.
Fig. 4Sample mean and variance of the observed UMI counts of the genes. (A) Expected value vs. variance of sample mean plot. X-axis: Expected value of sample mean; Y-axis: Variance of sample mean. (B) Expected value of sample mean vs. expected value of sample variance plot. X-axis: Expected value of sample mean; Y-axis: Expected value of sample variance. (C) Expected value of sample mean vs. CV of the sample mean plot. X-axis: Expected value of sample mean; Y-axis: CV of sample mean. (D) Expected value of sample mean vs. standard error of sample mean plot. X-axis: Expected value of sample mean; Y-axis: standard error of sample mean. (E) Variance of sample mean vs. expected value of sample variance plot. X-axis: Expected value of variance of sample mean; Y-axis: Expected value of sample variance. (F) CV of sample mean vs. expected value of sample variance. X-axis: CV of sample mean; Y-axis: Expected value of sample variance.
Fig. 5Schematic layout of cluster analysis in SwarnSeq method. (A) Flowchart for cell cluster number determination algorithm. (B) Determination of the optimum number of cell cluster for the experimental single-cell data. X-axis: Number of cell clusters; Y-axis: Clustering indices for every cell cluster. (C) Distribution of the cells across the cell clusters.
Fig. 6Key analytical results obtained through SwarnSeq Model. (A) Volcano plot for differential expression analysis results. X-axis represents the log transformation of the fold change values of genes. Y-axis represents the -log transformation of the p-values computed through the SwarnSeq model. red color represent the genes whose both -log > 20 and |log2FC| > 3; blue color represent the genes whose -log > 20; green color represent the genes whose |log2FC| > 3; black color indicates the non-significant genes. (B) Volcano plot for differential zero-inflation analysis results. X-axis represents the log transformation of the fold change values of genes. Y-axis represents the -log transformation of the p-values computed through the SwarnSeq model. red color represent the genes whose both -log > 7 and |log2FC| > 2; blue color represent the genes whose -log > 7; green color represent the genes whose |log2FC| > 2; black color indicates the non-significant genes. (C) Schematic representation of the classification of key genes detected through SwarnSeq model. DE genes: Differentially expressed; DZI: Differentially zero-inflated; DEZI: Both differentially expressed and differentially zero-inflated; Non-DE: non-differentially expressed; non-DZI: non-differentially zero-inflated. (D) Illustration of SwarnSeq method for classification of influential genes. Numbers in cells represent the genes belong to each category; (.): classes of the genes.
| Subject area | Statistics |
| More specific subject area | Statistical Genomics and Computational Biology |
| Method name | SwarnSeq |
| Name and reference of original method | Das, S. and Rai, S.N. (2021). SwarnSeq: An improved statistical approach for differential expression analysis of single-cell RNA-seq data. |
| Resource availability |