| Literature DB >> 31870412 |
F William Townes1,2, Stephanie C Hicks3, Martin J Aryee1,4,5,6, Rafael A Irizarry7,8.
Abstract
Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.Entities:
Keywords: Dimension reduction; GLM-PCA; Gene expression; Principal component analysis; RNA-Seq; Single cell; Variable genes
Mesh:
Year: 2019 PMID: 31870412 PMCID: PMC6927135 DOI: 10.1186/s13059-019-1861-6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Single cell RNA-Seq datasets used
| Number | Author | Tissue | Cells | MTU | Notes |
|---|---|---|---|---|---|
| 1 | Zheng [ | ERCC | 1015 | 11,125 | Spike-in only; technical negative control |
| 2 | Zheng [ | Monocytes | 2612 | 782 | 1 cell type; biological negative control |
| 3 | Tung [ | iPSCs | 57 | 24,170 | 1 cell type; biological negative control |
| 4 | Duo [ | PBMCs | 3994 | 1215 | 4 equal clusters of FACS-purified cells |
| 5 | Duo [ | PBMCs | 3994 | 1298 | 8 equal clusters of FACS-purified cells |
| 6 | Haber [ | Intestine | 533 | 3755 | Authors computationally identified 12 types |
| 7 | Muraro [ | Pancreas | 2282 | 18,795 | Authors computationally identified 9 types |
| 8 | Zheng [ | PBMCs | 68,579 | 1292 | Benchmarking computational speed |
Species: all H. sapiens except Haber (M. musculus). Protocols: all 10 × except Muraro (CEL-Seq2) and Tung (SMARTer). MTU median total UMI count. iPSCs induced pluripotent stem cells
Fig. 1Multinomial model adequately characterizes sampling distributions of technical and biological replicates negative control data. a Fraction of zeros is plotted against the total number of UMI in each droplet for the technical replicates. b As a but for cells in the biological replicates (monocytes). c After down-sampling replicates to 10,000 UMIs per droplet to remove variability due to the differences in sequencing depth, the fraction of zeros is computed for each gene and plotted against the log of expression across all samples for the technical replicates data. The solid curve is theoretical probability of observing a zero as a function of the expected counts derived from the multinomial model (blue) and its Poisson approximation (green). d As c but for the biological replicates (monocytes) dataset and after down-sampling to 575 UMIs per cell. Here, we also add the theoretical probability derived from a negative binomial model (red)
Fig. 2Example of how current approaches to normalization and transformation artificially distort differences between zero and nonzero counts. a UMI count distribution for gene ENSG00000114391 in the monocytes biological replicates negative control dataset. b Counts per million (CPM) distribution for the exact same count data. c Distribution of log2(1+CPM) values for the exact same count data
Fig. 3Current approaches to normalization and transformation induce variability in the fraction of zeros across cells to become the largest source of variability which in turn biases clustering algorithms to produce false-positive results based on distorted latent factors. a First principal component (PC) from the technical replicates dataset plotted against fraction of zeros for each cell. A red to blue color scale represents total UMIs per cell. b As a but for the monocytes biological replicates data. c Using the technical replicates, we applied t-distributed stochastic neighbor embedding (tSNE) with perplexity 30 to the top 50 PCs computed from log-CPM. The first 2 tSNE dimensions are shown with a blue to red color scale representing the fraction of zeros. d As c but for the biological replicates data. Here, we do not expect to find differences, yet we see distorted latent factors being driven by the total UMIs. PCA was applied to 5000 random genes
Fig. 4GLM-PCA dimension reduction is not affected by unwanted fraction of zeros variability and avoids false-positive results. a First GLM-PCA dimension (analogous to the first principal component) plotted against the fraction of zeros for the technical replicates with colors representing the total UMIs. b As a but using monocytes biological replicates. c Using the technical replicates, we applied t-distributed stochastic neighbor embedding (tSNE) with perplexity 30 to the top 50 GLM-PCA dimensions. The first 2 tSNE dimensions are shown with a blue to red color scale representing the fraction of zeros. d As c but for the biological replicates data. GLM-PCA using the Poisson approximation to the multinomial was applied to the same 5000 random genes as in Fig. 3
Fig. 5Dimension reduction with GLM-PCA and feature selection using deviance improves Seurat clustering performance. Each column represents a different ground truth dataset from [15]. a Comparison of dimension reduction methods based on the top 1500 informative genes identified by approximate multinomial deviance. The Poisson approximation to the multinomial was used for GLM-PCA. Dev. resid. PCA, PCA on approximate multinomial deviance residuals. b Comparison of feature selection methods. The top 1500 genes identified by deviance and highly variable genes were passed to 2 different dimension reduction methods: GLM-PCA and PCA on log-transformed CPM. Only the results with the number of clusters within 25% of the true number are presented