Literature DB >> 33295604

glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data.

Constantin Ahlmann-Eltze¹, Wolfgang Huber¹.

Abstract

MOTIVATION: The Gamma-Poisson distribution is a theoretically and empirically motivated model for the sampling variability of single cell RNA-sequencing counts and an essential building block for analysis approaches including differential expression analysis, principal component analysis and factor analysis. Existing implementations for inferring its parameters from data often struggle with the size of single cell datasets, which can comprise millions of cells; at the same time, they do not take full advantage of the fact that zero and other small numbers are frequent in the data. These limitations have hampered uptake of the model, leaving room for statistically inferior approaches such as logarithm(-like) transformation.
RESULTS: We present a new R package for fitting the Gamma-Poisson distribution to data with the characteristics of modern single cell datasets more quickly and more accurately than existing methods. The software can work with data on disk without having to load them into RAM simultaneously. AVAILABILITYAND IMPLEMENTATION: The package glmGamPoi is available from Bioconductor for Windows, macOS and Linux, and source code is available on github.com/const-ae/glmGamPoi under a GPL-3 license. The scripts to reproduce the results of this paper are available on github.com/const-ae/glmGamPoi-Paper. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Species

Year: 2021 PMID： 33295604 PMCID： PMC8023675 DOI： 10.1093/bioinformatics/btaa1009

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

The statistical distribution of sequencing counts from single-cell RNA-Seq can be modeled by (Grün ; Hafemeister and Satija, 2019; Silverman ; Svensson, 2020), where y are the observed counts for a particular gene across a set of sufficiently similar cells (acting as replicates) and μ represents the underlying, true expression level of the gene (the expectation value). The parameter determines the dispersion of the distribution; the tightest case is θ = 0, in which case the distribution coincides with the Poisson distribution. Larger values of θ correspond to wider distributions. Biological interest is added by extending the model beyond (conceptual) replicates and letting μ vary across the cells. This can be done in different ways: via a generalized linear model, , as in the differential expression methods edgeR (McCarthy ; Robinson ) and DESeq (Anders and Huber, 2010; Love ); via a factor analysis model (Risso ); or via a matrix decomposition analogous to PCA (Townes ). The model fits then provide biological insight about ‘significant’ variations in gene expression across cells, above and beyond the sampling noise. A popular alternative approach is to transform the counts using the shifted logarithm , with some choice of c > 0, and then proceed with analysis methods that are based on the least squares error, such as used for normal distributed data. However, this approach is fundamentally inferior as it overemphasizes the influence of small count fluctuations (Silverman ; Warton, 2018) and deals poorly with variable sequencing depth across cells (Townes ). With the Gamma-Poisson generalized linear model, parameter estimation proceeds by minimizing the deviance, a generalization of the sum of squares of residuals used in the least squares method. There are already a number of implementations to this end, including the R packages MASS (Venables and Ripley, 2002), edgeR and DESeq2. These all follow a similar approach: for each gene, the parameter vector β is estimated using an iterative reweighted least squares algorithm, and the dispersion θ is found by likelihood maximization. After years of development, and with tens of thousands of users, edgeR and DESeq2 provide robust implementations and are a de facto standard for the analysis of bulk RNA-seq data Application of these implementations to single-cell RNA-seq data, however, suffers from several issues. First, their runtime becomes excessive as the number of cells gets large. Second, their functionality—fitting a Gamma-Poisson generalized linear model for a fixed, known design matrix X—is only part of what users need: with single-cell RNA-seq data, important research questions include identification of latent factors, dimension reduction, clustering and classification. These limitations hamper the development and uptake of statistical models based on the Gamma-Poisson distribution and appear to be driving analysts toward the transformation approach. The R package glmGamPoi provides inference of Gamma-Poisson generalized linear models (details of the algorithm in Supplementary Appendix S1) with the following improvements over edgeR and DESeq2: Substantially higher speed of the overdispersion estimation, by using an efficient data representation that makes uses of the fact that most entries in the count matrix are from a small set of integer numbers (). Better estimates (i.e. larger likelihood) of the overdispersion on datasets with many small counts. No size limitations for the datasets. glmGamPoi supports fitting the model without loading all data into RAM simultaneously (i.e. working ‘on-disk’), by using the HDF5Array (Pagès, 2020) and beachmat (Lun et al., 2018) packages. Small number of R package dependencies to facilitate use as a building-block for higher-level methods, such as factor analysis, dimension reduction or clustering/classification. Like edgeR, glmGamPoi also provides a quasi-likelihood ratio test with empirical Bayesian shrinkage to identify differentially expressed genes (Lund et al., 2012). In addition, it provides the option to form a pseudobulk sample, which Crowell et al. (2019) found to be an effective way to identify differential expression between samples for replicated single cell experiments. To demonstrate how glmGamPoi can be integrated into other tools, we forked the DESeq2 package and integrated glmGamPoi as an alternative inference engine (github.com/mikelove/DESeq2/pull/24). We compared the runtime of glmGamPoi to other methods on the four single cell datasets summarized in Supplementary Table S1. The timing results are shown in Figure 1 and Supplementary Figure S1. The speedup by glmGamPoi compared to edgeR and DESeq2 was 6× to 13×. When the data were accessed directly from disk, the calculations took about twice as long. Omitting the estimation of θ (by setting it to zero) reduced the runtime to about a half. The forked version of DESeq2 that calls glmGamPoi was about as fast as calling glmGamPoi directly, indicating that inference carried out by glmGamPoi uses the largest part of the compute resources, while the additional steps carried out by DESeq2 make relatively small demands. Although all methods theoretically scale linearly with the number of genes and cells, we find empirically that glmGamPoi scales sub-linearly with the number of cells, which explains the observed performance benefit (Supplementary Fig. S2).

Fig. 1.

Bar plot comparing the runtime of glmGamPoi (in-memory, on-disk and without overdispersion estimation), edgeR and DESeq2 (with its own implementation, or calling glmGamPoi) on the Mouse Gastrulation dataset. The time measurements were repeated five times each as a single process without parallelization on a different node of a multi-node computing cluster with minor amounts of competing tasks. The points show individual measurements, the bars their median. To reproduce the results, see Supplementary Appendix S2 On the PBMC68k dataset, the calculations of DESeq2 and edgeR aborted because they ran out of memory (250 GB of RAM available). In contrast, glmGamPoi completed after ca. 45 min (Supplementary Fig. S1). Supplementary Figures S4–S6 show that glmGamPoi’s gain in performance does not come at a cost of accuracy. On the contrary, Supplement Figure S3 shows that glmGamPoi provides better estimates (in the sense of larger likelihood) than DESeq2 for 72% of the genes and 10% of the genes in comparison with edgeR. Those differences with edgeR, seem to be of minor importance for assessing differential expression: the bottom rows of Supplementary Figures S5 and S6 show that the P-values from glmGamPoi and edgeR are very similar, consistent with the fact that they use the same statistical test. In Supplementary Figure S7, we provide a more detailed comparison for which genes the P-values of glmGamPoi and DESeq2 are similar and for which genes they are different.

Funding

This work was supported by the EMBL International PhD Programme. In addition, this work has received funding from the European Research Council Synergy Grant DECODE under grant agreement No. 810296. Conflict of Interest: none declared. Click here for additional data file.

14 in total

1. Why you cannot transform your way out of trouble for small counts.

Authors: David I Warton
Journal: Biometrics Date: 2017-05-15 Impact factor: 2.571

2. Validation of noise models for single-cell transcriptomics.

Authors: Dominic Grün; Lennart Kester; Alexander van Oudenaarden
Journal: Nat Methods Date: 2014-04-20 Impact factor: 28.547

3. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

4. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Authors: Michael I Love; Wolfgang Huber; Simon Anders
Journal: Genome Biol Date: 2014 Impact factor: 13.583

5. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression.

Authors: Christoph Hafemeister; Rahul Satija
Journal: Genome Biol Date: 2019-12-23 Impact factor: 13.583

6. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model.

Authors: F William Townes; Stephanie C Hicks; Martin J Aryee; Rafael A Irizarry
Journal: Genome Biol Date: 2019-12-23 Impact factor: 13.583

7. muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data.

Authors: Helena L Crowell; Charlotte Soneson; Pierre-Luc Germain; Daniela Calini; Ludovic Collin; Catarina Raposo; Dheeraj Malhotra; Mark D Robinson
Journal: Nat Commun Date: 2020-11-30 Impact factor: 14.919

8. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

9. A general and flexible method for signal extraction from single-cell RNA-seq data.

Authors: Davide Risso; Fanny Perraudeau; Svetlana Gribkova; Sandrine Dudoit; Jean-Philippe Vert
Journal: Nat Commun Date: 2018-01-18 Impact factor: 14.919

10. beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.

Authors: Aaron T L Lun; Hervé Pagès; Mike L Smith
Journal: PLoS Comput Biol Date: 2018-05-03 Impact factor: 4.475

16 in total

1. A single-cell atlas of mouse lung development.

Authors: Nicholas M Negretti; Erin J Plosa; John T Benjamin; Bryce A Schuler; A Christian Habermann; Christopher S Jetter; Peter Gulleman; Claire Bunn; Alice N Hackett; Meaghan Ransom; Chase J Taylor; David Nichols; Brittany K Matlock; Susan H Guttentag; Timothy S Blackwell; Nicholas E Banovich; Jonathan A Kropski; Jennifer M S Sucre
Journal: Development Date: 2021-12-20 Impact factor: 6.868

2. Prenatal immune stress blunts microglia reactivity, impairing neurocircuitry.

Authors: Lindsay N Hayes; Kyongman An; Elisa Carloni; Fangze Li; Elizabeth Vincent; Chloë Trippaers; Manish Paranjpe; Gül Dölen; Loyal A Goff; Adriana Ramos; Shin-Ichi Kano; Akira Sawa
Journal: Nature Date: 2022-09-28 Impact factor: 69.504

3. ScRNA-seq expression of IFI27 and APOC2 identifies four alveolar macrophage superclusters in healthy BALF.

Authors: Xin Li; Fred W Kolling; Daniel Aridgides; Diane Mellinger; Alix Ashare; Claudia V Jakubzick
Journal: Life Sci Alliance Date: 2022-07-12

4. MBNL1 drives dynamic transitions between fibroblasts and myofibroblasts in cardiac wound healing.

Authors: Darrian Bugg; Logan R J Bailey; Ross C Bretherton; Kylie E Beach; Isabella M Reichardt; Kalen Z Robeson; Anna C Reese; Jagadambika Gunaje; Galina Flint; Cole A DeForest; April Stempien-Otero; Jennifer Davis
Journal: Cell Stem Cell Date: 2022-02-16 Impact factor: 25.269

5. Mapping the developing human immune system across organs.

Authors: Chenqu Suo; Emma Dann; Issac Goh; Laura Jardine; Vitalii Kleshchevnikov; Jong-Eun Park; Rachel A Botting; Emily Stephenson; Justin Engelbert; Zewen Kelvin Tuong; Krzysztof Polanski; Nadav Yayon; Chuan Xu; Ondrej Suchanek; Rasa Elmentaite; Cecilia Domínguez Conde; Peng He; Sophie Pritchard; Mohi Miah; Corina Moldovan; Alexander S Steemers; Pavel Mazin; Martin Prete; Dave Horsfall; John C Marioni; Menna R Clatworthy; Muzlifah Haniffa; Sarah A Teichmann
Journal: Science Date: 2022-06-03 Impact factor: 63.714