Literature DB >> 19910308

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Mark D Robinson1, Davis J McCarthy, Gordon K Smyth.   

Abstract

SUMMARY: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. AVAILABILITY: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

Entities:  

Mesh:

Year:  2009        PMID: 19910308      PMCID: PMC2796818          DOI: 10.1093/bioinformatics/btp616

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Modern molecular biology data present major challenges for the statistical methods that are used to detect differential expression, such as the requirement of multiple testing procedures and increasingly, empirical Bayes or similar methods that share information across all observations to improve inference. For microarrays, the abundance of a particular transcript is measured as a fluorescence intensity, effectively a continuous response, whereas for digital gene expression (DGE) data the abundance is observed as a count. Therefore, procedures that are successful for microarray data are not directly applicable to DGE data. This note describes the software package edgeR (empirical analysis of DGE in R), which forms part of the Bioconductor project (Gentleman et al., 2004). edgeR is designed for the analysis of replicated count-based expression data and is an implementation of methology developed by Robinson and Smyth (2007, 2008). Although initially developed for serial analysis of gene expression (SAGE), the methods and software should be equally applicable to emerging technologies such as RNA-seq (Li et al., 2008; Marioni et al., 2008) giving rise to digital expression data. edgeR may also be useful in other experiments that generate counts, such as ChIP-seq, in proteomics experiments where spectral counts are used to summarize the peptide abundance (Wong et al., 2008), or in barcoding experiments where several species are counted (Andersson et al., 2008). The software is designed for finding changes between two or more groups when at least one of the groups has replicated measurements.

2 MODEL

Bioinformatics researchers have learned many things from the analysis of microarray data. For instance, power to detect differential expression can be improved and false discoveries reduced by sharing information across all probes. One such procedure is limma (Smyth, 2004), where an empirical Bayes model is used to moderate the probe-wise variances. The moderated variances replace the probe-wise variances in the t- and F-statistic calculations. In a closely analogous but mathematically more complex procedure, edgeR models count data using an overdispersed Poisson model, and uses an empirical Bayes procedure to moderate the degree of overdispersion across genes. We assume the data can be summarized into a table of counts, with rows corresponding to genes (or tags or exons or transcripts) and columns to samples. For RNA-seq experiments, these may be counts at the exon, transcript or gene-level. We model the data as negative binomial (NB) distributed, for gene g and sample i. Here, M is the library size (total number of reads), ϕ is the dispersion and p is the relative abundance of gene g in experimental group j to which sample i belongs. We use the NB parameterization where the mean is μ=Mp and variance is μ (1+μϕ). For differential expression analysis, the parameters of interest are p. The NB distribution reduces to Poisson when ϕ=0. In some DGE applications, technical variation can be treated as Poisson. In general, ϕ represents the coefficient of variation of biological variation between the samples. In this way, our model is able to separate biological from technical variation. edgeR estimates the genewise dispersions by conditional maximum likelihood, conditioning on the total count for that gene (Smyth and Verbyla, 1996). An empirical Bayes procedure is used to shrink the dispersions towards a consensus value, effectively borrowing information between genes (Robinson and Smyth, 2007). Finally, differential expression is assessed for each gene using an exact test analogous to Fisher's exact test, but adapted for overdispersed data (Robinson and Smyth, 2008).

3 FEATURES

The required inputs for edgeR are the table of counts and two vectors annotating the samples: the vector of the library sizes (i.e. total number of reads) and a factor specifying the experimental group or condition for each sample. For users of limma, the edgeR package has a number of analogous functions. Once the data have been processed and the dispersion estimates are moderated, the topTags function can be used to tabulate the top differentially expressed genes (or tags or exons, etc.). Also, MA (log ratio versus abundance) plots can be created using the plotSmear function, allowing the same visualizations for DGE data as used for microarray data analysis (Fig. 1).
Fig. 1.

DGE data can be visualized as ‘MA’ plots (log ratio versus abundance), just as with microarray data where each dot represents a gene. This plot shows RNA-seq gene expression for DHT-stimulated versus Control LNCaP cells, as described in Li et al. (2008). The smear of points on the left side signifies that genes were observed in only one group of replicate samples and the points marked ‘×’ denote the top 500 differentially expressed genes.

DGE data can be visualized as ‘MA’ plots (log ratio versus abundance), just as with microarray data where each dot represents a gene. This plot shows RNA-seq gene expression for DHT-stimulated versus Control LNCaP cells, as described in Li et al. (2008). The smear of points on the left side signifies that genes were observed in only one group of replicate samples and the points marked ‘×’ denote the top 500 differentially expressed genes. A number of features have been added to the edgeR package since the initial publications. The initial methodology worked only for a two-group comparison. The extension to estimating and moderating the dispersion for multiple groups is straightforward and has been implemented recently. At present, testing for differential expression is supported only for pairwise comparisons; the user must specify which two groups to compare. We are currently investigating tests for more general cases. Many of the early RNA-seq datasets involve sequence reads from technical replicates (e.g. same source of RNA) as opposed to biological replicates (e.g. RNA from different individuals). Technical replicates will generally have lower variability than biological replicates and in our experience, the dispersion parameter (and the moderation procedure in edgeR) may not be necessary. For experiments with technical replicates, the data may be fitted well by the Poisson distribution, as demonstrated in Marioni et al. (2008). Since the Poisson distribution is a special case of the NB distribution (ϕ=0), edgeR can perform a Poisson-based analysis. The pairwise exact testing procedure will still be useful for these datasets.

4 DISCUSSION

We have developed a Bioconductor package edgeR that addresses one of the fundamental downstream data analysis tasks for count-based expression data: determining differential expression. The package and methods are general, and can work on other sources of count data, such as barcoding experiments and peptide counts. To the authors' knowledge, edgeR is the only software for SAGE or DGE data at this time which can account for biological variability when there are only one or two replicate samples. Funding: National Health and Medical Research Council Program (Grant 406657 to G.K.S.); NHMRC, Independent Research Institutes Infrastructure Support Scheme (Grant 361646); Victorian State Government OIS grant (awarded to the WEHI); a Melbourne International Research Scholarship (to M.D.R.); Belz, Harris and IBS Honours scholarships (to D.J.M.). Conflict of Interest: none declared.
  8 in total

1.  Linear models and empirical bayes methods for assessing differential expression in microarray experiments.

Authors:  Gordon K Smyth
Journal:  Stat Appl Genet Mol Biol       Date:  2004-02-12

Review 2.  Computational methods for the comparative quantification of proteins in label-free LCn-MS experiments.

Authors:  Jason W H Wong; Matthew J Sullivan; Gerard Cagney
Journal:  Brief Bioinform       Date:  2007-09-28       Impact factor: 11.622

3.  Moderated statistical tests for assessing differences in tag abundance.

Authors:  Mark D Robinson; Gordon K Smyth
Journal:  Bioinformatics       Date:  2007-09-19       Impact factor: 6.937

4.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data.

Authors:  Mark D Robinson; Gordon K Smyth
Journal:  Biostatistics       Date:  2007-08-29       Impact factor: 5.899

5.  Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer model.

Authors:  Hairi Li; Michael T Lovci; Young-Soo Kwon; Michael G Rosenfeld; Xiang-Dong Fu; Gene W Yeo
Journal:  Proc Natl Acad Sci U S A       Date:  2008-12-16       Impact factor: 11.205

6.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.

Authors:  John C Marioni; Christopher E Mason; Shrikant M Mane; Matthew Stephens; Yoav Gilad
Journal:  Genome Res       Date:  2008-06-11       Impact factor: 9.043

7.  Bioconductor: open software development for computational biology and bioinformatics.

Authors:  Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal:  Genome Biol       Date:  2004-09-15       Impact factor: 13.583

8.  Comparative analysis of human gut microbiota by barcoded pyrosequencing.

Authors:  Anders F Andersson; Mathilda Lindberg; Hedvig Jakobsson; Fredrik Bäckhed; Pål Nyrén; Lars Engstrand
Journal:  PLoS One       Date:  2008-07-30       Impact factor: 3.240

  8 in total
  2000 in total

1.  A likelihood-based approach to transcriptome association analysis.

Authors:  Jing Qian; Evan Ray; Regina L Brecha; Muredach P Reilly; Andrea S Foulkes
Journal:  Stat Med       Date:  2018-12-04       Impact factor: 2.373

2.  MicroRNAs are differentially abundant during Aedes albopictus diapause maintenance but not diapause induction.

Authors:  Z A Batz; A C Goff; P A Armbruster
Journal:  Insect Mol Biol       Date:  2017-08-04       Impact factor: 3.585

3.  Metabolic and spatio-taxonomic response of uncultivated seafloor bacteria following the Deepwater Horizon oil spill.

Authors:  K M Handley; Y M Piceno; P Hu; L M Tom; O U Mason; G L Andersen; J K Jansson; J A Gilbert
Journal:  ISME J       Date:  2017-08-04       Impact factor: 10.302

4.  Mitochondrial Oxidative Damage Underlies Regulatory T Cell Defects in Autoimmunity.

Authors:  Themis Alissafi; Lydia Kalafati; Maria Lazari; Anastasia Filia; Ismini Kloukina; Maria Manifava; Jong-Hyung Lim; Vasileia Ismini Alexaki; Nicholas T Ktistakis; Triantafyllos Doskas; George A Garinis; Triantafyllos Chavakis; Dimitrios T Boumpas; Panayotis Verginis
Journal:  Cell Metab       Date:  2020-07-31       Impact factor: 27.287

5.  IL-15 Preconditioning Augments CAR T Cell Responses to Checkpoint Blockade for Improved Treatment of Solid Tumors.

Authors:  Lauren Giuffrida; Kevin Sek; Melissa A Henderson; Imran G House; Junyun Lai; Amanda X Y Chen; Kirsten L Todd; Emma V Petley; Sherly Mardiana; Izabela Todorovski; Emily Gruber; Madison J Kelly; Benjamin J Solomon; Stephin J Vervoort; Ricky W Johnstone; Ian A Parish; Paul J Neeson; Lev M Kats; Phillip K Darcy; Paul A Beavis
Journal:  Mol Ther       Date:  2020-07-21       Impact factor: 11.454

6.  Endothelial cell differentiation is encompassed by changes in long range interactions between inactive chromatin regions.

Authors:  Henri Niskanen; Irina Tuszynska; Rafal Zaborowski; Merja Heinäniemi; Seppo Ylä-Herttuala; Bartek Wilczynski; Minna U Kaikkonen
Journal:  Nucleic Acids Res       Date:  2018-02-28       Impact factor: 16.971

7.  Lysocardiolipin acyltransferase regulates NSCLC cell proliferation and migration by modulating mitochondrial dynamics.

Authors:  Long Shuang Huang; Sainath R Kotha; Sreedevi Avasarala; Michelle VanScoyk; Robert A Winn; Arjun Pennathur; Puttaraju S Yashaswini; Mounica Bandela; Ravi Salgia; Yulia Y Tyurina; Valerian E Kagan; Xiangdong Zhu; Sekhar P Reddy; Tara Sudhadevi; Prasanth-Kumar Punathil-Kannan; Anantha Harijith; Ramaswamy Ramchandran; Rama Kamesh Bikkavilli; Viswanathan Natarajan
Journal:  J Biol Chem       Date:  2020-07-30       Impact factor: 5.157

8.  Developmental Heterogeneity of Microglia and Brain Myeloid Cells Revealed by Deep Single-Cell RNA Sequencing.

Authors:  Qingyun Li; Zuolin Cheng; Lu Zhou; Spyros Darmanis; Norma F Neff; Jennifer Okamoto; Gunsagar Gulati; Mariko L Bennett; Lu O Sun; Laura E Clarke; Julia Marschallinger; Guoqiang Yu; Stephen R Quake; Tony Wyss-Coray; Ben A Barres
Journal:  Neuron       Date:  2018-12-31       Impact factor: 17.173

9.  The cytotoxic type 3 secretion system 1 of Vibrio rewires host gene expression to subvert cell death and activate cell survival pathways.

Authors:  Nicole J De Nisco; Mohammed Kanchwala; Peng Li; Jessie Fernandez; Chao Xing; Kim Orth
Journal:  Sci Signal       Date:  2017-05-16       Impact factor: 8.192

10.  p62/SQSTM1 Cooperates with Hyperactive mTORC1 to Regulate Glutathione Production, Maintain Mitochondrial Integrity, and Promote Tumorigenesis.

Authors:  Hilaire C Lam; Christian V Baglini; Alicia Llorente Lope; Andrey A Parkhitko; Heng-Jia Liu; Nicola Alesi; Izabela A Malinowska; Darius Ebrahimi-Fakhari; Afshin Saffari; Jane J Yu; Ana Pereira; Damir Khabibullin; Barbara Ogorek; Julie Nijmeh; Taylor Kavanagh; Adam Handen; Stephen Y Chan; John M Asara; William M Oldham; Maria T Diaz-Meco; Jorge Moscat; Mustafa Sahin; Carmen Priolo; Elizabeth P Henske
Journal:  Cancer Res       Date:  2017-05-16       Impact factor: 12.701

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.