Literature DB >> 25900919

edgeRun: an R package for sensitive, functionally relevant differential expression discovery using an unconditional exact test.

Emmanuel Dimont1, Jiantao Shi1, Rory Kirchner1, Winston Hide2.   

Abstract

Next-generation sequencing platforms for measuring digital expression such as RNA-Seq are displacing traditional microarray-based methods in biological experiments. The detection of differentially expressed genes between groups of biological conditions has led to the development of numerous bioinformatics tools, but so far, few exploit the expanded dynamic range afforded by the new technologies. We present edgeRun, an R package that implements an unconditional exact test that is a more powerful version of the exact test in edgeR. This increase in power is especially pronounced for experiments with as few as two replicates per condition, for genes with low total expression and with large biological coefficient of variation. In comparison with a panel of other tools, edgeRun consistently captures functionally similar differentially expressed genes.
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 25900919      PMCID: PMC4514933          DOI: 10.1093/bioinformatics/btv209

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Next generation sequencing technologies are steadily replacing microarray-based methods, for instance transcriptome capture with RNA-Seq (Mortazavi ) and CAGE-Seq capture for the promoterome (Kanamori-Katayama ). All of these approaches result in digital expression data, where reads or tags are sequenced, mapped to the genome and then counted. The discrete nature of the data has required the development of new bioinformatics tools for their analysis that address discrete count data. Once the expression has been quantified, an important next step is the statistical significance testing of differential expression between two or more groups of conditions. By the far the simplest and most popular approach reduces differential expression to a pairwise comparison of mean parameters, resulting in a fold-change measure of change and a P-value to ascertain statistical significance of the finding. To address this problem, tools such as edgeR (Robinson ), DESeq2 (Love ) among many others have been developed and can be applied to any experiment in which digital count data is produced. This vast array of tool choices can be bewildering for the biologist since it is generally not clear under which conditions a tool is more appropriate than its alternates. Traditional metrics used when benchmarking methods such as the false positive rate and power are useful but limited as they are purely statistical concepts that can only be tested on simulated data. Moreover, they do not help in determining to what extent methods deliver truly biologically important genes. This is a major challenge because in the vast majority of cases, we do not know what the true positives and negatives are. In this article, we propose a novel metric to determine the number of functionally relevant genes reported by a differential expression tool and present edgeRun, an extension of the edgeR package delivering increased power to detect true positive differences between conditions without sacrificing on the false positive rate. We show using simulations and a real data example that edgeRun is uniformly more powerful than a host of differential expression tools for small sample sizes. We also demonstrate how even though it may be less statistically powerful than DESeq2 in some simulation cases, edgeRun nonetheless produces results that are functionally more relevant.

2 Methods

2.1 edgeRun: exact unconditional testing

Assuming independent samples, Robinson proposed edgeR, an R package that eliminates the nuisance mean expression parameter by conditioning on a sufficient statistic for the mean, a strategy first popularized by Fisher (1925) for the binomial distribution. This leads to a calculation of the exact P-value that does not involve the mean. The advantage of this approach is its analytic simplicity and fast computation, however, a key disadvantage is that this conditioning approach loses power, especially for genes whose counts are small. We propose an alternative more powerful approach which eliminates the nuisance mean parameter via maximizing the exact P-value over all possible values for the mean without conditioning which we call ‘unconditional edgeR’ or edgeRun. This technique was initially proposed by Barnard (1945) for the binomial distribution. The main disadvantage of this method is the higher computational burden required for the maximization step. On the other hand, the gain in power can be significant. A thorough derivation and comparison of both methods can be found in the Supplementary Methods.

2.2 Benchmarking against other methods

The compcodeR Bioconductor package (Soneson, 2014) was used to benchmark the performance of edgeRun against a panel of available other tools using a combination of simulated and real datasets. edgeRun had the highest area under the curve (AUC) of all methods and it maintained a comparable false discovery rate (FDR) similar to other tools. In terms of power, only DESeq2 was found to outperform edgeRun. For this reason in the next section, we perform a functional comparison only with DESeq2. The full results are summarized in Supplementary Methods.

2.3 Comparing functional relevance

We propose to compare the genes called significant by various differential expression tools. Figure 1 compares the results of edgeRun and DESeq2 applied to a prostate cancer dataset (Li ) using an FDR < 5% cutoff. Out of the 4226 genes reported as differentially expressed, 80% were common to both tools. The highest 500 up- or down-regulated of these consensus genes by fold-change are used as a seed signature. It is reasonable to hypothesize that true differentially expressed genes uniquely reported by a differential expression tool are functionally connected to genes in the consensus group. We use GRAIL (Raychaudhuri ) coupled with a global co-expression network COXPRESdb (Obayashi ) to assess the relatedness between a gene and the consensus group. As expected, nearly half of these seed genes are correlated with other members of the seed group, meaning that these consensus genes form a tightly connected network. Figure 1 shows that edgeRun reports 6.6 times more unique differentially expressed genes, and a larger proportion of which are co-expressed with the consensus: 33% of genes unique to edgeRun as compared with 17% of genes unique to DESeq2 (P-value < 0.001). This means that the genes reported by edgeRun are more likely to be functionally relevant as they are more correlated with the consensus network. More details on this approach on evaluating functional relevance can be found in the Supplementary Methods.
Fig. 1.

Comparing the functional relevance of genes called significantly differentially expressed by edgeRun and DESeq2

Comparing the functional relevance of genes called significantly differentially expressed by edgeRun and DESeq2

3 Discussion

We present edgeRun, an R package that improves on the popular package edgeR for differential digital expression by providing the capability to perform unconditional testing, resulting in more power to detect true differences in expression between two biological conditions. Even though the computational burden is increased, the power gained using this approach is significant, allowing researchers to detect more true positives, especially for cases with as few as two replicates per condition and for genes with low expression, all the while without sacrificing on type-I error rate control. edgeRun is simple to use, especially for users already experienced with edgeR as it is designed to interface with edgeR objects directly, taking inputs and generating output in the same format.
  8 in total

1.  Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer model.

Authors:  Hairi Li; Michael T Lovci; Young-Soo Kwon; Michael G Rosenfeld; Xiang-Dong Fu; Gene W Yeo
Journal:  Proc Natl Acad Sci U S A       Date:  2008-12-16       Impact factor: 11.205

2.  Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors:  Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal:  Nat Methods       Date:  2008-05-30       Impact factor: 28.547

3.  Unamplified cap analysis of gene expression on a single-molecule sequencer.

Authors:  Mutsumi Kanamori-Katayama; Masayoshi Itoh; Hideya Kawaji; Timo Lassmann; Shintaro Katayama; Miki Kojima; Nicolas Bertin; Ai Kaiho; Noriko Ninomiya; Carsten O Daub; Piero Carninci; Alistair R R Forrest; Yoshihide Hayashizaki
Journal:  Genome Res       Date:  2011-05-19       Impact factor: 9.043

4.  compcodeR--an R package for benchmarking differential expression methods for RNA-seq data.

Authors:  Charlotte Soneson
Journal:  Bioinformatics       Date:  2014-05-09       Impact factor: 6.937

5.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Authors:  Michael I Love; Wolfgang Huber; Simon Anders
Journal:  Genome Biol       Date:  2014       Impact factor: 13.583

6.  Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions.

Authors:  Soumya Raychaudhuri; Robert M Plenge; Elizabeth J Rossin; Aylwin C Y Ng; Shaun M Purcell; Pamela Sklar; Edward M Scolnick; Ramnik J Xavier; David Altshuler; Mark J Daly
Journal:  PLoS Genet       Date:  2009-06-26       Impact factor: 5.917

7.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors:  Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal:  Bioinformatics       Date:  2009-11-11       Impact factor: 6.937

8.  COXPRESdb: a database of comparative gene coexpression networks of eleven species for mammals.

Authors:  Takeshi Obayashi; Yasunobu Okamura; Satoshi Ito; Shu Tadaka; Ikuko N Motoike; Kengo Kinoshita
Journal:  Nucleic Acids Res       Date:  2012-11-29       Impact factor: 16.971

  8 in total
  12 in total

1.  Mining the proliferative diabetic retinopathy-associated genes and pathways by integrated bioinformatic analysis.

Authors:  Haiyan Sun; Yahui Cheng; Zhipeng Yan; Xiaokun Liu; Jun Zhang
Journal:  Int Ophthalmol       Date:  2020-01-17       Impact factor: 2.031

Review 2.  Dynamics in Transcriptomics: Advancements in RNA-seq Time Course and Downstream Analysis.

Authors:  Daniel Spies; Constance Ciaudo
Journal:  Comput Struct Biotechnol J       Date:  2015-08-24       Impact factor: 7.271

Review 3.  Differential Expression Analysis for RNA-Seq: An Overview of Statistical Methods and Computational Software.

Authors:  Huei-Chung Huang; Yi Niu; Li-Xuan Qin
Journal:  Cancer Inform       Date:  2015-12-13

4.  Characterization of the Two-Speed Subgenomes of Fusarium graminearum Reveals the Fast-Speed Subgenome Specialized for Adaption and Infection.

Authors:  Qinhu Wang; Cong Jiang; Chenfang Wang; Changjun Chen; Jin-Rong Xu; Huiquan Liu
Journal:  Front Plant Sci       Date:  2017-02-14       Impact factor: 5.753

5.  MFS Transporters and GABA Metabolism Are Involved in the Self-Defense Against DON in Fusarium graminearum.

Authors:  Qinhu Wang; Daipeng Chen; Mengchun Wu; Jindong Zhu; Cong Jiang; Jin-Rong Xu; Huiquan Liu
Journal:  Front Plant Sci       Date:  2018-04-13       Impact factor: 5.753

6.  Sugar Metabolism of the First Thermophilic Planctomycete Thermogutta terrifontis: Comparative Genomic and Transcriptomic Approaches.

Authors:  Alexander G Elcheninov; Peter Menzel; Soley R Gudbergsdottir; Alexei I Slesarev; Vitaly V Kadnikov; Anders Krogh; Elizaveta A Bonch-Osmolovskaya; Xu Peng; Ilya V Kublanov
Journal:  Front Microbiol       Date:  2017-11-02       Impact factor: 5.640

7.  Dimethylsulfoniopropionate concentration in coral reef invertebrates varies according to species assemblages.

Authors:  Isis Guibert; Flavien Bourdreux; Isabelle Bonnard; Xavier Pochon; Vaimiti Dubousquet; Phila Raharivelomanana; Véronique Berteaux-Lecellier; Gael Lecellier
Journal:  Sci Rep       Date:  2020-06-18       Impact factor: 4.379

8.  FgSsn3 kinase, a component of the mediator complex, is important for sexual reproduction and pathogenesis in Fusarium graminearum.

Authors:  Shulin Cao; Shijie Zhang; Chaofeng Hao; Huiquan Liu; Jin-Rong Xu; Qiaojun Jin
Journal:  Sci Rep       Date:  2016-03-02       Impact factor: 4.379

9.  RNA editing of the AMD1 gene is important for ascus maturation and ascospore discharge in Fusarium graminearum.

Authors:  Shulin Cao; Yi He; Chaofeng Hao; Yan Xu; Hongchang Zhang; Chenfang Wang; Huiquan Liu; Jin-Rong Xu
Journal:  Sci Rep       Date:  2017-07-04       Impact factor: 4.379

10.  Dicer-Like Proteins Regulate Sexual Development via the Biogenesis of Perithecium-Specific MicroRNAs in a Plant Pathogenic Fungus Fusarium graminearum.

Authors:  Wenping Zeng; Jie Wang; Ying Wang; Jing Lin; Yanping Fu; Jiatao Xie; Daohong Jiang; Tao Chen; Huiquan Liu; Jiasen Cheng
Journal:  Front Microbiol       Date:  2018-04-26       Impact factor: 5.640

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.