Ludwig Geistlinger1, Gergely Csaba2, Mara Santarelli3, Marcel Ramos4, Lucas Schiffer5, Nitesh Turaga6, Charity Law7, Sean Davis8, Vincent Carey9, Martin Morgan, Ralf Zimmer, Levi Waldron1. 1. Graduate School of Public Health and Health Policy, City University of New York, New York, NY 10027, USA. 2. Institute for Implementation Science and Population Health, City University of New York, New York, NY 10027, USA. 3. Institute for Bioinformatics, Ludwig-Maximilians-Universität München, 80333 Munich, Germany. 4. Roswell Park Cancer Institute, Buffalo, NY 14203, USA. 5. Graduate School of Arts and Sciences, Boston University, Boston, MA 02215, USA. 6. Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria 3052, Australia. 7. Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia. 8. Center for Cancer Research, National Cancer Institute, Bethesda, MD 20892, USA. 9. Harvard Medical School, Boston, MA 02215, USA.
Abstract
MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.
MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.
Authors: Francisco Sanchez-Vega; Marco Mina; Joshua Armenia; Walid K Chatila; Augustin Luna; Konnor C La; Sofia Dimitriadoy; David L Liu; Havish S Kantheti; Sadegh Saghafinia; Debyani Chakravarty; Foysal Daian; Qingsong Gao; Matthew H Bailey; Wen-Wei Liang; Steven M Foltz; Ilya Shmulevich; Li Ding; Zachary Heins; Angelica Ochoa; Benjamin Gross; Jianjiong Gao; Hongxin Zhang; Ritika Kundra; Cyriac Kandoth; Istemi Bahceci; Leonard Dervishi; Ugur Dogrusoz; Wanding Zhou; Hui Shen; Peter W Laird; Gregory P Way; Casey S Greene; Han Liang; Yonghong Xiao; Chen Wang; Antonio Iavarone; Alice H Berger; Trever G Bivona; Alexander J Lazar; Gary D Hammer; Thomas Giordano; Lawrence N Kwong; Grant McArthur; Chenfei Huang; Aaron D Tward; Mitchell J Frederick; Frank McCormick; Matthew Meyerson; Eliezer M Van Allen; Andrew D Cherniack; Giovanni Ciriello; Chris Sander; Nikolaus Schultz Journal: Cell Date: 2018-04-05 Impact factor: 41.582
Authors: Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971
Authors: Di Wu; Elgene Lim; François Vaillant; Marie-Liesse Asselin-Labat; Jane E Visvader; Gordon K Smyth Journal: Bioinformatics Date: 2010-07-07 Impact factor: 6.937
Authors: Sehyun Oh; Ludwig Geistlinger; Marcel Ramos; Daniel Blankenberg; Marius van den Beek; Jaclyn N Taroni; Vincent J Carey; Casey S Greene; Levi Waldron; Sean Davis Journal: Nat Commun Date: 2022-06-27 Impact factor: 17.694
Authors: Samuel Katz; Jian Song; Kyle P Webb; Nicolas W Lounsbury; Clare E Bryant; Iain D C Fraser Journal: Cell Syst Date: 2021-03-24 Impact factor: 11.091
Authors: Xin Lai; Florian S Dreyer; Martina Cantone; Martin Eberhardt; Kerstin F Gerer; Tanushree Jaitly; Steffen Uebe; Christopher Lischer; Arif Ekici; Jürgen Wittmann; Hans-Martin Jäck; Niels Schaft; Jan Dörrie; Julio Vera Journal: Theranostics Date: 2021-01-01 Impact factor: 11.556
Authors: Robert Ietswaart; Benjamin M Gyori; John A Bachman; Peter K Sorger; L Stirling Churchman Journal: Genome Biol Date: 2021-02-02 Impact factor: 13.583
Authors: Marcel Ramos; Ludwig Geistlinger; Sehyun Oh; Lucas Schiffer; Rimsha Azhar; Hanish Kodali; Ino de Bruijn; Jianjiong Gao; Vincent J Carey; Martin Morgan; Levi Waldron Journal: JCO Clin Cancer Inform Date: 2020-10