Laurence de Torrenté1, Samuel Zimmerman1, Masako Suzuki2, Maximilian Christopeit3, John M Greally2, Jessica C Mar4,5,6. 1. Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, 10461, USA. 2. Center for Epigenomics and Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, 10461, USA. 3. Internal Medicine II, Hematology, Oncology, Clinical Immunology and Rheumatology, University Hospital Tuebingen, Otfried-Mueller-Strasse 10, 72076, Tuebingen, Germany. 4. Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, 10461, USA. j.mar@uq.edu.au. 5. Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY, 10461, USA. j.mar@uq.edu.au. 6. Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, QLD, 4072, Australia. j.mar@uq.edu.au.
Abstract
BACKGROUND: In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our study investigated the prevalence of a range of gene expression distributions in three different tumor types from the Cancer Genome Atlas (TCGA). RESULTS: Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Most of the distribution categories contained genes that were significantly enriched for unique biological processes. Different assumptions based on the shape of the expression profile were used to identify genes that could discriminate between patients with good versus poor survival. The prognostic marker genes that were identified when the shape of the distribution was accounted for reflected functional insights into cancer biology that were not observed when standard assumptions were applied. We showed that when multiple types of distributions were permitted, i.e. the shape of the expression profile was used, the statistical classifiers had greater predictive accuracy for determining the prognosis of a patient versus those that assumed only one type of gene expression distribution. CONCLUSIONS: Our results highlight the value of studying a gene's distribution shape to model heterogeneity of transcriptomic data and the impact on using analyses that permit more than one type of gene expression distribution. These insights would have been overlooked when using standard approaches that assume all genes follow the same type of distribution in a patient cohort.
BACKGROUND: In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our study investigated the prevalence of a range of gene expression distributions in three different tumor types from the Cancer Genome Atlas (TCGA). RESULTS: Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Most of the distribution categories contained genes that were significantly enriched for unique biological processes. Different assumptions based on the shape of the expression profile were used to identify genes that could discriminate between patients with good versus poor survival. The prognostic marker genes that were identified when the shape of the distribution was accounted for reflected functional insights into cancer biology that were not observed when standard assumptions were applied. We showed that when multiple types of distributions were permitted, i.e. the shape of the expression profile was used, the statistical classifiers had greater predictive accuracy for determining the prognosis of a patient versus those that assumed only one type of gene expression distribution. CONCLUSIONS: Our results highlight the value of studying a gene's distribution shape to model heterogeneity of transcriptomic data and the impact on using analyses that permit more than one type of gene expression distribution. These insights would have been overlooked when using standard approaches that assume all genes follow the same type of distribution in a patient cohort.
Entities:
Keywords:
Cancer genomics; Gene expression; Multi-modality; Non-normal distribution; Survival analysis
Authors: Guido Marcucci; Krzysztof Mrózek; Michael D Radmacher; Ramiro Garzon; Clara D Bloomfield Journal: Blood Date: 2010-11-02 Impact factor: 22.113
Authors: Boris Bartholdy; Maximilian Christopeit; Britta Will; Yongkai Mo; Laura Barreyro; Yiting Yu; Tushar D Bhagat; Ujunwa C Okoye-Okafor; Tihomira I Todorova; John M Greally; Ross L Levine; Ari Melnick; Amit Verma; Ulrich Steidl Journal: J Clin Invest Date: 2014-03 Impact factor: 14.808
Authors: Kolja Eppert; Katsuto Takenaka; Eric R Lechman; Levi Waldron; Björn Nilsson; Peter van Galen; Klaus H Metzeler; Armando Poeppl; Vicki Ling; Joseph Beyene; Angelo J Canty; Jayne S Danska; Stefan K Bohlander; Christian Buske; Mark D Minden; Todd R Golub; Igor Jurisica; Benjamin L Ebert; John E Dick Journal: Nat Med Date: 2011-08-28 Impact factor: 53.440
Authors: Zejuan Li; Tobias Herold; Chunjiang He; Peter J M Valk; Ping Chen; Vindi Jurinovic; Ulrich Mansmann; Michael D Radmacher; Kati S Maharry; Miao Sun; Xinan Yang; Hao Huang; Xi Jiang; Maria-Cristina Sauerland; Thomas Büchner; Wolfgang Hiddemann; Abdel Elkahloun; Mary Beth Neilly; Yanming Zhang; Richard A Larson; Michelle M Le Beau; Michael A Caligiuri; Konstanze Döhner; Lars Bullinger; Paul P Liu; Ruud Delwel; Guido Marcucci; Bob Lowenberg; Clara D Bloomfield; Janet D Rowley; Stefan K Bohlander; Jianjun Chen Journal: J Clin Oncol Date: 2013-02-04 Impact factor: 44.544
Authors: Peter J M Valk; Roel G W Verhaak; M Antoinette Beijen; Claudia A J Erpelinck; Sahar Barjesteh van Waalwijk van Doorn-Khosrovani; Judith M Boer; H Berna Beverloo; Michael J Moorhouse; Peter J van der Spek; Bob Löwenberg; Ruud Delwel Journal: N Engl J Med Date: 2004-04-15 Impact factor: 91.245