Mumtahena Rahman1, Laurie K Jackson2, W Evan Johnson3, Dean Y Li4, Andrea H Bild5, Stephen R Piccolo6. 1. Department of Biomedical Informatics. 2. Department of Pharmacology and Toxicology. 3. Department of Oncological Sciences, University of Utah, Salt Lake City, UT, USA, Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA. 4. Department of Oncological Sciences, University of Utah, Salt Lake City, UT, USA, School of Medicine, Department of Human Genetics, University of Utah, Salt Lake City, UT 84132, USA and. 5. Department of Biomedical Informatics, Department of Pharmacology and Toxicology, Department of Oncological Sciences, University of Utah, Salt Lake City, UT, USA. 6. Department of Biology, Brigham Young University, Provo, UT 84604, USA.
Abstract
MOTIVATION: The Cancer Genome Atlas (TCGA) RNA-Sequencing data are used widely for research. TCGA provides 'Level 3' data, which have been processed using a pipeline specific to that resource. However, we have found using experimentally derived data that this pipeline produces gene-expression values that vary considerably across biological replicates. In addition, some RNA-Sequencing analysis tools require integer-based read counts, which are not provided with the Level 3 data. As an alternative, we have reprocessed the data for 9264 tumor and 741 normal samples across 24 cancer types using the Rsubread package. We have also collated corresponding clinical data for these samples. We provide these data as a community resource. RESULTS: We compared TCGA samples processed using either pipeline and found that the Rsubread pipeline produced fewer zero-expression genes and more consistent expression levels across replicate samples than the TCGA pipeline. Additionally, we used a genomic-signature approach to estimate HER2 (ERBB2) activation status for 662 breast-tumor samples and found that the Rsubread data resulted in stronger predictions of HER2 pathway activity. Finally, we used data from both pipelines to classify 575 lung cancer samples based on histological type. This analysis identified various non-coding RNA that may influence lung-cancer histology. AVAILABILITY AND IMPLEMENTATION: The RNA-Sequencing and clinical data can be downloaded from Gene Expression Omnibus (accession number GSE62944). Scripts and code that were used to process and analyze the data are available from https://github.com/srp33/TCGA_RNASeq_Clinical. CONTACT: stephen_piccolo@byu.edu or andreab@genetics.utah.edu SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.
MOTIVATION: The Cancer Genome Atlas (TCGA) RNA-Sequencing data are used widely for research. TCGA provides 'Level 3' data, which have been processed using a pipeline specific to that resource. However, we have found using experimentally derived data that this pipeline produces gene-expression values that vary considerably across biological replicates. In addition, some RNA-Sequencing analysis tools require integer-based read counts, which are not provided with the Level 3 data. As an alternative, we have reprocessed the data for 9264 tumor and 741 normal samples across 24 cancer types using the Rsubread package. We have also collated corresponding clinical data for these samples. We provide these data as a community resource. RESULTS: We compared TCGA samples processed using either pipeline and found that the Rsubread pipeline produced fewer zero-expression genes and more consistent expression levels across replicate samples than the TCGA pipeline. Additionally, we used a genomic-signature approach to estimate HER2 (ERBB2) activation status for 662 breast-tumor samples and found that the Rsubread data resulted in stronger predictions of HER2 pathway activity. Finally, we used data from both pipelines to classify 575 lung cancer samples based on histological type. This analysis identified various non-coding RNA that may influence lung-cancer histology. AVAILABILITY AND IMPLEMENTATION: The RNA-Sequencing and clinical data can be downloaded from Gene Expression Omnibus (accession number GSE62944). Scripts and code that were used to process and analyze the data are available from https://github.com/srp33/TCGA_RNASeq_Clinical. CONTACT: stephen_piccolo@byu.edu or andreab@genetics.utah.edu SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.
Authors: Stephen R Piccolo; Michelle R Withers; Owen E Francis; Andrea H Bild; W Evan Johnson Journal: Proc Natl Acad Sci U S A Date: 2013-10-15 Impact factor: 11.205
Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583
Authors: Jingchun Zhu; J Zachary Sanborn; Stephen Benz; Christopher Szeto; Fan Hsu; Robert M Kuhn; Donna Karolchik; John Archie; Marc E Lenburg; Laura J Esserman; W James Kent; David Haussler; Ting Wang Journal: Nat Methods Date: 2009-04 Impact factor: 28.547
Authors: Michelle H Townsend; Abigail M Felsted; Zac E Ence; Stephen R Piccolo; Richard A Robison; Kim L O'Neill Journal: Mol Cell Oncol Date: 2019-02-26
Authors: Ines Müller; Elwira Strozyk; Sebastian Schindler; Stefan Beissert; Htoo Zarni Oo; Thomas Sauter; Philippe Lucarelli; Sebastian Raeth; Angelika Hausser; Nader Al Nakouzi; Ladan Fazli; Martin E Gleave; He Liu; Hans-Uwe Simon; Henning Walczak; Douglas R Green; Jiri Bartek; Mads Daugaard; Dagmar Kulms Journal: Mol Cell Date: 2020-01-22 Impact factor: 17.970
Authors: Luz Garcia-Alonso; Francesco Iorio; Angela Matchan; Nuno Fonseca; Patricia Jaaks; Gareth Peat; Miguel Pignatelli; Fiammetta Falcone; Cyril H Benes; Ian Dunham; Graham Bignell; Simon S McDade; Mathew J Garnett; Julio Saez-Rodriguez Journal: Cancer Res Date: 2017-12-11 Impact factor: 12.701
Authors: Simon S McDade; Dennis J McCance; Kirtiman Srivastava; Adam Pickard; Stephanie G Craig; Gerard P Quinn; Shauna M Lambe; Jacqueline A James Journal: Clin Cancer Res Date: 2018-05-08 Impact factor: 12.531
Authors: Leonardo Collado-Torres; Abhinav Nellore; Kai Kammers; Shannon E Ellis; Margaret A Taub; Kasper D Hansen; Andrew E Jaffe; Ben Langmead; Jeffrey T Leek Journal: Nat Biotechnol Date: 2017-04-11 Impact factor: 54.908
Authors: Gajendra Shrestha; Shelley M MacNeil; Jasmine A McQuerry; David F Jenkins; Sunil Sharma; Andrea H Bild Journal: Semin Cell Dev Biol Date: 2016-06-20 Impact factor: 7.727