Hilary S Parker1, Jeffrey T Leek1, Alexander V Favorov2, Michael Considine1, Xiaoxin Xia1, Sameer Chavan1, Christine H Chung1, Elana J Fertig1. 1. Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA. 2. Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA.
Abstract
MOTIVATION: Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. RESULTS: Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set. AVAILABILITY AND IMPLEMENTATION: All analyses were performed using R version 2.15.0. The code and data used to generate the results of this manuscript is available from https://sourceforge.net/projects/psva.
MOTIVATION: Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. RESULTS: Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set. AVAILABILITY AND IMPLEMENTATION: All analyses were performed using R version 2.15.0. The code and data used to generate the results of this manuscript is available from https://sourceforge.net/projects/psva.
Authors: Elana J Fertig; Jie Ding; Alexander V Favorov; Giovanni Parmigiani; Michael F Ochs Journal: Bioinformatics Date: 2010-09-01 Impact factor: 6.937
Authors: Carlo Colantuoni; Barbara K Lipska; Tianzhang Ye; Thomas M Hyde; Ran Tao; Jeffrey T Leek; Elizabeth A Colantuoni; Abdel G Elkahloun; Mary M Herman; Daniel R Weinberger; Joel E Kleinman Journal: Nature Date: 2011-10-26 Impact factor: 49.962
Authors: Jill Gilbert; Barbara Murphy; Mary S Dietrich; Eve Henry; Richard Jordan; Ashley Counsell; Pamela Wirth; Wendell G Yarbrough; Robert J Slebos; Christine H Chung Journal: Cancer Date: 2011-07-15 Impact factor: 6.860
Authors: Jeffrey T Leek; Robert B Scharpf; Héctor Corrada Bravo; David Simcha; Benjamin Langmead; W Evan Johnson; Donald Geman; Keith Baggerly; Rafael A Irizarry Journal: Nat Rev Genet Date: 2010-09-14 Impact factor: 53.242
Authors: C H Chung; J Aulino; N J Muldowney; H Hatakeyama; J Baumann; B Burkey; J Netterville; R Sinard; W G Yarbrough; A J Cmelak; R J Slebos; Y Shyr; J Parker; J Gilbert; B A Murphy Journal: Ann Oncol Date: 2009-10-22 Impact factor: 32.976
Authors: Dohun Pyeon; Michael A Newton; Paul F Lambert; Johan A den Boon; Srikumar Sengupta; Carmen J Marsit; Craig D Woodworth; Joseph P Connor; Thomas H Haugen; Elaine M Smith; Karl T Kelsey; Lubomir P Turek; Paul Ahlquist Journal: Cancer Res Date: 2007-05-15 Impact factor: 12.701
Authors: Robbert J C Slebos; Yajun Yi; Kim Ely; Jesse Carter; Amy Evjen; Xueqiong Zhang; Yu Shyr; Barbara M Murphy; Anthony J Cmelak; Brian B Burkey; James L Netterville; Shawn Levy; Wendell G Yarbrough; Christine H Chung Journal: Clin Cancer Res Date: 2006-02-01 Impact factor: 12.531
Authors: Askar Obulkasim; Maarten Fornerod; Michel C Zwaan; Dirk Reinhardt; Marry M van den Heuvel-Eibrink Journal: BMC Bioinformatics Date: 2015-09-23 Impact factor: 3.169