Prasad Patil1, Pierre-Olivier Bachant-Winner2, Benjamin Haibe-Kains3, Jeffrey T Leek1. 1. Department of Biostatistics, Johns Hopkins School of Public Health, Baltimore, MD, USA. 2. Institut de Recherches Cliniques de Montréal, Montreal, Quebec H2W 1R7, Canada. 3. Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario M5G 1L7, Canada and Department of Medical Biophysics, University of Toronto, Toronto, Ontario M5G 1L7, Canada.
Abstract
MOTIVATION: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized to ensure that the test set data are comparable to the data upon which the predictor was trained. The most effective normalization methods depend on data from multiple patients. From a biomedical perspective, this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction. RESULTS: We demonstrate that results from existing gene signatures which rely on normalizing test data may be irreproducible when the patient population changes composition or size using a set of curated, publicly available breast cancer microarray experiments. As an alternative, we examine the use of gene signatures that rely on ranks from the data and show why signatures using rank-based features can avoid test set bias while maintaining highly accurate classification, even across platforms. AVAILABILITY AND IMPLEMENTATION: The code, data and instructions necessary to reproduce our entire analysis is available at https://github.com/prpatil/testsetbias.
MOTIVATION: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized to ensure that the test set data are comparable to the data upon which the predictor was trained. The most effective normalization methods depend on data from multiple patients. From a biomedical perspective, this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction. RESULTS: We demonstrate that results from existing gene signatures which rely on normalizing test data may be irreproducible when the patient population changes composition or size using a set of curated, publicly available breast cancer microarray experiments. As an alternative, we examine the use of gene signatures that rely on ranks from the data and show why signatures using rank-based features can avoid test set bias while maintaining highly accurate classification, even across platforms. AVAILABILITY AND IMPLEMENTATION: The code, data and instructions necessary to reproduce our entire analysis is available at https://github.com/prpatil/testsetbias.
Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend Journal: Nature Date: 2002-01-31 Impact factor: 49.962
Authors: Emanuel F Petricoin; Ali M Ardekani; Ben A Hitt; Peter J Levine; Vincent A Fusaro; Seth M Steinberg; Gordon B Mills; Charles Simone; David A Fishman; Elise C Kohn; Lance A Liotta Journal: Lancet Date: 2002-02-16 Impact factor: 79.321
Authors: Annuska M Glas; Arno Floore; Leonie J M J Delahaye; Anke T Witteveen; Rob C F Pover; Niels Bakx; Jaana S T Lahti-Domenici; Tako J Bruinsma; Marc O Warmoes; René Bernards; Lodewyk F A Wessels; Laura J Van't Veer Journal: BMC Genomics Date: 2006-10-30 Impact factor: 3.969
Authors: Lara Lusa; Lisa M McShane; James F Reid; Loris De Cecco; Federico Ambrogi; Elia Biganzoli; Manuela Gariboldi; Marco A Pierotti Journal: J Natl Cancer Inst Date: 2007-11-13 Impact factor: 13.506
Authors: Joel S Parker; Michael Mullins; Maggie C U Cheang; Samuel Leung; David Voduc; Tammi Vickery; Sherri Davies; Christiane Fauron; Xiaping He; Zhiyuan Hu; John F Quackenbush; Inge J Stijleman; Juan Palazzo; J S Marron; Andrew B Nobel; Elaine Mardis; Torsten O Nielsen; Matthew J Ellis; Charles M Perou; Philip S Bernard Journal: J Clin Oncol Date: 2009-02-09 Impact factor: 44.544
Authors: Sean P Pitroda; Riyue Bao; Jorge Andrade; Ralph R Weichselbaum; Philip P Connell Journal: Clin Cancer Res Date: 2017-03-24 Impact factor: 12.531
Authors: Nolan Priedigkeit; Rebecca J Watters; Peter C Lucas; Ahmed Basudan; Rohit Bhargava; William Horne; Jay K Kolls; Zhou Fang; Margaret Q Rosenzweig; Adam M Brufsky; Kurt R Weiss; Steffi Oesterreich; Adrian V Lee Journal: JCI Insight Date: 2017-09-07
Authors: Gregory M Chen; Lavanya Kannan; Ludwig Geistlinger; Victor Kofia; Zhaleh Safikhani; Deena M A Gendoo; Giovanni Parmigiani; Michael Birrer; Benjamin Haibe-Kains; Levi Waldron Journal: Clin Cancer Res Date: 2018-07-03 Impact factor: 12.531
Authors: Naim U Rashid; Xianlu L Peng; Chong Jin; Richard A Moffitt; Keith E Volmar; Brian A Belt; Roheena Z Panni; Timothy M Nywening; Silvia G Herrera; Kristin J Moore; Sarah G Hennessey; Ashley B Morrison; Ryan Kawalerski; Apoorve Nayyar; Audrey E Chang; Benjamin Schmidt; Hong Jin Kim; David C Linehan; Jen Jen Yeh Journal: Clin Cancer Res Date: 2019-11-21 Impact factor: 12.531
Authors: Kevin H Kensler; Venkat N Sankar; Jun Wang; Xuehong Zhang; Christopher A Rubadue; Gabrielle M Baker; Joel S Parker; Katherine A Hoadley; Andreea L Stancu; Michael E Pyle; Laura C Collins; David J Hunter; A Heather Eliassen; Susan E Hankinson; Rulla M Tamimi; Yujing J Heng Journal: Cancer Epidemiol Biomarkers Prev Date: 2018-12-27 Impact factor: 4.254
Authors: Rayees Rahman; Nicole Zatorski; Jens Hansen; Yuguang Xiong; J G Coen van Hasselt; Eric A Sobie; Marc R Birtwistle; Evren U Azeloglu; Ravi Iyengar; Avner Schlessinger Journal: Proc Natl Acad Sci U S A Date: 2021-05-11 Impact factor: 11.205