Ulisses M Braga-Neto1, Amin Zollanvari1, Edward R Dougherty1. 1. Department of Electrical and Computer Engineering, Center for Bioinformatics and Genomic Systems Engineering and Department of Statistics, Texas A&M University, College Station, TX, 77843, USA Department of Electrical and Computer Engineering, Center for Bioinformatics and Genomic Systems Engineering and Department of Statistics, Texas A&M University, College Station, TX, 77843, USA.
Abstract
MOTIVATION: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics. RESULTS: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used. AVAILABILITY AND IMPLEMENTATION: The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.
MOTIVATION: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics. RESULTS: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used. AVAILABILITY AND IMPLEMENTATION: The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.
Authors: Christine Desmedt; Fanny Piette; Sherene Loi; Yixin Wang; Françoise Lallemand; Benjamin Haibe-Kains; Giuseppe Viale; Mauro Delorenzi; Yi Zhang; Mahasti Saghatchian d'Assignies; Jonas Bergh; Rosette Lidereau; Paul Ellis; Adrian L Harris; Jan G M Klijn; John A Foekens; Fatima Cardoso; Martine J Piccart; Marc Buyse; Christos Sotiriou Journal: Clin Cancer Res Date: 2007-06-01 Impact factor: 12.531
Authors: Peter J M Valk; Roel G W Verhaak; M Antoinette Beijen; Claudia A J Erpelinck; Sahar Barjesteh van Waalwijk van Doorn-Khosrovani; Judith M Boer; H Berna Beverloo; Michael J Moorhouse; Peter J van der Spek; Bob Löwenberg; Ruud Delwel Journal: N Engl J Med Date: 2004-04-15 Impact factor: 91.245
Authors: Stephen K Van Den Eeden; Caroline M Tanner; Allan L Bernstein; Robin D Fross; Amethyst Leimpeter; Daniel A Bloch; Lorene M Nelson Journal: Am J Epidemiol Date: 2003-06-01 Impact factor: 4.897
Authors: Eng-Juh Yeoh; Mary E Ross; Sheila A Shurtleff; W Kent Williams; Divyen Patel; Rami Mahfouz; Fred G Behm; Susana C Raimondi; Mary V Relling; Anami Patel; Cheng Cheng; Dario Campana; Dawn Wilkins; Xiaodong Zhou; Jinyan Li; Huiqing Liu; Ching-Hon Pui; William E Evans; Clayton Naeve; Limsoon Wong; James R Downing Journal: Cancer Cell Date: 2002-03 Impact factor: 31.743