Ying Ding1, Shaowu Tang2, Serena G Liao2, Jia Jia2, Steffi Oesterreich2, Yan Lin2, George C Tseng1. 1. Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA. 2. Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA.
Abstract
MOTIVATION: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts. RESULTS: In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available. AVAILABILITY AND IMPLEMENTATION: tsenglab.biostat.pitt.edu/software.htm. CONTACT: ctseng@pitt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts. RESULTS: In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available. AVAILABILITY AND IMPLEMENTATION: tsenglab.biostat.pitt.edu/software.htm. CONTACT: ctseng@pitt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Sayan Mukherjee; Pablo Tamayo; Simon Rogers; Ryan Rifkin; Anna Engle; Colin Campbell; Todd R Golub; Jill P Mesirov Journal: J Comput Biol Date: 2003 Impact factor: 1.479
Authors: Christina Curtis; Sohrab P Shah; Suet-Feung Chin; Gulisa Turashvili; Oscar M Rueda; Mark J Dunning; Doug Speed; Andy G Lynch; Shamith Samarajiwa; Yinyin Yuan; Stefan Gräf; Gavin Ha; Gholamreza Haffari; Ali Bashashati; Roslin Russell; Steven McKinney; Anita Langerød; Andrew Green; Elena Provenzano; Gordon Wishart; Sarah Pinder; Peter Watson; Florian Markowetz; Leigh Murphy; Ian Ellis; Arnie Purushotham; Anne-Lise Børresen-Dale; James D Brenton; Simon Tavaré; Carlos Caldas; Samuel Aparicio Journal: Nature Date: 2012-04-18 Impact factor: 49.962
Authors: Victor L Jong; Inge M L Ahout; Henk-Jan van den Ham; Jop Jans; Fatiha Zaaraoui-Boutahar; Aldert Zomer; Elles Simonetti; Maarten A Bijl; H Kim Brand; Wilfred F J van IJcken; Marien I de Jonge; Pieter L Fraaij; Ronald de Groot; Albert D M E Osterhaus; Marinus J Eijkemans; Gerben Ferwerda; Arno C Andeweg Journal: Sci Rep Date: 2016-11-11 Impact factor: 4.379
Authors: Hale M Thompson; Brihat Sharma; Sameer Bhalla; Randy Boley; Connor McCluskey; Dmitriy Dligach; Matthew M Churpek; Niranjan S Karnik; Majid Afshar Journal: J Am Med Inform Assoc Date: 2021-10-12 Impact factor: 7.942