Literature DB >> 25086004

Bias correction for selecting the minimal-error classifier from many machine learning models.

Ying Ding1, Shaowu Tang2, Serena G Liao2, Jia Jia2, Steffi Oesterreich2, Yan Lin2, George C Tseng1.   

Abstract

MOTIVATION: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts.
RESULTS: In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available.
AVAILABILITY AND IMPLEMENTATION: tsenglab.biostat.pitt.edu/software.htm. CONTACT: ctseng@pitt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Year:  2014        PMID: 25086004      PMCID: PMC4221122          DOI: 10.1093/bioinformatics/btu520

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  13 in total

1.  Estimating dataset size requirements for classifying DNA microarray data.

Authors:  Sayan Mukherjee; Pablo Tamayo; Simon Rogers; Ryan Rifkin; Anna Engle; Colin Campbell; Todd R Golub; Jill P Mesirov
Journal:  J Comput Biol       Date:  2003       Impact factor: 1.479

2.  Estimating misclassification error with small samples via bootstrap cross-validation.

Authors:  Wenjiang J Fu; Raymond J Carroll; Suojin Wang
Journal:  Bioinformatics       Date:  2005-02-02       Impact factor: 6.937

Review 3.  Microarray data analysis: from disarray to consolidation and consensus.

Authors:  David B Allison; Xiangqin Cui; Grier P Page; Mahyar Sabripour
Journal:  Nat Rev Genet       Date:  2006-01       Impact factor: 53.242

4.  Avoiding model selection bias in small-sample genomic datasets.

Authors:  Daniel Berrar; Ian Bradbury; Werner Dubitzky
Journal:  Bioinformatics       Date:  2006-02-24       Impact factor: 6.937

Review 5.  Classification based upon gene expression data: bias and precision of error rates.

Authors:  Ian A Wood; Peter M Visscher; Kerrie L Mengersen
Journal:  Bioinformatics       Date:  2007-03-28       Impact factor: 6.937

6.  Correcting the optimal resampling-based error rate by estimating the error rate of wrapper algorithms.

Authors:  Christoph Bernau; Thomas Augustin; Anne-Laure Boulesteix
Journal:  Biometrics       Date:  2013-07-11       Impact factor: 2.571

7.  Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction.

Authors:  Anne-Laure Boulesteix; Carolin Strobl
Journal:  BMC Med Res Methodol       Date:  2009-12-21       Impact factor: 4.615

8.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.

Authors:  Christina Curtis; Sohrab P Shah; Suet-Feung Chin; Gulisa Turashvili; Oscar M Rueda; Mark J Dunning; Doug Speed; Andy G Lynch; Shamith Samarajiwa; Yinyin Yuan; Stefan Gräf; Gavin Ha; Gholamreza Haffari; Ali Bashashati; Roslin Russell; Steven McKinney; Anita Langerød; Andrew Green; Elena Provenzano; Gordon Wishart; Sarah Pinder; Peter Watson; Florian Markowetz; Leigh Murphy; Ian Ellis; Arnie Purushotham; Anne-Lise Børresen-Dale; James D Brenton; Simon Tavaré; Carlos Caldas; Samuel Aparicio
Journal:  Nature       Date:  2012-04-18       Impact factor: 49.962

9.  Bias in error estimation when using cross-validation for model selection.

Authors:  Sudhir Varma; Richard Simon
Journal:  BMC Bioinformatics       Date:  2006-02-23       Impact factor: 3.169

10.  CMA: a comprehensive Bioconductor package for supervised classification with high dimensional data.

Authors:  M Slawski; M Daumer; A-L Boulesteix
Journal:  BMC Bioinformatics       Date:  2008-10-16       Impact factor: 3.169

View more
  7 in total

1.  Type I error control for tree classification.

Authors:  Sin-Ho Jung; Yong Chen; Hongshik Ahn
Journal:  Cancer Inform       Date:  2014-11-16

2.  Transcriptome assists prognosis of disease severity in respiratory syncytial virus infected infants.

Authors:  Victor L Jong; Inge M L Ahout; Henk-Jan van den Ham; Jop Jans; Fatiha Zaaraoui-Boutahar; Aldert Zomer; Elles Simonetti; Maarten A Bijl; H Kim Brand; Wilfred F J van IJcken; Marien I de Jonge; Pieter L Fraaij; Ronald de Groot; Albert D M E Osterhaus; Marinus J Eijkemans; Gerben Ferwerda; Arno C Andeweg
Journal:  Sci Rep       Date:  2016-11-11       Impact factor: 4.379

3.  Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications.

Authors:  Yiyan Zhang; Yi Xin; Qin Li; Jianshe Ma; Shuai Li; Xiaodan Lv; Weiqi Lv
Journal:  Biomed Eng Online       Date:  2017-11-02       Impact factor: 2.819

4.  Using ordinal outcomes to construct and select biomarker combinations for single-level prediction.

Authors:  Allison Meisner; Chirag R Parikh; Kathleen F Kerr
Journal:  Diagn Progn Res       Date:  2018-05-21

5.  Bias and fairness assessment of a natural language processing opioid misuse classifier: detection and mitigation of electronic health record data disadvantages across racial subgroups.

Authors:  Hale M Thompson; Brihat Sharma; Sameer Bhalla; Randy Boley; Connor McCluskey; Dmitriy Dligach; Matthew M Churpek; Niranjan S Karnik; Majid Afshar
Journal:  J Am Med Inform Assoc       Date:  2021-10-12       Impact factor: 7.942

6.  Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation.

Authors:  Ioannis Tsamardinos; Elissavet Greasidou; Giorgos Borboudakis
Journal:  Mach Learn       Date:  2018-05-09       Impact factor: 2.940

7.  RiGoR: reporting guidelines to address common sources of bias in risk model development.

Authors:  Kathleen F Kerr; Allison Meisner; Heather Thiessen-Philbrook; Steven G Coca; Chirag R Parikh
Journal:  Biomark Res       Date:  2015-01-24
  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.