Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Bias correction for selecting the minimal-error classifier from many machine learning models.

Literature DB >> 25086004

Bias correction for selecting the minimal-error classifier from many machine learning models.

Ying Ding¹, Shaowu Tang², Serena G Liao², Jia Jia², Steffi Oesterreich², Yan Lin², George C Tseng¹.

Abstract

MOTIVATION: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts.
RESULTS: In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available.
AVAILABILITY AND IMPLEMENTATION: tsenglab.biostat.pitt.edu/software.htm. CONTACT: ctseng@pitt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease

Mesh：

Year: 2014 PMID： 25086004 PMCID： PMC4221122 DOI： 10.1093/bioinformatics/btu520

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

13 in total

1. Estimating dataset size requirements for classifying DNA microarray data.

Authors: Sayan Mukherjee; Pablo Tamayo; Simon Rogers; Ryan Rifkin; Anna Engle; Colin Campbell; Todd R Golub; Jill P Mesirov
Journal: J Comput Biol Date: 2003 Impact factor: 1.479

2. Estimating misclassification error with small samples via bootstrap cross-validation.

Authors: Wenjiang J Fu; Raymond J Carroll; Suojin Wang
Journal: Bioinformatics Date: 2005-02-02 Impact factor: 6.937

Review 3. Microarray data analysis: from disarray to consolidation and consensus.

Authors: David B Allison; Xiangqin Cui; Grier P Page; Mahyar Sabripour
Journal: Nat Rev Genet Date: 2006-01 Impact factor: 53.242

4. Avoiding model selection bias in small-sample genomic datasets.

Authors: Daniel Berrar; Ian Bradbury; Werner Dubitzky
Journal: Bioinformatics Date: 2006-02-24 Impact factor: 6.937

Review 5. Classification based upon gene expression data: bias and precision of error rates.

Authors: Ian A Wood; Peter M Visscher; Kerrie L Mengersen
Journal: Bioinformatics Date: 2007-03-28 Impact factor: 6.937

6. Correcting the optimal resampling-based error rate by estimating the error rate of wrapper algorithms.

Authors: Christoph Bernau; Thomas Augustin; Anne-Laure Boulesteix
Journal: Biometrics Date: 2013-07-11 Impact factor: 2.571

7. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction.

Authors: Anne-Laure Boulesteix; Carolin Strobl
Journal: BMC Med Res Methodol Date: 2009-12-21 Impact factor: 4.615

8. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.

Authors: Christina Curtis; Sohrab P Shah; Suet-Feung Chin; Gulisa Turashvili; Oscar M Rueda; Mark J Dunning; Doug Speed; Andy G Lynch; Shamith Samarajiwa; Yinyin Yuan; Stefan Gräf; Gavin Ha; Gholamreza Haffari; Ali Bashashati; Roslin Russell; Steven McKinney; Anita Langerød; Andrew Green; Elena Provenzano; Gordon Wishart; Sarah Pinder; Peter Watson; Florian Markowetz; Leigh Murphy; Ian Ellis; Arnie Purushotham; Anne-Lise Børresen-Dale; James D Brenton; Simon Tavaré; Carlos Caldas; Samuel Aparicio
Journal: Nature Date: 2012-04-18 Impact factor: 49.962

9. Bias in error estimation when using cross-validation for model selection.

Authors: Sudhir Varma; Richard Simon
Journal: BMC Bioinformatics Date: 2006-02-23 Impact factor: 3.169

10. CMA: a comprehensive Bioconductor package for supervised classification with high dimensional data.

Authors: M Slawski; M Daumer; A-L Boulesteix
Journal: BMC Bioinformatics Date: 2008-10-16 Impact factor: 3.169

7 in total

1. Type I error control for tree classification.

Authors: Sin-Ho Jung; Yong Chen; Hongshik Ahn
Journal: Cancer Inform Date: 2014-11-16

2. Transcriptome assists prognosis of disease severity in respiratory syncytial virus infected infants.

Authors: Victor L Jong; Inge M L Ahout; Henk-Jan van den Ham; Jop Jans; Fatiha Zaaraoui-Boutahar; Aldert Zomer; Elles Simonetti; Maarten A Bijl; H Kim Brand; Wilfred F J van IJcken; Marien I de Jonge; Pieter L Fraaij; Ronald de Groot; Albert D M E Osterhaus; Marinus J Eijkemans; Gerben Ferwerda; Arno C Andeweg
Journal: Sci Rep Date: 2016-11-11 Impact factor: 4.379

3. Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications.

Authors: Yiyan Zhang; Yi Xin; Qin Li; Jianshe Ma; Shuai Li; Xiaodan Lv; Weiqi Lv
Journal: Biomed Eng Online Date: 2017-11-02 Impact factor: 2.819

4. Using ordinal outcomes to construct and select biomarker combinations for single-level prediction.

Authors: Allison Meisner; Chirag R Parikh; Kathleen F Kerr
Journal: Diagn Progn Res Date: 2018-05-21

5. Bias and fairness assessment of a natural language processing opioid misuse classifier: detection and mitigation of electronic health record data disadvantages across racial subgroups.

Authors: Hale M Thompson; Brihat Sharma; Sameer Bhalla; Randy Boley; Connor McCluskey; Dmitriy Dligach; Matthew M Churpek; Niranjan S Karnik; Majid Afshar
Journal: J Am Med Inform Assoc Date: 2021-10-12 Impact factor: 7.942

6. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation.

Authors: Ioannis Tsamardinos; Elissavet Greasidou; Giorgos Borboudakis
Journal: Mach Learn Date: 2018-05-09 Impact factor: 2.940

7. RiGoR: reporting guidelines to address common sources of bias in risk model development.

Authors: Kathleen F Kerr; Allison Meisner; Heather Thiessen-Philbrook; Steven G Coca; Chirag R Parikh
Journal: Biomark Res Date: 2015-01-24

7 in total