Literature DB >> 25123902

Cross-validation under separate sampling: strong bias and how to correct it.

Ulisses M Braga-Neto1, Amin Zollanvari1, Edward R Dougherty1.   

Abstract

MOTIVATION: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics.
RESULTS: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.
AVAILABILITY AND IMPLEMENTATION: The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.
© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Year:  2014        PMID: 25123902      PMCID: PMC4296143          DOI: 10.1093/bioinformatics/btu527

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  9 in total

1.  Is cross-validation valid for small-sample microarray classification?

Authors:  Ulisses M Braga-Neto; Edward R Dougherty
Journal:  Bioinformatics       Date:  2004-02-12       Impact factor: 6.937

2.  Colorectal cancer epidemiology: incidence, mortality, survival, and risk factors.

Authors:  Fatima A Haggar; Robin P Boushey
Journal:  Clin Colon Rectal Surg       Date:  2009-11

3.  The molecular classification of multiple myeloma.

Authors:  Fenghuang Zhan; Yongsheng Huang; Simona Colla; James P Stewart; Ichiro Hanamura; Sushil Gupta; Joshua Epstein; Shmuel Yaccoby; Jeffrey Sawyer; Bart Burington; Elias Anaissie; Klaus Hollmig; Mauricio Pineda-Roman; Guido Tricot; Frits van Rhee; Ronald Walker; Maurizio Zangari; John Crowley; Bart Barlogie; John D Shaughnessy
Journal:  Blood       Date:  2006-05-25       Impact factor: 22.113

4.  Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series.

Authors:  Christine Desmedt; Fanny Piette; Sherene Loi; Yixin Wang; Françoise Lallemand; Benjamin Haibe-Kains; Giuseppe Viale; Mauro Delorenzi; Yi Zhang; Mahasti Saghatchian d'Assignies; Jonas Bergh; Rosette Lidereau; Paul Ellis; Adrian L Harris; Jan G M Klijn; John A Foekens; Fatima Cardoso; Martine J Piccart; Marc Buyse; Christos Sotiriou
Journal:  Clin Cancer Res       Date:  2007-06-01       Impact factor: 12.531

5.  Effect of separate sampling on classification accuracy.

Authors:  Mohammad Shahrokh Esfahani; Edward R Dougherty
Journal:  Bioinformatics       Date:  2013-11-20       Impact factor: 6.937

6.  Selection bias in gene extraction on the basis of microarray gene-expression data.

Authors:  Christophe Ambroise; Geoffrey J McLachlan
Journal:  Proc Natl Acad Sci U S A       Date:  2002-04-30       Impact factor: 11.205

7.  Prognostically useful gene-expression profiles in acute myeloid leukemia.

Authors:  Peter J M Valk; Roel G W Verhaak; M Antoinette Beijen; Claudia A J Erpelinck; Sahar Barjesteh van Waalwijk van Doorn-Khosrovani; Judith M Boer; H Berna Beverloo; Michael J Moorhouse; Peter J van der Spek; Bob Löwenberg; Ruud Delwel
Journal:  N Engl J Med       Date:  2004-04-15       Impact factor: 91.245

8.  Incidence of Parkinson's disease: variation by age, gender, and race/ethnicity.

Authors:  Stephen K Van Den Eeden; Caroline M Tanner; Allan L Bernstein; Robin D Fross; Amethyst Leimpeter; Daniel A Bloch; Lorene M Nelson
Journal:  Am J Epidemiol       Date:  2003-06-01       Impact factor: 4.897

9.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling.

Authors:  Eng-Juh Yeoh; Mary E Ross; Sheila A Shurtleff; W Kent Williams; Divyen Patel; Rami Mahfouz; Fred G Behm; Susana C Raimondi; Mary V Relling; Anami Patel; Cheng Cheng; Dario Campana; Dawn Wilkins; Xiaodong Zhou; Jinyan Li; Huiqing Liu; Ching-Hon Pui; William E Evans; Clayton Naeve; Limsoon Wong; James R Downing
Journal:  Cancer Cell       Date:  2002-03       Impact factor: 31.743

  9 in total
  4 in total

1.  Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score.

Authors:  Zhichao Miao; Eric Westhof
Journal:  Nucleic Acids Res       Date:  2015-05-04       Impact factor: 16.971

2.  Unbiased bootstrap error estimation for linear discriminant analysis.

Authors:  Thang Vu; Chao Sima; Ulisses M Braga-Neto; Edward R Dougherty
Journal:  EURASIP J Bioinform Syst Biol       Date:  2014-10-03

3.  A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery.

Authors:  Oliver P Watson; Isidro Cortes-Ciriano; Aimee R Taylor; James A Watson
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

4.  Incorporating prior knowledge induced from stochastic differential equations in the classification of stochastic observations.

Authors:  Amin Zollanvari; Edward R Dougherty
Journal:  EURASIP J Bioinform Syst Biol       Date:  2016-01-20
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.