Literature DB >> 18288255

Decorrelation of the true and estimated classifier errors in high-dimensional settings.

Blaise Hanczar1, Jianping Hua, Edward R Dougherty.   

Abstract

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.

Entities:  

Year:  2007        PMID: 18288255      PMCID: PMC3171336          DOI: 10.1155/2007/38473

Source DB:  PubMed          Journal:  EURASIP J Bioinform Syst Biol        ISSN: 1687-4145


  15 in total

1.  Towards sound epistemological foundations of statistical methods for high-dimensional biology.

Authors:  Tapan Mehta; Murat Tanik; David B Allison
Journal:  Nat Genet       Date:  2004-09       Impact factor: 38.330

2.  Is cross-validation valid for small-sample microarray classification?

Authors:  Ulisses M Braga-Neto; Edward R Dougherty
Journal:  Bioinformatics       Date:  2004-02-12       Impact factor: 6.937

3.  Prediction of cancer outcome with microarrays: a multiple random validation strategy.

Authors:  Stefan Michiels; Serge Koscielny; Catherine Hill
Journal:  Lancet       Date:  2005 Feb 5-11       Impact factor: 79.321

4.  Superior feature-set ranking for small samples using bolstered error estimation.

Authors:  Chao Sima; Ulisses Braga-Neto; Edward R Dougherty
Journal:  Bioinformatics       Date:  2004-10-28       Impact factor: 6.937

5.  Prediction error estimation: a comparison of resampling methods.

Authors:  Annette M Molinaro; Richard Simon; Ruth M Pfeiffer
Journal:  Bioinformatics       Date:  2005-05-19       Impact factor: 6.937

6.  What should be expected from feature selection in small-sample settings.

Authors:  Chao Sima; Edward R Dougherty
Journal:  Bioinformatics       Date:  2006-07-26       Impact factor: 6.937

7.  Quantification of the impact of feature selection on the variance of cross-validation error estimation.

Authors:  Yufei Xiao; Jianping Hua; Edward R Dougherty
Journal:  EURASIP J Bioinform Syst Biol       Date:  2007

8.  Validation of computational methods in genomics.

Authors:  Edward R Doughtery; Hua Jianping; Michael L Bittner
Journal:  Curr Genomics       Date:  2007-03       Impact factor: 2.236

9.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.

Authors:  A Bhattacharjee; W G Richards; J Staunton; C Li; S Monti; P Vasa; C Ladd; J Beheshti; R Bueno; M Gillette; M Loda; G Weber; E J Mark; E S Lander; W Wong; B E Johnson; T R Golub; D J Sugarbaker; M Meyerson
Journal:  Proc Natl Acad Sci U S A       Date:  2001-11-13       Impact factor: 11.205

10.  A gene-expression signature as a predictor of survival in breast cancer.

Authors:  Marc J van de Vijver; Yudong D He; Laura J van't Veer; Hongyue Dai; Augustinus A M Hart; Dorien W Voskuil; George J Schreiber; Johannes L Peterse; Chris Roberts; Matthew J Marton; Mark Parrish; Douwe Atsma; Anke Witteveen; Annuska Glas; Leonie Delahaye; Tony van der Velde; Harry Bartelink; Sjoerd Rodenhuis; Emiel T Rutgers; Stephen H Friend; René Bernards
Journal:  N Engl J Med       Date:  2002-12-19       Impact factor: 91.245

View more
  17 in total

1.  Pathway analysis using random forests with bivariate node-split for survival outcomes.

Authors:  Herbert Pang; Debayan Datta; Hongyu Zhao
Journal:  Bioinformatics       Date:  2009-11-18       Impact factor: 6.937

2.  Multiple-rule bias in the comparison of classification rules.

Authors:  Mohammadmahdi R Yousefi; Jianping Hua; Edward R Dougherty
Journal:  Bioinformatics       Date:  2011-05-05       Impact factor: 6.937

3.  High-dimensional bolstered error estimation.

Authors:  Chao Sima; Ulisses M Braga-Neto; Edward R Dougherty
Journal:  Bioinformatics       Date:  2011-09-13       Impact factor: 6.937

4.  Performance reproducibility index for classification.

Authors:  Mohammadmahdi R Yousefi; Edward R Dougherty
Journal:  Bioinformatics       Date:  2012-09-06       Impact factor: 6.937

Review 5.  Gut-host Crosstalk: Methodological and Computational Challenges.

Authors:  Ivan Ivanov
Journal:  Dig Dis Sci       Date:  2020-03       Impact factor: 3.199

6.  Characterization of the effectiveness of reporting lists of small feature sets relative to the accuracy of the prior biological knowledge.

Authors:  Chen Zhao; Michael L Bittner; Robert S Chapkin; Edward R Dougherty
Journal:  Cancer Inform       Date:  2010-03-18

7.  Classification and error estimation for discrete data.

Authors:  Ulisses M Braga-Neto
Journal:  Curr Genomics       Date:  2009-11       Impact factor: 2.236

8.  The illusion of distribution-free small-sample classification in genomics.

Authors:  Edward R Dougherty; Amin Zollanvari; Ulisses M Braga-Neto
Journal:  Curr Genomics       Date:  2011-08       Impact factor: 2.236

9.  On the limitations of biological knowledge.

Authors:  Edward R Dougherty; Ilya Shmulevich
Journal:  Curr Genomics       Date:  2012-11       Impact factor: 2.236

10.  Translational science: epistemology and the investigative process.

Authors:  Edward R Dougherty
Journal:  Curr Genomics       Date:  2009-04       Impact factor: 2.236

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.