Literature DB >> 17713587

Quantification of the impact of feature selection on the variance of cross-validation error estimation.

Yufei Xiao1, Jianping Hua, Edward R Dougherty.   

Abstract

Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the t-test for feature selection; and k-fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.

Entities:  

Year:  2007        PMID: 17713587      PMCID: PMC3171328          DOI: 10.1155/2007/16354

Source DB:  PubMed          Journal:  EURASIP J Bioinform Syst Biol        ISSN: 1687-4145


  6 in total

1.  Is cross-validation valid for small-sample microarray classification?

Authors:  Ulisses M Braga-Neto; Edward R Dougherty
Journal:  Bioinformatics       Date:  2004-02-12       Impact factor: 6.937

2.  Superior feature-set ranking for small samples using bolstered error estimation.

Authors:  Chao Sima; Ulisses Braga-Neto; Edward R Dougherty
Journal:  Bioinformatics       Date:  2004-10-28       Impact factor: 6.937

3.  Prediction error estimation: a comparison of resampling methods.

Authors:  Annette M Molinaro; Richard Simon; Ruth M Pfeiffer
Journal:  Bioinformatics       Date:  2005-05-19       Impact factor: 6.937

4.  Genetic test bed for feature selection.

Authors:  Ashish Choudhary; Marcel Brun; Jianping Hua; James Lowey; Ed Suh; Edward R Dougherty
Journal:  Bioinformatics       Date:  2006-01-20       Impact factor: 6.937

5.  Gene expression profiling predicts clinical outcome of breast cancer.

Authors:  Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal:  Nature       Date:  2002-01-31       Impact factor: 49.962

6.  A gene-expression signature as a predictor of survival in breast cancer.

Authors:  Marc J van de Vijver; Yudong D He; Laura J van't Veer; Hongyue Dai; Augustinus A M Hart; Dorien W Voskuil; George J Schreiber; Johannes L Peterse; Chris Roberts; Matthew J Marton; Mark Parrish; Douwe Atsma; Anke Witteveen; Annuska Glas; Leonie Delahaye; Tony van der Velde; Harry Bartelink; Sjoerd Rodenhuis; Emiel T Rutgers; Stephen H Friend; René Bernards
Journal:  N Engl J Med       Date:  2002-12-19       Impact factor: 91.245

  6 in total
  8 in total

1.  Decorrelation of the true and estimated classifier errors in high-dimensional settings.

Authors:  Blaise Hanczar; Jianping Hua; Edward R Dougherty
Journal:  EURASIP J Bioinform Syst Biol       Date:  2007

2.  Which is better: holdout or full-sample classifier design?

Authors:  Marcel Brun; Qian Xu; Edward R Dougherty
Journal:  EURASIP J Bioinform Syst Biol       Date:  2008

3.  Performance of feature selection methods.

Authors:  Edward R Dougherty; Jianping Hua; Chao Sima
Journal:  Curr Genomics       Date:  2009-09       Impact factor: 2.236

4.  Characterization of the effectiveness of reporting lists of small feature sets relative to the accuracy of the prior biological knowledge.

Authors:  Chen Zhao; Michael L Bittner; Robert S Chapkin; Edward R Dougherty
Journal:  Cancer Inform       Date:  2010-03-18

5.  Algebraic comparison of partial lists in bioinformatics.

Authors:  Giuseppe Jurman; Samantha Riccadonna; Roberto Visintainer; Cesare Furlanello
Journal:  PLoS One       Date:  2012-05-17       Impact factor: 3.240

6.  Bioinformatic-driven search for metabolic biomarkers in disease.

Authors:  Christian Baumgartner; Melanie Osl; Michael Netzer; Daniela Baumgartner
Journal:  J Clin Bioinforma       Date:  2011-01-20

7.  The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data.

Authors:  Eunji Kim; Ivan Ivanov; Jianping Hua; Johanna W Lampe; Meredith Aj Hullar; Robert S Chapkin; Edward R Dougherty
Journal:  Cancer Inform       Date:  2017-06-12

8.  On the epistemological crisis in genomics.

Authors:  Edward R Dougherty
Journal:  Curr Genomics       Date:  2008-04       Impact factor: 2.236

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.