Literature DB >> 14734317

Is cross-validation better than resubstitution for ranking genes?

Ulisses Braga-Neto1, Ronaldo Hashimoto, Edward R Dougherty, Danh V Nguyen, Raymond J Carroll.   

Abstract

MOTIVATION: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance.
RESULTS: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.

Entities:  

Mesh:

Year:  2004        PMID: 14734317     DOI: 10.1093/bioinformatics/btg399

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  14 in total

1.  Decorrelation of the true and estimated classifier errors in high-dimensional settings.

Authors:  Blaise Hanczar; Jianping Hua; Edward R Dougherty
Journal:  EURASIP J Bioinform Syst Biol       Date:  2007

2.  Validation of computational methods in genomics.

Authors:  Edward R Doughtery; Hua Jianping; Michael L Bittner
Journal:  Curr Genomics       Date:  2007-03       Impact factor: 2.236

3.  Objective detection of chronic stress using physiological parameters.

Authors:  Rabah M Al Abdi; Ahmad E Alhitary; Enas W Abdul Hay; Areen K Al-Bashir
Journal:  Med Biol Eng Comput       Date:  2018-06-18       Impact factor: 2.602

Review 4.  Radiomics: the process and the challenges.

Authors:  Virendra Kumar; Yuhua Gu; Satrajit Basu; Anders Berglund; Steven A Eschrich; Matthew B Schabath; Kenneth Forster; Hugo J W L Aerts; Andre Dekker; David Fenstermacher; Dmitry B Goldgof; Lawrence O Hall; Philippe Lambin; Yoganand Balagurunathan; Robert A Gatenby; Robert J Gillies
Journal:  Magn Reson Imaging       Date:  2012-08-13       Impact factor: 2.546

5.  Gene selection using iterative feature elimination random forests for survival outcomes.

Authors:  Herbert Pang; Stephen L George; Ken Hui; Tiejun Tong
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2012 Sep-Oct       Impact factor: 3.710

6.  Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies.

Authors:  Andrei S Rodin; Anatoliy Litvinenko; Kathy Klos; Alanna C Morrison; Trevor Woodage; Josef Coresh; Eric Boerwinkle
Journal:  J Comput Biol       Date:  2009-12       Impact factor: 1.479

7.  Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations.

Authors:  Tapio Pahikkala; Sebastian Okser; Antti Airola; Tapio Salakoski; Tero Aittokallio
Journal:  Algorithms Mol Biol       Date:  2012-05-02       Impact factor: 1.405

8.  Gene selection and classification of microarray data using random forest.

Authors:  Ramón Díaz-Uriarte; Sara Alvarez de Andrés
Journal:  BMC Bioinformatics       Date:  2006-01-06       Impact factor: 3.169

9.  Unbiased bootstrap error estimation for linear discriminant analysis.

Authors:  Thang Vu; Chao Sima; Ulisses M Braga-Neto; Edward R Dougherty
Journal:  EURASIP J Bioinform Syst Biol       Date:  2014-10-03

10.  RiGoR: reporting guidelines to address common sources of bias in risk model development.

Authors:  Kathleen F Kerr; Allison Meisner; Heather Thiessen-Philbrook; Steven G Coca; Chirag R Parikh
Journal:  Biomark Res       Date:  2015-01-24
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.