Literature DB >> 26674595

The feature selection bias problem in relation to high-dimensional gene data.

Jerzy Krawczuk1, Tomasz Łukaszuk2.   

Abstract

OBJECTIVE: Feature selection is a technique widely used in data mining. The aim is to select the best subset of features relevant to the problem being considered. In this paper, we consider feature selection for the classification of gene datasets. Gene data is usually composed of just a few dozen objects described by thousands of features. For this kind of data, it is easy to find a model that fits the learning data. However, it is not easy to find one that will simultaneously evaluate new data equally well as learning data. This overfitting issue is well known as regards classification and regression, but it also applies to feature selection. METHODS AND MATERIALS: We address this problem and investigate its importance in an empirical study of four feature selection methods applied to seven high-dimensional gene datasets. We chose datasets that are well studied in the literature-colon cancer, leukemia and breast cancer. All the datasets are characterized by a significant number of features and the presence of exactly two decision classes. The feature selection methods used are ReliefF, minimum redundancy maximum relevance, support vector machine-recursive feature elimination and relaxed linear separability.
RESULTS: Our main result reveals the existence of positive feature selection bias in all 28 experiments (7 datasets and 4 feature selection methods). Bias was calculated as the difference between validation and test accuracies and ranges from 2.6% to as much as 41.67%. The validation accuracy (biased accuracy) was calculated on the same dataset on which the feature selection was performed. The test accuracy was calculated for data that was not used for feature selection (by so called external cross-validation).
CONCLUSIONS: This work provides evidence that using the same dataset for feature selection and learning is not appropriate. We recommend using cross-validation for feature selection in order to reduce selection bias.
Copyright © 2015 Elsevier B.V. All rights reserved.

Entities:  

Keywords:  Convex and piecewise linear classifier; Feature selection bias; Gene selection; Microarray data; Support vector machine

Mesh:

Substances:

Year:  2015        PMID: 26674595     DOI: 10.1016/j.artmed.2015.11.001

Source DB:  PubMed          Journal:  Artif Intell Med        ISSN: 0933-3657            Impact factor:   5.326


  9 in total

1.  DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies.

Authors:  Bettina Mieth; Alexandre Rozier; Juan Antonio Rodriguez; Marina M C Höhne; Nico Görnitz; Klaus-Robert Müller
Journal:  NAR Genom Bioinform       Date:  2021-07-20

2.  Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

Authors:  Trang T Le; W Kyle Simmons; Masaya Misaki; Jerzy Bodurka; Bill C White; Jonathan Savitz; Brett A McKinney
Journal:  Bioinformatics       Date:  2017-09-15       Impact factor: 6.937

3.  Interpretable Machine Learning Reveals Dissimilarities Between Subtypes of Autism Spectrum Disorder.

Authors:  Mateusz Garbulowski; Karolina Smolinska; Klev Diamanti; Gang Pan; Khurram Maqbool; Lars Feuk; Jan Komorowski
Journal:  Front Genet       Date:  2021-02-25       Impact factor: 4.599

4.  Deep Learning Recurrent Neural Network for Concussion Classification in Adolescents Using Raw Electroencephalography Signals: Toward a Minimal Number of Sensors.

Authors:  Karun Thanjavur; Dionissios T Hristopulos; Arif Babul; Kwang Moo Yi; Naznin Virji-Babul
Journal:  Front Hum Neurosci       Date:  2021-11-24       Impact factor: 3.169

5.  Combing machine learning and elemental profiling for geographical authentication of Chinese Geographical Indication (GI) rice.

Authors:  Fei Xu; Fanzhou Kong; Hong Peng; Shuofei Dong; Weiyu Gao; Guangtao Zhang
Journal:  NPJ Sci Food       Date:  2021-07-08

6.  Prediction of venous thromboembolism with machine learning techniques in young-middle-aged inpatients.

Authors:  Hua Liu; Hua Yuan; Yongmei Wang; Weiwei Huang; Hui Xue; Xiuying Zhang
Journal:  Sci Rep       Date:  2021-06-18       Impact factor: 4.379

7.  Variable selection and validation in multivariate modelling.

Authors:  Lin Shi; Johan A Westerhuis; Johan Rosén; Rikard Landberg; Carl Brunius
Journal:  Bioinformatics       Date:  2019-03-15       Impact factor: 6.937

8.  Early isolated V-lesion may not truly represent rejection of the kidney allograft.

Authors:  Mariana Wohlfahrtova; Petra Hruba; Jiri Klema; Marek Novotny; Zdenek Krejcik; Viktor Stranecky; Eva Honsova; Petra Vichova; Ondrej Viklicky
Journal:  Clin Sci (Lond)       Date:  2018-10-29       Impact factor: 6.124

9.  Recurrent neural network-based acute concussion classifier using raw resting state EEG data.

Authors:  Arif Babul; Brandon Foran; Maya Bielecki; Adam Gilchrist; Dionissios T Hristopulos; Leyla R Brucar; Naznin Virji-Babul; Karun Thanjavur
Journal:  Sci Rep       Date:  2021-06-11       Impact factor: 4.379

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.