Literature DB >> 30030120

Benchmarking relief-based feature selection methods for bioinformatics data mining.

Ryan J Urbanowicz1, Randal S Olson2, Peter Schmitt3, Melissa Meeker4, Jason H Moore5.   

Abstract

Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. 'omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the 'Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF∗ performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.
Copyright © 2018 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Classification; Epistasis; Feature selection; Genetic heterogeneity; Regression; ReliefF

Mesh:

Year:  2018        PMID: 30030120      PMCID: PMC6299838          DOI: 10.1016/j.jbi.2018.07.015

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  20 in total

Review 1.  Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans.

Authors:  Heather J Cordell
Journal:  Hum Mol Genet       Date:  2002-10-01       Impact factor: 6.150

2.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis.

Authors:  Jason H Moore; Scott M Williams
Journal:  Bioessays       Date:  2005-06       Impact factor: 4.345

3.  Iterative RELIEF for feature weighting: algorithms, theories, and applications.

Authors:  Yijun Sun
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2007-06       Impact factor: 6.226

4.  Evaporative cooling feature selection for genotypic data involving interactions.

Authors:  B A McKinney; D M Reif; B C White; J E Crowe; J H Moore
Journal:  Bioinformatics       Date:  2007-06-22       Impact factor: 6.937

5.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer.

Authors:  M D Ritchie; L W Hahn; N Roodi; L R Bailey; W D Dupont; F F Parl; J H Moore
Journal:  Am J Hum Genet       Date:  2001-06-11       Impact factor: 11.025

Review 6.  Relief-based feature selection: Introduction and review.

Authors:  Ryan J Urbanowicz; Melissa Meeker; William La Cava; Randal S Olson; Jason H Moore
Journal:  J Biomed Inform       Date:  2018-07-18       Impact factor: 6.317

7.  PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.

Authors:  Vamsi K Mootha; Cecilia M Lindgren; Karl-Fredrik Eriksson; Aravind Subramanian; Smita Sihag; Joseph Lehar; Pere Puigserver; Emma Carlsson; Martin Ridderstråle; Esa Laurila; Nicholas Houstis; Mark J Daly; Nick Patterson; Jill P Mesirov; Todd R Golub; Pablo Tamayo; Bruce Spiegelman; Eric S Lander; Joel N Hirschhorn; David Altshuler; Leif C Groop
Journal:  Nat Genet       Date:  2003-07       Impact factor: 38.330

8.  Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data.

Authors:  Tricia A Thornton-Wells; Jason H Moore; Jonathan L Haines
Journal:  BMC Bioinformatics       Date:  2006-04-12       Impact factor: 3.169

9.  ReliefSeq: a gene-wise adaptive-K nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mRNA-Seq gene expression data.

Authors:  Brett A McKinney; Bill C White; Diane E Grill; Peter W Li; Richard B Kennedy; Gregory A Poland; Ann L Oberg
Journal:  PLoS One       Date:  2013-12-10       Impact factor: 3.240

10.  Collective feature selection to identify crucial epistatic variants.

Authors:  Shefali S Verma; Anastasia Lucas; Xinyuan Zhang; Yogasudha Veturi; Scott Dudek; Binglan Li; Ruowang Li; Ryan Urbanowicz; Jason H Moore; Dokyoon Kim; Marylyn D Ritchie
Journal:  BioData Min       Date:  2018-04-19       Impact factor: 2.522

View more
  27 in total

1.  CRISPRidentify: identification of CRISPR arrays using machine learning approach.

Authors:  Alexander Mitrofanov; Omer S Alkhnbashi; Sergey A Shmakov; Kira S Makarova; Eugene V Koonin; Rolf Backofen
Journal:  Nucleic Acids Res       Date:  2021-02-26       Impact factor: 16.971

2.  Theoretical properties of distance distributions and novel metrics for nearest-neighbor feature selection.

Authors:  Bryan A Dawkins; Trang T Le; Brett A McKinney
Journal:  PLoS One       Date:  2021-02-08       Impact factor: 3.240

3.  High-Resolution Genomic Comparisons within Salmonella enterica Serotypes Derived from Beef Feedlot Cattle: Parsing the Roles of Cattle Source, Pen, Animal, Sample Type, and Production Period.

Authors:  Gizem Levent; Ashlynn Schlochtermeier; Samuel E Ives; Keri N Norman; Sara D Lawhon; Guy H Loneragan; Robin C Anderson; Javier Vinasco; Henk C den Bakker; H Morgan Scott
Journal:  Appl Environ Microbiol       Date:  2021-05-26       Impact factor: 4.792

Review 4.  A Complete Process of Text Classification System Using State-of-the-Art NLP Models.

Authors:  Varun Dogra; Sahil Verma; Pushpita Chatterjee; Jana Shafi; Jaeyoung Choi; Muhammad Fazal Ijaz
Journal:  Comput Intell Neurosci       Date:  2022-06-09

5.  Hierarchical Information Criterion for Variable Abstraction.

Authors:  Mark Mirtchouk; Bharat Srikishan; Samantha Kleinberg
Journal:  Proc Mach Learn Res       Date:  2021-08

6.  EPIMUTESTR: a nearest neighbor machine learning approach to predict cancer driver genes from the evolutionary action of coding variants.

Authors:  Saeid Parvandeh; Lawrence A Donehower; Katsonis Panagiotis; Teng-Kuei Hsu; Jennifer K Asmussen; Kwanghyuk Lee; Olivier Lichtarge
Journal:  Nucleic Acids Res       Date:  2022-07-08       Impact factor: 19.160

Review 7.  Relief-based feature selection: Introduction and review.

Authors:  Ryan J Urbanowicz; Melissa Meeker; William La Cava; Randal S Olson; Jason H Moore
Journal:  J Biomed Inform       Date:  2018-07-18       Impact factor: 6.317

8.  Nearest-Neighbor Projected Distance Regression for Epistasis Detection in GWAS With Population Structure Correction.

Authors:  Marziyeh Arabnejad; Courtney G Montgomery; Patrick M Gaffney; Brett A McKinney
Journal:  Front Genet       Date:  2020-07-22       Impact factor: 4.772

9.  Bacterial Diversity and the Geochemical Landscape in the Southwestern Gulf of Mexico.

Authors:  E Ernestina Godoy-Lozano; Alejandra Escobar-Zepeda; Luciana Raggi; Enrique Merino; Rosa Maria Gutierrez-Rios; Katy Juarez; Lorenzo Segovia; Alexei Fedorovish Licea-Navarro; Adolfo Gracia; Alejandro Sanchez-Flores; Liliana Pardo-Lopez
Journal:  Front Microbiol       Date:  2018-10-18       Impact factor: 5.640

10.  Combing machine learning and elemental profiling for geographical authentication of Chinese Geographical Indication (GI) rice.

Authors:  Fei Xu; Fanzhou Kong; Hong Peng; Shuofei Dong; Weiyu Gao; Guangtao Zhang
Journal:  NPJ Sci Food       Date:  2021-07-08
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.