Literature DB >> 21576180

Classification with correlated features: unreliability of feature ranking and solutions.

Laura Tolosi1, Thomas Lengauer.   

Abstract

MOTIVATION: Classification and feature selection of genomics or transcriptomics data is often hampered by the large number of features as compared with the small number of samples available. Moreover, features represented by probes that either have similar molecular functions (gene expression analysis) or genomic locations (DNA copy number analysis) are highly correlated. Classical model selection methods such as penalized logistic regression or random forest become unstable in the presence of high feature correlations. Sophisticated penalties such as group Lasso or fused Lasso can force the models to assign similar weights to correlated features and thus improve model stability and interpretability. In this article, we show that the measures of feature relevance corresponding to the above-mentioned methods are biased such that the weights of the features belonging to groups of correlated features decrease as the sizes of the groups increase, which leads to incorrect model interpretation and misleading feature ranking.
RESULTS: With simulation experiments, we demonstrate that Lasso logistic regression, fused support vector machine, group Lasso and random forest models suffer from correlation bias. Using simulations, we show that two related methods for group selection based on feature clustering can be used for correcting the correlation bias. These techniques also improve the stability and the accuracy of the baseline models. We apply all methods investigated to a breast cancer and a bladder cancer arrayCGH dataset and in order to identify copy number aberrations predictive of tumor phenotype. AVAILABILITY: R code can be found at: http://www.mpi-inf.mpg.de/~laura/Clustering.r.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 21576180     DOI: 10.1093/bioinformatics/btr300

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  68 in total

1.  Neuroimage-Based Consciousness Evaluation of Patients with Secondary Doubtful Hydrocephalus Before and After Lumbar Drainage.

Authors:  Jiayu Huo; Zengxin Qi; Sen Chen; Qian Wang; Xuehai Wu; Di Zang; Tanikawa Hiromi; Jiaxing Tan; Lichi Zhang; Weijun Tang; Dinggang Shen
Journal:  Neurosci Bull       Date:  2020-07-01       Impact factor: 5.203

2.  Identification of Mood-Relevant Brain Connections Using a Continuous, Subject-Driven Rumination Paradigm.

Authors:  Anna-Clare Milazzo; Bernard Ng; Heidi Jiang; William Shirer; Gael Varoquaux; Jean Baptiste Poline; Bertrand Thirion; Michael D Greicius
Journal:  Cereb Cortex       Date:  2014-10-19       Impact factor: 5.357

3.  Radiomics in nuclear medicine: robustness, reproducibility, standardization, and how to avoid data analysis traps and replication crisis.

Authors:  Alex Zwanenburg
Journal:  Eur J Nucl Med Mol Imaging       Date:  2019-06-25       Impact factor: 9.236

4.  Cancer Progression Prediction Using Gene Interaction Regularized Elastic Net.

Authors: 
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2015-12-23       Impact factor: 3.710

5.  Fast, Accurate, and Stable Feature Selection Using Neural Networks.

Authors:  James Deraeve; William H Alexander
Journal:  Neuroinformatics       Date:  2018-04

6.  A method of gene expression data transfer from cell lines to cancer patients for machine-learning prediction of drug efficiency.

Authors:  Nicolas Borisov; Victor Tkachev; Maria Suntsova; Olga Kovalchuk; Alex Zhavoronkov; Ilya Muchnik; Anton Buzdin
Journal:  Cell Cycle       Date:  2018-01-17       Impact factor: 4.534

7.  Sequential feature selection and inference using multi-variate random forests.

Authors:  Joshua Mayer; Raziur Rahman; Souparno Ghosh; Ranadip Pal
Journal:  Bioinformatics       Date:  2018-04-15       Impact factor: 6.937

8.  CT radiomics to predict high-risk intraductal papillary mucinous neoplasms of the pancreas.

Authors:  Jayasree Chakraborty; Abhishek Midya; Lior Gazit; Marc Attiyeh; Liana Langdon-Embry; Peter J Allen; Richard K G Do; Amber L Simpson
Journal:  Med Phys       Date:  2018-09-27       Impact factor: 4.071

9.  Development of Multivariable Models to Predict and Benchmark Transfusion in Elective Surgery Supporting Patient Blood Management.

Authors:  Dieter Hayn; Karl Kreiner; Hubert Ebner; Peter Kastner; Nada Breznik; Angelika Rzepka; Axel Hofmann; Hans Gombotz; Günter Schreier
Journal:  Appl Clin Inform       Date:  2017-06-14       Impact factor: 2.342

10.  Prioritization of retinal disease genes: an integrative approach.

Authors:  Alex H Wagner; Kyle R Taylor; Adam P DeLuca; Thomas L Casavant; Robert F Mullins; Edwin M Stone; Todd E Scheetz; Terry A Braun
Journal:  Hum Mutat       Date:  2013-04-12       Impact factor: 4.878

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.