Literature DB >> 19460890

Predictor correlation impacts machine learning algorithms: implications for genomic studies.

Kristin K Nicodemus1, James D Malley.   

Abstract

MOTIVATION: The advent of high-throughput genomics has produced studies with large numbers of predictors (e.g. genome-wide association, microarray studies). Machine learning algorithms (MLAs) are a computationally efficient way to identify phenotype-associated variables in high-dimensional data. There are important results from mathematical theory and numerous practical results documenting their value. One attractive feature of MLAs is that many operate in a fully multivariate environment, allowing for small-importance variables to be included when they act cooperatively. However, certain properties of MLAs under conditions common in genomic-related data have not been well-studied--in particular, correlations among predictors pose a problem.
RESULTS: Using extensive simulation, we showed considering correlation within predictors is crucial in making valid inferences using variable importance measures (VIMs) from three MLAs: random forest (RF), conditional inference forest (CIF) and Monte Carlo logic regression (MCLR). Using a case-control illustration, we showed that the RF VIMs--even permutation-based--were less able to detect association than other algorithms at effect sizes encountered in complex disease studies. This reduction occurred when 'causal' predictors were correlated with other predictors, and was sharpest when RF tree building used the Gini index. Indeed, RF Gini VIMs are biased under correlation, dependent on predictor correlation strength/number and over-trained to random fluctuations in data when tree terminal node size was small. Permutation-based VIM distributions were less variable for correlated predictors and are unbiased, thus may be preferred when predictors are correlated. MLAs are a powerful tool for high-dimensional data analysis, but well-considered use of algorithms is necessary to draw valid conclusions. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2009        PMID: 19460890     DOI: 10.1093/bioinformatics/btp331

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  35 in total

1.  Biological validation of increased schizophrenia risk with NRG1, ERBB4, and AKT1 epistasis via functional neuroimaging in healthy controls.

Authors:  Kristin K Nicodemus; Amanda J Law; Eugenia Radulescu; Augustin Luna; Bhaskar Kolachana; Radhakrishna Vakkalanka; Dan Rujescu; Ina Giegling; Richard E Straub; Kate McGee; Bert Gold; Michael Dean; Pierandrea Muglia; Joseph H Callicott; Hao-Yang Tan; Daniel R Weinberger
Journal:  Arch Gen Psychiatry       Date:  2010-10

2.  TRM: a powerful two-stage machine learning approach for identifying SNP-SNP interactions.

Authors:  Hui-Yi Lin; Y Ann Chen; Ya-Yu Tsai; Xiaotao Qu; Tung-Sung Tseng; Jong Y Park
Journal:  Ann Hum Genet       Date:  2011-12-11       Impact factor: 1.670

3.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data.

Authors:  Daniel F Schwarz; Inke R König; Andreas Ziegler
Journal:  Bioinformatics       Date:  2010-05-26       Impact factor: 6.937

4.  High-throughput measurement, correlation analysis, and machine-learning predictions for pH and thermal stabilities of Pfizer-generated antibodies.

Authors:  Amy C King; Matthew Woods; Wei Liu; Zhijian Lu; Davinder Gill; Mark R H Krebs
Journal:  Protein Sci       Date:  2011-07-13       Impact factor: 6.725

5.  Correction for population stratification in random forest analysis.

Authors:  Yang Zhao; Feng Chen; Rihong Zhai; Xihong Lin; Zhaoxi Wang; Li Su; David C Christiani
Journal:  Int J Epidemiol       Date:  2012-11-12       Impact factor: 7.196

6.  An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.

Authors:  Raymond Walters; Charles Laurin; Gitta H Lubke
Journal:  Bioinformatics       Date:  2012-07-30       Impact factor: 6.937

7.  Suitability of Sludge Biotic Index (SBI), Sludge Index (SI) and filamentous bacteria analysis for assessing activated sludge process performance: the case of piggery slaughterhouse wastewater.

Authors:  Roberta Pedrazzani; Laura Menoni; Stefano Nembrini; Livia Manili; Giorgio Bertanza
Journal:  J Ind Microbiol Biotechnol       Date:  2016-04-12       Impact factor: 3.346

Review 8.  Systems biology data analysis methodology in pharmacogenomics.

Authors:  Andrei S Rodin; Grigoriy Gogoshin; Eric Boerwinkle
Journal:  Pharmacogenomics       Date:  2011-09       Impact factor: 2.533

9.  Evidence of statistical epistasis between DISC1, CIT and NDEL1 impacting risk for schizophrenia: biological validation with functional neuroimaging.

Authors:  Kristin K Nicodemus; Joseph H Callicott; Rachel G Higier; Augustin Luna; Devon C Nixon; Barbara K Lipska; Radhakrishna Vakkalanka; Ina Giegling; Dan Rujescu; David St Clair; Pierandrea Muglia; Yin Yao Shugart; Daniel R Weinberger
Journal:  Hum Genet       Date:  2010-04       Impact factor: 4.132

10.  The behaviour of random forest permutation-based variable importance measures under predictor correlation.

Authors:  Kristin K Nicodemus; James D Malley; Carolin Strobl; Andreas Ziegler
Journal:  BMC Bioinformatics       Date:  2010-02-27       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.