Benjamin B Chu1, Kevin L Keys2,3, Christopher A German4, Hua Zhou4, Jin J Zhou5, Eric M Sobel1,6, Janet S Sinsheimer1,4,6, Kenneth Lange1,6. 1. Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA. 2. Department of Medicine, University of California, San Francisco, 1701 Divisadero St, San Francisco, CA, 94115, USA. 3. Berkeley Institute of Data Science, University of California, Berkeley, 190 Doe Library, Berkeley, CA 94720, USA. 4. Department of Biostatistics, University of California, Los Angeles, 650 Charles E Young Dr S, Los Angeles, CA, 90095, USA. 5. Division of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave. Tucson, AZ, 85724, USA. 6. Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA, 90095 USA.
Abstract
BACKGROUND: Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression. RESULTS: We extend and efficiently implement iterative hard thresholding (IHT) for multiple regression, treating all SNPs simultaneously. Our extensions accommodate generalized linear models, prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing and exhibits a 2-3 orders of magnitude decrease in false-positive rates compared with lasso regression. We also test IHT on the UK Biobank hypertension phenotypes and the Northern Finland Birth Cohort of 1966 cardiovascular phenotypes. We find that IHT scales to the large datasets of contemporary human genetics and recovers the plausible genetic variants identified by previous studies. CONCLUSIONS: Our real data analysis and simulation studies suggest that IHT can (i) recover highly correlated predictors, (ii) avoid over-fitting, (iii) deliver better true-positive and false-positive rates than either marginal testing or lasso regression, (iv) recover unbiased regression coefficients, (v) exploit prior information and group-sparsity, and (vi) be used with biobank-sized datasets. Although these advances are studied for genome-wide association studies inference, our extensions are pertinent to other regression problems with large numbers of predictors.
BACKGROUND: Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression. RESULTS: We extend and efficiently implement iterative hard thresholding (IHT) for multiple regression, treating all SNPs simultaneously. Our extensions accommodate generalized linear models, prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing and exhibits a 2-3 orders of magnitude decrease in false-positive rates compared with lasso regression. We also test IHT on the UK Biobank hypertension phenotypes and the Northern Finland Birth Cohort of 1966 cardiovascular phenotypes. We find that IHT scales to the large datasets of contemporary human genetics and recovers the plausible genetic variants identified by previous studies. CONCLUSIONS: Our real data analysis and simulation studies suggest that IHT can (i) recover highly correlated predictors, (ii) avoid over-fitting, (iii) deliver better true-positive and false-positive rates than either marginal testing or lasso regression, (iv) recover unbiased regression coefficients, (v) exploit prior information and group-sparsity, and (vi) be used with biobank-sized datasets. Although these advances are studied for genome-wide association studies inference, our extensions are pertinent to other regression problems with large numbers of predictors.
Authors: Stacey Melquist; David W Craig; Matthew J Huentelman; Richard Crook; John V Pearson; Matt Baker; Victoria L Zismann; Jennifer Gass; Jennifer Adamson; Szabolcs Szelinger; Jason Corneveaux; Ashley Cannon; Keith D Coon; Sarah Lincoln; Charles Adler; Paul Tuite; Donald B Calne; Eileen H Bigio; Ryan J Uitti; Zbigniew K Wszolek; Lawrence I Golbe; Richard J Caselli; Neill Graff-Radford; Irene Litvan; Matthew J Farrer; Dennis W Dickson; Mike Hutton; Dietrich A Stephan Journal: Am J Hum Genet Date: 2007-03-08 Impact factor: 11.025
Authors: Hua Zhou; Janet S Sinsheimer; Douglas M Bates; Benjamin B Chu; Christopher A German; Sarah S Ji; Kevin L Keys; Juhyun Kim; Seyoon Ko; Gordon D Mosher; Jeanette C Papp; Eric M Sobel; Jing Zhai; Jin J Zhou; Kenneth Lange Journal: Hum Genet Date: 2019-03-26 Impact factor: 4.132
Authors: Chiara Sabatti; Susan K Service; Anna-Liisa Hartikainen; Anneli Pouta; Samuli Ripatti; Jae Brodsky; Chris G Jones; Noah A Zaitlen; Teppo Varilo; Marika Kaakinen; Ulla Sovio; Aimo Ruokonen; Jaana Laitinen; Eveliina Jakkula; Lachlan Coin; Clive Hoggart; Andrew Collins; Hannu Turunen; Stacey Gabriel; Paul Elliot; Mark I McCarthy; Mark J Daly; Marjo-Riitta Järvelin; Nelson B Freimer; Leena Peltonen Journal: Nat Genet Date: 2008-12-07 Impact factor: 38.330
Authors: Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee Journal: Gigascience Date: 2015-02-25 Impact factor: 6.524
Authors: Armin P Schoech; Daniel M Jordan; Po-Ru Loh; Steven Gazal; Luke J O'Connor; Daniel J Balick; Pier F Palamara; Hilary K Finucane; Shamil R Sunyaev; Alkes L Price Journal: Nat Commun Date: 2019-02-15 Impact factor: 14.919
Authors: Shashaank Vattikuti; James J Lee; Christopher C Chang; Stephen D H Hsu; Carson C Chow Journal: Gigascience Date: 2014-06-16 Impact factor: 6.524