Anastasia Gurinovich1, Harold Bae2, John J Farrell3, Stacy L Andersen3, Stefano Monti3, Annibale Puca4,5, Gil Atzmon6, Nir Barzilai6, Thomas T Perls3, Paola Sebastiani7. 1. Bioinformatics Program, Boston University, Boston, MA, USA. 2. College of Public Health and Human Sciences, Oregon State University, Corvallis, OR, USA. 3. Department of Medicine, Boston University School of Medicine, Boston, MA, USA. 4. Department of Medicine and Surgery, University of Salerno, Fisciano, Italy. 5. Cardiovascular Research Unit, IRCCS MultiMedica, Sesto San Giovanni, Italy. 6. Department of Medicine and Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, USA. 7. Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
Abstract
MOTIVATION: Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery. RESULTS: In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects' ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype. AVAILABILITY AND IMPLEMENTATION: PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery. RESULTS: In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects' ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype. AVAILABILITY AND IMPLEMENTATION: PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Noah A Rosenberg; Jonathan K Pritchard; James L Weber; Howard M Cann; Kenneth K Kidd; Lev A Zhivotovsky; Marcus W Feldman Journal: Science Date: 2002-12-20 Impact factor: 47.728
Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker Journal: Genome Res Date: 2003-11 Impact factor: 9.043
Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330
Authors: Anne B Newman; Nancy W Glynn; Christopher A Taylor; Paola Sebastiani; Thomas T Perls; Richard Mayeux; Kaare Christensen; Joseph M Zmuda; Sandra Barral; Joseph H Lee; Eleanor M Simonsick; Jeremy D Walston; Anatoli I Yashin; Evan Hadley Journal: Aging (Albany NY) Date: 2011-01 Impact factor: 5.682
Authors: Nadia Solovieff; Stephen W Hartley; Clinton T Baldwin; Thomas T Perls; Martin H Steinberg; Paola Sebastiani Journal: BMC Genet Date: 2010-12-09 Impact factor: 2.797