Daniel Schlauch1,2, Heide Fier1,3, Christoph Lange1,4. 1. Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA 02115, USA. 2. Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA. 3. Department of Genomic Mathematics, University of Bonn, Bonn, Germany. 4. Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA.
Abstract
MOTIVATION: In order to minimize the effects of genetic confounding on the analysis of high-throughput genetic association studies, e.g. (whole-genome) sequencing (WGS) studies, genome-wide association studies (GWAS), etc., we propose a general framework to assess and to test formally for genetic heterogeneity among study subjects. As the approach fully utilizes the recent ancestor information captured by rare variants, it is especially powerful in WGS studies. Even for relatively moderate sample sizes, the proposed testing framework is able to identify study subjects that are genetically too similar, e.g. cryptic relationships, or that are genetically too different, e.g. population substructure. The approach is computationally fast, enabling the application to whole-genome sequencing data, and straightforward to implement. RESULTS: Simulation studies illustrate the overall performance of our approach. In an application to the 1000 Genomes Project, we outline an analysis/cleaning pipeline that utilizes our approach to formally assess whether study subjects are related and whether population substructure is present. In the analysis of the 1000 Genomes Project data, our approach revealed subjects that are most likely related, but had previously passed standard qc-filters. AVAILABILITY AND IMPLEMENTATION: An implementation of our method, Similarity Test for Estimating Genetic Outliers (STEGO), is available in the R package stego from Github at https://github.com/dschlauch/stego . CONTACT: dschlauch@fas.harvard.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: In order to minimize the effects of genetic confounding on the analysis of high-throughput genetic association studies, e.g. (whole-genome) sequencing (WGS) studies, genome-wide association studies (GWAS), etc., we propose a general framework to assess and to test formally for genetic heterogeneity among study subjects. As the approach fully utilizes the recent ancestor information captured by rare variants, it is especially powerful in WGS studies. Even for relatively moderate sample sizes, the proposed testing framework is able to identify study subjects that are genetically too similar, e.g. cryptic relationships, or that are genetically too different, e.g. population substructure. The approach is computationally fast, enabling the application to whole-genome sequencing data, and straightforward to implement. RESULTS: Simulation studies illustrate the overall performance of our approach. In an application to the 1000 Genomes Project, we outline an analysis/cleaning pipeline that utilizes our approach to formally assess whether study subjects are related and whether population substructure is present. In the analysis of the 1000 Genomes Project data, our approach revealed subjects that are most likely related, but had previously passed standard qc-filters. AVAILABILITY AND IMPLEMENTATION: An implementation of our method, Similarity Test for Estimating Genetic Outliers (STEGO), is available in the R package stego from Github at https://github.com/dschlauch/stego . CONTACT: dschlauch@fas.harvard.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Jennifer Listgarten; Christoph Lippert; Carl M Kadie; Robert I Davidson; Eleazar Eskin; David Heckerman Journal: Nat Methods Date: 2012-05-30 Impact factor: 28.547
Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330
Authors: Alkes L Price; Michael E Weale; Nick Patterson; Simon R Myers; Anna C Need; Kevin V Shianna; Dongliang Ge; Jerome I Rotter; Esther Torres; Kent D Taylor; David B Goldstein; David Reich Journal: Am J Hum Genet Date: 2008-07 Impact factor: 11.025
Authors: Christoph Lippert; Jennifer Listgarten; Ying Liu; Carl M Kadie; Robert I Davidson; David Heckerman Journal: Nat Methods Date: 2011-09-04 Impact factor: 28.547
Authors: Jian Yang; Beben Benyamin; Brian P McEvoy; Scott Gordon; Anjali K Henders; Dale R Nyholt; Pamela A Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael E Goddard; Peter M Visscher Journal: Nat Genet Date: 2010-06-20 Impact factor: 38.330
Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis Journal: Nature Date: 2015-10-01 Impact factor: 49.962
Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean Journal: Nature Date: 2012-11-01 Impact factor: 49.962
Authors: Georg Hahn; Chloe M Wu; Sanghun Lee; Sharon M Lutz; Surender Khurana; Lindsey R Baden; Sebastien Haneuse; Dandi Qiao; Julian Hecker; Dawn L DeMeo; Rudolph E Tanzi; Manish C Choudhary; Behzad Etemad; Abbas Mohammadi; Elmira Esmaeilzadeh; Michael H Cho; Jonathan Z Li; Adrienne G Randolph; Nan M Laird; Scott T Weiss; Edwin K Silverman; Katharina Ribbeck; Christoph Lange Journal: Genet Epidemiol Date: 2021-06-22 Impact factor: 2.344