Literature DB >> 22125226

Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality.

Rebecca L Zuvich1, Loren L Armstrong, Suzette J Bielinski, Yuki Bradford, Christopher S Carlson, Dana C Crawford, Andrew T Crenshaw, Mariza de Andrade, Kimberly F Doheny, Jonathan L Haines, M Geoffrey Hayes, Gail P Jarvik, Lan Jiang, Iftikhar J Kullo, Rongling Li, Hua Ling, Teri A Manolio, Martha E Matsumoto, Catherine A McCarty, Andrew N McDavid, Daniel B Mirel, Lana M Olson, Justin E Paschall, Elizabeth W Pugh, Luke V Rasmussen, Laura J Rasmussen-Torvik, Stephen D Turner, Russell A Wilke, Marylyn D Ritchie.   

Abstract

Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
© 2011 Wiley Periodicals, Inc.

Entities:  

Mesh:

Year:  2011        PMID: 22125226      PMCID: PMC3592376          DOI: 10.1002/gepi.20639

Source DB:  PubMed          Journal:  Genet Epidemiol        ISSN: 0741-0395            Impact factor:   2.135


  28 in total

1.  Inference of population structure using multilocus genotype data.

Authors:  J K Pritchard; M Stephens; P Donnelly
Journal:  Genetics       Date:  2000-06       Impact factor: 4.562

Review 2.  The allelic architecture of human disease genes: common disease-common variant...or not?

Authors:  Jonathan K Pritchard; Nancy J Cox
Journal:  Hum Mol Genet       Date:  2002-10-01       Impact factor: 6.150

3.  Postassociation cleaning using linkage disequilibrium information.

Authors:  Buhm Han; Brian M Hackel; Eleazar Eskin
Journal:  Genet Epidemiol       Date:  2011-01       Impact factor: 2.135

4.  Complement factor H polymorphism in age-related macular degeneration.

Authors:  Robert J Klein; Caroline Zeiss; Emily Y Chew; Jen-Yue Tsai; Richard S Sackler; Chad Haynes; Alice K Henning; John Paul SanGiovanni; Shrikant M Mane; Susan T Mayne; Michael B Bracken; Frederick L Ferris; Jurg Ott; Colin Barnstable; Josephine Hoh
Journal:  Science       Date:  2005-03-10       Impact factor: 47.728

5.  Principal components analysis corrects for stratification in genome-wide association studies.

Authors:  Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal:  Nat Genet       Date:  2006-07-23       Impact factor: 38.330

6.  The NCBI dbGaP database of genotypes and phenotypes.

Authors:  Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal:  Nat Genet       Date:  2007-10       Impact factor: 38.330

7.  Quality control procedures for genome-wide association studies.

Authors:  Stephen Turner; Loren L Armstrong; Yuki Bradford; Christopher S Carlson; Dana C Crawford; Andrew T Crenshaw; Mariza de Andrade; Kimberly F Doheny; Jonathan L Haines; Geoffrey Hayes; Gail Jarvik; Lan Jiang; Iftikhar J Kullo; Rongling Li; Hua Ling; Teri A Manolio; Martha Matsumoto; Catherine A McCarty; Andrew N McDavid; Daniel B Mirel; Justin E Paschall; Elizabeth W Pugh; Luke V Rasmussen; Russell A Wilke; Rebecca L Zuvich; Marylyn D Ritchie
Journal:  Curr Protoc Hum Genet       Date:  2011-01

8.  Quality control and quality assurance in genotypic data for genome-wide association studies.

Authors:  Cathy C Laurie; Kimberly F Doheny; Daniel B Mirel; Elizabeth W Pugh; Laura J Bierut; Tushar Bhangale; Frederick Boehm; Neil E Caporaso; Marilyn C Cornelis; Howard J Edenberg; Stacy B Gabriel; Emily L Harris; Frank B Hu; Kevin B Jacobs; Peter Kraft; Maria Teresa Landi; Thomas Lumley; Teri A Manolio; Caitlin McHugh; Ian Painter; Justin Paschall; John P Rice; Kenneth M Rice; Xiuwen Zheng; Bruce S Weir
Journal:  Genet Epidemiol       Date:  2010-09       Impact factor: 2.135

9.  Knowledge-driven multi-locus analysis reveals gene-gene interactions influencing HDL cholesterol level in two independent EMR-linked biobanks.

Authors:  Stephen D Turner; Richard L Berg; James G Linneman; Peggy L Peissig; Dana C Crawford; Joshua C Denny; Dan M Roden; Catherine A McCarty; Marylyn D Ritchie; Russell A Wilke
Journal:  PLoS One       Date:  2011-05-11       Impact factor: 3.240

10.  The Next PAGE in understanding complex traits: design for the analysis of Population Architecture Using Genetics and Epidemiology (PAGE) Study.

Authors:  Tara C Matise; Jose Luis Ambite; Steven Buyske; Christopher S Carlson; Shelley A Cole; Dana C Crawford; Christopher A Haiman; Gerardo Heiss; Charles Kooperberg; Loic Le Marchand; Teri A Manolio; Kari E North; Ulrike Peters; Marylyn D Ritchie; Lucia A Hindorff; Jonathan L Haines
Journal:  Am J Epidemiol       Date:  2011-08-11       Impact factor: 4.897

View more
  46 in total

Review 1.  Methods of integrating data to uncover genotype-phenotype interactions.

Authors:  Marylyn D Ritchie; Emily R Holzinger; Ruowang Li; Sarah A Pendergrass; Dokyoon Kim
Journal:  Nat Rev Genet       Date:  2015-01-13       Impact factor: 53.242

2.  Imputation and quality control steps for combining multiple genome-wide datasets.

Authors:  Shefali S Verma; Mariza de Andrade; Gerard Tromp; Helena Kuivaniemi; Elizabeth Pugh; Bahram Namjou-Khales; Shubhabrata Mukherjee; Gail P Jarvik; Leah C Kottyan; Amber Burt; Yuki Bradford; Gretta D Armstrong; Kimberly Derr; Dana C Crawford; Jonathan L Haines; Rongling Li; David Crosslin; Marylyn D Ritchie
Journal:  Front Genet       Date:  2014-12-11       Impact factor: 4.599

Review 3.  A review of the role of electronic health record in genomic research.

Authors:  Parasuram Krishnamoorthy; Deepansh Gupta; Saurav Chatterjee; Jessica Huston; John J Ryan
Journal:  J Cardiovasc Transl Res       Date:  2014-08-14       Impact factor: 4.132

4.  Are Interactions between cis-Regulatory Variants Evidence for Biological Epistasis or Statistical Artifacts?

Authors:  Alexandra E Fish; John A Capra; William S Bush
Journal:  Am J Hum Genet       Date:  2016-09-15       Impact factor: 11.025

5.  Replication of SCN5A Associations with Electrocardio-graphic Traits in African Americans from Clinical and Epidemiologic Studies.

Authors:  Janina M Jeff; Kristin Brown-Gentry; Robert Goodloe; Marylyn D Ritchie; Joshua C Denny; Abel N Kho; Loren L Armstrong; Bob McClellan; Ping Mayo; Melissa Allen; Hailing Jin; Niloufar B Gillani; Nathalie Schnetz-Boutaud; Holli H Dilks; Melissa A Basford; Jennifer A Pacheco; Gail P Jarvik; Rex L Chisholm; Dan M Roden; M Geoffrey Hayes; Dana C Crawford
Journal:  Evol Comput Mach Learn Data Min Bioinform       Date:  2014

6.  Identification of unique venous thromboembolism-susceptibility variants in African-Americans.

Authors:  John A Heit; Sebastian M Armasu; Bryan M McCauley; Iftikhar J Kullo; Hugues Sicotte; Jyotishman Pathak; Christopher G Chute; Omri Gottesman; Erwin P Bottinger; Joshua C Denny; Dan M Roden; Rongling Li; Marylyn D Ritchie; Mariza de Andrade
Journal:  Thromb Haemost       Date:  2017-02-16       Impact factor: 5.249

7.  Generalization of variants identified by genome-wide association studies for electrocardiographic traits in African Americans.

Authors:  Janina M Jeff; Marylyn D Ritchie; Joshua C Denny; Abel N Kho; Andrea H Ramirez; David Crosslin; Loren Armstrong; Melissa A Basford; Wendy A Wolf; Jennifer A Pacheco; Rex L Chisholm; Dan M Roden; M Geoffrey Hayes; Dana C Crawford
Journal:  Ann Hum Genet       Date:  2013-03-28       Impact factor: 1.670

8.  High density GWAS for LDL cholesterol in African Americans using electronic medical records reveals a strong protective variant in APOE.

Authors:  Laura J Rasmussen-Torvik; Jennifer A Pacheco; Russell A Wilke; William K Thompson; Marylyn D Ritchie; Abel N Kho; Arun Muthalagu; M Geoff Hayes; Loren L Armstrong; Douglas A Scheftner; John T Wilkins; Rebecca L Zuvich; David Crosslin; Dan M Roden; Joshua C Denny; Gail P Jarvik; Christopher S Carlson; Iftikhar J Kullo; Suzette J Bielinski; Catherine A McCarty; Rongling Li; Teri A Manolio; Dana C Crawford; Rex L Chisholm
Journal:  Clin Transl Sci       Date:  2012-08-23       Impact factor: 4.689

Review 9.  Molecular genetic testing and the future of clinical genomics.

Authors:  Sara Huston Katsanis; Nicholas Katsanis
Journal:  Nat Rev Genet       Date:  2013-06       Impact factor: 53.242

10.  Secondary use of clinical data: the Vanderbilt approach.

Authors:  Ioana Danciu; James D Cowan; Melissa Basford; Xiaoming Wang; Alexander Saip; Susan Osgood; Jana Shirey-Rice; Jacqueline Kirby; Paul A Harris
Journal:  J Biomed Inform       Date:  2014-02-14       Impact factor: 6.317

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.