Literature DB >> 33939687

Has GWAS lost its status as a paragon of open science?

Callie Burt1,2, Marcus Munafò3,4.   

Abstract

Genomic research led the way in open science, a tradition continued by genome-wide association studies (GWAS)-through the sharing of materials, results, and data. Coordinated quality control procedures also contributed to robust findings. However, recent years have seen declines in GWAS transparency. Here, we assess some shifts away from open science practices with the aim of stimulating a discussion of these issues.

Entities:  

Mesh:

Year:  2021        PMID: 33939687      PMCID: PMC8118511          DOI: 10.1371/journal.pbio.3001242

Source DB:  PubMed          Journal:  PLoS Biol        ISSN: 1544-9173            Impact factor:   8.029


The Human Genome Project (HGP) led the way in open science—in particular, data sharing. In 1996, HGP scientists established the “Bermuda Principles,” which specified that DNA sequence data should be released in publicly accessible databases within 24 hours of generation. The following year, data quality standards were developed—the “Bermuda Sequence-Quality Standards.” These Bermuda agreements (see https://web.ornl.gov/sci/techresources/Human_Genome/research/bermuda.shtml) were key to the multinational collaborative work behind the HGP’s remarkable successes, producing a global knowledge resource that has stimulated major scientific advances. Following completion of the HGP, these open science principles were applied to other genomics projects. Scientists and funders recognized the value of data sharing, coordination, and transparency in advancing knowledge, scientific credibility, and improvements in human health. Many data sharing policies reflect the ethos of these principles, and many areas of human genomics continue lead the way in open science. Genome-wide association studies (GWAS) continued these open science trends. The GWAS era arose following the widespread recognition of low statistical power and questionable methodological practices in candidate gene association studies, which were plagued by low reproducibility. The need to collaborate at scale to achieve the large sample sizes required to detect small effect sizes, while correcting for multiple testing, necessitated coordinated data analysis plans and, in turn, harmonized datasets and code. These were shared within consortia, making it a small step to sharing materials, results, and data publicly (albeit typically summary results, rather than individual level data). Another benefit of this collaborative approach (particularly when handling complex datasets) was a focus on coordinated quality control procedures. For these reasons, human genomics in general, and GWAS, in particular, are often held up as an exemplar of reproducible science. Does the GWAS field still live up to these standards, or is it slipping back? GWAS is now a mainstream technique, and increasingly only one part of a study, rather than the study itself. Studies that include a GWAS now often include functional work, analysis of causal pathways, polygenic risk score analyses, and so on. But this greater breadth risks coming at a cost; often the details of the GWAS itself are relegated to a supplement, which reviewers may scrutinize less carefully [1], while the need to recruit reviewers to evaluate these other elements comes at the expense of having multiple experts inspect the GWAS itself, if this is no longer the sole focus. There is evidence that this has been accompanied by inconsistency in standards. We have seen imputation quality scores as high as r2 > .9 to an imputation accuracy score of < .1 [2]. GWAS may employ different thresholds across cohorts and analyses within the same study. While what is acceptable will depend on the specific nature of the study, these different thresholds may have a substantial impact on results. However, because imputed SNPs that pass the threshold are not treated any differently from measured SNPs, and imputation quality scores are not included in GWAS, we have no way of knowing whether this is a problem. Different software packages and bioinformatic pipelines are employed, with assumptions that may not be articulated. Even commonly adopted minimum thresholds for what constitutes “sufficient LD” for the purposes of identifying SNP “independence” (e.g., r2 <. 1) vary across studies, as well different analyses within studies. Employing different thresholds may be warranted, but methodological decisions should be clearly documented and justified. In our view, simply relying on honesty, and assuming no mistakes, is not the best way forward in modern science, where the incentives to produce noteworthy findings can be substantial. Transparency can serve a quality control function [3]. The extraordinary complexity and density of many current studies including a GWAS means methodological details can be relegated to extensive supplements. If these are not scrutinised fully, this may impact on the robustness and reproducibility of GWAS results, with downstream effects such as overinterpretation of noise (e.g., post-GWAS analyses, such as gene prediction, tissue specific expression). Further, many bioinformatic pipelines use existing associations and functional annotations to link to new findings. Alongside this increase in complexity, there has also been a shift away from open science practices. Efforts to achieve ever-greater sample sizes, coupled with the finite number of high-quality large cohorts with genetic information available, have encouraged researchers to increasingly partner with private companies that can offer large amounts of data. These companies have a direct interest in using GWAS results for profit and thus have a motivation to contribute data. But the results are commercially sensitive. One consequence is that these private-public research partnerships proceed largely on the terms of the private companies. These terms commonly include no access to individual data (analyzed with “in house pipelines”), no sharing of data, and sharing of only partial results. Many recently published GWAS using 23andMe data include partial results, no code, and no data [e.g., 4,5]. This is despite the fact that most of these studies are meta-analyses, and the data consist of summary statistics, rather than the primary, individual-level data, and therefore do not include sensitive, individually identifiable information. Furthermore, such closed data practices often contravene explicit journal and funding agency data sharing policies. The result is that researchers’ ability to replicate and build on these studies is limited. Commercial datasets are also often highly unrepresentative. The problem of lack of representativeness is not unique to commercial datasets—for example, UK Biobank achieved only an approximately 5% recruitment rate, with evidence of “healthy volunteer” selection bias into that study [6]. But selection into commercial datasets can be particularly pronounced. The widely used 23andMe data is composed primarily of individuals from the USA who can afford to investigate their DNA. Participants therefore tend to be European-ancestry, more highly educated, more affluent, and in better health. Furthermore, these data can make up a considerable proportion of the total sample in a GWAS—in some cases over 50% of the total sample [5]. These highly selected samples may bias results [7]. This concern is especially acute for socially patterned phenotypes such educational attainment, income, health behaviours, and mental health (which are often minimally phenotyped via brief participant self-report). The quest for ever-larger sample sizes seems to have come at the expense of the transparency and data sharing that characterized the field in the past. Collaborations between academia and industry can be powerful, and consortium efforts have been critical to the success of GWAS efforts, but we should always ask: At what cost? The question of whether this trade-off is a net positive deserves attention. We would encourage an open discussion of the costs and benefits of these trade-offs by the research community. Despite having led the way in open science and reproducibility, GWAS has become more opaque. Perhaps the method is being taken for granted, given its track record of generating reproducible findings; but reproducible science requires enforcing existing standards as well as continued review and refinement [8]. The lesson is that no methodology stands still, and as particularly complex methodologies evolve—whether it be GWAS, fMRI, etc.—we should continue to examine how these methodologies are applied, and how robust the findings they generate are. If GWAS wants to remain a paragon of open science, it cannot be open only when convenient. Otherwise, hard-won gains in openness and reproducibility can be gradually eroded often at a significant cost to scientific credibility.
  8 in total

1.  Publish houses of brick, not mansions of straw.

Authors:  William G Kaelin
Journal:  Nature       Date:  2017-05-23       Impact factor: 49.962

2.  Scientific rigor and the art of motorcycle maintenance.

Authors:  Marcus Munafò; Simon Noble; William J Browne; Dani Brunner; Katherine Button; Joaquim Ferreira; Peter Holmans; Douglas Langbehn; Glyn Lewis; Martin Lindquist; Kate Tilling; Eric-Jan Wagenmakers; Robi Blumenstein
Journal:  Nat Biotechnol       Date:  2014-09       Impact factor: 54.908

3.  Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions.

Authors:  David M Howard; Mark J Adams; Toni-Kim Clarke; Jonathan D Hafferty; Jude Gibson; Masoud Shirali; Jonathan R I Coleman; Saskia P Hagenaars; Joey Ward; Eleanor M Wigmore; Clara Alloza; Xueyi Shen; Miruna C Barbu; Eileen Y Xu; Heather C Whalley; Riccardo E Marioni; David J Porteous; Gail Davies; Ian J Deary; Gibran Hemani; Klaus Berger; Henning Teismann; Rajesh Rawal; Volker Arolt; Bernhard T Baune; Udo Dannlowski; Katharina Domschke; Chao Tian; David A Hinds; Maciej Trzaskowski; Enda M Byrne; Stephan Ripke; Daniel J Smith; Patrick F Sullivan; Naomi R Wray; Gerome Breen; Cathryn M Lewis; Andrew M McIntosh
Journal:  Nat Neurosci       Date:  2019-02-04       Impact factor: 28.771

4.  Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants With Those of the General Population.

Authors:  Anna Fry; Thomas J Littlejohns; Cathie Sudlow; Nicola Doherty; Ligia Adamska; Tim Sprosen; Rory Collins; Naomi E Allen
Journal:  Am J Epidemiol       Date:  2017-11-01       Impact factor: 4.897

5.  Collider scope: when selection bias can substantially influence observed associations.

Authors:  Marcus R Munafò; Kate Tilling; Amy E Taylor; David M Evans; George Davey Smith
Journal:  Int J Epidemiol       Date:  2018-02-01       Impact factor: 7.196

6.  A manifesto for reproducible science.

Authors:  Marcus R Munafò; Brian A Nosek; Dorothy V M Bishop; Katherine S Button; Christopher D Chambers; Nathalie Percie du Sert; Uri Simonsohn; Eric-Jan Wagenmakers; Jennifer J Ware; John P A Ioannidis
Journal:  Nat Hum Behav       Date:  2017-01-10

7.  Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals.

Authors:  James J Lee; Robbee Wedow; Aysu Okbay; Edward Kong; Omeed Maghzian; Meghan Zacher; Tuan Anh Nguyen-Viet; Peter Bowers; Julia Sidorenko; Richard Karlsson Linnér; Mark Alan Fontana; Tushar Kundu; Chanwook Lee; Hui Li; Ruoxi Li; Rebecca Royer; Pascal N Timshel; Raymond K Walters; Emily A Willoughby; Loïc Yengo; Maris Alver; Yanchun Bao; David W Clark; Felix R Day; Nicholas A Furlotte; Peter K Joshi; Kathryn E Kemper; Aaron Kleinman; Claudia Langenberg; Reedik Mägi; Joey W Trampush; Shefali Setia Verma; Yang Wu; Max Lam; Jing Hua Zhao; Zhili Zheng; Jason D Boardman; Harry Campbell; Jeremy Freese; Kathleen Mullan Harris; Caroline Hayward; Pamela Herd; Meena Kumari; Todd Lencz; Jian'an Luan; Anil K Malhotra; Andres Metspalu; Lili Milani; Ken K Ong; John R B Perry; David J Porteous; Marylyn D Ritchie; Melissa C Smart; Blair H Smith; Joyce Y Tung; Nicholas J Wareham; James F Wilson; Jonathan P Beauchamp; Dalton C Conley; Tõnu Esko; Steven F Lehrer; Patrik K E Magnusson; Sven Oskarsson; Tune H Pers; Matthew R Robinson; Kevin Thom; Chelsea Watson; Christopher F Chabris; Michelle N Meyer; David I Laibson; Jian Yang; Magnus Johannesson; Philipp D Koellinger; Patrick Turley; Peter M Visscher; Daniel J Benjamin; David Cesarini
Journal:  Nat Genet       Date:  2018-07-23       Impact factor: 38.330

8.  Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences.

Authors:  Richard Karlsson Linnér; Pietro Biroli; Edward Kong; S Fleur W Meddens; Robbee Wedow; Mark Alan Fontana; Maël Lebreton; Stephen P Tino; Abdel Abdellaoui; Anke R Hammerschlag; Michel G Nivard; Aysu Okbay; Cornelius A Rietveld; Pascal N Timshel; Maciej Trzaskowski; Ronald de Vlaming; Christian L Zünd; Yanchun Bao; Laura Buzdugan; Ann H Caplin; Chia-Yen Chen; Peter Eibich; Pierre Fontanillas; Juan R Gonzalez; Peter K Joshi; Ville Karhunen; Aaron Kleinman; Remy Z Levin; Christina M Lill; Gerardus A Meddens; Gerard Muntané; Sandra Sanchez-Roige; Frank J van Rooij; Erdogan Taskesen; Yang Wu; Futao Zhang; Adam Auton; Jason D Boardman; David W Clark; Andrew Conlin; Conor C Dolan; Urs Fischbacher; Patrick J F Groenen; Kathleen Mullan Harris; Gregor Hasler; Albert Hofman; Mohammad A Ikram; Sonia Jain; Robert Karlsson; Ronald C Kessler; Maarten Kooyman; James MacKillop; Minna Männikkö; Carlos Morcillo-Suarez; Matthew B McQueen; Klaus M Schmidt; Melissa C Smart; Matthias Sutter; A Roy Thurik; André G Uitterlinden; Jon White; Harriet de Wit; Jian Yang; Lars Bertram; Dorret I Boomsma; Tõnu Esko; Ernst Fehr; David A Hinds; Magnus Johannesson; Meena Kumari; David Laibson; Patrik K E Magnusson; Michelle N Meyer; Arcadi Navarro; Abraham A Palmer; Tune H Pers; Danielle Posthuma; Daniel Schunk; Murray B Stein; Rauli Svento; Henning Tiemeier; Paul R H J Timmers; Patrick Turley; Robert J Ursano; Gert G Wagner; James F Wilson; Jacob Gratten; James J Lee; David Cesarini; Daniel J Benjamin; Philipp D Koellinger; Jonathan P Beauchamp
Journal:  Nat Genet       Date:  2019-01-14       Impact factor: 38.330

  8 in total
  2 in total

1.  snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data.

Authors:  Christina Vasilopoulou; Benjamin Wingfield; Andrew P Morris; William Duddy
Journal:  F1000Res       Date:  2021-07-14

2.  Functional Genomics Analysis to Disentangle the Role of Genetic Variants in Major Depression.

Authors:  Judith Pérez-Granado; Janet Piñero; Alejandra Medina-Rivera; Laura I Furlong
Journal:  Genes (Basel)       Date:  2022-07-15       Impact factor: 4.141

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.