| Literature DB >> 28715421 |
Cameron Palmer1,2, Itsik Pe'er2.
Abstract
Genome-wide association studies (GWAS) have identified hundreds of SNPs responsible for variation in human quantitative traits. However, genome-wide-significant associations often fail to replicate across independent cohorts, in apparent inconsistency with their apparent strong effects in discovery cohorts. This limited success of replication raises pervasive questions about the utility of the GWAS field. We identify all 332 studies of quantitative traits from the NHGRI-EBI GWAS Database with attempted replication. We find that the majority of studies provide insufficient data to evaluate replication rates. The remaining papers replicate significantly worse than expected (p < 10-14), even when adjusting for regression-to-the-mean of effect size between discovery- and replication-cohorts termed the Winner's Curse (p < 10-16). We show this is due in part to misreporting replication cohort-size as a maximum number, rather than per-locus one. In 39 studies accurately reporting per-locus cohort-size for attempted replication of 707 loci in samples with similar ancestry, replication rate matched expectation (predicted 458, observed 457, p = 0.94). In contrast, ancestry differences between replication and discovery (13 studies, 385 loci) cause the most highly-powered decile of loci to replicate worse than expected, due to difference in linkage disequilibrium.Entities:
Mesh:
Year: 2017 PMID: 28715421 PMCID: PMC5536394 DOI: 10.1371/journal.pgen.1006916
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Distribution of papers across journals, for journals that had at least one article with sufficient information for analysis.
The full distribution of all journals analyzed in the study, including those with all papers excluded, is in Table A in S1 File.
| Analyzed | Excluded | |
|---|---|---|
| Am J Hum Genet | 8 | 23 |
| Am J Med Genet B Neuropsychiatr Genet | 2 | 2 |
| BMC Med Genet | 3 | 3 |
| Circ Cardiovasc Genet | 4 | 12 |
| Front Genet | 1 | 0 |
| Gene | 1 | 1 |
| Genet Epidemiol | 1 | 1 |
| Hum Genet | 3 | 7 |
| Hum Mol Genet | 24 | 48 |
| J Med Genet | 1 | 5 |
| Nat Genet | 26 | 48 |
| PLoS Genet | 18 | 31 |
| PLoS One | 6 | 21 |
| Science | 2 | 2 |
| Total: | 100 | 204 |
Ancestry distribution of samples included in GWAS.
Rows are as follows: (1) “Totals”: number of samples of a given ancestry in analyzed papers, with redundancy between studies published multiple times; (2) “Rate in GWAS”: percentage of total samples considered that were of this ancestry; (3) “Rate in Population”: percentage of world’s population that is of this ancestry; (4) “Enrichment in GWAS”: relative over (or under) representation of ancestry in GWAS relative to its rate in the world. Ancestry labels are approximations with the standard correspondences to HapMap2 reference samples (European = CEU, East Asian = JPT+CHB, African = YRI); here, “African American” denotes samples reported with that nomenclature, which typically corresponds to 80:20 admixture between ancestral sub-Saharan African and Western European genetics [11]. All of these equivalences are oversimplifications but correspond to assumptions widely used in the field. Counts are computed from totals across all papers analyzed in this study, not adjusting for duplicate uses of the same datasets across multiple studies. Total sample sizes are maximum counts of samples assuming no per-genotype missingness is present. The totals are rounded to the nearest integer as several imputed studies reported nonintegral sample sizes. Row 3 percentages in world population are approximations based on demographic data from 2014–2015 [12, 13].
| European | East Asian | African | African American | |
|---|---|---|---|---|
| Totals | 1601628 | 135472 | 1226 | 80006 |
| Rate in GWAS (percent) | 88.08 | 7.45 | 0.07 | 4.4 |
| Rate in Population (percent) | 13.3 | 59.8 | 13.1 | 0.5 |
| Enrichment in GWAS (percent) | 670.8 | 12.5 | 0.51 | 821.6 |
Fig 1Expected and observed replication rate per publication, stratified by journal.
Top panel (A): predicted versus expected replication for each paper. Each paper is flagged as being within 95% confidence of predicted replication rate under WC-corrected model (dots), greater (diamonds) or lower (Xs) than expectation. X-axis: predicted number of replications in a given paper, calculated as the sum across all loci of power to replicate based on WC-corrected discovery effect estimates. Y-axis: observed (jittered integer) number of replications in the paper. Colors correspond to journals. Replication is defined as a one-tailed replication p-value surpassing a per-paper Bonferroni threshold: . Confidence intervals defined as 95% confidence according to Poisson binomial draws from the WC-corrected power distribution. Bottom panel (B): distinct behaviors in journals depending on which set of papers is considered. Clusters correspond to paper quality (point shapes) from top panel; confidence intervals are 95% confidence intervals from the binomial distribution. Red lines are expected bar heights assuming that the observed paper data correspond to the WC-corrected model.
Fig 2Expected and observed rates of replication in replication deciles.
All variants are sorted by replication p-value and partitioned into deciles; we then compute power to replicate the variants in each bin using effect estimates with or without the Winner’s Curse. Left panel (A): including all papers (WC-corrected χ2 goodness of fit p < 10−4); right panel (B): including only papers conducting discovery and replication in the same continental ancestry per variant and reporting accurate per-locus N (WC-corrected χ2 goodness of fit p = 0.67). Improvement of fit exceeds what is expected due to loss of power from subsetting data (p < 0.01). X-axis: upper p-value boundary of bin; Y-axis: predicted fraction of replication within corresponding bin based on power estimated from discovery data. Tracks correspond to predicted power to replicate using raw discovery (red) or WC-corrected (teal) effect estimates. Error bars correspond to 95% confidence intervals around mean replication rates as estimated across multiple loci.