| Literature DB >> 27552965 |
Guo-Bo Chen1, Sang Hong Lee2,3, Matthew R Robinson2, Maciej Trzaskowski2, Zhi-Xiang Zhu4, Thomas W Winkler5, Felix R Day6, Damien C Croteau-Chonka7,8, Andrew R Wood9, Adam E Locke10, Zoltán Kutalik11,12,13, Ruth J F Loos14,15,16, Timothy M Frayling9, Joel N Hirschhorn17,18,19,20, Jian Yang2,21, Naomi R Wray2, Peter M Visscher22,23.
Abstract
Genome-wide association studies (GWASs) have been successful in discovering SNP trait associations for many quantitative traits and common diseases. Typically, the effect sizes of SNP alleles are very small and this requires large genome-wide association meta-analyses (GWAMAs) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study, we propose four metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We propose methods to examine the concordance between demographic information, and summary statistics and methods to investigate sample overlap. (I) We use the population genetics Fst statistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. (II) We conduct principal component analysis based on reported allele frequencies, and are able to recover the ancestral information for each cohort. (III) We propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. (IV) To quantify unknown sample overlap across all pairs of cohorts, we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.Entities:
Mesh:
Year: 2016 PMID: 27552965 PMCID: PMC5159754 DOI: 10.1038/ejhg.2016.106
Source DB: PubMed Journal: Eur J Hum Genet ISSN: 1018-4813 Impact factor: 4.246
Figure 1Recovery of cohort-level genetic background and inference of their geographic locations for GIANT BMI Metabochip cohorts and GIANT GWAS height cohorts using the Fst-derived genetic distance measure. (a) Genetic distance spectrum for all Metabochip cohorts to CEU, CHB, and YRI. The origins of the cohorts are denoted on the horizontal axis. (b) Projection for the Metabochip cohorts into FPC space defined by YRI, CHB, and CEU reference populations. The x and y axis represent relative distances derived from the genetic distance spectrum. Three dashed lines, blue for CEU, green for CHB, and red for YRI, partitioned the whole FPC space to three genealogical subspaces. (c) The genetic distance spectrum for the Metabochip European cohorts to CEU – northwest Europeans, FIN – northeast European, and TSI – southern Europeans. The nationality of the cohorts is denoted on the horizontal axis. (d) The projection for the Metabochip European cohorts to the FPC space defined by CEU, FIN, and TSI reference populations. The whole space is further partitioned into three subspaces, CEU-TSI genealogical subspace (red and blue dashed lines), FIN-TSI genealogical subspace (green-blue dashed lines), and CEU-FIN genealogical subspace (red-green dashed lines), respectively. (e) Each cohort has three Fst values by comparing with CEU, FIN, and TSI reference samples. The height of each bar represents its relative genetic distance to these three reference populations. The nationalities of the cohorts are denoted along the horizontal axis. The grey triangles along the x axis indicate MIGEN cohorts. (f) Given the three Fst values, the location of each cohort can be mapped. The whole space was partitioned into three subspaces, CEU-TSI genealogical subspace (red and blue dashed lines), FIN-TSI genealogical subspace (green and blue dashed lines), and CEU-FIN genealogical subspace (red and green dashed lines). DGI (in the blue box) had samples from the Botnia study. Across the MIGEN cohorts (denoted as red triangles in the red box), the same allele frequencies (likely calculated from a South European cohort) were presented for each cohort. The open circles represent the mean of inferred geographic locations for the cohorts from the same country. Cohort/country codes: AF, African; AU, Australia; CA, Canada; CH, Switzerland; DE, Germany; DK, Denmark; EE, Estonia; ES, Iberian Population in Spain in 1KG; EU, European Nations; FI, Finland; FIN, Fins in 1000 Genomes Project (1KG); FR, France; GBR, British in 1KG; GIB, Gujarati Indian in 1KG; GR, Greece; Hawaii, Hawaii in USA; IBS, Iberian Population in Spain in 1KG; IT, Italy; IS, Iceland; JM, Jamaica; JPT, Japanese in 1KG; LWK, Luhya in 1KG; NL, Netherlands; NO, Norway; PH, the Philippines; PK, Pakistan; SC, Seychelles; SCT, Scotland; SE, Sweden; TSI, Tuscany in 1KG; UK, United Kingdom; US, United States of America.
Figure 2λmeta for the GIANT height GWAS cohorts. (a) Given 174 cohorts, there are 15 051 λmeta values, which provide the overview of the quality control of the summary statistics. The heat map represents 15 051 λmeta statistics, and the x and y axis index each pair of cohorts. The pairs of cohorts showed heterogeneity () are illustrated on left-top triangle, and homogeneity () on right-bottom triangle. (b) The distribution of λmeta from 174 cohorts/files used in the GIANT height meta-analysis. The overall mean of 15 051 λmeta is 1.013, and SD is 0.022. (c) Illustration for homogeneity between two cohorts (SORBS MEN and WOMEN), λmeta=0.876. (d) Illustration of SardiNIA and WGHS, this pair of cohorts has λmeta=1.245. The grey band represents 95% confidence interval for λmeta.
The estimated correlation for a pair of cohorts via their summary statistics given 30 000 independent loci
| n | n | n | ||||
|---|---|---|---|---|---|---|
| 0.25 | 100 | 1000 | 1000 | 0.1 | 0.1072±0.0064 | 0.101±0.0093 |
| 1000 | 2000 | 0.0707 | 0.0814±0.0054 | 0.0709±0.0088 | ||
| 1000 | 5000 | 0.0447 | 0.0615±0.0055 | 0.0425±0.0096 | ||
| 1000 | 10 000 | 0.0316 | 0.0556±0.0063 | 0.0325±0.0099 | ||
| 0.25 | 1 | 1000 | 1000 | 0.001 | 0.0092±0.0056 | 0.0017±0.0093 |
| 1000 | 2000 | 0.0007 | 0.0126±0.0053 | 0.0006±0.0079 | ||
| 1000 | 5000 | 0.000447 | 0.0189±0.0060 | 0.0016±0.0090 | ||
| 1000 | 10 000 | 0.000316 | 0.0259±0.0059 | 0.0008±0.0092 | ||
| 0 | 100 | 1000 | 1000 | 0.1 | 0.0996±0.0052 | 0.094±0.0085 |
| 1000 | 2000 | 0.0707 | 0.0704±0.0048 | 0.0712±0.0097 | ||
| 1000 | 5000 | 0.0447 | 0.0453±0.0057 | 0.0441±0.0090 | ||
| 1000 | 10 000 | 0.0316 | 0.0335±0.0057 | 0.0325±0.0079 |
Notes: Heritability was simulated on 1000 QTLs. We also tried 100 QTLs, and results were nearly identical; n1, n2, and n1,2 represent the sample size for cohort 1, 2, and overlapping samples between them. γ1,2 represents the true correlation for a pair of summary statistics due to overlapping samples. represents the estimated correlation estimated via direct correlation between summary statistics, the method proposed by Bolormaa et al.[18] and Zhu et al.[19]. represents the estimated correlation estimate via λmeta, .
Figure 3Pseudo profile score regression for pinpointing overlapping samples/relatives. (a) Each cluster represents a pair of cohorts as denoted on the x axis. Within each cluster, from left to right, the detected overlapping controls using λmeta based either on effect size estimates or minor allele frequency (MAF), PPRS using 100, 200, and 500 markers. WTCCC cohort codes: BD for bipolar disorder, CAD for coronary artery disease, CD for Crohn’s disease, HT for hypertension, RA for rheumatoid arthritis, T1D for type 1 diabetes, and T2D for type 2 diabetes. (b) Illustration for regression coefficients between WTCCC BD and CAD from 57 pseudo profile scores (PPS) generated from 500 markers. The x axis is the PPSR regression coefficients and y axis is real genetic relatedness (as calculated from individual-level genotype data). The red points are the shared controls between two cohorts, and blue points are first-degree relatives. (c) The PPS regression coefficients for detecting overlapping first-degree relatives using 286 PPS generated from 500 markers. (d) Decoding genotypes from the PPS. Given the set of profile scores, one may run a GWAS-like analysis to infer the genotypes. The ratio between the number of markers (M) and number of pseudo profile scores (K) determines the potential discovery of individual-level information. The higher the ratio and, the higher the allele frequency, the less information can be recovered. From left to right, the profile scores generated using different number of markers. The y axis is a R2 metric representing the accuracy between the inferred genotypes and the real genotypes. From left to right panels, 100, 200, 500, and 1000 SNPs were used to generate 10, 20, 50, and 1000 profiles scores. In each cluster, the three bars are inferred accuracy using different MAF spectrum alleles, given the SE of the mean.