Literature DB >> 27552965

Across-cohort QC analyses of GWAS summary statistics from complex traits.

Guo-Bo Chen¹, Sang Hong Lee^2,3, Matthew R Robinson², Maciej Trzaskowski², Zhi-Xiang Zhu⁴, Thomas W Winkler⁵, Felix R Day⁶, Damien C Croteau-Chonka^7,8, Andrew R Wood⁹, Adam E Locke¹⁰, Zoltán Kutalik^11,12,13, Ruth J F Loos^14,15,16, Timothy M Frayling⁹, Joel N Hirschhorn^17,18,19,20, Jian Yang^2,21, Naomi R Wray², Peter M Visscher^22,23.

Abstract

Genome-wide association studies (GWASs) have been successful in discovering SNP trait associations for many quantitative traits and common diseases. Typically, the effect sizes of SNP alleles are very small and this requires large genome-wide association meta-analyses (GWAMAs) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study, we propose four metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We propose methods to examine the concordance between demographic information, and summary statistics and methods to investigate sample overlap. (I) We use the population genetics Fst statistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. (II) We conduct principal component analysis based on reported allele frequencies, and are able to recover the ancestral information for each cohort. (III) We propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. (IV) To quantify unknown sample overlap across all pairs of cohorts, we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27552965 PMCID： PMC5159754 DOI： 10.1038/ejhg.2016.106

Source DB: PubMed Journal: Eur J Hum Genet ISSN： 1018-4813 Impact factor: 4.246

Introduction

To elucidate genetic architecture, which requires maximized statistical power for discovery of risk alleles of small effect, large genome-wide association meta-analyses (GWAMAs) are tending towards ever-larger scale that may contain data from hundreds of cohorts. At the individual cohort level, genome-wide association study (GWAS) analysis is often based on various genotyping chips and conducted with different protocols, such as different software tools and reference populations for imputation, inclusion of study-specific covariates and association analyses using different methods and software. Although solid quality control (QC) analysis pipelines of GWAMA exist,[1] these analyses focus on QC for each cohort independently. With ever-increasing sizes of GWAMA, there is a need for additional QC that goes beyond the cohort-by-cohort genotype-level analysis performed to date. In this study, we propose a new set of QC metrics for GWAMA. All these applications assume that there is a central analysis hub, where summary statistic data from GWAS are uploaded for each cohort. All methods proposed are implemented in freely available software GEAR.

Materials and methods

Overview of materials and methods

Cohort-level summary statistics

The height GWAS summary statistics were provided by the GIANT Consortium and were from 82 cohorts (174 separate files) representing a total of 253 288 individuals, and ~2.5 million autosomal SNPs imputed to the HapMap2 reference.[2] Metabochip summary statistics for body mass index (BMI) were from 43 cohorts (120 files), representing a total of 103 047 samples from multiple ethnicities with about 200 000 SNPs genotyped on customised chips.[3, 4]

1000 Genomes project samples

1000 Genomes Project (1KG) reference samples[5] were used as the reference samples for estimating Fst and meta-PC. When assessing the global-level Fst measures, Yoruba represent African samples (YRI, 108 individuals), Han Chinese in Beijing represent East Asian samples (CHB, 103 individuals), and Utah Residents with Northern and Western European Ancestry represent European samples (CEU, 99 individuals) were employed as the reference panels. For calculating within-Europe Fst, CEU, Finnish (FIN, 99 individuals), and Tuscani (TSI, 107 individuals) were employed to represent northwest, northeast, and southern Europeans, respectively. For analyses using a whole European panel, CEU, FIN, TSI, GBR (British, 91 individuals), and IBS (Iberian, 107 individuals) were pooled together as an ‘averaged’ European reference.

WTCCC GWAS data

WTCCC GWAS data has 2934 shared controls for seven diseases with a total of 14 000 cases.[6] Individual GWAS was conducted for each disease using PLINK[7] and their summary statistics used to estimate λmeta. The four proposed metrics include: Fpc: a genome-wide comparison of allele frequency differences across cohorts or against a common reference population. Meta-PC: principal component analysis of reported allele frequencies. λmeta: a pairwise cohort statistic that uses allele frequency or effect size concordance to detect the proportion of sample overlap or heterogeneity. Pseudo profile score regression: an easy to implement analysis to pinpoint each between-cohort overlapping sample that does not require the sharing of individual-level genotype data. The technical details of these four methods summarized here can be found in the Supplementary Notes. Overview and application of these four metrics in GWAMA can be found in the Text Box.

Results

Population genetic QC analysis using Fst

In GWAMA, only summary statistics such as allele frequencies are available to the central analysis hub, it is difficult to identify population outliers. Gross differentiation in allele frequencies at specific SNPs between GWAMA cohorts and a reference (such as 1000 Genomes Project, denoted as 1KG)[5] are part of standard QC protocols,[1] but checking for more differentiation than expected across the entire genome is not usually part of the QC pipeline. We propose that a genetic distance inferred from Fst, which reflects genetic distance between pairwise populations, is a useful additional QC statistic to detect cohorts that are population outliers. Using the relationship between Fst and principal components,[8] our Fst cartographer algorithm can be used to estimate the relative genetic distance between cohorts (Supplementary Notes for Method I; Supplementary Figure S1). We applied the Fst metric to the GIANT Consortium BMI Metabochip cohorts (55 male-only cohorts, 55 female-only cohort, and 10 mixed-sex cohorts), which were recruited from multiple ethnicities,[3] such as Europeans, African Americans in the Atherosclerosis Risk in Communities Study (ARIC) and cohorts from Jamaica (SPT), Pakistan (PROMISE), Philippines (CLHNS), and Seychelles (SEY). For each Metabochip cohort, we sampled 30 000 independent markers to calculate Fst values with each of three 1KG samples (CEU, CHB, and YRI, respectively). For validation of the method, we also calculated Fst values against the 1KG Japanese (JPT, Japanese in Tokyo, Japan), Indian (GIH, Gujarati Indian in Houston, US), Kenyan (LWK, Luhya in Webuye, Kenya), and European samples (IBS, Iberian populations, Spain; FIN, Finnish, Finland; TSI, Toscani, Italy, and GBR, British in England, and Scortland, GBR), to see whether the known genetic origins of those cohorts can be recovered. According to the origins of the samples, each Metabochip cohort showed a different genetic distance spectrum to the three reference populations (Figure 1a). The JPT and Philippine cohorts had very small genetic distances to CHB, as expected, but large to CEU and YRI; however, the Pakistan cohorts showed much closer genetic distances to CEU than to CHB and YRI, indicating their demographic history. The cohorts sampled from Jamaica, Seychelles, Hawaii, and the African American ARIC cohort had small genetic distances to YRI, but large distances to CHB and CEU. For most European cohorts, as expected, the distances to CEU were very small compared with those to CHB and YRI. Given their relative distances to CEU, CHB, and YRI, using our Fst cartographer algorithm (Supplementary Notes for Method I; Supplementary Figure S1), the cohorts were projected into a two-dimensional space, called Fst-derived principal components (FPC) space, constructed by YRI, CHB, and CEU as the reference populations (Figure 1b). The allocation of the cohorts to the FPC space resembles that of eigenvector 1 against eigenvector 2 in principal component analysis (PCA), and is similar to those observed in PCA using individual-level GWAS data for populations of various ethnicities such as in 1KG samples.[5] Therefore, our method to place cohorts in geographical regions from GWAS summary statistics works well at a global-population scale.

Figure 1

Recovery of cohort-level genetic background and inference of their geographic locations for GIANT BMI Metabochip cohorts and GIANT GWAS height cohorts using the Fst-derived genetic distance measure. (a) Genetic distance spectrum for all Metabochip cohorts to CEU, CHB, and YRI. The origins of the cohorts are denoted on the horizontal axis. (b) Projection for the Metabochip cohorts into FPC space defined by YRI, CHB, and CEU reference populations. The x and y axis represent relative distances derived from the genetic distance spectrum. Three dashed lines, blue for CEU, green for CHB, and red for YRI, partitioned the whole FPC space to three genealogical subspaces. (c) The genetic distance spectrum for the Metabochip European cohorts to CEU – northwest Europeans, FIN – northeast European, and TSI – southern Europeans. The nationality of the cohorts is denoted on the horizontal axis. (d) The projection for the Metabochip European cohorts to the FPC space defined by CEU, FIN, and TSI reference populations. The whole space is further partitioned into three subspaces, CEU-TSI genealogical subspace (red and blue dashed lines), FIN-TSI genealogical subspace (green-blue dashed lines), and CEU-FIN genealogical subspace (red-green dashed lines), respectively. (e) Each cohort has three Fst values by comparing with CEU, FIN, and TSI reference samples. The height of each bar represents its relative genetic distance to these three reference populations. The nationalities of the cohorts are denoted along the horizontal axis. The grey triangles along the x axis indicate MIGEN cohorts. (f) Given the three Fst values, the location of each cohort can be mapped. The whole space was partitioned into three subspaces, CEU-TSI genealogical subspace (red and blue dashed lines), FIN-TSI genealogical subspace (green and blue dashed lines), and CEU-FIN genealogical subspace (red and green dashed lines). DGI (in the blue box) had samples from the Botnia study. Across the MIGEN cohorts (denoted as red triangles in the red box), the same allele frequencies (likely calculated from a South European cohort) were presented for each cohort. The open circles represent the mean of inferred geographic locations for the cohorts from the same country. Cohort/country codes: AF, African; AU, Australia; CA, Canada; CH, Switzerland; DE, Germany; DK, Denmark; EE, Estonia; ES, Iberian Population in Spain in 1KG; EU, European Nations; FI, Finland; FIN, Fins in 1000 Genomes Project (1KG); FR, France; GBR, British in 1KG; GIB, Gujarati Indian in 1KG; GR, Greece; Hawaii, Hawaii in USA; IBS, Iberian Population in Spain in 1KG; IT, Italy; IS, Iceland; JM, Jamaica; JPT, Japanese in 1KG; LWK, Luhya in 1KG; NL, Netherlands; NO, Norway; PH, the Philippines; PK, Pakistan; SC, Seychelles; SCT, Scotland; SE, Sweden; TSI, Tuscany in 1KG; UK, United Kingdom; US, United States of America.

We next investigated whether our genetic distance method works at a much finer geographic scale. It is known that using individual-level data, PCA can mirror the geographic locations for European samples.[9] Here we analyzed the 103 GIANT European-ancestry Metabochip cohorts (48 male-only cohorts, 47 female-only cohorts, and 8 mix-sex cohorts) for fine-scale Fst genetic distance measure using the CEU, FIN, and TSI reference populations, which represent northwest, northeast, and southern European populations, respectively. For each of the GIANT European-ancestry Metabochip cohorts, Fst was calculated relative to each of these three reference populations and showed concordance with the known origin of the samples (Figure 1c). For example, cohorts from Finland and Estonia were close to FIN but distant to TSI; cohorts from South Europe such as Italy and Greece had small genetic distance to TSI; and cohorts from West Europe had small genetic distance to CEU. Similarly, the projected origin for each European-ancestry Metabochip cohort resembles its geographic location within the European map as expected (Figure 1d). Therefore, Fpc based upon population differentiation also works at a fine scale. We next applied the Fst genetic distance measures to 174 GIANT height GWAS cohorts (79 male-only cohorts, 76 female-only cohorts, and 19 mixed-sex cohorts; excluding Metabochip data), which were all of European ancestry imputed to the HapMap reference panel.[2] Given the three Fst values to CEU, FIN, and TSI (Figure 1e), the geographic origin for each cohort can be inferred as for the GIANT BMI Metabochip data. The projected coordinates of each GWAS cohort matches its origin very well (Figure 1f). For example, a Canadian cohort, the Quebec Family Study (QFS), was closely located to DESIR, a French cohort, consistent with the French genetic heritage of the QFS.[10] In addition, we also observe complexity due to mixed samples from different countries. For example, the DGI/Botnia study had samples recruited from Sweden and Finland, and its inferred geographic location is in between of the Swedish cohorts and Finnish cohorts.[11] We also note that for the Myocardial Infarction Genetics Consortium (MIGEN) cohorts, which are recruited from Finland, Sweden, Spain, and the United States, the same allele frequencies were reported for all their sub-cohorts, and all cohorts were allocated to southern Europe (very closely located to 1KG IBS cohort; Figure 1f and Supplementary Figure S2). As the allele frequencies, used in QC steps to eliminate low-quality loci, were not directly used in estimating genetic effects in the GWAMA, the reported allele frequencies in MIGEN have not impacted much on the published GWAMA results.[2] Next, we show that Fst can detect populations that have a different demographic past. Using all 1KG European samples as the reference panel (eg, an ‘averaged’ European reference panel), most cohorts in GIANT had Fst<0.005 with this average, which agrees with previously reported results using individual-level data from European nations.[9] A few cohorts showed large Fst, such as the AMISH cohort with Fst=0.018, and the North Swedish Population Health Study[12] with Fst=0.014. Both populations are known to have been genetically isolated (Supplementary Figure S3).

PCA for allele frequencies (meta-PCA)

Given the same allele frequencies as used for Fst-based analysis above, we conducted PCA for allele frequencies, denoted as meta-PCA (or mPC). In meta-PCA, each cohort was analogously considered as an ‘individual’. For example, 120 Metabochip cohorts were considered as a sample of 120 ‘individuals’. Although the inferred ancestral information was for each cohort rather than any individuals, implementation of meta-PCA was the same as the conventional PCA (Supplementary Notes for Method II). Meta-PCA was tested with 1KG samples. It indicated that meta-PCA could reveal the genetic background for each cohort as precisely as that based on individual-level data (Supplementary Figure S4). We applied meta-PCA to 120 Metabochip cohorts for nearly 34K common SNPs between Metabochip and 1KG variants, with the inclusion of 10 1KG cohorts (East Asian: CHB and JPT; South Asian: GIH; European: CEU, FIN, GBR, IBS, and TSI; African: LWK and YRI) as the reference cohorts. Consistent with demographic information, the inferred ancestral information of each cohort agreed well with demographic information. For example, PROMISE (Pakistan) located very close to GIH, CLHNS (Philippines) close to CHB and JPT, ARIC (African American) and SPT (Jamaican) close to YRI and LWK, and the European cohorts close to CEU and FIN (Figure 4a). We also applied meta-PCA to 174 GIANT height GWAS cohorts for nearly 1M SNPs, with the inclusion of 10 1KG reference cohorts. At the global-population level, the 174 cohorts were all allocated close to CEU and FIN, consistent with their reported demographic information (Figure 4b). For fine-scale inference, we conducted meta-PCA again but with the inclusion of the five 1KG European samples. As demonstrated (Figure 4c), the resolution of the inferred relative location between European cohorts reflected their real geographical locations, as previously observed using individual-level data.[9] For example, of the four cohorts from Italy, the MICROS cohort was from South Tyrol, northern Italy. MICROS had its meta-PC coordinates much closer to CEU than another three Italian cohorts, reflecting its geographic location; the InCHIANTI cohort had its coordinates almost identical to TSI; the cohort SardiNIA located more southward than TSI, reflecting its relative geographic and genetic isolation as recently confirmed.[13] Similarly, in the sub-plots for Finland and Sweden, the cohorts from the MIGEN consortium, which all had reported allele frequencies of south Europe origin, were located near 1KG TSI and IBS. These results were consistent to what was observed from Fpc as described in the last section, and also agreed well with demographic information. Therefore, based on the reported allele frequencies, the demographic information could be verified by the meta-PCA method.

λmeta to detect pairwise cohort heterogeneity and sample overlap

In this study, we use the summary statistics for a pair of cohorts to calculate λmeta, a metric that examines heterogeneity from the concordance of reported effect sizes and sampling variance. For a SNP marker (i), given its reported estimated effect size (b) and sampling variance (σ2) in a pair of cohorts 1 and 2, we can calculate a test statistic , the ratio between the squared difference of their reported effects and the sum of their reported sampling variances. We constructed 30 000 T statistics using markers in linkage equilibrium along the genome for a pair of cohorts. Under the null hypothesis of no overlapping samples/heterogeneity, T follows a χ2 distribution with 1 degree of freedom (Supplementary Notes for Method III). Analogous to λGC, , the ratio between the median of the 30 000 T values and the median of a χ2 statistic with 1 degree of freedom (a value of 0.455) has an expected value of 1 for two independent GWAS summary statistics sets for the same trait. When there is heterogeneity between estimated genetic effects, the expectation is λmeta>1, and in contrast λmeta<1 if there are overlapping samples. In general, not only overlapping samples but also close relatives present in different cohorts can lead to correlated summary statistics generating λmeta<1. However, unless the proportion of overlapping relatives is substantial and their phenotypic correlation is high, the correlation of the summary statistics due to the effective number of overlapping samples (no) is expected to be dominated by the same individuals contributing phenotypic and genetic information to different cohorts (Supplementary Figure S5). Furthermore, if genomic control is applied to adjust the sampling variance, then λmeta will be reduced relative to its value without genomic control for λGC. GWAS summary statistics for schizophrenia were available in two phases: the first had 9394 controls and 12 462 cases,[14] and in the next phase ~18 000 Swedish samples were added.[15] Such a substantial overlap sample between these two sets of summary statistics led to the estimated value of λmeta as low as 0.257 (Supplementary Figure S6), consistent with this known overlap. In contrast, heterogeneity between data sets (represented by λmeta>1) was observed between GWAS summary statistics of rheumatoid arthritis from European and Asian studies,[16] for which λmeta=1.09 (Supplementary Figure S7). In addition, we note that the distribution of the empirical T-statistics deviates from expectation at the upper tail of the distribution, suggesting differences in effect size or linkage disequilibrium between these two ancestries. Next, we estimated λmeta from pairs of cohorts from the 174 GIANT height GWAS cohorts.[2] We found no evidence for substantial sample overlap but do observe between-cohort heterogeneity and technical artifacts. From the 174 GIANT height GWAS,[2] we calculated 15 051 cohort pairwise λmeta values, resulting in a bell-shaped distribution (Figure 2a and b), with the mean of 1.013 and the empirical SD of 0.022, which was greater than theoretical SD of 0.014. The empirical mean and SD can be used to construct a z-score test for each λmeta. These results are consistent with a small amount of heterogeneity, which is not unexpected due to variation of actual (unknown) genetic architecture and analysis protocols. However, the mean is close to 1.0 and based upon this QC metric, the results are consistent with stringent QC and data cleaning. The minimum λmeta value was ~0.88 (between SORBS men and SORBS women; Figure 2c), with P-value<1e−10 (testing for the difference from 1), and the maximum was 1.245 (between SardiNIA and WGHS; Figure 2d), with P-value<1e−10, leading to the most deflated and inflated λmeta across GIANT height study cohorts, both were significant after correction for multiple testing. Of note, SORBS were analyzed using a method that corrected for relatedness, which potentially led to the deflated λmeta as implicated by the theory (Supplementary Notes for Method III). Illustrating λmeta (Figure 2a) highlighted that 20 cohorts from the MIGEN consortium showed substantially lower λmeta with many other cohorts (right-bottom triangle in Figure 2a) than the average, consistent with over-conservative models for statistical association analyses being used in these cohorts – which may be due to very small sample size (ranging from 36 to 320 for the 20 MIGEN cohorts, with an average sample size of 132). Consistent with this, cohorts from MIGEN also have many of their λGC<1 (Supplementary Figures S8 and S9). In contrast, the SardiNIA cohort (4303 samples) showed heterogeneity with nearly all other cohorts (Supplementary Figures S8 and S9), perhaps due to unknown artifacts or a slightly different genetic architecture for height as result of demographic history.[17]

Figure 2

λmeta for the GIANT height GWAS cohorts. (a) Given 174 cohorts, there are 15 051 λmeta values, which provide the overview of the quality control of the summary statistics. The heat map represents 15 051 λmeta statistics, and the x and y axis index each pair of cohorts. The pairs of cohorts showed heterogeneity () are illustrated on left-top triangle, and homogeneity () on right-bottom triangle. (b) The distribution of λmeta from 174 cohorts/files used in the GIANT height meta-analysis. The overall mean of 15 051 λmeta is 1.013, and SD is 0.022. (c) Illustration for homogeneity between two cohorts (SORBS MEN and WOMEN), λmeta=0.876. (d) Illustration of SardiNIA and WGHS, this pair of cohorts has λmeta=1.245. The grey band represents 95% confidence interval for λmeta.

The statistical power of detection of overlapping samples is maximized when a pair of cohorts has equal sample size (Supplementary Figure S10), or in other words the confidence interval for null hypothesis of no overlapping samples depends on the sample sizes for a pair of cohorts. As a comparison, the estimation of a correlation between the genetic effects for a pair of cohorts has been proposed to quantify overlapping samples,[18, 19] but this metric is confounded with genetic architecture, such as heritability underlying the trait(s) (Table 1; Supplementary Notes IV). When there was heritability, the estimated correlation between genetic effects could be biased and could lead to an incorrect inference about overlapping samples for a pair of cohorts. When there was no heritability, the estimated correlation was correct and agreed well with the one estimated with λmeta. As existence of heritability is one of the reasons to perform GWAMA, so λmeta is preferred when estimating overlapping samples between cohorts.

Table 1

The estimated correlation for a pair of cohorts via their summary statistics given 30 000 independent loci

	n_1,2	n₁	n₂
0.25	100	1000	1000	0.1	0.1072±0.0064	0.101±0.0093
		1000	2000	0.0707	0.0814±0.0054	0.0709±0.0088
		1000	5000	0.0447	0.0615±0.0055	0.0425±0.0096
		1000	10 000	0.0316	0.0556±0.0063	0.0325±0.0099
0.25	1	1000	1000	0.001	0.0092±0.0056	0.0017±0.0093
		1000	2000	0.0007	0.0126±0.0053	0.0006±0.0079
		1000	5000	0.000447	0.0189±0.0060	0.0016±0.0090
		1000	10 000	0.000316	0.0259±0.0059	0.0008±0.0092
0	100	1000	1000	0.1	0.0996±0.0052	0.094±0.0085
		1000	2000	0.0707	0.0704±0.0048	0.0712±0.0097
		1000	5000	0.0447	0.0453±0.0057	0.0441±0.0090
		1000	10 000	0.0316	0.0335±0.0057	0.0325±0.0079

Notes: Heritability was simulated on 1000 QTLs. We also tried 100 QTLs, and results were nearly identical; n1, n2, and n1,2 represent the sample size for cohort 1, 2, and overlapping samples between them. γ1,2 represents the true correlation for a pair of summary statistics due to overlapping samples. represents the estimated correlation estimated via direct correlation between summary statistics, the method proposed by Bolormaa et al.[18] and Zhu et al.[19]. represents the estimated correlation estimate via λmeta, .

Another parameterization of λmeta is to estimate it from differences in allele frequencies between a pair of cohorts instead of differences between estimated effect sizes (Supplementary Notes III; Supplementary Figure S11).

Detection of overlapping samples using pseudo profile score regression

In many circumstances, individual cohorts are not permitted to share individual-level data, either by national law or by local ethical review board conditions. Although the metric λmeta can be transformed to give an estimate of no between cohorts for quantitative traits, it cannot give an estimate of overlapping samples in case–control studies due to the ratio of the cases and controls in each study. To get around this problem, Turchin and Hirshhorn[20] created a software tool, Gencrypt, which utilizes a security protocol known as one-way cryptographic hashes to allow overlapping participants to be identified without sharing individual-level data. We propose an alternative approach, pseudo profile score regression (PPSR), which involves sharing of weighted linear combinations of SNP genotypes with the central meta-analysis hub. In essence, multiple random profile scores are generated for each individual in each cohort, using SNP weights supplied by the analysis hub, and the resulting scores are provided back to the analysis hub. PPSR works through three steps (Supplementary Notes for Method IV; Supplementary Figure S12), and the purpose of PPSR is to estimate a relationship-like matrix of n × n dimension for a pair of cohorts, which have n and n individuals, respectively. Each entry of the matrix is filled with genetic similarity for a pair of samples from each of the two cohorts, estimated via the PPSR. The central hub analysts can determine the best set of SNPs that each individual analysis hub uses to generate PPS. Without the loss of generality, a set of loci directly genotyped in all cohorts would make good candidate set of SNPs for PPS. We use WTCCC data as an illustration to detect 2934 shared controls between any two of the diseases by PPSR. Among 330K not palindromic loci, we randomly picked M=100, 200, and 500 SNPs, to generate pseudo profile scores. It generated 21 cohort-pair comparisons, leading to the summation for 488 587 090 total individual-pair tests. To have an experiment-wise type I error rate=0.01, type II error rate=0.05 (power=0.95) for detecting overlapping individuals, we needed to generated at least 57 PPSs. We generated scores S=[s1,s2,s3,…,s57], where each s is a vector of M elements, sampled from a standard normal distribution. S is shared across seven cohorts for generating PPSs for each individual. In total, 57 PPSs were generated for each individual in each cohort. For a pair of cohorts, PPSR was conducted for each possible pair of individuals for any two cohorts over the generated PPSs. Once the regression coefficient (b) was greater than the threshold, here b=0.95, the pair of individuals was inferred to be having highly similar genotypes, implying that the individual was included in both cohorts (Supplementary Notes for Method IV). When using 200 and 500 random SNPs, all the known 2934 shared controls were detected from 21 cohort-pairwise comparison; when using 100 randomly SNPs, on average 2931 shared samples were identified, which is more accurate than using λmeta constructed using either genetic effects or allele frequencies (Figure 3a). In addition, for detected overlapping samples, there were no false positives observed – consistent with simulations that show the method was conservative in the controlling type I error rate (Supplementary Notes for Method IV). For comparison, we also used the Gencrypt to detect overlapping samples using the same set of SNPs as used in PPSR. Although Gencrypt guidelines suggest use of at least 20 000 random SNPs,[20] selecting 500 random SNPs in the WTCCC cohorts also provided good accuracy with Gencrypt, and on average about 2920 (99.6% of the shared controls) overlapping samples were detected, only slightly lower than PPSR. For example, for BP and CAD, Gencrypt detected 2912 shared controls, but was unable to identify ~20 overlapping controls, due to missing data (on average 1% missing rate).

Figure 3

Pseudo profile score regression for pinpointing overlapping samples/relatives. (a) Each cluster represents a pair of cohorts as denoted on the x axis. Within each cluster, from left to right, the detected overlapping controls using λmeta based either on effect size estimates or minor allele frequency (MAF), PPRS using 100, 200, and 500 markers. WTCCC cohort codes: BD for bipolar disorder, CAD for coronary artery disease, CD for Crohn’s disease, HT for hypertension, RA for rheumatoid arthritis, T1D for type 1 diabetes, and T2D for type 2 diabetes. (b) Illustration for regression coefficients between WTCCC BD and CAD from 57 pseudo profile scores (PPS) generated from 500 markers. The x axis is the PPSR regression coefficients and y axis is real genetic relatedness (as calculated from individual-level genotype data). The red points are the shared controls between two cohorts, and blue points are first-degree relatives. (c) The PPS regression coefficients for detecting overlapping first-degree relatives using 286 PPS generated from 500 markers. (d) Decoding genotypes from the PPS. Given the set of profile scores, one may run a GWAS-like analysis to infer the genotypes. The ratio between the number of markers (M) and number of pseudo profile scores (K) determines the potential discovery of individual-level information. The higher the ratio and, the higher the allele frequency, the less information can be recovered. From left to right, the profile scores generated using different number of markers. The y axis is a R2 metric representing the accuracy between the inferred genotypes and the real genotypes. From left to right panels, 100, 200, 500, and 1000 SNPs were used to generate 10, 20, 50, and 1000 profiles scores. In each cluster, the three bars are inferred accuracy using different MAF spectrum alleles, given the SE of the mean.

Furthermore, PPSR is able to detect pairs of relatives. For example, between the BD and CAD cohorts, two pairs of apparent first-degree relatives were detected (Figure 3b). To find additional first-degree relatives between BD and CAD cohorts, at least 265 PPSs were required to have a type I error rate of 0.01 and type II error rate of 0.05 for a regression coefficient cutoff of 0.45, a threshold for first-degree relatives. As expected, all other individuals that did not show high relatedness did not reach the threshold of 0.45 of the PPS regression coefficient for first-degree relatives (Figure 3c). Gencrypt did not detect any first-degree relatives. PPSR for each individual uses very little personal information and can be minimized so that there is very low probability of decoding it. One way to attempt to decode the genotypes from PPS is to reverse the PPSR, so that the individual genotypes can be predicted in the regression (Supplementary Notes for Method IV). The individual-level genotypic information that can be recovered by an analyst, who knows the S matrix (the weights for generating PPS), is determined by the ratio between the number of markers (M) that generated PPS and the number of PPS (K). Therefore, inferred information on individual genotypes can be minimized and tailored to any specific ethics requirements. We suggest to protect the privacy with sufficient accuracy (Figure 3d).

Discussion

In this study, we provide four metrics for monitoring and improving the quality of large-scale GWAMA based on summary statistics. Using the Fst-derived genetic distance measure, we can place all cohorts on an inferred geographic map and can easily identify cohorts that are genetic outliers or that have unexpected ancestry. In application, we should note that the Fst measure can identify unusual summary information, such as detected in the MIGEN cohorts from GIANT Consortium GWAMAs, in which the same allele frequencies were reported for all cohorts. Meta-PCA can also be used to infer the genetic background of cohorts. The high concordance between Fpc and meta-PCA indicates the both methods are robust. In practice, meta-PCA is much easier to implement when there are many cohorts, but FPC that has close-form analytical results provides a theoretical ground for meta-PCA. There are limitations for both FPC and meta-PCA. First, FPC depends on the choice of reference cohorts, such as 1KG reference cohorts, and the projection may be slightly different when other reference cohorts are adopted. Resembling any PCA, the projection from meta-PCA depends on the context of all cohorts, and the inclusion or exclusion of other cohorts will change the projection slightly. However, we believe the impact will not influence the inference of the genetic background of cohorts in a meta-analysis. Second, various mechanisms can give an identical projection in PCA. The purpose of both methods is to find the discordance between demographic information and genetic information, or outliers, in GWAMA. Our third metric λmeta provides information on sample overlap and heterogeneity between cohorts by utilizing the estimated allelic effect sizes and their standard errors. In most meta-analyses, the overall λmeta is likely to be slightly >1 solely due to unknown heterogeneity, slight as observed, in generating the phenotype and genotype data that cannot be accounted for by QC. The observed mean of λmeta for the GIANT height GWAMA was 1.03 but with more variation than expected by chance. The strong correlation between λGC and λmeta indicated the reported sampling of the reported data were systematically driven by analysis protocols, such as single-marker regression and linear mixed model methods. For cohorts with λGC<1 and λmeta<1, it is likely that the GWAS modeling strategy employed for GWAS in the cohort was too conservative, eg, MIGEN cohorts might have on average too small sample size for each cohort. Conversely, for cohorts with λGC>1 and λmeta>1, results are too heterogeneous, perhaps reflecting systematically smaller sampling variances of the reported genetic effects. As GWAMA often uses inverse-variance-weighted meta-analysis,[21] such cohorts may lead to incorrect weights to the different cohorts in the meta-analysis, suggesting that the statistical analysis in meta-analyses can be improved by applying better weighting factors. It is well recognised that overlapping samples may inflate the type-I error rate of GWAMA and therefore lead to false positives. Although post hoc correction of the test statistic is possible,[18, 19, 21] stringent QC ruling out overlapping samples makes the whole analysis easier and lowers the risk of false positives. A better solution would be to rule out shared samples at the start, for pairs of cohorts that show deflated λmeta, and we propose PPSR to accomplish this.

20 in total

1. Meta-analysis of genome-wide association studies with overlapping subjects.

Authors: Dan-Yu Lin; Patrick F Sullivan
Journal: Am J Hum Genet Date: 2009-12 Impact factor: 11.025

2. The Northern Swedish Population Health Study (NSPHS)--a paradigmatic study in a rural population combining community health and basic research.

Authors: Wilmar Igl; Asa Johansson; Ulf Gyllensten
Journal: Rural Remote Health Date: 2010-06-18 Impact factor: 1.759

3. Genes mirror geography within Europe.

Authors: John Novembre; Toby Johnson; Katarzyna Bryc; Zoltán Kutalik; Adam R Boyko; Adam Auton; Amit Indap; Karen S King; Sven Bergmann; Matthew R Nelson; Matthew Stephens; Carlos D Bustamante
Journal: Nature Date: 2008-08-31 Impact factor: 49.962

4. Genome-wide association study identifies five new schizophrenia loci.

Authors:
Journal: Nat Genet Date: 2011-09-18 Impact factor: 38.330

5. Population structure and eigenanalysis.

Authors: Nick Patterson; Alkes L Price; David Reich
Journal: PLoS Genet Date: 2006-12 Impact factor: 5.917

6. Genetics of rheumatoid arthritis contributes to biology and drug discovery.

Authors: Yukinori Okada; Di Wu; Gosia Trynka; Towfique Raj; Chikashi Terao; Katsunori Ikari; Yuta Kochi; Koichiro Ohmura; Akari Suzuki; Shinji Yoshida; Robert R Graham; Arun Manoharan; Ward Ortmann; Tushar Bhangale; Joshua C Denny; Robert J Carroll; Anne E Eyler; Jeffrey D Greenberg; Joel M Kremer; Dimitrios A Pappas; Lei Jiang; Jian Yin; Lingying Ye; Ding-Feng Su; Jian Yang; Gang Xie; Ed Keystone; Harm-Jan Westra; Tõnu Esko; Andres Metspalu; Xuezhong Zhou; Namrata Gupta; Daniel Mirel; Eli A Stahl; Dorothée Diogo; Jing Cui; Katherine Liao; Michael H Guo; Keiko Myouzen; Takahisa Kawaguchi; Marieke J H Coenen; Piet L C M van Riel; Mart A F J van de Laar; Henk-Jan Guchelaar; Tom W J Huizinga; Philippe Dieudé; Xavier Mariette; S Louis Bridges; Alexandra Zhernakova; Rene E M Toes; Paul P Tak; Corinne Miceli-Richard; So-Young Bang; Hye-Soon Lee; Javier Martin; Miguel A Gonzalez-Gay; Luis Rodriguez-Rodriguez; Solbritt Rantapää-Dahlqvist; Lisbeth Arlestig; Hyon K Choi; Yoichiro Kamatani; Pilar Galan; Mark Lathrop; Steve Eyre; John Bowes; Anne Barton; Niek de Vries; Larry W Moreland; Lindsey A Criswell; Elizabeth W Karlson; Atsuo Taniguchi; Ryo Yamada; Michiaki Kubo; Jun S Liu; Sang-Cheol Bae; Jane Worthington; Leonid Padyukov; Lars Klareskog; Peter K Gregersen; Soumya Raychaudhuri; Barbara E Stranger; Philip L De Jager; Lude Franke; Peter M Visscher; Matthew A Brown; Hisashi Yamanaka; Tsuneyo Mimori; Atsushi Takahashi; Huji Xu; Timothy W Behrens; Katherine A Siminovitch; Shigeki Momohara; Fumihiko Matsuda; Kazuhiko Yamamoto; Robert M Plenge
Journal: Nature Date: 2013-12-25 Impact factor: 49.962

7. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.

Authors:
Journal: Nature Date: 2007-06-07 Impact factor: 49.962

8. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

9. Genome-wide association analysis identifies 13 new risk loci for schizophrenia.

Authors: Stephan Ripke; Colm O'Dushlaine; Kimberly Chambert; Jennifer L Moran; Anna K Kähler; Susanne Akterin; Sarah E Bergen; Ann L Collins; James J Crowley; Menachem Fromer; Yunjung Kim; Sang Hong Lee; Patrik K E Magnusson; Nick Sanchez; Eli A Stahl; Stephanie Williams; Naomi R Wray; Kai Xia; Francesco Bettella; Anders D Borglum; Brendan K Bulik-Sullivan; Paul Cormican; Nick Craddock; Christiaan de Leeuw; Naser Durmishi; Michael Gill; Vera Golimbet; Marian L Hamshere; Peter Holmans; David M Hougaard; Kenneth S Kendler; Kuang Lin; Derek W Morris; Ole Mors; Preben B Mortensen; Benjamin M Neale; Francis A O'Neill; Michael J Owen; Milica Pejovic Milovancevic; Danielle Posthuma; John Powell; Alexander L Richards; Brien P Riley; Douglas Ruderfer; Dan Rujescu; Engilbert Sigurdsson; Teimuraz Silagadze; August B Smit; Hreinn Stefansson; Stacy Steinberg; Jaana Suvisaari; Sarah Tosato; Matthijs Verhage; James T Walters; Douglas F Levinson; Pablo V Gejman; Kenneth S Kendler; Claudine Laurent; Bryan J Mowry; Michael C O'Donovan; Michael J Owen; Ann E Pulver; Brien P Riley; Sibylle G Schwab; Dieter B Wildenauer; Frank Dudbridge; Peter Holmans; Jianxin Shi; Margot Albus; Madeline Alexander; Dominique Campion; David Cohen; Dimitris Dikeos; Jubao Duan; Peter Eichhammer; Stephanie Godard; Mark Hansen; F Bernard Lerer; Kung-Yee Liang; Wolfgang Maier; Jacques Mallet; Deborah A Nertney; Gerald Nestadt; Nadine Norton; Francis A O'Neill; George N Papadimitriou; Robert Ribble; Alan R Sanders; Jeremy M Silverman; Dermot Walsh; Nigel M Williams; Brandon Wormley; Maria J Arranz; Steven Bakker; Stephan Bender; Elvira Bramon; David Collier; Benedicto Crespo-Facorro; Jeremy Hall; Conrad Iyegbe; Assen Jablensky; Rene S Kahn; Luba Kalaydjieva; Stephen Lawrie; Cathryn M Lewis; Kuang Lin; Don H Linszen; Ignacio Mata; Andrew McIntosh; Robin M Murray; Roel A Ophoff; John Powell; Dan Rujescu; Jim Van Os; Muriel Walshe; Matthias Weisbrod; Durk Wiersma; Peter Donnelly; Ines Barroso; Jenefer M Blackwell; Elvira Bramon; Matthew A Brown; Juan P Casas; Aiden P Corvin; Panos Deloukas; Audrey Duncanson; Janusz Jankowski; Hugh S Markus; Christopher G Mathew; Colin N A Palmer; Robert Plomin; Anna Rautanen; Stephen J Sawcer; Richard C Trembath; Ananth C Viswanathan; Nicholas W Wood; Chris C A Spencer; Gavin Band; Céline Bellenguez; Colin Freeman; Garrett Hellenthal; Eleni Giannoulatou; Matti Pirinen; Richard D Pearson; Amy Strange; Zhan Su; Damjan Vukcevic; Peter Donnelly; Cordelia Langford; Sarah E Hunt; Sarah Edkins; Rhian Gwilliam; Hannah Blackburn; Suzannah J Bumpstead; Serge Dronov; Matthew Gillman; Emma Gray; Naomi Hammond; Alagurevathi Jayakumar; Owen T McCann; Jennifer Liddle; Simon C Potter; Radhi Ravindrarajah; Michelle Ricketts; Avazeh Tashakkori-Ghanbaria; Matthew J Waller; Paul Weston; Sara Widaa; Pamela Whittaker; Ines Barroso; Panos Deloukas; Christopher G Mathew; Jenefer M Blackwell; Matthew A Brown; Aiden P Corvin; Mark I McCarthy; Chris C A Spencer; Elvira Bramon; Aiden P Corvin; Michael C O'Donovan; Kari Stefansson; Edward Scolnick; Shaun Purcell; Steven A McCarroll; Pamela Sklar; Christina M Hultman; Patrick F Sullivan
Journal: Nat Genet Date: 2013-08-25 Impact factor: 38.330

10. Defining the role of common variation in the genomic and biological architecture of adult human height.

Authors: Andrew R Wood; Tonu Esko; Jian Yang; Sailaja Vedantam; Tune H Pers; Stefan Gustafsson; Audrey Y Chu; Karol Estrada; Jian'an Luan; Zoltán Kutalik; Najaf Amin; Martin L Buchkovich; Damien C Croteau-Chonka; Felix R Day; Yanan Duan; Tove Fall; Rudolf Fehrmann; Teresa Ferreira; Anne U Jackson; Juha Karjalainen; Ken Sin Lo; Adam E Locke; Reedik Mägi; Evelin Mihailov; Eleonora Porcu; Joshua C Randall; André Scherag; Anna A E Vinkhuyzen; Harm-Jan Westra; Thomas W Winkler; Tsegaselassie Workalemahu; Jing Hua Zhao; Devin Absher; Eva Albrecht; Denise Anderson; Jeffrey Baron; Marian Beekman; Ayse Demirkan; Georg B Ehret; Bjarke Feenstra; Mary F Feitosa; Krista Fischer; Ross M Fraser; Anuj Goel; Jian Gong; Anne E Justice; Stavroula Kanoni; Marcus E Kleber; Kati Kristiansson; Unhee Lim; Vaneet Lotay; Julian C Lui; Massimo Mangino; Irene Mateo Leach; Carolina Medina-Gomez; Michael A Nalls; Dale R Nyholt; Cameron D Palmer; Dorota Pasko; Sonali Pechlivanis; Inga Prokopenko; Janina S Ried; Stephan Ripke; Dmitry Shungin; Alena Stancáková; Rona J Strawbridge; Yun Ju Sung; Toshiko Tanaka; Alexander Teumer; Stella Trompet; Sander W van der Laan; Jessica van Setten; Jana V Van Vliet-Ostaptchouk; Zhaoming Wang; Loïc Yengo; Weihua Zhang; Uzma Afzal; Johan Arnlöv; Gillian M Arscott; Stefania Bandinelli; Amy Barrett; Claire Bellis; Amanda J Bennett; Christian Berne; Matthias Blüher; Jennifer L Bolton; Yvonne Böttcher; Heather A Boyd; Marcel Bruinenberg; Brendan M Buckley; Steven Buyske; Ida H Caspersen; Peter S Chines; Robert Clarke; Simone Claudi-Boehm; Matthew Cooper; E Warwick Daw; Pim A De Jong; Joris Deelen; Graciela Delgado; Josh C Denny; Rosalie Dhonukshe-Rutten; Maria Dimitriou; Alex S F Doney; Marcus Dörr; Niina Eklund; Elodie Eury; Lasse Folkersen; Melissa E Garcia; Frank Geller; Vilmantas Giedraitis; Alan S Go; Harald Grallert; Tanja B Grammer; Jürgen Gräßler; Henrik Grönberg; Lisette C P G M de Groot; Christopher J Groves; Jeffrey Haessler; Per Hall; Toomas Haller; Goran Hallmans; Anke Hannemann; Catharina A Hartman; Maija Hassinen; Caroline Hayward; Nancy L Heard-Costa; Quinta Helmer; Gibran Hemani; Anjali K Henders; Hans L Hillege; Mark A Hlatky; Wolfgang Hoffmann; Per Hoffmann; Oddgeir Holmen; Jeanine J Houwing-Duistermaat; Thomas Illig; Aaron Isaacs; Alan L James; Janina Jeff; Berit Johansen; Åsa Johansson; Jennifer Jolley; Thorhildur Juliusdottir; Juhani Junttila; Abel N Kho; Leena Kinnunen; Norman Klopp; Thomas Kocher; Wolfgang Kratzer; Peter Lichtner; Lars Lind; Jaana Lindström; Stéphane Lobbens; Mattias Lorentzon; Yingchang Lu; Valeriya Lyssenko; Patrik K E Magnusson; Anubha Mahajan; Marc Maillard; Wendy L McArdle; Colin A McKenzie; Stela McLachlan; Paul J McLaren; Cristina Menni; Sigrun Merger; Lili Milani; Alireza Moayyeri; Keri L Monda; Mario A Morken; Gabriele Müller; Martina Müller-Nurasyid; Arthur W Musk; Narisu Narisu; Matthias Nauck; Ilja M Nolte; Markus M Nöthen; Laticia Oozageer; Stefan Pilz; Nigel W Rayner; Frida Renstrom; Neil R Robertson; Lynda M Rose; Ronan Roussel; Serena Sanna; Hubert Scharnagl; Salome Scholtens; Fredrick R Schumacher; Heribert Schunkert; Robert A Scott; Joban Sehmi; Thomas Seufferlein; Jianxin Shi; Karri Silventoinen; Johannes H Smit; Albert Vernon Smith; Joanna Smolonska; Alice V Stanton; Kathleen Stirrups; David J Stott; Heather M Stringham; Johan Sundström; Morris A Swertz; Ann-Christine Syvänen; Bamidele O Tayo; Gudmar Thorleifsson; Jonathan P Tyrer; Suzanne van Dijk; Natasja M van Schoor; Nathalie van der Velde; Diana van Heemst; Floor V A van Oort; Sita H Vermeulen; Niek Verweij; Judith M Vonk; Lindsay L Waite; Melanie Waldenberger; Roman Wennauer; Lynne R Wilkens; Christina Willenborg; Tom Wilsgaard; Mary K Wojczynski; Andrew Wong; Alan F Wright; Qunyuan Zhang; Dominique Arveiler; Stephan J L Bakker; John Beilby; Richard N Bergman; Sven Bergmann; Reiner Biffar; John Blangero; Dorret I Boomsma; Stefan R Bornstein; Pascal Bovet; Paolo Brambilla; Morris J Brown; Harry Campbell; Mark J Caulfield; Aravinda Chakravarti; Rory Collins; Francis S Collins; Dana C Crawford; L Adrienne Cupples; John Danesh; Ulf de Faire; Hester M den Ruijter; Raimund Erbel; Jeanette Erdmann; Johan G Eriksson; Martin Farrall; Ele Ferrannini; Jean Ferrières; Ian Ford; Nita G Forouhi; Terrence Forrester; Ron T Gansevoort; Pablo V Gejman; Christian Gieger; Alain Golay; Omri Gottesman; Vilmundur Gudnason; Ulf Gyllensten; David W Haas; Alistair S Hall; Tamara B Harris; Andrew T Hattersley; Andrew C Heath; Christian Hengstenberg; Andrew A Hicks; Lucia A Hindorff; Aroon D Hingorani; Albert Hofman; G Kees Hovingh; Steve E Humphries; Steven C Hunt; Elina Hypponen; Kevin B Jacobs; Marjo-Riitta Jarvelin; Pekka Jousilahti; Antti M Jula; Jaakko Kaprio; John J P Kastelein; Manfred Kayser; Frank Kee; Sirkka M Keinanen-Kiukaanniemi; Lambertus A Kiemeney; Jaspal S Kooner; Charles Kooperberg; Seppo Koskinen; Peter Kovacs; Aldi T Kraja; Meena Kumari; Johanna Kuusisto; Timo A Lakka; Claudia Langenberg; Loic Le Marchand; Terho Lehtimäki; Sara Lupoli; Pamela A F Madden; Satu Männistö; Paolo Manunta; André Marette; Tara C Matise; Barbara McKnight; Thomas Meitinger; Frans L Moll; Grant W Montgomery; Andrew D Morris; Andrew P Morris; Jeffrey C Murray; Mari Nelis; Claes Ohlsson; Albertine J Oldehinkel; Ken K Ong; Willem H Ouwehand; Gerard Pasterkamp; Annette Peters; Peter P Pramstaller; Jackie F Price; Lu Qi; Olli T Raitakari; Tuomo Rankinen; D C Rao; Treva K Rice; Marylyn Ritchie; Igor Rudan; Veikko Salomaa; Nilesh J Samani; Jouko Saramies; Mark A Sarzynski; Peter E H Schwarz; Sylvain Sebert; Peter Sever; Alan R Shuldiner; Juha Sinisalo; Valgerdur Steinthorsdottir; Ronald P Stolk; Jean-Claude Tardif; Anke Tönjes; Angelo Tremblay; Elena Tremoli; Jarmo Virtamo; Marie-Claude Vohl; Philippe Amouyel; Folkert W Asselbergs; Themistocles L Assimes; Murielle Bochud; Bernhard O Boehm; Eric Boerwinkle; Erwin P Bottinger; Claude Bouchard; Stéphane Cauchi; John C Chambers; Stephen J Chanock; Richard S Cooper; Paul I W de Bakker; George Dedoussis; Luigi Ferrucci; Paul W Franks; Philippe Froguel; Leif C Groop; Christopher A Haiman; Anders Hamsten; M Geoffrey Hayes; Jennie Hui; David J Hunter; Kristian Hveem; J Wouter Jukema; Robert C Kaplan; Mika Kivimaki; Diana Kuh; Markku Laakso; Yongmei Liu; Nicholas G Martin; Winfried März; Mads Melbye; Susanne Moebus; Patricia B Munroe; Inger Njølstad; Ben A Oostra; Colin N A Palmer; Nancy L Pedersen; Markus Perola; Louis Pérusse; Ulrike Peters; Joseph E Powell; Chris Power; Thomas Quertermous; Rainer Rauramaa; Eva Reinmaa; Paul M Ridker; Fernando Rivadeneira; Jerome I Rotter; Timo E Saaristo; Danish Saleheen; David Schlessinger; P Eline Slagboom; Harold Snieder; Tim D Spector; Konstantin Strauch; Michael Stumvoll; Jaakko Tuomilehto; Matti Uusitupa; Pim van der Harst; Henry Völzke; Mark Walker; Nicholas J Wareham; Hugh Watkins; H-Erich Wichmann; James F Wilson; Pieter Zanen; Panos Deloukas; Iris M Heid; Cecilia M Lindgren; Karen L Mohlke; Elizabeth K Speliotes; Unnur Thorsteinsdottir; Inês Barroso; Caroline S Fox; Kari E North; David P Strachan; Jacques S Beckmann; Sonja I Berndt; Michael Boehnke; Ingrid B Borecki; Mark I McCarthy; Andres Metspalu; Kari Stefansson; André G Uitterlinden; Cornelia M van Duijn; Lude Franke; Cristen J Willer; Alkes L Price; Guillaume Lettre; Ruth J F Loos; Michael N Weedon; Erik Ingelsson; Jeffrey R O'Connell; Goncalo R Abecasis; Daniel I Chasman; Michael E Goddard; Peter M Visscher; Joel N Hirschhorn; Timothy M Frayling
Journal: Nat Genet Date: 2014-10-05 Impact factor: 38.330

11 in total

1. Association score testing for rare variants and binary traits in family data with shared controls.

Authors: Mohamad Saad; Ellen M Wijsman
Journal: Brief Bioinform Date: 2019-01-18 Impact factor: 11.622

2. Polygenic Scores for Major Depressive Disorder and Risk of Alcohol Dependence.

Authors: Allan M Andersen; Robert H Pietrzak; Henry R Kranzler; Li Ma; Hang Zhou; Xiaoming Liu; John Kramer; Samuel Kuperman; Howard J Edenberg; John I Nurnberger; John P Rice; Jay A Tischfield; Alison Goate; Tatiana M Foroud; Jacquelyn L Meyers; Bernice Porjesz; Danielle M Dick; Victor Hesselbrock; Eric Boerwinkle; Steven M Southwick; John H Krystal; Myrna M Weissman; Douglas F Levinson; James B Potash; Joel Gelernter; Shizhong Han
Journal: JAMA Psychiatry Date: 2017-11-01 Impact factor: 21.596

3. Multi-tissue transcriptome analyses identify genetic mechanisms underlying neuropsychiatric traits.

Authors: Eric R Gamazon; Aeilko H Zwinderman; Nancy J Cox; Damiaan Denys; Eske M Derks
Journal: Nat Genet Date: 2019-05-13 Impact factor: 38.330

4. Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated.

Authors: Eran Elhaik
Journal: Sci Rep Date: 2022-08-29 Impact factor: 4.996

5. Reproduction and In-Depth Evaluation of Genome-Wide Association Studies and Genome-Wide Meta-analyses Using Summary Statistics.

Authors: Yao-Fang Niu; Chengyin Ye; Ji He; Fang Han; Long-Biao Guo; Hou-Feng Zheng; Guo-Bo Chen
Journal: G3 (Bethesda) Date: 2017-03-10 Impact factor: 3.154

6. Age at first birth in women is genetically associated with increased risk of schizophrenia.

Authors: Guiyan Ni; Jacob Gratten; Naomi R Wray; Sang Hong Lee
Journal: Sci Rep Date: 2018-07-05 Impact factor: 4.379

7. A correction for sample overlap in genome-wide association studies in a polygenic pleiotropy-informed framework.

Authors: Marissa LeBlanc; Verena Zuber; Wesley K Thompson; Ole A Andreassen; Arnoldo Frigessi; Bettina Kulle Andreassen
Journal: BMC Genomics Date: 2018-06-25 Impact factor: 3.969

8. Two Synthetic 18-Way Outcrossed Populations of Diploid Budding Yeast with Utility for Complex Trait Dissection.

Authors: Robert A Linder; Arundhati Majumder; Mahul Chakraborty; Anthony Long
Journal: Genetics Date: 2020-04-02 Impact factor: 4.562

9. A genotype imputation method for de-identified haplotype reference information by using recurrent neural network.

Authors: Kaname Kojima; Shu Tadaka; Fumiki Katsuoka; Gen Tamiya; Masayuki Yamamoto; Kengo Kinoshita
Journal: PLoS Comput Biol Date: 2020-10-01 Impact factor: 4.475

10. A lead candidate functional single nucleotide polymorphism within the WARS2 gene associated with waist-hip-ratio does not alter RNA stability.

Authors: Milan Mušo; Rebecca Dumbell; Sara Pulit; Nasa Sinnott-Armstrong; Samantha Laber; Louisa Zolkiewski; Liz Bentley; Melina Claussnitzer; Roger D Cox
Journal: Biochim Biophys Acta Gene Regul Mech Date: 2020-09-30 Impact factor: 4.490