Literature DB >> 28598966

Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction.

Yiming Hu¹, Qiongshi Lu¹, Wei Liu², Yuhua Zhang³, Mo Li¹, Hongyu Zhao^1,4,5,6.

Abstract

Accurate prediction of disease risk based on genetic factors is an important goal in human genetics research and precision medicine. Advanced prediction models will lead to more effective disease prevention and treatment strategies. Despite the identification of thousands of disease-associated genetic variants through genome-wide association studies (GWAS) in the past decade, accuracy of genetic risk prediction remains moderate for most diseases, which is largely due to the challenges in both identifying all the functionally relevant variants and accurately estimating their effect sizes. In this work, we introduce PleioPred, a principled framework that leverages pleiotropy and functional annotations in genetic risk prediction for complex diseases. PleioPred uses GWAS summary statistics as its input, and jointly models multiple genetically correlated diseases and a variety of external information including linkage disequilibrium and diverse functional annotations to increase the accuracy of risk prediction. Through comprehensive simulations and real data analyses on Crohn's disease, celiac disease and type-II diabetes, we demonstrate that our approach can substantially increase the accuracy of polygenic risk prediction and risk population stratification, i.e. PleioPred can significantly better separate type-II diabetes patients with early and late onset ages, illustrating its potential clinical application. Furthermore, we show that the increment in prediction accuracy is significantly correlated with the genetic correlation between the predicted and jointly modeled diseases.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Year: 2017 PMID： 28598966 PMCID： PMC5482506 DOI： 10.1371/journal.pgen.1006836

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 5.917

Introduction

Achieving accurate disease risk prediction using genetic information is a major goal in human genetics research and precision medicine. Accurate prediction models will have great impacts on disease prevention and treatment strategies[1]. Various approaches that utilize genome-wide data in genetic risk prediction have been proposed, including machine-learning models trained on individual-level genotype and phenotype data[2-7], and polygenic risk scores (PRS) derived from genome-wide association study (GWAS) summary statistics [8, 9]. Despite the potential information loss in summary data, PRS-based approaches have been widely adopted in practice due to computational efficiency and the easy accessibility of GWAS summary level data[10, 11]. However, prediction accuracies for most complex diseases remain moderate, which is largely due to the challenges in both identifying all the functionally relevant variants and accurately estimating their effect sizes in the presence of linkage disequilibrium (LD) [12]. Integrating external information, e.g. pleiotropy [2, 3], LD [9], and functional annotations[13] has been shown to effectively address these challenges. Maier et al.[3] and Li et al.[2] showed that joint modeling of correlated traits could increase the prediction accuracy using individual level genotype data for psychiatric disorders and autoimmune diseases. Using summary level data, Hu et al.[13] proposed a single-trait risk prediction framework explicitly modeling LD and functional annotations, which consistently improves prediction accuracy for complex diseases. Furthermore, integrative genomic functional annotation, coupled with the rich collection of summary statistics from GWAS, have enabled increased statistical power in several different settings [14, 15]. Here, we introduce PleioPred (available at https://github.com/yiminghu/PleioPred), a principled framework that integrates GWAS summary statistics of genetically correlated diseases with various types of annotation data and reference genotype panels to improve risk prediction accuracy. Incorporating data from related traits and functional annotations increases the effective sample size and statistical power to detect functionally relevant variants, especially when diseases share similar genetic architecture. We compare PleioPred with state-of-the-art single-trait PRS-based approaches and demonstrate its consistent improvement in risk prediction performance using real data of multiple complex diseases. We first apply PleioPred to Crohn’s disease (CD), celiac disease (CEL) and type-II diabetes (T2D) by jointly modeling them with known correlated diseases (i.e. CD with Ulcerative Colitis (UC); CEL with UC; T2D with coronary artery disease (CAD)) and show a statistically significant improvement in prediction performance in independent validation cohort over single-trait models. By comparing two-trait prediction model with and without functional annotations in both simulation and real data analysis, we demonstrate that functional annotation may further improve the performance of joint modeling. Furthermore, we show that PRS calculated from PleioPred can effectively partition T2D patients by their age of onset, indicating the potential clinical usage of our approach[16, 17]. Through jointly modeling T2D with a wide spectrum of diseases, we demonstrate that the increment in prediction accuracy is significantly correlated with the genetic correlations between T2D and the jointly modeled diseases.

Results

Methods overview

We propose a Bayesian framework to incorporate functional annotations and pleiotropy. We assume throughout the report that the phenotypes of two diseases and the genotypes are standardized with mean zero and variance one. When phenotypes are binary, and denote disease liabilities instead [18, 19]. Here N1 and N2 denote the sample sizes for the two diseases and M is the number of markers. We assume a linear model with genotype matrices, effect sizes (β and γ) and random errors (ε and δ) mutually independent as follows We also assume that the effect sizes of different SNPs are independent. As for random errors, we assume that where and denote the heritability of two diseases and ρ measures the covariance within the overlapping individuals between two studies. Denote the LD matrix and marginal effect size estimator from GWAS as: and . In practice, and can be estimated from a reference panel and we therefore denote the LD matrix as for convenience. Then following the derivation in Hu et al. [13], we can derive the conditional distribution of GWAS summary statistics as where N is the number of overlapping samples between the two studies. When N is relatively small, we can discard terms with to reduce the computation burden. We first consider an infinitesimal model to account for a polygenic genetic architecture. We assume that the effect sizes follow a multivariate normal distribution: where and denote the variance of effect sizes of SNP i and ρ: = cor(β, γ), represents the genetic correlation between two diseases. This is equivalent to a multivariate random effects model with various variance components. Suppose that the whole genome is partitioned into K functional regions A1, …, A. We assume that the effect size of a SNP depends on the functional regions it falls in and the effect sizes are additive in the overlapping regions. To be specific, we have where τ denotes the variance of the effect size of SNPs on disease j falling in A alone. In the random effects model, the variance of effect size can be interpreted as heritability and thus for convenience, we will use heritability of SNP i instead of the variance of effect size in the rest of the manuscript. Details on parameter estimation are described in Methods. When all the parameters are specified, we can estimate the expectation of the effect sizes given the marginal effect size estimators of two diseases. The PRSs are defined as Finally, we treat ρ as a tuning parameter and the posterior expectation of the effect sizes can be calculated in closed form (Methods). In practice[9, 13], we note that a sparse model yields higher accuracy for most diseases. Moreover, the infinitesimal model assumption is relatively strong in some cases. For example, two related diseases may only share some causal variants and have no correlation among the effect sizes or the correlation structures may vary across the genome. We therefore propose a hierarchical Bayesian model with a more general assumption and we refer to this framework as the non-infinitesimal model. Under this model, we assume that the effect sizes follow a mixture distribution. That is, the effect sizes of SNP i for the two diseases follow a mixture distribution with two independent normal distribution (when SNP i is causal in both diseases), joint normal and point mass (when SNP i is causal in only one diseases) and joint point mass (when SNP i is not causal in either disease) [20]. Although we do not have closed form solution for the posterior expectation of the effect sizes, we use Markov Chain Monte Carlo (MCMC) to sample from the posterior distribution of the effect sizes to estimate the posterior expectation (Methods). For both infinitesimal and non-infinitesimal models, we used a total of 61 different annotation categories, including functional genome predicted by GenoCanyon scores [14], GenoSkyline tissue-specific functionality scores of 7 tissue types [15], and 53 baseline annotations for diverse genomic features [21]. More specifically, GenoCanyon is a statistical framework to predict functional regions in the human genome through integrative analysis of ENCODE epigenomic data and multiple conservation metrics [14]. Later we further extended the framework and developed GenoSkyline, which aimed to predict tissue-specific functionality [15]. We smoothed GenoCanyon scores by a 10Kb window, a strategy previously shown to improve robustness of functionality prediction [22]. The smoothed GenoCanyon annotation and raw GenoSkyline annotations of seven tissue types were dichotomized based on a cutoff of 0.5. The regions with GenoCanyon or GenoSkyline scores greater than the cutoff are interpreted as non-tissue-specific or tissue-specific functional regions in the human genome. Such dichotomization has been previously shown to be robust against the cutoff choice [15]. We compare the prediction performance of eight methods, corresponding to infinitesimal and non-infinitesimal versions of single-trait and two-trait approaches with and without functional annotations. As shown in [9, 13], LDpred and AnnoPred outperform other state-of-the-art PRS methods, we therefore use these two approaches as the representative single-trait prediction methods. AnnoPred-inf/AnnoPred: single-trait prediction model with 61 functional annotations LDpred-inf/LDpred: single-trait prediction model without functional annotations, corresponding to a special case of AnnoPred when assuming only one annotation covering the whole genome PleioPred-anno-inf/PleioPred-anno: two-trait prediction model with 61 functional annotations PleioPred-inf/PleioPred: two-trait prediction model without functional annotations, corresponding to a special case of PleioPred-anno when assuming only one annotation covering the whole genome All of these methods studied require a pre-specified tuning parameter except for PleioPred and PleioPred-anno. To select a suitable tuning parameter, we divided the independent testing dataset (individual level genotype and phenotype data) into two equal parts (A and B, non-overlapping), and selected the tuning parameters by optimizing prediction accuracy on dataset A. We then evaluated prediction accuracy using the remaining half of testing data, i.e. dataset B. Finally, we repeated the analysis one more time by choosing the tuning parameter on dataset B while evaluating the prediction accuracy on dataset A. Results from these two separate analyses were averaged to quantify model performance. Ideally, the parameter should be tuned in an independent cohort and then evaluated in another independent cohort. However, it is very challenging to find two independent cohorts without any overlapping samples with the training GWAS and we therefore chose a cross-validation scheme. In real data analysis, tuning the parameter within the same cohort may lead to a little bit over-optimistic results due to possible shared confounders. However, the proposed non-infinitesimal models address this issue via a hierarchical Bayesian approach to avoid tuning parameter and thus result in more robust and generalizable estimation. Besides the methods discussed above, we have also compared the performance of proposed joint models with a recently developed multi-trait analysis tool (MTAG [23]). Following the Polygenic Prediction section in their bioRxiv preprint (page 8), we first applied MTAG to GWAS summary statistics to get the multi-trait adjusted p values and effect sizes and then used the generated summary statistics as input to LDpred. The AUC of LDpred with MTAG adjusted summary statistics and all other four methods are shown in S7 Table. Our method outperformed all other methods including MTAG. Notably, MTAG outperformed LDpred in Crohn’s disease but its performance was even slightly worse than LDpred for celiac disease and type-II diabetes.

Simulations

We first performed simulations to demonstrate PleioPred’s ability to improve risk prediction accuracy. We simulated traits from GERA (dbGaP access number phs000674.v1.p1) genotype data, which contains 61,172 individuals genotyped for 670,176 SNPs. More specifically, we randomly selected ~28,000 individuals as training set to calculate the summary statistics for disease 1 and another ~28,000 for disease 2. The remaining ~5000 individuals were used for testing. Throughout the simulation we used genotype data of chromosome 1 (50,279 SNPs) to generate phenotypes. We first generated two annotations and each annotation was simulated by randomly selecting 10% of the genome, denoted as A1 and A2. Denote the heritability of each trait as and (both 30%) and the number of causal variants as m1 and m2 (both 300). Causal variants were generated as follows: one third of causal variants were selected from A1, one third from A2 and the rest from (A1⋃A2), of which p of the causal variants was shared by both diseases (0.2 and 0.8). Effect sizes of causal variants were sampled from and . We also randomly selected 5000 individuals and 10000 individuals from the training data of disease 1 and 2 respectively to calculate summary statistics in order to study the effect of unbalanced sample sizes on the increment of prediction accuracy. Correlations between simulated and predicted traits of disease 1 were calculated from 50 replicates under different simulation settings. PleioPred-anno showed the best prediction performance in all settings (Fig 1). The performance of the two-trait model improves as the proportion of shared causal variants increases. In the unbalanced case when the sample size of disease 1 is smaller than that of disease 2, we observed a larger increment in prediction accuracy, indicating that the benefit of integrating large GWAS of genetically correlated diseases and functional annotations when the sample size of disease of interest is moderate.

Fig 1

Prediction accuracy of non-infinitesimal models in simulated data.

We trained the models with equal training sample sizes (N1 = N2 = 28068, right panel) and unequal training sizes (N1 = 5000, N2 = 10000, left panel). Prediction accuracy was measured by correlation between simulated traits and predicted PRS.

Prediction accuracy of non-infinitesimal models in simulated data.

Real data analysis

To further illustrate the improvement in risk prediction accuracy, we first applied PleioPred to Crohn’s disease (CD), celiac disease (CEL) and type-II diabetes (T2D). We jointly modeled CD with ulcerative colitis (UC), CEL with UC, and T2D with coronary artery disease (CAD). We trained PleioPred using publicly accessible GWAS summary statistics and evaluated risk prediction performance using individual-level genotype and phenotype data from cohorts independent from the training GWAS samples. The training summary statistics for the two autoimmune disease include the training summary statistics are from the International Inflammatory Bowel Disease Genetics Consortium (IIBDGC; CD: Ncase = 6,333 and Ncontrol = 15,056, with samples from the Wellcome Trust Case Control Consortium (WTCCC) removed from the meta-analysis), a CEL GWAS with 4,533 cases and 10,750 controls [24], a UC GWAS from IIBDGC (Ncase = 6,687 and Ncontrol = 19,718). For the validation data, we merged the CD cases from WTCCC (Ncase = 1,829) and CEL cases from the National Institute of Diabetes and Digestive and Kidney Diseases study (NIDDK, Ncase = 1,716) with healthy controls from the Resource for Genetic Epidemiology Research on Aging Cohort (GERA, Ncontrol = 5,488). For T2D, we trained the model on summary data from the Diabetes Genetics Replication and Meta-analysis study (DIAGRAM, Ncase = 12,171 and Ncontrol = 56,862) [25] and the Coronary ARtery DIsease Genome wide Replication and Meta-analysis study (CARDIoGRAM, Ncase = 22,233 and Ncontrol = 64,762)[26]. Samples from the Northwestern NUgene Project (Ncase = 662 and Ncontrol = 517) [27] were used for validation. Details for each training GWAS summary statistics and independent testing cohorts are provided in S1 Text and S3 and S4 Tables. We evaluated the effectiveness of the per-SNP heritability estimated from functional annotations of the two autoimmune diseases (i.e. CD, CEL) with well-powered testing cohorts (N>3,000). Interestingly, not only the per-SNP heritability of the testing diseases (CD and CEL) but those of related diseases (UC) could effectively identify SNPs with large effect sizes (Fig 2A and 2B) and consistent effect directions in independent validation cohorts (Fig 2C and 2D), which shows that functional annotations can effectively prioritize shared causal variants between genetically correlated diseases.

Fig 2

Evaluating effectiveness of annotations and per-SNP heritability.

Evaluating effectiveness of annotations and per-SNP heritability.

(A, B) Comparing signal strengths of SNPs with high and low heritability of related diseases in independent validation cohorts. Both SNPs with higher heritability of testing disease and related disease have significantly stronger associations across two independent and well-powered testing datasets (N>3,000, (A) Crohn’s disease; (B) Celiac disease.). P-values were calculated using one-sided Kolmogorov-Smirnov test. (C, D) Comparing consistency of SNPs’ effect direction between training and testing datasets. Each bar quantifies the proportion of SNPs with consistent effect directions. P-values were calculated using one-sided two-sample binomial test. (C) Crohn’s disease; (D) Celiac disease. Correlations between the calculated PRS and disease status (COR) for different approaches and area under the ROC curve (AUC) are summarized in Table 1 and S1 Table. In both infinitesimal and non-infinitesimal models, we observed that two-trait models consistently outperformed single-trait methods and incorporating functional annotations could further improve the prediction accuracy across different diseases. Furthermore, non-infinitesimal models achieved much better performance than infinitesimal models. We also fitted a logistic regression model with the case/control status as outcome and PRS as covariates and reported the corresponding slopes of PRSs, which measures the increase in odds ratio of getting disease with a unit change in PRS (Table 1) and further validated the advantage of integrating pleiotropy and functional annotations. A likelihood ratio test was used to test for the difference in the prediction accuracy between models comparing the likelihood of a logistic regression fitting PRS of one method to that of a logistic regression fitting PRS of two methods jointly (Table 2). From the test, PleioPred with 61 annotations performed significantly better than single-trait models (infinitesimal model: p = 1.4e-33 for CD, p = 1.6e-12 for CEL and p = 1.7e-3 for T2D; non-infinitesimal model: p = 5.2e-29 for CD, p = 2.8e-7 for CEL and p = 0.027 for T2D). Reversing the order of test (that is, comparing the likelihood of two-trait model with that of two-trait and single-trait model jointly or model using annotations with model using and not using annotations jointly) results in non-significant p-values for most tests (S2 Table), which further demonstrates that PRS incorporating functional annotations and pleiotropy mostly encompasses the information of PRS of single trait model. Besides CAD, we also jointly modeled T2D with a spectrum of traits, whose genetic correlations with T2D have been systematically studied [28], including age at menarche (AAM), autism spectrum (AUT), bipolar disorder (BIP), body mass index (BMI), birth length (BIL), birth weight (BIW), childhood obesity (CHO), fasting glucose (FG), HDL Cholesterol (HDL), height (HGT), major depressive disorder (MDD), rheumatoid arthritis (RA) and schizophrenia (SCZ). We estimated the genetic correlations between T2D and these traits using LDSC[21, 28] and showed that the increment in prediction accuracy is significantly correlated with the genetic correlation between T2D and the jointly modeled traits (P = 0.002; Fig 3 and S1 Fig).

Table 1

Mean CORs and Regression slopes of infinitesimal and non-infinitesimal methods in independent validation cohort of CE, CEL, and T2D.

For two-trait prediction models, we jointly modeled CD with UC, CEL with UC and, T2D with CAD.

	COR^a			Regression Slope^b
	CD	CEL	T2D	CD	CEL	T2D
ldpred-inf	0.196	0.072	0.137	0.454	0.168	1.99
AnnoPred-inf	0.219	0.098	0.145	0.572	0.255	2.15
PleioPred-inf	0.246	0.100	0.168	0.661	0.292	2.198
PleioPred-anno-inf	0.248	0.122	0.184	0.739	0.400	2.333
ldpred	0.247	0.120	0.217	0.873	0.661	2.83
AnnoPred	0.279	0.132	0.219	1.306	0.924	2.86
PleioPred	0.307	0.141	0.225	1.284	1.332	3.05
PleioPred-anno	0.297	0.156	0.22	1.340	1.361	3.063

a correlations between disease status and PRS;

b Regression slopes of logistic regression with case/control status as outcome and PRS as covariates, larger value indicates a larger increase in odds ratio when PRS increases by one unit.

Table 2

p-values from the likelihood ratio tests comparing different models.

		CD	CEL	T2D
x₁	x₂	p-values from LRT^a
ldpred-inf	AnnoPred-inf	4.4e-15	2.8e-6	0.011
ldpred-inf	PleioPred-inf	3.9e-34	2.3e-7	0.041
AnnoPred-inf	PleioPred-anno-inf	1.5e-18	4.9e-8	0.031
PleioPred-inf	PleioPred-anno-inf	1.8e-9	1.9e-8	0.017
ldpred-inf	PleioPred-anno-inf	6.4e-31	1.6e-12	1.7e-3
ldpred	AnnoPred	1.3e-5	1.7e-5	0.066
ldpred	PleioPred	9.3e-40	0.022	0.039
AnnoPred	PleioPred-anno	8.6e-13	5.7e-5	0.021
PleioPred	PleioPred-anno	7.7e-3	0.014	0.45
ldpred	PleioPred-anno	5.2e-29	2.8e-7	0.027

a Likelihood ratio = -2[logL(x1)—logL(x1 + x2)], where logL(x1) and logL(x1 + x2) is the log likelihood from a logistic regression with case/control status as outcome and x1 and x2 as covariates.

Fig 3

Prediction accuracy of the PleioPred-anno on T2D when jointly modeled with additional traits.

Genetic correlations were estimated using LDSC[28] and the significant correlations were labeled in purple. P-value and confidence region indicates the significant correlation between prediction accuracy and genetic correlation. The similar pattern was observed in infinitesimal and non-infinitesimal models without annotations (S1 Fig). AAM: age at menarche, AUT: autism spectrum, BIP: bipolar disorder, BMI: body mass index, BIL: birth length, BIW: birth weight, CHO: childhood obesity, CAD: coronary artery disease, FG: fasting glucose, HDL: HDL Cholesterol, MDD: major depressive disorder, RA: rheumatoid arthritis, and SCZ: schizophrenia.

Prediction accuracy of the PleioPred-anno on T2D when jointly modeled with additional traits.

Mean CORs and Regression slopes of infinitesimal and non-infinitesimal methods in independent validation cohort of CE, CEL, and T2D.

For two-trait prediction models, we jointly modeled CD with UC, CEL with UC and, T2D with CAD. a correlations between disease status and PRS; b Regression slopes of logistic regression with case/control status as outcome and PRS as covariates, larger value indicates a larger increase in odds ratio when PRS increases by one unit. a Likelihood ratio = -2[logL(x1)—logL(x1 + x2)], where logL(x1) and logL(x1 + x2) is the log likelihood from a logistic regression with case/control status as outcome and x1 and x2 as covariates. Since COR only measures the global discriminating power of prediction method, it might not be the best evaluation metric for risk prediction approaches, with which it is of more use to stratify the population into clinically meaningful groups[1, 17, 29]. In order to test different methods’ ability to stratify individuals with high risk, we compared the proportion of cases among testing samples with high PRS from non-infinitesimal models in CD and CEL. PleioPred-anno showed highest power in stratifying patients within the top risk population (Fig 4A). For T2D, we compared the distribution of the age of onset within risk groups stratified by different non-infinitesimal PRSs (Fig 4B). Onset ages of T2D are significantly lower among the individuals with higher two-trait PRS than those with higher single-trait PRS, which indicates that PRS of two-trait methods could effectively stratify the population with high absolute risk of T2D and demonstrates the potential clinical usage of the PleioPred and the advantage of joint modeling of related diseases over single-trait prediction methods.

Fig 4

Comparing non-infinitesimal methods in different standards.

Comparing non-infinitesimal methods in different standards.

(A) Enrichment of proportion of cases in testing samples with high PRS (top 1%, 5%, 10%, 20% and 30% risk groups stratified by PRS) in CD and CEL. (B) Distribution of age of onset of T2D in testing samples with high PRS (top 5%, 10%, 20% and 30% risk groups stratified by PRS) in T2D. P-values were calculated using Wilcoxon rank test comparing the two-trait models with the one-trait models. The last column represents the overall age of onset in testing samples. In the non-infinitesimal two-trait model, the major contribution to improved performance came from pleiotropy. That is, the variants that are causal in both diseases would be prioritized and those are not causal or have smaller effect sizes in both diseases would be given lower effect size estimation. Therefore, incorporating a genetically correlated disease is equivalent to integrating a functional annotation and its effectiveness and power depend on the genetic correlation between two diseases. When the two diseases are very similar and share a large amount of causal and non-causal variants, adding less effective annotations may dilute the signals and lead to lower prediction accuracy. This aligns with our results in Tables 1 and 2, in which CD-UC and T2D-CAD have a rather high genetic correlation (0.427, 0.432 respectively) and PleioPred yields better performance. On the contrary, CEL-UC have a relatively lower genetic correlation (0.283) and PleioPred-anno yields the best prediction accuracy. We performed further analysis with T2D and 13 other correlated diseases (those used in Fig 3). We plot the prediction accuracy of PleioPred and PleioPred-anno against absolute genetic correlation and it can be seen that when the functional annotations are fixed, as the absolute genetic correlation increases, PleioPred tends to yield slightly better results (S2 Fig).

Discussion

Our work demonstrates that pleiotropy and functional annotations can effectively improve the performance of genetic risk prediction. PleioPred jointly analyzes genetically correlated diseases and diverse types of annotation data with GWAS summary statistics to upweight causal SNPs shared between diseases and with a higher likelihood of functionality, which lead to consistently better prediction accuracy for multiple complex diseases. Besides prediction accuracy, PleioPred can better stratify population into different risk groups and has greater potential in clinical usage. Our method is not without limitation. First, despite consistent improvement compared with existing PRS-based methods, AUCs for most diseases remain moderate. In order to effectively stratify risk groups for clinical usage, our model remains to be further calibrated using large cohorts with measured environmental and clinical risk factors [1]. Second, accurate estimation of GWAS signal enrichment and SNP effect sizes requires a large sample size for the training dataset. This could be potentially improved by better estimators for annotation-stratified heritability in the future [30]. Third, it is non-trivial to foresee whether PleioPred or PleioPred-anno would work better for a given pair of diseases. According to our observation in real data analysis, PleioPred would eventually outperform PleioPred-anno with an increasing genetic correlation. The threshold at which the change happens could be learned with a validation dataset in practice. The proposed framework can be easily customized and extended to incorporate more than two diseases, which could potentially further increase the prediction accuracy. However, it is worth noting that computation burden and the difficulty in model fitting also increases with the number of diseases. Furthermore, many GWAS have shared control samples, which may result in duplicated information and noise in the training samples. A few Bayesian models combining GWAS summary statistics with functional annotations have been proposed for the purpose of fine-mapping functional variants [31-33]. Whether these models could be adapted to benefit risk prediction accuracy remains to be investigated in the future. Importantly, the rich collection of publicly available integrative annotation data, in conjunction with the increasing accessibility of GWAS summary statistics, makes PleioPred a customizable and powerful tool. As GWAS sample size continues to grow, PleioPred has the potential to achieve even better prediction accuracy and become widely adopted as a summary of genetic contribution in clinical applications of risk prediction. Although more and more GWAS summary results are becoming available [34], in order to evaluate the prediction accuracy, a cohort independent with both training GWAS samples is required, which is very challenging to find. We will apply the proposed methods to a wide range of diseases when independent validation data become available in the future.

Methods

Conditional distribution of marginal effect size estimators

Assume the phenotypes of two diseases and the genotypes are standardized with mean zero and variance one. Here N1 and N2 denote the sample sizes for the two diseases and M is the number of markers. We further assume a linear model with genotype matrices, effect sizes (β and γ) and random errors (ε and δ) mutually independent. Assume that the effect sizes of different SNPs are independent. As for random errors, we assume that where and denote the heritability of two diseases and ρ measures the covariance within the overlapping individuals between two studies. Denote the LD matrix and marginal effect size estimator from GWAS as: and . In practice, and can be estimated from a reference panel and we therefore denote the LD matrix as for convenience. Then following the derivation in [13], we can derive the conditional distribution of GWAS summary statistics as where N is the number of overlapping samples between the two studies. When N is relatively small, we can discard terms with to reduce the computation burden. In practice, we usually ignore the overlap between samples mainly due to four reasons: 1) it is usually challenging to estimate the parameter ρ and obtain the exact number of overlapping samples. 2) The off-diagonal term is much smaller comparing to the diagonal terms (). Even in the case of complete overlap where , ρ is still at the magnitude of . 3) sensitivity analysis through simulations indicated that the method is very robust to overlapping samples (S6 Table). 4) In practice, ρ can be estimated via LDSC if N is known. However, including the covariance matrix of and can significantly increase the computational cost and thus increase the variability of estimation.

Infinitesimal model

Assume that the effect sizes follow a multivariate normal distribution: where and denote the variance of effect sizes of SNP i and ρ: = cor(β, γ), representing the genetic correlation between two diseases. Suppose that the whole genome is partitioned into K functional regions A1,…, A. Specific annotations used in PleioPred were described previously (Results). We assume the effect size of a SNP depends on the functional regions it falls in and the effect sizes are additive in the overlapping regions: where τ denotes the variance of the effect size of SNPs on disease j falling in A alone. For parameter estimation, we applied a two-stage approach: first, and are estimated using annotation stratified LD score regression (LDSC)[21], which is essentially a method of moments estimator since LDSC utilizes the relationship between the second moment of marginal estimators and variance components of each functional region. Specifically for each disease, we use to specify the per-SNP heritability for disease j where C is a constant calculated from the following equation We do not directly use as the per-SNP heritability because it is estimated in the context where all SNPs in the 1000 Genomes database are included in the model [21]. Such per-SNP heritability estimates cannot be extrapolated to the risk prediction context where many fewer SNPs are analyzed [35]. Therefore, we rescale the heritability estimates to better quantify each SNP’s contribution towards chip heritability. Following [36], we use a summary statistics-based heritability estimator that approximates the Haseman-Elston estimator: where and denote the mean squared marginal estimators ( and for diseases 1 and 2) and the mean non-stratified LD score, respectively. In the GWAS setting, are usually non-invertible and have very high dimensions. We thus study the posterior distribution of a small chunk of marginal effect size estimators instead. Let and be the estimated marginal effect sizes of SNPs in a region b (e.g. an LD block) and the corresponding genotype matrices are X and Z and sample correlation matrices is , respectively. Then the conditional distribution of the marginal effect size estimators is (assuming no overlapping individuals or omitting the off-diagonal terms) and denote the heritability of SNPs in region b for the two diseases, which are usually close to zero since the region b is relatively small and can be safely rounded to zero in calculation. We choose the size of b using the standard described in [9]. Finally, we treat ρ as a tuning parameter and the posterior expectation of the effect sizes can be calculated as:

Non-infinitesimal model

In practice[9, 13], we note that a sparse model yields a higher accuracy for most diseases. Moreover, the infinitesimal model assumption is relatively strong in some cases. For example, two related diseases may only share some causal variants and have no correlation among the effect sizes or the correlation structures may vary across the genome. We therefore propose a hierarchical Bayesian model with a more general assumption and we refer to this framework as the non-infinitesimal model. Under this model, we assume that the effect sizes follow a mixed distribution. That is, the effect sizes of SNP i to two diseases follow a mixed distribution with normal (when SNP i is causal in both diseases), joint normal and point mass (when SNP i is causal in only one diseases) and joint point mass (when SNP i is not causal in either disease). Although we do not have closed form solution for the posterior expectation of the effect sizes, we can use Gibbs sampler to sample from the posterior distribution of the effect sizes to estimate the posterior expectation. The joint posterior distribution of β and γ given , , β−, γ− and is The posterior distribution of is rather complicated and we therefore applied a Metropolis Hastings method to sample and use the following proposing distribution. in which d11 represents the number of SNPs that are causal in both diseases, d10 and d01 represent the number of SNPs that are causal in only one disease and d00 denotes the number of non-causal SNPs from previous sampling step. To ensure convergence, we shrink the posterior probability of being causal if the estimation of heritability at current step of either disease is larger than the heritability estimated from the GWAS summary statistics. That is, are shrunken by a factor , where and are the sampled effect size of SNP j in the ith iteration. And simulations showed the algorithm yields fast convergence and high accuracy in estimation (S5 Table). An important advantage about the non-infinitesimal approach is that it has no tuning parameters and thus more computationally efficient. Furthermore, by imposing a Bayesian shrinkage, we can better select functionally relevant variants and tune down the unrelated information. The running time mainly depends on the number of SNPs and iterations in MCMC steps used in prediction and for a typical GWAS dataset with 400,000 SNPs, it usually takes approximately two hours to finish 250 iterations in MCMC (which already leads to good convergence). And we recommend using at least one thousand unrelated individuals with the same ancestry for which summary statistics datasets are obtained from following the same guideline of [9].

Ethics statement

The study was approved by YALE UNIVERSITY HUMAN INVESTIGATION COMMITTEE with approval number 100 FR1 and 100 FR27.

Software availability

PleioPred software: https://github.com/yiminghu/PleioPred AnnoPred software: https://github.com/yiminghu/AnnoPred GenoCanyon: http://genocanyon.med.yale.edu/ GenoSkyline: http://genocanyon.med.yale.edu/GenoSkyline

Supplemental data

Supplemental data include two figures and six tables and detailed description of GWAS summary statistics and validation cohorts.

Prediction accuracy of the PleioPred-inf, PleioPred-anno-inf and PleioPred on T2D when jointly modeled with a wide spectrum of diseases.

Genetic correlations were estimated using LDSC[28] and significant correlations were labeled in purple. P value and confidence region indicates the significant correlation between increment in prediction accuracy and genetic correlation. AAM: age at menarche, AUT: autism spectrum, BIP: bipolar disorder, BMI: body mass index, BIL: birth length, BIW: birth weight, CHO: childhood obesity, CAD: coronary artery disease, FG: fasting glucose, HDL: HDL Cholesterol, MDD: major depressive disorder, RA: rheumatoid arthritis and SCZ: schizophrenia. (TIFF) Click here for additional data file.

Prediction accuracy of the PleioPred and PleioPred-anno on T2D when jointly modeled with a wide spectrum of diseases.

Genetic correlations were estimated using LDSC[28]. (AAM: age at menarche, gc (genetic correlation) = 0.1221; AUT: autism spectrum, gc = 0.0638; BIP: bipolar disorder, gc = 0.1227; BMI: body mass index, gc = 0.3445; BIL: birth length, gc = 0.2196; BIW: birth weight, gc = 0.2732; CHO: childhood obesity, gc = 0.2249; CAD: coronary artery disease, gc = 0.432; FG: fasting glucose, gc = 0.6234; HDL: HDL Cholesterol, gc = 0.4008; MDD: major depressive disorder, gc = 0.0288; RA: rheumatoid arthritis, gc = 0.0434; and SCZ: schizophrenia, gc = 0.0694). (TIFF) Click here for additional data file.

Mean AUCs of infinitesimal and non-infinitesimal methods in independent validation cohort of CE, CEL, and T2D.

For two-trait prediction models, we jointly modeled CD with UC, CEL with UC, and T2D with CAD. (XLSX) Click here for additional data file.

p-values from the likelihood ratio tests comparing different models.

(XLSX) Click here for additional data file.

URLs of GWAS summary statistics.

(XLSX) Click here for additional data file.

URLs of validation data.

(XLSX) Click here for additional data file.

Accuracy of parameter estimation in simulations using the proposed MCMC method.

Data were generated from real genotype data of chromosome 1 with 29,596 individuals for both traits. We random selected 300 out of 41,334 SNPs as causal variants with 1/3 shared between two traits. We simulated in total 6 scenarios corresponding to different heritability of two traits. In each setting we use the maximum of mean squared error (MAX_MSE) of effect sizes of 41,334 SNPs to evaluate the estimation accuracy. (XLSX) Click here for additional data file.

Influence of overlapped individuals in training samples.

Data were generated from real genotype data of chromosome 1 with 29,596 individuals for both traits. We random selected 300 out of 41,334 SNPs as causal variants with 1/3 shared causal variants. N_s: the number of overlapping individuals between diseases; rho_e: the covariance between random errors of two traits on the same individuals (see Methods for details); MAX_MSE1 and MAX_MSE2: the maximum of mean squared error of effect sizes of 41,334 SNPs in two traits respectively (100 replications, used for evaluating estimation accuracy). (XLSX) Click here for additional data file.

Mean AUCs of MTAG compared with other methods in real data analysis.

(XLSX) Click here for additional data file.

Details on GWAS summary statistics and validation data.

(DOCX) Click here for additional data file.

33 in total

1. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease.

Authors: Zhi Wei; Wei Wang; Jonathan Bradfield; Jin Li; Christopher Cardinale; Edward Frackelton; Cecilia Kim; Frank Mentch; Kristel Van Steen; Peter M Visscher; Robert N Baldassano; Hakon Hakonarson
Journal: Am J Hum Genet Date: 2013-05-23 Impact factor: 11.025

2. Estimating missing heritability for disease from genome-wide association studies.

Authors: Sang Hong Lee; Naomi R Wray; Michael E Goddard; Peter M Visscher
Journal: Am J Hum Genet Date: 2011-03-03 Impact factor: 11.025

3. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies.

Authors: Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf
Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063

4. Multiple common variants for celiac disease influencing immune gene expression.

Authors: Patrick C A Dubois; Gosia Trynka; Lude Franke; Karen A Hunt; Jihane Romanos; Alessandra Curtotti; Alexandra Zhernakova; Graham A R Heap; Róza Adány; Arpo Aromaa; Maria Teresa Bardella; Leonard H van den Berg; Nicholas A Bockett; Emilio G de la Concha; Bárbara Dema; Rudolf S N Fehrmann; Miguel Fernández-Arquero; Szilvia Fiatal; Elvira Grandone; Peter M Green; Harry J M Groen; Rhian Gwilliam; Roderick H J Houwen; Sarah E Hunt; Katri Kaukinen; Dermot Kelleher; Ilma Korponay-Szabo; Kalle Kurppa; Padraic MacMathuna; Markku Mäki; Maria Cristina Mazzilli; Owen T McCann; M Luisa Mearin; Charles A Mein; Muddassar M Mirza; Vanisha Mistry; Barbara Mora; Katherine I Morley; Chris J Mulder; Joseph A Murray; Concepción Núñez; Elvira Oosterom; Roel A Ophoff; Isabel Polanco; Leena Peltonen; Mathieu Platteel; Anna Rybak; Veikko Salomaa; Joachim J Schweizer; Maria Pia Sperandeo; Greetje J Tack; Graham Turner; Jan H Veldink; Wieke H M Verbeek; Rinse K Weersma; Victorien M Wolters; Elena Urcelay; Bozena Cukrowska; Luigi Greco; Susan L Neuhausen; Ross McManus; Donatella Barisani; Panos Deloukas; Jeffrey C Barrett; Paivi Saavalainen; Cisca Wijmenga; David A van Heel
Journal: Nat Genet Date: 2010-02-28 Impact factor: 38.330

5. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits.

Authors: Jian Yang; Teresa Ferreira; Andrew P Morris; Sarah E Medland; Pamela A F Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael N Weedon; Ruth J Loos; Timothy M Frayling; Mark I McCarthy; Joel N Hirschhorn; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2012-03-18 Impact factor: 38.330

6. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease.

Authors: Heribert Schunkert; Inke R König; Sekar Kathiresan; Muredach P Reilly; Themistocles L Assimes; Hilma Holm; Michael Preuss; Alexandre F R Stewart; Maja Barbalic; Christian Gieger; Devin Absher; Zouhair Aherrahrou; Hooman Allayee; David Altshuler; Sonia S Anand; Karl Andersen; Jeffrey L Anderson; Diego Ardissino; Stephen G Ball; Anthony J Balmforth; Timothy A Barnes; Diane M Becker; Lewis C Becker; Klaus Berger; Joshua C Bis; S Matthijs Boekholdt; Eric Boerwinkle; Peter S Braund; Morris J Brown; Mary Susan Burnett; Ian Buysschaert; John F Carlquist; Li Chen; Sven Cichon; Veryan Codd; Robert W Davies; George Dedoussis; Abbas Dehghan; Serkalem Demissie; Joseph M Devaney; Patrick Diemert; Ron Do; Angela Doering; Sandra Eifert; Nour Eddine El Mokhtari; Stephen G Ellis; Roberto Elosua; James C Engert; Stephen E Epstein; Ulf de Faire; Marcus Fischer; Aaron R Folsom; Jennifer Freyer; Bruna Gigante; Domenico Girelli; Solveig Gretarsdottir; Vilmundur Gudnason; Jeffrey R Gulcher; Eran Halperin; Naomi Hammond; Stanley L Hazen; Albert Hofman; Benjamin D Horne; Thomas Illig; Carlos Iribarren; Gregory T Jones; J Wouter Jukema; Michael A Kaiser; Lee M Kaplan; John J P Kastelein; Kay-Tee Khaw; Joshua W Knowles; Genovefa Kolovou; Augustine Kong; Reijo Laaksonen; Diether Lambrechts; Karin Leander; Guillaume Lettre; Mingyao Li; Wolfgang Lieb; Christina Loley; Andrew J Lotery; Pier M Mannucci; Seraya Maouche; Nicola Martinelli; Pascal P McKeown; Christa Meisinger; Thomas Meitinger; Olle Melander; Pier Angelica Merlini; Vincent Mooser; Thomas Morgan; Thomas W Mühleisen; Joseph B Muhlestein; Thomas Münzel; Kiran Musunuru; Janja Nahrstaedt; Christopher P Nelson; Markus M Nöthen; Oliviero Olivieri; Riyaz S Patel; Chris C Patterson; Annette Peters; Flora Peyvandi; Liming Qu; Arshed A Quyyumi; Daniel J Rader; Loukianos S Rallidis; Catherine Rice; Frits R Rosendaal; Diana Rubin; Veikko Salomaa; M Lourdes Sampietro; Manj S Sandhu; Eric Schadt; Arne Schäfer; Arne Schillert; Stefan Schreiber; Jürgen Schrezenmeir; Stephen M Schwartz; David S Siscovick; Mohan Sivananthan; Suthesh Sivapalaratnam; Albert Smith; Tamara B Smith; Jaapjan D Snoep; Nicole Soranzo; John A Spertus; Klaus Stark; Kathy Stirrups; Monika Stoll; W H Wilson Tang; Stephanie Tennstedt; Gudmundur Thorgeirsson; Gudmar Thorleifsson; Maciej Tomaszewski; Andre G Uitterlinden; Andre M van Rij; Benjamin F Voight; Nick J Wareham; George A Wells; H-Erich Wichmann; Philipp S Wild; Christina Willenborg; Jaqueline C M Witteman; Benjamin J Wright; Shu Ye; Tanja Zeller; Andreas Ziegler; Francois Cambien; Alison H Goodall; L Adrienne Cupples; Thomas Quertermous; Winfried März; Christian Hengstenberg; Stefan Blankenberg; Willem H Ouwehand; Alistair S Hall; Panos Deloukas; John R Thompson; Kari Stefansson; Robert Roberts; Unnur Thorsteinsdottir; Christopher J O'Donnell; Ruth McPherson; Jeanette Erdmann; Nilesh J Samani
Journal: Nat Genet Date: 2011-03-06 Impact factor: 38.330

7. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder.

Authors: Shaun M Purcell; Naomi R Wray; Jennifer L Stone; Peter M Visscher; Michael C O'Donovan; Patrick F Sullivan; Pamela Sklar
Journal: Nature Date: 2009-07-01 Impact factor: 49.962

8. Polygenic modeling with bayesian sparse linear mixed models.

Authors: Xiang Zhou; Peter Carbonetto; Matthew Stephens
Journal: PLoS Genet Date: 2013-02-07 Impact factor: 5.917

9. New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk.

Authors: Josée Dupuis; Claudia Langenberg; Inga Prokopenko; Richa Saxena; Nicole Soranzo; Anne U Jackson; Eleanor Wheeler; Nicole L Glazer; Nabila Bouatia-Naji; Anna L Gloyn; Cecilia M Lindgren; Reedik Mägi; Andrew P Morris; Joshua Randall; Toby Johnson; Paul Elliott; Denis Rybin; Gudmar Thorleifsson; Valgerdur Steinthorsdottir; Peter Henneman; Harald Grallert; Abbas Dehghan; Jouke Jan Hottenga; Christopher S Franklin; Pau Navarro; Kijoung Song; Anuj Goel; John R B Perry; Josephine M Egan; Taina Lajunen; Niels Grarup; Thomas Sparsø; Alex Doney; Benjamin F Voight; Heather M Stringham; Man Li; Stavroula Kanoni; Peter Shrader; Christine Cavalcanti-Proença; Meena Kumari; Lu Qi; Nicholas J Timpson; Christian Gieger; Carina Zabena; Ghislain Rocheleau; Erik Ingelsson; Ping An; Jeffrey O'Connell; Jian'an Luan; Amanda Elliott; Steven A McCarroll; Felicity Payne; Rosa Maria Roccasecca; François Pattou; Praveen Sethupathy; Kristin Ardlie; Yavuz Ariyurek; Beverley Balkau; Philip Barter; John P Beilby; Yoav Ben-Shlomo; Rafn Benediktsson; Amanda J Bennett; Sven Bergmann; Murielle Bochud; Eric Boerwinkle; Amélie Bonnefond; Lori L Bonnycastle; Knut Borch-Johnsen; Yvonne Böttcher; Eric Brunner; Suzannah J Bumpstead; Guillaume Charpentier; Yii-Der Ida Chen; Peter Chines; Robert Clarke; Lachlan J M Coin; Matthew N Cooper; Marilyn Cornelis; Gabe Crawford; Laura Crisponi; Ian N M Day; Eco J C de Geus; Jerome Delplanque; Christian Dina; Michael R Erdos; Annette C Fedson; Antje Fischer-Rosinsky; Nita G Forouhi; Caroline S Fox; Rune Frants; Maria Grazia Franzosi; Pilar Galan; Mark O Goodarzi; Jürgen Graessler; Christopher J Groves; Scott Grundy; Rhian Gwilliam; Ulf Gyllensten; Samy Hadjadj; Göran Hallmans; Naomi Hammond; Xijing Han; Anna-Liisa Hartikainen; Neelam Hassanali; Caroline Hayward; Simon C Heath; Serge Hercberg; Christian Herder; Andrew A Hicks; David R Hillman; Aroon D Hingorani; Albert Hofman; Jennie Hui; Joe Hung; Bo Isomaa; Paul R V Johnson; Torben Jørgensen; Antti Jula; Marika Kaakinen; Jaakko Kaprio; Y Antero Kesaniemi; Mika Kivimaki; Beatrice Knight; Seppo Koskinen; Peter Kovacs; Kirsten Ohm Kyvik; G Mark Lathrop; Debbie A Lawlor; Olivier Le Bacquer; Cécile Lecoeur; Yun Li; Valeriya Lyssenko; Robert Mahley; Massimo Mangino; Alisa K Manning; María Teresa Martínez-Larrad; Jarred B McAteer; Laura J McCulloch; Ruth McPherson; Christa Meisinger; David Melzer; David Meyre; Braxton D Mitchell; Mario A Morken; Sutapa Mukherjee; Silvia Naitza; Narisu Narisu; Matthew J Neville; Ben A Oostra; Marco Orrù; Ruth Pakyz; Colin N A Palmer; Giuseppe Paolisso; Cristian Pattaro; Daniel Pearson; John F Peden; Nancy L Pedersen; Markus Perola; Andreas F H Pfeiffer; Irene Pichler; Ozren Polasek; Danielle Posthuma; Simon C Potter; Anneli Pouta; Michael A Province; Bruce M Psaty; Wolfgang Rathmann; Nigel W Rayner; Kenneth Rice; Samuli Ripatti; Fernando Rivadeneira; Michael Roden; Olov Rolandsson; Annelli Sandbaek; Manjinder Sandhu; Serena Sanna; Avan Aihie Sayer; Paul Scheet; Laura J Scott; Udo Seedorf; Stephen J Sharp; Beverley Shields; Gunnar Sigurethsson; Eric J G Sijbrands; Angela Silveira; Laila Simpson; Andrew Singleton; Nicholas L Smith; Ulla Sovio; Amy Swift; Holly Syddall; Ann-Christine Syvänen; Toshiko Tanaka; Barbara Thorand; Jean Tichet; Anke Tönjes; Tiinamaija Tuomi; André G Uitterlinden; Ko Willems van Dijk; Mandy van Hoek; Dhiraj Varma; Sophie Visvikis-Siest; Veronique Vitart; Nicole Vogelzangs; Gérard Waeber; Peter J Wagner; Andrew Walley; G Bragi Walters; Kim L Ward; Hugh Watkins; Michael N Weedon; Sarah H Wild; Gonneke Willemsen; Jaqueline C M Witteman; John W G Yarnell; Eleftheria Zeggini; Diana Zelenika; Björn Zethelius; Guangju Zhai; Jing Hua Zhao; M Carola Zillikens; Ingrid B Borecki; Ruth J F Loos; Pierre Meneton; Patrik K E Magnusson; David M Nathan; Gordon H Williams; Andrew T Hattersley; Kaisa Silander; Veikko Salomaa; George Davey Smith; Stefan R Bornstein; Peter Schwarz; Joachim Spranger; Fredrik Karpe; Alan R Shuldiner; Cyrus Cooper; George V Dedoussis; Manuel Serrano-Ríos; Andrew D Morris; Lars Lind; Lyle J Palmer; Frank B Hu; Paul W Franks; Shah Ebrahim; Michael Marmot; W H Linda Kao; James S Pankow; Michael J Sampson; Johanna Kuusisto; Markku Laakso; Torben Hansen; Oluf Pedersen; Peter Paul Pramstaller; H Erich Wichmann; Thomas Illig; Igor Rudan; Alan F Wright; Michael Stumvoll; Harry Campbell; James F Wilson; Richard N Bergman; Thomas A Buchanan; Francis S Collins; Karen L Mohlke; Jaakko Tuomilehto; Timo T Valle; David Altshuler; Jerome I Rotter; David S Siscovick; Brenda W J H Penninx; Dorret I Boomsma; Panos Deloukas; Timothy D Spector; Timothy M Frayling; Luigi Ferrucci; Augustine Kong; Unnur Thorsteinsdottir; Kari Stefansson; Cornelia M van Duijn; Yurii S Aulchenko; Antonio Cao; Angelo Scuteri; David Schlessinger; Manuela Uda; Aimo Ruokonen; Marjo-Riitta Jarvelin; Dawn M Waterworth; Peter Vollenweider; Leena Peltonen; Vincent Mooser; Goncalo R Abecasis; Nicholas J Wareham; Robert Sladek; Philippe Froguel; Richard M Watanabe; James B Meigs; Leif Groop; Michael Boehnke; Mark I McCarthy; Jose C Florez; Inês Barroso
Journal: Nat Genet Date: 2010-01-17 Impact factor: 38.330

10. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes.

Authors: Andrew P Morris; Benjamin F Voight; Tanya M Teslovich; Teresa Ferreira; Ayellet V Segrè; Valgerdur Steinthorsdottir; Rona J Strawbridge; Hassan Khan; Harald Grallert; Anubha Mahajan; Inga Prokopenko; Hyun Min Kang; Christian Dina; Tonu Esko; Ross M Fraser; Stavroula Kanoni; Ashish Kumar; Vasiliki Lagou; Claudia Langenberg; Jian'an Luan; Cecilia M Lindgren; Martina Müller-Nurasyid; Sonali Pechlivanis; N William Rayner; Laura J Scott; Steven Wiltshire; Loic Yengo; Leena Kinnunen; Elizabeth J Rossin; Soumya Raychaudhuri; Andrew D Johnson; Antigone S Dimas; Ruth J F Loos; Sailaja Vedantam; Han Chen; Jose C Florez; Caroline Fox; Ching-Ti Liu; Denis Rybin; David J Couper; Wen Hong L Kao; Man Li; Marilyn C Cornelis; Peter Kraft; Qi Sun; Rob M van Dam; Heather M Stringham; Peter S Chines; Krista Fischer; Pierre Fontanillas; Oddgeir L Holmen; Sarah E Hunt; Anne U Jackson; Augustine Kong; Robert Lawrence; Julia Meyer; John R B Perry; Carl G P Platou; Simon Potter; Emil Rehnberg; Neil Robertson; Suthesh Sivapalaratnam; Alena Stančáková; Kathleen Stirrups; Gudmar Thorleifsson; Emmi Tikkanen; Andrew R Wood; Peter Almgren; Mustafa Atalay; Rafn Benediktsson; Lori L Bonnycastle; Noël Burtt; Jason Carey; Guillaume Charpentier; Andrew T Crenshaw; Alex S F Doney; Mozhgan Dorkhan; Sarah Edkins; Valur Emilsson; Elodie Eury; Tom Forsen; Karl Gertow; Bruna Gigante; George B Grant; Christopher J Groves; Candace Guiducci; Christian Herder; Astradur B Hreidarsson; Jennie Hui; Alan James; Anna Jonsson; Wolfgang Rathmann; Norman Klopp; Jasmina Kravic; Kaarel Krjutškov; Cordelia Langford; Karin Leander; Eero Lindholm; Stéphane Lobbens; Satu Männistö; Ghazala Mirza; Thomas W Mühleisen; Bill Musk; Melissa Parkin; Loukianos Rallidis; Jouko Saramies; Bengt Sennblad; Sonia Shah; Gunnar Sigurðsson; Angela Silveira; Gerald Steinbach; Barbara Thorand; Joseph Trakalo; Fabrizio Veglia; Roman Wennauer; Wendy Winckler; Delilah Zabaneh; Harry Campbell; Cornelia van Duijn; Andre G Uitterlinden; Albert Hofman; Eric Sijbrands; Goncalo R Abecasis; Katharine R Owen; Eleftheria Zeggini; Mieke D Trip; Nita G Forouhi; Ann-Christine Syvänen; Johan G Eriksson; Leena Peltonen; Markus M Nöthen; Beverley Balkau; Colin N A Palmer; Valeriya Lyssenko; Tiinamaija Tuomi; Bo Isomaa; David J Hunter; Lu Qi; Alan R Shuldiner; Michael Roden; Ines Barroso; Tom Wilsgaard; John Beilby; Kees Hovingh; Jackie F Price; James F Wilson; Rainer Rauramaa; Timo A Lakka; Lars Lind; George Dedoussis; Inger Njølstad; Nancy L Pedersen; Kay-Tee Khaw; Nicholas J Wareham; Sirkka M Keinanen-Kiukaanniemi; Timo E Saaristo; Eeva Korpi-Hyövälti; Juha Saltevo; Markku Laakso; Johanna Kuusisto; Andres Metspalu; Francis S Collins; Karen L Mohlke; Richard N Bergman; Jaakko Tuomilehto; Bernhard O Boehm; Christian Gieger; Kristian Hveem; Stephane Cauchi; Philippe Froguel; Damiano Baldassarre; Elena Tremoli; Steve E Humphries; Danish Saleheen; John Danesh; Erik Ingelsson; Samuli Ripatti; Veikko Salomaa; Raimund Erbel; Karl-Heinz Jöckel; Susanne Moebus; Annette Peters; Thomas Illig; Ulf de Faire; Anders Hamsten; Andrew D Morris; Peter J Donnelly; Timothy M Frayling; Andrew T Hattersley; Eric Boerwinkle; Olle Melander; Sekar Kathiresan; Peter M Nilsson; Panos Deloukas; Unnur Thorsteinsdottir; Leif C Groop; Kari Stefansson; Frank Hu; James S Pankow; Josée Dupuis; James B Meigs; David Altshuler; Michael Boehnke; Mark I McCarthy
Journal: Nat Genet Date: 2012-08-12 Impact factor: 38.330

31 in total

1. SummaryAUC: a tool for evaluating the performance of polygenic risk prediction models in validation datasets with only summary level statistics.

Authors: Lei Song; Aiyi Liu; Jianxin Shi
Journal: Bioinformatics Date: 2019-10-15 Impact factor: 6.937

2. Heritability of Regional Brain Volumes in Large-Scale Neuroimaging and Genetic Studies.

Authors: Bingxin Zhao; Joseph G Ibrahim; Yun Li; Tengfei Li; Yue Wang; Yue Shan; Ziliang Zhu; Fan Zhou; Jingwen Zhang; Chao Huang; Huiling Liao; Liuqing Yang; Paul M Thompson; Hongtu Zhu
Journal: Cereb Cortex Date: 2019-07-05 Impact factor: 5.357

3. Pleiotropic mapping and annotation selection in genome-wide association studies with penalized Gaussian mixture models.

Authors: Ping Zeng; Xingjie Hao; Xiang Zhou
Journal: Bioinformatics Date: 2018-08-15 Impact factor: 6.937

4. Non-parametric Polygenic Risk Prediction via Partitioned GWAS Summary Statistics.

Authors: Sung Chun; Maxim Imakaev; Daniel Hui; Nikolaos A Patsopoulos; Benjamin M Neale; Sekar Kathiresan; Nathan O Stitziel; Shamil R Sunyaev
Journal: Am J Hum Genet Date: 2020-05-28 Impact factor: 11.025

Review 5. Genetic correlations of polygenic disease traits: from theory to practice.

Authors: Wouter van Rheenen; Wouter J Peyrot; Andrew J Schork; S Hong Lee; Naomi R Wray
Journal: Nat Rev Genet Date: 2019-10 Impact factor: 53.242

Review 6. Genetic prediction of complex traits with polygenic scores: a statistical review.

Authors: Ying Ma; Xiang Zhou
Journal: Trends Genet Date: 2021-07-06 Impact factor: 11.639

7. Comparison of methods for estimating genetic correlation between complex traits using GWAS summary statistics.

Authors: Yiliang Zhang; Youshu Cheng; Wei Jiang; Yixuan Ye; Qiongshi Lu; Hongyu Zhao
Journal: Brief Bioinform Date: 2021-09-02 Impact factor: 11.622

8. Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets.

Authors: Sheng Yang; Xiang Zhou
Journal: Am J Hum Genet Date: 2020-04-23 Impact factor: 11.025

9. A statistical framework for cross-tissue transcriptome-wide association analysis.

Authors: Yiming Hu; Mo Li; Qiongshi Lu; Haoyi Weng; Jiawei Wang; Seyedeh M Zekavat; Zhaolong Yu; Boyang Li; Jianlei Gu; Sydney Muchnik; Yu Shi; Brian W Kunkle; Shubhabrata Mukherjee; Pradeep Natarajan; Adam Naj; Amanda Kuzma; Yi Zhao; Paul K Crane; Hui Lu; Hongyu Zhao
Journal: Nat Genet Date: 2019-02-25 Impact factor: 38.330

Review 10. Integrative omics of schizophrenia: from genetic determinants to clinical classification and risk prediction.

Authors: Fanglin Guan; Tong Ni; Weili Zhu; L Keoki Williams; Long-Biao Cui; Ming Li; Justin Tubbs; Pak-Chung Sham; Hongsheng Gui
Journal: Mol Psychiatry Date: 2021-06-30 Impact factor: 15.992