Literature DB >> 28979001

A machine-learning heuristic to improve gene score prediction of polygenic traits.

Guillaume Paré^1,2,3, Shihong Mao⁴, Wei Q Deng⁵.

Abstract

Machine-learning techniques have helped solve a broad range of prediction problems, yet are not widely used to build polygenic risk scores for the prediction of complex traits. We propose a novel heuristic based on machine-learning techniques (GraBLD) to boost the predictive performance of polygenic risk scores. Gradient boosted regression trees were first used to optimize the weights of SNPs included in the score, followed by a novel regional adjustment for linkage disequilibrium. A calibration set with sample size of ~200 individuals was sufficient for optimal performance. GraBLD yielded prediction R 2 of 0.239 and 0.082 using GIANT summary association statistics for height and BMI in the UK Biobank study (N = 130 K; 1.98 M SNPs), explaining 46.9% and 32.7% of the overall polygenic variance, respectively. For diabetes status, the area under the receiver operating characteristic curve was 0.602 in the UK Biobank study using summary-level association statistics from the DIAGRAM consortium. GraBLD outperformed other polygenic score heuristics for the prediction of height (p < 2.2 × 10-16) and BMI (p < 1.57 × 10-4), and was equivalent to LDpred for diabetes. Results were independently validated in the Health and Retirement Study (N = 8,292; 688,398 SNPs). Our report demonstrates the use of machine-learning techniques, coupled with summary-level data from large genome-wide meta-analyses to improve the prediction of polygenic traits.

Entities: Chemical

Year: 2017 PMID： 28979001 PMCID： PMC5627249 DOI： 10.1038/s41598-017-13056-1

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

The advent of precision medicine depends in large part on the availability of accurate and highly predictive polygenic risk scores. While progress has been made identifying genetic determinants of polygenic traits, the amount of phenotypic variance explained by polygenic risk scores derived from genome-wide significant associations remains modest. On the other hand, moderate to high narrow-sense heritability has been established for many human traits. It has been proposed that weak, yet undetected, associations underlie polygenic trait heritability[1]. Consistent with this hypothesis, polygenic risk scores that include both strongly and weakly associated variants are vastly superior than those including only genome-wide significant variants. For example, a recent study by Abraham et al. showed that a polygenic risk score incorporating 49,310 variants had a discrimination ability that was similar and complementary to the widely used clinical Framingham risk score for the prediction of coronary artery disease (CAD)[2]. Thus, there is a need for polygenic risk score methods that can integrate a large number of genetic variants. The most popular heuristic for polygenic risk score is based on linkage disequilibrium (LD) pruning of SNPs, prioritizing the most significant associations up to an empirically determined p-value threshold, and pruning the remaining SNPs based on LD[3]. This “pruning and thresholding” (P+T) approach has the advantage of being simple and computationally efficient, but discards some information because of LD pruning. To remediate this issue, a novel method, LDpred, which uses LD information from an external reference panel, was recently proposed to infer the mean causal effect size using a Bayesian approach[4]. While the latter method has been shown to improve prediction R 2, it depends on estimates of polygenic heritability and causal fraction, and can be sensitive to the misspecification of LD. We hypothesized that a further gain in prediction R 2 could be made by tuning the weights of SNPs included in polygenic risk scores using principles of machine-learning. Machine-learning encompasses a wide-ranging class of algorithms widely used to solve complex prediction problems. It has proven particularly useful when prediction is dependent on the integration of a large number of predictors, including higher-order interactions, and when sizeable training datasets are available for model fitting. In particular, gradient boosted regression trees are powerful and versatile methods for continuous outcome prediction[5], and thus, are ideal for updating the SNP weights in polygenic risk scores. Tree-based models partition the predictor space according to simple rules by identifying regions having the most homogeneous responses to predictors and fitting the mean response for observations in that region. Gradient boosting[6] is an efficient algorithm that sequentially combines a large number of weakly predictive models to optimize performance. We propose to leverage the large number of SNPs and the available summary-level statistics from genome-wide association studies (GWAS) to calibrate the weights of SNPs contributing to the polygenic risk score, adjusting for LD instead of pruning. Our heuristic, gradient boosted and LD adjusted (GraBLD; https://github.com/GMELab/GraBLD), involves two steps and uses the univariate regression coefficients from external meta-analysis[7-9] summary association statistics as the starting point (see Fig. 1 and Methods). First, the external univariate regression coefficients were updated with respect to a target population by the gradient boosted regression tree models. Second, the updated weights were corrected for LD to produce the final polygenic risk score.

Figure 1

An overview of the proposed machine-learning heuristic to boost polygenic risk scores and study design.

Results

We applied our machine-learning heuristic for height predictions using a calibration set of 10,000 participants, as well as an independent validation set of 130,215, both from the UK Biobank (UKB). The inputs for the gradient boosted regression trees were obtained from the Genetic Investigation of Anthropometric Traits (GIANT) consortium summary association statistics[10,11] of 1.98 M SNPs. Since the UKB is not part of the GIANT consortium, the initial weights were assumed to be independent of the target population. As recently proposed[12], principal components were added to the model and included in the prediction R 2. The prediction R 2 of our GraBLD polygenic risk score that included all SNPs was 0.239, corresponding to 46.9% of the total polygenic genetic variance estimated at 0.509 in the UKB using variance component models[13]. This compared advantageously to the optimal prediction R 2 obtained with P+T (0.220; 177 K SNPs), LDpred (0.207), or an unadjusted polygenic risk score (0.165) (p < 2.2 × 10−16 for all pairwise comparisons with GraBLD; Figs 2 and 3).

Figure 2

Figure 3

Relative improvement in discrimination for height, BMI, and diabetes, compared to unadjusted polygenic risk scores. The relative improvement in the prediction R 2 of gene scores, compared to the unadjusted polygenic risk score, is illustrated for height and BMI in the UKB validation set and HRS. For diabetes, the relative improvement in the area under the curve (AUC) is illustrated. For all traits, the polygenic risk score weights were derived from the UKB calibration set and tested without additional fitting or adjustment.

Prediction R 2 using polygenic risk scores as a function of increasing proportion of SNPs for height, BMI, and diabetes. The prediction R 2 of polygenic risk scores, as a proportion of the top SNPs from the GIANT consortium for height (A) and BMI (C) in the UKB validation set (N = 130,215), with 95% confidence bands. A total of 1.98 M SNPs were considered and ordered from the most to the least significant, according to GIANT summary association statistics. LPpred requires a determination of the fraction of causal SNPs, and illustrates only the best scores by setting the causal fractions to 0.3 and 0.01 for height and BMI, respectively. The prediction R 2 of the UKB polygenic risk scores in HRS is similarly illustrated for (B) height (N = 8,291) and (D) BMI (N = 8,269). The UKB polygenic risk scores were tested in HRS without additional fitting or adjustment. The area under the curve is illustrated for diabetes as a function of the proportion of top SNPs from the DIAGRAM consortium in the UKB validation set (E) and HRS (F) with 95% confidence bands. The LDpred causal fraction of 0.003 was determined in the UKB calibration set for diabetes. Relative improvement in discrimination for height, BMI, and diabetes, compared to unadjusted polygenic risk scores. The relative improvement in the prediction R 2 of gene scores, compared to the unadjusted polygenic risk score, is illustrated for height and BMI in the UKB validation set and HRS. For diabetes, the relative improvement in the area under the curve (AUC) is illustrated. For all traits, the polygenic risk score weights were derived from the UKB calibration set and tested without additional fitting or adjustment. We also tested the performance of GraBLD for the prediction of body mass index (BMI) and diabetes in the UKB using summary association statistics from the GIANT consortium for BMI, and the DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) consortium[14] for diabetes, respectively. The resulting score for BMI had a prediction R 2 of 0.082, outperforming the prediction R 2 of the unadjusted polygenic risk score (0.069), P+T (0.069), and LDpred (0.074) (p < 1.6 × 10−4 for all pairwise comparisons with GraBLD; Fig. 2). The GraBLD polygenic risk score accounted for 32.7% of the total polygenic variance, which was estimated at 0.251 for BMI in the UKB using variance component models. For diabetes, the area under the receiver operator curve (AUC) was 0.602, which was not statistically different from LDpred (0.613; p = 0.06), and compared favorably to the unadjusted polygenic risk score (0.583), as well as P+T (0.576) (p < 10−5 for comparisons with GraBLD; Figs 2 and 3). For sensitivity analyses, we tested the influence of the number of folds used to fit SNP weights, the number of SNPs used for LD adjustments, and the interaction depth on polygenic score performance (Supplementary Figure S2). We also illustrated the relationship between weights updated by gradient boosted regression trees and the external regression coefficients from consortia (Supplementary Figure S3). Calibration, or the ability of a gene score to accurately predict real observations, is as important as predictiveness when gene scores are used to infer unobserved traits. To evaluate calibration, we calculated the average absolute difference between the predicted trait and the actual trait for height and BMI in the validation set. For all methods, polygenic risk scores were first calibrated in the training set through the use of a linear regression model (along with the top principal components). The average absolute difference was the smallest for GraBLD for height (0.690 SD) and BMI (0.742 SD), compared to other polygenic scores (p < 10−52 for all pairwise comparisons with GraBLD). We tested for calibration for diabetes using the Hosmer-Lemeshow test[15], partitioning the UKB validation set by deciles of the predicted trait (Fig. 4). There was no evidence of mismatch between the predicted and observed event rates (p > 0.05).

Figure 4

Calibration of height, BMI, and diabetes polygenic risk scores. For each trait and method, the polygenic risk score values for the UKB validation set were divided into deciles. For each decile, the difference between the mean observed and predicted trait (95% confidence interval) is illustrated as a function of the mean predicted trait for that decile. The trait is expressed per SD unit for height (A) and BMI (B). A similar analysis was performed for diabetes, whereby the difference between the observed probability of diabetes and the predicted probability is illustrated as a function of the predicted probability for each decile. The set of participants used to calibrate GraBLD can theoretically be the test set since the univariate regression coefficient of each SNP in the target population is not used to tune its own polygenic score weight. However, doing so presents practical challenges when one wants to predict a trait unobserved in the target population. In such cases, a smaller training sample size is advantageous. Therefore, we explored the effect of the size of the calibration set on GraBLD performance by sub-sampling an increasing proportion of our calibration set for tuning. We determined that a calibration set as small as 200 was adequate to provide a high prediction R 2 for height and BMI (Fig. 5). For diabetes, we selected an increasing number of case-control pairs, and 100 pairs were sufficient for adequate performance.

Figure 5

GraBLD polygenic risk score discrimination as a function of calibration set sample size. The size of the UKB calibration set varied from 20 to 10,000 for height and BMI, and from 3 to 1,000 case-control pairs for diabetes. For each calibration sample size, discrimination of the corresponding polygenic risk score was calculated in the independent UKB validation set (N = 130,215 for height and BMI; N = 5,746 case-control pairs for diabetes). For any given SNP, the regression coefficient observed in the UKB was not used to determine its own weight in the polygenic risk score. Nonetheless, regression coefficients of other SNPs in the UKB were used, raising the issue of transferability to other populations. Hence, we tested GraBLD derived from the UKB in Health and Retirement Study (HRS) participants of European descent (N = 8,292). Only directly genotyped SNPs were used for this analysis and 683 K SNPs overlapped with both the UKB and consortia associations. For each method, the optimal GraBLD derived in the UKB calibration set was tested in the HRS without any further fitting or adjustment. Consistent with the UKB results, our machine-learning heuristic produced superior polygenic risk scores for height and BMI, compared to all others methods, and was a close second to LDpred for diabetes (Figs 2 and 3).

Discussion

Our proposed machine-learning heuristic led to significant improvements of polygenic risk scores in prediction R 2, compared to existing methods. Furthermore, we showed that GraBLD risk scores were well calibrated, requiring only a small “tuning” set sample size (N~200) to achieve satisfactory performance. This latter characteristic makes our method advantageous for the prediction of unobserved traits, and stems from the fact that our heuristic leverages the large number of genetic variants reported in GWAS to train gradient boosted regression tree models through genome partitioning. Overall, our results demonstrate that machine-learning techniques coupled with summary-level data from large genome-wide meta-analysis improve the prediction of polygenic traits. The regression trees approach we used can capture nonlinear effects and higher-order interactions, while the gradient boosting algorithm combines individually weak predictors to produce a strong classifier that enables a better prediction of genetic effects. The gradient boosted regression trees adaptively reweight the contribution of each SNP in order to maximize the prediction R 2 in a target population. Summary association statistics obtained from large external meta-analyses are implicitly assumed to provide the best initial estimates and regression trees “adapt” them to the regression coefficients observed in the target population. To avoid over-fitting, SNPs were divided into five distinct contiguous sets (thus circumventing potential LD spillover) and the weights of SNPs in each set were calculated using the prediction models trained on the remaining four sets. For example, the first set comprised SNPs from chromosomes 1, 2, and part of 3 such that SNPs from the remaining part of chromosome 3, as well as those on chromosomes 4 to 22 were used to derive prediction models for SNPs in the first set. Thus, the observed regression coefficients of any single SNP in the target population was never used directly or indirectly to derive its own weight in the polygenic score. In addition, we used a small learning rate for the boosting algorithm to reduce the risk of overfitting as it has been suggested that boosting is quite robust to overfit[16]. We also explored alternative machine-learning techniques to tune the SNP weights, with bagging being a close second to gradient boosted regression trees in terms of prediction R 2 (0.229 for height and 0.080 for BMI) as it is based on a similar principle of subsampling. Neural net produced inferior results and slower computation. Support vector machine and random forest proved to be computationally prohibitive with run times exceeding 7 days for the same analyses done in 8.25 hours by GraBLD. It is advantageous to correct the derived weights for LD when including multiple SNPs in a score, unless SNPs were first LD pruned. The novel correction we propose is based on the sum of pairwise LD r 2 of each SNP over neighboring SNPs. The polygenic risk score weights of each SNP were divided by the corresponding sum of r 2. To illustrate with a simple example, if five SNPs are in perfect LD (r 2 = 1) with each other, but in linkage equilibrium with all other SNPs (r 2 = 0), then the polygenic score weights of those five SNPs would be divided by five. Since all five SNPs are included in the score and the effect of all five SNPs are summed, the corrected weight contributions are equivalent to including a single SNP without correction. Thus, it is necessary to apply the LD correction only after adjusting SNP weights with gradient boosted regression trees as otherwise important information on the strength of association of individual SNPs would be lost. LD is only summed over SNPs included in the polygenic risk score such that our correction is specific to the set of SNPs included in a given score. When the genetic effects were strictly additive (i.e., no haplotype or interaction effect), the resulting polygenic score provided an unbiased estimate of the underlying genetic variance although at a tradeoff of increased polygenic score variance as compared to the “true” unobserved genetic model (see Methods). It can be shown that the variance explained by the polygenic risk score R 2 = R 2 true in simple cases where the pairwise r 2 LD is either 0 or 1 and the summary association statistics are derived from an asymptotically large sample. In more common scenarios with partial LD, the variance explained by R 2 < R 2 true reflects the loss of information when, for example, two SNPs are in partial LD and have true genetic effects with opposite directions. Using simulations, we estimated this loss of information at ~12% as the prediction R 2 explained ~88% of the true genetic variance, on average (see Methods). A few limitations are worth mentioning. First, our method was based on the premise that SNPs contribute additively to genetic variance. While empirical evidence suggests this holds true in most cases, our method is not expected to perform well in genomic regions where strong genetic interactions are present (e.g. HLA). In such situations, alternative methods such as LDpred might be better suited[4]. Second, there is a possibility that the polygenic risk scores derived using our method are inherently population-specific. However, with the exception of unadjusted polygenic risk scores, all methods require a determination of parameters in the target population and ours is no exception. Furthermore, if the genetic architecture varies between populations, then no polygenic risk score will perform universally well and it will be beneficial to tailor gene scores to each population. The observation that our heuristic performed equally well in the HRS, compared to other methods, suggests this might not be the case. Moreover, the small calibration sample size required by GraBLD is an advantage over other gene score methods. Third, our correction for LD yielded advantageous results yet is expected to lead to some loss of information when truly associated SNPs are in partial LD. Nonetheless, our method has several benefits over other methods, including its simplicity, use of summary association statistics, and intrinsic robustness to minor misspecification of LD or association strength. In summary, we propose a novel heuristic based on machine-learning concepts to improve the prediction of polygenic traits using gene scores. Our results show that for the classic polygenic traits, height and BMI, 46.9% and 32.7% of the estimated polygenic genetic variance was captured by our GraBLD gene scores. These results demonstrate the potential of machine-learning methods to harness the considerable amount of information available from local GWAS and external genome-wide meta-analyses. This is made possible through partitioning of the genome, enabling training of regression trees over large numbers of observations. Indeed, a small training sample size (~200) was sufficient to greatly improve the predictiveness of polygenic risk scores. As with other prediction problems involving machine-learning techniques, incremental improvements are to be expected with increased sample size, the inclusion of additional predictors, and the availability of more precise summary association statistics.

Methods

Datasets

Summary association statistics were used to tune the weight of SNPs in polygenic risk scores according to the target population. Univariate regression coefficients for height and body mass index (BMI) were downloaded from the Genetic Investigation of Anthropometric Traits (GIANT) consortium[4,10,11,17] at http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files. Univariate coefficients for diabetes were obtained from DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) consortium[14] at www.diagram-consortium.org/. The UK Biobank[18] (UKB) is a large population-based study from the United Kingdom. A total of 152,249 participants were genotyped using either the UK BiLEVE or the UK Biobank Affymetrix Axiom arrays, and a subset of 140,215 participants of European (British and Irish) Caucasian ancestry were used in the analyses. Genotypes were imputed using the UK10K reference panel using IMPUTE2, resulting in ~72 M SNPs. Height and BMI were adjusted for age and sex in all analyses. To mitigate the effects of outliers, values outside the 1st and 99th percentile were removed. All analyses were adjusted for the first 15 genetic principal components unless stated otherwise. The final sample size for height and BMI was 130,215. The UKB is not part of the GIANT meta-analysis of height and BMI[19,20], nor of the DIAGRAM consortium for diabetes[14]. There are 6,746 individuals with prevalent diabetes in the subset of the UKB included in the current report. We randomly selected 6,746 individuals without diabetes as paired controls on a 1:1 ratio. We then randomly sampled 1,000 case-control pairs as the calibration set, with the remaining 5,746 pairs forming the validation set. The Health and Retirement Study (HRS) is a longitudinal study conducted on Americans over age 50. We downloaded publicly available genome-wide data that are part of the HRS (dbGaP Study Accession: phs000428.v1.p1) and were generated using the Illumina Human Omni2.5-Quad BeadChip. The following HRS quality control criteria were used to filter genotype and phenotype data: (1) SNPs and individuals with missingness higher than 2% were excluded, (2) related individuals were excluded, (3) only participants with self-reported European ancestry and genetically confirmed by principal component analysis were included, (4) individuals for whom the reported sex does not match their genetic sex were excluded, (5) SNPs with Hardy-Weinberg equilibrium p < 1×10−6 were excluded, and (6) SNPs with minor allele frequencies lower than 0.02 were removed. The final dataset included 8,292 European participants genotyped for 688,398 SNPs. Height and BMI was adjusted for age and sex in all analyses, and to mitigate the effect of outliers, values outside the 1st and 99th percentile range were removed. All analyses were adjusted for the first 20 genetic principal components unless stated otherwise. The final sample sizes for height and BMI were 8,291 and 8,262, respectively. There were 1,815 individuals with diabetes and 6,477 controls. HRS was not part of the GIANT meta-analysis of height and BMI[19,20], nor of the DIAGRAM consortium for diabetes[14].

Polygenic risk scores

The genotypes for n individuals at m SNPs in the target population are given by a matrixwith each column vector representing the coded genotypes for an individual. Without loss of generality, we assume each column of (i.e., genotypes for a single SNP) to be standardized to have mean 0 and variance 1. For a standardized quantitative trait with mean 0 and variance 1, the underlying linear model can be expressed as:where is a vector of true genetic effects that are fixed across individuals, but random across SNPs, with mean 0 and covariance matrix σ 2 such that the total expected genetic variance is:and the error term with mean 0 and covariance , so that the covariance of is . Given , the genotypes of m SNPs for the i th individual, the gradient boosted and LD adjusted (GraBLD) polygenic risk score g( ) is defined as:where is an m-dimensional vector of boosted weights and is an m × m diagonal matrix with entries, adjusting for LD. For quantitative traits, the performance of the polygenic risk score was measured by the coefficient of determination (i.e., the prediction R 2), and for binary traits, performance was measured using the area under the receiver operator characteristic (ROC) curve.

Gradient boosted regression trees

Gradient boosted regression trees are powerful and versatile methods that combine otherwise weak classifiers to produce a strong learner for continuous outcome prediction[5]. They are ideally suited for improving SNP weights (w) in the polygenic risk score, without requiring individual-level genotypes since they can be used to predict continuous outcomes and can model non-linear relationships without feature selection. We also tested support vector machine (SVM), bagging, neural net, and random forest. SVM (“e1071” R package) and random forest (“randomForest” R package) took an inordinate amount of time to complete and were deemed impractical. Gradient boosted regression trees gave the best results when compared to bagging (“caret” R package) and neural net (“nnet” R package) using default parameters. Thus, all analyses were performed using gradient boosted regression trees. The fitted gave the contribution of individual SNPs to the final polygenic risk score. The weights used in gene scores ( were defined by the following:where ext refers to the univariate regression coefficients obtained as summary-level association statistics from the external consortium (assumed to have been standardized for reference allele frequency), and is derived to reflect the amount of deviation towards the null hypothesis of no association in the target population ( obs) with respect to the externally derived estimates of summary association statistics ( ext). When = 0 then obs = ext, implicitly assuming regression coefficients from large meta-analyses provide the best initial weights. While some information is lost because of this construct, the fitted weights are more robust and expected to improve the overall performance of polygenic risk scores. The dependent variable used in gradient boosted regression trees is constructed as:and the fitted deviation can be found by minimizing the squared-error loss functionwhere is a regression function of trees with input variables . The gradient boost algorithm aims to iteratively minimize the expected square error loss, with respect to , on weighted versions of the training data (). While multiple SNP annotations could be included as inputs (i.e. ), we only included the absolute value of the SNP regression coefficient for the target trait from the external consortium to reflect the strength of association, irrespective of the direction of effect. Importantly, SNPs were divided into 5 distinct sets of contiguous SNPs (to avoid LD spillover), and the fitted derived using the regression trees models trained on the remaining 4 sets was used to calculate the actual polygenic risk scores. The observed regression coefficient ( obs) of an individual SNP is therefore, never used directly or indirectly to derive its own weight. Furthermore, the SNP annotations used in the regression trees model should be independent of the population in which the polygenic risk score is applied. Gradient boosted regression trees models were fitted using the “GBM” R package (https://CRAN.R-project.org/package=gbm) with a squared error loss function. A total of 2,000 trees were fitted with an interaction depth of 5, a shrinkage parameter of 0.001, and a bag fraction of 0.5. The final number of trees used for modeling was selected as per GBM package instructions. All other parameters were set to their default values. The run time for each of the 5 folds was 8.25 hours when performed on a single 3 GHz core.

LD Adjustment for SNP weights

We propose a simple method to correct weights for LD in such a way that all SNPs can be included in a gene score, irrespective of LD. Let r 2 denote the pairwise linkage disequilibrium (r 2) between the j th and k th SNPs. The LD adjustment (η) for the j th SNP is defined by the sum of r 2 between the j th SNP and the 100 SNPs upstream and downstream:with a distance of 100 SNPs assumed sufficient to ensure linkage equilibrium (other values may be used). Including only SNPs that are part of the polygenic risk score in the calculation of η, the LD-corrected weights are given by:where w is the weight for the j th SNP. Prediction of polygenic risk score The prediction R 2 of the gene score in the target population is expressed as:and the expected value can be approximated by:and further simplified toby deriving the following relations: (1) , implying the covariance between the gene score and the trait is an unbiased estimator of the true genetic variance, and (2) ; thus, E[R 2]

An Unbiased Estimator of the True Genetic Variance

The sample covariance of the gene score with the observed in the target sample is given by:where * and are the residual error in the unobserved population used to derive summary association statistics and the target population, respectively. The reported * in GWAS meta-analyses are constructed to estimate the univariate regression coefficients from the otherwise unobserved genotype matrix * , and quantitative trait *:Assuming the target population is independent of the meta-analysis (i.e., * and are independent), we establish the expected value of the quadratic forms in (eq. 12): This equality holds for all positive definite matrices of the form , assuming the LD structure in the two populations is identical. Thus, Cov(g(X), ) is an unbiased estimator of the true genetic variance.

Variance of the polygenic risk score

The denominator in (eq. 11), E[Var(g(X))], can be shown to be greater than R 2 true:whileAnd thus:which leads to From the above inequality, we can conclude that E[Var(g())] is biased and will always be greater than or equal to the true genetic variance. All analyses were conducted in R statistical software, the scripts for gradient boosted regression trees and LD adjustments can be found at https://github.com/GMELab/GraBLD.

Simulations to assess the effect of LD adjustment on polygenic risk score bias and variance

We performed simulations to confirm the effect of LD adjustment on bias and polygenic risk score variance. A total of 5,000 individuals were simulated for 450 contiguous SNPs using phased haplotypes from the 1000 Genomes Project [19]. The genetic effect of each SNP was randomly selected from a normal distribution according to a pre-defined, unobserved, true regional genetic variance that assumed genome-wide heritability varying from 0 to 0.5. For each genetic variance set-point, 1,000 simulations were completed and a polygenic risk score incorporating LD correction was derived. The average (±SD) gene score prediction R 2, and the gene score variance and covariance between the gene score and the true (unobserved) genetic effect was calculated (Supplementary Figure S3). Based on these simulations, we confirmed that (1) LD-corrected gene scores were unbiased estimators of true genetic variance (i.e. E[Cov(g(), )] = R 2 true, and (2) the variance of the gene score was indeed higher than the true genetic variance. We further estimated the loss of information at ~12%, or in other words, the polygenic risk score prediction, R 2, explained ~88% of the true genetic effect variance, on average.

Pruning and thresholding polygenic risk score, LDpred and other methods

Pruning and thresholding (P+T) polygenic scores were derived using the “clump” function of PLINK[21] with an LD r 2 threshold of 0.2 and testing p-value thresholds in a continuous manner from the most to the least significant association. LDpred adjusts GWAS summary statistics for the effects of linkage disequilibrium, providing re-weighted effect estimates that are then used in polygenic risk scores[4]. LDpred was run as recommended by the authors, and included data synchronization and LDpred steps. LDpred requires a specification for the fraction of SNPs assumed to be causal. For each model, we tested causal fractions of 1 (infinitesimal), 0.3, 0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, and 0.0001, as recommended. The results are presented using the causal fraction of the best results. A heritability estimate was also required by the algorithm and was estimated from the summary association statistics from LDpred. As a sensitivity analysis, we additionally used heritability estimates given by the variance component models in the UKB. The results were consistent and only the default option is shown. Polygenic genetic variance (i.e., narrow sense heritability) was estimated for height and BMI in the UKB using the variance components implemented in GCTA[13]. All LD measures or related estimates used throughout the manuscript were derived from the UKB calibration set genotypes. Supplementary material

17 in total

1. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease.

Authors: Kirk E Lohmueller; Celeste L Pearce; Malcolm Pike; Eric S Lander; Joel N Hirschhorn
Journal: Nat Genet Date: 2003-01-13 Impact factor: 38.330

Review 2. Meta-analysis methods for genome-wide association studies and beyond.

Authors: Evangelos Evangelou; John P A Ioannidis
Journal: Nat Rev Genet Date: 2013-05-09 Impact factor: 53.242

3. Common SNPs explain a large proportion of the heritability for human height.

Authors: Jian Yang; Beben Benyamin; Brian P McEvoy; Scott Gordon; Anjali K Henders; Dale R Nyholt; Pamela A Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2010-06-20 Impact factor: 38.330

Review 4. Meta-analysis in genome-wide association studies.

Authors: Eleftheria Zeggini; John P A Ioannidis
Journal: Pharmacogenomics Date: 2009-02 Impact factor: 2.533

5. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores.

Authors: Bjarni J Vilhjálmsson; Jian Yang; Hilary K Finucane; Alexander Gusev; Sara Lindström; Stephan Ripke; Giulio Genovese; Po-Ru Loh; Gaurav Bhatia; Ron Do; Tristan Hayeck; Hong-Hee Won; Sekar Kathiresan; Michele Pato; Carlos Pato; Rulla Tamimi; Eli Stahl; Noah Zaitlen; Bogdan Pasaniuc; Gillian Belbin; Eimear E Kenny; Mikkel H Schierup; Philip De Jager; Nikolaos A Patsopoulos; Steve McCarroll; Mark Daly; Shaun Purcell; Daniel Chasman; Benjamin Neale; Michael Goddard; Peter M Visscher; Peter Kraft; Nick Patterson; Alkes L Price
Journal: Am J Hum Genet Date: 2015-10-01 Impact factor: 11.025

6. Hundreds of variants clustered in genomic loci and biological pathways affect human height.

Authors: Hana Lango Allen; Karol Estrada; Guillaume Lettre; Sonja I Berndt; Michael N Weedon; Fernando Rivadeneira; Cristen J Willer; Anne U Jackson; Sailaja Vedantam; Soumya Raychaudhuri; Teresa Ferreira; Andrew R Wood; Robert J Weyant; Ayellet V Segrè; Elizabeth K Speliotes; Eleanor Wheeler; Nicole Soranzo; Ju-Hyun Park; Jian Yang; Daniel Gudbjartsson; Nancy L Heard-Costa; Joshua C Randall; Lu Qi; Albert Vernon Smith; Reedik Mägi; Tomi Pastinen; Liming Liang; Iris M Heid; Jian'an Luan; Gudmar Thorleifsson; Thomas W Winkler; Michael E Goddard; Ken Sin Lo; Cameron Palmer; Tsegaselassie Workalemahu; Yurii S Aulchenko; Asa Johansson; M Carola Zillikens; Mary F Feitosa; Tõnu Esko; Toby Johnson; Shamika Ketkar; Peter Kraft; Massimo Mangino; Inga Prokopenko; Devin Absher; Eva Albrecht; Florian Ernst; Nicole L Glazer; Caroline Hayward; Jouke-Jan Hottenga; Kevin B Jacobs; Joshua W Knowles; Zoltán Kutalik; Keri L Monda; Ozren Polasek; Michael Preuss; Nigel W Rayner; Neil R Robertson; Valgerdur Steinthorsdottir; Jonathan P Tyrer; Benjamin F Voight; Fredrik Wiklund; Jianfeng Xu; Jing Hua Zhao; Dale R Nyholt; Niina Pellikka; Markus Perola; John R B Perry; Ida Surakka; Mari-Liis Tammesoo; Elizabeth L Altmaier; Najaf Amin; Thor Aspelund; Tushar Bhangale; Gabrielle Boucher; Daniel I Chasman; Constance Chen; Lachlan Coin; Matthew N Cooper; Anna L Dixon; Quince Gibson; Elin Grundberg; Ke Hao; M Juhani Junttila; Lee M Kaplan; Johannes Kettunen; Inke R König; Tony Kwan; Robert W Lawrence; Douglas F Levinson; Mattias Lorentzon; Barbara McKnight; Andrew P Morris; Martina Müller; Julius Suh Ngwa; Shaun Purcell; Suzanne Rafelt; Rany M Salem; Erika Salvi; Serena Sanna; Jianxin Shi; Ulla Sovio; John R Thompson; Michael C Turchin; Liesbeth Vandenput; Dominique J Verlaan; Veronique Vitart; Charles C White; Andreas Ziegler; Peter Almgren; Anthony J Balmforth; Harry Campbell; Lorena Citterio; Alessandro De Grandi; Anna Dominiczak; Jubao Duan; Paul Elliott; Roberto Elosua; Johan G Eriksson; Nelson B Freimer; Eco J C Geus; Nicola Glorioso; Shen Haiqing; Anna-Liisa Hartikainen; Aki S Havulinna; Andrew A Hicks; Jennie Hui; Wilmar Igl; Thomas Illig; Antti Jula; Eero Kajantie; Tuomas O Kilpeläinen; Markku Koiranen; Ivana Kolcic; Seppo Koskinen; Peter Kovacs; Jaana Laitinen; Jianjun Liu; Marja-Liisa Lokki; Ana Marusic; Andrea Maschio; Thomas Meitinger; Antonella Mulas; Guillaume Paré; Alex N Parker; John F Peden; Astrid Petersmann; Irene Pichler; Kirsi H Pietiläinen; Anneli Pouta; Martin Ridderstråle; Jerome I Rotter; Jennifer G Sambrook; Alan R Sanders; Carsten Oliver Schmidt; Juha Sinisalo; Jan H Smit; Heather M Stringham; G Bragi Walters; Elisabeth Widen; Sarah H Wild; Gonneke Willemsen; Laura Zagato; Lina Zgaga; Paavo Zitting; Helene Alavere; Martin Farrall; Wendy L McArdle; Mari Nelis; Marjolein J Peters; Samuli Ripatti; Joyce B J van Meurs; Katja K Aben; Kristin G Ardlie; Jacques S Beckmann; John P Beilby; Richard N Bergman; Sven Bergmann; Francis S Collins; Daniele Cusi; Martin den Heijer; Gudny Eiriksdottir; Pablo V Gejman; Alistair S Hall; Anders Hamsten; Heikki V Huikuri; Carlos Iribarren; Mika Kähönen; Jaakko Kaprio; Sekar Kathiresan; Lambertus Kiemeney; Thomas Kocher; Lenore J Launer; Terho Lehtimäki; Olle Melander; Tom H Mosley; Arthur W Musk; Markku S Nieminen; Christopher J O'Donnell; Claes Ohlsson; Ben Oostra; Lyle J Palmer; Olli Raitakari; Paul M Ridker; John D Rioux; Aila Rissanen; Carlo Rivolta; Heribert Schunkert; Alan R Shuldiner; David S Siscovick; Michael Stumvoll; Anke Tönjes; Jaakko Tuomilehto; Gert-Jan van Ommen; Jorma Viikari; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael A Province; Manfred Kayser; Alice M Arnold; Larry D Atwood; Eric Boerwinkle; Stephen J Chanock; Panos Deloukas; Christian Gieger; Henrik Grönberg; Per Hall; Andrew T Hattersley; Christian Hengstenberg; Wolfgang Hoffman; G Mark Lathrop; Veikko Salomaa; Stefan Schreiber; Manuela Uda; Dawn Waterworth; Alan F Wright; Themistocles L Assimes; Inês Barroso; Albert Hofman; Karen L Mohlke; Dorret I Boomsma; Mark J Caulfield; L Adrienne Cupples; Jeanette Erdmann; Caroline S Fox; Vilmundur Gudnason; Ulf Gyllensten; Tamara B Harris; Richard B Hayes; Marjo-Riitta Jarvelin; Vincent Mooser; Patricia B Munroe; Willem H Ouwehand; Brenda W Penninx; Peter P Pramstaller; Thomas Quertermous; Igor Rudan; Nilesh J Samani; Timothy D Spector; Henry Völzke; Hugh Watkins; James F Wilson; Leif C Groop; Talin Haritunians; Frank B Hu; Robert C Kaplan; Andres Metspalu; Kari E North; David Schlessinger; Nicholas J Wareham; David J Hunter; Jeffrey R O'Connell; David P Strachan; H-Erich Wichmann; Ingrid B Borecki; Cornelia M van Duijn; Eric E Schadt; Unnur Thorsteinsdottir; Leena Peltonen; André G Uitterlinden; Peter M Visscher; Nilanjan Chatterjee; Ruth J F Loos; Michael Boehnke; Mark I McCarthy; Erik Ingelsson; Cecilia M Lindgren; Gonçalo R Abecasis; Kari Stefansson; Timothy M Frayling; Joel N Hirschhorn
Journal: Nature Date: 2010-09-29 Impact factor: 49.962

7. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index.

Authors: Elizabeth K Speliotes; Cristen J Willer; Sonja I Berndt; Keri L Monda; Gudmar Thorleifsson; Anne U Jackson; Hana Lango Allen; Cecilia M Lindgren; Jian'an Luan; Reedik Mägi; Joshua C Randall; Sailaja Vedantam; Thomas W Winkler; Lu Qi; Tsegaselassie Workalemahu; Iris M Heid; Valgerdur Steinthorsdottir; Heather M Stringham; Michael N Weedon; Eleanor Wheeler; Andrew R Wood; Teresa Ferreira; Robert J Weyant; Ayellet V Segrè; Karol Estrada; Liming Liang; James Nemesh; Ju-Hyun Park; Stefan Gustafsson; Tuomas O Kilpeläinen; Jian Yang; Nabila Bouatia-Naji; Tõnu Esko; Mary F Feitosa; Zoltán Kutalik; Massimo Mangino; Soumya Raychaudhuri; Andre Scherag; Albert Vernon Smith; Ryan Welch; Jing Hua Zhao; Katja K Aben; Devin M Absher; Najaf Amin; Anna L Dixon; Eva Fisher; Nicole L Glazer; Michael E Goddard; Nancy L Heard-Costa; Volker Hoesel; Jouke-Jan Hottenga; Asa Johansson; Toby Johnson; Shamika Ketkar; Claudia Lamina; Shengxu Li; Miriam F Moffatt; Richard H Myers; Narisu Narisu; John R B Perry; Marjolein J Peters; Michael Preuss; Samuli Ripatti; Fernando Rivadeneira; Camilla Sandholt; Laura J Scott; Nicholas J Timpson; Jonathan P Tyrer; Sophie van Wingerden; Richard M Watanabe; Charles C White; Fredrik Wiklund; Christina Barlassina; Daniel I Chasman; Matthew N Cooper; John-Olov Jansson; Robert W Lawrence; Niina Pellikka; Inga Prokopenko; Jianxin Shi; Elisabeth Thiering; Helene Alavere; Maria T S Alibrandi; Peter Almgren; Alice M Arnold; Thor Aspelund; Larry D Atwood; Beverley Balkau; Anthony J Balmforth; Amanda J Bennett; Yoav Ben-Shlomo; Richard N Bergman; Sven Bergmann; Heike Biebermann; Alexandra I F Blakemore; Tanja Boes; Lori L Bonnycastle; Stefan R Bornstein; Morris J Brown; Thomas A Buchanan; Fabio Busonero; Harry Campbell; Francesco P Cappuccio; Christine Cavalcanti-Proença; Yii-Der Ida Chen; Chih-Mei Chen; Peter S Chines; Robert Clarke; Lachlan Coin; John Connell; Ian N M Day; Martin den Heijer; Jubao Duan; Shah Ebrahim; Paul Elliott; Roberto Elosua; Gudny Eiriksdottir; Michael R Erdos; Johan G Eriksson; Maurizio F Facheris; Stephan B Felix; Pamela Fischer-Posovszky; Aaron R Folsom; Nele Friedrich; Nelson B Freimer; Mao Fu; Stefan Gaget; Pablo V Gejman; Eco J C Geus; Christian Gieger; Anette P Gjesing; Anuj Goel; Philippe Goyette; Harald Grallert; Jürgen Grässler; Danielle M Greenawalt; Christopher J Groves; Vilmundur Gudnason; Candace Guiducci; Anna-Liisa Hartikainen; Neelam Hassanali; Alistair S Hall; Aki S Havulinna; Caroline Hayward; Andrew C Heath; Christian Hengstenberg; Andrew A Hicks; Anke Hinney; Albert Hofman; Georg Homuth; Jennie Hui; Wilmar Igl; Carlos Iribarren; Bo Isomaa; Kevin B Jacobs; Ivonne Jarick; Elizabeth Jewell; Ulrich John; Torben Jørgensen; Pekka Jousilahti; Antti Jula; Marika Kaakinen; Eero Kajantie; Lee M Kaplan; Sekar Kathiresan; Johannes Kettunen; Leena Kinnunen; Joshua W Knowles; Ivana Kolcic; Inke R König; Seppo Koskinen; Peter Kovacs; Johanna Kuusisto; Peter Kraft; Kirsti Kvaløy; Jaana Laitinen; Olivier Lantieri; Chiara Lanzani; Lenore J Launer; Cecile Lecoeur; Terho Lehtimäki; Guillaume Lettre; Jianjun Liu; Marja-Liisa Lokki; Mattias Lorentzon; Robert N Luben; Barbara Ludwig; Paolo Manunta; Diana Marek; Michel Marre; Nicholas G Martin; Wendy L McArdle; Anne McCarthy; Barbara McKnight; Thomas Meitinger; Olle Melander; David Meyre; Kristian Midthjell; Grant W Montgomery; Mario A Morken; Andrew P Morris; Rosanda Mulic; Julius S Ngwa; Mari Nelis; Matt J Neville; Dale R Nyholt; Christopher J O'Donnell; Stephen O'Rahilly; Ken K Ong; Ben Oostra; Guillaume Paré; Alex N Parker; Markus Perola; Irene Pichler; Kirsi H Pietiläinen; Carl G P Platou; Ozren Polasek; Anneli Pouta; Suzanne Rafelt; Olli Raitakari; Nigel W Rayner; Martin Ridderstråle; Winfried Rief; Aimo Ruokonen; Neil R Robertson; Peter Rzehak; Veikko Salomaa; Alan R Sanders; Manjinder S Sandhu; Serena Sanna; Jouko Saramies; Markku J Savolainen; Susann Scherag; Sabine Schipf; Stefan Schreiber; Heribert Schunkert; Kaisa Silander; Juha Sinisalo; David S Siscovick; Jan H Smit; Nicole Soranzo; Ulla Sovio; Jonathan Stephens; Ida Surakka; Amy J Swift; Mari-Liis Tammesoo; Jean-Claude Tardif; Maris Teder-Laving; Tanya M Teslovich; John R Thompson; Brian Thomson; Anke Tönjes; Tiinamaija Tuomi; Joyce B J van Meurs; Gert-Jan van Ommen; Vincent Vatin; Jorma Viikari; Sophie Visvikis-Siest; Veronique Vitart; Carla I G Vogel; Benjamin F Voight; Lindsay L Waite; Henri Wallaschofski; G Bragi Walters; Elisabeth Widen; Susanna Wiegand; Sarah H Wild; Gonneke Willemsen; Daniel R Witte; Jacqueline C Witteman; Jianfeng Xu; Qunyuan Zhang; Lina Zgaga; Andreas Ziegler; Paavo Zitting; John P Beilby; I Sadaf Farooqi; Johannes Hebebrand; Heikki V Huikuri; Alan L James; Mika Kähönen; Douglas F Levinson; Fabio Macciardi; Markku S Nieminen; Claes Ohlsson; Lyle J Palmer; Paul M Ridker; Michael Stumvoll; Jacques S Beckmann; Heiner Boeing; Eric Boerwinkle; Dorret I Boomsma; Mark J Caulfield; Stephen J Chanock; Francis S Collins; L Adrienne Cupples; George Davey Smith; Jeanette Erdmann; Philippe Froguel; Henrik Grönberg; Ulf Gyllensten; Per Hall; Torben Hansen; Tamara B Harris; Andrew T Hattersley; Richard B Hayes; Joachim Heinrich; Frank B Hu; Kristian Hveem; Thomas Illig; Marjo-Riitta Jarvelin; Jaakko Kaprio; Fredrik Karpe; Kay-Tee Khaw; Lambertus A Kiemeney; Heiko Krude; Markku Laakso; Debbie A Lawlor; Andres Metspalu; Patricia B Munroe; Willem H Ouwehand; Oluf Pedersen; Brenda W Penninx; Annette Peters; Peter P Pramstaller; Thomas Quertermous; Thomas Reinehr; Aila Rissanen; Igor Rudan; Nilesh J Samani; Peter E H Schwarz; Alan R Shuldiner; Timothy D Spector; Jaakko Tuomilehto; Manuela Uda; André Uitterlinden; Timo T Valle; Martin Wabitsch; Gérard Waeber; Nicholas J Wareham; Hugh Watkins; James F Wilson; Alan F Wright; M Carola Zillikens; Nilanjan Chatterjee; Steven A McCarroll; Shaun Purcell; Eric E Schadt; Peter M Visscher; Themistocles L Assimes; Ingrid B Borecki; Panos Deloukas; Caroline S Fox; Leif C Groop; Talin Haritunians; David J Hunter; Robert C Kaplan; Karen L Mohlke; Jeffrey R O'Connell; Leena Peltonen; David Schlessinger; David P Strachan; Cornelia M van Duijn; H-Erich Wichmann; Timothy M Frayling; Unnur Thorsteinsdottir; Gonçalo R Abecasis; Inês Barroso; Michael Boehnke; Kari Stefansson; Kari E North; Mark I McCarthy; Joel N Hirschhorn; Erik Ingelsson; Ruth J F Loos
Journal: Nat Genet Date: 2010-10-10 Impact factor: 38.330

8. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture.

Authors: Sonja I Berndt; Stefan Gustafsson; Reedik Mägi; Andrea Ganna; Eleanor Wheeler; Mary F Feitosa; Anne E Justice; Keri L Monda; Damien C Croteau-Chonka; Felix R Day; Tõnu Esko; Tove Fall; Teresa Ferreira; Davide Gentilini; Anne U Jackson; Jian'an Luan; Joshua C Randall; Sailaja Vedantam; Cristen J Willer; Thomas W Winkler; Andrew R Wood; Tsegaselassie Workalemahu; Yi-Juan Hu; Sang Hong Lee; Liming Liang; Dan-Yu Lin; Josine L Min; Benjamin M Neale; Gudmar Thorleifsson; Jian Yang; Eva Albrecht; Najaf Amin; Jennifer L Bragg-Gresham; Gemma Cadby; Martin den Heijer; Niina Eklund; Krista Fischer; Anuj Goel; Jouke-Jan Hottenga; Jennifer E Huffman; Ivonne Jarick; Åsa Johansson; Toby Johnson; Stavroula Kanoni; Marcus E Kleber; Inke R König; Kati Kristiansson; Zoltán Kutalik; Claudia Lamina; Cecile Lecoeur; Guo Li; Massimo Mangino; Wendy L McArdle; Carolina Medina-Gomez; Martina Müller-Nurasyid; Julius S Ngwa; Ilja M Nolte; Lavinia Paternoster; Sonali Pechlivanis; Markus Perola; Marjolein J Peters; Michael Preuss; Lynda M Rose; Jianxin Shi; Dmitry Shungin; Albert Vernon Smith; Rona J Strawbridge; Ida Surakka; Alexander Teumer; Mieke D Trip; Jonathan Tyrer; Jana V Van Vliet-Ostaptchouk; Liesbeth Vandenput; Lindsay L Waite; Jing Hua Zhao; Devin Absher; Folkert W Asselbergs; Mustafa Atalay; Antony P Attwood; Anthony J Balmforth; Hanneke Basart; John Beilby; Lori L Bonnycastle; Paolo Brambilla; Marcel Bruinenberg; Harry Campbell; Daniel I Chasman; Peter S Chines; Francis S Collins; John M Connell; William O Cookson; Ulf de Faire; Femmie de Vegt; Mariano Dei; Maria Dimitriou; Sarah Edkins; Karol Estrada; David M Evans; Martin Farrall; Marco M Ferrario; Jean Ferrières; Lude Franke; Francesca Frau; Pablo V Gejman; Harald Grallert; Henrik Grönberg; Vilmundur Gudnason; Alistair S Hall; Per Hall; Anna-Liisa Hartikainen; Caroline Hayward; Nancy L Heard-Costa; Andrew C Heath; Johannes Hebebrand; Georg Homuth; Frank B Hu; Sarah E Hunt; Elina Hyppönen; Carlos Iribarren; Kevin B Jacobs; John-Olov Jansson; Antti Jula; Mika Kähönen; Sekar Kathiresan; Frank Kee; Kay-Tee Khaw; Mika Kivimäki; Wolfgang Koenig; Aldi T Kraja; Meena Kumari; Kari Kuulasmaa; Johanna Kuusisto; Jaana H Laitinen; Timo A Lakka; Claudia Langenberg; Lenore J Launer; Lars Lind; Jaana Lindström; Jianjun Liu; Antonio Liuzzi; Marja-Liisa Lokki; Mattias Lorentzon; Pamela A Madden; Patrik K Magnusson; Paolo Manunta; Diana Marek; Winfried März; Irene Mateo Leach; Barbara McKnight; Sarah E Medland; Evelin Mihailov; Lili Milani; Grant W Montgomery; Vincent Mooser; Thomas W Mühleisen; Patricia B Munroe; Arthur W Musk; Narisu Narisu; Gerjan Navis; George Nicholson; Ellen A Nohr; Ken K Ong; Ben A Oostra; Colin N A Palmer; Aarno Palotie; John F Peden; Nancy Pedersen; Annette Peters; Ozren Polasek; Anneli Pouta; Peter P Pramstaller; Inga Prokopenko; Carolin Pütter; Aparna Radhakrishnan; Olli Raitakari; Augusto Rendon; Fernando Rivadeneira; Igor Rudan; Timo E Saaristo; Jennifer G Sambrook; Alan R Sanders; Serena Sanna; Jouko Saramies; Sabine Schipf; Stefan Schreiber; Heribert Schunkert; So-Youn Shin; Stefano Signorini; Juha Sinisalo; Boris Skrobek; Nicole Soranzo; Alena Stančáková; Klaus Stark; Jonathan C Stephens; Kathleen Stirrups; Ronald P Stolk; Michael Stumvoll; Amy J Swift; Eirini V Theodoraki; Barbara Thorand; David-Alexandre Tregouet; Elena Tremoli; Melanie M Van der Klauw; Joyce B J van Meurs; Sita H Vermeulen; Jorma Viikari; Jarmo Virtamo; Veronique Vitart; Gérard Waeber; Zhaoming Wang; Elisabeth Widén; Sarah H Wild; Gonneke Willemsen; Bernhard R Winkelmann; Jacqueline C M Witteman; Bruce H R Wolffenbuttel; Andrew Wong; Alan F Wright; M Carola Zillikens; Philippe Amouyel; Bernhard O Boehm; Eric Boerwinkle; Dorret I Boomsma; Mark J Caulfield; Stephen J Chanock; L Adrienne Cupples; Daniele Cusi; George V Dedoussis; Jeanette Erdmann; Johan G Eriksson; Paul W Franks; Philippe Froguel; Christian Gieger; Ulf Gyllensten; Anders Hamsten; Tamara B Harris; Christian Hengstenberg; Andrew A Hicks; Aroon Hingorani; Anke Hinney; Albert Hofman; Kees G Hovingh; Kristian Hveem; Thomas Illig; Marjo-Riitta Jarvelin; Karl-Heinz Jöckel; Sirkka M Keinanen-Kiukaanniemi; Lambertus A Kiemeney; Diana Kuh; Markku Laakso; Terho Lehtimäki; Douglas F Levinson; Nicholas G Martin; Andres Metspalu; Andrew D Morris; Markku S Nieminen; Inger Njølstad; Claes Ohlsson; Albertine J Oldehinkel; Willem H Ouwehand; Lyle J Palmer; Brenda Penninx; Chris Power; Michael A Province; Bruce M Psaty; Lu Qi; Rainer Rauramaa; Paul M Ridker; Samuli Ripatti; Veikko Salomaa; Nilesh J Samani; Harold Snieder; Thorkild I A Sørensen; Timothy D Spector; Kari Stefansson; Anke Tönjes; Jaakko Tuomilehto; André G Uitterlinden; Matti Uusitupa; Pim van der Harst; Peter Vollenweider; Henri Wallaschofski; Nicholas J Wareham; Hugh Watkins; H-Erich Wichmann; James F Wilson; Goncalo R Abecasis; Themistocles L Assimes; Inês Barroso; Michael Boehnke; Ingrid B Borecki; Panos Deloukas; Caroline S Fox; Timothy Frayling; Leif C Groop; Talin Haritunian; Iris M Heid; David Hunter; Robert C Kaplan; Fredrik Karpe; Miriam F Moffatt; Karen L Mohlke; Jeffrey R O'Connell; Yudi Pawitan; Eric E Schadt; David Schlessinger; Valgerdur Steinthorsdottir; David P Strachan; Unnur Thorsteinsdottir; Cornelia M van Duijn; Peter M Visscher; Anna Maria Di Blasio; Joel N Hirschhorn; Cecilia M Lindgren; Andrew P Morris; David Meyre; André Scherag; Mark I McCarthy; Elizabeth K Speliotes; Kari E North; Ruth J F Loos; Erik Ingelsson
Journal: Nat Genet Date: 2013-04-07 Impact factor: 38.330

9. Defining the role of common variation in the genomic and biological architecture of adult human height.

Authors: Andrew R Wood; Tonu Esko; Jian Yang; Sailaja Vedantam; Tune H Pers; Stefan Gustafsson; Audrey Y Chu; Karol Estrada; Jian'an Luan; Zoltán Kutalik; Najaf Amin; Martin L Buchkovich; Damien C Croteau-Chonka; Felix R Day; Yanan Duan; Tove Fall; Rudolf Fehrmann; Teresa Ferreira; Anne U Jackson; Juha Karjalainen; Ken Sin Lo; Adam E Locke; Reedik Mägi; Evelin Mihailov; Eleonora Porcu; Joshua C Randall; André Scherag; Anna A E Vinkhuyzen; Harm-Jan Westra; Thomas W Winkler; Tsegaselassie Workalemahu; Jing Hua Zhao; Devin Absher; Eva Albrecht; Denise Anderson; Jeffrey Baron; Marian Beekman; Ayse Demirkan; Georg B Ehret; Bjarke Feenstra; Mary F Feitosa; Krista Fischer; Ross M Fraser; Anuj Goel; Jian Gong; Anne E Justice; Stavroula Kanoni; Marcus E Kleber; Kati Kristiansson; Unhee Lim; Vaneet Lotay; Julian C Lui; Massimo Mangino; Irene Mateo Leach; Carolina Medina-Gomez; Michael A Nalls; Dale R Nyholt; Cameron D Palmer; Dorota Pasko; Sonali Pechlivanis; Inga Prokopenko; Janina S Ried; Stephan Ripke; Dmitry Shungin; Alena Stancáková; Rona J Strawbridge; Yun Ju Sung; Toshiko Tanaka; Alexander Teumer; Stella Trompet; Sander W van der Laan; Jessica van Setten; Jana V Van Vliet-Ostaptchouk; Zhaoming Wang; Loïc Yengo; Weihua Zhang; Uzma Afzal; Johan Arnlöv; Gillian M Arscott; Stefania Bandinelli; Amy Barrett; Claire Bellis; Amanda J Bennett; Christian Berne; Matthias Blüher; Jennifer L Bolton; Yvonne Böttcher; Heather A Boyd; Marcel Bruinenberg; Brendan M Buckley; Steven Buyske; Ida H Caspersen; Peter S Chines; Robert Clarke; Simone Claudi-Boehm; Matthew Cooper; E Warwick Daw; Pim A De Jong; Joris Deelen; Graciela Delgado; Josh C Denny; Rosalie Dhonukshe-Rutten; Maria Dimitriou; Alex S F Doney; Marcus Dörr; Niina Eklund; Elodie Eury; Lasse Folkersen; Melissa E Garcia; Frank Geller; Vilmantas Giedraitis; Alan S Go; Harald Grallert; Tanja B Grammer; Jürgen Gräßler; Henrik Grönberg; Lisette C P G M de Groot; Christopher J Groves; Jeffrey Haessler; Per Hall; Toomas Haller; Goran Hallmans; Anke Hannemann; Catharina A Hartman; Maija Hassinen; Caroline Hayward; Nancy L Heard-Costa; Quinta Helmer; Gibran Hemani; Anjali K Henders; Hans L Hillege; Mark A Hlatky; Wolfgang Hoffmann; Per Hoffmann; Oddgeir Holmen; Jeanine J Houwing-Duistermaat; Thomas Illig; Aaron Isaacs; Alan L James; Janina Jeff; Berit Johansen; Åsa Johansson; Jennifer Jolley; Thorhildur Juliusdottir; Juhani Junttila; Abel N Kho; Leena Kinnunen; Norman Klopp; Thomas Kocher; Wolfgang Kratzer; Peter Lichtner; Lars Lind; Jaana Lindström; Stéphane Lobbens; Mattias Lorentzon; Yingchang Lu; Valeriya Lyssenko; Patrik K E Magnusson; Anubha Mahajan; Marc Maillard; Wendy L McArdle; Colin A McKenzie; Stela McLachlan; Paul J McLaren; Cristina Menni; Sigrun Merger; Lili Milani; Alireza Moayyeri; Keri L Monda; Mario A Morken; Gabriele Müller; Martina Müller-Nurasyid; Arthur W Musk; Narisu Narisu; Matthias Nauck; Ilja M Nolte; Markus M Nöthen; Laticia Oozageer; Stefan Pilz; Nigel W Rayner; Frida Renstrom; Neil R Robertson; Lynda M Rose; Ronan Roussel; Serena Sanna; Hubert Scharnagl; Salome Scholtens; Fredrick R Schumacher; Heribert Schunkert; Robert A Scott; Joban Sehmi; Thomas Seufferlein; Jianxin Shi; Karri Silventoinen; Johannes H Smit; Albert Vernon Smith; Joanna Smolonska; Alice V Stanton; Kathleen Stirrups; David J Stott; Heather M Stringham; Johan Sundström; Morris A Swertz; Ann-Christine Syvänen; Bamidele O Tayo; Gudmar Thorleifsson; Jonathan P Tyrer; Suzanne van Dijk; Natasja M van Schoor; Nathalie van der Velde; Diana van Heemst; Floor V A van Oort; Sita H Vermeulen; Niek Verweij; Judith M Vonk; Lindsay L Waite; Melanie Waldenberger; Roman Wennauer; Lynne R Wilkens; Christina Willenborg; Tom Wilsgaard; Mary K Wojczynski; Andrew Wong; Alan F Wright; Qunyuan Zhang; Dominique Arveiler; Stephan J L Bakker; John Beilby; Richard N Bergman; Sven Bergmann; Reiner Biffar; John Blangero; Dorret I Boomsma; Stefan R Bornstein; Pascal Bovet; Paolo Brambilla; Morris J Brown; Harry Campbell; Mark J Caulfield; Aravinda Chakravarti; Rory Collins; Francis S Collins; Dana C Crawford; L Adrienne Cupples; John Danesh; Ulf de Faire; Hester M den Ruijter; Raimund Erbel; Jeanette Erdmann; Johan G Eriksson; Martin Farrall; Ele Ferrannini; Jean Ferrières; Ian Ford; Nita G Forouhi; Terrence Forrester; Ron T Gansevoort; Pablo V Gejman; Christian Gieger; Alain Golay; Omri Gottesman; Vilmundur Gudnason; Ulf Gyllensten; David W Haas; Alistair S Hall; Tamara B Harris; Andrew T Hattersley; Andrew C Heath; Christian Hengstenberg; Andrew A Hicks; Lucia A Hindorff; Aroon D Hingorani; Albert Hofman; G Kees Hovingh; Steve E Humphries; Steven C Hunt; Elina Hypponen; Kevin B Jacobs; Marjo-Riitta Jarvelin; Pekka Jousilahti; Antti M Jula; Jaakko Kaprio; John J P Kastelein; Manfred Kayser; Frank Kee; Sirkka M Keinanen-Kiukaanniemi; Lambertus A Kiemeney; Jaspal S Kooner; Charles Kooperberg; Seppo Koskinen; Peter Kovacs; Aldi T Kraja; Meena Kumari; Johanna Kuusisto; Timo A Lakka; Claudia Langenberg; Loic Le Marchand; Terho Lehtimäki; Sara Lupoli; Pamela A F Madden; Satu Männistö; Paolo Manunta; André Marette; Tara C Matise; Barbara McKnight; Thomas Meitinger; Frans L Moll; Grant W Montgomery; Andrew D Morris; Andrew P Morris; Jeffrey C Murray; Mari Nelis; Claes Ohlsson; Albertine J Oldehinkel; Ken K Ong; Willem H Ouwehand; Gerard Pasterkamp; Annette Peters; Peter P Pramstaller; Jackie F Price; Lu Qi; Olli T Raitakari; Tuomo Rankinen; D C Rao; Treva K Rice; Marylyn Ritchie; Igor Rudan; Veikko Salomaa; Nilesh J Samani; Jouko Saramies; Mark A Sarzynski; Peter E H Schwarz; Sylvain Sebert; Peter Sever; Alan R Shuldiner; Juha Sinisalo; Valgerdur Steinthorsdottir; Ronald P Stolk; Jean-Claude Tardif; Anke Tönjes; Angelo Tremblay; Elena Tremoli; Jarmo Virtamo; Marie-Claude Vohl; Philippe Amouyel; Folkert W Asselbergs; Themistocles L Assimes; Murielle Bochud; Bernhard O Boehm; Eric Boerwinkle; Erwin P Bottinger; Claude Bouchard; Stéphane Cauchi; John C Chambers; Stephen J Chanock; Richard S Cooper; Paul I W de Bakker; George Dedoussis; Luigi Ferrucci; Paul W Franks; Philippe Froguel; Leif C Groop; Christopher A Haiman; Anders Hamsten; M Geoffrey Hayes; Jennie Hui; David J Hunter; Kristian Hveem; J Wouter Jukema; Robert C Kaplan; Mika Kivimaki; Diana Kuh; Markku Laakso; Yongmei Liu; Nicholas G Martin; Winfried März; Mads Melbye; Susanne Moebus; Patricia B Munroe; Inger Njølstad; Ben A Oostra; Colin N A Palmer; Nancy L Pedersen; Markus Perola; Louis Pérusse; Ulrike Peters; Joseph E Powell; Chris Power; Thomas Quertermous; Rainer Rauramaa; Eva Reinmaa; Paul M Ridker; Fernando Rivadeneira; Jerome I Rotter; Timo E Saaristo; Danish Saleheen; David Schlessinger; P Eline Slagboom; Harold Snieder; Tim D Spector; Konstantin Strauch; Michael Stumvoll; Jaakko Tuomilehto; Matti Uusitupa; Pim van der Harst; Henry Völzke; Mark Walker; Nicholas J Wareham; Hugh Watkins; H-Erich Wichmann; James F Wilson; Pieter Zanen; Panos Deloukas; Iris M Heid; Cecilia M Lindgren; Karen L Mohlke; Elizabeth K Speliotes; Unnur Thorsteinsdottir; Inês Barroso; Caroline S Fox; Kari E North; David P Strachan; Jacques S Beckmann; Sonja I Berndt; Michael Boehnke; Ingrid B Borecki; Mark I McCarthy; Andres Metspalu; Kari Stefansson; André G Uitterlinden; Cornelia M van Duijn; Lude Franke; Cristen J Willer; Alkes L Price; Guillaume Lettre; Ruth J F Loos; Michael N Weedon; Erik Ingelsson; Jeffrey R O'Connell; Goncalo R Abecasis; Daniel I Chasman; Michael E Goddard; Peter M Visscher; Joel N Hirschhorn; Timothy M Frayling
Journal: Nat Genet Date: 2014-10-05 Impact factor: 38.330

10. Genomic prediction of coronary heart disease.

Authors: Gad Abraham; Aki S Havulinna; Oneil G Bhalala; Sean G Byars; Alysha M De Livera; Laxman Yetukuri; Emmi Tikkanen; Markus Perola; Heribert Schunkert; Eric J Sijbrands; Aarno Palotie; Nilesh J Samani; Veikko Salomaa; Samuli Ripatti; Michael Inouye
Journal: Eur Heart J Date: 2016-09-21 Impact factor: 29.983

23 in total

Review 1. Clinical use of current polygenic risk scores may exacerbate health disparities.

Authors: Alicia R Martin; Masahiro Kanai; Yoichiro Kamatani; Yukinori Okada; Benjamin M Neale; Mark J Daly
Journal: Nat Genet Date: 2019-03-29 Impact factor: 38.330

Review 2. Applications of machine learning in drug discovery and development.

Authors: Jessica Vamathevan; Dominic Clark; Paul Czodrowski; Ian Dunham; Edgardo Ferran; George Lee; Bin Li; Anant Madabhushi; Parantu Shah; Michaela Spitzer; Shanrong Zhao
Journal: Nat Rev Drug Discov Date: 2019-06 Impact factor: 84.694

3. The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities.

Authors: Lauren J Beesley; Maxwell Salvatore; Lars G Fritsche; Anita Pandit; Arvind Rao; Chad Brummett; Cristen J Willer; Lynda D Lisabeth; Bhramar Mukherjee
Journal: Stat Med Date: 2019-12-20 Impact factor: 2.373