| Literature DB >> 23874214 |
Gustavo de Los Campos1, Ana I Vazquez, Rohan Fernando, Yann C Klimentidis, Daniel Sorensen.
Abstract
Despite important advances from Genome Wide Association Studies (GWAS), for most complex human traits and diseases, a sizable proportion of genetic variance remains unexplained and prediction accuracy (PA) is usually low. Evidence suggests that PA can be improved using Whole-Genome Regression (WGR) models where phenotypes are regressed on hundreds of thousands of variants simultaneously. The Genomic Best Linear Unbiased Prediction (G-BLUP, a ridge-regression type method) is a commonly used WGR method and has shown good predictive performance when applied to plant and animal breeding populations. However, breeding and human populations differ greatly in a number of factors that can affect the predictive performance of G-BLUP. Using theory, simulations, and real data analysis, we study the performance of G-BLUP when applied to data from related and unrelated human subjects. Under perfect linkage disequilibrium (LD) between markers and QTL, the prediction R-squared (R(2)) of G-BLUP reaches trait-heritability, asymptotically. However, under imperfect LD between markers and QTL, prediction R(2) based on G-BLUP has a much lower upper bound. We show that the minimum decrease in prediction accuracy caused by imperfect LD between markers and QTL is given by (1-b)(2), where b is the regression of marker-derived genomic relationships on those realized at causal loci. For pairs of related individuals, due to within-family disequilibrium, the patterns of realized genomic similarity are similar across the genome; therefore b is close to one inducing small decrease in R(2). However, with distantly related individuals b reaches very low values imposing a very low upper bound on prediction R(2). Our simulations suggest that for the analysis of data from unrelated individuals, the asymptotic upper bound on R(2) may be of the order of 20% of the trait heritability. We show how PA can be enhanced with use of variable selection or differential shrinkage of estimates of marker effects.Entities:
Mesh:
Year: 2013 PMID: 23874214 PMCID: PMC3708840 DOI: 10.1371/journal.pgen.1003608
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Figure 1Minimum R2 reduction factor, , due to imperfect linkage disequilibrium between markers and QTL versus values of the regression of genomic relationships realized at markers and at causal loci ().
Percentage of loci by minor allele frequency (MAF), scenario and data set.
| Type of loci | Scenario | Data set | Minor Allele Frequency | ||||
| <3% | 3%–5% | 5%–10% | 10%–15% | >15% | |||
| Tag | FHS | .061 | .049 | .119 | .116 | .654 | |
| Tag | GEN | .065 | .049 | .119 | .115 | .652 | |
| Causal | RAND | FHS | .063 | .047 | .117 | .123 | .651 |
| Causal | RAND | GEN | .066 | .048 | .117 | .117 | .651 |
| Causal | Low-MAF | FHS | .310 | .233 | .239 | .207 | .011 |
| Causal | Low-MAF | GEN | .321 | .237 | .231 | .201 | .010 |
FHS = Framingham Heart Study, GEN = GENEVA, RAND: in this scenario causal and marker loci were drawn from the same distribution, Low-MAF: in this scenario marker loci were drawn at random and causal loci were drawn over-sampling loci with extreme minor-allele frequency (MAF). In the Low-MAF scenario, the sampling of causal loci was done using average MAF of the FHS and GEN data sets. Although MAF were very similar across data sets, these were not exactly equal, and this explains why in the Low-MAF roughly 1% of the causal loci had MAF (within-data set) greater than 15%.
Estimates (estimated standard errors) of proportion of variance explained () and of prediction R-squared of phenotypes in validation datasets, R (TST).
| Scenario | Genetic Information Used to Compute Relationships |
|
| ||
| FHS | GEN | FHS | GEN | ||
| RAND | Causal Loci | 0.775 | 0.773 | 0.545 | 0.517 |
| (0.009) | (0.010) | (0.040) | (0.031) | ||
| Markers | 0.774 | 0.737 | 0.263 | 0.071 | |
| (0.018) | (0.040) | (0.048) | (0.023) | ||
| Pedigree | 0.764 | — | 0.223 | — | |
| (0.020) | (0.047) | ||||
| Low-MAF | Causal Loci | 0.777 | 0.775 | 0.551 | 0.536 |
| (0.007) | (0.008) | (0.026) | (0.026) | ||
| Markers | 0.748 | 0.573 | 0.240 | 0.049 | |
| (0.018) | (0.058) | (0.029) | (0.019) | ||
| Pedigree | 0.755 | — | 0.224 | — | |
| (0.023) | (0.033) | ||||
FHS = Framingham Heart Study, GEN = GENEVA, RAND: in this scenario causal and marker loci were drawn from the same distribution, Low-MAF: in this scenario marker loci were drawn at random and causal loci were drawn over-sampling loci with low minor allele frequency, TST = Testing data set.
: average (over 30 MC replicates) estimated posterior mean of the ratio of genomic variance over the sum of genomic and residual variance;
: average prediction R2 (phenotypes) over 30 training (N = 5,300)-testing (N = 500) partitions.
Average (over individuals in TST data sets) regression coefficient (b, see eq. 4) between realized genomic relationships at markers and those realized at causal loci, corresponding minimum reduction factor in prediction R2 (see eq. 7) and observed reduction factor in prediction R-squared.
| Data set | Information used to compute relationships | Simulation Scenario | Regression Coefficient ( | Reduction Factor in R-squared | |
| Minimum | Observed | ||||
| Framingham | Pedigree | Random | 0.295 | 50% | 59% |
| Low-MAF | 0.285 | 51% | 60% | ||
| Markers | Random | 0.371 | 40% | 52% | |
| Low-MAF | 0.334 | 44% | 56% | ||
| GENEVA | Markers | Random | 0.127 | 76% | 86% |
| Low-MAF | 0.089 | 83% | 91% | ||
Low-MAF: scenario where causal loci were over-sampled among loci with low minor allele frequency. Random: scenario where markers and causal loci were sampled from the same distribution.
: For each individual in the testing (TST) data sets we computed the regression of marker or pedigree derived relationships on genomic relationships computed at causal loci, and , respectively, where j (j = n+1,n+2,…) indexes individuals in the testing data set and i (i = 1,…,n) indexes individuals in the training (TRN) data sets. The TRN-TST partitions used were those used in the simulation. Results in the table are averages across individuals in TST data sets.
: Upper bound calculated using expression (7).
: Reduction in prediction R2 observed when data was analyzed using markers relative to that obtained when data was analyzed using genotypes at causal loci (see Table 2).
Figure 2Genomic relationships realized at markers (vertical axis) versus those realized at causal loci (horizontal axis).
The plot displays realized relationships between one individual in TST and all the other individuals in TRN for GEN (right panel) and FHS (left panel). Genomic relationships computed using markers are given in the vertical axis and those computed using genotypes at causal loci are in the horizontal coordinate.
Regression coefficient (, see expression 6) between realized genomic relationships at markers and those realized at causal loci, by data set, type of relationship and simulation scenario.
| Data set | Framingham | GENEVA | ||||
| Relationships | Related | Unrelated | Unrelated | |||
| Scenario | RAND | Low-MAF | RAND | Low-MAF | RAND | Low-MAF |
| Average | 0.992 | 0.998 | 0.119 | 0.078 | 0.127 | 0.089 |
| q5% | 0.898 | 0.827 | 0.085 | 0.048 | 0.087 | 0.051 |
| q95% | 1.083 | 1.174 | 0.184 | 0.130 | 0.329 | 0.269 |
: Relationship between the individual whose phenotype is predicted and those used for model training; coefficients, , were estimated for each individual in training datasets. q5% and q95% represent the 5% and 95% empirical percentiles of the estimated regression coefficients.
Estimates of proportion of variance accounted for by regression on pedigree or regression on markers by training data set and analysis method (estimated posterior standard deviaton).
| Data set | Pedigree | G-BLUP | wG-BLUP |
| Framingham (N = 5,800) | 0.857 | 0.837 | 0.814 |
| (0.016) | (0.016) | (0.013) | |
| GENEVA (N = 5,800) | — | 0.374 | 0.268 |
| (0.049) | (0.026) | ||
| Framingham+GENEVA (N = 11,600) | — | 0.721 | 0.632 |
| (0.016) | (0.015) |
wG-BLUP uses all SNPs (p = 400 K), but the contribution of each SNP to the genomic relationship matrix was weighted using as weight, where is the SNP associated p-value reported by the GIANT consortium [5].
Prediction R-squared evaluated in testing data sets (average over 30 randomly drawn testing data sets, each having 500 individuals) by training and validation data sets and model.
| Training data sets | Testing data sets | Pedigree-BLUP | G-BLUP | wG-BLUP | ||
| N-FHS | N-GEN | N-FHS | N-GEN | |||
| 5,300 | — | 500 | — | 0.284 | 0.281 | 0.311 |
| (0.048) | (0.051) | (0.037) | ||||
| 5,300 | 5,800 | 500 | — | 0.273 | 0.290 | |
| (0.048) | (0.036) | |||||
| — | 5,300 | — | 500 | 0.031 | 0.086 | |
| (0.013) | (0.020) | |||||
| 5,800 | 5,300 | — | 500 | 0.036 | 0.110 | |
| (0.015) | (0.027) | |||||
N-FHS = Number of records from Framingham, N-GEN = Number of records from GENEVA. G-BLUP uses 400 K SNPs, wG-BLUP uses 400 K SNPs, but the contribution of each SNP to the genomic relationship matrix was weighted using as weight, where is the SNP associated p-value reported by [5].
Figure 3Prediction R2 (vertical axis) versus thousands of markers (selected based on p-values from the GWAS of the GIANT consortium, [5]) included in the model (horizontal axis) by validation data set (FHM in the left panel, GEN in the right panel) and training data set (line with dots training with FHM and GEN combined, line with circles, training-testing within each study).
Dotted horizontal lines give the prediction R2 obtained when all markers (p = 400,000) were used.