| Literature DB >> 21552331 |
Robert Makowsky1, Nicholas M Pajewski, Yann C Klimentidis, Ana I Vazquez, Christine W Duarte, David B Allison, Gustavo de los Campos.
Abstract
Despite rapid advances in genomic technology, our ability to account for phenotypic variation using genetic information remains limited for many traits. This has unfortunately resulted in limited application of genetic data towards preventive and personalized medicine, one of the primary impetuses of genome-wide association studies. Recently, a large proportion of the "missing heritability" for human height was statistically explained by modeling thousands of single nucleotide polymorphisms concurrently. However, it is currently unclear how gains in explained genetic variance will translate to the prediction of yet-to-be observed phenotypes. Using data from the Framingham Heart Study, we explore the genomic prediction of human height in training and validation samples while varying the statistical approach used, the number of SNPs included in the model, the validation scheme, and the number of subjects used to train the model. In our training datasets, we are able to explain a large proportion of the variation in height (h(2) up to 0.83, R(2) up to 0.96). However, the proportion of variance accounted for in validation samples is much smaller (ranging from 0.15 to 0.36 depending on the degree of familial information used in the training dataset). While such R(2) values vastly exceed what has been previously reported using a reduced number of pre-selected markers (<0.10), given the heritability of the trait (∼ 0.80), substantial room for improvement remains.Entities:
Mesh:
Year: 2011 PMID: 21552331 PMCID: PMC3084207 DOI: 10.1371/journal.pgen.1002051
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Figure 1A simplified representation of assessment of goodness of fit in a training dataset and of predictive ability across a population: an example with the Framingham population.
R-squared statistic measured in the data used to train the model (), estimated posterior mean of heritability (), and Deviance Information Criterion (DIC) by model and number of SNPs (where K = 1,000).
| Number of SNPs | Bayesian Lasso | Genomic Relationship GY | Genomic Relationship GH | |||||
|
| DIC |
|
| DIC |
|
| DIC | |
|
| 0.33 | 32,920 | 0.36 | 0.21 | 32,883 | 0.34 | 0.26 | 32,912 |
|
| 0.47 | 32,666 | 0.49 | 0.31 | 32,605 | 0.48 | 0.37 | 32,642 |
|
| 0.65 | 32,106 | 0.69 | 0.47 | 31,950 | 0.66 | 0.52 | 32,081 |
|
| 0.79 | 31,359 | 0.82 | 0.60 | 31,124 | 0.79 | 0.65 | 31,365 |
|
| 0.87 | 30,564 | 0.89 | 0.70 | 30,201 | 0.87 | 0.74 | 30,564 |
|
| 0.92 | 29,629 | 0.93 | 0.77 | 29,220 | 0.92 | 0.80 | 29,685 |
|
| - | - | 0.95 | 0.79 | 28,925 | 0.93 | 0.81 | 29,416 |
|
| - | - | 0.96 | 0.81 | 28,444 | 0.94 | 0.83 | 29,017 |
Estimates were obtained by fitting models to height adjusted by sex and age and using all available data (N = 5,117).
For the Bayesian LASSO, due to high memory requirements, only models including up to 80K markers were considered. This model does not include a genetic variance parameter, therefore it does not yield a direct estimate of heritability. For this reason heritability is not reported for this model.
R-squared between predicted and observed values () estimated using different number of SNPs (where K = 1,000), models, and validation designs.
| Number of SNPs | 10-Fold CV | 2-Generations design | Training-Testing Random | ||||||
| BL | GY | GH | BL | GY | GH | BL | GY | GH | |
|
| .097 | .102 | .098 | .054 | .035 | .035 | .064 | .035 | .033 |
|
| .126 | .130 | .129 | .066 | .058 | .061 | .080 | .059 | .057 |
|
| .166 | .174 | .168 | .087 | .088 | .093 | .099 | .094 | .088 |
|
| .200 | .204 | .199 | .106 | .111 | .115 | .119 | .119 | .114 |
|
| .217 | .221 | .216 | .117 | .118 | .123 | .128 | .131 | .126 |
|
| .236 | .237 | .236 | .124 | .126 | .129 | .138 | .139 | .137 |
|
| - | .240 | .240 | - | .130 | .132 | - | .142 | .141 |
|
| - | .247 | .249 | - | .133 | .133 | - | .146 | .145 |
BL = Bayesian LASSO, GH = Goddard-Hayes, and GY = Yang study (see Materials and Methods for elucidation).
10-fold cross validation, where the training set comprised 4,605–4,606 individuals.
Models were trained using the original cohort (N = 1,493) and predictive ability was assessed in the Offspring cohort (N = 3,624).
Data was assigned at random to a training set (N = 1,493) and predictive ability was evaluated in the remaining individuals (N = 3,624). This was repeated 10 times; each time individuals were randomly assigned into training/testing sets. Results are averaged across the ten replicates.
Figure 2We averaged the estimates of (measured in the training data), , (measured in a 10 fold cross validation), (measured in a 2 generation validation), and (measured in a replicated Training-Testing validation) over the three modeling techniques (BL, GH, GY) and showed their relationship to the number of SNPs included in the model.
Figure 3Averaged (across the three different models) estimates of (measured in a 10 fold cross validation) while varying the number of close relatives (si) in the training dataset with 2.5K to 400K SNPs.