| Literature DB >> 25918167 |
Athina Spiliopoulou1, Reka Nagy2, Mairead L Bermingham2, Jennifer E Huffman2, Caroline Hayward2, Veronique Vitart2, Igor Rudan3, Harry Campbell3, Alan F Wright2, James F Wilson3, Ricardo Pong-Wong4, Felix Agakov5, Pau Navarro2, Chris S Haley6.
Abstract
We explore the prediction of individuals' phenotypes for complex traits using genomic data. We compare several widely used prediction models, including Ridge Regression, LASSO and Elastic Nets estimated from cohort data, and polygenic risk scores constructed using published summary statistics from genome-wide association meta-analyses (GWAMA). We evaluate the interplay between relatedness, trait architecture and optimal marker density, by predicting height, body mass index (BMI) and high-density lipoprotein level (HDL) in two data cohorts, originating from Croatia and Scotland. We empirically demonstrate that dense models are better when all genetic effects are small (height and BMI) and target individuals are related to the training samples, while sparse models predict better in unrelated individuals and when some effects have moderate size (HDL). For HDL sparse models achieved good across-cohort prediction, performing similarly to the GWAMA risk score and to models trained within the same cohort, which indicates that, for predicting traits with moderately sized effects, large sample sizes and familial structure become less important, though still potentially useful. Finally, we propose a novel ensemble of whole-genome predictors with GWAMA risk scores and demonstrate that the resulting meta-model achieves higher prediction accuracy than either model on its own. We conclude that although current genomic predictors are not accurate enough for diagnostic purposes, performance can be improved without requiring access to large-scale individual-level data. Our methodologically simple meta-model is a means of performing predictive meta-analysis for optimizing genomic predictions and can be easily extended to incorporate multiple population-level summary statistics or other domain knowledge.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25918167 PMCID: PMC4476450 DOI: 10.1093/hmg/ddv145
Source DB: PubMed Journal: Hum Mol Genet ISSN: 0964-6906 Impact factor: 6.150
Figure 1.Whole-genome Identity-By-State coefficient of relationship between individuals in Croatia (left) and in Orkney (right). Each panel shows a histogram of the genetic similarity between every pair of individuals from each data set. The sub-panels display in more detail the right tail of the histogram by zooming-in on the y-axis. The similarity between individuals i and k is computed by where p is the minor allele frequency (MAF) of SNP j, is the genotype of individual i at SNP j, and P = 267 912 (260 562 in Orkney) is the number of the genotyped SNPs. Both cohorts contain related individuals, demonstrated in the sub-panels by the density mass around 0.5 (expected IBS between parent-offspring and full siblings), 0.25 (expected IBS between half siblings and uncle/aunt with nephew/niece) and 0.125 (expected IBS between first cousins). The proportion of related individuals is higher in Orkney compared with Croatia, demonstrated by the higher density in IBS values corresponding to familial relationships.
Heritability estimates in our cohort data and reported in the literature
| Trait | ||||
|---|---|---|---|---|
| Height | 0.81 | 0.90 | 0.88 (0.09) (36) | 0.46 (0.05) (36) |
| BMI | 0.29 | 0.56 | 0.34 (0.12) (36) | 0.14 (0.05) (36) |
| HDL | 0.57 | 0.61 | 0.48 (0.11) (36) | 0.12 (0.05) (36) |
Figure 2.Accuracy of penalized regression models as we increase the number of input SNPs together with accuracy of the GWAMA-based polygenic score. (A) Within-cohort prediction in Croatia. (B) Within-cohort prediction in Orkney. (C) Across-cohort prediction (training & validation: Croatia, testing: Orkney). y-axis: accuracy is measured using the Pearson's correlation coefficient between the predicted and true phenotype of the testing samples. The error bar corresponds to the 95% Confidence Interval (upper CI). x-axis: an increasing number of SNPs (genetic markers) are given as input to the regression models. The SNP selection is performed by GWAS pre-filtering using the training data in each case. The number of input SNPs for the GWAMA-based risk score is constant (height: 180 SNPs, BMI: 32 SNPs, HDL: 70 SNPs). Black ‘x’ symbol: depicts the optimal penalized regression model—in terms of the type of shrinkage penalty and the number of input markers—based on prediction accuracy on the validation set. This corresponds to the model that we would select as our best predictor in a real-world application.
Figure 3.Accuracy of LASSO and RR in within-cohort prediction computed using ‘related’, ‘unrelated’ and ‘all’ testing individuals (IBS threshold = 0.1) together with accuracy of the GWAMA-based polygenic score for the same individuals. The ‘related’ group (triangles) contains testing individuals with at least one nominal relative in the training data (IBS ≥ 0.1). The ‘unrelated’ group (squares) contains testing samples that are nominally unrelated to all the training samples. The ‘all’ group (circles) contains all the testing samples. y-axis: accuracy is measured using the Pearson's correlation coefficient between the predicted and true phenotype of the corresponding testing samples. x-axis: an increasing number of SNPs are given as input to the regression models (plotted in log-scale). The SNP selection is performed by GWAS pre-filtering using the training data in each case. Left: training, validation and testing are performed using samples from the Croatia dataset (nested cross-validation design). Right: training, validation and testing are performed using samples from the Orkney dataset (nested cross-validation design). Performance in the ‘related’ group is always better than that in the ‘unrelated’ group for LASSO and RR and this difference increases as we consider denser models. Performance of the GWAMA score is very similar in the ‘related’ and the ‘unrelated’ groups. The accuracies computed in ‘related’ and ‘unrelated’ groups based on a smaller IBS threshold (IBS threshold = 1/16) are given in Supplementary Material, Figure S8.
Figure 4.Accuracy of the optimal penalized regression model, the GWAMA-based polygenic score and the meta-model combining the two. y-axis: accuracy is measured using the Pearson's correlation coefficient between the predicted and true phenotype of the testing samples (nested cross-validation design for best penalized regression model, doubly nested cross-validation design for the meta-model). The error bar corresponds to the 95% Confidence Interval (upper CI). abc: denotes statistical significance from a one-tailed paired z-test comparing model performance. Superscript a denotes that a model is not statistically different to the first bar of the group, while superscript b denotes that a model is not statistically different to the second bar of the group. The type of shrinkage and number of input SNPs for the optimal penalized regression model in each case are given in Table 3.
Type of shrinkage and number of input SNPs for the optimal penalized regression model in within-cohort prediction
| Height Croatia | BMI Croatia | HDL Croatia | Height Orkney | BMI Orkney | HDL Orkney | |
|---|---|---|---|---|---|---|
| Type of shrinkage | Ridge | Ridge | Ridge | Ridge | Ridge | Elastic Net |
| Number of input SNPs | 100 000 | 1000 | 25 000 | 25 000 | 200 000 | 100 000 |
MAFs of the three SNPs retained in the LASSO model with the highest accuracy in across-cohort prediction of HDL
| Rsid | MAF in Orkney | MAF in Croatia |
|---|---|---|
| rs3764261 | 0.3909 | 0.3138 |
| rs1532624 | 0.4922 | 0.4099 |
| rs7499892 | 0.2292 | 0.1631 |
The three SNPs have a higher minor allele frequency (MAF) in Orkney compared with Croatia and thus we expect that more phenotypic variance will be explained by these three SNPs in Orkney.