| Literature DB >> 25628649 |
Jacqueline N Milton1, Martin H Steinberg2, Paola Sebastiani1.
Abstract
Many genetic markers have been shown to be associated with common quantitative traits in genome-wide association studies. Typically these associated genetic markers have small to modest effect sizes and individually they explain only a small amount of the variability of the phenotype. In order to build a genetic prediction model without fitting a multiple linear regression model with possibly hundreds of genetic markers as predictors, researchers often summarize the joint effect of risk alleles into a genetic score that is used as a covariate in the genetic prediction model. However, the prediction accuracy can be highly variable and selecting the optimal number of markers to be included in the genetic score is challenging. In this manuscript we present a strategy to build an ensemble of genetic prediction models from data and we show that the ensemble-based method makes the challenge of choosing the number of genetic markers more amenable. Using simulated data with varying heritability and number of genetic markers, we compare the predictive accuracy and inclusion of true positive and false positive markers of a single genetic prediction model and our proposed ensemble method. The results show that the ensemble of genetic models tends to include a larger number of genetic variants than a single genetic model and it is more likely to include all of the true genetic markers. This increased sensitivity is obtained at the price of a lower specificity that appears to minimally affect the predictive accuracy of the ensemble.Entities:
Keywords: bagging predictors; ensemble-based classifiers; genetic risk prediction; genetic risk score; prediction accuracy
Year: 2015 PMID: 25628649 PMCID: PMC4292739 DOI: 10.3389/fgene.2014.00474
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Distribution of the correlation between observed values and values predicted by the single genetic model (GS) and the ensemble of genetic models (ENS GS). The side by side boxplots display the correlation between the observed and predicted phenotype in the 1000 simulated test data vs. the number of SNPs in the single genetic model (GS: odd columns) and the ensemble of genetic models (ENS GS: even columns) for increasing heritability (h2). Row 1: five causal SNPs; row 2: 10 causal SNPs; row 3: 30 causal SNPs.
Distribution of the number of SNPs included in the best single genetic model selected with the single split of the data (GS), the best ensemble of genetic models (ENS GS), and the best single genetic model selected using cross-validation (CV).
| 5 | GS | 5 (5, 6) | 5 (5, 5) | 5 (5, 5) |
| ENS GS | 8 (6, 11) | 8 (7, 10) | 8 (7, 9) | |
| CV | 5 (5, 6) | 5 (5, 5) | 5 (5, 5) | |
| 10 | GS | 9 (7, 11) | 10 (9, 10) | 10 (10, 10) |
| ENS GS | 14 (10, 21) | 15 (13, 19) | 16 (14, 19) | |
| CV | 9 (7, 10) | 10 (9, 10) | 10 (10, 10) | |
| 30 | GS | 17 (8, 29) | 23 (16, 29) | 24 (20, 29) |
| ENS GS | 30 (11, 48) | 37 (30, 49) | 40 (34, 49) | |
| CV | 15 (9, 24) | 21 (16, 26) | 23 (20, 27) | |
Numbers in the table are median and interquartile range.
Figure 2Plots of the Model Selection Accuracy vs. Number of Causal SNPs. The top panel displays side-by-side boxplots of the proportion of causal SNPs that were included in the most predictive models, for increasing number of SNPs (x-axis), and increasing heritability (h2). The mid panel displays side-by-side boxplots of the specificity of the most predictive models and the bottom panel displays summaries of the prediction accuracy of the same methods. ENS GS: the ensemble of genetic models selected using a single split of the data (red); GS: single genetic model selected using a single split of the data (green); GS + CV: single genetic model selected using 10 fold CV model (blue).
Summary of the predictive accuracy of the best single genetic model selected with the single split of the data (GS), the best ensemble of genetic models (ENS GS), and the best single genetic model selected using cross-validation (CV).
| 5 | GS | 0.41 (0.33, 0.39) | 0.54 (0.49, 0.59) | 0.60 (0.56, 0.65) |
| ENS GS | 0.39 (0.33, 0.44) | 0.51 (0.46, 0.56) | 0.58 (0.53, 0.62) | |
| CV | 0.40 (0.38, 0.42) | 0.53 (0.51, 0.54) | 0.60 (0.59, 0.62) | |
| 10 | GS | 0.37 (0.31, 0.42) | 0.52 (0.48, 0.57) | 0.60 (0.56, 0.65) |
| ENS GS | 0.34 (0.28, 0.40) | 0.49 (0.44, 0.54) | 0.57 (0.52, 0.62) | |
| CV | 0.38 (0.35, 0.41) | 0.52 (0.50, 0.54) | 0.60 (0.58, 0.62) | |
| 30 | GS | 0.26 (0.21, 0.31) | 0.41 (0.35, 0.46) | 0.51 (0.46, 0.56) |
| ENS GS | 0.22 (0.16, 0.28) | 0.38 (0.32, 0.44) | 0.48 (0.42, 0.53) | |
| CV | 0.26 (0.21, 0.31) | 0.44 (0.41, 0.47) | 0.53 (0.51, 0.56) | |
Numbers in the table are median and interquartile range.