| Literature DB >> 23990773 |
Abstract
Genome-wide association studies (GWAS) are widely used to search for genetic loci that underlie human disease. Another goal is to predict disease risk for different individuals given their genetic sequence. Such predictions could either be used as a "black box" in order to promote changes in life-style and screening for early diagnosis, or as a model that can be studied to better understand the mechanism of the disease. Current methods for risk prediction typically rank single nucleotide polymorphisms (SNPs) by the p-value of their association with the disease, and use the top-associated SNPs as input to a classification algorithm. However, the predictive power of such methods is relatively poor. To improve the predictive power, we devised BootRank, which uses bootstrapping in order to obtain a robust prioritization of SNPs for use in predictive models. We show that BootRank improves the ability to predict disease risk of unseen individuals in the Wellcome Trust Case Control Consortium (WTCCC) data and results in a more robust set of SNPs and a larger number of enriched pathways being associated with the different diseases. Finally, we show that combining BootRank with seven different classification algorithms improves performance compared to previous studies that used the WTCCC data. Notably, diseases for which BootRank results in the largest improvements were recently shown to have more heritability than previously thought, likely due to contributions from variants with low minimum allele frequency (MAF), suggesting that BootRank can be beneficial in cases where SNPs affecting the disease are poorly tagged or have low MAF. Overall, our results show that improving disease risk prediction from genotypic information may be a tangible goal, with potential implications for personalized disease screening and treatment.Entities:
Mesh:
Year: 2013 PMID: 23990773 PMCID: PMC3749941 DOI: 10.1371/journal.pcbi.1003200
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Fraction of intersection of filtered SNPs lists between different cross-validation partitions.
For each disease (T1D, Type 1 diabetes; T2D, Type 2 diabetes; CD, Crohn's disease; CAD, coronary artery disease; BD, bipolar disorder; RA, rheumatoid arthritis; HT, hypertension), shown is the mean fraction (y-axis) of top SNPs shared between training sets from different cross-validations when ranking SNPs by GWASRank (red) or BootRank (blue). The x-axis shows the number of SNPs that were selected as top SNPs from the SNP ranking.
Figure 2BootRank reduces noise in p-value pathway enrichment scores and detects more enriched pathways.
(a) For each disease (T1D, Type 1 diabetes; T2D, Type 2 diabetes; CD, Crohn's disease; CAD, coronary artery disease; BD, bipolar disorder; RA, rheumatoid arthritis; HT, hypertension), shown are the differences in the average (x-axis) and coefficient of variation (y-axis) of the enrichment p-values of all significantly enriched KEGG pathways. (b) The mean noise (measured as coefficient of variation) of pathway enrichment p-values are shown for all diseases for GWASRank (red) or BootRank (blue).
Prediction performance for WTCCC data on test data.
| Disease/method | T1D | T2D | BD | CD | CAD | RA | HT |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| LO, AC | 0.75 | 0.6 | 0.67 | 0.63 | 0.6 | 0.67 | 0.61 |
| SVM | 0.82 | 0.71 | |||||
| GWASelect | 0.79 | ||||||
| SVM, LR | 0.89 | ||||||
| Forward ROC | 0.71 | ||||||
| LR, SVM, RF, BN | 0.56 | ||||||
| Elastic-net | 0.64 | ||||||
| LR, AC, SVM | 0.6 |
Shown are the AUC values obtained by different studies across the seven diseases in the WTCCC dataset. The reported AUCs were calculated only for test individuals. For each study, we took the best AUC reported for each disease, and missing diseases were left blank.
Diseases: T1D, Type 1 diabetes; T2D, Type 2 diabetes; CD, Crohn's disease; CAD, coronary artery disease; BD, bipolar disorder; RA, rheumatoid arthritis; HT, hypertension. Algorithms: SVM, support vector machine; LR, logistic regression; AC, allele count; RF, random forest; LO, log odds; BN, Bayesian networks.
Figure 3BootRank improves disease risk prediction for held-out test individuals.
(a) For each disease (T1D, Type 1 diabetes; T2D, Type 2 diabetes; CD, Crohn's disease; CAD, coronary artery disease; BD, bipolar disorder; RA, rheumatoid arthritis; HT, hypertension), shown are the training (empty circles) and test (filled circles) AUC values as a function of different numbers of SNPs used in the model (x-axis) when employing either GWASRank (red) or BootRank (blue) to rank SNPs. (b) The mean difference between training AUC and test AUC is shown for all diseases for GWASRank (red) or BootRank (blue).
Mean test AUC for different algorithms using BootRank.
| Disease/algorithm | T1D | T2D | BD | CD | CAD | RA | HT |
| Support vector machine (SVM) | 0.90 | 0.76 | 0.78 | 0.64 | 0.63 | 0.71 | 0.61 |
| Random forest (RF) | 0.88 | 0.76 | 0.77 | 0.65 | 0.68 | 0.71 | 0.64 |
| Regularized logistic regression (RLR) | 0.91 | 0.77 | 0.76 | 0.696 | 0.71 | 0.78 | 0.68 |
| Naïve Bayes (NB) | 0.77 | 0.83 | 0.83 | 0.67 | 0.72 | 0.71 | 0.68 |
| Allele count (AC) | 0.80 | 0.79 | 0.80 | 0.63 | 0.59 | 0.65 | 0.61 |
| Log Odds (LO) | 0.81 | 0.81 | 0.81 | 0.699 | 0.69 | 0.71 | 0.67 |
| Robust adaboost (RAB) | 0.89 | 0.78 | 0.78 | 0.695 | 0.75 | 0.75 | 0.71 |
| Majority (all algorithms) | 0.90 | 0.82 | 0.83 | 0.70 | 0.72 | 0.74 | 0.68 |
| 4-Majority (only RF, RLR, NB and RAB) | 0.91 | 0.82 | 0.82 | 0.71 | 0.75 | 0.77 | 0.70 |
Shown are the average AUC values for test individuals for the different algorithms when using BootRank, or when combining all 7 algorithms (Majority), or only 4 algorithms (4-Majority).