| Literature DB >> 22373389 |
Gengxin Li1, John Ferguson, Wei Zheng, Joon Sang Lee, Xianghua Zhang, Lun Li, Jia Kang, Xiting Yan, Hongyu Zhao.
Abstract
We consider the application of Efron's empirical Bayes classification method to risk prediction in a genome-wide association study using the Genetic Analysis Workshop 17 (GAW17) data. A major advantage of using this method is that the effect size distribution for the set of possible features is empirically estimated and that all subsequent parameter estimation and risk prediction is guided by this distribution. Here, we generalize Efron's method to allow for some of the peculiarities of the GAW17 data. In particular, we introduce two ways to extend Efron's model: a weighted empirical Bayes model and a joint covariance model that allows the model to properly incorporate the annotation information of single-nucleotide polymorphisms (SNPs). In the course of our analysis, we examine several aspects of the possible simulation model, including the identity of the most important genes, the differing effects of synonymous and nonsynonymous SNPs, and the relative roles of covariates and genes in conferring disease risk. Finally, we compare the three methods to each other and to other classifiers (random forest and neural network).Entities:
Year: 2011 PMID: 22373389 PMCID: PMC3287883 DOI: 10.1186/1753-6561-5-S9-S46
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Prediction rule of three proposed methods
| Feature | Empirical Bayes method | Weighted empirical Bayes method | Joint covariance model | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes | #SNP | MAF | Genes | #Syn SNP | #Non SNP | MAF | Genes | #Syn SNP | #Non SNP | MAF | |
| 1 | |||||||||||
| 2 | |||||||||||
| 3 | 1 | 0.29 | 13 | 23 | <0.01 | 1 | 0.29 | ||||
| 2 | 4 | 0.01–0.05 | |||||||||
| 1 | 2 | ≥0.05 | |||||||||
| 4 | 25 | <0.01 | 8 | 17 | <0.01 | 1 | 0.11 | ||||
| 7 | 0.01–0.05 | 5 | 2 | 0.01–0.05 | |||||||
| 3 | ≥0.05 | 2 | 1 | ≥0.05 | |||||||
| 5 | 36 | 1 | 0.29 | 1 | 0.13 | ||||||
| 6 | |||||||||||
| 3 | |||||||||||
| 6 | 1 | 0.11 | 4 | 13 | <0.01 | 4 | 13 | <0.01 | |||
| 1 | 1 | 0.01–0.05 | 1 | 1 | 0.01–0.05 | ||||||
| 1 | 1 | ≥0.05 | 1 | 1 | ≥0.05 | ||||||
| 7 | 17 | <0.01 | 1 | 0.11 | 13 | 23 | <0.01 | ||||
| 2 | 0.01–0.05 | 2 | 4 | 0.01–0.05 | |||||||
| 2 | ≥0.05 | 1 | 2 | ≥0.05 | |||||||
| 8 | 1 | 0.13 | 10 | 23 | <0.01 | 8 | 17 | <0.01 | |||
| 2 | 2 | 0.01–0.05 | 5 | 2 | 0.01–0.05 | ||||||
| 1 | 2 | ≥0.05 | 2 | 1 | ≥0.05 | ||||||
| 9 | 33 | <0.01 | 8 | 7 | < 0.01 | 1 | 0.1 | ||||
| 4 | 0.01–0.05 | 1 | 2 | 0.01–0.05 | |||||||
| 3 | ≥0.05 | 2 | ≥0.05 | ||||||||
| 10 | 14 | <0.01 | 1 | <0.01 | 14 | 12 | <0.01 | ||||
| 3 | 0.01–0.05 | 1 | 0.01–0.05 | 1 | 0.01–0.05 | ||||||
| 1 | ≥0.05 | ≥0.05 | |||||||||
Top 10 important features from the model incorporating genes and environmental variables for the three proposed methods. #SNP, number of SNPs within a specific gene; #Syn SNP, number of synonymous SNPs; #Non SNP, number of nonsynonymous SNPs. MAF shows three intervals of minor allele frequency: MAF < 0.01, 0.01 ≤ MAF < 0.05, and MAF ≥ 0.05. The boldfaced genes and environmental variables are real causal features that are selected across the three proposed models.
Comparison of the prediction rule between the empirical Bayes and other classifiers
| Feature | Empirical Bayes method | Random forest classifier | Logistic regression | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Genes | #SNP | MAF | Genes | #SNP | MAF | Genes | #SNP | MAF | |
| 1 | |||||||||
| 2 | |||||||||
| 3 | 1 | 0.29 | 25 | <0.01 | 36 | <0.01 | |||
| 7 | 0.01–0.05 | 6 | 0.01–0.05 | ||||||
| 3 | ≥0.05 | 3 | ≥0.05 | ||||||
| 4 | 25 | <0.01 | 36 | <0.01 | 1 | 0.29 | |||
| 7 | 0.01–0.05 | 6 | 0.01–0.05 | ||||||
| 3 | ≥0.05 | 3 | ≥0.05 | ||||||
| 5 | 36 | 10 | < 0.01 | 1 | 0.11 | ||||
| 6 | 1 | 0.01–0.05 | |||||||
| 3 | 2 | ≥0.05 | |||||||
| 6 | 1 | 0.11 | 17 | <0.01 | 17 | <0.01 | |||
| 2 | 0.01–0.05 | 2 | 0.01–0.05 | ||||||
| 2 | ≥0.05 | 2 | ≥0.05 | ||||||
| 7 | 17 | <0.01 | 23 | <0.01 | 25 | <0.01 | |||
| 2 | 0.01–0.05 | 4 | 0.01–0.05 | 7 | 0.01–0.05 | ||||
| 2 | ≥0.05 | 3 | ≥0.05 | 3 | ≥0.05 | ||||
| 8 | 1 | 0.13 | 8 | <0.01 | 14 | <0.01 | |||
| 0.01–0.05 | 3 | 0.01–0.05 | |||||||
| 4 | ≥0.05 | ≥0.05 | |||||||
| 9 | 33 | <0.01 | 1 | <0.01 | 33 | <0.01 | |||
| 4 | 0.01–0.05 | 1 | 0.01–0.05 | 4 | 0.01–0.05 | ||||
| 3 | ≥0.05 | 1 | ≥0.05 | 3 | ≥0.05 | ||||
| 10 | 14 | <0.01 | 16 | <0.01 | 1 | 0.13 | |||
| 3 | 0.01–0.05 | 1 | 0.01–0.05 | ||||||
| 2 | ≥0.05 | ||||||||
The top 10 important features from the model incorporating genes and environmental variables between our proposed method (empirical Bayes) and other classifiers (random forest and logistic regression). #SNP, number of SNPs within a specific gene. MAF shows three intervals of minor allele frequency: MAF < 0.01, 0.01 ≤ MAF < 0.05, and MAF ≥ 0.05. The boldfaced genes are real causal features that are selected simultaneously from the three models; for example, FLT1 is observed using the three classifiers.
Cross-validation error and AUC value for the three methods
| Item | Model | Statistic | Empirical Bayes method | Weighted empirical Bayes method | Joint covariance model |
|---|---|---|---|---|---|
| Cross-validation error | Gene + environment | Mean | 0.26 | 0.24 | 0.24 |
| SE | 0.0020 | 0.0011 | 0.0012 | ||
| AUC value | Gene + environment | Mean | 0.76 | 0.80 | 0.78 |
| SE | 0.0102 | 0.0015 | 0.0148 | ||
| AUC value | Gene | Mean | 0.60 | 0.64 | 0.62 |
| SE | 0.0191 | 0.0183 | 0.0191 |
AUC is the area under the ROC curve when minimizing the cross-validation error. SE, standard error of the cross-validation error and the AUC value.
Comparison of AUC value for the empirical Bayes and other classifiers
| Item | Model | Statistics | Empirical Bayes model | Random forest classifier | Neural network 1 | Neural network 2 |
|---|---|---|---|---|---|---|
| AUC value | Gene + environment | Mean | 0.76 | 0.67 | 0.68 | 0.70 |
| SE | 0.0102 | – | – | – |
AUC value indicates the area under the ROC curve when minimizing the cross-validation error. Neural network 1 used selected features from the logistic regression; neural network 2 used selected features from the empirical Bayes method. SE is the standard error of the AUC value.
Figure 1ROC curves for the EB, WEB, and JC methods for the prediction model using genes and environmental covariates. The black dotted line is the ROC curve generated from gene and environmental covariates in the prediction model based on the empirical Bayes (EB) method. The blue solid line is the ROC curve from the weighted empirical Bayes (WEB) model. The purple dot-dashed line is the ROC curve from the joint covariance (JM) model. The red dashed line is the diagonal.
Figure 2ROC curves for the EB, WEB and JC methods for the prediction model using genes only. The black dotted line is the ROC curve generated from the prediction model using genes only, based on the empirical Bayes (EB) method. The blue solid line is the ROC curve from the weighted empirical Bayes (WEB) model. The purple dot-dashed line is the ROC curve from the joint covariance (JC) model. The red dashed line is the diagonal.
Prediction rule for two classifiers based on one replicate
| Feature | Empirical Bayes classifier | Random forest classifier | ||||
|---|---|---|---|---|---|---|
| Genes | #SNP | MAF | Genes | #SNP | MAF | |
| 1 | ||||||
| 2 | ||||||
| 3 | 1 | <0.01 | <0.01 | |||
| 1 | 0.01–0.05 | 3 | 0.01–0.05 | |||
| 1 | ≥0.05 | 1 | ≥0.05 | |||
| 4 | 25 | <0.01 | 9 | <0.01 | ||
| 7 | 0.01–0.05 | 1 | 0.01–0.05 | |||
| 3 | ≥0.05 | 1 | ≥0.05 | |||
| 5 | 6 | <0.01 | 19 | <0.01 | ||
| 0.01–0.05 | 3 | 0.01–0.05 | ||||
| 2 | ≥0.05 | 4 | ≥0.05 | |||
| 6 | 17 | <0.01 | 4 | <0.01 | ||
| 4 | 0.01–0.05 | 3 | 0.01–0.05 | |||
| 1 | ≥0.05 | 4 | ≥0.05 | |||
| 7 | 23 | <0.01 | 8 | <0.01 | ||
| 4 | 0.01–0.05 | 0.01–0.05 | ||||
| 2 | ≥0.05 | 4 | ≥0.05 | |||
| 8 | 1 | 0.30 | 14 | <0.01 | ||
| 3 | 0.01–0.05 | |||||
| ≥0.05 | ||||||
| 9 | 22 | <0.01 | 24 | <0.01 | ||
| 5 | 0.01–0.05 | 4 | 0.01–0.05 | |||
| 3 | ≥0.05 | 1 | ≥0.05 | |||
| 10 | 33 | <0.01 | 9 | <0.01 | ||
| 4 | 0.01–0.05 | 1 | 0.01–0.05 | |||
| 3 | ≥0.05 | 6 | ≥0.05 | |||
Top 10 important features from the model incorporating genes and environmental variables (Age and Smoke) using one replicate between our proposed method (empirical Bayes) and the random forest method. #SNP, number of SNPs within a specific gene. MAF shows three intervals of minor allele frequency: MAF < 0.01, 0.01 ≤ MAF < 0.05, and MAF > 0.05. The boldfaced gene FLT1 still can be selected in the empirical Bayes method but is not observed using the random forest method.
Cross-validation error and AUC value for the empirical Bayes and random forest methods based on one replicate
| Item | Model | Statistics | Empirical Bayes method | Random forest method |
|---|---|---|---|---|
| Cross-validation error | Gene + environment | Mean | 0.26 | 0.23 |
| SE | 0.009 | – | ||
| AUC value | Gene + environment | Mean | 0.72 | 0.66 |
| SE | 0.058 | – |
AUC value is the area under the ROC curve when minimizing the cross-validation error. SE is the standard error of the cross-validation error and the AUC value.