| Literature DB >> 19816555 |
Zhi Wei1, Kai Wang, Hui-Qi Qu, Haitao Zhang, Jonathan Bradfield, Cecilia Kim, Edward Frackleton, Cuiping Hou, Joseph T Glessner, Rosetta Chiavacci, Charles Stanley, Dimitri Monos, Struan F A Grant, Constantin Polychronakos, Hakon Hakonarson.
Abstract
Genome-wide association studies (GWAS) have been fruitful in identifying disease susceptibility loci for common and complex diseases. A remaining question is whether we can quantify individual disease risk based on genotype data, in order to facilitate personalized prevention and treatment for complex diseases. Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci. Here we propose that sophisticated machine-learning approaches with a large ensemble of markers may improve the performance of disease risk assessment. We applied a Support Vector Machine (SVM) algorithm on a GWAS dataset generated on the Affymetrix genotyping platform for type 1 diabetes (T1D) and optimized a risk assessment model with hundreds of markers. We subsequently tested this model on an independent Illumina-genotyped dataset with imputed genotypes (1,008 cases and 1,000 controls), as well as a separate Affymetrix-genotyped dataset (1,529 cases and 1,458 controls), resulting in area under ROC curve (AUC) of approximately 0.84 in both datasets. In contrast, poor performance was achieved when limited to dozens of known susceptibility loci in the SVM model or logistic regression model. Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers. We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19816555 PMCID: PMC2748686 DOI: 10.1371/journal.pgen.1000678
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Description of the three T1D datasets used in the study.
| GWAS dataset | Num of Cases | Num of controls | Array platform | Purpose |
| WTCCC-T1D | 1,963 | 1,480 | Affymetrix Mapping 500K | Prediction model training and parameter selection; evaluation of predictive models trained on CHOP/Montreal-T1D |
| CHOP/Montreal-T1D | 1,008 | 1,000 | Illumina HumanHap550 | Evaluation of predictive models trained on WTCCC-T1D, using whole-genome imputed genotype data |
| GoKinD-T1D | 1,529 | 1,458 | Affymetrix Mapping 500K | Evaluation of predictive models trained on WTCCC-T1D or CHOP/Montreal-T1D |
Evaluation of risk assessment models on the WTCCC-T1D dataset by five-fold cross-validation.
| SNP selection | SVM (support vector machine) | LR (logistic regression) | Min #SNP | Max #SNP | ||||
| AUC | Sensitivity | Specificity | AUC | Sensitivity | Specificity | |||
| P<1×10−8 | 0.89 (0.017) | 0.87 (0.018) | 0.75 (0.041) | 0.89 (0.016) | 0.86 (0.026) | 0.75 (0.035) | 240 | 280 |
| P<1×10−7 | 0.89 (0.018) | 0.87 (0.024) | 0.75 (0.036) | 0.88 (0.018) | 0.86 (0.034) | 0.76 (0.031) | 286 | 328 |
| P<1×10−6 | 0.89 (0.018) | 0.88 (0.019) | 0.74 (0.041) | 0.89 (0.022) | 0.86 (0.033) | 0.76 (0.044) | 328 | 372 |
| P<1×10−5 | 0.89 (0.013) | 0.88 (0.013) | 0.73 (0.041) | 0.88 (0.014) | 0.85 (0.028) | 0.75 (0.037) | 399 | 433 |
| P<1×10−4 | 0.88 (0.012) | 0.87 (0.021) | 0.73 (0.026) | 0.87 (0.011) | 0.84 (0.016) | 0.75 (0.030) | 519 | 558 |
| P<1×10−3 | 0.86 (0.010) | 0.85 (0.020) | 0.69 (0.015) | 0.80 (0.009) | 0.77 (0.040) | 0.69 (0.025) | 1007 | 1085 |
area under receiver operating characteristic curve.
standard deviation.
sensitivity and specificity were calculated with default cutoff of zero point.
Figure 1Performance of risk assessment models trained on the WTCCC-T1D dataset.
For both the CHOP/Montreal-T1D and the GoKind-T1D datasets, the SVM (support vector machine) algorithm consistently outperforms LR (logistic regression), and the best performance is achieved when SNPs were selected using P-value cutoff of 1×10−6 or 1×10−5.
Figure 2Performance of risk assessment models trained on the CHOP/Montreal-T1D dataset.
For both the WTCCC-T1D and the GoKind-T1D datasets, the SVM (support vector machine) algorithm consistently outperforms LR (logistic regression), and the best performance is achieved when SNPs were selected using P-value cutoff of 1×10−6 or 1×10−5.
Figure 3Specificity of the SVM-based risk assessment models.
The risk assessment models were parameterized on the WTCCC-T1D dataset and evaluated on other disease cohorts from WTCCC, including bipolar disorder (BD), coronary heart disease (CAD), Crohn's disease (CD), hypertension (HT), rheumatoid arthritis (RA), and type 2 diabetes (T2D). The specificity measure was calculated with default cutoff of zero point. Except for RA, the specificity measures of the prediction model are comparable for other diseases as that for the control subjects.
Comparative analysis of prediction models by including different sets of markers.
| Marker selection (P<1×10−5) | # markers | AUC | AUC |
| All (MHC and non-MHC) SNPs | 409 | 0.81 | 0.84 |
| MHC SNPs | 338 | 0.78 | 0.81 |
| Non-MHC SNPs | 71 | 0.65 | 0.64 |
| Pruned MHC and non-MHC SNPs | 82 | 0.74 | 0.76 |
| Pruned MHC SNPs | 27 | 0.70 | 0.74 |
| Pruned MHC SNPs | 98 | 0.74 | 0.75 |
area under receiver operating characteristic curve.
SNPs are pruned using pairwise r threshold of 0.2.