| Literature DB >> 18466563 |
Yan V Sun1, Zhaohui Cai2, Kaushal Desai2, Rachael Lawrance3, Richard Leff2, Ansar Jawaid3, Sharon Lr Kardia1, Huiying Yang2.
Abstract
Using the North American Rheumatoid Arthritis Consortium (NARAC) candidate gene and genome-wide single-nucleotide polymorphism (SNP) data sets, we applied regression methods and tree-based random forests to identify genetic associations with rheumatoid arthritis (RA) and to predict RA disease status. Several genes were consistently identified as weakly associated with RA without a significant interaction or combinatorial effect with other candidate genes. Using random forests, the tested candidate gene SNPs were not sufficient to predict RA patients and normal subjects with high accuracy. However, using the top 500 SNPs, ranked by the importance score, from the genome-wide linkage panel of 5742 SNPs, we were able to accurately predict RA patients and normal subjects with sensitivity of approximately 90% and specificity of approximately 80%, which was confirmed by five-fold cross-validation. However, in a complete training-testing framework, replication of genetic predictors was less satisfactory; thus, further evaluation of existing methodology and development of new methods are warranted.Entities:
Year: 2007 PMID: 18466563 PMCID: PMC2367463 DOI: 10.1186/1753-6561-1-s1-s62
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Summary of significant associations (p < 0.05) between RA and candidate genes
| Single-gene model | Multiple-gene model | |||
| Gene*SNP | Carrier-test | Genotype-test | OR (95% CI) | |
| PTPN22*rs2476601 | <0.0001 | <0.0001 | <0.0001 | 2.43 (1.82, 3.24) |
| CTLA4*CT60 | 0.0060 | 0.0172 | 0.0115 | 0.72 (0.56, 0.93) |
| HAVCR1*5509_5511delCAA | 0.0339 | 0.0640 | 0.0731 | 0.80 (0.63, 1.02) |
| SUM04*rs237025 | <0.0001 | 0.0002 | 0.1044 | 1.39 (0.94, 2.04) |
| MAP3K71P2*rs577001 | 0.0012 | 0.0017 | 0.4527 | 1.14 (0.80, 1.63) |
Figure 1Two distinct clusters of RA patients in a multidimensional scaling (MDS) plot. The X-axis and the Y-axis represent the dimensions with the two largest eigenvalues generated by the MDS algorithm.
Summary of classification accuracy rates for different classification schemes
| RFs classification | Sensitivity | Specificity | AUC of ROC |
| All cases vs. Controls | 57. 5% | 60.6% | 0.59 |
| Cluster A vs. Cluster B | 77.1% | 36.3% | 0.56 |
| Cluster B vs. Controls | 67.1% | 52.3% | 0.62 |
Figure 2ROC curve of five-fold CV using RFs with the 500 most important SNPs. For each CV, a prediction model is built by using the training dataset and the ROC curve is generated by comparing the predicted RA status with the true RA status in the testing dataset. Each color curve represents prediction accuracy of one of the five CVs.
Figure 3Reducing dimensionality can improve the predictive ability of RFs. , Using the most important predictors in data set 1; , using the most important predictors in data set 2; , using random predictors in data set 1; , using random predictors in data set 2.
The overlap of top SNPs using RFs or association test from both data sets
| RFs | Association test | |||
| The most significant SNPs from both data sets | Number of common SNPs | Number of common SNPs | ||
| Top 50 | 0 | NA | 1 | 0.357 |
| Top 100 | 2 | 0.525 | 2 | 0.525 |
| Top 200 | 12 | 0.046* | 8 | 0.405 |
| Top 500 | 49 | 0.205 | 48 | 0.265 |
| Top 1000 | 191 | 0.068 | 191 | 0.068 |
| Top 2000 | 679 | 0.865 | 720 | 0.113 |
| Top 3000 | 1588 | 0.216 | 1576 | 0.413 |
| Top 4000 | 2820 | 0.058 | 2815 | 0.103 |