| Literature DB >> 34895289 |
Chih-Wei Chung1, Tzu-Hung Hsiao2, Chih-Jen Huang3, Yen-Ju Chen2,4, Hsin-Hua Chen2,4,5,6, Ching-Heng Lin2, Seng-Cho Chou1, Tzer-Shyong Chen7, Yu-Fang Chung8, Hwai-I Yang3, Yi-Ming Chen9,10,11,12,13.
Abstract
BACKGROUND: Rheumatoid arthritis (RA) and systemic lupus erythematous (SLE) are autoimmune rheumatic diseases that share a complex genetic background and common clinical features. This study's purpose was to construct machine learning (ML) models for the genomic prediction of RA and SLE.Entities:
Keywords: Genome-wide association studies; Genomic prediction; Human leukocyte antigen imputation; Machine learning; Rheumatoid arthritis; Single nucleotide polymorphism; Systemic lupus erythematosus
Year: 2021 PMID: 34895289 PMCID: PMC8666017 DOI: 10.1186/s13040-021-00284-5
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Manhattan plot of the differences in SNPs between patients with RA and SLE. A whole genome and (B) detailed HLA region. Blue- and red-dotted lines indicate thresholds for significance (p < 1 × 10−5 and p < 5 × 10−8, respectively). Xaxis: chromosome number; yaxis: log10P; SNP: single nucleotide polymorphism; RA: rheumatoid arthritis; SLE: systemic lupus erythematosus
Comparison of machine learning model performance with 5-fold cross-validation
| Classifier | Accuracy | Precision | Sensitivity | Specificity | F1 score | AUC |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.7610 | 0.7385 | 0.7801 | 0.7430 | 0.7587 | 0.8451 |
| Random Forest | 0.9402 | 0.9376 | 0.9384 | 0.9420 | 0.9379 | 0.9871 |
| Support Vector Machine | 0.9373 | 0.9310 | 0.9398 | 0.9352 | 0.9353 | 0.9829 |
| Gradient Tree Boosting | 0.9635 | 0.9579 | 0.9668 | 0.9606 | 0.9623 | 0.9953 |
| Extreme Gradient Boosting | 0.9618 | 0.9544 | 0.9668 | 0.9573 | 0.9606 | 0.9948 |
RA: rheumatoid arthritis; SLE: systemic lupus erythematosus; AUC: area under the curve
Fig. 2Data Visualization of machine learning model performance. A ROC curve and (B) PR curve. Comparisons of genomic prediction performances of RA and SLE by machine learning models. RA: rheumatoid arthritis; SLE: systemic lupus erythematosus; LR: logistic regression; RF: random forest; SVM: support vector machine; GTB: gradient tree boosting; XGM: extreme gradient boosting; AUC: area under the curve
Top 20 ranking genes by feature importance for predicting RA and SLE in proposed models
| Model | Gradient tree boosting | Extreme gradient boosting | ||||
|---|---|---|---|---|---|---|
| 1 | rs6906021 | HLA-DQA1 | 0.1205 | rs34965214 | HLA-DQA1 | 0.0998 |
| 2 | rs9271858 | HLA-DRB1 | 0.1030 | rs3104376 | HLA-DQA1 | 0.0562 |
| 3 | rs9273505 | HLA-DQB1 | 0.0895 | rs1391371 | HLA-DQA1 | 0.0557 |
| 4 | Affx-28,477,341 | HLA-DRB5 | 0.0848 | rs9271662 | HLA-DRB1 | 0.0494 |
| 5 | rs4999342 | HLA-DRB5 | 0.0741 | rs9273322 | HLA-DQA1 | 0.0387 |
| 6 | rs41269945 | HLA-DQA1 | 0.0452 | rs1049072 | HLA-DQB1 | 0.0380 |
| 7 | rs3104376 | HLA-DQA1 | 0.0372 | rs9274605 | HLA-DQB1 | 0.0359 |
| 8 | rs9274605 | HLA-DQB1 | 0.0348 | rs9273370 | HLA-DQA1 | 0.0345 |
| 9 | rs2395533 | HLA-DQA1 | 0.0323 | rs17843604 | HLA-DQA1 | 0.0337 |
| 10 | rs9274655 | HLA-DQB1 | 0.0317 | rs9271850 | HLA-DRB1 | 0.0269 |
| 11 | rs9271662 | HLA-DRB1 | 0.0239 | rs9271588 | HLA-DRB1 | 0.0241 |
| 12 | rs3830059 | HLA-DQB1 | 0.0224 | rs9271425 | HLA-DRB1 | 0.0220 |
| 13 | rs200716952 | HLA-DQB2 | 0.0207 | rs4999342 | HLA-DRB5 | 0.0211 |
| 14 | rs1003879 | C6orf10 | 0.0201 | rs9469219 | HLA-DQB1 | 0.0210 |
| 15 | rs9271489 | HLA-DRB1 | 0.0194 | rs9271858 | HLA-DRB1 | 0.0177 |
| 16 | rs9272461 | HLA-DQA1 | 0.0166 | rs9275087 | HLA-DQB1 | 0.0173 |
| 17 | Affx-28,498,545 | HLA-DQB1 | 0.0137 | rs17843619 | HLA-DQA1 | 0.0165 |
| 18 | rs2395111 | NOTCH4 | 0.0135 | rs17843605 | HLA-DQA1 | 0.0162 |
| 19 | rs1049072 | HLA-DQB1 | 0.0124 | rs9273505 | HLA-DQB1 | 0.0159 |
| 20 | Affx-28,494,632 | HLA-DQA1 | 0.0121 | rs2894249 | C6orf10 | 0.0158 |
SNP: single nucleotide polymorphism
Fig. 3SHAP summary graph of top 20 SNPs of machine learning models. A) gradient tree boosting and (B) extreme gradient boosting models. As the SHAP value of an SNP (x-axis) increased, the probability of RA increased; the lower is the x-axis SHAP value of an SNP, the higher is the probability of SLE development. Each dot on the SHAP plot was calculated using the prediction model for each SNP’s attribution value for a participant. Dots are illustrated according to the feature values of each participant and accumulate vertically to indicate density. Blue represents 0/0 and red represents 0/1 or 1/1. SNP: single nucleotide polymorphism; SHAP: SHapley Additive exPlanation
Associations of imputed HLA alleles with SLE compared with RA
| SLE | RA | 95% CI | ||||||
|---|---|---|---|---|---|---|---|---|
| HLA alleles | count | % | count | % | OR | lower | upper | |
| DQA1*01:02 | 877 | 20.0 | 655 | 15.7 | 1.28E-07 | 1.35 | 1.21 | 1.51 |
| DQA1*01:03 | 516 | 11.8 | 365 | 8.7 | 3.13E-06 | 1.40 | 1.21 | 1.61 |
| DQA1*03:01 | 252 | 5.8 | 342 | 8.2 | 1.07E-05 | 0.69 | 0.58 | 0.81 |
| DQA1*03:02 | 662 | 15.1 | 736 | 17.5 | 1.99E-03 | 0.83 | 0.74 | 0.94 |
| DQA1*03:03 | 300 | 6.9 | 598 | 14.3 | 2.84E-29 | 0.44 | 0.38 | 0.51 |
| DQA1*05:01 | 420 | 9.6 | 181 | 4.3 | 1.48E-21 | 2.35 | 1.96 | 2.81 |
| DQA1*05:05 | 437 | 10.0 | 401 | 9.6 | 5.36E-01 | 1.05 | 0.91 | 1.21 |
| DQA1*06:01 | 312 | 7.3 | 387 | 9.3 | 3.35E-04 | 0.75 | 0.64 | 0.88 |
| DQB1*02:01 | 414 | 9.5 | 178 | 4.3 | 2.44E-21 | 2.35 | 1.96 | 2.82 |
| DQB1*03:01 | 833 | 19.0 | 858 | 20.6 | 8.55E-02 | 0.91 | 0.82 | 1.01 |
| DQB1*03:02 | 243 | 5.6 | 321 | 7.7 | 7.61E-05 | 0.71 | 0.6 | 0.84 |
| DQB1*03:03 | 684 | 15.6 | 763 | 18.2 | 1.26E-03 | 0.83 | 0.74 | 0.93 |
| DQB1*04:01 | 268 | 6.1 | 554 | 13.2 | 5.20E-29 | 0.43 | 0.37 | 0.5 |
| DQB1*05:02 | 515 | 11.7 | 431 | 10.3 | 3.10E-02 | 1.16 | 1.01 | 1.33 |
| DQB1*06:01 | 647 | 14.8 | 468 | 11.2 | 7.90E-07 | 1.38 | 1.21 | 1.56 |
| DQB1*06:02 | 248 | 5.7 | 167 | 4.0 | 3.14E-04 | 1.44 | 1.18 | 1.77 |
| DRB1*03:01 | 415 | 9.5 | 179 | 4.3 | 2.87E-21 | 2.34 | 1.96 | 2.81 |
| DRB1*04:05 | 287 | 6.6 | 607 | 14.4 | 2.51E-33 | 0.41 | 0.36 | 0.48 |
| DRB1*08:03 | 484 | 11.1 | 336 | 8.0 | 2.00E-06 | 1.42 | 1.23 | 1.65 |
| DRB1*09:01 | 665 | 15.2 | 750 | 17.8 | 6.55E-04 | 0.82 | 0.73 | 0.92 |
| DRB1*11:01 | 291 | 6.6 | 296 | 7.1 | 4.34E-01 | 0.94 | 0.79 | 1.11 |
| DRB1*12:02 | 321 | 7.3 | 387 | 9.3 | 1.28E-03 | 0.78 | 0.67 | 0.91 |
| DRB1*15:01 | 476 | 10.9 | 341 | 8.2 | 1.82E-05 | 1.37 | 1.19 | 1.59 |
| DRB1*16:02 | 310 | 7.1 | 246 | 5.9 | 2.42E-02 | 1.22 | 1.03 | 1.45 |
By Pearson’s chi-squared test. RA as a reference group. HLA: human leukocyte antigen; SLE: systemic lupus erythematosus; RA: rheumatoid arthritis; OR: odds ratio; CI: confidence interval