| Literature DB >> 34789321 |
Ryan J Langdon1,2, Paul Yousefi3,4, Caroline L Relton3,4, Matthew J Suderman3,4.
Abstract
BACKGROUND: DNA methylation (DNAm) performs excellently in the discrimination of current and former smokers from never smokers, where AUCs > 0.9 are regularly reported using a single CpG site (cg05575921; AHRR). However, there is a paucity of DNAm models which attempt to distinguish current, former and never smokers as individual classes. Derivation of a robust DNAm model that accurately distinguishes between current, former and never smokers would be particularly valuable to epidemiological research (as a more accurate smoking definition vs. self-report) and could potentially translate to clinical settings. Therefore, we appraise 4 DNAm models of ternary smoking status (that is, current, former and never smokers): methylation at cg05575921 (AHRR model), weighted scores from 13 CpGs created by Maas et al. (Maas model), weighted scores from a LASSO model of candidate smoking CpGs from the literature (candidate CpG LASSO model), and weighted scores from a LASSO model supplied with genome-wide 450K data (agnostic LASSO model). Discrimination is assessed by AUC, whilst classification accuracy is assessed by accuracy and kappa, derived from confusion matrices.Entities:
Keywords: Classification; Epidemiology; Epigenetic; Methylation; Smoking
Mesh:
Year: 2021 PMID: 34789321 PMCID: PMC8597260 DOI: 10.1186/s13148-021-01191-6
Source DB: PubMed Journal: Clin Epigenetics ISSN: 1868-7075 Impact factor: 6.551
Initial and final numbers of CpGs for each DNAm model of smoking
| Classes | Model name | Novel/literature | Number of supplied features to LASSO (CpGs) | Final number of features (CpGs) |
|---|---|---|---|---|
| “Ever” versus never | AHRR | Literature | NA | 1 |
| Maas | Literature | NA | 13 | |
| Agnostic LASSO | Novel | 450K | 29 | |
| Candidate CpG LASSO | Novel | 14 | 9 | |
| Current versus former | AHRR | Literature | NA | 1 |
| Maas | Literature | NA | 13 | |
| Agnostic LASSO | Novel | 450K | 20 | |
| Candidate CpG LASSO | Novel | 40 | 4 |
“Literature”-based models contain pre-specified CpG sites and betas and were therefore not supplied to LASSO models in this paper. “Novel” models denote models where we supplied sets of CpGs for feature selection via cross-validated LASSO. The “final number of features” are those used to create the various DNAm scores of smoking seen in this paper
Performance of DNA methylation scores for discrimination between binary smoking statuses
Comparison of the discrimination of DNAm scores for binary smoking status problems, with AHRR model (cg05575921 methylation) as a reference. AUCs were compared to the reference using a DeLong's Z-test. Green cells indicate a statistical difference where a classifier improved upon the reference. Orange cells indicate where a classifier performed statistically worse than the reference
Performances of binary classifiers of smoking status
| Data | Accuracy statistics | AHRR model (reference) | Candidate CpG LASSO model | Maas model | Agnostic LASSO model |
|---|---|---|---|---|---|
| Training data | Accuracy (95% CI) | 0.721 (0.693–0.747) | 0.771 (0.744–0.795) | 0.752 (0.725–0.777) | 0.792 (0.766–0.816) |
| NIR (P: Acc > NIR) | 0.658 (6.3 × 10−6) | 0.658 (7.3 × 10−16) | 0.658 (2.1 × 10−11) | 0.658 (< 2.2 × 10−16) | |
| Kappa | 0.444 | 0.527 | 0.480 | 0.577 | |
| Sensitivity | 0.661 | 0.743 | 0.745 | 0.744 | |
| Specificity | 0.835 | 0.824 | 0.764 | 0.885 | |
| PPV | 0.885 | 0.890 | 0.858 | 0.925 | |
| NPV | 0.562 | 0.625 | 0.610 | 0.658 | |
| External validation data | Accuracy (95% CI) | 0.815 (0.784–0.842) | 0.837 (0.808–0.863) | 0.822 (0.791–0.849) | 0.822 (0.791–0.849) |
| NIR (P: Acc > NIR) | 0.637 (< 2.2 × 10−16) | 0.637 (< 2.2 × 10−16) | 0.637 (< 2.2 × 10−16) | 0.637 (< 2.2 × 10−16) | |
| Kappa | 0.624 | 0.661 | 0.627 | 0.633 | |
| Sensitivity | 0.766 | 0.818 | 0.814 | 0.792 | |
| Specificity | 0.900 | 0.869 | 0.835 | 0.873 | |
| PPV | 0.931 | 0.917 | 0.896 | 0.917 | |
| NPV | 0.686 | 0.731 | 0.719 | 0.705 | |
| Training data | Accuracy (95% CI) | 0.707 (0.671–0.740) | 0.512 (0.474–0.550) | 0.700 (0.664–0.733) | 0.757 (0.723–0.788) |
| NIR (P: Acc > NIR) | 0.522 (< 2.2 × 10−16) | 0.522 (0.715) | 0.522 (< 2.2 × 10−16) | 0.522 (< 2.2 × 10−16) | |
| Kappa | 0.416 | 0.025 | 0.403 | 0.516 | |
| Sensitivity | 0.658 | 0.504 | 0.625 | 0.701 | |
| Specificity | 0.761 | 0.521 | 0.781 | 0.817 | |
| PPV | 0.750 | 0.535 | 0.758 | 0.808 | |
| NPV | 0.670 | 0.490 | 0.656 | 0.715 | |
| External validation data | Accuracy (95% CI) | 0.646 (0.600–0.689) | 0.541 (0.494–0.587) | 0.619 (0.573–0.664) | 0.674 (0.629–0.717) |
| NIR (P: Acc > NIR) | 0.576 (1.3 × 10−3) | 0.576 (0.940) | 0.576 (0.03) | 0.576 (9.9 × 10−6) | |
| Kappa | 0.318 | 0.093 | 0.251 | 0.373 | |
| Sensitivity | 0.825 | 0.603 | 0.706 | 0.861 | |
| Specificity | 0.513 | 0.494 | 0.555 | 0.536 | |
| PPV | 0.556 | 0.468 | 0.539 | 0.578 | |
| NPV | 0.799 | 0.628 | 0.719 | 0.839 | |
Performance of ternary classifiers of smoking status (current, former and never)
| Data | Accuracy statistics | AHRR model (reference) | Candidate CpG LASSO model | Maas model | Agnostic LASSO model |
|---|---|---|---|---|---|
| Training data | Accuracy (95% CI) | 0.606 (0.576–0.635) | 0.538 (0.508–0.568) | 0.619 (0.589–0.648) | 0.695 (0.667–0.723) |
| NIR (P: Acc > NIR) | 0.364 (< 2.2 × 10−16) | 0.364 (< 2.2 × 10−16) | 0.364 (< 2.2 × 10−16) | 0.364 (< 2.2 × 10−16) | |
| Kappa | 0.405 | 0.306 | 0.427 | 0.541 | |
| Sensitivity | 0.835 | 0.824 | 0.764 | 0.885 | |
| Specificity | 0.661 | 0.743 | 0.745 | 0.744 | |
| PPV | 0.562 | 0.625 | 0.610 | 0.643 | |
| NPV | 0.885 | 0.890 | 0.858 | 0.925 | |
| Sensitivity | 0.299 | 0.380 | 0.455 | 0.518 | |
| Specificity | 0.872 | 0.763 | 0.797 | 0.892 | |
| PPV | 0.518 | 0.423 | 0.507 | 0.687 | |
| NPV | 0.731 | 0.729 | 0.762 | 0.802 | |
| Sensitivity | 0.658 | 0.397 | 0.625 | 0.669 | |
| Specificity | 0.875 | 0.802 | 0.887 | 0.905 | |
| PPV | 0.730 | 0.512 | 0.743 | 0.787 | |
| NPV | 0.830 | 0.720 | 0.819 | 0.839 | |
| External validation data | Accuracy (95% CI) | 0.612 (0.576–0.648) | 0.594 (0.557–0.609) | 0.603 (0.566–0.639) | 0.637 (0.601–0.673) |
| NIR (P: Acc > NIR) | 0.367 (< 2.2 × 10−16) | 0.367 (< 2.2 × 10−16) | 0.367 (< 2.2 × 10−16) | 0.367 (< 2.2 × 10−16) | |
| Kappa | 0.405 | 0.390 | 0.406 | 0.462 | |
| Sensitivity | 0.900 | 0.869 | 0.835 | 0.873 | |
| Specificity | 0.766 | 0.818 | 0.814 | 0.792 | |
| PPV | 0.686 | 0.731 | 0.719 | 0.705 | |
| NPV | 0.931 | 0.917 | 0.896 | 0.917 | |
| Sensitivity | 0.171 | 0.368 | 0.297 | 0.270 | |
| Specificity | 0.914 | 0.811 | 0.824 | 0.916 | |
| PPV | 0.536 | 0.530 | 0.494 | 0.651 | |
| NPV | 0.656 | 0.689 | 0.669 | 0.684 | |
| Sensitivity | 0.825 | 0.531 | 0.706 | 0.820 | |
| Specificity | 0.748 | 0.767 | 0.771 | 0.757 | |
| PPV | 0.548 | 0.458 | 0.533 | 0.556 | |
| NPV | 0.920 | 0.815 | 0.876 | 0.919 |
N.B. Ternary classifiers are the result of two binary classifiers being applied to DNAm data in sequence: ever versus never smoker classification, then current versus former classification of the ever smokers
Summaries of contributing publicly available studies
| Publication | Liu et al. | Su et al. | Tsaprouni et al. | Ventham et al. | Overall |
|---|---|---|---|---|---|
| Consortium | EIRA | N/A | CARDIOGENICS | IBD-BIOM | – |
| GEO accession | GSE42861 | GSE85210 | GSE50660 | GSE87648 | – |
| N | 689 | 253 | 464 | 383 | 1789 |
| Mean age (SD) | 51.9 (11.8) | 34.5 (8.8) | 55.4 (6.7) | 36.7 (14.2) | 47.0 (13.8) |
| Gender | 492 female; 197 male | 82 female; 171 male | 137 female; 327 male | 184 female; 199 male | 895 female; 894 male |
| Never smokers | 193 | 81 | 179 | 171 | 624 |
| Current smokers | 228 | 172 | 22 | 99 | 559 |
| Former smokers | 266 | 0 | 263 | 106 | 597 |
Fig. 1Diagrammatic view of two-stage approach to ternary smoking status classification
DNAm classification score generation using Maas et al. stepwise regression data
| For each individual in our DNAm data, a weighted score was obtained by multiplying the normalised methylation value at a given CpG by the effect size Maas et al. then summing these values: | |
| where “cpg” is the normalised methylation value in our dataset and “ | |