| Literature DB >> 31494793 |
Silvana C E Maas1,2, Athina Vidaki2, Rory Wilson3,4, Alexander Teumer5,6, Fan Liu2,7,8, Joyce B J van Meurs1,9, André G Uitterlinden1,9, Dorret I Boomsma10, Eco J C de Geus10, Gonneke Willemsen10, Jenny van Dongen10, Carla J H van der Kallen11,12, P Eline Slagboom13, Marian Beekman13, Diana van Heemst14, Leonard H van den Berg15, Liesbeth Duijts16, Vincent W V Jaddoe1,17,18, Karl-Heinz Ladwig4, Sonja Kunze3,4, Annette Peters3,4,19,20, M Arfan Ikram1, Hans J Grabe21, Janine F Felix1,17,18, Melanie Waldenberger3,4,19, Oscar H Franco1, Mohsen Ghanbari22,23, Manfred Kayser24.
Abstract
Inferring a person's smoking habit and history from blood is relevant for complementing or replacing self-reports in epidemiological and public health research, and for forensic applications. However, a finite DNA methylation marker set and a validated statistical model based on a large dataset are not yet available. Employing 14 epigenome-wide association studies for marker discovery, and using data from six population-based cohorts (N = 3764) for model building, we identified 13 CpGs most suitable for inferring smoking versus non-smoking status from blood with a cumulative Area Under the Curve (AUC) of 0.901. Internal fivefold cross-validation yielded an average AUC of 0.897 ± 0.137, while external model validation in an independent population-based cohort (N = 1608) achieved an AUC of 0.911. These 13 CpGs also provided accurate inference of current (average AUCcrossvalidation 0.925 ± 0.021, AUCexternalvalidation0.914), former (0.766 ± 0.023, 0.699) and never smoking (0.830 ± 0.019, 0.781) status, allowed inferring pack-years in current smokers (10 pack-years 0.800 ± 0.068, 0.796; 15 pack-years 0.767 ± 0.102, 0.752) and inferring smoking cessation time in former smokers (5 years 0.774 ± 0.024, 0.760; 10 years 0.766 ± 0.033, 0.764; 15 years 0.767 ± 0.020, 0.754). Model application to children revealed highly accurate inference of the true non-smoking status (6 years of age: accuracy 0.994, N = 355; 10 years: 0.994, N = 309), suggesting prenatal and passive smoking exposure having no impact on model applications in adults. The finite set of DNA methylation markers allow accurate inference of smoking habit, with comparable accuracy as plasma cotinine use, and smoking history from blood, which we envision becoming useful in epidemiology and public health research, and in medical and forensic applications.Entities:
Keywords: DNA methylation; Epidemiology; Epigenetics; Forensics; Smoking inference
Mesh:
Substances:
Year: 2019 PMID: 31494793 PMCID: PMC6861351 DOI: 10.1007/s10654-019-00555-w
Source DB: PubMed Journal: Eur J Epidemiol ISSN: 0393-2990 Impact factor: 8.082
Top 20 smoking-associated CpGs from 14 previous EWASs considered here for marker sub-selection and their contribution to smoking inference from blood
| CpG ID | Chr:positionb | Gene IDc | Location of CpG | Cumulative AUC |
|---|---|---|---|---|
| cg05575921a | 5:373,378 | Gene body | 0.8801 | |
| cg13039251a | 5:32,018,601 | Gene body | 0.8888 | |
| cg03636183a | 19:17,000,585 | Gene body | 0.8883 | |
| cg12803068a | 7:45,002,919 | Gene body | 0.8889 | |
| cg22132788a | 7:45,002,486 | Gene body | 0.8934 | |
| cg06126421a | 6:30,720,080 | NA | – | 0.8929 |
| cg21566642a | 2:233,284,661 | NA | – | 0.8957 |
| cg23576855a | 5:373,299 | Gene body | 0.8967 | |
| cg15693572a | 3:22,412,385 | NA | – | 0.8982 |
| cg05951221a | 2:233,284,402 | NA | – | 0.8989 |
| cg01940273a | 2:233,284,934 | NA | – | 0.8998 |
| cg12876356a | 1:92,946,825 | Gene body | 0.9005 | |
| cg09935388a | 1:92,947,588 | Gene body | 0.9010 | |
| cg19572487 | 17:38,476,024 | 5′UTR | 0.9012 | |
| cg19859270 | 3:98,251,294 | Gene body (1st Exon) | 0.9015 | |
| cg18146737 | 1:92,946,700 | Gene body | 0.9015 | |
| cg21161138 | 5:399,360 | Gene body | 0.9015 | |
| cg23480021 | 3:22,412,746 | NA | – | 0.9016 |
| cg21188533 | 3:53,700,263 | Gene body | 0.9015 | |
| cg03274391 | 3:22,413,232 | NA | – | 0.9015 |
NA not annotated to any gene according to the Illumina Infinium Human Methylation 450 K annotation file
AUC Area under the curve
aCpGs included in our final 13 CpG-model
bGenome coordinates provided by Illumina (GRCh37/hg19)
cAccording to the Illumina Infinium Human Methylation 450 K annotation file
Fig. 1DNA methylation β-value differences between smokers and never-smokers for the top 20 smoking-associated CpGs. Previously reported differences in β-values in mean or median (depending on availability per EWAS) between smokers and never-smokers (¤ or non-smokers, when non-smoking data was available) for the selected 20 top-associated CpGs obtained from the 14 reviewed EWASs on smoking habits that did not include samples used here for model building
Fig. 2Cumulative AUC profile for smoking habit inference from blood based on the top 20 CpGs. The 20 CpGs were selected from previous EWASs on smoking habits (see Fig. 1) and were tested in the model-building set (N = 3764). Presented is the cumulative contribution of each of the selected 20 CpGs to the model-based smoking habit inference, shown as the AUC plotted against the number of CpGs included in the binary logistic regression model. In the model selection process, first all CpGs were included, and using backward elimination procedures, those with the lowest z-statistic per model were removed one by one. After 13 CpGs, the AUC plateaus; therefore, and by considering the results from Chi squared testing, these 13 CpGs were used for further analyses
Outcomes of the two-category-model (smokers vs. non-smokers) for inferring smoking habits from blood based on CpGs
| 13-CpG model | 10-CpG modela | |||||
|---|---|---|---|---|---|---|
| Model building data set (N = 3764) | External validation | Model building data set (N = 3764) | External validation | |||
| Model building | Fivefold cross-validation | KORA (N = 1608) | Model building | Fivefold cross-validation | SHIP-Trend (N = 244) | |
| Accuracyb (95% CI) ± SD | 0.923 (0.914, 0.931) | 0.921 ± 0.008 | 0.926 (0.912, 0.938) | 0.917 (0.908, 0.926) | 0.917 ± 0.011 | 0.873 (0.825, 0.912) |
| Specificity | 0.976 | 0.976 ± 0.005 | 0.983 | 0.975 | 0.975 ± 0.006 | 0.995 |
| Sensitivity | 0.585 | 0.577 ± 0.044 | 0.580 | 0.548 | 0.551 ± 0.042 | 0.412 |
| AUC | 0.901 | 0.897 ± 0.137 | 0.911 | 0.896 | 0.893 ± 0.012 | 0.888 |
Cross-validation analysis results are presented as mean ± standard deviation
AUC Area under the curve
aThree CpGs (cg06126421, cg22132788 and cg05951221) are not included in the EPIC methylation microarray dataset from SHIP-Trend, this model is included here to demonstrate a second external validation in SHIP next to KORA with the full 13-CpG model
bProportion accurately inferred smoking habits, 95% confidence interval (CI)
Fig. 3Inferred probability of being a smoker versus the percentage of correctly inferred smoking habits. Histogram of predicted probabilities in our model building dataset (N = 3764), probabilities determined using the 13 CpGs included in the final prediction model. The y-axis presents the number of individuals for whom the predicted probability of being a smoker was within the given probability range (x-axis). The red dots present the percentage of individuals in each probability bin that were accurately inferred using a > 0.5 probability threshold for being a smoker
Outcomes of the three-category-model (current smokers vs. former smokers vs. never smokers) for inferring smoking habits from blood based on CpGs
| | Never (N = 1243) | Former (N = 1332) | Current (N = 364) |
|---|---|---|---|
| Specificity | 0.746 | 0.770 | 0.997 |
| Sensitivity | 0.780 | 0.652 | 0.668 |
| AUC | 0.835 | 0.772 | 0.928 |
| Specificity | 0.739 ± 0.017 | 0.766 ± 0.053 | 0.975 ± 0.008 |
| Sensitivity | 0.769 ± 0.060 | 0.643 ± 0.039 | 0.669 ± 0.056 |
| AUC | 0.830 ± 0.019 | 0.766 ± 0.023 | 0.925 ± 0.021 |
Cross-validation analysis results are presented as mean ± standard deviation
AUC Area under the Curve
aThree CpGs (cg06126421, cg22132788 and cg05951221) are not included in the EPIC methylation microarray dataset from SHIP-Trend
Outcomes of the two-category models for inferring smoking history (years of cessation time) in former smokers from blood based on 13 CpGs
| Former < 5 year versus Former ≥ 5 year cessation time | Former < 10 year versus Former ≥ 10 year cessation time | Former < 15 year versus Former ≥ 15 year cessation time | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Model building data set (N = 1332) | External validation | Model building data set (N = 1332) | External validation | Model building data set (N = 1332) | External validation | ||||
| Model building | Fivefold cross-validation | KORA (N = 652) | Model building | Fivefold cross-validation | KORA (N = 652) | Model building | Fivefold cross-validation | KORA (N = 652) | |
| Accuracya (95% CI) ± SD | 0.725 (0.700, 0.749) | 0.715 ± 0.020 | 0.830 (0.799, 0.858) | 0.730 (0.705, 0.753) | 0.721 ± 0.029 | 0.799 (0.766, 0.829) | 0.732 (0.707, 0.756) | 0.718 ± 0.016 | 0.759 |
| Specificity | 0.715 | 0.691 ± 0.090 | 0.494 | 0.694 | 0.682 ± 0.063 | 0.471 | 0.663 | 0.644 ± 0.033 | 0.449 |
| Sensitivity | 0.727 | 0.718 ± 0.026 | 0.879 | 0.740 | 0.733 ± 0.026 | 0.900 | 0.767 | 0.756 ± 0.015 | 0.902 |
| AUC | 0.793 | 0.774 ± 0.024 | 0.760 | 0.778 | 0.766 ± 0.033 | 0.764 | 0.779 | 0.767 ± 0.020 | 0.754 |
Cross-validation analysis results are presented as mean ± standard deviation
AUC Area under the curve
aProportion accurately inferred smoking habits, 95% confidence interval (CI)
Outcomes of model applications to infer smoking history (pack-years) in current smokers (N = 364) from blood based on CpGs
| 13-CpG model | 10-CpG modela | |||||
|---|---|---|---|---|---|---|
| Model Building (N = 364) | Fivefold Cross-validation | KORA F4 (N = 224) | Model Building (N = 364) | Fivefold Cross-validation | SHIP-Trend (N = 41) | |
| Accuracy (95% CI)b | 0.824 (0.781, 0.862) | 0.783 ± 0.05 | 0.813 (0.755, 0.861) | 0.808 (0.76, 0.847) | 0.770 ± 0.035 | 0.805 (0.651, 0.912) |
| Specificity | 0.644 | 0.577 ± 0.131 | 0.343 | 0.602 | 0.548 ± 0.14 | 0.778 |
| Sensitivity | 0.911 | 0.882 ± 0.045 | 0.899 | 0.907 | 0.879 ± 0.046 | 0.813 |
| AUC | 0.846 | 0.800 ± 0.068 | 0.796 | 0.834 | 0.809 ± 0.039 | 0.837 |
| Accuracy (95% CI)b | 0.733 (0.685, 0.778) | 0.719 ± 0.093 | 0.786 (0.726, 0.838) | 0.728 (0.679, 0.773) | 0.709 ± 0.059 | 0.659 (0.494, 0.799) |
| Specificity | 0.617 | 0.600 ± 0.204 | 0.455 | 0.597 | 0.575 ± 0.143 | 0.533 |
| Sensitivity | 0.819 | 0.805 ± 0.042 | 0.894 | 0.824 | 0.808 ± 0.035 | 0.731 |
| AUC | 0.815 | 0.767 ± 0.102 | 0.752 | 0.786 | 0.757 ± 0.077 | 0.779 |
Cross-validation analysis results are presented as mean ± standard deviation
Pack-years were calculated as the number of cigarettes smoked per day divided by 20, multiplied by the total years of smoking
aThree CpGs (cg06126421, cg22132788 and cg05951221) are not included in the EPIC methylation microarray dataset from SHIP-Trend
bProportion accurately inferred smoking habits; 95% CI, confidence interval; AUC, Area under the Curve
Outcomes of the five-category-model for inferring smoking habits and smoking history from blood based on 13 CpGs
| Never versus former > 10 years cessation time versus former ≤ 10 years cessation time versus < 15 pack-years versus ≥ 15 pack-years | |||||
|---|---|---|---|---|---|
| | Never (N = 1243) | F > 10 year (N = 1021) | F ≤ 10 year (N = 311) | < 15PY (N = 154) | ≥ 15PY (N = 210) |
| Specificity | 0.712 | 0.777 | 0.979 | 0.987 | 0.967 |
| Sensitivity | 0.817 | 0.554 | 0.206 | 0.299 | 0.724 |
| AUC | 0.835 | 0.739 | 0.793 | 0.869 | 0.949 |
| Specificity | 0.711 ± 0.022 | 0.775 ± 0.036 | 0.977 ± 0.009 | 0.984 ± 0.009 | 0.963 ± 0.014 |
| Sensitivity | 0.809 ± 0.047 | 0.545 ± 0.040 | 0.199 ± 0.042 | 0.274 ± 0.128 | 0.695 ± 0.064 |
| AUC | 0.832 ± 0.014 | 0.731 ± 0.026 | 0.779 ± 0.018 | 0.855 ± 0.046 | 0.947 ± 0.016 |
Cross-validation analysis results are presented as mean ± standard deviation
AUC area under the curve, F former smokers in years cessation time, PY pack-years
Model application to children from the Generation R study at 6 and 10 years of age
| Six years old | Six years old | Ten years old | Ten years old | |
|---|---|---|---|---|
| Whole dataset (N = 355) | Serial samples (N = 197) | Whole dataset (N = 309) | Serial samples (N = 197) | |
| Accuracya | 0.994 | 0.994 | 0.994 | 0.995 |
| N | 0:309 | 0:173 | 0:274 | 0:173 |
| 1:46 | 1:24 | 1:35 | 1:24 | |
| Specificity | 0.997 | 0.994 | 0.993 | 0.994 |
| Sensitivity | 0.022 | 0.0 | 0.0 | 0.0 |
| AUC | 0.649 | 0.650 | 0.606 | 0.592 |
AUC area under the curve
aProportion of children correctly predicted as non-smokers