| Literature DB >> 35725920 |
Bette Loef1, Albert Wong2, Nicole A H Janssen2, Maciek Strak2, Jurriaan Hoekstra2, H Susan J Picavet2, H C Hendriek Boshuizen2,3, W M Monique Verschuren2,4, Gerrie-Cor M Herber2.
Abstract
Due to the wealth of exposome data from longitudinal cohort studies that is currently available, the need for methods to adequately analyze these data is growing. We propose an approach in which machine learning is used to identify longitudinal exposome-related predictors of health, and illustrate its potential through an application. Our application involves studying the relation between exposome and self-perceived health based on the 30-year running Doetinchem Cohort Study. Random Forest (RF) was used to identify the strongest predictors due to its favorable prediction performance in prior research. The relation between predictors and outcome was visualized with partial dependence and accumulated local effects plots. To facilitate interpretation, exposures were summarized by expressing them as the average exposure and average trend over time. The RF model's ability to discriminate poor from good self-perceived health was acceptable (Area-Under-the-Curve = 0.707). Nine exposures from different exposome-related domains were largely responsible for the model's performance, while 87 exposures seemed to contribute little to the performance. Our approach demonstrates that ML can be interpreted more than widely believed, and can be applied to identify important longitudinal predictors of health over the life course in studies with repeated measures of exposure. The approach is context-independent and broadly applicable.Entities:
Mesh:
Year: 2022 PMID: 35725920 PMCID: PMC9209521 DOI: 10.1038/s41598-022-14632-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Flowchart of study participants. 1 Roughly two-third of the participants from round 1 were randomly selected and re-invited to participate in round 2.
Overview of demographic, lifestyle, environmental, and biological exposures included in the current study.
| Exposure | Label | Rounda |
|---|---|---|
| Sex | Male; female | r1 |
| Age | In years | r1–r5 |
| Educational level (highest level of education attained) | Primary education or less; lower vocational education or lower secondary education; intermediate vocational education or higher secondary education; higher vocational education or university | r1–r4 |
| Nationality | Dutch; non-Dutch | r1 |
| Marital status | Single, never married; married; widow/widower; divorced | r1–r5 |
| Household composition | With partner; with partner and children; single-parent household; single household; other household | r2–r5 |
| Working hours | In hours per week | r2–r5 |
| Alcohol use | No, never; no, I stopped using alcohol; every now and then, but less than 1 glass per week; yes | r1–r5 |
| Number of glasses of alcohol per day | In glasses per day | r1–r5 |
| Smoking status | Smoker; former smoker; never smoker | r1–r5 |
| Number of cigarettes per day | In cigarettes per day | r1–r5 |
| Smoking pack years | In the number of smoking years times the number of packs smoked per day | r1–r5 |
| Occupational physical activity (EPIC Physical Activity Questionnaire (Pols et al. 1997)) | Sedentary job; standing job; manual work; heavy manual work; not applicable | r1–r5 |
| Time spent on moderate to vigorous physical activity per week (EPIC Physical Activity Questionnaire (Pols et al. 1997)) | < 0.5 h; 0.5–3.5 h; ≥ 3.5 h or more, of which < 2 h vigorous; ≥ 3.5 h, of which ≥ 2 h or more vigorous | r2–r5 |
| Dutch Healthy Diet index 2015 (Looman et al. 2017) | On a scale from 0 to 130 (a higher score indicates higher adherence to the Dutch dietary guidelines) | r2–r4 |
| Number of hours of sleep per day | ≤ 5 h; 6 h; 7 h; 8 h; ≥ 9 h | r1–r5 |
| Reproductive cycle status | Male; female, regular cycle; female, irregular cycle; female, pregnant; female, anticontraceptive or hormone use; female, unknown/surgery; female, menopause | r1–r5 |
| Total NO2 concentration at home address (dispersion models (Velders et al. 2020))b | In ug/m3 | r1–r5 |
| Total PM2.5 concentration at home address (dispersion models (Velders et al. 2020))b | In ug/m3 | r1–r5 |
| Total elemental carbon concentration at home address (dispersion models (Velders et al. 2020))b | In ug/m3 | r1–r5 |
| Rail traffic noise levels in 2016 for the entire 24-h period at home address (Standard Model Instrumentation for Noise Assessments (Schreurs et al. 2010)) | In dB | r1–r5 |
| Road traffic noise levels in 2016 for the entire 24-h period at home address (Standard Model Instrumentation for Noise Assessments (Schreurs et al. 2010)) | In dB | r1–r5 |
| Normalized difference vegetation index in 2010 in buffer 300 m around home address (Landsat 5 Thematic Mapper (United States Geological Service) | On a scale from − 1 to 1 (higher score indicating more greenness) | r1–r5 |
| Normalized difference vegetation index in 2010 in buffer 1000 m around home address (Landsat 5 Thematic Mapper (United States Geological Service)) | On a scale from − 1 to 1 (higher score indicating more greenness) | r1–r5 |
| Damp stains in the house in the past two years | Not at all; occasionally; often; always | r2–r3 |
| Mold growth in the house in the past two years | Not at all; occasionally; often; always | r2–r3 |
| Hot water supply in the house | Geyser with drain; geyser without drain; boiler; combi boiler; combination or other | r2–r3 |
| Heat source for cooking | Gas; electric; combination or other | r2–r3 |
| Pet (cat, dog, bird or rodent) in the house | Yes; no, not anymore; no, never | r2–r3 |
| Smoking in the participant's environment | Yes, at home and at work; yes, at home; yes, at work; no | r2–r3 |
| Social support measured by positive social experiences (Van Oostrom et al. 1995) | On a scale from 8 to 32 (higher score indicates more positive experiences) | r1–r3 |
| Social support measured by negative social experiences (Van Oostrom et al. 1995) | On a scale from 8 to 32 (higher score indicates more negative experiences) | r1–r3 |
| Social support measure for elderly (Van Eijk et al. 1994) | On a scale from 12 to 48 (higher score indicates more social support) | r5 |
| Loneliness scale (De Jong-Gierveld et al. 1985) | On a scale from 0 to 11 (higher score indicates more loneliness) | r5 |
| Body mass index | In kg/m2 | r1–r5 |
| Waist/hip ratio | Ratio | r2–r5 |
| Waist circumference | In centimeters | r2–r5 |
| Pulse rate | In beats per minute | r1–r5 |
| Systolic pressure | In mm Hg | r1–r5 |
| Diastolic pressure | In mm Hg | r1–r5 |
| Total cholesterol | In mmol/l | r1–r5 |
| HDL cholesterol | In mmol/l | r1–r5 |
| Total/HDL cholesterol ratio | Ratio | r1–r5 |
| Use of high blood pressure medication | Yes; no | r1–r5 |
| Use of cholesterol lowering medication | Yes; no | r1–r5 |
aMeasurement rounds during which an exposure was measured (round 1 20–59 years, round 2 26–65 years, round 3 31–70 years, round 4 36–75 years, round 5 41–80 years).
bBased on concentration estimates of the year 2000 for round 1–3; the average of the years 2000 and 2010 for round 4; and the year 2010 for round 5.
Figure 2Summary of the six analysis steps.
The average value and trend over time of a few selected exposures, stratified by good or poor perceived health status at round 6.
| Exposure | Total population (n = 3419) | Good perceived health (n = 2876) | Poor perceived health (n = 543) | ||||
|---|---|---|---|---|---|---|---|
| Mean/% | SD/n | Mean/% | SD/n | Mean/% | SD/n | ||
| Working hours (in hours per week), AUE | 20.9 | 15.9 | 22.1 | 15.6 | 14.9 | 16.3 | < 0.001 |
| Working hours (in hours per week), TOE | − 1.2 | 7.5 | − 1.1 | 7.6 | − 1.8 | 7.0 | 0.040 |
| Marital status (% of the time married), AUE | 81 | 32 | 82 | 31 | 76 | 37 | < 0.001 |
| Marital status (% from married to widowed or divorced), TOE | 17 | 569 | 16 | 456 | 21 | 113 | 0.005 |
| Smoking (in pack years), AUE | 9 | 12 | 8 | 11 | 12 | 14 | < 0.001 |
| Smoking (in pack years), TOE | 0.8 | 2.2 | 0.8 | 2.1 | 1.1 | 2.9 | 0.005 |
| Alcohol use (% of the time every now and then or yes), AUE | 89 | 26 | 90 | 24 | 84 | 31 | < 0.001 |
| Alcohol use (% from never user to current user), TOE | 9 | 297 | 9 | 251 | 8 | 46 | 0.914 |
| NO2 concentration (in ug/m3), AUE | 27.7 | 1.9 | 27.7 | 1.9 | 27.7 | 1.9 | 0.922 |
| NO2 concentration (in ug/m3), TOE | − 1.6 | 0.6 | − 1.6 | 0.6 | − 1.6 | 0.7 | 0.662 |
| Damp stains in the house (% of the time yes), AUE | 22 | 34 | 22 | 34 | 26 | 36 | 0.014 |
| Damp stains in the house (% from no to yes), TOE | 10 | 295 | 10 | 246 | 11 | 49 | 0.553 |
| Body mass index (in kg/m2), AUE | 25.6 | 3.5 | 25.4 | 3.3 | 26.9 | 4.2 | < 0.001 |
| Body mass index (in kg/m2), TOE | 0.7 | 0.7 | 0.6 | 0.7 | 0.8 | 0.9 | < 0.001 |
| Use of high blood pressure medication (% of the time yes), AUE | 10 | 21 | 9 | 20 | 15 | 25 | < 0.001 |
| Use of high blood pressure medication (% from no to yes), TOE | 18 | 609 | 16 | 462 | 27 | 147 | < 0.001 |
The AUE and TOE indicate the average value of the exposure over time and the average trend in the exposure over time for continuous exposures, respectively. For the categorical exposures, the proportion of the time that participants were in a particular category (AUE) and the proportion of participants for whom a change from one reference category to another category occurred during the rounds (TOE) is presented.
AUE Area-Under-the-Exposure, TOE trend-of-the-exposure.
P-values were tested using the independent samples t test and chi-square test.
Figure 3Examples of average trajectories over time of a demographic (a), a lifestyle (b), an environmental (c), and a biological (d) exposure for those with good (solid green line) and poor (dashed blue line) perceived health status at round 6.
Prediction performance metrics for the total model and the models without a particular domain of exposures.
| Model | Optimal threshold ROC curve | Sensitivity and specificity at a predefined point of 0.5 | |||||
|---|---|---|---|---|---|---|---|
| AUC (95% CI) | Threshold | Specificity | Sensitivity | Sensitivity + specificity | Accuracy | Specificity | Sensitivity |
| 0.707 (0.655–0.759) | 0.789 | 0.725 | 0.593 | 1.318 | 0.704 | 0.777 | 0.787 |
| 0.684 (0.630–0.739) | 0.792 | 0.713 | 0.565 | 1.278 | 0.690 | 0.767 | 0.759 |
| 0.695 (0.642–0.747) | 0.875 | 0.494 | 0.815 | 1.309 | 0.545 | 0.774 | 0.806 |
| 0.702 (0.650–0.754) | 0.866 | 0.539 | 0.796 | 1.335 | 0.580 | 0.774 | 0.796 |
| 0.669 (0.611–0726) | 0.811 | 0.645 | 0.611 | 1.256 | 0.640 | 0.730 | 0.685 |
Figure 4Variable importance ranking of the 30 most important exposures in predicting self-perceived health. The x-axis displays the mean decrease in accuracy that occurs when a particular exposure is permuted randomly in the random forest. AUE, Area-Under-the-Exposure; BMI, body mass index; BP, blood pressure; LPA, leisure time physical activity; r5, round 5; trend, Trend-of-the-Exposure; WHR, waist/hip ratio.
Figure 5Exposure selection through cross-validation showing the prediction performance (Area-Under-the-Curve, AUC) (y-axis) of the model using a particular number of top-ranked exposures (x-axis). The dotted gray line represents the optimum number of exposures to select (q = 9).
Top 9 predictors of self-perceived health.
| Exposure | Label | Type | Round | Domain | |
|---|---|---|---|---|---|
| 1 | Working hours | In hours per week | Average over time | r2–r5 | Demographic |
| 2 | Waist circumference | In centimeters | Average over time | r2–r5 | Biological |
| 3 | Body mass index | In kg/m2 | Average over time | r1–r5 | Biological |
| 4 | Loneliness | On a scale from 0 to 11 | Measured in round 5 | r5 | Environmental |
| 5 | Waist/hip ratio | Ratio | Average over time | r2–r5 | Biological |
| 6 | Age | In years | Average over time | r1–r5 | Demographic |
| 7 | Sleep duration | In hours per day | Average over time | r1–r5 | Lifestyle |
| 8 | Smoking pack years | In pack years | Average over time | r1–r5 | Lifestyle |
| 9 | Total/HDL cholesterol ratio | Ratio | Average trend over time | r1–r5 | Biological |
Figure 6Partial dependence plots (PDPs) of the relation between predictors of self-perceived health and poor self-perceived health. The dotted gray line represents the reference value, i.e. the predicted outcome corresponds to the prevalence of poor perceived health in the total population at round 6 (0.16). AUE, Area-Under-the-Exposure; BMI, body mass index; r5, round 5; WHR, waist/hip ratio.