| Literature DB >> 34066093 |
Francisco O Cortés-Ibañez1, Sunil Belur Nagaraj2, Ludo Cornelissen3, Grigory Sidorenkov1, Geertruida H de Bock1.
Abstract
Health behaviors affect health status in cancer survivors. We hypothesized that nonlinear algorithms would identify distinct key health behaviors compared to a linear algorithm and better classify cancer survivors. We aimed to use three nonlinear algorithms to identify such key health behaviors and compare their performances with that of a logistic regression for distinguishing cancer survivors from those without cancer in a population-based cohort study. We used six health behaviors and three socioeconomic factors for analysis. Participants from the Lifelines population-based cohort were binary classified into a cancer-survivors group and a cancer-free group using either nonlinear algorithms or logistic regression, and their performances were compared by the area under the curve (AUC). In addition, we performed case-control analyses (matched by age, sex, and education level) to evaluate classification performance only by health behaviors. Data were collected for 107,624 cancer free participants and 2760 cancer survivors. Using all variables resulted an AUC of 0.75 ± 0.01, using only six health behaviors, the logistic regression and nonlinear algorithms differentiated cancer survivors from cancer-free participants with AUCs of 0.62 ± 0.01 and 0.60 ± 0.01, respectively. The main distinctive classifier was age. Though not relevant to classification, the main distinctive health behaviors were body mass index and alcohol consumption. In the case-control analyses, algorithms produced AUCs of 0.52 ± 0.01. No key health behaviors were identified by linear and nonlinear algorithms to differentiate cancer survivors from cancer-free participants in this population-based cohort.Entities:
Keywords: cancer survivors; classification; health behaviors; lifestyle; machine learning; medical informatics
Year: 2021 PMID: 34066093 PMCID: PMC8151639 DOI: 10.3390/cancers13102335
Source DB: PubMed Journal: Cancers (Basel) ISSN: 2072-6694 Impact factor: 6.639
Figure 1Study cohort selection based on health behaviors and socioeconomic factors.
Figure 2Overview of the procedure followed to reduce class imbalance (equalization strategy) in the Lifelines cohort. * Variables in the testing set were standardized using the mean and standard deviations from the training set.
Baseline characteristics of participants stratified into cancer survivors, matched cancer-free controls, and cancer-free.
| Variables | Cancer Survivors | Matched Cancer-Free Controls | All Participants without Cancer | |
|---|---|---|---|---|
| Participants | 2760 | 2759 | 107,624 | |
| Age, mean (SD) | 57 (18) | 57 (18) | 44 (16) | |
| Sex, females (%) | 1883 (68.2%) | 1882 (68.2%) | 62,910 (58.5%) | |
| Education level | ||||
| Low (%) | 1209 (43.8%) | 1179 (42.7%) | 30,676 (28.5%) | |
| Medium (%) | 862 (31.2%) | 869 (31.5%) | 43,107 (40.1%) | |
| High (%) | 689 (25.0%) | 711 (25.8%) | 33,841 (31.4%) | |
| Time since cancer diagnosis | ||||
| ≤5 years (%) | 1153 (41.7%) | |||
| >5 years (%) | 1607 (58.3%) | |||
| Body mass index | 26.2 (5.20) | 26.0 (5.10) | 25.4 (5.20) | |
| Smoking g/day, mean (SD) | 2.02 (5.69) | 1.65 (4.85) | 2.21 (5.75) | |
| Never (%) | 1013 (36.7%) | 1095 (39.7%) | 50,624 (47.0%) | |
| Former (%) | 1288 (46.7%) | 1238 (44.9%) | 35,067 (32.6%) | |
| Current (%) | 459 (16.6%) | 426 (15.4%) | 21,933 (20.4%) | |
| Alcohol intake g/day | 3.31 (9.35) | 3.57 (9.24) | 3.95 (9.46) | |
| Physical activity hrs/week | 3.25 (5.75) | 3.50 (5.58) | 3.00 (5.00) | |
| Diet LLDS | 26.00 (8.00) | 26.00 (8.00) | 24.00 (8.00) | |
| Sedentary behavior (TV hrs/day) | 3.00 (1.61) | 3.50 (1.50) | 2.00 (1.50) |
Abbreviations: SD, standard deviation; hrs, hours; LLDS, Lifelines Diet Score; TV, television. Results are shown as median (interquartile range) unless otherwise specified.
Overall performance of machine learning algorithms by AUCs for the 39 subsets and case–control analysis.
| Scenarios | AUC 39 Subsets | AUC Case–Controls | ||||||
|---|---|---|---|---|---|---|---|---|
| Logistic | Random Forest | Support Vector Machines | Gradient Boosting Machines | Logistic | Random Forest | Support Vector Machines | Gradient Boosting Machines | |
| All variables included * (95% CI). | 0.75 ± 0.01 | 0.75 ± 0.01 | 0.76 ± 0.02 | 0.74 ± 0.01 | 0.52 ± 0.01 | 0.52 ± 0.01 | 0.55 ± 0.02 | 0.53 ± 0.01 |
| - Excluding age (95% CI) | 0.63 ± 0.01 | 0.63 ± 0.01 | 0.66 ± 0.01 | 0.65 ± 0.02 | - | - | - | - |
| - Excluding age and sex (95% CI) | 0.62 ± 0.01 | 0.63 ± 0.01 | 0.65 ± 0.01 | 0.64 ± 0.02 | - | - | - | - |
| - Excluding age and education level (95% CI) | 0.60 ± 0.01 | 0.62 ± 0.01 | 0.63 ± 0.01 | 0.61 ± 0.01 | - | - | - | - |
Abbreviation: AUC, area under the receiver operator curve. *All health behaviors and socioeconomic factors included.
Consistency of variable importance in the random forest classifier by the MDG for every subanalysis.
| Variables | All Variables | Health Behaviors * | Case–Control * |
|---|---|---|---|
| Age | 100 | - | - |
| Sex | 7.65 | - | - |
| Education level | 6.03 | - | - |
| Body Mass Index | 56.44 | 100 | 100 |
| Alcohol intake | 54.04 | 99.42 | 99.15 |
| Physical activity | 45.87 | 83.23 | 84.93 |
| Diet | 43.77 | 73.95 | 76.93 |
| Sedentary behavior | 32.96 | 53.19 | 58.77 |
| Smoking | 13.27 | 12.84 | 24.30 |
The scale ranges from 1–100, where a number close to 100 means a more important variable in the analysis. The data show the consistency when including all variables, when including only health behaviors, and in the case–control analysis. * In these analyses, we included only health behaviors, therefore data for age, sex, and educational level are not shown.