| Literature DB >> 34184998 |
Sang Min Nam1, Thomas A Peterson2, Kyoung Yul Seo3, Hyun Wook Han4, Jee In Kang5.
Abstract
BACKGROUND: In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large.Entities:
Keywords: XGBoost; depression; epidemiology; machine learning; network; prediction model
Year: 2021 PMID: 34184998 PMCID: PMC8277318 DOI: 10.2196/27344
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1Flow diagram of the study. AUC: area under the curve; EBICglasso: extended Bayesian information criterium graphical lasso; KNHANES: Korea National Health and Nutrition Examination Survey.
Figure 2XGBoost model for the depression and performance test. Red dots represent positive for a binary factor or high feature values for a continuous factor on the beeswarm plot of shapley additive explanation (SHAP) values (log odds of the current depression) for the training data (A). On the weighted receiver operating characteristic curve (ROC) (B), sensitivity is 0.78 and specificity is 0.82 at the best threshold of probability of 0.461 for test samples. The model is also tested using the Patient Health Questionnaire-9 (depression when score ≥10) for 6098 samples (C). AUC: area under the curve.
Confusion matrix for the test data set.
| Actual depression | Predicted depressiona | |
|
| Current | No lifetime |
| Current | 103,182 | 28,933 |
| No lifetime | 804,195 | 3,657,604 |
aCase numbers are estimated using complex-survey-design weights at the best threshold.
Survey-weighted multiple logistic regression analysis of depression-related factors.
| Variable | No lifetime depressiona (N=22,262,880) | Current depressiona (N=616,082) | Odds ratio (95% CI)b | ||||||
|
|
|
|
|
| |||||
|
| Yes | 5,999,970 (26.95%) | 372,138 (60.40%) | 3.3 (2.5-4.3) | <.001 | ||||
|
| No (reference) | 16,262,910 (73.05%) | 243,944 (39.60%) | N/Ac | N/A | ||||
|
|
|
|
|
| |||||
|
| Male | 11,072,884 (49.74%) | 165,770 (26.91%) | 0.5 (0.3-0.7) | <.001 | ||||
|
| Female (reference) | 11,189,996 (50.26%) | 450,312 (73.09%) | N/A | N/A | ||||
|
|
|
|
|
| |||||
|
| Separated or divorced | 852,170 (3.83%) | 105,567 (17.14%) | 2.2 (1.4-3.5) | <.001 | ||||
|
| Single | 6,153,788 (27.64%) | 133,952 (21.74%) | 1.4 (0.9-2.2) | .11 | ||||
|
| Widowed | 397,387 (1.78%) | 36,923 (5.99%) | 1.4 (0.8-2.6) | .25 | ||||
|
| Married (reference) | 14,859,535 (66.75%) | 339,640 (55.13%) | N/A | N/A | ||||
|
|
|
|
|
| |||||
|
| Nonmanual | 10,304,597 (46.29%) | 161,758 (26.26%) | 0.6 (0.4-0.8) | <.001 | ||||
|
| Manual | 4,394,739 (19.74%) | 102,028 (16.56%) | 0.7 (0.4-1.0) | .06 | ||||
|
| Farmd | 498,239 (2.24%) | 10,848 (1.76%) | 0.4 (0.2-0.8) | .02 | ||||
|
| Unemployed (reference) | 7,065,305 (31.73%) | 341,448 (55.42%) | N/A | N/A | ||||
|
|
|
|
|
| |||||
|
| Current | 205,771 (0.92%) | 34,769 (5.64%) | 3.1 (1.5-6.5) | .002 | ||||
|
| Past | 303,213 (1.36%) | 10,387 (1.69%) | 1.3 (0.5-3.2) | .57 | ||||
|
| No lifetime (reference) | 21,753,896 (97.72%) | 570,926 (92.67%) | N/A | N/A | ||||
|
|
|
|
|
| |||||
|
| Yes | 2,963,312 (13.31%) | 146,742 (23.82%) | 1.4 (0.9-1.9) | .10 | ||||
|
| No (reference) | 19,299,568 (86.69%) | 469,340 (76.18%) | N/A | N/A | ||||
|
|
|
|
|
| |||||
|
| Current | 825,130 (3.71%) | 100,307 (16.28%) | 1.4 (0.9-2.1) | .17 | ||||
|
| Past | 155,266 (0.70%) | 8,385 (1.36%) | 0.7 (0.1-4.6) | .71 | ||||
|
| No lifetime (reference) | 21,282,484 (95.59%) | 507,390 (82.36%) | N/A | N/A | ||||
|
|
|
|
|
| |||||
|
| Yes | 1,466,133 (6.59%) | 84,663 (13.74%) | 1.4 (0.9-2.2) | .11 | ||||
|
| Prediabetes | 4,504,230 (20.23%) | 125,783 (20.42%) | 1.0 (0.7-1.4) | .81 | ||||
|
| No (reference) | 16,292,517 (73.18%) | 405,636 (65.84%) | N/A | N/A | ||||
|
|
|
|
|
| |||||
|
| Yes | 3,398,898 (15.27%) | 173,323 (28.13%) | 0.9 (0.7-1.3) | .65 | ||||
|
| No (reference) | 18,863,982 (84.73%) | 442,759 (71.87%) | N/A | N/A | ||||
|
|
|
|
|
| |||||
|
| Yes | 5,178,518 (23.26%) | 139,892 (22.71%) | 1.2 (0.8-1.8) | .47 | ||||
|
| No | 17,084,362 (76.74%) | 476,190 (77.29%) | N/A | N/A | ||||
| Pain or discomfort level (1 to 3 points), mean (SE) | 1.2 (0.004) | 1.6 (0.034) | 1.3 (1.1-1.5) | <.001 | |||||
| Fasting serum triglyceride (mg/dL)d, mean (SE) | 134.8 (1.3) | 161.8 (12.0) | 1.2 (1.1-1.3) | .002 | |||||
| Weight gain level within 1 year (0 to 3 points), mean (SE) | 0.4 (0.009) | 0.6 (0.047) | 1.2 (1.1-1.3) | .003 | |||||
| Urine specific gravity, mean (SE) | 1.020 (0.00007) | 1.018 (0.0004) | 0.8 (0.7-0.9) | .003 | |||||
| Usual activity problem level (1 to 3 points), mean (SE) | 1.0 (0.002) | 1.3 (0.029) | 1.1 (1.0-1.3) | .006 | |||||
| Income quintiles (household) (1 to 5 points), mean (SE) | 3.4 (0.02) | 2.6 (0.09) | 0.8 (0.7-0.9) | .007 | |||||
| Educational level (1 to 4 points), mean (SE) | 3.2 (0.01) | 2.6 (0.06) | 0.8 (0.7-1.0) | .01 | |||||
| Urine pH, mean (SE) | 5.7 (0.009) | 5.8 (0.050) | 1.1(1.0-1.3) | .07 | |||||
| Mobility problem level (1 to 3 points), mean (SE) | 1.1 (0.002) | 1.3 (0.032) | 1.1 (1.0-1.2) | .22 | |||||
| Self-care problem level (1 to 3 points), mean (SE) | 1.0 (0.001) | 1.1 (0.017) | 1.0 (0.9-1.1) | .98 | |||||
| Age (years)e, mean (SE) | 40.7 (0.2) | 45.1 (0.8) | 1.1 (0.8-1.3) | .61 | |||||
aPopulation counts (n), means, and standard errors are estimated using complex-survey-design weights.
bFor continuous factors, the value was standardized by removing the mean and scaling to unit variance.
cN/A: not applicable.
dNonmodel factors were found by controlling for model features and age.
eNot a depression-related factor, but included to control confounding effects of age.
Figure 3Partial correlation network graph and centrality indices for depression-associated factors. The factors can be positively (risk) or negatively (protective) related to current depression. Node size is proportional to the effective size of the odds ratio. Green and red edges represent positive and negative correlations, respectively. The edge with the highest absolute weight has full-color saturation and the widest width.
Possible confounders on indirect factors in depression.
| Indirect factors | Confoundersa | Odds ratio (95% CI) | ||
| With confounders | Without confounders | |||
| Current osteoarthritis | Pain or discomfort level and mobility problem level | 1.4 (0.9-2.1) | 1.6 (1.1-2.5) | |
| Hypercholesterolemia | Triglyceride | 1.4 (1.0-1.9) | 1.5 (1.0-2.1) | |
| Diabetes | Triglyceride and hypercholesterolemia | 1.4 (0.9-2.2) | 1.7 (1.1-2.6) | |
aA group of confounders to make a coefficient of the indirect factor statistically insignificant.
Figure 4The proportion of current depression according to sex and smoking (A) and according to weight gain and diabetes (B). The interaction effects between them are significant (P=.001 and .01, respectively; survey-weighted logistic regression).
Figure 5Linear regression plots between age and urine specific gravity in females and males. In the images, 95% confidence intervals for the regression estimate are drawn using translucent bands around the regression line.
Figure 6Linear regression plots between age and daily water intake in females and males. In the images, 95% confidence intervals for the regression estimate are drawn using translucent bands around the regression line.