| Literature DB >> 35905087 |
Abstract
We aimed to develop prediction models for depression among U.S. adults with hypertension using various machine learning (ML) approaches. Moreover, we analyzed the mechanisms of the developed models. This cross-sectional study included 8,628 adults with hypertension (11.3% with depression) from the National Health and Nutrition Examination Survey (2011-2020). We selected several significant features using feature selection methods to build the models. Data imbalance was managed with random down-sampling. Six different ML classification methods implemented in the R package caret-artificial neural network, random forest, AdaBoost, stochastic gradient boosting, XGBoost, and support vector machine-were employed with 10-fold cross-validation for predictions. Model performance was assessed by examining the area under the receiver operating characteristic curve (AUC), accuracy, precision, sensitivity, specificity, and F1-score. For an interpretable algorithm, we used the variable importance evaluation function in caret. Of all classification models, artificial neural network trained with selected features (n = 30) achieved the highest AUC (0.813) and specificity (0.780) in predicting depression. Support vector machine predicted depression with the highest accuracy (0.771), precision (0.969), sensitivity (0.774), and F1-score (0.860). The most frequent and important features contributing to the models included the ratio of family income to poverty, triglyceride level, white blood cell count, age, sleep disorder status, the presence of arthritis, hemoglobin level, marital status, and education level. In conclusion, ML algorithms performed comparably in predicting depression among hypertensive populations. Furthermore, the developed models shed light on variables' relative importance, paving the way for further clinical research.Entities:
Mesh:
Year: 2022 PMID: 35905087 PMCID: PMC9337649 DOI: 10.1371/journal.pone.0272330
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
ML algorithms used in the current study [24].
| Description | Hyperparameters used in | |
|---|---|---|
|
| ANN is a group of interconnected artificial neurons that utilizes a mathematical model or computational model to process information. The generic structure of a basic ANN comprises a series of nodes arranged in 3 layers (input, hidden, and output layers). The input nodes and the output node of an ANN correspond to the predictor variables and outcome variable, respectively. The nodes in the hidden layer are intermediate unobserved values that allow the ANN to model complex nonlinear associations between the input nodes and the output node. The nodes in different layers are connected by weights. | hidden layer = 1, decay weight = 0.09 |
|
| Random forest is a tree-based ensemble method that utilizes parallel decision trees built on subsets of the data to develop an optimal predictive model. Each tree in the random forest casts a vote based on its prediction, and the classification with the most votes becomes the overall model’s prediction. | mtry = 2 |
|
| AdaBoost is also an ensemble method like random forest. The core principle of AdaBoost is to fit a sequence of “weak learners” (i.e., models that are only slightly better than random guessing) to repeatedly modified data. All predictions are then combined through a weighted majority vote (or sum) to generate the final prediction. | nIter = 100, method = Adaboost.M1 |
|
| Stochastic gradient boosting is another ensemble technique. It iteratively builds several small decision trees, each based on a random subset of the data, with each additional tree emphasizing observations poorly modeled by the existing collection of trees. Ultimately, observations are assigned a class based on the most common classification among the trees. | n.trees = 100, interaction.depth = 1, shrinkage = 0.1, and n.minobsinnode = 10 |
|
| XGBoost implements gradient boosting with decision trees as the underlying learners. Whereas random forest employs individual trees in parallel to solve the same problem, XGBoost builds individual trees sequentially. Each tree is trained to resolve the prediction error remaining following the prior tree and thereby improves prediction. This offers another approach to building more complex and accurate models with trees while controlling individual tree depth and complexity. | nrounds = 1000, max_depth = 10, eta = 0.07, gamma = 0.01, colsample_bytree = 0.5, min_child_weight = 1, and subsample = 0.5 |
|
| An SVM model represents data samples as points in a space. The samples of separate categories are divided by a clear gap that should be as wide as possible. New data samples are then mapped onto that same space and predicted to become part of a category based on the side of the gap onto which they are mapped. | C = 0.1 |
aIn this study, we chose the linear kernel function as the kernel function of the SVM classifier
ANN: artificial neural network; SVM: support vector machine
Comparison of baseline characteristics (unweighted).
| Variables | n (%) or mean ± SD | |||
|---|---|---|---|---|
| Non-depressed (n = 7,652) | Depressed (n = 976) | |||
|
| ||||
| Age | 63.31 ± 11.40 | 61.32 ± 10.68 | 5.43 | < 0.001 |
| Race/ethnicity | 56.69 | < 0.001 | ||
| Mexican American | 740 (9.7) | 121 (12.4) | ||
| Other Hispanic | 748 (9.8) | 126 (12.9) | ||
| Non-Hispanic White | 2,989 (39.1) | 376 (38.5) | ||
| Non-Hispanic Black | 2,282 (29.8) | 268 (27.5) | ||
| Non-Hispanic Asian | 665 (8.7) | 35 (3.6) | ||
| Other | 228 (3.0) | 50 (5.1) | ||
| Gender | 70.99 | < 0.001 | ||
| Male | 3,863 (50.5) | 353 (36.2) | ||
| Female | 3,789 (49.5) | 623 (63.8) | ||
| Marital status | 102.17 | < 0.001 | ||
| Married/living with partner | 4,449 (58.2) | 401 (41.1) | ||
| Widowed/divorced/separated | 2,502 (32.7) | 449 (46.1) | ||
| Never married | 696 (9.1) | 125 (12.8) | ||
| Education level | 136.09 | < 0.001 | ||
| Less than 9th grade | 754 (9.8) | 165 (16.9) | ||
| 9-11th grade | 984 (12.9) | 192 (19.7) | ||
| High school graduate/GED or equivalent | 1,914 (25.0) | 242 (24.8) | ||
| Some college or AA degree | 2,335 (30.5) | 286 (29.3) | ||
| College graduate or above | 1,665 (21.8) | 91 (9.3) | ||
| The ratio of family income to poverty | 2.58 ± 1.60 | 1.65 ± 1.30 | 19.49 | < 0.001 |
| Insurance status | 3.04 | 0.080 | ||
| Yes | 6,828 (89.3) | 851 (87.5) | ||
| No | 817 (10.7) | 122 (12.5) | ||
| Time spent uninsured in the past year | 15.42 | < 0.001 | ||
| Yes | 309 (4.5) | 65 (7.6) | ||
| No | 6,547 (95.5) | 795 (92.4) | ||
|
| ||||
| Smoking status | 30.48 | < 0.001 | ||
| Yes | 3,713 (48.6) | 565 (58.0) | ||
| No | 3,933 (51.4) | 410 (42.0) | ||
| Minutes of sedentary activity | 373 ± 200.6 | 404.4 ± 224.3 | -4.13 | < 0.001 |
| Vigorous work activity | 2.79 | 0.090 | ||
| Yes | 1,294 (16.9) | 186 (19.1) | ||
| No | 6,355 (83.1) | 790 (80.9) | ||
| Moderate work activity | 0.63 | 0.430 | ||
| Yes | 2,621 (34.3) | 322 (33.0) | ||
| No | 5,026 (65.7) | 654 (67.0) | ||
| Walking or cycling | 0.44 | 0.510 | ||
| Yes | 1,535 (20.1) | 187 (19.2) | ||
| No | 6,115 (79.9) | 789 (80.8) | ||
| Vigorous recreational activity | 51.32 | < 0.001 | ||
| Yes | 958 (12.5) | 46 (4.7) | ||
| No | 6,693 (87.5) | 930 (95.3) | ||
| Moderate recreational activity | 92.68 | < 0.001 | ||
| Yes | 2,877 (37.6) | 214 (21.9) | ||
| No | 4,771 (62.4) | 762 (78.1) | ||
|
| ||||
| Presence of arthritis | 185.50 | < 0.001 | ||
| Yes | 3,375 (44.2) | 654 (67.4) | ||
| No | 4,261 (55.8) | 317 (32.6) | ||
| Presence of kidney disease | 51.01 | < 0.001 | ||
| Yes | 512 (6.7) | 127 (13.1) | ||
| No | 7,128 (93.3) | 844 (86.9) | ||
| Presence of asthma | 100.46 | < 0.001 | ||
| Yes | 1,138 (14.9) | 268 (27.5) | ||
| No | 6,512 (85.1) | 708 (72.5) | ||
| Presence of liver disease | 67.15 | < 0.001 | ||
| Yes | 468 (6.1) | 128 (13.2) | ||
| No | 7,174 (93.9) | 842 (86.8) | ||
| Presence of cancer or a malignance of any kind | 1.24 | 0.265 | ||
| Yes | 1,251 (16.4) | 173 (17.8) | ||
| No | 6,398 (83.6) | 801 (82.2) | ||
| Presence of cardiovascular disease | 77.94 | < 0.001 | ||
| Yes | 1,541 (20.4) | 311 (33.0) | ||
| No | 6,007 (79.6) | 631 (67.0) | ||
| Presence of sleep disorder | 478.04 | < 0.001 | ||
| Yes | 2,461 (32.2) | 662 (67.9) | < 0.001 | |
| No | 5,190 (67.8) | 313 (32.1) | ||
|
| ||||
| Segmented neutrophils number (1000c cells/uL) | 4.27 ± 1.70 | 4.66 ± 2.04 | -5.57 | < 0.001 |
| White blood cell count (1000 cells/uL) | 7.28 ± 5.25 | 7.72 ± 2.53 | -4.25 | < 0.001 |
| Red cell distribution width (%) | 13.93 ± 1.42 | 14.17 ± 1.54 | -4.52 | < 0.001 |
| Mean cell volume (fL) | 89.48 ± 6.15 | 89.21 ± 6.36 | 1.27 | 0.205 |
| Platelet count (1000 cells/uL) | 233.1 ± 64.22 | 245.3 ± 71.26 | -4.96 | < 0.001 |
| Gamma glutamyl transferase (U/L) | 33.91 ± 50.52 | 31.91 ± 61.31 | -3.75 | < 0.001 |
| Alanine aminotransferase (U/L) | 23.49 ± 18.39 | 25.71 ± 47.56 | -1.39 | 0.008 |
| Alkaline phosphatase (U/L) | 75.48 ± 27.92 | 81.01 ± 27.93 | -5.60 | < 0.001 |
| Eosinophils number (1000 cells/uL) | 0.21 ± 0.18 | 0.22 ± 0.17 | -1.43 | 0.154 |
| Basophils number (1000 cells/uL) | 0.05 ± 0.05 | 0.06 ± 0.05 | -2.60 | 0.009 |
| Glycohemoglobin (%) | 6.15 ± 1.22 | 6.40 ± 1.58 | -4.55 | < 0.001 |
| Triglyceride (mmol/L) | 1.79 ± 1.29 | 2.01 ± 1.50 | -4.26 | < 0.001 |
| Total cholesterol (mmol/L) | 4.89 ± 1.13 | 4.98 ± 1.18 | -2.11 | 0.035 |
| Body mass index (kg/m2) | 30.71 ± 7.15 | 32.81 ± 8.53 | -7.26 | < 0.001 |
| Direct high-density lipoprotein cholesterol (mmol/L) | 1.38 ± 0.43 | 1.35 ± 0.46 | 1.68 | 0.094 |
| Sodium (mmol/L) | 139.7 ± 2.76 | 139.7 ± 2.94 | 0.44 | 0.661 |
| Total bilirubin (umol/L) | 9.69 ± 4.84 | 8.86 ± 4.47 | 5.17 | < 0.001 |
| Hemoglobin (g/dL) | 13.82 ± 1.55 | 13.54 ± 1.65 | 4.93 | < 0.001 |
| Hematocrit (%) | 41.16 ± 4.30 | 40.45 ± 4.60 | 4.46 | < 0.001 |
| Albumin, urine (mg/L) | 84.52 ± 438.8 | 130.2 ± 534.5 | -2.53 | 0.010 |
| Albumin, refrigerated serum (g/L) | 41.41 ± 3.39 | 40.53 ± 3.73 | 6.73 | < 0.001 |
| Monocyte number (1000 cells/uL) | 0.59 ± 0.24 | 0.60 ± 0.22 | -1.88 | 0.060 |
| Lymphocyte number (1000 cells/uL) | 2.16 ± 4.48 | 2.18 ± 0.82 | -0.29 | 0.772 |
| Potassium (mmol/L) | 4.04 ± 0.41 | 4.04 ± 0.43 | -0.38 | 0.703 |
| Uric acid (umol/L) | 343.7 ± 89.79 | 337.5 ± 94.54 | 1.93 | 0.054 |
| Creatinine, urine (umol/L) | 10290.7 ± 6677.7 | 10771.2 ± 7273.8 | -1.93 | 0.054 |
aThe variable of family annual income was computed as a ratio of family income to poverty guidelines using the federal poverty level guidelines, which were available at (https://aspe.hhs.gov/prior-hhs-poverty-guidelines-and-federal-registerreferences). The poverty index is a ratio measuring the household income to the poverty threshold after accounting for inflation and family size.
bParticipants were considered as prevalent cardiovascular disease cases if ever told by a doctor that they had any of the following conditions: congestive heart failure, coronary heart disease, angina/angina pectoris, heart attack, or stroke.
SD: standard deviation
Fig 1ROC curves for six machine learning models in predicting depression.
Ten-fold cross-validation was used to build and evaluate the prediction models. Different colors represent the different machine learning classifiers used in this study. The gray line is the reference corresponding to the performance of a classifier that completely and randomly classifies the condition.
Average metrics of six models trained with stepwise backward elimination.
| Model | AUC | Accuracy | Precision | Sensitivity | Specificity | F1-score |
|---|---|---|---|---|---|---|
|
|
| 0.706 | 0.961 | 0.697 |
| 0.808 |
|
| 0.772 | 0.686 | 0.958 | 0.676 | 0.769 | 0.792 |
|
| 0.762 | 0.673 | 0.956 | 0.662 | 0.759 | 0.782 |
|
| 0.803 | 0.707 | 0.956 | 0.701 | 0.748 | 0.809 |
|
| 0.808 | 0.696 | 0.958 | 0.688 | 0.764 | 0.801 |
|
| 0.760 |
|
|
| 0.739 |
|
The highest value was bolded.
AUC: area under the receiver operating characteristic curve; ANN: artificial neural network; SVM: support vector machine
The most contributing features belonging to at least five of the six models.
| Features | Frequency | Rank | Descriptions | |||||
|---|---|---|---|---|---|---|---|---|
| ANN | Random forest | AdaBoost | Stochastic gradient boosting | XGBoost | SVM | |||
|
| 6 | 1 | 1 | 2 | 1 | 1 | 2 | The ratio of family income to poverty guidelines |
|
| 6 | 14 | 4 | 10 | 4 | 3 | 10 | Triglycerides, refrigerated serum (mmol/L) |
|
| 6 | 13 | 7 | 13 | 11 | 5 | 13 | White blood cell count (1000 cells/uL) |
|
| 6 | 9 | 9 | 11 | 14 | 8 | 11 | Age in years of the participant at the time of screening |
|
| 5 | 2 | 2 | 1 | 2 | NA | 1 | Ever told a doctor that you had trouble sleeping |
|
| 5 | 8 | 16 | 3 | 3 | NA | 3 | Ever told by a doctor that you had arthritis |
|
| 5 | NA | 6 | 12 | 16 | 6 | 12 | Hemoglobin (g/dL) |
|
| 5 | NA | 18 | 5 | 10 | 16 | 5 | Marital status of the participants |
|
| 5 | 15 | 17 | 8 | 8 | NA | 8 | The highest grade or level of schooling or the highest degree |
ANN: artificial neural network; SVM: support vector machine
NA: This feature was not ranked in the top 20 features for that model.