| Literature DB >> 35742158 |
Youngihn Kwon1, Juyeon Lee2, Joo Hee Park2, Yoo Mee Kim3, Se Hwa Kim3, Young Jun Won3, Hyung-Yong Kim2.
Abstract
As osteoporosis is a degenerative disease related to postmenopausal aging, early diagnosis is vital. This study used data from the Korea National Health and Nutrition Examination Surveys to predict a patient's risk of osteoporosis using machine learning algorithms. Data from 1431 postmenopausal women aged 40-69 years were used, including 20 features affecting osteoporosis, chosen by feature importance and recursive feature elimination. Random Forest (RF), AdaBoost, and Gradient Boosting (GBM) machine learning algorithms were each used to train three models: A, checkup features; B, survey features; and C, both checkup and survey features, respectively. Of the three models, Model C generated the best outcomes with an accuracy of 0.832 for RF, 0.849 for AdaBoost, and 0.829 for GBM. Its area under the receiver operating characteristic curve (AUROC) was 0.919 for RF, 0.921 for AdaBoost, and 0.908 for GBM. By utilizing multiple feature selection methods, the ensemble models of this study achieved excellent results with an AUROC score of 0.921 with AdaBoost, which is 0.1-0.2 higher than those of the best performing models from recent studies. Our model can be further improved as a practical medical tool for the early diagnosis of osteoporosis after menopause.Entities:
Keywords: feature selection; machine learning; osteoporosis; postmenopausal women; pre-screening; risk assessment
Year: 2022 PMID: 35742158 PMCID: PMC9222287 DOI: 10.3390/healthcare10061107
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
Figure 1Study procedure. Model A—trained Model based on checkup features. Model B—trained Model based on survey features. Model C—trained Model based on total (checkup + survey) features.
Figure 2Draft Model Performance. (A): The result of principal component analysis plot based on 1151 features. GBM—Gradient Boosting Machine. (B): Receiver operating characteristic (ROC) curve for three different best models (Random Forest, AdaBoost, and Gradient Boosting Machine) based on total features (the number of features = 1151).
Descriptive statistics of normal and osteoporosis subjects in the study.
| Variables | Characteristics | Normal | Osteoporosis |
|---|---|---|---|
| Age | Age (years) | 55.15 (49.46, 60.84) | 62.34 (56.92, 67.77) |
| LW_mp_a | Age of menopause (years) | 49.53 (45.07, 53.99) | 48.86 (43.93, 53.78) |
| LW_ms_a | Age of menarche (years) | 15.22 (13.37, 17.07) | 16.21 (14.16, 18.26) |
| BP8 | Average sleeping time for a day (hours) | 6.6 (5.25, 7.96) | 6.5 (4.93, 8.08) |
| BD2 | Beginning age of drinking (years) | 23.42 (7.69, 39.16) | 22.07 (2.02, 42.12) |
| HE_fev1fvc | Expired lung vol. for 1 | 0.8 (0.75, 0.86) | 0.79 (0.72, 0.86) |
| HE_HDL_st2 | HDL cholesterol | 49.58 (38.28, 60.87) | 48.35 (37.58, 59.11) |
| HE_ht | Height (cm) | 156.71 (151.59, 161.82) | 152.65 (147.57, 157.72) |
| DX_Q_ht | Highest height of the young (cm) | 158.88 (154.2, 163.56) | 156.26 (151.22, 161.3) |
| HE_insulin | Insulin | 10.7 (2.73, 18.66) | 10.07 (4.5, 15.64) |
| LQ_VAS | Quality of life scale (index) | 72.96 (54.39, 91.53) | 68.32 (47.18, 89.46) |
| HE_ALP | Serum alkaline phosphatase (IU/L) | 231.77 (165.21, 298.33) | 267.75 (188.13, 347.37) |
| HE_sbp2 | Systolic blood pressure (mmHg) | 124.67 (106.22, 143.12) | 127.36 (109.16, 145.56) |
| HE_crea | Serum Creatinine (mg/dL) | 0.72 (0.62, 0.82) | 0.7 (0.52, 0.89) |
| HE_vitD | Vitamin D (ng/mL) | 18.58 (11.98, 25.18) | 18.49 (11.38, 25.61) |
| HE_wt | Weight (kg) | 62.03 (53.58, 70.48) | 54.52 (47.07, 61.98) |
| HE_wc | Waist Circumference (cm) | 83.71 (74.44, 92.98) | 80.62 (71.98, 89.26) |
| BE5_1 | Muscle exercise per week (%) * | ||
| 1 | Never | 80 | 88.94 |
| 2 | One day a week | 3.97 | 1.84 |
| 3 | Two days a week | 4.13 | 2.21 |
| 4 | Three days a week | 4.63 | 2.83 |
| 5 | Four days a week | 2.15 | 1.6 |
| 6 | More than five days a week | 5.12 | 2.58 |
| edu | Education Level (%) * | ||
| 1 | Primary or less | 37.25 | 72.52 |
| 2 | Middle | 23.18 | 12.52 |
| 3 | High | 28.64 | 12.15 |
| 4 | College or more | 10.93 | 2.82 |
| LW_wh | Use of estrogen (%) * | ||
| 0 | No | 25.96 | 12.93 |
| 1 | Yes | 74.04 | 87.07 |
* indicates categorical variables, and the number of each characteristic under categorical variables refers to percentage.
The results of univariate correlation analysis with the list of 20 independent variables and dependent variable.
| Data Type | Variables | Characteristics | Correlation |
|---|---|---|---|
| Checkup | HE_wt | Weight (kg) | −0.426 (−0.467, −0.383) |
| HE_ht | Height (cm) | −0.367 (−0.411, −0.321) | |
| HE_wc | Waist Circumference (cm) | −0.170 (−0.219, −0.119) | |
| HE_fev1fvc | Expired lung vol. for 1 s | −0.115 (−0.172, −0.056) | |
| HE_HDL_st2 | HDL cholesterol (mg/dL) | −0.055 (−0.108, −0.002) | |
| HE_insulin | Insulin (μIU/mL) | −0.046 (−0.103, 0.010) | |
| HE_Crea | Serum Creatinine (mg/dL) | −0.045 (−0.098, 0.008) | |
| HE_vitD | Vitamin D (ng/mL) | −0.006 (−0.059, 0.047) | |
| HE_sbp2 | Systolic blood pressure (mmHg) | 0.073 (0.021, 0.124) | |
| HE_ALP | Serum alkaline phosphatase (IU/L) | 0.233 (0.183, 0.283) | |
| Survey | Edu | Education Level | −0.345 (−0.390, −0.298) |
| DX_Q_ht | Highest height of the young (cm) | −0.261 (−0.317, −0.203) | |
| LQ_VAS | Quality of life scale (index) | −0.112 (−0.163, −0.060) | |
| BE5_1 | muscle exercise per week (days) | −0.107 (−0.158, −0.055) | |
| LW_mp_a | Age of menopause (years) | −0.070 (−0.123, −0.016) | |
| BD2 | Beginning age of drinking (hours) | −0.037 (−0.088, 0.015) | |
| BP8 | Average sleeping time for a day (years) | −0.034 (−0.085, 0.018) | |
| LW_wh | Use of estrogen | 0.17 | |
| LW_ms_a | Age of menarche (years) | 0.243 (0.192, 0.292) | |
| Age | Age (years) | 0.540 (0.503, 0.576) |
Parentheses under the correlation column indicate a 95% confidence interval.
Figure 3The box plot for AUROC score of three different prediction models among three different data types. Model A—trained model based on checkup features. Model B—trained model based on survey features. Model C—trained model based on total (survey + checkup) features.
Figure 4Best Model (Model C) Performance. (A): The result of 2D principal component analysis plot based on selected 20 features. (B): Receiver operating characteristic (ROC) curve of three different best models (Random Forest, AdaBoost, and Gradient Boosting Machine) based on 20 selected features (total).