| Literature DB >> 36147870 |
Ruiyi Liu1, Yongle Zhan1,2, Xuan Liu1, Yifang Zhang1, Luting Gui1, Yimin Qu1, Hairong Nan3, Yu Jiang1.
Abstract
Gestational diabetes mellitus (GDM) is closely related to adverse pregnancy outcomes and other diseases. Early intervention in pregnant women who are at high risk of developing GDM could help prevent adverse health consequences. The study aims to develop a simple model using the stacking ensemble method to predict GDM for women in the first trimester based on easily available factors. We used the data from the Chinese Pregnant Women Cohort Study from July 2017 to November 2018. A total of 6,848 pregnant women in the first trimester were included in the analysis. Logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost) were considered as base learners. Optimal feature subsets for each learner were chosen by using recursive feature elimination cross-validation. Then, we built a pipeline to process imbalance data, tune hyperparameters, and evaluate model performance. The learners with the best hyperparameters were employed in the first layer of the proposed stacking method. Their predictions were obtained using optimal feature subsets and served as meta-learner's inputs. Another LR was used as a meta-learner to obtain the final prediction results. Accuracy, specificity, error rate, and other metrics were calculated to evaluate the performance of the models. A paired samples t-test was performed to compare the model performance. In total, 967 (14.12%) women developed GDM. For base learners, the RF model had the highest accuracy (0.638 (95% confidence interval (CI) 0.628-0.648)) and specificity (0.683 (0.669-0.698)) and lowest error rate (0.362 (0.352-0.372)). The stacking method effectively improved the accuracy (0.666 (95% CI 0.663-0.670)) and specificity (0.725 (0.721-0.729)) and decreased the error rate (0.333 (0.330-0.337)). The differences in the performance between the stacking method and RF were statistically significant. Our proposed stacking method based on easily available factors has better performance than other learners such as RF.Entities:
Mesh:
Year: 2022 PMID: 36147870 PMCID: PMC9489389 DOI: 10.1155/2022/8948082
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 3.822
Figure 1Flowchart illustrating data cleaning and processing.
Figure 2Work flow for the stacking method. (a) The structure of the proposed stacking method. (b) The principle of the stacking method.
Figure 3Work flow illustrating the predictive model construction process.
Figure 4Pipeline for training, optimizing, and evaluating models.
Characteristics between gestational diabetes mellitus (GDM) and non-GDM groups, N (%).
| Feature | Non-GDM group ( | GDM group ( |
| |
|---|---|---|---|---|
| Age group (years) | <25 | 1,353 (23.0) | 101 (10.4) | <0.001 |
| 25–30 | 2,957 (50.3) | 475 (49.1) | ||
| >30 | 1,571 (26.7) | 391 (40.4) | ||
|
| ||||
| Body mass index (kg/m2) | <18.5 | 825 (14.0) | 77 (8.0) | <0.001 |
| 18.5–24.0 | 3,798 (64.6) | 587 (60.7) | ||
| >24.0 | 1,258 (21.4) | 303 (31.3) | ||
|
| ||||
| Educational level | Primary school and below | 30 (0.5) | 6 (0.6) | 0.044 |
| Junior and senior high school | 2,084 (35.4) | 299 (30.9) | ||
| University | 3,408 (57.9) | 593 (61.3) | ||
| Masters' degree and above | 359 (6.1) | 69 (7.1) | ||
|
| ||||
| Education level of partner | Primary school and below | 29 (0.5) | 10 (1.0) | <0.001 |
| Junior and senior high school | 2,226 (37.9) | 292 (30.2) | ||
| University | 3,275 (55.7) | 589 (60.9) | ||
| Masters' degree and above | 351 (6.0) | 76 (7.9) | ||
|
| ||||
| Occupation | No† | 1,662 (28.3) | 224 (23.2) | 0.001 |
| Yes | 4,219 (71.7) | 743 (76.8) | ||
|
| ||||
| Occupation of partner | No | 463 (7.9) | 54 (5.6) | 0.015 |
| Yes | 5,418 (92.1) | 913 (94.4) | ||
|
| ||||
| Personal annual income (10000 CNY) | <5 | 3,655 (62.1) | 517 (53.5) | <0.001 |
| 5–10 | 1,754 (29.8) | 347 (35.9) | ||
| >10 | 472 (8.0) | 103 (10.7) | ||
|
| ||||
| Household annual income (10000 CNY) | <10 | 3,199 (54.4) | 463 (47.9) | 0.001 |
| 10–20 | 1,810 (30.8) | 335 (34.6) | ||
| >20 | 872 (14.8) | 169 (17.5) | ||
|
| ||||
| Household size | 1–3 | 3,230 (54.9) | 588 (60.8) | 0.003 |
| 4 | 1,332 (22.6) | 188 (19.4) | ||
| ≥5 | 1,319 (22.4) | 191 (19.8) | ||
|
| ||||
| Moderate physical activity | No | 5,074 (86.3) | 858 (88.7) | 0.043 |
| Yes | 807 (13.7) | 109 (11.3) | ||
|
| ||||
| Vitamin without vitamin D | No | 3,277 (55.7) | 505 (52.2) | 0.046 |
| Yes | 2,604 (44.3) | 462 (47.8) | ||
|
| ||||
| Vitamin D | No | 4,480 (76.2) | 683 (70.6) | <0.001 |
| Yes | 1,401 (23.8) | 284 (29.4) | ||
|
| ||||
| Calcium | No | 4,162 (70.8) | 644 (66.6) | 0.010 |
| Yes | 1,719 (29.2) | 323 (33.4) | ||
|
| ||||
| Soybean oil intake | No | 4,029 (68.5) | 711 (73.5) | 0.002 |
| Yes | 1,852 (31.5) | 256 (26.5) | ||
|
| ||||
| Olive oil intake | No | 5,338 (90.8) | 853 (88.2) | 0.015 |
| Yes | 543 (9.2) | 114 (11.8) | ||
|
| ||||
| GDM history | No | 5,872 (99.8) | 959 (99.2) | 0.001 |
| Yes | 9 (0.2) | 8 (0.8) | ||
|
| ||||
| Unplanned pregnancy‡ | No | 4,203 (71.5) | 734 (75.9) | 0.005 |
| Yes | 1,678 (28.5) | 233 (24.1) | ||
|
| ||||
| Hypertension history | No | 5,841 (99.3) | 951 (98.3) | 0.003 |
| Yes | 40 (0.7) | 16 (1.7) | ||
|
| ||||
| Uterine fibroids | No | 5,796 (98.6) | 935 (96.7) | <0.001 |
| Yes | 85 (1.4) | 32 (3.3) | ||
|
| ||||
| Hysteroscopic surgery | No | 5,841 (99.3) | 954 (98.7) | 0.047 |
| Yes | 40 (0.7) | 13 (1.3) | ||
|
| ||||
| FDRs (diabetes) | No | 5,689 (96.7) | 905 (93.6) | <0.001 |
| Yes | 192 (3.3) | 62 (6.4) | ||
|
| ||||
| SDRs (stroke) | No | 5,838 (99.3) | 967 (100.0) | 0.014 |
| Yes | 43 (0.7) | — | ||
†No, no full-time job; ‡Unplanned pregnancy, was this pregnancy an unplanned pregnancy?
Optimal feature subsets with three models.
| Model | Features' list |
|---|---|
| LR | (i) Sociodemographic factors: age, ethnic minority, occupation of partner, and personal annual income. |
| (ii) Lifestyle and behavioral factors: moderate physical activity, ventilator physical activity, smoking before pregnancy, smoked within the last month, drinking, cooking, and BMI. | |
| (iii) Environmental factors: living environmental pollution (including sewer, dumpster, and chemical, pesticide) and work environmental pollution (including noise, high temperature, heavy metal, and hair dye). | |
| (iv) Dietary habit and supplement intake: unsaturated fatty acids (including soybean oil, olive oil, and linseed oil); vitamins (including vitamin D and other types of vitamins); calcium, iron dietary supplement, and probiotics. | |
| (v) Previous pregnancy status: parity, GDM, hypertensive disorders complicating pregnancy, preeclampsia, placental abruption, late abortion, small for gestational age, premature delivery, and stillborn fetus. | |
| (vi) Personal disease history: hypertension, hyperlipidemia, hyperthyroidism, anemia, heart disease, chronic glomerulonephritis, cancer, epilepsy, tuberculosis, viral hepatitis type B, infertility, cervical intraepithelial neoplasia, uterine fibroids, ovarian cyst, gonorrhea, systemic lupus erythematosus, ulcerative colitis, and Sjogren syndrome. | |
| (vii) Family disease history: FDRs disease history (including hypertension, diabetes, hyperlipidemia, and cancer) and SDRs disease history (including diabetes, hyperlipidemia, stroke, and cancer). | |
| (viii) Gynecological surgery history: myomectomy, oophorocystectomy, hysteroscopic treatment, extrauterine pregnancy, diagnostic-curettage, and abortion. | |
|
| |
| RF | (i) Sociodemographic factors: age, education level, education level of partner, personal annual income, household annual income, and household size. |
| (ii) Lifestyle and behavioral factors: mild physical activity, sleep quality, depression level, and BMI. | |
| (iii) Environmental factors: passive smoking, noise pollution of living environment, renovation (working and living environment), and cooking. | |
| (iv) Dietary habit and supplement intake: unsaturated fatty acids (including soybean oil, peanut oil, and sunflower oil), dietary habit, vitamins (excluding vitamin D), calcium, and probiotics. | |
| (v) Previous pregnancy status: parity, unplanned pregnancy, and gravidity. | |
| (vi) Gynecological surgery history: abortion. | |
|
| |
| XGBoost | (i) Sociodemographic factors: age |
| (ii) Lifestyle and behavioral factors: BMI | |
| (iii) Environmental factors: renovation (living environment) | |
| (iv) Family disease history: SDRs having stroke | |
| (v) Gynecological surgery history: gravidity | |
Figure 5Boxplot of performance for the three models with the optimal parameter and stacking method.
The paired samples t-test between the stacking method and RF.
| Mean | Standard deviation | Standard error mean | 95% confidence interval of the difference |
| d |
| ||
|---|---|---|---|---|---|---|---|---|
| Lower | Upper | |||||||
| Accuracy (stacking-RF) | 0.028 | 0.035 | 0.005 | 0.018 | 0.038 | 5.709 | 49 | <0.001 |
| Specificity (stacking-RF) | 0.041 | 0.049 | 0.007 | 0.027 | 0.056 | 5.951 | 49 | <0.001 |
| Error rate (stacking-RF) | −0.028 | 0.035 | 0.005 | −0.038 | −0.018 | −5.709 | 49 | <0.001 |
A value of p < 0.05 was considered significant.
Figure 6Feature importance ranking of three models.