| Literature DB >> 35463669 |
Qi Sun1,2, Xiaoxuan Zou3, Yousheng Yan4, Hongguang Zhang1, Shuo Wang3, Yongmei Gao3, Haiyan Liu3, Shuyu Liu3, Jianbo Lu1,2, Ying Yang1, Xu Ma1,2.
Abstract
Objective: Preterm birth (PTB) was one of the leading causes of neonatal death. Predicting PTB in the first trimester and second trimester will help improve pregnancy outcomes. The aim of this study is to propose a prediction model based on machine learning algorithms for PTB. Method: Data for this study were reviewed from 2008 to 2018, and all the participants included were selected from a hospital in China. Six algorisms, including Naive Bayesian (NBM), support vector machine (SVM), random forest tree (RF), artificial neural networks (ANN), K-means, and logistic regression, were used to predict PTB. The receiver operating characteristic curve (ROC), accuracy, sensitivity, and specificity were used to assess the performance of the model.Entities:
Mesh:
Year: 2022 PMID: 35463669 PMCID: PMC9020923 DOI: 10.1155/2022/9635526
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 3.822
Figure 1Workflow of this study.
Figure 2Preprocessing of variables. varweek is the measurement result of the variable in week i; varmean20 is the composite indicator representing the variable 20 weeks ago.
Figure 3Classifiers used in this study. (1) Naive Bayesian (NBM): Naive Bayes calculates the posterior probability P(B|A) from P(A), P(B) and P(B|A); P(B|A) is the posterior probability of class B and P(A) is the prior probability of predictor A and P(B) is the prior probability of class, and P(B|A) is the probability of the predictor for the particular class. (2) Support Vector Machine (SVM); SVM outputs a hyperplane (wx+b=0) that best separates the classes and has the largest separation of geometrical separations. (3) Logistic regression: The principle of logistic regression is to use a logistic function to map the results of linear regression between 0 and 1; X is the input features, and β is the weight of the features. P(Y=1) is the predicted probability of class 1. (4) Artificial Neural Networks (ANN): An artificial neural network consists of an input layer, a hidden layer, and an output layer, and its core component is an artificial neuron. Each neuron is summed by several other neurons multiplied by weights; x is the input features. (5) K-means: The K-Means algorithm minimizes the squared error for cluster C; x is the unclassified sample, and C is the clusters, and u is the mean vector of clusters C. (6) Random Forest Tree (RF): Random forest is an algorithm that integrates multiple decision trees through the Bagging idea of ensemble learning. The principle of random forest bagging is to vote the classification results of several weak classifiers to form a strong classifier.
Characteristics of mother and newborn between PTB and control group.
| Variables | Control (4775) | Case (4775) | t/chi |
| |
|---|---|---|---|---|---|
| Age, years | 30.72 ± 4.00 | 29.94 ± 5.39 | 8.00 | <0.001 | |
|
| |||||
| Gestation, days | 274.66 ± 7.15 | 251.19 ± 11.51 | 119.70 | <0.001 | |
|
| |||||
| Gravidity | 1 | 3437 (0.72) | 3644 (0.76) | 25.08 | <0.001 |
| 2–3 | 1240 (0.26) | 1063 (0.22) | 25.08 | <0.001 | |
| >3 | 98 (0.02) | 68 (0.01) | 25.08 | <0.001 | |
|
| |||||
| Parity | 1 | 4006 (0.84) | 4176 (0.87) | 24.37 | <0.001 |
| >2 | 769 (0.16) | 599 (0.13) | 24.37 | <0.001 | |
|
| |||||
| Multiple birth | No | 4763 (1.00) | 4284 (0.90) | 479.50 | <0.001 |
| Yes | 12 (0.00) | 491 (0.10) | 479.50 | <0.001 | |
|
| |||||
| Birth gender | Male | 2464 (0.52) | 2659 (0.56) | 15.85 | <0.001 |
| Female | 2311 (0.48) | 2116 (0.44) | 15.85 | <0.001 | |
|
| |||||
| Birth weight, g | 3410.68 ± 402.05 | 2691.13 ± 544.90 | 73.43 | <0.001 | |
| Birth height, cm | 50.38 ± 1.25 | 47.85 ± 2.82 | 56.81 | <0.001 | |
| Apgar scores (1 min) | 9.95 ± 0.71 | 9.70 ± 1.37 | 11.19 | <0.001 | |
| Apgar scores (5 min) | 10.00 ± 0.66 | 9.82 ± 1.20 | 8.97 | <0.001 | |
| Apgar scores (10 min) | 9.95 ± 0.54 | 9.77 ± 1.32 | 8.68 | <0.001 | |
PTB: preterm birth.
Prenatal testing of pregnant women before 27 weeks of gestation between PTB and control group.
| Variables | Control (4775) | Case (4775) | t/chi |
| ||
|---|---|---|---|---|---|---|
| Physical examination | Waist size, cm | 82.68 ± 14.19 | 83.30 ± 13.74 | −2.17 | 0.030 | |
| Fundal height, cm | 20.57 ± 3.62 | 20.90 ± 3.84 | −4.33 | <0.001 | ||
| SBP, mmHg | 112.09 ± 10.26 | 113.34 ± 10.44 | −5.86 | <0.001 | ||
| DBP, mmHg | 69.47 ± 7.72 | 70.71 ± 16.09 | −4.82 | <0.001 | ||
| FHR, times/min | 145.50 ± 3.05 | 146.46 ± 17.27 | −3.78 | <0.001 | ||
| Weight, kg | 63.16 ± 9.15 | 63.04 ± 9.39 | 0.60 | 0.549 | ||
| Edema | No | 4759 (1.00) | 4747 (0.99) | 2.76 | 0.096 | |
| Yes | 16 (0.00) | 28 (0.01) | ||||
|
| ||||||
| Blood test | BG | A | 1238 (0.26) | 1063 (0.22) | 128.27 | <0.001 |
| B | 1571 (0.33) | 2106 (0.44) | ||||
| AB | 484 (0.10) | 417 (0.09) | ||||
| O | 1482 (0.31) | 1189 (0.25) | ||||
| Blood RH | Ne | 24 (0.01) | 15 (0.00) | 1.65 | 0.199 | |
| Po | 4751 (0.99) | 4760 (1.00) | ||||
| ALB, g/L | 41.24 ± 3.46 | 41.45 ± 2.75 | −3.20 | 0.001 | ||
| ALT, U/L | 20.65 ± 14.05 | 21.02 ± 13.69 | −1.28 | 0.199 | ||
| AST, U/L | 20.91 ± 7.81 | 22.14 ± 7.72 | −7.74 | <0.001 | ||
| Glu, mmol/L | 4.57 [4.25, 4.93] | 4.56 [4.23, 4.72] | <0.001 | |||
| Ca, mmol/L | 2.30 ± 0.14 | 2.31 ± 0.12 | −1.36 | 0.174 | ||
| Cr, umol/L | 50.86 ± 7.61 | 51.17 ± 8.25 | −1.92 | 0.055 | ||
| DB, umol/L | 1.72 [1.10, 2.30] | 1.74 [1.43, 1.90] | 0.594 | |||
| TSI, umol/L | 17.44 ± 3.33 | 17.60 ± 2.63 | −2.61 | 0.009 | ||
| GLOB, g/L | 27.28 ± 3.32 | 27.24 ± 2.43 | 0.73 | 0.466 | ||
| Mg, mmol/L | 0.87 ± 0.13 | 0.88 ± 0.09 | −4.43 | <0.001 | ||
| IP, mmol/L | 1.25 ± 0.15 | 1.25 ± 0.12 | −1.61 | 0.108 | ||
| TBA, umol/L | 3.83 [2.90, 5.10] | 4.90 [3.32, 5.01] | <0.001 | |||
| TB, umol/L | 11.28 ± 3.53 | 11.18 ± 2.64 | 1.67 | 0.095 | ||
| CHOL, mmol/L | 4.78 ± 0.73 | 4.80 ± 0.38 | −1.66 | 0.096 | ||
| TP, g/L | 68.76 ± 4.84 | 68.78 ± 3.55 | −0.22 | 0.829 | ||
| TG, mmol/L | 1.52 ± 0.54 | 1.58 ± 0.41 | −6.57 | <0.001 | ||
| Urea, mmol/L | 2.80 [2.38, 3.28] | 2.84 [2.40, 3.10] | 0.002 | |||
| UA, umol/L | 199.75 ± 40.26 | 198.22 ± 39.37 | 1.88 | 0.060 | ||
| BA, 10e9/L | 0.01 ± 0.03 | 0.01 ± 0.05 | −1.01 | 0.314 | ||
| Plt, 10e9/L | 220.31 ± 48.25 | 224.52 ± 48.60 | −4.25 | <0.001 | ||
| EOS, 10e9/L | 0.09 ± 0.09 | 0.09 ± 0.07 | 0.43 | 0.665 | ||
| Hb, g/L | 117.98 ± 8.56 | 117.69 ± 8.59 | 1.62 | 0.105 | ||
| MID, 10e9/L | 0.55 ± 0.10 | 0.56 ± 0.12 | −4.51 | <0.001 | ||
| LY, 10e9/L | 1.72 ± 0.40 | 1.75 ± 0.41 | −2.92 | 0.004 | ||
| MCH, pg | 31.49 ± 1.91 | 31.33 ± 1.85 | 4.15 | <0.001 | ||
| MCHC, g/L | 344.87 ± 10.25 | 343.39 ± 10.52 | 6.95 | <0.001 | ||
| MCV, fL | 91.31 ± 4.72 | 91.24 ± 4.50 | 0.76 | 0.445 | ||
| MO, 10e9/L | 0.53 ± 0.14 | 0.54 ± 0.14 | −4.33 | <0.001 | ||
| MPV, fL | 8.58 ± 1.10 | 8.60 ± 1.09 | −1.07 | 0.283 | ||
| NE, 10e9/L | 7.23 ± 1.69 | 7.36 ± 1.72 | −3.60 | <0.001 | ||
| P-LCR, % | 0.23 ± 0.05 | 0.23 ± 0.05 | 5.83 | <0.001 | ||
| HCT, % | 0.34 ± 0.02 | 0.35 ± 0.25 | −1.18 | 0.238 | ||
| PCT, % | 0.19 ± 0.04 | 0.19 ± 0.03 | −0.23 | 0.819 | ||
| PDW, % | 15.16 ± 2.25 | 14.83 ± 2.51 | 6.75 | <0.001 | ||
| RDW-CV, % | 0.16 ± 0.51 | 0.16 ± 0.35 | 0.81 | 0.416 | ||
| RDW-SD, fL | 42.84 ± 2.46 | 43.45 ± 2.12 | −13.07 | <0.001 | ||
| RBC, 10e12L | 3.76 ± 0.31 | 3.77 ± 0.32 | −1.51 | 0.131 | ||
| WBC, 10e9/L | 9.58 ± 1.93 | 9.74 ± 1.97 | −3.93 | <0.001 | ||
|
| ||||||
| Urine test strip | Urine pH | 6.67 ± 0.46 | 6.73 ± 0.46 | −6.77 | <0.001 | |
| USG | 1.02 ± 0.01 | 1.02 ± 0.01 | 4.96 | <0.001 | ||
| BIL | Ne | 4737 (0.99) | 4749 (0.99) | 1.90 | 0.168 | |
| Po | 38 (0.01) | 26 (0.01) | ||||
| Glycosuria | Ne | 3780 (0.79) | 3820 (0.80) | 0.98 | 0.322 | |
| Po | 995 (0.21) | 955 (0.20) | ||||
| KET | Ne | 4593 (0.96) | 4589 (0.96) | 0.03 | 0.873 | |
| Po | 182 (0.04) | 186 (0.04) | ||||
| Nitrituria | Ne | 4728 (0.99) | 4740 (0.99) | 1.49 | 0.222 | |
| Po | 47 (0.01) | 35 (0.01) | ||||
| Blood | Ne | 4322 (0.91) | 4397 (0.92) | 7.22 | 0.007 | |
| Po | 453 (0.09) | 378 (0.08) | ||||
| Proteinuria | Ne | 4729 (0.99) | 4698 (0.98) | 7.41 | 0.006 | |
| Po | 46 (0.01) | 77 (0.02) | ||||
| Bilirubinuria | Ne | 4758 (1.00) | 4755 (1.00) | 0.11 | 0.742 | |
| Po | 17 (0.00) | 20 (0.00) | ||||
| Urine WBC | Ne | 3490 (0.73) | 3475 (0.73) | 0.10 | 0.747 | |
| Po | 1285 (0.27) | 1300 (0.27) | ||||
|
| ||||||
| Gynecological examination | BV | Ne | 4678 (0.98) | 4719 (0.99) | 10.63 | 0.001 |
| Po | 97 (0.02) | 56 (0.01) | ||||
| CDV | 1 | 854 (0.18) | 975 (0.20) | 60.20 | <0.001 | |
| 2 | 2904 (0.61) | 3066 (0.64) | ||||
| 3 | 845 (0.18) | 590 (0.12) | ||||
| 4 | 172 (0.04) | 144 (0.03) | ||||
| VYI | Ne | 4499 (0.94) | 4549 (0.95) | 5.05 | 0.025 | |
| Po | 276 (0.06) | 226 (0.05) | ||||
ALB: serum albumin; ALT: alanine transaminase; AST: aspartate transaminase; BA: basophil granulocytes; BG: blood group; BIL: urine bilirubin; Blood RH: blood RH; BV: bacterial vaginosis; Ca: total calcium; CDV: cleaning degree of vagina, The higher the value, the worse the cleanliness; CHOL: total cholesterol; Cr: creatinine; DB: direct bilirubin; DBP: diastolic blood pressure; EOS: eosinophil granulocytes; FHR: fetal heart rate; GLOB: globulins; Glu: plasma glucose (fasting); Hb: hemoglobin; HCT: hematocrit; IP: serum inorganic phosphorus; KET: urine ketone bodies; LY: lymphocytes; MCH: mean cell hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean cell volume; Mg: magnesium; MID: intermediate cell; MO: monocytes; MPV: mean platelet volume; NE: neutrophil granulocytes; PCT: plateletcrit; PDW: platelet distribution width; P-LCR: mean platelet volume; Plt: platelet count; RBC: red blood cells; RDW-CV: red blood cell distribution width-CV; RDW-SD: red blood cell distribution width-CV; SBP: systolic blood pressure; TB: total bilirubin; TBA: total biliary acid; TG: triglycerides; TP: total protein; TSI: total serum iron; UA: uric acid; Urea: urea; Urine WBC: urine white blood cell; USG: urine specific gravity; VYI: vaginal yeast infection; WBC: white blood cell count; PTB: preterm birth. Variables that are not normally distributed were expressed as p50 [p25, p75].
The performance of models in the test set.
| Models | Accuracy | AUC (95% CI) | Sensitivity | Specificity | |
|---|---|---|---|---|---|
| Dataset 1 | SVM | 0.720 | 0.791 (0.771–0.811) | 0.710 | 0.731 |
| RF | 0.777 | 0.861 (0.841–0.871) | 0.720 | 0.840 | |
| NBM | 0.677 | 0.741 (0.721–0.761) | 0.705 | 0.646 | |
| ANN | 0.634 | 0.691 (0.671–0.711) | 0.687 | 0.576 | |
| K-means | 0.611 | 0.681 (0.661–0.701) | 0.794 | 0.412 | |
| Log | 0.610 | 0.701 (0.681–0.721) | 0.378 | 0.861 | |
|
| |||||
| Dataset 2 | SVM | 0.721 | 0.791 (0.781–0.811) | 0.722 | 0.721 |
| RF | 0.794 | 0.871 (0.851–0.881) | 0.756 | 0.832 | |
| NBM | 0.682 | 0.771 (0.751–0.791) | 0.785 | 0.581 | |
| ANN | 0.666 | 0.731 (0.711–0.751) | 0.595 | 0.738 | |
| K-means | 0.602 | 0.681 (0.671–0.701) | 0.811 | 0.393 | |
| Log | 0.606 | 0.701 (0.681–0.721) | 0.364 | 0.847 | |
|
| |||||
| Dataset 3 | SVM | 0.719 | 0.801 (0.781–0.811) | 0.695 | 0.743 |
| RF | 0.806 | 0.881 (0.871–0.901) | 0.765 | 0.846 | |
| NBM | 0.674 | 0.791 (0.771–0.811) | 0.837 | 0.515 | |
| ANN | 0.733 | 0.801 (0.791–0.821) | 0.741 | 0.726 | |
| K-means | 0.612 | 0.711 (0.691–0.731) | 0.824 | 0.405 | |
| Log | 0.633 | 0.701 (0.681–0.721) | 0.421 | 0.839 | |
|
| |||||
| Dataset 4 | SVM | 0.719 | 0.791 (0.781–0.811) | 0.678 | 0.763 |
| RF | 0.807 | 0.881 (0.871–0.891) | 0.743 | 0.875 | |
| NBM | 0.626 | 0.741 (0.721–0.761) | 0.328 | 0.946 | |
| ANN | 0.732 | 0.811 (0.801–0.831) | 0.730 | 0.734 | |
| K-means | 0.626 | 0.721 (0.701–0.741) | 0.801 | 0.436 | |
| Log | 0.611 | 0.701 (0.691–0.721) | 0.361 | 0.880 | |
|
| |||||
| Dataset 5 | SVM | 0.729 | 0.801 (0.781–0.811) | 0.685 | 0.773 |
| RF | 0.816 | 0.891 (0.871–0.901) | 0.751 | 0.882 | |
| NBM | 0.622 | 0.741 (0.721–0.761) | 0.315 | 0.937 | |
| ANN | 0.747 | 0.811 (0.801–0.831) | 0.730 | 0.763 | |
| K-means | 0.609 | 0.701 (0.681–0.721) | 0.780 | 0.434 | |
| Log | 0.623 | 0.691 (0.671–0.711) | 0.391 | 0.861 | |
NBM: Naive Bayesian; SVM: Support Vector Machine; RF: Random Forest Tree; ANN: Artificial Neural Networks; Log: Logistic regression; Dataset 1: 20 weeks gestation; Dataset 2: 22 weeks gestation; Dataset 3: 24 weeks gestation; Dataset 4: 26 weeks gestation; Dataset 5: 27 weeks gestation. AUC: the area under the curve; CI: confidence interval.
Figure 4AUC (a) and accuracy (b) of models in different gestation times. (NBM: Naive Bayesian; SVM: Support Vector Machine; RF: Random Forest Tree; ANN: Artificial Neural Networks; Log: logistic regression; AUC: the area under the curve).
Figure 5The ROC curve of the models. (a) Based on 20 weeks of gestation. (b) Based on 22 weeks of gestation. (c) Based on 24 weeks of gestation. (d) Based on 26 weeks of gestation. (e) Based on 27 weeks of gestation.
The top 20 importance variables of RF model.
| Variables | Decreased accuracy |
|---|---|
| Age (physical examination) | 0.0251 |
| Magnesium (blood test) | 0.0098 |
| Fundal height (physical examination) | 0.0077 |
| Serum inorganic phosphorus (blood test) | 0.0038 |
| Mean platelet volume (blood test) | 0.0038 |
| Waist size (physical examination) | 0.0038 |
| Total cholesterol (blood test) | 0.0035 |
| Triglycerides (blood test) | 0.0031 |
| Globulins (blood test) | 0.0024 |
| Total bilirubin (blood test) | 0.0024 |
| Neutrophil granulocytes (blood test) | 0.0024 |
| Red blood cell distribution width-SD (blood test) | 0.0024 |
| Bacterial vaginosis (gynecological examination) | 0.0021 |
| Urine bilirubin (urine test strip) | 0.0021 |
| Urine white blood cell (urine test strip) | 0.0021 |
| Diastolic blood pressure (physical examination) | 0.0014 |
| Blood group (blood test) | 0.0014 |
| Parity (physical examination) | 0.0014 |
| Eosinophil granulocytes (blood test) | 0.0010 |
| White blood cell count (blood test) | 0.0010 |
RF: Random Forest tree.
Figure 6The AUC of the model increases with the number of predictors. (NBM: Naive Bayesian; SVM: Support Vector Machine; RF: Random Forest Tree; ANN: Artificial Neural Networks; Log: logistic regression; AUC: the area under the curve).