| Literature DB >> 29180800 |
Hang Qiu1,2, Hai-Yan Yu3,4,5,6, Li-Ya Wang1, Qiang Yao7, Si-Nan Wu8, Can Yin9, Bo Fu1,2, Xiao-Juan Zhu1,2, Yan-Long Zhang9, Yong Xing9, Jun Deng9, Hao Yang10, Shun-Dong Lei11.
Abstract
Gestational diabetes mellitus (GDM) is conventionally confirmed with oral glucose tolerance test (OGTT) in 24 to 28 weeks of gestation, but it is still uncertain whether it can be predicted with secondary use of electronic health records (EHRs) in early pregnancy. To this purpose, the cost-sensitive hybrid model (CSHM) and five conventional machine learning methods are used to construct the predictive models, capturing the future risks of GDM in the temporally aggregated EHRs. The experimental data sources from a nested case-control study cohort, containing 33,935 gestational women in West China Second Hospital. After data cleaning, 4,378 cases and 50 attributes are stored and collected for the data set. Through selecting the most feasible method, the cost parameter of CSHM is adapted to deal with imbalance of the dataset. In the experiment, 3940 samples are used for training and the rest 438 samples for testing. Although the accuracy of positive samples is barely acceptable (62.16%), the results suggest that the vast majority (98.4%) of those predicted positive instances are real positives. To our knowledge, this is the first study to apply machine learning models with EHRs to predict GDM, which will facilitate personalized medicine in maternal health management in the future.Entities:
Mesh:
Year: 2017 PMID: 29180800 PMCID: PMC5703904 DOI: 10.1038/s41598-017-16665-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Prediction model and data processing schematic diagram. In EHRs, the feature vectors were extracted from the characteristics of the first trimester and the class labels from the diagnostic international classification of diseases (ICD-10) codes of OGTT in 24–28 weeks’ gestation. After EHR preprocessing, the experimental data were divided into two subsets in evaluation design. The training set was then modelled using six machine learning techniques and the variants of cost-sensitive hybrid models (CSHM). Five measure metrics of the performance were collected: accuracy; area under the ROC curve (AUC), true positive rates, false positive rates and confidence reports.
Statistical description of the sample attributes.
| index | field | Description | #. values | #. Missing | mean | media | Mode | s.d. | variance | Minimum | Maximum |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | high_risk | High Risk Pregnancy (Age Over 35) | 4378 | 0 | 0.13 | 0.00 | 0 | 0.336 | 0.113 | 0 | 1 |
| 2 | marriage_ages | Marriage Years | 4213 | 165 | 3.93 | 3.00 | 1 | 3.687 | 13.596 | 0 | 26 |
| 3 | height | Height | 4374 | 4 | 160.14 | 160.00 | 160 | 4.799 | 23.034 | 130 | 177 |
| 4 | pregnancy_times | Pregnancy Times | 4378 | 0 | 2.13 | 2.00 | 1 | 1.330 | 1.770 | 0 | 12 |
| 5 | husband_age | Husband Age | 4372 | 6 | 32.19 | 31.00 | 29 | 4.898 | 23.994 | 21 | 64 |
| 6 | delivery_age | Production Age | 4374 | 4 | 30.53 | 30.00 | 28 | 3.926 | 15.415 | 19 | 47 |
| 7 | bmi | Body Mass Index (BMI) | 4373 | 5 | 20.9005 | 20.5700 | 20.70 | 2.59699 | 6.744 | 14.62 | 36.89 |
| 8 | Nonnative | Nonnative | 4298 | 80 | 0.12 | 0.00 | 0 | 0.325 | 0.106 | 0 | 1 |
| 9 | HCT | Hematocrit | 4378 | 0 | 1.79 | 2.00 | 2 | 0.409 | 0.167 | 1 | 3 |
| 10 | MCH | The Level Of Mean Corpsular Hemoglobin | 4378 | 0 | 2.28 | 2.00 | 2 | 0.557 | 0.310 | 1 | 3 |
| 11 | WBC | Count of White Blood Cell | 4378 | 0 | 2.17 | 2.00 | 2 | 0.380 | 0.144 | 1 | 3 |
| 12 | EOS | Eosinophils | 4378 | 0 | 1.55 | 2.00 | 2 | 0.508 | 0.258 | 1 | 3 |
| 13 | MPV | Mean Platelet Volum | 4378 | 0 | 2.10 | 2.00 | 2 | 0.297 | 0.088 | 2 | 3 |
| 14 | PDW | Platelet Distribution Width | 4378 | 0 | 2.07 | 2.00 | 2 | 0.248 | 0.061 | 2 | 3 |
| 15 | RDW.CV | Red Blood Cell Distribution Width CV | 4378 | 0 | 1.84 | 2.00 | 2 | 0.545 | 0.297 | 0 | 2 |
| 16 | RDW.SD | Red Blood Cell Distribution Width SD | 4378 | 0 | 2.18 | 2.00 | 2 | 0.385 | 0.148 | 1 | 3 |
| 17 | MONO. | Monocytes | 4378 | 0 | 1.94 | 2.00 | 2 | 0.255 | 0.065 | 1 | 3 |
| 18 | EOS. | Eosinophil | 4341 | 37 | 1.67 | 2.00 | 2 | 0.487 | 0.237 | 1 | 3 |
| 19 | PCT | Path CAST | 4378 | 0 | 2.05 | 2.00 | 2 | 0.227 | 0.052 | 1 | 3 |
| 20 | P.LCR | Platelet-Large Cell Rate | 4378 | 0 | 2.10 | 2.00 | 2 | 0.293 | 0.086 | 2 | 3 |
| 21 | HBsAg | Hepatitis B[Virus] Surface Antigen | 4378 | 0 | 1.88 | 2.00 | 2 | 0.475 | 0.226 | 0 | 2 |
| 22 | Anti.HBs | Hepatitis B Surface Antibody | 4378 | 0 | .51 | .00 | 0 | 0.874 | 0.764 | 0 | 2 |
| 23 | Anti.HBe | Hepatitis B E Antibody | 4378 | 0 | 1.69 | 2.00 | 2 | 0.727 | 0.529 | 0 | 2 |
| 24 | HBcAb.T. | Hepatitis B Core Antibody | 4378 | 0 | 1.55 | 2.00 | 2 | 0.838 | 0.702 | 0 | 2 |
| 25 | ALT | Alanine Aminotransferase | 4378 | 0 | 1.83 | 2.00 | 2 | 0.557 | 0.310 | 0 | 2 |
| 26 | AST | Aspartate Transaminase | 4378 | 0 | 1.88 | 2.00 | 2 | 0.470 | 0.221 | 0 | 2 |
| 27 | PA | Prealbumin | 4375 | 3 | 1.75 | 2.00 | 2 | 0.436 | 0.190 | 1 | 2 |
| 28 | UN | Urea | 4378 | 0 | 1.43 | 1.00 | 1 | 0.495 | 0.245 | 1 | 2 |
| 29 | UA | Uric Acid | 4378 | 0 | 1.82 | 2.00 | 2 | 0.384 | 0.147 | 1 | 3 |
| 30 | FPG | Fasting Plasma Glucose | 4370 | 8 | 1.90 | 2.00 | 2 | 0.294 | 0.087 | 1 | 3 |
| 31 | RBC | Red Blood Cell | 4378 | 0 | 2.30 | 2.00 | 2 | 0.458 | 0.209 | 2 | 3 |
| 32 | EC | Epithelial Cell | 4378 | 0 | 2.41 | 2.00 | 2 | 0.492 | 0.242 | 2 | 3 |
| 33 | XYSPXB | Number of Small round epithelial cel | 4378 | 0 | 2.90 | 3.00 | 3 | 0.294 | 0.087 | 2 | 3 |
| 34 | CAST | Cast | 4378 | 0 | 2.25 | 2.00 | 2 | 0.435 | 0.189 | 2 | 3 |
| 35 | CAST.1 | Pathological cast | 4378 | 0 | 2.19 | 2.00 | 2 | 0.392 | 0.154 | 2 | 3 |
| 36 | EC.1 | Education | 4378 | 0 | 2.40 | 2.00 | 2 | 0.491 | 0.241 | 2 | 3 |
| 37 | WBC.1 | White Blood Cell | 4377 | 1 | 2.41 | 2.00 | 2 | 0.492 | 0.242 | 2 | 3 |
| 38 | TPOAb | Antithyroid Peroxidase Autoantibody | 4267 | 111 | 1.65 | 2.00 | 2 | 0.759 | 0.575 | 0 | 2 |
| 39 | TSH3UL | Thyroid Stimulating Hormone - Hypersensitivity | 4272 | 106 | 1.85 | 2.00 | 2 | 0.406 | 0.165 | 1 | 3 |
| 40 | Anti.A | Anti-A Blood Grouping Reagents | 4378 | 0 | 2.39 | 2.00 | 2 | 0.489 | 0.239 | 2 | 3 |
| 41 | Anti.B | Anti-B Blood Grouping Reagents | 4378 | 0 | 2.35 | 2.00 | 2 | 0.476 | 0.226 | 2 | 3 |
| 42 | A1cells | A1cells | 4378 | 0 | 2.61 | 3.00 | 3 | 0.489 | 0.239 | 2 | 3 |
| 43 | Bcells | Bursa Oriented Cells | 4378 | 0 | 2.65 | 3.00 | 3 | 0.476 | 0.227 | 2 | 3 |
| 44 | RBC.1 | Red Blood Cell Count | 4378 | 0 | 1.91 | 2.00 | 2 | 0.345 | 0.119 | 1 | 3 |
| 45 | LYMPH. | Lymphocyte | 4378 | 0 | 1.31 | 1.00 | 1 | 0.463 | 0.214 | 1 | 3 |
| 46 | NEUT | Neutrophil | 4378 | 0 | 2.20 | 2.00 | 2 | 0.404 | 0.163 | 1 | 3 |
| 47 | NEUT. | Neutrophilic Granulocyte | 4378 | 0 | 2.89 | 3.00 | 3 | 0.312 | 0.097 | 2 | 3 |
| 48 | r.GT | Glutamyl Transpeptidase | 4378 | 0 | 1.87 | 2.00 | 2 | 0.487 | 0.237 | 0 | 2 |
| 49 | ALP | Alkaline Phosphatase | 4378 | 0 | 1.69 | 2.00 | 2 | 0.467 | 0.218 | 1 | 3 |
| 50 | label_gdm | Gestational diabetes mellitus | 4378 | 0 | 0.14 | 0.00 | 0 | 0.346 | 0.120 | 0 | 1 |
Note: #. values (missing) means the number of values (missing). s.d.: standard deviation. In most clinical scenarios, patients typically visit hospitals irregularly. Gestational women normally do not take all the tests and examinations when they visit hospitals. Oftentimes we only observe some phenotype information from a patient in each of her visit, resulting in missing values for the others. Thus missing values are a prevailing phenomenon in EHR data. In addition, EHR data are inherently highly dimensional and spread across multiple aspects of health care. Features have been carefully selected or constructed before the data analysis in order to achieve the best predictive performance. In order to ensure the stability of the predictive models, some features were removed prior to data imputation. Features presented in less than 50% of patients in an EHR cohort were discarded from our analysis. The attribute “Family History of Type 2 Diabetes” should also be considered for training the model. However, in the present information system of this hospital, it did not collect the data of this attribute.
Setting details of the six methods.
| Methods | Details of setting |
|---|---|
| Logistic Regression | Procedure: polynomial |
| Selection of variables in equation fitting: forward | |
| Target class: 1, Model type: main effect | |
| Include constants in the equation | |
| Bayesian network | Structure type: Markov cover |
| Parameter learning method: maximum likelihood | |
| Neural Networks | Primary objective: Enhanced model accuracy(boosting) Model: multilayer perceptron NN |
| Hidden layer: automatically calculates the number of cells | |
| Termination rule: Maximum number of training cycles=250 | |
| Number of component models (boosting):10 | |
| Prevent over fitting sets: 30% | |
| Support Vector Machines | Kernel: radial basis function (non-linear) |
| Stop threshold: 1.0e-3 | |
| Regression accuracy(epsilon): 0.1 | |
| CHAID trees | Tree growth algorithm: CHAID |
| Maximum tree depth: 16Termination rule: ①Minimum number of records in a parent branch: 2.0%; ②The minimum number of records in a child branch: 1.0% | |
| Segmentation and merging: Significance level(0.05) | |
| Split Merge classes within a node: No | |
| The maximum number of iterations of convergence: 200 | |
| CSHM | Base classifiers: LR, SVM, CHAID trees Method: Confidence weighted voting (maximum) |
| Model discard criteria: AUCROC < 0.6 | |
| Cost ratio: |
Figure 2Performance of six techniques with cross validation. Bar graphs in (A), (B), (C) and (D) illustrate accuracy, area under ROC curve (AUC), true positive rate (TPR) and false positive rate (FPR) of those six techniques, respectively. Curves in (E) and (F) demonstrate receiver operating characteristic (ROC) for training and testing. LR: logistic regression; NB: naive Bayes; NN: neural network; SVM: support vector machine; CHAID: Chi-square automatic interaction detection Tree; CSHM (1): cost-sensitive hybrid model with cost parameter λ1=1 (symmetrical costs of misclassification). TPR and FPR are obtained from their confusion matrix.
Figure 3Performance of CSHM in five cost sensitive contexts with cross validation. Bar graphs in (A), (B), (C) and (D) illustrate accuracy, area under ROC curve (AUC), true positive rate (TPR) and false positive rate (FPR) of CSHM in five cost sensitive contexts, respectively. Curves in (E) and (F) demonstrate receiver operating characteristic (ROC) for training and testing. CSHM (1.5): cost-sensitive hybrid model with cost parameter λ1 = 1.5 (asymmetrical costs of misclassification). TPR and FPR are obtained from their confusion matrix.
Figure 4Significance of CSHM comparing with other methods. (A) Significance of CSHM to the algorithms of SVM, LR and NN; (B) significance of CSHM(100) to the other four cost sensitive contexts. (C) Comparison of the results with CSHM and SVM on the experimental data set. T(1): CSHM(1), CSHM model takes the cost parameter λ1=1. T(1)-LR (or NN, SVM): the true positive rates of CSHM(1) minus those of LR (or NN, SVM). T(100)-T(1)(or T(5), T(10), T(1000)): the true positive rates of CSHM(100) minus those of CSHM(1) (or T(5), T(10), T(1000)). p-value < 0.001 illustrates the significance of those two methods with a two-sided test for difference in AUC.
Significance comparison.
| Abbreviation | N | Mean | standard deviation | Standard error of mean | t | degree of freedom | Sig.(Two-sided) | Lower bound* | Upper Bound* | |
|---|---|---|---|---|---|---|---|---|---|---|
| CSHM-LR | T(1)-LR | 1871 | 0.1926 | 0.13892 | 0.00321 | 59.961 | 1870 | <0.001 | 0.1863 | 0.1989 |
| CSHM-NN | T(1)-NN | 1871 | 0.2439 | 0.16816 | 0.00389 | 62.727 | 1870 | <0.001 | 0.2362 | 0.2515 |
| CSHM-SVM | T(1)-SVM | 1871 | 0.0733 | 0.10030 | 0.00232 | 31.623 | 1870 | <0.001 | 0.0688 | 0.0779 |
| CSHM(100)-CSHM(1) | T(100)- T(1) | 1887 | 0.0698 | 0.06764 | 0.00156 | 44.828 | 1886 | <0.001 | 0.0667 | 0.0729 |
| CSHM(100)-CSHM(5) | T(100)- T(5) | 1887 | 0.0880 | 0.08092 | 0.00186 | 47.218 | 1886 | <0.001 | 0.0843 | 0.0916 |
| CSHM(100)-CSHM(10) | T(100)- T(10) | 1887 | 0.0320 | 0.03911 | 0.00090 | 35.562 | 1886 | <0.001 | 0.0302 | 0.0338 |
| CSHM(100)-CSHM(1000) | T(100)–T(1000) | 1887 | 0.0665 | 0.05797 | 0.00133 | 49.811 | 1886 | <0.001 | 0.0639 | 0.0691 |
*95% confidence interval of difference.
Figure 5Confidence reports of six techniques and CSHM in five cost sensitive contexts with cross validation. Bar graphs in (A) and (B) illustrate mean correct and bar graphs in (C) and (D) illustrate mean incorrect of those six techniques and CSHM in five cost sensitive contexts, respectively. Boxplots in (E) and (F) illustrate confidence distributions for training and those in (G) and (H) illustrate confidence distributions for testing of those six techniques and CSHM in five cost sensitive contexts, respectively. Mean correct: mean confidence of correct predictions; mean incorrect: mean confidence of incorrect predictions.