| Literature DB >> 34007875 |
I C Crockart1, L T Brink2, C du Plessis2, H J Odendaal2.
Abstract
OBJECTIVE: Intrauterine growth restriction (IUGR) is one of the most common causes of stillbirths. The objective of this study is to develop a machine learning model that will be able to accurately and consistently predict whether the estimated fetal weight (EFW) will be below the 10th percentile at 34+0-37 + 6 week's gestation stage, by using data collected at 20 + 0 to 23 + 6 weeks gestation.Entities:
Keywords: Classification; Fetal heart rate accelerations; IUGR; Machine learning; Umbilical artery Doppler
Year: 2021 PMID: 34007875 PMCID: PMC8128140 DOI: 10.1016/j.imu.2021.100533
Source DB: PubMed Journal: Inform Med Unlocked ISSN: 2352-9148
Table summarising each dataset used. Note, titles of datasets exist only to assist referencing and do not necessarily denote the information within.
| Dataset Title: | Week20_ONLY |
|---|---|
| Source: | Academic Dataset |
| Dimensions: | 4764 row × 14 columns |
| Brief Description: | This dataset contains the information collected from mothers and their fetuses at 20–24 weeks gestational age. This data has been processed from ECG data collected via the Monica AN24 Device. Originally processed by Ivan Calitz Crockart (2019). |
| Dataset Title: | F3 |
| Source: | Academic Dataset |
| Dimensions: | 2767 rows × 5 columns |
| Brief Description: | This dataset contains the information collected from mothers and their fetuses at 34+0–37 + 6 weeks gestational age. The data represents information pertaining to the health of the mother and the fetus at this stage - measuring the pulsality indexes of several arteries as well as intrauterine growth restriction as assessed by the estimated fetal weight. |
Feature overview with a brief description of each feature present in the respective datasets. Week20_ONLY referring to a gestational age of 20 + 0 to 23 + 6 weeks, and F3 to 34 + 0 to 37 + 6 weeks.
| Feature | Data Type | Description |
|---|---|---|
| Week20_ONLY nGB | Numerical | The number of times gained beats were recorded for the maternal heart rate |
| total_GB | Numerical | The sum of all the gained beats recorded for the maternal heart rate |
| GBoverTime(per hour) | Numerical | Ratio of the total gained beats recorded per hour of heart rate recording for the maternal heart rate |
| accDuration | Numerical | The total time (in seconds) of the entire recording that were recorded as accelerations for the maternal heart rate |
| totalDuration | Numerical | The total time (in seconds) of the recording for the maternal heart rate |
| nLB | Numerical | The number of times lost beats were recorded for the maternal heart rate |
| total_LB | Numerical | The sum of all the lost beats recorded for the maternal heart rate |
| nfGB | Numerical | The number of times gained beats were recorded for the fetal heart rate |
| total_fGB | Numerical | The sum of all the gained beats recorded for the fetal heart rate |
| fGBoverTime(per hour) | Numerical | Ratio of the total gained beats recorded per hour of heart rate recording for the fetal heart rate |
| fAccDuration | Numerical | The total time (in seconds) of the entire recording that were recorded as accelerations for the fetal heart rate |
| fTotalDuration | Numerical | The total time (in seconds) of the recording for the fetal heart rate |
| f_nLB | Numerical | The number of times lost beats were recorded for the fetal heart rate |
| total_fLB | Numerical | The sum of all the lost beats recorded for the fetal heart rate |
| F3 | ||
| F3_UMBILICAL_ARTERY_PI | Numerical | The Pulsality Index (PI) value for the Umbilical Artery |
| F3_AVG_UTERINE_ARTERY_PI | Numerical | The average Pulsality Index (PI) value for the Uterine Artery |
| F3_MCA_PI | Numerical | The Pulsality Index (PI) value for the Middle Cerebral Artery |
| F3_IUGR3 | Categorical | Whether or not the Intrauterine Growth Restriction was less than 3% |
| F3_IUGR10 | Categorical | Whether or not the Intrauterine Growth Restriction was less than 10% |
The data quality report for the Week20_ONLY dataset.
| Count | %Miss. | Card. | Min | 1st Qrt. | Mean | Median | 3rd Qrt. | Max | Std Dev. | |
|---|---|---|---|---|---|---|---|---|---|---|
| nGB | 4764 | 0.00 | 39 | 0 | 3.00 | 8.9404 | 8.0 | 13.0 | 59.0 | 6.8574 |
| total_GB | 4764 | 0.00 | 3240 | 0 | 616.75 | 3003.7278 | 1817.0 | 4152.5 | 37044.0 | 3539.0984 |
| GBoverTime(per hour) | 4764 | 0.00 | 1603 | 0 | 684.75 | 3308.1258 | 2000.0 | 4570.0 | 43000.0 | 3899.0871 |
| accDuration | 4764 | 0.00 | 786 | 0 | 1550.00 | 1654.8328 | 1695.0 | 1860.0 | 4030.0 | 451.3060 |
| totalDuration | 4764 | 0.00 | 296 | 0 | 3130.00 | 3141.0569 | 3230.0 | 3360.0 | 9610.0 | 781.0597 |
| nLB | 4764 | 0.00 | 38 | 0 | 0.00 | 2.3573 | 1.0 | 3.0 | 60.0 | 4.1340 |
| total_LB | 4764 | 0.00 | 1161 | 0 | 0.00 | 575.6371 | 115.0 | 377.25 | 107698.0 | 2999.5273 |
| nfGB | 4764 | 0.00 | 42 | 0 | 9.00 | 14.3298 | 14.0 | 19.0 | 48.0 | 6.9648 |
| total_fGB | 4764 | 0.00 | 3388 | 0 | 1759.75 | 3462.5649 | 2813.0 | 4239.0 | 86965.0 | 3544.5789 |
| fGBoverTime(per hour) | 4764 | 0.00 | 1156 | 0 | 2040.00 | 4128.5980 | 3200.0 | 4800.0 | 176000.0 | 5998.3244 |
| fAccDuration | 4764 | 0.00 | 770 | 0 | 1460.00 | 1528.6576 | 1558.0 | 1660.0 | 3362.0 | 319.4827 |
| fTotalDuration | 4764 | 0.00 | 371 | 0 | 3100.00 | 3141.6913 | 3200.0 | 3330.0 | 7860.0 | 616.9372 |
| f_nLB | 4764 | 0.00 | 41 | 0 | 8.00 | 13.0126 | 12.0 | 18.0 | 40.0 | 7.3192 |
| total_fLB | 4764 | 0.00 | 3111 | 0 | 1184.00 | 2500.1064 | 2110.0 | 3296.25 | 45277.0 | 2084.3124 |
The data quality report for the continuous features of the F3 dataset.
| Count | %Miss. | Card. | Min | 1st Qrt. | Mean | Median | 3rd Qrt. | Max | Std Dev. | |
|---|---|---|---|---|---|---|---|---|---|---|
| F3_UMBILICAL_ARTERY_PI | 644 | 76.7 | 10 | 0.5 | 0.8 | 0.9003 | 0.9 | 1.0 | 1.4 | 0.1546 |
| F3_AVG_UTERINE_ARTERY_PI | 657 | 76.3 | 15 | 0.4 | 0.6 | 0.7441 | 0.7 | 0.8 | 1.8 | 0.2078 |
| F3_MCA_PI | 638 | 76.9 | 18 | 1.5 | 1.5 | 1.7566 | 1.7 | 2.0 | 2.7 | 0.3174 |
The data quality report for the categorical features of the F3 dataset.
| Count | % Miss. | Card. | Mode | Mode Freq. | Mode % | 2nd Mode | 2nd Mode Freq. | 2nd Mode % | |
|---|---|---|---|---|---|---|---|---|---|
| F3_IUGR3 | 429.0 | 49.1 | 2.0 | 0.0 | 419.0 | 97.6690 | 1.0 | 10 | 2.3310 |
| F3_IUGR10 | 429 | 49.1 | 2.0 | 0.0 | 402.0 | 93.7063 | 1.0 | 27 | 6.2937 |
The tabulated results of the data visualization analysis.
| Feature vs Density | Graph Shape |
|---|---|
| nGB | Unimodal (Skewed right) |
| total_GB | Exponential |
| GBoverTime (per hour) | Exponential |
| accDuration | Normal (Unimodal) |
| nfGB | Normal (Unimodal) |
| total_fGB | Exponential |
| fGBoverTime(per hour) | Exponential |
| fAccDuration | Normal (Unimodal) |
| totalDuration | Normal (Unimodal) |
| fTotalDuration | Normal (Unimodal) |
| nLB | Exponential |
| f_nLB | Multimodal/Skewed Right |
| total_LB | Exponential |
| total_fLB | Unimodal (Skewed Right) |
| F3_UMBILICAL_ARTERY_PI | Multimodal/Normal (Unimodal) |
| F3_AVG_UTERINE_ARTERY_PI | Unimodal (Skewed Right) |
| F3_MCA_PI | Normal (Unimodal) |
Fig. 1.One of the scatter plot facet grids. This one compares the variables: GBoverTime(per hour), fGBoverTime(per hour) and F3_MCA_PI. Particular focus should be on the effect the ‘0′ values have on the data.
The Data Quality Plan in tabular form, giving a brief overview of the issues present and the handling strategies to be employed.
| Feature | Data Quality Issue(s) | Potential Handling Strategies |
|---|---|---|
| Week20_ONLY nGB | Skew data | Remove rows with 0 values. Remove Outliers. |
| total_GB | Skew data/Outliers/High Cardinality | Remove rows with 0 values. Remove Outliers. |
| GBoverTime (per hour) | Skew data/Outliers | Remove rows with 0 values. Remove Outliers. |
| accDuration | Outliers (Low) | Remove rows with 0 values. Remove Outliers. |
| totalDuration | Outliers (Low) | Remove rows with 0 values. Remove Outliers. |
| nLB | Skew data/Outliers | Remove rows with 0 values. Remove Outliers. |
| total_LB | Skew data/Outliers | Remove rows with 0 values. Remove Outliers. |
| nfGB | Outliers (Low) | Remove rows with 0 values. Remove Outliers. |
| total_fGB | Skew data/Outliers/High Cardinality | Remove rows with 0 values. Remove Outliers. |
| fGBoverTime(per hour) | Skew data/Outliers | Remove rows with 0 values. Remove Outliers. |
| fAccDuration | Outliers (Low) | Remove rows with 0 values. Remove Outliers. |
| fTotalDuration | Outliers (Low) | Remove rows with 0 values. Remove Outliers. |
| f_nLB | Skewed data | Remove rows with 0 values. Remove Outliers. |
| total_fLB | Skewed data/High Cardinality | Remove rows with 0 values. Remove Outliers. |
| F3 | ||
| f3_umbilical_artery_pi | Missing Data (76.7%) | Match metavalues (patID) to isolate relevant data |
| f3_avg_uterine_artery_PI | Missing Data (76.3%) | Match metavalues (patID) to isolate relevant data |
| f3_mca_pi | Missing Data (76.9%) | Match metavalues (patID) to isolate relevant data |
| f3_iugr3 | Missing Data (49.1%)/Irregular Cardinality | Drop from dataset |
| f3_iugr10 | Missing Data (49.1%)/Irregular Cardinality | Match metavalues (patID) to isolate relevant data |
A table displaying the feature rankings according the Orange’s ‘Rank’ method. Determined based on Information Gain and the Gain ratio.
| Info. gain | Gain ratio | |
|---|---|---|
| fGBoverTime(per hour) | 0.160 | 0.080 |
| F3_UMBILICAL_ARTERY_PI | 0.104 | 0.052 |
| F3_AVG_UTERINE_ARTERY PI | 0.088 | 0.044 |
| f_nLB | 0.072 | 0.036 |
| fAccDuration | 0.026 | 0.013 |
| nfGB | 0.026 | 0.013 |
| F3_MCA_PI | 0.010 | 0.005 |
Fig. 2.A screenshot of the Tree Model produced by Orange’s ‘Tree’ function.
Fig. 3.Graphical representation of the preparation process for each model.
The selected features and their types.
| Name | Type |
|---|---|
| fGBoverTime(per hour) | Feature |
| fAccDuration | Feature |
| f_nLB | Feature |
| F3_UMBILICAL_ARTERY_PI | Feature |
| F3_AVG_UTERINE_ARTERY_PI | Feature |
| D1 | Constructed Feature |
| D2 | Constructed Feature |
| F3_IUGR10 | Target Variable |
| patID | Meta Attribute |
Fig. 4.Visual results of the Hierarchical Clustering on the data. Average ‘Linkage’ used. Each colour represents a separate cluster of values.
The evaluation results for each model as produced by Orange’s ‘Test and Score’ function. Both sampling methods are included. The results shown are attained using a 95% confidence interval, inherent in the software function.
| Model 1 - Random Sampling(10, 70%); Stratified | |||||
|---|---|---|---|---|---|
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.544 | 0.833 | 0.812 | 0.797 | 0.833 |
| Random Forest | 0.754 | 0.907 | 0.889 | 0.904 | 0.907 |
| Logistic Regression | 0.547 | 0.813 | 0.777 | 0.745 | 0.813 |
| kNN | 0.745 | 0.873 | 0.872 | 0.871 | 0.873 |
| Model 1 - Cross Validation(3 folds); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.607 | 0.878 | 0.856 | 0.859 | 0.878 |
| Random Forest | 0.639 | 0.898 | 0.872 | 0.909 | 0.898 |
| Logistic Regression | 0.561 | 0.857 | 0.791 | 0.735 | 0.857 |
| kNN | 0.779 | 0.816 | 0.821 | 0.827 | 0.816 |
| Model 2 - Random Sampling(10, 70%); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.544 | 0.833 | 0.812 | 0.797 | 0.833 |
| Random Forest | 0.668 | 0.853 | 0.818 | 0.803 | 0.853 |
| Logistic Regression | 0.738 | 0.833 | 0.818 | 0.806 | 0.833 |
| kNN | 0.492 | 0.800 | 0.785 | 0.771 | 0.800 |
| Model 2 - Cross Validation(3 folds); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.607 | 0.878 | 0.856 | 0.859 | 0.878 |
| Random Forest | 0.692 | 0.918 | 0.904 | 0.925 | 0.918 |
| Logistic Regression | 0.823 | 0.878 | 0.856 | 0.859 | 0.878 |
| kNN | 0.575 | 0.796 | 0.781 | 0.769 | 0.796 |
| Model 3 - Random Sampling(10, 70%); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.746 | 0.927 | 0.917 | 0.926 | 0.927 |
| Random Forest | 0.692 | 0.907 | 0.896 | 0.898 | 0.907 |
| Logistic Regression | 0.732 | 0.907 | 0.893 | 0.899 | 0.907 |
| kNN | 0.578 | 0.827 | 0.801 | 0.782 | 0.827 |
| Model 3 - Cross Validation(3 folds); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.774 | 0.918 | 0.913 | 0.913 | 0.918 |
| Random Forest | 0.833 | 0.898 | 0.886 | 0.888 | 0.898 |
| Logistic Regression | 0.803 | 0.898 | 0.886 | 0.888 | 0.898 |
| kNN | 0.590 | 0.837 | 0.825 | 0.817 | 0.837 |
| Model 4 - Random Sampling(10, 70%); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.746 | 0.927 | 0.917 | 0.926 | 0.927 |
| Random Forest | 0.739 | 0.907 | 0.893 | 0.899 | 0.907 |
| Logistic Regression | 0.732 | 0.907 | 0.893 | 0.899 | 0.907 |
| kNN | 0.578 | 0.827 | 0.801 | 0.782 | 0.827 |
| Model 4 - Cross Validation(3 folds); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.774 | 0.918 | 0.913 | 0.913 | 0.918 |
| Random Forest | 0.779 | 0.918 | 0.904 | 0.925 | 0.918 |
| Logistic Regression | 0.803 | 0.898 | 0.886 | 0.888 | 0.898 |
| kNN | 0.590 | 0.837 | 0.825 | 0.817 | 0.837 |
| Model 5 - Random Sampling(10, 70%); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.771 | 0.933 | 0.926 | 0.932 | 0.933 |
| Random Forest | 0.707 | 0.927 | 0.917 | 0.926 | 0.927 |
| Logistic Regression | 0.762 | 0.920 | 0.908 | 0.919 | 0.920 |
| kNN | 0.812 | 0.873 | 0.830 | 0.849 | 0.873 |
| Model 5 - Cross Validation(3 folds); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.774 | 0.918 | 0.913 | 0.913 | 0.918 |
| Random Forest | 0.816 | 0.898 | 0.886 | 0.888 | 0.898 |
| Logistic Regression | 0.789 | 0.918 | 0.913 | 0.913 | 0.918 |
| kNN | 0.867 | 0.878 | 0.836 | 0.893 | 0.878 |
| Model 6 - Random Sampling(10, 70%); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| Model 1 - Random Sampling(10, 70%); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.771 | 0.933 | 0.926 | 0.932 | 0.933 |
| Random Forest | 0.707 | 0.927 | 0.917 | 0.926 | 0.927 |
| Logistic Regression | 0.762 | 0.920 | 0.908 | 0.919 | 0.920 |
| kNN | 0.812 | 0.873 | 0.830 | 0.849 | 0.873 |
| Model 6 - Cross Validation(3 folds); Stratified | |||||
| Method | AUC | CA | F1 | Precision | Recall |
| SGD | 0.774 | 0.918 | 0.913 | 0.913 | 0.918 |
| Random Forest | 0.816 | 0.898 | 0.886 | 0.888 | 0.898 |
| Logistic Regression | 0.789 | 0.918 | 0.913 | 0.913 | 0.918 |
| kNN | 0.867 | 0.878 | 0.836 | 0.893 | 0.878 |
Fig. 5.Comparative bar graph showing the AUC values for the different models for each evaluation method. These results were obtained using Random Sampling with Stratification.
Fig. 6.Comparative bar graph showing the AUC values for the different models for each evaluation method. These results were obtained using Cross Validation with Stratification.
Table showing the summarised Confusion Matrix results for each method used for the final model. The results are displayed as a percentage of the total number of predictions. The values highlighted in green represent the True values (desirable values), while those in red represent False values (undesirable values). [COLOUR].
| KNN | ||||
|---|---|---|---|---|
| Predicted | ||||
| 0.0 | 1.0 | |||
| 87.8% | 33.3% | 130 | ||
| 1.0 | 12.2% | 66.7% | 20 | |
| 147 | 3 | 150 | ||
| 0.0 | 1.0 | |||
| 92.1% | 10.0% | 130 | ||
| 1.0 | 7.9% | 90% | 20 | |
| 140 | 10 | 150 | ||
| 0.0 | 1.0 | |||
| 92.8% | 9.1% | 130 | ||
| 1.0 | 7.2% | 90.9% | 20 | |
| 139 | 11 | 150 | ||
| 0.0 | 1.0 | |||
| 93.5% | 8.3% | 130 | ||
| 1.0 | 6.5% | 91.7% | 20 | |
| 138 | 12 | 150 | ||
Fig. 7.The resulting ROC curve for the final model. The curve shown is for Target Class ‘0’.
Fig. 8.The resulting ROC curve for the final model. The curve shown is for Target Class ‘1’.