Literature DB >> 32283530

Artificial intelligence-assisted prediction of preeclampsia: Development and external validation of a nationwide health insurance dataset of the BPJS Kesehatan in Indonesia.

Herdiantri Sufriyana¹, Yu-Wei Wu², Emily Chia-Yu Su³.

Abstract

BACKGROUND: We developed and validated an artificial intelligence (AI)-assisted prediction of preeclampsia applied to a nationwide health insurance dataset in Indonesia.
METHODS: The BPJS Kesehatan dataset have been preprocessed using a nested case-control design into preeclampsia/eclampsia (n = 3318) and normotensive pregnant women (n = 19,883) from all women with one pregnancy. The dataset provided 95 features consisting of demographic variables and medical histories started from 24 months to event and ended by delivery as the event. Six algorithms were compared by area under the receiver operating characteristics curve (AUROC) with a subgroup analysis by time to the event. We compared our model to similar prediction models from systematically reviewed studies. In addition, we conducted a text mining analysis based on natural language processing techniques to interpret our modeling results.
FINDINGS: The best model consisted of 17 predictors extracted by a random forest algorithm. Nine∼12 months to the event was the period that had the best AUROC in external validation by either geographical (0.88, 95% confidence interval (CI) 0.88-0.89) or temporal split (0.86, 95% CI 0.85-0.86). We compared this model to prediction models in seven studies from 869 records in PUBMED, EMBASE, and SCOPUS. This model outperformed the previous models in terms of the precision, sensitivity, and specificity in all validation sets.
INTERPRETATION: Our low-cost model improved preliminary prediction to decide pregnant women that will be predicted by the models with high specificity and advanced predictors. FUNDING: This work was supported by grant no. MOST108-2221-E-038-018 from the Ministry of Science and Technology of Taiwan.

Entities: CellLine Chemical Disease Gene Species

Keywords: Artificial intelligence; Clinical prediction rule; Health insurance dataset; Machine learning; Natural language processing; Preeclampsia

Year: 2020 PMID： 32283530 PMCID： PMC7152721 DOI： 10.1016/j.ebiom.2020.102710

Source DB: PubMed Journal: EBioMedicine ISSN： 2352-3964 Impact factor: 8.143

Evidence before this study Preeclampsia is a subtype of pregnancy-induced hypertension (PIH) that is a common cause of maternal mortality. The etiology and pathogenesis are not well understood, but it was evidenced that the only cure is delivery; thus, false positives of preeclampsia predictions might lead to unnecessary early deliveries. This contributes to premature and low-birth-weight babies, which in turn, increases inefficient utilization of neonatal intensive care units (ICUs). There are four problems with preeclampsia prediction models from previous studies: (1) no robust prediction of all subtypes of preeclampsia; (2) biased predictive performances; (3) low precision or positive predictive values; and (4) a need for a high-resource setting to apply the prediction model. Most models achieved greater than 90% sensitivity and specificity only for early-onset but not all subtypes of preeclampsia. Meanwhile, late-onset preeclampsia is more common than early-onset cases; thus, admission to neonatal ICUs was not significantly reduced. A previous study showed promising predictive performance of a prediction model for late-onset preeclampsia. Yet, this did not apply recent standards that have been developed for predictive modeling, which were designed to avoid risks of bias. Ultimately, low precision of prediction models is common in preeclampsia predictions because of class imbalances in which preeclampsia outcomes were very low compared to normotensive controls (mostly 1:9). In low-precision prediction models, a predicted preeclampsia case is likely to be a false positive, and this leads to unnecessary early deliveries. Although there are highly precise prediction models limited to early-onset preeclampsia, these require expensive, inaccessible biophysical and biochemical markers such as the pulsatility index of the uterine artery by ultrasound measurement, soluble fms-like tyrosine kinase-1 (sFlt-1), and/or placental growth factor (PlGF). We need a prediction model with low false positive rate and low-cost predictors with high sensitivity at the same specificity compared to the others with low-cost predictors. This model will be a preliminary model to decide utility of prediction models with advanced predictors. Therefore, we can reduce both maternal mortality and neonatal morbidity as well as the prediction cost of preeclampsia at community level. Added value of this study The prediction model proposed in this study was robust for preeclampsia in both internal and external validation sets. This model was developed based on the Prediction Model Risk of Bias Assessment Tool (PROBAST) which contains recent guidelines for prediction model development to avoid risks of bias. We compared the precision, including sensitivity and specificity, to similar prediction models from previous studies. These were systematically reviewed among 879 records from PUBMED, EMBASE, and SCOPUS within the last 5 years (since 2015). Our model applied a machine learning algorithm that uses demographic variables and diagnoses on previous visits which are conceivably applicable in low-resource settings. To develop and validate this model, we utilized big data from a nationwide health insurance dataset in Indonesia (n = 2,641,096) with preeclampsia/eclampsia (n = 3318) vs. non-PIH nested control (n = 19,883) outcomes. Our model outperformed those from systematically reviewed studies in terms of both internal and external validation sets. For external validation, our precision levels were 0.59 (95% confidence interval (CI) 0.58–0.60; by geographical splitting) and 0.72 (95% CI 0.72–0.72; by temporal splitting) compared to the best previous model 0.17 (95% CI 0.17–0.17) at sensitivity ∼95%. Meanwhile, the specificities were 0.47 (95% CI 0.45–0.49; by geographical splitting) and 0.44 (95% CI 0.43–0.45; by temporal splitting) compared to the best previous model of 0.47 (95% CI 0.40–0.55). The areas under the receiver operating characteristics curve of our model were 0.88 (95% CI 0.88–0.89) and 0.86 (95% CI 0.85–0.86) for the geographical and temporal splits, respectively. Subjects predicted as preeclampsia/eclampsia was ∼80% in both external validation sets, which imply potential reduction of ∼20% cost for prediction by highly-specific models with advanced predictors. We also applied natural language processing techniques to assist interpretation of our model, which is considered one of the most important artificial intelligence applications. Implications of all the available evidence Since our model showed an acceptable predictive performance using information from a health insurance dataset that came from multiple healthcare facilities, we encourage health insurance companies to facilitate this model deployment in order to be used by inter-healthcare facilities in privacy-aware information systems. We expect this model to have an impact on improving efficient neonatal ICU utilization and in turn reduce expenses of insurance companies. Our prediction model also supported several recent findings on preeclampsia pathogenesis. The best predictive performance of our model used predictors during 9–<12 months to the event. This supports recent findings from bioinformatics studies which revealed that preeclampsia pathogenesis possibly starts before pregnancy rather than during the first trimester. Approximately one-third of text profiling results from diagnoses on previous visits were bacterial infection-related conditions, as inferred by natural language processing techniques. This also corresponds to recent findings from microbiology and microbiome studies that provided evidence of the role of bacterial infections or specific microbial communities in several organs of women with preeclampsia. Alt-text: Unlabelled box

Introduction

Predicting preeclampsia may prevent neonatal morbidity because this disorder can lead to neonatal prematurity [1,2]. Admission to neonatal intensive care units (ICUs) was not reduced (odds ratio [OR] 0.93, 95% confidence interval (CI) 0.55–1.59), although preterm/early preeclampsia was prevented by aspirin administration at 11–13 weeks’ gestation [3]. This is because term/late preeclamptic women are more common than preterm/early ones and only 85% of those were detected with a false positive rate of 10% at 35–37 weeks in high-resource settings [4]. A nationwide health insurance dataset of the BPJS Kesehatan in Indonesia can provide big data to develop artificial intelligence (AI)-assisted predictions that may reduce false positives. However, the predictive performance of predicting preeclampsia developed based on this health insurance dataset is still unclear. Preeclampsia is one of the pregnancy-induced hypertension (PIH) and placenta dysfunction-related disorders [5,6]. Preeclampsia affects 4.6% (95% uncertainty range 2.7–8.2) of pregnant women [7], and may also impair fetal growth, which leads to low birth weights as a predisposing factor to neonatal deaths [8,9]. Although many conditions in pregnant women contribute to premature and low-birth-weight infants [10], preeclampsia is the major contributor, because early delivery is the only cure for this disease. Yet, the decision to deliver early may be based on a false positive that leads to inefficient utilization of neonatal ICUs for preventable premature babies. Although neonatal ICU admission was not reduced by babies from preeclamptic women given aspirin at 11–13 weeks’ gestation [3], the length of stay in neonatal ICU was reduced by 20.3 days (95% CI 7.0–38.6) [11]. However, this was because of decreased birth rates at <32 weeks' gestation (OR 0.42; 95% CI, 0.19–0.93), or prevention of early preeclampsia. Meanwhile, the number of babies that were admitted to neonatal ICU was larger from term/late (n = 14/102, 13.72%) preeclamptic women compared to those from preterm/late ones (n = 7/102, 6.86%). By reducing length of stay without reducing neonatal ICU admission, the cost reduced mostly at individual but not at community level. Therefore, reducing false positives from preeclampsia predictions may improve the efficiency of utilization of these scarce facilities. Predicting preeclampsia is important because the effective prevention is only applied for preterm preeclampsia (risk ratio [RR] 0.92, 95% CI 0.45–0.87) in which aspirin is given at ≤16 weeks’ gestation [12]. Ninety predictors and 52 prediction models were compared by 126 systematic reviews, and 63.49% of them included advanced biomarkers, genomics, and/or ultrasound measures [13]. Nevertheless, few of those tests had both sensitivity and specificity above 90% in the external validation. Although there was an externally validated prediction model with a sensitivity of 93% (95% CI 76%–99%; at a specificity of 90%), this was true only for early but not all preeclampsia (sensitivity 49%, 95% CI 43%–56%) [14]. This model also used advanced biomarkers, while another model only achieved a sensitivity of 47.6% (95% CI 44.0–51.1%; at a specificity of 89.4%) using maternal characteristics and medical histories [15]. A prediction model with high precision and sensitivity but low-cost predictors is needed for preeclampsia. The model should have high sensitivity at the same or better specificity compared to the others with low-cost predictors. The model is intended to decide which patients will be predicted by other highly-specific models with advanced predictors. The preliminary prediction model will improve efficiency of neonatal ICU utilization and reduce the prediction cost at community level without sacrificing either maternal or neonatal patient safety. The poor performances of preeclampsia prediction may be caused by the complexity of this disease at the transcriptomic level [16]. Machine learning can potentially deal with this problem [17]; however, it needs big data to achieve good predictive performances. A recent machine learning prediction study demonstrated a promising predictive performance in internal validation by a stochastic gradient-boosting algorithm for late preeclampsia (c-statistics 0.924; with a sensitivity of 0.60 and a specificity of 0.99) [18]. That study utilized electronic medical records consisting of 24 clinical and biochemical predictors, but there were only 474 events of preeclampsia. It lacked events per variable (EPVs) which may cause overfitting to several machine learning algorithms [19]. However, there are no previous studies that developed and externally validated prediction models for preeclampsia that utilized big data with sufficient EPVs for machine learning algorithms. The Nationwide Health Insurance Dataset of BPJS Kesehatan (NHID-BPJSKes) in Indonesia can provide big data for developing machine learning prediction models. Health insurance datasets have been utilized for association studies involving PIH in Taiwan [[20], [21], [22], [23]] and a predictive study for postpartum women in the UK [24]. Although only demographic data and diagnoses are provided by the NHID-BPJSKes, machine learning prediction models can be developed using this dataset since it provides sufficient EPVs. This is because a systematic review showed that Indonesia as one of the countries with high incidences of preeclampsia based on two studies (8.6%; n = 43,464) [7]. Although there is no effective prevention for late preeclampsia, machine learning trained on big data may provide a predictive model with better precision. By reducing false positives of pregnancy termination because of preeclampsia, it may eventually improve utilization of ICUs. In addition, this predictive model can be used to efficiently construct prospective cohorts of preeclampsia for further development of machine learning predictions, especially in poor-resource settings. This study attempted to develop and validate an AI-assisted prediction of preeclampsia by machine learning applied to the NHID-BPJSKes in Indonesia.

Materials and methods

This study developed a prognostic prediction model utilizing publicly-accessed dataset; thus, our study was non-interventional and non-observational. We applied guidelines extended from Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD), which is widely accepted for diagnostic/prognostic studies. The guidelines were extended by several TRIPOD authors specifically for multivariable prediction models instead of a single predictor to minimize risks of bias and optimism for prediction model development [25]. The extended guidelines were called as Prediction Model Risk of Bias Assessment Tool (PROBAST). We applied the PROBAST in conjunction with the Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research [26].

Data source

This study utilized NHID-BPJSKes which is a cross-sectional dataset representing real-world data of insurance-based healthcare in Indonesia on 2015 and 2016, but, we preprocessed this dataset in order to build a tidy dataset for a nested case-control design. Until 2018, the health insurance covered 200,259,147 (75.8%) individuals in Indonesia [27]. This reflected coverage of this insurance on the pregnant women in this study. We utilized the initial version of the original dataset that had no accession number but published with the dataset code book for each version [27]. This book described details on cross-sectional sampling procedures of the original dataset. Briefly, all individuals covered by this insurance were sampled randomly stratified by 66,072 combinations of primary care (22,024 facilities) and family category (3 classes). The family category consisted of family of which members: (1) never using the health insurance; (2) using it in primary care only, and (3) using it in both primary care and hospital. The dataset only included the strata combination that consisted at least one family. It also included maximum 10 families; thus, if the combination consisted >10 families, then these were undersampled into 10 families. In the end, the sampling procedures resulted 586,969 families and 1,697,452 individuals. Before we reconstructed the NHID-BPJSKes dataset for this study, it had been sampled from overall data in the insurance database by the owner which was the social security administrator for health or badan penyelenggara jaminan sosial (BPJS) kesehatan in Indonesia. None of the authors were parts of the BPJS Kesehatan and the sample dataset has been also deidentified before it was made publicly accessed by request; thus, there was neither issue of patient privacy or need of informed consent for this study. Permission to the dataset for this study has been granted by the owner (dataset request approval no.: 12047/I.2/0919). The datasets for model development and validation in this study are available for public access by request to the corresponding author and by approval of the BPJS kesehatan in Indonesia. There were approximately 2.6 million instances from 34 provinces of Indonesia. These consisted of four tables of claims data, which describe membership, primary care visits by capitation, primary care visits by case-based group (CBG) payments, and hospital visits by CBG payments. Diagnoses in this dataset were coded according to the International Classification of Diseases 10th Revision (ICD10).

Study design

We applied a nested case-control design for this study. Inclusion criteria were pregnant women with and those without preeclampsia/eclampsia. All women with pregnancy records were included without considering whether the subjects were first-time mothers or not. Exclusion criteria were all subtypes of PIH except preeclampsia/eclampsia. Cases were defined as women with preeclampsia/eclampsia and one pregnancy but without other PIH diagnoses, while controls were defined as women with one pregnancy but without PIH diagnoses including preeclampsia/eclampsia. To approach these definitions, we applied data preprocessing by ICD10 codes. Case groups consisted of all visits from subjects that had codes of both O14–15 and Z33–37, while controls consisted of those from subjects having codes of Z33–37 only (Table 1). Neither cases nor controls included visits from subjects that had other codes within O10–16. The pregnancy period was defined between the earliest and latest dates of visits coded by Z33–37 or O. This period was applied for feature extraction. We also removed all records that possibly had more than one episode of pregnancy within a 2-year period in the dataset. This was achieved by removing subjects with differences of known earliest and latest delivery codes (O80–82) that were greater than zero. In the end, the age range of control group was matched with the case group (12–55 years old).

Table 1

Diagnosis codes for nested case-control sampling.

ICD10 codes	Description
O	Pregnancy, childbirth, and puerperium
O10–16	Oedema, proteinuria, and hypertensive disorders in pregnancy, childbirth, and puerperium
O14–15	Preeclampsia and eclampsia
O80–82	Encounter for delivery
Z33–37	Pregnant state, encounter for supervision of normal pregnancy, encounter for antenatal screening of mother, and outcome of delivery

Diagnosis codes for nested case-control sampling.

Feature extraction, representation, and selection

Features consisted of demographic variables and diagnoses on previous visits. We conducted a time-to-event analysis to extract diagnosis predictors. The event was delivery of which the time was considered as comparable time of outcome between case and control. Demographic variables were age (years), marital status (married/single/divorced or widowed/undefined), family role (wife/child/primary member/additional member), membership strata (first/second/third), and membership type (government-paid labor/company-paid labor/self-paid labor/non-labor). Diagnoses were derived from encounters coded by A to N (Table S1 in Supplementary materials). To capture specific codes to several causes of disease and organ-related diseases, 15 features were also added. All of the diagnostic features were accounted for in each period of either 1 year before the event or during gestation. In addition, the time to the event (months) and distinct diagnoses compared to all available three-digit codes were included. We also included diagnoses within 2 years before the event along with an additional feature of censored time to the event (months) for those with event times in 2015. All continuous variables were normalized. In total, there were 95 candidate features included in the predictive model. To avoid irrelevant and redundant features, we applied several feature selection techniques in a multivariate logistic regression model (MvLRM). These could be forward, backward, or stepwise feature selection. We applied 0.05 as the significance level for retaining the candidate feature in the model. Each of these methods might or might not be preceded by feature representation as either polynomial terms or principal components. Forward, backward, and stepwise selection were not used at the same time. We compared the MvLRM performances using any of these combinations of feature selection and/or representation with those with neither feature selection or representation. Instead of using one feature selection technique, comparison of the multiple combinations allowed us to have larger search space for model optimization. However, we limited this search space within feasible size by applying only 2-degree polynomial terms. The goal for the feature selection was to reach the predefined number of candidate features that fit a sufficient EPV for machine learning development [19,28]. We would define this number after finishing the subject selection (see Section 2.2). We applied the number as starting number of features included in backward selection that would stop at two remaining features. Contrarily, we started the forward and stepwise selection from two features and stopped at the predefined number. We chose at least two features because we applied conditional MvLRM to force time to the event being retained in the feature selection. This feature should be in the model anyway because we would conduct subgroup analysis using this feature in the best model (see Section 2.7). In addition, we also limit maximum principal components as much as the predefined number of candidate features. By this predefined number, we estimated the sample size having sufficient power for developing a predictive model, including the machine learning ones.

Model development

We compared six state-of-the-art machine learning algorithms using SAS Enterprise Miner 14.3 (SAS Institute, Cary, NC, US) to develop prognostic prediction model. These included the machine learning-optimized logistic regression (LR), decision tree (DT), artificial neural network (ANN), random forest (RF), support vector machine (SVM), and the ensemble (Ens.) algorithm that combined other algorithms. We conducted parameter tuning of the algorithms by comparing 726 configurations (Table S2 in Supplementary materials). Each algorithm included the best feature set from a previous selection. The best parameter tuning of each of the models was used for the final comparison. We also applied a critical appraisal to the best model based on the domain knowledge.

Model evaluation

We evaluated all models using both calibration and discrimination tests. Calibration was assessed by a linear regression of the predicted and true probabilities, while discrimination was assessed by the area under curve of the receiver operating characteristics curve (AUROC). In the end, we also compared the positive predictive value or information retrieval precision (Prec.) for the false positive rate (FPR or 1-specificity) of 10%.

Model validation

We split up the tidy dataset into training and test sets for internal and external validation, respectively. Data partitioning for external validation was further split geographically and temporally. In the geographical split for external validation (GEV), one city in each province was randomly sampled. The city list was used to filter the dataset into the test set, while the rest was split for the training and another validation set. Geographical randomization from each province of Indonesia is important to avoid racial/ethnic disparities associated between preeclampsia and its risk factors [29]. In the temporal split for external validation (TEV), 25% of the days in each month were randomly selected. All visits from subjects with delivery time on those selected days were split for the validation set; thus, the subjects were completely external to those in the training set and internal validation set. Temporal randomization was intended to avoid a seasonality effect on preeclampsia [30]. Women who delivered during winter (non-tropical regions) or rainy season (tropical regions) have higher prevalence of preeclampsia/eclampsia. Both geographical and temporal randomization are different to simple randomization which may leave subjects in the training set, that lived in the same cities or were delivered at the same time periods with those in the test sets. Using geographical and temporal randomization, the test sets would have subjects with unobserved features in the training set, that were related to the city and time period. Meanwhile, since preeclampsia was associated with racial/ethnicity and seasonality [29,30], no randomization for city and time period selection might cause biases in the predictive performance. We also conducted 10-time bootstrapping to iterate the external validation. Therefore, we could estimate the predictive performance of our model for future dataset. Before feature selection, a tidy dataset with balanced cases and controls was constructed for internal and external validations and analyzed for missing data. We conducted oversampling of cases by stratified random sampling. All candidate features and outcomes were used as stratification variables. Chi-square and t-tests were conducted on the dataset before and after oversampling to ensure that there was no significant effect of oversampling of the case group. Statistical tests were also conducted on the dataset before and after removing missing data. We conducted 10-fold cross-validation when developing the models. The training set and internal validation (IV) set were randomly assigned into 10 groups by stratified random sampling. The stratification variable for this randomization was the target outcome. The models were trained/fitted by aggregating nine groups and were internally validated by the remaining group each time. This process was repeated 10 times until all groups were used as the validation set. We applied 10-fold cross-validation starting from feature selection. However, to efficiently search for the best parameter tuning for each algorithm, we applied test split validation with a 9:1 ratio for both the training and validation sets. In the final comparison among all algorithms, we also applied 10-fold cross validations. External validations were applied to feature selection, parameter tuning, and final comparison, but these datasets had no role in parameter updating in each model.

Subgroup and text mining analyses

A subgroup analysis was conducted using the time-to-event. This determines the period before the event that has the highest discrimination ability according to the AUROC. To do so, we split all external validation sets after prediction. Datasets were split by the time-to-event into four groups which were 2 days to <6 months, 6∼<9 months, 9∼<12 months, and 12–24 months. The groupings were intended to imply known prediction periods and pathogenesis paradigms from previous studies within the range of available data. These were second- or third-trimester predictions [4], first-trimester prediction/pathogenesis [14], near-pregnancy pathogenesis (related to endometrial maturation) [31], and genetic paradigms of pathogenesis (related to vascular susceptibility) [32]. In addition, we also conducted a text mining analysis using SAS Enterprise Miner 14.3 (SAS Institute). It was based on natural language processing techniques on the internal validation set to interpret the best model. We extracted all ICD10 codes in any visits that had non-zero values in each diagnosis predictor and that had codes classified to the predictor. The visits were limited to those that had true predictions by the best model. Text profiling of ICD10 codes was conducted for each diagnosis predictor in the case group.

Comparison to previous studies

We also compared the best period for prediction of our model with those from previous studies. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for the comparison purpose was applied. We searched both research and review articles (systematic review/meta-analysis) in PUBMED, EMBASE, and SCOPUS within the last 5 years (since 2015) that developed and/or validated clinical prediction models. The models had to match our eligibility criteria as defined by PICOTS: [28] (1) population: women or pregnant women without specializing the population; (2) index: multivariable, prognostic clinical prediction model using demographic and/or clinical predictors in a poor-resource setting; (3) comparator: the best model in this study; (4) outcome: preeclampsia without differentiating early- or late-onset and with or without fetal growth restriction; (5) timing: before or during pregnancy until 2 days before onset or delivery; and (6) setting: survey, primary care, or hospital. The studies had to report the point and interval estimates of predictive performance, sample sizes in either case or control, and model validation methods. These were parts of quality assessment we followed from Prediction Model Risk of Bias Assessment Tool (PROBAST) [25], and the Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research [26]. All authors independently assessed the criteria in order as described. If there was a non-matching criterion, then we did not assess the next criterion. If there was a disagreement among authors, this was resolved through discussion. The data of eligible articles were extracted by HS, and the extracted data were reviewed by YWW and ECYS.

Decision curve analysis

To determine cut off values, we applied decision curve analysis that showed FPR, negative predictive value (NPV), positive predictive value (PPV), proportion of predicted positives, and true positive rate (TPR) or sensitivity. We identified cut off value with either sensitivity ∼0.95 and specificity ∼0.90 to compare the predictive performances with those of prediction models from previous studies. The cut off values were chosen based on the internal validation set to apply on the other sets. The final model used only cut off value at sensitivity ∼0.95 to achieve a sufficient preliminary prediction model that will be combined with other highly-specific models. Inverse of the proportion of predicted positives might imply potential reduction of further prediction by the models with advanced predictors.

Statistical analysis

We used SAS Enterprise Guide 7.1 (SAS Institute) to conduct all statistical analyses. Evaluation metrics were expressed as point and interval estimates with the 95% confidence interval (CI). The results from 10-fold cross-validation and 10-time bootstrapped external validation were used to calculate the interval estimate. We used the interval estimates to compare evaluation metrics of the models. The best model was determined by the AUROC and PPV or IR's precision from both external validations. In addition, significance of the selected candidate features was expressed as adjusted p-value. To describe the continuous features, we applied mean and standard deviation as the center and dispersion metrics, respectively, while frequency and the proportion was applied for the categorical features.

Results

Characteristics of the dataset

Datasets were constructed for internal and external validations (n = 23,201; Fig. 1). From these datasets, the proportion of visits by women who would be delivered in primary care and hospital (ratio of IV, GEV, and TEV) were 43.82% (9035:717:414) and 56.17% (11,940:605:490), respectively. Missing data were minor in both cases (0.33%, n = 11/3329) and controls (1.44%, n = 290/20,173). The numbers of missing data were the number of either cases or controls that had missing value on any of the features only. The outcome, either case or control, was complete in this study. Differences in predictor and outcome candidates were described before and after removing missing data or balancing data (Tables S3 and S4 in Supplementary materials).

Fig. 1

Dataset constructed for model development. The original dataset was constructed with a nested case-control design. Controls were sampled within the same age range of case groups (12–55 years old). NHID-BPJSKes, nationwide health insurance dataset of BPJS Kesehatan; PIH, pregnancy-induced hypertension; IV, internal validation; GEV, geographical split for external validation; TEV, temporal split for external validation. Censored diagnoses on previous visits in the training set were also minor in both cases (n = 878, 28.75%) or and controls (n = 5856, 32.67%). This is important because censored diagnoses can be viewed as negatives, while these may actually be positives but not recorded due to data availability. In real world data, this situation can occur with a new member to the health insurance program. We added the censored time-to-event to tell the algorithms how many months a subject had censored diagnoses on previous visits. Nonetheless, this candidate feature was not chosen in the selection process.

Selected feature candidates

To achieve sufficient EPV for model development in all machine learning algorithms, we selected up to 17 features as the predefined number of candidates. Several feature candidates were selected from the MvLRM using forward selection of both original features and principal components (Table 2). We also forced the principal component analysis to obtain 17 components. In addition to the original set of 95 feature candidates, the principal components made a feature set with 112 feature candidates. With this number of candidates, the EPV for the training set was 27.3. This number was sufficient for preliminary feature selection by the MvLRM in order to avoid optimism according to the standard, which was 20∼50 EPV for logistic regressions.

Table 2

Feature candidates selected by the multivariate logistic regression model with forward selection from original candidates and principal components.

#	Feature	Cases (n = 3054)	Controls (n = 17,921)	p value
1	Time-to-event (months) ± SD *	4.56 ± 5.19	4.16 ± 4.43	0.08
	Demographic variables
2	Age (years) ± SD	32 ± 12	30 ± 12	<0.0001
3	Family role, n (%)
	. Wife†	1895 (62.05)	10,953 (61.12)	–
	. Primary member	849 (27.80)	4381 (24.45)	0.06
	. Child	214 (7.01)	2161 (12.05)	0.01
	. Additional member	96 (3.14)	426 (2.38)	<0.0001
4	Member stratum, n (%)
	. First	459 (15.03)	2494 (13.92)	<0.0001
	. Second	1306 (42.76)	8114 (45.28)	0.45
	. Third †	1289 (42.21)	7313 (40.80)	–
5	Member type, n (%)
	. Company-paid labor	1517 (49.67)	8720 (48.66)	<0.0001
	. Government-paid labor	769 (25.18)	4997 (27.88)	<0.0001
	. Self-paid labor †	747 (24.46)	4173 (23.29)	–
	. Non-labor	21 (0.69)	31 (0.17)	<0.0001
	Diagnoses within the last 2 years to the event (partially censored)
6	A codes - Certain infectious and parasitic diseases, visits ± SD; n (%) ‡	2.72 ± 1.79; 248 (8.12)	1.54 ± 1.06; 1412 (7.88)	0.02
7	E codes - Endocrine, nutritional and metabolic diseases, visits ± SD; n (%) ‡	5.00 ± 5.41; 187 (6.12)	2.65 ± 2.33; 310 (1.73)	<0.0001
8	I codes - Diseases of the circulatory system, visits ± SD; n (%) ‡	4.05 ± 3.75; 570 (18.66)	2.63 ± 2.33; 609 (3.40)	<0.0001
9	Immune-related codes, visits ± SD; n (%) ‡	2.97 ± 2.15; 308 (10.09)	1.77 ± 1.39; 1142 (6.37)	<0.0001
10	Eye-related codes, visits ± SD; n (%) ‡	2.81 ± 1.62; 57 (1.87)	1.78 ± 1.10; 444 (2.48)	<0.0001
	Diagnoses within the last year to the event
11	N codes - Diseases of the genitourinary system, visits ± SD; n (%) ‡	3.94 ± 3.37; 172 (5.63)	1.95 ± 1.99; 856 (4.78)	<0.0001
12	Eye-related codes, visits ± SD; n (%) ‡	2.28 ± 1.37; 248 (8.12)	1.71 ± 0.99; 1412 (7.88)	<0.0001
	Diagnoses within the pregnancy period to the event
13	Breast-related codes, visits ± SD; n (%) ‡	5.85 ± 2.58; 13 (0.43)	1.00 ± 0.00; 6 (0.03)	<0.0001
14	Digestive system-related codes, visits ± SD; n (%) ‡	2.48 ± 2.35; 186 (6.09)	1.85 ± 1.60; 768 (4.29)	<0.0001
15	Skin and subcutaneous-related codes, visits ± SD; n (%) ‡	1.81 ± 0.71; 36 (1.18)	1.52 ± 1.14; 287 (1.60)	<0.0001
	Principal components
16	Principal components 8 (see Table 4 for the profile)	2.72 ± 1.79; 248 (8.12)	1.54 ± 1.06; 1412 (7.88)	<0.0001
17	Principal components 10 (see Table 4 for the profile)	−0.09 ± 0.03	0.09 ± 0.01	<0.0001

Forced into the multivariate logistic regression model.

Comparator.

Non-zero visits.

Feature candidates selected by the multivariate logistic regression model with forward selection from original candidates and principal components. Forced into the multivariate logistic regression model. Comparator. Non-zero visits.

Model comparison

Six machine learning models were compared (Table 3). The best model used the random forest algorithm consisting of 500 trees. This model was consistently superior in terms of both internal and external validations by geographical and temporal splits, compared to the other machine learning models, including the ensemble model, which did not outperform the random forest model. Three algorithms were combined in the ensemble model: the logistic regression, decision tree, and artificial neural network. This configuration had the best predictive performance among other configurations of ensemble models.

Table 3

Calibration and discrimination tests of six machine learning models by both internal and external validations.

Validation	Algorithm	Calibration		Discrimination tests
Validation	Algorithm	Slope (95% CI)	Intercept (95% CI)	AUROC (95% CI)	Prec. (95% CI) *
Internal	LR	1.08 (1.08, 1.09)	−0.04 (−0.04, −0.03)	0.70 (0.69, 0.70)	0.78 (0.78, 0.78)
	DT	0.99 (0.99, 1.00)	0.01 (0.01, 0.01)	0.66 (0.66, 0.67)	0.73 (0.72, 0.74)
	ANN	0.64 (0.63, 0.64)	0.14 (0.14, 0.15)	0.65 (0.64, 0.67)	0.74 (0.73, 0.75)
	RF	1.54 (1.54, 1.54)	−0.27 (−0.27, −0.26)	0.86 (0.85, 0.86)	0.86 (0.85, 0.86)
	SVM	2.68 (2.66, 2.70)	−0.89 (−0.90, −0.88)	0.68 (0.67, 0.68)	0.78 (0.76, 0.79)
	Ens.	1.21 (1.21, 1.22)	−0.13 (−0.13, −0.12)	0.70 (0.70, 0.71)	0.78 (0.77, 0.78)
External, geographical split	LR	1.80 (1.76, 1.83)	−0.34 (−0.35, −0.32)	0.74 (0.73, 0.76)	0.68 (0.67, 0.70)
	DT	0.69 (0.67, 0.71)	0.15 (0.14, 0.16)	0.60 (0.59, 0.61)	0.80 (0.79, 0.81)
	ANN	0.75 (0.73, 0.77)	0.08 (0.07, 0.09)	0.67 (0.64, 0.70)	0.55 (0.52, 0.58)
	RF	1.47 (1.45, 1.50)	−0.19 (−0.21, −0.18)	0.76 (0.76, 0.77)	0.82 (0.81, 0.83)
	SVM	3.12 (3.02, 3.21)	−1.07 (−1.12, −1.02)	0.62 (0.61, 0.62)	0.54 (0.52, 0.57)
	Ens.	1.52 (1.49, 1.55)	−0.28 (−0.30, −0.26)	0.72 (0.71, 0.73)	0.70 (0.68, 0.72)
External, temporal split	LR	0.74 (0.72, 0.76)	0.16 (0.15, 0.17)	0.62 (0.62, 0.63)	0.77 (0.76, 0.77)
	DT	0.92 (0.90, 0.93)	0.08 (0.08, 0.09)	0.63 (0.62, 0.63)	0.69 (0.68, 0.70)
	ANN	0.30 (0.29, 0.31)	0.34 (0.33, 0.35)	0.58 (0.58, 0.59)	0.71 (0.70, 0.72)
	RF	1.09 (1.08, 1.11)	0.02 (0.02, 0.03)	0.70 (0.70, 0.70)	0.78 (0.78, 0.79)
	SVM	2.25 (2.20, 2.30)	−0.65 (−0.67, −0.62)	0.63 (0.63, 0.63)	0.72 (0.71, 0.73)
	Ens.	0.74 (0.72, 0.76)	0.15 (0.14, 0.16)	0.61 (0.61, 0.62)	0.74 (0.73, 0.74)

AUROC, area under the receiver operating characteristic curve; LR, machine learning-optimized logistic regression; DT, decision tree; ANN, artificial neural network; RF, random forest; SVM, support vector machine; Ens., ensemble algorithm.

For a specificity of ∼90%.

Calibration and discrimination tests of six machine learning models by both internal and external validations. AUROC, area under the receiver operating characteristic curve; LR, machine learning-optimized logistic regression; DT, decision tree; ANN, artificial neural network; RF, random forest; SVM, support vector machine; Ens., ensemble algorithm. For a specificity of ∼90%. Calibration slope of the model with random forest algorithm was significantly different to 1 as demonstrated by the estimates (Table 3). This was also shown by the models with other algorithms. The receiver operating characteristics (ROC) curves and the area under curve (AUROCs) are also shown for the random forest model (Fig. 2). The ROC curve and AUROC for the training set were similar to internal validation set, but not external validation sets. For a specificity of ∼90%, the detection rates were 0.58 (95% CI 0.57–0.59), 0.44 (95% CI 0.43–0.46), and 0.37 (95% CI 0.36–0.38) for IV, GEV, and TEV, respectively. For the same specificity, the precisions were 0.86 (95% CI 0.85–0.86), 0.82 (95% CI 0.81–0.83), and 0.78 (95% CI 0.78–0.79) for IV, GEV, and TEV, respectively.

Fig. 2

Receiver operating characteristics (ROC) curves for the random forest model. Four panels show the ROC curves with AUROCs and 95% CIs using these datasets: (a) training set; (b) internal validation set; (c) external validation set by geographical split; and (d) external validation set by temporal split. The dashed line is a reference line. AUROC, area under the receiver operating characteristics curve.

Subgroup analysis by the time-to-event of the best model

Instances in each external validation set were subgrouped by period of the time-to-event. The AUROCs were re-computed in each subgroup (Fig. 3). The period of 9–<12 months to the event was the period which showed the highest AUROC both for GEV (0.89, 95% CI 0.88–0.89) and TEV (0.86, 95% CI 0.85–0.86). The wide discrepancy of AUROCs between internal and external validation were contributed by instances subgrouped by the period of 12–24 months, 6–<9 months, and 2 days–<6 months, but not 9–<12 months. These indicated that the features from this period might be more important than those from other periods.

Fig. 3

Area under receiver operating characteristics curve (AUROC) of subgroups by the time-to-event from the random forest model. Four panels show the AUROCs using these datasets: (a) training set; (b) internal validation set; (c) external validation set by geographical split; and (d) external validation set by temporal split. The error bar and 95% confidence interval are shown. To improve readability, the y-axis scale was begun from 0•45; all of the data are completely shown. The dashed line shows the minimum AUROC among those using training and IV sets. AUROC, area under the receiver operating characteristics curve.

Text mining analysis of the best model

A text mining analysis was conducted for all instances that were true-predicted by the random forest model in the internal validation set. Text profiles are shown for all diagnosis predictors (Table 4). Several codes in the text profiles were classified to one or more diagnosis predictors in the case group. Therefore, we could identify specific diagnoses in the true-predicted instances to interpret the best model in this study.

Table 4

Text profile for ICD10 codes of diagnosis predictors in the true-predicted case group by the random forest model.

Time-to-event	Diagnosis predictor	ICD10 codes and description
Diagnoses within the last 2 years to the event (partially censored)	A codes - Certain infectious and parasitic diseases	A010 (Typhoid fever)
		A09 (Infectious gastroenteritis and colitis, unspecified)
		A182 (Tuberculous peripheral lymphadenopathy)
		A231 (Brucellosis due to Brucella abortus)
		A78 (Q fever)
		A91 (Dengue haemorrhagic fever)
	E codes - Endocrine, nutritional, and metabolic diseases	E059 (Thyrotoxicosis, unspecified)
		E118 (Type 2 diabetes mellitus with unspecified complications)
		E119 (Type 2 diabetes mellitus without complications)
		E780 (Pure hypercholesterolemia)
		E785 (Hyperlipidaemia, unspecified)
		E86 (Volume depletion)
	I codes - Diseases of the circulatory system	I10 (Essential [primary] hypertension)
		I159 (Secondary hypertension, unspecified)
		I500 (Congestive heart failure)
	Immune-related codes	J304 (Allergic rhinitis, unspecified)
		J329 (Chronic sinusitis, unspecified)
		J459 (Asthma, unspecified)
		L208 (Other atopic dermatitis)
		L209 (Atopic dermatitis, unspecified)
		M154 (Erosive [osteo]arthrosis)
	Eye-related codes	H000 (Hordeolum and other deep inflammation of eyelid)
		H055 (Retained [old] foreign body following penetrating wound of orbit)
		H109 (Conjunctivitis, unspecified)
		H521 (Myopia)
		H527 (Disorder of refraction, unspecified)
Diagnoses within the last year to the event	Diseases of the genitourinary system	N300 (Acute cystitis)
		N309 (Cystitis, unspecified)
		N601 (Diffuse cystic mastopathy)
		N608 (Other benign mammary dysplasias)
		N609 (Benign mammary dysplasia, unspecified)
		N61 (Inflammatory disorders of breast)
	Eye-related codes	H000 (Hordeolum and other deep inflammation of eyelid)
		H055 (Retained [old] foreign body following penetrating wound of orbit)
		H109 (Conjunctivitis, unspecified)
		H521 (Myopia)
		H527 (Disorder of refraction, unspecified)
Diagnoses within the pregnancy period to the event	Breast-related codes	N61 (Inflammatory disorders of breast)
	Digestive system-related codes	A09 (Infectious gastroenteritis and colitis, unspecified)
		K029 (Dental caries, unspecified)
		K040 (Pulpitis)
		K045 (Chronic apical periodontitis)
		K047 (Periapical abscess without sinus)
		K053 (Chronic periodontitis)
		K30 (Excessive attrition of teeth)
	Skin and subcutaneous-related codes	L209 (Atopic dermatitis, unspecified)
Principal components	Principal components 8	H000 (Hordeolum and other deep inflammation of eyelid)
		H109 (Conjunctivitis, unspecified)
		H521 (Myopia)
		H527 (Disorder of refraction, unspecified)
		H608 (Other otitis externa)
		H609 (Otitis externa, unspecified)
		H811 (Benign paroxysmal vertigo)
		H814 (Vertigo of central origin)
	Principal components 10	D509 (Iron deficiency anemia, unspecified)
		D648 (Other specified anaemias)
		D649 (Anemia, unspecified)

Text profile for ICD10 codes of diagnosis predictors in the true-predicted case group by the random forest model. We found 879 records from PUBMED, EMBASE, and SCOPUS for ‘preeclampsia prediction model’ within the last 5 years, and seven studies were eligible for comparison to our random forest model in the subgroup of 9–12 months to the event (Supplementary materials). Compared to most previous models, our model in this subgroup had the best predictive performances in the AUROC competing with those from MacDonald-Wallis (2015), including the predictive performance using GEV and TEV (Table 5) [33]. The precision and sensitivity of our model were also the highest ones among those with a specificity of ∼90% [15,34,35]. For a sensitivity of ∼95%, our model had higher precision and competing specificity compared to that of MacDonald-Wallis (2015) [33]. For comparison purpose, we applied 0.34 and 0.54 as cut off values for the model at sensitivity (∼0.95) and specificity (∼0.90), respectively. The cut off values were determined based on internal validation (Fig. S2–S7 in Supplementary materials). However, we recommend cut off value of 0.34 to get highly-sensitive performance using our model as the preliminary prediction model to decide which patient will be predicted by other models with high specificity. We also recommend to apply prediction model from MacDonald-Wallis (2015) [33], which had NPV 1.00 (95% CI 0.99–1.00), to confirm predicted negatives by our model. Using cut off value of 0.34, the proportions of predicted positives were 77% (95% CI 75–78%) in GEV and 78% (95% CI 77–78%) in TEV. This imply potential reduction of ∼20% cost needed for prediction models with advanced predictors.

Table 5

Algorithm	Validation	AUROC (95% CI)	Prec. (95% CI)	Sens. (95% CI)	Spec. (95% CI)
Interval validation
At sensitivity ~0.95
RF 9–<12 mo.; cut off value of 0.34	10-fold CV	0.90 (0.88, 0.91)	0.71 (0.68, 0.73)	0.98 (0.97, 0.99)	0.52 (0.49, 0.55)
MacDonald-Wallis et al. (2015)[33]	Bootstrapping	0.88 (0.86, 0.90)	0.04 (0.03, 0.04)	0.95 *	0.37 (0.31, 0.42)
At specificity ~0.90
RF 9–<12 mo.; cut off value of 0.54	10-fold CV	0.90 (0.88, 0.91)	0.88 (0.87, 0.90)	0.70 (0.67, 0.73)	0.89 (0.87, 0.91)
Guy et al. (2017)[34]	No IV	0.80 (0.75, 0.85)	0.09 (0.07, 0.12)	0.41 (0.29, 0.54)	0.90 *
Viguiliouk et al. (2017)[36]	No IV	0.76 (0.72, 0.81)	NA	NA	NA
Wright et al. (2015)[15]	5-fold CV	0.76 †	0.08 †	0.40 (0.39, 0.42)	0.89 †
Rocha et al. (2017)[35]	No IV	0.75 (0.72, 0.79)	0.18 †	0.44 †	0.90 *^,†
External validation
At sensitivity ~0.95
RF 9–<12 mo.; cut off value of 0.34	Bootstrapped GEV	0.88 (0.88, 0.89)	0.59 (0.58, 0.60)	1.00 (1.00, 1.00)	0.47 (0.45, 0.49)
MacDonald-Wallis et al. (2015)[33]	Bootstrapping	0.88 (0.84, 0.93)	0.05 (0.04, 0.06)	0.95 *	0.47 (0.40, 0.55)
RF 9–<12 mo.; cut off value of 0.34	Bootstrapped TEV	0.86 (0.85, 0.86)	0.72 (0.72, 0.72)	0.90 (0.90, 0.90)	0.44 (0.43, 0.45)
ACOG (2017)[37]	Bootstrapping	0.57 (0.54, 0.61)	0.17 †	0.87 †	0.27 †
At specificity ~0.90
RF 9–<12 mo.; cut off value of 0.54	Bootstrapped GEV	0.88 (0.88, 0.89)	0.82 (0.80, 0.85)	0.52 (0.52, 0.52)	0.91 (0.90, 0.93)
RF 9–<12 mo.; cut off value of 0.54	Bootstrapped TEV	0.86 (0.85, 0.86)	0.89 (0.89, 0.89)	0.70 (0.70, 0.70)	0.86 (0.86, 0.86)
NICE (2015)[15]	Bootstrapping	0.76 †	0.07 †	0.39 (0.33, 0.37)	0.89 †
NICE (2017)[37]	Bootstrapping	0.61 (0.58, 0.65)	0.09 †	0.38 †	0.85 †

AUROC, area under the receiver operating characteristic curve; Prec., precision; Sens., sensitivity; Spec., specificity; RF, random forest; NA, not available; NICE, National Institute for Health and Care Excellence; ACOG, American College of Obstetrics and Gynaecology.

Fixed specificity.

Interval estimate was not reported.

Predictive performances of the random forest model in the subgroup of 9–12 months to the event with cut off value at either similar sensitivity or specificity based on internal validation compared to those from previous studies. AUROC, area under the receiver operating characteristic curve; Prec., precision; Sens., sensitivity; Spec., specificity; RF, random forest; NA, not available; NICE, National Institute for Health and Care Excellence; ACOG, American College of Obstetrics and Gynaecology. Fixed specificity. Interval estimate was not reported.

Discussion

Our model included predictors mostly from the medical history. It was the most frequent predictor used in preeclampsia prediction models [38]. Some of our predictors were also used in clinical prediction models from previous studies. Age, chronic hypertension (I10, I159), and diabetes mellitus (E118, E119) were used in the NICE and ACOG guidelines with or without modification [37]. In addition, a previous meta-analysis also showed associations between some of these predictors and preeclampsia [39]. These were a maternal age of >40 years (OR 1.50, 95% CI 1.20–2.00; I95%; df=15) and chronic hypertension (ICD10 I10, I159; OR 5.10, 95% CI 4.00–6.50; I98%; df=20). The random forest and text profiling algorithms included these diseases that were available in our training set. Nevertheless, there was no systemic lupus erythematosus (SLE), antiphospholipid syndrome or other thrombophilia, or chronic kidney disease in our training set, while previous models included those diseases along with age, chronic hypertension, and diabetes mellitus [15,33-36]. Although there was no SLE, erosive arthritis was available in our training set and was found in the text profiling results for immune-related codes. Erosive arthritis can be determined by the anti-citrullinated peptide and anti-carbamylated protein antibodies that were predictive among four other predictors in a predictive model for SLE (AUROC 0.81, 95% CI 0.80–0.81). In the same diagnosis predictor including chronic hypertension, congestive heart failure (I500) was also shown in the text profile of I codes (diseases of the circulatory system). This disease shared the same prognostic factor with preeclampsia, which is serum homocysteine. This factor was predictive for early-onset preeclampsia (AUROC 0.87; OR 1.54, 95% CI 1.30–1.84) [40]. Meanwhile, the serum homocysteine level was elevated in patients with congestive heart failure compared to controls (p<0.01) [41]. Preeclampsia was also associated with future congestive heart failure (RR 3.62, 95% CI 2.25–5.85; I83%; df=6) [42]; however, this may be related to common genetic backgrounds between the diseases [32]. Beyond diabetes mellitus, other endocrine, nutritional, and metabolic diseases were also included as diagnosis predictors in our model. Thyrotoxicosis or hyperthyroidism (E059) was associated with preeclampsia in a nationwide population-based study (OR 1.21, 95% CI 1.14–1.29) [43]. A meta-analysis also showed significant mean differences of total cholesterol (ICD10 E780; 20.20 mg/dL, 95% CI 8.70–31.70; I=99%; df=45) and triglycerides (ICD10 E785; 80.29 mg/dL, 95% CI 51.45–109.13; I=99%; df=43) in women with preeclampsia compared to those of controls, especially in the third trimester. Interestingly, there was an atopic pattern in the text profile of immune-related and skin-related codes, which consisted of allergic rhinitis (J304), asthma (J459), and atopic dermatitis (L208, L209). Women with atopic dermatitis had higher risks of severe preeclampsia (OR 1.27, 95% CI 1.01–1.58), and eclampsia (OR 2.08, 95% CI 1.09–3.98) [44]. Women with preeclampsia had higher incidences of having a child with allergic rhinitis (incidence rate ratio [IRR] 1.29, 95% CI 1.11–1.50), asthma (IRR 1.17, 95% CI 1.11–1.25), and atopic dermatitis (IRR 1.15, 95% CI 1.01–1.32) at ≥14 days old. Women with asthma had a higher risk of having a child with asthma if either the mother developed preeclampsia (adjusted hazard ratio [aHR] 4.73, 95% CI 2.20–10.70) or not (aHR 2.18, 95% CI 1.46–3.26) compared to neither asthma nor preeclampsia [45]. This genetic tendency could appear in both mother and offspring, so it is not easy to identify the cause or effect between preeclampsia and maternal atopy. In addition, asthma might be correlated with Brucella abortus infection (A231), since the numbers of B. abortus in the lungs were higher in asthma-induced murine models compared to the controls (p<0.001) [46]. There were other specific infections included in the A codes (certain infectious and parasitic diseases). The immune response specifically mediated by Salmonella typhi (A010) harboured more-diverse microbial communities in the gut of individuals with a multiphasic response compared to those with a late response [47]. This may be related to preeclampsia because there was also a significant shift in the gut microbial communities in women with this disease [48]. Tuberculous peripheral lymphadenopathy (A182) symptoms were frequent in human immunodeficiency virus (HIV) infection that shared dysregulation of the complement system with preeclampsia [49,50]. Infections by Coxiella burnetii or Q fever (A78) were associated with adverse maternal outcomes related to preeclampsia, such as intrauterine growth retardation and preterm delivery [51]. Dengue haemorrhagic fever (A91) may make pregnant women more susceptible to endothelial dysfunction and volume depletion (E86) in preeclampsia [52,53]. In addition, both intranasal bacteria and dysbiosis of microbiomes play putative roles in chronic sinusitis (J329) in immune-related codes [54]. Meanwhile, the means of hematogenous spread, including from the respiratory tract, were shown to be involved in great obstetrical syndromes like preeclampsia [55]. However, associations of these diseases with preeclampsia are still poorly understood. Conversely, preeclampsia and infectious diseases of the urinary tract as well as periodontal diseases are well studied. Maternal infections associated with preeclampsia were included as a diagnosis predictor by the N codes (diseases of the genitourinary system within the last year to the event). Urinary tract infections, including cystitis (N300, N309), were associated with preeclampsia (OR 1.57, 95% CI 1.45–1.70; I79%; df=16) [56]. Preeclampsia was also associated with periodontal diseases (ICD10 K040, K045, K047, K053; OR 1.76, 95% CI 1.43–2.18; I80%; df=5) [56]. Several microbes were identified on placental tissue samples from women with preeclampsia (n = 7; who underwent an elective caesarian delivery) by a polymerase chain reaction (PCR), 16S ribosomal (r)RNA gene, and next-generation sequencing, while all samples from the controls were negative (n = 48; p = 0.006). Generic levels of microbiomes were associated with periodontal disease, including Variovorax, Prevotella, Porphyromonas, and Dialister [57]. In a previous study, Bacillus cereus was also found in >90% of microbial communities from all women with late-onset preeclampsia (n = 4), but not in those from all women with early-onset preeclampsia (n = 3). 16S rRNA genes were negative in all venous blood, urine, and amniotic fluid samples, except in one woman whose amniotic fluid had B. cereus. This bacterium is an opportunistic pathogen among gastrointestinal infections (A09), and is widely recognized as a challenging problem in the food industry [58]. Outbreaks of B. cereus gastroenteritis were reported [59], [60], [61]. Interestingly, the numbers of patients with B. cereus bloodstream infections were higher in summer, and the source was urinary catheters [62]. Meanwhile, women with a month of conception during the summer (OR 1.22, 95% credible interval [CrI] 0.89–1.65) also had higher incidences of preeclampsia, but those who delivered during winter (OR 3.33, 95% CrI 0.31–35.48) had higher incidences of eclampsia in both the northern and southern hemispheres [63]. Inflammatory disorders of the breast (N61) was found as text profiling result of either N codes (diseases of the genitourinary system within 1 last year to the event) or breast related codes (within the pregnancy period to the event). However, most studies investigated breast neoplasms (N601, N608, and N609) as effects of preeclampsia instead of focusing on the inflammation alone. The incidences were lower in women with preeclampsia compared to non-preeclampsia if these were adjusted by the sex of the fetus (RR 0.85, 95% CI 0.77–0.95; I49%; df=5) [64]. Inflammatory disorders of the breast might be related to neoplasms in that study, since there are strong linkages between toll-like receptor (TLR)-mediated regulation of inflammation during breast cancer [65]. In context of this study, the inflammatory disorders of the breast (N61) before pregnancy might be more related to preeclampsia compared to those during pregnancy. This was implied by the finding that the period of 9∼<12 months to the event was the period which showed the best predictive performance. Preeclampsia is also associated with other digestive system-related codes, such as dental caries (K029) and excessive attrition of the teeth (K30). Pregnant women with dental caries had a higher prevalence of preeclampsia compared to normotensive controls (adjusted odds ratio [aOR] 1.76, 95% CI 1.43–2.18) [66]. However, excessive attrition of the teeth might not be directly associated with preeclampsia, but with age, because attrition of the teeth was greater in the age group of 51–60 years compared to those in either the age group of 20–30 years or other younger age groups (p<0.003) [67]. Surprisingly, eye-related codes were included in several diagnosis predictors in our model. Codes included those for disorders of refraction, especially myopia. These disorders are likely associated with age. A meta-regression of age and the year of birth with the prevalence of myopia demonstrated a U-shaped relationship (p<0.05) with an increasing prevalence from the age of 30 years [68]. The prevalence of hypermetropia increased from the age group of 41∼50 (aOR 2.7, 95% CI 1.3–5.7; p = 0.007) to 61–70 years (aOR 5.8, 95% CI 2.7–12.7; p<0.001) [69]. Age was also correlated with astigmatisms of ≥1.00 D, in terms of both corneal (OR 1.007, 95% CI 1.001–1.013; p = 0.02) and refractive astigmatism (OR 1.043, 95% CI 1.036–1.051; p<0.0001) [70]. This diagnosis predictor was a part of the principal components that also included age, as shown by text profiling results. In addition, other eye-related codes had unclear associations with preeclampsia. But, these diseases are common in clinical practice and might involve bacterial or viral infections, such as hordeolum and other deep inflammation of the eyelid (H000) [71], a retained foreign body following a penetrating wound of the eye orbit (H055) [72], conjunctivitis (H109) [73], otitis externa (H608 and H609) [74], and vertigo (H811 and H814) [75,76]. Another principal component included D codes (neoplasms or diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism). The text profiling results showed codes for anemia (D509, D648, and D649). Severe anemia was associated with preeclampsia/eclampsia in both nulliparous (aOR 3.74, 95% CI 2.90–4.81) and multiparous (aOR 3.45, 95% CI 2.79–4.25) women [77]. The effect of anemia in our model may have been adjusted by other conditions, since it was also a part of the principal components. A qualitative assessment from a systematic review described increases in anemia and eclampsia during periods of greater rainfall, which included studies suggesting that the seasonality of those diseases was associated with malaria [78]. The random forest outperformed other machine learning algorithms in this study. This algorithm was also the best model with superior predictive performance in several studies that developed clinical predictive models for such conditions as end-stage renal disease [79], incident delirium [80], H3K27M mutations in brainstem gliomas [81], prostate cancer [82], in-hospital mortality [83], chemoradiotherapy outcomes [84] and acute kidney injuries [85]. Features included demographic characteristics, comorbidities, medical histories, clinical predictors, laboratory findings, medical imaging, treatments, and biomarkers. Two studies also utilized routine registry data that were preprocessed by a nested case-control design [80,83]. Although machine learning algorithms did not show higher AUROC compared to logistic regression for clinical prediction models, particularly those with low risk of bias [86], the previous systematic review was limited. It did not compare the algorithms using the same datasets. Modern machine learning algorithms, including random forest, are data hungry; thus, the algorithms need a dataset with higher EPV than those for logistic regression [19]. Problem of low EPV causes overfitting in turn causing the predictive performance far poorer using external validation set compared to those using either training or internal validation set [28]. Meanwhile, the systematic review demonstrated that the most common cause of high risk of bias is the external validation method. The comparison of predictive performance was confounded by factors other than the model algorithm, such as sample size and number of predictors. These factors were our reasons to utilize a dataset with larger EPV and to apply external validation method rigorously based on the PROBAST guidelines. Calibration slopes of all models in this study were significantly different to 1 based on the 95% confidence interval although the AUROCs of models with several algorithms were considerably moderate to high, including the random forest algorithm. These may happen in a prediction model using nonlinear machine learning algorithm. [87]. We can expect a model with both well-calibrated and high AUROC if the model uses a linear function. Logistic regression predicts the outcome probability as a function where predictors come into the model linearly [88]. Depend on sample size, a prediction model that assumes linearity between predictors and outcome may have a poor predictive performance [86]. If a training set lacks of number of event outcome for one predictor adjusted by the others, linearity will be unlikely found between the predictor and the outcome although the association may be linear in larger sample size. Meanwhile, nonlinearity was found in many associations between several predictors and preeclampsia outcome [[89], [90], [91], [92], [93]]. Preeclampsia prediction using a nonlinear machine learning algorithm may outperform that using the linear algorithm depend on other factors. These made random forest outperformed the other algorithms. The factors were the dominance of predictors derived from those with binomial probability (the medical histories) [94], and high-dimensional training set with large sample size [95]. All of the selected candidate features for the random forest model were continuous variables, except three features which were categorical in origin from the NHID-BPJSKES. Continuous variables were the proportion of days with a visit to total days since recorded in the database. Instances with missing values of features were simply removed because of their minority. There was no significant difference in any selected candidate features before and after removing missing data. Categorization of continuous variables can lead to optimism as it uses cut offs based on the same dataset, but handling missing data by exclusion is still acceptable if it does not significantly change the distributions of predictors and outcomes [28]. The random forest algorithm in the best model did not apply built-in feature selection for the 17 features. These were taken from preliminary filtering of 95 candidates and 17 principal components by the MvLRM with forward selection. Principal components 8 and 10 were selected over 1 and 2. This was because those were not only compared to the other principal components but also the original candidates; thus, selection would not follow the way the principal components were ranked and selected. Using this workflow, we achieved ≥20 EPVs for the MvLRM and 180 EPVs for the random forest. These algorithms need 20–50 and 50–200 EPVs, respectively, utilizing three datasets [19]. In our dataset, candidate features like blood pressure and proteinuria were not available. However, unlike diagnostic prediction model, the use of candidate features that are parts of outcome definition should be avoided, i.e. blood pressure and proteinuria that are parts of preeclampsia definition. This situation is called as outcome leakage in either prognostic prediction modeling or machine learning prediction [26,28]. Nonetheless, we validate our best model by comparing the predictive performance in external validation with those of the traditional clinical scoring models for prognostic prediction of pregnancy outcome resulting preeclampsia. The models were those from NICE and ACOG in external validation of previous studies (Table 5) [15,37]. We applied cross-validation and geographical/temporal splitting for internal and external validation, respectively. Cross-validation was applied from feature selection to model selection with parameter updating. This technique was recommended by PROBAST guidelines as an unbiased method rather than using a non-repeated random splitting [28]. Our external validation sets included >100 instances in the case group, as recommended by the same guidelines. We applied geographical and temporal randomization instead of simple randomization to split the dataset for external validation. Simple randomization was not recommended because the training and external validation sets that only differed by chance would probably have similar predictive performances [96]. By geographical and temporal splitting, our external validations were similar to those from independent validation studies because the variance of features in our training set, that were related to the city and time period, were unobserved in our external validation sets. Therefore, this study applied standards for feature extraction, feature selection, model validation, and others that were designed to avoid bias and overfitting in development of either prognostic factors or the prediction model [25,28,96,97]. Our random forest model had the best predictive performance for the period of 9–<12 months to the event compared to 2 days–<6 months, 6–<9 months, and 12–24 months to the event. The first trimester was the most frequent period in which most studies developed a preeclampsia prediction model (n = 42/70, 61.43%) [38]. This is approximately equivalent to <9 months to the event. Only one of the studies developed the model before conception, and two studies used only non-time-varying maternal characteristics (i.e., ethnicity or social class). This was probably due to a common belief about preeclampsia that the pathogenesis begins from 11 to 13 weeks’ gestation [98]. In the two-stage model of preeclampsia pathogenesis, this disease is initiated by placental dysfunction, followed by endothelial dysfunction; yet, various theories have attempted to explain the cause of placental dysfunction. However, pregnant women with preeclampsia and endothelial dysfunction have been reported without placental disease [99]. Impaired endometrial maturations before and during early pregnancy was also demonstrated in preeclamptic women [31,100,101]. Only the late secretory phase of the menstruation cycle was impaired, and this phase is the only one enabling a successful pregnancy. This evidence suggests that an event may impair the endometrium in the last menstruation period before pregnancy. Our best model outperformed prognostic prediction models that only used demographic and/or clinical predictors from previous studies. The models included preeclampsia risk scoring from the NICE and ACOG. In particular, the precision or positive predictive value of our model was distinguished compared to those of previous models. This is because many preeclampsia prediction models were developed with an imbalanced dataset in which preeclampsia group was minor compared to the control group. Imbalanced outcomes impair the precision of prediction models, which can be handled by oversampling [102]. The most widely used oversampling method, which is called the synthetic minority over-sampling technique (SMOTE), can also improve the sensitivity in the minor-positives training set, although slightly reducing the accuracy and specificity [103]. However, SMOTE may cause problems in the distribution of the dataset [104]. We applied naïve random oversampling that randomly sampled the minority outcome with replacement. A machine learning predictive model using a dataset with this oversampling technique had a fairly better improvement in the AUROCs of many machine learning algorithms compared to those with an imbalanced dataset [105]. We also provide evidence that this technique did not affect the distribution of predictors and outcome in this study. However, several limitations of our prediction model should be considered. The training set in this study consisted of patients with medical histories that might be recorded by several healthcare facilities. The predictive performance might be poor in a healthcare facility at a certain visit if the patient had medical histories that were recorded mostly in databases of other healthcare facilities. The model deployment will need an information system that can be used by inter-healthcare facilities. Transforming this model into a risk calculator in clinical practice need further validation. Since we have no explicit information in our dataset to identify the healthcare facilities where the visits took place, we could not construct another external validation set that approached prediction with medical histories from a patient retrieved from single database of a healthcare facility. Nearly one-third of instances in our training set were also back-censored; thus, medical histories were not observed by our model, particularly during 12–24 months to the event. Other predictors, such as body mass index and gestational age at diagnosis or at delivery, were not available in the original dataset we utilized for model development in this study. This model also could not differentiate early- vs. late-onset and preterm vs. term preeclampsia with or without intrauterine growth restrictions. However, the development of prediction model in this study was not intended to predict either subtypes of preeclampsia or the adverse events. The model was intended to be preliminary prediction model to determine pregnant women that will be predicted by the other models with high specificity and advanced predictors. Our model had a high sensitivity and low-cost predictors. The model also had a better precision compared to the other model with low-cost predictor. This model may reduce false positives of preeclampsia. At same specificity, the sensitivity of our model was also higher than those from previous models with low-cost predictors. Our preliminary prediction model may reduce ∼20% of the cost at community level to use the highly-specific prediction model with advanced predictors. This could be achieved if the advanced prediction is only applied to the predicted positives by our model. We also recommend to confirm the predicted negatives by our model using the low-cost prediction model from MacDonald-Wallis (2015) [33]. Conceivably, only by combined prediction, the low-cost prediction will improve neonatal morbidity and ICU utilization without sacrificing the mother safety. Indeed, Indonesian people had rich human genetic diversity [106,107]. The genetic variation may affect characteristics of many diseases; yet, the geographic variation may be different from the variants affecting the diseases [108]. Nevertheless, the Indonesian genomic variation only covered those of Asian and Austronesian [107,109]. The generalizability of this model may be limited to Indonesia or other regions that share similar race/ethnicity and climate conditions. More external validations are still needed to consider this model being valid in other populations. This was not feasible for this study because we did not have access to health insurance dataset in other countries, that have similar data structure and the same classification system of diseases. In addition, because gestational period information was not available, we could not determine whether 9∼<12 months to the event was equivalent to the near or far periods before pregnancy. Selected features may be bystanders rather than causal or risk factors for preeclampsia; thus, this model should be interpreted carefully. Further investigation at the molecular level should be conducted to confirm this association. In conclusion, the best model in this study had robust performances on all validation sets, including external validations. This may describe the generalizability of a prediction model to unobserved samples. Our model also outperformed previous models, especially in precision, which used maternal characteristics and medical histories without biophysical or biochemical markers. This may reduce false positives for the decision for an early delivery, especially in poor-resource settings. But, a future study is needed to investigate the impact of our prediction model for reducing the false positives. In addition, it applied the random forest algorithm on features that were best to predict preeclampsia/eclampsia within 9–<12 months to the event and corresponded to findings of previous studies. This may give more insights into the preeclampsia pathogenesis; however, future investigations are needed to confirm these insights. Because medical histories used by our model were recorded from multiple healthcare facilities, we also recommend health insurance companies, particularly in Indonesia, facilitate this model deployment in privacy-aware information systems used by inter-healthcare facilities. Using our prediction model that showed acceptable performances, expense efficiency of insurance management may be improved in addition to preventing inefficient use of neonatal ICUs as expected.

Declaration of Competing Interest

Dr. Sufriyana, Dr. Wu, and Dr. Su have nothing to disclose.

102 in total

1. The prevalence of refractive errors among adult rural populations in Iran.

Authors: Hassan Hashemi; Payam Nabovati; Abbasali Yekta; Fereshteh Shokrollahzadeh; Mehdi Khabazkhoob
Journal: Clin Exp Optom Date: 2017-07-12 Impact factor: 2.742

2. Distribution of astigmatism as a function of age in an Australian population.

Authors: Paul G Sanfilippo; Seyhan Yazar; Lisa Kearns; Justin C Sherwin; Alex W Hewitt; David A Mackey
Journal: Acta Ophthalmol Date: 2015-01-13 Impact factor: 3.761

3. A simple clinical method to identify women at higher risk of preeclampsia.

Authors: Effie Viguiliouk; Alison L Park; Howard Berger; Michael P Geary; Joel G Ray
Journal: Pregnancy Hypertens Date: 2017-07-25 Impact factor: 2.899

4. Aspirin for Evidence-Based Preeclampsia Prevention trial: effect of aspirin on length of stay in the neonatal intensive care unit.

Authors: David Wright; Daniel L Rolnik; Argyro Syngelaki; Catalina de Paco Matallana; Mirian Machuca; Mercedes de Alvarado; Sofia Mastrodima; Min Yi Tan; Siobhan Shearing; Nicola Persico; Jacques C Jani; Walter Plasencia; George Papaioannou; Francisca S Molina; Liona C Poon; Kypros H Nicolaides
Journal: Am J Obstet Gynecol Date: 2018-03-02 Impact factor: 8.661

5. Prediction models for preeclampsia: A systematic review.

Authors: Annelien C De Kat; Jane Hirst; Mark Woodward; Stephen Kennedy; Sanne A Peters
Journal: Pregnancy Hypertens Date: 2019-03-11 Impact factor: 2.899

6. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration.

Authors: Karel G M Moons; Douglas G Altman; Johannes B Reitsma; John P A Ioannidis; Petra Macaskill; Ewout W Steyerberg; Andrew J Vickers; David F Ransohoff; Gary S Collins
Journal: Ann Intern Med Date: 2015-01-06 Impact factor: 25.391

7. Prediction model development of late-onset preeclampsia using machine learning-based methods.

Authors: Jong Hyun Jhee; SungHee Lee; Yejin Park; Sang Eun Lee; Young Ah Kim; Shin-Wook Kang; Ja-Young Kwon; Jung Tak Park
Journal: PLoS One Date: 2019-08-23 Impact factor: 3.240

Review 8. Pathogenesis of vascular leak in dengue virus infection.

Authors: Gathsaurie Neelika Malavige; Graham S Ogg
Journal: Immunology Date: 2017-05-24 Impact factor: 7.397

9. Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.

Authors: Timo M Deist; Frank J W M Dankers; Gilmer Valdes; Robin Wijsman; I-Chow Hsu; Cary Oberije; Tim Lustberg; Johan van Soest; Frank Hoebers; Arthur Jochems; Issam El Naqa; Leonard Wee; Olivier Morin; David R Raleigh; Wouter Bots; Johannes H Kaanders; José Belderbos; Margriet Kwint; Timothy Solberg; René Monshouwer; Johan Bussink; Andre Dekker; Philippe Lambin
Journal: Med Phys Date: 2018-06-13 Impact factor: 4.071

10. Allergic Asthma Favors Brucella Growth in the Lungs of Infected Mice.

Authors: Arnaud Machelart; Georges Potemberg; Laurye Van Maele; Aurore Demars; Maxime Lagneaux; Carl De Trez; Catherine Sabatel; Fabrice Bureau; Sofie De Prins; Pauline Percier; Olivier Denis; Fabienne Jurion; Marta Romano; Jean-Marie Vanderwinden; Jean-Jacques Letesson; Eric Muraille
Journal: Front Immunol Date: 2018-08-10 Impact factor: 7.561

7 in total

Review 1. The crucial role of LncRNA MIR210HG involved in the regulation of human cancer and other disease.

Authors: Juan Lu; Danhua Zhu; Xiaoqian Zhang; Jie Wang; Hongcui Cao; Lanjuan Li
Journal: Clin Transl Oncol Date: 2022-09-10 Impact factor: 3.340

Review 2. Optimising Cardiometabolic Risk Factors in Pregnancy: A Review of Risk Prediction Models Targeting Gestational Diabetes and Hypertensive Disorders.

Authors: Eleanor P Thong; Drishti P Ghelani; Pamada Manoleehakul; Anika Yesmin; Kaylee Slater; Rachael Taylor; Clare Collins; Melinda Hutchesson; Siew S Lim; Helena J Teede; Cheryce L Harrison; Lisa Moran; Joanne Enticott
Journal: J Cardiovasc Dev Dis Date: 2022-02-10

3. An imbalance-aware deep neural network for early prediction of preeclampsia.

Authors: Rachel Bennett; Zuber D Mulla; Pavan Parikh; Alisse Hauspurg; Talayeh Razzaghi
Journal: PLoS One Date: 2022-04-06 Impact factor: 3.240

4. Blood biomarkers representing maternal-fetal interface tissues used to predict early-and late-onset preeclampsia but not COVID-19 infection.

Authors: Herdiantri Sufriyana; Hotimah Masdan Salim; Akbar Reza Muhammad; Yu-Wei Wu; Emily Chia-Yu Su
Journal: Comput Struct Biotechnol J Date: 2022-08-08 Impact factor: 6.155

5. Development of early prediction model for pregnancy-associated hypertension with graph-based semi-supervised learning.

Authors: Seung Mi Lee; Yonghyun Nam; Eun Saem Choi; Young Mi Jung; Vivek Sriram; Jacob S Leiby; Ja Nam Koo; Ig Hwan Oh; Byoung Jae Kim; Sun Min Kim; Sang Youn Kim; Gyoung Min Kim; Sae Kyung Joo; Sue Shin; Errol R Norwitz; Chan-Wook Park; Jong Kwan Jun; Won Kim; Dokyoon Kim; Joong Shin Park
Journal: Sci Rep Date: 2022-09-22 Impact factor: 4.996

Review 6. On AI Approaches for Promoting Maternal and Neonatal Health in Low Resource Settings: A Review.

Authors: Misaal Khan; Mahapara Khurshid; Mayank Vatsa; Richa Singh; Mona Duggal; Kuldeep Singh
Journal: Front Public Health Date: 2022-09-30

Review 7. Vision for Improving Pregnancy Health: Innovation and the Future of Pregnancy Research.

Authors: James M Roberts; Dominik Heider; Lina Bergman; Kent L Thornburg
Journal: Reprod Sci Date: 2022-05-09 Impact factor: 2.924

7 in total