Literature DB >> 33480091

Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data.

Lin Li¹, Chuang-Chung Lee², Fang Liz Zhou¹, Cliona Molony², Zoran Doder¹, Evgeny Zalmover¹, Kristen Sharma¹, Juhaeri Juhaeri¹, Chuntao Wu¹.

Abstract

PURPOSE: To assess the performance of different machine learning (ML) approaches in identifying risk factors for diabetic ketoacidosis (DKA) and predicting DKA.
METHODS: This study applied flexible ML (XGBoost, distributed random forest [DRF] and feedforward network) and conventional ML approaches (logistic regression and least absolute shrinkage and selection operator [LASSO]) to 3400 DKA cases and 11 780 controls nested in adults with type 1 diabetes identified from Optum® de-identified Electronic Health Record dataset (2007-2018). Area under the curve (AUC), accuracy, sensitivity and specificity were computed using fivefold cross validation, and their 95% confidence intervals (CI) were established using 1000 bootstrap samples. The importance of predictors was compared across these models.
RESULTS: In the training set, XGBoost and feedforward network yielded higher AUC values (0.89 and 0.86, respectively) than logistic regression (0.83), LASSO (0.83) and DRF (0.81). However, the AUC values were similar (0.82) among these approaches in the test set (95% CI range, 0.80-0.84). While the accuracy values >0.8 and the specificity values >0.9 for all models, the sensitivity values were only 0.4. The differences in these metrics across these models were minimal in the test set. All approaches selected some known risk factors for DKA as the top 10 features. XGBoost and DRF included more laboratory measurements or vital signs compared with conventional ML approaches, while feedforward network included more social demographics.
CONCLUSIONS: In our empirical study, all ML approaches demonstrated similar performance, and identified overlapping, but different, top 10 predictors. The difference in selected top predictors needs further research.

Entities: Chemical Disease Gene Species

Keywords: AUC; diabetic ketoacidosis; least absolute shrinkage and selection operator; logistic regression; machine learning; prediction model

Year: 2021 PMID： 33480091 PMCID： PMC8049019 DOI： 10.1002/pds.5199

Source DB: PubMed Journal: Pharmacoepidemiol Drug Saf ISSN： 1053-8569 Impact factor: 2.890

Flexible machine learning (ML) approaches do not automatically result in improved performance over conventional ML approaches The performances of flexible ML and conventional ML approaches were similar in predicting diabetic ketoacidosis (DKA) using an electronic health records data source Flexible ML and conventional ML approaches identified overlapping, but different, top 10 risk factors for DKA Flexible ML approaches only provided the relative importance for each predictor, while logistic regression could also estimate odds ratio for each predictor Interpretation of the findings of flexible ML approaches is challenging

INTRODUCTION

Artificial intelligence including machine learning (ML) has been increasingly used to analyze healthcare data including electronic health records (EHR). ML is a natural extension to conventional analysis approaches such as logistic regression, and has been widely used to learn complex relationships or patterns from data to make accurate predictions. , Conventional analysis approaches are commonly used to identify risk factors for a disease or outcome in clinical epidemiology. However, there is limited data on performance of different ML approaches in such studies, and the theoretical superiority of more flexible ML such as XGBoost and distributed random forest (DRF) over conventional ML approaches has not been consistently observed in real‐world settings. Diabetic ketoacidosis (DKA) is an acute life‐threatening but preventable complication of type 1 diabetes (T1D). In the United States, DKA hospitalization had increased by 54.9% from 2009 to 2014 after a decline in 2000–2009. Because DKA is caused by insulin deficiency that is often precipitated by discontinuation of insulin or inadequate insulin treatment, it is important to identify risk factors for DKA and to predict DKA for effective T1D management. However, no prediction model for DKA is developed and used in clinical practice. Given the growing use of ML in clinical prediction including pharmacoepidemiology with limited model performance assessment, we conducted a study to assess the empirical performances of different ML approaches in identifying risk factors for DKA and predicting DKA in adults with T1D using an EHR database.

METHODS

Different ML approaches were applied in a case–control study nested in adults with T1D to identify risk factors and predict DKA. The nested case–control design can more readily and efficiently identify risk factors for DKA compared with a cohort design. Previous studies have demonstrated that this design can be used for development of clinical prediction models. ,

Data source

Optum de‐identified EHR database was used in this study. It currently encompasses approximately 80 million patients from all Census regions in the United States, with at least 5 million patients from each region. Data on the full spectrum of inpatient and outpatient treatments are collected from more than 140 000 physicians at more than 600 hospitals and 6500 clinics. On average, patients contribute 4 years of medical history to the database. Information such as patient‐reported symptoms and outcomes as well as treatment rationale is captured directly from providers' notes via natural language processing. Approximately 82% of patients in this database are part of an Integrated Delivery Network, which includes hospital and emergency care as well as outpatient visits. In addition, about 20% of patients can be linked with administrative claims data.

Study patients

Patients with T1D were identified from Optum EHR database between January 1, 2007 and September 30, 2018 by adapting the Klompas algorithm. The positive predictive value (PPV) of T1D using the Klompas algorithm was 89% in the original publication and 94.5% in an external validation study. , For each patient with T1D, the start of follow up for DKA event was the later of the first date on which a patient met the diabetes surveillance algorithm criteria or when a patient turned 18 years. The end of follow up was the date of first hospitalized DKA event during the follow‐up, the date of last‐recorded clinical activity in the database, or September 30, 2018, whichever occurred earliest. Hospitalized DKA event was identified using ICD‐9‐CM (250.1) or ICD‐10‐CM (E1X.1) diagnosis codes which have been reported to have a PPV of 88.9%. Because DKA is an emergency condition and patients with DKA rarely meet criteria to be safely discharged from emergency departments, outpatient or emergency encounters without subsequent hospitalization were excluded to minimize false positive cases. The index date of a case was the date of the first DKA event occurred during the follow up. The DKA cases included in the study must had ≥1 non‐emergency clinical activity encounter and insulin treatment within 365 days before the index date, had T1D for ≥365 days before the index date, and had ≥1 HbA1c measurement within 183 days before the index date. Cases who were pregnant and those who used antihyperglycemic agents indicated for type 2 diabetes only (except for metformin) within 365 days prior to or on the index date were excluded. For each DKA case, up to 10 controls without DKA on the index date of a case were randomly selected from the T1D cohort using incidence‐density sampling without replacement and matched on whether a patient's EHR data was linked with claims data. The index date for controls was assigned as the date of DKA diagnosis of their matching case. The same inclusion and exclusion criteria for cases were also applied to controls.

Potential predictors

Seven groups of potential predictors (i.e., features in ML) and the T1D cohort entry year as well as year of the index date were explored. These groups included social demographics, lifestyle factors, health service use, treatment, chronic comorbidities, acute medical conditions as well as laboratory test results, vital signs and other common measurements (Table S1). The selection of predictors was guided by the background knowledge of DKA and data availability.

Statistical methods

The overall structure of patient data that served as input to the analysis algorithm is shown in Figure 1. The whole study subjects were first randomly split into a training and a test data sets in a ratio of 4:1. The training set was used in the feature selection preprocessing and model building. The test set was used to assess each model performance.

FIGURE 1

Data processing flowchart. DKA, diabetic ketoacidosis; LASSO, least absolute shrinkage and selection operator; T1D, type 1 diabetes

Feature selection preprocessing

We carried out the feature selection preprocessing from all potential predictors in the training set. First, any features with missing data ≥60% were removed. This threshold was used to keep as many as laboratory measurements which were most likely due to an absence of testing. The remaining features with missing data are listed in Table S2. The missing records of these features were each imputed with the average values of the available records of that feature or grouped as “unknown” where applicable. Otherwise, many patients' records were incomplete for models to make prediction. Second, to avoid collinearity among the features which can cause unstable estimates and the sign flipping of the coefficients, , variance inflation factor (VIF) for each feature was calculated (Table S3). Features with VIF values ≥10 were flagged as highly correlated features. Only one feature with higher clinical utility or lower missing data percentage was selected within a group of highly correlated features or those caused sign flipping issue.

Predictive modeling algorithms

Two conventional and three flexible ML approaches were utilized to identify important risk factors for DKA from the selected features and to build prediction models. Conventional ML approaches included logistic regression and least absolute shrinkage and selection operator (LASSO). For logistic regression, backward selection with a cutoff p‐value of 0.05 was used to select statistically significant predictors; adjusted odds ratio (OR) and its 95% confidence interval (CI) was estimated for each selected predictor. LASSO is a technique of parameter regularization, which was applied to a logistic regression. During regularization, penalties are introduced to the model building process to avoid over‐fitting and reduce the number of covariates. Flexible ML approaches included XGBoost, DRF and feedforward network. XGBoost is a supervised learning algorithm that implements a process called boosting to yield accurate models. Boosting refers to the ensemble learning technique of building many models sequentially, with each new model attempting to correct for the deficiencies in the previous one. DRF generates a forest of classification trees, rather than a single classification or regression tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees can reduce the variance. The classification process takes the average prediction over all trees to make a final prediction. Neural networks are learning algorithms inspired by the brain research. Feedforward network we used is the simplest form of neural network since its flow of information only moves in forward direction without circling back. It consists of an input layer, an output layer, and several hidden layers in between. Each layer includes multiple nodes with different weights combining with input can determine the output of the network. During the learning process, these weights are updated to minimize the loss function.

Cross validation

During the training process, k‐fold cross validation was conducted for each modeling. That is, the training set was further split into a training subset and a validation subset in a ratio of 4:1 to tune hyperparameters. This process was repeated five times (k = 5). The final model was then built by aggregating the five cross validation models and evaluated in the full training data set. Further details are provided in Appendix I.

Model performance assessment

The model performance was assessed using different metrics below in the test set. Area under the receiver operating characteristic (ROC) curve (AUC): It plots the true positive rate against the false positive rate, and ranges from 0 to 1. The value of 1 indicates that the model predicts perfectly. Accuracy, specificity and sensitivity values were calculated based on a confusion matrix. To build a confusion matrix, a specific threshold value is required to determine whether a probability level gets assigned to a case or a control. We chose accuracy as the metric to optimize to determine the threshold value. Finally, to demonstrate the variability of the model predictions, the 95% CI of AUC, accuracy, specificity, and sensitivity values were established using 1000 bootstrap samples.

Feature importance

For each approach, the feature importance percentage was determined by calculating the relative influence of each predictor. For conventional ML approaches, it was derived and ranked by the magnitude of standardized coefficient of each selected statistically significant predictor. For XGBoost and DRF, two factors were considered to determine the relative importance of each feature: whether the variable was used to divide the decision tree node and how much prediction error has been reduced as a result of the split. That is, when split in a feature contributed to a larger decrease in the squared error, that feature was regarded as one with greater relative influence. For feedforward network, the Gedeon method was used to calculate feature importance. It considers the weights connecting the input features to the hidden layers. A model can be simplified by only including the top 10 features, and we assessed the AUC of the top 10 feature model for each approach. The data management was conducted using Palantir Foundry system (https://www.palantir.com/palantir-foundry/) housed in Sanofi. The control selection process via incidence‐density sampling was conducted using SAS 9.4 (SAS Institute, Cary, NC), and all other statistical analyses were performed using R software version 3.4 (www.r-project.org). We used H2O R package to implement the conventional and flexible ML processes.

RESULTS

A total of 3400 DKA cases and 11 780 controls were selected for the final analysis (Table 1). After the feature selection preprocessing 43 features were selected to predict DKA and were described in Table 2. Compared with controls, DKA cases were younger, had lower socioeconomic status and had more comorbidities. The mean of HbA1c level was 9.3% for DKA cases and 8.0% for controls.

TABLE 1

Study population attrition procession

	Patient counts (January 01, 2007–September 30, 2018)
Individuals in Optum^® de‐identified Electronic Health Record database	95 823 300
Individuals with diabetes	7 153 077
Individuals with type 1 diabetes	169 779
Individuals aged ≥18 years and within an Integrated Delivery Network	130 052
Individuals with at least 1 HbA1c measurement and at least 1 year of clinical activity any time	105 816
DKA case and control selection
Potential candidates	Potential DKA Cases N = 15 454	Potential Controls ^a N = 105 816
After application of the study criteria on potential DKA cases before matching: Type 1 diabetes for at least 365 days before index date Treated with insulin within 365 days before index date At least 1 HbA1c measurement within 183 days before index date Without pregnancy within 365 days before index date Without off‐label use of antihyperglycemic agents indicated for type 2 diabetes only (except for metformin) within 365 days before or on index date	3400	NA
Control selection via incidence density sampling based on 1:10 matching ratio	3400	34 000
After application of the same study criteria defined above on controls	3400	11 780

Abbreviations: DKA, diabetic ketoacidosis; HbA1c, hemoglobin A1c.

A control could not develop DKA before or at the matched index date but could become a case after the index date.

TABLE 2

Selected characteristics of DKA cases and controls

	DKA cases N = 3400	Controls N = 11 780	p‐Value ^a
Calendar year of T1D cohort entry (%)
2007	554 (16.3)	1831 (15.5)	0.111
2008	578 (17.0)	1910 (16.2)
2009	367 (10.8)	1175 (10.0)
2010	414 (12.2)	1469 (12.5)
2011	376 (11.1)	1286 (10.9)
2012	370 (10.9)	1257 (10.7)
2013	283 (8.3)	1009 (8.6)
2014	228 (6.7)	847 (7.2)
2015	131 (3.9)	532 (4.5)
2016	80 (2.4)	361 (3.1)
2017	19 (0.6)	103 (0.9)
Age, years, mean (SD)	42.9 (16.5)	45.4 (16.4)	<0.001
Sex (%)
Female	1818 (53.5)	5594 (47.5)	<0.001
Male	1579 (46.4)	6184 (52.5)
Unknown	3 (0.1)	2 (0.0)
Race (%)
Caucasian	2929 (86.1)	10 653 (90.4)	<0.001
African American	346 (10.2)	584 (5.0)
Asian	12 (0.4)	71 (0.6)
Other/Unknown	113 (3.3)	472 (4.0)
Annual household income, $, mean (SD)	41 773 (8302)	42 935 (8938)	<0.001
Insurance type (%)
Commercial	1306 (38.4)	6342 (53.8)	<0.001
Medicare	621 (18.3)	1519 (12.9)
Medicaid	514 (15.1)	727 (6.2)
Other payor type	134 (3.9)	353 (3.0)
Uninsured	202 (5.9)	255 (2.2)
Unknown	623 (18.3)	2584 (21.9)
Geographic region (%)
Midwest	2042 (60.1)	6792 (57.7)	<0.001
Northeast	315 (9.3)	1389 (11.8)
South	646 (19.0)	2237 (19.0)
West	269 (7.9)	1008 (8.6)
Other/Unknown	128 (3.8)	354 (3.0)
Lifestyle risk factors within 365 days before index date (%)
Alcohol abuse	175 (5.1)	226 (1.9)	<0.001
Controlled substance abuse	523 (15.4)	454 (3.9)	<0.001
Health service use within 365 days before index date (%)
Visit to endocrinologist	1789 (52.6)	7129 (60.5)	<0.001
Visit to primary care	1781 (52.4)	4959 (42.1)	<0.001
Chronic comorbidities any time between study start date and index date (%)
Cardiovascular disease	2042 (60.1)	5927 (50.3)	<0.001
Diabetic microvascular complications	1981 (58.3)	5160 (43.8)	<0.001
Chronic liver disease	244 (7.2)	435 (3.7)	<0.001
Chronic kidney disease	859 (25.3)	1328 (11.3)	<0.001
Dementia	137 (4.0)	217 (1.8)	<0.001
Psychiatric disorder	1743 (51.3)	3580 (30.4)	<0.001
Autoimmune disorders	432 (12.7)	1577 (13.4)	0.315
Cancer	377 (11.1)	1264 (10.7)	0.575
Acute medical conditions (%)
Infection within 7 days before index date	230 (6.8)	96 (0.8)	<0.001
Major surgery within 7 days before index date	45 (1.3)	6 (0.1)	<0.001
Non‐DKA hospitalization within 30 days before index date	610 (17.9)	168 (1.4)	<0.001
Treatments (%)
Insulin pump within 7 days before index date	154 (4.5)	153 (1.3)	<0.001
Insulin type within 7 days before index date
Intermediate/long‐acting insulin	218 (6.4)	309 (2.6)	<0.001
Rapid/short‐acting insulin	285 (8.4)	512 (4.3)	<0.001
Premixed insulin	10 (0.3)	15 (0.1)	0.061
Other medications within 30 days before index date
Systemic steroids	81 (2.4)	101 (0.9)	<0.001
Diuretics	132 (3.9)	221 (1.9)	<0.001
Antipsychotics	60 (1.8)	57 (0.5)	<0.001
Laboratory test results or vital signs within 183 days before index date, mean (SD) ^b
HbA1c, %	9.3 (1.8)	8.0 (1.4)	<0.001
Random blood glucose level, mg/dl	194.6 (60.6)	169.9 (66.3)	<0.001
eGFR, ml/min/1.73m²	83.1 (36.9)	96.4 (30.5)	<0.001
Total cholesterol, mg/dl	178.5 (46.7)	173.3 (38.3)	<0.001
Systolic blood pressure, mm Hg	126.8 (15.6)	124.4 (13.8)	<0.001
BMI, kg/m²	26.4 (5.8)	27.9 (5.7)	<0.001
Height, cm	169.6 (10.2)	171.1 (10.2)	<0.001
White blood cell count, x10³ per microliter	9.0 (3.3)	7.6 (2.7)	<0.001
Platelet count, x10³ per microliter	268.4 (79.7)	254.5 (72.4)	<0.001
Temperature, °C	36.7 (0.3)	36.7 (0.3)	<0.001
Pulse rate, beats per minute	85.0 (12.8)	78.7 (11.9)	<0.001
Respiratory rate, breaths per minute	17.4 (2.1)	16.7 (2.1)	<0.001
Hemoglobin, g/dl	12.6 (2.1)	13.4 (1.9)	<0.001
Oxygen saturation, S_pO₂ (pulse oximetry)	97.6 (1.5)	97.6 (1.5)	0.264

Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; SD, standard deviation.

Based on univariate analysis.

Based on non‐missing values.

Study population attrition procession Potential DKA Cases N = 15 454 Potential Controls N = 105 816 Type 1 diabetes for at least 365 days before index date Treated with insulin within 365 days before index date At least 1 HbA1c measurement within 183 days before index date Without pregnancy within 365 days before index date Without off‐label use of antihyperglycemic agents indicated for type 2 diabetes only (except for metformin) within 365 days before or on index date Abbreviations: DKA, diabetic ketoacidosis; HbA1c, hemoglobin A1c. A control could not develop DKA before or at the matched index date but could become a case after the index date. Selected characteristics of DKA cases and controls Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; SD, standard deviation. Based on univariate analysis. Based on non‐missing values.

Model performance with full set of features

In the training set, XGBoost outperformed the other 4 approaches with an AUC of 0.887, followed by feedforward network (AUC = 0.859), LASSO and logistic regression (AUC = 0.829 for each), and DRF (AUC = 0.808). In the test set, the AUC values ranged from 0.817 to 0.821 among these models and the difference decreased to 0.004 only (Table 3). The 95% CI of accuracy values ranged between 0.812 and 0.839. While the specificity values were higher than 0.9 for all models, the sensitivity values were only as high as 0.4. Consistent with the AUC findings, the differences in accuracy, sensitivity and specificity between flexible and conventional ML approaches were all minimal in the test set (Table 3). The confusion matrices are provided in Table S4.

TABLE 3

The performance of study models with full set of features in the test data set

Models	AUC (95% CI)	Accuracy (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)
Logistic regression	0.821 (0.804–0.837)	0.827 (0.814–0.839)	0.409 (0.321–0.497)	0.947 (0.925–0.969)
LASSO	0.821 (0.805–0.838)	0.827 (0.814–0.839)	0.407 (0.318–0.496)	0.948 (0.925–0.970)
XGBoost	0.819 (0.802–0.836)	0.825 (0.813–0.837)	0.414 (0.311–0.518)	0.944 (0.916–0.971)
DRF	0.817 (0.799–0.834)	0.827 (0.815–0.839)	0.420 (0.319–0.522)	0.944 (0.917–0.971)
Feedforward network	0.817 (0.799–0.834)	0.825 (0.812–0.837)	0.400 (0.291–0.508)	0.947 (0.920–0.975)

Models

AUC (95% CI)

Accuracy (95% CI)

Sensitivity (95% CI)

Specificity (95% CI)

Logistic regression

0.821

(0.804–0.837)

0.827

(0.814–0.839)

0.409

(0.321–0.497)

0.947

(0.925–0.969)

LASSO

0.821

(0.805–0.838)

0.827

(0.814–0.839)

0.407

(0.318–0.496)

0.948

(0.925–0.970)

XGBoost

0.819

(0.802–0.836)

0.825

(0.813–0.837)

0.414

(0.311–0.518)

0.944

(0.916–0.971)

DRF

0.817

(0.799–0.834)

0.827

(0.815–0.839)

0.420

(0.319–0.522)

0.944

(0.917–0.971)

Feedforward network

0.817

(0.799–0.834)

0.825

(0.812–0.837)

0.400

(0.291–0.508)

0.947

(0.920–0.975)

Abbreviations: AUC, area under the receiver operating characteristic curve; CI, confidence interval; DRF, distributed random forest; LASSO, least absolute shrinkage and selection operator.

The performance of study models with full set of features in the test data set 0.821 (0.804–0.837) 0.827 (0.814–0.839) 0.409 (0.321–0.497) 0.947 (0.925–0.969) 0.821 (0.805–0.838) 0.827 (0.814–0.839) 0.407 (0.318–0.496) 0.948 (0.925–0.970) 0.819 (0.802–0.836) 0.825 (0.813–0.837) 0.414 (0.311–0.518) 0.944 (0.916–0.971) 0.817 (0.799–0.834) 0.827 (0.815–0.839) 0.420 (0.319–0.522) 0.944 (0.917–0.971) 0.817 (0.799–0.834) 0.825 (0.812–0.837) 0.400 (0.291–0.508) 0.947 (0.920–0.975) Abbreviations: AUC, area under the receiver operating characteristic curve; CI, confidence interval; DRF, distributed random forest; LASSO, least absolute shrinkage and selection operator.

Feature importance

HbA1c level, non‐DKA hospitalization, and white blood cell count were identified as one of top 10 features across all 5 models (Table 4 and Figure S1). Logistic regression and LASSO consistently identified the same top 10 features with slightly different ranks and most of them are well‐established risk factors for DKA. XGBoost and DRF also identified almost the same top 10 features and eight were laboratory test results or vital signs, while feedforward network selected a very different set of top 10 features and six were social demographics. Compared with the conventional ML approaches, XGBoost and DRF identified the same five features in their top 10 features, while feedforward network identified the same four features.

TABLE 4

Top 10 features by each study model

Rank	Conventional machine learning		Flexible machine learning
Rank	Logistic regression	LASSO	XGBoost	DRF	Feedforward network
1	Insurance type – uninsured	HbA1c	HbA1c	HbA1c	Race – Asian
2	HbA1c	Non‐DKA hospitalization	Non‐DKA hospitalization	White blood cell count	Insurance type – uninsured
3	Non‐DKA hospitalization	Insurance type ‐ uninsured	White blood cell count	Non‐DKA hospitalization	Geographic region – Northeast
4	BMI	BMI	Hemoglobin	Hemoglobin	Race – African American
5	Pulse rate	Pulse rate	Pulse rate	Pulse rate	Geographic region – West
6	Psychiatric disorder	Psychiatric disorder	BMI	Random glucose level	Platelet count
7	Age	Age	Oxygen saturation	Respiratory rate	Gender ‐ female
8	Calendar year of diabetes cohort entry	Calendar year of diabetes cohort entry	Random glucose level	Platelet count	HbA1c
9	White blood cell count	White blood cell count	Platelet count	eGFR	Non‐DKA hospitalization
10	Acute infection	Acute infection	eGFR	BMI	White blood cell count

Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; DRF, distributed random forest; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; LASSO, least absolute shrinkage and selection operator.

Top 10 features by each study model Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; DRF, distributed random forest; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; LASSO, least absolute shrinkage and selection operator. In the logistic regression model, there were 16 positive predictors of DKA (i.e., with increased risk) such as higher HbA1c, non‐DKA hospitalization, and so on, and seven negative predictors (i.e., with decreased risk) such as older age, higher annual household income, and so on (Table S5). For each approach, the AUC values of the top 10 feature model were close to that of the full model in the test set ranging from 0.785 to 0.802 (Figure S2).

DISCUSSION

We evaluated the performance of different ML approaches in a nested case–control study that used an EHR database to identify risk factors for DKA in adults with T1D. We found that prediction of DKA is achievable with either conventional or flexible ML approaches and that the differences in performance were minimal among these approaches. All approaches could consistently identify the known risk factors for DKA including HbA1c and non‐DKA hospitalization. XGBoost and DRF included more laboratory test results or vital signs in their top 10 features, while feedforward included more social demographics. The flexible ML approaches only provided the relative importance for each predictor, while logistic regression could also estimate OR for each predictor. Therefore, interpreting the findings by the flexible ML approaches is challenging. In this study, we found that flexible ML approaches offered very limited improvement over conventional ones in predicting DKA using an EHR database. In a systematic review of performance comparison of logistic regression with ML for clinical prediction modeling, the AUC of logistic regression and ML models were similar when comparisons were restricted to studies with low risk of bias. Other studies also reported comparable AUCs between flexible and conventional ML models. , Although the accuracy and specificity were above 0.8 in this study, sensitivity was only 0.4. The main reason for the low sensitivity is the probability threshold we used in constructing the confusion matrix. For simplicity, we chose overall accuracy to optimize when determining the threshold. As a result, the threshold was set relatively high, which led to low sensitivity and high specificity. In general, the DKA risk factors that were most often selected by different models are clinically sensible as a triggering factor for insulin deficiency or discontinuation of insulin. Known risk factors for DKA include high HbA1c level, infection, surgery/trauma, younger age, female sex, low BMI, low socioeconomic status, and so on, , and either conventional or flexible ML approaches included some of these factors in their top 10 features. However, XGBoost and DRF included more laboratory test results or vital sign measurements compared with conventional models, while feedforward network included more social demographics. Because each model uses different theories and algorithms to determine the feature importance, direct comparison cannot be made to explain the different features selected by various models. Despite this, one possible explanation for the difference is that some of the identified laboratory test results or vital signs reflect underlying causes of DKA, for example increased white blood cell count and pulse rate with infection, or elevated hemoglobin with dehydration. Another possible explanation is that some test results and social demographics are largely interrelated, e.g., high random blood glucose level and uninsured status may both be associated with suboptimal diabetes management. This study has several limitations that should be considered. First, misclassification of DKA was possible, because we could not retrieve medical records to validate the outcome. However, the PPV for the proposed approach of DKA identification was 89%. Second, unlike administrative claims data, there is no patient enrollment information in EHR data. To minimize the possibility that the medical encounters in EHR data are incomplete, the study patients were selected among those with non‐emergency clinical activities recorded within 1 year prior to the index date, assuming all medical encounters were captured in the defined time window. In addition, we applied other inclusion and exclusion criteria which may limit the implementation of these predictive algorithms in a read‐world setting. Third, several laboratory measurements including white blood cell count had missing data close to 60% and the impact on model performance from the use of imputation for these features is unknown. However, the high percentages of these missing values were driven by controls, suggesting they were most likely to lack laboratory testing because controls had fewer medical conditions which could trigger laboratory testing than cases. Fourth, the predictor selection is based on the knowledge of DKA and data availability. This priori feature selection may limit the learning potential performance of some of the more complex ML algorithms and affect the interpretation of feature importance results. Fifth, each model uses different theories and algorithms to determine the feature importance. Therefore, principled comparison cannot be made across models to explain the differences in feature selection. Last, the model performance was assessed based on one typical nested case–control study using an EHR database. This needs to be considered when interpreting the generalizability of these results. Overall, the flexible ML approaches offered very limited performance improvement over conventional ones in predicting DKA using structured data recorded in EHR data source in this study. Both conventional and flexible ML approaches identified overlapping, but different top 10 risk factors for DKA. Further research is needed to determine the conditions under which the flexible ML approaches would outperform the conventional ones and vice versa and to better understand the reasons for differences in feature importance ranking among these approaches.

CONFLICT OF INTEREST

Lin Li, Chuang‐Chung Lee, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, and Juhaeri Juhaeri are employees of Sanofi. Kristen Sharma was an employee of Sanofi at the time of this study, and is a current employee of Astellas Pharmaceuticals, Inc. Chuntao Wu was an employee of Sanofi at the time of this study, and is a current employee of Alexion Pharmaceuticals, Inc.

AUTHOR CONTRIBUTIONS

Chuntao Wu, Cliona Molony, Lin Li, Chuang‐Chung Lee: Study concept and design. Chuang‐Chung Lee: Data acquisition and statistical analysis. Lin Li, Chuang‐Chung Lee, Chuntao Wu, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, Kristen Sharma, Juhaeri Juhaeri: Interpretation of data. Lin Li: Drafting of the manuscript. Lin Li, Chuang‐Chung Lee, Chuntao Wu, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, Kristen Sharma, Juhaeri Juhaeri: Critical revision of the manuscript for important intellectual content. Chuntao Wu, Juhaeri Juhaeri: Study supervision. Lin Li, Chuang‐Chung Lee: Accountable for accuracy and integrity. Appendix S1: Supporting Information Click here for additional data file.

19 in total

1. Risk prediction measures for case-cohort and nested case-control designs: an application to cardiovascular disease.

Authors: Andrea Ganna; Marie Reilly; Ulf de Faire; Nancy Pedersen; Patrik Magnusson; Erik Ingelsson
Journal: Am J Epidemiol Date: 2012-03-06 Impact factor: 4.897

2. Data mining of inputs: analysing magnitude and functional measures.

Authors: T D Gedeon
Journal: Int J Neural Syst Date: 1997-04 Impact factor: 5.866

3. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.

Authors: Evangelia Christodoulou; Jie Ma; Gary S Collins; Ewout W Steyerberg; Jan Y Verbakel; Ben Van Calster
Journal: J Clin Epidemiol Date: 2019-02-11 Impact factor: 6.437

Review 4. Machine learning and radiology.

Authors: Shijun Wang; Ronald M Summers
Journal: Med Image Anal Date: 2012-02-23 Impact factor: 8.545

5. pROC: an open-source package for R and S+ to analyze and compare ROC curves.

Authors: Xavier Robin; Natacha Turck; Alexandre Hainard; Natalia Tiberti; Frédérique Lisacek; Jean-Charles Sanchez; Markus Müller
Journal: BMC Bioinformatics Date: 2011-03-17 Impact factor: 3.307

6. Positive predictive value of automated database records for diabetic ketoacidosis (DKA) in children and youth exposed to antipsychotic drugs or control medications: a Tennessee Medicaid Study.

Authors: William V Bobo; William O Cooper; Richard A Epstein; Patrick G Arbogast; Jackie Mounsey; Wayne A Ray
Journal: BMC Med Res Methodol Date: 2011-11-23 Impact factor: 4.615

7. Neural networks versus Logistic regression for 30 days all-cause readmission prediction.

Authors: Ahmed Allam; Mate Nagy; George Thoma; Michael Krauthammer
Journal: Sci Rep Date: 2019-06-26 Impact factor: 4.379

8. Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data.

Authors: Lin Li; Chuang-Chung Lee; Fang Liz Zhou; Cliona Molony; Zoran Doder; Evgeny Zalmover; Kristen Sharma; Juhaeri Juhaeri; Chuntao Wu
Journal: Pharmacoepidemiol Drug Saf Date: 2021-02-03 Impact factor: 2.890

9. Automated detection and classification of type 1 versus type 2 diabetes using electronic health record data.

Authors: Michael Klompas; Emma Eggleston; Jason McVetta; Ross Lazarus; Lingling Li; Richard Platt
Journal: Diabetes Care Date: 2012-11-27 Impact factor: 19.112

10. Advantages of the nested case-control design in diagnostic research.

Authors: Cornelis J Biesheuvel; Yvonne Vergouwe; Ruud Oudega; Arno W Hoes; Diederick E Grobbee; Karel G M Moons
Journal: BMC Med Res Methodol Date: 2008-07-21 Impact factor: 4.615

1 in total

1. Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data.

1 in total