Literature DB >> 33480091

Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data.

Lin Li1, Chuang-Chung Lee2, Fang Liz Zhou1, Cliona Molony2, Zoran Doder1, Evgeny Zalmover1, Kristen Sharma1, Juhaeri Juhaeri1, Chuntao Wu1.   

Abstract

PURPOSE: To assess the performance of different machine learning (ML) approaches in identifying risk factors for diabetic ketoacidosis (DKA) and predicting DKA.
METHODS: This study applied flexible ML (XGBoost, distributed random forest [DRF] and feedforward network) and conventional ML approaches (logistic regression and least absolute shrinkage and selection operator [LASSO]) to 3400 DKA cases and 11 780 controls nested in adults with type 1 diabetes identified from Optum® de-identified Electronic Health Record dataset (2007-2018). Area under the curve (AUC), accuracy, sensitivity and specificity were computed using fivefold cross validation, and their 95% confidence intervals (CI) were established using 1000 bootstrap samples. The importance of predictors was compared across these models.
RESULTS: In the training set, XGBoost and feedforward network yielded higher AUC values (0.89 and 0.86, respectively) than logistic regression (0.83), LASSO (0.83) and DRF (0.81). However, the AUC values were similar (0.82) among these approaches in the test set (95% CI range, 0.80-0.84). While the accuracy values >0.8 and the specificity values >0.9 for all models, the sensitivity values were only 0.4. The differences in these metrics across these models were minimal in the test set. All approaches selected some known risk factors for DKA as the top 10 features. XGBoost and DRF included more laboratory measurements or vital signs compared with conventional ML approaches, while feedforward network included more social demographics.
CONCLUSIONS: In our empirical study, all ML approaches demonstrated similar performance, and identified overlapping, but different, top 10 predictors. The difference in selected top predictors needs further research.
© 2021 Sanofi. Pharmacoepidemiology and Drug Safety published by John Wiley & Sons Ltd.

Entities:  

Keywords:  AUC; diabetic ketoacidosis; least absolute shrinkage and selection operator; logistic regression; machine learning; prediction model

Year:  2021        PMID: 33480091      PMCID: PMC8049019          DOI: 10.1002/pds.5199

Source DB:  PubMed          Journal:  Pharmacoepidemiol Drug Saf        ISSN: 1053-8569            Impact factor:   2.890


Flexible machine learning (ML) approaches do not automatically result in improved performance over conventional ML approaches The performances of flexible ML and conventional ML approaches were similar in predicting diabetic ketoacidosis (DKA) using an electronic health records data source Flexible ML and conventional ML approaches identified overlapping, but different, top 10 risk factors for DKA Flexible ML approaches only provided the relative importance for each predictor, while logistic regression could also estimate odds ratio for each predictor Interpretation of the findings of flexible ML approaches is challenging

INTRODUCTION

Artificial intelligence including machine learning (ML) has been increasingly used to analyze healthcare data including electronic health records (EHR). ML is a natural extension to conventional analysis approaches such as logistic regression, and has been widely used to learn complex relationships or patterns from data to make accurate predictions. , Conventional analysis approaches are commonly used to identify risk factors for a disease or outcome in clinical epidemiology. However, there is limited data on performance of different ML approaches in such studies, and the theoretical superiority of more flexible ML such as XGBoost and distributed random forest (DRF) over conventional ML approaches has not been consistently observed in real‐world settings. Diabetic ketoacidosis (DKA) is an acute life‐threatening but preventable complication of type 1 diabetes (T1D). In the United States, DKA hospitalization had increased by 54.9% from 2009 to 2014 after a decline in 2000–2009. Because DKA is caused by insulin deficiency that is often precipitated by discontinuation of insulin or inadequate insulin treatment, it is important to identify risk factors for DKA and to predict DKA for effective T1D management. However, no prediction model for DKA is developed and used in clinical practice. Given the growing use of ML in clinical prediction including pharmacoepidemiology with limited model performance assessment, we conducted a study to assess the empirical performances of different ML approaches in identifying risk factors for DKA and predicting DKA in adults with T1D using an EHR database.

METHODS

Different ML approaches were applied in a case–control study nested in adults with T1D to identify risk factors and predict DKA. The nested case–control design can more readily and efficiently identify risk factors for DKA compared with a cohort design. Previous studies have demonstrated that this design can be used for development of clinical prediction models. ,

Data source

Optum de‐identified EHR database was used in this study. It currently encompasses approximately 80 million patients from all Census regions in the United States, with at least 5 million patients from each region. Data on the full spectrum of inpatient and outpatient treatments are collected from more than 140 000 physicians at more than 600 hospitals and 6500 clinics. On average, patients contribute 4 years of medical history to the database. Information such as patient‐reported symptoms and outcomes as well as treatment rationale is captured directly from providers' notes via natural language processing. Approximately 82% of patients in this database are part of an Integrated Delivery Network, which includes hospital and emergency care as well as outpatient visits. In addition, about 20% of patients can be linked with administrative claims data.

Study patients

Patients with T1D were identified from Optum EHR database between January 1, 2007 and September 30, 2018 by adapting the Klompas algorithm. The positive predictive value (PPV) of T1D using the Klompas algorithm was 89% in the original publication and 94.5% in an external validation study. , For each patient with T1D, the start of follow up for DKA event was the later of the first date on which a patient met the diabetes surveillance algorithm criteria or when a patient turned 18 years. The end of follow up was the date of first hospitalized DKA event during the follow‐up, the date of last‐recorded clinical activity in the database, or September 30, 2018, whichever occurred earliest. Hospitalized DKA event was identified using ICD‐9‐CM (250.1) or ICD‐10‐CM (E1X.1) diagnosis codes which have been reported to have a PPV of 88.9%. Because DKA is an emergency condition and patients with DKA rarely meet criteria to be safely discharged from emergency departments, outpatient or emergency encounters without subsequent hospitalization were excluded to minimize false positive cases. The index date of a case was the date of the first DKA event occurred during the follow up. The DKA cases included in the study must had ≥1 non‐emergency clinical activity encounter and insulin treatment within 365 days before the index date, had T1D for ≥365 days before the index date, and had ≥1 HbA1c measurement within 183 days before the index date. Cases who were pregnant and those who used antihyperglycemic agents indicated for type 2 diabetes only (except for metformin) within 365 days prior to or on the index date were excluded. For each DKA case, up to 10 controls without DKA on the index date of a case were randomly selected from the T1D cohort using incidence‐density sampling without replacement and matched on whether a patient's EHR data was linked with claims data. The index date for controls was assigned as the date of DKA diagnosis of their matching case. The same inclusion and exclusion criteria for cases were also applied to controls.

Potential predictors

Seven groups of potential predictors (i.e., features in ML) and the T1D cohort entry year as well as year of the index date were explored. These groups included social demographics, lifestyle factors, health service use, treatment, chronic comorbidities, acute medical conditions as well as laboratory test results, vital signs and other common measurements (Table S1). The selection of predictors was guided by the background knowledge of DKA and data availability.

Statistical methods

The overall structure of patient data that served as input to the analysis algorithm is shown in Figure 1. The whole study subjects were first randomly split into a training and a test data sets in a ratio of 4:1. The training set was used in the feature selection preprocessing and model building. The test set was used to assess each model performance.
FIGURE 1

Data processing flowchart. DKA, diabetic ketoacidosis; LASSO, least absolute shrinkage and selection operator; T1D, type 1 diabetes

Data processing flowchart. DKA, diabetic ketoacidosis; LASSO, least absolute shrinkage and selection operator; T1D, type 1 diabetes

Feature selection preprocessing

We carried out the feature selection preprocessing from all potential predictors in the training set. First, any features with missing data ≥60% were removed. This threshold was used to keep as many as laboratory measurements which were most likely due to an absence of testing. The remaining features with missing data are listed in Table S2. The missing records of these features were each imputed with the average values of the available records of that feature or grouped as “unknown” where applicable. Otherwise, many patients' records were incomplete for models to make prediction. Second, to avoid collinearity among the features which can cause unstable estimates and the sign flipping of the coefficients, , variance inflation factor (VIF) for each feature was calculated (Table S3). Features with VIF values ≥10 were flagged as highly correlated features. Only one feature with higher clinical utility or lower missing data percentage was selected within a group of highly correlated features or those caused sign flipping issue.

Predictive modeling algorithms

Two conventional and three flexible ML approaches were utilized to identify important risk factors for DKA from the selected features and to build prediction models. Conventional ML approaches included logistic regression and least absolute shrinkage and selection operator (LASSO). For logistic regression, backward selection with a cutoff p‐value of 0.05 was used to select statistically significant predictors; adjusted odds ratio (OR) and its 95% confidence interval (CI) was estimated for each selected predictor. LASSO is a technique of parameter regularization, which was applied to a logistic regression. During regularization, penalties are introduced to the model building process to avoid over‐fitting and reduce the number of covariates. Flexible ML approaches included XGBoost, DRF and feedforward network. XGBoost is a supervised learning algorithm that implements a process called boosting to yield accurate models. Boosting refers to the ensemble learning technique of building many models sequentially, with each new model attempting to correct for the deficiencies in the previous one. DRF generates a forest of classification trees, rather than a single classification or regression tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees can reduce the variance. The classification process takes the average prediction over all trees to make a final prediction. Neural networks are learning algorithms inspired by the brain research. Feedforward network we used is the simplest form of neural network since its flow of information only moves in forward direction without circling back. It consists of an input layer, an output layer, and several hidden layers in between. Each layer includes multiple nodes with different weights combining with input can determine the output of the network. During the learning process, these weights are updated to minimize the loss function.

Cross validation

During the training process, k‐fold cross validation was conducted for each modeling. That is, the training set was further split into a training subset and a validation subset in a ratio of 4:1 to tune hyperparameters. This process was repeated five times (k = 5). The final model was then built by aggregating the five cross validation models and evaluated in the full training data set. Further details are provided in Appendix I.

Model performance assessment

The model performance was assessed using different metrics below in the test set. Area under the receiver operating characteristic (ROC) curve (AUC): It plots the true positive rate against the false positive rate, and ranges from 0 to 1. The value of 1 indicates that the model predicts perfectly. Accuracy, specificity and sensitivity values were calculated based on a confusion matrix. To build a confusion matrix, a specific threshold value is required to determine whether a probability level gets assigned to a case or a control. We chose accuracy as the metric to optimize to determine the threshold value. Finally, to demonstrate the variability of the model predictions, the 95% CI of AUC, accuracy, specificity, and sensitivity values were established using 1000 bootstrap samples.

Feature importance

For each approach, the feature importance percentage was determined by calculating the relative influence of each predictor. For conventional ML approaches, it was derived and ranked by the magnitude of standardized coefficient of each selected statistically significant predictor. For XGBoost and DRF, two factors were considered to determine the relative importance of each feature: whether the variable was used to divide the decision tree node and how much prediction error has been reduced as a result of the split. That is, when split in a feature contributed to a larger decrease in the squared error, that feature was regarded as one with greater relative influence. For feedforward network, the Gedeon method was used to calculate feature importance. It considers the weights connecting the input features to the hidden layers. A model can be simplified by only including the top 10 features, and we assessed the AUC of the top 10 feature model for each approach. The data management was conducted using Palantir Foundry system (https://www.palantir.com/palantir-foundry/) housed in Sanofi. The control selection process via incidence‐density sampling was conducted using SAS 9.4 (SAS Institute, Cary, NC), and all other statistical analyses were performed using R software version 3.4 (www.r-project.org). We used H2O R package to implement the conventional and flexible ML processes.

RESULTS

A total of 3400 DKA cases and 11 780 controls were selected for the final analysis (Table 1). After the feature selection preprocessing 43 features were selected to predict DKA and were described in Table 2. Compared with controls, DKA cases were younger, had lower socioeconomic status and had more comorbidities. The mean of HbA1c level was 9.3% for DKA cases and 8.0% for controls.
TABLE 1

Study population attrition procession

Patient counts (January 01, 2007–September 30, 2018)
Individuals in Optum® de‐identified Electronic Health Record database95 823 300
Individuals with diabetes7 153 077
Individuals with type 1 diabetes169 779
Individuals aged ≥18 years and within an Integrated Delivery Network130 052
Individuals with at least 1 HbA1c measurement and at least 1 year of clinical activity any time105 816
DKA case and control selection
Potential candidates

Potential DKA Cases

N = 15 454

Potential Controls a

N = 105 816

After application of the study criteria on potential DKA cases before matching:

Type 1 diabetes for at least 365 days before index date

Treated with insulin within 365 days before index date

At least 1 HbA1c measurement within 183 days before index date

Without pregnancy within 365 days before index date

Without off‐label use of antihyperglycemic agents indicated for type 2 diabetes only (except for metformin) within 365 days before or on index date

3400NA
Control selection via incidence density sampling based on 1:10 matching ratio340034 000
After application of the same study criteria defined above on controls340011 780

Abbreviations: DKA, diabetic ketoacidosis; HbA1c, hemoglobin A1c.

A control could not develop DKA before or at the matched index date but could become a case after the index date.

TABLE 2

Selected characteristics of DKA cases and controls

DKA cases N = 3400Controls N = 11 780 p‐Value a
Calendar year of T1D cohort entry (%)
2007554 (16.3)1831 (15.5)0.111
2008578 (17.0)1910 (16.2)
2009367 (10.8)1175 (10.0)
2010414 (12.2)1469 (12.5)
2011376 (11.1)1286 (10.9)
2012370 (10.9)1257 (10.7)
2013283 (8.3)1009 (8.6)
2014228 (6.7)847 (7.2)
2015131 (3.9)532 (4.5)
201680 (2.4)361 (3.1)
201719 (0.6)103 (0.9)
Age, years, mean (SD)42.9 (16.5)45.4 (16.4)<0.001
Sex (%)
Female1818 (53.5)5594 (47.5)<0.001
Male1579 (46.4)6184 (52.5)
Unknown3 (0.1)2 (0.0)
Race (%)
Caucasian2929 (86.1)10 653 (90.4)<0.001
African American346 (10.2)584 (5.0)
Asian12 (0.4)71 (0.6)
Other/Unknown113 (3.3)472 (4.0)
Annual household income, $, mean (SD)41 773 (8302)42 935 (8938)<0.001
Insurance type (%)
Commercial1306 (38.4)6342 (53.8)<0.001
Medicare621 (18.3)1519 (12.9)
Medicaid514 (15.1)727 (6.2)
Other payor type134 (3.9)353 (3.0)
Uninsured202 (5.9)255 (2.2)
Unknown623 (18.3)2584 (21.9)
Geographic region (%)
Midwest2042 (60.1)6792 (57.7)<0.001
Northeast315 (9.3)1389 (11.8)
South646 (19.0)2237 (19.0)
West269 (7.9)1008 (8.6)
Other/Unknown128 (3.8)354 (3.0)
Lifestyle risk factors within 365 days before index date (%)
Alcohol abuse175 (5.1)226 (1.9)<0.001
Controlled substance abuse523 (15.4)454 (3.9)<0.001
Health service use within 365 days before index date (%)
Visit to endocrinologist1789 (52.6)7129 (60.5)<0.001
Visit to primary care1781 (52.4)4959 (42.1)<0.001
Chronic comorbidities any time between study start date and index date (%)
Cardiovascular disease2042 (60.1)5927 (50.3)<0.001
Diabetic microvascular complications1981 (58.3)5160 (43.8)<0.001
Chronic liver disease244 (7.2)435 (3.7)<0.001
Chronic kidney disease859 (25.3)1328 (11.3)<0.001
Dementia137 (4.0)217 (1.8)<0.001
Psychiatric disorder1743 (51.3)3580 (30.4)<0.001
Autoimmune disorders432 (12.7)1577 (13.4)0.315
Cancer377 (11.1)1264 (10.7)0.575
Acute medical conditions (%)
Infection within 7 days before index date230 (6.8)96 (0.8)<0.001
Major surgery within 7 days before index date45 (1.3)6 (0.1)<0.001
Non‐DKA hospitalization within 30 days before index date610 (17.9)168 (1.4)<0.001
Treatments (%)
Insulin pump within 7 days before index date154 (4.5)153 (1.3)<0.001
Insulin type within 7 days before index date
Intermediate/long‐acting insulin218 (6.4)309 (2.6)<0.001
Rapid/short‐acting insulin285 (8.4)512 (4.3)<0.001
Premixed insulin10 (0.3)15 (0.1)0.061
Other medications within 30 days before index date
Systemic steroids81 (2.4)101 (0.9)<0.001
Diuretics132 (3.9)221 (1.9)<0.001
Antipsychotics60 (1.8)57 (0.5)<0.001
Laboratory test results or vital signs within 183 days before index date, mean (SD) b
HbA1c, %9.3 (1.8)8.0 (1.4)<0.001
Random blood glucose level, mg/dl194.6 (60.6)169.9 (66.3)<0.001
eGFR, ml/min/1.73m2 83.1 (36.9)96.4 (30.5)<0.001
Total cholesterol, mg/dl178.5 (46.7)173.3 (38.3)<0.001
Systolic blood pressure, mm Hg126.8 (15.6)124.4 (13.8)<0.001
BMI, kg/m2 26.4 (5.8)27.9 (5.7)<0.001
Height, cm169.6 (10.2)171.1 (10.2)<0.001
White blood cell count, x103 per microliter9.0 (3.3)7.6 (2.7)<0.001
Platelet count, x103 per microliter268.4 (79.7)254.5 (72.4)<0.001
Temperature, °C36.7 (0.3)36.7 (0.3)<0.001
Pulse rate, beats per minute85.0 (12.8)78.7 (11.9)<0.001
Respiratory rate, breaths per minute17.4 (2.1)16.7 (2.1)<0.001
Hemoglobin, g/dl12.6 (2.1)13.4 (1.9)<0.001
Oxygen saturation, SpO2 (pulse oximetry)97.6 (1.5)97.6 (1.5)0.264

Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; SD, standard deviation.

Based on univariate analysis.

Based on non‐missing values.

Study population attrition procession Potential DKA Cases N = 15 454 Potential Controls N = 105 816 Type 1 diabetes for at least 365 days before index date Treated with insulin within 365 days before index date At least 1 HbA1c measurement within 183 days before index date Without pregnancy within 365 days before index date Without off‐label use of antihyperglycemic agents indicated for type 2 diabetes only (except for metformin) within 365 days before or on index date Abbreviations: DKA, diabetic ketoacidosis; HbA1c, hemoglobin A1c. A control could not develop DKA before or at the matched index date but could become a case after the index date. Selected characteristics of DKA cases and controls Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; SD, standard deviation. Based on univariate analysis. Based on non‐missing values.

Model performance with full set of features

In the training set, XGBoost outperformed the other 4 approaches with an AUC of 0.887, followed by feedforward network (AUC = 0.859), LASSO and logistic regression (AUC = 0.829 for each), and DRF (AUC = 0.808). In the test set, the AUC values ranged from 0.817 to 0.821 among these models and the difference decreased to 0.004 only (Table 3). The 95% CI of accuracy values ranged between 0.812 and 0.839. While the specificity values were higher than 0.9 for all models, the sensitivity values were only as high as 0.4. Consistent with the AUC findings, the differences in accuracy, sensitivity and specificity between flexible and conventional ML approaches were all minimal in the test set (Table 3). The confusion matrices are provided in Table S4.
TABLE 3

The performance of study models with full set of features in the test data set

ModelsAUC (95% CI)Accuracy (95% CI)Sensitivity (95% CI)Specificity (95% CI)
Logistic regression

0.821

(0.804–0.837)

0.827

(0.814–0.839)

0.409

(0.321–0.497)

0.947

(0.925–0.969)

LASSO

0.821

(0.805–0.838)

0.827

(0.814–0.839)

0.407

(0.318–0.496)

0.948

(0.925–0.970)

XGBoost

0.819

(0.802–0.836)

0.825

(0.813–0.837)

0.414

(0.311–0.518)

0.944

(0.916–0.971)

DRF

0.817

(0.799–0.834)

0.827

(0.815–0.839)

0.420

(0.319–0.522)

0.944

(0.917–0.971)

Feedforward network

0.817

(0.799–0.834)

0.825

(0.812–0.837)

0.400

(0.291–0.508)

0.947

(0.920–0.975)

Abbreviations: AUC, area under the receiver operating characteristic curve; CI, confidence interval; DRF, distributed random forest; LASSO, least absolute shrinkage and selection operator.

The performance of study models with full set of features in the test data set 0.821 (0.804–0.837) 0.827 (0.814–0.839) 0.409 (0.321–0.497) 0.947 (0.925–0.969) 0.821 (0.805–0.838) 0.827 (0.814–0.839) 0.407 (0.318–0.496) 0.948 (0.925–0.970) 0.819 (0.802–0.836) 0.825 (0.813–0.837) 0.414 (0.311–0.518) 0.944 (0.916–0.971) 0.817 (0.799–0.834) 0.827 (0.815–0.839) 0.420 (0.319–0.522) 0.944 (0.917–0.971) 0.817 (0.799–0.834) 0.825 (0.812–0.837) 0.400 (0.291–0.508) 0.947 (0.920–0.975) Abbreviations: AUC, area under the receiver operating characteristic curve; CI, confidence interval; DRF, distributed random forest; LASSO, least absolute shrinkage and selection operator.

Feature importance

HbA1c level, non‐DKA hospitalization, and white blood cell count were identified as one of top 10 features across all 5 models (Table 4 and Figure S1). Logistic regression and LASSO consistently identified the same top 10 features with slightly different ranks and most of them are well‐established risk factors for DKA. XGBoost and DRF also identified almost the same top 10 features and eight were laboratory test results or vital signs, while feedforward network selected a very different set of top 10 features and six were social demographics. Compared with the conventional ML approaches, XGBoost and DRF identified the same five features in their top 10 features, while feedforward network identified the same four features.
TABLE 4

Top 10 features by each study model

RankConventional machine learningFlexible machine learning
Logistic regressionLASSOXGBoostDRFFeedforward network
1Insurance type – uninsuredHbA1cHbA1cHbA1cRace – Asian
2HbA1cNon‐DKA hospitalizationNon‐DKA hospitalizationWhite blood cell countInsurance type – uninsured
3Non‐DKA hospitalizationInsurance type ‐ uninsuredWhite blood cell countNon‐DKA hospitalizationGeographic region – Northeast
4BMIBMIHemoglobinHemoglobinRace – African American
5Pulse ratePulse ratePulse ratePulse rateGeographic region – West
6Psychiatric disorderPsychiatric disorderBMIRandom glucose levelPlatelet count
7AgeAgeOxygen saturationRespiratory rateGender ‐ female
8Calendar year of diabetes cohort entryCalendar year of diabetes cohort entryRandom glucose levelPlatelet countHbA1c
9White blood cell countWhite blood cell countPlatelet counteGFRNon‐DKA hospitalization
10Acute infectionAcute infectioneGFRBMIWhite blood cell count

Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; DRF, distributed random forest; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; LASSO, least absolute shrinkage and selection operator.

Top 10 features by each study model Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; DRF, distributed random forest; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; LASSO, least absolute shrinkage and selection operator. In the logistic regression model, there were 16 positive predictors of DKA (i.e., with increased risk) such as higher HbA1c, non‐DKA hospitalization, and so on, and seven negative predictors (i.e., with decreased risk) such as older age, higher annual household income, and so on (Table S5). For each approach, the AUC values of the top 10 feature model were close to that of the full model in the test set ranging from 0.785 to 0.802 (Figure S2).

DISCUSSION

We evaluated the performance of different ML approaches in a nested case–control study that used an EHR database to identify risk factors for DKA in adults with T1D. We found that prediction of DKA is achievable with either conventional or flexible ML approaches and that the differences in performance were minimal among these approaches. All approaches could consistently identify the known risk factors for DKA including HbA1c and non‐DKA hospitalization. XGBoost and DRF included more laboratory test results or vital signs in their top 10 features, while feedforward included more social demographics. The flexible ML approaches only provided the relative importance for each predictor, while logistic regression could also estimate OR for each predictor. Therefore, interpreting the findings by the flexible ML approaches is challenging. In this study, we found that flexible ML approaches offered very limited improvement over conventional ones in predicting DKA using an EHR database. In a systematic review of performance comparison of logistic regression with ML for clinical prediction modeling, the AUC of logistic regression and ML models were similar when comparisons were restricted to studies with low risk of bias. Other studies also reported comparable AUCs between flexible and conventional ML models. , Although the accuracy and specificity were above 0.8 in this study, sensitivity was only 0.4. The main reason for the low sensitivity is the probability threshold we used in constructing the confusion matrix. For simplicity, we chose overall accuracy to optimize when determining the threshold. As a result, the threshold was set relatively high, which led to low sensitivity and high specificity. In general, the DKA risk factors that were most often selected by different models are clinically sensible as a triggering factor for insulin deficiency or discontinuation of insulin. Known risk factors for DKA include high HbA1c level, infection, surgery/trauma, younger age, female sex, low BMI, low socioeconomic status, and so on, , and either conventional or flexible ML approaches included some of these factors in their top 10 features. However, XGBoost and DRF included more laboratory test results or vital sign measurements compared with conventional models, while feedforward network included more social demographics. Because each model uses different theories and algorithms to determine the feature importance, direct comparison cannot be made to explain the different features selected by various models. Despite this, one possible explanation for the difference is that some of the identified laboratory test results or vital signs reflect underlying causes of DKA, for example increased white blood cell count and pulse rate with infection, or elevated hemoglobin with dehydration. Another possible explanation is that some test results and social demographics are largely interrelated, e.g., high random blood glucose level and uninsured status may both be associated with suboptimal diabetes management. This study has several limitations that should be considered. First, misclassification of DKA was possible, because we could not retrieve medical records to validate the outcome. However, the PPV for the proposed approach of DKA identification was 89%. Second, unlike administrative claims data, there is no patient enrollment information in EHR data. To minimize the possibility that the medical encounters in EHR data are incomplete, the study patients were selected among those with non‐emergency clinical activities recorded within 1 year prior to the index date, assuming all medical encounters were captured in the defined time window. In addition, we applied other inclusion and exclusion criteria which may limit the implementation of these predictive algorithms in a read‐world setting. Third, several laboratory measurements including white blood cell count had missing data close to 60% and the impact on model performance from the use of imputation for these features is unknown. However, the high percentages of these missing values were driven by controls, suggesting they were most likely to lack laboratory testing because controls had fewer medical conditions which could trigger laboratory testing than cases. Fourth, the predictor selection is based on the knowledge of DKA and data availability. This priori feature selection may limit the learning potential performance of some of the more complex ML algorithms and affect the interpretation of feature importance results. Fifth, each model uses different theories and algorithms to determine the feature importance. Therefore, principled comparison cannot be made across models to explain the differences in feature selection. Last, the model performance was assessed based on one typical nested case–control study using an EHR database. This needs to be considered when interpreting the generalizability of these results. Overall, the flexible ML approaches offered very limited performance improvement over conventional ones in predicting DKA using structured data recorded in EHR data source in this study. Both conventional and flexible ML approaches identified overlapping, but different top 10 risk factors for DKA. Further research is needed to determine the conditions under which the flexible ML approaches would outperform the conventional ones and vice versa and to better understand the reasons for differences in feature importance ranking among these approaches.

CONFLICT OF INTEREST

Lin Li, Chuang‐Chung Lee, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, and Juhaeri Juhaeri are employees of Sanofi. Kristen Sharma was an employee of Sanofi at the time of this study, and is a current employee of Astellas Pharmaceuticals, Inc. Chuntao Wu was an employee of Sanofi at the time of this study, and is a current employee of Alexion Pharmaceuticals, Inc.

AUTHOR CONTRIBUTIONS

Chuntao Wu, Cliona Molony, Lin Li, Chuang‐Chung Lee: Study concept and design. Chuang‐Chung Lee: Data acquisition and statistical analysis. Lin Li, Chuang‐Chung Lee, Chuntao Wu, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, Kristen Sharma, Juhaeri Juhaeri: Interpretation of data. Lin Li: Drafting of the manuscript. Lin Li, Chuang‐Chung Lee, Chuntao Wu, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, Kristen Sharma, Juhaeri Juhaeri: Critical revision of the manuscript for important intellectual content. Chuntao Wu, Juhaeri Juhaeri: Study supervision. Lin Li, Chuang‐Chung Lee: Accountable for accuracy and integrity. Appendix S1: Supporting Information Click here for additional data file.
  19 in total

1.  Risk prediction measures for case-cohort and nested case-control designs: an application to cardiovascular disease.

Authors:  Andrea Ganna; Marie Reilly; Ulf de Faire; Nancy Pedersen; Patrik Magnusson; Erik Ingelsson
Journal:  Am J Epidemiol       Date:  2012-03-06       Impact factor: 4.897

2.  Data mining of inputs: analysing magnitude and functional measures.

Authors:  T D Gedeon
Journal:  Int J Neural Syst       Date:  1997-04       Impact factor: 5.866

3.  A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.

Authors:  Evangelia Christodoulou; Jie Ma; Gary S Collins; Ewout W Steyerberg; Jan Y Verbakel; Ben Van Calster
Journal:  J Clin Epidemiol       Date:  2019-02-11       Impact factor: 6.437

Review 4.  Machine learning and radiology.

Authors:  Shijun Wang; Ronald M Summers
Journal:  Med Image Anal       Date:  2012-02-23       Impact factor: 8.545

5.  pROC: an open-source package for R and S+ to analyze and compare ROC curves.

Authors:  Xavier Robin; Natacha Turck; Alexandre Hainard; Natalia Tiberti; Frédérique Lisacek; Jean-Charles Sanchez; Markus Müller
Journal:  BMC Bioinformatics       Date:  2011-03-17       Impact factor: 3.307

6.  Positive predictive value of automated database records for diabetic ketoacidosis (DKA) in children and youth exposed to antipsychotic drugs or control medications: a Tennessee Medicaid Study.

Authors:  William V Bobo; William O Cooper; Richard A Epstein; Patrick G Arbogast; Jackie Mounsey; Wayne A Ray
Journal:  BMC Med Res Methodol       Date:  2011-11-23       Impact factor: 4.615

7.  Neural networks versus Logistic regression for 30 days all-cause readmission prediction.

Authors:  Ahmed Allam; Mate Nagy; George Thoma; Michael Krauthammer
Journal:  Sci Rep       Date:  2019-06-26       Impact factor: 4.379

8.  Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data.

Authors:  Lin Li; Chuang-Chung Lee; Fang Liz Zhou; Cliona Molony; Zoran Doder; Evgeny Zalmover; Kristen Sharma; Juhaeri Juhaeri; Chuntao Wu
Journal:  Pharmacoepidemiol Drug Saf       Date:  2021-02-03       Impact factor: 2.890

9.  Automated detection and classification of type 1 versus type 2 diabetes using electronic health record data.

Authors:  Michael Klompas; Emma Eggleston; Jason McVetta; Ross Lazarus; Lingling Li; Richard Platt
Journal:  Diabetes Care       Date:  2012-11-27       Impact factor: 19.112

10.  Advantages of the nested case-control design in diagnostic research.

Authors:  Cornelis J Biesheuvel; Yvonne Vergouwe; Ruud Oudega; Arno W Hoes; Diederick E Grobbee; Karel G M Moons
Journal:  BMC Med Res Methodol       Date:  2008-07-21       Impact factor: 4.615

View more
  1 in total

1.  Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data.

Authors:  Lin Li; Chuang-Chung Lee; Fang Liz Zhou; Cliona Molony; Zoran Doder; Evgeny Zalmover; Kristen Sharma; Juhaeri Juhaeri; Chuntao Wu
Journal:  Pharmacoepidemiol Drug Saf       Date:  2021-02-03       Impact factor: 2.890

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.