Literature DB >> 34013107

Convolutional Neural Network Model for Intensive Care Unit Acute Kidney Injury Prediction.

Sidney Le¹, Angier Allen¹, Jacob Calvert¹, Paul M Palevsky², Gregory Braden³, Sharad Patel⁴, Emily Pellegrini¹, Abigail Green-Saxena¹, Jana Hoffman¹, Ritankar Das¹.

Abstract

INTRODUCTION: Acute kidney injury (AKI) is common among hospitalized patients and has a significant impact on morbidity and mortality. Although early prediction of AKI has the potential to reduce adverse patient outcomes, it remains a difficult condition to predict and diagnose. The purpose of this study was to evaluate the ability of a machine learning algorithm to predict for AKI as defined by Kidney Disease: Improving Global Outcomes (KDIGO) stage 2 or 3 up to 48 hours in advance of onset using convolutional neural networks (CNNs) and patient electronic health record (EHR) data.
METHODS: A CNN prediction system was developed to use EHR data gathered during patients' stays to predict AKI up to 48 hours before onset. A total of 12,347 patient encounters were retrospectively analyzed from the Medical Information Mart for Intensive Care III (MIMIC-III) database. An XGBoost AKI prediction model and the sequential organ failure assessment (SOFA) scoring system were used as comparators. The outcome was AKI onset. The model was trained on routinely collected patient EHR data. Measurements included area under the receiver operating characteristic (AUROC) curve, positive predictive value (PPV), and a battery of additional performance metrics for advance prediction of AKI onset.
RESULTS: On a hold-out test set, the algorithm attained an AUROC of 0.86 and PPV of 0.24, relative to a cohort AKI prevalence of 7.62%, for long-horizon AKI prediction at a 48-hour window before onset.
CONCLUSION: A CNN machine learning-based AKI prediction model outperforms XGBoost and the SOFA scoring system, revealing superior performance in predicting AKI 48 hours before onset, without reliance on serum creatinine (SCr) measurements.

Entities: Chemical

Keywords: acute kidney injury; convolutional neural net; electronic health record data; machine learning; prediction; serum creatinine

Year: 2021 PMID： 34013107 PMCID： PMC8116756 DOI： 10.1016/j.ekir.2021.02.031

Source DB: PubMed Journal: Kidney Int Rep ISSN： 2468-0249

Acute kidney injury (AKI) is a complex syndrome associated with large clinical and financial burdens.1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9 Despite its prevalence in hospitalized patients, and reported incidence as high as 70% in the critically ill,, no treatment has been developed to effectively reverse injury to the kidney and restore kidney function. The reasons for this failure have been attributed to delays in diagnosis and intervention,,15, 16, 17, 18, 19, 2, 20, 21, 22, 23 the complex nature of the AKI syndrome and the staging of its severity,, and its multiple etiologies., Until recently, studies of incidence and outcomes of AKI have produced inconsistent results owing to varying definitions of AKI.24, 25, 26 The Risk, Injury, Failure, Loss, End-stage kidney disease criteria, followed by the AKI Network, and most recently the Kidney Disease: Improving Global Outcomes (KDIGO) criteria, have provided consensus on AKI definition. KDIGO guidelines define AKI as an absolute increase of serum creatinine (SCr) of >0.3 mg/dl within 48 hours or a relative increase of >50% in no more than 7 days., Doubling of SCr at steady state reflects an approximate 50% decrease in kidney function as evaluated by glomerular filtration rate. Some studies have suggested that changes in SCr even smaller than 0.3 mg/dl within 48 hours are associated with significant increases in the risk of death, dialysis, and other morbidities,,,32, 33, 34, 35, 36, 37, 38 and other studies are consistent with worsening outcomes with increasing AKI stage.,,,39, 40, 41, 42 However, increases of SCr are known to lag kidney injury by hours to days after the initial kidney insult, and therefore recognition of AKI is delayed owing to reliance on SCr measurements., Early AKI detection is critical to improving patient outcomes.45, 46, 47, 48 Given that the components necessary for defining and staging AKI are routinely available in EHR, a number of automated alerts have been developed to predict AKI events before onset. However, these alerts are generally triggered by detecting changes in SCr and urine output alone or in combination. Because a range of kidney injuries can exist before a loss of kidney function can be estimated with these standard laboratory tests,, there is great interest in developing methods that can be used to detect AKI in patients at an earlier stage.50, 51, 52, 53, 54, 55, 56 In this article, we describe our methodology for the development of a convolutional neural net (CNN) prediction system that predicts AKI up to 48 hours before onset using patient data extracted from the EHR. The CNN model does not require SCr or urine output values.

Methods

Description of Data

This study uses data from the MIMIC-III version 1.3 data set, collected at Beth Israel Deaconess Medical Center in Boston, Massachusetts, from 2001 to 2012. The MIMIC data set offers a variety of encounter information of more than 40,000 unique patients and includes both structured (e.g., laboratory results) and unstructured (e.g., clinician notes) data. Owing to differences in the storage of patient procedure information, we restrict our study to data collected from 2008 to 2012 using the iMDsoft MetaVision ICU (iMDsoft, Needham, MA) EHR system and do not include data collected from 2001 to 2008 using the Philips CareVue Clinical Information System (Philips Health-care, Andover, MA). Because the collection of the MIMIC data did not affect patient safety and because all data were anonymized in accordance with the Health Insurance Portability and Accountability Act Privacy Rule, the Institutional Review Boards of Beth Israel Deaconess Medical Center and the Massachusetts Institute of Technology waived the requirement for patient consent. From the MetaVision EHR MIMIC encounters, we selected for inclusion stays involving adult patients (i.e., age 18 years or older) with at least one measurement of diastolic blood pressure, systolic blood pressure, temperature, respiratory rate, heart rate, oxygen saturation, and Glasgow Coma Scale. These measurements were selected because they were frequently available and easily collected at the patient bedside, even before clinical suspicion of AKI was present. These were the only direct variables used during the training and testing of the algorithm; clinical notes vectorized with the Doc2Vec algorithm were also used as inputs to the CNN model. Serum creatinine was used as part of the KDIGO criteria, which served as the gold standard of patients with true-positive AKI, but it was not used as an input in testing. To facilitate the analysis of the 48-hour advance prediction of AKI onset with a 5-hour window of measurements upon which to base such a prediction, we required the patient stay duration to be at least 53 hours long. For convenience and with minimal restrictions, we required that patient encounters lasted no more than 1000 hours. To train and test the algorithm on the broadest possible patient sample, no further inclusion or exclusion criteria were applied. Patients with prevalent AKI, those with chronic kidney disease, and who received dialysis were therefore included. Inclusion criteria are listed in Figure 1 for 24- and 48-hour prediction windows, and the demographic characteristics of encounters meeting the inclusion criteria are reported in Table 1.

Figure 1

Table 1

Demographic characteristics of MIMIC-III ICU encounters found in the 48 hour data set and meeting the inclusion criteria of Figure 1

Characteristic		Count	%
Gender	Female	3186	46.71
Gender	Male	3635	53.29
Age (d):Median 65, IQR (53–77)	18–29	317	4.65
	30–39	307	4.50
	40–49	665	9.75
	50–59	1246	18.27
	60–69	1599	23.44
	70+	2687	39.39
Length of stay (d):Median 5, IQR (4–9)	<3	43	0.63
	3–5	4282	62.78
	6–8	1200	17.59
	9–11	528	7.74
	≥12	768	11.26
Inhospital death	Yes	1747	25.61
Inhospital death	No	5074	74.39
KDIGO stage 2 or 3	Positive	520	7.62
KDIGO stage 2 or 3	Negative	6301	92.38
KDIGO stage 1, 2, or 3	Positive	1410	20.67
KDIGO stage 1, 2, or 3	Negative	5411	79.33

ICU, intensive care unit; IQR, interquartile range; KDIGO, Kidney Disease: Improving Global Outcomes; MIMIC-III, Medical Information Mart for Intensive Care III.

We note that the determination of KDIGO positive or negative was made after the data preprocessing steps described in the Methods section.

Inclusion diagram. Patients were required to be at least 18 years of age and must have at least 1 measurement of at least 1 of the input features. MIMIC-III, Medical Information Mart for Intensive Care III. Demographic characteristics of MIMIC-III ICU encounters found in the 48 hour data set and meeting the inclusion criteria of Figure 1 ICU, intensive care unit; IQR, interquartile range; KDIGO, Kidney Disease: Improving Global Outcomes; MIMIC-III, Medical Information Mart for Intensive Care III. We note that the determination of KDIGO positive or negative was made after the data preprocessing steps described in the Methods section.

Overview of Preprocessing, Training, and Testing

MIMIC-III intensive care unit (ICU) encounter data were gathered in the following ways: encounters from the MetaVision database in MIMIC-III were required to be at least 18 years of age and had to include at least 1 measurement for at least 1 of the required input features. For each prediction offset T, the encounters were filtered such that each encounter was between 5 + T hours and 1000 hours. A total of 5 + T hours were required to account for the offset and give the model the required 5 hours of measurements used for prediction. For each prediction offset T, positive examples measurements were taken between 5 + T and T hours before onset for prediction, whereas negative example measurements were taken during random 5-hour windows of the patient stays. Onset was defined as the first time that the relevant KDIGO criteria were met during the patient stay. Patient encounters satisfying the inclusion criteria were immediately allocated to training and testing sets. Approximately 90% and 10% of all encounters were randomly allocated to the training and testing sets, respectively, stratified by positive and negative classes to ensure equal representation of classes in both sets. We binned the data by the hour, imputed missing measurements, and standardized measurements on a variable-by-variable basis. AKI was defined according to KDIGO stage 2 or KDIGO stage 3 criteria, and positive cases were identified as those patients reaching KDIGO stage 2 or stage 3 during the encounter. KDIGO stage 2 or stage 3 classifications were determined for each encounter, along with the corresponding times of KDIGO onset where appropriate. Stage 2 AKI is defined in the KDIGO staging system as an increase in SCr to more than 200% to 300% (>2- to 3-fold) from baseline or urine output <0.5 ml/kg per hour for more than 12 hours. Stage 3 AKI is defined as an increase in SCr to more than 300% (>3-fold) from baseline, or ≥4.0 mg/dl (≥354 mmol/l), or kidney replacement therapy, or a decrease in estimated glomerular filtration rate to <35 ml/min per 1.73 m2 (if <18 years of age), or urine output < 0.5 ml/kg per hour for ≥24 hours or anuria for ≥12 hours. In both cases, the smaller of either the Modification of Diet in Renal Disease SCr estimate based on KDIGO 2012 guidelines or the 20th percentile of observed creatinine measurements was used for the baseline creatinine measurement in each patient encounter. Any missing features required for measurement, including missing urine or SCr measures, made a contribution of 0 to the total KDIGO score. A Doc2Vec embedding network was created to vectorize clinical text data. The Doc2Vec algorithm works by creating vectors for the most common words in all the documents and separate vectors for each document. These vectors are trained by selecting a window of words in each document; the corresponding vectors for these words, in addition to the vector for the document that the text came from, predict the next word in the sequence. The resulting document vectors are used as inputs, whereas the word vectors are discarded. The embedding network was prepared on a large collection of midstay clinical notes, ranging from the primary complaint to radiology notes, including everything up to, but not including, the discharge summary, from encounters allocated to the training set. The network embedded texts into 250-dimensional numeric vectors, which served as inputs to the classifiers, alongside the structured data associated with the stays. Any notes dated after the onset of AKI were not used as inputs for the model to ensure that the model used only data found at or before prediction time. Training data were passed to a CNN structure, with hyperparameters optimized on the training set using the Python-based optimization package Talos (Autonomio Talos [Computer software]). Tuned hyperparameters include learning rate, batch size, optimization loss, L1 and L2 regularization coefficients, and the size of dense layers in the model. CNN was chosen instead of a recurrent neural network as it is faster to train and has fewer parameters (M. Blohm et al., unpublished data, 2018). In addition, the window of time from which the structured data were gathered for prediction was relatively short (5 hours). CNN modeling techniques have been found to outperform recurrent neural network modeling techniques with improved generalizability when applied to speech recognition tasks (A.V. Oord et al., unpublished data, 2016). After the end of the training on each fold, network performance was evaluated using the hold-out test set. Results were reported as the average test set performance across cross-validation folds.

Structured Data Preprocessing

Structured data were binned by the hour, with multiple intrahour measurements of the same variable replaced by its average. Missing measurements were handled separately for training and testing sets using the last observation that carried forward the imputation. Any remaining missing values were filled in using the measurement median in the training data. Quantitative data and document vectors were then standardized using the training data such that each feature had a mean of 0 and a variance of 1.

Document Vector Encoding Network and Unstructured Data Preprocessing

To facilitate the use of unstructured text data alongside the structured inputs, we trained a Doc2Vec (Q.V.Le., unpublished data, 2014) embedding network with 250 nodes on 238,468 midstay clinical notes. Document vectors were produced for the text data available from each encounter, using 125 epochs of the Doc2Vec algorithm—to better ensure the stability of the inferred document vectors—and an initial learning rate of 0.01. The choice of the number of epochs and learning rate was found through experimentation. Clinical notes dated after AKI onset were excluded from the input when training and testing CNN.

Training of Neural Network Classifier

We constructed a classifier to predict the probability of the presence of AKI at a given offset time from prediction using the Python deep learning library, Keras, that uses variants of multichannel, multiheaded attention together with convolutions to extract information from quantitative time series data. A separate network for handling the document vector produced by the Doc2Vec network was combined downstream through concatenation in a fully connected output layer. This allowed the model to incorporate information from both the time series data in the EHRs and the qualitative information found in the clinical notes. Model parameters were optimized using the Nadam optimizer as implemented in the Keras library with a learning rate of 0.0009 and binary cross-entropy loss. A diagram of this neural network architecture is available as Supplementary Figure S1. Owing to the low prevalence of AKI in the data, random oversampling was performed to artificially inflate the positive population. This was performed by picking examples from the positive class at random with replacement until the number of positive examples matched the number of negative examples. To fit the weights of the network with 10-fold cross-validation, we split the training data into 10 subsets of roughly equal size and iteratively used 9 subsets for intrafold training and the final subset for intrafold testing. Model parameters were fit over the course of 50 epochs on the 9 intrafold training subsets, with evaluation on the final subset. For each iterate, we obtained an ROC curve and a battery of performance metrics. We then randomly reset the model parameters before performing another iterate. From cross-validation, we obtained an average ROC curve and average performance metrics, along with standard deviation for the performance metrics. These results are presented in comparison with an XGBoost classifier and the SOFA score, which has been found to independently predict AKI outcomes62, 63, 64 and therefore serves as a validated comparison measure for AKI prediction. SOFA was computed using all organ systems; any missing inputs required for computation contributed zero points to the total SOFA score. The XGBoost classifier was trained on the same processed training sets—5-hour windows of quantitative, clinical EHR data—and evaluated on the same testing set. The time series data were turned into a list of the binned measurements at the different hours and given to XGBoost as input, requiring no additional feature engineering. Document vectors were not given as input for XGBoost. XGBoost hyperparameters were tuned using a cross-validated grid search on the training data. Hyperparameters were optimized using grid search in the hyperparameters “gamma,” which controls how often the trees are split, and “colsample_bytree,” which controls the number of features randomly selected for inputs when constructing each tree.

Results

The demographic characteristics associated with MIMIC-III ICU encounters meeting the inclusion criteria of Figure 1 are provided in Table 1. The study population consisted of 53.29% men, with a few patients younger than 30 years of age (4.65%) and a substantial percentage of patients aged 70 years or more (39.39%). More than half the patients had stays lasting between 3 and 5 days (62.78%), with a substantial percentage of patients experiencing stays of 12 days or longer (11.26%). The overall mortality rate was 39.39%, with 7.62% of encounters meeting the criteria for KDIGO stage 2 or stage 3 at some point during the stay, and 20.6% of stays meeting some stage of the KDIGO criteria at any point during the stay. Performance was evaluated by predicting once in each encounter using 5 hours of data. These data were taken either from a random portion of the stay for negative examples, or from the specified model offset for positive examples. The results from 10-fold cross-validation on the 90% training set are reported in Tables 2 and 3 for 48 and 24 hour predictions, respectively. Test performance is reported for the best performing model, selected by cross-validation of the training data. The CNN model, with the use of the Doc2Vec embeddings of encounter text data, outperformed the XGBoost comparator model and the SOFA score for advance prediction of KDIGO stage 2 or stage 3 onset. We note that, to provide nonsummative performance metrics (i.e., the metrics other than AUROC), we selected an operating point for each model or score that provided a sensitivity nearest to 0.80. The CNN model performed better (AUROC of 0.86 for 24 and 48 hour predictions) when text data were made available through Doc2Vec than when these data were unavailable (AUROC of 0.77 and 0.76 for 24 and 48 hour predictions, respectively). In addition, the quality of prediction was higher for KDIGO stage 2 or stage 3 onset, as compared with the prediction of onset for any of KDIGO stages 1–3. For corresponding CNN and XGBoost results without oversampling of the minority class, see Supplementary Table S1. Permutation feature importance methods were implemented to provide information on the relative importance of each input variable. A precision-recall curve comparison between the CNN model, the XGBoost model, and the SOFA score is presented in Supplementary Figure S2.

Table 2

Results from 10-fold cross-validation of predictions 48 hours before onset on the MIMIC-III data set

Performance metric	CNN	XGBoost	SOFA	No Doc2Vec	Stage 1 included	Stage 3 only
AUROC mean (SD)	0.856 (0.034)	0.654 (0.011)	0.701	0.763 (0.035)	0.778 (0.037)	0.819 (0.036)
Sensitivity mean (SD)	0.804 (0.000)	0.798 (0.000)	0.798	0.805 (0.006)	0.806 (0.008)	0.806 (0.000)
Specificity mean (SD)	0.763 (0.057)	0.380 (0.006)	0.441	0.623 (0.064)	0.649 (0.074)	0.679 (0.079)
PPV mean (SD)	0.236 (0.039)	0.095 (0.001)	0.127	0.163 (0.022)	0.310 (0.044)	0.105 (0.023)
NPV mean (SD)	0.975 (0.002)	0.956 (0.001)	0.960	0.970 (0.003)	0.940 (0.006)	0.985 (0.002)
Accuracy mean (SD)	0.765 (0.052)	0.411 (0.005)	0.612	0.638 (0.056)	0.672 (0.062)	0.683 (0.076)
DOR mean (SD)	14.076 (3.779)	2.421 (0.059)	3.123	7.123 (1.899)	8.167 (2.425)	9.566 (3.410)
LR+ mean (SD)	3.558 (0.739)	1.287 (0.012)	1.429	2.191 (0.362)	2.389 (0.478)	2.658 (0.660)
LR− mean (SD)	0.258 (0.021)	0.532 (0.008)	0.458	0.316 (0.035)	0.301 (0.035)	0.288 (0.035)
F1 mean (SD)	0.361 (0.047)	0.169 (0.001)	0.214	0.270 (0.030)	0.444 (0.045)	0.184 (0.036)

AUROC, area under the receiver operating characteristic curve; CNN, convolutional neural network; DOR, diagnostic odds ratio; KDIGO, Kidney Disease: Improving Global Outcomes; LR+, positive likelihood ratio; LR−, negative likelihood ratio; MIMIC-III, Medical Information Mart for Intensive Care III; NPV, negative predictive value; PPV, positive predictive value; SD, standard deviation; SOFA, sequential organ failure assessment.

The CNN model is compared with an XGBoost classifier and the SOFA score. SOFA required no training and thus could be applied to the entire test set at once; hence, no SD is reported. Additional comparison is made to the CNN model without the use of the Doc2Vec network (i.e., without unstructured text data) and for the prediction of KDIGO criteria of any stage.

Table 3

Results from 10-fold cross-validation of predictions 24 hours before onset on the MIMIC-III data set

Performance metric	CNN	XGBoost	SOFA	No Doc2Vec	Stage 1 included	Stage 3 only
AUROC mean (SD)	0.863 (0.009)	0.729 (0.009)	0.727	0.769 (0.028)	0.834 (0.004)	0.867 (0.009)
Sensitivity mean (SD)	0.803 (0.000)	0.801 (0.000)	0.784	0.801 (0.003)	0.798 (0.005)	0.795 (0.000)
Specificity mean (SD)	0.772 (0.021)	0.463 (0.026)	0.537	0.585 (0.066)	0.716 (0.018)	0.785 (0.024)
PPV mean (SD)	0.221 (0.016)	0.111 (0.005)	0.151	0.153 (0.019)	0.359 (0.014)	0.131 (0.014)
NPV mean (SD)	0.978 (0.001)	0.964 (0.002)	0.961	0.968 (0.003)	0.944 (0.001)	0.988 (0.000)
Accuracy mean (SD)	0.773 (0.020)	0.489 (0.024)	0.684	0.602 (0.060)	0.728 (0.014)	0.784 (0.023)
DOR mean (SD)	13.905 (1.617)	3.484 (0.367)	4.200	5.861 (1.440)	10.030 (0.822)	14.396 (2.212)
LR+ mean (SD)	3.545 (0.319)	1.494 (0.073)	1.692	1.970 (0.292)	2.821 (0.178)	3.740 (0.452)
LR− mean (SD)	0.256 (0.007)	0.431 (0.024)	0.403	0.344 (0.038)	0.282 (0.007)	0.261 (0.008)
F1 mean (SD)	0.345 (0.019)	0.194 (0.007)	0.247	0.256 (0.027)	0.494 (0.013)	0.224 (0.020)

Results from 10-fold cross-validation of predictions 48 hours before onset on the MIMIC-III data set AUROC, area under the receiver operating characteristic curve; CNN, convolutional neural network; DOR, diagnostic odds ratio; KDIGO, Kidney Disease: Improving Global Outcomes; LR+, positive likelihood ratio; LR−, negative likelihood ratio; MIMIC-III, Medical Information Mart for Intensive Care III; NPV, negative predictive value; PPV, positive predictive value; SD, standard deviation; SOFA, sequential organ failure assessment. The CNN model is compared with an XGBoost classifier and the SOFA score. SOFA required no training and thus could be applied to the entire test set at once; hence, no SD is reported. Additional comparison is made to the CNN model without the use of the Doc2Vec network (i.e., without unstructured text data) and for the prediction of KDIGO criteria of any stage. Results from 10-fold cross-validation of predictions 24 hours before onset on the MIMIC-III data set AUROC, area under the receiver operating characteristic curve; CNN, convolutional neural network; DOR, diagnostic odds ratio; KDIGO, Kidney Disease: Improving Global Outcomes; LR+, positive likelihood ratio; LR−, negative likelihood ratio; MIMIC-III, Medical Information Mart for Intensive Care III; NPV, negative predictive value; PPV, positive predictive value; SD, standard deviation; SOFA, sequential organ failure assessment. The CNN model is compared with an XGBoost classifier and the SOFA score. SOFA required no training and thus could be applied to the entire test set at once; hence, no SD is reported. Additional comparison is made to the CNN model without the use of the Doc2Vec network (i.e., without unstructured text data) and for the prediction of KDIGO criteria of any stage. The CNN model averaged a PPV of 0.24 over cross-validation folds for the 48-hour prediction of KDIGO stages 2 and 3, compared with average PPVs of 0.09 and 0.13 for XGBoost and the SOFA score, respectively (Table 2). CNN had almost no advantage (PPV of 0.16) in the absence of text data through Doc2Vec input. The average PPV was highest when the CNN classifier was given access to Doc2Vec input and tasked with 48-hour prediction of KDIGO stages 1–3 (PPV of 0.31). Relative to the 7.62% prevalence of KDIGO stages 2 and 3, positive predictions made by the CNN model enriched KDIGO stage 2 or 3 encounters by a factor of 4.80, whereas XGBoost and the SOFA scores enriched these encounters by factors of 2.50 and 2.11, respectively. The ROC curve comparison of the 48-hour prediction on the 10% hold-out test set is found in Figure 2. The CNN model, which was provided text data through the Doc2Vec input, performed substantially better than the XGBoost model and SOFA. The XGBoost model and SOFA had similar performance on the test set.

Figure 2

ROC curve comparison of prediction performance using a CNN classifier, an XGB classifier, and the SOFA score, 48 hours before AKI onset on the MIMIC-III ICU hold-out data set. AKI, acute kidney injury; AUROC, area under the receiver operating characteristic curve; CNN, convolutional neural network; ICU, intensive care unit; MIMIC-III, Medical Information Mart for Intensive Care III; ROC, receiver operating characteristic; SOFA, Sequential Organ Failure Assessment score; XGB, XGBoost.

Discussion

These experiments reveal that a CNN can predict AKI up to 48 hours in advance of KDIGO stage 2 or stage 3 AKI onset, with AUROC performance superior to that of an XGBoost classifier and the SOFA scoring system (Table 2, Figure 2). Unlike other diseases for which multiple severity scores exist, AKI represents a group of syndromes that are loosely connected by the characteristic rapid drop in estimated glomerular filtration rate found in patients with AKI. With more than 30 definitions of AKI, attempts at a uniform definition for AKI have included the Risk, Injury, Failure, Loss, End-stage kidney disease classification, followed by the AKI Network, and, most recently, the KDIGO criteria., The absence of a consistent, uniform definition may explain the current lack of an AKI-specific risk score that serves as a standard-of-care. To provide context for the performance of their AKI prediction models, previous studies have used the biomarker serum neutrophil gelatinase–associated lipocalin as a comparator, compared their model to other machine learning models, or not included a standard-of-care comparator., In the current study, we compare 2 machine learning models and provide the SOFA score as a comparator. Although the SOFA score was not developed for the purpose of long-horizon AKI prediction, because of the ubiquity of the SOFA score and its previous use in AKI outcome prediction, it serves as a validated comparator for our current approach.62, 63, 64 The XGBoost comparator is similarly important, primarily owing to its broad and successful use in applications for other clinical prediction tasks (e.g., the 2019 Physionet Computing in Cardiology Challenge). The superiority of the CNN classifier over the XGBoost classifier and the commonly used SOFA score is evidenced by key performance metrics, such as AUROC and PPV (Table 2). The PPV performance improvement is of particular importance. Romero-Brufau et al. have argued that AUROC performance may be misleading for clinicians interested in evaluating the clinical impact of a diagnostic tool, as AUROC does not incorporate information on the prevalence of a condition. In fact, for the same reason, AUROC is useful for comparing the performance of tools retrospectively validated on different data sets. This concern regarding PPV and prevalence is relevant to our study, as we found that the prevalence of KDIGO stages 2 or 3 is roughly 7.6% in the cohort, an estimate consistent with previous epidemiologic studies. The AUROC is a summative metric that may include ranges of operating points that are irrelevant to a given task, whereas PPV can be focused on a clinically relevant operating point. To produce the metrics in Table 2, we chose the operating points for the CNN and the comparators such that their sensitivities were fixed near 0.80. Beyond the text data input through Doc2Vec, CNN predictions were made using only age and 7 routinely collected patient measurements (diastolic blood pressure, systolic blood pressure, temperature, respiratory rate, heart rate, oxygen saturation, and Glasgow Coma Scale) as inputs. Although this study was restricted to the MetaVision (iMDSoft) EHR system for technical reasons, the use of these widely available inputs supports the generalizability of the model to broad clinical practice. Importantly, the CNN model did not rely on SCr to make predictions, distinguishing it from other AKI prediction tools. Creatinine levels can take hours or days to rise to AKI thresholds as defined in the KDIGO staging system; therefore, changes in SCr may reflect preexisting kidney damage. An AKI prediction tool that does not depend on SCr measurements may better afford clinicians the opportunity to intervene early, to prevent AKI development or progression, or to limit further kidney damage. Furthermore, using only often collected variables in the EHR for AKI prediction allows automatic screening of a general patient population for impending AKI without requiring specialized evaluation. This study contributes to the growing body of retrospective machine learning literature for the prediction of AKI. Chiofolo et al. developed a model for AKI prediction and surveillance in patients in the ICU for a 6-hour prediction window with an AUROC of 0.88. Flechet et al. developed the AKIpredictor, a prognostic calculator for prediction of AKI in patients in the ICU during the first week of stay. Their KDIGO stage 2 and 3 models produced AUROCs between 0.77 and 0.84. The AUROC of 0.84 corresponds to a prediction of KDIGO stage 2 and 3 after gathering 24 hours of data. As a point of comparison, the CNN model used only 5 hours of data before making a prediction. Recent work by Tomašev et al. pursued a deep learning approach for continuous risk prediction of deterioration in patients with AKI and evaluated their tool on a Veteran’s Health Administration data set of 703,782 adult patients. Algorithm performance for a 48-hour prediction window corresponded to a sensitivity of 55.8% and a specificity of 82.7%. This performance is reported to be in the range required for regulatory approval. Although these studies make important contributions to the domain of AKI research, they depend on the use of SCr to make predictions, which is a lagging marker of kidney function. In contrast, the CNN described in this work does not rely on SCr to make predictions of AKI onset, allowing for both longer lead times and improved predictive performance and for making predictions for patients who may not yet be clinically suspected of having AKI and who have not yet had their SCr measures drawn. CNN also offers improvement in performance as compared with our previous work, which used the machine learning method of gradient-boosted trees to predict AKI before onset and included SCr as a model input. In comparison, results from our current work suggest that AKI predictions can be made with a more robust machine learning architecture, without reliance on SCr, while achieving stronger predictive performance. Although the CNN described in this study offers substantial lead time in AKI identification (up to 48 hours) and offers improved predictive performance over our previous work, it still requires prospective validation. Furthermore, we cannot determine from this retrospective study what impact the algorithm might have on clinicians and their provision of care in clinical settings, nor provide an analysis of model evaluation and its prediction performance in time. Although the CNN model performance was superior to that of SOFA and XGBoost, improvements in PPV achieved by CNN compared with XGBoost or SOFA are less pronounced without the use of clinical notes. Algorithm performance was evaluated only on patients in the United States older than 18 years with stays in the ICU, which limits the generalizability of our results to other patient populations and levels of care. Although most of the patients in the negative class had a SCr measurement at some point in the ICU stay, it is possible that inclusion of patients missing urine measures in the negative class led to the misclassification of some patients in our data set. It is also possible that misclassifications could have occurred for some patients in the data set owing to inclusion of patients with a previous diagnosis of chronic kidney disease or who received dialysis. Owing to the lack of a standard-of-care AKI score, we used the SOFA score and the XGBoost model to provide context for our model performance. Although the SOFA score has been used in AKI outcome prediction studies,62, 63, 64 it was not developed for the purpose of long-horizon AKI prediction. Furthermore, although the XGBoost comparator was included owing to its use in other clinical prediction tasks, it does not serve as a standard-of-care for AKI predictions. Last, because there have been several proposed consensus definitions for AKI, the algorithm we described may produce different results when compared against non–KDIGO definitions, or in settings that use a different standard in their diagnostic procedures.

Conclusion

A CNN for AKI prediction outperforms XGBoost and the traditional SOFA scoring system, revealing superior performance in predicting AKI up to 48 hours before onset without reliance on measurements of changes in SCr. Although the use of clinical text data through a Doc2Vec network substantially strengthened CNN prediction performance, CNN was found to have superior performance over both XGBoost and SOFA even when clinical notes were not included as model inputs, supporting the use of CNN models for the task of AKI prediction. Such a tool may improve prediction and early detection of AKI in clinical settings, thereby allowing for earlier intervention.

Disclosure

SL, AA, JC, EP, AS, JH, and RD are or were employees or contractors of Dascena (Houston, Texas, USA) at the time the work was performed.

72 in total

Review 1. Association between e-alert implementation for detection of acute kidney injury and outcomes: a systematic review.

Authors: Philippe Lachance; Pierre-Marc Villeneuve; Oleksa G Rewa; Francis P Wilson; Nicholas M Selby; Robin M Featherstone; Sean M Bagshaw
Journal: Nephrol Dial Transplant Date: 2017-02-01 Impact factor: 5.992

2. Impact of real-time electronic alerting of acute kidney injury on therapeutic intervention and progression of RIFLE class.

Authors: Kirsten Colpaert; Eric A Hoste; Kristof Steurbaut; Dominique Benoit; Sofie Van Hoecke; Filip De Turck; Johan Decruyenaere
Journal: Crit Care Med Date: 2012-04 Impact factor: 7.598

3. Acute renal failure in the ICU: risk factors and outcome evaluated by the SOFA score.

Authors: A de Mendonça; J L Vincent; P M Suter; R Moreno; N M Dearden; M Antonelli; J Takala; C Sprung; F Cantraine
Journal: Intensive Care Med Date: 2000-07 Impact factor: 17.440

4. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-Related Problems of the European Society of Intensive Care Medicine.

Authors: J L Vincent; R Moreno; J Takala; S Willatts; A De Mendonça; H Bruining; C K Reinhart; P M Suter; L G Thijs
Journal: Intensive Care Med Date: 1996-07 Impact factor: 17.440

5. Electronic Alerts for Acute Kidney Injury.

Authors: Paul M Palevsky
Journal: Am J Kidney Dis Date: 2018-01 Impact factor: 8.860

6. Long-term risk of mortality and acute kidney injury during hospitalization after major surgery.

Authors: Azra Bihorac; Sinan Yavas; Sophie Subbiah; Charles E Hobson; Jesse D Schold; Andrea Gabrielli; A Joseph Layon; Mark S Segal
Journal: Ann Surg Date: 2009-05 Impact factor: 12.969

7. Declining mortality in patients with acute renal failure, 1988 to 2002.

Authors: Sushrut S Waikar; Gary C Curhan; Ron Wald; Ellen P McCarthy; Glenn M Chertow
Journal: J Am Soc Nephrol Date: 2006-02-22 Impact factor: 10.121

8. Intensity of renal support in critically ill patients with acute kidney injury.

Authors: Paul M Palevsky; Jane Hongyuan Zhang; Theresa Z O'Connor; Glenn M Chertow; Susan T Crowley; Devasmita Choudhury; Kevin Finkel; John A Kellum; Emil Paganini; Roland M H Schein; Mark W Smith; Kathleen M Swanson; B Taylor Thompson; Anitha Vijayan; Suzanne Watnick; Robert A Star; Peter Peduzzi
Journal: N Engl J Med Date: 2008-05-20 Impact factor: 91.245

9. Prediction of Acute Kidney Injury With a Machine Learning Algorithm Using Electronic Health Record Data.

Authors: Hamid Mohamadlou; Anna Lynn-Palevsky; Christopher Barton; Uli Chettipally; Lisa Shieh; Jacob Calvert; Nicholas R Saber; Ritankar Das
Journal: Can J Kidney Health Dis Date: 2018-06-08

10. Why the C-statistic is not informative to evaluate early warning scores and what metrics to use.

Authors: Santiago Romero-Brufau; Jeanne M Huddleston; Gabriel J Escobar; Mark Liebow
Journal: Crit Care Date: 2015-08-13 Impact factor: 9.097

10 in total

1. Development and Validation of Machine Learning Models for Real-Time Mortality Prediction in Critically Ill Patients With Sepsis-Associated Acute Kidney Injury.

Authors: Xiao-Qin Luo; Ping Yan; Shao-Bin Duan; Yi-Xin Kang; Ying-Hao Deng; Qian Liu; Ting Wu; Xi Wu
Journal: Front Med (Lausanne) Date: 2022-06-15

2. A LASSO-derived clinical score to predict severe acute kidney injury in the cardiac surgery recovery unit: a large retrospective cohort study using the MIMIC database.

Authors: Tucheng Huang; Wanbing He; Yong Xie; Wenyu Lv; Yuewei Li; Hongwei Li; Jingjing Huang; Jieping Huang; Yangxin Chen; Qi Guo; Jingfeng Wang
Journal: BMJ Open Date: 2022-06-02 Impact factor: 3.006

3. Association between delta anion gap and hospital mortality for patients in cardiothoracic surgery recovery unit: a retrospective cohort study.

Authors: Kai Xie; Chao Zheng; Gao-Ming Wang; Yi-Fei Diao; Chao Luo; Ellen Wang; Li-Wen Hu; Zhi-Jian Ren; Jing Luo; Bin-Hui Ren; Yi Shen
Journal: BMC Surg Date: 2022-05-14 Impact factor: 2.030

4. Machine learning for the prediction of acute kidney injury in patients with sepsis.

Authors: Suru Yue; Shasha Li; Xueying Huang; Jie Liu; Xuefei Hou; Yumei Zhao; Dongdong Niu; Yufeng Wang; Wenkai Tan; Jiayuan Wu
Journal: J Transl Med Date: 2022-05-13 Impact factor: 8.440

5. Machine learning for early discrimination between transient and persistent acute kidney injury in critically ill patients with sepsis.

Authors: Xiao-Qin Luo; Ping Yan; Ning-Ya Zhang; Bei Luo; Mei Wang; Ying-Hao Deng; Ting Wu; Xi Wu; Qian Liu; Hong-Shen Wang; Lin Wang; Yi-Xin Kang; Shao-Bin Duan
Journal: Sci Rep Date: 2021-10-12 Impact factor: 4.379

6. Account of Deep Learning-Based Ultrasonic Image Feature in the Diagnosis of Severe Sepsis Complicated with Acute Kidney Injury.

Authors: Yi Lv; Zhijia Huang
Journal: Comput Math Methods Med Date: 2022-01-31 Impact factor: 2.238