Literature DB >> 35028534

Evaluating machine learning models for sepsis prediction: A systematic review of methodologies.

Hong-Fei Deng1,2, Ming-Wei Sun3, Yu Wang1,2,3,4, Jun Zeng1,2,3,4, Ting Yuan1,2, Ting Li1,2, Di-Huan Li1,2, Wei Chen5, Ping Zhou6, Qi Wang7, Hua Jiang1,2,3,4.   

Abstract

Studies for sepsis prediction using machine learning are developing rapidly in medical science recently. In this review, we propose a set of new evaluation criteria and reporting standards to assess 21 qualified machine learning models for quality analysis based on PRISMA. Our assessment shows that (1.) the definition of sepsis is not consistent among the studies; (2.) data sources and data preprocessing methods, machine learning models, feature engineering, and inclusion types vary widely among the studies; (3.) the closer to the onset of sepsis, the higher the value of AUROC is; (4.) the improvement in AUROC is primarily due to using machine learning as a feature engineering tool; (5.) deep neural networks coupled with Sepsis-3 diagnostic criteria tend to yield better results on the time series data collected from patients with sepsis. The new evaluation criteria and reporting standards will facilitate the development of improved machine learning models for clinical applications.
© 2021 The Authors.

Entities:  

Keywords:  Clinical medicine; Machine learning

Year:  2021        PMID: 35028534      PMCID: PMC8741489          DOI: 10.1016/j.isci.2021.103651

Source DB:  PubMed          Journal:  iScience        ISSN: 2589-0042


Introduction

Sepsis is a significant threat to patients' lives. A meta-analysis estimated about 31.5 million sepsis and 19.4 million severe sepsis cases occur each year, contributing to 5.3 million deaths worldwide (Fleischmann et al., 2016). This doomed scenario is further amplified under the current COVID-19 pandemic, where most of the deceased could be traced to sepsis (Alhazzani et al.,2020). In 2016, the Third International Consensus Definition for Sepsis and Septic Shock (Sepsis-3) defined sepsis as “life-threatening organ dysfunction resulting from dysregulated host responses to infection.” It pointed out sepsis' death risk and the necessity of early identification and intervention (Cecconi et al.,2018). Early warning and accurate prediction on sepsis, which provides opportunities for physicians to take preventative measures to alleviate its devastating consequences, is recognized by researchers. A successful early warning together with the best clinical technique provides the best chance to reduce mortality and lower the risk of the severe septic shock (Shashikumar et al.,2017; Mira et al.,2017; Singer et al.,2016). Some clinical prognostic tools, such as Sequential Organ Failure Assessment (SOFA), Modified Early Warning Score (MEWS), Systemic Inflammatory Response Syndrome (SIRS), and quick Sequential Organ Failure Assessment (qSOFA) have been developed to predict the risk of death after the onset of sepsis (Raith et al.,2017). But these are not sufficiently reliable because most values of the tested markers come from ICU admission, which can hardly be linked definitively to the onset of infection. Consequently, traditional methods have limitations to accurately identify or predict the onset of sepsis and make high a fidelity prognosis. A new methodology is clearly in need. Two systematic reviews evaluated the performance of machine learning models used in prediction for occurrence and prognosis of sepsis in the past (Fleuren et al.,2020; Islam et al.,2019). However, the influence brought about by the evolution of diagnostic criteria has never been discussed. Compared to the old criteria, Sepsis-3 needs more clinical data to complete SOFA assessment and to confirm infection. In addition, there has not been any consensus on how to establish a reasonable dataset, an appropriate feature-treatment method, and how to obtain a prediction of sepsis development dynamically. The goal of this review is to identify the characteristics and shortcomings in the models and methods in the previous studies, and try to establish a unified standard and evaluation tool for machine learning models in order to guide the model development in medical science in the future to make reliable predictions on this deadly ailment.

Results

Studies included

A total of twenty-one studies are included in this review from two hundred and sixty-two potentially eligible papers based on our criteria (Figure 1). Most selected studies focused on early sepsis detection, prediction, and mortality. Only two aimed at predicting severe sepsis (Table 1). We notice that seven studies used data from The Medical Information Mart for Intensive Care (MIMIC) database. Two studies used data from the University of California San Francisco Medical Center database and the Beth Israel Deaconess Medical Center database (UCSF + BIDMC database).
Figure 1

Literature screening flowchart

Table 1

Basic information of the included studies

StudySepsis definitionTargetData sourcesMissing data processingTraining dataTesting dataValidation data
Delahanty et al. (2019)sepsis3.0Early prediction of sepsis49 urban community hospitals operated by Tenet HealthcareNR1,839,503920,026NR
Barton et al. (2019)sepsis3.0Detection and early prediction of sepsisUCSF data+BIDMC dataCarry-forward and replacing by meanNRNRNR
Taylor et al. (2016)Infection + SIRSMortality prediction of sepsisFour emergency departmentsK-means4222NR1056
Kam and Kim (2017)ICD-9Detection and early prediction of sepsisMIMIC-IIReplacing by nearest measured value2527236
Mao et al. (2018)SIRSSepsis detectionUCSF data+BIDMC dataCarry-forward and replacing by mean80%20%NR
Taneja et al. (2017)Clinical adjudication labelEarly prediction of sepsisCarle Foundation HospitalNRNRNRNR
Saqib et al. (2018)AngusEarly prediction of sepsisMIMIC-IIIForward-filling81%10%9%
Perng et al. (2019)SIRS + qSOFAMortality prediction of sepsisChang Gung Research DatabaseReplacing by medium number of the column70%30%NR
Thottakkara et al. (2016)the criteria of the Agency for Healthcare Research and QualitySevere sepsis predictionDECLARE dataReplacing by mean value70%NR30%
Bloch et al. (2019)Infection +SIRSEarly prediction of sepsisIsrael Rabin Medical CenterNR75%25%NR
Kwon and Baek, (2020)Infection + qSOFAMortality prediction of sepsisFour hospitals of KoreaNR74%18%8%
Nemati et al. (2018)sepsis3.0Early prediction of sepsistwo hospitals within the Emory Healthcare system and an ICU databaseNR80%20%NR
Lauritsen et al. (2020)Infection +SIRSEarly detection and prediction of sepsisFour Danish municipalities dataNR80%10%10%
Scherpf et al. (2019)ICD9+SIRSEarly prediction of sepsisMIMIC-IIILiner interpolation and “carry forward/backward” extrapolationNRNRNR
Hou et al. (2020)Sepsis3.0Mortality prediction of sepsisMIMIC III v1.4Remove the variables with more than 20% observations missing + multiple imputation methodNRNRNR
Kong et al. (2020)Sepsis3.0Mortality prediction of sepsisMIMIC IIIRemove the patients with more than 30% predictor variable missing + Replace by mean valueNRNRNR
Bedoya et al.(2020)SIRS + infection + end organ failureEarly detection of sepsisED of a quaternary academic hospitalNRNRNRNR
van Doorn et al. (2021)Infection + SIRS/SOFAMortality prediction of sepsisED at the Maastricht University Medical Center+NR1244NR100
Li et al. (2021)ICD-9Mortality prediction of sepsisMIMIC-III V1.4Remove the patients with data missing more than 30% + Replace by mean valueNRNRNR
Burdick et al. (2020)SIRSEarly severe sepsis predictionThe Dascena Analysis Dataset and the Cabell Huntington Hospital Datasetlast-one carry forwardNRNRNR
Qi et al. (2021)Sepsis3.0Mortality prediction of sepsisMIMIC-IIIRemove the patients with data missing more than 40% + Replace by 21% and mean valueNRNRNR

Abbreviation:SIRS: Systemic Inflammatory Response Syndrome; ICD9:international classification of diseases 9; NR: not reported.

Literature screening flowchart Basic information of the included studies Abbreviation:SIRS: Systemic Inflammatory Response Syndrome; ICD9:international classification of diseases 9; NR: not reported. Thirteen studies described preprocessing methods for the clinical data with various methods, including filling missing data by mean, median or nearest measured values, K-means clustering, forward-filling, linear interpolations, and carry forward/backward extrapolations. Twelve studies provided detailed descriptions of sample sizes or proportions between training groups and test/verification groups. However, not a single study discussed the rationale for adopting their methods. Only six studies adopted the latest Sepsis-3 definition, and the others used old criteria (SIRS/ICD [the international classification of diseases definition]/Angus/the criteria of the Agency for Healthcare Research and Quality). SIRS: Heart rate >90 beats/min; Body temperature >38°C or <36°C; Respiration rate >20 times/min or PaCO2<32mm Hg; White blood cell count >12 × 109/L or < 4 × 109/L Sepsis-3: infection + SOFA≥2

Quality evaluation

Using the Joanna Briggs Institute Critical Appraisal (JBI) tool, Kwong et al. proposed a method to evaluate the quality of machine learning research. JBI is a checklist for cross-sectional research, which has been adopted by Islam et al. and Kwong et al. to evaluate quality of machine learning studies (Kwong et al.,2019; Islam et al.,2018). It consists of eight items. We first applied their tool to evaluate the included studies, and the results are shown in Table 2.
Table 2

Quality evaluation of including studies

StudyInclusioncriteriaData preprocessedData source and collectionThe source of the featureEthical issueDetail discussionMeasurement of models' performanceCross-validation/evaluation method
Delahanty et al. (2019)00110111
Barton et al. (2019)01111111
Taylro, 201511100110
Kam and Kim (2017)11110110
Mao et al. (2018)01110111
Taneja et al. (2017)00110111
Saqib et al. (2018)11110110
Perng et al. (2019)01110111
Thottakkara et al. (2016)11110111
Bloch et al. (2019)11110111
Kwon and Baek (2020)10110111
Nemati et al. (2018)10100110
Lauritsen et al. (2020)10100110
Scherpf et al. (2019)11100111
Hou et al. (2020)11110110
Kong et al. (2020)11111111
Bedoya et al. (2020)10100110
van Doorn et al. (2021)11111111
Li et al. (2021)11111111
Burdick et al. (2020)11110111
Qi et al. (2021)11110110

Annotation: The contents of have been tweaked to better fit machine learning research.

Quality evaluation of including studies Annotation: The contents of have been tweaked to better fit machine learning research.

Prediction in time

Considering that sepsis progression is time-sensitive, a good predictive model should be able to verify the accuracy at different times. However, we find only seven studies provided the information (see Table 3).
Table 3

Prediction (AUROC) of each model at different hours in the sepsis studies

StudyModelAlgorithmDifferent hours
−48−24−12−10−8−6−5−4−3−2−1−0.250
Delahanty et al. (2019)RoSGradient boosting0.970.93
Barton et al. (2019)MLAGradient boosted trees0.830.840.88
Kam and Kim (2017)SepLSTMlong short-term memory0.930.940.960.99
Bloch et al. (2019)SVM-RBFSVM-RBF0.81410.88790.88070.86390.8675
Nemati et al. (2018)Weilbull-Cox proportional hazardsWeilbull-Cox proportional hazards0.790.80.810.82
Lauritsen et al. (2020)CNN-LSTMCNN-LSTM0.7520.7920.8420.879
Scherpf et al. (2019)RNNRNN0.760.790.81

Abbreviation:RoS: Risk of Sepsis; MLA: machine learning algorithm; LSTM: long short-term memory; SVM-RBF: support vector machines with radial basis function; CNN-LSTM: convolutional neural network-long short-term memory; RNN: recurrent neural network.

Prediction (AUROC) of each model at different hours in the sepsis studies Abbreviation:RoS: Risk of Sepsis; MLA: machine learning algorithm; LSTM: long short-term memory; SVM-RBF: support vector machines with radial basis function; CNN-LSTM: convolutional neural network-long short-term memory; RNN: recurrent neural network.

Performance in predictions

Compared with traditional predictive tools in single studies, AUROC of machine learning models mostly scored more than 0.8, with some studies even over 0.9, which was significantly higher than the traditional predictive tools where the results were around 0.7 (Table S1). Meanwhile, two studies also detected sepsis. Their predictive models showed AUROC value around 0.9 (Table 3), demonstrating strong ability to distinguish sepsis from no-sepsis patients at 0 h. We are therefore confident that machine learning algorithms can effectively predict sepsis.

Time sensitivity

The predictions can be divided into three categories: (1.) using only one model, (2.) using more than one model, and (3.) using the best model among several for the prediction. Among the 21 included studies, most belong to the third category and focused on the prediction of an early occurrence of sepsis. With the completion of information collection, the prediction performance of the third category at different hours is shown in Figure 2. Here, we make trend lines of AUROC in five studies, and find that the model's performance increased notably as the time gets closer to the onset of sepsis. The ideal time period for early sepsis prediction ranges from 0 to 24 h.
Figure 2

Predicting performance of multi-time points, related to Table 3

Predicting performance of multi-time points, related to Table 3

Mortality prediction

There are eight studies targeted at predicting sepsis mortality in emergency departments or ICUs, and we list seven studies' models, algorithm, AUROC, and prediction time in Table 4. These researchers tried many algorithms to build their predictive models. In a study of 28-days mortality prediction by Perng et al., the use of convolutional neural networks (CNN) + SoftMax resulted in AUROC =0.92, which is the highest among all the models in the study. Meanwhile, it predicted 72-h mortality, proving that CNN + SoftMax was the best model (AUROC = 0.94). And we find Ke Li et al. used Gradient Boosting Decision Tree (GBDT) and random forest (RF) to predict in-hospital mortality. They attained remarkably high AUROC scores (0.992, 0.980) and demonstrated excellent predictive ability of ensemble learning and traditional machine learning algorithm in sepsis.
Table 4

AUROC and time points of mortality prediction studies

StudyModelAlgorithmAUROCTime
Taylro, 2015Logistic regressionLogistic regression0.75528 days
CARTClassification and regression tree0.693
Random forestRandom forest0.860
MEDS scoreNR0.705
CURB-65 scoreNR0.734
REMS scoreNR0.717
Perng et al. (2019)KNNKNN0.8428 days
SoftMaxSoftMax0.88
PCA + SoftMaxPCA + SoftMax0.91
AE + SoftMaxAE + SoftMax0.90
CNN + SoftMaxCNN + SoftMax0.92
Kwon and Baek (2020)qSOFA scoresNR0.783 days
qSOFA-based machine-learning modelsExtreme gradient boosting, light gradient boosting machine, and random forest0.86
Hou et al. (2020)XGBoosteXtreme Gradient Boosting0.85730 days
logistic regressionlogistic regression0.819
SAPS-II scoresSimplified acute physiology score-II0.797
Kong et al. (2020)LASSOleast absolute shrinkage and selection operator0.829In hospital
RFrandom forest0.829
GBMgradient boosting machine0.845
LRlogistic regression0.833
SAPS IISimplified acute physiology score-II0.77
Li et al. (2021)GBDTGBDT0.992In hospital
LRLogistic regression0.876
KNNk-nearest neighbor0.877
RFRandom forest0.980
SVMSupport vector machine0.898
Qi et al. (2021)XGBoostExtreme gradient boosting0.848In hospital
SAPSIIThe simplified acute physiology score0.777
SOFASequential organ failure assessment score0.704
SIRSSystemic inflammatory response syndrome0.609
qSOFAQuick sequential organ failure assessment0.580

Abbreviation: CART: classification and regression tree; MEDS: mortality in emergency department sepsis score; KNN: K nearest neighbor; REMS: rapid emergency medicine score; CURB-65 score: the confusion, urea nitrogen, respiratory rate, blood pressure, 65 years of age and older; PCA: principal component analysis; AE: Autoencoder; CNN: Convolutional Neural Network; qSOFA: quick Sequential Organ Failure Assessment.

AUROC and time points of mortality prediction studies Abbreviation: CART: classification and regression tree; MEDS: mortality in emergency department sepsis score; KNN: K nearest neighbor; REMS: rapid emergency medicine score; CURB-65 score: the confusion, urea nitrogen, respiratory rate, blood pressure, 65 years of age and older; PCA: principal component analysis; AE: Autoencoder; CNN: Convolutional Neural Network; qSOFA: quick Sequential Organ Failure Assessment.

Feature engineering

All studies collected vital signs and laboratory data. For vital signs, researchers collected body temperature, heart rate, blood pressure, and respiratory rate. For laboratory data, researchers collected white blood cell count, lactic acid etc. Furthermore, demographic characteristics, clinical scores, and other features were also included in a few studies. We show representative ten studies and list their results in Table 5.
Table 5

Features engineering and included features of each study

StudyNumber of initial featuresNumber of final featuresIncluding features
Delahanty et al. (2019)21713Lactic acid (max), Shock index age (last), WBC count(max), Lactic acid(change), Neutrophils(max), Glucose(max), Blood urea nitrogen(max), Shock index age (first), Respiratory rate (max), Albumin (last), Systolic blood pressure (min), Serum creatinine (max), Temperature (max)
Barton et al. (2019)66SpO2, heart rate, respiratory rate, temperature, systolic blood pressure, diastolic blood pressure
Taylro, 201556620Oxygen saturation, Respiratory rate, Blood pressure, BUN, Albumin, Intubation, Procedures (in ED), Need for vasopressors, Age, RN resp care, RDW, Potassium, AST, Heart rate, Acuity level(triage), ED impression (Dx), CO2 (Lab), ECG performed, Beta-blocker (Home Med), Cardiac dysrhythmia (PMHx)
Kam and Kim (2017)99systolic pressure, pulse pressure, heart rate, body temperature, respiratory rate, WBC count, pH, blood oxygen saturation,age
Mao et al. (2018)66systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, temperature, peripheral capillary oxygen saturation
Taneja et al. (2017)31NRTNF-α, IL-1β, GCSF, IL-6, PCT, sTREM1, IL18, MMP9, TNFR1, TNFR2, IP10, MCP1, IL-1ra, NA, CD64, WBC, Lactic Acid, Systolic Blood Pressure, Diastolic Blood Pressure, Pulse, Temperature, Respirations, PCO2, Age, Gender, Bilirubin, Glasgow Coma Scale, Creatinine, Platelet, SOFA score, qSOFA score
Saqib et al. (2018)4734White blood cell count, Heart rate, Diastolic blood pressure, Systolic blood pressure, Mean blood pressure, Weight, Anion gap, Bicarbonate, Oxygen saturation, Height, Temperature, pH
Bloch et al. (2019)204the number of trend changes in respiratory rate and arterial pressure, the minimal change in respiratory rate, and the median change in heart rate
Kwon and Baek (2020)14NRAge, sex, diagnoses at the ED, systolic blood pressure, respiration rate, mental status, body temperature, heart rate, arterial partial pressure of carbon dioxide, white blood cell count, duration of hospitalization, ICU admission, mechanical ventilation, mortality.
Nemati et al.(2018)6565RRSTD, MAPSTD, HRV1, BPV1, HRV2, BPV2, MAP, HR, O2Sat, SBP, DBP, RESP, Temp, GCS, PaO2, FIO2, WBC, Hemoglobin, Hematocrit, Creatinine, Bilirubin and Bilirubin direct, Platelets, INR, PTT, AST, Alkaline Phosphatase, Lactate, Glucose, Potassium, Calcium, BUN, Phosphorus, Magnesium, Chloride, B-type BNP, Troponin, Fibrinogen, CRP, Sedimentation Rate, Ammonia, pH, pCO2, HCO3, Base Excess, SaO2, Care Unit (Surgical, Cardiac Care, or Neuro intensive care), Surgery in the past 12 h, Wound Class (clean, contaminated, dirty, or infected), Surgical Specialty (Cardiovascular, Neuro, Ortho-Spine, Oncology, Urology, etc.), Number of antibiotics in the past 12, 24, and 48 h, Age, CCI, Mechanical Ventilation, maximum change in SOFA score over the past 6 h.
Hou et al. (2020)2211urine output, lactate, Bun, sysbp, INR, age, cancer, SpO2, sodium, AG, creatinine

Annotation: The study of Tanejia2017 and YS2020 has established a variety of different models with different numbers of included features, so all features are provided. The Saqib et al. (2018) study provides only partial features.

Abbreviation: WBC count: white blood cell count; BUN: blood urea nitrogen; RDW: Red blood cell distribution width; AST: aspartate transaminase; ED: emergency department; ECG: electrocardiogram; SOFA: Sequential Organ Failure Assessment; qSOFA: quick Sequential Organ Failure Assessment; RRSTD: standard deviation of respiratory rate intervals; MAPSTD: standard deviation of mean arterial pressure; HRV1: average multiscale entropy of respiratory rate; BPV1:averagemultiscale entropy of mean arterial pressure; HRV2:average multiscale conditional entropy of respiratory rate; HRV2:average multiscale conditional entropy of respiratory rate; MAP: Mean Arterial Blood Pressure; HR: Heart Rate; O2Sat: Oxygen Saturation; SBP: Systolic Blood Pressure; DBP: Diastolic Blood Pressure; RESP: Respiratory Rate; Temp: Temperature; GCS: Glasgow Coma Scale; PaO2: Partial Pressure of Arterial Oxygen; FIO2: Fraction of Inspired O2; INR: International Normalized Ratio, PTT: Partial Prothrombin Time, AST: Aspartate Aminotransferase, BNP:B-type Natriuretic Peptide; CCI: Charleston Comorbidity Index; sysbp: systolic blood pressure; AG: anion gap.

Features engineering and included features of each study Annotation: The study of Tanejia2017 and YS2020 has established a variety of different models with different numbers of included features, so all features are provided. The Saqib et al. (2018) study provides only partial features. Abbreviation: WBC count: white blood cell count; BUN: blood urea nitrogen; RDW: Red blood cell distribution width; AST: aspartate transaminase; ED: emergency department; ECG: electrocardiogram; SOFA: Sequential Organ Failure Assessment; qSOFA: quick Sequential Organ Failure Assessment; RRSTD: standard deviation of respiratory rate intervals; MAPSTD: standard deviation of mean arterial pressure; HRV1: average multiscale entropy of respiratory rate; BPV1:averagemultiscale entropy of mean arterial pressure; HRV2:average multiscale conditional entropy of respiratory rate; HRV2:average multiscale conditional entropy of respiratory rate; MAP: Mean Arterial Blood Pressure; HR: Heart Rate; O2Sat: Oxygen Saturation; SBP: Systolic Blood Pressure; DBP: Diastolic Blood Pressure; RESP: Respiratory Rate; Temp: Temperature; GCS: Glasgow Coma Scale; PaO2: Partial Pressure of Arterial Oxygen; FIO2: Fraction of Inspired O2; INR: International Normalized Ratio, PTT: Partial Prothrombin Time, AST: Aspartate Aminotransferase, BNP:B-type Natriuretic Peptide; CCI: Charleston Comorbidity Index; sysbp: systolic blood pressure; AG: anion gap. In general, feature preprocessing can also be divided into two categories. The studies in category one used feature engineering methods to identify the key factors/features that can be used for machine learning processes. For example, Bloch et al. recorded four vital signs of data at the frequency of 6 times an hour, found median, and calculated mean values. They obtained 20 features and selected the most important 4 in their machine learning models (Bloch et al.,2019). The studies in category two rely on researchers' expertise to choose what factors/variables should be used to devise models. For example, Barton et al. used six factors, including heart rate and respiratory rate to develop their models to predict sepsis occurrence (Barton et al.,2019). Mao et al. chose the data that are easily available in intensive care unit and emergency department as features (Mao et al.,2018).

Discussion

As the first attempt to systematically review methodologies of sepsis prediction studies, we find that most studies focused on early prediction and detection of sepsis and mortality. Except for the results mentioned above, there are nine issues that we would like to address in this review.

Diagnostic criteria

Those studies, which adopt old sepsis definition or improper inclusion criteria, performed adequately. They were however viewed as too lax in sample inclusion and lacking enough specificity and sensitivity. For example, Mao et al. and Bloch et al. considered patients over 18 years old with a slight limitation and selected SIRS as diagnostic criteria (Mao et al.,2018; Bloch et al.,2019). They all used large datasets and had enough patients meeting the SIRS criteria, which led to high AUROC values. Compared to older diagnostic criteria, the latest Sepsis-3 includes more stringent clinical features and describes sepsis more accurately. Large disparities are found among sepsis definitions, making it impossible to compare AUROC of each study to find the best machine learning model. However, it should be noted that Kam et al. used the long short-term memory (LSTM) model to predict sepsis occurrence and obtained high AUROC value (Kamand Kim, 2017). In addition, 1D CNN combined with SoftMax model was selected by Perng et al., which significantly improved the performance of mortality prediction compared to the traditional predictive models in the single study. The CNN model reached AUROC 0.92 while the traditional models, such as KNN, got AUROC only 0.84 (Perng et al., 2019). This is because deep learning algorithms can remove many redundant dimensions by self-learning (Kamand Kim,2017; Mücke et al.,2021), and multiclass classification problem can be resolved with SoftMax. We also noted that Ke Li et al. reported that GBDT and RF predicted sepsis mortality well. GBDT is an ensemble learning method and may correct the training results and reduce the degree of overfitting by a regularization function (Chen et al.,2020). However, there are conflicting l studies which reported not good performance of random forest (Table 4); thus, further studies are needed to determine the robustness of RF for sepsis prediction.

Prediction time

Prediction times of the studies were different and AUROC changed with time (Figure 2). These characteristics corroborates with clinical experience. The closer to the onset of sepsis, the more accurately the model predicts. We find a study by Delahanty et al. where the results were inconsistent with other conclusions (Delahanty et al.,2019). This study used inappropriate inclusion criteria and concluded that neutrophil count had a negative effect on the RoS model, which was contrary to the pathophysiological mechanism. Therefore, we think the robustness of RoS should be further discussed. Because there is huge heterogeneity between studies predicting sepsis mortality, we cannot compare them reasonably to obtain similar rules.

Importance of feature engineering

There are commonly two major steps in machine learning studies. The first is to extract features from input samples and the second is to feed feature vectors into the machine learning algorithm for training and making the prediction. It is especially important to select key features and reduce the data dimension, which is known as feature engineering or order reduction. Feature engineering can not only significantly reduce redundant information and improve computational efficiency but also keep lowest negative influence of complex data dimensions on model robustness (Dai et al., 2020). There are two categories of feature engineering among the studies: one is designed by the domain expertise while the other is designed using machine learning methods (Miotto et al.,2018).

Designed by the domain expertise

Three studies chose features that were common and easily obtained in hospitals or based on the researchers' clinical experience. For example, several features that were common in the intensive care units (ICU) or emergency departments (systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, and body temperature) were selected in studies by Barton et al. and Mao et al. (Barton et al.,2019; Mao et al.,2018). However, only relying on clinical expertise could lead to strong subjectivity and may overlook some key features (Garcia et al.,2014). Even though the models performed adequately, the outcomes were difficult to be validated by external data; therefore, the applicability of these models is limited.

Designed by machine learning methods

Traditional reduced-order or feature-extraction algorithms, represented by principal component analysis (PCA) or auto decoder-encoder methods, can reduce data dimensions and significantly improve model performance. Perng et al. increased the accuracy of support vector machine (SVM) from 74.33% to 78.91% using PCA to preprocess the data. Meanwhile, AUROC of SoftMax increased from 0.88 to 0.91 (Perng et al., 2019). We can see the same situation here again, where Thottakkara et al. succeeded in improving the accuracy and AUROC after using PCA to preprocess their data (Thottakkara et al.,2016). In addition, the deep feedforward neural network (DFN) can independently learn and obtain the most crucial features. For instance, Kam et al. used DFN to detect early sepsis (Kamand Kim,2017). In addition, LSTM, a deep recurrent neural network, was adopted to learn long-range dependencies and handle vanishing gradient. As a result, the accuracy, sensitivity, and AUROC of LSTM are the highest and the number of the final features is the smallest.

Data granularity

Before processing data, researchers should first establish the definition for data granularity. The degree of data refinement and predictive performance can be improved by changing the granularity level (Dormosh et al.,2020). Here, we screened studies that refined data. In the 21 studies we included in this review, only Bloch et al. reported and discussed the issue of data granularity in detail, and the others did not mention this essential information at all (Bloch et al.,2019). Bloch et al. selected four features at the early stage and then expanded them to 20 features by calculating mean, median, minimum, maximum, and standard deviation. As a result, they obtained four satisfying features by ranking the features' importance. To some extent, it is another type of feature engineering used to explore intrinsic regularities of the clinical data.

Missing data

It is inadequate to build predictive models when there are missing data (Beaulieu-Jones et al.,2018). There were 12 studies that reported their methods on how to deal with missing data; six studies filled the missing data with mean value or median value. Three studies used methods such as filling in missing data by liner interpolation or “carry forward/backward” extrapolation methods (Table 1). It is a consensus that missing data should be processed before conducting any analysis (Mehrabani-Zeinabad et al., 2020), but obviously in the machine learning research field on sepsis, this standard operation has not been widely followed.

Machine learning algorithms

Various algorithms were applied, and their predictive performance is summarized in Tables 3 and 4, respectively. These consisted of popular current machine learning algorithms, including logistic regression, decision tree, support vector machine, random forest, and deep learning algorithms in supervised learning. The other category is unsupervised learning, including principal component analysis, K-means clustering, and autoencoder. Regardless of the influence of diagnostic criteria, the neural network-based algorithm performed better on average in sepsis detection, mortality, and early prediction. GBDT, which is a kind of classical and popular ensemble algorithm, may also have a broad prospect in sepsis prediction.

Continuous dataset

The studies contained in this review established outcome prediction models based on sectional data. To the contrary, some researchers predicted sepsis by using continuous data (time series). Kamaleswaran et al., Mohammed et al., and Wyk et al. constructed models for predicting sepsis onset with continuous physiologic data streams (Mohammed et al., 2021; Kamaleswaran et al., 2021a, 2021b, van Wyk et al., 2019). After preprocessing, they put data into selected algorithms and obtain the best predictive model. In addition, Kamaleswaran et al. also studied the significance of continuous data in predicting sepsis patients' response for volume treatment. We noticed this study reported better performance based on continuous data than EMR (Kamaleswaran et al., 2021a, 2021b), high-frequency data containing more patients' information, which may account for this result. This was an interesting finding. In fact, our team is conducting a similar research currently, which will be reported later.

Data heterogeneity

Predictive models of sepsis were mostly based on large databases. However, every sepsis patient is unique, therefore significant differences among the patients. For example, sepsis detection is to distinguish the confirmed sepsis patients from the non-sepsis patients, but it is difficult to carry out the classification in clinical settings because the symptoms and therapeutic medicine of every patient are different. There are large discrepancies among each individual patient so that it could be misleading to put all patients' data into a single dataset for training or testing when conducting machine learning. Therefore, all models mentioned above lack certain universality in machine learning protocols and cannot be used widely to assist any clinical decision-making (Fohner et al.,2019). Combined with clinical experience, researchers can collect necessary higher frequency clinical data every day to observe dynamical evolution of sepsis. Meanwhile, sepsis progression can be simulated by the machine learning model based on neural networks so that patients' prognosis will be predicted. Although we have discussed that we cannot solely rely on physician's experiences to select feature, it is necessary to integrate physician's experiences when transferring the model to new patient. It will mitigate the inherent heterogeneity. In a recent study, we have developed a deep learning method to integrate the clinical knowledge with clinical data to make successful short-term predictions (up to 48 h) for clinical practitioners (Lei et al.,2020). Based on the above discussions, we recommend strongly that future model development should incorporate clinical experience into data preprocessing instead of relying solely on the routinely collected data. Certain objective-oriented data preprocessing standard must be established so that the preprocessed data will be AI-ready for machine learning use.

Quality reevaluation and reporting standards

Through comparative study, we believe the JBI is a crude tool for evaluating machine learning methods. Based on the analysis above, we propose a new quality evaluation tool for machine learning methods. (1.) The evaluation methodology should include an appropriate and accurate disease definition, a data preprocessing protocol, and reasonable inclusion criteria. For example, we think that only Sepsis-3 can describe patient conditions accurately and be a basis for patient inclusion. (2.) For common problems in clinical data, such as missing data, data redundancy, data collected in different forms, noisy data etc., one should develop a protocol to produce standardized or normalized datasets, making sparse data non-sparse and “smooth” and improve data granularity. (3.) To avoid data redundancy and improve computational efficiency, feature engineering should include how many types of the original features are included, how many key features are selected, and how many types are classified. (4.) The process of sample removing, and grouping should be provided in the flowchart, and the rationales clearly explained. (5.) One needs to introduce algorithms, including the rationale for their choices based on relevant mathematical and statistical principles. (6.) Every model needs to have a set of corresponding evaluation criteria. We suggest adopting AUROC as an evaluation standard for the model performance. (7.) Finally, a prospective validation process is needed to ensure the predictive model developed can be adapted to clinical settings. Based on the new criteria alluded to above, we reevaluate the models in the 21 studies. We score 1 for any item meeting a criterion above and 0 otherwise. Total score more than or equal to 8 is considered high quality, 5–8 (including 5) average quality, and less than 5 low qualities. The quality reevaluation results of this review are shown in Tables 2, 10 studies are ranked low and 11 average. There are no high-quality models based on the score table. It is obvious that there are significant differences in data sources, data preprocessing, and feature engineering among sepsis prediction models. In addition, using diverse types of evaluation indices and predicted sepsis occurrence in various times, could result in distinctive model performance. Naturally, we believe there should be a unified standard to be a guideline of machine learning models in clinical research and applications. Referring to Standards for Reporting Diagnostic Accuracy (STARD) and combining with the above quality evaluation table, we propose a new report list of machine learning models in clinical medicine (Table 6).
Table 6

Report standards list of machine learning in clinical medication

Section and topicItemDescription
Title/Abstract/Keywords1Can be judged as a machine learning predictive research. (Keywords,such as machine learning,prediction)
Introduction2Introduce background, existing problems, and study targets,such as evaluating machine learning models to predict prognoses and probability of disease occurrence
Method research subject3Inclusion and exclusion criteria, locations where data is collected and time range
4Describe reasons of patients' selection, including symptoms, laboratorial results, or disease golden standard.
5Describe golden standard and provide references
Research data6Describe whether study is based on past datasets (retrospective study) or latest collection data (prospective study).
7Describe the data collection process.
8Describe the process of feature engineering. At least explain why choose this way to select features.
Results Building model9Provide flowchart of the including and excluding process, describe demographic and clinical characteristics (such as age, sex, height, and weight)
10Describe data preprocessing methods, including missing data processing, and smoothly processing sparse data.
11Describe the mathematical theory of the algorithm and its advantages.
12Describe numbers and names of finally including features
Research results13Describe models performance at different time points (provide at least one evaluation indicator, such as AUROC, accuracy).
Discussion14Discuss clinical universality of predictive models, including heterogeneity discussion and clinical prospective validation.
Report standards list of machine learning in clinical medication

Conclusions

Through a systematical review, we find that the number of studies using machine learning to predict the occurrence and mortality of sepsis grows rapidly in recent years, and the accuracy of predictions has improved considerably. However, there is no model that can be widely adopted in the real world yet, because of the lack of unified validation standard and procedure and the heterogeneity in a cohort of patients. In addition, the data collected from patients with sepsis are normally high-dimensional, highly heterogeneous, including both structured and unstructured data that evolve in a time-sensitive fashion and static data. Compared to traditional tools, deep neural networks are more suitable for this type of data. The traditional SIRS criteria cannot describe sepsis comprehensively due to the lack of sufficient features, and cannot be included for sepsis machine learning study. We note that studies based on Sepsis-3 just begin so that further studies are necessary. Hence, the new quality evaluation tool and reporting standard list suggested in this review would help improve the effective use of machine learning methods in clinical medicine.

Limitations of the study

We do not have access to enough medical information on the treatment process of sepsis and therefore cannot evaluate its significance in the model development. Moreover, limited to very few open-source databases, it is difficult to compare them and have a meaningful discussion. We do not find any study that described the specific influence of data preprocessing and have not come to the conclusion on which method is the best.

STAR★Methods

Key resources table

Resource availability

Lead contact

Further requests for resources and materials should be directed to and will be fulfilled by the Lead Contact, Hua Jiang (cdjianghua@qq.com).

Materials availability

This study did not yield new unique reagents.

Methods details

Eligibility criteria

There should be a consensus that eligible studies should provide clear data source based on Electronic Medical Record (EMR) or Electronic Healthy Record (EHR) from Emergency Department (ED) or Intensive Care Unit (ICU), so we can obtain disease and demographic information from the patients. In addition, we need AUROC and a clearly delineated detail of predictive models to compare and determine which model is the best. Considering the definition of sepsis has changed several times in the past, we require each study to provide at least one acceptable definition based on the current standard. We only study the prospection of machine learning algorithms in adults' sepsis and the target conditions include early detection and mortality of sepsis and severe sepsis.

Search strategy

A comprehensive literature retrieval is conducted in PubMed, ScienceDirect, Engineering Index (EI), Web of Science, China National Knowledge Infrastructure (CNKI) and WANFANG DATA for papers published between January 2010 and November 2021. Keywords like sepsis/machine learning/prediction are used for the search. A literature retrieval strategy for sepsis prediction All the included papers are perused by two independent reviewers (TL and DHL), including title-abstract and full text. All disagreements between the two authors are resolved by a third author (TY) and principal investigators (HFD, HJ). The chosen papers are limited to languages in Chinese and English.

Evaluating tool and reporting standard

After rules of evaluating machine learning models on sepsis prediction are established, we realize that, like clinical medicine, there is the need for specialized tools for quality evaluation and reporting standard to guide research analogous to those used in evidence-based medicine. Therefore, based on the above analysis, we propose a new quality evaluation tool and a new reporting standard from the aspects of data acquisition, algorithm selection, feature engineering, and model building with reference to the Standards for Reporting Diagnostic Accuracy (STARD). These are more comprehensive than existing tools and standards, and more appropriate for machine learning research in medicine.

Quantification and statistical analysis

This work systematically evaluate the method of statistical and quantification analysis of published researches. The authors of this work did not do further quantification analysis, eg, meta-analysis.
REAGENT or RESOURCESOURCEIDENTIFIER
Deposited data

Studies' methodologies and AUROC of predictionContained in the articleN/A

Other

MIMIC databaseMIMIC databasehttps://mimic.mit.edu/

A literature retrieval strategy for sepsis prediction

DatabasesSearch strategy
PubMed((sepsis [Title/Abstract]) and (machine learning [Title/Abstract])) and (prediction [Title/Abstract])
ScienceDirectTitle, abstract, keywords: sepsis, machine learning, predict
The engineering index(((sepsis) and (machine learning) and (prediction) and (mortality) and (onset)) WN KY)
Web of scienceTitle:(sepsis) and Title:(machine learning) and Title:(prediction)
CNKIky = 'sepsis' and ky = 'machine learning' and ky= 'prediction'
WANFANG DATATitle or keywords: “sepsis” and “machine learning” and “prediction”
  41 in total

1.  Sepsis Pathophysiology, Chronic Critical Illness, and Persistent Inflammation-Immunosuppression and Catabolism Syndrome.

Authors:  Juan C Mira; Lori F Gentile; Brittany J Mathias; Philip A Efron; Scott C Brakenridge; Alicia M Mohr; Frederick A Moore; Lyle L Moldawer
Journal:  Crit Care Med       Date:  2017-02       Impact factor: 7.598

2.  Prediction of sepsis patients using machine learning approach: A meta-analysis.

Authors:  Md Mohaimenul Islam; Tahmina Nasrin; Bruno Andreas Walther; Chieh-Chen Wu; Hsuan-Chia Yang; Yu-Chuan Li
Journal:  Comput Methods Programs Biomed       Date:  2018-12-26       Impact factor: 5.428

3.  Learning representations for the early detection of sepsis with deep neural networks.

Authors:  Hye Jin Kam; Ha Young Kim
Journal:  Comput Biol Med       Date:  2017-08-19       Impact factor: 4.589

4.  Early Prediction of Sepsis in EMR Records Using Traditional ML Techniques and Deep Learning LSTM Networks.

Authors:  Mohammed Saqib; Ying Sha; May D Wang
Journal:  Annu Int Conf IEEE Eng Med Biol Soc       Date:  2018-07

5.  Assessment of Global Incidence and Mortality of Hospital-treated Sepsis. Current Estimates and Limitations.

Authors:  Carolin Fleischmann; André Scherag; Neill K J Adhikari; Christiane S Hartog; Thomas Tsaganos; Peter Schlattmann; Derek C Angus; Konrad Reinhart
Journal:  Am J Respir Crit Care Med       Date:  2016-02-01       Impact factor: 21.405

6.  Predicting 30-days mortality for MIMIC-III patients with sepsis-3: a machine learning approach using XGboost.

Authors:  Nianzong Hou; Mingzhe Li; Lu He; Bing Xie; Lin Wang; Rumin Zhang; Yong Yu; Xiaodong Sun; Zhengsheng Pan; Kai Wang
Journal:  J Transl Med       Date:  2020-12-07       Impact factor: 5.531

7.  Combining Biomarkers with EMR Data to Identify Patients in Different Phases of Sepsis.

Authors:  Ishan Taneja; Bobby Reddy; Gregory Damhorst; Sihai Dave Zhao; Umer Hassan; Zachary Price; Tor Jensen; Tanmay Ghonge; Manish Patel; Samuel Wachspress; Jackson Winter; Michael Rappleye; Gillian Smith; Ryan Healey; Muhammad Ajmal; Muhammad Khan; Jay Patel; Harsh Rawal; Raiya Sarwar; Sumeet Soni; Syed Anwaruddin; Benjamin Davis; James Kumar; Karen White; Rashid Bashir; Ruoqing Zhu
Journal:  Sci Rep       Date:  2017-09-07       Impact factor: 4.379

8.  Development and Validation of a Quick Sepsis-Related Organ Failure Assessment-Based Machine-Learning Model for Mortality Prediction in Patients with Suspected Infection in the Emergency Department.

Authors:  Young Suk Kwon; Moon Seong Baek
Journal:  J Clin Med       Date:  2020-03-23       Impact factor: 4.241

9.  Validation of a machine learning algorithm for early severe sepsis prediction: a retrospective study predicting severe sepsis up to 48 h in advance using a diverse dataset from 461 US hospitals.

Authors:  Hoyt Burdick; Eduardo Pino; Denise Gabel-Comeau; Carol Gu; Jonathan Roberts; Sidney Le; Joseph Slote; Nicholas Saber; Emily Pellegrini; Abigail Green-Saxena; Jana Hoffman; Ritankar Das
Journal:  BMC Med Inform Decis Mak       Date:  2020-10-27       Impact factor: 2.796

10.  Using machine learning methods to predict in-hospital mortality of sepsis patients in the ICU.

Authors:  Guilan Kong; Ke Lin; Yonghua Hu
Journal:  BMC Med Inform Decis Mak       Date:  2020-10-02       Impact factor: 2.796

View more
  2 in total

Review 1.  The Promise of Digital Health: Then, Now, and the Future.

Authors:  Amy Abernethy; Laura Adams; Meredith Barrett; Christine Bechtel; Patricia Brennan; Atul Butte; Judith Faulkner; Elaine Fontaine; Stephen Friedhoff; John Halamka; Michael Howell; Kevin Johnson; Peter Long; Deven McGraw; Redonda Miller; Peter Lee; Jonathan Perlin; Donald Rucker; Lew Sandy; Lucia Savage; Lisa Stump; Paul Tang; Eric Topol; Reed Tuckson; Kristen Valdes
Journal:  NAM Perspect       Date:  2022-06-27

2.  Explainable Machine-Learning Model for Prediction of In-Hospital Mortality in Septic Patients Requiring Intensive Care Unit Readmission.

Authors:  Chang Hu; Lu Li; Yiming Li; Fengyun Wang; Bo Hu; Zhiyong Peng
Journal:  Infect Dis Ther       Date:  2022-07-14
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.