Literature DB >> 32383124

External validation of a prognostic model for intensive care unit mortality: a retrospective study using the Ontario Critical Care Information System.

Fran Priestap¹, Raymond Kao^2,3, Claudio M Martin^2,3,4.

Abstract

PURPOSE: To externally validate an intensive care unit (ICU) mortality prediction model that was created using the Ontario Critical Care Information System (CCIS), which includes the Multiple Organ Dysfunction Score (MODS).
METHODS: We applied the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) recommendations to a prospective longitudinal cohort of patients discharged between 1 July 2015 and 31 December 31 2016 from 90 adult level-3 critical care units in Ontario. We used multivariable logistic regression with measures of discrimination, calibration-in-the-large, calibration slope, and flexible calibration plots to compare prediction model performance of the entire data set and for each ICU subtype.
RESULTS: Among 121,201 CCIS records with ICU mortality of 11.3%, the C-statistic for the validation data set was 0.805. The C-statistic ranged from 0.775 to 0.846 among the ICU subtypes. After intercept recalibration to adjust the baseline risk, the mean predicted risk of death matched actual ICU mortality. The calibration slope was close to 1 with all CCIS data and ICU subtypes of cardiovascular and community hospitals with low ventilation rates. Calibration slopes significantly less than 1 were found for ICUs in teaching hospitals and community hospitals with high ventilation rates whereas coronary care units had a calibration slope significantly higher than 1. Calibration plots revealed over-prediction in high risk groups to a varying degree across all cohorts.
CONCLUSIONS: A risk prediction model primarily based on the MODS shows reproducibility and transportability after intercept recalibration. Risk adjusting models that use existing and feasible data collection can support performance measurement at the individual ICU level.

Entities: Disease Gene Species

Keywords: external validation; intensive care unit (ICU); mortality; prognostic model

Year: 2020 PMID： 32383124 PMCID： PMC7223438 DOI： 10.1007/s12630-020-01686-5

Source DB: PubMed Journal: Can J Anaesth ISSN： 0832-610X Impact factor: 5.063

Benchmarking can be used to identify opportunities for quality improvement.1 Performance or benchmarks can be monitored over time within a single practice, or compared across different practices. These methods for performance measurement and improvement require careful interpretation of the results and awareness of limitations.2 In complex systems, such as intensive care units (ICUs), it can be difficult to compare measures of quality since patients present with heterogeneous illnesses and varied disease severity. Methods have been proposed to account for this heterogeneity, most commonly regression techniques to risk-adjust the measure of interest.3-5 An ideal benchmarking system will use data that are readily available and simple to interpret.6 Ontario is the most populous province in Canada. In 2007, the Critical Care Information System (CCIS) was implemented by the provincial health ministry as part of a strategy to improve the quality and efficiency of the critical care system.7 The CCIS includes a measure of organ dysfunction on ICU admission (Multiple Organ Dysfunction Score [MODS])8 and daily nursing workload measures (Nine Equivalents Nursing Manpower Use Score [NEMS])9; however, this data has not been used to perform risk-adjustment, likely because validated models for this purpose are lacking. The ability of MODS to predict mortality has been reported in small, single-centre studies from Canada, Finland, and other countries.10,11 We used CCIS data from the two medical-surgical ICUs in our hospital to develop and internally validate a prediction model for ICU mortality.12 None of these models have been externally validated. External validation of a prediction model’s performance is an important and necessary process prior to clinical implementation.13-16 Access to “big data” is increasing as evident by analysis of registry databases that contain electronic health records for thousands or even millions of patients from multiple practices and hospitals.17 The CCIS is an example of a large e-health database that includes data from different types of ICUs, and thus provides an opportunity to assess both reproducibility (similar case-mix) and transportability (different but related populations) within the same study.18 The objective of this study was to conduct and report a methodologically sound external validation using guidelines and referenced statistical articles from the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) explanation and elaboration document.19

Methods

Approval for this study was granted by the Western University Research Ethics Board on 15 February 2017. Requirement for consent was waived.

Study design

We used an independent population-based cohort to perform a validation study on a previously published ICU mortality prediction model.12

Data source

We used data from the Ontario CCIS for this study. The CCIS is a web-based data application that uses a combination of methods to capture data. Demographic data can be auto-populated directly from the hospital electronic admission, discharge, and transfer system, but most of the data are manually entered by clerical and clinical staff as appropriate. Data elements used in this study, a subset of those captured in CCIS, are shown in Electronic Supplementary Material (ESM), eTable 1. All ICUs in Ontario are required to enter data into the CCIS for all admissions. Data were obtained for all level-3 ICU admissions between July 1 2015 and December 31 2016. Level 3 ICUs are defined as those providing life support and mechanical ventilation for more than 48 hours. Critical Care Services Ontario has organized the ICUs into groups based on ICU subtype (Table 1). The eligibility criteria, conditions, definitions, and measurements in this validation study were identical to those used in the original development study.

Table 1

CCIS level-3 ICU subtype groups and number of critical care units

Criteria	# Critical care units
Teaching hospitals (medical surgical ICU)	17
Community hospitals (medical surgical ICU) with ventilator patient day rate above the mean rate*	28
Community hospitals (medical surgical ICU) with ventilator patient day rate equal to or less than the mean rate*	23
Cardiac/cardiovascular unit	10
Coronary care units^†	10
Burn units	2

*Ventilator patient day rate = (ventilator days/patient care days) * 100 based on fiscal year 2016–2017; mean rate = 43.61%

†Coronary care units that provide invasive ventilation for longer than 48 hr

CCIS = Critical Care Information System; ICU = intensive care unit

CCIS level-3 ICU subtype groups and number of critical care units *Ventilator patient day rate = (ventilator days/patient care days) * 100 based on fiscal year 2016–2017; mean rate = 43.61% †Coronary care units that provide invasive ventilation for longer than 48 hr CCIS = Critical Care Information System; ICU = intensive care unit The minimum effective sample size for external validation has been reported as 100 outcome events.20 The data set included over 13,500 deaths. All ICU subtype groups had well over 100 deaths except burn ICUs, which were excluded from the subgroup analyses. The validation data set was first subject to administrative cleaning. We excluded admissions to pediatric and labour and delivery level-3 ICUs. Also excluded were records where patient age was reported as < 18 yr or > 115 yr, length of ICU stay was reported as 0 days (entry errors), or where duplicate MODS and/or NEMS entries were reported. For duplicate records, the record with the later time stamp was selected for linkage with the admission and discharge data. Finally, any records with missing predictor data were omitted from the analyses. Complete case analyses were used to assess model performance. Records with missing data represented approximately 5% of all cases and exclusion of these cases was not considered a threat to the validity of the results.21 The outcome of interest was ICU mortality. Predictor variables, available within the first 24 hr of critical care admission, were defined as follows: 1) age group (18–39, 40–79, ≥ 80 yr); 2) sex (M or F); 3) NEMS group (0–22, 23–29, ≥ 30); 4) MODS group (0, 1–4, 5–8, 9–12, ≥ 13); 5) admission source (operating room/postanesthesia care unit, emergency department, unit/ward, other hospital and other); 6) admitting diagnosis (cardiovascular/cardiac/vascular, respiratory, gastrointestinal, neurologic, trauma, other); and 7) readmission to critical care during the same hospital stay.12 Since we chose to restrict our analyses to variables contained within the CCIS data set, we modified our previously published model 12 by excluding the Charlson Comorbidity Index. eTable 1 (available as ESM) is provided as supplemental digital content and shows the equation for Logit [ICU Mortality].

Statistical analyses

The relatedness of the development and validation data sets was reviewed using two approaches. First, the distribution of context-important patient characteristics, including predictors and outcomes, were compared. Descriptive analyses of these characteristics were performed for the development and validation data sets and for the latter, also stratified by CCIS ICU subtype. Continuous data elements are expressed as mean (standard deviation [SD]) or median [interquartile range (IQR)] as appropriate. Categorical data elements are reported as proportions. To quantify the extent of the relatedness in case-mix between the development and validation samples, a binary logistic regression model (membership model) was created to predict the probability that an individual record belonged to either sample.22 Independent variables were the predictors and outcome from the prediction model. The discriminative ability of the model was quantified using its C-statistic with lower values indicating similarity between the data sets. Three measures were used to assess the performance of the model in the validation data set: 1) calibration-in-the-large, 2) calibration slope, and 3) discrimination. Calibration-in-the-large represents the level of agreement between observed and predicted mortality. It was calculated as the logistic regression model intercept given that the calibration slope equals 1 (logit(y)=a + logit(ŷ)).22,23 Where calibration-in-the-large was significantly different from 0, intercept recalibration was performed by fitting a new logistic regression model with an intercept only and an offset term for the linear predictor. Calibration slope reflects whether predicted risks are appropriately scaled with respect to each other over the entire range of possible values. It was estimated from the recalibration model equation logit(y)=a + boverall logit(ŷ).22,24 Loess-based calibration plots were created with predicted risk on the x-axis and observed mortality on the y-axis to illustrate the agreement across the range of predicted risks.23 Discrimination refers to the ability of the prediction model to separate individuals that died and those that survived. The concordance statistic was used to evaluate the discriminative value of the prediction model. For those observations excluded from the analyses because of missing predictors, comparisons with the observations used in the validation were also made. All analyses were performed using SAS 9.4 (SAS Institute Inc., Cary, NC, USA).

Results

After applying the exclusion criteria, 121,201 records were available for external validation (Fig. 1). The demographic and clinical characteristics (predictors) and ICU mortality of the patient population included in the development model and external validation data set are shown in Table 2. The C-statistic for the membership model comparing the development data set to the entire CCIS cohort was 0.764. Values between 0.7 and 0.8 are generally considered to reflect acceptable discrimination25 and in the case of this membership model, represent a data set that is somewhat related to the development data set, but not strongly so where a C-statistic of < 0.7 would be expected. This is confirmed by some key differences illustrated in Table 2. Specifically, the development population was younger, had a different source distribution (less from the operating room and emergency department, more from the ward and referrals from other hospitals), as well as higher levels of organ dysfunction upon admission, daily nurse workload, readmission, and ICU mortality. Admitting diagnosis also differed between the data sets with the development sample having a higher proportion of admissions for respiratory issues and a lesser proportion of cardiovascular-related admissions.

Fig. 1

Table 2

Baseline and clinical characteristics and outcomes of patients in the development and external validation data sets

	Development	External validation
		Included	Missing
Total number of subjects, n	4,321	121,201	6,507
Sex
% Female	42.7	40.2	38.7
N missing			3
Age (yr)
0–39	13.5	9.1	8.7
40–79	72.8	73.4	72.5
≥ 80	13.7	17.5	18.8
N missing			0
ICU admission source
Operating room/postanesthesia care unit	21.6	28.3	25.6
Other hospital	18.3	11.4	13.5
Emergency department	29.3	36.1	32.1
Other source*	8.8	9.7	15.0
Unit/ward	22.2	14.5	13.8
N missing			28
ICU admission diagnosis
Cardiovascular/cardiac/vascular	15.1	43.1	55.6
Other diagnosis^†	23.1	21.1	18.1
Gastrointestinal	10.5	6.1	5.7
Respiratory	32.5	19.3	13.7
Trauma	6.6	2.3	1.5
Neurologic	12.2	8.0	5.5
N missing			0
Multiple Organ Dysfunction Score (MODS)
0	5.7	16.9	21.1
1–4	39.9	46.0	47.7
5–8	40.1	29.0	22.8
9–12	12.3	7.2	7.3
> 13	2.0	0.9	1.1
N missing			5607
Nine Equivalents Nursing Manpower Use Score (NEMS)
0–22	12.9	36.4	43.1
23–29	32.5	26.3	23.6
≥ 30	54.6	37.3	33.3
N missing			2280
Re-admission to ICU (same hospital admission)	9.1	5.8	6.4
N missing			128
Mortality	22.8	11.2	11.9
N missing			28

†Other diagnosis includes patients with the following diseases: Metabolic/endocrine, Genitourinary, Musculoskeletal, Skin, Oncology, Hematology, Other

*Other source includes patients admitted from the following locations: Home – within or outside LHIN, Level 2 unit or step-down unit, Level 3 unit (medical/surgical or specialty unit), Complex continuing-care facility, Rehabilitation facility, Outside province, Other

ICU = intensive care unit; LHIN = Local Health Integration Network

Flow chart of patient records included in the external validation. Administrative cleaning includes the following: n = 1,609 (duplicates), n = 427 (admitted in error), n = 88 (ICU LOS = 0), n = 511 (age < 18 yr), n = 6 (age > 105 yr). ICU = intensive care unit; LOS = length of stay Baseline and clinical characteristics and outcomes of patients in the development and external validation data sets †Other diagnosis includes patients with the following diseases: Metabolic/endocrine, Genitourinary, Musculoskeletal, Skin, Oncology, Hematology, Other *Other source includes patients admitted from the following locations: Home – within or outside LHIN, Level 2 unit or step-down unit, Level 3 unit (medical/surgical or specialty unit), Complex continuing-care facility, Rehabilitation facility, Outside province, Other ICU = intensive care unit; LHIN = Local Health Integration Network These same analyses were performed for each ICU subtype group. The discrimination of the membership models indicated varying degrees of relatedness to the development sample. Relatedness to the development sample was found in teaching hospital medical-surgical units (C-statistic = 0.660) and community hospital medical-surgical units with high rates of mechanical ventilation (C-statistic = 0.740) but discordance in community hospital medical-surgical units with low rates of mechanical ventilation (C-statistic = 0.836), cardiac/cardiovascular units (C-statistic = 0.969), and coronary care units (C-statistic = 0.974). eTable 2 (available as ESM) is provided as supplemental digital content and shows the characteristics and outcomes for each individual ICU subtype group compared with those for the entire cohort. The demographic and clinical profile of cases excluded from the analyses because of missing data were similar to those included in the external validation (Table 2), and as such, data were considered to be missing completely at random. Calibration-in-the-large represents overall calibration of the model. Perfect agreement between observed and predicted values has an intercept value of 0. For all data combined and also for all ICU subtype groups except medical-surgical units in teaching hospitals, the intercept value was less than 0 indicating that the model over-predicted ICU mortality.22 This over-estimation was greatest in cardiac/cardiovascular and coronary care units. In the medical-surgical units in teaching hospitals, the intercept value was greater than 0 showing a slight under-estimation of mortality (Table 3). Given the differences between actual ICU mortality and predicted risk, an intercept recalibration was performed for all models resulting in calibration-in-the-large values that are essentially 0.

Table 3

Predicted risk and model performance statistics for external validation of the entire CCIS cohort and for ICU subtype groups

	n	Observed ICU mortality	Predicted risk of death in ICU before intercept recalibration	Predicted risk of death in ICU after intercept recalibration	Calibration in the large before intercept recalibration (95% CI)	Calibration in the large after intercept recalibration (95% CI)	Calibration slope (βrisk)	Discrimination (C-statistic)(95% CI)
All Level 3 units	121,201	13,594	11.2	14.4 (14.7)	8.7 [3.9–18.7]	11.2 (12.5)	6.3 [2.8–14.0]	− 0.343 (− 0.362 to − 0.324)	0.001 (− 0.018 to 0.021)	1.019 (1.001 to 1.037)	0.807 (0.804 to 0.811)
Teaching hospitals – medical/surgical units	28,894	4711	16.3	15.5 (15.2)	10.4 [4.2–20.8]	16.3 (15.7)	10.8 [4.6–22.2]	0.070 (0.036 to 0.105)	− 0.000 (− 0.035 to 0.034)	0.909 (0.878 to 0.941)	0.781 (0.774 to 0.788)
Community hospitals – medical/surgical units with high ventilator patient day rate*	33,653	5398	16.0	16.4 (17.1)	10.0 [3.7–22.3]	16.1 (16.9)	9.6 [3.6–21.6]	− 0.037 (− 0.069 to − 0.004)	− 0.006 (− 0.038 to 0.027)	0.982 (0.953 to 1.011)	0.816 (0.810 to 0.822)
Community hospitals – medical/surgical units with low ventilator patient day rate*	23,597	2100	8.9	11.6 (12.7)	7.5 [3.5–16.3]	9.1 (10.7)	5.5 [2.5–12.5]	− 0.350 (− 0.399 to − 0.302)	− 0.032 (− 0.080 to 0.016)	1.048 (1.003 to 1.093)	0.810 (0.801 to 0.819)
Cardiac /cardiovascular units	18,929	608	3.2	14.6 (12.4)	11.4 [7.3–16.3]	3.0 (3.8)	1.9 [1.3–3.0]	− 1.799 (− 1.883 to − 1.716)	0.064 (− 0.019 to 0.147)	1.062 (0.980 to 1.144)	0.768 (0.747 to 0.789)
Coronary care units	15,159	746	4.9	12.4 (12.8)	7.7 [3.5–17.1]	5.0 (6.6)	2.6 [1.2–6.3]	− 1.148 (− 1.225 to − 1.070)	− 0.011 (− 0.088 to 0.067)	1.232 (1.156 to 1.307)	0.850 (0.836 to 0.863)

Observed ICU mortality

Predicted risk of death in ICU before intercept recalibration

Predicted risk of death in ICU after intercept recalibration

Calibration in the large before intercept recalibration (95% CI)

Calibration in the large after intercept recalibration (95% CI)

Calibration slope (βrisk)

Discrimination (C-statistic)(95% CI)

(n)

Mean (SD)

Median[IQR]

Mean (SD)

Median[IQR]

All Level 3 units

121,201

13,594

11.2

14.4

(14.7)

8.7

[3.9–18.7]

11.2

(12.5)

6.3

[2.8–14.0]

− 0.343

(− 0.362 to − 0.324)

0.001

(− 0.018 to 0.021)

1.019

(1.001 to 1.037)

0.807

(0.804 to 0.811)

Teaching hospitals – medical/surgical units

28,894

4711

16.3

15.5

(15.2)

10.4

[4.2–20.8]

16.3

(15.7)

10.8

[4.6–22.2]

0.070

(0.036 to 0.105)

− 0.000

(− 0.035 to 0.034)

0.909

(0.878 to 0.941)

0.781

(0.774 to 0.788)

Community hospitals – medical/surgical units with high ventilator patient day rate*

33,653

5398

16.0

16.4

(17.1)

10.0

[3.7–22.3]

16.1

(16.9)

9.6

[3.6–21.6]

− 0.037

(− 0.069 to − 0.004)

− 0.006

(− 0.038 to 0.027)

0.982

(0.953 to 1.011)

0.816

(0.810 to 0.822)

Community hospitals – medical/surgical units with low ventilator patient day rate*

23,597

2100

8.9

11.6

(12.7)

7.5

[3.5–16.3]

9.1

(10.7)

5.5

[2.5–12.5]

− 0.350

(− 0.399 to − 0.302)

− 0.032

(− 0.080 to 0.016)

1.048

(1.003 to 1.093)

0.810

(0.801 to 0.819)

Cardiac /cardiovascular units

18,929

608

3.2

14.6

(12.4)

11.4

[7.3–16.3]

3.0

(3.8)

1.9

[1.3–3.0]

− 1.799

(− 1.883 to − 1.716)

0.064

(− 0.019 to 0.147)

1.062

(0.980 to 1.144)

0.768

(0.747 to 0.789)

Coronary care units

15,159

746

4.9

12.4

(12.8)

7.7

[3.5–17.1]

5.0

(6.6)

2.6

[1.2–6.3]

− 1.148

(− 1.225 to − 1.070)

− 0.011

(− 0.088 to 0.067)

1.232

(1.156 to 1.307)

0.850

(0.836 to 0.863)

*Ventilator patient day rate = (ventilator days / patient care days) *100; based on fiscal year 2016–2017; mean rate = 43.61%; β = standardized regression (beta) coefficient; CCIS = Critical Care Information System; CI = confidence interval; ICU = intensive care unit; IQR = interquartile range; SD = standard deviation

Predicted risk and model performance statistics for external validation of the entire CCIS cohort and for ICU subtype groups 14.4 (14.7) 8.7 [3.9–18.7] 11.2 (12.5) 6.3 [2.8–14.0] − 0.343 (− 0.362 to − 0.324) 0.001 (− 0.018 to 0.021) 1.019 (1.001 to 1.037) 0.807 (0.804 to 0.811) 15.5 (15.2) 10.4 [4.2–20.8] 16.3 (15.7) 10.8 [4.6–22.2] 0.070 (0.036 to 0.105) − 0.000 (− 0.035 to 0.034) 0.909 (0.878 to 0.941) 0.781 (0.774 to 0.788) 16.4 (17.1) 10.0 [3.7–22.3] 16.1 (16.9) 9.6 [3.6–21.6] − 0.037 (− 0.069 to − 0.004) − 0.006 (− 0.038 to 0.027) 0.982 (0.953 to 1.011) 0.816 (0.810 to 0.822) 11.6 (12.7) 7.5 [3.5–16.3] 9.1 (10.7) 5.5 [2.5–12.5] − 0.350 (− 0.399 to − 0.302) − 0.032 (− 0.080 to 0.016) 1.048 (1.003 to 1.093) 0.810 (0.801 to 0.819) 14.6 (12.4) 11.4 [7.3–16.3] 3.0 (3.8) 1.9 [1.3–3.0] − 1.799 (− 1.883 to − 1.716) 0.064 (− 0.019 to 0.147) 1.062 (0.980 to 1.144) 0.768 (0.747 to 0.789) 12.4 (12.8) 7.7 [3.5–17.1] 5.0 (6.6) 2.6 [1.2–6.3] − 1.148 (− 1.225 to − 1.070) − 0.011 (− 0.088 to 0.067) 1.232 (1.156 to 1.307) 0.850 (0.836 to 0.863) *Ventilator patient day rate = (ventilator days / patient care days) *100; based on fiscal year 2016–2017; mean rate = 43.61%; β = standardized regression (beta) coefficient; CCIS = Critical Care Information System; CI = confidence interval; ICU = intensive care unit; IQR = interquartile range; SD = standard deviation The calibration plots in Figs 2a and 2b show that some over-prediction remains following intercept recalibration, specifically when the risk of death is higher. The extent of over-prediction varies across ICU subtype groups but represents a small proportion of patients.

Fig. 2

a Loess-based calibration plots for validation of entire CCIS cohort. CCIS = Critical Care Information System. b Loess-based calibration plots for validation of individual ICU subtype groups. ICU = intensive care unit; TH = teaching hospitals; CH = community hospitals The calibration slope reflects whether the predicted risks are scaled appropriately to each other over the complete range of predicted probabilities and was another measure used to evaluate the model’s predictive performance in the validation samples. Calibration slopes not significantly different from 1 include all CCIS data, as well as community hospital medical-surgical units and cardiac/cardiovascular units. The calibration slope for teaching hospital medical-surgical units were significantly less than 1, showing higher variation in predicted probabilities (Table 3). Specifically, the variation between predicted and observed risks is too low for low-outcome risks and too high for high-outcome risks. The coronary care unit data set has a calibration slope significantly above 1 indicating too little variation in the predicted risks; predicted risks are systemically too high. Discrimination for all CCIS data and the individual ICU subtype groups ranged from acceptable to very good (Table 3). The validation data sets with the lowest area under the curve (AUC) [IQR] were teaching hospital medical-surgical units (C = 0.781 [0.774 -0.788]) and cardiovascular/cardiac units C = 0.768 [0.747 - 0.789]). The data sets including all CCIS data and all other ICU subtype groups had areas under the curve greater than 0.80.

Discussion

We used a prospectively collected, population-based cohort to perform external validation on a risk prediction model for ICU mortality. We found that an intercept update was required, which greatly improved the calibration-in-the-large for the entire cohort as well as for all ICU subtype groups. Over-estimation for higher predicted risk groups remains, but this population represents relatively few patients. Since the intention of the model is for performance measurement and not individual patient prognosis, the model fit is acceptable for the entire cohort of ICUs. The development and application of robust prognostic models are essential for valid performance measurement and many existing prognostic models have a limited life span because of changes in clinical practice and healthcare over time that can alter the risk of mortality for a given clinical situation. Prognostic models require periodic updating. Current prognostic models for mortality were published between 2005 and 2007 including Acute Physiology and Chronic Health Evaluation (APACHE) IV (AUC = 0.88),5 Simplified Acute Physiology Score (AUC = 0.848,)26 and Mortality Probability Admission Model (MPM0)-III (AUC = 0.823).27 The organ dysfunction scores that assess the presence and severity of organ dysfunction include MODS (AUC = 0.695), Sequential Organ Failure Assessment (SOFA) (AUC = 0.776), and Logistic Organ Dysfunction Score (AUC = 0.805).11 The AUC we report here for the entire cohort and for ICU subtype groups compares favourably with these other models. The development model showed strong agreement between observed and expected mortality as assessed using the Hosmer-Lemeshow goodness-of-fit test. Limitations of this decile-based analysis include the influence of sample size and the arbitrary selection of the risk categories.28-30 In this external validation, calibration was assessed using loess-based calibration plots, calibration-in-the-large, and calibration slope.23 Although the results are not directly comparable, the underlying conclusions are that the model has acceptable calibration in both the development and validation data sets, indicating good overall agreement between observed and expected ICU mortality. Discriminative ability increased slightly in this external validation and the membership model did indicate some case-mix differences. We anticipated that a data set containing over 120,000 patients would include a more diverse case-mix than the developmental model. Differences in case-mix can include the distribution of predictor values, varied participant or setting characteristics, and incidence of the outcome.18 This increase in heterogeneity would enhance discriminative ability in the validation cohort, and has several effects on model performance across different settings and populations.31,32 In fact, case-mix variation can lead to differences in the performance of a prediction model, even when the true predictors’ effects are consistent.31 Benchmarking is an approach to identify and implement best practices.1,33 Indicators selected for benchmarking can be compared over time within a single unit or practice, across units or practices or against a predetermined goal. Many potential indicators will not require risk or case-mix adjustment, while this will be needed for most patient-related outcomes such as mortality and length of stay. We caution against use of simple rank ordering or comparisons of one unit to another since regression models, such as the one we report, provide an estimated risk based on the average of the entire cohort. While our recalibration has reduced the bias across this cohort, estimates for subgroups or individual ICUs will remain biased. As can be seen in our data, it appears that teaching hospitals perform worse than average, community hospitals with high ventilator usage perform better than average, and cardiac units perform much better than average. Nevertheless, this would be a false conclusion since the differences across subgroups must cancel out across the entire cohort. At most, evaluation of subgroups or individual ICU results should only be compared with the average estimated performance and include confidence intervals.3 Models could be recalibrated for specific ICU subtypes but this involves subjective categorization of units and will not resolve the bias for individual ICUs. One randomized trial used quantiles to identify achievable performance levels for groups of units and reported improved performance in individual units.34 Ultimately, we believe that models such as these should be used to monitor performance over time only within individual ICUs. One such approach incorporates risk-adjusted measures into statistical process control methods.35,36 There are numerous strengths to this study. First, the breadth of the units that submit data to the CCIS allows for testing of both reproducibility (similar ICU subtype groups) and transportability (different ICU subtype groups), and the size of the CCIS data set provided ample statistical power for the required analyses. The TRIPOD framework indicates that a model’s predictive performance should be evaluated in relation to subgroups of interest, such as age or sex, specific settings or population rather than just across all individuals combined, which can mask any deficiencies in the model.19 It is increasingly recognized that the predictive performance of a model tends to vary across settings, populations, and periods,22,31,37,38 which implies there is often heterogeneity in model performance and that multiple external validation studies are needed to fully appreciate the generalizability of a prediction model.22 In this study, we have conducted subgroup analyses for each ICU subtype to evaluate performance in specific ICU patient populations. Another strength is adherence to the TRIPOD guidelines, which include references to appropriate analytic methods and complete reporting of the results.15,22,39,40 Next, both MODS and NEMS are relatively easy to collect, making this prediction tool more apt for risk-adjustment compared with more complex scoring systems. MODS requires only eight routinely collected variables and, in contrast to SOFA, is not dependent on treatment.41 NEMS assesses ICU resource utilization and efficiency that has been validated as a nurse workload measure in large cohorts of ICU patients.42 It is easy to use with minimum inter-observer variability,9,42 but has not been evaluated as a mortality or risk prediction tool. Limitations of this study include our inability to adjust for chronic health status as these data are not captured in the CCIS. Linkage to other data sets containing comorbidity data such as the Canadian Institute for Health Informatics Discharge Abstract Database could resolve this limitation, but we did not have access to identifiable patient information and such linkage was not possible. Another limitation is that, although ICU mortality is a proximal metric that can be used to evaluate quality of care in the ICU and ultimately improve patient outcomes, ICU survival is not a patient-centred goal. We found a low frequency of patients within the range of severity where mortality is over-predicted; however, this would need to be monitored regularly to ensure that results are interpreted correctly. Also, we could not evaluate the burn ICU subtype group accurately because of the low number of deaths. Finally, although there are no published studies on the accuracy of the CCIS data, we previously reported that inter-observer variability in data collection appears to be randomly distributed.43

Conclusion

Following an intercept update to adjust for the difference in mortality between the development and validation data sets, our ICU mortality prediction model performs well and shows both reproducibility and transportability. Some ICU subtype groups show inferior model fit compared with others, but the over-estimation of mortality occurs primarily in risk groups with low prevalence and thus has a minimal impact on overall calibration. These models could be used to provide risk-adjusted mortality rates to support performance measurement over time within individual ICUs using data that is easy and feasible to collect. Since the model represents an average of all the patients in the cohort, we recommend it should not be used for simple comparisons between ICUs or ICU subtypes. Below is the link to the electronic supplementary material. Logistic regression equation for predicted risk ICU mortality (Logit [ICU Mortality]). ICU = intensive care unit. Supplementary material 1 (PDF 118 kb) Baseline and clinical characteristics and outcomes for all level-3 unit patients discharged between July 1 2015 and December 31 2016. Data are provided for the entire cohort and stratified by the CCIS peer group. CCIS = Critical Care Information System. Supplementary material 2 (PDF 139 kb)

41 in total

Review 1. The Critical Care Research Network: a partnership in community-based research and research transfer.

Authors: S P Keenan; C M Martin; J D Kossuth; J Eberhard; W J Sibbald
Journal: J Eval Clin Pract Date: 2000-02 Impact factor: 2.431

2. Use and misuse of process and outcome data in managing performance of acute medical care: avoiding institutional stigma.

Authors: Richard Lilford; Mohammed A Mohammed; David Spiegelhalter; Richard Thomson
Journal: Lancet Date: 2004-04-03 Impact factor: 79.321

Review 3. Severity scoring in the critically ill: part 2: maximizing value from outcome prediction scoring systems.

Authors: Michael J Breslow; Omar Badawi
Journal: Chest Date: 2012-02 Impact factor: 9.410

Review 4. Benchmarking: a method for continuous quality improvement in health.

Authors: Amina Ettorchi-Tardy; Marie Levif; Philippe Michel
Journal: Healthc Policy Date: 2012-05

5. Validation of "nine equivalents of nursing manpower use score" on an independent data sample.

Authors: H U Rothen; V Küng; D H Ryser; R Zürcher; B Regli
Journal: Intensive Care Med Date: 1999-06 Impact factor: 17.440

6. Reporting recommendations for tumor marker prognostic studies (REMARK).

Authors: Lisa M McShane; Douglas G Altman; Willi Sauerbrei; Sheila E Taube; Massimo Gion; Gary M Clark
Journal: J Natl Cancer Inst Date: 2005-08-17 Impact factor: 13.506

Review 7. Multiple organ dysfunction score: a reliable descriptor of a complex clinical outcome.

Authors: J C Marshall; D J Cook; N V Christou; G R Bernard; C L Sprung; W J Sibbald
Journal: Crit Care Med Date: 1995-10 Impact factor: 7.598

8. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers.

Authors: Peter C Austin; Ewout W Steyerberg
Journal: Stat Med Date: 2013-08-23 Impact factor: 2.373

9. Accounting for missing data in statistical analyses: multiple imputation is not always the answer.

Authors: Rachael A Hughes; Jon Heron; Jonathan A C Sterne; Kate Tilling
Journal: Int J Epidemiol Date: 2019-08-01 Impact factor: 7.196

10. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges.

Authors: Richard D Riley; Joie Ensor; Kym I E Snell; Thomas P A Debray; Doug G Altman; Karel G M Moons; Gary S Collins
Journal: BMJ Date: 2016-06-22

6 in total

1. Novel Outcome Biomarkers Identified With Targeted Proteomic Analyses of Plasma From Critically Ill Coronavirus Disease 2019 Patients.

Authors: Douglas D Fraser; Gediminas Cepinskas; Eric K Patterson; Marat Slessarev; Claudio Martin; Mark Daley; Maitray A Patel; Michael R Miller; David B O'Gorman; Sean E Gill; Guillaume Pare; Ioannis Prassas; Eleftherios Diamandis
Journal: Crit Care Explor Date: 2020-08-24

2. Endothelial Injury and Glycocalyx Degradation in Critically Ill Coronavirus Disease 2019 Patients: Implications for Microvascular Platelet Aggregation.

Authors: Douglas D Fraser; Eric K Patterson; Marat Slessarev; Sean E Gill; Claudio Martin; Mark Daley; Michael R Miller; Maitray A Patel; Claudia C Dos Santos; Karen J Bosma; David B O'Gorman; Gediminas Cepinskas
Journal: Crit Care Explor Date: 2020-08-24

3. Metabolomics Profiling of Critically Ill Coronavirus Disease 2019 Patients: Identification of Diagnostic and Prognostic Biomarkers.

Authors: Douglas D Fraser; Marat Slessarev; Claudio M Martin; Mark Daley; Maitray A Patel; Michael R Miller; Eric K Patterson; David B O'Gorman; Sean E Gill; David S Wishart; Rupasri Mandal; Gediminas Cepinskas
Journal: Crit Care Explor Date: 2020-10-21

4. Detection and Profiling of Human Coronavirus Immunoglobulins in Critically Ill Coronavirus Disease 2019 Patients.

Authors: Douglas D Fraser; Gediminas Cepinskas; Marat Slessarev; Claudio M Martin; Mark Daley; Maitray A Patel; Michael R Miller; Eric K Patterson; David B O'Gorman; Sean E Gill; Susanne Oehler; Markus Miholits; Brian Webb
Journal: Crit Care Explor Date: 2021-03-12

5. Critically Ill COVID-19 Patients Exhibit Anti-SARS-CoV-2 Serological Responses.

Authors: Douglas D Fraser; Gediminas Cepinskas; Marat Slessarev; Claudio M Martin; Mark Daley; Maitray A Patel; Michael R Miller; Eric K Patterson; David B O'Gorman; Sean E Gill; Ian Higgins; Julius P P John; Christopher Melo; Lylia Nini; Xiaoqin Wang; Johannes Zeidler; Jorge A Cruz-Aguado
Journal: Pathophysiology Date: 2021-05-17

6. Elevated vascular transformation blood biomarkers in Long-COVID indicate angiogenesis as a key pathophysiological mechanism.

Authors: Maitray A Patel; Michael J Knauer; Michael Nicholson; Mark Daley; Logan R Van Nynatten; Claudio Martin; Eric K Patterson; Gediminas Cepinskas; Shannon L Seney; Verena Dobretzberger; Markus Miholits; Brian Webb; Douglas D Fraser
Journal: Mol Med Date: 2022-10-10 Impact factor: 6.376

6 in total