Mortality-risk prediction models are used to benchmark ICU performance
against ‘gold standards’, being the ICUs where the models had been derived
at the time they were derived. Therefore, it is important to apply up-to-date
risk prediction models.[[1]] ICU performance metrics based on mortality
risk prediction models include effectiveness, measured by the standardised
mortality ratio (SMR) and efficiency, measured by the number of patients
requiring at least one ICU-care modality and mortality risk >1%.[[2]] They
are also used as the basis for risk-adjusted control chart methodologies
which track ICU performance over time[[3]] and for quality improvement
efforts, for example, where the presence of intensivists has reduced SMR[[4]]
and where excessive deaths among low-risk patients were related to
invasive procedures and inadequate infection control practices.[[5]]Model derivation methods have evolved over time. Most models are
derived by a combination of expert selection of variables subjected to
univariate and multivariate logistic regression analysis. This process
produces a list of independent variables with their related coefficients
and odds ratios for the dependent variable, or outcome, usually death
or survival. Generally, these models generate a score from the logistic
function which is in turn transformed to a mortality risk at individual
patient level. Therefore, in a population of patients, the sum of individual
mortality risks generates the expected number of deaths.Neural network-based machine learning models are emerging and show
promise to perform as well as, if not better, than statistical models.[[6]] Better
performance, however, may come at the not insurmountable cost of added
complexity and the need for access to appropriate computational resources.[[7]]However they are derived, models need to perform well in both the
derivation cohort and in the intended use-case context. This is important
to establish prior to investing the time, effort and money it takes to
deploy these models in any ICU. The models need to demonstrate
good discrimination, defined as the ability to appropriately classify all
patients such that the observed and predicted outcome rates are as close
as possible. Receiver Operator Characteristic (ROC) curve analysis is
commonly used to assess model classification ability. Essentially, ROC
curves plot truly predicted non-survivor rates against the falsely predicted
non-survivor rates for each value of the score. The ideal classifier, which
does not exist, would have an area under this curve (AUC) of 1. A poor
(random) classifier would have an AUC of 0.5. Classifiers are regarded as
acceptable for AUCs from 0.7 to <0.8, good for AUCs of 0.8 to <0.9 and
excellent if ≥ 0.9.[[1]]ROC analysis applied to datasets with unbalanced dependent outcomes,
such as mortality and survival, however, might show overly optimistic
AUC values because a hypothetical model, for example, which categorises
all cases as survivors in a population with a 10% mortality rate, would
have an AUC of 0.9.An alternative strategy to assess model discrimination on imbalanced
datasets is to consider model precision as positive predictive value, the
ratio of true positive class and model predicted positive class, and recall
as sensitivity or true positive rate, the ratio of true positive class to
actual or observed positive class. The associated precision/recall curve
(PRC) relates precision to recall.[[8]] Similar to ROC curves, the PRC plot
generates a curve with an AUC. The greater the AUC, the better the
discrimination. Again, perfect discrimination would yield an AUC of 1.
Random model performance determined by PRC depends on the degree
of class imbalance. This ‘baseline’ on the y-axis of the plot (horizontal to
the x-axis) is calculated as the ratio of positive class to the sum of positive
and negative classes (y=P/(P+N)). AUC of PRC curves which indicate
random model performance will therefore also vary with class balance
and will be equal to y.Other measures of model performance include accuracy, defined as the
number of true positive and true negative classes within the whole sample
(TP+TN)/(TP+FP+TN+FN) and the F1 Score which is the harmonic mean
of precision and recall: 2*(precision*recall)/(precision + recall).[[9]]Models also need to perform well within categories of mortality
risk, different demographics, and across diagnostic groups i.e., model
calibration. The Hosmer-Lemeshow (H-L) statistical analysis is widely used
to assess model calibration, by determining the significance (p-value) of the
differences between observed and expected outcome rates within, usually,
10 groups (deciles) of increasing mortality risk. This methodology is valid
only if a few provisos are met, facilitated by sufficiently large data sets.[[1]]
H-L would be unreliable if sample size is less than 400 or if >4 out of 20
values of the ‘expected’ columns of the H-L table are <5. The same p-value
may be found if there is a small difference in a large sample, as with a large
difference in a small sample. Therefore, the p-value per se says nothing
about the clinical significance of a difference between observed and
expected mortality rates. Careful inspection of the H-L table will yield a
better ‘idea’ of the nature of the lack-of-fit. The H-L test will almost always
show a significant lack-of-fit if the observed vs expected mortality rates
are not similar in all deciles of risk. 95% confidence intervals (CI) for SMR
must be narrow for it to have meaning. This depends on the number of
recorded deaths. CIs will be wide if <50 are recorded.Shann[[1]] states that when interpreting lack of calibration, it is more likely
to mean that the standard of care in the unit is different than the units in
which the model was derived, at the time that it was derived. Thus, if SMR
is >1, care is worse. If SMR is <1, care is better. Based on ROC and H-L
analysis, Shann[[1]] further suggests a guide for deciding whether a model is
appropriate for any context. If the ROC AUC is >0.7 and similar numbers
or proportions of observed v. expected outcomes are found across all H-L
deciles of risk, the model is appropriate. Conversely a model may not be
appropriate if AUC-ROC is <0.7 or if more deaths occur than predicted
in lower risk categories and less deaths than predicted occur in higher risk
categories.[[1]] However issues of case mix and resource differences between
derivation and application contexts need to be considered.[[10]]Owing to the limitations of the H-L method, which include artificial
grouping of data into risk strata, a p-value which is indifferent to the
type or extent of miscalibration and that H-L suffers from low statistical
power,[[6,11]] alternative tests of discrimination need to be considered. The
flexible calibration curve with its slope and intercept is considered a
superior, though less popular, assessment of model calibration.[[11]] Model
calibration thus assessed can be classified either as mean, weak, moderate
or strong. Mean calibration refers to the ability of the model to predict
on average the outcome of interest. Over-prediction occurs when the
mean predicted outcome is greater than observed, while under-prediction
occurs when the average predicted outcome is less than observed.Weak calibration is defined by a flexible curve with a slope that is
either greater or less than 1. If the slope is less than 1 then the model
over-predicts, or under-predicts if >1. The intercept of the flexible curve
indicates over-prediction or under-prediction at values less than or greater
than zero, respectively.Moderate calibration is assessed by comparing how well the calibration
curve fits the model prediction to the observed outcomes. The flexible
curve will be close to the diagonal when proportions of predicted and
observed outcomes are similar.In the current issue, Pazi et al.,[[12]] using a cohort of 829 patients
admitted to an adult ICU in a tertiary hospital in South Africa (SA), have
developed a SAPS III-based mortality prediction model calibrated to their
specific ICU population using data available from a previous SAPS III
validation study.[[13]] Data were collected for one year starting in January
2017. They reported a mortality rate of 21.35%.The rationale for this unit-specific model derivation was the age of SAPS
III[[14]] and the fact that no data from centres in low- to-middle income
countries were included in the derivation dataset, and none from SA.
They employed logistic regression to select variables for four models
which where than subjected to cross-validation analyses and a final
model was internally validated using measures of discrimination (ROC
analysis, precision recall, balanced accuracy, bookmarked informedness
and markedness) and calibration (H-L and the flexible calibration
curve).The authors report a ROC-AUC of 0.86 and PRC-AUC of 0.67 with a
baseline of 0.2. Therefore global model discrimination is deemed good.
Calibration as determined by H-L, C and H-statistics produced p-values
of 0.95 and 0.93, respectively. However, SMRs are not consistent among
all risk categories and the associated CIs are consistently wide, including
unity. This illustrates the concerns expressed with regard to the H-L being
a reliable measure of model calibration. The authors further present the
flexible calibration curve with its slope and intercept and 95% CIs. The
intercept, though negative, is close to zero with 95% CIs -0.44 - 0.27.
Similarly, the slope of the curve is 1.04 with a 95% CI 0.76 - 1.36. The
model therefore shows moderate calibration, using the above criteria, with
a tendency to ‘oscillate’ within the CI limits.External validation of this model is needed, as pointed out by the
authors, preferably on larger data-sets among a range of ICUs in SA,
before it can be accepted as a national standard for benchmarking
performance of adult ICUs.In conclusion, understanding model validation methods will promote
appropriate mortality risk assignment model derivation, choice, validation
and application which in turn could increase the confidence clinicians
have in the ICU performance metrics that these models facilitate, and
ultimately that systems of care can be designed which maximise best
outcomes for patients in ICUs where these models are deployed.
Authors: Lincoln J Solomon; Kuban D Naidoo; Ilse Appel; Linda G Doedens; Robin J Green; Michael A Long; Brenda Morrow; Noor M Parker; Denise Parris; Afke H Robroch; Shamiel Salie; Shivani A Singh; Andrew C Argent Journal: Pediatr Crit Care Med Date: 2021-03-10 Impact factor: 3.624
Authors: M Earle; O Martinez Natera; A Zaslavsky; E Quinones; H Carrillo; E Garcia Gonzalez; A Torres; M P Marquez; J Garcia-Montes; I Zavala; R Garcia-Davila; I D Todres Journal: Crit Care Med Date: 1997-09 Impact factor: 7.598
Authors: Britt E Keuning; Thomas Kaufmann; Renske Wiersema; Anders Granholm; Ville Pettilä; Morten Hylander Møller; Christian Fynbo Christiansen; José Castela Forte; Harold Snieder; Frederik Keus; Rick G Pleijhuis; Iwan C C van der Horst Journal: Acta Anaesthesiol Scand Date: 2019-12-26 Impact factor: 2.105
Authors: Rui P Moreno; Philipp G H Metnitz; Eduardo Almeida; Barbara Jordan; Peter Bauer; Ricardo Abizanda Campos; Gaetano Iapichino; David Edbrooke; Maurizia Capuzzo; Jean-Roger Le Gall Journal: Intensive Care Med Date: 2005-08-17 Impact factor: 17.440
Authors: Elizabeth van der Merwe; Jacinto Kapp; Sisa Pazi; Ryan Aylward; Minette Van Niekerk; Busisiwe Mrara; Robert Freercks Journal: PLoS One Date: 2020-05-21 Impact factor: 3.240
Authors: Ben Van Calster; David J McLernon; Maarten van Smeden; Laure Wynants; Ewout W Steyerberg Journal: BMC Med Date: 2019-12-16 Impact factor: 8.775