Literature DB >> 33057333

Developing and validating subjective and objective risk-assessment measures for predicting mortality after major surgery: An international prospective cohort study.

Danny J N Wong^1,2, Steve Harris³, Arun Sahni^1,2, James R Bedford^1,2, Laura Cortes², Richard Shawyer⁴, Andrew M Wilson⁵, Helen A Lindsay⁵, Doug Campbell⁵, Scott Popham⁶, Lisa M Barneto⁷, Paul S Myles⁸, S Ramani Moonesinghe^1,2.

Abstract

BACKGROUND: Preoperative risk prediction is important for guiding clinical decision-making and resource allocation. Clinicians frequently rely solely on their own clinical judgement for risk prediction rather than objective measures. We aimed to compare the accuracy of freely available objective surgical risk tools with subjective clinical assessment in predicting 30-day mortality. METHODS AND
FINDINGS: We conducted a prospective observational study in 274 hospitals in the United Kingdom (UK), Australia, and New Zealand. For 1 week in 2017, prospective risk, surgical, and outcome data were collected on all adults aged 18 years and over undergoing surgery requiring at least a 1-night stay in hospital. Recruitment bias was avoided through an ethical waiver to patient consent; a mixture of rural, urban, district, and university hospitals participated. We compared subjective assessment with 3 previously published, open-access objective risk tools for predicting 30-day mortality: the Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality (P-POSSUM), Surgical Risk Scale (SRS), and Surgical Outcome Risk Tool (SORT). We then developed a logistic regression model combining subjective assessment and the best objective tool and compared its performance to each constituent method alone. We included 22,631 patients in the study: 52.8% were female, median age was 62 years (interquartile range [IQR] 46 to 73 years), median postoperative length of stay was 3 days (IQR 1 to 6), and inpatient 30-day mortality was 1.4%. Clinicians used subjective assessment alone in 88.7% of cases. All methods overpredicted risk, but visual inspection of plots showed the SORT to have the best calibration. The SORT demonstrated the best discrimination of the objective tools (SORT Area Under Receiver Operating Characteristic curve [AUROC] = 0.90, 95% confidence interval [CI]: 0.88-0.92; P-POSSUM = 0.89, 95% CI 0.88-0.91; SRS = 0.85, 95% CI 0.82-0.87). Subjective assessment demonstrated good discrimination (AUROC = 0.89, 95% CI: 0.86-0.91) that was not different from the SORT (p = 0.309). Combining subjective assessment and the SORT improved discrimination (bootstrap optimism-corrected AUROC = 0.92, 95% CI: 0.90-0.94) and demonstrated continuous Net Reclassification Improvement (NRI = 0.13, 95% CI: 0.06-0.20, p < 0.001) compared with subjective assessment alone. Decision-curve analysis (DCA) confirmed the superiority of the SORT over other previously published models, and the SORT-clinical judgement model again performed best overall. Our study is limited by the low mortality rate, by the lack of blinding in the 'subjective' risk assessments, and because we only compared the performance of clinical risk scores as opposed to other prediction tools such as exercise testing or frailty assessment.
CONCLUSIONS: In this study, we observed that the combination of subjective assessment with a parsimonious risk model improved perioperative risk estimation. This may be of value in helping clinicians allocate finite resources such as critical care and to support patient involvement in clinical decision-making.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 33057333 PMCID： PMC7561094 DOI： 10.1371/journal.pmed.1003253

Source DB: PubMed Journal: PLoS Med ISSN： 1549-1277 Impact factor: 11.069

Introduction

The provision of safe surgery is an international healthcare priority [1]. Guidelines recommend that preoperative risk estimation should guide treatment decisions and facilitate shared decision-making [2,3]. Furthermore, there is an ethical imperative (and in the United Kingdom [UK], a legal requirement) to provide an individualised assessment of a patient’s risk of adverse outcomes [4]. Increasing evidence suggests that postoperative mortality in both high and low/middle-income settings is due less to what happens in the operating theatre and more to our ‘failure to rescue’ patients who develop postoperative complications [5,6]. These observations also point towards opportunity: once a patient has been identified as high risk, mitigation strategies such as pre-emptive admission to critical care or enhanced postoperative surveillance may prevent adverse outcomes [2]. However, critical care is a finite resource, with competition for beds between surgical and emergency medical admissions. To that end, the requirement for a postoperative critical care bed is itself a risk factor for last-minute cancellation, with consequent potential for disruption and harm for both patients and healthcare providers [7]. Thus, there is a need to accurately stratify patient risk so as to make the most of limited resources and improve perioperative outcomes. This is especially true given the scale of demand; more than 300 million operations take place annually worldwide [8]. With a major postoperative morbidity rate of around 15% [9,10], a short-term mortality rate between 1 and 3% [11], and a reproducible association between short-term morbidity and long-term survival [9,12,13], the impact of surgical complications on individual patients, healthcare resources, and society at large is clearly evident. Furthermore, if resources permitted, substantially larger numbers of patients would be considered for surgical intervention [1]. There are numerous methods available to help clinicians estimate perioperative risk, including frailty indices [14], functional capacity assessments such as cardiopulmonary exercise testing (CPET) [15], and dozens of risk prediction scores and models, many of which are open-source, are easily applied, and have been validated in multiple heterogeneous surgical cohorts [16]. Despite this myriad of choices, data from national Quality Improvement (QI) programmes indicate that clinicians do not routinely document an individualised risk assessment before surgery [10,17]. In part, this may relate to the availability of complex investigations and equipoise over which method is most accurate, particularly when the accuracy of objective methods compared with subjective assessment alone is disputed [15]. We therefore performed a prospective cohort study with the following objectives: to describe how clinicians assess risk in routine practice, to externally validate and compare the performance of 3 open-access risk models with subjective assessment, and to investigate whether objective risk tools add value to subjective assessment.

Methods

This is a planned analysis of the Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery (SNAP-2: EPICCS) study, a prospective observational cohort study conducted in 274 hospitals from the UK, Australia, and New Zealand [18]. We report our findings in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE; S1 Text) and the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD; S2 Text) statements [19,20]. National research networks, including trainee-led networks, were used to maximise recruitment from public hospitals in all countries. All adult (≥18 years) patients undergoing inpatient surgery and meeting our criteria (see ‘Data set’, below) during a 1-week period were included in our analyses for this paper. Patients were recruited between 21–27 March 2017 in the UK, 21–27 June 2017 in Australia, and 6–13 September 2017 in New Zealand.

Ethical and governance approvals

UK-wide ethical approval for the study was obtained from the Health Research Authority (South Central–Berkshire B REC, reference number: 16/SC/0349); additional permission to collect patient-identifiable data without consent was granted through Section 251 exemption from the Confidentiality Advisory Group for England and Wales (CAG reference: 16/CAG/0087), the NHS Scotland Public Benefit and Privacy Panel for Health and Social Care (PBPP reference: 1617–0126), and individual Health and Social Care Trust research and development departments for each site in Northern Ireland (Belfast, Northern, South Eastern, and Western Health and Social Care Trusts, IRAS reference number: 154486). In Australia, each state had different regulatory approval processes, and approvals were received from the following ethics committees: New South Wales—Hunter New England and Greater Western Human Research Ethics Committee; Queensland—Metro South Hospital and Health Service Human Research Ethics Committee; South Australia—Southern Adelaide Clinical Human Research Ethics Committee; Tasmania—Tasmania Health and Medical Human Research Ethics Committee; Victoria—Alfred Health, Eastern Health, Goulburn Valley, Mercy Health, Monash Health, Peter MacCallum Cancer Centre Research Ethics Committees; Western Australia—South Metropolitan Health Service Human Research Ethics Committee. In New Zealand, the study received national approval from the Health and Disability Ethics Committees (Ethics ref: 17/NTB/139).

Data set

All data (S3 Text) were collected prospectively. In this study, we defined objective risk assessment as the use of a risk calculation model or equation or tool that supplies a prediction of risk on a probability scale. Before surgery, perioperative teams answered the following question for each patient: ‘What is the estimate of the perioperative team of the risk of death within 30 days?’, with 6 categorical response options (<1%, 1%–2.5%, 2.6%–5%, 5.1%–10%, 10.1%–50%, and >50%). These thresholds were decided by expert consensus within the study steering group and study authors. Teams were then asked to record how they arrived at this estimate (for example, clinical judgement and/or an objective risk tool). The patient data for this study were collected from a wide range of participating publicly funded hospitals in the UK (n = 245), Australia (n = 21), and New Zealand (n = 8). These were a heterogeneous mix of secondary (42%) and tertiary care (58%) institutions and likely reflective of the general composition of hospitals in these countries. We have previously described the hospitals and their available facilities for providing perioperative care [21]. Patients included in the study were adults (≥18 years) undergoing surgery or other interventions that required the presence of an anaesthetist and who were expected to require overnight stay in hospital. We included all procedures taking place in an operating theatre, radiology suite, endoscopy suite, or catheter laboratory for which inpatient (overnight) stay was planned, including both planned and emergency/urgent surgery of all types, endoscopy, and interventional radiology procedures. Patients were excluded if they indicated they did not want to participate in the study. We also excluded ambulatory surgery, obstetric procedures (for example, cesarean sections and surgery for complications of childbirth), procedures on ASA-PS (American Society of Anesthesiologists Physical Status score) grade VI patients, noninterventional diagnostic imaging (for example, CT or MRI scanning without interventions), and emergency department or critical care interventions requiring anaesthesia or sedation but no interventional procedure.

Statistical analysis

The protocol for SNAP-2: EPICCS was previously published with aims, objectives, and research questions outlined [18]. Our primary outcome for the study described in this paper was inpatient 30-day mortality, recorded prospectively by local collaborators. We conducted 3 inferential analyses, the first using the entire patient data set and the second and third omitting the patients for whom an objective tool was used to predict perioperative risk (Fig 1). For the first analysis, we evaluated performance of the Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality (P-POSSUM), Surgical Risk Scale (SRS), and Surgical Outcome Risk Tool (SORT) [16,22-24]. The calibration and discrimination of all models was assessed in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) recommendations [20]. Calibration was assessed by graphical inspection of observed versus expected mortality and by the Hosmer–Lemeshow goodness-of-fit test [25]. Discrimination was assessed by calculating the Area Under Receiver Operating Characteristic curve (AUROC) [26]. AUROCs were compared using DeLong’s test for 2 correlated ROC curves [27]. ROC curves can be constructed for both continuous predictions (for example, P-POSSUM, SRS, and SORT) and ordinal categorical predictions (for example, ASA-PS or the 6-category subjective predictions that clinicians were asked to make): in the former, sensitivities and specificities are calculated for every value in the probability range of 0 to 1, and then each point is plotted to obtain a smooth curve; in the latter, sensitivities and specificities are computed for each category, and the points form a polygon on the ROC plot.

Fig 1

Participant flowchart.

The second analysis compared the performance of subjective assessment (defined as either using clinical judgement and/or ASA-PS) against the best-performing risk tool. For this, we included only patients for whom subjective assessment alone was used to predict the risk of 30-day mortality. Subjective assessment was then evaluated on calibration and discrimination. Point estimates of risk prediction were taken as the midpoint of the predicted risk intervals provided by clinicians (i.e., 0.5% for the interval <1%, 1.75% for the interval 1%–2.5%, and so on), and the proportion of observed mortality in each of these risk categories was calculated. Calibration was then assessed by plotting the observed mortality proportions against the midpoints of clinician-predicted risk intervals. We then compared the performance of subjective assessment against the best-performing risk model, using AUROC and the continuous Net Reclassification Improvement (NRI) statistic [25]. The NRI quantifies the proportion of individuals whose predictions improve in accuracy (positive reclassification) subtracted by the proportion whose predictions worsened in accuracy (negative reclassification) when using one prediction model versus another [28]. An NRI >0 indicates an overall improvement, <0 an overall deterioration, and zero no difference in prediction accuracy. The third analysis evaluated the added value of combining subjective assessment with the best-performing risk tool by creating a logistic regression model with variables from both sources. For this, we fitted a logistic regression model with 2 variables: the subjective assessment of risk and the mortality prediction from the best objective risk tool according to the following logit formula: ln(R/(1 − R)) = β0 +β1Xsubjective + β2Xobjective, where R is the probability of 30-day mortality; β0, β1, and β2 are the model coefficients; Xsubjective is the subjective clinical assessment (6 ordered categories, as above); and Xobjective is the risk of mortality as predicted using the most accurate risk model. An optimism-corrected performance estimate of the combined model was obtained using bootstrapped internal validation with 1,000 repetitions; this was then compared with subjective assessment and the most accurate risk model alone. We used decision-curve analysis (DCA) to describe and compare the clinical implications of using each risk model. In DCA, a model is considered to have clinical value if it has the highest net benefit across the whole range of thresholds for which a patient would be labelled as ‘high risk’. The net benefit is defined as the difference between the proportion of true positives (labelled as high risk and then going on to die within 30 days of surgery) and the proportion of false positives (labelled as high risk but not going on to die within 30 days) weighted by the odds of the selected threshold for the high-risk label. At any given threshold, the model with the higher net benefit is the preferred model [20, 25, 29].

Missing data

The P-POSSUM requires biochemical and haematological data for calculation; however, fit patients may not have preoperative blood tests [30], and in other cases, there may be no time for blood analysis before surgery. Therefore, in cases for which these data were missing, normal physiological ranges were imputed because this most closely follows what clinicians might reasonably do in practice when tests are not indicated or not feasible or results are missing. Following imputation, we performed a complete case analysis because we considered the proportion of cases with missing data in the remaining variables to be low (1.08%) [31].

Sensitivity analyses

We conducted a number of sensitivity analyses to examine the potential effects of differences in population characteristics on our main study findings. First, we repeated our analyses in a full cohort of patients, including those undergoing obstetric procedures. Second, we repeated the analysis in a subgroup of high-risk patients, defined according to previously published criteria based on age, type of surgery, and comorbidities [15,32]. Third, we evaluated the impact on the accuracy of subjective assessment of using objective tools by comparing discrimination and calibration of subjective assessment in the subgroup of patients whose risk estimates were not solely informed by clinical judgement. Fourth, we repeated our analyses separately in the UK and Australian/New Zealand cohorts to investigate the potential for geographical influences on our findings. Fifth, we examined the potential impact of normal value imputation on missing P-POSSUM values by repeating the analysis on only cases in which no missing P-POSSUM variables were present. Finally, we conducted analyses on surgical specialty subgroups to evaluate the accuracy of the new model created on different subcohorts. Analyses were performed using R version 3.5.2; p < 0.05 was considered statistically significant. Statistical code is available on request.

Results

Patient data were collected on 26,502 surgical episodes in 274 hospitals across the UK, Australia, and New Zealand (Table 1). A total of 3,871 cases were excluded from all analyses: 3,660 obstetric cases in which there were no deaths, plus a further 286 cases for missing values. This left 22,631 cases with adequate data for external validation of the P-POSSUM, SRS, and SORT models, the first part of our analyses (Fig 1). For the second and third analyses, in which we compared subjective assessment against the best-performing objective risk tool and combined these measures to create a new model for internal validation, we excluded 4,891 cases in which clinician prediction was aided by the use of other risk tools. This left 21,325 cases for these analyses. There were 317 inpatient deaths within 30 days of surgery (1.40%). In most cases, subjective assessment alone was used to estimate risk (n = 17,845, 78.9%; Table 2). No patients were lost to follow-up.

Table 1

Patient demographics stratified by 30-day mortality.

		30-Day Mortality
	Overall	Survived	Died
N	22,631	22,314	317
Male sex (%)	10,671 (47.2)	10,481 (47.0)	190 (59.9)
Female sex (%)	11,960 (52.8)	11,833 (53.0)	127 (40.1)
Age (median [IQR])	62 [46–73]	62 [45–73]	76.00 [64.00–83.00]
Operative urgency (%)
Elective	12,061 (53.3)	12,029 (53.9)	32 (10.1)
Expedited	3,311 (14.6)	3,270 (14.7)	41 (12.9)
Urgent	6,617 (29.2)	6,460 (29.0)	157 (49.5)
Immediate	642 (2.8)	555 (2.5)	87 (27.4)
ASA-PS class (%)
I	4,462 (19.7)	4,458 (20.0)	4 (1.3)
II	10,192 (45.0)	10,168 (45.6)	24 (7.6)
III	6,574 (29.0)	6,454 (28.9)	120 (37.9)
IV	1,337 (5.9)	1,206 (5.4)	131 (41.3)
V	66 (0.3)	28 (0.1)	38 (12.0)
Procedure severity (%)*
Minor	1,951 (8.6)	1,919 (8.6)	32 (10.1)
Intermediate	4,523 (20.0)	4,476 (20.1)	47 (14.8)
Major	7,478 (33.0)	7,369 (33.0)	109 (34.4)
Xmajor	5,281 (23.3)	5,218 (23.4)	63 (19.9)
Complex	3,398 (15.0)	3,332 (14.9)	66 (20.8)
Surgical specialty (%)
Gastrointestinal surgery	4,472 (19.8)	4,384 (19.6)	88 (27.8)
Gynaecology/urology	4,309 (19.0)	4,297 (19.3)	12 (3.8)
Neuro/spinal surgery	1,208 (5.3)	1,181 (5.3)	27 (8.5)
Orthopaedics	6,772 (29.9)	6,688 (30.0)	84 (26.5)
Thoracic/cardiac surgery	1,033 (4.6)	1,015 (4.5)	18 (5.7)
Vascular	674 (3.0)	645 (2.9)	29 (9.1)
Other	4,163 (18.4)	4,104 (18.4)	59 (18.6)
Past medical history: coronary artery disease (%)	3,029 (13.4)	2,923 (13.1)	106 (33.4)
Past medical history: congestive cardiac failure (%)	893 (3.9)	839 (3.8)	54 (17.0)
Past medical history: metastatic cancer (active) (%)	825 (3.6)	799 (3.6)	26 (8.2)
Past medical history: dementia (%)	676 (3.0)	644 (2.9)	32 (10.1)
Past medical history: COPD (%)	1,955 (8.6)	1,909 (8.6)	46 (14.5)
Past medical history: pulmonary fibrosis (%)	180 (0.8)	173 (0.8)	7 (2.2)
Past medical history: liver cirrhosis (%)	224 (1.0)	206 (0.9)	18 (5.7)
Past medical history: renal disease (%)	381 (1.7)	362 (1.6)	19 (6.0)
Past medical history: diabetes (%)
Type 1	274 (1.2)	265 (1.2)	9 (2.8)
Type 2 (dietary-controlled)	614 (2.7)	598 (2.7)	16 (5.0)
Type 2 (insulin-controlled)	761 (3.4)	743 (3.3)	18 (5.7)
Type 2 (oral hypoglycaemic medication)	1,570 (6.9)	1,522 (6.8)	48 (15.1)
No diabetes	19,399 (85.8)	19,173 (86.0)	226 (71.3)
Postoperative length of stay in days (median [IQR])	3 [1–6]	3 [1–6]	7 [2–13
SORT-calculated mortality risk % (median [IQR])	0.4 [0.2–1.6]	0.4 [0.2–1.6]	8.9 [4.2–20.6]
P-POSSUM-calculated mortality risk % (median [IQR])	1.1 [0.6–2.9]	1.1 [0.6–2.9]	18.1 [5.7–41.6]
SRS-calculated mortality risk % (median [IQR])	1.9 [0.8–4.4]	1.9 [0.8–4.4]	19.6 [4.4–36.1]
Subjective clinical assessment made on clinical judgement and/or ASA-PS grading alone (%)	17,845 (78.9)	17,657 (79.1)	188 (59.3)

*Procedure severity classification (minor, intermediate, major, Xmajor, and complex: ordinal scale).

Abbreviations: ASA-PS, American Society of Anesthesiologists Physical Status; COPD, Chronic Obstructive Pulmonary Disease; IQR, interquartile range; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Table 2

Methods used by clinicians to estimate 30-day mortality.

Clinicians could select one or more categories; therefore, the total percentages (in parentheses) exceed 100%.

	Overall
n	22,631
Clinical judgement (%)	20,064 (88.7)
ASA-PS score (%)	8,622 (38.1)
Duke Activity Status Index or other activity index (%)	515 (2.3)
Six-minute walk test or incremental shuttle walk test (%)	48 (0.2)
Cardiopulmonary exercise testing (%)	215 (1.0)
Formal frailty assessment (for example, Edmonton Frail Scale) (%)	48 (0.2)
SRS (%)	315 (1.4)
SORT (%)	750 (3.3)
EuroSCORE (%)	442 (2.0)
POSSUM (%)	287 (1.3)
P-POSSUM (%)	1,397 (6.2)
Surgery-specific POSSUM (for example, Vasc-POSSUM) (%)	192 (0.8)
Other risk scoring system (%)	651 (2.9)

Abbreviations: ASA-PS, American Society of Anesthesiologists Physical Status; EuroSCORE, European System for Cardiac Operative Risk Evaluation; POSSUM, Physiology and Operative Severity Score for the enUmeration of Mortality; P-POSSUM, Portsmouth-POSSUM; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

*Procedure severity classification (minor, intermediate, major, Xmajor, and complex: ordinal scale). Abbreviations: ASA-PS, American Society of Anesthesiologists Physical Status; COPD, Chronic Obstructive Pulmonary Disease; IQR, interquartile range; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Methods used by clinicians to estimate 30-day mortality.

Clinicians could select one or more categories; therefore, the total percentages (in parentheses) exceed 100%. Abbreviations: ASA-PS, American Society of Anesthesiologists Physical Status; EuroSCORE, European System for Cardiac Operative Risk Evaluation; POSSUM, Physiology and Operative Severity Score for the enUmeration of Mortality; P-POSSUM, Portsmouth-POSSUM; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

External validation of existing risk prediction models

The SORT was the best calibrated of the pre-existing models; however, all overpredicted risk (Fig 2A–2C; Hosmer–Lemeshow p-values all <0.001 for the SORT, P-POSSUM, and SRS). All models exhibited good-to-excellent discrimination (Fig 2D; AUROC SORT = 0.90, 95% confidence interval [CI]: 0.88–0.92; P-POSSUM = 0.89, 95% CI: 0.88–0.91; SRS = 0.85, 95% CI: 0.82–0.87). The AUROC for the SORT was significantly better than SRS (p < 0.001), but not P-POSSUM (p = 0.298).

Fig 2

Calibration plots for the SORT (A), P-POSSUM (B), SRS (C), and ROC curves for the 3 models (D).

In the calibration plots (A–C), nonparametric smoothed best-fit curves (blue) are shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each decile of predicted mortality. External validation of all 3 models were performed on the entire patient data set (n = 22,631). ASA-PS, American Society of Anesthesiologists Physical Status; CI, confidence interval; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; ROC, Receiver Operating Characteristic; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Calibration plots for the SORT (A), P-POSSUM (B), SRS (C), and ROC curves for the 3 models (D).

Subjective assessment

There were 188 deaths (1.05%) within 30 days of surgery in the subset of 17,845 patients who had mortality estimates based on clinical judgement and/or ASA-PS alone. Subjective assessment overpredicted risk (Fig 3A, Hosmer–Lemeshow test p < 0.001) but demonstrated good discrimination (Fig 3B and Table 3, AUROC = 0.89, 95% CI: 0.86–0.91), which was not significantly different from the SORT (p = 0.309). Continuous NRI analysis did not show improvement in classification when using the SORT compared with subjective assessment (Table 3 and S4 Text). The 30-day mortality outcomes at each level of clinician risk prediction were cross-tabulated, showing that clinician predictions correlated well with actual mortality outcomes (S2 Table).

Fig 3

Calibration plots and ROC curves for subjective clinical assessments (A, B) and the logistic regression model combining clinician and SORT predictions (C, D), validated on the subset of patients in whom clinicians estimated risk based on clinical judgement alone (n = 17,845).

For (A), a nonparametric smoothed best-fit curve (blue) is shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each range of clinician-predicted mortality. For (C), the apparent (blue) and optimism-corrected (red) nonparametric smoothed calibration curves are shown; the latter was generated from 1,000 bootstrapped resamples of the data set. CI, confidence interval; ROC, Receiver Operating Characteristic; SORT, Surgical Outcome Risk Tool.

Table 3

Coefficients of the logistic regression model combining subjective clinical assessment with SORT-predicted risk; p-values in a logistic regression model test the null hypothesis that the estimated coefficient is equal to zero using a z-test.

	Coefficient	Standard Error	Z-Statistic	p-Value
Intercept	−6.403	0.2135	−30	<0.001
SORT-predicted risk (per 1% risk)	0.04028	0.007049	5.714	<0.001
Clinical assessment of risk
Clinical assessment of risk < 1%	Reference
Clinical assessment of risk = 1%–2.5%	1.487	0.2962	5.021	<0.001
Clinical assessment of risk = 2.6%–5%	2.365	0.3177	7.444	<0.001
Clinical assessment of risk = 5.1%–10%	3.074	0.2976	10.33	<0.001
Clinical assessment of risk = 10.1%–50%	4.156	0.2852	14.57	<0.001
Clinical assessment of risk > 50%	5.028	0.3186	15.78	<0.001

Abbreviations: SORT, Surgical Outcome Risk Tool.

Calibration plots and ROC curves for subjective clinical assessments (A, B) and the logistic regression model combining clinician and SORT predictions (C, D), validated on the subset of patients in whom clinicians estimated risk based on clinical judgement alone (n = 17,845).

Combining subjective and objective risk assessment

Bootstrapped internal validation yielded an optimism-corrected AUROC of 0.92 for a combined model using both subjective assessment and SORT predictions as independent variables (Table 3); this was better than subjective assessment alone (p < 0.001) and SORT alone (p = 0.021) (Table 4). The model also significantly (p < 0.001) improved reclassification compared with subjective assessment alone in continuous NRI analysis (S4 Text). The improved NRI was largely attributable to the correct downgrading of patient risks—i.e., a large proportion of patients were correctly reclassified as lower risk using the combined model compared with subjective assessment. The DCA also favoured SORT over the other previously published models, but the combined clinician judgement–SORT model again performed best (Fig 4). The effect of combining information from subjective assessment and the SORT is further demonstrated by computing the conditional probabilities of 30-day mortality using the combined model over a full range of predictor values (Fig 5). When assessing the decision curves across all risk thresholds, the combined model outperformed P-POSSUM and SRS, and beyond approximately the 10% risk threshold, P-POSSUM and SRS demonstrated negative net benefits when they were used. The decision curve for our combined model incorporating both subjective assessment and SORT showed increased net benefit across almost the entire range of risk thresholds versus SORT alone.

Table 4

Performance metrics for clinician prediction versus SORT and versus a logistic regression model combining clinician and SORT prediction.

	ROC			Continuous NRI
Model	AUROC	95% CI	p-Value¹	NRI	95% CI	p-Value²
Clinical	0.886	0.858–0.914	Reference	Reference
SORT	0.900	0.877–0.923	0.309	0.073	−0.062 to 0.208	0.288
Combined	0.920	0.899–0.940	<0.001	0.130	0.057–0.202	<0.001

1Differences between AUROCs are tested using DeLong’s test for 2 correlated ROC curves with a null hypothesis of no difference.

2Differences between continuous NRI statistics are tested using a z-test with a null hypothesis of no difference.

Abbreviations: AUROC, Area Under the Receiver Operating Characteristic curve; CI, confidence interval; NRI, Net Reclassification Improvement.

Fig 4

DCA.

DCA, decision-curve analysis; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Fig 5

Predicted risks from combined model, stratified by clinical assessments.

Performance metrics for clinician prediction versus SORT and versus a logistic regression model combining clinician and SORT prediction.

Calculations based on the subset of patients in whom clinician judgement alone was used to estimate risk (n = 17,845). The reported AUROC for the combined model is the optimism-corrected value from bootstrapped internal validation. 1Differences between AUROCs are tested using DeLong’s test for 2 correlated ROC curves with a null hypothesis of no difference. 2Differences between continuous NRI statistics are tested using a z-test with a null hypothesis of no difference. Abbreviations: AUROC, Area Under the Receiver Operating Characteristic curve; CI, confidence interval; NRI, Net Reclassification Improvement.

DCA.

DCA, decision-curve analysis; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Predicted risks from combined model, stratified by clinical assessments.

We model the changes to risk predictions (y-axis) based on subjective clinical assessments (coloured lines) as SORT-predicted risks (x-axis) change, to illustrate the change in risk predictions if information from both are combined. P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale. A summary of the different sensitivity analyses is provided in S5 Text. In the first sensitivity analysis (S6 Text), we repeated the main study analyses using the full cohort of patients available from SNAP-2: EPICCS, including those undergoing obstetric procedures, and found that there were minimal differences seen from our main study findings. The SORT was again the best calibrated of the pre-existing models in this larger cohort, and all objective risk tools again overpredicted risk (S1 Fig; Hosmer–Lemeshow p-values all <0.001 for the SORT, P-POSSUM, and SRS). The estimates for AUROC were minimally affected (S1 Fig; AUROC SORT = 0.91, 95% CI: 0.90–0.93; P-POSSUM = 0.90, 95% CI: 0.88–0.92; SRS = 0.85, 95% CI: 0.83–0.88). The AUROC for the SORT was still significantly better than SRS (p < 0.001), but not P-POSSUM (p = 0.121). Subjective assessment in this first sensitivity analysis demonstrated similar overprediction of risk (S2 Fig, Hosmer–Lemeshow test p < 0.001) but similar discrimination (S2 Fig, AUROC = 0.89, 95% CI: 0.87–0.92) to the main study analysis. Differences in discrimination between subjective assessment and SORT were again not significantly different (p = 0.216). Continuous NRI analysis again did not show improvement in classification when using the SORT compared with subjective assessment in this larger group of patients. For the second sensitivity analysis (S7 Text), we used a previously defined more restrictive inclusion criteria to identify high-risk patients [15, 32]. This yielded a subgroup of 12,985 patients in whom the 30-day mortality rate was 2.01%. In this subgroup, calibrations of P-POSSUM, SRS, and SORT predictions were similar to the full cohort (S3 Fig). The AUROCs were lower in this subgroup (SORT = 0.88, 95% CI: 0.86–0.90; P-POSSUM = 0.86; 95% CI: 0.84–0.89; SRS = 0.81, 95% CI: 0.78–0.84, S3 Fig). The calibration of subjective assessment was again similar to that of the full cohort, and discrimination was reduced but still good (AUROC = 0.85; 95% CI: 0.82–0.89, S3 Fig). The discrimination of subjective assessment in this subgroup was not significantly different from the full cohort (p = 0.155). The third sensitivity analysis (S8 Text) used the subgroup whose mortality estimate was based on clinical judgement in conjunction with any objective risk tool (n = 4,751, S4 Fig). The AUROC for subjective assessment in this subgroup was 0.88, which was not significantly different from the AUROC in the main cohort (p = 0.769). The calibration of subjective assessment in this subgroup was similar to that in the main cohort, again with a tendency to overpredict risk. In the fourth sensitivity analysis (S9 Text), we looked for differences in performance of subjective clinical assessment and objective risk tools between the UK and the Australia/New Zealand cohorts (S5 Fig and S3 Table). The 30-day mortality in the Australia/New Zealand cohort (1.09%) was comparable to that of the UK (1.45%, p = 0.127). Visual inspection of calibration plots showed SORT to be worse calibrated in Australasia than the UK. AUROCs for the objective tools in the Australasian subset (P-POSSUM = 0.90, SRS = 0.81, SORT = 0.87) were not significantly different from the AUROCs in the UK subset (P-POSSUM = 0.89, SRS = 0.85, and SORT = 0.90, p > 0.05 for all). The calibration of subjective clinical assessment was comparable in the 2 geographical subgroups, and there were also no significant differences in AUROCs (Australasia: 0.88, UK: 0.89, p = 0.860, S6 Fig). For the fifth sensitivity analysis (S10 Text), we used the subgroup of patients who had no missing P-POSSUM variables (n = 18,362; see S1 Table for patient characteristics). Patients with complete P-POSSUM variables appeared to be older, have higher ASA-PS grades, and undergo higher-severity surgery in comparison with those with missing P-POSSUM variables. The AUROC for clinical assessments in the subgroup with full P-POSSUM variables was 0.90, which was not significantly different from the AUROC obtained for clinical assessments in the main study analysis (p = 0.587), and the predictions were similarly calibrated to clinical assessments in the main study analysis, again with a tendency to overpredict risk (S7 Fig). When comparing the performance of P-POSSUM (AUROC = 0.89), SRS (AUROC = 0.84), and SORT (AUROC = 0.90) in this subgroup, the performance was again similar to that of the main study cohort (p > 0.05 for all comparisons). In the sixth and final sensitivity analysis (S11 Text), we evaluated the AUROC and calibration of the SORT–clinical judgement model in subgroups of patients according to surgical specialty (S4 Table). We found that the AUROC remained high within these subgroups (ranging from 0.87, 95% CI 0.75–0.98 in 1,033 cardiothoracic surgical patients through to 0.95, 95% CI 0.90–0.99 in 4,309 gynaecology and urology patients). Calibration was also good across different specialties, with the exception of vascular surgery (674 patients, AUROC 0.88, 95% CI 0.82–0.94; Hosmer-Lemeshow p-value = 0.009).

Discussion

We present data from an international cohort of patients undergoing inpatient surgery with a low risk of recruitment bias. Despite a plethora of options for objective risk assessment, in over 80% of patients, subjective assessment alone was used to predict 30-day mortality risk. All previously published risk models were poorly calibrated for this cohort of patients, reflecting the common problem of calibration drift over time. However, the combination of subjective clinical assessment with the parsimonious SORT model provides an accurate prediction of 30-day mortality, which is significantly better than any of the methods we evaluated used on their own. These findings should give confidence to clinicians that the combined SORT–clinical judgement model can be used to support the appropriate allocation of finite resources and to inform discussions with patients about the risks of surgery. The combined model accurately downgraded predicted risk compared with other methods; therefore, application of this approach may result in fewer low-risk patients inappropriately admitted to critical care (thus easing system pressures) and may result in fewer patients having their surgery cancelled for the lack of a critical care bed [7]. Finally, application of the SORT–clinical judgement model may assist hospital managers and policy makers in determining the likely demand for postoperative critical care, thus supporting best practice at the hospital, regional, or national level. This new model will now be incorporated into an open-access risk-assessment system (http://www.sortsurgery.com/), enabling clinicians to combine their clinical estimation of risk and the SORT model to evaluate patient risk from major surgery. To our knowledge, this is the first study comparing subjective and objective assessment for predicting perioperative mortality risk in a large multicentre international cohort. The highest-quality previous studies in this field have been challenged by recruitment bias because of the predominant participation of research active centres and the need for patient consent. For example, the METS (Measurement of Exercise Tolerance before Surgery) study ([15], which compared clinical assessment of functional capacity with exercise testing, self-assessment, and a serum biomarker in 1,401 patients, and the VISION study (Vascular Events in non-cardiac surgery cohort study) [32], which evaluated postoperative biomarkers in 15,133 patients, had 27% and 68% screening to recruitment rates, respectively. One way of overcoming such biases would be to study the accuracy of prognostic models using routinely collected or administrative data; however, this is unlikely to enable the evaluation of subjective assessments in multiple centres. Our study avoided these issues through prospective data collection in an unselected cohort with an ethical waiver for patient consent. The mortality in our sample closely matches that recorded in UK administrative data of patients undergoing major or complex surgery [11], therefore supporting our assertion that our cohort was representative of the ‘real-world’ perioperative population. Our observation that the majority of risk assessments conducted for perioperative patients do not involve objective measures is also noteworthy because subjective assessment is currently almost never incorporated into risk prediction tools for surgery. One exception is the American College of Surgeons National Surgical Quality Improvement Program Surgical Risk Calculator [33], which incorporates a 3-point scale of clinically assessed surgical risk (normal, high, or very high) to supplement a calculated prediction of mortality and various short-term outcomes. However, this system is proprietary, has rarely been evaluated outside the US, and is substantially more complex than the SORT–clinical judgement model, with 21 input variables compared with 8. Furthermore, their methodology for developing this ‘uplift’ was quite different from ours, using a panel of 80 surgeons to evaluate 10 case scenarios and grade them in retrospect. We recognise some limitations to our study. First, models predicting rare events may appear optimistically accurate, as a model that identifies every patient as being at low risk of mortality in a group in which the probability of death approaches 0% would almost always appear to be correct. For this reason, we undertook several sensitivity analyses, including one that evaluated the performance of the various risk-assessment methods in a subgroup of patients who have been defined as high risk in previous studies of prognostic indicators and in whom the mortality rate was higher. We found that the performance of the SORT and subjective assessment remained good and compared favourably with previous evaluations of more complex risk-assessment methods [15,32]. Second, whilst we assumed that subjective assessments were truly clinically based judgements, because this was a pragmatic unblinded study, it was possible that information from other sources may have subconsciously influenced these assessments. For this reason, we undertook the second sensitivity analysis, which refuted this possible risk. Third, the very act of estimating mortality risk may lead clinicians to take actions that improve that risk, therefore biasing the outcome of the assessments made and in particular affecting the calibration of subjective risk estimates. The only way to avoid this risk would be to have used subjective assessments made by clinicians independent of the clinical management of individual patients, and this may be an interesting opportunity for future research. Fourth, since we undertook this study, other promising risk-assessment methods have been developed, including the Combined Assessment of Risk Encountered in Surgery (CARES) system, which was developed using electronic health records; unfortunately, we were unable to externally validate this system because we did not collect all the required variables [34]. We also did not evaluate the accuracy of other risk prediction methods such as frailty assessment or cardiopulmonary exercise testing. However, this was not an a priori objective of our study [18]; furthermore, our observation of the lack of ‘real-world’ use of these types of predictors is in itself an important finding, particularly given the substantial interest in such measures (some of which carry considerable cost) in the research literature [15,35]. Fifth, the UK cohort was substantially larger than the Australasian cohort; however, we found no significant differences in mortality or accuracy of the various risk-assessment methods between the 2 geographical groups. Finally, the study was conducted entirely in high-income countries; therefore, our findings should now be tested in low- and middle-income nations in order to evaluate global generalisability. Our finding that the combination of subjective and SORT-based assessment is the best approach is important because it is likely to have face validity with clinicians, thereby improving the likelihood that our new model will be incorporated into clinical practice. There is a sound rationale for this finding, as it is likely that clinicians consider otherwise unmeasured factors that they recognise as important, such as severity of comorbid diseases, frailty, socioeconomic status, patient motivation, and anticipated technical challenges. Modern approaches to risk assessment using machine learning [36] provide promise for automation of risk prediction and incorporating data and calculations that clinicians may subconsciously consider when making subjective decisions; however, even these methods do not substantially outperform our simpler approach and are currently limited by recruitment biases and lack of availability. Future research could evaluate the benefits of incorporating clinical judgement into risk-assessment methods in medicine more generally. Implementation of a widely available, parsimonious, and free-to-use risk-assessment tool to guide clinical decision-making about critical care allocations and other aspects of perioperative care may now be considered particularly important in view of the likely prevalence of endemic COVID-19 leading to an increased demand for critical care facilities. Therefore, now more than ever, risk-based allocation of these resources is important for the benefit of individual patients and the hospitalised population as a whole. Further to this, application of either the SORT or the SORT–clinical judgement model to perioperative population data may assist healthcare policy makers and managers in modelling the likely demand for postoperative critical care, thus improving system level planning and resource utilisation. Based on the results of this large generalisable cohort study, the focus of the perioperative academic community could now shift from evaluation of which risk prediction method might be best to testing the impact of SORT–clinical judgement-based decision-making on perioperative outcomes. In conclusion, the combination of subjective and objective risk assessment using the SORT calculator provides a more accurate estimate of 30-day postoperative mortality than subjective assessment alone. Implementation of the SORT–clinical judgement model should lead to better clinical decision-making and improved allocation of resources such as critical care beds to patients who are most likely to benefit.

STROBE checklist.

STROBE, Strengthening the Reporting of Observational Studies in Epidemiology. (PDF) Click here for additional data file.

TRIPOD checklist.

TRIPOD, Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis. (DOCX) Click here for additional data file.

Case report form.

(DOCX) Click here for additional data file.

Continuous net reclassification index analysis and reclassification tables.

(DOCX) Click here for additional data file.

Sensitivity analyses overview.

(DOCX) Click here for additional data file.

Sensitivity analysis 1.

(DOCX) Click here for additional data file.

Sensitivity analysis 2.

(DOCX) Click here for additional data file.

Sensitivity analysis 3.

(DOCX) Click here for additional data file.

Sensitivity analysis 4.

(DOCX) Click here for additional data file.

Sensitivity analysis 5.

(DOCX) Click here for additional data file.

Sensitivity analysis 6.

(DOCX) Click here for additional data file.

Acknowledgments and full list of SNAP2: EPICCS collaborators.

SNAP2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery. (DOCX) Click here for additional data file.

Characteristics of the patient subgroups used in all sensitivity analyses.

ASA-PS, American Society of Anesthesiologists Physical Status; COPD, Chronic Obstructive Pulmonary Disease; IQR, interquartile range; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale. (DOCX) Click here for additional data file.

Confusion matrix of patients 30-day mortality outcomes versus clinician predictions.

(%) represents row percentage. (DOCX) Click here for additional data file.

AUROCs of the objective risk tools and subjective assessment, compared between the UK and Australian/New Zealand data subsets.

We found no significant difference in discrimination using any of the risk prediction tools or using subjective assessment when comparing their performance in the UK and Australian/New Zealand data sets. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale. (DOCX) Click here for additional data file.

Discrimination and calibration performance of the new combined prediction model in different specialty subgroups.

(DOCX) Click here for additional data file.

Calibration plots for the SORT (A), P-POSSUM (B), SRS (C), and ROC curves for the 3 models (D) validated in the whole patient cohort, including those undergoing obstetric procedures.

In the calibration plots (A–C), nonparametric smoothed best-fit curves (blue) are shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each decile of predicted mortality. External validation of all 3 models were performed on the entire SNAP-2: EPICCS patient data set (n = 25,854). CI, confidence interval; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; ROC, Receiver Operating Characteristic; SNAP-2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale. (PDF) Click here for additional data file.

Calibration plots and ROC curves for subjective clinical assessments (A, B) and the logistic regression model combining clinician and SORT predictions (C, D), validated on the subset of patients in whom clinicians estimated risk based on clinical judgement alone, drawn from the full SNAP-2: EPICCS data set, including patients who underwent obstetric surgery (n = 21,325).

For (A), a nonparametric smoothed best-fit curve (blue) is shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each range of clinician predicted mortality. For (C), the apparent (blue) and optimism-corrected (red) nonparametric smoothed calibration curves are shown, the latter was generated from 1,000 bootstrapped resamples of the data set. CI, confidence interval; ROC, Receiver Operating Characteristic; SNAP-2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery; SORT, Surgical Outcome Risk Tool. (PDF) Click here for additional data file.

Calibration plots for SORT (A), P-POSSUM (B), SRS (C), and clinical assessments (E) and ROC curves for the 3 models (D) and clinical assessments (F), validated in the sensitivity analysis patient subset with restricted inclusion criteria (n = 12,985).

The AUROCs for P-POSSUM, SRS, SORT, and clinical assessments were 0.863, 0.810, 0.875, and 0.853 in this subgroup, respectively. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale. (PDF) Click here for additional data file.

Calibration plot (A) and ROC curve (B) for clinical assessments, validated in the sensitivity analysis patient subgroup in which clinical assessments were made in conjunction with 1 or more other risk prediction tools (n = 4,786).

The AUROC for clinical assessments was 0.880 in this subgroup. AUROC, Area Under Receiver Operating Characteristic curve. (PDF) Click here for additional data file.

Calibration plots (A to F) and ROC curves (G & H) for objective risk tools, validated in patients stratified by their country groups.

There was minimal difference between countries. ROC, Receiver Operating Characteristic. (PDF) Click here for additional data file.

Calibration plots (A & B) and ROC curves (C & D) for clinical assessments, validated in patients stratified by their country groups.

There was minimal difference between countries. ROC, Receiver Operating characteristic Curve. (PDF) Click here for additional data file.

Calibration plots for SORT (A), P-POSSUM (B), SRS (C), and clinical assessments (E) and ROC curves for the 3 models (D) and clinical assessments (F), validated in the sensitivity analysis patient subset with complete P-POSSUM variables (n = 18,362).

The AUROCs for P-POSSUM, SRS, SORT, and clinical assessments were 0.893, 0.838, 0.899, and 0.896 in this subgroup, respectively. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale. (PDF) Click here for additional data file.

Raw data.

(ZIP) Click here for additional data file. 10 Mar 2020 Dear Dr. Moonesinghe, Thank you very much for submitting your manuscript "Comparing subjective and objective risk assessment for predicting mortality after major surgery - an international prospective cohort study" (PMEDICINE-D-19-04548) for consideration at PLOS Medicine. Your paper was evaluated by a senior editor and discussed among the editors here. It was also discussed with an academic editor with relevant expertise, and sent to independent reviewers, including a statistical reviewer. The reviews are appended at the bottom of this email and any accompanying reviewer attachments can be seen via the link below: [LINK] In light of these reviews, I am afraid that we will not be able to accept the manuscript for publication in the journal in its current form, but we would like to consider a revised version that addresses the reviewers' and editors' comments. Obviously we cannot make any decision about publication until we have seen the revised manuscript and your response, and we plan to seek re-review by one or more of the reviewers. In revising the manuscript for further consideration, your revisions should address the specific points made by each reviewer and the editors. Please also check the guidelines for revised papers at http://journals.plos.org/plosmedicine/s/revising-your-manuscript for any that apply to your paper. In your rebuttal letter you should indicate your response to the reviewers' and editors' comments, the changes you have made in the manuscript, and include either an excerpt of the revised text or the location (eg: page and line number) where each change can be found. Please submit a clean version of the paper as the main article file; a version with changes marked should be uploaded as a marked up manuscript. In addition, we request that you upload any figures associated with your paper as individual TIF or EPS files with 300dpi resolution at resubmission; please read our figure guidelines for more information on our requirements: http://journals.plos.org/plosmedicine/s/figures. While revising your submission, please upload your figure files to the PACE digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at PLOSMedicine@plos.org. We expect to receive your revised manuscript by Mar 31 2020 11:59PM. Please email us (plosmedicine@plos.org) if you have any questions or concerns. ***Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.*** We ask every co-author listed on the manuscript to fill in a contributing author statement, making sure to declare all competing interests. If any of the co-authors have not filled in the statement, we will remind them to do so when the paper is revised. If all statements are not completed in a timely fashion this could hold up the re-review process. If new competing interests are declared later in the revision process, this may also hold up the submission. Should there be a problem getting one of your co-authors to fill in a statement we will be in contact. YOU MUST NOT ADD OR REMOVE AUTHORS UNLESS YOU HAVE ALERTED THE EDITOR HANDLING THE MANUSCRIPT TO THE CHANGE AND THEY SPECIFICALLY HAVE AGREED TO IT. You can see our competing interests policy here: http://journals.plos.org/plosmedicine/s/competing-interests. Please use the following link to submit the revised manuscript: https://www.editorialmanager.com/pmedicine/ Your article can be found in the "Submissions Needing Revision" folder. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/plosmedicine/s/submission-guidelines#loc-methods. Please ensure that the paper adheres to the PLOS Data Availability Policy (see http://journals.plos.org/plosmedicine/s/data-availability), which requires that all data underlying the study's findings be provided in a repository or as Supporting Information. For data residing with a third party, authors are required to provide instructions with contact information for obtaining the data. PLOS journals do not allow statements supported by "data not shown" or "unpublished results." For such statements, authors must provide supporting data or cite public sources that include it. We look forward to receiving your revised manuscript. Sincerely, Louise Gaynor-Brook, MBBS PhD Associate Editor PLOS Medicine plosmedicine.org ----------------------------------------------------------- Requests from the editors: General comment: Please cite reference numbers in square brackets, leaving a space before the reference bracket, and removing spaces between reference numbers where more than one reference is cited e.g. '... postoperative complications [5,6].' General comment: Please refer to a more specific part of your appendix to make clear to readers to what you are referring e.g. Figure S1, Table S1, etc. General comment: Throughout the ms, please quote exact p values or "p<0.001", unless there is a specific statistical justification for quoting smaller exact numbers. General comment: Please add line numbers to your revised manuscript. Data Availability Statement: PLOS Medicine requires that the de-identified data underlying the specific results in a published article be made available, without restrictions on access, in a public repository or as Supporting Information at the time of article publication, provided it is legal and ethical to do so. Please provide the URL and any other details that will be needed to access data stored in the public repository. Title: Please revise your title according to PLOS Medicine's style, placing the study design in the subtitle (ie, after a colon). We suggest “Subjective and objective risk assessment for predicting mortality after major surgery: an international prospective cohort study” or similar Abstract Background: Please expand upon the context of why the study is important. The final sentence should clearly state the study question, and clarify the aims of the study e.g. to compare the accuracy of freely-available objective surgical risk tools with subjective clinical assessment in predicting 30-day mortality Please combine the Methods and Findings components of your Abstract into one subsection with the heading Methods and Findings’ Please include the study design, brief demographic details of the population studied (e.g. age, sex, types of surgery, etc) and further details of the study setting (e.g. what type of hospitals - secondary / tertiary care? Rural / urban? etc.), and main outcome measures. In your abstract and elsewhere, please quote p values alongside 95% CI where available. Please revise the sentence beginning ‘We included consecutive adults…’ to clarify. Please revise 'consecutive adults’ and ‘undergoing inpatient surgery for one week’ In the last sentence of the Abstract Methods and Findings section, please describe the main limitation(s) of the study's methodology. Please begin your Abstract Conclusions with "In this study, we observed ..." or similar. Please address the study implications, emphasising what is new without overstating your conclusions. Please tone down subjective language such as ‘highly accurate’ Please avoid vague statements such as " Clinicians can use this….", to instead highlight that ‘This may be of value in helping to stratify… ‘ At this stage, we ask that you include a short, non-technical Author Summary of your research to make findings accessible to a wide audience that includes both scientists and non-scientists. The Author Summary should immediately follow the Abstract in your revised manuscript. This text is subject to editorial change and should use non-identical language distinct from the scientific abstract. Please see our author guidelines for more information: https://journals.plos.org/plosmedicine/s/revising-your-manuscript#loc-author-summary Please ensure to include a final bullet point under ‘What do these findings mean?’ to describe the main limitation(s) of the study. Introduction Please explain the need for and potential importance of your study. Indicate whether your study is novel and how you determined that. Please define QI Methods Please include the completed STROBE / TRIPOD checklists as Supporting Information and refer to the relevant supplementary files in your Methods section. When completing the checklists, please use section and paragraph numbers, rather than page numbers. Please adapt "STROBE diagram" to "participant flowchart", or similar. Did your study have a prospective protocol or analysis plan? Please state this (either way) early in the Methods section. If a prospective analysis plan was used in designing the study, please include the relevant prospectively written document with your revised manuscript as a Supporting Information file to be published alongside your study, and cite it in the Methods section. If no such document exists, please make sure that the Methods section transparently describes when analyses were planned, when/why any data-driven changes to analyses took place, and what those changes were. Please provide more detail on the hospitals included in the study e.g. secondary / tertiary care, how many from each country, etc. Please provide detail on the inclusion criteria, as this is notably missing from your supplementary material. Please provide further detail on the institutional research and development department approvals for Northern Ireland, Australia and New Zealand as you have for the UK. This should include the names of the institutional review board(s) that provided ethical approval. Results Please mention in the main text of your Results that 4,891 surgical episodes were excluded from analysis (as indicated in your STROBE diagram) Please quote p values alongside 95% CI where available. Please refer to supplementary figures for the results presented for the sensitivity analyses. Please consider whether additional sensitivity analyses could be performed to at least partially address comments about inclusion of certain specialties that might bias the overall findings in a particular direction. Please ensure that it is made clear in your Methods that any such additional analyses are non-prespecified. Table 1 - please revise ‘Xmajor’. Tables 3 & 4 - please define all abbreviations used in the table legend. When a p value is given, please specify the statistical test used to determine it in the table legend. Tables S3 & S4 - please define all abbreviations used in the table legend. Figure 2 - please define all abbreviations used in the figure legend Figure 4 - please provide a figure legend Supplementary Figures 3 and 4 - please provide a figure legend Discussion Please present and organize the Discussion as follows: a short, clear summary of the article's findings; what the study adds to existing research and where and why the results may differ from previous research; strengths and limitations of the study; implications and next steps for research, clinical practice, and/or public policy; one-paragraph conclusion. Please tone down subjective language such as ‘highly accurate’ Please avoid vague statements such as " Clinicians can use this….", to instead highlight that ‘This may be of value in helping to stratify… ‘ References Please provide names of the first 6 co-authors for each paper referenced, before ‘et al’ Noting reference 11, please ensure that all references have full access details. Supplementary Files Please clearly label each supplementary file / table / figure with a distinct name and ensure that a reference is made to each component in the main text of the manuscript. Please ensure that all components of your supplementary files e.g. inclusion criteria are included in your resubmitted manuscript. Comments from the reviewers: Reviewer #1: "Comparing subjective and objective risk assessment for predicting mortality after major surgery - an international prospective cohort study of 26,216 patients" evaluates three publicly-available objective risk tools against clinicians' subjective judgment on mortality prediction, and further integrates the best-performing objective risk tool with said subjective judgment to produce an even better-performing logistic regression model. The prospective nature of the study, diverse demographics involved (over 26,000 patients from 274 hospitals in the United Kingdom, Australia and New Zealand) and open availability of the examined objective risk tooks (P-POSSUM, SRS and SORT) are particular strengths of the manuscript. Overall, the work seems to have the potential to broadly benefit surgical management practice. However, there are a number of points that might be expanded upon. Firstly, on the comparison methods; the authors may wish to clarify their definition of objective vs. subjective for risk assessment procedures. In particular, the "objective" Surgical Risk Scale (SRS) appears to comprise a summation of ASA-PS scores together with the CEPOD and BUPA scores ("The surgical risk scale as an improved tool for risk-adjusted analysis in comparative surgical audit", Sutton et al., 2002), but the ASA-PS is itself considered "subjective" assessment (Page 8). Moreover, it is not clear whether CEPOD and BUPA are any less "subjective" than ASA-PS, from their descriptions. Also, for P-POSSUM, it is noted that a number of biochemical/haematological parameters were missing in practice ("Missing data" section, Page 9), and in these cases, normal data was assumed/imputed. Moreover, there remained a small number of cases with further missing data beyond these bio/haemo parameters (as understood from "Following imputation, we performed a complete case analysis as we considered the proportion of cases with missing data in *the remaining variables* to be low [1.08%]"). This practice however seems problematic, in that close to half of the variables used for P-POSSUM appear to be biochemical/haemotological (i.e. haemoglobin, WBC, urea, sodium, potassium). Simply assuming normal values for these variables would then appear to rob P-POSSUM of quite a bit of its utility, since these assumptions constitute unsubstantiated evidence towards better patient outcomes. The authors might clarify as to exactly how many cases were affected by missing data for P-POSSUM, and consider an analysis of only those cases where full data was available for P-POSSUM, for a fair comparison. In general, the authors might consider summarizing the parameters & criteria used for the various objective methods as supplementary material. Secondly, on the use of AUROC as a main quantative assessment metric of the various methods - it is stated that the 30-day risk of death was predicted as one of six categorical responses, for all methods used (Page 7). Then, although the continuous net reclassification improvement statistic (NRI) is cited (Page 8), the authors might clarify in greater detail as to how this relates to the construction of ROC curves (i.e. Figure 2 & 3). For example, if the prediction is 0.5% and the patient indeed survives for 30 days, what sensitivity/specificity does this amount to as opposed to if the patient had not survived (we assume a binary outcome)? While readers might be familiar with the usual binary ROC formulation, the implementation of multiple categories might warrant more description. Further on the categorial prediction, the calibration graphs in Figures 2 and 3 appear to suggest that the various methods output different numbers of point estimates. For example, P-POSSUM has 10, SRS has 6, SORT has 9 and clinicians have 4 (as opposed to the six categories implied in the Dataset section). While it seems that a greater number of point estimates improves the level of detail of the corresponding ROC curve, the effect seems exaggerated for SORT and SRS; in particular, the ASA-PS ROC curve in Figure 2 seems to be a piecewise construction from about 3 datapoints, while the SRS curve likewise seems to be piecewise constructed from about 6 datapoints. However, the SORT and SRS curves have much finer detail (i.e. have an independent value for each 0.01 change in specificity or less), despite only having slightly more point estimates than SRS. The authors might wish to explain this discrepancy. Thirdly, while there is detailed analysis of the aggregate statistics for the various methods (including on a high-risk patient subset), an additional inter-model analysis at finer granularity would seem to be appropriate. In particular, do the various objective and subjective models tend to agree/disagree on specific patients, and in cases where they strongly disagree (i.e. one method predicts a very low mortality risk, while another method predicts a significant risk, for the same patient), what are the factors that might have led to this disagreement? Such an analysis would help to determine whether the choice of particular models (e.g. the combined logistic regression model vs. SORT alone) is strictly beneficial for all patients (strictly dominating method), or if it may shift the risks of inaccurate prediction from one group of patients onto another group. A confusion matrix of the six prediction categories vs. binary outcomes at a reasonable ROC operating point for each method would also be illuminating. Fourthly, the authors may wish to comment further on the practical implications of accurate mortality prediction (it is noted in the Discussion section that accurate prediction may result in fewer inappropriate admissions to critical case, though there remains a lack of "real-world" clinical uptake); will such predictions be used by clinicians/patients in deciding whether to commence surgery? Indeed, given that mortality estimates were made by the perioperative team beforehand for participating patients, did these estimates have any impact on the treatment offered (i.e. the abovementioned reduction in inappropriate admissions)? Finally, while perhaps somewhat out of the scope of this manuscript, the authors might consider exploring data-centric machine learning methods in the future (i.e. train models directly from the available patient demographic features) Reviewer #2: In this international prospective cohort study, the authors adress the very important issue of preopeative risk assessment for mortality. The article is clear an well written, but there is an important methodological question that have to be considered. 1) It is certainly a very good point to test the subjetive risk assesment. However a main limitation is that the physicians could not have used objective assessments scores to help them answering the question proposed. These cases should have been excluded for all analysis initially (flowchart 1). Additionally, patients for whom ASA score was used were kept in the other analysis and classified as belonging to the `subjective analysis `group. Altough ASA score is the oldest and has a lot of subjectivity in itself, it could not be considered that no score was used. Also, the numbers in Flowchart 1 and Table 2 are a bit confusing (23540 patients with clinical judgment only and 9928 with ASA score in Table 2 and in flowchart 21325 patients inluded). The main analysis should have been done only with patients classified by the clinical judgment according to the answer of the question proposed and not using any additinal tool, including ASA. 2) It is stated in methods that data imputing was done to the laboratory values that were missing to the POSSUM score. Please provide how many imputation was performed. Additionaly, assuming that the values are normal in patients undergoing elective surgery without a blood withdraw available, that may not be accurate for emergency patients, for whom there was no time to perform a blood test. 3) Although there is no consensus of what is the best risk stratification tool, in the discussion the authors mentioned that the limitation of the SQIP risk calculator was never validated outside the US. This would have been a good oportunity to do so because, although it involves more variables, it is online available for free and could be easily implemented in an electronic chart, for exemple. Reviewer #3: Dear authors Thank you to provide me the opportunity to read your manuscript, it reports a comparison of objective and subjective predictive tools to predict postoperative mortality after major surgery. While the objectives of the study are interesting and would have some potentially important clinical applications, several major limitations limit the interpretation of the observed results. My key suggestions to improve the impact of this study are: To select preoperative (not P-POSSOM) scores that do not already include a subjective component (neither SORT nor SRS). To not attend to create a universal predictive tool working for any type of surgeries. To better describe the subjective score introduced: inter-rater variability, distributions and eventually to reduce the number of strata to limit variability. To emphasize the results provided by the decision curves rather than focussing on AUC ROC in models with poor calibration. I summarized the most important methodological concerns below 1 - Population: The study population includes a non-representative mix of cases including some surgical specialties that are associated with high postoperative mortality and in which general predictive tools don't work well (i.e cardiac surgery - 3.9% of the cohort, 5.7% of the observed deaths). The objective of the study is to evaluate the predictive performances of scores already described elsewhere. Therefore we are not seeing representative samples and we are simply evaluating the performances of some predictive models. Unfortunately, the combination of the clinical prediction and the other scores is dependent of the study population case-mix, because a new model has been fit. Whether this combined model is working on other populations with a different case-mix is unknown and must be discussed properly as an important limitation. 2 - Population (2): The inclusion of obstetric is a usual mistake we can find in similar studies. It increased the number of patients in the cohort, but it is not associated with any deaths. Therefore, it dropped the average apparent mortality (no deaths after obstetrics in the study we are discussing today) and make the methodological tools used to compare the predictive models not accurate and potentially biased. Further, with no deaths and a clearly identified subgroup (which is included in the calculation of the scores), the inclusion of OB patients artificially increases discrimination performances but is not clinically relevant. Which clinician(s) would use the same score to predict accurately the outcome of OB and cardiac patients? 3 - Major methodological mistake: We are dealing with low-frequency primary outcome (i.e. 30d mortality = 1.2% - 317 deaths in 26,616 patients). P-POSSUM, SRS and SORT showed terrible calibrations in figure 2, where we are observing dramatic overestimations of the risk of death (The calibration curves are way down in the lower right part of the calibration plots). This common methodological concern is poorly detected with ROC curves. Combining infrequent outcomes and poorly calibrated predictive models constantly produce very high AUC while models have no clinical values. That is what we observed here - and in many other studies dealing with similar population characteristics. including OB patients amplified the phenomenon. 4 - Major methodological mistake (2): SORT : Truly preoperative, include ASA (subjective and wide inter-rater variability), fine description of the surgical procedures P-POSSUM: More objective risk score with less inter-rater variability, but includes intraoperative variable not available SRSpreoperatively SRS: Truly preoperative, include ASA (subjective and wide inter-rater variability) and a subjective stratification associated with the planned surgery. The new Subjective assessment introduced in this study: Preoperative, very subjective, 6 categories, somewhat like ASA. NO quantification of the inter-rater variability which is suspected to be very high in the not extreme categories. 2 of the 3 scores evaluated already include a clinical subjective evaluation of the preoperative patients' characteristics (i.e. ASA with all the limitations we know) 1 score includes intraoperative characteristics which make the comparison impossible. The authors finally combined SORT and the clinical prediction they introduced. They demonstrated in the study population (after model fitting) that the predictive performance are somewhat better, and that the calibration is way better (they fit a new model...) SORT already includes ASA, how does ASA interacted with the clinical prediction which look very similar. While it is recognized that the inter-rater variability of ASA is huge, what do we know about the inter-rater variability of the clinical prediction described here? 5 - The use of decision curves is a very strong methodological part of this study, Unfortunately, their description and interpretation are somewhat neglected compared to useless ROC comparison. The reviewer strongly recommend to expand the description , interpretation and disccusion of the decision curves which actuaaly provide the information that clinicians are seeking. Some specific comments: Page 18 line 1: "The SORT was the best-calibrated of the pre-existing models, however all over-predicted risk (Figure 2A-2C; Hosmer-Lemeshow p-values all " HL test is not an appropriate approach to evaluate calibration in large cohorts where the frequency of the outcomes is low (See Tripod statements). Further a HL stat probability lower than 0.10 suggests mis-calibration. In the figure 2, HL probabilities are all lower than 0.0001. This somewhat confirms the visual analysis of the curves suggesting that the calibration of these scores is terrible (results already widely describe elsewhere)> Page 18 line 6: "All models exhibited good-to-excellent discrimination (Figure 2D; AUROC SORT=0.91 (95% confidence interval (CI): 0.90-0.93); P-POSSUM=0.90, (95% CI: 0.88-0.92); SRS=0.85 (95% CI: 0.83-0.88)." High observed discriminations are the consequence of: 1/ poor calibration of the models and 2/primary outcome #1.2%. Presented CI seems to have been produced assuming a binormal distribution (the usual approach), this is not likely to be appropriate in this setting and the reader can guess that this CI are widely underestimated. More importantly, no discrimination should be interpreted with such a poor calibration. Figure 5: This figure looks great, but some confidence interval would probably show that there are some major overlaps between the c. clinical prediction strata Thank you again for your work and I hope my comments would be useful in your work. There is a true need for preoperative stratification/predictive tools and this work has the potential to fill a part of this need Yannick Le Manach MD PhD Reviewer #4: The manuscript entitled „Comparing subjective and objective risk assessment for predicting mortality after major surgery - an international prospective cohort study" was reviewed. In this study, the authors aimed to compare subjective and objective risk assessment tools, which are utilized in predicting the probability of the postoperative mortality. The study was a well-designed, prospective, multicentric, observational study, with great effort to reduce the bias as much as possible, in terms of recruitment phase and data analysis. However, I there are some concerns regarding the manuscript, which should be addressed. Comments: 1. In a recent study, Chan et al. (1) calculated the derivation and validation cohorts for mortality using the Combined Assessment of Risk Encountered in Surgery (CARES) surgical risk calculator, which is also comparable to the present study outcomes. The authors should also discuss the findings of this study. 2. How did the authors define the needed cut-off values for classification of preoperative estimation of mortality risk? 3. I would suggest the authors to perform a subgroup and validation analyses by classifying the patients based on the type of surgery. 4. How would it be possible to combine the two objective and subjective assessments in clinical practice. The combination of the tolls has improved the outcomes for sure, but authors did not suggest any combined approach to establish an estimation using both risk assessments simultaneously. Reference: 1. Chan DXH, Sim YE, Chan YH, Poopalalingam R, Abdullah HR. Development of the Combined Assessment of Risk Encountered in Surgery (CARES) surgical risk calculator for prediction of postsurgical mortality and need for intensive care unit admission risk: a single-center retrospective study. BMJ open. 2018;8(3):e019427. Any attachments provided with reviews can be seen via the following link: [LINK] 2 Jun 2020 Submitted filename: Reviewer responses SRM 01062020.docx Click here for additional data file. 18 Jun 2020 Dear Dr. Moonesinghe, Thank you very much for re-submitting your manuscript "Developing and validating subjective and objective risk assessment measures for predicting mortality after major surgery: an international prospective cohort study" (PMEDICINE-D-19-04548R1) for review by PLOS Medicine. I have discussed the paper with my colleagues and the academic editor and it was also seen again by previous reviewers. I am pleased to say that provided the remaining editorial and production issues are dealt with we are planning to accept the paper for publication in the journal. The remaining issues that need to be addressed are listed at the end of this email. Any accompanying reviewer attachments can be seen via the link below. Please take these into account before resubmitting your manuscript: [LINK] Our publications team (plosmedicine@plos.org) will be in touch shortly about the production requirements for your paper, and the link and deadline for resubmission. DO NOT RESUBMIT BEFORE YOU'VE RECEIVED THE PRODUCTION REQUIREMENTS. ***Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.*** In revising the manuscript for further consideration here, please ensure you address the specific points made by each reviewer and the editors. In your rebuttal letter you should indicate your response to the reviewers' and editors' comments and the changes you have made in the manuscript. Please submit a clean version of the paper as the main article file. A version with changes marked must also be uploaded as a marked up manuscript file. Please also check the guidelines for revised papers at http://journals.plos.org/plosmedicine/s/revising-your-manuscript for any that apply to your paper. If you haven't already, we ask that you provide a short, non-technical Author Summary of your research to make findings accessible to a wide audience that includes both scientists and non-scientists. The Author Summary should immediately follow the Abstract in your revised manuscript. This text is subject to editorial change and should be distinct from the scientific abstract. We expect to receive your revised manuscript within 1 week. Please email us (plosmedicine@plos.org) if you have any questions or concerns. We ask every co-author listed on the manuscript to fill in a contributing author statement. If any of the co-authors have not filled in the statement, we will remind them to do so when the paper is revised. If all statements are not completed in a timely fashion this could hold up the re-review process. Should there be a problem getting one of your co-authors to fill in a statement we will be in contact. YOU MUST NOT ADD OR REMOVE AUTHORS UNLESS YOU HAVE ALERTED THE EDITOR HANDLING THE MANUSCRIPT TO THE CHANGE AND THEY SPECIFICALLY HAVE AGREED TO IT. Please ensure that the paper adheres to the PLOS Data Availability Policy (see http://journals.plos.org/plosmedicine/s/data-availability), which requires that all data underlying the study's findings be provided in a repository or as Supporting Information. For data residing with a third party, authors are required to provide instructions with contact information for obtaining the data. PLOS journals do not allow statements supported by "data not shown" or "unpublished results." For such statements, authors must provide supporting data or cite public sources that include it. If you have any questions in the meantime, please contact me or the journal staff on plosmedicine@plos.org. We look forward to receiving the revised manuscript by Jun 25 2020 11:59PM. Sincerely, Clare Stone, PhD Managing Editor PLOS Medicine plosmedicine.org ------------------------------------------------------------ Requests from Editors: Please provide summary demographic information to the abstract as previously requested. We suggest quoting AUROC/95% CI in the abstract for all the objective models. Please add a new final sentence to the "methods and findings" subsection of your abstract to quote 2-3 of the study's main limitations. Please remove the instructions from the "author summary". Please convert p<0.0001 to p<0.001 throughout. Please move the reference call-outs to precede punctuation (e.g., "... decision making [2,3]."). Is reference 2 missing full access details? Comments from Reviewers: Reviewer #1: We thank the authors for addressing most of the points raised previously, and particularly appreciate the addition of a fourth sensitivity analysis on P-POSSUM, and an informative confusion matrix at 5% mortality. On the confusion matrix (Supplementary Table S6), we agree with the authors that a single cut-off ROC value is inadequate to characterize the ROC profile; the intention was chiefly to gain some additional perspective on the performance of the tool, which does appear to fulfil expectations. The authors might however note that some entries in the matrix appear slightly off-by-one (e.g. the Dead column is stated to total 189, but the sum of the six categories appears to be 188; 2.6-5% is stated to total 891, but the Alive+Dead for that row appears to be 892) On part of the second point raised in the previous review on different methods having different numbers of point estimates in Figure 2 and 3, the authors' clarification on clinicians having 4 points in their calibration graph was much appreciated. However, given that it is stated that the predictions for P-POSSUM, SRS and SORT are continuous variables, the authors may wish to briefly comment on the different number of sampled points for these methods, as shown in Figure 2. On the third point raised in the previous review ("additional inter-model analysis at finer granularity"), the major concern was whether the various risk models tend to have similar risk predictions for the same patients (in aggregate), and if not, what were the characteristics of patients that tended to be assigned different risks by different models. We agree with the authors that this may not be critical to the main thrust of the manuscript; it was proposed largely as the data appears available, and since it might be of interest given that multiple models are already being compared. As such, we would respect the authors' preference on whether to include such an analysis. Minor issue: On page 7 line 10, "and equipoise over method is most accurate" might be "...which method is most accurate" Reviewer #3: Dear Authors Thank you for the responses to my comments. Regarding the limitations associated with the datasets and the objectives of your study, I believe you significantly improved your manuscript. I still disagree with the value of subjective score in some countries where these scores are used for billing purpose. It seems it was not the case in your cohort. However, I would be prudent in any generalization tentative. A. Sankar et al. (BJA Volume 113, Issue 3, September 2014, Pages 424-432) provided a report of this problem in a jurisdiction where ASA is used for billing. While I will not ask for any revision for this version of the manuscript. I would be delighted if the authors would consider to add a sentence about the need for further researches since objective quantifications of the risk did not perform as well as when a subjective component is included (i.e. current models and approaches do not capture this part of the information provided by clinicians…more research needed to capture it in a more objective manner). Thanks for considering Yannick Le Manach MD PhD Any attachments provided with reviews can be seen via the following link: [LINK] 3 Sep 2020 Dear Prof. Moonesinghe, On behalf of my colleagues and the academic editor, Dr. David Menon, I am delighted to inform you that your manuscript entitled "Developing and validating subjective and objective risk assessment measures for predicting mortality after major surgery: an international prospective cohort study" (PMEDICINE-D-19-04548R2) has been accepted for publication in PLOS Medicine. PRODUCTION PROCESS Before publication you will see the copyedited word document (in around 1-2 weeks from now) and a PDF galley proof shortly after that. The copyeditor will be in touch shortly before sending you the copyedited Word document. We will make some revisions at the copyediting stage to conform to our general style, and for clarification. When you receive this version you should check and revise it very carefully, including figures, tables, references, and supporting information, because corrections at the next stage (proofs) will be strictly limited to (1) errors in author names or affiliations, (2) errors of scientific fact that would cause misunderstandings to readers, and (3) printer's (introduced) errors. If you are likely to be away when either this document or the proof is sent, please ensure we have contact information of a second person, as we will need you to respond quickly at each point. PRESS A selection of our articles each week are press released by the journal. You will be contacted nearer the time if we are press releasing your article in order to approve the content and check the contact information for journalists is correct. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. PROFILE INFORMATION Now that your manuscript has been accepted, please log into EM and update your profile. Go to https://www.editorialmanager.com/pmedicine, log in, and click on the "Update My Information" link at the top of the page. Please update your user information to ensure an efficient production and billing process. Thank you again for submitting the manuscript to PLOS Medicine. We look forward to publishing it. Best wishes, Clare Stone, PhD Managing Editor PLOS Medicine plosmedicine.org

32 in total

1. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies.

Authors: Erik von Elm; Douglas G Altman; Matthias Egger; Stuart J Pocock; Peter C Gøtzsche; Jan P Vandenbroucke
Journal: J Clin Epidemiol Date: 2008-04 Impact factor: 6.437

2. The long-term effects of postoperative complications.

Authors: Andrew Toner; Mark Hamilton
Journal: Curr Opin Crit Care Date: 2013-08 Impact factor: 3.687

3. Complications, failure to rescue, and mortality with major inpatient surgery in medicare patients.

Authors: Amir A Ghaferi; John D Birkmeyer; Justin B Dimick
Journal: Ann Surg Date: 2009-12 Impact factor: 12.969

4. Perioperative patient outcomes in the African Surgical Outcomes Study: a 7-day prospective observational cohort study.

Authors: Bruce M Biccard; Thandinkosi E Madiba; Hyla-Louise Kluyts; Dolly M Munlemvo; Farai D Madzimbamuto; Apollo Basenero; Christina S Gordon; Coulibaly Youssouf; Sylvia R Rakotoarison; Veekash Gobin; Ahmadou L Samateh; Chaibou M Sani; Akinyinka O Omigbodun; Simbo D Amanor-Boadu; Janat T Tumukunde; Tonya M Esterhuizen; Yannick Le Manach; Patrice Forget; Abdulaziz M Elkhogia; Ryad M Mehyaoui; Eugene Zoumeno; Gabriel Ndayisaba; Henry Ndasi; Andrew K N Ndonga; Zipporah W W Ngumi; Ushmah P Patel; Daniel Zemenfes Ashebir; Akwasi A K Antwi-Kusi; Bernard Mbwele; Hamza Doles Sama; Mahmoud Elfiky; Maher A Fawzy; Rupert M Pearse
Journal: Lancet Date: 2018-01-03 Impact factor: 79.321

5. POSSUM and Portsmouth POSSUM for predicting mortality. Physiological and Operative Severity Score for the enUmeration of Mortality and morbidity.

Authors: D R Prytherch; M S Whiteley; B Higgins; P C Weaver; W G Prout; S J Powell
Journal: Br J Surg Date: 1998-09 Impact factor: 6.939

6. Recent Innovations, Modifications, and Evolution of ACC/AHA Clinical Practice Guidelines: An Update for Our Constituencies: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines.

Authors: Glenn N Levine; Patrick T O'Gara; Joshua A Beckman; Sana M Al-Khatib; Kim K Birtcher; Joaquin E Cigarroa; Lisa de Las Fuentes; Anita Deswal; Lee A Fleisher; Federico Gentile; Zachary D Goldberger; Mark A Hlatky; José A Joglar; Mariann R Piano; Duminda N Wijeysundera
Journal: Circulation Date: 2019-04-23 Impact factor: 29.690

7. Cancelled operations: a 7-day cohort study of planned adult inpatient surgery in 245 UK National Health Service hospitals.

Authors: D J N Wong; S K Harris; S R Moonesinghe
Journal: Br J Anaesth Date: 2018-09-07 Impact factor: 9.166

8. The Surgical Risk Scale as an improved tool for risk-adjusted analysis in comparative surgical audit.

Authors: R Sutton; S Bann; M Brooks; S Sarin
Journal: Br J Surg Date: 2002-06 Impact factor: 6.939

9. Decision curve analysis: a novel method for evaluating prediction models.

Authors: Andrew J Vickers; Elena B Elkin
Journal: Med Decis Making Date: 2006 Nov-Dec Impact factor: 2.583

Review 10. Risk stratification tools for predicting morbidity and mortality in adult patients undergoing major surgery: qualitative systematic review.

Authors: Suneetha Ramani Moonesinghe; Michael G Mythen; Priya Das; Kathryn M Rowan; Michael P W Grocott
Journal: Anesthesiology Date: 2013-10 Impact factor: 7.892

8 in total

Review 1. Data Science Trends Relevant to Nursing Practice: A Rapid Review of the 2020 Literature.

Authors: Brian J Douthit; Rachel L Walden; Kenrick Cato; Cynthia P Coviak; Christopher Cruz; Fabio D'Agostino; Thompson Forbes; Grace Gao; Theresa A Kapetanovic; Mikyoung A Lee; Lisiane Pruinelli; Mary A Schultz; Ann Wieben; Alvin D Jeffery
Journal: Appl Clin Inform Date: 2022-02-09 Impact factor: 2.342

2. Postoperative mortality risk prediction that incorporates intraoperative vital signs: development and internal validation in a historical cohort.

Authors: Janny Xue Chen Ke; Daniel I McIsaac; Ronald B George; Paula Branco; E Francis Cook; W Scott Beattie; Robin Urquhart; David B MacDonald
Journal: Can J Anaesth Date: 2022-08-22 Impact factor: 6.713

3. Antiseizure medications (antiepileptic drugs) in adults: starting, monitoring and stopping.

Authors: Heather Angus-Leppan; Michael R Sperling; Vicente Villanueva
Journal: J Neurol Date: 2022-09-24 Impact factor: 6.682

4. Implementation of Routine Computed Tomography (CT) Following Laparoscopic Sleeve Gastrectomy: New Evidence Brings New Challenges.

Authors: Dimitrios E Magouliotis; Prokopis-Andreas Zotos; Dimitris Zacharoulis
Journal: Obes Surg Date: 2022-05-09 Impact factor: 3.479

8. A "snap-shot" visual estimation of health and objectively measured frailty: capturing general health in aging older women.

Authors: Patrik Bartosch; Linnea Malmgren; Paul Gerdhem; Jimmie Kristensson; Fiona Elizabeth McGuigan; Kristina Eva Akesson
Journal: Aging Clin Exp Res Date: 2022-03-25 Impact factor: 4.481

8 in total