B Van Calster1, E W Steyerberg2, T Bourne3, D Timmerman4, G S Collins5. 1. KU Leuven, Department of Development and Regeneration, Leuven, Belgium; Department of Public Health, Erasmus MC, Rotterdam, The Netherlands. 2. Department of Public Health, Erasmus MC, Rotterdam, The Netherlands. 3. KU Leuven, Department of Development and Regeneration, Leuven, Belgium; Department of Obstetrics and Gynecology, University Hospitals Leuven, Leuven, Belgium; Queen Charlotte's & Chelsea Hospital, Imperial College London, London, UK. 4. KU Leuven, Department of Development and Regeneration, Leuven, Belgium; Department of Obstetrics and Gynecology, University Hospitals Leuven, Leuven, Belgium. 5. Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK.
Dear Editor,External validation studies of prediction models are of utmost importance in order to assess the performance of a prediction model in different locations (Altman et al., 2009). We therefore read with interest the recent external validation study of the ADNEX model (Szubert et al., 2016).For patients with a persistent adnexal tumor who are scheduled for surgery, the ADNEX model predicts the risk of five tumor types: benign, borderline malignant, stage I cancer, stage II–IV cancer, or secondary metastatic cancer (Van Calster et al., 2014). The model was developed on data from 5909 patients collected at 24 centers, in 10 countries, between 1999 and 2012. ADNEX aims to assist clinicians make appropriate clinical decisions for patients presenting with an adnexal mass. When validating the ADNEX model, it is natural to first evaluate the prediction of malignancy, followed by the multiclass prediction of malignancy subtypes, in a similar way to other validation studies of multiclass models (Steyerberg et al., 1998). This approach is followed in the recent paper, but there are a number of important issues around the design, analysis, and reporting we wish to raise.First, validation studies should be designed to reliably assess performance in terms of discrimination and calibration (Steyerberg, 2009). In this particular case, the authors report a sample size calculation for testing the hypothesis that the AUC of the model is higher than 0.5. Assuming an AUC of 0.94 leads to a very low required sample size (n = 22). This approach is at odds with methodological guidance and the result is that the precision of performance measures will be low: for dichotomous prediction, previous studies have suggested that at least 100, and preferably at least 200 individuals with the event (in this case ovarian malignancy) are required for a meaningful validation (Steyerberg, 2009, Vergouwe et al., 2005, Collins et al., 2016). Here, center 1 has 70 malignant tumors, whilst center 2 has only 34, leading to unreliable per center results. Validation would therefore best be done on all patients, with center-specific results as an exploratory addition. Furthermore, statistical tests to compare results between centers are provided throughout the text. Although heterogeneity of performance across locations is important (Riley et al., 2016), p-values to compare two specific centers are uninformative. It is useful to observe that the AUCs were 0.955 and 0.907, since this is in line with the center-specific values reported in the original publication describing the ADNEX model (Van Calster et al., 2014). A detailed investigation of heterogeneity should however involve a larger dataset with patients from many different centers. Furthermore, subgroup analyses by menopausal status become very unreliable when stratified by center.Second, the authors have not adequately described their population and results. The prevalence of each of the five tumor types is not clearly provided, and the prevalence of stage I cancer and stage II–IV cancer can only be derived from the confusion matrix. The ADNEX model has variants with and without the serum marker CA125 as a predictor. The authors mix both variants depending on the availability of CA125, such that it is unclear to what variant the reported performance is referring.Third, the calibration of the predicted risk of malignancy has not been investigated, i.e. whether observed frequencies of malignancy correspond to predicted risks, especially around the risk threshold of 10%. Unfortunately, this aspect of risk prediction models is often overlooked despite its importance (Steyerberg, 2009).Finally, the ‘multiclass’ performance evaluation is fundamentally flawed. The key problem is the confusion matrix, which classifies patients into one of the five tumor types by choosing the group with the highest predicted risk. Baseline risk, or prevalence, of each tumor type varies substantially: among 327 patients, 223 are benign tumors (68%), 16 borderline (5%), 14 stage I primary cancers (4%), 64 stage II–IV primary cancers (20%), and 10 secondary metastatic cancers (3%). Given these large differences in prevalence, it is unlikely that ADNEX based risk predictions for secondary metastatic cancer will be larger than those for a benign tumor. As a result, the confusion matrix will rarely classify a tumor as a metastatic cancer, resulting in near zero sensitivity for this tumor type. Analogous arguments apply to borderline tumors and stage I primary cancers. Such results are misleading, since they are unrelated to the model's ability to discriminate between tumor types. More generally, it makes little clinical sense to classify patients into only one category. It is much more relevant to monitor which risks are high or increased, and to act upon them accordingly. For example, the predicted risk of advanced-stage ovarian cancer and the risk of secondary metastasis might both be increased (although the latter will usually be smaller than the former due to the lower prevalence). In such cases the clinician may focus management decisions on both tumor types. An elevated risk of a metastatic tumor may trigger planning additional preoperative diagnostic tests, such as gastroscopy, x-ray mammography or a full body MRI. Instead of a confusion matrix, concordance or c statistics for subgroup discrimination should be given. We would advise to present pairwise c statistics using the conditional risk method (Van Calster et al., 2012, Van Calster et al., 2014), although other approaches could be followed. Nevertheless, we warn that in this study the sample size is far too small to draw meaningful conclusions, although we realize that it would require a very large sample to have information on 100 secondary metastatic cancers, as in the IOTA collaboration (Van Calster et al., 2014).In conclusion, we are happy to observe the excellent discrimination between benign and malignant tumors seen in this study, in line with the original publication (Van Calster et al., 2014). However, the analysis does not allow us to draw any reliable conclusions with respect to multiclass discrimination. To improve reporting of prediction model studies, the TRIPOD guidelines have recently been introduced (Moons et al., 2015). These guidelines highlight the need for adequate sample size, assessment of calibration and transparent reporting of key information such as number of events in each category. Although we recognize that validation of multiclass models involves additional difficulties, it is clear that the TRIPOD recommendations should be followed to ensure all key information is clearly reported.
Authors: Ben Van Calster; Yvonne Vergouwe; Caspar W N Looman; Vanya Van Belle; Dirk Timmerman; Ewout W Steyerberg Journal: Eur J Epidemiol Date: 2012-10-07 Impact factor: 8.082
Authors: E W Steyerberg; A Gerl; S D Fossá; D T Sleijfer; R de Wit; W J Kirkels; N Schmeller; C Clemm; J D Habbema; H J Keizer Journal: J Clin Oncol Date: 1998-01 Impact factor: 44.544
Authors: Sebastian Szubert; Andrzej Wojtowicz; Rafal Moszynski; Patryk Zywica; Krzysztof Dyczkowski; Anna Stachowiak; Stefan Sajdak; Dariusz Szpurek; Juan Luis Alcazar Journal: Gynecol Oncol Date: 2016-06-30 Impact factor: 5.482
Authors: Karel G M Moons; Douglas G Altman; Johannes B Reitsma; John P A Ioannidis; Petra Macaskill; Ewout W Steyerberg; Andrew J Vickers; David F Ransohoff; Gary S Collins Journal: Ann Intern Med Date: 2015-01-06 Impact factor: 25.391
Authors: Ben Van Calster; Kirsten Van Hoorde; Lil Valentin; Antonia C Testa; Daniela Fischerova; Caroline Van Holsbeke; Luca Savelli; Dorella Franchi; Elisabeth Epstein; Jeroen Kaijser; Vanya Van Belle; Artur Czekierdowski; Stefano Guerriero; Robert Fruscio; Chiara Lanzani; Felice Scala; Tom Bourne; Dirk Timmerman Journal: BMJ Date: 2014-10-15
Authors: Dirk Timmerman; François Planchamp; Tom Bourne; Chiara Landolfo; Andreas du Bois; Luis Chiva; David Cibula; Nicole Concin; Daniela Fischerova; Wouter Froyman; Guillermo Gallardo Madueño; Birthe Lemley; Annika Loft; Liliana Mereu; Philippe Morice; Denis Querleu; Antonia Carla Testa; Ignace Vergote; Vincent Vandecaveye; Giovanni Scambia; Christina Fotopoulou Journal: Int J Gynecol Cancer Date: 2021-06-10 Impact factor: 3.437