Literature DB >> 34876430

Do comprehensive deep learning algorithms suffer from hidden stratification? A retrospective study on pneumothorax detection in chest radiography.

Jarrel Seah^1,2, Cyril Tang², Quinlan D Buchlak^2,3, Michael Robert Milne^2,2, Xavier Holt², Hassan Ahmad², John Lambert², Nazanin Esmaili^3,4, Luke Oakden-Rayner⁵, Peter Brotchie^2,6, Catherine M Jones^2,7.

Abstract

OBJECTIVES: To evaluate the ability of a commercially available comprehensive chest radiography deep convolutional neural network (DCNN) to detect simple and tension pneumothorax, as stratified by the following subgroups: the presence of an intercostal drain; rib, clavicular, scapular or humeral fractures or rib resections; subcutaneous emphysema and erect versus non-erect positioning. The hypothesis was that performance would not differ significantly in each of these subgroups when compared with the overall test dataset.
DESIGN: A retrospective case-control study was undertaken.
SETTING: Community radiology clinics and hospitals in Australia and the USA. PARTICIPANTS: A test dataset of 2557 chest radiography studies was ground-truthed by three subspecialty thoracic radiologists for the presence of simple or tension pneumothorax as well as each subgroup other than positioning. Radiograph positioning was derived from radiographer annotations on the images. OUTCOME MEASURES: DCNN performance for detecting simple and tension pneumothorax was evaluated over the entire test set, as well as within each subgroup, using the area under the receiver operating characteristic curve (AUC). A difference in AUC of more than 0.05 was considered clinically significant.
RESULTS: When compared with the overall test set, performance of the DCNN for detecting simple and tension pneumothorax was statistically non-inferior in all subgroups. The DCNN had an AUC of 0.981 (0.976-0.986) for detecting simple pneumothorax and 0.997 (0.995-0.999) for detecting tension pneumothorax.
CONCLUSIONS: Hidden stratification has significant implications for potential failures of deep learning when applied in clinical practice. This study demonstrated that a comprehensively trained DCNN can be resilient to hidden stratification in several clinically meaningful subgroups in detecting pneumothorax. © Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.

Entities: Chemical

Keywords: accident & emergency medicine; chest imaging; health informatics

Mesh：

Year: 2021 PMID： 34876430 PMCID： PMC8655590 DOI： 10.1136/bmjopen-2021-053024

Source DB: PubMed Journal: BMJ Open ISSN： 2044-6055 Impact factor: 2.692

This study analysed a comprehensively trained deep learning algorithm that can detect up to 124 clinical findings on a chest radiograph. Strengths are that it not only evaluated the algorithm’s ability to detect pneumothorax but also the clinically significant groups of pneumothorax—simple and tension pneumothorax. Additionally, the study evaluated the algorithm’s ability to detect pneumothorax in specific, clinically salient subgroups. A large dataset of 2557 chest radiographs with a robust ground truth was evaluated. The study was limited by its retrospective nature, which necessitates further prospective, external validation studies.

Introduction

Deep convolutional neural networks (DCNNs) are state-of-the-art for various image classification and processing tasks1 2 In the medical imaging and artificial intelligence (AI) literature, these have frequently claimed to have near-human or even super-human performance in a variety of classification tasks performed by radiologists.3 4 However, recent concerns have been raised about the translation of such results to clinical practice.5 Most deep learning models in medical imaging are developed to detect specific findings or a group of similar findings, and as such performance is typically reported using a summary metric such as the area under the receiver operating characteristic curve (AUC). This can hide the performance of the models on clinically distinct and meaningful subgroups within these single findings, a phenomenon described as hidden stratification by Oakden-Rayner et al.6 This work has shown that the algorithms trained to detect pneumothorax on chest X-ray are often affected by hidden stratification, performing well in summary across an entire test dataset but performing worse in the subset of pneumothorax patients without the presence of an accompanying intercostal drain. As a pneumothorax is frequently treated with the insertion of an intercostal drain,7 many chest radiography training datasets labelled for pneumothorax, including well-known public datasets such as the National Institutes of Health CXR14 dataset8 demonstrate a strong correlation between pneumothorax and intercostal drains. Due to this correlation and the absence of explicit labels that distinguish intercostal drains as a separate finding, DCNNs trained on these datasets frequently erroneously rely on the presence of intercostal drains to identify pneumothoraces, a process which has been called ‘shortcut learning’ or ‘unintended cue learning’.9 Reliance on these unintended cues can lead to reduced performance when the model is evaluated on the subset of cases without intercostal drains. This example is particularly dangerous as intercostal drain insertion usually indicates that the pneumothorax has already been identified and treated. As such, an algorithm demonstrating good performance for pneumothorax detection overall may be masking poor performance within the most clinically relevant subgroups, namely those yet to be treated and who would most benefit from prompt diagnosis. Other clinically meaningful pneumothorax subgroups include patients with subcutaneous emphysema; patients with acute rib, clavicular, scapular or humeral fractures or rib resections and patients with semierect or supine (referred to as non-erect) positioning. While not an exhaustive list, these subgroups contain features that correlate with pneumothorax that a DCNN may erroneously rely on. For instance, subcutaneous emphysema may be benign but is often associated with pneumomediastinum and pneumothorax.10 Patients with trauma with rib and other skeletal fractures often have associated pneumothorax or haemopneumothorax.11 Additionally, trauma bay patients are often imaged in the supine or semierect position due to the acute nature of their injuries. The positioning of the patient also alters the visibility of pneumothoraces as well as their radiological appearances.12 Pneumothorax is a common postoperative complication following thoracic surgery, of which rib resection is a common indicator.13 Labelling these subgroups and evaluating the performance of a DCNN on each one has been described as schema completion.6 Evaluating the performance of DCNNs in such a way can help answer critical questions about the true clinical utility of such AI-driven computer-aided diagnosis tools. We hypothesised that a DCNN trained comprehensively to detect multiple findings, including some of these subgroups, would demonstrate non-inferior performance as measured by AUC for both simple and tension pneumothorax in these subgroups when compared with the overall test dataset.

Methods

DCNN software

A commercially available DCNN-based computer-aided diagnosis algorithm (Annalise CXR V.1.2, annalise.ai, Sydney, Australia) was evaluated. This algorithm has been trained to detect 124 clinical findings on chest radiography14 and is publicly available at https://cxrdemo.annalise.ai. This algorithm indicates if each of the findings are present, as well as provides a numerical score indicating its confidence that the finding is present. This algorithm consists of several convolutional neural networks based on the EfficientNet architecture using the Keras library with Tensorflow V.2.1.

Patient and public involvement

Patients and public were not involved in the design, conduct or reporting of this study.

Study design and dataset

This project’s test dataset was obtained retrospectively from a more extensive study by (Seah et al) titled ‘Radiologist chest X-ray diagnostic accuracy performance when augmented by a comprehensive deep learning model: a multireader multicase study’, previously undertaken to validate the DCNN algorithm, which describes the case selection and participant flow. A reanalysis of the performance data from that study was conducted to test the hypothesis that the DCNN algorithm is resilient to hidden stratification. The chest radiographs for the test dataset were retrospectively obtained from two sources: a large dataset from a private radiology clinic in Australia and the publicly available Medical Information Mart for Intensive Care CXR (MIMIC-CXR) dataset.15 These radiographs were not used in the DCNN training dataset, and there was no overlap between patients in the training and testing cohorts. Each study comprised multiple images from a single patient. The test dataset’s inclusion criteria were age >16 years; and studies that contained at least one frontal image. Additional frontal or lateral images acquired within each study were also used. Studies were in DICOM format and were deidentified. The original resolution and bit-depth was preserved. For the original study, cases were selected to comprise a wide variety of pathology that included each of the subgroups analysed in this investigation, as well as non-pneumothorax related pathologies such as lung nodules or airspace opacities. Table 1 presents the breakdown of the number of cases with and without simple and tension pneumothorax in the overall dataset and in each subgroup.

Table 1

Demographics of the test dataset

Patients	2286
Studies	2568
Images	4568
Sex	29% male28% female43% unknown*
Age	74 years (SD 15 years)*
View position	28% Posteroanterior (PA)33% Anteroposterior (AP)31% Lateral (LAT)8% other

*MIMIC-CXR does not provide sex or age information and hence data for this is incomplete.

Demographics of the test dataset *MIMIC-CXR does not provide sex or age information and hence data for this is incomplete.

Ground truthing

Each study was evaluated by three subspecialist thoracic radiologists, including one of the authors (CMJ), who had each undertaken dedicated chest imaging fellowships. Each of these radiologists was trained on a specific set of definitions for simple and tension pneumothorax, as well as each of the subgroups, as defined in online supplemental appendix A. In particular, tension pneumothorax was defined as pneumothorax with mediastinal shift towards the contralateral lung. For the ‘no fractures’ subgroup, specialist radiologists were instructed to label acute rib, humerus, clavicular, spinal and scapular fractures, as well as the presence of any rib resections. The presence of any of these fractures or rib resection disqualified the study from the ‘no fractures’ subgroup. Radiologists independently assessed each study, with access to the patient’s past and future imaging, clinical reports, as well as any CT chest reports if available, and identified if each finding was absent or present within that study. The consensus for each finding for each triple-read study was obtained using the Dawid-Skene consensus algorithm,16 which considers the relative accuracies of each labeller for each finding. This was performed to mitigate variability and resolve discrepancies. Additionally, a radiology registrar (JS) reviewed the radiographer annotations on each image to identify if it was erect, semierect or supine. Studies with an accompanying lateral were considered erect. Where such annotations were not present, the positioning of the patient was estimated by considering indicators on the image such as the presence of air fluid levels or arm positioning. All annotations were performed on an in-house web-browser-based labelling tool capable of displaying DICOM images.

Statistical analysis

For both simple and tension pneumothorax, the AUC was calculated, which is a commonly used metric of interest in the assessment of diagnostic classification tests.17 To obtain the performance in a subgroup, the sample was filtered to retain only patients from that subgroup before recalculating the AUC. The difference in AUC between the full test set and each subgroup was bootstrapped to obtain a Bonferroni adjusted 95% CI. A difference of greater than 0.05 in AUC was considered clinically significant, therefore, if the lower bound of the CI of the delta exceeded −0.05, the performance in that subgroup was considered statistically non-inferior. This is a commonly chosen non-inferiority margin in diagnostic radiology AUC analysis.18 A p<0.00625, adjusted for eight hypotheses tested, was considered statistically significant. Analyses were conducted using Excel 2016 as well as custom Python scripts and the scipy,19 scikitlearn,20 nltk,21 gensim22 and keras23 packages.

Results

Population characteristics

A total of 2568 studies from 2286 patients were included, comprising 4568 images. Forty-three per cent of cases from the test dataset were drawn from the MIMIC-CXR dataset and 57% were drawn from the private Australian radiology practice dataset. Table 1 presents the demographic and imaging characteristics of the test dataset. MIMIC-CXR does not provide age and sex data. Eleven studies were deemed unsuitable by the DCNN and hence were excluded from analysis. None of these 11 studies were labelled by the ground-truthers as positive for simple or tension pneumothorax. There were 162 cases of simple pneumothorax and 49 cases of tension pneumothorax. Most cases of pneumothorax were found on erect chest radiographs. Sixty simple pneumothoraces and 11 tension pneumothoraces were seen on non-erect chest radiographs. Table 2 presents the number of cases with simple or tension pneumothorax in the entire test dataset, as well as within each subgroup. The complete co-occurrence matrix of each of the subgroups is presented in online supplemental appendix B.

Table 2

Number of studies with simple or tension pneumothorax within the testing set as well as throughout each subgroup

Subgroup	Simple pneumothorax	Tension pneumothorax	Total
Total	166	49	2557
Non-erect	64	19	595
No subcutaneous emphysema	96	44	2409
No intercostal drain	65	39	2277
No fractures	130	45	2017

Number of studies with simple or tension pneumothorax within the testing set as well as throughout each subgroup

AUC performance

Figures 1 and 2 present the difference in AUC within each subgroup as compared with the overall test dataset. For simple pneumothorax, the lower bound of the adjusted 95% CI of the AUC delta exceeds −0.05 and the upper bound exceeds 0 in all subgroups, indicating that performance in those subgroups was statistically non-inferior to the overall test dataset. For tension pneumothorax, the ‘no fractures’ and ‘non-erect’ subgroups were statistically non-inferior to the overall test dataset, however the ‘no subcutaneous emphysema’ and ‘no intercostal drain’ subgroups both demonstrated lower bounds of the adjusted 95% CIs of the AUC delta exceeding 0, meaning these two subgroups demonstrated significantly superior performance. Table 3 presents the raw AUC values for the DCNN’s performance and table 4 presents AUC deltas.

Table 3

AUC values with 95% CI (non-adjusted) for the DCNN’s performance on simple and tension pneumothorax in the test dataset as well as in specific subgroups

	AUC
Simple pneumothorax	0.981 (0.976–0.986)
No subcutaneous emphysema	0.983 (0.977–0.989)
No fractures	0.979 (0.972–0.986)
No intercostal drain	0.986 (0.979–0.992)
Non erect	0.968 (0.954–0.980)
Tension pneumothorax	0.997 (0.995–0.999)
No subcutaneous emphysema	0.998 (0.997–0.999)
No fractures	0.997 (0.995–0.999)
No intercostal drain	0.998 (0.997–0.999)
Non erect	0.995 (0.990–0.999)

AUC, area under the receiver operating characteristic curve; DCNN, deep convolutional neural network.

Table 4

Difference in AUC values between each specific subgroup and the overall test dataset with 95% adjusted CI

	Delta AUC (subpopulation—full population)
	Mean	Lower	Upper
Simple pneumothorax
No subcutaneous emphysema	0.002	−0.003	0.006
No fractures	−0.001	−0.005	0.003
No intercostal drain	0.005	−0.002	0.011
Non erect	−0.013	−0.030	0.001
Tension pneumothorax
No subcutaneous emphysema	0.001	0.000	0.002
No fractures	−0.000	−0.001	0.001
No intercostal drain	0.001	0.000	0.002
Non erect	−0.002	−0.009	0.003

AUC, area under the receiver operating characteristic curve.

AUC values with 95% CI (non-adjusted) for the DCNN’s performance on simple and tension pneumothorax in the test dataset as well as in specific subgroups AUC, area under the receiver operating characteristic curve; DCNN, deep convolutional neural network. Difference in AUC values between each specific subgroup and the overall test dataset with 95% adjusted CI AUC, area under the receiver operating characteristic curve. Difference in AUC for detecting simple pneumothorax in the test dataset versus each specific subgroup with adjusted 95% CI. AUC, area under the receiver operating characteristic curve. Difference in AUC for detecting tension pneumothorax in the test dataset versus each specific subgroup with adjusted 95% CI. AUC, area under the receiver operating characteristic curve.

Discussion

We evaluated the clinical performance of a commercially available DCNN algorithm for the detection of simple and tension pneumothorax in a retrospective study with a large test dataset, representative of real-world clinical conditions. Of note, 33% of images in the test dataset were AP images, reflecting the inpatient and emergent nature of the studies. The test dataset was obtained from a wide range of Australian radiology sites as well as the MIMIC-CXR dataset, improving the generalisability of the results. When compared with existing algorithms24 25 the DCNN algorithm demonstrated comparable or higher AUC for the detection of pneumothorax. The DCNN algorithm performed better in detecting tension pneumothorax when compared with simple pneumothorax, presumably due to the larger size and conspicuity of this type of pneumothorax, along with other associated features like mediastinal shift. The DCNN algorithm appeared to be resilient to hidden stratification for the four tested subgroups, with statistically non-inferior performance in six of the eight subgroups tested, with the remaining two demonstrating borderline statistically superior performance. The superior performance of the tension pneumothorax subgroups with no intercostal drain or no subcutaneous emphysema was likely due to limitations with the testing dataset itself, which contained 49 tension pneumothoraces in total. Because of the relatively low number of cases, a small number of strongly confidently positive or negative cases unique to any one of these subgroups would likely influence the distribution of AUC deltas and may explain the counterintuitive result observed. Out of all subgroups, the DCNN algorithm demonstrated the greatest decrease in AUC when tested on the non-erect subgroup, although the result remained statistically non-inferior. This DCNN algorithm appears to be resilient to hidden stratification as it has been trained on a comprehensively labelled dataset with 124 findings, including subgroups. The fact that the non-erect subgroup demonstrates the greatest decrease in AUC is circumstantial evidence that comprehensive labelling is beneficial as ‘non-erect’ was the only subgroup examined in this study that was not part of the 124 findings explicitly labelled during model training. Another likely contributing factor is that non-erect pneumothoraxes are less conspicuous and indeed may not be visible on supine chest radiographs at all. As computer-aided diagnosis and clinical decision support software becomes more prevalent in clinical practice,26 it is likely that clinically meaningful failures will stem from hidden stratification, or more specifically, the lack of evaluation of clinically relevant subclasses6 Therefore, explicit evaluation of such clinically relevant subclasses is critical to responsible clinical decision support research, and that it is the domain of clinicians to define these subclasses and resist the temptation of oversimplifying performance into a single metric such as AUC for broad disease categories. While recent literature has highlighted this problem, it has been identified and cautioned against for years,27 and unfortunately ignored in most clinical deep learning research. This is one of the risks to be mitigated when implementing computer-aided diagnosis tools at the bedside.28

Limitations and future research

One notable limitation of this study is that the test dataset was drawn from the same population as the training dataset, and further research and external validation is required to verify these results. Another limitation is that radiologist readers may have missed subtle pneumothoraces, especially on supine patients,12 although this was mitigated by the availability of future chest radiographs and reports as well as contemporaneous CT reports. Results may underestimate the true decrease in performance in the non-erect subgroup. Opportunities for future research include providing paired chest CT images and radiographs to radiologists engaged in the ground-truth process to ensure that the ground-truth truly reflects the underlying pathology, as well as testing for resilience to other subgroups that were not available in this study. Future work is required to define the performance of the model in subgroups of tension pneumothorax over a larger number of cases than available in the test dataset to clarify whether superior performance is truly indicative of model behaviour, or simply an artefact of the test dataset. This would require obtaining data from sources and populations external to the training dataset and ensuring sufficiently large numbers of tension pneumothorax cases to verify results. Additionally, this work was conducted as a retrospective analysis, which limits the generalisability of results to datasets that the DCNN algorithm has not seen before. Furthermore, as this is a reanalysis of previously acquired data, it may be underpowered to detect subtle differences in these subgroups. Therefore, additional prospective studies in different geographies, with a priori power analyses to determine adequate sample sizes, are required to see if similar performance is obtained in other populations. The resilience of the comprehensively trained DCNN algorithm to hidden stratification may also be due to additional factors, such as the already high baseline performance in identifying simple and tension pneumothorax. Further work is needed to explore the benefits of comprehensive labelling and training in findings that the DCNN algorithm does not perform as well on. One hypothesis is that as such findings may be more difficult to identify, the DCNN algorithm may rely on associated features or ‘unintended cues’ more, leading to worse hidden stratification.

Conclusion

We have demonstrated that in a retrospective analysis a comprehensively trained DCNN algorithm can be resilient to hidden stratification when detecting simple and tension pneumothorax in clinically relevant subgroups. Further external validation and prospective study is needed to see if the benefits of a comprehensively trained model are generalisable in other settings.

17 in total

Review 1. Receiver operating characteristic curve in diagnostic test assessment.

Authors: Jayawant N Mandrekar
Journal: J Thorac Oncol Date: 2010-09 Impact factor: 15.609

Review 2. Machine learning applications to clinical decision support in neurosurgery: an artificial intelligence augmented systematic review.

Authors: Quinlan D Buchlak; Nazanin Esmaili; Jean-Christophe Leveque; Farrokh Farrokhi; Christine Bennett; Massimo Piccardi; Rajiv K Sethi
Journal: Neurosurg Rev Date: 2019-08-17 Impact factor: 3.042

3. Ethical thinking machines in surgery and the requirement for clinical leadership.

Authors: Quinlan D Buchlak; Nazanin Esmaili; Jean-Christophe Leveque; Christine Bennett; Massimo Piccardi; Farrokh Farrokhi
Journal: Am J Surg Date: 2020-07-08 Impact factor: 2.565

4. Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review.

Authors: Waseem Rawat; Zenghui Wang
Journal: Neural Comput Date: 2017-06-09 Impact factor: 2.026

5. Statistical considerations for testing an AI algorithm used for prescreening lung CT images.

Authors: Nancy A Obuchowski; Jennifer A Bullen
Journal: Contemp Clin Trials Commun Date: 2019-08-22

6. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.

Authors: Alistair E W Johnson; Tom J Pollard; Seth J Berkowitz; Nathaniel R Greenbaum; Matthew P Lungren; Chih-Ying Deng; Roger G Mark; Steven Horng
Journal: Sci Data Date: 2019-12-12 Impact factor: 6.444

Review 7. Convolutional neural networks in medical image understanding: a survey.

Authors: D R Sarvamangala; Raghavendra V Kulkarni
Journal: Evol Intell Date: 2021-01-03

8. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists.

Authors: Pranav Rajpurkar; Jeremy Irvin; Robyn L Ball; Kaylie Zhu; Brandon Yang; Hershel Mehta; Tony Duan; Daisy Ding; Aarti Bagul; Curtis P Langlotz; Bhavik N Patel; Kristen W Yeom; Katie Shpanskaya; Francis G Blankenberg; Jayne Seekins; Timothy J Amrhein; David A Mong; Safwan S Halabi; Evan J Zucker; Andrew Y Ng; Matthew P Lungren
Journal: PLoS Med Date: 2018-11-20 Impact factor: 11.069

9. Automated detection of moderate and large pneumothorax on frontal chest X-rays using deep convolutional neural networks: A retrospective study.

Authors: Andrew G Taylor; Clinton Mielke; John Mongan
Journal: PLoS Med Date: 2018-11-20 Impact factor: 11.069

Review 10. SciPy 1.0: fundamental algorithms for scientific computing in Python.

Authors: Pauli Virtanen; Ralf Gommers; Travis E Oliphant; Matt Haberland; Tyler Reddy; David Cournapeau; Evgeni Burovski; Pearu Peterson; Warren Weckesser; Jonathan Bright; Stéfan J van der Walt; Matthew Brett; Joshua Wilson; K Jarrod Millman; Nikolay Mayorov; Andrew R J Nelson; Eric Jones; Robert Kern; Eric Larson; C J Carey; İlhan Polat; Yu Feng; Eric W Moore; Jake VanderPlas; Denis Laxalde; Josef Perktold; Robert Cimrman; Ian Henriksen; E A Quintero; Charles R Harris; Anne M Archibald; Antônio H Ribeiro; Fabian Pedregosa; Paul van Mulbregt
Journal: Nat Methods Date: 2020-02-03 Impact factor: 28.547