Literature DB >> 34930738

Assessment of the effect of a comprehensive chest radiograph deep learning model on radiologist reports and patient outcomes: a real-world observational study.

Catherine M Jones^1,2, Luke Danaher², Michael R Milne^3,2, Cyril Tang¹, Jarrel Seah^1,4, Luke Oakden-Rayner⁵, Andrew Johnson¹, Quinlan D Buchlak^1,6, Nazanin Esmaili^6,7.

Abstract

OBJECTIVES: Artificial intelligence (AI) algorithms have been developed to detect imaging features on chest X-ray (CXR) with a comprehensive AI model capable of detecting 124 CXR findings being recently developed. The aim of this study was to evaluate the real-world usefulness of the model as a diagnostic assistance device for radiologists.
DESIGN: This prospective real-world multicentre study involved a group of radiologists using the model in their daily reporting workflow to report consecutive CXRs and recording their feedback on level of agreement with the model findings and whether this significantly affected their reporting.
SETTING: The study took place at radiology clinics and hospitals within a large radiology network in Australia between November and December 2020. PARTICIPANTS: Eleven consultant diagnostic radiologists of varying levels of experience participated in this study. PRIMARY AND SECONDARY OUTCOME MEASURES: Proportion of CXR cases where use of the AI model led to significant material changes to the radiologist report, to patient management, or to imaging recommendations. Additionally, level of agreement between radiologists and the model findings, and radiologist attitudes towards the model were assessed.
RESULTS: Of 2972 cases reviewed with the model, 92 cases (3.1%) had significant report changes, 43 cases (1.4%) had changed patient management and 29 cases (1.0%) had further imaging recommendations. In terms of agreement with the model, 2569 cases showed complete agreement (86.5%). 390 (13%) cases had one or more findings rejected by the radiologist. There were 16 findings across 13 cases (0.5%) deemed to be missed by the model. Nine out of 10 radiologists felt their accuracy was improved with the model and were more positive towards AI poststudy.
CONCLUSIONS: Use of an AI model in a real-world reporting environment significantly improved radiologist reporting and showed good agreement with radiologists, highlighting the potential for AI diagnostic support to improve clinical practice. © Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.

Entities: Chemical

Keywords: deep learning; machine learning chest X-ray

Mesh：

Year: 2021 PMID： 34930738 PMCID： PMC8689166 DOI： 10.1136/bmjopen-2021-052902

Source DB: PubMed Journal: BMJ Open ISSN： 2044-6055 Impact factor: 2.692

This study substantially adds to the limited literature on real-world evaluation of comprehensive chest X-ray artificial intelligence models in radiology workflow. This was a multicentre study conducted across a mix of public hospitals, private hospitals and community clinic settings. Due to the design of the study, diagnostic accuracy of the decision support system was not a measurable outcome. Results of this study are self-reported and may therefore be prone to bias. Determination of the significance of report changes due to the model’s recommendations was made at the discretion of each radiologist on a case-by-case basis.

Introduction

Radiology is a data-rich medical specialty and is well placed to embrace artificial intelligence (AI).1This is especially true in high volume imaging tasks such as chest X-ray (CXR) imaging. The rapid application of X-ray technology to diagnosing chest diseases at the end of the 19th century led to the CXR becoming a first-line diagnostic imaging tool2 and it remains an essential component of the diagnostic pathway for chest disease. Due to advancements in digital image acquisition, low ionising radiation dose and low cost, the chest radiograph is more easily accessible worldwide than any other imaging modality.3 The challenges of interpreting CXR, however, have not lessened over the last half-century. CXR images are two-dimensional representations of complex three-dimensional structures, relying on soft-tissue contrast between structures of different densities. Multiple overlapping structures lead to reduced visibility of both normal and abnormal structures,4 with up to 40% of the lung parenchyma obscured by overlying ribs and the mediastinum.5 This can be further exacerbated by other factors including the degree of inspiration, other devices in the field of view, and patient positioning. In addition, there is a wide range of pathology in the chest which is visible to varying degrees on the CXR. These factors combine to make CXRs difficult to accurately interpret, with an error rate of 20%–50% for CXRs containing radiographic evidence of disease reported in the literature.6 Notably, lung cancer is one of the most common cancers worldwide and is the most common cause of cancer death,7 and CXR interpretation error accounts for 90% of cases where lung cancer is missed.8 Despite technological advancements in CXR over the past 50 years, this level of diagnostic error has remained constant.6 A rapidly developing field attempting to assist radiologists in radiological interpretation involves the application of machine learning, in particular deep neural networks.9 Deep neural networks learn patterns in large, complex datasets, enabling the detection of subtle features and outcome prediction.10 11 The potential of these algorithms has grown rapidly in the past decade thanks to the development of more useful neural network models, advancements in computational power and an increase in the volume and availability of digital imaging datasets.11 Of note is the rise of convolutional neural networks (CNNs), a type of deep neural network that excels at image feature extraction and classification, and demonstrates strong performance in medical image analysis, leading to the rapid advancement of computer vision in medical imaging.12 13 CNNs have been used to develop models to successfully detect targeted clinical findings on CXR, including lung cancer,14 15 pneumonia,16 17 COVID-19,18 pneumothorax,19–22 pneumoconiosis,23 cardiomegaly,24 pulmonary hypertension25 and tuberculosis.26–30 These studies highlight the effectiveness of applied machine learning in CXR interpretation, however, most of these deep learning systems are limited in scope to a single finding or a small set of findings, therefore lacking the broad utility that would make them useful in clinical practice. Recently, our group developed a comprehensive deep learning CXR diagnostic assist device, which was designed to assist clinicians in CXR interpretation and improve diagnostic accuracy, validated for 124 clinically relevant findings seen on frontal and lateral chest radiographs.31 The primary objective of the current study was to evaluate the real-world usefulness of the model as a diagnostic assist device for radiologists in both hospital and community clinic settings. This involved examining the frequency at which the model’s recommendations led to a ‘significant impact on the report’, defined as the inclusion of findings recommended by the model which altered the radiologists report in a meaningful way. The frequency of change in patient management and recommendations for further imaging were also evaluated. Secondary endpoints included: (1) investigating agreement between radiologists and the findings detected by the model; and (2) assessing radiologist attitudes towards the tool and AI models in general.

Methods

Model development and validation

A modified version of a commercially available AI tool for use as a diagnostic assist device displaying results within a viewer (CXR viewer; Annalise CXR V.1.2, Annalise-AI, Sydney, Australia) was evaluated.32 The AI tool deploys an underlying machine learning model, developed and validated by Seah et al,31 which consists of attribute and classification CNNs based on the EfficientNet architecture33 and a segmentation CNN based on U-Net34 with EfficientNet backbone. The model was trained on 821 681 de-identified CXR images from 284 649 patients originating from inpatient, outpatient and emergency settings across Australia, Europe and North America. Training dataset labelling involved independent triple labelling of all images by three radiologists selected from a wider pool of 120 consultant radiologists (none of whom were employed by the radiology network involved in this current study). The model was validated for 124 clinical findings in a multireader, multicase study.31 Thirty-four of these findings were deemed priority findings based on their clinical importance. The full list of 124 findings is available in online supplemental table 1. Ground truth labels for the validation study dataset were determined by a consensus of three independent radiologists drawn from a pool of seven fully credentialed subspecialty thoracic radiologists. The algorithm is publicly available at https://cxrdemo.annalise.ai. The AI model was used in line with pre-existing regulatory approval.35

Technical integration

Prior to the start of the study, technical integration of the software into existing radiology practice systems and testing occurred over several weeks. First, an integration adapter was installed on the IT network of each radiology clinic and acted as a gateway between the internal IT infrastructure and the AI model. Auto-routing rules were established ensuring only CXR studies were forwarded to the integration adapter from the picture archiving and communication system (PACS). Following a successful testing period, the Annalise CXR viewer was installed and configured on workstations for the group of study radiologists.

Study participants

Eleven consultant radiologists working for a large Australian radiology network were invited to participate in the study through their local radiologist network. This group included general diagnostic radiologists who had completed specialist radiology training and passed all diagnostic radiology college examinations required for consultant accreditation in Australia. All radiologists reported the minimum of 2000 chest radiographs per year (either within the radiology network or through other institutions) suggested to maintain competency.36 No subspecialist chest radiologists were included. The group included radiologists with a range of experience levels: five radiologists had 0–5 years post-training experience, three radiologists had 6–10 years of experience, and three radiologists had more than 10 years of experience. Radiologists were situated across four states in Australia and worked in public hospitals, private hospitals and community clinic settings. Both on site and remote reporting was included, in line with regular workflow. Prior to study commencement, each radiologist attended a training seminar and a one-on-one training session to fully understand the CXR viewer and its features. In addition, the participating radiologists were able to familiarise themselves with the viewer prior to commencement of data collection.

CXR case selection

In this multicentre real-world prospective study, all consecutive chest radiographs reported by the radiologists originating from inpatient, outpatient and emergency settings were included for a period covering nearly 6 weeks. The CXR cases were reported with the assistance of the AI tool in real-world clinical practice, using high-resolution diagnostic radiology monitors within the radiologists’ normal reporting environment. As per usual workflow across a large radiology network spanning a geographically large area with many regional and remote clinics, both on-site and remote reporting of CXR cases was undertaken. A total of 106 sites contributed cases with case numbers varying from one case up to a maximum of 271 cases at the busiest site. At least one frontal chest radiograph was required for analysis by the model, and cases that did not include at least one were excluded. Chest radiographs from patients aged younger than 16 years were excluded. Data from all sources was de-identified for analysis.

AI-assisted reporting

For each CXR case, radiologists produced their clinical report with access to clinical information, the referral and available patient history, in line with the normal workflow. The AI model analyses the CXR image(s) for each case but does not incorporate clinical inputs (such as previous imaging, referral information or patient demographic data) into the analysis. Model output was displayed to the radiologist in a user interface, linked to the image in the PACS, automatically launching when a CXR case was opened (figure 1).

Figure 1

Flow diagram illustrating the AI-assisted reporting process described in this study. AI, artificial intelligence; CXR, chest X-ray; PACS, picture archiving and communication system; RIS, radiological information system. A modified version of the commercially available AI software was employed for this study, which incorporated changes into the user interface to allow radiologists to provide feedback on model recommendations. No changes were made to the underlying model. An example of the modified model user interface is presented in figure 2. For each case, the model provided a list of suggested findings, listed as ‘priority’ or ‘other’, along with a confidence indicator. For a subset of findings, a region of interest localiser was overlayed on the image and the model indicated whether the finding was on the left or the right side, or both (see online supplemental table 1). The CXR viewer was configured to display its findings after the radiologists’ initial read of the case. For each case, radiologists were asked to review the CXR viewer’s findings and provide feedback within the viewer. The options presented to the radiologists in the viewer are listed in table 1.

Figure 2

Example of the modified user interface used by the participating radiologists in this study. The red box highlights the feedback options added to the interface for this study.

Table 1

List of review options presented to the radiologist with each case

Review option	Description
Rejected clinical finding	A model-detected finding disputed by the radiologist
Missed clinical finding	A model-detected finding missed by the radiologist
Add additional findings	Finding(s) identified by the radiologist but not identified by the model
These findings significantly impacted my report	A yes/no binary question relating to the effect of the model output on the radiologist report
These findings may impact patient management	A yes/no binary question relating to the effect of the model output on patient management, as perceived by the reporting radiologist
These findings led to additional imaging recommendations	A binary yes/no question related to whether the radiologist recommended further imaging based on the model output

List of review options presented to the radiologist with each case Example of the modified user interface used by the participating radiologists in this study. The red box highlights the feedback options added to the interface for this study. The outcome measure of ‘significant impact on the report’ was the primary outcome measure. A significant change was described as the inclusion of findings recommended by the model, which altered the radiologists report in a meaningful way. As this varied by patient and clinical setting, it was left to the discretion of the radiologist. During the analysis of radiologist feedback, it was assumed that a change in patient management or further imaging recommendation would not occur without radiologists indicating a material change in the CXR report, and thus management and imaging questions were dependent on a significant change in the report. This was also patient-specific; for example, missing a pneumothorax in a ventilated patient with known pneumothorax would not have the same impact on patient management as a previously unknown pneumothorax in an outpatient. Free-text input describing missed findings or other relevant data were manually added after data collection was complete. No formal adjudication of cases showing discrepancy between radiologist and model interpretation was performed. The study was not designed as a diagnostic accuracy validation. No review or ground truthing process was performed. Radiologists remained responsible for image interpretation and formulation of the report.

Poststudy survey

On completion of data collection, a poststudy survey was distributed to all participating radiologists to obtain feedback on the usefulness of the CXR viewer and how it affected their opinion of AI in radiology. A table of the survey questions is presented in online supplemental table 2.

Statistics and data analysis

A 1% rate of significant changes in reports (the primary outcome measure) was deemed to be clinically significant prior to commencing the study. Based on estimations of the prevalence of missed critical findings on CXR, preliminary power calculations estimated that the number of cases required to detect at least a 1% rate of significant changes in reports was approximately 2000 cases in total, with alpha value 0.05 and desired power of 0.90. To account for any dropout in radiologists or cases, a target of 3000 cases was set for the study. Ten radiologists were recruited, with an 11th included for any unexpected participant drop out and to achieve this target in a reasonable time period. A two-tailed binomial test was used to test the hypothesis that the rate of significant report change, patient management change or imaging recommendation change was at least 1%. To ensure that the sampling of CXRs reasonably approximated a random snapshot of the true population, radiologists in various states, with varying experience levels, as well as in different conditions of practice (community clinic vs hospital based) were selected. Additionally, the study was conducted prospectively which further aligned the structure of the sampled data with the expected structure of the population, justifying the choice of analysing the sample using a binomial test without adjustment for each radiologist. Multivariate logistic regression using generalised linear mixed effect analysis was used to assess the effect of several possible confounders on the measured outcomes, including the number of critical clinical findings per case identified by the model, the inpatient/outpatient status of the patients, the experience level of the radiologists and the presence or absence of a lateral radiograph. The Wald test was applied to the derived regression coefficients to determine their significance. Radiologists were grouped by experience level into 0–5 years postcompletion of radiology training, 6–10 years, and more than 10 years. A likelihood ratio test comparing a binomial logistic regression with categorical radiologist experience against a null model was performed to assess the hypothesis that the outcomes (significant changes in reports, management or imaging recommendation) were associated with experience. A significance threshold of 0.05 was chosen, with the Benjamini-Hochberg procedure37 applied to all reported outcomes to account for multiple hypothesis testing. Two clinically qualified researchers independently performed statistical analyses using different software. Calculations were performed in Excel 2016 with RealStatistics resource pack and cross-checked in Python V.3.7 using the Pandas V.1.0.5,38 NumPy V.1.18.5,39 SciPy V.1.4.1,40 Scikit-Learn V.0.24.0,41 pymer4 V.0.7.1 (linked to R V.3.4.1, Ime4 V.1.1.26)42 and Statsmodels V.0.12.143 libraries.

Results

A total of 2972 cases were reported by 11 radiologists over a period of 6 weeks. These cases came from 2665 unique patients (52.7% male), with a median age of 67 (IQR 50–77). Information on radiologist experience, number of cases reported, source of cases and outcome measures for each radiologist are listed in table 2.

Table 2

Demographics and results for the eleven radiologists involved in this study

Radiologist ID	No of years post-training	Cases reported (% outpatient)	Significant report impact (%)	Patient management changes (%)	Imaging recommendations (%)
1	19	136 (21.3)	1 (0.7)	1 (0.7)	0 (0.0)
2	1	325 (46.2)	4 (1.2)	0 (0.0	1 (0.3)
3	4	230 (86.1)	20 (8.6)	14 (6.1)	10 (4.3)
4	6	375 (22.7)	3 (1.0)	0 (0.0)	1 (0.2)
5	4	186 (45.7)	22 (11.8)	9 (4.8)	8 (4.3)
6	20	333 (11.1)	3 (1.0)	2 (0.6)	1 (0.3)
7	3	312 (48.4)	15 (4.8)	8 (2.5)	1 (0.3)
8	26	408 (39.7)	10 (2.4)	5 (1.2)	4 (1.0)
9	9	214 (43.0)	6 (2.8)	2 (0.9)	2 (0.9)
10	6	159 (98.1)	1 (0.6)	1 (0.6)	1 (0.6)
11	5	294 (40.1)	7 (2.4)	1 (0.3)	0 (0.0)
Total		2972	92 (3.1)	43 (1.4)	29 (1.0)

Percentages (%) represent the associated value as a proportion of the total case number for that radiologist.

Demographics and results for the eleven radiologists involved in this study Percentages (%) represent the associated value as a proportion of the total case number for that radiologist. Of the 2972 cases, 1825 (61.4%) cases had lateral (as well as frontal) radiographs available for interpretation. 1709 (57.5%) cases were from an inpatient setting, and 1263 (42.5%) from an outpatient setting. The median number of findings per case was five (mean: 5.1, SD: 3.9), with a wide range in the number of findings per case (maximum=20). A total of 364 cases returned zero findings predicted by the model from the complete 124 findings list. A total of 1526 of the 2972 cases had one or more critical findings detected by the CXR viewer, with the critical findings in 1459 (96%) of these cases being confirmed by the radiologist. The number of critical findings per case is summarised in figure 3.

Figure 3

Counts of numbers of critical findings for the cases seen by the radiologist, defined as the number of critical findings agreed + the number of critical findings added. The number of cases which returned zero findings was 1513.

Influence of the AI model on radiologist reporting

Across all 2972 cases, there were 92 cases identified by radiologists as having significant report changes (3.1%), 43 cases of changed patient management (1.4%) and 29 cases of additional imaging recommendations (1.0%) as a result of exposure to the AI model output. When compared with the hypothesised 1% rate of change, the findings were significantly higher for changed reports (p<0.01) and changed patient management (p<0.01), and not significantly different for rate of imaging recommendation (p=0.50).

Agreement with model findings

Of the 2972 cases, 2569 had no findings rejected or added by the radiologists, indicating agreement with the model over all 124 possible findings in 86.5% of cases. A total of 306 (10.2%) cases had one finding rejected by the radiologist and 84 (2.8%) had two or more findings rejected by the radiologist. 202 (5.3%) critical findings detected by the model were rejected by radiologists. The missed and rejected critical findings are detailed in table 3.

Table 3

Critical finding	Displayed by model	Radiologist agreed with finding (%)	Radiologist rejected finding (%)	Added in by radiologist	Missed by radiologist
Acute aortic syndrome	2	2.0 (100.0)	0 (0.0)	0	0
Acute humerus fracture	5	5 (100.0)	0 (0.0)	0	0
Acute rib fracture	54	39 (72.2)	15 (27.8)	0	5
Cardiomegaly	1008	979 (97.1)	29 (2.9)	0	0
Cavitating mass	14	13 (92.9)	1 (7.1)	0	0
Cavitating mass internal content	6	5 (83.3)	1 (16.7)	0	0
Diffuse airspace opacity	13	13 (100.0)	0 (0.0)	0	0
Diffuse lower airspace opacity	153	148 (96.7)	5 (3.3)	0	0
Diffuse perihilar airspace opacity	45	45 (100.0)	0 (0.0)	0	0
Diffuse upper airspace opacity	2	2 (100.0)	0 (0.0)	0	0
Focal airspace opacity	341	321 (94.1)	20 (5.9)	0	2
Hilar lymphadenopathy	8	6 (75.0)	2 (25.0)	0	0
Inferior mediastinal mass	8	7 (87.5)	1 (12.5)	0	0
Loculated effusion	87	80 (92.0)	7 (8.0)	0	1
Lung collapse	11	10 (90.9)	1 (9.1)	0	0
Malpositioned CVC	85	78 (91.8)	7 (8.2)	0	1
Malpositioned ETT	52	43 (82.7)	9 (17.3)	0	0
Malpositioned NGT	39	31 (79.5)	8 (20.5)	0	0
Malpositioned PAC	13	9 (69.2)	4 (30.8)	0	0
Multifocal airspace opacity	125	120 (96.0)	5 (4.0)	0	1
Multiple pulmonary masses	43	38 (88.4)	5 (11.6)	0	0
Pneumomediastinum	5	5 (100.0)	0 (0.0)	1	0
Pulmonary congestion	220	215 (97.7)	5 (2.3)	1	0
Segmental collapse	292	290 (99.3)	2 (0.7)	0	1
Shoulder dislocation	1	0 (0.0)	1 (100.0)	0	0
Simple effusion	687	650 (94.6)	37 (5.4)	0	1
Simple pneumothorax	90	77 (85.6)	13 (14.4)	1	1
Single pulmonary mass	41	38 (92.7)	3 (7.3)	1	1
Single pulmonary nodule	105	95 (90.5)	10 (9.5)	3	5
Subcutaneous emphysema	53	51 (96.2)	2 (3.8)	0	1
Subdiaphragmatic gas	7	7 (100.0)	0 (0.0)	1	0
Superior mediastinal mass	37	32 (86.5)	5 (13.5)	0	0
Tension pneumothorax	11	7 (63.6)	4 (36.4)	0	0
Tracheal deviation	133	133 (100.0)	0 (0.0)	0	0
Total	3796	3594 (94.7)	202 (5.3)	8	20

Percentages (%) represent the associated value as a proportion of the total number of findings displayed by the model.

CVC, Central venous catheter; ETT, Endotracheal tube; NGT, Nasogastric tube; PAC, Pulmonary artery catheter.

Breakdown of the critical findings detected by the model and the level of radiologist agreement with each, including the number of findings reportedly missed by the model (and added by the radiologist) or missed by the radiologist Percentages (%) represent the associated value as a proportion of the total number of findings displayed by the model. CVC, Central venous catheter; ETT, Endotracheal tube; NGT, Nasogastric tube; PAC, Pulmonary artery catheter. Thirteen cases (0.5%) had findings (16 in total) added by the radiologists which they deemed were missed by the model, of which 8 were critical findings (see table 3). The remaining eight non-critical missed findings were atelectasis (four findings), cardiac valve prosthesis (two findings), spinal wedge fracture (one finding) and peribronchial thickening (one finding).

Factors influencing reporting, management or imaging recommendation

The number of critical findings displayed by the model was significantly higher in cases where there was a change in report, patient management or imaging recommendation (p<0.001, p=0.001, p=0.004; table 4). The presence of a lateral projection image in the CXR case interpreted by the model was associated with a significantly greater likelihood of changes to imaging recommendation (p=0.005), but not to the report or patient management (p=0.105 and p=0.061, respectively).

Table 4

Factors affecting AI model influence on report, patient management, or imaging recommendation

Predictor	Change	ORs (adjusted CI)	P value	Benjamini-Adjusted threshold	Significance
No of critical findings	Report	1.306 (1.132 to 1.507)	0	0.0042	Yes
No of critical findings	Patient management	1.267 (1.056 to 1.521)	0.001	0.0083	Yes
No of critical findings	Imaging recommendation	1.319 (1.035 to 1.681)	0.004	0.0125	Yes
Lateral CXR	Imaging recommendation	6.495 (1.297 to 32.530)	0.005	0.0167	Yes
Lateral CXR	Patient management	2.158 (0.837 to 5.565)	0.061	0.0208	No
Lateral CXR	Report	1.542 (0.848 to 2.805)	0.105	0.025	No
Radiologist experience	Report	0–5 years: Baseline6–10 years: 0.255 (0.043 to 1.521)>10 years: 0.305 (0.065 to 1.439)	0.120	0.0292	No
Radiologist experience	Patient management	0–5 years: Baseline6–10 years: 0.165 (0.009 to 3.214)>10 years: 0.378 (0.054 to 2.654)	0.262	0.0333	No
Radiologist experience	Imaging recommendation	0–5 years: Baseline6–10 years: 0.357 (0.034 to 3.783)>10 years: 0.380 (0.044 to 3.287)	0.516	0.0458	No
Inpatient/outpatient	Imaging recommendation	1.550 (0.613 to 3.919)	0.326	0.0375	No
Inpatient/outpatient	Report	0.794 (0.476 to 1.323)	0.358	0.0417	No
Inpatient/outpatient	Patient management	0.818 (0.408 to 1.640)	0.572	0.0500	No

Significance testing by the Benjamini-Hochberg algorithm to account for multiple hypotheses. ORs derived from stepwise logistic regression coefficients with CIs calculated with Benjamini-adjusted thresholds. Radiologist experience analysed as a categorical variable with derived from stepwise logistic regression coefficients with CIs calculated with Benjamini-adjusted thresholds. Radiologist experience analysed as a categorical variable with ORs representing effect of changing experience levels from the baseline (0–5 years) to a different level.

AI, artificial intelligence; CXR, chest X-ray.

Factors affecting AI model influence on report, patient management, or imaging recommendation Significance testing by the Benjamini-Hochberg algorithm to account for multiple hypotheses. ORs derived from stepwise logistic regression coefficients with CIs calculated with Benjamini-adjusted thresholds. Radiologist experience analysed as a categorical variable with derived from stepwise logistic regression coefficients with CIs calculated with Benjamini-adjusted thresholds. Radiologist experience analysed as a categorical variable with ORs representing effect of changing experience levels from the baseline (0–5 years) to a different level. AI, artificial intelligence; CXR, chest X-ray. Radiologists with fewer than 5 years consultant experience contributed 1347 cases, and indicated a rate of 5.0% for significant report change, 2.4% patient management change, and 1.5% recommendations for further imaging. These numbers were higher than for the radiologists with 6–10 years of experience (1.3%, 0.4%, 0.5%, respectively, over 748 cases) and also for radiologists with greater than 10 years of experience (1.6%, 0.9%, 0.6% over 877 cases). However, a likelihood ratio test applied to binomial logistic regression analysis indicated that the level of radiologist experience did not significantly influence the rate of change in report, patient management or imaging recommendation (p=0.120, p=0.262, and p=0.516, respectively). Whether a patient was imaged as an inpatient or outpatient was not significantly associated with any change in report, patient management or imaging recommendation (p=0.358, p=0.572, p=0.326, respectively).

Survey results

The poststudy survey was completed by ten out of the eleven radiologists (figures 4 and 5). Notably, seven (70%) participants felt that their reporting time was slightly worse, however, when asked how satisfied they were with their reporting time, seven (70%) indicated that they were satisfied. Diverging stacked bar chart depicting the first set of radiologist survey responses. CXR, chest X-ray. Diverging stacked bar chart visualising the second set of survey responses of the radiologists. AI, artificial intelligence; CXR, chest X-ray. Nine out of 10 radiologists responded that their reporting accuracy was improved while using the CXR viewer, with 9 out of 10 (90%) participants being satisfied with accuracy of the CXR model’s findings. Nine radiologists (90%) demonstrated an improved attitude towards the use of the AI diagnostic viewer by the end of the study and 9 (90%) demonstrated an improved attitude towards AI in general. No radiologists reported a more negative attitude towards the CXR viewer or towards AI in general.

Discussion

We have previously shown that using the output of this comprehensive deep learning model improved radiologist diagnostic accuracy44 in a non-clinical setting, but it is important to demonstrate that this improvement translates into meaningful change in a real-world environment. In this multicentre real-world prospective study, we determined how often the finding recommendations of the comprehensive deep learning model led to a material change in the radiologist’s report, a change in the patient management recommendation, or a change in subsequent imaging recommendation. To the authors’ knowledge, this is the first time that the impact of a comprehensive deep learning model developed to detect radiological findings on CXR has been studied in a real-world reporting environment. Other commercially available deep learning models able to detect multiple findings on CXR have been studied in the non-clinical setting, yielding encouraging results and outperforming physicians in the detection of major thoracic findings45 as well as improving resident diagnostic sensitivity.46 Other models have demonstrated diagnostic accuracy that is comparable to that of test radiologists.47 Additionally, studies have yielded promising results for the use of models in population screening, particularly for tuberculosis, where several models have met the minimum WHO recommendations for tuberculosis triage tests.29 48 We showed that radiologists agreed with all findings identified by the AI model in 86.5% of cases on a per case basis, while on a per finding basis, agreed with the critical findings identified by the model on 94.7% of findings. Notably, there was a significant change to the report in 3.1% of cases leading to changes in recommended patient management in 1.4% of cases, and changes to imaging recommendations in 1% of cases. Of note, 146 lung lesions (solitary lung nodule and solitary lung mass) were present in the dataset according to the model. Two lung lesions flagged by the model but missed by radiologists were recommended for additional imaging and changed management, subsequently diagnosed as lung carcinoma, highlighting the real-world value of integrating this type of system into the radiology workflow. However, four findings of lung nodule were flagged by the radiologists as missed by the model, indicating that the model alone is not intended to replace radiologist interpretation. The significant impact of the CXR viewer on radiologist reporting and recommendations did however come at the cost of false positives, with 13% of cases having one or more model findings rejected by the radiologist. When this false positive rate is compared against the false positive rates per case reported in other studies investigating CXR models, which range from 14% to 88%,14 49 50 it is considered acceptable. Furthermore, these studies report false-positive rates for CXR models that only detect lung nodules, while in the current study this represents the false positive rate across 124 findings. Notably, on a per finding basis, only 5.3% of critical findings detected by the model were rejected by the radiologist. However, there were several outliers in the critical findings group that had noticeably higher rates of rejection, including acute rib fracture, hilar lymphadenopathy, malpositioned nasogastric tube (NGT)/pulmonary artery catheter (PAC), shoulder dislocation and tension pneumothorax. Several explanations for this are low sample size, the subjectivity of diagnosis (especially for hilar lymphadenopathy and tension features of pneumothorax), and heightened model sensitivity at the expense of specificity. In particular, the rate of ‘overcalling’ of malposition of nasogastric tubes was related to both the threshold choice (favouring sensitivity given the critical nature of NGT malposition) and the limitation in the model output in distinguishing malpositioned NGTs from incompletely visualised NGTs. This limitation has subsequently been addressed with model modifications. Overall, this trade-off appears to be reasonable to the participating radiologists, who reported a high level of satisfaction with the model. In this study, analysis of radiologists by experience level using logistic regression found no statistically significant relationship between experience level and increased changes to reports, patient management changes or imaging recommendations as a result of the model. Statistical analysis of the relationship between experience level and change in report was associated with a p=0.12, suggesting that, with further research, a significant relationship may be identified. It is expected that the inclusion of a larger group of radiologists may lead to a significant finding, as the association between experience and level of change has been noted in other studies. For example Jang et al, showed that less experienced radiologists benefited the most from diagnostic assistance in detecting lung nodules on CXR.14 In this study, 3 of the 11 radiologists contributed a higher than average incidence of the primary outcome of report change, and these were all less experienced radiologists compared with the cohort average experience level. While this may be due to variations in individual radiologist interpretation of ‘significant report change’, the consistency of experience level across these three radiologists suggests a relationship with experience level and tool impact. The primary factor that influenced the likelihood of the model findings leading to a change in the report was the presence of critical findings in the model’s recommendation. This is particularly notable because it indicates that the changes to the report are significant. They did not simply involve the inclusion of additional non-critical findings in the report, which may be interpreted as overestimating the impact of the model. The inpatient or outpatient status of a case was found not to significantly affect the likelihood of significant changes to the radiologists’ report, to patient management, or to imaging recommendations. The poststudy survey provided further insight into the impact that the CXR viewer had on participant reporting, in addition to the level of agreement and changes to the radiology report and patient management recommendations outlined above. The first notable response was that the CXR viewer may have negatively affected reporting times (although only mildly) for the majority of radiologists. This outcome was expected in this study setting because the radiologists were taking additional time to provide feedback on the model’s recommendations for each case. Previous studies that surveyed radiologists reported that 74.4% thought AI would lower the interpretation time.51 It is notable that even with the negative impact the model had on reporting time, the majority of radiologists (70%) were still satisfied with reporting time while using the CXR viewer, suggesting that the diagnostic improvements offered by the model were enough to offset the additional perceived reporting time. Additional insight from the survey suggested that very little training was required before radiologists felt comfortable using the tool. This is useful as education on AI has been a primary concern among clinicians, as a large proportion of radiologists report having little knowledge of AI.52

Limitations and future research

The results presented in this study are self-reported by participating radiologists and are likely an underestimation of the model’s actual impact. It is expected that radiologists would not report every instance in which they made an interpretive error. Another limitation is that there was no objective gold standard against which the radiologist and model interpretation could be measured. This is a small-scale study involving a limited sample size, conducted over several weeks. As a result, it lacks the statistical power to examine the benefit of the model on a finding-by-finding basis. In future, it would be beneficial to conduct a similar study with a larger sample size to allow for more powerful statistical analysis and examination of specific finding changes. Another useful next step would be to include a gold standard to determine the ground truth for the CXR findings, as this would prevent any under reporting which may occur with self-reported results, as well as enable the detection of false negatives as a result of the CXR viewer. Although none of the cases evaluated in this study had been seen by the model previously, we note that one of the five data sources used for model training originated from the same radiology network. This, therefore, cannot be considered as true external evaluation. Further work in truly external institutions in the future are welcomed.

Conclusion

This study indicated that the integration of a comprehensive AI model capable of detecting 124 findings on CXR into a radiology workflow led to significant changes in reports and patient management, with an acceptable rate of additional imaging recommendations. These results were not affected by the inpatient status of the patient, and although approaching significance, the experience level of the radiologists did not significantly relate to the primary endpoint outcomes. In secondary endpoint outcomes, the model output showed good agreement with radiologists, and radiologists showed high rates of satisfaction with their reporting times and diagnostic accuracy when using the CXR viewer as a diagnostic assist device. Results highlight the usefulness of AI-driven diagnostic assist tools in improving clinical practice and patient outcomes.

38 in total

Review 1. Machine Learning for Medical Imaging.

Authors: Bradley J Erickson; Panagiotis Korfiatis; Zeynettin Akkus; Timothy L Kline
Journal: Radiographics Date: 2017-02-17 Impact factor: 5.333

Review 2. Missed lung cancer: when, where, and why?

Authors: Annemilia Del Ciello; Paola Franchi; Andrea Contegiacomo; Giuseppe Cicchetti; Lorenzo Bonomo; Anna Rita Larici
Journal: Diagn Interv Radiol Date: 2017 Mar-Apr Impact factor: 2.630

3. Deep Learning for Chest Radiograph Diagnosis in the Emergency Department.

Authors: Eui Jin Hwang; Ju Gang Nam; Woo Hyeon Lim; Sae Jin Park; Yun Soo Jeong; Ji Hee Kang; Eun Kyoung Hong; Taek Min Kim; Jin Mo Goo; Sunggyun Park; Ki Hwan Kim; Chang Min Park
Journal: Radiology Date: 2019-10-22 Impact factor: 11.105

4. Application of deep learning-based computer-aided detection system: detecting pneumothorax on chest radiograph after biopsy.

Authors: Sohee Park; Sang Min Lee; Namkug Kim; Jooae Choe; Yongwon Cho; Kyung-Hyun Do; Joon Beom Seo
Journal: Eur Radiol Date: 2019-03-26 Impact factor: 5.315

Review 5. Artificial intelligence in radiology.

Authors: Ahmed Hosny; Chintan Parmar; John Quackenbush; Lawrence H Schwartz; Hugo J W L Aerts
Journal: Nat Rev Cancer Date: 2018-08 Impact factor: 60.716

Review 6. The past, present and future role of artificial intelligence in imaging.

Authors: Mohammad Ihsan Fazal; Muhammed Ebrahim Patel; Jamie Tye; Yuri Gupta
Journal: Eur J Radiol Date: 2018-06-22 Impact factor: 3.528

7. Efficient Deep Network Architectures for Fast Chest X-Ray Tuberculosis Screening and Visualization.

Authors: F Pasa; V Golkov; F Pfeiffer; D Cremers; D Pfeiffer
Journal: Sci Rep Date: 2019-04-18 Impact factor: 4.379

8. Deep Learning Algorithms with Demographic Information Help to Detect Tuberculosis in Chest Radiographs in Annual Workers' Health Examination Data.

Authors: Seok-Jae Heo; Yangwook Kim; Sehyun Yun; Sung-Shil Lim; Jihyun Kim; Chung-Mo Nam; Eun-Cheol Park; Inkyung Jung; Jin-Ha Yoon
Journal: Int J Environ Res Public Health Date: 2019-01-16 Impact factor: 3.390

9. Effect of a comprehensive deep-learning model on the accuracy of chest x-ray interpretation by radiologists: a retrospective, multireader multicase study.

Authors: Jarrel C Y Seah; Cyril H M Tang; Quinlan D Buchlak; Xavier G Holt; Jeffrey B Wardman; Anuar Aimoldin; Nazanin Esmaili; Hassan Ahmad; Hung Pham; John F Lambert; Ben Hachey; Stephen J F Hogg; Benjamin P Johnston; Christine Bennett; Luke Oakden-Rayner; Peter Brotchie; Catherine M Jones
Journal: Lancet Digit Health Date: 2021-07-01

10. Using artificial intelligence to read chest radiographs for tuberculosis detection: A multi-site evaluation of the diagnostic accuracy of three deep learning systems.

Authors: Zhi Zhen Qin; Melissa S Sander; Bishwa Rai; Collins N Titahong; Santat Sudrungrot; Sylvain N Laah; Lal Mani Adhikari; E Jane Carter; Lekha Puri; Andrew J Codlin; Jacob Creswell
Journal: Sci Rep Date: 2019-10-18 Impact factor: 4.379