Literature DB >> 35434976

Improving the Performance of Radiologists Using Artificial Intelligence-Based Detection Support Software for Mammography: A Multi-Reader Study.

Jeong Hoon Lee¹, Ki Hwan Kim¹, Eun Hye Lee², Jong Seok Ahn¹, Jung Kyu Ryu³, Young Mi Park⁴, Gi Won Shin⁴, Young Joong Kim⁵, Hye Young Choi⁶.

Abstract

OBJECTIVE: To evaluate whether artificial intelligence (AI) for detecting breast cancer on mammography can improve the performance and time efficiency of radiologists reading mammograms.
MATERIALS AND METHODS: A commercial deep learning-based software for mammography was validated using external data collected from 200 patients, 100 each with and without breast cancer (40 with benign lesions and 60 without lesions) from one hospital. Ten readers, including five breast specialist radiologists (BSRs) and five general radiologists (GRs), assessed all mammography images using a seven-point scale to rate the likelihood of malignancy in two sessions, with and without the aid of the AI-based software, and the reading time was automatically recorded using a web-based reporting system. Two reading sessions were conducted with a two-month washout period in between. Differences in the area under the receiver operating characteristic curve (AUROC), sensitivity, specificity, and reading time between reading with and without AI were analyzed, accounting for data clustering by readers when indicated.
RESULTS: The AUROC of the AI alone, BSR (average across five readers), and GR (average across five readers) groups was 0.915 (95% confidence interval, 0.876-0.954), 0.813 (0.756-0.870), and 0.684 (0.616-0.752), respectively. With AI assistance, the AUROC significantly increased to 0.884 (0.840-0.928) and 0.833 (0.779-0.887) in the BSR and GR groups, respectively (p = 0.007 and p < 0.001, respectively). Sensitivity was improved by AI assistance in both groups (74.6% vs. 88.6% in BSR, p < 0.001; 52.1% vs. 79.4% in GR, p < 0.001), but the specificity did not differ significantly (66.6% vs. 66.4% in BSR, p = 0.238; 70.8% vs. 70.0% in GR, p = 0.689). The average reading time pooled across readers was significantly decreased by AI assistance for BSRs (82.73 vs. 73.04 seconds, p < 0.001) but increased in GRs (35.44 vs. 42.52 seconds, p < 0.001).
CONCLUSION: AI-based software improved the performance of radiologists regardless of their experience and affected the reading time.

Entities: Chemical

Keywords: Artificial intelligence; Breast cancer; Deep-learning; Mammography; Multi-reader study; Reading time; Screening

Mesh：

Year: 2022 PMID： 35434976 PMCID： PMC9081685 DOI： 10.3348/kjr.2021.0476

Source DB: PubMed Journal: Korean J Radiol ISSN： 1229-6929 Impact factor: 7.109

INTRODUCTION

Breast cancer is the second-leading cause of cancer-related death in female worldwide [12]. Mammography screening has been widely used to reduce breast cancer mortality; however, reading a large number of mammography screenings is laborious, time-consuming, and cost-intensive [345]. Computer-aided diagnosis (CAD) systems were introduced as a support tool in the 1990s as a potential method for overcoming resource deficiency in mammography screening [678]. Recently, with the introduction of artificial intelligence (AI), breast experts have evaluated the performance of AI algorithms in mammography [91011]. CAD systems were introduced into clinical practice based on their high sensitivity; however, subsequent large-scale studies have shown that these systems rarely change the decisions made by radiologists in breast cancer screening [121314]. The main reason for the inability of CAD systems to improve radiologists’ performance in practice was that CAD identified many redundant markers per image; therefore, radiologists did not trust the results provided [15]. Moreover, their high false-positive rates led to an increase in unnecessary examinations. The use of AI in detecting breast cancer on mammograms is of great interest [16171819]. Although AI algorithms show great potential in breast cancer screening, their characteristics must be carefully evaluated and radiologists should fully understand these characteristics when they are used in practice [2021]. This is because the effectiveness of the AI algorithms in diagnostic decision-making may vary with the radiologist’s experience [1016]. Therefore, the strengths and weaknesses of AI algorithms should be identified based on radiologist experience before they are introduced into clinical practice. This study investigated whether AI could improve radiologists’ performance using a commercially available deep-learning-based software for mammography depending on their experience with external data [10]. The effects of AI on reading time efficiency were also evaluated.

MATERIALS AND METHODS

This retrospective study was approved by the Ethics Committee of the Institutional Review Board of Soonchunhyang University Hospital Bucheon (IRB No. SCHBC 2019-08-022-001), and the requirement for informed consent was waived. Lunit Inc. provided technical support for the AI system and statistical analysis of the study data. The first two authors, who are employees of Lunit Inc., contributed to the study design and analysis of the results of the image interpretation by the study readers; however, they did not participate in collecting and reading the mammograms in the reader study, which were collected from Soonchunhyang University Hospital Bucheon.

Patients and Datasets

A total of 200 external validation cases, including biopsy-proven malignant (n = 100), benign (n = 40), and negative (n = 60) cases, were collected from adult Asian female who underwent four-view mammography for breast cancer screening between March 2009 and September 2018 (Fig. 1, Table 1). Based on the assumption that the AUROC for mammography interpretation by radiologists is approximately 0.90, the AUROC difference between readings with and without AI assistance was greater than 0.15, and the variability between observers was moderate. The sample size of 200 was estimated to be sufficient for demonstrating the performance change elicited by the use of AI [22].

Fig. 1

Schematic of the workflow for the multi-reader study.

The squares in sessions 1 and 2 represent readers. AI = artificial intelligence

Table 1

Population Characteristics and Mammographic Features

		Cancer	Benign	Negative
Number of samples		100	40	60
Age, years
	Mean	53.03	50.15	51.97
	Median	51	48	51
	Range	36–78	36–75	39–78
	Interquartile range	47.75–57.00	44.00–55.00	45.75–57.00
BI-RADS composition categories
	a	5 (5.0)	0 (0.0)	3 (5.0)
	b	22 (22.0)	11 (27.5)	22 (36.7)
	c	38 (38.0)	23 (57.5)	18 (30.0)
	d	35 (35.0)	6 (15.0)	17 (28.3)
Mammographic features
	Calcification	49	27
	Mass	42	14
	Asymmetry	23	8
	Distortion	5	1
T stage
	0	28
	I	54
	II	11
	Unknown	7
N stage
	0	75
	1	18
	2	0
	Unknown	7

Data are number of cases with % in parentheses, unless specified otherwise. BI-RADS = Breast Imaging Reporting and Data System

Biopsy-proven malignant cases were consecutively collected from patients who were pathologically confirmed as having breast cancer within six months after mammography, and bilateral breast cancer was not included. Negative cases were collected from patients who were Breast Imaging Reporting and Data System (BI-RADS) final assessment category 1 on mammography and confirmed as negative over more than 2 years of follow-up imaging. Benign cases were confirmed via biopsy (biopsy-confirmed benign) or follow-up imaging for more than 2 years (clinically confirmed benign). During the data collection period, 20 cases were randomly collected for each benign category and 60 negative cases were randomly collected. Age and BI-RADS composition categories did not differ significantly among the three groups (ANOVA, p > 0.05). Of the malignant cases, 28 were ductal carcinoma in situ and 65 were invasive cancers. Of the latter, the majority were invasive ductal carcinomas (55/65, 84.6%). The remaining invasive cancers included lobular carcinoma (6/65, 9.2%), mucinous carcinoma (2/65, 3.1%), apocrine carcinoma (1/65, 1.5%), and tubular carcinoma (1/65, 1.5%). Seven cancer patients underwent surgery in other hospitals after receiving a core biopsy; therefore, their stage information was not available.

Mammography Examination

All digital mammography examinations included four views of full-field digital mammograms (i.e., right craniocaudal view, right mediolateral oblique view, left craniocaudal view, and left mediolateral oblique view), acquired using an equivalent full-field digital mammography system (Senographe 2000D; GE Healthcare).

AI Algorithm

Commercially available deep learning-based software for mammography (Lunit INSIGHT MMG, version 1.1.1.0; Lunit) was used in this study [10]. This AI algorithm was trained to detect breast cancer using data from 170230 mammograms, including data from over 30000 breast cancer cases. The AI algorithm provided an abnormality score based on four views between 0 and 100, indicating the possibility of breast cancer, and a heatmap was shown at the location of the abnormal region, indicating an abnormality score of 10 or more.

Reader Study

Ten radiologists with different levels of experience participated in the reader study. Of these, five were breast specialist radiologists (BSRs) with 4–19 years of breast imaging experience and five were general radiologists (GRs) with 1–10 years of post-training experience. BSRs read more than 3000 mammography examinations per year, whereas GRs did not. The workflow of this multicase multireader study is shown in Figure 1. Five readers (three BSRs and two GRs) read the mammograms with AI, and after two months of washout, they re-read the mammograms without AI. The remaining five readers (two BSRs and three GRs) read the mammograms with and without AI assistance. A reader-crossover design and two months of washout period were adopted to minimize the potential bias that the order of AI usage can affect the reading time and performance of the readers [23]. There was no change in the reading environment for each reader between reading sessions with and without AI. All readers performed this study using a web-based reader study system with the hardware configuration used in daily practice. All radiologists read the mammograms rated the likelihood of malignancy (LOM) in each case on a seven-point scale in both readings with and without AI: 1 = definitely normal, 2 = benign, 3 = probably benign, 4 = low suspicion of malignancy, 5 = moderate suspicion of malignancy, 6 = high suspicion of malignancy, and 7 = highly suggestive of malignancy. The radiologist was requested to record an LOM score of at least 3 when a suspected breast cancer lesion was detected. In the reading with AI assistance, the heatmap was overlaid on the original mammograms, and the heatmap could be turned on and off easily using the shortcut key. The readers considered both the AI results and the original mammogram findings and made their own final judgments using a seven-point scale. Additionally, for localization ROC (LROC) analysis, each radiologist was requested to mark one location of the most suspicious region and then provide the LOM using the same seven-point scale. For each case, the reading time from the initial image view by the reader to the final decision was automatically recorded using a web-based reporting system, and the readers were informed of this time measurement in advance.

Reference Standard

To conduct the LROC analysis, an expert with 20 years of experience in breast imaging annotated the location of malignant lesions with a free-form contour line referring to the original mammography and pathology reports, and this annotation was considered the ground truth. The expert also evaluated the BI-RADS composition categories and dominant mammographic features. For mammographic features, one of the three most prominent features (mass, asymmetry, and architectural distortion) was selected, and the presence of microcalcifications suspected of implying malignancy was evaluated separately.

Statistical Analysis

Diagnostic performance was primarily analyzed using conventional receiver operating characteristic (ROC) curve analysis. For AI, the area under the ROC (AUROC) was measured using the abnormality score (range: 0–100). The radiologists’ AUROC values were obtained based on the LOM. The difference in AUROCs between the readings with and without AI was compared using the DeLong test in the pROC R package (version 3.6.3; R Project for Statistical Computing, https://www.r-project.org) [24]. The average AUROC for each group was obtained from the arithmetic mean of the AUC [25]. The average AUROC was compared using the iMRMC method [26]. To analyze the performance considering the correct localization of the lesion, LROC analysis was performed. The area under the LROC (AULROC) was measured using the non-parametric trapezoidal method [27]. SensitivityLROC was calculated using true positives, excluding the incorrect location cases. The localization of the AI was considered correct if the location of the maximum pixel-level abnormality scores was inside the closed free-form line of the ground truth drawn by the expert radiologist. The correctness of the localization of the readers was determined on the basis of whether the reader’s point mark was inside the ground truth [28]. The difference in AULROCs between the readings with and without AI was evaluated using Hillis’ modification of the Dorfman-Berbaum-Metz (DBMH) method [2930]. The DBMH method was conducted using the StSignificanceTesting function of the RJafroc R package v1.3.2 [3132]. For both analyses, a representative set of sensitivity and specificity values was obtained and compared between the reading modes. The sensitivity and specificity of AI were determined using a threshold of 10, as described in a previous study [10]. For the radiologists’ interpretation, an LOM of 1–2 was considered negative, and an LOM ≥ 3 was considered positive to calculate sensitivity and specificity. The change in sensitivity and specificity of each reader was compared using McNemar’s test, and the change in sensitivity and specificity of each reading group (BSR and GR) was evaluated using logistic regression with the generalized estimating equation (GEE) method, and 95% confidence intervals (CIs) were calculated for each group using the DTComPair and the geepack R package [3334]. In cases of cancer, the sensitivities of the readers with and without AI were compared in subgroups according to T and N stages using the GEE method. Additionally, the Wilcoxon signed-rank test was used to compare the reading times pooled across readers with and without AI assistance. Statistical significance was set at p < 0.05. Statistical analyses were performed using R statistical software (version 3.6.3).

RESULTS

Comparative Performance of AI-Unassisted and AI-Assisted Reading

Figure 2 shows the performance results of the AI algorithm and BSRs and GRs with and without AI assistance. The AUROC values of the AI, BSRs (average across five readers), and GRs (average across five readers) were 0.915, 0.813, and 0.684, respectively (Table 2). With AI assistance, the AUROC values significantly increased to 0.884 (p = 0.007) and 0.833 (p < 0.001) in the BSR and GR groups, respectively. Sensitivity was also significantly increased by AI assistance in both the BSR (74.6% vs. 88.6%, p < 0.001) and GR groups (52.1% vs. 79.4%, p < 0.001), but the specificity did not differ significantly (BSR: 66.6% vs. 66.4%, p = 0.238; GR: 70.8% vs. 70.0%, p = 0.689). When the GR group used AI assistance, the AUROC was not significantly different from that of the BSR group without AI (p = 0.138). Table 2 lists the performance of each reader. With AI, the AUROCs changed from 0.794–0.839 to 0.835–0.920 in the BSR group and from 0.608–0.727 to 0.794–0.859 in the GR group, with statistically significant differences, except for one BSR. The sensitivity also significantly increased for two of the BSRs and all GRs. However, there was no significant difference in the specificity of the nine readers (three BSRs and five GRs) between AI-assisted and unassisted readings.

Fig. 2

Performance of the AI alone and BSR and GR groups with and without AI.

A. ROC curves of radiologists with and without AI. B. Difference of AUROC between AI and radiologists. AI = artificial intelligence, AUROC = area under the ROC curve, BSR = breast specialist radiologist, GR = general radiologist, ROC = receiver operating characteristic

Table 2

Diagnostic Performance of Five BSRs and Five GRs with and without AI Assistance for 100 Patients with and 100 Patients without Breast Cancer

		Conventional ROC Analysis									LROC Analysis
		AUROC			Sensitivity			Specificity			AULROC			Sensitivity_LROC
		AI-Unassisted	AI-Assisted	P	AI-Unassisted, %	AI-Assisted, %	P	AI-Unassisted, %	AI-Assisted, %	P	AI-Unassisted	AI-Assisted	P	AI-Unassisted, %	AI-Assisted, %	P
BSR
	1	0.839	0.903	0.008	75.0	86.0	0.005	69.0	68.0	0.819	0.673	0.787	0.003	71.0	81.0	0.012
	2	0.826	0.901	0.002	83.0	90.0	0.052	58.0	66.0	0.033	0.714	0.820	0.004	77.0	85.0	0.045
	3	0.808	0.835	0.293	79.0	83.0	0.317	68.0	61.0	0.108	0.646	0.689	0.242	70.0	75.0	0.225
	4	0.797	0.863	0.004	77.0	82.0	0.166	63.0	67.0	0.317	0.615	0.713	0.003	68.0	76.0	0.021
	5	0.794	0.920	< 0.001	59.0	92.0	< 0.001	75.0	60.0	< 0.001	0.526	0.824	< 0.001	55.0	86.0	< 0.001
		0.813 (0.756–0.870)	0.884 (0.840–0.928)	0.007	74.6 (70.8–78.4)	88.6 (83.6–89.6)	< 0.001	66.6 (62.5–70.7)	64.4 (60.2–68.6)	0.238	0.635 (0.564–0.706)	0.767 (0.700–0.833)	0.034	68.2 (64.1–72.3)	80.6 (77.1–84.1)	< 0.001
GR
	1	0.727	0.855	< 0.001	75.0	87.0	0.011	52.0	56.0	0.433	0.499	0.738	< 0.001	57.0	80.0	< 0.001
	2	0.716	0.822	0.001	46.0	81.0	< 0.001	75.0	67.0	0.102	0.399	0.713	< 0.001	43.0	78.0	< 0.001
	3	0.712	0.833	< 0.001	58.0	81.0	< 0.001	68.0	68.0	1.000	0.496	0.710	< 0.001	53.0	76.0	< 0.001
	4	0.657	0.859	< 0.001	41.0	77.0	< 0.001	81.0	83.0	0.564	0.383	0.693	< 0.001	40.0	73.0	< 0.001
	5	0.608	0.794	< 0.001	36.0	71.0	< 0.001	78.0	76.0	0.593	0.323	0.616	< 0.001	34.0	67.0	< 0.001
	Average (95% CI)	0.684 (0.616–0.752)	0.833 (0.779–0.887)	< 0.001	51.2 (46.8–55.6)	79.4 (75.9–82.9)	< 0.001	70.8 (66.8–74.8)	70.0 (66.0–74.0)	0.689	0.420 (0.356–0.484)	0.694 (0.625–0.763)	< 0.001	45.4 (41.0–49.8)	74.8 (0.710–0.786)	< 0.001
All radiologists (95% CI)		0.748 (0.686–0.811)	0.858 (0.809–0.907)	< 0.001	62.9 (59.9–65.9)	83.0 (80.7–85.3)	< 0.001	68.7 (65.8–71.6)	67.2 (64.3–70.1)	0.273	0.527 (0.464–0.590)	0.730 (0.666–0.794)	< 0.001	56.8 (53.7–59.9)	77.7 (75.1–80.3)	< 0.001
AI (95% CI)		0.915 (0.876–0.954)			87.0 (80.4–93.6)			79.0			0.769 (0.689–0.849)			79.0 (78.8–93.0)

AI = artificial intelligence, AUROC = area under the receiver operating characteristics curve, BSR = breast specialist radiologist, CI = confidence interval, GR = general radiologist, LROC = localization receiver operating characteristic

In the LROC analysis, the AULROC values were significantly increased by AI assistance in the BSR (0.635 vs. 0.767, p = 0.034) and GR (0.420 vs. 0.694, p < 0.001) groups. The corresponding sensitivity (sensitivityLROC) values were also significantly increased using AI in the BSR (68.2% vs. 80.6%, p < 0.001) and GR (45.4% vs. 74.8%, p < 0.001). A representative example of mammography examinations in which the number of correct recall decisions across readers changed between two reading sessions is shown in Figure 3.

Fig. 3

Examples of breast cancer detected with the aid of AI.

A. Mammograms in 47-year-old female with invasive ductal carcinoma. B. Heatmap and abnormality score are shown as in the viewer of the AI-based software. C. The patient was recalled by three of 10 radiologists when reading without AI assistance and by nine of 10 radiologists using AI-based software for support. AI = artificial intelligence

Effects of AI Assistance in Breast Cancer Detection according to T and N Stages

The sensitivities of the AI and readers were evaluated in patients (n = 93) with T-stage and N-stage information. Table 3 lists the changes in sensitivity and statistical significance according to T-stage and N-stage, representing tumor size and lymph node metastasis status, respectively. In both the BSR and GR groups, sensitivity significantly increased with AI assistance, regardless of the T-stage and N-stage of the patients.

Table 3

Effects of AI Assistance on Sensitivity for Radiologist Groups according to T-Stage and N-Stage

	T-0 (n = 28)			T-1 (n = 54)			T-2 (n = 11)			N-0 (n = 75)			N-1 (n = 18)
	AI-Unassisted, %	AI-Assisted, %	P	AI-Unassisted, %	AI-Assisted, %	P	AI-Unassisted, %	AI-Assisted, %	P	AI-Unassisted, %	AI-Assisted, %	P	AI-Unassisted, %	AI-Assisted, %	P
BSR	73.6 (66.3–80.9)	82.9 (76.6–89.1)	0.001	72.6 (67.3–77.9)	86.7 (82.6–90.7)	0.004	72.7 (61.0–84.5)	89.1 (80.9–97.3)	0.048	72.3 (67.7–76.8)	84.5 (80.9–88.2)	0.001	75.6 (66.7–84.4)	91.1 (85.2–97.0)	0.011
GR	56.4 (48.2–64.6)	75.7 (68.6–82.8)	< 0.001	47.4 (41.5–53.4)	80.4 (75.6–85.1)	< 0.001	41.8 (28.8–54.9)	74.5 (63.0–86.1)	0.004	50.9 (45.9–56.0)	77.6 (73.4–81.8)	< 0.001	43.3 (33.1–53.6)	81.1 (73.0–89.2)	< 0.001
AI	82.1 (68.0–96.3)			87.0 (78.1–96.0)			90.9 (73.9–107.9)			84.0 (75.7–92.3)			94.4 (83.9–105.0)

p values of unaided versus with AI from generalized estimating equation analysis. Values in parentheses are 95% confidence interval. AI = artificial intelligence, BSR = breast specialist radiologist, GR = general radiologist

Subgroup analysis was performed on AUROC, sensitivity, and specificity by dividing the BI-RADS composition category into two groups: a-b and c-d (Supplementary Table 1). In both groups, the AUROC and sensitivity increased significantly with AI. In the BSR group, the specificity significantly increased in the BI-RADS composition category a-b group (67.8% vs. 74.4%, p = 0.014) and decreased in the c-d group (65.9% vs. 58.8%, p = 0.003). There was no significant difference in specificity in the GR group. To observe how the decision of readers was changed by AI assistance, the numbers of detected and missed cases according to the AI and radiologists are shown in Table 4. The majority of radiologists made the decision. For cases detected by AI, the number of cases detected correctly by radiologists increased from 54 to 75 with AI assistance. For cases that the AI alone could not detect, radiologists additionally detected three cases using AI.

Table 4

Number of Breast Cancer Detected or Missed by AI and Radiologists

	Detected by Both	Detected by Radiologists Alone	Detected by AI Alone	Missed by Both
Unaided	54	3	25	18
With AI	75	6	4	15

AI = artificial intelligence

Effects of AI Assistance on Reading Time

With AI assistance, the overall average reading time pooled across readers significantly changed between before and after using AI in the BSR (decrease from 82.73 seconds to 73.04 seconds, p < 0.001) and GR (increase from 35.44 seconds to 42.52 seconds, p < 0.001) groups. The average reading times of the BSR group were significantly decreased regardless of the three case types: cancer, benign and negative (83.05 seconds vs. 74.79 seconds, p = 0.004; 86.41 seconds vs. 74.20 seconds, p = 0.025; and 79.72 seconds vs. 69.37 seconds, p = 0.040, respectively) (Fig. 4). However, in the GR group, there was a significant increase in cancer and benign cases (37.99 seconds vs. 46.94 seconds, p < 0.001 and 33.62 seconds vs. 47.45 seconds, p < 0.001, respectively).

Fig. 4

Reading time pooled across radiologists with and without AI assistance.

AI = artificial intelligence

DISCUSSION

We investigated whether deep-learning-based AI assistance could improve the detection performance of radiologists. This study analyzed the change in detection performance and reading time using AI assistance according to the reader’s experience and characteristics of the samples. AI assistance improved detection performance regardless of the experience of the radiologists and the characteristics of the samples. AI increased the efficiency of the BSR, which had a relatively long reading time, and helped the GR focus the lesions in detail, thereby increasing the reading time. Before introducing AI algorithms into clinical practice, this study will be helpful in anticipating how AI affects the detection performance of radiologists in the real world. Recently, many AI algorithms have been developed to assist breast cancer detection, and they have focused on increasing sensitivity without reducing specificity, which can lead to unnecessary recalls and higher medical costs [15]. Consequently, many studies have shown that AI algorithms have superior performance compared to traditional CAD [9101618]. A study by Kooi et al. [35] reported that a deep learning model outperformed the traditional CAD system at a high sensitivity. AI algorithms can accurately identify malignant lesions and ignore benign lesions by training using a large amount of mammography data. In a previous study with 546 cases (110 cancers), the AUROC for reading with AI was significantly higher than that without AI (0.886 vs. 0.866, respectively) [16]. The sensitivity (83.0%–86.0%) and specificity (77.1%–79.3%) were improved using AI. Another study also demonstrated that AI could improve the AUROC value (0.801–0.881) in a reader study [10]. In this study, the AUROCs of the BSR and GR groups were significantly improved (0.813 vs. 0.884, p = 0.007 and 0.684 vs. 0.833, p < 0.001, respectively), which is consistent with previous studies. In terms of specificity, AI was superior to radiologists in distinguishing clinically confirmed benign cases from cancer, and AI did not differ from BSRs in biopsy-confirmed benign cases. Although false-positive cases detected by the AI tended to increase the false positives identified by radiologists, the overall specificity was not statistically different because the false positive rate of radiologists also decreased in cases where the AI analysis was negative. This high AI specificity could help improve radiologists’ performance without losing specificity in reader studies. In terms of sensitivity, AI can assist radiologists based on the fact that 25 cancers were only detected using AI (Table 4). With AI assistance, the number of cancers detected only using AI was significantly reduced to 4. Although it has been reported that reading time is significantly lower for digital breast tomosynthesis (DBT) reading with AI assistance [36], there may be a limit to reducing the reading time in 2D mammography because the number of images is significantly smaller than that of DBT. In another study, Rodriguez-Ruiz et al. [18] found that reading time tended to decrease in low-suspicion examinations and increase in high-suspicion examinations but did not differentiate radiologists’ experience. However, we found that the use of AI could also reduce the average reading time in the BSR group, probably because AI helps radiologists to detect lesions more quickly. This reduction in reading time is expected to improve radiologists’ efficiency in clinical practice. Moreover, the reading time for GRs increased slightly with the use of AI, presumably because the average reading time of GRs was shorter than that of BSRs, and the sensitivity increased with the use of AI. This increase in sensitivity indicates that AI can encourage GRs to focus more on mammograms with suspicious findings. However, negative cases with AI had the shortest reading time regardless of the reader's experience, suggesting that the use of AI may increase the efficiency of breast cancer screening, where the negative is relatively dominant. This study had some limitations. First, it included a small number of cancer cases, so it was difficult to perform a detailed analysis in terms of mammographic features and cancer stage. Second, there was a high proportion of cancer and biopsy-confirmed benign cases, which does not reflect breast cancer screening. However, in this study, we focused on changes in radiologists’ decisions following AI assistance. Third, multiple lesions were not considered; therefore, comparisons at the lesion level were impossible. Fourth, the reader’s study environment differs from real-world clinical settings. Furthermore, the readers were probably not fully committed to accepting AI results because of their lack of experience in using AI before participating in this study. This can explain why the AUROC of one reader did not change significantly between that before and after using the AI algorithm and why the specificity of the readers was lower than that of the AI. Further studies, including radiologists with sufficient experience in AI-based software and a larger dataset representing the real-world environment, such as breast screening data or use in a diagnostic setting, are needed. In summary, this study demonstrated the effect of AI-based software in interpreting mammograms. AI improved cancer detection without sacrificing specificity, regardless of reader experience, and affected reading time.

33 in total

1. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis.

Authors: Stephen L Hillis
Journal: Stat Med Date: 2007-02-10 Impact factor: 2.373

2. Generalized Roe and Metz receiver operating characteristic model: analytic link between simulated decision scores and empirical AUC variances and covariances.

Authors: Brandon D Gallas; Stephen L Hillis
Journal: J Med Imaging (Bellingham) Date: 2014-09-25

3. Large scale deep learning for computer aided detection of mammographic lesions.

Authors: Thijs Kooi; Geert Litjens; Bram van Ginneken; Albert Gubern-Mérida; Clara I Sánchez; Ritse Mann; Ard den Heeten; Nico Karssemeijer
Journal: Med Image Anal Date: 2016-08-02 Impact factor: 8.545

4. Unintended Consequences of Machine Learning in Medicine.

Authors: Federico Cabitza; Raffaele Rasoini; Gian Franco Gensini
Journal: JAMA Date: 2017-08-08 Impact factor: 56.272

5. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study.

Authors: Hyo-Eun Kim; Hak Hee Kim; Boo-Kyung Han; Ki Hwan Kim; Kyunghwa Han; Hyeonseob Nam; Eun Hye Lee; Eun-Kyung Kim
Journal: Lancet Digit Health Date: 2020-02-06

6. Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection.

Authors: Constance D Lehman; Robert D Wellman; Diana S M Buist; Karla Kerlikowske; Anna N A Tosteson; Diana L Miglioretti
Journal: JAMA Intern Med Date: 2015-11 Impact factor: 21.873

7. Impact of computer-aided detection systems on radiologist accuracy with digital mammography.

Authors: Elodia B Cole; Zheng Zhang; Helga S Marques; R Edward Hendrick; Martin J Yaffe; Etta D Pisano
Journal: AJR Am J Roentgenol Date: 2014-10 Impact factor: 3.959

8. pROC: an open-source package for R and S+ to analyze and compare ROC curves.

Authors: Xavier Robin; Natacha Turck; Alexandre Hainard; Natalia Tiberti; Frédérique Lisacek; Jean-Charles Sanchez; Markus Müller
Journal: BMC Bioinformatics Date: 2011-03-17 Impact factor: 3.307

9. Detecting and classifying lesions in mammograms with Deep Learning.

Authors: Dezső Ribli; Anna Horváth; Zsuzsa Unger; Péter Pollner; István Csabai
Journal: Sci Rep Date: 2018-03-15 Impact factor: 4.379

10. Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms.

Authors: Thomas Schaffter; Diana S M Buist; Christoph I Lee; Yaroslav Nikulin; Dezso Ribli; Yuanfang Guan; William Lotter; Zequn Jie; Hao Du; Sijia Wang; Jiashi Feng; Mengling Feng; Hyo-Eun Kim; Francisco Albiol; Alberto Albiol; Stephen Morrell; Zbigniew Wojna; Mehmet Eren Ahsen; Umar Asif; Antonio Jimeno Yepes; Shivanthan Yohanandan; Simona Rabinovici-Cohen; Darvin Yi; Bruce Hoff; Thomas Yu; Elias Chaibub Neto; Daniel L Rubin; Peter Lindholm; Laurie R Margolies; Russell Bailey McBride; Joseph H Rothstein; Weiva Sieh; Rami Ben-Ari; Stefan Harrer; Andrew Trister; Stephen Friend; Thea Norman; Berkman Sahiner; Fredrik Strand; Justin Guinney; Gustavo Stolovitzky; Lester Mackey; Joyce Cahoon; Li Shen; Jae Ho Sohn; Hari Trivedi; Yiqiu Shen; Ljubomir Buturovic; Jose Costa Pereira; Jaime S Cardoso; Eduardo Castro; Karl Trygve Kalleberg; Obioma Pelka; Imane Nedjar; Krzysztof J Geras; Felix Nensa; Ethan Goan; Sven Koitka; Luis Caballero; David D Cox; Pavitra Krishnaswamy; Gaurav Pandey; Christoph M Friedrich; Dimitri Perrin; Clinton Fookes; Bibo Shi; Gerard Cardoso Negrie; Michael Kawczynski; Kyunghyun Cho; Can Son Khoo; Joseph Y Lo; A Gregory Sorensen; Hwejin Jung
Journal: JAMA Netw Open Date: 2020-03-02