| Literature DB >> 34470740 |
Karoline Freeman1, Julia Geppert1, Chris Stinton1, Daniel Todkill1, Samantha Johnson1, Aileen Clarke1, Sian Taylor-Phillips2.
Abstract
OBJECTIVE: To examine the accuracy of artificial intelligence (AI) for the detection of breast cancer in mammography screening practice.Entities:
Mesh:
Year: 2021 PMID: 34470740 PMCID: PMC8409323 DOI: 10.1136/bmj.n1872
Source DB: PubMed Journal: BMJ ISSN: 0959-8138
Summary of study characteristics for studies using AI as standalone system
| Study | Study design | Population | Mammography vendor | Index test | Comparator | Reference standard |
|---|---|---|---|---|---|---|
| Lotter 2021 | Enriched test set MRMC laboratory study (accuracy of a read) | 285 women from 1 US health system with 4 centres (46.0% screen detected cancer); age and ethnic origin NR | Hologic 100% | In-house AI system (DeepHealth); threshold NR (set to match readers’ sensitivity and specificity, respectively) | 5 MQSA certified radiologists (US), single reading; threshold of BI-RADS scores 3, 4, and 5 considered recall | Cancer: pathology confirmed cancer within 3 months of screening; confirmed negative: a negative examination followed by an additional BI-RADS score 1 or 2 interpretation at the next screening examination 9-39 months later |
| McKinney 2020 | Retrospective test accuracy study (accuracy of a read) | 3097 women from 1 US centre (22.2% cancer within 27 months of screening); age <40, 181 (5.8%); 40-49, 1259 (40.7%); 50-59, 800 (25.8%); 60-69, 598 (19.3%); ≥70, 259 (8.4%) | Hologic / Lorad branded: >99%; | In-house AI system (Google Health); threshold: to achieve superiority for both sensitivity and specificity compared with original single reading using validation set | Original single radiologist decision (US); threshold: BI-RADS scores 0, 4, 5 were treated as positive | Cancer: biopsy confirmed cancer within 27 months of imaging; non-cancer: one follow-up non-cancer screen or biopsied negative (benign pathologies) after ≥21 months |
| Rodriguez-Ruiz 2019 | Enriched test set MRMC laboratory study (accuracy of a read) | 199 examinations from a Dutch digital screening pilot project (39.7% cancer); | Hologic 100% | Transpara version 1.4.0 (Screenpoint Medical BV, Nijmegen, Netherlands); | Nine Dutch radiologists, single reading, as part of a previously completed MRMC study | Cancer: histopathology-proven cancer; non-cancer: ≥1 normal follow-up screening examination (2 year screening interval) |
| Salim | Retrospective test accuracy study (accuracy of a read) | 8805 women from a Swedish cohort study (8.4% cancer within 12 months of screening); median age 54.5 (IQR 47.4-63.5) | Hologic 100% | 3 commercial AI systems (anonymised: AI-1, AI-2, and AI-3); threshold: corresponding to the specificity of the first reader | Original radiologist decision (Sweden); (1) single reader (R1; R2), (2) consensus reading; no threshold | Cancer: pathology confirmed cancer within 12 months of screening; non-cancer: ≥2 years cancer free follow-up |
| Schaffter 2020 | Retrospective test accuracy study (accuracy of a read) | 68 008 consecutive women from 1 Swedish centre (1.1% cancer within 12 months of screening) mean age 53.3 (SD 9.4) | NR | 4 in-house AI systems: 1 top performing model submitted to the DREAM challenge, 1 ensemble method of the eight best performing models (CEM), CEM combined with reader decision (single reader or consensus reading); threshold: corresponding to the sensitivity of single and consensus reading, respectively | Original radiologist decision (Sweden); (1) single reader (R1; R2), (2) consensus reading; no threshold | Cancer: tissue diagnosis within 12 months of screening; non-cancer: no cancer diagnosis ≥12 months after screening |
AI=artificial intelligence; BI-RADS=breast imaging reporting and data system; CEM=challenge ensemble method; DREAM=Dialogue on Reverse Engineering Assessment and Methods; IQR=interquartile range; MQSA=Mammography Quality Standards Act; MRMC=multiple reader multiple case; NR=not reported; R1=first reader; R2=second reader; SD=standard deviation.
Summary of study characteristics for studies using AI for triage
| Study | Study design | Population | Mammography vendor | Index test | Comparator | Reference standard |
|---|---|---|---|---|---|---|
| Balta 2020 | Retrospective cohort study (accuracy of classifying into low and high risk categories) | 17 895 consecutively acquired screening examinations from 1 centre in Germany (0.64% screen detected cancer), age NR | Siemens 70% Hologic 30% | Transpara version 1.6.0 (Screenpoint Medical BV, Nijmegen, Netherlands); | No comparator as human consensus reading decisions used as reference standard for screen negative results | Cancer: biopsy proven screen detected cancers; non-cancer: no information about follow-up for the normal examinations was available. |
| Dembrower 2020 | Retrospective case-control study (accuracy of classifying into low and high risk categories) | 7364 women with screening examinations obtained during 2 consecutive screening rounds in 1 centre in Sweden (7.4% cancer: 347 screen detected in current round, 200 interval cancers within 30 months of previous screening round), median age 53.6 (IQR 47.6-63.0) | Hologic 100% | Lunit (Seoul, South Korea, version 5.5.0.16) | None | Cancer: diagnosed with breast cancer at current screening round or within ≤30 months of previous screening round; non-cancer: >2 years’ follow up |
| Lång 2021 | Retrospective cohort study (accuracy of classifying into low and high risk categories) | 9581 women attending screening at 1 centre in Sweden, consecutive subcohort of Malmö Breast Tomosynthesis Screening Trial (0.71% screen detected cancers), mean age 57.6 (range 40-74) | Siemens 100% | Transpara version 1.4.0 (Screenpoint Medical BV, Nijmegen, Netherlands); | No comparator as human consensus reading decisions used as reference standard for screen negative results | Cancer: histology of surgical specimen or core needle biopsies with a cross reference to a regional cancer register; |
| Raya-Povedano 2021 | Retrospective cohort study (accuracy of classifying into low and high risk categories) | 15 986 consecutive women from the Córdoba Tomosynthesis Screening Trial, 1 Spanish centre (0.7% cancer: 98 screen detected (FFDM or DBT), 15 interval cancers within 24 months of screening); mean age 58 (SD 6), range 50-69 years | Hologic (Selenia Dimensions) 100% | Transpara, version 1.6.0 (ScreenPoint Medical BV, Nijmegen, Netherlands); | Original radiologist decision from Córdoba Tomosynthesis Screening Trial (double reading without consensus or arbitration) | Cancer: histopathologic results of biopsy, screen detected via FFDM or DBT and interval cancers within 24 months of screening; non-cancer: normal reading with 2-years’ follow-up |
AI=artificial intelligence; DBT=digital breast tomosynthesis; FFDM=full field digital mammography; IQR=interquartile range; NR=not reported; SD=standard deviation.
Summary of study characteristics for studies using AI as reader aid
| Study | Study design | Population | Mammography vendor | Index test | Comparator | Reference standard |
|---|---|---|---|---|---|---|
| Pacilè 2020 | Enriched test set MRMC laboratory study, | 240 women from 1 US centre (50.0% cancer), mean age 59 (range 37-85) | NR | 14 MQSA certified radiologists (US) with AI support (MammoScreen version 1, Therapixel, Nice, France); threshold: level of suspicion (0-100) >40 | 14 MQSA certified radiologists (US) without AI support, single reading; threshold: level of suspicion (0-100) >40 | Cancer: histopathology; |
| Rodriguez-Ruiz 2019 | Enriched test set MRMC laboratory study, fully crossed (accuracy of a read) | 240 women (120 from 1 US centre and 120 from 1 German centre; 41.7% cancer), median age 62 (range 39-89) | Hologic 50% | 14 MQSA certified radiologists (US) with AI support (Transpara version 1.3.0, Screenpoint Medical BV, Nijmegen, the Netherlands); threshold: BI-RADS score ≥3 | 14 MQSA certified radiologists (US) without AI support, single reading; | Cancer: histopathology confirmed cancer; false positives: histopathologic evaluation or negative follow-up for ≥1 year; |
| Watanabe 2019 | Enriched test set MRMC laboratory study, first without AI support, then AI aided (accuracy of a read) | 122 women from 1 US centre (73.8% cancer, all false negative mammograms), mean age 65.4 (range 40-90) | NR | 7 MQSA certified radiologists (US) with AI support (cmAssist, CureMetrix, Inc., La Jolla, CA); no threshold | 7 MQSA certified radiologists (US) without AI support, single reading; no threshold | Cancer: biopsy proven cancer; non-cancer: BI-RADS 1 and 2 women with a 2 year follow-up of negative diagnosis |
AI=artificial intelligence; BI-RADS=Breast Imaging-Reporting and Data System; MQSA=Mammography Quality Standards Act; MRMC=multireader multicase; NR=not reported.
Fig 1Overview of published evidence in relation to proposed role in screening pathway. Purple shade=current pathway; orange shade=AI added to pathway; green shade=level of evidence for proposed AI role. AI=artificial intelligence; +/−=high/low risk of breast cancer, person icon=radiologist reading of mammograms as single, first, or second reader; MRMC=multiple reader multiple case; R1, R2=reader 1, reader 2; RCT, randomised controlled trial; sens=sensitivity; spec=specificity
Fig 2Overview of concerns about risk of bias and applicability of included studies. *Low concerns about applicability for consensus reading; high concerns about applicability for single reading as comparator test. †Low concerns about risk of bias and low applicability for the previous screening round (biopsy proven cancer or at least two years’ follow-up); high concerns about risk of bias and high applicability for the current screening round (biopsy-proven cancer but no follow-up of test negatives)
Summary of test accuracy outcomes
| Study | Index test (manufacturer)/comparator | TP | FP | FN | TN | % Sensitivity | Δ % Sensitivity, | % Specificity | Δ % Specificity, |
|---|---|---|---|---|---|---|---|---|---|
| Standalone AI (5 studies): | |||||||||
| Lotter 2021, | AI (in-house) at reader’s specificity | 126 | 51 | 5 | 103 | 96.2 (91.7 to 99.2) | +14.2, P<0.001 | 66.9 | Set to be equal |
| AI (in-house) at reader’s sensitivity | 107 | 14 | 24 | 140 | 82.0 | Set to be equal | 90.9 (84.9 to 96.1) | +24.0, P<0.001 | |
| Comparator: average single reader† | NA | NA | NA | NA | 82.0 | — | 66.9 | ||
| McKinney 2020 | AI (in-house) | NR | NR | NR | NR | 56.24 | +8.1, P<0.001 | 84.29 | +3.46, P=0.02 |
| Comparator: original single reader | NR | NR | NR | NR | 48.1 | — | 80.83 | — | |
| Rodriguez-Ruiz 2019 | AI (Transpara version 1.4.0) | 63 | 25 | 16 | 95 | 80 (70 to 90) | +3 (-6.2 to 12.6) | 79 (73 to 86) | Set to be equal |
| Comparator: average single reader§ | NA | NA | NA | NA | 77 (70 to 83) | — | 79 (73 to 86) | — | |
| Salim 2020 | AI-1 (anonymised) | 605 | NR | NR | NR | 81.9 (78.9 to 84.6) | See below | 96.6 (96.5 to 96.7) | Set to be equal |
| AI-2 (anonymised) | 495 | NR | NR | NR | 67.0 (63.5 to 70.4) | −14.9 | 96.6 (96.5 to 96.7) | Set to be equal | |
| AI-3 (anonymised) | 498 | NR | NR | NR | 67.4 (63.9 to 70.8) | −14.5 | 96.7 (96.6 to 96.8) | Set to be equal | |
| Comparator: original reader 1 | 572 | NR | NR | NR | 77.4 (74.2 to 80.4) | −4.5 | 96.6 (96.5 to 96.7) | — | |
| Comparator: original reader 2 | 592 | NR | NR | NR | 80.1 (77.0 to 82.9) | −1.8 | 97.2 (97.1 to 97.3) | +0.6 | |
| Comparator: original consensus reading | 628 | NR | NR | NR | 85.0 (82.2 to 87.5) | +3.1 | 98.5 (98.4 to 98.6) | +1.9 | |
| Schaffter 2020 | Top-performing AI (in-house) | NR | NR | NR | NR | 77.1 | Set to be equal | 88 | −8.7 |
| Ensemble method (CEM; in-house) | NR | NR | NR | NR | 77.1 | Set to be equal | 92.5 | −4.2 | |
| Comparator: original reader 1 | NR | NR | NR | NR | 77.1 | — | 96.7 (96.6 to 96.8) | ||
| Schaffter 2020 | Top-performing AI (in-house) | NR | NR | NR | NR | 83.9 | Set to be equal | 81.2 | −17.3 |
| Comparator: original consensus reading | NR | NR | NR | NR | 83.9 | — | 98.5 | — | |
| AI for triage pre-screen (4 studies): | |||||||||
| Balta 2020 | AI as pre-screen (Transpara version 1.6.0): | ||||||||
| AI score ≤2: ~15% low risk | 114 | 15 028 | 0 | 2754 | 100.0 | NA | 15.49 | NA | |
| AI score ≤5: ~45% low risk | 109 | 9791 | 5 | 7991 | 95.61 | NA | 44.94 | NA | |
| AI score ≤7: ~65% low risk | 105 | 6135 | 9 | 11 647 | 92.11 | NA | 65.50 | NA | |
| Lång 2020 | AI as pre-screen (Transpara version 1.4.0): | ||||||||
| AI score ≤2: ~19% low risk | 68 | 7684 | 0 | 1829 | 100.0 | NA | 19.23 | NA | |
| AI score ≤5: ~53% low risk | 61 | 4438 | 7 | 5075 | 89.71 | NA | 53.35 | NA | |
| AI score ≤7: ~73% low risk | 57 | 2541 | 11 | 6972 | 83.82 | NA | 73.29 | NA | |
| Raya-Povedano 2021 | AI as pre-screen (Transpara version 1.6.0); AI score ≤7: ~72% low risk | 100 | 4450 | 13 | 11 424 | 88.5 (81.1 to 93.7) | NA | 72.0 (71.3 to 72.7) | NA |
| Dembrower 2020 | AI as pre-screen (Lunit version 5.5.0.16): | ||||||||
| AI score ≤0.0293: 60% low risk¶ | 347 | 29 787 | 0 | 45 200 | 100.0 | NA | 60.28 | NA | |
| AI score ≤0.0870: 80% low risk¶ | 338 | 14 729 | 9 | 60 258 | 97.41 | NA | 80.36 | NA | |
| AI for triage post-screen (1 study): | |||||||||
| Dembrower 2020 | AI as post-screen (Lunit v5.5.0.16); | 32 | 1413 | 168 | 73 921 | 16 | NA | 98.12 | NA |
| Dembrower 2020 | AI as post-screen (Lunit version 5.5.0.16); prediction of interval and next round screen detected cancers: | 103 | 1342 | 444 | 73 645 | 19 | NA | 98.21 | NA |
| AI as reader aid (3 studies): | |||||||||
| Pacilè 2020 | AI support§ (MammoScreen version 1) | NA | NA | NA | NA | 69.1 (60.0 to 78.2) | +3.3, P=0.02 | 73.5 (65.6 to 81.5) | +1.0, P=0.63 |
| Comparator: average single reader** | NA | NA | NA | NA | 65.8 (57.4 to 74.3) | — | 72.5 (65.6 to 79.4) | — | |
| Rodriguez-Ruiz 2019 | AI support (Transpara version 1.3.0) | 86 | 29 | 14 | 111 | 86 (84 to 88) | +3, P=0.05 | 79 (77 to 81) | +2, P=0.06 |
| Comparator: average single reader | 83 | 32 | 17 | 108 | 83 (81 to 85) | — | 77 (75 to 79) | — | |
| Watanabe 2019 | AI support** (cmAssist) | NA | NA | NA | NA | 62 (range 41 to 75) | +11, P=0.03 | 77.2 | −0.9 (NR) |
| Comparator: average single reader** | NA | NA | NA | NA | 51 (range 25 to 71) | — | 78.1 | — | |
AI=artificial intelligence; CEM=challenge ensemble method of eight top performing AIs from DREAM challenge; CI=confidence interval; DREAM=Dialogue on Reverse Engineering Assessment and Methods; FN=false negatives; F=false positives; NA=not applicable; NR=not reported; TN=true negatives; TP=true positives.
Inverse probability weighting: negative cases were upweighted to account for the spectrum enrichment of the study population. Patients associated with negative biopsies were downweighted by 0.64. Patients who were not biopsied were upweighted by 23.61.
Applied an inverse probability weighted bootstrapping (1000 samples) with a 14:1 ratio of healthy women to women receiving a diagnosis of cancer to simulate a study population with a cancer prevalence matching a screening cohort.
In addition, the challenge ensemble method prediction was combined with the original radiologist assessment. At the first reader’s sensitivity of 77.1%, CEM+reader 1 resulted in a specificity of 98.5% (95% confidence interval 98.4% to 98.6%), higher than the specificity of the first reader alone of 96.7% (95% confidence interval, 96.6% to 96.8%; P<0.001). At the consensus readers’ sensitivity of 83.9%, CEM+consensus did not significantly improve the consensus interpretations alone (98.1% v 98.5% specificity, respectively). These simulated results of the hypothetical integration of AI with radiologists’ decisions were excluded as they did not incorporate radiologist behaviour when AI is applied.
Applied 11 times upsampling of the 6817 healthy women, resulting in 74 987 healthy women and a total simulated screening population of 75 534.
Specificity estimates not based on exact numbers; the numbers were calculated by reviewers from reported proportions applied to 75 334 women (347 screen detected cancers and 74 987 healthy women).
In enriched test set multiple reader multiple case laboratory studies where multiple readers asses the same images, there are considerable problems in summing 2x2 test data across readers.
Fig 3Study estimates of sensitivity and false positive rate (1−specificity) in receiver operating characteristic space by index test (artificial intelligence) and comparator (radiologist) for eight included studies. Comparators are defined as consensus of two readers and arbitration (radiologist consensus), or single reader decision/average of multiple readers (radiologist single/average). iVertical dashed lines represent specificity for screening programmes for Denmark (2% false positive rate),61 UK (3% false positive rate),62 63 and US (11% false positive rate).64 Retrospective test accuracy studies: Salim et al,35 Schaffter et al,36 and McKinney et al.29 Enriched test set multiple reader multiple case laboratory studies: Pacilè et al,30 Watanabe et al,37 Rodriguez-Ruiz et al33 (Rodriguez-Ruiz 2019a in figure), Lotter 2021,28 and Rodriguez-Ruiz et al32 (Rodriguez-Ruiz 2019b in figure)
Fig 4Study estimates of sensitivity and false positive rate (1−specificity) in receiver operating characteristic space for studies of artificial intelligence (AI) as a pre-screen (A) or post-screen (B). Pre-screen requires very high sensitivity, but can have modest specificity, post-screen requires very high specificity, but can have modest sensitivity. Reference standard for test negatives was double reading not follow-up. (A) Dembrower 2020a: retrospective study using AI (Lunit version 5.5.0.16) for pre-screen (point estimates not based on exact numbers). Reference standard includes only screen detected cancers. No data reported for radiologists.26 Balta 2020 (Transpara version 1.6.0),25 Raya-Povedano 2021 (Transpara version 1.6.0),31 and Lång 2020 (Transpara version 1.4.0)27: retrospective studies using AI as pre-screen. Reference standard includes only screen detected cancers. (B) Dembrower 2020b: retrospective study using AI (Lunit version 5.5.0.16) for post-screen detection of interval cancers,26 Dembrower 2020c: retrospective study using AI (Lunit version 5.5.0.16) for post-screen detection of interval cancers and next round screen detected cancers.26 Thresholds highlighted represent thresholds specified in studies. Radiologist double reading for this cohort would be 100% specificity and 0% sensitivity as this was only in a cohort of women with screen (true and false) negative mammograms