Literature DB >> 36207474

Systematic analysis of the test design and performance of AI/ML-based medical devices approved for triage/detection/diagnosis in the USA and Japan.

Mitsuru Yuba1, Kiyotaka Iwasaki2,3,4,5.   

Abstract

The development of computer-aided detection (CAD) using artificial intelligence (AI) and machine learning (ML) is rapidly evolving. Submission of AI/ML-based CAD devices for regulatory approval requires information about clinical trial design and performance criteria, but the requirements vary between countries. This study compares the requirements for AI/ML-based CAD devices approved by the US Food and Drug Administration (FDA) and the Pharmaceuticals and Medical Devices Agency (PMDA) in Japan. A list of 45 FDA-approved and 12 PMDA-approved AI/ML-based CAD devices was compiled. In the USA, devices classified as computer-aided simple triage were approved based on standalone software testing, whereas devices classified as computer-aided detection/diagnosis were approved based on reader study testing. In Japan, however, there was no clear distinction between evaluation methods according to the category. In the USA, a prospective randomized controlled trial was conducted for AI/ML-based CAD devices used for the detection of colorectal polyps, whereas in Japan, such devices were approved based on standalone software testing. This study indicated that the different viewpoints of AI/ML-based CAD in the two countries influenced the selection of different evaluation methods. This study's findings may be useful for defining a unified global development and approval standard for AI/ML-based CAD.
© 2022. The Author(s).

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 36207474      PMCID: PMC9542463          DOI: 10.1038/s41598-022-21426-7

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.996


Introduction

Software development for medical devices utilizing artificial intelligence (AI) and machine learning (ML) has been evolving rapidly. The amount of AI/ML-based software used in medical devices approved by the US Food and Drug Administration (FDA) has been increasing every year[1,2]. An annual survey by the American College of Radiology showed that more than 30% of radiologists used AI to improve diagnostic interpretation accuracy[3]. The use of AI/ML for computer-aided detection (CADe), computer-aided diagnosis (CADx), and computer-aided simple triage (CAST[4]) has allowed practitioners to use these computer-aided methods to their full potential. However, owing to the rapid progress of AI/ML-based CAD, there is a demand to properly evaluate the efficacy and safety of the methods used to acquire approval[5-7]. Evaluation methods for AI/ML-based CAD devices can be classified into standalone software testing and reader study testing. Standalone software testing is defined as a performance test of the AI-only using test data that are collected retrospectively. The advantage is cost and time savings because there is no need to recruit readers for performance evaluation. However, because it cannot be used to evaluate performance in clinical practice, it cannot evaluate usability and the affection of AI assistance. Reader study testing is defined as a performance test that evaluates the interaction of AI with physicians on diagnostic or detection accuracy. It is necessary to recruit readers for testing, which is more costly and time-consuming than standalone software testing. Reader study testing can be performed not only prospectively but also retrospectively using previously collected images. A recent prospective randomized controlled trial (RCT) for evaluating the performance of AI/ML-based CAD for cataract detection failed to demonstrate diagnostic accuracy comparable to that of the pilot study. This study indicated the need to evaluate the influence of physician intervention in clinical practice[8]. The most recent reader study testing[9,10], in which 45 clinicians from 9 clinical institutions participated in the evaluation of a product intended to detect breast cancer, compared the results with and without AI assistance and reported that AI assistance improved clinicians’ accuracy. Hence, evaluations differ depending on the test design and domain. Therefore, it would be beneficial to developers of AI/ML-based CAD to know whether the criteria for diagnostic accuracy can be evaluated by standalone software testing or should be evaluated by reader study testing that includes the influence of physicians on diagnostic accuracy. In this study, we investigate the AI/ML-based CAD devices approved by the FDA and the Pharmaceuticals and Medical Devices Agency (PMDA) in Japan and analyzed their requirements in terms of target and study design to provide insights into the global development of AI/ML-based CAD.

Results

AI/ML-based medical devices in the USA

We identified 45 FDA-approved AI/ML-based medical devices using the FDA Product Code Classification Database[11] (Fig. 1). There were 45 devices for a variety of targets categorized as follows: 16 (35.6%) for triage intracranial hemorrhage or large vessel occlusion detection, 11 (24.4%) for breast cancer detection/diagnosis, 9 (20.0%) for triage pulmonary embolism, pneumothorax, pleural effusion, or intra-abdominal free gas, 5 (11.1%) for wrist fracture, cervical spine bone fracture, vertebral compression fracture, or rib fracture diagnosis, 3 (6.7%) for diabetic retinopathy diagnosis, and 1 (2.2%) for colorectal polyp detection. In terms of the study design of the 45 devices, 35 (77.8%) were approved based on standalone software testing (Table 1). The other 10 (22.2%) devices were approved based on reader study testing (Table 2). In terms of sources of clinical data, three studies (6.7%) were conducted prospectively, while 42 (93.3%) studies used previously collected clinical data to evaluate efficacy (Fig. 2).
Figure 1

Flowchart for extraction of AI/ML-based CAD devices approved in the USA and Japan.

Table 1

Characteristics of the 35 FDA approved AI/ML-based CAD evaluated by standalone software testing.

DeviceIntended use/summaryCADImaging modalityTest CaseSensitivity/specificityAUCApproved pathwayApproved dateManufacture
Chest and abdomen imaging
BriefCase forPulmonary Embolism triageTriage and notification of pulmonary embolismCASTCT18490.6/89.9510(k)April 2019Aidoc Medical Ltd
HealthPNXTriage and notification of pneumothoraxCASTX-ray58893.1/92.90.98510(k)May 2019Zebra Medical Vision Ltd
Critical Care SuiteTriage and notification of pneumothoraxCASTX-ray80484.3/93.50.96510(k)August 2019GE Medical Systems LLC
HealthCXRTriage and notification of pleural effusionCASTX-ray55496.7/93.10.98510(k)November 2019Zebra Medical Vision Ltd
Red DotTriage and notification of pneumothoraxCASTX-ray88894.6/87.90.97510(k)February 2020Behold. AI Technologies Limited
AIMI-Triage CXR PTXTriage and notification of pneumothoraxCASTX-ray30092/900.96510(k)April 2020RADLogics Inc
BriefCase for Intra-abdominal Free Gas triageTriage and notification of intra-abdominal free gasCASTCT18491/88.9510(k)June 2020Aidoc Medical Ltd
BriefCase for incidental Pulmonary Embolism triageTriage and notification of incidental pulmonary embolismCASTCT26890.5/88.7510(k)August 2020Aidoc Medical Ltd
CINA CHESTTriage and notification of pulmonary embolism (PE) and aortic dissection (AD)CASTChest CTA/thoraco-abdominal CTA

PE: 396

AD: 298

91.1/91/8

96.4/97.5

510(k)May 2021Avicenna.AIt
Head imaging
ContaCTTriage and notification of large vessel occlusionCASTCTA30087.8/89.60.91de novoFebruary 2018Viz.AI Inc
BriefCase for Intracranial Hemorrhage triageTriage and notification of intracranial hemorrhageCASTCT19893.6/92.3510(k)August 2018Aidoc Medical Ltd
AccipiolxTriage and notification of intracranial hemorrhageCASTCT36092/86510(k)October 2018MaxQ-AI Ltd
HealthICHTriage and notification of intracranial hemorrhageCASTCT42794.4/92.5510(k)June 2019Zebra Medical Vision Ltd
DeepCTTriage and notification of intracranial hemorrhageCASTCT26093.8/92.3510(k)July 2019Deep01 Limited
BriefCase for Large Vessel Occlusion triageTriage and notification of large vessel occlusionCASTCTA38388.8/87.2510(k)December 2019Aidoc Medical Ltd
Viz ICHTriage and notification of intracranial hemorrhageCASTCT26193/900.96510(k)March 2020Viz ai Inc
Rapid ICHTriage and notification of intracranial hemorrhageCASTCT33689.9/94.3510(k)March 2020iSchemaView Inc
CuraRad-ICHTriage and notification of intracranial hemorrhageCASTCT38890.6/93.1510(k)April 2020CuraCloud Corp
NineAITriage and notification of intracranial hemorrhage (ICH) and mass effect (ME)CASTCT

ICH:89.9/97.4

ME:96.4/91.1

510(k)April 2020Nines Inc
qERTriage and notification of intracranial hemorrhage (ICH), mass effect (ME), midline shift (MS) and cranial fracture (CF)CASTCT

Total:1320

ICH:629

ME:471

MS:414

CF:248

98.5/91.2

96.9/93/9

96.3/96

97.3/95.3

96.7/92.7

0.98

0.99

0.99

0.97

510(k)June 2020Qure. ai Technologies
CINATriage and notification of intracranial hemorrhage (ICH) and large vessel occlusion (LVO)CASTCT/CTA

ICH:814

LVO:476

91.4/97.5

97.9/97.6

0.94

0.98

510(k)June 2020AVICENNA. AI
Rapid LVOTriage and notification of large vessel occlusionCASTCTA97/95.60.99510(k)July 2020iSchemaView Inc
AccipiolxImprove performance by changing algorithmCASTCT36097/93N/A510(k)August 2020MaxQ AI Ltd
HALOTriage and notification of large vessel occlusionCASTCTA36491.1/870.97510(k)November 2020NiCo-Lab BV
Viz ICHAddition of GE’s non-contrast CT as supported systemsCASTCT38795/960.97510(k)March 2021Viz ai Inc
Fracture imaging
BriefCase for C-Spine fracture triageTriage and notification of cervical spine bone fractureCASTCT18691.7/88.6510(k)May 2019Aidoc Medical Ltd
HealthVCFTriage and notification of vertebral compression fracturesCASTCT61190.2/86.90.95510(k)May 2020Zebra Medical Vision Ltd
uAI Easy Triage-RibTriage and notification of multiple (3 or more) acute rib fractureCASTCT20092.7/84.70.93510(k)January 2021Shanghai United Imaging Intelligence Co., Ltd
Breast imaging
cmTriageTriage and notification of breast cancerCASTMammogram125586.9/88.50.95510(k)March 2019CureMetrix Inc
ProFound AI Software V2.1Application to add Siemens Modalities as supported systems

CADe/

CADx

Digital breast tomosynthesis694510(k)October 2019iCAD Inc
Transpara V1.5Addition of Fujifilm’s mammogram as supported systems

CADe/

CADx

Mammogram510(k)December 2019ScreenPoint Medical BV
HealthMammoTriage and notification of breast cancerCASTMammogram83589.9/90.70.96510(k)July 2020Zebra Medical Vision Ltd
Saige-QTriage and notification of breast cancerCASTMammogram/Digital breast tomosynthesis

Mammogram: 1333

DBT: 1528

92.2/91.2

98.3/95.7

0.96

0.98

510(k)April 2021DeepHealth, Inc
Transpara V1.7Addition of Fujifilm’s digital breast tomosynthesis as supported systems

CADe/

CADx

Mammogram/Digital breast tomosynthesis510(k)June 2021ScreenPoint Medical B.V
Ophthalmology imaging
IDx-DRAddition of Training mode, and change the user interface

CADe/

CADx

Fundus camera510(k)June 2021Digital Diagnostics Inc
Table 2

Characteristics of the 10 FDA approved AI/ML-based CAD evaluated by reader study testing.

DeviceIntended use/summaryCADImaging modalityStudy designTest caseReaderSensitivity/specificityAUCApproved pathwayApproved dateManufacture
Endoscope imaging
GI GeniusDetection of colonic mucosal lesionsCADeEndoscopy

Prospective,

RCT

6856

ADR: 54.8

(40.4)

de novoApril 2021Cosmo Artificial Intelligence—AI, LTD
Ophthalmology imaging
IDx-DRDetection and diagnosis of more than mild diabetic retinopathyCADe/CADxFundus cameraProspective90087/90de novoApril 2018IDx LLC
EyeArtDetection and diagnosis of more than mild diabetic retinopathy and vision-threatening diabetic retinopathyCADe/CADxFundus cameraProspective94295.5/86.5510(k)August 2020Eyenuk Inc
Fracture imaging
OsteoDetectDetection and diagnosis of distal radius fractures of adult wristsCADeX-ray

Retrospective,

MRMC,

Fully-crossed

20024

80/91

(74/88)

0.88

(0.84)

de novoMay 2018Imagen Technologies Inc
FractureDetectDetection and diagnosis of 12 fractures (ankle, clavicle, elbow, femur, forearm, hip, humerus, knee, pelvis, shoulder, tibia/fibula, wrist)CADeX-ray

Retrospective,

MRMC,

Fully-crossed

17524

90/91.8

(82/89)

0.95

(0.91)

510(k)July 2020Imagen Technologies Inc
Breast imaging
Transpara V1.3Detection and diagnosis of breast cancerCADe/CADxMammogram

Retrospective,

MRMC,

Fully-crossed

24014

0.88

(0.86)

510(k)November 2018ScreenPoint Medical BV
ProFound AI Software V2.0Detection and diagnosis of breast cancerCADe/CADxDigital breast tomosynthesis

Retrospective,

MRMC,

Fully-crossed

26024

85/69

(77/62)

0.85

(0.79)

510(k)December 2018iCAD Inc
Transpara V1.6Addiction of digital breast tomosynthesis as supported systemsCADe/CADx

Mammogram/

Digital breast tomosynthesis

Retrospective,

MRMC,

Fully-crossed

24018

0.86

(0.83)

510(k)March 2020ScreenPoint Medical BV
MammoScreenDetection and diagnosis of breast cancerCADe/CADxMammogram

Retrospective,

MRMC

24014

0.8

(0.77)

510(k)March 2020Therapixel
Genius AI DetectionDetection and diagnosis of breast cancerCADe/CADxDigital breast tomosynthesis

Retrospective,

MRMC

39017

75.9/25.8

(66.8/23.4)

0.82

(0.79)

510(k)November 2020Hologic Inc

Sensitivity/specificity and AUC are shown as with AI (without AI).

Figure 2

Number of approved AI/ML-based CAD devices in the USA and Japan.

Flowchart for extraction of AI/ML-based CAD devices approved in the USA and Japan. Characteristics of the 35 FDA approved AI/ML-based CAD evaluated by standalone software testing. PE: 396 AD: 298 91.1/91/8 96.4/97.5 ICH:89.9/97.4 ME:96.4/91.1 Total:1320 ICH:629 ME:471 MS:414 CF:248 98.5/91.2 96.9/93/9 96.3/96 97.3/95.3 96.7/92.7 0.98 0.99 0.99 0.97 ICH:814 LVO:476 91.4/97.5 97.9/97.6 0.94 0.98 CADe/ CADx CADe/ CADx Mammogram: 1333 DBT: 1528 92.2/91.2 98.3/95.7 0.96 0.98 CADe/ CADx CADe/ CADx Characteristics of the 10 FDA approved AI/ML-based CAD evaluated by reader study testing. Prospective, RCT ADR: 54.8 (40.4) Retrospective, MRMC, Fully-crossed 80/91 (74/88) 0.88 (0.84) Retrospective, MRMC, Fully-crossed 90/91.8 (82/89) 0.95 (0.91) Retrospective, MRMC, Fully-crossed 0.88 (0.86) Retrospective, MRMC, Fully-crossed 85/69 (77/62) 0.85 (0.79) Mammogram/ Digital breast tomosynthesis Retrospective, MRMC, Fully-crossed 0.86 (0.83) Retrospective, MRMC 0.8 (0.77) Retrospective, MRMC 75.9/25.8 (66.8/23.4) 0.82 (0.79) Sensitivity/specificity and AUC are shown as with AI (without AI). Number of approved AI/ML-based CAD devices in the USA and Japan.

AI/ML-based medical devices in Japan

We identified 12 PMDA-approved AI/ML-based medical devices using the database of the Japan Association for the Advancement of Medical Equipment Search (JAAME Search)[12] (Fig. 1). The target of the devices was as follows: 6 (50%) for colorectal lesion detection, 3 (25%) for detection of covid-19 infection, 2 (16.7%) for detection of pulmonary nodules, and 1 (8.3%) for cerebral aneurysm detection. In terms of the study design of the 12 devices, 9 (75%) were approved based on standalone software testing (Table 3), and three (25%) were approved based on reader study testing (Table 4). No prospective studies have been conducted to acquire market approval (Fig. 2).
Table 3

Characteristics of the 9 PMDA approved AI/ML-based CAD evaluated by standalone software testing.

DeviceIntended use/summaryCADImaging modalityTest caseSensitivity/specificityAUCAccuracyApproved dateManufacture
Endoscope imaging
EndoBRAINDiagnosis of neoplastic or non-neoplasticCADxEndocytoscopy10096.998December 2018CYBERNET
EndoBRAIN-UCDiagnosis of inflammation for Ulcerative colitisCADxEndocytoscopy100095.1/90.791.9April 2020CYBERNET
EndoBRAIN-EYEDetection of colonic mucosal lesionsCADeEndoscopy30096.3/93.7January 2020CYBERNET
EndoBRAIN-PlusDiagnosis of non-neoplastic, adenoma or invasive cancerCADxEndocytoscopy5091.8/97.3July 2020CYBERNET
EW10-EC02Detection and diagnosis of colonic mucosal lesionsCADe/CADxEndoscopy

WLI: 912

LCI: 943

BLI: 296

WLI: 308

94.5

96

94.9

93.2

September 2020FUJIFILM
WISE VISIONDetection of colonic mucosal lesionsCADeEndoscopy35083/89November 2020NEC
COVID-19 detection
InferRead CT PneumoniaDetection of pneumonia cause by covid-19CADeCT19077.1/90.7June 2020CES decartes
Ali-M3Detection of pneumonia cause by covid-19CADeCT70489.6/37.1June 2020MIC Medical Corp
FS-AI693Detection of pneumonia cause by covid-19CADeCT21787.5/37.10.74May 2021FUJIFILM
Table 4

Characteristics of the 3 PMDA approved AI/ML-based CAD evaluated by Reader Study Testing.

DeviceIntended use/summaryCADImaging modalityTest caseReaderSensitivity/specificityAUCAccuracyApproved dateManufacture
Chest and abdomen imaging
FS-AI688Detection of lung noduleCADeCT361061.4 (49)May 2020FUJIFILM
EIRL X-Ray Lung noduleDetection of lung noduleCADeCT3201856.9/96.7 (45.4/96.3)

0.76

(0.70)

88.4

(85.6)

August 2020LPIXEL
Head imaging
EIRL aneurysmDetection of Unruptured cerebral aneurysmCADeMRA2002077.2/72.1 (68.2/79.4)

0.75

(0.71)

September 2019LPIXEL

Sensitivity/specificity and AUC are shown as with AI (without AI).

Characteristics of the 9 PMDA approved AI/ML-based CAD evaluated by standalone software testing. WLI: 912 LCI: 943 BLI: 296 WLI: 308 94.5 96 94.9 93.2 Characteristics of the 3 PMDA approved AI/ML-based CAD evaluated by Reader Study Testing. 0.76 (0.70) 88.4 (85.6) 0.75 (0.71) Sensitivity/specificity and AUC are shown as with AI (without AI).

Endoscope imaging

As a targeting device for colorectal lesions, GI Genius (Medtronic) was approved by the FDA based on the data of a prospective RCT. A total of 685 patients were enrolled and divided into two groups. The adenoma detection rate (ADR) was compared between participants diagnosed using traditional endoscopy methods and those diagnosed using CADe. The efficacy of CAD was demonstrated by the fact that the detection rate by CADe exceeded that of traditional endoscopy methods without CADe (54.8% vs. 40.4%)[13]. In Japan, six devices[14-19] were approved for colorectal lesions, and all devices were evaluated using retrospective data and standalone software testing. Of the six devices, three were used for analyzing images captured with an endocytoscope (1 for differentiating the degree of inflammation of ulcerative colitis, and 2 for differentiating the degree of tumor malignancy), and 3 were used for analyzing images captured by the endoscope. The average number of images used for the evaluation of the devices designed for endocytoscope was 383.3 (minimum 50, maximum 1000). The average sensitivity, specificity, and accuracy rates were 94.6% (minimum 91.8, maximum 96.9), 94.1% (minimum 91, maximum 97.3), and 95% (minimum 92, maximum 98). AUC was not reported. For devices designed for endoscopic detection of polyps, video data were used instead of images for performance evaluation. The efficacy of EndoBRAIN-EYE (CYBERNET)[16] was evaluated using 12 h of videos including 300 lesions. The efficacy of WISE VISION (NEC)[18] was evaluated using videos including 350 lesions, and the number of continuous frames in which the lesions were identified was the index of performance. EW1-EC02 (FUJIFILM)[19] has both CADe and CADx functions. The CADe performance was evaluated based on the successful continuous detection of polyps. The CADx performance was evaluated based on the correct identification of a lesion as a tumor or non-tumor lesion[20].

Chest and abdominal imaging

In the USA, nine medical devices[21-29] aimed at triaging pulmonary embolism, pneumothorax, pleural effusion, or intra-abdominal free gas, all categorized as CAST, were approved through evaluation of standalone software testing. An average of 496 images (minimum 184, maximum 888) were used for performance evaluation. The sensitivity and specificity reported by all nine studies had averages of 92.0% (minimum 84.3, maximum 96.4) and 91.4% (minimum 87.9, maximum 97.5). AUC was only reported by 5 studies and the average was 0.97 (minimum 0.96, maximum 0.98). In Japan, two devices[30,31] for lung nodule detection were categorized as CADe and approved based on reading study testing. An average of 178 images (minimum 36, maximum 320) were used for testing. The reported averages of sensitivity and specificity were 59.1% (minimum 56.9, maximum 61.4) and 63.9% (minimum 37.1, maximum 90.7), respectively. AUC was not reported by either study.

Head imaging

In the USA, 16 devices[32-47] designed for the triage of intercranial hemorrhage and/or large vessel occlusion were labeled as CAST devices and approved after analysis of performance evaluation results. An average of 414.6 images (minimum 198, maximum 1320) were used for evaluation. The averages of the sensitivity and specificity reported by all studies were 93.6 (minimum 87.7, maximum 98.5) and 92.8 (minimum 86, maximum 97.6), respectively. The average AUC (reported by only 7 studies) was 0.97 (minimum 0.91, maximum 0.99). The Accipilox (MaxQ AI Ltd.), a medical device targeting the detection of intercranial hemorrhage, was first approved by the FDA in 2018[34]. However, after changing the original algorithm from an ML-based to a convolutional neural network (CNN)-based algorithm in 2020[45], application for re-approval became necessary. With this change, the sensitivity increased from 92 to 97%, and the specificity increased from 86 to 93%. Both tests were performed using 360 images. Similarly, the Viz ICH (Viz ai Inc.)[38,47], another device for intracranial hemorrhage detection, was granted FDA clearance after the development of an add-on allowing for AI automatic detection on non-contrast CT scans acquired on scanners manufactured by general electric (GE). Device sensitivity increased from 93 to 95%, specificity increased from 90 to 96%, and AUC increased from 0.96 to 0.97. In Japan, a CADe device that analyzes head magnetic resonance angiography images to detect unruptured cerebral aneurysms[48] was approved based on reader study testing. A total of 200 images were used for the testing. The reported sensitivity and specificity were 77.2% and 72.1%, respectively. AUC was not reported.

Ophthalmology imaging

In the USA, two devices[49,50] for diabetic retinopathy diagnosis were approved using data from a prospective study. The average number of images used for performance evaluation was 921 (minimum 900, maximum 942). Sensitivity and specificity were 91.2% (minimum 87, maximum 95.5) and 88.6% (minimum 86.5, maximum 90), respectively. AUC was not reported. Notably, the percentage of images that could be correctly evaluated through AI was calculated using an imageability factor, and the reported average was 97.3% (minimum 96, maximum 98.6). Both devices were used in primary care facilities in the USA and were developed to help caregivers decide whether to encourage patients to see specialists based on the results of AI analysis. Regarding IDx-DR (Digital Diagnosis Inc.), a second application for the addition of a training mode and alterations to the user interface were approved[51]. However, additional performance evaluation was not conducted at the time of the second application. AI/ML-based CAD for the diagnosis of diabetic retinopathy has not yet been approved by Japanese agencies.

Fracture imaging

Of the five devices[52-56] aimed at fracture detection that received FDA approval, three were evaluated by standalone software testing[54-56]. These three devices were categorized as CAST for cervical spine fracture, vertebral compression fracture, and rib fracture, and CT images were used for analyses. The other two devices were developed for wrist fracture detection[52] and 12 types of fracture detection on X-ray images[53]. These two devices were approved based on reader study testing and an improvement in diagnostic accuracy using X-ray images was demonstrated with the assistance of the software. Standalone software testing was conducted using an average of 332.3 images (minimum 186, maximum 611). The average sensitivity, specificity, and AUC were 91.5% (minimum 90.2, maximum 92.7), 86.7% (minimum 84.7, maximum 88.6), and 0.94 (minimum 0.93, maximum 0.95), respectively. AUC was only reported for two of the three devices (not reported for the cervical spine fracture triage device). Reader study testing was performed using an average of 187.5 images (minimum 175, maximum 200). Average sensitivity, specificity, and AUC were 85% (minimum 80, maximum 90), 91.4% (minimum 91, maximum 91.8), and 0.91 (minimum 0.88, maximum 0.95), respectively. AI/ML devices aimed at detecting fractures are yet to be approved in Japan.

Breast imaging

Among the 11 devices[57-67] for detection and diagnosis of breast cancer approved in the USA, six were evaluated based on standalone software testing[57,58,60,62-64]. Five devices categorized as CADe/CADx, which were designed to detect suspected breast cancer sites and malignancy levels, were approved based on reader study testing[59,61,65-67]. Among the devices evaluated through standalone software testing, ProFound AI Software V2.1 (iCAD Inc.)[58], Transpara V1.5 (ScreenPoint Medical BV)[60], and Transpara V1.7 (ScreenPoint Medical BV)[62] were classified as CADe/CADx devices. However, all three devices were upgraded versions of devices that had been approved based on reader study testing, with the addition of mammography or digital breast tomosynthesis. For the Transpara V1.6 (ScreenPoint Medical BV)[61], a second reader study test was conducted at the time of the upgrade from the previous version because it added digital breast tomosynthesis as usable data input. For standalone software testing, an average of 1411 images were used (minimum 694, maximum 1528), with an average sensitivity of 91.8% (minimum 86.9, maximum 98.3), specificity of 91.5 (minimum 88.5, maximum 95.7), and AUC of 0.96 (minimum 0.95, maximum 0.98). For reader study testing, an average of 274 images were used for the evaluation (minimum 240, maximum 390), with an average sensitivity of 80.4% (minimum 75.9, maximum 85), specificity of 47.4% (minimum 25.8, maximum 69), and AUC of 0.84 (minimum 0.8, maximum 0.88). Currently, no AI/ML medical devices for breast cancer detection and diagnostics have been approved in Japan.

SARS-Cov-2(COVID-19) detection

In Japan, three medical devices[68-70] aimed at detecting COVID-19 infection have been approved based on standalone software testing. In response to the rapid spread of COVID-19, all devices were fast-tracked for evaluation and approved within two months of application. The average number of images used to evaluate the performance of these devices was 370.3 (minimum 190, maximum 704) with an average sensitivity and specificity of 84.9% (minimum 77.7, maximum 89.6) and 54.7 (minimum 37.1, maximum 90.7) respectively. AUC was not reported in any study.

Discussion

In this study, we extracted AI/ML-based CAD devices approved in the USA and Japan and thoroughly assessed the performance evaluation methods. The main findings are as follows: (1) In the USA, devices classified as CAST were approved based on standalone software testing, and all devices classified as CADe/CADx were approved based on reader study testing. However, in Japan, there is no clear classification. (2) AI/ML-based CAD in the field of endoscopy for the detection of colorectal polyps was approved based on the data of a prospective RCT in the USA. In Japan, it was approved based on the evaluation of the software alone. This difference was influenced by the fact that the use of colonoscopy in the medical healthcare system in the two countries is quite different, as discussed in “Necessity of prospective testing” section. (3) In the USA, a wider variety of devices are available, compared to the devices available in Japan. To the best of our knowledge, this is the first comprehensive systematic comparative analysis of evaluation methods for AL/ML-based CAD devices approved in the USA and Japan.

Different methodological approaches to standalone software testing and a reader study testing

There are two major testing methods for evaluating AI/ML-based CAD devices: standalone software testing and reader study testing. The 31 devices approved as CAST in the USA were all evaluated by software alone. On the other hand, all the devices classified as CADe/CADx were subjected to reader study testing, except for post-market improvements. CAST is said to have been introduced by Goldenberg et al. in 2011[4,71]. In the USA, devices classified as having CAST functions are intended for use in urgent situations, such as intracerebral hemorrhage or cerebrovascular obstruction, where the devices assist non-specialists in promptly determining the best course of action to take. As the devices may contribute to the clinical decision-making process, software test results are required to demonstrate sensitivity and specificity of 90% or higher. The mean values of sensitivity, specificity, and AUC for all approved CAST devices in the USA were high—92.9% (minimum 84.3, maximum 98.3), 91.7% (minimum 83.5, maximum 97.6), 0.96 (minimum 0.91, maximum 0.99), respectively. The FDA has issued guidance on the standalone evaluation of software and recommends AUC, sensitivity, and specificity as evaluation indexes[72]. Probably owing to this guidance, these indices were evaluated for many devices. These evaluation indexes would contribute to a reasonable evaluation of the performances of newly developed devices. Furthermore, in the USA, all devices were tested using the multiple reader multiple case (MRMC) method. The reported average number of doctors who participated in the test was 19.2 (minimum 14, maximum 24). Of the devices approved based on reader study testing, five were conducted using “Fully-Crossed design” following the FDA recommendation due to its greater statistical power. The test design is recommended by the FDA when the output results are displayed concurrently (2020 revision)[73]. Despite the extensive testing procedures that were performed before approval, there were instances where the clinical performance of approved devices did not measure up to expectations[74-76]. This was the case with BriefCase for CSF Triage (Aidoc Medical Ltd.) and Health VCF (Zebra Medical Vision Ltd.). Such cases underline the necessity of analyzing the generalizations contained within the current evaluation methods for increasingly diverse devices. There are currently no approved CAST devices in Japan; hence, it was not possible to find any information on an evaluation method for CAD devices that fall under this category. The data indicates that, in Japan, the method used to evaluate the performance of a device is not reliant on the category the device is classified by, be it CAST or CADe/CADx. We believe that the reason there were no PMDA-approved CAST devices in Japan lies in the medical environment differences when compared to the USA. For instance, according to data reported by the Organization for Economic Co-operation and Development in a survey on the distribution of medical equipment[77], Japan ranked highest in the CT scanner category with 111 scanners for every 1,000,000 people; The USA ranked 11th with only 43 (27 in hospitals and 16 ambulatories) scanners per 1,000,000 people. However, the reported number of CT examinations per 1000 individuals for both countries was comparable: 200–250 exams per 1000 individuals. Thus, it is evident that, over an identical period, the data volume outputted by a single CT scanner in the USA would be far greater than that in Japan. Therefore, the need for prompt screening of high-risk patients may be greater in the USA than in Japan, which may be the reason why CAST devices are widely developed in the USA. In Japan, there is no guidance on the evaluation of devices by standalone software testing or by reader study testing. Establishing such guidance along with evaluation indexes may be necessary if Japan hopes to continue promoting research and development of AI/ML-based medical devices.

Necessity of prospective testing

Of the 57 AI/ML-based CAD devices selected and analyzed in the present study, three devices conducted prospective studies: IDx-DR (IDx) and EyeArt (Eyenuk, Inc), for the detection of diabetic retinopathy, and GI Genious (Cosmo Artificial Intelligence-AI, LTD) for the detection of colorectal polyps. The common denominator between these three devices is that the quality of the image used as input data into the software greatly depends on the skill of the surgeon. Hence, the performance of the AI/ML-based CAD device is considerably dependent on the dexterity of the user, and a less skilled professional may not be able to realize the full potential of the device. Moreover, the dependence of such devices on the user’s skills leads to a higher likelihood than some of their counterparts to misdiagnose. Although many studies have been reported on AI/ML-based CAD devices for ultrasonography used to detect breast tumors[78], it has not yet been granted regulatory approval. It was speculated that this is because the imaging skill of the operator had a significant impact on the performance of the software. We believe that the difference between CAD-assisted colorectal polyp detection in the USA and Japan is due to the significant differences in the clinical positioning of colonoscopy in the healthcare system. In the USA, colonoscopy is recognized as the “gold-standard” screening test for colorectal cancer prevention. Most practitioners choose to remove all polyps found during a colonoscopy. Therefore, there is a concern that the use of AI/ML-based CAD devices will inevitably increase the number of polyps detected, including benign ones, thereby increasing the burden of the procedure on the patients. Furthermore, for colonoscopy, rather than sensitivity and specificity, the best indicator for performance evaluation is the adenoma detection rate (ADR)[79,80]. Indeed, the ADR index has been confirmed to be directly correlated with the mortality rate of colorectal cancer[81,82]. In Japan, when a polyp is found during colonoscopy, the physician makes a qualitative diagnosis using a magnifying endoscope and makes a judgment on its malignancy level. The term “semi-clean colon” refers to a colon with a small adenoma judged as benign (also known as microadenomas, less than 5 mm wide), for which a follow-up is performed without excision[83]. This indicates that not all polyps are extracted during a procedure in Japan, contrary to the procedure in the USA. Therefore, the impact of colonoscopy on treatment strategy differs between Japan and the USA, which might have resulted in the approval of the CADe device based on the evaluation of the standalone software testing in Japan. Furthermore, to evaluate the performance of the CAD system for detecting colorectal polyps, video frames were used as test data and the performance was evaluated in an actual clinical practice manner. It can be said that the PMDA has made a reasonable evaluation of CAD for colorectal polyp detection in line with the clinical scenario in Japan. IDx-DR and EyeArt devices designed for the detection of diabetic retinopathy are used in primary care facilities. These devices, using the results of AI analysis, are intended to help caregivers decide whether to encourage patients to see specialists. These devices are intended to be operated by non-expert practitioners. Therefore, it is necessary for manufacturers to properly train operators to be able to use the device to achieve its full potential and create an appropriate imaging protocol. This explains why imageability was also used as an evaluation factor when reviewing the performance of such devices[84].

Comparison of diversity of AI/ML-based CAD

A comprehensive analysis of AI/ML-based CAD devices approved by the regulatory agencies revealed that the FDA approved a wider variety of devices than the PMDA. In the USA, AI/ML-based CAD for intracerebral hemorrhage, cerebrovascular obstruction, breast cancer, pneumothorax, pulmonary embolism, and pleural effusion diagnosis is a field that remains on the cutting edge of the healthcare industry. The constantly updated and improved head CT, mammogram, and chest CT databases may be one of the reasons for such technological advances. Indeed, digital databases such as the Digital Database for Screening Mammography (DDSM)[85], or ChexPert[86] are known for their large-scale database and are frequently used in studies on image analysis algorithm development. Furthermore, the National Institute of Health created the ChestNet-14[87] dataset available through Kaggle (an online community periodically organizing data science competitions). ChestNet-14 comprises 112,000 images of 14 different types of lesions. Similarly, the Radiological Society of North America published 25,312 head CT images on Kaggle[88]. As can be seen from these instances, there is a constant push for the further development of AI/ML-based CAD-assisted diagnostic/detection devices. Historically, the USA pioneered the application of CAD to the medical field, with the FDA approving the world’s first CAD device in 1998 (“Image Checker”[89], by R2, now manufactured by Hologic). This is assumed to be one of the reasons why medical AI/ML research and development is at an advanced stage as compared to other countries. The most advanced AI/ML-based CAD sector in Japan targets the colonoscopy market. Currently, the Japanese company, Olympus, accounts for 70% of the endoscopes’ global market shares[90]. It is speculated that this may be a factor for the use of AI in the analysis of endoscopic images, and lesion detection is more advanced as compared to the other areas. In Japan, research teams at Showa University and Nagoya University have published a database (SUN[91]) containing 49,799 colorectal polyps. Therefore, further research and development focusing on this area are required.

Future work

Because the European Union’s European Medicines Agency (EMA) is another important regulatory agency, including AI/ML-based medical devices approved by the EMA would result in a more complete analysis of the current state of global device approval procedures. However, the EMA does not appear to have a comprehensive database accessible to the public. If the EMA makes its data publicly available, we will incorporate it in a future study, generating results of higher quality and consistency.

Limitations

The present study had two limitations. First, the devices that were approved in the USA or Japan were extracted using their general names or product codes. Therefore, there may be relevant devices that we have not identified. Second, this systematic analysis was limited to AI/ML-based CAD devices. Nevertheless, we believe that our comprehensive analysis and comparison of evaluation methods of AL/ML-based medical devices in terms of target and study design between the USA and Japan provide valuable knowledge on the global development of AI/ML-based CAD.

Conclusions

To the best of our knowledge, the present study is the first systematic comprehensive comparative analysis to clarify differences in performance evaluation methods of AI/ML-based CAD devices approved in the USA and Japan. In the USA, there are two prevalent methods for performance evaluation: standalone software testing and reader study testing. Which one is used depends on whether the device is CAST or CADe/CADx. In contrast, Japan does not make such a clear distinction, as illustrated by the indiscriminate use of either standalone software testing or reader study testing for performance evaluation of devices belonging to the same class label (CADe/CADx). In addition to this, the present study indicated that the AI/ML CAD devices approved in the USA were much more diverse than those approved in Japan. As a regulatory agency, the FDA has issued clear guidance specifying and describing points to keep in mind when conducting standalone software testing or reader study testing. The authors believe that the active publication of such guidance and extensive comprehensive documentation by regulatory agencies encourages the development of AI/ML medical devices. Finally, from the perspective of mutual acceptance of AI/ML-based CAD devices developed in both countries, it would seem relevant to address the issue of international harmonization of AI/ML-based CAD evaluation to obtain consensus on reliable evaluation methods for these devices.

Methods

Extraction of AI/ML-based medical devices in the USA

AI/ML medical devices were extracted from the FDA product code database[11]. Product Code is a “3-character unique product identifier”. As of June 22, 2021 (the date at which devices were selected), 6701 product codes were listed. Two independent authors performed a keyword search and determined whether the AI-based CAD met the inclusion criteria and resolved discrepancies by joint review and consensus. When using search keywords such as “artificial intelligence,” “machine learning,” and “deep learning,” 18 product codes were identified (8 for artificial intelligence, 9 for machine learning, and 1 for deep learning). Among the 18 product codes, seven duplicates were removed, and five other product codes were excluded after screening (excluding codes that did not correspond to triage, notification, detection, or diagnosis). The final six product codes encompassed a total of 48 devices. Of these 48 devices, four were granted de novo clearance and 44 had been granted 510(k) clearance [no premarket approval (PMA)]. Two devices were excluded from this study because of insufficient information in the 510(k) summary; one was excluded because of minor changes in the target user. The final number of US-approved devices used in the present study was 45. Details of the screening and selection processes are shown in Fig. 1. Information on de novo classification requests, decision summaries, and a 510(k) summary of AI/ML-based CAD approved in the USA was collected. Information on (1) device name; (2) manufacturer; (3) approval date; (4) intended use; (5) test method; (6) target disease; (7) test data volume; and (8) performance [sensitivity, specificity, area under the curve (AUC), and accuracy rate] was retrieved.

Extraction of AI/ML-based medical devices in Japan

Japanese PMDA-approved AI/ML medical devices were extracted from the database of the Japan Association for the Advancement of Medical Equipment using its search service (JAAME Search)[12]. The JAAME database comprehensively stores the general names of medical devices and information on approved or certified medical devices. First, since this study focuses on AI/ML medical devices used to diagnose specific diseases, the initial search was performed using a “disease diagnostic program.” The search results showed 165 device categories with generic names (equivalent to FDA product codes). These 165 categories comprised a combined total of 349 approved/certified medical devices (as of June 22, 2021). Since the first AI/ML medical device approved in Japan was EndoBRAIN (CYBERNET), for which approval was issued in December 2018[14], the search was refined to devices approved between December 2018 and June 2021. This reduced the number of devices that matched all selected characteristics to 57. After excluding devices for genome analysis or other non-image-based tasks, 32 devices remained. The final data assessment checked whether using AI/ML from the press release information and package inserts of the devices, yielding 12 devices for study.

Classification of AI/ML-based CAD

Identified AI/ML-based CAD devices were classified based on the definitions of CADe, CADx, and CAST described in the guidance document by the FDA[72]. Taking the example of lesion detection, a device that outputs the mark or emphasis is a CADe, a device that identifies the malignancy level of the lesion is a CADx, and a device whose output is meant to reduce or eliminate the burden of doctors is a CAST.

Data analysis

After grouping the identified devices according to their target area, we further divided them into subgroups according to the evaluation method used for approval (standalone software testing or reader study testing). Known data averages for the test cases, sensitivity, specificity, and AUC results were calculated (minimum–maximum) using Microsoft Excel.

Research involving human participants

This study is a systematic review and do not involve human participants.
  25 in total

Review 1.  Guidelines for colonoscopy surveillance after screening and polypectomy: a consensus update by the US Multi-Society Task Force on Colorectal Cancer.

Authors:  David A Lieberman; Douglas K Rex; Sidney J Winawer; Francis M Giardiello; David A Johnson; Theodore R Levin
Journal:  Gastroenterology       Date:  2012-07-03       Impact factor: 22.682

2.  Concepts in U.S. Food and Drug Administration Regulation of Artificial Intelligence for Medical Imaging.

Authors:  Ajay Kohli; Vidur Mahajan; Kevin Seals; Ajit Kohli; Saurabh Jha
Journal:  AJR Am J Roentgenol       Date:  2019-06-05       Impact factor: 3.959

Review 3.  Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015-20): a comparative analysis.

Authors:  Urs J Muehlematter; Paola Daniore; Kerstin N Vokinger
Journal:  Lancet Digit Health       Date:  2021-01-18

4.  2020 ACR Data Science Institute Artificial Intelligence Survey.

Authors:  Bibb Allen; Sheela Agarwal; Laura Coombs; Christoph Wald; Keith Dreyer
Journal:  J Am Coll Radiol       Date:  2021-04-20       Impact factor: 5.532

5.  Colonoscopy quality measures: experience from the NHS Bowel Cancer Screening Programme.

Authors:  Thomas J W Lee; Matthew D Rutter; Roger G Blanks; Sue M Moss; Andrew F Goddard; Andrew Chilton; Claire Nickerson; Richard J Q McNally; Julietta Patnick; Colin J Rees
Journal:  Gut       Date:  2011-09-22       Impact factor: 23.059

6.  Addressing health disparities in the Food and Drug Administration's artificial intelligence and machine learning regulatory framework.

Authors:  Kadija Ferryman
Journal:  J Am Med Inform Assoc       Date:  2020-12-09       Impact factor: 4.497

7.  Correlation between adenoma detection rate and novel quality indicators for screening colonoscopy. A proposal for quality measures tool kit.

Authors:  Mohamed M Abdelfatah; Sherif Elhanafi; Marc J Zuckerman; Mohamed O Othman
Journal:  Scand J Gastroenterol       Date:  2017-06-23       Impact factor: 2.423

8.  The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database.

Authors:  Stan Benjamens; Pranavsingh Dhunnoo; Bertalan Meskó
Journal:  NPJ Digit Med       Date:  2020-09-11

9.  Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices.

Authors:  Michael D Abràmoff; Philip T Lavin; Michele Birch; Nilay Shah; James C Folk
Journal:  NPJ Digit Med       Date:  2018-08-28

10.  Breast Mass Detection in Mammography Based on Image Template Matching and CNN.

Authors:  Lilei Sun; Huijie Sun; Junqian Wang; Shuai Wu; Yong Zhao; Yong Xu
Journal:  Sensors (Basel)       Date:  2021-04-18       Impact factor: 3.576

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.