Literature DB >> 33629545

Key Principles of Clinical Validation, Device Approval, and Insurance Coverage Decisions of Artificial Intelligence.

Seong Ho Park¹, Jaesoon Choi², Jeong Sik Byeon³.

Abstract

Artificial intelligence (AI) will likely affect various fields of medicine. This article aims to explain the fundamental principles of clinical validation, device approval, and insurance coverage decisions of AI algorithms for medical diagnosis and prediction. Discrimination accuracy of AI algorithms is often evaluated with the Dice similarity coefficient, sensitivity, specificity, and traditional or free-response receiver operating characteristic curves. Calibration accuracy should also be assessed, especially for algorithms that provide probabilities to users. As current AI algorithms have limited generalizability to real-world practice, clinical validation of AI should put it to proper external testing and assisting roles. External testing could adopt diagnostic case-control or diagnostic cohort designs. A diagnostic case-control study evaluates the technical validity/accuracy of AI while the latter tests the clinical validity/accuracy of AI in samples representing target patients in real-world clinical scenarios. Ultimate clinical validation of AI requires evaluations of its impact on patient outcomes, referred to as clinical utility, and for which randomized clinical trials are ideal. Device approval of AI is typically granted with proof of technical validity/accuracy and thus does not intend to directly indicate if AI is beneficial for patient care or if it improves patient outcomes. Neither can it categorically address the issue of limited generalizability of AI. After achieving device approval, it is up to medical professionals to determine if the approved AI algorithms are beneficial for real-world patient care. Insurance coverage decisions generally require a demonstration of clinical utility that the use of AI has improved patient outcomes.

Entities: Disease Gene Species

Keywords: Artificial intelligence; Device approval; Insurance coverage; Software validation

Mesh：

Year: 2021 PMID： 33629545 PMCID： PMC7909857 DOI： 10.3348/kjr.2021.0048

Source DB: PubMed Journal: Korean J Radiol ISSN： 1229-6929 Impact factor: 3.500

INTRODUCTION

Artificial intelligence (AI) technology is expected to be of substantial help in medicine by overcoming current limitations and developing innovative solutions and will likely have a great impact on healthcare in the future [12]. All pharmaceuticals and medical devices, including AI devices, must be subjected to a rigorous clinical validation process to ensure safety and efficacy prior to use on patients. There is a wide range of AI devices for use in healthcare, and the methods used for clinical validation vary according to their form and function. Most are classified as diagnostic devices, as they are algorithms used to assist with diagnosis, decision-making, and prediction, such as computer-aided detection (CADe), computer-aided diagnosis, or clinical decision support systems. As such, methods for their clinical validation resemble those for common diagnostic tests. In this article, we aim to explain the key principles of clinical validation, device approval, and insurance coverage decisions for AI algorithms in healthcare.

Performance Indicators of AI Algorithms

There are a variety of indicators that may be used to evaluate the performance of AI algorithms. Some are technical indicators with little medical relevance, and others apply only to specific situations. Therefore, instead of presenting a comprehensive list of all indicators, we have focused on frequently used indicators with high medical relevance.

Dice Similarity Coefficient

The Dice similarity coefficient is used to evaluate AI algorithms that perform segmentation of organs or lesions on medical images [3]. Its definition is illustrated in Figure 1. For example, if there is an AI algorithm that can display the area suspected of prostate cancer on prostate magnetic resonance imaging (MRI), its performance can be evaluated by measuring the degree of overlap between the pathologically confirmed cancerous region and the area identified as cancer by the algorithm. There are several other coefficients similar to the Dice similarity coefficient.

Fig. 1

Dice similarity coefficient.

Sensitivity, Specificity, Receiver Operating Characteristic Curve

As shown in Figure 2, if an AI algorithm presents a binary result (e.g., presence vs. absence of a disease), its performance can be described, as in general diagnostic tests, in terms of sensitivity = true positive/(true positive + false negative), i.e., the proportion of subjects identified as positive by the AI out of all disease-positive subjects, and specificity = true negative/(false positive + true negative), i.e., the proportion of subjects identified as negative by the AI out of all disease-negative subjects. Even though the result an AI algorithm gives is presented as a binary classification, it is preceded by the process of outputting the result as a continuous number (for example, a decimal range between 0 and 1 as in probability). A threshold is then applied to convert it into a binary result. The sensitivity and specificity of the AI algorithm vary depending on how the threshold is set. If the threshold is set high, sensitivity decreases and specificity increases. If the threshold is set low, the sensitivity increases and the specificity decreases. A receiver operating characteristic (ROC) curve is a graph drawn by plotting the sensitivity on the y-axis and 1 - specificity on the x-axis, while varying the threshold value (Fig. 3) [4]. The value of the area under the curve (AUC) or area under the ROC (AUROC) curve is the mean sensitivity or specificity for all possible threshold values. Its maximum value is 1. In theory, the higher the value, the higher the diagnostic accuracy. Interpretations should be made carefully, however, because a higher AUROC value of an AI algorithm is not necessarily equivalent to higher performance of the AI in practice. Given that a particular threshold value is required when using an AI algorithm in practice, the sensitivity and specificity values for the given threshold, not the mean AUROC value, are the algorithm's actual measures of performance. The AUROC value is merely the mean sensitivity or specificity value. For further details, see the relevant literature [256].

Fig. 2

Diagnostic cross-table (also referred to as confusion matrix).

AI = artificial intelligence, FN = false negative, FP = false positive, TN = true negative, TP = true positive

Fig. 3

Exemplary receiver operating characteristic curves that show the performance of four readers in interpreting breast ultrasonography assisted by a deep-learning algorithm.

Adapted from Choi et al. Korean J Radiol 2019;20:749-758, with permission from the Korean Society of Radiology [4]. AUC = area under the curve

Free-Response ROC Curve

The free-response ROC (FROC) curve is used to evaluate the performance of AI algorithms with a CADe function, such as those for detecting colonic polyps on colonoscopy images. The AI algorithm output is correct when both the presence of a lesion and the localization of the lesion site are proven correct. When the AI algorithm for detecting colonic polyps indicates there is a polyp in a patient with colonic polyps, its diagnosis is correct only when it also detects the correct lesion site. If it fails to detect a polyp in an area where there is one and indicates there is a polyp in an area where there is none, it has produced both false-negative and false-positive results. In diagnostic tasks where a CADe-enabled algorithm is applied, there may be multiple lesions in a patient, and a CADe-enabled algorithm may present multiple false positives. In this case, it would be more appropriate to evaluate the algorithm's diagnostic accuracy using sensitivity and the number of false positives instead of sensitivity and specificity. If the threshold value is set too high for the algorithm's internal continuous output values, the sensitivity decreases, as does the number of false positives; if the threshold value is set too low, the sensitivity increases, but the number of false positives increases as well. The FROC curve is a graph drawn by plotting the sensitivity on the y-axis and the mean number of false positives instead of 1 - specificity on the x-axis (Fig. 4) [7]. The mean number of false positives can be calculated in several ways, depending on the situation. For example, they can be calculated using the mean number per patient or per image. There are also slightly modified forms of the FROC method. For further details, see the relevant literature [89].

Fig. 4

Exemplary free-response receiver operating characteristic curves that show the performance of six methods of detecting polyps in colonoscopy videos.

The x-axis is the mean number of false positives per image frame. A curve closer to the left upper corner indicates a higher performance, for example, a higher performance of the red curve than the blue curve. Adapted from Tajbakhsh et al. Proceedings of IEEE 12th International Symposium on Biomedical Imaging. New York: IEEE; 2015, with permission from IEEE [7]. CNN = convolutional neural network

Calibration Accuracy

The performance indicators described above are all indicators of discrimination accuracy. Calibration accuracy, on the other hand, which describes how similar the predicted probability values presented by an AI algorithm (for example, “The probability of this lesion to be cancerous is X%”) are to the actual probabilities, should be evaluated separately [6]. According to the Bayes' theorem of probability, the actual probability is greatly influenced by the pretest probability, also referred to as disease prevalence. It follows that a probability presented by an AI algorithm that does not take into consideration the pretest probability is likely inaccurate. Therefore, a rigorous evaluation of calibration accuracy is required for all AI algorithms that present probabilities directly to users. Particular care should be taken, as calibration accuracy is often overlooked when evaluating the performance of an AI algorithm [6]. It goes beyond the scope of this paper to go into details about calibration accuracy. For further details, see the relevant literature.

Limited Generalizability of AI Algorithm Performance in Healthcare

Overfitting in AI Algorithms for Medical Diagnosis/Prediction

Machine learning algorithms characterized by high dimensionality and mathematical complexity, such as deep learning which represents the current AI technology, have strong data dependence. Therefore, they tend to have excellent accuracy in training data, but their performance deteriorates in external data not used for training. This phenomenon is called ‘overfitting’ [10]. It is well known that AI algorithms for medical diagnosis/prediction are particularly prone to overfitting. There are several techniques to reduce overfitting, collectively termed regularization, but regularization alone is often insufficient to address overfitting in AI algorithms for medical diagnosis/prediction. For this reason, most current AI algorithms in medicine may fail to generalize [11]. Table 1 shows some examples of the limited generalizability of AI algorithms for medical diagnosis/prediction [1213141516]. As shown in these examples, in real-world clinical settings, the diagnostic accuracy of these AI algorithms decreases or the presented probability becomes incorrect, and the threshold value set for converting the internal output value into the final result does not fit.

Table 1

Examples of Limited Generalizability of the Performance of Artificial Intelligence Algorithms for Medical Diagnosis/Prediction

Author	Algorithm	Result
Zech et al. [12]	CNN algorithm to detect pneumonia on chest radiographs	AUC of 0.931 in internal testing compared with 0.815 in external testing
Ting et al. [13]	CNN algorithm to detect referable diabetic retinopathy on retinal photographs	AUC ranging from 0.889 to 0.983 when tested externally at 10 different hospitals
Ridley [14]	CNN algorithm to detect intracranial hemorrhage on noncontrast head computed tomography scans	Sensitivity, specificity, and AUC of 98%, 95%, and 0.993, respectively, when tested internally compared with 87.1%, 58.3%, and 0.834, respectively, when tested on a real-world data set
Hwang et al. [15]	CNN algorithm to distinguish normal chest radiographs from abnormal chest radiographs that contain any of the four types of pathologies including malignancy, tuberculosis, pneumonia, and pneumothorax	When externally tested at five different hospitals with a single fixed threshold applied to the raw algorithm output, the specificity indicated a wide range from 56.6% to 100%, while the sensitivity was less variable ranging from 91.3% to 100%
Lee et al. [16]	CNN algorithm to categorize hepatic fibrosis (F0, F1, F2–3, and F4 according to METAVIR scoring) on B-mode ultrasonography images	Accuracy of 83.5% in internal testing compared with 76.4% in external testing

AUC = area under the curve, CNN = convolutional neural network

Reasons for High Overfitting in AI Algorithms for Medical Diagnosis/Prediction

The fundamental reason behind the high overfitting and limited generalizability of AI algorithms for medical diagnosis/prediction is their failure to sufficiently reflect the real-world situations in the data sets used to train the AI algorithms [11171819]. Several factors are involved in this phenomenon. First, medical data are highly heterogeneous. Even if patients have the same disease, their other characteristics such as age, sex, disease severity, underlying conditions, or comorbidities often differ across the capacity, type, and location of the hospitals. The variety and distribution of similar diseases or differential diagnoses found in patients suspected of a particular disease but who do not have the disease also often differ across hospitals. Disease prevalence may also vary from one hospital to another. Simply put, the situation of one hospital often cannot be applied to another. The fact that hospitals utilize different devices also contributes to the data heterogeneity. For example, in the case of imaging devices, such as computed tomography and MRI, an AI algorithm tuned to the image properties of one vendor may not work well on the images obtained with the scanners from another manufacturer. Furthermore, advances in healthcare equipment and technologies lead to constant changes in therapeutic agents and diagnostic tools, which creates temporal heterogeneity. For example, AI algorithms trained with data including treatment agents used in the past cannot function correctly in situations where different therapeutic agents are used; likewise, AI that has been trained on images from old imaging devices may not work properly on images from new devices. To overcome this data heterogeneity, it is necessary to train AI with a huge amount data systematically collected from as many hospitals as possible, which is a labor- and time-intensive procedure requiring committed medical professionals and a large amount of material resources. The paradoxical combination of high data heterogeneity and insufficient data for training AI algorithms often results in a situation where the training data sets do not sufficiently reflect the clinical settings in which an algorithm is intended to be used. In addition, real-world medical data contain diverse gray areas and noise elements. There are cases in which no clear reference standards are available to determine the presence or absence of a certain disease. The presence vs. absence of a disease, as described in a binary variable (0 or 1), may only represent a few specific points along a continuous phase of change in the development and progress of an entire disease process. Nevertheless, data allowing a clear binary disease classification (present/absent) are often selectively used for AI training purposes [20]. Also, data from which noise elements are removed for more efficient processing by computer programs are used preferentially.

Implications of the Limited Generalizability

First, when evaluating AI algorithm performance, it is important to perform external validation using external data, as discussed further later [61011212223242526272829]. Given that an AI algorithm's performance in a clinical envirionment may differ from when it was developed, it is best to conduct external validation directly in the target clinical environment. Nevertheless, insufficient external validation of AI algorithms frequently poses problem [30]. Second, instead of blindly accepting the result presented by an AI algorithm, medical professionals should make final decisions after due consideration to the clinical situation and other relevant information. The threshold value described above should also be properly tuned to the clinical situation. For these reasons, while high-performance AI might replace medical professionals for specific functions under limited conditions, AI is not an autonomous tool to replace a medical professional; its role is limited to providing competent assistance and information to the medical professional. Third, as a method to improve the generalizability of an AI algorithm, an additional training round may be administered using data from target hospitals and specific clinical settings prior to its use in the practice. There are concerns, however, that an AI algorithm's initial accuracy may be impaired if it is trained with additional data that include errors and biases. Unlike locked software algorithms, AI algorithms can change through continuous learning, and continuous evaluation and management are required, even after device approval. As the current device-approval system has no concrete provisions for continuous evaluation and management, this aspect should be addressed in the future.

Evaluating AI Algorithm Performance: Classification according to Data Used

The methods of evaluating AI algorithms' performance can be classified according to the characteristics of the data used for the evaluation. Before explaining these classifications, it is necessary to understand the term ‘validation’ clearly. In addition to its ordinary meaning (i.e., verification or confirmation), as used in this article, validation is also a technical term in the machine learning field, referring to the process of adjusting hyperparameters when making AI algorithms [31]. The process of adjusting hyperparameters is also called tuning to avoid confusion; however, validation is more widely used to indicate the procedure in AI-related literature [31]. On the other hand, validation test or test is used instead of validation to indicate verification of algorithm performance and distinguish it from the process of adjusting hyperparameters [3233].

Internal Validation

Internal validation tends to overestimate the performance of AI algorithms. Therefore, internal validation has a role in checking the algorithm performance while developing it rather than confirming the performance of a finished model. External validation is required to determine the performance of AI algorithms. Results from internal validation can be used to compare the results of external validation. Cross-validation and split-sample validation are categorized as types of internal validation.

Cross-Validation

A well-known example of cross-validation is k-fold cross-validation [32]. The original data are split into k number of groups; one group is retained as the testing data, and the remaining groups form the training data. At each iteration, one group after another is used as the testing data until every group has been used once. Finally, the mean of all results is obtained. This method can be used for a preliminary evaluation of an algorithm's performance when the original data size is small. However, it is considered inadequate for algorithm performance validation.

Split-Sample Validation

In split-sample validation, the original data are split into three sets (training set, tuning set, and test set). The test set is not used for training and tuning of the AI algorithm; it is used to test the performance of the trained and tuned AI algorithm (Fig. 5) [32]. The data can be split randomly or stratified according to the data-collection period. Split-sample validation is better suited for internal validation than cross-validation.

Fig. 5

Typical data sets used for development and testing of an AI algorithm.

AI = artificial intelligence

External Validation

External validation refers to evaluation of an AI algorithm's performance using data collected independently instead of the original data (Fig. 5). Typically, a data set provided by external hospitals is used instead of the one that provided the training data. As shown in Figure 2, to evaluate the performance of an AI algorithm, two categories of data are required: one that contains the condition targeted by the AI algorithm for diagnosis or prediction and one that does not. Depending on the method of collecting these validation data, external validation studies can be largely divided into diagnostic case-control and diagnostic cohort studies [34].

Diagnostic Case-Control Study

In diagnostic case-control studies, samples with and without the target condition to be diagnosed or predicted by an AI algorithm are collected separately. For example, when evaluating the performance of an AI algorithm that discriminates the presence or absence of lung cancer by analyzing chest X-ray images, a certain number of chest X-ray images with lung cancer (case) and without lung cancer (control) are collected. When data is collected in this way, prevalence (in the example case, the proportion of chest X-rays with lung cancer among the total number of chest X-ray images) is artificially designated, unlike the natural prevalence observed in real-world settings. Moreover, the method by which the images with and without lung cancer are collected may affect the degree of variation in the size and shape of the included lung cancers, the presence or variety of lung lesions that may mimic lung cancer in patients without lung cancer, and the presence or degree of various conditions or underlying diseases and comorbidities that can affect the discovery of lung cancer, all of which are collectively referred to as ‘spectrum.’ Data collected in this case-control manner often differ from the natural spectrum in clinical settings due to selection bias [35]. This artificial spectrum and prevalence affect the evaluation of algorithm performance [26].

Diagnostic Cohort Study

In a diagnostic cohort study, the clinical setting in which an AI algorithm will be applied is predefined; data are collected based on this definition, regardless of the presence/absence of the disease to be diagnosed or predicted by the AI algorithm. In the case of an AI algorithm discriminating the presence or absence of lung cancer on chest X-rays, the eligibility criteria (for example, “adults 55 years of age and older with X pack-year smoking history”) are defined. The AI algorithm performance is assessed on X-ray images taken from patients who are continuously recruited or randomly selected from those satisfying the eligibility criteria. Some of the recruited patients may have lung cancer, and others may not. In this way, data with the natural spectrum and prevalence can be collected, and the performance and the threshold value determined in the validation study can be more directly applied to the clinical setting defined by the eligibility criteria. Whereas a diagnostic case-control study evaluates performance in a somewhat artificial experimental setting, a diagnostic cohort study evaluates performance in a more realistic clinical environment. It is essential to clearly understand the actual clinical setting for which the AI algorithm is intended when determining the concrete eligibility criteria to reflect the clinical setting adequately.

Differences between Diagnostic Case-Control Study and Diagnostic Cohort Study

Diagnostic case-control and diagnostic cohort studies are designed for different purposes. The former aims to evaluate the overall ‘technical performance’ of an AI algorithm for the intended diagnosis/prediction. Although prospective studies offer certain advantages, retrospective studies can also be used to validate the technical performance of an AI algorithm, subject to the availability of a validation data set containing well-distributed examples of various difficulty levels matching the purpose of the AI algorithm. Diagnostic cohort studies aims to evaluate the ‘clinical performance’ of an AI algorithm in specific clinical settings or patient groups. Therefore, compared to a diagnostic case-control study, a diagnostic cohort study has a clearer and more concrete notion of the clinical target population, i.e., clinical indication. For diagnostic cohort studies, a prospective study design is recommended. In general, the technical performance of an AI algorithm is first tested via a diagnostic case-control study, then the clinical performance is tested via a diagnostic cohort study. There is a tendency for performance to be rated higher in diagnostic case-control studies than in diagnostic cohort studies.

Validation Using Standard Data Sets

Some are of the opinion that standard data sets should be used for performance validation of AI algorithms. The process of collecting standard data is often similar to that of collecting data for diagnostic case-control studies. Therefore, well-established standard data sets may prove suitable for evaluating the technical performance of AI algorithms. Such standard sets would be rendered more efficient when collected from many different hospitals. One of the major drawbacks of standard data is that AI algorithms that perform well only on standard data can emerge. A standard data set can be compared to the College Entrance Exam (Korean equivalent of SAT). Every year, new questions are formulated under thorough security measures, because reusing questions does not allow for the proper evaluation of a student's performance. Likewise, a standard data set that is not continuously updated may be subject to the “leaking test questions” effect. Moreover, to successfully generate a genuinely representative standard data set, meticulous prior research is needed on fundamental issues such as the conditions required for a standard data set to test the performance of AI algorithms sufficiently and objectively.

Passing Criteria for AI Algorithm Performance Evaluation

When evaluating AI algorithm performance, a criterion for determining whether the performance is adequate should also be prepared. This can be done by comparing the stand-alone performance of an AI algorithm with an absolute criterion (e.g., 90% or higher accuracy), which may involve ambiguity. It may be more intuitive to prepare a control method against which you can check the performance of an AI algorithm. These may include comparing AI with existing similar AI algorithms, other tests, or medical professionals as the control, as well as comparing medical professionals using AI with those not using AI as the control. While the focus of former is on the performance of AI algorithms themselves, the latter reflects the role of AI as an auxiliary tool providing information to medical professionals. The medical professionals may include both experienced and unexperienced doctors, and more useful information will likely be derived when the comparative results are analyzed according to their level of experience. In a relative comparison with a control, it is necessary to set the performance-difference criteria (e.g., less than 5%) at which the two compared results are considered statistically equivalent or significantly different. The passing criterion for absolute performance evaluation or for the relative comparison with a control cannot be set uniformly; it must be set according to the function of each AI algorithm and each clinical setting. Once the criterion is established, the sample size needed for performance evaluation can be calculated using well-known statistical methods [36373839].

Evaluation of the Clinical Utility of AI Algorithms

High accuracy does not necessarily mean an AI algorithm can improve clinical outcomes. AI algorithms are computerized aids that provide information to medical professionals to assist them in the clinical decision process. For any computerized tool to be useful, how the tool is integrated into the workflow is critical besides its performance. An AI algorithm must deliver information to the right person in the right way. Likewise, the coordinating doctor's response to the information from AI and the actions taken greatly affect the outcomes of patient care. Since the final clinical outcomes are achieved through the therapeutic or prophylactic actions taken based on diagnostic decisions, if no therapeutic or prophylactic actions are taken, no effects are made to the clinical outcomes. On the other hand, therapeutic or prophylactic actions can also bring about adverse reactions in some patients. Therefore, it is crucial to directly assess the effect of AI on clinical outcomes apart from its performance [40]. Such an evaluation is called validation of ‘clinical utility.’ Utility and efficacy are not interchangeable terms. Technical performance, clinical performance, and clinical utility are all indicators of efficacy of different levels and characters. An example of validating clinical utility is provided in a study on an AI algorithm developed in the United Kingdom that monitors and analyzes the uterine contractions of a woman in labor and the heartbeat of the fetus and sends a real-time alert to the doctor when a fetal problem is suspected [41]. After verifying the accuracy of the algorithm, the investigators randomly assigned high-risk women in labor into AI-aided and AI-unaided groups to compare the outcomes of care [41]. Although the AI algorithm showed high accuracy in recognizing abnormal heartbeats, no significant difference in clinical outcome was observed between the experimental and control groups for both fetuses and mothers. As a result, the study could not demonstrate the clinical utility of the AI algorithm in terms of medical benefits to patients. As shown in this example, validating the clinical utility of AI algorithms involves determining any differences in patient outcomes between AI-aided and AI-unaided patient care. Ideally, a randomized clinical trial should be conducted to prevent the effect of confounding variables in the intergroup comparison. However, since randomized clinical trials are not always possible, the results of prospective or retrospective observational research adjusted for confounding variables may be used. There are also AI algorithms that may be clinically validated sufficiently via performance validation alone without clinical utility validation. The appropriate level of clinical validation tailored to individual AI algorithms and clinical settings should be determined by medical experts. Table 2 provides a few examples of randomized controlled trials examining AI algorithms [414243444546].

Table 2

Examples of Randomized Controlled Trials that Compared Practice with and without Artificial Intelligence Algorithms

Author	Algorithm	Patient	Primary Outcome
Wijnberge et al. [42]	Non-deep learning, machine learning algorithm that continuously analyzes arterial pressure waveform during surgery and warns if hypotensive event is expected within the next 15 minutes	Adult patients (≥ 18 years old) scheduled to undergo an elective noncardiac surgery under general anesthesia with need for continuous invasive blood pressure monitoring per arterial line	Time-weighted average of hypotension during surgery defined as hypotension below a mean arterial pressure of 65 mm Hg (in millimeters of mercury) x time spent below a mean arterial pressure of 65 mm Hg (in minutes) divided by total duration of operation (in minutes)
INFANT Collaborative Group [41]	Non-deep learning, machine learning algorithm that continuously analyzes cardiotocographic data and delivers color-coded alerts to physicians when abnormalities are noted	Women in labor who require continuous electronic fetal heart rate monitoring	Rate of poor neonatal outcome (intrapartum stillbirth or early neonatal death excluding lethal congenital anomalies, or neonatal encephalopathy, admission to the neonatal unit within 24 h for ≥ 48 h with evidence of feeding difficulties, respiratory illness, or encephalopathy with evidence of compromise at birth), and developmental assessment at age 2 years in a subset of surviving children
Repici et al. [43], Wang et al. [44], Wang et al. [45]	CNN-based CADe algorithm that detects polyps on colonoscopy images	Patients undergoing screening, surveillance, or diagnostic colonoscopy	Adenoma detection rate (percentage of patients with at least one histologically proven adenoma or carcinoma)
Wu et al. [46]	CNN-based algorithm that monitors occurrence of blind spots during esophagogastroduodenoscopy examination	Patients undergoing esophagogastroduodenoscopy	Rate of blind spots (number of unobserved sites/views from a total of 26 different sites/views in a patient as defined by the investigators) during endoscopic examination

CADe = computer-aided detection, CNN = convolutional neural network

Clinical Validation of AI Algorithms from the Viewpoint of Device Approval and Insurance Coverage

Device approval, issued by entities such as the Korean Ministry of Food and Drug Safety (MFDS), the US Food and Drug Administration, and the European Commission (CE Marking), and decisions surrounding insurance coverage involve not only scientific principles but also sociopolitical factors. Therefore, AI device approval and insurance coverage may vary according to country, period, and social factors, and they cannot be explained from one perspective. This paper provides scientific principles alone related to AI device approval and insurance coverage. For the definitions of technical performance, clinical performance, clinical utility, diagnostic case-control study, and diagnostic cohort study, see explanations in the corresponding parts of this article.

Differences between Pharmaceuticals and Diagnostic Devices

Most medical AI algorithms are diagnostic devices and thus subject to the process of diagnostic device approval and insurance coverage. For a proper understanding of AI algorithms in this context, it is useful to clarify the difference between diagnostic devices and pharmaceutical agents. For the approval of a pharmaceutical agent, it is generally necessary to prove, in a phase III clinical trial, that a given pharmaceutical agent improves patient treatment outcomes when used within a specific patient population. In other words, the clinical utility should be demonstrated for a particular indication. In contrast, approval of diagnostic devices does not require high-level clinical evidence applied to pharmaceutical agents, and generally focuses on technical performance validation. Of course, higher-level clinical validation data, if available, may enable a more thorough evaluation. Insurance coverage is an act that an insurer pays for medical services (i.e., the use of pharmaceutical agents or medical devices) delivered to policyholders (i.e., patients) who pay premiums. Therefore, it is important to demonstrate the clinical utility (i.e., patient benefit) of the medical services provided. Given that a medical service can be useful to one patient and useless to another, its indications must be specified when applying insurance coverage. Coverage of a medical service provided to patients (i.e., payment for medical fees indirectly by patients through insurance premium) in whom clinical utility has not been proven is unusual and unreasonable. In the case of therapeutic agents, the conditions for insurance coverage are essentially the same as those for their approval. Therefore, after approval, reasonably priced drugs are generally automatically covered by insurance. For diagnostic devices, however, approval is usually issued after technical performance validation, falling short of the requirements for clinical utility validation needed for insurance coverage. For this reason, device approval is not automatically associated with insurance coverage. A diagnostic device approved by regulatory agencies can be marketed and used for clinical practice. Later, if further clinical testing reveals clinical conditions in which the device is beneficial for patients, insurance coverage may include these indications. For example, even if MRI and ultrasonography have been approved by their proven technical performances, they are not covered by insurance until their use has proven beneficial for patient care in more specified clinical conditions and patient populations.

Device Approval for AI Algorithms

For device approval of an AI algorithm, its technical performance validation must be sufficiently documented at least. An AI algorithm's technical performance validation can be performed through external validation such as diagnostic case-control study. A prospective study is advantageous, if possible, but a retrospective study can also provide technical performance validation of an AI algorithm, subject to the availability of a validation data set containing well-distributed examples of various difficulty levels matching its purpose. The conditions under which the AI algorithm operates well should be clarified and documented during the technical performance validation (e.g., devices and image acquisition methods, etc. that work well with the AI). It should be noted that AI device approval is merely permission to use the device on patients and to bring it to the market for that purpose. In other words, although a certain level of safety and efficacy of the AI should be demonstrated (typically through technical performance validation), AI device approval does not indicate whether the AI device is beneficial or valuable for patient care [252647]. Medical professionals involved in patient care should conduct further clinical validation and evaluation of the approved AI device to verify its clinical utility and ensure its safe and efficacious clinical application [4748]. Moreover, as it is difficult to investigate all matters related to generalizability of the AI algorithm during the device-approval process, it is important to further clarify the circumstances under which the AI output is accurate/inaccurate. These scientific principles are adopted in the Guidelines for Big Data- and AI-based Medical Device Approval revised and released by the MFDS for the general public. These guidelines state that the sample data used in clinical investigations for an AI-based device seeking device approval should comprise independent data sets other than those used during the product development process. In other words, external validation is required. Use of reliable retrospective data sets are allowed for the validation for device approval by MFDS, if appropriate. Applicants may decide whether a prospective or retrospective clinical study design is suitable for the product.

Insurance Coverage for AI-Based Medical Device

Regarding insurance coverage for AI-based devices, the Health Insurance Review and Assessment Service under the Korean Ministry of Health and Welfare released the Guidelines for the Evaluation for Medical Insurance Coverage for Innovative Medical Technology in December 2019. These guidelines added some flexibility to the scientific principles of medical insurance coverage. They state that, when improved patient outcomes or significant improvement in diagnostic accuracy with the use of an AI-based device compared to conventional care is verified, extra compensation through insurance coverage may be considered (the demonstration of cost-effectiveness is also included in the guidelines; we omit it because it is beyond the scope of this paper). While results from a prospective or retrospective research on patient outcomes adjusting for confounding variables or a randomized clinical trial is recommended for the evaluation, a diagnostic cohort study may also be accepted on a case-by-case basis for external validation of the clinical performance of AI. That is, for insurance coverage, clinical utility should be demonstrated in the form of improved patient outcomes; however, under certain circumstances, demonstration in a diagnostic cohort study of a significant improvement in diagnostic accuracy with the use of an AI-based device for a specific clinical condition/patient population may also satisfy the conditions for insurance coverage.

CONCLUSIONS

We examined the key principles of clinical validation, device approval, and insurance coverage of AI algorithms for medical diagnosis/prediction. When evaluating the discrimination performance of AI, the Dice similarity coefficient, sensitivity, specificity, ROC curve, and FROC curve are widely used. In the case of an AI algorithm presenting probability directly, calibration performance should be evaluated as well. Most currently available AI algorithms for medical diagnosis and prediction have limited generalizability to real-world healthcare settings of their performance shown in their development stage or through internal validation. This highlights the importance of external validation of performance in the clinical validation of AI algorithms. It should also be considered that AI is generally not meant to serve as a stand-alone tool. It is an auxiliary tool providing information to the medical professional. For external validation of AI performance, diagnostic case-control and diagnostic cohort studies may be conducted. The former evaluates the technical performance of an AI algorithm, while the latter evaluates the clinical performance in samples representing the target patients in real-world clinical scenarios. The ultimate clinical validation of an AI algorithm lies in the evaluation of its effect on patient outcomes. A randomized clinical trial is ideal for this validation of clinical utility. AI-device approval generally focuses on technical performance validation. Therefore, it is not used to determine whether the AI is beneficial for patient care and improves patient outcomes. Also, it is difficult to investigate all matters related to generalizability of the AI algorithm during the device-approval process. After achieving device approval, it is up to medical professionals to determine whether the approved AI algorithms are beneficial for real-world patient care. To obtain insurance coverage, it is essential to demonstrate clinical utility in the form of improved patient outcomes. As the use of AI algorithms for medical diagnosis/prediction is likely to increase in the future, the topics discussed herein should be introduced into medical-school curriculums [49].

43 in total

1. Evaluating Artificial Intelligence Applications in Clinical Settings.

Authors: Elaine O Nsoesie
Journal: JAMA Netw Open Date: 2018-09-07

2. Minimal clinically important difference: defining what really matters to patients.

Authors: Anna E McGlothlin; Roger J Lewis
Journal: JAMA Date: 2014-10-01 Impact factor: 56.272

3. Using Free-Response Receiver Operating Characteristic Curves to Assess the Accuracy of Machine Diagnosis of Cancer.

Authors: Chaya S Moskowitz
Journal: JAMA Date: 2017-12-12 Impact factor: 56.272

Review 4. Canadian Association of Radiologists White Paper on Artificial Intelligence in Radiology.

Authors: An Tang; Roger Tam; Alexandre Cadrin-Chênevert; Will Guest; Jaron Chong; Joseph Barfett; Leonid Chepelev; Robyn Cairns; J Ross Mitchell; Mark D Cicero; Manuel Gaudreau Poudrette; Jacob L Jaremko; Caroline Reinhold; Benoit Gallix; Bruce Gray; Raym Geis
Journal: Can Assoc Radiol J Date: 2018-04-11 Impact factor: 2.248

5. Deep learning with ultrasonography: automated classification of liver fibrosis using a deep convolutional neural network.

Authors: Jeong Hyun Lee; Ijin Joo; Tae Wook Kang; Yong Han Paik; Dong Hyun Sinn; Sang Yun Ha; Kyunga Kim; Choonghwan Choi; Gunwoo Lee; Jonghyon Yi; Won-Chul Bang
Journal: Eur Radiol Date: 2019-09-02 Impact factor: 5.315

6. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes.

Authors: Daniel Shu Wei Ting; Carol Yim-Lui Cheung; Gilbert Lim; Gavin Siew Wei Tan; Nguyen D Quang; Alfred Gan; Haslina Hamzah; Renata Garcia-Franco; Ian Yew San Yeo; Shu Yen Lee; Edmund Yick Mun Wong; Charumathi Sabanayagam; Mani Baskaran; Farah Ibrahim; Ngiap Chuan Tan; Eric A Finkelstein; Ecosse L Lamoureux; Ian Y Wong; Neil M Bressler; Sobha Sivaprasad; Rohit Varma; Jost B Jonas; Ming Guang He; Ching-Yu Cheng; Gemmy Chui Ming Cheung; Tin Aung; Wynne Hsu; Mong Li Lee; Tien Yin Wong
Journal: JAMA Date: 2017-12-12 Impact factor: 56.272

7. Computerised interpretation of fetal heart rate during labour (INFANT): a randomised controlled trial.

Authors:
Journal: Lancet Date: 2017-03-21 Impact factor: 79.321

8. What should medical students know about artificial intelligence in medicine?

Authors: Seong Ho Park; Kyung-Hyun Do; Sungwon Kim; Joo Hyun Park; Young-Suk Lim
Journal: J Educ Eval Health Prof Date: 2019-07-03

9. Randomised controlled trial of WISENSE, a real-time quality improving system for monitoring blind spots during esophagogastroduodenoscopy.

Authors: Lianlian Wu; Jun Zhang; Wei Zhou; Ping An; Lei Shen; Jun Liu; Xiaoda Jiang; Xu Huang; Ganggang Mu; Xinyue Wan; Xiaoguang Lv; Juan Gao; Ning Cui; Shan Hu; Yiyun Chen; Xiao Hu; Jiangjie Li; Di Chen; Dexin Gong; Xinqi He; Qianshan Ding; Xiaoyun Zhu; Suqin Li; Xiao Wei; Xia Li; Xuemei Wang; Jie Zhou; Mengjiao Zhang; Hong Gang Yu
Journal: Gut Date: 2019-03-11 Impact factor: 23.059

10. Preparing Medical Imaging Data for Machine Learning.

Authors: Martin J Willemink; Wojciech A Koszek; Cailin Hardell; Jie Wu; Dominik Fleischmann; Hugh Harvey; Les R Folio; Ronald M Summers; Daniel L Rubin; Matthew P Lungren
Journal: Radiology Date: 2020-02-18 Impact factor: 11.105

10 in total

1. Hepatocellular carcinoma pathologic grade prediction using radiomics and machine learning models of gadoxetic acid-enhanced MRI: a two-center study.

Authors: Yeo Eun Han; Yongwon Cho; Min Ju Kim; Beom Jin Park; Deuk Jae Sung; Na Yeon Han; Ki Choon Sim; Yang Shin Park; Bit Na Park
Journal: Abdom Radiol (NY) Date: 2022-09-21

Review 2. The power of public-private partnership in medical technology innovation: Lessons from the development of FDA-cleared medical devices for assessment of concussion.

Authors: Michael E Singer; Dallas C Hack; Daniel F Hanley
Journal: J Clin Transl Sci Date: 2022-03-10

3. Deep learning model for the automatic classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy: a multi-center retrospective study.

Authors: Mizuho Nishio; Daigo Kobayashi; Eiko Nishioka; Hidetoshi Matsuo; Yasuyo Urase; Koji Onoue; Reiichi Ishikura; Yuri Kitamura; Eiro Sakai; Masaru Tomita; Akihiro Hamanaka; Takamichi Murakami
Journal: Sci Rep Date: 2022-05-17 Impact factor: 4.996

4. Looking Ahead to 2022 for the Korean Journal of Radiology.

Authors: Seong Ho Park
Journal: Korean J Radiol Date: 2022-01 Impact factor: 3.500

5. Deep Learning-Assisted Diagnosis of Pediatric Skull Fractures on Plain Radiographs.

Authors: Jae Won Choi; Yeon Jin Cho; Ji Young Ha; Yun Young Lee; Seok Young Koh; June Young Seo; Young Hun Choi; Jung-Eun Cheon; Ji Hoon Phi; Injoon Kim; Jaekwang Yang; Woo Sun Kim
Journal: Korean J Radiol Date: 2022-01-04 Impact factor: 3.500

6. How to Clearly and Accurately Report Odds Ratio and Hazard Ratio in Diagnostic Research Studies?

Authors: Seong Ho Park; Kyunghwa Han
Journal: Korean J Radiol Date: 2022-05-31 Impact factor: 7.109

Review 7. [Applications of Artificial Intelligence in Mammography from a Development and Validation Perspective].

Authors: Ki Hwan Kim; Sang Hyup Lee
Journal: Taehan Yongsang Uihakhoe Chi Date: 2021-01-31

8. Deep learning analysis to predict EGFR mutation status in lung adenocarcinoma manifesting as pure ground-glass opacity nodules on CT.

Authors: Hyun Jung Yoon; Jieun Choi; Eunjin Kim; Sang-Won Um; Noeul Kang; Wook Kim; Geena Kim; Hyunjin Park; Ho Yun Lee
Journal: Front Oncol Date: 2022-09-02 Impact factor: 5.738

Review 9. [Ethics for Artificial Intelligence: Focus on the Use of Radiology Images].

Authors: Seong Ho Park
Journal: J Korean Soc Radiol Date: 2022-06-22

10. Arterial enhancing local tumor progression detection on CT images using convolutional neural network after hepatocellular carcinoma ablation: a preliminary study.

Authors: Sanghyeok Lim; YiRang Shin; Young Han Lee
Journal: Sci Rep Date: 2022-02-02 Impact factor: 4.379

10 in total