| Literature DB >> 33629545 |
Seong Ho Park1, Jaesoon Choi2, Jeong Sik Byeon3.
Abstract
Artificial intelligence (AI) will likely affect various fields of medicine. This article aims to explain the fundamental principles of clinical validation, device approval, and insurance coverage decisions of AI algorithms for medical diagnosis and prediction. Discrimination accuracy of AI algorithms is often evaluated with the Dice similarity coefficient, sensitivity, specificity, and traditional or free-response receiver operating characteristic curves. Calibration accuracy should also be assessed, especially for algorithms that provide probabilities to users. As current AI algorithms have limited generalizability to real-world practice, clinical validation of AI should put it to proper external testing and assisting roles. External testing could adopt diagnostic case-control or diagnostic cohort designs. A diagnostic case-control study evaluates the technical validity/accuracy of AI while the latter tests the clinical validity/accuracy of AI in samples representing target patients in real-world clinical scenarios. Ultimate clinical validation of AI requires evaluations of its impact on patient outcomes, referred to as clinical utility, and for which randomized clinical trials are ideal. Device approval of AI is typically granted with proof of technical validity/accuracy and thus does not intend to directly indicate if AI is beneficial for patient care or if it improves patient outcomes. Neither can it categorically address the issue of limited generalizability of AI. After achieving device approval, it is up to medical professionals to determine if the approved AI algorithms are beneficial for real-world patient care. Insurance coverage decisions generally require a demonstration of clinical utility that the use of AI has improved patient outcomes.Entities:
Keywords: Artificial intelligence; Device approval; Insurance coverage; Software validation
Mesh:
Year: 2021 PMID: 33629545 PMCID: PMC7909857 DOI: 10.3348/kjr.2021.0048
Source DB: PubMed Journal: Korean J Radiol ISSN: 1229-6929 Impact factor: 3.500
Fig. 1Dice similarity coefficient.
Fig. 2Diagnostic cross-table (also referred to as confusion matrix).
AI = artificial intelligence, FN = false negative, FP = false positive, TN = true negative, TP = true positive
Fig. 3Exemplary receiver operating characteristic curves that show the performance of four readers in interpreting breast ultrasonography assisted by a deep-learning algorithm.
Adapted from Choi et al. Korean J Radiol 2019;20:749-758, with permission from the Korean Society of Radiology [4]. AUC = area under the curve
Fig. 4Exemplary free-response receiver operating characteristic curves that show the performance of six methods of detecting polyps in colonoscopy videos.
The x-axis is the mean number of false positives per image frame. A curve closer to the left upper corner indicates a higher performance, for example, a higher performance of the red curve than the blue curve. Adapted from Tajbakhsh et al. Proceedings of IEEE 12th International Symposium on Biomedical Imaging. New York: IEEE; 2015, with permission from IEEE [7]. CNN = convolutional neural network
Examples of Limited Generalizability of the Performance of Artificial Intelligence Algorithms for Medical Diagnosis/Prediction
| Author | Algorithm | Result |
|---|---|---|
| Zech et al. [ | CNN algorithm to detect pneumonia on chest radiographs | AUC of 0.931 in internal testing compared with 0.815 in external testing |
| Ting et al. [ | CNN algorithm to detect referable diabetic retinopathy on retinal photographs | AUC ranging from 0.889 to 0.983 when tested externally at 10 different hospitals |
| Ridley [ | CNN algorithm to detect intracranial hemorrhage on noncontrast head computed tomography scans | Sensitivity, specificity, and AUC of 98%, 95%, and 0.993, respectively, when tested internally compared with 87.1%, 58.3%, and 0.834, respectively, when tested on a real-world data set |
| Hwang et al. [ | CNN algorithm to distinguish normal chest radiographs from abnormal chest radiographs that contain any of the four types of pathologies including malignancy, tuberculosis, pneumonia, and pneumothorax | When externally tested at five different hospitals with a single fixed threshold applied to the raw algorithm output, the specificity indicated a wide range from 56.6% to 100%, while the sensitivity was less variable ranging from 91.3% to 100% |
| Lee et al. [ | CNN algorithm to categorize hepatic fibrosis (F0, F1, F2–3, and F4 according to METAVIR scoring) on B-mode ultrasonography images | Accuracy of 83.5% in internal testing compared with 76.4% in external testing |
AUC = area under the curve, CNN = convolutional neural network
Fig. 5Typical data sets used for development and testing of an AI algorithm.
AI = artificial intelligence
Examples of Randomized Controlled Trials that Compared Practice with and without Artificial Intelligence Algorithms
| Author | Algorithm | Patient | Primary Outcome |
|---|---|---|---|
| Wijnberge et al. [ | Non-deep learning, machine learning algorithm that continuously analyzes arterial pressure waveform during surgery and warns if hypotensive event is expected within the next 15 minutes | Adult patients (≥ 18 years old) scheduled to undergo an elective noncardiac surgery under general anesthesia with need for continuous invasive blood pressure monitoring per arterial line | Time-weighted average of hypotension during surgery defined as hypotension below a mean arterial pressure of 65 mm Hg (in millimeters of mercury) x time spent below a mean arterial pressure of 65 mm Hg (in minutes) divided by total duration of operation (in minutes) |
| INFANT Collaborative Group [ | Non-deep learning, machine learning algorithm that continuously analyzes cardiotocographic data and delivers color-coded alerts to physicians when abnormalities are noted | Women in labor who require continuous electronic fetal heart rate monitoring | Rate of poor neonatal outcome (intrapartum stillbirth or early neonatal death excluding lethal congenital anomalies, or neonatal encephalopathy, admission to the neonatal unit within 24 h for ≥ 48 h with evidence of feeding difficulties, respiratory illness, or encephalopathy with evidence of compromise at birth), and developmental assessment at age 2 years in a subset of surviving children |
| Repici et al. [ | CNN-based CADe algorithm that detects polyps on colonoscopy images | Patients undergoing screening, surveillance, or diagnostic colonoscopy | Adenoma detection rate (percentage of patients with at least one histologically proven adenoma or carcinoma) |
| Wu et al. [ | CNN-based algorithm that monitors occurrence of blind spots during esophagogastroduodenoscopy examination | Patients undergoing esophagogastroduodenoscopy | Rate of blind spots (number of unobserved sites/views from a total of 26 different sites/views in a patient as defined by the investigators) during endoscopic examination |
CADe = computer-aided detection, CNN = convolutional neural network