| Literature DB >> 33096088 |
David B Larson1, Hugh Harvey2, Daniel L Rubin3, Neville Irani4, Justin R Tse5, Curtis P Langlotz6.
Abstract
Although artificial intelligence (AI)-based algorithms for diagnosis hold promise for improving care, their safety and effectiveness must be ensured to facilitate wide adoption. Several recently proposed regulatory frameworks provide a solid foundation but do not address a number of issues that may prevent algorithms from being fully trusted. In this article, we review the major regulatory frameworks for software as a medical device applications, identify major gaps, and propose additional strategies to improve the development and evaluation of diagnostic AI algorithms. We identify the following major shortcomings of the current regulatory frameworks: (1) conflation of the diagnostic task with the diagnostic algorithm, (2) superficial treatment of the diagnostic task definition, (3) no mechanism to directly compare similar algorithms, (4) insufficient characterization of safety and performance elements, (5) lack of resources to assess performance at each installed site, and (6) inherent conflicts of interest. We recommend the following additional measures: (1) separate the diagnostic task from the algorithm, (2) define performance elements beyond accuracy, (3) divide the evaluation process into discrete steps, (4) encourage assessment by a third-party evaluator, (5) incorporate these elements into the manufacturers' development process. Specifically, we recommend four phases of development and evaluation, analogous to those that have been applied to pharmaceuticals and proposed for software applications, to help ensure world-class performance of all algorithms at all installed sites. In the coming years, we anticipate the emergence of a substantial body of research dedicated to ensuring the accuracy, reliability, and safety of the algorithms.Entities:
Year: 2020 PMID: 33096088 PMCID: PMC7574690 DOI: 10.1016/j.jacr.2020.09.060
Source DB: PubMed Journal: J Am Coll Radiol ISSN: 1546-1440 Impact factor: 5.532
Examples of types of measurements used as diagnostic tasks in medical imaging
| Measurement Type | Example | Description |
|---|---|---|
| Lesion size | RECIST | Describes the technique for measuring lesion dimensions |
| Volumetric analysis | Future liver remnant | Quantifies volume of a lesion or organ of interest |
| Physiologic process | Gastric half emptying time | Quantifies the function of a physiologic process |
| Growth and maturity | Greulich and Pyle bone age | Quantifies growth and maturity of an individual or organ system |
| X-ray attenuation | Hounsfield unit | Quantifies attenuation of an x-ray beam on CT |
| Contrast enhancement | Peak lesion enhancement of on CT or MR | Quantifies enhancement based on threshold increase in attenuation or signal intensity after contrast administration CT or MR, respectively |
| Dynamic characteristics of contrast enhancement | Adrenal adenoma washout calculation | Describes the dynamics of decrease in lesion enhancements from the peak |
| Fat fraction | Liver fat fraction | Quantifies fat fraction within an organ or lesion based on MRI |
| Iron content | Liver T2∗ imaging | Quantifies iron content within an organ based on MRI |
| Diffusion-weighted imaging | Diffusion restriction | Quantifies Brownian motion of water molecules on MRI |
| Organ stiffness | Liver stiffness, based on MR or ultrasound elastography | Quantifies shear modulus of the liver on MR or ultrasound |
| Bolus perfusion | Stroke imaging | Quantifies mean transit time, cerebral blood volume, and cerebral blood flow to estimate tissue at risk |
| Fluid velocity | Portal vein velocity measure on ultrasound | Quantifies magnitude and direction of blood flow |
| Flow characteristics | Arterial resistive index on ultrasound | Estimates the resistivity to flow in an artery, based on minimum and maximum velocities |
| Radiotracer uptake | Standard uptake value | Ratio of radioactivity concentration in a lesion to whole body concentration of injected radiotracer |
RECIST = Response Evaluation Criteria in Solid Tumours.
Examples of types of classification schemes used as diagnostic tasks in medical imaging
| Classification Type | Example | Description |
|---|---|---|
| Findings associated with disease | Findings of pulmonary embolism on CT pulmonary angiogram | Describes the imaging findings associated with the presence (or absence) of a disease process |
| Probability of disease | PIOPED criteria | Provides a probability of disease based on imaging findings |
| Type of pathology | WHO CNS tumor classification system | Describes the type of disease, based on pathological provable characteristics, such as histopathology |
| Grade of pathology | Gleason grading system for prostate adenocarcinoma, AAST organ injury scoring scale | Describes the severity of disease, based on findings that are associated with relevant outcomes |
| Stage of pathology | TNM staging system | Describes the anatomical extent of the disease, based on findings that are associated with relevant outcomes |
| Lesion characterization based on risk assessment | BI-RADS | Categorizes lesions, or potential lesions, based on imaging findings that are associated with relevant outcomes |
| Pathology characterization | Stanford aortic dissection classification, Salter-Harris fracture types | Describes types of a disease process based on imaging findings and generally associated with relevant outcomes that may have implications for management |
| Diagnostic criteria based on imaging findings | Fleischner diagnostic HRCT criteria for UIP pattern | Image-based criteria for classifying a disease process |
| Clinical management based on imaging findings | Fleischner Society guidelines for management of incidental nodules detected on CT | Provides recommendations for clinical management based on imaging findings |
| Imaging pattern description | Wolfe classification of breast parenchymal patterns | Characterizes different types of normal imaging findings, typically to provide context for diagnosis of disease |
| Anatomical variants | Geist classification of os naviculare types | Describes different normal imaging appearances of an organ system |
AAST = American Association for the Surgery of Trauma; CNS = central nervous system; HRCT = high resolution CT; PIOPED = Prospective Investigation of Pulmonary Embolism Diagnosis; UIP = usual interstitial pneumonia; WHO = World Health Organization.
Examples of important performance elements of diagnostic algorithms
| Element | Explanation |
|---|---|
| Accurate | The algorithm should accurately perform all diagnostic tasks for which it is designed. |
| Reliable | The algorithm should remain accurate in the setting of reasonably expected variation encountered in the clinical environment, including reasonable variations in image quality. |
| Applicable | The accuracy of the algorithm should be maintained across all makes and models of image modalities and for all patient populations for which it is designed to function. |
| Deterministic | The algorithm should give the same answer for the same image when used at different times and in different settings. |
| Nondistractible | The algorithm should be able to recognize the salient information from the image and not change its assessment based on extraneous, noncontributory image data. |
| Self-aware of limitations | The algorithm should have the means to detect when it is at or beyond the boundaries of its capabilities, whether because of inherent limitations of the model, limitations of its clinical applicability, or limitations imposed by clinical variation such as unexpected patient anatomy or image quality. |
| Fail-safe | The algorithm should recognize when it has reached an erroneous conclusion and have the means for ensuring that all errors are caught and stopped before they are propagated into the clinical environment. |
| Transparent logic | The user interface should enable the operator to clearly see the linkage between the input and output, including what data were analyzed, what alternatives were considered, and why certain possibilities were excluded, to be able to correctly accept or reject the algorithm’s conclusion on any given case. |
| Transparent degree of confidence | The algorithm should share with the user a level of confidence in its assessment for each case. The accuracy of the model’s expression of confidence should be validated as well as the accuracy of the model itself. |
| Able to be monitored | The algorithm should share performance data with users to enable ongoing monitoring of both individual and aggregated cases, quickly highlighting any significant deviations in performance. |
| Auditable | An independent means should be provided to monitor the algorithm’s ongoing performance in a way that guides appropriate intervention. This may include periodic quality control checks similar to those performed by operators on imaging equipment. |
| Intuitive user interface | The user interface should enable the operator to intuitively how to use the algorithm with as little training as possible and impose the minimum possible cognitive load on the user. |
Fig 1Illustration depicting proposed linkage of the evaluation of diagnostic algorithm performance from the defined task to implementation at the local site. Algorithms are developed according to a defined standard diagnostic task. Performance is compared with other algorithms in a controlled environment, which becomes the internal benchmark for general real-world performance and local site performance, which in turn becomes the benchmark for ongoing monitoring.
Fig 2Phased development and evaluation process of diagnostic algorithms.