Akifumi Hagiwara1, Shohei Fujita, Yoshiharu Ohno2, Shigeki Aoki1. 1. From the Department of Radiology, Juntendo University School of Medicine, Tokyo. 2. Department of Radiology, Fujita Health University School of Medicine, Toyoake, Aichi, Japan.
Abstract
Radiological images have been assessed qualitatively in most clinical settings by the expert eyes of radiologists and other clinicians. On the other hand, quantification of radiological images has the potential to detect early disease that may be difficult to detect with human eyes, complement or replace biopsy, and provide clear differentiation of disease stage. Further, objective assessment by quantification is a prerequisite of personalized/precision medicine. This review article aims to summarize and discuss how the variability of quantitative values derived from radiological images are induced by a number of factors and how these variabilities are mitigated and standardization of the quantitative values are achieved. We discuss the variabilities of specific biomarkers derived from magnetic resonance imaging and computed tomography, and focus on diffusion-weighted imaging, relaxometry, lung density evaluation, and computer-aided computed tomography volumetry. We also review the sources of variability and current efforts of standardization of the rapidly evolving techniques, which include radiomics and artificial intelligence.
Radiological images have been assessed qualitatively in most clinical settings by the expert eyes of radiologists and other clinicians. On the other hand, quantification of radiological images has the potential to detect early disease that may be difficult to detect with human eyes, complement or replace biopsy, and provide clear differentiation of disease stage. Further, objective assessment by quantification is a prerequisite of personalized/precision medicine. This review article aims to summarize and discuss how the variability of quantitative values derived from radiological images are induced by a number of factors and how these variabilities are mitigated and standardization of the quantitative values are achieved. We discuss the variabilities of specific biomarkers derived from magnetic resonance imaging and computed tomography, and focus on diffusion-weighted imaging, relaxometry, lung density evaluation, and computer-aided computed tomography volumetry. We also review the sources of variability and current efforts of standardization of the rapidly evolving techniques, which include radiomics and artificial intelligence.
Quantitative imaging, defined as the extraction of quantifiable features from radiological images,[1] has been increasingly performed for the measurement of normal biological and pathological processes, patient risk stratification, evaluation of treatment response and outcome, and drug development.[2] Such features of quantitative imaging in clinical settings are called biomarkers (quantitative imaging biomarkers [QIBs]), which is a characteristic that is objectively measured and evaluated.[3] Although the term “biomarker” is often meant to imply a measurand (the true value of the quantity intended to be measured) of laboratory assays, such as blood sugar tests, it can also denote clinical measurands such as blood pressure and metrics obtained with quantitative imaging. Quantitative imaging biomarkers are continuous variables, whereas ordinal variables, such as the PI-RADS (Prostate Imaging Reporting and Data System) with 5 numbered categories for assessment of prostate carcinoma,[4] are not considered to be QIBs.[5] Biomarkers are important in healthcare for a physician to determine the most appropriate management for a patient's unique state of disease at the molecular level. This concept is called personalized or precision medicine. A biopsied specimen is only a small fraction of the entire tissue that is sampled at a certain time point, and spatial/temporal sampling biases are not negligible.[1] On the other hand, QIB covers a wide segment or the whole of a subject and can provide more comprehensive spatial information concerning the tissues. Repetitive sampling is also much easier for imaging than a biopsy, and imaging data can be dynamically obtained in some cases in the order of seconds to milliseconds.[6,7] In addition, QIBs may enable the detection of a subclinical presentation of disease that is too subtle to be detected by human eyes[8]; this leads to a better outcome for patients than when disease is detected after the clinical presentation is recognized. Reliable QIBs can also help foster the development of medical products in regulatory settings.[9] For example, if a QIB is qualified by the Food and Drug Administration for drug development, it could help deliver a new therapy to the public through either a traditional or accelerated approval pathway.[10]In addition to the clinical relevance and sensitivity to the disease process, good reproducibility is the key element of a qualified biomarker.[11] Although QIBs can be used similarly to laboratory assays, its clinical application has been hindered by its generally lower reproducibility. This is partly because the extraction of most QIBs from radiological images is not yet fully automated, and it requires a radiologist or other experienced practitioner to engage in the analysis process, which introduces an inevitable variability arising from human perception.[12] Further, the variability of a QIB is also derived from acquisition hardware, software, procedures, operators, and the measurement methods. The Quantitative Imaging Biomarkers Alliance (QIBA) was established by the Radiological Society of North America (RSNA) in 2007 to proceed quantitative imaging and introduce the use of QIBs in clinical trials and practice by engaging researchers, healthcare professionals, and the industry (https://www.rsna.org/en/research/quantitative-imaging-biomarkers-alliance). The mission of QIBA is to improve the value and practicability of QIB by reducing variability across devices/sites, patients, and time. The QIBA has been developing QIBA Profiles that standardize methods for each selected QIB to achieve a useful level of performance.[13] Claims written in the Profiles describe the performance of the QIB and focus on a quantitative interpretation of the measurements for the individual subject. Conformance to the specifications of a Profile is required not only for hardware, software, and analysis methods, but also for operators and analysts. In collaboration with QIBA, the Japan Radiological Society and European Society of Radiology have also established Japan QIBA (J-QIBA) (http://www.radiology.jp/j-qiba/english/index.html) and European Imaging Biomarker Alliance, respectively, both of which have the same goal.This review article aims to summarize and discuss how the variability of quantitative values derived from radiological images are induced by a number of factors and how variabilities are mitigated and standardization of the quantitative values are achieved. For the interpretation of studies related to evaluating the performance of QIBs, terminology and key statistics will be explained. We also discuss the variabilities of specific biomarkers derived from magnetic resonance imaging (MRI) and computed tomography (CT). Further, we review the sources of variability and current standardization efforts for rapidly evolving techniques, including radiomics and artificial intelligence (AI). Overall, the terminologies related to variability used in this article conform to those suggested by the QIBA Terminology Working Group in 2015.[9]
STATISTICS
In this section, we explain the statistics and related terminology to understand the literature describing the performance of QIBs and to help readers conduct an appropriate evaluation of a QIB by themselves.
Terminology
For a QIB to be clinically useful, it is desired to be reliably comparable to known reference measurements or true value, and it must be comparable to one another in the same subjects for repetitive measurements.[14] These properties of a QIB can be characterized by accuracy (systematic measurement error or bias) and precision (random measurement error), which are together called uncertainty.[9] Measurement bias can be estimated only when the true value is known, and it can be mitigated by improving the calibration of a measurement system. The term measurand refers to the true value of the quantity intended to be measured.[15]Accuracy commonly describes a range of characteristics including how a measured value relates to a known reference. Accuracy in terms of quantitative imaging usually consists of linearity and bias. Linearity is the ability of a measurement to provide a directly proportional value to the measurand or a known reference. Bias is an estimate of systemic measurement error; it describes the difference between the average of a measurement made on a subject and its true value or known reference.Although reliability, agreement, precision, repeatability, and reproducibility are often used interchangeably, these terms are distinctive.[9,15]Reliability is defined as the ratio of variance based on the between-subject measurement to total variance based on the observed measurement. In other words, reliability represents how well different subjects can be distinguished from each other despite a QIB's uncertainty or measurement errors. Reliability is typically assessed by an intraclass correlation coefficient (ICC). Agreement has a broader meaning than reliability and indicates the degree of closeness between measurements made on the same subject by different observers or measurement methods. Precision or repeatability represents the closeness between measured values obtained by replicate measurements of the same subject with the same measurement method under the identical or near-identical conditions—including subject, measurement procedure, environment, and scanner—over a short period. Repeatability studies are often referred to as test-retest or scan-rescan.Reproducibility describes the closeness of measurements under a set of conditions that includes different locations, operators, measuring systems, or replicate measurements on the same or similar subjects. These conditions are analogous to real clinical practice where various external factors cannot be tightly regulated.
Linearity
Linearity can be evaluated by regressing the measurements (Y values) on the true values (X values). A linear model can be fit by least squares as:where β0 is the intercept and β1 is the slope. If the relationship between Y and X is well explained by a line (ie, R2 >0.90), then the assumption of linearity is met.[16] Although linearity is the ideal condition, monotonic relation (ie, the relationship of a QIB and the measurand can be described as a strictly increasing or decreasing function) is necessary and generally sufficient for a QIB to be clinically useful; it does so by discriminating every distinct value of measurands.[9] However, the slope of a function is related to sensitivity. If the relationship between Y and X is nonlinear, the ability of a measurement to detect change in the measurand is inconstant.[15] If there is lack of a standard reference, another imaging measurement method for which proportionality is established can be used as the reference standard to evaluate the new imaging measurement method.
Bias
Bias is the difference between the sampled mean and true value or known reference. %Bias is calculated by dividing the bias by the true value or known reference. If the true value or known reference is unavailable, bias cannot be evaluated. Hence, bias is typically calculated using validated phantoms with a well-defined reference. At least 5 to 7 similarly spaced values over the relevant range of true values should be chosen.[15] If the data are from various cohorts and the bias is inconsistent, a bias profile should be reported rather than a single bias value.[16] For example, bias for tumors with different sizes, shapes, and densities can be reported as a bias profile for CT volumetry.[17] Inconstant biases should be specified cautiously, especially when assessment of change in the QIB is the focus; this is because the different biases do not cancel out in calculating the change.[15] In this case, transformations, such as log-transformation, may render an inconstant bias constant. The assessments of linearity and bias are directly linked to each other; both should be presented when assessing either for the technical performance of a QIB.
Precision (Repeatability)
Precision, or repeatability, is concerned with whether a measurement agrees with a second measurement of the same quantity; high precision is a good indicator of the ability of a QIB to reveal an effect of treatment, identify disease, or discriminate between groups using the same scanner, sequence, software, and analysis method. Precision can be assessed by repeated imaging of a phantom, although it does not perfectly reflect the real clinical situation. When precision is assessed by repeated imaging of human subjects, the variance in measurements can be contaminated by subject-related variability due to a variety of reasons including behavioral, physiological, and psychological factors that may have changed between scans, even if the actual process of imaging acquisition remains unchanged. Further, it may be ethically inappropriate to scan subjects with repeated doses of radiation or with the use of contrast agents or tracers. A washout period also needs to be considered before a rescan if a contrast agent or tracer is used.Precision can be expressed numerically by measures of variability such as within-subject standard deviation (wSD), within-subject coefficient of variation (wCV), or 95% precision limit.[9] The wSD represents the standard deviation of measurements from the same or similar subjects under specified conditions. The wCV for repeated measurements of a subject is the wSD divided by the mean. The wCV as a group is typically acquired by taking the square root of the mean of wCV2 per subject. Only precision, not biological variation, is recommended to be included when reporting the performance of a QIB. Within-subject variance may also arise from patient repositioning and scanner calibrations. If the precision varies over a range of relevant magnitudes in the measurands, a precision profile should be considered; it should be reported as a table or plot showing estimates of precision—possibly stratified by one or more variables affecting the precision. The 95% precision limit is calculated as the repeatability coefficient (RC) or % repeatability coefficient (%RC). The standard deviation of the difference between 2 repeated measurements is wSD. Repeatability coefficient is the least significant difference between 2 repeated measurements at a 2-sided significance of α = 0.05, and it is calculated as[15]:Likewise, %RC is calculated as:The limit of agreement (LOA), the interval containing 95% differences between repeated measurements on the same subjects, is −RC to +RC. It represents the minimum detectable difference in 2 measurements with 95% confidence. A meta-analysis of the literature can summarize an RC by taking a weighted average of the reported values.[16]
Bland-Altman Graph Analysis
The Bland-Altman plot provides a graphic representation of agreement in addition to the 95% LOA.[18] The 95% LOA is the interval that is expected to contain 95% of differences between the measurement and true value or the other measurement, and it is calculated using the standard deviation of the difference. The Bland-Altman plot illustrates the differences between a measurement method and another one, or the true value, plotted against their mean. If the true value is used, one may plot the differences against the true value instead of their mean. The differences can also be expressed as percentages, which is useful when the variability of the difference increases as the magnitude of the measurement increases. The Bland-Altman plot also helps to demonstrate the relationship between bias and variance.
Intraclass Correlation Coefficient
Instead of reporting the components of uncertainty (eg, bias and precision) in a separate manner, ICC can also be used to summarize the uncertainty.[9] Intraclass correlation coefficient considers both the within-subject variance originating from measurement error and variance originating from the difference between subjects.[19] The ICC is the fraction of the total variance that is attributed to the subjects and is calculated as:If the measurement error is small compared with the true variance between subjects, ICC approaches 1. Although subjective, adjectives to describe ranges of ICC values include the following: poor (0 to 0.5), moderate (0.5 to 0.75), substantial (0.75 to 0.9), and excellent (0.9 to 1).[20] A moderate ICC can be considered sufficient when a measurement is used for group-level comparisons for research purposes. However, if a measurement is used in individual patients for important clinical decisions, an excellent ICC is required.[21] Intraclass correlation coefficient can help us stop being excessively concerned about measurement error when between-subject variance is large. However, ICC depends on the subject population being studied,[22,23] and ICC calculated for a group of subjects may not be applicable to another population. For example, when ICC is calculated for a group of healthy subjects, it may become unacceptably low because a group of healthy subjects tends to be homogenous and biological variance is low. However, ICC may be acceptable when calculated for a group of patients, which may typically be more heterogenous than a group of healthy subjects. Further, when we assess the subtle differences within a subject (eg, evaluating treatment response), ICC is often impractical. In this case, precision reported by RC would be more suitable, as the RC shows the smallest within-subject change that can be reliably detected.
Pearson Correlation Coefficient
The Pearson correlation coefficient has been frequently used to compare repeated measurements or a new measurement technique with the old one. However, this approach only evaluates the linear association between 2 measurements without consideration of bias, and it does not give an indication of repeatability or agreement.[18] Further, a large between-subject variation makes the correlation coefficient higher. High correlation coefficients may be achieved for 2 QIBs with wide ranges, even when they are in poor agreement, such as when one is twice the size of the other.
Reproducibility Coefficient and Multicenter Study
The reproducibility coefficient (RDC) is a measure of precision that is used when scanners, imaging procedures, location, operator, analysts, and/or algorithms differ at 2 time points. It shows the minimum detectable difference between 2 repeated measurements performed under different conditions with a 95% confidence, and it can be measured directly from clinical studies.[16] Just like RC and %RC, RDC and %RDC are calculated as 2.77 wSD and 2.77 wCV, respectively, under different rather than unchanged imaging acquisitions. An example of a reproducibility study may compare the volume of an organ measured by CT with that measured by MRI for the same organ of the same subject.Reproducibility is especially important in multicenter studies where reproduction of the same measurement is required across different centers and often with different kinds of scanners.[24] Multicenter studies of human subjects enable a comprehensive investigation of the disease. This is an advantage, especially when the disease is rare. However, if reproducibility across scanners is low, variability across scanners may reduce the statistical power of detecting differences between groups and annul the benefit of using data from multiple centers.[25] Mitigation of variation across centers can be achieved by (1) setting scanning and analysis procedures to be as identical as possible so that any systemic errors are replicated across participating centers[26] or (2) aiming for high accuracy at each center.[27] Although standardizing the scanning protocol is the simplest method for reducing measurement variabilities, differences in the scanners produced by different vendors may prevent identical protocols from being used at every site. In a cross-sectional study that compares groups, all groups should be included at each center, and the effect of the center should be added as a covariate in the statistical analysis.[11]
Meta-analysis of Technical Performance Studies
Before a QIB is accepted for clinical use, performance metrics, such as repeatability and reproducibility, should be evaluated. Ideally, this evaluation should involve summaries from multiple studies to overcome any limitations arising from a small sample size (typically 10 to 20 subjects) of a study concerning technical performance and include a wider range of relevant clinical settings and patient populations.[28] Although a meta-analysis of any technical performance metric is theoretically feasible, a meta-analysis of reproducibility and agreement is more complicated than that of repeatability because the studies that assess reproducibility and agreement are more heterogenous than those of repeatability. For example, a reproducibility study can be performed using scanners of the same type across different sites, scanners of different types from the same vendor, or different scanners from multiple vendors. Generally, reproducibility of the measurement decreases in this order.
VARIABILITY SOURCES, STANDARDIZATION, AND HARMONIZATION
This section focuses on the variability sources common to QIBs (Fig. 1) and how these variabilities could be mitigated. Variability sources specific to the modality or each biomarker will be discussed later in the corresponding section. The degree of measurement imperfections in comparison to the pathophysiological changes due to disease determines the significance of measurement imperfections for each QIB and hence the amount of effort required to be taken to reduce such variabilities. This effort may include building and keeping quality assurance (QA) at each center and improving the acquisition/analytical method. For example, the MAGNIMS (magnetic resonance in multiple sclerosis) research group has led a number of multicenter studies on MS, which occasionally included MRI physicists traveling to different centers in Europe, sometimes with a phantom, to decrease the measurement variability.[11]
FIGURE 1
Variability sources of quantitative imaging biomarkers.
Variability sources of quantitative imaging biomarkers.
Patient Positioning and Movement
The operator should be trained for adequate and consistent positioning of the phantoms/human subjects. Movement of the subject during and between scan sequences can cause artifacts and degrade the image quality. This can be mitigated by paying careful attention to the comfort level of the subject. Involuntary movement, such as respiratory, cardiac, and gut motion, can also cause degradation of the image.
Region of Interest
The size of the region of interest (ROI) has been known to affect the repeatability and reproducibility of QIBs.[29] A QIB is often measured as the mean value of a map within an ROI, where an increase in the size of the ROI leads to a smaller variance (ie, higher repeatability/reproducibility) of the measurement. This is important when the ROI size can be variable due to treatment response or disease progression, when monitoring the effect of treatment on lesions such as tumors,[30] or when the ROI size is small such as in the case of measuring multiple sclerosis focal plaques.[31-33] The appropriate selection of an ROI size and estimation of size effect would help adjust the decision threshold by a QIB in monitoring a treatment effect.[29] The ROI placement procedure can also be variable among radiologists. Before starting a clinical trial involving a number of radiologists, especially when they are from different sites, the variability across them should be assessed and desirably standardized. Region of interest placement using automated techniques, such as deep learning, is a possible approach[34] that reduces the burden of clinicians and may increase both repeatability and reproducibility.
Observer
If an analysis (eg, ROI placement) involves observers, observers should be trained to a set of well-defined rules. Agreement across observers should also be assessed using ICC. There is a possibility of a practice effect, so observers should perform the actual analysis after reaching the plateau of their learning curve.[5] Software for semiautomated or fully automated analyses will increase the repeatability. Automated techniques using deep learning trained with a large dataset is expected to reduce the variability in tasks such as tumor segmentation.[35]
Hardware and Software Upgrades
Hardware and scanner software upgrades may introduce more bias and/or less precision in the derived QIB.[36] For example, Keenan et al[36] showed that the variable flip angle (VFA) T1 measurements on upgraded systems (hardware and software) had an overestimation of approximately 18% compared with the measurements of the original system (Fig. 2). Lee et al[37] also showed that a consistent bias of up to 3% was observed between VFA T1 measurements before and after a scanner software upgrade. Even when performing a study that uses only a single scanner, consistent versions should be used for both hardware and software.
FIGURE 2
Variable flip angle measurement of T1 relaxation time on the original and postupgrade systems. Following upgrade, variable flip angle overestimates the T1 compared with the original measurements. Postupgrade measurements were completed using 3 different head coils (coils A–C) and the body coil (reproduced with permission from Keenan et al[36]).
Variable flip angle measurement of T1 relaxation time on the original and postupgrade systems. Following upgrade, variable flip angle overestimates the T1 compared with the original measurements. Postupgrade measurements were completed using 3 different head coils (coils A–C) and the body coil (reproduced with permission from Keenan et al[36]).
Standardization
The performance of QIBs can be assessed with the true value (eg, phantoms, digital reference objects [DROs], simulation, and test-retest datasets—assuming no change), a reference standard, or without a reference standard (eg, agreement studies between algorithms and studies of algorithm precision). Phantoms and simulation data are cost-effective and reliable, and can be in large amount. Phantoms can be scanned repeatedly without any ethical constraints and are relatively easy to transport between centers. However, one must bear in mind that optimization of a QIB to a phantom or simulation data may not work well on in vivo data, due to the lack of realism. For example, pulmonary nodules in a phantom have several characteristics that may differ from human pulmonary nodules, including sharp margins, smooth surfaces, elemental shapes (spheroids and conics), homogeneous density, no vascular interaction, and no motion artifact. An algorithm that is optimized for any of these properties may appear to have an overly optimistic performance and may not show high performance for real in vivo data.In human studies assessing QIB performance, the true value is often unavailable. Although histology or pathology tests are usually considered as the true value, these are more appropriately referred to as reference standards; they are described as well-accepted or commonly used methods for measuring a biomarker but also have associated bias and/or random error.[38] For example, histology and pathology are known to be affected by fixation and staining, spatial and temporal sampling errors due to heterogeneity in tissue and the difference in time between imaging and sampling, the nonquantitative nature of the histopathology examinations, and subjective interpretations by humans. Specialist markings (eg, setting a boundary of a region for volumetry) are also sometimes considered as the true value, but they often have a variable degree of interreader variation and should be considered as a reference standard.[39]
Data Harmonization
Although a multicenter study has the potential to increase statistical power, the inclusion of different scanner vendors, acquisition protocols, image reconstruction algorithms, and field strengths results in unwanted systematic variation. Data harmonization aims to remove these variations retrospectively after acquisition while preserving the biological variability. Harmonization can be performed using traveling human data acquired at each site by determining a scanner-specific correction factor.[40-42] If only postprocessed data (eg, fractional anisotropy map and cortical thickness) are available, regression analysis or more sophisticated statistical approaches can be performed for harmonization.[43,44] Harmonization of raw data is particularly important for diffusion-weighted imaging (DWI) data to be analyzed by multicompartment models or tractography, and model-free methods for harmonization of raw DWI signals have also been suggested.[45,46] Harmonization by deep learning has also been proposed, but the algorithm should be trained on the data of same subjects acquired on different scanners that are intended to be harmonized.[47]
MAGNETIC RESONANCE IMAGING
Magnetic resonance imaging can extract a variety of quantitative tissue properties that include not only length and volume but also relaxation properties (T1, T2, and T2*), diffusion, perfusion, phase, fat fraction, temperature, tissue chemical properties (eg, spectroscopy and chemical exchange saturation transfer), and physical properties (eg, elastography).[48] However, a large number of variabilities in image acquisition methods and postprocessing algorithms hinder the extraction of accurate and reproducible quantitative information from MRI. In this section, we discuss sources of variability in QIBs that are specific to MRI and the importance of periodic QA to maintain sufficient accuracy and precision. We also discuss the current body of knowledge regarding the standardization of quantitative MRI metrics that are fundamental in MRI.
Temperature
Temperature control is required for phantom scanning. T1 and T2 of Ni-DTPA were reported to change 0.2% to 1% and approximately 1.3% to 1.5%, respectively, per °C at a temperature approximately 21°C.[49] The apparent diffusion coefficient (ADC) of the pure water changes approximately 3% per °C at room temperature.[50] The phantom should be stored in the MRI room and reach a temperature close to that in the magnet bore so that the temperature of the phantom does not fluctuate during the scan. The temperature of the phantom should be measured and recorded after the scan is complete. Conversion of the acquired values to those at a standard temperature might be possible.[11] In case of a human scan, temperature control is assumed to be unnecessary because homeostasis provides intrinsic temperature control. However, core temperature can increase more than 1°C by MRI scan, especially at 3 T for obese subjects, and this thermal effect on quantitative MRI remains to be investigated.
B1 Field Nonuniformity
Nonuniformity in the radiofrequency transmit field (B1+) is the major cause of error in quantitative MRI, especially when using high magnetic fields and surface coils for transmission.[50] Body coil excitation is preferable for uniform transfmission.[26] Calibration of the transmitter output can be carried out periodically as part of routine maintenance. The accuracy of flip angle depends on the B1+ inhomogeneity at a given spatial location, which can be measured by B1+ mapping.[51] Notably, every vendor uses their own radiofrequency pulse shapes that lead to variability in flip angle, complicating comparison across scanners from different vendors. The acquired B1+ map can be used for the measurement of tissue parameters such as T1 and magnetization transfer to correct the achieved flip angle for the intended one.[52,53] The B1+ field is smooth compared with anatomical structures, even at high field strengths, so B1+ maps are often acquired at low resolution to spare acquisition time.[54]With the increasing use of a large number of coils and parallel imaging, the receive sensitivity field (B1−) nonuniformity should be addressed. B1− nonuniformity used to be measured from a B1+ map based on the reciprocity principle (B1+ = B1−) if the excitation and receiving are done by the same coil. The reciprocity principle can still be used when different coils are used by performing an additional acquisition in which the transmit coil is used for receiving.[55] However, the reciprocity principle becomes less accurate at a field strength of 3 T or higher.[56,57] B1− nonuniformity affects the spatial distribution of image intensity and thereby any quantitative MRI, especially the measurement of proton density and absolute metabolite concentration. The receiver gain can be automatically set or changed during the prescan procedure, but it is desired to be fixed during the acquisition of the image series.
B0 Field Nonuniformity
When an object is placed in the magnet, the magnetic susceptibility of the object alters the static magnetic field B0 in the object slightly. The shim coil usually adjusts to obtain a spatially uniform B0 distribution. However, for extended fields of view, observable deviations from uniformity and image degradation can occur in the periphery.[58] Spatially varying tissue susceptibility, especially at the air-tissue interface, can also induce B0 field nonuniformity. This is one cause of a generally higher repeatability and reproducibility for in vitro phantom studies than for in vivo human studies. In human subjects, the ROIs have to be put on spatially variable places, and signal variability becomes higher as the pixel departs from isocenter of the magnet. Proton density fat fraction is vulnerable to B0 field nonuniformity because differentiation between the phase shifts, due to B0 nonuniformity and those due to chemical shift utilized for extracting fat signal, is difficult.[59]
Field Strength
Some tissue parameters, including ADC, diffusion tensor imaging, proton density, volume, and perfusion, are independent of field strength. However, a higher field strength may contribute to an increased signal-to-noise ratio. Other parameters, including T1, T2, and magnetization transfer, are dependent on field strength.[60]
Quality Assurance in Quantitative MRI
Quality assurance is an ongoing process of ensuring that the instrument continues to operate adequately.[61-63] To use a QIB in a clinical routine, regular QA on a weekly basis (possibly on a daily basis as an initial assessment) is required. Quality assurance for quantitative MRI can be performed in healthy controls and/or in phantoms. Phantoms have the advantage of providing accurate values and being stable and always available. Phantoms and analysis software are ideally developed specifically for each QIB to address some of the variabilities. Anthropomorphic phantoms have been developed for certain body parts including the breast,[64] prostate,[65] and brain,[66] considering the fact that spatial relationship between scan objects and the coil affects the patterns of field inhomogeneity. For example, the breast phantoms were developed partly because the previous phantoms were not physically compatible with a breast coil.[64] The properties of the phantom may vary over time due to the instability of the material due to fungal invasion, chemical decay, evaporation, or contamination by water vapor. Temperature dependence is also a problem, whereas human temperature is homeostatically controlled.[11] Further, the realism of a phantom may not be sufficient because many potential sources of variability in vivo (eg, movement, positioning variability, and B1 variation due to subject shape) are absent. Normal white matter can be a standard for some MRI parameters (eg, ADC or magnetization transfer) because the normal biological range is narrow. In a multicenter study, standardized QA procedures should be followed by all institutions to keep the acquired data as uniform as possible.Computer-simulated phantoms, or DROs, can also be used to evaluate the performance of the propagation of error in quantitative MRI regarding error from both the measurement and bias of parameter constraints or assumptions, as well as that from noise. However, simulations often do not match measurements in vivo due to the negligence of biological effects different than those being simulated.[67]
Diffusion-Weighted Imaging
One of the most widely investigated MRI QIBs in clinical trials is the ADC derived from DWI, which is sensitive to the random motion of water molecules.[68] Although DWI is used clinically as a qualitative indicator of disease presence, ADC has been investigated in clinical trials for diagnosis, staging tumors, assessing treatment response, and predicting tumor aggressiveness. However, confidence in its use has not been fully established due to differences across scanners and populations, which hinders the use of ADC in the clinical workflow. The complexity of the tissue structures makes ADC dependent on a number of factors including pulse sequence construction,[69] acquisition parameters, modeling techniques, anatomic regions being evaluated, and the subject orientation with respect to the diffusion directions.[68,70-72] An example of systematic ADC variations arises from a scanner upgrade to a high-end machine that allows shorter echo time for improving DWI quality, which would shorten the diffusion time, lead to a possible decrease in the visibility of acute brain infarction, and increase in the measured ADC value (Fig. 3).[71,73] It is accepted that ADC is independent of field strengths,[74,75] although higher field strengths may be beneficial due to improvements in signal-to-noise ratio. Huo et al[76] reported lower variance in ADC measurement at 3 T compared with 1.5 T. Control of these variabilities may enable ADC to replace biopsy, such as for the differential diagnosis between tumor recurrence and necrosis.[77]
FIGURE 3
Apparent diffusion coefficient dependency on diffusion time. Diffusion-weighted imaging with short (A), intermediate (B), and long diffusion times (Δeff) (C) shows acute infarction at the right paramedian aspect of the pons, responsible for medial longitudinal fasciculus syndrome. Diffusion-weighted imaging with short Δeff (A) demonstrated decreased contrast of the lesion with the surrounding tissue compared with diffusion-weighted imaging with longer Δeff (B and C). D–F, Images show ADC maps of corresponding diffusion-weighted imaging. The apparent diffusion coefficient values of the lesion were increased with short Δeff (D) compared with long Δeff (F) (reproduced with permission from Boonrod et al[71]).
Apparent diffusion coefficient dependency on diffusion time. Diffusion-weighted imaging with short (A), intermediate (B), and long diffusion times (Δeff) (C) shows acute infarction at the right paramedian aspect of the pons, responsible for medial longitudinal fasciculus syndrome. Diffusion-weighted imaging with short Δeff (A) demonstrated decreased contrast of the lesion with the surrounding tissue compared with diffusion-weighted imaging with longer Δeff (B and C). D–F, Images show ADC maps of corresponding diffusion-weighted imaging. The apparent diffusion coefficient values of the lesion were increased with short Δeff (D) compared with long Δeff (F) (reproduced with permission from Boonrod et al[71]).The presumption for using ADC in clinical practice for managing tumors is that treatment-associated change in the microenvironment precedes changes in the lesion size, thereby encouraging the use of ADC as a biomarker of treatment response.[78] Conformance to the specifications of the QIBA DWI Profile[13] by all relevant staff, scanner, and software involved in ADC acquisition/measurement supports the following claim: a measured change in the ADC of lesions in the brain,[79-81] liver,[82-85] prostate,[86-90] and breast[91,92] of 11%, 26%, 47%, and 13% (each denotes %RC), respectively, or larger indicates that a true change has occurred with 95% confidence. Due to the intrinsic dependence of the measured ADC on biophysical tissue properties, these claims are organ specific. Notably, the Profile requires usage of the same scanner and image acquisition parameters for baseline and subsequent measurements with periodic QA (Fig. 4). Estimation of reproducibility based on previous studies is more complicated than repeatability because the reproducibility condition is heterogenous among studies. When interscanner CV is evaluated, one should be careful if the scanners are from the same vendor or from different vendors as significant intervendor bias in ADC measurement of the brain has been reported with a %bias up to 7%.[93]
FIGURE 4
Typical quantitative diffusion-weighted magnetic resonance imaging trial workflow for treatment response assessment with key QIBA (Quantitative Imaging Biomarkers Alliance) Profile activities (reprinted with permission from QIBA RSNA diffusion-weighted imaging profile v:12.20.2019: https://qibawiki.rsna.org/index.php/Profiles).
Typical quantitative diffusion-weighted magnetic resonance imaging trial workflow for treatment response assessment with key QIBA (Quantitative Imaging Biomarkers Alliance) Profile activities (reprinted with permission from QIBA RSNA diffusion-weighted imaging profile v:12.20.2019: https://qibawiki.rsna.org/index.php/Profiles).Before a multicenter trial, qualification of each site should be assessed according to the specific protocol for the site's ability to adopt a standardized acquisition protocol and image analysis. The performance of ADC in each site should be assessed by the ice-water DWI phantom.[50,94,95] Although substrates, such as sucrose,[96] alkane,[97] and copper sulfate,[98] have been used to achieve a wide range of ADC values, sensitivity of ADC values to temperature variation has been problematic; ADC of pure water changes approximately 3% per °C.[50] The ice-water phantom was designed to eliminate thermal variability and keep the phantom at 0°C by filling the phantom with an ice-water bath. The inner tube is typically filled with distilled water,[94,95] giving an ADC of 1.1 × 10−3 mm2/s, or alternatively, polyvinylpyrrolidone solutions with a range of ADC values.[99]
Magnetic Resonance Relaxometry
The signal intensity of conventional MR images, such as T1- and T2-weighted images, depends on many acquisition parameters and MR scanner variations. Thus, absolute signal intensity has no direct meaning, and the evaluation of MRI scans mainly involves comparison with surrounding tissues in the same slice. Absolute quantification of longitudinal relaxation time (T1), transverse relaxation time (T2) or their inverse relaxation rates (R1 and R2), and proton density (PD) provides an absolute scale; hence, it enables a more objective evaluation of development,[100] aging,[101] and diseases.[102]Proton density indicates the amount of detectable protons by MRI and is proportional to the MRI signal intensity. Calculation of PD is based on the estimation of the magnetization at equilibrium (M0), which represents the signal intensity in the absence of any relaxation.[103]T1 relaxation time characterizes the approach of the polarized spins to equilibrium in the direction of the external magnetic field, and it is affected by a number of tissue properties including free water content, macromolecules, iron, and gadolinium chelate. T1 values significantly increase with the field strength.[60] The criterion standard method for measuring T1 relaxation time is the inversion recovery technique, in which only one echo is acquired at a time and full spin relaxation is awaited (approximately 5 T1 periods) before the next spin inversion. This technique is time-consuming and infeasible in clinical settings. To aim for time efficiency, 2 other techniques, namely, the Look-Locker (LL)[104] and VFA[105] techniques, were introduced. Stikov et al[52] compared inversion recovery, LL, and VFA techniques using a phantom and the brains of healthy volunteers. Although these techniques agreed well on the phantom, LA and VFA respectively showed consistently underestimated and overestimated T1 values measured by the inversion recovery technique. The deviations reached over 30% in WM, from 750 milliseconds (LL) to 1070 milliseconds (VFA). They found that major sources of differences were inaccurate B1+ mapping and incomplete spoiling of transverse magnetizations. Thus, they concluded that quality assessment of T1 mapping techniques should be performed both for a phantom and in vivo.T2 relaxation time indicates the rate at which the transverse component of magnetization decays to zero, and it is primarily driven by nearby nuclei. By changing the echo time, T2 relaxation time can be measured by spin echo technique with as few as 2 measurements—assuming monoexponential decay; however, the measurement suffers from partial volume and is susceptible to noise.[106] Further, the acquisition time is very long because full T1 relaxation is required during the acquisitions. Multiecho T2 (MET2) accelerates the spin echo method by using multiple refocusing pulses at increasing TE. A version of MET2, termed the Carr-Purcell-Meiboom-Gill sequence, accelerates MET2 by incorporating a 180-degree phase increment to refocusing pulses and is now considered to be the criterion standard for T2 measurement.[107,108] Another approach for T2 measurement is driven equilibrium single-pulse observation of T2 (DESPOT2), which uses a balanced steady-state free precession pulse sequence.[109] An alternative approach to T2 measurement is to separate the imaging section of the sequence from T2-weighting using the T2 preparation pulse (T2-prep), which enables acquisition of multiple echo with fast imaging technique.[110] As with T1 measurement, all these methods are affected by B1+ and B1− inhomogeneities. T2 measurements are also affected by magnetization transfer effects between the water and macromolecular protons, resulting in diminished signal in the free water and inaccurate T2 measurement.[111,112] Similar to the discrepancy among T1 measurement methods, T2 measurement methods are known to show disagreement with each other.[113] Jutras et al[114] reported that the WM T2 of 70 and 50 milliseconds was measured by MET2 and DESPOT2, respectively, in the same subject, partly due to the different weighting of each tissue component by these methods.Instead of measuring T1, T2, and PD separately, these values can also be measured simultaneously. Simultaneous measurement has attracted attention for its merit in inherent alignment of the acquired maps and potential reduction in scan time. Two major approaches are quantitative synthetic MRI[102] and MR fingerprinting (MRF).[115] Quantitative synthetic MRI is commonly performed by a 2D multidynamic multiecho (MDME) sequence, which is a turbo spin-echo sequence typically performed with 4 delay times and 2 echo times in the brain so that the scan time is clinically feasible—approximately 6 minutes for the whole head coverage (Fig. 5). B1+ field measurements can be simultaneously performed based on the same acquisition data.[116] The performance of the MDME sequence was examined on 3 scanners from 3 different vendors.[117] The highest intrascanner wCVs for T1, T2, and PD were 2.07%, 7.60%, and 12.86%, respectively, for the ISMRM/NIST standardized phantom, and 1.33%, 0.89%, and 0.77%, respectively, for healthy volunteer brains. The highest interscanner wCVs of T1, T2, and PD were 10.86%, 15.27%, and 9.95%, respectively, for the phantom, and 3.15%, 5.76%, and 3.21%, respectively, for the volunteer brains. Estimating the myelin volume fraction in each voxel by using a 4-compartment model based on the MDME data has also been suggested[118] and applied to diseases such as multiple sclerosis[33,119] and Sturge-Weber syndrome.[120] For applications in other organs, the sequence may have to be adjusted to each target tissue.[121,122] Because radiologists are not accustomed to reading parameter maps, synthetic MRI techniques have also been applied to the MDME data. Synthetic MRI enables the creation of clinically used contrast-weighted images including T1-weighted, T2-weighted, and fluid-attenuated inversion recovery (FLAIR) images based on the T1, T2, and PD maps.[123] Although the image quality of synthetic FLAIR is generally perceived to be inferior to FLAIR acquired by conventional methods, improvement of synthetic FLAIR quality by deep learning has also been suggested.[124] Hence, relaxometry data derived from MDME may become an adjunct to contrast-weighted images in clinical settings. The effect of variability in in-plane resolution on the volumetry based on the MDME data was found to be little in healthy volunteers and MS patients, presumably because the segmentation algorithm considers tissue partial volumes in the interval of 0% to 100% rather than assigning a single tissue type to each voxel.[125,126] The 3-dimensional (3D) version of the MDME, namely, 3D-QALAS (3D-quantification using an interleaved LL acquisition sequence with T2-prep pulse), was recently developed for the heart[127] and has also been applied to the brain.[128,129] Synthetic MR angiography constructed by deep learning is also feasible based on the 3D-QALAS data of high resolution.[130]
FIGURE 5
Quantification using quantitative synthetic magnetic resonance imaging. The QRAPMASTER acquisition was applied to retrieve the R1 map (top row), R2 map, and proton density map. Based on these maps, conventional (eg, T2-weighted) images can be synthesized (middle row). Furthermore, the R1, R2, and proton density maps provide an absolute scale and hence a robust input to brain segmentation. An example of one of these segmentations (of myelin) is shown in the bottom row. The quantitative synthetic magnetic resonance imaging method provides maps that are independent of the magnetic resonance scanner and hence provide the same result on all major platforms. For this example, the subject was scanned at 3.0 T on a GE 750 (A), Siemens Skyra (B), Philips Ingenia (C), and at 1.5 T on a GE 450 W (D), Siemens Aera (E), and Philips Ingenia (F) (adapted and reproduced with permission from Hagiwara et al[116]).
Quantification using quantitative synthetic magnetic resonance imaging. The QRAPMASTER acquisition was applied to retrieve the R1 map (top row), R2 map, and proton density map. Based on these maps, conventional (eg, T2-weighted) images can be synthesized (middle row). Furthermore, the R1, R2, and proton density maps provide an absolute scale and hence a robust input to brain segmentation. An example of one of these segmentations (of myelin) is shown in the bottom row. The quantitative synthetic magnetic resonance imaging method provides maps that are independent of the magnetic resonance scanner and hence provide the same result on all major platforms. For this example, the subject was scanned at 3.0 T on a GE 750 (A), Siemens Skyra (B), Philips Ingenia (C), and at 1.5 T on a GE 450 W (D), Siemens Aera (E), and Philips Ingenia (F) (adapted and reproduced with permission from Hagiwara et al[116]).Another promising approach of simultaneous relaxometry is MRF. In contrast to quantitative synthetic MRI, MRF adopts a novel approach that does not rely on a traditional curve fitting approach. In MRF, radiofrequency pulses and repetition times are simultaneously varied in a pseudorandom fashion to create signal evolutions that characterize the various relaxation processes unique for each type of tissue (so-called fingerprint).[131] The acquired signal evolutions are pattern-matched against a separately simulated dictionary data, allowing the extraction of multiple tissue properties, including but not limited to T1, T2, PD, and B0. Proton density is estimated as a scaling factor between the acquired and simulated signal evolutions. Magnetic resonance fingerprinting can measure any property that can be simulated by the Bloch equation, for example, and recent works have also incorporated the measurements of B1+,[132,133] T2*,[134] magnetization transfer,[135] amide,[136] spectroscopy, perfusion,[137] and microvascular characteristics into the MRF.[138] The pattern matching can be performed even in the presence of undersampling artifacts; hence, the scan can be highly accelerated to reduce the scan time.[139] The effect of motion on the resulting image is also small as long as the errors are incoherent in such a way that pattern-matching is still possible; however, MRF is known to be more vulnerable to through-plane motion than to in-plane motion.[140] The dictionary of MRF should cover the signal evolutions of a physiologically possible range of tissue properties. The dictionary size presents a trade-off between accuracy and the speed of pattern-matching. The pattern-matching process may benefit from deep learning in terms of both accuracy and speed.[141] The pattern-matching process is a distinctive factor of MRF in view of standardization because resulting maps are dependent on the structure of the dictionary (Fig. 6). The dictionary should be carefully prepared based on the intended purpose, computational resource, and acceptable matching time.
FIGURE 6
Magnetic resonance fingerprinting relies on pattern-matching of through-time signals of highly undersampled images to separately simulated dictionary data. The resulting maps depend not only on the pulse sequence, like conventional magnetic resonance imaging, but also on the dictionary with which the raw data are processed. Although the images processed with 2 different dictionaries in this figure look similar, their histograms are quite different from each other.
Magnetic resonance fingerprinting relies on pattern-matching of through-time signals of highly undersampled images to separately simulated dictionary data. The resulting maps depend not only on the pulse sequence, like conventional magnetic resonance imaging, but also on the dictionary with which the raw data are processed. Although the images processed with 2 different dictionaries in this figure look similar, their histograms are quite different from each other.Sequence design is flexible in MRF, and several sequence designs have been used, including balanced steady-state free precession,[131] fast imaging with steady-state precession (FISP),[142] RF-spoiled gradient echo, and quick echo splitting nuclear magnetic resonance.[143] The MRF has been primarily investigated for brain and phantom imaging, but methods for adjusting MRF acquisition to the heart,[144] abdomen,[145] and prostate[146] have also been proposed. High-resolution 3D MRF with the resolution of 1 mm isovoxel was also proposed with a scan time less than 8 minutes for full brain coverage.[147]Kato et al[148] investigated FISP-based MRF with B1+ correction and T1 and T2 measurements on the ISMRM/NIST phantom scanned for 100 days and showed high repeatability with a CV of T1 less than 1% and that of T2 less than 3%, which were better than the values reported by Jiang et al[149] without B1+ correction. Korzdorfer et al[150] investigated the repeatability and reproducibility of T1 and T2 measurements by FISP-based MRF with B1+ correction on the brains of 10 healthy volunteers, each of which were scanned 4 times using 4 or more of the 10 MRI scanners; these include 3 different models at 3 T from the same vendor at 4 sites. Repeatability, defined by 95% confidence intervals on relative difference, was 2.0% to 3.1% for T1 and 3.1% to 7.9% for T2 in GM and WM, respectively. Interscanner reproducibility was 3.4% for T1 and 8% for T2 in GM and WM. Larger variations of T2 were likely attributed to scanner imperfections related to certain system characteristics, such as different eddy current behaviors and the diffusion effect.[132]
Multiparametric Quantitative MRI
There is a growing amount of evidence showing that multiparametric MRI can offer better diagnostic ability[151] or more specific biological information[152] than each quantitative measurement. However, this approach is limited by time constraint; hence, a balance between benefit and time should be considered before clinical implementation. One possible approach for implementing multiparametric MRI into clinical practice is setting a cutoff value for each parameter.[146,153] Pinker et al[154] evaluated the diagnostic accuracy of contrast-enhanced MRI, DWI, and MR spectroscopy in combinations of 2 or 3 and used a single measure by setting a threshold for each parameter. As a result, 3 parameters achieved higher accuracy for differentiating between benign and malignant breast lesions than did 1 or 2 parameters. Although the current practice of cancer assessment by multiparametric MRI relies largely on qualitative analysis, if interscanner differences can be overcome or quantified, the current practice may be replaced by more objective and precise quantitative analyses. However, approaches taken toward a single QIB may not be appropriate for multiparametric imaging. For example, the use of multiple QIBs at the same time leads to increases in false-positives for declaring change in at least one QIB, especially when the correlations between the QIBs are low.[155] This could be solved by introducing Mahalanobis distance (ie, distance between a point and zero in a multivariate space that is corrected by variance), resulting in an appropriate type I error rate. Currently, the QIBA Multiparametric Metrology Group is working on developing guidelines for treating multiparametric imaging data.[155]Another possible application of multiparametric MRI in clinical practice is feeding the data into machine learning for diagnosis,[156] tumor grading,[157] or the prediction of treatment response.[158] Multiparametric quantitative MRI can also be used for extracting new measures that reflect subvoxel microstructural information such as myelin and axon density, axon diameter, and membrane permeability.[152,159] Geometric distortion of images, image misregistration, and different interpolation techniques will introduce errors in created maps; hence, these issues should be cautiously handled.
COMPUTED TOMOGRAPHY
Since the beginning of the clinical application of multidetector row CT (MDCT) in the late 1990s, CT has played a critical role in routine clinical practice. Further, in the last decade, some societies have considered the application of quantitative CT-based indexes as QIBs for patient management, including therapeutic planning and treatment response assessment, and have worked on standardizing the CT protocols for quantitative assessment of CT-based indexes.[160,161] In addition, several investigators have proposed CT-based QIBs for the management of chronic obstructive pulmonary disease (COPD), pulmonary nodules, interstitial lung disease, pulmonary thromboembolism, and pulmonary hypertension.[162-169] In line with this, RSNA QIBA has been working on standardizing the CT protocols and has published profiles through the following committees: (1) CT angiography, (2) CT volumetry, (3) lung density, and (4) small lung nodule.[170]However, academic and social interests in radiation dose reduction for CT examinations without any accompanying reduction in diagnostic capability have been steadily on the rise. In addition, newly developed iterative reconstruction (IR) methods have been introduced and applied in routine clinical practice.[171,172] In fact, dose reduction strategies have been realized by employing a variety of techniques for data acquisition, such as tube current reduction, tube voltage reduction, increased helical pitch, scan length optimization, scan protocol individualization, and utilization of automatic exposure control (AEC).[172-176] In contrast to RSNA QIBA, J-QIBA primarily aims to determine state-of-the-art CT protocols while keeping the suggested accuracy of CT numbers and bronchial wall thickness, as well as volumetry, within QIBA CT profiles.[170]
Lung Density Evaluation for Quantitative Assessment of Chronic Obstructive Pulmonary Disease
Computed tomography is currently the most widely used modality to evaluate morphologic and pulmonary functional changes for the assessment of COPD.[168,177-180] For both clinical and academic purposes, several commercially available and proprietary software and visual scoring systems have been adopted for the CT-based assessment of pulmonary emphysema.[168,177] Two major approaches have been reported for the quantitative assessment of COPD.[168,177,181-183] One approach determines the percentage of low attenuation area in the lung, which reflects the destruction of the lung parenchyma,[168,177,181-183] and the other determines the percentage of wall area in the bronchi, which reflects bronchial narrowing and wall thickening.[183] In addition, 3D airway luminal volumetry has been introduced as another quantitative airway evaluation method for COPD patients.[184-186] Taking these quantitative CT assessments of COPD and the current situation regarding radiation dose reduction strategies into consideration,[168,177,181-186] the application of IR can be viewed as an important issue not only related to radiation dose reduction, but also the accuracy of quantitative CT evaluation of COPD.In the meanwhile, Chen-Mayer et al[187] members of RSNA QIBA published an article regarding the standardization of CT protocols for 64-detector row CT using a variety of scanner models. They provided a quantitative assessment of the variations observed in CT lung density measurements attributed to nonbiological sources, including scanner calibration, the x-ray spectrum, and filtration. However, this study did not address the differences in scan protocols, reconstruction methods, or tube current, and so on. Hence, Ohno et al,[188] as part of J-QIBA activity, compared the effect of different acquisition and reconstruction algorithms on the radiation dose and accuracy of CT number measurements using a 320-detector row CT and the same phantom used by Chen-Mayer et al.[187] They found that the use of a forward projected model-based iterative reconstruction (FIRST, model-based IR method) and adaptive iterative dose reduction using 3D processing (AIDR 3D, hybrid-type IR method) for the 80-detector row helical and wide-volume acquisitions can reduce the radiation dose to a level of 10 mA while keeping the CT number accuracy smaller than the RSNA QIBA Profile request. Therefore, a collaboration between RSNA QIBA and J-QIBA will provide not only standard CT protocols, but also state-of-the art CT protocols for lung density measurement and the application of CT number as one of the QIBs for pulmonary diseases.
Computer-Aided Volumetry for Quantitative Assessment of a Small Pulmonary Nodule
Several large cohort trials, including the National Lung Screening Trial for reducing lung cancer mortality,[189] showed that lung cancer screening with low-dose CT could reduce lung cancer-specific mortality.[163,190-194] Many studies have reported the importance of volume measurements and/or doubling time assessment by computer-aided volumetry (CADv) software in nodule management.[163,190-195] In line with this, the RSNA-QIBA has evaluated the measurement accuracy of various CADv software programs provided by many vendors in a QIBA recommended phantom study[196] and given feedback to suppliers. The J-QIBA contributed to this study by providing scan data. However, this study did not address the effect of differences in scan methods, tube currents, or reconstruction methods. Hence, Ohno et al[197] performed a phantom study in accordance with QIBA recommendations to evaluate the effects of tube current and reconstruction methods on the nodule volume measured with 3D CADv software (CT Lung Nodule Analysis; Vital Images Inc, Minnetonka, MN).[197] In this study, an anthropomorphic thoracic phantom with 30 simulated nodules with various densities and diameters were scanned with an area-detector CT at several tube currents. The mean absolute measurement errors of AIDR 3D and FIRST methods were significantly lower than those of the FBP algorithm in ultra-low-dose CT. For all nodule types, absolute measurement errors of the FBP method in ultra-low-dose CT were significantly higher than those of standard-dose CT. Both IR algorithms were thus shown to be more effective than the FBP algorithm for radiation dose reduction. Ohno et al are now considering to perform a study with the RSNA-QIBA investigating clinical application of the 3D CADv software with deep learning technique in routine clinical practice.
RADIOMICS
Radiomics is based on the high-throughput computer extraction of potentially innumerable numbers of quantitative imaging metrics, or “radiomic features,” which will be collectively used for the prediction of diagnosis and prognosis and gene expression profiling.[198] These radiomic features can be combined with other patient characteristics to increase the accuracy of prediction. Because radiomics analyses can be conducted with conventionally used clinical images such as T1-weighted images, FLAIR images, and ADC maps, it is conceivable that conversion of radiological images to mineable data will become routine practice for improving decision-making in precision medicine. Radiomic features are often categorized into shape and first- and higher-order features. First-order features are based on histogram-based analyses and include mean, maximum, minimum, and entropy. Higher-order features are described as texture features related to spatial patterns of voxel intensities. Due to the complexity of radiomic features, there is the danger of overfitting, and hence, dimensionality should be reduced by prioritizing the features. This can be performed by detecting redundant features that are highly correlated with each other. Determining the repeatability and reproducibility of each feature and extracting stable ones can also help the prioritization process in reducing redundant dimensions.[199] Radiomics-specific phantoms with known features are useful in evaluating the effect of scanner and vendor variance on radiomic features, optimizing protocols and image processing in obtaining radiomic features.[200-202]In general, a lack of reproducibility in radiomic features is a limitation for radiomics to be widely used in clinical practice.[199,203,204] The stability of radiomic features is sensitive, at various degrees, to a number of processing factors, including image acquisition parameters, reconstruction algorithms, digital image preprocessing, and feature extracting methods (Fig. 7). For example, Zhao et al[205] reported that different reconstruction formulas (sharp or smooth) for lung CT introduce variability in radiomic features of tumors. Voxel size resampling is often performed for CT datasets acquired with variable voxel sizes in order to obtain more reproducible CT features.[206-208] However, the selection of interpolation methods (eg, nearest neighbor, trilinear and tricubic interpolation) has been revealed to affect the reproducibility of radiomic features.[209] Also, how the software treats the boundary of the volume of interest also affects radiomic features; when the voxels outside the ROI are treated as zero, these boundary regions can have an extremely high gradient and may affect the resulting feature values. Eroding the ROI to include only the core of the target tissue[210] and generating 3 different regions (the tumor, boundary, and peritumor)[211] are possible approaches to achieve reproducible features.
FIGURE 7
Typical variability sources in radiomics analysis. Each feature is the result of multiple processes performed on radiological images. Example factors that introduce variabilities in the resulting features are shown for each process in the bottom row.
Typical variability sources in radiomics analysis. Each feature is the result of multiple processes performed on radiological images. Example factors that introduce variabilities in the resulting features are shown for each process in the bottom row.In 2018, Traverso et al[206] reported the results of a systematic review of previous articles investigating repeatability and reproducibility of radiomic features. Overall, first-order features had higher reproducibility than shape and higher-order features, with entropy being consistently reported as one of the most stable features. Among higher-order features, coarseness and contrast were generally poorly reproducible. Interobserver differences in segmentation affected radiomic features, and variabilities were higher for higher-order features. Semiautomated or fully automated methods improved feature reproducibility.[212] However, the articles included in this systematic review were mostly about CT and PET. A more recent study investigated the stability of radiomic features extracted from ADC maps by a multicenter trial and concluded that 122 of 1322 features were stable with a concordance correlation coefficient of more than 0.85 for all tumor entities investigated (ie, ovarian cancer, lung cancer, and colorectal liver metastasis).[213] For magnetic field strength and vendor differences, 245 and 209 features, respectively, were stable. In a phantom study, Baeßler et al[214] showed that FLAIR provided the highest number of stable radiomic features among T1-weighted, T2-weighted, and FLAIR images. It is largely unknown how the repeatability and reproducibility of each feature are propagated to the final result of radiomics, and therefore, validation of a radiomics algorithm against another independent dataset is considered to be crucial.[215]There are several multi-institutional efforts to standardize and increase the reproducibility of the radiomics approach; this includes providing guidelines, standardized framework, and DROs. The Imaging Biomarker Standardization Initiative provides consensus-based recommendations, nomenclature, and guidelines to improve the reproducibility of radiomic studies.[208] Radiomics Ontology, which is publicly accessible via the NCBO BioPortal (https://bioportal.bioontology.org/ontologies/RO), provides a semantic framework for radiomic features that is in line with the nomenclature addressed by Imaging Biomarker Standardization Initiative. The NRG Oncology investigators have provided recommendations and a guideline specifically for use in the National Clinical Trials Network.[216] They suggest that the radiomics quality score[217,218] may serve to evaluate the quality of radiomics studies.
ARTIFICIAL INTELLIGENCE
Artificial intelligence, including machine learning and deep learning, has been increasingly applied to medical imaging.[219] Promising results have been shown in various tasks related to radiological images, such as the detection of lesions,[220] segmentation (eg, labeling organs),[221] classification (eg, pneumonia vs cancer),[222] reconstruction (eg, MRI k-space to clinical image),[223] and noise reduction.[224] In relation to standardization of QIB, an AI algorithm that automates the process of QIB extraction has the capability to decrease variability, such as through an automated pipeline that can reduce ambiguity and variability in lesion segmentation.[225] Extracting a QIB using AI in a fully automated manner is also feasible. For example, it can predict the functional flow reserve from cardiac CT data by point estimatioin.[226] The major advantages of AI approaches over manual approaches in terms of decreasing variability in QIB are as follows: (1) no variance is caused by fatigue as in human analysts, and (2) AI returns consistent results from the same input. There are several recommendations and guidelines for the development and evaluation of an AI algorithm in the medical field.[219,227,228] In brief, desired steps to develop a reliable AI algorithm include the following: (1) using reliable reference standards, (2) using a training dataset that matches the intended use, (3) tuning hyperparameters on a dataset independent of the training dataset, and (4) using external datasets to evaluate the model performance. To develop an AI algorithm that is robust to variability in acquisition parameters, machine settings, and clinical conditions, the algorithm should be trained with a heterogeneous dataset.[229] The standardization/harmonization of the input images could be one approach to making an algorithm that is generalizable to multiple scanners—although this is unrealistic for multiple vendors, especially when the inputs are multimodal. Although a QIB extracted using AI can be assessed through conventional approaches, there comes a possible issue specific to AI; AI models can be further fine-tuned at each institution using its own data, and the repeatability and reproducibility may change through each update. Quality assessment methods of AI algorithms that are easy to be implemented at each institution still remain to be established.
CONCLUSIONS
Quantification of radiological images has the potential to enable earlier detection of disease, complement or replace biopsy, provide clear differentiation of disease stage, and play an important role in precision medicine. Various sources of variabilities in QIBs have been identified, and extensive efforts have been made to achieve accurate and precise results. Artificial intelligence, especially deep learning techniques, may also further mitigate the variabilities of QIB. In recent years, there has been a surge of interest in multiparametric imaging, including radiomics, but evaluation methods of accuracy and precision of the end results for such techniques still remain to be investigated.
Authors: Katja Pinker; Linda Moy; Elizabeth J Sutton; Ritse M Mann; Michael Weber; Sunitha B Thakur; Maxine S Jochelson; Zsuzsanna Bago-Horvath; Elizabeth A Morris; Pascal At Baltzer; Thomas H Helbich Journal: Invest Radiol Date: 2018-10 Impact factor: 6.016
Authors: Denise R Aberle; Amanda M Adams; Christine D Berg; William C Black; Jonathan D Clapp; Richard M Fagerstrom; Ilana F Gareen; Constantine Gatsonis; Pamela M Marcus; JoRean D Sicks Journal: N Engl J Med Date: 2011-06-29 Impact factor: 91.245
Authors: A Hagiwara; M Hori; K Yokoyama; M Y Takemura; C Andica; K K Kumamaru; M Nakazawa; N Takano; H Kawasaki; S Sato; N Hamasaki; A Kunimatsu; S Aoki Journal: AJNR Am J Neuroradiol Date: 2016-10-27 Impact factor: 3.825
Authors: Henry Dieckhaus; Rozanna Meijboom; Serhat Okar; Tianxia Wu; Prasanna Parvathaneni; Yair Mina; Siddharthan Chandran; Adam D Waldman; Daniel S Reich; Govind Nair Journal: Top Magn Reson Imaging Date: 2022-06-28
Authors: Akifumi Hagiwara; Talia C Oughourlian; Nicholas S Cho; Jacob Schlossman; Chencai Wang; Jingwen Yao; Catalina Raymond; Richard Everson; Kunal Patel; Sergey Mareninov; Fausto J Rodriguez; Noriko Salamon; Whitney B Pope; Phioanh L Nghiemphu; Linda M Liau; Robert M Prins; Timothy F Cloughesy; Benjamin M Ellingson Journal: Neuro Oncol Date: 2022-06-01 Impact factor: 13.029
Authors: Mahesh B Keerthivasan; Jean-Philippe Galons; Kevin Johnson; Lavanya Umapathy; Diego R Martin; Ali Bilgin; Maria I Altbach Journal: J Magn Reson Imaging Date: 2021-07-13 Impact factor: 4.813
Authors: S Fujita; K Yokoyama; A Hagiwara; S Kato; C Andica; K Kamagata; N Hattori; O Abe; S Aoki Journal: AJNR Am J Neuroradiol Date: 2021-01-07 Impact factor: 3.825
Authors: Akifumi Hagiwara; Hiroyuki Tatekawa; Jingwen Yao; Catalina Raymond; Richard Everson; Kunal Patel; Sergey Mareninov; William H Yong; Noriko Salamon; Whitney B Pope; Phioanh L Nghiemphu; Linda M Liau; Timothy F Cloughesy; Benjamin M Ellingson Journal: Sci Rep Date: 2022-01-20 Impact factor: 4.379