| Literature DB >> 35888036 |
Jeroen Bleker1, Thomas C Kwee1, Derya Yakar1.
Abstract
Background: Reproducibility and generalization are major challenges for clinically significant prostate cancer modeling using MRI radiomics. Multicenter data seem indispensable to deal with these challenges, but the quality of such studies is currently unknown. The aim of this study was to systematically review the quality of multicenter studies on MRI radiomics for diagnosing clinically significant PCa.Entities:
Keywords: multicenter MRI; prostate cancer; radiomics
Year: 2022 PMID: 35888036 PMCID: PMC9324573 DOI: 10.3390/life12070946
Source DB: PubMed Journal: Life (Basel) ISSN: 2075-1729
CLAIM checklist table with three columns containing the section/subsection, CLAIM item number, and the description of the item.
| Title/Abstract | ||
|---|---|---|
| 1 | Identification as a study of AI methodology, specifying the category of technology used (e.g., deep learning) | |
| 2 | Structured summary of study design, methods, results, and conclusions | |
|
| ||
| 3 | Scientific and clinical background, including the intended use and clinical role of the AI approach | |
| 4 | Study objectives and hypotheses | |
|
| ||
| Study Design | 5 | Prospective or retrospective study |
| 6 | Study goal, such as model creation, exploratory study, feasibility study, non-inferiority trial | |
| Data | 7 | Data sources |
| 8 | Eligibility criteria: how, where, and when potentially eligible participants or studies were identified (e.g., symptoms, results from previous tests, inclusion in registry, patient-care setting, location, dates) | |
| 9 | Data pre-processing steps | |
| 10 | Selection of data subsets, if applicable | |
| 11 | Definitions of data elements, with references to Common Data Elements | |
| 12 | De-identification methods | |
| 13 | How missing data were handled | |
| Ground Truth | 14 | Definition of ground truth reference standard, in sufficient detail to allow replication |
| 15 | Rationale for choosing the reference standard (if alternatives exist) | |
| 16 | Source of ground-truth annotations; qualifications and preparation of annotators | |
| 17 | Annotation tools | |
| 18 | Measurement of inter- and intrarater variability; methods to mitigate variability and/or resolve discrepancies | |
| Data Partitions | 19 | Intended sample size and how it was determined |
| 20 | How data were assigned to partitions; specify proportions | |
| 21 | Level at which partitions are disjoint (e.g., image, study, patient, institution) | |
| Model | 22 | Detailed description of model, including inputs, outputs, all intermediate layers and connections |
| 23 | Software libraries, frameworks, and packages | |
| 24 | Initialization of model parameters (e.g., randomization, transfer learning) | |
| Training | 25 | Details of training approach, including data augmentation, hyperparameters, number of models trained |
| 26 | Method of selecting the final model | |
| 27 | Ensembling techniques, if applicable | |
| Evaluation | 28 | Metrics of model performance |
| 29 | Statistical measures of significance and uncertainty (e.g., confidence intervals) | |
| 30 | Robustness or sensitivity analysis | |
| 31 | Methods for explainability or interpretability (e.g., saliency maps), and how they were validated | |
| 32 | Validation or testing on external data | |
|
| ||
| Data | 33 | Flow of participants or cases, using a diagram to indicate inclusion and exclusion |
| 34 | Demographic and clinical characteristics of cases in each partition | |
| Model performance | 35 | Performance metrics for optimal model(s) on all data partitions |
| 36 | Estimates of diagnostic accuracy and their precision (such as 95% confidence intervals) | |
| 37 | Failure analysis of incorrectly classified cases | |
|
| ||
| 38 | Study limitations, including potential bias, statistical uncertainty, and generalizability | |
| 39 | Implications for practice, including the intended use and/or clinical role | |
|
| ||
| 40 | Registration number and name of registry | |
| 41 | Where the full study protocol can be accessed | |
| 42 | Sources of funding and other support; role of funders |
Figure 1PRISMA 2020 Flow diagram.
Checklist for artificial intelligence in medical imaging evaluation for each of the five studies included in the review. If the study fit the total CLAIM item description, a score of 1 was awarded. For example, item 1: “Indicate the use of the AI techniques—such as “deep learning” or “random forests”—in the article’s title and/or abstract” requires detailed mention of all AI techniques used. If one or more is missing, a zero was given. N/A stands for non-applicable and is used when the specific item does not fit with the goal or approach of the study. Each N/A reduces the possible total score (42—number of N/As) that is used for calculating the percentage of items fulfilled.
| Domain | Item | Bleker et al. [ | Castillo et al. [ | Lim et al. [ | Montoya Perez et al. [ |
|---|---|---|---|---|---|
|
| |||||
| 1 | 0 | 0 | 1 | 0 | |
| 2 | 1 | 1 | 1 | 1 | |
|
| |||||
| 3 | 1 | 1 | 1 | 1 | |
| 4 | 1 | 0 | 0 | 0 | |
|
| |||||
| Study Design | 5 | 1 | 1 | 1 | 1 |
| 6 | 1 | 1 | 1 | 1 | |
| Data | 7 | 1 | 1 | 1 | 1 |
| 8 | 1 | 1 | 1 | 0 | |
| 9 | 0 | 0 | 0 | 0 | |
| 10 | N/A | N/A | N/A | N/A | |
| 11 | 1 | 1 | 1 | 1 | |
| 12 | 0 | 0 | 0 | 0 | |
| 13 | 0 | 1 | 0 | 1 | |
| Ground Truth | 14 | 1 | 1 | 1 | 1 |
| 15 | 1 | 1 | 1 | 1 | |
| 16 | 1 | 0 | 1 | 1 | |
| 17 | N/A | 0 | 1 | 1 | |
| 18 | N/A | N/A | 0 | N/A | |
| Data Partitions | 19 | 1 | 1 | 1 | 1 |
| 20 | 1 | 1 | 0 | 1 | |
| 21 | 1 | 1 | 1 | 1 | |
| Model | 22 | 1 | 1 | 1 | 1 |
| 23 | 0 | 1 | 1 | 0 | |
| 24 | 1 | 1 | 1 | 1 | |
| Training | 25 | 1 | 1 | 1 | 1 |
| 26 | 1 | 1 | 1 | 1 | |
| 27 | N/A | 1 | N/A | N/A | |
| Evaluation | 28 | 1 | 1 | 1 | 1 |
| 29 | 1 | 0 | 1 | 1 | |
| 30 | 1 | 1 | 1 | 0 | |
| 31 | 1 | 1 | 0 | 1 | |
| 32 | 1 | 1 | 1 | 1 | |
|
| |||||
| Data | 33 | 1 | 1 | 1 | 1 |
| 34 | 0 | 0 | 1 | 1 | |
| Model performance | 35 | 1 | 1 | 0 | 1 |
| 36 | 1 | 1 | 1 | 1 | |
| 37 | 1 | 0 | 0 | 0 | |
|
| |||||
| 38 | 1 | 1 | 1 | 1 | |
| 39 | 0 | 0 | 0 | 0 | |
|
| |||||
| 40 | N/A | N/A | N/A | N/A | |
| 41 | N/A | N/A | N/A | N/A | |
| 42 | 1 | 0 | 1 | 1 | |
| Total score percentage | 80.6 (29/36) | 71.1 (27/38) | 71.1 (27/38) | 75.7(28/37) |
RQS grading for each of the studies included in this review can be found in Table 2.
Radiomics quality scores and total percentages for each of the studies included in this review. Total maximum score that could be achieved is 36 points.
| RQS | Bleker et al. [ | Castillo et al. [ | Lim et al. [ | Montoya Perez et al. [ |
|---|---|---|---|---|
| Image Protocol Quality | 2 | 2 | 2 | 1 |
| Multiple segmentations | 1 | 1 | 1 | 0 |
| Phantom Study on all scanners | 0 | 0 | 0 | 0 |
| Imaging at multiple time points | 0 | 0 | 0 | 0 |
| Feature reduction or adjustment feature reduction or adjustment for multiple testing | 3 | 3 | 3 | 3 |
| Multivariable analysis with non radiomics features | 0 | 0 | 0 | 1 |
| Detect and discuss biological correlates | 0 | 0 | 0 | 1 |
| Cut-off analyses | 0 | 0 | 0 | 0 |
| Discrimination statistics | 2 | 2 | 2 | 2 |
| Calibration statistics | 1 | 1 | 1 | 1 |
| Prospective study registered in a trial database | 0 | 0 | 0 | 0 |
| Validation | 5 | 5 | 3 | 3 |
| Comparison to ‘gold standard’ | 2 | 2 | 2 | 2 |
| Potential clinical utility | 2 | 2 | 2 | 2 |
| Cost-effectiveness analysis | 0 | 0 | 0 | 0 |
| Open science and data | 2 | 3 | 0 | 3 |
| Total score percentage | 55.6 (20/36) | 58.3 (21/36) | 44.4 (16/36) | 52.8 (19/36) |
The radiomics quality score: RQS.
| Criteria | Points | |
|---|---|---|
| 1 | Image protocol quality—well-documented image protocols (for example, contrast, slice-thickness, energy, etc.) and/or usage of public image protocols allow reproducibility/replicability | +1 (if protocols are well-documented) +1 (if public protocol is used) |
| 2 | Multiple segmentations—possible actions are: segmentation by different physicians/algorithms/software, perturbing segmentations by (random) noise, segmentation at different breathing cycles. Analyze feature robustness to segmentation variabilities | +1 |
| 3 | Phantom study on all scanners—detect inter-scanner differences and vendor -dependent features. Analyze feature robustness to these sources of variability | +1 |
| 4 | Imaging at multiple time points—collect images of individuals at additional time points. Analyze feature robustness to temporal variabilities (for example, organ movement, organ expansion/shrinkage) | +1 |
| 5 | Feature reduction or adjustment for multiple testing—decreases the risk of overfitting. Overfitting is inevitable if the number of features exceeds the number of samples. Consider feature robustness when selecting features | −3 (if neither measure is implemented) +3 (if either measure is implemented) |
| 6 | Multivariable analysis with non radiomics features (for example, EGFR mutation)—is expected to provide a more holistic model. Permits correlating/inferencing between radiomics and non radiomics features | +1 |
| 7 | Detect and discuss biological correlates—demonstration of phenotypic differences (possibly associated with underlying gene-protein expression patterns) deepens understanding of radiomics and biology | +1 |
| 8 | Cut-off analyses—determine risk groups by either the median, a previously published cut-off or report a continuous risk variable. Reduces the risk of reporting overly optimistic results. | +1 |
| 9 | Discrimination statistics—report discrimination statistics (for example, C-statistic, ROC curve, AUC) and their statistical significance (for example, p-values, confidence intervals). One can also apply resampling methods (for example, bootstrapping, cross validation) | +1 (is a discrimination statistic and its statistical significance are reported) +1 (if a resampling method technique is also applied) |
| 10 | Calibration statistics—report calibration statistics (for example, Calibration-in-the-large/slope, calibration plots) and their statistical significance (for example, p-values, confidence intervals). One can also apply resampling methods (for example, bootstrapping, cross validation) | +1 (is a calibration statistic and its statistical significance are reported) +1 (if a resampling method technique is also applied) |
| 11 | Prospective study registered in a trial database— provides the highest level of evidence supporting the clinical validity and usefulness of the radiomics biomarker | +7 (for prospective validation of a radiomics signature in an appropriate trial) |
| 12 | Validation—the validation is performed without retraining and without adaption of the cut-off value, provides crucial information with regard to credible clinical performance | −5 (if validation is missing) + 2 (if validation is based on a dataset from the same institute) +3 (if validation if based on a dataset from another institute) +4 (if validation is based on two datasets from two distinct institutes) +4 (if the study validates a previously published signature) +5 (if validation is based on three or more datasets from distinct institutes) Datasets should be of comparable size and should have at least 10 events per model feature |
| 13 | Comparison to ’gold standard’—assess the extent to which the model agrees with/is superior to the current ’gold standard’ method (for example, TNM-staging for survival prediction). This comparison shows the added value of radiomics | +2 |
| 14 | Potential clinical utility—report on the current and potential application of the model in a clinical setting (for example decision curve analysis) | +2 |
| 15 | Cost-effectiveness analysis—report on the cost-effectiveness of the clinical application (for example, QALYs generated) | +1 |
| 16 | Open science and data—make code and data publicly available. Open science facilitates knowledge transfer and reproducibility of the study | +1 (if scans are open source) +1 (if region of interest segmentations are open source) +1 (if code is open source) +1 (if radiomics features are calculated on a set of representative ROIs and the calculated features and representative ROIs are open source) |
| Total points (36 = 100%) | ||