| Literature DB >> 35984654 |
Jonathan H Lu1, Alison Callahan1, Birju S Patel1, Keith E Morse2,3, Dev Dash1, Michael A Pfeffer1,4, Nigam H Shah1,4,5.
Abstract
Importance: Various model reporting guidelines have been proposed to ensure clinical prediction models are reliable and fair. However, no consensus exists about which model details are essential to report, and commonalities and differences among reporting guidelines have not been characterized. Furthermore, how well documentation of deployed models adheres to these guidelines has not been studied.Entities:
Mesh:
Year: 2022 PMID: 35984654 PMCID: PMC9391954 DOI: 10.1001/jamanetworkopen.2022.27779
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Summary of 15 Model Reporting Guideline Papers
| Source | Abbreviation or short title | Title | Journal | Total No. of citations | Items |
|---|---|---|---|---|---|
| Schulz et al,[ | CONSORT-AI | CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials |
| 11 529 | 68 |
| Moher et al,[ | CONSORT 2010 Explanation and Elaboration: updated guidelines for reporting parallel group randomized trials |
| |||
| Liu et al,[ | Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension |
| |||
| Moons et al,[ | Risk | Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker |
| 1320 | 41 |
| Moons et al,[ | Risk prediction models: II. External validation, model updating, and impact assessment |
| |||
| Chan et al,[ | SPIRIT-AI | SPIRIT 2013 Statement: Defining Standard Protocol Items for Clinical Trials |
| 2952 | 75 |
| Chan et al,[ | SPIRIT 2013 explanation and elaboration: guidance for protocols of clinical trials |
| |||
| Rivera et al,[ | Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension |
| |||
| Steyerberg and Vergouwe,[ | ABCD | Toward better clinical prediction models: seven steps for development and an ABCD for validation |
| 709 | 33 |
| Moons et al,[ | CHARMS | Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies: The CHARMS Checklist |
| 565 | 63 |
| Collins et al,[ | TRIPOD | Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement |
| 3031 | 86 |
| Moons et al,[ | Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration |
| |||
| Cohen et al,[ | STARD | STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration |
| 711 | 55 |
| Luo et al,[ | Guidelines | Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View |
| 244 | 49 |
| Breck et al,[ | ML test score | The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction |
| 68 | 34 |
| Wolff et al,[ | PROBAST | PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies |
| 284 | 55 |
| Moons et al,[ | PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration |
| |||
| Mitchell et al,[ | Model Cards | Model Cards for Model Reporting |
| 311 | 49 |
| Sendak et al,[ | Model facts labels | Presenting machine learning model information to clinical end users with model facts labels |
| 14 | 37 |
| Hernandez-Boussard et al,[ | MINIMAR | MINIMAR (MINimum Information for Medical AI Reporting): developing reporting standards for artificial intelligence in health care |
| 18 | 28 |
| Norgeot et al,[ | MI-CLAIM checklist | Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist |
| 24 | 40 |
| Silcox et al,[ | Trust and value checklist | AI-Enabled Clinical Decision Support Software: A “Trust and Value Checklist” for Clinicians |
| 2 | 26 |
Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy.
We included the explanation and elaboration papers for CONSORT, SPIRIT, TRIPOD, and PROBAST. For CONSORT and SPIRIT, we also included the AI-specific extensions. We grouped risk prediction models II with the risk prediction models I.
Sums the citations for each report, excluding the explanation and elaboration papers as of May 2021.
Indicates the number of deduplicated items sourced from that guideline.
Model Reporting Guidelines With Their Items Mapped Onto Different Stages in the Creation and Evaluation of a Machine Learning Model to Guide Care
| Model reporting guideline | No. of items that map to each stage | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Use case assessment | Model | Practical feasibility | Utility assessment | Deployment design | Execution of workflow | Model monitoring | Prospective evaluation | |||
| Formulation | Development | Development: fairness | ||||||||
| Model cards | 8 | 5 | 29 | 9 | 1 | 0 | 0 | 0 | 0 | 0 |
| Model facts labels | 10 | 7 | 9 | 0 | 1 | 1 | 0 | 0 | 2 | 1 |
| Guidelines | 7 | 6 | 31 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| MI-CLAIM | 4 | 3 | 29 | 3 | 0 | 1 | 0 | 0 | 0 | 1 |
| MINIMAR | 4 | 4 | 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| TRIPOD | 7 | 9 | 53 | 1 | 0 | 3 | 0 | 0 | 3 | 2 |
| CONSORT-AI | 10 | 3 | 23 | 6 | 1 | 0 | 0 | 0 | 2 | 19 |
| SPIRIT-AI | 9 | 3 | 17 | 1 | 2 | 0 | 0 | 0 | 2 | 18 |
| Trust and value checklist | 4 | 0 | 9 | 0 | 2 | 1 | 0 | 0 | 4 | 2 |
| ML test score | 0 | 0 | 12 | 4 | 1 | 0 | 0 | 2 | 17 | 0 |
| Risk | 2 | 4 | 24 | 0 | 0 | 1 | 0 | 0 | 2 | 6 |
| STARD | 8 | 2 | 37 | 6 | 0 | 1 | 0 | 0 | 0 | 0 |
| ABCD | 1 | 3 | 27 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| CHARMS | 5 | 9 | 42 | 1 | 2 | 0 | 0 | 0 | 1 | 4 |
| PROBAST | 4 | 6 | 41 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| Total | 14 | 14 | 104 | 10 | 5 | 4 | 0 | 2 | 19 | 25 |
Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; MINIMAR, Minimum Information for Medical AI Reporting; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy; TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis.
Stages are listed in Figure 4 of Jung et al.[7] Each cell contains the number of items contributed by the relevant model reporting guideline toward a given stage of the workflow (columns).
Commonly Requested Items Across Reporting Guidelines
| Item description | No. of reporting guidelines requesting the item | Task | Stage | Reporting rate, % |
|---|---|---|---|---|
| Provide any description of the data set (eg, training or study) in question | 12 | Data composition | Model development | 100 |
| Define the output or outcome produced by the model | 10 | Data composition: output | Model formulation | 100 |
| Define the specific local area, environment, or setting of training data and model deployment | 10 | Study design and/or population | Use case | 100 |
| Describe how data were preprocessed (eg, data cleaning, predictor transformation, outlier removal, predictor coding) | 10 | Preprocessing and data cleaning | Model development | 100 |
| Desribe how missing data were handled | 10 | Preprocessing and data cleaning | Model development | 50 |
| Describe parameters used to train and select models, including constraints and penalties added as loss terms (eg, shrinkage penalties) | 10 | Model building | Model development | 58 |
| Provide CIs, statistical significance, or some other handling of uncertainty and variability in model performance metrics | 10 | Model performance and comparison | Model development | 0 |
| Clarify what type of validation was performed, whether internal or external | 11 | Validation | Model development | 100 |
| Describe internal validation strategy to account for model optimism (eg, cross-validation, bootstrapping, data splitting) | 11 | Validation | Model development | 100 |
| Describe performance measures | 13 | Metrics | Model development | 100 |
| AUROC (C index) | 11 | Metrics: discrimination | Model development | 100 |
| Describe how the ML model should be used in clinical context | 11 | Intended use | Use case | 100 |
Abbreviations: AUROC, area under the receiver operating characteristic curve; ML, machine learning.
Lists all items requested by at least 10 model reporting guidelines.
Indicates the item’s related task.
Indicates stage of clinical predictive model development.[7]
Indicates the percentage of the model briefs that reported the information requested in the item, where the denominator is the number of model briefs for which the item was applicable.
Adherence Rates to Entire Reporting Guidelines Across Model Briefs
| Model reporting guideline | Epic Systems Corporation model briefs, % | Mean (IQR), % | No. of applicable items, mean (IQR) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Deterioration index | Early detection of sepsis | Risk | Pediatric risk of hospital admission or ED visit | Risk of hospital admission or ED visit | Inpatient risk of falls | Projected block utilization | Remaining length of stay | Risk | ||||||
| Unplanned readmission | Patient no-show | Admission for heart failure | Hospital admission or ED visit for asthma | Hypertension | ||||||||||
| Model cards | 66 | 47 | 63 | 51 | 40 | 69 | 51 | 45 | 50 | 47 | 41 | 57 | 52 (46-58) | 48.6 (48.0-49.0) |
| Model facts labels | 77 | 71 | 80 | 89 | 71 | 80 | 71 | 71 | 82 | 60 | 63 | 71 | 74 (71-80) | 34.8 (35.0-35.0) |
| Guidelines | 64 | 66 | 66 | 66 | 57 | 74 | 62 | 49 | 70 | 64 | 64 | 66 | 64 (63-66) | 46.9 (47.0-47.0) |
| MI-CLAIM | 55 | 58 | 63 | 58 | 47 | 68 | 53 | 34 | 51 | 53 | 45 | 58 | 54 (50-58) | 37.9 (38.0-38.0) |
| MINIMAR | 71 | 71 | 79 | 61 | 68 | 86 | 71 | 46 | 67 | 75 | 61 | 82 | 70 (65-76) | 27.9 (28.0-28.0) |
| TRIPOD | 63 | 63 | 61 | 48 | 42 | 61 | 47 | 36 | 57 | 48 | 44 | 51 | 52 (46-61) | 75.4 (74.8-76.0) |
| CONSORT-AI | 63 | 43 | 63 | 60 | 33 | 67 | 53 | 47 | 47 | 49 | 42 | 51 | 52 (46-61) | 42.4 (42.0-43.0) |
| SPIRIT-AI | 61 | 55 | 54 | 54 | 38 | 61 | 44 | 49 | 51 | 41 | 39 | 46 | 49 (43-54) | 40.4 (40.0-41.0) |
| Trust and value checklist | 46 | 33 | 39 | 50 | 29 | 42 | 38 | 46 | 50 | 25 | 33 | 46 | 40 (33-46) | 23.9 (24.0-24.0) |
| ML test score | 27 | 15 | 33 | 24 | 9 | 33 | 15 | 6 | 18 | 12 | 9 | 15 | 18 (11-25) | 32.9 (33.0-33.0) |
| Risk | 64 | 65 | 63 | 53 | 50 | 68 | 53 | 48 | 61 | 56 | 56 | 56 | 58 (53-63) | 33.6 (33.0-34.0) |
| STARD | 54 | 45 | 50 | 40 | 29 | 52 | 52 | 39 | 35 | 40 | 40 | 52 | 44 (39-52) | 48.7 (48.0-49.0) |
| ABCD | 65 | 65 | 48 | 55 | 61 | 68 | 52 | 39 | 60 | 65 | 61 | 61 | 58 (54-65) | 30.9 (48.0-49.0) |
| CHARMS | 78 | 70 | 68 | 65 | 56 | 75 | 66 | 47 | 73 | 65 | 63 | 64 | 66 (64-71) | 54.9 (54.0-55.0) |
| PROBAST | 69 | 71 | 67 | 62 | 53 | 68 | 58 | 46 | 63 | 60 | 58 | 60 | 61 (58-67) | 52.1 (51.8-52.5) |
Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; ED, emergency department; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; MINIMAR, Minimum Information for Medical AI Reporting; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy; TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis.
Requested but Less-Reported Items
| Item description | Reporting rate, % | No. of model briefs | No. of model reporting guidelines requesting the item | |
|---|---|---|---|---|
| Applicable | Reporting | |||
| Specify who funded or supported the study and clarify any conflicts of interest | 0 | 10 | 0 | 4 |
| Provide information on how to access the data used | 0 | 12 | 0 | 4 |
| Provide statistics on the amount of missing data | 8 | 12 | 1 | 5 |
| Given the problem context, describe what factors or subgroups would be helpful to perform a subanalysis of model performance evaluation (eg, demographics, environment, lighting); these factors do not have to be available in the data | 42 | 12 | 5 | 5 |
| Provide summary statistics of key demographics, characteristics, or other factors for the data set in question | 50 | 12 | 6 | 6 |
| Discuss age as an important demographic factor to report summary statistics on or disaggregate performance by | 50 | 12 | 6 | 4 |
| Discuss sex as an important demographic factor to report summary statistics on or disaggregate performance by | 33 | 12 | 4 | 4 |
| Discuss other factors for the prediction problem to report summary statistics on or disaggregate performance by (eg, sex, sexual orientation, Fitzpatrick skin type, socioeconomic status, geographic location, presenting symptoms, clinical signs, laboratory values, and other diagnoses) | 50 | 12 | 6 | 4 |
| Provide flowchart describing how participants were interacted with, assigned, and followed up in the study (especially in clinical trials) | 0 | 12 | 0 | 5 |
| Describe the annotation process of the input data, including who annotated the input data, what instructions they were given, and what expertise was needed | 18 | 11 | 2 | 4 |
| Describe blinding of data collectors and predictor assessors to outcomes, if done | 0.0 | 9 | 0 | 4 |
| Describe the annotation process of the output data, including who annotated the output data, what instructions they were given, and what expertise was needed | 27 | 11 | 3 | 7 |
| Describe blinding of outcome assessors to predictors of the model, if done | 0 | 9 | 0 | 7 |
| Describe how missing data were handled | 50 | 12 | 6 | 10 |
| Indicate whether feature selection involved computing univariate associations between input features and outcomes (not recommended) | 18 | 11 | 2 | 4 |
| Provide CIs, statistical significance, or some other handling of uncertainty and variability in model performance metrics | 0 | 12 | 0 | 10 |
| Provide sufficient information to enable reproducibility or replication | 0 | 12 | 0 | 7 |
| Report model coefficients (regression) or saliency map | 8 | 12 | 1 | 7 |
| Disaggregate performance by subgroup or other important data slice | 33 | 12 | 4 | 8 |
| Describe external validation strategy and evaluation data set (eg, what external data set was used), ways it may differ from the training set (eg, geography, time), and why the data set was chosen | 33 | 12 | 4 | 9 |
| Provide calibration plot | 0 | 12 | 0 | 6 |
| Provide negative predictive value | 17 | 12 | 2 | 6 |
| Provide sensitivity, ideally at a predefined probability threshold | 42 | 12 | 5 | 9 |
| Provide specificity, ideally at a predefined probability threshold | 8 | 12 | 1 | 8 |
| Net reclassification improvement | 0 | 12 | 0 | 5 |
| Specify directions, explanations, and other user-facing materials that will be included in the model | 0 | 12 | 0 | 9 |
| Guidance on how to deploy the machine learning model into clinical workflows | 33 | 12 | 4 | 7 |
| Indicate which version of the model is being discussed | 45 | 11 | 5 | 6 |
| Describe how models are updated or locally tuned | 42 | 12 | 5 | 8 |
All items requested by 4 or more model reporting guidelines but reported by no more than 50% of applicable model briefs are listed.