Literature DB >> 35984654

Assessment of Adherence to Reporting Guidelines by Commonly Used Clinical Prediction Models From a Single Vendor: A Systematic Review.

Jonathan H Lu1, Alison Callahan1, Birju S Patel1, Keith E Morse2,3, Dev Dash1, Michael A Pfeffer1,4, Nigam H Shah1,4,5.   

Abstract

Importance: Various model reporting guidelines have been proposed to ensure clinical prediction models are reliable and fair. However, no consensus exists about which model details are essential to report, and commonalities and differences among reporting guidelines have not been characterized. Furthermore, how well documentation of deployed models adheres to these guidelines has not been studied.
Objectives: To assess information requested by model reporting guidelines and whether the documentation for commonly used machine learning models developed by a single vendor provides the information requested. Evidence Review: MEDLINE was queried using machine learning model card and reporting machine learning from November 4 to December 6, 2020. References were reviewed to find additional publications, and publications without specific reporting recommendations were excluded. Similar elements requested for reporting were merged into representative items. Four independent reviewers and 1 adjudicator assessed how often documentation for the most commonly used models developed by a single vendor reported the items. Findings: From 15 model reporting guidelines, 220 unique items were identified that represented the collective reporting requirements. Although 12 items were commonly requested (requested by 10 or more guidelines), 77 items were requested by just 1 guideline. Documentation for 12 commonly used models from a single vendor reported a median of 39% (IQR, 37%-43%; range, 31%-47%) of items from the collective reporting requirements. Many of the commonly requested items had 100% reporting rates, including items concerning outcome definition, area under the receiver operating characteristics curve, internal validation, and intended clinical use. Several items reported half the time or less related to reliability, such as external validation, uncertainty measures, and strategy for handling missing data. Other frequently unreported items related to fairness (summary statistics and subgroup analyses, including for race and ethnicity or sex). Conclusions and Relevance: These findings suggest that consistent reporting recommendations for clinical predictive models are needed for model developers to share necessary information for model deployment. The many published guidelines would, collectively, require reporting more than 200 items. Model documentation from 1 vendor reported the most commonly requested items from model reporting guidelines. However, areas for improvement were identified in reporting items related to model reliability and fairness. This analysis led to feedback to the vendor, which motivated updates to the documentation for future users.

Entities:  

Mesh:

Year:  2022        PMID: 35984654      PMCID: PMC9391954          DOI: 10.1001/jamanetworkopen.2022.27779

Source DB:  PubMed          Journal:  JAMA Netw Open        ISSN: 2574-3805


Introduction

Despite good predictive performance in metrics such as the area under the receiver operating characteristic (AUROC) curve, the use of machine learning models trained on electronic health record data[1] to guide care has not often been demonstrated to translate into measurable clinical gains in the form of better medical care, lower cost, or more equitable outcomes,[2,3,4] leading to a gap that has been referred to as an “artificial intelligence (AI) chasm.”[5] Some potential reasons for this chasm are that current models are not useful,[4,6,7] reliable,[8,9] or fair.[10,11,12,13,14,15,16,17,18] Nevertheless, predictive models have frequently been deployed in health care settings without transparency or independent validation,[19,20] and their subsequent failures have occasionally been met with public outcry.[2,21,22,23] Adhering to model reporting guidelines is one way to improve the reliability,[24,25,26,27,28] fairness,[29,30] and usefulness[25,31,32,33,34] of clinical predictive models. Reporting guidelines have long been used to assess the strength of clinical trial,[35,36] observational,[37] and diagnostic[38] studies. Guidelines about reporting the performance of predictive models are receiving increasing attention, including from the National Institutes of Health,[39] and several more guidelines are in development.[40,41,42] However, limited information is available about the overlapping coverage of these varying guidelines, making it difficult for participants in the community to understand what common set of items should be expected, let alone which items can be reported in practice. As a result, important information is often missing from documentation. For example, a review that examined 164 models described in the scientific literature[43] found low reporting rates of demographic variables such as race (36%) and socioeconomic status (8%) as well as low external validation rates (12%). A critical review of published models for diagnosis and prognosis of COVID-19[44] found that most models were at high risk of bias due to poor reporting. The goal of this systematic review was to summarize clinical predictive model reporting guidelines and characterize how often items are requested across guidelines. In addition, we assessed whether the documentation for commonly deployed models provided the information requested by model reporting guidelines. Compared with previous work,[43,44] we focused on user-facing product documentation accompanying models, which allowed us to analyze models that have been deployed in practice and are not limited to those described in peer-reviewed publications. Furthermore, we comprehensively measured the reporting rates of every requested item covered in all the guidelines.

Methods

Our analysis consisted of 2 phases. We first compiled model reporting guidelines and summarized them to identify the unique reporting items they request and analyzed the items that are the most and least requested across all guidelines. A team of 4 reviewers (J.H.L., A.C., B.S.P., and K.E.M.) and 1 adjudicator (D.D.) then assessed a sample of model documentation to identify the items they report as well as any gaps in reporting. We describe each of these phases in detail and provide additional information in the eMethods in the Supplement. Through the review process we addressed the items from the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) reporting guideline that were applicable to this study.

Summarizing Model Reporting Guidelines

We searched MEDLINE via PubMed using queries for machine learning model card and reporting machine learning from November 4 to December 6, 2020. We reviewed citations to find additional publications. Finally, we excluded publications that did not give specific model reporting recommendations. We then gathered the set of reportable elements in these reporting guidelines and merged similar elements into distinct, representative items to eliminate duplication. For example, “report the intended user of the model”[31] and “describe external validation strategy”[24] are unique items. First, we identified an initial set of elements by reviewing each reporting guideline, including the explanation and elaboration documents and AI extensions to verify that every guideline’s elements were captured. Second, we reviewed each element and, using expert judgment, merged those that requested the same information into the same item. We recorded each study’s phrases describing the elements to enable a full traceback of which elements were merged into each item. Last, we created a 1-line summary of each item to share for reviewers to reference (eAppendix in the Supplement).

Assessing Item Reporting in Existing Model Documentation

To assess the use of this collective set of reportable items in user-facing documentation, we obtained a convenience sample of model documentation in March 2021. We reviewed the user-facing documentation (analogous to a drug package insert) provided by 1 vendor (Epic Systems Corporation), which they term cognitive computing model briefs (hereafter referred to as model briefs) (eTable 1 in the Supplement). Each model brief has a community adoption score that represents the proportion of organizations that have used a specific model of organizations using any model and takes values from a scale ranging from 1 to 3. We chose all models that had a community adoption score of 2 or 3 in March 2021. The model briefs with community adoption score of 3 of 3 were the Deterioration Index,[45] Early Detection of Sepsis,[46] Risk of Unplanned Readmission (Version 2),[47] Risk of Patient No-Show (Version 2),[48] Pediatric Hospital Admissions and ED Visits,[49] and Risk of Hospital Admission or ED Visit (Version 2)[50] models. The model briefs with community adoption of 2 of 3 were for Inpatient Risk of Falls,[51] Projected Block Utilization,[52] Remaining Length of Stay,[53] Hospital Admissions for Heart Failure,[54] Hospital Admissions and ED Visits for Asthma,[55] and Hypertension.[56] Note that model briefs are periodically updated by the vendor, and we assessed the most recent version available at the time of our study. The 4 reviewers read each of the 12 model briefs and independently assessed whether they reported information specified in the items as summarized in the eAppendix in the Supplement (process described in the eMethods in the Supplement). Specifically, for each item, each reviewer first determined whether the item was applicable to the model, and if it was determined to be applicable, whether that item was reported or not reported. For example, an item such as “a link to the clinical trial registration” was determined to be not applicable to models where documentation does not intend to describe a clinical trial. The reviewers’ specific assessments are all available (eAppendix in the Supplement). The reviewer then decided whether the model brief reported the information requested in the item, recording the relevant part of the model brief supporting their decision. Reviewers were informatics experts (J.H.L. and A.C.) and clinicians (B.J.P. and K.E.M.) who had expertise in deployment of machine learning at our academic medical center. The adjudicator (D.D.) then reviewed the items for which there was disagreement among reviewers to make a final determination. The adjudicator was constrained to choose only from the options already selected by the reviewers. The adjudicator was also a clinician with similar expertise in deployment of machine learning models. Detailed terminology and summary statistics calculation are provided in the eMethods in the Supplement.

Results

Items Requested by Model Reporting Guidelines

The literature search for model reporting guidelines resulted in a list of 27 publications,[25,29,30,31,32,33,34,38,41,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74] and a citation review yielded 3 additional publications.[26,27,28] We excluded publications that did not provide specific model reporting recommendations, yielding 15 model reporting guidelines (Table 1).[24,25,26,27,28,29,30,31,32,33,34,35,57,58,59,74,75,76,77,78,79,80]
Table 1.

Summary of 15 Model Reporting Guideline Papers

SourceAbbreviation or short titleTitleJournalTotal No. of citationsbItemsc
Schulz et al,[35] 2010CONSORT-AICONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials International Journal of Surgery 11 52968
Moher et al,[78] 2010CONSORT 2010 Explanation and Elaboration: updated guidelines for reporting parallel group randomized trials Journal of Clinical Epidemiology
Liu et al,[32] 2020Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension Nature Medicine
Moons et al,[74] 2012RiskRisk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker Heart 132041
Moons et al,[24] 2012Risk prediction models: II. External validation, model updating, and impact assessment Heart
Chan et al,[75] 2013SPIRIT-AISPIRIT 2013 Statement: Defining Standard Protocol Items for Clinical Trials Annals of Internal Medicine 295275
Chan et al,[79] 2013SPIRIT 2013 explanation and elaboration: guidance for protocols of clinical trials BMJ
Rivera et al,[33] 2020Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension BMJ
Steyerberg and Vergouwe,[27] 2014ABCDToward better clinical prediction models: seven steps for development and an ABCD for validation European Heart Journal 70933
Moons et al,[28] 2014CHARMSCritical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies: The CHARMS Checklist PLoS Medicine 56563
Collins et al,[59] 2015TRIPODTransparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement British Journal of Surgery 303186
Moons et al,[80] 2015Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration Annals of Internal Medicine
Cohen et al,[76] 2016STARDSTARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration BMJ Open 71155
Luo et al,[57] 2016GuidelinesGuidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View Journal of Medical Internet Research 24449
Breck et al,[25] 2017ML test scoreThe ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction Proceedings of the 2017 IEEE International Conference on Big Data 6834
Wolff et al,[77] 2019PROBASTPROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies Annals of Internal Medicine 28455
Moons et al,[26] 2019PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration Annals of Internal Medicine
Mitchell et al,[29] 2019Model CardsModel Cards for Model Reporting Proceedings of the Conference on Fairness, Accountability, and Transparency 31149
Sendak et al,[31] 2020Model facts labelsPresenting machine learning model information to clinical end users with model facts labels NPJ Digital Medicine 1437
Hernandez-Boussard et al,[30] 2020MINIMARMINIMAR (MINimum Information for Medical AI Reporting): developing reporting standards for artificial intelligence in health care Journal of the American Medical Information Association 1828
Norgeot et al,[58] 2020MI-CLAIM checklistMinimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist Nature Medicine 2440
Silcox et al,[34] 2020Trust and value checklistAI-Enabled Clinical Decision Support Software: A “Trust and Value Checklist” for Clinicians NEJM Catalyst 226

Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy.

We included the explanation and elaboration papers for CONSORT, SPIRIT, TRIPOD, and PROBAST. For CONSORT and SPIRIT, we also included the AI-specific extensions. We grouped risk prediction models II with the risk prediction models I.

Sums the citations for each report, excluding the explanation and elaboration papers as of May 2021.

Indicates the number of deduplicated items sourced from that guideline.

Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy. We included the explanation and elaboration papers for CONSORT, SPIRIT, TRIPOD, and PROBAST. For CONSORT and SPIRIT, we also included the AI-specific extensions. We grouped risk prediction models II with the risk prediction models I. Sums the citations for each report, excluding the explanation and elaboration papers as of May 2021. Indicates the number of deduplicated items sourced from that guideline. These model reporting guidelines were published in computer science publications (Proceedings of the Conference on Fairness, Accountability, and Transparency[29] and Proceedings of the 2017 IEEE International Conference on Big Data[25]), biomedical informatics journals (Journal of the American Medical Informatics Association,[30] NPJ Digital Medicine,[31] and Journal of Medical Internet Research[57]), and clinical journals (Annals of Internal Medicine,[26,75,77,80] BMJ,[33,79] BMJ Open,[76] Nature Medicine,[32,58] Heart,[24,74] European Heart Journal,[27] PLOS Medicine,[28] NEJM Catalyst,[34] Journal of Clinical Epidemiology,[78] International Journal of Surgery,[35] and British Journal of Surgery[59]). Four guidelines published between 2010 and 2015 have been cited by other articles more than 1000 times, whereas 4 guidelines were published after 2019 and have been cited fewer than 50 times to date. Of the 15 reporting guidelines, 11 had examples of how to complete their requested items.[25,26,27,29,30,31,38,74,78,79,80] However, only 5 showed a full example completing all items for a single model,[27,29,30,31,74] and only 1 of those models had actually been deployed at a health system.[31,81] After deduplication, 220 distinct items were requested across the reporting guidelines (eAppendix in the Supplement). A cross-tabulation of the 220 items against the 15 reporting guidelines is provided in eTable 2 in the Supplement. For example, the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline has more items requesting details on preprocessing,[59] whereas the Minimum Information About Clinical Artificial Intelligence Modeling (MI-CLAIM) has more items requesting details for model examinations.[58] Table 2 summarizes the model reporting guidelines in terms of the number of items that map to each stage in the creation and evaluation of a machine learning model (Figure 4 in Jung et al[7]). For example, Model Cards[29] contributes the most items to fairness in model development (n = 29), whereas model facts labels (n = 10)[31] or Consolidated Standards of Reporting Trials (CONSORT)-AI (n = 10)[32] contribute the most items to use case assessment.
Table 2.

Model Reporting Guidelines With Their Items Mapped Onto Different Stages in the Creation and Evaluation of a Machine Learning Model to Guide Care

Model reporting guidelineNo. of items that map to each stagea
Use case assessmentModelPractical feasibilityUtility assessmentDeployment designExecution of workflowModel monitoringProspective evaluation
FormulationDevelopmentDevelopment: fairness
Model cards85299100000
Model facts labels10790110021
Guidelines76311010010
MI-CLAIM43293010001
MINIMAR44185000000
TRIPOD79531030032
CONSORT-AI1032361000219
SPIRIT-AI931712000218
Trust and value checklist4090210042
ML test score001241002170
Risk24240010026
STARD82376010000
ABCD13270010000
CHARMS59421200014
PROBAST46410110010
Total14141041054021925

Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; MINIMAR, Minimum Information for Medical AI Reporting; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy; TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis.

Stages are listed in Figure 4 of Jung et al.[7] Each cell contains the number of items contributed by the relevant model reporting guideline toward a given stage of the workflow (columns).

Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; MINIMAR, Minimum Information for Medical AI Reporting; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy; TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis. Stages are listed in Figure 4 of Jung et al.[7] Each cell contains the number of items contributed by the relevant model reporting guideline toward a given stage of the workflow (columns). Table 3 lists the items requested by at least 10 of the 15 reporting guidelines. The most commonly requested items relate to tasks, such as preprocessing, handling missing data, model performance including handling of uncertainty (eg, CIs, statistical significance) or AUROC, and internal validation. A total of 28 distinct performance metrics were requested (eTable 3 in the Supplement), including AUROC, sensitivity, positive predictive value, and calibration plot.
Table 3.

Commonly Requested Items Across Reporting Guidelines

Item descriptionaNo. of reporting guidelines requesting the itemTaskbStagecReporting rate, %d
Provide any description of the data set (eg, training or study) in question12Data compositionModel development100
Define the output or outcome produced by the model10Data composition: outputModel formulation100
Define the specific local area, environment, or setting of training data and model deployment10Study design and/or populationUse case100
Describe how data were preprocessed (eg, data cleaning, predictor transformation, outlier removal, predictor coding)10Preprocessing and data cleaningModel development100
Desribe how missing data were handled10Preprocessing and data cleaningModel development50
Describe parameters used to train and select models, including constraints and penalties added as loss terms (eg, shrinkage penalties)10Model buildingModel development58
Provide CIs, statistical significance, or some other handling of uncertainty and variability in model performance metrics10Model performance and comparisonModel development0
Clarify what type of validation was performed, whether internal or external11ValidationModel development100
Describe internal validation strategy to account for model optimism (eg, cross-validation, bootstrapping, data splitting)11ValidationModel development100
Describe performance measures13MetricsModel development100
AUROC (C index)11Metrics: discriminationModel development100
Describe how the ML model should be used in clinical context11Intended useUse case100

Abbreviations: AUROC, area under the receiver operating characteristic curve; ML, machine learning.

Lists all items requested by at least 10 model reporting guidelines.

Indicates the item’s related task.

Indicates stage of clinical predictive model development.[7]

Indicates the percentage of the model briefs that reported the information requested in the item, where the denominator is the number of model briefs for which the item was applicable.

Abbreviations: AUROC, area under the receiver operating characteristic curve; ML, machine learning. Lists all items requested by at least 10 model reporting guidelines. Indicates the item’s related task. Indicates stage of clinical predictive model development.[7] Indicates the percentage of the model briefs that reported the information requested in the item, where the denominator is the number of model briefs for which the item was applicable. Finally, 77 items were requested by just 1 reporting guideline (eTable 4 in the Supplement). Twelve of the items were model performance metrics such as the F score. The ML Test Score had 20 unique items related to model deployment and monitoring, such as the model updating process. CONSORT-AI and Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT)-AI had a combined 21 clinical trial–specific items, which mostly did not apply to Epic System Corporation’s model briefs.

Reporting of Items by Model Briefs

Interrater agreement on assessments of item reporting was 76% (for all pairs of reviewers, and every item of a given model brief). Of 220 items, 176 (80%) were applicable to at least 1 model brief. Of these, 119 items (68%) were reported by at least 1 model brief. Model briefs reported a median of 39% (IQR, 37%-43%; range, 31%-47%) of applicable items (eTable 5 in the Supplement). After excluding items corresponding to performance metrics—to avoid penalizing model briefs for not reporting multiple, nearly redundant performance metrics—the median completion rate for applicable items was 43% (IQR, 41%-48%; range, 33%-52%). Overall, items had a median reporting rate across model briefs of 25% (IQR, 0-83%; range, 0-100%). Forty items were reported by more than 90% of the model briefs (eTable 6 in the Supplement). These commonly reported items include information about model development and formulation, specifically the training data set, preprocessing, model type, internal validation, and performance metrics. These items include 9 of the 12 most commonly requested items by the reporting guidelines (Table 3). All 12 model briefs reported the following use case–related items: how the model is to be used in clinical care, who will use the model, ways the model could impact clinical care, and rationale for use. Seventy-five items were reported by fewer than 10% of the model briefs (eTable 7 in the Supplement). These items included missing data statistics, blinding of predictor and/or outcome assessors, variability of performance measures (eg, CIs), reporting of model coefficients or most predictive features, model examinations including performance errors and intersectional subgroup analyses, user-facing materials and warnings on when to stop use of model, and monitoring of input data and model predictions. In addition, of 28 distinct performance metrics requested, only AUROC (100%), positive predictive value (67%), and sensitivity (42%) were reported by more than one-fifth of the model briefs (eTable 3 in the Supplement).

Adherence to Entire Reporting Guidelines by Model Briefs

Table 4 shows the adherence rates to individual reporting guidelines, which is the model briefs’ mean completion rate of items requested by each reporting guideline. Model reporting guidelines had a median adherence rate of 53% (IQR, 50%-63%; range, 18%-74%). The ML Test Score had the lowest median adherence rate (18% [IQR, 11%-25%]), whereas Model Facts Labels had the highest (74% [IQR, 71%-80%]). After excluding items corresponding to performance metrics as before, the median adherence rates remained similar, at 57% (IQR, 50%-70%; range, 16%-73%).
Table 4.

Adherence Rates to Entire Reporting Guidelines Across Model Briefs

Model reporting guidelineEpic Systems Corporation model briefs, %Mean (IQR), %No. of applicable items, mean (IQR)
Deterioration indexEarly detection of sepsisRiskPediatric risk of hospital admission or ED visitRisk of hospital admission or ED visitInpatient risk of fallsProjected block utilizationRemaining length of stayRisk
Unplanned readmissionPatient no-showAdmission for heart failureHospital admission or ED visit for asthmaHypertension
Model cards66476351406951455047415752 (46-58)48.6 (48.0-49.0)
Model facts labels77718089718071718260637174 (71-80)34.8 (35.0-35.0)
Guidelines64666666577462497064646664 (63-66)46.9 (47.0-47.0)
MI-CLAIM55586358476853345153455854 (50-58)37.9 (38.0-38.0)
MINIMAR71717961688671466775618270 (65-76)27.9 (28.0-28.0)
TRIPOD63636148426147365748445152 (46-61)75.4 (74.8-76.0)
CONSORT-AI63436360336753474749425152 (46-61)42.4 (42.0-43.0)
SPIRIT-AI61555454386144495141394649 (43-54)40.4 (40.0-41.0)
Trust and value checklist46333950294238465025334640 (33-46)23.9 (24.0-24.0)
ML test score27153324933156181291518 (11-25)32.9 (33.0-33.0)
Risk64656353506853486156565658 (53-63)33.6 (33.0-34.0)
STARD54455040295252393540405244 (39-52)48.7 (48.0-49.0)
ABCD65654855616852396065616158 (54-65)30.9 (48.0-49.0)
CHARMS78706865567566477365636466 (64-71)54.9 (54.0-55.0)
PROBAST69716762536858466360586061 (58-67)52.1 (51.8-52.5)

Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; ED, emergency department; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; MINIMAR, Minimum Information for Medical AI Reporting; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy; TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis.

Abbreviations: ABCD, alpha calibration-in-the-large, beta calibration slope, C statistic, decision-curve analysis; AI, artificial intelligence; CHARMS, Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies; CONSORT, Consolidated Standards of Reporting Trials; ED, emergency department; MI-CLAIM, Minimum Information About Clinical Artificial Intelligence Modeling; MINIMAR, Minimum Information for Medical AI Reporting; ML, machine learning; PROBAST, Prediction Model Risk of Bias Assessment Tool; SPIRIT, Standard Protocol Items: Recommendations for Interventional Trials; STARD, Standards for Reporting of Diagnostic Accuracy; TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis.

Requested But Less Reported Items

We identified 29 items that were requested by at least 4 of 15 reporting guidelines but were reported by 50% or fewer of model briefs (Table 5). Many of these less-reported items are related to measures of reliability. These include performance of an external validation (33%) and CIs or statistical significance in model performance metrics (0). There was also low reporting of statistics on the amount of missing data (8%) and how missing data were handled (50%). In addition, there was less reporting on items related to fairness (eg, data set representativeness and performance across subgroups). These include summary statistics of key characteristics of the training data set (reporting rate, 50%) or disaggregating performance by a subgroup (33%). Demographic factors such as age (50%), sex (33%), and other relevant factors (50%) lacked both summary statistics and disaggregated performance. Furthermore, there was low reporting of guidance on how to deploy the machine learning model into a clinical workflow (33%), what user-facing materials there will be with the model (0), and how models are updated (42%). Last, some items related to transparency were provided less often, including model coefficients (8%), who funded the study (which might be relevant for conflict of interest purposes) (0), and how to access the data set (0).
Table 5.

Requested but Less-Reported Items

Item descriptionReporting rate, %No. of model briefsNo. of model reporting guidelines requesting the item
ApplicableReporting
Specify who funded or supported the study and clarify any conflicts of interest01004
Provide information on how to access the data used01204
Provide statistics on the amount of missing data81215
Given the problem context, describe what factors or subgroups would be helpful to perform a subanalysis of model performance evaluation (eg, demographics, environment, lighting); these factors do not have to be available in the data421255
Provide summary statistics of key demographics, characteristics, or other factors for the data set in question501266
Discuss age as an important demographic factor to report summary statistics on or disaggregate performance by501264
Discuss sex as an important demographic factor to report summary statistics on or disaggregate performance by331244
Discuss other factors for the prediction problem to report summary statistics on or disaggregate performance by (eg, sex, sexual orientation, Fitzpatrick skin type, socioeconomic status, geographic location, presenting symptoms, clinical signs, laboratory values, and other diagnoses)501264
Provide flowchart describing how participants were interacted with, assigned, and followed up in the study (especially in clinical trials)01205
Describe the annotation process of the input data, including who annotated the input data, what instructions they were given, and what expertise was needed181124
Describe blinding of data collectors and predictor assessors to outcomes, if done0.0904
Describe the annotation process of the output data, including who annotated the output data, what instructions they were given, and what expertise was needed271137
Describe blinding of outcome assessors to predictors of the model, if done0907
Describe how missing data were handled5012610
Indicate whether feature selection involved computing univariate associations between input features and outcomes (not recommended)181124
Provide CIs, statistical significance, or some other handling of uncertainty and variability in model performance metrics012010
Provide sufficient information to enable reproducibility or replication01207
Report model coefficients (regression) or saliency map81217
Disaggregate performance by subgroup or other important data slice331248
Describe external validation strategy and evaluation data set (eg, what external data set was used), ways it may differ from the training set (eg, geography, time), and why the data set was chosen331249
Provide calibration plot01206
Provide negative predictive value171226
Provide sensitivity, ideally at a predefined probability threshold421259
Provide specificity, ideally at a predefined probability threshold81218
Net reclassification improvement01205
Specify directions, explanations, and other user-facing materials that will be included in the model01209
Guidance on how to deploy the machine learning model into clinical workflows331247
Indicate which version of the model is being discussed451156
Describe how models are updated or locally tuned421258

All items requested by 4 or more model reporting guidelines but reported by no more than 50% of applicable model briefs are listed.

All items requested by 4 or more model reporting guidelines but reported by no more than 50% of applicable model briefs are listed.

Discussion

The research community has published many model reporting guidelines with the goal of improving the transparency of prediction models for informed decisions about which models to deploy. However, among 15 reporting guidelines, 220 items are collectively requested, which is both burdensome for model developers to report in their entirety and overwhelming for an end user. We found that documentation examined consistently reported the most requested items from this collective set, but overall a median of 39% of applicable items could be reported. This discrepancy underscores the urgent need to identify items that are both feasible to report in practice and necessary to support a decision to deploy a given clinical prediction model. Adhering to a single model reporting guideline may be insufficient because no single guideline is fully comprehensive, and some items may be familiar only to certain model development communities or have only recently been recognized as relevant. Our approach identified patterns in terms of frequently requested items across guidelines and corresponding gaps in reporting that inform the following suggestions on reporting model information for both the research community and model developers. For model developers, we suggest prioritizing reporting of the most commonly requested items (Table 3). Model briefs were excellent at reporting these: 9 of the 12 most commonly requested items had 100% reporting rates. These included information on model development and use, such as the outcome definition, and how the model is intended to be used. These commonly requested items—which tend to be about model performance—are not always the most important for making a decision for deployment and do not inform us whether a model will be useful.[7,82] These 12 commonly requested items are only a subset of what guidelines consider important to report. Therefore, we suggest additional focus on items that were requested but were not often reported (Table 5), such as items related to reliability: external validation, data missingness, and monitoring. Specific example items include external validation strategy, uncertainty measures such as CIs, calibration plots, performance comparison against a baseline, missing data statistics and strategy of missingness handling, how models are updated and tuned, and methods for monitoring input data or regressions in prediction quality in newer data. We further suggest reporting items related to fairness (in this interpretation, referring to data set representativeness and model performance for subgroups) and transparency, which were also often requested but not reported (Table 5). For fairness, model documentation should report summary statistics or disaggregated performance by sex, age, race and ethnicity, and other relevant attributes, as well as the results of subgroup and intersectional analyses. We acknowledge this is a limited view of fairness (which is becoming better defined by a dedicated field of scholarship)[83] and that items must be contextualized depending on how the model is used and how the data are collected. For example, biased outcome measurement would not be surfaced by subgroup analyses of performance.[6] For transparency, we suggest reporting model coefficients, model reproducibility, how to access the data set, and who funded the study, which might be relevant for conflict of interest purposes. That these items were rarely reported in the documentation may be unsurprising given that companies have to protect intellectual property such as model architecture details and coefficients, although there is increasing pressure to demonstrate external validation.[19,84] We suggest that the research community directly engage model developers and information technologists to ensure that published recommendations are feasible to follow and relevant for deployment decisions. As a positive development, dialogue with Epic Systems Corporation’s data science team based on the article’s preprint led to updates to model briefs to include CIs for performance metrics, information about the missing data imputation strategy used, and additional details about algorithm types including, where applicable, parameters used in grid search and type of penalization.[47,85,86] Such interactions, but occurring at a larger scale, are necessary to bridge the implementation gap by ensuring developers are providing the most relevant and necessary information about their models. Because many model reporting guidelines[29,30,31,34,58] aim to support model developers and users, we think recommendations are applicable to model briefs and there is a need for an open forum for bidirection conversation. In eTable 2 and the eAppendix in the Supplement, we group the 220 items by task to enable conversation about which additional items are relevant. Finally, we suggest that deployment teams use items as checklists for ensuring quality in model development, usefulness, workflow capacity, and reliability monitoring[25] and that teams review items at project initiation time.[87]

Limitations

This study has several key limitations. First, we analyzed model documentation from only a single vendor, Epic Systems Corporation. Documentation for models at other vendors, such as the Cerner model for patient volume,[88] could also be analyzed through this framework. Also, to respect copyright, we were not able to release the sections of the model brief that our reviewers used to justify when an item was reported. In addition, although reviewers worked independently, future work could improve on our process for adjudication. Interrater agreement of 76% suggests opportunities to improve reporting. Items that lacked consensus across all model briefs (eTable 8 in the Supplement) often required subjective judgments, such as whether certain items applied if the model brief was not a research study (eg, “Describe how participants were enrolled or recruited into the data” or “Describe the design of the study that was used to collect the data”). Others involved judgments about what reporting was sufficient, such as “Discuss any limitations and caveats of the study.” Methods to assess reporting adherence could be made more consistent and specific through more granular rubrics for third-party reviewers (eg, “partially provided” or “don’t know” categories). Our findings should be interpreted with caution because our deduplication process may mask certain differences among guidelines (eg, some guidelines provide explicit instructions and examples, whereas others merely call for reporting). We also caution against overinterpreting the completion rate across all items, because items are not exchangeable entities. Two items such as “missing data statistics” and “sensitivity” provide different information, so we recommend considering the completion of individual items when possible. In addition, we were unable to directly assess which items are useful for making deployment decisions, so not every item may be equally important to report. Last, to provide an upper bound on the quality of reporting, reviewers were instructed that in situations for which they were uncertain how to score a particular item, to err on the side of affirming that the item was addressed. For example, we gave credit for “describe how models were tested in a new setting before deployment” for statements that simply stated to contact a support representative to validate the model.

Conclusions

Model reporting guidelines have been developed to ensure that deployed clinical predictive models are reliable and fair. Although many have been published, to our knowledge they have not been gathered and analyzed in aggregate. In this study, we compile reportable items from 15 reporting guidelines and found that guidelines collectively request 220 distinct items. Such a wide breadth of items collectively poses a large reporting burden for model developers. To provide a snapshot of reporting quality for deployed models, we examined the 12 most adopted models from a single widely used health vendor. We found that the documentation reports the most commonly requested items. However, the documentation could provide more information on reliability, transparency, and fairness. Direct engagement with the vendor led to improvements in their documentation for future users. Overall, there is a need for better prioritization of items to report for predictive models in health care and thereby aid informed decisions about which models to deploy.
  59 in total

Review 1.  Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker.

Authors:  Karel G M Moons; Andre Pascal Kengne; Mark Woodward; Patrick Royston; Yvonne Vergouwe; Douglas G Altman; Diederick E Grobbee
Journal:  Heart       Date:  2012-03-07       Impact factor: 5.994

2.  Ensuring Fairness in Machine Learning to Advance Health Equity.

Authors:  Alvin Rajkomar; Michaela Hardt; Michael D Howell; Greg Corrado; Marshall H Chin
Journal:  Ann Intern Med       Date:  2018-12-04       Impact factor: 25.391

3.  Making Machine Learning Models Clinically Useful.

Authors:  Nigam H Shah; Arnold Milstein; Steven C Bagley PhD
Journal:  JAMA       Date:  2019-10-08       Impact factor: 56.272

4.  Good intentions are not enough: how informatics interventions can worsen inequality.

Authors:  Tiffany C Veinot; Hannah Mitchell; Jessica S Ancker
Journal:  J Am Med Inform Assoc       Date:  2018-08-01       Impact factor: 4.497

5.  Dissecting racial bias in an algorithm used to manage the health of populations.

Authors:  Ziad Obermeyer; Brian Powers; Christine Vogeli; Sendhil Mullainathan
Journal:  Science       Date:  2019-10-25       Impact factor: 47.728

6.  Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist.

Authors:  Karel G M Moons; Joris A H de Groot; Walter Bouwmeester; Yvonne Vergouwe; Susan Mallett; Douglas G Altman; Johannes B Reitsma; Gary S Collins
Journal:  PLoS Med       Date:  2014-10-14       Impact factor: 11.069

7.  Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View.

Authors:  Wei Luo; Dinh Phung; Truyen Tran; Sunil Gupta; Santu Rana; Chandan Karmakar; Alistair Shilton; John Yearwood; Nevenka Dimitrova; Tu Bao Ho; Svetha Venkatesh; Michael Berk
Journal:  J Med Internet Res       Date:  2016-12-16       Impact factor: 5.428

8.  Veridical data science.

Authors:  Bin Yu; Karl Kumbier
Journal:  Proc Natl Acad Sci U S A       Date:  2020-02-13       Impact factor: 11.205

9.  The Predictive Approaches to Treatment effect Heterogeneity (PATH) Statement.

Authors:  David M Kent; Jessica K Paulus; David van Klaveren; Ralph D'Agostino; Steve Goodman; Rodney Hayward; John P A Ioannidis; Bray Patrick-Lake; Sally Morton; Michael Pencina; Gowri Raman; Joseph S Ross; Harry P Selker; Ravi Varadhan; Andrew Vickers; John B Wong; Ewout W Steyerberg
Journal:  Ann Intern Med       Date:  2019-11-12       Impact factor: 25.391

10.  Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group.

Authors:  Viknesh Sounderajah; Hutan Ashrafian; Ravi Aggarwal; Jeffrey De Fauw; Alastair K Denniston; Felix Greaves; Alan Karthikesalingam; Dominic King; Xiaoxuan Liu; Sheraz R Markar; Matthew D F McInnes; Trishan Panch; Jonathan Pearson-Stuttard; Daniel S W Ting; Robert M Golub; David Moher; Patrick M Bossuyt; Ara Darzi
Journal:  Nat Med       Date:  2020-06       Impact factor: 53.440

View more
  2 in total

1.  Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings.

Authors:  Sharon E Davis; Colin G Walsh; Michael E Matheny
Journal:  Front Digit Health       Date:  2022-09-02

Review 2.  Addressing racial disparities in surgical care with machine learning.

Authors:  John Halamka; Mohamad Bydon; Paul Cerrato; Anjali Bhagra
Journal:  NPJ Digit Med       Date:  2022-09-30
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.