Literature DB >> 32594179

MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care.

Tina Hernandez-Boussard^1,2,3, Selen Bozkurt¹, John P A Ioannidis^1,4,5, Nigam H Shah^1,2.

Abstract

The rise of digital data and computing power have contributed to significant advancements in artificial intelligence (AI), leading to the use of classification and prediction models in health care to enhance clinical decision-making for diagnosis, treatment and prognosis. However, such advances are limited by the lack of reporting standards for the data used to develop those models, the model architecture, and the model evaluation and validation processes. Here, we present MINIMAR (MINimum Information for Medical AI Reporting), a proposal describing the minimum information necessary to understand intended predictions, target populations, and hidden biases, and the ability to generalize these emerging technologies. We call for a standard to accurately and responsibly report on AI in health care. This will facilitate the design and implementation of these models and promote the development and use of associated clinical decision support tools, as well as manage concerns regarding accuracy and bias.

Entities: Chemical Disease Species

Keywords: artificial intelligence; clinical decision support; electronic health records; reporting standards

Mesh：

Year: 2020 PMID： 32594179 PMCID： PMC7727333 DOI： 10.1093/jamia/ocaa088

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

INTRODUCTION

The rise of digital data and advances in computing power have contributed to significant advancements in artificial intelligence (AI), including machine learning (ML), for clinical decision support for diagnosis, treatment, and prognosis., The literature suggests that these methods may approach or exceed the performance of expert clinicians, particularly in the fields of signal processing, image classification, and spotting medication errors. These advances bring hopes for better personalized and value-based care. The healthcare industry is becoming comfortable with AI-based solutions, which are rapidly emerging at the point of care. However, the influx of AI models into the healthcare setting presents a fundamental shift in the use of data to guide clinical care and treatment decisions. Up until now, most models have been fed select input variables that were often handpicked by clinicians because they are known or suspected to have a valid clinical association with the outcome of interest. There are currently over 250,000 publications based on these kind of clinical scoring systems. With the increasing use of machine learning, the machine decides what input variables or features are important and related to the outcome of interest. Therefore, the data used for training and the definition of the task—be it classification or prediction—become more important than the specifics of the machine learning algorithm. Detailed knowledge of the data used to train the model (ie, the training data) and the population those data represent—or often, does not represent—is essential to understanding the validity and generalizability of the “AI solution.” New reports suggest that biases hidden in the training data used for model development could have negative consequences in certain populations., It is clear that the performance of any AI model broadly depends on its reliability and its ability to generalize to the setting and population in which it is applied, rather than its performance represented by the training and test data alone. However, the characteristics of the data necessary to assess how these predictive models perform are not being adequately reported in the literature, leaving uncertainty and doubt about the application in the broader healthcare setting. An empirical evaluation of 81 studies comparing AI models against clinicians showed major problems with lack of transparency, bias, and unjustified claims, likely because key details about the studies were often missing. Given the fast-evolving pace of AI solutions in health care, regulating them is complicated and global efforts are emerging to safely and efficiently standardize this regulatory task. The current regulatory environment is developing rapidly, with regulatory leaders and diverse stakeholders (eg, healthcare systems, clinicians, patients) developing a framework that both promotes innovation and ensures safety, privacy, and good intent. There is a global consensus that AI solutions must be fair and nondiscriminatory and that AI solutions in health care should have a positive impact across all sectors of social and economic life.,, However, through a lack of incentives, restrictions around data sharing and data privacy, and the acceptance of stealth science in industry (eg, science that is not backed by peer-reviewed evidence), we have created a healthcare environment that allows AI solutions to be disseminated and deployed at point of care without understanding how the model was developed, from what data was the model learned, and using what data was the model deemed satisfactory for use. Transparency is needed across 3 main categories: the population from which the data were acquired; model design and development, including training data; and model evaluation and validation. A lack of transparency regarding the training data used for model development directly affects the reproducibility, generalizability, and interpretability of a proposed model. Indeed, our recent study showed an alarming lack of transparency of ML models developed in research studies. Therefore, we need transparency in the reporting of the design, development, evaluation, and validation of AI models in health care to achieve and retain confidence and trust for all the stakeholders. Minimal standards for reporting scientific information are common and have improved the standards of biomedical as well as clinical research. From MIAME (Minimum Information About a Microarray Experiment) for gene expression microarrays to PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) meta-analyses,, reporting standards have emerged from communities to promote replication, validation and the use of secondary resources. These standards not only ensure transparent reporting of findings, but also guide authors in preparing their manuscripts, and allow journals to critically evaluate and appraise the findings, thus aiding the general interpretation of scientific information. Many standards comprise a short checklist of minimal information required, such as the 25-item CONSORT (Consolidated Standards of Reporting Trials) statement for clinical trials, the 22-item STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist for observational studies, and the 33-item SPIRIT (Standard Protocol Items: Recommendations and Intervention Trials) checklist for interventional trials. Importantly, both CONSORT and SPIRIT will be extending their checklists to include guidelines for trials that include an ML or AI component. This will complement a new initiative from TRIPOD, TRIPOD-ML (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis for Machine Learning). Feeding into these ongoing initiatives, we propose MINIMAR (MINimum Information for Medical AI Reporting), as a starting point for a broader community discussion. We believe that the adoption of such a standard will help the dissemination of such algorithms across healthcare systems and provide transparency to address potential biases and unintended consequences. MINIMAR will also promote external validation, encouraging the use of secondary resources.

GENERAL PRINCIPLES OF THE MINIMAR DESIGN

As a starting point, such a standard should satisfy the following requirements: (1) include information on the population providing the training data, in terms of data sources, cohort selection; (2) include training data demographics in a way that enables a comparison with the population the model is applied to; (3) provide detailed information about the model architecture and development so as to interpret the intent of the model and compare it to similar models and permit replication; and (4) transparently report model evaluation, optimization, and validation to clarify how local model optimization can be achieved and enable replication and resource sharing (Table 1).

Table 1.

Reporting standards for 4 essential components of artificial intelligence solutions in health care

Features	Description	Example²³	Example²⁴
1. Study population and setting
Population	Population from which study sample was drawn	Patients undergoing elective surgery	All patients
Study setting	The setting in which the study was conducted (eg, academic medical left, community healthcare system, rural healthcare clinic)	U.S. academic, tertiary care hospital	2 U.S. academic medical lefts
Data source	The source from which data were collected	EHRs	EHRs
Cohort selection	Exclusion/inclusion criteria	Adult patients; Patients were excluded if they died during hospitalization.	All admissions for adult patients. Hospitalizations of 24 h or longer.
2. Patient demographic characteristics
Age	Age of patients included in the study	Mean 58.34 y	Median ∼56 y
Sex	Sex breakdown of study cohort	Female: 73.0%	Female 55.0%
Sex	Sex breakdown of study cohort	Male: 27.0%	Female 55.0%
Race	Race characteristics of patients included in the study	White: 69.0%	Not provided
		Black: 3.1%
		Asian: 5.9%
Ethnicity	Ethnicity breakdown of patients included in the study	Hispanic: 13.2%	Not provided
Socioeconomic status	A measure or proxy measure of the socioeconomic status of patients included in the study	Private: 31.9%	Not provided
		Medicare: 47.8%
		Medicaid: 11.7%
3. Model architecture
Model output	The computed result of the model	Postoperative pain scores	In-hospital deaths, 30-day unplanned readmission, length of stay, discharge status
Target user	The indented user of the model output (eg, clinician, hospital management team, insurance company)	Risks scores produced by the model will be used by the hospital team for pain management	Predictions produced by the model will be used by hospitals for care management
Data splitting	How data were split for training, testing, and validation	10-fold cross-validation	80%/10%10% (train/validation/test)
Gold standard	Labeled data used to train and test the model	100 manually annotated clinical notes and pain scores recorded in EHR	Death, readmission and ICD codes in EHRs
Model task	Classification or prediction	Prediction	Prediction
Model architecture	Algorithm type (eg, machine learning, deep learning, etc.)	ElasticNet regularized regression	Recurrent neural networks, attention-based time-aware neural network model, and neural network
Features	List of variables used in the model and how they were used in the model in terms of categories or transformation	65 predictive features including age, race, ethnicity, sex, insurance type (as public and private) and preoperative pain (log transformation was applied)	Provided in detail for all models
Missingness	How missingness was addressed: reported, imputed, or corrected	Missing data were imputed using median of the variable distribution	Not provided
4. Model evaluation
Optimization	Model or parameter tuning applied	Generated vectors with a dimension of 300 and a window size of 5	Documented and provided for all models in detail
Internal model validation	Study internal validation	Internal 10-fold cross-validation	Hold-out validation set
External model validation	External validation using data from another setting	Not performed	Not performed
Transparency	How code and data are shared with the community.	Code and sample data available via GitHub	Data is not available; code is available via GitHub

EHR: electronic health record; ICD: International Classification of Diseases.

Reporting standards for 4 essential components of artificial intelligence solutions in health care EHR: electronic health record; ICD: International Classification of Diseases. The first requirement is related to the study population and setting, including patient demographics and cohort selection. It is essential to know the target population and how the training data were derived from this target population. This includes the need to understand the data that were used to develop (and train) the model, including the target patient population, the study setting, and data source, and how the final cohort was selected. These details provide the information on the data that a model is trained to anticipate potential biases and equity issues. As the second requirement, this should include the detailed documentation of patient characteristics and sensitive variables in the population, such as race and socioeconomic status. For example, a model that predicts general maternal mortality that is then applied to a black community must include a significant proportion of black patients in the training data, as well as risk factors applicable to them, such as sickle cell disease or high blood pressure, in order to adequately predict outcomes in the black community. Data transparency is essential to promote fair and equitable models. The third requirement would serve to provide a detailed explanation of the design and development of the AI or ML model in every publication. To evaluate any AI solution, it is essential to know the model task (ie, classification or prediction), the intended model output (eg, risk score for 30-day mortality), and the model beneficiary, if any. Currently, this is not widely done, which has led to important misinterpretations of model outcomes. For example, a recent study highlighted downstream bias in an ML model that was developed to predict costs of care yet was implemented in the healthcare setting to predict need of care. This misinterpretation resulted in allocating more intensive care resources to patients who had higher reimbursement rates, rather than to patients who had higher clinical need for those resources. Other necessary model details, such as modeling technique, feature selection, and the handling of missing values, should be transparent to appropriately apply an AI model in health care. The fourth requirement is related to information on model evaluation, including optimization and validation. Model evaluation strategies should be defined in detail, in terms of data used for both internal and external validation as well as the adopted approach adopted for evaluation (eg, 5-fold cross-validation or 80/20 split). The choice of validation metrics, such as sensitivity, specificity, positive predictive value, or area under the receiver-operating characteristic curve, also needs to be defined. In addition, the overall model performance metrics and the hyperparameters chosen for the final best model optimizations should be reported. Finally, as part of model evaluation, transparency is necessary for broad AI application in health care in order to achieve and retain confidence and trust from all the stakeholders. Indeed, recent studies show an alarming difficulty in reproducing models developed in research studies and suggest that even if the training data cannot be shared due to privacy issues, the source code of the model should be shared publicly. Therefore, in order to demonstrate the provenance and authenticity of the data and knowledge used to make decisions by AI models, promoting access to training data and source code is crucial to ensure that ML in biomedicine can be broadly applied and generalized. This is essential not only for choosing the best model for the given setting, but also for the unbiased comparison of different models or different settings.

DISCUSSION

Our goal is to set forth a standard for minimum information necessary to understand intended predictions, target populations, and hidden biases of an AI and ML clinical decision tool for both research scientists and medical practitioners. To that end, we hope that this description will stimulate discussion of the proposed MINIMAR standards and encourage the medical informatics community, as well as the general research community, to provide us with their views on how this standard can be improved. Clearly, the consequences of making wrong or inaccurate classifications or predictions in health care can be fatal. To address this, we need clear reporting of the training data, the model architecture, and evaluation and validation procedures. For that, we need reporting standards. Here, we start this conversation by proposing MINIMAR, the minimal information for medical AI reporting. We believe it would be valuable if groups producing these studies would strive for a level of transparency in their methods that supports the reproducibility of results, in particular on different underlying population representations. This information can help prioritize research agendas and highlight populations underrepresented in this wave of medical informatics. We call for a standard to accurately and responsibly report on AI in health care. This will facilitate the design and implementation of these models and promote the development and use of associated clinical decision support tools, as well as managing concerns regarding accuracy and bias. In this era of data-driven medicine, establishing minimum standards for developing and reporting methodologies, sharing algorithms and tools, and establishing other resources is essential to ensure transparency and equity are at the forefront of AI-augmented health care. This is a necessary step in a larger agenda that will help assess the ethics, regulation, and effectiveness of AI models in transforming health care.

AUTHOR CONTRIBUTIONS

TH-B attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. TH-B affirms that the article is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained. TH-B was involved in study concept and design; drafting of the article; administrative, technical, or material support; and study supervision. TH-B, SB, JPAI, and NHS were involved in critical revision of the manuscript for important intellectual content.

CONFLICT OF INTEREST STATEMENT

None declared.

19 in total

1. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Authors: A Brazma; P Hingamp; J Quackenbush; G Sherlock; P Spellman; C Stoeckert; J Aach; W Ansorge; C A Ball; H C Causton; T Gaasterland; P Glenisson; F C Holstege; I F Kim; V Markowitz; J C Matese; H Parkinson; A Robinson; U Sarkans; S Schulze-Kremer; J Stewart; R Taylor; J Vilo; M Vingron
Journal: Nat Genet Date: 2001-12 Impact factor: 38.330

2. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies.

Authors: Erik von Elm; Douglas G Altman; Matthias Egger; Stuart J Pocock; Peter C Gøtzsche; Jan P Vandenbroucke
Journal: Lancet Date: 2007-10-20 Impact factor: 79.321

3. The Proliferation of Reports on Clinical Scoring Systems: Issues About Uptake and Clinical Utility.

Authors: Douglas W Challener; Larry J Prokop; Omar Abu-Saleh
Journal: JAMA Date: 2019-06-25 Impact factor: 56.272

4. Screening for medication errors using an outlier detection system.

Authors: Gordon D Schiff; Lynn A Volk; Mayya Volodarskaya; Deborah H Williams; Lake Walsh; Sara G Myers; David W Bates; Ronen Rozenblum
Journal: J Am Med Inform Assoc Date: 2017-03-01 Impact factor: 4.497

Review 5. Future of electronic health records: implications for decision support.

Authors: Brian Rothman; Joan C Leonard; Michael M Vigoda
Journal: Mt Sinai J Med Date: 2012 Nov-Dec

6. Dissecting racial bias in an algorithm used to manage the health of populations.

Authors: Ziad Obermeyer; Brian Powers; Christine Vogeli; Sendhil Mullainathan
Journal: Science Date: 2019-10-25 Impact factor: 47.728

7. SPIRIT 2013 Statement: defining standard protocol items for clinical trials.

Authors: An-Wen Chan; Jennifer M Tetzlaff; Douglas G Altman; Andreas Laupacis; Peter C Gøtzsche; Karmela Krle A-Jerić; Asbjørn Hrobjartsson; Howard Mann; Kay Dickersin; Jesse A Berlin; Caroline J Dore; Wendy R Parulekar; William S M Summerskill; Trish Groves; Kenneth F Schulz; Harold C Sox; Frank W Rockhold; Drummond Rennie; David Moher
Journal: Rev Panam Salud Publica Date: 2015-12

8. Scalable and accurate deep learning with electronic health records.

Authors: Alvin Rajkomar; Eyal Oren; Kai Chen; Andrew M Dai; Nissan Hajaj; Michaela Hardt; Peter J Liu; Xiaobing Liu; Jake Marcus; Mimi Sun; Patrik Sundberg; Hector Yee; Kun Zhang; Yi Zhang; Gerardo Flores; Gavin E Duggan; Jamie Irvine; Quoc Le; Kurt Litsch; Alexander Mossin; Justin Tansuwan; James Wexler; Jimbo Wilson; Dana Ludwig; Samuel L Volchenboum; Katherine Chou; Michael Pearson; Srinivasan Madabushi; Nigam H Shah; Atul J Butte; Michael D Howell; Claire Cui; Greg S Corrado; Jeffrey Dean
Journal: NPJ Digit Med Date: 2018-05-08

9. Predicting inadequate postoperative pain management in depressed patients: A machine learning approach.

Authors: Arjun Parthipan; Imon Banerjee; Keith Humphreys; Steven M Asch; Catherine Curtin; Ian Carroll; Tina Hernandez-Boussard
Journal: PLoS One Date: 2019-02-06 Impact factor: 3.240

10. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges.

Authors: Richard D Riley; Joie Ensor; Kym I E Snell; Thomas P A Debray; Doug G Altman; Karel G M Moons; Gary S Collins
Journal: BMJ Date: 2016-06-22

44 in total

1. Setting the agenda: an informatics-led policy framework for adaptive CDS.

Authors: Jeffery Smith
Journal: J Am Med Inform Assoc Date: 2020-12-09 Impact factor: 4.497

2. Reporting of demographic data and representativeness in machine learning models using electronic health records.

Authors: Selen Bozkurt; Eli M Cahan; Martin G Seneviratne; Ran Sun; Juan A Lossio-Ventura; John P A Ioannidis; Tina Hernandez-Boussard
Journal: J Am Med Inform Assoc Date: 2020-12-09 Impact factor: 4.497

3. Patient safety and quality improvement: Ethical principles for a regulatory approach to bias in healthcare machine learning.

Authors: Melissa D McCradden; Shalmali Joshi; James A Anderson; Mjaye Mazwi; Anna Goldenberg; Randi Zlotnik Shaul
Journal: J Am Med Inform Assoc Date: 2020-12-09 Impact factor: 4.497

4. A Comparison of Logistic Regression Against Machine Learning Algorithms for Gastric Cancer Risk Prediction Within Real-World Clinical Data Streams.

Authors: Robert J Huang; Nicole Sung-Eun Kwon; Yutaka Tomizawa; Alyssa Y Choi; Tina Hernandez-Boussard; Joo Ha Hwang
Journal: JCO Clin Cancer Inform Date: 2022-06

5. Reproducibility standards for machine learning in the life sciences.

Authors: Benjamin J Heil; Michael M Hoffman; Florian Markowetz; Su-In Lee; Casey S Greene; Stephanie C Hicks
Journal: Nat Methods Date: 2021-10 Impact factor: 47.990

Review 6. Explainable medical imaging AI needs human-centered design: guidelines and evidence from a systematic review.

Authors: Haomin Chen; Catalina Gomez; Chien-Ming Huang; Mathias Unberath
Journal: NPJ Digit Med Date: 2022-10-19

7. A Framework for Augmented Intelligence in Allergy and Immunology Practice and Research-A Work Group Report of the AAAAI Health Informatics, Technology, and Education Committee.

Authors: Paneez Khoury; Renganathan Srinivasan; Sujani Kakumanu; Sebastian Ochoa; Anjeni Keswani; Rachel Sparks; Nicholas L Rider
Journal: J Allergy Clin Immunol Pract Date: 2022-03-15

Review 8. Artificial intelligence and spine imaging: limitations, regulatory issues and future direction.

Authors: Alexander L Hornung; Christopher M Hornung; G Michael Mallow; J Nicolas Barajas; Alejandro A Espinoza Orías; Fabio Galbusera; Hans-Joachim Wilke; Matthew Colman; Frank M Phillips; Howard S An; Dino Samartzis
Journal: Eur Spine J Date: 2022-01-27 Impact factor: 2.721

9. Artificial Intelligence and Radiomics in Head and Neck Cancer Care: Opportunities, Mechanics, and Challenges.

Authors: Lisanne V van Dijk; Clifton D Fuller
Journal: Am Soc Clin Oncol Educ Book Date: 2021-03

10. Recommendations for the safe, effective use of adaptive CDS in the US healthcare system: an AMIA position paper.

Authors: Carolyn Petersen; Jeffery Smith; Robert R Freimuth; Kenneth W Goodman; Gretchen Purcell Jackson; Joseph Kannry; Hongfang Liu; Subha Madhavan; Dean F Sittig; Adam Wright
Journal: J Am Med Inform Assoc Date: 2021-03-18 Impact factor: 4.497