Literature DB >> 36173632

Randomized Clinical Trials of Machine Learning Interventions in Health Care: A Systematic Review.

Deborah Plana¹, Dennis L Shung², Alyssa A Grimshaw³, Anurag Saraf⁴, Joseph J Y Sung⁵, Benjamin H Kann⁶.

Abstract

Importance: Despite the potential of machine learning to improve multiple aspects of patient care, barriers to clinical adoption remain. Randomized clinical trials (RCTs) are often a prerequisite to large-scale clinical adoption of an intervention, and important questions remain regarding how machine learning interventions are being incorporated into clinical trials in health care. Objective: To systematically examine the design, reporting standards, risk of bias, and inclusivity of RCTs for medical machine learning interventions. Evidence Review: In this systematic review, the Cochrane Library, Google Scholar, Ovid Embase, Ovid MEDLINE, PubMed, Scopus, and Web of Science Core Collection online databases were searched and citation chasing was done to find relevant articles published from the inception of each database to October 15, 2021. Search terms for machine learning, clinical decision-making, and RCTs were used. Exclusion criteria included implementation of a non-RCT design, absence of original data, and evaluation of nonclinical interventions. Data were extracted from published articles. Trial characteristics, including primary intervention, demographics, adherence to the CONSORT-AI reporting guideline, and Cochrane risk of bias were analyzed. Findings: Literature search yielded 19 737 articles, of which 41 RCTs involved a median of 294 participants (range, 17-2488 participants). A total of 16 RCTS (39%) were published in 2021, 21 (51%) were conducted at single sites, and 15 (37%) involved endoscopy. No trials adhered to all CONSORT-AI standards. Common reasons for nonadherence were not assessing poor-quality or unavailable input data (38 trials [93%]), not analyzing performance errors (38 [93%]), and not including a statement regarding code or algorithm availability (37 [90%]). Overall risk of bias was high in 7 trials (17%). Of 11 trials (27%) that reported race and ethnicity data, the median proportion of participants from underrepresented minority groups was 21% (range, 0%-51%). Conclusions and Relevance: This systematic review found that despite the large number of medical machine learning-based algorithms in development, few RCTs for these technologies have been conducted. Among published RCTs, there was high variability in adherence to reporting standards and risk of bias and a lack of participants from underrepresented minority groups. These findings merit attention and should be considered in future RCT design and reporting.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36173632 PMCID： PMC9523495 DOI： 10.1001/jamanetworkopen.2022.33946

Source DB: PubMed Journal: JAMA Netw Open ISSN： 2574-3805

Introduction

Machine learning has the potential to improve the diagnosis and prognosis of disease to enhance clinical care. Given the increasing amount of digital data generated from routine medical care, available computational processing power, and research advances, such as deep learning, there has been substantial interest in applying machine learning techniques to improve patient care across medical disciplines.[1,2] Models have been investigated for tasks such as improved cancer diagnosis, emergency department triage, and intensive care unit decision support.[3,4,5] However, the recent failures to successfully implement machine learning systems in clinical settings have highlighted the limitations of this technology, generating disillusionment and distrust in their potential to impact medicine.[6,7] These machine learning system failures are often attributable to a lack of generalizability, an inability to adapt a system trained with data from 1 context to perform well in a new one,[8] or an inability to demonstrate a clinically meaningful benefit.[7] Mitigation strategies have been proposed to ensure their applicability, such as the use of larger and more diverse data sets and direct collaborations with clinical experts in model development.[9,10,11] We investigated a different and complementary area of study of machine learning model-testing procedures, randomized clinical trials (RCTs), which may affect their ultimate use in heterogeneous clinical settings. Randomized clinical trials are considered the gold standard for assessing an intervention’s impact in clinical care,[12] and the current landscape of RCTs for machine learning in health care continues to evolve. Randomized clinical trials, particularly those with transparent and reproducible methods, are important for demonstrating the clinical utility of machine learning interventions given the inherent opacity and black box nature of these models and the difficulty in deciphering the mechanistic basis for model predictions.[13,14] Furthermore, machine learning model performance in the clinical setting is dependent on the training data that was used during model development and may not generalize well to patient populations that are out of the training data’s distribution.[15] Factors such as geographic location[16] and racial, ethnic, and sex characteristics of model training data are often overlooked; thus, RCTs that are inclusive of a range of demographic backgrounds are crucial to avoiding known biases that can be propagated and deepened based on flawed training data.[17,18] Therefore, we performed a systematic review to better understand the landscape of machine learning RCTs and trial qualities that affect reproducibility, inclusivity, generalizability, and successful implementation of artificial intelligence (AI) or machine learning interventions in clinical care. We focused the review on trials that used AI or machine learning as a clinical intervention, with patients allocated randomly to either a treatment arm with a therapeutic intervention based on machine learning or a standard of care arm.

Methods

This systematic review used the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA)[19] and Synthesis Without Meta-analysis (SWiM)[20] reporting guidelines. The protocol was registered a priori (CRD42021230810).

Search Strategy and Selection Criteria

A systematic search of the literature was conducted by a medical librarian (A.A.G.) in Cochrane Library, Google Scholar, Ovid Embase, Ovid MEDLINE, PubMed, Scopus, and Web of Science Core Collection databases to find relevant articles published from the inception of each database to October 15, 2021, and final searches were performed in all databases on this date. The search was peer reviewed by a second medical librarian using the Peer Review of Electronic Search Strategies (PRESS) guideline.[21] Databases were searched using a combination of controlled vocabulary and free text terms for AI, clinical decision-making, and RCTs. The search was not limited by language or year. Details of the full search strategies are given in eAppendix 1 in the Supplement. The citationchaser[22] package for R software, version 4.0.3 (R Foundation for Statistical Computing) was used to search the reference lists of included studies and to retrieve articles that cited the included studies to find additional relevant studies not retrieved by the database search. Citations from all databases were imported into an Endnote 20 library (Clarivate Analytics), in which duplicates were removed. The deduplicated results were imported into the Covidence systematic review management program for screening and data extraction. Two independent screeners performed a title and abstract review, with a third screener to resolve disagreements. This phase of screening was performed by 5 of us (D.P., D.L.S., A.A.G., A.S., and B.H.K.). The full texts of the resulting articles were then independently reviewed for inclusion by 2 screeners (D.P., D.L.S., and A.S.) and a third screener (B.H.K.) to resolve disagreements. Articles were included if they were deemed by consensus to be RCTs in which AI or machine learning was used in at least 1 randomization arm in a medical setting. Search strategies are available in the eAppendix 1 in the Supplement, and reasons for exclusion are found in Figure 1.

Figure 1.

Screening and Selection of Randomized Clinical Trials

AI indicates artificial intelligence.

aDatabases and registers included Cochrane Library, Google Scholar, Ovid Embase, Ovid MEDLINE, PubMed, Scopus, and Web of Science Core Collection.

Screening and Selection of Randomized Clinical Trials

AI indicates artificial intelligence. aDatabases and registers included Cochrane Library, Google Scholar, Ovid Embase, Ovid MEDLINE, PubMed, Scopus, and Web of Science Core Collection.

Statistical Analysis

Two of us (D.P. and D.L.S.) independently extracted data and assessed the risk of bias for each study using standardized data extraction forms. The forms completed by each were compared; disagreement was resolved by review and discussion, with another one of us (B.H.K.) serving as final arbiter. Authors were not contacted for additional unpublished data. Risk of bias was assessed using the Cochrane Risk of Bias, version 2 tool for RCTs.[23] This tool was developed to assess risk of bias in RCTs and has 5 domains, including risk of bias owing to the randomization process, deviations from the intended interventions (effect of assignment to intervention), missing outcome data, measurement of the outcome, and selection of the reported result. To assess reproducibility and reporting transparency, we assessed article adherence to the recently published Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT-AI) reporting guideline,[24] which is an extension of the CONSORT guideline developed through an international multi-stakeholders group via staged consensus. Machine learning–based RCTs are recommended to routinely report the extended criteria in addition to the core CONSORT items. Two of us (D.P. and D.L.S.) independently extracted data and assessed each of the 11 CONSORT-AI extension criteria for each article. Disagreement was resolved by review and discussion, with another of us (B.H.K.) serving as final arbiter. To evaluate inclusivity, we assessed reporting of sex, race, and ethnicity. We calculated the proportion of participants from underrepresented minority groups within each study using the National Institutes of Health definition of groups underrepresented in biomedical research[25]; the definition designates American Indian or Alaska Native, Black or African American, Hispanic or Latino, and Native Hawaiian or other Pacific Islander as underrepresented minority groups. To assess other qualities pertaining to generalizability and clinical adoption, we assessed the use of clinical vs nonclinical end points, whether the trial was conducted at a single site or multiple sites, and geographic location. Other qualities assessed were the use of measures with vs without performance thresholds, the disease setting of the trial, the model training data type, and the machine learning model type. The data for all aforementioned items were independently extracted by 2 of us (D.P. and D.L.S.) for each article, with disagreement resolved by review and discussion, with another of us (B.H.K.) serving as final arbiter. All summary statistics were computed using R software, version 4.0.3.

Results

The search resulted in 28 159 records; after duplicates were removed, 19 737 remained for title and abstract screening, and 19 621 of these were excluded (Figure 1). No additional articles were located from citation chasing. Full-text review was conducted for 116 articles; of those, 75 studies were excluded because they were conference abstracts (n = 19), had the wrong study design (n = 16), performed the wrong intervention (n = 14), contained duplicate study data (n = 11), did not involve clinical decision-making (n = 6), did not use AI or machine learning (n = 3), provided preliminary results only (n = 2), were not conducted in a medical setting (n = 2), did not assess any outcomes that impacted clinical decision-making (n = 1), or were not written in the English language (n = 1) (eAppendix 1 and eAppendix 2 in the Supplement). Overall, 41 RCTs involving a median of 294 participants (range, 17-2488 participants) met inclusion criteria.[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] The main study characteristics are shown in the Table as well as eAppendix 3 in the Supplement. No quantitative meta-analysis was performed owing to the heterogeneity of reported outcomes across clinical trials. The number of published RCTs increased substantially over the study period. Of the 41 included RCTs,[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] 16 (39%) were published from January to October 2021[42,47,48,49,50,51,53,54,55,56,57,59,60,61,62,63] and 36 (88%) from January 2019 to October 2021 (Figure 2).[26,27,29,31,32,33,34,35,37,38,39,41,42,43,44,45,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] Trials were most often conducted in the US (15 [37%])[29,30,31,32,33,36,40,44,46,49,55,59,61,62,63] or China (13 [32%]),[27,37,38,39,43,45,48,52,54,56,60,65,66] and 6 studies (15%) were conducted across multiple countries.[26,29,42,47,50,57] In terms of qualities associated with generalizability, 20 RCTs[26,28,29,30,32,33,34,39,42,44,47,50,51,54,56,57,59,62,63,64] (49%) were conducted at multiple sites, and 21 RCTs (51%) were conducted at a single site.[27,31,35,36,37,38,40,41,43,45,46,48,49,52,53,55,58,60,61,65,66] Only 11 trials (27%) reported race and ethnicity (Figure 2)[30,31,32,33,36,44,49,55,59,61,63]; among those trials, a median of 21% (range, 0%-51%) of participants were from underrepresented minority groups.

Table.

Summary of Randomized Clinical Trials Included in the Systematic Review

Source	Study location	Study aim	Female, %	Race and ethnicity (%)	Medical specialty	Disease
Pavel et al,[26] 2020	Ireland, the Netherlands Sweden, UK	Detect neonatal seizures	40	NR	Neonatology	Neonatal seizures
Wang et al,[27] 2020	China	Detect colorectal adenomas	52	NR	Gastroenterology	Colon polyp and adenoma
Caparros-Gonzalez et al,[28] 2018	Spain	Assess premature infant physiological response	41	NR	Neonatology	None
Nimri et al,[29] 2020	Europe (multiple countries), Israel, US	Optimize insulin dose	52	NR	Endocrinology	Type 1 diabetes
Vennalaganti et al,[30] 2018	US	Detect Barrett esophagus–associated neoplasia	24	African American (2), White (95), other (3)	Gastroenterology	Barrett esophagus–associated neoplasia
Voss et al,[31] 2019	US	Improve socialization in children with autism spectrum disorder	11	Black (3), East Asian (24), Hispanic (23), Middle Eastern (1), South Asian (8), White/European American (35), unknown (17)	Pediatrics	Autism spectrum disorders
Manz et al,[32] 2020	US	Increase serious illness conversations among patients with cancer	54	Non-Hispanic Black (20), non-Hispanic White (70), other (10)	Oncology	End of life
Persell et al,[33] 2020	US	Improve blood pressure control in outpatients with hypertension	61	Black (35), Hispanic (8), White (52), other (9)	Primary care	Hypertension
Repici et al,[34] 2020	Italy	Detect colorectal adenomas	51	NR	Gastroenterology	Colorectal neoplasia
Wijnberge et al,[35] 2020	Netherlands	Detect intraoperative hypotension	43	NR	Cardiac surgery	Intraoperative hypotension
Shimabukuro et al,[36] 2017	US	Predict outcomes in patients with sepsis	54	African American (11), Asian (16), Hispanic (21), White (47), other (5)	Intensive care	Sepsis
Wang et al,[37] 2020	China	Detect colorectal adenomas	49	NR	Gastroenterology	Colon polyp and adenoma
Gong et al,[38] 2020	China	Detect colorectal adenomas	51	NR	Gastroenterology	Colorectal adenomas
Lin et al,[39] 2019	China	Diagnose childhood cataracts	55	NR	Ophthalmology	Childhood cataracts
Rabbi et al,[40] 2015	US	Facilitate weight loss through automated personalized feedback for physical activity and diet	47	NR	Primary care	None
Auloge et al,[41] 2020	France	Assess feasibility of AI/AR tool for vertebroplasty	65	NR	Orthopedics	Vertebral fracture
Avari et al,[42] 2021	Spain, UK	Decrease hypoglycemia episodes with personalized bolus advice for people with type 1 diabetes	52	NR	Endocrinology	Type 1 diabetes
Wang et al,[43] 2019	China	Detect colorectal adenomas	52	NR	Gastroenterology	Colon polyp and adenoma
Forman et al,[44] 2019	US	Facilitate weight loss by predicting and preventing dietary lapses	85	Hispanic or non-White (30), non-Hispanic White (70)	Primary care	Obesity
Wu et al,[45] 2019	China	Reduce rate of blind spots during EGD	52	NR	Gastroenterology	No specific disease
El Solh et al,[46] 2009	US	Predict optimal CPAP using neural network to reduce titration failure	57	NR	Pulmonology	Obstructive sleep apnea
Luštrek et al,[47] 2021	Belgium, Italy	Assess self-management of congestive heart failure using app and wristband	NR	NR	Cardiology	Congestive heart failure
Chen and Gao,[48] 2021	China	Assess AI-based echocardiography for diagnosis of acute heart failure	36	NR	Cardiology	Acute left heart failure
Seol et al,[49] 2021	US	Assess management of childhood asthma	43	White (72)	Pediatrics	Asthma
Repici et al,[50] 2022	Italy, Switzerland	Investigate colon adenoma detection of nonexpert endoscopists	50	NR	Gastroenterology	Colorectal cancer screening
Kamba et al,[51] 2021	Japan	Decrease colon adenoma miss rate	23	NR	Gastroenterology	Colorectal cancer screening
Liu et al,[52] 2020	China	Increase polyp and adenoma detection with CADe	53	NR	Gastroenterology	Colorectal cancer screening
Blomberg et al,[53] 2021	Denmark	Assess emergency-dispatched recognition of cardiac arrest during call	36	NR	Emergency medicine	Dispatcher assessment
Xu et al,[54] 2021	China	Assess polyp detection of AI-assisted colonoscopy	49	NR	Gastroenterology	Colorectal cancer screening
Jayakumar et al,[55] 2021	US	Evaluate effect of AI-enabled patient decision aid on knee osteoarthritis management	64	Asian (12), Black or African American (17), Hispanic or Latino (34), White (36)	Orthopedics	Osteoarthritis
Wu et al,[56] 2021	China	Identify blind spots in EGD	52	NR	Gastroenterology	Early gastric cancer
Sandal et al,[57] 2021	Denmark, Norway	Improve quality of life in patients with lower back pain using app	55	NR	Primary care	Lower back pain
Noor et al,[58] 2020	India	Identify follicles in patients with ovarian stimulation	100	NR	Gynecology	Infertility
Yao et al,[59] 2021	US	Identify patients with low ejection fraction from ECG data	54	Asian (1), Black or African American (2), White (94), other (2), missing (0.5)	Cardiology	Heart failure
Wu et al,[60] 2021	China	Identify gastric neoplasms on EGD	54	NR	Gastroenterology	Early gastric neoplasia
Strömblad et al,[61] 2021	US	Predict surgical case durations	83	Asian (8), Black (8), White (77), other (4), unknown (4)	Surgery	Solid tumor surgical procedures for gynecological and colorectal cancers
Eng et al,[62] 2021	US	Assess skeletal age	46	NR	Radiology	Skeletal development
Glissen Brown et al,[63] 2022	US	Reduce adenoma miss rate using computer-aided polyp detection	45	African American (21), White (68)	Gastroenterology	Colorectal cancer screening
Meijer et al,[64] 2020	Netherlands	Reduce pain after surgical procedures	56	NR	Anesthesiology	Postoperative pain
Liu et al,[65] 2020	China	Improve detection rate of polyps and adenomas	46	NR	Gastroenterology	Colon polyp and adenoma
Su et al,[66] 2020	China	Improve detection rate of polyps and adenomas	51	NR	Gastroenterology	Colon polyp and adenoma

Abbreviations: AI, artificial intelligence; AR, artificial reality; CADe, computer-aided detection; CPAP, continuous positive airway pressure; EGD, esophagogastroduodenoscopy; NR, not reported.

Figure 2.

Characteristics of Randomized Clinical Trials

A total of 41 randomized clinical trials were included in the analysis.[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] Individuals from underrepresented minority groups were participants in 11 clinical trials in which information on participant race and/or ethnicity was reported.[30,31,32,33,36,44,49,55,59,61,63] B, Data for 2021 are from January through October 15. D, The other medical specialty category includes anesthesiology, cardiac surgery, emergency medicine, general surgery, gynecology, intensive care, ophthalmology, pulmonology, and radiology.

Abbreviations: AI, artificial intelligence; AR, artificial reality; CADe, computer-aided detection; CPAP, continuous positive airway pressure; EGD, esophagogastroduodenoscopy; NR, not reported.

Characteristics of Randomized Clinical Trials

Figure 3.

Adherence to Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT-AI) Extension Guideline

A total of 41 randomized clinical trials were included in the analysis.[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] The CONSORT-AI extension is an internationally developed consensus document reflecting recommended clinical trial reporting characteristics to ensure transparency and reproducibility.[24]

Adherence to Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT-AI) Extension Guideline

A total of 41 randomized clinical trials were included in the analysis.[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] The CONSORT-AI extension is an internationally developed consensus document reflecting recommended clinical trial reporting characteristics to ensure transparency and reproducibility.[24] The risk of bias for the RCTs is summarized in Figure 4. Overall risk of bias was high in 7 trials (17%).[27,36,40,46,48,55,58] Bias from measurement of outcomes was the type most often observed, with at least some concern for bias in 19 trials (46%).[27,33,38,40,41,42,43,44,45,46,48,49,51,55,56,59,63,65,66]

Figure 4.

Risk of Bias in Randomized Clinical Trials

Risk of Bias in Randomized Clinical Trials

A total of 41 randomized clinical trials were included in the analysis.[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] Risk of bias was assessed using the revised Cochrane Risk of Bias, version 2 tool for randomized clinical trials.[23] Regarding clinical use cases in RCTs, the most common clinical specialty represented was gastroenterology (16 [39%])[27,30,34,37,38,43,45,50,51,52,54,56,60,63,65,66]; most of these RCTs involved endoscopic imaging.[27,34,37,38,43,45,50,51,52,54,56,60,63,65,66] Most studies involving clinical use cases enrolled adult patients (36 [88%]).[27,29,30,32,33,34,35,36,37,38,40,41,42,43,44,45,46,47,48,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] Four trials (10%) were performed in a primary care setting, and all of these trials involved user-inputted data[33,40,44,57]; 4 other trials (10%) were in cardiology or cardiac surgery[35,47,48,59] and involved electrocardiographic, wearable device, echocardiographic, or arterial waveform data. Two trials (5%) were performed in the neonatal setting,[26,28] evaluating seizures and physiological distress, and 3 studies (7%) were performed primarily among pediatric populations more broadly,[31,39,49] evaluating asthma, autism spectrum disorder, and childhood cataracts. Most RCTs involved clinical outcome measures (34 [83%])[26,27,29,30,31,32,33,34,35,36,37,38,39,42,43,44,45,46,48,49,50,51,52,53,54,55,57,58,59,60,63,64,65,66] and outcome measures without performance thresholds (32 [78%]).[28,31,32,33,34,35,36,38,39,40,42,44,46,47,48,49,50,51,52,53,54,55,56,57,58,60,61,62,63,64,65,66] In terms of data sources, 15 trials (37%) mostly used endoscopic imaging–based interventions,[27,34,37,38,43,45,50,51,52,54,56,60,63,65,66] 5 (12%) used patient-inputted data,[33,40,44,55,57] 2 (5%) used primary electronic health record data,[32,61] 2 (5%) used electrocardiogram data,[48,59] and 2 (5%) used blood-based data (glucose and insulin levels).[29,42] A total of 20 articles (49%) used deep learning neural networks.[27,30,34,37,38,43,45,46,48,50,51,52,54,56,59,60,62,63,65,66]

Discussion

This systematic review found a lack of RCTs for medical machine learning interventions and highlighted the need for additional well-designed, transparent, and inclusive RCTs for machine learning interventions to promote use in the clinic. There is growing concern that new machine learning models are being released after preliminary validation studies without follow-through on their ability to formally show superiority in a gold standard RCT.[67,68] Of note, there are currently 343 US Food and Drug Administration (FDA)–approved medical AI or machine learning interventions.[69] Our finding of 41 medical machine learning RCTs suggests that most FDA-approved machine learning–enabled medical devices are approved without efficacy demonstrated in an RCT. This finding is likely explained, in part, by the lower burden of evidence required for AI or machine learning algorithm clearance (often classified by the FDA as software as a medical device) compared with pharmaceutical drugs.[70] To our knowledge, this review is the first rigorous attempt at quantifying this gap. Prior work has shown a lack of prospective testing in this field but has not rigorously assessed the quantity of RCTs using a PROSPERO-registered method or tie-breaking arbitration[71] or has analyzed the testing of technologies only related to imaging data.[72] In addition, these studies did not explore study adherence to CONSORT-AI standards or assess the inclusivity of underrepresented minority groups and women in the study populations. Finally, the scope of our review compared with prior work differed; our work specifically focused on the use of clinical AI or machine learning interventions that were used as investigational arms in RCTs. We excluded RCTs that used traditional statistical models and RCTs in which AI or machine learning was included in the study protocol but was not part of the randomized intervention. In this way, we highlighted RCTs that directly compared AI or machine learning with standard of care and RCTs that were designed to demonstrate a high level of evidence for clinical utility. A comparison of the trials included in this study with prior work is available in eAppendix 4 in the Supplement. Our initial search of 28 159 records and subsequent yield of only 41 RCTs[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] indicates a translational gap between development and clinical impact. Most of the RCTs included in this review were conducted between January 2019 and October 2021 (36 [88%]),[26,27,29,31,32,33,34,35,37,38,39,41,42,43,44,45,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] and 16 studies (39%) were conducted between January and October 2021,[42,47,48,49,50,51,53,54,55,56,57,59,60,61,62,63] indicating that the rate of new RCTs for machine learning interventions increased over time. Clinical use cases for these technologies most often involved endoscopic imaging in gastroenterology (15 [37%])[27,34,37,38,43,45,50,51,52,54,56,60,63,65,66] and enrolled adult patients (36 [88%]).[27,29,30,32,33,34,35,36,37,38,40,41,42,43,44,45,46,47,48,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] Regarding trial reporting, no RCT included in this review adhered to all the machine learning-specific reporting standards (ie, the CONSORT-AI extension guideline[24]). Specifically, 37 trials (90%) did not share code and data along with study results,[26,27,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,51,52,53,54,55,56,57,58,61,62,63,64,65,66] 38 (93%) did not analyze poor-quality or unavailable input data,[26,27,28,29,30,31,32,33,34,35,36,37,39,40,41,42,43,44,45,46,48,49,50,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] and 38 (93%) did not assess performance errors,[26,27,28,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,48,49,50,51,52,53,54,55,56,57,58,59,60,61,63,64,65,66] all of which may contribute to issues in reproducibility. These results suggest that machine learning RCT reporting quality needs improvement. The CONSORT-AI guideline helps ensure transparency and reproducibility of RCT methods,[24] and the lack of guideline adherence observed among RCTs in this review may be another barrier to clinical adoption. Of note, the CONSORT-AI standards were published in September 2020, when most of the trials analyzed in this review would have been either published or under peer review. Future work should reassess the percentage of guideline-adherent RCTs published after 2021 to assess the impact of the CONSORT-AI guideline in RCT design. Regarding RCT inclusivity, among the trials selected based on our search criteria, we found that only 20 (49%) were conducted at more than 1 site.[26,28,29,30,32,33,34,39,42,44,47,50,51,54,56,57,59,62,63,64] In addition, we found a lack of reporting of demographic information across studies, with only 11 RCTs (27%) reporting participant race or ethnicity.[30,31,32,33,36,44,49,55,59,61,63] Within this subset, studies had a median of 21% of enrolled participants belonging to underrepresented minority groups, a number concordant or slightly higher than proportions reported in prior systematic reviews analyzing medical RCTs.[73,74] Trials were most often conducted in the US (15 [37%])[29,30,31,32,33,36,40,44,46,49,55,59,61,62,63] or China (13 [32%]),[27,37,38,39,43,45,48,52,54,56,60,65,66] and only 6 studies (15%) were conducted across multiple countries.[26,29,42,47,50,57] Taken together, this lack of diversity in patient populations involved in RCTs indicates that the generalizability of their results across clinical sites is unknown—a growing concern for the federal regulation of machine learning interventions as medical devices.[75] Regarding risk of bias, a high risk was found in 7 trials[27,36,40,46,48,55,58] (17%); although substantial, this proportion was lower than the proportion of high-risk studies found in a cross-sectional study of non–machine learning interventions[76] which found that a median of 50% of studies had a high risk of bias. This difference suggests that deficiencies in design, execution, and reporting of RCTs are not more widespread than those in other trials of medical interventions. This systematic review found a low but increasing number of RCTs of machine learning interventions in health care. This low number is in contrast to the large number of preliminary validation studies of medical machine learning interventions and the increasing number of FDA approvals in this research area; many of these technologies have reached the clinical implementation phase without a gold standard assessment of efficacy through an RCT.[69] It is not practical to formally assess every potential iteration of a new technology through an RCT (eg, a machine learning algorithm used in a hospital system and then used for the same clinical scenario in another geographic location). In particular, when an algorithm only indirectly affects patient care (eg, risk stratification, enhanced diagnosis), local, independent validation studies may provide an adequate level of evidence to encourage early adoption, although this is an area of ongoing debate. A baseline RCT of an intervention’s efficacy would help to establish whether a new tool provides clinical utility and value. This baseline assessment could be followed by retrospective or prospective external validation studies to demonstrate how an intervention’s efficacy generalizes over time and across clinical settings.

Limitations

This study has several limitations. Of note, this analysis only selected RCTs that assessed a machine learning intervention directly impacting clinical decision-making. Additional work could be done to quantify the use of machine learning in alternative settings (eg, improving clinician-facing tools for workflow efficiency or assessing patient stratification, including biomarker discovery and validation efforts). Future work should aim to incorporate these wider definitions of a clinical tool for assessing the impact of machine learning across diverse steps in the clinical care pipeline. Nonetheless, we hypothesize that such literature contains a similar abundance of preliminary results and a dearth of RCTs assessing the relevance of machine learning in a controlled, clinical setting. An additional limitation is that this area of research is rapidly evolving, and our work is only current through October 2021. Future systematic reviews of machine learning interventions in health care will require more frequent updating and study of available results.

Conclusions

This systematic review found a low but increasing number of RCTs for machine learning interventions in health care. These results highlight the need for medical machine learning RCTs to promote safe and effective clinical implementation. The findings also highlight areas of concern in terms of the quality of medical machine learning RCTs and opportunities to improve reporting transparency and inclusivity, which should be considered in the design and publication of future trials.

70 in total

1. Minority Representation in Clinical Trials in the United States: Trends Over the Past 25 Years.

Authors: Manuel A Ma; Dora E Gutiérrez; Joanna M Frausto; Wael K Al-Delaimy
Journal: Mayo Clin Proc Date: 2021-01 Impact factor: 7.616

2. Insulin dose optimization using an automated artificial intelligence-based decision support system in youths with type 1 diabetes.

Authors: Revital Nimri; Tadej Battelino; Lori M Laffel; Robert H Slover; Desmond Schatz; Stuart A Weinzimer; Klemen Dovc; Thomas Danne; Moshe Phillip
Journal: Nat Med Date: 2020-09-09 Impact factor: 53.440

3. Impact of a real-time automatic quality control system on colorectal polyp and adenoma detection: a prospective randomized controlled study (with videos).

Authors: Jing-Ran Su; Zhen Li; Xue-Jun Shao; Chao-Ran Ji; Rui Ji; Ru-Chen Zhou; Guang-Chao Li; Guan-Qun Liu; Yi-Shan He; Xiu-Li Zuo; Yan-Qing Li
Journal: Gastrointest Endosc Date: 2019-08-24 Impact factor: 9.427

Review 4. Big data and black-box medical algorithms.

Authors: W Nicholson Price
Journal: Sci Transl Med Date: 2018-12-12 Impact factor: 17.956

5. The single-monitor trial: an embedded CADe system increased adenoma detection during colonoscopy: a prospective randomized study.

Authors: Peixi Liu; Pu Wang; Jeremy R Glissen Brown; Tyler M Berzin; Guanyu Zhou; Weihui Liu; Xun Xiao; Ziyang Chen; Zhihong Zhang; Chao Zhou; Lei Lei; Fei Xiong; Liangping Li; Xiaogang Liu
Journal: Therap Adv Gastroenterol Date: 2020-12-15 Impact factor: 4.409

6. The Role of Deep Learning-Based Echocardiography in the Diagnosis and Evaluation of the Effects of Routine Anti-Heart-Failure Western Medicines in Elderly Patients with Acute Left Heart Failure.

Authors: Jinyou Chen; Yue Gao
Journal: J Healthc Eng Date: 2021-08-09 Impact factor: 2.682

7. Artificial intelligence and colonoscopy experience: lessons from two randomised trials.

Authors: Alessandro Repici; Marco Spadaccini; Giulio Antonelli; Loredana Correale; Roberta Maselli; Piera Alessia Galtieri; Gaia Pellegatta; Antonio Capogreco; Sebastian Manuel Milluzzo; Gianluca Lollo; Dhanai Di Paolo; Matteo Badalamenti; Elisa Ferrara; Alessandro Fugazza; Silvia Carrara; Andrea Anderloni; Emanuele Rondonotti; Arnaldo Amato; Andrea De Gottardi; Cristiano Spada; Franco Radaelli; Victor Savevski; Michael B Wallace; Prateek Sharma; Thomas Rösch; Cesare Hassan
Journal: Gut Date: 2021-06-29 Impact factor: 23.059

8. Time to reality check the promises of machine learning-powered precision medicine.

Authors: Jack Wilkinson; Kellyn F Arnold; Eleanor J Murray; Maarten van Smeden; Kareem Carr; Rachel Sippy; Marc de Kamps; Andrew Beam; Stefan Konigorski; Christoph Lippert; Mark S Gilthorpe; Peter W G Tennant
Journal: Lancet Digit Health Date: 2020-09-16

9. Effect of Home Blood Pressure Monitoring via a Smartphone Hypertension Coaching Application or Tracking Application on Adults With Uncontrolled Hypertension: A Randomized Clinical Trial.

Authors: Stephen D Persell; Yaw A Peprah; Dawid Lipiszko; Ji Young Lee; Jim J Li; Jody D Ciolino; Kunal N Karmali; Hironori Sato
Journal: JAMA Netw Open Date: 2020-03-02

10. Three-Dimensional Automated Volume Calculation (Sonography-Based Automated Volume Count) versus Two-Dimensional Manual Ultrasonography for Follicular Tracking and Oocyte Retrieval in Women Undergoing in vitro Fertilization-Embryo Transfer: A Randomized Controlled Trial.

Authors: Nilofar Noor; Chithira Pulimoottil Vignarajan; Neena Malhotra; Perumal Vanamail
Journal: J Hum Reprod Sci Date: 2020-12-28