Yunying Feng1, Siyu Liang2, Yuelun Zhang3,4, Shi Chen2,4, Qing Wang5, Tianze Huang1, Feng Sun6, Xiaoqing Liu4,7, Huijuan Zhu2,4, Hui Pan8. 1. Eight-year Program of Clinical Medicine, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. 2. Department of Endocrinology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. 3. Medical Research Center, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. 4. Clinical Epidemiology Unit, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. 5. Department of Automation, Tsinghua University, Beijing, China. 6. Department of Epidemiology and Biostatistics, School of Public Health, Peking University Health Science Center, Beijing, China. 7. Department of Infectious Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China. 8. Department of Endocrinology, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
Abstract
OBJECTIVE: We aim to investigate the application and accuracy of artificial intelligence (AI) methods for automated medical literature screening for systematic reviews. MATERIALS AND METHODS: We systematically searched PubMed, Embase, and IEEE Xplore Digital Library to identify potentially relevant studies. We included studies in automated literature screening that reported study question, source of dataset, and developed algorithm models for literature screening. The literature screening results by human investigators were considered to be the reference standard. Quantitative synthesis of the accuracy was conducted using a bivariate model. RESULTS: Eighty-six studies were included in our systematic review and 17 studies were further included for meta-analysis. The combined recall, specificity, and precision were 0.928 [95% confidence interval (CI), 0.878-0.958], 0.647 (95% CI, 0.442-0.809), and 0.200 (95% CI, 0.135-0.287) when achieving maximized recall, but were 0.708 (95% CI, 0.570-0.816), 0.921 (95% CI, 0.824-0.967), and 0.461 (95% CI, 0.375-0.549) when achieving maximized precision in the AI models. No significant difference was found in recall among subgroup analyses including the algorithms, the number of screened literatures, and the fraction of included literatures. DISCUSSION AND CONCLUSION: This systematic review and meta-analysis study showed that the recall is more important than the specificity or precision in literature screening, and a recall over 0.95 should be prioritized. We recommend to report the effectiveness indices of automatic algorithms separately. At the current stage manual literature screening is still indispensable for medical systematic reviews.
OBJECTIVE: We aim to investigate the application and accuracy of artificial intelligence (AI) methods for automated medical literature screening for systematic reviews. MATERIALS AND METHODS: We systematically searched PubMed, Embase, and IEEE Xplore Digital Library to identify potentially relevant studies. We included studies in automated literature screening that reported study question, source of dataset, and developed algorithm models for literature screening. The literature screening results by human investigators were considered to be the reference standard. Quantitative synthesis of the accuracy was conducted using a bivariate model. RESULTS: Eighty-six studies were included in our systematic review and 17 studies were further included for meta-analysis. The combined recall, specificity, and precision were 0.928 [95% confidence interval (CI), 0.878-0.958], 0.647 (95% CI, 0.442-0.809), and 0.200 (95% CI, 0.135-0.287) when achieving maximized recall, but were 0.708 (95% CI, 0.570-0.816), 0.921 (95% CI, 0.824-0.967), and 0.461 (95% CI, 0.375-0.549) when achieving maximized precision in the AI models. No significant difference was found in recall among subgroup analyses including the algorithms, the number of screened literatures, and the fraction of included literatures. DISCUSSION AND CONCLUSION: This systematic review and meta-analysis study showed that the recall is more important than the specificity or precision in literature screening, and a recall over 0.95 should be prioritized. We recommend to report the effectiveness indices of automatic algorithms separately. At the current stage manual literature screening is still indispensable for medical systematic reviews.
Authors: Phil Edwards; Mike Clarke; Carolyn DiGuiseppi; Sarah Pratap; Ian Roberts; Reinhard Wentz Journal: Stat Med Date: 2002-06-15 Impact factor: 2.373
Authors: Matthew D F McInnes; David Moher; Brett D Thombs; Trevor A McGrath; Patrick M Bossuyt; Tammy Clifford; Jérémie F Cohen; Jonathan J Deeks; Constantine Gatsonis; Lotty Hooft; Harriet A Hunt; Christopher J Hyde; Daniël A Korevaar; Mariska M G Leeflang; Petra Macaskill; Johannes B Reitsma; Rachel Rodin; Anne W S Rutjes; Jean-Paul Salameh; Adrienne Stevens; Yemisi Takwoingi; Marcello Tonelli; Laura Weeks; Penny Whiting; Brian H Willis Journal: JAMA Date: 2018-01-23 Impact factor: 56.272
Authors: Penny F Whiting; Anne W S Rutjes; Marie E Westwood; Susan Mallett; Jonathan J Deeks; Johannes B Reitsma; Mariska M G Leeflang; Jonathan A C Sterne; Patrick M M Bossuyt Journal: Ann Intern Med Date: 2011-10-18 Impact factor: 25.391
Authors: Peter Bragge; Ornella Clavisi; Tari Turner; Emma Tavender; Alex Collie; Russell L Gruen Journal: BMC Med Res Methodol Date: 2011-06-17 Impact factor: 4.615