Joshua Feldman1, Andrea Thomas-Bachli2,3, Jack Forsyth2,3, Zaki Hasnain Patel2,3, Kamran Khan2,3,4. 1. Harvard University, School of Engineering and Applied Sciences, Cambridge, Massachusetts, USA. 2. Li Ka Shing Knowledge Institute, St. Michaels Hospital, Toronto, Ontario, Canada. 3. BlueDot, Toronto, Ontario, Canada. 4. Division of Infectious Diseases, Department of Medicine, University of Toronto, Toronto, Ontario, Canada.
Abstract
OBJECTIVE: We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports. MATERIALS AND METHODS: We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports. RESULTS: Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5-61) days earlier on average. DISCUSSION: We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy. CONCLUSION: Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports.
OBJECTIVE: We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports. MATERIALS AND METHODS: We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports. RESULTS: Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5-61) days earlier on average. DISCUSSION: We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy. CONCLUSION: Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports.
Authors: D M Hartley; N P Nelson; R R Arthur; P Barboza; N Collier; N Lightfoot; J P Linge; E van der Goot; A Mawudeku; L C Madoff; L Vaillant; R Walters; R Yangarber; J Mantero; C D Corley; J S Brownstein Journal: Clin Microbiol Infect Date: 2013-06-21 Impact factor: 8.067
Authors: Luke Mondor; John S Brownstein; Emily Chan; Lawrence C Madoff; Marjorie P Pollack; David L Buckeridge; Timothy F Brewer Journal: Emerg Infect Dis Date: 2012-07 Impact factor: 6.883
Authors: Jessica S Schwind; David J Wolking; John S Brownstein; Jonna A K Mazet; Woutrina A Smith Journal: PLoS One Date: 2014-10-15 Impact factor: 3.240
Authors: Philippe Barboza; Laetitia Vaillant; Abla Mawudeku; Noele P Nelson; David M Hartley; Lawrence C Madoff; Jens P Linge; Nigel Collier; John S Brownstein; Roman Yangarber; Pascal Astagneau Journal: PLoS One Date: 2013-03-05 Impact factor: 3.240
Authors: Nakul Aggarwal; Mahnoor Ahmed; Sanjay Basu; John J Curtin; Barbara J Evans; Michael E Matheny; Shantanu Nundy; Mark P Sendak; Carmel Shachar; Rashmee U Shah; Sonoo Thadaney-Israni Journal: NAM Perspect Date: 2020-11-30
Authors: Mira Kim; Kyunghee Chae; Seungwoo Lee; Hong-Jun Jang; Sukil Kim Journal: Int J Environ Res Public Health Date: 2020-12-17 Impact factor: 3.390