PURPOSE: Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles. METHODS: Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers. RESULTS: Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period. CONCLUSIONS: The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.
PURPOSE: Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles. METHODS: Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers. RESULTS: Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period. CONCLUSIONS: The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.
Authors: Ian Donaldson; Joel Martin; Berry de Bruijn; Cheryl Wolting; Vicki Lay; Brigitte Tuekam; Shudong Zhang; Berivan Baskin; Gary D Bader; Katerina Michalickova; Tony Pawson; Christopher W V Hogue Journal: BMC Bioinformatics Date: 2003-03-27 Impact factor: 3.169
Authors: Dm Hartley; Np Nelson; R Walters; R Arthur; R Yangarber; L Madoff; Jp Linge; A Mawudeku; N Collier; Js Brownstein; G Thinus; N Lightfoot Journal: Emerg Health Threats J Date: 2010-02-19
Authors: Mikaela Keller; Michael Blench; Herman Tolentino; Clark C Freifeld; Kenneth D Mandl; Abla Mawudeku; Gunther Eysenbach; John S Brownstein Journal: Emerg Infect Dis Date: 2009-05 Impact factor: 6.883
Authors: Kimberly N Gajewski; Amy E Peterson; Rohit A Chitale; Julie A Pavlin; Kevin L Russell; Jean-Paul Chretien Journal: PLoS One Date: 2014-10-20 Impact factor: 3.240
Authors: David M Hartley; Courtney M Giannini; Stephanie Wilson; Ophir Frieder; Peter A Margolis; Uma R Kotagal; Denise L White; Beverly L Connelly; Derek S Wheeler; Dawit G Tadesse; Maurizio Macaluso Journal: PLoS One Date: 2017-07-28 Impact factor: 3.240
Authors: Simon I Hay; Katherine E Battle; David M Pigott; David L Smith; Catherine L Moyes; Samir Bhatt; John S Brownstein; Nigel Collier; Monica F Myers; Dylan B George; Peter W Gething Journal: Philos Trans R Soc Lond B Biol Sci Date: 2013-02-04 Impact factor: 6.237
Authors: Philippe Barboza; Laetitia Vaillant; Abla Mawudeku; Noele P Nelson; David M Hartley; Lawrence C Madoff; Jens P Linge; Nigel Collier; John S Brownstein; Roman Yangarber; Pascal Astagneau Journal: PLoS One Date: 2013-03-05 Impact factor: 3.240
Authors: Arvind Ramanathan; Laura L Pullum; Tanner C Hobson; Christopher G Stahl; Chad A Steed; Shannon P Quinn; Chakra S Chennubhotla; Silvia Valkova Journal: Front Public Health Date: 2015-08-03