| Literature DB >> 32287567 |
Yulei Zhang1, Yan Dang1, Hsinchun Chen1, Mark Thurmond2, Cathy Larson1.
Abstract
Syndromic surveillance can play an important role in protecting the public's health against infectious diseases. Infectious disease outbreaks can have a devastating effect on society as well as the economy, and global awareness is therefore critical to protecting against major outbreaks. By monitoring online news sources and developing an accurate news classification system for syndromic surveillance, public health personnel can be apprised of outbreaks and potential outbreak situations. In this study, we have developed a framework for automatic online news monitoring and classification for syndromic surveillance. The framework is unique and none of the techniques adopted in this study have been previously used in the context of syndromic surveillance on infectious diseases. In recent classification experiments, we compared the performance of different feature subsets on different machine learning algorithms. The results showed that the combined feature subsets including Bag of Words, Noun Phrases, and Named Entities features outperformed the Bag of Words feature subsets. Furthermore, feature selection improved the performance of feature subsets in online news classification. The highest classification performance was achieved when using SVM upon the selected combination feature subset.Entities:
Keywords: Feature selection; News classification; News monitoring; Syndromic surveillance
Year: 2009 PMID: 32287567 PMCID: PMC7114309 DOI: 10.1016/j.dss.2009.04.016
Source DB: PubMed Journal: Decis Support Syst ISSN: 0167-9236 Impact factor: 5.795
Major data sources used for syndromic surveillance [48].
| Data source | Description |
|---|---|
| Chief complaints from ED visits or ambulatory visits | Patient-reported signs and symptoms of their illnesses |
| School or work absenteeism | Data collected from school or workplace |
| Hospital admission | Data that is recorded when hospitalization takes place |
| Triage nurse calls, 911 calls | Symptoms or signs recorded during patient calls when consulting health care nurses |
| ICD-9 | Preliminary diagnoses for billing |
| ICD-9-CM | Allow assignment of codes to diagnoses and procedures; often used for third-party insurance reimbursement purpose |
| Laboratory test orders | Orders for laboratory tests |
| Laboratory test results | Results of laboratory tests |
| Public sources | News reports or bulletin notification |
Major news-based syndromic surveillance systems.
| System | Data sources | Domain |
|---|---|---|
| ProMED-mail | Media reports, health department alerts, government reports, local observers, and other sources | Human diseases, zoonotic diseases and diseases that affect sources of human nutrition (both plants and livestock animals) |
| Argus (DIB) | Active case files, event reports, articles, and etc. | Over 130 infectious diseases |
| MiTAP | Multiple information sources (epidemiological reports, newswire feeds, emails, online news, transcribed audios) in multiple languages (English, Chinese, French, German, Italian, Portuguese, Russian, and Spanish) | Infectious diseases |
| HealthMap | A variety of electronic sources: online news wires, Really Simple Syndication (RSS) feeds, expert-curated accounts (such as ProMED-mail), and validated official alerts (such as WHO) | About 90 infectious diseases |
Studies on adding intelligence (heuristics) into the crawling strategies.
| Study | Intelligence (heuristics) |
|---|---|
| Chen, Chung, Ramsey, & Yang | Best first search, and genetic algorithm |
| Chakrabarti, Berg, & Dom | Naïve Bayesian |
| Rennie & Mccallum | Reinforcement learning |
| Menczer & Belew | Evolutionary algorithm, and neural network |
| Chau & Chen | Neural network, traditional graph search, and PageRank algorithm |
| Johnson, Tsioutsiouliklis, & Giles | SVM with linear kernel |
| Pant & Srinivasan | Compared various machine learning algorithms |
Fig. 1System architecture of automatic online news monitoring and classification.
FMD keywords used by the FMD Lab at UC-Davis.
| Language | FMD keywords |
|---|---|
| English | Foot and mouth disease/hoof and mouth disease |
| Spanish | Fiebre Aftosa |
| Portuguese | Febre Aftosa |
| French | Fièvre aphteuse |
Important online FMD news sources identified by the FMD Lab.
| News website | Government website | International organization | Research lab |
|---|---|---|---|
| All Africa | European Commission for Agriculture | New OIE | WRL |
| PigSite | FGI ARRIAH | Old OIE | FMD Lab in UC Davis |
| BBC FMD News | Federation of American Scientists | FAO | DEFRA |
| The New Vision | EUFMD | SEAFMD | |
| Bloomberg News | Agriculture, Canada | ProMED | |
| Times Online | Argentina-SENASA | OIE-JP | |
| Agrolink Noticias | Peru-SENASA | ||
| World Farming News | SESA | ||
| Arabic News | European Commission for Agriculture | ||
| 9 sites | 9 sites | 6 sites | 3 sites |
Fig. 2The news distribution on different websites.
Performance measures of the online news classification component.
| Feature subset | Classification algorithm | Accuracy | Average precision | Average recall | Average |
|---|---|---|---|---|---|
| FeatureBFS–BW | KNN | 63.89% | 70.68% | 63.89% | 67.11% |
| LBN | 71.56% | 72.83% | 62.64% | 67.35% | |
| NB | 74.96% | 74.64% | 63.78% | 68.78% | |
| SVM | 72.22% | 72.35% | 72.19% | 72.27% | |
| Average | 70.66% | ||||
| FeatureBFS–Comb | KNN | 64.43% | 54.70% | 64.45% | 59.18% |
| LBN | 73.94% | 74.44% | 73.95% | 74.19% | |
| NB | 75.85% | 75.45% | 75.87% | 75.66% | |
| SVM | 74.30% | 74.60% | 74.26% | 74.43% | |
| Average | 72.13% | ||||
| FeatureSFS–BW | KNN | 72.22% | 58.05% | 72.21% | 64.36% |
| LBN | 73.17% | 71.77% | 73.18% | 72.47% | |
| NB | 72.58% | 71.44% | 72.61% | 72.02% | |
| SVM | 75.13% | 74.59% | 75.13% | 74.86% | |
| Average | 73.27% | ||||
| FeatureSFS–Comb | KNN | 71.56% | 74.60% | 71.57% | 73.05% |
| LBN | 76.15% | 75.19% | 76.14% | 75.66% | |
| NB | 75.07% | 74.24% | 75.06% | 74.65% | |
| SVM | |||||
| Average | 74.96% |
Results of hypothesis testing.
| No. | Hypothesis | Result | Result | ||
|---|---|---|---|---|---|
| H1a | FeatureBFS–Comb >FeatureBFS–BW | ||||
| KNN | 0.0008⁎⁎ | Confirmed | 0.0550 | Not confirmed | |
| LBN | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| NB | < 0.0001⁎⁎ | Confirmed | 0.0010⁎⁎ | Confirmed | |
| SVM | < 0.0001⁎⁎ | Confirmed | 0.0005⁎⁎ | Confirmed | |
| H1b | FeatureSFS–Comb >FeatureSFS-BW | ||||
| KNN | 0.1260 | Not confirmed | 0.0010⁎⁎ | Confirmed | |
| LBN | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| NB | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| SVM | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| H2a | FeatureSFS–BW >FeatureBFS–BW | ||||
| KNN | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| LBN | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| NB | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| SVM | < 0.0001⁎⁎ | Confirmed | < 0.4968 | Not confirmed | |
| H2b | FeatureSFS–Comb >FeatureBFS–Comb | ||||
| KNN | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| LBN | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed | |
| NB | < 0.0001⁎⁎ | Confirmed | 0.0384⁎ | Confirmed | |
| SVM | < 0.0001⁎⁎ | Confirmed | 0.1499 | Not confirmed | |
| H3a | SVM > KNN on FeatureSFS–BW | < 0.0001⁎⁎ | Confirmed | < 0.0001⁎⁎ | Confirmed |
| H3b | SVM > KNN on FeatureSFS–Comb | < 0.0001⁎⁎ | Confirmed | <0.0001⁎⁎ | Confirmed |
| H3c | SVM > LBN on FeatureSFS–BW | < 0.0001⁎⁎ | Confirmed | 0.4577 | Not confirmed |
| H3d | SVM > LBN on FeatureSFS–Comb | 0.0041⁎⁎ | Confirmed | 0.2919 | Not confirmed |
| H3e | SVM > NB on FeatureSFS–BW | < 0.0001⁎⁎ | Confirmed | <0.0001⁎⁎ | Confirmed |
| H3f | SVM > NB on FeatureSFS–Comb | < 0.0001⁎⁎ | Confirmed | 0.0002⁎⁎ | Confirmed |
Note. Significance levels ⁎ α = 0.05 and ⁎⁎ α = 0.01.
The 56 features in FeatureSFS–Comb.
| No. | Feature | No. | Feature | No. | Feature |
|---|---|---|---|---|---|
| 1 | 20 | 39 | Such as | ||
| 2 | Beef | 21 | 40 | Susceptible | |
| 3 | Board | 22 | 41 | ||
| 4 | Campaign | 23 | Measures | 42 | The disease |
| 5 | Cattle | 24 | Meat | 43 | The embargo |
| 6 | Company | 25 | Ministry | 44 | |
| 7 | Confirmed | 26 | Mouth | 45 | The herd |
| 8 | DES | 27 | 46 | The outbreak | |
| 9 | Detected | 28 | Origin | 47 | |
| 10 | Director | 29 | Outbreak | 48 | Threatening |
| 11 | Disinfection | 30 | Prices | 49 | To control |
| 12 | Emerging | 31 | Quarantine | 50 | US |
| 13 | FMD | 32 | Received | 51 | Vaccinate |
| 14 | Herds | 33 | Reported | 52 | Vaccinated |
| 15 | Imports | 34 | Results | 53 | |
| 16 | Infected | 35 | Sacrificed | 54 | |
| 17 | Information | 36 | Samples | 55 | Venezuela |
| 18 | Isolated | 37 | Serotype | 56 | Village |
| 19 | 38 | Stricken |