Literature DB >> 22383735

GENI-DB: a database of global events for epidemic intelligence.

Nigel Collier1, Son Doan.   

Abstract

UNLABELLED: We present a novel public health database (GENI-DB) in which news events on the topic of over 176 infectious diseases and chemicals affecting human and animal health are compiled from surveillance of the global online news media in 10 languages. News event frequency data were gathered systematically through the BioCaster public health surveillance system from July 2009 to the present and is available to download by the research community for purposes of analyzing trends in the global burden of infectious diseases. Database search can be conducted by year, country, disease and language. AVAILABILITY: The GENI-DB is freely available via a web portal at http://born.nii.ac.jp/.

Entities:  

Mesh:

Year:  2012        PMID: 22383735      PMCID: PMC3324518          DOI: 10.1093/bioinformatics/bts099

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Systems which gather information about disease outbreak events from informal digital sources such as news media are now seen as having high value to national and transnational public health agencies (Heymann and Rodier, 2001). Although agencies in wealthy countries have a sophisticated array of indicator sources such as over-the-counter sales or sentinel networks, not all countries possess the resources to implement or maintain such systems. With concerns about newly emerging diseases such as A(H5N1), there has been increasing attention on epidemic intelligence (EI) systems that can complement indicator networks by detecting events on a global scale so that they can be acted on close to source. While there are several surveillance systems offering alerts and news browsing (Hartley ), to the best of our knowledge there are only a few databases where researchers, government health officials, physicians and public health practitioners can look for historical event data which are updated in real time. ProMED (Madoff and Woodall, 2005), a human network organized through the International Society for Disease Surveillance, is an excellent source with its wide-coverage open access database of human disease reports from 1995 onwards. Reports are gathered manually and reviewed by experts in a staged process before being sent out via email and stored in an online database. ProMED reports on both human and animal health events although its coverage of animal health is more limited in scope. The World Health Organization (WHO) offers a more specialized service through its weekly EuroFlu bulletins which are archived online as well as a news bulletin service through its Global Alert and Response site. Additionally, the World Animal Health Information Database (WAHID) provides access to reports of exceptional events submitted by member states of the OIE (World Organisation of Animal Health). GENI-DB is a complementary service that provides additional support to those interested in understanding the context of ongoing disease outbreaks as well as analyzing the global burden of infectious diseases. We developed the GENI-DB database as a free, structured and searchable source on news event statistics reported in the global media. Information gathering has been done fully automatically without human intervention by the BioCaster EI system (Collier ) from thousands of sources in 10 languages. Our experiments have shown that aggregated event counts from news can provide valuable early warning alerts that in some cases are more timely than ProMED (Collier, 2010). An additional advantage is that since BioCaster is a single system with a common reporting standard, this allows users to obtain comparative estimates of disease outbreaks across disease conditions and geographic areas.

2 METHODS

The GENI-DB database and web server is implemented on a 24 × 2.66 GHz Xeon core server running on Ubuntu Linux version 9.04, Apache (version 2.2.11), PHP (version 5.2.9) and MySQL (version 14.14) and is viewable in all major web browsers and operating systems, e.g. Safari, IE, Firefox and Chrome on Linux/Windows/Apple OS. The database is freely available for users to view and download data 24/7. Updates to the database take place once every hour during normal operation but this can be shortened to 20 min as required during public health emergencies. The BioCaster system comprises a modularized text mining pipeline running on a dedicated cluster linked to the backend of the web server. The modules consist of efficient natural language processing algorithms for web scraping, language detection, machine translation (Koehn ), classifying documents into relevant or non-relevant (Bow toolkit: www.cs.cmu.edu/~mccallum/bow) as well as dedicated modules for identifying terms and their relationships (Simple Rule Language editor: http://code.google.com/p/srl-editor). These modules are implemented in various programming languages and glued together using Perl scripts. Various modules are integrated with a sophisticated knowledge model of the domain defining semantic categories for diseases, species, symptoms, agents etc. and the relationships between them. These relationships are assembled automatically into an event report comprising a slot filler template with a minimum fill of a country, province, disease, species and time element (Collier ). One event report is generated for each relevant news event. Reports cover 176 infectious diseases including under-specified types such as ‘Unclassified disease’. At various points in the pipeline staged filtering heuristics are applied to ensure a minimum level of quality. For example, events with no identifiable province are entered into the database but considered to be of lower quality and therefore not shown in the output of GENI-DB. Currently BioCaster surveillances ~27 000 news items per day from Google News as well as various public and NPO sources such as the ProMed-mail, Hong Kong SAR Communicable Disease Watch list, the OIE alert lists, the European Media Monitor alerts and AlertNet. One important change that has occurred in the system was in December 2011 when the freely available Google Translate service was deprecated. Up to this point, BioCaster used this service to translate articles to English in order to assess topical relevance. We have endeavored to work around this by implementing the freely available MOSES machine translation system and have currently trained translation engines for Arabic, Russian, French, Portuguese and Spanish to English. Chinese, Dutch, German, Italian, Korean and Vietnamese to English are expected to be ready by April 2012. However we have not been able to recover Thai to English due to lack of parallel texts needed to train the system. It is still too early to assess the quality impact on performance but we hope to report on this in future publications. Domain modeling is encapsulated through the BioCaster ontology, a freely available public health applications ontology designed to integrate laymen's language of disease reporting across 12 languages (http://code.google.com/p/biocaster-ontology).

3 RESULTS

GENI-DB is a useful source for exploring media reporting patterns as well as following disease outbreaks. Figure 1 illustrates how aggregated multilingual reporting can be used to visualise media coverage and timeliness in different languages for the porcine foot-and-mouth epidemic in South Korea during 2010–2011. Ranking diseases by the number of reports (Table 1) and countries by the number of outbreak events detected per unit of population (Tables S2 and S3 in Supplementary Material) gives an indication of both the incidence of disease but also the characteristics of online media reports.
Fig. 1.

Porcine foot-and-mouth outbreak in South Korea 2010–2011. Daily news event counts are shown for several languages as denoted by the ISO 639-1 codes. Event frequencies are given in stacked bar graphs and on the left-hand axis. The line graph and right-hand axis show cumulative event frequencies.

Table 1.

Disease event frequency by species in GENI-DB between 15th July 2009 and 28th July 2011

RankHuman diseaseReportsAnimal diseaseReports
1Unclassified influenza20 982Unclassified influenza7519
2Cholera19 936Foot-and-mouth3827
3Influenza A(H1N1)17 759Influenza A(H5N1)2202
4Dengue fever14 064Influenza A(H1N1)1351
5Measles6378West Nile fever815
6E-coli2557Anthrax658
7Anthrax2123Rabies595
8Influenza A(H5N1)1946Herpes583
9HFMD1788Brucellosis568
10Malaria1716Eastern equine encephalitis512
Porcine foot-and-mouth outbreak in South Korea 2010–2011. Daily news event counts are shown for several languages as denoted by the ISO 639-1 codes. Event frequencies are given in stacked bar graphs and on the left-hand axis. The line graph and right-hand axis show cumulative event frequencies. Disease event frequency by species in GENI-DB between 15th July 2009 and 28th July 2011 Ranking countries according to the language of the report also yields some interesting trends in media focus, supporting our view that multilingual reports are necessary to maximize sensitivity. For example, Haiti features in the top three reported countries between July 2009 and July 2011 for most languages except for Chinese where it appeared ranked at seven behind Japan, Taiwan, USA and France. In French both Angola and Canada were more widely reported than USA. A recent quantitative study by Lyon ) provides further insights into the volume, geographic coverage, timeliness and sources of BioCaster's information and a comparison against two other systems: HealthMap and EpiSpider. Within the GENI-DB database we have non-zero event counts for 170 states. Several states have unexpectedly low counts either because there were very few open source reports during the two year period (e.g. for sub-Saharan Africa or central Asia) or because of technical limitations in the system such as missing languages (e.g. Polish), out of vocabulary names for provinces (e.g. Egypt), failure to normalise diacritics to Roman and non-registration of small island states. These issues are now being addressed by adding automated detection for alternative Arabic Romanizations, extending our place names ontology to include all world states and provinces as well as a greater number of minor cities (populations under 100 000). Additionally, we are constantly looking at how we can improve place name disambiguation from evidence in the text which is one of the greatest technological challenges we face. Several caveats need to be kept in mind when interpreting the data. Perhaps the most important is that the data have been sourced and analyzed automatically, i.e. no human moderation has taken place. In this respect, the events reported in the database are as is. De-duplication has not been attempted, except to exclude articles with the same URL, since the frequencies of reports may have something useful to say about the degree of concern felt about an event.

4 CONCLUSIONS

The goal of GENI-DB is to offer a complementary service to extant databases helping provide insights and overcome information overload on experts. The database provides opportunities for comparisons against other sources as well as material for generating synthetic datasets. We hope that by making GENI-DB available, the data can aid in analysis of global trends, progress the state of the art in automated event alerting as well as helping those interested more generally in the patterns of media reporting on public health.
  6 in total

1.  Comparison of web-based biosecurity intelligence systems: BioCaster, EpiSPIDER and HealthMap.

Authors:  A Lyon; M Nunn; G Grossel; M Burgman
Journal:  Transbound Emerg Dis       Date:  2011-12-20       Impact factor: 5.005

Review 2.  The internet and the global monitoring of emerging diseases: lessons from the first 10 years of ProMED-mail.

Authors:  Lawrence C Madoff; John P Woodall
Journal:  Arch Med Res       Date:  2005 Nov-Dec       Impact factor: 2.235

Review 3.  Hot spots in a wired world: WHO surveillance of emerging and re-emerging infectious diseases.

Authors:  D L Heymann; G R Rodier
Journal:  Lancet Infect Dis       Date:  2001-12       Impact factor: 25.071

4.  What's unusual in online disease outbreak news?

Authors:  Nigel Collier
Journal:  J Biomed Semantics       Date:  2010-03-31

5.  Landscape of international event-based biosurveillance.

Authors:  Dm Hartley; Np Nelson; R Walters; R Arthur; R Yangarber; L Madoff; Jp Linge; A Mawudeku; N Collier; Js Brownstein; G Thinus; N Lightfoot
Journal:  Emerg Health Threats J       Date:  2010-02-19

6.  BioCaster: detecting public health rumors with a Web-based text mining system.

Authors:  Nigel Collier; Son Doan; Ai Kawazoe; Reiko Matsuda Goodwin; Mike Conway; Yoshio Tateno; Quoc-Hung Ngo; Dinh Dien; Asanee Kawtrakul; Koichi Takeuchi; Mika Shigematsu; Kiyosu Taniguchi
Journal:  Bioinformatics       Date:  2008-10-15       Impact factor: 6.937

  6 in total
  5 in total

Review 1.  Uncovering text mining: a survey of current work on web-based epidemic intelligence.

Authors:  Nigel Collier
Journal:  Glob Public Health       Date:  2012-07-11

2.  Assessing the methods needed for improved dengue mapping: a SWOT analysis.

Authors:  David Frost Attaway; Kathryn H Jacobsen; Allan Falconer; Germana Manca; Nigel M Waters
Journal:  Pan Afr Med J       Date:  2014-04-16

Review 3.  A review of evaluations of electronic event-based biosurveillance systems.

Authors:  Kimberly N Gajewski; Amy E Peterson; Rohit A Chitale; Julie A Pavlin; Kevin L Russell; Jean-Paul Chretien
Journal:  PLoS One       Date:  2014-10-20       Impact factor: 3.240

Review 4.  The potential use of social media and other internet-related data and communications for child maltreatment surveillance and epidemiological research: Scoping review and recommendations.

Authors:  Laura M Schwab-Reese; Wendy Hovdestad; Lil Tonmyr; John Fluke
Journal:  Child Abuse Negl       Date:  2018-02-01

5.  Refining the global spatial limits of dengue virus transmission by evidence-based consensus.

Authors:  Oliver J Brady; Peter W Gething; Samir Bhatt; Jane P Messina; John S Brownstein; Anne G Hoen; Catherine L Moyes; Andrew W Farlow; Thomas W Scott; Simon I Hay
Journal:  PLoS Negl Trop Dis       Date:  2012-08-07
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.