| Literature DB >> 34350393 |
Maimuna S Majumder1,2, Sherri Rose3.
Abstract
During infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across 3 outbreaks. After developing an algorithm with regular expressions, we automatically curated data from health agencies via 3 information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak, and an implementation process was presented for application to future outbreaks. When compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all 3 outbreaks. Within the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.Entities:
Keywords: automation; data curation; infectious diseases; outbreaks; regular expressions
Year: 2021 PMID: 34350393 PMCID: PMC8327373 DOI: 10.1093/jamiaopen/ooab058
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Data collected across case study outbreaks
| Case study | Data source | Reporting period | Number of fields | Total cells curated |
|---|---|---|---|---|
| Measles in Samoa | November 22, 2019–December 8, 2019 | 3 | 51 | |
| Ebola in the DRC | Email Newsletters | August 6, 2018–July 31, 2019 | 10 | 3600 |
| MERS in South Korea | Disease Outbreak News Reports | May 30, 2015–June 9, 2015 | 5 | 315 |
Abbreviations: DRC: Democratic Republic of the Congo; MERS: Middle East Respiratory Syndrome.
Figure 1.Assembly algorithm flowchart depicting automatic curation of text-based information into machine-readable data. Three example rows of data from the Ebola case study are shown for a single field (of 360 rows and 10 fields total). Trigger phrases are shown in purple and the numerical values of interest are shown in orange.
Figure 2.Implementation flowchart depicting how a researcher may apply the assembly algorithm to a new outbreak. The process is partitioned into 3 phases: (1) calibration (using N URLs), (2) execution, and (3) modification. N may vary across use cases; for data reported daily, at least a week is recommended (N = 7).
Figure 3.Assembly algorithm performance curves over time for the Ebola case study. Cumulative missingness is shown in orange, accuracy in teal, misidentification in purple, and data availability (at the source) in green.