Literature DB >> 34350393

A generalizable data assembly algorithm for infectious disease outbreaks.

Abstract

During infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across 3 outbreaks. After developing an algorithm with regular expressions, we automatically curated data from health agencies via 3 information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak, and an implementation process was presented for application to future outbreaks. When compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all 3 outbreaks. Within the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.

Entities: Chemical Disease Gene Species

Keywords: automation; data curation; infectious diseases; outbreaks; regular expressions

Year: 2021 PMID： 34350393 PMCID： PMC8327373 DOI： 10.1093/jamiaopen/ooab058

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

INTRODUCTION

Since 2000, thousands of infectious disease outbreaks have been reported by the World Health Organization (WHO) globally. A considerable subset of these have been due to emerging zoonotic pathogens, including the novel coronavirus SARS-CoV-2, the causative agent of the coronavirus disease 2019; its predecessors, Middle East Respiratory Syndrome (MERS) coronavirus and SARS-CoV-1; Zika virus, and Ebola virus, among others. Emergence of these pathogens has been driven by the increasing permeability of the animal–human interface, whereas ease of travel has enabled their transmission across borders., Not all outbreaks from the last 2 decades have been due to emerging infections, however; notably, due to increasing vaccine hesitancy around the world, re-emerging diseases, such as measles and mumps, have experienced a resurgence as well., During these outbreaks, epidemiological information from a variety of data sources—from formal reports by the WHO to email newsletters and social media posts from national ministries of health—is often made available to the public, including researchers responsible for monitoring and mitigation efforts., Unfortunately, these publicly available data are typically locked in blocks of text that are rarely machine-readable, which poses a considerable roadblock for surveillance and response activities that hinge on mathematical modeling (eg, data-driven allocation of ventilators or vaccines). To overcome this hurdle, researchers typically commit substantial labor toward manually curating and converting these text-based data into an analyzable format (eg, comma-separated values, CSV). The time and effort required are often directly related to the complexity of the available information and thus, outbreak researchers have publicly advocated for the development of an algorithm that can be easily implemented to automatically curate such information across multiple settings. In this article, we introduce a generalizable data assembly algorithm to automate curation of text-based, outbreak-related information shared online by health agencies and demonstrate its performance across 3 recent case study outbreaks: measles in Samoa (2019), Ebola in the Democratic Republic of the Congo (DRC) (2018–2019), and MERS in South Korea (2015). We implement this algorithm on semistructured source text of increasing complexity from social media (ie, Twitter), email newsletters, and WHO disease outbreak news (DON) reports, respectively, to produce machine-readable CSV files for each of our 3 case studies. Though the data available for curation vary across source texts, the underlying structure of the algorithm—regular expressions to extract pertinent outbreak-related information—remains constant across applications and is generalizable. The source texts considered in this study represent a spectrum of information complexity, and when combined with mathematical modeling approaches, can be used to inform decision-making during infectious disease outbreaks. For measles in Samoa and Ebola in the DRC, we extract simple aggregate statistics (eg, case counts) over time, which can be used for case count projections, assessment of intervention performance, and vaccination rate estimation. Meanwhile, for MERS in South Korea, we extract more complex multifeature patient-level data (ie, data in which every row is a patient and every column is a feature), which enable reconstruction of transmission networks and evaluation of risk factors associated with mortality.

METHODS

Data on the evolving epidemiology of each outbreak were first manually curated for validation purposes. Summary information for each study is available in Table 1. Aggregate cases and deaths associated with the measles outbreak in Samoa were collected from the Government of Samoa Twitter account from November 22, 2019 (date of first tweet) to December 8, 2019 (date of last tweet)., Similar aggregate statistics were also collected for the Ebola outbreak in the DRC from email newsletters issued by the Ministère de la Santé RDC (MSRDC) from August 6, 2018 (date of first newsletter received) to July 31, 2019 (date of last newsletter received)., Finally, patient-level data were collected from WHO DON reports for the MERS outbreak in South Korea from May 30, 2015 (date of first report) to June 9, 2015 (date of last report)., These same text-based data were then algorithmically collected using our data assembly algorithm.

Table 1.

Data collected across case study outbreaks

Case study	Data source	Reporting period	Number of fields	Total cells curated
Measles in Samoa	Twitter	November 22, 2019–December 8, 2019	3	51
Ebola in the DRC	Email Newsletters	August 6, 2018–July 31, 2019	10	3600
MERS in South Korea	Disease Outbreak News Reports	May 30, 2015–June 9, 2015	5	315

Abbreviations: DRC: Democratic Republic of the Congo; MERS: Middle East Respiratory Syndrome.

Data collected across case study outbreaks Abbreviations: DRC: Democratic Republic of the Congo; MERS: Middle East Respiratory Syndrome. The assembly algorithm was developed in the Python programming language and, as shown in Figure 1 and Supplementary Figures S1 and S2, uses regular expressions and trigger phrases to automatically transform semistructured text-based information from user-inputted URLs into machine-readable data. Here, trigger phrases are the phrases that accompany the information of interest in a given block of text. When these phrases are translated into searchable patterns of characters (ie, regular expressions) in any given language, they act as “triggers” for the data assembly algorithm to identify and collect information for the desired fields (ie, variables). This underlying regex-based structure enables generalizability of the algorithm to a wide variety of source texts and information types, as demonstrated by the 3 case study outbreaks selected.

Figure 1.

Assembly algorithm flowchart depicting automatic curation of text-based information into machine-readable data. Three example rows of data from the Ebola case study are shown for a single field (of 360 rows and 10 fields total). Trigger phrases are shown in purple and the numerical values of interest are shown in orange. For the measles case study, the following 3 data fields were automatically curated using our assembly algorithm: cumulative cases, incident cases, and cumulative deaths. Seventeen rows of data, where each row is a date, were collected across these 3 fields for a total of 51 cells. Similarly, data for the following 10 fields were automatically curated for the Ebola case study: confirmed cumulative cases, total cumulative cases (confirmed + probable), confirmed cumulative deaths, total cumulative deaths (confirmed + probable), cumulative cases recovered, cumulative vaccinations deployed, cumulative vaccinations deployed in Region A, cumulative vaccinations deployed in Region B, cumulative vaccinations deployed in Region C, and cumulative vaccinations deployed in Region D. Across these 10 fields, 360 rows of data, where again each row is a date, were collected for a total of 3600 cells. Finally, data for the MERS case study were automatically curated to populate the following 5 fields: documented sex, age, date of symptoms, date of diagnosis, and healthcare worker status. Sixty-three rows of data, where each row is a patient, were collected for a total of 315 cells across these 5 fields. In all 3 case study outbreaks, the manually curated data for the aforementioned fields were used to validate the performance (ie, missingness and misidentification) of the assembly algorithm. Missingness is defined as a cell for which the algorithm did not curate a value but for which a value was available when compared against manual curation. Misidentification is defined as a cell for which the algorithm curated a value but for which the value was incorrect when compared against manual curation. Given its intended application in outbreak settings, the assembly algorithm was designed conservatively, placing priority on increasing accuracy over decreasing missingness. Code for all 3 implementations of the assembly algorithm, as well as the manually collected validation data, are available at https://github.com/mmajumder/Data_Assembly_Algorithm. Though we manually curated all available data for our 3 case study outbreaks to comprehensively validate algorithmic performance, researchers who wish to implement our algorithm for a new outbreak need only to validate a subset of data early in the collection process. Figure 2 describes this implementation process in 3 phases: (1) calibration, (2) execution, and (3) modification. Each section is disaggregated into actionable steps and includes guidance regarding common challenges, such as changes to trigger phrases at the source.

Figure 2.

Implementation flowchart depicting how a researcher may apply the assembly algorithm to a new outbreak. The process is partitioned into 3 phases: (1) calibration (using N URLs), (2) execution, and (3) modification. N may vary across use cases; for data reported daily, at least a week is recommended (N = 7).

RESULTS

When validating algorithmically collected data against manually collected data, the data assembly algorithm performed well for all 3 iterations. Across the entirety of each outbreak reporting period, overall cumulative missingness for the case studies was 0% (0 cells) for measles, 1% (34 cells) for Ebola, and 2% (7 cells) for MERS, while overall cumulative misidentification was 0% (0 cells), 0% (0 cells), and 1% (3 cells), respectively. Because the reporting period for the Ebola outbreak was considerably longer (368 days) than the measles (16 days) and MERS (11 days) case studies, we also examined missingness and misidentification over time by day for the Ebola case study. Notably, the assembly algorithm exhibited steady gains in cumulative accuracy from August 2018 through June 2019, as displayed in Figure 3. Decreased cumulative availability of data in the source itself (ie, fields for which MSRDC reported data in May 2019 but no longer reported in June 2019) coincided with minor decreases in cumulative accuracy between June 2019 and August 2019. Cumulative missingness dropped from 5% in August 2018 to near 0% in August 2019, and due to the conservative nature of the assembly algorithm, cumulative misidentification was 0% over the same time period.

Figure 3.

Assembly algorithm performance curves over time for the Ebola case study. Cumulative missingness is shown in orange, accuracy in teal, misidentification in purple, and data availability (at the source) in green.

DISCUSSION

By showcasing its performance within the context of 3 distinct infectious disease outbreaks, we demonstrated the generalizability of our data assembly algorithm across diverse source texts and information types. Intuitively, we found that algorithmic curation of more complex data (eg, multifeature patient-level data for MERS in South Korea) exhibited slightly higher rates of missingness and misidentification than simpler data (eg, case counts over time); however, overall cumulative performance for both metrics was impressive across curated fields for all 3 case study outbreaks. However, our work has several limitations. First, our algorithm was not designed to collect data from unstructured text (eg, tweets by a random user). Instead, we prioritized semistructured source texts produced by health agencies given that they are widely considered to be vital information sources during outbreaks. Second, the current version of our algorithm assumes that URLs of source texts will be manually collected by researchers who are familiar with health agencies and their information reporting practices. We also note that the source texts we considered for our case studies featured unpredictable URL formats (eg, Twitter), which makes automation of URL collection a nontrivial task that is ripe for future work. Finally, application of our algorithm to a new outbreak necessitates an initial period of manual curation and trigger phrase identification during the calibration phase, and if needed, ad hoc instances of these steps during the modification phase. Nevertheless, by reducing the overall amount of manual curation that outbreak scientists must perform, our algorithm creates additional capacity for data validation. In absence of algorithmic assembly, 2 researchers may typically be tasked with manual curation to establish intercurator accuracy for all data points, but the implementation process for our algorithm is designed such that only one researcher must manually curate a mere fraction of said data points. Moreover, by allowing researchers to identify their own trigger phrases, our algorithm provides them with the flexibility to collect data on fields that are of specific interest to them and pertinent to their subject matter expertise. Within the context of the 3 case study outbreaks presented in this study, the fields for which data were automatically curated by our assembly algorithm were selected purposefully given their long-standing utility to mathematical modeling for informed epidemiological decision-making. Historically, counts of cases and deaths over time—fields that were collected both for measles in Samoa and for Ebola in the DRC—have been used to model the transmission dynamics associated with outbreaks, including important epidemiological parameters such as fatality rates and reproduction numbers., These parameters are critical to formulating case count projections and assessing performance of interventions, which enable public health decision-makers to approach outbreaks from a position of preparedness. Furthermore, these parameters can also be used to model vaccination rates during outbreaks of vaccine-preventable diseases, which can be leveraged to lobby for the resources necessary to vaccinate vulnerable communities. Meanwhile, patient-level “line list” data have traditionally been employed to assess risk factors for different outcomes; indeed, the data presented in this article for MERS in South Korea have been used precisely in this way to assess risk factors for mortality given MERS-CoV infection,, as well as for transmission to others following infection. Such analyses allow for improvements to resource allocation both with respect to patient care (ie, preferentially allocate intensive care units to patients who are less likely to survive infection) and with respect to contact-tracing (ie, preferentially allocate resources to contact trace individuals who are more likely to transmit to others following infection), among other applications. As recently noted by George et al, tools that can transform text-based information into machine-readable data are urgently needed by the outbreak management community. Given the epidemiological utility of the data types curated by our data assembly algorithm across our 3 case study outbreaks, we believe that the usefulness of the work we present here will persist as infectious diseases continue to emerge and re-emerge. We encourage other researchers to apply it to novel contexts (ie, new outbreaks), while carefully considering the ethical implications before deployment in new settings. Our algorithm is designed to generalize across diseases and enable the democratization of essential epidemiological data that are otherwise locked in blocks of non-machine-readable text. However, despite strong accuracy and missingness assessments for all 3 case study outbreaks considered in this article, we recommend that the implementation process we have outlined above be employed to validate the robustness of our data assembly algorithm during future outbreaks as well.

FUNDING

Research reported in this work was supported by the National Institutes of Health through an NIH Director’s New Innovator Award DP2-MD012722. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

AUTHOR CONTRIBUTIONS

MSM: Study design; data acquisition, analysis, and interpretation; drafting the work; and critical revision of the work. SR: Study design; data interpretation; and critical revision of the work.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

CONFLICT OF INTEREST STATEMENT

None declared. Click here for additional data file.

35 in total

1. Early epidemic dynamics of the west african 2014 ebola outbreak: estimates derived with a simple two-parameter model.

Authors: David Fisman; Edwin Khoo; Ashleigh Tuite
Journal: PLoS Curr Date: 2014-09-08

2. 2014 ebola outbreak: media events track changes in observed reproductive number.

Authors: Maimuna S Majumder; Sheryl Kluberg; Mauricio Santillana; Sumiko Mekaru; John S Brownstein
Journal: PLoS Curr Date: 2015-04-28

3. Real-time characterization of risks of death associated with the Middle East respiratory syndrome (MERS) in the Republic of Korea, 2015.

Authors: Kenji Mizumoto; Akira Endo; Gerardo Chowell; Yuichiro Miyamatsu; Masaya Saitoh; Hiroshi Nishiura
Journal: BMC Med Date: 2015-09-30 Impact factor: 8.775

4. Transmissibility of the influenza virus in the 1918 pandemic.

Authors: Laura Forsberg White; Marcello Pagano
Journal: PLoS One Date: 2008-01-30 Impact factor: 3.240

5. Utilizing Nontraditional Data Sources for Near Real-Time Estimation of Transmission Dynamics During the 2015-2016 Colombian Zika Virus Disease Outbreak.

Authors: Maimuna S Majumder; Mauricio Santillana; Sumiko R Mekaru; Denise P McGinnis; Kamran Khan; John S Brownstein
Journal: JMIR Public Health Surveill Date: 2016-06-01