Literature DB >> 24928667

OutbreakTools: a new platform for disease outbreak analysis using the R software.

Thibaut Jombart¹, David M Aanensen², Marc Baguelin³, Paul Birrell⁴, Simon Cauchemez⁵, Anton Camacho⁶, Caroline Colijn⁷, Caitlin Collins⁸, Anne Cori⁸, Xavier Didelot⁸, Christophe Fraser⁸, Simon Frost⁹, Niel Hens¹⁰, Joseph Hugues¹¹, Michael Höhle¹², Lulla Opatowski¹³, Andrew Rambaut¹⁴, Oliver Ratmann⁸, Samuel Soubeyrand¹⁵, Marc A Suchard¹⁶, Jacco Wallinga¹⁷, Rolf Ypma¹⁷, Neil Ferguson⁸.

Abstract

The investigation of infectious disease outbreaks relies on the analysis of increasingly complex and diverse data, which offer new prospects for gaining insights into disease transmission processes and informing public health policies. However, the potential of such data can only be harnessed using a number of different, complementary approaches and tools, and a unified platform for the analysis of disease outbreaks is still lacking. In this paper, we present the new R package OutbreakTools, which aims to provide a basis for outbreak data management and analysis in R. OutbreakTools is developed by a community of epidemiologists, statisticians, modellers and bioinformaticians, and implements classes and methods for storing, handling and visualizing outbreak data. It includes real and simulated outbreak datasets. Together with a number of tools for infectious disease epidemiology recently made available in R, OutbreakTools contributes to the emergence of a new, free and open-source platform for the analysis of disease outbreaks. Crown

Entities: Chemical

Keywords: Bioinformatics; Epidemics; Epidemiology; Free; Infectious disease; Public health; R; Software

Mesh：

Year: 2014 PMID： 24928667 PMCID： PMC4058532 DOI： 10.1016/j.epidem.2014.04.003

Source DB: PubMed Journal: Epidemics ISSN： 1878-0067 Impact factor: 4.396

Introduction

Infectious disease outbreak investigation is a complex task in which a variety of data sources can be exploited for attempting to uncover the spatio-temporal dynamics and transmission pathways of a pathogen in a population. These data can include information on cases’ symptoms, their contacts, results of diagnostic tests and, increasingly, pathogen genetic sequences. Such rich and diverse data offer unprecedented prospects for understanding the process of disease transmission and ultimately designing adapted containment strategies and prophylaxis. Dedicated methodological approaches are traditionally used to analyze different types of data separately, and can exploit information such as the generation time distribution and the timing of symptom onsets (Wallinga and Teunis, 2004; Hens et al., 2012), contact patterns amongst individuals (Calatayud et al., 2010; Cauchemez et al., 2011), geographic locations of the cases (Truscott et al., 2007; Chis Ster and Ferguson, 2007), or pathogen genetic sequences (Vega et al., 2004; Jombart et al., 2011; Harris et al., 2013). Interestingly, the advent of genetic data has also triggered a number of methodological developments aiming to exploit different types of data simultaneously (Ypma et al., 2012; Morelli et al., 2012; Teunis et al., 2013; Jombart et al., 2014; Mollentze et al., 2014). Unfortunately, few of these approaches are widely available to the community as computer software, and a unified platform for the analysis of disease outbreaks is still lacking. Because it is free, open-source, and hosts the largest collection of tools for statistical analysis, the R software environment (R Core Team, 2013a) appears an ideal host for the development of such a platform. Besides dedicated packages for e.g. advanced linear modelling (Faraway, 2004), time series (Cowpertwait and Metcalfe, 2009), spatial processes (Bivand et al., 2008), multivariate methods (Karatzoglou et al., 2004; Zou and Hastie, 2012; Dray and Dufour, 2007), genetic data analysis (Paradis et al., 2004; Jombart, 2008; Jombart and Ahmed, 2011; Paradis, 2010) and graphics (Wickham, 2009), R offers the full flexibility of an interpreted computer language, allied with the possibility of calling upon precompiled routines, e.g. in C, C++ or Fortran, whenever computationally intensive tasks need to be undertaken. R is already hosting a growing number of packages for infectious disease epidemiology, including surveillance (Höhle, 2007) for temporal and spatio-temporal modelling (including outbreak detection), R0 (Obadia et al., 2012), TreePar (Stadler and Bonhoeffer, 2013) and EpiEstim (Cori et al., 2013) for reproduction number estimation, and outbreaker (Jombart et al., 2014) for transmission tree reconstruction. To ensure coherence between these different approaches and promote further developments, basic tools for storing and handling outbreak data are needed. In order to fill this gap, a community of epidemiologists, modellers, statisticians and bioinformaticians has developed the R package OutbreakTools. Here, “outbreak data” is defined as the above-described collection of data originating from a set of outbreak cases. This software, initiated during a hackathon for the analysis of disease outbreaks in R (http://sites.google.com/site/hackoutwiki/), provides object classes implementing a flexible and coherent representation of outbreak data, alongside procedures to manipulate, summarize and visualize these data. In this paper, we provide an overview of the main features of OutbreakTools, and discuss the future of R as a platform for the analysis of outbreak data.

Results

The main purpose of OutbreakTools is to provide a coherent yet flexible way of storing outbreak data. To achieve this goal, a new formal (S4) class ‘obkData’ (short for ‘outbreak data’) has been developed. This class uses different slots (Table 1) to store individual meta data (e.g. age, sex), time-stamped observations made on the individuals (e.g. fever, swab results, or answers on food exposures from questionnaires), contacts between patients, DNA sequences of the pathogen, phylogenetic trees, and contextual data at the population level (e.g. school closures, climatic variables). Complex data structures such as dynamic contact networks or DNA sequences from different genes are respectively stored using the new classes ‘obkContacts’ and ‘obkSequences’.

Table 1

Content of the formal (S4) class ‘obkData’. Instances of the class obkData can store a variety of data in the indicated slots. Filling the slots is optional, and empty slots are all NULL.

Slot name	Content
@individuals	data.frame containing patient meta-data (e.g. age, sex).
@records	list of data.frame containing time-stamped observations made on cases (e.g. fever, swab results); allows for repeated observations on the same individual.
@dna	obkSequences object containing pathogen genetic sequences for one or several genes with recorded collection dates; uses the class ‘DNAbin’ to store sequences; allows for multiple sequences for the same cases.
@contacts	obkContacts object storing contact data between patients, stored as a static or dynamic network; uses the classes ‘network’ and ‘networkDynamic’.
@trees	multiphylo object storing one or several phylogenetic trees of pathogen genomes; uses the class ‘phylo’ to store trees.
@context	a list of data.frames contextual data relevant at a population level (e.g. school closure)

To promote interoperability, okbData objects can be created from standard input files via procedures already available in R. Data tables can be imported from text files (extensions ‘.txt’ and ‘.csv’), from other statistical software using the package foreign (R Core Team, 2013b), or from XML files using the package XML (Butts, 2008). Aligned DNA sequences in FASTA format can be read using ape (Paradis et al., 2004) or adegenet (Jombart, 2008; Jombart and Ahmed, 2011), and phylogenetic trees can be imported from Newick or NEXUS format using ape (Paradis et al., 2004). To ensure that obkData objects are readily compatible with other R packages, existing classes have been used for storing data whenever possible: the class ‘DNAbin’ for DNA sequences (Paradis et al., 2004), the classes ‘network’ and ‘networkDynamic’ for contact data (Butts, 2008), and the class ‘phylo’ for phylogenetic trees (Paradis et al., 2004). Considerable efforts have been made to ensure that these different pieces of information are stored in a coherent way. The use of a formal (S4) class system offers multiple advantages in this respect, as it allows one to accurately define the object's content, and to perform consistency checks between the different data sources when the object is created. This means, for instance, that individuals documented in the contact or symptom data will be linked, through unique individual identifiers, to available individual meta-data, or that tips of the trees will be linked to existing DNA sequences whenever possible. Similarly, dates provided in different formats are automatically standardized, and sequences of the same genes are checked for consistent length. As obkData objects allow for coherent data storage and can be saved easily as compressed R objects (using the function save), they also offer a new and efficient way of sharing data amongst collaborators and making studies reproducible after publication. Despite this complex data structure, accessing information stored in obkData objects is facilitated by a large number of accessors. These functions allow for the retrieval of specific data (get.data), including sampling dates (get.dates), contacts (get.contacts), individual meta-data (get.individuals) or DNA sequences from given genes (get.dna), without requiring knowledge about the internal data structure. Importantly, decoupling the access to information from the internal data storage also ensures long-term code portability: future changes in the data structure will not affect results as long as accessors return the same information. This approach will enable future developments of the obkData class and allow for the incorporation of new types of data. Besides accessors, data handling is also facilitated by a subsetting procedure (function subset) which allows one to isolate data for given sets of individuals, samples, genes, sequences, or from a given time window. The information contained in obkData objects can be easily visualized using options of the generic function plot, or directly using dedicated functions. Individual timelines can be used to visualize course of illness and collection dates of samples for each individual (function plotIndividualTimeline, Fig. 1), maps can be drawn to assess the geographic distribution of the cases (function plotGeo), contact data can be visualized as graphs (function plotfor obkContacts objects), and genetic data can be visualized as phylogenies (function plotggphy, Fig. 2) and minimum spanning trees (function plotggMST). Most of these graphs take advantage of the high-quality customisable graphics implemented in ggplot2 (Wickham, 2009).

Fig. 1

Timeline of samples of the Newmarket equine influenza outbreak (HorseFlu dataset). This figure represents the temporal distribution of the VIRAL shedding samples gathered during the outbreak. Each horizontal line represents an individual. Individuals are sorted and coloured by yard. Repeated samples gathered on the same individual are represented using different symbols. The code for reproducing this figure is provided in Appendix 1.

Fig. 2

Phylogeny of pandemic influenza H1N1 sequences (FluH1N1pdm2009 dataset). This phylogenetic tree based on 514 hemagglutinin segments of pandemic influenza H1N1 was plotted using the function plotggphy. The code for reproducing this figure is provided in Appendix 1.

While OutbreakTools focuses on storing, handling and visualizing data, the package also implements basic tools for data analysis. Adapted summaries (function summary) have been implemented to provide quick insights into the data, make.phylocan be used to obtain phylogenies for all genes of the dataset, and get.incidence can be used to compute incidence from dates of symptom onsets, but also from any time-stamped data. In the latter situation, positive cases can be defined from either quantitative or categorical data, by specifying a range of numerical values, a list of character strings or even regular expressions. In practice, this allows for the computation of incidence based on any symptom data or sample analysis. This feature therefore allows for a direct use of procedures implemented in R0 (Obadia et al., 2012) or EpiEstim (Cori et al., 2013) for estimating reproduction numbers. To illustrate its features, OutbreakTools is released with both simulated and empirical datasets, including 514 annotated DNA sequences of the 2009 influenza pandemic (dataset FluH1N1pdm2009, Fig. 2) and data from a large Newmarket (UK) outbreak of equine influenza (dataset HorseFlu; Hughes et al., 2012, Fig. 1). Finally, OutbreakTools also includes a simulation tool (function simuEpi) which allows for the generation of outbreaks (including pathogen genome sequences) under a standard SIR model (Fig. 3), and can easily be extended to use other models (e.g. SIS, SEIR). OutbreakTools is documented in a 50-page manual and released with a tutorial introducing the data structures and the main functionalities of the package.

Fig. 3

Simulated outbreak using simuEpi. This outbreak was simulated under a SIR model with 100 hosts, contact rate β = 0.005 and recovery rate ν = 0.1. (a) Dynamics of the outbreak showing the numbers of susceptibles, infected and recovered over time. (b) Transmission tree, where each dot is a labelled case with colours representing the date of infection. (c) Neighbour-Joining phylogeny reconstructed from the simulated DNA sequences, ladderized and rooted to the first case. The code for reproducing these figures is provided in Appendix 1.

Discussion

While a number of packages for infectious disease epidemiology have recently been developed in the R software (Jombart et al., 2014; Obadia et al., 2012; Stadler and Bonhoeffer, 2013; Cori et al., 2013), basic tools for storing, handling and visualizing outbreak data have so far been lacking. OutbreakTools fills this gap by implementing new formal classes allowing for a coherent yet flexible representation of disease outbreak data, alongside a number of functions for manipulating and visualizing that data. As such, it represents a significant step towards building a comprehensive platform for outbreak analysis in R. The collaborative and open nature of this project, together with the possibility of modifying internal data structures seamlessly for the user, ensures that OutbreakTools will be able to evolve and adapt to incorporate new types of data and approaches used for outbreak analysis. The new availability of basic tools for outbreak analysis will hopefully encourage the further development of tools for investigating epidemics. It should in particular facilitate the implementation of novel integrative approaches able to exploit various types of data simultaneously (Ypma et al., 2012; Morelli et al., 2012; Teunis et al., 2013; Mollentze et al., 2014). Comparing the tools emerging from this still-burgeoning methodological field will likely be useful, as was recently demonstrated by the HIV modelling community (Eaton et al., 2012). In this respect, the existence of a unified platform for the analysis of disease outbreaks should provide the common ground needed for such comparisons to be drawn. More generally, the provision of a coherent structure for storing outbreak data will drastically improve the ease of data exchange amongst collaborators and hopefully encourage data sharing within the community. Arguably, the choice of R for developing a new platform for outbreak analysis may initially appeal mostly to a community of R experts, and considerable efforts should be made to reach as broad an audience as possible. First, providing free tutorials and teaching material is paramount for making new tools accessible to the community at large. This is the objective of the “R-epi project” (http://sites.google.com/site/therepiproject/), a website developed collaboratively and aiming to provide free resources for the analysis of disease outbreaks primarily in R, but also using other free software. Interestingly, recent developments such as the package shiny (Beeley, 2013) dramatically aid in the development of user-friendly web interfaces running R tools. Such approaches could be considered for reaching out to an even broader audience and trying and maximize the availability of leading-edge methods for epidemics analysis to the community at large, including not only modellers and statisticians, but also epidemiologists and public health agencies.

Resources

Availability: OutbreakTools 0.1–0 is distributed on CRAN (http://cran.r-project.org/) and available for R 3.0.2 on Windows, Mac OSX, and Linux platforms. It can be installed as any other package using the graphical user interface or typing the instruction: install.packages(“OutbreakTools”) Licence: GNU General Public Licence (GPL) ≥2. Website: http://sites.google.com/site/therepiproject/r-pac/about Documentation: besides the usual package documentation, OutbreakTools is released with a tutorial which can be opened by typing: vignette(“OutbreakTools”). More documentation can be found on the project's website. Development: the development of OutbreakTools is hosted on Sourceforge: http://sourceforge.net/projects/hackout/ New contributions are welcome and encouraged.

23 in total

1. Control of a highly pathogenic H5N1 avian influenza outbreak in the GB poultry flock.

Authors: James Truscott; Tini Garske; Irina Chis-Ster; Javier Guitian; Dirk Pfeiffer; Lucy Snow; John Wilesmith; Neil M Ferguson; Azra C Ghani
Journal: Proc Biol Sci Date: 2007-09-22 Impact factor: 5.349

2. pegas: an R package for population genetics with an integrated-modular approach.

Authors: Emmanuel Paradis
Journal: Bioinformatics Date: 2010-01-14 Impact factor: 6.937

3. adegenet: a R package for the multivariate analysis of genetic markers.

Authors: Thibaut Jombart
Journal: Bioinformatics Date: 2008-04-08 Impact factor: 6.937

4. Uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods.

Authors: Tanja Stadler; Sebastian Bonhoeffer
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2013-02-04 Impact factor: 6.237

5. Infectious disease transmission as a forensic problem: who infected whom?

Authors: Peter Teunis; Janneke C M Heijne; Faizel Sukhrie; Jan van Eijkeren; Marion Koopmans; Mirjam Kretzschmar
Journal: J R Soc Interface Date: 2013-02-06 Impact factor: 4.118

6. Robust reconstruction and analysis of outbreak data: influenza A(H1N1)v transmission in a school-based population.

Authors: Niel Hens; Laurence Calatayud; Satu Kurkela; Teele Tamme; Jacco Wallinga
Journal: Am J Epidemiol Date: 2012-07-12 Impact factor: 4.897

7. Reconstructing disease outbreaks from genetic data: a graph approach.

Authors: T Jombart; R M Eggo; P J Dodd; F Balloux
Journal: Heredity (Edinb) Date: 2010-06-16 Impact factor: 3.821

8. HIV treatment as prevention: systematic comparison of mathematical models of the potential impact of antiretroviral therapy on HIV incidence in South Africa.

Authors: Jeffrey W Eaton; Leigh F Johnson; Joshua A Salomon; Till Bärnighausen; Eran Bendavid; Anna Bershteyn; David E Bloom; Valentina Cambiano; Christophe Fraser; Jan A C Hontelez; Salal Humair; Daniel J Klein; Elisa F Long; Andrew N Phillips; Carel Pretorius; John Stover; Edward A Wenger; Brian G Williams; Timothy B Hallett
Journal: PLoS Med Date: 2012-07-10 Impact factor: 11.069

9. Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus aureus: a descriptive study.

Authors: Simon R Harris; Edward J P Cartwright; M Estée Török; Matthew T G Holden; Nicholas M Brown; Amanda L Ogilvy-Stuart; Matthew J Ellington; Michael A Quail; Stephen D Bentley; Julian Parkhill; Sharon J Peacock
Journal: Lancet Infect Dis Date: 2012-11-14 Impact factor: 25.071

10. Transmission parameters of the 2001 foot and mouth epidemic in Great Britain.

Authors: Irina Chis Ster; Neil M Ferguson
Journal: PLoS One Date: 2007-06-06 Impact factor: 3.240

18 in total

1. Acanthodian dental development and the origin of gnathostome dentitions.

Authors: Martin Rücklin; Benedict King; John A Cunningham; Zerina Johanson; Federica Marone; Philip C J Donoghue
Journal: Nat Ecol Evol Date: 2021-05-06 Impact factor: 15.460

Review 2. Genomic Analysis of Viral Outbreaks.

Authors: Shirlee Wohl; Stephen F Schaffner; Pardis C Sabeti
Journal: Annu Rev Virol Date: 2016-08-03 Impact factor: 10.431

3. Toward Precision Healthcare: Context and Mathematical Challenges.

Authors: Caroline Colijn; Nick Jones; Iain G Johnston; Sophia Yaliraki; Mauricio Barahona
Journal: Front Physiol Date: 2017-03-21 Impact factor: 4.566

4. Translation of Real-Time Infectious Disease Modeling into Routine Public Health Practice.

Authors: David J Muscatello; Abrar A Chughtai; Anita Heywood; Lauren M Gardner; David J Heslop; C Raina MacIntyre
Journal: Emerg Infect Dis Date: 2017-05 Impact factor: 6.883

5. The epidemic dynamics of hepatitis C virus subtypes 4a and 4d in Saudi Arabia.

Authors: Ahmed A Al-Qahtani; Guy Baele; Nisreen Khalaf; Marc A Suchard; Mashael R Al-Anazi; Ayman A Abdo; Faisal M Sanai; Hamad I Al-Ashgar; Mohammed Q Khan; Mohammed N Al-Ahdal; Philippe Lemey; Bram Vrancken
Journal: Sci Rep Date: 2017-03-21 Impact factor: 4.379

Review 6. Key data for outbreak evaluation: building on the Ebola experience.

Authors: Anne Cori; Christl A Donnelly; Ilaria Dorigatti; Neil M Ferguson; Christophe Fraser; Tini Garske; Thibaut Jombart; Gemma Nedjati-Gilani; Pierre Nouvellet; Steven Riley; Maria D Van Kerkhove; Harriet L Mills; Isobel M Blake
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2017-05-26 Impact factor: 6.237

7. To screen or not to screen: an interactive framework for comparing costs of mass malaria treatment interventions.

Authors: Justin Millar; Kok Ben Toh; Denis Valle
Journal: BMC Med Date: 2020-06-19 Impact factor: 8.775

8. Epidemic curves made easy using the R package incidence.

Authors: Zhian N Kamvar; Jun Cai; Juliet R C Pulliam; Jakob Schumacher; Thibaut Jombart
Journal: F1000Res Date: 2019-01-31

9. Bluetongue virus spread in Europe is a consequence of climatic, landscape and vertebrate host factors as revealed by phylogeographic inference.

Authors: Maude Jacquot; Kyriaki Nomikou; Massimo Palmarini; Peter Mertens; Roman Biek
Journal: Proc Biol Sci Date: 2017-10-11 Impact factor: 5.349

10. A systematic review of spatial decision support systems in public health informatics supporting the identification of high risk areas for zoonotic disease outbreaks.

Authors: Rachel Beard; Elizabeth Wentz; Matthew Scotch
Journal: Int J Health Geogr Date: 2018-10-30 Impact factor: 3.918