Literature DB >> 26092722

Spatial and temporal epidemiological analysis in the Big Data era.

Abstract

Concurrent with global economic development in the last 50 years, the opportunities for the spread of existing diseases and emergence of new infectious pathogens, have increased substantially. The activities associated with the enormously intensified global connectivity have resulted in large amounts of data being generated, which in turn provides opportunities for generating knowledge that will allow more effective management of animal and human health risks. This so-called Big Data has, more recently, been accompanied by the Internet of Things which highlights the increasing presence of a wide range of sensors, interconnected via the Internet. Analysis of this data needs to exploit its complexity, accommodate variation in data quality and should take advantage of its spatial and temporal dimensions, where available. Apart from the development of hardware technologies and networking/communication infrastructure, it is necessary to develop appropriate data management tools that make this data accessible for analysis. This includes relational databases, geographical information systems and most recently, cloud-based data storage such as Hadoop distributed file systems. While the development in analytical methodologies has not quite caught up with the data deluge, important advances have been made in a number of areas, including spatial and temporal data analysis where the spectrum of analytical methods ranges from visualisation and exploratory analysis, to modelling. While there used to be a primary focus on statistical science in terms of methodological development for data analysis, the newly emerged discipline of data science is a reflection of the challenges presented by the need to integrate diverse data sources and exploit them using novel data- and knowledge-driven modelling methods while simultaneously recognising the value of quantitative as well as qualitative analytical approaches. Machine learning regression methods, which are more robust and can handle large datasets faster than classical regression approaches, are now also used to analyse spatial and spatio-temporal data. Multi-criteria decision analysis methods have gained greater acceptance, due in part, to the need to increasingly combine data from diverse sources including published scientific information and expert opinion in an attempt to fill important knowledge gaps. The opportunities for more effective prevention, detection and control of animal health threats arising from these developments are immense, but not without risks given the different types, and much higher frequency, of biases associated with these data.

Entities: Chemical Disease Gene Species

Keywords: Data science; Exploratory analysis; Internet of Things; Modelling; Multi-criteria decision analysis; Spatial analysis; Visualisation

Mesh：

Year: 2015 PMID： 26092722 PMCID： PMC7114113 DOI： 10.1016/j.prevetmed.2015.05.012

Source DB: PubMed Journal: Prev Vet Med ISSN： 0167-5877 Impact factor: 2.670

Introduction

Economic and technological developments in the last 50 years have led to global eco-social system changes that greatly facilitate the emergence and spread of infectious diseases in both animals and humans. This represents a major challenge for the management of infectious disease risks and is likely to require a paradigm shift in analytical approaches rather than an evolution of existing ones. This change in approach is reflected in the widespread recognition of the need to adopt inter- and transdisciplinary approaches in risk research and management. In addition, the digital revolution has provided major opportunities with respect to data collection and analysis. This has now evolved into the Internet of Things where everyday objects are connected through information networks, allowing them to send and receive data (Anon., 2014b, Kamel Boulos and Al-Shorbaji, 2014). Related to this, is the so-called Industry 4.0 (a collective term for technologies and concepts of value chain organisation; (Lee et al., 2014)), which reflects a vision for how the industrial sector may respond to the tight integration between the physical and digital world through the implementation of smart value chains. The concepts of smart health (Solanas et al., 2014), mHealth (Istepanian et al., 2004) and eHealth (Eysenbach, 2001) can be seen as the starting point for these developments and, together with the recent increase in popularity, and availability, of wearable sensors, have boosted the development of associated technologies. However, these sensors, other measurement devices and data sources are of limited use if the raw data they generate are not converted into information that can inform decision making, which has led to the need for suitable data management and analytical methods that can handle the resulting large, heterogeneous datasets. In animal health in general, and veterinary epidemiology specifically, the established methodological frameworks provide guidance for research of cause-effect relationships based on data generated through a priori designed field and laboratory studies. This review explores recent developments, and future directions, for spatial and temporal analysis in support of managing complex animal health problems. We begin this review from a broader perspective by focussing on the developments that have led to the data revolution and its impact on the health sciences. We then discuss how the new scientific discipline of data science has been established to tackle the analytical challenges and opportunities resulting from the data revolution. From this wider analytical context, we then focus on the specific developments in spatio-temporal epidemiological data analysis resulting from the data revolution.

Data revolution: from the Internet via Big Data to the Internet of Things

Scientific approaches aimed at improving our understanding of the complexity of the systems of which animal and human diseases form a part, usually involve data collection. However, the way in which data are generated has changed radically over the last 30 years, mainly as a result of the emergence of electronic methods for measuring, recording, storing and distributing data. As part of this development, the Internet now forms the backbone of a globally-reaching information network. The drivers behind the data revolution have been multiple, and early on were dominated by defence, public safety and scientific interests. Only once commercial companies such as Google (https://www.google.com), Amazon (http://www.amazon.com) and Facebook (https://www.facebook.com) were able to demonstrate, during the last 10 years, the potential for commercial exploitation, did the data revolution truly take off. There are also now increasing concerns in relation to potential abuse of Big Data (Schadt, 2012, Anon., 2014a). Mayer-Schönberger and Cukier (2014) define Big Data as ‘The ability of society to harness information in novel ways to produce useful insights or goods and services of significant value’ and ‘…things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value.’ Big data are generally characterised by 3Vs: volume (relative magnitude of dataset), velocity (rate at which new data are generated) and variety (heterogeneous structure of dataset [e.g. text, video, audio]) (Gandomi and Haider, 2015). A fourth ‘v’ frequently used to describe Big Data is veracity which acknowledges the inherent uncertainty frequently associated with, in particular, web-based Big Data and the corresponding need for analytical approaches that are able to account for this unreliability (Gandomi and Haider, 2015). In addition, the business community has added a fifth ‘v’; value. Traditional database management systems based on tabular or relational data management structures are not suited to dealing with Big Data as most of it is unstructured. Cloud-based data storage using the Apache Hadoop® distributed file system (http://hadoop.apache.org; last accessed April 2015) has been developed to allow efficient management of such data (O’Driscoll et al., 2013, Fernández et al., 2014). A data mining approach was used to explore the use of search term data for prediction of flu trends (Ginsberg et al., 2009) based on the assumption that changes in information and communication patterns on the Internet can act as early warning of changes in population health (Wilson and Brownstein, 2009). This resulted in the development of the search-term surveillance system, Google Flu Trends (GFT; http://www.google.org/flutrend, last accessed April 2015). By combining data-mining of Google search queries and statistical modelling, GFT provides a baseline indicator of the trend or changes in the rate of influenza, thereby providing estimates of weekly regional US influenza activity with a reporting lag of only one day compared with the 1–2 week delays associated with the Centers for Disease Control and Prevention (CDC) Influenza Sentinel Provider Surveillance reports (Ginsberg et al., 2009). However, the results generated by this algorithm have been the subject of controversy as predictions were incorrect at specific time points when they particularly mattered (Butler, 2013, Lazer et al., 2014). The fact remains though, that the relative immediacy of web-based surveillance systems allows for much quicker targeting of infection hot-spots in pandemic situations, as was done by companies such as Google, in the recent influenza H1N1 crisis (Chew and Eysenbach, 2010, Signorini et al., 2011, St Louis and Zorlu, 2012). Although search-term surveillance systems such as GFT are currently best suited to track disease activity in developed countries – the system requires large populations of web-search users in order to be most effective (Carneiro and Mylonakis, 2009) and a robust existing surveillance system to provide data for calibration (Wilson et al., 2009), – retrospective analysis of Google Trend’s search frequency for the term ‘Ebola’, in the developing countries of Guinea, Liberia and Sierra Leone, showed a moderate-to-high correlation with epidemic curves for the outbreak in those countries (Milinovich et al., 2015) suggesting that web-based surveillance systems have the potential to be used as early-warning systems in developing, as well as in developed, countries. However, systems which mine secondary (e.g. news reports) rather than primary web-based data sources (e.g. search queries) are possibly better suited for disease surveillance in developing countries. Examples of such systems include BioCaster (http://biocaster.nii.ac.jp, last accessed April 2015; Collier et al., 2008), EpiSPIDER (Tolentino et al., 2007, Keller et al., 2009), HealthMap (http://www.healthmap.org, last accessed April 2015; Brownstein et al., 2008, Freifeld et al., 2008, Brownstein et al., 2009, Keller et al., 2009, Brownstein et al., 2010), ProMED-mail (http://www.promedmail.org, last accessed April 2015; Cowen et al., 2006, Tolentino et al., 2007, Zeldenrust et al., 2008) and Canada’s Global Public Health Intelligence Network (GPHIN) (Mykhalovskiy and Weir, 2006). The value of such systems for flagging potential health threats is highlighted by the fact that GPHIN identified the 2002 severe acute respiratory syndrome (SARS) outbreak in Guangdong Province, China, more than two months before the World Health Organisation’s (WHO) official announcement (Mykhalovskiy and Weir, 2006). Similarly, HealthMap identified news stories reporting a strange fever in Guinea nine days before official notification of the 2014 West Africa Ebola outbreak (Milinovich et al., 2015). Although the inadequate initial response by the international community to the 2014 Ebola outbreak has been highlighted by some as a failure of Big Data analytical approaches for purposes of early warning (Leetaru, 2014, Milinovich et al., 2015), the fact remains that the primary value of such systems currently lies in their ability to flag events that may warrant further investigation rather than acting as the primary surveillance system (Wilson and Brownstein, 2009, Hartley et al., 2013). As such, although web-based surveillance systems are still a long way from replacing traditional surveillance methods, they provide a useful complement to conventional approaches (Milinovich et al., 2014), to the extent that they have become an important component of the influenza surveillance scene. For example, WHO’s Global Outbreak Alert and Response Network (GOARN; http://www.who.int/csr/outbreaknetwork/en/, last accessed May 2015) use such data as part of their day-to-day surveillance activities (Grein et al., 2000, Heymann and Rodier, 2001) and are authorised to act on this information (Wilson et al., 2008). Moving from surveillance to delivery of health care, precision medicine aims to utilise Big Data for the purpose of optimising the use of diagnostic tools, therapeutics and preventive management (Anon., 2011, Collins and Varmus, 2015). More recently, an increasing number of sensor and other measurement devices have been connected to the internet, giving rise to the so-called Internet of Things. It also includes data collected through participatory, crowdsourcing or citizen science mechanisms (Heipke, 2010, Kamel Boulos et al., 2011, Chunara et al., 2013). The opportunities and challenges arising from the Internet of Things are only just being recognised by manufacturing industries, and this has been referred to as the fourth industrial revolution or Industry 4.0 (Lee et al., 2014). In animal production, precision livestock farming is considered to have significant potential to improve animal health, production and welfare. While sensor technology is already used, for example, in dairy cattle feeding, mastitis, fertility, locomotion and metabolism, the integration and analysis of the data for decision making still needs further development (Rutten et al., 2013, Mortari and Lorenzelli, 2014). It is very likely that more widespread utilisation and better adaptation of these digital technologies will provide an opportunity for more effective traceability of livestock and their products and animal health surveillance. However, to get the most out of both Big Data and data generated by the Internet of Things requires a change in analytical approach, which has led to the development of data science.

Data science

While the amount of data available for analysis continues to increase exponentially, the development of suitable analytical tools for converting this raw data into useful knowledge has been much slower (Anon., 2013, Kambatla et al., 2014, Gandomi and Haider, 2015). Up until about five to ten years ago, the challenge associated with analysing epidemiological data sourced from existing, and sometimes collated across, multiple data sources was addressed using secondary data analysis approaches (Sorensen et al., 1996, Olsen, 2008). The associated methodological developments were strongly underpinned by the well-established principles of the scientific method, with data analysis primarily being the responsibility of statistical science. While most epidemiologists will have experienced the technical challenges associated with data management, it is notable that most postgraduate epidemiology training programmes primarily focus on tabular, while barely covering, relational databases. With Big Data, we have now reached levels of data complexity as expressed in the first four of the 5 V attributes (i.e. volume, velocity, variety, veracity and value) which cannot be effectively addressed by the ‘classical’ data management and analytical skill set. As result, and strongly incentivised by businesses interested in exploiting value, the fifth of the Big Data attributes, data analysts with advanced computer science skills have now become involved, in order to effectively convert the variety of data types and sources into knowledge (Wing, 2008, Bell et al., 2009, Porter et al., 2012). An extreme interpretation of this new situation was expressed by the Editor-in-Chief of Wired Magazine in an article entitled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (Anderson, 2008). He suggested that in the Petabyte Age, hypothesis-driven research would become irrelevant and be replaced by mining of data for associations. This extreme view has resulted in some debate (Norvig, 2009, Pigliucci, 2009, Schutt and O’Neil, 2013, Faghmous and Kumar, 2014, Mayer-Schönberger and Cukier, 2014). Arguably, this debate has had the benefit that scientists reflected on the utility of their respective research methodologies. It has also re-emphasised the importance of knowledge discovery going beyond the development of descriptive models optimised for mathematical accuracy. The renewed belief in the development of increasingly more powerful artificial intelligence applications also emphasises the effectiveness of machine learning algorithms (Jones, 2014, Gibney, 2015, Scholkopf, 2015, You, 2015). This does not mean that statistical algorithms will become less relevant, as is reflected in the current trend towards synergistic method development between the computer and statistical sciences (Kuhn and Johnson, 2014, Peters et al., 2014). To more effectively deal with Big Data, and the associated analysis challenges, the new discipline of data science has been established which explicitly requires a multidisciplinary team approach (Dhar, 2013; Schutt and O’Neil, 2013). The four-bubble Data Science Venn diagram (Fig. 1 ) adapted from the three-bubble original by Conway (2010) reflects the interdependence between required disciplines (Malak, 2014). As such, it emphasises the importance of integrating computer science, statistical science, specialist domain expertise and social science. Conway (2010) had not explicitly separated social science from specialist domain expertise, but it seems justified to separate it out given that human behaviour has a major influence on the characteristics of most data sources. Arguably, this perspective is very similar to the interdisciplinary approach that underpins One Health and Ecohealth.

Fig. 1

Four-bubble Data Science Venn diagram (reproduced with permission from Malak, 2014).

Four-bubble Data Science Venn diagram (reproduced with permission from Malak, 2014). Gartner Inc. (Gartner, 2014), an international information technology research and advisory company, annually evaluates the maturity of emerging technologies and presents their conclusions using the Gartner Hype Cycle (Fig. 2 ). By representing time on the x-axis and expectations on the y-axis, they define five phases through which a technology will typically pass before it potentially achieves widespread adoption. Starting with the Innovation Trigger phase and rapidly climbing the Peak of Inflated Expectations, the cycle then descends into the Trough of Disillusionment (with respect to expectations). From there it may ascend the Slope of Enlightenment before finally reaching the Plateau of Productivity. As of 2014, the Gartner Hype Cycle considered data science (entering the Peak of Inflated Expectations) to be lagging behind both the Internet of Things (midway through the Peak) and Big Data (entering the Trough of Disillusionment) (Gartner, 2014) – a trend that mirrors the development of spatial analytical methods suitable for taking advantage of the opportunities offered by georeferenced Big Data.

Fig. 2

The generic Gartner Hype Cycle which defines the five phases through which a technology will typically pass before it potentially achieves widespread adoption (reproduced with permission from Gartner, 2014; Gartner Methodologies, Hype Cycle, http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp).

The development of spatial analytical methods

The analysis framework based on Pfeiffer et al. (2008a), presented in a slightly updated format in Fig. 3 , is still relevant for structuring the different spatial and spatio-temporal epidemiological analytical methods. These are based primarily on classical statistical theory, with the addition of Bayesian methods to address the issue of spatial and temporal dependence. However, the assumptions, in particular, of frequentist statistical methods are usually not met for Big Data, and therefore analytical algorithms are required which are statistically robust (i.e. non-parametric) and also are capable of efficiently analysing very large datasets. The developments for epidemiological analyses have, so far, been primarily through the inclusion of machine learning regression amongst the modelling methods, whereas in visualisation and exploration it has been primarily through more effective use of interfaces and flexible software environments. It needs to be emphasised that data analysis plays an important role also in dealing with the five Vs of Big Data (i.e. volume, velocity, variety, veracity and value). The first four attributes refer to areas which are subject to significant research aimed at optimisation of data utilisation. A particularly difficult aspect is the challenge presented by differences in data quality, including the ubiquitous presence and heterogeneity of bias. Below, we discuss developments for each of the three analysis categories of the framework.

Fig. 3

Spatial and temporal data analysis in support of decision making in animal health in the Big Data era.

Visualisation of spatial patterns

Visualisation, whether as part of the analysis process or communication purposes, has always been a particular strength of spatial analysis and so it is not surprising that the biggest advances in the field of spatial analysis, with respect to Big Data, have occurred in this area. Big Data analytics emphasise the use of interactive visualisation methods using charts and maps, so that analysts and decision makers can quickly obtain insights from the most up-to-date data (e.g. GAPMINDER; http://www.gapminder.org, last accessed May 2015). While geographical information system (GIS) software remains at the forefront for manipulating and producing complex visualisations of spatio-temporal data, the advent of interactive digital maps and virtual globes such as Google Maps™ and Google Earth™ has encouraged simple visualisation of disease data in real time, as illustrated by the integration of such digital platforms into an ever-expanding number of animal and public-health projects and platforms. For example, HealthMap (http://www.healthmap.org, last accessed April 2015), together with its mobile app Outbreaks Near Me (http://www.healthmap.org/outbreaksnearme/, last accessed May 2015), provides real-time surveillance of emerging public health threats (Brownstein et al., 2008, Freifeld et al., 2008), while Nature’s use of the platform to track the global spatio-temporal spread of highly pathogenic avian influenza H5N1 (Google Earth Avian Flu; http://www.nature.com/nature/multimedia/googleearth, last accessed April 2015; Butler, 2006) won the Association of Online Publishers (AOP) Use of a New Digital Platform Award in 2006. Google Earth has also proved valuable for visualising disease data from informal settlements or rural areas in developing countries where the lack of geolocation infrastructure such as road names or house numbers precludes the use of conventional mapping software for visualising disease data. In a modern day reprise of John Snow’s 1856 cholera investigation, use of the digital platform allowed Baker et al. (2011) to map the spread of a typhoid outbreak in Kathmandu – where street names are not used – and trace the cause of the epidemic to low-lying public water resources. In addition to web-based mapping of disease, a related field is that of volunteered geographic information (VGI) (Goodchild, 2007, Goodchild and Li, 2012) or crowdsourced cartography (Dodge and Kitchin, 2013) which uses volunteers to create maps. A well-known example of VGI is OpenStreetMap (OSM; http://www.openstreetmap.org/, last accessed May 2015), an open, online, editable map of the world being created by volunteers using a combination of local knowledge, GPS tracks and aerial imagery. During the 2014 West Africa Ebola crisis, personnel of Médecins Sans Frontières (MSF) enlisted the help of the Humanitarian OSM Team (HOT) – an extension of OSM – to map Guéckédou – the main city in Guinea affected by the outbreak (Hodson, 2014). Within 20 h of receiving the request, online volunteers had mapped three cities in Guinea based on satellite imagery of the area, populating them with over 100 000 buildings – information that proved crucial for door-to-door canvassing of inhabitants and mapping the spread of disease. Other examples of crowdsourced cartography include Geo-Wiki (http://www.geo-wiki.org/, last accessed May 2015), a global network of volunteers working to improve the quality of global land-cover maps. In a systematic review of visualisation and analytics for infectious disease research, Carroll et al. (2014) identified limitations of visualisation tools in terms of their utility and usability for end users, including risk of misinterpretation of choropleth maps by not adequately showing missing data and uncertainty. They report a need for interdisciplinary tool development to allow valid integrated analysis of data sourced from different areas such as molecular, network and population data. Similarly, not all crowdsourced information is of equal quality; some data are of higher quality than others just as some contributors are consistently better than others (Haklay, 2010). The inclusion of robust measures of quality for VGI would be useful to indicate the level of confidence associated with each piece of information, and although traditional statistical concepts of uncertainty and bias are hard to apply to VGI, other options are available. For example, See et al. (2013) found that when classifying land-cover, volunteer accuracy appeared to be higher when responses for a given location were more consistent and when the volunteers indicated higher confidence in their responses, suggesting that these additional pieces of information could be used to develop associated robust measures of quality. Additional possibilities include the application of Bayesian probability or Dempster–Shafer theory (Eastman, 2009) to provide measures of confidence. Another area that has received significant attention is the analysis of molecular, movement and network data (Brunker et al., 2012, Okabe and Sugihara, 2012, Andrienko and Andrienko, 2013, Carrel and Emch, 2013). In this context, the utility of mobile phone call location records for infectious disease research and policy development has been of recent interest (Tatem, 2014, Wesolowski et al., 2014b). For example, mobile call location records were used during the 2014 Ebola outbreak to visualise and quantify the movements of a sample of the human population in West Africa (Wesolowski et al., 2014a), effectively visualising the spatial catchment areas of urban centres which reached even the more distant locations of the region.

Exploration of spatial data

Spatial exploratory analysis uses statistical methods to test the likelihood that an observed spatial or spatio-temporal pattern is a result of chance variation. Amongst these, the spatial and space-time scan statistic are probably the most often used cluster detection methods. In recent years, the scan statistic has been further developed to incorporate diverse spatial structures and a range of outcome variables with different measurement scales (Correa et al., 2014, Costa and Kulldorff, 2014, Murray et al., 2014, Prates et al., 2014). Similarly, interpolation methods for spatial data, such as kriging, have also been expanded to accommodate different types of outcome variables such as ordinal or Poisson measurement scales (Li and Heap, 2014, Oliver and Webster, 2014). However, kernel smoothing – used to convert point data into smooth raster maps and an effective tool for visualising continuous spatial variation in risk and rates – still requires continuing methodological development, particularly in the selection of appropriate bandwidths for kernel functions (Sarojinie Fernando and Hazelton, 2014).

Spatial modelling

Modelling approaches can be broadly categorised into data- and knowledge-driven methods (Pfeiffer et al., 2008b, Stevens and Pfeiffer, 2011). The former use a dataset comprising several risk factors together with an outcome variable, and risk-factor effect estimates are usually obtained using one of a range of regression methods. Data-driven approaches can be further sub-divided depending on whether they require both disease presence and absence data to calibrate the model, or presence-only data. Alternatively, with knowledge-driven methods, risk estimates are derived based on existing or hypothesised understanding of the causal relationships leading to disease occurrence (Stevens and Pfeiffer, 2011, Stevens et al., 2013). Amongst presence-absence data-driven methods, Bayesian approaches used to be a major focus of development but these have recently been complemented by machine learning methods which are better able to deal with the large datasets of the Big Data era (Vatsavai et al., 2012, Lawson, 2014, van Zyl, 2014a, van Zyl, 2014b, Ziegler and König, 2014). Machine learning regression modelling used to consist primarily of classification tree analysis (Breiman et al., 1984), but in recent years this approach has been more or less replaced by random forest and boosted regression tree methods. These approaches are considered to be less affected by missing values, non-linearity, autocorrelation, lack of independence and distributional assumptions than parametric methods. In addition, several comparative reviews of the performance of the different species distribution modelling methods (Hirzel et al., 2006, Elith and Graham, 2009, França and Cabral, 2015) suggest that, in general, tree-based regression methods tend to perform slightly better than other spatial regression approaches. Requiring large datasets to be able to produce generalizable inferences, these methods are ideally suited for analysing Big Data. Boosted regression trees are being used with increasing frequency to predict species distributions and disease risk (Hay et al., 2006, Martin et al., 2011, Gilbert et al., 2014, Pigott et al., 2014), while Tatem et al. (2014) used random forest regression tree analysis to generate risk maps for malaria occurrence and human movement flows based on mobile phone call location records to describe the spatial variation in malaria exportation/importation potential for Namibia. However, a common problem with disease regression modelling is that, while the outcome variable may consist of fairly reliable disease presence information, for a usually unknown number of space-time observations, absence of disease reporting may not reflect true absence of disease or absence data may not be available (e.g. surveillance data). This is also common in ecological species distribution modelling and has led to the development of different sampling approaches to generate pseudo-absence data that can be used with regression methods requiring both presence and absence data, as well as the development of specific modelling techniques requiring presence-only data such as the ecological niche modelling (ENM) methods including ecological niche factor analysis (ENFA), genetic algorithm for rule-set production (GARP) and maximum entropy (Maxent) (Hirzel et al., 2002, Elith and Leathwick, 2009). Requiring only disease presence data means that ENM methods can make use of the extensive disease occurrence data available in surveillance databases, and by extension, of web-based Big Data systems containing information on location of disease occurrence but lacking absence data. Increased access to molecular information on hosts and pathogens has resulted in the emergence of the field of phylogeography which integrates geospatial with genetic data (Liang et al., 2010, Chan et al., 2011, Faria et al., 2011, Pybus et al., 2012, Carrel and Emch, 2013, Alvarado-Serrano and Knowles, 2014). Further, combining ENM and phylogeography can be particularly informative for studies of globally distributed pathogens where environmental associations may be linked to genetic variation. From a methodological perspective, knowing that certain lineages exhibit niche specialisation and unique geographic distributions, can improve model accuracy by dividing a large population into biologically meaningful sub-populations (Mullins et al., 2013). However, ignoring such genetic variations may result in ENM which are biased towards a dominant strain in a particular region. There are also now a number of examples of integrated analysis of spatial and social network data (Firestone et al., 2011, Giebultowicz et al., 2011, Firestone et al., 2012). Hay et al. (2013) discussed the opportunities arising from taking advantage of Big Data through integrated analyses and emphasises the need for dynamic, risk-mapping capability based on integrated analysis ranging from more static environmental, to highly dynamic, social media risk factor variables. While data-driven methods still dominate in spatial modelling, the use of knowledge-driven approaches has increased during the last ten years. This is particularly the case for dynamic modelling, but also for static approaches such as multi-criteria decision analysis (MCDA). A key characteristic of these modelling approaches is their emphasis on inter-disciplinarity in that system understanding generated by different disciplines needs to be integrated so that the particular modelling objectives can be meaningfully achieved. Big Data is unlikely to result in the demise of the need for use of expert opinion and integration of existing knowledge such as MCDA, particularly in the context of management of new and emerging risks. Use of knowledge-driven approaches and interpretation of results needs to recognise the potential impact of bias and underestimation of variability, given that the model structure is based on the opinion of experts, and the parameters tend to also be based on expert opinion or generated by a variety of research activities. Malczewski (2006) in his review of spatial MCDA notes that the methodology has been applied in many areas, particularly for land suitability analysis, and that it facilitated the development of participatory GIS. However, he highlights that the methodologies are frequently used without taking account of the method’s underlying assumptions. More recently, Malczewski (2010) and Hongoh et al. (2011) emphasised the benefits of using spatially explicit MCDA to improve transparency and trans-disciplinarity of decision-making processes. In animal health, Clements et al. (2006) and Stevens et al. (2013) used spatial MCDA to generate suitability maps for Rift Valley fever for Africa and avian influenza H5N1 for Asia, respectively. Both applied Dempster–Shafer theory to explicitly express and propagate uncertainty in relation to knowledge about the underlying processes expressed in the decision rules. de Glanville et al. (2014) generated suitability maps for African swine fever for Africa and used Monte-Carlo sensitivity analysis to express uncertainty in relation to model outputs. Other animal health applications of spatial MCDA have addressed animal diseases such as African horse sickness in Spain and Rift Valley fever in Italy (Tran et al., 2013, Sanchez-Matamoros et al., 2014). The increasing use of MCDA in the environmental sciences has resulted in further development of MCDA methodologies to reduce the influence of subjectivity of individual criteria weights on the risk score outcome (Yemshanov et al., 2013, Feizizadeh et al., 2014, Jankowski et al., 2014, Ligmann-Zielinska and Jankowski, 2014).

Conclusions

It is almost certain that in the near future humanity will have to deal with major infectious disease threats, largely as either a direct or indirect consequence of anthropogenic development. The latter involves technological and scientific advances which have, and will, generate opportunities for more effective management of current, and new and emerging infectious disease threats. Big Data, together with the Internet of Things, has introduced a new way of collecting and analysing data that is very different from the hypothesis-driven approaches previously accepted by the international scientific community as the primary mechanism for generating new scientific knowledge. Within the area of epidemiological analysis of spatial and spatio-temporal data, Big Data associated technologies and data sources have, so far, had limited impact, with the main advances having been associated with machine learning modelling methods, the recent use of mobile phone location records, molecular diagnostic and animal movement data. To more effectively harness the opportunities offered by these new digital technologies in animal and human health, an interdisciplinary approach will have to be embraced which, in addition to the various scientific domains associated with human, animal and environmental health, also includes computer science. This will result in a particularly interesting situation for epidemiologists whose scientific strength has been the integration between the applied health sciences and the more theoretical and abstract methods underpinning statistical analysis, to which they could now add the role of acting as an interface with the computer science aspects of Big Data and the Internet of Things. By doing so they will be able to continue their substantial contribution to the understanding of cause-effect relationships in eco-social systems, and thereby expand the knowledge-base underpinning effective animal health risk management.

Conflict of interest

The authors report no conflict of interest.

81 in total

1. Interpreting Google flu trends data for pandemic H1N1 influenza: the New Zealand experience.

Authors: N Wilson; K Mason; M Tobias; M Peacey; Q S Huang; M Baker
Journal: Euro Surveill Date: 2009-11-05

Review 2. Google trends: a web-based tool for real-time surveillance of disease outbreaks.

Authors: Herman Anthony Carneiro; Eleftherios Mylonakis
Journal: Clin Infect Dis Date: 2009-11-15 Impact factor: 9.079

Review 3. Integrating statistical genetic and geospatial methods brings new power to phylogeography.

Authors: Lauren M Chan; Jason L Brown; Anne D Yoder
Journal: Mol Phylogenet Evol Date: 2011-02-23 Impact factor: 4.286

4. DeepMind algorithm beats people at classic video games.

Authors: Elizabeth Gibney
Journal: Nature Date: 2015-02-26 Impact factor: 49.962

5. Evaluation of ProMED-mail as an electronic early warning system for emerging animal diseases: 1996 to 2004.

Authors: Peter Cowen; Tam Garland; Martin E Hugh-Jones; Arnon Shimshony; Stuart Handysides; Donald Kaye; Lawrence C Madoff; Marjorie P Pollack; Jack Woodall
Journal: J Am Vet Med Assoc Date: 2006-10-01 Impact factor: 1.936

6. Identification of Suitable Areas for African Horse Sickness Virus Infections in Spanish Equine Populations.

Authors: A Sánchez-Matamoros; J M Sánchez-Vizcaíno; V Rodríguez-Prieto; E Iglesias; B Martínez-López
Journal: Transbound Emerg Dis Date: 2014-12-05 Impact factor: 5.005

7. Rumors of disease in the global village: outbreak verification.

Authors: T W Grein; K B Kamara; G Rodier; A J Plant; P Bovier; M J Ryan; T Ohyama; D L Heymann
Journal: Emerg Infect Dis Date: 2000 Mar-Apr Impact factor: 6.883

8. Adding the spatial dimension to the social network analysis of an epidemic: investigation of the 2007 outbreak of equine influenza in Australia.

Authors: Simon M Firestone; Robert M Christley; Michael P Ward; Navneet K Dhand
Journal: Prev Vet Med Date: 2012-02-23 Impact factor: 2.670

Review 9. Use of unstructured event-based reports for global infectious disease surveillance.

Authors: Mikaela Keller; Michael Blench; Herman Tolentino; Clark C Freifeld; Kenneth D Mandl; Abla Mawudeku; Gunther Eysenbach; John S Brownstein
Journal: Emerg Infect Dis Date: 2009-05 Impact factor: 6.883

10. HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports.

Authors: Clark C Freifeld; Kenneth D Mandl; Ben Y Reis; John S Brownstein
Journal: J Am Med Inform Assoc Date: 2007-12-20 Impact factor: 4.497

14 in total

1. A Practitioner-Driven Research Agenda for Syndromic Surveillance.

Authors: Richard S Hopkins; Catherine C Tong; Howard S Burkom; Judy E Akkina; John Berezowski; Mika Shigematsu; Patrick D Finley; Ian Painter; Roland Gamache; Victor J Del Rio Vilas; Laura C Streichert
Journal: Public Health Rep Date: 2017 Jul/Aug Impact factor: 2.792

2. The National Cancer Institute's Dietary Assessment Primer: A Resource for Diet Research.

Authors: Frances E Thompson; Sharon I Kirkpatrick; Amy F Subar; Jill Reedy; TusaRebecca E Schap; Magdalena M Wilson; Susan M Krebs-Smith
Journal: J Acad Nutr Diet Date: 2015-10-01 Impact factor: 4.910

3. Evidence in Practice - A Pilot Study Leveraging Companion Animal and Equine Health Data from Primary Care Veterinary Clinics in New Zealand.

Authors: Petra Muellner; Ulrich Muellner; M Carolyn Gates; Trish Pearce; Christina Ahlstrom; Dan O'Neill; Dave Brodbelt; Nick John Cave
Journal: Front Vet Sci Date: 2016-12-23

4. Translating Big Data into Smart Data for Veterinary Epidemiology.

Authors: Kimberly VanderWaal; Robert B Morrison; Claudia Neuhauser; Carles Vilalta; Andres M Perez
Journal: Front Vet Sci Date: 2017-07-17

5. Temporal and spatial distribution characteristics in the natural plague foci of Chinese Mongolian gerbils based on spatial autocorrelation.

Authors: Hai-Wen Du; Yong Wang; Da-Fang Zhuang; Xiao-San Jiang
Journal: Infect Dis Poverty Date: 2017-08-07 Impact factor: 4.520

6. Accounting for space and uncertainty in real-time location system-derived contact networks.

Authors: Trevor S Farthing; Daniel E Dawson; Michael W Sanderson; Cristina Lanzas
Journal: Ecol Evol Date: 2020-04-12 Impact factor: 2.912

7. Past, Present, and Future of Veterinary Epidemiology and Economics: One Health, Many Challenges, No Silver Bullets.

Authors: Andres M Perez
Journal: Front Vet Sci Date: 2015-11-17

8. Data distribution in public veterinary service: health and safety challenges push for context-aware systems.

Authors: Laura Contalbrigo; Stefano Borgo; Giandomenico Pozza; Stefano Marangon
Journal: BMC Vet Res Date: 2017-12-22 Impact factor: 2.741

9. Inferring epidemiological links from deep sequencing data: a statistical learning approach for human, animal and plant diseases.

Authors: M Alamil; J Hughes; K Berthier; C Desbiez; G Thébaud; S Soubeyrand
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2019-06-24 Impact factor: 6.237

10. EpiExploreR: A Shiny Web Application for the Analysis of Animal Disease Data.

Authors: Lara Savini; Luca Candeloro; Samuel Perticara; Annamaria Conte
Journal: Microorganisms Date: 2019-12-11