Literature DB >> 28830109

Mind the Scales: Harnessing Spatial Big Data for Infectious Disease Surveillance and Inference.

Elizabeth C Lee¹, Jason M Asher², Sandra Goldlust¹, John D Kraemer³, Andrew B Lawson⁴, Shweta Bansal^1,5.

Abstract

Spatial big data have the velocity, volume, and variety of big data sources and contain additional geographic information. Digital data sources, such as medical claims, mobile phone call data records, and geographically tagged tweets, have entered infectious diseases epidemiology as novel sources of data to complement traditional infectious disease surveillance. In this work, we provide examples of how spatial big data have been used thus far in epidemiological analyses and describe opportunities for these sources to improve disease-mitigation strategies and public health coordination. In addition, we consider the technical, practical, and ethical challenges with the use of spatial big data in infectious disease surveillance and inference. Finally, we discuss the implications of the rising use of spatial big data in epidemiology to health risk communication, and public health policy recommendations and coordination across scales.

Entities: Disease Species

Keywords: digital epidemiology; disease mapping; infectious diseases; spatial big data; spatial epidemiology; statistical bias

Mesh：

Year: 2016 PMID： 28830109 PMCID： PMC5144899 DOI： 10.1093/infdis/jiw344

Source DB: PubMed Journal: J Infect Dis ISSN： 0022-1899 Impact factor: 5.226

During one of epidemiology's formative moments, John Snow mapped London households in which residents had cholera and succeeded in highlighting the risk of cholera associated with the Broad Street pump. Since then, spatial investigations have played a critical role in improving our understanding of the associations between risks and disease outcomes. In infectious disease epidemiology, we ask, “Which populations are at higher risk for disease?” “Where did this outbreak originate?” and “Where can we expect future disease outbreaks to arise?” Fundamentally, these are spatial questions that rely on spatial data for answers. Traditional infectious disease epidemiology is built on the foundation of relatively high-quality and high-accuracy data on disease (eg, serological diagnostic assays) and behavior (eg, vaccination surveys). These data are usually characterized by small size, but they benefit from control groups or designed observational samples from known underlying populations, thus rendering it possible to make population-level inferences. On the other hand, digital infectious disease epidemiology typically uses existing digital traces, repurposing them to identify patterns in health-related processes. Digital data are electronic and can often be characterized as big data when they are produced in large volumes (ie, when there is a large number of subjects or a large number of measurements per subject), with high velocity (ie, when data are created in near real time), and have variety in sources and organizational structures [1]. When big data are characterized by fine spatial granularity, in which point or areal locations are identified, we refer to them here as spatial big data. Big data provide opportunities for infectious disease epidemiology and public health because they increase accessibility to populations over space and time; data on personal beliefs, behaviors, and health outcomes are now available at unprecedented breadth and depth. The trade-off to this tremendous access is the potential for loss of quality and accuracy. Streams of digital data relevant to public health may serve as proxies for a desired measure, but these data sets may not meet the assumptions for standard methods of epidemiological comparison (eg, self-reported symptoms on social media and serological diagnoses both serve as proxies for so-called true cases, but they have different biases and collection procedures and represent different populations). The trade-off between access and accuracy and the task of separating true signal from large and varied noise characterizes the challenge and opportunity of big data for infectious disease epidemiology [2]. In this article, we focus on spatial big data and its applications to the field of spatial epidemiology. We highlight the opportunities for spatial big data to improve spatial modeling and data coverage and describe ongoing challenges as spatial big data become more pervasive in informing disease surveillance, disease control, and public health policy.

SPATIAL BIG DATA OPEN NEW DOORS IN EPIDEMIOLOGY

True to the promise of variety in big data streams, several familiar technologies produce spatial big data that can be used for infectious disease surveillance and modeling. Social media sites like Facebook and Twitter allow users to tag individual posts with specific locations, linking geography to specific health behaviors. Mobile phones send signals with global positioning system locations, and their call data records are spatially referenced through cell tower locations, both of which enable the measurement of human activity and mobility [3, 4]. Web search data may capture user location through Internet protocol addresses, and online encyclopedia (Wikipedia) access logs may identify locations on the basis of the search language [5, 6]. Administrative medical claims and pharmacy transactions indicate the location of healthcare facilities and drugstores where patients seek care and medications [7, 8]. Restaurant reservation cancellations on sites like OpenTable may provide insight into disease incidence in specific cities [9]. Infectious disease epidemiology has already witnessed an impact from spatial big data, and the development of new methods and improvements to computational efficiency will only increase the potential of these data sources. Satellite imagery to infer climate, land use, and population density information has contributed to a better understanding of the spatial distribution of critical mosquito disease vectors and the seasonal epidemic dynamics of measles [eg, 10, 11]; and HealthMap, an automated, online news and outbreak reporting aggregator, has enabled the assimilation of disparate sources of disease occurrence data and has been used to examine spatial dynamics of cholera [12]. Mobile phone call data records have provided insights into human mobility that have informed risk maps, importation potential, and spatial dynamics of dengue and malaria [eg, 4]. Medical claims data have been used to examine spatial heterogeneity in influenza epidemic timing and severity [13, 14], while geographically referenced Twitter data have been used to identify spatial antivaccination sentiment [15]. While these studies with spatial big data have leveraged the fine spatial resolution to develop a detailed understanding of disease risk, there remain untapped opportunities with real-time surveillance, large-scale ecological inference, and adaptive disease mitigation strategies. Harnessing disease data from digital sources may enable epidemiological analyses to be performed at finer spatial scales in areas with poor coverage from traditional public health surveillance, and traditional and digital sources of spatial big data may be combined to account for the bias and gaps in each [eg, 16]. The assimilation of multiple spatial big data sources through flexible statistical modeling methods and the continuous nature of data streams could enable near real-time dynamic disease mapping and risk mapping in the near future. For example, Bayesian statistical approaches have emerged as tools for merging multiscale big data sources, incorporating explicit spatial dependencies into maps and models, and providing a framework for joining disease surveillance data across spatial scales while explicitly capturing the variation in measurement bias across locations [eg, 17]. Finally, access to multiple spatial scales of data allows one scale with missing observations to borrow information from a different scale through the addition of contextual effects in modeling inference [18].

SPATIAL BIG DATA PRESENT TECHNICAL CHALLENGES

While big data offer significant opportunities for epidemiological modeling and analysis, they also present a variety of technical and practical challenges. The measurement of incomplete and unrepresented populations, the lack of consistency and reliability in data over time, and the need for data and model validation are broad challenges with big data and statistical analysis that are discussed elsewhere [eg, 19, 20]. Here, we discuss a narrower set of challenges that arise specifically from the spatial nature of big data.

Spatial Coverage and Representation

Spatial big data may provide precise spatial information, but careful users should question the validity of available data. For example, we know that sources of spatial big data have biases in usership rates and demographic characteristics by location (Figure 1A) [21]. Medical claims record data only from insured and care-seeking populations, which may vary systematically according to socioeconomic and demographic characteristics. Social media sites where users volunteer spatial data tend to have more users and higher-quality information per capita in urban areas as compared to rural areas [21]. Mobile phone ownership varies by sex and literacy, and phone sharing between multiple individuals and SIM card switching complicate comparisons of these data across locations [3, 4]. As we cannot often measure the heterogeneities in user populations, these heterogeneities can translate into poor choices in sampling design (eg, how to stratify samples to get a representative population). Beyond heterogeneities in user populations, the populations captured by big data (eg, Twitter users) are not usually relevant to epidemiology; even if we could generate an unbiased sample of the population, it may not provide information important to public health. All of these issues complicate analyses that seek to compare different locations. Ultimately, issues with spatial coverage and representation cause problems for statistical inference, which often depends on assumptions of independent random variation and representative sampling for validity. Future research should compare analyses of spatial big data and analyses of designed observational data, to demonstrate the validity of spatial big data samples and to understand which features of a big data sample can produce robust statistical inference.

Figure 1.

A, Spatial big data have spatial biases in the populations they represent. For instance, as reported by the 2013 American Community Survey, there is spatial variation in home Internet access across the United States, which might affect the populations generating search query data in Google Trends. B, With static spatial data (left), individuals (represented with different colors) report case events (points) at fixed locations. For instance, 2 individuals visited the same physician's office with symptoms multiple times (points along the time axis), so their events are recorded at the same position along the space axis (see overlapping trajectories in the lower part of the space axis), while another individual visited a different physician's office with symptoms 3 times in a similar period (upper part of space axis). Events from the same individual are connected with a dashed line. With dynamic spatial data (right), events are recorded as individuals move through space. For example, the dark blue individual (see trajectory that begins earliest on the time axis) recorded 4 events when they tweeted about symptoms at work, at the grocery store, at the pharmacy, and at home, so their case events occur at 4 different positions along the space axis. Events occur in time dynamically (as shown in this figure), but events may also be aggregated to regular intervals (eg, weekly). C, Data at different spatial scales may have different magnitudes and variability in time, after adjustment for population size, even if they are derived from the same data source. For instance, we observe time-varying fluctuations and variation in epidemic peak timing and magnitude in the county-level disease data (gray) that are lost in the state-level data (black). D, One possible method to protect privacy is to mask individual-level data by aggregating collected data to larger spatial resolutions. In reality, individuals (black circles) may be connected to other individuals through mobile phone calls (black lines). The publicly released data may be aggregated to the level of neighborhoods (green circles), and the number of calls between individuals from different neighborhoods (green lines) would be represented with different weights (here, depicted with varying thickness according to number of individual calls).

Spatial Uncertainty and Noise

Each source of big data provides a different type of spatial insight, despite the high spatial resolution among the sources. Users of social media volunteer their geographical locations in their profiles or posts, while Internet search engines can log spatial information automatically every time a Web search is performed. Sometimes the data are tied to a static location, as in the case of medical claims and healthcare facilities, but the cell towers associated with call data records and the locations of geographically tagged tweets vary dynamically over time (Figure 1B). Across the combinations of features—self-reported or automated, and static or dynamic—among these data sources, there are additional layers of uncertainty to consider in the context of epidemiology. For one, when spatial information in big data is not clearly specified, systematic biases in the results may be generated from the data-cleaning process itself (eg, addresses may be less likely to be geolocated in rural areas) [eg, 22]. Second, locations of potential transmission events will often differ from locations where disease is reported. While these components are explicitly differentiated in medical claims data (ie, transmitted in the community and reported at healthcare facilities), social media posts affiliated with dynamic movements could provide undifferentiated information about both transmission and reporting event locations. Big data provide information at unprecedented levels of spatial precision, but the spatial information fundamental to infectious disease epidemiology (eg, location and conditions that caused a disease transmission event) continues to remain obscured. As big data become more prevalent in epidemiological analysis, public health officials should take care not to conflate spatial precision with spatial accuracy in statistical inference for disease transmission and control.

Spatial Scales and Misalignment

When spatial big data are available at the level of individuals or precise spatial coordinates, practitioners may need to choose the scale of analysis and aggregate data accordingly. Analyzing data at the individual scale is prone to overfitting and the atomistic fallacy, in which we may make incorrect inferences at the group- or population-level on the basis of relationships observed in individual-level data [23]. For example, if we observe an association between body mass index (BMI) and hospitalization for influenza among individuals, it may be incorrect to assume that populations with a high average BMI would have higher rates of influenza-associated hospitalization. On the other hand, analyzing data at aggregated scales is prone to the ecological fallacy, in which inferences about individuals are derived falsely from population-level observations [23, 24]. As an example, if we observed a negative association between average income and cholera prevalence at a national scale, it would be erroneous to assume that poor individuals have a higher risk of cholera than wealthy individuals. Similarly, statistical relationships between predictors and disease outcomes may change when analyses are performed at different spatial aggregations. For instance, Google Flu Trends attempted to estimate influenza activity across different regions of the United States by modeling the relationship between Google search terms and visits for influenza-like illness (ILI), as reported in traditional influenza surveillance systems [5]. However, the set of search terms identified as “most predictive” of ILI activity were tuned to a specific spaital scale (region-level), and may not apply to finer-resolutions (eg, zipcode-level) [5, 25]. Additionally, spatial questions often require the use of multiple data sources, and spatial misalignment arises when data are collected at different spatial scales and need to be incorporated into a single analysis. For instance, we may seek to understand the spatial distribution of cases at the state level when data were collected at the parish or county level (switching between 2 areal scales), or translate case data associated with household coordinates to cases at the county level (switching between point and areal scales; Figure 1C). Spatial big data have expanded the types of spatial information available for data aggregation—posts geographically tagged on social media might provide information at the level of countries, cities, neighborhoods, landmarks, and latitude-longitude coordinates—potentially engaging statistical change of support problems, even for one individual in a single day [24]. The multiplicity of highly resolved spatial scales also poses concerns for standard data checks, since traditional public health data will not necessarily be available at scales appropriate for validating comparisons to spatial big data [7, 16]. Finally, choices about how to deal with spatial misalignment have consequences for modeling results. For instance, recent studies have asked whether Zika virus–associated microcephaly was occurring at unusually high rates in different Brazilian states. Birth rate data might be collected at one spatial scale according to regular demographic surveys, but data systems tracking microcephalic live births would likely have finer spatial detail. Depending on the choice of spatial scale, the combination of these 2 data sources creates the potential for both overestimation and underestimation of microcephaly rates.

Spatial Confidentiality and Ethics

The practice of collecting data without seeking appropriate ethical approval presents some risk for digital infectious disease epidemiology, and the access to fine-grain spatial information further deepens this concern. Safeguards currently implemented for collecting and sharing spatial big data have focused on the obfuscation and aggregation of shared data to protect privacy and on the anonymization and de-identification of individuals. Many research institutions have standardized practices to protect individual privacy that follow the guidance of institutional review boards, disclosure review boards for public use data, and federal laws (eg, the Health Insurance Portability and Accountability Act of 1996, in the United States), but these organizations do not often recognize high-resolution spatial data as a source that should be covered under human subjects protection policies [26]. Several studies have provided examples in which seemingly anonymized data could be mined (or linked with other databases) for de-anonymization: de Montjoye et al [27] showed that 4 spatiotemporal position points from mobile phone records can be sufficient to uniquely identify 95% of individuals in a large de-identified data set; and Homer et al [28] showed that the sheer quantity of data collected could be sufficient to re-identify individuals in a genetic database. These issues already push the limits of existing ethical review mechanisms and our understanding of de-anonymization. In the future, guidelines to protect privacy and confidentiality may require the masking of individual-level records through the aggregation of data to coarser spatial resolutions (Figure 1D), the provision of synthetic data sets that attempt to mimic underlying distributions [29], or the distillation of spatial big data to parameters commonly used in epidemiological models. Investigations may consider the optimal choice of spatial scale in the context of trade-offs between the accurate representation of process heterogeneity, the protection of privacy [26], and the improvement of computational efficiency [30]. Nevertheless, public data become increasingly vulnerable to breaches of privacy as additional data are released and data-mining techniques improve over time.

IMPLICATIONS FOR PUBLIC HEALTH COMMUNICATION AND POLICY

The promise of high spatial and temporal resolutions in spatial big data opens opportunities for change in the standard practice of public health. In circumstances where adjacent or subordinate administrative units issue separate public health recommendations (eg, US federal, state, and local governments may issue independent influenza vaccination recommendations), spatial big data may enable these entities to derive their policies from analyses of a common data set and encourage coordination of preparedness activities across scales [eg, 14]. There is a growing panoply of adaptive, behavioral, and health economic modeling methods aimed at identifying the most-effective interventions for human and livestock diseases. As these methods begin to find use during ongoing outbreaks, the combination of spatial big data and adaptive models could enable the real-time adaptive management of infectious diseases and the coordination of disease control efforts across spatial scales. In the long term, some sources of big data may become more readily available at finer spatial resolutions than the administrative regions at which policy decisions are made, even to the level of the individual. Spatial big data have already changed consumer-marketing strategies: rather than targeting geographic areas with certain socioeconomic and behavioral characteristics, marketers can now target individual users on the basis of behaviors demonstrated in their digital traces [31]. Should epidemiological modeling and design reflect these cultural changes to public health data? Perhaps an analogous scenario would see individual epidemiological data being used to inform optimal intervention strategies, ignoring the administrative boundaries that typically constrain decision making. It is difficult to imagine how such a public health infrastructure could operate—resources must still be coordinated and expended by administrative units, and policy decisions must still apply to populations (rather than individuals) to maintain feasibility. Nevertheless, epidemiological analyses with spatial big data expand the possibilities for multiscale coordination of infectious disease surveillance, response, and forecasting. The real-time high-volume nature of spatial big data makes more epidemiological information readily available to policymakers, but it also creates challenges for the communication of public health information. Spatial big data enable small-area analyses, which are simultaneously highly precise to spatial locations and highly uncertain in modeling results about risk of disease. Similarly, the rise of epidemic forecasting technologies based on spatial big data might present predictions about risk and epidemic outcomes in precise locations even though the forecasts themselves are subject to uncertainty [16]. Consumers of analyses derived from spatial big data—clinicians, public health officials, epidemiologists, and modelers—should develop conscientious practices for communicating uncertainty about spatial results to the public.

23 in total

1. Impact of human mobility on the emergence of dengue epidemics in Pakistan.

Authors: Amy Wesolowski; Taimur Qureshi; Maciej F Boni; Pål Roe Sundsøy; Michael A Johansson; Syed Basit Rasheed; Kenth Engø-Monsen; Caroline O Buckee
Journal: Proc Natl Acad Sci U S A Date: 2015-09-08 Impact factor: 11.205

2. Detecting influenza epidemics using search engine query data.

Authors: Jeremy Ginsberg; Matthew H Mohebbi; Rajan S Patel; Lynnette Brammer; Mark S Smolinski; Larry Brilliant
Journal: Nature Date: 2009-02-19 Impact factor: 49.962

3. Heterogeneous mobile phone ownership and usage patterns in Kenya.

Authors: Amy Wesolowski; Nathan Eagle; Abdisalan M Noor; Robert W Snow; Caroline O Buckee
Journal: PLoS One Date: 2012-04-25 Impact factor: 3.240

4. Guess who's not coming to dinner? Evaluating online restaurant reservations for disease surveillance.

Authors: Elaine O Nsoesie; David L Buckeridge; John S Brownstein
Journal: J Med Internet Res Date: 2014-01-22 Impact factor: 5.428

5. Spatial Transmission of 2009 Pandemic Influenza in the US.

Authors: Julia R Gog; Sébastien Ballesteros; Cécile Viboud; Lone Simonsen; Ottar N Bjornstad; Jeffrey Shaman; Dennis L Chao; Farid Khan; Bryan T Grenfell
Journal: PLoS Comput Biol Date: 2014-06-12 Impact factor: 4.475

6. Enhancing disease surveillance with novel data streams: challenges and opportunities.

Authors: Benjamin M Althouse; Samuel V Scarpino; Lauren Ancel Meyers; John W Ayers; Marisa Bargsten; Joan Baumbach; John S Brownstein; Lauren Castro; Hannah Clapham; Derek At Cummings; Sara Del Valle; Stephen Eubank; Geoffrey Fairchild; Lyn Finelli; Nicholas Generous; Dylan George; David R Harper; Laurent Hébert-Dufresne; Michael A Johansson; Kevin Konty; Marc Lipsitch; Gabriel Milinovich; Joseph D Miller; Elaine O Nsoesie; Donald R Olson; Michael Paul; Philip M Polgreen; Reid Priedhorsky; Jonathan M Read; Isabel Rodríguez-Barraquer; Derek J Smith; Christian Stefansen; David L Swerdlow; Deborah Thompson; Alessandro Vespignani; Amy Wesolowski
Journal: EPJ Data Sci Date: 2015-10-16 Impact factor: 3.184

7. Unique in the Crowd: The privacy bounds of human mobility.

Authors: Yves-Alexandre de Montjoye; César A Hidalgo; Michel Verleysen; Vincent D Blondel
Journal: Sci Rep Date: 2013 Impact factor: 4.379

8. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays.

Authors: Nils Homer; Szabolcs Szelinger; Margot Redman; David Duggan; Waibhav Tembe; Jill Muehling; John V Pearson; Dietrich A Stephan; Stanley F Nelson; David W Craig
Journal: PLoS Genet Date: 2008-08-29 Impact factor: 5.917

9. Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales.

Authors: Donald R Olson; Kevin J Konty; Marc Paladini; Cecile Viboud; Lone Simonsen
Journal: PLoS Comput Biol Date: 2013-10-17 Impact factor: 4.475

10. Demonstrating the use of high-volume electronic medical claims data to monitor local and regional influenza activity in the US.

Authors: Cécile Viboud; Vivek Charu; Donald Olson; Sébastien Ballesteros; Julia Gog; Farid Khan; Bryan Grenfell; Lone Simonsen
Journal: PLoS One Date: 2014-07-29 Impact factor: 3.240

12 in total

1. Big Data for Infectious Disease Surveillance and Modeling.

Authors: Shweta Bansal; Gerardo Chowell; Lone Simonsen; Alessandro Vespignani; Cécile Viboud
Journal: J Infect Dis Date: 2016-12-01 Impact factor: 5.226

2. Exploring the Relationship among Human Activities, COVID-19 Morbidity, and At-Risk Areas Using Location-Based Social Media Data: Knowledge about the Early Pandemic Stage in Wuhan.

Authors: Mengyue Yuan; Tong Liu; Chao Yang
Journal: Int J Environ Res Public Health Date: 2022-05-27 Impact factor: 4.614

3. Epidemiologic Survey of Legionella Urine Antigen Testing Within a Large Wisconsin-Based Health Care System.

Authors: Caroline P Toberna; Hannah M William; Jessica J F Kram; Kayla Heslin; Dennis J Baumgardner
Journal: J Patient Cent Res Rev Date: 2020-04-27

4. Deploying digital health data to optimize influenza surveillance at national and local scales.

Authors: Elizabeth C Lee; Ali Arab; Sandra M Goldlust; Cécile Viboud; Bryan T Grenfell; Shweta Bansal
Journal: PLoS Comput Biol Date: 2018-03-07 Impact factor: 4.475

5. Large-scale loyalty card data in health research.

Authors: Jaakko Nevalainen; Maijaliisa Erkkola; Hannu Saarijärvi; Turkka Näppilä; Mikael Fogelholm
Journal: Digit Health Date: 2018-11-29

6. Spatiotemporal data mining: a survey on challenges and open problems.

Authors: Ali Hamdi; Khaled Shaban; Abdelkarim Erradi; Amr Mohamed; Shakila Khan Rumi; Flora D Salim
Journal: Artif Intell Rev Date: 2021-04-15 Impact factor: 9.588

7. Transmission and control pressure analysis of the COVID-19 epidemic situation using multisource spatio-temporal big data.

Authors: Fangxiong Wang; Ziqian Tan; Zaihui Yu; Siqi Yao; Changfeng Guo
Journal: PLoS One Date: 2021-03-29 Impact factor: 3.240

8. Harnessing Social Media in the Modelling of Pandemics-Challenges and Opportunities.

Authors: Joanna Sooknanan; Nicholas Mays
Journal: Bull Math Biol Date: 2021-04-09 Impact factor: 1.758

9. Instagram, Flickr, or Twitter: Assessing the usability of social media data for visitor monitoring in protected areas.

Authors: Henrikki Tenkanen; Enrico Di Minin; Vuokko Heikinheimo; Anna Hausmann; Marna Herbst; Liisa Kajala; Tuuli Toivonen
Journal: Sci Rep Date: 2017-12-14 Impact factor: 4.379

10. Data distribution in public veterinary service: health and safety challenges push for context-aware systems.

Authors: Laura Contalbrigo; Stefano Borgo; Giandomenico Pozza; Stefano Marangon
Journal: BMC Vet Res Date: 2017-12-22 Impact factor: 2.741