Literature DB >> 27014744

Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital: A Retrospective Observational Study.

David Andre Broniatowski¹, Mark Dredze², Michael J Paul³, Andrea Dugas⁴.

Abstract

BACKGROUND: Public health officials and policy makers in the United States expend significant resources at the national, state, county, and city levels to measure the rate of influenza infection. These individuals rely on influenza infection rate information to make important decisions during the course of an influenza season driving vaccination campaigns, clinical guidelines, and medical staffing. Web and social media data sources have emerged as attractive alternatives to supplement existing practices. While traditional surveillance methods take 1-2 weeks, and significant labor, to produce an infection estimate in each locale, web and social media data are available in near real-time for a broad range of locations.
OBJECTIVE: The objective of this study was to analyze the efficacy of flu surveillance from combining data from the websites Google Flu Trends and HealthTweets at the local level. We considered both emergency department influenza-like illness cases and laboratory-confirmed influenza cases for a single hospital in the City of Baltimore.
METHODS: This was a retrospective observational study comparing estimates of influenza activity of Google Flu Trends and Twitter to actual counts of individuals with laboratory-confirmed influenza, and counts of individuals presenting to the emergency department with influenza-like illness cases. Data were collected from November 20, 2011 through March 16, 2014. Each parameter was evaluated on the municipal, regional, and national scale. We examined the utility of social media data for tracking actual influenza infection at the municipal, state, and national levels. Specifically, we compared the efficacy of Twitter and Google Flu Trends data.
RESULTS: We found that municipal-level Twitter data was more effective than regional and national data when tracking actual influenza infection rates in a Baltimore inner-city hospital. When combined, national-level Twitter and Google Flu Trends data outperformed each data source individually. In addition, influenza-like illness data at all levels of geographic granularity were best predicted by national Google Flu Trends data.
CONCLUSIONS: In order to overcome sensitivity to transient events, such as the news cycle, the best-fitting Google Flu Trends model relies on a 4-week moving average, suggesting that it may also be sacrificing sensitivity to transient fluctuations in influenza infection to achieve predictive power. Implications for influenza forecasting are discussed in this report.

Entities: Chemical Disease Species

Keywords: Web mining; social computing; time series analysis

Year: 2015 PMID： 27014744 PMCID： PMC4803078 DOI： 10.2196/publichealth.4472

Source DB: PubMed Journal: JMIR Public Health Surveill ISSN： 2369-2960

Introduction

Public health officials and policy makers rely on influenza infection rate information to make important decisions during the course of an influenza season. Whereas influenza surveillance has traditionally been conducted using laboratory data, hospitalizations, and physician visits for influenza-like illness (ILI), web and social media data sources have emerged as attractive alternatives to supplement existing practices. While traditional surveillance methods take 1-2 weeks, and significant labor, to produce an infection estimate in each locale, web and social media data are available in near real-time for a broad range of locations. Studies have demonstrated that web queries [1-3], Twitter messages [4-12], and other sources (eg, Wikipedia [13], mobile app reporting [14]) may be productively mined for influenza surveillance data. New resources like Google Flu Trends [1], HealthTweets [15,16](Figure 1), and Flu Near You [14] deliver near-real time estimates of infection rates.

Figure 1

Screenshot of HealthTweets.

However, few have examined the efficacy of local surveillance [12,17,18]. In this study, we analyzed the efficacy of local flu surveillance from Google Flu Trends and HealthTweets. Whereas previous studies that considered either Google or Twitter in isolation, we evaluated multiple trends available from both. Furthermore, instead of restricting our study to hospitals designated as ILI sentinels, or emergency department ILI rates, we considered both emergency department ILI and laboratory-confirmed influenza cases for a single hospital in the city of Baltimore. This enabled us to evaluate the impact on specific care centers when making influenza response decisions, such as staffing and resource allocation. Screenshot of HealthTweets.

Methods

Study Population and Setting

This was a retrospective observational study comparing estimates of influenza activity from Google flu trends and Twitter to actual counts of individuals with laboratory-confirmed influenza, and counts of individuals presenting to the emergency department with ILI. Each parameter was evaluated on the municipal, regional, and national scale.

Data Collection and Methods of Measurement

Data were collected from November 20, 2011 through March 16, 2014. All measurements were recorded weekly to allow for direct comparison between data sources. Following the Centers for Disease Control (CDC) Convention, each week summed the data points from Sunday through the following Saturday. The number of municipal- (city) level subjects was estimated by evaluating the number of patients presenting to an urban academic emergency department in Baltimore, Maryland with an annual volume of over 60,000 adult and 24,000 pediatric visits. The number of confirmed influenza cases was determined by summing the number of emergency department visits with laboratory-confirmed influenza that occurred during each week. Similarly, the number of patients with ILI was determined by summing the number of emergency department patients who reported fever with cough or sore throat each week. Regional data were collected via the CDC surveillance reports for health and Human Services (HHS) Region 3, including both the percentage of patients reporting ILI and the percentage of tests positive for influenza. National data were collected from the CDC surveillance report of the nationwide percentage of patients reporting ILI and the total percentage of patients testing positive for influenza. Google Flu Trends data for the United States, the state of Maryland, and the city of Baltimore were downloaded directly from the Google Flu Trends website [19]. Twitter data for the same three locations was obtained from the HealthTweets website [15], an online platform for public health surveillance aimed at sharing the latest research results on Twitter data with the scientific community and public officials. The underlying data were generated using a sequence of supervised machine-learning algorithms [10,12], namely logistic regression classifiers, the first of which identified tweets that were relevant to health. Next, tweets that were about influenza were isolated. The final classifier separated tweets that were about reported influenza infection from those that only reported awareness of the flu. The tweets indicating influenza infection constituted our dataset. Message locations were identified using Carmen [20], a software package that infers tweet locations using Global Positioning System (GPS) coordinates and self-reported locations from the free text of the user biographic profiles.

Statistical Analysis

Data were analyzed by evaluating weekly trends over time using the Box-Jenkins procedure [21] applied to each data source (influenza tests at our medical center, ILI at our medical center, % reported flu cases in HHS region 3 and the USA, and % reported ILI in HHS region 3 and the USA) in order to control for autocorrelation in the corresponding time series. We next fit an autoregressive integrated moving average model with exogenous covariates (ARIMAX) to each data time series, Xt, where p, d, and q, are the respective autoregressive, differencing, and moving average orders of the model (Figure 2 , part a). The φiand θiare the autoregressive and moving average parameters, respectively, εtis a normally distributed error term with a mean of 0, L is a lag operator defined as in Figure 2 , part b, and mtis defined as in Figure 2 , part c, where ytis a series of predictors (eg, Twitter and/or Google Flu Trends data), the ηiare a series of predictor weights, and b is the total number of predictor time series.

Figure 2

Equations defining the ARIMAX model.

We chose the autoregressive, differencing, and moving average terms of each model that minimized each its Aikake Information Criterion (AIC) subject to the constraint that each model used the same degree of differencing for each data source. This constraint was imposed to enable comparison across social media predictors (ie, Twitter, Google Flu Trends, or both). All statistics were conducted using the R Project for Statistical Computing, version 3.0.2 (The R Foundation for Statistical Computing). Specifically, we used the "arima()” function in the forecast package [22]. Parameter selection was informed by the “auto.arima()” function, using the Hyndman and Khandakar algorithm [23]. Deviations from the algorithm’s output were then examined by hand and parameters that deviated from algorithm output were chosen if they minimized AIC. Equations defining the ARIMAX model.

Results

Table 1 summarizes the results of each ARIMA model incorporating Twitter and Google Flu Trends data. Our results show that Baltimore-area Twitter data provided a better estimate of actual influenza cases reported in the Baltimore metropolitan area when compared to state- and national-level Twitter data (see Figure 3). Furthermore, a combination of Twitter and Google Flu Trends data sources outperformed either Twitter or Google Flu Trends individually when predicting actual influenza outbreaks at municipal and regional levels.

Table 1

Log-likelihood (AICa) for each surveillance method.

		Laboratory-confirmed influenza			Influenza like illness (ILI)
		City	Region	US	City	Region	US
Twitter^b
	US^c	-311 (627)^0,1,0e	-317^g(653)^5,1,3	-235^g(484)^0,1,5	-502^g(1009)^0,2,1	-66^g(143)^0,1,0	-27^g(61)^1,1,1
	MD^d	-310 (624)^0,1,0	-321 (661)^5,1,3	-236 (486)^0,1,5	-503 (1012)^0,1,0	-70 (144)^0,1,0	-30 (68)^1,1,1
	Baltimore	-308^g(620)^0,1,0	-323 (666)^5,1,3	-235 (484)^0,1,5	-504 (1013)^0,2,1	-74 (158)^0,1,3	-32 (74)^1,1,1
Google Flu Trends
	US	-291^g(596)^1,1,4	-313^g(648)^5,1,4	-230^f,g(475)^0,1,5	-494^f,g(1002)^1,2,4	-49^f,g(110)^0,1,4	-1^f,g(15)^1,1,4
	MD	-299 (612)^1,1,4	-318 (656)^5,1,3	-236 (486)^0,1,5	-498 (1010)^1,2,4	-58 (129)^0,1,4	-27 (61)^1,1,1
	Baltimore	-295 (604)^1,1,4	-320 (660)^5,1,3	-236 (486)^0,1,5	-495 (1005)^1,2,4	-60 (132)^0,1,4	-23 (56)^1,1,2
Both
	US	-289^f,g(594)^1,1,4	-312^f,g(646)^5,1,3	-230^g(477)^0,1,5	-495^g(1003)^0,1,4	-49^g(112)^0,1,4	-0^g(17)^1,1,4
	MD	-299 (613)^1,14	-318 (657)^5,1,3	-235 (485)^0,1,5	-498 (1011)^1,2,4	-58 (130)^0,1,4	-27 (68)^1,1,1
	Baltimore	-294 (604)^1,1,4	-319 (659)^5,1,3	-235 (486)^0,1,5	-500 (1007)^0,2,1	-60 (134)^0,1,4	-22 (55)^1,1,2

AIC=Aikake Information Criterion

bTwitter data from the HealthTweets website.

cUS=United States

dMD=Maryland

eSuperscript numerals indicate the autoregressive order, the order of differencing, and the moving average order, respectively. Models were chosen to minimize AIC, guided by examinations of autocorrelation and partial autocorrelation values.

fThe best predictor across all data sources.

gThe best predictor within each data source (HealthTweets website, Google, or a linear combination of both).

Figure 3

Plot of weekly confirmed influenza cases (right axis) as compared to standardized Baltimore social media data (left axis).

Log-likelihood (AICa) for each surveillance method. AIC=Aikake Information Criterion bTwitter data from the HealthTweets website. cUS=United States dMD=Maryland eSuperscript numerals indicate the autoregressive order, the order of differencing, and the moving average order, respectively. Models were chosen to minimize AIC, guided by examinations of autocorrelation and partial autocorrelation values. fThe best predictor across all data sources. gThe best predictor within each data source (HealthTweets website, Google, or a linear combination of both). Plot of weekly confirmed influenza cases (right axis) as compared to standardized Baltimore social media data (left axis). When directly comparing models that rely only on one data source (ie, Twitter or Google Flu Trends but not both), we found that the best-fitting Twitter models were simple whereas the best-fitting Google Flu Trends models generally required more parameters. For example, at the municipal level, the best-fitting Twitter model did not require any autoregressive or moving average terms, whereas the best-fitting Google Flu Trends model required a 4-week moving average of Google Flu Trends data and an autoregressive term. In general, these more complex Google Flu Trends models outperformed the best-fitting Twitter models. Although these Google Flu Trends models were significantly more complex (ie, one must fit more parameters), they had a lower AIC, indicating that they were also more informative.

Discussion

Principal Findings

Consistent with prior work [18], we found that national-level Google Flu Trends data may be used to track actual influenza cases in the Baltimore area. The fact that a combination of Twitter and Google Flu Trends data at the national (US) level outperformed all other data sources for local and regional confirmed influenza cases indicates that these data sources are not redundant and that Twitter data are contributing information useful to influenza surveillance that are not captured by the corresponding Google Flu Trends data.

Comparison With Prior Work

Whereas prior work using Google Flu Trends data has largely focused on US ILI data, we extended this finding to multiple levels of geographic granularity by examining social media surveillance at the regional and city levels as well. We found that US Google Flu Trends data best explained ILI rates at all levels (including the municipal level, see Figure 4). This contrasts with prior research, which found that Google Flu Trends data conflated signals of influenza awareness (eg, media attention) with signals of actual infection - overestimating the flu season’s peak prevalence. In addition, this prior work found that there was insufficient control for temporal autocorrelation and a lack of analysis of Google Flu Trends data at local, rather than national, levels [24].

Figure 4

Plot of weekly influenza-like illness cases (right axis) as compared to standardized US social media data (left axis).

Plot of weekly influenza-like illness cases (right axis) as compared to standardized US social media data (left axis). In this study, we controlled for autocorrelation and exogenous temporal factors using an ARIMAX model. The improved performance of this model might be an indication that the 4-week moving average terms are smoothing out fluctuations due to the news cycle. Nevertheless, because Google Flu Trends data do not explicitly differentiate between signals of influenza awareness and actual infection, this relatively complicated model may buy accuracy at the cost of sensitivity to transient phenomena. Thus, temporary spikes in media coverage are smoothed out, but so would temporary spikes in influenza infection. Elsewhere, we have shown that our Twitter data overcome the limitations identified in prior Google Flu Trends studies by filtering out signals of influenza awareness from signals of actual infection and enabling analysis at multiple levels of geographic granularity [12,25]. Furthermore, the fact that the Twitter model is more lightweight means that it is more able to correctly track transient increases in infection when they occur [12]. Finally, municipal-level Twitter data provided a better account of actual influenza cases in Baltimore than did state- or national- level data. This finding is consistent with prior work [12] showing that local Twitter data does contribute information that is useful for municipal surveillance. In contrast, state- and local-level Google Flu Trends data did not improve surveillance when compared to national GFT data.

Limitations

One limitation of our approach is that it only relies upon one municipality. Furthermore, our analysis only examined three seasons of influenza data, one of which (the 2012-2013 season) is known to have been anomalous. Future work should therefore focus on incorporating data from multiple influenza seasons.

Conclusions

Overall, our results motivate the need for future work examining how social media may be used to track measures relevant to influenza surveillance in multiple different locations and seasons.

12 in total

1. Using internet searches for influenza surveillance.

Authors: Philip M Polgreen; Yiling Chen; David M Pennock; Forrest D Nelson
Journal: Clin Infect Dis Date: 2008-12-01 Impact factor: 9.079

2. Twitter: big data opportunities.

Authors: David Andre Broniatowski; Michael J Paul; Mark Dredze
Journal: Science Date: 2014-07-11 Impact factor: 47.728

3. Big data. The parable of Google Flu: traps in big data analysis.

Authors: David Lazer; Ryan Kennedy; Gary King; Alessandro Vespignani
Journal: Science Date: 2014-03-14 Impact factor: 47.728

4. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak.

Authors: Cynthia Chew; Gunther Eysenbach
Journal: PLoS One Date: 2010-11-29 Impact factor: 3.240

5. Detecting influenza epidemics using search engine query data.

Authors: Jeremy Ginsberg; Matthew H Mohebbi; Rajan S Patel; Lynnette Brammer; Mark S Smolinski; Larry Brilliant
Journal: Nature Date: 2009-02-19 Impact factor: 49.962

6. Influenza forecasting with Google Flu Trends.

Authors: Andrea Freyer Dugas; Mehdi Jalalpour; Yulia Gel; Scott Levin; Fred Torcaso; Takeru Igusa; Richard E Rothman
Journal: PLoS One Date: 2013-02-14 Impact factor: 3.240

7. Monitoring influenza epidemics in china with search query from baidu.

Authors: Qingyu Yuan; Elaine O Nsoesie; Benfu Lv; Geng Peng; Rumi Chunara; John S Brownstein
Journal: PLoS One Date: 2013-05-30 Impact factor: 3.240

8. Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real-time.

Authors: David J McIver; John S Brownstein
Journal: PLoS Comput Biol Date: 2014-04-17 Impact factor: 4.475

9. Influenza-like illness surveillance on Twitter through automated learning of naïve language.

Authors: Francesco Gesualdo; Giovanni Stilo; Eleonora Agricola; Michaela V Gonfiantini; Elisabetta Pandolfi; Paola Velardi; Alberto E Tozzi
Journal: PLoS One Date: 2013-12-04 Impact factor: 3.240

10. National and local influenza surveillance through Twitter: an analysis of the 2012-2013 influenza epidemic.

Authors: David A Broniatowski; Michael J Paul; Mark Dredze
Journal: PLoS One Date: 2013-12-09 Impact factor: 3.240

17 in total

Review 1. Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related Text.

Authors: G Gonzalez-Hernandez; A Sarker; K O'Connor; G Savova
Journal: Yearb Med Inform Date: 2017-09-11

Review 2. A scoping review of the use of Twitter for public health research.

Authors: Oduwa Edo-Osagie; Beatriz De La Iglesia; Iain Lake; Obaghe Edeghere
Journal: Comput Biol Med Date: 2020-05-16 Impact factor: 4.589

3. Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda.

Authors: Reid Priedhorsky; Dave Osthus; Ashlynn R Daughton; Kelly R Moran; Nicholas Generous; Geoffrey Fairchild; Alina Deshpande; Sara Y Del Valle
Journal: CSCW Conf Comput Support Coop Work Date: 2017 Feb-Mar

4. Evaluating Google, Twitter, and Wikipedia as Tools for Influenza Surveillance Using Bayesian Change Point Analysis: A Comparative Analysis.

Authors: J Danielle Sharpe; Richard S Hopkins; Robert L Cook; Catherine W Striley
Journal: JMIR Public Health Surveill Date: 2016-10-20

5. Coughing, sneezing, and aching online: Twitter and the volume of influenza-like illness in a pediatric hospital.

Authors: David M Hartley; Courtney M Giannini; Stephanie Wilson; Ophir Frieder; Peter A Margolis; Uma R Kotagal; Denise L White; Beverly L Connelly; Derek S Wheeler; Dawit G Tadesse; Maurizio Macaluso
Journal: PLoS One Date: 2017-07-28 Impact factor: 3.240

6. Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on Surveillance Estimates.

Authors: Jared Mowery
Journal: Online J Public Health Inform Date: 2016-12-28

7. Accurate Influenza Monitoring and Forecasting Using Novel Internet Data Streams: A Case Study in the Boston Metropolis.

Authors: Fred Sun Lu; Suqin Hou; Kristin Baltrusaitis; Manan Shah; Jure Leskovec; Rok Sosic; Jared Hawkins; John Brownstein; Giuseppe Conidi; Julia Gunn; Josh Gray; Anna Zink; Mauricio Santillana
Journal: JMIR Public Health Surveill Date: 2018-01-09

8. Improving HIV Surveillance Data for Public Health Action in Washington, DC: A Novel Multiorganizational Data-Sharing Method.

Authors: Adam Allston; Reshma Bhattacharjee; Sahithi Boggavarapu; Sharon Carter; Amanda D Castel; Jeff Collmann; Colin Flynn; Auntré Hamp; Diana Jordan; Seble Kassaye; Michael Kharfen; Garret Lum; Raghu Pemmaraju; Anne Rhodes; Jeff Stover; Mary A Young; Joanne Michelle F Ocampo; J C Smart
Journal: JMIR Public Health Surveill Date: 2016-01-15

9. Subregional Nowcasts of Seasonal Influenza Using Search Trends.

Authors: Sasikiran Kandula; Daniel Hsu; Jeffrey Shaman
Journal: J Med Internet Res Date: 2017-11-06 Impact factor: 5.428

10. Respiratory syncytial virus tracking using internet search engine data.

Authors: Eyal Oren; Justin Frere; Eran Yom-Tov; Elad Yom-Tov
Journal: BMC Public Health Date: 2018-04-03 Impact factor: 3.295