Literature DB >> 25115873

Using clinicians' search query data to monitor influenza epidemics.

Mauricio Santillana¹, Elaine O Nsoesie², Sumiko R Mekaru³, David Scales⁴, John S Brownstein⁵.

Abstract

Search query information from a clinician's database, UpToDate, is shown to predict influenza epidemics in the United States in a timely manner. Our results show that digital disease surveillance tools based on experts' databases may be able to provide an alternative, reliable, and stable signal for accurate predictions of influenza outbreaks.

Entities: Chemical Disease Species

Keywords: Internet-based disease surveillance; digital disease detection; prediction of influenza

Mesh：

Year: 2014 PMID： 25115873 PMCID： PMC4296132 DOI： 10.1093/cid/ciu647

Source DB: PubMed Journal: Clin Infect Dis ISSN： 1058-4838 Impact factor: 9.079

The discovery of unusual outbreaks often depends on individual health practitioners who can promptly identify abnormal circumstances and then report those concerns to the greater community [1, 2]. Although the impact of these reports cannot be overstated, recent developments in Internet technologies have demonstrated the power of the crowd as well. For example, crowdsourcing approaches allow members of the public to complete tasks relevant to a larger goal [3]. Search activity on diseases such as influenza and dengue has been shown to correlate with traditional surveillance data in multiple instances [4-8]. Google Flu Trends (GFT) demonstrated a link between influenza-related search query data and the Centers for Disease Control and Prevention's (CDC) influenza-like Illness (ILI) index [5]. Other examples include the use of search query data from Yahoo! [9] and from Baidu [8] to track influenza epidemics. Internet search queries are available much earlier than data from validated traditional surveillance systems and have the potential to provide timely epidemiologic intelligence to inform prevention messaging and healthcare facility staffing decisions. The potential for the public's search activity to be influenced by anxiety, fears, and rumors raises concerns regarding reliability [10-13]. Although recent revisions to GFT have shown that these concerns can be partially mitigated [13-15], shifting Internet-based surveillance from the entire public to subject-matter experts may maintain timeliness while generating a more reliable and stable signal requiring much less data. A recent small retrospective study using data on queries to a Finnish primary care guidelines database demonstrated, for example, that disease-specific queries for Lyme disease, tularemia, and other infectious diseases correlated well with concurrent confirmed cases [16]. Here, we show that UpToDate (www.uptodate.com), a physician-authored clinical decision support Internet resource that is used by 700 000 clinicians in 158 countries and almost 90% of academic medical centers in the United States, can be used for syndromic surveillance of influenza. Specifically, we use UpToDate's search query activity related to ILI to design a timely sentinel of influenza incidence in the United States.

METHODS

Data

UpToDate is a professional database utilized by healthcare practitioners for point-of-care decisions. The information provided is rigorously authored and edited by experienced physicians. Also, UpToDate topics are accessed >18 million times monthly, and studies suggest that information provided through the site helps improve healthcare outcomes in hospitals [17-19]. In collaboration with UpToDate, we obtained search volume of 23 search terms related to ILI, as well as overall search activity from November 2011 to November 2013 for US accounts only. The search terms were as follows: influenza, Haemophilus influenzae, flu, parainfluenza, H1N1, H7N9, H5N1, H3N2, grippe, gripe, adenovirus, rhinovirus, respiratory syncytial virus, metapneumovirus, coronavirus, Bordetella pertussis, Mycoplasma pneumoniae, pneumonia, bronchitis, H9N2, sinusitis, upper respiratory tract infection, and Tamiflu. We obtained a weekly search fraction for each search term, at any given point in time, by dividing the number of searches for a given phrase by the total number of searches in the UpToDate database, thus minimizing the effects of variation in the overall use of the UpToDate database through time. We also obtained the national ILI weekly index from the CDC for the same time period to use as a comparator (available at: http://www.cdc.gov/flu/weekly/pastreports.htm).

Analysis

We built a collection of multivariate linear models using the z scores of the aforementioned 23 search terms' weekly search fraction as explanatory variables and the CDC ILI index as our dependent variable. The multiplicative coefficients associated with each search term in each multivariate linear model were updated weekly as the CDC ILI index was updated. Our multivariate models can be expressed as where I(t) is the percentage of national ILI physician visits, Q(t) is the search fraction associated with term i at time t, αi(t) is the multiplicative coefficient associated with each term at time t, and e is the normally distributed error term. Model selection was performed using a least absolute shrinkage and selection operator (LASSO) technique [20] at every single week incorporating new CDC ILI information as it became available. Therefore, our approach recalibrated weekly the relevance of the search activity for each individual term according to its historical prediction ability. The LASSO technique uses an optimization algorithm that favors models that minimize the mean squared error between the observations and predictions, while penalizing models containing many variables by simultaneously minimizing the sum of the absolute size of the regression coefficients. We produced real-time estimates of ILI activity at time t, assuming that (1) we only had access to CDC-reported ILI data up to 2 weeks prior (ie, up to t–2 weeks), and (2) assuming that we had access to the real-time (time = t) number of searches in the UpToDate database. Our dynamic approach is similar to the one presented in Santillana et al [15], and inspired by data assimilation techniques widely used in weather forecasting and oceanography [21, 22] and supervised machine-learning techniques [20]. Our methodology was implemented in Matlab version R2011a. The LASSO routine was obtained from (available at: http://www.stanford.edu/~hastie/glmnet_matlab/) in November 2013 [23].

RESULTS

The training period for our first prediction comprised 26 weeks (5 November 2011–28 April 2012). Thus, our first real-time estimate of ILI was calculated for the week of 12 May 2012 (2 weeks later) using the optimal multivariate model. We produced a weekly time series consisting of real-time estimates using our approach for the subsequent weeks up to the week of 30 November 2013. Figure 1 shows our real-time estimates and the CDC-reported ILI visits. GFT estimates are included for context.

Figure 1.

Performance of our methodology along with Centers for Disease Control and Prevention (CDC)–reported influenza-like illness (ILI) activity. CDC ILI is shown in black; our model, named UpToDate, is shown in light grey; and Google Flu Trends (GFT) estimates are shown with a dashed grey line for context. Our estimates predict very well the CDC-reported ILI visits and outperform GFT estimates during the prediction period. Moreover, our approach estimates accurately the peak of the 2012–2013 influenza season (in the week of 30 December 2012) and produces a slight overestimation of the influenza epidemic curve in the second week of January 2013 (overestimating the flu activity by approximately 25% in relative terms—ie, 5.6% of ILI as opposed to the actual 4.5%). This overestimation is minimal when compared to the GFT estimates (overestimating the influenza activity by 130% in relative terms—ie, 10.5% of ILI as opposed to the actual 4.5%). Our methodology has strong predictive power (Pearson correlation of 0.972; a root mean square error [RMSE] of 0.2829%) during the prediction period starting in the week of 12 May 2012 and ending in the last week of November 2013. Although GFT has a very high Pearson correlation (0.9499) during this same time period, it clearly fails to produce reliable estimates for the peak of the 2012–2013 influenza season. This mismatch is better captured by the RMSE, which shows that GFT estimates are on average off by 1.4% of the national population (ie, almost 5 times larger than our RMSE). In Figure 2 we present a heatmap representing the relevance of each search term in predicting influenza activity as a function of time, during the validation time period. The term Tamiflu is the strongest predictor, whereas sinusitis, influenza, H1N1, and coronavirus display relevance as predictors during different time periods.

Figure 2.

Heatmap representing the relevance of each search term in predicting influenza activity as a function of time (in weeks, starting in May 2012). Clinicians’ Tamiflu search activity among clinicians is highly correlated with Centers for Disease Control and Prevention–reported influenza-like illness and thus is found to be the strongest predictor by our algorithm. Sinusitis, influenza, H1N1, and coronavirus display significant relevance as predictors during different time periods.

DISCUSSION

Our findings demonstrate that combining a robust dynamic methodology and subject-matter experts' search activity more accurately predicts influenza activity than the well-established Internet-based tool Google Flu Trends. Specifically, the model presented here has numerous strengths compared to GFT. First, the model does not require expert supervision to adjust the search terms over the course of the influenza season. Our approach can also accommodate and identify changes in clinicians’ selection of search terms over time while retaining the model's predictive power as demonstrated in Figure 2. Not only does this strength address evolving medical vocabulary, it also avoids “model drift” (static models typically match the training data well; however, as time progresses its deviation from truth may cause its predictions to drift farther and farther from truth (as seen in Cook et al [10] with GFT). The success of our approach suggests that low volumes of queries (in the order of 100–10 000 seconds) in relevant subject-matter experts' databases, such as UpToDate, provide a promising way to identify meaningful signals to track influenza activity. This will motivate the need for future research aimed at testing the accuracy of our methodology at state and city levels, and potentially in the prediction of other diseases. Moreover, our findings in combination with those shown in Jormanainen et al [16] suggest that data acquired from specialized databases may have an improved signal-to-noise ratio and may be less likely to be impacted by public disruption resulting from anxiety or media reports on increased morbidity and mortality during (novel) outbreaks of influenza. Limitations in this data source include those inherent in most novel data sources advanced for monitoring infectious diseases. Although timely, these data sources lack the specificity observed in traditional surveillance systems, which rely on hierarchical reporting procedures. These data streams therefore supplement traditional disease surveillance provided by organizations such as the CDC. Finally, UpToDate data is not publicly available and thus not ready to be used as an alternative disease detection sentinel.

CONCLUSIONS

In this study, we demonstrate that search queries from the UpToDate database in conjunction with a dynamic multivariate methodology can be successfully utilized to obtain real-time estimates of influenza incidence in the United States before the release of official reports. Clinicians can use outcomes from the model to monitor estimated levels of influenza in the United States. We also discuss the potential usefulness and limitations of digital data sources for infectious disease surveillance based on search query data [5, 7, 8, 24–28]. Future work may include analysis of smaller geographic units.

21 in total

1. Data assimilation and its applications.

Authors: B Wang; X Zou; J Zhu
Journal: Proc Natl Acad Sci U S A Date: 2000-10-10 Impact factor: 11.205

2. How doctors make use of online, point-of-care clinical decision support systems: a case study of UpToDate©.

Authors: John Addison; Jo Whitcombe; Steven William Glover
Journal: Health Info Libr J Date: 2012-10-15

3. Evaluation of ProMED-mail as an electronic early warning system for emerging animal diseases: 1996 to 2004.

Authors: Peter Cowen; Tam Garland; Martin E Hugh-Jones; Arnon Shimshony; Stuart Handysides; Donald Kaye; Lawrence C Madoff; Marjorie P Pollack; Jack Woodall
Journal: J Am Vet Med Assoc Date: 2006-10-01 Impact factor: 1.936

4. What can digital disease detection learn from (an external revision to) Google Flu Trends?

Authors: Mauricio Santillana; D Wendong Zhang; Benjamin M Althouse; John W Ayers
Journal: Am J Prev Med Date: 2014-07-02 Impact factor: 5.043

5. Physicians' database searches as a tool for early detection of epidemics.

Authors: V Jormanainen; J Jousimaa; I Kunnamo; P Ruutu
Journal: Emerg Infect Dis Date: 2001 May-Jun Impact factor: 6.883

6. Detecting influenza epidemics using search engine query data.

Authors: Jeremy Ginsberg; Matthew H Mohebbi; Rajan S Patel; Lynnette Brammer; Mark S Smolinski; Larry Brilliant
Journal: Nature Date: 2009-02-19 Impact factor: 49.962

7. A new approach to monitoring dengue activity.

Authors: Lawrence C Madoff; David N Fisman; Taha Kass-Hout
Journal: PLoS Negl Trop Dis Date: 2011-05-31

8. Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic.

Authors: Samantha Cook; Corrie Conrad; Ashley L Fowlkes; Matthew H Mohebbi
Journal: PLoS One Date: 2011-08-19 Impact factor: 3.240

9. Influenza forecasting with Google Flu Trends.

Authors: Andrea Freyer Dugas; Mehdi Jalalpour; Yulia Gel; Scott Levin; Fred Torcaso; Takeru Igusa; Richard E Rothman
Journal: PLoS One Date: 2013-02-14 Impact factor: 3.240

10. Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales.

Authors: Donald R Olson; Kevin J Konty; Marc Paladini; Cecile Viboud; Lone Simonsen
Journal: PLoS Comput Biol Date: 2013-10-17 Impact factor: 4.475

30 in total

1. Accurate estimation of influenza epidemics using Google search data via ARGO.

Authors: Shihao Yang; Mauricio Santillana; S C Kou
Journal: Proc Natl Acad Sci U S A Date: 2015-11-09 Impact factor: 11.205

2. Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda.

Authors: Reid Priedhorsky; Dave Osthus; Ashlynn R Daughton; Kelly R Moran; Nicholas Generous; Geoffrey Fairchild; Alina Deshpande; Sara Y Del Valle
Journal: CSCW Conf Comput Support Coop Work Date: 2017 Feb-Mar

3. Association between Search Behaviors and Disease Prevalence Rates at 18 U.S. Children's Hospitals.

Authors: Dennis Daniel; Traci Wolbrink; Tanya Logvinenko; Marvin Harper; Jeffrey Burns
Journal: Appl Clin Inform Date: 2017-12-14 Impact factor: 2.342

4. Using Search Engine Data as a Tool to Predict Syphilis.

Authors: Sean D Young; Elizabeth A Torrone; John Urata; Sevgi O Aral
Journal: Epidemiology Date: 2018-07 Impact factor: 4.822

5. Forecasting influenza-like illness trends in Cameroon using Google Search Data.

Authors: Elaine O Nsoesie; Olubusola Oladeji; Aristide S Abah Abah; Martial L Ndeffo-Mbah
Journal: Sci Rep Date: 2021-03-24 Impact factor: 4.379

6. Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance.

Authors: M Santillana; A T Nguyen; T Louie; A Zink; J Gray; I Sung; J S Brownstein
Journal: Sci Rep Date: 2016-05-11 Impact factor: 4.379

7. Correlation Between UpToDate Searches and Reported Cases of Middle East Respiratory Syndrome During Outbreaks in Saudi Arabia.

Authors: Anna R Thorner; Bin Cao; Terrence Jiang; Amy J Warner; Peter A Bonis
Journal: Open Forum Infect Dis Date: 2016-02-18 Impact factor: 3.835

8. Analyzing Information Seeking and Drug-Safety Alert Response by Health Care Professionals as New Methods for Surveillance.

Authors: Alison Callahan; Igor Pernek; Gregor Stiglic; Jure Leskovec; Howard R Strasberg; Nigam Haresh Shah
Journal: J Med Internet Res Date: 2015-08-20 Impact factor: 5.428

9. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance.

Authors: Mauricio Santillana; André T Nguyen; Mark Dredze; Michael J Paul; Elaine O Nsoesie; John S Brownstein
Journal: PLoS Comput Biol Date: 2015-10-29 Impact factor: 4.475

10. Estimating the cumulative incidence of COVID-19 in the United States using influenza surveillance, virologic testing, and mortality data: Four complementary approaches.

Authors: Fred S Lu; Andre T Nguyen; Nicholas B Link; Mathieu Molina; Jessica T Davis; Matteo Chinazzi; Xinyue Xiong; Alessandro Vespignani; Marc Lipsitch; Mauricio Santillana
Journal: PLoS Comput Biol Date: 2021-06-17 Impact factor: 4.475