Literature DB >> 29181246

Methods Using Social Media and Search Queries to Predict Infectious Disease Outbreaks.

Dong-Woo Seo1, Soo-Yong Shin2.   

Abstract

OBJECTIVES: For earlier detection of infectious disease outbreaks, a digital syndromic surveillance system based on search queries or social media should be utilized. By using real-time data sources, a digital syndromic surveillance system can overcome the limitation of time-delay in traditional surveillance systems. Here, we introduce an approach to develop such a digital surveillance system.
METHODS: We first explain how the statistics data of infectious diseases, such as influenza and Middle East Respiratory Syndrome (MERS) in Korea, can be collected for reference data. Then we also explain how search engine queries can be retrieved from Google Trends. Finally, we describe the implementation of the prediction model using lagged correlation, which can be calculated by the statistical packages, i.e., SPSS (Statistical Package for the Social Sciences).
RESULTS: Lag correlation analyses demonstrated that search engine data/Twitter have a significant temporal relationship with influenza and MERS data. Therefore, the proposed digital surveillance system can be used to predict infectious disease outbreaks earlier.
CONCLUSIONS: This prediction method could be the core engine for implementing a (near-) real-time digital surveillance system. A digital surveillance system that uses Internet resources has enormous potential to monitor disease outbreaks in the early phase.

Entities:  

Keywords:  Digital Syndromic Surveillance System; Disease Outbreak; Search Engine; Social Media

Year:  2017        PMID: 29181246      PMCID: PMC5688036          DOI: 10.4258/hir.2017.23.4.343

Source DB:  PubMed          Journal:  Healthc Inform Res        ISSN: 2093-3681


I. Introduction

Emerging infectious diseases, such as Severe Acute Respiratory Syndrome (SARS) in 2002, the H1N1 pandemic in 2009, and Middle East Respiratory Syndrome (MERS) in 2015, have highlighted the necessity for a syndromic surveillance system, which can play a significant role in detecting the beginning of an infectious disease outbreak [123]. However, traditional surveillance systems mainly depend on case reports, such as influenza-like illness (ILI) reports, which have time-delays in reporting and case confirmation. To enable the earlier detection of infectious disease outbreaks, a syndromic surveillance system should utilize real-time or near-real-time data, i.e., school or work absenteeism, over the counter medication sales, or the volume of Internetbased health inquiries [4567]. Among diverse alternative data sources, search queries, social media, and website visits have proven potential for digital surveillance systems [3578910]. In Korea, there is a national health alert system operated by the National Health Insurance Service which uses data from social media and blogs [11]. In this paper, we briefly introduce how social media and search queries can be used to predict infectious disease outbreaks. This prediction method could be the core engine for implementing a (near-) real-time digital surveillance system.

II. Case Description

1. Data Collection

To implement a prediction model for the digital surveillance system, the statistics of an infectious disease and digital data, such as search engine queries or social media data, should be collected. Disease statistics are used as target data, and digital data is used as input for the prediction model.

1) Collection of disease statistics

ILI data in Korea can be gathered from Korea Centers for Disease Control and Prevention (KCDC) [12]. Since the ILI reports of the KCDC are published in word processor formats, such as HWP files in Korean and DOC files in English, the data should be manually curated. Figure 1 shows an example ILI report published by the KCDC. The ratio of ILI in week 28 was 6.0.
Figure 1

ILI report example from week 28, 2017 (July 9, 2017–July 15, 2017). The ratio is the number of outpatients divided by 1,000.

The MERS statistics of Korea in 2015 were released via the official governmental MERS statistics site. However, since that site was not accessible at the time of this study, the statistics could be alternately collected from Wikipedia [13]. When curating disease statistics, researchers should be cautious about time intervals. ILI data is weekly data, whereas MERS statistics are reported as daily data.

2) Choosing keywords

To collect digital data from search engine queries or Twitter, the keywords that relate to influenza or MERS need to be determined first. This step is the most important step for the performance of disease prediction model. To choose the keywords, both laypersons' opinions and experts' opinion are taken into consideration, since search queries and social media data are generated by laypersons. In our previous works regarding influenza prediction, we conducted a survey by quota sampling based on sex and age to choose keywords [89]. Moreover, we also included keywords that were chosen by physicians, such as fever, cough, and sore throat for influenza prediction, since these keywords are related to the definitions of ILI. For MERS prediction, MERS and 메르스 (MERS in Korean) were chosen first. Then the top two related combined queries in Korean were added [3]. Moreover, selected keywords were translated into English or Korean since people can perform search queries in both languages. The chosen keywords are shown in Table 1.
Table 1

The chosen keywords for influenza and MERS

MERS: Middle East Respiratory Syndrome, PCR: polymerase chain reaction.

aOnly Korean keyword was used, bonly English keywords were used.

3) Collection of data from search engines

The daily or weekly trends of the keywords for web search queries can be obtained from Google Trends [14] or Naver DataLab [15]. The newly renovated Naver DataLab only offers weekly trends, and it requires several steps to download the output. Therefore, we only explain how to use Google Trends. Google Trends offers various data patterns starting from the previous minute's data to monthly data based on the selected time range. For example, Figure 2 shows the weekly trend of searches for “influenza” of Google Trends between September 9, 2007, and September 8, 2012. Trend data associated with the predefined keywords were retrieved by setting the location parameter to “South Korea” and setting the time parameters. The time parameters were based on previous disease outbreak seasons, for example, “May 2015 to Jun 2015” for MERS in Korea. May 11, 2015 was the symptom onset day of the first laboratory-confirmed patient. The results can be downloaded in a CSV format by clicking the arrow located in upper right in the graph. The downloaded CSV file consists of two columns. The first column represents the week by indicating the starting date of the week and the second column is the trend data. The researcher should collect the necessary trend data for each keyword. Table 2 shows example influenza statistics from ILI reports (Figure 1) with Google Trends data for the following five keywords: “New influenza,” “New flu,” “Fever,” “Tamiflu,” and “Flu” (Figure 2). As shown in Table 2, there is a timedelay pattern between the influenza statistics and Google Trends data.
Figure 2

Trends of influenza search queries, according to Google Trends, between September 9, 2007 and September 8, 2012.

Table 2

Example of influenza data and Google Trends data

Virological and ILI data were extracted manually from ILI reports. The trend data of representative keywords were extracted from Google Trends. Data column represents the week data with starting date.

KCDC: Korea Centers for Disease Control & Prevention, ILI: influenza-like illness.

The important fact is that the current trend data obtained from Google Trends or Naver DataLab are available as normalized values, not the absolute numbers of searches. According to Google Trends, “Numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. Likewise, a score of 0 means the term was less than 1% as popular as the peak.” The previous site of Naver DataLab, called Naver Trends offered the separate data trends for mobile and desktop searches [16]. However, Naver DataLab does not distinguish between mobile and desktop trends as Google Trends does. In our previous work, we collaborated with Daum to collect curated data [89]. People search web pages using one or multiple words at a time. To reflect this behavior, combinations of the chosen keywords should be considered, for example, “H1N1”, “H1N1 Treatment”, “H1N1 Symptom”, “Influenza”, “Influenza Treatment”, and “Influenza Symptom”.

4) Collection of data from Twitter

Among diverse social media platforms, Twitter has been used as a data sources for digital surveillance systems [17]. In our previous work, the number of tweets containing one of the predefined keywords was collected through Topsy, which is a certified partner of Twitter that offers social searching and social analytics [3]. However, Topsy closed a few years ago. There are alternative services, such as GNIP [18] or Talkwalker [19]; however, researchers would be required to purchase the relevant data from them. Therefore, there is currently a barrier to using Twitter data.

2. Data Analysis

Spearman correlation analyses can be used to examine the correlations between search engine data and disease data. Lag correlation analyses can be used to assess the temporal relationships between these sets of data for up to userdefined days or weeks. Significance was set at p < 0.05. In our work, we used the SPSS package to obtain statistical values, and the proposed method is summarized as follows. The KCDC data values are placed tidily in a column (Table 2). The search engine query data values to be compare are placed next to it. The data must be arranged in rows so that each row represents a specific time period. The correlation value obtained in this state means present (0 week lag) [8]. To get the correlation coefficient of preceding or lagging week, the column containing a search engine query data value is moved in the desired direction and correlation analysis is performed. In this case, the data that differ by the number of weeks to be compare is placed in the same row. To see the changes in correlation coefficients over time, correlation coefficients in subsequent epidemiological intervals are calculated. These correlation coefficients can be used to validate the developed prediction model. If necessary, subgroup analyses for the period are conducted along the same lines. For example, we performed a subgroup analysis focusing on the acceleration and deceleration period of MERS (June 3, 2015–June 26, 2015) adopting the CDC interval [3].

III. Discussion

The proposed digital surveillance system which uses Internet resources has enormous potential to monitor disease outbreaks in the early phase; however, this approach has some limitations as well. First, in our work, it was difficult to choose keywords although they have a considerable effect on the performance of a prediction model. More importantly, keywords should be changed periodically. For example, before 2015 only experts were aware of MERS. However, most Koreans know about MERS nowadays. Since people continuously learn new terminology and change the search keywords they use, keywords should be updated regularly to maintain prediction performance [9]. Second, as in the case of Google Flu, this system can fail to predict disease outbreaks correctly [20]. Therefore, the proposed digital surveillance system should be used with caution or as a complementary method.
  13 in total

1.  What is syndromic surveillance?

Authors:  Kelly J Henning
Journal:  MMWR Suppl       Date:  2004-09-24

2.  Infodemiology: tracking flu-related searches on the web for syndromic surveillance.

Authors:  Gunther Eysenbach
Journal:  AMIA Annu Symp Proc       Date:  2006

3.  Big data. The parable of Google Flu: traps in big data analysis.

Authors:  David Lazer; Ryan Kennedy; Gary King; Alessandro Vespignani
Journal:  Science       Date:  2014-03-14       Impact factor: 47.728

4.  Emergence of a novel swine-origin influenza A (H1N1) virus in humans.

Authors:  Fatimah S Dawood; Seema Jain; Lyn Finelli; Michael W Shaw; Stephen Lindstrom; Rebecca J Garten; Larisa V Gubareva; Xiyan Xu; Carolyn B Bridges; Timothy M Uyeki
Journal:  N Engl J Med       Date:  2009-05-07       Impact factor: 91.245

5.  Detecting influenza epidemics using search engine query data.

Authors:  Jeremy Ginsberg; Matthew H Mohebbi; Rajan S Patel; Lynnette Brammer; Mark S Smolinski; Larry Brilliant
Journal:  Nature       Date:  2009-02-19       Impact factor: 49.962

6.  The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic.

Authors:  Alessio Signorini; Alberto Maria Segre; Philip M Polgreen
Journal:  PLoS One       Date:  2011-05-04       Impact factor: 3.240

7.  Correlation between national influenza surveillance data and google trends in South Korea.

Authors:  Sungjin Cho; Chang Hwan Sohn; Min Woo Jo; Soo-Yong Shin; Jae Ho Lee; Seoung Mok Ryoo; Won Young Kim; Dong-Woo Seo
Journal:  PLoS One       Date:  2013-12-05       Impact factor: 3.240

8.  High correlation of Middle East respiratory syndrome spread with Google search and Twitter trends in Korea.

Authors:  Soo-Yong Shin; Dong-Woo Seo; Jisun An; Haewoon Kwak; Sung-Han Kim; Jin Gwack; Min-Woo Jo
Journal:  Sci Rep       Date:  2016-09-06       Impact factor: 4.379

9.  Correlation between National Influenza Surveillance Data and Search Queries from Mobile Devices and Desktops in South Korea.

Authors:  Soo-Yong Shin; Taerim Kim; Dong-Woo Seo; Chang Hwan Sohn; Sung-Hoon Kim; Seung Mok Ryoo; Yoon-Seon Lee; Jae Ho Lee; Won Young Kim; Kyoung Soo Lim
Journal:  PLoS One       Date:  2016-07-08       Impact factor: 3.240

10.  Web queries as a source for syndromic surveillance.

Authors:  Anette Hulth; Gustaf Rydevik; Annika Linde
Journal:  PLoS One       Date:  2009-02-06       Impact factor: 3.240

View more
  11 in total

1.  Prediction of COVID-19 Outbreaks Using Google Trends in India: A Retrospective Analysis.

Authors:  U Venkatesh; Periyasamy Aravind Gandhi
Journal:  Healthc Inform Res       Date:  2020-07-31

2.  Evaluation of Scientific Quality of YouTube Video Content Related to Umbilical Hernia.

Authors:  Guner Cakmak
Journal:  Cureus       Date:  2021-04-25

Review 3.  Social media based surveillance systems for healthcare using machine learning: A systematic review.

Authors:  Aakansha Gupta; Rahul Katarya
Journal:  J Biomed Inform       Date:  2020-07-02       Impact factor: 6.317

4.  Modification of the Conventional Influenza Epidemic Models Using Environmental Parameters in Iran.

Authors:  Ahmad Naserpor; Sharareh R Niakan Kalhori; Marjan Ghazisaeedi; Rasoul Azizi; Mohammad Hosseini Ravandi; Sajad Sharafie
Journal:  Healthc Inform Res       Date:  2019-01-31

5.  Correlation between Google Trends on dengue fever and national surveillance report in Indonesia.

Authors:  Atina Husnayain; Anis Fuad; Lutfan Lazuardi
Journal:  Glob Health Action       Date:  2019       Impact factor: 2.640

Review 6.  Vaccines Meet Big Data: State-of-the-Art and Future Prospects. From the Classical 3Is ("Isolate-Inactivate-Inject") Vaccinology 1.0 to Vaccinology 3.0, Vaccinomics, and Beyond: A Historical Overview.

Authors:  Nicola Luigi Bragazzi; Vincenza Gianfredi; Milena Villarini; Roberto Rosselli; Ahmed Nasr; Amr Hussein; Mariano Martini; Masoud Behzadifar
Journal:  Front Public Health       Date:  2018-03-05

7.  Forecasting influenza epidemics by integrating internet search queries and traditional surveillance data with the support vector machine regression model in Liaoning, from 2011 to 2015.

Authors:  Feng Liang; Peng Guan; Wei Wu; Desheng Huang
Journal:  PeerJ       Date:  2018-06-25       Impact factor: 2.984

8.  Googling Musculoskeletal-Related Pain and Ranking of Medical Associations' Patient Information Pages: Google Ads Keyword Planner Analysis.

Authors:  Yoichiro Yamaguchi; Deokcheol Lee; Takuya Nagai; Taro Funamoto; Takuya Tajima; Etsuo Chosa
Journal:  J Med Internet Res       Date:  2020-08-14       Impact factor: 5.428

9.  Prediction of Number of Cases of 2019 Novel Coronavirus (COVID-19) Using Social Media Search Index.

Authors:  Lei Qin; Qiang Sun; Yidan Wang; Ke-Fei Wu; Mingchih Chen; Ben-Chang Shia; Szu-Yuan Wu
Journal:  Int J Environ Res Public Health       Date:  2020-03-31       Impact factor: 3.390

Review 10.  Emojis in public health and how they might be used for hand hygiene and infection prevention and control.

Authors:  Nasim Lotfinejad; Reza Assadi; Mohammad Hassan Aelami; Didier Pittet
Journal:  Antimicrob Resist Infect Control       Date:  2020-02-10       Impact factor: 4.887

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.