Omar Enzo Santangelo1, Sandro Provenzano2, Dimple Grigis3, Domiziana Giordano4, Francesco Armetta5, Alberto Firenze6. 1. Department of Health Promotion, Mother and Child Care, Internal Medicine and Medical Specialties "G. D'Alessandro", University of Palermo, Palermo, Italy. omarenzosantangelo@hotmail.it. 2. Department of Health Promotion, Mother and Child Care, Internal Medicine and Medical Specialties "G. D'Alessandro", University of Palermo, Palermo, Italy. provenzanosandro@hotmail.it. 3. University of Bergamo, Bergamo, Italy. dimplyg1@gmail.com. 4. Department of Health Promotion, Mother and Child Care, Internal Medicine and Medical Specialties "G. D'Alessandro", University of Palermo, Palermo, Italy. domiziana.giordano@gmail.com. 5. Department of Health Promotion, Mother and Child Care, Internal Medicine and Medical Specialties "G. D'Alessandro", University of Palermo, Palermo, Italy. francesco.armetta03@gmail.com. 6. Department of Health Promotion, Mother and Child Care, Internal Medicine and Medical Specialties "G. D'Alessandro", University of Palermo, Palermo, Italy. alberto.firenze@unipa.it.
Abstract
INTRODUCTION: Cases of measles in some European countries are increasing. The aim of this study is to find the correlation between Google Trends and Wikipedia searches and the real number of cases notified. MATERIALS AND METHODS: The data on Internet searches have been obtained from Google Trends and Wikipedia. The reported cases of measles were selected from January 2013 until December 2018 for Google Trends and July 2015 until December 2018 from for Wikipedia. We have selected data from four European Countries: Italy, France, Germany and Romania. The data extracted from Wikipedia and Google Trends have been moved over time (Lag), one month in the future and one month in the past. Cross-correlation results are obtained as product-moment correlations between the two time series. The statistical analyses have been performed by using the Spearman's rank correlation coefficient or Pearson correlation coefficient. RESULTS: A temporal correlation was observed between the bulletin of ECDC and Wikipedia search trends. For Wikipedia the strongest correlation is at a lag of +1 for rougeole (r=0.9006) and masern (r=0.7023) and at lag 0 for morbillo (r=0.8892) and rujeola (r=0.5462); for Google Trends the strongest correlation at a lag 0 for rougeole (rho=0.7398), symptômes rougeole (rho=0.3399), masern (rho=0.6484), sintomi morbillo (rho=0.6029), rujeola (rho=0.7209), simptome rujeola (rho=0.5297) and at lag -1 for masern symptom (rho=0.4536) and morbillo (rho=0.5804). CONCLUSIONS: Google and Wikipedia could play an important role in surveillance, although these tools need to be combined with traditional surveillance systems.
INTRODUCTION: Cases of measles in some European countries are increasing. The aim of this study is to find the correlation between Google Trends and Wikipedia searches and the real number of cases notified. MATERIALS AND METHODS: The data on Internet searches have been obtained from Google Trends and Wikipedia. The reported cases of measles were selected from January 2013 until December 2018 for Google Trends and July 2015 until December 2018 from for Wikipedia. We have selected data from four European Countries: Italy, France, Germany and Romania. The data extracted from Wikipedia and Google Trends have been moved over time (Lag), one month in the future and one month in the past. Cross-correlation results are obtained as product-moment correlations between the two time series. The statistical analyses have been performed by using the Spearman's rank correlation coefficient or Pearson correlation coefficient. RESULTS: A temporal correlation was observed between the bulletin of ECDC and Wikipedia search trends. For Wikipedia the strongest correlation is at a lag of +1 for rougeole (r=0.9006) and masern (r=0.7023) and at lag 0 for morbillo (r=0.8892) and rujeola (r=0.5462); for Google Trends the strongest correlation at a lag 0 for rougeole (rho=0.7398), symptômes rougeole (rho=0.3399), masern (rho=0.6484), sintomi morbillo (rho=0.6029), rujeola (rho=0.7209), simptome rujeola (rho=0.5297) and at lag -1 for masern symptom (rho=0.4536) and morbillo (rho=0.5804). CONCLUSIONS: Google and Wikipedia could play an important role in surveillance, although these tools need to be combined with traditional surveillance systems.
Cases of measles in some European countries are increasing, large outbreaks with fatalities are ongoing in countries that had previously eliminated or interrupted endemic transmission (1).Internet-based surveillance systems offer a novel and developing means of monitoring conditions of public health concern, including emerging infectious diseases (2).The Google Trends database is searchable by term, geography and time with a one-week sampling rate. Google Trends allows a user to compare up to five terms or topics simultaneously and results are displayed as a set of time series. Google Trends normalizes the search data with the day on which more searches were made giving a reference value equal to 100, on the contrary, it assigns a reference value of 0 for the day when fewer searches were carried out. Then the data standardized are presented by Google Trends as “relative search volume” (RSV), an “Interest Index” that can take a value between 0 and 100 based on the proportion to all searches on all terms or topics (3).The association between the predictive power of Google Trends and the data of official surveillance systems of various countries has been shown by various authors for different diseases, concluding that these data can help to monitor and predict infectious diseases (2,4).The objective of the study is to evaluate, through two comparative studies, time correlation between Google Trends, Wikipedia Trends and the conventional surveillance data generated by the reporting of measles infection cases reported on bulletin by the European Centre for Disease Prevention and Control (ECDC).
Materials and methods
Cross-sectional study design was used. Every month the ECDC issues a bulletin with the cases reported in European Nations in the previous months regarding measles (5).We have selected data from four European Countries: Italy, France, Germany and Romania.From Wikipedia Trends (6) it is possible to know how many times a specific page is viewed by users, data were extracted and aggregated on monthly basis. Then, the following data were extracted:- a number of monthly views by users from 1 July 2015 to 31 December 2018 of the pages: morbillo (Italian term for measles), rougeole (french term for measles), masern (german term for measles), rujeola (romanian term for measles).From Google Trends (3), on June 10, 2019, the data have been obtained using the italian, french, german and romanian search terms, in the “Health” category, morbillo (italian), rougeole (french), masern (german), rujeola (romanian) that mean “measles” in english, and sintomi morbillo (italian), symptômes rougeole (french), masern symptome (german), simptome rujeola (romanian) that mean “measles symptoms” in english, in the time-frame elapsing from 1 January 2013 to 31 December 2018; the data have been aggregated by month.The files in “.CSV” format have been downloaded. Google Trends provides for a relative search volume (RSV), which is computed as the percentage of queries concerning a particular term for a specific location and time period, where 100 is the maximum value and 0 is the minimum value.Then we created two databases:- with monthly data (MDW) with the reported cases of measles in ECDC bulletin and Wikipedia Trends data from July 2015 to December 2018;- with monthly data (MDG) with the reported cases of measles in ECDC bulletin and Google Trends data from January 2013 to December 2018.The data extracted from Wikipedia and Google Trends have been moved over time (Lag), one month in the future and one month in the past.Cross-correlation results are obtained as product-moment correlations between the two time series. The advantage of using cross-correlations is that it accounts for time dependence between two time-series variables.Statistical analyses were performed using the Pearson correlation coefficient (r) for the “MDW” database and Spearman’s rank correlation coefficient (rho) for the “MDG” database. The statistical significance level for the analyses was 0.05. The data were analyzed using the STATA statistical software, version 14 (7).
Results
The raw data for Wikipedia Trends are shown in Figure 1. A temporal correlation was observed between the bulletin of ECDC and Wikipedia search trends. Regarding the database MDW, the strongest correlation is at a lag of +1 for rougeole (r=0.9006) and masern (r=0.7023) and at lag 0 for morbillo (r=0.8892) and rujeola (r=0.5462) (Table 1). Google Trends Internet search data showed the strongest correlation at a lag 0 for rougeole (rho=0.7398), symptômes rougeole (rho=0.3399), masern (rho=0.6484), sintomi morbillo (rho=0.6029), rujeola (rho=0.7209), simptome rujeola (rho=0.5297) and at lag -1 for masern symptom (rho=0.4536) and morbillo (rho=0.5804) (Table 2).
Table 1.
Time series bi-directional cross-correlation coefficients for 1 month displaying relationships between Wikipedia Trends and cases reported by the ECDC. Used Pearson correlation coefficient
Wikipedia Trends Terms
Lag in months compared to cases reported by the ECDC
-1 (42 observations)
0 (42 observations)
+1 (41 observations)
Rougeole (France)
0.5803*
0.8278*
0.9006*
Masern (Germany)
0.3258**
0.6400*
0.7023*
Morbillo (Italy)
0.8085*
0.8892*
0.6840*
Rujeola (Romania)
0.5101*
0.5462*
0.5390*
*p-value<0.001 / **p-value<0.05
Table 2.
Time series bi-directional cross-correlation coefficients for 1 month displaying relationships between Google Trends and cases reported by the ECDC. In bold, the strongest correlations. Used Spearman’s rank correlation coefficient
Google Trends Terms
Lag in months compared to cases reported by the ECDC
-1
0
+1
France
rougeole
0.6982*
0.7398*
0.6726*
symptômes rougeole
0.2919**
0.3399**
0.2734**
Germany
masern
0.6354*
0.6484*
0.5033*
masern symptome
0.4536*
0.4424*
0.4441*
Italy
morbillo
0.5804*
0.5398*
0.4515*
sintomi morbillo
0.5787*
0.6029*
0.5288*
Romania
rujeola
0.6908*
0.7209*
0.6869*
simptome rujeola
0.4829*
0.5297*
0.4434*
*p-value<0.001 / **p-value<0.05
Time series bi-directional cross-correlation coefficients for 1 month displaying relationships between Wikipedia Trends and cases reported by the ECDC. Used Pearson correlation coefficient*p-value<0.001 / **p-value<0.05Time series bi-directional cross-correlation coefficients for 1 month displaying relationships between Google Trends and cases reported by the ECDC. In bold, the strongest correlations. Used Spearman’s rank correlation coefficient*p-value<0.001 / **p-value<0.05
Discussion and Conclusions
The results for months at Lag 0 showed that the peaks of the curves for France and Germany anticipate by about one month the peaks of the curve deriving from the cases notified by the ECDC, while the peaks of the curve for Italy can be superimposed on the curve of the ECDC. Table 1 shows the Time series bi-directional cross-correlation coefficients for 1 month displaying relationships between Wikipedia Trends and cases reported by the ECDC. From this analysis it emerged that for France and Germany the maximum correlation between ECDC and Wikipedia data was observed at lag +1. This could mean that searches for selected terms on Wikipedia anticipate ECDC notifications by about a month. While for Italy and Romania the highest correlation occurs at Lag 0, so the search for terms on Wikipedia is about the same time as the cases notified by ECDC.Medium or strong correlations emerge mainly at Lag 0 analyzing the data on Google trends (Table 2), probably attributable, according to the authors, to the fact that the population is currently looking for the terms present in Table 2 and therefore the number of searches is directly connected to the number of measles cases in progress. It would be possible to obtain more specific information if the ECDC bulletin were weekly, the monthly lags are still large enough to plan for a possible response to an epidemic, in other studies this type of analysis has already been carried out (8, 9).With regard to the limits of the study, it should be noted that the media could influence the population’s search for online terms. There are several reasons for the peak search for measles terms, such as the increase of the number of cases in the community and the increased media attention (10). While for Google trends (3) it is possible to separate the data at the regional geographical level, another limitation for Wikipedia is the lack of geographical identification of a possible epidemic because Wikipedia Trends does not provide data at these levels. In addition, the temporal and geographical changes in are not well documented, which may affect the outcome of the research and the results of our study (10). Therefore, the interpretation and generalization of results require caution.In conclusion, the results of this study suggest that Google Trends and Wikipedia-based surveillance systems have a potential role as a possible public health tool. Today, it can be a valuable tool that can flank the traditional surveillance systems ones and that in the future could be more validated and consolidated.