Literature DB >> 34113751

Reliability of Google Trends: Analysis of the Limits and Potential of Web Infoveillance During COVID-19 Pandemic and for Future Research.

Alessandro Rovetta1,2.   

Abstract

Background: Alongside the COVID-19 pandemic, government authorities around the world have had to face a growing infodemic capable of causing serious damages to public health and economy. In this context, the use of infoveillance tools has become a primary necessity. Objective: The aim of this study is to test the reliability of a widely used infoveillance tool which is Google Trends. In particular, the paper focuses on the analysis of relative search volumes (RSVs) quantifying their dependence on the day they are collected.
Methods: RSVs of the query coronavirus + covid during February 1-December 4, 2020 (period 1), and February 20-May 18, 2020 (period 2), were collected daily by Google Trends from December 8 to 27, 2020. The survey covered Italian regions and cities, and countries and cities worldwide. The search category was set to all categories. Each dataset was analyzed to observe any dependencies of RSVs from the day they were gathered. To do this, by calling i the country, region, or city under investigation and j the day its RSV was collected, a Gaussian distribution X i = X ( σ i , x ¯ i ) was used to represent the trend of daily variations of x i j = R S V s i j . When a missing value was revealed (anomaly), the affected country, region or city was excluded from the analysis. When the anomalies exceeded 20% of the sample size, the whole sample was excluded from the statistical analysis. Pearson and Spearman correlations between RSVs and the number of COVID-19 cases were calculated day by day thus to highlight any variations related to the day RSVs were collected. Welch's t-test was used to assess the statistical significance of the differences between the average RSVs of the various countries, regions, or cities of a given dataset. Two RSVs were considered statistical confident when t < 1.5 . A dataset was deemed unreliable if the confident data exceeded 20% (confidence threshold). The percentage increase Δ was used to quantify the difference between two values.
Results: Google Trends has been subject to an acceptable quantity of anomalies only as regards the RSVs of Italian regions (0% in both periods 1 and 2) and countries worldwide (9.7% during period 1 and 10.9% during period 2). However, the correlations between RSVs and COVID-19 cases underwent significant variations even in these two datasets ( M a x   | Δ |   =   +   625 % for Italian regions, and M a x   | Δ | =   + 175 %   for countries worldwide). Furthermore, only RSVs of countries worldwide did not exceed confidence threshold. Finally, the large amount of anomalies registered in Italian and international cities' RSVs made these datasets unusable for any kind of statistical inference.
Conclusion: In the considered timespans, Google Trends has proved to be reliable only for surveys concerning RSVs of countries worldwide. Since RSVs values showed a high dependence on the day they were gathered, it is essential for future research that the authors collect queries' data for several consecutive days and work with their RSVs averages instead of daily RSVs, trying to minimize the standard errors until an established confidence threshold is respected. Further research is needed to evaluate the effectiveness of this method.
Copyright © 2021 Rovetta.

Entities:  

Keywords:  COVID-19; google trends; google trends data; google trends data analysis; social science research

Year:  2021        PMID: 34113751      PMCID: PMC8186442          DOI: 10.3389/frma.2021.670226

Source DB:  PubMed          Journal:  Front Res Metr Anal        ISSN: 2504-0537


Introduction

A novel coronavirus was identified in Wuhan (Hubei province, China) in late 2019 (Wu et al., 2020). This was responsible for a severe respiratory disease named COVID-19 by the WHO on February 11, 2020. The virus, formerly reported as 2019-nCoV, was subsequently named SARS-CoV-2. Despite attempts by governments and the scientific community to contain the infection, COVID-19 has spread beyond the Chinese borders and was declared a pandemic by the WHO on March 11, 2020 (Cascella et al., 2021). Such pandemic has put a strain on health systems and economies of countries worldwide, causing more than 3 million deaths and forcing governments to implement very restrictive lockdowns (Askitas et al., 2021). In this scenario, fake news and inaccurate information circulated widely on the web creating severe issues to public health and economy all over the world (Pennycook et al., 2020; Rovetta and Bhagavathula, 2020; Tagliabue et al., 2020; Tasnim et al., 2020). Dr Tedros Adhanom Ghebreyesus-director of the World Health Organization (WHO) - claimed that the battle we are fighting does not only concern the epidemic but also its infodemic (UNS, 2020). Moreover, the WHO itself has launched an international campaign called “Managing the COVID-19 infodemic: Promoting healthy behaviors and mitigating the harm from misinformation and disinformation” to sensitize states to contrast the spread of misinformation (WHO, 2020). To date, one of the main problems consists in conspiracy news relating to alleged vaccine damage, which can seriously compromise the international strategy for the abatement of SARS-CoV-2 (Tollefson, 2021). Therefore, the demand for new effective and efficient infodemiological methods has never been as pressing as today. In this regard, scientists are increasingly adopting infoveillance tools to monitoring the infodemic on websites, social media, and newspapers (Zeraatkar and Ahmadi, 2020). Numerous research groups have exploited the state-of-art of machine learning to catalog and analyze the large flows of COVID-19-related information circulating on social networks, forums, and online platforms like Twitter, Reddit, Instagram, Facebook, and YouTube (Tsao et al., 2021). Among the most skillful approaches, Rustam et al. adopted a wide variety of supervised algorithms such as random forest (RF), XGBoost classifier, support vector classifier (SVC), extra trees classifier (ETC), decision tree (DT), and long-short term memory (LSTM) deep learning model to analyze COVID-19-related tweets sentiment (Rustam et al., 2021). Their results showed that: 1) Extra Trees Classifiers outperformed all other models by achieving a 0.93 accuracy score using the authors’ proposed concatenated features set; 2) the LSTM achieved low accuracy as compared to machine learning classifiers. Nonetheless, Jelodar et al. implemented a novel application for natural language process (NLP) based on an LSTM model for the same purpose on Reddit posts, obtaining convincing results (Jelodar et al., 2020). Mackey et al. also studied the dissemination of fake and dangerous information on Twitter and Instagram through NLP and deep learning (Mackey et al., 2020). Although this evidence seems contradictory, it merely shows the vast range of unexplored possibilities offered by machine learning for infodemiological aims and, at the same time, that the model accuracy depends strongly on the initial conditions. Ergo, Machine learning showed excellent effectiveness but it has limitations (Mohri et al., 2012). As highlighted by comparing scientific literature, its application needs ad-hoc interventions not always assimilable in a general methodology. Supervised algorithms require large training datasets to produce inferred functions for mapping new examples; such a procedure consumes time and resources, thus slowing down the infoveillance process consistently. On the contrary, unsupervized models learn from raw data without any prior knowledge; therefore, results might be less inaccurate and take more time if compared with supervised learning. Moreover, datasets always require appropriate processing before using them. In this regard, many authors have preferred to adopt more traditional methods, like multivariate regressions, cross-correlations, time-series analysis, and descriptive statistics (Tsao et al., 2021). The majority of these are now integrated into easy-to-use automatic kits available for Microsoft Excel software or similar (e.g., Real Statistics and Zaiontz, 2021; XLSTAT, 2021), which is a great advantage in terms of operational speed. However, when dealing with platforms such as Twitter, Reddit, Instagram, or Facebook, the collection and analysis of posts is still laborious: indeed, it requires the use of databases already extracted (which limits the power of investigation) or application programming interfaces (APIs) and all datasets must be suitably processed before use (Kim et al., 2020). Therefore, while all of the above methods are essential and powerful for historical data analysis, more immediate and rapid tools are equally necessary for quasi-real-time infoveillance. In particular, Google Trends—an open online infoveillance tool developed by Google™—has been widely used by the scientific community not only for monitoring disinformation but also for making rapid epidemiological predictions on the spread of infectious diseases (Mavragani and Ochoa, 2019). Google Trends quantifies the users’ web interest in a keyword (e.g., “football”) by returning a normalized value ranging from 0 to 100, called relative search volume, proportional to the ratio between the keyword-related queries and the total web queries. The user can also narrow the analysis to specific geographical areas (continents, states, regions, cities, etc.) in a fixed timelapse. In this regard, the quantitative analysis of relative search volumes of pre-selected queries was used for several purposes during COVID-19 pandemic: 1) predicting COVID-19 cases (Ahmad et al., 2020; Ayyoubzadeh et al., 2020; Jimenez et al., 2020; Mavragani and Gkillas, 2020; Sulyok et al., 2020; Venkatesh and Gandhi, 2020; Prasanth et al., 2021), 2) studying the web interest in COVID-19 (Effenberger et al., 2020; Hu et al., 2020; Rovetta and Castaldo, 2020; Springer et al., 2020), 3) studying the adoption of infodemic terms and related consequences (Cinelli et al., 2020; Cuan-Baltazar et al., 2020; Rovetta and Bhagavathula, 2020), 4) studying a full range of users’ psychological-emotional responses (Husnayain et al., 2020; Rovetta and Castaldo, 2020; Zattoni et al., 2020; Brodeur et al., 2021; Zitting et al., 2021), 5) studying the impact of mass media and governmental policies on users’ web searches (Rovetta and Bhagavathula, 2020; Sousa-Pinto et al., 2020; Huynh Dagher et al., 2021), 6) studying the economic-commercial impact (Brodeur et al., 2021; Sotis, 2021), 7) studying the spread of COVID-19 symptoms (Ahmad et al., 2020; Jimenez et al., 2020; Kluger and Scrivener, 2020; Walker et al., 2020), 8) studying other various web interests (Berger et al., 2021; Elsaie and Youssef, 2021). This type of research is mainly based on the search for statistical cross-correlations between users’ web searches related to specific topics, such as symptoms, drugs, therapies, vaccines, number of infected people, number of deaths, anxiety, fear, stress, etc., and the number of disease contagions and deaths officially registered after a certain timespan. However, not all that glitters is gold. First, many of these studies propose conflicting conclusions: specifically, some authors claim that the correlations between COVID-19 cases and web searches are generally spurious as mass media and government agencies’ announcements can influence them. Second, this paper shows that Google Trends has some limitations that are often overlooked and which risk heavily biasing and distorting correlation-based analytics. Furthermore, some anomalies in the calculus of relative search volumes (RSVs) could also alter any infodemiological analysis in an unpredictable way. Nonetheless, as shown above, a considerable portion of the academic world continues to rely on this tool to conduct its scientific investigations. This is probably due to the fact that Google Trends offers a simple and immediate way to obtain clean data (i.e., without complications related to privacy) on the vast majority of users’ web interests all over the world. This efficiency can be decisive in the epidemiological and infodemiological evaluation; indeed, although scientists and governments have launched mobile applications and websites with similar purposes (Kondylakis et al., 2020), the percentage of users involved is significantly lower than that of Google. Such services could also select a more targeted user, undermining the demand for randomness in the sample extraction (e.g., deniers and conspirators are automatically excluded from the dataset). Anyway, the combined use of both these methods could help us better understand their strengths and limitations and serve as a complete infoveillance approach. Therefore, the aim of this study is to delve into the aforementioned issues exploring their nature and searching for solutions to circumventing them, thus allowing the scientific community to continue using Google Trends through a more reliable approach.

Methods

To assess the reliability of Google Trends (GT), relative search volumes (RSVs) of a specific query in a fixed period were downloaded on different days as to reveal any dependence on the date they were collected. According to Google, RSVs are calculated as follows: each data point is divided by the total searches of the geography and time range it represents to compare relative popularity. Otherwise, places with the most search volume would always be ranked highest. The resulting numbers are then scaled on a range of 0–100 based on a topic’s proportion to all searches on all topics (Google Support, 2021). In this context, “anomalies” were defined as those countries, regions, or cities whose RSVs appeared only on specific days.

Data Collection

RSVs of the query coronavirus + covid were collected from two distinct periods: 1 February—4 December, 2020 (period 1), and 20 February—18 May, 2020 (period 2). As shown in previous studies, this query encompasses 80% of COVID-19-related web searches worldwide (Rovetta and Bhagavathula, 2020). For this reason, it has been considered well representative of the web interest in COVID-19. Furthermore, an independent verification confirmed these results, highlighting an increasing use of the keyword covid (Supplementary File S1). This final survey was carried out exploiting the rising queries and the associated queries directly provided by Google Trends. Period 1, corresponding to the Italian lockdown, was chosen for GT to provide daily RSVs, while period 2 was chosen for GT to provide weekly RSVs. The survey was carried out on Italian regions and cities, and worldwide countries and cities. All RSVs of periods 1 and 2 were collected daily for a minimum of 7 days and until any anomaly was highlighted; when no anomaly was identified within 15–20 days, the investigation was considered concluded. The data-collection period ranged from 8 to 25 December, 2020. The Google Trends category search-parameter was set to all categories. All details are shown in Table 1.
TABLE 1

Google Trends keywords summary.

Geographical regionInvestigation period (2020)SubregionCollection period (2020)Google trends URL a , b , c , d
ItalyFebruary 1–December 4RegionsDecember 8–26URL Italy period 1
CitiesDecember 14–26
February 20–May 18RegionsDecember 8–26URL Italy period 2
CitiesDecember 14–26
WorldFebruary 1–December 4RegionsDecember 14–26URL world period 1
CitiesDecember 16–26
February 20–May 18RegionsDecember 14–27URL world period 2
CitiesDecember 16–27

Google Trends 1, coronavirus + covid query in Italian regions and cities during period 1. URL: https://trends.google.it/trends/explore?date=2020-02-01%202020-12-04&geo=IT&q=coronavirus%20%2B%20covid.

Google Trends 2, coronavirus + covid query in Italian regions and cities during period 2. URL:https://trends.google.it/trends/explore?date=2020-02-20%202020-05-18&geo=IT&q=coronavirus%20%2B%20covid.

Google Trends 3, coronavirus + covid query in World countries and cities during period 1. URL: https://trends.google.it/trends/explore?date=2020-02-01%202020-12-04&q=coronavirus%20%2B%20covid.

Google Trends 4, coronavirus + covid query in World countries and cities during period 2. URL: https://trends.google.it/trends/explore?date=2020-02-20%202020-05-18&q=coronavirus%20%2B%20covid.

Google Trends keywords summary. Google Trends 1, coronavirus + covid query in Italian regions and cities during period 1. URL: https://trends.google.it/trends/explore?date=2020-02-01%202020-12-04&geo=IT&q=coronavirus%20%2B%20covid. Google Trends 2, coronavirus + covid query in Italian regions and cities during period 2. URL:https://trends.google.it/trends/explore?date=2020-02-20%202020-05-18&geo=IT&q=coronavirus%20%2B%20covid. Google Trends 3, coronavirus + covid query in World countries and cities during period 1. URL: https://trends.google.it/trends/explore?date=2020-02-01%202020-12-04&q=coronavirus%20%2B%20covid. Google Trends 4, coronavirus + covid query in World countries and cities during period 2. URL: https://trends.google.it/trends/explore?date=2020-02-20%202020-05-18&q=coronavirus%20%2B%20covid. Data on Italian COVID-19 cases was collected from the Italian Civil Protection Department official dashboard (ICPD, 2020). Data on international COVID-19 cases was collected from the World Health Organization official dashboard (WHO, 2020).

Statistical Analysis

By calling the country, region, or city under investigation and the day its was collected, a Gaussian distribution where is the standard deviation (also called ) and is the mean value of , was used to represent the trend of . To evaluate data normality, the Shapiro-Wilk test was used (Ghasemi and Zahediasl, 2012). The significance threshold was indicatively set at (Amrhein et al., 2017). Data distributions that deviated greatly from were marked with an asterisk (*). The impact of daily variations of in on Pearson (R) and Spearman (r) correlations with COVID-19 total cases was estimated; to do this, it was enough to compute the correlations on different days and calculate their percentage increases . For the adoption of these correlations, standard criteria were exploited (Mukaka, 2012). The Welch’s t-test (Kim, 2015) was performed in order to understand if the differences between the mean RSVs , extracted from the same geographical area and period but on different days, were significant. A difference between two was considered statistically significant when . This test was considered appropriate since the mean values, together with their relative 95% confidence interval, well represented the samples (i.e. the arithmetic mean was sufficiently centered and the confidence interval comprised the clear majority of values). Furthermore, it does not require that the variances be similar. A dataset was deemed unreliable if the confident data exceeded 20% (confidence threshold) for at least one country, region, or city. When anomalies were identified in more than 20% of cases, no investigation on the distributions was conducted.

Results

Italian Regions’ Web Interest During Period 1 (1 February–4 December, 2020)

As shown in Figure 1, there have been strong relationships between RSVs and the dates they were collected: in fact, the regional ranking of web interest underwent several unpredictable variations even as regards the peak values RSV = 100.
FIGURE 1

Dependence of Google Trends relative search volumes (RSVs) on collection date: Italian regions’ web interest in the query coronavirus + covid during period 1 (1 February–4 December, 2020). X-axis: dates on which the RSVs were collected. Y-axis: Google Trends RSV. * = Regions that showed a non-normal trend over time.

Dependence of Google Trends relative search volumes (RSVs) on collection date: Italian regions’ web interest in the query coronavirus + covid during period 1 (1 February–4 December, 2020). X-axis: dates on which the RSVs were collected. Y-axis: Google Trends RSV. * = Regions that showed a non-normal trend over time. The daily standard deviation of the sample ranged in the interval [6.6, 7.6], making all values in the central band mutually confident. Because of that, any correlations between RSVs and COVID-19 cases (or related statistics) could not be meaningful if merely based on a single-day dataset. Furthermore, even supposing no variance in daily samples, the correlation between the number of COVID-19 cases and RSVs went from on December 8 to on the following day ). Considering the whole dataset, the same correlations ranged in the interval . The mean value and standard error of the distributions were and respectively, with ranging in the interval [0.1, 0.7]. Therefore, the confidence threshold was exceeded (e.g., Abruzzo, 37%). However, no anomalies have been found.

Italian Regions’ Web Interest During Period 2 (20 February–18 May, 2020)

As shown in Figure 2 (next page), the variance of RSVs as a function of the day they were gathered was lower than that of the previous dataset . This is probably due not only to the investigated period but also to the different sampling frequency. However, there was greater variability on RSV peaks and a larger number of non-normal trends.
FIGURE 2

Dependence of Google Trends relative search volumes (RSVs) on collection date: Italian regions’ web interest in the query coronavirus + covid during period 2 (20 February–18 May, 2020). X-axis: dates on which the RSVs were collected. Y-axis: Google Trends RSV. * = Regions that showed a non-normal trend over time.

Dependence of Google Trends relative search volumes (RSVs) on collection date: Italian regions’ web interest in the query coronavirus + covid during period 2 (20 February–18 May, 2020). X-axis: dates on which the RSVs were collected. Y-axis: Google Trends RSV. * = Regions that showed a non-normal trend over time. The confidence threshold was exceeded (e.g., Abruzzo, 47%). Spearman and Pearson correlations between COVID-19 cases and daily RSVs ranged in the intervals and respectively. No anomalies have been found.

Italian Cities’ Web Interest During Period 1 and Period 2

As shown in Tables 2, 3 (next two pages), significant anomalies occurred in 33.3% of Italian cities during period 1 and 45.8% during period 2. In particular, Perugia and Prato-absent respectively 7- and 10-times during period 1- recorded RSVs = 100 on 6 occasions. During period 2, Messina, Perugia, Pescara, Prato, and Salerno, recorded only 1 RSV out of 14 samples, while Parma recorded 2 RSVs. Therefore, any type of correlation or other statistical calculus, evaluation, or consideration on this dataset would be highly dependent on the day the data was gathered.
TABLE 2

Dependence of Google Trends relative search volumes (RSVs) on collection date: Italian cities’ web interest in the query coronavirus + covid during period 1 (1 February–4 December, 2020).

CityWeekly RSVs collected day by day from 14 to 26 December, 2020
Bari94989494979594969495919194
Bologna94929694969595969594909194
Brescia89918788878990868888858687
Cagliari979810096979594989696979198
Catania88898682868689858588848387
Firenze9610098100100971009997981009597
Genova88928991899189929090898687
Milano89949090919191918693908890
Modena95929594
Napoli89938988919091879094888489
Padova88908887918887848789878487
Palermo79847980818078788078797981
Parma8788898786858687
Perugia10098100100100100
Prato10010095
Reggio calabria9510096
Reggio emilia8890
Roma90949293949392939294928892
Salerno8786878485888585
Taranto100
Torino92929591968896908797888887
Trieste90889192909193928590908886
Venezia82858381808381827984807980
Verona83858681868686848284827982
TABLE 3

Dependence of Google Trends relative search volumes (RSVs) on collection date: Italian cities’ web interest in the query coronavirus + covid during period 2 (20 February–18 May, 2020).

CityDaily RSVs collected day by day from 14 to 26 December, 2020
Bari93908990909190879288879087
Bologna96959695959695929592929895
Brescia93939293949294928890949190
Cagliari10010098100100100100100100100100100
Catania89878989898788919486888986
Firenze93939596959696979794939693
Genova88878688898989868983858986
Messina77
Milano9798100100959598979694969896
Modena8994939293
Napoli90899090888887868787888784
Padova939293929390949590919091
Palermo78777977777981797975787874
Parma8386
Perugia97
Pescara95
Prato91
Reggio emilia878684
Roma90919293899091908990899089
Salerno86
Torino93929495919294949492949192
Trieste90929091868991898989868989
Venezia91899391899189948788888787
Verona8987908889
Dependence of Google Trends relative search volumes (RSVs) on collection date: Italian cities’ web interest in the query coronavirus + covid during period 1 (1 February–4 December, 2020). Dependence of Google Trends relative search volumes (RSVs) on collection date: Italian cities’ web interest in the query coronavirus + covid during period 2 (20 February–18 May, 2020).

Global Web Interest During Period 1 (February–4 December, 2020)

Google Trends reported a maximum of 62 countries’ RSVs (Supplementary Table S1). Significant anomalies occurred in 6 cases (9.7%) and the peak was reached and maintained unchanged by Italy (. In 64.5% of cases, data was not normally distributed. No nation exceeded the confidence threshold even if the dataset showed a high variability range if compared to that of Italy (). Spearman correlations with COVID-19 total cases ranged in the interval ; however, it must be pointed out that the value was an outlier (recorded on December 16, 2020) and a more representative interval is .

Global Web Interest During Period 2 (20 February–18 May, 2020)

Google Trends reported a maximum of 64 countries’ RSVs (Supplementary Table S2). Significant anomalies occurred in 7 cases (10.9%) and the peak was reached and maintained unchanged by Italy (. In 56.2% of cases, data was not normally distributed. No nation exceeded the confidence threshold even if the dataset showed a high variability range if compared to that of Italy (). Spearman correlations with COVID-19 total cases ranged in the interval ; however, it must be pointed out that the value was an outlier (recorded on December 16, 2020) and a more representative interval is .

International Cities’ Web Interest During Period 1 and Period 2

As shown in Tables 4, 5 (next two pages), significant anomalies occurred in 30.4% of international cities during period 1 and 38.1% during period 2. In particular, Bogotà, Chicago, Dubai, Houston, Hyderabad, Los Angeles, Sao Paulo, Santiago of Chile were affected by anomalies during period 1 and period 2, which also included Milan () and Rome ( on December 25, 2020). Therefore, any type of correlation or other statistical calculus, evaluation, or consideration on this dataset would be highly dependent on the day the data was gathered.
TABLE 4

Dependence of Google Trends relative search volumes (RSVs) on collection date: international cities’ web interest in the query coronavirus + covid during period 1 (1 February–4 December, 2020).

CityDaily RSVs collected day by day from 16 to 26 December, 2020
Bangalore6063635963606263616261
Bogotá4849495049
Chicago62636262626364
Mexico city5049504951514949514950
Dubai71
Houston52545153
Hyderabad43
London6667676767676566646667
Los angeles58605960606058575861
Madrid8082828581848078808084
Melbourne8788868884878583858588
Milan9710098971001009410010097100
Mumbai7374707169727271727172
New York5251515052505050525150
New Delhi5960565859595859565759
Paris7071717370716972707372
Rome100981001009810010010097100100
Sao paulo323333343433
Santiago of Chile43444444
Singapore5656565658575557565558
Sydney6160616059606058606161
Toronto8180787979827781787979
TABLE 5

Dependence of Google Trends relative search volumes (RSVs) on collection date: international cities’ web interest in the query coronavirus + covid during period 2 (20 February–18 May, 2020).

CityDaily RSVs collected day by day from 16 to 27 December, 2020
Bangalore676765676566666670686565
Bogotá50524951505152535048
Chicago61626060596362
Mexico city46464548464746
Houston535350515252
Hyderabad4948494850
London646464636465666467656362
Los angeles585855575658565660575657
Madrid838585838285868484868387
Melbourne606160585860595862645860
Milan100100100100100100100100100100100
Mumbai787775767676767878807777
New York535656515256545455555354
New Delhi616260625960606161636161
Paris697170706970717071716869
Rome919491929394961009198
Sao paulo343432333535323633
Santiago of Chile4547464846
Singapore555758575658575958605857
Sydney555656565455555657595557
Toronto727270717172727075717070
Dependence of Google Trends relative search volumes (RSVs) on collection date: international cities’ web interest in the query coronavirus + covid during period 1 (1 February–4 December, 2020). Dependence of Google Trends relative search volumes (RSVs) on collection date: international cities’ web interest in the query coronavirus + covid during period 2 (20 February–18 May, 2020).

Discussion

As far as the author knows, this is the first study to assess Google Trends reliability through an iterated queries analysis. In particular, this paper clearly demonstrates a strong dependence of Google Trends relative search volumes (RSVs) values on the date they are gathered. The dataset of Italian regions above all, although if not affected by anomalies, showed how the collection of the same queries’ RSVs (i.e. same category, area and period) on different days is able to substantially modify a statistical correlation between RSVs themself and an external quantity (in this case, the number of COVID-19 infections). Moreover, in all the other datasets, an even greater problem was highlighted such as the presence or absence of specific RSVs depending on the day the sample was gathered. This phenomenon has also affected cities that have reached peak values on several occasions, such as Milan and Rome in the global dataset and Perugia and Prato in the Italian dataset. Furthermore, the fact that Prato and Perugia have reached a peak of web interest in the Italian dataset but not in the international dataset shows how Google Trends RSV measurement includes only specific geographical areas according to the search item chosen by the user. Finally, RSVs of Italian regions and cities as well as RSVs of international cities showed such a daily variance that these areas were often statistically confident with each other, compromising any search for correlations or any other rank-based grouping. The most reliable dataset—i.e. a sample that showed an acceptable number of anomalies and whose data did not exceed the confidence threshold—was that of countries worldwide both during period 1 and period 2. However, even in this case there were outliers capable of destroying the correlation between RSVs and COVID-19 cases. The results of this research cast an aura of uncertainty in using Google Trends for making infodemiological or epidemiological evaluations. In all studies conducted so far, data was extracted only once as the authors could not expect the dependence of RSVs on the day of collection. Anomalies can disrupt the statistical significance of a correlation, as they can change the distributive nature of the sample by transforming it from Gaussian to non-Gaussian or vice versa. Thus, depending on the type of correlation, they can irrevocably compromise the use of p-values as graded measures of evidence against the null hypothesis. Moreover, since Pearson coefficient is sensitive to outliers, they can drastically affect its strength (Mukaka, 2012). Finally, pronounced changes in RSV can invalidate the reliability of a dataset. For example, no analysis on the geographical distribution of web interest in a chosen topic for a specific area can be carried out when RSVs varies significantly from day to day (although the investigated period is always the same). But these are not the only criticalities that this survey has pointed out: indeed, this evidence shows that any study performed through Google Trends is inherently not reproducible. At present, the actual algorithm by which Google Trends detects query data is unknown. This makes it difficult, if not impossible, to identify the causes of this phenomenon. Alongside the limitations highlighted in this work, Cervellin et al. pointed out that web queries can be influenced by main media, further reducing the credibility of this research tool (Cervellin et al., 2017). Nuti et al. have previously found that a large multitude of papers lack the information needed to make them fully reproducible (Nuti et al., 2014). Nevertheless, Google Trends has served and still serves as an excellent tool for infoveillance and infodemiology: in fact, even admitting that newspapers and newscasts can influence web queries, it provides a way to quantify the web interest in a specific topic more efficiently than any other methods historically used (e.g., population surveys) (Amber et al., 2016; Dreher et al., 2018; Mohamad and Kok, 2019; Havelka et al., 2020). Moreover, it can be used as a complement to a traditional analysis (Schootman et al., 2014). During the COVID-19 pandemic, it was widely used by the scientific community and continues to be. Therefore, infoveillance and infodemiology scholars must adopt a more robust criterion for collecting data from Google Trends. Specifically, a series of steps can minimize the likelihood of fatal misinterpretation: 1) the trend of the RSV of a query for pre-selected periods and geographic regions must be gathered and monitored daily to assess its stability, i.e. the absence of anomalies and dramatic changes in the RSV of geographic subunits (like cities, regions, or nations). 2) If the trend has been stable for at least 7 days, continue to download data until statistical incompatibility between the RSVs of the various subunits has been reached (e.g., Welch’s t-test > 1.5). If the dataset is not normally distributed, it is recommended to perform at least 30 extractions; otherwise, it is possible to adopt a non-parametric test. 3) Use the mean RSV values of each subunit as measures to represent the sample, also providing their 95% confidence interval (or variability range). 4) When searching for correlations between RSV and other quantities, calculate them for every daily dataset and provide each mean value together with its 95% confidence interval (or variability range). This technique refers to standard frequentist inference criteria, ergo it applies to any frequentistic dataset. The central limit theorem ensures that mean values and confidence intervals are valid statistical measures for making comparisons regardless of data distribution (Kwak and Kim, 2017). Regarding the influence of media or external sources on RSV, Sato et al. are developing an analytical approach to clean up data from these disturbances (Sato et al., 2021). This would consent to the adoption of Google Trends also in the epidemiological field. In conclusion, Google Trends represents a great source of information for the entire scientific community. Nonetheless, more details should be provided by Google on how RSVs are presented to users. To ensure full reliability of a Google Trends dataset, it is essential for future research that authors collect queries’ data for several consecutive days and work with their RSVs averages instead of daily RSVs, trying to minimize the standard errors until an established confidence threshold is respected. Anyway, since this analysis is limited to a single query in two fixed time frames, further research is needed to understand when and how the proposed method is sufficient to contain the oscillations of the RSV acceptably. In particular, it is necessary to establish the causal relationship between Google Trends datasets selection and the occurrence of anomalies and sudden changes in the RSV.
  15 in total

1.  A Google Trends Approach to Identify Distinct Diurnal and Day-of-Week Web-Based Search Patterns Related to Conjunctivitis and Other Common Eye Conditions: Infodemiology Study.

Authors:  Michael S Deiner; Gurbani Kaur; Stephen D McLeod; Julie M Schallhorn; James Chodosh; Daniel H Hwang; Thomas M Lietman; Travis C Porco
Journal:  J Med Internet Res       Date:  2022-07-05       Impact factor: 7.076

2.  COVID-19 and thyroid disease: An infodemiological pilot study.

Authors:  Ioannis Ilias; Charalampos Milionis; Eftychia Koukkou
Journal:  World J Methodol       Date:  2022-05-20

3.  Discovering spatiotemporal patterns of COVID-19 pandemic in South Korea.

Authors:  Sungchan Kim; Minseok Kim; Sunmi Lee; Young Ju Lee
Journal:  Sci Rep       Date:  2021-12-28       Impact factor: 4.379

4.  Influence of Mass Media on Italian Web Users During the COVID-19 Pandemic: Infodemiological Analysis.

Authors:  Alessandro Rovetta; Lucia Castaldo
Journal:  JMIRx Med       Date:  2021-10-18

5.  Mild Adverse Events of Sputnik V Vaccine in Russia: Social Media Content Analysis of Telegram via Deep Learning.

Authors:  Andrzej Jarynowski; Alexander Semenov; Mikołaj Kamiński; Vitaly Belik
Journal:  J Med Internet Res       Date:  2021-11-29       Impact factor: 5.428

6.  Malaysian Public Interest in Common Medical Problems: A 10-Year Google Trends Analysis.

Authors:  Ren Yi Kow; Norfazilah Mohamad Rafiai; Akmal Azim Ahmad Alwi; Chooi Leng Low; Nur Raziana Rozi; Khairul Nizam Siron; Ahmad Hafiz Zulkifly; Zamzuri Zakaria Mohamad; Mohamed Saufi Awang
Journal:  Cureus       Date:  2022-01-14

7.  Are symptoms associated with SARS-CoV-2 infections evolving over time?

Authors:  M Ricco; M Valente; F Marchesi
Journal:  Infect Dis Now       Date:  2022-02-02

8.  A new infodemiological approach through Google Trends: longitudinal analysis of COVID-19 scientific and infodemic names in Italy.

Authors:  Alessandro Rovetta; Lucia Castaldo
Journal:  BMC Med Res Methodol       Date:  2022-01-30       Impact factor: 4.615

9.  Associations between COVID-19 mobility restrictions and economic, mental health, and suicide-related concerns in the US using cellular phone GPS and Google search volume data.

Authors:  Catherine Gimbrone; Caroline Rutherford; Sasikiran Kandula; Gonzalo Martínez-Alés; Jeffrey Shaman; Mark Olfson; Madelyn S Gould; Sen Pei; Marta Galanti; Katherine M Keyes
Journal:  PLoS One       Date:  2021-12-22       Impact factor: 3.240

10.  Google Trends as a Predictive Tool for COVID-19 Vaccinations in Italy: Retrospective Infodemiological Analysis.

Authors:  Alessandro Rovetta
Journal:  JMIRx Med       Date:  2022-04-19
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.