Takahiro Yabe1,2, Kota Tsubouchi3, Yoshihide Sekimoto4, Satish V Ukkusuri1. 1. Lyles School of Civil Engineering, Purdue University, 550 Stadium Mall Avenue, West Lafayette, IN 47907, USA. 2. Institute for Data, Systems, and Society, Massachusetts Institute of Technology, 50 Ames St, Cambridge, MA 02142, USA. 3. Yahoo Japan Corporation, Kioi Tower, Tokyo, Garden Terrace Kioicho, 1-3, Kioi-cho, Chiyoda-ku, Tokyo, Japan. 4. Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba Meguro-Ku, Tokyo 153-8505, Japan.
Abstract
COVID-19 has disrupted the global economy and well-being of people at an unprecedented scale and magnitude. To contain the disease, an effective early warning system that predicts the locations of outbreaks is of crucial importance. Studies have shown the effectiveness of using large-scale mobility data to monitor the impacts of non-pharmaceutical interventions (e.g., lockdowns) through population density analysis. However, predicting the locations of potential outbreak occurrence is difficult using mobility data alone. Meanwhile, web search queries have been shown to be good predictors of the disease spread. In this study, we utilize a unique dataset of human mobility trajectories (GPS traces) and web search queries with common user identifiers (> 450 K users), to predict COVID-19 hotspot locations beforehand. More specifically, web search query analysis is conducted to identify users with high risk of COVID-19 contraction, and social contact analysis was further performed on the mobility patterns of these users to quantify the risk of an outbreak. Our approach is empirically tested using data collected from users in Tokyo, Japan. We show that by integrating COVID-19 related web search query analytics with social contact networks, we are able to predict COVID-19 hotspot locations 1-2 weeks beforehand, compared to just using social contact indexes or web search data analysis. This study proposes a novel method that can be used in early warning systems for disease outbreak hotspots, which can assist government agencies to prepare effective strategies to prevent further disease spread. Human mobility data and web search query data linked with common IDs are used to predict COVID-19 outbreaks. High risk social contact index captures both the contact density and COVID-19 contraction risks of individuals. Real world data was collected from 200 K individual users in Tokyo during the COVID-19 pandemic. Experiments showed that the index can be used for microscopic outbreak early warning.
COVID-19 has disrupted the global economy and well-being of people at an unprecedented scale and magnitude. To contain the disease, an effective early warning system that predicts the locations of outbreaks is of crucial importance. Studies have shown the effectiveness of using large-scale mobility data to monitor the impacts of non-pharmaceutical interventions (e.g., lockdowns) through population density analysis. However, predicting the locations of potential outbreak occurrence is difficult using mobility data alone. Meanwhile, web search queries have been shown to be good predictors of the disease spread. In this study, we utilize a unique dataset of human mobility trajectories (GPS traces) and web search queries with common user identifiers (> 450 K users), to predict COVID-19 hotspot locations beforehand. More specifically, web search query analysis is conducted to identify users with high risk of COVID-19 contraction, and social contact analysis was further performed on the mobility patterns of these users to quantify the risk of an outbreak. Our approach is empirically tested using data collected from users in Tokyo, Japan. We show that by integrating COVID-19 related web search query analytics with social contact networks, we are able to predict COVID-19 hotspot locations 1-2 weeks beforehand, compared to just using social contact indexes or web search data analysis. This study proposes a novel method that can be used in early warning systems for disease outbreak hotspots, which can assist government agencies to prepare effective strategies to prevent further disease spread. Human mobility data and web search query data linked with common IDs are used to predict COVID-19 outbreaks. High risk social contact index captures both the contact density and COVID-19 contraction risks of individuals. Real world data was collected from 200 K individual users in Tokyo during the COVID-19 pandemic. Experiments showed that the index can be used for microscopic outbreak early warning.
The coronavirus pandemic (COVID-19) has inflicted significant health and economic impacts across the globe. Due to the contagious nature of the disease, monitoring and controlling how people come in contact with each other has shown to be key in containing the disease (Zhang et al., 2020). Therefore, in response to the COVID-19 pandemic, countries have taken various non-pharmaceutical interventions (NPIs) (e.g., lockdowns, social distancing, testing and tracing) (Flaxman et al., 2020), while facing the conundrum of balancing the trade-off between benefits on public health and depletion of economic performance (Ukkusuri, Yabe, & Seetharam, 2020). In order to monitor and evaluate the impacts of such interventions, large scale mobile phone data has been utilized as an effective data source (Fan & Stewart, 2021; Oliver et al., 2020). Studies on human mobility analysis have used such data to model disease dynamics (Bengtsson et al., 2015; Dodge et al., 2021; Finger et al., 2016; Tizzoni et al., 2014; Wesolowski et al., 2012). During the COVID-19 crisis, various stakeholders have utilized large-scale mobility datasets to evaluate the effects of NPIs in various regions (Bonato et al., 2020; Cintia et al., 2020; Dahlberg et al., 2020; Gao, Rao, Kang, Liang, & Kruse, 2020; Klein et al., 2020; Kraemer et al., 2020; Lai et al., 2020; Long & Reuschke, 2021; Pepe et al., 2020; Santana et al., 2020; Schlosser, Maier, Hinrichs, Zachariae, & Brockmann, 2020; Wellenius et al., 2020; Yabe et al., 2020).The aforementioned studies have shown the effectiveness of using large-scale mobility data to monitor physical co-location of the population (which can be used as a proxy for social contacts) in a fine-grained spatial and temporal scale. More specifically, metrics such as “social contact index (SCI)” have been proposed, and their strong relationships with the estimated transmissibility of COVID-19 (i.e. effective reproduction number R
) has been shown (Yabe et al., 2020). However, such metrics are incapable of predicting outbreaks beforehand, since such analysis can only be conducted in a retrospective manner. This is a critical drawback since countries are already facing second and third waves of COVID-19 (as of October 2020), and countries are in need of effective early warning systems that can predict when and where the next outbreaks would occur beforehand.To tackle this problem, we utilize a unique dataset that contains both the GPS location data (mobility trajectories) and web search queries of users, which are linked with common user identifiers (IDs). We hypothesize that users who search COVID-19 symptom related queries more frequently and intensely have a higher risk of having contracted the virus. By integrating the mobility analysis and web search analysis, we propose the “high-risk social contact index (HR-SCI)” metric, which takes into consideration both the density of population and risk of COVID-19 (Fig. 1
), in contrast to previous metrics that only rely on either mobility data or web search data alone. The methods are tested using data of >450,000 users in Tokyo, who were observed across a 7-month period (February 1, 2020 to September 10, 2020). Experiments show that the HR-SCI is capable of predicting COVID-19 outbreak hotspots 1–2 weeks before the official observations of the outbreak. The HR-SCI metric can be used to develop early warning systems for COVID-19 hotspots that inform government agencies the locations of potential outbreaks, allowing them to plan effective prevention and preparation strategies.
Fig. 1
Overview of the study. Using individual human mobility trajectories and web search data with common user IDs, we aim to predict COVID-19 hotspot locations via search query analysis and social contact detection.
Overview of the study. Using individual human mobility trajectories and web search data with common user IDs, we aim to predict COVID-19 hotspot locations via search query analysis and social contact detection.The key contributions of this paper are as follows:This study is the first to test the usage of web search and mobility data, which are linked by user IDs, for the prediction of COVID-19 outbreak hotspots.We propose a novel metric, high risk social contact index (HR-SCI), that captures both the social contact density and the COVID-19 contraction risk levels of the users, with high spatio-temporal granularity.We verify that the HR-SCI can improve the predictability of COVID-19 hotspot locations through several case studies, compared to using just the social contact index or web search queries alone.
Related works
Mobility analysis during COVID-19
Mobile devices have become ubiquitous in many areas around the world, providing opportunity to analyze human mobility dynamics in an unprecedented spatial and temporal granularity and scale (Blondel, Decuyper, & Krings, 2015). To monitor and evaluate the impacts of such interventions, large scale mobile phone data has been utilized as an effective data source (Oliver et al., 2020). Studies on human mobility analysis have used such data to model disease dynamics (Bengtsson et al., 2015; Finger et al., 2016; Tizzoni et al., 2014; Wesolowski et al., 2012). During the COVID-19 crisis, researchers, industry, and government agencies have utilized large-scale mobility datasets to evaluate the effects of NPIs in various countries, including the United States (Gao et al., 2020; Klein et al., 2020; Wellenius et al., 2020), the United Kingdom (Santana et al., 2020), Italy (Bonaccorsi et al., 2020; Bonato et al., 2020; Cintia et al., 2020; Pepe et al., 2020), China (Kraemer et al., 2020; Lai et al., 2020), Sweden (Dahlberg et al., 2020), Germany (Schlosser et al., 2020), Spain (Orro, Novales, Monteagudo, Pérez-López, & Bugarn, 2020), Austria (Heiler et al., 2020), and Japan (Mizuno & Ohnishi, 2020; Yabe et al., 2020). None of these studies have attempted to integrate the analysis of web search queries to predict COVID-19 outbreak hotspots.
Applications of web search data analysis
Mining of web search query data has attracted the attention of researchers and practitioners ever since search engines were introduced to the world (Silverstein, Marais, Henzinger, & Moricz, 1999). Web search data has been utilized for various applications, for example, to predict users' demographics (Wu et al., 2019) and mobility decisions during crisis (Yabe, Tsubouchi, Shimizu, Sekimoto, & Ukkusuri, 2019). During the COVID-19 pandemic, many studies have utilized web search query data to understand information seeking behavior and the occurrence of infodemics (Bento et al., 2020; Mavragani, 2020; Rovetta & Bhagavathula, 2020). Others have used such data to detect the increase in COVID-19 symptoms (e.g. loss of smell) (Rajan et al., 2020; Walker, Hopkins, & Surda, 2020), and also for predicting outbreaks (Hisada et al., 2020; Li et al., 2020). With the availability of open datasets (Xu et al., 2020), there is great potential in further using web search query data for pandemic response and prevention. Despite the vast array of studies, none have attempted to integrate web search data with mobility data to predict COVID-19 outbreaks.
Context: COVID-19 in Tokyo, Japan
Japan has experienced a low number of cases and deaths due to COVID-19 in comparison to other countries in Europe and America, despite the social and physical proximity to China and intervention policies that are not as aggressive as some of the other countries (Dong, Du, & Gardner, 2020). Non-pharmaceutical interventions implemented by the Japanese government include a non-mandatory closure and remote-working of non-essential business employees (February 26th), closures of public elementary, junior high and high schools (March 2nd), and inbound entry restrictions, starting with Hubei Province, China (February 3rd), until restricting inbound visitors from 73 countries (April 3rd). No mandatory lockdowns were enforced in Japan. The State of Emergency (SoE) was declared on April 7th, and was lifted on May 25th, after observing a significant decrease in cases (see Fig. 2
, gray bars). Although Japan was able to contain the disease, many cities started to see an increase of cases in early July, and experienced the second wave in July to September (see Fig. 2, gray bars). Tokyo, which has the largest number of COVID-19 cases among prefectures in Japan, had around 400 new daily cases at its peak. As of October 17th, Tokyo has had 28,839 cases (out of 90,979 in Japan) and 434 deaths (out of 1650 in Japan).
Fig. 2
Time series of estimated COVID-19 related metrics aggregated at the entire Tokyo metropolitan scale between February 1st and September 5th, 2020 (left axis). A) Total number of high risk users. B) Social contact index of an average user. C) High risk social contact index, which is the total amount of social contacts the high risk users encounter. Gray bars in each panel show the daily number of new COVID-19 cases in Tokyo (right axis).
Time series of estimated COVID-19 related metrics aggregated at the entire Tokyo metropolitan scale between February 1st and September 5th, 2020 (left axis). A) Total number of high risk users. B) Social contact index of an average user. C) High risk social contact index, which is the total amount of social contacts the high risk users encounter. Gray bars in each panel show the daily number of new COVID-19 cases in Tokyo (right axis).
Data
In this study, we utilized web search data and GPS location data, that are linked with the same user identifiers (IDs), owned by Yahoo Japan Corporation.1
Because the user IDs are linked, we are able to 1) identify users who have a high risk of COVID-19 contraction from their web search behavior, and 2) track their mobility patterns to measure their social contact rates with other users in the city.
Privacy policies for user data
Yahoo Japan Corporation (YJ) has developed its own privacy policy and requires users to read and agree to its privacy policy before using any of the services provided by YJ. Furthermore, because location data is highly sensitive for the users, users were asked to sign an additional consent form specific to the collection and usage of location data when they used apps that collect location information. The additional consent explains the frequency and accuracy of location information collection, and also the purpose and how the data will be used. In addition to the above consent, YJ asked for additional consent from the users in this study because the analysis related to COVID-19 and personal health are much more sensitive. Therefore, YJ performed a double consent process, where the users who have given their consent to the usage of location information and web search queries were asked again, if they wish to provide their consent to be included in the dataset.Moreover, YJ implemented strict restrictions in the analysis procedure. The methodology for handling the data and for obtaining user consent for this study were supervised by an advisory board composed of external experts. YJ also ensured that research institutions other than YJ that participate in this study do not have direct access to the data. Although external research institutions were allowed to analyze aggregated data, the actual raw data were kept within YJ, and any analysis performed on raw data were performed within servers administered by YJ. In summary, given the high sensitivity of the study, this study was performed with careful consideration of the users' privacy.
Individual human mobility data
GPS location data are anonymized so that individuals cannot be specified, and personal information such as gender, age and occupation are unknown. Each GPS location record contains the user's unique ID, timestamp of the observation, longitude, and latitude. The data has a sample rate of approximately 2% of the entire population. The data acquisition frequency of GPS locations varies according to the movement speed of the user to minimize the burden on the user's smartphone battery. If it is determined that the user is staying in a certain place for a long time, data is acquired at a relatively low frequency, and if it is determined that the user is moving, the data is acquired more frequently. We overcome this varying data acquisition frequency by spatially and temporally interpolating the location data, as explained in Section 5.3. A panel of users who were active each day in Tokyo metropolitan area before, during and after the COVID-19 pandemic were selected from the pool of users that have agreed to the aforementioned consents. This led to a sample of about 450,000 users, with approximately 50 observation points per user each day.
Individual web search data
In addition to GPS location data, Yahoo Japan collects the web search queries of the users to improve the web search quality. Each web search query record contains the user's unique ID, timestamp of the search, and query text that was searched. The web search data were used to identify users who have a high risk of COVID-19 contraction, following the methods explained in Section 5.2.
Number of COVID-19 cases in Tokyo
The daily number of confirmed COVID-19 cases in Tokyo were reported by the Tokyo metropolitan government through their Github data portal (Tokyo Metropolitan Government, 2020) (gray bars in Fig. 2).
Methodology
Preliminaries
Definition 1 (web search session)
Each user's web search behavior can be observed as a sequence of web search queries performed by the user. Usually, such continuous sequence of searches within a short time period are performed under a consistent underlying search intent. We define these short sequences of searches, which are assumed to be under a consistent search intent, as “web search sessions”. Web search sessions are used to detect high risk users who intensively search about COVID-19 symptoms (Section 5.2).
Definition 2 (high risk user)
A high risk user is defined as a user who conducts more web search sessions that are related to COVID-19 symptoms than the pre-defined threshold.
Definition 3 (social contact index)
The social contact index (SCI) measures the average amount of encounters that each user experiences due to movements outside of their estimated home locations. When n individuals are observed in the given spatial unit and temporal window, the number of contacts in this cell can be quantified as .
Definition 4 (high risk social contact index)
The high risk social contact index (HR-SCI) is a composite measure of the SCI and the number of high risk users. HR-SCI measures the average amount of social contacts that high risk users encounter due to movements outside of their estimated home locations.
Scoring of web search queries
To identify high risk users of COVID-19 contraction, the users' web search queries from Yahoo Japan Search were given risk scores. Queries submitted by users were treated as “Covid-19 queries” if they match against pre-defined query patterns. The query patterns consist of 3 types of query phrases: 1) COVID-19 symptom related queries, 2) names of medical institutions related to COVID-19 care, and 3) names of locations, as shown in Table 1
. For the first group of queries, 186 COVID-19 symptom related queries were determined, such as ‘coronavirus high fever’ or ‘may have coronavirus', including various slang words (e.g., ‘corona’ instead of ‘coronavirus', which is typically used by Japanese-speaking users). For the second group of queries, 5 queries that represent medical institutions concerned with COVID-19 were determined. These include facilities such as hospitals designated by the local health authorities to be specialized for the treatment of COVID-19 infected patients. The third group of queries include 2168 queries representing names of locations (e.g., Shibuya). The latter two groups of queries only were scored as “COVID-19 related” only when they were searched together (e.g., “Central hospital Shibuya”). To ensure the fairness of future analyses and to avoid users from making these searches even when they do not have COVID-19 symptoms, we do not disclose the details of the query list. When a user has at least one COVID-19 related search query within a session, we determine that web search session as a COVID-19 related web search session. Given a pre-defined threshold k, a user is identified as a high risk user is he/she had more than k COVID-19 related web search sessions. Fig. 2A shows the hourly number of high risk users with k = 3 in the Tokyo region, along with the daily number of new cases.
Table 1
Query words that were used for scoring users' risk level using web search data.⁎
Type
Quantity
Examples
Symptom related
186
“coronavirus high fever” “no smell”, “may have corona”
Medical institutions
5
“hospital”, “medical clinic”
Location names
2168
“Shibuya”, “Shinjuku”
queries on medical institution names and location names need to be searched together to qualify as a “COVID-19 related search”.
Query words that were used for scoring users' risk level using web search data.⁎queries on medical institution names and location names need to be searched together to qualify as a “COVID-19 related search”.
Social contact analysis
Using the GPS location data of each user, we compute the social contact index within a given region on the time of interest. To overcome data sparsity caused by battery saving functions of the app (explained in Section 4.2), we perform spatio-temporal interpolation of the GPS location observations. Because the GPS data are collected less frequently when movement is detected, we assume that the individual users are static while there are no observations. In the final interpolated dataset, each individual's locations are recorded every 30 min. Using the interpolated individual trajectory data produced from mobile phone location data, the social contact indexes were computed. The social contact indexes shown in Fig. 2B were computed for 30 min intervals. First, for each time interval [t,
t + dt), where dt = 30 minutes, users who were not within 125 m from their estimated home locations (via meanshift clustering of nighttime staypoints) were detected as “staying out”. We denote this set of individual users as N
out. For user i staying out (i ∈ N
out), we compute the number of other users “staying out” who are within 125 m from user i, and use that count c
as a proxy for social contacts. The social contact index is calculated as the total social contacts for all users staying out, divided by the total number of users including those staying at their homes. Thus, mean social contact value is computed as C
= ∑
c
/N, where N is the total number of users observed on that day. The social contact index is the relative value of mean social contacts with respect to typical mobility patterns, observed before the COVID-19 pandemic. Thus, the social contact index (SCI) of 1 corresponds to the same amount of social contacts as the daily peak times on weekdays before the COVID-19 pandemic. A previous study has shown that the spatial threshold does not significantly affect the estimation results in the urban scale (Yabe et al., 2020), however, to show this we conduct a sensitivity analysis on the temporal and spatial threshold parameters in Section 6.3.Fig. 2B shows the SCI between February 1st and September 5th in metropolitan Tokyo. We observe a gradual decrease starting from April, low SCI during May and June, and a rise in SCI in July and August. By comparing with the new daily cases (gray bars), SCI reduces significantly after the first wave but the decrease is more subtle after the second wave. Fig. 2C shows the high risk SCI (HR-SCI). We observe that similar to the number of high risk users (Fig. 2A), the number of high risk users increase slightly before both the first and second waves, however, the increase is smaller during the second wave. In the following sections, we investigate the effectiveness of these three metrics in predicting COVID-19 outbreaks in different geographical scales (macroscopic and microscopic).
Results
In this section, we investigate the effectiveness of each metric (number of high risk users, social contact index, high risk social contact index) in predicting COVID-19 outbreak hotspot locations on different spatial scales. 6.1, 6.2 investigate the predictability of the macroscopic (entire Tokyo metropolitan area) and microscopic (125 m × 125 m level) trends of new COVID-19 cases, respectively. Section 6.3 conducts a sensitivity analysis on the spatial and temporal threshold parameters on the predictability of the number of cases. Using homogeneous grid cells, as opposed to municipal boundaries that vary in sizes, is effective in avoiding the modifiable areal unit problem (MUAP), which is a statistical bias that may impact the results of statistical hypothesis tests because of heterogeneous zone sizes.
Macroscopic trend prediction
From visual inspection of Fig. 2, we observe that the trends of high risk users (panel A) and high risk social contact index (panel C) have two peaks in early April and mid July, similar to the trends of daily new COVID-19 cases, which also has two peaks. To quantify the predictability of daily cases trend, we computed the time lagged cross correlation between the metric time series data and the daily cases trend (“Both waves”). A positive lag would indicate that the metric precedes the daily number of cases, while a negative lag would indicate the opposite. Also, since we observe different patterns during the first and second waves, we divide the daily cases trend into the first wave (February 1st ∼ May 31st) and the second wave (June 1st ∼ September 5th), and compute the lag for each of the periods.Fig. 3A-C shows the estimation results of the temporal lag for the three metrics: high risk users (HRU), social contact index (SCI), and high risk social contact index (High Risk SCI). We used the daily values of each metric to compute the cross-correlations with the number of daily cases. The results indicate that the two metrics – high risk users and high risk social contact index – have similar high predictability of the daily number of cases when inspecting the two waves separately, however, the High Risk SCI outperforms the HRU index when we perform lagged correlation analysis on both waves. The time lagged time series data are visualized in Fig. 3D-F, where the dotted black line shows the original observation, solid black line shows the lagged waves in aggregate (“both waves”), and colored dashed and solid lines show the time lagged first wave and second wave predictions, respectively. The peak cross correlation are shown in Table 2
, and indicates that while HRU and HR-SCI metric perform better, SCI itself is not able to capture the trends of the number of new cases. Although the HRU metric performed the best during the first wave with R = 0.894 and lag of 8 days, the HR-SCI performed the best during the second wave with R = 0.756 and for the two waves combined with R = 0.641. From these results, we conclude that the HR-SCI metric perform well to predict the temporal trends of new daily COVID-19 cases across multiple waves.
Fig. 3
Time lagged cross correlation analysis of the high risk users, social contact index, and high risk social contact index metrics against the daily number of new cases. Results indicate higher predictability using HRU and HR-SCI; the metrics preceded the daily cases trend by 8–16 days during the first wave, and by 16–23 days during the second wave. HR-SCI had a better overall fit on the two waves.
Table 2
Time lagged cross correlation and lagged days (in brackets) of the three metrics with the total daily number of COVID-19 cases in Tokyo.
Pearson Correlation (lagged days)
High Risk Users
SCI
High Risk SCI
1st wave
0.894 (+8 days)
0.324 (+30 days)
0.614 (+16 days)
2nd wave
0.740 (+16 days)
0.619 (+23 days)
0.756 (+23 days)
Both waves combined
0.166 (+9 days)
0.299 (+23 days)
0.641 (+16 days)
Time lagged cross correlation analysis of the high risk users, social contact index, and high risk social contact index metrics against the daily number of new cases. Results indicate higher predictability using HRU and HR-SCI; the metrics preceded the daily cases trend by 8–16 days during the first wave, and by 16–23 days during the second wave. HR-SCI had a better overall fit on the two waves.Time lagged cross correlation and lagged days (in brackets) of the three metrics with the total daily number of COVID-19 cases in Tokyo.
Microscopic outbreak hotspot prediction
In the previous section, it was shown that the HR-SCI metric was most effective in predicting the macroscopic (Tokyo metropolitan area scale) prediction of COVID-19 outbreak across multiple waves. Policy makers could further benefit from more microscopic, finer spatially-grained prediction (and early warning system) of outbreak hotspots. In this section, we tested the effectiveness of the three metrics on predicting outbreak hotspots beforehand, in the microscopic (i.e., 125 m grid) spatial scale.First, we visualize the three indexes (SCI, HRU, and HR-SCI) for each 1 km × 1 km grid in Tokyo metropolitan region. Fig. 4
shows the spatial distributions of the three index scores (SCI, HRU, HR-SCI) on the 1 km × 1 km grid scale, aggregated between June 1st and July 31st, and normalized between 0 and 1. From Table 3
, we observe high SCI scores in major central business districts including Shinjuku, Tokyo, and Ikebukuro, whereas the high risk index is more dispersed, with lower density areas such as Hibiya, Hamamatsu-cho, and Ogikubo ranked higher compared to SCI. HR-SCI is a composite measure, including locations from both the SCI and HRU ranking. As we can see from the normalized scores of the top 100 ranked locations of each metric in Fig. 5
, HR-SCI is more selective, showing high scores on only a few grid cells.
Fig. 4
Spatial plot of the total (A) social contact index, (B) high risk user count, and (C) high risk social contact index during June 1st - July 31st in the metropolitan Tokyo area, aggregated into 1 km grid cells.
Table 3
Top 10 locations in Tokyo metropolitan area with highest SCI, HRU, and HR-SCI indexes.
Rank
SCI
High Risk Users
High Risk SCI
1
West Shinjuku
Hibiya
Tokyo
2
Tokyo
Tokyo
West Shinjuku
3
Ikebukuro
Ikebukuro
Akihabara
4
Akihabara
Akihabara
Hibiya
5
Ginza
Hamamatsu-cho
Ikebukuro
6
Ohtemachi
Shimbashi
Kitasenju
7
Hibiya
Shimokitazawa
Ohtemachi
8
Kitasenju
West Shinjuku
Shimbashi
9
South Shibuya
Ogikubo
Ogikubo
10
Shinagawa
Iidabashi
Hamamatsu-cho
Fig. 5
Normalized scores of each metric in 1 km grid cells sorted by their rank, showing high skewness of the HR-SCI.
Spatial plot of the total (A) social contact index, (B) high risk user count, and (C) high risk social contact index during June 1st - July 31st in the metropolitan Tokyo area, aggregated into 1 km grid cells.Top 10 locations in Tokyo metropolitan area with highest SCI, HRU, and HR-SCI indexes.Normalized scores of each metric in 1 km grid cells sorted by their rank, showing high skewness of the HR-SCI.To validate the effectiveness of the HR-SCI scores on predicting outbreak hotspot locations, we collected news articles reporting the occurrence of COVID-19 outbreaks in Tokyo. We found news reports of outbreak events in eight areas in Tokyo – Shinagawa, Mizonokuchi, Ikebukuro, Nishi-Kasai, Tachikawa, Ooimachi, Kinshicho, and Kitasenju – which were used to validate the predictability of outbreak hotspots using the metrics. The SCI and HR-SCI values of the grid cell that contains the main location of each area were used. For example for Shinagawa, the 125 m × 125 m cell that contains the central business district around Shinagawa Station was used.
Shinagawa area
Shinagawa (Fig. 6A) is known as one of the largest central business districts in Tokyo, and therefore consistently has a high social contact index, ranking between 8th to 10th in all periods. However, we can also observe that Shinagawa ranks very low (between 100th to 1000th) using the high risk user count index. On the other hand, Shinagawa's high risk social contact index rank fluctuates over time, and ranks 5th (week of July 6th) and 7th (week of July 13th). It is reported that cases in Shinagawa increased from July 13th to August 2nd, and has had several COVID-19 clusters during that period. Therefore, using SCI would result in false positive predictions until mid July, whereas HR-SCI effectively predicts outbreaks with 2 weeks in hand.
Fig. 6
The rank of eight areas in Tokyo using social contact index, high risk users index, and high risk social contact index between early June and end of July. Orange shaded periods show local COVID-19 outbreak timings. Rise in HR-SCI ranks precede outbreaks 1–2 weeks beforehand.
The rank of eight areas in Tokyo using social contact index, high risk users index, and high risk social contact index between early June and end of July. Orange shaded periods show local COVID-19 outbreak timings. Rise in HR-SCI ranks precede outbreaks 1–2 weeks beforehand.
Ikebukuro area
Similar to the Shinagawa area, Ikebukuro is also a bustling area with many shopping and business facilities, and ranks constantly in the top 10 using SCI (between 2nd and 7th in entire period) (Fig. 6B). However, Ikebukuro never ranks in the top 50 using high risk users index. Using HR-SCI, Ikebukuro ranks 2nd, 1st, 1st in the three weeks from June 1st, and again ranks in the top 10 during the end of July (weeks of July 20th and 27th). It has been reported that in Ikebukuro, the number of cases have risen from mid-June to the end of June, and by the end of June had the most number of new cases in Tokyo. Moreover, occurrence of COVID-19 clusters in large shopping malls and the city hall have been reported at around early August (which is 1–2 weeks after the second increase in HR-SCI). Ikebukuro showcases another example where using SCI, it would constantly be false-positively predicted as outbreak locations, but using HR-SCI, outbreak timings can be accurately predicted 1–2 weeks beforehand.
Mizonokuchi area
Mizonokuchi area is a residential area located in the suburbs of Tokyo, unlike Shinagawa and Ikebukuro. As shown in Fig. 6C, Mizonokuchi never ranks in the top 10 using social contact index, due to low population density. However, we observe an increase in HR-SCI in early June and the week of July 20th (ranked 6th in the entire Tokyo region), which coincides with the two outbreaks of COVID-19 in the area reported at early July and from the end of July to beginning of August. The case study of Mizonokuchi shows that even in low population density areas, HR-SCI is able to predict outbreaks with 2 weeks beforehand.
Nishi-Kasai area
Nishi-Kasai area, similar to Mizonokuchi area, is a residential area with smaller active population compared to Shinagawa and Ikebukuro. Therefore, the area always ranks below 20th in SCI. However, using the HR-SCI, Nishi-Kasai ranks in the top 10 twice during the study period: the weeks of June 22nd and July 20th (Fig. 6D). Indeed, COVID-19 outbreaks have been reported on July 4th and July 30th in the area, which are both 2 weeks after the area was ranked 10th using the HR-SCI. Nishi-Kasai is another case where because of the low active population density, the area never appears in the top rankings using SCI, but accurately appears before actual outbreaks using HR-SCI.
Sensitivity analysis
To understand the effects of the spatial and temporal threshold parameters that we selected to compute the social contacts (125 m and 30 min time intervals) in the previous experiments, we conducted a parameter sensitivity check for the spatial threshold and temporal thresholds. Fig. 7
shows the sensitivity of the high risk social contact index when computed using different temporal (30 min, 1 h, 2 h) and spatial (1 km, 500 m, 250 m, 125 m) threshold parameters. Note that since the original data contained the location information of each individual every 30 min, the minimum temporal threshold was 30 min. The overall patterns computed using all spatial and temporal threshold parameters present similar trends, where we see two significant peaks; 2020 April just before the first wave, and 2020 July, just before the second wave. Changing the temporal threshold between 30 min, 60 min, and 120 min changes the granularity of the estimations, where we see a more smoothed out time series data when social contacts are computed in 120 min thresholds, which is intuitive. When changing the spatial threshold parameter, we observe interesting patterns especially in the first wave. The high risk SCI is more sensitive to the spatial threshold parameter during the first wave, where reducing the contact threshold from 1 km to 125 m reduces the HR social contact index by around 70%. However, the second wave is significantly less affected by the change in the spatial threshold parameter. This indicates that during the first wave, high risk individuals had higher spatial proximity with eachother, where many of them could be reached within 1 km from eachother. On the other hand, the high risk individuals were more dispersed and distanced during the second wave, posing less contact risks.
Fig. 7
Sensitivity of the high risk social contact index for different temporal (30 min, 1 h, 2 h) and spatial (1 km, 500 m, 250 m, 125 m) threshold parameters.
Sensitivity of the high risk social contact index for different temporal (30 min, 1 h, 2 h) and spatial (1 km, 500 m, 250 m, 125 m) threshold parameters.To further understand the effects of the spatial and temporal threshold parameters on the early warning performance of the 2 COVID-19 waves, the lagged correlation analysis was conducted for each of the 12 (= 4 spatial × 3 temporal) threshold parameter pairs. Fig. 8
plots the lagged correlation coefficient (top row) and the time lag (bottom row). Fig. 9
shows the lagged correlation analysis plots for the parameter combinations. The time lag for the 1st wave, 2nd wave, and both waves between the high risk social contact index and the number of cases are shown. We observe that the correlation is higher when we compute the index using more spatially granular information. Interestingly, the right plot on the top row shows that for both waves, the temporal threshold of 60 min performs best (also for the second wave). The time lags under different parameters are mostly stable for each wave.
Fig. 8
Sensitivity of the lagged correlation coefficient (top row) and time lag (bottom row) between the high risk social contact index and number of cases for different temporal (30 min, 1 h, 2 h) and spatial (1 km, 500 m, 250 m, 125 m) threshold parameters for the 1st wave, 2nd wave, and both waves.
Fig. 9
Sensitivity of the lagged correlation between the high risk social contact index and number of cases for different temporal (30 min, 1 h, 2 h) and spatial (1 km, 500 m, 250 m, 125 m) threshold parameters.
Sensitivity of the lagged correlation coefficient (top row) and time lag (bottom row) between the high risk social contact index and number of cases for different temporal (30 min, 1 h, 2 h) and spatial (1 km, 500 m, 250 m, 125 m) threshold parameters for the 1st wave, 2nd wave, and both waves.Sensitivity of the lagged correlation between the high risk social contact index and number of cases for different temporal (30 min, 1 h, 2 h) and spatial (1 km, 500 m, 250 m, 125 m) threshold parameters.
Discussion
Advantages and disadvantages of proposed metrics
In this study, we investigated the effectiveness of various metrics for predicting outbreak hotspot locations of COVID-19, using a unique dataset that which contains both the web search queries and GPS location information of users. The two experiment setups – predicting the macroscopic (Tokyo metropolitan area) and microscopic (oubreaks in 125 m cell scale) – unraveled the strengths and weaknesses of the three metrics: high risk user (HRU) count, social contact index (SCI), and high risk social contact index (HR-SCI).It was shown in Fig. 3 and Table 2 that HRU is fairly effective in capturing the macroscopic trends of COVID-19 outbreaks for each of the waves independently. We observed high correlation between the daily number of mobile phone users who had high number of COVID-19 related web search sessions and the daily new cases count provided by the Tokyo Metropolitan Government. In particular, cross correlation analysis showed that the fluctuations in the high risk user count preceded the cases count by 1–2 weeks, which can be useful in predicting the outbreaks in the macroscopic scale. However, the HRU index was not effective in predicting outbreak locations in the microscopic scale, as shown in Section 6.2. This is because on the microscopic scale (125 m grid cells), the decrease in the average number of users observed in each cell increases the noise in the data. These metrics could become biased even by small groups of people who search about COVID-19 out of pure interest.The social contact index (SCI), which is used in many studies, could be effective in predicting outbreaks if COVID-19 cases occur in high dense areas, since essentially SCI captures the population density of that area. However, as shown in the analysis in Section 6.2, SCI has two major drawbacks in predicting hotspots. First, the SCI is relatively static over time, staying consistently large in locations that are more congested, including central business districts and hub stations (e.g., Tokyo, Shibuya, Shinjuku), and consistently low in more rural areas. Although it is true that COVID-19 has a higher probability of spreading in high-density areas, because of this static nature of the SCI metric, it is difficult to predict the timings of such outbreaks. In the case of Shinagawa (one of the busiest central business districts in Tokyo), outbreaks could not be detected using SCI because the area was always congested throughout the COVID-19 crisis. Mizonokuchi (an town in the outskirts of Tokyo) which had an outbreak also could not be detected using SCI because of the consistently low SCI.Considering the pros and cons of the HRU and SCI metrics, the high risk SCI (HR-SCI) metric is able to capture both 1) existence of high risk users and 2) high population (contact) density. Thus, especially in the microscopic scale, HR-SCI was shown to be more effective in predicting hotspot locations in Section 6.2. Therefore, HR-SCI could be utilized to implement early warning systems for COVID-19 outbreak hotspot locations.
Limitations and future works
The presented empirical results should be considered in the light of some limitations. First, we found that although the HRU and HR-SCI metrics had high predictability of COVID-19 cases in Fig. 3, the optimal time lags in the two waves varied (first wave: 9–16 days, second wave: 16–23 days). Many reasons could be causing this difference, however, this suggests a change in the the web search behavior of the users between the two waves. As the virus spreads and more information becomes available due to the increase of cases, the type of symptoms that users search in relation to COVID-19 could increase (e.g., it was unknown that loss of smell was related to COVID-19 in the first weeks of the outbreak). Moreover, the difference in the absolute amount of high risk users between the two waves show that awareness levels of COVID-19 had also changed, and that as people become more used to the disease, less people search about COVID-19 symptoms. Despite these issues, the finding that we can predict the outbreak around 1–2 weeks beforehand is valuable and has various potential applications as discussed later. Future research could focus more on the modeling of the web search behavior to improve the accuracy of the outbreaks.Second, computational costs increase drastically as we increase the spatial resolution of the analysis. For example, in Tokyo, there are around 2400 1 km × 1 km grid cells, which were further divided into over 150,000,125 m × 125 m grid cells. Running the analysis with high computing resources could be one solution to this issue. Finally, collecting ground truth data for COVID-19 outbreaks on the microscopic spatial scales were challenging, with information sources scattered across websites of various institutions and agencies. Constructing a comprehensive database that contains when, where, and how many patients were reported on fine spatial scales (with careful consideration on privacy concerns) would be valuable for future research.
Potential applications
The usability of high risk social contact index is not limited to COVID-19, and can be applied to any contagious disease in theory. For example, flu has been the subject of many previous studies on predicting disease spread using web search data and mobility data (e.g., (Lampos & Cristianini, 2010; Panigutti, Tizzoni, Bajardi, Smoreda, & Colizza, 2017)). Although the web search intensity, frequency, and the disease related parameters (e.g., effective reproduction number) could be different in other diseases, it would be valuable to test this method on other epidemics. Moreover, testing the applicability of this method in different regions around the world could be an interesting extension to this study.In addition to the early warning system mentioned previously, these metrics can be used to develop a personalized alert and navigation app on smartphones. Although there are existing apps that inform users about their risk of contracting the disease based on their mobility information (e.g., COCOA app2
developed by the Japanese Government), they do not provide early warning information on where potential outbreaks could occur. Using the high risk social contact index, we would be able to provide navigation and personalized early warning to users so that they can avoid visiting or passing through high risk areas.
Conclusion
As COVID-19 continues to affect public health in cities across the world, early warning systems that can predict where the next outbreak would occur is of significant importance for government agencies. In this study, we used web search data and GPS location data, which are linked with common user IDs, to predict outbreak hotspot locations using the high risk social contact index. Validation using data from Tokyo, Japan showed that compared to previously proposed metrics, the high risk social contact index is capable of predicting the timing of outbreaks 1–2 weeks beforehand in a microscopic (125 m) spatial scale. This study proposes a novel method to predict disease outbreak hotspots, which can be used to develop early warning systems that may assist government agencies to prepare effective strategies for disease spread prevention.
Authors: Flavio Finger; Tina Genolet; Lorenzo Mari; Guillaume Constantin de Magny; Noël Magloire Manga; Andrea Rinaldo; Enrico Bertuzzo Journal: Proc Natl Acad Sci U S A Date: 2016-05-23 Impact factor: 11.205
Authors: Amy Wesolowski; Nathan Eagle; Andrew J Tatem; David L Smith; Abdisalan M Noor; Robert W Snow; Caroline O Buckee Journal: Science Date: 2012-10-12 Impact factor: 47.728
Authors: Anjana Rajan; Ravi Sharaf; Robert S Brown; Reem Z Sharaiha; Benjamin Lebwohl; SriHari Mahadev Journal: JMIR Public Health Surveill Date: 2020-07-17
Authors: Ana I Bento; Thuy Nguyen; Coady Wing; Felipe Lozano-Rojas; Yong-Yeol Ahn; Kosali Simon Journal: Proc Natl Acad Sci U S A Date: 2020-05-04 Impact factor: 11.205
Authors: Michele Tizzoni; Paolo Bajardi; Adeline Decuyper; Guillaume Kon Kam King; Christian M Schneider; Vincent Blondel; Zbigniew Smoreda; Marta C González; Vittoria Colizza Journal: PLoS Comput Biol Date: 2014-07-10 Impact factor: 4.475
Authors: Cecilia Panigutti; Michele Tizzoni; Paolo Bajardi; Zbigniew Smoreda; Vittoria Colizza Journal: R Soc Open Sci Date: 2017-05-17 Impact factor: 2.963