Literature DB >> 35271589

Inclusion of environmentally themed search terms improves Elastic net regression nowcasts of regional Lyme disease rates.

Eric Kontowicz^1,2, Grant Brown³, James Torner¹, Margaret Carrel⁴, Kelly K Baker⁵, Christine A Petersen^1,2,6.

Abstract

Lyme disease is the most widely reported vector-borne disease in the United States. 95% of confirmed human cases are reported in the Northeast and upper Midwest (25,778 total confirmed cases from Northeast and upper Midwest / 27,203 total US confirmed cases). Human cases typically occur in the spring and summer months when an infected nymph Ixodid tick takes a blood meal. Current federal surveillance strategies report data on an annual basis, leading to nearly a year lag in national data reporting. These lags in reporting make it difficult for public health agencies to assess and plan for the current burden of Lyme disease. Implementation of a nowcasting model, using historical data to predict current trends, provides a means for public health agencies to evaluate current Lyme disease burden and make timely priority-based budgeting decisions. The objective of the study was to develop and compare the performance of nowcasting models using free data from Google Trends and Centers of Disease Control and Prevention surveillance reports. We developed two sets of elastic net models for five regions of the United States: 1. Using only monthly proportional hit data from the 21 disease symptoms and tick related terms, and 2. Using monthly proportional hit data from terms identified via Google correlate and the disease symptom and vector terms. Elastic net models using the full-term list were highly accurate (Root Mean Square Error: 0.74, Mean Absolute Error: 0.52, R2: 0.97) for four of the five regions of the United States and improved accuracy 1.33-fold while reducing error 0.5-fold compared to predictions from models using disease symptom and vector terms alone. Many of the terms included and found to be important for model performance were environmentally related. These models can be implemented to help local and state public health agencies accurately monitor Lyme disease burden during times of reporting lag from federal public health reporting agencies.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35271589 PMCID： PMC8912246 DOI： 10.1371/journal.pone.0251165

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Lyme disease is the most widely reported vector-borne disease in the United States [1], with 95% (25,778 total confirmed cases from Northeast and upper Midwest / 27,203 total US confirmed cases) of human cases occurring in the Northeast and upper Midwest [2]. Borrelia burgdorferi sensu lato (including Borrelia mayonii, hereafter B. burgdorferi) is the causative agent of Lyme disease. It is transmitted to people predominantly when nymph or, to a lesser extent, adult ticks infected with B. burgdorferi take a blood meal [3, 4]. Hard to detect nymphal Ixodes ticks quest for blood meals during spring and early summer months. People are at greatest risk of contracting Lyme disease during and immediately following this time [5-9] when spending time in the environment for either work or recreation [2]. Areas with sandy soil and wooded vegetation are environmental factors associated with higher tick densities [10]. With increased geographic spread of Lyme disease, there has been increased incidence since 2000 [11]. Lyme disease has a large economic burden on patients and their surrounding communities [12, 13]. Surveillance of Lyme disease in the United States requires participation from many different areas of the health care system [14]. This surveillance relies on case reports from physicians, lab reports from diagnostic labs and collation of this data as cases by local and state health departments. These case reports are forwarded to the Centers of Disease Control and Prevention (CDC), which then aggregates the data and produces summary reports on national Lyme disease incidence. Due to differences in reporting from states and localities, compilation of data at the federal level can take several years, resulting in a time lag for release of nationwide surveillance and summary reports. This lag in federal reporting has been problematic for local health departments (LHDs), as they must predict current and emerging public health needs based on Federal data that is several years old [15]. LHDs not only play a vital role in surveillance of Lyme disease, but also help mitigate disease incidence through the implementation of local interventions. Funded prevention efforts/campaigns by LHDs can have a positive effect on health in communities [16]. Unfortunately, there are often many important competing health priorities in communities. As such, LHDs must make critical decisions to allocate their limited fund to areas of highest need. Modeling methods that accurately nowcast, or predict the present, Lyme disease incidence in a region would allow for better planning on the part of LHDs to allocate their efforts. Using statistical learning methods for nowcasting can also discover, or highlight, patterns that are associated with disease and can be used to generate future hypotheses. Infodemiology is an emerging area of science research focused on utilizing information from an electronic medium (typically the internet) with an aim to inform or improve public health [17, 18]. Examples of infodemiology include monitoring Twitter, facebook posts, or Instagram hashtags for syndromic surveillance, identifying access or misinformation about vaccination or other public health initiatives, and measuring the effectiveness of public health education messages [17]. In developed countries approximately 94%of younger generations have access to and use the internet according to the International Telecommunication Union [19]. This increase in internet usage has changed the way individuals seek and receive health information and provides researchers with new opportunities to improve disease prediction and public information [20]. Usage of non-traditional indicators of disease spread, like Google search traffic history, has gained credibility from public health audiences [21, 22]. Google search data has been used with a variety of mathematical and statistical models to predict obesity rates, unemployment rates, and infectious diseases with varying levels of accuracy [23-26]. The principal insight of these approaches is that search data is available at a wide temporal and geographical scale, and such queries may be correlated with a phenomena or disease process of interest or human behaviors [27]. This correlation can be leveraged to make predictions of current or future health outcome rates. In addition, relative frequencies of search terms may generate interesting hypotheses concerning human behaviors and their relationship with disease outcomes. Given the complex and potentially high dimensional nature of search data, statistical and machine learning tools are a natural fit for model development. There are a variety of parametric and non-parametric statistical learning approaches used in the literature for infectious disease prediction as discussed in a recent review [28]. In this work, we do not provide a comprehensive review of such options, but rather seek to demonstrate that nowcasting is a promising opportunity for Lyme disease specifically. For this reason, we employ Elastic net regression. Elastic net regression provides a flexible parametric approach which strikes a compromise between the L1 and L2 penalties of Least Absolute Shrinkage and Selection Operator (LASSO) and ridge regression, respectively. It is also computationally straightforward, being easily employed on modest hardware. An additional advantage of elastic net regression is the grouping effect, where strongly correlated features tend to remain or be excluded from the model together [29]. In this study, we built elastic net regression models capable of nowcasting Lyme disease rates in five different regions of the United States. We developed two models for each region, 1. Using search traffic data from only disease name, symptom and vector related terms and 2. Using search traffic from terms identified via Google CorrelateTM in addition to disease name, symptom and vector related terms to identify trends using information recently sought by the general public on the disease, it’s symptoms, and correlated terms [30, 31]. We hypothesized that nowcasting models would have better predictive accuracy and lower error when using a full list of search terms that the average person would search compared to models that only use terms related to disease name, symptom and vectors of Lyme disease. Further, the three most important terms from accurate models will be potential exposure/location themed and their search patterns will align temporally with the timing of Lyme disease incidence in endemic areas, the Northeast and Midwest, and less well in non-endemic areas, the Southwest and West.

Materials and methods

Outcome data

All Lyme disease incidence data for this study was provided by the United States Centers for Disease Control and Prevention (CDC) (https://www.cdc.gov/lyme/stats/tables.html). In 2008, the CDC switched to a Suspected, Probable, or Confirmed case reporting approach. Cases were considered confirmed if an individual presents with erythema migrans (bullseye rash) and with a known exposure, a case of erythema migrans with laboratory evidence of infection and without known exposure, or a case with at least one late manifestation that has laboratory evidence of infection. Any other case of physician-diagnosed Lyme disease that has laboratory evidence of infection were considered probable cases. Both confirmed and probable case definitions were included to provide a more sensitive and inclusive criterion. Laboratory evidence of infection in both definitions allowed for strong confidence in a Lyme disease diagnosis. Even so, heterogeneity remains in reporting strategy; between 2015 and 2016, Massachusetts changed their reporting strategy to only report laboratory confirmed cases to the CDC. Only reporting laboratory confirmed cases is likely to lead to underreporting of the true burden of disease [32]. Lyme disease incidence is reported by the CDC on a per county of diagnosis for each US state. For the purposes of this study, we aggregated these counts to state and month based on date of diagnosis. Next, regional incidence rates were calculated for five different regions: Northeast, Midwest, Southeast, Southwest and West (. Regions were developed as a hybrid of known high incidence regions and the US Census regions [14, 33]. Regional monthly Lyme disease incidence rates were calculated using combined state level population data from the 2010 US Census. Data was split into training and hold-out sets; models were fit on observations between February 2004 and December 2014 and validated on the hold-out observations which had available surveillance data from January 2015 to December 2017.

Regions of the United States.

United States divided into 5 different regions by the geography division of the U.S. census bureau (Northeast, Midwest, Southeast, Southwest, and West) used to calculate regional Lyme incidence, and regional search term data. Map created using ArcGIS software using US census bureau.

Google search term data

Regional Lyme disease incidence trends from the training period were used with Google CorrelateTM to identify the top 100 correlated search terms on which monthly proportional search hit data was later collected via Google TrendsTM [34]. Google CorrelateTM was not able to identify terms at state levels. These correlations can only be made on a nationwide basis for a submitted time series. Thus, we were not able to limit our search term identification by region. However, using regional Lyme disease time series data, provided many regionally specific terms in the top 100 correlated terms for each region (S1 Table). Strong correlation was determined when the correlation value was greater than 0.8, moderate if correlation value was between 0.5 and 0.8, and poor when less than 0.5. Strength of correlation was determined by correlation value (r) and significance of correlation was determined by p-value < 0.05. Google CorrelateTM implements an Approximate Nearest Neighbor (ANN) system to identify candidate search terms that matched similar temporal trends from supplied data. This system implemented a two-pass hash-base system. The first pass computed the approximate distance from the supplied time series to a hash of each series in Google’s database [34]. The second pass computed the exact distance function using the top results supplied from the first pass [34]. For each region, the 100 terms identified from Google CorrelateTM and the 21 Lyme disease symptom and Ixodid- vector related terms were entered into gtrendsR (an interface to obtain Google TrendsTM queries via R) [35] to collect proportional monthly search hit data for each term per region [35, 36]. Data was collected for regional search traffic in a systematic way similar to Mavragani and Ochoa (2019) [37]. Keywords were identified via Google CorrelateTM plus 21 Lyme disease symptoms and Ixodid- vector related terms. These terms were entered into Google TrendsTM via gtrendsR without alteration or use of quotations or combination. Regional search hit data was collected at the overall state level (including metropolitan, urban, suburban and rural searches) for each term and averaged to regional aggregates. Geographic regions were: Northeast, Southeast, Midwest, Southwest and West (). The selected period of search data was 2004 to 2019, collected at monthly to match temporal aggregation of Lyme disease data. Search categories were not implemented in this research. Search hit data was collected between September 18, 2019 and September 26, 2019. This was then used as feature data for nowcasting Lyme disease incidence trends [38, 39].

Modeling

For each region, two groups of elastic net regression models were fit for comparison: 1. a model using only monthly proportional hit data from the 21 disease symptoms and tick related terms list, and 2. a model using monthly proportional hit data from terms identified via Google CorrelateTM in combination with the disease symptom and tick term list (this will be referred to as the full-term list for the remainder of the paper). The training data was from February 2004 through December 2014. To help prevent overfitting we implemented a rolling training window for the statistical learning process with a twelve-month learning window and one month validation window. To further address the potential for overfitting, we excluded data between January 2015 until December 2017 from the model training process. The hold-out data set was not used in any model training or in-sample validation and was only used to determine how models would respond to new data and to determine if the models overfitted to the training data. We collected all search data in September of 2019 therefore all nowcasting done by developed models presented in this article will not exceed September 2019. All elastic net models were built and run in R version 3.6.2 using the caret and glmnet packages [40, 41]. Model fit was determined using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R2. All graphics of model fit, and search term correlation were created using ggplot2 in R version 3.6.2, and search terms are presented as directly provided by Google CorrelateTM. Elastic net regression is a penalized form of ordinary least squares regression and contains a hybrid of ridge and Least Absolute Shrinkage and Selection Operator (LASSO) regression penalties [29]. Elastic Net regression was implemented to both reduce the impact or outright eliminate non-essential feature data as it compromises the L1 and L2 penalties of LASSO and ridge regression respectively. Alpha and lambda hyper parameters are used in Elastic net regression to balance the tuning of the L1 (LASSO) and L2 (Ridge) norm penalty parameters (Eq 1A). Alpha determines the relative weights of the two penalty parameters and lambda determines the overall weight of the summation of the individual penalties. For each region and elastic net model group (disease symptom and vector terms alone vs. full-term list), we tested a combination of 50 and 150 different automatically generated values of alpha and lambda to select optimal values. Regional monthly Lyme disease incidence rates, as calculated from CDC surveillance data, was the outcome of the nowcasting models. Feature data was regional monthly search hit data from each region. We only used data from search terms where gtrendsR was able to appropriately return proportional monthly hit data. Despite terms having a correlation at the national level and therefore identified via Google CorrelateTM, some terms held non-variable values of zero for their monthly proportional hit data at the region level. These terms with their zero variance would cause model failure and thus were excluded from the modeling process.

Variable importance

Elastic net regression can reduce or outright eliminate feature data from final models. We wanted to determine which search terms had the greatest influence in the final, best tuned models. To determine search terms influence, the varImp function from the caret package was used to calculate the scaled importance of each term in the final models. The varImp function takes the absolute value of each coefficient and ranks these coefficients and stores them as variable importance from zero to one hundred. Put simply, larger coefficients have greater influence and thus are associated with increased importance.

Results

Between 2004 and 2017, the Northeast consistently had the highest counts of Lyme disease followed by the Midwest. The lowest incidence areas were consistently the West and Southwest regions (). All regions showed seasonal oscillation of Lyme disease incidence with typical peaks in summer months (July, August, and September) and falling in winter (). Seasonal oscillation occurs at a lower incidence in the West and southwest regions compared to the high-incidence regions of the Northeast and Midwest (). These regional temporal trends were used with Google CorrelateTM to identify 100 terms with correlated search patterns. Across all regions, there were environmental themes of outdoor activities that included concerts, camping, and water parks; places where people are likely to be exposed to Ixodes ticks during the late spring, summer and early fall [10]. ( complete list of candidate search terms provided in S1 Table). Gtrends was used to collect regional monthly proportional search data for each term identified from Google Correlate along with the symptom and vector terms (120 total terms for each region). Some terms identified with Google CorrelateTM at the national level were identified as having no search traffic at the regional level and were removed from the regional list (). Regional Lyme disease incidence count from CDC surveillance. Incidence counts calculated by summing monthly state incidence form CDC surveillance in each region. Calculations and graphs made suing RStudio version 3.6.2. Regional Lyme disease incidence. (A) All regions relative to Northeast incidence rates, (B) Southwest, (C) West. Incidence rates calculated by summing monthly state incidence from CDC surveillance in each region. Denominator values calculated from 2010 US Census state populations and aggregated to region. Calculations and graphs made using RStudio version 3.6.2. For accurate modeling predictions, or nowcasts, it is important to use feature data that is correlated to the outcome data of interest. Pearson’s correlation was performed for each term’s proportional monthly search traffic and regional Lyme disease rates within the training timeframe. Individual term correlation with Lyme disease incidence had a large range for each region of the US with moderate mean and median correlation values ( complete results provided in S2 Table). The ten most correlated search terms for the training period were either strongly or moderately correlated with regional Lyme disease rates, except for terms matching the trend of rates, or lack thereof, in the Southwest (). Each region, except the Southwest, had sixteen terms with a correlation greater than 0.7 (complete results provided in S2 Table). Over the regions that have suitable Ixodes climate and habitat (Northeast, Midwest, Southeast, and West regions), we found high maximum correlation values (0.893, 0.898, 0.840, and 0.836, respectively) for the top correlated search terms. Many of the 21 terms based on known Lyme disease symptoms or vectors had poor bivariate correlation with regional Lyme disease incidence. For example, fever, which is more often searched in winter months [42], was negatively correlated with Lyme disease incidence in every region for the entire timeframe of the study (.

Negative bivariate correlation of fever to Lyme disease incidence for all regions of the United States.

Correlation calculated using Pearson method with independent variable as proportional Google hits for each term and dependent variable Lyme Incidence per 100,000 for each region. ** p << 0.05 * p < 0.05. The variance of feature data is also important for making accurate predictions. Features that have little to no variance overtime make for poor predictors. The variability of each term was assessed per region. It was found that two terms in the Northeast, one term in the Midwest, and ten terms in the Southeast had zero variance. These terms were excluded from the nowcasting process. To evaluate the hypothesis that nowcast predictions would be more accurate when including the full list of candidate search terms as compared to a list of Lyme disease specific terms, two sets of elastic net regression models were constructed: 1. models with only Lyme disease symptoms and vector terms as features and 2. models with the full list of non-zero variance terms identified from Google Correlate coupled with symptom and vector terms (S1 Table). Predictions from regression models developed using data from symptom and vector terms exclusively, produced accurate nowcasting models (assessed via R2) with low error (assessed via RMSE and MAE) in four of the five US regions ( results for both models provided in S3 Table). The predictions from these models provide accurate estimations of the timing of the seasonal pattern of Lyme disease (). Elastic net modeling using disease symptom and vector terms only produces accurate nowcasting model for Lyme disease. (A) Northeast, (B) Midwest, (C) Southeast, (D) Southwest, and (E) West. Same one-year period from Northeast region with accurate nowcast model. Two elastic net models were developed for each region. Elastic net models trained using CDC surveillance data and search term data from February 2004 through December 2014. Vertical dashed line starts at January 2015 and indicates the start of the hold-out data set. Nowcasting performed using search term data from January 2018 until September 2019. Nowcasting models developed using the full list of search terms produced predictions that had a 1.33-fold improvement in accuracy and a 0.5-fold reduction in error compared to the symptom and vector only models ( results for both models provided in S4 Table). For each region it was found that using the full-term lists, which often included environmentally themed terms, increased the accuracy and reduced the error of predictions. On average, model accuracy (R2) improved by 0.2 when using the full list of search terms. The greatest improvement in accuracy when using the full-term list models was seen in the West (R2 difference was 0.31). The Southeast had the least improvement (0.12) in accuracy. RMSE was reduced by 0.18 on average across all regions and MAE was reduced by 0.14 when comparing predictions between the full-term list models and the symptom and vector only models. The greatest reduction in error was seen in the Northeast region. It was found that predictions from the full-term list compared to the symptom and vector only models reduced RMSE by 0.69 and MAE by 0.56 cases per 100,000 population in the Northeast region. Reduction in error for the Southwest and West were found to be approximately 0 (RMSE = 0.001 and 0.004 respectively; MAE = 0.001 and 0.002 respectively). Predictions from the full list models also produced accurate timing of seasonal patterns of Lyme disease, but with improved mimicking of peaks and recessions (). Compared to the symptom and vector term only models, predictions from the full-term list model showed more accurate variation in the spring and summer peaks of Lyme disease across all regions. In both modeling efforts, the Southwest consistently had the poorest predictive accuracy.

Elastic net modeling using the full-term list produces predictions with greater accuracy and less error.

(A) Northeast, (B) Midwest, (C) Southeast, (D) Southwest, and (E) West. Same one-year period from Northeast region with accurate nowcast model. Two elastic net models were developed for each region. Elastic net models trained using CDC surveillance data and search term data from February 2004 through December 2014. Vertical dashed line starts at January 2015 and indicates the start of the hold-out data set. Nowcasting performed using search term data from January 2018 until September 2019. In some years, Lyme disease incidence in the Northeast and Midwest showed secondary peaks or plateaus in the post-summer spike of incident cases. These secondary spikes or plateaus typically occur in late summer and early fall months as infected adult ticks take blood meals transmitting Lyme disease to people. Predictions from models using only symptoms and vector terms did not have sufficient sensitivity to detect to these changes (). Alternatively, predictions from the full-term list models had sufficient sensitivity to detect these secondary spikes or plateaus of decreasing incidence at the regional level ().

Elastic net modeling using full-term list is sensitive to secondary spikes of Lyme disease incidence in Northeast and Midwest regions.

(A) Northeast Lyme disease incidence (black line) and disease symptom and vector terms only model predictions (red line), (B) Northeast Lyme disease incidence (black line) and full-term list model predictions (red line), (C) Midwest Lyme disease incidence (black line) and disease symptom and vector terms only model predictions (purple line), and (D) Midwest Lyme disease incidence (black line) and full-term list model predictions (purple line). Elastic net models trained using CDC surveillance data and search term data from February 2004 through December 2014 and hold-out data from January 2015 and December 2017. Statistical learning techniques can help highlight specific areas in which future hypothesis or interventions could be generated. We identified the three most important terms from the accurate full-term list nowcasting models. (). As hypothesized, many of the top three most important terms for producing accurate nowcasts were regionally specific and environmentally themed. The Northeast and Southeast were the only regions that had a potential symptom term (bulls-eye rash, rash) identified in the top three important terms. We further hypothesized that due to the importance of these environmentally related themes, the time series of these search terms trends would mimic the same general trends for Lyme disease. These patterns are particular evident in areas with higher incidence of Lyme disease; the Northeast, Midwest and Southeast (). It was found that the search traffic for these top three terms aligns with the peaks and recessions of Lyme disease on the same monthly scale.

Time series of regional candidate search terms for simple Lyme disease tracking.

(A). Northeast, (B). Midwest, and (C). Southeast. The top three most important terms from each region model identified by varImp function in R. (a-c). Candidate terms scaled to align with regional Lyme disease incidence. Terms presented directly as provided by Google CorrleateTM.

Discussion

With the growing incidence of Lyme disease in the United States, novel methods that help health departments to prepare for years of increased Lyme disease exposure are critical. We found that when using Google search history data in nowcasting, accurate predictions of Lyme disease can be generated. Importantly, the search traffic for the top three search terms generally followed the same temporal nature of regional Lyme disease incidence. These terms and nowcasting methods could help Health Departments determine approximate trends of Lyme disease in their area by monitoring the search traffic trends of the terms via the free tool of Google TrendsTM. Additionally, many of the terms that remained in these accurate models were environmentally themed and can be used to generate future hypotheses for intervention and prevention actives. Overall, each elastic net model performed well and provided accurate estimations of the of regional Lyme disease incidence provided by surveillance data from the CDC (Tables and ). Results showed that predictions were more accurate from models using a full list of colloquial search terms the average person is likely to search compared to models that only used symptom, disease or vector terms. It was also found that predictions from models that included the full-term list were more sensitive to detecting secondary spikes and recession plateaus in the fall months of the Northeast and Midwest (). Moreover, many of the search terms identified via Google Correlate which had high levels of bivariate correlation and remained important throughout the elastic net modeling process were environmentally related. While not all these terms directly relate to an activity that have obvious risk of tick exposure and transmission of Lyme disease, environmentally related terms can serve as a proxy for an intention for people to spend time outdoors. Increased time spent outdoors has been shown to increase exposure to ticks in the environment [43-45]. Causal inference cannot be directly drawn from these results, however given the common pattern of environmental terms and many of their high correlations a pattern has emerged. These terms can help LHDs generate hypothesis on where to perform future tick surveillance, implement intervention measures, or spread tick awareness. These findings suggest the importance of including colloquial search terms over symptom or vector related terms alone for current and future prediction efforts. Our models can be implemented by LHDs as they currently are, or terms that more specific the local populations search habits can be substituted to further improve performance. The Southwest, a non-endemic region for Lyme disease [14], continually had the poorest performing predictions. Ixodes ticks in the Southwest are more suspected to feed on lizards and other non-reservoir hosts [46], thus it is not surprising that Lyme disease incidence was low. The CDC also classifies county of residence and not county of acquisition in surveillance reports therefore it is likely that those diagnosed in this region were exposed elsewhere. The Southwest also had the lowest number of feature data compared to all other regions. These all likely led to the low performance of predictions in this region. On the other hand, the West region, which also had a low number of incident cases, but had a greater number of feature data had better performing model predictions. The West also has suitable habitat for Ixodes pacificus, a known vector of B. burgdorferi [14]. These results indicate that in addition to having an appropriate number of feature data and outcomes, regions also need to have a suitable environment for the tick vectors in order to produce accurate nowcasts. These findings continue to show the importance of inducing environment related feature data for current or future prediction efforts in areas that are either endemic with Lyme disease or have suitable Ixodid tick habitats. To our knowledge, two prior studies have been performed using Google search data to try and improve model performance [47, 48]. One study concluded that using a single term, “Borreliose”, was not helpful in improving model accuracy [47]. While “Borreliose” is a medically accurate term for Lyme disease, we found that colloquial disease terms had moderate to high levels of correlation. Our findings found that the bivariate correlation for disease symptoms and colloquial disease terms ranged from -0.33 to 0.85 across five US regions. Terms often moderately (correlation value > 0.5), or highly correlated (correlation value > 0.8), with regional monthly Lyme disease incidence included: “lyme disease”, “lyme”, “rash”, and “tick”. Further, environmentally related terms often had the highest levels of correlation across all regions. Another study developed a tool, Lymelight, which monitored the incidence of Lyme disease in real time using Lyme disease symptom web searches in a two-year period to predict future Lyme disease burden and treatment impacts [48]. Despite producing accurate models, this method only used symptom terms which may not predict true patterns of Lyme disease or risky behaviors. Our findings show using symptom, disease and vector terms in combination with terms that focus on environments in which one may have the risk of being exposed can greatly improve model performance over symptom and vector terms alone. These findings continue to suggest the importance of direct or proxy measures for time spent outdoors when predicting vector-borne diseases. An advantage of using data from Google search history, R studio as a modeling software, and elastic net regression is that accurate predictions can be made quickly (approximately 24 hours from start to finish) and free. This can allow LHDs to have more up to date estimations of regional Lyme disease incidence beyond federal report schedules without additional finical burden. We found when graphing the search traffic for three most important terms from regional models, in endemic areas of the Northeast and Midwest, as hypothesized they provide a very good broad scale of timing. Following these terms, or more locally specific environmental terms could provide even quicker tracking of general temporal trends of Lyme disease for LHDs. Most of the top three important terms were environmentally related. This further suggests the importance of including terms or variables that focus on the environment for current and future prediction efforts. While there are strengths of statistical learning approaches, there are limitations to our approach as well. These models were developed at the regional level and are subject to less accurate predictions at the state or local level without refitting the model. Additionally, grouping states into different regions will alter results of these findings as both regional rate and search term identification using Google CorrelateTM were performed regional aggregation strategy. These models are not generalizable to other vector-borne diseases in their current form. Similar approaches could be used for other vector-borne diseases such as Anaplasmosis, as this is also vectored by Ixodid ticks and therefore will have similar temporal trends and environmental risk factors. Additionally, these models are not generalizable to other countries. All the Lyme disease and search data were based on US disease and Google habits, it is unlikely that our developed models would produce accurate results in other countries. However, a similar approach could be used in other countries that have strong surveillance data and a free access database of the countries’ most utilized search engine. Moreover, other sources of data on human behavior (i.e. data form social networks like Twitter) present additional opportunities for such models, potentially at greater spatial and temporal granularity. Greater consideration or different modeling techniques may need to be implemented for communicable diseases. However, these models can be incorporated to get a general idea of surrounding areas for those LHDs that are vastly underfunded. Local or regionally specific terms could easily be substituted into these models which could help improve model fit on a case-by-case basis. These findings highlight the importance of strong disease surveillance and computational modeling efforts working together. Predictions over time are likely to improve not only due to increases in statistical and computing power, but in the maintenance and enhancement of strong disease surveillance efforts performed nationwide.

Complete list of search terms identified by Google Correlate from each region.

Terms for each region were identified via Google Correlate using region specific Lyme disease rates from training period data. (PDF) Click here for additional data file.

Bivariate correlations of each search term to the regional Lyme disease rates.

Pearson Correlations values were calculated between each term monthly proportional search data and corresponding Lyme disease rates for each term and region. (PDF) Click here for additional data file.

Predictions from symptoms and vector terms only models produce accurate predictions with low error.

(PDF) Click here for additional data file.

Predictions from full list models produce highly accurate predictions with low error.

(PDF) Click here for additional data file. 18 Jun 2021 PONE-D-21-11338 Inclusion of environmentally themed search terms improved Elastic Net regression nowcasts of regional Lyme disease rates PLOS ONE Dear Christy: Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. One of the reviewers does not think that infodemiology has any utility and rejected it. I had to find reviewers who had actually done this kind of work before. One of them had substantive comments that need to be addressed in a revision. In comments to me, it was suggested that the authors needed more familiarity with infodemiology, particularly GoogleTrends analyses, and that there was a good literature on its applications, limitations, and standards of practice. The other comment was that the database was at least 2 years old and it was likely that there was additional data that could be used. I Please submit your revised manuscript by Aug 02 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Best regards, Sam Sam R. Telford III Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please confirm in your manuscript that you have adhered to the Terms and Use/Terms of Service of Google Correlate. 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. 4. Thank you for stating the following financial disclosure: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." At this time, please address the following queries: a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution. b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” c) If any authors received a salary from any of your funders, please state which authors and which funders. d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This manuscript proposes a method to forecast Lyme disease incidence using regressions methods with Google search history data. I believe that the application of machine learning/Big Data methods is misguided for ecologically complex phenomena like the incidence tick-borne diseases. This manuscript does not provide any compelling evidence for the contribution of these techniques to the understand-ing of Lyme disease incidence. As an exquisitely seasonal process, Lyme disease perpetuation and zo-onotic transmission will be powerfully correlated to Google search terms. The same would be true for indicators of “nice weather” (not obtained from Google!). Reviewer #2: General comments This is a paper on inclusion of environmentally themed search terms improved Elastic Net regression nowcasts of regional Lyme disease rates. I have some comments on your manuscript. Specific comments 1. Abstract and Introduction: ”…with 95% of human cases occurring…” Please provide the absolute numbers of cases (n/N), showing where this percentage is coming from. 2. Introduction: “CDC”. Please write this out when mentioned for the first time in the main text. 3. Introduction: “LASSO”. Please write this out when mentioned for the first in the text. 4. Material and Methods: “erythema migrans”. You may consider briefly describing what this is, you may put this description in the parentheses, for example. 5. Please check up the capital letters concerning Google, United States, Table, for example. Also, some words are lowercased instead of capital letters. 6. Both abbreviations are used: “US” and U.S.”. Please consider choosing one of them. 7. Figure 7: If you use color lines in the figures, please tell the readers which color indicates which line. 8. Please check up the reference list concerning the links and make sure that they are updated. Reviewer #3: This is an interesting approach in modeling Lyme Disease with Google Trends data. However, there are some issues that need to be addressed before this manuscript can be reconsidered for publication. The authors mention that “High correlation was determined when the correlation value was greater than 0.8, moderate if correlation value was between 0.5 and 0.8, and poor when less than 0.5”. Shouldn’t the significance of a correlation be measured by, for example, the p-values (or CIs)? Also, “high” and “moderate” should be defined (I assume the authors mean that high is p<0.01 and that moderate is p<0.05; however, a correlation with a p-value less than .05 is considered quite strong). There is no description of the Google Trends data selection criteria and collection procedure. This is an important drawback of this manuscript. All methodology steps should be reported in detail (e.g., period, region, category, web search, use of quotes for keywords with more than one word, individual searches, comparisons, etc.). This is an information epidemiology (infodemiology) study. I suggest that the authors study the relevant literature in order to gain insight and enhance their literature review. An introductory paragraph could be added in the Introduction Section. The analysis (data collection) was conducted in September 2019, considering data up to December 2018. It is now 2021, and there are two more years’ data available. I believe it would add to the value of this manuscript if the analysis was updated. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Ivo M Foppa Reviewer #2: Yes: Samuli Pesälä Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 24 Aug 2021 We thank our reviewers for their thoughtful comments. By responding to them, we believe that we have greatly strengthened the manuscript. Our responses will appear below in Garamond. a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution. There were no sources of funding for this research b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. c) If any authors received a salary from any of your funders, please state which authors and which funders. No authors received a salary from any of our funders. d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.” The authors received no specific funding for this work. 5. Review Comments to the Author Reviewer #1: This manuscript proposes a method to forecast Lyme disease incidence using regressions methods with Google search history data. I believe that the application of machine learning/Big Data methods is misguided for ecologically complex phenomena like the incidence tick-borne diseases. This manuscript does not provide any compelling evidence for the contribution of these techniques to the understanding of Lyme disease incidence. As an exquisitely seasonal process, Lyme disease perpetuation and zoonotic transmission will be powerfully correlated to Google search terms. The same would be true for indicators of “nice weather” (not obtained from Google!). We appreciate the considerations of this reviewer and completely agree that the ecological system of Borrelia is complex. This exercise was to determine how well these computational methods could reproduce and “nowcast” trends of reported Lyme Disease, and surprisingly it did quite well and found trends even if the weather was not nice… like the fall adult Ixodes transmission “bump”. That is why we feel this should be reported through this work. Reviewer #2: General comments This is a paper on inclusion of environmentally themed search terms improved Elastic Net regression nowcasts of regional Lyme disease rates. I have some comments on your manuscript. Specific comments 1. Abstract and Introduction: ”…with 95% of human cases occurring…” Please provide the absolute numbers of cases (n/N), showing where this percentage is coming from. We have now included the total number of confirmed positive cases (n) from northeast and upper Midwest states and the total number of confirmed positive cases for the United states (N). 2. Introduction: “CDC”. Please write this out when mentioned for the first time in the main text. We have included the full name (Center for Disease Control and Prevention) of the CDC before its first mention in the introduction 3. Introduction: “LASSO”. Please write this out when mentioned for the first in the text. We have included the full name (Least Absolute Shrinkage and Selection Operator) of the LASSO before its first mention in the introduction. 4. Material and Methods: “erythema migrans”. You may consider briefly describing what this is, you may put this description in the parentheses, for example. We have added “bullseye rash” as a parenthetical comment in this section. 5. Please check up the capital letters concerning Google, United States, Table, for example. Also, some words are lowercased instead of capital letters. We have made edits throughout the manuscript to maintain consistency of capitalization and apologize for these errors. 6. Both abbreviations are used: “US” and U.S.”. Please consider choosing one of them. We have edited the manuscript to maintain consistency of US throughout. 7. Figure 7: If you use color lines in the figures, please tell the readers which color indicates which line. We have included text into the legend of Figure 7 to make clear that the black lines are regional incidence and the colored lines are model predictions. 8. Please check up the reference list concerning the links and make sure that they are updated. We have updated the reference list to ensure that the all links are active and functioning. Reviewer #3: This is an interesting approach in modeling Lyme Disease with Google Trends data. However, there are some issues that need to be addressed before this manuscript can be reconsidered for publication. 1. The authors mention that “High correlation was determined when the correlation value was greater than 0.8, moderate if correlation value was between 0.5 and 0.8, and poor when less than 0.5”. Shouldn’t the significance of a correlation be measured by, for example, the p-values (or CIs)? Also, “high” and “moderate” should be defined (I assume the authors mean that high is p<0.01 and that moderate is p<0.05; however, a correlation with a p-value less than .05 is considered quite strong). We thank the reviewer for this comment, and indicated when p-values were significant in Supplemental Table 2 to accompany the correlation values. We feel that reporting the correlation value directly is important, as this measures the strength of the linear relationship directly. P-values (and by extension confidence intervals) measure the strength of evidence for the presence of nonzero correlation, but do not give any indication of the strength of correlation itself. In a large sample with multiple comparisons, one could obtain very strong evidence for nonzero correlation when the linear relationship is actually quite weak, while in a small sample a high correlation might not reach significance. 2. There is no description of the Google Trends data selection criteria and collection procedure. This is an important drawback of this manuscript. All methodology steps should be reported in detail (e.g., period, region, category, web search, use of quotes for keywords with more than one word, individual searches, comparisons, etc.). We have added to the methods section to make this clear. We state in the methods that we use the terms identified via Google Correlate to collect search hit data on. We have added more language to make it clear that terms identified from Google Correlate were inputted into Google Trends unaltered to collect search term hit data for each region. I also included text to make it clear that gtrendsR is an R interface for Google Trends that allows for an automated process of collecting search term data. 3. This is an information epidemiology (infodemiology) study. I suggest that the authors study the relevant literature in order to gain insight and enhance their literature review. An introductory paragraph could be added in the Introduction Section. We have included a paragraph, lines 93-102, to the introduction outlining infodemiology and its use in predicting disease and better informing the general public about health-related outcomes. 4. The analysis (data collection) was conducted in September 2019, considering data up to December 2018. It is now 2021, and there are two more years’ data available. I believe it would add to the value of this manuscript if the analysis was updated. We appreciate and understand the reviewers concern for having recent data for publication. However, the authors’ intention of this work is to show that value of including environmentally related features when nowcasting with Google Search terms, which does not require all recent data. This manuscript highlights the importance of considering environmental factors when creating prediction models for vector borne diseases. To this end, the models have also been posted on Eric Kontowicz’s Github (https://github.com/ekontowicz/Lyme-disease-Elastic-Net-regression-Nowcasting) for further use by other researchers and additional updates. Lastly, given the time it takes to have rigorous peer review, particularly during this COVID-19 pandemic these findings will lag all surveillance data available. Submitted filename: PLOSOne_response to ReviewerComments final.docx Click here for additional data file. 29 Sep 2021

PONE-D-21-11338R1

Inclusion of environmentally themed search terms improved Elastic Net regression nowcasts of regional Lyme disease rates

PLOS ONE Dear Christy, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 13 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Sam R. Telford III Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments (if provided): Reviewer 2 makes the point that these kinds of studies should follow a standard methodology, e.g., Mavragani and Ochoa 2019 JMIR Public Health and Surveillance. I have examined this reference and believe that if you have a table in the methods or at lines 380 et seq that summarize the 4 critical aspects of this kind of work, as recommended by Mavragani and Ochoa (keywords, region, period, category) this concern would be satisfied. Citing this reference as informing your study reinforces the fact that standardized methods should be the basis for analyses using GT. At the very least, because it is suggested that GT might be used by local health departments in a predictive manner, it might be good to make it easy for them to test it out by having a very simple set of search terms provided. It is clear that you spent much time analyzing keywords; in the end, what were they? (Table S1 is very comprehensive but can the most high-value keywords be highlighted in a summary table?) For region, was it overall, including metropolitan, urban, suburban and rural? For period, it looks like the specific searches were monthly from 2004-2015. According to Mavragani and Ochoa, search term category does not need to be specified if the keywords are very specific and you provided much analysis on selecting the useful keywords. I would prefer not to send this back to the one reviewer and delay a decision. If you can provide such a table, the ms could be accepted without a third round of review. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed Reviewer #3: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: Authors have revised the text as requested. This manuscript has definitely improved after revisions. Reviewer #3: The GT methodology is still not properly reported. Please see relevant representative literature to understand how to report the methodology (like this one for example: https://www.jmir.org/2020/8/e19611/). ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: Yes: Samuli Pesälä Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

21 Nov 2021 We thank Dr. Telford, the editor, and our reviewers for their thoughtful comments. We agree that the rigorous context of Mavragani and Ochoa, 2019, and in a new summary table, containing what they emphasize as the four key components for this type of computational epidemiology adds great strength to this method and specifically our manuscript. We have added this summary table at line 293 containing the five regions, the top 10, focused, keywords (so categories not needed) over the critical training period, 2004-2012. Our responses will appear below in Garamond. a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution. There were no sources of funding for this research b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. c) If any authors received a salary from any of your funders, please state which authors and which funders. No authors received a salary from any of our funders. d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.” The authors received no specific funding for this work. 5. Review Comments to the Author Reviewer #1: This manuscript proposes a method to forecast Lyme disease incidence using regressions methods with Google search history data. I believe that the application of machine learning/Big Data methods is misguided for ecologically complex phenomena like the incidence tick-borne diseases. This manuscript does not provide any compelling evidence for the contribution of these techniques to the understanding of Lyme disease incidence. As an exquisitely seasonal process, Lyme disease perpetuation and zoonotic transmission will be powerfully correlated to Google search terms. The same would be true for indicators of “nice weather” (not obtained from Google!). We appreciate the considerations of this reviewer and completely agree that the ecological system of Borrelia is complex. This exercise was to determine how well these computational methods could reproduce and “nowcast” trends of reported Lyme Disease, and surprisingly it did quite well and found trends even if the weather was not nice… like the fall adult Ixodes transmission “bump”. That is why we feel this should be reported through this work. Reviewer #2: General comments This is a paper on inclusion of environmentally themed search terms improved Elastic Net regression nowcasts of regional Lyme disease rates. I have some comments on your manuscript. Specific comments 1. Abstract and Introduction: ”…with 95% of human cases occurring…” Please provide the absolute numbers of cases (n/N), showing where this percentage is coming from. We have now included the total number of confirmed positive cases (n) from northeast and upper Midwest states and the total number of confirmed positive cases for the United states (N). 2. Introduction: “CDC”. Please write this out when mentioned for the first time in the main text. We have included the full name (Center for Disease Control and Prevention) of the CDC before its first mention in the introduction 3. Introduction: “LASSO”. Please write this out when mentioned for the first in the text. We have included the full name (Least Absolute Shrinkage and Selection Operator) of the LASSO before its first mention in the introduction. 4. Material and Methods: “erythema migrans”. You may consider briefly describing what this is, you may put this description in the parentheses, for example. We have added “bullseye rash” as a parenthetical comment in this section. 5. Please check up the capital letters concerning Google, United States, Table, for example. Also, some words are lowercased instead of capital letters. We have made edits throughout the manuscript to maintain consistency of capitalization and apologize for these errors. 6. Both abbreviations are used: “US” and U.S.”. Please consider choosing one of them. We have edited the manuscript to maintain consistency of US throughout. 7. Figure 7: If you use color lines in the figures, please tell the readers which color indicates which line. We have included text into the legend of Figure 7 to make clear that the black lines are regional incidence and the colored lines are model predictions. 8. Please check up the reference list concerning the links and make sure that they are updated. We have updated the reference list to ensure that the all links are active and functioning. Reviewer #3: This is an interesting approach in modeling Lyme Disease with Google Trends data. However, there are some issues that need to be addressed before this manuscript can be reconsidered for publication. 1. The authors mention that “High correlation was determined when the correlation value was greater than 0.8, moderate if correlation value was between 0.5 and 0.8, and poor when less than 0.5”. Shouldn’t the significance of a correlation be measured by, for example, the p-values (or CIs)? Also, “high” and “moderate” should be defined (I assume the authors mean that high is p<0.01 and that moderate is p<0.05; however, a correlation with a p-value less than .05 is considered quite strong). We thank the reviewer for this comment, and indicated when p-values were significant in Supplemental Table 2 to accompany the correlation values. We feel that reporting the correlation value directly is important, as this measures the strength of the linear relationship directly. P-values (and by extension confidence intervals) measure the strength of evidence for the presence of nonzero correlation, but do not give any indication of the strength of correlation itself. In a large sample with multiple comparisons, one could obtain very strong evidence for nonzero correlation when the linear relationship is actually quite weak, while in a small sample a high correlation might not reach significance. 2. There is no description of the Google Trends data selection criteria and collection procedure. This is an important drawback of this manuscript. All methodology steps should be reported in detail (e.g., period, region, category, web search, use of quotes for keywords with more than one word, individual searches, comparisons, etc.). We have added to the methods section to make this clear. We state in the methods that we use the terms identified via Google Correlate to collect search hit data on. We have added more language to make it clear that terms identified from Google Correlate were inputted into Google Trends unaltered to collect search term hit data for each region. I also included text to make it clear that gtrendsR is an R interface for Google Trends that allows for an automated process of collecting search term data. 3. This is an information epidemiology (infodemiology) study. I suggest that the authors study the relevant literature in order to gain insight and enhance their literature review. An introductory paragraph could be added in the Introduction Section. We have included a paragraph, lines 93-102, to the introduction outlining infodemiology and its use in predicting disease and better informing the general public about health-related outcomes. 4. The analysis (data collection) was conducted in September 2019, considering data up to December 2018. It is now 2021, and there are two more years’ data available. I believe it would add to the value of this manuscript if the analysis was updated. We appreciate and understand the reviewers concern for having recent data for publication. However, the authors’ intention of this work is to show that value of including environmentally related features when nowcasting with Google Search terms, which does not require all recent data. This manuscript highlights the importance of considering environmental factors when creating prediction models for vector borne diseases. To this end, the models have also been posted on Eric Kontowicz’s Github (https://github.com/ekontowicz/Lyme-disease-Elastic-Net-regression-Nowcasting) for further use by other researchers and additional updates. Lastly, given the time it takes to have rigorous peer review, particularly during this COVID-19 pandemic these findings will lag all surveillance data available. Submitted filename: PLOSOne_response to ReviewerComments final.docx Click here for additional data file. 2 Feb 2022 Inclusion of environmentally themed search terms improves Elastic Net regression nowcasts of regional Lyme disease rates PONE-D-21-11338R2 Dear Christy: I am pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Sam R. Telford III Academic Editor PLOS ONE Additional Editor Comments (optional): Christy, sorry this has taken so long. Paradoxically with COVID and people being at home, it is no easier finding qualified reviewers and getting reviews back than when they were at work. Reviewers' comments: 1 Mar 2022 PONE-D-21-11338R2 Inclusion of environmentally themed search terms improves Elastic Net regression nowcasts of regional Lyme disease rates Dear Dr. Petersen: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Sam R. Telford III Academic Editor PLOS ONE

Table 1

Candidate search terms identified via Google CorrelateTM by region with symptom/vector terms.

Northeast Search Terms	Midwest Search Terms
Identified by Google Correlate ^TM	Identified by Google Correlate ^TM
free concerts, july calendar, necbl, little league all stars, alive at five, movies under the stars, prospect park bandshell, summer recipe, harwich mariners, freezer jam	festivals milwaukee, beaches in michigan, kings island discount, easy summer recipes, lake beaches, motel wisconsin dells, movies in the park, summer desserts, dorm bedding, drive in ohio
Southeast Search Terms	Southwest Search Terms
Identified by Goggle Correlate ^TM	Identified by Google Correlate ^TM
intex, cloudy pool, summer things, alabama water park, blue bayou in baton rouge, cloudy pool water, baking soda pool, summer things to do, green pool, springtails	loans for, how to make string bracelets, pigeon forge hotels, recipes on the grill, sandstone amphitheater, cheap bmx bikes, cataratas del niagara, world rv, cave of the winds colorado springs, produce stand
West Search Terms	Symptom and Ixodid-vector Terms
Identified by Google Correlate ^TM	added for Each Region
concert in the park, berry picking, movies in park, concert in park, blueberry picking, outdoor movies, soak city, lake water park, blueberry farm, broomfield bay	tick, black tick, lyme, lyme disease, rash, bullseye rash, bell’s palsy, facial paralysis, side of face paralyzed, knee pain, swollen knees, swollen joint, swollen joints, joint pain, fever, tired, deer tick, black-legged tick, black legged tick, black leg tick

Table 2

Number of search terms that had monthly proportional hit data available from GtrendsTM.

Region	Terms Into Gtrends^TM	Terms From Gtrends^TM
Northeast	120	87
Midwest	120	86
Southeast	120	80
Southwest	120	42
West	120	83

Table 3

Summary values of bivariate correlation of full-term list search terms to regional Lyme disease rates of model training data.

Region	Range	Mean Correlation	Median Correlation
Northeast	-0.279, 0.893	0.560	0.663
Midwest	-0.245, 0.898	0.602	0.691
Southeast	-0.137, 0.840	0.524	0.590
Southwest	-0.065, 0.612	0.229	0.231
West	-0.165, 0.836	0.421	0.416

Table 4

Ten most correlated regional search terms for training period (2004–2012).

Northeast		Midwest		Southeast		Southwest		West
Search Term	Corr. Value	Search Term	Corr. Value	Search Term	Corr. Value	Search Term	Corr. Value	Search Term	Corr. Value
july calendar	0.89**	kings island discount	0.90**	intex	0.84**	loans for	0.61**	movies in park	0.84**
free concerts	0.88**	beaches in michigan	0.90**	cloudy pool	0.84**	hotels ca	0.55**	movies in the park	0.83**
movies under the stars	0.87**	festivals milwaukee	0.89**	summer things	0.81**	ca water	0.55**	movie in park	0.82**
lyme	0.85**	easy summer recipes	0.88**	baking soda pool	0.80**	deer tick	0.45*	concert in the park	0.80**
summer recipe	0.85**	lake beaches	0.88**	green pool	0.80**	moon bay ca	0.44*	berry picking	0.80**
lyme disease	0.85**	motel wisconsin dells	0.87**	alabama water park	0.79**	half moon bay ca	0.40*	blueberry farm	0.79**
little league all stars	0.85**	blueberry farm	0.85**	cloudy pool water	0.79**	make string bracelets	0.40*	concert in park	0.79**
necbl	0.84**	summer desserts	0.85**	summer things to do	0.79**	rash	0.39*	blueberry picking	0.78**
berry picking	0.83**	movies in the park	0.85**	blue bayou in baton rouge	0.77**	tick	0.38*	outdoor movies	0.77**
alive at five	0.83**	watermelon recipe	0.84**	springtails	0.75**	how to make string bracelets	0.38*	lake water park	0.76**

** p << 0.05

* p < 0.05.

Table 5

Predictions from symptoms and vector terms only models produce accurate predictions with low error.

	Northeast	Midwest	Southeast	Southwest	West
⍺, ƛ	0.47, 0.60	0.33, 0.20	0.29, 0.07	0.11, 0.01	0.1, 0.01
Training
RMSE	1.32	0.36	0.11	0.01	0.01
MAE	0.89	0.21	0.07	0.01	0.01
R ²	0.77	0.65	0.67	0.32	0.50
In-sample Validation
RMSE	1.50	0.38	0.11	0.01	0.01
MAE	1.01	0.25	0.07	0.01	0.01
R ²	0.71	0.59	0.69	0.38	0.29
Out of Sample
RMSE	1.65	0.43	0.14	0.01	0.01
MAE	1.38	0.34	0.10	0.01	0.01
R ²	0.79	0.76	0.82	0.37	0.63

Table 6

Predictions form full-term list models produce highly accurate predictions with low error.

	Northeast	Midwest	Southeast	Southwest	West
⍺, ƛ	0.1, 0.85	0.93, 0.00	0.1, 0.07	0.1, 0.01	0.1, 0.00
Training
RMSE	0.66	0.12	0.06	0.01	0.01
MAE	0.46	0.09	0.04	0.01	0.00
R ²	0.94	0.95	0.91	0.56	0.84
In-sample Validation
RMSE	0.99	0.23	0.08	0.01	0.01
MAE	0.62	0.14	0.05	0.01	0.01
R ²	0.87	0.85	0.84	0.44	0.70
Out of Sample
RMSE	0.74	0.29	0.14	0.01	0.01
MAE	0.52	0.17	0.09	0.01	0.01
R ²	0.97	0.94	0.91	0.45	0.82

Table 7

Three most important terms for each model often environmentally themed.

Northeast
Elastic Net 1		Elastic Net 2
Search Term	Scaled Importance	Search Term	Scaled Importance
July Calendar	100.00	July Calendar	100.00
Fresh Cherry Pie	82.12	Fresh Cherry Pie	83.29
Bullseye Rash	75.51	Bullseye Rash	75.47
Midwest
Elastic Net 1		Elastic Net 2
Search Term	Scaled Importance	Search Term	Scaled Importance
Festivals Milwaukee	100.00	Festivals Milwaukee	100.00
Lake Beaches	97.35	Kings Island Discount	99.16
Kings Island Discount	96.35	Lake Beaches	97.40
Southeast
Elastic Net 1		Elastic Net 2
Search Term	Scaled Importance	Search Term	Scaled Importance
Intex Pool Cover	100.00	Intex Pool Cover	100.00
Rash	87.07	Rash	88.06
Swampdogs	85.64	Swampdogs	85.45
Southwest
Elastic Net 1		Elastic Net 2
Search Term	Scaled Importance	Search Term	Scaled Importance
Loans for	100.00	Loans for	100.00
CA Water	67.20	CA Water	66.82
Hotels CA	61.00	Hotels CA	60.14
West
Elastic Net 1		Elastic Net 2
Search Term	Scaled Importance	Search Term	Scaled Importance
Movies in the Park	100.00	Movies in the Park	100.00
Concert in the Park	69.18	Concert in the Park	69.65
Waterworld Denver	62.13	Waterworld Denver	62.44

35 in total

1. Emerging infectious diseases: prediction and detection.

Authors: N H Ogden; P AbdelMalik; Jrc Pulliam
Journal: Can Commun Dis Rep Date: 2017-10-05

2. Time of year and outdoor recreation affect human exposure to ticks in California, United States.

Authors: Daniel J Salkeld; W Tanner Porter; Samantha M Loh; Nathan C Nieto
Journal: Ticks Tick Borne Dis Date: 2019-06-07 Impact factor: 3.744

3. The clinical assessment, treatment, and prevention of lyme disease, human granulocytic anaplasmosis, and babesiosis: clinical practice guidelines by the Infectious Diseases Society of America.

Authors: Gary P Wormser; Raymond J Dattwyler; Eugene D Shapiro; John J Halperin; Allen C Steere; Mark S Klempner; Peter J Krause; Johan S Bakken; Franc Strle; Gerold Stanek; Linda Bockenstedt; Durland Fish; J Stephen Dumler; Robert B Nadelman
Journal: Clin Infect Dis Date: 2006-10-02 Impact factor: 9.079