Literature DB >> 32246053

Search trends and prediction of human brucellosis using Baidu index data from 2011 to 2018 in China.

Songyu Wu1, Wenchao Wu1, Hetian Xue1, Chenhao Zhao1, Yuhan Yang1, Kai An1, Qing Zhen2.   

Abstract

Reporting on brucellosis, a relatively rare infectious disease caused by Brucella, is often delayed or incomplete in traditional disease surveillance systems in China. Internet search engine data related to brucellosis can provide an economical and efficient complement to a conventional surveillance system because people tend to seek brucellosis-related health information from Baidu, the largest search engine in China. In this study, brucellosis incidence data reported by the CDC of China and Baidu index data were gathered to evaluate the relationship between them. We applied an autoregressive integrated moving average (ARIMA) model and an ARIMA model with Baidu search index data as the external variable (ARIMAX) to predict the incidence of brucellosis. The two models based on brucellosis incidence data were then compared, and the ARIMAX model performed better in all the measurements we applied. Our results illustrate that Baidu index data can enhance the traditional surveillance system to monitor and predict brucellosis epidemics in China.

Entities:  

Mesh:

Year:  2020        PMID: 32246053      PMCID: PMC7125199          DOI: 10.1038/s41598-020-62517-7

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

Human brucellosis is one of the most common zoonotic diseases, among many infectious diseases, including zoonoses[1]. After the initial infection, brucellosis has a wide range of clinical manifestations in the acute stage: patients can develop fever, sweating, and arthritis. Without appropriate treatment, the disease will progress to the chronic stage, which is far more severe than the acute stage[2], such as endocarditis, neurological diseases[3,4]. Brucellosis is transferred from livestock to humans through nonsterile or contaminated products or contact with sick animals, and there have been no clear cases of brucellosis transfer from person to person[5]. Brucellosis not only endangers the development of the animal husbandry industry but also seriously threatens the health of the occupational population and people in epidemic areas. The health burden of brucellosis has caused severe damage to the health and economies of the affected countries. Thus, in the last few years, increasing attention has been paid to the epidemiology surveillance of brucellosis incidence[6]. Currently, brucellosis appears in more than 160 countries and regions; the annual average number of new cases is higher than half a million globally[7]. In China, the incidence of brucellosis varies annually. There was no consistent pattern in incidence from 2011 to 2016; however, after 2016, there has been a distinct decrease. Other studies have also drawn the same conclusion either nationally or provincially[7-9]. The areas with brucellosis outbreak are mainly distributed in the northern provinces[10]. Moreover, brucellosis has several other epidemiological characteristics. For example, according to a previous study and reports, in terms of age and gender, young and middle-aged males in the labour force, especially those who work with livestock, are the main population being infected[3]. The Internet has become a way for increasing numbers of people to acquire health information[11]. Internet data have entered the public health field with rapid updating and availability in large quantities. By analyzing Internet data, researchers can establish estimates of the distribution of health-related events in an Internet-based monitoring system model[12]. The Autoregressive Integrated Moving Average (ARIMA) model is a widely used model that can use the past and the present to predict the future[13]. The ARIMA model has been applied to predict other diseases epidemics such as AIDS[14]. A well-known example of the use of Internet data in health is the monitoring of influenza outbreaks, where it is as accurate as traditional methods[11]. Google Trends is an Internet platform that specifies the term “public health surveillance” to provide users with geospatial and temporal patterns of search volume[15]. In the United States, researchers have successfully predicted flu outbreaks by analyzing the Google Trends and accurately estimated flu levels 1–2 weeks earlier than the published CDC report[16]. They also used Google Search to accurately predict influenza activity levels every week across the United States. Google Trends is widely used around the world, but it is not available in China. Most Chinese people use the Baidu engine to search for information on the Internet[17]. At present, Baidu, as the largest search engine on China’s Internet, has a vast share of the search engine market[18]. The information generated by the Baidu search data could be applied to epidemiological research. In Zhejiang Province of China, the prevalence of Norwalk virus has been determined through Internet monitoring (Baidu Index), and an appropriate model has been established to predict potential Norwalk virus infection[19]. Similarly, some methods of using Baidu Search Index(BSI) data have also been carried out in HIV/AIDS; hand, foot and mouth disease; and dengue[17,20,21]. The current brucellosis surveillance system in China is flawed and incomplete. The reported number of human brucellosis cases may be only a portion of the actual cases of human brucellosis in the country[22]. Therefore, we need to adopt a more pragmatic approach to monitor the prevalence of human brucellosis. Based on previous studies of using search index data for the prediction of disease incidence, we believe that the Baidu index could provide a new approach to monitor the incidence of brucellosis.

Results

Descriptive and correlation analysis

The average incidence of human brucellosis from 2011 to 2018 is 48,111.63 per year. According to the time series plot of the human brucellosis incidence (Fig. 1). The annual incidence increased from 43,827 in 2011 to 60,782 in 2015; decreased from 2015 to 40,028 in 2018. The time distribution of the cases showed strong seasonality: mainly occurred from March to August, spring, and summer.
Figure 1

Trends of the actual incidence of brucellosis from January 2011 to December 2018.

Trends of the actual incidence of brucellosis from January 2011 to December 2018. Table 1 is a brief description of the keyword data of BSI. The BSI data showed seasonal fluctuations and reached a peak in summer. We can infer from Figs. 1 and 2 that a similar seasonality occurred among cases of brucellosis and BSI data of keywords.
Table 1

Description of the observed data, there are five forms used to express brucellosis.

Abbr.Min.MeanMedianMaxStd.Dev.
B1211582.5580.51133201.2
B284324.9305.51957197.6
B3155264.2261.546560.5
B435127.2130.518728.0
B545140.0140.024328.3

B1, B2, B3, B4, and B5 are the different keywords describing brucellosis in the Chinese language.

Figure 2

Time series of 5 different index terms from January 2011 to December 2018.

Description of the observed data, there are five forms used to express brucellosis. B1, B2, B3, B4, and B5 are the different keywords describing brucellosis in the Chinese language. Time series of 5 different index terms from January 2011 to December 2018. Spearman correlation analysis was applied to find the correlation between the monthly observed brucellosis incidence and the BSI data (Table 2). The study showed a positive association between the BSI data and the monthly cases of brucellosis.
Table 2

The results of Spearman correlation analysis between BSI data and actual brucellosis incidence shows that all Baidu search indexes correlated with the time series of brucellosis incidence.

Abbr.rsP-value
B10.491<0.01
B20.362<0.01
B30.426<0.01
B40.447<0.01
B50.501<0.01
The results of Spearman correlation analysis between BSI data and actual brucellosis incidence shows that all Baidu search indexes correlated with the time series of brucellosis incidence. We used cross-correlation analysis to further detect the correlation between the incidence of brucellosis and the lag order of BSI (Table 3). All keywords showed a significant relationship (p < 0.05). Two correlation analyses showed a significant relationship between the BSI data and the incidences of brucellosis, indicating the BSI data could reflect the changes in the brucellosis epidemic.
Table 3

The results of cross-correlation analysis between BSI data and actual brucellosis incidence. CCF: cross-correlation function.

Abbr.Maximum CCFLagP-value
B10.4910<0.01
B20.21200.038
B30.4760<0.01
B40.3670<0.01
B50.3910<0.01
The results of cross-correlation analysis between BSI data and actual brucellosis incidence. CCF: cross-correlation function.

Seasonal ARIMA model building and diagnosis

The seasonal ARIMA model uses past values of brucellosis incidence to predict future value. The result of the augmented Dickey-Fuller (ADF) test was not significant (Dickey-Fuller = -3.2995, lag order = 6, p = 0.076), suggesting that the time series is not stable and must be differentiated. After one-order differentiation, a method that could stationalize a non-stationary time series by computing the differences between two consecutive observations, the result of the ADF test is significant (Dickey-Fuller = -8.394, lag order = 6, p < 0.001). After differentiation, we use seasonal differentiation to eliminate the seasonality of the time series. After all the differentiating, we use ACF(autocorrelation function) and PACF (partial autocorrelation function) plots of the time series in the supplement to estimate the MA and AR terms of the model (Supplementary Fig. S2). According to the minimum information quantity of the Akaike information criterion (AIC), after several trials, we selected ARIMA (1,1,1) (0,1,1)[12] as the forecasting model (Table 4). We tested each coefficient of the parameter with the null hypothesis that the coefficient is zero. T in the table is the statistics of the t-test, and the p-value is the probability that the null hypothesis is correct. All the parameters are significant, and the residual autocorrelation test (Ljung-Box test) shows that the residuals are not distinguishable from a white noise series (Q* = 0.0057379, p = 0.9396).
Table 4

Parameters of ARIMA (1,1,1) (0,1,1)[12].

CoefficientStandard errorTP-value
AR10.6060.1693.576<0.001
MA1−0.8171.150−7.107<0.001
SMA1−0.7620.209−3.644<0.001
Parameters of ARIMA (1,1,1) (0,1,1)[12].

ARIMAX model construction and comparison

The ARIMAX model is a multivariable version of ARIMA. The BSI data are the covariables of the ARIMAX model. After several tries and external variable selection, we selected B1, B2, and B4 based on the minimal AIC values (Table 5), and we tested each coefficient the same way as the testing on the previous ARIMA model.
Table 5

The parameters of the ARIMA model with BSI as the external variable.

CoefficientStandard errorTP-value
AR10.6880.1694.074<0.001
MA1−0.8920.119−7.524<0.001
SMA1−0.5720.144−3.972<0.001
B13.0570.8153.751<0.001
B2−0.5360.211−2.5300.011
B48.7123.2022.7210.007
The parameters of the ARIMA model with BSI as the external variable. To evaluate the in-sample fitting performance of the models, we use ME (mean error), MAE (mean absolute error), MPE (mean percentage error), MAPE (mean absolute percentage error) RMSE (root mean square error), and AIC (Akaike information criterion) to compare models 1 and 2. Model 2 has better performance than model 1 in all measurements we applied. (Table 6)
Table 6

Comparison between the ARIMA and ARIMAX models.

Model 1(ARIMA)Model 2(ARIMAX)
ME−16.3709−3.4060
MAE321.1026292.1836
MPE−0.76450.3039
MAPE9.02%8.07%
RMSE445.6888399.3604
AIC1288.431259.64
Ljung-Box p0.93960.7705

ME: mean error, MAE: mean average error, MPE: mean percentage error, MAPE: mean absolute percentage error, RMSE: root mean square error, AIC: akaike information criterion.

Comparison between the ARIMA and ARIMAX models. ME: mean error, MAE: mean average error, MPE: mean percentage error, MAPE: mean absolute percentage error, RMSE: root mean square error, AIC: akaike information criterion. To show the performance of forecasting, we compared the prediction results with criterion interval from January 2019 to May 2019 of two models (Table 7). Model 2 still performs better than Model 1, since the RMSE for model 2 is 371.1875 and 377.6349 for model 1.
Table 7

Actual brucellosis incidence in 2019 and the out-of-sample prediction for ARIMA (Model 1) and ARIMAX (Model 2).

TimeActual incidenceModel 1Model 2
Value95%CIValue95%CI
Jan 201923901696.793(735.412, 2658.175)1851.391(1007.858, 2694.924)
Feb 201922271833.312(609.734, 3056.889)1738.133(659.862, 2816.403)
Mar 201940213808.437(2430.329, 5189.545)4035.677(2823.679, 5247.675)
Apr 201945594406.562(2919.176, 5893.947)4927.924(3627.244, 6228.604)
May 201952385333.362(3759.529, 6907.195)5391.204(4025.604, 6756.804)
Actual brucellosis incidence in 2019 and the out-of-sample prediction for ARIMA (Model 1) and ARIMAX (Model 2).

Discussion

This study used the keyword digging tool to find brucellosis-related keywords. We introduced different combinations of the keywords as covariables of the ARIMAX model. Since there are only five different keywords, we could compare models with all possible combinations. The fact that our selected model (model 2) has a lower AIC value than the model with all keywords combined (1259.64 VS 1278.95) is the reason we do not introduce the combination of the keywords into the model even though we calculated it. Since Internet-based epidemiology related data is inexpensive, fine-grained, and real-time capable, it can fill the gap in public health supervision[23]. Therefore, in recent years, as an innovative way to improve the effectiveness of disease prevention and control programs, Internet-based monitoring systems are increasingly being explored[17]. Many similar studies have been conducted on various diseases. According to the World Health Organization (WHO), the actual incidence of brucellosis is 10 to 25 times higher than that reported[24]. The convenience of the Internet has attracted millions of people around the world to choose the Internet as a way to obtain health information[25]. BSI shows the changes in the public in interest for time series in the selected term. In addition, different terms can be compared simultaneously. The source of the interest from the public about the brucellosis could be the news reports or the actual incidence. As a relatively rare disease, the news reports about the brucellosis are significantly less than common infectious diseases such as flu. The primary source of the online search queries could come from the brucellosis patients and the people around the patients. This hypothesis is the reason BSI could reflect the actual brucellosis incidence through the interests of the online search queries. The ARIMA model is the most common model used to predict the epidemiological characteristics of a disease, but it can only make use of historical data. ARIMAX, however, can combine BSI and past incidence numbers to predict future incidence. Additionally, in this study, Table 6 has shown that the ARIMAX model has a better fitting outcome than the ARIMA model. Thus, the BSI can play a vital role in establishing the ARIMAX model as well as the monitoring of brucellosis. The BSI of the previous month can be available at the beginning of each month. Moreover, through this model, the prediction of the cases of brucellosis can be calculated immediately, which is faster than the traditional monitoring system. When the value predicted by the model increases, it may indicate a rise in the incidence of brucellosis, and the relevant departments related to disease control and prevention can make preparations for potential outbreak brucellosis. This study nevertheless has some deficiencies: First, China is a country with a wide range of cultures ethnicities, geographies, and population distributions[26]. Due to the scarcity of data at the provincial and municipal levels, it is not possible to predict the incidence trend in individual provinces and cities through this model. If data for each region can be collected, the forecast of brucellosis incidence can be targeted to different towns and areas. Second, according to the statistics of the China International Network Information Center, by December 2018, the Chinese Internet penetration rate was 59.6% (https://www.cnnic.net.cn/). This relatively low rate suggests that the Internet is not yet widely available in various remote parts of China. BSI cannot cover these areas. Therefore, the accuracy of predicting the number of cases with this model needs further evaluation. Third, although its exposure rate is relatively low compared to that of other diseases, it is still not possible to identify the reason for the search for brucellosis. Some queries may come from patients who are not sick and seek information for other purposes. When media reports on a new outbreak, the BSI will increase as a result of curiosity and panic, which leads to a false spike in the number of searches. Thus, further research is needed to find methods to eliminate the impact of media on the results[17].

Materials and Methods

Data sources

Epidemiological data for brucellosis: We used the monthly reported data from the official website of the Chinese Center for Disease Control and Prevention (China CDC), which covers the period from January 2011 to December 2018. These data are aggregated, open to the public, and do not contain any personal information. We accessed the official website of the CDC (http://www.chinacdc.cn/), viewed the monthly report, and gathered the information of monthly brucellosis incidence into the table. BSI data: We obtained monthly search engine query data from the Baidu index (https://index.baidu.com) by month, along with the monthly reported brucellosis incidence data. Because the BSI data were available from January 2011, we collected BSI data from January 2011 to December 2018. Data from both sources were entered into Microsoft Excel 2016. The data were then exported in CSV format to R 3.6.1 for further analysis.

Keyword selection

We used the keyword excavation tool (http://www.7c.com/keyword/) to excavate the different expressions that refer to brucellosis in the Chinese language. After removing the search terms that were not related to brucellosis from the excavation results, five different keywords that express the same meaning in the Chinese language were selected. Then, data for each keyword were collected separately from the Baidu Index website. After the correlation analysis, a five-keywords-combined variable was calculated as follows:Where n is the number of keywords (n = 5), and were i th keywords value of BSI data and the weight of i th keyword. The are the value of Spearman correlation coefficients. In the ARIMAX model constructing process, we tried a single keyword, the combination of several keywords or the combination of five keywords as covariable of the model multiple times, and we ultimately chose B1, B2, and B4 as the final covariables of the model based on the AIC criteria.

ARIMA and ARIMAX

The ARIMA model is one of the most common time series models that use the historical value of a variable to predict its future values. The ARIMAX, multivariable version ARIMA, can take other time series information as the external variable. We use both the ARIMA and ARIMAX models to predict the same time series, which is the incidence of brucellosis. The difference between the two models is that the second model (model 2) is multivariable. Three steps were mainly introduced for model construction. Firstly, the autocorrelation term was determined through the ACF and PACF of brucellosis cases. Secondly, ARIMA and ARIMAX model was built. Finally, select the best performing model as the forecasting model.

Model selection and prediction

We followed several criteria for time series model selection; first, the residuals of the models should not be distinguishable from a white noise series; second, when comparing two models, we assume the better model with lower absolute value in AIC and RMSE than the other model. After we find the best single-variable model and best multi-variable model, more measurements would be taken into the comparison between ARIMA and ARIMAX model, including ME, MAE, MPE, and MAPE. Each of the measurements has its benefits. MPE is easy to calculate, and the result can be easily understood. MAPE is similar to MPE and can better represent the forecast error. MAE and RMSE are similar; the main difference between the two measurements is the sensitivity to the outlier of forecasting outcome; RMSE is more sensitive than MAE. After model selection, we predicted the future value from the selected model. We did five times one-month-ahead prediction and compared the actual brucellosis cases from January to May of 2019. The Fig. 3 briefly summarized the whole research process from the selection of keywords to out-sample prediction.
Figure 3

The whole research process from the selection of keywords to out-sample prediction.

The whole research process from the selection of keywords to out-sample prediction.

Statistical tests

Spearman rank correlation was used to detect the association between monthly BSI data and brucellosis incidence. Cross-correlation analysis is a measure of similarity between two time series. The range of results of this test is −1 to 1, and the closer the result to 1 or −1, the more closely two-time series are related. The augmented Dickey-Fuller test is a unit root test to determine whether the differencing of a time series is required. In our analysis, the null hypothesis is that the data are stationary; a p-value less than 0.05 (p < 0.05) suggests that differencing is required. The Ljung-Box test is the test that we applied to detect the autocorrelations of the residual series. If the p-value of the test is significant, we suggest that autocorrelations exist in the series, which indicates that the model does not fully utilize the information of the training set. All statistical and modeling processes for the time series were analyzed using R 3.6.1. The ARIMA and ARIMAX model were built and analyzed using the “TSA” package. P < 0.05 was considered significant in the analysis process. Supplementary Information.
  18 in total

Review 1.  Brucellosis - regionally emerging zoonotic disease?

Authors:  Mayada Gwida; Sascha Al Dahouk; Falk Melzer; Uwe Rösler; Heinrich Neubauer; Herbert Tomaso
Journal:  Croat Med J       Date:  2010-08       Impact factor: 1.351

Review 2.  Clinical manifestations of human brucellosis: a systematic review and meta-analysis.

Authors:  Anna S Dean; Lisa Crump; Helena Greter; Jan Hattendorf; Esther Schelling; Jakob Zinsstag
Journal:  PLoS Negl Trop Dis       Date:  2012-12-06

3.  Evolving forecasting classifications and applications in health forecasting.

Authors:  Ireneous N Soyiri; Daniel D Reidpath
Journal:  Int J Gen Med       Date:  2012-05-08

4.  Mapping publication trends and identifying hot spots of research on Internet health information seeking behavior: a quantitative and co-word biclustering analysis.

Authors:  Fan Li; Min Li; Peng Guan; Shuang Ma; Lei Cui
Journal:  J Med Internet Res       Date:  2015-03-25       Impact factor: 5.428

5.  Changing Epidemiology of Human Brucellosis, China, 1955-2014.

Authors:  Shengjie Lai; Hang Zhou; Weiyi Xiong; Marius Gilbert; Zhuojie Huang; Jianxing Yu; Wenwu Yin; Liping Wang; Qiulan Chen; Yu Li; Di Mu; Lingjia Zeng; Xiang Ren; Mengjie Geng; Zike Zhang; Buyun Cui; Tiefeng Li; Dali Wang; Zhongjie Li; Nicola A Wardrop; Andrew J Tatem; Hongjie Yu
Journal:  Emerg Infect Dis       Date:  2017-02       Impact factor: 6.883

Review 6.  A Systematic Review and Meta-Analysis of Epidemiology and Clinical Manifestations of Human Brucellosis in China.

Authors:  Rongjiong Zheng; Songsong Xie; Xiaobo Lu; Lihua Sun; Yan Zhou; Yuexin Zhang; Kai Wang
Journal:  Biomed Res Int       Date:  2018-04-22       Impact factor: 3.411

7.  A systematic review and meta-analysis of the prevalence of osteoarticular brucellosis.

Authors:  Shakirat A Adetunji; Gilbert Ramirez; Margaret J Foster; Angela M Arenas-Gamboa
Journal:  PLoS Negl Trop Dis       Date:  2019-01-18

8.  Epidemiological features and risk factors associated with the spatial and temporal distribution of human brucellosis in China.

Authors:  Yin-Jun Li; Xin-Lou Li; Song Liang; Li-Qun Fang; Wu-Chun Cao
Journal:  BMC Infect Dis       Date:  2013-11-16       Impact factor: 3.090

9.  Temporal trends analysis of human brucellosis incidence in mainland China from 2004 to 2018.

Authors:  Yongbin Wang; Chunjie Xu; Shengkui Zhang; Zhende Wang; Ying Zhu; Juxiang Yuan
Journal:  Sci Rep       Date:  2018-10-26       Impact factor: 4.379

10.  Spatio-temporal cluster and distribution of human brucellosis in Shanxi Province of China between 2011 and 2016.

Authors:  Ting Wang; Xiang Wang; Ping Tie; Yongfei Bai; Yuhua Zheng; Changfu Yan; Zhikai Chai; Jing Chen; Huaxiang Rao; Lingjia Zeng; Limin Chen; Lixia Qiu
Journal:  Sci Rep       Date:  2018-11-19       Impact factor: 4.379

View more
  8 in total

1.  Utilizing Baidu Index to Investigate Seasonality, Spatial Distribution and Public Attention of Dry Eye Diseases in Chinese Mainland.

Authors:  Haozhe Yu; Weizhen Zeng; Mengyao Zhang; Gezheng Zhao; Wenyu Wu; Yun Feng
Journal:  Front Public Health       Date:  2022-07-06

2.  Human brucellosis: Widespread information deficiency hinders an understanding of global disease frequency.

Authors:  Christopher G Laine; Harvey M Scott; Angela M Arenas-Gamboa
Journal:  PLoS Negl Trop Dis       Date:  2022-05-17

3.  Impact of a health campaign on Chinese public awareness of stroke: evidence from internet search data.

Authors:  Man Cao; Tianjia Guan; Xueyan Han; Bingjie Shen; Baohua Chao; Yuanli Liu
Journal:  BMJ Open       Date:  2021-12-14       Impact factor: 2.692

4.  Uncertainty, Search Engine Data, and Stock Market Returns During a Pandemic.

Authors:  Sheng Xu; Jing Zhang; Rui Shen
Journal:  Front Public Health       Date:  2022-04-06

5.  Incidence of pulmonary tuberculosis under the regular COVID-19 epidemic prevention and control in China.

Authors:  Ziwei Wu; Ziyi Chen; Siyu Long; Aiping Wu; Hongsheng Wang
Journal:  BMC Infect Dis       Date:  2022-07-24       Impact factor: 3.667

6.  The research of SARIMA model for prediction of hepatitis B in mainland China.

Authors:  Daren Zhao; Huiwu Zhang; Qing Cao; Zhiyi Wang; Ruihua Zhang
Journal:  Medicine (Baltimore)       Date:  2022-06-10       Impact factor: 1.817

7.  Online public attention toward allergic rhinitis in Wuhan, China: Infodemiology study using Baidu index and meteorological data.

Authors:  Yunfei Wang; Ziang Gao; Hao Lv; Yu Xu
Journal:  Front Public Health       Date:  2022-10-03

8.  Using Search Trends to Analyze Web-Based Interest in Lower Urinary Tract Symptoms-Related Inquiries, Diagnoses, and Treatments in Mainland China: Infodemiology Study of Baidu Index Data.

Authors:  Shanzun Wei; Ming Ma; Changjing Wu; Botao Yu; Lisha Jiang; Xi Wen; Fudong Fu; Ming Shi
Journal:  J Med Internet Res       Date:  2021-07-06       Impact factor: 5.428

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.