Literature DB >> 35687775

The research of SARIMA model for prediction of hepatitis B in mainland China.

Daren Zhao1, Huiwu Zhang1, Qing Cao2, Zhiyi Wang3, Ruihua Zhang4.   

Abstract

ABSTRACT: Hepatitis B virus infection is a major global public health concern. This study explored the epidemic characteristics and tendency of hepatitis B in 31 provinces of mainland China, constructed a SARIMA model for prediction, and provided corresponding preventive measures.Monthly hepatitis B case data from mainland China from 2013 to 2020 were obtained from the website of the National Health Commission of the People's Republic of China. Monthly data from 2013 to 2020 were used to build the SARIMA model and data from 2021 were used to test the model.Between 2013 and 2020, 9,177,313 hepatitis B cases were reported in mainland China. SARIMA(1,0,0)(0,1,1)12 was the optimal model and its residual was white noise. It was used to predict the number of hepatitis B cases from January to December 2021, and the predicted values for 2021 were within the 95% confidence interval.This study suggests that the SARIMA model simulated well based on epidemiological trends of hepatitis B in mainland China. The SARIMA model is a feasible tool for monitoring hepatitis B virus infections in mainland China.
Copyright © 2022 the Author(s). Published by Wolters Kluwer Health, Inc.

Entities:  

Mesh:

Year:  2022        PMID: 35687775      PMCID: PMC9276452          DOI: 10.1097/MD.0000000000029317

Source DB:  PubMed          Journal:  Medicine (Baltimore)        ISSN: 0025-7974            Impact factor:   1.817


Introduction

Hepatitis B, caused by the hepatitis B virus (HBV) infection, is transmitted through contact with percutaneous or permucosal exposure to infected blood or body fluids.[ Although there are effective prophylactic vaccines and treatments for hepatitis B, they still face a high level of morbidity and mortality due to hepatitis B and a tremendous economic burden.[ It has been reported that 3.5% of the global population is infected with chronic hepatitis B virus.[ In 2013, there were 1.45 million deaths due to viral hepatitis, which was the seventh highest cause of mortality worldwide.[ Between 2015 and 2019, the number of confirmed cases of hepatitis B increased from 257 million to 296 million, with 1.5 million new infections each year.[ Hepatitis B remains a global epidemic, particularly in Southeast Asia, sub-Saharan Africa, and China.[ In China, hepatitis B is classified as a Category B infectious disease. Over the past few decades, a series of comprehensive preventive measures in China, including HBV immunization programs, integrated prevention of mother-to-child transmission, and effective finance policies,[ have been implemented effectively and have achieved remarkable progress. However, it has been reported that China has a higher intermediate prevalence (5%–7.99%)[; therefore, there are still many challenges in the prevention and control of hepatitis B. Scientific prediction and analysis play a crucial role in the prevention and control of hepatitis B, allocation of health resources properly or methodically, provision of effective measures, and the promotion of public health.[ Meanwhile, a reasonable prediction of hepatitis B can give full play to early warning links in advance to provide corresponding preventive measures.[ Therefore, it may remind us the prediction and analysis of hepatitis B are crucial for prevention and control. A time series is a statistical method used to predict future trends based on historical data and time variables.[ Currently, various prediction approaches have been adopted to handle the temporal characteristics of infectious diseases,[ such as the exponential smoothing model,[ grey model first-order one-variable (GM(1,1) model),[ linear regression,[ and autoregressive integrated moving average model (ARIMA).[ The ARIMA model is widely used to predict infectious diseases. SARIMA is an extended version of the ARIMA model, which requires periodicity, seasonality, and randomness. In this study, we collected monthly hepatitis B case data from 31 provinces in mainland China from 2013 to 2020, and applied the SARIMA model to construct a prediction model. The SARIMA(1,0,0)(0,1,1)12 model was fitted to predict the hepatitis B epidemic trends in 2021.

Materials and methods

Data source

Monthly HBV case data for 31 provinces in mainland China from 2013 to 2020 were provided by the National Health Commission of the People's Republic of China (http://www.nhc.gov.cn/). Hepatitis B is classified as a category B infectious disease since the “Regulations of the People's Republic of China on the Administration of Acute Infectious Diseases” promulgated by the Ministry of Health of the People's Republic of China in 1978. The diagnostic criteria for hepatitis B (WS 299–2008 Diagnostic Criteria for Hepatitis B) were formulated by the Ministry of Health of the People's Republic of China (http://www.nhc.gov.cn/wjw/s9491/200907/41983.shtml). In China, once the diagnosis of hepatitis B is confirmed, physicians must report to the Internet-based surveillance system within 24 hours in local hygiene departments. From 2013 to 2020 were used as basic data and materials for the study (Table 1). The research database was divided into training and validation datasets. Monthly data from 2013 to 2020 were used as the training set to build the SARIMA model, and monthly data from 2021 were used as the validation set to test the model.
Table 1

The number of Hepatitis B cases from 2013 to 2020 in mainland China.

YearMonth
JanuaryFebuaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberTotal
2013102,36778,884107,53597,22596,97886,33597,40197,43087,91586,16188,47987,6091,114,319
201490,21083,06899,29294,76891,93688,20195,64894,07587,82785,99685,12588,3971,084,543
201596,64972,869104,42794,35091,19489,22493,58689,22889,80686,39387,28490,1031,085,113
201689,69982,204105,74593,19095,07990,16691,21997,67087,39085,48091,47891,3711,100,691
201786,65799,417110,71798,123101,783100,15598,501103,97796,85689,10597,69497,5601,180,545
2018109,02186,886120,659106,398108,83198,491103,809105,06895,31294,217100,84996,3361,225,877
2019107,75490,985113,941110,266106,43197,362112,454106,98597,81598,774102,174102,1511,247,092
202091,02651,50688,150101,26297,65199,319106,135102,304105,37795,633100,561100,2091,139,133
The number of Hepatitis B cases from 2013 to 2020 in mainland China.

Ethics statement

In this study, monthly hepatitis B case data for 31 provinces in mainland China from 2013 to 2020 were obtained from publicly accessible sources (http://www.nhc.gov.cn/). The data were relatively uninvolved in detailed patient personal information. Therefore, this study did not require a formal ethical assessment.

SARIMA model

The SARIMA model, which is an extended version of the ARIMA model, is generally expressed as SARIMA (p, d, q) (P, D, Q).[ However, a distinctive difference is that the SARIMA model includes seasonal characteristics.[ In the expression, p, d, and q represent the order of auto-regression, the degree of trend difference, and the order of moving average, respectively; P, D, and Q represent the seasonal auto regression lag, degree of seasonal difference, and seasonal moving average, respectively; and s represents the length of the seasonal period,[ which was evaluated as 12 in our study. The modeling process of the SARIMA model consists of 3 parts: series stability, model identification and estimation, and model diagnosis.[ Firstly, 2 methods can be used to solve this problem in the series stability process. One is the original sequence plot, which was observed by viewing the fitted curve to determine whether it was constant. The other is the augmented Dickey-Fuller test, which confirms the series stability. The ADF test was performed using the EViews 10.0 software. Second, model identification and estimation were determined by using autocorrelation function (ACF) and partial autocorrelation function (PACF) plots. On this basis, the model parameters of the p, q, P, Q, and candidate SARIMA models were determined.[ Thirdly, model diagnosis was evaluated using a white noise test. The Ljung-Box Q test was used to determine whether the residuals were independently or normally distributed. A t-test and a P value were used to determine whether the model parameters were statistically significant. Finally, using the lowest Bayesian information criterion of Schwarz (BIC) values and their residual white noise, we identified the optimal model to predict future epidemic trends of hepatitis B.

Assessment of prediction performance

In this section, the prediction performance is evaluated using a mathematical formula that describes the goodness-of-fit between the predicted and observed values.[ where, is the predicted value, Xt is the observed value, and n is the sequence sample size.

Statistical analysis

EViews 10.0 and SPSS software (version 23.0) were used to construct the SARIMA model. The significance level was set at P < .05.

Results

Description of hepatitis B cases from 2013 to 2020 in mainland China

A total of 9,177,313 hepatitis B cases were reported in mainland China between 2013 and 2020, with an average annual growth rate of 0.05% (Table 1 and Fig. 1). Figure 2 shows that the lowest point of the monthly number of hepatitis B cases each year from 2013 to 2020 was February, whereas the highest point was March. It was dominated by seasonal and periodic fluctuations.
Figure 1

The monthly number of Hepatitis B cases from 2013 to 2020 in mainland China.

Figure 2

The monthly number of Hepatitis B cases each year from 2013 to 2020 in mainland China.

The monthly number of Hepatitis B cases from 2013 to 2020 in mainland China. The monthly number of Hepatitis B cases each year from 2013 to 2020 in mainland China.

Construction of SARIMA model

As shown in Fig. 3, the original sequence seems to be non-stationary, suggesting that trend differenced and seasonally differenced or augmented Dickey-Fuller tests should be performed. After the natural logarithm transformation, first-order differenced, and seasonally differenced with one period of 12 (Fig. 4), the time series was stationary. Thus, parameters d, D, and s were 0, 1, and 12, respectively. Furthermore, the ADF test results (t = −4.734, P < .001) indicated that after the natural logarithm transformation and differenced, the time series was stationary.
Figure 3

Plot of the original sequence.

Figure 4

Plot of after the natural logarithm transformation and a first-order differenced and seasonal differenced with one-period of 12.

Plot of the original sequence. Plot of after the natural logarithm transformation and a first-order differenced and seasonal differenced with one-period of 12. Figures 5 and 6 shows the results of the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots. Based on the analysis of the ACF plot, after a first-order differenced, the autocorrelation coefficient values exhibited slow decay. Meanwhile, the ACF was maximal in order 1, and it was significantly higher than that in order 2 or 3; therefore, we identified the parameter q as 0 or 1 and p as 1. Similarly, the PACF plot suggests that the parameters of Q as 1 and P are 0 or 1 because the values of the autocorrelation coefficients trail with a slow decay, and its maximum is on order 1 among orders 2 and 3. As described above, the parameters of the p, q, P, Q, and candidate SARIMA models are determined (Table 2).
Figure 5

Plot of ACF after differenced sequence.

Figure 6

Plot of PACF after differenced sequence.

Table 2

Parameter estimation of candidate SARIMA models.

Ljung-Box Q(18)
ModelR-squaredRMSEMAPEMAENormalized BICStatisticsDFP value
SARIMA(1,0,0)(0,1,1)120.4417236.6255.4934980.08717.87921.03616.177
SARIMA(1,0,0)(1,1,1)120.4897009.0275.2904728.59317.92116.84315.328
SARIMA(1,0,1)(0,1,1)120.4966957.0345.1274545.90417.90613.91515.532
SARIMA(1,0,1)(1,1,1)120.4967005.5695.0954513.14917.97312.55914.561
Plot of ACF after differenced sequence. Plot of PACF after differenced sequence. Parameter estimation of candidate SARIMA models. Further analysis by applying the white noise test revealed that the candidate SARIMA models passed the Ljung-Box Q test (P > .05). Based on the results of the estimates and standard errors of the candidate SARIMA model parameters (Table 3), as well as for the lowest Bayesian information criterion of Schwarz (BIC) values, we found that SARIMA(1,0,0)(0,1,1)12 was the optimal model, which passed the t-test(P < .001), and its residual was white noise (Fig. 7). Moreover, the predicted and observed values from 2013 to 2020 in mainland China were well simulated by the SARIMA(1,0,0)(0,1,1)12 model (Fig. 8).
Table 3

Estimates and standard error of candidate SARIMA models parameter.

ModelParametersEstimateSEtP value
SARIMA(1,0,0)(0,1,1)12AR Lag 10.5040.0985.145.000
Seasonal Difference1
MA, Seasonal Lag 10.6110.1474.150.000
SARIMA(1,0,0)(1,1,1)12Constant1213.634600.0712.022.046
AR Lag 10.4490.1014.434.000
AR, Seasona Lag 10.1860.2840.6540515
Seasonal Difference1
MA, Seasonal Lag 10.9845.2700.187.852
SARIMA(1,0,1)(0,1,1)12Constant1382.365606.9692.277.025
AR Lag 10.6490.1833.542.001
MA Lag 10.2370.2341.011.315
Seasonal Difference1
MA, Seasonal Lag 10.9320.8471.101.274
SARIMA(1,0,1)(1,1,1)12Constant1097.881737.9411.488.141
AR Lag 10.6630.1973.374.001
MA Lag 10.2870.2491.152.253
AR, Seasonal Lag 10.2720.3400.800.426
Seasonal Difference1
MA, Seasonal Lag 10.99842.7480.023.981
Figure 7

Normal Q-Q plot of noise residual from SARIMA(1,0,0)(0,1,1)12.

Figure 8

Comparison of the predicted value and observed value results from SARIMA(1,0,0)(0,1,1)12 model.

Estimates and standard error of candidate SARIMA models parameter. Normal Q-Q plot of noise residual from SARIMA(1,0,0)(0,1,1)12. Comparison of the predicted value and observed value results from SARIMA(1,0,0)(0,1,1)12 model. Therefore, SARIMA(1,0,0)(0,1,1)12 was used to predict the number of HBV cases in mainland China between January and December 2021. The prediction results are presented in Table 4. Moreover, the predicted values were within the 95% confidence interval.
Table 4

Prediction result of SARIMA(1,0,0)(0,1,1)12 model.

DatePredicted value95% Lower confidence limit95% Upper confidence limit
Jan-219896884888113049
Feb-21748895912390655
Mar-2110364887482119814
Apr-2110311186844119377
May-2110130385012117595
Jun-219744481146113742
Jul-2110492088620121220
Aug-2110295786657119257
Sep-219900682705115306
Oct-219427777977110578
Nov-219915082850115450
Dec-219852382223114823
Prediction result of SARIMA(1,0,0)(0,1,1)12 model.

Discussion

Accurate prediction and estimation of infectious diseases is crucial for public health authorities to clearly understand their epidemic characteristics and take preventive and control measures in advance.[ Given the background of big data, the use of historical data to forecast future HBV incidence has recently become an important issue in public health research. Therefore, it is important to apply predictive models to explore the regularity of HBV infections. It plays a significant role in the prevention and control of HBV infections. Our findings suggest that the monthly number of hepatitis B cases from 2013 to 2020 has witnessed uniform growth and seasonal and periodic fluctuations. The reason for the uniform growth of hepatitis B cases is that China has a large population and an increasing frequency of large floating populations in urban and rural areas.[ Moreover, the reasons for the seasonal and periodic fluctuation tendencies in hepatitis B cases are as follows. First, the most important festival in China is the Spring Festival, which is celebrated in January or February each year, and people always return to their hometown earlier to attend a family reunion. Consequently, large-scale migrant workers continue to float between the urban and rural areas. During this time, people did not want to be in the hospital because they believed that it was unfortunate to go to the hospital, according to traditional Chinese concepts. Therefore, the lowest monthly number of hepatitis B cases each year from 2013 to 2020 was recorded in February. Second, it is challenging to prevent and control HBV infection because migrant workers return to their original work after the Spring Festival. In addition, migrant workers who would spend time in entertainment places frequently enjoy nightlife delights.[ Therefore, the highest point in each year from 2013 to 2020 was March. It has been proven that statistical mathematical models can help governments and at all levels in hygiene departments recognize epidemic behavior earlier[ and make better decisions for the prevention and control of infectious diseases.[ Therefore, time series analysis was developed to fit an appropriate prediction model that can provide a reference for the surveillance of infectious diseases. It should be noted that according to the data features of the study, choosing the right model is a prerequisite for exploring credible results. Based on a literature review and our previous research, we found that the ARIMA,[ GM (1, 1) model,[ Decision tree,[ Random forest,[ AdaBoost with decision tree (AdaBoost),[ extreme gradient boosting decision tree (XGBoost),[ Elman network,[ and generalized regression neural network (GRNN)[ can be used to predict hepatitis B. Among these, the GM (1,1) model is suitable for short-term prediction with small sample sizes.[ However, Decision tree, Random forest, AdaBoost, XGBoost, and GRNN prediction techniques, described as machine-learning-based forecasting models, specialize in dealing with big data and nonlinear problems but require a large number of samples.[ For example, Yin et al[ used Decision tree, Random forest, AdaBoost, XGBoost, and GRNN prediction techniques to fit and predict patients co-infected with hepatitis B virus (HBV) and human immunodeficiency virus (HIV) in Jiangsu, China, which provided 12469 samples sizes in the study. Compared to GM (1, 1) and machine-learning-based forecasting models, the SARIMA model has numerous advantages for infectious disease prediction. It is much easier to construct a model that requires at least 36 samples [ rather than a large number of samples owing to its simple theoretical principle. However, the SARIMA model uses historical data and effectively captures the essence of the dependence between current and past observations, while considering the dynamic nature of infectious disease sequences.[ As a result, the SARIMA model is widely used to predict infectious diseases such as hemorrhagic fever,[ seasonal influenza,[ mumps,[ visceral leishmaniasis,[ hand-foot-mouth disease,[ brucellosis,[ and scarlet fever.[ In our study, the prediction technique we selected was in accordance with data characteristics and sample sizes. The hepatitis B data displayed features of seasonality and periodicity, and the sample size was 96, which was in full accordance with the requirements of data characteristics and sample sizes for SARIMA. As a result, SARIMA was applied to the prediction of hepatitis B in mainland China from 2013 to 2020. The prediction results showed well-simulated epidemiological trends for hepatitis B in mainland China. Therefore, this research suggests that the SARIMA model has a good short-term predictive performance, which is in accordance with previous studies.[ This study also found that the predicted values for 2021 were within the 95% confidence interval, which theoretically indicates that it is impossible to cause an outbreak of hepatitis B during this period. However, the outbreak of the hepatitis B epidemic has been affected by many social,[ sociocultural,[ and economic factors.[ Consequently, when taking preventive measures in the future, the potential factors that can cause HBV infection cannot be ignored. To the best of our knowledge, this is the first detailed study to explore the construction of a SARIMA model to predict HBV cases in mainland China. However, our study had several limitations. First, only a linear relationship between the incidence of hepatitis B and its associated factors should be considered, leaving social, cultural, economic, and other factors out of account. Second, the SARIMA model has disadvantages: it is feasible only for short-term predictions and cannot handle nonlinear problems.[ To obtain accurate and long-term prediction results, a large amount of data must be continually collected and updated. We then apply the combined forecasting model for prediction, such as the SARIMA-NARNNX hybrid model,[ SARIMA-NAR hybrid model,[ and SARIMA-SDGM hybrid prediction model.[ Third, almost all prediction techniques have defects; therefore, the prediction results only take references.[ We should adopt a correct attitude towards the prediction results and use them to guide practical work, rather than regarding the prediction results as an exclusive policy-making reference frame.

Conclusions

This study showed that hepatitis B cases from 2013 to 2020 exhibited uniform growth and a seasonal and periodic fluctuation tendency. This warns us that we should take predictive and preventive measures to prevent outbreaks of this infectious disease, which peaks in March each year. SARIMA model simulated well by the reality of epidemiological trends of Hepatitis B in mainland China. It is helpful for governments and hygiene departments at all levels to provide appropriate measures and reasonably allocate health resources.

Acknowledgments

We thank the Project of the Hospital Management Institute, National Health Commission of the People's Republic of China (Grant No. YLZLXZ-2021-004) for funding this study.

Author contributions

Conceptualization: Daren Zhao, Huiwu Zhang, Ruihua Zhang Formal analysis: Daren Zhao Resources: Daren Zhao Writing – original draft: Daren Zhao, Qing Cao, Zhiyi Wang. Writing – review & editing: Huiwu Zhang
  41 in total

Review 1.  Hepatitis B prevention, diagnosis, treatment and care: a review.

Authors:  E J Aspinall; G Hawkins; A Fraser; S J Hutchinson; D Goldberg
Journal:  Occup Med (Lond)       Date:  2011-12       Impact factor: 1.611

Review 2.  Estimations of worldwide prevalence of chronic hepatitis B virus infection: a systematic review of data published between 1965 and 2013.

Authors:  Aparna Schweitzer; Johannes Horn; Rafael T Mikolajczyk; Gérard Krause; Jördis J Ott
Journal:  Lancet       Date:  2015-07-28       Impact factor: 79.321

3.  Time series modelling to forecast prehospital EMS demand for diabetic emergencies.

Authors:  Melanie Villani; Arul Earnest; Natalie Nanayakkara; Karen Smith; Barbora de Courten; Sophia Zoungas
Journal:  BMC Health Serv Res       Date:  2017-05-05       Impact factor: 2.655

4.  Short-term traffic speed prediction under different data collection time intervals using a SARIMA-SDGM hybrid prediction model.

Authors:  Zhanguo Song; Yanyong Guo; Yao Wu; Jing Ma
Journal:  PLoS One       Date:  2019-06-26       Impact factor: 3.240

5.  Forecasting hand-foot-and-mouth disease cases using wavelet-based SARIMA-NNAR hybrid model.

Authors:  Gongchao Yu; Huifen Feng; Shuang Feng; Jing Zhao; Jing Xu
Journal:  PLoS One       Date:  2021-02-05       Impact factor: 3.240

6.  Forecast model analysis for the morbidity of tuberculosis in Xinjiang, China.

Authors:  Yan-Ling Zheng; Li-Ping Zhang; Xue-Liang Zhang; Kai Wang; Yu-Jian Zheng
Journal:  PLoS One       Date:  2015-03-11       Impact factor: 3.240

7.  Prediction of new active cases of coronavirus disease (COVID-19) pandemic using multiple linear regression model.

Authors:  Smita Rath; Alakananda Tripathy; Alok Ranjan Tripathy
Journal:  Diabetes Metab Syndr       Date:  2020-08-01

8.  Forecasting incidence of infectious diarrhea using random forest in Jiangsu Province, China.

Authors:  Xinyu Fang; Wendong Liu; Jing Ai; Mike He; Ying Wu; Yingying Shi; Wenqi Shen; Changjun Bao
Journal:  BMC Infect Dis       Date:  2020-03-14       Impact factor: 3.090

9.  SARFIMA model prediction for infectious diseases: application to hemorrhagic fever with renal syndrome and comparing with SARIMA.

Authors:  Chang Qi; Dandan Zhang; Yuchen Zhu; Lili Liu; Chunyu Li; Zhiqiang Wang; Xiujun Li
Journal:  BMC Med Res Methodol       Date:  2020-09-29       Impact factor: 4.615

10.  Exponentially Increasing Trend of Infected Patients with COVID-19 in Iran: A Comparison of Neural Network and ARIMA Forecasting Models.

Authors:  Leila Moftakhar; Mozhgan Seif; Marziyeh Sadat Safe
Journal:  Iran J Public Health       Date:  2020-10       Impact factor: 1.429

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.