Daren Zhao1, Huiwu Zhang1, Qing Cao2, Zhiyi Wang3, Ruihua Zhang4. 1. Department of Medical Administration, Sichuan Provincial Orthopedics Hospital, Chengdu, Sichuan, China. 2. Department of Medical Administration, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, Chengdu, Sichuan, China. 3. Department of Medical Administration, Sichuan Cancer Hospital & Institute,Chengdu, Sichuan, China. 4. School of Management,Chengdu University of Traditional Chinese Medicine,Chengdu, Sichuan, China.
Abstract
ABSTRACT: Hepatitis B virus infection is a major global public health concern. This study explored the epidemic characteristics and tendency of hepatitis B in 31 provinces of mainland China, constructed a SARIMA model for prediction, and provided corresponding preventive measures.Monthly hepatitis B case data from mainland China from 2013 to 2020 were obtained from the website of the National Health Commission of the People's Republic of China. Monthly data from 2013 to 2020 were used to build the SARIMA model and data from 2021 were used to test the model.Between 2013 and 2020, 9,177,313 hepatitis B cases were reported in mainland China. SARIMA(1,0,0)(0,1,1)12 was the optimal model and its residual was white noise. It was used to predict the number of hepatitis B cases from January to December 2021, and the predicted values for 2021 were within the 95% confidence interval.This study suggests that the SARIMA model simulated well based on epidemiological trends of hepatitis B in mainland China. The SARIMA model is a feasible tool for monitoring hepatitis B virus infections in mainland China.
ABSTRACT: Hepatitis B virus infection is a major global public health concern. This study explored the epidemic characteristics and tendency of hepatitis B in 31 provinces of mainland China, constructed a SARIMA model for prediction, and provided corresponding preventive measures.Monthly hepatitis B case data from mainland China from 2013 to 2020 were obtained from the website of the National Health Commission of the People's Republic of China. Monthly data from 2013 to 2020 were used to build the SARIMA model and data from 2021 were used to test the model.Between 2013 and 2020, 9,177,313 hepatitis B cases were reported in mainland China. SARIMA(1,0,0)(0,1,1)12 was the optimal model and its residual was white noise. It was used to predict the number of hepatitis B cases from January to December 2021, and the predicted values for 2021 were within the 95% confidence interval.This study suggests that the SARIMA model simulated well based on epidemiological trends of hepatitis B in mainland China. The SARIMA model is a feasible tool for monitoring hepatitis B virus infections in mainland China.
Hepatitis B, caused by the hepatitis B virus (HBV) infection, is transmitted through contact with percutaneous or permucosal exposure to infected blood or body fluids.[ Although there are effective prophylactic vaccines and treatments for hepatitis B, they still face a high level of morbidity and mortality due to hepatitis B and a tremendous economic burden.[ It has been reported that 3.5% of the global population is infected with chronic hepatitis B virus.[ In 2013, there were 1.45 million deaths due to viral hepatitis, which was the seventh highest cause of mortality worldwide.[ Between 2015 and 2019, the number of confirmed cases of hepatitis B increased from 257 million to 296 million, with 1.5 million new infections each year.[Hepatitis B remains a global epidemic, particularly in Southeast Asia, sub-Saharan Africa, and China.[ In China, hepatitis B is classified as a Category B infectious disease. Over the past few decades, a series of comprehensive preventive measures in China, including HBV immunization programs, integrated prevention of mother-to-child transmission, and effective finance policies,[ have been implemented effectively and have achieved remarkable progress. However, it has been reported that China has a higher intermediate prevalence (5%–7.99%)[; therefore, there are still many challenges in the prevention and control of hepatitis B.Scientific prediction and analysis play a crucial role in the prevention and control of hepatitis B, allocation of health resources properly or methodically, provision of effective measures, and the promotion of public health.[ Meanwhile, a reasonable prediction of hepatitis B can give full play to early warning links in advance to provide corresponding preventive measures.[ Therefore, it may remind us the prediction and analysis of hepatitis B are crucial for prevention and control.A time series is a statistical method used to predict future trends based on historical data and time variables.[ Currently, various prediction approaches have been adopted to handle the temporal characteristics of infectious diseases,[ such as the exponential smoothing model,[ grey model first-order one-variable (GM(1,1) model),[ linear regression,[ and autoregressive integrated moving average model (ARIMA).[ The ARIMA model is widely used to predict infectious diseases. SARIMA is an extended version of the ARIMA model, which requires periodicity, seasonality, and randomness.In this study, we collected monthly hepatitis B case data from 31 provinces in mainland China from 2013 to 2020, and applied the SARIMA model to construct a prediction model. The SARIMA(1,0,0)(0,1,1)12 model was fitted to predict the hepatitis B epidemic trends in 2021.
Materials and methods
Data source
Monthly HBV case data for 31 provinces in mainland China from 2013 to 2020 were provided by the National Health Commission of the People's Republic of China (http://www.nhc.gov.cn/). Hepatitis B is classified as a category B infectious disease since the “Regulations of the People's Republic of China on the Administration of Acute Infectious Diseases” promulgated by the Ministry of Health of the People's Republic of China in 1978. The diagnostic criteria for hepatitis B (WS 299–2008 Diagnostic Criteria for Hepatitis B) were formulated by the Ministry of Health of the People's Republic of China (http://www.nhc.gov.cn/wjw/s9491/200907/41983.shtml). In China, once the diagnosis of hepatitis B is confirmed, physicians must report to the Internet-based surveillance system within 24 hours in local hygiene departments.From 2013 to 2020 were used as basic data and materials for the study (Table 1). The research database was divided into training and validation datasets. Monthly data from 2013 to 2020 were used as the training set to build the SARIMA model, and monthly data from 2021 were used as the validation set to test the model.
Table 1
The number of Hepatitis B cases from 2013 to 2020 in mainland China.
Year
Month
January
Febuary
March
April
May
June
July
August
September
October
November
December
Total
2013
102,367
78,884
107,535
97,225
96,978
86,335
97,401
97,430
87,915
86,161
88,479
87,609
1,114,319
2014
90,210
83,068
99,292
94,768
91,936
88,201
95,648
94,075
87,827
85,996
85,125
88,397
1,084,543
2015
96,649
72,869
104,427
94,350
91,194
89,224
93,586
89,228
89,806
86,393
87,284
90,103
1,085,113
2016
89,699
82,204
105,745
93,190
95,079
90,166
91,219
97,670
87,390
85,480
91,478
91,371
1,100,691
2017
86,657
99,417
110,717
98,123
101,783
100,155
98,501
103,977
96,856
89,105
97,694
97,560
1,180,545
2018
109,021
86,886
120,659
106,398
108,831
98,491
103,809
105,068
95,312
94,217
100,849
96,336
1,225,877
2019
107,754
90,985
113,941
110,266
106,431
97,362
112,454
106,985
97,815
98,774
102,174
102,151
1,247,092
2020
91,026
51,506
88,150
101,262
97,651
99,319
106,135
102,304
105,377
95,633
100,561
100,209
1,139,133
The number of Hepatitis B cases from 2013 to 2020 in mainland China.
Ethics statement
In this study, monthly hepatitis B case data for 31 provinces in mainland China from 2013 to 2020 were obtained from publicly accessible sources (http://www.nhc.gov.cn/). The data were relatively uninvolved in detailed patient personal information. Therefore, this study did not require a formal ethical assessment.
SARIMA model
The SARIMA model, which is an extended version of the ARIMA model, is generally expressed as SARIMA (p, d, q) (P, D, Q).[ However, a distinctive difference is that the SARIMA model includes seasonal characteristics.[ In the expression, p, d, and q represent the order of auto-regression, the degree of trend difference, and the order of moving average, respectively; P, D, and Q represent the seasonal auto regression lag, degree of seasonal difference, and seasonal moving average, respectively; and s represents the length of the seasonal period,[ which was evaluated as 12 in our study.The modeling process of the SARIMA model consists of 3 parts: series stability, model identification and estimation, and model diagnosis.[ Firstly, 2 methods can be used to solve this problem in the series stability process. One is the original sequence plot, which was observed by viewing the fitted curve to determine whether it was constant. The other is the augmented Dickey-Fuller test, which confirms the series stability. The ADF test was performed using the EViews 10.0 software. Second, model identification and estimation were determined by using autocorrelation function (ACF) and partial autocorrelation function (PACF) plots. On this basis, the model parameters of the p, q, P, Q, and candidate SARIMA models were determined.[ Thirdly, model diagnosis was evaluated using a white noise test. The Ljung-Box Q test was used to determine whether the residuals were independently or normally distributed. A t-test and a P value were used to determine whether the model parameters were statistically significant. Finally, using the lowest Bayesian information criterion of Schwarz (BIC) values and their residual white noise, we identified the optimal model to predict future epidemic trends of hepatitis B.
Assessment of prediction performance
In this section, the prediction performance is evaluated using a mathematical formula that describes the goodness-of-fit between the predicted and observed values.[where, is the predicted value, Xt is the observed value, and n is the sequence sample size.
Statistical analysis
EViews 10.0 and SPSS software (version 23.0) were used to construct the SARIMA model. The significance level was set at P < .05.
Results
Description of hepatitis B cases from 2013 to 2020 in mainland China
A total of 9,177,313 hepatitis B cases were reported in mainland China between 2013 and 2020, with an average annual growth rate of 0.05% (Table 1 and Fig. 1). Figure 2 shows that the lowest point of the monthly number of hepatitis B cases each year from 2013 to 2020 was February, whereas the highest point was March. It was dominated by seasonal and periodic fluctuations.
Figure 1
The monthly number of Hepatitis B cases from 2013 to 2020 in mainland China.
Figure 2
The monthly number of Hepatitis B cases each year from 2013 to 2020 in mainland China.
The monthly number of Hepatitis B cases from 2013 to 2020 in mainland China.The monthly number of Hepatitis B cases each year from 2013 to 2020 in mainland China.
Construction of SARIMA model
As shown in Fig. 3, the original sequence seems to be non-stationary, suggesting that trend differenced and seasonally differenced or augmented Dickey-Fuller tests should be performed. After the natural logarithm transformation, first-order differenced, and seasonally differenced with one period of 12 (Fig. 4), the time series was stationary. Thus, parameters d, D, and s were 0, 1, and 12, respectively. Furthermore, the ADF test results (t = −4.734, P < .001) indicated that after the natural logarithm transformation and differenced, the time series was stationary.
Figure 3
Plot of the original sequence.
Figure 4
Plot of after the natural logarithm transformation and a first-order differenced and seasonal differenced with one-period of 12.
Plot of the original sequence.Plot of after the natural logarithm transformation and a first-order differenced and seasonal differenced with one-period of 12.Figures 5 and 6 shows the results of the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots. Based on the analysis of the ACF plot, after a first-order differenced, the autocorrelation coefficient values exhibited slow decay. Meanwhile, the ACF was maximal in order 1, and it was significantly higher than that in order 2 or 3; therefore, we identified the parameter q as 0 or 1 and p as 1. Similarly, the PACF plot suggests that the parameters of Q as 1 and P are 0 or 1 because the values of the autocorrelation coefficients trail with a slow decay, and its maximum is on order 1 among orders 2 and 3. As described above, the parameters of the p, q, P, Q, and candidate SARIMA models are determined (Table 2).
Figure 5
Plot of ACF after differenced sequence.
Figure 6
Plot of PACF after differenced sequence.
Table 2
Parameter estimation of candidate SARIMA models.
Ljung-Box Q(18)
Model
R-squared
RMSE
MAPE
MAE
Normalized BIC
Statistics
DF
P value
SARIMA(1,0,0)(0,1,1)12
0.441
7236.625
5.493
4980.087
17.879
21.036
16
.177
SARIMA(1,0,0)(1,1,1)12
0.489
7009.027
5.290
4728.593
17.921
16.843
15
.328
SARIMA(1,0,1)(0,1,1)12
0.496
6957.034
5.127
4545.904
17.906
13.915
15
.532
SARIMA(1,0,1)(1,1,1)12
0.496
7005.569
5.095
4513.149
17.973
12.559
14
.561
Plot of ACF after differenced sequence.Plot of PACF after differenced sequence.Parameter estimation of candidate SARIMA models.Further analysis by applying the white noise test revealed that the candidate SARIMA models passed the Ljung-Box Q test (P > .05). Based on the results of the estimates and standard errors of the candidate SARIMA model parameters (Table 3), as well as for the lowest Bayesian information criterion of Schwarz (BIC) values, we found that SARIMA(1,0,0)(0,1,1)12 was the optimal model, which passed the t-test(P < .001), and its residual was white noise (Fig. 7). Moreover, the predicted and observed values from 2013 to 2020 in mainland China were well simulated by the SARIMA(1,0,0)(0,1,1)12 model (Fig. 8).
Table 3
Estimates and standard error of candidate SARIMA models parameter.
Model
Parameters
Estimate
SE
t
P value
SARIMA(1,0,0)(0,1,1)12
AR Lag 1
0.504
0.098
5.145
.000
Seasonal Difference
1
MA, Seasonal Lag 1
0.611
0.147
4.150
.000
SARIMA(1,0,0)(1,1,1)12
Constant
1213.634
600.071
2.022
.046
AR Lag 1
0.449
0.101
4.434
.000
AR, Seasona Lag 1
0.186
0.284
0.654
0515
Seasonal Difference
1
MA, Seasonal Lag 1
0.984
5.270
0.187
.852
SARIMA(1,0,1)(0,1,1)12
Constant
1382.365
606.969
2.277
.025
AR Lag 1
0.649
0.183
3.542
.001
MA Lag 1
0.237
0.234
1.011
.315
Seasonal Difference
1
MA, Seasonal Lag 1
0.932
0.847
1.101
.274
SARIMA(1,0,1)(1,1,1)12
Constant
1097.881
737.941
1.488
.141
AR Lag 1
0.663
0.197
3.374
.001
MA Lag 1
0.287
0.249
1.152
.253
AR, Seasonal Lag 1
0.272
0.340
0.800
.426
Seasonal Difference
1
MA, Seasonal Lag 1
0.998
42.748
0.023
.981
Figure 7
Normal Q-Q plot of noise residual from SARIMA(1,0,0)(0,1,1)12.
Figure 8
Comparison of the predicted value and observed value results from SARIMA(1,0,0)(0,1,1)12 model.
Estimates and standard error of candidate SARIMA models parameter.Normal Q-Q plot of noise residual from SARIMA(1,0,0)(0,1,1)12.Comparison of the predicted value and observed value results from SARIMA(1,0,0)(0,1,1)12 model.Therefore, SARIMA(1,0,0)(0,1,1)12 was used to predict the number of HBV cases in mainland China between January and December 2021. The prediction results are presented in Table 4. Moreover, the predicted values were within the 95% confidence interval.
Table 4
Prediction result of SARIMA(1,0,0)(0,1,1)12 model.
Date
Predicted value
95% Lower confidence limit
95% Upper confidence limit
Jan-21
98968
84888
113049
Feb-21
74889
59123
90655
Mar-21
103648
87482
119814
Apr-21
103111
86844
119377
May-21
101303
85012
117595
Jun-21
97444
81146
113742
Jul-21
104920
88620
121220
Aug-21
102957
86657
119257
Sep-21
99006
82705
115306
Oct-21
94277
77977
110578
Nov-21
99150
82850
115450
Dec-21
98523
82223
114823
Prediction result of SARIMA(1,0,0)(0,1,1)12 model.
Discussion
Accurate prediction and estimation of infectious diseases is crucial for public health authorities to clearly understand their epidemic characteristics and take preventive and control measures in advance.[ Given the background of big data, the use of historical data to forecast future HBV incidence has recently become an important issue in public health research. Therefore, it is important to apply predictive models to explore the regularity of HBV infections. It plays a significant role in the prevention and control of HBV infections.Our findings suggest that the monthly number of hepatitis B cases from 2013 to 2020 has witnessed uniform growth and seasonal and periodic fluctuations. The reason for the uniform growth of hepatitis B cases is that China has a large population and an increasing frequency of large floating populations in urban and rural areas.[ Moreover, the reasons for the seasonal and periodic fluctuation tendencies in hepatitis B cases are as follows. First, the most important festival in China is the Spring Festival, which is celebrated in January or February each year, and people always return to their hometown earlier to attend a family reunion. Consequently, large-scale migrant workers continue to float between the urban and rural areas. During this time, people did not want to be in the hospital because they believed that it was unfortunate to go to the hospital, according to traditional Chinese concepts. Therefore, the lowest monthly number of hepatitis B cases each year from 2013 to 2020 was recorded in February. Second, it is challenging to prevent and control HBV infection because migrant workers return to their original work after the Spring Festival. In addition, migrant workers who would spend time in entertainment places frequently enjoy nightlife delights.[ Therefore, the highest point in each year from 2013 to 2020 was March.It has been proven that statistical mathematical models can help governments and at all levels in hygiene departments recognize epidemic behavior earlier[ and make better decisions for the prevention and control of infectious diseases.[ Therefore, time series analysis was developed to fit an appropriate prediction model that can provide a reference for the surveillance of infectious diseases. It should be noted that according to the data features of the study, choosing the right model is a prerequisite for exploring credible results. Based on a literature review and our previous research, we found that the ARIMA,[ GM (1, 1) model,[ Decision tree,[ Random forest,[ AdaBoost with decision tree (AdaBoost),[ extreme gradient boosting decision tree (XGBoost),[ Elman network,[ and generalized regression neural network (GRNN)[ can be used to predict hepatitis B. Among these, the GM (1,1) model is suitable for short-term prediction with small sample sizes.[ However, Decision tree, Random forest, AdaBoost, XGBoost, and GRNN prediction techniques, described as machine-learning-based forecasting models, specialize in dealing with big data and nonlinear problems but require a large number of samples.[ For example, Yin et al[ used Decision tree, Random forest, AdaBoost, XGBoost, and GRNN prediction techniques to fit and predict patients co-infected with hepatitis B virus (HBV) and human immunodeficiency virus (HIV) in Jiangsu, China, which provided 12469 samples sizes in the study.Compared to GM (1, 1) and machine-learning-based forecasting models, the SARIMA model has numerous advantages for infectious disease prediction. It is much easier to construct a model that requires at least 36 samples [ rather than a large number of samples owing to its simple theoretical principle. However, the SARIMA model uses historical data and effectively captures the essence of the dependence between current and past observations, while considering the dynamic nature of infectious disease sequences.[ As a result, the SARIMA model is widely used to predict infectious diseases such as hemorrhagic fever,[ seasonal influenza,[ mumps,[ visceral leishmaniasis,[ hand-foot-mouth disease,[ brucellosis,[ and scarlet fever.[In our study, the prediction technique we selected was in accordance with data characteristics and sample sizes. The hepatitis B data displayed features of seasonality and periodicity, and the sample size was 96, which was in full accordance with the requirements of data characteristics and sample sizes for SARIMA. As a result, SARIMA was applied to the prediction of hepatitis B in mainland China from 2013 to 2020. The prediction results showed well-simulated epidemiological trends for hepatitis B in mainland China. Therefore, this research suggests that the SARIMA model has a good short-term predictive performance, which is in accordance with previous studies.[ This study also found that the predicted values for 2021 were within the 95% confidence interval, which theoretically indicates that it is impossible to cause an outbreak of hepatitis B during this period. However, the outbreak of the hepatitis B epidemic has been affected by many social,[ sociocultural,[ and economic factors.[ Consequently, when taking preventive measures in the future, the potential factors that can cause HBV infection cannot be ignored.To the best of our knowledge, this is the first detailed study to explore the construction of a SARIMA model to predict HBV cases in mainland China. However, our study had several limitations. First, only a linear relationship between the incidence of hepatitis B and its associated factors should be considered, leaving social, cultural, economic, and other factors out of account. Second, the SARIMA model has disadvantages: it is feasible only for short-term predictions and cannot handle nonlinear problems.[ To obtain accurate and long-term prediction results, a large amount of data must be continually collected and updated. We then apply the combined forecasting model for prediction, such as the SARIMA-NARNNX hybrid model,[ SARIMA-NAR hybrid model,[ and SARIMA-SDGM hybrid prediction model.[ Third, almost all prediction techniques have defects; therefore, the prediction results only take references.[ We should adopt a correct attitude towards the prediction results and use them to guide practical work, rather than regarding the prediction results as an exclusive policy-making reference frame.
Conclusions
This study showed that hepatitis B cases from 2013 to 2020 exhibited uniform growth and a seasonal and periodic fluctuation tendency. This warns us that we should take predictive and preventive measures to prevent outbreaks of this infectious disease, which peaks in March each year. SARIMA model simulated well by the reality of epidemiological trends of Hepatitis B in mainland China. It is helpful for governments and hygiene departments at all levels to provide appropriate measures and reasonably allocate health resources.
Acknowledgments
We thank the Project of the Hospital Management Institute, National Health Commission of the People's Republic of China (Grant No. YLZLXZ-2021-004) for funding this study.