Literature DB >> 22481978

Comparing statistical models to predict dengue fever notifications.

Arul Earnest¹, Say Beng Tan, Annelies Wilder-Smith, David Machin.

Abstract

Dengue fever (DF) is a serious public health problem in many parts of the world, and, in the absence of a vaccine, disease surveillance and mosquito vector eradication are important in controlling the spread of the disease. DF is primarily transmitted by the female Aedes aegypti mosquito. We compared two statistical models that can be used in the surveillance and forecast of notifiable infectious diseases, namely, the Autoregressive Integrated Moving Average (ARIMA) model and the Knorr-Held two-component (K-H) model. The Mean Absolute Percentage Error (MAPE) was used to compare models. We developed the models using used data on DF notifications in Singapore from January 2001 till December 2006 and then validated the models with data from January 2007 till June 2008. The K-H model resulted in a slightly lower MAPE value of 17.21 as compared to the ARIMA model. We conclude that the models' performances are similar, but we found that the K-H model was relatively more difficult to fit in terms of the specification of the prior parameters and the relatively longer time taken to run the models.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22481978 PMCID： PMC3310403 DOI： 10.1155/2012/758674

Source DB: PubMed Journal: Comput Math Methods Med ISSN： 1748-670X Impact factor: 2.238

1. Introduction

The incidence of dengue fever (DF) has grown dramatically around the world in recent decades, with some 2.5 billion people now at risk of the disease [1]. Dengue haemorrhagic fever (DHF) is a potentially lethal complication, with an estimated 500 000 people requiring hospitalization each year, a very large proportion of whom are children. About 2.5% of those affected die [1]. DF is a viral vector-borne disease that is common in the tropics and subtropics and is primarily spread by the female Aedes aegypti mosquito. Mosquito vector control is important in restricting its spread. It has been found that controlling the vector population before disease is detected reducing transmission with a reduction of the Aedes aegypti population in a 3-month period, from 16% to 2%, as measured by the premises index [2]. However, predicting the incidence of vector-borne diseases like DF remains difficult, as DF shows strong variations over time [3-5]. In Singapore, seasonal trends are seen with peaks occurring generally in June or September. DF is characterized by both epidemic peaks that appear every 3–5 years, as well as seasonal oscillations within a year. Possible reasons for changes in outbreak patterns include change in number of infections due to interventions to eradicate the mosquitoes, as well as change in the number of people who are susceptible to the disease through prior infections [6]. Seasonal trends in DF can be caused by several factors, including climatic variables such as temperature and precipitation [7-10]. Autoregressive Integrated Moving Average (ARIMA) models have been used in applications such as the assessment of seasonal variation in selected medical conditions [11], and as a surveillance tool for outbreak detection [12]. ARIMA (AR, D, MA) models make use of previous observations to make predictions of future values using lag parameter values. Lags of the differenced series appearing in the forecasting equation are termed Auto Regressive (AR), those of the forecast errors, Moving Average (MA), and a time series that needs to be differenced to achieve stationarity, Differenced (D). The prediction process uses constantly updated information (in our example weekly DF cases) to predict the course of dengue in subsequent weeks. Time series analysis of infectious diseases within the Bayesian framework has been considered in some studies [13-16]. One such example demonstrated that Klebsiella pneumoniae is related to the quantity of a third-generation antibiotic use (cephalosporin) in a hospital, with a lag of three months [17]. Others included a Knorr-Held (K-H) two-component model to incorporate both seasonal and epidemic characteristics of notifiable infectious diseases [15], as well as a Bayesian hierarchical time series model to detect outbreaks of Rubella and Salmonella infections [14]. Studies have compared ARIMA models with dynamic models for infectious diseases (fitted via maximum likelihood methods) [18, 19]. However, to the best of our knowledge, a direct comparison between the single-component (ARIMA) and two-component (K-H) models has not been undertaken.

2. Methods

The purpose of this paper is to compare the two-component K-H with the single-component ARIMA model in predicting weekly DF notifications. Different formulations of models within each type are compared, together with a sensitivity analysis of the K-H model, fitted within a Bayesian framework.

2.1. Data

The Singapore Infectious Diseases Act (1977) requires medical practitioners to notify all cases of DF to the Ministry of Health (MoH) within 24 hours. We obtained data from the published “Weekly Infectious Disease Bulletin”, available from the MoH website which uses the World Health Organization 2009 criteria for DF which is also detailed there [20]. All notified and registered DF cases were laboratory confirmed, with laboratory assays from Polymerase Chain Reaction (PCR) and/or NS1 antigen (in the first 5 days of illness) and/or a positive Dengue Immunoglobulin M after day 5 of illness. We studied weekly DF notifications in Singapore till June 2008. Data from January 2001 to December 2006 was used to estimate the model parameters. Thereafter, we performed external validation of the models using data from January 2007 to June 2008.

2.2. ARIMA Model

If f represents the number of cases of DF in week T, then AR relates this observation to an earlier f , where J = 1,2,…, T − 1. MA relates the error (defined as the difference between the observed, f, and that predicted, F, notifications) at week T to week (T − K), where K = 1,2,…. D allows the differenced series, ΔT = (f − f ), to be modelled in the event of nonstationarity in the time series, where L = 0,1, 2,…. Here J, K, and L are the “orders” of the respective ARIMA components. Partial autocorrelation (PAC) and autocorrelation (AC) plots are used to determine J and K, respectively. We describe the ARIMA (3,1,1) model equation used in our analysis. The number of cases of DF at week T is denoted as f , where T is the first week for which DF is to be predicted where F is the predicted number of DF cases for week T, and f , f , and f are the DF counts in the three immediate preceding weeks, termed lag 1, lag 2, and lag 3, respectively, and ε , ε are the error term at time T and T − 1, respectively. In essence, we used observed values up till time T − 1 to predict for dengue fever cases at week T. μ is a constant and φ 1, φ 2, and φ 3 are the coefficients for the three autoregressive terms in the model, θ is the first order moving average parameter and these are estimated within Stata V11.0 [21] via full or unconditional maximum likelihood estimates. For the ARIMA models, we used the Mean Absolute Percentage Error (MAPE) described below to compare predictive accuracy of the models.

2.3. Two-Component K-H Model

The K-H model distinguishes between the endemic, x, and epidemic, y, components of DF such that the number of cases observed f = x + y and the corresponding prediction model is formulated as Here X and Y have independent Poisson distributions with a composite parameters (ω ν ) and (ω λ F ), in which ω handles over dispersion, hence F is also Poisson with parameter ω [ν + λ F ]. This in turn corresponds to a negative binomial distribution with dispersion parameter ψ. The mixing parameter, ω , is assumed to have a Gamma distribution with parameters (ψ + F ) and (ψ + ν + λ F ). The endemic parameter, ν , is modelled as a harmonic wave (to handle strong seasonality inherent in infectious disease surveillance data) with see [15], where 2π/52 is the base frequency of the curve, which is suitable for weekly data and γ 0 is a constant. The logarithmic transformation is necessary to ensure stationarity in the variance of the series. The epidemic component is derived from the parameter sequence = (λ 1,…, λ ), which is assumed to be a piecewise constant [15] with unknown number of location K and unknown location of the changepoints θ 1 < ⋯<θ K, that is, where θ 1 < θ 2 < ⋯<θ are the K unknown changepoints, such that θ ∈ {1,2,…, n − 1) for all k ∈ (1,2,…, K). For K = 0, there is no changepoint and λ = λ (1) for all T = 1,…, n [15]. The piecewise function is needed to provide flexibility in the model in terms of modelling the outbreaks of dengue fever in addition to possible seasonal trends that we observe. The two-component model formulation is completed by specifying prior distributions for the parameters in the model as follows: N denotes a normal and Ga a Gamma distribution. σ 2 was set to 106, representing highly dispersed independent normal priors for each coefficient. I is an identity matrix. For λ (, k = 1,…, K + 1, independent exponential distributions with mean 1/ξ and variance 1/ξ 2 were specified. ξ was then assigned a gamma hyperprior Ga(χ , δ ). The marginal prior distribution for λ ( is then a gamma-gamma distribution [22]. In our study, we used χ = 10 and δ = 10, which corresponded to the gamma-gamma marginal of λ ( turning out to be an F-distribution with (2,20) degrees of freedom, which then indicates that the marginal prior probability of an outbreak occurring (i.e., λ ( ≥ 1) is 0.39, while always favouring smaller values of λ (, with the density function monotonically decreasing. The dispersion parameter for the negative binomial distribution, ψ, which was designed to handle extra-Poisson variation in the data, was assigned a gamma hyperprior as well, with the following parameter, Ga(α , β ). α and β were assigned values of 1 and 0.1, respectively in the original analysis corresponding to a prior mean and standard deviation of 10. The K-H models were fitted using the customised Bayesian software Twins V1.0 [15]. Markov Chain Monte Carlo (MCMC) methods, in particular the Metropolis-Hastings algorithm, were used to estimate the parameters. For each model, we ran 200 iterations as burn-ins. These burn-in samples were discarded and not used in the analysis. We ran a further 60,000 iterations, but only saved every 20th observation, resulting in a final 3000 sample size. This was to circumvent the problem of autocorrelated samples.

2.4. Model Comparison

We compared the ARIMA model with the K-H model and as well conducted a sensitivity analysis on the K-H model using the MAPE: where n is the total number of weeks of data. The Bayesian analyses were based on several assumptions regarding the prior distributions, and we assessed the robustness of our results in a sensitivity analysis. For the sensitivity analyses, we considered 4 different scenarios which involved varying values of χ , δ or α , β while keeping the other variables at their original values: Model 1: α = 0.1 and β = 0.1, Model 2: α = 10 and β = 1, Model 3: χ = 1 and δ = 1, and Model 4: χ = 10 and δ = 1. The prior values for the sensitivity analysis were selected to represent a range of realistic scenarios where the probabilities of an outbreak were expected to be different. In particular, we selected priors where the probability of observing an outbreak ranged from 0.001 (for χ = 10 and δ = 1) to 0.5 (χ = 1 and δ = 1).

3. Results

Figure 1 highlights the weekly distribution of DF notifications in Singapore from January 2001 to June 2008. It is evident that DF notification exhibits both seasonal trends (e.g., regular peaks around June or September and troughs seen in the first 4 months of the year) and epidemic trends (most markedly shown during the 2005 epidemic, when average weekly counts exceeded 600 cases).

Figure 1

Weekly cases of dengue fever (DF) in Singapore.

The autocorrelation plots for DF (Figure 2(a)) indicated that correlations gradually declined over the weeks to insignificant values after 12 weeks. The partial autocorrelations plots (Figure 2(b)) showed a spike at week 1 and week 4 indicating possible inclusion of AR terms of the order of up to four in the ARIMA model. We evaluated the various combinations, including autocorrelation terms 3 and 4 in our analysis.

Figure 2

Plots of autocorrelation and partial correlation for dengue fever (DF).

We explored various formulations of the ARIMA model, and we summarise some of the more important ones in Table 1. As can be seen, ARIMA (3,1,0) provided the lowest MAPE value of 19.86. Including a moving average term did not improve the fit of the model, as with adding an autocorrelation term of four. Adding a 12-month seasonal component (not shown) also did not lower the MAPE. The parameters for the final ARIMA model are shown in Table 2. We found all three autoregressive terms AR(1) = −0.10 (P = 0.001), AR(2) = 0.10 (P = 0.002), and AR(3) = 0.23 (P < 0.001) to be statistically significant. The parameters for the K-H model are also provided in Table 2.

Table 1

Comparison of MAPE values across various ARIMA models.

Model	Model specification	MAPE
1	ARIMA (1,0,0)	23.61
2	ARIMA (2,0,0)	23.09
3	ARIMA (3,0,0)	23.20
4	ARIMA (4,0,0)	23.23
5	ARIMA (3,1,0)	19.86
6	ARIMA (3,1,1)	19.96

Table 2

Parameters for the final models.

ARIMA model	Coefficient	95% confidence interval		P value
Constant (μ)	0.28	−3.86	4.41	0.896
AR 1 (φ ₁)	−0.10	−0.16	−0.04	0.001
AR 2 (φ ₂)	0.10	0.04	0.17	0.002
AR 3 (φ ₃)	0.23	0.17	0.29	<0.001

K-H model	Coefficient	95% credible interval

ψ	25.1	18.4	32.3
γ ₀	3.3	1.9	3.6
γ ₁	−0.2	−0.3	0.3
γ ₂	−0.5	−0.7	−0.4
ξ	1.0	1.0	1.0
K	7.6	0.01	15.0
λ	1.3	0.7	2.0

The comparison between the ARIMA and K-H model is shown in Figure 3. Table 3 shows the results from comparing the two models. Overall, the K-H model performed marginally better than the ARIMA model (MAPE of 17.21 and 17.54 resp.). In particular, the model predicted well (out-of-sample) for certain periods, including the early endemic periods between weeks 1 to 12. Fine-tuning the parameters for the K-H model allowed us to make better predictions for the epidemic periods, as we show in the sensitivity analysis (Table 4). For instance, the model predicted well for the epidemic periods within the weeks 17 to 24 (sensitivity analysis 4).

Figure 3

Comparison of out-of-sample forecasts of dengue fever (DF) between ARIMA and two-component K-H model (January 2007 to June 2008).

Table 3

Comparison of out-of-sample predictions (external validation) between ARIMA and K-H models.

MAPE	ARIMA	K-H
Overall	17.54	17.21

Stratified (in 4 week intervals)
Year 2007
Weeks 1 to 4	17.07	14.27
5 to 8	28.60	25.62
9 to 12	33.41	30.63
13 to 16	32.52	33.09
17 to 20	21.83	20.53
21 to 24	20.64	19.76
25 to 28	12.86	13.22
29 to 32	11.53	14.40
33 to 36	8.54	10.26
37 to 40	5.07	6.50
41 to 44	18.49	17.42
45 to 48	8.54	10.35
49 to 52	15.70	12.44
Year 2008
Weeks 1 to 4	11.13	11.16
5 to 8	29.09	25.63
9 to 12	16.39	19.41
13 to 16	15.51	10.77
17 to 20	19.21	18.38
21 to 24	9.83	10.07

Table 4

Sensitivity analysis on K-H model parameters.

MAPE	Initial K-H model	Sensitivity analysis
		1	2	3	4
Overall	17.21	17.71	17.71	17.50	16.54
Stratified (in 4 week intervals)
Year 2007
Weeks 1 to 4	14.27	20.12	22.33	20.30	20.03
5 to 8	25.62	25.41	25.66	25.12	23.39
9 to 12	30.63	31.06	31.30	30.95	31.07
13 to 16	33.09	33.20	32.31	32.49	27.89
17 to 20	20.53	20.41	20.40	20.92	21.82
21 to 24	19.76	21.14	20.90	21.25	21.58
25 to 28	13.22	13.45	14.18	13.28	12.91
29 to 32	14.40	13.98	13.04	13.39	10.65
33 to 36	10.26	10.76	9.97	10.55	6.11
37 to 40	6.50	6.69	6.39	5.54	3.30
41 to 44	17.42	17.62	17.54	16.91	15.94
45 to 48	10.35	11.30	10.59	11.09	10.67
49 to 52	12.44	12.99	12.37	12.58	13.12
Year 2008
Weeks 1 to 4	11.16	11.09	10.85	11.13	11.31
5 to 8	25.63	25.53	26.01	25.49	25.83
9 to 12	19.41	20.03	20.25	19.31	16.28
13 to 16	10.77	10.72	10.47	10.72	11.25
17 to 20	18.38	19.20	17.97	18.59	19.27
21 to 24	10.07	9.33	10.98	9.95	9.92

A description of the parameters used in the sensitivity analysis is provided in the 4th page of the manuscript.

In terms of forecasting one-week ahead DF notifications, both methods performed well (Figure 3). For instance, the K-H model forecasted 53 (observed 58) for 2007 week 1, 356 (observed 371) for 2007 week 26, 112 (observed 115) for 2008 week 1, and 171 (observed 132) for 2008 week 26. It is worth noting that these results are for out-of-sample predictions. The Bayesian analysis is influenced by the prior specification. As such, we investigated the robustness of our results to different formulation of the priors. These priors represented a wide range of realistic scenarios where the probability of an outbreak is expected to differ. As can be seen from Table 4, it appears that the models have generally similar MAPE values except for sensitivity analysis 4, where the MAPE is actually the lowest at 16.54. In our local setting, we found that specifying a small prior probability of 0.001 for an outbreak to occur provided a better fit of the data.

4. Discussion

We found that the K-H model performed better than the conventional ARIMA time series model; however, this was only marginal. Forecasting weekly cases of DF has immense implication for hospital resources planning. For an infectious disease ward, knowing the normal trend of DF, along with predictions of the following week's DF can allow hospital planners to better plan for and allocate their manpower and other resources. Intensive media campaigns (e.g., television advertisements) in the weeks prior to a projected increase in DF notifications may prove to reduce the number of new cases. Though we used the MAPE index to compare the models, other indices are also available. The Mean Squared Error, for instance, is calculated from the sum of the squared error values. Compared to MAPE, the values are not relative to the magnitude of the observation, and the values are not intuitively easy to interpret. There were several limitations in our study. Firstly, our analysis was dependent on notifiable data. While clinicians are required to report all cases of DF and DHF to the MOH, there is a possibility that the cases could be underreported, especially since mild asymptomatic cases of DF may have not been diagnosed. While this may have led to an under-estimate in the forecasts, the comparisons across the models are still valid, as they make use of the same number of weekly cases. In our analysis, we compared the predictive capability of the models using one-week ahead forecast of dengue fever notification. It is possible to forecast for periods longer than that, of course the predictions may inherently not be as accurate as a one-week forecast. In conclusion, we found that both the final models chosen for the ARIMA and K-H models predict the future course of DF in Singapore reliably well, while the former performed marginally better. The ARIMA models were relatively faster to implement and run, while the K-H model was sensitive to the choice of priors, which needs to be carefully made before the study is conducted.

18 in total

1. A tutorial introduction to Bayesian inference for stochastic epidemic models using Markov chain Monte Carlo methods.

Authors: Philip D O'Neill
Journal: Math Biosci Date: 2002 Nov-Dec Impact factor: 2.144

2. Climatological variables and the incidence of Dengue fever in Barbados.

Authors: Colin Depradine; Ernest Lovell
Journal: Int J Environ Health Res Date: 2004-12 Impact factor: 3.411

3. Climate, mosquito indices and the epidemiology of dengue fever in Trinidad (2002-2004).

Authors: D D Chadee; B Shivnauth; S C Rawlins; A A Chen
Journal: Ann Trop Med Parasitol Date: 2007-01

4. Spatial and temporal dynamics of dengue fever in Peru: 1994-2006.

Authors: G Chowell; C A Torre; C Munayco-Escate; L Suárez-Ognio; R López-Cruz; J M Hyman; C Castillo-Chavez
Journal: Epidemiol Infect Date: 2008-04-08 Impact factor: 2.451

5. Assessing the roles of temperature, precipitation, and ENSO in dengue re-emergence on the Texas-Mexico border region.

Authors: Joan M Brunkard; Enrique Cifuentes; Stephen J Rothenberg
Journal: Salud Publica Mex Date: 2008 May-Jun

6. Inapparent infections and cholera dynamics.

Authors: Aaron A King; Edward L Ionides; Mercedes Pascual; Menno J Bouma
Journal: Nature Date: 2008-08-14 Impact factor: 49.962

Review 7. Dengue prevention and 35 years of vector control in Singapore.

Authors: Eng-Eong Ooi; Kee-Tai Goh; Duane J Gubler
Journal: Emerg Infect Dis Date: 2006-06 Impact factor: 6.883

8. Time series modeling for syndromic surveillance.

Authors: Ben Y Reis; Kenneth D Mandl
Journal: BMC Med Inform Decis Mak Date: 2003-01-23 Impact factor: 2.796

9. Autoregression as a means of assessing the strength of seasonality in a time series.

Authors: Rahim Moineddin; Ross EG Upshur; Eric Crighton; Muhammad Mamdani
Journal: Popul Health Metr Date: 2003-12-15

10. Spatial and temporal clustering of dengue virus transmission in Thai villages.

Authors: Mammen P Mammen; Chusak Pimgate; Constantianus J M Koenraadt; Alan L Rothman; Jared Aldstadt; Ananda Nisalak; Richard G Jarman; James W Jones; Anon Srikiatkhachorn; Charity Ann Ypil-Butac; Arthur Getis; Suwich Thammapalo; Amy C Morrison; Daniel H Libraty; Sharone Green; Thomas W Scott
Journal: PLoS Med Date: 2008-11-04 Impact factor: 11.069

18 in total

1. Morbidity Rate Prediction of Dengue Hemorrhagic Fever (DHF) Using the Support Vector Machine and the Aedes aegypti Infection Rate in Similar Climates and Geographical Areas.

Authors: Kraisak Kesorn; Phatsavee Ongruk; Jakkrawarn Chompoosri; Atchara Phumee; Usavadee Thavara; Apiwat Tawatsin; Padet Siriyasatien
Journal: PLoS One Date: 2015-05-11 Impact factor: 3.240

2. Spatio-temporal trends and risk factors for Shigella from 2001 to 2011 in Jiangsu Province, People's Republic of China.

Authors: Fenyang Tang; Yuejia Cheng; Changjun Bao; Jianli Hu; Wendong Liu; Qi Liang; Ying Wu; Jessie Norris; Zhihang Peng; Rongbin Yu; Hongbing Shen; Feng Chen
Journal: PLoS One Date: 2014-01-08 Impact factor: 3.240

3. Analysis of significant factors for dengue fever incidence prediction.

Authors: Padet Siriyasatien; Atchara Phumee; Phatsavee Ongruk; Katechan Jampachaisri; Kraisak Kesorn
Journal: BMC Bioinformatics Date: 2016-04-16 Impact factor: 3.169

4. Application of an autoregressive integrated moving average model for predicting injury mortality in Xiamen, China.

Authors: Yilan Lin; Min Chen; Guowei Chen; Xiaoqing Wu; Tianquan Lin
Journal: BMJ Open Date: 2015-12-09 Impact factor: 2.692

5. Time series modelling to forecast prehospital EMS demand for diabetic emergencies.

Authors: Melanie Villani; Arul Earnest; Natalie Nanayakkara; Karen Smith; Barbora de Courten; Sophia Zoungas
Journal: BMC Health Serv Res Date: 2017-05-05 Impact factor: 2.655

6. A comparative study on the prediction of the BP artificial neural network model and the ARIMA model in the incidence of AIDS.

Authors: Zeming Li; Yanning Li
Journal: BMC Med Inform Decis Mak Date: 2020-07-02 Impact factor: 2.796

7. Soft Computing of a Medically Important Arthropod Vector with Autoregressive Recurrent and Focused Time Delay Artificial Neural Networks.

Authors: Petros Damos; José Tuells; Pablo Caballero
Journal: Insects Date: 2021-05-31 Impact factor: 2.769

8. Three-Month Real-Time Dengue Forecast Models: An Early Warning System for Outbreak Alerts and Policy Decision Support in Singapore.

Authors: Yuan Shi; Xu Liu; Suet-Yheng Kok; Jayanthi Rajarethinam; Shaohong Liang; Grace Yap; Chee-Seng Chong; Kim-Sung Lee; Sharon S Y Tan; Christopher Kuan Yew Chin; Andrew Lo; Waiming Kong; Lee Ching Ng; Alex R Cook
Journal: Environ Health Perspect Date: 2015-12-11 Impact factor: 9.031

9. Evaluating the performance of infectious disease forecasts: A comparison of climate-driven and seasonal dengue forecasts for Mexico.

Authors: Michael A Johansson; Nicholas G Reich; Aditi Hota; John S Brownstein; Mauricio Santillana
Journal: Sci Rep Date: 2016-09-26 Impact factor: 4.379

10. Bayesian dynamic modeling of time series of dengue disease case counts.

Authors: Daniel Adyro Martínez-Bello; Antonio López-Quílez; Alexander Torres-Prieto
Journal: PLoS Negl Trop Dis Date: 2017-07-03