| Literature DB >> 35606725 |
Enbin Yang1,2, Hao Zhang1,2,3, Xinsheng Guo1,2, Zinan Zang1,2, Zhen Liu1,4, Yuanning Liu5,6,7.
Abstract
BACKGROUND: Tuberculosis (TB) is the respiratory infectious disease with the highest incidence in China. We aim to design a series of forecasting models and find the factors that affect the incidence of TB, thereby improving the accuracy of the incidence prediction.Entities:
Keywords: Hybrid forecasting model; Machine learning; Model explanation; Multivariate multi-step LSTM model; SHAP; Tuberculosis
Mesh:
Year: 2022 PMID: 35606725 PMCID: PMC9128107 DOI: 10.1186/s12879-022-07462-8
Source DB: PubMed Journal: BMC Infect Dis ISSN: 1471-2334 Impact factor: 3.667
Fig. 1a Finding optimal value of in LASSO by MSE. b Compressing the coefficient of irrelevant factor to 0 by LASSO
Fig. 2a Monthly incidence of TB in Liaoning Province from Jan 2005 to Dec 2015. b First-order differential. c First-order seasonal difference. d First-order difference and first-order seasonal difference
A series of alternative the ARIMA (p, 1, q) models (24 steps ahead prediction)
| AIC | BIC | RMSE | MAE | MAPE (%) | sMAPE (%) | |
|---|---|---|---|---|---|---|
| (5, 5) | 408.04 | 442.55 | 0.5628 | 0.4724 | 11.1506 | 10.3224 |
| (3, 3) | 419.26 | 442.26 | 1.3423 | 1.1488 | 26.5904 | 22.5749 |
| (2, 1) | 427.84 | 442.21 | 0.8994 | 0.7413 | 16.9383 | 19.3824 |
| (3, 5) | 417.13 | 445.88 | 0.9290 | 0.7617 | 17.8239 | 15.7419 |
Bold indicates the best performing model
A series of alternative the SARIMA models (24 steps ahead prediction)
| AIC | BIC | RMSE | MAE | MAPE (%) | sMAPE (%) | |
|---|---|---|---|---|---|---|
| (0, 1, 0, 1) | 340.64 | 348.98 | 0.8471 | 0.7204 | 15.9791 | 14.5258 |
| (0, 2, 0, 1) | 341.56 | 352.68 | 0.7778 | 0.6519 | 14.4502 | 13.2851 |
| (0, 1, 1, 1) | 342.54 | 353.66 | 0.8323 | 0.7083 | 15.7193 | 14.3164 |
| (1, 2, 0, 1) | 342.83 | 356.73 | 0.8040 | 0.6764 | 14.9867 | 13.7548 |
Bold indicates the best performing model
Fig. 3QQ plot of the ARIMA (2, 1, 4) model.
Fig. 4a, c, and e are autocorrelation plots of the original series, ARIMA (2, 1, 4) model, and SARIMA model. b, d, and f are the corresponding partial autocorrelation
Fig. 5a ARIMA (2, 1, 4) model prediction. b SARIMA model prediction
Forecast performance of the multivariate LSTM model with different size factor sets
| Number of factors | RMSE | MAE | MAPE (%) | sMAPE (%) |
|---|---|---|---|---|
| 24 | 0.5002 | 0.3779 | 9.1007 | 8.7484 |
| 10 | 0.4854 | 0.3502 | 8.6350 | 8.0442 |
| 5 | 0.6184 | 0.4462 | 11.1905 | 9.9987 |
| 0 | 0.5213 | 0.4106 | 10.2265 | 9.4457 |
Bold indicates the best performing model
Fig. 6a Multivariate 2-step LSTM model prediction. b 3-step ARIMA–LSTM hybrid model prediction
Comparison of the forecast performance of each model
| Model | RMSE | MAE | MAPE (%) | sMAPE (%) |
|---|---|---|---|---|
| 6-step ahead prediction between January 2016 to June 2016 | ||||
| | 0.3244 (−) | 0.2811 (−) | 6.0454 (−) | 5.8097 (−) |
| | 1.0157 (+ 213.10%) | 0.9339 (+ 232.23%) | 20.0035 (+ 230.89%) | 17.9006 (+ 208.12%) |
| | ||||
| | 0.4659 (+ 43.62%) | 0.3206 (+ 14.05%) | 6.8661 (+ 13.58%) | 7.4156 (+ 27.64%) |
| 12-step ahead prediction between January 2016 and December 2016 | ||||
| | 0.4425 (−) | 0.3917 (−) | 9.7674 (−) | 9.1462 (−) |
| | 0.7825 (+ 63.40%) | 0.6508 (+ 66.15%) | 14.5400 (+ 48.86%) | 13.2301 (+ 44.65%) |
| | 0.4060 (− 8.25%) | 0.3073 (− 21.55%) | 7.8076 (− 20.06%) | 7.5203 (− 17.78%) |
| | ||||
| 24-step ahead prediction between January 2016 and December 2017 | ||||
| | 0.4672 (−) | 0.4177 (−) | 9.9328 (−) | 9.3198 (−) |
| | 0.7634 (+ 63.40%) | 0.6384 (+ 52.84%) | 14.1518 (+ 42.48%) | 13.0495 (+ 40.02%) |
| | 0.4108 (− 12.07%) | 0.3295 (− 21.12%) | 7.7436 (− 22.04%) | 7.4895 (− 19.64%) |
| | ||||
The data format x(y), x is the error value and y is the percentage change compared to the ARIMA model. Particularly, (–) indicates the null value. A is the ARIMA model and B is the SARIMA model. The new model proposed in this paper is labeled by superscript . is the multivariate 2-step LSTM model and is the 3-step ARIMA–LSTM hybrid forecasting model
Fig. 7Impact of single sample characteristics (January 2016 forecast)
Fig. 8Feature impact (24 samples of the test set)
Fig. 9a Scatter plot of feature density. b Feature importance SHAP values
Fig. 10The three-layer LSTM internal and external structure used in this paper
Fig. 11a Input and output of multivariate n-step LSTM model (when ). b Hybrid forecasting model principle