| Literature DB >> 35215129 |
Samuel Dixon1, Ravikiran Keshavamurthy1,2, Daniel H Farber1, Andrew Stevens1, Karl T Pazdernik1,3, Lauren E Charles1,2.
Abstract
Accurate infectious disease forecasting can inform efforts to prevent outbreaks and mitigate adverse impacts. This study compares the performance of statistical, machine learning (ML), and deep learning (DL) approaches in forecasting infectious disease incidences across different countries and time intervals. We forecasted three diverse diseases: campylobacteriosis, typhoid, and Q-fever, using a wide variety of features (n = 46) from public datasets, e.g., landscape, climate, and socioeconomic factors. We compared autoregressive statistical models to two tree-based ML models (extreme gradient boosted trees [XGB] and random forest [RF]) and two DL models (multi-layer perceptron and encoder-decoder model). The disease models were trained on data from seven different countries at the region-level between 2009-2017. Forecasting performance of all models was assessed using mean absolute error, root mean square error, and Poisson deviance across Australia, Israel, and the United States for the months of January through August of 2018. The overall model results were compared across diseases as well as various data splits, including country, regions with highest and lowest cases, and the forecasted months out (i.e., nowcasting, short-term, and long-term forecasting). Overall, the XGB models performed the best for all diseases and, in general, tree-based ML models performed the best when looking at data splits. There were a few instances where the statistical or DL models had minutely smaller error metrics for specific subsets of typhoid, which is a disease with very low case counts. Feature importance per disease was measured by using four tree-based ML models (i.e., XGB and RF with and without region name as a feature). The most important feature groups included previous case counts, region name, population counts and density, mortality causes of neonatal to under 5 years of age, sanitation factors, and elevation. This study demonstrates the power of ML approaches to incorporate a wide range of factors to forecast various diseases, regardless of location, more accurately than traditional statistical approaches.Entities:
Keywords: GLARMA; Q-fever; big data; campylobacteriosis; deep learning; infectious disease forecasting; machine learning; multi-feature fusion; prediction; typhoid
Year: 2022 PMID: 35215129 PMCID: PMC8875569 DOI: 10.3390/pathogens11020185
Source DB: PubMed Journal: Pathogens ISSN: 2076-0817
The total case counts for each disease by country for 2009–2018, including mean, minimum, and maximum of the monthly regional cases. All available data through 2017 for the seven countries were included in the ML training sets but only highlighted countries were used in performance evaluations. NA means not available.
| Campylobacteriosis | Q-Fever | Typhoid | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Min–Max | Total | Mean | Min–Max | Total | Mean | Min–Max | Total | |
| Australia | 190.10 | 0–880 | 182,498 | 3.92 | 0–32 | 3768 | 1.17 | 0–14 | 1122 |
| Israel | 40.64 | 0–361 | 29,429 | 1.12 | 0–21 | 809 | 0.06 | 0–7 | 63 |
| Japan | NA | NA | NA | NA | 0–2 | 10 | 0.07 | 0–7 | 264 |
| Norway | 10.36 | 0–184 | 22,839 | 0.201 | 0–1 | 14 | 0.08 | 0-8 | 169 |
| Sweden | 31.55 | 0–427 | 16,380 | NA | NA | NA | ~0.00 | 0–1 | 5 |
| United States | 42.04 | 0–794 | 98,720 | 0.06 | 0–3 | 423 | 0.19 | 0–29 | 1251 |
Figure 1Mean campylobacteriosis case counts per region for available data between January 2009 and August 2018 for all seven countries with shaded areas representing two standard deviations from the mean. Note: only data from 2018 in Australia, Israel, and the United States was held out as the test set.
Figure 2Mean Q-fever case counts per region for available data between January 2009 and August 2018 with shaded areas representing two standard deviations from the mean.
Figure 3Time series of mean typhoid case counts per region for all available data between January 2009 and August 2018 with shaded areas representing two standard deviations from the mean.
Figure 4Average forecasting model performance by region for all predicted months during 2018 for campylobacteriosis, Q-fever, and typhoid as assessed by mean absolute error (MAE), root mean squared error (RMSE), and Poisson deviance (deviance) in Australia, Israel, and the United States.
Figure 5Mean performance of ML and statistical models by mean absolute error (MAE), root mean squared error (RMSE), and Poisson deviance (deviance) for Australia, United States, and Israel in forecasting monthly regional disease case counts in 2018.
The top-performing disease model by location and forecast length. For the model, if all metrics match, one model is listed; if the best model by metric did not match, the cell contains ‘MAE; RMSE; deviance’ and if more than one model produced the same value for the metric, they are listed together and separated by commas.
| Disease | Location | Nowcast | Short-Term | Long-Term |
|---|---|---|---|---|
| Campylo-bacteriosis | All countries | Alt-XGB | Alt-XGB | Alt-XGB; RF (Both); Alt-XGB |
| Australia | XGB (Both) | Alt-XGB | Alt-XGB | |
| Israel | MLP; MLP; XGB | XGB | XGB; GLARMA; XGB | |
| US | Alt-XGB; GLARMA; Alt-XGB | GLARMA; GLARMA; Alt-XGB | Alt-XGB; Alt-RF; Alt-XGB | |
| Q-Fever | All countries | RF | Alt-RF; Alt-XGB; Alt-XGB | Alt-XGB |
| Australia | GLARMA | Alt-XGB | XGB | |
| Israel | MLP | Enc–Dec; GLARMA; Enc–Dec | Alt-XGB | |
| US | All Models | All Tree-based ML (Alt-XGB *) | All Tree-based ML(Alt-XGB *) | |
| Typhoid | All countries | Enc–Dec | Alt-XGB; GLARMA; XGB (Both) | MLP |
| Australia | GLARMA; XGB(Both), GLARMA; Enc–Dec | GLARMA | MLP; Alt-RF; MLP | |
| Israel | All Models | All Models | All Models | |
| US | MLP | All Tree-based ML | RF; MLP; RF |
* Smallest error range. Note: “Both” refers to the Alt and Not versions of the same model.
Figure 6Comparison of model performance by metric in forecasting regional monthly case counts for January to August 2018 in Australia, Israel, and the United States. Error bar shows highest and lowest error by month.
Figure 7Comparison of model performance in forecasting monthly case counts from 2018 by metric for the 20 highest case count regions (highest) and lowest case count regions (lowest) across Australia, Israel, and the United States.
The top-performing disease models by data split. For the model, if MAE, RMSE, and deviance matched, there is one model listed; if the best model by metric did not match, the cell contains ‘MAE; RMSE; deviance’ and if greater than one model produced the same value for all the metrics, they are listed together and separated by commas.
| Number of Cases | Country Over All Months | Forecast Time Over All Locations | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Disease | Overall | High Cases | Low Cases (Zero) | Australia | Israel | US | Nowcasting | Short Term | Long Term |
| Campylo- | Alt-XGB | Alt-XGB | GLARMA | Alt-XGB | XGB (Both) | GLARMA, Alt-XGB | XGB | XGB | RF(Both); RF(Both); Alt-XGB |
| Q-fever | Alt-XGB | Alt-XGB | Tree-based, DL | Alt-XGB | GLARMA; Enc–Dec; Alt-XGB | All | RF | Tree-based | Alt-XGB |
| Typhoid | Alt-XGB | GLARMA | All | Alt-XGB; GLARMA; GLARMA | All | RF | Enc–Dec | Alt-XGB; GLARMA; XGB (Both) | MLP |
Note in this table that Both refers to the Alt and Not versions of the same model.
Figure 8Top 20 features grouped in categories with their relative importance by disease and tree-based ML model-type shaded by contribution. Features are sorted by decreasing average relative importance of all Alt-models.
Summary of case count data used in the analysis by country and disease name for the study period of January 2009 through August 2018.
| Country | Campylobacteriosis | Q-Fever | Typhoid | |||
|---|---|---|---|---|---|---|
| Date Range | # Regions | Date Range | # Regions | Date Range | # Regions | |
| Australia | 2009–2018 | 8 | 2009–2018 | 8 | 2009-2018 | 8 |
| Finland | 2009–2017 | 18 | NA | 0 | NA | 0 |
| Israel | 2012–2018 | 6 | 2012–2018 | 6 | 2009–2018 | 6 |
| Japan | NA | 0 | 2012–2017 | 47 | 2012–2017 | 47 |
| Norway | 2009–2017 | 18 | 2009–2017 | 18 | 2009–2017 | 18 |
| Sweden | 2009–2017 | 21 | NA | 0 | 2009–2017 | 21 |
| United States | 2015–2018 | 51 | 2009–2018 | 51 | 2009–2018 | 51 |
“NA” means no data available in EpiArchive for the specified diseases and countries. “#” means number of.
Explanatory variable data types, website for public access, individual features by name, geographic location, geographic resolution, time period, and periodicity.
| Data Type | Website | Individual | Geographic | Geographic Resolution | Time Period | Periodicity |
|---|---|---|---|---|---|---|
| Case Counts | Incidences of select human diseases. | Countries of interest | Region-level | 2009–2018 | Daily | |
| Political Borders | Geopolitical borders (country and within country) | Countries of interest | Region-level | 2018 | Single instance | |
| Climate | air temperature, humidity, precipitation, soil moisture, and wind speed | Global | Gridded 0.25° × 0.25°, | 2012–2018 | Monthly | |
| Gross Domestic Product | Gross Domestic Product | Global | Country-level | Varies | Yearly | |
| Elevation | Digital Elevation Map | Global | 43,200 × 17,200 (30 arc seconds) | NA | NA | |
| Mortality | Deaths by country, year, sex, age group, and cause of death. | Global | Country-level | 2009–2018 | Yearly | |
| Municipal waste | Municipal waste generation and treatment | Countries of interest | Country-level | 2009–2017 | Yearly | |
| Socio-political | Country and internal administrative borders; socioeconomic and political attributes | Global | Varies by country; 1: 10 m–110 m | 2019 | Single instance | |
| Population | Population by age intervals by location | Global | Country-level | 2009–2015 | Every 5 years | |
| Population Density | Population density | Global | 30 arc-seconds | 2009–2015 | Every 5 years | |
| Water Potability and Treatment | Freshwater resources, available water, wastewater treatment plant capacity, surface water | Countries of interest | Country-level | 2009–2017 | Yearly |
“NA” means no data available.