| Literature DB >> 35691633 |
Tim C D Lucas1, Anita K Nandi2, Suzanne H Keddie2, Elisabeth G Chestnutt2, Rosalind E Howes2, Susan F Rumisha3, Rohan Arambepola2, Amelia Bertozzi-Villa4, Andre Python2, Tasmin L Symons2, Justin J Millar2, Punam Amratia2, Penelope Hancock2, Katherine E Battle2, Ewan Cameron2, Peter W Gething5, Daniel J Weiss2.
Abstract
Maps of disease burden are a core tool needed for the control and elimination of malaria. Reliable routine surveillance data of malaria incidence, typically aggregated to administrative units, is becoming more widely available. Disaggregation regression is an important model framework for estimating high resolution risk maps from aggregated data. However, the aggregation of incidence over large, heterogeneous areas means that these data are underpowered for estimating complex, non-linear models. In contrast, prevalence point-surveys are directly linked to local environmental conditions but are not common in many areas of the world. Here, we train multiple non-linear, machine learning models on Plasmodium falciparum prevalence point-surveys. We then ensemble the predictions from these machine learning models with a disaggregation regression model that uses aggregated malaria incidences as response data. We find that using a disaggregation regression model to combine predictions from machine learning models improves model accuracy relative to a baseline model.Entities:
Keywords: Disaggregation regression; Spatial statistics; Stacking; Surveillance data
Mesh:
Year: 2020 PMID: 35691633 PMCID: PMC9205339 DOI: 10.1016/j.sste.2020.100357
Source DB: PubMed Journal: Spat Spatiotemporal Epidemiol ISSN: 1877-5845
Fig. 1Schematic of the baseline disaggregation regression model (Enviro) and the two stage method (MLl). Models are shown in yellow ovals, malaria data is shown in purple rectangles and covariates are shown in green rectangles. The baseline model (Enviro) uses aggregated incidence data and raw environmental covarates in a disaggregation regression model. In the two stage method (MLl), new covariates are created in stage 1 by training machine learning models on prevalence data. Predictions from these machine learning models are used as covariates in the stage 2 disaggregation regression. Only one of the two stage models (MLl) is shown for simplicity. If MLg was included as well for example, it would look the same as MLl except that the prevalence data (pink box in stage 1) would have the global database of prevalence surveys. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Summary of stage 2 models and the experiments they are grouped into. All models are disaggregation regression models fitted to aggregated incidence data. The only difference between the models is which covariates are being used. Environmental covariates includes the full set of eight variables. Local ML covariates are the predictions from stage 1 machine learning models, trained with local prevalence data and the eight environmental covariates as inputs. The local prevalence datasets are data from within each country except for Colombia where local prevalence refers to South American data. Global ML covariates are the predictions from stage 1 machine learning models, trained with global prevalence data (i.e. the full prevalence database) and the eight environmental covariates as inputs. Experiment 1 tests whether including Local ML covariates improves predictive performance while Experiment 2 tests whether including Global ML covariates improves performance.
| Model | Environmental covariates | Local ML covariates | Global ML covariates |
|---|---|---|---|
| Enviro | ✓ | ||
| Enviro + MLl | ✓ | ✓ | |
| MLl | ✓ | ||
| Enviro | ✓ | ||
| MLg | ✓ | ||
| MLl + MLg | ✓ | ✓ |
Pearson correlations between observed and predicted values for experiment 1.
| Country | Cross-validation | Enviro | Enviro + MLl | MLl |
|---|---|---|---|---|
| Colombia | Random | 0.59 | 0.59 | |
| Colombia | Spatial | 0.12 | 0.25 | |
| Indonesia | Random | 0.52 | 0.48 | |
| Indonesia | Spatial | 0.45 | 0.44 | |
| Madagascar | Random | 0.69 | 0.68 | |
| Madagascar | Spatial | 0.22 | 0.18 | |
| Senegal | Random | 0.57 | 0.51 | |
| Senegal | Spatial | 0.63 | 0.63 | 0.51 |
Fig. 2Observed data against predictions for random cross-validation hold-out samples on a square root transformed scale. There are 12 cases composed of 4 countries (COL:, Colombia, IDN: Indonesia, MDG: Madagascar, SEN: Senegal) and three sets of covariates (Envir: raw environmental covariates only, Enviro + MLl: raw environmental covariates and machine learning covariates trained on local prevalence data combined, MLl: Machine learning models trained on local prevalence data only.
Fig. 3Observed data against predictions for spatial cross-validation hold-out samples on a square root transformed scale. There are 12 cases composed of 4 countries (COL:, Colombia, IDN: Indonesia, MDG: Madagascar, SEN: Senegal) and three sets of covariates (Envir: raw environmental covariates only, Enviro + MLl: raw environmental covariates and machine learning covariates trained on local prevalence data combined, MLl: Machine learning models trained on local prevalence data only.
Fig. 4A) Observed data for Colombia (grey for zero incidence). B) Out-of-sample predictions for the spatial cross-validation, environmental covariates only model. C) Out-of-sample predictions for the spatial cross-validation, local machine learning only model. For each cross-validation fold, predictions are made for the held out data which are then combined to make a single surface.
Pearson correlations between observed and predicted values for experiment 2.
| Country | Cross-validation | Enviro | MLg | MLl + MLg |
|---|---|---|---|---|
| Colombia | Random | 0.55 | 0.58 | |
| Colombia | Spatial | 0.12 | 0.12 | |
| Indonesia | Random | 0.32 | 0.46 | |
| Indonesia | Spatial | 0.45 | 0.41 | 0.45 |
| Madagascar | Random | 0.67 | 0.68 | |
| Madagascar | Spatial | 0.22 | 0.51 | |
| Senegal | Random | 0.50 | 0.49 | |
| Senegal | Spatial | 0.55 | 0.52 |
Coverage of 80% credible intervals. Values outside 0.7–0.9 are shown in bold.
| Country | CV | Enviro | MLl | Enviro + MLl | MLg | MLl + MLg |
|---|---|---|---|---|---|---|
| Colombia | Random | |||||
| Colombia | Spatial | |||||
| Indonesia | Random | 0.80 | 0.81 | 0.78 | 0.79 | 0.77 |
| Indonesia | Spatial | 0.80 | 0.78 | 0.78 | 0.76 | 0.75 |
| Madagascar | Random | 0.80 | 0.77 | 0.75 | 0.77 | 0.76 |
| Madagascar | Spatial | 0.74 | 0.70 | 0.75 | 0.76 | |
| Senegal | Random | 0.79 | 0.79 | 0.79 | 0.79 | 0.82 |
| Senegal | Spatial | 0.85 | 0.85 | 0.85 |
Machine learning model results and fitted parameters (i.e. model weights) of the machine learning predictions only models (i.e. MLl local only and MLg global only). For each dataset (the country and whether the data was local or global) the best model (lowest RMSE) is shown in bold. Similary, the largest coefficient within each disaggregation model is shown in bold.
| Country | Model | RMSEl | RMSEg | ||
|---|---|---|---|---|---|
| Colombia | RF | 0.625 | 0.180 | ||
| Colombia | GBM | 0.073 | 0.178 | -0.218 | |
| Colombia | enet | 0.070 | 0.219 | 0.233 | 0.183 |
| Colombia | nnet | 0.070 | 0.129 | 0.220 | 0.527 |
| Colombia | ppr | 0.070 | 0.667 | 0.205 | |
| Indonesia | RF | 0.447 | 0.178 | ||
| Indonesia | GBM | 0.085 | 0.357 | 0.178 | 0.289 |
| Indonesia | enet | 0.091 | 0.303 | 0.233 | |
| Indonesia | nnet | 0.089 | 0.220 | 0.316 | |
| Indonesia | ppr | 0.089 | 0.364 | 0.205 | 0.089 |
| Madagascar | RF | 0.538 | |||
| Madagascar | GBM | 0.105 | 0.178 | 0.432 | |
| Madagascar | enet | 0.116 | 0.301 | 0.233 | 0.262 |
| Madagascar | nnet | 0.113 | 0.033 | 0.220 | 0.364 |
| Madagascar | ppr | 0.109 | 0.469 | 0.205 | 0.403 |
| Senegal | RF | ||||
| Senegal | GBM | 0.099 | 0.261 | 0.178 | 0.408 |
| Senegal | enet | 0.103 | 0.344 | 0.233 | 0.205 |
| Senegal | nnet | 0.099 | 0.254 | 0.220 | 0.190 |
| Senegal | ppr | 0.098 | 0.268 | 0.205 | 0.126 |