| Literature DB >> 30816185 |
Denis Valle1, Kok Ben Toh2, Gabriel Zorello Laporta3,4, Qing Zhao5.
Abstract
Count data commonly arise in natural sciences but adequately modeling these data is challenging due to zero-inflation and over-dispersion. While multiple parametric modeling approaches have been proposed, unfortunately there is no consensus regarding how to choose the best model. In this article, we propose a ordinal regression model (MN) as a default model for count data given that this model is shown to fit well data that arise from several types of discrete distributions. We extend this model to allow for automatic model selection (MN-MS) and show that the MN-MS model generates superior inference when compared to using the full model or more traditional model selection approaches. The MN-MS model is used to determine how human biting rate of mosquitoes, known to be able to transmit malaria, are influenced by environmental factors in the Peruvian Amazon. The MN-MS model had one of the best fit and out-of-sample predictive skill amongst all models. While A. darlingi is strongly associated with highly anthropized landscapes, all the other mosquito species had higher mean biting rates in landscapes with a lower fraction of exposed soil and urban area, revealing a striking shift in species composition. We believe that the MN and MN-MS models are valuable additions to the modelling toolkit employed by environmental modelers and quantitative ecologists.Entities:
Year: 2019 PMID: 30816185 PMCID: PMC6395857 DOI: 10.1038/s41598-019-39377-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Assumptions used to simulated data for each model.
| Reg. model | Mean | Variances | Assumptions | Parameter values |
|---|---|---|---|---|
| Poisson | Small | — | ||
| Large | — | |||
| NB | Small | Small | ||
| Small | Large | |||
| Large | Small | |||
| Large | Large | |||
| ZIP | Small | — | ||
| Large | — | |||
| ZINB | Small | Small | ||
| Small | Large | |||
| Large | Small | |||
| Large | Large |
In these equations, q is a latent binary variable, ωi is the response count variable, x is an explanatory variable, λ = exp (β0 + β1x), and . For the negative binomial distribution, E[w] = μ and .
Data on mosquito human biting rate is zero-inflated and over-dispersed.
| Species | Proportion of zeroes | Maximum number of mosquitoes caught in a 6 hour period |
|---|---|---|
| 0.70 | 109 | |
| 0.92 | 24 | |
| 0.60 | 308 | |
| 0.82 | 249 | |
| 0.71 | 124 | |
| 0.86 | 33 |
The MN model fits well data generated from a diverse set of conditional distributions despite lack of information on the correct distribution.
| Reg. model | Mean | Variances | MN model fits equally well or has better fit (proportion) |
|---|---|---|---|
| Poisson | Small | — | 1.0 |
| Large | — | 1.0 | |
| NB | Small | Small | 0.9 |
| Small | Large | 1.0 | |
| Large | Small | 1.0 | |
| Large | Large | 1.0 | |
| ZIP | Small | — | 0.8 |
| Large | — | 0.0 | |
| ZINB | Small | Small | 0.9 |
| Small | Large | 1.0 | |
| Large | Small | 1.0 | |
| Large | Large | 1.0 |
Numbers correspond to the proportion of datasets (based on 10 datasets) for which the MN model fitted the data equally well or had a better fit when compared to the true model with estimated parameters. Models were judged to fit the data equally well if their 95% credible intervals for the log-likelihood (our measure of goodness-of-fit) overlapped.
Figure 1The MN-MS model performs slightly worse than the Poisson regression models in identifying the true non-zero slopes (left panel) but performs substantially better in identifying the true zero slopes (right panel). Results for the Poisson model without model selection (Poisson no MS; purple), with AIC model selection (Poisson AIC MS; red), and the MN model with model selection (MN-MS; blue) are displayed. A 1:1 line was added for reference (dashed diagonal line), where results closer to this line indicate better performance. Circles represent the median, thick lines represent the 20–80% range, while thin lines represent the full range (minimum to maximum) based on 10 datasets. Left panel: The x-axis displays the true number of non-zero slopes used to generate the data while the y-axis reveals how many of these slopes were correctly identified to be non-zero and were estimated with the correct sign. Right panel: The x-axis displays the true number of zero slopes used to generate the data while the y-axis reveals how many of these slopes were correctly identified to be zero.
Figure 2The MN-MS model estimates well non-linear effects of covariates x1, x2, and x3 (top panels) and the absence of effects associated with covariates x4, x5, and x6 (bottom panels). True mean response functions are depicted with red line while the estimated relationship are shown with black lines (continuous and dashed lines are the median and point-wise 95% credible intervals, respectively). Circles show the knot locations for each covariate, a priori set to 0.2, 0.4, 0.6, and 0.8 quantile of the corresponding covariate. The displayed response curves are based on one of the 10 simulated datasets and were created by only varying the focal covariate while the other covariates were set to their mean values.
The MN and MN-MS generally fit mosquito data better than other competing regression models.
| Species | Model fit | |||||
|---|---|---|---|---|---|---|
| Poisson | NB | ZINB | ZIP | MN | MN-MS | |
| −4756 | −1283 | −1245 | −2682 | − | −1245 | |
| −407 | −318 | −311 | −316 | − | −314 | |
| −6922 | −1616 | −1591 | −3905 | − | −1552 | |
| −4131 | −775 | −770 | −1406 | − | − | |
| −2533 | −1057 | −1035 | −1682 | − | −1033 | |
| −1147 | −587 | − | −670 | −563 | −568 | |
The median of the log-likelihood (model fit) is provided for each combination of model and mosquito species. The best model for each species is emphasized in bold. “ZI” stands for zero-inflation.
The MN and MN-MS generally predict out-of-sample mosquito data better than other competing regression models.
| Species | Predictive performance | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| MN model | MN-MS model | ||||||||
| Poisson | NB | ZINB | ZIP | Poisson | NB | ZINB | ZIP | MN | |
| 0.86 | 0.79 | 0.79 | 0.64 | 0.86 | 0.79 | 0.79 | 0.64 | 0.36 | |
| 0.86 | 0.86 | 0.93 | 1.00 | 0.93 | 0.79 | 0.93 | 1.00 | 0.57 | |
| 0.79 | 0.71 | 0.79 | 0.64 | 0.79 | 0.71 | 0.79 | 0.64 | 0.29 | |
| 0.79 | 0.79 | 0.86 | 0.71 | 0.79 | 0.86 | 0.93 | 0.71 | 0.79 | |
| 0.79 | 0.86 | 0.71 | 0.71 | 0.79 | 0.79 | 0.79 | 0.79 | 0.64 | |
| 0.79 | 0.93 | 0.93 | 0.79 | 0.71 | 0.93 | 0.93 | 0.79 | 0.79 | |
Numbers indicate the proportion of cross-validation folds (based on 14 folds) in which the MN and MN-MS models had lower MSE scores when compared to each alternative model and for each mosquito species. “ZI” stands for zero-inflation. The last column on the right shows the proportion of cross-validation folds in which the MN-MS model had lower MSE score relative to the MN model.
Figure 3Statistical associations between mosquito biting-rates and environmental covariates based on the MN-MS model. Modeling results from individual mosquito species are shown separately in each row (A. darlingi = darli., A. nuneztovari = nunez., A. triannulatus = trian., A. benarrochi = benar., A. oswaldoi = oswal., and A. rangeli = range.). Continuous and dashed lines represent the median and the 95% credible intervals, respectively. Circles show potential inflection points (i.e., knot locations), a priori set to 0.2, 0.4, 0.6 and 0.8 quantiles of the covariate. Left to right panels show the inferred associations between mosquito biting-rate (number of mosquitoes caught per 6-hour period) and precipitation (mm/hr), proportion of forest pixels, and proportion of exposed soil/urban pixels, respectively. Proportion of pixels was calculated within a 500 m buffer of each observation location.
Figure 4Large shift in species composition in mean mosquito biting-rates associated with changes in the proportion of exposed soil/urban area. Modeling results from individual mosquito species are shown in different colors (A. darlingi = darli., A. nuneztovari = nunez., A. triannulatus = trian., A. benarrochi = benar., A. oswaldoi = oswal., and A. rangeli = range.) as a function of the proportion of exposed soil/urban pixel. Proportion of pixels was calculated within a 500 m buffer of each observation location.
Figure 5Spatial prediction of mean mosquito biting-rates for the two most common anopheline species and overall biting rate. From left to right, each panel shows the spatial prediction of mean mosquito biting-rate for A. darlingi (darli.), A. triannulatus (trian.), and the sum of the predicted mean biting-rate of the six anopheline mosquito species (Sum). Axes depict UTM coordinates in meters. The road network is depicted with black lines and covariate extrapolation is avoided by removing all areas for which covariate values were outside the range used to fit the model. Spatial extrapolation is avoided by restricting spatial prediction to within 2.5 km of sampled sites.