| Literature DB >> 29116006 |
Benjamin J Krajacich1, Jacob I Meyers2, Haoues Alout2, Roch K Dabiré3, Floyd E Dowell4, Brian D Foy2.
Abstract
BACKGROUND: Understanding the age-structure of mosquito populations, especially malaria vectors such as Anopheles gambiae, is important for assessing the risk of infectious mosquitoes, and how vector control interventions may impact this risk. The use of near-infrared spectroscopy (NIRS) for age-grading has been demonstrated previously on laboratory and semi-field mosquitoes, but to date has not been utilized on wild-caught mosquitoes whose age is externally validated via parity status or parasite infection stage. In this study, we developed regression and classification models using NIRS on datasets of wild An. gambiae (s.l.) reared from larvae collected from the field in Burkina Faso, and two laboratory strains. We compared the accuracy of these models for predicting the ages of wild-caught mosquitoes that had been scored for their parity status as well as for positivity for Plasmodium sporozoites.Entities:
Keywords: Aging; Anopheles; Mosquitoes; Spectroscopy
Mesh:
Year: 2017 PMID: 29116006 PMCID: PMC5678599 DOI: 10.1186/s13071-017-2501-1
Source DB: PubMed Journal: Parasit Vectors ISSN: 1756-3305 Impact factor: 3.876
Algorithms used in analysis
| Algorithm | Used for regression or classification? | Outlier detection? | Variable selection? |
|---|---|---|---|
| Partial Least Squares (PLS) | Both | No | No |
| interval PLS (iPLS) | Regression | No | Yes |
| ensemble PLS with feature selection (enPLS) | Regression | Yes | Yes |
| Model Adaptive Space Shrinkage - PLS (MASS) | Regression | Yes | Yes |
| Variable Combination Population Analysis (VCPA) | Regression | No | Yes |
| Support Vector Machine-Linear Kernel (svmLinear) | Both | No | No |
| Oblique Random Forest - Ridge (ORF) | Classification | No | No |
Calibration, cross-validation, validation and independent test set 1 (ITS1) results for each algorithm on the 6 datasets
| Dataset | Samples | No. var | RMSEC | R2Cal | RMSECV | R2CV | LV | RMSEV | RMSEP-ITS1 |
|---|---|---|---|---|---|---|---|---|---|
| Dataset 1 | |||||||||
| PLS | 178 | 1851 | 2.68 | 0.55 | 3.16 | 0.39 | 10 | 2.90 | 3.88 |
| iPLS | 178 | 180 | 2.41 | 0.64 | 2.92 | 0.55 | 10 | 2.97 | 5.52 |
| enPLS | 175 | 400 | 1.71 | 0.82 | 2.04 | 0.74 | na | 2.62 | 7.01 |
| MASS | 173 | 258 | 2.00 | 0.74 | 2.28 | 0.66 | 10 | 2.93 | 4.04 |
| VCPA | 178 | 11 | 2.36 | 0.65 | 2.52 | 0.60 | 10 | 3.11 | 4.64 |
| svmLinear | 178 | 1851 | na | na | 2.83 | 0.59 | na | 2.70 | 4.29 |
| Dataset 2 | |||||||||
| PLS | 156 | 1851 | 1.85 | 0.83 | 2.28 | 0.74 | 10 | 2.71 | 4.08 |
| iPLS | 156 | 120 | 1.54 | 0.93 | 1.20 | 0.90 | 10 | 2.41 | 3.88 |
| enPLS | 152 | 300 | 0.81 | 0.97 | 1.05 | 0.95 | na | 1.89 | 4.19 |
| MASS | 153 | 385 | 0.87 | 0.96 | 1.10 | 0.94 | 10 | 2.41 | 4.33 |
| VCPA | 156 | 10 | 1.88 | 0.82 | 2.08 | 0.78 | 10 | 2.49 | 3.29 |
| svmLinear | 156 | 1851 | na | na | 1.89 | 0.81 | na | 2.13 | 4.60 |
| Dataset 3 | |||||||||
| PLS | 160 | 1851 | 2.05 | 0.80 | 2.61 | 0.70 | 10 | 2.85 | 5.53 |
| iPLS | 160 | 60 | 1.97 | 0.81 | 2.41 | 0.78 | 10 | 2.29 | 5.61 |
| enPLS | 158 | 350 | 0.76 | 0.97 | 1.44 | 0.90 | na | 1.96 | 4.29 |
| MASS | 158 | 441 | 1.24 | 0.93 | 1.59 | 0.88 | 10 | 2.06 | 4.17 |
| VCPA | 160 | 10 | 1.95 | 0.82 | 2.05 | 0.80 | 8 | 2.55 | 3.40 |
| svmLinear | 160 | 1851 | na | na | 1.94 | 0.82 | na | 2.23 | 3.76 |
| Dataset 4 | |||||||||
| PLS | 200 | 1851 | 2.10 | 0.76 | 2.60 | 0.64 | 10 | 2.43 | 5.18 |
| iPLS | 200 | 60 | 1.71 | 0.84 | 2.17 | 0.80 | 10 | 2.41 | 4.05 |
| enPLS | 195 | 350 | 0.85 | 0.96 | 1.32 | 0.90 | na | 1.49 | 3.56 |
| MASS | 196 | 140 | 1.55 | 0.87 | 1.78 | 0.82 | 10 | 1.98 | 3.95 |
| VCPA | 200 | 11 | 2.28 | 0.71 | 2.39 | 0.69 | 7 | 2.72 | 6.44 |
| svmLinear | 200 | 1851 | na | na | 1.99 | 0.77 | na | 1.74 | 4.32 |
| Dataset 5 | |||||||||
| PLS | 334 | 1851 | 2.94 | 0.50 | 3.16 | 0.43 | 10 | 3.42 | 3.57 |
| iPLS | 334 | 180 | 2.50 | 0.64 | 2.76 | 0.58 | 10 | 2.72 | 6.70 |
| enPLS | 330 | 200 | 1.77 | 0.82 | 2.07 | 0.75 | na | 3.10 | 4.69 |
| MASS | 329 | 466 | 2.20 | 0.71 | 2.36 | 0.67 | 10 | 3.10 | 3.67 |
| VCPA | 334 | 12 | 2.82 | 0.54 | 2.89 | 0.51 | 8 | 3.70 | 4.79 |
| svmLinear | 334 | 1851 | na | na | 2.66 | 0.63 | na | 2.81 | 3.70 |
| Dataset 6 | |||||||||
| PLS | 494 | 1851 | 3.24 | 0.43 | 3.50 | 0.34 | 10 | 3.29 | 3.43 |
| iPLS | 494 | 120 | 3.21 | 0.44 | 3.36 | 0.41 | 8 | 2.99 | 5.01 |
| enPLS | 479 | 300 | 1.76 | 0.83 | 2.21 | 0.73 | na | 2.77 | 3.33 |
| MASS | 492 | 482 | 2.58 | 0.64 | 2.83 | 0.56 | 10 | 3.08 | 2.96 |
| VCPA | 494 | 10 | 3.43 | 0.47 | 3.15 | 0.46 | 10 | 3.43 | 2.48 |
| svmLinear | 494 | 1851 | na | na | 2.68 | 0.61 | na | 2.78 | 3.49 |
Note: eEach of the six datasets were used to generate models using six regression algorithms. The root mean squared error (RMSE) is presented for the calibration, cross-validation and validation sets, and independent test set 1. This measure (with units of “days”) allows for an approximation of how much error is present across the range of ages present in each dataset
Abbreviations: No. of var. number of variables used, RMSEC root mean squared error of calibration, R Cal coefficient of variation of calibration, RMSECV root mean squared error of cross-validation, R CV coefficient of variation of cross-validation based on the actual vs predicted ages of the average of the 5 or 10 fold cross-validation, LV number of latent variables used in PLS regression (if applicable), RMSEV root mean squared error of validation set, RMSEP-ITS1 root mean squared error of prediction for independent test set 1, na not available for RMSEC/ R2Cal values (was not calculated natively in the implementation of svmLinear) or not applicable for LV (due to use of ensemble models in enPLS and not used in support vector machines)
Fig. 1Averaged spectra per dataset, and wavelengths utilized by variable selection approaches. Datasets 1–6 are displayed in panels a–f, respectively. Wavelengths selected by the four algorithms are represented by the tick marks under the spectral profile
Fig. 2Predicted vs actual age for NIRS validation set 6 (VS6) with two models. Partial least squares (a) and ensemble partial least squares (b) are displayed. 25–75% confidence (box) and 5–95% confidence intervals (whiskers) are marked. Groups with statistically different means (P < 0.05) via ANOVA with Tukey’s multiple comparisons adjustment are marked with different letters
Classification model accuracy for cross-validation, validation, and independent test sets. The classification accuracy, i.e. was a mosquito whose actual age was less than 7 days of age or greater than 7 days of age predicted as “young” or “old,” respectively in cross-validation, validation, or ITS1; or the accuracy of predicting a nulliparous mosquito successfully as “young”, a parous mosquito as “old”, or a sporozoite positive mosquito as “old” (ITS2 and ITS3) is presented. All classifications within sets are binary (i.e. young vs old). If accuracy was significant via McNemar’s Chi-square test, the 5–95% confidence interval is presented in the parenthesis. Degree of significance is demarcated
| Dataset | Accuracy CV | Accuracy V | ITS1 Accuracy | ITS2 Accuracy | ITS3 Accuracy |
|---|---|---|---|---|---|
| Dataset 1 | |||||
| PLS | 0.7913 | 0.7727 (0.6216–0.8853)** | 0.5507 | 0.5625 | 0.5128 |
| ObliqueRF | 0.8649 | 0.7955 (0.647–0.902)*** | 0.5652 | 0.625 (0.5096–0.7308)* | 0.5128 |
| svmLinear | 0.8422 | 0.8636 (0.7265–0.9483)*** | 0.6232 (0.4983–0.7371)* | 0.6232 (0.4983–0.7371) * | 0.5128 |
| Dataset 2 | |||||
| PLS | 0.9165 | 0.8421 (0.6875–0.9398)*** | 0.4493 | 0.6 (0.4844–0.708)* | 0.5385 |
| ObliqueRF | 0.9354 | 0.8684 (0.7191–0.9559)*** | 0.4058 | 0.55 | 0.5385 |
| svmLinear | 0.9356 | 0.8947 (0.752–0.9706)*** | 0.4348 | 0.6 (0.4844–0.708)* | 0.5769 |
| Dataset 3 | |||||
| PLS | 0.95 | 0.878 (0.738–0.9592)*** | 0.5072 | 0.4625 | 0.4872 |
| ObliqueRF | 0.9687 | 0.9756 (0.8714–0.9994)*** | 0.5942 | 0.55 | 0.4744 |
| svmLinear | 0.9562 | 0.9756 (0.8714–0.9994)*** | 0.5217 | 0.5375 | 0.4872 |
| Dataset 4 | |||||
| PLS | 0.895 | 0.88 (0.7569–0.9547)*** | 0.4928 | 0.5 | 0.5128 |
| ObliqueRF | 0.97 | 0.98 (0.8935–0.9995)*** | 0.5072 | 0.525 | 0.4615 |
| svmLinear | 0.945 | 0.96 (0.8629–0.9951)*** | 0.5362 | 0.55 | 0.4744 |
| Dataset 5 | |||||
| PLS | 0.7726 | 0.7073 (0.5965–0.8026)*** | 0.5942 | 0.55 | 0.5385 |
| ObliqueRF | 0.8442 | 0.7805 (0.6754–0.8644)*** | 0.6667 (0.5429–0.7756)** | 0.525 | 0.4872 |
| svmLinear | 0.8232 | 0.8049 (0.7026–0.8842)*** | 0.6812 (0.5579–0.7883)** | 0.5875 | 0.5769 |
| Dataset 6 | |||||
| PLS | 0.7348 | 0.748 (0.6617–0.8219)*** | 0.6812 (0.5579–0.7883)** | 0.55 | 0.4872 |
| ObliqueRF | 0.8502 | 0.8537 (0.7786–0.9109)*** | 0.6232 (0.4983–0.7371)* | 0.625 (0.5096–0.7308) * | 0.5256 |
| svmLinear | 0.8518 | 0.8374 (0.7601–0.8978)*** | 0.6957 (0.5731–0.8008)** | 0.5625 | 0.5 |
*P < 0.05, **P < 0.01, ***P < 0.001
Abbreviations: CV cross-validation, V validation, ITS independent test set, LV latent variables used if applicable
Fig. 3Comparison of predicted vs actual age for independent test set 1 (ITS1) with two models. Partial least squares (a) and ensemble partial least squares (b) are displayed
Fig. 4Prediction of independent test set 2 (ITS2) (nulliparous vs Plasmodium sporozoite positive, a) and independent test set 3 (ITS3) (nulliparous vs parous, b) for five algorithms created from dataset 6