| Literature DB >> 31427903 |
Judith Leo1, Edith Luhanga1, Kisangiri Michael1.
Abstract
Cholera epidemic remains a public threat throughout history, affecting vulnerable population living with unreliable water and substandard sanitary conditions. Various studies have observed that the occurrence of cholera has strong linkage with environmental factors such as climate change and geographical location. Climate change has been strongly linked to the seasonal occurrence and widespread of cholera through the creation of weather patterns that favor the disease's transmission, infection, and the growth of Vibrio cholerae, which cause the disease. Over the past decades, there have been great achievements in developing epidemic models for the proper prediction of cholera. However, the integration of weather variables and use of machine learning techniques have not been explicitly deployed in modeling cholera epidemics in Tanzania due to the challenges that come with its datasets such as imbalanced data and missing information. This paper explores the use of machine learning techniques to model cholera epidemics with linkage to seasonal weather changes while overcoming the data imbalance problem. Adaptive Synthetic Sampling Approach (ADASYN) and Principal Component Analysis (PCA) were used to the restore sampling balance and dimensional of the dataset. In addition, sensitivity, specificity, and balanced-accuracy metrics were used to evaluate the performance of the seven models. Based on the results of the Wilcoxon sign-rank test and features of the models, XGBoost classifier was selected to be the best model for the study. Overall results improved our understanding of the significant roles of machine learning strategies in health-care data. However, the study could not be treated as a time series problem due to the data collection bias. The study recommends a review of health-care systems in order to facilitate quality data collection and deployment of machine learning techniques.Entities:
Mesh:
Year: 2019 PMID: 31427903 PMCID: PMC6683776 DOI: 10.1155/2019/9397578
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Cholera cases reported by WHO by a year and by continent from 1989 to 2017 [43].
Description of data for daily seasonal weather changes.
| Variable | Description | SI Unit |
|---|---|---|
| Temp_max | Minimum Temperature | Degree centigrade (°C) |
| Temp_mean | Mean Temperature | Degree centigrade (°C) |
| Temp_min | Maximum Temperature | Degree centigrade (°C) |
| Temp_range | Temperature Range | Degree centigrade (°C) |
| Rainfall | Rainfall | Millimeter (mm) |
| Humidity | Relative Humidity | (%) |
| Wind_Spd | Wind Speed | Knots |
| Wind_Dir | Wind Direction | Degrees |
Description of cholera cases data with regards to patient details.
| Variable | Description | SI Unit |
|---|---|---|
| District | District Names | Dar es Salaam Districts |
| Date | Date on set | Date Month Year |
| Result | Lab result | Yes or No |
Statistical data description of cholera cases using count, mean, std, min, max, and percentile.
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Rainfall | 2951 | 1.962 | 7.518 | 0 | 0 | 0 | 0.2 | 105.1 |
| Temp_max | 2951 | 31.343 | 1.816 | 0 | 30 | 31.1 | 32.7 | 36.3 |
| Temp_min | 2951 | 22.496 | 2.505 | 0 | 21 | 21.4 | 24.2 | 28.8 |
| Temp_mean | 2951 | 26.92 | 1.854 | 0 | 25.5 | 26.7 | 28.2 | 31.55 |
| Temp_range | 2951 | 8.847 | 2.323 | 0 | 7.5 | 9 | 10.4 | 16.4 |
| Humidity | 2951 | 78.835 | 5.2 | 0 | 75 | 78 | 81 | 97 |
| Wind_Dir | 2951 | 117.32 | 91.23 | 0 | 50 | 120 | 160 | 360 |
| Wind_Spd | 2951 | 5.33 | 3.706 | 0 | 3 | 5 | 8 | 18 |
| result | 2951 | 0.07 | 0.255 | 0 | 0 | 0 | 0 | 1 |
Figure 2Patients distribution per months.
Figure 3Rainfall distribution per months.
Figure 4Patient distribution across districts.
Figure 5Model formulation approach.
Figure 6Imbalanced cholera dataset.
Classifiers with a detailed range of sensitivity, specificity, and balanced accuracy.
| Classifiers | Sensitivity score | Specificity score | Balanced accuracy score |
|---|---|---|---|
|
| |||
| XGB | 0.055+-0.08 | 0.995+-0.006 | 0.525+-0.04 |
| K-NN | 0.095+-0.103 | 0.985+-0.014 | 0.54+-0.053 |
| DT | 0.119+-0.116 | 0.98+-0.016 | 0.549+-0.061 |
| RF | 0.166+-0.137 | 0.981+-0.016 | 0.574+-0.072 |
| ExtraTrees | 0.114+-0.113 | 0.984+-0.015 | 0.549+-0.06 |
| AdaBoost | 0 | 0.997+-0.005 | 0.498+-0.003 |
| LDA | 0 | 1 | 0.5 |
|
| |||
|
| |||
| XGB | 0.801+-0.148 | 0.742+-0.053 | 0.772+-0.079 |
| K-NN | 0.656+-0.24 | 0.83+-0.042 | 0.743+-0.123 |
| DT | 0.579+-0.17 | 0.882+-0.032 | 0.73+-0.09 |
| RF | 0.632+-0.156 | 0.88+-0.034 | 0.756+-0.085 |
| ExtraTrees | 0.589+-0.161 | 0.88+-0.032 | 0.734+-0.085 |
| AdaBoost | 0.708+-0.206 | 0.707+-0.058 | 0.707+-0.103 |
| LDA | 0.593+-0.23 | 0.594+-0.051 | 0.593+-0.111 |
|
| |||
|
| |||
| XGB | 0.056+-0.096 | 0.9912+-0.01 | 0.524+-0.049 |
| K-NN | 0.061+-0.09 | 0.989+-0.013 | 0.525+-0.045 |
| DT | 0.119+-0.117 | 0.983+-0.015 | 0.551+-0.062 |
| RF | 0.153+-0.121 | 0.983+-0.016 | 0.568+-0.063 |
| ExtraTrees | 0.114+-0.113 | 0.984+-0.015 | 0.549+-0.06 |
| AdaBoost | 0 | 0.999+-0.004 | 0.5 |
| LDA | 0 | 1 | 0.5 |
|
| |||
|
| |||
| XGB | 0.805+-0.169 | 0.73+-0.05 | 0.767+-0.09 |
| K-NN | 0.705+-0.199 | 0.828+-0.034 | 0.767+-0.105 |
| DT | 0.596+-0.163 | 0.879+-0.032 | 0.737+-0.086 |
| RF | 0.645+-0.155 | 0.877+-0.031 | 0.761+-0.082 |
| ExtraTrees | 0.585+-0.19 | 0.879654+-0.033950 | 0.732+-0.102 |
| AdaBoost | 0.691+-0.168 | 0.731+-0.040 | 0.711+-0.087 |
| LDA | 0.534+-0.226 | 0.622+-0.072 | 0.578+-0.117 |
Description of sensitivity and specificity [53].
| Disease present | Disease absent | Total | |
|---|---|---|---|
| Test positive | a (TP) | b (FP) | all cases |
|
| |||
| Test negative | c (FN) | d (TN) | all noncases |
|
| |||
| all diseased | all nondiseased | all participants in the study | |
|
| |||
| Sensitivity= a/(a+c) | Specificity= d/(b+d) | ||
Note: TP: True Positive, TN: True Negative, FP: False Positive, and FN: False Negative.
Figure 7Results of sensitivity, specificity, and balanced-accuracy metrics.