| Literature DB >> 31475670 |
Phani Krishna Kondeti1, Kumar Ravi2, Srinivasa Rao Mutheneni1, Madhusudhan Rao Kadiri1, Sriram Kumaraswamy1, Ravi Vadlamani2, Suryanaryana Murty Upadhyayula3.
Abstract
Filariasis is one of the major public health concerns in India. Approximately 600 million people spread across 250 districts of India are at risk of filariasis. To predict this disease, a pilot scale study was carried out in 30 villages of Karimnagar district of Telangana from 2004 to 2007 to collect epidemiological and socio-economic data. The collected data are analysed by employing various machine learning techniques such as Naïve Bayes (NB), logistic model tree, probabilistic neural network, J48 (C4.5), classification and regression tree, JRip and gradient boosting machine. The performances of these algorithms are reported using sensitivity, specificity, accuracy and area under ROC curve (AUC). Among all employed classification methods, NB yielded the best AUC of 64% and was equally statistically significant with the rest of the classifiers. Similarly, the J48 algorithm generated 23 decision rules that help in developing an early warning system to implement better prevention and control efforts in the management of filariasis.Entities:
Keywords: Filariasis; Machine learning techniques; mosquito; socio-economic factors
Year: 2019 PMID: 31475670 PMCID: PMC6805759 DOI: 10.1017/S0950268819001481
Source DB: PubMed Journal: Epidemiol Infect ISSN: 0950-2688 Impact factor: 2.451
Fig. 1.Schematic diagram of the proposed methodology.
Epidemiological and socio-economic attributes for the prediction of filariasis
| Major attributes | Sub-attributes | Survey participants ( |
|---|---|---|
| Samples for filariasis | Samples positive for microfilaria | 199 |
| Samples negative for microfilaria | 5195 | |
| Age groups | 1–5 | 173 |
| 6–10 | 549 | |
| 11–17 | 1033 | |
| 18–25 | 831 | |
| 26–40 | 1430 | |
| 41–60 | 1209 | |
| >61 | 169 | |
| Gender | Male | 2623 |
| Female | 2771 | |
| Occupation | Agriculture | 2191 |
| Labourers | 2135 | |
| Business | 586 | |
| Employees | 183 | |
| Others | 299 | |
| Education | Undergraduate | 5067 |
| Graduate | 327 | |
| House structure | Hut and thatched | 1555 |
| Tiled | 2372 | |
| RCC | 1467 | |
| Income (INR/Rs) | <1000 | 1251 |
| 1000–3000 | 3380 | |
| >3000 | 763 | |
| Breeding Habitats | Cess pool | 991 |
| Cess pit | 782 | |
| Open drainage | 1950 | |
| No, breeding habitats | 1671 | |
| Drainage system | Kutcha | 1360 |
| Pucca | 4034 | |
| Filaria awareness | Yes | 3832 |
| No | 1562 | |
| Participated in MDA programme | Yes | 3144 |
| No | 2250 | |
| Mosquito avoidance | Yes | 982 |
| No | 4412 |
The results obtained for imbalanced dataset without feature selection
| Classifiers | Specificity | Sensitivity | AUC | Accuracy | |
|---|---|---|---|---|---|
| Naïve Bayes | 0.92 | 0.11 | 0.51 | 0.80 | 1.8 |
| J48 | 1.0 | 0 | 0.50 | 0.85 | 2.4 |
| JRip | 0.99 | 0 | 0.50 | 0.85 | 2.4 |
| LMT | 1.0 | 0 | 0.50 | 0.85 | 2.5 |
| CART | 1.0 | 0 | 0.50 | 0.85 | 2.4 |
| PNN | 0.57 | 0.33 | 0.45 | 0.36 | 3.35 |
| GBM | 0.62 | 0.53 | 0.57 | 0.60 | – |
Results obtained using 100% oversampling and gain ratio feature selection
| Classifiers | Specificity | Sensitivity | AUC | Accuracy | |
|---|---|---|---|---|---|
| Naïve Bayes | 0.83 | 0.29 | 0.56 | 0.75 | 2.02 |
| J48 | 0.86 | 0.27 | 0.56 | 0.77 | 1.9 |
| LMT | 0.85 | 0.27 | 0.56 | 0.76 | 2.08 |
| CART | 0.94 | 0.06 | 0.50 | 0.82 | 4.78 |
| JRip | 0.94 | 0.06 | 0.50 | 0.81 | 4.57 |
| PNN | 0.80 | 0.47 | 0.63 | 0.75 | – |
| GBM | 0.63 | 0.58 | 0.61 | 0.63 | 0.65 |
Results obtained using 400% oversampling and gain ratio feature selection
| Classifiers | Specificity | Sensitivity | AUC | Accuracy | |
|---|---|---|---|---|---|
| Naïve Bayes | 0.60 | 0.68 | 0.64 | 0.61 | – |
| J48 | 0.73 | 0.51 | 0.62 | 0.70 | 0.66 |
| JRip | 0.73 | 0.48 | 0.61 | 0.69 | 1.68 |
| LMT | 0.73 | 0.48 | 0.60 | 0.69 | 1.05 |
| CART | 0.73 | 0.44 | 0.58 | 0.69 | 1.08 |
| PNN | 0.53 | 0.64 | 0.58 | 0.54 | 2.15 |
| GBM | 0.65 | 0.61 | 0.63 | 0.64 | 1.28 |
Fig. 2.Tree generated using J48 algorithm.
Fig. 3.Decision rules obtained using CART.
Test sample dataset
| Variables | Test data |
|---|---|
| MDA | 0 |
| Breeding habitats – open drainage | 1 |
| Breeding habitats – no breeding habitat | 0 |
| Breeding habitats – cesspit | 0 |
| Drainage system – kutcha | 0 |
| Drainage system – pucca | 1 |
| Mosquito avoidance | 1 |
| House type – tiled | 1 |
| House type – RCC | 0 |
| Awareness | 0 |
| Sex – F | 0 |
| Sc sex – M | 1 |
| Remarks | Negative (neg) |
Fig. 4.Rules generated by the J48 algorithm.
Results obtained using a different combination of variables
| Feature | Specificity | Sensitivity | AUC | Accuracy | Impact |
|---|---|---|---|---|---|
| MDA | 0.50 | 0.62 | 0.56 | 0.52 | 0.06 |
| Mosquito avoidance | 0.64 | 0.59 | 0.61 | 0.63 | 0.09 |
| Drainage system | 0.64 | 0.58 | 0.61 | 0.63 | 0.10 |
| House type | 0.66 | 0.50 | 0.58 | 0.64 | 0.17 |
| Sex | 0.65 | 0.49 | 0.57 | 0.63 | 0.18 |
| Awareness | 0.66 | 0.49 | 0.58 | 0.64 | 0.19 |
| Breeding habitat | 0.67 | 0.47 | 0.57 | 0.64 | 0.21 |
Relative features importance obtained using GBM
| Sl. No. | Feature name | Feature relative importance |
|---|---|---|
| 1 | MDA | 17.8 |
| 2 | Mosquito avoidance | 11.3 |
| 3 | Drainage system – kutcha | 10.7 |
| 4 | Awareness | 9.91 |
| 5 | Breeding habitats – open drainage | 9.27 |
| 6 | House type – RCC | 8.26 |
| 7 | Drainage system – pucca | 8.12 |
| 8 | House type – tiled | 8.04 |
| 9 | Breeding habitats – cesspit | 7.22 |
| 10 | Breeding habitats – no breeding habitat | 6.23 |
| 11 | Sc sex – M | 1.6 |
| 12 | Sex – F | 1.56 |
Fig. 5.ROC area under the curve for GBM.