| Literature DB >> 34176515 |
Yan-Feng Gong1, Ling-Qian Zhu1, Yin-Long Li1, Li-Juan Zhang1, Jing-Bo Xue1, Shang Xia1,2, Shan Lv1,2, Jing Xu1, Shi-Zhu Li3,4.
Abstract
BACKGROUND: Schistosomiasis control is striving forward to transmission interruption and even elimination, evidence-lead control is of vital importance to eliminate the hidden dangers of schistosomiasis. This study attempts to identify high risk areas of schistosomiasis in China by using information value and machine learning.Entities:
Keywords: China; Information value; Machine learning; Risk prediction; Schistosomiasis
Year: 2021 PMID: 34176515 PMCID: PMC8237418 DOI: 10.1186/s40249-021-00874-9
Source DB: PubMed Journal: Infect Dis Poverty ISSN: 2049-9957 Impact factor: 4.520
Fig. 1Case location and river distribution in this study
Summary of environmental variables related to the distribution of schistosomiasis and Oncomelania hupensis
| Category | Variable | Definition | Source |
|---|---|---|---|
| Climate variables | AR | Aridity | |
| IM | Index of moisture | ||
| AAP | Average annual precipitation | ||
| AAT | Average annual temperature | ||
| BIO2 | Mean diurnal temperature range | ||
| BIO7 | Temperature annual range | ||
| BIO10 | Mean temperature of warmest quarter | ||
| BIO11 | Mean temperature of coldest quarter | ||
| BIO16 | Precipitation of wettest quarter | ||
| BIO17 | Precipitation of driest quarter | ||
| Geographic variables | LF | Landform | |
| LD | Land use | ||
| SLOPE | Slope | ||
| DST | Distance to waterway | ||
| EL | Elevation | ||
| ANDVI | Annual normalized difference vegetation index | ||
Socio -economic variables | GDP | Gross domestic product | |
| DP | Density of population | ||
| NTL | Night-time lights | ||
Confusion matrix of binary classification results
| Predicted result | Predicted presence | Predicted absence |
|---|---|---|
| Investigated presence | a | b |
| Investigated absence | c | d |
a. True predicted presence; b. False predicted presence; c. False predicted absence; d. True predicted absence
Fig. 2Implementation path of model building
Number and meaning of environmental factor classification based on the principle of chi-square binning
| Factors | Number | Classification index |
|---|---|---|
| AAP (mm) | 8 | < 850; 850–950; 950–1000; 1000–1350; 1350–1450; 1450–1500; 1500–1550; > 1550 |
| AAT (°C) | 8 | < 11.5; 11.5–16.0; 16.0–17.0; 17.0–17.5; 17.5–18.0; 18.0–18.5; 18.5–19.0; > 19.0 |
| IM (%) | 8 | < 45; 45–50; 50–55; 55–60; 60–65; 65–70; 70–90; > 90 |
| AR (%) | 8 | < 62; 62–66; 66–68; 68–72; 72–74; 74–92; 92–95; > 95 |
| BIO2 | 8 | < 7.3; 7.3–7.8; 7.8–7.9; 7.9–8.2; 8.2–8.6; 8.6–9.3; 9.3–9.9; > 9.9 |
| BIO7 | 8 | < 24; 24–27.5; 27.5–29; 29–31; 31–31.5; 31.5–33; 33–33.5; > 33.5 |
| BIO10 (°C) | 8 | < 17; 17–20; 20–22; 22–25; 25–26.5; 26.5–27; 27–28; > 28 |
| BIO11 (°C) | 8 | < 5.8; 5.8–6.0; 6.0–6.2; 6.2–6.4; 6.4–6.6; 6.6–7.6; 7.6–8.6; > 8.6 |
| BIO16 (mm) | 8 | < 440; 440–460; 460–480; 480–500; 500–520; 520–540; 540–560; > 560 |
| BIO17 (mm) | 8 | < 20; 20–50; 50–130; 130–140; 140–155; 155–160; 160–175; > 175 |
| LF | 6 | Plains; terraces; hills; small undulating mountains; medium undulating mountains; large undulating mountains |
| LD | 7 | Paddy field; dry land; woodland; grassland; water area; urban and rural residential land; unused land |
| EL (m) | 7 | < 50; 50–100; 100–450; 450–700; 700–2150; 2150–2500; > 2500 |
| SLOPE (°) | 8 | < 2; 2–3; 3–6; 6–9; 9–13; 13–22; 22–29; > 29 |
| DST (km) | 8 | < 0.5; 0.5–1.0; 1.0–1.5; 1.5–2; 2–2.5; 2.5–3; 3–3.5; > 3.5 |
| ANDVI | 8 | < 0.78; 0.78–0.79; 0.79–0.8; 0.8–0.81; 0.81–0.82; 0.82–0.83; 0.83–0.84; > 0.84 |
GDP (10 000/km2) | 7 | < 50; 50–100; 100–150; 150–250; 250–350; 350–800; 800–1000; > 1000 |
| DP (Person/km2) | 8 | < 100; 100–150; 150–200; 200–250; 250–400; 400–450; 450–550; > 550 |
| NTL | 8 | < 0.08; 0.08–0.10; 0.10–0.12; 0.12–0.14; 0.14–0.16; 0.16–0.18; 0.18–0.54; > 0.54 |
AAP average annual temperature, AAT annual accumulated temperature, IM index of moisture, AR aridity, BIO2 mean diurnal temperature range, BIO7 temperature annual range, BIO10 mean temperature of warmest quarter, BIO11 mean temperature of coldest quarter, BIO16 mean precipitation of wettest quarter, BIO17 mean precipitation of driest quarter, LF landform, LD land use, SLOPE slope, DST distance to waterway, EL elevation, ANDVI annual normalized difference vegetation index, GDP gross domestic product, DP density of population, NTL night-time lights
Results for grading information value by environmental influencing factors
| Grade | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| AAP | − 1.435 | − 0.941 | − 0.789 | 0.223 | 0.901 | 1.219 | 0.118 | − 0.811 |
| AAT | − 2.970 | 0.411 | 0.647 | 0.693 | 0.544 | 1.067 | 0.582 | − 0.305 |
| IM | − 0.498 | 0.916 | 0.095 | 0.693 | 0.818 | 1.587 | − 0.288 | − 1.466 |
| AR | − 1.224 | − 0.693 | 0.836 | 0.773 | 0.383 | 0.228 | − 0.801 | − 1.447 |
| BIO2 | 0.547 | 1.176 | 0.319 | 0.323 | 0 | − 0.553 | − 1.194 | − 1.269 |
| BIO7 | − 0.406 | − 0.651 | − 1.504 | − 1.355 | 0.357 | 0.774 | 0.568 | − 1.082 |
| BIO10 | − 2.773 | − 0.838 | − 0.693 | − 1.674 | − 0.827 | − 1.584 | 0.894 | 1.192 |
| BIO11 | − 1.064 | 1.121 | 1.121 | 1.118 | 0.847 | 0.228 | − 0.773 | − 0.406 |
| BIO16 | − 0.916 | − 0.074 | − 0.442 | − 1.065 | 0.598 | 0.223 | 0.811 | 0.180 |
| BIO17 | − 2.110 | − 0.887 | − 0.203 | 0.499 | 1.421 | 0.767 | 0.534 | − 0.095 |
| LF | 0.950 | 1.068 | 0.766 | − 0.300 | − 0.742 | − 1.789 | ||
| LD | 0.347 | 0.169 | − 0.266 | 0.342 | 0.234 | 0.123 | − 1.634 | |
| SLOPE | 0.841 | 0 | − 0.187 | − 1.099 | − 2.485 | − 0.821 | − 0.949 | − 1.946 |
| DST | 0.395 | 0.560 | 0.821 | 0.442 | 0.147 | − 0.406 | 0 | − 0.515 |
| EL | 0.959 | 0.195 | − 1.126 | − 0.167 | − 0.975 | − 0.651 | − 2.169 | |
| ANDVI | 0.227 | − 0.105 | − 0.486 | 0.223 | − 0.452 | − 0.223 | − 1.299 | − 0.256 |
| GDP | − 1.052 | − 0.065 | 0.035 | 0.773 | 0.619 | 0.218 | 0.211 | 0.511 |
| DP | − 0.946 | − 0.102 | − 1.179 | 0.560 | 0.431 | 0.368 | 1.099 | 0.621 |
| NTL | − 0.887 | − 0.674 | 0.111 | − 0.827 | 0.143 | 0.470 | 0.186 | 0.450 |
AAP average annual temperature, AAT annual accumulated temperature, IM index of moisture, AR aridity, BIO2 mean diurnal temperature range, BIO7 mean temperature annual range, BIO10 mean temperature of warmest quarter, BIO11 mean temperature of coldest quarter; BIO16 mean precipitation of wettest quarter, BIO17 mean precipitation of driest quarter, LF landform, LD land use, SLOPE slope, DST distance to waterway, EL elevation, ANDVI annual normalized difference vegetation index, GDP gross domestic product, DP density of population, NTL night-time lights
Predictive performance indicators for the seven models
| Model | IV | LR | IV + LR | RF | RF + IV | GBM | IV + GBM |
|---|---|---|---|---|---|---|---|
| Accuracy | 0.732 | 0.790 | 0.815 | 0.785 | 0.820 | 0.849 | 0.878 |
| AUC | 0.750 | 0.827 | 0.853 | 0.840 | 0.872 | 0.859 | 0.902 |
| F1 | 0.705 | 0.867 | 0.871 | 0.854 | 0.875 | 0.903 | 0.920 |
IV information value, LR logistic regression, RF random forest, GBM generalized boosted model, AUC area under the curve
Fig. 3Current risk prediction for schistosomiasis in China based on the optimal coupled model