| Literature DB >> 35902812 |
Van Tran1, Tazmilur Saad1, Mehret Tesfaye2, Sosina Walelign2, Moges Wordofa2, Dessie Abera2, Kassu Desta2, Aster Tsegaye2, Ahmet Ay3,4, Bineyam Taye5.
Abstract
BACKGROUND: Although previous epidemiological studies have examined the potential risk factors that increase the likelihood of acquiring Helicobacter pylori infections, most of these analyses have utilized conventional statistical models, including logistic regression, and have not benefited from advanced machine learning techniques.Entities:
Keywords: Classification; Ethiopia; Feature selection; H. pylori infection; Logistic regression; Machine learning; School children
Mesh:
Year: 2022 PMID: 35902812 PMCID: PMC9330977 DOI: 10.1186/s12879-022-07625-7
Source DB: PubMed Journal: BMC Infect Dis ISSN: 1471-2334 Impact factor: 3.667
Fig. 1Map of Sululta and Ziway (Batu) can be located to the north and south of Addis Ababa, respectively, Ethiopia
List of risk factors (features) used in the study and the survey results
| Feature | Response | ||
|---|---|---|---|
| Residence | 293 (54.3%) | 247 (45.7%) | |
| Rural | 50 (12.1%) | 364 (87.9%) | |
| Allergies | Any allergic disease | 70 (31%) | 156 (69%) |
| 273 (37.5%) | 455 (62.5%) | ||
| Parasites | Any parasites found | 110 (37.2%) | 186 (62.8%) |
| 233 (35.4%) | 425 (64.6%) | ||
| Cooking area | 276 (33.5%) | 547 (66.5%) | |
| Outside house | 67 (51.1%) | 64 (48.9%) | |
| Dewormed status | 247 (33.7%) | 487 (66.3%) | |
| Not dewormed | 96 (43.6%) | 124 (56.4%) | |
| Cow | Family owns cow(s) | 47 (23.4%) | 154 (76.6%) |
| 296 (39.3%) | 457 (60.7%) | ||
| Smoking | 10 (16.4%) | 51 (83.6%) | |
| No smokers | 333 (37.3%) | 560 (62.7%) | |
| Cat | 257 (38.6%) | 408 (61.4%) | |
| Cat lives inside | 53 (25.4%) | 156 (74.6%) | |
| Cat lives outside | 33 (41.3%) | 47 (58.7%) | |
| Dog | 228 (40.1%) | 341 (59.9%) | |
| Dog lives inside | 0 (0%) | 4 (100%) | |
| Dog lives outside | 115 (30.2%) | 266 (69.8%) | |
| Electricity use | 273 (55.2%) | 222 (44.8%) | |
| Sometimes | 11 (12.8%) | 75 (87.2%) | |
| Never | 59 (15.8%) | 314 (84.2%) | |
| Floor in Home | 150 (36.8%) | 258 (63.2%) | |
| Wood | 4 (20%) | 16 (80%) | |
| Mud | 186 (35.9%) | 332 (64.1%) | |
| Other | 3 (37.5%) | 5(62.5%) | |
| Waste disposal | 80 (44.2%) | 101 (55.8%) | |
| Pit | 56 (36.6%) | 97 (63.4%) | |
| Open field | 26 (12.6%) | 181 (87.4%) | |
| Burn | 181 (43.8%) | 232 (56.2%) | |
| Age | 43 (47.3%) | 48 (52.7%) | |
| 6–10 years | 167 (39.3%) | 258 (60.7%) | |
| 11–15 years | 133 (30.4%) | 305 (69.6%) | |
| Family size | 55 (34.6%) | 104 (65.4%) | |
| 4–5 people | 165 (35.6%) | 299 (64.4%) | |
| > 5 people | 123 (37.2%) | 208 (62.8%) | |
| Toilet | 10 (25%) | 30 (75%) | |
| Pit toilet | 325 (38.7%) | 514 (61.3%) | |
| Open field | 8 (10.7%) | 67 (89.3%) | |
| Water source | 327 (38.3%) | 526 (61.7%) | |
| Well | 12 (15.4%) | 66 (84.6%) | |
| River or rain water | 4 (17.4%) | 19 (82.6%) |
The reference group is the one that is bolded in the response column
Univariate and stepwise multivariate logistic regression model of H. pylori risk factors in association with H. pylori infection in school children, Ethiopia
| Feature | Meaning | p-value (Uni) | COR | CI 95% | p-value (multi) | AOR | CI 95% | ||
|---|---|---|---|---|---|---|---|---|---|
| Residence | Place of residence | ||||||||
| 0 | Urban | 293 (54.3%) | 247 (45.7%) | ||||||
| 1 | Rural | 50 (12.1%) | 364 (87.9%) | 0.000 | 0.117 | 0.083–0.164 | 0.000 | 0.216 | 0.144–0.321 |
| Allergy | Any allergic disease | ||||||||
| 0 | No allergy | 273 (37.5%) | 455 (62.5%) | ||||||
| 1 | Allergy | 70 (31%) | 156 (69%) | 0.075 | 0.748 | 0.543–1.029 | |||
| Parasite found | Any parasites found | ||||||||
| 0 | No allergy | 233 (35.4%) | 425 (64.6%) | ||||||
| 1 | Yes | 110 (37.2%) | 186 (62.8%) | 0.566 | 1.087 | 0.818–1.446 | |||
| Cook area | Cooking area | ||||||||
| 0 | Inside house | 276 (33.5%) | 547 (66.5%) | ||||||
| 1 | Outside house | 67 (51.1%) | 64 (48.9%) | 0.000 | 2.075 | 1.430–3.009 | |||
| Deworm | Deworming status | ||||||||
| 0 | Deworm | 247 (33.7%) | 487 (66.3%) | ||||||
| 1 | Not deworm | 96 (43.6%) | 124 (56.4%) | 0.007 | 1.526 | 1.123–2.076 | 0.006 | 1.676 | 1.161–2.424 |
| Cow | Any cow | ||||||||
| 0 | No | 296 (39.3%) | 457 (60.7%) | ||||||
| 1 | Yes | 47 (23.4%) | 154 (76.6%) | 0.000 | 0.471 | 0.329–0.674 | 0.011 | 0.583 | 0.381–0.880 |
| Smoking | Anyone smoke? | ||||||||
| 0 | Smoke | 10 (16.4%) | 51 (83.6%) | ||||||
| 1 | Non-smoke | 333 (37.3%) | 560 (62.7%) | 0.002 | 3.033 | 1.519–6.054 | 0.008 | 2.830 | 1.352–6.419 |
| Cat | Do you have a cat? | ||||||||
| 0 | No | 257 (38.6%) | 408 (61.4%) | ||||||
| 1 (CatInside) | Lives inside | 53 (25.4%) | 156 (74.6%) | 0.001 | 0.539 | 0.381–0.764 | |||
| 2 (CatOutside) | Kept outside | 33 (41.3%) | 47 (58.7%) | 0.652 | 1.115 | 0.695–1.786 | |||
| Dog | Do you have a dog | ||||||||
| 0 | No | 228 (40.1%) | 341 (59.9%) | ||||||
| 1 (DogInside) | Lives inside | 0 (0%) | 4 (100%) | 0.998 | 0.000 | 0.000- inf | |||
| 2 (DogOutside) | Kept outside | 115 (30.2%) | 266 (69.8%) | 0.002 | 0.647 | 0.491–0.852 | |||
| Elec | Electricity? | ||||||||
| 0 | Everyday | 273 (55.2%) | 222 (44.8%) | ||||||
| 1 (ElecSometimes) | Sometimes | 11 (12.8%) | 75 (87.2%) | 0.000 | 0.124 | 0.064–0.239 | 0.007 | 0.363 | 0.167–0.734 |
| 2 (ElecNever) | Never | 59 (15.8%) | 314 (84.2%) | 0.000 | 0.154 | 0.111–0.214 | 0.000 | 0.427 | 0.283–0.641 |
| Floor | Type of floor | ||||||||
| 0 | Cement | 150 (36.8%) | 258 (63.2%) | ||||||
| 1 (WoodFloor) | Wood | 4 (20%) | 16 (80%) | 0.202 | 0.434 | 0.121–1.563 | |||
| 2 (MudFloor) | Mud | 186 (35.9%) | 332 (64.1%) | 0.846 | 0.974 | 0.744–1.274 | |||
| 9 (OtherFloor) | Others | 3 (37.5%) | 5(62.5%) | 0.955 | 1.042 | 0.246–4.421 | |||
| Waste | Where do dispose waste | ||||||||
| 0 | Garbage bin | 80 (44.2%) | 101 (55.8%) | ||||||
| 1 (WastePit) | Pit | 56 (36.6%) | 97 (63.4%) | 0.523 | 0.868 | 0.563–1.339 | |||
| 2 (WasteField) | Open field | 26 (12.6%) | 181 (87.4%) | 0.000 | 0.207 | 0.124–0.345 | 0.016 | 0.530 | 0.311–0.878 |
| 3 (WasteBurn) | Burning | 181 (43.8%) | 232 (56.2%) | 0.405 | 1.156 | 0.822–1.626 | |||
| Years (Age) | |||||||||
| 0 | 0–5 | 43 (47.3%) | 48 (52.7%) | ||||||
| 1 (6–10Years) | 6 to 10 | 167 (39.3%) | 258 (60.7%) | 0.176 | 0.735 | 0.471–1.148 | |||
| 2 (11–15Years) | 11 to 15 | 133 (30.4%) | 305 (69.6%) | 0.002 | 0.492 | 0.314–0.772 | |||
| FamSize | Family Size | ||||||||
| 0 | 0–3 | 55 (34.6%) | 104 (65.4%) | ||||||
| 1 (4-5FamSize) | 4 or 5 | 165 (35.6%) | 299 (64.4%) | 0.616 | 1.100 | 0.758–1.595 | |||
| 2 (> FamSize) | > 5 | 123 (37.2%) | 208 (62.8%) | 0.446 | 1.164 | 0.788–1.718 | 0.121 | 1.303 | 0.933–1.822 |
| Toilet | |||||||||
| 0 | Flush toilet | 10 (25%) | 30 (75%) | ||||||
| 1 (ToiletPit) | Pit | 325 (38.7%) | 514 (61.3%) | 0.109 | 1.738 | 0.885–3.415 | 0.004 | 2.340 | 1.337–4.244 |
| 2 (ToiletField) | Open field | 8 (10.7%) | 67 (89.3%) | 0.027 | 0.328 | 0.122–0.881 | |||
| Water | |||||||||
| 0 | Piped | 327 (38.3%) | 526 (61.7%) | ||||||
| 1 (WaterWell) | Well | 12 (15.4%) | 66 (84.6%) | 0.000 | 0.292 | 0.156–0.549 | |||
| 2 (WaterNatural) | River/rain water | 4 (17.4%) | 19 (82.6%) | 0.051 | 0.339 | 0.114–1.004 | |||
| HPYLORI | 343 (36%) | 611 (64%) |
Fig. 2Average H. Pylori prevalence prediction accuracy and F1- scores of machine learning classifiers using various feature selection methods. Maroon and blue colors represent high and low accuracy (A), and F1 score (B), respectively. The numbers within each cell indicate the accuracy/F1-score of each classifier-feature selection method pair. KNN indicates K-Nearest Neighbors: SVM, Support Vector Machines; XGB, XGBoost; LR, Logistic Regression; NB, Naive Bayes; and RF, Random Forests. FULL indicates all risk factors are used. IG indicates Information Gain: ReF, ReliefF; MRMR, Minimum Redundancy Maximum Relevance; CFS, Correlation-based Feature Selection; FCBF, Fast Correlation Based Filter; and SFFS, Sequential Floating Forward Selection. The numbers -10 and -20 indicate the number of risk factors selected for the ranking-based feature selection methods. C The Receiver Operating Characteristic (ROC) curves of six classifiers (using their best hyperparameter combination) were obtained when they were used to predict H. pylori infection using a subset of risk factors selected through IG-20 feature selection method. The area under the ROC curve (AUROC) for KNN was 0.76, 0.79 for NB, and 0.78 for the other classifiers. The X-axis represents the False Positive Rate (1-Specificity) whereas the Y-axis represents the True Positive Rate (Sensitivity)
Fig. 3The relative importance of H.pylori risk factors based on all feature selection methods. X-axis indicates the H. Pylori risk factors, summarized in Table 1. Y-axis indicates the average probability of being selected across all feature selection methods. The error bars indicate one standard errors across all cross-validation folds
The frequency of H. pylori risk factors being chosen for all feature selection methods
| Feature | All* | Multivariate LR |
|---|---|---|
| X | ||
| Allergies | 43.16%(± 3.82%) | |
| Parasites | 44.42% ± (4.09%) | |
| X | ||
| Cow | 38.63%(± 3.58%) | X |
| X | ||
| Cat: lives inside | 47.89%(± 3.86%) | |
| Dog: kept outside | 26%(± 2.78%) | |
| X | ||
| X | ||
| Floor in home: wood | 44.32%(± 3.78%) | |
| Floor in home: mud | 27.37%(± 2.82%) | |
| Waste disposal: pit | 44.84%(± 3.88%) | |
| X | ||
| Waste disposal: burn | 29.26%(± 2.98%) | |
| Age: 6–10 years | 32.63%(± 3.35%) | |
| Age: 11–15 years | 30.11%(± 3.54%) | |
| Family size: 4–5 | 35.16%(± 3.47%) | |
| Family size: > 5 | 46%(± 3.94%) | |
| X | ||
| Water source: well | 44.11%(± 3.38%) | |
| Water source: river or rain water | 38.63%(± 3.45%) |
*Results from ranking-based, subset-based, and SFFS feature selection methods are combined. The features are indicated in the first column. The second column shows the average (± 1 standard error) frequency of being picked across all feature selection methods and cross-validation folds. The third column shows the features that the multivariate logistic regression approach determined to be significant. Bold, italic, and highlighted numbers show features that occur more frequently than 75 percent, 60–75 percent, and , respectively
Fig. 4 Two-dimensional hierarchical clustering heatmap of H. pylori risk factors and feature selection methods. Maroon and blue colors indicate more and less frequently selected features in five tenfold cross-validation runs, respectively. X-axis shows the H. pylori risk factors, summarized in Table 1. Y-axis indicates all feature selection methods. The risk factors found more frequently by feature selection methods appear on the heatmap's left columns. The feature selection methods that select the greatest number of risk factors appear on the heatmap's bottom rows. The risk factors grouped together suggest that they have been chosen similarly under varying feature selection methods. The feature selection methods grouped together indicate that these methods choose a similar set of risk factors