| Literature DB >> 35934714 |
Kah Yee Tai1, Jasbir Dhaliwal2, KokSheik Wong1.
Abstract
BACKGROUND: The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS).Entities:
Keywords: Feature extraction algorithm; Genetic risk factors; Machine learning; Malaria; Single nucleotide polymorphisms; Weighted genetic risk score
Mesh:
Year: 2022 PMID: 35934714 PMCID: PMC9358850 DOI: 10.1186/s12859-022-04870-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Machine learning pipeline for individual malaria risk score prediction
Fig. 2Methodology flow chart
Analysed populations and samples
| Population | Case | Control | Sample size |
|---|---|---|---|
| Burkina Faso | 807 | 639 | 1446 |
| Cameroon | 693 | 778 | 1471 |
| Gambia | 2807 | 2786 | 5593 |
| Ghana | 422 | 342 | 764 |
| Kenya | 1944 | 1738 | 3682 |
| Malawi | 1590 | 1498 | 3088 |
| Mali | 475 | 394 | 869 |
| Nigeria | 288 | 131 | 419 |
| Tanzania | 485 | 494 | 979 |
| Vietnam | 860 | 868 | 1728 |
| Papua New Guinea | 420 | 395 | 815 |
| Total | 20,854 |
Sample size indicates the total number of individuals for each population
Fig. 3Pseudocode of the proposed feature extraction algorithm
Fig. 4Overview of genotype-pattern-frequency-based features
Fig. 5High-level pseudocode of the feature extraction and selection stage
Fig. 6Feature importance ranking of all 104 SNPs, computed using the proposed feature extraction algorithm with LR-RFE
Fig. 7Feature importance ranking of all 104 SNPs, computed using the benchmark feature extraction algorithm with LR-RFE
Fig. 8Comparison of feature importance scores using different feature extraction algorithms with LR-RFE: (1) proposed algorithm and (2) benchmark algorithm
Fig. 9Performance analysis of the wGRS-based and wGRS + GF-based models with respect to MAE scores and feature sets
Comparison of prediction performance between different feature combinations determined using the feature extraction algorithms with LR-RFE
| Feature set | Proposed feature extraction algorithm | Baseline feature extraction algorithm | ||||
|---|---|---|---|---|---|---|
| LightGBM | XGBoost | Ridge regression | LightGBM | XGBoost | Ridge regression | |
| wGRS | ||||||
| 10 | 0.2304 | 0.2313 | 0.2313 | 0.2035 | 0.2037 | 0.2027 |
| 20 | 0.4862 | 0.4883 | 0.4868 | 0.2678 | 0.2675 | 0.2706 |
| 30 | 0.6135 | 0.6238 | 0.6141 | 0.3471 | 0.3514 | 0.3622 |
| 40 | 0.6114 | 0.6204 | 0.6209 | 0.4683 | 0.4726 | 0.5050 |
| 50 | 0.6263 | 0.6464 | 0.6362 | 0.8010 | 0.8124 | 0.8274 |
| 60 | 0.6856 | 0.7009 | 0.7078 | 0.8145 | 0.8330 | 0.8491 |
| 70 | 0.7431 | 0.7522 | 0.7731 | 0.8129 | 0.8321 | 0.8260 |
| 80 | 0.9077 | 0.9546 | 0.9219 | 0.9047 | 0.9509 | 0.9394 |
| 90 | 0.9586 | 1.0018 | 0.9744 | 0.9302 | 0.9704 | 0.9538 |
| 100 | 1.0934 | 1.1240 | 1.1236 | 1.0318 | 1.1065 | 1.0641 |
| wGRS + GF | ||||||
| 10 | 0.0748 | 0.0749 | 0.0831 | 0.0584 | 0.0597 | 0.0600 |
| 20 | 0.2220 | 0.2231 | 0.2274 | 0.0747 | 0.0751 | 0.0760 |
| 30 | 0.2625 | 0.2624 | 0.2641 | 0.1265 | 0.1285 | 0.1352 |
| 40 | 0.2713 | 0.2736 | 0.2751 | 0.1743 | 0.1800 | 0.1931 |
| 50 | 0.2605 | 0.2644 | 0.2625 | 0.5646 | 0.5656 | 0.5671 |
| 60 | 0.2929 | 0.2945 | 0.2944 | 0.5660 | 0.5678 | 0.5721 |
| 70 | 0.2940 | 0.2981 | 0.3034 | 0.5354 | 0.5345 | 0.5359 |
| 80 | 0.5323 | 0.5421 | 0.5317 | 0.5602 | 0.5610 | 0.5670 |
| 90 | 0.5501 | 0.5646 | 0.5509 | 0.5437 | 0.5463 | 0.5488 |
| 100 | 0.6249 | 0.6333 | 0.6323 | 0.5929 | 0.6041 | 0.6003 |
P-values obtained with best parameters
| LightGBM | XGBoost | Ridge regression |
|---|---|---|
| 2.52E-24 | 8.56E-24 | 2.13E-24 |
Compares the p-values of MAE scores using the aforementioned risk scores with best parameters
P-values obtained with default parameters
| LightGBM | XGBoost | Ridge regression |
|---|---|---|
| 6.41E-24 | 7.76E-23 | 2.59E-23 |
Compares the p-values of MAE scores using the aforementioned risk scores with default parameters