| Literature DB >> 33244054 |
Patricia Martins Conde1,2, Thomas Sauter2, Thanh-Phuong Nguyen3,4.
Abstract
Hereditary haemochromatosis (HH) is an autosomal recessive disease, where HFE C282Y homozygosity accounts for 80-85% of clinical cases among the Caucasian population. HH is characterised by the accumulation of iron, which, if untreated, can lead to the development of liver cirrhosis and liver cancer. Since iron overload is preventable and treatable if diagnosed early, high-risk individuals can be identified through effective screening employing artificial intelligence-based approaches. However, such tools expose novel challenges associated with the handling and integration of large heterogeneous datasets. We have developed an efficient computational model to screen individuals for HH using the family study data of the Hemochromatosis and Iron Overload Screening (HEIRS) cohort. This dataset, consisting of 254 cases and 701 controls, contains variables extracted from questionnaires and laboratory blood tests. The final model was trained on an extreme gradient boosting classifier using the most relevant risk factors: HFE C282Y homozygosity, age, mean corpuscular volume, iron level, serum ferritin level, transferrin saturation, and unsaturated iron-binding capacity. Hyperparameter optimisation was carried out with multiple runs, resulting in 0.94 ± 0.02 area under the receiving operating characteristic curve (AUCROC) for tenfold stratified cross-validation, demonstrating its outperformance when compared to the iron overload screening (IRON) tool.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33244054 PMCID: PMC7691515 DOI: 10.1038/s41598-020-77367-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Workflow for the construction of a HH risk model based on machine learning. This workflow consists of three steps: (1) data preprocessing, (2) feature selection, and (3) model development and evaluation. In the first step, the family and family history datasets from the family study were merged. The family data set contains data from different sources, i.e., demographics (age, gender, and ethnicity), blood markers, and personal medical history. The data was cleaned and categorical variables with more than two classes were encoded using an one-hot-encoding approach. In the second step, feature selection based on six different methods (statistical and machine learning-based) was performed and eight different sets of risk factors were manually selected. In the last step, each of the selected risk factor sets were evaluated using different machine learning algorithms. First, the data was split into training and testing sets using tenfold stratified cross-validation (CV). The hyperparameters of each ML algorithm were tuned using GridSearch and tenfold stratified CV, and optimized for F1 score. After hyperparameter optimization, the optimal model was trained and evaluated on an unseen test set. This step was repeated 10 times. After final performance estimate, the best model including the best feature set were selected, and hyperparameter optimization was run on the whole dataset using GridSearch and tenfold stratified CV.
Characteristics of the family study participants (n = 955).
| Controls (%) | Cases (%) | |
|---|---|---|
| Male | 30.58 | 12.57 |
| Female | 42.83 | 14.03 |
| < 20 | 1.99 | 0.21 |
| 20–29 | 10.89 | 1.88 |
| 30–39 | 14.97 | 2.72 |
| 40–49 | 17.17 | 7.02 |
| 50–59 | 12.15 | 6.07 |
| ≥ 60 | 16.23 | 8.69 |
| Caucasian | 56.75 | 19.27 |
| Asian/Pacific Islander | 6.81 | 4.19 |
| Hispanic | 6.28 | 1.68 |
| African American | 2.2 | 0.84 |
| American Indian, Multiple, Unknown | 1.36 | 0.63 |
| Healthy | 20.94 | 4.92 |
| 6.7 | 15.92 | |
The healthy genotype represents individuals with no C282Y or H63D mutation in HFE gene.
List of top risk factors obtained after feature ranking. Multiple methods were used to extract the most relevant variables.
| A | B | C | D | E | F |
|---|---|---|---|---|---|
| uibc[ | uibc[ | uibc[ | ts[ | uibc[ | uibc[ |
| ts[ | ts[ | ts[ | uibc[ | ts[ | sf[ |
| fer[ | C282Y/C282Y[ | sf[ | C282Y/C282Y[ | C282Y/C282Y[ | C282Y/C282Y[ |
| sf[ | sf[ | C282Y/C282Y[ | sf[ | sf[ | ts[ |
| C282Y/C282Y[ | gender[ | fer[ | fer[ | gender[ | fer[ |
| tibc[ | ast[ | alt[ | C282Y/ + [ | ast[ | tibc[ |
| mch[ | fer[ | age[ | plt | fer[ | mch[ |
| mcv[ | age[ | plt | mcv[ | Caucasian[ | mcv[ |
| C282Y/ + [ | alt[ | tibc[ | tibc[ | age[ | ast[ |
| rdw | mcv[ | age[ | aneut | C282Y/ + [ | |
| age[ | Asian/PacificIslander[ | mch[ | mcv[ | rdw | |
| hgb | NumRel_hemo[ | wmono | alt[ | alt[ | |
| hct | rbc | Rel_hemo[ | rbc | age[ | |
| rhMen[ | mhArth[ | hgb | ggt | ||
| ggt | alt[ | Perso_arthrit[ | plt |
(A) Wilcoxon signed-rank test followed by Bonferroni correction. Only features with an adjusted p-value ≤ 0.05 are shown. (B) Extreme gradient boosting and (C) random forests employing both tenfold stratified CV and RFE, and optimized for F1 score. In columns D to F, only the top 15 features are shown, and these were obtained by employing (D) mutual information, (E) extreme gradient boosting, and (F) random forests. All risk factors are sorted by decreasing order of significance.
alt alanine aminotransferase serum activity, aneut absolute number of neutrophils, Asian/PacificIslander Asian or Pacific Islander ethnicity, ast aspartate aminotransferase serum activity, C282Y/ + HFE C282Y heterozygosity, C282Y/C282Y HFE C282Y homozygosity, Caucasian Caucasian ethnicity, fer serum iron concentration, ggt gamma glutamyl transferase serum activity, hct haematocrit, hgb haemoglobin concentration, mch mean corpuscular haemoglobin/RBC, mcv mean corpuscular volume, mhArth positive medical history of arthritis, NumRel_hemo number of relatives affected by haemochromatosis, Perso_arthrit personal history of arthritis, plt platelet count, rbc red blood cell count, rdw red blood cell distribution width, Rel_hemo positive family history of haemochromatosis, rhMen at menopause, sf serum ferritin concentration, tibc total iron binding capacity, ts transferrin saturation, uibc unsaturated iron binding capacity, wmono % monocytes in whole blood cell count.
Figure 2Distribution of (a) serum ferritin concentration, (b) transferrin saturation and (c) unsaturated iron binding capacity among the different HFE genotypes. The red and the blue dashed line represent the reference range for female and male individuals, respectively. The reference ranges for females are 200 ng/mL, and 45%, for serum ferritin and transferrin saturation, respectively. The reference ranges for males are 300 ng/mL, and 50%, for serum ferritin and transferrin saturation, respectively. As the serum ferritin concentration range was very wide, serum ferritin concentrations above 1650 ng/mL are not shown. Abbreviations: + / + : individuals with no C282Y or H63D mutation in HFE gene; C282Y/ + : HFE C282Y heterozygosity; C282Y/C282Y: HFE C282Y homozygosity; H63D/ + : HFE H63D heterozygosity; H63D/H63D: HFE H63D homozygosity; C282Y/H63D: HFE compound heterozygosity. Number of individuals present in each category: female + / + (n = 128); male + / + (n = 119); female C282Y/ + (n = 180); male C282Y/ + (n = 146); female C282Y/C282Y (n = 141); male C282Y/C282Y (n = 75); female H63D/ + (n = 43); male H63D/ + (n = 36); female H63D/H63D (n = 9); male H63D/H63D (n = 8); female C282Y/H63D (n = 41); male C282Y/H63D (n = 27). The data plotted in these figures correspond to the raw data, and was not imputed. Thus, these figures represent the data of 953 individuals for which a genotype was available.
Figure 3Spearman correlation plot of HH associated variables from the family dataset. Variables with more than 10% of missing values were removed. Only risk factors which fulfil the following 2 criteria are shown: (1) absolute correlation with the target variable (Cases) equal or larger than 0.4 and (2) Bonferroni corrected p-value ≤ 0.05. Abbreviations: C282Y/C282Y: HFE C282Y homozygosity; fer: serum iron concentration; sf: serum ferritin concentration; ts: transferrin saturation; uibc: unsaturated iron binding capacity.
Hereditary haemochromatosis risk score model’s performance on the test set.
| Set | Number of features | Model | Accuracy ± sd | F1 Score ± sd | Sensitivity ± sd | Specificity ± sd |
|---|---|---|---|---|---|---|
| A | 15 | XGB | 0.8797 ± 0.0383 | 0.7772 ± 0.0654 | 0.784 ± 0.0815 | 0.9145 ± 0.0496 |
| RF | 0.8692 ± 0.0405 | 0.7489 ± 0.0697 | 0.7285 ± 0.0741 | 0.9202 ± 0.0465 | ||
| LR | 0.8534 ± 0.0548 | 0.7465 ± 0.0749 | 0.7912 ± 0.0644 | 0.8759 ± 0.0764 | ||
| B | 13 | XGB | 0.8995 ± 0.0376 | 0.8095 ± 0.0691 | 0.8032 ± 0.0985 | 0.9344 ± 0.0467 |
| MLP | 0.8817 ± 0.0387 | 0.7766 ± 0.0645 | 0.7638 ± 0.0661 | 0.9244 ± 0.0512 | ||
| RF | 0.8838 ± 0.042 | 0.7762 ± 0.0861 | 0.7638 ± 0.1133 | 0.9273 ± 0.0453 | ||
| C | 9 | XGB | 0.8974 ± 0.0404 | 0.8092 ± 0.0756 | 0.8234 ± 0.1136 | 0.9244 ± 0.0456 |
| RF | 0.889 ± 0.0495 | 0.7863 ± 0.0923 | 0.7683 ± 0.1089 | 0.9329 ± 0.0457 | ||
| KNN | 0.8796 ± 0.0342 | 0.7742 ± 0.0645 | 0.7825 ± 0.1088 | 0.9145 ± 0.046 | ||
| A&B | 7 | XGB | 0.8943 ± 0.0389 | 0.8041 ± 0.0704 | 0.8154 ± 0.0989 | 0.923 ± 0.0456 |
| MLP | 0.8764 ± 0.0557 | 0.7776 ± 0.0936 | 0.8031 ± 0.0973 | 0.903 ± 0.0577 | ||
| RF | 0.8786 ± 0.0426 | 0.7746 ± 0.0788 | 0.7838 ± 0.0944 | 0.913 ± 0.0443 | ||
| A&C | 7 | RF | 0.8901 ± 0.0283 | 0.798 ± 0.0429 | 0.8111 ± 0.068 | 0.9187 ± 0.0466 |
| XGB | 0.8849 ± 0.0462 | 0.786 ± 0.0841 | 0.7954 ± 0.1071 | 0.9173 ± 0.0493 | ||
| MLP | 0.8754 ± 0.0442 | 0.7769 ± 0.0771 | 0.8112 ± 0.094 | 0.8988 ± 0.0499 | ||
| B&C | 7 | XGB | 0.8974 ± 0.0373 | 0.8067 ± 0.0675 | 0.804 ± 0.1021 | 0.9316 ± 0.0511 |
| KNN | 0.8838 ± 0.039 | 0.7921 ± 0.0593 | 0.8185 ± 0.0598 | 0.9073 ± 0.057 | ||
| MLP | 0.8796 ± 0.0491 | 0.7803 ± 0.0858 | 0.7992 ± 0.1039 | 0.9087 ± 0.0575 | ||
| A&B&C | 6 | XGB | 0.8891 ± 0.0397 | 0.7939 ± 0.0738 | 0.8034 ± 0.0937 | 0.9202 ± 0.0426 |
| RF | 0.8807 ± 0.0417 | 0.7793 ± 0.0734 | 0.7877 ± 0.0871 | 0.9145 ± 0.0507 | ||
| MLP | 0.8764 ± 0.0493 | 0.778 ± 0.0862 | 0.8112 ± 0.1131 | 0.9002 ± 0.0586 | ||
| ALL | 122 | XGB | 0.8848 ± 0.0431 | 0.7818 ± 0.0809 | 0.7798 ± 0.1157 | 0.923 ± 0.0522 |
| RF | 0.8649 ± 0.0486 | 0.7265 ± 0.0954 | 0.6729 ± 0.0941 | 0.9344 ± 0.0475 | ||
| LR | 0.8408 ± 0.0557 | 0.7211 ± 0.0711 | 0.76 ± 0.0807 | 0.8701 ± 0.0784 |
Only the top three algorithms are shown for each feature set.
sd standard deviation.
Figure 4Performance curves for the best set of risk factors. (a) Receiver Operating Characteristic (ROC) curve. The ROC curve was interpolated with the function interp from the scipy package using the true and the false positive rate values obtained in each CV run. (b) Precision-Recall curve. The PR curve was interpolated with the function interp from the scipy package using the recall and precision values obtained in each CV run.
Performance of the IRON score and risk factors (age, gender, medical history of liver condition, osteoporosis and thyroid disease) on the family study from the HEIRS cohort.
| Number of features | Model/criterion | F1 Score ± sd | AUCROC ± sd | AUPRC ± sd | |
|---|---|---|---|---|---|
| IRON score risk factors | 5 | RF | 0.2016 ± 0.0606 | 0.6041 ± 0.0588 | 0.356 ± 0.0576 |
| KNN | 0.1937 ± 0.0701 | 0.5637 ± 0.0565 | 0.313 ± 0.0512 | ||
| SVC | 0.1931 ± 0.0935 | 0.4903 ± 0.0901 | 0.2912 ± 0.0544 | ||
| IRON Score | 5 | > 0 | 0.4540 | 0.5744 | 0.2990 |
| > 2 | 0.4257 | 0.5594 | 0.2924 | ||
| > 3 | 0.4031 | 0.5475 | 0.2869 | ||
| > 5 | 0.3767 | 0.5309 | 0.2792 | ||
| > 6 | 0.3202 | 0.5375 | 0.2837 |
Only the top three algorithms are shown for the IRON score feature set.
AUCROC area under the ROC curve, AUPRC area under the precision-recall curve, sd standard deviation.
Statistical comparison of the AUCROC values between the best HH risk score model (XGB and feature set B) and all the other tested models.
| Model | Feature set | Adjusted p-value |
|---|---|---|
| DT | IRON score | 0.0099 |
| KNN | IRON score | 0.0099 |
| LR | IRON score | 0.0099 |
| MLP | IRON score | 0.0099 |
| RF | IRON score | 0.0099 |
| SVC | IRON score | 0.0099 |
| XGB | IRON score | 0.0099 |
| DT | ALL | 0.0099 |
| KNN | ALL | 0.0240 |
| MLP | ALL | 0.0240 |
In this table are listed tested models which yielded statistically significant lower AUCROC values when compared to the best HH risk score model. The models performance were compared using 2-tail Wilcoxon signed-rank test followed by Bonferroni correction. Only models with an adjusted p-value ≤ 0.05 are shown.