| Literature DB >> 35421085 |
Ariela Mota Ferreira1, Laércio Ives Santos2, Ester Cerdeira Sabino3, Antonio Luiz Pinho Ribeiro4, Léa Campos de Oliveira-da Silva3, Renata Fiúza Damasceno1, Marcos Flávio Silveira Vasconcelos D'Angelo5, Maria do Carmo Pereira Nunes4, Desirée Sant Ana Haikal1.
Abstract
Chagas disease (CD) is recognized by the World Health Organization as one of the thirteen most neglected tropical diseases. More than 80% of people affected by CD will not have access to diagnosis and continued treatment, which partly supports the high morbidity and mortality rate. Machine Learning (ML) can identify patterns in data that can be used to increase our understanding of a specific problem or make predictions about the future. Thus, the aim of this study was to evaluate different models of ML to predict death in two years of patients with CD. ML models were developed using different techniques and configurations. The techniques used were: Random Forests, Adaptive Boosting, Decision Tree, Support Vector Machine, and Artificial Neural Networks. The adopted settings considered only interview variables, only complementary exam variables, and finally, both mixed. Data from a cohort study with CD patients called SaMi-Trop were analyzed. The predictor variables came from the baseline; and the outcome, which was death, came from the first follow-up. All models were evaluated in terms of Sensitivity, Specificity and G-mean. Among the 1694 individuals with CD considered, 134 (7.9%) died within two years of follow-up. Using only the predictor variables from the interview, the different techniques achieved a maximum G-mean of 0.64 in predicting death. Using only the variables from complementary exams, the G-mean was up to 0.77. In this configuration, the protagonism of NT-proBNP was evident, where it was possible to observe that an ML model using only this single variable reached G-mean of 0.76. The configuration that mixed interview variables and complementary exams achieved G-mean of 0.75. ML can be used as a useful tool with the potential to contribute to the management of patients with CD, by identifying patients with the highest probability of death. Trial Registration: This trial is registered with ClinicalTrials.gov, Trial ID: NCT02646943.Entities:
Mesh:
Year: 2022 PMID: 35421085 PMCID: PMC9041770 DOI: 10.1371/journal.pntd.0010356
Source DB: PubMed Journal: PLoS Negl Trop Dis ISSN: 1935-2727
Fig 1Flowchart of the process of selecting predictor variables based on cross-validation analysis for predicting death among patients with CD within two years of follow-up.
Description of Machine Learning (ML) techniques adopted in the study.
| ML Technique | Description | Adjusted hyperparameters |
|---|---|---|
| Random Forests | Uses an aggregate of decision trees from randomly selected instances and predictors. Each tree predicts a problem class and the model prediction is determined by majority vote. | Number of trees, number of sampled predictors, fraction of sampled data. |
| Adaptative Boosting | Uses an aggregate of decision trees from randomly selected instances and predictors in the first tree. From the second tree onwards, the selection of instances is made considering a probability proportional to the prediction error of the previous trees. | Number of trees, Learning Rate. |
| Decision tree | Recursively defined structure composed of decision nodes and leaf nodes. A decision node contains a test on some predictor and for each result of that test there is a link to a subtree. A leaf node corresponds to one of the problem classes. | Maximum number of subdivisions. |
| Support Vector Machine | Searches for a hyperplane that maximizes the distance between instances of two different classes. When the problem dealt with has only two predictors, this hyperplane is represented by a line, and with | Kernel function. |
| Artificial Neural Networks | Simulates the way a human brain learns through artificial neurons. An artificial neuron takes input information from an external source and combines such inputs with non-linear operations producing a result based on the assimilated knowledge. | Number of neurons in the hidden layer, number of epochs and learning rate. |
Fig 2Flowchart of patients included and excluded from the cohort and of eligible patients in the study and in the models obtained.
SaMi-Trop Project. Minas Gerais.
Descriptive and bivariate analysis of categorical predictor variables selected among those with the greatest predictive power, and their association with death in patients with Chagas disease (CD).
Minas Gerais, Brazil (n = 1,694).
|
|
|
|
| |
|---|---|---|---|---|
| Death | ||||
| n (%) | No n (%) | Yes n (%) | ||
|
| ||||
| Gender | ||||
| Male | 562 (33.2) | 503 (89.5) | 59 (10.5) |
|
| Female | 1132 (66.8) | 1057 (93.4) | 75 (6.6) | |
| Literate | ||||
| No | 750 (44.5) | 671 (89.5) | 79 (10.5) |
|
| Yes | 937 (55.5) | 884 (94.3) | 53 (5.7) | |
| Age | ||||
| Up to 60 years | 935 (55.2) | 886 (94.8) | 49 (5.2) |
|
| Above 61 years | 759 (44.8) | 674 (88.8) | 85 (11.2) | |
| Self-declared color | ||||
| White | 361 (21.4) | 332 (92) | 29 (8.0) | 0.836 |
| Non-white | 1324 (78.6) | 1222 (92.3) | 102 (7.7) | |
| Per capita income | ||||
| Greater than R$ 356.33 | 665 (39.7) | 602 (90.5) | 63 (9.5) |
|
| Less than R$ 356.32 | 1011 (60.3) | 944 (93.4) | 67 (6.6) | |
|
| ||||
| Climb stairs | ||||
| No | 617 (36.7) | 538 (87.2) | 79 (12.8) |
|
| Yes | 1065 (63.3) | 1010 (94.8) | 55 (5.2) | |
| Self-reported ECG irregularity | ||||
| No | 651 (39.2) | 602 (92.5) | 49 (7.5) | 0.559 |
| Yes | 1009 (60.8) | 9258 (91.7) | 84 (8.3) | |
| Racing heart | ||||
| No | 603 (36.3) | 602 (92.5) | 49 (7.5) | 0.559 |
| Yes | 1057 (63.7) | 925 (91.7) | 84 (8.3) | |
|
| ||||
| Arterial hypertension | ||||
| No | 608 (35.9) | 573 (94.2) | 35 (5.8) |
|
| Yes | 1086 (64.1) | 987 (90.9) | 99 (9.1) | |
| Permanent self-reported pacemaker | ||||
| No | 1559 (93.9) | 1446 (92.8) | 113 (7.2) |
|
| Yes | 101 (6.1) | 81 (80.2) | 20 (19.8) | |
|
| ||||
| Heart rate | ||||
| Normal | 1112 (67.4) | 1024 (92.1) | 88 (7.9) | 0.141 |
| Below normal (up to 59 bpm) | 509 (30.8) | 474 (93.1) | 35 (6.9) | |
| Above normal (above 101 bpm) | 30 (1.8) | 25 (83.3) | 5 (16.7) | |
| Corrected QT interval | ||||
| Normal (up to 440 m/s) | 816 (49.4) | 778 (95.3) | 38 (4.7) |
|
| Altered (above 441 m/s) | 835 (50.6) | 745 (89.2) | 90 (10.8) | |
| QRS complex duration | ||||
| Normal (up to 120) | 959 (58.1) | 911 (95) | 48 (5) |
|
| Altered (above 121) | 692 (41.9) | 612 (88.4) | 80 (11.6) | |
| Isolated right bundle branch block plus left anterior fascicular block | ||||
| Negative | 1460 (88.4) | 1352 (92.6) | 108 (7.4) | 0.135 |
| Positive | 191 (11.6) | 171 (89.5) | 20 (10.5) | |
| Isolated right bundle branch block | ||||
| Negative | 1315 (79.6) | 1209 (91.9) | 106 (8.1) | 0.355 |
| Positive | 336 (20.4) | 314 (93.5) | 22 (6.5) | |
| Pacemaker | ||||
| Absent | 1592 (96.4) | 1482 (93.1) | 110 (6.9) |
|
| Present | 59 (3.6) | 41 (69.5) | 18 (30.5) | |
| Pathological Q waves | ||||
| Negative | 1398 (84.7) | 1308 (93.6) | 90 (6.4) |
|
| Positive | 253 (14.9) | 215 (85) | 38 (15) | |
| Low QRS complex voltage | ||||
| Negative | 1556 (94.2) | 1444 (92.8) | 112 (7.2) |
|
| Positive | 95 (5.8) | 79 (83.2) | 16 (16.8) | |
| Categorized NT-proBNP | ||||
| Normal (below 300pg/dl) | 1194 (70.8) | 1166 (97.7) | 28 (2.3) |
|
| Altered (above 301pg/dl) | 492 (29.2) | 386 (78.5) | 106 (21.5) | |
* Variation of the n = 1.694 because of missing information.
π Chi squared test
Fig 3Performance of models in predicting death for patients with CD, within two years, according to each machine learning technique adopted.
Fig 3A: Considering the interview variables. Fig 3B: Considering the variables of complementary exams. Fig 3C: Considering the variables of complementary exams, excluding the categorized NT-proBNP variable. Fig 3D: Considering the variables of complementary exams, considering only the NT-proBNP variable. Fig 3E: Considering the interview variables and complementary exam variables.
Fig 4Importance of predictor variables of complementary exams for predicting death in patients with CD, within two years, according to Random Forests ranking.