| Literature DB >> 35732850 |
Raquel Teixeira1,2, Carina Rodrigues3,4, Carla Moreira3,4,5, Henrique Barros3,4,6, Rui Camacho7,8.
Abstract
The timely identification of cohort participants at higher risk for attrition is important to earlier interventions and efficient use of research resources. Machine learning may have advantages over the conventional approaches to improve discrimination by analysing complex interactions among predictors. We developed predictive models of attrition applying a conventional regression model and different machine learning methods. A total of 542 very preterm (< 32 gestational weeks) infants born in Portugal as part of the European Effective Perinatal Intensive Care in Europe (EPICE) cohort were included. We tested a model with a fixed number of predictors (Baseline) and a second with a dynamic number of variables added from each follow-up (Incremental). Eight classification methods were applied: AdaBoost, Artificial Neural Networks, Functional Trees, J48, J48Consolidated, K-Nearest Neighbours, Random Forest and Logistic Regression. Performance was compared using AUC- PR (Area Under the Curve-Precision Recall), Accuracy, Sensitivity and F-measure. Attrition at the four follow-ups were, respectively: 16%, 25%, 13% and 17%. Both models demonstrated good predictive performance, AUC-PR ranging between 69 and 94.1 in Baseline and from 72.5 to 97.1 in Incremental model. Of the whole set of methods, Random Forest presented the best performance at all follow-ups [AUC-PR1: 94.1 (2.0); AUC-PR2: 91.2 (1.2); AUC-PR3: 97.1 (1.0); AUC-PR4: 96.5 (1.7)]. Logistic Regression performed well below Random Forest. The top-ranked predictors were common for both models in all follow-ups: birthweight, gestational age, maternal age, and length of hospital stay. Random Forest presented the highest capacity for prediction and provided interpretable predictors. Researchers involved in cohorts can benefit from our robust models to prepare for and prevent loss to follow-up by directing efforts toward individuals at higher risk.Entities:
Mesh:
Year: 2022 PMID: 35732850 PMCID: PMC9217966 DOI: 10.1038/s41598-022-13946-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
General characteristics of the study population (n = 542).
| Characteristics | na (%) |
|---|---|
| Female | 232 (42.8) |
| Male | 310 (57.2) |
| Median (p25–p75) | 1172 (940–1436) |
| Median (p25–p75) | 29 (27–31) |
| < 26 | 27 (5.0) |
| 26–27 | 118 (21.8) |
| 28–29 | 148 (27.3) |
| 30–31 | 249 (45.9) |
| Yes (< 10th percentile) | 52 (9.7) |
| No (≥ 10th percentile) | 485 (90.3) |
| Missing | 5 (0.9) |
| Singleton | 372 (68.6) |
| Multiple | 170 (31.4) |
| 0 | 342 (63.2) |
| 1 | 144 (26.6) |
| ≥ 2 | 55 (10.2) |
| Missing | 1 (0.2) |
| No | 156 (29.1) |
| Yes | 381 (70.9) |
| Missing | 5 (0.9) |
| Median (p25–p75) | 31 (27–35) |
| < 25 | 85 (15.7) |
| 25–34 | 300 (55.4) |
| ≥ 35 | 157 (29.0) |
| No | 81 (15.1) |
| Yes | 454 (84.9) |
| Missing | 7 (1.3) |
| Least deprived (q1–q4) | 447 (83.2) |
| Most deprived (q5) | 90 (16.8) |
| Missing | 5(0.9) |
| Median (p25–p75) | 51(37–71) |
aCalculation of percentages does not include missing values.
bSGA, small for gestational age, based on intrauterine curves developed for the cohort[54].
cThe sum of the categories surpasses 100% as the numbers were rounded up.
Figure 1Area Under the Curve-Precision Recall (AUC-PR) for follow-ups 1, 2, 3 and 4.
Performance results of the classification methods applied to the prediction of attrition in four follow-ups of EPICE-PT cohort.
| Follow-up | Methods | Performance metrics (mean, SD) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline model | Incremental modela | ||||||||||||
| Sensitivity | Accuracy | F-measure | Sensitivity | Accuracy | F-measure | ||||||||
| 1 | AdaBoost | 82.3 | 6.0 | 83.2 | 5.7 | 83.3 | 5.7 | N/a | N/a | N/a | N/a | N/a | N/a |
| Artificial Neural Networks | 81.4 | 3.1 | 81.1 | 3.1 | 81.2 | 3.1 | N/a | N/a | N/a | N/a | N/a | N/a | |
| Functional Trees | 74.5 | 5.2 | 74.7 | 1.8 | 74.7 | 1.8 | N/a | N/a | N/a | N/a | N/a | N/a | |
| J48 | 76.9 | 3.3 | 78.0 | 2.9 | 78.0 | 2.8 | N/a | N/a | N/a | N/a | N/a | N/a | |
| J48Consolidated | 82.0 | 4.2 | 79.3 | 2.0 | 79.3 | 1.9 | N/a | N/a | N/a | N/a | N/a | N/a | |
| K-Nearest Neighbours | 86.0 | 3.9 | 76.5 | 2.1 | 76.5 | 2.2 | N/a | N/a | N/a | N/a | N/a | N/a | |
| Logistic Regression | 69.7 | 5.7 | 73.7 | 2.0 | 73.6 | 2.1 | N/a | N/a | N/a | N/a | N/a | N/a | |
| Random Forest | 82.3 | 6.3 | 88.2 | 1.9 | 88.1 | 2.0 | N/a | N/a | N/a | N/a | N/a | N/a | |
| 2 | AdaBoost | 82.4 | 5.8 | 71.6 | 7.2 | 70.9 | 7.6 | 85.6 | 3.6 | 82.3 | 3.7 | 82.3 | 3.7 |
| Artificial Neural Networks | 82.6 | 6.3 | 75.2 | 3.5 | 74.8 | 3.5 | 82.2 | 1.8 | 79.9 | 1.9 | 79.9 | 2.0 | |
| Functional Trees | 76.8 | 3.8 | 71.4 | 2.6 | 71.2 | 2.6 | 76.1 | 2.8 | 73.1 | 3.2 | 73.1 | 3.2 | |
| J48 | 77.8 | 7.4 | 73.2 | 5.3 | 73.1 | 5.3 | 79.4 | 3.1 | 77.0 | 1.8 | 76.9 | 1.9 | |
| J48Consolidated | 73.7 | 4.1 | 73.6 | 4.2 | 73.6 | 4.3 | 76.5 | 4.1 | 78.1 | 1.5 | 78.2 | 1.5 | |
| K-Nearest Neighbours | 87.6 | 4.5 | 71.7 | 3.9 | 70.5 | 4.0 | 85.4 | 2.7 | 76.7 | 1.6 | 76.4 | 1.7 | |
| Logistic Regression | 77.2 | 2.5 | 67.0 | 1.7 | 66.4 | 1.8 | 80.2 | 4.7 | 74.7 | 2.5 | 74.6 | 2.4 | |
| Random Forest | 86.8 | 2.4 | 82.6 | 1.8 | 82.5 | 1.8 | 85.0 | 3.3 | 84.6 | 2.5 | 84.6 | 2.5 | |
| 3 | AdaBoost | 75.4 | 6.2 | 85.0 | 3.5 | 84.8 | 3.6 | 87.9 | 7.3 | 90.3 | 1.7 | 90.3 | 1.8 |
| Artificial Neural Networks | 79.0 | 7.0 | 81.3 | 3.1 | 81.3 | 3.2 | 87.2 | 5.1 | 89.8 | 0.3 | 89.8 | 0.3 | |
| Functional Trees | 74.4 | 5.7 | 78.2 | 3.0 | 78.3 | 3.0 | 84.9 | 6.0 | 87.5 | 2.1 | 87.5 | 2.1 | |
| J48 | 70.8 | 3.4 | 81.0 | 2.2 | 80.8 | 2.2 | 84.2 | 6.4 | 89.0 | 2.7 | 89.0 | 2.8 | |
| J48Consolidated | 74.1 | 4.6 | 80.5 | 2.7 | 80.5 | 2.7 | 87.8 | 3.0 | 89.6 | 1.9 | 89.6 | 1.9 | |
| K-Nearest Neighbours | 72.5 | 2.6 | 77.7 | 2.0 | 77.7 | 1.9 | 88.9 | 6.6 | 90.1 | 1.8 | 90.1 | 1.9 | |
| Logistic Regression | 69.5 | 5.5 | 77.6 | 1.1 | 77.4 | 1.2 | 87.9 | 6.4 | 88.1 | 3.0 | 88.2 | 3.1 | |
| Random Forest | 73.4 | 3.8 | 86.1 | 2.1 | 85.7 | 2.2 | 89.8 | 4.1 | 92.9 | 0.9 | 92.9 | 0.9 | |
| 4 | AdaBoost | 83.3 | 3.1 | 84.2 | 1.5 | 84.2 | 1.5 | 88.5 | 4.5 | 92.1 | 2.6 | 92.1 | 2.6 |
| Artificial Neural Networks | 82.3 | 4.0 | 78.4 | 2.9 | 78.4 | 2.9 | 91.0 | 1.6 | 92.9 | 2.1 | 92.9 | 2.1 | |
| Functional Trees | 76.2 | 4.1 | 74.3 | 1.2 | 74.2 | 1.2 | 91.5 | 3.7 | 92.2 | 3.1 | 92.2 | 3.1 | |
| J48 | 74.6 | 5.6 | 79.6 | 2.5 | 79.5 | 2.6 | 88.7 | 3.4 | 92.5 | 1.7 | 92.4 | 1.7 | |
| J48Consolidated | 77.4 | 4.3 | 77.0 | 5.4 | 77.0 | 5.3 | 89.2 | 3.3 | 92.7 | 1.6 | 92.7 | 1.6 | |
| K-Nearest Neighbours | 84.1 | 1.0 | 72.6 | 2.0 | 72.4 | 2.1 | 89.0 | 1.5 | 93.3 | 1.4 | 93.3 | 1.4 | |
| Logistic Regression | 76.1 | 3.0 | 73.5 | 1.8 | 73.6 | 1.9 | 87.7 | 4.9 | 89.2 | 1.6 | 89.2 | 1.6 | |
| Random Forest | 82.6 | 3.0 | 85.3 | 2.3 | 85.2 | 2.3 | 91.0 | 2.3 | 94.3 | 2.2 | 94.2 | 2.2 | |
aAt follow-up 1, baseline and incremental model are equivalent.
The top- ranked variables by the variable importance for each year in Baseline and Incremental Model.
| Mean rank | Follow-up 1 | Follow-up 2 | Follow-up 3 | Follow-up 4 |
|---|---|---|---|---|
| 1 | Birthweight | Birthweight | Birthweight | Birthweight |
| 2 | Maternal age | Gestational age | Maternal age | Region of birth |
| 3 | Length of hospital stay | Maternal age | Gestational age | Gestational age |
| 4 | Gestational age | Length of hospital stay | Length of hospital stay | Length of hospital stay |
| 5 | Sex | Region of birth | Sex | Maternal age |
| 1 | Birthweight | Birthweight | Birthweight | Birthweight |
| 2 | Maternal age | Maternal age | Length of hospital stay | Maternal age |
| 3 | Length of hospital stay | Gestational age | Gestational age | Gestational age |
| 4 | Gestational age | Sex | Sex | Region of birth |
| 5 | Sex | Length of hospital stay | Maternal age | Length of hospital stay |
Figure 2Importance of the predictor variables (based on the mean decrease in impurity) in the Random Forest for each year (Baseline Model).