| Literature DB >> 36240212 |
Sara Domínguez-Rodríguez1, Miquel Serna-Pascual1, Andrea Oletto2, Shaun Barnabas3, Peter Zuidewind3, Els Dobbels3, Siva Danaviah4, Osee Behuhuma4, Maria Grazia Lain5, Paula Vaz5, Sheila Fernández-Luis6,7, Tacilta Nhampossa7, Elisa Lopez-Varela7, Kennedy Otwombe8, Afaaf Liberty8, Avy Violari8, Almoustapha Issiaka Maiga9, Paolo Rossi10, Carlo Giaquinto11, Louise Kuhn12, Pablo Rojo1, Alfredo Tagarro1,13,14.
Abstract
Logistic regression (LR) is the most common prediction model in medicine. In recent years, supervised machine learning (ML) methods have gained popularity. However, there are many concerns about ML utility for small sample sizes. In this study, we aim to compare the performance of 7 algorithms in the prediction of 1-year mortality and clinical progression to AIDS in a small cohort of infants living with HIV from South Africa and Mozambique. The data set (n = 100) was randomly split into 70% training and 30% validation set. Seven algorithms (LR, Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Naïve Bayes (NB), Artificial Neural Network (ANN), and Elastic Net) were compared. The variables included as predictors were the same across the models including sociodemographic, virologic, immunologic, and maternal status features. For each of the models, a parameter tuning was performed to select the best-performing hyperparameters using 5 times repeated 10-fold cross-validation. A confusion-matrix was built to assess their accuracy, sensitivity, and specificity. RF ranked as the best algorithm in terms of accuracy (82,8%), sensitivity (78%), and AUC (0,73). Regarding specificity and sensitivity, RF showed better performance than the other algorithms in the external validation and the highest AUC. LR showed lower performance compared with RF, SVM, or KNN. The outcome of children living with perinatally acquired HIV can be predicted with considerable accuracy using ML algorithms. Better models would benefit less specialized staff in limited resources countries to improve prompt referral in case of high-risk clinical progression.Entities:
Mesh:
Year: 2022 PMID: 36240212 PMCID: PMC9565414 DOI: 10.1371/journal.pone.0276116
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Feature distribution according to the different data sets.
| Training set | Testing set | p-value | |
|---|---|---|---|
|
|
| ||
|
| 0.316 | ||
| | 36.0 [29.6;69] | 41.0 [30;89.1] | |
|
| 0.416 | ||
| Female | 35 (49.3%) | 11 (37.9%) | |
| Male | 36 (50.7%) | 18 (62.1%) | |
|
| 0.805 | ||
| | -1.46 [-2.62;-0.87] | -1.18 [-2.98;-0.30] | |
|
| 0.140 | ||
| No | 41 (57.7%) | 22 (75.9%) | |
| Yes | 30 (42.3%) | 7 (24.1%) | |
|
| 0.563 | ||
| | 30.0 [0.00;35.5] | 31.0 [0.00;50.0] | |
|
| 0.195 | ||
| | 32.0 [18.5;62.5] | 36.0 [23.0;82.0] | |
|
| 0.558 | ||
| 3TC+ABC+LPVr | 33 (46.5%) | 14 (48.3%) | |
| 3TC+ABC+NVP | 0 (0.00%) | 1 (3.45%) | |
| 3TC+AZT+LPVr | 22 (31.0%) | 8 (27.6%) | |
| 3TC+AZT+NVP | 16 (22.5%) | 6 (20.7%) | |
|
| 0.350 | ||
| | 609715 [36738;2570245] | 226844 [36295;1344319] | |
|
| 0.587 | ||
| | 36.9 [29.9;45.2] | 40.0 [28.0;47.0] | |
|
| 1.000 | ||
| No | 34 (47.9%) | 14 (48.3%) | |
| Yes | 37 (52.1%) | 15 (51.7%) | |
|
| 0.589 | ||
| Poor | 3 (4.23%) | 3 (10.3%) | |
| Intermediate low | 12 (16.9%) | 4 (13.8%) | |
| Intermediate high | 18 (25.4%) | 5 (17.2%) | |
| Good | 38 (53.5%) | 17 (58.6%) |
ART: Antiretroviral; 3TC: Lamivudine; ABC: Abacavir; LPVr: Lopinavir boosted with ritonavir; NVP: Nevirapine; Maternal severe life events: change in employment, separation or relationship break-up, new partner, loss of home or move, or death in the family; Maternal adherence (Optimal: No ART dose missed; Intermediate low: 10–50% doses missed; Intermediate high: 50–90%; Good: >90%)
Algorithm tuning parameters.
| Algorithm | Tuning parameter |
|---|---|
| Logistic regression | - |
| Random forest | mtry = 12 |
| Support Vector Machine | C = 8; sigma = 4.69·10−11 |
| Naïve Bayes | fL = 0; adjust = 1 |
| K-nearest neighbor | K = 5 |
| Artificial Neural Network | Size = 11; decay = 0.1 |
| GLMNET | Alpha = 0.8; lambda = 0.21 |
Algorithm tuning parameters selected by repeated (5 times) 10-fold cross-validation in a grid. Mtry: Number of variables for splitting at each tree node in a random forest; C: regularization parameter that controls the trade off between the achieving a low training error and a low testing error; sigma: determines how fast the similarity metric goes to zero as they are further apart; fL: Laplace smoother; adjust: adjust the bandwidth of the kernel density; K = number of nearest neighbours; size: number of units in hidden layer; decay: regularization parameter to avoid over-fitting; alpha: regularization parameter; lambda: penalty on the coefficients
Fig 1Algorithms performance in the validation set.
Fig 2Algorithms receiving operating curve in the validation set.
Fig 3Probability of death/progression according to each algorithm in the validation set.