INTRODUCTION: High yield HIV testing strategies are critical to reach epidemic control in high prevalence and low-resource settings such as East and Southern Africa. In this study, we aimed to predict the HIV status of individuals living in Angola, Burundi, Ethiopia, Lesotho, Malawi, Mozambique, Namibia, Rwanda, Zambia and Zimbabwe with the highest precision and sensitivity for different policy targets and constraints based on a minimal set of socio-behavioural characteristics. METHODS: We analysed the most recent Demographic and Health Survey from these 10 countries to predict individual's HIV status using four different algorithms (a penalized logistic regression, a generalized additive model, a support vector machine, and a gradient boosting trees). The algorithms were trained and validated on 80% of the data, and tested on the remaining 20%. We compared the predictions based on the F1 score, the harmonic mean of sensitivity and positive predictive value (PPV), and we assessed the generalization of our models by testing them against an independent left-out country. The best performing algorithm was trained on a minimal subset of variables which were identified as the most predictive, and used to 1) identify 95% of people living with HIV (PLHIV) while maximising precision and 2) identify groups of individuals by adjusting the probability threshold of being HIV positive (90% in our scenario) for achieving specific testing strategies. RESULTS: Overall 55,151 males and 69,626 females were included in the analysis. The gradient boosting trees algorithm performed best in predicting HIV status with a mean F1 score of 76.8% [95% confidence interval (CI) 76.0%-77.6%] for males (vs [CI 67.8%-70.6%] for SVM) and 78.8% [CI 78.2%-79.4%] for females (vs [CI 73.4%-75.8%] for SVM). Among the ten most predictive variables for each sex, nine were identical: longitude, latitude and, altitude of place of residence, current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom use during last intercourse and, wealth index. Only age at first sex for male (ranked 10th) and Rohrer's index for female (ranked 6th) were not similar for both sexes. Our large-scale scenario, which consisted in identifying 95% of all PLHIV, would have required testing 49.4% of males and 48.1% of females while achieving a precision of 15.4% for males and 22.7% for females. For the second scenario, only 4.6% of males and 6.0% of females would have had to be tested to find 55.7% of all males and 50.5% of all females living with HIV. CONCLUSIONS: We trained a gradient boosting trees algorithm to find 95% of PLHIV with a precision twice higher than with general population testing by using only a limited number of socio-behavioural characteristics. We also successfully identified people at high risk of infection who may be offered pre-exposure prophylaxis or voluntary medical male circumcision. These findings can inform the implementation of new high-yield HIV tests and help develop very precise strategies based on low-resource settings constraints.
INTRODUCTION: High yield HIV testing strategies are critical to reach epidemic control in high prevalence and low-resource settings such as East and Southern Africa. In this study, we aimed to predict the HIV status of individuals living in Angola, Burundi, Ethiopia, Lesotho, Malawi, Mozambique, Namibia, Rwanda, Zambia and Zimbabwe with the highest precision and sensitivity for different policy targets and constraints based on a minimal set of socio-behavioural characteristics. METHODS: We analysed the most recent Demographic and Health Survey from these 10 countries to predict individual's HIV status using four different algorithms (a penalized logistic regression, a generalized additive model, a support vector machine, and a gradient boosting trees). The algorithms were trained and validated on 80% of the data, and tested on the remaining 20%. We compared the predictions based on the F1 score, the harmonic mean of sensitivity and positive predictive value (PPV), and we assessed the generalization of our models by testing them against an independent left-out country. The best performing algorithm was trained on a minimal subset of variables which were identified as the most predictive, and used to 1) identify 95% of people living with HIV (PLHIV) while maximising precision and 2) identify groups of individuals by adjusting the probability threshold of being HIV positive (90% in our scenario) for achieving specific testing strategies. RESULTS: Overall 55,151 males and 69,626 females were included in the analysis. The gradient boosting trees algorithm performed best in predicting HIV status with a mean F1 score of 76.8% [95% confidence interval (CI) 76.0%-77.6%] for males (vs [CI 67.8%-70.6%] for SVM) and 78.8% [CI 78.2%-79.4%] for females (vs [CI 73.4%-75.8%] for SVM). Among the ten most predictive variables for each sex, nine were identical: longitude, latitude and, altitude of place of residence, current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom use during last intercourse and, wealth index. Only age at first sex for male (ranked 10th) and Rohrer's index for female (ranked 6th) were not similar for both sexes. Our large-scale scenario, which consisted in identifying 95% of all PLHIV, would have required testing 49.4% of males and 48.1% of females while achieving a precision of 15.4% for males and 22.7% for females. For the second scenario, only 4.6% of males and 6.0% of females would have had to be tested to find 55.7% of all males and 50.5% of all females living with HIV. CONCLUSIONS: We trained a gradient boosting trees algorithm to find 95% of PLHIV with a precision twice higher than with general population testing by using only a limited number of socio-behavioural characteristics. We also successfully identified people at high risk of infection who may be offered pre-exposure prophylaxis or voluntary medical male circumcision. These findings can inform the implementation of new high-yield HIV tests and help develop very precise strategies based on low-resource settings constraints.
In order to reach epidemic control by 2030, the Joint United Nations Programme (UNAIDS) has set fast track targets to rapidly scale up effective HIV services [1]. One of the aims is to ensure that 95% of the approximately 38 million people globally living with HIV (PLHIV) are aware of their HIV status and that 95% of those with HIV positive diagnoses are on treatment [2].People in East and Southern Africa are disproportionately burdened by HIV, constituting more than half of the global PLHIV with 20.7 million people currently estimated to be HIV positive [2]. As of 2020, 87% of PLHIV in this region were aware of their HIV status, of whom 83% were accessing treatment [3]. In addition, 25% of new HIV infections in East and Southern Africa were concentrated among key populations such as female sex workers, men having sex with men, prisoners, and people who inject drugs [3].HIV is transmitted within a complex network that is influenced by biological, behavioural, and social factors. In East and Southern Africa, there is large geographical variation in the distribution of the HIV epidemic [4]. In order to identify populations at a high risk of infection, global HIV prevention efforts have shifted toward optimizing resource allocation by considering geographical data as a way of increasing program impact and efficiency [5].Modern predictive algorithms have the power to substantially enhance HIV prevention and detection, increasing the prediction capability by processing large amounts of data of a different nature. This methodology has been implemented to establish patterns of HIV risk behaviour, to optimise HIV treatment modalities, and to identify high-risk individuals for targeted interventions from a number of novel data sources [6-15].As more PLHIV are diagnosed, finding persons with undiagnosed HIV becomes progressively more difficult and expensive. Hence, resource constraints and potential funding shortages have resulted in demands for differentiated high yield testing strategies in parallel to provider-initiated HIV testing and counselling (PITC) [14, 16, 17]. In this paper, we aim to identify new key populations based on socio-behavioural characteristics by comparing four different prediction algorithms. These insights intend to both inform targeted case-finding strategies as well as identify high risk HIV negative individuals eligible for prevention services such as voluntary medical male circumcision (VMMC) and/or pre-exposure prophylaxis (PrEP).
Methods
Data
Since 1984, the Demographic and Health Surveys (DHS) program has provided technical assistance for over 400 surveys in more than 90 countries, advancing global understanding of health and population trends in developing countries [18]. DHS are nationally-representative household surveys that provide data for a wide range of monitoring and impact evaluation indicators on health and nutrition. Standard DHS surveys have large sample sizes (usually between 5,000 and 30,000 households) and are typically conducted every five years [19]. We used the most recent DHS surveys at or after 2013 of ten East and Southern African countries (S1 Table) with a generalised HIV epidemic: Angola, Burundi, Ethiopia, Lesotho, Malawi, Mozambique, Namibia, Rwanda, Zambia and, Zimbabwe.We combined separately male and female datasets of each country with their corresponding household’s geographic position and their HIV test results. We then merged the ten countries and obtained two datasets containing 68,979 males and 83,910 females with 527 and 3,213 variables respectively, since different socio-behavioural characteristics are recorded for each sex. The target variable was the HIV status of the individuals (0 for HIV negative and 1 for HIV positive). During the data pre-processing step, only individuals with positive or negative HIV status were included in the analysis; those with unknown status were discarded. We cleaned, concatenated, filtered, transformed, and aggregated the data (S2 Table). We imputed missing values, that we assumed missing at random, using multiple imputation by chained equations (MICE) (as detailed in S3 and S4 Tables) and the data were further harmonized and scaled [20, 21]. Thus, the final dataset included 55,151 males and 69,626 females with 84 and 122 variables respectively; 73 variables were common to both sexes (S5 Table).
Training, validation and test procedure steps
Fig 1—Step 1
From these two datasets, we first left one of the 10 countries out (switching left out country) to create 10 different datasets per sex, each one comprised of only 9 countries. This has been done for generalization purposes in order to further assess the quality of our models when the data were not drawn from the exact same distribution. Then, each of the 10 newly created datasets per sex were split at the individual level between a stratified (due to imbalanced outcomes) 80% training set and a 20% test set. The above described MICE imputation and data standardization was then performed separately on training and test datasets to avoid information from the training dataset to contaminate the test dataset.
Fig 1—Step 2
Using the training datasets and 50 randomly selected sets of hyperparameters, a stratified 5-fold cross-validation was then performed for each algorithm on each of the training sets for training and validation. The set of hyperparameters that obtained the maximum mean F1 score over the validation datasets was selected.
Fig 1—Step 3
Each one of the 10 best models per sex and per algorithm was then ran on the corresponding test set and the resulting metric scores were averaged. We selected the algorithm with the maximum mean F1 score over the 10 test datasets. Finally, we applied each selected model on the corresponding left out country dataset.
Algorithms
We compared four algorithms for the prediction of the HIV status of an individual: a penalized logistic regression (Elastic Net) [22], a generalized additive model (GAM) [23], a support vector machine (SVM) [24], and an implementation of gradient boosted trees (XGBoost) [25]. The Elastic Net and the GAM are among the most widely used classification methods in biology and medicine, SVM is a very common machine learning algorithm, and XGBoost is a decision-tree-based ensemble which has gained a lot of attraction since its development a few years ago due to its excellent performances (more details about the models can be found in the S1 File). Our primary interest was to find the largest number of HIV positive individuals (sensitivity) with the highest possible yield (positive predictive value (PPV)). We, therefore, used the F1 score for assessing the performance of the different algorithms. This metric combines the sensitivity and the precision in a harmonic mean and is often recommended for unbalanced datasets when comparing models [26]. The probability threshold to classify if someone is considered HIV positive was set at 50%. In addition, to validate our results with a strictly proper scoring rule, we also computed the Brier score. This score is strictly equivalent to the mean squared error as applied to predicted probabilities for unidimensional predictions.The analysis was done in two steps for each of the four algorithms (Fig 1—Step 2, 3), and separately for males and females. Training and validation were performed using the stratified 5-fold cross-validation on the training sample with 50 different sets of hyperparameters randomly chosen from a grid (as detailed in S1 File). Among these sets, we selected the one with the highest mean F1 score, and tested the obtained model on the test sample and on the left-out country, which were not used during training and validation (Fig 1—Step 3). We selected the best algorithm based on its averaged F1 scores on the ten test samples.
Fig 1
Methodology diagram of the analysis part 1.
Variables selection and HIV status prediction
For variables selection and HIV status prediction, we used the exact same training, validation and testing strategy than in the first part of our analysis except that no country was left out. We split each unique dataset par sex into a stratified 80% training and validation set and a 20% test set. The best algorithm was trained and validated using a random grid-search over 250 sets of hyperparameters and a stratified 5-fold cross-validation. The first predictions were performed using all available variables. Based on the F1 scores, sensitivity, and PPV, we compared two imputation methods, namely MICE (models M1 and F1 for males and females, respectively) and a built-in method from the selected algorithm [25] (models M2 and F2).We used a sequential forward floating selection (SFFS), which eliminates (or adds) variables based on a defined classifier performance metric, on the 80% training samples and calculated the F1 scores for different subsets of variables. We selected the subset of variables for which the F1 scores plateaued and we then assessed the direction of the association between these variables and the probability of being HIV-positive using Shapley values [27].We retrained the best algorithm with the above defined subsets of variables (models M3 and F3) and also on a minimal subset common to both sexes (models M4 and F4). The F1 scores, the sensitivity, and the PPV were compared to the ones obtained for M1, M2, F1, and F2. With our last models based on a minimal subset common to both sexes (models M4 and F4), we further analysed the results at country level, comparing the F1, sensitivity and PPV between countries and the differences between observed and predicted prevalences.
Scenarios
We tested two scenarios: for the first scenario, the sensitivity was set to 95%, equivalent to 95% of PLHIV knowing their status, and we reported the corresponding precision and number of individuals to be tested. For the second scenario, we identified a population for which the probability of being HIV positive was higher than 90%. We considered that these groups of individuals are targets for specific testing strategies or ideal candidates for prevention services.
Ethical review
No ethical approval was needed for this study.
Data and code availability
The data supporting the findings of this study are available from the DHS Program https://dhsprogram.com/. The DHS Program is authorized to distribute, at no cost, unrestricted survey data files for legitimate academic research. Registration is required to access the data.The data was collected between 2013 and 2017 depending on each country.All analyses were performed in Python version 3.7.4. The code is available on https://gitlab.com/Triphon/predicting_hiv_status.
Results
Overall, 55,151 males and 69,626 females were analysed with a prevalence ranging from 0.8% among males in Ethiopia to 33.3% among females in Lesotho with a global HIV prevalence of 8.0% (4,417 individuals) for males and 11.5% (8,011 individuals) for females. Individuals aged 25 to 34 years were the largest age group, representing 35.9% of females and 31.9% of males. About two-thirds of people lived in rural areas (S6 Table).Fig 2 shows the performance of the four algorithms on the test samples and independent left-out countries. XGBoost had the highest F1 scores on all test samples with a mean F1 score of 76.8% [95% confidence interval (CI) 76.0%-77.6%] for males and 78.8% [CI 78.2%-79.4%] for females. In comparison, SVM had a mean F1 score of 69.2% [CI 68.2%-70.2%] for males and 74.6% [CI 73.7%-75.5%] for females. For Elastic Net, the mean F1 score was 32.6% [CI 31.8%-33.4%] for males and 41.5% [CI 40.3%-42.7%] for females. GAM performed the worst with a mean F1 score of 26.2% [CI 25.0%-27.4%] for males and 39.8% [CI 38.1%-41.5%] for females. When focusing on the Brier scores, XGBoost was still the best performing algorithm, followed by SVM, GAM and finally ElasticNet. In general, the scores obtained by the models with the best Brier score were very similar to the ones obtained with the best F1 score (S7 to S10 Tables).
Fig 2
Boxplot of the f1 scores for the 4 algorithms on the test and left-out samples per sex.
When tested against the ten left-out countries, the performance of the algorithms was substantially lower than on the test samples and the F1 scores varied more widely (Fig 2—RH). The mean F1 score was the best for Elastic Net with 21.4% [CI 12.3%-30.5%] for males and 32.6% [CI 21.2%-44.0%] for females, followed closely by XGBoost with 20.9% [CI 14.3%-27.5%] and 29.8% [CI 19.0%-40.6%], respectively. In comparison, the mean F1 score for SVM was 15.4% [CI 10.9%-19.9%] for males and 22.3% [CI 14.1%-30.5%] for females. Again, GAM performed the worst with a mean F1 score of 6.6% [CI 0.9%-12.1%] and 17.1% [CI 4.4%-29.8%]. See S7 to S10 Tables for details on PPV, sensitivity and Brier scores. However, the algorithms performed generally better in countries with higher prevalence (S7 to S10 Tables).Given that the best performance on the test samples was obtained with XGBoost, both for F1 and Brier scores, we used this algorithm for the selection of variables and the prediction of the HIV status of the individuals on the entire datasets, where no country was left out. The results on all variables using the two different imputation methods are shown in Table 1. For both sexes, the XGBoost imputation (M2 and F2) resulted in slightly higher F1 scores compared to the MICE imputation (M1 and F1). The F1 scores on the validation samples were 75.5% [CI 73.7%-77.3%] vs 74.9% [CI 73.3%-76.5%] for males and 76.1% [CI 74.9%-77.3%] vs 75.5% [CI 74.6%-76.4%] for females. Given the similarity of the obtained results, we decided to use the built-in XGBoost method for further analyses (i.e. models M3, F3, M4 and, F4) because of its simplicity of implementation and its lower computation time.
Table 1
Results per sex of the XGBoost algorithm for different imputation methods and sets of variables.
True Positive (TP), False Negative (FN), False Positive (FP), True Negative (TN), Positive Predictive Value (PPV)Multiple Imputation by Chained Equations (MICE).(± %): 95% Confidence Interval.Fig 3 shows the subset of most relevant variables to predict an individual’s HIV status, as selected by the SFFS procedure. With 15 variables for males and 27 variables for females, the F1 score plateaued at 99.6% and 97%, respectively.
Fig 3
Shapley values.
The variables are displayed sorted by importance from top to bottom (from the highest Shapley value to the lowest). The blue and red colours represent the value range of the variable (blue (red) represents the low (high) value range of the variable). For example, the older the age, the more likely the person will be HIV positive. N.b.: Shapley values do not describe the causal impact of each covariate, only the additional change in overall outcome by adding this covariate.
Shapley values.
The variables are displayed sorted by importance from top to bottom (from the highest Shapley value to the lowest). The blue and red colours represent the value range of the variable (blue (red) represents the low (high) value range of the variable). For example, the older the age, the more likely the person will be HIV positive. N.b.: Shapley values do not describe the causal impact of each covariate, only the additional change in overall outcome by adding this covariate.Among those variables, four were specific to females (‘currently breastfeeding’, ‘fertility preference’, ‘time to get to water source’ and ‘entries in birth history’) and two to males (‘number of women fathered children with’ and ‘respondent circumcised’). Out of the ten most predictive variables for both sexes, nine were identical: geographic position (longitude, latitude, and altitude), current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom used during last sexual intercourse with most recent partner, and a wealth index from the DHS which combines numerous wealth-related variables such as household assets and utility services [28]. The age at first sexual intercourse ranked tenth for males but only twentieth for females; the Rohrer’s index (an estimate of obesity) ranked sixth for females but was not available for males.Older age, older age of most recent partner, older age and more years since first cohabitation, higher total lifetime number of sexual partners, longer time since last sex, higher number of unions, higher number of women fathered children with, condom use during last sexual intercourse with most recent partner, having been tested for HIV, living in an urban area, higher longitude coordinate and buying vegetables from vendor with HIV were positively associated with the probability of HIV positivity for most individuals, either males, females or both. Higher age at first sex, higher wealth index, higher latitude coordinate, higher altitude, higher number of years of education, higher number of entries in birth history, circumcision, higher Rohrer’s index, more years lived in place of residence, use of contraceptive, currently breastfeeding and higher number of household’s members were mainly negatively associated with HIV positivity. The direction of association was not clear for the age of the household head and the time to get to the water source (Fig 3).Table 1 shows the results (confusion matrix, PPV, sensitivity and F1 score) of the XGBoost algorithm on the 15 most important variables for males (M3) and 27 most important variables for females (F3). As expected from the SFFS procedure, the F1 scores of these two models were close to the scores obtained with all available variables (M2 and F2). The F1 scores decreased only by 1.8 percentage points for males and by 0.5 percentage points for females. In comparison, by using the nine most predictive common variables (M4 and F4), the F1 scores decreased respectively by 2.6 and 3.7 percentage points compared to M2 and F2. M4 and F4 were the models used for the two scenarios considering that the drop in performance compared to the previous, more complex models, was minimal.S11 Table shows the results of our models’ predictions at country-level and per sex. For males, Malawi has the lowest predictive power with a F1 score of 61.4% compared to 81.8% for Angola. For females, Angola has the lowest F1 score with 61.8% versus 80.0% for Burundi. Sensitivity values are ranging from 51.3% for males in Malawi to 79.3% for females in Lesotho. For PPV, the lowest value is for males in Burundi with 75.0% versus 100% for males in Angola and Ethiopia.Again, at country level, we have then aggregated the HIV status predictions per country to estimate national prevalence. Our models underestimated country-specific HIV prevalence but with small relative differences ranging from -0.6% for females in Zimbabwe to -33.3% for males in Malawi (S12 Table). Fig 4 shows two maps per sex, one representing the predicted prevalence per country (left) and the other one the absolute difference between the predicted and the observed prevalence (right). The worst absolute difference is for males in Zambia with -3.1% versus -0.1% for males in Burundi and female in Zimbabwe.
Fig 4
Predicted prevalence per country (LH) and absolute difference between predicted and observed prevalence (RH).
1) 95% PLHIV know their status
For males, a sensitivity of 95% would require that 5,450 individuals out of 11,031 (49.4%) would need to be tested to identify 840 HIV positives out of the 883 PLHIV. The corresponding PPV is 15.4%; 7 individuals would therefore need to be tested to find one HIV positive person (number needed to test NNT). For females, 6,696 individuals out of 13,926 (48.1%) would need to be tested to find 1,522 HIV positives out of the 1,602 PLHIV. The PPV is 22.7% and the NNT is 5.
2) At least 90% probability of being HIV positive
Out of 11,031 males and 13,926 females, 512 males (4.6%) and 837 females (6.0%) were identified as high-risk populations (i.e. at least 90% of being HIV positive). Overall, 492 males would have been correctly identified as HIV positive out of the 883 male PLHIV (sensitivity of 55.7% and PPV of 96.1%) and 809 females would have been correctly identified as HIV positive out of the 1,602 female PLHIV (sensitivity of 50.5% and PPV of 96.7%).
Discussion
Using large representative datasets with over 120,000 persons from ten East and Southern African countries, we were able to accurately predict the HIV status of individuals using demographic and socio-behavioural characteristics only. Our approach allowed us to select the nine most important predictor variables common for both sexes: geographic position (longitude, latitude and, altitude), current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom use during last sexual intercourse with most recent partner, and wealth index. Using these nine variables to predict HIV positivity reduces dramatically the amount of knowledge needed to identify key populations.We also determined the direction of the association between predictor variables and HIV status. We confirmed a number of established HIV risk factors such as older age or older age of the most recent partner [29], a high number of sexual partners [30], and living in an urban area [31]. Additionally, circumcision and breastfeeding were associated with a lower risk of HIV positivity [31]. Unlike previous findings [32], condom use during the last sexual intercourse increased the probability of HIV positivity in our study. This seemingly counterintuitive finding may be the result of increased condom use in individuals who are already aware of their positive HIV status. The differences in individual HIV status due to the altitude are likely multifactorial. These factors stem from environmental, biological, as well as socio-behavioural and policy-level differences that impact infection and transmission [30, 33–35]. The cross-sectional nature of our study limits our ability to investigate this further. We also identified risk factors for HIV infection which have rarely been investigated before. For example, an increased distance to water source was associated with HIV status; the association could be either positive or negative, but not neutral. A previous study showed that the risk of sexual assault of women, and hence the risk of HIV infection, increased when the time to reach a water source increased [32]. However, longer time to get to water sources are more common in rural areas where HIV prevalence is known to be generally lower, hence a decrease in risk of HIV positivity.Our model was also able to accurately predict the prevalence at country level per sex. The difference in predictive power by country depends on many factors, such as the prevalence of the country, the percentage of the country sample size compared to the overall sample and the similarity of the country risk factors versus its peers.When adapting our predictive algorithm to finding 95% of PLHIV, we needed to test 7 males (NNT of 7; PPV of 15.4%), and 5 females (NNT of 5; PPV of 22.7%) to find one HIV positive individual. A previous systematic review of different testing strategies showed that NNTs ranged between 3 and 86 for community-based testing strategies and between 4 and 154 for facility-based testing strategies [36]. Our method is, consequently, among the best performing testing strategies and can reduce by two the number of tests needed to find 95% of PLHIV compared to current general population testing.When targeted HIV case-finding strategies are implemented to increase the cost-effectiveness of testing, a high yield is important to ensure that many of those tested are HIV positive. It is currently unknown if additional behavioural-based testing strategies can enhance or complement current targeted case-finding strategies such as index testing. Acceptable cut-offs for both sensitivity and PPV would need to be adapted for specific low resources settings and for the desired testing coverage. In our second scenario, we identified about 5% of the population at high risk of being HIV positive using a probability cut-off of 90%. This allowed us to identify more than 50% of all PLHIV with most of the tested population being HIV-positive; the remaining HIV-negative tested individuals are choice candidates for preventative services such as pre-exposure prophylaxis (PrEP). We were consequently able to maximise the efficacy of the testing. We believe that our method would, therefore, be a valuable addition to current targeted strategies.To our knowledge, this study is the first to use machine learning methods to predict HIV in generalised HIV epidemic East and Southern African countries using routinely collected survey data. The main scope was to determine common risk factors of HIV positivity between countries with high HIV prevalence and the predictive ability of machine learning models based on these common risk factors. Hence, one of the limitations of this study was the generalizability of our predictive models for countries that were not used to train the algorithm. The accuracy of the prediction decreased, probably due to different risk factor distributions between countries. Future studies could improve the generalizability by selecting more similar countries than the country we aimed at generalizing to or apply our algorithm to country-specific individuals. We were also limited by the available variables in our dataset, and as a result we were unable to consider differences in viral load suppression, health-care expenditure, specific HIV-related interventions, and conflicts and wars. Additionally, missing values were to be found in the data and implied making assumptions about their randomness and using imputation methods that are necessarily imperfect by nature [37]. Finally, although HIV testing was laboratory-based and not self-reported, some results were inconclusive and, thus, discarded. A number of variables were self-reported and therefore subject to social desirability and recall bias.
Conclusions
Using machine learning algorithms, we identified strong predictors of HIV positivity. Our findings may explain the spatial variability of HIV prevalence and can inform HIV testing strategies in resource-limited settings. While the implementation of a machine learning based risk score for targeted interventions was feasible in rural East Africa [38], the acceptability and use of potentially sensitive behavioural risk factors to directly identify individuals for HIV testing needs to be evaluated. Our algorithm performed well with only a limited number of variables, which do not require extensive interviews or questionnaires. This approach may be implemented by clinicians and community health care workers or utilised through additional HIV case-finding modalities such as call centres, social media, and self-testing initiatives. The availability of individual-level data on the association of various diseases with socio-behavioural characteristics is rapidly increasing. Advanced methods to analyse these large sources of data can help to prevent, diagnose and treat HIV and other diseases more efficiently.(DOCX)Click here for additional data file.
Values of the F1 score and the Brier score for each of the 50 sets of parameters per algorithm (female ex-Zambia dataset).
(DOCX)Click here for additional data file.
List of the Demographic and Health Surveys (DHS) survey year.
(DOCX)Click here for additional data file.
Data processing.
(DOCX)Click here for additional data file.
Summary statistics of the observed and imputed data for the incomplete variables in the male test dataset.
(DOCX)Click here for additional data file.
Summary statistics of the observed and imputed data for the incomplete variables in the female test dataset.
(DOCX)Click here for additional data file.
List of variables.
(DOCX)Click here for additional data file.
Characteristics of Demographic and Health Survey (DHS) individuals.
(DOCX)Click here for additional data file.
Results of the XGBoost algorithm per sex for the validation (80%; 5-fold cross-validation), test (20%) and, left-out (excluded country) samples.
(DOCX)Click here for additional data file.
Results of the Support Vector Machine (SVM) algorithm per sex for the validation, test and, left-out samples.
(DOCX)Click here for additional data file.
Results of the ElasticNet algorithm per sex for the validation, test and, left-out samples.
(DOCX)Click here for additional data file.
Results of the Generalized Additive Model (GAM) algorithm per sex for the validation, test and, left-out samples.
(DOCX)Click here for additional data file.
F1, sensitivity, PPV and Brier score per country for models M4 and F4.
(DOCX)Click here for additional data file.
Predicted prevalence per country.
(DOCX)Click here for additional data file.
Transfer Alert
This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.30 Jul 2021PONE-D-21-10249Prediction of HIV status based on socio-behavioural characteristics in East and Southern AfricaPLOS ONEDear Dr. Orel,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.Please submit your revised manuscript by 7th of September 2020. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.Please include the following items when submitting your revised manuscript:A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript.Kind regards,Daniel BoatengAcademic EditorPLOS ONEJournal Requirements:When submitting your revision, we need you to address these additional requirements.1.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found athttps://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf andhttps://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf2. Thank you for stating the following in the Acknowledgments Section of your manuscript:“We acknowledge the support of the Swiss National Science Foundation (SNF professorship grant n° 163878 to O Keiser) which funded this study. “We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:“We acknowledge the support of the Swiss National Science Foundation (SNF professorship grant n° 163878 to O Keiser) which funded this study. he funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”Please include your amended statements within your cover letter; we will change the online submission form on your behalf.Additional Editor Comments (if provided):The article addresses a very important subject in Global health. However, both reviewers raised a important methodological concerns that must be addressed and resubmitted for consideration.[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to Questions
Comments to the Author1. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: PartlyReviewer #2: Yes********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: I Don't KnowReviewer #2: Yes********** 3. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: NoReviewer #2: No********** 4. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: YesReviewer #2: Yes********** 5. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This is an interesting and timely application of machine learning to a problem of substantial societal impact. Unfortunately there are shortcomings in the description of the methods and potentially also methodological issues which I fear invalidate the results (for these reasons cannot judge whether statistical analyses are performed appropriately). Please see attached document for my detailed evaluation of this paper.Reviewer #2: Reviewer comments on “Prediction of HIV status based on socio-behavioural characteristics in East and Southern Africa”Overall, this is an interesting paper that explores different machine learning methods (algorithms) to predict individual-level HIV status in East and Southern Africa. Although the work could be relevant to the global public health field, more emphasis should be placed on (i) exploration of model accuracy/predictive power by country, African region and at smaller scales within countries, (ii) comparison with previous studies that looked at underlying predictors of HIV prevalence, and (iii) judging the suitability of such models for areas where no or scarce HIV data are available and what this depends on.The authors can further elaborate on the difference in prediction accuracy by country and by African region: now, the same variables are used for the predictions, but mechanisms underlying HIV trends and risks might be highly country-specific. Does modelling by country lead to the same selection of the 10 most predictive variables? And how much does the predictive power differ by country, and what would this depend on? Also, would it be possible to use the algorithms with the same 10 variables to make predictions for smaller (subnational) areas? So that policy makers could estimate the HIV prevalence in their area when outdated or unreliable HIV data are available.Would it be possible to add maps of the measured versus predicted HIV prevalence by country or province, and/or a map with the discrepancy in measured versus predicted prevalence? The current figures are nice, but quite technical, it would be good to add one or two plots or maps that are easier to interpret without a lot of technical knowledge, the enhance usefulness to policy makers.Please further explain in the discussion why the 10 most important variables could logically be predictive for HIV status and compare your findings with those of previous studies that looked at predictive variables for HIV status in sub-Saharan Africa, such as:Palk, Laurence, and Sally Blower. "Geographic variation in sexual behavior can explain geospatial heterogeneity in the severity of the HIV epidemic in Malawi." BMC medicine 16.1 (2018): 1-9.Dwyer-Lindgren, Laura, et al. "Mapping HIV prevalence in sub-Saharan Africa between 2000 and 2017." Nature 570.7760 (2019): 189-193.Bulstra, Caroline A., et al. "Mapping and characterising areas with high levels of HIV transmission in sub-Saharan Africa: A geospatial analysis of national survey data." PLoS medicine 17.3 (2020): e1003042.For some predictive variables in particular, the predictive mechanisms are not obvious. For example, how does altitude affect HIV status, and are latitude, longitude and altitude included separately? The charm of the applied machine learning algorithms is that you let the data speak, but it would be informative if the authors think about potential mechanisms in which the variables affect HIV status. They might be proxy variables for other, more intuitive, variables of impact.Some further explanation is needed on splitting the data between testing and training dataset; was this done at the individual level or at the sample location level? And was the 80%-20% selection present for every country, or only for the overall data sample?In the abstract, I would suggest to add the included countries to the abstract, instead of only “10 countries in East and Southern Africa” and explain what the F1 score is, since reader might not be familiar with this.********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: NoReviewer #2: Yes: Caroline A. Bulstra[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.Submitted filename: review_report.pdfClick here for additional data file.8 Nov 2021Dear editor,Thank you very much for the helpful reviews which allowed us to substantially improve the manuscript. Please find below our answers to the referee comments.Yours sincerely,Erol Orel, Aziza Merzouki and Olivia Keiser on behalf of all co-authors.CommentsEditorial Comments to AuthorsJournal Requirements:When submitting your revision, we need you to address these additional requirements.1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming.The manuscript meets PLOS ONE’s requirements, including those for file naming.2. Thank you for stating the following in the Acknowledgments Section of your manuscript:“We acknowledge the support of the Swiss National Science Foundation (SNF professorship grant n° 163878 to O Keiser) which funded this study. “We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement.Any funding-related text has been removed from the Acknowledgements section.Currently, your Funding Statement reads as follows: “We acknowledge the support of the Swiss National Science Foundation (SNF professorship grant n° 163878 to O Keiser) which funded this study. he funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf.Amended Funding Statement: “We acknowledge the support of the Swiss National Science Foundation (SNF professorship grant n° 163878 and grant n° 202660 to O Keiser) which funded this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”Additional Editor Comments (if provided):The article addresses a very important subject in Global health. However, both reviewers raised important methodological concerns that must be addressed and resubmitted for consideration.Reviewer 1In this manuscript, the authors use publicly available socio-behavioural data from the Demographic and Health Surveys program for 10 countries in Eastern and Southern Africa to predict HIV status for individuals using machine learning and statistical learning models. The authors first compared four different types of model types (penalized logistic regression, generalized additive model, support vector machines and gradient-boosted trees) to identify the best-performing type (gradient-boosted trees) according to the so-called F1 metric, which is a well-known but less transparent metric. Subsequently the authors then identified the subset of most contributing features and built new models of the best type around these. Based on these models, the authors then investigated two scenarios for targeting preventive action: a) the number of persons to be tested to reach a sensitivity of 95 % (true positive rate), and b) identified a population for which the probability of being HIV positive was higher than 95 %.This is an area of significant importance, and indeed an area in which modern machine learning methods can be brought to bear for great societal impact. The paper is furthermore well-written and mostly easy to read, and the figures and tables are generally well-prepared (more on this below). It therefore saddens me to report that I cannot support publication at this stage due to serious shortcomings of the methods description, and potentially also inconsistencies in the results which undermine the claims of the authors (this latter I cannot judge in full due to shortcomings of the methods description).Major points1. The methods section is missing a number of important aspects in order for me to assess in detail the steps the authors have taken. How were the models fitted?Each algorithm was assessed using a stratified 5-fold cross-validation on 50 sets of hyperparameters values, same metrics measured (namely F1, sensitivity, positive predictive value (PPV) and Brier score), same threshold used (0.5), and best model selection based on highest F1 on the test dataset. What differs is the algorithms and their respective objective functions (training loss + regularization) together with the number and type of the hyperparameters. We have added the following information as a section named “Models” in the Supplementary Material:Generalized additive models (Logistic GAM)GAM takes the functional form above with in our case the intercept set to zero. Since the scale of the Binomial distribution is known, our gridsearch minimizes an Un-Biased Risk Estimator (UBRE) objective: where D is the deviance, n the number of data, s the scale parameter (equal to 1 in our case) and DoF the effective degrees of freedom of the model. The feature functions are penalized B splines (P spline) with 20 basis functions for each by default. The smoothing hyperparameter lambda is drawn from a continuous uniform distribution between exp(-3) and exp(3) for each of the spline functions.Penalized Logistic Regression (Elastic Net)Logistic regression takes the above functional form with in our case the Intercept set to zero. The two hyperparameters of the loss function below were evenly distributed on a logarithmic space (C: logspace(-9, 9), l1ratio: logspace(-9, 0)):The parameter C is the inverse of the regularization strength alpha and l1ratio is the Elastic-Net mixing regularization hyperparameter where l1ratio=0 is equivalent to using a l2 penalty and l1ratio=1 equivalent to using a l1 penalty .Support Vector Classifier (SVC)The radial basis function kernel has been used for this learning algorithm. The parameter C (evenly distributed on a logarithmic space (logspace(-9, 9)) is the regularization hyperparameter, similar to the one of Elastic Net (i.e. the strength of the regularization is inversely proportional to C) of the below loss function. The penalty is a squared l2 penalty:Extreme Gradient Boosting (XGBoost)Nine hyperparameters were included into the random grid search. The details of each parameter space are:The objective function is:where in our case, l is the log likelihood of the Bernoulli distribution (i.e. log loss). For a more detailed overview of the regularization term and all the hyperparameters you can refer to the following article by Chen & Guestrin: https://arxiv.org/pdf/1603.02754.pdf.What were the target variables, and what was the outcome from each of the trainings?The target variable was the HIV status of the individuals (0 for HIV negative and 1 for HIV positive). We have now added this explicitly line 15 of the Methods section.The outcome from each of the training for each of the algorithms can be found on the Gitlab repository https://gitlab.com/Triphon/predicting_hiv_status, in the ““scripts/algorithm_name/” folder. It consists of the precision, recall, F1 and Brier score obtained for the 50 sets of hyperparameters on each 5 validation samples, for each of the 20 datasets (2 sex and 10 left-one-country datasets) stored in a joblib format.As far as I can tell from the files shared in the author’s model and code repository the model selection analysis is not included (and the readme in the repository is not providing much help).Each algorithm’ results can be found tables A6i to A6iv in the “Supplementary Material” and can be extracted from each “scripts/algorithm_name” folders using the algorithm_name.ipynb jupyter notebook. The selection between the 4 algorithms was then simply done by averaging the F1 scores of the 10 test datasets by sex and ranking the obtained values by descending order.Regarding the variables selection in the second part of the analysis, the scripts and the obtained results of the SFFS procedure can be found in the “scripts/variable_selection” folder (“sffs.ipynb”). The final model with only 9 variables is available in the “scripts/xgb” folder and named xgb_9.ipynb.We have simplified the structure of the repository and rewritten the readme file to better explain this.As will be apparent further below, details of model training will be important to assess the validity of the results the authors claim. Description of the details of the imputation methods used are missing. Is the variance conserved for the MICE algorithm that the authors use?We expanded the description of the MICE imputation method which can be found in the Supplementary Material at the “MICE imputation” section and referenced in the main manuscript in the Methods at line 19. An in-depth overview on how the imputation was conducted can be found in the script “scripts/data_processing_engineering/imputation.py” in the Gitlab repository. Two descriptive statistics tables (mean, standard deviation, minimum and maximum) of observed and imputed data for male and female test datasets have been added to the Supplementary Material (Table 5i and Table 5ii). For a detailed description of the XGBoost “imputation” method (indeed more a method for dealing with missing values), please refer section 3.4. of the XGBoost article https://arxiv.org/pdf/1603.02754v2.pdf. This reference was cited in the main manuscript (reference 23) at line 62 of the Methods section.And what are the implications of using the imputation of the XGBoost algorithm? This needs to be clarified to know whether this step is artificially limiting or enhancing the presented results.The two methods were giving very similar results and the implications of using the XGBoost algorithm with missing values were mainly practical. First, the computational time of MICE when dealing with a high number of variables (84 and 122 respectively for male and female) was approximately 3 days using the High-Performance Computing resources (Baobab cluster) of the University of Geneva against few hours when using the XGBoost algorithm. Also, hundreds of lines of code (see “scripts/data_processing_engineering/imputation.py”) were needed to perform the MICE imputation where no additional coding was needed with the XGBoost algorithm. We have rephrased it in the main manuscript in the Results section at line 33.I cannot follow the details of the three-step training process (Fig. 1). Is the whole thing a nested cross-validation where the outer loop (Step 1) is across countries (switching “holdout country”) and the inner step (Steps 2 and 3) is the 5-fold cross-validation used during model training? I sense the approach is fine but cannot follow all steps to make sure.From our entire datasets for males and females, we first left one of the 10 countries out (switching left out country) to create 10 different datasets per sex comprised of only 9 countries. This has been done for generalization purposes in order to assess the quality of our models when the data were not drawn from the exact same distribution. Then, each of the 10 newly created datasets per sex were split between a stratified (due to imbalanced outcomes) 80% training set and a 20% test set (Fig. 1 Step. 1). A stratified 5-fold cross-validation was then performed for each algorithm on each of the training sets for training and validation (Step. 2). This could effectively be seen as the inner step in a classical nested cross-validation. Each of the 10 best models per sex and per algorithm was then tested on the corresponding test set and the resulting F1 scores were averaged (first part of Step. 3) similar to the outer step. Finally, we applied each selected model on the corresponding left out country dataset (second part of Step. 3). We have rewritten the Methods section for better clarity (see line 23 to 34).Figures and tables describing data preparation and model training are largely missing captions. This makes it harder to follow the steps; in cases like Table A2 I simply cannot infer how to read the table. What does each row represent? What happens between each row?The different steps of the data preparation can be found in the Supplementary Material (see line 1 to 24). We have modified Table A2 in the Supplementary Material to make it self-explanatory and more readable.2. Two of the four model types, specifically support vector machines and gradient-boosted trees, actually do not model the probability that an individual has HIV unless specific steps are taken (e.g. use Platt scaling). It does not seem to be the case that such steps have been taken but cannot know for sure due to insufficient methods details provided. It is highly problematic that gradient-boosted trees (the type which is selected by the authors) does not return probabilities since the authors use it to identify the subpopulation which has more than 95 % probability of having HIV (scenario 2); put differently, the results from scenario 2 cannot be trusted since gradient-boosted trees are used. This concern seems to invalidate these results. For context: These models assign individuals to classes based on classification rules (the sign of the solution to the convex quadratic optimization problem for support vector machines, the particular splitting rules used in the gradient-boosted trees (usually cross-entropy) [1]) instead of modeling directly the probability that an individual has HIV. These are examples of improper scoring rules which are well-known to cause incorrect probabilities. An observed probability can be inferred afterwards from these models, but that will depend on the particular elements of the dataset and as such will change with the addition of a single new datapoint, and is in general not a good estimate of the actual probability.We have now calibrated the predicted probabilities obtained by XGBoost using a sigmoid (Platt scaling) and incorporated these changes into the Results section line 87 to 93. The differences in scenario 2 resulting from the calibration are as follows: 1) The probability threshold has been moved from 95% to 90%. 2) Without calibration, out of 11,031 males and 13,926 females, 551 males (5.0%) (461 previously i.e. 4.2%) and 1’113 females (8.0%) (862 previously i.e. 6.2%) were identified as high-risk populations. Overall, 526 males (447 previously) would have been correctly identified as HIV positive out of the 883 male PLHIV (sensitivity of 59.6% (50.6% previously) and PPV of 95.5% (97.0% previously)) and 1’065 females (833 previously) would have been correctly identified as HIV positive out of the 1,602 female PLHIV (sensitivity of 66.5% (52.0% previously) and PPV of 95.7% (96.6% previously)). 3) With calibration, 512 males (4.6%) and 837 females (6.0%) were identified as high-risk populations. Overall, 492 males would have been correctly identified as HIV positive (sensitivity of 55.7% and PPV of 96.1% ) and 809 females would have been correctly identified as HIV positive (sensitivity of 50.5% and PPV of 96.7%).3. The results in Fig. 2 illustrate substantially lower F1 scores on the left-out samples for support vector machines and gradient-boosted trees (XGBoost) which need to be investigated. This illustrates that the performance found in training is not at all retrieved when the models are applied to new data, i.e. it undermines the trust which can be put in the abilities of these models to offer usable predictions. This puts all presented results at risk of being incorrect! While the authors do observe this, no explanation is offered nor is further investigation conducted. As a minimum the authors would need to explain why this should not be a point of concern for trusting the results. It is unclear to me whether this is a result of overfitting, is a consequence of using the F1 metric (itself an improper scoring rule) to evaluate the models (that metric itself being a nonlinear function of the class assignments of the models), or something else entirely.We believe this should not be a point of concern for trusting the results because, unlike the test sets (for which the results are similar to the cross-validation), the left-out countries are not drawn from the same probability density function as the training sets. The purpose of this was to study the generalization of our model on data distributions that are different from the one used to train and validate the model. In the Discussion section, the following explanation was given line 51: “Hence, one of the limitations of this study was the generalizability of our predictive models for countries that were not used to train the algorithm. The accuracy of the prediction decreased, probably due to different risk factor distributions between countries. Future studies could improve the generalizability of our models by training them on more similar countries than the country we aimed at generalizing to.4. It appears the authors are using the F1 score to pick the best model (in Algorithms, as part of Methods section). As already mentioned, this is an improper scoring rule (which does not indicate which model best predicts the probability that an individual has HIV) and selecting models based on this metric could lead to models with good F1 scores which are a bad representation of whether or not an individual has HIV! The authors will need to clarify that they are not selecting models based on an improper scoring rule. More details on considerations for the important topic of proper scoring rules for classification can be found in [2] and [3].While we understand the reviewer’s point of concern about F1 score being an improper scoring metric, not based on probability and depending on an arbitrary threshold (“business” decision), we decided to use this metric for three main purposes: 1) ability to compare our results with previous studies; 2) appropriate metric for imbalanced datasets; 3) importance of targeting specific sensitivity and PPV values to achieve different testing strategies. This is explained in the sub-section Algorithms of the Methods section from line 8 to line 13. However, in order to compare our results with proper scoring rules, we have recomputed entirely the part of our analysis comparing the 4 algorithms using the Brier score and populated Tables A6i to A6iv in the Supplementary Material section with its values. XGBoost is still the best performing algorithm, followed by SVM, GAM and finally ElasticNet. In general, the scores obtained by the models with the best Brier score were very similar to the ones obtained with the best F1 score. We have added some sentences about Brier scores results in the Results section from line 14 to line 17.Minor points1. The authors use random selection of model hyper-parameters, which is a well-established approach. It would be good with some graphical illustration of the improvement from hyperparameter optimization as an insight into whether model type or hyperparameter tuning is more important for this case.We have added in the Supplementary Material a graphical illustration (Figure A1) showing the F1 scores per algorithm obtained with the 50 sets of parameters for one of the female datasets with 9 countries (ex-Zambia) as a case example. In addition, we have also compared it with the Brier scores obtained.2. The authors use Shapley values to assess the impact of each covariate on the outcome, which is also an established approach. However, readers without deep statistical training could be led to believe that the type of impact suggested by Shapley values could be used for shaping intervention strategies. Unfortunately, this would not be correct because Shapley values do not describe the causal impact of each covariate, only the additional change in overall outcome by adding this covariate. I would suggest adding a comment along these lines to mitigate any unintended conclusions.A comment has been added at the top of the Shapley values graph to explain these limitations. “It has to be noted here that Shapley values do not describe the causal impact of each covariate, only the additional change in overall outcome by adding this covariate.”Reviewer 2Overall, this is an interesting paper that explores different machine learning methods (algorithms) to predict individual-level HIV status in East and Southern Africa. Although the work could be relevant to the global public health field, more emphasis should be placed on (i) exploration of model accuracy/predictive power by country, African region and at smaller scales within countries, (ii) comparison with previous studies that looked at underlying predictors of HIV prevalence, and (iii) judging the suitability of such models for areas where no or scarce HIV data are available and what this depends on.The authors can further elaborate on the difference in prediction accuracy by country and by African region: now, the same variables are used for the predictions, but mechanisms underlying HIV trends and risks might be highly country-specific. Does modelling by country lead to the same selection of the 10 most predictive variables?The main scope of this study was to determine common risk factors of HIV positivity between countries with high HIV prevalence and the predictive ability of machine learning models based on these common risk factors. Hence, we did not train our algorithm with country-specific individuals. However, although this was not in the scope of this paper, we now mention it in the Discussion section line 55 as something worth exploring in the future.And how much does the predictive power differ by country, and what would this depend on?We computed the F1 score, sensitivity and PPV values separately for each of the 10 countries, as obtained by the model. We have created Table A7 and added it in the Supplementary Material. The difference in predictive power by country would depend on many factors, some of them being: the prevalence of the country, the percentage of the country sample size compared to the overall sample, the similarity of the risk factors between the country and its peers, etc...Would it be possible to add maps of the measured versus predicted HIV prevalence by country or province, and/or a map with the discrepancy in measured versus predicted prevalence?Yes, the prediction of individuals HIV status can be pooled by country in order to compute predicted prevalence. We have added Table A8 in the Supplementary Material which shows the predicted versus the measured prevalence per country, the absolute and the relative differences between both. In addition, we have produced 2 maps per sex showing the predicted prevalence per country and the absolute differences. We have incorporated these maps in the main document by replacing the previous Figure 4. We have reported the results in the Results section from line 73 and discussed these findings in the Discussion section from line 28.Also, would it be possible to use the algorithms with the same 10 variables to make predictions for smaller (subnational) areas? So that policy makers could estimate the HIV prevalence in their area when outdated or unreliable HIV data are available.Yes, by the same methodology used to compute predicted prevalence at country level, we could compute predicted prevalence at district level.The current figures are nice, but quite technical. It would be good to add one or two plots or maps that are easier to interpret without a lot of technical knowledge, to enhance usefulness to policy makers.Following your advice, we have replaced the quite technical Figure 4 with the maps discussed above. We have also removed Figure 3A and 3B for the sake of simplicity and better clarity. We were unfortunately limited to five figures and tables in total, hence, we have added in the Supplementary Material section Table A7 showing the different values of the F1 score, sensitivity and PPV per country and Table A8 highlighting the difference between predicted and measured prevalence per country.Please further explain in the discussion why the 10 most important variables could logically be predictive of HIV status and compare your findings with those of previous studies that looked at predictive variables for HIV status in sub-Saharan Africa, such as:- Palk, Laurence, and Sally Blower. "Geographic variation in sexual behavior can explain geospatial heterogeneity in the severity of the HIV epidemic in Malawi." BMC medicine 16.1 (2018): 1-9.- Dwyer-Lindgren, Laura, et al. "Mapping HIV prevalence in sub-Saharan Africa between 2000 and 2017." Nature 570.7760 (2019): 189-193.- Bulstra, Caroline A., et al. "Mapping and characterising areas with high levels of HIV transmission in sub-Saharan Africa: A geospatial analysis of national survey data." PLoS medicine 17.3 (2020): e1003042.We have further explained in the Discussion section (from line 10 to line 27) how the most important variables found by our method could logically be predictive of individual HIV status and compared it with previous studies on risk factors. We have included in these comparisons the studies you have kindly brought to our attention and these papers have been added to the References section.The charm of the applied machine learning algorithms is that you let the data speak, but it would be informative if the authors think about potential mechanisms in which the variables affect HIV status. They might be proxy variables for other, more intuitive, variables of impact. For some predictive variables in particular, the predictive mechanisms are not obvious. For example, how does altitude affect HIV status, and are latitude, longitude and altitude included separately?We have added the following sentence line 17 in the Discussion section: “The differences in individual HIV status due to the altitude are likely multifactorial. These factors stem from environmental, biological, as well as social and policy-level differences that impact infection and transmission.” Yes, altitude, latitude and longitude have been included separately, however, algorithms based on decision trees are able to discover interaction among independent variables.Some further explanation is needed on splitting the data between testing and training dataset; was this done at the individual level or at the sample location level?The split of data between training and testing has been done at the individual level. We have added this precision in the Method section line 26.And was the 80%-20% selection present for every country, or only for the overall data sample?The 80/20 split has been performed on every 10 subsets per sex comprised of 9 countries in the first part of our analysis and on the overall data samples per sex in the second part of our analysis.In the abstract, I would suggest adding the included countries, instead of only “10 countries in East and Southern Africa” and explain what the F1 score is, since readers might not be familiar with this.We have added countries' names in the Introduction section line 3 of the Abstract and a definition of the F1 score in the Methods section line 5 of the Abstract.Submitted filename: Response to Reviewers.docxClick here for additional data file.15 Dec 2021
PONE-D-21-10249R1
Prediction of HIV status based on socio-behavioural characteristics in East and Southern Africa
PLOS ONE
Dear Dr. OREL,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.
Please submit your revised manuscript by Jan 29 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.
Please include the following items when submitting your revised manuscript:
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.We look forward to receiving your revised manuscript.Kind regards,Daniel BoatengAcademic EditorPLOS ONEJournal Requirements:Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to Questions
Comments to the Author1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response)Reviewer #2: All comments have been addressed********** 2. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: PartlyReviewer #2: Yes********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: NoReviewer #2: Yes********** 4. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: YesReviewer #2: Yes********** 5. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: YesReviewer #2: Yes********** 6. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have addressed nearly all the comments I raised in the first review, and have done so in satisfactory fashion. There seems to be one open point on the statistical methods related to their imputation methods, which I think should be addressed before the paper can be accepted for publication. I also feel the More details in attached document.Reviewer #2: The authors have rigorously approved the methods descriptions in the manuscript and have added several sensitivity analyses and visualisations to further improve understanding of the results for the general public. My recommendation is to accept the paper for publication in PLOS One.********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: NoReviewer #2: No[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.
Submitted filename: review.docxClick here for additional data file.31 Jan 2022Reviewer 1I am happy to see that the reviewers have addressed most of my comments. In their reply, they provide satisfactory replies and they have made good changes to the manuscript, supplement and code repository which makes it easier to follow. I am also happy to see the new Fig 4 which is a great addition.Unfortunately, there are still two outstanding points which I think should be further elaborated.1) The authors have added description of the MICE imputation as I requested, but it is unfortunately not clear from this description if steps have been taken to ensure that the imputation does not affect the variance of the imputed variables. If no steps are taken, imputation can influence the distribution of the imputed variable (often reducing the variance, sometimes also affecting the mean), which will likely impact the results (e.g. influence the ability to provide accurate predictions). Judging from the first 10-15 rows in Table S3i it seems that the imputed variables have lower variance. The authors should clarify which steps have been taken to ensure that variance and mean have not been affected by the imputations.Many thanks, we are happy that the important work done to address the comments and revise the pa-per is acknowledged and valued. Regarding MICE imputation, imputing missing data has always been a challenge, and as stated in the reviewer’s comment, results of any statistical analysis can be only as good as the quality of the data (garbage in - garbage out) and this is particularly true in presence of missing data. In our specific case, we have assumed that the missing values were Missing At Random (MAR) and used the MICE imputation method (similar to the flowchart below from Jakobsen, J.C., Gluud, C., Wetterslev, J. et al. When and how should multiple imputation be used for handling miss-ing data in randomised clinical trials – a practical guide with flowcharts. BMC Med Res Methodol 17, 162 (2017). https://doi.org/10.1186/s12874-017-0442-1). We now explicitly mention this assumption in the revised manuscript (line 129) and cite the additional reference paper from Jakobsen et al.In other words, we assumed that there might be a relationship between the propensity of missing values and the observed data. MAR data is more common than Missing Completely At Random (MCAR) in all disciplines. Where in the case of MCAR, missing and observed observations are generated from the same distribution, in the case of MAR, the missing and observed observations are no longer coming from the same distribution. Nguyen et al. [Nguyen, C.D., Carlin, J.B. & Lee, K.J. Model checking in multiple imputation: an overview and case study. Emerg Themes Epidemiol 14, 8 (2017). https://doi.org/10.1186/s12982-017-0062-6] suggests that these discrepancies between observed and imputed data are not necessarily problematic, since under MAR we may expect such differences to arise.In order to check the obtained models in MICE and to ensure the validity of our findings, we have taken a multifaceted approach:- We assessed the plausibility of the imputed data using subject matter knowledge, harmonizing imputed data to the observed data types (e.g. dichotomic) and to the domain of definition or scale (e.g. strictly positive or maximum value). For a complete overview, the python script can be found in the project repository at “scripts/data_processing_engineering/imputation.py”. This reference to the code was added to the Supplementary Material in the MICE imputation section.- We compared the imputed data with the observed data to identify major problems with the imputation model. As a rule of thumb, Stuart et al. [Stuart EA, Azur M, Frangakis C, Leaf P. Multiple Imputation with large data sets: a case study of the children’s mental health initiative. Am J Epidemiol. 2009;169(9):1133–9.] proposed comparing the means and variances of observed and imputed values and suggested flagging variables if the ratio of variances of the observed and imputed values was less than 0.5 or greater than 2, or if the absolute difference in means was greater than two standard deviations.- We compared the scores and results obtained with the MICE imputation method to the ones obtained using the XGBoost built-in method and found very similar values. This is presented in main manuscript line 241-248.- We used MICE imputation in the first part of our analysis, when we compared the predictive performance of the different algorithms. The performance of all algorithms was compared over the same potentially imperfect imputed datasets. We did not use the MICE imputed datasets in the second part of our analysis, i.e. for variables selection and HIV status prediction (used XGBoost built-in method).Finally, we have added as a limitation, the following sentence line 376-378 of the “Discussion” section: “Additionally, missing values were to be found in the data and implied making assumptions about their randomness and using imputation methods that are necessarily imperfect by nature.”2) The authors take several steps during their model identification and training procedure. This methodological rigor is a strong point of the paper. Unfortunately, it can still be hard to follow all the steps the authors take, so I think it could improve the paper even more if the methods section started with an overview of the different steps.For better clarity, we have rewritten the different steps for model identification and training procedurein the “Methods” section by adding a specific sub-section named “Training, validation and test procedure steps” starting line 23.Furthermore, upon identifying the best model, the authors seem to follow a different process when fitting the selected model in order to draw the conclusions they present (using data from all 10 countries, fitting using a regular five-fold cross validation). I cannot seem to find a description of whether any special considerations have been made to the test sets for this model (which is used to generate the results in the paper) – are they randomly selected? Stratified? It would be good if the authors could comment on this as well.Thank you for pointing this out. We used the exact same training, validation and testing strategy than in the first part of our analysis except that no country was left out. We have made it more explicit in the “Methods” section starting line 178-181 (beginning of the sub-section “Variables selection and HIV status prediction”):“For variables selection and HIV status prediction, we used the exact same training, validation and testing strategy than in the first part of our analysis except that no country was left out. We split each unique dataset par sex into a stratified 80% training and validation set and a 20% test set. The best algorithm was trained and validated using a random grid-search over 250 sets of hyperparameters and a stratified 5-fold cross-validation”.Submitted filename: Response to Reviewers.docxClick here for additional data file.11 Feb 2022Prediction of HIV status based on socio-behavioural characteristics in East and Southern AfricaPONE-D-21-10249R2Dear Dr. Orel,We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.Kind regards,Daniel BoatengGuest EditorPLOS ONEAdditional Editor Comments (optional):Reviewers' comments:18 Feb 2022PONE-D-21-10249R2Prediction of HIV status based on socio-behavioural characteristics in East and Southern AfricaDear Dr. OREL:I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.If we can help with anything else, please email us at plosone@plos.org.Thank you for submitting your work to PLOS ONE and supporting open access.Kind regards,PLOS ONE Editorial Office Staffon behalf ofDr. Daniel BoatengGuest EditorPLOS ONE
Authors: Daniel J Feller; Jason Zucker; Michael T Yin; Peter Gordon; Noémie Elhadad Journal: J Acquir Immune Defic Syndr Date: 2018-02-01 Impact factor: 3.731
Authors: Catherine A Koss; James Ayieko; Florence Mwangwa; Asiphas Owaraganise; Dalsone Kwarisiima; Laura B Balzer; Albert Plenty; Norton Sang; Jane Kabami; Theodore D Ruel; Douglas Black; Carol S Camlin; Craig R Cohen; Elizabeth A Bukusi; Tamara D Clark; Edwin D Charlebois; Maya L Petersen; Moses R Kamya; Diane V Havlir Journal: Clin Infect Dis Date: 2018-11-28 Impact factor: 9.079
Authors: Amitabh B Suthar; Nathan Ford; Pamela J Bachanas; Vincent J Wong; Jay S Rajan; Alex K Saltzman; Olawale Ajose; Ade O Fakoya; Reuben M Granich; Eyerusalem K Negussie; Rachel C Baggaley Journal: PLoS Med Date: 2013-08-13 Impact factor: 11.069
Authors: Kuteesa R Bisaso; Susan A Karungi; Agnes Kiragga; Jackson K Mukonzo; Barbara Castelnuovo Journal: BMC Med Inform Decis Mak Date: 2018-09-04 Impact factor: 2.796
Authors: Caroline A Bulstra; Jan A C Hontelez; Federica Giardina; Richard Steen; Nico J D Nagelkerke; Till Bärnighausen; Sake J de Vlas Journal: PLoS Med Date: 2020-03-06 Impact factor: 11.069