OBJECTIVE: Quantifying HIV incidence is essential for tracking epidemics but doing this in concentrated epidemic can be a particular challenge because of limited consistent high-quality data about the size, behaviour and prevalence of HIV among key populations. Here, we examine a method for estimating HIV incidence from routinely collected case-reporting data. METHODS: A flexible model of HIV infection, diagnosis and survival is constructed and fit to time-series data on the number of reported cases in a Bayesian framework. The time trend in the hazard of infection is specified by a penalized B-spline. We examine the performance of the model by applying it to synthetic data and determining whether the method is capable of recovering the input incidence trend. We then apply the method to real data from Colombia and compare our estimates of incidence with those that have been derived using alternative methods. RESULTS: The method can feasibly be applied and it successfully recovered a range of incidence trajectories in synthetic data experiments. However, estimates for incidence in the recent past are highly uncertain. When applied to data from Colombia, a credible trajectory of incidence is generated which indicates a much lower historic level of HIV incidence than has previously been estimated using other methods. CONCLUSION: It is feasible, though not satisfactory, to estimate incidence using case-report data in settings with good data availability. Future work should examine the impact on missing or biased data, the utility of alternative formulations of flexible functions specifying incidence trends, and the benefit of also including data on deaths and programme indicators such as the numbers receiving antiretroviral therapy.
OBJECTIVE: Quantifying HIV incidence is essential for tracking epidemics but doing this in concentrated epidemic can be a particular challenge because of limited consistent high-quality data about the size, behaviour and prevalence of HIV among key populations. Here, we examine a method for estimating HIV incidence from routinely collected case-reporting data. METHODS: A flexible model of HIV infection, diagnosis and survival is constructed and fit to time-series data on the number of reported cases in a Bayesian framework. The time trend in the hazard of infection is specified by a penalized B-spline. We examine the performance of the model by applying it to synthetic data and determining whether the method is capable of recovering the input incidence trend. We then apply the method to real data from Colombia and compare our estimates of incidence with those that have been derived using alternative methods. RESULTS: The method can feasibly be applied and it successfully recovered a range of incidence trajectories in synthetic data experiments. However, estimates for incidence in the recent past are highly uncertain. When applied to data from Colombia, a credible trajectory of incidence is generated which indicates a much lower historic level of HIV incidence than has previously been estimated using other methods. CONCLUSION: It is feasible, though not satisfactory, to estimate incidence using case-report data in settings with good data availability. Future work should examine the impact on missing or biased data, the utility of alternative formulations of flexible functions specifying incidence trends, and the benefit of also including data on deaths and programme indicators such as the numbers receiving antiretroviral therapy.
In any setting, one of the most important pieces of information in responding to an HIV epidemic, evaluating past efforts, and planning for the future is the time-course of the HIV incidence rate.Direct observation of new infections, whether through cohort studies or via the use of incidence assays and algorithms [1], is unfeasible in most national settings. Mathematical modelling provides an alternative mean to infer incidence using other more readily available data. Currently, UNAIDS estimates of HIV incidence trends in generalized epidemics are derived through fitting a simple model to HIV prevalence measurement among pregnant women and prevalence measurements in national household surveys [2,3]. This approach has been less satisfying in settings with concentrated epidemics because of a combination of factors: inconsistencies in the measurements of prevalence over time, scarce prevalence measurements among hard-to-reach groups and lack of robust estimates on the size of these groups. For this reason, it is important to explore other means of estimating HIV incidence in these settings, which would not exclusively rely on estimates of HIV prevalence and size estimates of key populations.Other sources of available data, which could potentially be combined with models to draw inference on incidence in concentrated epidemics, include time-series data on AIDS deaths and reported cases. The use of estimated death time series is investigated in another article in this collection (Stover et al. in this supplement). Here, we focus on the use of reported cases.In many settings, especially in Latin America and Europe, systems are in place to centrally record the number of persons newly diagnosed with HIV. In many settings, it is believed that the coverage of this system (the proportion of newly diagnosed cases that are counted) could be as high as 80% [4,5]. However, the interval between an infection and a diagnosis is not known and may change over time, and this would confound all estimates of incidence unless further information could be included on the interval between testing and diagnosis. Possible candidates to inform the interval between infection and diagnosis are the clinical stage and CD4+ cell count [6]. Further, in the future, measurement of a well characterized biomarker of recent infection may also be available [7].In this study, we build on prior work [8,9] to develop a modelling framework that can be used to estimate HIV incidence drawing primarily on HIV case-report data. We then use synthetic data to test the performance in recovering a wide range of possible incidence trajectories. Finally, we apply the model to Colombia to derive HIV incidence estimates, which are then compared with other estimates available for Colombia.
Methods
Mathematical model
The proposed model consists of a deterministic mathematical model to simulate the process of HIV infection and disease progression, death and diagnosis (Fig. 1).
Fig. 1
Model of HIV infection and progression.
Model of HIV infection and progression.Parameter values and prior distributions for calibration can be found in Table 1.
Table 1
Model parameters.
This structure is expressed through a set of six ordinary differential equations as follows:Where U(t)is the number of persons HIV susceptible at time t; I1(t) is the number of infected individuals in acute infection at time t; I(t) (j = 2, . . ., 5) is the number of infected individuals in CD4+ cell count category j at time t and s(t)is the time-varying rate at which the population of infected individuals transmits HIV to susceptible persons. The rate at which new infections occur in this population is given by s(t)U(t). All symbols and parameters are described in Table 1.Model parameters.At infection, individuals pass through a stage of acute infection following which individuals enter one of four CD4+ cell count categories (<200, 200–350, 350–500, 500+ cell/μl). Thereafter, individuals progress to lower CD4+ cell counts at rates which have previously been estimated from European data [10]. Death due to AIDS in the model is assumed to occur at a fixed rate only among individuals with CD4+ cell count below 200 cell/μl. The overall median survival time is 10.4 years, consistent with estimates from low-income and middle-income countries [11]. In the model framework, these parameters describing the progression of HIV infections are considered to be known perfectly.At all times after infection, individuals can be diagnosed with HIV. At that point they transition into a ‘diagnosed’ category. The number of HIV-infected persons that transition to the diagnosed category in a given year in the model is compared with the data on the number of reported cases.
Model parameterization
HIV infection rate
Following Hogan et al.[2], the instantaneous hazard of infection for susceptible individuals s(t) is specified as a B-spline function of time. The B-spline is parameterized by a vector of coefficients β1, . . ., n, where n-3 is the number of knots of the spline which were evenly spaced on the time interval 1975–2015. This flexible functional form allows many credible incidence trajectories while constraining the trajectory to evolve smoothly with respect to time. This does not attempt to provide a mechanistic description of HIV transmission. Note that in other formulations, the spline describes a force of infection (which gives incidence when multiplied with the product of U(t) and all infected individuals) rather than a hazard of infection, but this is not possible here as, for parsimony, we do not model the entire infected population.We set β1 = β2 = β3 = 0 to anchor the incidence trajectory at zero before 1975, whereas β to β6 were independently specified with uniform priors:β4 ∼U(0, 5), β5 ∼U(-5, 5)and β6 ∼U(-5, 5). We impose penalties [12] on the spline, in order to limit over-fitting to potentially noisy data and to represent our prior assumption that rapid large oscillations in the HIV incidence rate are unlikely. We use second-degree difference penalties [12] expressed as:β = 2β - β + μ, . . . , for i>6where the error μ is the amount of deviance from the trajectory defined by the two previous coefficients.The prior for μ is normally distributed with parameters N(0, τ2) , where τ2 is a hyper-parameter controlling the overall smoothness of the function; small values of τ2 result in little deviance from the previous trajectory and large values in contrast produce more flexible curves.
HIV detection rates
The rate at which HIV-infected individuals are diagnosed is ρ(t), and is allowed to vary by CD4+ cell count category j and calendar time t:where T is the average time (years) to reach CD4+ cell count category below 200 cell/μl (j = 5) for those in CD4+ cell count category j = 1, . . . , 4(T = 8.17; T = 7.93; T = 6.74; T = 3.74; T = 0).This parameterization sets the rate of detection to vary linearly as the time to reach CD4+ <200 cell/μl decreases. The slope of this line is specified by parameter ω, which allows the linear relationship to be either positive or negative. With ω > 0, individuals with high CD4+ cell counts are more likely to be detected, which is plausible scenario if exposure to HIV is suspected or symptoms of acute infection drive testing; with ω < 0, individuals with low CD4+ cell counts are more likely to be detected, which may be true if HIV testing is prompted by increasing experience of illness. The prior on ω reflects that both of these possibilities are equally likely: ω∼U(-1, 1). Parameter m is a constant scaling parameter which ensures that the diagnosis rate is greater than zero for all CD4+ categories, irrespective of the value of ω.Rate of testing for those in CD4+ cell count category below 200 cell/μL (ρ) has a prior distribution such that the average interval from infection to diagnosis is uniformly distributed between 0.5 and 10 years: .σ(t) describes the change in the rates of diagnosis over time. Here we assume that this change can be represented by the logistic function: where r is the rate of increase, and t50 is the time to reach half the upper limit rate. Parameter r is set at 0.003 and a prior was placed on t50 that allows a wide range in shapes in σ(t) from a stable trend over time (t50 = 1) to a sharp increase in the rate of diagnosis from close to zero at the start of the epidemic (t50 = 0.1) : t50 ∼U(0.1, 1).
Model validation and calibration procedures
The inference on the HIV incidence rate is done in a Bayesian framework. The likelihood is given by the probability of observing the number of HIV cases without AIDS reported in a year, the number of HIV cases with AIDS reported in a year (i.e. symptomatic illness or CD4+ cell count <200 cell/μl) and the number of persons diagnosed with HIV after death in a given year, each modelled as Poisson process. We used the Metropolis Hastings algorithm with component-wise updating to sample from the posterior distribution of the model. We used an adaptive proposal density variance in order to achieve an acceptance rate of approximately 30%. Three chains were run in parallel for 500 000 iterations after which chains were visually inspected for convergence and 50% of the initial runs were discarded.Models are estimated that use B-spline functions with number of knots n = 7, 8, 10, and with hyper-prior for τ2 variously given as ∼U(0.001,0.1), ∼U(0.1,0.5), ∼U(0.5,1.0), ∼U(1.0,1.5) or ∼U(1.5, 2.5), giving a total of 20 models runs for each estimation problem. Among these, a model is chosen that has the greatest agreement to data, given its number of effective parameters, as measured by the model with smallest deviance information criterion (DIC) [13]. For that chosen model, a sample from the joint posterior distribution is formed by randomly sampling 1000 iterations across all its chains. Medians and 95% credible intervals are reported for the posterior samples.
Assessment of model performance using synthetic data
The performance of model in recovering the ‘correct’ incidence trajectory from sets of synthetic data was assessed. One major concern was that the specification of this estimation method would render it unable to correctly discern all possible trajectories with different amounts of variation over time or timing of inflections. For this reason, we created three ‘challenge datasets’ using the same assumptions as the estimation model about natural history but varying the number in the spline function that generates that incidence rate (n = 7, 8, 10). Synthetic data were generated under the assumption of constant rates of diagnosis over time and CD4+ cell count.
Application of the model to Colombia
Between 1983 and 2010, Colombia reported 78 999 HIV/AIDS patients, the vast majority coming from urban areas and major cities [17]. Overall prevalence was estimated at 0.22% in 2009 in the adult population [18], but HIV prevalence appears to be substantially concentrated among MSM, among whom prevalence has been estimated to be between 5.6 and 24.1% in different cross-sectional studies in 2010 [19]. In our method, we assumed equivalence between an AIDS diagnosis in the data and a diagnosis for a patient with a CD4+ cell count below 200 in the model.There have been official records of the number of deaths that were classified as AIDS deaths among persons who had not previously been recorded as HIV-infected (i.e. post-mortem HIV diagnosis) since 1985 [20]. No reporting bias or misclassification in this is assumed.We used data on the total number of HIV tests reported to have been performed in Colombia to centre the prior on the trend in the rate of diagnosis over time [21]. This trend can be broadly described as a sigmoid increase with a turning point in 1996 and a plateau in the more recent years.These data were used in the method outlined above to estimate incidence over time. The results were compared with reported estimates of incidence from UNAIDS [22], which are based on fitting a model to prevalence data and population size estimates for key populations [3].
Results
We created three challenge datasets with different epidemic trajectories and the proposed method was able to successfully recover each of them (Fig. 2). A part of the method is to estimate multiple models and then select that model with the lowest DIC; as expected, the model that was favoured in each of these tests was the model which has the same number of knots as that used to generate the synthetic data. Also as expected, uncertainty becomes greater in more recent years as the data on case reports are less informative on current incidence rates. Uncertainty for estimates in the recent years is especially great when the data indicate that trend in incidence has changed.
Fig. 2
Incidence estimation with three configurations of the spline function.
Incidence estimation with three configurations of the spline function.In each panel, the black solid line is the true simulated incidence trend; the dashed red is the incidence estimation using the proposed method (median of the posterior distribution); the shaded blue region gives a 95% credible interval for incidence using the proposed method. The synthetic data were created using a model with (a) 7 knots, (b) 8 knots or (c) 10 knots.
Estimating HIV incidence in Colombia
The model was applied to Colombian data using the proposed method. Model estimates and data on HIV cases, AIDS cases and post-mortem detection are presented in Fig. 3a–c and resultant estimates of HIV incidence are presented in Fig. 3d. For Colombia, the model formulation with the lowest DIC was one with eight knots and hyper-prior τ2 ∼U(1, 1.5).
Fig. 3
Model fit to data in Colombia.
Model fit to data in Colombia.Panels with data points (blue dots) as reported on a yearly basis by Colombian surveillance system, in three categories: (a) cases previously undiagnosed and only detected after death, (b) cases detected during AIDS stage and (c) cases detected and classified as non-AIDS at the time of diagnosis. In a–c, the grey shaded area are the posterior estimation from the model chosen for Colombia. (d) The resulting HIV incidence trajectory in Colombia from the proposed method (red dashed line) with 95% credible interval (shaded area).The model has a good fit with the observed data (Fig. 3a–c) in the early years, whereas uncertainty is widely propagated in the recent years as noted in the experiments with synthetic data. The data for post-mortem-detected cases and AIDS cases exhibit complex patterns which the model is forced to reconcile with the clear monotonic increase in reported HIV cases and with our a priori belief that the incidence trajectory should vary smoothly over time.The resulting estimate of HIV incidence (Fig. 3d), describes an early peak in new infections in 1990 and resurgence in the epidemic since 2000. The period with the highest incidence rate is estimated to be 2008–2009.These incidence estimates can be compared with those previously presented by UNAIDS and which are based on an entirely different methodology whereby models are fitted to observed prevalence data and estimated sizes of key population (Fig. 4). In recent years, the estimates derived through the proposed method and the UNAIDS methods are in very close agreement. However, the UNAIDS-estimated historical trajectory of incidence suggests very high-peak incidence during 1995, of 0.3 per 100 person-years at risk (pyar), whereas the proposed method suggests an incidence rate much lower, at 0.03 per 100 pyar.
Fig. 4
Estimated number of new HIV infections in Colombia from the proposed method (red dashed lines) and current UNAIDS methods (blue dots).
Estimated number of new HIV infections in Colombia from the proposed method (red dashed lines) and current UNAIDS methods (blue dots).The orange shaded area shows the 95% credible interval of the estimates of the proposed method. Note that UNAIDS does compute uncertainty intervals for most of its statistics but intervals were not retrievable from online data sources for this statistic [22].
Discussion
We examined a method for estimating HIV incidence from case-report data. We have shown that it is capable of correctly recovering a wide range of possible incidence trajectories and have applied to data from Colombia to give complementary estimates of incidence to that derived from other methods. The present formulation gives highly uncertain estimated for most recent years, limiting its usefulness.Although it is promising that our method can broadly identify a wide range of incidence trajectories, some small inconsistencies remained. We believe these will be because of the setting of the fixed placement of the knots and the rigid ordering of unconstrained and constrained knots. It will be possible to determine whether the evaluation of a wider range of models that incorporate these differences and model selection based on DIC is sufficient to overcome these challenges. This would inevitably increase the computational demands of the estimation procedure and so alternative methods for estimating the model will also be explored.It will also be useful to examine how this method works under different conditions of data availability. In the challenges with synthetic data, we assumed that all required data were present and unbiased. Although this may be a reasonable assumption for some settings, it will not be universally the case and small errors in early estimates of number of cases may propagate to large errors in the inferred incidence trajectory. We also used the same assumption for the natural history of HIV (survival rates and CD4+ progression) in the method that generated the synthetic data as in the estimation model, which will inevitably flatter the performance of the model. It will be crucial to test the performance of the model when alternative assumptions are made and the model cannot be considered to have been fully tested until this step is done. A further step would be to examine the impact of the survival rates used to generate the synthetic data departing from those assumed in the estimation model, as well as propagating uncertainty in the assumptions about natural history in the estimation procedure. Finally, we did not assess the impact of the use of inappropriate assumptions on the trend in diagnosis rates over time.Here, we allowed HIV detection rates to vary by different CD4+ cell count stage, according to a prior which reflects a wide spectrum of detection scenarios. It is possible that unlikely detection configurations are allowed within this approach, but the absence of evidence to support a more restrictive parameterization led us to opt for the more flexible assumption. Ideally, in a model-aggregation framework specific detection rates by group of transmission and stage of disease would be modelled separately, but this will only be feasible for those contexts with highly detailed case-report data and evidence of differential HIV detection and linkage to care.This method gives uncertain results, especially in recent years, because there remains confounding by potential changes in the interval between infection and death and because numbers of cases and deaths are informative of incidence rates sometime in the past rather than current incidence. Therefore, in the current form, this method cannot be expected to reliably detect recent changes in incidence, which would be very important in monitoring a national epidemic. We expect, however, that it would be highly advantageous for this method to draw on further data. It will be possible to extend this method to incorporate data and/or estimates on deaths and numbers on antiretroviral therapy (ART), which should contribute to discriminating the earlier part of incidence trajectory in particular. Although it is not commonly used at present, the addition of measurement of biomarkers that relate to recency of infection at diagnosis could also help discriminate current incidence levels, though this would depend on the characteristics of the biomarker and how well they are known (Bao et al. in this collection).Our estimates of HIV incidence for Colombia must be interpreted with a high degree of caution and in light of our method's assumptions and present limitations. Of particular note, the estimates from the proposed model suggest that incidence was historically much lower than the levels the previous UNAIDS estimate indicate. This could stem from our assumption of complete case reporting and no misclassification of AIDS deaths: if, in fact, some diagnoses go unreported and some true AIDS deaths are misclassified, then our estimates will be too low. Nevertheless, it seems unlikely that these biases would fully explain the discrepancy. Added to this, we note that the level of incidence estimated here is in closer agreement with that recently reported by Murray et al.[23].In conclusion, this method is promising technique for estimating HIV incidence trends that does not rely on using prevalence data and size estimates of key populations, and which leverages high quality routinely collected data. Future work should focus on updating the model structure to allow other forms of surveillance data, such as AIDS deaths and people on ART, and further scrutiny of model performance under circumstances of missing or biased availability of data.
Acknowledgements
T.B.H. thanks UNAIDS, The World Bank, The Rush Foundation and the Bill & Melinda Gates Foundation for funding support. A.C. thanks the NIH for funding through the NIAID cooperative agreement UM1 AI068619.
Authors: Sara Lodi; Andrew Phillips; Giota Touloumi; Ronald Geskus; Laurence Meyer; Rodolphe Thiébaut; Nikos Pantazis; Julia Del Amo; Anne M Johnson; Abdel Babiker; Kholoud Porter Journal: Clin Infect Dis Date: 2011-10 Impact factor: 9.079
Authors: H Irene Hall; Ruiguang Song; Philip Rhodes; Joseph Prejean; Qian An; Lisa M Lee; John Karon; Ron Brookmeyer; Edward H Kaplan; Matthew T McKenna; Robert S Janssen Journal: JAMA Date: 2008-08-06 Impact factor: 56.272
Authors: Tim Brown; Le Bao; Adrian E Raftery; Joshua A Salomon; Rebecca F Baggaley; John Stover; Patrick Gerland Journal: Sex Transm Infect Date: 2010-10-06 Impact factor: 3.519
Authors: Anne Cori; Helen Ayles; Nulda Beyers; Ab Schaap; Sian Floyd; Kalpana Sabapathy; Jeffrey W Eaton; Katharina Hauck; Peter Smith; Sam Griffith; Ayana Moore; Deborah Donnell; Sten H Vermund; Sarah Fidler; Richard Hayes; Christophe Fraser Journal: PLoS One Date: 2014-01-15 Impact factor: 3.240
Authors: Paul J Birrell; O Noel Gill; Valerie C Delpech; Alison E Brown; Sarika Desai; Tim R Chadborn; Brian D Rice; Daniela De Angelis Journal: Lancet Infect Dis Date: 2013-02-01 Impact factor: 25.071
Authors: Timothy B Hallett; Basia Zaba; John Stover; Tim Brown; Emma Slaymaker; Simon Gregson; David P Wilson; Kelsey K Case Journal: AIDS Date: 2014-11 Impact factor: 4.177
Authors: Tara D Mangal; Ana Roberta Pati Pascom; Juan F Vesga; Mariana Veloso Meireles; Adele Schwartz Benzaken; Timothy B Hallett Journal: Epidemics Date: 2019-02-07 Impact factor: 4.396