| Literature DB >> 31277299 |
Raabya Rossenkhan1, Morgane Rolland2,3, Jan P L Labuschagne1,4, Roux-Cil Ferreira1, Craig A Magaret1, Lindsay N Carpp1, Frederick A Matsen Iv1, Yunda Huang1, Erika E Rudnicki1, Yuanyuan Zhang1, Nonkululeko Ndabambi5, Murray Logan5, Ted Holzman1, Melissa-Rose Abrahams5, Colin Anthony5, Sodsai Tovanabutra2,3, Christopher Warth1, Gordon Botha5, David Matten5, Sorachai Nitayaphan6, Hannah Kibuuka7, Fred K Sawe8, Denis Chopera9, Leigh Anne Eller2,3, Simon Travers4, Merlin L Robb2,3, Carolyn Williamson5, Peter B Gilbert1,10, Paul T Edlefsen11,12.
Abstract
Knowledge of the time of HIV-1 infection and the multiplicity of viruses that establish HIV-1 infection is crucial for the in-depth analysis of clinical prevention efficacy trial outcomes. Better estimation methods would improve the ability to characterize immunological and genetic sequence correlates of efficacy within preventive efficacy trials of HIV-1 vaccines and monoclonal antibodies. We developed new methods for infection timing and multiplicity estimation using maximum likelihood estimators that shift and scale (calibrate) estimates by fitting true infection times and founder virus multiplicities to a linear regression model with independent variables defined by data on HIV-1 sequences, viral load, diagnostics, and sequence alignment statistics. Using Poisson models of measured mutation counts and phylogenetic trees, we analyzed longitudinal HIV-1 sequence data together with diagnostic and viral load data from the RV217 and CAPRISA 002 acute HIV-1 infection cohort studies. We used leave-one-out cross validation to evaluate the prediction error of these calibrated estimators versus that of existing estimators and found that both infection time and founder multiplicity can be estimated with improved accuracy and precision by calibration. Calibration considerably improved all estimators of time since HIV-1 infection, in terms of reducing bias to near zero and reducing root mean squared error (RMSE) to 5-10 days for sequences collected 1-2 months after infection. The calibration of multiplicity assessments yielded strong improvements with accurate predictions (ROC-AUC above 0.85) in all cases. These results have not yet been validated on external data, and the best-fitting models are likely to be less robust than simpler models to variation in sequencing conditions. For all evaluated models, these results demonstrate the value of calibration for improved estimation of founder multiplicity and of time since HIV-1 infection.Entities:
Keywords: HIV-1; HIV-1 primary infection; acute and early HIV-1 infection; founder multiplicity; infection time; leave-one-out-cross-validation (LOOCV); sequence analysis; vaccine efficacy assessment
Mesh:
Substances:
Year: 2019 PMID: 31277299 PMCID: PMC6669737 DOI: 10.3390/v11070607
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Characteristics of HIV-1 sequences and participants in each study.
| Study Feature | RV 217 (ECHO) | CAPRISA 002 |
|---|---|---|
| HIV-1 subtype(s) | CRF01_AE (MSM); A1/D/C and Recombinants (WSM) | C (WSM) |
| Sequencing strategy | Single genome amplification and sequencing | Next generation sequencing (Illumina w/PrimerID) |
| HIV-1 genomic region | Near full length genome (NFLG) | V3 variable loop of the gp120 envelope protein |
| Median bases per HIV-1 sequence (min, IQR, max) | NFLG: 8813 (8624, 8753-8841, 8891); | 498 (495, 498-498, 501) |
| Median HIV-1 sequences per participant after removing recombination and hypermutation (min, IQR, max) | 9.5 (2.6, 8.4-10, 11) | 352 (26, 142.3-640, 2764) |
| Median HIV-1 sequences removed per participant (min, IQR, max) | 0 (0, 0-1, 8) | 0 (0, 0-1, 356) |
| Total number of participants | 36 | 21 |
| Number of MSM | 17 | 0 |
| Number of WSM | 19 | 21 |
| N participants with 1-2M sample | 36 | 20 |
| N participants with ~6M sample | 34 | 18 |
| Mean Gold days 1-2M (SD) | 47 (4.3) | 62 (4.9) |
| Mean Gold days ~6M (SD) | 184 (11.3) | 180 (12.1) |
| N Gold isMultiple 1-2M (%) | 10 (28%) | 5 (25%) |
| N Gold isMultiple ~6M (%) | 10 (29%) | 6 (33%) |
| Median bounds width in days 1-2M (min, IQR, max) | 48 (20, 34-76, 308) | 54 (27, 41-70, 108) |
| Median bounds width in days ~6M (min, IQR, max) | 146 (18, 91-195, 369) | 120 (30, 86-170, 183) |
| Mean lPVL 1-2M (SD) | 4.5 (0.8) | 4.9 (0.7) |
| Mean lPVL ~6M (SD) | 4.1 (1.0) | 4.5 (0.8) |
Calculations reflect the evaluated dataset, after removal of sequences with evidence of hypermutation or recombination. NFLG, nearly full length genome; LH, left half genome; RH, right half genome; WSM, women who have sex with men; MSM, men who have sex with men; IQR, interquartile range; lPVL = log10 plasma viral load; SD, standard deviation; Gold = modified center of bounds (COB) timing estimate applied to previously unavailable acute tight diagnostic bounds (prior to the 1-2M sample date) and the agreed-upon gold standard is a multiple indicator, see Methods.
Figure 1Prediction errors of the Center of Bounds, PrankenBeast, Poisson Fitter, and modified Poisson Fitter estimators of infection time before (a–d) and after (e–h) calibration for mutation rate after fitting a linear model with terms for log10 plasma viral load (lPVL), the interaction of the estimator with lPVL, the last negative date, the interaction of the estimator with the last negative date and an intercept. Predictions were made on held out data in a leave-on-out cross-validation scheme (see Methods). The sequences used for prediction were: (a), (e): RV217 (NFLG) 1-2 months; (b), (f): CAPRISA 002 (V3) 1-2 months; (c), (g): RV217 ~6 months; (d), (h): CAPRISA 002 ~6 months. The median difference between the predicted and gold-standard values is shown as the center line of each box; the solid box boundaries illustrate the 25th and 75th percentiles (interquartile range, IQR). The leftmost entry (“Gold standard”) depicts the (zero) “prediction” error if the true days since infection values are known. Values depicted in parentheses indicate the root mean squared error, which is an estimate of the standard error when the fitted predictor is applied to future samples, and the bias is shown above these. The whiskers extend to the most extreme data point within 1.5 times the IQR from the box boundaries; points outside of this range are plotted as outlier points. NFLG, near full-length genome.
Figure 2Multiplicity AUC of estimators of multiple-founder infections. Bar plots show areas under the receiver operating characteristic (ROC) curve (AUC) for uncalibrated (red) and calibrated (turquoise) predictors of multiplicity when evaluating predictions on held-out values during leave one-out cross-validation, using the LASSO procedure to select and fit a logistic regression model. Uncalibrated predictors include the method used in the past HVTN sieve analyses, two values computed by the Poisson Fitter software to evaluate a fit to a star-like phylogenetic model, and variants of these using preprocessed inputs (see Methods). Calibrated versions of these predictors are made using models trained using all available data, except for the one participant held out at a time (LOOCV). AUC values of 1.0 indicate a perfect predictor, and values of 0.5 indicate a predictor that is no better than random chance. The sequences used for prediction were: (a): RV217 NFLG 1–2 months; (b): CAPRISA 002 V3 1–2 months; (c): NFLG ~6 months; (d): V3 ~6 months.