| Literature DB >> 30933973 |
Craig A Magaret1, David C Benkeser2, Brian D Williamson3, Bhavesh R Borate1, Lindsay N Carpp1, Ivelin S Georgiev4,5,6, Ian Setliff4,7, Adam S Dingens8,9,10, Noah Simon3, Marco Carone3, Christopher Simpkins1, David Montefiori11, Galit Alter12, Wen-Han Yu12, Michal Juraska1, Paul T Edlefsen1, Shelly Karuna1, Nyaradzo M Mgodi13, Srilatha Edugupanti14, Peter B Gilbert1,3.
Abstract
The broadly neutralizing antibody (bnAb) VRC01 is being evaluated for its efficacy to prevent HIV-1 infection in the Antibody Mediated Prevention (AMP) trials. A secondary objective of AMP utilizes sieve analysis to investigate how VRC01 prevention efficacy (PE) varies with HIV-1 envelope (Env) amino acid (AA) sequence features. An exhaustive analysis that tests how PE depends on every AA feature with sufficient variation would have low statistical power. To design an adequately powered primary sieve analysis for AMP, we modeled VRC01 neutralization as a function of Env AA sequence features of 611 HIV-1 gp160 pseudoviruses from the CATNAP database, with objectives: (1) to develop models that best predict the neutralization readouts; and (2) to rank AA features by their predictive importance with classification and regression methods. The dataset was split in half, and machine learning algorithms were applied to each half, each analyzed separately using cross-validation and hold-out validation. We selected Super Learner, a nonparametric ensemble-based cross-validated learning method, for advancement to the primary sieve analysis. This method predicted the dichotomous resistance outcome of whether the IC50 neutralization titer of VRC01 for a given Env pseudovirus is right-censored (indicating resistance) with an average validated AUC of 0.868 across the two hold-out datasets. Quantitative log IC50 was predicted with an average validated R2 of 0.355. Features predicting neutralization sensitivity or resistance included 26 surface-accessible residues in the VRC01 and CD4 binding footprints, the length of gp120, the length of Env, the number of cysteines in gp120, the number of cysteines in Env, and 4 potential N-linked glycosylation sites; the top features will be advanced to the primary sieve analysis. This modeling framework may also inform the study of VRC01 in the treatment of HIV-infected persons.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30933973 PMCID: PMC6459550 DOI: 10.1371/journal.pcbi.1006952
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Distributions of neutralization sensitivity outcomes of Env pseudoviruses in dataset 1 and in dataset 2.
Fig 2Performance of the Super Learner and the top 5 individual models in classifying the IC50 censored outcome.
Cross-validated AUC point estimates and 95% confidence intervals are shown for A) models trained on dataset 1 and B) models trained on dataset 2.
Fig 3Cross-validated receiver operating characteristic curves for the best-performing models in classifying the IC50 censored outcome.
Results are shown for the top three cross-validated models plus the cross-validated performance of the Super Learner, for A) dataset 1 and B) dataset 2. Values in parentheses are the cross-validated areas under the receiver operating characteristic curve (CV-AUC) for the different models.
Fig 4Classification boxplots for the best-performing models and the Super Learner in classifying the IC50 censored outcome.
Cross-validated performance is shown for the Super Learner and the top three individual models for (A) dataset 1 and (B) dataset 2. Glmnet is the lasso learner.
Fig 5Performance of the Super Learner and the top 5 individual models in predicting the quantitative log IC50 outcome.
Cross-validated R2 point estimates and 95% confidence intervals are shown for A) models trained on dataset 1 and B) models trained on dataset 2.
Variable importance measure (VIM) information for the features that have a Holm-Bonferroni p-value less than 0.05, ranked by their contribution to the classification of the log IC50 censored outcome.
| Feature | MCCV Composite VIM | Ensemble VIM | Ensemble VIM Rank | Direction of Effect | p-value | q-value | FWER | |
|---|---|---|---|---|---|---|---|---|
| 456 is R | 90.772 | 0.045 | 0.030 | 1 | Sensitive | 2.18E-41 | 1.81E-38 | 1.81E-38 |
| 459 is G | 66.592 | 0.013 | 0.027 | 9 | Sensitive | 5.59E-33 | 2.32E-30 | 4.63E-30 |
| 280 is N | 44.915 | 0.018 | 0.030 | 5 | Sensitive | 1.42E-28 | 3.93E-26 | 1.18E-25 |
| 458 is G | 40.411 | 0.003 | 0.027 | 97 | Sensitive | 2.62E-28 | 5.43E-26 | 2.16E-25 |
| 655 is N | 34.969 | -0.004 | 0.027 | 394 | Resistant | 6.42E-10 | 1.55E-08 | 5.12E-07 |
| 279 is E | 26.415 | -0.006 | 0.027 | 553 | Resistant | 1.42E-05 | 1.61E-04 | 0.011 |
| Length of gp120 | 23.992 | -0.003 | 0.027 | 362 | Resistant | 2.71E-05 | 2.77E-04 | 0.02 |
| 471 is I | 21.989 | -0.002 | 0.030 | 278 | Resistant | 3.59E-14 | 1.19E-12 | 2.90E-11 |
| 181 is M | 17.443 | -0.009 | 0.027 | 683 | Resistant | 1.13E-05 | 1.34E-04 | 0.009 |
| Length of Env | 14.893 | -0.002 | 0.027 | 271 | Resistant | 4.65E-06 | 5.84E-05 | 0.004 |
| 428 is Q | 14.592 | 0.003 | 0.027 | 79 | Sensitive | 2.28E-19 | 1.90E-17 | 1.88E-16 |
| 466 is E | 13.877 | -0.004 | 0.027 | 426 | Sensitive | 4.42E-19 | 3.33E-17 | 3.62E-16 |
| 124 is P | 13.515 | 0.005 | 0.028 | 54 | Sensitive | 1.04E-16 | 4.13E-15 | 8.46E-14 |
| 469 is R | 13.424 | -0.008 | 0.027 | 622 | Sensitive | 2.85E-08 | 5.25E-07 | 2.24E-05 |
| 589 is D | 12.691 | 0.001 | 0.027 | 137 | Sensitive | 5.19E-15 | 1.87E-13 | 4.19E-12 |
| Total cysteines in Env | 12.57 | 0.003 | 0.027 | 86 | Resistant | 1.06E-06 | 1.57E-05 | 8.23E-04 |
| 569 is T | 11.7 | -0.004 | 0.027 | 395 | Sensitive | 5.44E-17 | 2.66E-15 | 4.43E-14 |
| 616 is PNGS | 10.648 | -0.002 | 0.027 | 306 | Sensitive | 1.38E-11 | 4.25E-10 | 1.11E-08 |
| 365 is S | 10.641 | -0.006 | 0.027 | 508 | Sensitive | 3.27E-10 | 8.47E-09 | 2.61E-07 |
| 457 is D | 10.137 | 0 | 0.027 | 202 | Sensitive | 3.46E-06 | 4.42E-05 | 0.003 |
| 456 is W | 10.122 | -0.013 | 0.027 | 788 | Resistant | 4.19E-11 | 1.24E-09 | 3.36E-08 |
| Total PNG sites in V5 region | 10.083 | -0.003 | 0.027 | 346 | Sensitive | 3.20E-05 | 3.05E-04 | 0.024 |
| 456 is H | 8.955 | -0.011 | 0.027 | 742 | Resistant | 1.42E-05 | 1.61E-04 | 0.011 |
| 374 is H | 8.929 | -0.008 | 0.026 | 620 | Sensitive | 5.87E-16 | 2.22E-14 | 4.75E-13 |
| 471 is G | 8.881 | 0 | 0.027 | 205 | Sensitive | 9.08E-10 | 2.02E-08 | 7.22E-07 |
| 459 is D | 7.234 | 0 | 0.027 | 221 | Resistant | 4.85E-11 | 1.39E-09 | 3.89E-08 |
| Total cysteines in gp120 | 6.826 | 0 | 0.027 | 193 | Resistant | 7.53E-07 | 1.18E-05 | 5.86E-04 |
| 397 is C | 6.679 | 0.001 | 0.027 | 133 | Resistant | 1.22E-24 | 2.03E-22 | 1.01E-21 |
| 425 is N | 6.503 | -0.012 | 0.027 | 773 | Sensitive | 3.45E-10 | 8.68E-09 | 2.75E-07 |
| 156 is N | 6.245 | -0.005 | 0.027 | 442 | Sensitive | 9.23E-10 | 2.02E-08 | 7.33E-07 |
| 156 is PNGS | 6.222 | NA | NA | 805 | Sensitive | 9.23E-10 | 2.02E-08 | 7.33E-07 |
| 280 is S | 5.592 | -0.011 | 0.027 | 740 | Resistant | 5.76E-17 | 2.66E-15 | 4.68E-14 |
| 425 is R | 5.346 | -0.007 | 0.027 | 558 | Resistant | 1.64E-10 | 4.54E-09 | 1.31E-07 |
| 824 is PNGS | 1.303 | 0.012 | 0.027 | 13 | Resistant | 3.21E-06 | 4.16E-05 | 0.002 |
| 229 is PNGS | 0.99 | 0.013 | 0.028 | 10 | Resistant | 6.62E-07 | 1.06E-05 | 5.16E-04 |
Features shown were ranked among the top 50 features by either VIM method and had a Holm-Bonferroni 2-sided p-value less than 0.05 for an association with the outcome in a logistic regression model using both datasets (with adjustment for geographic region as in all analyses).
1The ensemble-based VIM standard error is based on the estimated influence function for the ensemble-based VIM [49].
2 When the direction of effect is “sensitive” (“resistant”), the presence of or a higher quantity of the feature associates with a censored (non-censored) IC50, which is interpreted as VRC01 sensitivity (resistance).
3 The p-value is from a Wald test in a logistic regression model testing the association of the feature with outcome, controlling for the sequences’ geographic region of origin information to control for possible confounding.
4 The q-value is the Benjamini-Hochberg false discovery rate.
5 The FWER p-value is the Holm-Bonferroni family-wise error-rate adjusted p-value.
FWER, family-wise error rate; MCCV, Monte Carlo cross-validation; SE, standard error; VIM, variable importance measure.
Variable importance measure (VIM) information for the features that have a Holm-Bonferroni p-value less than 0.05, ranked by their contribution to the prediction of the quantitative log IC50 outcome.
| Feature | MCCV Composite VIM | Ensemble VIM | Ensemble VIM SE | Ensemble VIM Rank | Direction of Effect | p-value | q-value | FWER |
|---|---|---|---|---|---|---|---|---|
| 456 is R | 100 | 0.131 | 0.024 | 15 | Sensitive | 1.68E-23 | 1.40E-20 | 1.40E-20 |
| 459 is G | 68.458 | 0.11 | 0.022 | 377 | Sensitive | 9.00E-20 | 3.74E-17 | 7.46E-17 |
| 181 is M | 40.702 | 0.108 | 0.021 | 504 | Resistant | 4.12E-06 | 8.77E-05 | 0.003 |
| 279 is D | 33.329 | 0.109 | 0.021 | 460 | Sensitive | 5.39E-05 | 8.29E-04 | 0.042 |
| Subtype is A1 | 31.624 | 0.117 | 0.022 | 56 | Sensitive | 1.09E-05 | 1.97E-04 | 0.009 |
| Length of Env | 30.962 | 0.105 | 0.021 | 680 | Resistant | 6.53E-06 | 1.29E-04 | 0.005 |
| 655 is N | 26.362 | 0.103 | 0.021 | 768 | Resistant | 2.30E-05 | 3.82E-04 | 0.018 |
| 471 is I | 23.337 | 0.112 | 0.022 | 204 | Resistant | 1.77E-09 | 7.74E-08 | 1.44E-06 |
| Length of gp120 | 19.636 | 0.106 | 0.021 | 636 | Resistant | 6.86E-06 | 1.33E-04 | 0.005 |
| 471 is G | 19.324 | 0.106 | 0.021 | 638 | Sensitive | 8.80E-09 | 3.04E-07 | 7.10E-06 |
| 280 is N | 16.896 | 0.11 | 0.021 | 368 | Sensitive | 2.43E-15 | 5.05E-13 | 2.01E-12 |
| 179 is L | 10.191 | 0.113 | 0.022 | 170 | Sensitive | 5.74E-06 | 1.16E-04 | 0.005 |
| 456 is S | 6.81 | 0.116 | 0.021 | 69 | Resistant | 1.47E-06 | 3.48E-05 | 0.001 |
| 459 is D | 5.934 | 0.106 | 0.021 | 626 | Resistant | 1.89E-07 | 5.42E-06 | 1.52E-04 |
| 425 is N | 4.186 | 0.119 | 0.021 | 44 | Sensitive | 3.33E-09 | 1.38E-07 | 2.70E-06 |
| 455 is Q | 0.011 | 0.12 | 0.022 | 37 | Resistant | 8.70E-09 | 3.04E-07 | 7.03E-06 |
| 428 is M | 0.007 | 0.204 | 0.032 | 3 | Resistant | 2.31E-07 | 6.40E-06 | 1.85E-04 |
| 280 is T | 0.005 | 0.126 | 0.023 | 19 | Resistant | 7.63E-06 | 1.44E-04 | 0.006 |
Features shown were ranked among the top 50 features by either VIM method and had a Holm-Bonferroni 2-sided p-value less than 0.05 for an association with the outcome in a linear regression model using both datasets (with adjustment for geographic region as in all analyses).
1The ensemble-based VIM standard error is based on the estimated influence function for the ensemble-based VIM [49].
2 When the direction of effect is “sensitive” (“resistant”), the presence of or a higher quantity of the feature associates with a lower (higher) quantitative log IC50, which is interpreted as VRC01 sensitivity (resistance).
3 The p-value is from a Wald test in a logistic regression model testing the association of the feature with outcome, controlling for the sequences’ geographic region of origin information to control for possible confounding.
4 The q-value is the Benjamini-Hochberg false discovery rate.
5 The FWER p-value is the Holm-Bonferroni family-wise error-rate adjusted p-value.
FWER, family-wise error rate; MCCV, Monte Carlo cross-validation; SE, standard error; VIM, variable importance measure.
Fig 6Positions, magnitudes, and distributions of amino acid residues at predictive sites selected by VIM methods.
A) IC50 censored outcome; B) Quantitative log IC50 outcome. C and D) Logo plots of the probabilities of each of the amino acids observed at key positions in (C) VRC01-sensitive Env pseudoviruses and D) VRC01-resistant Env pseudoviruses. FWER, family-wise error rate; VIM, variable importance measure. The positions illustrated here correspond to the results in Tables 1 and 2 for the presence of residues at specific sites.
Distinct input variable sets used for the machine learning analyses, and learning algorithm types.
| A. Distinct input variable sets used for the Super Learning analyses | |
| 1: geog | Geographic region (Asia/Europe and Americas/N. Africa/S. Africa) |
| 2: geog.AAchVRC01 | geog + Group 1 |
| 3: geog.AAchCD4bs | geog + Group 2 (AAs in the CD4 binding site) (124, 125, 198, 279, 280, 281, 282, 283, 365, 369, 374, 425, 426, 428, 429, 430, 432, 455, 456, 458, 459, 460, 461, 471, 474, 475, 476, 477) |
| 4: geog.AAchESA | geog + Group 3 (AAs with sufficient Exposed Surface Area) (97, 198, 276, 278, 279, 280, 281, 282, 365, 371, 415, 428, 429, 430, 455, 458, 459, 460, 461, 467, 474, 476) |
| 5: geog.AAchGLYCO | geog + Group 4 (AAs important for glycosylation) (61, 197, 276, 362, 363, 386, 392, 462, 463) |
| 6: geog.AAchCOVAR | geog + Group 5 (AAs that covary with the VRC01 binding footprint) (46, 132, 138, 144, 150, 179, 181, 186, 190, 290, 321, 328, 354, 389, 394, 396, 397, 406) |
| 7: geog.AAchPNGS | geog + Group 6 (AAs associated with VRC01-specific PNGS effects) (130, 139, 143, 156, 187, 197, 241, 289, 339, 355, 363, 406, 408, 410, 442, 448, 460, 462) |
| 8: geog.AAchgp41 | geog + All gp41 sites that affect global neutralization sensitivity (544, 569, 582, 589, 655, 668, 675, 677, 680, 681, 683, 688, 702) |
| 9: geog.AAchGlyGP160 | geog + All gp160 N-glycosylation sites that are not included in VRC01 contact sites or paratope or sites with covariability |
| 10: geog.st | geog + Group 8 (viral subtypes) (01 AE/02 AG/07 BC/A1/A1C/A1D/B/C/D/O/Other) |
| 11: geog.sequonCt | geog + Group 9 (region-specific PNGS counts) |
| 12: geog.geom | geog + Group 10 (viral geometry metrics) |
| 13: geog.cys | geog + Group 11 (counts of cysteine residues in certain regions) |
| 14: geog.sbulk | geog + Group 12 (steric bulk at critical locations) |
| 15: geog.corP | geog + features selected with t-test univariate p-values |
| 16: geog.glmnet | geog + features selected with non-zero coefficients based off lasso |
| 17: geog.all.MCCV | All variables in sets 1–13, described above (AAs as positions 46, 61, 97, 124, 125, 130, 132, 138, 139, 143, 144, 150, 156, 179, 181, 186, 187, 190, 197, 198, 241, 276, 278, 279, 280, 281, 282, 283, 289, 290, 321, 328, 339, 354, 355, 362, 363, 365, 369, 371, 374, 386, 389, 392, 394, 396, 397, 406, 408, 410, 415, 425, 426, 428, 429, 430, 432, 442, 448, 455, 456, 458, 459, 460, 461, 462, 463, 465, 466, 467, 471, 474, 475, 476, and 477, plus all features in Groups 8 through 12) |
| B. Learning algorithm types and the distinct input variable groups used with each learner | |
| SL.randomForest | 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16 |
| SL.glmnet | 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16 |
| SL.xgboost | 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16 |
| SL.naiveBayes | 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16 |
| SL.glm | 1,10,11,12,13,15,16 |
| SL.step | 1,10,11,12,13,15,16 |
| SL.step.interaction | 1,10,11,12,13,15,16 |
| SL.mean | None |
geog = geography; AA = amino acid. AA positions are given in HXB2 coordinates.
1All amino acids included in the variable sets met the minimum variability filter that the site had to differ from the consensus site in at least 3 sequences in the entire CATNAP data set (i.e. before splitting into the two analysis sets).
2See Methods for details on listed input variable Groups 1−13
3The algorithms are listed by the functions used in the SuperLearner R package. An exception is “SL.naiveBayes”, which was a custom wrapper designed to use the naiveBayes function from the e1071 package. The SL.glmnet package was used with the lasso penalty. All tuning parameters are set to the default values of the SuperLearner package, except SL.xgboost, which we modified to fit decision stumps rather than trees.
Fig 7Specific sites in Feature Groups 1 to 7 before application of the minimum variability filter.