Literature DB >> 29311326

Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies.

Raymond H Y Louie^1,2, Kevin J Kaczorowski^3,4, John P Barton^4,5,6, Arup K Chakraborty^7,4,5,6,8, Matthew R McKay^9,10.

Abstract

HIV is a highly mutable virus, and over 30 years after its discovery, a vaccine or cure is still not available. The isolation of broadly neutralizing antibodies (bnAbs) from HIV-infected patients has led to renewed hope for a prophylactic vaccine capable of combating the scourge of HIV. A major challenge is the design of immunogens and vaccination protocols that can elicit bnAbs that target regions of the virus's spike proteins where the likelihood of mutational escape is low due to the high fitness cost of mutations. Related challenges include the choice of combinations of bnAbs for therapy. An accurate representation of viral fitness as a function of its protein sequences (a fitness landscape), with explicit accounting of the effects of coupling between mutations, could help address these challenges. We describe a computational approach that has allowed us to infer a fitness landscape for gp160, the HIV polyprotein that comprises the viral spike that is targeted by antibodies. We validate the inferred landscape through comparisons with experimental fitness measurements, and various other metrics. We show that an effective antibody that prevents immune escape must selectively bind to high escape cost residues that are surrounded by those where mutations incur a low fitness cost, motivating future applications of our landscape for immunogen design.

Entities: Chemical Disease Gene Species

Keywords: HIV; broadly neutralizing antibodies; envelope protein; fitness landscape; statistical inference

Mesh：

Substances：

Year: 2018 PMID： 29311326 PMCID： PMC5789945 DOI： 10.1073/pnas.1717765115

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

After over three decades of effort, there is now renewed hope for an effective vaccine against HIV. This is because of the isolation of broadly neutralizing antibodies (bnAbs) that are capable of neutralizing diverse HIV strains in vitro and progress made toward inducing them by vaccination (1–7), and reports of protection against infection for macaques immunized with a T cell-based vaccine (8, 9). However, much progress still needs to be made to realize the goal of deploying an effective vaccine, and also to develop rational strategies for optimizing immunogens and vaccination protocols. A related issue is the choice of combinations of bnAbs for passive therapy of infected persons, an approach that has shown promise in humans and macaques (10–12). Because HIV is highly mutable, one key challenge is that the virus can evolve mutations that abrogate the binding of antibodies to HIV’s envelope proteins or T cell receptors to peptides derived from viral proteins, while also preserving fitness (the ability to properly assemble the virus, replicate, and propagate infection). This allows HIV to escape from human immune responses. On the other hand, some parts of the viral proteome are vulnerable to mutations because they result in a severe loss of fitness. Vaccination strategies aimed at targeting these parts of the proteome and not the ones that readily allow escape might be effective. The design of effective vaccination strategies would therefore benefit from knowledge of the fitness landscape of viral proteins—that is, knowledge of the fitness of the virus as a function of its amino acid sequence. A conventional approach to estimate the fitness landscape of a virus is to assume that the fitness of a sequence is directly related to the extent to which the amino acids at a residue are conserved in circulating viral strains; more conserved residues are ones where mutations are predicted to result in a larger fitness penalty. This approach was used to obtain a landscape experimentally for the HIV envelope protein, gp160 (13), and computationally for other HIV proteins (14, 15). However, it ignores epistatic coupling between mutations at different residues, which is known to be an important factor for viral fitness (16–18). For example, if a mutation that allows the virus to evade an immune response occurs at a particular relatively conserved residue, it is likely to significantly impair viral fitness. However, if another mutation at a different residue has a compensatory effect, the fitness cost incurred by making the primary immune-evading mutation can be partially restored. Because HIV is highly mutable and also exhibits a high replication rate, such compensatory mutations can be accessed in vivo, and have been found to be a significant factor in determining viral fitness and mutational escape pathways (16–19). Thus, to define the mutational vulnerabilities of a virus like HIV in vivo, and thus guide the design of vaccination strategies, the desired fitness landscape must include the effects of coupling between mutations at different residues. Unfortunately, despite continuing progress (13, 20, 21), experimentally obtaining a fitness landscape that includes coupling information for the majority of proteins in HIV is infeasible, because the number of experimental fitness values that need to be measured is prohibitively large. For example, there are on the order of 107 experimental fitness values that would need to be measured for gp160 (Fig. 1), the polyprotein that comprises the HIV spike that is targeted by antibodies.

Fig. 1.

For different HIV proteins, the number of parameters, average site entropy, and protein length are shown. gp160 has orders of magnitude more parameters than the majority of the other HIV proteins, the longest length, and one of the largest average site entropy. Note that, to obtain a landscape purely based on experiments, the number of required in vitro experimental fitness values is proportional to the number of parameters. The amino acid MSA for each protein was downloaded from the Los Alamos National Laboratory (LANL) HIV sequence database (https://www.hiv.lanl.gov/). One approach that has been successful in obtaining the fitness landscape of HIV proteins, including the effects of couplings, has relied on inferring this information from the sequences of circulating strains [a multiple sequence alignment (MSA)] (19, 22–25). First, a maximum-entropy–based computational approach () was used to infer a “prevalence landscape” from the sequence data, taking into account couplings between residues; the prevalence landscape describes the probability of observing a virus with particular protein sequence in circulation (Fig. 2). Theoretical analyses and tests against in vitro and clinical data (19, 22, 26) suggest that, because of the evolutionary history of HIV and the diversity of (largely ineffective) immune responses that the HIV population has been subjected to, the prevalence landscape is a reasonably good proxy for the fitness landscape. This approach has been applied and validated for various internal proteins of HIV (19, 22–25). Similar methods have also been employed to predict the fitness effects of mutations in bacterial proteins (27). However, a fitness landscape for gp160 has not been previously obtained, despite its importance as a target of antibody responses. This is because the inference problem is far more challenging. In particular, the gp160 primary sequence is more than twice as long as any other HIV protein, and it is also among the most variable. This leads to an explosion in the number of model parameters to be estimated (Fig. 1). Standard inference approaches, such as those based on gradient descent algorithms [commonly referred to as Boltzmann machine-learning (BML) methods] or Markov chain Monte Carlo-based simulation methods (22, 28) are intractable for the gp160 protein.

Fig. 2.

(A) The general framework used for inferring a fitness landscape for gp160. Protein sequence data, in the form of a multiple sequence alignment (MSA), is used to infer a maximum-entropy, or least-biased, probability distribution of sequences, subject to recapitulating the observed one- and two-point mutation probabilities. This distribution is termed the sequence prevalence landscape (). Based on past work on other HIV proteins, we assume that protein fitness is directly proportional to sequence prevalence. As shown in a schematic depiction (Right), the fitness landscape traces out fitness values over the space of possible protein sequences and highlights several features expected of the gp160 landscape: HIV often escapes even in the presence of bnAbs, suggesting that many mutational pathways may have small fitness costs (i); other escape pathways may lead to a large decrease in fitness (ii); but compensatory mutations can restore fitness (iii). (B) Our computational framework includes three steps to solve the maximum-entropy problem. Step 1 involves reducing the number of mutants from the original MSA (). The bar chart shows the single-mutant probabilities for one particular residue (highlighted in gray in the MSA depicted in A). Infrequent amino acids are combined together and treated as a single mutant (“post-combined”). Step 2 involves solving a modified minimum probability flow (MPF) objective function involving the KL divergence between the empirical distribution and a short-time evolved distribution (Eq. ), to include regularization and arbitrary number of mutants (). Step 3 involves refining the parameters from step 2 using a Boltzmann machine-learning (BML) algorithm (), which produces the field (hf) and coupling (Jf) parameters of the landscape.

Computational Method

The fitness landscape is specified as a probabilistic model, chosen as the model having the maximum entropy (i.e., the model least biased by intuition), subject to the constraint of reproducing the single- and double-mutant probabilities observed in the MSA (). For a given length-L amino acid sequence , where can take on any of the 20 naturally occurring amino acids or gap in the MSA, this maximum-entropy model assigns the probability (22):where, in analogy with statistical mechanics, we refer to E(x) as the energy of sequence x. The parameters and , referred to as fields and couplings, respectively, can be obtained by solving the following convex optimization problem ():subject to the single- and double-mutant probabilities matching those in the MSA (). Here, denotes the patient-weighted sequence probability distribution from the MSA (), and denotes the Kullback–Leibler divergence between and (). This problem can be solved in principle by BML algorithms; however, such algorithms require computing a gradient function that involves an exponential number of terms in L, thereby making direct computation infeasible for the gp160 protein. A solution to this challenge is provided for other HIV proteins by use of a cluster expansion approach along with BML (24, 29, 30). However, achieving reliable statistical inference for gp160 is difficult due to the large parameter space (Fig. 1) and the relatively small number of sequences available for gp160 (). The method described here addresses these issues via three main steps (Fig. 2 and ). The most important of these is the introduction of a computationally efficient algorithm that produces accurate initial estimates of the fields and couplings. This method is augmented by the refinement of these estimates using BML, which, due to the accuracy of the initialization, converges quickly, and the application of a variable-combining method for reducing the effective number of parameters, which is more systematic than past such approaches. Our approach for providing an initial estimate of the field and coupling parameters is based on the principle of minimum probability flow (MPF) (31), which was successfully applied to different applications such as deep belief networks and independent component analysis (31), although to our knowledge it has yet to be used for the analysis of protein sequence data. We first describe the general idea of MPF as applied to our maximum-entropy problem, and then discuss some distinctions with related work. The MPF principle leads to replacing in the optimization problem (Eq. ) with an alternate (related) distribution that is simpler to optimize, and which is expected to yield an accurate approximation to the desired maximum-entropy solution. Specifically, the maximum-entropy form (Eq. ) is viewed as the equilibrium distribution of a Markov process with appropriately specified deterministic dynamics () that evolves the empirical data distribution toward . Based on these dynamics, the original optimization problem (Eq. ) is replaced with the alternative problem:where () represents the distribution obtained by running the dynamics for time . While this distribution, as well as the associated KL divergence in Eq. , is still quite complicated, it is greatly simplified for small values of . Specifically, for small , the KL divergence is well approximated by a linear function of ():where is a simple function of the field and coupling parameters , (see for the explicit formula). The parameters that minimize this function can then be easily found using a standard gradient descent algorithm. Importantly, choosing small to facilitate an expansion of the KL divergence (Eq. ) results in an efficient objective function , which involves only a quadratic number of terms in L (), as opposed to the original problem (Eq. ), which was exponential in L. The estimates obtained by minimizing are also statistically consistent (31): if the data are drawn from the maximum-entropy model (Eq. ), then the parameter estimates become arbitrarily accurate given a sufficiently large number of samples. Of direct relevance to our problem, the MPF procedure was previously applied to learn Ising spin-glass models (31), which have the form of the maximum-entropy distribution (Eq. ), but restricted to modeling only two amino acids per residue. Our approach () extends that procedure to make it suitable for the gp160 problem. First, we allow for arbitrary numbers of amino acids per residue, which is essential due to the large sitewise amino acid variability observed in the MSA. Second, we introduce regularization to control sampling noise (), which is required due to the large number of parameters and limited sequence data for gp160. The regularization parameters are chosen such that the model parameters yield a suitable balance between statistical overfitting and underfitting, based on a previously defined metric (32). This is achieved by appropriately biasing the estimated single- and double-mutant probabilities through regularization, to faithfully capture real mutational variability and ignore variability resulting from finite sampling (). By solving a proxy optimization problem, the MPF inference procedure yields a set of inferred model parameters that serve as approximations to the desired optimal maximum-entropy parameters (Eq. ). These parameters are then refined using a gradient descent-based BML algorithm, referred to as RPROP (33), with an additional regularization term (). While BML algorithms are intractable when applied to the gp160 maximum-entropy problem with arbitrary initialization (Eq. ), such algorithms can converge quickly when accurate initial parameters resulting from the MPF-based inference procedure are provided. Multiple amino acids are observed at each residue of gp160 in the MSA. Here, we restrict the number of amino acids explicitly modeled in the inference procedure to decrease the computational burden (30). We propose a general data-driven approach to combine the least frequent mutants (nonconsensus amino acids) together. Combining such mutants also makes sense statistically since the low-frequency mutants are particularly sensitive to sampling noise, and their corresponding parameters are more challenging to meaningfully estimate. This approach, applicable to any protein, can substantially decrease the number of parameters to be estimated. To achieve this, as in the past (30), for residue i we model only the k most frequent mutants, while grouping the remaining q − k + 1 mutants together, where q denotes the number of mutants at residue i as observed in the MSA. Here, k is chosen such that, upon grouping, the entropy at residue i is at least a certain fraction ϕ of the corresponding entropy without grouping. The design question is to choose an appropriate value of ϕ. In previous work (19, 22, 30), this factor was chosen arbitrarily. However, for a protein like gp160, which has a significantly larger number of parameters than the other HIV proteins (Fig. 1), a more judicious choice of ϕ may have considerable benefits. A lower value will combine more mutants together, thus decreasing the number of parameters to be estimated and reducing the computational burden. However, combining too many mutants can result in a loss of useful information regarding amino acid identities. These quantities can be quantified in the form of a statistical bias (). To balance these competing issues systematically, we introduce a method for selecting ϕ based on the sequence data (). It is described briefly as follows. For any given ϕ, for each residue i we can specify k as indicated above. Letting denote the frequency of amino acid a at residue i, this leads to a modified (sparsified) model in which the frequencies of the modified model are now as follows:where represents the frequency of the coarse-grained amino acid. The squared error/bias of this model (i.e., for the specific ϕ) at residue i is then estimated as , while the total variance of the amino acid frequencies at residue i is estimated as . Our aim is to select a level of coarse-graining that corresponds to the fractional entropy captured, ϕ, such that these quantities are commensurate. Intuitively, this approach seeks to select the sparsest model (i.e., choosing ϕ giving the most coarse-grained combining) such that the resulting errors caused by combining do not generally exceed the statistical fluctuations in the estimated amino acid frequencies from the MSA. A significant advantage of such variable selection approach is that it can substantially reduce the number of maximum-entropy parameters to infer (i.e., reducing the number of fields and couplings in Eq. ), thereby simplifying the parameter estimation problem, but without sacrificing the predictive power of the inferred model. To this end, denoting the ratio of fractional error to variance in the amino acid frequencies in the data (see above) as and letting be the average of taken over all residues i, we select ϕ by numerically searching for a value that yields close to 1.

Results

We applied the computational framework to a MSA of HIV-1 clade B gp160 amino acid sequences downloaded from the Los Alamos National Laboratory (LANL) HIV sequence database (https://www.hiv.lanl.gov/). The MSA was processed to ensure both sequence and residue quality (), resulting in L = 815 residues and 20,043 sequences belonging to 1,918 patients. These sequences were highly variable, involving an average pairwise Hamming distance of 0.1824, normalized by L.

The Computational Method Is Fast and Accurately Captures the Observed Sequence Statistics.

When applied to the MSA of gp160, our computational framework took only 2.5 d of CPU time (12 h for MPF, 2 d for convergence of the subsequent BML) on a 16-core node with 128-GB RAM and 2.7-GHz processing speed. This is significant given the large number of parameters that needed to be estimated (∼4.4 million). The computational efficiency was aided by the initial variable selection phase, which reduced the number of parameters by a factor of 6. The inferred model accurately reproduced the single- and double-mutant probabilities observed in the MSA (Fig. 3 ), as required. The fast convergence of the BML algorithm in our framework was due both to the significant reduction in parameters achieved with the initial selection phase, as well as the high accuracy of the initialization parameters estimated by MPF (Fig. 3 ). Nonetheless, the BML refinement led to a better fit of the single- and double-mutant probabilities, while also yielding a clear improvement in terms of connected correlations (Fig. 3). Finally, while the model was designed to explicitly reproduce single- and double-mutant probabilities, the inferred parameters also captured higher order statistics (Fig. 3). This information was not reflected by the model produced by MPF, emphasizing the importance of the subsequent BML estimation phase.

Fig. 3.

(A and B) Scatter plot of the MPF/MPF-BML vs. MSA single (double)-mutant probabilities, verifying that the inferred fields and couplings for both MPF and final model with additional BML refinement, labeled MPF-BML, can accurately reproduce the single (double)-mutant probabilities. (C) Scatter plot of the connected correlations (covariances), demonstrating the benefits of BML refinement over MPF alone. (D) Probability of number of mutations, verifying that the inferred fields and couplings after BML, but not after MPF alone, accurately reproduce higher order statistics beyond the single- and double-mutant probabilities (which were not inputs to the inference procedure).

Inferred Fitness Landscape Accurately Predicts in Vitro Replicative Fitness Measurements.

To assess the ability of the inferred model to capture the fitness of different strains of gp160, we compared predictions based on this model to in vitro measurements of HIV fitness. We compiled 98 fitness measurements from in vitro experiments using competition (16, 17, 34, 35) or infectivity assays (18, 36). Strains used in these experiments were constructed by engineering single, double, or higher order mutations on a CCR5-tropic strain. However, there was no explicit consideration given to choosing higher order mutants that exhibit epistatic coupling, known to be important in vivo (). The metric of fitness in our model is the energy, E in Eq. . Larger values of E correspond to less-fit strains. Based on the values of E computed using our fitness landscape, we predicted the relative fitness of the experimentally tested strains. We then determined the rank correlation [based on a weighted Spearman correlation measure ()] between these values of E and the measured fitness values (Fig. 4). As anticipated, we found a strong weighted negative correlation (), given by between model energies and viral fitness (see for results for each individual experiment). These results demonstrate the ability of the inferred landscape to discriminate the relative intrinsic fitness of different HIV strains with mutations in gp160. We note that the predicted combining factor ϕ = 0.95, established at the initial variable combining phase, produced a landscape that achieved the strongest correlation with fitness values compared with other values of ϕ ().

Fig. 4.

(A) Normalized logarithm of fitness vs. normalized energy for all seven experimental datasets collected from the literature for a combining factor of ϕ = 0.95. References for the datasets are shown in the legend. Normalization is performed by subtracting the mean of each dataset and dividing by the SD. A strong weighted correlation () is observed (see also for individual experiments). (B) Fraction of contacts in the top x predicted pairs which are actually in contact [true-positive rate (TPR)] vs. the top x predicted pairs as determined by a high value of the coupling constant (black line). True contacts are calculated from a crystal structure of the SOSIP trimer (5D9Q). These data show that the pairs with high predicted values of the coupling constants predict true contacts observed in the crustal structure of the SOSIP trimer. Also graphed is the total number of contacts (observed in the crystal structure) divided by the total number of pairs (horizontal blue line), which represents the probability of choosing a pair in contact purely by chance. Only pairs that are more than five residues apart in sequence space are considered, and out of these pairs, two residues are assumed to be in contact if they are <8 Å apart. PDB ID code 5D9Q.

Inferred Couplings Predict Contact Residues.

We expect that strongly interacting pairs of residues should be associated with strong couplings. For example, we previously showed in other HIV proteins that strong couplings are associated with compensatory mutations or those that interact synergistically to make the double mutant especially unfit (22, 24, 25). We used the couplings inferred for gp160 to predict residues that are likely to be in contact in the protein’s native state, which can be determined from crystal structures. The rationale is that residues that are in close spatial proximity in the protein’s native state should have a greater influence on each other, and this effect can be quantified by a measure dependent on the corresponding couplings (37, 38). Similar approaches were originally developed to predict residues in contact in the 3D protein structure on the basis of their inferred interactions, and have recently been extended to identify sets of interacting proteins on the basis of strong inferred couplings (39, 40). As our model is inferred from in vivo data, we anticipate that the model should capture contacts in the gp160 trimer that constitutes the functionally important HIV spike, not just contacts between residues in monomers of gp120 or gp41. To test this, we compared the top predicted contacts (a function of our model couplings) with known contacts based on a crystal structure of SOSIP (41), a synthetic mimic of the native gp160 trimer [Protein Data Bank (PDB) ID code 5D9Q]. The results (Fig. 4) demonstrate that the couplings in our inferred model are predictive of protein contacts for the gp160 trimeric spike of HIV. We found similar results for nine other crystal structures (). Although our inferred couplings can predict contacts well, apparent false positives are still observed. To examine this further, we considered the top 20 residue pairs that have the largest average product correction–direct information (APC-DI) scores, in which 8 of these are not predicted to be in contact (). With the exception of one residue pair, these residues are located in the V2 or V4 loop, or are CD4 contacts. For the residues in the V2 loop or those that are CD4 contacts, a possible reason for these residue pairs having a high APC-DI score is due to the conformational changes that occur when gp160 interacts with CD4 and other coreceptors, which is not captured in the protein structures. Indeed, conformational changes due to gp160–CD4 interaction have been well documented for the V2 loop (42). These conformational changes may favor or disfavor certain pairs of mutations. For the 389–417 residue pair in the V4 loop, it is not clear why this pair has a high APC-DI score, although we note that these residues are located near the base of the loop.

Observed Mutations in Footprints of CD4 Binding Site bnAbs Have Low Fitness Costs.

As further validation of our landscape, we investigated whether its predictions explain why certain mutations that change the binding of bnAbs are observed in circulating virus strains, and others are not. Based on crystal structures, we first determined the residues on gp160 that were in contact with a number of CD4 binding-site bnAbs (listed in ). The CATNAP online tool (43) is a database of experimental IC-50 measurements between panels of bnAb–virus pairs. To determine whether mutations in gp160 residues that were in contact with the bnAbs affected binding, we used a feature in CATNAP that uses Fisher’s exact test to identify viral amino acids that are statistically associated (P < 0.05) with either high or low IC-50 scores. This analysis allowed us to determine a set of residues in gp160 contained in the footprints of CD4 binding-site antibodies that are statistically associated with a change in IC-50 for at least one mutant amino acid. These residues have at least one amino acid substitution pathway that changes the binding affinity to bnAbs. We analyzed only viral strains observed in the MSA that correspond to viable viruses. At the identified set of residues, we further classify specific amino acid mutations as either “observed,” defined as those mutations associated with a change in binding to bnAbs in the CATNAP data and observed in the MSA, or “unobserved,” defined as mutations that are not in the CATNAP data (irrespective of whether they are observed in the MSA). We found that roughly 90% of these unobserved mutations are absent from the MSA (), and therefore are likely to be unviable. Furthermore, the largest frequency for an amino acid mutation present in the MSA but absent from CATNAP is small [i.e., less than 5% ()], while the amino acid mutations observed in the CATNAP data are representative of those observed in the MSA (). We expect, therefore, that observed mutations would be associated with lower fitness costs than unobserved mutations, since they are associated with viable viruses, and hence they are expected to escape the immune pressure with the least effect on the ability to propagate infection. To assess whether the predictions of our fitness landscape are consistent with these expectations, we employed a measure that has been used before to predict the fitness cost of a mutation averaged over all sequence backgrounds (24). Specifically, we compute the change in energy associated with the mutation averaged over a Monte Carlo sample of diverse sequence backgrounds obtained using our fitness landscape to approximate:where and represent the final parameters after BML refinement, and the summation is over all sequences having the consensus amino acid (“0”) at the ith residue. Here, is identical to , except with the amino acid at residue i replaced by a. This measure quantifies the typical change in fitness upon introducing this mutation at the residue across various sequence backgrounds in circulating viruses. A positive fitness cost implies a decrease in fitness (22, 24). As predicted, the observed mutations at the residues that affect IC-50 are predicted to incur a lower average fitness cost compared with amino acid mutations that were not observed (Fig. 5).

Fig. 5.

Distribution of fitness costs for residues in the footprints of CD4bs-directed bnAbs. The footprint residues for a set of CD4bs-directed bnAbs (listed in ) were determined using crystal structures (as described in ). Next, the CATNAP online tool (43), a database of experimental IC-50 measurements between panels of bnAb–virus pairs, was used to identify amino acids within these footprints that are statistically associated with either high or low IC-50 for the bnAb–virus pairs tested. The footprint residues were separated into two sets of residues: the first set (A) consists of those residues for which at least one amino acid is statistically associated with a change in IC-50; the second set (B) consists of all other residues in the footprint where no such statistical associations with IC-50 were found. For both graphs, the fitness costs are calculated over all amino acids present in the CATNAP bnAb–virus panels (observed) and all amino acids not present in these CATNAP panels (unobserved). For the residues in A, the means of the observed and unobserved fitness cost distributions are 3.86 and 4.17, respectively (P < 0.001); for the residues in B, the means of the observed and unobserved fitness cost distributions are 3.61 and 5.42, respectively (P < 0.001); the P values were estimated by simulating from a null model in which the amino acids that were observed or unobserved are randomly permuted for each residue in the footprint. The difference in mean fitness cost between observed and unobserved mutations is due to the presence of some observed mutations with very low fitness costs, coupled with high fitness costs for many unobserved mutations. The existence of the low-fitness cost pathways that also likely abrogate binding implies that even canonically broad and potent bnAbs are imperfect. In patients who develop bnAbs upon natural infection, the viral quasispecies readily evolves escape mutations (1); escape mutations also eventually arise when bnAbs are administered passively (44), even when multiple bnAbs are used in concert (44, 45). To further test our fitness landscape, we next focused on the gp160 residues in CD4 binding site antibody footprints that are not statistically associated with changes in IC-50 according to the analysis using CATNAP. We first examined whether the amino acid mutations at these residues that are observed are biochemically similar to each other and to the consensus amino acid at the residue. To determine this, for each residue, we calculated the biochemical similarity between the nonconsensus amino acid and the respective consensus amino acid, averaged over all sequences in the panels with nonconsensus amino acid at that residue (). The similarity matrix (46) assigns to each pair of amino acids a similarity score ranging from 0 to 6, with matching amino acids assigned a value of either 5 or 6, where 6 is assigned to the more rare, unique amino acids (F, M, Y, H, C, W, R, and G) to reflect this. Mismatched amino acids are assigned values from 3 to 0 based on their polarity, hydrophobicity, shape, and charge (47). For each residue, the average biochemical similarity calculated above was compared with a simulated null model in which the nonconsensus amino acids observed in the bnAb–virus panels were randomly selected from all nonconsensus amino acids at that residue (). The average similarity was above the fifth percentile of the respective null model for 53% of the residues (), indicating that, on average, the amino acids sampled at these residues are more biochemically similar to the consensus amino acid than a uniformly selected random sample. The fact that the observed mutations are biochemically similar may explain why they did not significantly alter binding characteristics of the bnAbs. We then computed the fitness cost for the observed mutations at these residues and compared them with the mutations that were not observed as before. We expect that the latter mutations to biochemically different amino acids may have affected IC-50, but did not arise in viable viruses because of their high fitness costs; that is, the fitness cost associated with evolving escape mutations was prohibitive. Consistent with this expectation, our fitness landscape predicts that the fitness cost for the observed mutations are lower than those for the unobserved mutations (Fig. 5).

bnAbs That Target the CD4 Binding Site Contact Select Residues That Are Predicted to Be Associated with a High Fitness Cost upon Mutation.

We computed the fitness cost associated with making mutations at all gp160 residues by carrying out Monte Carlo simulations to obtain an ensemble of sequences, all of which contained the consensus amino acid at a chosen residue. This ensemble of sequences was used to estimate the fitness cost for each particular mutation from consensus to a nonconsensus residue, as in Eq. above. The results were then averaged over the mutant amino acids (24) to weight more likely mutations more highly as shown below:where is the number of mutants at residue i after mutant combining (). Eq. provides us with an average fitness cost of evolving mutations at a particular residue, , averaged over all sequence backgrounds in which the mutation might arise and over all mutant amino acids. We then superimposed this predicted residue-level fitness costs onto the SOSIP crystal structure. This can be visualized in a heat map depicting the fitness costs of residues that we determined to be accessible on the surface of SOSIP (Fig. 6) (). These results show that the map is very rugged, composed of a mixture of closely spaced low- and high-fitness cost residues. To obtain a clearer picture of whether low-fitness cost residues are predominant in any region of the size of a typical antibody footprint, we calculated the fitness cost () averaged over all surface residues within 12.5 Å of the chosen residue (the radius of a typical antibody footprint, as estimated from analyzing various bnAb–gp160 crystal structures). The corresponding heat map shows that fitness costs averaged over the antibody footprint are smooth over the entire surface on this scale (Fig. 6). Ab-footprint–sized regions are typically dominated by low-fitness cost residues. Thus, most Ab responses that arise may be escaped via mutations. Even though the CD4 binding site has one of the largest fitness costs when averaged over a typical antibody footprint size (Fig. 6), there still exist several residues associated with low fitness costs for mutations. Hence, if mutations at these residues lead to abrogation of Ab binding, then they present relatively easy escape pathways for the virus, consistent with the previous observation based on CATNAP data. We note that two other regions in Fig. 6 appear to have large numbers of high-escape cost residues: (i) the and motifs of the C1 region of gp120, located in the trimer core near the intertrimer interfaces as well as the gp120–gp41 interface (48), and (ii) a region of gp41 containing portions of the H2 and membrane-proximal external region (MPER) motifs (49–52), although much of the MPER region is truncated in the 5D9Q crystal structure, including the epitopes for MPER-directed bnAbs 2F5 and 4E10 (51). However, the first region is sterically inaccessible to antibodies, while the second region contains a few highly variable residues that dominate the averaging in Eq. , and thus on the scale of a footprint it is not very conserved.

Fig. 6.

Fitness costs superimposed on SOSIP Env gp160 trimer (PDB ID code 5D9Q). High-fitness cost residues (red) are more difficult for the virus to mutate, while low-fitness cost residues (blue) tend to be highly variable and easy to mutate. Residues in black are considered hypervariable and are not included in our model. (A) Fitness cost for each residue is averaged over all possible sequence backgrounds and amino acids at that residue. (B) Additionally averaged over surface residues within 12.5 Å of the residue. In both A and B, the regions circled in black, from Top Left to Bottom Right, correspond to the CD4bs, the and motifs of the C1 region of gp120, and a region of gp41 containing portions of the H2 and MPER motifs, respectively. We next examined the residues in the CD4 binding site region that are most strongly associated with the binding of the VRC01 class of antibodies. Li et al. (53) created an extensive set of alanine scanning mutants on a background of JRCSF, a CCR5-tropic clade B gp160 strain. For each alanine mutant, the authors measured binding affinity of full-length gp120 monomer to VRC01 relative to the wild-type virus. This assay led to the identification of residues where mutation to alanine led to significant decrease in VRC01–gp120 binding. The authors additionally conducted a neutralization assay in which whole mutant pseudoviruses are incubated with live target cells, and loss of infectivity is measured as a function of concentration of added CD4–Ig, a synthetic construct in which the CD4 domains 1 and 2 are fused to human IgG1 Fc domain. CD4–Ig competes with target cells to bind to the CD4 binding site of the pseudovirus, thus hindering infection of target cells. In this assay, if lower concentrations of CD4–Ig are required to inhibit infection, a decreased ability of the alanine mutant to infect via CD4-mediated interactions is indicated. In other words, a viral strain with a low CD4–Ig value has a low replicative fitness. The portion of these data corresponding to the gp160 residues in contact with VRC01 can be determined from crystal structures (). Of the residues on gp160 that were determined to be important for binding to VRC01, seven of these result in significant loss of binding (<33%) to gp120 upon mutation to alanine. While three of these seven (367, 368, and 457) are predicted by our landscape to have high fitness cost upon mutation, the remaining four (279, 371, 467, and 474) have much lower predicted fitness cost. This is seemingly at odds with the apparent high fitness cost of mutation to alanine for all seven residues, as measured experimentally (loss in sensitivity to CD4–Ig neutralization). This discrepancy can likely be explained by examining these residues at the amino acid level. For the four residues in question, the most common nonconsensus amino acid accounts for a large fraction of sequences in the MSA (), and it is not alanine. Thus, mutation to the most common nonconsensus amino acid is unlikely to result in a high fitness cost. Furthermore, in the CATNAP panel described in the previous section, neither the consensus nor the most common mutant amino acid is significantly associated with change in IC-50. Rather, the distributions of IC-50s for sequences containing either the consensus or the most common mutant amino acid are indistinguishable by two-tailed t test (). This evidence suggests that mutation to the most common mutant amino acid may be insufficient to abrogate binding to VRC01. Our predicted fitness costs are likely artificially low, since Eq. tends to weight low-fitness cost amino acid mutations highly during the averaging procedure and we infer that mutation to the most common nonconsensus amino acid does not incur a high fitness cost. If we instead calculate the fitness costs for these residues specifically to alanine, we predict much higher values (3.7, 4.2, 2.7, and 4.7, for residues 279, 371, 467, and 474, respectively). In contrast, for the three residues with high predicted fitness cost, the frequency of the most-common mutant amino acid was at most 0.2% in the MSA. Thus, although alanine was not the most-common mutant, the experimental fitness measurements made using alanine mutants were representative of typical escape costs at these residues. Most convincingly, 9 out of the top 10 residues with largest predicted fitness cost had an undetectable CD4–Ig value (), further suggesting that our landscape provides an accurate measure of intrinsic fitness. These results show that, consistent with observations regarding the VRCO1 class of antibodies, our fitness landscape predicts that bnAbs that target the CD4 binding-site region selectively bind to those residues associated with the largest fitness costs. They furthermore highlight the importance of describing fitness at the amino acid level of detail, not just residues—a fact that further highlights the importance of the fitness landscape that we have inferred.

Discussion

HIV is a highly mutable virus, and the design of both T cell and antibody arms of an effective vaccine would benefit from a knowledge of its mutational vulnerabilities as this would identify targets on the HIV proteome and viral spike where potent immune responses should be directed. Determining the fitness of the virus (ability to assemble, replicate, and propagate infection) as a function of the sequence of its proteins would help in this regard. Because of the significance of compensatory or deleterious coupling between mutations at multiple residues, in determining such a fitness landscape, it is important to account for the effects of coupling between mutations. Past work has been reasonably successful in determining such a fitness landscape for diverse HIV polyproteins by inferring the same from sequence data and testing predictions against in vitro experiments and clinical data (19, 22–25). This has not been true for the gp160 Envelope polyprotein that forms the virus’s spike and is targeted by antibodies. This is due to the large diversity and size of gp160 (Fig. 1). We have reported a computational approach, based on the maximum-entropy method (19, 22), which includes a data-driven parameter reduction method, an initial estimation of the prevalence landscape parameters based on the principle of MPF, followed by a BML procedure to refine the parameters (Fig. 2). We applied this framework to infer the fitness landscape of gp160 from available sequence data. Predictions of the resulting model compare very well with published experimental data, including intrinsic fitness measurements, protein contacts in the SOSIP trimer, and known escape mutations that arise to evade antibodies that target the CD4 binding site. A natural question, equally valid for any viral protein, is why one may see a strong relation between prevalence and fitness, and what are the fundamental processes underpinning this relationship. Motivated by this question, for internal HIV proteins that are targeted by cytotoxic T lymphocytes (CTLs), a mechanistic explanation has been proposed on the basis of simulation studies of HIV evolution and associated theory (22, 23, 26). Three key factors were identified for HIV proteins: (i) Because of the huge diversity of HLA genes in the population, very few regions of the HIV proteome are targeted by a significant fraction of humans. (ii) If a mutation is forced by the immune response of a particular individual and it incurs a fitness cost, then upon infection of a new patient who does not target the same region, during chronic infection the first mutation will revert. (iii) Unlike influenza, the HIV population has not been subjected to a few effective classes of natural or vaccine-induced immune responses by a significant fraction of humans across the world. Therefore, it has not evolved in narrowly directed ways to evade effective herd memory responses. (iv) Furthermore, the effects of phylogeny are ameliorated due to high rates of recombination (54). For these reasons, over reasonable mutational distances, HIV proteins are in a steady state and so amenable to inference of fitness landscapes using maximum-entropy approaches applied to sequence data. This is decidedly not true for the influenza population, which is driven far out of equilibrium (55, 56). The first of the above arguments, however, does not translate directly to gp160, which is primarily targeted by antibodies (57). While it is tempting to draw high-level analogies with the CTL-based phenomena on the basis that the diversity of antibodies generated against HIV is large, there are clear distinctions. For example, it is known that the V2 loop of gp160 is estimated to be immunogenic in 20–40% of infected patients, while the V3 loop induces antibodies in essentially all infected patients (58). This would seem to promote more directed evolution compared with the internal proteins under CTL pressure. However, it is possible that different individuals evolve antibodies that target different residues in the V2 and V3 loops. It is also true that peptides derived from gp160 are targeted by CTLs, and that antibody responses are often directed toward debris from the fragile HIV spike. These reasons may underlie why we see excellent correspondence between our inferred model for gp160 and experimental measurements. Nonetheless, the mechanistic reasons remain unclear for gp160. Resolving these basic principles governing the strong prevalence–fitness correspondences identified in this paper for gp160 is worthy of future study. We also considered an optimized model using the fields only, obtained by fitting only the single-mutant probabilities. We observe a modest gain in using the model with couplings, compared with the fields model; that is, the fields model has a correlation of compared with the correlation of using our computational framework. The reason the comparison with in vitro fitness measurements is not much worse for the model with fields only is because the coupling matrix is relatively sparse, and so unless fitness measurements are carried out with only the subset of mutants with large couplings, the fields are dominant. No such preselection was done in the data on in vitro fitness that we compared our predictions against. The value of the coupling parameters is seen in several contexts. For example, protein contacts simply cannot be predicted without the couplings because the physical interactions between these residues induce a correlated mutation structure. More importantly, the fitness consequences of the couplings can be very significant for the evolution of HIV in vivo (the situation of ultimate interest), as described below. Barton et al. (24) studied a cohort of HIV-infected individuals and used the inferred fitness landscapes of HIV proteins (other than gp120) to predict how long it took for the virus to escape from the initial T cell immune pressure in individual patients. When using single-residue entropy alone (i.e., fields only), there was only a 15% correlation between single-residue conservation and escape time. However, when they carried out dynamic simulations of the evolution using the fitness landscape that includes coupling parameters, this correlation was 72% and statistically significant. The reason for this dramatic increase in performance upon including the couplings is that evolutionary trajectories in vivo can sample many potential compensatory pathways that may aid escape due to the high rate of replication and mutation of HIV. Therefore, pathways characterized by particularly strong compensatory couplings can be accessed. Similarly, Barton et al. (24) showed that dramatic differences in escape time for patients targeting the same epitope could be explained by differences in the background mutations contained in the sequence of the virus strains that infected these patients. For example, if the background mutations had strong negative couplings with the ultimate escape mutation, the escape mutation took much longer to evolve and take over the population. Therefore, in the context of the real in vivo situation of interest, including the couplings is very significant. This point has also been emphasized in other contexts where compensatory mutations have been observed (16–18, 59, 60). While our method was developed to address the specific challenges posed by the gp160 protein, the approach is general and may be applied to other high-dimensional maximum-entropy inference problems. For example, it may be directly applied to estimate the prevalence landscape of other HIV proteins, or proteins of other viruses. As an example, we applied the framework to p24, a relatively conserved internal protein of HIV. The landscape with ∼80,000 parameters was inferred very quickly (∼2.3 h for MPF and 1.2 h for the subsequent BML), and predictions from this landscape compared favorably with experimental fitness values (19), returning a strong Spearman correlation of −0.8 (). For p24, along with other HIV proteins, similar correspondences between prevalence and fitness have been obtained through other maximum-entropy inference methods (19, 22–24). We showed that the fitness landscape of the surface-accessible parts of the ENV trimer is very rugged where a few residues that incur a large fitness cost upon mutation (including the effects of couplings between mutations) are surrounded by residues with a low fitness cost upon mutation (Fig. 6). Even for the relatively “conserved” CD4 binding-site region, only mutations at some residues are predicted to incur large fitness costs upon mutations. This is consistent with the coarse-grained models of antibody–virus binding employed in models of affinity maturation with variant antigens by Wang et al. (61) and Shaffer et al. (62). In contrast, in a related study, Luo and Perelson (62, 63) assumed that variant antigens contain epitopes composed of only conserved or only variable residues. Our observation about the spatial distribution of escape costs on the surface of HIV envelope suggests that antibody footprints will in general contain both conserved and variable residues. It is likely that these conserved residues need to be the principal targets of effective antibody responses that are difficult for the virus to evade. For this reason, the availability of our fitness landscape of HIV’s envelope proteins is expected to aid the rational design of immunogens that could potentially induce bnAbs upon vaccination (1–7), as well as guide the choice of combination of known bnAbs that would prevent escape in humans undergoing passive antibody therapy (10). Specifically, our fitness landscape can be clinically useful in the future for the selection of combination bnAb therapy. Much like in the realm of antiretroviral drugs, the use of multiple different bnAbs has the potential to prevent or delay escape mutations, as the viruses in the host must evolve escape mutations in combinations of binding sites of the bnAbs administered. Two bnAbs with epitopes that have many detrimental couplings between them in all possible sequence backgrounds are likely to delay escape in diverse patients with different virus strains due to the added difficulty of introducing simultaneous escape mutations in both epitopes. Furthermore, bnAbs that target epitopes containing residues wherein escape mutations can be easily compensated by other mutations should be avoided. The importance of considering the fitness landscape in an analogous context of antiretroviral drugs has been previously shown for protease by Butler et al. (25). An additional application that is clinically relevant is the selection of variant antigens (e.g., variants of SOSIP) for use in a candidate HIV vaccine designed to elicit bnAbs. The induction of bnAbs via vaccination will likely require immunization with multiple variant antigens that share some conserved residues, but whose variable regions contain distinct mutations compared with each other. The fitness landscape inferred in this work can provide insight into which residues within these antigens should be mutated within candidate immunogens such that the induced bnAbs are cross-reactive to diverse circulating viral strains with different sequences.

Materials and Methods

includes detailed descriptions of the data preprocessing (), computational framework (), fitness verification (), residue contact prediction (), comparison with other methods (), and application of the landscape for characterizing the fitness cost of escape mutations from bnAbs (). The landscape, processed MSA, and implementation of the computational framework are freely available on the following website: https://github.com/raymondlouie/MPF-BML.

60 in total

Review 1. The Antibody Response against HIV-1.

Authors: Julie Overbaugh; Lynn Morris
Journal: Cold Spring Harb Perspect Med Date: 2012-01 Impact factor: 6.915

2. Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis.

Authors: Thomas Gueudré; Carlo Baldassi; Marco Zamparo; Martin Weigt; Andrea Pagnani
Journal: Proc Natl Acad Sci U S A Date: 2016-10-11 Impact factor: 11.205

3. ACE: adaptive cluster expansion for maximum entropy graphical model inference.

Authors: J P Barton; E De Leonardis; A Coucke; S Cocco
Journal: Bioinformatics Date: 2016-06-21 Impact factor: 6.937

4. Repeating sequences and gene duplication in proteins.

Authors: A D McLachlan
Journal: J Mol Biol Date: 1972-03-14 Impact factor: 5.469

5. Somatic mutations of the immunoglobulin framework are generally required for broad and potent HIV-1 neutralization.

Authors: Florian Klein; Ron Diskin; Johannes F Scheid; Christian Gaebler; Hugo Mouquet; Ivelin S Georgiev; Marie Pancera; Tongqing Zhou; Reha-Baris Incesu; Brooks Zhongzheng Fu; Priyanthi N P Gnanapragasam; Thiago Y Oliveira; Michael S Seaman; Peter D Kwong; Pamela J Bjorkman; Michel C Nussenzweig
Journal: Cell Date: 2013-03-28 Impact factor: 41.582

6. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design.

Authors: Andrew L Ferguson; Jaclyn K Mann; Saleha Omarjee; Thumbi Ndung'u; Bruce D Walker; Arup K Chakraborty
Journal: Immunity Date: 2013-03-21 Impact factor: 31.745

Review 7. A Blueprint for HIV Vaccine Discovery.

Authors: Dennis R Burton; Rafi Ahmed; Dan H Barouch; Salvatore T Butera; Shane Crotty; Adam Godzik; Daniel E Kaufmann; M Juliana McElrath; Michel C Nussenzweig; Bali Pulendran; Chris N Scanlan; William R Schief; Guido Silvestri; Hendrik Streeck; Bruce D Walker; Laura M Walker; Andrew B Ward; Ian A Wilson; Richard Wyatt
Journal: Cell Host Microbe Date: 2012-10-18 Impact factor: 21.023

8. Relationship between functional profile of HIV-1 specific CD8 T cells and epitope variability with the selection of escape mutants in acute HIV-1 infection.

Authors: Guido Ferrari; Bette Korber; Nilu Goonetilleke; Michael K P Liu; Emma L Turnbull; Jesus F Salazar-Gonzalez; Natalie Hawkins; Steve Self; Sydeaka Watson; Michael R Betts; Cynthia Gay; Kara McGhee; Pierre Pellegrino; Ian Williams; Georgia D Tomaras; Barton F Haynes; Clive M Gray; Persephone Borrow; Mario Roederer; Andrew J McMichael; Kent J Weinhold
Journal: PLoS Pathog Date: 2011-02-10 Impact factor: 6.823

9. Therapeutic efficacy of potent neutralizing HIV-1-specific monoclonal antibodies in SHIV-infected rhesus monkeys.

Authors: Dan H Barouch; James B Whitney; Brian Moldt; Florian Klein; Thiago Y Oliveira; Jinyan Liu; Kathryn E Stephenson; Hui-Wen Chang; Karthik Shekhar; Sanjana Gupta; Joseph P Nkolola; Michael S Seaman; Kaitlin M Smith; Erica N Borducchi; Crystal Cabral; Jeffrey Y Smith; Stephen Blackmore; Srisowmya Sanisetty; James R Perry; Matthew Beck; Mark G Lewis; William Rinaldi; Arup K Chakraborty; Pascal Poignard; Michel C Nussenzweig; Dennis R Burton
Journal: Nature Date: 2013-10-30 Impact factor: 49.962

10. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing.

Authors: Jaclyn K Mann; John P Barton; Andrew L Ferguson; Saleha Omarjee; Bruce D Walker; Arup Chakraborty; Thumbi Ndung'u
Journal: PLoS Comput Biol Date: 2014-08-07 Impact factor: 4.475

38 in total

1. Constrained Mutational Sampling of Amino Acids in HIV-1 Protease Evolution.

Authors: Jeffrey I Boucher; Troy W Whitfield; Ann Dauphin; Gily Nachum; Carl Hollins; Konstantin B Zeldovich; Ronald Swanstrom; Celia A Schiffer; Jeremy Luban; Daniel N A Bolon
Journal: Mol Biol Evol Date: 2019-04-01 Impact factor: 16.240

2. Design of immunogens to elicit broadly neutralizing antibodies against HIV targeting the CD4 binding site.

Authors: Simone Conti; Kevin J Kaczorowski; Ge Song; Katelyn Porter; Raiees Andrabi; Dennis R Burton; Arup K Chakraborty; Martin Karplus
Journal: Proc Natl Acad Sci U S A Date: 2021-03-02 Impact factor: 11.205

3. Data-driven supervised learning of a viral protease specificity landscape from deep sequencing and molecular simulations.

Authors: Manasi A Pethe; Aliza B Rubenstein; Sagar D Khare
Journal: Proc Natl Acad Sci U S A Date: 2018-12-26 Impact factor: 11.205

4. Molecular Determinants of Epistasis in HIV-1 Protease: Elucidating the Interdependence of L89V and L90M Mutations in Resistance.

Authors: Mina Henes; Klajdi Kosovrasti; Gordon J Lockbaum; Florian Leidner; Gily S Nachum; Ellen A Nalivaika; Daniel N A Bolon; Nese Kurt Yilmaz; Celia A Schiffer; Troy W Whitfield
Journal: Biochemistry Date: 2019-08-19 Impact factor: 3.162

5. Adenovirus-vectored vaccine containing multidimensionally conserved parts of the HIV proteome is immunogenic in rhesus macaques.

Authors: Dariusz K Murakowski; John P Barton; Lauren Peter; Abishek Chandrashekar; Esther Bondzie; Ang Gao; Dan H Barouch; Arup K Chakraborty
Journal: Proc Natl Acad Sci U S A Date: 2021-02-02 Impact factor: 11.205

6. Epistasis and entrenchment of drug resistance in HIV-1 subtype B.

Authors: Avik Biswas; Allan Haldane; Eddy Arnold; Ronald M Levy
Journal: Elife Date: 2019-10-08 Impact factor: 8.140

7. An Antigenic Atlas of HIV-1 Escape from Broadly Neutralizing Antibodies Distinguishes Functional and Structural Epitopes.

Authors: Adam S Dingens; Dana Arenz; Haidyn Weight; Julie Overbaugh; Jesse D Bloom
Journal: Immunity Date: 2019-01-29 Impact factor: 31.745