Literature DB >> 28132437

Graph-based optimization of epitope coverage for vaccine antigen design.

Abstract

Epigraph is a recently developed algorithm that enables the computationally efficient design of single or multi-antigen vaccines to maximize the potential epitope coverage for a diverse pathogen population. Potential epitopes are defined as short contiguous stretches of proteins, comparable in length to T-cell epitopes. This optimal coverage problem can be formulated in terms of a directed graph, with candidate antigens represented as paths that traverse this graph. Epigraph protein sequences can also be used as the basis for designing peptides for experimental evaluation of immune responses in natural infections to highly variable proteins. The epigraph tool suite also enables rapid characterization of populations of diverse sequences from an immunological perspective. Fundamental distance measures are based on immunologically relevant shared potential epitope frequencies, rather than simple Hamming or phylogenetic distances. Here, we provide a mathematical description of the epigraph algorithm, include a comparison of different heuristics that can be used when graphs are not acyclic, and we describe an additional tool we have added to the web-based epigraph tool suite that provides frequency summaries of all distinct potential epitopes in a population. We also show examples of the graphical output and summary tables that can be generated using the epigraph tool suite and explain their content and applications. Published 2017. This article is a U.S. Government work and is in the public domain in the USA. Statistics in Medicine published by John Wiley & Sons Ltd. Published 2017. This article is a U.S. Government work and is in the public domain in the USA. Statistics in Medicine published by John Wiley & Sons Ltd.

Entities: CellLine Chemical Disease Gene Species

Keywords: algorithm; antigen; de Bruijn graph; directed acyclic graph; epitope; vaccine

Mesh：

Substances：

Year: 2017 PMID： 28132437 PMCID： PMC5763320 DOI： 10.1002/sim.7203

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

Introduction

The human immunodeficiency virus (HIV) causes a chronic infection that without treatment ultimately leads to AIDS and death, although the virus can be held in check by daily life‐long antiretroviral therapy. HIV is a retrovirus with a high mutation rate. Immune responses in natural infection drive diversification in vivo by selecting for immune escape variants 1, 2, 3, 4. Thus, distinct and immunologically relevant mutations accumulate over time in every infected individual, creating a complex HIV quasispecies in each infected person; this translates to highly diverse viruses at the population level. This diversity limits the cross‐reactivity of single antigen vaccines, such as a natural protein or a consensus sequence 5, 6. Thus, we, along with our team at Los Alamos National Laboratory, developed a multiple antigen mosaic strategy about 10 years ago, expressly aimed at contending with this diversity 7. The mosaic vaccine employs a genetic algorithm to generate antigens that collectively maximize their coverage of diverse epitopes in an HIV target population. Epigraph is a recently introduced algorithm 8 that uses a graph‐based approach to maximize the same potential epitope coverage that the mosaic algorithm maximized, but with better computational efficiency and, under some conditions, with provably optimal solutions. Building on 8, which emphasized applications of the epigraph code, here, we provide a more detailed mathematical description of the epigraph algorithm, a comparison of various heuristics for removing cycles from the directed graph, and an overview of what is available in the online tool that is hosted on the HIV Database website (https://www.hiv.lanl.gov/content/sequence/EPIGRAPH/epigraph.html). This includes the ability to ‘characterize potential T‐cell epitopes (PTEs)’, and we show what that characterization looks like for US B‐clade Gag protein. The foundation of both mosaic and epigraph design is that antigen optimization is based on coverage of PTEs, which are strings of k contiguous amino acids (or k‐mers), as first suggested in 9. Because the optimal length of most cytotoxic T‐cell epitopes is nine amino acids, we usually set k = 97, 10. Epigraph's speed allowed us to explore additional vaccine design issues that were previously intractable with mosaics. These include studying the impact of excluding of rare epitopes, optimizing on coverage of imperfectly matched epitopes, and incorporation into more complex iterative clustering algorithms that enabled a tailored therapeutic vaccine design 8. Vaccines designed using the HIV mosaic code have performed very well when tested experimentally; we expect epigraphs to behave comparably. To date, all mosaics that have been tested produce proteins that are well folded (i.e., they bind to many neutralizing antibodies that recognize discontinuous antibody epitopes – epitopes that consist of nonadjacent amino acids that are brought close together in natural folded proteins). Furthermore, they are highly immunogenic, eliciting both T‐cell and B‐cell antibody responses 6, 11, 12. T‐cell responses to mosaic vaccines effectively target HIV‐infected cells 13, and, as intended, they are far more cross‐reactive than those induced by natural proteins 6, 11, 14, 15, 16. They can be applied to whole proteins, or just to regions that are relatively conserved; even the most conserved regions of HIV are variable when considering epitope length fragments 10, 17, 18. Mosaic designs have also been explored for influenza 19, hepatitis C 20, Ebola 8, 21, and Chlamydia 22. Of note, the diversity coverage of antibody epitopes as well as T‐cell epitopes should theoretically be augmented by using mosaic/epigraph designed polyvalent vaccines rather than using combinations of natural strains. Antibody epitopes are generally discontinuous, but contain short linear stretches of neighboring amino acids that come together in the folded protein. By algorithmically favoring inclusion of commonly found combinations of amino acid that are in close proximity in sequence space, as a consequence of maximizing k‐mer coverage in mosaic/epigraph vaccines, and by tending to minimize the inclusion of rare amino acids, even discontinuous antibody epitopes will tend to enriched for forms of the epitopes that are common among natural strains. As epigraphs are complementary sets, common variants will be represented, and through exposure to common variants during affinity maturation, antibodies may be selected that have greater breadth. In contrast, amino acids that are very rare in a given position and also very rare local combinations of amino acids pepper virtually every natural HIV sequence; thus, when natural proteins are used for vaccination, they may yield type‐specific vaccine responses if rare amino acids or combinations of amino acids are embedded in a immunogenic epitopes. For example, Env is the key antibody vaccine target for HIV‐1. When we generated an epigraph for the HIV Env protein global alignment, containing 4256 sequences each isolated from a different individual, we found there were a remarkable 650,000 distinct 9‐mers. Over 500,000 were only found once in the whole population of 4256 sequences. Each Env is less than 900 amino acids long, and this averaged to 120 unique 9‐mers per natural strain 8.

Formulation

A multiset of N sequences is taken to characterize the variability of a virus over a population of interest. Each sequence is a string of alphabetic characters, corresponding to the 20 amino acids. Some special characters are allowed, including a gap character (‐) in the case of aligned sequences, and characters indicating a premature stop codons (*), an uncertain amino acid due to an ambiguous base call within the corresponding codon (X), and a frame‐shifted codon (#). For this exposition, we define an epitope as any short subsequence of k amino acids. Strictly speaking, this is an abuse of terminology, because true epitopes correspond to very specific subsequences that are known to be immunologically relevant. For viral proteins of interest here, however, such as HIV‐1 Gag, T‐cell epitopes are ubiquitous across the entire Gag sequence 23. More formally (and in previous treatments 8), we refer to these k‐mers as ‘PTEs’. Here, however, we will simply use ‘epitope’ to mean k‐mer. For each epitope e, we assign an integer frequency f(e) corresponding to the number of sequences in in which the epitope appears. If the epitope appears more than once in a given sequence (e.g., due to a repeat), it is still only counted once for that sequence. The simple version of the problem is to design an artificial sequence q that optimally ‘covers’ the epitopes in . If we write as the set of epitopes that appear in the sequence q, then our measure of coverage is the sum of the frequencies for all the epitopes that appear in q. It is convenient to normalize this quantity by the sum of the frequencies of all the epitopes: . In the ‘cocktail’ version of the problem, the aim is to generate M > 1 artificial sequences that optimally cover the sequences in . We write as the set of epitopes that appear in at least one of the sequences in , and we seek to optimize the sum of the frequencies for all the epitopes that appear in .

Graph‐based formulation

A graph is composed of two sets: a set of nodes and a corresponding set of edges, where each edge connects a pair of nodes. In a directed graph, the edges have a direction, and the connection goes from one node to the other. The graphs we will construct are directed and very much like the de Bruijn graphs 24 that are used in other kinds of sequence analysis, particularly in sequence assembly 25, 26. We can construct a single directed graph in which each k‐mer epitope e is associated with a node. A pair of epitopes e ,e is connected by a directed edge if the last k − 1 characters of e match the first k − 1 characters of e . For example, the nodes e = VTSSNMNNA and e = TSSNMNNAD are connected by a directed edge because they share the k − 1 characters TSSNMNNA and therefore could be adjacent (overlapping) epitopes in a sequence …VTSSNMNNAD…. A path through the graph is a sequence of nodes, associated with epitopes e 1,e 2,…,e , such that there is an edge from e to e for ℓ = 1,…,L − 1. Such a path corresponds to a sequence of L + k − 1 characters. For computational convenience, we add two extra nodes, a BEGIN and an END node. The BEGIN node connects to all the nodes that lack predecessors (corresponding to epitopes that are the first k characters in a sequence). Nodes that lack successors (because they are the last k characters in a sequence) are connected to the END. All paths of interest, then, go from the BEGIN node to the END node. In this graph‐based formulation, the goal is to identify a path P through the graph that maximizes where the sum is over all the epitopes that are in the path (but if a given node appears more than once in a path, its epitope frequency f(e) is only included once in the sum).

Solving the graph‐formulated M = 1 problem in the ideal case

If the directed graph of epitopes is acyclic (which means that there are no paths that include the same node more than once) and if only M = 1 artificial sequence is needed, then we can find the optimal path through this graph, which corresponds to the optimal vaccine sequence given our criteria of maximizing epitope coverage. The algorithm for finding the optimal path uses dynamic programming, a strategy that has been widely used for other sequence analysis problems, including sequence alignments 27, 28. The algorithm involves a forward loop followed by a backward loop. The forward loop defines the function F(e) to be the largest sum achievable for any path that terminates with the epitope e. This function can be computed for each node in the graph in a stepwise manner. Let be the set of predecessors of node e: that is, the set of nodes e for which there exists a directed edge that connects from e to e. Then we have If the set of predecessors is empty, then we define F(e) = f(e). If the graph of epitopes is a directed acyclic graph, then there exists a ‘topological ordering’ of the epitopes, e 1,e 2,…, with the property that if (e ,e ) is a directed edge, then i < j. By proceeding in this topological order, we can straightforwardly evaluate Equation (1) for all the nodes. Having evaluated F(e) for all the nodes e, we choose a node with maximum value: . This will be the final epitope in our optimal string. Now, we just work backwards: If the set is empty, then we are finished, and the sequence of epitopes corresponds to a sequence q of p + k characters that optimizes the epitope coverage. We remark that the argmax operator may not have a unique value; if it does not, then there will be multiple solutions, all of which are optimal in the sense of coverage. Furthermore, this optimality is achieved with computational effort that scales only linearly with the size (as measured in edges) of the network. Figure 1 illustrates the creation of a graph and the optimal path through the graph, for a toy example involving k = 3‐mers, spanning 13 ‘sequences’ of six amino acids each.

Figure 1

Simple example with k = 3 and a population sample composed of these 13 sequences: 2 × MSAMQL, 4 × MSARGL, 4 × VGARQL, and 3 × MGARGL. In this simple illustration, we take k = 3, but in actual practice, we choose k in the range 8–12, with k = 9 generally preferred. (a) Nodes are associated with k‐mers, each of which is a potential T‐cell epitope. In this simple illustration, there are twelve nodes, corresponding to the twelve distinct 3‐mers that appear in the 13 sequences listed in the legend. For each 3‐mer e, the expression f(e) corresponds to the number of sequences in which e appears. (b) The Graph is produced by connecting pairs of nodes with directed edges. If the last k ‐ 1 characters of one node match the first k ‐ 1 characters of another node, then a directed edge connects the one node to the other. In this case the last 2 characters of the first node match the first 2 characters in the following node. Thus, MSA connects to SAM and to SAR, but not to GAR. Also, we add a BEGIN and an END node to the graph as a bookkeeping convenience. (c) The frequency f(e) corresponds to the number of sequences in the sample population in which the k‐mer associated with the node e appears. For each path P that ends on epitope e, we can compute the sum of frequencies of the nodes in the path. The maximum of this sum, over all such paths, is given by F(e) . For a directed acyclic graph, such as the one in this illustration, we can compute f(e) very efficiently using the expression in Eq. (1). In particular, to compute F(e) , find the largest F(e′) among the nodes e′ that are predecessors to e and add that to F(e) . For instance, ARQ has two predecessors: e′ =SAR and e′ =GAR. F(e′) = 11 for GAR, and adding that to f(e) = 4 for e =ARQ gives f(e) = 15. (d) The optimal path, which maximizes the sum of f(e) over all the nodes in the path, is obtained by starting at the END node and working backward. At each step, the predecessor with the largest value of f(e) is chosen. In this illustration, of the three predecessors to END, the node RGL has the f(e) = 25, which is larger than for the other two predecessors. We work backward to ARG, and then choose between SAR and GAR; we select GAR since its f(e) = 11 is higher than SARŠs value. We continue moving backward until we reach the BEGIN node. (e) The Epigraph solution (here given by VGARGL) is associated with the optimal path [VGA,GAR,ARG,RGL]. Note that this solution is not among the sequences that were in the population sample, and that it is not the consensus sequence of those sequences. It is, however, the single sequence that maximizes the coverage of the 12 distinct 3‐mers, given their frequencies.

Multi‐antigen (M > 1) vaccines

Here, we seek a set of antigens that collectively cover the sequences in . With , the set of epitopes that appear in at least one of the sequences in , the goal is to maximize . A variety of strategies were described in 8, but the main idea is to sequentially add new antigens q in a way that optimizes a complementary coverage function. As in the M = 1 case, this coverage is a sum of frequencies f(e), but the sum is only over epitopes that have not yet appeared in any of the other antigens. Algorithmically, one proceeds just as for the M = 1 case but the sum is over f ∗(e) that is equal to f(e) for epitopes that have not yet appeared in the vaccine and is equal to zero for epitopes that have already appeared in the vaccine.

Decycling

Unfortunately, the network that is created from a sequence list is not guaranteed to be acyclic. In practice, particularly for larger values of k (and larger values of f ), the network is often ‘very nearly’ acyclic and can be made acyclic with only a few perturbations to the network. The optimal solution to this perturbed network is then taken as a nearly optimal solution to the original network. Removing the least number of edges to produce an acyclic graph is equivalent to an NP‐hard problem, the ‘minimum feedback arc set’ problem 29, 30. Thus, we cannot expect a universally efficient algorithm for optimally decycling a graph, and that is what motivates us to consider a variety of heuristic approaches for making the directed graph acyclic with a minimum amount of perturbation. The first step in eliminating cycles is to identify them. To do this, we decompose the graph into ‘strongly connected components’ 31; for an acyclic graph, each node is its own strongly connected component. Within a single strongly connected component, a path can be found from every node to every other node. Cycles can be identified in the following way: if e and e are two nodes in the same component, then the directed path from e to e can be merged with the directed path from e to e to form a cycle (although possibly not a ‘simple’ cycle) that includes both e and e . Our approach for eliminating cycles is to keep all the nodes from the original network, but to successively cut edges until an acyclic network is obtained. Each time a cycle is located in the graph, we choose one of the edges in the cycle to remove from the graph. This choice is heuristic, and we tried a number of alternatives. But in general, because cutting edges may have the effect of isolating nodes, we seek cuts that isolate low‐value nodes. For each edge (e ,e ), we can define a value, based on f(e ) and f(e ); then we choose the edge with the smallest value and remove it from the graph. We considered eight specific heuristic functions, listed in Table 1.

Table 1

Decycling heuristics: a heuristic function provides a value for each edge in the directed graph; edges with the lowest values are removed until the graph becomes acyclic.

Heuristic	Description
sum	A very simple and (empirically) effective heuristic is to take the value to be the sum f(e _a) + f(e _b).
max	Another simple heuristic is to take the maximum of the two values associated with the two nodes
	that define the given edge: that is, maxf(ea),f(eb).
iso	We observe that if e _a is the sole predecessor of e _b, then cutting edge (e _a,e _b) will isolate node e _b;
	similarly, if e _b is the sole successor to e _a, then cutting the edge will isolate e _a. The cost associated
	with the iso statistic takes the sum of the f(e) values of the isolated nodes.
sum+iso	The cost associated with edge (e _a,e _b) is the sum of the sum and iso costs. This is the heuristic we
	used in 8.
max+iso	Sum of max and iso heuristics.
posn	Every node can be assigned a position x(e) corresponding to the length of the shortest path that
	connects the BEGIN node to e. We use − x(e _a) as the heuristic value, thus preferring to cut edges
	whose first node is farthest from BEGIN.
del‐posn	The value given by x(e _b) − x(e _a) tells us the extent to which the directed edge points ‘forward’ –
	large negative values correspond to edges that point backward, and are good candidates for cutting.
random	Assign random values to edges.

Decycling heuristics: a heuristic function provides a value for each edge in the directed graph; edges with the lowest values are removed until the graph becomes acyclic. Because the choice of which cycle to draw from the graph is based on random choices, we do seven trials for each of the eight heuristics and consider both the average and the maximum coverage values for those seven trials. The results are shown in Figures 2, 3, 4, 5. These experiments, to some extent, vindicate our choice of heuristic (sum+iso) used in 8, which is seen as a stable choice that gives results that are, over a range of different proteins and clades, consistently competitive with the more expensive mosaic algorithm. But this heuristic does not always lead to the best coverage. In Pol B, we see the posn heuristic produces significantly higher coverage than any other heuristic, even though posn for other proteins often gives significantly poorer coverage.

Figure 2

Figure 3

Similar to Figure 2, but for the Nef protein.

Figure 4

Similar to Figure 2, but for the Pol protein.

Figure 5

Similar to Figure 2, but for the Env protein.

Coverage score (left, blue) and removed edges count (right, orange) for four Gag protein clades B, C, E, and M. Each plot compares results for eight different heuristic decycling algorithms; these heuristics are described in Table 1. Dark blue bars correspond to average coverage, over seven trials. The lighter cyan bars show the maximum coverage, over those seven trials. The bars are arranged so that the top bar has the highest average coverage and the bottom bar has the lowest average coverage. The vertical red line corresponds to the value provided by the mosaic algorithm, as reported in 8. Light orange bars indicate the number of edges removed from the graph, using the different heuristics, in order to obtain an acyclic graph. Similar to Figure 2, but for the Nef protein. Similar to Figure 2, but for the Pol protein. Similar to Figure 2, but for the Env protein. As a general rule, we expected the heuristics that removed the fewest edges (i.e., that produced the least ‘damage’ to the full graphs) to give the highest coverage, and although that trend roughly holds (particularly for the Gag and Env proteins), there are some striking exceptions. For both Nefs B and M, we see that the del‐posn heuristic produces not only the worst or second worst damage but also the highest coverage. Further study is called for (both in comparing heuristics and in developing new ones), but a reasonable strategy at this point would be to apply the whole slate of available heuristics to any given population sample of interest and to simply take the one with the best coverage. Because different heuristics yielded different outcomes depending on the input dataset, we have how added the ability for the user to specify which heuristic to employ in the design phase of epigraph.

The epigraph web interface

Users can execute the epigraph algorithm on their own data using a web interface 32 maintained by the Los Alamos HIV database (http://www.hiv.lanl.gov/).

Human immunodeficiency virus type 1 Gag as a test case

User input for all of the web‐based tools in the epigraph tool suite is a set of sequences considered to be a reasonable (or best available) sampling of the pathogen of interest's protein diversity in infected population that is being targeted for a vaccine, or for immune response studies. Epigraph, like all of the tools provided through the HIV database project, has a readily available input sample set to enable users to quickly explore the tool, even if they do not have a data set to hand. For the epigraph tool, we chose a highly relevant data set as the sample input, incorporating the Gag p24 proteins sampled within the last decade in the USA, one sequence derived from each of 189 different individuals. We pulled this set from the HIV database pre‐made alignments 33, and analogous HIV sets can readily be obtained representing any global region, country, or HIV subtype. The immunology informing Gag p24 as the choice for our example file is worth considering, because Gag is an excellent protein for inclusion in T‐cell response vaccines 34 and the example highlights the merits of an epigraph approach. Gag p24 is one of the most conserved proteins in HIV, but conservation is relative, and even p24 is highly variable at the epitope level. The number of vaccine‐elicited Gag cytotoxic T‐lymphocyte (CTL) responses has been shown to directly correlate with viral control in rhesus macaque SIV challenge models 35, and Gag is very immunogenic. It is spanned by an dense overlay of experimentally well‐defined known human T‐cell epitopes 10. To see the extent of known CTL epitopes, see also the HIV database maps of known epitopes collected from the literature 23. This list of experimentally defined HIV epitopes is certainly an underestimate, as most experimental studies rely on the use of consensus peptides (e.g., see 36, 37), and when specific peptides are synthesized that exactly match an infecting virus in an individual whose immune response is being studied, CTL responses to novel epitopes are readily discovered 38. The high density of T‐cell epitopes is further supported by population‐based T‐cell recognition studies 36, 37. A high density of known epitopes is not exclusive to HIV Gag and other HIV proteins. For example, the influenza A hemagglutinin protein, another intensively studied viral protein, has over 800 T‐cell epitopes listed in the Immune Epitope Database (http://www.iedb.org/), suggesting a comparably dense tiling of epitopes to Gag. Such epitope density is part of the justification for the assumption that any 9‐mer fragment is a potential T‐cell epitope in epigraph/mosaic design. This is particularly important when considering the array of epitope target possibilities across a diverse vaccinated population with a complex spectrum of human leukocyte antigens (HLAs). HLAs are the human proteins that present T‐cell epitopes: They are highly polymorphic, and they play a key role in determining which particular epitopes are recognized in a given individual. While not all 9‐mers will be precisely defined optimal epitopes, still they will partially overlap with valid epitopes. The epigraph solution is optimized across all possibly relevant potential T‐cell epitopes, without having to define those regions a priori when not enough data is available to define and differentiate natural epitope regions. In a natural infection, a given individual will recognize a relatively small number of T‐cell epitopes 38, and this is also true for vaccines using standard delivery methods 39. But vaccination with a cytomegalovirus (CMV) vector changes this paradigm. Rhesus macaques vaccinated with SIV immunogens delivered in rhesus CMV vectors can clear over half of the infections that are established upon challenge 40, 41. The T‐cell response to these vaccines is extraordinary. A remarkable number of CTL epitopes in Gag are recognized 40, 41, and they are presented in unusual and not readily predictable ways, using the non‐classical major histocompatibility complex E molecule for presentation 42. Epigraphs were originally developed with the design of HIV inserts for preclinical development of CMV vectors in mind 8. Given the very high density of non‐canonical and unpredictable epitope responses generated in the context of the CMV vector, it is reasonable to treat all 9‐mers as potential epitopes.

Epigraph input and output

The epigraph tool is organized into multiple tabs, and essential elements of the output of each page are shown in Figure 6. The first tool is simply the epigraph design page. A set of diverse protein sequences are provided by the user, and epigraph produces sequences that optimize epitope coverage are produced. A graphic showing the coverage values relative to the input population is created (Figure 6A). To compare epigraphs to other vaccines, or to explore their cross‐reactive potential in other populations, one can use the antigen evaluation tool (Figure 6B), where the frequency of matched k‐mers between the vaccine and sequence set can be evaluated. Both the epitope coverage and the number of the rarest epitopes in the vaccine are summarized. In Figure 6B, we compare two two‐antigen epigraph vaccines, one based on the US B‐clade sequence described earlier, and based on the CRF08 recombinant form that is common in China. One can see the dramatic drop off in epitope coverage when one evaluates the US B‐clade epigraphs against CRF08 sequences and vice versa. Figure 6C illustrates the coverage if two natural strains are used instead of an epigraph pair. One hundred sets of two natural strains were selected randomly from the US B‐clade set, and coverage of the population of US B‐clade viruses was evaluated. If one deliberately selects the natural strains with the best coverage, the potential for cross‐reactive vaccine responses can be greatly enhanced over typical choices that might be made for reasons of history or convenience. Epigraphs provide better coverage than the best two natural strains, even in the context (Figure 6A vs. C) of Gag p24, one of the most conserved proteins in HIV. Epigraph has a tool that enables a user to weigh the benefit of excluding rare epitopes against the cost of excluding these epitopes on total coverage (Figure 6D).

Figure 6

Summary output of the epigraph tool suite using 189 B‐clade Gag p24 protein viruses from the USA as an example set. (A) Design output, including the sequences for two‐antigen vaccine optimized for 9‐mer epitope coverage, a table of coverage information, and a graphic comparing the coverage of the first epigraph (EG‐0) to the additional coverage obtained by the addition of the complementary epigraph (EG‐1). The red bar measures perfectly matched epitopes, while the orange bar adds the off‐by‐one matches. (B) Antigen coverage evaluation, showing the fraction of epitope matches between epigraph vaccines and two populations of viral sequences: US B clade and the Chinese CRF08 (labeled 08) recombinant lineage. Shown are fraction of covered epitopes (left) and the raw counts (right). The counts for the CRF08 are much reduced because only 48 CRF08 Gag sequences were available, while 189 B‐clade sequences were included. (C) Coverage distribution is shown for 100 randomly selected pairs of natural p24 B‐clade sequences. The best natural sequence coverage is 84%, marked in red. (D) exclude rare epitope output. This code sequentially explores the impact on coverage of excluding rare 9‐mers, starting with excluding the rarest (those only found once) and stepping up the minimal count by 1 until a graph can no longer be completed (for this data set, that limit occurs at a count of 62). The blue line is a simple epigraph run, the black line is slightly improved and reflects the best scores given 100 random restarts. A complete table is given in the epigraph output, but here, we only track the values up to a minimum count of 13. The maximum coverage if all paths through the graph are available is 0.873. If the antigens are restricted to include only 9‐mers that are found 12 or more times, the coverage is reduced by less than a percent, to 0.866. (E) A histogram graphic summarizes the number distinct 9‐mers that are found in a given number of sequences, for the full Gag protein in the B US data set. Note that almost 10,000 completely unique 9‐mers (i.e., only found in one sequence) are observed in this conserved protein in this relative conserved dataset. The few highly conserved peptides are summarized in Table 2.

Table 2

The most conserved Gag peptides in the US B‐clade set were identified using the ‘characterized PTEs’ tool in the epigraph tool suite.

Start	Stop	Frequency	Peptide	Class I HLA presentation
35	43	183	VWASRELER
			SPR[TL]LNAWVK[VW]
148	156	181	SPRTLNAWV	B8101, B0702
149	157	187	PRTLNAWVK
150	158	187	RTLNAWVKV	A*0201
164	172	181	FSPEVIPMF	B57, B58, B63
177	185	180	EGATPQDLN
			MLNTVGGHQAAMQMLK
187	195	179	MLNTVGGHQ
188	196	179	LNTVGGHQA	human, unknown HLA
189	197	179	NTVGGHQAA
190	198	179	TVGGHQAAM
191	199	182	VGGHQAAMQ
192	200	185	GGHQAAMQM
193	201	185	GHQAAMQML	B1510, B3901, B38
194	202	184	HQAAMQMLK	B52, A11
			REPRGSDIAGTTS
229	237	179	REPRGSDIA
230	238	179	EPRGSDIAG
231	239	184	PRGSDIAGT
232	240	184	RGSDIAGTT
233	241	184	GSDIAGTTS
			GLNK[I‐]VRMYSP
269	277	181	GLNKIVRMY	B*1501, B62
270	278	182	LNKIVRMYS
271	279	182	NKIVRMYSP
			QGPKEPFRDYVDRFYK
287	295	182	QGPKEPFRD
288	296	182	GPKEPFRDY	B7
289	297	182	PKEPFRDYV	human, unknown HLA
290	298	182	KEPFRDYVD
291	299	182	EPFRDYVDR	A*0201
			EPFRDYVDRFF	B81
292	300	182	PFRDYVDRF	human, unknown HLA
293	301	189	FRDYVDRFY
			FRDYVDRFYK	B*1801, B27
294	302	183	RDYVDRFYK
			VKNWMTETLL
313	321	181	VKNWMTETL	B*4801
314	322	181	KNWMTETLL
			WMTETLLVQN
316	324	184	WMTETLLVQ
317	325	185	MTETLLVQN
			LEEMMTACQGVGGP
343	351	182	LEEMMTACQ	human, unknown HLA
344	352	182	EEMMTACQG	human, unknown HLA
345	353	182	EMMTACQGV	A*0201
346	354	184	MMTACQGVG	human, unknown HLA
347	355	184	MTACQGVGG	human, unknown HLA
348	356	184	TACQGVGGP

This generated the data in the frequency and peptide columns. The HIV database sequence locator tool 43 generated positions for each of the peptides, relative to the HIXB2 reference strain. The HIV database ELF tool 44 identified the known database epitopes within the set.PTEs, potential T‐cell epitopes; ELF, Epitope Location Finder; HIV, human immunodeficiency virus; HLA, human leukocyte antigen.

Figure 6E illustrates a newly added feature of the epigraph tool suite, introduced for this paper: frequency counts of potential epitopes. Tables are produced that enable users to resolve the frequency of each unique k‐mer in their input data (essentially this summarizes the data that is encapsulated in the nodes of the graph), either overall, or by position in an alignment. In the case of the B‐clade US HIV Gag sequences, this ranges from a very large number of unique 9‐mers that are only sampled once in the population, to a very small number of highly conserved 9‐mers that are present in nearly all of the sequences in the dataset. A graphic is produced to illustrate this distribution (Figure 6E). This data can be piped through additional HIV database tools to enhance its usefulness. We provide an example in Table 2. Here, for the purpose of illustration, we take all 9‐mers that are present in 95% of the US B‐clade input data, identifying the most highly conserved epitopes in this data set. This peptide set is then piped through the HIV database sequence locator tool 43, to identify the positions of each conserved 9‐mer in Gag. The data is sorted by position, and overlapping 9‐mers are combined to define larger highly conserved regions. Each conserved region is then piped through the HIV database Epitope Location Finder tool 44 to identify known epitopes within the region, and their HLA presenting proteins. Epitope Location Finder also enables epitope predictions (data not shown). All of this data is brought to produce Table 2. Such a table could provide a useful baseline for interpreting protective immune responses, or for informing peptide vaccine designs. The most conserved Gag peptides in the US B‐clade set were identified using the ‘characterized PTEs’ tool in the epigraph tool suite. This generated the data in the frequency and peptide columns. The HIV database sequence locator tool 43 generated positions for each of the peptides, relative to the HIXB2 reference strain. The HIV database ELF tool 44 identified the known database epitopes within the set.PTEs, potential T‐cell epitopes; ELF, Epitope Location Finder; HIV, human immunodeficiency virus; HLA, human leukocyte antigen.

Discussion

Epigraph produces vaccine immunogen cocktail designs that include intact proteins, so they can be delivered using the same strategies as natural proteins. Because of this, epitopes will be naturally processed and presented, and large stretches of proteins with many potential epitopes and many HLA presentation motifs are included in epigraph vaccines. In addition to the design of full‐length proteins, there is also an interest in focusing vaccine‐stimulated T‐cell responses exclusively on conserved regions, to shift immunodominance to epitopes under fitness constraints 45, 46, 47. A variant on this theme is to consider coevolutionary patterns and fitness landscapes and to restrict vaccine design to inclusion of epitopes that require compensatory mutations to escape 48. Either way, responses to epitopes with high fitness costs may be beneficial in either a preventative or therapeutic setting, but delivery is challenging. HIV peptide vaccines comprised of concatenated epitopes have been essentially immunologically silent when tested in human trials 49, 50. For this reason, when designing epigraphs and mosaics that conserved epitopes presumed to have a high viral fitness cost associated with escape, we have focused on the use of conserved regions rather than conserved epitopes, so that local protein context is preserved to enable more natural epitope processing. To date, our conserved region HIV mosaic designs are immunogenic in animals studies 17, 18. Of note, and some concern, responses to conserved region epitopes were stronger when they were embedded in a complete protein than when they were isolated in conserved region fragments 17. We have recently used epigraph to design a conserved region pan‐filovirus vaccine 8, and immunogenicity testing is currently underway. As an alternative to conserved regions, recent exploration of the use dendritic cells for peptide presentation to T cells may offer a valuable alternative strategy for presenting short peptides to the immune system 51. Epigraphs could also be helpful for the design of peptide variants in such a system. Cytomegalovirus vectors have sparked great recent interest T‐cell‐based vaccines. SIV proteins presented in CMV vectors generate prolific T‐cell responses, which enable control and clearance of pathogenic SIV upon infection in over 50% of vaccinated monkeys; thus, CMV‐vectored HIV vaccines may have both therapeutic and prophylactic potential 40, 41. CMV responses in rhesus macaques have advanced the field into new areas of T cell‐mediated immunity that we are just beginning to understand, and provided new impetus for exploring T‐cell‐mediated vaccine approaches 40, 41, 42. Pairing epigraphs inserts with CMV vectors may be able to enhance the cross‐reactive potential of the responses, for both preventative and therapeutic vaccine applications 8. In summary, epigraphs employ a graph‐based approach to design antigens for vaccine cocktails against highly variable pathogens. Epigraphs provide significantly improved runtimes over the first generation mosaic vaccine design strategies (seconds to minutes, typically, for epigraph versus hours to days for mosaics). This enhanced computational efficiency allows users to explore aspects of vaccine design that were previously intractable. Epigraph requires that the epitope graph be acyclic, and we have here explored the outcome of using different heuristics for removing cycles from graphs. While some trends were noted, we have found that the optimal heuristic choice varies from one input data set to another. We have also provided more detailed explanations of the basic input and output of the epigraph tool suite.

41 in total

1. Polyvalent vaccines for optimal coverage of potential T-cell epitopes in global HIV-1 variants.

Authors: Will Fischer; Simon Perkins; James Theiler; Tanmoy Bhattacharya; Karina Yusim; Robert Funkhouser; Carla Kuiken; Barton Haynes; Norman L Letvin; Bruce D Walker; Beatrice H Hahn; Bette T Korber
Journal: Nat Med Date: 2006-12-24 Impact factor: 53.440

Review 2. T-cell vaccine strategies for human immunodeficiency virus, the virus with a thousand faces.

Authors: Bette T Korber; Norman L Letvin; Barton F Haynes
Journal: J Virol Date: 2009-05-13 Impact factor: 5.103

3. Full-length HIV-1 immunogens induce greater magnitude and comparable breadth of T lymphocyte responses to conserved HIV-1 regions compared with conserved-region-only HIV-1 immunogens in rhesus monkeys.

Authors: Kathryn E Stephenson; Adam SanMiguel; Nathaniel L Simmons; Kaitlin Smith; Mark G Lewis; James J Szinger; Bette Korber; Dan H Barouch
Journal: J Virol Date: 2012-08-15 Impact factor: 5.103

4. Short conserved sequences of HIV-1 are highly immunogenic and shift immunodominance.

Authors: Otto O Yang; Ayub Ali; Noriyuki Kasahara; Emmanuelle Faure-Kumar; Jin Young Bae; Louis J Picker; Haesun Park
Journal: J Virol Date: 2014-11-05 Impact factor: 5.103

5. Hepatitis C genotype 1 mosaic vaccines are immunogenic in mice and induce stronger T-cell responses than natural strains.

Authors: Karina Yusim; Rebecca Dilan; Erica Borducchi; Kelly Stanley; Elena Giorgi; William Fischer; James Theiler; Joseph Marcotrigiano; Bette Korber; Dan H Barouch
Journal: Clin Vaccine Immunol Date: 2012-12-05

6. Therapeutic Vaccination With Dendritic Cells Loaded With Autologous HIV Type 1-Infected Apoptotic Cells.

Authors: Bernard J C Macatangay; Sharon A Riddler; Nicole D Wheeler; Jonathan Spindler; Mariam Lawani; Feiyu Hong; Mary J Buffo; Theresa L Whiteside; Mary F Kearney; John W Mellors; Charles R Rinaldo
Journal: J Infect Dis Date: 2015-12-08 Impact factor: 5.226

Review 7. Justification for the inclusion of Gag in HIV vaccine candidates.

Authors: Anna-Lise Williamson; Edward P Rybicki
Journal: Expert Rev Vaccines Date: 2015-12-28 Impact factor: 5.217

8. Comprehensive epitope analysis of human immunodeficiency virus type 1 (HIV-1)-specific T-cell responses directed against the entire expressed HIV-1 genome demonstrate broadly directed responses, but no correlation to viral load.

Authors: M M Addo; X G Yu; A Rathod; D Cohen; R L Eldridge; D Strick; M N Johnston; C Corcoran; A G Wurcel; C A Fitzpatrick; M E Feeney; W R Rodriguez; N Basgoz; R Draenert; David R Stone; C Brander; P J R Goulder; E S Rosenberg; M Altfeld; B D Walker
Journal: J Virol Date: 2003-02 Impact factor: 5.103

9. Epigraph: A Vaccine Design Tool Applied to an HIV Therapeutic Vaccine and a Pan-Filovirus Vaccine.

Authors: James Theiler; Hyejin Yoon; Karina Yusim; Louis J Picker; Klaus Fruh; Bette Korber
Journal: Sci Rep Date: 2016-10-05 Impact factor: 4.379

10. Broadly targeted CD8⁺ T cell responses restricted by major histocompatibility complex E.

Authors: Scott G Hansen; Helen L Wu; Benjamin J Burwitz; Colette M Hughes; Katherine B Hammond; Abigail B Ventura; Jason S Reed; Roxanne M Gilbride; Emily Ainslie; David W Morrow; Julia C Ford; Andrea N Selseth; Reesab Pathak; Daniel Malouli; Alfred W Legasse; Michael K Axthelm; Jay A Nelson; Geraldine M Gillespie; Lucy C Walters; Simon Brackenridge; Hannah R Sharpe; César A López; Klaus Früh; Bette T Korber; Andrew J McMichael; S Gnanakaran; Jonah B Sacha; Louis J Picker
Journal: Science Date: 2016-01-21 Impact factor: 47.728

7 in total

1. Dendritic cells focus CTL responses toward highly conserved and topologically important HIV-1 epitopes.

Authors: Tatiana M Garcia-Bates; Mariana L Palma; Renee R Anderko; Denise C Hsu; Jintanat Ananworanich; Bette T Korber; Gaurav D Gaiha; Nittaya Phanuphak; Rasmi Thomas; Sodsai Tovanabutra; Bruce D Walker; John W Mellors; Paolo A Piazza; Eugene Kroon; Sharon A Riddler; Nelson L Michael; Charles R Rinaldo; Robbie B Mailliard
Journal: EBioMedicine Date: 2021-01-12 Impact factor: 8.143