Literature DB >> 35755950

Computational cancer neoantigen prediction: current status and recent advances.

Abstract

Over the last few decades, immunotherapy has shown significant therapeutic efficacy in a broad range of cancer types. Antitumor immune responses are contingent on the recognition of tumor-specific antigens, which are termed neoantigens. Tumor neoantigens are ideal targets for immunotherapy since they can be recognized as non-self antigens by the host immune system and thus are able to elicit an antitumor T-cell response. There are an increasing number of studies that highlight the importance of tumor neoantigens in immunoediting and in the sensitivity to immune checkpoint blockade. Therefore, one of the most fundamental tasks in the field of immuno-oncology research is the identification of patient-specific neoantigens. To this end, a plethora of computational approaches have been developed in order to predict tumor-specific aberrant peptides and quantify their likelihood of binding to patients' human leukocyte antigen molecules in order to be recognized by T cells. In this review, we systematically summarize and present the most recent advances in computational neoantigen prediction, and discuss the challenges and novel methods that are being developed to resolve them.

Entities: Chemical

Keywords: immunotherapy; neoantigens; personalized medicine

Year: 2021 PMID： 35755950 PMCID： PMC9216660 DOI： 10.1016/j.iotech.2021.100052

Source DB: PubMed Journal: Immunooncol Technol ISSN： 2590-0188

Introduction

Conventional treatment of malignant tumors is based upon surgery, chemotherapy, and radiation therapy, each of which has its advantages and drawbacks. Surgical procedures cannot always ensure the complete removal of tumor cells, and recent studies show that the inflammatory response to a post-operative infection can increase the risk of tumor recurrence in cancer through the release of proinflammatory mediators., Radiation therapy and chemotherapy can elicit acquired resistance by different mechanisms, including multidrug resistance, suppression of apoptosis, altered drug metabolism, and enhanced DNA repair and gene amplification., Immunotherapy that harnesses the power of the immune system to target malignant cells has emerged in recent years and is showing remarkable results in clinical trials. One of the major drawbacks of immunotherapy is that tumor cells can evolve immunoevasive and immunosuppressive phenotypes, thus achieving immune escape. Immunosuppressive tumor cells can express membrane proteins like the programmed death-ligand 1 protein which binds to its receptor [programmed cell death protein 1 (PD-1)] on activated T cells and delivers a signal that inhibits T-cell receptor (TCR)-mediated activation of interleukin-2 production and T-cell proliferation., The 2018 Nobel Prize in physiology and medicine award winners, James P. Allison and Tasuku Honjo, have shown that the PD-1 blockade is effective against many types of tumors because it enhances the antitumor activity of cytotoxic T-lymphocytes, which recognize various tumor-specific antigens (TSAs),, and these findings formed the basis of the immune checkpoint inhibition (ICI) therapy. The inherent genetic instability of tumor cells leads to the occurrence of a large number of non-synonymous somatic mutations that are not present in healthy tissue. Expression of these tumor-specific mutations will produce aberrant proteins which will subsequently be proteolytically cleaved by the proteasome (Figure 1). The resulting mutated peptides are then transferred to the endoplasmic reticulum (ER) lumen through the transporter associated with antigen processing (TAP) complex where they will be made available for binding to major histocompatibility complex class I (MHC-I; in vertebrates) molecules or the human leukocyte antigen class I (HLA-I; in humans) within the peptide loading complex. These peptides, named neoantigens, are defined as the tumor-specific mutated peptides that are presented on the membrane of malignant cells via the HLA-I protein complex and are not subjected to central or peripheral tolerance, thus being capable of inducing CD8+ T-cell mediated antitumor responses., Over the years, mouse cancer models and strong correlative clinical data provided definitive experimental evidence of how targeting neoantigens can result in a positive response to immune-mediated therapies.12, 13, 14 Whereas HLA-I molecules are expressed by most nucleated cells and primarily present endogenously-derived peptide antigens to CD8+ T cells, HLA class II (HLA-II) molecules are predominantly expressed by professional antigen presenting cells (pAPCs) and present antigenic peptides—mainly generated from exogenous proteins—to CD4+ T cells. Despite the fact that HLA-II is constitutively expressed only on pAPCs, there is evidence that HLA-II expression can be induced by interferon-γ in many other cell types, including tumor cells (HLA-II-positive malignant cells). Briefly, once HLA-II chains are complexed with the invariant chain (Ii) protein and the co-chaperone HLA-DM in the ER, the complex buds off in a vesicle that fuses with endosomes. Subsequently, HLA-II is loaded with endocytosed exogenous peptides or endogenously-derived peptides originating from autophagy. Stabilized peptide-HLA-II complexes are presented to CD4+ T cells on the cell surface. Tumor-specific HLA-II expression has been associated with improved prognosis and response to immunotherapy in humans,17, 18, 19, 20 and increased tumor rejection in murine models.,

Figure 1

Sources of non-self neoantigens. Neoantigens originate from mutated proteins expressed only in cancer cells.

These non-self antigens can derive from a number of different events at the gene, transcript, or protein level, such as point mutations (SNV), small insertions or deletions (indels), alternative splicing and fusion of genes. But also translation errors and post-transcriptional modifications can lead to aberrant proteins. These aberrant proteins are then processed by the proteasome and cleaved into shorter peptides. The transporter associated with antigen processing (TAP) brings these peptides to the endoplasmic reticulum, where they are loaded on to the major histocompatibility complex (MHC) molecule. The peptide–MHC complex is then transported to the cell surface and presented to T cells.

AG, antigen; ER, endoplasmic reticulum; SNVs, single nucleotide variations; uORF, upstream open reading frame.

Sources of non-self neoantigens. Neoantigens originate from mutated proteins expressed only in cancer cells. These non-self antigens can derive from a number of different events at the gene, transcript, or protein level, such as point mutations (SNV), small insertions or deletions (indels), alternative splicing and fusion of genes. But also translation errors and post-transcriptional modifications can lead to aberrant proteins. These aberrant proteins are then processed by the proteasome and cleaved into shorter peptides. The transporter associated with antigen processing (TAP) brings these peptides to the endoplasmic reticulum, where they are loaded on to the major histocompatibility complex (MHC) molecule. The peptide–MHC complex is then transported to the cell surface and presented to T cells. AG, antigen; ER, endoplasmic reticulum; SNVs, single nucleotide variations; uORF, upstream open reading frame. Despite promising results and the increasing interest in immunotherapy, however, there are many technical challenges and questions that arise from the very nature of tumors and their ability to acquire immune escape mechanisms. In addition to immunosuppression, the immunoevasive attributes of tumor cells reside in the weak immunogenicity (defined as the ability of a peptide bound to an MHC molecule to induce adaptive immune responses) of most neoantigens. Therefore, the identification of immunogenic neoantigens is a pivotal step in the field of immuno-oncology research and plays an instrumental role in the development of novel immunotherapeutic approaches. Advances in next generation sequencing (NGS) techniques has permitted improved identification of tumor-specific neoantigens and a better understanding of tumor–immune system interactions. Sequencing depth, quality of tumor tissue, the source of the sequencing material, and other factors, however, still pose a major challenge in the process of in silico neoantigen prediction. Moreover, the analysis of high-throughput sequencing data is a daunting task and requires a high level of bioinformatics expertise. A typical neoantigen prediction computational workflow can be summarized into three steps: (i) variant calling and inference of tumor-specific mutated peptides, (ii) HLA typing, and (iii) HLA binding affinity prediction and filtering/prioritization of neoantigens. The main focus of this review is to (i) highlight important information regarding the neoantigen landscape and the approaches available for mining this information for each class of neoantigens, (ii) present the latest developments and advances regarding the algorithms and computational frameworks available for the identification and prioritization of neoantigens that have emerged since our last survey, and (iii) briefly discuss the technical caveats of the available methods, and also address some important biological questions that need to be addressed in order to develop methods that can predict the immunogenicity of neoantigens.

The tumor antigen landscape

Tumor cells express a broad spectrum of antigens including TSAs (or neoantigens), tumor-associated antigens (TAAs), and cancer germline antigens (CGAs). TAAs and CGAs are not expressed exclusively by tumor cells, but can also be found on the surface of cells residing within normal tissue., Due to their expression in healthy tissue, targeting such antigens would pose two issues: (i) poor results due to central immunological tolerance mechanisms, and more importantly (ii) increased risk for cross-reactivity with structurally related self-peptides, and off-target toxicities., Unlike TAAs and CGAs, neoantigens (TSAs) are expressed only by tumor cells and can thus be considered akin to truly foreign peptides, as they are completely absent from normal tissue, and therefore represent an ideal immunotherapy target since they can be recognized as non-self by the host immune system. Despite these theoretical advantages, however, neoantigen-specific approaches cannot completely eliminate the risk of autoimmunity. This is exemplified by the fact that neoantigens derived from single nucleotide variations (SNVs) can exhibit high resemblance to their normal counterparts and thus neoantigen-specific T cells can be cross-reactive with the non-mutated peptides.,

Sources of neoantigens

Neoantigens are cancer-specific aberrant peptides that can be recognized as non-self and which can elicit an immune response by the host immune system. These aberrations can result from several types of genomic or transcriptome-based alterations and post-translational modifications (PTMs) in tumors. The most well characterized neoantigens are the result of non-synonymous somatic mutations such as SNVs, small insertions, deletions (indels), frameshift mutations, or other genomic rearrangements, such as gene fusions.31, 32, 33 Neoantigens can also arise from post-transcriptional aberrations, including cancer-specific alternative exon splicing, intron retention, and premature transcription ending. Finally, another less explored source of neoantigens are cancer-specific post-translational protein modifications, such as methylation, phosphorylation, acetylation, and glycosylation, (Figure 1). Due to the considerable cost and difficulties presented by experimental methods used for the identification of PTMs, recently many computational methods like GPS-Lipid, and MusiteDeep have been developed for predicting PTMs. To date, PTM-derived neoantigens have not been a significant focus of recent immuno-oncology research, and therefore there is an urgent need to expand the characterization of post-translational modified HLA-bound peptides as well as the repertoire of TCR that recognize these modified peptides. Technological advances in deep RNAseq gene expression analysis, whole-cell, and MHC-elute mass spectrometry (MS) peptide detection will be essential for the discovery of neoantigens of this class. Depending on the neoantigen class in question, different approaches and sequencing techniques must be utilized in order to elucidate the diverse repertoire of tumor-specific mutations.

SNVs and small indels

Peptides derived from SNVs and small indels belong to the most commonly studied category of neoantigens, mainly due to the fact that the methods involved in the identification of these mutations are well established, maintained, and easily accessible, but also because SNVs and small indels were considered to be the major source of neoantigens until recently. The problem with SNV-derived neoantigens is that they can exhibit significant similarity to their normal counterparts and thus only a small percentage of these putative neoantigens appear to be immunogenic. In order to assess TSAs derived from SNVs and small indels, it is important that whole exome sequencing (WES) or whole genome sequencing (WGS) reads of tumor and matched normal DNA samples are used. Typically, the reads will first go through quality control and if required they can be processed to remove low-quality base calls and residual sequencing adapters. The reads are then aligned to a reference genome using a short read aligner. According to the GATK Best Practices workflow, it is recommended that the BAM files should undergo additional processing before variant calling: i.e. identify redundant reads and base quality score recalibration (BQSR) which adjusts the base quality scores of the reads using an empirical error model which can be carried out using tools such as GATK4. In addition, if it is not an integral element of the variant caller, read realignment around known indels using GATK3 can be carried out in order to reduce alignment errors. With the support and evidence of the aligned reads (BAM files) from tumor and normal tissue, numerous variant callers can detect somatic variants that are present in tumor samples. Most commonly, these callers use either Bayesian inference or traditional statistical models combined with specific filters. Some examples of variant calling tools include MuTect/MuTect2, VarScan2, Manta, SomaticSniper, FreeBayes, and Strelka. Since there is no universal ‘gold-standard’ tool and due to discrepancies among variant callers, finding a single best caller for various datasets is considered impractical. One solution to this issue is to combine the results from individual callers either by majority voting or with consensus approaches. Once the variants are identified, the final step would be to annotate them and produce the resulting mutated protein sequences; the most commonly used tools for this process are the Ensembl variant effect prediction (VEP) and SnpEff.

Alternative splicing variants

Mutations in splice sites or splicing factors, exon skipping, intron retention, and a variety of post-transcriptional modifications can produce splice variants which have been suggested to be particularly relevant for cancer types with low tumor mutational burden but harboring splice factor mutations. These events are detectable only at the transcriptome level and can be quantified using RNA sequencing (RNA-seq) data. This class of neoantigens, however, includes non-mutated peptides and given that potential off-target effects of cell-based immunotherapies may have drastic consequences, there are several questions regarding the specificity and cross-reactivity of the predicted alternative splicing (AS)-derived neoantigens. In a recent study, the authors developed a proteogenomic strategy to identify cancer-restricted non-mutated antigens using medullary thymic epithelial cells (mTECs) as ‘normal control’. This was based on the unique characteristic of mTECs to express peripheral antigens, which contributes to the establishment of T-cell self-tolerance. mTECs display a high level of AS and RNA editing, further expanding the broad repertoire of self antigens in the thymus.54, 55, 56 We anticipate that the method can be complemented by the incorporation of RNA-seq data of normal tissues from the Genotype-Tissue Expression (GTEx) project. There are two major approaches applied for AS event analysis: (i) isoform-based and (ii) count-based strategies. The first step for isoform-based methods is the reconstruction of full-length transcripts and then, based on the sequencing reads supporting these transcripts, the estimation of their relative abundances. Once the abundances are estimated, statistical testing is applied in order to identify the differential expression of the reconstructed transcripts between conditions (e.g. between tumor and normal samples). Tools using isoform-based methods include Trinity, Scripture, Cufflinks, Cuffdiff2, EBSeq, StringTie, and DiffSplice. These approaches rely heavily on accurate transcript quantification and may be affected by the sequencing depth and the read length. Count-based methods can be further divided into exon-based and event-based approaches. Exon-based methods seek to assign read counts to different features (such as exons or junctions) instead of reconstructing full-length transcripts. These approaches tend to be more robust in terms of differentially expressed exons/junctions between conditions, but their limitation lies with the fact that such approaches are incapable of identifying the type of splicing event occurring in a gene. Tools using exon-based methods include DEXseq, SplicingCompass, edgeR, and limma. Finally, event-based approaches seek to quantify directly the splicing events by measuring the fraction of mRNAs expressed from a gene containing a specific form of an AS event. The problem with this approach is that it is not designed to accommodate the varying uncertainty of isoform expression across isoform groups. Consequently, its application for isoform inference results in reduced power for some classes of isoforms and increased false discovery rate for others. Several tools use event-based approaches, including MAJIQ, SplAdder, rMATS, SUPPA2, MISO, and dSpliceType.

Gene fusions

Recurrent balanced rearrangements, most commonly translocations, have been shown to represent important early steps in the initiation of carcinogenesis.76, 77, 78 These rearrangements usually exert their action either by deregulation of gene expression in one of the breakpoints or with the creation of a hybrid gene through the fusion of parts of two genes. The expression of fusion genes results in chimeric proteins, which have the potential to be highly immunogenic due to their difference from their normal counterparts, especially in terms of peptides that include the fusion junction point and both the neighboring breakpoint regions. Fusion events can be identified using WGS or RNA-seq data. In principle, WES data can also be used for fusion events detection, albeit WGS provides the most comprehensive and unbiased characterization of genomic alterations in genomes. Gene fusion predictors using targeted captured DNA data include BreakID and GRIDSS2. Expressed gene fusions, however, are only detectable with RNA-seq data, which require less storage space, and analysis time. Fusion transcript prediction algorithms follow two strategies, which are broadly defined as (i) assembly-first and (ii) mapping-first approaches. Assembly-first methods perform de novo assembly of reads into longer transcripts and proceed then to identify chimeric transcripts that are consistent with recurrent balanced rearrangements. Such methods allow for the exploration of fusion transcripts that are not well represented by the reference genome sequence, or novel fusion transcripts that are entirely absent from the reference genome. The major drawback of the assembly-first methods is that they exhibit low sensitivity when compared with read mapping methods in most cases. Assembly-first tools include TrinityFusion, and JAFFA-assembly. Mapping-first approaches align the sequencing reads directly to the reference genome and proceed then to identify those reads composed of segments which map in a non-linear way to two different locations of the reference genome (split reads), and read pairs from the same fragment whose alignments to the reference genome have distance and/or orientation that differ from the expected if the fragment was contiguous to the reference genome (discordant reads). Such methods require the number of supporting reads to correlate with the expression of the genes involved in the fusion. True predictions usually have a balanced number of split and discordant reads. Events with only discordant reads or without discordant reads and only split reads having anchors in just one gene are frequently artifacts. These methods exhibit increased sensitivity, but their limitation is their inability to identify novel fusion transcripts and intragenic deletions (deletions within a gene are difficult to distinguish from ordinary splicing in RNA-Seq data). The two top performing and most widely used mapping-first tools are Arriba and STAR-Fusion.

HLA typing

The HLA complex is one of the most gene-dense and polymorphic regions in the human genome which encodes key components of the human adaptive immune system. The HLA region consists of six classical HLA genes, more specifically HLA-A, -B, -C for HLA class I and HLA-DR, -DQ, -DP for HLA class II. HLA polymorphisms occur in domains responsible for epitope binding, and thus the overall immune repertoire is exponentially broadened. Until recently, only CD8+ T cells and therefore only HLA class I molecules were considered to have an important role in neoantigen recognition, but there is increasing evidence that the majority of the immunogenic tumor mutanome is recognized by CD4+ T cells, and thus HLA class II neoantigens can also elicit immune responses to cancer. Starting from the year 1998, the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) created a publicly available, and curated database containing serologically defined HLA antigens and their genes/alleles defined by nucleotide sequences (IMGT/HLA database). The first step for almost every HLA typing tool is the mapping of reads to the exonic and intronic regions of the HLA genes, as defined in the IMGT/HLA database. Then different approaches can be taken for the identification of the HLA types. Briefly, some tools create a list of HLA alleles by selecting those with no missed exons and no more than one mismatch and then proceed to form pairs of HLA alleles (e.g. A∗01:01:01 and A∗01:01:02) from this list. For each pair, a score is calculated (applying a scoring scheme based on multiple sequence alignment) and the pair with the best score is reported as the final result. Other tools construct a binary hit matrix for all the reads mapping to at least one HLA allele and assuming that the correct HLA genotype explains the highest number of mapped reads, they formulate an integer linear programming optimization in order to find an optimal solution. Finally, some tools use a graph-based alignment approach to ensure increased read mapping sensitivity and then seek to find maximum likelihood estimates of abundance through an expectation-maximization algorithm. Most tools accept RNA-seq, WES, and WGS data as input and can carry out HLA class I typing and/or class II typing. One of the best performing tools for HLA class I typing is OptiType followed by Polysolver, while for class I and II typing the two best performing tools are HLA-HD and HISAT genotype. HLA class I typing tools have been researched extensively and can achieve high sensitivity. The use of long reads has been proposed in order to increase HLA typing accuracy, and while the results look promising, there are still several limitations, mainly attributed to the high background signal., Using long but noisy sequencing reads for HLA typing requires the development of novel bioinformatics solutions distinct from those designed for shorter but more accurate reads, and moreover, current HLA typing tools can reach 99% sensitivity for common HLA class I alleles using short paired-end reads. This percentage, however, does not represent the whole biological truth. This is because HLA allele sequences may only be partially available in the IMGT/HLA repository; for example, so far 3644 alleles have been classified for HLA-A and although all alleles of HLA-A have known sequences for exons 2 and 3, only 383 alleles have full-length sequences available. This problem is not restricted to HLA class I only, as HLA class II typing is less investigated, being evident by the number of named alleles for each HLA gene in the IMGT/HLA database: as of October 2021, the IMGT/HLA database contains 23 002 entries with known sequences for HLA class I whereas it contains only 8673 entries with known sequences for class II. Other facts that add to this problem are the dimeric nature of the functional HLA class II complex and the copy number variation of one of the loci (HLA-DRB) that makes this region exceptionally convoluted. Furthermore, the IMGT/HLA database includes the most frequent HLA alleles in the human population, which proves problematic in the case of rare or novel HLA alleles and subsequently leads to an increased false negative discovery rate. Therefore, specialized HLA typing [polymerase chain reaction (PCR)- or NGS-based] using either long/mid-range PCR-based isolation, or hybridization-based capture methods, will be superior in a clinical setting. Nonetheless, while PCR-based HLA typing using sequence-specific primers, sequence-specific oligonucleotide probes, and Sanger sequencing-based typing methods have significantly improved HLA typing resolution, there are several caveats, including time-consuming protocols, low throughput, unphased data, and ambiguity.

HLA binding affinity prediction

Among the various processes described so far in this review, the major determinant of neoantigen presentation is the binding of the tumor-specific epitopes to the HLA molecules. Therefore, computational predictors that discriminate HLA binding from non-binding peptides are critical. Typically, these predictors utilize MS identified HLA eluted ligands (EL) or binding affinity data deposited in the Immune Epitope Database (IEDB), the SysteMHC Atlas, the proteomics identifications database (PRIDE), or other publicly available MS-based immunopeptidomic datasets, to train machine learning (ML) classifiers. Early developed tools use linear regression-based methods to predict HLA peptide binding affinity. The problem with this approach is that it operates under the assumption that the contribution of individual residues to the overall binding affinity is linear in nature, which is rarely the case since the correlation between neighboring peptide residues can also affect the HLA binding. In order to account for this non-linear relationship, current ML classifiers utilize artificial neural networks (ANNs). For example, a feedforward neural network can simulate the contribution of each peptide residue type by adapting the weights of locally connected, one-dimensional convolutional layers in order to capture the complex interactions between HLA binding residues. Allele-specific methods train a model for each HLA allele and learn the binding patterns of each allele separately. Enough experimentally validated ligands are available for only a few hundreds of HLA alleles, however, which represents only a small fraction of the HLA alleles observed in the human population. To address this issue, pan-allele predictors have been introduced that allow for interpolation between ligands, but also between receptors. The input of these algorithms consists of both the sequence of the ligand and the sequence of the HLA allele’s binding site, and thus they are powerful at capturing correlations between amino acids in the HLA binding site and in the ligand. It is noted that in principle, pan-specific methods can predict binders of any HLA allele with known protein sequence, which implies that truly novel HLA alleles might pose an issue when it comes to binding affinity predictions. The most widely used allele-specific binding affinity predictors are NetMHC and MHCflurry for HLA-I, and NetMHCII, and mixMHC2pred for HLA-II. Among the top performing pan-specific HLA binding affinity predictors are MHCflurry 2.0 and NetMHCpan for HLA class I and NetMHCIIpan for HLA class II. Due to the different training approaches, allele-specific methods outperform the pan-specific methods for HLA molecules where sufficient data are available to accurately characterize the binding motif, and pan-specific methods outperform the allele-specific methods when data are scarcer. It has been shown, however, that consensus approaches combining both methods can improve the binding affinity prediction accuracy., Although ANNs have addressed the non-linear nature of the peptide-HLA binding process, there are known limitations to the methods depending on the datasets each method uses to train its internal ML algorithm. Typically, these ML algorithms fall into two broad categories, with the first being ML classifiers trained on binding affinity data. This can limit substantially the prediction power, since only the binding event is modeled and no other biological feature involved in the process is accounted for. In order to resolve this issue, the second category of ML classifiers are trained on combined binding affinity and MS-based EL data. Despite the major improvements in the quality of immunopeptidomics data, there are still several technical restrictions to overcome. The MS obtained spectra are compared with in silico generated spectra of peptides from protein sequence databases with MS search tools (spectra searches). One limitation is that the spectra search is limited to the available databases, which are usually restricted to the annotated human proteome. To address this limitation, dedicated proteogenomics computational pipelines for customized reference databases have been developed to expand the search space beyond the canonical human proteome., Second, peptides that have features that make them incompatible with ionization might not be detected with standard methods. Finally, the antibodies employed during the immunopurification process of peptide-HLA complexes in EL assays are mostly pan-specific, which may eventually result in multiallelic data. More recent ML algorithms seek to annotate the EL datasets and deconvolute the multiallelic to single allelic data before they employ them to train the predictors.,, Another promising approach in order to solve this issue is the monoallelic strategy for profiling the HLA peptidome which leverages cell lines expressing a single HLA allele and optimized immunopurifications.,

Filtering and prioritization of neoantigens

Early day neoantigen prediction methods targeted binding affinity, measured in half-maximal inhibitory concentration (IC50), for the filtering and prioritization of neoantigens. As a rule of thumb, every peptide exhibiting an IC50 <500 nM was considered a ‘candidate’ and the remaining peptides (with IC50 >500 nM) would be filtered out, then the putative neoantigens would be prioritized according to the IC50 values from the lowest (strong binders) to highest (weak binders). As the methods evolved, the concept of neoantigen ranking scores was introduced for the classification of peptides into strong and weak binders. In brief, the binding affinity predictions are scored and ranked compared with a set of random natural HLA binding peptides in order to address the inherent bias of certain molecules towards higher/lower mean predicted affinities. Due to the great diversity and the stochastic nature of the T-cell immune response, however, a single value associated with a part of the whole process can hardly provide sufficient information to accurately model the complex tumor–immune interactions. In this context, systematic integration of multiple features into a unified neoantigen prioritization algorithm would yield increased classification accuracy. These features extend beyond the characteristics of HLA binding and presentation, including clonality of the neoantigen, amino acid characteristics like ‘hydrophobicity’, ‘polarity and charged value’, ‘molecular size'', ‘entropy of peptides’, and promiscuity of HLA molecules which was shown to be correlated with bad prognosis after ICI therapy. Most recently developed algorithms125, 126, 127, 128 measure the information gain from such features by utilizing feature selection processes, and then proceed to train ML classifiers on the basis of the selected immunogenicity features. Nevertheless, a major downside of ML approaches is overfitting, and their performance can be significantly affected by the quantity and quality of the training datasets. Ideally, in order to avoid this issue, ML classifiers should be trained on large positive (e.g. experimentally validated immunogenic epitopes) and negative (non- immunogenic) datasets. Unfortunately, there is still a lack of such comprehensive positive/negative datasets. Peptide-MHC multimers or yeast display assays leverage the isolation of antigen-specific T cells. Together with single-cell TCR sequencing, immunogenic peptides can be identified and characterized at a large scale,, although at high costs and technical challenges which may limit their application. The combination of these technologies and the integration of structural modeling information can further improve the classification accuracy, as has been shown by recently developed tools like Net-TCR 2.0 or PRIME. In a recent study, the Tumor Neoantigen Selection Alliance of the Parker Institute for Cancer Immunotherapy identified key components of tumor epitope immunogenicity. According to the study, these components can be classified into ‘presentation’ and ‘recognition’ features of the immunogenic peptides. The first category encompasses features that are associated with effective antigen presentation, namely HLA binding affinity, expression of the originating gene (‘tumor abundance’), expected duration of peptide-HLA interaction (‘binding stability’), and peptide hydrophobicity. The second category involves peptide features considered to be associated with immunogenicity among peptides that have the highest likelihood of being presented. Two features were identified: (i) ‘agretopicity’135, 136, 137 which is the ratio of mutant binding affinity to wild-type binding affinity and (ii) ‘foreignness’138, 139, 140 which is the probability of TCR recognition as inferred by the homology of the tumor peptide to known pathogenic peptides in the IEDB.

Future perspectives

In-depth understanding of tumor–immune interactions

Although peptide-HLA binding has been researched extensively and is in fact one of the best characterized processes in neoantigen presentation, there are several caveats impeding the accurate and unbiased elucidation of the complex relationships between the tumor and the immune system. This may be attributed to the aforementioned inherent technical biases in MS data and also due to our poor understanding of the nature of the tumor-immune system interactions, which limits the modeling capabilities especially in terms of T-cell recognition. In order to increase the robustness of the models, some methods integrate information that spans beyond the process of peptide-HLA binding, like proteasomal cleavage sites, TAP transportation, and ER loading. Although this seemed promising at first, in practice the gain in accuracy is marginal,, and most computational pipelines consider this step optional. We expect that integrating immunogenicity features such as binding stability, peptide hydrophobicity, agretopicity, and foreignness into computational pipelines will enable increased accuracy in predicting immunogenic neoantigens.

Exploration of the immunological ‘dark matter’

So far, cancer research has mostly focused on mutations that alter protein coding sequences. There is increasing evidence suggesting, however, that non-canonical and cryptic peptides also contribute to the HLA peptidome. The emergence of proteogenomics has radically revolutionized our perspective of the cancer proteome by identifying peptides encoded by all reading frames of any genomic region.,143, 144, 145 Moreover, experiments involving ribosome profiling provided strong evidence for pervasive translation outside of annotated protein coding genes. Using approaches derived from statistical physics, it has become possible to quantify transcriptome-wide motif usage in human and murine non-coding RNAs, determining that most have motif usage consistent with the coding genome. In a recent study, Liepe et al. report evidence that a large fraction of HLA class I ligands are spliced together by the proteasome from two different fragments of the same protein, due to the proteasome-catalyzed peptide splicing process. Although proteasomal splicing is a controversial subject and there are numerous published concerns regarding the findings of the aforementioned study,149, 150, 151 in a way, it highlighted our poor understanding of the biological processes involved and the challenges that remain regarding the computational identification of peptides that are not encoded in the proteome. Exploring the uncharted waters of the immunological ‘dark matter’ may uncover the contribution of proteins derived from non-canonical sources to the cancer immune repertoire, and expand the range of putative neoantigens.

Ease of access and deployment of computational pipelines

Dependency issues, version control, lack of scalability, and inconsistencies between development and production environments are only a few of the problems a researcher has to resolve in order to perform computational analysis in general and to predict neoantigens specifically. Software container technology, such as Docker (https://www.docker.com) and Singularity (https://sylabs.io), along with package and environment management systems like Conda (https://conda.io) have revolutionized the practice of software development and deployment. These technologies enable the containerization of software along with its required dependencies in a sanitized environment, thus avoiding any software conflicts, and ensure portability and reproducibility across different information technology platforms in healthcare systems and the cloud. Finally, the deployment of the multiple components required by a bioinformatics pipeline for neoantigen prediction in an orchestrated manner can prove to be a daunting task and usually requires a high level of bioinformatics expertise. Workflow management systems such as Nextflow, Snakemake, Airflow (https://airflow.apache.org/), and CWL allow the development of scalable and reproducible data analysis workflows. Most of the current bioinformatics pipelines (Table 1) are deposited as open source projects in repositories like GitHub (https://github.com/) and can be ported locally and deployed by researchers with minimal effort.

Table 1

Computational tools and pipelines used in/for neoantigen prediction

Purpose	Name	Input data	HLA class	Repository (if available)
HLA typing tools	OptiType⁹²	WGS/WES/RNA-seq	Class I	https://github.com/FRED-2/OptiType
	PolySolver⁹³	WES	Class I	https://github.com/jason-weirather/hla-polysolver
	HLA-HD⁹⁴	WGS/WES/RNA-seq	Class I and II	https://www.genome.med.kyoto-u.ac.jp/HLA-HD/
	HISAT-genotype⁹⁵	WGS/WES/RNA-seq	Class I and II	https://daehwankimlab.github.io/hisat-genotype/
	arcasHLA¹⁵⁹	WGS/WES/RNA-seq	Class I and II	https://github.com/RabadanLab/arcasHLA
	HLAscan¹⁶⁰	WGS/WES	Class I and II	https://github.com/SyntekabioTools/HLAscan
	xHLA¹⁶¹	WGS/WES	Class I and II	https://github.com/humanlongevity/HLA
	seq2HLA¹⁶²	RNA-seq	Class I and II	https://github.com/TRON-Bioinformatics/seq2HLA
	PHLAT¹⁶³	WGS/WES/RNA-seq	Class I and II	https://sites.google.com/site/phlatfortype/home
	ATHLATES¹⁶⁴	WGS/WES/amplicon	Class I and II	https://github.com/cliu32/athlates
	HLA-VBSeq¹⁶⁵	WGS	Class I and II	http://nagasakilab.csml.org/hla/
	HLAminer¹⁶⁶	WGS/WES/RNA-seq/amplicon	Class I and II	https://github.com/bcgsc/HLAminer
	HLA-LA¹⁶⁷	WGS/WES/RNA-seq	Class I and II	https://github.com/DiltheyLab/HLA-LA

ANN, artificial neural network; CSV, comma separated values; DCNN, deep convoluted neural network; HLA, human leukocyte antigen; GBDT, gradient boosted decision trees; GLM, generalized linear model; MGF, mascot generic format; MHC, major histocompatibility complex; MS, mass spectrometry; RNA-seq, RNA sequencing; SMM, stabilized matrix method; SNVs, single nucleotide variations; VCF, variant call format; WES, whole exome sequencing; WGS, whole genome sequencing.

Computational tools and pipelines used in/for neoantigen prediction ANN, artificial neural network; CSV, comma separated values; DCNN, deep convoluted neural network; HLA, human leukocyte antigen; GBDT, gradient boosted decision trees; GLM, generalized linear model; MGF, mascot generic format; MHC, major histocompatibility complex; MS, mass spectrometry; RNA-seq, RNA sequencing; SMM, stabilized matrix method; SNVs, single nucleotide variations; VCF, variant call format; WES, whole exome sequencing; WGS, whole genome sequencing.

Conclusion

The advancements in immuno-oncology have been impressive over the past 30 years; however, there are several bottlenecks and open questions that need to be resolved before immunotherapy can become one of the major pylons of cancer treatment. Computational approaches, like neoantigen prediction, will likely play a key role in unlocking the potential of immunotherapies like adoptive T-cell therapy or neoantigen vaccination, which showed significant tumor regression and even durable responses in patients with melanoma, cholangiocarcinoma, or glioblastoma.155, 156, 157, 158 Integration of computational methods in clinical settings will thus pave the way for personalized medicine.

188 in total

1. High-resolution HLA typing by long reads from the R10.3 Oxford nanopore flow cells.

Authors: Chang Liu; Xiao Yang; Brian F Duffy; Jessica Hoisington-Lopez; MariaLynn Crosby; Rhonda Porche-Sorbet; Katsuyuki Saito; Rick Berry; Victoria Swamidass; Robi D Mitra
Journal: Hum Immunol Date: 2021-02-19 Impact factor: 2.850

2. SomaticSniper: identification of somatic point mutations in whole genome sequencing data.

Authors: David E Larson; Christopher C Harris; Ken Chen; Daniel C Koboldt; Travis E Abbott; David J Dooling; Timothy J Ley; Elaine R Mardis; Richard K Wilson; Li Ding
Journal: Bioinformatics Date: 2011-12-06 Impact factor: 6.937

3. Association of the autoimmune disease scleroderma with an immunologic response to cancer.

Authors: Christine G Joseph; Erika Darrah; Ami A Shah; Andrew D Skora; Livia A Casciola-Rosen; Fredrick M Wigley; Francesco Boin; Andrea Fava; Chris Thoburn; Isaac Kinde; Yuchen Jiao; Nickolas Papadopoulos; Kenneth W Kinzler; Bert Vogelstein; Antony Rosen
Journal: Science Date: 2013-12-05 Impact factor: 47.728

4. SplicingCompass: differential splicing detection using RNA-seq data.

Authors: Moritz Aschoff; Agnes Hotz-Wagenblatt; Karl-Heinz Glatting; Matthias Fischer; Roland Eils; Rainer König
Journal: Bioinformatics Date: 2013-02-28 Impact factor: 6.937

Review 5. Cancer immunoediting: integrating immunity's roles in cancer suppression and promotion.

Authors: Robert D Schreiber; Lloyd J Old; Mark J Smyth
Journal: Science Date: 2011-03-25 Impact factor: 47.728

6. DeepImmuno: deep learning-empowered prediction and generation of immunogenic peptides for T-cell immunity.

Authors: Guangyuan Li; Balaji Iyer; V B Surya Prasath; Yizhao Ni; Nathan Salomonis
Journal: Brief Bioinform Date: 2021-05-03 Impact factor: 11.622

7. Checkpoint blockade cancer immunotherapy targets tumour-specific mutant antigens.

Authors: Matthew M Gubin; Xiuli Zhang; Heiko Schuster; Etienne Caron; Jeffrey P Ward; Takuro Noguchi; Yulia Ivanova; Jasreet Hundal; Cora D Arthur; Willem-Jan Krebber; Gwenn E Mulder; Mireille Toebes; Matthew D Vesely; Samuel S K Lam; Alan J Korman; James P Allison; Gordon J Freeman; Arlene H Sharpe; Erika L Pearce; Ton N Schumacher; Ruedi Aebersold; Hans-Georg Rammensee; Cornelis J M Melief; Elaine R Mardis; William E Gillanders; Maxim N Artyomov; Robert D Schreiber
Journal: Nature Date: 2014-11-27 Impact factor: 49.962

8. Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames.

Authors: Céline M Laumont; Tariq Daouda; Jean-Philippe Laverdure; Éric Bonneil; Olivier Caron-Lizotte; Marie-Pierre Hardy; Diana P Granados; Chantal Durette; Sébastien Lemieux; Pierre Thibault; Claude Perreault
Journal: Nat Commun Date: 2016-01-05 Impact factor: 14.919

9. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

10. neoANT-HILL: an integrated tool for identification of potential neoantigens.

Authors: Ana Carolina M F Coelho; André L Fonseca; Danilo L Martins; Paulo B R Lins; Lucas M da Cunha; Sandro J de Souza
Journal: BMC Med Genomics Date: 2020-02-22 Impact factor: 3.063