Literature DB >> 26512199

Machine Learning Methods for Predicting HLA-Peptide Binding Activity.

Heng Luo¹, Hao Ye², Hui Wen Ng², Leming Shi³, Weida Tong², Donna L Mendrick², Huixiao Hong².

Abstract

As major histocompatibility complexes in humans, the human leukocyte antigens (HLAs) have important functions to present antigen peptides onto T-cell receptors for immunological recognition and responses. Interpreting and predicting HLA-peptide binding are important to study T-cell epitopes, immune reactions, and the mechanisms of adverse drug reactions. We review different types of machine learning methods and tools that have been used for HLA-peptide binding prediction. We also summarize the descriptors based on which the HLA-peptide binding prediction models have been constructed and discuss the limitation and challenges of the current methods. Lastly, we give a future perspective on the HLA-peptide binding prediction method based on network analysis.

Entities: Chemical Disease Gene Species

Keywords: HLA; MHC; binding; machine learning; peptide; prediction

Year: 2015 PMID： 26512199 PMCID： PMC4603527 DOI： 10.4137/BBI.S29466

Source DB: PubMed Journal: Bioinform Biol Insights ISSN： 1177-9322

Background

Human leukocyte antigens (HLAs) and peptides

The major histocompatibility complexes (MHCs), a major component of the vertebrate immune system, are expressed on cell surfaces for cellular recognition and antigen presentation. In humans, the MHCs are called HLAs. Located at the short arm of chromosome 6, HLAs are one of the most polymorphic genes in humans and are different among countries and ethnicities.1–3 According to the statistics of the international ImMunoGeneTics database/HLA database,4 >12,500 HLA alleles have been recorded by March 2015. HLA alleles are systematically named in such a way, for example, HLA-A*02:01, where A indicates the HLA-A gene locus and 02:01 specifies the protein sequence for this allele. Detailed information of HLA nomenclature can be found at http://hla.alleles.org/. Three human MHC categories have been identified as Classes I, II, and III due to their different genetic loci. The Class I HLAs, including HLA loci A, B, C, E, F, and G, are codominantly expressed on the surface of all nucleated cells. They present intracellular-processed antigen peptides to helper CD8+ T-cells for cytotoxicity responses such as natural- killer-cell-induced apoptosis.5–7 Class II HLAs, including HLA D locus, are selectively expressed on the surface of den-dritic cells, B-cells, and other antigen-presenting cells. They present the antigen peptides to helper CD4+ T-cells to trigger acquired immune responses such as B-cell activation.8–10 The Class III MHCs function in the complement system for the clearance of pathogens.11,12 For Class I and Class II HLAs, studying their binding to peptides is essential to understand the immune system. The Class I and Class II HLAs are similar in structure. Both classes of HLAs have a long binding groove that can bind peptides degraded from antigens. Though HLAs contain two chains, the binding grooves of Class I HLAs are determined by only α chain, while those of Class II HLAs consist of both α and β chains.13–15 In addition, these two classes of HLAs bind to peptides with different lengths. While Class I HLAs bind to shorter peptides around 9-mers, Class II HLAs can bind to a large variety of peptides around 15-mers or longer due to their open-ended binding grooves.16–18 Though the peptide binders of Class II HLAs are generally longer, the core- binding regions are still around nine residues.15 Therefore, when predicting the binding between Class II HLAs and peptides, extra processes are sometimes needed to determine which part of the peptide binds within the HLA pockets.18 In addition to the structural differences, Class I and Class II HLAs present the antigen peptides to trigger immune responses in different pathways as shown in Figure 1.19,20

Figure 1

The typical pathways by which HLAs present antigen peptides to T-cells. In the HLA Class I pathway, endogenous antigen proteins are degraded by proteasomes into peptides that are transported via transporters associated with antigen processing (TAPs) into the ER. The peptides are loaded onto Class I HLAs and the complexes are sent to the Golgi apparatus for modification. Finally, the complexes are fused into the cell membrane where they can be recognized by TCRs on CD8+ T-cells. In the HLA Class II pathway, exogenous protein antigens are ingested by the cell into endocytic vesicular compartments and loaded onto Class II HLAs in the ER and processed by Golgi apparatus. The complexes are presented on the cell surface and recognized by TCR of CD4+ T-cells.

Most peptides presented by Class I HLAs are from endogenous cytosolic proteins (eg, defective products and even viral proteins if the cell is infected by virus) synthesized by the cell itself. These proteins are degraded by proteasomes into peptides that are transported into endoplasmic reticulum (ER) by transporters associated with antigen processing and loaded onto Class I HLAs. After glycosylated in Golgi apparatus, the Class I HLA–peptide complexes are fused into cell membrane and presented to the T-cell receptors (TCRs) on CD8+ T-cells for cellular immune responses. If the CD8+ cell, or cytotoxic T-cell, recognizes the specific antigen, the CD8+ cell can trigger the presenting cell to undergo apoptosis. The peptides presented by Class II HLAs are usually from extracellular antigens. The exogenous antigens are engulfed into the endocytic route compartments and digested by proteases. Class II HLAs synthesized in ER and glycosylated in Golgi apparatus acquire the peptides in the vesicular compartments and present them on the cell surface. The Class II HLA–peptide complexes are then recognized by TCRs of CD4+ T-cells to further the immune response such as antibody synthesis.19,20 The two pathways are not separated, and antigens that are mainly processed by one class of HLAs can be presented by the other via a cross-presentation pathway. However, details of this mechanism still remain obscure.21 HLAs play an important role in the immune system to present peptides to TCRs for immune responses; however, this process may result in adverse outcomes under certain circumstances. Autoimmunity can occur when HLAs may present peptides that are structurally similar to self-peptides to TCRs.22 Exogenous drugs may react with the antigen protein, insert into the binding groove of HLAs, or interfere with the HLA–peptide–TCR complex to cause adverse events (not shown in Fig. 1).23–25 The variety of HLAs, peptides, and TCRs all affect the immune response and make it challenging to understand the underlying mechanisms that could help with the prevention of adverse events. However, our recent study showed that by considering the binding peptide inside the HLA-binding groove, the performance of molecular modeling and prediction was improved.13 Thus, understanding HLA–peptide binding can help interpret the interaction mechanisms between the drugs and HLAs. To address the complexity of the immune system and to improve the ability to understand and even to predict, the HLA–peptide binding is definitely a crucial step. Various methods have already been developed to address such needs.

HLA–peptide binding prediction

The methods for HLA–peptide binding prediction can be divided into three categories: (1) position-specific scoring matrix (PSSM) based, (2) machine learning based, and (3) structure based.26 The PSSM-based methods generate a matrix for each residue position inside a peptide given a specific HLA. When predicting the binding affinity for a new peptide, values for each residue at each position are attained and summarized for a score by a given formula. The PSSM methods were introduced when the available data were limited. They were gradually replaced by machine learning methods that showed larger data capability, fast prediction speed, and reliable accuracy.27,28 Meanwhile, the structure-based methods, such as residue-based statistical energy function,29 quantitative structure–activity relationship (QSAR) analysis,30 and quantitative sequence–activity models,31 are an alternative to the machine learning methods as more HLA–peptide binding structures are becoming available for analysis. The structure-based methods provide a better insight to understand the HLA–peptide binding at the structure level; however, the prediction accuracy, speed, and scope remain a challenge due to the limited number of available crystal structures.26,32,33 Since the machine learning models are widely developed and used by major institutions, including the largest repository of HLA–peptide binding data, IEDB,34,35 this review focused on the machine learning methods used for predicting HLA–peptide binding.

Current Status

Existing methods

Various machine learning approaches have been used for HLA–peptide binding prediction, including artificial neural network (ANN), decision tree, hidden Markov model (HMM), regression methods, support vector machine (SVM), and consensus methods; the latter combines with several of the former. Table 1 gives an overall summary of these tools including their descriptors, supported HLAs and peptides, and performance.

Table 1

An overview of major machine learning tools for predicting HLA–peptide binding sorted by category and method. The tools were divided into two categories, qualitative or quantitative, depending on the outputs. The underlying method, descriptors, performance, and URL were harvested from the original papers. The supported number and class of HLAs and corresponding length of peptides were harvested from either the original papers or their websites. Some tools utilize extra process to deal with peptides with various lengths, which are listed in the “extra process” column.

CATEGORY	NAME	METHOD	DESCRIPTOR	PERFORMANCE	HLA (CLASS)	PEPTIDE LENGTH (HLA CLASS)	EXTRA PROCESS	URL
Qualitative	ANNPred37	ANN	Sparse encoding	Accuracy: 87.3%±5.9%	30(I)	9-mers(I)	N/A	http://www.imtech.res.in/raghava/nhlapred/neural.html
	MULTIPRED38,39	ANN/HMM/SVM	Sparse encoding	AUC >0.80	23(I), 6(II)	9-mers(I), 9-mer cores(II)	N/A	http://antigen.i2r.a-star.edu.sg/multipred/
	nHLAPred37	ANN/PSSM	Sparse encoding	Accuracy: 93.6%±2.92%	30(I)	9-mers(I)	N/A	http://www.imtech.res.in/raghava/nhlapred/comp.html
	Zhu et al.46	Decision Tree	N/A	Accuracy: ~0.8	16(I)	9-mers(I)	N/A	N/A
	S-HMM48	HMM	N/A	AUC: 0.85~0.89	1(II)	9~25-mers (II)	N/A	N/A
	ocHMM49	HMM	Physicochemical property grouping	Accuracy: 0.35~0.99	2(I)	Various(I)	N/A	N/A
	Salomon et al.61	Kernel	BLOSUM62$	AUC: 0.82~0.96	25(II)	9~33-mers (II)	N/A	N/A
	KISS59	SVM	Heckerman et al^81#	AUC: 0.86~0.90	35(I)	9-mers(I)	N/A	http://cbio.ensmp.fr/kiss/
	MHC2PRED54	SVM	Sparse encoding	Accuracy: ~80%	42(II)	9-mers or longer(II)	Matrix optimization techniques (MOTs)	http://www.imtech.res.in/raghava/mhc2pred/
	POPI58	SVM	Physicochemical properties	Accuracy: ~60%	23(I), 21(II)	9-mers(I), 9-mer cores(II)	N/A	http://iclab.life.nctu.edu.tw/POPI/
	SVMHC56,57	SVM/PSSM	Sparse encoding	MCC: 0.85	32(I), 51(II)	9-mers(I), 9-mer cores(II)	N/A	http://abi.inf.uni-tuebingen.de/Services/SVMHC
Quantitative	NetMHC/NetMHCII40,41	ANN	Sparse encoding/BLOSUM50	AUC: 0.914(I), 0.787(II)	78(I), 14(II)	8~11-mers(I), various(II)	NN-align	http://www.cbs.dtu.dk/services/
	NetMHCpan/NetMHCIIpan42,43	ANN	Sparse encoding/BLOSUM50#	Pearson: 0.77(I), AUC: 0.847(II)	150(I), 35(II)*	8~14-mers(I), 9~19-mers(II)	Similar to NN-align	http://www.cbs.dtu.dk/services/
	IEDB35,62,63	ANN/Consensus	N/A	AUC: 0.96(I), 0.76(II)	50(I), 54(II)	Various(I/II)	N/A	http://tools.immuneepitope.org/main/tcell/
	NetMHCcons64	Consensus	N/A	Better than single methods	101(I)*	8~15-mers(I)	N/A	http://www.cbs.dtu.dk/services/NetMHCcons/
	MHCMIR/MHC2MIR52,53	MIL/MIR	BLOSUM62$	AUC: 0.73~0.89	26(II)	9~25-mers(II)	N/A	http://datamining-iip.fudan.edu.cn/service/MHC2MIL/index.html
	MHCPRED50	QSAR regression	N/A	q2: 0.3~0.8	11(I), 3(II)	9-mers(I), 9-mer cores(II)	N/A	http://www.ddg-pharmfac.net/mhcpred/MHCPred/
	SVRMHC51	SVR	Sparse encoding/11 physicochemical properties	q2: −0.6~0.7	36(I), 6(II)	9-mer cores(I/II)	Iterative self-consistent (ISC)	http://svrmhc.biolead.org

Notes:

NetMHCpan/NetMHCIIpan/NetMHCcons can predict any HLA allele with a known sequence, thus the HLA number is unlimited.

The descriptors contain both the peptides and HLAs.

The BLUSUM62 matrix was used for distance calculation.

Abbreviation: N/A, not available/applicable.

Artificial neural network

Since its first application to HLA-A*02:01 in 1995,36 ANNs have been widely used to predict peptide binding for a large number of HLA alleles. To construct an ANN model, the peptide sequences are transformed to numeric descriptors that are then fed to several layers of artificial neurons. The value of each artificial neuron is deducted from the previous layer via mathematical formulae, and a final prediction value is calculated. The parameters within the formulae are determined during the training process by back propagation. Multiple papers and servers have implemented the ANN method, including ANNPred/nHLAPred,37 IEDB,34 MULTIPRED,38,39 NetMHC/NetMHCII,40,41 and NetMHCpan/NetMHCIIpan.42,43 ANNs can be utilized to make both qualitative and quantitative predictions for both classes of HLAs. Reliable performances have been achieved regarding this method. It is still under active development and improvement on quite a few servers including IEDB and NetMHCpan/NetMHCIIpan. However, ANNs require a fixed number of input neurons; therefore, peptides of various lengths need to be proceeded with extra processes to have a fix-length sequence.

Decision tree

The decision trees are a group of splitting tree structure constructed from the training samples.44 The splitting rules are determined from the training process. When a new sample arrives, it undergoes a flowchart-like structure and finally reaches a classification prediction. It was first introduced to make predictions for HLA-A*02:01 in 1999.45 Zhu et al implemented a C4.5 decision tree classifier to identify peptide binding for 16 HLA-A alleles.46 The decision trees are easier to interpret than ANNs due to their rule-based nature. Though this method is widely used in the machine learning field, it is less implemented in predicting HLA–peptide binding.18,44

Hidden Markov model

As a widely used method for pattern recognition, HMMs have been utilized for HLA–peptide binding predictions. In an HMM, a peptide is converted to different states for different positions. At each position or state, the probabilities of amino acids are calculated, and a final value is given by combining all the probabilities. There are different ways to construct HMMs, including fully connected HMMs,47 HMMs optimized with successive state splitting algorithm,48 and profile HMMs49 that can merge overlapping patterns. HMMs have the advantage of processing peptides with various lengths; however, different HMMs have to be developed separately for binders and nonbinders. Due to the model separation, HMMs have been developed only for a very limited number of HLAs.

Regression

Regression models have been developed for quantitative predictions of HLA–peptide binding. MHCPRED implemented a QSAR regression to predict HLA–peptide binding affinity.50 In this method, individual amino acid contributions at each position are calculated using partial least squares. SVRMHC utilized support vector regression (SVR), a regression derivative of SVM, for quantitative predictions.51 Multiple instance learning (MIL) and its regression derivation, multiple instance regression (MIR), were also used to predict HLA–peptide binding in MHCMIR/MHC2MIR.52,53 Regression models have the advantage of making quantitative predictions, which not only identify whether a peptide is a binder or nonbinder toward an HLA but also tell how strong the binding is. However, to make an accurate quantitative prediction can be more challenging than to make a qualitative one.

Support vector machine

The SVM creates a hyperplane in the high-dimensional space of training data to classify them into different groups. SVM has been used by a few servers for HLA–peptide binding prediction including MHC2PRED,54 MULTIPRED,55 SVMHC,56,57 Prediction Of Peptide Immunogenicity (POPI),58 and Kernel-based Inter-allele peptide binding prediction SyStem (KISS).59 The quantitative derivative of SVM was implemented by SVRMHC as we mentioned before.51 SVM usually achieves a high accuracy.60 Similar to ANN, SVM requires a fixed dimension or length of input data, limiting its applicability. However, Salomon and Flower proposed a kernel method derived from SVM that can handle peptides with various lengths using similarity scores.61

Consensus method

A consensus method uses a combination of the predictions from its component models. Each of the component models first makes a prediction separately and the final prediction is then made considering all the predictions from the component models. IEDB recommended a consensus approach on its server.35,62,63 In IEDB, the binding prediction is made from four PSSM-based models for Class I HLAs, while for Class II HLAs, the result comes from nine models including several PSSM-based models and machine learning models. Three machine learning models (QSAR regression-based MHCPRED,50 SVM-based MHC2PRED54 and SVR-based SVRMHC51) were utilized by IEDB for HLA–peptide binding prediction. When predicting a HLA–peptide binding, not every model is able to return a value. However, if three or more models provide predictions, the three top-performed models are selected and the median value is used as the final consensus score. A similar consensus approach was implemented by NetMHCcons64 that combines ANN-based NetMHC, NetMHCpan, and PSSM-based PickPocket. Instead of median values, NetMHCcons uses the average log-transformed values as the final prediction scores. The consensus method generally outperforms single models since it can preferably select the top-performing methods from benchmark tests. However, the applicability of the consensus method is limited by its individual components since it requires outputs simultaneously from those models.

Descriptors

Overview

Most machine learning methods take only peptide sequences as input features; therefore, individual models have to be developed for each HLA. One exception is KISS59 that implements a kernel function of both peptide descriptors and HLA similarities. Another exception is NetMHCpan/NetMHCIIpan42,43 that uses the sequences of both the peptides and the HLAs to construct a single pan-specific model for an entire class of HLAs. The developers of NetMHCpan/NetMHCIIpan studied different HLA structures and identified a series of residues on HLAs that closely interact with peptides to form a pseudo-sequence. Both the peptide sequences and HLA pseudo-sequences are input into the machine learning model at the same time. Such models make it possible to make predictions for HLA alleles with little or no experimental data as long as their sequences are known. Some machine learning models such as HMMs,48 kernel functions,61 and MIL/MIR models52,53 naturally process peptides with various lengths. Most models only deal with fixed lengths of peptides; therefore, separate models have to be developed for peptides with different lengths. This is problematic for Class II HLA binders since peptides partially interact with Class II HLAs. In order to solve this problem, extra processes were implemented to identify the interacting region of peptides. SVM-based method MHC2PRED54 utilized matrix optimization techniques to identify the 9-mer binding cores from Class II HLA binders. SVR-based SVRMHC51 used iterative self-consistent65 to find the 9-mer cores with the information of HLA anchor positions. ANN-based NetMHC/NetMHCII40,41 and NetMHCpan/NetMHCIIpan42,43 took advantage of alignment-based NN-align or similar processes41,66 to get both the 9-mer cores and the peptide flanking residues for Class II HLA binders. Both the 9-mer cores and the flanking residues were used as features to develop the models. The machine learning models did not directly make use of sequences as features except the HMMs. Usually machine learning models accept either binary categories (such as 0 and 1) or continuous numeric inputs. Therefore, input sequences need to be transformed to descriptors. The commonly used descriptors are sparse encoding, blocks substitution matrix (BLOSUM), and physicochemical properties.

Sparse encoding

Sparse encoding is simple but widely used by servers such as ANNPred/nHLAPred,37 MHC2PRED,54 NetMHC/NetMHCII,40,41 NetMHC-pan/NetMHCIIpan,42,43 SVMHC,56,57 and SVRMHC.51 The concept of sparse encoding is similar to dummy variables in the machine learning field. Since each position within a sequence can be any of 20 different amino acids, a position is presented by a 20-number vector of 19 zeros and a single one such as 10000 …, 01000 …, and 00100 … depending on what is the actual amino acid. Each type of amino acids at a specific position is represented by a single and unique variable. Thus, a 9-mer peptide is converted to 20 × 9 = 180 binary variables.

Blocks substitution matrix

The BLOSUMs are widely used in protein sequence alignment. The scores within BLOSUMs represent pairwise evolutionary distances between amino acids. Some machine learning methods utilize BLOSUMs to calculate sequence distances. For example, Salomon and Flower used the BLOSUM62 matrix to calculate peptide similarity scores in their kernel method.61 They also compared 83 different matrices including physicochemical and structural distance matrices and found BLOSUM62 matrix was among the top three. MHCMIR/MHC2MIR52,53 implemented the BLOSUM62 matrix to compute subsequence similarities as well. For some other machine learning models, the BLOSUMs are just used as descriptors. In addition to sparse encoding, NetMHC/Net-MHCII40,41 and NetMHCpan/NetMHCIIpan42,43 implemented the BLOSUM50 matrix as descriptors. However, we did not find any benchmark data that compare sparse encoding versus BLOSUM.

Physicochemical properties

Besides the substitution matrices, the physicochemical properties of amino acids, such as hydrogen bond number, polarity, and hydrophobicity were used as descriptors. Zhang et al used physicochemical properties to group amino acids in their HMM.49 POPI used 20 physicochemical properties to represent each amino acid according to AAindex database 9.0.58,67 SVRMHC utilized both sparse encoding and their 11-factor physicochemical property descriptors in their method.51,68 They compared both types of descriptors and found that the two types of descriptors showed different performance on different HLA alleles. No conclusion was drawn to indicate which one is absolutely better than the other.

Performance

Recently published machine learning models achieved an area under receiver operating characteristic curve (AUC) around 0.85–0.95 for Class I HLAs and 0.75–0.85 for Class II HLAs.69 For some HLA alleles such as HLA-A*02:04, the AUC reached 0.98.28 The existing machine learning models were developed using the data sets harvested at different times. Some HLA–peptide binding data are qualitative and some are quantitative, and the data set sizes are different, making direct performance comparison between the models difficult. Therefore, some benchmark tests were conducted using unified data sets.35,70–73 For prediction of peptides binding Class I HLAs, Peters et al tested three methods using a data set of 48 Class I MHCs.70 From the prospective of AUC, their ANN model (AUC = 0.957) outperformed their PSSM-based models (AUC = 0.934–0.952). They also benchmarked 16 publicly available tools such as MHCPRED,50 MULTIPRED,38,39 NetMHC,40 and SVMHC.56,57 Five Class I HLAs were evaluated but only two of them had been predicted by all these methods. For HLA-A*02:01, the performance was NetMHC (ANN) > MULTIPRED (ANN) > MHCPRED (QSAR regression) = SVMHC (SVM) > MULTIPRED (HMM), while for HLA-A*24:02, the result was NetMHC (ANN) > MULTIPRED (ANN) > MULTIPRED (HMM) > MHCPRED (QSAR regression) > SVMHC (SVM). Likewise, Trost et al tested 16 tools using Peters data70 and other literature data and found similar results.71 In 2008, Lin et al evaluated 30 servers on the binding data of tumor antigens toward seven Class I HLAs.72 The classification result indicated a rank of NetMHC (ANN) > IEDB (ANN) > MHCPRED (SVM), while NetMHC (ANN) and IEDB (ANN) showed the best performance in quantitative predictions. For prediction of peptides binding Class II HLAs, Wang et al evaluated several methods using their experimental data set that contains 16 Class II MHCs.35 Their result indicated a performance order of Consensus method > SVRMHC (SVR) > MHC2PRED (SVM) > MHCPRED (QSAR regression). Lin et al evaluated 21 methods using 103 test peptides from four protein antigens and seven Class II HLAs.73 Their classification result indicated that NetMHCIIpan (ANN) was the best followed by two PSSM-based methods and MULTIPRED (SVM). They also showed MHCPRED (QSAR regression) and MULTIPRED (HMM) had an AUC >0.775 when predicting the binding of promiscuous peptides. Some methods were developed several years ago and rarely updated. Due to the limited availability of different models on the supported HLAs, it is hard to benchmark the existing models on a large number of HLAs.70 However, the current benchmarks indicated that the ANN-based and consensus models such as IEDB and NetMHC have a good overall performance.

Challenge

Though various methods and tools have been developed to predict HLA–peptide binding, challenges in this field remain for researchers to address.

Limited support for HLAs

Most machine learning tools develop models for individual HLAs separately. Therefore, the peptides binding models have been developed for a very limited number of HLAs. In order to train a reliable model, 15–50 minimal binding peptides for a specific HLA37,51,57 are a precondition. Though more binding data are becoming available, the experimental data for different HLAs are still disparately distributed. Incorporating HLA sequences as input features to machine learning models is one of the solutions. For example, NetMHCpan/NetMHCIIpan42,43 utilized both the pseudo-sequence from the HLAs and the peptide sequence as features. Such a method can theoretically predict binding between any HLAs and peptides given their sequences; however, the prediction accuracy for HLAs with little or no experimental data is lower.

Peptide length

While some methods such as HMMs can naturally accept peptides with different lengths, others can only accept fixed lengths of input peptide sequences. Therefore, separate models have to be developed to fit different lengths of peptides. This practice has challenges when it is applied to peptides that bind to Class II HLAs since the peptides are very diverse in length and only partially interact with Class II HLAs. Extra processes such as NN-align41,66 were implemented to identify a 9-mer core-binding region of these peptides so that they can be proceeded by the general machine learning models. However, Nielsen et al pointed out the prediction models for Class II HLAs generally have an AUC 0.10 less than those for Class I HLAs.69 The problem of peptide length variety may be one of the causes. Though the models can get a good performance during cross-validation, they may have problems when used on new data sets or applications. In a benchmark test of four real protein antigens by Lin et al.73, no predictor showed good performance in predicting promiscuous peptides. The researchers mentioned that future improvement of the HLA–peptide binding prediction includes minimizing false positives.

Prospective

Network approach

With more data becoming available, it is possible to analyze and predict the HLA–peptide binding from a network viewpoint. HLA–peptide binding data can be transformed into a network where the HLAs and peptides are presented as the nodes and the binding data (categories or affinities) as the edges. Network-based prediction algorithms, such as collaborative filtering algorithm,74 network-based inference,75,76 and neighbor-edges based and unbiased leverage algorithm (Nebula), the latter developed in our laboratory,77 have been proposed for predictions of HLA–peptide binding. The network approaches are able to process different HLA alleles (or even different HLA classes) and peptides of various lengths without extra processes. Our submitted manuscript showed that Nebula outperformed existing methods. Since Nebula is not necessarily a machine learning method and does not require a training process, predictions even on a large network of 120,000 HLA–peptide binding pairs can be made within a second. In addition, the HLA–peptide binding network can be analyzed and clustered into modules that may reveal specific binding properties and patterns to better understand HLA–peptide interactions.

Combining multiple approaches

The existing consensus methods combining several PSSM-based and machine learning-based methods showed generally improved performance than a single method.35,62–64 Such methods take advantage of the best performing models to make better predictions with reduced outliers; however, the consensus methods are restrained by the prediction abilities of their component methods such as a limited number of supported HLAs. As more crystal structures of HLA–peptide complexes become available, studying such structures can aid the interpretation and prediction of HLA–peptide binding. Based on the existing crystal structures, the structures of HLAs with known sequences can be modeled via homology modeling with high identities.13 Various structure-based methods, including molecular docking and dynamics, can be utilized for HLA–peptide binding predictions.78–80 Some modeling process, such as molecular dynamics, may take a large amount of calculation time. However, with the development of cloud computing technologies, parallel computing can accelerate this process hundredfold or more. These structure-based approaches can be used for HLA alleles or peptides with little or no experimental data, which may complement the data- dependent machine learning methods. Combining such methods can be promising for the future development in this field.

Conclusion

Understanding and predicting HLA–peptide binding is an essential step for studies of the immune system, T-cell epitopes, and adverse drug reactions. The methods that are popular in the field of computer sciences are widely used for HLA–peptide binding predictions. Different types of descriptors were utilized to transfer the peptide and HLA sequences into model-acceptable numbers. Some extra processes were implemented to deal with peptides with various lengths. Among the machine learning-based methods, the ANNs and consensus methods were found among the top-performed methods. However, the existing methods have different kinds of limitations such as lack of supported HLAs, a problem dealing with peptides of various lengths and high false positives in experimental validations. We have proposed the network method and a combination approach that uses multiple types of methods to address some of these challenges. The findings and conclusions in this article have not been formally disseminated by the US Food and Drug Administration (FDA) and should not be construed to represent the FDA determination or policy.

79 in total

1. Structural prediction of peptides binding to MHC class I molecules.

Authors: Huynh-Hoa Bui; Alexandra J Schiewe; Hermann von Grafenstein; Ian S Haworth
Journal: Proteins Date: 2006-04-01

2. Efficient peptide-MHC-I binding prediction for alleles with few known binders.

Authors: Laurent Jacob; Jean-Philippe Vert
Journal: Bioinformatics Date: 2007-12-14 Impact factor: 6.937

3. Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov models.

Authors: H Mamitsuka
Journal: Proteins Date: 1998-12-01

4. Physical association between the CD8 and HLA class I molecules on the surface of activated human T lymphocytes.

Authors: Y Bushkin; S Demaria; J M Le; R Schwab
Journal: Proc Natl Acad Sci U S A Date: 1988-06 Impact factor: 11.205

5. TCR-induced transmembrane signaling by peptide/MHC class II via associated Ig-alpha/beta dimers.

Authors: P Lang; J C Stolpa; B A Freiberg; F Crawford; J Kappler; A Kupfer; J C Cambier
Journal: Science Date: 2001-02-23 Impact factor: 47.728

6. Soluble HLA class I molecules induce natural killer cell apoptosis through the engagement of CD8: evidence for a negative regulation exerted by members of the inhibitory receptor superfamily.

Authors: Grazia Maria Spaggiari; Paola Contini; Roberta Carosio; Marica Arvigo; Massimo Ghio; Daniela Oddone; Alessandra Dondero; Maria Raffaella Zocchi; Francesco Puppo; Francesco Indiveri; Alessandro Poggi
Journal: Blood Date: 2002-03-01 Impact factor: 22.113

7. Role of MHC class II expressing CD4+ T cells in proteolipid protein(91-110)-induced EAE in HLA-DR3 transgenic mice.

Authors: Ashutosh Mangalam; Moses Rodriguez; Chella David
Journal: Eur J Immunol Date: 2006-12 Impact factor: 5.532

8. Prediction of chemical-protein interactions network with weighted network-based inference method.

Authors: Feixiong Cheng; Yadi Zhou; Weihua Li; Guixia Liu; Yun Tang
Journal: PLoS One Date: 2012-07-16 Impact factor: 3.240

9. SVMHC: a server for prediction of MHC-binding peptides.

Authors: Pierre Dönnes; Oliver Kohlbacher
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

Review 10. Evaluating the immunogenicity of protein drugs by applying in vitro MHC binding data and the immune epitope database and analysis resource.

Authors: Sinu Paul; Ravi V Kolla; John Sidney; Daniela Weiskopf; Ward Fleri; Yohan Kim; Bjoern Peters; Alessandro Sette
Journal: Clin Dev Immunol Date: 2013-10-08

22 in total

1. Rapid microsphere-assisted peptide screening (MAPS) of promiscuous MHCII-binding peptides in Zika virus envelope protein.

Authors: Mason R Smith; Luke F Bugada; Fei Wen
Journal: AIChE J Date: 2019-06-11 Impact factor: 3.993

2. PSSMHCpan: a novel PSSM-based software for predicting class I peptide-HLA binding affinity.

Authors: Geng Liu; Dongli Li; Zhang Li; Si Qiu; Wenhui Li; Cheng-Chi Chao; Naibo Yang; Handong Li; Zhen Cheng; Xin Song; Le Cheng; Xiuqing Zhang; Jian Wang; Huanming Yang; Kun Ma; Yong Hou; Bo Li
Journal: Gigascience Date: 2017-05-01 Impact factor: 6.524

3. Rapid Identification of MHCII-Binding Peptides Through Microsphere-Assisted Peptide Screening (MAPS).

Authors: Luke F Bugada; Mason R Smith; Fei Wen
Journal: Methods Mol Biol Date: 2022

4. Combining Three-Dimensional Modeling with Artificial Intelligence to Increase Specificity and Precision in Peptide-MHC Binding Predictions.

Authors: Michelle P Aranha; Yead S M Jewel; Robert A Beckman; Louis M Weiner; Julie C Mitchell; Jerry M Parks; Jeremy C Smith
Journal: J Immunol Date: 2020-09-02 Impact factor: 5.422

5. Predicting HLA class II antigen presentation through integrated deep learning.

Authors: Binbin Chen; Michael S Khodadoust; Niclas Olsson; Lisa E Wagar; Ethan Fast; Chih Long Liu; Yagmur Muftuoglu; Brian J Sworder; Maximilian Diehn; Ronald Levy; Mark M Davis; Joshua E Elias; Russ B Altman; Ash A Alizadeh
Journal: Nat Biotechnol Date: 2019-10-14 Impact factor: 54.908

6. Beyond Tripeptides Two-Step Active Machine Learning for Very Large Data sets.

Authors: Alexander van Teijlingen; Tell Tuttle
Journal: J Chem Theory Comput Date: 2021-04-27 Impact factor: 6.006

7. sNebula, a network-based algorithm to predict binding between human leukocyte antigens and peptides.

Authors: Heng Luo; Hao Ye; Hui Wen Ng; Sugunadevi Sakkiah; Donna L Mendrick; Huixiao Hong
Journal: Sci Rep Date: 2016-08-25 Impact factor: 4.379

8. A Rat α-Fetoprotein Binding Activity Prediction Model to Facilitate Assessment of the Endocrine Disruption Potential of Environmental Chemicals.

Authors: Huixiao Hong; Jie Shen; Hui Wen Ng; Sugunadevi Sakkiah; Hao Ye; Weigong Ge; Ping Gong; Wenming Xiao; Weida Tong
Journal: Int J Environ Res Public Health Date: 2016-03-25 Impact factor: 3.390

9. Pathway Analysis Revealed Potential Diverse Health Impacts of Flavonoids that Bind Estrogen Receptors.

Authors: Hao Ye; Hui Wen Ng; Sugunadevi Sakkiah; Weigong Ge; Roger Perkins; Weida Tong; Huixiao Hong
Journal: Int J Environ Res Public Health Date: 2016-03-26 Impact factor: 3.390

Review 10. Post-Genomics and Vaccine Improvement for Leishmania.

Authors: Negar Seyed; Tahereh Taheri; Sima Rafati
Journal: Front Microbiol Date: 2016-04-06 Impact factor: 5.640