Literature DB >> 24565049

An effective and effecient peptide binding prediction approach for a broad set of HLA-DR molecules based on ordered weighted averaging of binding pocket profiles.

Wen-Jun Shen, Shaohong Zhang, Hau-San Wong.   

Abstract

BACKGROUND: The immune system must detect a wide variety of microbial pathogens, such as viruses, bacteria, fungi and parasitic worms, to protect the host against disease. Antigenic peptides displayed by MHC II (class II Major Histocompatibility Complex) molecules is a pivotal process to activate CD4+ TH cells (Helper T cells). The activated TH cells can differentiate into effector cells which assist various cells in activating against pathogen invasion. Each MHC locus encodes a great number of allele variants. Yet this limited number of MHC molecules are required to display enormous number of antigenic peptides. Since the peptide binding measurements of MHC molecules by biochemical experiments are expensive, only a few of the MHC molecules have suffecient measured peptides. To perform accurate binding prediction for those MHC alleles without suffecient measured peptides, a number of computational algorithms were proposed in the last decades.
RESULTS: Here, we propose a new MHC II binding prediction approach, OWA-PSSM, which is a significantly extended version of a well known method called TEPITOPE. The TEPITOPE method is able to perform prediction for only 50 MHC alleles, while OWA-PSSM is able to perform prediction for much more, up to 879 HLA-DR molecules. We evaluate the method on five benchmark datasets. The method is demonstrated to be the best one in identifying binding cores compared with several other popular state-of-the-art approaches. Meanwhile, the method performs comparably to the TEPITOPE and NetMHCIIpan2.0 approaches in identifying HLA-DR epitopes and ligands, and it performs significantly better than TEPITOPEpan in the identification of HLA-DR ligands and MultiRTA in identifying HLA-DR T cell epitopes.
CONCLUSIONS: The proposed approach OWA-PSSM is fast and robust in identifying ligands, epitopes and binding cores for up to 879 MHC II molecules.

Entities:  

Year:  2013        PMID: 24565049      PMCID: PMC3908610          DOI: 10.1186/1477-5956-11-S1-S15

Source DB:  PubMed          Journal:  Proteome Sci        ISSN: 1477-5956            Impact factor:   2.480


Introduction

The immune system must detect a wide variety of microbial pathogens, such as viruses, bacteria, fungi and parasitic worms, to protect the host against disease. Antigenic peptides displayed by MHC II (class II Major Histocompatibility Complex) molecules is a pivotal process to activate CD4+ TH cells (Helper T cells). The activated TH cells can differentiate into effector cells which assist various cells in activating against pathogen invasion [1]. MHC I and II are the two main classes of MHC. MHC I molecules exist in all nucleated cells. CD8+ T cytotoxic cells only recognize antigenic peptides which are displayed by MHC I from cytosol to the surface of cells and eliminate the infected cells. On the other hand, MHC II molecules are normally found only in antigen-presenting cells (APCs). TH cells only recognize those foreign peptides that are displayed by MHC II from endocytosed proteins to the surface of APCs and then produce a large number of cytokineses to activate various cells to defend invasion [2,3]. The structures of MHC I and II are slightly different on the binding grooves. MHC I molecules have conserved residues which bind to the terminal residues of antigenic peptides, so they form close grooves. On the other hand, these kinds of conserved residues do not exist in the MHC II molecules, which form open grooves. Hence MHC II can accommodate longer peptides than MHC I, which results in increased diffeculty in performing binding prediction for MHC II [4-6]. The HLA (Human Leukocyte Antigen, MHC in humans) II molecules are encoded by the DP, DQ and DR loci. Each MHC locus encodes a great number of allele variants. Yet this limited number of MHC molecules are required to display enormous number of antigenic peptides. Each specific MHC molecule can bind to a great number of different peptides, and certain peptides can bind to several MHC molecules. Since the peptide binding measurements of MHC molecules by biochemical experiments are expensive, only a few of the MHC molecules have suffecient measured peptides. In [7], it is mentioned that in order to accurately describe the binding motif of MHC II, at least 100 to 200 measured peptides are required. To perform accurate binding prediction for those MHC alleles without suffecient measured peptides, a number of computational algorithms (referred to as pan-specific methods) were proposed in the last decade [8,9]. The TEPITOPE [10] method is the pioneering and most popular pan-specific approach for MHC II binding prediction. Its basic idea is if two HLA-DR alleles have identical pseudo sequence (The pseudo sequence is composed of several amino acids.) in the same pocket, they will share the same quantitative profile (The pocket profile measures the binding strength between a given pocket with the twenty basic amino acids.). The MHCIIMulti [11] method enables prediction of more than 500 HLA-DR molecules by using multiple instance learning. The NetMHCIIpan [12] method first transforms each DRB allele into a 21 amino acids pseudo-sequence, and uses the SMM-align [7] method to identify the binding cores and peptide flanking residues, next trains the model using an artificial neural network learning algorithm. The MultiRTA [13] method, which can perform prediction for both HLA-DR and HLA-DP molecules, calculates the binding affenity of a peptide by thermodynamic averaging over the binding affenities of all registers, and introduces a regularization constraint to avoid overfitting. The NetMHCIIpan-2.0 [14] method is a synthesis of NN-align [15], NetMHCpan and NetMHCIIpan. MULTIPRED2 [16] can perform prediction for 1077 HLAI and HLA-II alleles and 26 HLA supertypes. It can be regarded as a combination of the MULTIPRED, PEPVAC, NetMHCpan and NetMHCIIpan methods. The TEPITOPEpan [9] method, which builds on the TEPITOPE and PickPocket [17] methods, enables prediction for more than 700 HLA-DR molecules. Here, we propose a new MHC II binding prediction method, which we call OWA-PSSM. A preliminary study of a special case of this framework was first conducted in [18]. This method is a significantly extended version of the TEPITOPE method. Through introducing the ordered weighted averaging (OWA) weights [19,20], we develop a novel weighting scheme for those pocket profiles generated by TEPITOPE. Specifically, the gamma probability density function (PDF) [21] is employed to generate the OWA weights. The gamma PDF is a generalization of the exponential density function, and have close relationship with a number of continuous distributions. In our experiments, we will evaluate the performance of OWA-PSSM through comparing with four other popular state-of-the-art or recently proposed pan-specific methods, TEPITOPE, MultiRTA, NetMHCIIpan2.0 and TEPITOPEpan.

Materials and methods

We retrieved all HLA-DRB (HLA-DR β chain) protein sequences from the FTP site (ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/) provided by the IMGT/HLA database. Five independent benchmark datasets are employed to evaluate the performance of OWA-PSSM through comparing with the TEPITOPE, MultiRTA, NetMHCIIpan2.0 and TEPITOPEpan methods.

Data sets

Generation of 879 DRB alleles

A set of 879 DRB alleles are prepared as follows. We retrieved all DRB protein sequences from the FTP site provided by the IMGT/HLA database. Each DRB allele has an offecial name assigned by the WHO Nomenclature Committee for Factors of the HLA System. The first four digits coming after the gene name are used to distinguish different alleles. Hence for those alleles without difference until the fifth digit, we kept only the first one of them by sorting their offecial names in ascending order. Meanwhile, for each allele, we only considered the residues whose IMGT assigned residue indices range from 9 to 86 as they cover all pocket residues employed in TEPITOPE. Those alleles with absent amino acids in this range are omitted. The final step is to exclude non-expressed alleles and those alleles whose amino acid at residue 86 is neither glycine nor valine.

SMM-align DRB binding dataset

The substitution matrix and the parameters of the gamma PDF are determined by using the same dataset [7] as the TEPITOPEpan method. Hence these two methods will be compared in a more compatible way.

MHCBench DRB1*0401 binding dataset

The MHCBench server [22] provides eight benchmark datasets to evaluate MHC binding prediction methods. It is a popular benchmark to evaluate the performance of new methods by comparing with previously developed algorithms.

NetMHCIIpan-2.0 HLA-DR ligands

A large dataset studied in [14] consisting of 1164 HLA-DR ligands and 28 DRB alleles is evaluated.

NetMHCIIpan-2.0 HLA-DR T cell epitopes

Another large dataset studied in [14] consisting of 42 DRB alleles and 1325 epitopes is adopted to perform further evaluation.

X-ray crystallographic structures of pMHC II complex

The last dataset contains 41 X-ray crystallographic structures of pMHC II complexes (see Table 1). These 41 X-ray structures were retrieved from the PDB database [23]. For these 41 X-ray structures, each one contains an HLA-DR/peptide binding complex. The binding cores were directly obtained from the IMGT/3Dstructure database [24]. To the best of our knowledge, this dataset is the largest and most complete that has ever been studied for the prediction of MHC II binding cores.
Table 1

X-ray crystallographic structures of pMHC II binding complexes.

PDB IDDRB AllelePeptide SequenceCore
4E41DRB1*0101GELIGILNAAKVPADIGILNAAKV
1A6ADRB1*0301PVSKMRMATPLLMQAMRMATPLLM
1AQDDRB1*0101VGSDWRFLRGYHQYAWRFLRGYHQ
1BX2DRB1*1501ENPVVHFFKNIVTPRVHFFKNIVT
1DLHDRB1*0101PKYVKQNTLKLATYVKQNTLKL
1FV1DRB5*0101NPVVHFFKNIVTPRTPPPSQFKNIVTPRT
1FYTDRB1*0101PKYVKQNTLKLATYVKQNTLKL
1H15DRB5*0101GGVYHFVKKHVHESYHFVKKHVH
1HQRDRB5*0101VHFFKNIVTPRTPFKNIVTPRT
1HXYDRB1*0101PKYVKQNTLKLATYVKQNTLKL
1J8HDRB1*0401PKYVKQNTLKLATYVKQNTLKL
1JWMDRB1*0101PKYVKQNTLKLATYVKQNTLKL
1JWSDRB1*0101PKYVKQNTLKLATYVKQNTLKL
1JWUDRB1*0101PKYVKQNTLKLATYVKQNTLKL
1KG0DRB1*0101PKYVKQNTLKLATYVKQNTLKL
1KLGDRB1*0101GELIGILNAAKVPADIGILNAAKV
1KLUDRB1*0101GELIGTLNAAKVPADIGTLNAAKV
1LO5DRB1*0101PKYVKQNTLKLATYVKQNTLKL
1PYWDRB1*0101XFVKQNAAALFVKQNAAAL
1R5IDRB1*0101PKYVKQNTLKLATYVKQNTLKL
1SJEDRB1*0101PEVIPMFSALSEGATPVIPMFSALS
1SJHDRB1*0101PEVIPMFSALSEGVIPMFSALS
1T5WDRB1*0101AAYSDQATPLLLSPRYSDQATPLL
1T5XDRB1*0101AAYSDQATPLLLSPRYSDQATPLL
1YMMDRB1*1501ENPVVHFFKNIVTPRGGSGGGGGVHFFKNIVT
1ZGLDRB5*0101VHFFKNIVTPRTPGGFKNIVTPRT
2FSEDRB1*0101AGFKGEQGPKGEPGFKGEQGPKG
2G9HDRB1*0101PKYVKQNTLKLATYVKQNTLKL
2IAMDRB1*0101GELIGILNAAKVPADIGILNAAKV
2IANDRB1*0101GELIGTLNAAKVPADIGTLNAAKV
2ICWDRB1*0101PKYVKQNTLKLATYVKQNTLKL
2IPKDRB1*0101XPKWVKQNTLKLATWVKQNTLKL
2OJEDRB1*0101PKYVKQNTLKLATYVKQNTLKL
2Q6WDRB3*0101AWRSDEALPLGSWRSDEALPL
2SEBDRB1*0401AYMRADAAAGGAMRADAAAGG
3C5JDRB3*0301QVIILNHPGQISAIILNHPGQI
3L6FDRB1*0101APPAYEKLSAEQSPPYEKLSAEQS
3PDODRB1*0101KPVSKMRMATPLLMQALPMMRMATPLLM
3PGDDRB1*0101KMRMATPLLMQALPMMRMATPLLM
3S4SDRB1*0101PKYVKQNTLKLATYVKQNTLKL
3S5LDRB1*0101PKYVKQNTLKLATYVKQNTLKL
X-ray crystallographic structures of pMHC II binding complexes.

Methods

The proposed OWA-PSSM approach is introduced in the following subsections. The OWA-PSSM method is designed based on the PSSM (Position Specific Scoring Matrix) which is a popular technique in the prediction of MHC binding [9,10,17,25-28]. In general, the lengths of MHC II binding cores are nine amino acids. Every position at the binding core is related to a specific pocket. The PSSM is employed to specify the binding strengths between twenty basic amino acids with these nine pockets, such that the binding specificities of HLA-DR molecules could be quantified. For MHC II molecules, there are five anchor sites (sites 1, 4, 6, 7 and 9) at the binding core. These five anchor sites govern the binding strength of peptides with MHC II molecules [3]. The OWA weights are employed to define profiles for anchor pockets 4, 6, 7 and 9. For the remaining pockets, including the anchor pocket 1 and four non-anchor pockets 2, 3, 5 and 8, we adopt the same strategy as TEPITOPE to specify their quantitative profiles.

Generation of profiles for pockets 4 6 7 9

Here, the pocket pseudo-sequences and the associated profiles generated by TEPITOPE are referred to as raw pocket pseudo-sequences and raw pocket profiles, respectively. These raw pseudo-sequences are composed of several amino acids whose associated residue indices are given in Table 2. Eleven representative HLA-DR alleles are adopted by TEPITOPE to specify the different profiles for anchor pockets 4, 6, 7 and 9. These eleven alleles are DRB1*0101, DRB1*0301, DRB1*0401, DRB1*0402, DRB1*0404, DRB1*0701, DRB1*0801, DRB1*1101, DRB1*1302, DRB1*1501 and DRB5*0101. If two alleles have identical pseudosequences in the same pocket, they will have identical profiles. For a given pocket (pocket 4, 6, 7 or 9), we collect all the different raw pocket pseudo-sequences into a set , , |r= n, where i = 1, 2, ..., m, x ∈ {4, 6, 7, 9}, m is the number of unique pseudo-sequences, and n is the number of amino acids contained in a pseudo-sequence. Meanwhile, we collect all the different raw profiles into a set , and |p= 20, i = 1, 2, ..., m. There is a one-to-one correspondence between pand r. By using a substitution matrix, every raw pseudo-sequence is represented as a 20n-dimensional real vector, and is referred to as raw encoded pseudo-sequence collected into a set , .
Table 2

DRB Pocket Residue Indices.

PocketPocket residue indices
P186
P2-
P3-
P413, 70, 71, 74, 78
P5-
P611
P728, 30, 47, 61, 67, 71
P8-
P99, 37, 57, 60, 61
DRB Pocket Residue Indices. Next the radial basis function (RBF) is employed to measure the similarity between the encoded pseudo-sequences of a predicted allele a and a raw encoded pseudo-sequence. where vis the encoded pseudo-sequence for a predicted allele a. The pseudo-sequence for allele a is generated by using the pocket residue indices in Table 2. Obviously, 0 < K(v) ≤ 1 and K(v) = 1 if and only if v= v. After these similarity values are sorted in descending order, a specific ordered position would be associated with an OWA weight. A new pocket profile is generated as a weighted average over m raw pocket profiles in . A schematic illustration of the generation of a new profile is shown in Figure 1[18].
Figure 1

Schematic illustration of profile generation for pocket 4/6/7/9.

Schematic illustration of profile generation for pocket 4/6/7/9. Next we use the gamma distribution to generate OWA weights. The PDF of a gamma distribution is defined by: for x >0 and k, θ >0. Γ() denotes the gamma function. The gamma distribution is specified by its shape and scale parameters: k and θ. When k ≤ 1, the density function is decreasing while k >1, it is unimodal and the mode occurs at (k - 1)θ. If k = 1 and θ = µ, then the gamma distribution becomes the exponential distribution X ~ exp(1/µ). The OWA weight distribution is generated by discretizing the gamma PDF as follows: where m is the dimension of the OWA weights, k and θ are the shape and scale parameters respectively. After normalizing, the OWA weights are defined by: Let w= P (X = i), and these weights satisfy the following constraints: Given a predicted DRB allele a, let K= (k, k, ..., k), where k= K(v), , and the associated raw pocket profiles are . The elements of Kare sorted in descending order, and the re-ordered vector of Kis denoted as . The corresponding OWA weighting vector is denoted as W , W = (w1, w2, ..., w). We denote the pocket profiles associated with the re-ordered vector as , . Hence we define the pocket profile for allele a by: where x ∈ {4, 6, 7, 9}. In particular, Essentially, through measuring the similarity between the encoded pseudo-sequence of a with m raw encoded pseudo-sequences in the same pocket x, the smaller the kthe higher weight assigned to p, . The new profile weighting approach developed in this work is inspired by the OWA operator [19]. We generate profiles of pockets 4, 6, 7 and 9 for any allele in the set by using the OWA weights derived from the gamma PDF.

Generation of profiles for pockets 1 2 3 5 8

For the remaining pockets, including the anchor pocket 1 and four non-anchor pockets 2, 3, 5 and 8, we adopt the same strategy as TEPITOPE to specify quantitative profiles. Accurately quantifying pocket 1 is essential for the identification of binding cores, and this pocket is mainly characterized by the 86th residue of DRB protein sequences. For all the alleles in the set , the amino acid at the 86th residue is either glycine or valine. For the alleles with glycine at the 86th residue, the profile of pocket 1 is assigned to be -1 for aliphatic amino acids (Ile, Leu, Met, Val) and 0 for aromatic amino acids (Phe, Trp, Tyr). However, if this residue is valine, the profile is set to 0 for aliphatic and -1 for aromatic. Other amino acids are set to -999 [29]. This reflects that the binding cores of MHC II prefer aliphatic and aromatic amino acids at position 1. For the non-anchor pockets 2, 3, 5 and 8, their contributions to the pMHC binding is minimal. Hence we assign identical profiles to all alleles for pockets 2 and 3 and zero vectors for pockets 5 and 8 [29], respectively. Given an allele , the pocket profiles are denoted as , x = 1, 2, 3, 5, 8.

Position specific scoring matrices

For a given allele , the quantitative profiles for nine pockets are defined by Then the PSSM is defined by assembling nine pocket profiles together as The PSSM is a 20 by 9 matrix whose columns correspond to nine pockets and rows correspond to twenty basic amino acids.

Prediction measure and statistical tests

The AUC (Area Under the receiver operating characteristic Curve) is employed to measure the prediction performance, which is 1 for perfect prediction and 0.5 for random prediction. A paired t test is used for statistical comparison, and the AUC score comparison result is considered to be statistically significant if p is less than 0.05.

Results

The substitution matrix and the parameters k and θ of the gamma PDF are determined by using the dataset described in [7], which contains 14 HLA-DR alleles. The MHCBench, NetMHCIIpan-2.0 HLA-DR ligand and T cell epitope datasets were then used to extensively evaluate the performance of OWA-PSSM through comparing with TEPITOPE, MultiRTA, NetMHCIIpan2.0 and TEPITOPEpan. Furthermore, 41 X-ray crystallographic structures of pMHC II complexes were employed to evaluate the prediction quality of OWA-PSSM on identifying binding cores.

Determination of the substitution matrix and the gamma distribution parameters

A substitution matrix is used to encode an amino acid into a 20-dimensional real-valued vector. We could then apply the Gaussian kernel to compute similarities between encoded pseudo-sequences. The BLOSUM50 and BLOSUM62 [30] matrices are two important substitution matrices for MHC/peptide binding prediction [9,12,17]. Here, 59 symmetric substitution matrices are tested. The gamma probability density function is discretized to generate OWA weights, with those values controlled by the shape and scale parameters. Experiments were performed to explore the effect on MHC II prediction through varying the substitution matrix and gamma distribution parameters k and θ. We used the SMM-align dataset consisting of 14 HLADR alleles and 4603 peptides to determine the parameters. We tested 59 symmetric substitution matrices obtained from the AAindex database [31] whose data are collected from published literature. Optimal shape and scale parameters were chosen from set {0.1, 0.2, ..., 0.9} ∪ {1, 2, ..., 30}. We find that the chemical similarity substitution matrix [32] with shape parameter k = 0.2, θ = 0.6 performs best. The experiment results show that the shape parameter k ≤ 1 performs better than k > 1. The scale parameter measures the steepness of the OWA distribution, and the OWA distribution becomes steeper when θ is smaller. And thus in the following the chemical similarity matrix is used to encode the pocket pseudo-sequences, and the OWA weights applied to define profiles for 879 DRB alleles in the set are determined by the discretization of the gamma distribution with shape parameter 0.2 and scale parameter 0.6. The prediction performance for the SMM-align dataset is shown in Table 3. The performance of OWAPSSM is better than TEPITOPE and TEPITOPEpan. The average AUC scores over these 14 DRB alleles are 0.747 and 0.732 for the OWA-PSSM and TEPITOPEpan, respectively. The performance of OWA-PSSM, TEPITOPE and TEPITOPEpan are comparable for the alleles predictable by TEPITOPE. However, for those alleles not predictable by TEPITOPE, the average AUC scores are 0.776 and 0.711 for OWA-PSSM and TEPITOPEpan, respectively. OWA-PSSM outperforms TEPITOPEpan in all alleles not predictable by TEPITOPE.
Table 3

The performance of OWA-PSSM, TEPITOPE and TEPITOPEpan on the SMM-align dataset in terms of AUC.

Allele# peptidesTEPITOPETEPITOPEpanOWA-PSSM
DRB1*010112030.6470.6480.645
DRB1*03014740.7330.7390.731
DRB1*04014570.7540.7700.756
DRB1*04041680.8290.8320.830
DRB1*04051710.7890.7850.789
DRB1*07013100.7680.7680.771
DRB1*08021740.7690.7740.784
DRB1*09011170.6860.731
DRB1*11013590.7090.7000.715
DRB1*13021790.7210.7280.727
DRB1*15013650.7250.7270.731
DRB3*01011020.7240.869
DRB4*01011810.7220.729
DRB5*01013430.6540.6520.646

Average0.7320.747
Average I0.7360.7380.739
Average II0.7110.776

"Average" is the average over 14 alleles. "Average I" is the average over 11 alleles predictable by TEPITOPE. "Average II" is the average over 3 alleles not predictable by TEPITOPE. We obtain the PSSMs of TEPITOPE and TEPITOPEpan from their public servers ProPred (http://www.imtech.res.in/raghava/propred/page4.html) and TEPITOPEpan (http://www.biokdd.fudan.edu.cn/Service/TEPITOPEpan/TEPITOPEpan.html), respectively.

The performance of OWA-PSSM, TEPITOPE and TEPITOPEpan on the SMM-align dataset in terms of AUC. "Average" is the average over 14 alleles. "Average I" is the average over 11 alleles predictable by TEPITOPE. "Average II" is the average over 3 alleles not predictable by TEPITOPE. We obtain the PSSMs of TEPITOPE and TEPITOPEpan from their public servers ProPred (http://www.imtech.res.in/raghava/propred/page4.html) and TEPITOPEpan (http://www.biokdd.fudan.edu.cn/Service/TEPITOPEpan/TEPITOPEpan.html), respectively.

Performance on the MHCBench dataset

In order to evaluate the performance of OWA-PSSM compared with TEPITOPE, MultiRTA, NetMHCI-Ipan2.0 and TEPITOPEpan, eight datasets consisting of binders and non-binders for DRB1*0401 were retrieved from the MHCBench database. As shown in Table 4 the performance of OWA-PSSM is similar to those of TEPITOPE and NetMHCIIpan2.0. It can also be observed that OWA-PSSM performs best in all eight datasets, and significantly outperforms TEPITOPEpan (p <0.01, paired t-test) and MultiRTA (p <0.01, paired t-test).
Table 4

The performance on the MHCBench dataset in terms of AUC.

Dataset# peptidesOWA-PSSMTEPITOPETEPITOPEpanMultiRTANetMHCIIpan-2.0
Set 110170.7680.7660.7640.7130.765
Set 26730.7350.7340.7270.6850.739
Set 3a5900.7390.7360.7340.7010.710
Set 3b4950.7560.7530.7480.7150.753
Set 4a6460.7530.7500.7480.6990.759
Set 4b5840.7460.7450.7380.7060.751
Set 5a1170.6550.6680.6400.5990.649
Set 5b850.6640.6890.6460.5970.640

Average0.7270.7300.7180.6770.721

The PSSMs of TEPITOPE, TEPITOPEpan and the results of MultiRTA were obtained from their respective web servers. The prediction of NetMHCIIpan-2.0 were computed by its stand-alone software package.

The performance on the MHCBench dataset in terms of AUC. The PSSMs of TEPITOPE, TEPITOPEpan and the results of MultiRTA were obtained from their respective web servers. The prediction of NetMHCIIpan-2.0 were computed by its stand-alone software package.

Performance on the HLA-DR ligand dataset

Here, we evaluate the OWA-PSSM method on a large-scale MHC II ligand dataset. This dataset covers 1164 HLA-DR ligands restricted to 28 HLA-DR alleles [5,14]. We applied a similar approach as NetMHCIIpan-2.0, in which each ligand source protein was divided into overlapping k-mers with lengths equal to the annotated ligand, and all k-mers without the annotated ligand were labeled as negatives. The results are shown in Table 5. The prediction performance of OWA-PSSM is significantly better than that of TEPITOPEpan (p <0.05, paired t-test). Specifically, the OWA-PSSM method outperforms TEPITOPEpan in 16 out of 28 alleles. Comparing the prediction accuracy of OWA-PSSM with those of MultiRTA and NetMHCIIpan2.0 on this ligand dataset, we find that their AUC score distributions are not significantly different (p >0.05, paired t-test). For the 17 alleles predictable by TEPITOPE, TEPITOPE performs best for 11 alleles and OWA-PSSM performs best for 6 alleles. However, these differences are not statistically significant (p >0.05, paired t-test).
Table 5

Prediction performance on the HLA-DR ligand dataset.

Allele# ligandsOWA-PSSMTEPITOPETEPITOPEpanMultiRTANetMHCIIpan-2.0
DRB1*0101530.8270.8330.8340.8330.835
DRB1*010250.8890.8950.8920.9350.927
DRB1*0301880.6670.6730.6710.6520.789
DRB1*04014680.8310.8330.8260.7710.875
DRB1*0402360.8820.8850.8800.7680.667
DRB1*040310.9540.9541.0000.845
DRB1*0404420.7790.7750.7970.7110.765
DRB1*0405360.8040.8090.7780.7290.856
DRB1*0701470.6980.6970.6960.7200.744
DRB1*0801390.6940.6970.6560.5410.643
DRB1*080210.9300.9160.9230.5320.978
DRB1*080310.2290.1490.3830.292
DRB1*090160.7500.6590.8420.957
DRB1*10011830.7830.7700.8270.866
DRB1*1101350.8280.8350.8310.8380.896
DRB1*110480.8680.8700.8560.8110.911
DRB1*1201110.8010.8280.8470.863
DRB1*1301160.8190.8240.8130.7450.724
DRB1*1302190.7430.7420.7350.7200.561
DRB1*140190.7120.7300.7040.810
DRB1*1501220.7200.7180.7170.6630.671
DRB1*150230.7730.7670.7740.7060.665
DRB1*160120.6210.6300.9180.849
DRB3*010120.8880.9180.9530.971
DRB3*030150.9070.7830.9390.948
DRB4*010160.5000.4910.5150.726
DRB4*010320.9410.7530.7450.827
DRB5*0101180.8230.8420.8350.7770.847

Average11640.7740.7560.7540.797
Average I0.7990.8010.7950.7320.786
Average II0.7350.6970.7890.814

"Average" is the average over 28 alleles. "Average I" is the average over 17 alleles predictable by TEPITOPE.

"Average II" is the average over 11 alleles not predictable by TEPITOPE. The PSSMs of TEPITOPE, TEPITOPEpan and the results of MultiRTA were obtained from their respective web servers. The prediction of NetMHCIIpan-2.0 were obtained directly from its publication.

Prediction performance on the HLA-DR ligand dataset. "Average" is the average over 28 alleles. "Average I" is the average over 17 alleles predictable by TEPITOPE. "Average II" is the average over 11 alleles not predictable by TEPITOPE. The PSSMs of TEPITOPE, TEPITOPEpan and the results of MultiRTA were obtained from their respective web servers. The prediction of NetMHCIIpan-2.0 were obtained directly from its publication.

Performance on the HLA-DR T cell epitope dataset

A T cell epitope is an antigen fragment that is recognized by T cells. Identifying T cell epitopes is vital for vaccine design. A large-scale T-cell epitopes dataset used in [14] is tested here. The prediction performance is showed in Table 6. The OWA-PSSM method performs significantly better than MultiRTA (p <0.01, paired t-test) in identifying T cell epitopes. Comparing the prediction accuracy of OWA-PSSM with those of TEPITOPEpan and NetMHCIIpan2.0 on this T cell epitope dataset, we find that their AUC score distributions are not significantly different (p >0.05, paired t-test). TEPITOPE is one of the best approaches in identifying T cell epitopes. Compared with this approach, OWA-PSSM performs best for 12 of the 20 alleles predictable by TEPITOPE, while TEPITOPE performs best on 8 alleles (the difference is not significant, p >0.05, paired t-test).
Table 6

Prediction performance on the HLA-DR T-cell epitope dataset.

Allele# epitopesOWA-PSSMTEPITOPETEPITOPEpanMultiRTANetMHCIIpan-2.0
DRB1*01011250.8070.7950.8080.7860.810
DRB1*010240.7900.7610.7920.8220.879
DRB1*010350.8370.7190.5280.667
DRB1*03011730.6320.6370.6400.6550.683
DRB1*04013420.7470.7410.7430.7070.775
DRB1*0402330.5750.5740.5710.5210.570
DRB1*0403140.9040.9050.8480.896
DRB1*0404460.7320.7320.7380.7150.744
DRB1*0405210.7450.7460.7220.5790.626
DRB1*040660.7660.7530.8690.741
DRB1*040740.8080.8240.7490.668
DRB1*040820.9300.9300.9300.9990.986
DRB1*0701560.7370.7200.7360.7360.742
DRB1*070310.9050.9110.9150.7070.896
DRB1*080140.5860.5540.6400.7160.663
DRB1*080220.8480.8660.8500.6850.754
DRB1*080320.5480.5160.7070.852
DRB1*0901130.7290.6970.6360.738
DRB1*100140.8700.8350.7890.875
DRB1*1101880.7520.7450.7510.7030.815
DRB1*110210.8430.8280.8220.5030.493
DRB1*110330.3330.3280.4800.510
DRB1*110460.7930.8100.8050.6660.807
DRB1*120130.8760.8870.8620.970
DRB1*1301150.7670.7830.7560.6420.632
DRB1*1302100.8320.8090.8130.7810.860
DRB1*130330.4820.5620.5150.604
DRB1*1401160.7180.7810.6970.789
DRB1*140410.9300.9490.9380.956
DRB1*140520.8610.8070.8480.839
DRB1*15011930.6880.6810.6900.6650.722
DRB1*1502200.6110.6050.6080.5700.681
DRB1*150320.8020.8290.5310.874
DRB1*160150.6840.6990.7210.724
DRB1*160230.8850.9120.8860.984
DRB3*0101120.8750.8330.8830.895
DRB3*0202100.5880.6130.4660.539
DRB3*030110.9880.8850.9060.966
DRB4*0101170.6630.5600.5830.789
DRB4*010310.9900.9900.9920.991
DRB5*0101550.7460.7380.7470.7520.802
DRB5*010210.9090.8700.7520.987

Average0.7640.7480.7580.7170.781
Average I0.7530.7480.7540.6960.747
Average II0.7750.7610.7360.812

"Average" is the average over 42 alleles. "Average I" is the average over 20 alleles predictable by TEPITOPE.

"Average II" is the average over 22 alleles not predictable by TEPITOPE. The PSSMs of TEPITOPE, TEPITOPEpan and the results of MultiRTA were obtained from their respective web servers. The prediction of NetMHCIIpan-2.0 were obtained directly from its publication.

Prediction performance on the HLA-DR T-cell epitope dataset. "Average" is the average over 42 alleles. "Average I" is the average over 20 alleles predictable by TEPITOPE. "Average II" is the average over 22 alleles not predictable by TEPITOPE. The PSSMs of TEPITOPE, TEPITOPEpan and the results of MultiRTA were obtained from their respective web servers. The prediction of NetMHCIIpan-2.0 were obtained directly from its publication.

Identification of the peptide binding cores

The lengths of the peptides which bind to MHC II range from 9 to 25 amino acids. However, only a segment of the peptide plays a significant role in binding to a MHC II molecule which is referred to as a binding core. Identifying the binding core correctly is very important for the MHC II/peptide binding study. A dataset containing 41 X-ray crystallographic structures is employed to evaluate the prediction of OWA-PSSM in identifying binding cores. As showed in Table 7, OWA-PSSM is the only method that correctly identifies the binding cores for all complexes. TEPITOPEpan, MultiRTA and NetMHCIIpan2.0 misidentify 2, 5 and 9 complexes, respectively. For the 39 MHC II/peptide complexes whose MHC II molecules are predictable by TEPITOPE, both TEPITOPE and OWA-PSSM can correctly predict the binding cores. However, OWAPSSM is able to identify the binding cores of all 41 complexes correctly regardless of whether the MHC II alleles are predictable by TEPITOPE or not.
Table 7

Comparison of OWA-PSSM with four pan-specific methods in identifying MHC II-peptide binding cores.

PDB IDOWA-PSSMTEPITOPETEPITOPEpanMultiRTANetMHCIIpan-2.0
4E41IGILNAAKVIGILNAAKVIGILNAAKVIGILNAAKVLIGILNAAK
1A6AMRMATPLLMMRMATPLLMMRMATPLLMMRMATPLLMMRMATPLLM
1AQDWRFLRGYHQWRFLRGYHQWRFLRGYHQWRFLRGYHQWRFLRGYHQ
1BX2VHFFKNIVTVHFFKNIVTVHFFKNIVTVHFFKNIVTVVHFFKNIV
1DLHYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1FV1FKNIVTPRTFKNIVTPRTFKNIVTPRTVHFFKNIVTFFKNIVTPR
1FYTYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1H15YHFVKKHVHYHFVKKHVHYHFVKKHVHYHFVKKHVHYHFVKKHVH
1HQRFKNIVTPRTFKNIVTPRTFKNIVTPRTVHFFKNIVTFFKNIVTPR
1HXYYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1J8HYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1JWMYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1JWSYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1JWUYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1KG0YVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1KLGIGILNAAKVIGILNAAKVIGILNAAKVIGILNAAKVLIGILNAAK
1KLUIGTLNAAKVIGTLNAAKVIGTLNAAKVIGTLNAAKVIGTLNAAKV
1LO5YVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1PYWFVKQNAAALFVKQNAAALFVKQNAAALFVKQNAAALFVKQNAAAL
1R5IYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
1SJEVIPMFSALSVIPMFSALSVIPMFSALSVIPMFSALSVIPMFSALS
1SJHVIPMFSALSVIPMFSALSVIPMFSALSVIPMFSALSVIPMFSALS
1T5WYSDQATPLLYSDQATPLLYSDQATPLLSDQATPLLLYSDQATPLL
1T5XYSDQATPLLYSDQATPLLYSDQATPLLSDQATPLLLYSDQATPLL
1YMMVHFFKNIVTVHFFKNIVTVHFFKNIVTVHFFKNIVTVHFFKNIVT
1ZGLFKNIVTPRTFKNIVTPRTFKNIVTPRTVHFFKNIVTFFKNIVTPR
2FSEFKGEQGPKGFKGEQGPKGFKGEQGPKGFKGEQGPKGFKGEQGPKG
2G9HYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
2IAMIGILNAAKVIGILNAAKVIGILNAAKVIGILNAAKVLIGILNAAK
2IANIGTLNAAKVIGTLNAAKVIGTLNAAKVIGTLNAAKVIGTLNAAKV
2ICWYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
2IPKWVKQNTLKLWVKQNTLKLWVKQNTLKLWVKQNTLKLWVKQNTLKL
2OJEYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
2Q6WWRSDEALPL-WRSDEALPLWRSDEALPLWRSDEALPL
2SEBMRADAAAGGMRADAAAGGYMRADAAAGMRADAAAGGYMRADAAAG
3C5JIILNHPGQI-VIILNHPGQIILNHPGQIIILNHPGQI
3L6FYEKLSAEQSYEKLSAEQSYEKLSAEQSYEKLSAEQSYEKLSAEQS
3PDOMRMATPLLMMRMATPLLMMRMATPLLMMRMATPLLMMRMATPLLM
3PGDMRMATPLLMMRMATPLLMMRMATPLLMMRMATPLLMKMRMATPLL
3S4SYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL
3S5LYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKLYVKQNTLKL

Correct41/4139/4136/4132/41
Correct I39/3939/3938/3934/3930/39
Correct II2/21/22/22/2

Incorrectly predicted binding cores are highlighted in bold. "Correct" gives the number of correctly identified cores over 41 X-ray structures. "Correct I" gives the number of correctly identified cores over 39 X-ray structures whose DRB alleles are predictable by TEPITOPE. "Correct II" gives the number of correctly identified cores over 2 X-ray structures whose DRB alleles are not predictable by TEPITOPE. The PSSMs of TEPITOPE, TEPITOPEpan and the results of MultiRTA, NetMHCIIpan-2.0 were obtained from their respective web servers.

Comparison of OWA-PSSM with four pan-specific methods in identifying MHC II-peptide binding cores. Incorrectly predicted binding cores are highlighted in bold. "Correct" gives the number of correctly identified cores over 41 X-ray structures. "Correct I" gives the number of correctly identified cores over 39 X-ray structures whose DRB alleles are predictable by TEPITOPE. "Correct II" gives the number of correctly identified cores over 2 X-ray structures whose DRB alleles are not predictable by TEPITOPE. The PSSMs of TEPITOPE, TEPITOPEpan and the results of MultiRTA, NetMHCIIpan-2.0 were obtained from their respective web servers.

Discussion and conclusion

The PSSM based approaches have been demonstrated to be a powerful technique for MHC/peptide binding prediction [26,33,34]. A PSSM is a motif matrix which can succinctly describe a MHC/peptide binding motif. The TEPITOPE method is the best known PSSM based pan-specific approach for MHC II binding prediction. It determines 35 distinct pocket profiles in vitro based on 11 HLA-DR alleles. Although TEPITOPE can solely perform prediction for 50 HLA-DR alleles, it has been shown to be one of the best performing approaches in HLA-DR ligand/epitope prediction. Furthermore, it is the best method in identifying HLA-DR binging cores. The TEPITOPEpan method is also a PSSM based method whose PSSMs are generated based on those 35 pocket profiles determined by TEPITOPE. TEPITOPEpan uses the same approach as PickPocket to compute the similarity score between two pocket pseudo-sequences and the weights over pocket profiles by using BLOSUM62. Based on this approach, the similarity score is likely to be negative, and while setting the negative score to zero, it is probable that the similarity scores among all pocket profiles are zero, and the weights cannot be computed. On the other hand, OWA-PSSM does not encounter this problem since its similarity scores and weights associated with pocket profiles are positive. The performance of OWA-PSSM and TEPITOPE is similar for the alleles predictable by TEPITOPE. While OWA-PSSM can make prediction for a much higher number of HLA-DR molecules then TEPITOPE. In addition, the method is extensively evaluated on five benchmark datasets, and is shown to be the best approach in identifying binding cores compared with four state-of-the-art or recently proposed pan-specific MHC II prediction approaches, TEPITOPE, MultiRTA, NetMHCIIpan2.0 and TEPITOPEpan. Additionally, the method performs comparably to TEPITOPE and NetMHCIIpan2.0 in identifying HLA-DR epitopes and ligands, and it significantly outperforms TEPITOPEpan in identifying HLA-DR ligands and MultiRTA in the identification of HLA-DR T cell epitopes. Here, we develop a new MHC II binding prediction approach, which we call OWA-PSSM. This method is a significantly extended version of the TEPITOPE method. In particular, we preserve the advantage of TEPITOPE. Positions 1, 4, 6, 7 and 9 in a binding core of a MHC II molecule are estimated to be anchor positions which determine the peptide binding affenity. Identifying the P1 position of the binding core is an essential step for MHC II ligand/epitope prediction, and the TEPITOPE method clearly reveals the MHC II molecules' preferred amino acids at the P1 position. As a result, OWA-PSSM is also successful in predicting MHC II/ligands, epitopes and binding cores by using a similar approach. For pocket 4, 6, 7 and 9, the MHC II molecules are highly polymorphic and the pseudo sequences in the same pocket are highly diverse, hence the TEPITOPE method is unable to predict those DRB alleles with pseudo sequences different from its original 35 pseudo sequences. In this work, through introducing a new weighting scheme that is inspired by the OWA operator, we can make prediction for up to 879 MHC II molecules. In addition, this method is fast and robust in identifying HLA-DR ligands, epitopes and binding cores.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

WJS formulated the algorithm and performed the experiments. SZ designed the experimental protocols. HSW initiated the project and designed the overall algorithmic framework. All authors read and approved the manuscript.
  26 in total

1.  The Protein Data Bank.

Authors:  H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  AAindex: amino acid index database.

Authors:  S Kawashima; M Kanehisa
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

Review 3.  SYFPEITHI: database for MHC ligands and peptide motifs.

Authors:  H Rammensee; J Bachmann; N P Emmerich; O A Bachor; S Stevanović
Journal:  Immunogenetics       Date:  1999-11       Impact factor: 2.846

4.  ProPred: prediction of HLA-DR binding sites.

Authors:  H Singh; G P Raghava
Journal:  Bioinformatics       Date:  2001-12       Impact factor: 6.937

5.  EPIMHC: a curated database of MHC-binding peptides for customized computational vaccinology.

Authors:  Pedro A Reche; Hong Zhang; John-Paul Glutting; Ellis L Reinherz
Journal:  Bioinformatics       Date:  2005-01-18       Impact factor: 6.937

6.  Prediction of peptide-MHC binding using profiles.

Authors:  Pedro A Reche; Ellis L Reinherz
Journal:  Methods Mol Biol       Date:  2007

7.  Structural requirements for the interaction between peptide antigens and I-Ed molecules.

Authors:  A Sette; L Adorini; E Appella; S M Colón; C Miles; S Tanaka; C Ehrhardt; G Doria; Z A Nagy; S Buus
Journal:  J Immunol       Date:  1989-11-15       Impact factor: 5.422

8.  Repeating sequences and gene duplication in proteins.

Authors:  A D McLachlan
Journal:  J Mol Biol       Date:  1972-03-14       Impact factor: 5.469

9.  Precise prediction of major histocompatibility complex class II-peptide interaction based on peptide side chain scanning.

Authors:  J Hammer; E Bono; F Gallazzi; C Belunis; Z Nagy; F Sinigaglia
Journal:  J Exp Med       Date:  1994-12-01       Impact factor: 14.307

10.  Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method.

Authors:  Morten Nielsen; Claus Lundegaard; Ole Lund
Journal:  BMC Bioinformatics       Date:  2007-07-04       Impact factor: 3.169

View more
  3 in total

Review 1.  Strategies for the identification of T cell-recognized tumor antigens in hematological malignancies for improved graft-versus-tumor responses after allogeneic blood and marrow transplantation.

Authors:  Jenny Zilberberg; Rena Feinman; Robert Korngold
Journal:  Biol Blood Marrow Transplant       Date:  2014-11-20       Impact factor: 5.742

Review 2.  Best practices for bioinformatic characterization of neoantigens for clinical utility.

Authors:  Megan M Richters; Huiming Xia; Katie M Campbell; William E Gillanders; Obi L Griffith; Malachi Griffith
Journal:  Genome Med       Date:  2019-08-28       Impact factor: 11.117

3.  A Novel Peptide Binding Prediction Approach for HLA-DR Molecule Based on Sequence and Structural Information.

Authors:  Zhao Li; Yilei Zhao; Gaofeng Pan; Jijun Tang; Fei Guo
Journal:  Biomed Res Int       Date:  2016-05-31       Impact factor: 3.411

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.