| Literature DB >> 23815611 |
Fei Luo1, Yangyang Gao, Yongqiong Zhu, Juan Liu.
Abstract
BACKGROUND: The HLA (human leukocyte antigen) class I is a kind of molecule encoded by a large family of genes and is characteristic of high polymorphism. Now the number of the registered HLA-I molecules has exceeded 3000. Slight differences in the amino acid sequences of HLAs would make them bind to different sets of peptides. In the past decades, although many methods have been proposed to predict the binding between peptides and HLA-I molecules and achieved good performance, most experimental data used by them is limited to the HLAs with a small number of alleles. Thus they are inclined to obtain high prediction accuracy only for data with similar alleles. Because the peptides and HLAs together determine the binding, it's necessary to consider their contribution meanwhile.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23815611 PMCID: PMC3654895 DOI: 10.1186/1471-2105-14-S8-S1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The framework of our method. The input data contains two parts, one is the peptide and the other is HLA molecule. HLA molecules will be processed by the steps of extracting interacting amino acid residues and computing the contact energy. Then they will be encoded as the classification features and input into the established classifier to do the training and predict.
Figure 2Interacting residues. (a) is the binding sites of HLA-A and (b) is the binding sites of HLA-B. The column number represents the HLA molecular residue index given by the IMGT/HLA database and the row number indicates the amino acid residue index of peptide with the length 9. The grey cells in the grid indicate residues that have interaction between HLA and peptide.
Prediction Results on Benchmark Dataset
| Allele | ANNBM | ARB | SMM | NetMHC | Other methods | Peptides |
|---|---|---|---|---|---|---|
| A*0101 | 0.977 | 0.964 | 0.98 | 0.982 | 0.955 | 1157 |
| A*0201 | 0.951 | 0.934 | 0.952 | 0.957 | 0.922 | 3089 |
| A*0202 | 0.891 | 0.875 | 0.899 | 0.9 | 0.793 | 1447 |
| A*0203 | 0.911 | 0.884 | 0.916 | 0.921 | 0.788 | 1443 |
| A*0206 | 0.906 | 0.872 | 0.914 | 0.927 | 0.735 | 1437 |
| A*0301 | 0.932 | 0.908 | 0.94 | 0.937 | 0.851 | 2094 |
| A*1101 | 0.945 | 0.918 | 0.948 | 0.951 | 0.869 | 1985 |
| A*2402 | 0.826 | 0.718 | 0.78 | 0.825 | 0.77 | 197 |
| A*2601 | 0.950 | 0.907 | 0.931 | 0.956 | 0.736 | 672 |
| A*2902 | 0.907 | 0.755 | 0.911 | 0.935 | 0.597 | 160 |
| A*3101 | 0.923 | 0.909 | 0.93 | 0.928 | 0.829 | 1869 |
| A*3301 | 0.915 | 0.892 | 0.925 | 0.915 | 0.807 | 1140 |
| A*6801 | 0.88 | 0.84 | 0.885 | 0.883 | 0.772 | 1141 |
| A*6802 | 0.883 | 0.865 | 0.898 | 0.899 | 0.643 | 1434 |
| B*0702 | 0.966 | 0.952 | 0.964 | 0.965 | 0.942 | 1262 |
| B*0801 | 0.968 | 0.936 | 0.943 | 0.955 | 0.766 | 708 |
| B*1501 | 0.939 | 0.9 | 0.952 | 0.941 | 0.816 | 978 |
| B*1801 | 0.848 | 0.573 | 0.853 | 0.838 | 0.779 | 118 |
| B*2705 | 0.957 | 0.915 | 0.94 | 0.938 | 0.926 | 969 |
| B*3501 | 0.873 | 0.851 | 0.889 | 0.875 | 0.792 | 736 |
| B*4002 | 0.858 | 0.541 | 0.842 | 0.754 | 0.775 | 118 |
| B*4402 | 0.824 | 0.533 | 0.74 | 0.778 | 0.783 | 119 |
| B*4403 | 0.791 | 0.461 | 0.77 | 0.763 | 0.698 | 119 |
| B*5101 | 0.894 | 0.822 | 0.868 | 0.886 | 0.82 | 244 |
| B*5301 | 0.886 | 0.871 | 0.882 | 0.899 | 0.861 | 254 |
| B*5401 | 0.911 | 0.847 | 0.921 | 0.903 | 0.799 | 255 |
| B*5701 | 0.96 | 0.428 | 0.871 | 0.826 | 0.767 | 59 |
| B*5801 | 0.972 | 0.889 | 0.964 | 0.961 | 0.899 | 988 |
| AVG | 0.909 | 0.791 | 0.874 | 0.901 | 0.796 | |
Table 1 summarizes the comparative results between our ANNBM and the methods in the Bjoern Peters work on the benchmark. We use the AUC (Area Under roc Curve) of 5-folds cross validation as the prediction evaluation criterion. In the table, the first column is the allele name, including 14 HLA-A class molecules and 14 HLA-B class molecules. The columns from 2 to 5 are the AUC value of 5-folds cross validation from the ANNBM、 ARB、 SMM and NetMHC respectively. In addition, ANNBM is also compared to other 16 online prediction methods including various outstanding classifiers like SVM () and the best prediction value among them is listed in the column 6. The last column is the number of peptides binding to the corresponding the HLA molecule.
Figure 3Methods comparison. According to the results in the table 1, we divide results into HLA-A class group (a) and HLA-B class group (b) and order them in ascendance based on the peptide number to measure the correlation between scale of dataset and classification accuracy. The panels from left to right and up to down are the linear fitting between the peptide number (x axis) and accuracy (y axis) on five methods: ANNBM, ARB, SMM, NetMHC, and Other methods. The right down picture is the standard deviation of the classification accuracy. We could see ANNBM gets the smallest slope rate and standard deviation, which proves that ANNBM is most independent with dataset scale and stable.
Figure 4ROC curve of ANNBM、 ARB、 SMM、 NetMHC on HLA-A*0201.
Figure 5ROC curve of ANNBM、 ARB、 SMM、 NetMHC on HLA-B*4402.
Prediction Results on Unknown Alleles Dataset
| Allele | Supertype | ANNBM | NetMHC | NetMHCpan | Peptides |
|---|---|---|---|---|---|
| A*0101 | A1 | 0.854 | 0.672 | 0.873 | 1157 |
| A*0201 | A2 | 0.905 | 0.886 | 0.912 | 3089 |
| A*0202 | A2 | 0.840 | 0.784 | 0.815 | 1447 |
| A*0203 | A2 | 0.836 | 0.818 | 0.832 | 1443 |
| A*0206 | A2 | 0.883 | 0.826 | 0.847 | 1436 |
| A*0301 | A3 | 0.867 | 0.820 | 0.849 | 2094 |
| A*1101 | A3 | 0.879 | 0.851 | 0.866 | 1985 |
| A*2301 | A24 | 0.917 | 0.877 | 0.863 | 104 |
| A*2402 | A24 | 0.864 | 0.848 | 0.821 | 197 |
| A*2403 | A24 | 0.923 | 0.894 | 0.912 | 254 |
| A*2601 | A1 | 0.771 | 0.631 | 0.733 | 672 |
| A*2902 | A3 | 0.832 | 0.603 | 0.749 | 160 |
| A*3001 | A3 | 0.863 | 0.846 | 0.838 | 669 |
| A*3002 | A1 | 0.671 | 0.711 | 0.721 | 92 |
| A*3101 | A3 | 0.853 | 0.822 | 0.878 | 1869 |
| A*3301 | A3 | 0.838 | 0.699 | 0.763 | 1140 |
| A*6801 | A3 | 0.768 | 0.744 | 0.760 | 1141 |
| A*6802 | A2 | 0.812 | 0.664 | 0.669 | 1434 |
| A*6901 | A2 | 0.902 | 0.811 | 0.823 | 833 |
| B*0702 | B7 | 0.919 | 0.864 | 0.902 | 1262 |
| B*1501 | B62 | 0.687 | 0.536 | 0.750 | 978 |
| B*1801 | B62 | 0.823 | 0.775 | 0.729 | 969 |
| B*3501 | B7 | 0.805 | 0.737 | 0.762 | 736 |
| B*4001 | B44 | 0.852 | 0.818 | 0.870 | 1078 |
| B*4002 | B44 | 0.883 | 0.802 | 0.807 | 118 |
| B*4402 | B44 | 0.824 | 0.771 | 0.839 | 119 |
| B*4403 | B44 | 0.836 | 0.800 | 0.842 | 119 |
| B*4501 | B44 | 0.822 | 0.804 | 0.809 | 114 |
| B*5101 | B7 | 0.887 | 0.879 | 0.905 | 244 |
| B*5301 | B7 | 0.828 | 0.819 | 0.838 | 254 |
| B*5401 | B7 | 0.880 | 0.847 | 0.845 | 255 |
| B*5701 | B58 | 0.945 | 0.652 | 0.919 | 59 |
| B*5801 | B58 | 0.869 | 0.625 | 0.841 | 988 |
| AVG | 0.847 | 0.774 | 0.824 | ||
From table.2, we can see that ANNBM method obtains the higher average AUC value than NetMHCpan and NetMHC methods by 0.023 and 0.073. NetMHC encoding method doesn't take into account the HLA molecules information. Although the training data comes from the same super-type and acquires perfect results on the allele specific benchmark dataset, the HLA differences in the same super class are not reflected, so it is not difficult to understand the NetMHC prediction accuracy decreases and lower than those of ANNBM and NetMHCpan that encode HLA molecules information. Comparing the encoding method of the HLA molecules between ANNBM and NetMHCpan, ANNBM uses the B matrix and each amino acid that could interact with peptide is denoted by a numerical value, while NetMHCpan uses the BLOSUM matrix and a 20 dimensions vector to denote each amino acid. Obviously, ANNBM has higher efficiency in the storage and computation. The average AUC of ANNBM is greater than that of NetMHCpan, especially on the A*0202 and B*3501, whose ROC curves are showed in figure 6 and 7.