| Literature DB >> 30692623 |
Zhonghao Liu1, Yuxin Cui1, Zheng Xiong1, Alierza Nasiri1, Ansi Zhang2, Jianjun Hu3,4.
Abstract
Interactions between human leukocyte antigens (HLAs) and peptides play a critical role in the human immune system. Accurate computational prediction of HLA-binding peptides can be used for peptide drug discovery. Currently, the best prediction algorithms are neural network-based pan-specific models, which take advantage of the large amount of data across HLA alleles. However, current pan-specific models are all based on the pseudo sequence encoding for modeling the binding context, which is based on 34 positions identified from the HLA protein-peptide bound structures in early works. In this work, we proposed a novel deep convolutional neural network model (DCNN) for HLA-peptide binding prediction, in which the encoding of the HLA sequence and the binding context are both learned by the network itself without requiring the HLA-peptide bound structure information. Our DCNN model is also characterized by its binding context extraction layer and dual outputs with both binding affinity output and binding probability outputs. Evaluation on public benchmark datasets shows that our DeepSeqPan model without HLA structural information in training achieves state-of-the-art performance on a large number of HLA alleles with good generalization capability. Since our model only needs raw sequences from the HLA-peptide binding pairs, it can be applied to binding predictions of HLAs without structure information and can also be applied to other protein binding problems such as protein-DNA and protein-RNA bindings. The implementation code and trained models are freely available at https://github.com/pcpLiu/DeepSeqPan .Entities:
Mesh:
Substances:
Year: 2019 PMID: 30692623 PMCID: PMC6349913 DOI: 10.1038/s41598-018-37214-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Interaction map of the HLA pseudo sequence in NetMHCPan. Reproduced from original paper.
Figure 2DeepSeqPan Network Structure. (i) Peptide and HLA encoders. (ii) Binding context extractor. (iii) Affinity and binding predictors.
Figure 3Peptide encoding example. Sequence HLNPNKTKR is encoded into a 2D tensor with dimension 1 (height) × 9 (width) × 20 (channel). Each of 20 channels represents one amino acid type and we set a channel value to 1 if the corresponding amino acid appears at this location of the input sequence.
Five-fold cross validation on all training data and CD-HIT filtered training data.
| Training Dataset | Alleles | Seq Count |
| Binary Binding | ||
|---|---|---|---|---|---|---|
| AUC | SRCC | AUC | SRCC | |||
| BD2013 | All alleles | 121,787 | 0.94 | 0.73 | 0.94 | 0.70 |
| HLA-A | 72,618 | 0.94 | 0.75 | 0.94 | 0.73 | |
| HLA-B | 46,915 | 0.94 | 0.68 | 0.94 | 0.64 | |
| HLA-C | 2,254 | 0.89 | 0.70 | 0.89 | 0.69 | |
| CD-HIT BD2013 | All alleles | 104,449 | 0.94 | 0.71 | 0.94 | 0.68 |
| HLA-A | 60,987 | 0.94 | 0.73 | 0.94 | 0.71 | |
| HLA-B | 41,360 | 0.94 | 0.66 | 0.94 | 0.62 | |
| HLA-C | 2,102 | 0.89 | 0.69 | 0.89 | 0.68 | |
| BD2009 | All alleles | 88,742 | 0.93 | 0.69 | 0.93 | 0.68 |
| HLA-A | 57,173 | 0.93 | 0.72 | 0.93 | 0.71 | |
| HLA-B | 31,569 | 0.93 | 0.62 | 0.93 | 0.60 | |
Comparison of Kim’s model and DeepSeqPan.
| HLA | Count | AUC | |
|---|---|---|---|
| Kim | DeepSeqPan | ||
| All | 19,240 | 0.76 | 0.74 |
| HLA-A | 3,416 | 0.74 | 0.71 |
| HLA-B | 15,824 | 0.79 | 0.79 |
Performance comparison between LOAO cross validation and random 5-fold cross validation.
| Metrics | Threshold | IC50 | Binary | ||
|---|---|---|---|---|---|
| LOAO | Random 5-fold | LOAO | Random 5-fold | ||
| AUC | >0.7 | 74 | 80 | 74 | 78 |
| >0.8 | 50 | 72 | 52 | 70 | |
| SRCC | >0.6 | 28 | 53 | 26 | 49 |
| >0.7 | 14 | 34 | 15 | 32 | |
Consistency inspection results.
| Cross Validation | Benchmark Evaluation | |
|---|---|---|
| Total samples | 121,787 | 19,741 |
| Consistent pred. | 116,688 (95.81%) | 17,004 (86.14%) |
| Correct IC50 pred. | 108,064 (88.73%) | 11,690 (59.21%) |
| Correct Binary pred. | 107,239 (88.05%) | 10,487 (53.12%) |
Figure 4Correlation analysis between binary prediction values and regression prediction values on benchmark dataset.