| Literature DB >> 36160050 |
Patrick J Lawrence1, Xia Ning1,2,3.
Abstract
In this work, we propose a new deep-learning model, MHCrank, to predict the probability that a peptide will be processed for presentation by MHC class I molecules. We find that the performance of our model is significantly higher than that of two previously published baseline methods: MHCflurry and netMHCpan. This improvement arises from utilizing both cleavage site-specific kernels and learned embeddings for amino acids. By visualizing site-specific amino acid enrichment patterns, we observe that MHCrank's top-ranked peptides exhibit enrichments at biologically relevant positions and are consistent with previous work. Furthermore, the cosine similarity matrix derived from MHCrank's learned embeddings for amino acids correlates highly with physiochemical properties that have been experimentally demonstrated to be instrumental in determining a peptide's favorability for processing. Altogether, the results reported in this work indicate that MHCrank demonstrates strong performance compared with existing methods and could have vast applicability in aiding drug and vaccine development.Entities:
Keywords: MHC class I; antigen processing; artificial intelligence; immunology; machine learning
Year: 2022 PMID: 36160050 PMCID: PMC9499997 DOI: 10.1016/j.crmeth.2022.100293
Source DB: PubMed Journal: Cell Rep Methods ISSN: 2667-2375
Figure 1MHCrank model architecture
MHCrank takes a uniform-length N-flank + peptide + C-flank sequence, C-terminal cleavage site (see gray box), and the peptide’s original length before padding or trimming as input. The amino acids comprising the sequence and cleavage site-specific kernel (CSSK) undergo feature embedding. A convolution layer is applied to the embedding of the entire sequence. The remainder of the MHCrank architecture can be split into six components. Component (1) applies a mean pool to the convolution output corresponding to the N-flank. Component (2) applies a mean pool to the convolution output corresponding to the C-flank. The convolution output corresponding to the peptide sequence is forwarded to two stacked convolution layers. Components (3) and (4) each have two outputs (A and B) obtained from the output of these convolution layers. (3A) extracts the output corresponding to the peptide’s N-terminal amino acid. (4A) extracts the output corresponding to the peptide’s C-terminal amino acid. (3B) applies a mean pool to the peptide’s non-N-terminal amino acids. (4B) applies a mean pool to the peptide’s non-C-terminal amino acids. Component (5) applies a global kernel to the embedded CSSK. Component (6) is a single node that takes the peptide’s original length as input. Two dense layers are applied to the concatenated output of each component. The output from the second dense layer enters an output layer that predicts the probability of the input peptide undergoing antigen processing. Note that the layout of this diagram is largely inspired by the presentation of MHCflurry’s architecture (O’Donnell et al., 2020).
See also Figure S1.
Performance comparison: Mean AUC
| MHCrank ensembles | ||||||
|---|---|---|---|---|---|---|
| EL | Fw-top1 | Fw-top2 | Ba-top1 | Ba-top2 | C-top1 | C-top2 |
| 0.9050 | 0.9073† (0.0)∗∗∗∗ | 0.9120† (0.0)∗∗∗∗ | 0.9102† (0.0)∗∗∗∗ | 0.9147† (0.0)∗∗∗∗ | 0.9121† (0.0)∗∗∗∗ | 0.9153† (0.0)∗∗∗∗ |
Results comparing the performance of MHCrank’s ensembles against (netMHCpan4.0-EL) with respect to mean AUC. † highlights MHCrank ensemble’s improvement in performance over netMHCpan4.0-EL; p values are reported in parentheses to the right of the mean performance values. Statistically significant improvement of MHCrank’s ensembles relative to netMHCpan4.0-EL’s performance (after multiple-hypothesis correction) is denoted as follows: ∗∗∗∗p ≤ 0.001. See also Figure S4 and Tables S2–S4.
Performance comparison: Mean precision@
| MHCrank | |||||||
|---|---|---|---|---|---|---|---|
| EL | Fw-top1 | Fw-top2 | Ba-top1 | Ba-top2 | C-top1 | C-top2 | |
| 10 | 0.7065 | 0.7260† (0.0028)∗∗ | 0.7251† (0.0050)∗∗ | 0.6492 (0.0) | 0.6980 (0.1995) | 0.7089† (0.7117) | 0.7116† (0.4440) |
| 25 | 0.6434 | 0.6544† (0.0114)∗ | 0.6517† (0.0524) | 0.6314 (0.0047) | 0.6246 (0.0) | 0.6496† (0.1424) | 0.6452† (0.6729) |
| 50 | 0.5971 | 0.5973† (0.9488) | 0.5971 (1.0) | 0.5891 (0.0105) | 0.5922 (0.1142) | 0.6063† (0.0031)∗∗ | 0.6035† (0.0391) |
| 100 | 0.5495 | 0.5284 (0.0) | 0.5360 (0.0) | 0.5182 (0.0) | 0.5304 (0.0) | 0.5491 (0.8511) | 0.5434 (0.0074) |
| 250 | 0.4344 | 0.4202 (0.0) | 0.4410† (0.0)∗∗∗∗ | 0.4198 (0.0) | 0.4359† (0.3132) | 0.4412† (0.0)∗∗∗∗ | 0.4482† (0.0)∗∗∗∗ |
| 500 | 0.3288 | 0.3413† (0.0)∗∗∗∗ | 0.3544† (0.0)∗∗∗∗ | 0.3409† (0.0)∗∗∗∗ | 0.3552† (0.0)∗∗∗∗ | 0.3550† (0.0)∗∗∗∗ | 0.3604† (0.0)∗∗∗∗ |
Results comparing the performance of MHCrank’s ensembles against (netMHCpan4.0-EL) with respect to mean precision@. † highlights MHCrank ensemble’s improvement in performance over netMHCpan4.0-EL; p values are reported in parentheses to the right of the mean performance values. Statistically significant improvement of MHCrank’s ensembles relative to netMHCpan4.0-EL’s performance (after multiple-hypothesis correction) is denoted as follows: ∗p ≤ 0.1; ∗∗p ≤ 0.05; ∗∗∗∗p ≤ 0.001. See also Figure S4 and Tables S2–S4.
Performance comparison: Mean NDCG@
| MHCrank ensembles | |||||||
|---|---|---|---|---|---|---|---|
| EL | Fw-top1 | Fw-top2 | Ba-top1 | Ba-top2 | C-top1 | C-top2 | |
| 10 | 0.7253 | 0.7451† (0.0032)∗∗ | 0.7544† (0.0)∗∗∗∗ | 0.6705 (0.0) | 0.7166 (0.2004) | 0.7265† (0.8571) | 0.7357† (0.1268) |
| 25 | 0.6712 | 0.6846† (0.0031)∗∗ | 0.6877† (0.0002)∗∗∗ | 0.6486 (0.0) | 0.6547 (0.0003) | 0.6761† (0.2813) | 0.6770† (0.1905) |
| 50 | 0.6272 | 0.6317† (0.17375) | 0.6344† (0.0282) | 0.6113 (0.0) | 0.6198 (0.02749) | 0.6346† (0.02602) | 0.6348† (0.0213) |
| 100 | 0.5795 | 0.5658 (0.0) | 0.5735 (0.01295) | 0.5486 (0.0) | 0.5621 (0.0) | 0.5805† (0.0) | 0.5770 (0.0) |
| 250 | 0.4712 | 0.4202 (0.0) | 0.4780† (0.0)∗∗∗∗ | 0.4538 (0.0) | 0.4701 (0.4678) | 0.4772† (0.0001)∗∗∗∗ | 0.4833† (0.0)∗∗∗∗ |
| 500 | 0.3685 | 0.3779† (0.0)∗∗∗∗ | 0.3909† (0.0)∗∗∗∗ | 0.3743† (0.0)∗∗∗∗ | 0.3891† (0.0)∗∗∗∗ | 0.3909† (0.0)∗∗∗∗ | 0.3961† (0.0)∗∗∗∗ |
Results comparing the performance of MHCrank’s ensembles against (netMHCpan4.0-EL) with respect to mean NDCG@. † highlights MHCrank ensemble’s improvement in performance over netMHCpan4.0-EL; p values are reported in parentheses to the right of the mean performance values. Statistically significant improvement of MHCrank’s ensembles relative to netMHCpan4.0-EL’s performance (after multiple-hypothesis correction) is denoted as follows: ∗∗p ≤ 0.05; ∗∗∗p ≤ 0.01; ∗∗∗∗p ≤ 0.001. See also Figure S4 and Tables S2–S4.
Performance comparison: Percent change
| Precision@ | NDCG@ | AUC | |
|---|---|---|---|
| 10 | 2.63†∗∗ | 4.00†∗∗∗∗ | 0.77†∗∗∗∗ |
| 25 | 1.29† | 2.47†∗∗∗ | – |
| 50 | 0.00 | 1.15† | – |
| 100 | −2.47 | −1.05 | – |
| 250 | 1.51†∗∗∗∗ | 1.45†∗∗∗∗ | – |
| 500 | 7.77†∗∗∗∗ | 6.09†∗∗∗∗ | – |
Results comparing the performance of MHCrank’s ensembles against (netMHCpan4.0-EL) with respect to percent improvement. †highlights MHCrank ensemble’s improvement in performance over netMHCpan4.0-EL. Statistically significant improvement of MHCrank’s ensembles relative to netMHCpan4.0-EL’s performance (after multiple-hypothesis correction) is denoted as follows: ∗∗p ≤ 0.05; ∗∗∗p ≤ 0.01; ∗∗∗∗p ≤ 0.001. The p values are not reported for percent improvements, as these are derived from the performance reported for Fw-top2 (Tables 1, 2, and 3). Note that the reported percent improvement in AUC was calculated over the entire dataset, not at a specific k threshold. See also Figure S4 and Tables S2–S4.
Figure 2Amino acid enrichment of training peptides and top 100 predicted candidates
(A–D) The enrichment of amino acids in (A) 50,000 randomly sampled hits from training dataset and in the top 100 peptides from the testing data ranked by (B) MHCrank’s Fw-top2 ensemble and both the (C) MHCflurry-AP and (D) netMHCpan4.0-EL baseline methods. Yellow boxes covering positions 2 and 9 in each figure highlight the enrichment of the peptides at their typical anchor positions.
See also Figures S2 and S3.
Figure 3Performance of MHCrank with differing amino acid embedding methods
(A–C) Mean precision and NDCG@ and AUC, respectively for six identical architectures of MHCrank each trained while using a distinct amino acid embedding method. The embedding method is used by all the best-performing MHCrank ensembles, including Fw-top2. NormBLO refers to a normalized version of the BLOSUM matrix; PC is the embedding matrix we produce using physicochemical properties (see “amino acid representation”).
(D) Cosine similarities of embeddings learned by an MHCrank model randomly selected from the best-performing ensemble (Fw-top2).
See also Tables S5 and S6.
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Training data for MHCrank ensembles | ( | |
| Multialleleic benchmark dataset | ( | |
| Kared et al. SARS-CoV-2 dataset | ( | |
| Snyder et al. SARS-CoV-2 dataset | ( | |
| Python | Python Software Foundation | v3.6.6 |
| Tensorflow | v2.2.1 | |
| MHCrank | This paper | |
| MHCflurry-2.0 | ( | |
| netMHCpan-4.0 | ( | |
| netMHCpan-4.1 | ( | |
| DeepLigand | ( | |
| MHCSeqNet | ( | |
| CPU | Ohio Supercomputer Center | Intel Xeon 8268s Cascade Lakes |
| GPU | Ohio Supercomputer Center | NVIDIA Volta V100 |