| Literature DB >> 34697557 |
Yuan-Ling Xia1,2, Weihua Li3, Yongping Li4, Xing-Lai Ji5, Yun-Xin Fu1,6, Shu-Qun Liu1.
Abstract
Modeling antigenic variation in influenza (flu) virus A H3N2 using amino acid sequences is a promising approach for improving the prediction accuracy of immune efficacy of vaccines and increasing the efficiency of vaccine screening. Antigenic drift and antigenic jump/shift, which arise from the accumulation of mutations with small or moderate effects and from a major, abrupt change with large effects on the surface antigen hemagglutinin (HA), respectively, are two types of antigenic variation that facilitate immune evasion of flu virus A and make it challenging to predict the antigenic properties of new viral strains. Despite considerable progress in modeling antigenic variation based on the amino acid sequences, few studies focus on the deep learning framework which could be most suitable to be applied to this task. Here, we propose a novel deep learning approach that incorporates a convolutional neural network (CNN) and bidirectional long-short-term memory (BLSTM) neural network to predict antigenic variation. In this approach, CNN extracts the complex local contexts of amino acids while the BLSTM neural network captures the long-distance sequence information. When compared to the existing methods, our deep learning approach achieves the overall highest prediction performance on the validation dataset, and more encouragingly, it achieves prediction agreements of 99.20% and 96.46% for the strains in the forthcoming year and in the next two years included in an existing set of chronological amino acid sequences, respectively. These results indicate that our deep learning approach is promising to be applied to antigenic variation prediction of flu virus A H3N2.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34697557 PMCID: PMC8541863 DOI: 10.1155/2021/9997669
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1The flowchart of our deep learning approach using the one-dimensional CNN and BLSTM module.
The main layers and their optional parameters in the deep learning approach.
| Layer | Parameter |
|---|---|
| Convolution1D_1 | Filter (8, 16, 32, and 64), kernel size (2, 5, 10, and 15), strides (1) |
| MaxPooling_1 | Pool size (2), strides (1) |
| Convolution1D_2 | Filter (8, 16, 32, and 64), kernel size (2, 5, 10, and 15), strides (1) |
| MaxPooling_2 | Pool size (2), strides (1) |
| BLSTM_1 | Memory cell (32, 64, 128, and 256) |
| BLSTM_2 | Memory cell (32, 64, 128, and 256) |
| Dense_1 | Output space (64) |
| Dropout_1 | Rate (0.6) |
| Dense_2 | Output space (25) |
| Dropout_2 | Rate (0.6) |
| Softmax | Output space (1) |
The prediction results obtained from different deep learning models with the position feature alone and in combination with the other three features.
| Model | Agreement (%) | Sensitivity (%) | Specificity (%) | MCC |
|---|---|---|---|---|
| Position | 95.73 | 95.18 | 96.12 | 0.914 |
| Position-epitope | 97.16 | 96.85 | 97.34 | 0.939 |
| Position-Gly | 95.02 | 93.84 | 95.75 | 0.895 |
| Position-RBD | 94.74 | 92.42 | 96.44 | 0.892 |
The prediction performance of our deep learning approach and other existing approaches.
| Approaches | Training set | Validation set | Agreementa (%) | Sensitivitya (%) | Specificitya (%) | MCCa |
|---|---|---|---|---|---|---|
| Multiple regression [ | 181 HI experiments | 31878 pairs in Smith's datasetb | 89.89 | — | — | — |
| Multiple regression on physicochemical properties [ | 394 HI experiments | 31878 pairs in Smith's datasetb | 96.96 | 99.55 | 82.30 | 0.877 |
| Decision tree [ | 181 HI experiments | 31878 pairs in Smith's datasetb | 96.20 | — | — | — |
| Joint random forest methodc [ | 28690 pairs in Smith's dataset | 31878 pairs in Smith's datasetb | 96.4 | 98.1 | 77.7 | 0.758 |
| Stacked autoencoderd [ | 80% of the 8097 pairs in a concise version of Smith's dataset | 20% of the 8097 pairs in a concise version of Smith's dataset | 95 | 95 | 93 | — |
| Our deep learning approache | The filtered virus pairs formed by 70% of 253 strains in Smith's dataset | The filtered virus pairs formed by 30% of 253 strains in Smith's dataset | 97.16 | 96.85 | 97.34 | 0.939 |
aThe mark “—“ means that there is no relevant data in literature. bSmith's dataset contains 31878 pairwise comparisons among 253 viral strains that belong to 11 clusters; out of the 31878 virus pairs, 27098 pairs composed of the strains from different clusters contain antigenic variations, whereas 4780 pairs composed of the strains from the same clusters possess similar antigens [36]. cYao et al. performed 10-fold cross-validation on Smith's dataset. dThe stacked autoencoder model was developed based on a concise dataset obtained by removing from Smith's dataset the sequence pairs that contain more than 9 antigenic variation-causing mutations followed by further removing the redundant pairs. eOur deep learning method was developed based on a more concise dataset built from Smith's dataset (for details of constructing the dataset, see Section 2.1); the advantage of our dataset is that the virus pair-constituting strains in the training set and validation set are completely nonoverlapping or different.
The results of the antigenic variation prediction for flu A H3N2 in the forthcoming year and in the next two years using our deep learning approach.
| Prediction duration | Agreement (%) | Sensitivity (%) | Specificity (%) | MCC |
|---|---|---|---|---|
| Next year | 99.20 | 98.59 | 99.32 | 0.972 |
| Next two years | 96.46 | 98.58 | 96.24 | 0.830 |
Comparison between the agreements obtained by our deep learning approach and the Antigen-Bridges method with three residue sets [21] for the strains in the forthcoming year and in the next two years.
| Approaches (amino acid number) | Next year (%) | Next two years (%) |
|---|---|---|
| Antigen-Bridges (39-residue set) | 83.78 | 75.10 |
| Antigen-Bridges (44-residue set) | 79.75 | 72.48 |
| Antigen-Bridges (25-residue set) | 80.51 | 71.51 |
| Our deep learning approach | 99.20 | 96.46 |