| Literature DB >> 30598102 |
Xinrui Zhou1, Rui Yin1, Chee-Keong Kwoh2, Jie Zheng3.
Abstract
BACKGROUND: The evolution of influenza A viruses leads to the antigenic changes. Serological diagnosis of the antigenicity is usually labor-intensive, time-consuming and not suitable for early-stage detection. Computational prediction of the antigenic relationship between emerging and old strains of influenza viruses using viral sequences can facilitate large-scale antigenic characterization, especially for those viruses requiring high biosafety facilities, such as H5 and H7 influenza A viruses. However, most computational models require carefully designed subtype-specific features, thereby being restricted to only one subtype.Entities:
Keywords: Antigenicity; Classification; Encoding scheme; Influenza
Mesh:
Substances:
Year: 2018 PMID: 30598102 PMCID: PMC6311925 DOI: 10.1186/s12864-018-5282-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1A pipeline for machine learning projects and illustration for CFreeEnS. a Encoding a non-numeric dataset into equal-length numeric vectors is necessary for both traditional machine learning models and deep neural networks. b CFreeEnS encodes m aligned protein sequence pairs of length l with k substitution matrices, resulting in a numeric feature matrix X with dimension m×k×l
Datasets for training and testing the predicting model
| Subtype | Number of sequences | T | D/T | HA1 lengths |
|---|---|---|---|---|
| H1N1 | 68 | 355 | 0.50 | 327 |
| H3N2 | 621 | 791 | 0.47 | 329 |
| H5N1 | 148 | 293 | 0.57 | 320 |
| H9N2 | 29 | 118 | 0.68 | 317 |
| Combined | 866 | 1557 | 0.52 | 340 |
1T: Total number of viral pairs;
2D: The number of antigenic distinct viral pairs;
3Combined: The combined dataset of H1N1, H3N2, H5N1 and H9N2
Fig. 2Evaluation of all substitution matrices on datasets of single subtype. The 94 substitution matrices have an average testing accuracy higher than 80% with small standard deviation, except on A/H9N2. Each dataset has a distinct substitution matrix resulting in the highest testing accuracy
Fig. 3Comparing F-score of models on datasets with singe subtype influenza viruse
Performance comparison among five strategies on four single subtype datasets
| Dataset | Methods | Accuracy | Precision | Recall | F-score |
|---|---|---|---|---|---|
| H1N1 | Liao et al. | 0.742 | 0.717 | 0.877 | 0.788 |
| MutCounts | 0.824 | 0.802 | 0.884 | 0.840 | |
| Peng et al. | 0.661 | 0.671 | 0.711 | 0.683 | |
| RegionBand | 0.706 | 0.669 | red | 0.766 | |
| CFreeEnS | ared | red | 0.887 | red | |
| H3N2 | Liao et al. | 0.784 | 0.748 | 0.891 | 0.812 |
| MutCounts | 0.843 | 0.841 | 0.851 | 0.845 | |
| Peng et al. | 0.720 | 0.658 | red | 0.777 | |
| RegionBand | 0.790 | 0.763 | 0.864 | 0.809 | |
| CFreeEnS | red | red | 0.882 | red | |
| H5N1 | Liao et al. | 0.753 | 0.758 | 0.878 | 0.813 |
| MutCounts | 0.863 | 0.859 | 0.915 | 0.885 | |
| Peng et al. | 0.846 | 0.857 | 0.908 | 0.880 | |
| RegionBand | 0.858 | 0.824 | red | 0.893 | |
| CFreeEnS | red | red | 0.965 | red | |
| H9N2 | Liao et al. | 0.708 | 0.816 | 0.819 | 0.810 |
| MutCounts | 0.775 | 0.823 | 0.914 | 0.859 | |
| Peng et al. | 0.633 | red | 0.601 | 0.702 | |
| RegionBand | 0.804 | 0.818 | 0.954 | 0.880 | |
| CFreeEnS | red | 0.860 | red | red |
aThe highest scores among five strategies on each dataset are colored redred
Performance comparison among five strategies on the combined dataset
| Dataset | Methods | Accuracy | Precision | Recall | F-score |
|---|---|---|---|---|---|
| Combined | Liao et al. | 0.739 | 0.716 | 0.879 | 0.789 |
| MutCounts | 0.698 | 0.675 | red | 0.781 | |
| Peng et al. | 0.741 | 0.757 | 0.800 | 0.775 | |
| RegionBand | 0.751 | 0.723 | 0.912 | 0.807 | |
| CFreeEnS-4 | ared | red | 0.900 | red |
aThe highest scores among five strategies on each dataset are colored redred
Fig. 4Accuracy scores of transfer learning using three encoding schemes: MutCounts, RegionBand and CFreeEnS. MutCounts: features that are used in the method proposed by Liao et al. [3]; RegionBand: features that are used in the method proposed by Peng et al. [8]. All the models use random forest as a downstream learning method