| Literature DB >> 35509019 |
Zheng Kou1, Xinyue Fan2, Junjie Li2, Zehui Shao2, Xiaoli Qiang3.
Abstract
BACKGROUND: Influenza B virus can cause epidemics with high pathogenicity, so it poses a serious threat to public health. A feature representation algorithm is proposed in this paper to identify the pathogenicity phenotype of influenza B virus.Entities:
Keywords: Amino acid feature; Influenza B virus; Machine learning; Pathogenicity
Mesh:
Substances:
Year: 2022 PMID: 35509019 PMCID: PMC9066401 DOI: 10.1186/s40249-022-00974-0
Source DB: PubMed Journal: Infect Dis Poverty ISSN: 2049-9957 Impact factor: 10.485
Fig. 1Flowchart of pathogenicity identification of IBV. The 40 signature positions based on entropy were first screened after data were downloaded and cleaned. Six encoding methods of amino acids with changeable parameters were used to extract features. Then, 67 descriptors were proposed, and two types of informative outputs from the RF method were obtained to be further optimized with the mRMR algorithm and the SFS strategy. Each strain was finally represented by two optimized informative features with the low dimension ‘class’ and ‘prob.’ These optimal subsets were used to construct predictive models
Summary of feature descriptor and feature number
| Feature descriptor | Feature type | Feature number | Feature descriptor | Feature type | Feature number |
|---|---|---|---|---|---|
| 1 | AAC | 20 | 35 | GGAP (g = 15) | 441 |
| 2 | PseAAC (λ = 0) | 21 | 36 | GGAP (g = 16) | 441 |
| 3 | PseAAC (λ = 1) | 22 | 37 | GGAP (g = 17) | 441 |
| 4 | PseAAC (λ = 2) | 23 | 38 | BIT20 (k = 4) | 80 |
| 5 | PseAAC (λ = 3) | 24 | 39 | BIT20 (k = 8) | 160 |
| 6 | PseAAC (λ = 4) | 25 | 40 | BIT20 (k = 12) | 240 |
| 7 | PseAAC (λ = 5) | 26 | 41 | BIT20 (k = 16) | 320 |
| 8 | PseAAC (λ = 6) | 27 | 42 | BIT20 (k = 20) | 400 |
| 9 | PseAAC (λ = 7) | 28 | 43 | BIT20 (k = 24) | 480 |
| 10 | PseAAC (λ = 8) | 29 | 44 | BIT20 (k = 28) | 560 |
| 11 | PseAAC (λ = 9) | 30 | 45 | BIT20 (k = 32) | 640 |
| 12 | PseAAC (λ = 10) | 31 | 46 | BIT20 (k = 36) | 720 |
| 13 | PseAAC (λ = 11) | 32 | 47 | BIT20 (k = 40) | 800 |
| 14 | PseAAC (λ = 12) | 33 | 48 | BIT21 (k = 4) | 84 |
| 15 | PseAAC (λ = 13) | 34 | 49 | BIT21 (k = 8) | 168 |
| 16 | PseAAC (λ = 14) | 35 | 50 | BIT21 (k = 12) | 252 |
| 17 | PseAAC (λ = 15) | 36 | 51 | BIT21 (k = 16) | 336 |
| 18 | PseAAC (λ = 16) | 37 | 52 | BIT21 (k = 20) | 420 |
| 19 | PseAAC (λ = 17) | 38 | 53 | BIT21 (k = 24) | 504 |
| 20 | GGAP (g = 0) | 441 | 54 | BIT21 (k = 28) | 588 |
| 21 | GGAP (g = 1) | 441 | 55 | BIT21 (k = 32) | 672 |
| 22 | GGAP (g = 2) | 441 | 56 | BIT21 (k = 36) | 756 |
| 23 | GGAP (g = 3) | 441 | 57 | BIT21 (k = 40) | 840 |
| 24 | GGAP (g = 4) | 441 | 58 | OLP (k = 4) | 44 |
| 25 | GGAP (g = 5) | 441 | 59 | OLP (k = 8) | 88 |
| 26 | GGAP (g = 6) | 441 | 60 | OLP (k = 12) | 132 |
| 27 | GGAP (g = 7) | 441 | 61 | OLP (k = 16) | 176 |
| 28 | GGAP (g = 8) | 441 | 62 | OLP (k = 20) | 220 |
| 29 | GGAP (g = 9) | 441 | 63 | OLP (k = 24) | 264 |
| 30 | GGAP (g = 10) | 441 | 64 | OLP (k = 28) | 308 |
| 31 | GGAP (g = 11) | 441 | 65 | OLP (k = 32) | 352 |
| 32 | GGAP (g = 12) | 441 | 66 | OLP (k = 36) | 396 |
| 33 | GGAP (g = 13) | 441 | 67 | OLP (k = 40) | 440 |
| 34 | GGAP (g = 14) | 441 |
AAC amino acid composition, PC-PseAAC parallel correlation-based pseudo-amino-acid composition, GGAP the G-gap dipeptide composition, BIT20 twenty-bit feature, BIT21 twenty-one-bit feature, OLP overlapping property feature
Fig. 2Proportion of IBV in all positive samples per influenza season. The x-axis represents the seasons from 1997 to 2000. The y-axis represents the positive proportion for IBV. The ratio of 35% is shown by the dotted blue line
Amino acid set for pathogenicity identification
| Number | Protein | Positiona | Entropy | Number | Protein | Position | Entropy |
|---|---|---|---|---|---|---|---|
| 1 | PB1 | 57 | 0.66 | 21 | NA | 49 | 0.82 |
| 2 | PB1 | 752 | 0.68 | 22 | NA | 73 | 0.72 |
| 3 | PA | 352 | 0.73 | 23 | NA | 120 | 0.66 |
| 4 | HA | 47 | 0.70 | 24 | NA | 295 | 0.72 |
| 5 | HA | 74 | 0.69 | 25 | NA | 320 | 0.67 |
| 6 | HA | 115 | 0.67 | 26 | NA | 342 | 0.84 |
| 7 | HA | 128 | 0.94 | 27 | NA | 358 | 0.67 |
| 8 | HA | 132 | 0.70 | 28 | NA | 373 | 0.83 |
| 9 | HA | 135 | 1.06 | 29 | NA | 384 | 0.65 |
| 10 | HA | 145 | 0.69 | 30 | NA | 389 | 0.66 |
| 11 | HA | 149 | 0.67 | 31 | NA | 392 | 0.67 |
| 12 | HA | 161 | 0.99 | 32 | NA | 395 | 0.99 |
| 13 | HA | 162 | 0.70 | 33 | NA | 465 | 0.66 |
| 14 | HA | 173 | 0.65 | 34 | NB | 21 | 0.85 |
| 15 | HA | 200 | 0.68 | 35 | NB | 99 | 0.71 |
| 16 | HA | 228 | 0.72 | 36 | NS1 | 111 | 0.85 |
| 17 | HA | 231 | 0.67 | 37 | NS1 | 115 | 0.70 |
| 18 | NP | 9 | 0.66 | 38 | NS1 | 120 | 0.67 |
| 19 | NP | 66 | 0.65 | 39 | NS1 | 127 | 0.68 |
| 20 | NA | 45 | 0.66 | 40 | NEP | 88 | 0.71 |
PB1 polymerase basic protein 1, PA polymerase acid protein, HA hemagglutinin, NP nucleoprotein, NA neuraminidase, NB glycoprotein NB, NS1 nonstructural protein 1, NEP nuclear export protein
aB/Wisconsin/23/2019 (EPI_ISL_357982) as reference strain
Fig. 3Signature positions in the 11 viral proteins. A Profile of 40 signature positions from positive samples of IBV. B Profile of 40 signature positions from negative samples of IBV. The x-axis represents the signature position in viral proteins. The y-axis represents the entropy value
Fig. 4Optimization of informative features. A The SFS curves for the ACC of ‘class’ and ‘prob’ features. B The SFS curves for the MCC of ‘class’ and ‘prob’ features. The x-axis represents the incremental numbers of informative features. The y-axis represents the metric for the ACC and MCC. The ACC is marked in blue, while the MCC is marked in yellow
Performance of the informative features
| Features | ACC | SE | SP | MCC | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|
| Class features | 94.0 | 94.1 | 93.8 | 87.9 | 814 | 806 | 53 | 51 |
| Probabilistic features | 93.9 | 94.6 | 93.1 | 87.7 | 818 | 800 | 59 | 47 |
| Optimal class features | 94.2 | 95.0 | 93.4 | 88.4 | 822 | 802 | 57 | 43 |
| Optimal probabilistic features | 94.1 | 94.9 | 93.3 | 88.2 | 820 | 802 | 58 | 44 |
SE sensitivity, SP specificity, ACC accuracy, MCC Matthew’s correlation coefficient, TP true positive, TN true negative, FP false positive, FN false negative
Performance of the optimal class features
| Feature | ACC | SE | SP | MCC | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|
| Optimal class features | 94.2 | 95.0 | 93.4 | 88.4 | 822 | 802 | 57 | 43 |
| OLP (k = 28) | 91.1 | 86.6 | 95.7 | 82.6 | 749 | 822 | 37 | 116 |
| PC-PseAAC (λ = 5) | 94.0 | 94.0 | 94.1 | 88.1 | 813 | 808 | 51 | 52 |
| GGAP (g = 5) | 93.6 | 92.7 | 94.4 | 87.1 | 802 | 811 | 48 | 63 |
| BIT20 (k = 12) | 91.0 | 86.2 | 95.7 | 82.3 | 746 | 822 | 37 | 119 |
SE sensitivity, SP specificity, ACC accuracy, MCC Matthew’s correlation coefficient, TP true positive, TN true negative, FP false positive, FN false negative, PC-PseAAC parallel correlation-based pseudo-amino-acid composition, GGAP the G-gap dipeptide composition, BIT20 twenty-bit feature, OLP overlapping property feature
Performance of the optimal probabilistic features
| Feature | ACC | SE | SP | MCC | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|
| Optimal probabilistic features | 94.1 | 94.9 | 93.3 | 88.2 | 820 | 802 | 58 | 44 |
| BIT21 (k = 32) | 91.8 | 88.3 | 95.3 | 83.9 | 764 | 819 | 40 | 101 |
| BIT20 (k = 4) | 91.0 | 86.2 | 95.7 | 82.3 | 746 | 822 | 37 | 119 |
| AAC | 93.8 | 93.5 | 94.1 | 87.6 | 809 | 808 | 51 | 56 |
| BIT21 (k = 4) | 91.0 | 86.2 | 95.8 | 82.4 | 746 | 823 | 36 | 119 |
SE sensitivity, SP specificity, ACC accuracy, MCC Matthew’s correlation coefficient, TP true positive, TN true negative, FP false positive, FN false negative, AAC amino acid composition, BIT20 twenty-bit feature, BIT21 twenty-one-bit feature
Performance of the SFS strategy
| Learning strategies | ACC | SE | SP | MCC | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|
| Optimal class features | 94.2 | 95.0 | 93.4 | 88.4 | 822 | 802 | 57 | 43 |
| Optimal probabilistic features | 94.1 | 94.9 | 93.3 | 88.2 | 820 | 802 | 58 | 44 |
| Major voting | 93.5 | 92.0 | 95.0 | 87.1 | 796 | 816 | 43 | 69 |
| Probability averaging | 93.0 | 90.9 | 95.2 | 86.2 | 786 | 818 | 41 | 79 |
SFS sequential forward search, SE sensitivity, SP specificity, ACC accuracy, MCC Matthew’s correlation coefficient, TP true positive, TN true negative, FP false positive, FN false negative
Fig. 5Comparison of four traditional classifiers. A Performances of the optimal ‘class’ features. B Performances of the optimal ‘prob’ features. C ROC curves of the optimal ‘class’ features. D ROC curves of the optimal ‘prob’ features