| Literature DB >> 27473048 |
Zeeshan Khaliq1, Mikael Leijon2,3, Sándor Belák3,4, Jan Komorowski5,6.
Abstract
BACKGROUND: The underlying strategies used by influenza A viruses (IAVs) to adapt to new hosts while crossing the species barrier are complex and yet to be understood completely. Several studies have been published identifying singular genomic signatures that indicate such a host switch. The complexity of the problem suggested that in addition to the singular signatures, there might be a combinatorial use of such genomic features, in nature, defining adaptation to hosts.Entities:
Keywords: Combinatorial signatures; Host adaptation; Host-specific signatures; Influenza A virus; MCFS; Rosetta; Rough sets
Mesh:
Substances:
Year: 2016 PMID: 27473048 PMCID: PMC4966792 DOI: 10.1186/s12864-016-2919-4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
The training data
| Nr. of sequences for each subtype | Features after MCFS | ||||||
|---|---|---|---|---|---|---|---|
| H1N1 | H3N2 | ||||||
| Protein | Avian | Human | Avian | Human | Total features | H1N1 | H3N2 |
| HA | 214 | 5205 | 164 | 3715 | 628 | 115 | 88 |
| NA | 205 | 3093 | 173 | 3412 | 517 | 93 | 79 |
| NS1 | 150 | 1258 | 150 | 1176 | 249 | 98 | 85 |
| NEP | 61 | 407 | 54 | 299 | 124 | 31 | 26 |
| NP | 125 | 839 | 93 | 773 | 506 | 61 | 69 |
| M1 | 45 | 467 | 42 | 355 | 275 | 18 | 15 |
| M2 | 65 | 461 | 64 | 503 | 98 | 25 | 23 |
| PA | 192 | 1677 | 143 | 1358 | 726 | 65 | 47 |
| PA-X | 57 | 164 | 45 | 244 | 252 | 28 | 24 |
| PB1 | 171 | 1654 | 132 | 1347 | 762 | 59 | 33 |
| PB2 | 184 | 1817 | 136 | 1297 | 776 | 52 | 42 |
| PB1-F2 | 151 | 224 | 112 | 737 | 101 | 64 | 54 |
Total Features are the total number of aa positions that are investigated. Features after MCFS are the aa positions that are ranked significant, i.e. having power to discriminate avian from human sequences
10-fold cross-validation accuracies
| Mean accuracy (%) | ||
|---|---|---|
| Protein | H1N1 | H3N2 |
| HA | 98 | 98.7 |
| M1 | 87.7 | 88.8 |
| M2 | 87.6 | 92.9 |
| NA | 93.9 | 98.6 |
| NP | 93 | 97.3 |
| NS1 | 93.1 | 98.9 |
| NEP | 83.4 | 95.3 |
| PA | 95.1 | 97.9 |
| PA-X | 95.9 | 97.7 |
| PB1 | 94.7 | 95.1 |
| PB1F2 | 95.5 | 92.3 |
| PB2 | 95.9 | 97.5 |
Cross-validation accuracies of the 100 classifiers were averaged
Performance of the models on their corresponding complete data sets
| H1N1 | H3N2 | |||||
|---|---|---|---|---|---|---|
| Protein | Sensitivity | Specificity | MCC | Sensitivity | Specificity | MCC |
| HA | 0.999 | 0.953 | 0.961 | 1 | 0.987 | 0.993 |
| M1 | 1 | 0.881 | 0.934 | 0.994 | 1 | 0.971 |
| M2 | 1 | 0.859 | 0.918 | 0.996 | 0.873 | 0.908 |
| NA | 1 | 0.907 | 0.95 | 1 | 0.908 | 0.95 |
| NP | 1 | 0.864 | 0.92 | 0.994 | 0.957 | 0.946 |
| NS1 | 0.998 | 0.932 | 0.954 | 0.991 | 0.993 | 0.96 |
| NEP | 0.995 | 0.883 | 0.912 | 0.997 | 1 | 0.988 |
| PA-X | 0.901 | 1 | 0.856 | 1 | 1 | 1 |
| PA | 0.972 | 0.979 | 0.892 | 0.996 | 0.979 | 0.969 |
| PB1-F2 | 0.91 | 0.987 | 0.884 | 0.999 | 0.778 | 0.861 |
| PB1 | 0.993 | 0.93 | 0.923 | 1 | 0.879 | 0.932 |
| PB2 | 0.989 | 0.984 | 0.935 | 0.996 | 0.985 | 0.972 |
Sensitivity is the ability to correctly predict human sequences and specificity is the ability to correctly predict avian sequences where 1 means perfect prediction and 0 means no correct prediction. Matthews correlation coefficient (MCC) value is a measure of how well the model performs overall where 1 means a perfect classification, 0 is for a prediction no better than random and −1 indicates a total disagreement between predictions and observations. “na” means the measure could not be calculated for the given model
Example rule-based model
| Rule | Accuracy (%) | Support | Decision coverage (%) |
|---|---|---|---|
| IF P435 = I THEN host = Human | 99.9 | 5128 | 98.4 |
| IF P200 = S THEN host = Human | 99.9 | 4052 | 77.8 |
| IF P10 = Y THEN host = Human | 99.8 | 3998 | 76.7 |
| IF P88 = S THEN host = Human | 99.9 | 3989 | 76.5 |
| IF P6 = V THEN host = Human | 99.8 | 3936 | 75.5 |
| IF P222 = R THEN host = Human | 99.9 | 3823 | 73.4 |
| IF P220 = T THEN host = Human | 100.0 | 3584 | 68.8 |
| IF P516 = K THEN host = Human | 99.9 | 1818 | 34.9 |
| IF P200 = P and P222 = K THEN host = Avian | 91.3 | 229 | 97.7 |
| IF P130 = K THEN host = Avian | 91.3 | 218 | 93.0 |
| IF P2 = E and P222 = K THEN host = Avian | 96.2 | 208 | 93.5 |
| IF P137 = A and P544 = L THEN host = Avian | 96.1 | 205 | 92.1 |
| IF P78 = L and P435 = V THEN host = Avian | 97.1 | 204 | 92.5 |
| IF P9 = F THEN host = Avian | 98.5 | 204 | 93.9 |
| IF P6 = F THEN host = Avian | 98.2 | 169 | 77.6 |
| IF P14 = V THEN host = Avian | 99.4 | 165 | 76.6 |
| IF P173 = T THEN host = Avian | 98.7 | 158 | 72.9 |
The model presented here is for the HA protein of the H1N1 subtypeModels for the other proteins of both the subtypes are listed in Additional file 2
Performance of the rule-based models on the new, unseen data
| Human sequences | Avian sequences | ||||
|---|---|---|---|---|---|
| Protein | Total | Correctly classified | Total | Correctly classified | Accuracy (%) |
| HA-H1N1 | 108 | 105 | 2 | 2 | 97.3 |
| HA-H3N2 | 73 | 73 | 4 | 4 | 100.0 |
| M1-H1N1 | 25 | 25 | 0 | 0 | 100.0 |
| M1-H3N2 | 8 | 7 | 0 | 0 | 87.5 |
| M2-H1N1 | 30 | 26 | 2 | 2 | 87.5 |
| M2-H3N2 | 22 | 16 | 3 | 3 | 76.0 |
| NA-H1N1 | 33 | 33 | 2 | 1 | 97.1 |
| NA-H3N2 | 46 | 46 | 4 | 3 | 98.0 |
| NP-H1N1 | 13 | 13 | 2 | 2 | 100.0 |
| NP-H3N2 | 8 | 8 | 4 | 4 | 100.0 |
| NS1-H1N1 | 31 | 31 | 2 | 2 | 100.0 |
| NS1-H3N2 | 19 | 19 | 3 | 3 | 100.0 |
| NEP-H1N1 | 12 | 12 | 2 | 2 | 100.0 |
| NEP-H3N2 | 8 | 8 | 2 | 2 | 100.0 |
| PAX-H1N1 | 18 | 14 | 2 | 2 | 80.0 |
| PAX-H3N2 | 7 | 7 | 0 | 0 | 100.0 |
| PA-H1N1 | 34 | 29 | 2 | 2 | 86.1 |
| PA-H3N2 | 23 | 23 | 4 | 4 | 100.0 |
| PB1F2-H1N1 | 3 | 3 | 2 | 2 | 100.0 |
| PB1F2-H3N2 | 9 | 8 | 4 | 0 | 61.5 |
| PB1-H1N1 | 27 | 27 | 1 | 1 | 100.0 |
| PB1-H3N2 | 20 | 20 | 1 | 1 | 100.0 |
| PB2-H1N1 | 29 | 29 | 2 | 2 | 100.0 |
| PB2-H3N2 | 16 | 16 | 3 | 3 | 100.0 |
Fig. 1Ciruvis diagrams of combinations from the rules of H1N1 models. Models having at least three combinations are shown. The outer circle shows the positions. The inner circle shows the position or positions to which the position of the outer circle is connected. The edges show these connections. The width and color of the edges are related to the connection score (low = yellow and thin, high = red and thick). The width of an outer position is the sum of all connections to it, scaled so that all positions together cover the whole circle [26]
Fig. 2Ciruvis diagrams of combinations from the rules of H3N2 models. Models having at least three combinations are shown. The outer circle shows the positions. The inner circle shows the position or positions to which the position of the outer circle is connected. The edges show these connections. The width and color of the edges are related to the connection score (low = yellow and thin, high = red and thick). The width of an outer position is the sum of all connections to it, scaled so that all positions together cover the whole circle [26]
Amino acid residues having the most interactions in the models of both subtypes
| Subtype | Protein | Positions | Number of interactions |
|---|---|---|---|
| H1N1 | HA | 222K | 2 |
| M1 | 121T | 5 | |
| M2 | 14G | 6 | |
| NEP | 57S, 60S | 2 | |
| PA | 28P, 277S | 3 | |
| PA-X | 28P | 4 | |
| PB1 | 179M, 741A | 3 | |
| PB2 | 65E | 3 | |
| H3N2 | M1 | 101R | 2 |
| M2 | 11T, 14G, 31S, 54R | 2 | |
| NEP | 14M | 4 | |
| PB1 | 212L | 2 | |
| PB2 | 82N | 6 |
Novel singular aa positions associated to host adaptation
| Protein | Novel singular positions |
|---|---|
| HA | 6,9,10,14,23,47,66,69,78,88,91,94,130,173,189,200,220,222,435,516 |
| M1 | 30,116,142,207,209 |
| M2 | 13,16,31,36,43,51,54 |
| NA | 16,18,19,23,30,40,42,44,46,47,74,79,147,150,157,166,232,285,341,344,351,369,372,389,397,435,437,466 |
| NP | 31,53,98,146,444,450,498 |
| NS1 | 6,7,14,23,27,28,74,123,152,192,220,226 |
| NS2 | 6,7,14,32,34,48,83,86 |
| PA | 85,323,336,348,362,300 |
| PAX | 28,85,210,233 |
| PB1 | 12,54,59,113,175,212,339,435,576,586,587,619,709 |
| PB1F2 | 3,6,12,17,21,25,26,27,28,33,47,52,54,57,58,60,62,65,82 |
| PB2 | 54,65,354 |
Amino acid changes associated with host adaptation
| H1N1 | H3N2 | ||||||
|---|---|---|---|---|---|---|---|
| Protein | Position | Avian | Human | Protein | Position | Avian | Human |
| HA | 6 | F | V | HA | 78 | R | E |
| NA | 46 | P | T | NA | 30 | A | I |
| 74 | L | V | 40 | N | Y | ||
| NP | 100 | R | I,V | 44 | I | S | |
| NS1 | 6 | I | M | NP | 16 | G | D |
| NEP | 6 | I | M | PA-X | 28 | P | L |
| PB1-F2 | 58 | L | - | PA | 28 | P | L |
| PB2 | 588 | A | I | 57 | R | Q | |
| PB2 | 9 | D | N | ||||
| 64 | M | T | |||||
Fig. 3Phylogeny of PB2 H3N2 protein of avian hosts annotated with top 5 avian rules form the PB2 H3N2 model. Each sequences is represented by its GeneBank accession. The violet nodes mark the sequences that supports rule 1,2,3,4 and 5, which are 91.4 % of the total sequences. Similarly the DarkViolet nodes mark the sequences that support rule 1, 2, 3 and 4 but lacks support for rule 5, which are 2.2 % of the total sequences. The nodes with a LightBlue background are the new, unseen sequences. The unmarked nodes do not support the top 5 rules, and were either supporting rules other than the top 5 or were not classified by the models
Performance of the H1N1 models on H3N2 data and vice versa
| Protein | Sensitivity | Specificity | MCC | |
|---|---|---|---|---|
| H3N2 data - H1N1 models | HA | 1 | 0 | na |
| M1 | 1 | 0.895 | 0.941 | |
| M2 | 1 | 0.73 | 0.84 | |
| NA | 1 | 0 | na | |
| NP | 1 | 0.882 | 0.932 | |
| NS1 | 1 | 0.747 | 0.85 | |
| NEP | 1 | 0.648 | 0.78 | |
| PA-X | 0 | 1 | na | |
| PA | 0.021 | 0.93 | −0.11 | |
| PB1-F2 | 0.023 | 1 | 0.056 | |
| PB1 | 0.563 | 0.909 | 0.302 | |
| PB2 | 0.979 | 0.949 | 0.873 | |
| H1N1 data - H3N2 models | HA | 0 | na | na |
| M1 | 0.957 | 0.975 | 0.885 | |
| M2 | 0.987 | 0.766 | 0.804 | |
| NA | 1 | 0 | −0.004 | |
| NP | 0.364 | 0.984 | 0.251 | |
| NS1 | 0.365 | 0.993 | 0.237 | |
| NEP | 0.027 | 1 | 0.061 | |
| PA-X | 0.201 | 0.982 | 0.223 | |
| PA | 0.247 | 0.995 | 0.177 | |
| PB1-F2 | 0.991 | 0.804 | 0.832 | |
| PB1 | 0.992 | 0.877 | 0.888 | |
| PB2 | 0.956 | 0.951 | 0.786 |
Sensitivity is the ability to correctly predict human sequences and specificity is the ability to correctly predict avian sequences where 1 means perfect prediction and 0 means no correct prediction. Matthews correlation coefficient (MCC) value is a measure of how well the model performs overall where 1 means a perfect classification, 0 is for a prediction no better than random and −1 indicates a total disagreement between predictions and observations. “na” means the measure could not be calculated for the given model
Fig. 4Phylogenetic tree of the M1 protein from sequences of both subtypes and both hosts. Both the subtypes and the hosts are combined into a single tree. It can be seen that human sequences of both the subtypes form their own distinct clades. The avian sequences, on the other hand, fell into a single clade
Amino acid positions discussed in literature from the models of both the subtypes for all proteins
| Protein | Positions | Description |
|---|---|---|
| M1 | 115,121,137 | Known signatures of host-adaptation [ |
| 30,142,207,209 | Affecting viral production on mutation [ | |
| 121 | Affecting viral replication [ | |
| 101 | Determinant of temperature sensitivity [ | |
| M2 | 11,14,18,20,28,55,57,78,82,89,93 | Known signatures of host-adaptation [ |
| 31 | S31N is a known marker for amantadine resistance [ | |
| 18,20 | Lie next to 17,19 which forms a di-sulphide bond [ | |
| NS1 | 18,21,22,53,60,70,81,112,114,171,215,227 | Known signatures of host-adaptation [ |
| 215 | Required for Crk/CrL-SH3 binding [ | |
| 123 | Necessary for interaction with PKR, resulting in an inhibition of eIF2alpha phosphorylation [ | |
| 95 | Along with others, has been shown to be necessary for binding p85beta and activating PI3K signaling [ | |
| 220 | Part of nuclear localization signal 2 essential for the importin-alpha binding [ | |
| NEP(NS2) | 57,60,70,107 | Known signatures of host-adaptation [ |
| NP | 16,33,100,214,283,313,351,353,357,422 | Known signatures of host-adaptation [ |
| 16 | D16G shown to decrease pathogenicity several fold [ | |
| PA | 28,55,57,65,256,268,277,356,382,400,409 | Known signatures of host-adaptation [ |
| 85,336 | Residues 85I and 336 M are deemed important for enhanced polymerase activity in mammalian cells [ | |
| 57,65,85 | Shown to be involved in suppressing the host cell protein synthesis during infection [ | |
| PB1 | 52,179,216,298,327,336,361,375,581,741 | Known signatures of host-adaptation [ |
| 581 | Shown to be conferring temperature sensitivity to human influenza virus vaccine strains [ | |
| 473 | Mutation at position 473 has been shown to decrease polymerase activity [ | |
| PB2 | 9,44,64,81,105,271,292,368,453,588,613,682,684 | Known signatures of host-adaptation [ |
| 591 | 591Q is known to mimic the effect of 627 K [ | |
| 271 | 271A shown to increase polymerase activity in mammalian cells [ | |
| 271,588 | Also been shown to be host range determinants [ | |
| PB1-F2 | 16,23,42,66,70,73,76 | Known signatures of host-adaptation [ |
| 66 | Linked with affecting pathogenicity [ | |
| NA | 46,47,74,147,157,341,351 | Under selection pressure with a shift of hosts from birds to humans [ |
| 344 | Calcium ion binds here that stabilizes the molecule (UniProt: Q9IGQ6). | |
| HA | 2,6,9,10,14 | Signal peptide domain |
| 88,173,220,22 | Position 71, 159, 206 and 208 of the fully-mature HA with H3-numbering [ |
| Rule | Accuracy (%) | Support | Decision coverage (%) |
|---|---|---|---|
| IF P200 = P AND P222 = K THEN host = Avian | 91.3 | 229 | 97.7 |