| Literature DB >> 26112351 |
Zeeshan Khaliq1, Mikael Leijon2,3, Sándor Belák4,5, Jan Komorowski6,7.
Abstract
Entities:
Mesh:
Substances:
Year: 2015 PMID: 26112351 PMCID: PMC4482282 DOI: 10.1186/s12866-015-0465-x
Source DB: PubMed Journal: BMC Microbiol ISSN: 1471-2180 Impact factor: 3.605
Fig. 1Schematic representation of the applied computational modeling methodology
The training data
| Protein | H5N1 | Non-H5N1 | Total | All Features | Significant Features | |||
|---|---|---|---|---|---|---|---|---|
| HP | LP | HP | LP | HP | LP | |||
| HA | 1377 | 54 | 48 | 512 | 1425 | 566 | 616 | 82 |
| NA | 551 | 32 | 23 | 264 | 574 | 296 | 593 | 114 |
| M1 | 161 | 9 | 13 | 52 | 174 | 61 | 329 | 16 |
| M2 | 186 | 9 | 14 | 63 | 200 | 72 | 98 | 18 |
| NS1 | 425 | 16 | 22 | 148 | 447 | 164 | 249 | 71 |
| NS2 | 202 | 3 | 14 | 53 | 216 | 56 | 129 | 25 |
| NP | 294 | 12 | 22 | 113 | 316 | 125 | 511 | 22 |
| PA | 465 | 22 | 25 | 235 | 490 | 257 | 730 | 57 |
| PB1 | 405 | 26 | 25 | 223 | 430 | 249 | 775 | 44 |
| PB2 | 446 | 26 | 23 | 247 | 469 | 273 | 783 | 62 |
| PB1-F2 | 135 | 16 | 15 | 114 | 150 | 130 | 101 | 40 |
The HP and LP columns represent the number of highly pathogenic and low pathogenic sequences in each of the proteins, respectively. The ‘All features’ column is the total number of features (i.e. AA’s) from which significant features are selected with Monte Carlo Feature Selection
Fig. 2Accuracies of the cross-validation and the testing of the models on new, unseen data. a Quality measures for the rule-based models. Averaged Accuracy is the average of mean accuracy from the 10-fold cross-validation loop for the models created on 100 under-sampled subsets for each protein. Standard deviation from the 10-fold cross validation loop, averaged in a similar way as accuracy, is shown as error bars on the plot. b Re-classification of the training sequences of the H5N1 sequences. Accuracy is the percentage of correctly classified sequences. See also Additional file 4: Table S2. c Re-classification of the training sequences of the non-H5N1 sequences. Accuracy is the percentage of correctly classified sequences. See also Additional file 4: Table S3. d Accuracies of the classifiers when tested on the newly published unseen H5N1 sequences, i.e. sequences not included in the training of the models and with sequences identical to the training sequences removed. Accuracy is the percentage of correctly classified sequences. Classifiers consisted of the significant rules from all the rule-based models created for a given protein. See also Additional file 5: Table S4. e Accuracies of the classifiers when tested on the newly published unseen non-H5N1 sequences, i.e. sequences not included in the training of the models and with sequences identical to the training sequences removed. Accuracy is the percentage of correctly classified sequences. Classifiers consisted of the significant rules from all the rule-based models created for a given protein. See also Additional file 5: Table S5
The strongest rules for highly and low pathogenic viruses from the HA classifier
| Rule | Accuracy (%) | Support | Class-Specific-Coverage (%) | |
|---|---|---|---|---|
| HP-Rules | IF P43(HA1) = D THEN virus = HP | 99.8 | 1225 | 86 |
| IF P83(HA1) = A THEN virus = HP | 100.0 | 807 | 57 | |
| IF P71(HA1) = I THEN virus = HP | 100.0 | 759 | 53 | |
| LP-Rules | IF P43(HA1) = S THEN virus = LP | 95.2 | 589 | 99 |
| IF P83(HA1) = D THEN virus = LP | 94.6 | 571 | 95 | |
| IF P107(HA1) = S THEN virus = LP | 95.8 | 552 | 93 | |
| IF P138(HA1) = N THEN virus = LP | 92.7 | 536 | 88 | |
| IF P309(HA1) = D THEN virus = LP | 94.9 | 533 | 89 | |
| IF P320(HA1) = V THEN virus = LP | 95.7 | 532 | 90 | |
| IF P195(HA1) = N THEN virus = LP | 88.8 | 400 | 63 | |
| IF P16(SP) = G THEN virus = LP | 89.3 | 392 | 62 | |
| IF P203(HA2) = I THEN virus = LP | 82.4 | 380 | 55 | |
| IF P6(SP) = I THEN virus = LP | 97.5 | 354 | 61 | |
| IF P7(SP) = A THEN virus = LP | 98.0 | 352 | 61 | |
| IF P3(SP) = R THEN virus = LP | 94.1 | 341 | 57 | |
| IF P240(HA1) = S THEN virus = LP | 95.2 | 332 | 56 | |
| IF P275(HA1) = D THEN virus = LP | 97.3 | 300 | 52 |
Accuracy is the percentage of the sequences in the support set correctly classified by the rule. Support is the number of sequences that satisfy the “IF” conditions of the rule. Class-Specific-Coverage is the percentage per class (i.e. HP or LP, respectively) of the sequences that support the rule and are correctly classified by the rule. For instance, if a rule is an HP class rule then the Class-Specific-Coverage gives the percentage of the HP sequences classified correctly by this rule
AA’s and its combinations associated with high and low pathogenicity in all the proteins
| Association to high pathogenicity | Association to low pathogenicity | |||
|---|---|---|---|---|
| Singular residues | Combination of residues | Singular residues | Combination of residues | |
| HA | D-43HA1, A-83HA1, I-71HA1 | - | S-43HA1, D-83HA1, S-107HA1, N-138HA1, D-309HA1, V-302HA1, A-7sp, I-6sp, D-275HA1, N-195HA1, S-240HA1, R-3SP, S-194HA1 | - |
| NA | N-369, G-386, T-288, H-100, D-269, H-41 | - | N-400, K-38, V-192, P-90, I-73, I-262, L-255, M-24, S-14, E-41, S-269, K-187, T-434, E-74, S-43 | - |
| NS1 | N-48, L-207, N-212 | R-59 & N-212, | M-27 | P-208 & K-212, |
| F-22 & N-48 | S-82 & R-113 & D-166, | |||
| F-22 & S-82 & T-89 & R-113, | ||||
| P55 = E & P73 = S & P82 = S & P89 = T & P113 = R | ||||
| S-48 & S-73 & S-82 & D-166, | ||||
| S-48 & S-82 & D-166, | ||||
| S-82 & A-107 & R-113, | ||||
| S-73 & S-82 & R-113, | ||||
| S-82 & R-113, | ||||
| S-82 & K-212 | ||||
| NS2 | A-22, A-115, V-14 | V-6 & I-60 | - | V-49 & S-60 |
| M1 | A-166, N-232, N-224, K-27, I-168 | T-121 & I-168 | V-166, R-101 | V-166 & D-232 |
| M2 | E-14 | - | - | G-14 & E-66, |
| G-14 & I-28, | ||||
| G-14 & K-18, | ||||
| I-28 & S-82, | ||||
| K-18 & I-28 | ||||
| NP | S-34 | N-377 & N-482, | N-450 | K-77 & V-353 & S-377 |
| S-34 & N-377, | ||||
| S-34 & N-482, | ||||
| R-77 & N-482, | ||||
| R-77 & N-377, | ||||
| A-373 & S-450, | ||||
| A-373 & N-377 | ||||
| PA | T-129, S-58 | - | - | - |
| PB1 | I-149, V-14, L-384 | I-113 & I-149, | - | A-14 & V-113 & G-154 & S-3 |
| V-14 & I-113, | ||||
| I-113 & K-386, | ||||
| T-59 & 113-I & K-215, | ||||
| I-113 & K215 | ||||
| PB2 | M-64, T-339 | - | I-478 | M-64 & I-478 |
| PB1-F2 | Y-57, P-48 | - | Q-48, D-50 | - |
AA’s for the HA protein are shown to be either of HA1, HA2 or Signal Peptide (SP). AA’s for the NA protein shown are numbered according to the whole length sequences i.e. the sequences without the stalk deletion
AA mutations associated with a shift of pathogenicity from low to high
| AA mutations associated with a change in pathogenicity from low to high | |
|---|---|
| HA | S-43HA1-D, D-83HA1-A |
| NA | S-269-D, E-41-H |
| NS1 | S-48-N, K-212-N |
| NS2 | - |
| M1 | V-166-A |
| M2 | G-14-E |
| NP | K-77-R, S-377-N |
| PA | - |
| PB1 | - |
| PB2 | - |
| PB1-F2 | Q-48-P |
Fig. 3AA’s appearing in the most significant rules marked on the 3D structures of different proteins. AA residues appearing in the rules are shown as spheres. Positions from the high pathogenicity rules are shown in blue, positions from the low pathogenicity rules are in magenta and mutations associated with the shift of pathogenicity from low to high as defined by the rules are shown in red. a Mapping of amino acid positions associated with pathogenicity from the rules onto 3D structure of the HA protein of Influenza A virus (A/Hubei/1/2010 (H5N1)) (PDB: 4KTH). Chain A (HA1 residues) and chain B (HA2 residues) are presented in green, while the rest of the trimer is shown in gray. b A cartoon representation of chains A, B, C and D of the NA protein with AA positions from the rules (PDBID: 2HU4). Chain A, the one marked with rule positions, is shown in green and the others in gray. Residue R-371, shown as a sphere in orange, is a part of the catalytic site of the protein. Cyan spheres constitute Oseltamivir 2, a substrate bound to the protein. c A cartoon representation of the NP protein trimer (PDBID: 2IQH) with positions from the rules. Chain A is shown in green and the others are in gray. d AA’s from the rules marked on a cartoon representation of NS1 (PDBID: 3FST). e A cartoon representation of the PB2 protein cap-binding domain (PDBID: 4CB4) with AA’s from the rules
| Rule |
|
|
| IF P22 = F AND P48 = N THEN virus = HP | 99.7 | 355 |