| Literature DB >> 24809455 |
Mansour Ebrahimi1, Parisa Aghagolzadeh2, Narges Shamabadi1, Ahmad Tahmasebi3, Mohammed Alsharifi4, David L Adelson4, Farhid Hemmatzadeh5, Esmaeil Ebrahimie4.
Abstract
The evolution of the influenza A virus to increase its host range is a major concern worldwide. Molecular mechanisms of increasing host range are largely unknown. Influenza surface proteins play determining roles in reorganization of host-sialic acid receptors and host range. In an attempt to uncover the physic-chemical attributes which govern HA subtyping, we performed a large scale functional analysis of over 7000 sequences of 16 different HA subtypes. Large number (896) of physic-chemical protein characteristics were calculated for each HA sequence. Then, 10 different attribute weighting algorithms were used to find the key characteristics distinguishing HA subtypes. Furthermore, to discover machine leaning models which can predict HA subtypes, various Decision Tree, Support Vector Machine, Naïve Bayes, and Neural Network models were trained on calculated protein characteristics dataset as well as 10 trimmed datasets generated by attribute weighting algorithms. The prediction accuracies of the machine learning methods were evaluated by 10-fold cross validation. The results highlighted the frequency of Gln (selected by 80% of attribute weighting algorithms), percentage/frequency of Tyr, percentage of Cys, and frequencies of Try and Glu (selected by 70% of attribute weighting algorithms) as the key features that are associated with HA subtyping. Random Forest tree induction algorithm and RBF kernel function of SVM (scaled by grid search) showed high accuracy of 98% in clustering and predicting HA subtypes based on protein attributes. Decision tree models were successful in monitoring the short mutation/reassortment paths by which influenza virus can gain the key protein structure of another HA subtype and increase its host range in a short period of time with less energy consumption. Extracting and mining a large number of amino acid attributes of HA subtypes of influenza A virus through supervised algorithms represent a new avenue for understanding and predicting possible future structure of influenza pandemics.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24809455 PMCID: PMC4014573 DOI: 10.1371/journal.pone.0096984
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The most important protein attributes (features) in structure of different HA subtypes selected by different attribute weighting algorithms.
| Attribute | The number of attribute weighting algoritms that indicated the attribute as important |
| The frequency of Gln | 8 |
| Percentage of Cys | 7 |
| Percentage of Tyr | 7 |
| The frequency of Tyr | 7 |
| The frequency of Glu | 7 |
| Percentage of Try | 7 |
| Count of Ile | 6 |
| The frequency of Arg | 6 |
| Percentage of His | 6 |
| The frequency of Asp | 6 |
| Percentage of Met | 6 |
| Non-reduced Cys extinctioncoefficient at 280 nm | 6 |
| The frequency of Phe | 5 |
Total number of attribute weighting algorithms which have announced the certain attribute important (weight higher than 0.5, Table S4). This table presents the number of algorithms that selected the attribute. Weighting algorithms were PCA, SVM, Relief, Uncertainty, Gini index, Chi Squared, Deviation, Rule, Information Gain, and Information Gain Ratio.
Figure 1Decision Tree from Decision Stump model ran with Gini Index criterion.
As may be inferred from the figure, the count of Tyr was the most important and the sole protein attributes in distinguishing various HA subtypes of influenza virus A. When the value for this feature was equal to 26, 27, 28, 29; if the value was equal to 18, 19, 20, 21 or 22, the virus belonged to the H3 class. While the count of Tyr was equal to 17, the subtype of the virus was H4; but when the value was equal to 23 or 24, the virus was associated with H5. H6 virus subtype was identified when the count of Tyr was equal to 26. Finally, when the value was equal to 13, 14 or 15, the virus fell into the H7 class. Underneath, the host species for each virus class has been depicted.
Figure 2Decision Tree from Random Tree model ran with Gini Index criterion.
As may be inferred from the figure, the frequency of Pro - Gly was the most important protein attributes to build the tree and the counts or the frequencies of other dipeptides used to generate the tree branches and to distinguish various HA subtypes of influenza virus A. With the defined valuse for the count of Phe – Met, the count of Asn – Met and the frequency of Trp – Leu, the virus subtypes were either H3 or H5. With different values for the count of Asn – Met, various virus subtypes distinguished. All virus subclasses (except H6, H8, H10, H11 and H14) were classified by this model. Underneath each subtype common host has been depicted.
The accuracy of four different tree induction models (each ran with four criteria, Accuracy, Gain Ratio, Gini Index and Info Gain) on 11 datasets [original protein features dataset (FCdb) as well as 10 datasets generated by trimming (filtering) the original FCdb dataset by attribute weighting algorithms) computed by 10-fold cross validation.
| Decision Tree | Decision Tree Parallel | Decision Tree Stump | Decision Tree Random Forest | |||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99.52 | 99.52 | 99.52 | 99.52 | 94.22 | 99.27 | 99.46 | 99.47 | 71.63 | 48.48 | 71.18 | 71.98 | 88.90 | 91.72 | 99.51 | 99.54 |
|
| 99.33 | 99.33 | 99.33 | 99.33 | 89.30 | 99.56 | 99.51 | 99.33 | 71.63 | 49.80 | 71.18 | 71.98 | 86.18 | 95.92 | 99.52 | 99.44 |
|
| 70.73 | 70.73 | 70.73 | 70.73 | 55.01 | 70.73 | 96.78 | 96.80 | 50.65 | 50.73 | 58.85 | 58.85 | 55.13 | 70.96 | 96.65 | 96.80 |
|
| 99.24 | 99.24 | 99.24 | 99.24 | 89.19 | 99.39 | 99.29 | 99.32 | 71.63 | 49.80 | 71.18 | 71.98 | 82.47 | 94.45 | 99.43 | 99.14 |
|
| 99.43 | 99.43 | 99.43 | 99.43 | 91.69 | 99.40 | 99.35 | 99.25 | 71.63 | 48.48 | 71.18 | 71.98 | 87.37 | 93.12 | 99.01 | 99.33 |
|
| 70.73 | 70.73 | 70.73 | 70.73 | 55.01 | 70.73 | 96.78 | 96.80 | 50.65 | 50.73 | 58.85 | 58.85 | 55.13 | 70.96 | 96.65 | 96.80 |
|
| 99.35 | 99.35 | 99.35 | 99.35 | 87.58 | 99.31 | 99.37 | 99.22 | 71.63 | 49.80 | 71.18 | 71.98 | 87.88 | 96.64 | 99.24 | 98.87 |
|
| 99.47 | 99.47 | 99.47 | 99.47 | 91.36 | 99.37 | 99.24 | 99.39 | 71.63 | 49.80 | 71.18 | 71.86 | 82.51 | 93.81 | 99.36 | 98.86 |
|
| 99.40 | 99.40 | 99.40 | 99.40 | 91.70 | 99.40 | 99.32 | 99.25 | 71.63 | 49.80 | 71.18 | 71.98 | 88.62 | 95.01 | 99.70 | 99.65 |
|
| 98.99 | 98.99 | 98.99 | 98.99 | 93.64 | 92.48 | 99.09 | 99.09 | 70.99 | 46.40 | 70.05 | 72.03 | 86.13 | 89.00 | 98.56 | 98.06 |
|
| 99.06 | 96.87 | 74.36 | 46.04 | 76.13 | 96.87 | 74.93 | 74.50 | 48.48 | 90.00 | 58.38 | 58.38 | 97.65 | 99.31 | 98.34 | 97.73 |
This table presents the accuracy percentage of Tree Induction models (Decision Tree, Decision Tree Parallel, Decision Stump, Random Forest and Random Tree) run with four different criteria (Gain Ratio, Information Gain, Gini Index and Accuracy). The lowest and highest accuracies have been highlighted.
The accuracy of Baysian and Neural Network models on various datasets [11 datasets including original protein features dataset (FCdb) as well as 10 datasets generated by trimming (filtering) the original FCdb dataset by attribute weighting algorithms] computed by 10-fold cross validation.
| Baysian Models | Neural Nets Models | |||
|
|
|
|
| |
|
| 98.72% | 98.59% | 99.71% | 99.73% |
|
| 88.55% | 96.70% | 79.94% | 82.71% |
|
| 98.90% | 98.26% | 99.70% | 99.69% |
|
| 98.12% | 98.79% | 99.70% | 99.70% |
|
| 84.38% | 63.36% | 99.70% | 99.73% |
|
| 99.22% | 98.69% | 78.81% | 81.44% |
|
| 98.95% | 98.65% | 99.63% | 99.59% |
|
| 98.53% | 98.97% | 99.70% | 99.67% |
|
| 84.38% | 63.36% | 98.44% | 98.37% |
|
| 92.69% | 97.89% | 99.71% | 99.71% |
|
| %99.18 | %97.51 | %99.73 | 99.69% |
This table presents the accuracy percentage of Baysian (Naïve Bayes and Bayse Kernel) and Neural Network models (Auto MLp and Neural Net) run on all 10 datasets.