| Literature DB >> 21592393 |
Esmaeil Ebrahimie1, Mansour Ebrahimi, Narjes Rahpayma Sarvestani, Mahdi Ebrahimi.
Abstract
Halophile proteins can tolerate high salt concentrations. Understanding halophilicity features is the first step toward engineering halostable crops. To this end, we examined protein features contributing to the halo-toleration of halophilic organisms. We compared more than 850 features for halophilic and non-halophilic proteins with various screening, clustering, decision tree, and generalized rule induction models to search for patterns that code for halo-toleration. Up to 251 protein attributes selected by various attribute weighting algorithms as important features contribute to halo-stability; from them 14 attributes selected by 90% of models and the count of hydrogen gained the highest value (1.0) in 70% of attribute weighting models, showing the importance of this attribute in feature selection modeling. The other attributes mostly were the frequencies of di-peptides. No changes were found in the numbers of groups when K-Means and TwoStep clustering modeling were performed on datasets with or without feature selection filtering. Although the depths of induced trees were not high, the accuracies of trees were higher than 94% and the frequency of hydrophobic residues pointed as the most important feature to build trees. The performance evaluation of decision tree models had the same values and the best correctness percentage recorded with the Exhaustive CHAID and CHAID models. We did not find any significant difference in the percent of correctness, performance evaluation, and mean correctness of various decision tree models with or without feature selection. For the first time, we analyzed the performance of different screening, clustering, and decision tree algorithms for discriminating halophilic and non-halophilic proteins and the results showed that amino acid composition can be used to discriminate between halo-tolerant and halo-sensitive proteins.Entities:
Year: 2011 PMID: 21592393 PMCID: PMC3117752 DOI: 10.1186/1746-1448-7-1
Source DB: PubMed Journal: Saline Systems ISSN: 1746-1448
Results of supervised feature selection on the first 100 important protein attributes (with value equal to 1) contributing to halo-stability of studied proteins.
| No | Field | Value | Rank | No | Field | Value | Rank |
|---|---|---|---|---|---|---|---|
| Freq of Leu-Leu | 1.0 | Important | Freq of Tyr-Leu | 1.0 | Important | ||
| Freq of Phe-Pro | 1.0 | Important | Freq of Glu | 1.0 | Important | ||
| Freq of Leu-Ile | 1.0 | Important | Freq of Tyr-Ile | 1.0 | Important | ||
| Freq of Glu-Met | 1.0 | Important | Freq of carbon | 1.0 | Important | ||
| Freq of Trp-Lys | 1.0 | Important | Freq of Ile-Ile | 1.0 | Important | ||
| Freq of Leu | 1.0 | Important | Freq of Ile-Ala | 1.0 | Important | ||
| Freq of Val-Phe | 1.0 | Important | Freq of Met | 1.0 | Important | ||
| Freq of Trp-Pro | 1.0 | Important | Freq of Other | 1.0 | Important | ||
| Freq of Lys-Leu | 1.0 | Important | Freq of Val | 1.0 | Important | ||
| Freq of Leu-His | 1.0 | Important | Freq of Ala-Lys | 1.0 | Important | ||
| Freq of Leu-Arg | 1.0 | Important | Freq of Arg-Ile | 1.0 | Important | ||
| Freq of Ile | 1.0 | Important | Freq of Ile-Gly | 1.0 | Important | ||
| Freq of Val-Lys | 1.0 | Important | Freq of Lys-Thr | 1.0 | Important | ||
| Freq of Pro-Tyr | 1.0 | Important | Freq of Lys-His | 1.0 | Important | ||
| Freq of Val-Leu | 1.0 | Important | Freq of Phe-Ile | 1.0 | Important | ||
| Freq of Ser-Phe | 1.0 | Important | Freq of sulfur | 1.0 | Important | ||
| Freq of Val-Gln | 1.0 | Important | Freq of Ser-His | 1.0 | Important | ||
| Freq of Phe-Leu | 1.0 | Important | Freq of Lys-Val | 1.0 | Important | ||
| Freq of Asp-Trp | 1.0 | Important | Freq of Leu-Ser | 1.0 | Important | ||
| Freq of Gly-Leu | 1.0 | Important | Freq of His-Ser | 1.0 | Important | ||
| Freq of Leu-Trp | 1.0 | Important | Freq of Ala-Phe | 1.0 | Important | ||
| Freq of His | 1.0 | Important | Freq of nitrogen | 1.0 | Important | ||
| Freq of Phe | 1.0 | Important | Freq of Glu-Ser | 1.0 | Important | ||
| Freq of Lys | 1.0 | Important | Freq of Arg | 1.0 | Important | ||
| Freq of Tyr-Trp | 1.0 | Important | Freq of Met-Gly | 1.0 | Important | ||
| Freq of Gln-Leu | 1.0 | Important | Freq of Ile-Thr | 1.0 | Important | ||
| Freq of Leu-Val | 1.0 | Important | Freq of Pro-Leu | 1.0 | Important | ||
| Freq of Cys-Tyr | 1.0 | Important | Freq of Lys-Ile | 1.0 | Important | ||
| Freq of Leu-Lys | 1.0 | Important | Freq of Try | 1.0 | Important | ||
| Freq of Met-Leu | 1.0 | Important | Freq of Phe-Thr | 1.0 | Important | ||
| Freq of Thr-Phe | 1.0 | Important | Freq of Leu-Pro | 1.0 | Important | ||
| Freq of Val-Ile | 1.0 | Important | Freq of Ile-Val | 1.0 | Important | ||
| Freq of Leu-Gly | 1.0 | Important | Freq of Ile-Trp | 1.0 | Important | ||
| Freq of Gly-Ile | 1.0 | Important | Count of Trp-Lys | 1.0 | Important | ||
| Freq of Hydrophobic | 1.0 | Important | Freq of His-Leu | 1.0 | Important | ||
| Freq of Tyr-Lys | 1.0 | Important | Freq of Gly-Ala | 1.0 | Important | ||
| Freq of Thr-Val | 1.0 | Important | Freq of Ala-Val | 1.0 | Important | ||
| Freq of Ile-Asn | 1.0 | Important | Count of Trp-Pro | 1.0 | Important | ||
| Freq of His-Gln | 1.0 | Important | Freq of Val-Pro | 1.0 | Important | ||
| Freq of Glu-Gly | 1.0 | Important | Freq of Ser-Ile | 1.0 | Important | ||
| Freq of Leu-Tyr | 1.0 | Important | Freq of Glu-Lys | 1.0 | Important | ||
| Freq of Met-Arg | 1.0 | Important | Freq of oxygen | 1.0 | Important | ||
| Freq of Ala-Leu | 1.0 | Important | Freq of Thr-Leu | 1.0 | Important | ||
| Freq of Gln-Met | 1.0 | Important | Freq of Leu-Cys | 1.0 | Important | ||
| Freq of Trp-Leu | 1.0 | Important | Freq of Ile-Leu | 1.0 | Important | ||
| Freq of Thr-His | 1.0 | Important | Freq of Leu-Ala | 1.0 | Important | ||
| Freq of Ile-Arg | 1.0 | Important | Freq of Phe-Val | 1.0 | Important | ||
| Freq of Pro-Val | 1.0 | Important | Freq of Thr-Lys | 1.0 | Important | ||
| Freq of Tyr-Phe | 1.0 | Important | Freq of Tyr | 1.0 | Important | ||
| Count of hydrogen | 1.0 | Freq of Glu-Ala | 1.0 | Important | |||
The algorithm considers one attribute at a time to determine how well each predictor alone predicts the target variable. The important value for each variable is then cal-culated as (1-p), where p is the p value of the appropriate test of association between the candidate predictor and the target variable. Since the target value was categorical, p values based on the F statistic was used.
The most important protein attributes selected by all used attribute weighting algorithms in this study.
| Chi Square | Deviation | Gini Index | Uncertainty | Relief | SVM | PCA | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Freq of Cys | 1.00 | hydrogen | 1.00 | Count of hydrogen | 1.00 | Freq of Val-Gln | 1.00 | Hydrogen | 1.00 | Freq of Lys-Val | 1.00 | Count of Ser-Leu | 1.00 |
| Freq of Asn | 1.00 | Freq of Leu-Val | 0.95 | Freq of Val | 0.85 | Freq of Ala-Val | 0.96 | Freq of carbon | 0.98 | Freq of Val-Ile | 0.97 | Count of Ile | 1.00 |
| Freq of Val | 0.99 | Freq of Thr-Val | 0.95 | Freq of Leu-Val | 0.81 | Freq of Leu-Val | 0.95 | Freq of oxygen | 0.90 | Freq of Lys-Cys | 0.93 | Count of Met | 0.93 |
| Freq of Lys | 0.95 | Freq of Val | 0.88 | Freq of Thr-Val | 0.81 | Freq of Thr-Val | 0.92 | Freq of Hydrophobic Residues | 0.89 | Freq of Asn-Asn | 0.93 | Count of Phe | 0.92 |
| Freq of Leu-Val | 0.94 | Freq of Leu-Arg | 0.88 | Freq of Val-Val | 0.79 | Freq of Gly-Leu | 0.91 | Freq of Leu | 0.85 | Freq of Glu-Met | 0.92 | Count of Leu-Ser | 0.89 |
| Freq of Val-Gln | 0.93 | Freq of Leu | 0.88 | Freq of Ala-Tyr | 0.78 | hydrogen | 0.90 | Freq of nitrogen | 0.83 | Freq of Ala-Tyr | 0.89 | Count of Leucine (L) | 0.87 |
| Freq of Gly | 0.93 | Freq of Leu-Ala | 0.86 | Freq of Leu | 0.78 | Freq of carbon | 0.90 | Freq of Val | 0.78 | Freq of Lys-Thr | 0.88 | Count of Tyr | 0.83 |
| Freq of Leu-Gly | 0.92 | Freq of Leu-Leu | 0.86 | Freq of Ala-Val | 0.76 | Freq of Thr-Lys | 0.89 | Freq of Ile | 0.74 | Freq of Asp-Pro | 0.82 | Non-reduced Cys Ext coef | 0.82 |
| Freq of Leu-Leu | 0.92 | Freq of Gly | 0.86 | Freq of Gly | 0.76 | Freq of Leu-Arg | 0.88 | Freq of Hydrophilic Res | 0.73 | Freq of Val-Lys | 0.81 | Reduced Cys Ext coef | 0.82 |
| Freq of Leu-Arg | 0.92 | Freq of Asn | 0.86 | Freq of Hydrophobic Residues | 0.74 | Freq of Leu-Leu | 0.88 | Freq of Other | 0.72 | Freq of Thr-Val | 0.80 | Count of Leu-Phe | 0.81 |
| Freq of Ala-Val | 0.90 | Freq of Pro-Val | 0.86 | Freq of Gly-Ala | 0.74 | Freq of Nitrogen | 0.88 | Freq of Gly | 0.67 | Freq of Ile-Trp | 0.79 | Count of Tyr-Lys | 0.80 |
| Freq of Hydrophilic Residues | 0.89 | Freq of Lys-Leu | 0.84 | Freq of Leu-Leu | 0.72 | Freq of Val | 0.88 | Freq of Thr | 0.65 | Freq of Leu-Leu | 0.79 | Count of Ile-Leu | 0.78 |
| Freq of Val-Lys | 0.88 | Freq of Ala | 0.84 | Freq of Ala | 0.72 | Freq of Thr-His | 0.88 | Freq of Asp | 0.62 | Freq of Asn-Gln | 0.79 | Count of Leu-Tyr | 0.77 |
| Freq of Glu-Ala | 0.88 | Freq of Tryp | 0.84 | Freq of Leu-Arg | 0.71 | Freq of Cys | 0.88 | Freq of His | 0.61 | Freq of Ser-Cys | 0.78 | Count of sulfur | 0.77 |
| Freq of Thr-Val | 0.87 | Freq of Tyr | 0.84 | Freq of Gly-Leu | 0.71 | Freq of Trp-Pro | 0.88 | Freq of Phe-Pro | 0.61 | Freq of Phe-Pro | 0.77 | Count of Val-Ser | 0.72 |
| Freq of Gly-Leu | 0.87 | Freq of Val-Ser | 0.84 | Freq of Leu-Gly | 0.70 | Freq of Asn | 0.87 | Freq of Val-Ile | 0.61 | Freq of Gly-Leu | 0.77 | Count of Ala-Tyr | 0.72 |
| Freq of Glu | 0.86 | Freq of Hydrophobic Residues | 0.83 | Freq of Val-Lys | 0.70 | Freq of oxygen | 0.87 | Freq of Tyr | 0.61 | Freq of Pro-Arg | 0.77 | Count of Leu-Leu | 0.70 |
| Count of hydrogen | 1.0 | Count of hydrogen | 1.0 | Freq of hydrophobic residues | 1.0 | ||||||||
| Freq of Leu - Val | 0.95 | Freq of Glu - Met | 0.91 | Freq of Ala | 1.0 | ||||||||
| Freq of Thr - Val | 0.94 | Freq of Gly - Leu | 0.91 | Freq of hydrogen | o.97 | ||||||||
| Freq of Val | 0.88 | Freq of Leu - Leu | 0.91 | Freq of Gly | 0.96 | ||||||||
| Freq of Leu - Arg | 0.87 | Freq of Leu | 0.91 | Freq of Lys | 0.96 | ||||||||
| Freq of Leu | 0.86 | Freq of Val | 0.89 | Freq of Asn | 0.94 | ||||||||
| Freq of, Leu - Ala | 0.86 | Freq of Trp - Pro | 0.87 | Freq of Val | 0.92 | ||||||||
| Freq of Leu - Leu | 0.86 | Freq of Ala - Tyr | 0.85 | Freq of Cys | 0.88 | ||||||||
| Freq of Gly | 0.84 | Freq of Val - Val | 0.80 | Freq of Leu | 0.87 | ||||||||
| Freq of Asn | 0.84 | Freq of Leu - Gly | 0.80 | Freq of Ser | 0.86 | ||||||||
| Freq of Pro - Val | 0.84 | Freq of Val - Lys | 0.80 | Freq of Thr | 0.86 | ||||||||
| Freq of Lys - Leu | 0.84 | Freq of Leu - Leu | 0.85 | ||||||||||
| Freq of Ala | 0.84 | Freq of hydrophilic residues | 0.81 | ||||||||||
| Freq of Trp | 0.83 | ||||||||||||
| Freq of Tyr | 0.81 | ||||||||||||
| Freq of Val - Ser | 0.81 | ||||||||||||
| Freq of hydrophobic residues | 0.78 | ||||||||||||
The figures are the value of each features importance assigned by algorithm.
Figure 1A decision tree generated by the CHAID modeling method without feature selection filtering comparing halo-tolerant (T) with the halo-sensitive (S) proteins.
Comparison of percentage of correctness, wrongness, performance evaluation (T & F), and mean correct and incorrect in various decision tree models in datasets with and without feature selection for halo-tolerant and halo-stable protein groups (T/F groups)
| Different decision tree models | % Correctness | % Wrongness | Performance evaluation (T) | Performance evaluation (F) | The most important feature (protein attribute) in build the decision tree | ||||
|---|---|---|---|---|---|---|---|---|---|
| Without feature selection | With feature selection | Without feature selection | With feature selection | Without feature selection | With feature selection | Without feature selection | With feature selection | ||
| 99.31 | 99.31 | 0.69 | 0.69 | 0.053 | 0.053 | 2.833 | 2.833 | Count of hydrogen | |
| 99.31 | 99.31 | 0.69 | 0.69 | - | 0.053 | - | 2.833 | Count of hydrogen | |
| 99.31 | 98.97 | 0.61 | 1.03 | - | 0.057 | - | 2.725 | Count of hydrogen | |
| 98.97 | 99.31 | 1.03 | 0.69 | - | 0.053 | - | 2.833 | Frequency of Leu - Leu | |
| 99.66 | 99.31 | 0.34 | 0.69 | - | 0.053 | - | 2.833 | Frequency of oxygen | |
| 99.66 | 99.31 | 0.34 | 0.69 | - | 0.053 | - | 2.833 | Frequency of oxygen | |
The association rules found in the data by the generalized rule induction (GRI) method, comparing halo-sensitive and halo-tolerant (including halolysin) proteins.
| Antecedent | Support % | Confidence % | |
|---|---|---|---|
| 1 | Aliphatic index > 97.672 | 16.9 | 77.55 |
| 2 | Length > 1116.000 | 7.93 | 69.57 |
| 3 | Aliphatic index > 97.672 and Length > 308.500 | 10.34 | 63.33 |
| 4 | Length > 1116.000 and Length < 1319.500 | 5.52 | 56.25 |
| 5 | Aliphatic index > 97.672 and Non-reduced Cys Extinction coefficient at 280 nm > 19180.000 and the count of Thr> 19.500 | 8.62 | 56.0 |
| 6 | Length > 1116.000 and Aliphatic index > 65.500 | 5.17 | 53.33 |
| 7 | Aliphatic index > 97.672 and Aliphatic index < 105.670 | 6.55 | 52.63 |
| 8 | Aliphatic index > 97.672 and Non-reduced Cys Extinction coefficient at 280 nm > 19180.000 and the count of Ala > 31.500 | 7.24 | 52.38 |
| 9 | Aliphatic index > 97.672 and Length > 308.500 and the count of Ala > 31.500 | 7.24 | 52.38 |
| 10 | Non-reduced Absorption at 280 nm 0.1% (= 1 g/l) > 1.019 | 7.93 | 52.17 |
| 11 | Aliphatic index > 97.672 and Non-reduced Absorption at 280 nm 0.1% (= 1 g/l) > 0.430 | 7.93 | 52.17 |
| 12 | Aliphatic index > 97.672 and Length > 308.500 and the percentage of Try > 0.162 | 7.93 | 52.17 |
| 13 | The count of hydrogen > 0.488 | 11.03 | 50.0 |
| 14 | Aliphatic index > 97.672 and the percentage of His > 1.095 | 7.59 | 50.0 |
| 15 | Aliphatic index > 97.672 and the count of Tyr > 5.500 and the count of Ala > 16.500 | 7.59 | 50.0 |
| 16 | Aliphatic index > 97.672 and the count of Val > 22.500 and the count of Trp-Arg < 1.500 | 7.59 | 50.0 |
| 17 | Aliphatic index > 97.672 and the count of Arg > 10.500 and the count of Ala-Leu > 2.500 | 7.59 | 50.0 |
| 18 | Aliphatic index > 97.672 and the count of Leu > 38.500 and the count of Ala > 16.500 | 7.59 | 50.0 |
| 19 | Aliphatic index > 97.672 and the count of Ile > 18.500 and the count of Ala > 16.500 | 7.59 | 50.0 |
| 20 | Aliphatic index > 97.672 and the count of Gly > 22.500 and the count of Val-Gln > 0.500 | 7.59 | 50.0 |
| 21 | Aliphatic index > 97.672 and the count of neutral charges > 243.000 and the count of Val-Gln > 0.500 | 7.59 | 50.0 |
| 22 | Aliphatic index > 97.672 and the count of hydrophobic residues > 154.500 and the count of Val-Gln > 0.500 | 7.59 | 50.0 |
| 23 | Aliphatic index > 97.672 and the count of oxygen > 460.500 and the count of Val-Gln > 0.500 | 7.59 | 50.0 |
| 24 | Aliphatic index > 97.672 and the count of nitrogen > 398.500 and the count of Val-Gln > 0.500 | 7.59 | 50.0 |
| 25 | Aliphatic index > 97.672 and the count of carbon > 1513.000 and the count of Val-Gln > 0.500 | 7.59 | 50.0 |
| 26 | Aliphatic index > 97.672 and the count of hydrogen > 2430.000 and the count of Val-Gln > 0.500 | 7.59 | 50.0 |
| 27 | Aliphatic index > 97.672 and Reduced Cys extinction coefficient at 280 nm > 18705.000 and the percentage of His > 1.095 | 7.59 | 50.0 |
| 28 | Aliphatic index > 97.672 and Non-reduced Cys extinction coefficient at 280 nm > 19180.000 and the percentage of His > 1.095 | 7.59 | 50.0 |
| 29 | Aliphatic index > 97.672 and Length > 308.500 and the count of Val-Gln > 0.500 | 7.59 | 50.0 |
| 30 | Aliphatic index > 97.672 and Length > 308.500 and the percentage of His > 1.924 | 6.9 | 50.0 |
| 31 | Length > 1116.000 and the count of Ala > 71.500 | 4.83 | 50.0 |
| 32 | Length > 1116.000 and Reduced Cys extinction coefficient at 280 nm > 69840.000 and Aliphatic index > 62.094 | 4.83 | 50.0 |
| 33 | Length > 1116.000 and Non-reduced Cys extinction coefficient at 280 nm > 70530.000 and Aliphatic index > 62.094 | 4.83 | 50.0 |