| Literature DB >> 21853079 |
Mansour Ebrahimi1, Amir Lakizadeh, Parisa Agha-Golzadeh, Esmaeil Ebrahimie, Mahdi Ebrahimi.
Abstract
The engineering of thermostable enzymes is receiving increased attention. The paper, detergent, and biofuel industries, in particular, seek to use environmentally friendly enzymes instead of toxic chlorine chemicals. Enzymes typically function at temperatures below 60°C and denature if exposed to higher temperatures. In contrast, a small portion of enzymes can withstand higher temperatures as a result of various structural adaptations. Understanding the protein attributes that are involved in this adaptation is the first step toward engineering thermostable enzymes. We employed various supervised and unsupervised machine learning algorithms as well as attribute weighting approaches to find amino acid composition attributes that contribute to enzyme thermostability. Specifically, we compared two groups of enzymes: mesostable and thermostable enzymes. Furthermore, a combination of attribute weighting with supervised and unsupervised clustering algorithms was used for prediction and modelling of protein thermostability from amino acid composition properties. Mining a large number of protein sequences (2090) through a variety of machine learning algorithms, which were based on the analysis of more than 800 amino acid attributes, increased the accuracy of this study. Moreover, these models were successful in predicting thermostability from the primary structure of proteins. The results showed that expectation maximization clustering in combination with uncertainly and correlation attribute weighting algorithms can effectively (100%) classify thermostable and mesostable proteins. Seventy per cent of the weighting methods selected Gln content and frequency of hydrophilic residues as the most important protein attributes. On the dipeptide level, the frequency of Asn-Glu was the key factor in distinguishing mesostable from thermostable enzymes. This study demonstrates the feasibility of predicting thermostability irrespective of sequence similarity and will serve as a basis for engineering thermostable enzymes in the laboratory.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21853079 PMCID: PMC3154288 DOI: 10.1371/journal.pone.0023146
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The most important protein attributes (features) selected by different attribute weighting algorithms.
| Attribute | The number of attribute weightings that indicate the attribute is important |
| The percentage of Gln | 7 |
| The frequency of hydrophilic residues | 7 |
| The count of other residues | 7 |
| The percentage of Glu | 6 |
| The frequency of Asn | 6 |
| The frequency of Asn- Gln | 4 |
| The count of Asn-Asn | 4 |
| The frequency of Asn-Asn | 4 |
| The count of Gln | 3 |
| The Percentage of Thr | 3 |
| The percentage of Val | 2 |
| The frequency of Arg | 2 |
| The count of Pro-Gln | 2 |
| The count of Lys-Gln | 2 |
This table presents the number of algorithms that selected the attribute. Weighting algorithms were PCA, SVM, Relief, Uncertainty, Gini index, Chi Squared, Deviation, Rule, Correlation, and Information Gain.
Clustering of 10 datasets (generated after performing 10 attribute weighting algorithms) into T (mesophile) and F (thermophile) classes by four different unsupervised clustering algorithms (K-Means, K-Medoids, SVC and EMC).
| Chi Squared | Correlation | Deviation | Gini Index | Information Gain | Relief | Rule | PCA | SVM | Uncertainty | |||||||||||
| T | F | T | F | T | F | T | F | T | F | T | F | T | F | T | F | T | F | T | F | |
|
| 1461 | 596 | 1810 | 247 | 1222 | 835 | 1452 | 605 | 1333 | 724 | 1603 | 454 | 1601 | 456 | 434 | 1623 | 1076 | 981 | 372 | 1685 |
|
| 487 | 1570 | 1521 | 536 | 104 | 1953 | 1570 | 487 | 1152 | 905 | 583 | 1474 | 1652 | 405 | 1768 | 289 | 892 | 1165 | 939 | 1118 |
|
| 363 | 1688 | 1701 | 328 | 1705 | 6 | 363 | 1688 | 570 | 1487 | 529 | 1324 | 1561 | 4 | 631 | 1426 | 0 | 2057 | 1089 | 947 |
|
| 0 | 2057 | 1544 | 513 | 0 | 0 | 0 | 2057 | 0 | 2057 | 0 | 2057 | 4 | 2053 | 0 | 2057 | 0 | 2057 | 1544 | 513 |
The actual numbers of T (mesostable) and F (thermostable) classes in the original datasets were 1544 and 513, respectively. The highest accuracy (100%) was observed when the EMC clustering method was applied to datasets generated by Correlation and Uncertainty attribute weighting algorithms that highlighted in the table.
Figure 1Random Forest decision model on Gini Index criterion.
The frequency of Asn – Gln was the most important attribute used to build the tree. The frequencies of Asn – Thr, Gly – Gly, and Asp – Pro were the other features used to build the rest of tree. T: mesostable and F: thermostable.
Figure 2Decision Tree on Gain Ratio model.
As can be inferred from the figure, modulation of Glutamine (Gln) content and frequency of hydrophilic residues are the most important protein attributes to distinguish mesostable (T) from thermostable proteins (T).
Topologies and overall, true, and false accuracies of the best neural networks run on whole database (with 794 features) and step-wised feature selected database (with 27 features).
| Type of neural network | Number of Hidden Layers | Number of Neurons in layer | Per cent of accuracy in predicting thermostable proteins | Per cent of accuracy in predicting mesostable proteins | Per cent of Overall accuracy | |||
| 1 | 2 | 3 | 4 | |||||
| Feed-forward (27 features) | 2 | 40 | 20 | 0.84 | 0.95 | 0.91 | ||
| Elman (27 features) | 2 | 10 | 5 | 0.84 | 0.95 | 0.91 | ||
| Feed-forward (794 features) | 3 | 50 | 20 | 10 | 0.85 | 0.93 | 0.90 | |
| Elman (794 features) | 4 | 50 | 25 | 10 | 5 | 0.83 | 0.95 | 0.91 |
Ten-fold cross validation of Elman neural network (2 hidden layers with 10 and 5 neurons in each layer) run on a dataset with selected features (with 27 features) presenting overall, mesostable, and thermostable prediction accuracies.
| Run | Size of training set | Size of test set | Per cent of accuracy in predicting thermostable proteins | Per cent of accuracy in predicting mesostable proteins | Per cent of Overall accuracy |
| 1 | 1851 | 206 | 0.81 | 0.85 | 0.83 |
| 2 | 1851 | 206 | 0.76 | 0.92 | 0.86 |
| 3 | 1851 | 206 | 0.89 | 0.91 | 0.90 |
| 4 | 1851 | 206 | 0.89 | 0.96 | 0.93 |
| 5 | 1851 | 206 | 0.90 | 1.00 | 0.96 |
| 6 | 1851 | 206 | 0.74 | 0.99 | 0.89 |
| 7 | 1851 | 206 | 0.89 | 0.96 | 0.93 |
| 8 | 1851 | 206 | 0.84 | 0.97 | 0.92 |
| 9 | 1851 | 206 | 0.84 | 0.98 | 0.92 |
| 10 | 1854 | 203 | 0.88 | 0.93 | 0.91 |
|
|
|
|
|
Accuracies of the best neural networks gained found in this work in predicting the right temperature for a dataset of 65 new proteins with known temperature.
| Dataset without feature selection (all 794) features | ||
| Accuracy% | Feed-forward with 3 hidden layer | Elman with 4 hidden layer |
| Thermostable proteins | 78.46 | 80.00 |
| Thermostable proteins | 93.48 | 97.83 |
| Overall | 85.97 | 88.91 |
|
| ||
| Thermo-stable proteins | 78.46 | 78.46 |
| False | 95.65 | 95.65 |
| Overall | 87.05 | 87.05 |