| Literature DB >> 26099921 |
Akram Mohammed, Chittibabu Guda.
Abstract
BACKGROUND: Enzymes are known as the molecular machines that drive the metabolism of an organism; hence identification of the full enzyme complement of an organism is essential to build the metabolic blueprint of that species as well as to understand the interplay of multiple species in an ecosystem. Experimental characterization of the enzymatic reactions of all enzymes in a genome is a tedious and expensive task. The problem is more pronounced in the metagenomic samples where even the species are not adequately cultured or characterized. Enzymes encoded by the gut microbiota play an essential role in the host metabolism; thus, warranting the need to accurately identify and annotate the full enzyme complements of species in the genomic and metagenomic projects. To fulfill this need, we develop and apply a method called ECemble, an ensemble approach to identify enzymes and enzyme classes and study the human gut metabolic pathways.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26099921 PMCID: PMC4474468 DOI: 10.1186/1471-2164-16-S7-S16
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Schematic representation of the ECemble method and its application.
Distribution of enzyme and non-enzyme features.
| Feature Set | Pfam | Prosite (PS) | Superfamily (SF) | ∑ (Pfam, PS, SF) |
|---|---|---|---|---|
| Database | 12273 | 1308 | 1962 | 15543 |
| Enzyme Only | 859 | 133 | 111 | 1103 |
| Non-Enzyme Only | 7658 | 525 | 1013 | 9196 |
| *Common features | 2514 | 647 | 781 | 3942 |
| Total unique features | 11031 | 1305 | 1905 | 14241 |
*Common between enzyme and non-enzyme sequences
Feature vectors and dimensionality for the dataset.
| Dataset | Number of feature vectors | Data dimensionality |
|---|---|---|
| Enzyme Sequences | 64948 | 5045x64948 |
| Mixed$ | 193240 | 14241x193240 |
$Both enzyme and non-enzyme sequences
Overall prediction accuracy of ECemble method.
| EC Level | Number of | Correctly | % Overall | Classes | Model(s) | Average # of features per model |
|---|---|---|---|---|---|---|
| Level-0 | 193240 | 188866 | 97.74 | 2 | 1 | 14241 |
| Level-1 | 62674 | 62167 | 99.19 | 6 | 1 | 5045 |
| Level-2 | 62167 | 61721 | 99.28 | 51 | 6 | 811 |
| Level-3 | 61721 | 60931 | 98.72 | 169 | 51 | 95 |
| Level-4 | 60931 | 60199 | 98.80 | 1921 | 169 | 30 |
*Correctly predicted: Instances predicted by at least 2 of top 3 classifiers. Level-0 represents the model to predict if a protein sequence is an enzyme or not. Level 1-Level 4 represents corresponding levels of the EC number hierarchy.
Ten-fold cross validation and testing accuracy for enzyme identification and enzyme classification.
| Enzyme Identification (EC L0) | Enzyme Classification (EC L1) | |||
|---|---|---|---|---|
| DS | 66.39 | 66.39 | 39.12 | 39.31 |
| NBC | 92.60 | 92.46 | 96.11 | 95.88 |
| KNN | 94.38 | 94.38 | 97.80 | 97.56 |
| SVM | 95.69 | 94.86 | 98.34 | 98.39 |
| RFC | 98.42 | 94.60 | 97.50 | 97.28 |
*Ten-fold cross validation accuracy. At EC L0 and EC L1 using ML classifiers, Decision Stump (DS), Naïve Bayes Classifier (NBC), K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Random Forest Classifier (RFC). At EC L0, train and test sets contain 154,592 and 38,648 sequences respectively, whereas EC L1 contain train and test sets of 50,139 and 12,535, respectively.
Figure 2ROC curve at Level-0 and Level-1 using top 3 performing classifiers. (A) Testing at Level-0 using KNN, (B) Testing at level-0 using RFC, (C) Testing at Level-0 using SVM, (D) Testing at Level-1 using KNN, (E) Testing at level-1 using RFC, (F) Testing at Level-1 using SVM. Due to the high accuracies, False Positive Rate (X-axis) is shown till 0.5 for Level-0 and till 0.05 for Level-1.
Figure 3Comparison of ECemble with BLAST and EFICAz methods.
Figure 4Fractions of known and ECemble predicted enzymes in the proteomes of 10 model organisms from UniProt. The data are sorted from highest to lowest known enzyme fractions. Both reviewed and unreviewed sequences from UniProt were used.
ECemble predicted enzymes from reviewed human proteome.
|
|
|
|
|
|
|---|---|---|---|---|
| Q9Y394 | 1.1.-.- | 1.1.1.n4* | (-)-trans-carveol dehydrogenase | |
| Q9NXC2 | 1.-.-.- | 1.1.1.n6* | D-chiro-inositol 3-dehydrogenase | |
| Q96CU9 | 1.-.-.- | 1.1.5.4* | Malate dehydrogenase (quinone) | |
| O75342 | 1.13.11.- | 1.13.11.12* | Linoleate 13S-lipoxygenase | |
| Q9H2A2 | 1.2.1.- | 1.2.1.16* | Succinate-semialdehyde dehydrogenase (NAD(P)(+)) | |
| Q8NBX0 | 1.-.-.- | 1.5.1.10* | Saccharopine dehydrogenase (NADP(+), L-glutamate-forming) | |
| Q9UNQ2 | 2.1.1.- | 2.1.1.182* | 16S rRNA (adenine(1518)-N(6)/adenine(1519)-N(6))-dimethyltransferase | |
| A6NJ78 | 2.1.1.- | 2.1.1.199* | 16S rRNA (cytosine(1402)-N(4))-methyltransferase | |
| P0C7V9 | 2.1.1.- | 2.1.1.199* | 16S rRNA (cytosine(1402)-N(4))-methyltransferase | |
| Q8N2H3 | 1.-.-.- | 2.1.1.74* | (FADH(2)-oxidizing) | |
| Q8NHS2 | 2.6.1.- | 2.6.1.9* | Histidinol-phosphate transaminase | |
| Q96C11 | 2.7.1.- | 2.7.1.16* | Ribulokinase | |
| A4D126 | 2.7.7.- | 2.7.7.60* | 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase | |
| Q7L5L3 | 3.1.-.- | 3.1.4.46* | Glycerophosphodiester phosphodiesterase | |
| O43897 | 3.4.24.- | 3.4.24.21* | Astacin | |
| Q9H6P5 | 3.4.25.- | 3.4.25.2* | HslU--HslV peptidase | |
| Q6DHV7 | 3.5.4.- | 3.5.4.2* | Adenine deaminase | |
| P48200 | None | 4.2.1.33* | 3-isopropylmalate dehydratase | |
| P17643 | 1.14.18.- | 1.14.11.13** | Gibberellin 2-beta-dioxygenase | |
| P80365 | 1.1.1.- | 1.3.1.9** | Enoyl-[acyl-carrier-protein] reductase (NADH) | |
| Q9UHB4 | 1.6.-.- | 1.8.1.2** | Sulfite reductase (NADPH) | |
| Q8N8Q3 | 3.1.26.- | 3.1.21.7** | Deoxyribonuclease V | |
| Q9UJA9 | 3.1.-.- | 3.6.1.27** | Undecaprenyl-diphosphate phosphatase | |
| Q15777 | 3.1.-.- | 3.6.1.41** | Bis(5'-nucleosyl)-tetraphosphatase (symmetrical) | |
| Q86YJ6 | 4.2.3.- | 4.2.1.20** | Tryptophan synthase | |
| Q9BYJ1 | 5.4.4.- | 1.13.11.12*** | Linoleate 13S-lipoxygenase | |
| P08910 | 3.1.1.- | 2.3.1.31*** | Homoserine O-acetyltransferase | |
| Q96SE0 | 3.1.1.- | 2.3.1.31*** | Homoserine O-acetyltransferase | |
| Q8WU67 | 3.1.1.- | 2.3.1.84*** | Alcohol O-acetyltransferase | |
| Q8WUS8 | 1.1.1.- | 5.1.3.20*** | ADP-glyceromanno-heptose 6-epimerase |
$Current annotation is based on UniProt database
n Preliminary EC numbers include an 'n' as part of the fourth digit in ENZYME database
* Good predictions that complement the current annotations
** Predictions that differ from the current annotations only at the subclass level
*** Predictions that differ from the current annotations at the class level
Figure 5Distribution of pathways and enzymes in each of the major pathway categories. (A) Distribution of mapped pathways in each of the 8 major pathway categories (B) Distribution of human and bacterial enzymes in each of the 8 major pathway categories.
Figure 6Starch and sucrose metabolism pathway. The following color coding scheme is used for the pathways: Known human enzyme reactions (Light Red), Unknown human enzyme reactions that are predicted by ECemble (Pink), Known bacterial enzyme reactions in the gut microflora predicted by ECemble (Light green) and Unknown gut bacterial enzyme reactions that are predicted by ECemble (Light Blue).
Enzymes and their function in starch and sucrose metabolism pathway.
| Function | Enzyme used | EC Number |
|---|---|---|
| Sucrose → Sucrose-6P | Protein-N(pi)-phosphohistidine--sugar phosphotransferase | 2.7.1.69 |
| Sucrose-6P → α-D-Glucose-6P (glucose) | Beta-fructofuranosidase | 3.2.1.26 |
| Pectin → Pectate | Pectinesterase | 3.1.1.11 |
| Pectate → Galacturonate | Polygalacturonase | 3.2.1.15 |
| Xylan → Xylose | Xylan 1,4-beta-xylosidase | 3.2.1.37 |
Figure 7Fructose and mannose metabolism pathway. The following color coding scheme is used for the pathways: Known human enzyme reactions (Light Red), Unknown human enzyme reactions that are predicted by ECemble (Pink), Known bacterial enzyme reactions in the gut microflora predicted by ECemble (Light green) and Unknown gut bacterial enzyme reactions that are predicted by ECemble (Light Blue).
Figure 8Number of Obesity and IBD enzymes in each of the major KEGG pathway categories. (A) Number of significant (p-value < 0.05; Fisher Exact Test) enzymes from obese vs. lean subjects in each major pathway categories (B) Number of significant (p-value < 0.05; Fisher Exact Test) enzymes from IBD vs. Non-IBD subjects in each major pathway categories.
Class-wise statistics on the number of enzyme sequences, and the subclasses at each level.
| EC Level-1(6 Classes) | Total Sequences | Subclasses at | Subclasses at | Subclasses at |
|---|---|---|---|---|
| EC1: Oxidoreductases | 8662 | 22 | 90 | 658 |
| EC2: Transferases | 23604 | 9 | 31 | 751 |
| EC3: Hydrolases | 15183 | 11 | 49 | 781 |
| EC4: Lyases | 5525 | 7 | 15 | 268 |
| EC5: Isomerases | 4146 | 6 | 17 | 127 |
| EC6: Ligases | 7830 | 6 | 11 | 109 |
| Total |
There are 6 major classes at Level-1 that expand into 61, 213 and 2,693 subclasses at Levels 2, 3 and 4, respectively.