| Literature DB >> 20034383 |
Uri Weingart1, Yair Lavi, David Horn.
Abstract
BACKGROUND: Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20034383 PMCID: PMC2811123 DOI: 10.1186/1471-2105-10-446
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Compilation of training and test datasets.
| Dataset | Selection Criteria from Swiss-Prot | Number of Proteins (and SPs) | Precision | Recall |
|---|---|---|---|---|
| Training set #1 | Single EC annotation and Date-Integrated before 7/1/2006 | 89,854 | 100% | 85% |
| "Enzyme Test Set" | EC annotation and Date-Integrated between 7/1/2006 and 7/1/2008 | 24,443 | 98% | 70% |
| "Ten Organism Test-Set" | EC annotation and Date-Integrated between 7/1/2006 and 7/1/2008 and all non-enzymes before 7/1/2008 | 4,509 | 98% | 76% |
| Training set #2 | Single EC annotation and Date-Integrated before 7/27/2009 | 201,169 | 100% | 94% |
Variation of precision and recall of DME (based on the 1st SP set) on the enzyme test-set as function of the L3 threshold.
| L3 threshold | precision | Recall |
|---|---|---|
| 5 | 95.1% | 72.4% |
| 6 | 95.8% | 72.3% |
| 7 | 98.4% | 70.0% |
| 8 | 99.4% | 67.1% |
| 9 | 99.5% | 66.2% |
| 10 | 99.5% | 65.4% |
| 11 | 99.5% | 65.0% |
| 12 | 99.6% | 64.8% |
| 13 | 99.6% | 63.9% |
Figure 1SP hits on the ten-organism test-set. The numbers of proteins in the ten organism test-set carrying n SP matches, for n = 0 to 10. The inset shows the same data on a semi-log scale, emphasizing the sharp exponential decrease for low n, partially reflecting the existence of erroneous SP hits.
Comparison of results for the ten organism test-set with those of a random model as function of coverage-length at level 3 of the EC hierarchy.
| L3 | Real | Random | stdev | Noise |
|---|---|---|---|---|
| 0 | 3768 | 4150.33 | 18.8 | |
| 4 | 0 | 1.00 | 0 | |
| 5 | 41 | 41.67 | 1.53 | 1.02 |
| 6 | 305 | 256.00 | 19.1 | 0.84 |
| 7 | 106 | 54.33 | 1.15 | 0.51 |
| 8 | 13 | 3.67 | 2.89 | 0.28 |
| 9 | 5 | 0.00 | 0 | 0 |
| 10 | 2 | 0.00 | 0 | 0 |
| 11 | 1 | 0.00 | 0 | 0 |
| 12-15 | 25 | 2.00 | 1.73 | 0.08 |
| >15 | 243 | 0 | 0 | 0 |
DME predictions vs. Swiss-Prot EC (level 3) annotations for the 10 organism Test Set.
| DME | Swiss-Prot | # proteins | |
|---|---|---|---|
| A | P | P | 252 |
| B | P | DP | 4 |
| C | P | NP | 139 |
| D | NP | P | 76 |
| E | NP | NP | 4,038 |
DME predictions for the ten-organism test-set are compared with recent Swiss-Prot EC assignments.
| id | DME Prediction (1st SP set) | L1 | L2 | L3 | L4 | Current Swiss-Prot EC annotation |
|---|---|---|---|---|---|---|
| 1.11.1 | 25 | 22 | 22 | 0 | 1 | |
| 3.6.3 | 25 | 25 | 25 | 0 | 3.6.3.34 | |
| 3.6.3 | 7 | 7 | 7 | 0 | 3.6.3 | |
| 3.6.3 | 13 | 13 | 13 | 0 | 3.6.3 | |
| 4.1.2 | 9 | 9 | 9 | 0 | 4.1.2.n3 | |
| 3.6.3.17 | 14 | 8 | 8 | 8 | 3.6.3 | |
| 3.6.1 | 58 | 58 | 58 | 0 | 3.6.1 |
L1 to L4 are the coverage-lengths at EC levels 1 to 4 respectively
Numbers of sequences with consistent SP hits (same category at level 3 of the EC hierarchy) are compared between 5000 proteins randomly chosen from Sargasso-Sea data, and a corresponding random model, as function of coverage-length.
| L3 | Real | Random | stdev | Noise |
|---|---|---|---|---|
| 0 | 3,910 | 4,868 | 5.1 | |
| 7 | 235 | 127 | 5.5 | 0.54 |
| 8 | 71 | 6 | 2.1 | 0.08 |
| 9 | 40 | 0 | 0 | |
| 10 | 27 | 0 | 0 | |
| >10 | 717 | 0 | 0 |
Figure 2Numbers of enzymes predicted in Sargasso-Sea data. Numbers of enzymes predicted by DME in the Sargasso-Sea data. Shown are the thirty leading level 3 EC categories.
Leading occurrences of EC-numbers in Sargasso-Sea data
| EC | # proteins | Enzymatic activity |
|---|---|---|
| 2.7.7.6 | 5,993 | DNA-directed RNA polymerase |
| 1.6.99.5 | 2,999 | NADH dehydrogenase (quinone) |
| 5.99.1.3 | 2,610 | DNA topoisomerase (ATP-hydrolysing). DNA gyrase. |
| 6.3.5.5 | 2,198 | carbamoyl-phosphate synthase (glutamine-hydrolysing) |
| 3.6.3.14 | 2,169 | H+-transporting two-sector ATPase. ATP synthase. |
| 2.7.7.7 | 2,083 | DNA-directed DNA polymerase |
Some examples of doubly annotated enzymes uncovered by DME in the Sargasso-Sea data.
| Prediction a | Prediction b | # Proteins |
|---|---|---|
| 3.5.4.25 | 4.1.99.12 | 27 |
| 3.6.3.44 | 2.7.1.130 | 6 |
| 1.1.1.205 | 1.7.1.7 | 6 |
| 2.7.1.25 | 2.7.7.4 | 6 |
Figure 3Enzymatic profiles. Enzymatic profiles of three metagenomes. Compared are the relative numbers of identified enzymes in the 30 leading sub-subclasses (EC level 3) of the Sargasso-Sea meatagenome with those of the gut microbiomes.
Absolute values of differences between enzymatic profiles, based on the DME predicted distributions at level 3 of EC.
| Metagenome | Sargasso | Subject7 | Subject8 |
|---|---|---|---|
| 0 | 0.42 | 0.41 | |
| 0.42 | 0 | 0.18 | |
| 0.41 | 0.18 | 0 |