| Literature DB >> 17623104 |
Yannick Pouliot1, Peter D Karp.
Abstract
BACKGROUND: Using computational database searches, we have demonstrated previously that no gene sequences could be found for at least 36% of enzyme activities that have been assigned an Enzyme Commission number. Here we present a follow-up literature-based survey involving a statistically significant sample of such "orphan" activities. The survey was intended to determine whether sequences for these enzyme activities are truly unknown, or whether these sequences are absent from the public sequence databases but can be found in the literature.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17623104 PMCID: PMC1940265 DOI: 10.1186/1471-2105-8-244
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example of a metabolic pathway involving a validated orphan.
Biological significance of selected validated orphans. The extent and significance of published research associated with a selection of validated orphans is detailed
| 1.1.1.43 | Phosphogluconate 2-dehydrogenase | 1961 | 2417 | Positive reports of evaluation as a drug target against Trypanosome; trypanocidal activity has been reported; involved in 2-dehydro-D-gluconate degradation pathway |
| 5.1.3.17 | Heparosan- | 1979 | 16 | Involved in the biosynthesis of heparan sulfate, which binds proteins to modulate signaling events in embryogenesis. Mouse gene knock-out results in late lethal phenotype. |
| 2.3.1.105 | Alkylglycerophosphate 2-O-acetyltransferase | 1986 | 9 | Involved in platelet activating factor biosynthesis; possible involvement in ischemia |
| 3.1.3.59 | Alkylacetylglycerophosphatase | 1986 | 9 | Involved in platelet activation factor biosynthesis |
| 2.7.1.106 | Glucose-1,6-bisphosphate synthase | 1975 | 9 | Present in several mammalian tissues. Involved in glucose metabolism |
| 1.2.1.23 | 2-oxoaldehyde dehydrogenase (NAD+) | 1967 | 9 | Involved in the development of diabetic complications |
| 1.14.11.6 | Thymine dioxygenase | 1972 | 9 | Present in both lower and higher eukaryotes |
| 1.1.1.16 | Galactitol 2-dehydrogenase | 1956 | 5 | Insulin dysregulation |
| 2.3.1.14 | Glutamine N-phenylacetyltransferase | 1957 | 4 | Investigated as a predictor of carotid endarterectomy in middle-aged individuals |
| 1.2.1.25 | 2-oxoisovalerate dehydrogenase (acylating) | 1969 | 4 | Present in prokaryotes and eukaryotes. In the latter, participates in primary metabolism pathway for valine degradation |
E.C. 2.3.1.23 is listed in italics because it was cloned and sequenced in 2006, after the completion of this study
Figure 2Literature survey process.
Summary of survey results
| Total number of putative orphans | 1,356 | |
| Number required to achieve 95% significance | 180 | 13.3% |
| Number orphans evaluated | 228 | 16.8% |
| Out of 228 orphans: | ||
| Number of artifactual orphans | 41 | 18.0% |
| Number of valid orphans | 187 | 82.0% |
| Max. number of salvageable orphans (all rankings) | 57 | 25.0% |
| Out of 57 salvageable orphans: | ||
| Excellent | 9 | 15.8% |
| Good | 23 | 40.4% |
| Marginal | 9 | 15.8% |
| Poor | 16 | 28.1% |
| Bacterial salvageable orphans | 26 | 45.6% |
| Eukaryotic salvageable orphans | 31 | 54.4% |
The survey was designed to achieve a maximum sampling error of 5%, 19 times out of 20. This corresponds to a minimum sample size of ~80 orphans. A total of 228 orphans were in fact surveyed. In a number of cases more than one instance of an orphan activity was evaluated because the activity was reported in more than one species. Consequently, 286 instances were evaluated.
Species distribution of Eubacterial validated orphans
| Species | No. of Orphans | Species | No. of Orphans |
| Acinetobacter NCIB 9871 | 1 | Pasteurella tuberculosis | 2 |
| Actinoplanes missouriensis | 1 | Pedobacter heparinus | 1 |
| Aerom onas sp. | 1 | Propionibacterium pentosaceum | 1 |
| Alcaligenes eutrophus | 1 | Proteus mirabilis | 1 |
| Alcaligenes faecalis | 1 | Pseudomonas (species undefined) | 2 |
| Arthrobacter GJM -1 | 1 | Pseudomonas fluorescens | 3 |
| Arthrobacter oxydans | 1 | Pseudomonas graveolens | 1 |
| Arthrobacter sp. | 2 | Pseudomonas MS | 1 |
| Azotobacter vinelandii | 1 | Pseudomonas MSU-1 | 1 |
| Bacillus subtilis | 1 | Pseudomonas P-2 | 2 |
| Cellulom onas sp. | 1 | Pseudomonas putida | 7 |
| Clostridium cylindrosporum | 1 | Pseudomonas putida P2 | 1 |
| Clostridium kluyveri | 2 | Pseudomonas saccharophilia | 2 |
| Clostridium pasteurianum | 1 | Pseudomonas sp. | 4 |
| Clostridium SB4 | 1 | Pseudomonas sp. P-501 | 1 |
| Clostridium sporogenes | 2 | Pseudomonas syringae GG | 1 |
| Corynebacterium cyclohexanicum | 1 | Pseudomonas testosteroni | 1 |
| Escherichia coli | 8 | Rhodococcus | 1 |
| Flavobacterium | 1 | Rhodopseudomonas sphaeroides | 1 |
| Flavobacterium sp. | 1 | Salmonella typhimurium | 1 |
| Klebsiella aerogenes | 1 | Streptococcus faecalis | 1 |
| Micrococcus denitrificans | 1 | Streptococcus mutans | 1 |
| Microorganism | 2 | Streptomyces virginiae | 1 |
| Mycobacterium tuberculosis | 1 | Thiobacillus thioparus | 1 |
| Nocardia (species undefined) | 1 | Unknown | 2 |
The total number of orphans is greater than the number of activities because a given activity may be present in more than one species. The exact species of some orphans can be unclear or unstated, in which case these are classified under a generic term ("species undefined", "unknown", etc). The total number of orphans is greater than the number of activities because a given activity may be present in more than one species. The exact species of some orphans can be unclear or unstated, in which case these are classified under a generic term ("species undefined", "unknown", etc).
Species distribution of Eukaryotic validated orphans
| Acrocylindrium sp. | 1 | Nectria haem atococca/Fusarium solani f.sp. Phaseoli | 1 |
| Arachis hypogaea | 2 | Neurospora (subspecies undefined) | 1 |
| ASparagus officinalis | 1 | Neurospora crassa | 1 |
| Aspergillus niger | 2 | Ochromonas malhamensis | 1 |
| Avena coleoptiles | 1 | Ovis aries | 2 |
| Bauhenia monandra | 1 | Pea sativum var. Alaska | 1 |
| Bostaurus | 3 | Penicillium atrovenetum | 1 |
| Capra hircus | 1 | Penicillium chrysogenum | 1 |
| Catharanthus roseus | 1 | Penicillium patulum | 1 |
| Cavia porcellus | 3 | Phaseolus aureus | 3 |
| Chlorella | 1 | Phaseolus radiatus | 1 |
| Chrysosplenium americanum | 1 | Pisum sativum (variety unspecified) | 1 |
| Cichorium endivia | 1 | Pycnoporus coccineus | 1 |
| Citrus (subspecies undefined) | 1 | Raphanus sativus | 1 |
| Corydalis cava | 1 | Rat (subspecies undefined) | 18 |
| Cucurbita maxima | 1 | Rat Sprague-Dawley | 5 |
| Daucus carota | 1 | Rhodotorula glutinis | 2 |
| Entamoeba histolytica | 1 | Saccharomyces cerevisiae | 5 |
| Euglena gracilis | 1 | Saccharum officinarum | 1 |
| Flaveria spp. | 1 | Secale cereale | 1 |
| Fundulus heteroclitus | 1 | Sesamum indicum | 1 |
| Gallus gallus | 1 | Several | 1 |
| Homo sapiens | 5 | Sorghum bicolor | 2 |
| Hordeum (species undefined) | 1 | Spinacia | 1 |
| Hordeum vulgare subsp. Vulgare | 2 | Spinacia oleracea | 1 |
| Lasallia pustulata | 1 | Sus scrofa | 7 |
| Lilium longiflorum | 1 | Tecoma stans | 1 |
| Lupinus albus | 1 | Thea sinensis | 1 |
| Lycopersicon esculentum | 2 | Trypanosoma brucei brucei | 1 |
| Macaca mulatta | 1 | Tulipa cv. Apeldoorn | 1 |
| Mentha piperita | 1 | Unknown | 1 |
| Mesocricetus auratus | 1 | Yeast (species undefined) | 4 |
| Mold | 1 | Zea mays | 2 |
| Mouse (species undefined) | 3 |
The total number of orphans is greater than the number of activities because a given activity may be present in more than one species. The exact species of some orphans can be unclear or unstated, in which case these are classified under a generic term ("mold", "mouse", "unknown", etc). The total number of orphans is greater than the number of activities because a given activity may be present in more than one species. The exact species of some orphans can be unclear or unstated, in which case these are classified under a generic term ("mold", "mouse", "unknown", etc).
Validated orphan activities
| 1.1.1.13 | difficult | 1.14.99.24 | difficult | 3.1.3.47 | good |
| 1.1.1.16 | difficult | 1.21.3.2 | difficult | 3.1.3.59 | difficult |
| 1.1.1.43 | difficult | 1.97.1.3 | difficult | 3.1.3.72 | difficult |
| 1.1.1.54 | difficult | 2.1.1.112 | difficult | 3.1.4.43 | difficult |
| 1.1.1.84 | excellent | 2.1.1.137 | artifact | 3.1.6.17 | good |
| 1.1.1.92 | marginal | 2.1.1.141 | artifact | 3.1.8.2 | artifact |
| 1.1.1.101 | difficult | 2.1.1.143 | artifact | 3.2.1.56 | difficult |
| 1.1.1.144 | difficult | 2.1.1.147 | difficult | 3.2.1.77 | difficult |
| 1.1.1.146 | artifact | 2.1.1.84 | difficult | 3.2.1.100 | good |
| 1.1.1.163 | artifact | 2.1.1.99 | difficult | 3.2.1.112 | excellent |
| 1.1.1.172 | good | 2.1.2.4 | difficult | 3.2.1.115 | difficult |
| 1.1.1.196 | good | 2.3.1.14 | difficult | 3.2.1.128 | excellent |
| 1.1.1.208 | poor | 2.3.1.23 | difficult | 3.2.1.136 | difficult |
| 1.1.1.226 | excellent | 2.3.1.24 | artifact | 3.2.1.137 | poor |
| 1.1.1.245 | artifact | 2.3.1.33 | difficult | 3.2.2.10 | difficult |
| 1.1.1.258 | artifact | 2.3.1.49 | difficult | 3.4.11.16 | good |
| 1.1.1.265 | excellent | 2.3.1.68 | marginal | 3.4.13.7 | difficult |
| 1.1.2.5 | artifact | 2.3.1.96 | difficult | 3.4.17.16 | good |
| 1.1.3.23 | difficult | 2.3.1.98 | good | 3.4.21.103 | artifact |
| 1.17.1.1 | difficult | 2.3.1.102 | artifact | 3.4.22.44 | artifact |
| 1.17.99.2 | artifact | 2.3.1.103 | poor | 3.4.22.46 | artifact |
| 1.2.1.18 | artifact | 2.3.1.105 | poor | 3.4.23.28 | difficult |
| 1.2.1.20 | difficult | 2.3.1.114 | poor | 3.4.23.30 | artifact |
| 1.2.1.23 | poor | 2.3.1.133 | marginal | 3.4.24.54 | artifact |
| 1.2.1.25 | difficult | 2.3.1.140 | difficult | 3.5.1.30 | good |
| 1.2.1.32 | artifact | 2.3.1.161 | artifact | 3.5.1.33 | artifact |
| 1.2.1.33 | difficult | 2.3.2.3 | artifact | 3.5.1.39 | poor |
| 1.2.1.52 | difficult | 2.3.2.7 | difficult | 3.5.1.58 | excellent |
| 1.2.1.54 | good | 2.3.3.3 | difficult | 3.5.1.62 | artifact |
| 1.2.1.63 | marginal | 2.4.1.23 | poor | 3.5.1.67 | difficult |
| 1.2.1.64 | artifact | 2.4.1.29 | difficult | 3.5.1.71 | poor |
| 1.2.3.6 | difficult | 2.4.1.41 | valid | 3.5.1.79 | artifact |
| 1.2.3.7 | poor | 2.4.1.43 | difficult | 3.5.2.13 | poor |
| 1.2.3.8 | artifact | 2.4.1.57 | artifact | 3.5.2.16 | artifact |
| 1.3.1.4 | difficult | 2.4.1.66 | artifact | 3.5.3.2 | good |
| 1.3.1.5 | difficult | 2.4.1.73 | artifact | 3.5.5.2 | difficult |
| 1.3.1.6 | artifact | 2.4.1.97 | difficult | 3.6.1.18 | excellent |
| 1.3.1.11 | difficult | 2.4.1.110 | poor | 3.6.1.2 | artifact |
| 1.3.1.37 | difficult | 2.4.1.125 | difficult | 3.6.1.52 | artifact |
| 1.3.7.1 | difficult | 2.4.1.126 | valid | 3.6.3.17 | artifact |
| 1.3.99.15 | artifact | 2.4.1.153 | poor | 3.6.3.24 | artifact |
| 1.3.99.21 | artifact | 2.4.1.167 | difficult | 3.6.3.28 | artifact |
| 1.4.1.11 | good | 2.4.1.176 | difficult | 3.6.4.4 | artifact |
| 1.4.1.17 | good | 2.4.1.180 | excellent | 4.1.1.24 | difficult |
| 1.4.99.4 | marginal | 2.4.1.184 | difficult | 4.1.1.52 | difficult |
| 1.4.99.5 | artifact | 2.4.1.215 | artifact | 4.1.1.56 | difficult |
| 1.5.1.21 | good | 2.4.2.35 | difficult | 4.1.1.75 | difficult |
| 1.5.99.11 | artifact | 2.5.1.4 | difficult | 4.1.2.23 | difficult |
| 1.6.5.7 | artifact | 2.5.1.42 | difficult | 4.1.2.28 | difficult |
| 1.7.3.1 | difficult | 2.6.1.22 | difficult | 4.1.2.35 | difficult |
| 1.7.3.5 | valid | 2.6.1.27 | poor | 4.1.3.35 | difficult |
| 1.8.1.5 | artifact | 2.6.1.32 | difficult | 4.2.1.5 | difficult |
| 1.10.1.1 | difficult | 2.6.1.33 | poor | 4.2.1.43 | good |
| 1.10.3.4 | difficult | 2.6.1.75 | good | 4.2.1.62 | good |
| 1.12.98.2 | artifact | 2.7.1.43 | difficult | 4.2.1.77 | difficult |
| 1.13.11.14 | difficult | 2.7.1.54 | difficult | 4.2.1.81 | difficult |
| 1.13.11.24 | artifact | 2.7.1.64 | difficult | 4.2.1.93 | difficult |
| 1.13.11.25 | artifact | 2.7.1.77 | difficult | 4.2.1.97 | marginal |
| 1.13.11.35 | difficult | 2.7.1.106 | difficult | 4.2.1.101 | artifact |
| 1.13.12.9 | good | 2.7.1.131 | poor | 4.2.2.14 | artifact |
| 1.14.11.10 | difficult | 2.7.1.134 | difficult | 4.2.3.19 | artifact |
| 1.14.11.6 | poor | 2.7.1.142 | difficult | 4.2.99.19 | artifact |
| 1.14.13.10 | marginal | 2.7.4.20 | difficult | 4.3.1.10 | difficult |
| 1.14.13.23 | good | 2.7.7.44 | difficult | 4.3.1.20 | difficult |
| 1.14.13.24 | artifact | 2.7.7.51 | difficult | 4.5.1.4 | difficult |
| 1.14.13.42 | difficult | 2.7.8.10 | difficult | 5.1.1.6 | difficult |
| 1.14.13.51 | difficult | 2.7.8.22 | difficult | 5.1.1.9 | marginal |
| 1.14.13.58 | excellent | 2.8.1.3 | excellent | 5.1.3.17 | difficult |
| 1.14.13.60 | difficult | 2.8.2.28 | difficult | 5.2.1.10 | difficult |
| 1.14.13.72 | difficult | 3.1.1.36 | difficult | 5.2.1.11 | difficult |
| 1.14.13.73 | artifact | 3.1.1.39 | difficult | 5.4.3.5 | artifact |
| 1.14.15.2 | difficult | 3.1.1.40 | poor | 5.5.1.11 | good |
| 1.14.16.5 | good | 3.1.1.78 | artifact | 5.5.1.12 | artifact |
| 1.14.99.18 | artifact | 3.1.2.11 | difficult | 5.5.1.3 | difficult |
| 1.14.99.22 | artifact | 3.1.3.14 | difficult | 6.3.1.6 | difficult |
| 3.1.3.38 | poor | 6.3.4.8 | difficult |
All 228 orphans reviewed in this study are listed. The salvageability of an orphan is ranked "difficult" when factors such as unclear species of origin, lack of molecular descriptors, or lack of comprehensive genome sequence hinder cloning of the cognate gene. Note that such rankings do not take into account the availability of molecular descriptors which enable the identification of a candidate gene in one species, and, through orthology, the identification of a candidate gene in a second species for which these descriptors are not available.
Domain distribution of validated orphans
| Eukaryota | 156 | 54.36% |
| Eubacteria | 113 | 39.37% |
| Unknown | 15 | 5.23% |
| Viruses | 2 | 0.70% |
| Archaea | 1 | 0.35% |
Orphans with "Unknown" listed for their domain tend to be microbes that were insufficiently characterized to place them in either the Eubacteria or Archaea domains.
Top four most represented Eubacteria
| 27 | 35.06% | |
| 8 | 10.39% | |
| 7 | 9.09% | |
| 4 | 5.19% |
Availability of completely sequenced genomes for Eubacterial validated orphans
| Same species | 23 | 31.9% | 9 | 18.4% |
| Same genus, related species | 12 | 16.7% | 16 | 32.6% |
The number of available comprehensive genome sequences for validated Eubacterial orphans was tallied. Cases where the genome sequence of a species does not exist but where the sequence of a related species from the same genus is available are also listed, as are ongoing comprehensive genomic sequencing projects for genomes not currently available.
Figure 3Distribution of enzymatic activities in validated orphans. The percentage of validated orphan activities belonging to each EC class is shown.
Figure 4Publication year of original publications describing orphan activities. The publication date associated with the original source articles of all instances of orphans surveyed here is plotted (286 instances of orphans, corresponding to 228 activities), based upon the IUBMB record. In a number of cases more than one instance of an orphan activity was evaluated because the activity was reported in more than one species.
Example artifactual orphans
| 3.4.21.103 | Physarolisin (a proteinase) | 1982 | Q8MZS4 | IUBMB entry lists a 2003 paper describing a gene coding for a protein with this activity [28]. Sequence is in Swiss-Prot but ENZYME does not reference this sequence. | Lack of database cross-referencing presumably involving the long interval between the initial characterization of the activity and the cloning of the gene. | |
| 3.5.2.16 | Maleimide hydrolase | 1997 | Q93T25 | ENZYME and IUBMB entries are not referencing a Swiss-Prot entry from a 2002 paper describing the cloning of gene coding for this [29]. | Lack of database cross-referencing is not restricted to older orphans. | |
| 3.5.1.79 | Phthalyl amidase | 1995 | N/A | The sequence, listed in a patent associated with a 1996 paper by [30] in | Note: though the protein sequence is not available from the UniProt database, the DNA sequence is present in the EMBL database. | |
| 3.1.8.2 | Diisopropyl-fluoro-phosphatase | 1954 | Q44238 | ENZYME and IUBMB entries are not referencing a Swiss-Prot entry associated with a 1996 paper describing the cloning of a gene coding for an enzyme with this activity [31]. | This enzymatic activity detoxifies nerve gas. The gene is part of a widespread gene family with otherwise unknown function, with members in |
Figure 5Salvageability ranking of validated orphans. The suitability of validated orphans for eventual cloning of at least one cognate gene was evaluated according to the ranking system described in the text. Out of 228 orphans, 57 were judged to be salvageable. A: Overall salvageability ranking (percentage out of 57); B: Domain distribution of salvageable orphans (number of orphans). Note that the total is greater than 57 because some orphans have different evaluations in the different species in which they have been reported. One orphan is also shared between Eubacteria and Eukaryotes.
Example artifactual orphans that are salvageable
| 1.1.1.226 | excellent | 1.4.1.11 | Good |
| 1.1.1.265 | excellent | 1.4.1.17 | Good |
| 1.14.13.58 | excellent | 1.5.1.21 | Good |
| 2.4.1.180 | excellent | 2.3.1.98 | Good |
| 2.8.1.3 | excellent | 2.6.1.58 | Good |
| 3.2.1.112 | excellent | 2.6.1.75 | Good |
| 3.2.1.128 | excellent | 3.1.3.47 | Good |
| 3.5.1.58 | excellent | 3.1.6.17 | Good |
| 3.6.1.18 | excellent | 3.2.1.100 | Good |
| 1.1.1.172 | good | 3.4.17.16 | Good |
| 1.1.1.196 | good | 3.5.1.30 | Good |
| 1.13.12.9 | good | 3.5.3.2 | Good |
| 1.14.13.23 | good | 4.2.1.43 | Good |
| 1.14.16.5 | good | 4.2.1.62 | Good |
| 1.2.1.54 | good | 5.5.1.11 | Good |
Validated orphans with a salvageability ranking of "good" or better are listed.
Figure 6Distribution of enzymatic activities for salvageable orphans ranked "good" and "excellent".
Selected salvageable orphans
| 3.5.1.30 | Good | None | 5-amino-penta-namidase | Yes ( | Several | 67 | N/A | |
| 5.5.1.11 | Good | None | Dichloro-muconate cyclo- isomerase | N/A | Yes | 40 ± 10 | N/A | |
| 4.2.1.97 | Marginal | None | Phaseollidin hydratase | No | Different species ( | monomer 1: 47 monomer 2: 49 | ||
| 2.3.1.103 | Poor | None | Sinapoylglucose–sinapoylglucose O-sinapoyltransferase | N/A | N/A | 55 | N/A |
A selection of orphans with different salvageability rankings are listed. Pathway names are those used in the MetaCyc database. *: The genomes of several strains of P. fluorescens are in the final stages of assembly and are essentially fully sequenced. N/A: not available
Example of artifactual orphans resolved by this survey
| 1.1.1.163 | Cyclopentanol dehydrogenase | Q8GAV9 | |
| 1.13.11.24 | Quercetin 2,3-dioxygenase | P42106 | |
| 3.6.3.24 | Nickel-transporting ATPase | P33593 | |
| 2.1.1.143 | 24-methylenesterol C-methyltransferase | Q94JS4 | |
| 2.1.1.143 | 24-methylenesterol C-methyltransferase | Q39227 |
All Swiss-Prot entries listed here have been updated with the corresponding EC number.
Main data sources used by the orphan survey
| TrEMBL [32] | Comprehensive protein and DNA sequence data | Swiss Institute of Bioinformatics | Web |
| Comprehensive Microbial Repository (CMR [33]) | Extensive genomic data for microbial species | The Institute for Genomic Research | BioWarehouse |
| BioCyc databases | Collection of pathway/genome databases primarily concerned with microbial species | Bioinformatics Research Group, SRI International | BioWarehouse |
| IUBMB Enzyme Nomenclature [34] | Description of enzymes that have been assigned an EC number by the Enzyme Commission | Nomenclature Committee of the International Union of Biochemistry and Molecular Biology | Web and BioWarehouse |
| ENZYME [35] | Repository of information relative to the nomenclature of enzymes | Swiss Institute of Bioinformatics | Web and BioWarehouse |
| NCBI Taxonomy [36] | Taxonomy database | National Center for Biotechnology Information | Web and BioWarehouse |
| PubMed | Literature database | National Library of Medicine | Web |
| Name of enzyme activity |
| Is lack of sequence confirmed? |
| Bibliographical data (publication dates, authors, institutions) |
| Name of species |
| Can the species associated with the original publications be unambiguously identified? |
| Is a comprehensive genome sequence available for those species? |
| Are comprehensive genome sequences available from closely-related species? |
| Is there ongoing genomic sequencing for those species or from closely-related species? |
| Are molecular data such as Mr and pI available? |
| Does the purification and characterization procedure suggest that purifying this enzyme should be reasonably straightforward? |