| Literature DB >> 17166279 |
Adrian K Arakaki1, Weidong Tian, Jeffrey Skolnick.
Abstract
BACKGROUND: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17166279 PMCID: PMC1764738 DOI: 10.1186/1471-2164-7-315
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Species and taxonomic groups represented in the genome sequence dataset. The taxonomic information is from the NCBI Taxonomy database [85] and the three-letter code for the organisms is from KEGG [24]
1Incomplete genome project. aae: Aquifex aeolicus, aci: Acinetobacter sp. ADP1, afu: Archaeoglobus fulgidus, ago: Ashbya gossypii, ama: Anaplasma marginale, ana: Anabaena sp. PCC7120, ape: Aeropyrum pernix, atc: Agrobacterium tumefaciens C58 (Cereon), ath: Arabidopsis thaliana, atu: Agrobacterium tumefaciens C58 (UWash/Dupont), baa: Bacillus anthracis A2012, bab: Buchnera aphidicola Bp, ban: Bacillus anthracis Ames, bar: Bacillus anthracis Ames 0581, bas: Buchnera aphidicola Sg, bat: Bacillus anthracis Sterne, bba: Bdellovibrio bacteriovorus, bbr: Bordetella bronchiseptica, bbu: Borrelia burgdorferi, bca: Bacillus cereus ATCC 10987, bce: Bacillus cereus ATCC 14579, bcl: Bacillus clausii, bcz: Bacillus cereus ZK, bfl: Blochmannia floridanus, bfr: Bacteroides fragilis, bga: Borrelia garinii, bha: Bacillus halodurans, bhe: Bartonella henselae, bja: Bradyrhizobium japonicum, bld: Bacillus licheniformis DSM13, bli: Bacillus licheniformis ATCC 14580, blo: Bifidobacterium longum, bma: Burkholderia mallei, bme: Brucella melitensis, bms: Brucella suis, bpa: Bordetella parapertussis, bpe: Bordetella pertussis, bps: Burkholderia pseudomallei, bqu: Bartonella quintana, bsu: Bacillus subtilis, bth: Bacteroides thetaiotaomicron, btk: Bacillus thuringiensis, buc: Buchnera aphidicola APS, cac: Clostridium acetobutylicum, cal: Candida albicans, cbu: Coxiella burnetii, cca: Chlamydophila caviae, ccr: Caulobacter crescentus, cdi: Corynebacterium diphtheriae, cef: Corynebacterium efficiens, cel: Caenorhabditis elegans, cgl: Corynebacterium glutamicum, cho: Cryptosporidium hominis, cje: Campylobacter jejuni NCTC11168, cjr: Campylobacter jejuni RM1221, cme: Cyanidioschyzon merolae, cmu: Chlamydia mu ridarum, cpa: Chlamydophila pneumoniae AR39, cpe: Clostridium perfringens, cpj: Chlamydophila pneumoniae J138, cpn: Chlamydophila pneumoniae CWL029, cpt: Chlamydophila pneumoniae TW183, cpv: Cryptosporidium parvum, ctc: Clostridium tetani E88, cte: Chlorobium tepidum, ctr: Chlamydia trachomatis, cvi: Chromobacterium violaceum, ddi: Dictyostelium discoideum, det: Dehalococcoides ethenogenes, dme: Drosophila melanogaster, dps: Desulfotalea psychrophila, dra: Deinococcus radiodurans, dre: Danio rerio, dvu: Desulfovibrio vulgaris Hildenborough, eba: Azoarcus sp. EbN1, eca: Erwinia carotovora, ecc: Escherichia coli CFT073, ece: Escherichia coli O157 EDL933, ecj: Escherichia coli K-12 W3110, eco: Escherichia coli K-12 MG1655, ecs: Escherichia coli O157 Sakai, ecu: Encephalitozoon cuniculi, efa: Enterococcus faecalis, erg: Ehrlichia ruminantium Gardel, eru: Ehrlichia ruminantium Welgevonden (South Africa), erw: Ehrlichia ruminantium Welgevonden (France), fnu: Fusobacterium nucleatum, ftu: Francisella tularensis, gka: Geobacillus kaustophilus, gox: Gluconobacter oxydans, gsu: Geobacter sulfurreducens, gvi: Gloeobacter violaceus, hal: Halobacterium sp. NRC-1, hdu: Haemophilus ducreyi, hhe: Helicobacter hepaticus, hin: Haemophilus influenzae, hma: Haloarcula marismortui, hpj: Helicobacter pylori J99, hpy: Helicobacter pylori 26695, hsa: Homo sapiens, ilo: Idiomarina loihiensis, lac: Lactobacillus acidophilus, lic: Leptospira interrogans serovar Copenhageni, lil: Leptospira interrogans serovar lai, lin: Listeria innocua, ljo: Lactobacillus johnsonii, lla: Lactococcus lactis, lma: Leishmania major, lmf: Listeria monocytogenes F2365, lmo: Listeria monocytogenes EGD-e, lpf: Legionella pneumophila Lens, lpl: Lactobacillus plantarum, lpn: Legionella pneumophila Philadelphia 1, lpp: Legionella pneumophila Paris, lxx: Leifsonia xyli xyli CTCB07, mac: Methanosarcina acetivorans, mbo: Mycobacterium bovis, mca: Methylococcus capsulatus, mfl: Mesoplasma florum, mga: Mycoplasma gallisepticum, mge: Mycoplasma genitalium, mhy: Mycoplasma hyopneumoniae, mja: Methanococcus jannaschii, mka: Methanopyrus kandleri, mle: Mycobacterium leprae, mlo: Mesorhizobium loti, mma: Methanosarcina mazei, mmo: Mycoplasma mobile, mmp: Methanococcus maripaludis, mmu: Mus musculus, mmy: Mycoplasma mycoides, mpa: Mycobacterium avium paratuberculosis, mpe: Mycoplasma penetrans, mpn: Mycoplasma pneumoniae, mpu: Mycoplasma pulmonis, msu: Mannheimia succiniciproducens, mtc: Mycobacterium tuberculosis CDC1551, mth: Methanobacterium thermoautotrophicum, mtu: Mycobacterium tuberculosis H37Rv, neq: Nanoarchaeum equitans, neu: Nitrosomonas europaea, nfa: Nocardia farcinica, ngo: Neisseria gonorrhoeae, nma: Neisseria meningitidis Z2491 (serogroup A), nme: Neisseria meningitidis MC58 (serogroup B), oih: Oceanobacillus iheyensis, osa: Oryza sativa, pab: Pyrococcus abyssi, pac: Propionibacterium acnes, pae: Pseudomonas aeruginosa, pai: Pyrobaculum aerophilum, pcu: Parachlamydia sp. UWE25, pfa: Plasmodium falciparum, pfu: Pyrococcus furiosus, pgi: Porphyromonas gingivalis, pho: Pyrococcus horikoshii, plu: Photorhabdus luminescens, pma: Prochlorococcus marinus SS120, pmm: Prochlorococcus marinus MED4, pmt: Prochlorococcus marinus MIT9313, pmu: Pasteurella multocida, poy: Phytoplasma sp. onion yellows, ppr: Photobacterium profundum, ppu: Pseudomonas putida, pst: Pseudomonas syringae, pto: Picrophilus torridus, rba: Rhodopirellula baltica, rco: Rickettsia conorii, rno: Rattus norvegicus, rpa: Rhodopseudomonas palustris CGA009, rpr: Rickettsia prowazekii, rso: Ralstonia solanacearum GMI1000, rty: Rickettsia typhi, sac: Staphylococcus aureus COL, sag: Streptococcus agalactiae 2603, sam: Staphylococcus aureus MW2, san: Streptococcus agalactiae NEM316, sar: Staphylococcus aureus MRSA252, sas: Staphylococcus aureus MSSA476, sau: Staphylococcus aureus N315, sav: Staphylococcus aureus Mu50, sce: Saccharomyces cerevisiae, sco: Streptomyces coelicolor, sep: Staphylococcus epidermidis ATCC 12228, ser: Staphylococcus epidermidis RP62A, sfl: Shigella flexneri 301, sfx: Shigella flexneri 2457T, sil: Silicibacter pomeroyi, sma: Streptomyces avermitilis, sme: Sinorhizobium meliloti, smu: Streptococcus mutans, son: Shewanella oneidensis, spa: Streptococcus pyogenes MGAS10394, spg: Streptococcus pyogenes MGAS315, spm: Streptococcus pyogenes MGAS8232, spn: Streptococcus pneumoniae TIGR4, spo: Schizosaccharomyces pombe, spr: Streptococcus pneumoniae R6, sps: Streptococcus pyogenes SSI-1, spt: Salmonella enterica serovar Paratyphi A, spy: Streptococcus pyogenes SF370, sso: Sulfolobus solfataricus, stc: Streptococcus thermophilus CNRZ1066, sth: Symbiobacterium thermophilum, stl: Streptococcus thermophilus LMG18311, stm: Salmonella typhimurium LT2, sto: Sulfolobus tokodaii, stt: Salmonella enterica serovar typhi Ty2, sty: Salmonella typhi CT18, syc: Synechococcus sp. PCC6301, syn: Synechocystis sp. PCC6803, syw: Synechococcus sp. WH8102, tac: Thermoplasma acidophilum, tbr: Trypanosoma brucei, tde: i, tel: Thermosynechococcus elongatus, tko: Thermococcus kodakaraensis, tma: Thermotoga maritima, tpa: Treponema pallidum, tte: Thermoanaerobacter tengcongensis, tth: Thermus thermophilus HB27, ttj: Thermus thermophilus HB8, tvo: Thermoplasma volcanium, twh: Tropheryma whipplei Twist, tws: Tropheryma whipplei TW08/27, uur: Ureaplasma urealyticum, vch: Vibrio cholerae, vfi: Vibrio fischeri, vpa: Vibrio parahaemolyticus, vvu: Vibrio vulnificus CMCP6, vvy: Vibrio vulnificus YJ016, wbm: Wolbachia endosymbiont strain TRS of Brugia malayi, wbr: Wigglesworthia brevipalpis, wol: Wolbachia wMel, wsu: Wolinella succinogenes, xac: Xanthomonas axonopodis, xcc: Xanthomonas campestris, xfa: Xylella fastidiosa 9a5c, xft: Xylella fastidiosa Temecula1, xoo: Xanthomonas oryzae, ype: Yersinia pestis CO92, ypk: Yersinia pestis KIM, ypm: Yersinia pestis Mediaevails, yps: Yersinia pseudotuberculosis, zmo: Zymomonas mobilis.
Figure 1Enzyme content in organisms from the three domains of life. Number of enzymes as a function of the proteome size for archaeal (A), bacterial (B) and eukaryotic (C) genomes. The gray, magenta and green lines represent: regression line, 95% and 99% prediction intervals, respectively. (D) Distribution of the fraction of enzymes in archaeal, bacterial and eukaryotic genomes. The statistics represented in the box-and-whisker plots are: outliers below the 10th percentile (circles, bottom), 10th percentile (whisker, bottom), 25th percentile (box, bottom), median (thick line), 75th percentile (box, top), 90th percentile (whisker, top) and outliers above 90th percentile (circles, top).
Figure 2Comparison of EFICAz predictions with KEGG annotations. Comparison of EFICAz predictions with KEGG annotations from the Genes database of March 5, 2005, Release 33.0+/03–5 (A-B) and of March 7, 2006, Release 37.0+/03–07 (C-D). We analyze two levels of enzyme function description: four-field EC numbers (A, C) and three-field EC numbers (B, D). For all, archaeal, bacterial and eukaryotic genomes we plot the average percentage of enzymatic proteins per genome whose EFICAz-inferred and KEGG-provided annotations at the specified level of detail agree (green columns) or disagree (red columns), and whose enzyme function annotation at the specified level of detail is only provided by EFICAz (blue columns) or by KEGG (yellow columns). The numeric values inserted in each stacked column are the corresponding average percentage of enzymatic proteins per genome +/- the standard deviation.
Figure 3Similarity of 64 previously hypothetical proteins to EFICAz training enzymes. Number of previously hypothetical proteins predicted to be enzymes by EFICAz at different intervals of maximal sequence identity to enzymes included in the EFICAz version 5.0 training set. The true enzyme function of these 64 previously hypothetical proteins has been recently determined; therefore, we could assess the precision of our predictions. Dark green, light green and red bars represent four field EC number predictions with four, three or less than three correct EC fields, respectively. Yellow and orange bars represent three field EC number predictions with three or less than three correct EC fields, respectively. The median of the distribution (24.8%) is indicated by the broken line.
Four-field EC number validation of EFICAz-predicted enzyme functions for 25 previously hypothetical proteins
| Domain | Org.1 | Gene name2 | PMID3 | True EC number4 | Predicted EC number5 | EC field Agreement6 |
| Eukarya | hsa | 54995 | 16261191 | 2.3.1.41 | 2.3.1.41 | 4 |
| Eukarya | hsa | 84779 | 16638120 | 2.3.1.88 | 2.3.1.88 | 4 |
| Bacteria | ana | alr3351 | 15695431 | 6.3.2.2 | 6.3.2.2 | 4 |
| Archaea | ape | APE0768 | 14551194 | 5.3.1.9 | 5.3.1.9 | 4 |
| Bacteria | eco | b0581 | 15211520 | 6.3.2.2 | 6.3.2.2 | 4 |
| Bacteria | ecc | c0667 | 15211520 | 6.3.2.2 | 6.3.2.2 | 4 |
| Bacteria | mle | ML1399 | 15500449 | 4.6.1.1 | 4.6.1.1 | 4 |
| Bacteria | pae | PA1167 | 15136569 | 4.2.2.3 | 4.2.2.3 | 4 |
| Eukarya | cel | R07B7.11 | 15676072 | 3.2.1.49/3.2.1.22 | 3.2.1.22 | 4 |
| Bacteria | mtu | Rv1647 | 15500449 | 4.6.1.1 | 4.6.1.1 | 4 |
| Bacteria | mtu | Rv1700 | 12906832 | 3.6.1.13 | 3.6.1.13 | 4 |
| Bacteria | mtu | Rv1885c | 15654876 | 5.4.99.5 | 5.4.99.5 | 4 |
| Bacteria | mtu | Rv2747 | 15838030 | 2.3.1.1 | 2.3.1.1 | 4 |
| Bacteria | sfx | S0496 | 15211520 | 6.3.2.2 | 6.3.2.2 | 4 |
| Bacteria | sfl | SF0488 | 15211520 | 6.3.2.2 | 6.3.2.2 | 4 |
| Bacteria | spt | SPA0821 | 15547259 | 2.5.1.17 | 2.5.1.17 | 4 |
| Bacteria | spt | SPA2151 | 15211520 | 6.3.2.2 | 6.3.2.2 | 4 |
| Bacteria | sty | STY2255 | 15547259 | 2.5.1.17 | 2.5.1.17 | 4 |
| Bacteria | stt | t0824 | 15547259 | 2.5.1.17 | 2.5.1.17 | 4 |
| Archaea | tac | Ta1434 | 15044458 | 2.5.1.17 | 2.5.1.17 | 4 |
| Bacteria | ece | Z0720 | 15211520 | 6.3.2.2 | 6.3.2.2 | 4 |
| Bacteria | ecc | c0735 | 16411753 | 3.2.2.8 | 3.2.2.1 | 3 |
| Archaea | mja | MJ0044 | 16621811 | 2.7.4.- | 2.7.2.8 | 2 |
| Archaea | mja | MJ0936 | 15128743 | 3.1.4.- | 3.6.1.10 | 1 |
| Bacteria | mtu | Rv0805 | 16313172 | 3.1.4.17 | 3.6.1.10 | 1 |
1 The species names corresponding to the KEGG three letter codes are listed in the footnote of Table 1.
2 Gene name from the Genes database of KEGG.
3 PMID: PubMed Unique Identifier, the journal citation accession number for the most relevant record in PubMed supporting the experimentally-derived annotation [65].
4 Experimentally-derived EC numbers.
5 EFICAz-predicted EC numbers.
6 Number of matching first n fields of the experimentally-derived and EFICAz-predicted EC numbers, with n = 1 to 4.
Three-field EC number validation of enzyme functions predicted by EFICAz with four-field EC numbers for 12 previously hypothetical proteins
| Domain | Org.1 | Gene name2 | PMID3 | True EC number4 | Predicted EC number5 | EC field Agreement6 |
| Bacteria | ecc | c2186 | 16077126 | 1.1.1.- | 1.1.1.2 | ≤ 3 |
| Bacteria | ecc | c5454 | 15489502 | 3.1.3.- | 3.1.3.48 | ≤ 3 |
| Bacteria | lla | L124252 | 15901700 | 2.1.1.- | 2.1.1.14 | ≤ 3 |
| Bacteria | pae | PA1032 | 16461666 | 3.5.1.- | 3.5.1.11 | ≤ 3 |
| Bacteria | sfx | S0029 | 11027694 | 3.2.2.- | 3.2.2.1 | ≤ 3 |
| Bacteria | sfl | SF0027 | 11027694 | 3.2.2.- | 3.2.2.1 | ≤ 3 |
| Bacteria | spt | SPA2330 | 15157072 | 2.7.1.- | 2.7.1.2 | ≤ 3 |
| Bacteria | spt | SPA4373 | 15489502 | 3.1.3.- | 3.1.3.48 | ≤ 3 |
| Bacteria | sty | STY0426 | 15157072 | 2.7.1.- | 2.7.1.2 | ≤ 3 |
| Bacteria | stt | t2471 | 15157072 | 2.7.1.- | 2.7.1.2 | ≤ 3 |
| Bacteria | ece | Z0035 | 11027694 | 3.2.2.- | 3.2.2.1 | ≤ 3 |
| Bacteria | ece | Z0493 | 15157072 | 2.7.1.- | 2.7.1.2 | ≤ 3 |
1 The species names corresponding to the KEGG three letter codes are listed in the footnote of Table 1.
2 Gene name from the Genes database of KEGG.
3 PMID: PubMed Unique Identifier, the journal citation accession number for the most relevant record in PubMed supporting the experimentally-derived annotation [65].
4 Experimentally-derived EC numbers.
5 EFICAz-predicted EC numbers.
6 Number of matching first n fields of the experimentally-derived and EFICAz-predicted EC numbers, with n = 1 to 4.
Three-field EC number validation of enzyme functions predicted by EFICAz with three-field EC numbers for 27 previously hypothetical proteins
| Domain | Org.1 | Gene name2 | PMID3 | True EC number4 | Predicted EC number5 | EC field Agreement6 |
| Archaea | afu | AF1938 | 11790732 | 6.2.1.1 | 6.2.1.- | 3 |
| Bacteria | bsu | BG11467 | 14635137 | 2.3.1.- | 2.3.1.- | 3 |
| Bacteria | bsu | BG11761 | 16242712 | 1.1.1.- | 1.1.1.- | 3 |
| Bacteria | bth | BT4131 | 15952775 | 3.1.3.- | 3.1.3.- | 3 |
| Bacteria | ecc | c1394 | 15157072 | 2.7.1.- | 2.7.1.- | 3 |
| Bacteria | ecc | c2089 | 16253988 | 2.8.3.- | 2.8.3.- | 3 |
| Bacteria | eco | b2873 | 11092864 | 3.5.2.- | 3.5.2.- | 3 |
| Bacteria | cef | CE0356 | 15225990 | 2.3.1.- | 2.3.1.- | 3 |
| Bacteria | lpf | lpl2377 | 16390437 | 2.7.3.- | 2.7.3.- | 3 |
| Bacteria | lpp | lpp2524 | 16390437 | 2.7.3.- | 2.7.3.- | 3 |
| Bacteria | lpp | lpp2599 | 11053398 | 2.1.1.- | 2.1.1.- | 3 |
| Archaea | mja | MJ0883 | 15165845 | 2.1.1.31 | 2.1.1.- | 3 |
| Archaea | pho | PH1035 | 15737605 | 2.4.1.- | 2.4.1.- | 3 |
| Archaea | pho | PH1915 | 16260766 | 2.1.1.- | 2.1.1.- | 3 |
| Archaea | pho | PH1948 | 16245322 | 2.1.1.- | 2.1.1.- | 3 |
| Bacteria | rpr | RP028 | 16364512 | 2.1.1.43 | 2.1.1.- | 3 |
| Bacteria | mtu | Rv0891c | 15500449 | 4.6.1.1 | 4.6.1.- | 3 |
| Bacteria | mtu | Rv1500 | 16257960 | 2.4.1.- | 2.4.1.- | 3 |
| Bacteria | mtu | Rv3225c | 12715873 | 2.7.1.- | 2.7.1.- | 3 |
| Bacteria | sco | SCO2599 | 12951512 | 3.1.4.- | 3.1.4.- | 3 |
| Bacteria | spn | SP1051 | 12571357 | 2.7.1.- | 2.7.1.- | 3 |
| Archaea | sto | ST0071 | 15212797 | 3.1.1.- | 3.1.1.- | 3 |
| Archaea | sto | ST0723 | 16618099 | 1.5.1.30 | 1.5.1.- | 3 |
| Bacteria | ttj | TTHA1280 | 16511182 | 2.1.1.- | 2.1.1.- | 3 |
| Bacteria | ypk | y0368 | 12923112 | 2.3.1.- | 2.3.1.- | 3 |
| Bacteria | ype | YPO3632 | 16452420 | 2.3.1.- | 2.3.1.- | 3 |
| Archaea | tac | Ta1419 | 14551194 | 5.3.1.8 5.3.1.9 | 6.1.1.- | 0 |
1 The species names corresponding to the KEGG three letter codes are listed in the footnote of Table 1.
2 Gene name from the Genes database of KEGG.
3 PMID: PubMed Unique Identifier, the journal citation accession number for the most relevant record in PubMed supporting the experimentally-derived annotation [65].
4 Experimentally-derived EC numbers.
5 EFICAz-predicted EC numbers.
6 Number of matching first n fields of the experimentally-derived and EFICAz-predicted EC numbers, with n = 1 to 4.
Figure 4Benchmark test of updated versions of EFICAz. Precision (A-C), recall (D-F) and number of enzyme types described by four-field EC numbers (G-I) for different versions of EFICAz, at different levels of maximal testing to training sequence identity, averaged per enzyme type. Curves in red correspond to enzyme types for which at least 10 training sequences were available; curves in blue correspond to all enzyme types. The training of versions 2.0, 3.0 and 4.0 of EFICAz is based on the Releases 2.0, 3.0 and 4.0 of UniProt, respectively. The new Swiss-Prot sequences added to UniProt 5.0 since the release of UniProt 2.0, 3.0 and 4.0 constitute the test sequences for versions 2.0, 3.0 and 4.0 of EFICAz. See Methods for a full description of the benchmark procedure.
Source of Sequence Data for EFICAz training. The fifth column shows the number of enzymes annotated in Swiss-Prot with four-field EC numbers, which constitute the primary source for the training of EFICAz.
| EFICAz version/UniProt Release | UniProt Release Date | Number of Sequences in UniProt | Number of Sequences in Swiss-Prot | Number of Enzymes in Swiss-Prot |
| 2.0 | Jul. 5, 2004 | 1,487,788 | 153,871 | 44,508 |
| 3.0 | Oct. 25, 2004 | 1,612,609 | 163,235 | 47,144 |
| 4.0 | Feb. 1, 2005 | 1,757,967 | 168,297 | 48,788 |
| 5.0 | May 10, 2005 | 1,896,046 | 181,571 | 53,314 |