| Literature DB >> 11178258 |
D A Natale1, U T Shankavaram, M Y Galperin, Y I Wolf, L Aravind, E V Koonin.
Abstract
BACKGROUND: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi.Entities:
Mesh:
Substances:
Year: 2000 PMID: 11178258 PMCID: PMC15027 DOI: 10.1186/gb-2000-1-5-research0009
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1A flow chart of the genome annotation process using COGs. NR is the Non-Redundant sequence database at the National Center for Biotechnology Information.
Assignment of proteins to COGs
| Category | ||
| Proteins assigned by COGNITOR | 1,123 | 1,421 |
| Proteins included in COGs* | 1,102 | 1,404 |
| True positives | 1,062 | 1,381 |
| Pre-existing COGs | 1,011 | 1,339 |
| New COGs | 27 | 3 |
| Divided† | 24 | 39 |
| False positives | 44 | 31 |
| Not accepted | 21 | 17 |
| Reassigned to a related COG | 21 | 14 |
| Reassigned to an unrelated COG | 2 | 0 |
| False negatives‡ | 17 | 9 |
*Includes true positives, reassigned false positives, and false negatives. †Not included in 'To preexisting COG' or 'To new COG' numbers. ‡Proteins added during manual checking.
Comparison of proteins in COGs for archaeal species
| Species* COGs | Genome size (Mb) | ORFs | Percentage of ORFs in COGs | |
| Total | in COGs | |||
| Af | 2.18 | 2,411 | 1,755 | 73 |
| Mj | 1.74 | 1,747 | 1,252 | 72 |
| Mth | 1.75 | 1,871 | 1,339 | 72 |
| Ph | 1.74 | 2,072 | 1,333 | 64 |
| Pa | 1.77 | 1,765 | 1,404 | 79 |
| Ap | 1.67 | 2,694 | 1,102 | 41 |
| Ap† | 1.67 | 1,873 | 1,129 | 60 |
*For abbreviations see the Materials and methods section. †After adjusting for new ORFs and removal of likely false ORFs. This adjusted number of genes is the upper estimate of the actual total number of genes in A. pernix.
Phyletic distribution of the archaeal COGs
| Functional category | Number of COGs including all five euryarchaeal species | Number of COGs including all five euryarchaeal species and | Number of COGs including a subset of euryarchaeal species and | Number of COGs including |
| Translation and ribosome biogenesis | 113 | 109 | 17 | 0 |
| Transcription | 26 | 25 | 11 | 0 |
| Replication, recombination, repair | 38 | 27 | 15 | 2 |
| Cell division and chromosome partitioning | 3 | 1 | 0 | 0 |
| Post-translational modification, | 19 | 15 | 4 | 2 |
| protein turnover, chaperones | ||||
| Cell envelope biogenesis, outer membrane | 8 | 7 | 8 | 0 |
| Cell motility and secretion | 8 | 8 | 6 | 0 |
| Inorganic ion transport and metabolism | 13 | 9 | 27 | 2 |
| Signal transduction | 4 | 4 | 5 | 1 |
| Carbohydrate transport and metabolism | 20 | 18 | 13 | 2 |
| Energy production and conversion | 36 | 22 | 33 | 18 |
| Amino acid transport and metabolism | 34 | 29 | 58 | 5 |
| Nucleotide transport and metabolism | 34 | 25 | 6 | 3 |
| Coenzyme metabolism | 25 | 21 | 30 | 2 |
| Lipid metabolism | 8 | 6 | 15 | 2 |
| General functional prediction only | 75 | 59 | 60 | 8 |
| Uncharacterized | 67 | 45 | 82 | 4 |
| Total | 514 | 416 | 273 | 50 |
Figure 2The main phylogenetic patterns for the predicted proteins encoded in six archaeal genomes. Af, Archaeoglobus fulgidus; Mt, Methanobacterium thermoautotrophicum, Pa, Pyrococcus abyssi; Mj, Methanococcus jannaschii; Ph, Pyrococcus horikoshii; Ap, Aeropyrum pernix. 1, members of COGs including all archaeal species; 2, members of COGs including a subset of archaeal species; 3, members of COGs that include no archaeal species other then the analyzed one; 4, not in COGs. The percentage of proteins in each category is indicated.
Co-occurrence of genomes in COGs: A. pernix groups within the archaeal domain
| Mj | Mth | Af | Ph | Pa | Tm | Ec | Bs | Ssp | Sc | ||
| - | 275 | 273 | 151 | 203 | 168 | ||||||
| 836 | 561 | 563 | 685 | 634 | 677 | ||||||
| - | 408 | 424 | 411 | 299 | 297 | ||||||
| Mj | - | 157 | 164 | 252 | 258 | 507 | 460 | 496 | 481 | 524 | |
| 969 | 812 | 805 | 717 | 722 | 462 | 509 | 473 | 488 | 445 | ||
| - | 175 | 291 | 249 | 252 | 597 | 961 | 892 | 714 | 406 | ||
| Mth | - | 177 | 325 | 296 | 509 | 441 | 478 | 465 | 515 | ||
| 987 | 810 | 662 | 700 | 478 | 546 | 509 | 522 | 472 | |||
| - | 286 | 271 | 274 | 581 | 924 | 856 | 680 | 379 | |||
| Af | - | 355 | 330 | 569 | 483 | 526 | 540 | 593 | |||
| 1096 | 741 | 780 | 527 | 613 | 570 | 556 | 503 | ||||
| - | 192 | 194 | 532 | 857 | 795 | 646 | 348 | ||||
| Ph | - | 46 | 460 | 436 | 453 | 495 | 487 | ||||
| 933 | 894 | 473 | 497 | 480 | 438 | 446 | |||||
| - | 80 | 586 | 973 | 885 | 764 | 405 | |||||
| Pa | - | 470 | 431 | 451 | 501 | 505 | |||||
| 974 | 504 | 543 | 523 | 473 | 469 | ||||||
| - | 577 | 964 | 898 | 742 | 396 | ||||||
| Tm | - | 223 | 216 | 325 | 592 | ||||||
| 1059 | 836 | 843 | 734 | 467 | |||||||
| - | 634 | 522 | 468 | 384 | |||||||
| Ec | - | 339 | 480 | 813 | |||||||
| 1470 | 1331 | 990 | 657 | ||||||||
| - | 234 | 212 | 194 | ||||||||
| Bs | - | 464 | 750 | ||||||||
| 1365 | 901 | 615 | |||||||||
| - | 301 | 236 | |||||||||
| Ssp | - | 632 | |||||||||
| 1202 | 570 | ||||||||||
| - | 281 | ||||||||||
| Sc | - | ||||||||||
| 851 | |||||||||||
| - |
*In each cell, the middle line is the number of COGs in which the given two species co-occur; the top line is the number of COGs in which the genome in the corresponding row, but not the one in the corresponding column, is represented; conversely, the bottom line is the number of COGs in which the genome in the corresponding column, but not the one in the corresponding row, is represented. The diagonal cells show the total number of COGs that include representatives from the given genome. The cells that show the co-occurrence data among archaea show the numbers of COGs in red, and the cells that show the co-occurrence data for archaea and yeast show the numbers of COGs in blue. For abbreviations, see the Materials and methods section.
Figure 3Classification of genomes by co-occurrence in the COGs. (a) A cluster dendrogram. (b) A neighbor-joining unrooted tree. For abbreviations, see the Materials and methods section.
COGs represented in all archaea but not in other species: probable archaeal synapomorphies
| COG number | (Predicted) function | Comments |
| 2511 | Glu-tRNAGln amidotransferase B subunit | This protein is homologous to bacterial B subunits and an archaeal paralog but contains a synapomorphic insert, the so-called GAD domain, shared with bacterial aspartyl-tRNA synthetases [49] |
| 2016 | Predicted RNA-binding protein, contains PUA domain | PUA domain is most common in archaea and is found also in pseudouridine synthases, archaeosine synthases and glutamate kinases [50] |
| 1370 | Predicted RNA-binding protein, contains PUA domain | This form of the PUA domain is present as a stand-alone protein in |
| 1746 | tRNA nucleotidyltransferase (CCA-adding enzyme) | Archaeal CCA-adding enzyme is only very distantly related to other members of the Polβ superfamily of nucleotidyltransferases [51] |
| 1395 | Predicted transcription regulators | The proteins of this family do not share similarity with other proteins beyond the DNA-binding helix-turn-helix domain [32] |
| 1389 | DNA topoisomerase VI, subunit B | These proteins contain an ATPase domain of the TopoII/MutL/HSP90/histidine kinase fold, but do not show a specific relationships to any other proteins of this class |
| 1591 | Holliday junction resolvase, archaeal-type | Distant homologs seen in some bacteria (L.A., K.S. Makarova and E.V.K., unpublished.observations) |
| 1571 | Predicted DNA-binding proteins, possibly nucleotidyl transferase or nuclease | These proteins consist of two distinct, predicted DNA-binding domains (OB-fold and Zn-ribbon) and an uncharacterized, probably enzymatic domain that is unique for archaea (see text) |
| 1491 | Predicted DNA-binding protein | These proteins contain the helix-hairpin-helix module, but otherwise, do not show significant similarity to any other proteins |
| 1938 | Predicted ATP-grasp-domain-containing enzymes | Only distantly related to other ATP-grasp proteins; predicted to possess ATP-dependent carboligase or similar activity [19] |
| 1407 | Predicted calcineurin-type phosphoesterase | Only distantly related to other phosphohydrolases of the calcineurin fold [37] |
| 1782 | Predicted metal-dependent RNase of the metallo-β-lactamase fold | In spite of significant similarity to other families of metallo-β-lactamases, this family shows a clear synapomorphy, the presence of the RNA-binding KH domain [52] |
| 1608 | Predicted kinase related to acetylglutamate kinase | Only distantly related to other kinases of the same fold |
| 1829 | Predicted kinase of the actin/HSP70/sugar kinase fold | Only distantly related to other kinases of the same fold |
| 1907 | Predicted kinase of the actin/HSP70/sugar kinase fold | Only distantly related to other kinases of the same fold |
| 1831 | Predicted metal-dependent hydrolase of the urease superfamily | Only distantly related to other hydrolases of the same superfamily [53] |
| 1571 | Predicted DNA-binding protein containing the Zn-ribbon module | |
| 2034 | Conserved membrane protein | |
| 2064 | Conserved membrane protein | |
| 1339, 2090, 1581, | Uncharacterized proteins unique to archaea | |
| 1460, 1786, 1701, 1931, 1909, 1888, 1382, 1849, 1630, 1303, 1325, 1679 |
Figure 4COGs not represented in each of the archaeal species while including members of the remaining five species. For P. horikoshii and P. abyssi, the absence of the respective second pyrococcal species was allowed. For abbreviations, see the Materials and methods section.
Examples of COGs conserved in euryarchaea but missing in A. pernix
| COG number | Phylogenetic pattern* | Function | Comments |
| 0101 | amtks-yqvdcebrhujwgpolinx | Pseudouridylate synthase (tRNA psi55) | This RNA modification enzyme hitherto has been considered ubiquitous and essential [54] |
| 1549 | amtks- - - - - - - - - - - - - - - - - - - - | Archaeosine tRNA-ribosyltransferase, contains PUA domain | See text |
| 2036 | amtks-y- - - - - - - - - - - - - - - - - - | Histones H3 and H4 | In the crenarchaeon |
| 1933 | amtks- - - - - - - - - - - - - - - - - - - - | Unique archaeal DNA polymerase, large subunit | This DNA polymerase is highly conserved among the euryarchaea but so far has not been seen in other taxa [15] |
| 1311 | amtks-y- - - - - - - - - - - - - - - - - - | Small subunit of DNA polymerase, predicted phosphatase (calcineurin-like superfamily) | The absence of this subunit, which is represented by (predicted) active phosphatases in archaea and inactivated forms in eukaryotes [37] correlates with the absence of the large subunit |
| 1111 | amtks-y- - - - - - - - - - - - - - - - - | ERCC4-like helicase | A typical archaeal-eukaryotic repair protein, a predicted active helicase in euryarchaea and an inactivated form in eukaryotes [39] |
| 1107 | amtks- - - - - - - - - - - - - - - - - - - - | Archaea-specific RecJ-like exonuclease, ontains DnaJ-type Zn finger domain | |
| 1243 | amtks-y-v- - - - - - - - - - - - - - - - | Transcription elongation factor ELP3 | Consists of an amino-terminal biotin-synthase-like domain and a carboxy-terminal histone acetylase domain (only the biotin-synthase-like domain is represented in some bacteria) |
| 0206 | amtks- -qvdcebrhuj-gpol-nx | Cell division GTPase FtsZ | A central component of bacterial and archaeal cell-division machinery which is homologous (although only weakly similar) and functionally analogous to eukaryotic tubulins [56]. So far, among prokaryotes, FtsZ has been found missing only in |
| 0455 | amtks- -qvdcebr-uj- - -ol-n- | ATPases involved in chromosome partitioning | There may be a correlation between the absence of FtsZ and the presence of only one chromosome-partitioning ATPase (the ortholog of bacterial Mrp) in |
| 1149 | amtks- - -v- - - - - - - - - - - - - - - | MinD superfamily P-loop ATPases containing an inserted ferredoxin domain | |
| 0065 | amtks-yqvdcebrh-j- - - - - -n- | 3-Isopropylmalate dehydratase large subunit | |
| 0066 | amtks-yqvdcebrh-j- - - - - -n- | 3-isopropylmalate dehydratase small subunit | |
| 0473 | amtks-yqvdcebrh-j- - - - - -nx | 3-isopropylmalate dehydrogenase | |
| 0119 | amtks-yqvdcebrh-j- - - - - -n- | Isopropylmalate/homocitrate/citramalate synthase | |
| 0015 | amtks-yqvdcebrhuj- - - - - -n- | Adenylosuccinate lyase | Most of the enzyme of the |
| 0104 | amtks-yqvdcebrhuj- - - - - -n- | Adenylosuccinate synthase | |
| 0034 | amtks-yqvdcebrh-j- - - - - -n- | Glutamine phosphoribosyl-pyrophosphate amidotransferase | |
| 0151 | amtks-yqvdcebrhuj- - - - - -n- | Phosphoribosylamine-glycine ligase | |
| 0150 | amtks-yqvdcebrh-j- - - - - -n- | Phosphoribosylamino-imidazol (AIR) synthetase PurM | |
| 0152 | amtks-yqvdcebrh-j- - - - - -nx | Phosphoribosylamino-imidazolesuccino-carboxamide (SAICAR) synthase | |
| 0041 | amtks-yqvdcebrh-j- - - - - -n- | Phosphoribosylcarboxy-aminoimidazole (NCAIR) mutase PurE | |
| 0047 | amtks-yqvdcebrh-j- - - - - -n- | Phosphoribosylformyl-glycinamidine (FGAM) synthase, glutamine amidotransferase | |
| domain | |||
| 0046 | amtks-yqvdcebrh-j- - - - - -n- | Phosphoribosylformyl-glycinamidine (FGAM) synthase, synthetase domain | |
| 0340 | amtks-yqvdcebrhuj- - - -linx | Biotin-(acetyl-CoA carboxylase) ligase | |
| 0511 | amtks-yqvdcebrhuj- - - -linx | Biotin carboxyl carrier protein of acetyl-CoA carboxylase | |
| 0157 | amtks-yqv-cebrhu- - - - - - -n- | Nicotinate-nucleotide pyrophosphorylase |
*In the phylogenetic patterns, each letter indicates that a particular genome is represented in the given COG, and a dash indicates the absence of a representative from the corresponding genome. The one-letter code for genomes is as follows: a, Archeoglobus fulgidus; m, Methanococcus jannaschii; t, Methanobacterium thermoautotrophicum; k, Pyrococcus horikoshii; s, Pyrococcus abyssi; z, Aeropyrum pernix; y, Saccharomyces cerevisiae; q, Aquifex aeolicus; v, Thermotoga maritima; d, Deinococcus radiodurans; c, Synechocystis sp; e, Escherichia coli; b, Bacillus subtilis; r, Mycobacterium tuberculosis; h, Haemophilus influenzae; u, Helicobacter pylori; j, Campylobacter jejuni; w, Ureaplasma urealyticum; g, Mycoplasma genitalium; p, Mycoplasma pneumoniae; o, Borrelia burgdorferi; l, Treponema pallidum; i, Chlamydia trachomatis and C. pneumoniae; n, Neisseria meningitidis; x, Rickettsia prowazekii.
A. pernix proteins conserved in a wide range of organisms but missing in euryarchaea (examples)
| Phylogenetic pattern* | Function | Comments | ||
| APE1618/1048 | - - - - -zyq-d-ebr- - - - - - - - - -x | Aconitase A | Unlike other archaea with sequenced genomes, | |
| APE1816/0114 | - - - - -zy- -dcebrhu- - - - - -i-x | Fumarase | ||
| APE1677/1071 | - - - - -zy- -dc-br- - - -gp- -i-x | pyruvate dehydrogenase E1 component, α-subunit | ||
| APE1674/0022 | - - - - -zy- -dc-br- - - -gp- -i-x | pyruvate dehydrogenase E1 component, β-subunit | ||
| APE1671/0508 | - - - - -zy- -dcebrh- - -gp- -i-x | Dihydrolipoamide acyltransferase | ||
| APE1725/1290 | - - - - -z-q-dc-br-uj- - - - - -nx | Cytochrome b | As an aerobe, | |
| APE1623, APE0793_1/0843 | - - - - -z-q-dcebr-uj- - - - - -nx | Cytochrome c oxidase, heme b and copper-binding subunit | ||
| APE0793_2/ | - - - - -z-q-dcebr- - - - - - - - - -x | Cytochrome oxidase, subunit 3 | ||
| 1845 | ||||
| APE1498/1171 | - - - - -zy-vdcebrh- - - - - - - - -x | Threonine dehydratase | Specifically related to a subfamily of bacterial catabolic threonine dehydratases (e.g. | |
| APE1038/0295 | - - - - -zy-vd-ebrh- -wgpo- - - - | Cytidine deaminase | ||
| APE1353/0514 | - - - - -zy-dceb-h- - - - - -l- - - | DNA helicase (RecQ family) | APE1353 differs from other members of the RecQ family by the presence of long amino-terminal extension that probably form a non-globular domain. APE1353 shows no specific affinity to any of the bacterial orthologs. | |
| APE2450/0260 | - - - - -z-q-dcebrhu-wgp- -i-x | Leucyl aminopeptidase | ||
| APE0137/0405 | - - - - -zy- -dcebr-u- - - - - - - - - | Gamma-glutamyltranspeptidase | ||
| APE2464/ | - - - - -z-qvdcebr- - - - - - - - - - - | Phosphate starvation-inducible protein | ||
| COG1702 | PhoH, Predicted ATPase | |||
| APE0993/>0813 | - - - - -z- - -d-eb-hu-wgp-l- - - | Purine-nucleoside phosphorylase | Correlates with the absence of | |
| APE2105/0813a | - - - - -z- - - - - -e- -h- - - - -l- - - | Uridine phosphorylase | ||
| APE0033/1866 | - - - - -zy- -d-eb-h- - - - - - - - - - | Phosphoenolpyruvate carboxykinase |
*The designations are as in Table 6.
Examples of differential genome display of Pyrococcus abyssi and Pyrococcus horikoshii using the COG approach
| Gene/COG Number | Phylogenetic pattern* | Function | Comment |
| Present in | |||
| PAB2044/0547 | amt-szyqvdcebrhuj- - - - - -n- | Anthranilate phosphoribosyltransferase | The entire branched pathway for aromatic amino acid biosynthesis appears to be present in |
| PAB2045/0147 | amt-szyqvdcebrhuj- - - - - -n- | Anthranilate synthase component I | |
| PAB2046/0512 | amt-szyqvdcebrhuj- - - - - -n- | Anthranilate synthase component II | |
| PAB0307/0082 | - - - -szyqvdceb-huj- - - - -inx | DAHP synthase | |
| PAB2049/0159 | amt-szyqvdcebrhuj- - - - -in- | Tryptophan synthase α subunit | |
| PAB2048/0133 | amt-szyqvdcebrhuj- - - - -in- | Tryptophan synthase β subunit | |
| PAB0250/0031 | - - - -szyqvdcebrhuj- - - - - -n- | Cysteine synthase | |
| PAB1595/2046 | a- - -szyq-dc-b- - -j- - - - - - - - | ATP sulfurylase | |
| PAB0781/0529 | a- - -szyq-dcebr- -j- - - - - -n- | Adenylylsulfate kinase | |
| PAB1839/0035 | - -t-szyqvdcebrh-jwgp-l-n- | Uracil phosphoribosyltransferase | |
| PAB2246/0827 | -m- -s- - - - - - -brhu-w-p- - - - - | Adenine-specific DNA methyltransferases | |
| PAB2154/0610 | amt-s- - - - - -e- -hujw-p- - -n- | Restriction enzymes type I helicase subunit | |
| Present in | |||
| PH0369/0153 | - - -k- -y-v- -ebrh- - - - - -I--- | Galactokinase | |
| PH1048, PH1046/2309 | - - -k-z-q-d- -b- - - - - - -o- - - - | Leucyl aminopeptidase (aminopeptidase T) | |
| PH0365/1085 | a- -k- -y-v- -e-rh- - - - - - - - - - | Galactose-1-phosphate uridylyltransferase | |
| PH0896/1230 | - - -k- -yqvd-eb- - -j- - - - - -n- | Co/Zn/Cd efflux system component | |
| PH0162/1353 | amtk- - -qv- - - -r- - - - - - - - - - - | Predicted hydrolase of the HD superfamily | |
| PH1032/0338 | -m-k- - - - - -ce- -hu- - - - -I- - - | Site-specific DNA methylase dam | |
| PH0873/1401 | - -tk- - -q-dceb-uj- - - - - - - - | GTPase subunit (McrB) of a restriction endonuclease | |
*The designations are as in Table 6.