| Literature DB >> 14759257 |
Eugene V Koonin1, Natalie D Fedorova, John D Jackson, Aviva R Jacobs, Dmitri M Krylov, Kira S Makarova, Raja Mazumder, Sergei L Mekhedov, Anastasia N Nikolskaya, B Sridhar Rao, Igor B Rogozin, Sergei Smirnov, Alexander V Sorokin, Alexander V Sverdlov, Sona Vasudevan, Yuri I Wolf, Jodie J Yin, Darren A Natale.
Abstract
BACKGROUND: Sequencing the genomes of multiple, taxonomically diverse eukaryotes enables in-depth comparative-genomic analysis which is expected to help in reconstructing ancestral eukaryotic genomes and major events in eukaryotic evolution and in making functional predictions for currently uncharacterized conserved genes.Entities:
Mesh:
Substances:
Year: 2004 PMID: 14759257 PMCID: PMC395751 DOI: 10.1186/gb-2004-5-2-r7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Assignment of proteins from each of the seven analyzed eukaryotic genomes to KOGs with different numbers of species and to LSEs. 0, Proteins without detectable homologs (singletons); 1, LSEs. Species abbreviations: Ath, Arabidopsis thaliana; Cel, Caenorhabditis elegans; Dme, Drosophila melanogaster; Ecu, Encephalitozoon cuniculi; Hsa, Homo sapiens; Sce, Saccharomyces cerevisisae; Spo, Schizosaccharomyces pombe.
KOGs and TWOGs with unexpected phyletic patterns (examples)
| KOG/TWOG number | Phyletic pattern* | (Predicted) structure and function | Prokaryotic homologs | Comments |
| TWOG0892 | ---H--E | Discoidin domain protein, potential regulator of proteasome activity | Detected in a few phylogenetically scattered bacteria, no COG so far [ | |
| TWOG0263 | A-----E | ATP/ADP translocase | ATP/ADP translocases of chlamydia, rickettsia, | ATP/ADP translocase is a hallmark of intracellular parasites and symbionts, which allows them to scavenge ATP from the host cell; chloroplast protein in plants. Could be acquired by plants and microsporidia via independent HGT from bacteria. [ |
| TWOG0689 | ---HY-- | Uncharacterized protein essential for propionate metabolism | PrpD protein of several bacteria and archaea (COG2079) | The yeast and human (and the orthologs from other vertebrates) proteins show the greatest similarity to different subsets of bacterial orthologs, which might suggest independent HGT events. |
| TWOG0871 | ---H-P- | Uncharacterized conserved protein, probably enzyme | COG4336, sporadic representation in several bacterial lineages | The human (and mouse) protein has an additional domain conserved in the archaeon |
| TWOG0788 | A----P- | Urease | Ureases of many bacterial species | Highly conserved enzyme present in plants and many fungi but not |
| 4751 | A--H--E | Recombination repair protein BRCA2, contains varying number of BRCA2 repeats | None | Although sequence conservation is limited to the BRC repeats [ |
| 4597 | A--H--E | TATA-binding protein 1-interacting protein | None | Probable multiple gene losses |
| 4486 | A--H--E | 3-methyl-adenine DNA glycosylase | Orthologs in many bacteria (COG2094) | The plant protein and those from mammals and microsporidia show the greatest similarity to different subsets of bacterial orthologs. Evolution might have included a combination of gene loss and independent HGT events |
| 1594 | A-D-Y-- | Predicted epimerase related to aldose 1-epimerase | Bacterial orthologs, primarily proteobacteria (COG0676) | Eukaryotic proteins are more closely related to each other than to bacterial orthologs, indicating monophyletic origin. Function remains unknown; might be involved in a distinct and still uncharacterized pathway of polysaccharide biosynthesis. LSE in |
| 4141 | ---HYPE | Rad52/22, protein involved in double-strand break repair | None | Probable gene loss in plants, insects and nematodes |
| 4528 | -CDH--E | Uncharacterized predicted enzyme, possibly a polynucleotide kinase (structure of the ortholog from the bacterium | Conserved in all archaea and several bacteria (COG1371) | Context analysis of archaeal and bacterial genomes suggests functional interaction between proteins of KOG5324 and KOG4246, RNA 3'-terminal phosphate cyclase (KOG4398, COG0430), and tRNA/rRNA cytosine C5-methylase (KOG1299/COG0144) ([ |
| 3833 | -CDH--E | Uncharacterized predicted enzyme, possibly a polynuclotide phosphatase | Conserved in all archaea and several bacteria (COG1690) | See comment for KOG5324 |
*Abbreviations: A, thale cress A. thaliana; C, nematode C. elegans; D, fruit fly D. melanogaster; E, microsporidian Encephalitozoon cuniculi; H, Homo sapiens; S, budding yeast S. cerevisiae; P, fission yeast S. pombe; a letter indicates the presence of the respective species in the given KOG and a dash indicates its absence.
KOGs represented by exactly one ortholog in seven analyzed eukaryotic genomes (examples)
| KOG number | (Predicted) function | Multiprotein complex | Functional class* | Prokaryotic homologs | Fitness class† | Comments | |
| Yeast‡ | Worm§ | ||||||
| 0392 | SNF2 family DNA-dependent ATPase | TBP-DNA complex | Many bacteria and archaea (COG0553) | 0 | 1 | Involved in regulation of transcription from POL II promoters [ | |
| 0121 | Nuclear cap-binding protein complex, subunit CBP20 (RRM-domain-containing RNA-binding protein) | Cap-binding complex | A | Several bacteria (COG0724) | 1 | X | RRM-domain proteins show scattered presence in bacteria and might have been horizontally transferred from eukaryotes |
| 0213 | U2-snRNP associated splicing factor 3b, subunit 1 | Spliceosome | A | None | 0 | 0 | |
| 0227 | snRNA-associated protein, splicing factor 3a, subunit b (Prp11p) | Spliceosome | A | None | 0 | 0 | |
| 2268 | Predicted nucleic-acid-binding protein kinase of the RIO1 family; 40S ribosomal subunit biogenesis/18S rRNA processing | Pre-40S subunit | A | Orthologs in most archaea but not in bacteria (COG0478) | 0 | X | One of the very small number of protein kinases that show a clear-cut orthologous relationship between all eukaryotes and most archaea, and, apparently, the only one containing a helix-turn-helix nucleic-acid-binding domain. [ |
| 3031 | Protein required for 60S ribosomal subunit biogenesis; [ | Processosome | A | Distantly related to COG2136, represented by orthologs in most archaea, but not in bacteria (KSM, unpublished) | 0 | X | The COG2136 proteins appear to be subunits of the predicted archaeal exosome [ |
| 3045 | Predicted RNA methylase involved in rRNA processing | Processosome? | A | Distantly related to numerous Rossmann-fold methylases but prokaryotic orthologs could not be confidently identified | 1 | 1 | This protein (Rrp8p in yeast) has been shown to participate in the processing of rRNA and sequence analysis reveals the presence of a Rossmann-fold methylase domain [ |
| 3064 | RNA-binding nuclear protein containing a distinct C4 Zn-finger; implicated in the biogenesis of 60S ribosomal subunits [ | Processosome | A | None | 0 | 0 | Initially identified in yeast as the MAK16 protein required for dsRNA virus reproduction [ |
| 0291, 0302, 0306, 310, 0319, 0650, 1272 | WD40-repeat proteins, subunits of rRNA processing complexes [ | Processosome | A | WD40-repeat proteins are present in several bacterial lineages and are particularly abundant in cyanobacteria but are missing in most archaea; none of them appear to be obvious orthologs of this protein (COG2319) | all 0 | X,X,1,X,1,1,1 | |
| 0284 | Polyadenylation factor I complex, subunit PFS2, WD40-repeat protein | Poly-adenylation complex | A | Same as above (COG2319) | 0 | X | |
| 0337 | RNA helicase involved in 28S rRNA processing | Processosome | A | Most of the archaea and bacteria (COG0513) | 0 | X | |
| 0343 | RNA helicase involved in 28S rRNA processing | Processosome | A | Most of the archaea and bacteria (COG0513) | 0 | X | |
| 1069 | 3'-5' exoribonuclease (RNAse PH), exosome subunit Rrp46 | Exosome | A | Most bacteria and archaea (COG0689) | 0 | 1 | |
| 1070 | Exosome subunit Rrp5 (RNA-binding S1 domain fused to TPR repeats) | Exosome | A | Most bacteria (COG0539, COG0457) | 0 | 1 | |
| 1135 | mRNA cleavage and polyadenylation complex subunit CFT2 (CPSF) | Cleavage and polyadenylation complex | A | Most archaea and some bacteria (COG1236) | 0 | 0 | |
| 1914 | mRNA cleavage and polyadenylation factor I complex, subunit RNA14 | Cleavage and polyadenylation complex | A | None | 0 | X | |
| 1975 | RNA (guanine-7-) methyltransferase (capping enzyme subunit) | Capping enzyme | A | Numerous methyltrans-ferases (COG0500) but no ortholog | 0 | 1 | |
| 2051 | Nonsense-mediated mRNA decay complex, subunit 2 | NMD complex | A | None | 1 | X | |
| 2554 | Pseudouridylate synthase | ? | A | Most archaea and bacteria (COG0101) | 1 | 1 | |
| 2613 | Upf1p-interacting protein, NMD complex subunit Nmd3p | NMD complex | A | All archaea, no bacteria (COG1499) | 0 | X | |
| 2771 | tRNA-specific adenosine-34 deaminase subunit Tad3p | Heterodimeric RNA-specific deaminase | A | Most bacteria and some archaea (COG0590) | 0 | X | |
| 2780 | Protein involved in ribosomal large subunit assembly (RPF1), contains IMP4 domain | Processosome | A | Most archaea, no bacteria (COG2136) | 0 | 1 | |
| 2781 | Subunit of the small (ribosomal) subunit (SSU) processosome (snoRNP), IMP4 | Processosome | A | Most archaea, no bacteria (COG2136) | 0 | 1 | |
| 2874 | Protein involved in rRNA processing and ribosomal assembly | ? | A | All archaea, no bacteria (COG1094) | 0 | 1 | Predicted RNA-binding protein containing KH domain |
| 3013 | Exosome subunit Rrp4 | Exosome | A | Most archaea, on bacteria (COG1097) | 0 | X | |
| 3031 | Protein involved in large ribosome subunit assembly and 28S rRNA processing (Rrf2) | Processosome | A | None | 0 | X | Contains the BRIX domain |
| 3322 | RNAse P/MRP subunit, involved in processing of pre-tRNAs and the 5.8S rRNA | RNAse P/MRP holoenzyme | A | None | 0 | 1 | |
| 3448 | Predicted snRNP core protein | Spliceosome | A | All archaea, no bacteria (COG1958) | 0 | 1 | |
| 3482 | Small nuclear ribonucleoprotein (snRNP) SMF subunit | Spliceosome | A | All archaea, no bacteria (COG1958) | 0 | 0 | |
| 2463 | Predicted RNA-binding protein, consisting of a PIN domain and a Zn-ribbon. Involved in 26S proteasome assembly | 26S proteasome, pre-40S subunit | A,O | Represented by orthologs in all archaea but no bacteria (COG1349) | 0 | X | PIN domain has been detected in exosome subunits and is thought to have RNA-binding properties or even nuclease activity [ |
| 3273 | Predicted RNA-binding protein containing KH domain, interacts with Nob1p | 26S proteasome, pre-40S subunit | A,O | Orthologs in all archaea but no bacteria (COG1094) | 0 | 0 | This is the second predicted RNA-binding protein involved in proteasome assembly, [ |
| 1831 | Deadenylating 3'-5' exonuclease, negative regulator of PolII transcription | CCR4-NOT core complex | AK | None | 0 | 0 | |
| 1159 | NADP-dependent flavoprotein reductase, probably sulfite reductase subunit | ? | CL | Many bacteria (COG0369) | 0 | X | Genetic evidence of a role in DNA replication [ |
| 1800 | Ferredoxin/adrenodoxin reductase | ? | C | Most bacteria and some archaea (COG0493) | 0 | X | |
| 1173 | Anaphase-promoting complex (APC), Cdc16 subunit (TPR-repeat protein) | APC | D | Most of archaea and bacteria have TPR-repeat proteins (COG0457) but no orthologs of Cdc16 | 0 | 0 | |
| 3437 | Anaphase-promoting complex (APC), subunit 10 | APC | D | None | 1 | 1 | |
| 1358 | Serine palmitoyltransferase | ? | I | Most bacteria and some archaea (COG0156) | 0 | 0 | |
| 1511 | Mevalonate kinase | ? | I | Most archaea and some bacteria (COG1577) | 0 | X | |
| 3059 | N-acetylglucosaminyltransferase complex, subunit PIG-C/GPI2, involved in phosphatidylinositol biosynthesis | N-acetylglucos-aminyltransferase complex | I | None | 0 | 1 | |
| 0467 | Translation elongation factor 2 paralog (GTPase) | ? | J | All (COG0480) | 0 | X | Involved in 60S ribosomal subunit maturation [ |
| 1147 | Glutamyl-tRNA synthetase | Multispecificity aminoacyl-tRNA synthetase complex | J | All (COG0008) | 0 | X | |
| 2784 | Phenylalanyl-tRNA synthetase, beta subunit | Heterodimeric phenylalanyl-tRNA synthetase | J | All (COG0016) | 0 | X | |
| 3123 | Diphtamide synthase (methyltransferase) | ? | J | All archaea, no bacteria (COG1798) | 1 | 1 | |
| 0261 | RNA polymerase III, largest subunit | RNAPIII holoenzyme | K | All (COG0086) | 0 | X | |
| 0262 | RNA polymerase I, largest subunit | RNAPI holoenzyme | K | All (COG0086) | 0 | X | |
| 0215 | RNA polymerase III, second largest subunit | RNAPIII holoenzyme | K | All (COG0085) | 0 | X | |
| 0216 | RNA polymerase I, second largest subunit | RNAPI holoenzyme | K | All (COG0085) | 0 | X | |
| 1063 | RNA polymerase II elongator complex, subunit ELP2, WD repeat protein | RNA polymerase II elongator complex | K | WD40-repeat proteins are present in several bacterial lineages and are particularly abundant in cyanobacteria but are missing in most archaea; none of them appear to be obvious orthologs of this protein (COG2319) | 1 | X | |
| 1131 | RNA polymerase II transcription initiation/nucleotide excision repair factor TFIIH, 5'-3' helicase subunit RAD3 | RNAPII holoenzyme | K | Most archaea and bacteria (COG1199) | 0 | X | |
| 1920 | RNA polymerase II Elongator subunit | RNAP II elongator complex | K | None | 1 | X | |
| 1932 | TBP-associated factor (Taf2p) | TFIID complex | K | None | 0 | X | |
| 2009 | Transcription initiation factor TFIIIB, Bdp1 subunit (Myb domain) | TFIIIB | K | None | 0 | 0 | |
| 2076 | RNA polymerase III transcription factor TFIIIC, TPR-repeat-containing protein | TFIIIC | K | Most of archaea and bacteria have TPR-repeat proteins (COG0457) but no orthologs of TFIIC | 0 | X | |
| 2487 | RNA polymerase II transcription initiation/nucleotide excision repair factor TFIIH, subunit TFB4 | TFIIH | K | None | 0 | 1 | |
| 2691 | RNA polymerase II subunit 9 | RNAP II holoenzyme | K | Most archaea, no bacteria (COG1594) | 1 | X | |
| 2807 | RNA polymerase II transcription initiation/nucleotide excision repair factor TFIIH, SSL1 subunit | TFIIH | K | No orthologs although von Willebrand A domains are present in a variety of prokaryotic proteins | 0 | 0 | Consists of a von Willebrand A domain most closely related to those in the proteasome subunit RPN10 [ |
| 2907 | RNA polymerase I transcription factor TFIIS, subunit A12.2/RPA12 | TFIIS | K | All archaea, no bacteria (COG1594) | 1 | 0 | |
| 3169 | RNA polymerase II transcriptional regulation mediator | Mediator complex [ | K | None | 0 | X | |
| 3233 | RNA polymerase III subunit C34 | RNAP III holoenzyme | K | None | 0 | 1 | |
| 3297 | RNA polymerase III subunit C25 | RNAP III holoenzyme | K | All archaea, no bacteria (COG1095) | 0 | 0 | |
| 3438 | Subunit common to RNA polymerases I (A) and III (C); Rpc19p | RNAP I and III holoenzymes | K | 0 | 1 | ||
| 3471 | RNA polymerase II transcription initiation/nucleotide excision repair factor TFIIH, subunit TFB2 | TFIIH | K | None | 0 | X | |
| 3490 | Transcription elongation factor SPT4, Zn-ribbon protein | Chromatin-associated transcription complexes | K | None | 1 | 1 | |
| 3497 | RNA polymerase II subunit; Rpb10p | RNAP II holoenzyme | K | All archaea, no bacteria (COG1644) | 0 | X | |
| 3901 | Transcription initiation factor IID subunit (Taf13p) | TFIID | K | None | 0 | X | |
| 3949 | RNA polymerase II elongator complex, subunit ELP4 | RNAP II elongator complex | K | None | 1 | 1 | |
| 4086 | SOH1 protein potentially involved in Pol II transcription regulation and repair | SMCC complex [ | K | None | 1 | X | |
| 1532 | Predicted GTPase of the XAB1 family [ | TBP-free TAF(II) complex | L | All archaea and several bacteria (COG1100) | 0 | 0 | XP-A-binding protein in humans, thus implicated in repair ([ |
| 1533 | Predicted GTPase of the XAB1 family (paralog of KOG1757) [ | TBP-free TAF(II) complex? | L | All archaea and several bacteria (COG1100) | 0 | X | Might have a function in repair given the paralogous relationship with KOG1757. |
| 1625 | DNA polymerase α processivity subunit, inactivated phosphatase | DNA polymerase α holoenzyme | L | Small subunit of archaeal DNA polymerase II (COG1311) | 0 | 0 | The small, regulatory subunit of DNA polymerase α also forms a pan-eukaryotic KOG3044, which is a paralog of KOG0861 (the only recent duplication in KOG3044 is seen in vertebrates). In contrast, another paralog, the small subunit of DNA polymerase ε, is represented in animals, fungi and the early-branching protozoan |
| 0479 | DNA replication licensing factor MCM3 | Pre-replication complex | L | All archaea, no bacteria (COG1241) | 0 | X | |
| 0481 | DNA replication licensing factor MCM5 | Pre-replication complex | L | All archaea, no bacteria (COG1241) | 0 | X | |
| 0482 | DNA replication licensing factor MCM7 | Pre-replication complex | L | All archaea, no bacteria (COG1241) | 0 | 0 | |
| 0964 | Structural maintenance of chromosome protein 3 (cohesin subunit SMC3) | Sister chromatid cohesion complex | L | Many archaea and bacteria (COG1196) | 0 | X | |
| 0979 | Structural maintenance of chromosome protein 5 (cohesin subunit SMC5) | Sister chromatid cohesion complex | L | Many archaea and bacteria (COG1196) | 0 | X | |
| 1942 | TBP-interacting protein TIP49 (DNA helicase) | chromatin remodeling complex | L | Most of the archaea, no bacteria (COG1224) | 0 | 0 | |
| 1979 | DNA mismatch repair ATPase, MLH1 | Mismatch repair complex | L | Most bacteria and some archaea (COG0323) | 1 | 1 | |
| 2267 | DNA primase, large subunit | DNA polymerase α:primase complex | L | All archaea, no bacteria (COG2219) | 0 | 0 | |
| 2299 | Ribonuclease HI | Replisome | L | All archaea, most bacteria (COG0164) | 1 | X | |
| 2310 | DNA repair exonuclease MRE11 | MRN complex involved in double-strand break repair | L | All archaea, most bacteria (COG0420) | 1 | 1 | |
| 2929 | Origin recognition complex, subunit 2 (ORC2) | ORC | L | None | 1 | 1 | |
| 0179 | 20S proteasome, regulatory subunit beta type PSMB1/PRE7 (paralog of KOG0185) | 20S proteasome | O | All archaea but only actinomycetes among bacteria (COG0638) | 0 | 0 | |
| 0185 | 20S proteasome, regulatory subunit beta type PSMB4/PRE4 (paralog of KOG0179) | 20S proteasome | O | All archaea but only actinomycetes among bacteria (COG0638) | 0 | 0 | |
| 2708 | Predicted metalloprotease with chaperone activity (RNAse H/HSP70 fold) [ | Putative complex involved in translation regulation [ | O | Represented by orthologs in all archaea and bacteria (COG0533) | 0 | X | One of the few remaining uncharacterized proteins that are universally conserved in all cellular life forms. The only experimentally demonstrated activity is that of sialoglycoprotease but fusion with a distinct protein kinase in several archaea and analysis of gene neighborhood suggest a fundamental role in signal transduction, possibly translation regulation. [ |
| 0301 | Protein required for normal rates of ubiquitin-dependent proteolysis, contains WD40 repeats | Proteasome? | O | Same as above (COG2319) | 1 | X | |
| 0358 | Chaperonin complex component, TCP-1 delta subunit (CCT4) | TCP-1 | O | All archaea and nearly all bacteria (COG0459) | 0 | 0 | |
| 0363 | Chaperonin complex component, TCP-1 beta subunit (CCT2) | TCP-1 | O | All archaea and nearly all bacteria (COG0459) | 0 | 0 | |
| 0687 | 26S proteasome regulatory complex, subunit RPN7/PSMD6 | 26S proteasome | O | None | 0 | 0 | |
| 1299 | Vacuolar sorting protein VPS45/Stt10 (Sec1 family) | t-SNARE complex | O | None | 1 | X | Involved in t-SNARE complex assembly [ |
| 1349 | GPI-anchor transamidase complex, GPI8 subunit | GPI-anchor transamidase complex | O | Distantly related proteases in some bacteria (no COG) | 0 | 1 | |
| 1943 | Beta-tubulin folding cofactor D, involved in chromosome segregation | ? | O | None | 1 | 1 | |
| 2015 | NEDD8-activating complex, UBA3 subunit | NEDD8-activating complex | O | Most bacteria and some archaea (COG0476) | 1 | 1 | |
| 2126 | Phosphoethanolamine | ? | O | Several bacteria and archaea (COG1524) | 0 | X | |
| 2884 | 26S proteasome regulatory complex, subunit RPN10/PSMD4 | 26S proteasome regulatory complex | O | No orthologs although von Willebrand A domains are present in a variety of prokaryotic proteins | 1 | 1 | Contains von Willebrand A domain |
| 2908 | 26S proteasome regulatory complex, subunit RPN9/PSMD13 | 26S proteasome regulatory complex | O | None | 0 | 0 | Contains PINT domain |
| 0209 | Endoplasmic reticulum membrane P-type ATPase | ? | P | Many bacteria and some archaea (COG0474) | 1 | X | |
| 3379 | Uncharacterized member of the histidine triad superfamily of nucleotide hydorlases | ? | R | Most archaea and bacteria (COG0537) | 1 | X | Only biochemical function predicted. |
| 2635 | Coatomer (COPI) complex delta subunit | COPI complex | U | None | 0 | 0 | |
| 2927 | Membrane component of ER protein translocation apparatus (Sec62) | Sec complex | U | None | 0 | 1 | |
| 2978 | Dolichol-phosphate mannosyltransferase | ? | U | All archaea, most bacteria (COG0463) | 0 | X | |
| 3198 | Signal recognition particle, subunit Srp19 | Signal recognition particle | U | All archaea, no bacteria (COG1400) | 0 | X | |
| 3315 | Subunit of the targeting complex (TRAPP) involved in ER to Golgi trafficking | TRAPP | U | None | 0 | X | |
| 3369 | Subunit of the targeting complex (TRAPP) involved in ER to Golgi trafficking | TRAPP | U | None | 0 | X | |
| 1992 | Nuclear export receptor CSE1/CAS (importin beta) | ? | YU | None | 0 | X | |
| 2316 | PP-loop family ATP pyrophosphatase domain, which in fungi, plants and insects is fused to a duplicated translation inhibitor domain. The fusion, along with the phyletic pattern of the PP-ATPase domain, suggests an essential function in translation regulation | ? | A | Orthologs of the PP-loop domain are present in all archaea (COG2102) but not in bacteria. Orthologs of the translation inhibitor domain are found in most bacteria and several archaea (COG0251) | 1 | X | PP-loop ATPases have been previously implicated in base thiolation in various RNAs [ |
| 2523 | Predicted RNA-binding protein containing a PUA domain, probable role in RNA modification [ | Putative novel RNA modification complex | A | Orthologs present in all archaea (COG2016) but not in bacteria | 1 | X | Several of the archaeal orthologs of this protein form fusions with a PP-loop ATPase domain implicated in base thiolation [ |
| 0270, 0271, 1539 | WD40-repeat proteins | Processosome | A | WD40-repeat proteins are present in several bacterial lineages and are particularly abundant in cyanobacteria but are missing in most archaea; none of them appear to be obvious orthologs of this protein (COG2319) | all 0 | X,1,X | By analogy with other conserved WD40-repeat proteins, predicted to be subunits of rRNA processing/ribosome assembly complexes |
| 2321 | Nucleolar protein, contains WD40 repeats | rRNA processosome? | A | WD40-repeat proteins are present in several bacterial lineages and are particularly abundant in cyanobacteria but are missing in most archaea; none of them appear to be obvious orthologs of this protein (COG2319) | 0 | 1 | Probable subunit of an rRNA-processing complex |
| 1763 | Uncharacterized conserved protein containing a CCCH Zn-finger; possible role in RNA processing or splicing | ? | A | None | 1 | 1 | CCCH fingers have been shown to bind 3' untranslated regions in various mRNAs [ |
| 2837 | Protein containing a U1-type, RNA-binding C2H2 Zn-finger. Probable role in RNA splicing/processing | Spliceosome? | A | None | 0 | 0 | U1-type fingers are essential for the assembly of U1 RNP [ |
| 3073 | Predicted RNA-binding protein containing PIN domain and involved in 18S rRNA processing | Pre-40S subunit | A | Most archaea, no in bacteria (COG1412) | 0 | 1 | Interacts with Nop14p and is required for 40S subunit biogenesis and 18S rRNA maturation (11694595). The presence of the PIN domain suggests RNA-binding and, possibly, RNAse activity |
| 3154 | Uncharacterized protein with potential function in translation or ribosomal biogenesis | Pre-40S subunit? | A? | Most archaea, no bacteria (COG2042) | 1 | X | The general functional prediction stems from the observation that the gene for this protein forms a predicted conserved operon with the gene for ribosomal protein L40E in several archaeal genomes |
| 3214 | Small protein containing a Zn-ribbon, possibly RNA-binding; potential role in RNA processing or transcription regulation | ? | A? | Conserved in Crenarchaeota (COG4888) | 1 | 1 | |
| 3800 | Predicted E3 ubiquitin ligase containing RING finger, subunit of transcription/repair factor TFIIH and CDK-activating kinase assembly factor | TFIIH | KO | None | 0 | X | |
| 3176 | Predicted α-helical protein, possibly involved in replication/repair; paralog of KOG3636 | A novel complex with PCNA involved in replication? | L? | Conserved in most (possibly all) archaea but not in bacteria (COG1711) | 0 | X | A function in DNA replication/repair and/or transcription is suggested by the analysis of the genome context of archaeal orthologs which form an evolutionarily conserved association with the genes for replication sliding clamp (PCNA ortholog) (K.S.M. and E.V.K., unpublished work) |
| 3303 | Predicted α-helical protein, possibly involved in replication/repair transcription; paralog of KOG3508 | A novel complex with PCNA involved in replication? | L? | Conserved in most (possibly all) archaea but not in bacteria (COG1711) | 0 | 0 | A function in DNA replication/repair and/or transcription is suggested by the analysis of the genome context of archaeal orthologs which form an evolutionarily conserved association with the genes for replication sliding clamp (PCNA ortholog) (K.S.M. and E.V.K., unpublished.work) |
| 0396 | Predicted E3 ubiquitin ligase | Ub ligase | O | None | 1 | 1 | The proteins in this KOG contain a modified RING domain, which might not be capable of metal-binding similarly to the U-box domain [ |
| 1443 | Multitransmembrane protein, predicted drug/metabolite transporter | ? | R | Most archaea and bacteria (COG0697) | 1 | X | |
| 2647 | Multitransmembrane protein, potential transporter | ? | R | Most bacteria and some archaea (COG0628) | 0 | 1 | |
| 2488 | Predicted N-acetyltransferase | ? | R | Most archaea and bacteria (COG0454) | 1 | X | Putative role in ribosomal maturation? |
| 3347 | Predicted nucleotide kinase; nuclear protein (Fap7p) | ? | R | Conserved in all archaea but not in bacteria (COG1936) | 0 | 1 | Involved in oxidative stress reponse in yeast [ |
| 3974 | Predicted sugar kinase | Putative novel complex with KOG2585 proteins | R | All archaea and most bacteria (COG0063) | 1 | 1 | Based on fusions seen in prokaryotes, predicted to interact functionally and, possibly, physically with uncharacterized proteins of KOG2585 (represented in all eukaryotes but includes paralogs in some species) |
| 2318 | Uncharacterized conserved protein | ? | S | None | 0 | 1 | |
| 3237 | Uncharacterized conserved protein containing coiled-coil domain | ? | S | None | 0 | 1 | Coiled-coil domains are often involved in complex assembly; this could be an uncharacterized component of the chromatin or the spliceosome |
*Abbreviations for the functional categories are as in Figure 3. †0, essential gene (lethal knockout); 1, non-essential gene (non-lethal knockout); X indicates that no data is available for the given gene. ‡Data from [85]. §Data from [86].
Figure 2Distribution of the KOGs by the number of paralogs in each of the analyzed eukaryotic genomes. The species abbreviations are as in Figure 1.
Figure 3Functional breakdown of the KOGs. Designations of functional categories: A, RNA processing and modification; B, chromatin structure and dynamics; C, energy production and conversion; D, cell-cycle control and mitosis; E, amino acid metabolism and transport; F, nucleotide metabolism and transport; G, carbohydrate metabolism and transport; H, coenzyme metabolism; I, lipid metabolism; J, translation; K, transcription; L, replication and repair; M, membrane and cell wall structure and biogenesis; O, post-translational modification, protein turnover, chaperone functions; P, inorganic ion transport and metabolism; Q, secondary metabolites biosynthesis, transport and catabolism; T, signal transduction; U, intracellular trafficking and secretion; Y, nuclear structure; Z, cytoskeleton; R, general functional prediction only (typically, prediction of biochemical activity), S, function unknown. This breakdown is only for KOGs that included at least three species.
Figure 4Variation of amino-acid substitution rates among KOGs. (a) Probability-density function for the distribution of evolutionary rates among the set of KOGs including all seven analyzed eukaryotic species. (b) Distribution functions for the evolutionary rates in different functional categories of KOGs. The designations of functional categories are as in Figure 3.
Evolutionary rates in KOGs with different functions: evolutionary rates for different functional categories of KOGs*
| Functional category | Number of KOGs | Mean rate, substitutions per site | Standard deviation |
| J | 227 | 0.98 | 0.37 |
| H | 62 | 0.98 | 0.30 |
| A | 167 | 1.01 | 0.36 |
| C | 140 | 1.01 | 0.43 |
| O | 307 | 1.01 | 0.40 |
| F | 50 | 1.05 | 0.34 |
| E | 130 | 1.07 | 0.38 |
| L | 139 | 1.11 | 0.38 |
| B | 56 | 1.13 | 0.33 |
| Z | 64 | 1.13 | 0.46 |
| K | 209 | 1.15 | 0.42 |
| G | 115 | 1.16 | 0.43 |
| I | 110 | 1.16 | 0.32 |
| T | 200 | 1.18 | 0.39 |
| D | 111 | 1.19 | 0.40 |
| R | 415 | 1.23 | 0.42 |
| M | 33 | 1.26 | 0.47 |
| U | 196 | 1.27 | 0.42 |
| Q | 30 | 1.27 | 0.37 |
| P | 69 | 1.28 | 0.45 |
| N | 2 | 1.30 | 0.78 |
| S | 348 | 1.40 | 0.41 |
| All | 3203 | 1.16 | 0.42 |
*Only the KOGs that included a member(s) from Arabidopsis were analyzed; the evolutionary rates are the average distances between the Arabidopsis representative in the given KOG and the proteins from other species (see Material and methods for details). The functional categories are designated as in Figure 5.
Statistical significance of differences in evolutionary rates between selected functional categories of KOGs (t-test)
| J | L | U | S | |
| J | - | |||
| L | 3 × 10-3 | - | ||
| U | 1 × 10-12 | 3 × 10-4 | - | |
| S | 7 × 10-33 | 5 × 10-13 | 2 × 10-4 | - |
Figure 5Parsimonious scenarios of loss and emergence of genes (KOGs) in eukaryotic evolution. (a) The coelomate topology of the phylogenetic tree of the eukaryotic crown group. (b) The ecdysozoan topology of the phylogenetic tree of the eukaryotic crown group. The numbers in boxes indicate the inferred number of KOGs in the respective ancestral forms. The numbers next to branches indicate the number of gene gains (emergence of KOGs) (numerator) and gene (KOG) losses (denominator) associated with the respective branches; a dash indicates that the number of losses for a given branch could not be determined. Proteins from each genome that did not belong to KOGs as well as LSEs were counted as gains on the terminal branches. The species abbreviations are as in Figure 1.
Functional profiles of genes lost in different eukaryotic lineages
| Functional category | Lost genes (KOGs) | |||||||||
| Hs* | Dm* | Coelomates/ Ecdysozoa | Ce* | Animals | Sc | Sp | Yeasts | Ec | Fungi-Ec | |
| Total | 162/114 | 520/369 | 37/188 | 541/751 | 193 | 299 | 202 | 55 | 1,969 | 802 |
| RNA processing and modification | 2/3 | 9/8 | 1/2 | 10/11 | 4 | 15 | 7 | 1 | 88 | 32 |
| Translation | 3/3 | 16/11 | 0/5 | 13/10 | 9 | 9 | 6 | 3 | 122 | 10 |
| Transcription | 5/2 | 16/12 | 0/4 | 29/33 | 2 | 16 | 9 | 4 | 83 | 40 |
| Replication and repair | 4/5 | 28/14 | 1/15 | 29/14 | 2 | 9 | 7 | 3 | 60 | 16 |
| Chromatin structure and dynamics | 1/1 | 8/6 | 0/2 | 8/6 | 0 | 5 | 3 | 1 | 29 | 11 |
| Energy production and conversion | 7/10 | 9/10 | 5/4 | 12/10 | 7 | 6 | 13 | 1 | 110 | 37 |
| Cell cycle control and mitosis | 3/3 | 11/6 | 0/5 | 15/11 | 3 | 12 | 3 | 1 | 61 | 16 |
| Amino acid metabolism and transport | 5/6 | 16/9 | 1/8 | 15/7 | 38 | 6 | 9 | 0 | 110 | 18 |
| Nucleotide metabolism and transport | 3/3 | 6/3 | 0/3 | 8/5 | 5 | 0 | 3 | 1 | 38 | 9 |
| Carbohydrate metabolism and transport | 3/3 | 13/10 | 1/4 | 18/14 | 8 | 10 | 16 | 3 | 70 | 41 |
| Coenzyme metabolism | 0/2 | 5/5 | 2/2 | 14/12 | 11 | 1 | 1 | 0 | 51 | 12 |
| Lipid metabolism | 1/5 | 27/19 | 4/12 | 18/6 | 4 | 9 | 19 | 2 | 74 | 33 |
| Membrane and cell wall structure and biogenesis | 5/4 | 10/10 | 2/2 | 9/11 | 7 | 5 | 3 | 0 | 37 | 15 |
| Post-translational modification, protein turnover, chaperone functions | 3/5 | 22/15 | 2/9 | 44/40 | 8 | 29 | 21 | 4 | 167 | 69 |
| Inorganic ion transport and metabolism | 2/4 | 8/8 | 2/2 | 8/7 | 9 | 2 | 6 | 4 | 50 | 14 |
| Secondary metabolites biosynthesis, transport and catabolism | 1/2 | 6/5 | 1/2 | 5/3 | 2 | 4 | 1 | 0 | 23 | 5 |
| Signal transduction | 5/3 | 32/22 | 0/10 | 30/37 | 4 | 16 | 7 | 3 | 110 | 52 |
| Intracellular trafficking and secretion | 4/3 | 10/8 | 0/2 | 14/14 | 3 | 5 | 11 | 0 | 116 | 22 |
| Nuclear structure | 0/0 | 3/3 | 0/0 | 5/6 | 0 | 1 | 0 | 0 | 16 | 5 |
| Cytoskeleton | 0/0 | 2/2 | 0/0 | 6/8 | 0 | 9 | 0 | 3 | 44 | 6 |
| General functional prediction only (typically, prediction of biochemical activity) | 14/13 | 79/55 | 5/29 | 88/72 | 30 | 55 | 24 | 11 | 241 | 134 |
| Function unknown | 91/34 | 184/128 | 10/66 | 143/414 | 37 | 75 | 33 | 10 | 269 | 205 |
*For each of the animals, the numerator indicates the number of genes lost under the coelomate topology of the species tree and the denominator indicates the number of genes lost under the ecdysozoan topology of the tree.
Groups of functionally linked genes co-eliminated during evolution of different eukaryotic lineages
| Functional group/ complex | Lost KOGs* | ||||||
| Hs | Dm | Ce | Coelomates/ Ecdysozoa | Animals | Yeasts | Fungi-Ec | |
| Mitochondrial ribosomal proteins | 3331, 3435/ 3331, 3435 | 3505, 4600, 4612/ None | 3505, 4122, 4600, 4612/ 4122 | None/ 3505, 4600, 4612 | 0899, 0938, 1740, 3254, 3278, 4844 | 0408,1686, 1708, 4707 | |
| Spliceosome, including putative associated proteins | 1847, 1960/ 1847 | 1902, 1960, 2991, 3414 | None/ 1960 | 0105, 0107, 0117, 1365, 1588, 1676, 1847, 1996, 2191, 2242, 2548, 2991, 4207, 4211 | |||
| Exosome | 1004, 1613 | ||||||
| Replication origin-recognition complex | 2228, 2538, 4557 | 4557 | |||||
| Mismatch repair system | 0218, 0220, 221, 1977 | 0218, 1977, 4120 | None/ 0218, 1977 | ||||
| Ubiquitin system/ proteasome-signalosome components | 0170, 0428, 1814, 4116, 4185, 4412 | 0168, 0170, 0320, 0421, 0423, 1364, 1571, 1645, 1871, 1873, 1887, 2561, 2932, 3061, 3250, 3268, 4146, 4159, 4275, 4412, 4413, 4414, 4692, 4761 | None/ 0170, 4412 | 0823, 1645, 1734 | 0311, 0423, 0427, 0827, 0895, 1100, 1139, 1464, 1571, 1812, 1887, 2561, 2932, 3011, 3050, 3268, 4185, 4248, 4265, 4275, 4413, 4414, 4427, 4642, 4692, 4761 | ||
| NADH-ubiquinone oxido-reductase/ NADH dehydro-genase | 2865, 2870, 3256, 3300, 3365, 3382, 3389, 3426, 3446, 3456, 3458, 3466, 3468, 4009, 4662, 4668, 4669, 4770, 4845 | ||||||
*For each of the animals, the numerator indicates the KOGs lost under the coelomate topology of the species tree, and the denominator indicates KOGs lost under the ecdysozoan topology.
Figure 6Correspondence between eukaryotic and prokaryotic orthologous gene sets. (a) Representation of prokaryotic counterparts in different subsets of KOGs. CGA, crown group ancestor; non-CGA, KOGs not represented in the crown group ancestor; MSP, metazoa-specific KOGs. (b) Evidence of ancient duplications of eukaryotic genes revealed by the KOGs against COGs comparison. The connections between KOGs and COGs detected by using RPS-BLAST (see text) were analyzed by single linkage clustering.
Figure 7Gene dispensability in yeast and worm and phyletic patterns of the respective KOGs. (a) Distribution of essential and non-essential genes among different size classes of KOGs and LSEs in yeast Saccharomyces cerevisiae. (b) Distribution of essential and non-essential genes among different size classes of KOGs and LSEs in the nematode C. elegans. The number of species in the KOGs and LSEs is color-coded as indicated to the right of each plot.
Domain accretion in complex eukaryotes
| Hsa | Dme | Ath | Cel | Sce | Spo | Ecu | |
| Hsa | <1 × 10-10 | <1 × 10-10 | <1 × 10-10 | <1 × 10-10 | <1 × 10-10 | <1 × 10-10 | |
| 470 | |||||||
| Dme | 3214 | 2 × 10-1 | <1 × 10-10 | <1 × 10-10 | <1 × 10-10 | <1 × 10-10 | |
| 805 | |||||||
| 327 | 354 | ||||||
| Ath | 2224 | 2085 | 3 × 10-1 | <1 × 10-10 | <1 × 10-10 | <1 × 10-10 | |
| 530 | 403 | ||||||
| 347 | 428 | 334 | |||||
| Cel | 2986 | 2962 | 2052 | 1 × 10-8 | <1 × 10-10 | <1 × 10-10 | |
| 880 | 650 | 376 | |||||
| 149 | 161 | 183 | 197 | ||||
| Sce | 1789 | 1704 | 1769 | 1715 | 1 × 10-2 | <1 × 10-10 | |
| 504 | 411 | 374 | 336 | ||||
| 100 | 123 | 135 | 150 | 158 | |||
| Spo | 1880 | 1807 | 1886 | 1808 | 2360 | <1 × 10-10 | |
| 549 | 426 | 388 | 359 | 216 | |||
| 10 | 17 | 12 | 14 | 13 | 19 | ||
| Ecu | 700 | 738 | 739 | 748 | 816 | 835 | |
| 332 | 254 | 235 | 244 | 158 | 140 |
For a given pair of species the numbers in each cell below the diagonal represent, from top to bottom: the number of KOGs in which the average number of detected domains from the CDD collection (cut-off E = 10-3) in the proteins from the species to the left is greater than that for the species to the right; the number of KOGs with equal average number of domains; the number of KOGs in which the average number of domains is greater for the species to the right (for example, D. melanogaster has a greater number of detected domains than H. sapiens in 470 KOGs, the same number in 3,214 KOGs, and a smaller number in 805 KOGs). The numbers above the diagonal are the statistical significance of the difference, P(χ2).