| Literature DB >> 16381887 |
Feng Chen1, Aaron J Mackey, Christian J Stoeckert, David S Roos.
Abstract
The OrthoMCL database (http://orthomcl.cbil.upenn.edu) houses ortholog group predictions for 55 species, including 16 bacterial and 4 archaeal genomes representing phylogenetically diverse lineages, and most currently available complete eukaryotic genomes: 24 unikonts (12 animals, 9 fungi, microsporidium, Dictyostelium, Entamoeba), 4 plants/algae and 7 apicomplexan parasites. OrthoMCL software was used to cluster proteins based on sequence similarity, using an all-against-all BLAST search of each species' proteome, followed by normalization of inter-species differences, and Markov clustering. A total of 511,797 proteins (81.6% of the total dataset) were clustered into 70,388 ortholog groups. The ortholog database may be queried based on protein or group accession numbers, keyword descriptions or BLAST similarity. Ortholog groups exhibiting specific phyletic patterns may also be identified, using either a graphical interface or a text-based Phyletic Pattern Expression grammar. Information for ortholog groups includes the phyletic profile, the list of member proteins and a multiple sequence alignment, a statistical summary and graphical view of similarities, and a graphical representation of domain architecture. OrthoMCL software, the entire FASTA dataset employed and clustering results are available for download. OrthoMCL-DB provides a centralized warehouse for orthology prediction among multiple species, and will be updated and expanded as additional genome sequence data become available.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16381887 PMCID: PMC1347485 DOI: 10.1093/nar/gkj123
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1A phylogeny was constructed for 55 sequenced genomes based on orthologous gene content. See Table 1 for species abbreviations. The tree was drawn using Phylodendron ().
The 55 genomes included in OrthoMCL-DB, with clustering statistics
| Lineage | Abbreviation | Full name | Data source | Sequences | Clustered | Groups |
|---|---|---|---|---|---|---|
| Archaea | ||||||
| Euryarchaeota | hal | GenBank | 2622 | 1878 | 1323 | |
| mja | Methanococcus jannaschii DSSM 2661 | GenBank | 1786 | 1260 | 1054 | |
| Crenarchaeota | sso | Sulfolobus solfataricus P2 | GenBank | 2977 | 2220 | 1357 |
| Nanoarchaeota | neq | Nanoarchaeum equitans Kin4-M | GenBank | 536 | 351 | 336 |
| Bacteria | ||||||
| Proteobacteria | wsu | Wolinella succinogenes DSM 1740 | GenBank | 2044 | 1617 | 1338 |
| gsu | Geobacter sulfurreducens PCA | GenBank | 3446 | 2616 | 1987 | |
| atu | Agrobacterium tumefaciens C58 Uwash | GenBank | 5402 | 3826 | 2757 | |
| rso | Ralstonia solanacearum GMI1000 | GenBank | 5116 | 3856 | 2795 | |
| eco | Escherichia coli K12 | GenBank | 4242 | 3295 | 2536 | |
| Aquifex | aae | Aquifex aeolicus VF5 | GenBank | 1560 | 1294 | 1165 |
| Thermotoga | tma | Thermotoga maritima MSB8 | GenBank | 1858 | 1473 | 1297 |
| Green nonsulfur | det | Dehalocsoccoides ethenogenes 195 | GenBank | 1580 | 1237 | 963 |
| Deinococci | dra | Deinococcus radiodurans R1 | GenBank | 3182 | 2249 | 1848 |
| Spirochetes | tpa | GenBank | 1036 | 703 | 621 | |
| Green sulfur | cte | Chlorobium tepidum TLS | GenBank | 2252 | 1554 | 1361 |
| Planctomyces/Pirella | rba | Rhodopirellula baltica SH_1 | GenBank | 7325 | 3624 | 2261 |
| Chlamydia | cpn | Chlamydophila pneumoniae CWL029 | GenBank | 1052 | 722 | 599 |
| Cyanobacteria | syn | GenBank | 2517 | 1782 | 1526 | |
| Actinobacteria | mtu | Mycobacterium tuberculosis H37Rv | GenBank | 3991 | 2963 | 1983 |
| Gram-positive | ban | Bacillus anthracis Ames Ames | GenBank | 5311 | 3497 | 2361 |
| Eukaryota | ||||||
| Entamoeba | ehi | Entamoeba histolytica | TIGR | 9772 | 8149 | 2910 |
| Dictyostelium | ddi | Dictyostelium discoideum | dictyBase | 13 678 | 10 144 | 4974 |
| Plants/Algae | cme | Cyanodioschyzon merolae 10D | University of Tokyo | 5013 | 3802 | 3286 |
| tps | Thalassiosira pseudonana | JGI | 11 397 | 7767 | 5211 | |
| ath | Arabidopsis thaliana | TIGR | 28 952 | 25 546 | 11 390 | |
| osa | Oryza sativa | TIGR | 88 149 | 78 731 | 18 933 | |
| Apicomplexa | tgo | Toxoplasma gondii | ToxoDB | 7793 | 4522 | 3755 |
| cpa | Cryptosporidium parvum Iowa | CryptoDB | 3396 | 3287 | 3222 | |
| cho | Cryptosporidium hominis TU502 | CryptoDB | 3886 | 3532 | 3427 | |
| pfa | Plasmodium falciparum 3D7 | PlasmoDB | 5363 | 5054 | 4371 | |
| pyo | Plasmodium yoelii 17XNL | PlasmoDB | 7850 | 6056 | 4252 | |
| pkn | Plasmodium knowlesi | PlasmoDB | 6890 | 4692 | 3878 | |
| the | Theileria parva | TIGR | 4035 | 3003 | 2455 | |
| Fungi | sce | Saccharomyces cerevisiae S288C | SGD | 6702 | 5612 | 4633 |
| spo | Schizosaccharomyces pombe | Sanger | 4984 | 4328 | 3726 | |
| yli | Yarrowia lipolytica CLIB99 | Genolevures | 6666 | 5549 | 4464 | |
| kla | Kluyveromyces lactis CLIB210 | Genolevures | 5331 | 4957 | 4592 | |
| dha | Debaryomyces hansenii CBS767 | Genolevures | 6896 | 5602 | 4581 | |
| cgl | Candida glabrata CBS138 | Genolevures | 5272 | 4947 | 4342 | |
| cne | Cryptococcus neoformans | TIGR | 5882 | 4743 | 3845 | |
| ago | Ashbya gossypii | AGD | 4726 | 4565 | 4335 | |
| ncr | Neurospora crassa OR74A | Whitehead | 10 617 | 6298 | 5102 | |
| Microsporidium | ecu | Encephalitozoon cuniculi | GenBank | 1996 | 1348 | 1113 |
| Animals | cel | Caenorhabditis elegans | WORMBASE | 22 420 | 19 307 | 13 242 |
| cbr | Caenorhabditis briggsae | Sanger | 19 334 | 16 948 | 13 227 | |
| dme | Drosophila melanogaster | Ensembl | 19 177 | 16 251 | 8640 | |
| aga | Anopheles gambiae | Ensembl | 15 802 | 12 645 | 8662 | |
| cin | Ciona intestinalis | Ensembl | 15 851 | 11 460 | 8140 | |
| fru | Fugu rubripes | Ensembl | 33 003 | 28 145 | 14 277 | |
| tni | Tetraodon nigroviridis | Ensembl | 28 005 | 18 707 | 13 861 | |
| dre | Danio rerio | Ensembl | 32 062 | 26 692 | 12 738 | |
| gga | Gallus gallus | Ensembl | 28 416 | 22 826 | 12 420 | |
| mmu | Mus musculus | Ensembl | 31 535 | 27 299 | 17 917 | |
| rno | Rattus norvegicus | Ensembl | 32 543 | 28 318 | 17 445 | |
| hsa | Homo sapiens | Ensembl | 33 869 | 28 948 | 16 586 | |
Figure 2An OrthoMCL group is a cluster of sequences from multiple species predicted to be orthologous to each other. (A) Ortholog group summary information, including group size (# Sequences, # Taxa), BLAST statistics (% Match Pairs, Average E-value, Average % Coverage, Average % Identity) and the phyletic pattern profile for all species in the dataset is shown. Rows in the phyletic pattern profile table represent bacteria, archaea, single-cellular eukaryotes and multi-cellular eukaryotes (plants and animals); each box represents a single species, with black or white background denoting presence or absence in the ortholog group, and the number of protein sequences found in the ortholog group listed. Mouse-over expands abbreviations to provide the full species name. Links at top left access a tabular list of information for each member of the ortholog group (including links to the reference database), a graphical representation of Pfam domain architecture (B), a BioLayout graph of pairwise similarity scores (C), a MUSCLE multiple sequence alignment (D) and a sequence retrieval option. The example shown illustrates a ‘prolipoprotein diacylglyceryl transferase’, whose distribution is restricted to the bacteria.