| Literature DB >> 21737420 |
Brigitte Boeckmann1, Marc Robinson-Rechavi, Ioannis Xenarios, Christophe Dessimoz.
Abstract
Phylogenomic databases provide orthology predictions for species with fully sequenced genomes. Although the goal seems well-defined, the content of these databases differs greatly. Seven ortholog databases (Ensembl Compara, eggNOG, HOGENOM, InParanoid, OMA, OrthoDB, Panther) were compared on the basis of reference trees. For three well-conserved protein families, we observed a generally high specificity of orthology assignments for these databases. We show that differences in the completeness of predicted gene relationships and in the phylogenetic information are, for the great majority, not due to the methods used, but to differences in the underlying database concepts. According to our metrics, none of the databases provides a fully correct and comprehensive protein classification. Our results provide a framework for meaningful and systematic comparisons of phylogenomic databases. In the future, a sustainable set of 'Gold standard' phylogenetic trees could provide a robust method for phylogenomic databases to assess their current quality status, measure changes following new database releases and diagnose improvements subsequent to an upgrade of the analysis procedure.Entities:
Mesh:
Year: 2011 PMID: 21737420 PMCID: PMC3178055 DOI: 10.1093/bib/bbr034
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Comparison of selected phylogenomic databases
| Database | Nb. species and taxonomic range | Nb databases inquired for input data | Homology detection and clustering | Multiple sequence alignment and tree-building | Grouping strategy | Goals include | Updates per year (estimations) |
|---|---|---|---|---|---|---|---|
| Compara (Ensembl 58) | 47 chordates and outgroups | 1 | BlastP Hcluster_sg | M-Coffee, TreeBeSt (NJtree, NJ and ML, species tree) | (i) Phylogenetic trees, (ii) Ortholog groups from species pairs | Gene phylogeny | 6 |
| eggNOG (release 2) | 630 species | 4 | Blast RBH triangular linkage clustering | Muscle, MAFFT and filters PhyML | Hierarchical groups based on up to six taxonomic levels | Comparative genomics, species phylogeny | 1 |
| HOGENOM (release 5) | 964 species | 12 | BlastP2 (low complexity filters) single-linkage clustering ≥50% similarity, ≥80% overlap | Muscle, Gblocks BioNJ, PhyML, FASTTREE and TREEFINDER | Phylogenetic trees | Gene phylogeny | 1 |
| InParanoid (release 7) | 99 eukaryotes and | 22 | BLAST (compositional adjustment, SEG) ≥50% overlap | Kalign NJ (100 replicates) | Ortholog groups from species pairs | Comparison of species pairs | 1 |
| OMA (May-2010) | 1000 species | 12 | Smith–Waterman with minimum length requirement | – | (i) Pure ortholog groups, (ii) Ortholog groups from species pairs and (iii) Hierarchical groups based on taxonomic nodes | Comparative genomics, phyletic profiles, species phylogeny | 2 |
| OrthoDB (release 3) | 40 vertebrates 23 arthropods 32 fungi | 8 | Smith–Waterman, RBH, triangular linkage clustering | – | Hierarchical groups based on a species phylogeny | Comparative genomics, species phylogeny | 1 |
| Panther (release 7) | 48 species | 13 | BlastP, HSP, single-linkage clusters (SLC) | MAFFT GIGA | (i) Phylogenetic trees and (ii) Ortholog groups from species pairs | Gene phylogeny, Function prediction | 1 |
Figure 1:Concepts of selected phylogenomic databases. Rows (from top to bottom) indicate the different database concepts, the structure of ortholog groups, the completeness of predicted gene relationships and the implied tree structures. Latter visualizes the captured phylogenetic information.
Figure 2:Reference tree for the V-type ATPase β-subunit subfamily and corresponding ortholog predictions from seven phylogenomic databases. The different grouping strategies are clearly reflected: OMA, InParanoid and the unlabeled trees of HOGENOM occur as mutually exclusive groups, while all other databases possess hierarchical grouping strategies. Most orthology predictions coincide with those of the reference tree, but none of the phylogenomic databases is in full agreement with all of them: OMA groups are split into more groups than necessary, which results in less predicted gene relationships; InParanoid predicts the B2 subunit of Ornithorhynchus anatinus to be an ortholog of the human B1 subunit and lacks some of the arthropod orthologs; OrthoDB assigns corresponding 1:1 orthologs only for closely related species such as primates or rodents; eggNOG gives contradictory information on the B2 subunit of Xenopus tropicalis; the tree topology of Panther suggests lineage-specific duplications for the paralogs of X. tropicalis, Caenorhabditis elegans and C. briggsae; the tree of Compara includes an additional duplication event within the vertebrate B2 clade; HOGENOM differs from the reference tree only by the inversion of a speciation node (data not shwn) and lacks one of the expected orthologs in the data set. Missing orthologs are also observed for OMA, InParanoid and Panther. Explanation: the left block (headed ‘Ortholog hierarchies’) indicates the ortholog classification derived from the reference tree, with the largest homolog group given in the first column; different levels of orthologous hierarchies are shown as patterned cells in the right-handed columns. Corresponding groups defined by the phylogenomic databases are patterned accordingly, if relevant to the benchmarked ortholog classification. Triangle: gene duplication event. White cell: gene of species that are not covered by the database. Plain gray cell: gene assigned to an unexpected ortholog group. Descending diagonal: expected gene that was missing in an ortholog group. Ascending diagonal: false positive prediction. Black horizontal bar: groups of the same hierarchical level within the same column. For OrthoDB the black bar also separates the three taxonomic sections of the database (VeRTebrate, ARThropods, FUNgi). For more details, see Supplementary Figure S3.
Benchmarking results based on three reference trees
The analyzed databases are OMA pure orthologous groups and pairwise groups, InParanoid, OrthoDB, eggNOG, Panther trees and HOGENOM. Databases with a hierarchical grouping concept are scored in two ways, based on the ortholog groups and based on the implied trees. For HOGENOM, the calculation is based on Robinson Foulds distances. Columns: ‘Expected OTUs’: number of genes expected to be present in an ortholog group according to the species list of the phylogenomic database. ‘Mapped OTUs’: number of genes of the reference tree that are mapped to the ortholog groups; ‘Number groups’: number of groups relevant to the reference tree. Scores are calculated for the three types of gene relationships: orthology, orthology/paralogy and ‘extended’ gene relationships. The weighted average is shown bottom left of the table. For each column, the best achieved values are shaded dark gray, the second-best light gray. ‘Coverage’ indicates the weighted average of mapped genes, in percent. For each family, the number of genes and the number of relevant gene relationships are indicated within the gray header row.