| Literature DB >> 21712343 |
Kimmen Sjölander1, Ruchira S Datta, Yaoqing Shen, Grant M Shoffner.
Abstract
Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology. While promiscuous domains represent a challenge to all orthology prediction methods, overall structural similarity is highly correlated with proximity in a phylogenetic tree, conferring a degree of robustness to phylogenetic methods. In this article, we review the issues involved in orthology prediction when data sets include sequences with structurally heterogeneous domain architectures, with particular attention to automated methods designed for high-throughput application, and present a case study to illustrate the challenges in this area.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21712343 PMCID: PMC3178056 DOI: 10.1093/bib/bbr036
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1:Orthology and paralogy subtypes and the use of tree distances in PHOG. We present this toy example of gene family evolution to illustrate the main orthology subtypes and how the PHOG algorithm uses tree distances and topology jointly to infer orthologs. ‘Dup’ indicates a duplication event in the animal lineage, and ‘I’ represents a group of predicted inparalogs. Recall that super-orthology requires that all nodes on a path joining two sequences correspond to speciation events. The PHOG algorithm for super-orthology identification allows subtrees containing only members of a single species to be included in a PHOG super-orthology group; some of these will correspond to actual inparalogs while others will be multiple entries and/or isoforms of the same gene in protein sequence databases. The two boxed subtrees (PHOG-S 1 and PHOG-S 2) correspond to super-orthology groups by this definition, with PHOG-S 2 including a possible inparalogous subtree with human genes 2a, 2b and 2c. In contrast, the Schistosoma mansoni and yeast genes have no super-orthologs. Standard phylogenetic orthology prediction protocols consider only the tree topology, including the S. mansoni gene in an orthology group with the Gene 2 clade. However, PHOG uses both tree distance and topology to enhance orthology identification precision; since the tree distances between the S. mansoni gene and genes in PHOG-S 1 are smaller than those between it and genes in PHOG-S 2, it is excluded from PHOG-S 2. This toy example also illustrates the nontransitivity of the standard definition of orthology, which requires only that the most recent common ancestor of two genes correspond to a speciation event. By this definition, the yeast gene is orthologous to Mouse Gene 1 and Mouse Gene 2, and to Rat Gene 1 and Rat Gene 2 and to all of the other sequences in the tree. However, Mouse Gene 1 is clearly not orthologous to Rat Gene 2 (they are paralogs, since they are related by gene duplication).
Figure 2:Phylogenetic analysis of a human Lamin-B receptor (UniProt sequence Q14739). Orthologs selected by TreeFam in mouse and zebrafish (Danio rerio) are indicated with an asterisk. Sequence fragments are marked with a dagger. (A) Pfam domain architecture for Q14739. (B) Maximum likelihood (ML) tree of proteins sharing the same domain architecture identified using FlowerPower. (C) ML tree of proteins aligning to the N-terminal LBR_tudor domain; a subtree of the full tree is shown, restricted to the vertebrate lineage. Pfam domains found for the full-length amino acid sequences are displayed at right. (D) and (E) ML trees of sequences matching the C-terminal ERG4/ERG24 domain (restricted as in C to the vertebrate lineage) constructed using RAxML (D) and FastTree (E) respectively. Super-orthology groups are boxed with dashed lines; sequences within each super-orthology group have identical domain architectures and functions. In both D and E, the upper subtree contains the human Lamin-B receptor and orthologs; sequences in the lower subtree are missing the N-terminal LBR-tudor domain. Note that zebrafish protein A9ULT1 included by RAxML (albeit with low bootstrap support) was excluded by FastTree, allowing predicted super-orthologs in the lower subtree of E to expand to include the two Xenopus sequences. Homologs to Q14739 were retrieved using the PhyloBuilder webserver [25]; FlowerPower global–global homology clustering (i.e. requiring a common domain architecture) was used for the tree shown in B, and global–local mode was used for the domain phylogenies shown in C and D. Multiple sequence alignments for B–D were constructed with MAFFT [71], followed by masking columns with >70% gaps. Maximum likelihood trees were constructed using RAxML [64] with the JTT+Γ model and 20 discrete γ-rate categories, and for E using FastTree [72] with the same parameters. The statistical support of branches was evaluated by 100 bootstrap replicates. Trees were rooted using the mid-point method.