| Literature DB >> 26966373 |
Abstract
With the increasing number of sequenced genomes and their comparisons, the detection of orthologs is crucial for reliable functional annotation and evolutionary analyses of genes and species. Yet, the dynamic remodeling of genome content through gain, loss, transfer of genes, and segmental and whole-genome duplication hinders reliable orthology detection. Moreover, the lack of direct functional evidence and the questionable quality of some available genome sequences and annotations present additional difficulties to assess orthology. This article reviews the existing computational methods and their potential accuracy in the high-throughput era of genome sequencing and anticipates open questions in terms of methodology, reliability, and computation. Appropriate taxon sampling together with combination of methods based on similarity, phylogeny, synteny, and evolutionary knowledge that may help detecting speciation events appears to be the most accurate strategy. This review also raises perspectives on the potential determination of orthology throughout the whole species phylogeny.Entities:
Keywords: HGT; evolutionary processes; genome annotation quality; genome trees; multidomains; phylogeny; synteny; taxon sampling
Year: 2016 PMID: 26966373 PMCID: PMC4778853 DOI: 10.4137/GEI.S37925
Source DB: PubMed Journal: Genomics Insights ISSN: 1178-6310
Figure 1Homologs—paralogs—orthologs.
Notes: This figure illustrates speciation and duplication events and their resulting consequences on gene terminology. The figure shows:
an intraspecies duplication of gene g giving rise to two genes g1 and g2 (note that g is no more visible in species S);
a speciation event giving rise to two species A and B with identical contents as S; in particular, g1 and g2 are denoted as g1a and g2a in A and g1b and g2b in B;
we assume that in B, g2b is duplicated and gives rise to g2b1 and g2b2; (Note that g2b is no more visible in B).
In this scheme and considering solely the last speciation event:
– g1 and g2 are homologs because they descend from g. Similarly, g1a and g1b are homologs because they descend from g1;
– g1 and g2 are in-paralogs, because they are duplicated in S;
– Similarly, g2b1 and g2b2 are in-paralogs because they are duplicated in B;
– g1a and g2a are out-paralogs because their ancestors are duplicated in S;
– Similarly, g1b and each of g2b1 and g2b2 are out-paralogs, because their ancestors are duplicated in S;
– g1a and g1b are orthologs because they are in distinct species A and B, respectively, with a common ancestor g1;
– g2a and g2b1 and g2a and g2b2 are orthologs because they are in distinct species A and B, respectively, with the same ancestor g2. g2b1 and g2b2 are also called co-orthologs to g2a.
Dashed arrows with different colors highlight pairs of orthologs, out-paralogs, and in-paralogs.
Figure 2Evolutionary processes.
Notes: This figure illustrates some significant evolutionary processes as revealed by large-scale comparative analyses of predicted proteomes: phylogeny, expansion, exchange, and reduction. Phylogeny is the direct descent from ancestor to actual genome. Expansion (in red) includes gene duplication, segmental and whole-genome duplication, and genesis. Exchange (in blue) includes mainly HGT and introgression. Reduction is represented by gene loss. Rearrangements include inversions, translocations, fusion, and fissions.
Methods for orthology inference.
| METHOD | ALGORITHM |
|---|---|
| COG | Similarity—Single linkage clustering + Constraints |
| InParanoid/MultiParanoid | Similarity (pair-wise species)/Extends to multiple species |
| OrthoMCL | Similarity—MCL clustering algorithm |
| TribeMCL | Similarity—MCL clustering algorithm |
| eggNOG | Similarity—Detects false RBH due to gene fusion and protein domain shuffling |
| OrthoFocus | Similarity—extended RBH to handle many-to-one and many-to-many relationships |
| OrthoInspector | Smilarity |
| SPO | Similarity (RBH)—Partition of orthologs includes Intra-species Partition and MCL clustering. |
| OrthoFinder | Similarity—Clustering |
| Roundup | Reciprocal Smallest Distance |
| RSD | Reciprocal Smallest Distance (evolutionary distance = estimated number of amino acid substitutions) |
| OMA | Similarity—Global sequence alignment |
| ME | Minimum Evolution Method |
| MSOAR | Similarity—Genome rearrangement—duplication |
| Orthostrapper | Phylogeny—bootstrap |
| RIO | Similarity (HMMER)—bootstrap—Phylogeny |
| PhIGs | Similarity—Multiple sequence alignments—Phylogenetic trees |
| PhyOP | Similarity (overlapping limits)—phylogeny based on dS (synonymous substitution rates) |
| TreeFam | Infer orthologs—paralog from the phylogenetic tree |
| LOFT | Assigns hierarchical orthology numbers to genes based on a phylogenetic tree |
| EnsemblCompara GeneTrees | Clustering—multiple alignment—tree generation based on TreeBeST method |
| SYNERGY | Sequence similarity—species phylogeny—reconstruction of underlying gene evolutionary histories |
| PHOG | Precomputed phylogenic trees followed by identification of orthologs as sequences from different species that are each others reciprocal nearest neighbors |
| COCO-CL | Similarity—Correlation between sequences—single linkage clustering |
Note: This table shows some orthology inference methods with corresponding reference and a short description of their underlying algorithm.
Figure 3Example of a misleading situation in orthology inference.
Notes: A species S is shown including a gene g that has been duplicated (Gd) into g1 and g2. A speciation event (Sp) gave rise to two species S1 and S2, followed by a duplication (Gd) solely in S2 of g1 (resulting in g1a and g1b) and of g2 (resulting in g2a and g2b). The neighboring genes g0 and g3 are conserved. If genes g1 in S1 and g2a and g2b in S2 are lost, most similarity and phylogenetic methods for orthology detection will assign erroneously orthology to g2, g1a, and g2b. Indeed, these are not orthologous, because g2, g1a, and g2b do not result from the same ancestral gene after the speciation event. Conservation of their neighboring genes and synteny may help to suspect speciation and gene duplication events and therefore conclude for the nonorthology of these genes.
Figure 4Assessment of members of orthologs in an SPO cluster by detecting motifs and their distribution.
Notes: Motifs in SPOs are illustrated with the example of SPO29.1, from the considered 12 mycobacterial species. This SPO contains proteins corresponding to mapA and mapB (methionine aminopeptidase). Column headings are as follows: (a) SpecCode_ProtID: species code (see coding conventions below) followed by the protein identification; (b) Partition_RBH: partition of RBHs in pairwise proteome comparisons of considered species) denoted Pl.r where l is the number of proteins in the partition and r is an arbitrary index; (c) paralogs: paralogous class Pn.m is a partition of intraspecies RBHs and Cp.q is the cluster obtained by the mcl programme (see Ref. 62 for more details on the coding scheme of Pn.m.Cp.q classes); and (d) motifs: distributions of motifs as obtained with the meme/mast programs. The distributions highlight motifs shared by all proteins (ancestral motifs: 3,6,2,4) and motifs shared by subsets of proteins. Checking of the detailed description of paralogs allowed adding the last line (MYSM_MSMMEG5683) because only three from the P10.11.C4.47 cluster were found by the RBH procedure.
| Code | Species | Code | Species |
| MYTU | MYAV | ||
| MYBO | MYAP | ||
| MYTC | MYJL | ||
| MYUL | MYVA | ||
| MYMA | MYSM | ||
| MYLE | MYAB |