Literature DB >> 27043882

Standardized benchmarking in the quest for orthologs.

Adrian M Altenhoff^1,2, Brigitte Boeckmann³, Salvador Capella-Gutierrez^4,5,6, Daniel A Dalquen⁷, Todd DeLuca⁸, Kristoffer Forslund⁹, Jaime Huerta-Cepas⁹, Benjamin Linard¹⁰, Cécile Pereira^11,12, Leszek P Pryszcz⁴, Fabian Schreiber¹³, Alan Sousa da Silva¹³, Damian Szklarczyk^14,15, Clément-Marie Train¹, Peer Bork^9,16,17, Odile Lecompte¹⁸, Christian von Mering^14,15, Ioannis Xenarios^3,19,20, Kimmen Sjölander²¹, Lars Juhl Jensen²², Maria J Martin¹³, Matthieu Muffato¹³, Toni Gabaldón^4,5,23, Suzanna E Lewis²⁴, Paul D Thomas²⁵, Erik Sonnhammer²⁶, Christophe Dessimoz^{7,20,27,28,29}.

Abstract

Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.

Entities: Chemical

Mesh：

Year: 2016 PMID： 27043882 PMCID： PMC4827703 DOI： 10.1038/nmeth.3830

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Main

Evolutionarily related genes (homologs) across different species are often divided into gene pairs that originated through speciation events (orthologs) and gene pairs that originated through duplication events (paralogs)[1]. This distinction is useful in a broad range of contexts, including phylogenetic tree inference, genome annotation, comparative genomics and gene function prediction[2,3,4]. Accordingly, dozens of methods[5] and resources[6,7,8] for orthology inference have been developed. Because the true evolutionary history of genes is typically unknown, assessing the performance of these orthology inference methods is not straightforward. Several indirect approaches have been proposed. Based on the notion that orthologs tend to be functionally more similar than paralogs (a notion now referred to as the ortholog conjecture[9,10,11,12]), Hulsen et al.[13] used several measures of functional conservation (coexpression levels, protein–protein interactions and protein domain conservation) to benchmark orthology inference methods. Chen et al.[14] proposed an unsupervised learning approach based on consensus among different orthology methods. Altenhoff and Dessimoz[15] introduced a phylogenetic benchmark measuring the concordance between gene trees reconstructed from putative orthologs and undisputed species trees. More recently, several 'gold standard' reference sets, either manually curated[16,17] or derived from trusted resources[18], have been used as benchmarks. Finally, Dalquen et al.[19] used simulated genomes to assess orthology inference in the presence of varying amounts of duplication, lateral gene transfer and sequencing artifacts. This wide array of benchmarking approaches poses considerable challenges to developers and users of orthology methods. Conceptually, the choice of an appropriate benchmark strongly depends on the application at hand. Practically, most methods are not available as stand-alone programs and thus cannot easily be compared on a common set of data. Likewise, some benchmarks rely on complex pipelines that may be difficult to implement. If public results are available as part of a publication or a resource, inconsistent genome releases or identifiers severely complicate comparisons. Some methods or benchmarks can also be computationally costly to run. As a result, users cannot easily identify appropriate tools, and methodological progress is hampered. Here, we report on a community effort to standardize and facilitate orthology benchmarking. For this effort, we established a shared reference data set and developed a web-based service for automatic orthology benchmarking (http://orthology.benchmarkservice.org). We then used these resources to run a community experiment to assess 15 well-established orthology inference methods and resources on a wide array of phylogenetic and functional benchmarks. By providing a way to automatically include new methods and disseminate results publicly, we hope to maintain an up-to-date and comprehensive assessment of state-of-the-art orthology tools.

Results

Here, we provide an overview of the benchmark service and orthology inference methods and then present benchmarking results in three categories: species discordance tests, reference gene trees and functional tests. The benchmark service alone required the evaluation of 70,390,701 orthologous relationships and the inference of 233,000 phylogenetic trees.

Benchmark service

To automate ortholog benchmarking on a broad range of tests (detailed below), we developed a publicly accessible web service (Fig. 1). Using this workflow, an orthology method developer first infers orthologs using the Quest for Orthologs (QfO) reference proteome data set. Orthology inference methods vary in the kind of output they provide—e.g., labeled gene trees and orthologous groups—but it is usually possible to reduce these to orthologous pairs, which thus constitute a natural 'common denominator' for benchmarking. The benchmark service accepts these pairwise orthologs predictions in OrthoXML[20] or tab-delimited format. As the OrthoXML format also supports InParanoid-style clusters and hierarchical orthologous groups, the service can automatically convert these to pairwise relationships.

Figure 1

The Orthology Benchmark service facilitates assessment and comparison of orthology inference methods.

The Orthology Benchmark service facilitates assessment and comparison of orthology inference methods.

Orthology method developers run their methods on a reference proteome set and submit the inferred orthologs to the service. The predictions are subjected to a battery of phylogenetic and functional tests, and the results are returned to the method developer, who can choose to disclose them publicly. Next, the service ensures that only predictions among valid reference proteomes are provided (with scoring implicitly assuming that the uploaded inferences are complete). Benchmarks are then selected and run in parallel; some may take up to several hours. Finally, statistical analyses determine the method's performance on each benchmark data set. Where possible, performance is measured in terms of precision (i.e., positive predictive value: the proportion of ortholog predictions that are correct) and recall (i.e., sensitivity, or true positive rate: the proportion of actual orthologs that are correctly predicted). Raw data and results are stored and provided to the submitter, who can choose to make the results publicly available. In order to achieve transparency and encourage improvements, we have released source code under an open source license (Mozilla Public License Version 2.0) at https://github.com/qfo/benchmark-webservice (also Supplementary Software).

Methods investigated

We investigated a broad array of well-established methods, including three tree-based methods: Ensembl Compara[21], PANTHER 8.0 (ref. 22) and PhylomeDB[23]; seven graph-based methods (i.e., based on pairwise comparisons): Best Reciprocal Hits[24], Reciprocal Smallest Distance (RSD)[25], EggNOG[26], Hieranoid[27], InParanoid[28], OMA[29], OrthoInspector[30] and a meta-method incorporating both tree- and graph-based methods, MetaPhOrs[31]. For some methods, multiple variants are included in the analysis (Online Methods). Each method inferred orthologs on the 754,149 protein sequences from 66 reference genomes except for MetaPhOrs, which inferred orthologs on all but three prokaryotes (Online Methods).

Generalized species tree discordance test

Orthology was first defined in the context of species tree inference, which requires genes related through speciation[1]. The species tree discordance test exploits this relationship by assessing the accuracy of orthologs in terms of the accuracy of the species tree that can be reconstructed from them[15]. The original protocol was limited to species tree 'comb' topology (a specific type of tree in which all bifurcations occur along a single path) and a small number of taxa (up to six). Here we overcome these two limitations by generalizing the orthology sampling procedure to any tree topology and employing larger reference trees from the SwissTree initiative. Furthermore, to minimize the possibility of gene–species tree discordance due to incomplete lineage sorting, we avoided sampling orthologs among species separated by branches shorter than 10 million years (myr) (Online Methods and Supplementary Fig. 1). We observed different trade-offs between average discordance (Robinson–Foulds[32] distance, as a proxy for the false discovery rate, the complement of precision) and the number of trees that can be sampled (proxy for recall) across all methods (Fig. 2). An ideal method would be placed in the lower right corner of Figure 2. When considering eukaryotes, results with the highest precision and lowest recall were obtained with OMA groups. At the other extreme, PANTHER 8.0 (all) tended to yield the highest recall and lowest precision results. Among the more balanced methods, no method consistently obtained a better balance than the other methods across all data sets, but Orthoinspector, InParanoid and PANTHER (LDO only) performed well overall. In terms of broad categories, there is no obvious systematic difference in performance between tree-based (Ensembl, PANTHER and PhylomeDB) and graph-based methods (the rest) or between methods relying on species tree (Ensembl, PANTHER, PhylomeDB, OMA GETHOGs, Hieranoid and EggNOG) and methods that do not. The latter point is perhaps unexpected, as one could expect knowledge of the species tree to provide an 'unfair' advantage in this particular benchmark. If there is any such effect, our results indicate that it is small.

Figure 2

The Generalized Species Tree Discordance test assesses the congruence of inferred orthologs with a trusted reference tree.

Source data

The Generalized Species Tree Discordance test assesses the congruence of inferred orthologs with a trusted reference tree.

Benchmarking results are shown for eukaryotes. A trade-off between precision (measured in terms of tree error in the y-axis) and recall (measured in terms of completed tree samples in the x-axis; Online Methods) can be observed. Only high-confidence branches of the reference tree (L90, Online Methods), at least 10 myr long, are considered. Error bars indicate 95% confidence intervals and the line indicates the 'Pareto frontier'. Source data These trends persisted when we measured recall in terms of the number of inferred orthologs (Supplementary Fig. 2) or when we focused on other clades (Supplementary Figs. 3, 4, 5). Among vertebrates, the results were largely consistent, but we noted minor differences in the ranking of individual methods, with InParanoid Core yielding the highest precision and MetaPhOrs the highest recall (Supplementary Fig. 3). We also benchmarked the methods for their ability to recover ortholog relationships among 'universal' genes by applying the species discordance test on a tree spanning across archaea, bacteria and eukaryotes. Once again, there were slight variations in the precise ranking of methods, but the overall trends were very similar to what was observed for eukaryotes only (Supplementary Fig. 5). Finally, if we included (high-confidence) short branches as well, the average concordance of reconstructed trees substantially decreased, both because short branches tend to be harder to infer and because of potential incomplete lineage sorting around them; however, the relative position of the methods remained practically unchanged, which was a further indication of the robustness of the benchmark (Supplementary Fig. 6).

Reference gene trees

The second series of orthology benchmarks employs evolutionary relationships of gene pairs derived from annotated high-quality gene trees. Such reference trees are inferred through a careful combination of computational inference and expert curation: results obtained at each step of the tree inference pipeline (homolog identification, alignment, tree inference and gene–species tree reconciliation) are individually inspected, poor-quality sequences are excluded from the analysis and results are typically assessed using multiple models. This manual oversight is expected to yield gene phylogenies with high statistical support and topological consistency. Concordance of orthology predictions was assessed with two sets of trees. The first was SwissTree[16,33], a small collection of large- and high-confidence gene family phylogenies with different types of challenges for orthology prediction and species from all domains. The second, TreeFam-A[34], consisted of a larger set of metazoan gene trees and thus covered a taxonomically restricted but wider range of protein families. Results obtained from the two benchmarks were quite similar (Fig. 3). On these benchmarks, virtually no trade-off between precision and recall appeared to be necessary. The best-performing methods were the ones that adopted a balanced precision–recall strategy, with MetaPhOrs doing particularly well. Methods with a more skewed precision–recall strategy (in particular, stringent OMA groups and permissive PANTHER (all)) fared poorly in comparison. This may be due in part to the nature of the reference gene tree data set, which focuses on gene families with a tractable evolutionary history. On ambiguous phylogenies, mistakes would become unavoidable and a skewed strategy could become preferable, depending on the application.

Figure 3

Benchmark results using sets of reference gene trees.

Evolutionary gene relationships are predicted for the QfO reference proteomes by 15 different methods. From the results, pairs of orthologous relationships are determined for each method and compared to those obtained from the reference gene trees of (a) SwissTree and (b) TreeFam-A. Error bars indicate 95% confidence intervals.

Source data

Benchmark results using sets of reference gene trees.

Functional benchmarks

The third series of benchmarks evaluated orthology in terms of functional similarity. Although orthology is an evolutionary and not a functional relationship, we chose to include functional benchmarks for two reasons. First, for similar levels of sequence divergence, orthologs have been shown to be moderately (but significantly) more conserved than paralogs in terms of Gene Ontology (GO) annotation similarity[11]. For a given evolutionary distance, more accurate orthology inference is thus likely to be correlated with more functionally similar gene pairs. Second, many users are interested in using orthologs to identify functionally conserved genes; for this purpose, functional benchmarks are directly relevant. We assessed functional similarity based on experimentally backed annotations from the UniProt–Gene Ontology Annotation (GOA) database[35] and Enzyme Commission (EC) numbers from the ENZYME database[36]. Though the two benchmarks consider different aspects of gene function, the results were largely consistent. In both cases, orthology inference methods showed a clear trade-off between precision (measured as the average Schlicker semantic similarity[37] of functional annotations associated with orthologs) and recall (measured as the number of ortholog relationships predicted; Fig. 4). The only exception was with the EC number benchmark, where MetaPhOrs falls beneath the 'Pareto frontier' (the frontier defined by the methods that are not outperformed by any other method in both precision and recall). However, MetaPhOrs is also the only method with missing taxa, and the three missing taxa contain a substantial number of genes with EC annotations (827 in total). This lack of EC annotations has a negative effect on the recall.

Figure 4

Benchmarks of functional similarity between inferred orthologous gene pairs.

Two different types of functional annotations are used: (a) experimentally supported GO annotations and (b) Enzyme Commission (EC) numbers. Error bars indicate 95% confidence intervals.

Source data

Benchmarks of functional similarity between inferred orthologous gene pairs.

Two different types of functional annotations are used: (a) experimentally supported GO annotations and (b) Enzyme Commission (EC) numbers. Error bars indicate 95% confidence intervals. Source data

Discussion

The Orthology Benchmark service overcomes many of the practical complications previously associated with orthology benchmarking. It enables systematic comparison of a new method with state-of-the-art approaches on to a wide range of benchmarks. It replaces current practice, which typically includes fewer methods, fewer tests and less empirical data. By relying on a common set of data for all methods, the benchmark service ensures that the results obtained by different methods are directly comparable. Previous benchmarking efforts required painstaking and error-prone mapping of proteins between different sources, releases and choice of alternative splicing variants. In contrast, by relying on a common set of data for all methods, the benchmark service ensures that the results obtained by different methods are directly comparable. The only caveat is that, since proteomes vary in quality and analytical difficulty, the results on the benchmark data set may not entirely reflect the quality of the orthology assignments otherwise provided by each resource. The choice of species included in the QfO reference proteomes (Online Methods) requires a compromise between (i) increasing the number of proteomes to make the benchmark set more representative of current resources and (ii) keeping the number of proteomes low to facilitate and encourage new submissions to the benchmark. Submissions performed on a subset of the proteomes are discouraged, as all missing predictions are counted as false negatives. This provides an incentive for submitters to analyze the entire reference proteome data set. We considered alternative ways of handling submissions on partial data, but these approaches had major flaws. For example, one alternative was to extrapolate scores obtained on the subset of proteomes considered in a particular submission to all data. However, this approach could introduce a bias in the analyses (e.g., some methods only predict orthologs for 'easy' pairs of proteomes). Another alternative was to restrict comparisons to the intersection of proteomes analyzed by all methods. However, this approach results in an excessive waste of information, as the intersection can only decrease with each additional method. Overall, results obtained across multiple phylogenetic and functional tests corroborated previous observations that the main difference among the established orthology inference methods lies in the trade-off they produce in terms of precision and recall[13,15,17]. However, this trade-off was not present in the reference gene tree test, perhaps because sequences with ambiguous location are typically excluded from these hand-curated trees. On these reference trees, the meta-method MetaPhOrs performed particularly well. The analysis also confirmed that the widely used reciprocal best hit approach has a relatively high precision but a relatively low recall[38,39]. Other methods fill different niches, with OMA group and PANTHER (all) often lying at the two extremes of the precision–recall trade-off. Among the more balanced approaches, InParanoid, Hieranoid and OrthoInspector showed solid performance in most benchmarks. The decision of whether to favor a skewed or a balanced approach to the precision–recall trade-off strongly depends on the application. For instance, hypothesis-generating analyses may favor a high recall, while phylogenomic species tree inference typically requires high precision. Because of this, we refrained from computing a combined score, which would necessarily entail a statement of preference with respect to this trade-off. To be deemed competitive, a method should ideally reach or exceed the Pareto frontier in at least a subset of the benchmarks. If it does not, the benchmark service may help uncover bugs or deeper flaws. Analogous to unit testing in software engineering, benchmarking can also provide quality control for new releases of established resources. In the course of the present community benchmarking effort, over a hundred sets of predictions were submitted to the service. Many submitters did not make their results publicly available, presumably after discovering poor outcome in some of the benchmarks. This clearly demonstrates the effectiveness of the benchmark service for quality control. The bane of benchmarking is circularity. Despite our best efforts, not all circularity could be avoided. Some methods used knowledge of the species tree in their inference; however, this potentially unfair advantage produced a negligible difference in performance for these methods. More generally, many methods were trained or fine-tuned using some of the benchmarks considered here. For instance, parameters of the meta-method MetaPhOrs were in part trained using TreeFam-A[31]. Similarly, the latest versions of InParanoid[28] and PhylomeDB[23] used the benchmark service for parameter fine-tuning. As for the functional benchmarks, although GO annotations derived from sequence comparisons were excluded, experiments are often guided by sequence similarity to proteins with known function. Thus, even when restricting analyses to experimentally backed GO annotations, we cannot avoid circularity entirely. However, because the benchmarks are collectively underpinned by a large amount of data from a broad range of species (tens of thousands of trees and hundreds of thousands of pairs of functional annotations), the risk of overfitting seems low, and this potential risk will be monitored by the QfO benchmarking working group. New benchmarks may be introduced over time to detect and discourage overfitting. Presently, the benchmark service uses orthologous gene pairs as 'common denominators' among all the methods. However, many resources provide richer outputs—such as reconciled gene trees or hierarchical orthologous groups—and may indeed be optimized for these. The performance on pairwise data is thus not entirely representative of what the data offer. In the future, however, the benchmark service could be extended to evaluate these richer, more specific orthology formats as well. Similarly, the benchmark service could also be extended to take into account confidence scores or posterior probabilities, which are particularly relevant to likelihood-based orthology inference methods[40,41].

Methods

Quest for orthologs reference proteomes and species tree.

The QfO consortium has defined a consensus data set of proteomes and common file formats[6,7] to be used by diverse orthology inference methods, allowing for standardized benchmarks and aiding integration of multiple ortholog sources. The QfO Reference Proteomes data sets were created as a collection of data providing a representative protein for each gene in the genome of selected species. Such data sets have been generated annually from the UniProt Knowledgebase (UniProtKB) database[42] for the past five years. To this end, a gene-centric pipeline has been developed and enhanced over these years by UniProt. The QfO Reference Proteomes are a manually compiled subset of the UniProt reference proteomes, comprising well-annotated model organisms and organisms of interest for biomedical research and phylogeny, with the intention to provide broad coverage of the tree of life. These complete, nonredundant reference proteomes are publicly available at ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO. The data sets are provided either in SeqXML[20] format or as a collection of FASTA files. The benchmarking effort reported here uses the reference proteomes data set released in 2011, which comprises 754,149 nonredundant protein sequences from 66 species (40 eukaryotes and 26 bacteria–archaea). The reference species tree used in this study was produced by the QfO species tree working group, which surveyed the literature to establish a well-supported tree topology for the 66 species[43] (Supplementary Fig. 1). The internal nodes of this reference species tree have assigned confidence levels based on the agreement among the resources surveyed (L90: congruent, significant branch support; L70: congruent; L50: one alternative species tree topology; L30: default level; L10: two or more alternative species tree topologies have been reported; for more detail, see Boeckmann et al.[43]). The latest version of the tree can be retrieved from http://swisstree.vital-it.ch/species_tree. To minimize the chance of including cases of incomplete lineage sorting in the species tree discordance benchmark, we estimated the evolutionary times of all internal branches using the timetree resource[44] and collapsed branches that were shorter than 10 myr.

Orthology databases and methods.

EggNOG[26] (http://eggnogdb.embl.de) is a database of Orthologous Groups (OGs) and functional annotation covering prokaryotic and eukaryotic species. Since version 4.1, the EggNOG method is also capable of producing fine-grained (for example, pairwise) orthology predictions based on the automated analysis of phylogenetic trees. For this study, the complete set of 66 reference proteomes was independently analyzed using the EggNOG pipeline, which involved 1) joining proteins into inparalogous groups from closely related species and 2) de novo reconstruction of 38,513 OGs by clustering the obtained inparalogous groups based on triangles of their reciprocal best hits[45]. Phylogenetic analysis and automated tree interpretation for each OG was subsequently performed using the workflow described in PhylomeDB22 as implemented in the ETE Toolkit v2.3 (ref. 46). The phylogenetic approach used included testing three aligners (MAFFT[47] v6.861b, Muscle[48] v3.8.31 and Clustal Omega[49] v1.2.1) and five evolutionary models (LG, WAG, JTT, VT and MtREV); applying alignment consensus and soft trimming techniques (M-Coffee[50] v10, trimAl[51] v1.3); and using maximum likelihood tree inference (PhyML[52] v3). This workflow is labeled as eggnog41 when using the ETE-build command and was applied in a per-OG basis. Pairwise orthology predictions were derived from each tree using the species overlap algorithm[53] after rooting trees to midpoint. The predictions were submitted to the benchmark service in July 2015. Ensembl Compara[21] uses a gene–species tree reconciliation pipeline. The predictions were run using the code released in version 81 of the Ensembl (July 2015). However, Treebest (the software used to build phylogenetic trees) had to be adapted to accept alignments of protein sequences. Treebest makes a consensus out of trees built with various phylogenetic methods and some of them required nucleotide sequences, which were not provided in the QfO data set. The list of maximum-likelihood models and distance methods (used for neighbor joining) was thus updated to: WAG, JTT and Dayhoff instead of WAG and HKY (for maximum likelihood), and JTT, Kimura and mixed amino acid models instead of dN, dS and mixed nucleotide models (for neighbor joining). The predictions were submitted to the benchmark service in June 2015. An older submission based on version 66 of the Ensembl code (June 2011) is also present on the benchmark service. Hieranoid[27] performs pairwise orthology analysis using InParanoid at each node in a guide (species) tree as it progresses from its leaves to the root. This concept reduces the total runtime complexity from a quadratic to a linear function of the number of species. We ran Hieranoid 2.0. Hieranoid outputs ortholog groups structured as species trees with orthologs at all levels, hence there can be many outparalogs within an ortholog group. The trees were therefore parsed to extract ortholog pairs only at the last common ancestor of two species, for all species pairs. The predictions were submitted to the benchmark service in April 2015. InParanoid[28] is a graph-based algorithm that aims to generate orthologous groups that include all inparalogs but no outparalogs between species pairs. Version 4.1 of the algorithm was run with default parameters. Two variants were tested in this study: the regular InParanoid output containing all predicted pairs of orthologs (labeled InParanoid in the plots) and a high-confidence set including only orthologs with InParanoid's maximum confidence score of 1.0 (labeled Inparanoid (core)). The predictions were submitted to the benchmark service in June 2011. MetaPhOrs[31] (Meta Phylogeny-based Orthologs) is a repository of orthologs and paralogs that were computed using phylogenetic trees available in several databases or computed from graph-based orthologous groups. For each orthology–paralogy prediction, MetaPhOrs (http://orthology.phylomedb.org/) provides two reliability scores: Evidence Level (informing about number of repositories from which prediction is retrieved) and Consistency Score (defining overall agreement of source databases about given prediction). MetaPhOrs does not include predictions for the three reference genomes Streptomyces coelicolor, Thermotoga maritima and Pyrococcus kodakaraensis (strain KOD1). The predictions were submitted to the benchmark service in February 2013. OMA[29] (Pairs, Groups, HOGs) is a publicly available resource (http://omabrowser.org/) that provides orthology predictions among thousands of proteomes from all domains of life. OMA uses evolutionary distance estimates from Smith–Waterman alignments to infer orthologs. A distinct feature among graph-based methods is the witness of nonorthology step in its pipeline, where cases of differential gene losses get detected. OMA provides three different groupings of orthologs: (i) the raw pairwise ortholog relationships form the OMA Pairs, a gene-centric view that lists all the orthologs for a given gene. (ii) OMA Groups, a very stringent type of grouping where all member proteins are orthologous to one another within a group. OMA Groups have been designed mainly for species tree inference purposes, as gene trees built from them should be congruent with the species tree. (iii) Lastly, we constructed hierarchical orthologous groups (OMA HOGs). These are nested groups that contain genes that descend from a single common ancestral gene within a given taxonomic range using the GETHOGs algorithm[54]. The predictions were submitted to the benchmark service in June 2011 (OMA pairs and groups) and in March 2013 (OMA HOGs). OrthoInspector[30] is a database of precomputed orthology and inparalogy relationships and a stand-alone package allowing large-scale predictions of orthology between thousands of proteomes (http://lbgi.fr/orthoinspector/). The resource has recently undergone a major new release, with improved speed and visualisation tools, but the inference algorithm is unchanged from the initial graph-based method described in Linard et al.[55]. The predictions were submitted to the benchmark service in June 2011. PANTHER 8.0[22] is based on version 8.0 of the PANTHER database (http://pantherdb.org), released in 2012 (the current version is 10.0, released in 2015). Family membership of each sequence is based on HMM scoring to the PANTHER 'library' of HMMs (at both the family and subfamily levels). Sequences were aligned with MAFFT[56] and the resulting alignment was used to construct phylogenetic trees with the GIGA program[57]. GIGA (version 1.1 was used for PANTHER version 8.0) uses a species tree to guide tree construction, and all nodes in the tree are labeled as speciation or gene duplication events; these labeled nodes are used to infer orthologs (pairs of genes with a speciation event as their common ancestor). PANTHER predicts two types of orthologs: least-diverged orthologs (LDO) and other orthologs (O). LDO pairs can be simplistically thought of as 'the same gene' in two different species. Formally, the two genes created by each gene duplication event in the tree are treated asymmetrically: the least diverged duplicate (the one with the shortest branch immediately following the duplication) remains in the same LDO group as its ancestor, while the other duplicate founds a new LDO group. The benchmarking was performed on either LDO only, or all orthologs (including both LDO and O). The predictions were submitted to the benchmark service in February 2013. PhylomeDB[23] (http://phylomedb.org/) is a publicly available repository of phylomes, i.e., the complete collection of phylogenies for all genes of a given species in a predefined evolutionary context. PhylomeDB is unique among other repositories in that it follows an approach that is both gene centric and genome wide. PhylomeDB uses its phylogenetic trees to infer orthology and paralogy relationships. For the Quest for Orthologs project, 42 phylomes were reconstructed using different combinations of the 66 species in the benchmark. A total of 458,108 phylogenetic trees were generated, which were later combined to provide orthology predictions for all proteins included in the benchmark. Briefly, each tree was scanned and only the partition of up to 30 sequences, including the seed protein, was kept. Then, evolutionary relationships were computed for those protein sequences based on a species overlap approach. Redundant predictions across the 42 phylomes were unified using the Consistency Score (CS) as implemented in MetaPhOrs (see above). Only those predictions having a Consistency Score greater or equal to 0.5 across the whole data set were called orthologs. The predictions were submitted to the benchmark service in June 2013. RBH[24] (Reciprocal best hit) is a classic method consisting of identifying the pairs of genes with mutually highest alignment score between every pair of species. Here, we use reciprocal blastp hits as orthologs, with minimum E-value of 1e–2, and we keep all hits that are ≥99% of the highest score. The predictions were submitted to the benchmark service in January 2016. RSD[25] infers orthology relationships by finding pairs of genes whose nearest gene, computed using PAML, is the other gene in the pair. Candidates genes are also filtered using BLAST E-value and multiple-sequence alignment divergence thresholds. This method is implemented in the database RoundUp[58], a large-scale orthology database developed by the Wall Lab. The database is no longer maintained, but the source code is still available at https://github.com/todddeluca/reciprocal_smallest_distance/. To identify orthologs, we ran the algorithm with divergence and E-value cutoffs of 0.8 and 1e–5, respectively. The predictions were submitted to the benchmark service in February 2012.

Benchmarks.

Generalized species tree discordance. The idea behind the species tree discordance test is simple. Two genes are orthologous if they started diverging through a speciation event. Therefore, if we sample putative orthologous genes such that all resulting genes are related through speciation events, the resulting tree should be congruent with the species tree. Previously, we presented a sampling strategy for fully imbalanced tree topologies[15]. Here, we extend this idea to arbitrary reference trees, including those with soft polytomies (unresolved nodes). The following procedure is repeated a large number of times. We start with a random gene in a random genome. We then attempt to sample a maximal path along the tree by selecting an orthologous gene in the 'next' species in the tour from the list of reported orthologs (Supplementary Fig. 7a). If there are multiple possibilities in the choice of the 'next' species due to soft polytomies, or in the choice of the orthologous counterparts due to one-to-many or many-to-many orthology, a choice is made at random. If there is no predicted ortholog at any step along the path, the sample is deemed unsuccessful. Alternatively, if at least one orthologous counterpart is predicted at each step, this results in a set of n sequences. Assuming that i) the reference tree is correct, ii) the retrieved orthologs are all correct and iii) all within-species variation is fixed (i.e., no incomplete lineage sorting), it is easy to prove that the unrooted evolutionary tree relating these sequences should only contain speciation nodes and should therefore be congruent with the reference species tree. Proof: The n sequences sampled through the circular tour are sampled by starting from a random sequence and retrieving n − 1 pairs of orthologs. By construction, these n − 1 pairs of orthologs belong to pairs of species that have distinct last common ancestors and thus coalesce in different speciation nodes in the phylogenetic tree of these sequences. Therefore, that tree contains at least n − 1 distinct speciation nodes. However, the rooted, fully-resolved evolutionary tree of n species has exactly n − 1 internal nodes. Thus, all the internal nodes of the gene tree are speciation nodes. Since we assume that there is no incomplete lineage sorting, as long as the input orthologs are correct, the tree relating these sequences should be congruent with the species tree. A least-squares distance tree is reconstructed for each set of putative orthologous sequences. After aligning the sequences with MAFFT[47], maximum likelihood distances and their variances (using the inverse Fisher information) are estimated using the EstimatePam() function in the Darwin programming environment[59] for each pair of sequences. Next, the gene tree is estimated using Darwin's MinSquareTree() function, which is a fast implementation of the weighted least-squares trees[60] constrained to non-negative branch lengths[61]. We have previously shown that orthology benchmarking results obtained with such distance trees are consistent with more computationally demanding Maximum likelihood trees[15]. The Robinson–Foulds[32] distance between this gene tree and the reference tree measures the false discovery rate, while the total number of trees is used as a proxy of recall. Due to the stochastic nature of the algorithm, repeated runs of the benchmark may lead to slightly (albeit nonsignificantly) different results. Reference gene trees. Reference gene trees labeled with speciation and duplication events were downloaded from SwissTree on March 23, 2015 (http://swisstree.vital-it.ch/) and Treefam-A version 7 (http://www.treefam.org/). As sequences analyzed in these two resources can differ from those of the QfO reference proteomes, sequences were mapped based on gene identifiers or sequence identifiers. After mapping, for each family the n(n − 1)/2 induced pairwise evolutionary relationships were extracted and compared with the orthologous predictions from each orthology prediction method as follows. Let G = {g} be the set of all genes in the reference gene tree and R = {(g, g) | g ∈ G, g ∈ G, g ≠ g, label(g,g) = speciation} the set of true orthologs according to the reference tree. Likewise, let R be the set of nonorthologous relations in that family and P = {(g, g)}, be the set of all predictions made by the orthology method. With P = {(g, g) | (g, g) ∈ P ∩ g ∈ G ∩ g ∈ G}, we denote the set of orthologs where both members are part of the reference gene family. Now, the true/false positives/negatives are simply TP = P ∩ R, FP = P ∩ R, FN = R − P and TN = R − P. From these values we can compute positive predictive values (PPV) and true positive rate (TPR): PPV = |TP|/(|TP| + |FP|), TPR = |TP|/(|TP| + |FN|). We can further estimate the uncertainties of these rates by treating them as binomially distributed random variables, for example, σ2(PPV) = PPV(1 − PPV)/(|TP| + |FP|). Finally, we combine all the families by building averages of the rates. As an example, for the positive predictive value this results in, Functional tests.We downloaded the Gene Ontology annotations[62] for all the genes in the reference genomes from the November 2014 release of UniProt-GOA[35] and excluded any annotation with a 'NOT' qualifier from this set. For the analysis shown here, we only use annotations with experimental evidence codes (EXP, IPI, IDA, IMP, IGI and IEP). Likewise, we collected the hierarchical EC number assignments of the ENZYME database[36] maintained by Swiss-Prot. The computation of the functional similarities between gene pairs is done in the same way for both types of data, using the approach of Schlicker et al.[37]: the semantic similarity between annotations sim(i,j) is measured using Lin's metric[63]; between any two genes, the most similar pairs of annotations are identified and averaged, i.e., where p is the set of function annotations associated with protein i.

Code availability.

The source code is available under an open source license (Mozilla Public License Version 2.0) at https://github.com/qfo/benchmark-webservice.

58 in total

1. The ENZYME database in 2000.

Authors: A Bairoch
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Darwin v. 2.0: an interpreted computer language for the biosciences.

Authors: G H Gonnet; M T Hallett; C Korostensky; L Bernardin
Journal: Bioinformatics Date: 2000-02 Impact factor: 6.937

3. The use of gene clusters to infer functional coupling.

Authors: R Overbeek; M Fonstein; M D'Souza; G D Pusch; N Maltsev
Journal: Proc Natl Acad Sci U S A Date: 1999-03-16 Impact factor: 11.205

4. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

Review 5. Orthologs, paralogs, and evolutionary genomics.

Authors: Eugene V Koonin
Journal: Annu Rev Genet Date: 2005 Impact factor: 16.830

6. Detecting putative orthologs.

Authors: D P Wall; H B Fraser; A E Hirsh
Journal: Bioinformatics Date: 2003-09-01 Impact factor: 6.937

7. Distinguishing homologous from analogous proteins.

Authors: W M Fitch
Journal: Syst Zool Date: 1970-06

8. Benchmarking ortholog identification methods using functional genomics data.

Authors: Tim Hulsen; Martijn A Huynen; Jacob de Vlieg; Peter M A Groenen
Journal: Genome Biol Date: 2006-04-13 Impact factor: 13.583

9. A new measure for functional similarity of gene products based on Gene Ontology.

Authors: Andreas Schlicker; Francisco S Domingues; Jörg Rahnenführer; Thomas Lengauer
Journal: BMC Bioinformatics Date: 2006-06-15 Impact factor: 3.169

10. M-Coffee: combining multiple sequence alignment methods with T-Coffee.

Authors: Iain M Wallace; Orla O'Sullivan; Desmond G Higgins; Cedric Notredame
Journal: Nucleic Acids Res Date: 2006-03-23 Impact factor: 16.971

77 in total

1. Best match graphs.

Authors: Manuela Geiß; Edgar Chávez; Marcos González Laffitte; Alitzel López Sánchez; Bärbel M R Stadler; Dulce I Valdivia; Marc Hellmuth; Maribel Hernández Rosales; Peter F Stadler
Journal: J Math Biol Date: 2019-04-09 Impact factor: 2.259

2. The mathematics of xenology: di-cographs, symbolic ultrametrics, 2-structures and tree-representable systems of binary relations.

Authors: Marc Hellmuth; Peter F Stadler; Nicolas Wieseke
Journal: J Math Biol Date: 2016-11-30 Impact factor: 2.259

3. Reconstructing gene trees from Fitch's xenology relation.

Authors: Manuela Geiß; John Anders; Peter F Stadler; Nicolas Wieseke; Marc Hellmuth
Journal: J Math Biol Date: 2018-06-27 Impact factor: 2.259

4. Gene family innovation, conservation and loss on the animal stem lineage.

Authors: Daniel J Richter; Parinaz Fozouni; Michael B Eisen; Nicole King
Journal: Elife Date: 2018-05-31 Impact factor: 8.140

5. Tetraconatan phylogeny with special focus on Malacostraca and Branchiopoda: highlighting the strength of taxon-specific matrices in phylogenomics.

Authors: Martin Schwentner; Stefan Richter; D Christopher Rogers; Gonzalo Giribet
Journal: Proc Biol Sci Date: 2018-08-22 Impact factor: 5.349

6. Navigating the Phenotype Frontier: The Monarch Initiative.

Authors: Julie A McMurry; Sebastian Köhler; Nicole L Washington; James P Balhoff; Charles Borromeo; Matthew Brush; Seth Carbon; Tom Conlin; Nathan Dunn; Mark Engelstad; Erin Foster; Jean-Philippe Gourdine; Julius O B Jacobsen; Daniel Keith; Bryan Laraway; Jeremy Nguyen Xuan; Kent Shefchek; Nicole A Vasilevsky; Zhou Yuan; Suzanna E Lewis; Harry Hochheiser; Tudor Groza; Damian Smedley; Peter N Robinson; Christopher J Mungall; Melissa A Haendel
Journal: Genetics Date: 2016-08 Impact factor: 4.562

Standardized benchmarking in the quest for orthologs.

Main

Results

Benchmark service

The Orthology Benchmark service facilitates assessment and comparison of orthology inference methods.

Methods investigated

Generalized species tree discordance test

The Generalized Species Tree Discordance test assesses the congruence of inferred orthologs with a trusted reference tree.

Reference gene trees

Benchmark results using sets of reference gene trees.

Functional benchmarks

Benchmarks of functional similarity between inferred orthologous gene pairs.

Discussion

Methods

Quest for orthologs reference proteomes and species tree.

Orthology databases and methods.

Benchmarks.

Code availability.

1. The ENZYME database in 2000.

2. Darwin v. 2.0: an interpreted computer language for the biosciences.

3. The use of gene clusters to infer functional coupling.

4. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Review 5. Orthologs, paralogs, and evolutionary genomics.

6. Detecting putative orthologs.

7. Distinguishing homologous from analogous proteins.

8. Benchmarking ortholog identification methods using functional genomics data.

9. A new measure for functional similarity of gene products based on Gene Ontology.

10. M-Coffee: combining multiple sequence alignment methods with T-Coffee.

1. Best match graphs.

2. The mathematics of xenology: di-cographs, symbolic ultrametrics, 2-structures and tree-representable systems of binary relations.

3. Reconstructing gene trees from Fitch's xenology relation.

4. Gene family innovation, conservation and loss on the animal stem lineage.

5. Tetraconatan phylogeny with special focus on Malacostraca and Branchiopoda: highlighting the strength of taxon-specific matrices in phylogenomics.

6. Navigating the Phenotype Frontier: The Monarch Initiative.

7. SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier.

8. Reciprocal best match graphs.

9. Genome Recovery, Functional Profiling, and Taxonomic Classification from Metagenomes.

10. OrthoList 2: A New Comparative Genomic Analysis of Human and Caenorhabditis elegans Genes.