Literature DB >> 22247275

Roundup 2.0: enabling comparative genomics for over 1800 genomes.

Todd F DeLuca¹, Jike Cui, Jae-Yoon Jung, Kristian Che St Gabriel, Dennis P Wall.

Abstract

UNLABELLED: Roundup is an online database of gene orthologs for over 1800 genomes, including 226 Eukaryota, 1447 Bacteria, 113 Archaea and 21 Viruses. Orthologs are inferred using the Reciprocal Smallest Distance algorithm. Users may query Roundup for single-linkage clusters of orthologous genes based on any group of genomes. Annotated query results may be viewed in a variety of ways including as clusters of orthologs and as phylogenetic profiles. Genomic results may be downloaded in formats suitable for functional as well as phylogenetic analysis, including the recent OrthoXML standard. In addition, gene IDs can be retrieved using FASTA sequence search. All source code and orthologs are freely available. AVAILABILITY: http://roundup.hms.harvard.edu.

Entities: Disease Gene Species

Mesh：

Year: 2012 PMID： 22247275 PMCID： PMC3289913 DOI： 10.1093/bioinformatics/bts006

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Orthologs are genes from different organisms that descend from a single ancestral gene in the most recent common ancestor (Fitch, 1970). In comparative genomics, they are used to infer the function of novel genes from the function of well-studied ones, to construct phylogenies and explore the evolution of genes and species, and to study sequence conservation and change. They are also valuable in analyzing gene networks, studying gene gain and loss, and finding genes in model organisms that correspond to human disease genes (Altenhoff and Dessimoz 2009; Gabaldon ; Kristensen ). Advances in high-throughput genomic sequencing have made it possible to produce many datasets in a relatively short time period. For example, from 2006 to 2011, the number of complete proteomes listed in UniProtKB (Magrane and Consortium, 2011), a repository of annotated protein sequences, has increased from around 300 to over 2500. To overcome the engineering challenges of computing and publishing orthologs for such a large number of genomes, we redesigned the comparative genomics tool, Roundup (DeLuca ), to scale with the rate of genome sequencing and to enable increasingly more sophisticated comparative genomics analyses. Roundup 2.0 contains orthology data for over 1800 genomes, providing one of the largest diversities among similar orthology databases (Chen ; Datta ; Huerta-Cepas ; Kristensen ; Li ; Linard ; Ostlund ; Rouard ; Schneider ; Tatusov ). Roundup compares well to other major databases, with recent studies showing similar ortholog composition for model organisms (Altenhoff and Dessimoz, 2009; Chen ). The data in Roundup include clusters of orthologs for a wide range of sequence conservation, allowing searches for distant orthologs, and also phylogenetic profiles that enable functional investigation, phylogenetic analysis and prediction of network organization (Cui ).

2 ALGORITHMS

We used the reciprocal smallest distance (RSD) (Wall ) algorithm to infer orthologs. RSD improves the sensitivity of reciprocal best blast hits by considering global alignment and maximum likelihood evolutionary distance between sequences. As a pairwise orthology algorithm, RSD scales quadratically with the number of genomes in Roundup. Altenhoff et al. assessed 10 ortholog inference projects and methods, confirming the reliable performance of RSD over a wide array of genomes from the tree of life (Altenhoff and Dessimoz, 2009). For Roundup 2.0, we changed RSD to improve its speed, stability and ortholog inference. We replaced WU-BLAST (W.Gish, personal communication) with NCBI BLAST (Altschul ). Also, we replaced ClustalW (Thompson ) with Kalign (Lassmann and Sonnhammer, 2005). Kalign is faster than ClustalW and produces better alignments for more distantly related sequences. This change resulted in 9% closer maximum likelihood distances between orthologs computed using PAML 4.0 (Yang, 2007), and 0.3% more orthologs on average. Since the Roundup database stores orthologs for 12 combinations of divergence and E-value thresholds, RSD was modified to compute orthologs for any number of parameter combinations as quickly as for one parameter combination. This change should be of interest to researchers investigating the effect of different parameter settings and degree of global sequence similarity on ortholog inference. With the addition of other caching and file I/O changes, RSD is over six times faster than the previous version in our performance tests. In addition to housing the orthologs inferred by RSD, Roundup builds clusters of orthologous genes, i.e. ortholog groups, using deterministic single-linkage clustering. It partitions a graph into connected subgraphs by creating a cluster for every gene and then merging two clusters if a gene in one of the clusters is orthologous to a gene in the other one. The result is that every gene in a group is orthologous to at least one other gene in the group and to no genes in any other groups. In contrast to other orthology databases (Chen ; Schneider ; Tatusov ), Roundup orthologous groups are built on the fly using genomes selected by the user. This allows users to include exactly their genomes of interest and to explore the effects of including different genomes on the grouping of orthologs.

3 GENOMES AND ORTHOLOGS

The 1807 genomes in Roundup 2.0 are from UniProtKB (Magrane and Consortium, 2011), including 226 Eukaryota, 1447 Bacteria, 113 Archaea, 21 Viruses and Viroids. The approximately 63 CPU core-years to compute the orthologs took several weeks on our research computing cluster. Roundup used a fault-tolerant computational pipeline to compute orthologs for all 1 631 721 pairs of genomes across 12 parameter combinations selected to allow researchers access to results for a broad range of divergence and E-value threshold settings. As a result, there are over 11 billion orthologs available in Roundup. The genomes and orthologs are updated 2–4 times per year.

4 WEB INTERFACE

The Roundup website provides two ways to search for orthologs. First, the Browse query is a genome-centric search that retrieves all orthologs between one genome and a set of other genomes. Results can be filtered by gene name or gene identifier. To aid users in finding gene identifiers, a FASTA sequence may be used to retrieve a gene id. The second query, Retrieve, returns all orthologs for all pairs of genomes in a set of genomes the user specifies. Query results are then clustered into groups of orthologous genes as described above. All genes in the groups are linked to UniProt and annotated with available gene names and GO Process terms provided by UniProtKB and Gene Ontology (Ashburner ). FASTA sequences for genes in orthologous groups are also provided for further analysis. In addition to the standard view of search results, there are summaries by GO Terms and by Gene Clusters. The orthologous groups may be downloaded in several formats: NEXUS, PHYLIP, OrthoXML, Phylogenetic Profile and Text. OrthoXML (Schmitt ) is provided to support interoperability with other Orthology databases and the quest for orthologs (Gabaldon ; Kuzniar ). Query results are cached for up to 30 days and may be retrieved by using the initial URL. To support research and replication, we make available for download from the website: FASTA sequences for genomes; orthologs in OrthoXML and text formats; and code for RSD and Roundup. Orthologs are also available through an HTTP-based API. Roundup 2.0 is an important step forward towards keeping pace with the rate of genome sequencing. The features and flexibility of Roundup 2.0, coupled with the wide coverage of genomes, enables increasingly large-scale comparative genomics analyses that address key questions in phylogeny, genome evolution and systems biology.

25 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

3. Detecting putative orthologs.

Authors: D P Wall; H B Fraser; A E Hirsh
Journal: Bioinformatics Date: 2003-09-01 Impact factor: 6.937

4. Roundup: a multi-genome repository of orthologs and evolutionary distances.

Authors: Todd F Deluca; I-Hsien Wu; Jian Pu; Thomas Monaghan; Leonid Peshkin; Saurav Singh; Dennis P Wall
Journal: Bioinformatics Date: 2006-06-15 Impact factor: 6.937

5. Distinguishing homologous from analogous proteins.

Authors: W M Fitch
Journal: Syst Zool Date: 1970-06

6. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Authors: J D Thompson; D G Higgins; T J Gibson
Journal: Nucleic Acids Res Date: 1994-11-11 Impact factor: 16.971

7. Phylogenetically informed logic relationships improve detection of biological network organization.

Authors: Jike Cui; Todd F DeLuca; Jae-Yoon Jung; Dennis P Wall
Journal: BMC Bioinformatics Date: 2011-12-15 Impact factor: 3.169

8. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups.

Authors: Feng Chen; Aaron J Mackey; Christian J Stoeckert; David S Roos
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. Kalign--an accurate and fast multiple sequence alignment algorithm.

Authors: Timo Lassmann; Erik L L Sonnhammer
Journal: BMC Bioinformatics Date: 2005-12-12 Impact factor: 3.169

10. TreeFam: a curated database of phylogenetic trees of animal gene families.

Authors: Heng Li; Avril Coghlan; Jue Ruan; Lachlan James Coin; Jean-Karim Hériché; Lara Osmotherly; Ruiqiang Li; Tao Liu; Zhang Zhang; Lars Bolund; Gane Ka-Shu Wong; Weimou Zheng; Paramvir Dehal; Jun Wang; Richard Durbin
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

23 in total

1. A second-generation protein-protein interaction network of Helicobacter pylori.

Authors: Roman Häuser; Arnaud Ceol; Seesandra V Rajagopala; Roberto Mosca; Gabriella Siszler; Nadja Wermke; Patricia Sikorski; Frank Schwarz; Matthias Schick; Stefan Wuchty; Patrick Aloy; Peter Uetz
Journal: Mol Cell Proteomics Date: 2014-03-13 Impact factor: 5.911

2. Evolutionary distance of amino acid sequence orthologs across macaque subspecies: identifying candidate genes for SIV resistance in Chinese rhesus macaques.

Authors: Cody T Ross; Morteza Roodgar; David Glenn Smith
Journal: PLoS One Date: 2015-04-17 Impact factor: 3.240

3. Draft Genome Sequence of Lysinibacillus fusiformis Strain SW-B9, a Novel Strain for Biotransformation of Isoeugenol to Vanillin.

Authors: Liqing Zhao; Guanhui Bao; Beibei Geng; Jiangning Song; Yin Li
Journal: Genome Announc Date: 2015-04-16

4. PhyloTreePruner: A Phylogenetic Tree-Based Approach for Selection of Orthologous Sequences for Phylogenomics.

Authors: Kevin M Kocot; Mathew R Citarella; Leonid L Moroz; Kenneth M Halanych
Journal: Evol Bioinform Online Date: 2013-10-29 Impact factor: 1.625

5. Cloud computing for comparative genomics with windows azure platform.

Authors: Insik Kim; Jae-Yoon Jung; Todd F Deluca; Tristan H Nelson; Dennis P Wall
Journal: Evol Bioinform Online Date: 2012-08-30 Impact factor: 1.625

6. OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis.

Authors: Matthew D Whiteside; Geoffrey L Winsor; Matthew R Laird; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

7. Genome Sequence of Klebsiella oxytoca M5al, a Promising Strain for Nitrogen Fixation and Chemical Production.

Authors: Guanhui Bao; Yanping Zhang; Chenyu Du; Zugen Chen; Yin Li; Zhu'an Cao; Yanhe Ma
Journal: Genome Announc Date: 2013-01-31

8. KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters.

Authors: Akihiro Nakaya; Toshiaki Katayama; Masumi Itoh; Kazushi Hiranuka; Shuichi Kawashima; Yuki Moriya; Shujiro Okuda; Michihiro Tanaka; Toshiaki Tokimatsu; Yoshihiro Yamanishi; Akiyasu C Yoshizawa; Minoru Kanehisa; Susumu Goto
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

9. OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs.

Authors: Robert M Waterhouse; Fredrik Tegenfeldt; Jia Li; Evgeny M Zdobnov; Evgenia V Kriventseva
Journal: Nucleic Acids Res Date: 2012-11-24 Impact factor: 16.971

10. Improved orthologous databases to ease protozoan targets inference.

Authors: Nelson Kotowski; Rodrigo Jardim; Alberto M R Dávila
Journal: Parasit Vectors Date: 2015-09-29 Impact factor: 3.876