Literature DB >> 18055500

InParanoid 6: eukaryotic ortholog clusters with inparalogs.

Ann-Charlotte Berglund¹, Erik Sjölund, Gabriel Ostlund, Erik L L Sonnhammer.

Abstract

The InParanoid eukaryotic ortholog database (http://InParanoid.sbc.su.se/) has been updated to version 6 and is now based on 35 species. We collected all available 'complete' eukaryotic proteomes and Escherichia coli, and calculated ortholog groups for all 595 species pairs using the InParanoid program. This resulted in 2 642 187 pairwise ortholog groups in total. The orthology-based species relations are presented in an orthophylogram. InParanoid clusters contain one or more orthologs from each of the two species. Multiple orthologs in the same species, i.e. inparalogs, result from gene duplications after the species divergence. A new InParanoid website has been developed which is optimized for speed both for users and for updating the system. The XML output format has been improved for efficient processing of the InParanoid ortholog clusters.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 18055500 PMCID： PMC2238924 DOI： 10.1093/nar/gkm1020

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Many analyses in comparative genomics depend on correct mapping of orthologs between species. Orthologs are defined as genes in different species deriving from a single gene in the last common ancestor (1), and are therefore likely to have the same function. If an ortholog undergoes duplication in one species, the copies are referred to as inparalogs (2). Inparalogs are by definition co-orthologs to one or more orthologs in another species. In contrast, two genes deriving from a duplication that predated the speciation event between the species are referred to as outparalogs. The InParanoid program was developed to identify clusters of inparalogs while avoiding inclusion of outparalogs. InParanoid is one of the first comprehensive ortholog databases (3,4), but nowadays more than 15 different ortholog databases exist (5). A reason for the multitude of ortholog databases is that different research questions have different needs. For instance, the COGs database (6) contains very large clusters of orthologs that often contain outparalogs (7). At the other extreme, the Homologene database (8) often places inparalogs in different clusters. For some applications, one extreme or the other may be appropriate. However, the average user is normally interested in simply finding all orthologs in species Y to a gene in species X, including all inparalogs but excluding outparalogs. InParanoid was developed to optimally serve this type of user. Two benchmarks have recently been published that try to objectively assess the quality of different ortholog databases (9,10). In both these tests, which look either at accuracy of functional annotation or at inferred accuracy, InParanoid was top ranked. This suggests that InParanoid is successful at balancing the false-negative and false-positive rate, and is appropriate as a general-purpose orthology tool. The InParanoid program has been upgraded to version 2.0. This release contains a number of fixes to minor bugs that could lead to incorrect cluster merging or bootstrap values. These problems were however rare. We here present InParanoid 6, comprising 34 eukaryotic species and one prokaryotic outgroup. The website has been completely reconstructed and has new front- and back-ends, yet looks very similar to the old site. The new design makes it much faster for the user, and allows easier updating of the system. With the new back-end, we can handle much larger datasets in the future without performance problems.

DATA AND IMPLEMENTATION

The data was gathered from three different sources: Ensembl, NCBI and model organism databases (MODs). We only considered eukaryotic genomes sequenced to a coverage greater than 6X, with <1% unknown amino acids (X in the protein sequences). Most MOD data was packaged and uploaded by the staffs at TAIR, WormBase, FlyBase, ZFIN, dictyBase, SGD and MGI to us directly, but three MODs were downloaded from their repositories. Before running InParanoid, each proteome was made non-redundant by keeping only the longest transcript from each gene. If this is not done first, different transcripts from the same gene can end up in different clusters if they exist in more than one species. Below we only list the non-redundant number of proteins for each species. Nine of the proteomes were uploaded to us by MOD staff. Together, we have formed an informal consortium of MODs that want to cross-reference each other using orthologs from InParanoid. We particularly welcome this system as it allows us to use the most complete and recent set of proteins for each organism, and ensures that we use identifiers that work in the MODs so that web links to proteins are valid. We hope that more MODs will join in and provide their proteomes in a new and robust XML format that will be introduced for the next release. From Ensembl, data was obtained for Aedes aegypti (transcripts for 15 419 genes), Anopheles gambiae (13 277), Apis mellifera (13 448), Bos taurus (22 280), Canis familiaris (19 314), Ciona intestinalis (14 278), Gallus gallus (16 715), Gasterosteus aculeatus (20 879), Homo sapiens (22 983), Macaca mulatta (22 045), Monodelphis domestica (19 597), Pan troglodytes (20 982), Rattus norvegicus (23 299), Takifugu rubripes (22 008), Tetraodon nigroviridis (28 005) and Xenopus tropicalis (18 473). Apis mellifera was taken from Ensembl release 38 and all other proteomes from release 43. From NCBI, we obtained Candida glabrata (5192), Cryptococcus neoformans (6487), Debaromyces hansenii (6318), Entamoeba histolytica (9772), Escherichia coli K12 (4243), Entamoeba histolytica (9772), Kluyveromyces lactis (5336), Yarrowia lipolytica (6544). The MODs uploaded proteomes for Arabidopsis thaliana (26 819), Caenorhabditis briggsae (19 334), Caenorhabditis elegans (20 084), Caenorhabditis remanei (25 595), Danio rerio, (12 303), Dictyostelium discoideum (13 523), Drosophila melanogaster (13 854), Mus musculus (23 132), Saccharomyces cerevisiae (5792). We obtained from other MODs Oryza sativa (77 853) (from http://www.gramene.org), Drosophila pseudoobscura (9871) (from http://www.flybase.org), and Schizosaccharomyces pombe (5003) (http://www.sanger.ac.uk).

InParanoid clustering

NCBI–Blast comparisons using these datasets were performed between each pair of species, involving four whole proteome runs per species pair (normal runs both ways plus two self-self runs). For the 35 proteomes this amounts to 595 species pairs, requiring 1225 whole-proteome Blast searches. These were executed on the SBC compute cluster comprising about 300 Linux nodes. The pairwise Blast results were used as the input for the InParanoid ortholog clustering procedure (3). The output from InParanoid 6 is available as XML, SQL, HTML and native format for downloading at the InParanoid homepage, and is searchable via the web interface. The XML format was defined in the RELAX NG schema language.

INPARANOID CONTENT

The 35 species present in the InParanoid database result in 595 pairwise ortholog lists. The information in these lists was used to generate a phylogenetic tree that reflects the level of orthology between the different species. We calculated the orthology distance from species A to B, dAB, by and used the average orthology distances (dAB + dBA)/2 to construct a UPGMA tree, shown in Figure 1. This so-called ‘orthophylogram’ shows quantitatively the level of orthology between different clades. In general, it agrees with the standard taxonomic species tree, but we noted a few exceptions. Opossum (M. domesticus), a marsupial mammal, is clustered together with placental mammals, and the zebrafish D. rerio clustered as an outgroup to the land animals rather than together with other fish. The latter anomaly is very minor as all fish are still neighbors in the tree, but the placement of opossum is surprising. If this placement is correct, then marsupials could have evolved from a particular lineage of placental mammals. Another difference is found in the yeast clade. In the taxonomy, K. lactis, S. cerevisiae and D. hansenii are clustered together, while C. glabrata is placed outside this group. These are arranged differently in the orthophylogram in that C. glabrata has traded place with D. hansenii, which now is placed as an outgroup to K. lactis, S. cerevisiae and C. glabrata. InParanoid's grouping is supported by 25S rDNA sequences (11). Surprisingly, the green plants are placed as a subgroup among single-cell organisms, next to the fungal group.

Figure 1.

Orthophylogram of all 35 species in InParanoid 6. This UPGMA tree is based on the average fraction of orthologs between species. For instance, on average 91.2% of the proteins in H. sapiens and P. troglodytes are orthologous. The tree topology generally corresponds to the standard taxonomy, but a few exceptions were noted (see text). It is worth noting that on average only 91.2% of the proteins in H. sapiens and chimpanzee P. troglodytes are orthologous. The individual figures are 88.4% for human and 94% for chimpanzee. This is surprisingly low since the genome-wide nucleotide divergence between human and chimpanzee is estimated to only 1.23% (12). The much higher difference observed for orthologs is not due to unique proteins in either proteome, as the fraction of homologs reported by Inparanoid is 96.7% for human and 99.3% for chimpanzee. Rather, it reflects that the sequences were too divergent to be considered orthologs. This is, however, often caused by incomplete sequencing or errors in gene annotation. The average number of inparalogs per cluster ranges from 1.001 (in Drosophila pseudoobscura when compared to D. melanogaster) to 7.160 (in O. sativa when compared to D. rerio). The overall mean number of inparalogs per species was 1.54, and the median was 1.25. The distribution of cluster sizes is shown in Figure 2. The highly duplicated genome of O. sativa is responsible for all average cluster sizes of four, and generates a separate peak in the distribution around five. In fact, O. sativa had on average more than four inparalogs per ortholog group when compared to every other non-plant species. It is surprising that the average number of inparalogs in O. sativa was so high when compared with D. rerio; when compared with D. rerio's phylogenetic neighbors the number was only around five. Although the rice proteome clearly contains the largest number of genes, our figures are probably somewhat overestimated. Evidence for this is that we were not able to find shared gene identifiers between any rice proteins in the MOD. This problem will be resolved in the future by collaborating directly with the rice MOD staff to get a better-annotated rice proteome.

Figure 2.

Histogram of average number of inparalogs/cluster per species for all species–species comparisons in InParanoid 6. The peak around five inparalogs/cluster is entirely caused by O. sativa, rice.

DATA AVAILABILITY

The InParanoid database is freely available at http://inparanoid.sbc.su.se. In addition to the data which is available to search/browse using the web interface, fasta files containing all proteins, protein description files, ortholog tables in raw, SQL and XML format are available for each pairwise InParanoid analysis. The InParanoid program is freely available upon request to inparanoid@sbc.su.se.

12 in total

1. Overview and comparison of ortholog databases.

Authors: Andrey Alexeyenko; Julia Lindberg; Asa Pérez-Bercoff; Erik L L Sonnhammer
Journal: Drug Discov Today Technol Date: 2006

2. Distinguishing homologous from analogous proteins.

Authors: W M Fitch
Journal: Syst Zool Date: 1970-06

3. Initial sequence of the chimpanzee genome and comparison with the human genome.

Authors:
Journal: Nature Date: 2005-09-01 Impact factor: 49.962

4. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

Authors: M Remm; C E Storm; E L Sonnhammer
Journal: J Mol Biol Date: 2001-12-14 Impact factor: 5.469

5. Genome evolution in yeasts.

Authors: Bernard Dujon; David Sherman; Gilles Fischer; Pascal Durrens; Serge Casaregola; Ingrid Lafontaine; Jacky De Montigny; Christian Marck; Cécile Neuvéglise; Emmanuel Talla; Nicolas Goffard; Lionel Frangeul; Michel Aigle; Véronique Anthouard; Anna Babour; Valérie Barbe; Stéphanie Barnay; Sylvie Blanchin; Jean-Marie Beckerich; Emmanuelle Beyne; Claudine Bleykasten; Anita Boisramé; Jeanne Boyer; Laurence Cattolico; Fabrice Confanioleri; Antoine De Daruvar; Laurence Despons; Emmanuelle Fabre; Cécile Fairhead; Hélène Ferry-Dumazet; Alexis Groppi; Florence Hantraye; Christophe Hennequin; Nicolas Jauniaux; Philippe Joyet; Rym Kachouri; Alix Kerrest; Romain Koszul; Marc Lemaire; Isabelle Lesur; Laurence Ma; Héloïse Muller; Jean-Marc Nicaud; Macha Nikolski; Sophie Oztas; Odile Ozier-Kalogeropoulos; Stefan Pellenz; Serge Potier; Guy-Franck Richard; Marie-Laure Straub; Audrey Suleau; Dominique Swennen; Fredj Tekaia; Micheline Wésolowski-Louvel; Eric Westhof; Bénédicte Wirth; Maria Zeniou-Meyer; Ivan Zivanovic; Monique Bolotin-Fukuhara; Agnès Thierry; Christiane Bouchier; Bernard Caudron; Claude Scarpelli; Claude Gaillardin; Jean Weissenbach; Patrick Wincker; Jean-Luc Souciet
Journal: Nature Date: 2004-07-01 Impact factor: 49.962

6. Benchmarking ortholog identification methods using functional genomics data.

Authors: Tim Hulsen; Martijn A Huynen; Jacob de Vlieg; Peter M A Groenen
Journal: Genome Biol Date: 2006-04-13 Impact factor: 13.583

7. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Lewis Y Geer; Yuri Kapustin; Oleg Khovayko; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Vadim Miller; Kim D Pruitt; Gregory D Schuler; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Roman L Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2006-12-14 Impact factor: 16.971

8. Inparanoid: a comprehensive database of eukaryotic orthologs.

Authors: Kevin P O'Brien; Maido Remm; Erik L L Sonnhammer
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. Assessing performance of orthology detection strategies applied to eukaryotic genomes.

Authors: Feng Chen; Aaron J Mackey; Jeroen K Vermunt; David S Roos
Journal: PLoS One Date: 2007-04-18 Impact factor: 3.240

10. The COG database: an updated version includes eukaryotes.

Authors: Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal: BMC Bioinformatics Date: 2003-09-11 Impact factor: 3.169

123 in total

1. CCR5 blockade is well tolerated and induces changes in the tissue distribution of CCR5+ and CD25+ T cells in healthy, SIV-uninfected rhesus macaques.

Authors: Jessica E Taaffe; Steven E Bosinger; Gregory Q Del Prete; James G Else; Sarah Ratcliffe; Christopher D Ward; Thi Migone; Mirko Paiardini; Guido Silvestri
Journal: J Med Primatol Date: 2011-11-14 Impact factor: 0.667

2. Orthology relations, symbolic ultrametrics, and cographs.

Authors: Marc Hellmuth; Maribel Hernandez-Rosales; Katharina T Huber; Vincent Moulton; Peter F Stadler; Nicolas Wieseke
Journal: J Math Biol Date: 2012-03-29 Impact factor: 2.259

Review 3. Comparing peripheral glial cell differentiation in Drosophila and vertebrates.

Authors: Floriano Rodrigues; Imke Schmidt; Christian Klämbt
Journal: Cell Mol Life Sci Date: 2010-09-04 Impact factor: 9.261

4. Prediction of human protein-protein interaction by a mixed Bayesian model and its application to exploring underlying cancer-related pathway crosstalk.

Authors: Yan Xu; Wen Hu; Zhiqiang Chang; Huizi Duanmu; Shanzhen Zhang; Zhenqi Li; Zihui Li; Lili Yu; Xia Li
Journal: J R Soc Interface Date: 2010-10-13 Impact factor: 4.118

5. Evolutionary Conservation and Diversification of Puf RNA Binding Proteins and Their mRNA Targets.

Authors: Gregory J Hogan; Patrick O Brown; Daniel Herschlag
Journal: PLoS Biol Date: 2015-11-20 Impact factor: 8.029

6. Global networks of functional coupling in eukaryotes from comprehensive data integration.

Authors: Andrey Alexeyenko; Erik L L Sonnhammer
Journal: Genome Res Date: 2009-02-25 Impact factor: 9.043

7. Evolutionary plasticity of segmentation clock networks.

Authors: Aurélie J Krol; Daniela Roellig; Mary-Lee Dequéant; Olivier Tassy; Earl Glynn; Gaye Hattem; Arcady Mushegian; Andrew C Oates; Olivier Pourquié
Journal: Development Date: 2011-07 Impact factor: 6.868

8. Systematic definition of protein constituents along the major polarization axis reveals an adaptive reuse of the polarization machinery in pheromone-treated budding yeast.

Authors: Rammohan Narayanaswamy; Emily K Moradi; Wei Niu; G Traver Hart; Matthew Davis; Kriston L McGary; Andrew D Ellington; Edward M Marcotte
Journal: J Proteome Res Date: 2009-01 Impact factor: 4.466

9. Databases of homologous gene families for comparative genomics.

Authors: Simon Penel; Anne-Muriel Arigon; Jean-François Dufayard; Anne-Sophie Sertier; Vincent Daubin; Laurent Duret; Manolo Gouy; Guy Perrière
Journal: BMC Bioinformatics Date: 2009-06-16 Impact factor: 3.169

10. The Gene Ontology's Reference Genome Project: a unified framework for functional annotation across species.

Authors:
Journal: PLoS Comput Biol Date: 2009-07-03 Impact factor: 4.475