Literature DB >> 17933761

OPTIC: orthologous and paralogous transcripts in clades.

Andreas Heger1, Chris P Ponting.   

Abstract

The genome sequences of a large number of metazoan species are now known. As multiple closely related genomes are sequenced, comparative studies that previously focussed on only pairs of genomes can now be extended over whole clades. The orthologous and paralogous transcripts in clades (OPTIC) database currently provides sets of gene predictions and orthology assignments for three clades: (i) amniotes, including human, dog, mouse, opossum, platypus and chicken (17 443 orthologous groups); (ii) a Drosophila clade of 12 species (12 889 orthologous groups) and (iii) a nematode clade of four species (13 626 orthologous groups). Gene predictions, multiple alignments and phylogenetic trees are freely available to browse and download from http://genserv.anat.ox.ac.uk/clades. Further genomes and clades will be added in the future.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17933761      PMCID: PMC2238935          DOI: 10.1093/nar/gkm852

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

New technologies and reduced costs are driving a marked increase in the numbers of genomes that are being sequenced. This steep rise in data presents opportunities for predicting evolutionary relationships of genes not between pairs of genomes, as previously, but instead among genomes from a clade of closely related species. Computational tools for gene prediction, orthology assignment and multiple alignment are now needing to be developed using phylogenetic approaches. To meet this challenge, we have developed a pipeline for gene prediction and orthology assignment for any clade of genomes (Heger and Ponting, in press). The current release of the orthologous and paralogous transcripts in clades (OPTIC) database contains three clades: 12 species from the genus Drosophila, an amniotic clade of five mammals with chicken as outgroup, and four Caenorhabditis nematodes (Table 1).
Table 1.

Gene sets and orthology assignments in three clades

SpeciesGenesGenes with orthologs (%)Orphaned genes (%)
D. melanogaster*13 83613 563 (98)273 (2)
D. simulans13 20312 318 (93)885 (7)
D. sechellia15 46714 356 (93)1111 (7)
D. erecta14 19913 471 (95)728 (5)
D. yakuba14 97114 218 (95)753 (5)
D. ananassae14 33713 205 (92)1132 (8)
D. pseudoobsura12 30411 609 (94)695 (6)
D. persimilis12 97311 876 (92)1097 (8)
D. willistoni13 14411 360 (86)1784 (14)
D. virilis12 01711 096 (92)921 (8)
D. mojavensis11 71710 883 (93)834 (7)
D. grimshawi11 80011 011 (93)789 (7)
C. elegans*20 09314 037 (70)6056 (30)
C. remanei18 13714 961 (82)3176 (18)
C.PB280121 93117 759 (81)4172 (19)
C. briggsae18 38813 460 (73)4928 (27)
H. sapiens*22 61119 339 (86)3272 (14)
M. musculus*24 44220 758 (85)3684 (15)
C. familiaris*19 31418 066 (94)1248 (6)
M. domestica*19 59718 123 (92)1474 (8)
O. anatinus*18 59615 312 (82)3284 (18)
G. gallus*16 71513 893 (83)2822 (17)

Gene sets marked with an asterisk (*) were obtained from ENSEMBL, whereas all others have been predicted by the pipeline. Orphans represent genes that have no ortholog in any of the other genomes in the clade. These will represent results of heuristic failures in our ortholog prediction pipeline or in gene predictions, as well as true gene losses.

Gene sets and orthology assignments in three clades Gene sets marked with an asterisk (*) were obtained from ENSEMBL, whereas all others have been predicted by the pipeline. Orphans represent genes that have no ortholog in any of the other genomes in the clade. These will represent results of heuristic failures in our ortholog prediction pipeline or in gene predictions, as well as true gene losses. The pipeline predicts orthology for both orthologous groups and simple 1:1 ortholog sets. Here, orthologous groups contain orthologs and in-paralogs but exclude out-paralogs (1), those duplicated genes that were each present in the last common ancestor of a clade. Simple 1:1 ortholog sets are derived from orthologous groups by examining the gene tree and extracting sub-trees that contain exactly one gene per species. To enable inferences of gene duplication or loss, or positive selection on individual codons, we supply amino acid or nucleotide multiple sequence alignments, and phylogenetic trees, for each orthologous group. All data may be searched or downloaded freely from http://genserv.anat.ox.ac.uk/download/clades.

Database construction

The pipeline requires a set of genome sequences and ENSEMBL (2) gene sets for each genome. If a gene set for a genome is unavailable, we predict transcripts by homology from a reference transcript set and thereafter automatically derive a gene set from them (Heger and Ponting, in press). A quality control step removes partial predictions and marks those predictions as pseudogenes that contain in-frame stop-codons and frameshift insertions and deletions. Both genes and pseudogenes comprise a predicted gene set. ENSEMBL and predicted gene sets are then submitted to an orthology assignment process. A full description of the pipeline, including parameter settings, is provided on the web site. Briefly, the pipeline implements the following steps: Gene prediction by homology from a transcript set using Exonerate (3). Pairwise orthology assignment between all pairs of genomes using: BlastP (4) all-against-all alignments of all translated transcripts and PhyOP (5) tree-based orthology assignment of genes. Graph-based grouping of genes from all species into clusters. Multiple alignment of translated exons using MUSCLE (6). Estimation of phylogenetic tree topology using NJTree (7). Decomposition of clusters into orthologous groups. Branch length estimation using codeml from the PAML package (8). Computation of simple 1:1 ortholog sets. Data are stored in a relational database and gene predictions are displayed within a GMOD genome browser (http://www.gmod.org). Software is open source and available without charge on request to the authors.

Database contents

For the current release, we have applied our pipeline to three metazoan clades (Table 1) each containing between 4 and 12 species. Genes were predicted for Drosophila and Caenorhabditis species’ genome assemblies using D. melanogaster (9) and C. elegans (10) protein-coding transcripts as templates. Mammalian and chicken gene sets were from ENSEMBL release 42 (2). The web server provides an up-to-date list of genome assemblies for the current release. We find 12 889 orthologous groups in the Drosophila clade, 17 443 groups in the amniotic clade and 13 626 groups in the four Caenorhabditis species. Of these, 10 563 orthologous groups in the Drosophila clade, 9675 groups in the amniotic clade and 6545 groups in the Caenorhabditis clade contain the full species complements. The numbers of simple 1:1 ortholog sets are smaller (5241, 7587, and 5987 for the three clades, respectively) owing to gene duplications and absences from incomplete assemblies. For each orthologous group, we provide: Transcript predictions: Predicted transcripts are available as exonic genomic coordinates, and as peptide and coding sequences. Orthologs: Orthologous groups and simple 1:1 ortholog sets. Multiple alignments: Multiple alignments of transcripts and genes within an orthologous group are provided both as aligned nucleic acid sequences and as aligned peptide sequences. Frameshift insertions or deletions in pseudogenes have been removed, and stop-codons have been masked in order to facilitate downstream analyses. Genes have been aligned by concatenating exons of all transcripts while maintaining frame. Phylogenetic trees: For each orthologous group, we provide a phylogenetic tree. The topology of the tree has been calculated from NJTree, while branch lengths (nucleotide substitutions per site) have been assigned using PAML.

Database access and web service

The web service permits interactive data querying and browsing of orthologous groups and simple 1:1 ortholog sets for each clade (Figure 1). Species distributions of orthologous groups are denoted by phylogenetic profiles denoting the presence (‘+’) or absence (‘0’) of one or more genes in a group. For example, a search for orthologous groups in the amniotic clade with the phylogenetic profile ‘+  +  +000’ lists 542 orthologous groups that contain genes in human, mouse and dog, but have no orthologs in opossum, platypus and chicken.
Figure 1.

Browsing the orthology database. A sample session starts with a query for all simple 1:1 ortholog sets (bottom left). It continues with a list of all simple 1:1 ortholog sets containing all six species from the amniotic clade, then by a selection of one particular ortholog set (number 114), and finally with a viewing of the gene-based multiple sequence alignment.

Browsing the orthology database. A sample session starts with a query for all simple 1:1 ortholog sets (bottom left). It continues with a list of all simple 1:1 ortholog sets containing all six species from the amniotic clade, then by a selection of one particular ortholog set (number 114), and finally with a viewing of the gene-based multiple sequence alignment. In queries for simple 1:1 ortholog sets, ‘1’ indicates that exactly one copy of this gene is present and ‘–’ indicates that this particular species should not be considered. Thus, the profile ‘111–––’ applied to simple 1:1 ortholog sets yields 13 788 simple 1:1 ortholog sets that contain exactly one gene in human, dog and mouse, and any number of homologs in opossum, platypus or chicken. For each orthologous group and simple 1:1 ortholog set, multiple alignments and a phylogenetic tree may be displayed. A synteny viewer also allows an assessment of whether orthologs occur in regions of conserved synteny. Genes of particular interest can be located either by identifier or by genomic location. Computational biologists interested in performing large-scale analyses can download complete datasets from the download area.

OUTLOOK

OPTIC is designed to provide precalculated phylogenetic datasets that are of benefit to clade genomic analyses. Our approach complements other existing projects (2,7,12, 12) in four respects: (i) we apply the pipeline to diverse, and not just experimental model, organisms; (ii) we define clades with respect to phylogenetic distances that are amenable to evolutionary rate analysis (roughly, where the number of synonymous substitutions per synonymous site is <2.0 (5)); (iii) our orthology relationships are inferred by considering all species equally, in a phylogenetic approach and (4) we use all exons across all alternative transcripts as opposed to the longest transcripts only. A particularly useful feature of OPTIC is its provision of multiple alignments either for genes as concatenated exons, or for alternative transcripts. We plan to update gene predictions and orthology assignments and add more genomes and clades when they become available.
  12 in total

1.  MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors:  Robert C Edgar
Journal:  Nucleic Acids Res       Date:  2004-03-19       Impact factor: 16.971

Review 2.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

3.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

Authors:  M Remm; C E Storm; E L Sonnhammer
Journal:  J Mol Biol       Date:  2001-12-14       Impact factor: 5.469

4.  PAML: a program package for phylogenetic analysis by maximum likelihood.

Authors:  Z Yang
Journal:  Comput Appl Biosci       Date:  1997-10

5.  The human phylome.

Authors:  Jaime Huerta-Cepas; Hernán Dopazo; Joaquín Dopazo; Toni Gabaldón
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

6.  TreeFam: a curated database of phylogenetic trees of animal gene families.

Authors:  Heng Li; Avril Coghlan; Jue Ruan; Lachlan James Coin; Jean-Karim Hériché; Lara Osmotherly; Ruiqiang Li; Tao Liu; Zhang Zhang; Lars Bolund; Gane Ka-Shu Wong; Weimou Zheng; Paramvir Dehal; Jun Wang; Richard Durbin
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

7.  Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human.

Authors:  Leo Goodstadt; Chris P Ponting
Journal:  PLoS Comput Biol       Date:  2006-09-29       Impact factor: 4.475

8.  FlyBase: genomes by the dozen.

Authors:  Madeline A Crosby; Joshua L Goodman; Victor B Strelets; Peili Zhang; William M Gelbart
Journal:  Nucleic Acids Res       Date:  2006-11-11       Impact factor: 16.971

9.  WormBase: new content and better access.

Authors:  Tamberlyn Bieri; Darin Blasiar; Philip Ozersky; Igor Antoshechkin; Carol Bastiani; Payan Canaran; Juancarlos Chan; Nansheng Chen; Wen J Chen; Paul Davis; Tristan J Fiedler; Lisa Girard; Michael Han; Todd W Harris; Ranjana Kishore; Raymond Lee; Sheldon McKay; Hans-Michael Müller; Cecilia Nakamura; Andrei Petcherski; Arun Rangarajan; Anthony Rogers; Gary Schindelman; Erich M Schwarz; Will Spooner; Mary Ann Tuli; Kimberly Van Auken; Daniel Wang; Xiaodong Wang; Gary Williams; Richard Durbin; Lincoln D Stein; Paul W Sternberg; John Spieth
Journal:  Nucleic Acids Res       Date:  2006-11-11       Impact factor: 16.971

10.  Inparanoid: a comprehensive database of eukaryotic orthologs.

Authors:  Kevin P O'Brien; Maido Remm; Erik L L Sonnhammer
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

View more
  16 in total

1.  Reduced purifying selection prevails over positive selection in human copy number variant evolution.

Authors:  Duc-Quang Nguyen; Caleb Webber; Jayne Hehir-Kwa; Rolph Pfundt; Joris Veltman; Chris P Ponting
Journal:  Genome Res       Date:  2008-08-07       Impact factor: 9.043

2.  Genome-wide probabilistic reconciliation analysis across vertebrates.

Authors:  Owais Mahmudi; Joel Sjöstrand; Bengt Sennblad; Jens Lagergren
Journal:  BMC Bioinformatics       Date:  2013-10-15       Impact factor: 3.169

3.  Accurate distinction of pathogenic from benign CNVs in mental retardation.

Authors:  Jayne Y Hehir-Kwa; Nienke Wieskamp; Caleb Webber; Rolph Pfundt; Han G Brunner; Christian Gilissen; Bert B A de Vries; Chris P Ponting; Joris A Veltman
Journal:  PLoS Comput Biol       Date:  2010-04-22       Impact factor: 4.475

4.  Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis.

Authors:  Rami A Dalloul; Julie A Long; Aleksey V Zimin; Luqman Aslam; Kathryn Beal; Le Ann Blomberg; Pascal Bouffard; David W Burt; Oswald Crasta; Richard P M A Crooijmans; Kristal Cooper; Roger A Coulombe; Supriyo De; Mary E Delany; Jerry B Dodgson; Jennifer J Dong; Clive Evans; Karin M Frederickson; Paul Flicek; Liliana Florea; Otto Folkerts; Martien A M Groenen; Tim T Harkins; Javier Herrero; Steve Hoffmann; Hendrik-Jan Megens; Andrew Jiang; Pieter de Jong; Pete Kaiser; Heebal Kim; Kyu-Won Kim; Sungwon Kim; David Langenberger; Mi-Kyung Lee; Taeheon Lee; Shrinivasrao Mane; Guillaume Marcais; Manja Marz; Audrey P McElroy; Thero Modise; Mikhail Nefedov; Cédric Notredame; Ian R Paton; William S Payne; Geo Pertea; Dennis Prickett; Daniela Puiu; Dan Qioa; Emanuele Raineri; Magali Ruffier; Steven L Salzberg; Michael C Schatz; Chantel Scheuring; Carl J Schmidt; Steven Schroeder; Stephen M J Searle; Edward J Smith; Jacqueline Smith; Tad S Sonstegard; Peter F Stadler; Hakim Tafer; Zhijian Jake Tu; Curtis P Van Tassell; Albert J Vilella; Kelly P Williams; James A Yorke; Liqing Zhang; Hong-Bin Zhang; Xiaojun Zhang; Yang Zhang; Kent M Reed
Journal:  PLoS Biol       Date:  2010-09-07       Impact factor: 8.029

5.  Gene-pseudogene evolution: a probabilistic approach.

Authors:  Owais Mahmudi; Bengt Sennblad; Lars Arvestad; Katja Nowick; Jens Lagergren
Journal:  BMC Genomics       Date:  2015-10-02       Impact factor: 3.969

6.  genenames.org: the HGNC resources in 2011.

Authors:  Ruth L Seal; Susan M Gordon; Michael J Lush; Mathew W Wright; Elspeth A Bruford
Journal:  Nucleic Acids Res       Date:  2010-10-06       Impact factor: 16.971

7.  Accelerated evolution of PAK3- and PIM1-like kinase gene families in the zebra finch, Taeniopygia guttata.

Authors:  Lesheng Kong; Peter V Lovell; Andreas Heger; Claudio V Mello; Chris P Ponting
Journal:  Mol Biol Evol       Date:  2010-03-17       Impact factor: 16.240

8.  Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA profiling in seven vertebrates.

Authors:  Hannah K Long; David Sims; Andreas Heger; Neil P Blackledge; Claudia Kutter; Megan L Wright; Frank Grützner; Duncan T Odom; Roger Patient; Chris P Ponting; Robert J Klose
Journal:  Elife       Date:  2013-02-26       Impact factor: 8.140

9.  Rapid evolution of Beta-keratin genes contribute to phenotypic differences that distinguish turtles and birds from other reptiles.

Authors:  Yang I Li; Lesheng Kong; Chris P Ponting; Wilfried Haerty
Journal:  Genome Biol Evol       Date:  2013       Impact factor: 3.416

10.  Insights into the evolution of Darwin's finches from comparative analysis of the Geospiza magnirostris genome sequence.

Authors:  Chris M Rands; Aaron Darling; Matthew Fujita; Lesheng Kong; Matthew T Webster; Céline Clabaut; Richard D Emes; Andreas Heger; Stephen Meader; Michael Brent Hawkins; Michael B Eisen; Clotilde Teiling; Jason Affourtit; Benjamin Boese; Peter R Grant; Barbara Rosemary Grant; Jonathan A Eisen; Arhat Abzhanov; Chris P Ponting
Journal:  BMC Genomics       Date:  2013-02-12       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.