Literature DB >> 30698751

OrthoMaM v10: Scaling-Up Orthologous Coding Sequence and Exon Alignments with More than One Hundred Mammalian Genomes.

Celine Scornavacca1, Khalid Belkhir1, Jimmy Lopez1, Rémy Dernat1, Frédéric Delsuc1, Emmanuel J P Douzery1, Vincent Ranwez2.   

Abstract

We present version 10 of OrthoMaM, a database of orthologous mammalian markers. OrthoMaM is already 11 years old and since the outset it has kept on improving, providing alignments and phylogenetic trees of high-quality computed with state-of-the-art methods on up-to-date data. The main contribution of this version is the increase in the number of taxa: 116 mammalian genomes for 14,509 one-to-one orthologous genes. This has been made possible by the combination of genomic data deposited in Ensembl complemented by additional good-quality genomes only available in NCBI. Version 10 users will benefit from pipeline improvements and a completely redesigned web-interface.
© The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  coding sequences; comparative genomics; mammals; orthologous sequences; phylogenomics

Mesh:

Year:  2019        PMID: 30698751      PMCID: PMC6445298          DOI: 10.1093/molbev/msz015

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


OrthoMaM is a database of high-quality orthologous sequence alignments and phylogenetic trees from mammalian genomes. It has been, for instance, used for developing new molecular markers, inferring mammalian phylogenies, simulating sequences for testing alignment filtering methods, and studying the evolution of base composition in protein-coding sequences. Previous versions of our database only included mammalian genomes from Ensembl (Ranwez et al. 2007; Douzery et al. 2014). With the progress of sequencing techniques, the number of genomes included in NCBI, but not yet in Ensembl, grows each year. This motivated us to totally rethink our database to include the annotated mammalian genomes available in NCBI only. To identify the core set of orthologous sequences to be used, we rely on the OrthoMaM v8 pipeline described in Douzery et al. (2014). Briefly, we isolate 1-to-1 orthologous genes among pairs of our pillar placental species (Homo-Mus, Homo-Canis, and Mus-Canis) by using Ensembl v91 annotations (Zerbino et al. 2018). These clusters are then enriched by adding sequences of 69 additional mammals annotated in Ensembl v91 as 1-to-1 orthologues to the human gene, and turned into clusters of 1-to-1 orthologous CDSs by selecting the longest transcript of each gene. These proto-clusters are aligned using MAFFT at the amino acid level (Katoh and Standley 2013). The CDSs of the 47 annotated mammal genomes available in NCBI (Rigden and Fernández 2017) but not in Ensembl as of March 2018 are then used to enrich the proto-alignments. For each proto-alignment, an HMM profile is created via hmmbuild using default parameters of the HMMER toolkit (Eddy 2011). Additionally, all HMM profiles are concatenated and summarized using hmmpress to construct an HMM database. Then, for each NCBI CDS Ci, hmmscan is used on the HMM database to get the best hits among the proto-alignments—denoted bestAl(Ci). For each proto-alignment Aj, the most similar sequences for each species S—denoted bestSeq(S, Aj)—are detected via hmmsearch. Outputs from hmmsearch and hmmscan are discarded if the first hit score is not substantially better than the second (hit2 < 0.9 hit1) and are combined in a best-reciprocal-hit fashion: Suppose that we are given a NCBI CDS sequence Ci for species Sk; then denoting Aj the best proto-alignment for Ci, if the first sequence of bestSeq(Sk, bestAl(Ci)) equals Ci and none of the other sequences Cz≠ Ci in bestSeq(Aj) is such that bestAl(Sk, Cz) =Aj, then Ci is considered to belong to Aj. This ensures our orthology predictions for the NCBI CDSs to be robust. The enriched orthologous clusters have been thoughtfully aligned and filtered via the OMM_MACSE pipeline to construct high-quality codon alignments relying on MACSE v2 (Ranwez et al. 2018), MAFFT (Katoh and Standley 2013), and HMMcleaner (Philippe et al. 2017). Phylogenetic trees for each CDS alignment are constructed under maximum likelihood (ML) with RAxML (Stamatakis 2014) under the GTR+Γ model (Yang 1994). OrthoMaM v10 also includes an ad hoc nonorthologous sequence detection method. For a given marker, the distance between the most recent common ancestor (MRCA) of placental sequences and any sequence of the marker, denoted here Si, depends mostly on two factors: the considered marker (some genes evolve faster) and the genome to which Si belongs (some species evolve faster). We use linear regression to explain the patristic distance on inferred ML trees between a sequence Si and the placental MRCA based on those two factors. When the linear regression prediction departs from the observed patristic distance (standardized residual >3), we consider the sequence as spurious and remove it. As in the previous versions of our database, we use exon positions in CDS alignments to infer exon orthology and alignments (see Douzery et al. 2014 for details). Phylogenetic trees for each exon alignment are reconstructed as for CDSs. OrthoMaM v10 database is available at http://www.orthomam.univ-montp2.fr/orthomam_v10/. For each CDS and exon marker, we provide gene level information (gene name, GO annotation), full sequence traceability information (sequence identifier in Ensembl/NCBI, filtering details), nucleotide and amino acid alignments, phylogenetic trees, as well as several evolutionary indicators such as relative evolutionary rate and G + C content. The improved web-interface (see fig. 1) provides a user-friendly way to query the database based on alignment content (number of sequences, number of species that should be represented, and alignment length), evolutionary indicators, or sequence similarity using blast.
. 1.

Screenshots from the OrthoMaM website version 10. Here, the database is queried for CDSs with a relative evolutionary rate between 0.5 and 2, a GC3 between 23% and 45%, a parameter of the gamma distribution ranging from 1 to 2, and an alignment length >1,000 characters (top left). The result returns 126 target CDSs, for which information can be downloaded (bottom left). The longest CDS (ABCA13) is visualized with the corresponding phylogenetic information (top right) and AA alignment (bottom right).

Screenshots from the OrthoMaM website version 10. Here, the database is queried for CDSs with a relative evolutionary rate between 0.5 and 2, a GC3 between 23% and 45%, a parameter of the gamma distribution ranging from 1 to 2, and an alignment length >1,000 characters (top left). The result returns 126 target CDSs, for which information can be downloaded (bottom left). The longest CDS (ABCA13) is visualized with the corresponding phylogenetic information (top right) and AA alignment (bottom right). This new reactive web site, which has been completely redesigned to improve the user experience, the triplication of the number of species, the major update of the pipeline for orthology prediction and sequence filtering should make OrthoMaM v10 a central resource for anyone interested in mammalian comparative genomics.
  9 in total

1.  Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods.

Authors:  Z Yang
Journal:  J Mol Evol       Date:  1994-09       Impact factor: 2.395

2.  MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors:  Kazutaka Katoh; Daron M Standley
Journal:  Mol Biol Evol       Date:  2013-01-16       Impact factor: 16.240

3.  OrthoMaM v8: a database of orthologous exons and coding sequences for comparative genomics in mammals.

Authors:  Emmanuel J P Douzery; Celine Scornavacca; Jonathan Romiguier; Khalid Belkhir; Nicolas Galtier; Frédéric Delsuc; Vincent Ranwez
Journal:  Mol Biol Evol       Date:  2014-04-09       Impact factor: 16.240

4.  Accelerated Profile HMM Searches.

Authors:  Sean R Eddy
Journal:  PLoS Comput Biol       Date:  2011-10-20       Impact factor: 4.475

5.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Authors:  Alexandros Stamatakis
Journal:  Bioinformatics       Date:  2014-01-21       Impact factor: 6.937

6.  The 2018 Nucleic Acids Research database issue and the online molecular biology database collection.

Authors:  Daniel J Rigden; Xosé M Fernández
Journal:  Nucleic Acids Res       Date:  2018-01-04       Impact factor: 16.971

7.  OrthoMaM: a database of orthologous genomic markers for placental mammal phylogenetics.

Authors:  Vincent Ranwez; Frédéric Delsuc; Sylvie Ranwez; Khalid Belkhir; Marie-Ka Tilak; Emmanuel Jp Douzery
Journal:  BMC Evol Biol       Date:  2007-11-30       Impact factor: 3.260

8.  Ensembl 2018.

Authors:  Daniel R Zerbino; Premanand Achuthan; Wasiu Akanni; M Ridwan Amode; Daniel Barrell; Jyothish Bhai; Konstantinos Billis; Carla Cummins; Astrid Gall; Carlos García Girón; Laurent Gil; Leo Gordon; Leanne Haggerty; Erin Haskell; Thibaut Hourlier; Osagie G Izuogu; Sophie H Janacek; Thomas Juettemann; Jimmy Kiang To; Matthew R Laird; Ilias Lavidas; Zhicheng Liu; Jane E Loveland; Thomas Maurel; William McLaren; Benjamin Moore; Jonathan Mudge; Daniel N Murphy; Victoria Newman; Michael Nuhn; Denye Ogeh; Chuang Kee Ong; Anne Parker; Mateus Patricio; Harpreet Singh Riat; Helen Schuilenburg; Dan Sheppard; Helen Sparrow; Kieron Taylor; Anja Thormann; Alessandro Vullo; Brandon Walts; Amonida Zadissa; Adam Frankish; Sarah E Hunt; Myrto Kostadima; Nicholas Langridge; Fergal J Martin; Matthieu Muffato; Emily Perry; Magali Ruffier; Dan M Staines; Stephen J Trevanion; Bronwen L Aken; Fiona Cunningham; Andrew Yates; Paul Flicek
Journal:  Nucleic Acids Res       Date:  2018-01-04       Impact factor: 16.971

9.  MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons.

Authors:  Vincent Ranwez; Emmanuel J P Douzery; Cédric Cambon; Nathalie Chantret; Frédéric Delsuc
Journal:  Mol Biol Evol       Date:  2018-10-01       Impact factor: 16.240

  9 in total
  12 in total

1.  Evolutionary impacts of purine metabolism genes on mammalian oxidative stress adaptation.

Authors:  Ran Tian; Chen Yang; Si-Min Chai; Han Guo; Inge Seim; Guang Yang
Journal:  Zool Res       Date:  2022-03-18

2.  High-quality carnivoran genomes from roadkill samples enable comparative species delineation in aardwolf and bat-eared fox.

Authors:  Rémi Allio; Marie-Ka Tilak; Celine Scornavacca; Nico L Avenant; Andrew C Kitchener; Erwan Corre; Benoit Nabholz; Frédéric Delsuc
Journal:  Elife       Date:  2021-02-18       Impact factor: 8.140

3.  ACPT gene is inactivated in mammalian lineages that lack enamel or teeth.

Authors:  Yuan Mu; Xin Huang; Rui Liu; Yulin Gai; Na Liang; Daiqing Yin; Lei Shan; Shixia Xu; Guang Yang
Journal:  PeerJ       Date:  2021-01-22       Impact factor: 2.984

4.  Ten Years of Collaborative Progress in the Quest for Orthologs.

Authors:  Benjamin Linard; Ingo Ebersberger; Shawn E McGlynn; Natasha Glover; Tomohiro Mochizuki; Mateus Patricio; Odile Lecompte; Yannis Nevers; Paul D Thomas; Toni Gabaldón; Erik Sonnhammer; Christophe Dessimoz; Ikuo Uchiyama
Journal:  Mol Biol Evol       Date:  2021-07-29       Impact factor: 16.240

5.  Hearing loss genes reveal patterns of adaptive evolution at the coding and non-coding levels in mammals.

Authors:  Anabella P Trigila; Francisco Pisciottano; Lucía F Franchini
Journal:  BMC Biol       Date:  2021-11-16       Impact factor: 7.431

6.  An Improved Codon Modeling Approach for Accurate Estimation of the Mutation Bias.

Authors:  Thibault Latrille; Nicolas Lartillot
Journal:  Mol Biol Evol       Date:  2022-02-03       Impact factor: 16.240

7.  Divergence time estimation using ddRAD data and an isolation-with-migration model applied to water vole populations of Arvicola.

Authors:  Alfonso Balmori-de la Puente; Jacint Ventura; Marcos Miñarro; Aitor Somoano; Jody Hey; Jose Castresana
Journal:  Sci Rep       Date:  2022-03-08       Impact factor: 4.379

8.  Molecules and fossils tell distinct yet complementary stories of mammal diversification.

Authors:  Nathan S Upham; Jacob A Esselstyn; Walter Jetz
Journal:  Curr Biol       Date:  2021-07-29       Impact factor: 10.900

9.  A Bayesian Mutation-Selection Framework for Detecting Site-Specific Adaptive Evolution in Protein-Coding Genes.

Authors:  Nicolas Rodrigue; Thibault Latrille; Nicolas Lartillot
Journal:  Mol Biol Evol       Date:  2021-03-09       Impact factor: 16.240

10.  Inferring Long-Term Effective Population Size with Mutation-Selection Models.

Authors:  Thibault Latrille; Vincent Lanore; Nicolas Lartillot
Journal:  Mol Biol Evol       Date:  2021-09-27       Impact factor: 16.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.