Literature DB >> 30165589

MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons.

Vincent Ranwez¹, Emmanuel J P Douzery², Cédric Cambon^1,2, Nathalie Chantret¹, Frédéric Delsuc².

Abstract

Multiple sequence alignment is a prerequisite for many evolutionary analyses. Multiple Alignment of Coding Sequences (MACSE) is a multiple sequence alignment program that explicitly accounts for the underlying codon structure of protein-coding nucleotide sequences. Its unique characteristic allows building reliable codon alignments even in the presence of frameshifts. This facilitates downstream analyses such as selection pressure estimation based on the ratio of nonsynonymous to synonymous substitutions. Here, we present MACSE v2, a major update with an improved version of the initial algorithm enriched with a complete toolkit to handle multiple alignments of protein-coding sequences. A graphical interface now provides user-friendly access to the different subprograms.

Entities: Chemical Disease Gene

Mesh：

Substances：
Codon, Terminator

Year: 2018 PMID： 30165589 PMCID： PMC6188553 DOI： 10.1093/molbev/msy159

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Brief Communication

Multiple Alignment of Coding Sequences (MACSE) was the first automatic solution developed to align multiple protein-coding nucleotide sequences based on their amino acid translation while allowing for the occurrence of frameshifts (Ranwez et al. 2011). Its key feature is to align DNA sequences at the nucleotide level, but with the possibility to include gap lengths that are not a multiple of three bases, that is, generating frameshifts, while scoring the resulting nucleotide alignments based on their amino acid translation. This allows one to produce nucleotide alignments that preserve the underlying codon structure while benefiting from the higher similarity of amino acid sequences. Since its first release in 2011, MACSE has been used in multiple contexts including comparative transcriptomic studies (Lan and Pritchard 2016), pseudogene evolution (Delsuc et al. 2015), genome-wide analyses of selection (Assis et al. 2012), metabarcoding analyses (Leray et al. 2013), and phylogenomic pipelines (Bragg et al. 2016). Here we present a major update of MACSE with an improved version enriched by a series of subprograms aimed at facilitating the production and handling of multiple alignments of protein-coding sequences. Altogether, the subprograms implemented in the new MACSE v2 release compose a powerful toolkit now easily accessible through a graphical user interface (fig. 1).

. 1.

The graphical user interface of MACSE v2 (left) allows to select the desired subprogram, to browse the file system for choosing input FASTA files, and to set parameter values. It automatically generates the corresponding command line (bottom left). When the user selects a new subprogram or click on an option field, a brief help related to this program or option is displayed on the top of the interface (red arrows). An exemplar data set of 15 mitochondrial NADH dehydrogenase subunit 3 (nad3) gene sequences of turtles has been aligned by MACSE (parameters shown). The resulting alignment is displayed at the nucleotide (top right), codon (middle), and amino acid (bottom right) levels using SeaView v4.6.4 (Gouy et al.2010). Exclamation marks (!) emphasize the frameshifts detected by MACSE, most of which corresponding to programmed frameshift mutations (Russell and Beckenbach 2008). The core alignment subprogram (alignSequences) has been improved in performance through a faster estimation of its objective function, namely the SP-score, thanks to recently derived optimal algorithmic solutions (Ranwez 2016). Additional parameters have also been introduced to control the speed/extensiveness ratio of the heuristic search for an alignment optimizing the SP-score. MACSE v2 uses a progressive alignment strategy to obtain an initial draft of the multiple sequence alignment that is subsequently improved using the 2-cut refinement strategy. This widespread strategy, also used for instance by MUSCLE (Edgar 2004), consists of partitioning the current solution into two subalignments that are subsequently realigned. The resulting alignment replaces the previous one if its SP-score is improved and the refinement process stops when no more improvements are found (see Ranwez et al. 2011; Ranwez 2016 for algorithmic details). A tricky part of multiple sequence alignment is the choice of the elementary cost of each possible event. For instance, the relative costs of gap openings and gap extensions with respect to amino acid substitution strongly impact the final result and no efficient strategy as been found so far to select the ideal costs with respect to the sequences to be aligned (Wheeler and Kececioglu 2007). MACSE requires additional costs for frameshifts and stop codons that are not easier to set than traditional gap-associated costs. We provide default values that have proved to be effective based on our experience. This is further discussed in the MACSE online documentation that provides guidelines for handling specific sequences such as pseudogenes or RNAseq contigs resulting from error prone long read sequencing technologies. The TrimNonHomologousFragments subprogram was developed to remove long sequence fragments that are unrelated to other sequences. Indeed, positioning long insertions in one or several sequences could drastically slow down and impede the alignment process. Moreover, long insertions may often prove finally useless since they are removed by alignment filtering tools in subsequent analyses. When a compatibility graph of maximum exact match (MEM) is constructed between two genomic sequences, they can be rapidly aligned after identification of the longest weighted path (Hohl et al. 2002). We extended this approach to handle the translation of nucleotide sequences in the three possible coding frames using a compressed amino acid alphabet. This allows identifying and trimming long insertions present in only few sequences, as such regions are rarely part of long MEM paths. The enrichAlignment subprogram can be used to sequentially add new DNA sequences to an existing alignment. Its input parameters allow defining criteria that the additional sequences should fulfil to be actually incorporated into the final alignment. For instance, sequences can be automatically discarded when, once aligned, they would contain a stop codon, too many gaps, or more than a given number of frameshifts. The original alignment can either be sequentially enriched, or kept unchanged so that all sequences are compared with the same reference alignment. This latter option is especially useful for metabarcoding projects based on markers such as the mitochondrial Cytochrome Oxidase subunit I (cox1) gene. This typically involves enriching a reference alignment containing sequences from databases such as BOLD (Ratnasingham and Hebert 2007) or MIDORI (Machida et al. 2017) with thousands of newly generated sequences. The reportMaskAA2NT subprogram takes as input a nucleotide alignment and a filtered version of the corresponding amino acid alignment, for example, produced by HMMcleaner (Philippe et al. 2017), and reports this filtering at the codon level. By default, it additionally filters out small sequence fragments mostly surrounded by gaps or filtered nucleotides. Other MACSE v2 subprograms allow performing useful alignment manipulations such as translating sequences using different genetic codes in the same alignment (translateNT2AA); restricting a coding alignment to a subset of sequences and/or sites (splitAlignment, trimAlignment); or refining an existing alignment using the 2-cut strategy to improve its SP-score (refineAlignment). The command line interface is still key in most analyses that require running MACSE v2 in parallel on hundreds or thousands of data sets using a computing cluster. However, as the number of subprograms and options increased significantly, we now provide a user-friendly graphical interface. This should make it easier for new users to adopt MACSE v2 and hopefully broaden its usage and application scope. MACSE v2 is Java software freely available under the CECILL license (GPL variant) at https://bioweb.supagro.inra.fr/macse/, last accessed August 22, 2018. MACSE v2 and OMM_MACSE, a pipeline strongly relying on the MACSE v2 toolkit that has been used to align the thousands of orthologous genes in the OrthoMAM database (Douzery et al. 2014), are also available through dedicated web services at http://mbb.univ-montp2.fr/MBB/, last accessed August 22, 2018.

15 in total

1. Efficient multiple genome alignment.

Authors: Michael Höhl; Stefan Kurtz; Enno Ohlebusch
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

2. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

3. Multiple alignment by aligning alignments.

Authors: Travis J Wheeler; John D Kececioglu
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

4. SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building.

Authors: Manolo Gouy; Stéphane Guindon; Olivier Gascuel
Journal: Mol Biol Evol Date: 2009-10-23 Impact factor: 16.240

5. Metazoan mitochondrial gene sequence reference datasets for taxonomic assignment of environmental samples.

Authors: Ryuji J Machida; Matthieu Leray; Shian-Lei Ho; Nancy Knowlton
Journal: Sci Data Date: 2017-03-14 Impact factor: 6.444

6. Recoding of translation in turtle mitochondrial genomes: programmed frameshift mutations and evidence of a modified genetic code.

Authors: R David Russell; Andrew T Beckenbach
Journal: J Mol Evol Date: 2008-12 Impact factor: 2.395

7. Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals.

Authors: Xun Lan; Jonathan K Pritchard
Journal: Science Date: 2016-05-20 Impact factor: 47.728

8. A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents.

Authors: Matthieu Leray; Joy Y Yang; Christopher P Meyer; Suzanne C Mills; Natalia Agudelo; Vincent Ranwez; Joel T Boehm; Ryuji J Machida
Journal: Front Zool Date: 2013-06-14 Impact factor: 3.172

9. bold: The Barcode of Life Data System (http://www.barcodinglife.org).

Authors: Sujeevan Ratnasingham; Paul D N Hebert
Journal: Mol Ecol Notes Date: 2007-05-01

10. Two Simple and Efficient Algorithms to Compute the SP-Score Objective Function of a Multiple Sequence Alignment.

Authors: Vincent Ranwez
Journal: PLoS One Date: 2016-08-09 Impact factor: 3.240

69 in total

1. Detection of subgenome bias using an anchored syntenic approach in Eleusine coracana (finger millet).

Authors: Nathan D Hall; Jinesh D Patel; J Scott McElroy; Leslie R Goertzen
Journal: BMC Genomics Date: 2021-03-12 Impact factor: 3.969

2. Computational Methods for Pseudogene Annotation Based on Sequence Homology.

Authors: Paul M Harrison
Journal: Methods Mol Biol Date: 2021

3. PseudoChecker: an integrated online platform for gene inactivation inference.

Authors: Luís Q Alves; Raquel Ruivo; Miguel M Fonseca; Mónica Lopes-Marques; Pedro Ribeiro; L Filipe C Castro
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

4. Combinations of Spok genes create multiple meiotic drivers in Podospora.

Authors: Aaron A Vogan; S Lorena Ament-Velásquez; Alexandra Granger-Farbos; Jesper Svedberg; Eric Bastiaans; Alfons Jm Debets; Virginie Coustou; Hélène Yvanne; Corinne Clavé; Sven J Saupe; Hanna Johannesson
Journal: Elife Date: 2019-07-26 Impact factor: 8.140

5. Recurrent loss of HMGCS2 shows that ketogenesis is not essential for the evolution of large mammalian brains.

Authors: David Jebb; Michael Hiller
Journal: Elife Date: 2018-10-16 Impact factor: 8.140

6. Drivers and dynamics of a massive adaptive radiation in cichlid fishes.

Authors: Fabrizia Ronco; Michael Matschiner; Astrid Böhne; Anna Boila; Heinz H Büscher; Athimed El Taher; Adrian Indermaur; Milan Malinsky; Virginie Ricci; Ansgar Kahmen; Sissel Jentoft; Walter Salzburger
Journal: Nature Date: 2020-11-18 Impact factor: 49.962

7. Functional or Vestigial? The Genomics of the Pineal Gland in Xenarthra.

Authors: Raul Valente; Filipe Alves; Isabel Sousa-Pinto; Raquel Ruivo; L Filipe C Castro
Journal: J Mol Evol Date: 2021-08-03 Impact factor: 2.395

8. Antiviral Activity and Adaptive Evolution of Avian Tetherins.

Authors: Veronika Krchlíková; Helena Fábryová; Tomáš Hron; Janet M Young; Anna Koslová; Jiří Hejnar; Klaus Strebel; Daniel Elleder
Journal: J Virol Date: 2020-06-01 Impact factor: 5.103

9. High-quality carnivoran genomes from roadkill samples enable comparative species delineation in aardwolf and bat-eared fox.

Authors: Rémi Allio; Marie-Ka Tilak; Celine Scornavacca; Nico L Avenant; Andrew C Kitchener; Erwan Corre; Benoit Nabholz; Frédéric Delsuc
Journal: Elife Date: 2021-02-18 Impact factor: 8.140

10. Analysis of Paralogs in Target Enrichment Data Pinpoints Multiple Ancient Polyploidy Events in Alchemilla s.l. (Rosaceae).

Authors: Diego F Morales-Briones; Berit Gehrke; Chien-Hsun Huang; Aaron Liston; Hong Ma; Hannah E Marx; David C Tank; Ya Yang
Journal: Syst Biol Date: 2021-12-16 Impact factor: 15.683