Literature DB >> 30295730

VIRULIGN: fast codon-correct alignment and annotation of viral genomes.

Pieter J K Libin^1,2, Koen Deforche³, Ana B Abecasis⁴, Kristof Theys¹.

Abstract

SUMMARY: Virus sequence data are an essential resource for reconstructing spatiotemporal dynamics of viral spread as well as to inform treatment and prevention strategies. However, the potential benefit of these applications critically depends on accurate and correctly annotated alignments of genetically heterogeneous data. VIRULIGN was built for fast codon-correct alignments of large datasets, with standardized and formalized genome annotation and various alignment export formats.
AVAILABILITY AND IMPLEMENTATION: VIRULIGN is freely available at https://github.com/rega-cev/virulign as an open source software project. SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Substances：
Codon

Year: 2019 PMID： 30295730 PMCID： PMC6513156 DOI： 10.1093/bioinformatics/bty851

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Many viral pathogens, in particular RNA viruses, are fast evolving within and between hosts, and markers of adaptation to changing conditions can be detected in their genomes (Lemey ). Structural, functional and phenotypic predictions from viral genotypes have fostered advances in drug design, diagnostics and clinical management of viral infections (Houldcroft ; Pybus and Rambaut, 2009; Theys ). Virus genetic data are also a requisite for inference of evolutionary histories and active epidemiological surveillance (Dellicour ; Hadfield ; Libin ). However, genotype-dependent applications are strongly affected by the quality of underlying sequence alignments. The process of aligning virus sequences is challenged by their extensive genetic diversity and frequent insertions and deletions, and as a result plethora of alignment software exists with different objectives and applications. Aligners for mapping and assembling sequence reads to study virus populations have significantly advanced in recent years (Posada-Cespedes ). Algorithms to align viral consensus or Sanger sequences, resulting in pairwise or multiple alignments, have made less progress over time. Such alignments are however crucial for various aspects of public health and diagnostics. Multiple sequence alignments (MSAs) of viral genes or genome sequences are often constructed by progressive-iterative approaches such as MAFFT, MUSCLE or Clustal Omega (Edgar, 2004; Katoh and Standley, 2013; Sievers ), partly due to their generic applicability and ease of use. These heuristic methods are less capable of mitigating frameshift errors and can be sensitive to noise in sequence data, which is detrimental when protein sequences need to be analyzed in the correct open reading frame (ORF) (e.g. use of codon substitution models in phylogenetics or detection of drug resistance mutations). Alternatively, guidance of the alignment process by a reference sequence can overcome these limitations (Tzou ). However, the use of ill-annotated reference sequences hampers the outcome of the alignment. Moreover, inferior sequences in the dataset will have a large impact on the MSA result, and restraining the MSA process by their automatic rejection will further improve the reproducibility and quality of the alignment. As such, we developed VIRULIGN which is a fast reference-guided and codon-correct alignment and annotation tool for protein coding sequences of closely-related viruses.

2 Related work

In comparison to VIRULIGN, other codon aware alignment softwares are available (e.g. MACSE and TranslatorX) (Abascal ; Ranwez ), these however do not support to guide the alignment process by reference sequence. HAlign shares VIRULIGN’s objective to perform the alignment of large sequence datasets of closely related sequences, but does not focus on codon correct alignments (Zou ).

3 Features

VIRULIGN is a cross-platform (GNU/Linux, Unix, MacOS and Windows) and easy-to-use command line application. VIRULIGN can handle large sequence datasets in a computationally efficient manner, as shown experimentally (see Section 5) and through an analysis of the algorithm’s computational complexity (see Section 4). Considering a single ORF, VIRULIGN’s alignment algorithm is designed for closely related viral genomes with a conserved gene order and corrects the alignment for codon anomalies resulting from single nucleotide alterations. Automated frame shift correction and genome annotation increases the quality of the alignment and reduces the need for manual editing, thereby addressing the need for reproducible research (Peng, 2011). A codon-correct MSA is essential for evolutionary hypothesis testing and phylogenetic inference using codon substitution models (Shapiro ), and for detecting footprints of selective constraints on coding sequence alignments. In addition, the identification of amino acid mutations (including insertions or deletions) associated with drug resistance (HIV-1, Hepatitis C virus, Influenza virus), disease outcome (Hepatitis B virus) or epidemic potential (Ebola virus, Chikungunya virus) are important aspects in the management of infectious diseases. VIRULIGN enables its users to provide formalized protein annotation of the target CDS, relative to positions within a curated reference genome, through the use of an XML file. This annotation file can be easily defined by the user and VIRULIGN provides pre-defined annotations for several viral pathogens (see Supplementary Material). The XML file supports the description of a single ORF. In order to handle multi-ORF genomes, multiple annotation files can be specified, to produce distinct alignments for each of the different ORFs, which we demonstrate this in the context of HIV in the Supplementary Material. This feature facilitates genome-wide or protein-specific analyses, and provides virologists with a tool to evaluate and optimize reference sequences in terms of completeness and representativeness (Theys ). VIRULIGN allows to export the computed alignment to various output formats, where different options can be combined to obtain an appropriate alignment representation. Alignments can be exported, either in nucleotide or amino acid alphabet, as FASTA and CSV files, with the latter representing protein positions and mutations as distinct columns. VIRULIGN is an open-source project (GPLv2 license) written in the C++ programming language. VIRULIGN was previously used in different research areas in infectious diseases (see Supplementary Material for an extensive overview), and can be easily integrated in data management and analysis platforms for viral pathogens.

4 Methods

VIRULIGN attempts to construct an MSA of a set of target sequences with respect to the reference sequence r (Figure of the alignment process in Supplementary Material). For each target sequence , a codon correct pairwise alignment with r is computed. During this procedure, different alignments are generated using the Needleman-Wunsch global alignment algorithm (Needleman and Wunsch, 1970). We will refer to the amino acid representation of the reference sequence r as AA(r). Firstly, we perform a Needleman-Wunsch nucleotide alignment of r and t, resulting in alignment . Secondly, the three ORFs of target sequence t are translated to their respective amino acid representation, and a reference is kept from each of the amino acids to their respective codon. Each of these amino acid sequences is aligned to AA(r) using the Needleman-Wunsch algorithm. From these three alignments, the alignment with the highest alignment score is selected. This amino acid alignment is then converted to a nucleotide alignment , by replacing each of the amino acids with their respective nucleotide codon. Thirdly, if and differ, we suspect that a frame-shift has occurred in the target sequence. We then attempt to fix the frame-shift, by detecting the first isolated gap of which the size is not a multiple of three, and replace it by an n nucleotide symbol. Finally, we move again to the second step and the procedure is repeated until no more frame-shifts are detected, or the maximum number of frame-shifts (i.e. a configuration option) has been exceeded. In the latter case, target sequence t is excluded from the MSA, and an error is reported. This procedure results in a set of codon-correct aligned target sequences, where each of these alignments contains information about possible insertions in the target sequence. This data structure can be exported to a MSA, in a variety of output formats (see Section 3), by iterating over the alignment columns in each of the pairwise alignments. The way the VIRULIGN algorithm operates, alignment errors will be propagated as frameshift errors. VIRULIGN enables the user to control quality by providing a parameter to bound the number of allowed frameshift corrections. To derive VIRULIGN’s computational complexity, we observe that for each target sequence , a constant number of Needleman–Wunsch alignments is performed. It is well known that the computational complexity of a Needleman–Wunsch alignment of a sequence tuple (s1, s2) is in both space and time (Needleman and Wunsch, 1970). As in VIRULIGN, we consider the reference sequence r and a set of target sequences , and each target sequence t is aligned to r, the maximal Needleman–Wunsch computational complexity is . As this applies to all target sequences, VIRULIGN’s full computational complexity is .

5 Application and future perspectives

We demonstrate VIRULIGN’s abilities by constructing MSAs of real genomic data of Dengue virus (DENV), HIV-1 and Zika virus (ZIKV), which were collected from public databases and also used in studies on viral diversity. Detailed information on the datasets and methods used is available as Supplementary Material. Firstly, full-length genomes from different genotypes of DENV serotype 1 (DENV-1) (n = 1433) were collected from Genbank. This dataset is representative for the DENV-1 worldwide epidemic and was aligned with VIRULIGN, MAFFT, MUSCLE and Clustal Omega. This example shows that, compared to the other tools, VIRULIGN generated an amino acid alignment in the correct ORF without the need for manual correction, while remaining computationally efficient. Secondly, a selected subset of full-length ZIKV genomes (n = 19) was aligned with VIRULIGN using an XML annotation file. The alignment was exported in an amino acid representation to illustrate, in conjunction with other command line utilities, the variability at a glycosylation motif that instigated the effort to correct the ZIKV reference sequence (Theys ). Thirdly, we conducted experiments in the context of HIV-1. HIV-1 exhibits three ORFs that together translate the complete set of viral proteins, however, these different reading frames complicate the alignment of the respective CDS. We used a curated set of full-length HIV-1 genomes (n = 2966) (Li et al., 2017) that was used to study HIV-1 subtype diversity. This dataset was aligned with VIRULIGN to select the gag poly-protein and identify encoded proteins in an efficient manner. Similar operations can be easily applied to other HIV-1 poly-proteins. As a second example, we used VIRULIGN to align a large HIV-1 dataset (n = 111 222) spanning the reverse transcriptase enzyme, obtained from the curated and public Stanford University HIV Drug Resistance Database (HIVDB). An accurate alignment has significant clinical importance in the context of drug resistance detection. Due to its favorable computational complexity in this context, VIRULIGN performed a better alignment much faster than MAFTT. Through this example, we also demonstrate VIRULIGN’s capabilities to exclude erroneous sequences from the alignment (see Supplementary Material). Finally, we demonstrate the strength of VIRULIGN to quickly detect the presence of resistance mutations by reproducing findings from a recent study on HIV drug resistance. Future developments include a community-driven repository of standardized and curated genome annotations of representative reference sequences and the integration of VIRULIGN in tools for surveillance and genomic epidemiology. Additional areas of interest include the addition of functionalities for multi-ORF alignments, support for non-coding sequences and support for user-defined genetic codes (Taylor ). In this work, we focus on virus species with a relatively short genome. Nonetheless, we believe it to be an interesting direction for future work to explore VIRULIGN’s potential to align viruses with larger genomes (e.g. orthopoxviruses). Click here for additional data file.

21 in total

1. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

2. Reproducible research in computational science.

Authors: Roger D Peng
Journal: Science Date: 2011-12-02 Impact factor: 47.728

3. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences.

Authors: Beth Shapiro; Andrew Rambaut; Alexei J Drummond
Journal: Mol Biol Evol Date: 2005-09-21 Impact factor: 16.240

Review 4. HIV evolutionary dynamics within and among hosts.

Authors: Philippe Lemey; Andrew Rambaut; Oliver G Pybus
Journal: AIDS Rev Date: 2006 Jul-Sep Impact factor: 2.500

5. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors: Kazutaka Katoh; Daron M Standley
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

6. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations.

Authors: Federico Abascal; Rafael Zardoya; Maximilian J Telford
Journal: Nucleic Acids Res Date: 2010-04-30 Impact factor: 16.971

7. MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons.

Authors: Vincent Ranwez; Sébastien Harispe; Frédéric Delsuc; Emmanuel J P Douzery
Journal: PLoS One Date: 2011-09-16 Impact factor: 3.240

8. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors: Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal: Mol Syst Biol Date: 2011-10-11 Impact factor: 11.429

Review 9. Evolutionary analysis of the dynamics of viral infectious disease.

Authors: Oliver G Pybus; Andrew Rambaut
Journal: Nat Rev Genet Date: 2009-08 Impact factor: 53.242

10. Virus-host co-evolution under a modified nuclear genetic code.

Authors: Derek J Taylor; Matthew J Ballinger; Shaun M Bowman; Jeremy A Bruenn
Journal: PeerJ Date: 2013-03-05 Impact factor: 2.984

20 in total

1. An Evolutionary Model-Based Approach To Quantify the Genetic Barrier to Drug Resistance in Fast-Evolving Viruses and Its Application to HIV-1 Subtypes and Integrase Inhibitors.

Authors: Kristof Theys; Pieter J K Libin; Kristel Van Laethem; Ana B Abecasis
Journal: Antimicrob Agents Chemother Date: 2019-07-25 Impact factor: 5.191

2. Mat_peptide: comprehensive annotation of mature peptides from polyproteins in five virus families.

Authors: Christopher N Larsen; Guangyu Sun; Xiaomei Li; Sam Zaremba; Hongtao Zhao; Sherry He; Liwei Zhou; Sanjeev Kumar; Vince Desborough; Edward B Klem
Journal: Bioinformatics Date: 2020-03-01 Impact factor: 6.937

3. Estimating the potential to prevent locally acquired HIV infections in a UNAIDS Fast-Track City, Amsterdam.

Authors: Alexandra Blenkinsop; Mélodie Monod; Ard van Sighem; Nikos Pantazis; Daniela Bezemer; Eline Op de Coul; Thijs van de Laar; Christophe Fraser; Maria Prins; Peter Reiss; Godelieve J de Bree; Oliver Ratmann
Journal: Elife Date: 2022-08-03 Impact factor: 8.713

Review 4. Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research.

Authors: Franziska Hufsky; Kevin Lamkiewicz; Alexandre Almeida; Abdel Aouacheria; Cecilia Arighi; Alex Bateman; Jan Baumbach; Niko Beerenwinkel; Christian Brandt; Marco Cacciabue; Sara Chuguransky; Oliver Drechsel; Robert D Finn; Adrian Fritz; Stephan Fuchs; Georges Hattab; Anne-Christin Hauschild; Dominik Heider; Marie Hoffmann; Martin Hölzer; Stefan Hoops; Lars Kaderali; Ioanna Kalvari; Max von Kleist; Renó Kmiecinski; Denise Kühnert; Gorka Lasso; Pieter Libin; Markus List; Hannah F Löchel; Maria J Martin; Roman Martin; Julian Matschinske; Alice C McHardy; Pedro Mendes; Jaina Mistry; Vincent Navratil; Eric P Nawrocki; Áine Niamh O'Toole; Nancy Ontiveros-Palacios; Anton I Petrov; Guillermo Rangel-Pineros; Nicole Redaschi; Susanne Reimering; Knut Reinert; Alejandro Reyes; Lorna Richardson; David L Robertson; Sepideh Sadegh; Joshua B Singer; Kristof Theys; Chris Upton; Marius Welzel; Lowri Williams; Manja Marz
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

5. HIV-1-Transmitted Drug Resistance and Transmission Clusters in Newly Diagnosed Patients in Portugal Between 2014 and 2019.

Authors: Marta Pingarilho; Victor Pimentel; Mafalda N S Miranda; Ana Rita Silva; António Diniz; Bianca Branco Ascenção; Carmela Piñeiro; Carmo Koch; Catarina Rodrigues; Cátia Caldas; Célia Morais; Domitília Faria; Elisabete Gomes da Silva; Eugénio Teófilo; Fátima Monteiro; Fausto Roxo; Fernando Maltez; Fernando Rodrigues; Guilhermina Gaião; Helena Ramos; Inês Costa; Isabel Germano; Joana Simões; Joaquim Oliveira; José Ferreira; José Poças; José Saraiva da Cunha; Jorge Soares; Júlia Henriques; Kamal Mansinho; Liliana Pedro; Maria João Aleixo; Maria João Gonçalves; Maria José Manata; Margarida Mouro; Margarida Serrado; Micaela Caixeiro; Nuno Marques; Olga Costa; Patrícia Pacheco; Paula Proença; Paulo Rodrigues; Raquel Pinho; Raquel Tavares; Ricardo Correia de Abreu; Rita Côrte-Real; Rosário Serrão; Rui Sarmento E Castro; Sofia Nunes; Telo Faria; Teresa Baptista; Maria Rosário O Martins; Perpétua Gomes; Luís Mendão; Daniel Simões; Ana Abecasis
Journal: Front Microbiol Date: 2022-04-25 Impact factor: 5.640

6. RNA genome conservation and secondary structure in SARS-CoV-2 and SARS-related viruses: a first look.

Authors: Ramya Rangan; Ivan N Zheludev; Rachel J Hagey; Edward A Pham; Hannah K Wayment-Steele; Jeffrey S Glenn; Rhiju Das
Journal: RNA Date: 2020-05-12 Impact factor: 4.942

7. Increasing importance of European lineages in seeding the hepatitis C virus subtype 1a epidemic in Spain.

Authors: Ana Belen Pérez; Bram Vrancken; Natalia Chueca; Antonio Aguilera; Gabriel Reina; Miguel García-Del Toro; Francisco Vera; Miguel Angel Von Wichman; Juan Ignacio Arenas; Francisco Téllez; Juan A Pineda; Mohamed Omar; Enrique Bernal; Antonio Rivero-Juárez; Elisa Fernández-Fuertes; Alberto de la Iglesia; Juan Manuel Pascasio; Philippe Lemey; Féderico Garcia; Lize Cuypers
Journal: Euro Surveill Date: 2019-02

8. A computational method for the identification of Dengue, Zika and Chikungunya virus species and genotypes.

Authors: Vagner Fonseca; Pieter J K Libin; Kristof Theys; Nuno R Faria; Marcio R T Nunes; Maria I Restovic; Murilo Freire; Marta Giovanetti; Lize Cuypers; Ann Nowé; Ana Abecasis; Koen Deforche; Gilberto A Santiago; Isadora C de Siqueira; Emmanuel J San; Kaliane C B Machado; Vasco Azevedo; Ana Maria Bispo-de Filippis; Rivaldo Venâncio da Cunha; Oliver G Pybus; Anne-Mieke Vandamme; Luiz C J Alcantara; Tulio de Oliveira
Journal: PLoS Negl Trop Dis Date: 2019-05-08

9. Time to Harmonize Dengue Nomenclature and Classification.

Authors: Lize Cuypers; Pieter J K Libin; Peter Simmonds; Ann Nowé; Jorge Muñoz-Jordán; Luiz Carlos Junior Alcantara; Anne-Mieke Vandamme; Gilberto A Santiago; Kristof Theys
Journal: Viruses Date: 2018-10-18 Impact factor: 5.048

10. Drivers of HIV-1 transmission: The Portuguese case.

Authors: Andrea-Clemencia Pineda-Peña; Marta Pingarilho; Guangdi Li; Bram Vrancken; Pieter Libin; Perpétua Gomes; Ricardo Jorge Camacho; Kristof Theys; Ana Barroso Abecasis
Journal: PLoS One Date: 2019-09-30 Impact factor: 3.240