Literature DB >> 25725497

IVA: accurate de novo assembly of RNA virus genomes.

Martin Hunt1, Astrid Gall1, Swee Hoe Ong1, Jacqui Brener2, Bridget Ferns3, Philip Goulder2, Eleni Nastouli4, Jacqueline A Keane1, Paul Kellam5, Thomas D Otto1.   

Abstract

MOTIVATION: An accurate genome assembly from short read sequencing data is critical for downstream analysis, for example allowing investigation of variants within a sequenced population. However, assembling sequencing data from virus samples, especially RNA viruses, into a genome sequence is challenging due to the combination of viral population diversity and extremely uneven read depth caused by amplification bias in the inevitable reverse transcription and polymerase chain reaction amplification process of current methods.
RESULTS: We developed a new de novo assembler called IVA (Iterative Virus Assembler) designed specifically for read pairs sequenced at highly variable depth from RNA virus samples. We tested IVA on datasets from 140 sequenced samples from human immunodeficiency virus-1 or influenza-virus-infected people and demonstrated that IVA outperforms all other virus de novo assemblers.
AVAILABILITY AND IMPLEMENTATION: The software runs under Linux, has the GPLv3 licence and is freely available from http://sanger-pathogens.github.io/iva
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2015        PMID: 25725497      PMCID: PMC4495290          DOI: 10.1093/bioinformatics/btv120

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The main challenge of assembling sequence data from an RNA virus sample into a consensus sequence lies in the extremely variable read depth from current sequencing approaches combined with the extensive viral population diversity. An example is shown in Figure 1 where regions of the genome are represented with different read depths, caused by the separate reverse transcription polymerase chain reaction amplification of overlapping regions of the genome before library preparation. Further, there is a relatively high rate of single base differences in the reads throughout the genome. These properties of the data cause standard assembly algorithms to produce multiple contigs covering the same region and, more significantly, miss regions of the genome entirely (Yang ).
Fig. 1.

Example HIV-1 assemblies. Plots show the proportion of single base differences per mapped read compared to the IVA contig, the read depth and contigs from PRICE, Trinity and VICUNA aligned to the single IVA contig. The minimum read depth is 63

Example HIV-1 assemblies. Plots show the proportion of single base differences per mapped read compared to the IVA contig, the read depth and contigs from PRICE, Trinity and VICUNA aligned to the single IVA contig. The minimum read depth is 63 Despite the availability of at least 40 genome assemblers (http://en.wikipedia.org/wiki/Sequence_assembly), VICUNA (Yang ) and PRICE (Ruby ) are currently the only assemblers designed for virus data. VICUNA tackles the assembly problem by first clustering the reads that should belong to the same contig, using min hashes to infer similarity. Contigs are generated and then merged to form the final output. PRICE begins with seed sequences, which are iteratively extended by generating new sequence from local assemblies of reads at contig ends. In addition, the RNA-seq assembler Trinity (Grabherr ) has been used to assemble virus data because it can handle irregular read depth. Trinity constructs de Bruijn graphs from clusters of the reads, then resolves each cluster into transcripts by tracing reads and their mates through the graphs. Our approach is similar to that of PRICE, except we extend contigs more conservatively using consensus kmers from the reads instead of using local assemblies. Also our new assembler, called IVA (Iterative Virus Assembler), is a completely de novo assembler, whereas PRICE must be provided with seed sequences to be extended into contigs.

2 Methods

A flowchart describing the assembly process is shown in Supplementary Figure S1 and full details are in the Supplementary Material. Before assembling, adapter sequences are removed from the reads using Trimmomatic (Bolger ), followed by the trimming of polymerase chain reaction primer sequences. After trimming the reads, the most abundant kmer among the reads is found using kmc (Deorowicz ). This short seed kmer is iteratively extended into a contig using reads that have a perfect match to that kmer, treating the reads as unpaired. A list of all possible extension sequences is made (one sequence per overhanging read). IVA identifies the kmer of length k among prefixes of the possible extension sequences, for largest possible k, such that the kmer appears at least 10 times and is at least four times as abundant as the next most common kmer of length k. In this way, the seed is iteratively extended until its length reaches the insert size of the read pairs. Contigs are extended in a similar manner to that of seed kmers. Instead of using perfect string matches, reads are mapped to the contigs with SMALT (http://www.sanger.ac.uk/resources/software/smalt/). During mapping, IVA also uses SAMtools (Li ). Reads mapped as part of a perfect pair (in the correct orientation and separated by the correct distance) and hang off a contig end are used to extend the contig. The sequence added to a contig end is constructed using the method described above for kmer extensions. When no more contigs can be extended, they are cleaned as follows before generating a new seed. Contig ends are trimmed for quality, and overlapping contigs are merged based on sequence similarity found at their ends using nucmer (Kurtz ). Assembly stops either when a pre-defined maximum contig number is reached or no new seeds can be made.

3 Results

We evaluated IVA, PRICE, Trinity and VICUNA with different parameters on Illumina paired reads from 42 human immunodeficiency virus 1 (HIV-1) samples and 98 Influenza A and B virus samples. See the Supplementary Material for the full analysis. To compare the assemblies for each sample, we picked the closest reference from a pool of genomes using Kraken (Wood and Salzberg, 2014). For the accession numbers and complete evaluation procedure, see the Supplementary Material. We generated quality metrics using (i) nucmer to compare contigs with a reference genome, (ii) GAGE (Salzberg ) analysis code and (iii) RATT (Otto ) to transfer annotation from the reference to the assembly. The ideal assembler output is defined as one contig for HIV-1, or exactly one contig for each Influenza virus genome segment, with the expected length compared to the closest reference and no duplication. IVA generated ideal assemblies for 57% of the HIV samples and 21% of the Influenza virus samples (Table 1 and Supplementary Tables S1 and S2), significantly more than the other assemblers. These low numbers are generally due to contigs of incorrect length (Fig. 2a) or duplications in the assemblies (Fig. 2b, Supplementary Figs S2 and S3, Table 1 and Supplementary Tables S1 and S2). IVA had the smallest variation in these results, especially for the Influenza virus samples (Fig. 2, Supplementary Figs S2 and S3, Table 1 and Supplementary Tables S1 and S2). The proportion of each reference genome assembled into contigs was 89.8-98.3% for HIV-1. The corresponding values for Influenza virus ranged from 89.8 (PRICE) to 98.8% (IVA). The mean per cent of HIV-1 annotation features transferred by RATT from IVA assemblies was 99.0% on both HIV-1 and Influenza virus samples. This was more than the other assemblers, except VICUNA with alternative settings that achieved 99.2% mean annotation transfer, at the expense of a duplication rate more than double that of IVA (Supplementary Table S1). There were few assembly errors—Trinity produced none, and IVA and VICUNA made one error each. The typical run time was under 10 h and none of the assemblers had excessive memory requirements (Supplementary Fig. S4). IVA was slightly slower on the HIV-1 samples but was comparable to PRICE and faster than VICUNA on the Influenza virus data.
Table 1.

Summary of assembly QC results

HIV-1
Influenza
IVAPRITriVICIVAPRICETriVIC
Ideal assemblies (%)a57.111.914.32.421.40.01.00.0
Mean reference bases assembled (%)97.997.289.898.398.889.897.694.3
Mean % annotation transferred99.090.086.297.399.092.196.195.3
Total assembly errorsb14010600

aHIV-1: the entire genome must be assembled into a unique contig. Influenza: each segment must be assembled into a unique contig.

bAn error is an inversion, relocation or translocation reported by GAGE. Numbers reported are the total across all assemblies. Supplementary Tables S1 and S2 expand on this table.

Fig. 2.

Comparison of assembly success. (a) For each segment of the reference, the longest matching contig was found. This plot shows the total length of these contigs for each assembly, as a percentage of the reference length. (b) Total assembly lengths, excluding contamination by only counting contigs that match the reference, as a percentage of the reference length

Comparison of assembly success. (a) For each segment of the reference, the longest matching contig was found. This plot shows the total length of these contigs for each assembly, as a percentage of the reference length. (b) Total assembly lengths, excluding contamination by only counting contigs that match the reference, as a percentage of the reference length Summary of assembly QC results aHIV-1: the entire genome must be assembled into a unique contig. Influenza: each segment must be assembled into a unique contig. bAn error is an inversion, relocation or translocation reported by GAGE. Numbers reported are the total across all assemblies. Supplementary Tables S1 and S2 expand on this table.

4 Discussion

Considering the number of ideal assemblies produced by the available tools, it can be seen that assembling RNA virus genomes is challenging. However, IVA was consistently better at producing single sequences representing the consensus sequence of each virus population, especially on the Influenza virus data. In contrast, the other tools tended to either produce multiple copies of parts of each genome or miss entire regions from their output. In summary, we developed IVA specifically to assemble short read sequencing data from RNA virus samples and have shown that it produces significantly higher quality assemblies than existing approaches.
  10 in total

1.  GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Authors:  Steven L Salzberg; Adam M Phillippy; Aleksey Zimin; Daniela Puiu; Tanja Magoc; Sergey Koren; Todd J Treangen; Michael C Schatz; Arthur L Delcher; Michael Roberts; Guillaume Marçais; Mihai Pop; James A Yorke
Journal:  Genome Res       Date:  2012-01-06       Impact factor: 9.043

2.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

3.  Versatile and open software for comparing large genomes.

Authors:  Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal:  Genome Biol       Date:  2004-01-30       Impact factor: 13.583

4.  De novo assembly of highly diverse viral populations.

Authors:  Xiao Yang; Patrick Charlebois; Sante Gnerre; Matthew G Coole; Niall J Lennon; Joshua Z Levin; James Qu; Elizabeth M Ryan; Michael C Zody; Matthew R Henn
Journal:  BMC Genomics       Date:  2012-09-13       Impact factor: 3.969

5.  RATT: Rapid Annotation Transfer Tool.

Authors:  Thomas D Otto; Gary P Dillon; Wim S Degrave; Matthew Berriman
Journal:  Nucleic Acids Res       Date:  2011-02-08       Impact factor: 16.971

6.  Disk-based k-mer counting on a PC.

Authors:  Sebastian Deorowicz; Agnieszka Debudaj-Grabysz; Szymon Grabowski
Journal:  BMC Bioinformatics       Date:  2013-05-16       Impact factor: 3.169

7.  Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors:  Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal:  Nat Biotechnol       Date:  2011-05-15       Impact factor: 54.908

8.  PRICE: software for the targeted assembly of components of (Meta) genomic sequence data.

Authors:  J Graham Ruby; Priya Bellare; Joseph L Derisi
Journal:  G3 (Bethesda)       Date:  2013-05-20       Impact factor: 3.154

9.  Kraken: ultrafast metagenomic sequence classification using exact alignments.

Authors:  Derrick E Wood; Steven L Salzberg
Journal:  Genome Biol       Date:  2014-03-03       Impact factor: 13.583

10.  Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors:  Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal:  Bioinformatics       Date:  2014-04-01       Impact factor: 6.937

  10 in total
  75 in total

1.  Evaluation of Phylogenetic Methods for Inferring the Direction of Human Immunodeficiency Virus (HIV) Transmission: HIV Prevention Trials Network (HPTN) 052.

Authors:  Yinfeng Zhang; Chris Wymant; Oliver Laeyendecker; M Kathryn Grabowski; Matthew Hall; Sarah Hudelson; Estelle Piwowar-Manning; Marybeth McCauley; Theresa Gamble; Mina C Hosseinipour; Nagalingeswaran Kumarasamy; James G Hakim; Johnstone Kumwenda; Lisa A Mills; Breno R Santos; Beatriz Grinsztejn; Jose H Pilotto; Suwat Chariyalertsak; Joseph Makhema; Ying Q Chen; Myron S Cohen; Christophe Fraser; Susan H Eshleman
Journal:  Clin Infect Dis       Date:  2021-01-23       Impact factor: 9.079

2.  Mitogenome analysis of dwarf pufferfish (Carinotetraodon travancoricus) endemic to southwest India and its implications in the phylogeny of Tetraodontidae.

Authors:  Chandhini Sathyajith; Yusuke Yamanoue; Shin-Ichi Yokobori; Sunesh Thampy; Rejish Kumar Vattiringal Jayadradhan
Journal:  J Genet       Date:  2019-12       Impact factor: 1.166

3.  Petabase-scale sequence alignment catalyses viral discovery.

Authors:  Robert C Edgar; Jeff Taylor; Victor Lin; Tomer Altman; Pierre Barbera; Dmitry Meleshko; Dan Lohr; Gherman Novakovsky; Benjamin Buchfink; Basem Al-Shayeb; Jillian F Banfield; Marcos de la Peña; Anton Korobeynikov; Rayan Chikhi; Artem Babaian
Journal:  Nature       Date:  2022-01-26       Impact factor: 49.962

4.  Evaluation of haplotype callers for next-generation sequencing of viruses.

Authors:  Anton Eliseev; Keylie M Gibson; Pavel Avdeyev; Dmitry Novik; Matthew L Bendall; Marcos Pérez-Losada; Nikita Alexeev; Keith A Crandall
Journal:  Infect Genet Evol       Date:  2020-03-06       Impact factor: 3.342

5.  Mutations Identified in the Hepatitis C Virus (HCV) Polymerase of Patients with Chronic HCV Treated with Ribavirin Cause Resistance and Affect Viral Replication Fidelity.

Authors:  Niels Mejer; Ulrik Fahnøe; Andrea Galli; Santseharay Ramirez; Ola Weiland; Thomas Benfield; Jens Bukh
Journal:  Antimicrob Agents Chemother       Date:  2020-11-17       Impact factor: 5.191

6.  Molecular epidemiology of swine influenza A viruses in the Southeastern United States, highlights regional differences in circulating strains.

Authors:  Constantinos S Kyriakis; Ming Zhang; Stefan Wolf; Les P Jones; Byoung-Shik Shim; Anna H Chocallo; Jarod M Hanson; MingRui Jia; Dong Liu; Ralph A Tripp
Journal:  Vet Microbiol       Date:  2017-10-18       Impact factor: 3.293

7.  Isolation and Genome Sequence Characterization of Bacteriophage vB_SalM_PM10, a Cba120virus, Concurrently Infecting Salmonella enterica Serovars Typhimurium, Typhi, and Enteritidis.

Authors:  Sandeep Newase; Balu P Kapadnis; Ravindranath Shashidhar
Journal:  Curr Microbiol       Date:  2018-10-25       Impact factor: 2.188

8.  Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses.

Authors:  Zhi-Luo Deng; Akshay Dhingra; Adrian Fritz; Jasper Götting; Philipp C Münch; Lars Steinbrück; Thomas F Schulz; Tina Ganzenmüller; Alice C McHardy
Journal:  Brief Bioinform       Date:  2021-05-20       Impact factor: 11.622

9.  Virulence of Enterovirus A71 Fluctuates Depending on the Phylogenetic Clade Formed in the Epidemic Year and Epidemic Region.

Authors:  Kyousuke Kobayashi; Hidekazu Nishimura; Katsumi Mizuta; Tomoha Nishizawa; Son T Chu; Hiroshi Ichimura; Satoshi Koike
Journal:  J Virol       Date:  2021-09-15       Impact factor: 5.103

10.  Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver.

Authors:  Chris Wymant; François Blanquart; Tanya Golubchik; Astrid Gall; Margreet Bakker; Daniela Bezemer; Nicholas J Croucher; Matthew Hall; Mariska Hillebregt; Swee Hoe Ong; Oliver Ratmann; Jan Albert; Norbert Bannert; Jacques Fellay; Katrien Fransen; Annabelle Gourlay; M Kate Grabowski; Barbara Gunsenheimer-Bartmeyer; Huldrych F Günthard; Pia Kivelä; Roger Kouyos; Oliver Laeyendecker; Kirsi Liitsola; Laurence Meyer; Kholoud Porter; Matti Ristola; Ard van Sighem; Ben Berkhout; Marion Cornelissen; Paul Kellam; Peter Reiss; Christophe Fraser
Journal:  Virus Evol       Date:  2018-05-18
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.