Literature DB >> 34567846

VGEA: an RNA viral assembly toolkit.

Paul E Oluniyi^1,2, Fehintola Ajogbasile^1,2, Judith Oguzie^1,2, Jessica Uwanibe^1,2, Adeyemi Kayode^1,2, Anise Happi², Alphonsus Ugwu^1,2, Testimony Olumade^1,2, Olusola Ogunsanya³, Philomena Ehiaghe Eromon², Onikepe Folarin^1,2, Simon D W Frost^4,5, Jonathan Heeney⁶, Christian T Happi^1,2.

Abstract

Next generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. Viral genomes, however, represent a great challenge to bioinformatics due to their high mutation rate and forming quasispecies in the same infected host, bringing about the need to implement advanced bioinformatics tools to assemble consensus genomes well-representative of the viral population circulating in individual patients. Many tools have been developed to preprocess sequencing reads, carry-out de novo or reference-assisted assembly of viral genomes and assess the quality of the genomes obtained. Most of these tools however exist as standalone workflows and usually require huge computational resources. Here we present (Viral Genomes Easily Analyzed), a Snakemake workflow for analyzing RNA viral genomes. VGEA enables users to map sequencing reads to the human genome to remove human contaminants, split bam files into forward and reverse reads, carry out de novo assembly of forward and reverse reads to generate contigs, pre-process reads for quality and contamination, map reads to a reference tailored to the sample using corrected contigs supplemented by the user's choice of reference sequences and evaluate/compare genome assemblies. We designed a project with the aim of creating a flexible, easy-to-use and all-in-one pipeline from existing/stand-alone bioinformatics tools for viral genome analysis that can be deployed on a personal computer. VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and read-level quality control, BWA (Li & Durbin, 2009) for mapping sequencing reads to the human reference genome, SAMtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, IVA (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences, SeqKit (Shen et al., 2016) for cleaning shiver assembly for QUAST, QUAST (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies and MultiQC (Ewels et al., 2016) for aggregation of the results from fastp, BWA and QUAST. Our pipeline was successfully tested and validated with SARS-CoV-2 (n = 20), HIV-1 (n = 20) and Lassa Virus (n = 20) datasets all of which have been made publicly available. VGEA is freely available on GitHub at: https://github.com/pauloluniyi/VGEA under the GNU General Public License. ©2021 Oluniyi et al.

Entities: Chemical

Keywords: Assembly; Genome; NGS; VGEA

Year: 2021 PMID： 34567846 PMCID： PMC8428259 DOI： 10.7717/peerj.12129

Source DB: PubMed Journal: PeerJ ISSN： 2167-8359 Impact factor: 2.984

Introduction

The most abundant biological entities on Earth are viruses as they can be found among all cellular forms of life. So far, over four thousand five hundred viral species have been discovered, from which a huge amount of sequence information has been collected by researchers and scientists all over the world (Pickett et al., 2012; Sharma, Priyadarshini & Vrati, 2015; Brister et al., 2015). In recent times (past two decades), a number of these viruses have emerged in the human population causing disease outbreaks and sometimes pandemics. These viruses include mainly: Influenza virus, Severe Acute Respiratory Syndrome (SARS) coronavirus, Middle East Respiratory Syndrome (MERS) coronavirus, Ebola virus, Yellow fever virus, Lassa virus (LASV), Zika virus (Chan, 2002; Bean et al., 2013; Folarin et al., 2016; Grubaugh et al., 2017; Metsky et al., 2017; Siddle et al., 2018; Ajogbasile et al., 2020) and SARS-CoV-2 (Chen et al., 2020; Holshue et al., 2020; Sohrabi et al., 2020). During these outbreaks and pandemics, genomic sequencing for identification and characterization of the transmission and evolution of the causative agents have proved to be critical in helping inform disease surveillance and epidemiology. Next Generation Sequencing (NGS) platforms have been widely accepted as high-throughput, open view technologies that have many attractive features for virus detection and assembly (Tang & Chiu, 2010; Mokili, Rohwer & Dutilh, 2012). NGS-based studies have vastly increased our understanding of viral diversity (Reyes et al., 2010; Cantalupo et al., 2011). Pathogen sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. The field of genomics, especially pathogen genomics has been transformed by NGS, with costs constantly decreasing, equipment becoming more portable/field deployable during outbreaks and remarkable increase in data availability. The huge amount of data being generated requires various processing steps such as removal of primers and adapters, quality filtering and control which is usually crucial for various downstream analysis. Several tools have been developed for these purposes, such as fastp (Chen et al., 2018) and Trimmomatic (Bolger, Lohse & Usadel, 2014). Reconstructing viral genomes from NGS data is usually achieved through de novo assembly (which is the process of assembling genomes using overlapping sequencing reads), or through a reference-guided approach (which involves mapping sequence reads to a reference genome). Numerous tools have been developed for these purposes; SPAdes (Bankevich et al., 2012), Burrows-Wheeler Alignment tool (BWA), V-GAP (Nakamura et al., 2016), VirusTAP (Yamashita, Sekizuka & Kuroda, 2016), V-Pipe (Posada-Céspedes et al., 2021) and viral-ngs (https://github.com/broadinstitute/viral-ngs), amongst others. Contigs generated by de novo assembly however do not provide a complete summary of reads, misassembly can result in the contigs having an incorrect structure, and for parts of the genome where contigs could not be assembled, no information is available. In addition, reference-guided assembly of viral genomes can lead to biased loss of information which can then skew epidemiological and evolutionary conclusions (Wymant et al., 2018). Variant analysis and genome quality assessment to detect variants and changes occurring across the genome of a virus is also a key step in viral genome analysis as viruses (especially RNA viruses) are known to have high mutation rates (Duffy, 2018). Variant analysis is important for detecting outbreak origins and for phylogenetic/phylogeographic studies and best practices for variant identification in microbial genomes have been proposed in literature and adopted to a large extent (Van der Auwera et al., 2013). A number of pipelines that have been developed for downstream analysis of viral genomes require high performance computing (HPC) clusters and/or cloud-based systems e.g., the V-pipe authors recommend running V-pipe on clusters because for most applications, running V-pipe on a local machine may not be efficient (https://github.com/cbg-ethz/V-pipe/wiki/advanced) and some of these pipelines are only web-based such as VirAmp (Wan et al., 2015) and VirusTAP (Yamashita, Sekizuka & Kuroda, 2016. Also, some pipelines have many dependencies to be installed especially if the analysis requires multiple tasks to be performed. In low-and-middle income countries (LMICs) where most scientists do not have access to HPC clusters or cloud-based systems and where internet connection is too unstable to regularly make use of web-based platforms for analysis, this can be a daunting task. The challenges listed above motivated the development of VGEA (Viral Genomes Easily Analyzed, available online at https://github.com/pauloluniyi/VGEA). VGEA makes use of existing bioinformatics pipeline/tools to carry out various viral genome analysis tasks and is built on an advanced workflow management system, Snakemake (Köster & Rahmann, 2012).

Materials and Methods

Datasets

We successfully tested and validated VGEA with SARS-CoV-2 (n = 20) and Lassa Virus (n = 20) datasets sequenced on the illumina MiSeq and illumina FGx sequencing machines in our laboratory at the African Centre of Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer’s University, Ede, Nigeria. Briefly, samples were inactivated in buffer AVL and viral RNA was extracted according to the QiAmp viral RNA mini kit (Qiagen) manufacturer’s instructions. Extracted RNA was treated with Turbo DNase to remove contaminating DNA, followed by cDNA synthesis with random hexamers. Sequencing libraries were prepared using the Nextera XT kit (Illumina) as previously described (Matranga et al., 2016) and sequenced on the Illumina Miseq platform with 101 base pair paired-end reads. We also tested and validated VGEA with HIV-1 datasets sequenced on the illumina HiSeq 2500 obtained from NCBI Sequence Read Archive (SRA). We made use of 60 test datasets (Lassa Virus (20), SARS-CoV-2 (20) and HIV-1 (20)) for the validation of the VGEA pipeline. All our test datasets are available on figshare (https://doi.org/10.6084/m9.figshare.13009997).

Implementation

The installation of VGEA requires the pipeline to be downloaded onto a personal computer and creation of a conda environment to set up all dependencies. Complete installation steps are in the github README file: https://github.com/pauloluniyi/VGEA/blob/master/README.md The analysis of VGEA is broken down into a set of ‘rules’ that links the output file of an analysis into the input of the next task in the general workflow (Fig. 1). The dependencies are fastp for read trimming and read-level quality control, BWA for mapping sequencing reads to the human reference genome, SAMtools for extracting unmapped reads and also for splitting bam files into fastq files, IVA for de novo assembly to generate contigs, shiver to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit for cleaning shiver assembly for QUAST, QUAST to evaluate/assess the quality of genome assemblies and MultiQC for aggregation of the results from fastp, BWA and QUAST

Figure 1

A schematic workflow of VGEA.

A schematic workflow of VGEA.

User-supplied paired-end fastq files are pre-processed and trimmed using FASTP followed by mapping to the human reference genome with BWA. Following mapping, a BAM file containing unaligned/unmapped reads is extracted using SAMTOOLS. This BAM file is then split into fastq files of forward and reverse reads also with SAMTOOLS after which de novo assembly is carried out using IVA. Following de novo assembly, SHIVER is used to map the reads and generate consensus sequences, and detailed minority variant information (full explanation of the shiver method is in File S1). SEQKIT is used to clean the SHIVER output for QUAST after which genome evaluation and assessment is carried out using QUAST. MULTIQC is then used for aggregation of results from BWA, FASTP and QUAST. All of these tools can be installed using a bioconda channel (Grüning et al., 2018). The input files for VGEA are paired-end fastq files. VGEA allows full customization of the pipeline, so users can modify the parameters used in running their samples. It is possible to modify every step of the workflow to suit the samples being processed. Users can also add more steps to the pipeline as they see fit. The pipeline runs on Linux/Unix and Mac. However, no prior programming is required to run the pipeline and, once the user supplies the input, the whole workflow can run automatically from beginning to end.

Results

VGEA carries out read trimming and quality control tasks on input FASTQ data using fastp (Fig. 2). This increases the quality of data used for subsequent steps of the pipeline.

Figure 2

Fastp pre-processing report for a SARS-CoV-2 test dataset analyzed using VGEA.

VGEA then maps reads to the human reference genome in order to remove human contaminants, the pipeline carries out this step using BWA. Genome assembly and consensus sequence generation is carried out, together with the generation of summary minority-variant information (base frequencies at each position) and detailed minority-variant information (all reads aligned to their correct position in the genome). VGEA carries out assembly using IVA and generates consensus sequences using shiver. Previous study by the shiver developers has shown the systematic superiority of mapping to shiver’s constructed reference compared with mapping the same reads to the closest of 3,249 references: median values of 13 bases called differently and more accurately, zero bases called differently and less accurately, and 205 bases of missing sequence recovered (Wymant et al., 2018). VGEA also assesses the quality of genome assemblies using QUAST. QUAST evaluates metrics such as contig sizes, misassemblies and structural variations, genome representation and its functional elements, variations of N50 based on aligned blocks and then presents these statistics in graphical form. QUAST also makes a histogram of several metrics including the number of complete genes, operons and the genome fraction (%). Finally, VGEA compiles the results of BWA, fastp and QUAST into a single MultiQC report (Fig. 3).

Figure 3

MultiQC report of five SARS-CoV-2 datasets analyzed using VGEA.

Performance evaluation

VGEA makes use of Snakemake’s benchmarking feature which allows the measurement of the CPU usage and wall clock time of each rule in the pipeline. This allows the user to know which step of the pipeline requires the least and highest amount of computational resources. Knowledge of this can help the user decide on the number of threads to dedicate to each rule as VGEA also makes use of Snakemake’s multi-threading feature. Table 1 shows the benchmarking values for a sample SARS-CoV-2 dataset analyzed using VGEA.

Table 1

Benchmarking values (time and CPU usage) for a SARS-CoV-2 dataset analyzed using VGEA.

VGEA rule name	Time (h:m:s)	Maximum RAM used (MB)
human_reference_index	1:01:53	4688.56
fastp	0:00:14	581.91
bwa_human	0:08:52	5960.95
samtools_extract	0:02:40	16.21
bamtofastq	0:01:39	6.61
^aiva	8:19:11	238.57
shiver_init	0:00:53	64.97
shiver_align_contigs	0:04:37	2509.64
shiver_map_reads	0:31:51	567.27
shiver_tidy	0:00:00	1.06
quast	0:00:33	72.51

Notes.

IVA was run using one CPU core and two threads so if allowed more computational resources, the assembly time will be even shorter.

Notes. IVA was run using one CPU core and two threads so if allowed more computational resources, the assembly time will be even shorter. We compared the contigs generated by VGEA’s assembly step with contigs generated using two other standalone and commonly used assembly pipelines, SPAdes (Bankevich et al., 2012) and Velvet (Zerbino & Birney, 2008). We compared against these two pipelines because most commonly used assembly workflows like viral-ngs and VirAmp are built on them. We carried out this comparison by making use of five different SARS-CoV-2 test datasets (namely CV18, CV29, CV45, CV115 and CV145 datasets available on FigShare and NCBI). We compared the assemblies to the SARS-CoV-2 reference genome, and N50/NG50, mis-assembly, mismatches and indel scores were used to evaluate the performance of each assembly method as recommended by Assemblathon 2 (Bradnam et al., 2013) (Table 2). Basic statistics were calculated using QUAST. All results of our performance evaluation and comparison are provided as File S2. All analyses were run on a 64-bit personal computer with 16GB RAM using four threads. SPAdes version 3.15.2 and Velvet version 1.2.10 were used for the comparison purposes using the default parameters.

Table 2

Performance comparison using different assembly pipelines.

Sample ID	# reads (x10 ⁶⁾	Pipeline	# contigs	Largest contig (bp)	N50	NG50	Genome fraction (%)	Mis assemblies	Mismatches	Indels	Maximum RAM used (MB)
CV18	3.2	VGEA SPAdes Velvet	42 384 68	29928 22141 1858	2294 1435 728	29928 22141 922	99.776 99.652 19.326	0 1 0	10 18 3	0 1 0	627 2447 1544
CV29	1.8	VGEA SPAdes Velvet	31 478 66	7731 24904 2877	3065 1136 942	7534 24904 1380	99.786 99.632 1.729	0 0 0	9 7 0	0 0 0	484 2314 807
CV45	6.2	VGEA SPAdes Velvet	30 45 535	16248 6779 5239	2603 1255 898	16248 2447 3030	98.291 94.957 14.256	1 0 0	11 35 0	0 12 0	666 2504 1360
CV115	2	VGEA ^aSPAdes Velvet	28 49 41	5225 1942 2847	2258 1068 819	3060 1828 931	96.957 - 68.134	0 – 0	12 – 9	0 – 0	177 1735 511
CV145	4.4	VGEA SPAdes Velvet	28 188 178	6807 3216 1798	2049 1190 682	4214 2477 1107	73.093 5.073 3.578	0 2 0	14 13 0	0 0 0	635 2547 1459

Notes.

QUAST gave no genome fraction value for this sample.

Notes. QUAST gave no genome fraction value for this sample. Evaluation statistics showed that contigs generated by VGEA had the highest NG50 score for four of the five datasets and the highest N50 scores across all five datasets. In all five datasets, VGEA’s contigs had the highest genome fraction covering greater than 95% in four. Comparison of maximum RAM used by VGEA, SPAdes and Velvet showed that VGEA used the least amount of RAM for the analyses of all five datasets used for comparison. SPAdes and Velvet however ran faster than VGEA for all analyses.

Discussion

VGEA is built on the snakemake workflow management system (BKöster & Rahmann, 2012), a workflow management system that allows the effortless deployment and execution of complex distributed computational workflows in any UNIX-based system, from local machines to high-performance computing clusters. It is a user-friendly, customizable and reproducible pipeline which can be deployed on a personal computer and which can run from start to finish with a single command. VGEA was designed with ease-of-use in mind and so all its dependencies can be installed in a conda environment under the bioconda channel (Grüning et al., 2018 making it particularly useful for scientists with little or no computational background and for scientists in LMICs who don’t have much access to high-performance computing clusters or cloud-computing resources. VGEA capitalizes on Snakemake’s multi-threading feature so that makes it possible for it to be deployed on laptops with greater computing performance or a computing server to improve its speed. The pipeline was tested with paired-end short-read sequencing data produced by the illumina platform (MiSeq, MiSeq FGx and HiSeq 2500). The results generated by the major steps of the VGEA pipeline are summed up together into a MultiQC report which can be easily interpreted and understood by anyone with little or no knowledge of bioinformatics.

Conclusion

VGEA was built primarily by biologists and in a manner that is easy to be employed by users without significant computational background. As new and innovative tools for viral genome analysis and assembly are increasingly being developed, these can easily be incorporated into the VGEA pipeline. We hope that other scientists can build upon and improve VGEA as a tool to extract more qualitative and quantitative information from viral genomes. Click here for additional data file. Click here for additional data file.

39 in total

1. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors: Daniel R Zerbino; Ewan Birney
Journal: Genome Res Date: 2008-03-18 Impact factor: 9.043

2. Snakemake--a scalable bioinformatics workflow engine.

Authors: Johannes Köster; Sven Rahmann
Journal: Bioinformatics Date: 2012-08-20 Impact factor: 6.937

3. Zika virus evolution and spread in the Americas.

Authors: Hayden C Metsky; Christian B Matranga; Shirlee Wohl; Stephen F Schaffner; Catherine A Freije; Sarah M Winnicki; Kendra West; James Qu; Mary Lynn Baniecki; Adrianne Gladden-Young; Aaron E Lin; Christopher H Tomkins-Tinch; Simon H Ye; Daniel J Park; Cynthia Y Luo; Kayla G Barnes; Rickey R Shah; Bridget Chak; Giselle Barbosa-Lima; Edson Delatorre; Yasmine R Vieira; Lauren M Paul; Amanda L Tan; Carolyn M Barcellona; Mario C Porcelli; Chalmers Vasquez; Andrew C Cannons; Marshall R Cone; Kelly N Hogan; Edgar W Kopp; Joshua J Anzinger; Kimberly F Garcia; Leda A Parham; Rosa M Gélvez Ramírez; Maria C Miranda Montoya; Diana P Rojas; Catherine M Brown; Scott Hennigan; Brandon Sabina; Sarah Scotland; Karthik Gangavarapu; Nathan D Grubaugh; Glenn Oliveira; Refugio Robles-Sikisaka; Andrew Rambaut; Lee Gehrke; Sandra Smole; M Elizabeth Halloran; Luis Villar; Salim Mattar; Ivette Lorenzana; Jose Cerbino-Neto; Clarissa Valim; Wim Degrave; Patricia T Bozza; Andreas Gnirke; Kristian G Andersen; Sharon Isern; Scott F Michael; Fernando A Bozza; Thiago M L Souza; Irene Bosch; Nathan L Yozwiak; Bronwyn L MacInnis; Pardis C Sabeti
Journal: Nature Date: 2017-05-24 Impact factor: 49.962

4. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.

Authors: Wei Shen; Shuai Le; Yan Li; Fuquan Hu
Journal: PLoS One Date: 2016-10-05 Impact factor: 3.240

Review 5. Metagenomics for the discovery of novel human viruses.

Authors: Patrick Tang; Charles Chiu
Journal: Future Microbiol Date: 2010-02 Impact factor: 3.165

6. Raw sewage harbors diverse viral populations.

Authors: Paul G Cantalupo; Byron Calgua; Guoyan Zhao; Ayalkibet Hundesa; Adam D Wier; Josh P Katz; Michael Grabe; Roger W Hendrix; Rosina Girones; David Wang; James M Pipas
Journal: MBio Date: 2011-10-04 Impact factor: 7.867

Review 7. Metagenomics and future perspectives in virus discovery.

Authors: John L Mokili; Forest Rohwer; Bas E Dutilh
Journal: Curr Opin Virol Date: 2012-01-20 Impact factor: 7.090

8. fastp: an ultra-fast all-in-one FASTQ preprocessor.

Authors: Shifu Chen; Yanqing Zhou; Yaru Chen; Jia Gu
Journal: Bioinformatics Date: 2018-09-01 Impact factor: 6.937

Review 9. Studying immunity to zoonotic diseases in the natural host - keeping it real.

Authors: Andrew G D Bean; Michelle L Baker; Cameron R Stewart; Christopher Cowled; Celine Deffrasnes; Lin-Fa Wang; John W Lowenthal
Journal: Nat Rev Immunol Date: 2013-10-25 Impact factor: 53.106

10. Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors: Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal: Bioinformatics Date: 2014-04-01 Impact factor: 6.937

1 in total

1. Metagenomic sequencing characterizes a wide diversity of viruses in field mosquito samples in Nigeria.

Authors: Judith U Oguzie; Udoka C Nwangwu; Paul E Oluniyi; Testimony J Olumade; Uwem E George; Akano Kazeem; Bolajoko E Bankole; Farida O Brimmo; Chukwuemeka C Asadu; Okechukwu C Chukwuekezie; Josephine C Ochu; Catherine O Makwe; Festus A Dogunro; Cosmas O Onwude; William E Nwachukwu; Ebuka K Ezihe; Gilkenny K Okonkwo; Ndubuisi E Umazi; Jacob Maikere; Nneka O Agashi; Emelda I Eloy; Stephen O Anokwu; Angela I Okoronkwo; Ebuka M Nwosu; Sandra O Etiki; Ifeoma M Ngwu; Chikwe Ihekweazu; Onikepe A Folarin; Isaac O O Komolafe; Christian T Happi
Journal: Sci Rep Date: 2022-05-10 Impact factor: 4.996

1 in total