| Literature DB >> 33194387 |
Idowu B Olawoye1,2, Simon D W Frost3,4, Christian T Happi1,2.
Abstract
Next generation sequencing technologies are becoming more accessible and affordable over the years, with entire genome sequences of several pathogens being deciphered in few hours. However, there is the need to analyze multiple genomes within a short time, in order to provide critical information about a pathogen of interest such as drug resistance, mutations and genetic relationship of isolates in an outbreak setting. Many pipelines that currently do this are stand-alone workflows and require huge computational requirements to analyze multiple genomes. We present an automated and scalable pipeline called BAGEP for monomorphic bacteria that performs quality control on FASTQ paired end files, scan reads for contaminants using a taxonomic classifier, maps reads to a reference genome of choice for variant detection, detects antimicrobial resistant (AMR) genes, constructs a phylogenetic tree from core genome alignments and provide interactive short nucleotide polymorphism (SNP) visualization across core genomes in the data set. The objective of our research was to create an easy-to-use pipeline from existing bioinformatics tools that can be deployed on a personal computer. The pipeline was built on the Snakemake framework and utilizes existing tools for each processing step: fastp for quality trimming, snippy for variant calling, Centrifuge for taxonomic classification, Abricate for AMR gene detection, snippy-core for generating whole and core genome alignments, IQ-TREE for phylogenetic tree construction and vcfR for an interactive heatmap visualization which shows SNPs at specific locations across the genomes. BAGEP was successfully tested and validated with Mycobacterium tuberculosis (n = 20) and Salmonella enterica serovar Typhi (n = 20) genomes which are about 4.4 million and 4.8 million base pairs, respectively. Running these test data on a 8 GB RAM, 2.5 GHz quad core laptop took 122 and 61 minutes on respective data sets to complete the analysis. BAGEP is a fast, calls accurate SNPs and an easy to run pipeline that can be executed on a mid-range laptop; it is freely available on: https://github.com/idolawoye/BAGEP. ©2020 Olawoye et al.Entities:
Keywords: Bacteria genomics; Bioinformatics; Pipeline; Workflow
Year: 2020 PMID: 33194387 PMCID: PMC7597632 DOI: 10.7717/peerj.10121
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1A schematic workflow of BAGEP.
Fastp is used for quality control on paired-end FASTQ reads. The processed reads are mapped against a reference genome provided by the user. Centrifuge is used to check the reads for contamination and generate a taxonomic visualized report with Krona. Variants are called using Freebayes and annotated with SnpEff (aided by Snippy). The resulting variant call format (VCF) files and genomes from each sample are collated with Snippy-core to produce a VCF file containing all samples, core and whole genome multiple sequence alignments. A maximum-likelihood phylogenetic tree is constructed with IQTREE and a HTML file containing an interactive SNP visualization with the VCF file. Finally, Abricate generates an AMR report from whole genome multiple sequence alignments.
Figure 2Pre-processing reports from read 1 of one of the M. tuberculosis genomes.
(A) Read quality before filtering. (B) Read quality after filtering. (C) Base contents before filtering. (D) Base contents after filtering.
Figure 3Taxonomic classification of reads in an isolate using Centrifuge and Krona.
Figure 4Interactive visualization of SNPs showing their positions across genomes.
(A) Twenty (20) M. tuberculosis genomes. (B) Twenty (20) Salmonella enterica serovar Typhi genomes.