| Literature DB >> 32753501 |
Robert A Petit1, Timothy D Read2.
Abstract
Sequencing of bacterial genomes using Illumina technology has become such a standard procedure that often data are generated faster than can be conveniently analyzed. We created a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a data set setup step (Bactopia Data Sets [BaDs]), which creates a series of customizable data sets for the species of interest, the Bactopia Analysis Pipeline (BaAP), which performs quality control, genome assembly, and several other functions based on the available data sets and outputs the processed data to a structured directory format, and a series of Bactopia Tools (BaTs) that perform specific postprocessing on some or all of the processed data. BaTs include pan-genome analysis, computing average nucleotide identity between samples, extracting and profiling the 16S genes, and taxonomic classification using highly conserved genes. It is expected that the number of BaTs will increase to fill specific applications in the future. As a demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on Lactobacillus crispatus, a species that is a common part of the human vaginal microbiome. Bactopia is an open source system that can scale from projects as small as one bacterial genome to ones including thousands of genomes and that allows for great flexibility in choosing comparison data sets and options for downstream analysis. Bactopia code can be accessed at https://www.github.com/bactopia/bactopiaIMPORTANCE It is now relatively easy to obtain a high-quality draft genome sequence of a bacterium, but bioinformatic analysis requires organization and optimization of multiple open source software tools. We present Bactopia, a pipeline for bacterial genome analysis, as an option for processing bacterial genome data. Bactopia also automates downloading of data from multiple public sources and species-specific customization. Because the pipeline is written in the Nextflow language, analyses can be scaled from individual genomes on a local computer to thousands of genomes using cloud resources. As a usage example, we processed 1,664 Lactobacillus genomes from public sources and used comparative analysis workflows (Bactopia Tools) to identify and analyze members of the L. crispatus species.Entities:
Keywords: Lactobacilluszzm321990; annotation; assembly; bacteria; genomics; software
Year: 2020 PMID: 32753501 PMCID: PMC7406220 DOI: 10.1128/mSystems.00190-20
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
List of bioinformatic tools used by the Bactopia Analysis Pipeline, version 1.4.0
| Name | Version | Description | Link | Reference(s) |
|---|---|---|---|---|
| AMRFinder+ | 3.6.7 | Finds acquired antimicrobial resistance genes and some point mutations in protein or assembled nucleotide sequences | ||
| Aragorn | 1.2.38 | Finds transfer RNA (tRNA) features | ||
| Ariba | 2.14.4 | Antimicrobial resistance identification by assembly | ||
| ART | 2016.06.05 | A set of simulation tools to generate synthetic next-generation sequencing reads | ||
| assembly-scan | 0.3.0 | Generates basic stats for an assembly | ||
| Barrnap | 0.9 | Bacterial ribosomal RNA predictor | ||
| BBMap | 38.76 | A suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data | ||
| BCFtools | 1.9 | Utilities for variant calling and manipulating VCFs and BCFs | ||
| Bedtools | 2.29.2 | A powerful tool set for genome arithmetic | ||
| BioPython | 1.76 | Tools for biological computation written in Python | ||
| BLAST+ | 2.9.0 | Basic local alignment search tool | ||
| Bowtie2 | 2.4.1 | A fast and sensitive gapped-read aligner | ||
| BWA | 0.7.17 | Burrows-Wheeler Aligner for short-read alignment | ||
| CD-HIT | 4.8.1 | Accelerated for clustering the next-generation sequencing data | ||
| CheckM | 1.1.2 | Assesses the quality of microbial genomes recovered from isolates, single cells, and metagenomes | ||
| ClonalFrameML | 1.12 | Efficient inference of recombination in whole bacterial genomes | ||
| DiagrammeR | 1.0.0 | Graph and network visualization using tabular data in R | ||
| DIAMOND | 0.9.35 | Accelerated BLAST-compatible local sequence aligner | ||
| eggNOG-Mapper | 2.0.1 | Fast genome-wide functional annotation through orthology assignment | ||
| EMIRGE | 0.61.1 | Reconstructs full-length ribosomal genes from short-read sequencing data | ||
| FastANI | 1.3 | Fast whole-genome similarity (ANI) estimation | ||
| FastTree 2 | 2.1.10 | Approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences | ||
| fastq-dl | 1.0.3 | Downloads FASTQ files from SRA or ENA repositories | ||
| FastQC | 0.11.9 | A quality control analysis tool for high throughput sequencing data. | ||
| fastq-scan | 0.4.3 | Outputs FASTQ summary statistics in JSON format | ||
| FLASH | 1.2.11 | A fast and accurate tool to merge paired-end reads | ||
| freebayes | 1.3.2 | Bayesian haplotype-based genetic polymorphism discovery and genotyping | ||
| GNU Parallel | 20200122 | A shell tool for executing jobs in parallel | ||
| GTDB-tk | 1.0.2 | A tool kit for assigning objective taxonomic classifications to bacterial and archaeal genomes | ||
| HMMER | 3.3 | Biosequence analysis using profile hidden Markov models | ||
| Infernal | 1.1.2 | Searches DNA sequence databases for RNA structure and sequence similarities | ||
| IQ-TREE | 1.6.12 | Efficient phylogenomic software by maximum likelihood | ||
| ISMapper | 2.0 | Insertion sequence mapping software | ||
| Lighter | 1.1.2 | Fast and memory-efficient sequencing error corrector | ||
| MAFFT | 7.455 | Multiple alignment program for amino acid or nucleotide sequences | ||
| Mash | 2.2.2 | Fast genome and metagenome distance estimation using MinHash | ||
| Mashtree | 1.1.2 | Creates a tree using Mash distances | ||
| maskrc-svg | 0.5 | Masks recombination as detected by ClonalFrameML or Gubbins and draws an SVG | ||
| McCortex | 1.0 | |||
| MEGAHIT | 1.2.9 | Ultra-fast and memory-efficient (meta-)genome assembler | ||
| MinCED | 0.4.2 | Mining CRISPRs in environmental data sets | ||
| Minimap2 | 2.17 | A versatile pairwise aligner for genomic and spliced nucleotide sequences | ||
| ncbi-genome-download | 0.2.12 | Scripts to download genomes from the NCBI FTP servers | ||
| Nextflow | 19.10.0 | A DSL for data-driven computational pipelines | ||
| phyloFlash | 3.3b3 | Rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an Illumina (metagenomic data set) | ||
| Pigz | 2.3.4 | A parallel implementation of gzip for modern multiprocessor, multicore machines | ||
| Pilon | 1.23 | An automated genome assembly improvement and variant detection tool | ||
| PIRATE | 1.0.3 | A toolbox for pan-genome analysis and threshold evaluation | ||
| pplacer | 1.1.alpha19 | Phylogenetic placement and downstream analysis | ||
| Prodigal | 2.6.3 | Fast, reliable protein-coding gene prediction for prokaryotic genomes | ||
| Prokka | 1.4.5 | Rapid prokaryotic genome annotation | ||
| QUAST | 5.0.2 | Quality assessment tool for genome assemblies | ||
| Racon | 1.4.13 | Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads | ||
| Roary | 3.13.0 | Rapid large-scale prokaryote pan genome analysis | ||
| samclip | 0.2 | Filter SAM file for soft and hard clipped alignments | ||
| SAMtools | 1.9 | Tools for manipulating next-generation sequencing data | ||
| Seqtk | 1.3 | A fast and lightweight tool for processing sequences in the FASTA or FASTQ format | ||
| Shovill | 1.0.9se | Faster assembly of Illumina reads | ||
| SKESA | 2.3.0 | Strategic | ||
| Snippy | 4.4.5 | Rapid haploid variant calling and core genome alignment | ||
| SnpEff | 4.3.1 | Genomic variant annotations and functional effect prediction toolbox | ||
| snp-dists | 0.6.3 | Pairwise SNP distance matrix from a FASTA sequence alignment | ||
| SNP-sites | 2.5.1 | Rapidly extracts SNPs from a multi-FASTA alignment | ||
| Sourmash | 3.2.0 | Compute and compare MinHash signatures for DNA data sets | ||
| SPAdes | 3.13.0 | An assembly toolkit containing various assembly pipelines | ||
| Trimmomatic | 0.39 | A flexible read trimming tool for Illumina NGS data | ||
| Unicycler | 0.4.8 | Hybrid assembly pipeline for bacterial genomes | ||
| vcf-annotator | 0.5 | Add biological annotations to variants in a VCF file | ||
| Vcflib | 1.0.0rc3 | A simple C++ library for parsing and manipulating VCF files | ||
| Velvet | 1.2.10 | Short read | ||
| VSEARCH | 2.14.1 | Versatile open-source tool for metagenomics | ||
| vt | 2015.11.10 | A tool set for short-variant discovery in genetic sequence data |
VCF, variant call format; BCF, binary variant call format; SVG, scalable vector graphics; JSON, JavaScript Object Notation; DSL, digital subscriber line; SSU, small subunit; NGS, next-generation sequencing.
FIG 1Bactopia overview. (a) A general overview of the Bactopia workflow. (b) A detailed diagram of processing pathways within the Bactopia Analysis Pipeline showing optional data set inputs.
A comparison of bacterial genome analysis workflows
| Feature | Bactopia | ASA3P | Nullarbor | TORMES |
|---|---|---|---|---|
| Version | 1.4.0 | 1.3.0 | 2.0.20191013 | 1.1 |
| Release date | 1 July 2020 | 2 May 2020 | 13 October 2019 | 14 April 2020 |
| Latest commit | 1 July 2020 | 26 June 2020 | 15 March 2020 | 28 May 2020 |
| Sequence technology | Illumina, Hybrid (Nanopore, Pacbio) | Illumina, Nanopore, PacBio | Illumina | Illumina |
| Single-end reads | Yes | Yes | No | No |
| Workflow | Nextflow | Groovy | Perl + Make | Bash |
| Resume if stopped | Yes | No | Yes | No |
| Reuse existing runs for expanded analysis | Yes | No | Yes | No |
| Built-in high-performance computing cluster and cloud capability | Yes | Yes | No | No |
| Individual program adjustable parameters | Yes | No | Yes | No |
| Batch processing from config file | Yes | Yes | Yes | Yes |
| Single sample processing from command line | Yes | No | Yes | No |
| Sequence depth downsample | Yes | No | Yes | No |
| Automatic reference selection for variant detection | Yes | No | No | No |
| Data download from SRA/ENA | Yes | No | No | No |
| Species identification | ||||
| Comparative analysis | Separate process | Built-in process | Built-In Process | Built-in process |
| Summary | Text | HTML | HTML | R Markdown |
| Package manager | Bioconda | Bioconda and Brew | Conda YAML | |
| Container available | Yes | Yes | Yes | No |
| Documentation | Website | PDF manual | Readme | Readme |
| Github repository |
Summary of Lactobacillus genome sequencing projects quality and coverage
| Quality rank | No. of samples | Original coverage | Post-Bactopia coverage | Per-read quality score | Read length (bp) | Contig count | % of assembled genome size compared to estimated genome size |
|---|---|---|---|---|---|---|---|
| Gold | 967 | 213× | 100× | Q35 | 100 | 52 | 92 |
| Silver | 386 | 160× | 100× | Q35 | 100 | 110 | 93 |
| Bronze | 205 | 102× | 100× | Q34 | 100 | 90 | 93 |
| Exclude | 48 | 26× | 22× | Q34 | 100 | 706 | 93 |
| Unprocessed | 58 |
All values except number of samples are medians.
FIG 2Maximum-likelihood phylogeny from reconstructed 16S rRNA genes. A phylogenetic representation of 1,470 samples using IQ-Tree (28–30). (a) A tree of the full set of samples. The outer ring represents the genus assigned by GTDB-Tk, as indicated. (b) The same tree as shown in panel a, but with the non-Lactobacillus clade collapsed. Major groups of Lactobacillus species (indicated with a letter) and the most sequenced Lactobacillus species have been labeled. The inner ring represents the average nucleotide identity (ANI), determined by FastANI (6), of samples to L. crispatus. The tree was built from a multiple-sequence alignment (31) of 16S genes reconstructed by phyloFlash (25) with 1,281 parsimony-informative sites. The likelihood score for the consensus tree constructed from 1,000 bootstrap trees was −54,698. Taxonomic classifications were assigned by GTDB-Tk (21).
Lactobacillus crispatus genomes used in pan-genome analysis
| Accession no. | Host | Source | Reference | ||
|---|---|---|---|---|---|
| BioProject | BioSample | Experiment | |||
| Human* | Urine* | ||||
| Human* | Urine* | ||||
| Human* | Urine* | ||||
| Human* | Urine* | ||||
| Human* | Urine* | ||||
| Human* | Urine* | ||||
| Human* | Urine* | ||||
| Human* | Urine* | ||||
| Human* | Urine* | ||||
| Human* | Unknown | ||||
| Human* | Unknown | ||||
| Human* | Unknown | ||||
| Human* | Unknown | ||||
| Human* | Vaginal* | ||||
| Human | Urine | ||||
| Human* | Vaginal* | ||||
| Human* | Vaginal* | ||||
| Human* | Vaginal* | ||||
| Human* | Vaginal* | ||||
| Human* | Vaginal* | ||||
| Human* | Vaginal* | ||||
| Human* | Vaginal* | ||||
| Human | Eye | ||||
| Human | Eye | ||||
| Human | Vaginal | ||||
| Human | Vaginal | ||||
| Human | Vaginal | ||||
| Human | Vaginal | ||||
| Human | Vaginal | ||||
| Human | Gut | ||||
| Chicken | Gut | ||||
| Human | Gut | ||||
| Turkey | Gut | ||||
| Human | Eye | ||||
| Chicken | Gut | ||||
| Chicken | Gut | ||||
| Chicken | Gut | ||||
| Chicken | Gut | ||||
| Chicken | Gut | ||||
| Chicken | Gut | ||||
| Chicken | Gut | ||||
| Chicken | Gut | ||||
| Chicken | Gut | ||||
| Human | Vaginal | ||||
| Human | Vaginal | ||||
| Human | Vaginal | ||||
Lactobacillus crispatus samples (n = 42) were used in the pan-genome analysis.
NCBI Assembly (beginning with GCF) or SRA experiment accession number.
The host and source were collected from metadata associated with the BioSample or available publications. In cases when a host and/or source was not explicitly stated, it was inferred from available metadata (denoted by an asterisk).
FIG 3Core-genome maximum-likelihood phylogeny of Lactobacillus crispatus. A core-genome phylogenetic representation using IQ-Tree (28–30) of 42 L. crispatus samples. The putatively recombinant positions predicted using ClonalFrameML (37) were removed from the alignment with maskrc-svg (38). The tree was built from 972 core genes identified by Roary with 9,209 parsimony-informative sites. The log-likelihood score for the consensus tree constructed from 1,000 bootstrap trees was −1,418,106.