Literature DB >> 32753501

Bactopia: a Flexible Pipeline for Complete Analysis of Bacterial Genomes.

Robert A Petit1, Timothy D Read2.   

Abstract

Sequencing of bacterial genomes using Illumina technology has become such a standard procedure that often data are generated faster than can be conveniently analyzed. We created a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a data set setup step (Bactopia Data Sets [BaDs]), which creates a series of customizable data sets for the species of interest, the Bactopia Analysis Pipeline (BaAP), which performs quality control, genome assembly, and several other functions based on the available data sets and outputs the processed data to a structured directory format, and a series of Bactopia Tools (BaTs) that perform specific postprocessing on some or all of the processed data. BaTs include pan-genome analysis, computing average nucleotide identity between samples, extracting and profiling the 16S genes, and taxonomic classification using highly conserved genes. It is expected that the number of BaTs will increase to fill specific applications in the future. As a demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on Lactobacillus crispatus, a species that is a common part of the human vaginal microbiome. Bactopia is an open source system that can scale from projects as small as one bacterial genome to ones including thousands of genomes and that allows for great flexibility in choosing comparison data sets and options for downstream analysis. Bactopia code can be accessed at https://www.github.com/bactopia/bactopiaIMPORTANCE It is now relatively easy to obtain a high-quality draft genome sequence of a bacterium, but bioinformatic analysis requires organization and optimization of multiple open source software tools. We present Bactopia, a pipeline for bacterial genome analysis, as an option for processing bacterial genome data. Bactopia also automates downloading of data from multiple public sources and species-specific customization. Because the pipeline is written in the Nextflow language, analyses can be scaled from individual genomes on a local computer to thousands of genomes using cloud resources. As a usage example, we processed 1,664 Lactobacillus genomes from public sources and used comparative analysis workflows (Bactopia Tools) to identify and analyze members of the L. crispatus species.
Copyright © 2020 Petit and Read.

Entities:  

Keywords:  Lactobacilluszzm321990; annotation; assembly; bacteria; genomics; software

Year:  2020        PMID: 32753501      PMCID: PMC7406220          DOI: 10.1128/mSystems.00190-20

Source DB:  PubMed          Journal:  mSystems        ISSN: 2379-5077            Impact factor:   6.496


INTRODUCTION

Sequencing a bacterial genome, an activity that once required the infrastructure of a dedicated genome center, is now a routine task that even a small laboratory can undertake. Many open-source software tools have been created to handle various parts of the process of using raw read data for functions such as single nucleotide polymorphism (SNP) calling and de novo assembly. As a result of dedicated community efforts, it has recently become much easier to locally install these bioinformatic tools through package managers (Bioconda [1] and Brew [2]) or through the use of software containers (Docker and Singularity). Despite these advances, producers of bacterial sequence data face a bewildering array of choices when considering how to perform analysis, particularly when large numbers of genomes are involved and processing efficiency and scalability become major factors. Efficient bacterial multigenome analysis has been hampered by three missing functionalities. First is the need to have workflows of workflows' that can integrate analyses and provide a simplified way to start with a collection of raw genome data, remove low-quality sequences, and perform the basic analytic steps of de novo assembly, mapping to reference sequence, and taxonomic assignment. Second is the desire to incorporate user-specific knowledge of the species into the input of the main genome analysis pipeline. While many microbiologists are not expert bioinformaticians, they are experts in the organisms they study. Third is the need to create an output format from the main pipeline that could be used for future customized downstream analysis such as pan-genome analysis and basic visualization of phylogenies. Here, we introduce Bactopia, an integrated suite of workflows primarily designed for flexible analysis of Illumina genome sequencing projects of bacteria from the same taxon. Bactopia is based on Nextflow workflow software (3) and is designed to be scalable, allowing projects as small as a single genome to be run on a local desktop or projects including many thousands of genomes to be run as a batch on a cloud infrastructure. Running multiple tasks on a single platform standardizes the underlying data quality used for gene and variant calling between projects run in different laboratories. This structure also simplifies the user experience. In Bactopia, complex multigenome analysis can be run in a small number of commands. However, there are myriad options for fine-tuning data sets used for analysis and the functions of the system. The underlying Nextflow structure ensures reproducibility. To illustrate the functionality of the system, we performed a Bactopia analysis of 1,664 public genome samples of the Lactobacillus genus, an important component of the microbiome of humans and animals.

RESULTS

Design and implementation.

Bactopia links together open-source bioinformatics software, available from Bioconda (1), using Nextflow (3). Nextflow was chosen for its flexibility: Bactopia can be run locally, on clusters, or on cloud platforms with simple parameter changes. It also manages the parallel execution of tasks and creates checkpoints allowing users to resume jobs. Nextflow automates installation of the component software of the workflow through integration with Bioconda. For ease of deployment, Bactopia can be installed either through Bioconda, a Docker container, or a Singularity container. All of the software programs used by Bactopia (version 1.4.0) described in the manuscript are listed in Table 1 with their individual version numbers.
TABLE 1

List of bioinformatic tools used by the Bactopia Analysis Pipeline, version 1.4.0

NameVersionDescriptiona LinkReference(s)
AMRFinder+3.6.7Finds acquired antimicrobial resistance genes and some point mutations in protein or assembled nucleotide sequenceshttps://github.com/ncbi/amr47
Aragorn1.2.38Finds transfer RNA (tRNA) featureshttp://130.235.244.92/ARAGORN/Downloads/85
Ariba2.14.4Antimicrobial resistance identification by assemblyhttps://github.com/sanger-pathogens/ariba13
ART2016.06.05A set of simulation tools to generate synthetic next-generation sequencing readshttps://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm59
assembly-scan0.3.0Generates basic stats for an assemblyhttps://github.com/rpetit3/assembly-scan73
Barrnap0.9Bacterial ribosomal RNA predictorhttps://github.com/tseemann/barrnap86
BBMap38.76A suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence datahttps://jgi.doe.gov/data-and-tools/bbtools/61
BCFtools1.9Utilities for variant calling and manipulating VCFs and BCFshttps://github.com/samtools/bcftools87
Bedtools2.29.2A powerful tool set for genome arithmetichttps://github.com/arq5x/bedtools279
BioPython1.76Tools for biological computation written in Pythonhttps://github.com/biopython/biopython54
BLAST+2.9.0Basic local alignment search toolhttps://blast.ncbi.nlm.nih.gov/Blast.cgi53
Bowtie22.4.1A fast and sensitive gapped-read alignerhttps://github.com/BenLangmead/bowtie288
BWA0.7.17Burrows-Wheeler Aligner for short-read alignmenthttps://github.com/lh3/bwa/77
CD-HIT4.8.1Accelerated for clustering the next-generation sequencing datahttps://github.com/weizhongli/cdhit55, 56
CheckM1.1.2Assesses the quality of microbial genomes recovered from isolates, single cells, and metagenomeshttps://github.com/Ecogenomics/CheckM72
ClonalFrameML1.12Efficient inference of recombination in whole bacterial genomeshttps://github.com/xavierdidelot/ClonalFrameML37
DiagrammeR1.0.0Graph and network visualization using tabular data in Rhttps://github.com/rich-iannone/DiagrammeR89
DIAMOND0.9.35Accelerated BLAST-compatible local sequence alignerhttps://github.com/bbuchfink/diamond90
eggNOG-Mapper2.0.1Fast genome-wide functional annotation through orthology assignmenthttps://github.com/eggnogdb/eggnog-mapper80, 81
EMIRGE0.61.1Reconstructs full-length ribosomal genes from short-read sequencing datahttps://github.com/csmiller/EMIRGE91
FastANI1.3Fast whole-genome similarity (ANI) estimationhttps://github.com/ParBLiSS/FastANI6
FastTree 22.1.10Approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequenceshttp://www.microbesonline.org/fasttree92
fastq-dl1.0.3Downloads FASTQ files from SRA or ENA repositorieshttps://github.com/rpetit3/fastq-dl58
FastQC0.11.9A quality control analysis tool for high throughput sequencing data.https://github.com/s-andrews/FastQC63
fastq-scan0.4.3Outputs FASTQ summary statistics in JSON formathttps://github.com/rpetit3/fastq-scan64
FLASH1.2.11A fast and accurate tool to merge paired-end readshttps://ccb.jhu.edu/software/FLASH/93
freebayes1.3.2Bayesian haplotype-based genetic polymorphism discovery and genotypinghttps://github.com/ekg/freebayes94
GNU Parallel20200122A shell tool for executing jobs in parallelhttps://www.gnu.org/software/parallel/95
GTDB-tk1.0.2A tool kit for assigning objective taxonomic classifications to bacterial and archaeal genomeshttps://github.com/Ecogenomics/GTDBTk21
HMMER3.3Biosequence analysis using profile hidden Markov modelshttp://hmmer.org/23, 96, 97
Infernal1.1.2Searches DNA sequence databases for RNA structure and sequence similaritieshttp://eddylab.org/infernal/98
IQ-TREE1.6.12Efficient phylogenomic software by maximum likelihoodhttps://github.com/Cibiv/IQ-TREE28
ISMapper2.0Insertion sequence mapping softwarehttps://github.com/jhawkey/IS_mapper82
Lighter1.1.2Fast and memory-efficient sequencing error correctorhttps://github.com/mourisl/Lighter62
MAFFT7.455Multiple alignment program for amino acid or nucleotide sequenceshttps://mafft.cbrc.jp/alignment/software/31
Mash2.2.2Fast genome and metagenome distance estimation using MinHashhttps://github.com/marbl/Mash17, 75
Mashtree1.1.2Creates a tree using Mash distanceshttps://github.com/lskatz/mashtree83
maskrc-svg0.5Masks recombination as detected by ClonalFrameML or Gubbins and draws an SVGhttps://github.com/kwongj/maskrc-svg38
McCortex1.0De novo genome assembly and multisample variant callinghttps://github.com/mcveanlab/mccortex74
MEGAHIT1.2.9Ultra-fast and memory-efficient (meta-)genome assemblerhttps://github.com/voutcn/megahit66
MinCED0.4.2Mining CRISPRs in environmental data setshttps://github.com/ctSkennerton/minced99
Minimap22.17A versatile pairwise aligner for genomic and spliced nucleotide sequenceshttps://github.com/lh3/minimap2100
ncbi-genome-download0.2.12Scripts to download genomes from the NCBI FTP servershttps://github.com/kblin/ncbi-genome-download35
Nextflow19.10.0A DSL for data-driven computational pipelineshttps://github.com/nextflow-io/nextflow3
phyloFlash3.3b3Rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an Illumina (metagenomic data set)https://github.com/HRGV/phyloFlash25
Pigz2.3.4A parallel implementation of gzip for modern multiprocessor, multicore machineshttps://zlib.net/pigz/101
Pilon1.23An automated genome assembly improvement and variant detection toolhttps://github.com/broadinstitute/pilon/69
PIRATE1.0.3A toolbox for pan-genome analysis and threshold evaluationhttps://github.com/SionBayliss/PIRATE84
pplacer1.1.alpha19Phylogenetic placement and downstream analysishttps://github.com/matsen/pplacer24
Prodigal2.6.3Fast, reliable protein-coding gene prediction for prokaryotic genomeshttps://github.com/hyattpd/Prodigal22
Prokka1.4.5Rapid prokaryotic genome annotationhttps://github.com/tseemann/prokka36
QUAST5.0.2Quality assessment tool for genome assemblieshttp://quast.sourceforge.net/71
Racon1.4.13Ultrafast consensus module for raw de novo genome assembly of long uncorrected readshttps://github.com/lbcb-sci/racon102
Roary3.13.0Rapid large-scale prokaryote pan genome analysishttps://github.com/sanger-pathogens/Roary7
samclip0.2Filter SAM file for soft and hard clipped alignmentshttps://github.com/tseemann/samclip103
SAMtools1.9Tools for manipulating next-generation sequencing datahttps://github.com/samtools/samtools104
Seqtk1.3A fast and lightweight tool for processing sequences in the FASTA or FASTQ formathttps://github.com/lh3/seqtk105
Shovill1.0.9seFaster assembly of Illumina readshttps://github.com/tseemann/shovill65
SKESA2.3.0Strategic k-mer extension for scrupulous assemblieshttps://github.com/ncbi/SKESA67
Snippy4.4.5Rapid haploid variant calling and core genome alignmenthttps://github.com/tseemann/snippy76
SnpEff4.3.1Genomic variant annotations and functional effect prediction toolboxhttp://snpeff.sourceforge.net/106
snp-dists0.6.3Pairwise SNP distance matrix from a FASTA sequence alignmenthttps://github.com/tseemann/snp-dists39
SNP-sites2.5.1Rapidly extracts SNPs from a multi-FASTA alignmenthttps://github.com/sanger-pathogens/snp-sites107
Sourmash3.2.0Compute and compare MinHash signatures for DNA data setshttps://github.com/dib-lab/sourmash19
SPAdes3.13.0An assembly toolkit containing various assembly pipelineshttps://github.com/ablab/spades26
Trimmomatic0.39A flexible read trimming tool for Illumina NGS datahttp://www.usadellab.org/cms/index.php?page=trimmomatic108
Unicycler0.4.8Hybrid assembly pipeline for bacterial genomeshttps://github.com/rrwick/Unicycler70
vcf-annotator0.5Add biological annotations to variants in a VCF filehttps://github.com/rpetit3/vcf-annotator109
Vcflib1.0.0rc3A simple C++ library for parsing and manipulating VCF fileshttps://github.com/vcflib/vcflib110
Velvet1.2.10Short read de novo assembler using de Bruijn graphshttps://github.com/dzerbino/velvet68
VSEARCH2.14.1Versatile open-source tool for metagenomicshttps://github.com/torognes/vsearch111
vt2015.11.10A tool set for short-variant discovery in genetic sequence datahttps://github.com/atks/vt112

VCF, variant call format; BCF, binary variant call format; SVG, scalable vector graphics; JSON, JavaScript Object Notation; DSL, digital subscriber line; SSU, small subunit; NGS, next-generation sequencing.

List of bioinformatic tools used by the Bactopia Analysis Pipeline, version 1.4.0 VCF, variant call format; BCF, binary variant call format; SVG, scalable vector graphics; JSON, JavaScript Object Notation; DSL, digital subscriber line; SSU, small subunit; NGS, next-generation sequencing. There are three main components of Bactopia (Fig. 1; see also Fig. S1 in the supplemental material). Bactopia Data Sets (BaDs) is a framework for formatting organism-specific data sets to be used by the downstream analysis pipeline. The Bactopia Analysis Pipeline (BaAP) is a customizable workflow for the analysis of individual bacterial genome projects that is an extension and generalization of the previously published Staphylococcus aureus-specific Staphopia Analysis Pipeline (StAP) (4). The inputs to BaAP are FASTQ files from bacterial Illumina sequencing projects, either imported from the National Centers for Biotechnology Information (NCBI) Short Read Archive (SRA) database or provided locally, and any reference data in the BaDs. Bactopia Tools (BaTs) is a set of workflows that use the output files from a BaAP project to run genomic analysis on multiple genomes. For this project we used BaTs to (i) summarize the results of running multiple bacterial genomes through BaAP, (ii) extract 16S gene sequences and create a phylogeny, (iii) assign taxonomic classifications with the Genome Taxonomy Database (GTDB) (5), (iv) determine subsets of Lactobacillus crispatus samples by average nucleotide identity (ANI) with FastANI (6), and (v) run pan-genome analysis for L. crispatus using Roary (7) and create a core-genome phylogeny.
FIG 1

Bactopia overview. (a) A general overview of the Bactopia workflow. (b) A detailed diagram of processing pathways within the Bactopia Analysis Pipeline showing optional data set inputs.

Bactopia overview. (a) A general overview of the Bactopia workflow. (b) A detailed diagram of processing pathways within the Bactopia Analysis Pipeline showing optional data set inputs. Bactopia Analysis Pipeline workflow. Download FIG S1, PDF file, 0.1 MB.

Comparison to similar open-source software.

At the time of writing (February 2020), we knew of only three other actively maintained open-source generalist bacterial genomic workflow software programs that encompassed a similar range of functionality to Bactopia: ASA3P (8), TORMES (9), and the currently unpublished Nullarbor (10). The versions of these programs used many of the same component software programs (e.g., Prokka, SPAdes, BLAST+, and Roary) but differed in the philosophies underlying their design (Table 2). This made head-to-head runtime comparisons somewhat meaningless as each was aimed at a different analysis scenario and produced a different output. Bactopia was the most open-ended and flexible, allowing the user to customize input databases and providing a platform for downstream analysis by different BaTs rather than built-in pangenome and phylogeny creation. Bactopia also had some features not implemented in the other programs, such as SRA/ENA search and download and automated reference genome selection for identifying variants. Both Bactopia and ASA3P are highly scalable, and each can be seamlessly executed on local, cluster, and cloud environments with little effort required by the user. ASA3P was the only program to implement long-read assembly of multiple projects. TORMES was the only program to include a user-customizable RMarkdown for reporting and to have optional analyses specifically for Escherichia and Salmonella. Nullarbor was the only program to implement a prescreening method for filtering out potential biological outliers prior to full analysis.
TABLE 2

A comparison of bacterial genome analysis workflows

FeatureBactopiaASA3PNullarborTORMES
Version1.4.01.3.02.0.201910131.1
Release date1 July 20202 May 202013 October 201914 April 2020
Latest commit1 July 202026 June 202015 March 202028 May 2020
Sequence technologyIllumina, Hybrid (Nanopore, Pacbio)Illumina, Nanopore, PacBioIlluminaIllumina
Single-end readsYesYesNoNo
WorkflowNextflowGroovyPerl + MakeBash
Resume if stoppedYesNoYesNo
Reuse existing runs for expanded analysisYesNoYesNo
Built-in high-performance computing cluster and cloud capabilityYesYesNoNo
Individual program adjustable parametersYesNoYesNo
Batch processing from config fileYesYesYesYes
Single sample processing from command lineYesNoYesNo
Sequence depth downsampleYesNoYesNo
Automatic reference selection for variant detectionYesNoNoNo
Data download from SRA/ENAYesNoNoNo
Species identificationk-mers, 16S, ANIk-mers, 16S, ANIk-mersk-mers
Comparative analysisSeparate processBuilt-in processBuilt-In ProcessBuilt-in process
SummaryTextHTMLHTMLR Markdown
Package managerBiocondaBioconda and BrewConda YAML
Container availableYesYesYesNo
DocumentationWebsitePDF manualReadmeReadme
Github repositoryhttps://github.com/bactopia/bactopia/https://github.com/oschwengers/asaphttps://github.com/tseemann/nullarborhttps://github.com/nmquijada/tormes
A comparison of bacterial genome analysis workflows

Use case: the Lactobacillus genus.

We performed a Bactopia analysis of publicly available raw Illumina data labeled as belonging to the Lactobacillus genus. Lactobacillus is an important component of the human microbiome, and cultured samples have been sequenced by several research groups over the past few years. Lactobacillus crispatus and other species are often the majority bacterial genus of the human vagina and are associated with low pH and reduction in pathogen burden (11). Samples of the genus are used in the food industry for fermentation in the production of yoghurt, kimchi, kombucha, and other common items. Lactobacillus is a common probiotic although recent genome-based transmission studies showed that bloodstream infections can follow after ingestion by immunocompromised patients (12). In November 2019, we initiated Bactopia analysis using the following three commands: The “bactopia datasets” subcommand automated downloading of BaDs. With these parameters, we downloaded and formatted the following data sets: Ariba (13) reference databases for the Comprehensive Antibiotic Resistance Database (CARD) and the core Virulence Factor Database (VFDB) (14, 15), RefSeq Mash sketch (16, 17), GenBank Sourmash signatures (18, 19), PLSDB BLAST database and Mash sketch (20), and a clustered protein set and Mash sketch from completed Lactobacillus genomes () available from NCBI Assembly (RefSeq). This took 25 min to complete. The “bactopia search” subcommand produced a list of accession numbers for 2,030 experiments that had been labeled as “Lactobacillus” (taxonomy identifier [taxon ID]: 1578) (Data Set S1). After filtering for only Illumina sequencing, 1,664 accession numbers for experiments remained (Data Set S2). Results returned after querying ENA for Lactobacillus. Download Data Set S1, TXT file, 2.4 MB. SRA/ENA accession numbers of experiments processed by Bactopia. Download Data Set S2, TXT file, 0.02 MB. The main “bactopia” command automated BaAP processing of the list of accessions () using the downloaded BaTs (). Here, we chose a standard maximum coverage per genome of 100× (), based on the estimated genome size. We used the range of genome sizes (1.2 Mb to 3.7 Mb) for the completed Lactobacillus genomes to require that the estimated genome size for each sample be between 1 Mbp () and 4.2 Mbp (). Samples were processed on a 96-core SLURM cluster with 512 GB of available RAM. Analysis took approximately 2.5 days to complete, with an estimated runtime of 30 min per sample (determined by adding up the median process runtime, for 17 different processes in total, in BaAP). No individual process used more than 8 GB of memory, with all but five using less than 1 GB. Nextflow (3) recorded detailed statistics on resource usage, including CPU, memory, job duration, and input-output (I/O). (Data Set S3). Nextflow runtime report for Lactobacillus genomes processed by Bactopia. Download Data Set S3, PDF file, 0.1 MB.

Analysis of Lactobacillus genomes using BaTs.

The BaAP outputted a directory of directories named after the unique experiment accession number for each sample. Within each sample directory were subdirectories for the output of each analysis run. These data structures were recognized by BaTs for subsequent analysis. We used BaT “summary” to generate a summary report of our analysis. The report includes an overview of sequence quality, assembly statistics, and predicted antimicrobial resistances and virulence factors. It also outputs a list of samples that fail to meet minimum sequencing depth and/or quality thresholds. BaT “summary” grouped samples as gold, silver, bronze, exclude, or unprocessed, based on BaAP completion, minimum sequencing coverage, per-read sequencing mean quality, minimum mean read length, and assembly quality (Table 3; Fig. S2). To be placed in a group, a sample had to meet each cutoff. Cutoffs were based on those used by the Staphopia Analysis Pipeline (StAP) (4) with the addition of a contig count cutoff. For this analysis we used the default values for these cutoffs to group our samples. Gold samples were defined as those having greater than 100× coverage, per-read mean quality greater than Q30, mean read length greater than 95 bp, and an assembly with fewer than 100 contigs. Silver samples were defined as those having greater than 50× coverage, per-read mean quality greater than Q20, mean read length greater than 75 bp, and an assembly with less than 200 contigs. Bronze samples were defined as those having greater than 20× coverage, per-read mean quality greater than Q12, mean read length greater than 49 bp, and an assembly with fewer than 500 contigs. A total of 106 samples (the exclude and unprocessed groups) were excluded from further analysis (Table S1). Forty-eight samples that failed to meet the minimum thresholds for bronze quality were assigned to the exclude group. Fifty-eight samples that were not processed by BaAP due to sequencing-related errors or because of the estimated genome sizes were grouped as unprocessed. Of these, one (SRA accession no. SRX4526092) was labeled as paired end but did not have both sets of reads, one (SRA accession no. SRX1490246) was identified to be an assembly converted to FASTQ format, and 14 had insufficient sequencing depth. The remaining 42 samples, unprocessed by BaAP, had an estimated genome size which exceeded 4.2 Mbp (set at runtime). We queried these samples against available GenBank and RefSeq sketches using Mash screen and Sourmash lca gather. There were 36 samples that contained evidence for Lactobacillus but also sequences for other bacterial species, phage, virus, and plant genomes. There were six samples that contained no evidence for Lactobacillus, four of which had matches to multiple bacterial species, and two of which had matches only to Saccharomyces cerevisiae.
TABLE 3

Summary of Lactobacillus genome sequencing projects quality and coverage

Quality rankNo. of samplesOriginal coveragePost-Bactopia coveragePer-read quality scoreRead length (bp)Contig count% of assembled genome size compared to estimated genome size
Gold967213×100×Q351005292
Silver386160×100×Q3510011093
Bronze205102×100×Q341009093
Exclude4826×22×Q3410070693
Unprocessed58

All values except number of samples are medians.

Summary of Lactobacillus genome sequencing projects quality and coverage All values except number of samples are medians. Sequencing quality ranks per year from 2011 to 2019 of Lactobacillus genome projects. Genome projects were grouped into three ranks of increasing quality: bronze, silver, and gold. The rank was based on coverage, read length, per-read quality, and total assembled contigs. The highest rank, gold, represented 62% (n = 967) of the available Lactobacillus genome projects. Of the remaining genomes, 25% (n = 386) were ranked silver, and 13% (n = 205) were ranked bronze. Between the years of 2011 and 2019, gold-ranked samples consistently outnumbered silver- and bronze-ranked sampled except for the years 2011 and 2015. However, it is likely that the total number of gold-ranked samples is underrepresented due coverage reduction being based on the estimated genome size. Download FIG S2, PDF file, 0 MB. Lactobacillus samples excluded from analysis. Download Table S1, DOCX file, 0.02 MB. There were 1,558 samples with gold, silver, or bronze quality (Table 3) that were used for further analysis. For these we found that, on average, the assembled genome size was about 12% smaller than the estimated genome size (Table 3; Fig. S3). If we assume that the assembled genome size is a better indicator of a sample’s genome size, the average coverage before quality control (QC) increased from 220× to 268×. In this use case, the Lactobacillus genus, it was necessary to estimate genome sizes, but in dealing with samples from a single species, it may be better to provide a known genome size. Comparison of estimated genome size and assembled genome size. The assembled genome size (y axis) and estimated genome size (x axis) were plotted for each sample. The color of the dot is determined by the rank of the sample. The solid black line represents a 1:1 ratio between the assembled genome size and the estimated genome size. The genome size was estimated for each sample by Mash (17) using the raw sequences. Download FIG S3, PDF file, 0.02 MB. For visualization of the phylogenetic relationships of the samples, we used the “phyloflash” and “gtdb” BaTs. The “gtdb” BaT used GTDB-Tk (21) to assign a taxonomic classification to each sample. GTDB-Tk used the assembly to predict genes with Prodigal (22), identify GTDB marker genes (5) () for phylogenetic inference with HMMER3 (23), and find the maximum-likelihood placement of each sample on the GTDB-Tk reference tree with pplacer (24). A taxonomic classification was assigned to 1,554 samples, and 4 samples failed classification due to insufficient marker gene coverage or marker genes with multiple hits. The “phyloflash” BaT used the phyloFlash tool (25) to reconstruct a 16S rRNA gene from each sample that was used for phylogenetic reconstruction (Fig. 2). Samples that failed to meet quality cutoffs were excluded from this analysis (). The 16S rRNA was reconstructed from a SPAdes (26) assembly and annotated against the SILVA (27) rRNA database () for 1,470 samples. There were 88 samples that were excluded from the phylogeny: 12 samples that did not meet the requirement of a mean read length of 50 bp, 17 samples in which a 16S gene could not be reconstructed, 19 samples that had a mismatch in assembly and mapped-read taxon designations, and 40 samples that had 16S genes reconstructed for multiple species. A phylogenetic tree was created with IQ-TREE (28–30) based on a multiple-sequence alignment of the reconstructed 16S genes with MAFFT (31). Taxonomic classifications from GTDB-Tk were used to annotate the 16S genes with iTOL (32).
FIG 2

Maximum-likelihood phylogeny from reconstructed 16S rRNA genes. A phylogenetic representation of 1,470 samples using IQ-Tree (28–30). (a) A tree of the full set of samples. The outer ring represents the genus assigned by GTDB-Tk, as indicated. (b) The same tree as shown in panel a, but with the non-Lactobacillus clade collapsed. Major groups of Lactobacillus species (indicated with a letter) and the most sequenced Lactobacillus species have been labeled. The inner ring represents the average nucleotide identity (ANI), determined by FastANI (6), of samples to L. crispatus. The tree was built from a multiple-sequence alignment (31) of 16S genes reconstructed by phyloFlash (25) with 1,281 parsimony-informative sites. The likelihood score for the consensus tree constructed from 1,000 bootstrap trees was −54,698. Taxonomic classifications were assigned by GTDB-Tk (21).

Maximum-likelihood phylogeny from reconstructed 16S rRNA genes. A phylogenetic representation of 1,470 samples using IQ-Tree (28–30). (a) A tree of the full set of samples. The outer ring represents the genus assigned by GTDB-Tk, as indicated. (b) The same tree as shown in panel a, but with the non-Lactobacillus clade collapsed. Major groups of Lactobacillus species (indicated with a letter) and the most sequenced Lactobacillus species have been labeled. The inner ring represents the average nucleotide identity (ANI), determined by FastANI (6), of samples to L. crispatus. The tree was built from a multiple-sequence alignment (31) of 16S genes reconstructed by phyloFlash (25) with 1,281 parsimony-informative sites. The likelihood score for the consensus tree constructed from 1,000 bootstrap trees was −54,698. Taxonomic classifications were assigned by GTDB-Tk (21). A recent analysis of completed genomes in the NCBI found 239 discontinuous de novo Lactobacillus species using a 94% ANI cutoff (33). Based on GTDB taxonomic classification, which applies a 95% ANI cutoff, we identified 161 distinct Lactobacillus species in 1,554 samples. The five most sequenced Lactobacillus species, accounting for 45% of the total, were L. rhamnosus (n = 225), L. paracasei (n = 180), L. gasseri (n = 132), L. plantarum (n = 86), and L. fermentum (n = 80). Within these five species the assembled genomes sizes were remarkably consistent (Fig. S4). There were 58 samples that were not classified as Lactobacillus, of which 34 were classified as Streptococcus pneumoniae by both 16S gene sequencing and GTDB (Table S2). Assembled genome sizes are consistent within species. Download FIG S4, PDF file, 0 MB. Samples with a non-Lactobacillus taxonomic classification. Download Table S2, DOCX file, 0.02 MB. We found that 505 (∼33%) of 1,554 taxonomic classifications by 16S gene and GTDB were in conflict with the taxonomy according to the NCBI SRA, illustrating the importance of an unbiased approach to understanding sample context. In samples that had both a 16S and GTDB taxonomic classification, there was disagreement in 154 out of 1,467 samples. Of these, 47% were accounted for by the recently described L. paragasseri (34) (n = 72). This possibly highlights a lag in the reclassification of assemblies in the NCBI Assembly database. Analysis of the pangenome of the entire genus using a tool such as Roary (7) would return only a few core genes, owing to sequence divergence of evolutionarily distant species. However, because the “roary” BaT can be supplied with a list of individual samples, it is possible to isolate the analysis to the species level. As an example of using BaTs to focus on a particular group within the larger set of results, we chose L. crispatus, a species commonly isolated from the human vagina and also found in the guts/feces of poultry. We used the “fastani” BaT to estimate the ANI of all samples against a single () randomly selected L. crispatus completed genome (NCBI Assembly accession no. GCF_003795065; ) with FastANI (6). A cutoff of greater than 95% ANI was used to categorize a sample as L. crispatus. A pan-genome analysis was conducted on only the samples categorized as L. crispatus () using the “roary” BaT. The “roary” BaT downloaded all available completed L. crispatus genomes with ncbi-genome-download (35), formatted the completed genomes with Prokka (36), created a pan-genome and core-genome alignment () with Roary (7), identified and masked recombination with ClonalFrameML (37) and maskrc-svg (38), and created a phylogenetic tree with IQ-TREE (28–30) and a pairwise SNP distance matrix with snp-dists (39). ANI analysis revealed 38 samples as having >96.1% ANI to L. crispatus, with no other sample greater than 83.1%. Four completed L. crispatus genomes were also included in the analysis (Table 4), for a total of 42 genomes. The pan-genome of L. crispatus was revealed to have 7,037 gene families and 972 core genes (Fig. 3). Similar to a recent analysis by Pan et al. (40), L. crispatus was separated into two main phylogenetic groups, one associated with human vaginal isolates and the other having more mixed provenance and including chicken, turkey, and human gut isolates.
TABLE 4

Lactobacillus crispatus genomes used in pan-genome analysis

Accession no.b
Hostc Sourcec Reference
BioProjectBioSampleExperimentb
PRJEB8104SAMEA3319334ERX1126086Human*Urine*
SAMEA3319350ERX1126089Human*Urine*
SAMEA3319265ERX1126106Human*Urine*
SAMEA3319366ERX1126138Human*Urine*
SAMEA3319373ERX1126140Human*Urine*
SAMEA3319383ERX1126143Human*Urine*
SAMEA3319392ERX1126150Human*Urine*
PRJEB22112SAMEA104208649ERX2150228Human*Urine*
SAMEA104208650ERX2150229Human*Urine*
PRJEB3060SAMEA1920319ERX271950Human*Unknown
SAMEA1920326ERX271958Human*Unknown
SAMEA1920319ERX450852Human*Unknown
SAMEA1920326ERX450860Human*Unknown
PRJNA50051SAMN00109860SRX026143Human*Vaginal*
PRJNA272101SAMN03854351SRX1090887HumanUrine113
PRJNA50053SAMN00829399SRX130900Human*Vaginal*
PRJNA50057SAMN00829123SRX130912Human*Vaginal*
PRJNA50067SAMN00829125SRX130914Human*Vaginal*
PRJNA52107SAMN01057066SRX155504Human*Vaginal*
PRJNA52105SAMN01057067SRX155505Human*Vaginal*
PRJNA52107SAMN01057066SRX155863Human*Vaginal*
PRJNA52105SAMN01057067SRX155875Human*Vaginal*
PRJNA379934SAMN06624125SRX2660270HumanEye
PRJNA222257SAMN02369387SRX456245HumanEye
PRJNA231221SAMN11056458SRX5949263HumanVaginal114
PRJNA547620SAMN11973370SRX5986001HumanVaginal115
SAMN11973369SRX5986002HumanVaginal115
SAMN11973371SRX5986003HumanVaginal115
PRJNA557339SAMN12395213SRX6613945HumanVaginal116
PRJNA563077SAMN12667791SRX6959881HumanGut40
SAMN12667801SRX6959883ChickenGut40
SAMN12667803SRX6959885HumanGut40
SAMN12667804SRX6959886TurkeyGut40
SAMN12667805SRX6959887HumanEye40
SAMN12667793SRX6959888ChickenGut40
SAMN12667794SRX6959889ChickenGut40
SAMN12667795SRX6959890ChickenGut40
SAMN12667796SRX6959891ChickenGut40
SAMN12667797SRX6959892ChickenGut40
SAMN12667798SRX6959893ChickenGut40
SAMN12667799SRX6959894ChickenGut40
SAMN12667800SRX6959895ChickenGut40
PRJNA531669SAMN11372136GCF_009769205ChickenGut117
PRJNA231221SAMN11056458GCF_009730275HumanVaginal114
PRJNA431864SAMN08409124GCF_003971565HumanVaginal118
PRJNA499123SAMN10343598GCF_003795065HumanVaginal119

Lactobacillus crispatus samples (n = 42) were used in the pan-genome analysis.

NCBI Assembly (beginning with GCF) or SRA experiment accession number.

The host and source were collected from metadata associated with the BioSample or available publications. In cases when a host and/or source was not explicitly stated, it was inferred from available metadata (denoted by an asterisk).

FIG 3

Core-genome maximum-likelihood phylogeny of Lactobacillus crispatus. A core-genome phylogenetic representation using IQ-Tree (28–30) of 42 L. crispatus samples. The putatively recombinant positions predicted using ClonalFrameML (37) were removed from the alignment with maskrc-svg (38). The tree was built from 972 core genes identified by Roary with 9,209 parsimony-informative sites. The log-likelihood score for the consensus tree constructed from 1,000 bootstrap trees was −1,418,106.

Lactobacillus crispatus genomes used in pan-genome analysis Lactobacillus crispatus samples (n = 42) were used in the pan-genome analysis. NCBI Assembly (beginning with GCF) or SRA experiment accession number. The host and source were collected from metadata associated with the BioSample or available publications. In cases when a host and/or source was not explicitly stated, it was inferred from available metadata (denoted by an asterisk). Core-genome maximum-likelihood phylogeny of Lactobacillus crispatus. A core-genome phylogenetic representation using IQ-Tree (28–30) of 42 L. crispatus samples. The putatively recombinant positions predicted using ClonalFrameML (37) were removed from the alignment with maskrc-svg (38). The tree was built from 972 core genes identified by Roary with 9,209 parsimony-informative sites. The log-likelihood score for the consensus tree constructed from 1,000 bootstrap trees was −1,418,106. Last, we looked at patterns of antibiotic resistance across the genus using a table, generated by the “summary” BaT, of resistance genes and loci called by AMRFinder+ (41). Only 79 out of 1,496 Lactobacillus samples defined by GTDB-Tk (21) were found to have predicted resistance using AMRFinder+. The most common resistance categories were tetracyclines (67 samples), followed by macrolides, lincosamides, and aminoglycosides (16, 15, and 11 samples, respectively). Species with the highest proportion of resistance included L. amylovorus (12/14 tetracycline resistant) and L. crispatus (10/42 tetracycline resistant). Only three genomes of L. amylophilus were included in the study, but each contained matches to genes for macrolide, lincosamide, and tetracycline resistance. The linking thread between these species is that they are each commonly isolated from agricultural animals. The high proportion of L. crispatus samples isolated from chickens that were tetracycline resistant has been previously observed (42, 43) (Fig. 3). A recent analysis of 184 Lactobacillus type strain genomes by Campedelli et al. (44) found a higher percentage of type strains with aminoglycoside (20/184), tetracycline (18/184), erythromycin (6/184), and clindamycin (60/184) resistance. Forty-two of the type strains had chloramphenicol resistance genes whereas, here, AMRFinder+ returned only 1/1,467 genes. These differences probably reflect a combination of the different sampling biases of the studies and the strategy of Campedelli et al. to use a relaxed threshold for hits to maximize sensitivity (blastp matches against the CARD database with acid sequence identity of 30% and query coverage of 70% [44]). Resistance is probably undercalled by both methods because of a lack of well-characterized resistance loci from the Lactobacillus genus to use for comparison.

DISCUSSION

Bactopia is a flexible workflow for bacterial genomics. It can be run on a laptop for a single bacterial sample, but, critically, the underlying Nextflow framework allows it to make efficient use of large clusters and cloud-computing environments to process the many thousands of genomes that are currently being generated. For users that are not familiar with bacterial genomic tools and/or who require a standardized pipeline, Bactopia is a one-stop shop that can be easily deployed using conda, Docker, and Singularity containers. For researchers with particular interest in individual species or genera, BaDs can be highly customized with taxon-specific databases. The current version of Bactopia has only minimal support for long-read data, but this is an area that we plan to expand in the future. We also plan to implement more comparative analyses in the form of additional BaTs. With a framework set in place for developing BaTs, it should be possible to make a toolbox of workflows that not only can be used for all bacteria but are also customized for annotating genes and loci specific for particular species.

MATERIALS AND METHODS

Bactopia Data Sets.

The Bactopia pipeline can be run without downloading and formatting Bactopia Data Sets (BaDs). However, providing them enriches the downstream analysis. Bactopia can import specific existing public data sets, as well as accessible user-provided data sets in the appropriate format. A subcommand (“bactopia datasets”) was created to automate downloading, building, and (or) configuring these data sets for Bactopia. BaDs can be grouped into those that are general and those that are user supplied. General data sets include a Mash (17) sketch of the NCBI RefSeq (16) and PLSDB (20) databases and a Sourmash (19) signature of microbial genomes (including viral and fungal) from the NCBI GenBank (18) database. Ariba (13), a software program for detecting genes in raw read (FASTQ) files, uses a number of default reference databases for virulence and antibiotic resistance. The available Ariba data sets include ARG-ANNOT (45), CARD (15), MEGARes (46), NCBI Reference Gene Catalog (47), plasmidfinder (48), resfinder (49), SRST2 (50), VFDB (14), and VirulenceFinder (51). When an organism name is provided, additional data sets are set up. If a multilocus sequence typing (MLST) schema is available for the species, it is downloaded from PubMLST.org (52) and set up for BLAST+ (53) and Ariba. Each RefSeq completed genome for the species is downloaded using ncbi-genome-download (35). A Mash sketch is created from the set of downloaded completed genomes to be used for automatic reference selection for variant calling. Protein sequences are extracted from each genome with BioPython (54), clustered using CD-HIT (55, 56), and formatted to be used by Prokka (36) for annotation. Users may also provide their own organism-specific reference data sets to be used for BLAST+ alignment, short-read alignment, or variant calling.

Bactopia Analysis Pipeline.

The Bactopia Analysis Pipeline (BaAP) takes input FASTQ or preassembled genomes as FASTA files and optional user-specified BaDs and performs a number of workflows that are based on either de novo whole-genome assembly, reference mapping, or sequence decomposition (i.e., k-mer-based approaches) (Fig. 1b). BaAP has incorporated numerous existing bioinformatic tools (Table 1) into its workflow (Fig. 1b; see also Fig. S1 in the supplemental material). For each tool, many of the input parameters are exposed to the user, allowing for fine-tuning analysis.

BaAP: acquiring FASTQs.

Bactopia provides multiple ways for users to provide their FASTQ-formatted sequences. Input FASTQs can be local or downloaded from public repositories or preassembled genomes as FASTA files. There is also an option for hybrid assembly of Illumina and long-read data. Local sequences can be processed one at a time or in batches. To process a single sample, the user provides the path to the FASTQ(s) and a sample name. For multiple samples, this method does not make efficient use of Nextflow’s queue system. Alternatively, users can provide a “file of filenames” (FOFN), which is a tab-delimited file with information about samples and paths to the corresponding FASTQ(s). By using the FOFN method, Nextflow queues each sample and makes efficient use of available resources. A subcommand (“bactopia prepare”) was created to automate the creation of an FOFN. Raw sequences available from public repositories (e.g., European Nucleotide Archive [ENA], Sequence Read Archive [SRA], DNA Data Bank of Japan [DDBJ], or NCBI Assembly) can also be processed by Bactopia. Sequences associated with a provided experiment accession number (e.g., DRX, ERX, or SRX prefix) or NCBI Assembly accession number (e.g., GCF or GCA prefix) are downloaded and processed exactly as local sequences would be. A subcommand (“bactopia search”) was created which allows users to query ENA to create a list of experiment accession numbers from the ENA Data Warehouse API (57) associated with a BioProject accession number, taxon ID, or organism name.

BaAP: validating FASTQs.

The path for input FASTQ(s) is validated, and, if necessary, sequences from public repositories are downloaded using fastq-dl (58). If a preassembled genome is provided as an input, 2- by 250-bp paired-end reads are simulated using ART (59). Once validated, the FASTQ input(s) is tested to determine if it meets a minimum threshold for continued processing. All BaAP steps expect to use Illumina sequence data, which represent the great majority of genome projects currently generated. FASTQ files that are explicitly marked as non-Illumina or have properties that suggest that they are non-Illumina (e.g., read length or error profile) are excluded. By default, input FASTQs must exceed 2,241,820 bases (20× coverage of the smallest bacterial genome, Nasuia deltocephalinicola [60]) and 7,472 reads (minimum required base pairs/300 bp, the longest available reads from Illumina). If estimated, the genome size must be between 100,000 bp and 18,040,666 bp, which is based on the range of known bacterial genome sizes (N. deltocephalinicola, NCBI accession no. GCF_000442605, 112,091 bp; Minicystis rosea, NCBI accession no. GCF_001931535, 16,040,666 bp). Failure to pass these requirements excludes the samples from further subsequent analysis. The threshold values can be adjusted by the user at runtime.

BaAP: FastQ quality control and generation of pFASTQ.

Input FASTQs that pass the validation steps undergo quality control steps to remove poor-quality reads. BBDuk, a component of BBTools (61), removes Illumina adapters and phiX contaminants and filters reads based on length and quality. Base calls are corrected using Lighter (62). At this stage, the default procedure is to downsample the FASTQ file to an average 100× genome coverage (if over 100×) with Reformat (from BBTools). This step, which was used in StAP (4), significantly saves computing time at little final cost to assembly or SNP calling accuracy. The genome size for coverage calculation is either provided by the user or estimated based on the FASTQ data by Mash (17). The user can provide their own value for downsampling FASTQs or disable it completely. Summary statistics before and after QC are created using FastQC (63) and fastq-scan (64). After QC, the original FASTQs are no longer used, and only the processed FASTQs (pFASTQ) are used in subsequent analysis.

BaAP: assembly, reference mapping, and decomposition.

BaAP uses Shovill (65) to create a draft de novo assembly with MEGAHIT (66), SKESA (67) (default), SPAdes (26), or Velvet (68) and makes corrections using Pilon (69) from the pFASTQ. Alternatively, if long reads were provided with paired-end pFASTQ, a hybrid assembly is created with Unicycler (70). The quality of the draft assembly is assessed by QUAST (71) and CheckM (72). Summary statistics for the draft assembly are created using assembly scan (73). If the total size of the draft assembly fails to meet a user-specified minimum size, further assembly-based analyses are discontinued. Otherwise, a BLAST+ (53) nucleotide database is created from the contigs. The draft assembly is also annotated using Prokka (36). If available at runtime, Prokka will first annotate with a clustered RefSeq protein set, followed by its default databases. The annotated genes and proteins are then subjected to antimicrobial resistance prediction with AMRFinder+ (47). For each pFASTQ, sketches are created using Mash (k = 21,31) and Sourmash (19) (k = 21,31,51). McCortex (74) is used to count 31-mers in the pFASTQ.

BaAP: optional steps.

At runtime, Bactopia checks for BaDs specified by the command line (if any) and adjusts the settings of the pipeline accordingly. Examples of processes executed only if a BaDs is specified include Ariba (13) analysis for each available reference data set, sequence containment estimation against RefSeq (16) with mash screen (75) and against GenBank (18) with sourmash lca gather (19), and PLSDB (20), with mash screen and BLAST+. The sequence type (ST) of the sample is determined with BLAST+ and Ariba. The nearest reference RefSeq genome, based on mash (17) distance, is downloaded with ncbi-genome-download (35), and variants are called with Snippy (76). Alternatively, one or more reference genomes can be provided by the user. Users can also provide sequences for sequence alignment with BLAST+ and per-base coverage with BWA (77, 78) and Bedtools (79).

Bactopia tools.

After BaAP has successfully finished, it will create a directory for each strain with subdirectories for each analysis result. The directory structure is independent of the project or options chosen. Bactopia Tools (BaTs) are a set of comparative-analysis workflows written using Nextflow that take advantage of the predictable output structure from BaAP. Each BaT is created from the same framework and a subcommand (“bactopia tools create”) is available to simplify the creation of future BaTs. Five BaTs were used for analyses in this article. The “summary” BaT outputs a summary report of the set of samples and a list of samples that failed to meet thresholds set by the user. This summary includes basic sequence and assembly stats as well as technical (pass/fail) information. The “roary” BaT creates a pan-genome of the set of samples with Roary (7), with the option to include RefSeq (16) completed genomes. The “fastani” BaT determines the pairwise average nucleotide identity (ANI) for each sample with FastANI (6). The “phyloflash” BaT reconstructs 16S rRNA gene sequences with phyloFlash (25). The “gtdb” BaT assigns taxonomic classifications from the Genome Taxonomy Database (GTDB) (5) with GTDB-tk (21). Each Bactopia tool has a separate Nextflow workflow with its own conda environment, Docker image, and Singularity image. Additional BaTs are currently available for eggNOG-mapper (80, 81), ISMapper (82), Mashtree (83), and PIRATE (84).

Data availability.

Raw Illumina sequences of Lactobacillus samples used in this study were acquired from experiments submitted under BioProject accession numbers PRJDB1101, PRJDB1726, PRJDB4156, PRJDB4955, PRJDB5065, PRJDB5206, PRJDB6480, PRJDB6495, PRJEB10572, PRJEB11980, PRJEB14693, PRJEB18589, PRJEB19875, PRJEB21025, PRJEB21680, PRJEB22112, PRJEB22252, PRJEB23845, PRJEB24689, PRJEB24698, PRJEB24699, PRJEB24700, PRJEB24701, PRJEB24713, PRJEB24715, PRJEB25194, PRJEB2631, PRJEB26638, PRJEB2824, PRJEB29398, PRJEB29504, PRJEB2977, PRJEB3012, PRJEB3060, PRJEB31213, PRJEB31289, PRJEB31301, PRJEB31307, PRJEB5094, PRJEB8104, PRJEB8721, PRJEB9718, PRJNA165565, PRJNA176000, PRJNA176001, PRJNA183044, PRJNA184888, PRJNA185359, PRJNA185406, PRJNA185584, PRJNA185632, PRJNA185633, PRJNA188920, PRJNA188921, PRJNA212644, PRJNA217366, PRJNA218804, PRJNA219157, PRJNA222257, PRJNA224116, PRJNA227106, PRJNA227335, PRJNA231221, PRJNA234998, PRJNA235015, PRJNA235017, PRJNA247439, PRJNA247440, PRJNA247441, PRJNA247442, PRJNA247443, PRJNA247444, PRJNA247445, PRJNA247446, PRJNA247452, PRJNA254854, PRJNA255080, PRJNA257137, PRJNA257138, PRJNA257139, PRJNA257141, PRJNA257142, PRJNA257182, PRJNA257185, PRJNA257853, PRJNA257876, PRJNA258355, PRJNA258500, PRJNA267549, PRJNA269805, PRJNA269831, PRJNA269832, PRJNA269860, PRJNA269905, PRJNA270961, PRJNA270962, PRJNA270963, PRJNA270964, PRJNA270965, PRJNA270966, PRJNA270967, PRJNA270968, PRJNA270969, PRJNA270970, PRJNA270972, PRJNA270973, PRJNA270974, PRJNA272101, PRJNA272102, PRJNA283920, PRJNA289613, PRJNA29003, PRJNA291681, PRJNA296228, PRJNA296248, PRJNA296274, PRJNA296298, PRJNA296309, PRJNA296751, PRJNA296754, PRJNA298448, PRJNA299992, PRJNA300015, PRJNA300023, PRJNA300088, PRJNA300119, PRJNA300123, PRJNA300179, PRJNA302242, PRJNA303235, PRJNA303236, PRJNA305242, PRJNA306257, PRJNA309616, PRJNA312743, PRJNA315676, PRJNA316969, PRJNA322958, PRJNA322959, PRJNA322960, PRJNA322961, PRJNA336518, PRJNA342061, PRJNA342757, PRJNA347617, PRJNA348789, PRJNA376205, PRJNA377666, PRJNA379934, PRJNA381357, PRJNA382771, PRJNA388578, PRJNA392822, PRJNA397632, PRJNA400793, PRJNA434600, PRJNA436228, PRJNA474823, PRJNA474907, PRJNA476494, PRJNA477598, PRJNA481120, PRJNA484967, PRJNA492883, PRJNA493554, PRJNA496358, PRJNA50051, PRJNA50053, PRJNA50055, PRJNA50057, PRJNA50059, PRJNA50061, PRJNA50063, PRJNA50067, PRJNA50115, PRJNA50117, PRJNA50125, PRJNA50133, PRJNA50135, PRJNA50137, PRJNA50139, PRJNA50141, PRJNA50159, PRJNA50161, PRJNA50163, PRJNA50165, PRJNA50167, PRJNA50169, PRJNA50173, PRJNA504605, PRJNA504734, PRJNA505088, PRJNA52105, PRJNA52107, PRJNA52121, PRJNA525939, PRJNA530250, PRJNA533291, PRJNA533837, PRJNA542049, PRJNA542050, PRJNA542054, PRJNA543187, PRJNA544527, PRJNA547620, PRJNA552757, PRJNA554696, PRJNA554698, PRJNA557339, PRJNA562050, PRJNA563077, PRJNA573690, PRJNA577465, PRJNA578299, PRJNA68459, and PRJNA84. Links for the websites and software used in this study are as follows: Bactopia website and documentation, https://bactopia.github.io/; Github, https://www.github.com/bactopia/bactopia/; Zenodo Snapshot, https://doi.org/10.5281/zenodo.3926909; Bioconda, https://bioconda.github.io/recipes/bactopia/README.html; and the containers Docker, https://cloud.docker.com/u/bactopia/, and Singularity, https://cloud.sylabs.io/library/rpetit3/bactopia.
  89 in total

1.  A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors:  Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal:  Fly (Austin)       Date:  2012 Apr-Jun       Impact factor: 2.160

2.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

3.  Cloacal Lactobacillus isolates from broilers often display resistance toward tetracycline antibiotics.

Authors:  K Cauwerts; F Pasmans; L A Devriese; F Haesebrouck; A Decostere
Journal:  Microb Drug Resist       Date:  2006       Impact factor: 3.431

4.  FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors:  Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal:  PLoS One       Date:  2010-03-10       Impact factor: 3.240

5.  Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors:  Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal:  Bioinformatics       Date:  2009-03-20       Impact factor: 6.937

6.  SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments.

Authors:  Andrew J Page; Ben Taylor; Aidan J Delaney; Jorge Soares; Torsten Seemann; Jacqueline A Keane; Simon R Harris
Journal:  Microb Genom       Date:  2016-04-29

7.  Complete Genome Sequence of Lactobacillus crispatus CO3MRSI1.

Authors:  Elias McComb; Johanna Holm; Bing Ma; Jacques Ravel
Journal:  Microbiol Resour Announc       Date:  2019-01-31

8.  FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science.

Authors:  Heike Sichtig; Timothy Minogue; Yi Yan; Christopher Stefan; Adrienne Hall; Luke Tallon; Lisa Sadzewicz; Suvarna Nadendla; William Klimke; Eneida Hatcher; Martin Shumway; Dayanara Lebron Aldea; Jonathan Allen; Jeffrey Koehler; Tom Slezak; Stephen Lovell; Randal Schoepp; Uwe Scherf
Journal:  Nat Commun       Date:  2019-07-25       Impact factor: 14.919

9.  Genomic and epidemiological evidence of bacterial transmission from probiotic capsule to blood in ICU patients.

Authors:  Idan Yelin; Kelly B Flett; Christina Merakou; Preeti Mehrotra; Jason Stam; Erik Snesrud; Mary Hinkle; Emil Lesho; Patrick McGann; Alexander J McAdam; Thomas J Sandora; Roy Kishony; Gregory P Priebe
Journal:  Nat Med       Date:  2019-11-07       Impact factor: 53.440

10.  CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors:  Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal:  Bioinformatics       Date:  2012-10-11       Impact factor: 6.937

View more
  15 in total

1.  ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses.

Authors:  Natasha Pavlovikj; Joao Carlos Gomes-Neto; Jitender S Deogun; Andrew K Benson
Journal:  PeerJ       Date:  2021-05-21       Impact factor: 2.984

2.  Structures of Neisseria gonorrhoeae MtrR-operator complexes reveal molecular mechanisms of DNA recognition and antibiotic resistance-conferring clinical mutations.

Authors:  Grace A Beggs; Julio C Ayala; Logan G Kavanaugh; Timothy D Read; Grace M Hooks; Maria A Schumacher; William M Shafer; Richard G Brennan
Journal:  Nucleic Acids Res       Date:  2021-04-19       Impact factor: 16.971

3.  Characterization of the first vaginal Lactobacillus crispatus genomes isolated in Brazil.

Authors:  Marcelle Oliveira de Almeida; Rodrigo Carvalho; Flavia Figueira Aburjaile; Fabio Malcher Miranda; Janaína Canário Cerqueira; Bertram Brenig; Preetam Ghosh; Rommel Ramos; Rodrigo Bentes Kato; Siomar de Castro Soares; Artur Silva; Vasco Azevedo; Marcus Vinicius Canário Viana
Journal:  PeerJ       Date:  2021-03-10       Impact factor: 2.984

4.  Species-Specific Quality Control, Assembly and Contamination Detection in Microbial Isolate Sequences with AQUAMIS.

Authors:  Carlus Deneke; Holger Brendebach; Laura Uelze; Maria Borowiak; Burkhard Malorny; Simon H Tausch
Journal:  Genes (Basel)       Date:  2021-04-26       Impact factor: 4.096

5.  Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification.

Authors:  Oliver Schwengers; Lukas Jelonek; Marius Alfred Dieckmann; Sebastian Beyvers; Jochen Blom; Alexander Goesmann
Journal:  Microb Genom       Date:  2021-11

6.  Culture-enriched community profiling improves resolution of the vertebrate gut microbiota.

Authors:  Samantha L Goldman; Jon G Sanders; Weiwei Yan; Anthony Denice; Margaret Cornwall; Kathleen N Ivey; Emily N Taylor; Alex R Gunderson; Michael J Sheehan; Deus Mjungu; Elizabeth V Lonsdorf; Anne E Pusey; Beatrice H Hahn; Andrew H Moeller
Journal:  Mol Ecol Resour       Date:  2021-07-09       Impact factor: 7.090

7.  MicroPIPE: validating an end-to-end workflow for high-quality complete bacterial genome construction.

Authors:  Valentine Murigneux; Leah W Roberts; Brian M Forde; Minh-Duy Phan; Nguyen Thi Khanh Nhu; Adam D Irwin; Patrick N A Harris; David L Paterson; Mark A Schembri; David M Whiley; Scott A Beatson
Journal:  BMC Genomics       Date:  2021-06-25       Impact factor: 3.969

8.  Species-Wide Phylogenomics of the Staphylococcus aureus Agr Operon Revealed Convergent Evolution of Frameshift Mutations.

Authors:  Vishnu Raghuram; Ashley M Alexander; Hui Qi Loo; Robert A Petit; Joanna B Goldberg; Timothy D Read
Journal:  Microbiol Spectr       Date:  2022-01-19

9.  The Antimicrobial Susceptibility of Porphyromonas gingivalis: Genetic Repertoire, Global Phenotype, and Review of the Literature.

Authors:  Georg Conrads; Tim Klomp; Dongmei Deng; Johannes-Simon Wenzler; Andreas Braun; Mohamed M H Abdelbary
Journal:  Antibiotics (Basel)       Date:  2021-11-24

10.  rMAP: the Rapid Microbial Analysis Pipeline for ESKAPE bacterial group whole-genome sequence data.

Authors:  Ivan Sserwadda; Gerald Mboowa
Journal:  Microb Genom       Date:  2021-06
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.