| Literature DB >> 30902113 |
Erik L Clarke1, Louis J Taylor1, Chunyu Zhao2, Andrew Connell1, Jung-Jin Lee2, Bryton Fett2, Frederic D Bushman1, Kyle Bittinger3.
Abstract
BACKGROUND: Analysis of mixed microbial communities using metagenomic sequencing experiments requires multiple preprocessing and analytical steps to interpret the microbial and genetic composition of samples. Analytical steps include quality control, adapter trimming, host decontamination, metagenomic classification, read assembly, and alignment to reference genomes.Entities:
Keywords: Pipeline; Quality control; Shotgun metagenomic sequencing; Software; Sunbeam
Mesh:
Year: 2019 PMID: 30902113 PMCID: PMC6429786 DOI: 10.1186/s40168-019-0658-x
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1Inputs, processes, and outputs for standard steps in the Sunbeam metagenomics pipeline
Fig. 2Schematics of example extension inputs and contents. a Files for extension sbx_metaspades_example, which uses MetaSPAdes to assemble reads from quality-controlled fastq.gz files. sbx_metaspades_example.rules lists procedure necessary to generate assembly results from a pair of decontaminated, quality-controlled FASTQ input files. requirements.txt lists the software requirements for the package to be installed through Conda. b Files contained within the sbx_report extension: requirements.txt lists the software requirements for the package to be installed through Conda; sbx_report.rules contains the code for the rule as above, final_report.Rmd is a R markdown script that generates and visualizes the report, example.html is an example report, and README.md provides instructions for installing and running the extension. Sunbeam inputs required for each extension are shown as colored shapes above the extensions
Fig. 4a Comparison between Komplexity and similar software (BBMask, DUST, and RepeatMasker). The small bar plot in the lower left shows the total nucleotides masked by each tool. The central bar plot shows the number of unique nucleotides masked by every tool combination; each combination is shown by the connected dots below. Bars displaying nucleotides masked by tool combinations that include Komplexity are colored red. b Example complexity score distributions calculated by Komplexity for reads from ten stool virome samples (high microbial biomass; [15]) and ten bronchoalveolar lavage (BAL) virome samples (low-biomass, high-host; [12]) using the default parameters
Feature comparison for metagenomic pipelines
| Sunbeam | SURPI | KneadData | EDGE | ATLAS | |
|---|---|---|---|---|---|
| Architecture/usage | |||||
| Dependency management | Conda | Bash | Pip (partial) | Conda | |
| Modularity | Snakemake | Perl modules | Snakemake | ||
| Results reporting | Tables, coverage maps, figures | Tables, coverage maps | Tables, coverage maps | ||
| Extension framework | Sunbeam extensions | ||||
| Clinical certification | CLIA | ||||
| Data source | Local, SRA | Local | Local | Local | Local |
| Quality control | |||||
| Adapter trimming | Trimmomatic, Cutadapt | Cutadapt | Trimmomatic | FaQCs | BBDuk2 |
| Error correction | Tadpole | ||||
| Read quality | Fastqc | Cutadapt | Fastqc | FaQCs | BBDuk2 |
| Host filtering | Any | Human | Any | Any | Any |
| Low complexity | Komplexity | DUST | TRF | Mono- or dinucleotide repeats | BBDuk2 |
| Read subsampling/rarefaction | VSEARCH (extension) | ||||
| Sequence analysis | |||||
| Reference alignment | BWA | Bowtie2, MUMmer + JBrowse | BBMap | ||
| Classification | Kraken, (MetaPhlAn2, Kaiju extensions) | SNAP | GOTTCHA, Kraken, MetaPhlAn | DIAMOND | |
| Assembly | MEGAHIT | Minimo | IDBA-UD, SPAdes | MEGAHIT, SPAdes | |
| ORFs (aa) | Prodigal, BLASTp | Prokka | |||
| Full contig (nt) | Circularity, BLASTn | RAPSearch | BWA | DIAMOND | |
| Functional annotation | eggNOG (extension) | ENZYME/eggNOG/ dbCAN | |||
| Phylogeny reconstruction | PhaMe, FastTree/RAxML | ||||
| Primer design | BW, Primer3 | ||||
Feature comparison for metagenomic pipelines. Tools used by each pipeline: trimmomatic [46], cutadapt [45], tadpole [78], fastqc [47], FaQCs [79], BBDuk2 [80], DUST [36], TRF [81], VSEARCH [82], bwa [48], bowtie2 [83], BBMap [84], KRAKEN [49], SNAP [85], MUMmer [86], JBrowse [87], GOTTCHA [88], MetaPhlAn [58], DIAMOND [89], FastTree [90], MEGAHIT [51], SPAdes [91], Minimo [92], Prodigal [52], BLASTp [53], Prokka [93], BLASTn [53], eggNOG [94], ENZYME [95], dbCAN [96], Primer3 [97], RAPSearch [98], RAxML [99], conda [43], PhaME [100], Snakemake [30], SAMtools [54]
Fig. 3a Nonmetric multidimensional scaling plots generated using the vegan package in R [76], using MetaPhlAn2 classifications of data from Lewis et al. [61]. Each point is colored by the cluster in which it was annotated in the Lewis et al. metadata—cluster 2 (red) is the dysbiotic cluster, while cluster 1 (blue) is the healthy-like cluster. b Inverse Simpson diversity by absolute latitude calculated using the vegan package in R from the Kraken classification output of Sunbeam for Bahram et al. [63]. Points are colored by habitat. The polynomial regression line is shown in black. c Boxplots of unique Anelloviridae taxa in each sample from McCann et al. [64]. Each point corresponds to a single sample. d Heatmap from shallow shotgun analysis colored by proportional abundance. Each row corresponds to a bacterial taxon; each column represents a different reagent combination. Columns are grouped by time point, then by subject (top). All plots were generated using the ggplot2 R package [77]
Memory usage, speed, and nucleotides masked for each program
| Tool | Microsatellite nucleotides masked (%) | Conserved coding sequence nucleotides masked (%) | Speed (kilobase/second) | Peak memory usage (megabytes) |
|---|---|---|---|---|
| Komplexity | 54.6 | 0.68 |
|
|
| RepeatMasker |
| 0.75 | 0.64 ± 0.02 | 624 ± 3.0 |
| BBMask | 43 |
| 1690 ± 440 | 385 ± 52 |
| DUST | 44.9 | 0.74 | 795 ± 10 | 17.0 ± 0.18 |
Columns show the percentage of nucleotides (microsatellite or conserved coding sequence) from reads masked by each tool, as well as the normalized time taken and peak memory usage of each tool while processing the dataset (1.1 megabases). The top-performing tool in each category is shown in italics