| Literature DB >> 35902092 |
Luke R Thompson1,2, Sean R Anderson1,2, Paul A Den Uyl3, Nastassia V Patin2,4, Shen Jean Lim2,4, Grant Sanderson5, Kelly D Goodwin2.
Abstract
BACKGROUND: Amplicon sequencing (metabarcoding) is a common method to survey diversity of environmental communities whereby a single genetic locus is amplified and sequenced from the DNA of whole or partial organisms, organismal traces (e.g., skin, mucus, feces), or microbes in an environmental sample. Several software packages exist for analyzing amplicon data, among which QIIME 2 has emerged as a popular option because of its broad functionality, plugin architecture, provenance tracking, and interactive visualizations. However, each new analysis requires the user to keep track of input and output file names, parameters, and commands; this lack of automation and standardization is inefficient and creates barriers to meta-analysis and sharing of results.Entities:
Keywords: amplicon sequencing; eDNA; environmental DNA; meta-analysis; metabarcoding; microbiome
Mesh:
Substances:
Year: 2022 PMID: 35902092 PMCID: PMC9334028 DOI: 10.1093/gigascience/giac066
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 7.658
Figure 1: The Tourmaline workflow. Install natively (macOS, Linux) or using a Docker container. Set up by cloning the Tourmaline repository (directory) from GitHub, initializing the directory from a previous run (optional), editing the configuration file (config.yaml, Supplementary Table S1), creating symbolic links to the reference database files, organizing the sequence files and/or editing the FASTQ manifest file, and editing and creating a symbolic link to the metadata file. Run by calling the Snakemake commands for denoise, taxonomy, diversity, and report—or running just the report command to generate all output if the parameters do not need to be changed between individual commands. It is recommended but not required to run the unfiltered commands before the filtered commands. The primary input and output files are listed. Detailed instructions for each step are provided in the Tourmaline Wiki [44].
Figure 2: Step-by-step tutorial on Tourmaline using the provided test data, which are subsampled from the 16S rRNA amplicon data of a 2018 survey of Western Lake Erie. Key parameters in config.yaml and primary output for each command (pseudo-rule) are listed. Indicated output should be evaluated to determine the appropriate parameters for the next command. Evaluation of the primary outputs and rationale for parameter choice is shown for the test Lake Erie 16S rRNA data that come with the Tourmaline repository. See Supplementary Fig. S3 for screenshots of the primary output files.
Figure 3: Example of the main outputs of the Tourmaline workflow beyond the QIIME 2 outputs. Contents in panels A, E, F, and G are truncated. Screenshots of additional output files are provided in Supplementary Fig. S3. See Fig. 2 for commands, parameters, and guidance.
Benchmarking and parallel processing results from running the full 2018 Lake Erie 16S rRNA data set through Tourmaline with either 1 or 8 cores using a Tourmaline Docker container allocated with 32 GB RAM running on an 18-core iMac Pro (2017). The Snakemake command used the parameter --cores 1 or --cores 8, and parameters in config.yaml specifying the number of threads for individual rules were set to 1 or 8, respectively. Times reported are the elapsed real time between invocation and termination and are reported as HH:MM:SS. Times do not include the initial step of importing FASTQ files into a QIIME 2 archive (fastq_pe.qza), which took ~2 minutes. Parameters shown in the last column are those most relevant to the runtimes. Unless otherwise noted, the parameters used were the defaults in config.yaml.
| Rule | Time (--cores 1) | Time (--cores 8) | Parameters and details |
|---|---|---|---|
| dada2_pe_denoise | 02:05:43 | 00:38:10 | method: dada2-pe |
| 96 samples * 120,338 sequences per sample | |||
| = 11,552,448 total sequences | |||
| dada2_pe_taxonomy_unfiltered | 01:31:55 | 00:12:39 | classify_method: consensus-vsearch |
| 12,379 representative sequences | |||
| dada2_pe_diversity_unfiltered | 01:18:09 | 01:13:49 | alignment_method: muscle |
| alignment_muscle_maxiters: 2 | |||
| alignment_muscle_diags: -diags | |||
| odseq_distance_metric: linear | |||
| odseq_bootstrap_replicates: 100 | |||
| odseq_threshold: 0.025 | |||
| 12,379 representative sequences | |||
| (lengths: min 240, max 418, avg 258) | |||
| dada2_pe_report_unfiltered | 00:00:05 | 00:00:05 | – |
| Total | 04:55:52 | 02:04:43 | – |