| Literature DB >> 36174014 |
Teresita M Porter1, Mehrdad Hajibabaei1.
Abstract
Multi-marker metabarcoding is increasingly being used to generate biodiversity information across different domains of life from microbes to fungi to animals such as for molecular ecology and biomonitoring applications in different sectors from academic research to regulatory agencies and industry. Current popular bioinformatic pipelines support microbial and fungal marker analysis, while ad hoc methods are often used to process animal metabarcode markers from the same study. MetaWorks provides a harmonized processing environment, pipeline, and taxonomic assignment approach for demultiplexed Illumina reads for all biota using a wide range of metabarcoding markers such as 16S, ITS, and COI. A Conda environment is provided to quickly gather most of the programs and dependencies for the pipeline. Several workflows are provided such as: taxonomically assigning exact sequence variants, provides an option to generate operational taxonomic units, and facilitates single-read processing. Pipelines are automated using Snakemake to minimize user intervention and facilitate scalability. All pipelines use the RDP classifier to provide taxonomic assignments with confidence measures. We extend the functionality of the RDP classifier for taxonomically assigning 16S (bacteria), ITS (fungi), and 28S (fungi), to also support COI (eukaryotes), rbcL (eukaryotes, land plants, diatoms), 12S (fish, vertebrates), 18S (eukaryotes, diatoms) and ITS (fungi, plants). MetaWorks properly handles ITS by trimming flanking conserved rRNA gene regions as well as protein coding genes by providing two options for removing obvious pseudogenes. MetaWorks can be downloaded from https://github.com/terrimporter/MetaWorks and quickstart instructions, pipeline details, and a tutorial for new users can be found at https://terrimporter.github.io/MetaWorksSite.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36174014 PMCID: PMC9521933 DOI: 10.1371/journal.pone.0274260
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
RDP-trained reference sets that can be used with MetaWorks.
| Marker | Target taxa | Classifier availability | Number of included sequences | Number of included taxa at all ranks (species) | Source data |
|---|---|---|---|---|---|
| COI | Eukaryotes |
| 1,221,528 | 154,351 (114,687) | BOLD [ |
| rbcL | Diatoms |
| 3,504 | 1,432 (1,023) | Diat.barcode [ |
| rbcL | Land plants |
| 148,258 | 61,398 (50,778) | INSDC [ |
| rbcL | Eukaryotes |
| 164,454 | 65,742 (53,344) | INSDC [ |
| 12S | Fish |
| 2,853 | 4,751 (2,833) | MitoFish [ |
| 12S | Vertebrates |
| 10,654 | 15,007 (9,564) | INSDC [ |
| SSU (18S) | Diatoms |
| 2,962 | 1,198 (828) | Diat.barcode [ |
| SSU (16S) | Vertebrates |
| 72,195 | 21,282 (15,155) | INSDC [ |
| SSU (18S) | Eukaryotes |
| 42,301 | 7,504 (5,440 genera) | SILVA [ |
| SSU (16S) | Prokaryotes | Built-in to the RDP classifier* | 13,212 | 3,247 (2,506 genera) | RDP [ |
| ITS | Fungi (Warcup) | Built-in to the RDP classifier | 17,878 | 10,621 (8,551) | Deshpande et al., 2016 [ |
| ITS | Fungi (UNITE 2014) | Built-in to the RDP classifier | 145,019 | 23,222 (20,337) | Abarenkov et al., 2010 [ |
| ITS | Fungi (UNITE 2021) |
| 1,393,203 | 376,167 (352,588) | UNITE [ |
| ITS | Plants |
| 104,387 | 72,632 (61,693) | PLANiTS [ |
| LSU | Fungi | Built-in to the RDP classifier | 11,442 | 2,633 (1,895) | Liu et al., 2012 [ |
Fig 1MetaWorks workflow to produce taxonomically assigned exact sequence variants.
To aid reproducibility, a Conda environment is provided. Although multiple Snakemake workflows are provided in MetaWorks, here we show the main workflow that generates taxonomically assigned ESVs. Input files are shown in the first panel (green), the ESV workflow is shown in the centre panel (blue), and outfiles are shown in the last panel (orange). The input files in white boxes are required by snakemake to run the appropriate workflow. The input files in green need to be supplied by the user. Note that only custom-trained classifiers such as for COI need to be supplied by the user whereas classifiers built-in to the RDP classifier are used automatically to process prokaryote 16S assignments, for example. The denoising step shown here includes the removal of rare clusters, sequences with putative errors, as well as chimeric sequences. The results are provided in a comma-separated value (CSV) file and shows each ESV per sample with read counts and taxonomic assignments. Abbreviations: Demultiplexed Illumina paired-end reads (R1 + R2), internal transcribed spacer (ITS) region, open reading frame sequences (ORFs).