| Literature DB >> 32620910 |
Vanessa Arranz1, William S Pearman2, J David Aguirre2, Libby Liggins2.
Abstract
The use of DNA metabarcoding to characterise the biodiversity of environmental and community samples has exploded in recent years. However, taxonomic inferences from these studies are contingent on the quality and completeness of the sequence reference database used to characterise sample species-composition. In response, studies often develop custom reference databases to improve species assignment. The disadvantage of this approach is that it limits the potential for database re-use, and the transferability of inferences across studies. Here, we present the MARine Eukaryote Species (MARES) reference database for use in marine metabarcoding studies, created using a transparent and reproducible pipeline. MARES includes all COI sequences available in GenBank and BOLD for marine taxa, unified into a single taxonomy. Our pipeline facilitates the curation of sequences, synonymization of taxonomic identifiers used by different repositories, and formatting these data for use in taxonomic assignment tools. Overall, MARES provides a benchmark COI reference database for marine eukaryotes, and a standardised pipeline for (re)producing reference databases enabling integration and fair comparison of marine DNA metabarcoding results.Entities:
Mesh:
Year: 2020 PMID: 32620910 PMCID: PMC7334202 DOI: 10.1038/s41597-020-0549-9
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Bioinformatic pipeline for generating a custom reference database combining sequences retrieved from BOLD and GenBank for a taxonomic group of interest. Shaded boxes detail the workflow for each numbered step described in the methods and the name of the script required for each step (available on github: https://github.com/wpearman1996/MARES_database_pipeline). Smaller open boxes describe the subroutines including the functions, packages, and software required (in italics). Boxes with solid outlines indicate input files and boxes with dotted-lined boxes indicate the output files. Asterisks denote the original contributions of MARES to the previously published routines.
Published reference databases commonly used for taxonomic assignment in COI eukaryotic metabarcoding studies.
| Reference Database | Target organisms | Source repository | Method | Sequences | Unique species (%Unique species) | Marine species (%Marine species) | Reference |
|---|---|---|---|---|---|---|---|
| BOLD | Eukaryotes | BOLD | Keyword search | 5,586,934 | 169,705 (3.04) | 18,328 (10.80) | Ratnasingham and Hebert[ |
| GenBank | Eukaryotes | GenBank | Keyword search | 1,933,547 | 160,061 (8.28) | 17,943 (11.21) | NCBI Resource Coordinators[ |
| Midori | Metazoans | GenBank | Keyword search | 927.386 | 131,988 (14.23) | 14,057 (10.65) | Machida, |
| db_COI_MBPK | Eukaryotes | EMBL, BOLD | 188.975 | 48,853 (25.85) | 6,844 (14.01) | Wangensteen and Turon[ | |
| CRUX_CO1 | Eukaryotes | EMBL, GenBank | CRUX ( | 1,401,802 | 127,422 (9.10) | 15,737 (12.35) | Curd, |
| MARES_BAR | Marine eukaryotes | GenBank, BOLD | Keyword search | 1,224,187 | 61,123 (4.91) | 17,884 (29.26) | This data descriptor[ |
| MARES_NOBAR | Marine eukaryotes | GenBank, BOLD | Keyword search | 1,491,691 | 71,499 (4.79) | 19,154 (26.79) | This data descriptor[ |
BOLD and GenBank reference databases were built using Step 1 and 2 of the bioinformatic pipeline (Fig. 1). ‘BOLD’ was generated by retrieving all COI sequences available from the BOLD repository. ‘GenBank’ was generated with the keyword search Eukaryota and COI synonyms. ‘Unique species’ were retained after a quality control procedure that retains only unique, fully identified taxa with binomial species names. ‘% Unique species’ was calculated using the number of unique species as the numerator and the total number of sequences as the denominator. ‘Marine species’ was determined by the number of unique species present in each database that appeared in the World Register of Marine Species (WoRMS) and AlgaeBase[27]. ‘% Marine species’ was then calculated using the number of marine species as the numerator and the number of unique sequences as the denominator.
Pairwise β‐diversity measures for comparisons of species composition among reference databases Below the diagonal is the total Jaccard’s dissimilarity (βJAC) and above the diagonal is the βratio representing the proportion of total Jaccard’s dissimilarity (βJAC) explained by nestedness (βJNE).
| Midori | BOLD | GenBank | db_COI_MBPK | MARES_COI_ NOBAR | MARES_ COI_BAR | CRUX_CO1 | |
|---|---|---|---|---|---|---|---|
| Midori | 0.32 | 0.57 | 0.64 | 0.26 | 0.32 | 0.11 | |
| BOLD | 0.43 | 0.10 | 0.87 | 0.53 | 0.81 | 0.43 | |
| GenBank | 0.26 | 0.34 | 0.81 | 0.62 | 0.65 | 0.59 | |
| db_COI_MBPK | 0.70 | 0.73 | 0.72 | 0.06 | 0.04 | 0.82 | |
| MARES_COI_NOBAR | 0.70 | 0.68 | 0.64 | 0.82 | 0.92 | 0.27 | |
| MARES_COI_BAR | 0.73 | 0.67 | 0.69 | 0.80 | 0.15 | 0.39 | |
| CRUX_CO1 | 0.23 | 0.40 | 0.29 | 0.65 | 0.69 | 0.69 |
Smaller values for the ratio indicate that dissimilarities are primarily due to databases containing different species, whereas larger values indicate dissimilarities are primarily driven by differences in the number of species present in each database.
| Measurement(s) | DNA |
| Technology Type(s) | bioinformatics analysis |
| Factor Type(s) | DNA sequence |
| Sample Characteristic - Organism | Eukaryota |
| Sample Characteristic - Environment | marine environment |