Literature DB >> 35852418

Toward global integration of biodiversity big data: a harmonized metabarcode data generation module for terrestrial arthropods.

Paula Arribas¹, Carmelo Andújar¹, Kristine Bohmann², Jeremy R deWaard^3,4, Evan P Economo⁵, Vasco Elbrecht⁶, Stefan Geisen⁷, Marta Goberna⁸, Henrik Krehenwinkel⁹, Vojtech Novotny^10,11, Lucie Zinger^12,13, Thomas J Creedy¹⁴, Emmanouil Meramveliotakis¹⁵, Víctor Noguerales¹, Isaac Overcast¹², Hélène Morlon¹², Anna Papadopoulou¹⁵, Alfried P Vogler^14,16, Brent C Emerson¹.

Abstract

Metazoan metabarcoding is emerging as an essential strategy for inventorying biodiversity, with diverse projects currently generating massive quantities of community-level data. The potential for integrating across such data sets offers new opportunities to better understand biodiversity and how it might respond to global change. However, large-scale syntheses may be compromised if metabarcoding workflows differ from each other. There are ongoing efforts to improve standardization for the reporting of inventory data. However, harmonization at the stage of generating metabarcode data has yet to be addressed. A modular framework for harmonized data generation offers a pathway to navigate the complex structure of terrestrial metazoan biodiversity. Here, through our collective expertise as practitioners, method developers, and researchers leading metabarcoding initiatives to inventory terrestrial biodiversity, we seek to initiate a harmonized framework for metabarcode data generation, with a terrestrial arthropod module. We develop an initial set of submodules covering the 5 main steps of metabarcode data generation: (i) sample acquisition; (ii) sample processing; (iii) DNA extraction; (iv) polymerase chain reaction amplification, library preparation, and sequencing; and (v) DNA sequence and metadata deposition, providing a backbone for a terrestrial arthropod module. To achieve this, we (i) identified key points for harmonization, (ii) reviewed the current state of the art, and (iii) distilled existing knowledge within submodules, thus promoting best practice by providing guidelines and recommendations to reduce the universe of methodological options. We advocate the adoption and further development of the terrestrial arthropod module. We further encourage the development of modules for other biodiversity fractions as an essential step toward large-scale biodiversity synthesis through harmonization.

Entities: Chemical

Keywords: arthropods; biodiversity big data integration; biodiversity inventory; comparability; data generation; harmonization; metabarcoding; modular structure; reproducibility

Mesh：

Year: 2022 PMID： 35852418 PMCID： PMC9295367 DOI： 10.1093/gigascience/giac065

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 7.658

Background

DNA metabarcoding, involving polymerase chain reaction (PCR)–coupled high-throughput sequencing (HTS) directly from bulk or environmental samples, represents the most cost-efficient approach for obtaining molecular community profiles [1,2]. Metabarcoding is increasingly being used to characterize and monitor biodiversity and is recognized as a substantial advance leading to a step change in multiple fields of biodiversity science (e.g., [3-5]). Diverse projects, from local to global scales, are currently generating massive quantities of site-based community-level biodiversity inventory data, including hyperdiverse assemblages or groups for which classical sampling and identification is overly complicated and time-consuming. The potential for integrating across such data, from diverse sources and time series, offers new opportunities to better understand how biodiversity is structured in space and time, and the factors that regulate it. Additionally, such integration can be leveraged for better monitoring and the development of holistic biodiversity conservation strategies, in response to global change [4,6, 7]. However, collective international efforts are required to achieve optimal global integration and synthesis. While integrative efforts for harmonized site-based genomic inventories exist in the microbial realm (e.g., [8-10]), such a framework has yet to be extended to nonmicrobial fractions of biodiversity. However, there is an emerging consensus that such integration can be achieved within an HTS framework, analogous to the Genomic Observatories (GO) concept, first proposed by Davies et al. [11, 12]. If effective strategies can be developed to harmonize the data resulting from metabarcoding studies (i.e., metabarcode inventory data), these can potentially scale up to a noncentralized network within which global patterns and trends of biodiversity can be addressed [13]. There are ongoing efforts to maximize the potential for integrating across independent biodiversity data sets through improved standardization for the reporting of inventory data (Humboldt Core: [14]). In the case of molecular data specifically, the GEOME initiative [15, 16] promotes standardization for the reporting of taxonomic, genomic, and metadata through customizable yet standard-compliant spreadsheets that capture the temporal and geospatial context of a biosample. While recommendations have been made for the harmonization of bioinformatic processing of raw metabarcode read data from metazoan biodiversity fractions [17], harmonization at the stage of generating such metabarcode data has yet to be addressed and thus remains a fundamental impediment for data integration. The success of global microbial diversity assessment initiatives has pivoted on standardized metabarcoding protocols for sampling, DNA extraction, barcode amplification/enrichment, and library generation and sequencing of microbial/planktonic communities (e.g., [18,19] for the Earth Microbiome Project or [20-22] for the TARA Oceans and the Ocean Sampling Day). Despite pioneering efforts to harmonize metabarcode data generation beyond microbial biodiversity fractions (e.g., see [23, 24]), further efforts are required within this expanding research area.

A Harmonized Framework for the Generation of Metabarcode Data for Terrestrial Animals

Terrestrial metazoans constitute one of the most heterogeneous groups in terms of body size across the tree of life. Metabarcoding is emerging as an important approach for the inventorying of metazoan diversity and is increasingly being used across the fields of community ecology, evolutionary ecology, biogeography, conservation biology, and environmental management, among others. Given the rapid development of data generation in this area, the potential for downstream synthesis across independently generated data sets may be compromised if divergent strategies are being implemented. There is already concern that nuances in metabarcoding workflows make comparisons difficult (e.g., [25-28]). Guidance for the implementation of effective and robust sampling and sample-processing approaches is both timely and essential and will increase the potential for broader benefits to biodiversity science through harmonization. We believe that the overarching goal of a harmonized metabarcode framework for inventorying biodiversity should be to reduce unnecessary heterogeneity in the generation of metabarcode data, thus facilitating comparability and integration among independent metabarcode data sets. The development and implementation of consistent workflows for data generation is a key step for the bottom-up growth of a GO network for global integration and synthesis within biodiversity science, while the challenge is to also allow flexibility to successfully address objectives at the individual project level. It has previously been argued that a harmonized framework with a “modular” structure for data generation could offer a pathway to navigate through the complex structure of terrestrial metazoan biodiversity, by placing different fractions of terrestrial diversity at the core of each “module” [13]. Within such a framework, best practices and harmonized protocols for the generation of metabarcode data can be developed for different target fractions of biodiversity (e.g., terrestrial arthropods). Within individual modules, submodules serve as the fundamental building blocks that provide guidelines and recommendations for the 5 key steps to generate metabarcode data: (i) sample acquisition; (ii) sample processing; (iii) DNA extraction; (iv) PCR amplification, library preparation, and sequencing; and (v) DNA sequence and metadata deposition. Different data generation pipelines can be configured within a module by choosing among submodule options, allowing for variable requirements of different assemblages within the module (e.g., flying, aquatic, or ground arthropods within a terrestrial arthropod module) and different sample vouchering needs (e.g., destructive vs. nondestructive DNA extraction). Such a modular structure provides a harmonized framework for comparability across independent studies, by reducing redundant efforts and improving reporting and comparability, while retaining flexibility to incorporate additional submodules as the need arises (see Fig. 1, a schematic representation of the proposed modular structure).

Figure 1:

A harmonized framework with a “modular” structure for metazoan metabarcoding. Schematic representation of the modular structure proposed for building a harmonized framework for the generation of metabarcode data for different fractions of terrestrial animals. Different fractions of terrestrial animal diversity are at the core of each “module” (red rectangle, e.g., the terrestrial arthropods module), and within such a framework, best practices and harmonized protocols are developed as submodules (black blocks). Submodules within each module serve as the fundamental building blocks that provide guidelines and recommendations for the 5 well-defined steps for generating metabarcode data (left panel, rows 1 to 5). Within this framework, tailored data generation pipelines can be configured within a module, drawn from the set of alternative submodules. Here, through our collective expertise as practitioners of metabarcoding, method developers, and researchers leading metabarcoding initiatives to inventory terrestrial arthropod biodiversity, we seek to initiate a harmonized framework for the generation of terrestrial metazoan metabarcode data. Specifically, we aim to provide an initial set of submodules (black blocks in Fig. 1) covering the 5 main steps of metabarcode data generation (rows 1 to 5 in Fig. 1) that constitute the backbone of a terrestrial arthropod module (red block in Fig. 1). We first (i) identify key points for harmonization within each of the 5 steps, (ii) review the current state of the art within the arthropod metabarcoding literature, and (iii) distill existing information and knowledge within submodules, thus promoting best practice by providing guidelines and recommendations to reduce the universe of methodological options. Standardization or harmonization of methods will, in some contexts, lead to trade-offs against what might be considered perfect methods [29]. Such trade-offs may limit the uptake of harmonized protocols, thus compromising the discovery of unifying principles from analyses synthesizing across comparable studies. Thus, rather than being overly prescriptive, we seek to propose a flexible framework that can be opted into with minimal compromise, to increase the comparative value of metabarcode data.

Harmonization for the Metabarcoding of Terrestrial Arthropods: The Terrestrial Arthropods Module

There are multiple reasons why techniques for inventorying and monitoring terrestrial arthropod biodiversity are urgently needed. First, arthropods comprise the majority of known animal species in terrestrial habitats. It has been estimated that there are 5.5 million insect species on Earth, most yet to be discovered, and up to 6.8 million species (range, 5.9–7.8 million) for all terrestrial arthropods [30]. In addition to this high diversity, arthropods present vast trait variation, which imposes a substantial challenge for assessing their responses to environmental change. We now face the challenge of declining arthropod abundance and richness, a very real and serious threat that society must urgently address [31, 32]. Arthropods are also a key biodiversity fraction for monitoring because they include many invasive species [33], requiring comparable baseline data to study the potential susceptibility and responses of communities to invasion. DNA metabarcoding has emerged as a powerful approach for characterizing complex, and in many cases largely unknown, arthropod assemblages [7, 34]. In response to this, researchers from diverse disciplines are shifting from conventional inventorying of arthropod diversity to DNA metabarcoding, with evidence for exponential growth uptake [17]. Indeed, adaptations of microbial metabarcoding approaches to the macroscopic component of diversity have been heavily influenced by their application to the arthropod fauna (see [1, 35] for pioneering studies). Metabarcoding of DNA extracted from bulk samples of whole organisms (whole-organism community DNA, wocDNA) is (i) the most common and straightforward metabarcoding approach to inventory arthropod biodiversity, (ii) comparable to standard methods of arthropod monitoring, and (iii) has high potential for harmonization [27]. Data generation practices for the metabarcoding of arthropod community samples are still in the early stages. Through the development and adoption of a standardized terrestrial arthropod data generation module, the potential for comparability across future large-scale biodiversity inventorying efforts can be optimized. There is sufficient background from which recommendations can be developed (e.g., [36-40]) to guide methodological decisions within the emerging research community. Recent global initiatives that pivot on arthropod wocDNA also provide a critical mass for developing harmonized data generation, while simultaneously highlighting the relevance and timeliness of a terrestrial arthropod module. These initiatives include the BIOSCAN initiative (https://ibol.org/programs/bioscan/) and its regional extensions such as BIOSCAN Europe (https://www.bioscaneurope.org/), BioAlfa, the Kruger Malaise Program [41], the SITE-100 project (https://www.site100.org/), the Insect Biome Atlas Project (https://insectbiomeatlas.org), LIFEPLAN (https://www.helsinki.fi/en/projects/lifeplan), and the OKEON initiative (https://okeon.unit.oist.jp/).

Identifying Key Points of Harmonization for Submodules within Each Data Generation Step

Sample acquisition step

A starting point for integration across independent biodiversity inventory efforts is a harmonized sample definition. In the case of terrestrial arthropods, sample definition is strongly linked to the sampling technique implemented. There is extensive evidence that different arthropod mass sampling techniques have differing capture efficiencies with regard to total community assemblages within which they are deployed, with no one method detecting the entire arthropod diversity within a site [42]. In this context, with the aim of standardizing insect inventorying and monitoring methods, Montgomery et al. [43] proposed 7 main sampling methods with the aim of maximizing data integration across insect monitoring efforts, including (i) Malaise trapping, (ii) light trapping, (iii) pan trapping, (iv) pitfall trapping, (v) beating sheets, (vi) acoustic monitoring, and (vii) active visual surveys. These complementary sampling methods provide an appropriate platform from which to develop sample acquisition submodules, which could be implemented individually or combined for more complex sampling designs. Most implementations of wocDNA metabarcoding to date are Malaise trap based, at scales ranging from local to global (e.g., [44-49]). Additionally, Malaise traps are frequently deployed together with other sampling techniques to generate plot-based arthropod inventory data (e.g., [50], SITE100, ForestGEO arthropod protocol) and are the sampling strategy of the Global Malaise Trap Program/BIOSCAN initiative [44], with more than 10,000 samples already generated worldwide. Malaise traps [51] are primarily effective for sampling flying insects (e.g., [52]) but have gained popularity for assessing terrestrial arthropod communities (e.g., [53]) and have been proposed as ideal for insect biomonitoring using metabarcoding [43,50]. Once installed, they require limited effort and can yield clean samples comprising almost exclusively arthropods and in very large numbers (up to 10,000 specimens per week in some cases). Moreover, they can remain in place and yield new samples through passive sampling with low handling time, making them suitable for time-resolved monitoring. Given these considerations, Malaise traps are an obvious sampling submodule candidate. Following the recommendations of Montgomery et al. [43], together with operational procedures adopted within the BIOSCAN initiative (https://biodiversitygenomics.net/resources/bioscan), Townes-style Malaise traps are preferred, with a 165 × 110 cm interception area being most common and 95% ethanol as the preservation agent (see [50]) but propylene glycol (ratio of 50–100% propylene glycol with water is frequently recommended as evaporation is negligible compared to ethanol and adequately preserves DNA [54,55]). Sampling effort has typically been delimited to 1 week within most metabarcoding studies, representing a compromise between maximizing sampling effort and reducing potential problems with DNA degradation [38]. The Malaise trap should preferably be placed at the center of the habitat patch to be characterized and, when possible, the trap should be positioned at a right angle to the dominant insect flight line. While submodule implementation can be restricted to a single trap, we emphasize that biological replicates (simultaneous Malaise trapping events) are desirable within the same habitat patch [56] and can provide useful information regarding sampling efficiency (see, e.g., [57, 58] for occupancy modeling using some means of sampling replication for insects). Similarly, temporal replication is also desirable, considering the possible variability due to changing environmental conditions for optimal arthropod activity and species-specific idiosyncrasies. If temporal replication is not possible, trapping during maximum activity periods for flying insects is desirable. See Table 1 for a summary of key guidelines and recommendations for the 1.1 Malaise trapping sample acquisition submodule.

Table 1:

Summary of key guidelines and recommendations within the 1.1 Malaise trapping sample acquisition submodule

1.1 Malaise trapping sample acquisition submodule
Sample definition	Townes-style Malaise trap (165 × 110 cm interception area) One week per sample Collecting fluid: >95% ethanol/50–95% propylene glycol Center in habitat patch location Position perpendicular to natural flight corridorSpatial and temporal replicates
Sampling event metadata	Geographical coordinatesDate and period of trappingPhoto recording for habitat and microhabitatExtreme weather events during trapping
Sample storage	>95% molecular grade ethanol/propylene glycolFully submerged biomassStorage conditions of −20ºC or −80ºC

Summary of key guidelines and recommendations within the 1.1 Malaise trapping sample acquisition submodule Recording metadata associated with sampling is also an important action for harmonization. Our opinion converges on a minimum set of metadata attributes for each sample: (i) the geographical coordinates of the Malaise trap, (ii) the date and time interval for the sampling event, and (iii) photo recording (ideally a 360º photo around each trap) of the habitat patch within which the Malaise trap is placed. In agreement with Montgomery et al. [43], we also recommend metadata reporting for the presence of rainfall, or extreme weather events, during the trapping. Detailed characterization of habitat and microhabitats within sampling sites would require time and resources that may limit module uptake. If needed, environmental characterization of sampling sites can potentially be extracted from remote sensing data (see [4]). For additional information on metadata reporting, see the section on DNA sequence and metadata sharing and storage. Sample storage conditions, as the endpoint of the sample acquisition chain, carry implications for downstream data quality and are thus an important focus for harmonization. Sample storage conditions are consequential for the degradation of target DNA and/or the proliferation of nontarget biomass in the sample. As such, they can strongly impact metabarcoding biodiversity profiles [59]. However, the effect of this bias on mock arthropod samples, at least for short-term storage (i.e., <1 month), is of limited importance (see [38]). In the case of longer storage of arthropod community samples, we strongly recommend the use of >95% molecular-grade ethanol as a preservative using leak-proof glass or plastic vials or jars [60], ensuring that the entire bulk sample is fully submerged before storage and then storage conditions of −20ºC or −80ºC. In the case of storage or transport safety constraints, propylene glycol (undiluted) can be used as an alternative to ethanol [61]. Such an approach will limit inherent biases in inventory data due to irregular DNA degradation. The storage of biological replicates is always desirable (Table 1). While Malaise trapping is notably efficient for aerially active arthropods, species with low mobility are less likely to be sampled (e.g., [62]). In this context, pitfall trapping offers a complementary passive sampling technique for ground active arthropods, and thus we consider it to be an appropriate candidate for the development of a complementary sampling submodule. The joint implementation of Malaise and pitfall trapping represents an appropriate compromise to limit the diversity of sampling techniques implemented, while seeking to capture a broad representation of arthropod biodiversity. Pitfall traps [63] are containers buried in the ground with their rim at surface level to capture ground-dwelling (epigeic) insects. Pitfall traps are the most effective method for sampling ground active arthropods and are an established and popular monitoring technique (e.g., the US National Ecological Observatory Network [NEON] [54]; the UK Environmental Change Network [64]). Pitfall and Malaise traps are highly complementary, sampling largely nonoverlapping fractions of arthropod assemblages with reduced additional effort, and they have already been jointly applied in several wocDNA metabarcoding studies (e.g., [48]). Guidelines for standardizing pitfall trapping, based on a review of the existing literature [65], have recommended plastic cups with an 11-cm diameter and a 9- to 11-cm depth, and a roof raised 1.5 cm above the trap entrance. The number of individuals sampled per trap can be limited, and as such, composite samples from multiple pitfall traps can be used to increase the sampling effort. There is some controversy over how far apart traps should be placed to be considered as independent samples (e.g., [66,67]). We suggest that the NEON protocol [54] provides a suitable framework for harmonization, within which a composite sample is generated using 4 pitfall traps arranged at the corners of a square with sides of 25 m. While submodule implementation can be restricted to a single composite sample (4 pitfall traps), biological replicates are desirable (e.g., [54]) and can be achieved by allowing several meters between replicate traps within each corner. Sampling effort is defined by the trapping interval and varies across studies, typically ranging from 3 days to 4 weeks (e.g., [48, 54, 68]). One week provides an appropriate interval and facilitates coordination with the setting and servicing of Malaise traps. Temporal replication is also desirable, and if not possible, trapping should be targeted toward periods of maximum arthropod activity [54]. Propylene glycol (ratio of 50–100% propylene glycol, with water, for a total volume between 100 and 200 mL, depending upon the dilution ratio) is the most frequently recommended collecting medium, as evaporation is negligible compared to ethanol, it is odorless, and it adequately preserves DNA ([54, 55], Table 2).

Table 2:

Summary of key guidelines and recommendations within the 1.2 Pitfall trapping sample acquisition submodule

1.2 Pitfall trapping sample acquisition submodule
Sample definition	Plastic cups with diameter 11 cm, depth 9–11 cm, and a roof raised 1.5 cmComposite sample (4 pitfall traps, placed at the corners of a square with sides of 25 m)One week per sampleCollecting fluid: propylene glycol (50–95%)Spatial and temporal replicates
Sampling event metadata	Geographical coordinatesDate and period of trappingPhoto recording for habitat and microhabitatExtreme weather events during trapping
Sample storage	>95% molecular-grade ethanol/propylene glycolFully submerged biomassStorage conditions of −20ºC or −80ºC

Summary of key guidelines and recommendations within the 1.2 Pitfall trapping sample acquisition submodule Similar to Malaise traps, a minimum set of metadata attributes for each pitfall composite sample should include (i) the geographical coordinates of the trap, (ii) period of the trapping event, and (iii) photo recording (ideally a 360º photo around each trap). Following Montgomery et al. [43], we also recommend metadata reporting for the presence of rainfall or extreme events during sampling. Finally, in order to minimize the degradation of target DNA and/or the proliferation of nontarget biomass in the sample during medium- to long-term storage, we strongly recommend the use of >95% molecular-grade ethanol, or propylene glycol, as described above for Malaise trap samples. See Table 2 for key guidelines and recommendations of the 1.2 Pitfall trapping sample acquisition submodule.

Sample processing step

In contrast to microbial or environmental DNA (eDNA) approaches, where samples can be directly processed for DNA extraction, the macroscopic nature of arthropod community samples has led to a broad range of sample processing protocols, among which size sorting is the most common. Size sorting is often used because larger specimens tend to release more DNA and may dominate the total sequence count in metabarcoding data [69]. Thus, sorting invertebrates into multiple size classes and then pooling the digested tissue according to DNA concentration, abundance, or richness in each class has become common practice (e.g., [1, 70, 71]), and size sorting has revealed improved efficiency in the detection of low biomass species (e.g., [40, 70]). However, increasing sequencing depth can also increase taxon recovery to comparable levels without size sorting [72]. More generally, it has been suggested that with sufficient sequencing depth and within reasonable size ranges, species recovery is not skewed by variable biomass of species and that a size-sorting step need not be carried out [71]. Please see the section on amplification, library preparation, and sequencing steps for details on sequencing depth. In addition to the fact that handling time for size sorting places high logistical constraints for large-scale studies, size-sorting procedures also reduce comparability across independent initiatives if not fully harmonized. Given these considerations, we consider size sorting to be unnecessary for a harmonized approach, but if incorporated, it should be of limited complexity (e.g., wet sieving into 2 size fractions, 4-mm sieve pooled 1:10 to 2:10 [>4 mm/<4 mm] [72]) and properly reported. Removing any form of biomass sorting/sample picking steps will also improve cost-effectiveness and facilitate broad implementation for biomonitoring [27]. Biomass and abundance information is often fundamental for biodiversity analysis, including the global assessment of arthropod decline (see [73]). However, deriving abundance information from metabarcode data remains a challenge, primarily due to inherent biases during PCR amplification, but also because of variation in gene copy number, organelle number, and technical aspects of workflows for sampling, laboratory procedures, sequencing, and bioinformatic processing [5,69, 74]. Given these considerations, we consider that an arthropod community sample processing submodule should emphasize the importance of (i) providing a wet weight measurement for each sample and (ii) generating arthropod community sample photographs. Wet mass measurement can be used as a surrogate for sample biomass. It can be easily obtained from samples after filtering off excess ethanol using a nylon filtration fabric that retains smaller specimens (e.g., 20-µm filters). Photographic recording is not a commonly reported practice, but looking forward, we think it is very likely that the integration of quantitative morphological and molecular approaches will be an important area of interest and development [75]. There is potential for image-based specimen identification involving machine learning tools to be applied as an external validation of molecular-based diversity estimations, particularly for arthropod groups with limited cryptic variation between species [75-77]. While obtaining high-quality images of arthropod community samples may be time-consuming, we recommend, as a minimum, that such images should be taken at high resolution using a conventional stereoscope equipped with a built-in microscope camera or an external single-lens reflex camera with macro lens, over a white background (ideally submerged under ethanol in a plastic tray), and minimizing the overlap among individuals to provide a physical record of the sample. Vouchering selected specimens may be considered unnecessary when well-parameterized reference libraries are available (e.g., [78]) but is otherwise an important consideration for future taxonomic assignment of metabarcoding reads and for completing reference barcode databases (e.g., following BOLD guidelines; see [50, 79]). Vouchering also provides a resource for potential parallel efforts to generate high-throughput specimen-based genomic resources (i.e., partial or complete genomes, microbiomes, diet) for sites of special interest (SuperGOs [13]; i.e., sites where molecular community data are intensively generated at both the temporal and the genomic axes, consistent with the idea of “model ecosystems”). Vouchered barcode sequences are also of particular relevance for bioinformatic processing of metabarcode reads. It has been demonstrated that such sequences are fundamental for efficient and validated filtering of nuclear copies of mitochondrial sequences and that they control for taxonomically inflated estimates of community composition [80]. While sample processing is not the most problematic step for cross-contamination, contamination issues have been reported (e.g., [81]), and at least basic equipment cleaning between samples is required. See Table 3 for key guidelines and recommendations of the arthropod community sample processing submodule.

Table 3:

Summary of key guidelines and recommendations within the 2.1 Arthropod community sample processing submodule

2.1 Arthropod community sample processing submodule
Sample wet mass weight	20-µm nylon filtration fabric
Sample photography	White backgroundEthanol submerged (white tray)Photographic scale
Size sorting	Minimize size sorting4-mm sieve
Vouchering specimens	Random or directed selection of specimens for being individually DNA extracted and barcoded

Summary of key guidelines and recommendations within the 2.1 Arthropod community sample processing submodule

DNA extraction step

A fundamental consideration for harmonizing wocDNA extraction concerns whether a preextraction homogenization-grinding step (thus implying destruction of the specimens within an arthropod community sample) is needed. Such a step can facilitate homogeneous digestion across specimens and reduce digestion volumes. It is often achieved through manual grinding in a mortar after freezing in liquid nitrogen, grinding in ethanol, or mechanical bead beating. Nondestructive extraction protocols have been developed for unsorted arthropod samples to maintain exoskeletal integrity (e.g., [61, 70, 82]). Using mock arthropod community samples generated from material collected in Malaise traps, Nielsen et al. [82] found that homogenized samples yielded more DNA but generally produced more inconsistent results when compared to nondestructive extraction. When assessing the recovered taxonomic content of samples using operational taxonomic units (OTUs), intact samples performed at least comparable to, if not better than, homogenized samples. Thus, considering that efficiency seems to be comparable, avoiding a homogenization step will (i) reduce potential heterogeneity among studies, (ii) reduce processing time, (iii) reduce contamination risk, and (iv) maintain a physical archive accessible for future developments in image classification using deep learning for the extraction of additional data, such as abundances (see sample processing section). Given these considerations, nondestructive DNA extraction should be a core feature of the arthropod community sample DNA extraction submodule. When necessary (e.g., soil arthropods where a large fraction have hard exoskeletons; see [83]), semidestructive or destructive extraction submodules will need to be developed. Nondestructive DNA extractions require large volumes of digestion buffer to extract wocDNA. Nielsen et al. [82] have demonstrated that OTU diversity estimates are not influenced by the (sub)volume of digestion buffer that is subsequently purified. Given this consideration, typical commercial kit extraction volumes of 100–200 µL can be considered an appropriate subsampling volume for subsequent purification. A broad range of DNA extraction protocols are being applied to wocDNA metabarcoding. It remains unclear how different extraction methods might impact downstream results, as there is contrasting evidence on its importance based on eDNA approaches [19, 84]. Manual (column-based) and robotic (bead-based) implementations of the Qiagen (Hilden, Germany) DNeasy Blood & Tissue kit and homologous kits have been widely used for extracting wocDNA from terrestrial invertebrates [34]. There is little evidence for PCR inhibitor issues for DNA extracts from arthropod community samples (but see [85]), and if they occur, they can be appropriately accounted for through dilution of DNA extracts before PCR amplification (see next section). Given these considerations, simple and efficient kit-based protocols that allow sample extraction at scale (e.g., Qiagen DNeasy Blood & Tissue and analogous kits; see [86]) provide an appropriate basis for harmonization. Negative controls and technical replicates are fundamental for quality control and can be used to filter out artifactual sequences [87], and as such, their incorporation in the extraction step will also facilitate validation and integration of data across studies. Biobanking of DNA from environmental samples has been strongly advocated for long-term biomonitoring [88]. Biobanking of DNA ensures opportunities for reanalysis of past data sets with future technologies, an important consideration given high method turnover and associated comparability issues. Aliquots of purified wocDNA are suitable for archiving, ideally using low-DNA binding tubes and freezers of −80°C or colder, but if this option is unavailable, storage at −20°C in nondefrosting freezers provides an adequate alternative. Several museums are already offering this service with affordable pricing (e.g., Smithsonian & Canadian museum in Ottawa). See Table 4 for key guidelines and recommendations for the arthropod community sample DNA extraction submodule.

Table 4:

Summary of key guidelines and recommendations proposed within the 3.1 Arthropod community sample DNA extraction submodule

3.1 Arthropod community sample DNA extraction submodule
Digestion	No physical homogenization stepHigh volumes of digestion bufferLong digestion (shaking)
Purification	200 µL of digestion bufferQiagen DNeasy Blood & Tissue typeNegative controls and technical extraction replicates
Purified DNA storage	Biobanking of DNA aliquots−80°C, −20°C nondefrosting freezers

Summary of key guidelines and recommendations proposed within the 3.1 Arthropod community sample DNA extraction submodule

Amplification, Library Preparation, and Sequencing Step

There is a clear trend toward the use of the Cytochrome c oxidase subunit I barcode region (COI barcode) for wocDNA metabarcoding of arthropods (e.g., [37, 40, 83, 89–92]). This can be largely attributed to (i) the good performance of different COI primers for arthropod community samples, (ii) the availability of large COI barcode reference databases, (iii) sufficient variation to typically allow taxonomic assignment at the species level, and (iv) the potential to identify and remove sequencing errors and spurious sequence assemblies by bioinformatic processing based on the predicted variation in protein‐coding regions and the limited expected length variation within the COI barcode [89]. Multiple primer sets have been demonstrated to efficiently characterize arthropod community samples, particularly those incorporating degenerate nucleotide positions (i.e., positions that allow for the binding of more than 1 nucleotide) (see Fig. 2 in Elbrecht et al. [37]), with a trend toward using the second half (3′) of the COI barcode for metabarcoding studies (e.g., [40,93]). The BF3 fragment (418 bp) provides better taxonomic resolution than other overlapping fragments. Furthermore, primers within this region are also unaffected by slippage and provide maximum overlap across already published studies [37]. Given these considerations, choosing primers of demonstrated efficiency within the BF3 region (BF3 + BF2 or III_B_F + Fol-degen-rev, among others; see [37]), or that overlap substantially with it, offer high potential for harmonizing across independent studies. PCR conditions are strongly dependent on selected primers but also on sample composition and polymerase used. Ideally, PCR annealing temperatures and cycle numbers should be quantitative PCR optimized [94]. However, in the absence of such optimization, steps can be taken to reduce unneeded variability across studies. The number of PCR cycles should be maintained at or below 30 cycles if possible, to limit the formation of intrasample chimeras ([95], reviewed in [5]). Serial dilution is a beneficial strategy, as DNA concentration from arthropod community samples, together with PCR inhibitors, can be high, and potential inhibitors can be effectively diluted out (e.g., [96]). Comparisons of polymerase performance for metabarcoding [97] have revealed that polymerase choice impacts read abundance, but not occurrence. Among 6 commercially available polymerases tested, Qiagen Multiplex Master Mix has been shown to provide the most accurate estimates of relative abundance but also generated the highest error rate [97]. While high-fidelity DNA polymerases can reduce PCR error rates [97, 98], their proofreading activity (3′→5′ exonuclease activity) can increase the rate of chimera formation [99, 100]. PCR volume does not appear to be an important consideration for harmonization as it has been reported that it does not influence downstream results but provides opportunity for cost savings via PCR miniaturization (lower cost from reduced quantities of reaction components [101]; Table 5).

Table 5:

Summary of key guidelines and recommendations proposed within the 4.1 Arthropod community sample DNA amplification, library preparation, and sequencing submodule

4.1 Arthropod community sample DNA amplification, library preparation, and sequencing submodule
Target DNA fragments and primers	COI locusSecond half (3′) of the COI barcode fragmentDegenerate primers (see Elbrech et al. 2019)
PCR conditions	Minimize number of PCR cyclesDilution of DNA extractNonproofreading TaqPCR replicates (3), ideally individually labeledNegative controlsTechnical PCR replicationCross-contamination control practices
Library preparation	Two-step protocol

Summary of key guidelines and recommendations proposed within the 4.1 Arthropod community sample DNA amplification, library preparation, and sequencing submodule Performing PCR replicates and pooling for library preparation or sequencing is a well-established standard in the metabarcoding literature, particularly for arthropod community samples, with strong recommendations for a minimum pooling of 3 PCR replicates [102, 103]. The use of multiple PCR replicates per sample to be individually sequenced (technical replication) is less common, but their importance has been highlighted. Together with PCR negative controls, technical PCR replicates can provide important quality control for the removal of PCR and sequencing artifacts [87, 94, 104]. Thus, negative controls and technical replication within individual sequencing runs should be considered essential practice to identify potential biases and errors from (i) cross-contamination, (ii) tag-jumping events [105], and (iii) false-negative detection. Given the high potential for cross-contamination within the PCR step, rigorous measures should be taken to minimize this risk (e.g., using filter tips, robotic platforms for plate aliquoting). Cross-contamination can be detected and filtered out by including technical replicates, together with positive and negative controls randomly distributed among different plates to bioinformatically curate data, reducing problems associated with tag switching and/or cross-contamination [106]. These should be included in the laboratory and sequencing workflow (e.g., [107]). An important measure that enables one to filter out potential contamination during data processing is to use different nucleotide tag and/or library index combinations for individual PCR replicates within samples, as this will allow for restrictive sequence processing across each replicate [87, 104]. Similarly, the number of reads assigned to a given tag/library index combination that were not used in the study can provide an estimation of the contamination rate and thus a minimum OTU relative abundance that should be considered as reliable [108]. Mock communities have been investigated as positive controls for estimating recovery bias, and the use of synthetic/exogenous internal standards has also been explored to estimate absolute abundance from metabarcode data [10, 109, 110]. In the context of harmonization across studies, universal positive controls harbor much potential for intercalibration. This has yet to be developed and tested, but could be the basis for further improvement within this submodule. Library preparation involves the addition of sample-specific nucleotide identifiers to amplicons and nucleotide tails for sequencing, for which there is considerable heterogeneity in the arthropod wocDNA metabarcoding literature. In their recent review, Bohmann et al. [106] identified and reviewed 3 main approaches to achieve sample-specific labeling and library preparation in metabarcoding studies. These include (i) a 1-step PCR approach in which sample DNA extracts are amplified, tagged, and built into sequence libraries in a single PCR reaction with fusion primers, then pooled and sequenced; (ii) a 2-step PCR, in which sample DNA extracts are PCR amplified with 2 primer sets: a first PCR with metabarcoding primers carrying the 5′ sequence overhangs and no nucleotide tags and a second PCR using sequence overhangs, allowing the amplicons to be indexed (i5 and i7 indexes); and (iii) a tagged PCR approach, in which DNA extracts are PCR amplified with metabarcoding primers that carry 5′ nucleotide tags, individually tagged PCR products are then pooled, and PCR-based or ligation-based library preparation is performed for pools of 5′ tagged amplicons. All 3 labeling strategies have been used for arthropod wocDNA metabarcoding (e.g., [70, 94, 111]). The 2-step approach, which is based on the Illumina 16S ribosomal RNA protocol, originally developed for microbiome studies, appears to be more commonly used. Tests comparing consistency and taxon detection efficiency between 1-step and 2-step PCR protocols (in this case implementing TrueSeq Nano over first untagged PCR) using mock arthropod samples reveal better performance with the 2-step protocol [26]. Ligation-based tagged PCR library preparations have been advocated, to avoid false assignment of sequences to samples by tag jumping [94,112], a recognized problem within the PCR-based tagged approach [105,106]. However, no study has yet compared performance between 2-step and ligation-based tagged PCR. Between these two, the 2-step approach is the more frequently used for arthropod metabarcoding and thus provides a suitable approach to minimize heterogeneity across studies (Table 5). The sequencing depth needed to recover all taxa is strongly dependent on the diversity and complexity of a given sample. A sequencing depth of 60,000 ± 55,000 reads per amplicon per sample is commonly reported [113]. Increasing sequencing depth can increase the detection rate of low-abundance taxa and reduce the impacts of differential processing protocols on perceived diversity [40]. However, increased sequencing depth increases the cost by sample (see Table 2 in Piper et al. [7] for a summary of the costs [2019] and Gb output for each platform) and inherently increases the detection of artifactual sequences, requiring additional procedures for their removal [5,80, 104]. Distinguishing between sufficient or insufficient sequencing depth can be controlled for by evaluating replicability [40] or by taxa recovery graphs on mock or composition controlled communities of comparable nature [114]. The choice of sequencing platform also has potential to generate variation among data sets. This variation appears to be limited across currently popular platforms, such as Illumina MiSeq, Ion Torrent PGM, and Ion Torrent S5 [40]. However, as future sequencing platforms may present greater variation, it is important to report such details (e.g., sequencing platform, read length). See Table 5 for key guidelines and recommendations for the arthropod community sample DNA amplification, library preparation, and sequencing submodule.

Metadata and DNA sequence sharing and storage step

Metadata associated with the different steps of generating metabarcode data should be reported with DNA sequence data to enhance long-term reuse value (see [115]). The GEOME (Genomic Observatories Metadatabase) initiative [15,16] offers a very useful platform, facilitating findable, accessible, interoperable, and reusable data archival practices (i.e., FAIR principles). Interoperability is central to GEOME, as metadata follow controlled vocabularies consistent with DarwinCore and MIxS standards [116, 117] and new records on GEOME are incorporated into the Global Biodiversity Information Facility, GBIF (https://www.gbif.org/). A customizable but standard-compliant single spreadsheet for metainformation, including (i) the reference to the submodules implemented within each data acquisition steps (e.g., 1.2 sample acquisition submodule, 2.1 sample processing submodule, etc.) and (ii) all key information highlighted within each of the submodules, will facilitate downstream comparison among data sets. The metadata spreadsheet for the terrestrial arthropod module (GEOME spreadsheet) can be additionally included as supplementary publication material. Finally, GEOME also facilitates DNA data sharing through the deposition of raw genetic data to the Sequence Read Archive (SRA, www.ncbi.nlm.nih.gov/sra), while maintaining persistent links to standard compliant metadata held in the GEOME database. SRA is thus an ideal platform for the storage of demultiplexed HTS files. Given the continuous development and improvement of bioinformatic tools for HTS data analysis, public archiving of raw DNA data is important to facilitate future synthetic analysis across historical data sets. See Table 6 for key guidelines and recommendations of the arthropod community sample metadata and DNA sequence sharing and storage submodule.

Table 6:

Summary of key guidelines and recommendations proposed within the 5.1 Arthropod community sample metadata and DNA sequence sharing and storage submodule

5.1 Arthropod community sample metadata and DNA sequence sharing and storage submodule
Metadata	GEOME metadata submissionGEOME spreadsheet with the key information of the modules performed
DNA sequences	Raw dataSRA

Summary of key guidelines and recommendations proposed within the 5.1 Arthropod community sample metadata and DNA sequence sharing and storage submodule

Conclusions

Whole-organism community DNA metabarcoding is emerging as a powerful tool to characterize and compare arthropod communities, from the scale of local community composition through to global comparative analyses. For this potential to be fully realized, comparability across data sets generated by independent research groups is a fundamental prerequisite. There are several challenges to achieve this. First, as is the case for many new fields, early development has led to different strategies and tools, among which some will facilitate data comparability, while others will not. Here we have addressed this issue by suggesting a modular framework that seeks to reduce redundant efforts and improve comparability across studies by harmonization of common practice across different research initiatives, where that practice demonstrates utility. We have illustrated this framework with recommendations for a module for the characterization of terrestrial arthropods. A second challenge is that canalization of different practices to optimize comparability at the community level may, inadvertently, limit flexibility at the scale of individual studies. While this is to some extent unavoidable, the flexible structure we presented here seeks to broaden the applicability of a modular framework within the wocDNA metabarcoding community. Finally, unless appropriate data and metadata are provided for a given wocDNA metabarcode study, the opportunities for integrative analyses across historical data sets are likely to be limited. We address this challenge by advocating good reporting practice and highlight that the submodule structure provides a framework for the incorporation of new advances as they emerge within the field of metabarcoding. We advocate the adoption and development of the terrestrial arthropod module that we propose here, as an important step toward harmonization of metabarcode data. We further encourage the development of additional submodules for the terrestrial arthropod module (e.g., soil mesoarthropod sample acquisition, pan trapping for pollinator sample acquisition), as well as modules for other biodiversity fractions that are appropriate targets for wocDNA metabarcoding.

Abbreviations

COI barcode: Cytochrome c oxidase subunit I barcode region; GBIF: Global Biodiversity Information Facility; GEOME: Genomic Observatories Metadatabase; GO: Genomic Observatories; HTS: high-throughput sequencing; OTUs: operational taxonomic units; PCR: polymerase chain reaction; SRA: Sequence Read Archive; wocDNA: whole-organism community DNA.

Data Availability

Not applicable.

Competing Interests

A.P.V. is a cofounder and scientific advisor of NatureMetrics, a private company providing commercial services in DNA‐based monitoring. The authors declare that they have no other conflicts of interest.

Funding

The working group “Toward Harmonisation for the Generation of Metabarcoding Data: Soil Biodiversity and Terrestrial Arthropod modules” held in November 2020 (online) was organized by the iBioGen project, which has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No. 810729. P.A. was funded through a Junior Leader Fellowship (LCF/BQ/PR21/11840006) by “la Caixa” Foundation (ID 100010434) and the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 847648.

Authors' Contributions

P.A. and B.C.E. conceptualized the manuscript. All authors contributed to the ideas and discussion of this review. P.A. and B.C.E. coordinated the working group meetings and led the writing with contributions from all authors. All authors read and approved the final manuscript. Click here for additional data file. Click here for additional data file. Click here for additional data file. Francesco Martoni -- 1/23/2022 Reviewed Click here for additional data file. Francesco Martoni -- 5/27/2022 Reviewed Click here for additional data file. Andrew Dopheide -- 2/7/2022 Reviewed Click here for additional data file.

74 in total

1. Tagsteady: A metabarcoding library preparation protocol to avoid false assignment of sequences to samples.

Authors: Christian Carøe; Kristine Bohmann
Journal: Mol Ecol Resour Date: 2020-08-06 Impact factor: 7.090

2. Global rise in emerging alien species results from increased accessibility of new source pools.

Authors: Hanno Seebens; Tim M Blackburn; Ellie E Dyer; Piero Genovesi; Philip E Hulme; Jonathan M Jeschke; Shyama Pagad; Petr Pyšek; Mark van Kleunen; Marten Winter; Michael Ansong; Margarita Arianoutsou; Sven Bacher; Bernd Blasius; Eckehard G Brockerhoff; Giuseppe Brundu; César Capinha; Charlotte E Causton; Laura Celesti-Grapow; Wayne Dawson; Stefan Dullinger; Evan P Economo; Nicol Fuentes; Benoit Guénard; Heinke Jäger; John Kartesz; Marc Kenis; Ingolf Kühn; Bernd Lenzner; Andrew M Liebhold; Alexander Mosena; Dietmar Moser; Wolfgang Nentwig; Misako Nishino; David Pearman; Jan Pergl; Wolfgang Rabitsch; Julissa Rojas-Sandoval; Alain Roques; Stephanie Rorke; Silvia Rossinelli; Helen E Roy; Riccardo Scalera; Stefan Schindler; Kateřina Štajerová; Barbara Tokarska-Guzik; Kevin Walker; Darren F Ward; Takehiko Yamanaka; Franz Essl
Journal: Proc Natl Acad Sci U S A Date: 2018-02-05 Impact factor: 11.205

3. Why the COI barcode should be the community DNA metabarcode for the metazoa.

Authors: Carmelo Andújar; Paula Arribas; Douglas W Yu; Alfried P Vogler; Brent C Emerson
Journal: Mol Ecol Date: 2018-09-17 Impact factor: 6.185

4. Pyrosequencing analysis of the bacterial communities in the guts of honey bees Apis cerana and Apis mellifera in Korea.

Authors: Jae-Hyung Ahn; In-Pyo Hong; Jeung-Im Bok; Byung-Yong Kim; Jaekyeong Song; Hang-Yeon Weon
Journal: J Microbiol Date: 2012-11-04 Impact factor: 3.422

5. Darwin Core: an evolving community-developed biodiversity data standard.

Authors: John Wieczorek; David Bloom; Robert Guralnick; Stan Blum; Markus Döring; Renato Giovanni; Tim Robertson; David Vieglais
Journal: PLoS One Date: 2012-01-06 Impact factor: 3.240

6. The founding charter of the Genomic Observatories Network.

Authors: Neil Davies; Dawn Field; Linda Amaral-Zettler; Melody S Clark; John Deck; Alexei Drummond; Daniel P Faith; Jonathan Geller; Jack Gilbert; Frank Oliver Glöckner; Penny R Hirsch; Jo-Ann Leong; Chris Meyer; Matthias Obst; Serge Planes; Chris Scholin; Alfried P Vogler; Ruth D Gates; Rob Toonen; Véronique Berteaux-Lecellier; Michèle Barbier; Katherine Barker; Stefan Bertilsson; Mesude Bicak; Matthew J Bietz; Jason Bobe; Levente Bodrossy; Angel Borja; Jonathan Coddington; Jed Fuhrman; Gunnar Gerdts; Rosemary Gillespie; Kelly Goodwin; Paul C Hanson; Jean-Marc Hero; David Hoekman; Janet Jansson; Christian Jeanthon; Rebecca Kao; Anna Klindworth; Rob Knight; Renzo Kottmann; Michelle S Koo; Georgios Kotoulas; Andrew J Lowe; Viggó Thór Marteinsson; Folker Meyer; Norman Morrison; David D Myrold; Evangelos Pafilis; Stephanie Parker; John Jacob Parnell; Paraskevi N Polymenakou; Sujeevan Ratnasingham; George K Roderick; Naiara Rodriguez-Ezpeleta; Karsten Schonrogge; Nathalie Simon; Nathalie J Valette-Silver; Yuri P Springer; Graham N Stone; Steve Stones-Havas; Susanna-Assunta Sansone; Kate M Thibault; Patricia Wecker; Antje Wichels; John C Wooley; Tetsukazu Yahara; Adriana Zingone
Journal: Gigascience Date: 2014-03-07 Impact factor: 6.524

7. A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents.

Authors: Matthieu Leray; Joy Y Yang; Christopher P Meyer; Suzanne C Mills; Natalia Agudelo; Vincent Ranwez; Joel T Boehm; Ryuji J Machida
Journal: Front Zool Date: 2013-06-14 Impact factor: 3.172

8. A call for an international network of genomic observatories (GOs).

Authors: Neil Davies; Chris Meyer; Jack A Gilbert; Linda Amaral-Zettler; John Deck; Mesude Bicak; Philippe Rocca-Serra; Susanna Assunta-Sansone; Kathy Willis; Dawn Field
Journal: Gigascience Date: 2012-07-12 Impact factor: 6.524

9. Can DNA-Based Ecosystem Assessments Quantify Species Abundance? Testing Primer Bias and Biomass--Sequence Relationships with an Innovative Metabarcoding Protocol.

Authors: Vasco Elbrecht; Florian Leese
Journal: PLoS One Date: 2015-07-08 Impact factor: 3.240

10. The utility of DNA metabarcoding for studying the response of arthropod diversity and composition to land-use change in the tropics.

Authors: Kingsly Chuo Beng; Kyle W Tomlinson; Xian Hui Shen; Yann Surget-Groba; Alice C Hughes; Richard T Corlett; J W Ferry Slik
Journal: Sci Rep Date: 2016-04-26 Impact factor: 4.379