| Literature DB >> 34496132 |
Thomas J Creedy1, Carmelo Andújar2, Emmanouil Meramveliotakis3, Victor Noguerales2,3, Isaac Overcast4, Anna Papadopoulou3, Hélène Morlon4, Alfried P Vogler1,5, Brent C Emerson2, Paula Arribas2.
Abstract
Metabarcoding of DNA extracted from community samples of whole organisms (whole organism community DNA, wocDNA) is increasingly being applied to terrestrial, marine and freshwater metazoan communities to provide rapid, accurate and high resolution data for novel molecular ecology research. The growth of this field has been accompanied by considerable development that builds on microbial metabarcoding methods to develop appropriate and efficient sampling and laboratory protocols for whole organism metazoan communities. However, considerably less attention has focused on ensuring bioinformatic methods are adapted and applied comprehensively in wocDNA metabarcoding. In this study we examined over 600 papers and identified 111 studies that performed COI metabarcoding of wocDNA. We then systematically reviewed the bioinformatic methods employed by these papers to identify the state-of-the-art. Our results show that the increasing use of wocDNA COI metabarcoding for metazoan diversity is characterised by a clear absence of bioinformatic harmonisation, and the temporal trends show little change in this situation. The reviewed literature showed (i) high heterogeneity across pipelines, tasks and tools used, (ii) limited or no adaptation of bioinformatic procedures to the nature of the COI fragment, and (iii) a worrying underreporting of tasks, software and parameters. Based upon these findings we propose a set of recommendations that we think the metabarcoding community should consider to ensure that bioinformatic methods are appropriate, comprehensive and comparable. We believe that adhering to these recommendations will improve the long-term integrative potential of wocDNA COI metabarcoding for biodiversity science.Entities:
Keywords: COI barcode; animal communities; bioinformatics; community ecology; high-throughput sequencing; metabarcoding
Mesh:
Substances:
Year: 2021 PMID: 34496132 PMCID: PMC9292290 DOI: 10.1111/1755-0998.13502
Source DB: PubMed Journal: Mol Ecol Resour ISSN: 1755-098X Impact factor: 8.678
Table of all bioinformatic tasks performed across the core papers set
| Task group | Task | Description | Number of papers reporting task | Number of papers not reporting software | Total number of software tools | Total number of software functions | Number of papers performing manually |
|---|---|---|---|---|---|---|---|
| Read preparation | Quality control |
| 19 | 0 | 4 | 4 | 0 |
| Adapter trimming |
| 9 | 1 | 6 | 6 | 0 | |
| Demultiplexing |
| 55 | 17 | 16 | 19 | 0 | |
| Pair merging |
| 63 | 1 | 10 | 18 | 0 | |
| Quality trimming |
| 20 | 1 | 8 | 10 | 0 | |
| Mate pairing |
| 3 | 0 | 3 | 3 | 0 | |
| Primer trimming |
| 66 | 8 | 15 | 17 | 0 | |
| Reverse complementation |
| 7 | 3 | 2 | 2 | 0 | |
| Sequence conversion |
| 3 | 0 | 2 | 3 | 0 | |
| Length trimming |
| 10 | 3 | 6 | 7 | 0 | |
| Pair concatenation |
| 8 | 4 | 4 | 4 | 0 | |
| Assembly |
| 6 | 0 | 4 | 4 | 0 | |
| Degapping |
| 1 | 0 | 1 | 1 | 0 | |
| Sequence processing | Dereplication |
| 58 | 10 | 11 | 19 | 0 |
| Size sorting |
| 10 | 2 | 3 | 4 | 0 | |
| Filtering | Quality filtering |
| 81 | 11 | 20 | 27 | 0 |
| Similarity filtering |
| 9 | 1 | 4 | 4 | 0 | |
| Length filtering |
| 54 | 21 | 17 | 23 | 0 | |
| Preclustering |
| 12 | 1 | 3 | 6 | 0 | |
| Denoising |
| 18 | 1 | 8 | 8 | 0 | |
| Normalisation |
| 2 | 0 | 1 | 1 | 1 | |
| Chimera filtering |
| 63 | 4 | 6 | 16 | 1 | |
| Translation filtering |
| 22 | 3 | 11 | 12 | 0 | |
| Frequency filtering |
| 51 | 37 | 11 | 15 | 1 | |
| Taxonomy filtering |
| 9 | 5 | 1 | 1 | 1 | |
| Mistag filtering |
| 3 | 1 | 1 | 1 | 0 | |
| Data generation | OTU delimitation |
| 84 | 5 | 12 | 22 | 0 |
| OTU mapping |
| 30 | 3 | 7 | 11 | 0 | |
| Uncurated taxonomic assignment |
| 55 | 2 | 11 | 13 | 0 | |
| Reference taxonomic assignment |
| 60 | 9 | 18 | 23 | 1 |
Tasks are grouped into four groups by broad purpose, and a detailed definition of each task is given along with summary statistics of the implementation of each task across the 111 papers. For a list of the software used for each task, Table S1 is an expanded version of this table.
FIGURE 1Year of publication of the articles in the core papers set. Bar fills and numbers refer to the number of articles within each research aim category. Note that only articles indexed by Web of Science by 3rd November 2020 were included
FIGURE 2Bioinformatic pipelines implemented by the core papers set. (a) Frequency distribution of the number of tasks by study, (b) Number of tasks by study against the year of publication, with best fit regression line in blue with shaded 95% confidence intervals around the line. Slight horizontal jitter added to points to better show density. (c) Network diagram of tasks and different pipeline routes through these tasks. All pipelines start and end on the respective orange nodes. All other nodes are coloured according to the four main categories of bioinformatic tasks; red for read preparation tasks, blue for sequence processing, green for filtering and purple for data generation tasks. Arrows link tasks performed consecutively, with direction of arrow showing order of tasks. Thickness of arrows shows relative frequency of pairs of consecutive tasks. Arrows coloured orange are the top 10% of consecutive task pairs by relative frequency; note that while this illustrates a possible complete pipeline from Start to End, this “average” pipeline is not in fact performed by any of the papers assessed by this review
FIGURE 3Violin plot of standardised task position within pipelines. Increasing x‐axis position denotes later placement of task within pipelines, vertical dashed lines denote 25%, 50% and 75% of the way through the pipeline, respectively. Tasks are separated into task groups and ordered within task group by mean standardised pipeline position. Points denote task positions where tasks occurred too infrequently to compute density profile for violin plots. Values report the total number of papers implementing each task
FIGURE 4Plots summarising the reporting of three key aspects of bioinformatic tools (software name, version and parameters) by the core papers. (a). Venn diagram shows the number of papers fully reporting each detail, that is, giving the software used for every task reported, and giving the parameters and version for each task where software is given; 86 papers reported at least one of the three details for all steps, 25 further papers failed to fully report all three details in all steps. (b) Bar chart details the proportion of papers employing a specific task that failed to report the software used for that task, with longer bars denoting a greater proportion of papers not reporting software for that specific task
FIGURE 5Consistency in software reporting and use over time. (a) The total number of unique software functions reported across all papers for each year of publication. (b) For each paper, the proportion of the total number of bioinformatic tasks for which the software used for a task was not reported. (c) The software homogeneity rate, calculated only when more than one paper reported a task in a given year. A value of 1 means all papers used the same tool for a given task in a given year. (d) The software dominance rate, calculated only when more than one paper reported a task in a given year. A value of 1 means all papers used the same tool for a given task in a given year. (b–d) Best fit regression lines are shown in blue with shaded 95% confidence intervals around the line. Horizontal jitter added to points to illustrate density within years; (c and d) colours denote different tasks, see Figure S1