Literature DB >> 26236474

Standardizing metadata and taxonomic identification in metabarcoding studies.

Leho Tedersoo¹, Kelly S Ramirez², R Henrik Nilsson³, Aivi Kaljuvee⁴, Urmas Kõljalg⁴, Kessy Abarenkov¹.

Abstract

High-throughput sequencing-based metabarcoding studies produce vast amounts of ecological data, but a lack of consensus on standardization of metadata and how to refer to the species recovered severely hampers reanalysis and comparisons among studies. Here we propose an automated workflow covering data submission, compression, storage and public access to allow easy data retrieval and inter-study communication. Such standardized and readily accessible datasets facilitate data management, taxonomic comparisons and compilation of global metastudies.

Entities: Disease Gene

Keywords: Data storage; Digital object identifiers (DOI); Environmental metadata; High-throughput sequencing (HTS); Interactive database; Internal transcribed spacer (ITS); Next-generation sequencing; Species hypotheses

Mesh：

Year: 2015 PMID： 26236474 PMCID： PMC4521374 DOI： 10.1186/s13742-015-0074-5

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Background

The revolution of high-throughput sequencing (HTS) technologies has provided an unprecedented insight into the biodiversity and ecology of organisms, particularly those that are microscopic and difficult to culture. Such metabarcoding studies from terrestrial and aquatic ecosystems, as well as from living organisms, generate enormous amounts of sequence data and associated environmental metadata to address the properties of biodiversity. However, comparisons among datasets and global syntheses are severely hampered by the widespread use of different protocols in virtually every step of the data generation and analysis processes, from sample preparation through laboratory work to calculation of operational taxonomic units (OTUs), the proxies for species [1]. The naming of these OTUs is non-systematic and study-specific, which disables wide-scale taxonomic communication [2]. Furthermore, the outcome of the calculation of OTUs is strongly dependent on the choice of bioinformatics tools and clustering methods. Improved bioinformatics techniques reveal systematic errors in HTS datasets, and their reanalysis opens new perspectives in understanding biodiversity [3, 4]. The possibility of recalculation and integration of data into metastudies is of great importance to understand the patterns of biodiversity over large spatiotemporal scales [5]. Compilation of metastudies requires easy access to environmental metadata and sequence data. Yet the metadata describing sampling location, habitat quality and interacting taxa are often excluded from the published supplementary datasets. Alternatively, these data remain buried in inaccessible forms in databases such as the Short Read Archive (SRA) [www.ncbi.nlm.nih.gov/sra], DataDryad [http://datadryad.org], or in spreadsheets on the authors’ computers [4]. For example, of 27 fungal metabarcoding datasets we tried to access, many were missing from public repositories (37 %), were locked (19 %), or lacked information about de-multiplexing samples (15 %). Contacting the corresponding authors finally enabled us to recover 67 % of the requested data. Furthermore, even if the environmental metadata are available, they are typically located in user-defined data fields with different units or no units at all. Building on these shortcomings, existing standards, and recently developed bioinformatics tools, we propose a workflow for standardized metadata and sequence data synthesis (Fig. 1). This workflow has the potential to form a basis for efficient data management, download, and scientific hypothesis testing in taxonomic and ecological metastudies.

Fig. 1

General data structure. a Suggested workflow using various bioinformatics tools and databases. DOI, digital object identifier; HTS, high-throughput sequencing; INSDc, International Nucleotide Sequence Database Collaboration; repres, representative; seq, sequencing. b Proposed minimum data fields for HTS metadata

Data standardization

Long-established disciplines such as plant and animal sciences have a rich history of standardization, with Darwin Core [6] and the Access to Biological Collections Data (ABCD) [www.tdwg.org] representing the most widely used formats. These efforts have demonstrated the benefit of data standardization, particularly the ability to address research and management questions relevant to global change [5]. Initiatives such as the International Nucleotide Sequence Database Collaboration (INSDC) [www.insdc.org], Global Biodiversity Information Facility (GBIF) [www.gbif.org], Map of Life [www.mapoflife.org], Encyclopedia of Life [www.eol.org], International Barcode of Life (iBOL) [http://ibol.org], and Data Observation Network for Earth (DataONE) [www.dataone.org] constitute the global leaders in managing biodiversity information. Conversely, microbial ecology and other disciplines using metabarcoding tools for species identification have had no organized standardization efforts until recently. ‘Minimum information about a marker gene sequence’ (MIMARKS) is a recent effort developed by the Genomics Standards Consortium to implement standardization and description of sequence-based data [7]. Still, agreement on global standards does not automatically ensure enforcement. Providing data with scientific publications has long been encouraged, but leading publishers have only recently declared this a strict requirement. Unfortunately, this effort has not extended to standardization. Therefore, we advocate that journals and institutes should follow a set of standards agreed upon by the research community [1]. We recommend that metadata terminology should follow the MIMARKS and Darwin Core standards, because these formats have been generally approved and implemented by multiple initiatives and thus provide compatibility. Therefore, the database modules for storing sequence data and metadata need to be very broad and flexible to cover virtually all fields of biology. Similarly, the forms for data upload should be simple and follow both the above standards. To simplify upload of sequence data and metadata, we propose a single spreadsheet for metainformation about the whole study (studies), site(s) and sample(s) as well as links to demultiplexed HTS files (Fig. 1). Efficient data management requires an interactive database associated with an online workbench for data curation, annotation and analysis. Because of its central role in data storage and well-developed modules for up-to-date taxonomy, the INSDC would preferably lead this development. Multiple commitments and paucity of directed funding mean that the INSDC evolves slowly and thus remains badly behind the research needs. Therefore, research consortia have generated multiple platforms for data storage, bioinformatics analysis and statistical analysis, e.g. Quantitative Insights into Microbial Ecology (QIIME) [www.qiime.org], Ribosomal Database Project (RDP) [https://rdp.cme.msu.edu/], Silva [www.arb-silva.de/], Barcode of Life Data Systems (BOLD Systems) [www.boldsystems.org], and UNITE [https://unite.ut.ee]. The two latter platforms are suitably structured for managing barcoding data. Unfortunately, these initiatives remain largely unconnected. Integrating these community-driven efforts to create a central data portal requires interdisciplinary collaboration involving expert knowledge from multiple research fields to bridge the disciplines of taxonomy and ecology of organisms with genomics and bioinformatics.

Communication of species

Raw HTS data analysis is time consuming and requires substantial bioinformatics skills. To undertake comparisons across metabarcoding studies, noise removal and OTU calculation must be done following the same bioinformatics protocols [8]. This effectively disables comparisons across HTS platforms, which differ in base calling and error rates. Thus, if all HTS sequence authors were to reduce the size and complexity of their data before adding them to repositories, those data would be much more useful to the wider research community. The processing of uploaded data generated by different HTS technologies needs the implementation of standardized automated bioinformatics routines. The research community requires a consensus on how to implement these bioinformatics tools in a way that finds the middle ground among minimizing technical errors and tag switches, maximizing the sensitivity to biological variation, and downstream reanalysis requirements. Given that most HTS platforms produce sequences with error rates approaching 1 %, we propose that sequence data be clustered at 99 % similarity using single linkage methods. Representative sequences of non-singleton taxonomic clusters should be selected automatically, based on their similarity to the consensus sequence of the cluster and should carry information about relative abundance. Researchers wishing to use other similarity thresholds and algorithms can rapidly process these deposited representative sequences to that effect. The continuous development of bioinformatics HTS data analysis tools means that it is also necessary to store raw data to enable future recalculation of clusters and representative sequences. To provide centralized species identification, representative sequences of metabarcoding studies should be further clustered along with Sanger sequences using multiple sequence similarity thresholds that represent species hypotheses [9]. This approach allows users to choose the biologically meaningful OTUs for further downstream analyses (Fig. 2). For direct taxonomic communication of OTUs across studies, we recommend the use of digital object identifiers (DOI) of species hypotheses. Such a system was recently introduced in the UNITE platform for Sanger sequences of the internal transcribed spacer (ITS) region, the official fungal barcode [9]. Several HTS bioinformatics workbenches such as QIIME and mothur [www.mothur.org] have implemented the use of species hypotheses for fungal ITS sequences. Similarly, the BOLD system has implemented fixed-threshold species approximation and the barcode index numbers (BIN) naming system for the Cytochrome C Oxidase subunit 1 (COI) barcode of animals [10].

Fig. 2

Screenshot of PlutoF workbench [11] for managing species hypotheses in the UNITE database [https://unite.ut.ee]. Multiple alignment of one of 20 clades of the enigmatic fungal class Archaeorhizomycetes is shown. Species hypotheses (SH) based on 97.0-100.0 % sequence similarity thresholds are marked with color patterns. The representative sequence of each SH is shown in green text. User-annotated taxonomic and ecological metadata are also indicated

Conclusions

Given the poor accessibility of high-throughput sequencing data and environmental metadata, there is an urgent need for a centralized system of standardized data deposition and management, an issue that affects many areas of biodiversity research. For example, the Global Soil Biodiversity Initiative (GSBI) has taken steps to standardize and integrate sequence information with classical morphology data [1]. Although there has been no clear leadership, these efforts demonstrate that the research community favours the implementation of standards and is prepared to move towards global guidelines. Regarding sequence data, we argue that these should be made available both in raw and quality-filtered formats that also allow easy access for non-bioinformaticians. Communication of taxonomic identification between studies would be enabled with a permanent DOI-based naming system of OTUs [9, 10]. Standardization of sequence data, metadata and taxonomic communication will greatly improve our understanding of global biodiversity and autecology of species.

9 in total

1. Data archiving in ecology and evolution: best practices.

Authors: Michael C Whitlock
Journal: Trends Ecol Evol Date: 2010-12-14 Impact factor: 17.712

2. Towards a unified paradigm for sequence-based identification of fungi.

Authors: Urmas Kõljalg; R Henrik Nilsson; Kessy Abarenkov; Leho Tedersoo; Andy F S Taylor; Mohammad Bahram; Scott T Bates; Thomas D Bruns; Johan Bengtsson-Palme; Tony M Callaghan; Brian Douglas; Tiia Drenkhan; Ursula Eberhardt; Margarita Dueñas; Tine Grebenc; Gareth W Griffith; Martin Hartmann; Paul M Kirk; Petr Kohout; Ellen Larsson; Björn D Lindahl; Robert Lücking; María P Martín; P Brandon Matheny; Nhu H Nguyen; Tuula Niskanen; Jane Oja; Kabir G Peay; Ursula Peintner; Marko Peterson; Kadri Põldmaa; Lauri Saag; Irja Saar; Arthur Schüßler; James A Scott; Carolina Senés; Matthew E Smith; Ave Suija; D Lee Taylor; M Teresa Telleria; Michael Weiss; Karl-Henrik Larsson
Journal: Mol Ecol Date: 2013-09-24 Impact factor: 6.185

3. Meta-analysis of deep-sequenced fungal communities indicates limited taxon sharing between studies and the presence of biogeographic patterns.

Authors: Anjuli Meiser; Miklós Bálint; Imke Schmitt
Journal: New Phytol Date: 2013-10-01 Impact factor: 10.151

4. Accurate determination of microbial diversity from 454 pyrosequencing data.

Authors: Christopher Quince; Anders Lanzén; Thomas P Curtis; Russell J Davenport; Neil Hall; Ian M Head; L Fiona Read; William T Sloan
Journal: Nat Methods Date: 2009-08-09 Impact factor: 28.547

5. From integrative taxonomy to species description: one step beyond.

Authors: E Pante; C Schoelinck; N Puillandre
Journal: Syst Biol Date: 2014-10-29 Impact factor: 15.683

6. From barcoding single individuals to metabarcoding biological communities: towards an integrative approach to the study of global biodiversity.

Authors: Melania E Cristescu
Journal: Trends Ecol Evol Date: 2014-08-28 Impact factor: 17.712

7. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors: Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

8. Darwin Core: an evolving community-developed biodiversity data standard.

Authors: John Wieczorek; David Bloom; Robert Guralnick; Stan Blum; Markus Döring; Renato Giovanni; Tim Robertson; David Vieglais
Journal: PLoS One Date: 2012-01-06 Impact factor: 3.240

9. A DNA-based registry for all animal species: the barcode index number (BIN) system.

Authors: Sujeevan Ratnasingham; Paul D N Hebert
Journal: PLoS One Date: 2013-07-08 Impact factor: 3.240

9 in total

12 in total

1. Metadata harmonization-Standards are the key for a better usage of omics data for integrative microbiome analysis.

Authors: Tomislav Cernava; Daria Rybakova; Michael Schloter; Gabriele Berg; François Buscot; Thomas Clavel; Alice Carolyn McHardy; Fernando Meyer; Folker Meyer; Jörg Overmann; Bärbel Stecher; Angela Sessitsch
Journal: Environ Microbiome Date: 2022-06-24

2. Toward global integration of biodiversity big data: a harmonized metabarcode data generation module for terrestrial arthropods.

Authors: Paula Arribas; Carmelo Andújar; Kristine Bohmann; Jeremy R deWaard; Evan P Economo; Vasco Elbrecht; Stefan Geisen; Marta Goberna; Henrik Krehenwinkel; Vojtech Novotny; Lucie Zinger; Thomas J Creedy; Emmanouil Meramveliotakis; Víctor Noguerales; Isaac Overcast; Hélène Morlon; Anna Papadopoulou; Alfried P Vogler; Brent C Emerson
Journal: Gigascience Date: 2022-07-19 Impact factor: 7.658

Review 3. Repositories for Taxonomic Data: Where We Are and What is Missing.

Authors: Aurélien Miralles; Teddy Bruy; Katherine Wolcott; Mark D Scherz; Dominik Begerow; Bank Beszteri; Michael Bonkowski; Janine Felden; Birgit Gemeinholzer; Frank Glaw; Frank Oliver Glöckner; Oliver Hawlitschek; Ivaylo Kostadinov; Tim W Nattkemper; Christian Printzen; Jasmin Renz; Nataliya Rybalka; Marc Stadler; Tanja Weibulat; Thomas Wilke; Susanne S Renner; Miguel Vences
Journal: Syst Biol Date: 2020-11-01 Impact factor: 15.683

4. New light on names and naming of dark taxa.

Authors: Martin Ryberg; R Henrik Nilsson
Journal: MycoKeys Date: 2018-02-23 Impact factor: 2.984

5. Meta-omics data and collection objects (MOD-CO): a conceptual schema and data model for processing sample data in meta-omics research.

Authors: Gerhard Rambold; Pelin Yilmaz; Janno Harjes; Sabrina Klaster; Veronica Sanz; Anton Link; Frank Oliver Glöckner; Dagmar Triebel
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

Review 6. Prospects and challenges of implementing DNA metabarcoding for high-throughput insect surveillance.

Authors: Alexander M Piper; Jana Batovska; Noel O I Cogan; John Weiss; John Paul Cunningham; Brendan C Rodoni; Mark J Blacket
Journal: Gigascience Date: 2019-08-01 Impact factor: 6.524

7. Citizen science and online data: Opportunities and challenges for snake ecology and action against snakebite.

Authors: Andrew M Durso; Rafael Ruiz de Castañeda; Camille Montalcini; M Rosa Mondardini; Jose L Fernandez-Marques; François Grey; Martin M Müller; Peter Uetz; Benjamin M Marshall; Russell J Gray; Christopher E Smith; Donald Becker; Michael Pingleton; Jose Louies; Arthur D Abegg; Jeannot Akuboy; Gabriel Alcoba; Jennifer C Daltry; Omar M Entiauspe-Neto; Paul Freed; Marco Antonio de Freitas; Xavier Glaudas; Song Huang; Tianqi Huang; Yatin Kalki; Yosuke Kojima; Anne Laudisoit; Kul Prasad Limbu; José G Martínez-Fonseca; Konrad Mebert; Mark-Oliver Rödel; Sara Ruane; Manuel Ruedi; Andreas Schmitz; Sarah A Tatum; Frank Tillack; Avinash Visvanathan; Wolfgang Wüster; Isabelle Bolon
Journal: Toxicon X Date: 2021-06-22

8. Temporally Variable Geographical Distance Effects Contribute to the Assembly of Root-Associated Fungal Communities.

Authors: Christopher J Barnes; Christopher J van der Gast; Caitlin A Burns; Niall P McNamara; Gary D Bending
Journal: Front Microbiol Date: 2016-02-25 Impact factor: 5.640

Review 9. Taxonomy and evolution of Aspergillus, Penicillium and Talaromyces in the omics era - Past, present and future.

Authors: Chi-Ching Tsang; James Y M Tang; Susanna K P Lau; Patrick C Y Woo
Journal: Comput Struct Biotechnol J Date: 2018-05-31 Impact factor: 7.271

10. Metabarcoding versus mapping unassembled shotgun reads for identification of prey consumed by arthropod epigeal predators.

Authors: Débora Pires Paula; Suellen Karina Albertoni Barros; Rafael Major Pitta; Marliton Rocha Barreto; Roberto Coiti Togawa; David A Andow
Journal: Gigascience Date: 2022-03-24 Impact factor: 6.524