Literature DB >> 28243624

Reference datasets of tufA and UPA markers to identify algae in metabarcoding surveys.

Vanessa Rossetto Marcelino¹, Heroen Verbruggen¹.

Abstract

The data presented here are related to the research article "Multi-marker metabarcoding of coral skeletons reveals a rich microbiome and diverse evolutionary origins of endolithic algae" (Marcelino and Verbruggen, 2016) [1]. Here we provide reference datasets of the elongation factor Tu (tufA) and the Universal Plastid Amplicon (UPA) markers in a format that is ready-to-use in the QIIME pipeline (Caporaso et al., 2010) [2]. In addition to sequences previously available in GenBank, we included newly discovered endolithic algae lineages using both amplicon sequencing (Marcelino and Verbruggen, 2016) [1] and chloroplast genome data (Marcelino et al., 2016; Verbruggen et al., in press) [3], [4]. We also provide a script to convert GenBank flatfiles into reference datasets that can be used with other markers. The tufA and UPA reference datasets are made publicly available here to facilitate biodiversity assessments of microalgal communities.

Entities: Disease Species

Keywords: Metabarcoding; Ostreobium; RDP classifier; Reference sequences; UPA; tufA

Year: 2017 PMID： 28243624 PMCID： PMC5320050 DOI： 10.1016/j.dib.2017.02.013

Source DB: PubMed Journal: Data Brief ISSN： 2352-3409

Specifications Table Value of the data The tufA and UPA reference datasets facilitate biodiversity assessments of cyanobacterial and eukaryotic algal communities using high-throughput sequencing. When used with the Naive Bayesian Classifier (RDP classifier) implemented in QIIME [2], [5], the taxonomic metadata of the reference datasets provided here allow classifying operational taxonomic units (OTUs) at higher taxonomic ranks when no match is found at lower ranks. For example, an OTU with no close relatives at species or genus level can be classified at the family level, facilitating the interpretation of the results. We incorporate in the datasets recently discovered endolithic (limestone-boring) algal lineages [1], [3], [4] to facilitate the identification of these algae in other studies. The script provided here facilitates the development of custom reference databases for non-standard metabarcoding markers.

Data

The datasets of this article provide reference sequences of the elongation factor Tu (tufA) and the Universal Plastid Amplicon (UPA) loci and their corresponding taxonomic information. Supplementary File 1 is a set of identified tufA reference sequences in fasta format. Supplementary File 2 is a tab-delimited file containing the taxonomic information of the tufA reference sequences. The tufA reference dataset contains bacterial and chloroplast tufA sequences, including green algae, red algae, heterokonts, cryptophytes and haptophytes. Supplementary File 3 is a set of identified UPA reference sequences (a fragment of the 23S rDNA) in fasta format. Supplementary File 4 is a tab-delimited file containing the taxonomic information of the UPA reference sequences. This reference dataset contains bacterial and chloroplast 23S rDNA sequences, including cyanobacteria, green algae, red algae, heterokonts, cryptophytes and haptophytes. Supplementary File 5 is a python script that takes a GenBank (.gb) flatfile as input and produces the 2 files needed by the RDP classifier (QIIME version). This script requires Biopython [6].

Experimental design, materials and methods

We produced reference datasets that can be used with the Naive Bayesian Classifier (RDP classifier) implemented in the QIIME pipeline [2], [5]. Each of these datasets consists of: 1) a fasta file containing the reference DNA sequences and short sequence identifiers and 2) a text file matching the sequence identifiers to their taxonomic metadata. To produce these datasets we first mined sequences from GenBank by querying the marker name and downloading all matching items as full GenBank records. We added endolithic (limestone-boring) green algal lineages discovered with the tufA marker in our study “Multi-marker metabarcoding of coral skeletons reveals a rich microbiome and diverse evolutionary origins of endolithic algae” [1]. We identified these algal lineages in a phylogenetic context [see [1]] and included representatives of the main endolithic clades in the tufA reference dataset. We also retrieved a large diversity of algae with the UPA marker but these lineages did not receive the same nomenclature as the tufA lineages because the correspondence between the tufA and the UPA algal clades was unknown. To solve this issue and match tufA and UPA clades we used chloroplast genome data. The complete chloroplast genomes of two endolithic algal strains – Ostreobium HV05042 and SAG699 – were sequenced [3], [4] and added to the UPA reference dataset. Phylogenetically, these strains are in Ostreobium Clade 3 and Clade 4, respectively. Since there are no reference sequences for Ostreobium Clade 1 and Clade 2 it is possible that OTUs belonging to Ostreobium Clades 1 and 2 will be classified as Clades 3 and 4 or will be only classified at higher taxonomic levels. The reference datasets were equalized so as not to contain identical sequences or a disproportional number of closely related species, which yields downstream benefits for taxonomic assignment [see [7]]. To equalize the datasets and exclude closely related or identical reference sequences, we built a UPGMA tree of the sequences with a JC69 model. We sliced this tree at 0.001 branch length units from the tips, which yielded several clades containing closely related sequences. We kept in the dataset one reference sequence from each of these clades based on their quality (i.e. length and number of undefined bases). For the tufA OTUs obtained in Marcelino and Verbruggen [1] we used a threshold of 0.1 branch length units (1–3 OTUs per family) to not add a disproportionally high amount of endolithic algal lineages in the reference dataset. The reference datasets were converted to a QIIME-friendly format with the gb_2_RDP.py script (Supplementary File 5), which uses the metadata information contained in GenBank files to produce the taxonomic metadata required by RDP. The gb_2_RDP.py script is also available at: https://github.com/vrmarcelino/Make_Ref_Dataset/blob/master/gb_2_RDP.py

Subject area	Biology
More specific subject area	Metabarcoding
Type of data	Text files (DNA sequence data, metadata and python script)
How data was acquired	GenBank data compilation, Amplicon sequencing and Chloroplast genome sequencing
Data format	Filtered
Experimental factors	Endolithic algae lineages were identified with metabarcoding and chloroplast genome sequencing
Experimental features	Genes were extracted from GenBank data, closely related organisms were filtered out and file was converted to a ready-to-use format.
Data source location	Melbourne, Australia
Data accessibility	The data are available with this article

7 in total

1. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy.

Authors: Qiong Wang; George M Garrity; James M Tiedje; James R Cole
Journal: Appl Environ Microbiol Date: 2007-06-22 Impact factor: 4.792

2. Phylogenetic position of the coral symbiont Ostreobium (Ulvophyceae) inferred from chloroplast genome data.

Authors: Heroen Verbruggen; Vanessa R Marcelino; Michael D Guiry; Ma Chiela M Cremen; Christopher J Jackson
Journal: J Phycol Date: 2017-05-12 Impact factor: 2.923

3. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

4. QIIME allows analysis of high-throughput community sequencing data.

Authors: J Gregory Caporaso; Justin Kuczynski; Jesse Stombaugh; Kyle Bittinger; Frederic D Bushman; Elizabeth K Costello; Noah Fierer; Antonio Gonzalez Peña; Julia K Goodrich; Jeffrey I Gordon; Gavin A Huttley; Scott T Kelley; Dan Knights; Jeremy E Koenig; Ruth E Ley; Catherine A Lozupone; Daniel McDonald; Brian D Muegge; Meg Pirrung; Jens Reeder; Joel R Sevinsky; Peter J Turnbaugh; William A Walters; Jeremy Widmann; Tanya Yatsunenko; Jesse Zaneveld; Rob Knight
Journal: Nat Methods Date: 2010-04-11 Impact factor: 28.547

5. The effect of training set on the classification of honey bee gut microbiota using the Naïve Bayesian Classifier.

Authors: Irene L G Newton; Guus Roeselers
Journal: BMC Microbiol Date: 2012-09-26 Impact factor: 3.605

6. Multi-marker metabarcoding of coral skeletons reveals a rich microbiome and diverse evolutionary origins of endolithic algae.

Authors: Vanessa Rossetto Marcelino; Heroen Verbruggen
Journal: Sci Rep Date: 2016-08-22 Impact factor: 4.379

7. Evolutionary Dynamics of Chloroplast Genomes in Low Light: A Case Study of the Endolithic Green Alga Ostreobium quekettii.

Authors: Vanessa R Marcelino; Ma Chiela M Cremen; Chistopher J Jackson; Anthony A W Larkum; Heroen Verbruggen
Journal: Genome Biol Evol Date: 2016-10-05 Impact factor: 3.416

7 in total

1 in total

Review 1. Beneath the surface: community assembly and functions of the coral skeleton microbiome.

Authors: Francesco Ricci; Vanessa Rossetto Marcelino; Linda L Blackall; Michael Kühl; Mónica Medina; Heroen Verbruggen
Journal: Microbiome Date: 2019-12-12 Impact factor: 14.650

1 in total