| Literature DB >> 30222734 |
Javier Del Campo1,2, Martin Kolisko2,3, Vittorio Boscaro2, Luciana F Santoferrara4, Serafim Nenarokov3, Ramon Massana1, Laure Guillou5, Alastair Simpson6, Cedric Berney5, Colomban de Vargas5, Matthew W Brown7, Patrick J Keeling2, Laura Wegener Parfrey2,8.
Abstract
Environmental sequencing has greatly expanded our knowledge of micro-eukaryotic diversity and ecology by revealing previously unknown lineages and their distribution. However, the value of these data is critically dependent on the quality of the reference databases used to assign an identity to environmental sequences. Existing databases contain errors and struggle to keep pace with rapidly changing eukaryotic taxonomy, the influx of novel diversity, and computational challenges related to assembling the high-quality alignments and trees needed for accurate characterization of lineage diversity. EukRef (eukref.org) is an ongoing community-driven initiative that addresses these challenges by bringing together taxonomists with expertise spanning the eukaryotic tree of life and microbial ecologists, who use environmental sequence data to develop reliable reference databases across the diversity of microbial eukaryotes. EukRef organizes and facilitates rigorous mining and annotation of sequence data by providing protocols, guidelines, and tools. The EukRef pipeline and tools allow users interested in a particular group of microbial eukaryotes to retrieve all sequences belonging to that group from International Nucleotide Sequence Database Collaboration (INSDC) (GenBank, the European Nucleotide Archive [ENA], or the DNA DataBank of Japan [DDBJ]), to place those sequences in a phylogenetic tree, and to curate taxonomic and environmental information for the group. We provide guidelines to facilitate the process and to standardize taxonomic annotations. The final outputs of this process are (1) a reference tree and alignment, (2) a reference sequence database, including taxonomic and environmental information, and (3) a list of putative chimeras and other artifactual sequences. These products will be useful for the broad community as they become publicly available (at eukref.org) and are shared with existing reference databases.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30222734 PMCID: PMC6160240 DOI: 10.1371/journal.pbio.2005849
Source DB: PubMed Journal: PLoS Biol ISSN: 1544-9173 Impact factor: 8.029
Fig 1Comparison of existing databases.
(A) Bar plots of taxonomic annotations of BioMarKs environmental sequences using the most popular reference databases for annotating 18S rRNA gene datasets for protist metabarcoding analyses (INSDC GenBank release 215, SILVA version 123.1, and PR2 version 4.2) at the first (level just below Eukarya) and second taxonomic ranks. White spaces in the boxes mark the changes between second-level ranks. Taxon names for the first rank in each database are listed below the bar plot. On top of each bar plot, within brackets, we show the number of taxa per rank. The taxon names for the first rank and bar plots are colored based on the eukaryotic supergroups defined by Burki 2014 [16]. (B) Distribution of the number ranks assigned to terminal taxa (unique taxonomic strings) in the three databases. (C) Taxonomic agreement on the annotation of the 20 most abundant OTUs within BioMarKs using each database, listed on x-axis as GenBank, SILVA, and PR2. Full taxonomic annotation available in S1 Table. INSDC, International Nucleotide Sequence Database Collaboration; OTU, operational taxonomic unit; PR2, Protist Ribosomal Reference Database; SAR, Stramenopiles, Alveolates, and Rhizaria; 18S rRNA, small subunit ribosomal DNA.
Fig 2Simplified scheme of the EukRef workflow.
Outputs are highlighted in red. HTES, high-throughput environmental sequencing.
Fig 3Case study of Heterotrichea, Ciliophora.
(A) Phylogenetic tree of sequences used as input into the EukRef pipeline. (B) Phylogenetic tree following EukRef pipeline. Branches and leaves in red correspond to those present in the input dataset (in A); branches and leaves in black are those acquired by the EukRef pipeline and exclude artifacts and sequences discarded during curation that fell outside the group of interest. Output tree was used as a guide to perform the taxonomic annotation. (C) Output of EukRef curation. Representative sequences: output phylogenetic tree depicting representative sequences clustered at a 97% similarity threshold. EukRef annotation: taxonomic annotation following curation, which is propagated to all sequences in the 97% cluster (N, number of sequences in each cluster). Column Ann: proportion of sequences for which annotation was unchanged (black), improved (pink), or corrected (red). Metadata are added to the curated database for each sequence based on GenBank record and/or information in publications associated with the sequences. Column Env: portion of sequences in the cluster found in marine, freshwater, brackish, or unknown environment. Column Scr: portion of sequences derived from environmental sequencing or known isolates (either cultures or morphologically identified cells). Fully curated reference database of Heterotrichea available in S3 Data. Ann, annotation; Env, environment; Scr, source.