Literature DB >> 29036529

ITSoneDB: a comprehensive collection of eukaryotic ribosomal RNA Internal Transcribed Spacer 1 (ITS1) sequences.

Monica Santamaria¹, Bruno Fosso¹, Flavio Licciulli², Bachir Balech¹, Ilaria Larini³, Giorgio Grillo², Giorgio De Caro², Sabino Liuni², Graziano Pesole^1,3.

Abstract

A holistic understanding of environmental communities is the new challenge of metagenomics. Accordingly, the amplicon-based or metabarcoding approach, largely applied to investigate bacterial microbiomes, is moving to the eukaryotic world too. Indeed, the analysis of metabarcoding data may provide a comprehensive assessment of both bacterial and eukaryotic composition in a variety of environments, including human body. In this respect, whereas hypervariable regions of the 16S rRNA are the de facto standard barcode for bacteria, the Internal Transcribed Spacer 1 (ITS1) of ribosomal RNA gene cluster has shown a high potential in discriminating eukaryotes at deep taxonomic levels. As metabarcoding data analysis rely on the availability of a well-curated barcode reference resource, a comprehensive collection of ITS1 sequences supplied with robust taxonomies, is highly needed. To address this issue, we created ITSoneDB (available at http://itsonedb.cloud.ba.infn.it/) which in its current version hosts 985 240 ITS1 sequences spanning over 134 000 eukaryotic species. Each ITS1 is mapped on the NCBI reference taxonomy with its start and end positions precisely annotated. ITSoneDB has been developed in agreement to the FAIR guidelines by enabling the users to query and download its content through a simple web-interface and access relevant metadata by cross-linking to European Nucleotide Archive.

Entities: Gene Species

Mesh：

Substances：

Year: 2018 PMID： 29036529 PMCID： PMC5753230 DOI： 10.1093/nar/gkx855

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The increasing amount of DNA sequence information generated by worldwide metagenomics initiatives is enabling the identification of a growing number of taxa and genes in any natural and anthropic environment. Most of microbiome studies so far have been focused on the assessment of the taxonomic composition of prokaryotic communities based on one or more hypervariable regions of the 16S rRNA. Nowadays, a more holistic investigation, including also viral and eukaryotic components, is becoming the new challenge (1). In this framework, the amplicon-based metagenomic approach or metabarcoding, in which selected short, variable and standardized DNA regions, named DNA barcodes (2,3), are simultaneously amplified and sequenced from an ensemble of organisms sharing the same habitat, is rapidly gaining popularity also to unravel eukaryotic diversity. In addition to Fungi, which are particularly intriguing due to their ubiquity, high diversity and often cryptic manifestations, scientists are looking with growing interest at other Eukaryotes (4–7). Indeed, despite their widespread, diversity and their important role in biogeochemical cycles (8–11), eukaryotic microbial species appear to be consistently underestimated. This has been recently confirmed in the marine environment by Tara Oceans Initiative (12). Finally, metabarcoding has been used also to monitor the macro-fauna biodiversity in aquatic environments (13). The taxonomic classification of high-throughput sequence reads generated by metabarcoding experiments is usually based on first mapping each read to the relevant barcode reference collection and then inferring its more likely taxonomic attribute at species or higher taxonomic rank level (14). In this respect, reference databases aimed at supporting eukaryotic communities characterization are still far to be reliable and exhaustive, possibly causing biased taxonomic inferences. In order to address this drawback a worldwide remarkable effort is on-going to collect and harmonize the vast amount of sequence data already available for several barcodes, including the Internal Transcribed Spacer (ITS) of the rRNA gene cluster, one of the most promising for the assessment of eukaryotic biodiversity. ITS has been already proposed as the standard DNA barcode for fungi and plants (15–18) and widely used for discriminating taxa in other biological groups, including algae, protists and animals (4,19,20). In particular, many recent lines of evidence have highlighted the great potential of the ITS1 sub-region in discriminating eukaryotes at deeper taxonomic levels, particularly in Fungi. A comparative analysis between ITS1 and ITS2 in 10 major groups of eukaryotes, in terms of PCR primer universality, length of amplification product and GC content affecting DNA sequencing outcome and species discrimination power, robustly supports the hypothesis that ITS1 is a better DNA barcode than ITS2 for eukaryotic species (4). However, the availability of comprehensive and quality-controlled resources of ITS1 sequences, supplied with unbiased and unambiguous taxonomies and interfaced with the state of the art pipelines for metagenomics analysis, is still lacking. Such resources should be taxonomically comprehensive and unbiased, well controlled and their content easily accessible and retrievable. This is a critical obstacle to take full advantage from the most reliable bioinformatic tools, such as QIIME (21) and its recent upgrade QIIME 2 (https://qiime2.org/), Mothur (22), MICCA (23) or BioMaS (Bioinformatic analysis of metagenomics ampliconS) (24), that rely on comparing the meta-barcode sequences against the relevant reference databases to infer their taxonomic origin. In order to address this gap we have developed ITSoneDB (http://itsonedb.cloud.ba.infn.it/), focusing on the whole eukaryotic domain and establishing the first and unique curated reference database aimed at ITS1-based metagenomic surveys. Indeed, similar well-annotated and updated databases, such as UNITE (User-friendly Nordic ITS Ectomycorrhiza Database, http://unite.ut.ee) and ITS2 Database (http://its2.bioapps.biozentrum.uni-wuerzburg.de/) concern either the entire ITS sequence or its ITS2 sub-region, respectively. Currently, ITSoneDB collects about one million ITS1 sequences spanning over 134 000 species (according to NCBI taxonomy). The annotation of ITS1 region boundaries has been refined by coupling the information available in the original European Nucleotide Archive (ENA) entries with those inferred by mapping Hidden Markov Models (HMM) corresponding to the conserved ITS1 flanking genes (see ‘Materials and Methods’ section). We have undertaken a number of actions to ensure that our data follow the FAIR principles (25). In particular, Findability and Accessibility are already granted by user-friendly query and cross-link systems to retrieve and download the sequences stored in the database and get their associated metadata respectively. We are working to extend the Interoperability and the Re-usability features by integrating ITSoneDB in a cloud-based Galaxy workbench in which users may run established metagenomics analysis pipelines, thus providing complete and reusable workflows for taxonomic annotation of eukaryotic microbiomes. ITSoneDB can be easily interfaced with state of the art metabarcoding analysis tools such as QIIME (21) or BioMaS (24) as well as other popular metagenomic analyses pipelines. Moreover, we plan to integrate our database in the EBI metagenomics portal in order to increase its accessibility and use.

DATABASE CONTENT

Currently, ITSoneDB collects 985 240 ITS1 sequences corresponding to 134 598 species (according to NCBI taxonomy), and 276 362 (28%) ITS1 region positions in the original sequences are inferred only by HMM profiles mapping while 543 266 (55.2%) are obtained from the ENA entries features table. The location of ITS1 region in 165 612 (16.8%) sequences are inferred by both the approaches. Table 1 reports the number of sequences and species in the eukaryotic kingdom and in its major taxa represented in ITSoneDB, where Fungi are the most represented taxonomic groups, covering almost 70% of database content. Supplementary Figure S1 shows a more detailed taxonomic spread of ITSoneDB sequences across Eukaryotes. The length of ITS1 sequences collected in ITSoneDB mainly ranges between 50 and 1000 bp, with 91.7% of the sequence between 100 and 300 bp long.

Table 1.

ITSoneDB content statistics

Taxon	Taxid	Total sequences	ENA annotation only	HMM annotation only	ENA and HMM	Species
Eukaryota	2759	985 240	543 266	276 362	165 612	134 598
Fungi	4751	684 540	378 049	221 723	84 768	53 552
Metazoa	33 208	54 782	32 186	9084	13 512	9438
Viridiplantae	33 090	203 437	113 572	32 503	57 362	66 595

Each ITSoneDB entry is composed of three main sections: the first consists of an overall entry description in which general information about the entire sequence, such as coverage, function and length, and taxonomic classification are reported. The second one, named ‘ITS1 sequence’, reports information about the ITS1 region annotation inferred from ENA feature tables and/or HMM profiles mapping. The last section, indicated as ‘18S rRNA HMM profile — target sequence alignment’ and ‘5.8S rRNA HMM profile — target sequence alignment’ (see Figure 1), only available if ITS1 boundaries have been refined by HMM mapping, shows the alignments of HMM profiles and the corresponding regions in the sequence.

Figure 1.

Snapshot of a ITSoneDB entry. In (A) the entry information directly extracted from ENA are reported: (i) accession number (with a hyperlink to the corresponding ENA item), (ii) version, (iii) description, (iv) sequence length (the whole sequence length, not only the ITS1), (v) taxon name (scientific name of the organism the sequence belongs, with a hyperlink to the corresponding NCBI taxonomy entry), (vi) taxon rank (taxonomic class) and lineage (full taxonomic path associated to the Taxon name). In (B) the ITS1 position annotation from ENA and HMM are reported. (C) and (D) show the alignments of the sequence with 18S and 5.8S rRNA HMM profiles respectively.

DATABASE FEATURES

ITSoneDB is publicly and freely accessible through web browser on a permanent URL. Single or multiple entries can be selected for web visualization and/or retrieval through different query options (located at the top left of the home page). The ‘simple search’ box allows querying the database by species name/s, taxon name/s or ENA accession number/s. An auto-completion feature permits to choose easily the desired query terms. The ‘tree search’ option allows a simple navigation across the taxonomic tree (NCBI taxonomy) and the selection of the taxa of interest by checking the adjacent box; the total number of ITS1 sequences with ENA and HMM localization are displayed next to taxon name. Alternatively, the ‘advanced search’ option allows constructing a refined query using boolean operators on the queries performed previously (shown in the ‘executed query’ box). For instance, prior queries for the species Aspergillus aculeatus (query#1) and Zygowilliopsis californica (query#2) performed separately, can be combined to obtain all entries of the two species through a composing panel as follows: query#1 OR query#2. The advanced search may be also refined by defining parameters regarding sequence length, ITS1 annotation method (ENA, HMM or both) and the E-values and/or posterior probability values supporting the HMM matches. Each entry in the query output can be visualized as a web page showing the accession number (cross-linked to ENA), a brief description, the full lineage description from the NCBI taxonomy (hyper-linked to the NCBI taxonomy database), the sequence length and taxon rank. The entries of interest can be exported into FASTA formatted DNA sequence file by choosing one or both annotation options (ENA annotations and/or HMM). Moreover, ITSoneDB offers an additional export feature limited to a representative sequence per species that returns the centroid of a population of sequences belonging to the same species (see ‘Materials and Methods’ section for additional details). Another important feature of ITSoneDB, especially for local analysis, is the possibility to export the entire database or the species representative sequences (options available at top right of the home page).

MATERIALS AND METHODS

In order to generate, maintain and update ITSoneDB, we designed a multi-step Python and BASH workflow (Supplementary Figure S2). In the first step, the nucleotide entries from the ENA (European Nucleotide Archive, http://www.ebi.ac.uk/ena) database (26) are locally downloaded. The current version of ITSoneDB 1.131, which extends and upgrade a previous version limited to Fungi (27), has been populated by considering the ENA release 131 (02/27/2017), counting for 803 147 518 entries. In order to reduce both the computational requirements and processing time, the Plant (PLN), Environmental (ENV), Human (HUM), Fungal (FUN), Other Mammal (MAM), Invertebrates (INV), Other Vertebrates (VRT), Mus musculus (MUS), Other Rodents (ROD), Unclassified (UNC) divisions belonging to the Genome Sequencing Scan (GSS), High-Throughput cDNA Sequencing (HTC), High-Throughput Genome Sequencing (HTG), Standard (STD), Patent (PAT), EST (expressed sequence tag) and Transcriptome Shotgun Assembly (TSA) data classes, have been locally downloaded. Afterward, only the Eukaryotic entries have been selected, reducing the data count to 83 008 723 items. In the second step, the accession number, the description and the available annotation under specific feature keys (i.e. rRNA, misc_rRNA, misc_feature and source) have been extracted from each entry and stored in a TSV (tab-separated values) file, while the associated nucleotide sequence has been saved in FASTA file. This procedure allowed also to associate taxonomic information (i.e. NCBI taxonomy identifier and taxonomic path) to each ENA accession number. Subsequently, a comprehensive, ad hoc developed and manually curated dictionary of 110 common ITS1 synonyms (see Supplementary Table S1) has been used to filter the data stored in the TSV files and select the entries where the ITS1 start and end positions were specifically annotated. At the same time, the ITS1 boundaries have been validated or de novo defined by using a similarity-based approach. ITS1 is flanked by two highly conserved genes encoding for the ribosomal RNA 18S and 5.85, respectively, whose conservation profile can be suitably modeled by HMM. The HMMs for 18S (RF01960) and 5.8S (RF00002) rRNAs have been generated by using the reference multiple alignments (Stockholm format) available in the RFAM database (28,29). The 18S and 5.8S rRNA HMMs have been thus mapped against previously extracted ENA sequences by using hmmsearch (HMMER 3.1 package) (30). Statistically significant HMM matches have been considered for ITS1 boundaries definition using as threshold the e-value ≤ 0.001. In order to retain matches where the terminal portion of the 18S HMM profile and/or the initial portion of the 5.8S HMM profile aligned to initial or terminal part of the query sequence, respectively, we also considered matches with E-value ≥ 0.001 but with an average posterior-probability (a measure of alignment accuracy) >0.85 (30). The information regarding ITS1 boundaries extracted from entry annotation and/or defined by inferring the 18S and 5.8S locations were then merged to generate the tables used to populate the database. Finally, for each species represented in the database a representative entry was selected: all the sequences belonging to the same species were extracted and clustered, setting up a 97% identity threshold, by applying VSEARCH (31). The reference sequence corresponded the centroid of the largest cluster.

DATABASE ARCHITECTURE AND WEB INTERFACE

ITSoneDB implementation is based on a three-tier architecture: client, server and database (Supplementary Figure S3). In the database layer, data and metadata are stored in a MySQL (version 5.5) relational DBMS (Database Management System) using INNODB as stored engine in order to implement persistent queries. The Graphical User Interface (GUI) is developed as JAVA Web Application in Java Platform Enterprise Edition — Java EE. It uses jQuery/jQuery-UI framework JavaScript on the client layer, Java servlets and JavaServer Pages (jsp) on the server layer. The web application is deployed in a Tomcat web server (https://tomcat.apache.org). To implement the communication between the data layer and the Web Application, the Hibernate ORM (Object Relational Mapping, http://hibernate.org/orm/) has been adopted. It provides a framework for mapping an object-oriented domain model to a relational database enabling us to handle the data layer as objects in the web pages.

FUTURE DEVELOPMENTS

The first release of ITSoneDB, together with all the tools developed to build and populate it, has been designed to be further improved and expanded. Many of the future improvements will be carried out in the framework of the activities of the ELIXIR research infrastructure for biological data (https://www.elixir-europe.org) in collaboration with the EMBL-EBI and Norway nodes. First, we plan to constantly update the data by retrieving and curating the new ITS1 sequences available in primary databases and other resources, including those hosting shot-gun and amplicon-based metagenomics datasets. During these updates, we will take all necessary actions to maintain the long-time usability and value of our data by keeping in mind the FAIR data principles guidelines. We plan also to connect ITSoneDB to the UniEuk Initiative (32) (http://unieuk.org) in order to map its content on a more curated and harmonized taxonomy. In order to further support users involved in metagenomics experiments we plan to implement services for (i) calculating the ‘barcoding gap’ in a custom-defined taxonomic range; (ii) designing ‘universal’ PCR primers effective in that range; and (iii) performing multi-query sequence similarity searches suitable for large metagenomic datasets. A new section of the database, will be also created, focused on organisms living in the marine environment in order to address its largely unknown complexity. This section will be linked to other Marine reference databases such as MarRef, MarDB and MarCat (available at https://mmp.sfb.uit.no) developed within the ELIXIR project framework. We will allow to efficiently parallelize and manage the pairwise and multiple alignments required by the new planned functions (barcoding gap computation and primers design respectively) by integrating ITSoneDB in a Cloud-based analysis workspace. Furthermore, we will release and update ad hoc pre-formatted version of ITSoneDB sequences and taxonomy in order to allow the use of popular metabarcoding analysis tools such as BioMaS (24), QIIME (21), LCA-classifier (used by META-pipe, https://arxiv.org/abs/1604.04103), MAPseq (used by EBI metagenomics—EMG (33)) (http://www.biorxiv.org/content/early/2017/04/12/126953), MOTHUR (22) or UCHIME (34). Sequences, taxonomy and other metadata will be also made available in human and machine readable tabular formats. We also plan to create a Galaxy workbench in order to allow users to select a specific analysis pipeline for metabarcoding data analysis using ITSoneDB as a reference collection, fostering data sharing, transparency and reproducibility. Finally, we will work in collaboration with EBI in the Elixir framework to provide ITSoneDB as a reference database in a specific workflow for ITS1, within the EBI metagenomics portal. This will further increase its use, exposure and interoperability.

DATA AVAILABILITY

ITSoneDB is freely accessible as a web application at http://itsonedb.cloud.ba.infn.it/. Click here for additional data file.

33 in total

1. Biological identifications through DNA barcodes.

Authors: Paul D N Hebert; Alina Cywinska; Shelley L Ball; Jeremy R deWaard
Journal: Proc Biol Sci Date: 2003-02-07 Impact factor: 5.349

2. The promise of DNA barcoding for taxonomy.

Authors: Paul D N Hebert; T Ryan Gregory
Journal: Syst Biol Date: 2005-10 Impact factor: 15.683

Review 3. Protists are microbes too: a perspective.

Authors: David A Caron; Alexandra Z Worden; Peter D Countway; Elif Demir; Karla B Heidelberg
Journal: ISME J Date: 2008-11-13 Impact factor: 10.302

4. ITS1: a DNA barcode better than ITS2 in eukaryotes?

Authors: Xin-Cun Wang; Chang Liu; Liang Huang; Johan Bengtsson-Palme; Haimei Chen; Jian-Hui Zhang; Dayong Cai; Jian-Qin Li
Journal: Mol Ecol Resour Date: 2014-09-24 Impact factor: 7.090

5. Taxonomic binning of metagenome samples generated by next-generation sequencing technologies.

Authors: Johannes Dröge; Alice C McHardy
Journal: Brief Bioinform Date: 2012-07-31 Impact factor: 11.622

6. Rfam 12.0: updates to the RNA families database.

Authors: Eric P Nawrocki; Sarah W Burge; Alex Bateman; Jennifer Daub; Ruth Y Eberhardt; Sean R Eddy; Evan W Floden; Paul P Gardner; Thomas A Jones; John Tate; Robert D Finn
Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 19.160

7. DNA Barcoding Green Microalgae Isolated from Neotropical Inland Waters.

Authors: Sámed I I A Hadi; Hugo Santana; Patrícia P M Brunale; Taísa G Gomes; Márcia D Oliveira; Alexandre Matthiensen; Marcos E C Oliveira; Flávia C P Silva; Bruno S A F Brasil
Journal: PLoS One Date: 2016-02-22 Impact factor: 3.240

8. MICCA: a complete and accurate software for taxonomic profiling of metagenomic data.

Authors: Davide Albanese; Paolo Fontana; Carlotta De Filippo; Duccio Cavalieri; Claudio Donati
Journal: Sci Rep Date: 2015-05-19 Impact factor: 4.379

9. A comprehensive method for amplicon-based and metagenomic characterization of viruses, bacteria, and eukaryotes in freshwater samples.

Authors: Miguel I Uyaguari-Diaz; Michael Chan; Bonnie L Chaban; Matthew A Croxen; Jan F Finke; Janet E Hill; Michael A Peabody; Thea Van Rossum; Curtis A Suttle; Fiona S L Brinkman; Judith Isaac-Renton; Natalie A Prystajecky; Patrick Tang
Journal: Microbiome Date: 2016-07-19 Impact factor: 14.650

10. DNA barcoding in the cycadales: testing the potential of proposed barcoding markers for species identification of cycads.

Authors: Chodon Sass; Damon P Little; Dennis Wm Stevenson; Chelsea D Specht
Journal: PLoS One Date: 2007-11-07 Impact factor: 3.240

10 in total

1. Putative group I introns in the eukaryote nuclear internal transcribed spacers.

Authors: Daniele Corsaro; Danielle Venditti
Journal: Curr Genet Date: 2019-08-28 Impact factor: 3.886

2. Recommendations for connecting molecular sequence and biodiversity research infrastructures through ELIXIR.

Authors: Robert M Waterhouse; Anne-Françoise Adam-Blondon; Donat Agosti; Petr Baldrian; Bachir Balech; Erwan Corre; Robert P Davey; Henrik Lantz; Graziano Pesole; Christian Quast; Frank Oliver Glöckner; Niels Raes; Anna Sandionigi; Monica Santamaria; Wouter Addink; Jiri Vohradsky; Amandine Nunes-Jorge; Nils Peder Willassen; Jerry Lanfear
Journal: F1000Res Date: 2021-12-03

3. Comparative Analysis of Metagenomics and Metataxonomics for the Characterization of Vermicompost Microbiomes.

Authors: Marcos Pérez-Losada; Dhatri Badri Narayanan; Allison R Kolbe; Ignacio Ramos-Tapia; Eduardo Castro-Nallar; Keith A Crandall; Jorge Domínguez
Journal: Front Microbiol Date: 2022-05-10 Impact factor: 6.064

4. The European Nucleotide Archive in 2019.

Authors: Clara Amid; Blaise T F Alako; Vishnukumar Balavenkataraman Kadhirvelu; Tony Burdett; Josephine Burgin; Jun Fan; Peter W Harrison; Sam Holt; Abdulrahman Hussein; Eugene Ivanov; Suran Jayathilaka; Simon Kay; Thomas Keane; Rasko Leinonen; Xin Liu; Josue Martinez-Villacorta; Annalisa Milano; Amir Pakseresht; Nadim Rahman; Jeena Rajan; Kethi Reddy; Edward Richards; Dmitriy Smirnov; Alexey Sokolov; Senthilnathan Vijayaraja; Guy Cochrane
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

5. MGnify: the microbiome analysis resource in 2020.

Authors: Alex L Mitchell; Alexandre Almeida; Martin Beracochea; Miguel Boland; Josephine Burgin; Guy Cochrane; Michael R Crusoe; Varsha Kale; Simon C Potter; Lorna J Richardson; Ekaterina Sakharova; Maxim Scheremetjew; Anton Korobeynikov; Alex Shlemov; Olga Kunyavskaya; Alla Lapidus; Robert D Finn
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

6. PLANiTS: a curated sequence reference dataset for plant ITS DNA metabarcoding.

Authors: Elisa Banchi; Claudio G Ametrano; Samuele Greco; David Stanković; Lucia Muggia; Alberto Pallavicini
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451

Review 7. An Introduction to Next Generation Sequencing Bioinformatic Analysis in Gut Microbiome Studies.

Authors: Bei Gao; Liang Chi; Yixin Zhu; Xiaochun Shi; Pengcheng Tu; Bing Li; Jun Yin; Nan Gao; Weishou Shen; Bernd Schnabl
Journal: Biomolecules Date: 2021-04-02

Review 8. Metagenomic Approaches to Investigate the Contribution of the Vineyard Environment to the Quality of Wine Fermentation: Potentials and Difficulties.

Authors: Irene Stefanini; Duccio Cavalieri
Journal: Front Microbiol Date: 2018-05-16 Impact factor: 5.640

9. ITSoneWB: profiling global taxonomic diversity of eukaryotic communities on Galaxy.

Authors: Marco Tangaro; Giuseppe Defazio; Bruno Fosso; Vito Flavio Licciulli; Giorgio Grillo; Giacinto Donvito; Enrico Lavezzo; Giacomo Baruzzo; Graziano Pesole; Monica Santamaria
Journal: Bioinformatics Date: 2021-06-12 Impact factor: 6.931

10. Skin Metagenomic Sequence Analysis of Early Candida auris Outbreaks in U.S. Nursing Homes.

Authors: Xin Huang; Rory M Welsh; Clay Deming; Diana M Proctor; Pamela J Thomas; Gabrielle M Gussin; Susan S Huang; Heidi H Kong; Meghan L Bentz; Snigdha Vallabhaneni; Tom Chiller; Brendan R Jackson; Kaitlin Forsberg; Sean Conlan; Anastasia P Litvintseva; Julia A Segre
Journal: mSphere Date: 2021-08-04 Impact factor: 4.389

10 in total