Literature DB >> 19920122

The ITS2 Database III--sequences and structures for phylogeny.

Christian Koetschan¹, Frank Förster, Alexander Keller, Tina Schleicher, Benjamin Ruderisch, Roland Schwarz, Tobias Müller, Matthias Wolf, Jörg Schultz.

Abstract

The internal transcribed spacer 2 (ITS2) is a widely used phylogenetic marker. In the past, it has mainly been used for species level classifications. Nowadays, a wider applicability becomes apparent. Here, the conserved structure of the RNA molecule plays a vital role. We have developed the ITS2 Database (http://its2.bioapps.biozentrum.uni-wuerzburg.de) which holds information about sequence, structure and taxonomic classification of all ITS2 in GenBank. In the new version, we use Hidden Markov models (HMMs) for the identification and delineation of the ITS2 resulting in a major redesign of the annotation pipeline. This allowed the identification of more than 160,000 correct full length and more than 50,000 partial structures. In the web interface, these can now be searched with a modified BLAST considering both sequence and structure, enabling rapid taxon sampling. Novel sequences can be annotated using the HMM based approach and modelled according to multiple template structures. Sequences can be searched for known and newly identified motifs. Together, the database and the web server build an exhaustive resource for ITS2 based phylogenetic analyses.

Entities: CellLine Chemical Disease Species

Mesh：

Substances：
DNA, Ribosomal Spacer

Year: 2009 PMID： 19920122 PMCID： PMC2808966 DOI： 10.1093/nar/gkp966

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The internal transcribed spacer 2 (ITS2) of the nuclear rDNA cistron is a widely used phylogenetic marker. In its early years it was specifically used for low-level phylogenetic analyses, i.e. of species within the same genus. At that time, only nucleotide information of the fast evolving sequence was used. With analyses of the two-dimensional structure of the molecule it became evident that the structure is highly conserved throughout the eukaryotes (1–3). The combination of a fast evolving sequence with a slow evolving structure within one molecule suggested its capability for higher level classifications (4). In the last years, the ITS2 has been revealed to be more than just an excellent phylogenetic marker. Its applications include usage as a marker for species identification in environmental samples (phylochips) (5,6), as a target molecule for barcoding (7,8) and for distinguishing species (9). In many of these cases, the structure plays a fundamental role. Even though sequence databases typically include a large quantity of ITS2 sequences, no coherent information source existed so far including both sequence and structure information, with ITS2 specific annotations. As a consequence of this lack, every scientist had to predict the structure of each molecule in his/her dataset more or less manually. Even worse, in the majority of phylogenetic procedures as e.g. alignment or tree calculation the structure could not be used at all as the corresponding software was not capable of integrating the structure information. In order to tackle these problems and to be better able to exploit the power of this intriguing molecule, we have developed the ITS2 Database. Its goal is to provide a valid structure for every ITS2 sequence within GenBank and thereby to become an exhaustive data source for sequence/structure based phylogenetic analyses, as well as offering tools capable of exploiting the information surplus obtained by these secondary structures. In this article, we describe additions to the ITS2 Database in terms of (i) new developments in automated structure prediction, (ii) new features for the access to the data via the Web interface and (iii) new tools for the analysis of ITS2 sequences.

DATA GENERATION

In the previous version of the database, we used a BLAST (10) based approach for the detection of ITS2 in un-annotated GenBank (11) sequences. We were able to predict the structure of more than 35 000 ITS2 where the start- and end-positions were either lacking or misidentified. As BLAST per se is a local alignment tool (hence the name) and the sequence length is very variable throughout the eukaryotes, heuristics had to be implemented to identify the start and end points of the ITS2. To improve this approach, we have recently developed a Hidden Markov Model (HMM) based method for the correct delineation of the ITS2 (12). Start and end position are inferred from the surrounding 5.8S and 28S regions, that are highly conserved. This method initiated a complete re-design of data generation for the ITS2 Database (Figure 1). In the initial step, we searched through the complete nucleotide database (nt) of GenBank for potential ITS2 sequences using hmmsearch (13). Simultaneously, all annotated ITS2 were extracted from GenBank. In cases where both methods were informative about the position of the ITS2, the HMM based information superseded that from GenBank. This led to 196 697 sequences with positional information of the ITS2 (Database accessed at the 22 June 2009). In the second step, all retained sequences were folded using UNAfold (14). Typical ITS2 features were shown by 63 645 structures, namely the conserved core of four helices with the third as the longest. This was a substantial increase compared to the previous approach where only GenBank annotations were taken into account. This indicated the necessity of a correct delineation for the folding step. In the next step, these structures served as templates in the homology modelling process. In contrast to the previous approach, we iterated the homology modelling process until no further new correct structures were identified. This resulted in an additional 99 010 predicted full-length structures, further underlining the presence of a conserved structural core of the ITS2 throughout all eukaryotes. Remaining sequences which could either not be homology modelled or where start and end position could not be predicted run through a final step resulting in partial structures. A BLAST search against all identified sequence structure pairs was performed. All significant hits (E-value < 10−10) were extended in both directions by five bases. Finally, we applied a less strict homology modelling which required at least two concatenated helices with a transfer larger than 75% each. This resulted in more than 50 000 partial structures. Using the modified pipeline, which would run in a single core 1221 days, we now provide structural information for over 210 000 ITS2, doubling the number of the previous version. As a detailed taxonomic breakdown (Table 1) the best coverage is found in fungi and plants with 80 and 93%, respectively. Only for ∼25% of the metazoan ITS2 sequences, a structure could be predicted. This could indicate a deviation from the ‘common core’. It could also be caused by problems of UNAfold to identify the correct fold, leading to a paucity of templates for homology modelling. Additionally, the ITS2 Database now contains a record for each GenBank entry which was identified either via textual annotation or our HMM based annotation tool, rendering it as an exhaustive resource for ITS2 sequences and structures.

Figure 1.

Flow chart of the new pipeline for the ITS2 annotation.

Table 1.

Taxonomic breakdown of predicted ITS2 structures

	Structure		Partials		All
	Count	Percentage	Count	Percentage	Count	Percentage
Alveolata	1750	34.67	947	18.76	5048	53.43
Amoebozoa	19	13.01	9	6.16	146	19.18
Apusozoa	0	0.00	0	0.00	35	0.00
Choanoflagellida	0	0.00	0	0.00	1	0.00
Cryptophyta	25	38.46	17	26.15	65	64.62
Environmental samples	26	28.26	7	7.61	92	35.87
Euglenozoa	3	0.62	191	39.71	481	40.33
Fornicata	0	0.00	0	0.00	3	0.00
Fungi	79 251	59.14	28 124	20.99	134 005	80.13
Fungi/Metazoa incertae sedis	2	2.86	0	0.00	70	2.86
Haptophyceae	6	19.35	3	9.68	31	29.03
Heterolobosea	1	0.59	1	0.59	170	1.18
Metazoa	4754	20.14	1357	5.75	23 603	25.89
Nucleariidae	0	0.00	0	0.00	2	0.00
Parabasalidea	1	0.51	0	0.00	197	0.51
Rhizaria	12	2.66	2	0.44	451	3.10
Rhodophyta	27	3.52	28	3.65	768	7.16
Stramenopiles	4441	52.01	2537	29.71	8539	81.72
Viridiplantae	72 322	72.95	20 488	20.67	99 141	93.61
Sum	162 640	59.61	53 711	19.69	272 848	79.29

Flow chart of the new pipeline for the ITS2 annotation. Taxonomic breakdown of predicted ITS2 structures

WEB INTERFACE

Search tab

In addition to a search for sequences and structures with GenBank identifiers or species information, we now also provide a BLAST based search. However, standard BLAST procedures are frequently not able to identify distantly related ITS2 sequences because of their high sequence divergence. To overcome this hindrance, we have implemented a sequence and structure based BLAST search that includes information about the highly conserved structure for the homology search. The sequence-structure BLAST uses an ITS2 specific 12 × 12 scoring matrix representing each nucleotide/structure combination as tuple. This matrix is also used in 4SALE (15) and, as corresponding rate matrix, in ProfDistS (16) for automatic sequence-structure alignment and phylogenetic reconstruction, respectively. Thus, species sampling that starts with any sequence of interest and covers broad taxonomic ranges has become as simple as a BLAST search.

Annotate tab

The web interface does not only present access to the information stored in the database. Further, it provides tools for researchers to process newly determined sequences and to integrate them with already published ones. As shown in the data generation pipeline, correct delineation of the ITS2 sequence can be crucial for structure prediction. We therefore have implemented a web-based interface for the HMM based annotation. It integrates ﬁve taxon-speciﬁc HMMs for searches and several individually selectable parameters, as e.g. cut-off E-value or size limitation. As a result, delimited ITS2 sequences are shown as well as the predicted hybrid of 5.8S and 28S rRNA as a confirmation of the HMM annotation’s accuracy (12).

Model tab

After annotation of newly retained ITS2 sequences and selection of a taxon sampling from the ITS2 Database, secondary structures may be determined by two means: First, prediction may be accomplished by homology modelling with the complete set of sequences and structures of the database serving as templates (Predict tab). A second approach is to identify the best template structure within the taxon sampling and use it for homology modelling of the remainders (Model tab). To date, one had to manually run through all possible templates and select the one which resulted in the highest helix transfer percentages. To avoid this tedious and somewhat arbitrary procedure, we now provide the possibility to use multiple sequence-structure pairs to model multiple target sequences. The database will calculate all against all structures and select the template which resulted in the homology prediction with highest percentages of helix transfers for all target sequences. Similarly, suboptimal structures of a sequence as e.g. retained from minimum free energy folding software, may be given as template input for a set of sequences. As a result, the database will model the structure for all requested sequences with the best fitting suboptimal secondary structure. This is needed, as sometimes the energetically best structure is not the biologically correct one. As the complete homology modelling approach is independent of the ITS2, it may be used to predict the secondary structure of any RNA given a homologous molecule with a known structure.

Motif tab

In addition to the overall structure, conserved motifs like an UGGU sequence preceding the apex of the third helix and a pyrimidine–pyrimidine mismatch in the second helix have been described for the ITS2 (2). In the aforementioned study, identification of these motifs was based on a small dataset and performed mainly by manual inspection. With the availability of the large set of ITS2 sequences in our database, we searched in an automatic way (17) for highly conserved motifs in the ITS2. From our pool of homology models, we randomly extracted a set of unique species. Analysing separately fungal and plant alignments, known and novel motifs were identified. Although the UGGU motif 5′ side to the apex of helix III differs in its composition for fungi, it is located in a corresponding position. For both kingdoms, the U–U mismatch is surrounded by two motifs: one to the left of helix II and one to the right between helix II and III with additional AAA (Figure 2). Having transformed these sequence motifs into HMMs, we now provide identification of these motifs in sequences of interest (Motif tab).

Figure 2.

General ITS2 topology and visualization of plant HMM motifs for the secondary structure of D. brevis (gi: 31281745). Annotation from HMMs of 5.8S and 28S are displayed as dotted lines tracing the outline of their position, whereas the ITS2 motif HMMs are represented by coloured lines. In parts of these motifs, nucleotide frequencies are presented (21,22). Nucleotides are coloured yellow in unpaired regions, whereas paired nucleotides are blue. CBCs between secondary structures of D. brevis and D. scapigeroides (gi: 31281755) are shown in red.

The ITS2 of Dahlia brevis as an example

As an example to illustrate the information that can be extracted from the database and the Web interface we analysed the ITS2 of D. brevis (18). Looking up the entry for the GenBank identifier 31281745 in the ITS2 Database revealed a stereotypical ITS2 structure (Figure 2). It adopts the common four helix structure with the third as the longest. Additionally, all sequence motifs characteristic for plants are present. In a comparison with another species, here D. scapigeroides (gi: 31281755), two Compensatory Base Changes (CBCs) could readily be identified. Indeed, two sequences belong with a probability of 93% to two different species, if at least one CBC is present (9). It should be mentioned, that the CBC criterion works only in one direction. The presence of more than one CBCs indicates with high probability two different species, if there is no CBC, there still could be two species. As D. brevis follows all the stereotypes of an ITS2 as the best scoring sequence resulting from all motif searches, it was selected as the ‘May 2009’ ITS2 in the newly added rubric ‘ITS2 of the Month’.

CONCLUSIONS

With the new pipeline for structure prediction, the ITS2 Database now provides information about the structure of more than 210 000 ITS2 molecules, nearly 80% of all ITS2 sequences in GenBank, covering all major taxonomic units. Having the structure available is only the first step for a successful phylogenetic analysis. It would be a pity to use the structure only for the manual refinement of an alignment and neglect it in all other steps. We thus have developed additional stand-alone programs for the entire procedure, which includes automatic alignment calculation [4SALE (15)] as well as tree reconstruction [ProfDistS (16)] considering both, sequences AND secondary structures (these programs have to be downloaded separately). Together, they are seamlessly integrated into a pipeline from sequence through structure and finally to the phylogenetic tree (19). Finally, species boundaries in the dataset can be estimated using the CBCanalyzer [(20), meanwhile also implemented in 4SALE]. The application of secondary structures for the reconstruction of phylogenies improves not only the stability of resulting trees, but more importantly increases the accuracy of phylogenetic estimations (muanuscript under preparation). Thus, it would be desirable to include structural information not only for the ITS2, but also for other frequently used phylogenetic RNA markers.

FUNDING

Deutsche Forschungsgemeinschaft (DFG) (grant Mu-2831/1-1 to C.K. and T.S.); BIGSS graduate school (to A.K.); Bundesministerium für Bildung und Forschung (BMBF) (grant FUNCRYPTA to F.F.). Conflict of interest statement. None declared.

21 in total

1. ITS2 is a double-edged tool for eukaryote evolutionary comparisons.

Authors: Annette W Coleman
Journal: Trends Genet Date: 2003-07 Impact factor: 11.639

2. A common core of secondary structure of the internal transcribed spacer 2 (ITS2) throughout the Eukaryota.

Authors: Jörg Schultz; Stefanie Maisel; Daniel Gerlach; Tobias Müller; Matthias Wolf
Journal: RNA Date: 2005-04 Impact factor: 4.942

3. CBCAnalyzer: inferring phylogenies based on compensatory base changes in RNA secondary structures.

Authors: Matthias Wolf; Joachim Friedrich; Thomas Dandekar; Tobias Müller
Journal: In Silico Biol Date: 2005-03-16

4. Distinguishing species.

Authors: Tobias Müller; Nicole Philippi; Thomas Dandekar; Jörg Schultz; Matthias Wolf
Journal: RNA Date: 2007-07-24 Impact factor: 4.942

5. ITS2 sequences as barcodes for identifying and analyzing spider mites (Acari: Tetranychidae).

Authors: Tselila Ben-David; Sarah Melamed; Uri Gerson; Shai Morin
Journal: Exp Appl Acarol Date: 2007-03-09 Impact factor: 2.132

Review 6. ITS2 sequence-structure analysis in phylogenetics: a how-to manual for molecular systematics.

Authors: Jörg Schultz; Matthias Wolf
Journal: Mol Phylogenet Evol Date: 2009-08 Impact factor: 4.286

Review 7. Profile hidden Markov models.

Authors: S R Eddy
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

8. The internal transcribed spacer 2 exhibits a common secondary structure in green algae and flowering plants.

Authors: J C Mai; A W Coleman
Journal: J Mol Evol Date: 1997-03 Impact factor: 2.395

9. Using ITS2 secondary structure to create species-specific oligonucleotide probes for fungi.

Authors: Frank C Landis; Andrea Gargas
Journal: Mycologia Date: 2007 Sep-Oct Impact factor: 2.696

10. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; David L Wheeler
Journal: Nucleic Acids Res Date: 2007-12-11 Impact factor: 16.971

72 in total

1. Aspergillus Secondary Metabolite Database, a resource to understand the Secondary metabolome of Aspergillus genus.

Authors: Varahalarao Vadlapudi; Nabajyoti Borah; Kanaka Raju Yellusani; Sriramya Gade; Prabhakar Reddy; Maheshwari Rajamanikyam; Lakshmi Narasimha Santosh Vempati; Satya Prakash Gubbala; Pankaj Chopra; Suryanarayana Murty Upadhyayula; Ramars Amanchy
Journal: Sci Rep Date: 2017-08-04 Impact factor: 4.379

2. Evaluation of the internal transcribed spacer 2 (ITS2) as a molecular marker for phylogenetic inference using sequence and secondary structure information in blow flies (Diptera: Calliphoridae).

Authors: M A T Marinho; A C M Junqueira; A M L Azeredo-Espin
Journal: Genetica Date: 2011-12-25 Impact factor: 1.082

3. Phylogeny of Oedogoniales, Chaetophorales and Chaetopeltidales (Chlorophyceae): inferences from sequence-structure analysis of ITS2.

Authors: Mark A Buchheim; Danica M Sutherland; Tina Schleicher; Frank Förster; Matthias Wolf
Journal: Ann Bot Date: 2011-10-25 Impact factor: 4.357

4. Discordance in variation of the ITS region and the mitochondrial COI gene in the subterranean amphipod Crangonyx islandicus.

Authors: Etienne Kornobis; Snæbjörn Pálsson
Journal: J Mol Evol Date: 2011-08-04 Impact factor: 2.395

5. Formal revision of the Alexandrium tamarense species complex (Dinophyceae) taxonomy: the introduction of five species with emphasis on molecular-based (rDNA) classification.

Authors: Uwe John; R Wayne Litaker; Marina Montresor; Shauna Murray; Michael L Brosnahan; Donald M Anderson
Journal: Protist Date: 2014-10-13

6. Tree diversity and species identity effects on soil fungi, protists and animals are context dependent.

Authors: Leho Tedersoo; Mohammad Bahram; Tomáš Cajthaml; Sergei Põlme; Indrek Hiiesalu; Sten Anslan; Helery Harend; Franz Buegger; Karin Pritsch; Julia Koricheva; Kessy Abarenkov
Journal: ISME J Date: 2015-07-14 Impact factor: 10.302

7. Molecular phylogeny of 21 tropical bamboo species reconstructed by integrating non-coding internal transcribed spacer (ITS1 and 2) sequences and their consensus secondary structure.

Authors: Jayadri Sekhar Ghosh; Samik Bhattacharya; Amita Pal
Journal: Genetica Date: 2017-04-24 Impact factor: 1.082

8. Structural characterization of the internal transcribed spacer 2 (ITS2) of the ribosomal DNA (rDNA) cluster in Calyptratae (Diptera: Schizophora) and its implications for molecular phylogenetic analyses.

Authors: M A T Marinho; A M L Azeredo-Espin; N I T Zanchin
Journal: J Mol Evol Date: 2013-02-19 Impact factor: 2.395

9. Use of ITS2 region as the universal DNA barcode for plants and animals.

Authors: Hui Yao; Jingyuan Song; Chang Liu; Kun Luo; Jianping Han; Ying Li; Xiaohui Pang; Hongxi Xu; Yingjie Zhu; Peigen Xiao; Shilin Chen
Journal: PLoS One Date: 2010-10-01 Impact factor: 3.240

10. A molecular phylogeny of Hypnales (Bryophyta) inferred from ITS2 sequence-structure data.

Authors: Benjamin Merget; Matthias Wolf
Journal: BMC Res Notes Date: 2010-11-25