Literature DB >> 18448469

NOBAI: a web server for character coding of geometrical and statistical features in RNA structure.

Vegeir Knudsen¹, Gustavo Caetano-Anollés.

Abstract

The Numeration of Objects in Biology: Alignment Inferences (NOBAI) web server provides a web interface to the applications in the NOBAI software package. This software codes topological and thermodynamic information related to the secondary structure of RNA molecules as multi-state phylogenetic characters, builds character matrices directly in NEXUS format and provides sequence randomization options. The web server is an effective tool that facilitates the search for evolutionary history embedded in the structure of functional RNA molecules. The NOBAI web server is accessible at 'http://www.manet.uiuc.edu/nobai/nobai.php'. This web site is free and open to all users and there is no login requirement.

Entities: Chemical Disease Species

Mesh：

Substances：
RNA

Year: 2008 PMID： 18448469 PMCID： PMC2447726 DOI： 10.1093/nar/gkn220

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Functional RNA molecules are highly diverse and they are mainly structured by a set of short A-form helices that are typically about 10 base pairs in length. These helices define elements of secondary structure which are then arranged in space through tertiary contacts and delimit motifs identifiable at the sequence or structure levels (1,2). Since these elements define the overall fold of the molecule, evolutionary history can be studied directly from secondary structure (3–6). Evolutionary studies have used both geometrical and statistical features (characters) describing topological and thermodynamic attributes of RNA molecules to infer rooted phylogenetic trees (branching histories of inheritance) (5–11). Such trees reveal evolutionary patterns that are generally overlooked by traditional phylogenetic methods that focus on sequence. Geometrical and statistical characters gave trees that were congruent and similarly rooted, and generally matched trees reconstructed from sequences. The approach is robust and it has been applied to a wide variety of molecules at different taxonomical levels, from the subspecies/species levels to the universal tree. For example, patterns in spacer rRNA structures matched diversification patterns of phytopathogenic fungi following continental pathogen introduction (5) or habitat adaptation (6), and resolved intrageneric relationships in Caribbean gorgonian corals (12). At higher taxonomical levels, analysis of rRNA structure helped resolve rapid radiations in cladoceran orders of arthropodes (13) and deep phylogenetic relationships in the grasses (7). Trees of life were even reconstructed from the structure of several molecules, including the small and large subunits of rRNA (6,9) and tRNA (11). Finally, structural evolution was traced in ribosomes (9) and the origins and evolution of eukaryotic retrotransposable elements (8) and tRNA (10) was established. In order to facilitate phylogenetic analysis, we developed NOBAI (Numeration of Objects in Biology: Alignment Inferences), a software package that automates the task of coding information in the secondary structure of folded RNA sequences. NOBAI consists of four separate modules, which are written in either C or C++: MARTEN (Molecular Analysis and Recording Tool for Evolutionary Numeration), STOAT (Statistics Of Architectural Topology), STRINGGEN and OMROKGEN. MARTEN takes as input sequence and structural information in DCSE format (14), codes geometrical features in paired and unpaired regions of the molecules as linearly ordered multi-state characters and produces a file in NEXUS format (15) for analysis with standard phylogenetic software. STOAT takes as input nucleotide sequences in FASTA format, folds the molecules (with or without constraints) using routines in RNAfold (16,17), calculates four normalized morphospace parameters that describe molecular mechanic properties of the molecules and produces a file in NEXUS format. Finally, STRINGGEN and OMROKGEN are sequence randomization tools that read sequence strings and either generate all possible recombinations or rearrange sequences by single nucleotide permutation. All four applications in the NOBAI software package are UNIX command line based and they do not have a graphical user interface (GUI). This paper presents the NOBAI web server. The server is an Apache installation that provides a web interface to the four applications in NOBAI. The communication between the applications, the web server and the user (such as the uploading of an input file and the presentation of the computational results) is done by Perl-CGI scripts. Sample input data for each application is hyperlinked to the application interfaces. The execution time for these samples (when assuming no load on the server) should be less than 10 seconds. The computational results are made available for download from a temporal web page as both individual files and a compressed tar-file.

METHODs

Figure 1A illustrates how the secondary structure of functional RNA molecules can be used to reconstruct evolutionary history in the form of phylogenetic trees. This method requires that we know the geometrical shape of the molecules, either by inference using alignment and/or folding algorithms or experimentally through crystallographic or other methods. The actual secondary structure of a specific RNA molecule can be found in databases [e.g. the EUROPEAN RIBOSOMAL RNA DATABASE (18) or Rfam (19)] or can be obtained using predictive folding software [e.g. the Vienna package (17)].

Figure 1.

An illustration of the NOBAI software package. (A) A sketch of how the secondary structure of RNA molecules can be used to reconstruct the evolutionary history of organisms. H, hairpin; S, stem number x; SS, substructure number x; M, molecule number x; F, the Frobenius norm; Q, Shannon entropy; P, base-pairing propensity and S, mean stem length. (B) A sketch of the processing performed by MARTEN and STOAT (the two main programs in the NOBAI package). (C) A sketch of the processing performed by STRINGGEN and OMROKGEN. Geometrical characters describe individual molecular components (substructures, SS), or the entire molecule. For example, the sample molecule in Figure 1A can be divided into six SS following the DCSE format, one substructure for each stem strand. These SS are further associated with three parameters: the number of unpaired nucleotides (U), the number of paired nucleotides (P) and the number of bulged nucleotides (B). For our sample molecule the first stem strand (S1) has 2 unpaired nucleotides, 5 paired nucleotides and 0 nucleotides in bulges or internal loops. These three numbers can then be used in a geometrical character data matrix to represent S1, as shown in Figure 1A. Statistical characters define a morphospace that provides global descriptions of elements of secondary structure of folded nucleotide sequences. These morphospace characters include the Shannon entropy of the base-pairing probability matrix (Q), the Frobenius norm (F), the base-pairing propensity (P) and the mean stem length (S) (20,21). Their phylogenetic significance has been previously discussed (7). Table 1 provides equations defining these four statistical parameters. As illustrated in Figure 1A, integers can be assigned to each parameter in a statistical character data matrix.

Table 1.

Shannon entropy
Frobenius norm
Base-pairing propensity
Mean stem length

Statistical characters of folded RNA sequences used by STOAT. Terms used in the equations: ; P, the base paring probability; A, the length of stem number k (i.e. the number of paired nucleotides in that stem); C, the number of paired bases; N, the number of bases; and B, the number of stems The generated character data matrices contain characters describing the RNA molecules. Figure 1A shows how data matrices display character states and are used to generate phylogenetic trees. Several software packages, including PAUP* (22), take these matrices in NEXUS format as input for phylogenetic analysis. If the character data matrices are flipped (i.e. rows become columns and vice versa), then the data matrices can be used to create phylogenetic trees of SS in addition to the traditional phylogenetic trees of molecules. These trees of SS are very useful and have been used to define origins and evolution of RNA structure (8–10).

WEB SERVER

The NOBAI web server provides a web interface to the applications in the NOBAI software package. The web server is available at ‘www.manet.uiuc.edu/nobai/nobai.php’, while source code of the applications can be obtained directly from the authors. The service is free and open to all users and there is no login requirement. The web server has one separate web interface for each of the four applications. These web interfaces also contain links to sample input data and a documentation page. The documentation pages are small manuals that give a brief description of both the application and the input parameters. These manuals were originally written with UNIX man-page macros and thereafter converted to html by groff (23). The results from the computations performed on the web server are made available for download on a temporary web page after the calculation has ended. This web page presents the computational results as both individual files and a tar file (24). The tar file is compressed with gzip (25) and it contains all the result files plus the user provided input. The tar file will unpack all the files in the same directory as it is stored. All files related to a specific computation are stored in a separate directory on the server for at least 24 hours. However, the user needs to recall the exact web address to a file in order to download that file after the temporary web page has been lost. Figure 1 (B and C) gives an overview of the input, output and processing of the four applications in the NOBAI software package. The two main programs are MARTEN and STOAT, which code phylogenetic characters associated with RNA secondary structures. OMROKGEN and STRINGGEN on the other hand, are tools that generate new nucleotide strings in which the positions of the nucleotides in the original sequence are rearranged. All four applications print information to standard out during the computation. On the web server, this information is saved in the stdout.txt file. A brief description of each of the four web interfaces is given below.

MARTEN

MARTEN accepts only input files in the DCSE format. The main output from MARTEN is a geometrical character data matrix written to a NEXUS file. The suffix of the NEXUS formatted file is ‘nex’. MARTEN also translates the input sequences from the DCSE format to the FASTA format. The suffix of the FASTA file is ‘fas’. The default and maximum number of character states are 64. However, the user can change the number of character states. A capture of the MARTEN web interface and result page are shown in Figure 2.

Figure 2.

Two examples of web pages of the NOBAI web server. (A) Capture of the MARTEN web interface with the default parameters. A brief description of the input parameters can be found under the ‘Documentation’ hyperlink. (B) Capture of a temporary result page generated by the MARTEN web interface. The result files can either be downloaded as separate files or as a compressed tar file. Note that in certain occasions DCSE files downloaded from the EUROPEAN RIBOSOMAL RNA DATABASE may contain sequences for which the ‘helix numbers’ are not located directly underneath the stem strands. If the number of stems strands for those sequences are less than the number of ‘helix numbers’, then MARTEN may fail to assign the stem strands to a ‘helix number.’ When this situation occurs, MARTEN excludes the sequence from the computation. The user can, however, instruct MARTEN to search for ‘helix numbers’ outside the stem strands, but it is often better to edit the input file manually instead, if the calculation fails.

STOAT

STOAT reads an input file in the FASTA format and invokes the RNAfold program (version 1.4) from the Vienna package (17). RNAfold returns the minimum free energy (MFE) structure and the probability of base-pairing between bases i and j in the sequence (p). The base-pairing probability stems from the partition function (26) and is needed to calculate the Shannon entropy and the Frobenius norm. Parameters such as the stem lengths (A) and the number of stems (B) are calculated from the secondary structure of the folded nucleotides. Note that STOAT divides stems containing bulges or internal loops into more than one substructure (e.g. stem 2 of the sample molecule in Figure 1A is divided into two SS). STOAT outputs two statistical character data matrices, each written to a NEXUS file. These two files have the suffixes ‘_l.nex’ and ‘_g.nex’. The local ‘_l.nex’ file gives the character states relative to a linear scale based on the minimum and maximum values for that run. The global ‘_g.nex’ file gives the character states relative to 0 and 1. The statistical characters used in the matrices are the Shannon entropy, the Frobenius norm, the base-pairing propensity and the mean stem length (Table 1). In the ‘_g.nex’-file, the mean stem length is given as S−1. The default number of character states is 31 and the maximum number is 64. Figure 3 shows a local NEXUS file generated by STOAT and a consensus phylogenetic tree generated from it using PAUP*.

Figure 3.

Character data matrix produced by STOAT in NEXUS format (_l.nex file) from 5S rRNA archaeal structures (A) and strict consensus phylogenetic tree reconstructed using PAUP* using maximum parsimony as the optimality criterion. (B). Note that only a top segment of the NEXUS file is shown and that the 5S rRNA sequences are provided as sample input data in the web server. STOAT also translates sequences from FASTA to DCSE format. This translator does not align the sequences. Consequently, the DCSE format of the sequences is written to separate files.

STRINGGEN and OMROKGEN

STRINGGEN creates all possible recombinations of a given nucleic acid sequence. The application reads a nucleotide sequence and applies a loop to select all possible sequences of that same length and base composition. Because of computational limitations, the maximum sequence length for the web interface is 15 nucleotides. OMROKGEN also rearranges sequences. However, the rearrangement is performed by a permutation procedure described in refs (20,27). This procedure consists of three perfect shuffles, each swapping nucleotides sequentially at all sites with a randomly chosen site elsewhere in the sequence. On the OMROKGEN web interface the user can specify the number of sequences to be generated. These randomization tools shuffle sequences of any defined nucleotide composition and can be used to dissect the effects of composition and order of nucleotides in the stability of folded nucleic acids molecules (28). However, they may not be suitable for applications that require dinucleotide shuffling.

CONCLUSIONS

A software package has been developed to code geometrical and statistical phylogenetic characters from the secondary structure of folded RNA sequences. The applications in this software package have been made freely accessible on a web server open for all users. This software package with its web interfaces will facilitate the search for evolutionary patterns embedded in the structure of functional RNA.

22 in total

Review 1. Stitching together RNA tertiary architectures.

Authors: T Hermann; D J Patel
Journal: J Mol Biol Date: 1999-12-10 Impact factor: 5.469

2. Use of RNA secondary structure for studying the evolution of RNase P and RNase MRP.

Authors: L J Collins; V Moulton; D Penny
Journal: J Mol Evol Date: 2000-09 Impact factor: 2.395

3. NEXUS: an extensible file format for systematic information.

Authors: D R Maddison; D L Swofford; W P Maddison
Journal: Syst Biol Date: 1997-12 Impact factor: 15.683

4. Evolved RNA secondary structure and the rooting of the universal tree of life.

Authors: Gustavo Caetano-Anollés
Journal: J Mol Evol Date: 2002-03 Impact factor: 2.395

5. Tracing the evolution of RNA structure in ribosomes.

Authors: Gustavo Caetano-Anollés
Journal: Nucleic Acids Res Date: 2002-06-01 Impact factor: 16.971

6. The European ribosomal RNA database.

Authors: Jan Wuyts; Guy Perrière; Yves Van De Peer
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

7. Vienna RNA secondary structure server.

Authors: Ivo L Hofacker
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

8. Optimal alphabets for an RNA world.

Authors: Paul P Gardner; Barbara R Holland; Vincent Moulton; Mike Hendy; David Penny
Journal: Proc Biol Sci Date: 2003-06-07 Impact factor: 5.349

9. Structural rRNA characters support monophyly of raptorial limbs and paraphyly of limb specialization in water fleas.

Authors: Timothy D Swain; Derek J Taylor
Journal: Proc Biol Sci Date: 2003-05-07 Impact factor: 5.349

10. Evolutionary patterns in the sequence and structure of transfer RNA: early origins of archaea and viruses.

Authors: Feng-Jie Sun; Gustavo Caetano-Anollés
Journal: PLoS Comput Biol Date: 2008-03-07 Impact factor: 4.475

8 in total

Review 1. Folding and finding RNA secondary structure.

Authors: David H Mathews; Walter N Moss; Douglas H Turner
Journal: Cold Spring Harb Perspect Biol Date: 2010-08-04 Impact factor: 10.005

2. Phylogenetic study of nine species of freshwater monogeneans using secondary structure and motif prediction from India.

Authors: Anshu Chaudhary; Hridaya Shanker Singh
Journal: Bioinformation Date: 2012-09-21