Literature DB >> 22492627

SAVoR: a server for sequencing annotation and visualization of RNA structures.

Fan Li¹, Paul Ryvkin, Daniel M Childress, Otto Valladares, Brian D Gregory, Li-San Wang.

Abstract

RNA secondary structure is required for the proper regulation of the cellular transcriptome. This is because the functionality, processing, localization and stability of RNAs are all dependent on the folding of these molecules into intricate structures through specific base pairing interactions encoded in their primary nucleotide sequences. Thus, as the number of RNA sequencing (RNA-seq) data sets and the variety of protocols for this technology grow rapidly, it is becoming increasingly pertinent to develop tools that can analyze and visualize this sequence data in the context of RNA secondary structure. Here, we present Sequencing Annotation and Visualization of RNA structures (SAVoR), a web server, which seamlessly links RNA structure predictions with sequencing data and genomic annotations to produce highly informative and annotated models of RNA secondary structure. SAVoR accepts read alignment data from RNA-seq experiments and computes a series of per-base values such as read abundance and sequence variant frequency. These values can then be visualized on a customizable secondary structure model. SAVoR is freely available at http://tesla.pcbi.upenn.edu/savor.

Entities: Chemical Disease Species

Mesh：

Substances：
RNA

Year: 2012 PMID： 22492627 PMCID： PMC3394343 DOI： 10.1093/nar/gks310

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The secondary structure of an RNA molecule comprises specific base pairing interactions encoded within the primary nucleotide sequence. The formation of secondary structure is vital to the maturation and function of many classes of RNAs. For example, the classic clover-leaf folding pattern of tRNAs is necessary for their function in translation, while the processing of multiple classes of small regulatory RNAs requires formation of specific secondary structures (1,2). Recently, the advent of high-throughput RNA sequencing (RNA-seq) has enabled unbiased, genome-wide studies of many RNA populations within the cell. RNA-seq and its variant protocols have been recently used to study a wide range of biological phenomena, including RNA silencing (3,4), RNA–protein interactions (5,6) and protein translation (7), to name a few. These experiments, along with several recent studies of RNA base pairing (4,8–10), have highlighted the functional significance of RNA structure on a global scale. Although the importance of RNA secondary structure is clear, most existing tools for RNA-seq analysis, such as DESeq (11), Myrna (12), Cufflinks (13), and Galaxy (14), primarily report RNA-seq analyses in the context of linear transcript models and do not support a structure-based interpretation. On the other hand, tools that do enable visualization and annotation of RNA structure models [e.g. RNAstructure (15), RNAfold (16), etc.] are focused on the problem of RNA secondary structure prediction and are not easily applicable to analysis of RNA-seq data. To address this gap, we have developed Sequencing Annotation and Visualization of RNA structures (SAVoR), which neatly integrates common RNA-seq analyses with a structure-based annotation and visualization framework (Figure 1). To do this, SAVoR extracts sequencing data from user-specified RNA-seq alignment files and computes a series of per-nucleotide values such as read abundance and sequence variant frequency, which are then directly plotted on a customizable structural model. The entire process is streamlined via a simple web interface and is completely platform independent. The uses of SAVoR range from a quick look at a transcript of interest to fully customized and annotated publication quality structure models.

Figure 1.

The SAVoR workflow. Upon validation of user input, the primary sequence and genomic location of the user-submitted transcript(s) are determined, and intersecting sequence reads are converted to the desired annotation values. The secondary structure is then determined and plotted with the specified visualization options.

SAVOR WEB SERVER

Input

SAVoR requires the user to enter an RNA transcript sequence and a secondary structure as input. The sequence can be entered as a primary sequence in plain text or FASTA format, an Rfam (17) or transcript [Refseq (18), SGD (19) or TAIR (20)] accession number, or the genomic location by chromosomal range and strand information (Figure 2). Currently, the input sequence is restricted to 20 000 nt in length. If an Rfam accession number is entered and multiple matching entries are located (often the case for repetitive RNA elements such as tRNAs), then SAVoR lists all matching entries from which the user can then select the desired locus. If a primary sequence is entered and any genome-based annotation is selected, then BLASTN (21) with ‘-gapopen 999 gapextend 999’ and otherwise default parameters will be used to determine the genomic location of the input sequence. The user will be prompted to select from a list of the top 20 BLASTN results that pass an E-value cutoff of 1e−3. The user can use a simple drop-down menu to select the reference genome, which is used by SAVoR to retrieve database entries and primary sequence data. SAVoR currently supports the latest reference genome releases for human, mouse, Drosophila melanogaster (fruit fly), Saccharomyces cerevisiae (budding yeast), Arabidopsis thaliana and Caenorhabditis elegans and contains 3831 Rfam entries and 167 157 RefSeq/SGD/TAIR entries.

Figure 2.

SAVoR can be used to visualize RNA-seq results in the context of predicted RNA secondary structures. In this example, the ‘User Input’ tab of the SAVoR web interface is populated with a sample input. The desired transcript is specified by UCSC-style genomic coordinates, and a custom structure is provided in dot-parenthesis notation. The ‘coverage’ annotation type has been selected and the URL for a web-accessible BAM file of read alignments provided. All optional visualization settings have been selected, and a blue-red color scheme will be used.

Specifying the secondary structure

Depending on the type of input sequence and RNA-seq data, the user has four options to specify how the model of RNA secondary structure is generated. For example, SAVoR can retrieve the secondary structure from the Rfam database when the input is an Rfam accession ID. Additionally, the RNAfold program can be used with or without experimental constraints to fold the sequence into its minimum free energy state. If the constrained option is selected, the log2 abundance ratios of structure-informative RNA-seq data sets (4,8–10) are used to derive experimental constraints for structure prediction. Specifically, in the resulting structure model, a base will be paired when the dsRNA-seq to ssRNA-seq abundance ratio for that nucleotide exceeds some given threshold and vice versa (4,9). SAVoR will then use RNAfold to find the best secondary structure model based on the given constraints. Finally, the user can enter a specific secondary structure using the common dot-parenthesis notation (22).

Generating per-base annotations

Next, the user specifies the type of annotation to be overlaid on the RNA secondary structure model. Importantly, SAVoR supports remote access to indexed BAM files, which are highly compact files that contain read alignments from an RNA-seq experiment. SAVoR directly extracts sequencing reads that intersect with the input RNA transcript without requiring the user to upload the entire BAM file. Extracted reads are then converted to per-nucleotide annotation values. SAVoR can generate four different annotation types based on BAM files from RNA-seq or other types of high-throughput sequencing experiments: (i) read abundance (number of reads that cover each nucleotide base), (ii) endpoint abundance (number of reads whose 5′ or 3′ endpoint occurs at each nucleotide base), (iii) per-base mismatch frequency and (iv) per-base normalized log2 abundance ratio (for this analysis, the user is required to enter URL of two BAM files, which will be used by SAVoR to compute ratios). It is worth noting that when log2 abundance or abundance ratio is selected, pseudo counts (adding 1 to the count of every position) are used to avoid numerical errors. Alternatively, the user can upload a text file of custom annotation values using the UCSC Genome Browser BED format, a flexible tabular file format for genomic locations and associated data (http://genome.ucsc.edu/FAQ/FAQformat.html#format1). Currently, the BED-format file is limited to 5 Mb in size. Finally, the user can select from a series of visualization options that specify the markup and color scheme of the output structural model (these options are described in detail on the SAVoR website). Figure 2 shows an example input page that uses genomic coordinates for sequence input, a custom dot-parenthesis structure, and read coverage annotation with default visualization settings. The user can try out this sample input by clicking the ‘Sample input’ link on the SAVoR home page. SAVoR can be used to visualize RNA-seq results in the context of predicted RNA secondary structures. In this example, the ‘User Input’ tab of the SAVoR web interface is populated with a sample input. The desired transcript is specified by UCSC-style genomic coordinates, and a custom structure is provided in dot-parenthesis notation. The ‘coverage’ annotation type has been selected and the URL for a web-accessible BAM file of read alignments provided. All optional visualization settings have been selected, and a blue-red color scheme will be used.

Output

After the user submits the input data, the ‘Output’ tab is automatically displayed (Figure 3). Progress indicators are shown for each step of the SAVoR workflow, along with warnings at steps that may require additional processing time. SAVoR validates all user-supplied data (e.g. that the input sequence only contains valid nucleotide characters) and reports any detected anomalies to help the user fix any errors in the input. Upon completion, the output structural model is directly displayed in the web browser, along with the calculated annotation values in tabular format. A legend showing the color scheme and annotation type is displayed in the top left corner of the output model. The sequence and structure are displayed using the default layout by RNAplot (16) with annotation values overlaid. The 5′ and 3′ ends of the transcript, as well as the position of every 10th nucleotide, are marked to facilitate location of a specific region of interest. The entire model can be scaled and panned as desired using standard browser tools.

Figure 3.

SAVoR produces highly informative, user specified, annotated models of RNA secondary structure. The ‘Output’ tab of the SAVoR web interface is dynamically rendered to show progress indicators and error messages. Each step of the pipeline is indicated, and a ‘Complete’ message appears upon successful completion. When the entire process is finished, the output (an annotated model of RNA secondary structure) is displayed, and links for downloading the output files are provided. Additionally, the calculated per-base annotation values are displayed in tabular form and can be downloaded as a plain text file. Links to the output structure model in SVG, PDF, and PNG formats, and the annotation values in plain text format are provided as well. Importantly, files generated by each user submission are uniquely named and can only be viewed via these output links. The results are kept by the server for at least 72 h. If changes to the input data are desired, the user can simply click on the ‘User Input’ tab and directly modify the stored input values. While resubmission of the input form will result in rerunning of the entire SAVoR pipeline and generation of new output files, a typical SAVoR run requires <30 s for a 1 kb sequence.

Example uses

While we have streamlined the workflow design to strengthen its accessibility, SAVoR is also very flexible. We describe three example use cases to illustrate this point. Corresponding figures can be found in the Gallery page on the SAVoR website. Visualizing read distribution across a known transcript: The user specifies an Rfam or RefSeq ID, the ‘coverage’ annotation option, and read alignments as a BAM file. This type of model can be used to look for biases in read distribution such as those derived from small RNAs produced from a precursor transcript. Comparing experimental and computational base pairing predictions: The user specifies the ‘RNAfold’ structure prediction method and ‘log-ratio’ annotation option, and provides two BAM files (containing structure-informative RNA-seq data) as input. The resulting output can be used to compare base pairing predictions from a free-energy based computational approach (RNAfold) with experimentally derived base pairing data (log-ratio). Visualizing single nucleotide polymorphisms (SNPs): The user uploads a UCSC Genome Browser BED format file of customized per-nucleotide values along with any set of sequence and structure inputs. For example, we can upload a file-containing SNP coordinates and use this to color known SNPs on the secondary structure; this allows the user to examine if population diversity correlates with predicted or experimentally determined RNA structural constraints.

Implementation

The SAVoR web server runs Apache 2.2.3 on a CentOS 5.7 machine with 2× Intel Xeon E5450 3.00 GHz processors and 16 GB RAM. Asynchronous JavaScript and XML (AJAX) technology is used to dynamically render PHP output into formatted HTML. A local MySQL database is used to store Rfam and Refseq/SGD/TAIR entries, and a local installation of BLAST+ is used to retrieve sequence and genomic locus information. Structure prediction is optionally performed using a local installation of RNAfold (version 1.8.4), and backbone layout is done using RNAplot. SAMtools (23) is used to extract annotation values from BAM files, and custom Perl and Ruby scripts are used to process BED files. Inkscape (version 0.47) is used to convert from the native SVG format to publication quality PDF and PNG output files. SAVoR has been tested extensively, and the internal programs were used to generate annotated models of RNA secondary structure in our recent publications (4,9).

CONCLUSIONS

The incredible power and versatility of high-throughput RNA-seq approaches have spurred many insights into RNA function, biogenesis, and structure, and offer almost endless possibilities for future studies of RNA biology (24–27). Interpretation of these data is fast becoming a bottleneck, and substantial efforts to aid in this process are currently necessary. With SAVoR, we have developed a unique and user-friendly interface to streamline the interpretation of RNA-seq data in the context of RNA secondary structure. Specifically, our web server directly computes per-nucleotide quantities from RNA-seq data sets and overlays these annotation values on a structural model. The uses of this web-based tool range from quick checks of data quality to production of fully customized publication quality figures, and will aid researchers in many aspects of RNA-seq analysis. We plan to extend SAVoR to directly retrieve annotations from the UCSC Genome Browser and other public genomic databases, thereby removing the need for users to generate their own annotation files and improving accessibility to different data sources beyond sequencing alignments. We also plan to add other methods for structure prediction and visualization such as conservation-informed prediction (28,29), and implement other RNA secondary structure layout options including circular and linear structure plots. These additional features will further aid interpretation of genome-scale data in the context of RNA secondary structure.

FUNDING

Funding for open access charge: Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine. This work was supported by National Science Foundation [MCB-1053846 to B.D.G.]; Penn Genome Frontiers Institute [pilot award to B.D.G. and L.S.W.]; National Institutes of Health [NIA AG10124 to B.D.G.and L.S.W.; NHGRI 5T32HG000046-13 to F.L. and P.R.]; SmithKline Beecham Center of Excellence in Geriatric Medicine through the Penn Institute on Aging [to M.C.]. Conflict of interest statement. None declared.

28 in total

1. Memory efficient folding algorithms for circular RNA secondary structures.

Authors: Ivo L Hofacker; Peter F Stadler
Journal: Bioinformatics Date: 2006-02-01 Impact factor: 6.937

2. The centrality of RNA.

Authors: Phillip A Sharp
Journal: Cell Date: 2009-02-20 Impact factor: 41.582

3. The dynamic landscapes of RNA architecture.

Authors: José Almeida Cruz; Eric Westhof
Journal: Cell Date: 2009-02-20 Impact factor: 41.582

4. RNAz 2.0: improved noncoding RNA detection.

Authors: Andreas R Gruber; Sven Findeiß; Stefan Washietl; Ivo L Hofacker; Peter F Stadler
Journal: Pac Symp Biocomput Date: 2010

5. Global analysis of RNA secondary structure in two metazoans.

Authors: Fan Li; Qi Zheng; Paul Ryvkin; Isabelle Dragomir; Yaanik Desai; Subhadra Aiyer; Otto Valladares; Jamie Yang; Shelly Bambina; Leah R Sabin; John I Murray; Todd Lamitina; Arjun Raj; Sara Cherry; Li-San Wang; Brian D Gregory
Journal: Cell Rep Date: 2012-01-26 Impact factor: 9.423

6. BLAST+: architecture and applications.

Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169

7. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

8. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling.

Authors: Nicholas T Ingolia; Sina Ghaemmaghami; John R S Newman; Jonathan S Weissman
Journal: Science Date: 2009-02-12 Impact factor: 47.728

9. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

SAVoR: a server for sequencing annotation and visualization of RNA structures.

INTRODUCTION

SAVOR WEB SERVER

Input

Specifying the secondary structure

Generating per-base annotations

Output

Example uses

Implementation

CONCLUSIONS

FUNDING

1. Memory efficient folding algorithms for circular RNA secondary structures.

2. The centrality of RNA.

3. The dynamic landscapes of RNA architecture.

4. RNAz 2.0: improved noncoding RNA detection.

5. Global analysis of RNA secondary structure in two metazoans.

6. BLAST+: architecture and applications.

7. The Sequence Alignment/Map format and SAMtools.

8. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling.

9. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

10. FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing.

1. StructureFold: genome-wide RNA secondary structure mapping and reconstruction in vivo.

2. Using machine learning and high-throughput RNA sequencing to classify the precursors of small non-coding RNAs.

3. The landscape of miRNA editing in animals and its impact on miRNA biogenesis and targeting.

4. CoRAL: predicting non-coding RNAs from small RNA-sequencing data.

5. HAMR: high-throughput annotation of modified ribonucleotides.

6. A comprehensive database of high-throughput sequencing-based RNA secondary structure probing data (Structure Surfer).

7. DASHR: database of small human noncoding RNAs.

Review 8. RNA motif discovery: a computational overview.

9. RNAex: an RNA secondary structure prediction server enhanced by high-throughput structure-probing data.

10. Secondary Structural Model of Human MALAT1 Reveals Multiple Structure-Function Relationships.