Literature DB >> 25573919

RNA-Rocket: an RNA-Seq analysis resource for infectious disease research.

Andrew S Warren¹, Cristina Aurrecoechea¹, Brian Brunk², Prerak Desai¹, Scott Emrich², Gloria I Giraldo-Calderón¹, Omar Harb², Deborah Hix¹, Daniel Lawson¹, Dustin Machi¹, Chunhong Mao¹, Michael McClelland¹, Eric Nordberg¹, Maulik Shukla¹, Leslie B Vosshall¹, Alice R Wattam¹, Rebecca Will¹, Hyun Seung Yoo¹, Bruno Sobral¹.

Abstract

MOTIVATION: RNA-Seq is a method for profiling transcription using high-throughput sequencing and is an important component of many research projects that wish to study transcript isoforms, condition specific expression and transcriptional structure. The methods, tools and technologies used to perform RNA-Seq analysis continue to change, creating a bioinformatics challenge for researchers who wish to exploit these data. Resources that bring together genomic data, analysis tools, educational material and computational infrastructure can minimize the overhead required of life science researchers.
RESULTS: RNA-Rocket is a free service that provides access to RNA-Seq and ChIP-Seq analysis tools for studying infectious diseases. The site makes available thousands of pre-indexed genomes, their annotations and the ability to stream results to the bioinformatics resources VectorBase, EuPathDB and PATRIC. The site also provides a combination of experimental data and metadata, examples of pre-computed analysis, step-by-step guides and a user interface designed to enable both novice and experienced users of RNA-Seq data.
AVAILABILITY AND IMPLEMENTATION: RNA-Rocket is available at rnaseq.pathogenportal.org. Source code for this project can be found at github.com/cidvbi/PathogenPortal. CONTACT: anwarren@vt.edu SUPPLEMENTARY INFORMATION: Supplementary materials are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Year: 2015 PMID： 25573919 PMCID： PMC4410666 DOI： 10.1093/bioinformatics/btv002

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Transcriptomic analysis using high-throughput sequencing continues to increase in popularity due to low sequencing costs, its sensitivity, reproducibility and its ability to sample the entire transcriptome (Wang ). As an active area of research, there are many variations of RNA-Seq protocols, data types and tools that continue to evolve. As a result, there is no ‘one size fits all’ solution for doing RNA-Seq analysis. The range of considerations facing life scientists who want to leverage this technology can demand significant investment of time and resources. For this reason, we have created RNA-Rocket, an RNA-Seq analysis service that enables infectious disease research for prokaryotic and eukaryotic pathogens as well as vectors and host genomes. RNA-Rocket is built on Galaxy (Blankenberg ; Giardine ; Goecks ), with modifications to help simplify the process for routine use and provide a guided user experience. RNA-Rocket integrates data from the PATRIC, EuPathDB and VectorBase Bioinformatics Resource Centers (BRCs) and is provided by Pathogen Portal (pathogenportal.org), a resource linking all BRCs funded by the National Institute of Allergy and Infectious Diseases (NIAID). The RNA-Rocket service leverages multiple open source software tools to provide a free resource where users can upload their RNA-Seq data, align them against a genome and generate quantitative transcript profiles. This service also provides streaming of alignment and annotation results back to the appropriate BRC so that users can view results in the relevant BRC, using annotations and tools provided in support of transcriptomic analysis. The Pathosystems Resource Integration Center (PATRIC) is the all-bacteria Bioinformatics Resource Center (patricbrc.org) (Wattam ). PATRIC provides researchers with an online resource that stores and integrates a variety of data types (e.g. genomics, transcriptomics, protein-protein interactions, 3-D protein structures and sequence typing data) along with any associated metadata. The eukaryotic pathogen databases (EuPathDB: eupathdb.org) provide access to a variety of data types from important human and veterinary parasites such as Plasmodium (malaria), Cryptosporidium (cryptosporidiosis) and kinetoplastida (i.e. Trypanosoma brucei and Leishmania species.) (Aurrecoechea ). VectorBase (vectorbase.org) is a bioinformatics resource for invertebrate vectors of human parasites and pathogens (Megy ). It currently hosts the genomes of 35 organisms including mosquitoes (20 of which are Anopheles species), tsetse flies, ticks, lice, kissing bugs and sandflies.

2 Implementation

RNA-Rocket takes advantage of many different open-source projects to enable users to upload and analyze their own data. We use the Galaxy system to consolidate and provide the tools and services necessary to process high-throughput sequencing data. The use of Galaxy has many benefits: showing provenance information for data creation, including the tools and parameters used to process data; support for batch analysis for multiple samples; providing a mechanism for results sharing across research groups and publishing for external references such as presentations or publications and its integration of tools and projects in the larger bioinformatics community. Before users can run analysis on the RNA-Rocket site they must first upload their data in FASTQ format. Using standard Galaxy interfaces RNA-Rocket supports upload via URL, FTP, HTTP and direct transfer via the European Nucleotide Archive (Leinonen ), which can be searched using ENA, SRA, GEO and ArrayExpress identifiers. To enable basic RNA-Seq processing, RNA-Rocket provides users with a set of pre-determined Galaxy workflows configured to use existing BRC genomes and annotations. The primary workflow for RNA-Seq analysis aligns short read data to a reference genome using Bowtie2 (Langmead and Salzberg, 2012) or TopHat2 (Kim ), assembles transcripts using Cufflinks, and generates coverage bedGraph and BigWig files using BEDTools (Quinlan and Hall, 2010) and UCSC tools (Kuhn ), respectively. This workflow generates BAM files and tab-delimited output, which can be used to determine transcript structure and the level of expression in the target organism. The site also provides the ability to conduct differential expression analysis using Cuffdiff, part of the Cufflinks suite and the ability to visualize data generated using CummeRbund (Trapnell ). When users submit their jobs to RNA-Rocket, they are queued and run on a first-come, first-served basis on a compute cluster using a modern, high-density computer architecture. The site features an interactive concept diagram, which highlights the appropriate processing step(s), based on the concept that the user is interested in. Clicking on a concept diagram component gives information about the corresponding processing steps that fulfill the component (Fig. 1). To assist users, the site provides a ‘Launch Pad’ menu system that breaks down the context and rationale for executing a particular step, the input required from the user and the output that is generated.

Fig. 1.

Data flow in RNA-Rocket

Data flow in RNA-Rocket RNA-Rocket is designed to make users aware of three major concerns: the quality base calls in their sequencing reads, the number of reads aligned from their sample and accounting for PCR bias. Base call quality can vary depending on the sequencing technology and sample preparation. Poor quality sequencing can impact the certainty with which a read can be mapped to the genome (Li ). Although, modern aligners endeavor to account for Phred quality scores when performing alignment, it is important for users to be aware that low quality input can lead to low levels of alignment. To help users account for this, RNA-Rocket provides access to the FastQC tool (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) for determining the quality profile of sample reads, as well as Sickle (https://github.com/najoshi/sickle) and Trimmomatic (Bolger ), which can be used to automatically trim off low quality base calls from the ends of sequencing reads. Once alignment is performed the user also has the option to check the quality profile and number of reads mapped using a modified version of the SAMStat tool (Lassmann ). The RNA-Rocket site highlights these quality control steps in both the concept diagram and interactive menu system. These three options enable the researcher to maximize the amount of their sample being used through iterative pre-processing, alignment and evaluation. RNA-Rocket also provides a quality control option for removing PCR bias that can occur as a result of library preparation (Aird ). Certain sequences may be overrepresented if their composition leads to bias in the amplification process (Benjamini and Speed, 2012). For paired-end data this option uses Picard tools (http://picard.sourceforge.net/) to ‘collapse’ multiple read pairs that have identical coordinates for the first and second read into a single representative pair. In addition to RNA-Seq analysis, RNA-Rocket also supports ChIP-Seq (Chromatin Immunoprecipitation-Sequencing) analysis by providing access to the peak calling program MACS (Zhang ). After mapping the reads to the reference genome, a user may use MACS to identify and quantify the ChIP signal enriched genomic regions. The mapped reads and genome coverage can be directly viewed via the BRC genome browsers and the peak calling result can be viewed and downloaded via the RNA-Rocket web interface. The RNA-Rocket site is updated daily with genomes and annotations from each of the contributing BRCs. Thousands of genomes are organized and indexed using Bowtie2 (Langmead and Salzberg, 2012) and SAMtools (Li ) to enable alignment and bias correction for abundance estimation, respectively. Reference resources are organized by BRC so that results can be streamed and analyzed within the context of the data provider.

3 Results

The RNA-Rocket project modifies the existing Galaxy code so that Galaxy workflows can be constructed in advance by system administrators and shared to users through a tiered menu system. This menu system, referred to as ‘Launch Pad’, organizes RNA-Seq processing steps conceptually and gives increasing detail as the user progress towards launching a job, i.e. an analysis step. This system also asks the user to populate their project space, known as a ‘history’ in Galaxy terms, with the necessary files before attempting to configure the parameters for their job. This is designed to minimize confusion when attempting to setup an analysis and promote organization for projects involving many files and processing steps. Using the workflow system, RNA-Rocket provides pre-formulated solutions for common problems that users encounter. These workflows are easily adapted to new tools and are publicly available for download to enable offline analysis and customization for researchers. RNA-Rocket provides example data from each of the BRC projects so that users can familiarize themselves with the site using real data. Some of these data are provided through the Driving Biological Project (DBP) initiative, research projects competitively enabled through NIAID’s BRC program designed to drive innovation at the BRC sites based on the needs of the research community. Each dataset is provided as both ‘before’ and ‘after’ project spaces so that users can import the project into their own user space, run their own analysis and view pre-existing results. See the Supplementary Material for more details on this data. By combining experimental concepts with file-based requirements, the user interface aims to guide life science researchers through the process of RNA-Seq data analysis while making them aware of quality control caveats. After results have been computed at the RNA-Rocket site they can be streamed back to the respective BRC depending on the reference organism selected for analysis. This provides users with the ability to process and analyze their RNA-Seq data remotely without having to download potentially large files to their own computer. RNA-Rocket is a free service that can be used by life-science researchers to process and analyze their RNA-Seq data. The site maintains up-to-date genome and annotation data through NIAD’s Bioinformatic Resource Centers. By leveraging BRC data and the Galaxy system, the RNA-Rocket project can provide up-to-date tools and capability despite the rapidly changing landscape of RNA-Seq analysis.

19 in total

1. Galaxy: a platform for interactive large-scale genome analysis.

Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043

2. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

3. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

Review 4. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

5. SAMStat: monitoring biases in next generation sequencing data.

Authors: Timo Lassmann; Yoshihide Hayashizaki; Carsten O Daub
Journal: Bioinformatics Date: 2010-11-18 Impact factor: 6.937

6. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

7. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries.

Authors: Daniel Aird; Michael G Ross; Wei-Sheng Chen; Maxwell Danielsson; Timothy Fennell; Carsten Russ; David B Jaffe; Chad Nusbaum; Andreas Gnirke
Journal: Genome Biol Date: 2011-02-21 Impact factor: 13.583

8. The European Nucleotide Archive.

Authors: Rasko Leinonen; Ruth Akhtar; Ewan Birney; Lawrence Bower; Ana Cerdeno-Tárraga; Ying Cheng; Iain Cleland; Nadeem Faruque; Neil Goodgame; Richard Gibson; Gemma Hoad; Mikyung Jang; Nima Pakseresht; Sheila Plaister; Rajesh Radhakrishnan; Kethi Reddy; Siamak Sobhany; Petra Ten Hoopen; Robert Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2010-10-23 Impact factor: 16.971

9. VectorBase: improvements to a bioinformatics resource for invertebrate vector genomics.

Authors: Karine Megy; Scott J Emrich; Daniel Lawson; David Campbell; Emmanuel Dialynas; Daniel S T Hughes; Gautier Koscielny; Christos Louis; Robert M Maccallum; Seth N Redmond; Andrew Sheehan; Pantelis Topalis; Derek Wilson
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

10. Model-based analysis of ChIP-Seq (MACS).

Authors: Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal: Genome Biol Date: 2008-09-17 Impact factor: 13.583

5 in total

1. Cyclic di-GMP modulates gene expression in Lyme disease spirochetes at the tick-mammal interface to promote spirochete survival during the blood meal and tick-to-mammal transmission.

Authors: Melissa J Caimano; Star Dunham-Ems; Anna M Allard; Maria B Cassera; Melisha Kenedy; Justin D Radolf
Journal: Infect Immun Date: 2015-05-18 Impact factor: 3.441

2. PerC Manipulates Metabolism and Surface Antigens in Enteropathogenic Escherichia coli.

Authors: Jay L Mellies; Amy Platenkamp; Jossef Osborn; Lily Ben-Avi
Journal: Front Cell Infect Microbiol Date: 2017-02-07 Impact factor: 5.293

3. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center.

Authors: Alice R Wattam; James J Davis; Rida Assaf; Sébastien Boisvert; Thomas Brettin; Christopher Bun; Neal Conrad; Emily M Dietrich; Terry Disz; Joseph L Gabbard; Svetlana Gerdes; Christopher S Henry; Ronald W Kenyon; Dustin Machi; Chunhong Mao; Eric K Nordberg; Gary J Olsen; Daniel E Murphy-Olson; Robert Olson; Ross Overbeek; Bruce Parrello; Gordon D Pusch; Maulik Shukla; Veronika Vonstein; Andrew Warren; Fangfang Xia; Hyunseung Yoo; Rick L Stevens
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

4. Dynamic Changes in the Splenic Transcriptome of Chickens during the Early Infection and Progress of Marek's Disease.

Authors: Lu Dang; Man Teng; Hua-Wei Li; Hui-Zhen Li; Sheng-Ming Ma; Pu Zhao; Xiu-Jie Li; Rui-Guang Deng; Gai-Ping Zhang; Jun Luo
Journal: Sci Rep Date: 2017-09-14 Impact factor: 4.379

5. Identification of bacterial sRNA regulatory targets using ribosome profiling.

Authors: Jing Wang; William Rennie; Chaochun Liu; Charles S Carmack; Karine Prévost; Marie-Pier Caron; Eric Massé; Ye Ding; Joseph T Wade
Journal: Nucleic Acids Res Date: 2015-11-05 Impact factor: 16.971

5 in total