Literature DB >> 22847932

easyRNASeq: a bioconductor package for processing RNA-Seq data.

Nicolas Delhomme¹, Ismaël Padioleau, Eileen E Furlong, Lars M Steinmetz.

Abstract

MOTIVATION: RNA sequencing is becoming a standard for expression profiling experiments and many tools have been developed in the past few years to analyze RNA-Seq data. Numerous 'Bioconductor' packages are available for next-generation sequencing data loading in R, e.g. ShortRead and Rsamtools as well as to perform differential gene expression analyses, e.g. DESeq and edgeR. However, the processing tasks lying in between these require the precise interplay of many Bioconductor packages, e.g. Biostrings, IRanges or external solutions are to be sought.
RESULTS: We developed 'easyRNASeq', an R package that simplifies the processing of RNA sequencing data, hiding the complex interplay of the required packages behind a single functionality. AVAILABILITY: The package is implemented in R (as of version 2.15) and is available from Bioconductor (as of version 2.10) at the URL: http://bioconductor.org/packages/release/bioc/html/easyRNASeq.html, where installation and usage instructions can be found. CONTACT: delhomme@embl.de.

Entities: Species

Mesh：

Substances：
RNA

Year: 2012 PMID： 22847932 PMCID： PMC3463124 DOI： 10.1093/bioinformatics/bts477

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Since the extensive utilization of RNA sequencing for expression profiling (RNA-Seq, Mortazavi ), numerous tools have been developed, as part of the R/Bioconductor (Gentleman ) project, to load RNA-Seq data in R. The first: ‘ShortRead’ (Morgan ) parses manufacturer-specific formats. It gave way to the ‘Rsamtools’ package, as the ‘SAM/BAM’ format (Li ) became a de facto standard for reporting next-generation sequencing (NGS) alignment data. In parallel, analysis packages were adapted (‘edgeR’, Robinson ) or newly developed (‘DESeq’, Anders and Huber, 2010) to accommodate for NGS specificities. Recently, the ‘Bioconductor’ Core Team released several packages to connect these parts of the process: e.g. GenomicRanges, GenomicFeatures; however, combining them appropriately requires a good understanding of their functionalities, and depending on the data, different combinations of these packages have to be used, implying a long learning curve for RNA-Seq neophytes. In parallel to these aspects, the sequencers’ yield increase resulted in the generalization of protocols that allow several samples to be run in a single lane, a process called ‘multiplexing’ (Lefrançois ). De-multiplexing the obtained data is a processing step that no R package currently addresses. Here, we describe ‘easyRNASeq’, an R package that eases RNA-Seq processing by combining the necessary packages in a single wrapper that ensures the pertinence of the provided data and information and helps users circumnavigate RNA-Seq processing pitfalls. In addition, it introduces functionalities to handle data produced by recent NGS protocols.

2 easyRNASeq

The easyRNASeq package combines the following steps: reading in sequenced reads, retrieving annotations, summarizing read counts by the feature of interest, e.g. exon, gene and finally reporting results, normalized or not, in formats suitable for downstream analyses. This is achieved by using and extending many Bioconductor packages functionalities (Fig. 1) and provided to end users as a single function that wraps the entire process.

Fig. 1.

Packages wrapped by easyRNASeq. At every step of the process, easyRNASeq encapsulates and extends lower level package functionalities, finally merging them into a single high-level function.: easyRNASeq

2.1 Reading data

Depending on the alignment format, manufacturer-specific (e.g. Illumina export) or the de facto BAM (Binary Alignment/Map) standard, the data are parsed by either ShortRead or Rsamtools, respectively. The coverage is extracted per base pair and divided by the read length—i.e. reads’ coverage proportion are reported per base pair—and stored in an ‘IRanges running length encoding’ (RLE) vector. This approach yields identical results to the common counting reads per se approach, when applied to non-spliced regions as shown in the ‘RNASeqTutorial’ vignette. However, it more accurately assigns reads spanning exon–exon junctions (EEJ), i.e. unlike common methods that arbitrarily select an EEJ side, reads coverage proportion is, here, distributed across the EEJ.

2.2 Loading annotations

Genic annotations are retrieved using ‘biomaRt’ (Durinck ) or read from ‘General Feature Format 3’ (GFF3) or ‘Gene Transfer Format’ (GTF) files using ‘genomeIntervals’. The annotation set is stored in a ‘RangedData’ object (IRanges). To reduce loading times, ‘RangedData’ or ‘GRangesList’ objects from the R environment or RData (rda) files can be used, provided they describe exons, features, transcripts or genes (a ‘feature’ is the representation of a genomic locus, not necessarily genic, e.g. an enhancer).

2.3 Counting reads

The reads’ coverage is summarized according to the chosen features: exons, features, transcripts or gene models. Here, a gene model is defined as the set of non-overlapping loci (i.e. synthetic exons) that represent all the possible exons and untranslated regions of a gene. easyRNASeq is not limited to genic summarization only, e.g. promoter ‘features’ can be used to look for eRNAs (Kim ).

2.4 Output

Four formats are offered: count table (the default), ‘CountDataSet’ (DESeq object), ‘DGEList’ (edgeR object) and ‘RNAseq’ (easyRNASeq object). The count table reports raw counts or ‘reads per kb of feature per million reads’ (RPKM, Mortazavi ), if preferred. For DESeq and edgeR, an object of their respective class is returned. If desired, these would have been subjected to their respective normalization, in which case quality assessment (QA) plots are drawn to evaluate it. Finally, RNAseq objects allow different count summarizations to be performed (e.g. by exon, by gene) on the same data without re-processing, a useful feature when first assessing a dataset.

2.5 Performance

Fetching annotation using ‘biomaRt’ takes about 10 min from an average network. Generating the gene models takes up to 15 min for large genomes—e.g. Homo sapiens. If these annotations are readily available, processing a 36-bp single-end ‘Illumina’ lane with 100 M reads—a BAM file of 3.2 GB—requires 3 GB RAM and 3 min on an Intel 2.4 GHz CPU.

3 DE-MULTIPLEXING

The current data yield allows several samples to be sequenced as a single library, where small nucleotide ‘barcodes’ (4–6 bp) uniquely identify the samples. Resulting sequences must be ‘de-multiplexed’, a functionality introduced by easyRNASeq that splits the result file into sample-specific files. To account for sequencing errors, flexibility in identifying the barcode is necessary and achieved using thresholds based on the Hamming distance (Hamming, 1950). QA plots help validating the chosen thresholds as well as assessing the multiplexing efficacy.

4 CONCLUSIONS

This note presents the Bioconductor easyRNASeq package. It introduces a method that effectively hides the complex interplay of numerous Bioconductor packages to the end user. Its output can be formatted for further processing by analysis packages such as DESeq or edgeR. Finally, it contains additional features such as de-multiplexing or support for gapped alignments. Future developments will integrate recent trends such as strand-specific sequencing or differential exon usage detection methods.

9 in total

1. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.

Authors: Steffen Durinck; Yves Moreau; Arek Kasprzyk; Sean Davis; Bart De Moor; Alvis Brazma; Wolfgang Huber
Journal: Bioinformatics Date: 2005-08-15 Impact factor: 6.937

2. Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors: Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

5. Widespread transcription at neuronal activity-regulated enhancers.

Authors: Tae-Kyung Kim; Martin Hemberg; Jesse M Gray; Allen M Costa; Daniel M Bear; Jing Wu; David A Harmin; Mike Laptewicz; Kellie Barbara-Haley; Scott Kuersten; Eirene Markenscoff-Papadimitriou; Dietmar Kuhl; Haruhiko Bito; Paul F Worley; Gabriel Kreiman; Michael E Greenberg
Journal: Nature Date: 2010-04-14 Impact factor: 49.962

6. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

7. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

8. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.

Authors: Martin Morgan; Simon Anders; Michael Lawrence; Patrick Aboyoun; Hervé Pagès; Robert Gentleman
Journal: Bioinformatics Date: 2009-08-03 Impact factor: 6.937

9. Efficient yeast ChIP-Seq using multiplex short-read DNA sequencing.

Authors: Philippe Lefrançois; Ghia M Euskirchen; Raymond K Auerbach; Joel Rozowsky; Theodore Gibson; Christopher M Yellman; Mark Gerstein; Michael Snyder
Journal: BMC Genomics Date: 2009-01-21 Impact factor: 3.969

9 in total

47 in total

1. SutA is a bacterial transcription factor expressed during slow growth in Pseudomonas aeruginosa.

Authors: Brett M Babin; Megan Bergkessel; Michael J Sweredoski; Annie Moradian; Sonja Hess; Dianne K Newman; David A Tirrell
Journal: Proc Natl Acad Sci U S A Date: 2016-01-19 Impact factor: 11.205

2. Multiple functional linear model for association analysis of RNA-seq with imaging.

Authors: Junhai Jiang; Nan Lin; Shicheng Guo; Jinyun Chen; Momiao Xiong
Journal: Quant Biol Date: 2015-08-15

3. The Molecular Dialog between Flowering Plant Reproductive Partners Defined by SNP-Informed RNA-Sequencing.

Authors: Alexander R Leydon; Caleb Weinreb; Elena Venable; Anke Reinders; John M Ward; Mark A Johnson
Journal: Plant Cell Date: 2017-04-11 Impact factor: 11.277

4. RNAdetector: a free user-friendly stand-alone and cloud-based system for RNA-Seq data analysis.

Authors: Alessandro La Ferlita; Salvatore Alaimo; Sebastiano Di Bella; Emanuele Martorana; Georgios I Laliotis; Francesco Bertoni; Luciano Cascione; Philip N Tsichlis; Alfredo Ferro; Roberta Bosotti; Alfredo Pulvirenti
Journal: BMC Bioinformatics Date: 2021-06-03 Impact factor: 3.169

5. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor.

Authors: Simon Anders; Davis J McCarthy; Yunshun Chen; Michal Okoniewski; Gordon K Smyth; Wolfgang Huber; Mark D Robinson
Journal: Nat Protoc Date: 2013-08-22 Impact factor: 13.491

6. ABC Transporters Required for Hexose Uptake by Clostridium phytofermentans.

Authors: Tristan Cerisy; Alba Iglesias; William Rostain; Magali Boutard; Christine Pelle; Alain Perret; Marcel Salanoubat; Henri-Pierre Fierobe; Andrew C Tolonen
Journal: J Bacteriol Date: 2019-07-10 Impact factor: 3.490

7. Identification of lncRNA-associated differential subnetworks in oesophageal squamous cell carcinoma by differential co-expression analysis.

Authors: Wei Liu; Cai-Yan Gan; Wei Wang; Lian-Di Liao; Chun-Quan Li; Li-Yan Xu; En-Min Li
Journal: J Cell Mol Med Date: 2020-03-12 Impact factor: 5.310

8. Drosophila melanogaster show a threshold effect in response to radiation.

Authors: Michael Antosh; David Fox; Thomas Hasselbacher; Robert Lanou; Nicola Neretti; Leon N Cooper
Journal: Dose Response Date: 2014-05-05 Impact factor: 2.658

9. The START App: a web-based RNAseq analysis and visualization resource.

Authors: Jonathan W Nelson; Jiri Sklenar; Anthony P Barnes; Jessica Minnier
Journal: Bioinformatics Date: 2017-02-01 Impact factor: 6.937

10. Evolution of a Biomass-Fermenting Bacterium To Resist Lignin Phenolics.

Authors: Tristan Cerisy; Tiffany Souterre; Ismael Torres-Romero; Magali Boutard; Ivan Dubois; Julien Patrouix; Karine Labadie; Wahiba Berrabah; Marcel Salanoubat; Volker Doring; Andrew C Tolonen
Journal: Appl Environ Microbiol Date: 2017-05-17 Impact factor: 4.792