Literature DB >> 25417205

QuasR: quantification and annotation of short reads in R.

Dimos Gaidatzis¹, Anita Lerch¹, Florian Hahne², Michael B Stadler¹.

Abstract

UNLABELLED: QuasR is a package for the integrated analysis of high-throughput sequencing data in R, covering all steps from read preprocessing, alignment and quality control to quantification. QuasR supports different experiment types (including RNA-seq, ChIP-seq and Bis-seq) and analysis variants (e.g. paired-end, stranded, spliced and allele-specific), and is integrated in Bioconductor so that its output can be directly processed for statistical analysis and visualization.
AVAILABILITY AND IMPLEMENTATION: QuasR is implemented in R and C/C++. Source code and binaries for major platforms (Linux, OS X and MS Windows) are available from Bioconductor (www.bioconductor.org/packages/release/bioc/html/QuasR.html). The package includes a 'vignette' with step-by-step examples for typical work ﬂows. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene Species

Mesh：

Year: 2014 PMID： 25417205 PMCID： PMC4382904 DOI： 10.1093/bioinformatics/btu781

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

High-throughput sequencing has become a powerful research tool in a wide range of applications, such as transcriptome profiling (RNA-seq), measurement of DNA-protein interactions or chromatin modifications (ChIP-seq) and DNA methylation (Bis-seq). In the last years, there have been many efforts to provide software in R/Bioconductor (Gentleman ) to simplify data processing and biological interpretation, such as an efficient framework for working with genomic ranges (Lawrence ), or tools for read alignment (Liao ), quality control (Morgan ) and statistical analysis (Anders and Huber 2010; Robinson ). It is however still challenging to conduct a complete analysis from raw data to a publishable result in the form of a single R script, which would greatly improve documentation and thus facilitate the exchange of analysis details with coworkers. Often it is necessary to perform a subset of tasks outside of R, for example on the operating system’s shell. Typically, the tools used in the analysis have to be downloaded and installed from distinct sources, and resolving software dependencies can be time-consuming or become a major obstacle for non-expert researchers trying to perform or reproduce an analysis. Here, we introduce the Bioconductor package QuasR that aims to overcome these issues by abstracting many technical details of high-throughput sequencing data analysis. QuasR builds on top of the functionality provided by Bioconductor and external tools such as bowtie (Langmead ) or SpliceMap (Au ), and extends it to support additional analysis types, such as DNA methylation and allele-specific analysis. QuasR is available for all major platforms (Linux, OS X and MS Windows), and its output can be directly used for downstream analyses and visualization, thus allowing an uninterrupted workflow from raw data to scientific results.

2 Feature overview

The user interface of QuasR consists of only a handful of functions (Fig. 1) and a single class (qProject) that is returned by qAlign and serves as input to all downstream processing.

Fig. 1.

QuasR consists of one class (qProject) and five main functions. Typical visualizations of the function outputs are shown as insets

QuasR consists of one class (qProject) and five main functions. Typical visualizations of the function outputs are shown as insets By default, qAlign uses bowtie to align single or paired-end reads to a reference genome, and unmapped reads are optionally further aligned to alternative references, for example to quantify the level of vector contamination. For convenience, the genome can be obtained through Bioconductor (42 genome assemblies for 21 different species are available in release 2.14), in which case it is automatically downloaded and indexed if necessary. qAlign also supports spliced alignments and alignment of bisulfite-converted reads (both directional und undirectional bisulfite libraries). Pre-existing alignments in BAM format (Li ) that have been generated outside of QuasR can be imported, thereby enabling the use of any alignment software and strategy that produces output in the supported format. Finally, qAlign stores metadata for all generated BAM files, including information about alignment parameters and checksums for genome and short read sequences, allowing it to recognize pre-existing BAM files that will not have to be recreated. qQCReport produces a set of quality control plots that allow assessment of the technical quality of sequencing data and alignments, and help to identify over-represented reads and libraries with a low sequence complexity. qCount is the main function for quantification. It is used to count alignments in known genomic intervals (promoters, exons, genes, etc.) or in peak regions identified outside of QuasR. It avoids redundant counting of individual alignments (e.g. when combining transcripts from the same gene). qCount provides fine-grained control over quantification, for example to include only alignments that are (anti-)sense to the query region, to select alignments based on mapping quality or to report counts for exon–exon junctions. The resulting count tables can be directly used for statistical analysis in dedicated packages (Anders and Huber, 2010; Robinson ). qProfile is similar to qCount, with the main difference that it returns a spatial profile of counts with the number of alignments at different positions relative to the query. qMeth is used in Bis-seq experiments to obtain the numbers of methylated and unmethylated cytosines for selected sequence contexts. For experimental systems with known heterozygous loci (for example an F1 cross between two divergent mouse inbred strains), QuasR allows to perform allele-specific analysis. In order to avoid alignment bias, qAlign will automatically inject the known single nucleotide variations into the reference genome to produce two new versions of that genome. The reads are aligned to both genomes, and the best alignment for each read is retained. The quantification of such allele-specific alignments by qCount, qProfile or qMeth produces three instead of a single number per sample and query feature, corresponding to the alignment counts for reference and alternative alleles and the unclassifiable alignments. An alignment can be unclassifiable if the read (or both reads in a paired-end experiment) did not overlap with a known polymorphism. All QuasR functions are designed to make use of the parallel package for parallel processing on computers with multiple cores or compute clusters. The package vignette (available at http://www.bioconductor.org/packages/release/bioc/vignettes/QuasR/inst/doc/QuasR.pdf) contains more details on QuasR functions, as well as step-by-step examples for typical analysis tasks. In addition, the supplementary online material provides QuasR code recipes illustrating the installation of the QuasR package, removal of adaptor sequences, quantification of RNA expression and DNA methylation, as well as allele-specific analysis.

3 Conclusions

By abstracting technical details, QuasR greatly simplifies an analysis of high-throughput sequencing data and makes it accessible to a wider community. Already the installation of required software tools can be conveniently achieved from within R, irrespective of the compute platform. Furthermore, through its integration with the Bioconductor infrastructure, it is also possible to obtain genome sequences and gene annotation in this manner, and the R packaging system provides a solid infrastructure to track and document the versions of both software and annotation that is used in a given analysis, which is a prerequisite for reproducible research. QuasR unites the preprocessing and alignment of raw sequence reads with the numerous downstream analysis tools available in Bioconductor and for the R environment, and enables well integrated, single-script workflows that document all steps of an analysis and their parameters in a format that is simple to share and reproduce.

9 in total

1. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

2. Detection of splice junctions from paired-end RNA-seq data by SpliceMap.

Authors: Kin Fai Au; Hui Jiang; Lan Lin; Yi Xing; Wing Hung Wong
Journal: Nucleic Acids Res Date: 2010-04-05 Impact factor: 16.971

3. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors: Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-03-04 Impact factor: 13.583

4. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

5. Software for computing and annotating genomic ranges.

Authors: Michael Lawrence; Wolfgang Huber; Hervé Pagès; Patrick Aboyoun; Marc Carlson; Robert Gentleman; Martin T Morgan; Vincent J Carey
Journal: PLoS Comput Biol Date: 2013-08-08 Impact factor: 4.475

6. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

7. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

8. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.

Authors: Martin Morgan; Simon Anders; Michael Lawrence; Patrick Aboyoun; Hervé Pagès; Robert Gentleman
Journal: Bioinformatics Date: 2009-08-03 Impact factor: 6.937

9. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.

Authors: Yang Liao; Gordon K Smyth; Wei Shi
Journal: Nucleic Acids Res Date: 2013-04-04 Impact factor: 16.971

9 in total

110 in total

1. NeuroD1 reprograms chromatin and transcription factor landscapes to induce the neuronal program.

Authors: Abhijeet Pataskar; Johannes Jung; Pawel Smialowski; Florian Noack; Federico Calegari; Tobias Straub; Vijay K Tiwari
Journal: EMBO J Date: 2015-10-29 Impact factor: 11.598

2. Competition between DNA methylation and transcription factors determines binding of NRF1.

Authors: Silvia Domcke; Anaïs Flore Bardet; Paul Adrian Ginno; Dominik Hartl; Lukas Burger; Dirk Schübeler
Journal: Nature Date: 2015-12-16 Impact factor: 49.962

3. Analysis of intronic and exonic reads in RNA-seq data characterizes transcriptional and post-transcriptional regulation.

Authors: Dimos Gaidatzis; Lukas Burger; Maria Florescu; Michael B Stadler
Journal: Nat Biotechnol Date: 2015-06-22 Impact factor: 54.908

4. Reconstitution in vitro of the entire cycle of the mouse female germ line.

Authors: Orie Hikabe; Nobuhiko Hamazaki; Go Nagamatsu; Yayoi Obata; Yuji Hirao; Norio Hamada; So Shimamoto; Takuya Imamura; Kinichi Nakashima; Mitinori Saitou; Katsuhiko Hayashi
Journal: Nature Date: 2016-10-17 Impact factor: 49.962

5. Schizophrenia-Like Phenotype Inherited by the F2 Generation of a Gestational Disruption Model of Schizophrenia.

Authors: Stephanie M Perez; David D Aguilar; Jennifer L Neary; Melanie A Carless; Andrea Giuffrida; Daniel J Lodge
Journal: Neuropsychopharmacology Date: 2015-06-12 Impact factor: 7.853

6. The Y Chromosome Plays a Protective Role in Experimental Hypoxic Pulmonary Hypertension.

Authors: Soban Umar; Christine M Cunningham; Yuichiro Itoh; Shayan Moazeni; Mylene Vaillancourt; Shervin Sarji; Alex Centala; Arthur P Arnold; Mansoureh Eghbali
Journal: Am J Respir Crit Care Med Date: 2018-04-01 Impact factor: 21.405

7. Interleukins 12 and 15 induce cytotoxicity and early NK-cell differentiation in type 3 innate lymphoid cells.

Authors: Ana Raykova; Paolo Carrega; Frank M Lehmann; Robert Ivanek; Vanessa Landtwing; Isaak Quast; Jan D Lünemann; Daniela Finke; Guido Ferlazzo; Obinna Chijioke; Christian Münz
Journal: Blood Adv Date: 2017-12-14

8. A Lamina-Associated Domain Border Governs Nuclear Lamina Interactions, Transcription, and Recombination of the Tcrb Locus.

Authors: Shiwei Chen; Teresa Romeo Luperchio; Xianrong Wong; Europe B Doan; Aaron T Byrd; Kingshuk Roy Choudhury; Karen L Reddy; Michael S Krangel
Journal: Cell Rep Date: 2018-11-13 Impact factor: 9.423

9. Comparative analysis of MBD-seq and MeDIP-seq and estimation of gene expression changes in a rodent model of schizophrenia.

Authors: Jennifer L Neary; Stephanie M Perez; Kara Peterson; Daniel J Lodge; Melanie A Carless
Journal: Genomics Date: 2017-03-29 Impact factor: 5.736

10. AP-2α and AP-2β cooperatively orchestrate homeobox gene expression during branchial arch patterning.

Authors: Eric Van Otterloo; Hong Li; Kenneth L Jones; Trevor Williams
Journal: Development Date: 2018-01-25 Impact factor: 6.868