Literature DB >> 20861030

Girafe--an R/Bioconductor package for functional exploration of aligned next-generation sequencing reads.

Joern Toedling1, Constance Ciaudo, Olivier Voinnet, Edith Heard, Emmanuel Barillot.   

Abstract

UNLABELLED: The R/Bioconductor package girafe facilitates the functional exploration of alignments of sequence reads from next-generation sequencing data to a genome. It allows users to investigate the genomic intervals together with the aligned reads and to work with, visualise and export these intervals. Moreover, the package operates within and extends the ever-growing Bioconductor framework and thus enables users to leverage a multitude of methods for their data in order to answer specific research questions.
AVAILABILITY AND IMPLEMENTATION: The R package girafe is available from the Bioconductor web site: http://www.bioconductor.org/packages/release/bioc/html/girafe.html. An extensive vignette and the Bioconductor mailing lists provide additional documentation and help for using the package.

Entities:  

Mesh:

Year:  2010        PMID: 20861030      PMCID: PMC2971573          DOI: 10.1093/bioinformatics/btq531

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Next-generation sequencing (NGS) technologies provide users with millions of comparatively short RNA or DNA reads from biological samples of interest. The first step in the analysis of these data usually is to align the reads to the chosen reference genome, using powerful, specific alignment tools. For the secondary analysis that follows, there is a need for an integrated work environment which allows users to explore and annotate the genome intervals with the aligned reads. Besides providing functionality for data exploration, this environment must also provide interfaces to other available tools, especially as the field of NGS analysis software is comparatively new and rapidly evolving. Previously described, excellent tools that provide frameworks for working with aligned reads include Galaxy (Giardine et al., 2005) and SAMtools (Li et al., 2009). Here, we describe the R/Bioconductor package girafe that enables users to investigate the genome intervals with the aligned reads, henceforth referred to as aligned intervals, and to work with, visualise and export these aligned intervals. One advantage of girafe over other tools for working with aligned reads is that the package operates within the open source, open development and constantly growing Bioconductor framework (Gentleman et al., 2004). Thus, this package enables users to leverage a multitude of methods in the analysis of their data in order to answer specific research questions.

2 AVAILABLE FUNCTIONALITY

In the following, we present some functionalities of girafe. The package is built on, and greatly enhances, the functionalities of the Bioconductor packages genomeIntervals and ShortRead (Morgan et al., 2009). For this demonstration, we use example data downloaded from the Gene Expression Omnibus database (Edgar et al., 2002, GSE10364). The data are Solexa reads obtained from small RNA profiling of mouse oocytes (Tam et al., 2008). Importing aligned reads: The reads were mapped to the mouse genome (assembly mm9) using the Bowtie aligner (Langmead et al., 2009). The resulting file can be read into R, using the ShortRead package, and converted into an object of class AlignedGenomeIntervals, the core class of package girafe. Exploring aligned intervals: Objects of this class can easily be explored using standard R functions to obtain summary statistics answering questions, such as (i) how long are the reads aligned to specific intervals? or (ii) how many intervals are located on each chromosome? Processing the aligned intervals: Basic interval operations, such as sorting, shifting and determining intersections and unions of interval sets, are readily supported. Moreover, the function reduce provides a flexible way to combine, or merge, aligned intervals. One intention could be to combine aligned reads at exactly the same position, which only differ in their sequence due to sequencing errors. Another example objective could be to combine overlapping short reads that may be (degradation) products of the same primary transcript. Visualisation: The package girafe contains functions for visualising aligned intervals with the powerful plotting facilities of R. These visualisation functions are a flexible alternative to those provided by genome browsers and may be especially relevant for sequencing data from organisms which are not well represented in genome browsers. Figure 1 shows aligned intervals from the example data in a 500 bp region on the X chromosome. The reads aligned in this region correspond to two miRNAs reported to be highly expressed in these data (Tam et al., 2008).
Fig. 1.

Visualisation of genome intervals with aligned reads. Two miRNAs (shown in pink) have been annotated on the Crick strand within this region on the X chromosome. Reads were exclusively aligned to the Crick strand in this region and mostly correspond to the mature and star sequences of the miRNAs. Intervals with uniquely aligned reads are shown in blue while grey marks non-unique reads. For mmu-mir-503 (left), the original data are shown while for mmu-mir-322 (right) the overlapping AlignedGenomeIntervals have been merged using the function reduce.

Visualisation of genome intervals with aligned reads. Two miRNAs (shown in pink) have been annotated on the Crick strand within this region on the X chromosome. Reads were exclusively aligned to the Crick strand in this region and mostly correspond to the mature and star sequences of the miRNAs. Intervals with uniquely aligned reads are shown in blue while grey marks non-unique reads. For mmu-mir-503 (left), the original data are shown while for mmu-mir-322 (right) the overlapping AlignedGenomeIntervals have been merged using the function reduce. Summarising the data using sliding windows: The data can be searched for genome regions of defined interest using a sliding-window approach. For each window, the number of intervals with aligned reads, the total number of reads aligned, the number of unique reads aligned, the fraction of intervals on the Watson strand, and the higher number of aligned reads at a single interval within the window are reported. Overlap with annotated genome features: A frequent task is to determine the overlap of the aligned intervals with genome elements that are described in databases, in order to annotate the aligned reads. girafe includes functions for efficiently determining these overlaps and allows the user to specify custom requirements, such as a minimum fraction of the total interval length, for considering intervals and features to be truly overlapping. Exporting the data: The girafe package contains methods for exporting the data into tab-delimited text files, which can be uploaded to genome browsers for further visualisation and exploration. Currently supported formats include ‘bed’, ‘bedGraph’ and ‘wiggle’. Vignette: The package vignette provides more detailed examples together with the corresponding R source code and discusses memory usage and interactions with other Bioconductor packages.

3 CONCLUSION

The R/Bioconductor package girafe provides users with a powerful, flexible and extensible framework to explore NGS data, following alignment of the reads to a reference genome. The package interacts with other Bioconductor packages and allows export of the data in various formats for exploring them in other software, with the aim of not restricting the user to a limited set of analysis tools. The field of NGS analysis software is growing rapidly. Thus, future developments of the package will include adding further methods for working with aligned genome intervals, reducing the memory footprint, and providing additional interfaces to other R/Bioconductor packages and other software.
  7 in total

1.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.

Authors:  Ron Edgar; Michael Domrachev; Alex E Lash
Journal:  Nucleic Acids Res       Date:  2002-01-01       Impact factor: 16.971

2.  Galaxy: a platform for interactive large-scale genome analysis.

Authors:  Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal:  Genome Res       Date:  2005-09-16       Impact factor: 9.043

3.  Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocytes.

Authors:  Oliver H Tam; Alexei A Aravin; Paula Stein; Angelique Girard; Elizabeth P Murchison; Sihem Cheloufi; Emily Hodges; Martin Anger; Ravi Sachidanandam; Richard M Schultz; Gregory J Hannon
Journal:  Nature       Date:  2008-04-10       Impact factor: 49.962

4.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

5.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors:  Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-03-04       Impact factor: 13.583

6.  Bioconductor: open software development for computational biology and bioinformatics.

Authors:  Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal:  Genome Biol       Date:  2004-09-15       Impact factor: 13.583

7.  ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.

Authors:  Martin Morgan; Simon Anders; Michael Lawrence; Patrick Aboyoun; Hervé Pagès; Robert Gentleman
Journal:  Bioinformatics       Date:  2009-08-03       Impact factor: 6.937

  7 in total
  7 in total

1.  Dynamic nucleotide-binding site and leucine-rich repeat-encoding genes in the grass family.

Authors:  Sha Luo; Yu Zhang; Qun Hu; Jiongjiong Chen; Kunpeng Li; Chen Lu; Hui Liu; Wen Wang; Hanhui Kuang
Journal:  Plant Physiol       Date:  2012-03-15       Impact factor: 8.340

2.  Small RNA deep sequencing identifies microRNAs and other small noncoding RNAs from human herpesvirus 6B.

Authors:  Lee Tuddenham; Jette S Jung; Béatrice Chane-Woon-Ming; Lars Dölken; Sébastien Pfeffer
Journal:  J Virol       Date:  2011-11-23       Impact factor: 5.103

3.  Deep-sequencing protocols influence the results obtained in small-RNA sequencing.

Authors:  Joern Toedling; Nicolas Servant; Constance Ciaudo; Laurent Farinelli; Olivier Voinnet; Edith Heard; Emmanuel Barillot
Journal:  PLoS One       Date:  2012-02-27       Impact factor: 3.240

4.  Miniature inverted-repeat transposable elements (MITEs) have been accumulated through amplification bursts and play important roles in gene expression and species diversity in Oryza sativa.

Authors:  Chen Lu; Jiongjiong Chen; Yu Zhang; Qun Hu; Wenqing Su; Hanhui Kuang
Journal:  Mol Biol Evol       Date:  2011-11-16       Impact factor: 16.240

Review 5.  Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies.

Authors:  Alexandre G de Brevern; Jean-Philippe Meyniel; Cécile Fairhead; Cécile Neuvéglise; Alain Malpertuy
Journal:  Biomed Res Int       Date:  2015-06-01       Impact factor: 3.411

6.  ARYANA: Aligning Reads by Yet Another Approach.

Authors:  Milad Gholami; Aryan Arbabi; Ali Sharifi-Zarchi; Hamidreza Chitsaz; Mehdi Sadeghi
Journal:  BMC Bioinformatics       Date:  2014-09-10       Impact factor: 3.169

7.  The microRNA cluster miR-183/96/182 contributes to long-term memory in a protein phosphatase 1-dependent manner.

Authors:  Bisrat T Woldemichael; Ali Jawaid; Eloïse A Kremer; Niharika Gaur; Jacek Krol; Antonin Marchais; Isabelle M Mansuy
Journal:  Nat Commun       Date:  2016-08-25       Impact factor: 14.919

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.