| Literature DB >> 23950696 |
Michael Lawrence1, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, Vincent J Carey.
Abstract
We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.Entities:
Mesh:
Year: 2013 PMID: 23950696 PMCID: PMC3738458 DOI: 10.1371/journal.pcbi.1003118
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Tabular (top) and visual (bottom) representation of the exons for the human KRAS gene, derived from the UCSC known gene annotation.
In the table, the columns seqnames, start and end locate the exons in the genome. The strand column indicates the direction of transcription. The exons are grouped into transcripts by tx_id, and the exon IDs are given by exon_id. Virtually all genomic data sets fit this pattern: genomic location, followed by a series of columns, often including strand and/or score, that annotate that location. In the plot, the rectangles represent exonic regions, and the arrows represent the introns, as well as the strand.
Summary of the Ranges API.
| Category | Function | Description |
| Accessors |
| Get or set the starts, ends and widths |
|
| Get or set the names | |
|
| Get or set metadata on elements or object | |
|
| Number of ranges in the vector | |
|
| Range formed from min(start) and max(end) | |
| Ordering |
| Compare ranges, ordering by start then width |
|
| Sort by the ordering defined above | |
|
| Find ranges with multiple instances | |
|
| Find unique instances, removing duplicates | |
| Arithmetic |
| Shrink or expand ranges |
|
| Move the ranges by specified amount | |
|
| Change width, anchoring on start, end or mid | |
|
| Separation between ranges (closest endpoints) | |
|
| Clamp ranges to within some start and end | |
|
| Generate adjacent regions on start or end | |
| Set operations |
| Merge overlapping and adjacent ranges |
|
| Set operations on reduced ranges | |
|
| Parallel set operations, on each | |
|
| Find regions not covered by reduced ranges | |
|
| Ranges formed from union of endpoints | |
| Overlaps |
| Find all overlaps for each |
|
| Count overlaps of each | |
|
| Find nearest neighbors (closest endpoints) | |
|
| Find nearest | |
|
| Find ranges in | |
| Coverage |
| Count ranges covering each position |
| Extraction |
| Get or set by logical or numeric index |
|
| Get integer sequence from | |
|
| Subset | |
|
| Conventional R semantics | |
| Split, combine |
| Split ranges by a factor into a |
|
| Concatenate two or more range objects |
Categorized listing and description of the API for range-based objects, such as IRanges, RangesList, GRanges and GRangesList.
Figure 2Illustration of the reduce and disjoin operations on the last exon from each of the KRAS transcripts.
Figure 3Illustration of overlap (top) and adjacency (bottom) relationships.
The any mode detects hits with partial or complete overlap, while within requires that the query range represents a subregion of the subject range.
Figure 4Illustration of overlap computations between two GRangesList objects.
Each set of rectangles linked by solid lines represents a compound range, i.e., an element of the list. Ranges in the query (top) are being matched against ranges in the subject (bottom). The labels between them indicate the type of overlap (any, within, none).
Contents of the krasA object, representing the exons in isoform A of KRAS.
|
| |||||||
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ranges for the first three reads in the ctcfReads object, storing the read alignments for the CTCF sample.
|
| |||
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 5Visualization of the coverage of bases by GFP- and CTCF-bound fragments (top) in the context of part of the gene model for Rrp1, Entrez gene 18114 (bottom).
Partial output of countVariants applied to a BAM file from an ENCODE CTCF ChIP-seq experiment.
|
| |||||||||
|
|
|
|
|
|
|
|
| ||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The GRanges instance includes location-specific information on 24 attributes of each call, including information on sequencer cycle, base call quality distribution, and other features of BAM-based variant calling as performed by GSNAP [11].
Figure 6Top panels: distributions of alternate nucleotide proportions for on- and off-SNP allele-dependent CTCF binding events. Bottom panels: relationships between average call quality values and alternate nucleotide proportions are depicted using a 2D density estimate (darker regions correspond to higher density.).
Selected packages based on the Ranges infrastructure.
| Term | Count | Example packages |
| Genetics | 16 | NarrowPeaks, nucleR, GenomicFeatures, mosaics |
| Preprocessing | 11 | MEDIPS, biovizBase, TSSi, HMMcopy |
| Infrastructure | 9 | Genominator, nnotationDbi, ggbio, dInfoBuilder |
| GeneExpression | 8 | GGtools, easyRNASeq, Repitools, TransView |
| Sequencing | 5 | girafe, triform, seqbias, rSFFreader |
| Microarray | 4 | methyAnalysis, Gviz, MinimumDistance, charm |
| Clustering | 4 | chroGPS, methVisual, DirichletMultinomial, PICS |
| GenomicSequence | 3 | rGADEM, MotifDb, MotIV |
| QualityControl | 3 | ShortRead, R453Plus1Toolbox, htSeqTools |
| Statistics | 2 | oneChannelGUI, PING |
| OneChannel | 2 | xmapcore, annmap |
| DataRepresentation | 2 | genoset, FunciSNP |
| GeneticVariability | 2 | VanillaICE, SNPchip |
| Bioinformatics | 2 | DiffBind, segmentSeq |
| ChIPseq | 2 | chipseq, BayesPeak |
| Other | 10 | ChromHeatMap, gwascat, ChIPpeakAnno, OTUbase |
Categories are biocViews terms. Up to 4 packages were randomly sampled from Bioconductor packages that explicitly declare a dependence on IRanges, GenomicRanges, or GenomicFeatures packages.