Literature DB >> 27999613

A bedr way of genomic interval processing.

Syed Haider1, Daryl Waggott1, Emilie Lalonde1,2, Clement Fung1, Fei-Fei Liu2,3, Paul C Boutros1,2.   

Abstract

BACKGROUND: Next-generation sequencing is making it critical to robustly and rapidly handle genomic ranges within standard pipelines. Standard use-cases include annotating sequence ranges with gene or other genomic annotation, merging multiple experiments together and subsequently quantifying and visualizing the overlap. The most widely-used tools for these tasks work at the command-line (e.g. BEDTools) and the small number of available R packages are either slow or have distinct semantics and features from command-line interfaces.
RESULTS: To provide a robust R-based interface to standard command-line tools for genomic coordinate manipulation, we created bedr. This open-source R package can use either BEDTools or BEDOPS as a back-end and performs data-manipulation extremely quickly, creating R data structures that can be readily interfaced with existing computational pipelines. It includes data-visualization capabilities and a number of data-access functions that interface with standard databases like UCSC and COSMIC.
CONCLUSIONS: bedr package provides an open source solution to enable genomic interval data manipulation and restructuring in R programming language which is commonly used in bioinformatics, and therefore would be useful to bioinformaticians and genomic researchers.

Entities:  

Keywords:  BED format; Data integration; Genomic intervals; Sequence algebra

Year:  2016        PMID: 27999613      PMCID: PMC5157088          DOI: 10.1186/s13029-016-0059-5

Source DB:  PubMed          Journal:  Source Code Biol Med        ISSN: 1751-0473


Background

With the advent of high-throughput sequencing technologies, data scientists are facing immense challenges in large-scale sequence analysis and in integrating genomic annotations. For instance, comparing new experiments with previously published datasets, translating genomic coordinates between different assemblies of an organism as well as finding cross-species orthologues are some of the common use-cases in basic science experiments. To assist these tasks genomic features are routinely represented and shared using Browser Extensible Display (BED; [1]), Distributed Annotation System (DAS; [2]), General Feature Format (GFF), Gene Transfer Format (GTF) and Variant Call Format (VCF). These all enable cross-sectional analysis of genomic studies across multiple programming languages, thereby enabling seamless data-integration. R is the de facto standard for statistical analysis and visualization in computational biology [3] for both exploratory prototyping and rigorous production pipelines. To this end R has adopted several packages, such as GenomicRanges and IRanges that expressly deal with genomic intervals [4]. Albeit powerful, these existing tools require understanding of bespoke data structures and classes/objects. To address these issues we implemented a formal BED-operations framework called bedr, which is an R API offering utility functions implementing commonly used genomic operations as well as offering a formal R interface to interact with BEDTools and BEDOPS. bedr is fully compatible with the ubiquitous BED tools [5, 6] paradigm and integrates seamlessly with R-based work-flows.

Implementation

bedr is implemented in R and supports the two main BED engines: BEDTools and BEDOPS [5, 6]. It works on Unix-like operating systems including Mac OSX. A high-level conceptual overview of the package usage is shown in Fig. 1. bedr functions run on any computing platform (commodity computer, cluster or cloud), and facilitates integration of multiple data sources (e.g. gene annotations) and tools (e.g. BEDTools) with user-specified genomic coordinates, yielding fast results in R data structures such as data.frame and list. All API calls are written in R, which internally calls BEDTools and/or BEDOPS utilities in the native format using system commands. Further, bedr data can be readily converted to BED [1] and GRanges [4] objects for wider interoperability.
Fig. 1

Overview of bedr package. bedr can run on a commodity linux based computer or a cloud/cluster. Users can interface with the underlying driver engines such as BEDTools/BEDOPS/tabix/GenomicRanges through bedr methods in R. This enables integration of user-specified multiple genomic intervals with reference data sources such as gene annotations (e.g. UCSC) and disease specific features (e.g. COSMIC). Such integration spans general-purpose genomic interval operations of intersection (*), union (sum) and joins. Output is returned in R friendly data structures for convenience in subsequent downstream analyses. These data structures are readily convertible to standard data exchange formats such as BED and GRanges using bedr utility methods

Overview of bedr package. bedr can run on a commodity linux based computer or a cloud/cluster. Users can interface with the underlying driver engines such as BEDTools/BEDOPS/tabix/GenomicRanges through bedr methods in R. This enables integration of user-specified multiple genomic intervals with reference data sources such as gene annotations (e.g. UCSC) and disease specific features (e.g. COSMIC). Such integration spans general-purpose genomic interval operations of intersection (*), union (sum) and joins. Output is returned in R friendly data structures for convenience in subsequent downstream analyses. These data structures are readily convertible to standard data exchange formats such as BED and GRanges using bedr utility methods

Results and discussion

The primary input to most bedr methods is a regions object, which is represented as either an R vector of multiple region strings as illustrated below or a data.frame of regions with three columns: chr, start, and end. The regions object returned by various bedr methods matches the input format; vector or data.frame. Here we briefly summarize a subset of key bedr functionalities. For further details on a range of bedr utilities, please see package’s help and vignettes for detailed examples and workflows.

Sort & merge

This functionality enables sorting of genomic regions in both natural and lexographical order using R, unix, BEDTools and BEDOPS engines. Following examples demonstrate the usage of these engines: regions <- get.example.regions() region <- regions[[1]] bedr.sort.region( x = region, engine = "unix", method = "natural" ) bedr.sort.region( x = region, engine = "R", method = "lexicographical" ) bedr.sort.region( x = region, engine = "bedtools" ) bedr.sort.region( x = region, engine = "bedops" ) The above code will generate the following outputs of sorted regions: # natural sort (unix) "chr1:10-100" "chr1:101-200" "chr1:200-210" "chr1:211-212" "chr2:10-50" "chr2:40-60" "chr10:50-100" "chr20:1-5" # lexicographical sort (R) "chr1:10-100" "chr1:101-200" "chr1:200-210" "chr1:211-212" "chr10:50-100" "chr2:10-50" "chr2:40-60" "chr20:1-5" # lexicographical sort (bedtools) "chr1:10-100" "chr1:101-200" "chr1:200-210" "chr1:211-212" "chr10:50-100" "chr2:10-50" "chr2:40-60" "chr20:1-5" # lexicographical sort (bedops) "chr1:10-100" "chr1:101-200" "chr1:200-210" "chr1:211-212" "chr10:50-100" "chr2:10-50" "chr2:40-60" "chr20:1-5" As shown above, various types of sorting results are presented in a similar R data structures regardless of which sorting engine is used (unix, R, bedtools or bedops) and their respective output style. Also, BEDTools and BEDOPS do not support natural sorting, and if method = “natural” is requested with these two engines, bedr automatically defaults to using engine = “unix” of “R” to perform sorting. Note, sorting of large number of regions through R will be slow and may also result in high memory overhead. Much of the command-line interaction with BEDTools and BEDOPS is performed through temporary files followed by efficient piping/parsing of the output straight into R data structures. This ensures that memory intensive sorting tasks (or any other genomic operations discussed below) are managed by the optimized engines such as (BEDTools or BEDOPS), and therefore memory operations in R are limited to subsequent parsing of output. In addition to sort operations, bedr also supports identification of overlapping regions which can be collapsed to avoid downstream analytical challenges such as many:many join results (Fig. 2), e.g.
Fig. 2

Illustration of key bedr operations. bedr regions objects represent a collection of sub-regions specified as R vector or data.frame. Three partially overlapping example regions (a, b and c) located at the beginning of human chromosome 1 (red mark on ideogram, 1-250 bp) are shown here. Vertical gray separators between sub-regions indicate regions that are 1 base pair apart. Overlapping regions can be merged, joined, subtracted resulting in new regions objects as shown here. Associated source code snippets are documented in the Results section. Regions object flank (b, 5 bp) exemplifies bedr utility flank.regions creating flanking (up and/or downstream) regions of a specified length; +/-5 bp in the example shown here

Illustration of key bedr operations. bedr regions objects represent a collection of sub-regions specified as R vector or data.frame. Three partially overlapping example regions (a, b and c) located at the beginning of human chromosome 1 (red mark on ideogram, 1-250 bp) are shown here. Vertical gray separators between sub-regions indicate regions that are 1 base pair apart. Overlapping regions can be merged, joined, subtracted resulting in new regions objects as shown here. Associated source code snippets are documented in the Results section. Regions object flank (b, 5 bp) exemplifies bedr utility flank.regions creating flanking (up and/or downstream) regions of a specified length; +/-5 bp in the example shown here bedr.merge.region(x = region) The above code will generate the following output of merged regions: "chr1:10-100" "chr1:101-210" "chr1:211-212" "chr10:50-100" "chr2:10-60" "chr20:1-5" Sort and merge can be combined into one step given they are generally run as a tandem preprocessing step: bedr.snm.region(x = region) The above code will generate the following vector output of sorted and merged regions: "chr1:10-100" "chr1:101-210" "chr1:211-212" "chr10:50-100" "chr2:10-60" "chr20:1-5"

Join

This functionality enables joining two region-based datasets using intervals as an index or primary key. The output is left outer join with respect to the first regions object (Fig. 2), e.g. regions.a <- bedr.merge.region( x = regions[[1]] ) regions.b <- bedr.merge.region( x = regions[[2]] ) regions.c <- bedr.merge.region( x = regions[[4]] ) bedr.join.region( x = regions.a, y = regions.b ) The above code will generate the following output, containing regions of regions.a in the first column, while any overlapping regions from regions.b are listed in columns 2 to 4 (chr, start, end). Regions in regions.a with no overlap are encoded as: . and -1 Similarly, another bedr function bedr.join.multiple.region() supports merging of multiple sets of regions (Fig. 2), e.g. bedr.join.multiple.region( x = list( a = regions.a, b = regions.b, c = regions.c ) ) The above code will generate the output data.frame shown below. The table lists all the sub-regions and their presence across the three sets of region objects (regions.a, regions.b, and regions.c) passed to the function. For instance, sub-region chr1:1-10 (column: index) overlaps with 2 region objects (b and c). This presence is shown as comma separated list in ‘names’ column as well as a truth table in the subsequent columns. The number of columns representing the truth table will match the number of region objects passed to the function bedr.join.multiple.region().

Subtract and intersect

The subtract utility identifies regions exclusive to first set of regions, and intersect function identifies sub-regions of first set which overlap with the second set of regions (Fig. 2), e.g. bedr.subtract.region( x = regions.a, y = regions.b ) The above code will generate the following output which lists the sub-regions exclusive to regions.a: "chr1:10-100" "chr10:50-100" "chr20:1-5" Intersect utility makes use of bed.join.region() and finds regions in the second set which overlap with the regions in the first set. An example is shown in the Results section “Join”. Similarly in.region(x = regions.a, y = regions.b) and its R style convenience operator %in.region% can be used to test (logical) presence of overlapping regions, e.g. in.region( x = regions.a, y = regions.b ) FALSE TRUE TRUE FALSE TRUE FALSE bedr also provides an interface to find overlapping regions using Tabix [7]. This can be done using the following bedr call: regions.d <- c( "1:1000-100000", "1:1000000-1100000" ) cosmic.vcf.example <- system.file( "extdata/CosmicCodingMuts_v66_20130725_ex.vcf.gz", package = "bedr" ) head( tabix( region = regions.d, file.name = cosmic.vcf.example, check.chr = FALSE ) ) which identifies regions overlapping with COSMIC coding mutations file resulting in the following data.frame (only first six rows are shown below):

Third-party compatibility

Given that bedr can process regions data as R’s vector as well as data.frame data structure, it is easily transformable to other third-party sequence and region objects. For instance, bedr provides a utility adaptor to convert regions into BED data.frame as shown below: regions.a.bed <- convert2bed( x = regions.a ) which can be further converted to a widely compatible GRanges [4] object, as shown below: library("GenomicRanges") makeGRangesFromDataFrame( df = regions.a.bed ) The above code will create a GRanges object shown in the output below, which can be further customized/extended with additional annotations such as strand and genomic feature names. To perform feature meta-analysis and annotation retrieval/conversion (see example workflow in Additional file 1), bedr facilitates downloads from UCSC [8], COSMIC [9] and HUGO [10] including reference genome annotations, repeat sequences, black lists and disease candidate features. Also, bedr has a fully integrated unit-testing framework allowing users to verify integrity of bedr functions when using customized development or installations.

Visualization

For results of common operations such as intersect, Venn diagrams of overlapping features between 2 to 5 sets of regions (2- to 5-way Venn diagrams) can be generated automatically [11]. The overlap criterion can be defined in a number of ways including unique intervals, gene length or user-specified size as a fraction of sub-region’s length, e.g. bedr.plot.region( input = list( a = regions.a, b = regions.b ), feature = "bp", fraction.overlap = 0.1 ) The above code will generate a base pair level overlap of sequence objects regions.a and regions.b, and show the results as a Venn diagram highlighting lengths of exclusive and overlapping regions as shown below: Further, bedr output is ideally suited for alternative complex set visualization tools such as UpSetR [12] and Gviz [13].

Conclusions

We created bedr; an R package to support genomic operations using the BEDTools [6] and BEDOPS [5] engines. bedr implements an API in R which offers a number of utility functions such as intersect, merge, sort and plotting of genomic intervals as well as provides a unified interface to BEDTools and BEDOPS. These functions are efficient, powerful and perform complex feature annotations and cross-sectional operations on genomic regions. Given that bedr supports two well-established genomic engines, its output is comparable to the native output of these tools, however in R data structures. These features of bedr are urgently needed by bioinformatics research community and will be a timely addition to the catalogue of sequence analysis tools. Further, the interoperability of bedr data structures with BED and GRanges data.frame/objects makes it an easy-to-fit component in existing genomic pipelines. bedr is freely available as an open-source package through CRAN and lends itself for customized extensions needed for in-house sequencing-analysis pipelines as well as future bioinformatics protocols.

Availability and requirements

Project name: bedr Project home page: http://cran.r-project.org/web/packages/bedr Operating system(s): OSX, Linux/Unix Programming language: R Other requirements: BEDTools, BEDOPS License: e.g. GNU GPL-2 Any restrictions to use by non-academics: None
index V4 V5 V6
1 2 3 4 5 6 chr1:10-100 chr1:101-210 chr1:211-212 chr10:50-100 chr2:10-60 chr20:1-5 . chr1 chr1 . chr2 . -1 111 111 -1 40 -1 -1 250 250 -1 60 -1
index n.overlaps names a b c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 chr1:1-10 chr1:10-20 chr1:20-100 chr1:100-101 chr1:101-111 chr1:111-210 chr1:210-211 chr1:211-212 chr1:212-240 chr1:240-250 chr1:2000-2010 chr10:50-100 chr10:100-110 chr10:110-150 chr2:1-5 chr2:5-10 chr2:10-20 chr2:20-30 chr2:30-40 chr2:40-60 chr20:1-5 chr20:6-7 chr20:7-10 chr20:10-12 2 1 2 1 2 3 2 3 2 1 1 1 1 2 2 1 2 1 2 3 1 1 2 1 b,c 0 1 1 a 1 0 0 a,c 1 0 1 c 0 0 1 a,c 1 0 1 a,b,c 1 1 1 b,c 0 1 1 a,b,c 1 1 1 b,c 0 1 1 b 0 1 0 b 0 1 0 a 1 0 0 b 0 1 0 b,c 0 1 1 b,c 0 1 1 c 0 0 1 a,c 1 0 1 a 1 0 0 a,c 1 0 1 a,b,c 1 1 1 a 1 0 0 b 0 1 0 b,c 0 1 1 c 0 0 1
CHROM POS ID REF ALT QUAL FILTER
1 2 3 4 5 6 1 1 1 1 1 1 69345 69523 69538 69539 69540 69569 COSM911918 COSM426644 COSM75742 COSM1343690 COSM1560546 COSM1599955 C G G T G T A T A C T C NA NA NA NA NA NA <NA> <NA> <NA> <NA> <NA> <NA>
INFO
1 2 3 4 5 6 GENE=OR4F5;STRAND=+;CDS=c.255C>A;AA=p.I85I;CNT=1 GENE=OR4F5;STRAND=+;CDS=c.433G>T;AA=p.G145C;CNT=1 GENE=OR4F5;STRAND=+;CDS=c.448G>A;AA=p.V150M;CNT=1 GENE=OR4F5;STRAND=+;CDS=c.449T>C;AA=p.V150A;CNT=1 GENE=OR4F5;STRAND=+;CDS=c.450G>T;AA=p.V150V;CNT=1 GENE=OR4F5;STRAND=+;CDS=c.479T>C;AA=p.L160P;CNT=2
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] [2] [3] [4] [5] [6] chr1 chr1 chr1 chr10 chr2 chr20 [10, 100] [101, 210] [211, 212] [50, 100] [10, 60] [1, 5] * * * * * *
- - - - - - - seqinfo: 4 sequences from an unspecified genome;no seqlengths
  12 in total

1.  The UCSC Table Browser data retrieval tool.

Authors:  Donna Karolchik; Angela S Hinrichs; Terrence S Furey; Krishna M Roskin; Charles W Sugnet; David Haussler; W James Kent
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

2.  Tabix: fast retrieval of sequence features from generic TAB-delimited files.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2011-01-05       Impact factor: 6.937

3.  BEDOPS: high-performance genomic feature operations.

Authors:  Shane Neph; M Scott Kuehn; Alex P Reynolds; Eric Haugen; Robert E Thurman; Audra K Johnson; Eric Rynes; Matthew T Maurano; Jeff Vierstra; Sean Thomas; Richard Sandstrom; Richard Humbert; John A Stamatoyannopoulos
Journal:  Bioinformatics       Date:  2012-05-09       Impact factor: 6.937

Review 4.  Orchestrating high-throughput genomic analysis with Bioconductor.

Authors:  Wolfgang Huber; Vincent J Carey; Robert Gentleman; Simon Anders; Marc Carlson; Benilton S Carvalho; Hector Corrada Bravo; Sean Davis; Laurent Gatto; Thomas Girke; Raphael Gottardo; Florian Hahne; Kasper D Hansen; Rafael A Irizarry; Michael Lawrence; Michael I Love; James MacDonald; Valerie Obenchain; Andrzej K Oleś; Hervé Pagès; Alejandro Reyes; Paul Shannon; Gordon K Smyth; Dan Tenenbaum; Levi Waldron; Martin Morgan
Journal:  Nat Methods       Date:  2015-02       Impact factor: 28.547

5.  Visualizing Genomic Data Using Gviz and Bioconductor.

Authors:  Florian Hahne; Robert Ivanek
Journal:  Methods Mol Biol       Date:  2016

6.  BEDTools: a flexible suite of utilities for comparing genomic features.

Authors:  Aaron R Quinlan; Ira M Hall
Journal:  Bioinformatics       Date:  2010-01-28       Impact factor: 6.937

7.  The distributed annotation system.

Authors:  R D Dowell; R M Jokerst; A Day; S R Eddy; L Stein
Journal:  BMC Bioinformatics       Date:  2001-10-10       Impact factor: 3.169

8.  The UCSC Genome Browser database: 2015 update.

Authors:  Kate R Rosenbloom; Joel Armstrong; Galt P Barber; Jonathan Casper; Hiram Clawson; Mark Diekhans; Timothy R Dreszer; Pauline A Fujita; Luvina Guruvadoo; Maximilian Haeussler; Rachel A Harte; Steve Heitner; Glenn Hickey; Angie S Hinrichs; Robert Hubley; Donna Karolchik; Katrina Learned; Brian T Lee; Chin H Li; Karen H Miga; Ngan Nguyen; Benedict Paten; Brian J Raney; Arian F A Smit; Matthew L Speir; Ann S Zweig; David Haussler; Robert M Kuhn; W James Kent
Journal:  Nucleic Acids Res       Date:  2014-11-26       Impact factor: 19.160

9.  Genenames.org: the HGNC resources in 2015.

Authors:  Kristian A Gray; Bethan Yates; Ruth L Seal; Mathew W Wright; Elspeth A Bruford
Journal:  Nucleic Acids Res       Date:  2014-10-31       Impact factor: 19.160

10.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer.

Authors:  Simon A Forbes; David Beare; Prasad Gunasekaran; Kenric Leung; Nidhi Bindal; Harry Boutselakis; Minjie Ding; Sally Bamford; Charlotte Cole; Sari Ward; Chai Yin Kok; Mingming Jia; Tisham De; Jon W Teague; Michael R Stratton; Ultan McDermott; Peter J Campbell
Journal:  Nucleic Acids Res       Date:  2014-10-29       Impact factor: 16.971

View more
  9 in total

1.  Identifying high-confidence capture Hi-C interactions using CHiCANE.

Authors:  Erle M Holgersen; Andrea Gillespie; Olivia C Leavy; Joseph S Baxter; Alisa Zvereva; Gareth Muirhead; Nichola Johnson; Orsolya Sipos; Nicola H Dryden; Laura R Broome; Yi Chen; Igor Kozin; Frank Dudbridge; Olivia Fletcher; Syed Haider
Journal:  Nat Protoc       Date:  2021-04-09       Impact factor: 13.491

2.  From landrace to modern hybrid broccoli: the genomic and morphological domestication syndrome within a diverse B. oleracea collection.

Authors:  Zachary Stansell; Thomas Björkman
Journal:  Hortic Res       Date:  2020-10-01       Impact factor: 6.793

3.  Erratum to: A bedr way of genomic interval processing.

Authors:  Syed Haider; Daryl Waggott; Emilie Lalonde; Clement Fung; Fei-Fei Liu; Paul C Boutros
Journal:  Source Code Biol Med       Date:  2017-01-09

4.  valr: Reproducible genome interval analysis in R.

Authors:  Kent A Riemondy; Ryan M Sheridan; Austin Gillen; Yinni Yu; Christopher G Bennett; Jay R Hesselberth
Journal:  F1000Res       Date:  2017-06-29

5.  Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection.

Authors:  Anna Y Lee; Adam D Ewing; Kyle Ellrott; Yin Hu; Kathleen E Houlahan; J Christopher Bare; Shadrielle Melijah G Espiritu; Vincent Huang; Kristen Dang; Zechen Chong; Cristian Caloian; Takafumi N Yamaguchi; Michael R Kellen; Ken Chen; Thea C Norman; Stephen H Friend; Justin Guinney; Gustavo Stolovitzky; David Haussler; Adam A Margolin; Joshua M Stuart; Paul C Boutros
Journal:  Genome Biol       Date:  2018-11-06       Impact factor: 13.583

6.  FIREVAT: finding reliable variants without artifacts in human cancer samples using etiologically relevant mutational signatures.

Authors:  Hyunbin Kim; Andy Jinseok Lee; Jongkeun Lee; Hyonho Chun; Young Seok Ju; Dongwan Hong
Journal:  Genome Med       Date:  2019-12-17       Impact factor: 11.117

7.  Quantifying the influence of mutation detection on tumour subclonal reconstruction.

Authors:  Lydia Y Liu; Vinayak Bhandari; Adriana Salcedo; Shadrielle M G Espiritu; Quaid D Morris; Thomas Kislinger; Paul C Boutros
Journal:  Nat Commun       Date:  2020-12-07       Impact factor: 14.919

8.  DNA methylation analysis of normal colon organoids from familial adenomatous polyposis patients reveals novel insight into colon cancer development.

Authors:  Matthew A Devall; Stephen Eaton; Mourad Wagdy Ali; Christopher H Dampier; Daniel Weisenberger; Steven M Powell; Li Li; Graham Casey
Journal:  Clin Epigenetics       Date:  2022-08-23       Impact factor: 7.259

9.  Establishment and characterization of immortalized human breast cancer cell lines from breast cancer patient-derived xenografts (PDX).

Authors:  Yongxian Zhuang; Jordan M Grainger; Peter T Vedell; Jia Yu; Ann M Moyer; Huanyao Gao; Xiao-Yang Fan; Sisi Qin; Duan Liu; Krishna R Kalari; Matthew P Goetz; Judy C Boughey; Richard M Weinshilboum; Liewei Wang
Journal:  NPJ Breast Cancer       Date:  2021-06-18
  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.