Literature DB >> 20110278

BEDTools: a flexible suite of utilities for comparing genomic features.

Aaron R Quinlan1, Ira M Hall.   

Abstract

MOTIVATION: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing web-based methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner.
RESULTS: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets.
AVAILABILITY AND IMPLEMENTATION: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools CONTACT: aaronquinlan@gmail.com; imh4y@virginia.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2010        PMID: 20110278      PMCID: PMC2832824          DOI: 10.1093/bioinformatics/btq033

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Determining whether distinct sets of genomic features (e.g. aligned sequence reads, gene annotations, ESTs, genetic polymorphisms, mobile elements, etc.) overlap or are associated with one another is a fundamental task in genomics research. Such comparisons serve to characterize experimental results, infer causality or coincidence (or lack thereof) and assess the biological impact of genomic discoveries. Genomic features are commonly represented by the Browser Extensible Data (BED) or General Feature Format (GFF) formats and are typically compared using either the UCSC Genome Browser's (Kent et al., 2002) ‘Table Browser’ or using the Galaxy (Giardine et al., 2005) interface. While these tools offer a convenient and reliable method for such analyses, they are not amenable to large and/or ad hoc datasets owing to the inherent need to interact with a remote or local web site installation. Moreover, complicated analyses often require iterative testing and refinement. In this sense, faster and more flexible tools allow one to conduct a greater number and more diverse set of experiments. This necessity is made more acute by the data volume produced by current DNA sequencing technologies. In an effort to address these needs, we have developed BEDTools, a fast and flexible suite of utilities for common operations on genomic features.

2 FEATURES AND METHODS

2.1 Common scenarios

Genomic analyses often seek to compare features that are discovered in an experiment to known annotations for the same species. When genomic features from two distinct sets share at least one base pair in common, they are defined as ‘intersecting’ or ‘overlapping’. For example, a typical question might be ‘Which of my novel genetic variants overlap with exons?’ One straightforward approach to identify overlapping features is to iterate through each feature in set A and repeatedly ask if it overlaps with any of the features in set B. While effective, this approach is unreasonably slow when screening for overlaps between, for example, millions of DNA sequence alignments and the RepeatMasker (Smit et al., 1996–2004) track for the human genome. This inefficiency is compounded when asking more complicated questions involving many disparate sets of genomic features. BEDTools was developed to efficiently address such questions without requiring an installation of the UCSC or Galaxy browsers. The BEDTools suite is designed for use in a UNIX environment and works seamlessly with existing UNIX utilities (e.g. grep, awk, sort, etc.), thereby allowing complex experiments to be conducted with a single UNIX pipeline.

2.2 Language and algorithmic approach

BEDTools incorporates the genome-binning algorithm used by the UCSC Genome Browser (Kent et al., 2002). This clever approach uses a hierarchical indexing scheme to assign genomic features to discrete ‘bins’ (e.g. 16 kb segments) along the length of a chromosome. This expedites searches for overlapping features, since one must only compare features between two sets that share the same (or nearby) bins. As illustrated in Supplementary Figure 1, calculating feature overlaps for large datasets (e.g. millions of sequence alignments) is substantially faster than using the tools available on the public Galaxy web site. The software is written in C++ and supports alignments in BAM format (Li et al., 2009) through use of the BAMTools libraries (Barnett et al., http://sourceforge.net/projects/bamtools/).

2.3 Supported operations

Table 1 illustrates the wide range of operations that BEDTools support. Many of the tools have extensive parameters that allow user-defined overlap criteria and fine control over how results are reported. Importantly, we have also defined a concise format (BEDPE) to facilitate comparisons of discontinuous features (e.g. paired-end sequence reads) to each other (pairToPair), and to genomic features in traditional BED format (pairToBed). This functionality is crucial for interpreting genomic rearrangements detected by paired-end mapping, and for identifying fusion genes or alternative splicing patterns by RNA-seq. To facilitate comparisons with data produced by current DNA sequencing technologies, intersectBed and pairToBed compute overlaps between sequence alignments in BAM format (Li et al., 2009), and a general purpose tool is provided to convert BAM alignments to BED format, thus facilitating the use of BAM alignments with all other BEDTools (Table 1). The following examples illustrate the use of intersectBed to isolate single nucleotide polymorphisms (SNPs) that overlap with genes, pairToBed to create a BAM file containing only those alignments that overlap with exons and intersectBed coupled with samtools to create a SAM file of alignments that do not intersect (-v) with repeats.
Table 1.

Summary of supported operations available in the BEDTools suite

UtilityDescription
intersectBed*Returns overlaps between two BED files.
pairToBedReturns overlaps between a BEDPE file and a BED file.
bamToBedConverts BAM alignments to BED or BEDPE format.
pairToPairReturns overlaps between two BEDPE files.
windowBedReturns overlaps between two BED files within a user-defined window.
closestBedReturns the closest feature to each entry in a BED file.
subtractBed*Removes the portion of an interval that is overlapped by another feature.
mergeBed*Merges overlapping features into a single feature.
coverageBed*Summarizes the depth and breadth of coverage of features in one BED file relative to another.
genomeCoverageBedHistogram or a ‘per base’ report of genome coverage.
fastaFromBedCreates FASTA sequences from BED intervals.
maskFastaFromBedMasks a FASTA file based upon BED coordinates.
shuffleBedPermutes the locations of features within a genome.
slopBedAdjusts features by a requested number of base pairs.
sortBedSorts BED files in useful ways.
linksBedCreates HTML links from a BED file.
complementBed*Returns intervals not spanned by features in a BED file.

Utilities in bold support sequence alignments in BAM. Utilities with an asterisk were compared with Galaxy and found to yield identical results.

Summary of supported operations available in the BEDTools suite Utilities in bold support sequence alignments in BAM. Utilities with an asterisk were compared with Galaxy and found to yield identical results. Other notable tools include coverageBed, which calculates the depth and breadth of genomic coverage of one feature set (e.g. mapped sequence reads) relative to another; shuffleBed, which permutes the genomic positions of BED features to allow calculations of statistical enrichment; mergeBed, which combines overlapping features; and utilities that search for nearby yet non-overlapping features (closestBed and windowBed). BEDTools also includes utilities for extracting and masking FASTA sequences (Pearson and Lipman, 1988) based upon BED intervals. Tools with similar functionality to those provided by Galaxy were directly compared for correctness using the ‘knownGene’ and ‘RepeatMasker’ tracks from the hg19 build of the human genome. The results from all analogous tools were found to be identical (Table 1).

2.4 Other advantages

Except for the novel paired-end functionality and support for alignments in BAM format, many of the genomic comparisons supported by BEDTools can be performed in one way or another with available web-based tools. However, BEDTools offers several important advantages. First, it can read data from standard input and write to standard output, which allows complex set operations to be performed by combining BEDTools operations with each other or with existing UNIX utilities. Second, most of the tools can distinguish DNA strands when searching for overlaps, which allows orientation to be considered when interpreting paired-end mapping or RNA-seq data. Third, the use of BEDTools mitigates the need to interact with local or public instances of the UCSC Genome Browser or Galaxy, which can be a major bottleneck when working with large genomics datasets. Finally, the speed and extensive functionality of BEDTools allow greater flexibility in defining and refining genomic comparisons. These features allow for diverse and complex comparisons to be made between ever-larger genomic datasets.
  4 in total

1.  The human genome browser at UCSC.

Authors:  W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal:  Genome Res       Date:  2002-06       Impact factor: 9.043

2.  Galaxy: a platform for interactive large-scale genome analysis.

Authors:  Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal:  Genome Res       Date:  2005-09-16       Impact factor: 9.043

3.  Improved tools for biological sequence comparison.

Authors:  W R Pearson; D J Lipman
Journal:  Proc Natl Acad Sci U S A       Date:  1988-04       Impact factor: 11.205

4.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

  4 in total
  2000 in total

1.  UBE2O remodels the proteome during terminal erythroid differentiation.

Authors:  Anthony T Nguyen; Miguel A Prado; Paul J Schmidt; Anoop K Sendamarai; Joshua T Wilson-Grady; Mingwei Min; Dean R Campagna; Geng Tian; Yuan Shi; Verena Dederer; Mona Kawan; Nathalie Kuehnle; Joao A Paulo; Yu Yao; Mitchell J Weiss; Monica J Justice; Steven P Gygi; Mark D Fleming; Daniel Finley
Journal:  Science       Date:  2017-08-04       Impact factor: 47.728

2.  IL-15 Preconditioning Augments CAR T Cell Responses to Checkpoint Blockade for Improved Treatment of Solid Tumors.

Authors:  Lauren Giuffrida; Kevin Sek; Melissa A Henderson; Imran G House; Junyun Lai; Amanda X Y Chen; Kirsten L Todd; Emma V Petley; Sherly Mardiana; Izabela Todorovski; Emily Gruber; Madison J Kelly; Benjamin J Solomon; Stephin J Vervoort; Ricky W Johnstone; Ian A Parish; Paul J Neeson; Lev M Kats; Phillip K Darcy; Paul A Beavis
Journal:  Mol Ther       Date:  2020-07-21       Impact factor: 11.454

3.  Endothelial cell differentiation is encompassed by changes in long range interactions between inactive chromatin regions.

Authors:  Henri Niskanen; Irina Tuszynska; Rafal Zaborowski; Merja Heinäniemi; Seppo Ylä-Herttuala; Bartek Wilczynski; Minna U Kaikkonen
Journal:  Nucleic Acids Res       Date:  2018-02-28       Impact factor: 16.971

4.  Breast Cancer Suppression by Progesterone Receptors Is Mediated by Their Modulation of Estrogen Receptors and RNA Polymerase III.

Authors:  Jessica Finlay-Schultz; Austin E Gillen; Heather M Brechbuhl; Joshua J Ivie; Shawna B Matthews; Britta M Jacobsen; David L Bentley; Peter Kabos; Carol A Sartorius
Journal:  Cancer Res       Date:  2017-07-20       Impact factor: 12.701

5.  Robust Identification of Developmentally Active Endothelial Enhancers in Zebrafish Using FANS-Assisted ATAC-Seq.

Authors:  Aurelie Quillien; Mary Abdalla; Jun Yu; Jianhong Ou; Lihua Julie Zhu; Nathan D Lawson
Journal:  Cell Rep       Date:  2017-07-18       Impact factor: 9.423

6.  Integrated Analysis of RNA and DNA from the Phase III Trial CALGB 40601 Identifies Predictors of Response to Trastuzumab-Based Neoadjuvant Chemotherapy in HER2-Positive Breast Cancer.

Authors:  Maki Tanioka; Cheng Fan; Joel S Parker; Katherine A Hoadley; Zhiyuan Hu; Yan Li; Terry M Hyslop; Brandelyn N Pitcher; Matthew G Soloway; Patricia A Spears; Lynn N Henry; Sara Tolaney; Chau T Dang; Ian E Krop; Lyndsay N Harris; Donald A Berry; Elaine R Mardis; Eric P Winer; Clifford A Hudis; Lisa A Carey; Charles M Perou
Journal:  Clin Cancer Res       Date:  2018-07-23       Impact factor: 12.531

7.  Common and Differential Transcriptional Actions of Nuclear Receptors Liver X Receptors α and β in Macrophages.

Authors:  Ana Ramón-Vázquez; Juan Vladimir de la Rosa; Carlos Tabraue; Felix Lopez; Bonifacio Nicolas Díaz-Chico; Lisardo Bosca; Peter Tontonoz; Susana Alemany; Antonio Castrillo
Journal:  Mol Cell Biol       Date:  2019-02-15       Impact factor: 4.272

8.  Oryza sativa RNA-Dependent RNA Polymerase 6 Contributes to Double-Strand Break Formation in Meiosis.

Authors:  Changzhen Liu; Yi Shen; Baoxiang Qin; Huili Wen; Jiawen Cheng; Fei Mao; Wenqing Shi; Ding Tang; Guijie Du; Yafei Li; Yufeng Wu; Zhukuan Cheng
Journal:  Plant Cell       Date:  2020-07-30       Impact factor: 11.277

9.  Cell-Type-Specific Splicing of Piezo2 Regulates Mechanotransduction.

Authors:  Marcin Szczot; Leah A Pogorzala; Hans Jürgen Solinski; Lynn Young; Philina Yee; Claire E Le Pichon; Alexander T Chesler; Mark A Hoon
Journal:  Cell Rep       Date:  2017-12-05       Impact factor: 9.423

10.  Hotspots of Aberrant Enhancer Activity in Fibrolamellar Carcinoma Reveal Candidate Oncogenic Pathways and Therapeutic Vulnerabilities.

Authors:  Timothy A Dinh; Ramja Sritharan; F Donelson Smith; Adam B Francisco; Rosanna K Ma; Rodica P Bunaciu; Matt Kanke; Charles G Danko; Andrew P Massa; John D Scott; Praveen Sethupathy
Journal:  Cell Rep       Date:  2020-04-14       Impact factor: 9.423

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.